No description

Shell 100%

Find a file

Javier Segura 3e52fe82b9 All checks were successful validate / validate (push) Successful in 26s Details Add forgejo actions		2026-06-11 22:11:38 +02:00
.forgejo/workflows	Add forgejo actions	2026-06-11 22:11:38 +02:00
alertmanager	Initial import	2026-06-11 20:45:01 +02:00
alloy	Initial import	2026-06-11 20:45:01 +02:00
blackbox	Initial import	2026-06-11 20:45:01 +02:00
caddy	Initial import	2026-06-11 20:45:01 +02:00
grafana/provisioning	Initial import	2026-06-11 20:45:01 +02:00
loki	Initial import	2026-06-11 20:45:01 +02:00
prometheus	Initial import	2026-06-11 20:45:01 +02:00
deploy.sh	Add forgejo actions	2026-06-11 22:11:38 +02:00
DEPLOYMENT_GUIDE.md	Initial import	2026-06-11 20:45:01 +02:00
docker-compose.yml	Fixed port	2026-06-11 20:53:51 +02:00
OPERATIONS.md	Initial import	2026-06-11 20:45:01 +02:00
README.md	Initial import	2026-06-11 20:45:01 +02:00

README.md

Monitoring Stack — Spec & Runbook

A single-server observability stack: metrics, logs, dashboards, and alerts, run with Docker Compose. Comfortably fits a 4 GB RAM VPS.

Components & data flow

                          ┌───────────────┐
   host metrics  ───────► │ node-exporter │ ──┐
                          └───────────────┘   │ scrape
                          ┌───────────────┐   │
   container metrics ───► │   cAdvisor    │ ──┤
                          └───────────────┘   ▼
                                        ┌────────────┐  alerts  ┌──────────────┐
                                        │ Prometheus │ ───────► │ Alertmanager │ ──► email/Slack/webhook
                                        └────────────┘          └──────────────┘
                                              ▲
                                              │ query
   docker + host logs ──► ┌───────┐  push  ┌──────┐  query  ┌─────────┐
                          │ Alloy │ ─────► │ Loki │ ◄─────  │ Grafana │ ◄── you (browser)
                          └───────┘        └──────┘         └─────────┘
                                                                 ▲
                                                                 └── queries both Prometheus & Loki

Service	Role	Port (host)
Grafana	Dashboards & visualization	3000
Prometheus	Metrics TSDB + alert rule evaluation	9090
Alertmanager	Alert grouping/routing/silencing	9093
Loki	Log storage & query	3100
Alloy	Log shipper (Promtail's replacement)	12345
node-exporter	Host CPU/RAM/disk/network metrics	9100
cAdvisor	Per-container resource metrics	8080
blackbox-exporter	Endpoint probing (up/SSL/TTFB)	9115
Caddy	TLS reverse proxy (public HTTPS + basic auth)	80, 443
Forgejo	Git forge / code hosting	2222 (SSH)
forgejo-runner	Forgejo Actions runner (CI)	—
forgejo-runner-dind	Docker engine for CI jobs (isolated)	—

Why Alloy and not Promtail? Promtail reached end-of-life on March 2, 2026. Grafana folded it into Alloy (their OpenTelemetry-based unified agent). Alloy ships logs to Loki the same way and is the supported path going forward.

Layout

monitoring/
├── docker-compose.yml
├── .env.example                       # copy to .env
├── prometheus/
│   ├── prometheus.yml                 # scrape targets + alerting
│   ├── rules/alerts.yml               # alert rules (host + endpoint health)
│   └── targets/
│       └── blackbox-endpoints.yml     # EDIT to add probed URLs (auto-reloaded)
├── blackbox/
│   └── blackbox.yml                   # endpoint probe modules
├── caddy/
│   └── Caddyfile                      # TLS reverse proxy + basic auth routes
├── alertmanager/
│   └── alertmanager.yml               # routing + receivers
├── loki/
│   └── loki-config.yml                # TSDB v13, filesystem, 30d retention
├── alloy/
│   └── config.alloy                   # log collection pipeline
└── grafana/
    └── provisioning/
        ├── datasources/datasources.yml  # Prometheus + Loki auto-wired
        └── dashboards/dashboards.yml    # drop dashboard JSON here

Run it

cd monitoring
cp .env.example .env          # then edit .env and set a real Grafana password
docker compose up -d
docker compose ps             # all services should be "running"/"healthy"

Then open:

Grafana → http://SERVER_IP:3000 (login with the creds from .env)
Prometheus → http://SERVER_IP:9090 (check Status → Targets: all UP)
Alertmanager → http://SERVER_IP:9093

Data sources are pre-wired, so in Grafana you can go straight to Explore and query Prometheus or Loki.

First dashboards (import by ID)

In Grafana: Dashboards → New → Import, paste an ID:

1860 — Node Exporter Full (host CPU/RAM/disk/network)
19792 — cAdvisor / container metrics
13639 — Loki logs / app

Public access (Caddy reverse proxy)

A caddy service fronts the stack on ports 80/443, terminates TLS with automatic Let's Encrypt certificates, and proxies three hostnames to the backends over the internal office network:

Public URL	Backend	Auth
`https://grafana.javiersegura.net`	grafana:3000	Grafana's own login
`https://prometheus.javiersegura.net`	prometheus:9090	HTTP basic auth
`https://alertmanager.javiersegura.net`	alertmanager:9093	HTTP basic auth

Prometheus and Alertmanager have no authentication of their own, so Caddy gates them behind HTTP basic auth (user admin). The routing lives in caddy/Caddyfile.

Prerequisites for certificates to issue:

DNS — each hostname needs a public A/AAAA record pointing at the server.
Firewall — ports 80 and 443 must be reachable from the internet (80 is needed for the ACME challenge and the HTTP→HTTPS redirect).

Until DNS and ports are in place, those sites will 502 while Caddy retries ACME — that's expected, not a misconfiguration.

Basic-auth credentials are bcrypt hashes kept in .env on the server (never committed; deploy.sh does not sync or delete it). Generate them with:

docker run --rm caddy:2.11.4-alpine caddy hash-password --plaintext 'YOUR_PASSWORD'

then add the output to .env — single-quoted:

PROMETHEUS_BASIC_AUTH_HASH='$2a$14$....'
ALERTMANAGER_BASIC_AUTH_HASH='$2a$14$....'

The single quotes are required. Compose interpolates env values by default, and a bcrypt hash contains $, so an unquoted hash makes Compose try to expand $2a/$14/etc. as variables — you get a variable is not set, defaulting to a blank string warning and an empty (broken) hash. Single-quoted dotenv values are taken literally; the quotes are stripped before the value reaches Caddy.

The caddy service loads .env via env_file, and the Caddyfile reads the hashes as {$PROMETHEUS_BASIC_AUTH_HASH} / {$ALERTMANAGER_BASIC_AUTH_HASH}. Change a hash and docker compose up -d caddy to apply.

To change a username, or to add basic auth in front of Grafana too, edit caddy/Caddyfile.

CI: Forgejo Actions runner

Two services power Forgejo Actions:

forgejo-runner-dind — an isolated docker:28-dind engine where CI jobs actually run. It's privileged (a dind requirement) and reachable only on the dedicated ci network — it publishes no host port and is not on the office network, so untrusted CI workloads can't touch the monitoring stack.
forgejo-runner — registers with Forgejo once, then runs the daemon that picks up jobs and executes them via the dind engine.

One-time setup:

Make sure Actions is enabled on the Forgejo side (it is by default in recent Forgejo; [actions] ENABLED = true in app.ini).
Get a runner registration token in Forgejo: Site Administration → Actions → Runners → Create new runner (or the per-org / per-repo Runners page).
Put it in .env on the server (tokens are alphanumeric — no quoting needed):
```
FORGEJO_RUNNER_REGISTRATION_TOKEN=...
```
docker compose up -d forgejo-runner-dind forgejo-runner.

On first start the runner sees no /data/.runner file, registers using the token, and writes the registration into the forgejo_runner_data volume (/srv/office/data/forgejo-runner). That persists, so the token is only needed once — later restarts skip straight to the daemon. Confirm the runner shows up under the Forgejo Runners page as idle.

Optional overrides (defaults shown) — set in .env if you want:

Variable	Default	Purpose
`FORGEJO_INSTANCE_URL`	`https://forge.javiersegura.net`	Forgejo URL the runner registers/polls against
`FORGEJO_RUNNER_NAME`	`office-runner`	Runner name shown in Forgejo
`FORGEJO_RUNNER_LABELS`	`ubuntu-latest:docker://node:22-bookworm,…`	maps `runs-on:` labels to job images

Labels are label:docker://image pairs: a workflow's runs-on: ubuntu-latest runs inside node:22-bookworm. node images are used because most actions need Node. For closer GitHub compatibility, point a label at a catthehacker image, e.g. ubuntu-latest:docker://ghcr.io/catthehacker/ubuntu:act-22.04.

Why the public URL? The runner registers against https://forge.javiersegura.net (not the internal forgejo:3000) because the job containers run inside dind — isolated from this stack's Docker networks — and must clone repos from an address that's reachable from there. The public URL works everywhere with internet access. This assumes the host allows NAT loopback (a container connecting to the host's public IP reaching Caddy), which is the common case. If jobs can't reach Forgejo, either enable hairpin NAT on the host, or register against the internal URL and attach the runner + dind to the office network instead (then forgejo:3000 resolves, but only if dind jobs are also given a route to it).

Useful operations

# Reload Prometheus config without restart (web.enable-lifecycle is on)
curl -X POST http://localhost:9090/-/reload

# Validate Prometheus rules / config before reloading
docker exec prometheus promtool check config /etc/prometheus/prometheus.yml

# Check live alert rule state
open http://localhost:9090/alerts

# Tail a service
docker compose logs -f alloy

Hooking up alerts

Alertmanager is wired to Prometheus but sends nowhere until you fill in a receiver. Edit alertmanager/alertmanager.yml, uncomment one of the email / Slack / webhook blocks, set real values, then:

docker compose restart alertmanager

Alertmanager does not read ${ENV_VARS} from its config, so put real secrets in the file directly and keep it out of version control.

What's collected, at a glance

	Metrics	Logs
Host	Resource usage via node-exporter (auto). App `/metrics`: add a scrape job using `host.docker.internal`	`/var/log/.log` and* the systemd journal (auto)
Containers	Resource usage via cAdvisor (auto). App `/metrics`: put the app on the `monitoring` network + add a scrape job	All container stdout/stderr via Alloy (auto)

"Resource usage" = CPU/RAM/disk/network. "App /metrics" = your own application's Prometheus endpoint (request counts, queue depth, etc.) — that always needs an explicit scrape target.

Adding your own application metrics

A host service exposing metrics on a host port (e.g. a Go service on :2112). Prometheus runs in a container, so it reaches the host through host.docker.internal (the compose file maps this to the host gateway, which is required on Linux):

  - job_name: 'my-host-app'
    static_configs:
      - targets: ['host.docker.internal:2112']

A container service. Attach it to the monitoring network so Prometheus can resolve it by container name, then scrape that name:

# in that app's compose file
services:
  my-app:
    networks: [monitoring]
networks:
  monitoring:
    external: true
    name: monitoring

# in prometheus/prometheus.yml
  - job_name: 'my-app'
    static_configs:
      - targets: ['my-app:8000']

Either way, reload Prometheus afterwards:

curl -X POST http://localhost:9090/-/reload

Confirm the target shows UP under Prometheus → Status → Targets.

Monitoring endpoints (up / SSL expiry / TTFB)

The blackbox-http job probes external URLs and emits probe_success, probe_http_status_code, probe_ssl_earliest_cert_expiry, and probe_http_duration_seconds. Matching alerts ship in rules/alerts.yml (EndpointDown, HttpStatusAbnormal, SslCertExpiring*, HighTtfb).

To add or remove a monitored endpoint, edit prometheus/targets/blackbox-endpoints.yml — that's it. Prometheus watches the file and reloads it automatically; no restart, no -/reload. You can also add more files named blackbox-*.yml (e.g. one per environment) and they're all picked up by the glob.

# prometheus/targets/blackbox-endpoints.yml
- targets:
    - https://myapp.example.com/health
  labels:
    env: production

Note the distinction from static_configs: editing a target file is live-reloaded, but editing prometheus.yml itself (job definitions, relabeling) still needs curl -X POST http://localhost:9090/-/reload.

Security notes (do before exposing to the internet)

Change the Grafana admin password in .env. Don't ship the default.
Don't publish 9090 / 9093 / 3100 / 8080 / 9100 / 12345 to the public internet — they have no auth. They're already bound to 127.0.0.1 in docker-compose.yml so they're reachable only on the host (use an SSH tunnel for direct access). Public access goes exclusively through the Caddy reverse proxy with TLS — see Public access above — and Prometheus/Alertmanager sit behind basic auth there. Only Caddy publishes to 0.0.0.0 (80/443).
cAdvisor runs privileged: true — that's its normal requirement, but it's another reason not to expose its port.

Retention & disk

Metrics: 30 days (--storage.tsdb.retention.time=30d in compose).
Logs: 30 days (retention_period: 720h in loki-config.yml, enforced by the compactor). Loki on a single server stores chunks on the local filesystem volume — watch disk usage and adjust retention to taste.

Version pinning

The image tags in docker-compose.yml were current stable as of June 2026. Before deploying, verify each against its releases page and bump as needed — and keep them pinned (never :latest):

Prometheus / Alertmanager / node-exporter → github.com/prometheus
Loki / Alloy / Grafana → github.com/grafana
cAdvisor → github.com/google/cadvisor