No description
Find a file
Javier Segura 3e52fe82b9
All checks were successful
validate / validate (push) Successful in 26s
Add forgejo actions
2026-06-11 22:11:38 +02:00
.forgejo/workflows Add forgejo actions 2026-06-11 22:11:38 +02:00
alertmanager Initial import 2026-06-11 20:45:01 +02:00
alloy Initial import 2026-06-11 20:45:01 +02:00
blackbox Initial import 2026-06-11 20:45:01 +02:00
caddy Initial import 2026-06-11 20:45:01 +02:00
grafana/provisioning Initial import 2026-06-11 20:45:01 +02:00
loki Initial import 2026-06-11 20:45:01 +02:00
prometheus Initial import 2026-06-11 20:45:01 +02:00
deploy.sh Add forgejo actions 2026-06-11 22:11:38 +02:00
DEPLOYMENT_GUIDE.md Initial import 2026-06-11 20:45:01 +02:00
docker-compose.yml Fixed port 2026-06-11 20:53:51 +02:00
OPERATIONS.md Initial import 2026-06-11 20:45:01 +02:00
README.md Initial import 2026-06-11 20:45:01 +02:00

Monitoring Stack — Spec & Runbook

A single-server observability stack: metrics, logs, dashboards, and alerts, run with Docker Compose. Comfortably fits a 4 GB RAM VPS.

Components & data flow

                          ┌───────────────┐
   host metrics  ───────► │ node-exporter │ ──┐
                          └───────────────┘   │ scrape
                          ┌───────────────┐   │
   container metrics ───► │   cAdvisor    │ ──┤
                          └───────────────┘   ▼
                                        ┌────────────┐  alerts  ┌──────────────┐
                                        │ Prometheus │ ───────► │ Alertmanager │ ──► email/Slack/webhook
                                        └────────────┘          └──────────────┘
                                              ▲
                                              │ query
   docker + host logs ──► ┌───────┐  push  ┌──────┐  query  ┌─────────┐
                          │ Alloy │ ─────► │ Loki │ ◄─────  │ Grafana │ ◄── you (browser)
                          └───────┘        └──────┘         └─────────┘
                                                                 ▲
                                                                 └── queries both Prometheus & Loki
Service Role Port (host)
Grafana Dashboards & visualization 3000
Prometheus Metrics TSDB + alert rule evaluation 9090
Alertmanager Alert grouping/routing/silencing 9093
Loki Log storage & query 3100
Alloy Log shipper (Promtail's replacement) 12345
node-exporter Host CPU/RAM/disk/network metrics 9100
cAdvisor Per-container resource metrics 8080
blackbox-exporter Endpoint probing (up/SSL/TTFB) 9115
Caddy TLS reverse proxy (public HTTPS + basic auth) 80, 443
Forgejo Git forge / code hosting 2222 (SSH)
forgejo-runner Forgejo Actions runner (CI)
forgejo-runner-dind Docker engine for CI jobs (isolated)

Why Alloy and not Promtail? Promtail reached end-of-life on March 2, 2026. Grafana folded it into Alloy (their OpenTelemetry-based unified agent). Alloy ships logs to Loki the same way and is the supported path going forward.

Layout

monitoring/
├── docker-compose.yml
├── .env.example                       # copy to .env
├── prometheus/
│   ├── prometheus.yml                 # scrape targets + alerting
│   ├── rules/alerts.yml               # alert rules (host + endpoint health)
│   └── targets/
│       └── blackbox-endpoints.yml     # EDIT to add probed URLs (auto-reloaded)
├── blackbox/
│   └── blackbox.yml                   # endpoint probe modules
├── caddy/
│   └── Caddyfile                      # TLS reverse proxy + basic auth routes
├── alertmanager/
│   └── alertmanager.yml               # routing + receivers
├── loki/
│   └── loki-config.yml                # TSDB v13, filesystem, 30d retention
├── alloy/
│   └── config.alloy                   # log collection pipeline
└── grafana/
    └── provisioning/
        ├── datasources/datasources.yml  # Prometheus + Loki auto-wired
        └── dashboards/dashboards.yml    # drop dashboard JSON here

Run it

cd monitoring
cp .env.example .env          # then edit .env and set a real Grafana password
docker compose up -d
docker compose ps             # all services should be "running"/"healthy"

Then open:

Data sources are pre-wired, so in Grafana you can go straight to Explore and query Prometheus or Loki.

First dashboards (import by ID)

In Grafana: Dashboards → New → Import, paste an ID:

  • 1860 — Node Exporter Full (host CPU/RAM/disk/network)
  • 19792 — cAdvisor / container metrics
  • 13639 — Loki logs / app

Public access (Caddy reverse proxy)

A caddy service fronts the stack on ports 80/443, terminates TLS with automatic Let's Encrypt certificates, and proxies three hostnames to the backends over the internal office network:

Public URL Backend Auth
https://grafana.javiersegura.net grafana:3000 Grafana's own login
https://prometheus.javiersegura.net prometheus:9090 HTTP basic auth
https://alertmanager.javiersegura.net alertmanager:9093 HTTP basic auth

Prometheus and Alertmanager have no authentication of their own, so Caddy gates them behind HTTP basic auth (user admin). The routing lives in caddy/Caddyfile.

Prerequisites for certificates to issue:

  1. DNS — each hostname needs a public A/AAAA record pointing at the server.
  2. Firewall — ports 80 and 443 must be reachable from the internet (80 is needed for the ACME challenge and the HTTP→HTTPS redirect).

Until DNS and ports are in place, those sites will 502 while Caddy retries ACME — that's expected, not a misconfiguration.

Basic-auth credentials are bcrypt hashes kept in .env on the server (never committed; deploy.sh does not sync or delete it). Generate them with:

docker run --rm caddy:2.11.4-alpine caddy hash-password --plaintext 'YOUR_PASSWORD'

then add the output to .envsingle-quoted:

PROMETHEUS_BASIC_AUTH_HASH='$2a$14$....'
ALERTMANAGER_BASIC_AUTH_HASH='$2a$14$....'

The single quotes are required. Compose interpolates env values by default, and a bcrypt hash contains $, so an unquoted hash makes Compose try to expand $2a/$14/etc. as variables — you get a variable is not set, defaulting to a blank string warning and an empty (broken) hash. Single-quoted dotenv values are taken literally; the quotes are stripped before the value reaches Caddy.

The caddy service loads .env via env_file, and the Caddyfile reads the hashes as {$PROMETHEUS_BASIC_AUTH_HASH} / {$ALERTMANAGER_BASIC_AUTH_HASH}. Change a hash and docker compose up -d caddy to apply.

To change a username, or to add basic auth in front of Grafana too, edit caddy/Caddyfile.

CI: Forgejo Actions runner

Two services power Forgejo Actions:

  • forgejo-runner-dind — an isolated docker:28-dind engine where CI jobs actually run. It's privileged (a dind requirement) and reachable only on the dedicated ci network — it publishes no host port and is not on the office network, so untrusted CI workloads can't touch the monitoring stack.
  • forgejo-runner — registers with Forgejo once, then runs the daemon that picks up jobs and executes them via the dind engine.

One-time setup:

  1. Make sure Actions is enabled on the Forgejo side (it is by default in recent Forgejo; [actions] ENABLED = true in app.ini).
  2. Get a runner registration token in Forgejo: Site Administration → Actions → Runners → Create new runner (or the per-org / per-repo Runners page).
  3. Put it in .env on the server (tokens are alphanumeric — no quoting needed):
    FORGEJO_RUNNER_REGISTRATION_TOKEN=...
    
  4. docker compose up -d forgejo-runner-dind forgejo-runner.

On first start the runner sees no /data/.runner file, registers using the token, and writes the registration into the forgejo_runner_data volume (/srv/office/data/forgejo-runner). That persists, so the token is only needed once — later restarts skip straight to the daemon. Confirm the runner shows up under the Forgejo Runners page as idle.

Optional overrides (defaults shown) — set in .env if you want:

Variable Default Purpose
FORGEJO_INSTANCE_URL https://forge.javiersegura.net Forgejo URL the runner registers/polls against
FORGEJO_RUNNER_NAME office-runner Runner name shown in Forgejo
FORGEJO_RUNNER_LABELS ubuntu-latest:docker://node:22-bookworm,… maps runs-on: labels to job images

Labels are label:docker://image pairs: a workflow's runs-on: ubuntu-latest runs inside node:22-bookworm. node images are used because most actions need Node. For closer GitHub compatibility, point a label at a catthehacker image, e.g. ubuntu-latest:docker://ghcr.io/catthehacker/ubuntu:act-22.04.

Why the public URL? The runner registers against https://forge.javiersegura.net (not the internal forgejo:3000) because the job containers run inside dind — isolated from this stack's Docker networks — and must clone repos from an address that's reachable from there. The public URL works everywhere with internet access. This assumes the host allows NAT loopback (a container connecting to the host's public IP reaching Caddy), which is the common case. If jobs can't reach Forgejo, either enable hairpin NAT on the host, or register against the internal URL and attach the runner + dind to the office network instead (then forgejo:3000 resolves, but only if dind jobs are also given a route to it).

Useful operations

# Reload Prometheus config without restart (web.enable-lifecycle is on)
curl -X POST http://localhost:9090/-/reload

# Validate Prometheus rules / config before reloading
docker exec prometheus promtool check config /etc/prometheus/prometheus.yml

# Check live alert rule state
open http://localhost:9090/alerts

# Tail a service
docker compose logs -f alloy

Hooking up alerts

Alertmanager is wired to Prometheus but sends nowhere until you fill in a receiver. Edit alertmanager/alertmanager.yml, uncomment one of the email / Slack / webhook blocks, set real values, then:

docker compose restart alertmanager

Alertmanager does not read ${ENV_VARS} from its config, so put real secrets in the file directly and keep it out of version control.

What's collected, at a glance

Metrics Logs
Host Resource usage via node-exporter (auto). App /metrics: add a scrape job using host.docker.internal /var/log/*.log and the systemd journal (auto)
Containers Resource usage via cAdvisor (auto). App /metrics: put the app on the monitoring network + add a scrape job All container stdout/stderr via Alloy (auto)

"Resource usage" = CPU/RAM/disk/network. "App /metrics" = your own application's Prometheus endpoint (request counts, queue depth, etc.) — that always needs an explicit scrape target.

Adding your own application metrics

A host service exposing metrics on a host port (e.g. a Go service on :2112). Prometheus runs in a container, so it reaches the host through host.docker.internal (the compose file maps this to the host gateway, which is required on Linux):

  - job_name: 'my-host-app'
    static_configs:
      - targets: ['host.docker.internal:2112']

A container service. Attach it to the monitoring network so Prometheus can resolve it by container name, then scrape that name:

# in that app's compose file
services:
  my-app:
    networks: [monitoring]
networks:
  monitoring:
    external: true
    name: monitoring
# in prometheus/prometheus.yml
  - job_name: 'my-app'
    static_configs:
      - targets: ['my-app:8000']

Either way, reload Prometheus afterwards:

curl -X POST http://localhost:9090/-/reload

Confirm the target shows UP under Prometheus → Status → Targets.

Monitoring endpoints (up / SSL expiry / TTFB)

The blackbox-http job probes external URLs and emits probe_success, probe_http_status_code, probe_ssl_earliest_cert_expiry, and probe_http_duration_seconds. Matching alerts ship in rules/alerts.yml (EndpointDown, HttpStatusAbnormal, SslCertExpiring*, HighTtfb).

To add or remove a monitored endpoint, edit prometheus/targets/blackbox-endpoints.yml — that's it. Prometheus watches the file and reloads it automatically; no restart, no -/reload. You can also add more files named blackbox-*.yml (e.g. one per environment) and they're all picked up by the glob.

# prometheus/targets/blackbox-endpoints.yml
- targets:
    - https://myapp.example.com/health
  labels:
    env: production

Note the distinction from static_configs: editing a target file is live-reloaded, but editing prometheus.yml itself (job definitions, relabeling) still needs curl -X POST http://localhost:9090/-/reload.

Security notes (do before exposing to the internet)

  • Change the Grafana admin password in .env. Don't ship the default.
  • Don't publish 9090 / 9093 / 3100 / 8080 / 9100 / 12345 to the public internet — they have no auth. They're already bound to 127.0.0.1 in docker-compose.yml so they're reachable only on the host (use an SSH tunnel for direct access). Public access goes exclusively through the Caddy reverse proxy with TLS — see Public access above — and Prometheus/Alertmanager sit behind basic auth there. Only Caddy publishes to 0.0.0.0 (80/443).
  • cAdvisor runs privileged: true — that's its normal requirement, but it's another reason not to expose its port.

Retention & disk

  • Metrics: 30 days (--storage.tsdb.retention.time=30d in compose).
  • Logs: 30 days (retention_period: 720h in loki-config.yml, enforced by the compactor). Loki on a single server stores chunks on the local filesystem volume — watch disk usage and adjust retention to taste.

Version pinning

The image tags in docker-compose.yml were current stable as of June 2026. Before deploying, verify each against its releases page and bump as needed — and keep them pinned (never :latest):

  • Prometheus / Alertmanager / node-exporter → github.com/prometheus
  • Loki / Alloy / Grafana → github.com/grafana
  • cAdvisor → github.com/google/cadvisor