- Shell 100%
|
|
||
|---|---|---|
| .forgejo/workflows | ||
| alertmanager | ||
| alloy | ||
| blackbox | ||
| caddy | ||
| grafana/provisioning | ||
| loki | ||
| prometheus | ||
| deploy.sh | ||
| DEPLOYMENT_GUIDE.md | ||
| docker-compose.yml | ||
| OPERATIONS.md | ||
| README.md | ||
Monitoring Stack — Spec & Runbook
A single-server observability stack: metrics, logs, dashboards, and alerts, run with Docker Compose. Comfortably fits a 4 GB RAM VPS.
Components & data flow
┌───────────────┐
host metrics ───────► │ node-exporter │ ──┐
└───────────────┘ │ scrape
┌───────────────┐ │
container metrics ───► │ cAdvisor │ ──┤
└───────────────┘ ▼
┌────────────┐ alerts ┌──────────────┐
│ Prometheus │ ───────► │ Alertmanager │ ──► email/Slack/webhook
└────────────┘ └──────────────┘
▲
│ query
docker + host logs ──► ┌───────┐ push ┌──────┐ query ┌─────────┐
│ Alloy │ ─────► │ Loki │ ◄───── │ Grafana │ ◄── you (browser)
└───────┘ └──────┘ └─────────┘
▲
└── queries both Prometheus & Loki
| Service | Role | Port (host) |
|---|---|---|
| Grafana | Dashboards & visualization | 3000 |
| Prometheus | Metrics TSDB + alert rule evaluation | 9090 |
| Alertmanager | Alert grouping/routing/silencing | 9093 |
| Loki | Log storage & query | 3100 |
| Alloy | Log shipper (Promtail's replacement) | 12345 |
| node-exporter | Host CPU/RAM/disk/network metrics | 9100 |
| cAdvisor | Per-container resource metrics | 8080 |
| blackbox-exporter | Endpoint probing (up/SSL/TTFB) | 9115 |
| Caddy | TLS reverse proxy (public HTTPS + basic auth) | 80, 443 |
| Forgejo | Git forge / code hosting | 2222 (SSH) |
| forgejo-runner | Forgejo Actions runner (CI) | — |
| forgejo-runner-dind | Docker engine for CI jobs (isolated) | — |
Why Alloy and not Promtail? Promtail reached end-of-life on March 2, 2026. Grafana folded it into Alloy (their OpenTelemetry-based unified agent). Alloy ships logs to Loki the same way and is the supported path going forward.
Layout
monitoring/
├── docker-compose.yml
├── .env.example # copy to .env
├── prometheus/
│ ├── prometheus.yml # scrape targets + alerting
│ ├── rules/alerts.yml # alert rules (host + endpoint health)
│ └── targets/
│ └── blackbox-endpoints.yml # EDIT to add probed URLs (auto-reloaded)
├── blackbox/
│ └── blackbox.yml # endpoint probe modules
├── caddy/
│ └── Caddyfile # TLS reverse proxy + basic auth routes
├── alertmanager/
│ └── alertmanager.yml # routing + receivers
├── loki/
│ └── loki-config.yml # TSDB v13, filesystem, 30d retention
├── alloy/
│ └── config.alloy # log collection pipeline
└── grafana/
└── provisioning/
├── datasources/datasources.yml # Prometheus + Loki auto-wired
└── dashboards/dashboards.yml # drop dashboard JSON here
Run it
cd monitoring
cp .env.example .env # then edit .env and set a real Grafana password
docker compose up -d
docker compose ps # all services should be "running"/"healthy"
Then open:
- Grafana → http://SERVER_IP:3000 (login with the creds from
.env) - Prometheus → http://SERVER_IP:9090 (check Status → Targets: all UP)
- Alertmanager → http://SERVER_IP:9093
Data sources are pre-wired, so in Grafana you can go straight to Explore and query Prometheus or Loki.
First dashboards (import by ID)
In Grafana: Dashboards → New → Import, paste an ID:
- 1860 — Node Exporter Full (host CPU/RAM/disk/network)
- 19792 — cAdvisor / container metrics
- 13639 — Loki logs / app
Public access (Caddy reverse proxy)
A caddy service fronts the stack on ports 80/443, terminates TLS with
automatic Let's Encrypt certificates, and proxies three hostnames to the
backends over the internal office network:
| Public URL | Backend | Auth |
|---|---|---|
https://grafana.javiersegura.net |
grafana:3000 | Grafana's own login |
https://prometheus.javiersegura.net |
prometheus:9090 | HTTP basic auth |
https://alertmanager.javiersegura.net |
alertmanager:9093 | HTTP basic auth |
Prometheus and Alertmanager have no authentication of their own, so Caddy gates
them behind HTTP basic auth (user admin). The routing lives in
caddy/Caddyfile.
Prerequisites for certificates to issue:
- DNS — each hostname needs a public A/AAAA record pointing at the server.
- Firewall — ports 80 and 443 must be reachable from the internet (80 is needed for the ACME challenge and the HTTP→HTTPS redirect).
Until DNS and ports are in place, those sites will 502 while Caddy retries ACME — that's expected, not a misconfiguration.
Basic-auth credentials are bcrypt hashes kept in .env on the server
(never committed; deploy.sh does not sync or delete it). Generate them with:
docker run --rm caddy:2.11.4-alpine caddy hash-password --plaintext 'YOUR_PASSWORD'
then add the output to .env — single-quoted:
PROMETHEUS_BASIC_AUTH_HASH='$2a$14$....'
ALERTMANAGER_BASIC_AUTH_HASH='$2a$14$....'
The single quotes are required. Compose interpolates env values by default, and
a bcrypt hash contains $, so an unquoted hash makes Compose try to expand
$2a/$14/etc. as variables — you get a variable is not set, defaulting to a blank string warning and an empty (broken) hash. Single-quoted dotenv values
are taken literally; the quotes are stripped before the value reaches Caddy.
The caddy service loads .env via env_file, and the Caddyfile reads the
hashes as {$PROMETHEUS_BASIC_AUTH_HASH} / {$ALERTMANAGER_BASIC_AUTH_HASH}.
Change a hash and docker compose up -d caddy to apply.
To change a username, or to add basic auth in front of Grafana too, edit
caddy/Caddyfile.
CI: Forgejo Actions runner
Two services power Forgejo Actions:
forgejo-runner-dind— an isolateddocker:28-dindengine where CI jobs actually run. It'sprivileged(a dind requirement) and reachable only on the dedicatedcinetwork — it publishes no host port and is not on theofficenetwork, so untrusted CI workloads can't touch the monitoring stack.forgejo-runner— registers with Forgejo once, then runs the daemon that picks up jobs and executes them via the dind engine.
One-time setup:
- Make sure Actions is enabled on the Forgejo side (it is by default in recent
Forgejo;
[actions] ENABLED = trueinapp.ini). - Get a runner registration token in Forgejo: Site Administration → Actions → Runners → Create new runner (or the per-org / per-repo Runners page).
- Put it in
.envon the server (tokens are alphanumeric — no quoting needed):FORGEJO_RUNNER_REGISTRATION_TOKEN=... docker compose up -d forgejo-runner-dind forgejo-runner.
On first start the runner sees no /data/.runner file, registers using the
token, and writes the registration into the forgejo_runner_data volume
(/srv/office/data/forgejo-runner). That persists, so the token is only
needed once — later restarts skip straight to the daemon. Confirm the runner
shows up under the Forgejo Runners page as idle.
Optional overrides (defaults shown) — set in .env if you want:
| Variable | Default | Purpose |
|---|---|---|
FORGEJO_INSTANCE_URL |
https://forge.javiersegura.net |
Forgejo URL the runner registers/polls against |
FORGEJO_RUNNER_NAME |
office-runner |
Runner name shown in Forgejo |
FORGEJO_RUNNER_LABELS |
ubuntu-latest:docker://node:22-bookworm,… |
maps runs-on: labels to job images |
Labels are label:docker://image pairs: a workflow's runs-on: ubuntu-latest
runs inside node:22-bookworm. node images are used because most actions need
Node. For closer GitHub compatibility, point a label at a catthehacker image,
e.g. ubuntu-latest:docker://ghcr.io/catthehacker/ubuntu:act-22.04.
Why the public URL? The runner registers against
https://forge.javiersegura.net (not the internal forgejo:3000) because the
job containers run inside dind — isolated from this stack's Docker networks —
and must clone repos from an address that's reachable from there. The public
URL works everywhere with internet access. This assumes the host allows NAT
loopback (a container connecting to the host's public IP reaching Caddy),
which is the common case. If jobs can't reach Forgejo, either enable hairpin
NAT on the host, or register against the internal URL and attach the runner +
dind to the office network instead (then forgejo:3000 resolves, but only
if dind jobs are also given a route to it).
Useful operations
# Reload Prometheus config without restart (web.enable-lifecycle is on)
curl -X POST http://localhost:9090/-/reload
# Validate Prometheus rules / config before reloading
docker exec prometheus promtool check config /etc/prometheus/prometheus.yml
# Check live alert rule state
open http://localhost:9090/alerts
# Tail a service
docker compose logs -f alloy
Hooking up alerts
Alertmanager is wired to Prometheus but sends nowhere until you fill in a
receiver. Edit alertmanager/alertmanager.yml, uncomment one of the
email / Slack / webhook blocks, set real values, then:
docker compose restart alertmanager
Alertmanager does not read ${ENV_VARS} from its config, so put real
secrets in the file directly and keep it out of version control.
What's collected, at a glance
| Metrics | Logs | |
|---|---|---|
| Host | Resource usage via node-exporter (auto). App /metrics: add a scrape job using host.docker.internal |
/var/log/*.log and the systemd journal (auto) |
| Containers | Resource usage via cAdvisor (auto). App /metrics: put the app on the monitoring network + add a scrape job |
All container stdout/stderr via Alloy (auto) |
"Resource usage" = CPU/RAM/disk/network. "App /metrics" = your own application's
Prometheus endpoint (request counts, queue depth, etc.) — that always needs an
explicit scrape target.
Adding your own application metrics
A host service exposing metrics on a host port (e.g. a Go service on
:2112). Prometheus runs in a container, so it reaches the host through
host.docker.internal (the compose file maps this to the host gateway, which
is required on Linux):
- job_name: 'my-host-app'
static_configs:
- targets: ['host.docker.internal:2112']
A container service. Attach it to the monitoring network so Prometheus
can resolve it by container name, then scrape that name:
# in that app's compose file
services:
my-app:
networks: [monitoring]
networks:
monitoring:
external: true
name: monitoring
# in prometheus/prometheus.yml
- job_name: 'my-app'
static_configs:
- targets: ['my-app:8000']
Either way, reload Prometheus afterwards:
curl -X POST http://localhost:9090/-/reload
Confirm the target shows UP under Prometheus → Status → Targets.
Monitoring endpoints (up / SSL expiry / TTFB)
The blackbox-http job probes external URLs and emits probe_success,
probe_http_status_code, probe_ssl_earliest_cert_expiry, and
probe_http_duration_seconds. Matching alerts ship in rules/alerts.yml
(EndpointDown, HttpStatusAbnormal, SslCertExpiring*, HighTtfb).
To add or remove a monitored endpoint, edit
prometheus/targets/blackbox-endpoints.yml — that's it. Prometheus watches
the file and reloads it automatically; no restart, no -/reload. You can also
add more files named blackbox-*.yml (e.g. one per environment) and they're
all picked up by the glob.
# prometheus/targets/blackbox-endpoints.yml
- targets:
- https://myapp.example.com/health
labels:
env: production
Note the distinction from static_configs: editing a target file is
live-reloaded, but editing prometheus.yml itself (job definitions,
relabeling) still needs curl -X POST http://localhost:9090/-/reload.
Security notes (do before exposing to the internet)
- Change the Grafana admin password in
.env. Don't ship the default. - Don't publish 9090 / 9093 / 3100 / 8080 / 9100 / 12345 to the public
internet — they have no auth. They're already bound to
127.0.0.1indocker-compose.ymlso they're reachable only on the host (use an SSH tunnel for direct access). Public access goes exclusively through the Caddy reverse proxy with TLS — see Public access above — and Prometheus/Alertmanager sit behind basic auth there. Only Caddy publishes to0.0.0.0(80/443). - cAdvisor runs
privileged: true— that's its normal requirement, but it's another reason not to expose its port.
Retention & disk
- Metrics: 30 days (
--storage.tsdb.retention.time=30din compose). - Logs: 30 days (
retention_period: 720hinloki-config.yml, enforced by the compactor). Loki on a single server stores chunks on the local filesystem volume — watch disk usage and adjust retention to taste.
Version pinning
The image tags in docker-compose.yml were current stable as of June 2026.
Before deploying, verify each against its releases page and bump as needed —
and keep them pinned (never :latest):
- Prometheus / Alertmanager / node-exporter → github.com/prometheus
- Loki / Alloy / Grafana → github.com/grafana
- cAdvisor → github.com/google/cadvisor