No description

Find a file

Martin McCaffery e1a4e9c579 Some checks failed ci / build (push) Failing after 22s Details ci / goreleaser (push) Successful in 1m3s Details ci: add docker build to release		2026-02-12 11:26:29 +01:00
.github/workflows	ci: add docker build to release	2026-02-12 11:26:29 +01:00
cmd	Rename repo from forgejo-runner-resource-collector	2026-02-12 09:55:04 +01:00
docs/background	Update and extend documentation	2026-02-09 17:36:59 +01:00
internal	Rename repo from forgejo-runner-resource-collector	2026-02-12 09:55:04 +01:00
scripts/hooks	feat: add resource collector for Forgejo runners	2026-02-04 14:13:24 +01:00
test	Add token-based authentication for receiver	2026-02-11 15:18:03 +01:00
.gitignore	ci: add docker build to release	2026-02-12 11:26:29 +01:00
.goreleaser.yaml	ci: add docker build to release	2026-02-12 11:26:29 +01:00
CLAUDE.md	docs: add CLAUDE.md with development guidance	2026-02-06 15:42:43 +01:00
Dockerfile	feat(docker): multi-stage build for collector and receiver	2026-02-06 11:54:20 +01:00
Dockerfile.goreleaser	ci: add docker build to release	2026-02-12 11:26:29 +01:00
go.mod	Rename repo from forgejo-runner-resource-collector	2026-02-12 09:55:04 +01:00
go.sum	feat(receiver): add HTTP metrics receiver with SQLite storage	2026-02-06 11:40:03 +01:00
Makefile	feat: add resource collector for Forgejo runners	2026-02-04 14:13:24 +01:00
README.md	Add token-based authentication for receiver	2026-02-11 15:18:03 +01:00

README.md

Forgejo Runner Resource Collector

A lightweight metrics collector for CI/CD workloads in shared PID namespace environments. Reads /proc to collect CPU and memory metrics, groups them by container/cgroup, and pushes run summaries to a receiver service for storage and querying.

Architecture

The system has two independent binaries:

┌─────────────────────────────────────────────┐       ┌──────────────────────────┐
│  CI/CD Pod (shared PID namespace)           │       │  Receiver Service        │
│                                             │       │                          │
│  ┌───────────┐  ┌────────┐  ┌───────────┐   │       │  POST /api/v1/metrics    │
│  │ collector │  │ runner │  │ sidecar   │   │       │         │                │
│  │           │  │        │  │           │   │       │         ▼                │
│  │ reads     │  │        │  │           │   │  push │  ┌────────────┐          │
│  │ /proc for │  │        │  │           │   │──────▶│  │  SQLite    │          │
│  │ all PIDs  │  │        │  │           │   │       │  └────────────┘          │
│  └───────────┘  └────────┘  └───────────┘   │       │         │                │
│                                             │       │         ▼                │
└─────────────────────────────────────────────┘       │  GET /api/v1/metrics/... │
                                                      └──────────────────────────┘

Collector

Runs as a sidecar alongside CI workloads. On a configurable interval, it reads /proc to collect CPU and memory for all visible processes, groups them by container using cgroup paths, and accumulates samples. On shutdown (SIGINT/SIGTERM), it computes run-level statistics (peak, avg, percentiles) and pushes a single summary to the receiver.

./collector --interval=2s --top=10 --push-endpoint=http://receiver:8080/api/v1/metrics

Flags: --interval, --proc-path, --log-level, --log-format, --top, --push-endpoint, --push-token

Environment variables:

Variable	Description	Example
`GITHUB_REPOSITORY_OWNER`	Organization name	`my-org`
`GITHUB_REPOSITORY`	Full repository path	`my-org/my-repo`
`GITHUB_WORKFLOW`	Workflow filename	`ci.yml`
`GITHUB_JOB`	Job name	`build`
`GITHUB_RUN_ID`	Unique run identifier	`run-123`
`COLLECTOR_PUSH_TOKEN`	Bearer token for push endpoint auth	—
`CGROUP_PROCESS_MAP`	JSON: process name → container name	`{"node":"runner"}`
`CGROUP_LIMITS`	JSON: per-container CPU/memory limits	See below

CGROUP_LIMITS example:

{
  "runner": { "cpu": "2", "memory": "1Gi" },
  "sidecar": { "cpu": "500m", "memory": "256Mi" }
}

CPU supports Kubernetes notation ("2" = 2 cores, "500m" = 0.5 cores). Memory supports Ki, Mi, Gi, Ti (binary) or K, M, G, T (decimal).

Receiver

HTTP service that stores metric summaries in SQLite (via GORM) and exposes a query API.

./receiver --addr=:8080 --db=metrics.db --read-token=my-secret-token --hmac-key=my-hmac-key

Flags:

Flag	Environment Variable	Description	Default
`--addr`	—	HTTP listen address	`:8080`
`--db`	—	SQLite database path	`metrics.db`
`--read-token`	`RECEIVER_READ_TOKEN`	Pre-shared token for read/admin endpoints (required)	—
`--hmac-key`	`RECEIVER_HMAC_KEY`	Secret key for push token generation/validation (required)	—

Endpoints:

POST /api/v1/metrics — receive and store a metric summary (requires scoped push token)
POST /api/v1/token — generate a scoped push token (requires read token auth)
GET /api/v1/metrics/repo/{org}/{repo}/{workflow}/{job} — query stored metrics (requires read token auth)

Authentication:

All metrics endpoints require authentication via --read-token:

The GET endpoint requires a Bearer token matching the read token
The POST metrics endpoint requires a scoped push token (generated via POST /api/v1/token)
The token endpoint itself requires the read token

Token flow:

# 1. Admin generates a scoped push token using the read token
curl -X POST http://localhost:8080/api/v1/token \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}'
# → {"token":"<hex-encoded HMAC>"}

# 2. Collector uses the scoped token to push metrics
./collector --push-endpoint=http://localhost:8080/api/v1/metrics \
  --push-token=<token-from-step-1>

# 3. Query metrics with the read token
curl -H "Authorization: Bearer my-secret-token" \  #gitleaks:allow
  http://localhost:8080/api/v1/metrics/repo/my-org/my-repo/ci.yml/build

Push tokens are HMAC-SHA256 digests derived from --hmac-key and the scope (org/repo/workflow/job). They are stateless — no database storage is needed. The HMAC key is separate from the read token so that compromising a push token does not expose the admin credential.

How Metrics Are Collected

The collector reads /proc/[pid]/stat for every visible process to get CPU ticks (utime + stime) and /proc/[pid]/status for memory (RSS). It takes two samples per interval and computes the delta to derive CPU usage rates.

Processes are grouped into containers by reading /proc/[pid]/cgroup and matching cgroup paths against the CGROUP_PROCESS_MAP. This is necessary because in shared PID namespace pods, /proc/stat only shows host-level aggregates — per-container metrics must be built up from individual process data.

Container CPU is reported in cores (not percentage) for direct comparison with Kubernetes resource limits. System-level CPU is reported as a percentage (0-100%).

Over the course of a run, the summary.Accumulator tracks every sample and on shutdown computes:

Stat	Description
`peak`	Maximum observed value
`p99`, `p95`, `p75`, `p50`	Percentiles across all samples
`avg`	Arithmetic mean

These stats are computed for CPU, memory, and per-container metrics.

API Response

GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build

[
  {
    "id": 1,
    "organization": "my-org",
    "repository": "my-org/my-repo",
    "workflow": "ci.yml",
    "job": "build",
    "run_id": "run-123",
    "received_at": "2026-02-06T14:30:23.056Z",
    "payload": {
      "start_time": "2026-02-06T14:30:02.185Z",
      "end_time": "2026-02-06T14:30:22.190Z",
      "duration_seconds": 20.0,
      "sample_count": 11,
      "cpu_total_percent": { "peak": ..., "avg": ..., "p50": ... },
      "mem_used_bytes": { "peak": ..., "avg": ... },
      "containers": [
        {
          "name": "runner",
          "cpu_cores": { "peak": 2.007, "avg": 1.5, "p50": 1.817, "p95": 2.004 },
          "memory_bytes": { "peak": 18567168, "avg": 18567168 }
        }
      ],
      "top_cpu_processes": [ ... ],
      "top_mem_processes": [ ... ]
    }
  }
]

CPU metric distinction:

cpu_total_percent — system-wide, 0-100%
cpu_cores (containers) — cores used (e.g. 2.0 = two full cores)
peak_cpu_percent (processes) — per-process, where 100% = 1 core

All memory values are in bytes.

Running

Docker Compose

# Start the receiver (builds image if needed):
docker compose -f test/docker/docker-compose-stress.yaml up -d --build receiver

# Generate a scoped push token for the collector:
PUSH_TOKEN=$(curl -s -X POST http://localhost:9080/api/v1/token \
  -H "Authorization: Bearer dummyreadtoken" \
  -H "Content-Type: application/json" \
  -d '{"organization":"test-org","repository":"test-org/stress-test","workflow":"stress-test-workflow","job":"heavy-workload"}' \
  | jq -r .token)

# Start the collector and stress workloads with the push token:
COLLECTOR_PUSH_TOKEN=$PUSH_TOKEN \
  docker compose -f test/docker/docker-compose-stress.yaml up -d --build collector

# ... Wait for data collection ...

# Trigger shutdown summary:
docker compose -f test/docker/docker-compose-stress.yaml stop collector

# Query results with the read token:
curl -H "Authorization: Bearer dummyreadtoken" \
  http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload

Local

go build -o collector ./cmd/collector
go build -o receiver ./cmd/receiver

# Start receiver with both keys:
./receiver --addr=:8080 --db=metrics.db \
  --read-token=my-secret-token --hmac-key=my-hmac-key

# Generate a scoped push token:
PUSH_TOKEN=$(curl -s -X POST http://localhost:8080/api/v1/token \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}' \
  | jq -r .token)

# Run collector with the push token:
./collector --interval=2s --top=10 \
  --push-endpoint=http://localhost:8080/api/v1/metrics \
  --push-token=$PUSH_TOKEN

Internal Packages

Package	Purpose
`internal/proc`	Low-level `/proc` parsing (stat, status, cgroup)
`internal/metrics`	Aggregates process metrics from `/proc` into system/container views
`internal/cgroup`	Parses `CGROUP_PROCESS_MAP` and `CGROUP_LIMITS` env vars
`internal/collector`	Orchestrates the collection loop and shutdown
`internal/summary`	Accumulates samples, computes stats, pushes to receiver
`internal/receiver`	HTTP handlers and SQLite store
`internal/output`	Metrics output formatting (JSON/text)

Background

Technical reference on the Linux primitives this project builds on:

Identifying process cgroups by PID — how to read /proc/<PID>/cgroup to determine which container a process belongs to
/proc/stat behavior in containers — why /proc/stat shows host-level data in containers, and how to aggregate per-process stats from /proc/[pid]/stat instead, including CPU tick conversion and cgroup limit handling