243 lines
12 KiB
Markdown
243 lines
12 KiB
Markdown
# Forgejo Runner Resource Collector
|
|
|
|
A lightweight metrics collector for CI/CD workloads in shared PID namespace environments. Reads `/proc` to collect CPU and memory metrics, groups them by container/cgroup, and pushes run summaries to a receiver service for storage and querying.
|
|
|
|
## Architecture
|
|
|
|
The system has two independent binaries:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────┐ ┌──────────────────────────┐
|
|
│ CI/CD Pod (shared PID namespace) │ │ Receiver Service │
|
|
│ │ │ │
|
|
│ ┌───────────┐ ┌────────┐ ┌───────────┐ │ │ POST /api/v1/metrics │
|
|
│ │ collector │ │ runner │ │ sidecar │ │ │ │ │
|
|
│ │ │ │ │ │ │ │ │ ▼ │
|
|
│ │ reads │ │ │ │ │ │ push │ ┌────────────┐ │
|
|
│ │ /proc for │ │ │ │ │ │──────▶│ │ SQLite │ │
|
|
│ │ all PIDs │ │ │ │ │ │ │ └────────────┘ │
|
|
│ └───────────┘ └────────┘ └───────────┘ │ │ │ │
|
|
│ │ │ ▼ │
|
|
└─────────────────────────────────────────────┘ │ GET /api/v1/metrics/... │
|
|
└──────────────────────────┘
|
|
```
|
|
|
|
### Collector
|
|
|
|
Runs as a sidecar alongside CI workloads. On a configurable interval, it reads `/proc` to collect CPU and memory for all visible processes, groups them by container using cgroup paths, and accumulates samples. On shutdown (SIGINT/SIGTERM), it computes run-level statistics (peak, avg, percentiles) and pushes a single summary to the receiver.
|
|
|
|
```bash
|
|
./collector --interval=2s --top=10 --push-endpoint=http://receiver:8080/api/v1/metrics
|
|
```
|
|
|
|
**Flags:** `--interval`, `--proc-path`, `--log-level`, `--log-format`, `--top`, `--push-endpoint`, `--push-token`
|
|
|
|
**Environment variables:**
|
|
|
|
| Variable | Description | Example |
|
|
| ------------------------- | ------------------------------------- | ------------------- |
|
|
| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` |
|
|
| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` |
|
|
| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` |
|
|
| `GITHUB_JOB` | Job name | `build` |
|
|
| `GITHUB_RUN_ID` | Unique run identifier | `run-123` |
|
|
| `COLLECTOR_PUSH_TOKEN` | Bearer token for push endpoint auth | — |
|
|
| `CGROUP_PROCESS_MAP` | JSON: process name → container name | `{"node":"runner"}` |
|
|
| `CGROUP_LIMITS` | JSON: per-container CPU/memory limits | See below |
|
|
|
|
**CGROUP_LIMITS example:**
|
|
|
|
```json
|
|
{
|
|
"runner": { "cpu": "2", "memory": "1Gi" },
|
|
"sidecar": { "cpu": "500m", "memory": "256Mi" }
|
|
}
|
|
```
|
|
|
|
CPU supports Kubernetes notation (`"2"` = 2 cores, `"500m"` = 0.5 cores). Memory supports `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal).
|
|
|
|
### Receiver
|
|
|
|
HTTP service that stores metric summaries in SQLite (via GORM) and exposes a query API.
|
|
|
|
```bash
|
|
./receiver --addr=:8080 --db=metrics.db --read-token=my-secret-token --hmac-key=my-hmac-key
|
|
```
|
|
|
|
**Flags:**
|
|
|
|
| Flag | Environment Variable | Description | Default |
|
|
| -------------- | --------------------- | ----------------------------------------------------- | ------------ |
|
|
| `--addr` | — | HTTP listen address | `:8080` |
|
|
| `--db` | — | SQLite database path | `metrics.db` |
|
|
| `--read-token` | `RECEIVER_READ_TOKEN` | Pre-shared token for read/admin endpoints (required) | — |
|
|
| `--hmac-key` | `RECEIVER_HMAC_KEY` | Secret key for push token generation/validation (required) | — |
|
|
|
|
**Endpoints:**
|
|
|
|
- `POST /api/v1/metrics` — receive and store a metric summary (requires scoped push token)
|
|
- `POST /api/v1/token` — generate a scoped push token (requires read token auth)
|
|
- `GET /api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}` — query stored metrics (requires read token auth)
|
|
|
|
**Authentication:**
|
|
|
|
All metrics endpoints require authentication via `--read-token`:
|
|
|
|
- The GET endpoint requires a Bearer token matching the read token
|
|
- The POST metrics endpoint requires a scoped push token (generated via `POST /api/v1/token`)
|
|
- The token endpoint itself requires the read token
|
|
|
|
**Token flow:**
|
|
|
|
```bash
|
|
# 1. Admin generates a scoped push token using the read token
|
|
curl -X POST http://localhost:8080/api/v1/token \
|
|
-H "Authorization: Bearer my-secret-token" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}'
|
|
# → {"token":"<hex-encoded HMAC>"}
|
|
|
|
# 2. Collector uses the scoped token to push metrics
|
|
./collector --push-endpoint=http://localhost:8080/api/v1/metrics \
|
|
--push-token=<token-from-step-1>
|
|
|
|
# 3. Query metrics with the read token
|
|
curl -H "Authorization: Bearer my-secret-token" \ #gitleaks:allow
|
|
http://localhost:8080/api/v1/metrics/repo/my-org/my-repo/ci.yml/build
|
|
```
|
|
|
|
Push tokens are HMAC-SHA256 digests derived from `--hmac-key` and the scope (org/repo/workflow/job). They are stateless — no database storage is needed. The HMAC key is separate from the read token so that compromising a push token does not expose the admin credential.
|
|
|
|
## How Metrics Are Collected
|
|
|
|
The collector reads `/proc/[pid]/stat` for every visible process to get CPU ticks (`utime` + `stime`) and `/proc/[pid]/status` for memory (RSS). It takes two samples per interval and computes the delta to derive CPU usage rates.
|
|
|
|
Processes are grouped into containers by reading `/proc/[pid]/cgroup` and matching cgroup paths against the `CGROUP_PROCESS_MAP`. This is necessary because in shared PID namespace pods, `/proc/stat` only shows host-level aggregates — per-container metrics must be built up from individual process data.
|
|
|
|
Container CPU is reported in **cores** (not percentage) for direct comparison with Kubernetes resource limits. System-level CPU is reported as a percentage (0-100%).
|
|
|
|
Over the course of a run, the `summary.Accumulator` tracks every sample and on shutdown computes:
|
|
|
|
| Stat | Description |
|
|
| -------------------------- | ------------------------------ |
|
|
| `peak` | Maximum observed value |
|
|
| `p99`, `p95`, `p75`, `p50` | Percentiles across all samples |
|
|
| `avg` | Arithmetic mean |
|
|
|
|
These stats are computed for CPU, memory, and per-container metrics.
|
|
|
|
## API Response
|
|
|
|
```
|
|
GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build
|
|
```
|
|
|
|
```json
|
|
[
|
|
{
|
|
"id": 1,
|
|
"organization": "my-org",
|
|
"repository": "my-org/my-repo",
|
|
"workflow": "ci.yml",
|
|
"job": "build",
|
|
"run_id": "run-123",
|
|
"received_at": "2026-02-06T14:30:23.056Z",
|
|
"payload": {
|
|
"start_time": "2026-02-06T14:30:02.185Z",
|
|
"end_time": "2026-02-06T14:30:22.190Z",
|
|
"duration_seconds": 20.0,
|
|
"sample_count": 11,
|
|
"cpu_total_percent": { "peak": ..., "avg": ..., "p50": ... },
|
|
"mem_used_bytes": { "peak": ..., "avg": ... },
|
|
"containers": [
|
|
{
|
|
"name": "runner",
|
|
"cpu_cores": { "peak": 2.007, "avg": 1.5, "p50": 1.817, "p95": 2.004 },
|
|
"memory_bytes": { "peak": 18567168, "avg": 18567168 }
|
|
}
|
|
],
|
|
"top_cpu_processes": [ ... ],
|
|
"top_mem_processes": [ ... ]
|
|
}
|
|
}
|
|
]
|
|
```
|
|
|
|
**CPU metric distinction:**
|
|
|
|
- `cpu_total_percent` — system-wide, 0-100%
|
|
- `cpu_cores` (containers) — cores used (e.g. `2.0` = two full cores)
|
|
- `peak_cpu_percent` (processes) — per-process, where 100% = 1 core
|
|
|
|
All memory values are in **bytes**.
|
|
|
|
## Running
|
|
|
|
### Docker Compose
|
|
|
|
```bash
|
|
# Start the receiver (builds image if needed):
|
|
docker compose -f test/docker/docker-compose-stress.yaml up -d --build receiver
|
|
|
|
# Generate a scoped push token for the collector:
|
|
PUSH_TOKEN=$(curl -s -X POST http://localhost:9080/api/v1/token \
|
|
-H "Authorization: Bearer dummyreadtoken" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"organization":"test-org","repository":"test-org/stress-test","workflow":"stress-test-workflow","job":"heavy-workload"}' \
|
|
| jq -r .token)
|
|
|
|
# Start the collector and stress workloads with the push token:
|
|
COLLECTOR_PUSH_TOKEN=$PUSH_TOKEN \
|
|
docker compose -f test/docker/docker-compose-stress.yaml up -d --build collector
|
|
|
|
# ... Wait for data collection ...
|
|
|
|
# Trigger shutdown summary:
|
|
docker compose -f test/docker/docker-compose-stress.yaml stop collector
|
|
|
|
# Query results with the read token:
|
|
curl -H "Authorization: Bearer dummyreadtoken" \
|
|
http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload
|
|
```
|
|
|
|
### Local
|
|
|
|
```bash
|
|
go build -o collector ./cmd/collector
|
|
go build -o receiver ./cmd/receiver
|
|
|
|
# Start receiver with both keys:
|
|
./receiver --addr=:8080 --db=metrics.db \
|
|
--read-token=my-secret-token --hmac-key=my-hmac-key
|
|
|
|
# Generate a scoped push token:
|
|
PUSH_TOKEN=$(curl -s -X POST http://localhost:8080/api/v1/token \
|
|
-H "Authorization: Bearer my-secret-token" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}' \
|
|
| jq -r .token)
|
|
|
|
# Run collector with the push token:
|
|
./collector --interval=2s --top=10 \
|
|
--push-endpoint=http://localhost:8080/api/v1/metrics \
|
|
--push-token=$PUSH_TOKEN
|
|
```
|
|
|
|
## Internal Packages
|
|
|
|
| Package | Purpose |
|
|
| -------------------- | ------------------------------------------------------------------- |
|
|
| `internal/proc` | Low-level `/proc` parsing (stat, status, cgroup) |
|
|
| `internal/metrics` | Aggregates process metrics from `/proc` into system/container views |
|
|
| `internal/cgroup` | Parses `CGROUP_PROCESS_MAP` and `CGROUP_LIMITS` env vars |
|
|
| `internal/collector` | Orchestrates the collection loop and shutdown |
|
|
| `internal/summary` | Accumulates samples, computes stats, pushes to receiver |
|
|
| `internal/receiver` | HTTP handlers and SQLite store |
|
|
| `internal/output` | Metrics output formatting (JSON/text) |
|
|
|
|
## Background
|
|
|
|
Technical reference on the Linux primitives this project builds on:
|
|
|
|
- [Identifying process cgroups by PID](docs/background/identify-process-cgroup-by-pid.md) — how to read `/proc/<PID>/cgroup` to determine which container a process belongs to
|
|
- [/proc/stat behavior in containers](docs/background/proc-stat-in-containers.md) — why `/proc/stat` shows host-level data in containers, and how to aggregate per-process stats from `/proc/[pid]/stat` instead, including CPU tick conversion and cgroup limit handling
|