No description
Find a file
Manuel Ganter addab99e5d
All checks were successful
ci / build (push) Successful in 26s
docs: add CLAUDE.md with development guidance
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 15:42:43 +01:00
.github/workflows fix(ci): unset GITHUB_TOKEN via shell before running goreleaser 2026-02-04 14:57:56 +01:00
cmd feat(collector): add HTTP push for metrics to receiver 2026-02-06 11:44:20 +01:00
docs/background docs: move background documentation to docs/background 2026-02-06 15:40:38 +01:00
internal fix(receiver): return Payload as JSON object instead of string 2026-02-06 15:22:27 +01:00
scripts/hooks feat: add resource collector for Forgejo runners 2026-02-04 14:13:24 +01:00
test test(stress): increase cpu-stress to 3 workers with 2.0 CPU limit 2026-02-06 15:32:55 +01:00
.gitignore feat: add resource collector for Forgejo runners 2026-02-04 14:13:24 +01:00
.goreleaser.yaml chore: remove darwin from build targets 2026-02-04 15:10:07 +01:00
CLAUDE.md docs: add CLAUDE.md with development guidance 2026-02-06 15:42:43 +01:00
Dockerfile feat(docker): multi-stage build for collector and receiver 2026-02-06 11:54:20 +01:00
go.mod feat(receiver): add HTTP metrics receiver with SQLite storage 2026-02-06 11:40:03 +01:00
go.sum feat(receiver): add HTTP metrics receiver with SQLite storage 2026-02-06 11:40:03 +01:00
Makefile feat: add resource collector for Forgejo runners 2026-02-04 14:13:24 +01:00
README.md docs: add README with API documentation 2026-02-06 15:38:22 +01:00

Forgejo Runner Resource Collector

A lightweight resource metrics collector designed to run alongside CI/CD workloads in shared PID namespace environments. It collects CPU and memory metrics, groups them by container/cgroup, and pushes summaries to a receiver service.

Components

  • Collector: Gathers system and per-process metrics at regular intervals, computes run-level statistics, and pushes a summary on shutdown.
  • Receiver: HTTP service that stores metric summaries in SQLite and provides a query API.

Receiver API

POST /api/v1/metrics

Receives metric summaries from collectors.

GET /api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}

Retrieves all stored metrics for a specific workflow and job.

Example request:

GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build

Example response:

[
  {
    "id": 1,
    "organization": "my-org",
    "repository": "my-org/my-repo",
    "workflow": "ci.yml",
    "job": "build",
    "run_id": "run-123",
    "received_at": "2026-02-06T14:30:23.056Z",
    "payload": {
      "start_time": "2026-02-06T14:30:02.185Z",
      "end_time": "2026-02-06T14:30:22.190Z",
      "duration_seconds": 20.0,
      "sample_count": 11,
      "cpu_total_percent": { ... },
      "mem_used_bytes": { ... },
      "mem_used_percent": { ... },
      "top_cpu_processes": [ ... ],
      "top_mem_processes": [ ... ],
      "containers": [
        {
          "name": "runner",
          "cpu_cores": {
            "peak": 2.007,
            "p99": 2.005,
            "p95": 2.004,
            "p75": 1.997,
            "p50": 1.817,
            "avg": 1.5
          },
          "memory_bytes": {
            "peak": 18567168,
            "p99": 18567168,
            "p95": 18567168,
            "p75": 18567168,
            "p50": 18567168,
            "avg": 18567168
          }
        }
      ]
    }
  }
]

Understanding the Metrics

CPU Metrics

There are two different CPU metric formats in the response:

1. System and Process CPU: Percentage (cpu_total_percent, peak_cpu_percent)

These values represent CPU utilization as a percentage of total available CPU time.

  • cpu_total_percent: Overall system CPU usage (0-100%)
  • peak_cpu_percent (in process lists): Per-process CPU usage where 100% = 1 full CPU core

2. Container CPU: Cores (cpu_cores)

Important: The cpu_cores field in container metrics represents CPU usage in number of cores, not percentage.

Value Meaning
0.5 Half a CPU core
1.0 One full CPU core
2.0 Two CPU cores
2.5 Two and a half CPU cores

This allows direct comparison with Kubernetes resource limits (e.g., cpu: "2" or cpu: "500m").

Example interpretation:

{
  "name": "runner",
  "cpu_cores": {
    "peak": 2.007,
    "avg": 1.5
  }
}

This means the "runner" container used a peak of ~2 CPU cores and averaged 1.5 CPU cores during the run.

Memory Metrics

All memory values are in bytes:

  • mem_used_bytes: System memory usage
  • memory_bytes (in containers): Container RSS memory usage
  • peak_mem_rss_bytes (in processes): Process RSS memory

Statistical Fields

Each metric includes percentile statistics across all samples:

Field Description
peak Maximum value observed
p99 99th percentile
p95 95th percentile
p75 75th percentile
p50 Median (50th percentile)
avg Arithmetic mean

Configuration

Collector Environment Variables

Variable Description Example
GITHUB_REPOSITORY_OWNER Organization name my-org
GITHUB_REPOSITORY Full repository path my-org/my-repo
GITHUB_WORKFLOW Workflow filename ci.yml
GITHUB_JOB Job name build
GITHUB_RUN_ID Unique run identifier run-123
CGROUP_PROCESS_MAP JSON mapping process names to container names {"node":"runner"}
CGROUP_LIMITS JSON with CPU/memory limits per container See below

CGROUP_LIMITS example:

{
  "runner": {"cpu": "2", "memory": "1Gi"},
  "sidecar": {"cpu": "500m", "memory": "256Mi"}
}

CPU values support Kubernetes notation: "2" = 2 cores, "500m" = 0.5 cores.

Memory values support: Ki, Mi, Gi, Ti (binary) or K, M, G, T (decimal).

Receiver Environment Variables

Variable Description Default
DB_PATH SQLite database path metrics.db
LISTEN_ADDR HTTP listen address :8080

Running

Docker Compose (stress test example)

docker compose -f test/docker/docker-compose-stress.yaml up -d
# Wait for metrics collection...
docker compose -f test/docker/docker-compose-stress.yaml stop collector
# Query results
curl http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload

Local Development

# Build
go build -o collector ./cmd/collector
go build -o receiver ./cmd/receiver

# Run receiver
./receiver --listen=:8080 --db=metrics.db

# Run collector
./collector --interval=2s --top=10 --push-endpoint=http://localhost:8080/api/v1/metrics