No description
  • Go 78.6%
  • JavaScript 10.1%
  • CSS 7.6%
  • Makefile 1.2%
  • HTML 1.2%
  • Other 1.3%
Find a file
Daniel Sy ce3b16a20c
All checks were successful
ci / ci (push) Successful in 53s
fix(receiver): 🛂 use SkipClientIDCheck with manual aud validation for gateway mode
go-oidc built-in ClientID check requires aud=clientID which OAuth2 access
tokens do not carry by default. Switch to SkipClientIDCheck:true and enforce
audience manually via validateAudience() so signature/expiry are still
verified via JWKS while allowing the IdP-issued access token aud to differ
from the OIDC client ID. When ClientID is empty the aud check is skipped
entirely, supporting deployments without audience enforcement.

Add test cases: missing aud rejected when clientID set, any aud accepted
when clientID is empty.
2026-04-01 16:40:28 +02:00
.github/workflows chore: moved trify repository to edp 2026-03-31 09:57:41 +02:00
cmd feat: added log level env var RECEIVER_LOG_LEVEL or --log-level flag 2026-03-31 12:29:05 +02:00
docs feat: Configurable claim mapping 2026-03-30 18:32:18 +02:00
internal fix(receiver): 🛂 use SkipClientIDCheck with manual aud validation for gateway mode 2026-04-01 16:40:28 +02:00
pkg/client feat: added debug endpoint to dump entire metrics db 2026-03-12 15:59:48 +01:00
scripts fix: update push client initialization to include nil context 2026-03-12 11:28:10 +01:00
test refactor: Rename from optimiser to sizer 2026-02-17 17:42:39 +01:00
.gitignore ci: add docker build to release 2026-02-12 11:26:29 +01:00
.goreleaser.yaml refactor: Rename from optimiser to sizer 2026-02-17 17:42:39 +01:00
AGENTS.md feat(ui): added dex integration 2026-03-19 10:23:42 +01:00
CLAUDE.md refactor: Rename from optimiser to sizer 2026-02-17 17:42:39 +01:00
Dockerfile feat(ui): add build version footer and inject git commit hash 2026-03-19 17:10:18 +01:00
Dockerfile.goreleaser ci: generate two separate binaries 2026-02-13 14:24:33 +01:00
flake.lock feat: Added gateway mode reading auth information from headers passed by api gateway 2026-03-30 18:09:48 +02:00
go.mod chore: go mod tidy 2026-03-31 10:02:43 +02:00
go.sum feat(ui): added dex integration 2026-03-19 10:23:42 +01:00
Makefile feat(ui): add build version footer and inject git commit hash 2026-03-19 17:10:18 +01:00
metrics.db feat(ui): added webui 2026-03-18 14:14:58 +01:00
README.md feat(ui): enhanced styled ui 2026-03-18 16:46:02 +01:00
renovate.json Add renovate.json 2026-03-16 15:21:59 +00:00

Forgejo Runner Sizer

A resource sizer for CI/CD workloads in shared PID namespace environments. The collector reads /proc to gather CPU and memory metrics grouped by container/cgroup, and pushes run summaries to the receiver. The receiver stores metrics and exposes a sizer API that computes right-sized Kubernetes resource requests and limits from historical data.

Architecture

The system has two binaries — a collector and a receiver (which includes the sizer):

┌─────────────────────────────────────────────┐       ┌──────────────────────────┐
│  CI/CD Pod (shared PID namespace)           │       │  Receiver Service        │
│                                             │       │                          │
│  ┌───────────┐  ┌────────┐  ┌───────────┐   │       │  POST /api/v1/metrics    │
│  │ collector │  │ runner │  │ sidecar   │   │       │         │                │
│  │           │  │        │  │           │   │       │         ▼                │
│  │ reads     │  │        │  │           │   │  push │  ┌────────────┐          │
│  │ /proc for │  │        │  │           │   │──────▶│  │  SQLite    │          │
│  │ all PIDs  │  │        │  │           │   │       │  └────────────┘          │
│  └───────────┘  └────────┘  └───────────┘   │       │         │                │
│                                             │       │         ▼                │
└─────────────────────────────────────────────┘       │  GET /api/v1/metrics/... │
│  GET /api/v1/sizing/...  │
│       (sizer)            │
└──────────────────────────┘

Collector

Runs as a sidecar alongside CI workloads. On a configurable interval, it reads /proc to collect CPU and memory for all visible processes, groups them by container using cgroup paths, and accumulates samples. On shutdown (SIGINT/SIGTERM), it computes run-level statistics (peak, avg, percentiles) and pushes a single summary to the receiver.

./collector --interval=2s --top=10 --push-endpoint=http://receiver:8080/api/v1/metrics

Flags: --interval, --proc-path, --log-level, --log-format, --top, --push-endpoint, --push-token

Environment variables:

Variable Description Example
GITHUB_REPOSITORY_OWNER Organization name my-org
GITHUB_REPOSITORY Full repository path my-org/my-repo
GITHUB_WORKFLOW Workflow filename ci.yml
GITHUB_JOB Job name build
GITHUB_RUN_ID Unique run identifier run-123
COLLECTOR_PUSH_TOKEN Bearer token for push endpoint auth
CGROUP_PROCESS_MAP JSON: process name → container name {"node":"runner"}
CGROUP_LIMITS JSON: per-container CPU/memory limits See below

CGROUP_LIMITS example:

{
  "runner": { "cpu": "2", "memory": "1Gi" },
  "sidecar": { "cpu": "500m", "memory": "256Mi" }
}

CPU supports Kubernetes notation ("2" = 2 cores, "500m" = 0.5 cores). Memory supports Ki, Mi, Gi, Ti (binary) or K, M, G, T (decimal).

Receiver (with sizer)

HTTP service that stores metric summaries in SQLite (via GORM), exposes a query API, and provides a sizer endpoint that computes right-sized Kubernetes resource requests and limits from historical run data.

./receiver --addr=:8080 --db=metrics.db --read-token=my-secret-token --hmac-key=my-hmac-key

Flags:

Flag Environment Variable Description Default
--addr HTTP listen address :8080
--db SQLite database path metrics.db
--read-token RECEIVER_READ_TOKEN Pre-shared token for read/admin endpoints (required)
--hmac-key RECEIVER_HMAC_KEY Secret key for push token generation/validation (required)

Web UI

The receiver includes a web UI for viewing collected metrics.

  • URL: /ui
  • Authentication: The UI uses the same --read-token as the API. Enter the token in the UI to load metrics.

Endpoints:

  • POST /api/v1/metrics — receive and store a metric summary (requires scoped push token)
  • POST /api/v1/token — generate a scoped push token (requires read token auth)
  • GET /api/v1/metrics/repo/{org}/{repo}/{workflow}/{job} — query stored metrics (requires read token auth)
  • GET /api/v1/debug/metrics — return all metric rows from the database (requires read token auth)
  • GET /api/v1/sizing/repo/{org}/{repo}/{workflow}/{job} — compute container sizes from historical data (requires read token auth)

Authentication:

All metrics endpoints require authentication via --read-token:

  • The GET endpoint requires a Bearer token matching the read token
  • The POST metrics endpoint requires a scoped push token (generated via POST /api/v1/token)
  • The token endpoint itself requires the read token

Token flow:

# 1. Admin generates a scoped push token using the read token
curl -X POST http://localhost:8080/api/v1/token \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}'
# → {"token":"<hex-encoded HMAC>"}

# 2. Collector uses the scoped token to push metrics
./collector --push-endpoint=http://localhost:8080/api/v1/metrics \
  --push-token=<token-from-step-1>

# 3. Query metrics with the read token
curl -H "Authorization: Bearer my-secret-token" \  #gitleaks:allow
  http://localhost:8080/api/v1/metrics/repo/my-org/my-repo/ci.yml/build

# 4. Debug endpoint: dump all stored metrics
curl -H "Authorization: Bearer my-secret-token" \  #gitleaks:allow
  http://localhost:8080/api/v1/debug/metrics

Push tokens are HMAC-SHA256 digests derived from --hmac-key and the scope (org/repo/workflow/job). They are stateless — no database storage is needed. The HMAC key is separate from the read token so that compromising a push token does not expose the admin credential.

How Metrics Are Collected

The collector reads /proc/[pid]/stat for every visible process to get CPU ticks (utime + stime) and /proc/[pid]/status for memory (RSS). It takes two samples per interval and computes the delta to derive CPU usage rates.

Processes are grouped into containers by reading /proc/[pid]/cgroup and matching cgroup paths against the CGROUP_PROCESS_MAP. This is necessary because in shared PID namespace pods, /proc/stat only shows host-level aggregates — per-container metrics must be built up from individual process data.

Container CPU is reported in cores (not percentage) for direct comparison with Kubernetes resource limits. System-level CPU is reported as a percentage (0-100%).

Over the course of a run, the summary.Accumulator tracks every sample and on shutdown computes:

Stat Description
peak Maximum observed value
p99, p95, p75, p50 Percentiles across all samples
avg Arithmetic mean

These stats are computed for CPU, memory, and per-container metrics.

API Response

GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build
[
  {
    "id": 1,
    "organization": "my-org",
    "repository": "my-org/my-repo",
    "workflow": "ci.yml",
    "job": "build",
    "run_id": "run-123",
    "received_at": "2026-02-06T14:30:23.056Z",
    "payload": {
      "start_time": "2026-02-06T14:30:02.185Z",
      "end_time": "2026-02-06T14:30:22.190Z",
      "duration_seconds": 20.0,
      "sample_count": 11,
      "cpu_total_percent": { "peak": ..., "avg": ..., "p50": ... },
      "mem_used_bytes": { "peak": ..., "avg": ... },
      "containers": [
        {
          "name": "runner",
          "cpu_cores": { "peak": 2.007, "avg": 1.5, "p50": 1.817, "p95": 2.004 },
          "memory_bytes": { "peak": 18567168, "avg": 18567168 }
        }
      ],
      "top_cpu_processes": [ ... ],
      "top_mem_processes": [ ... ]
    }
  }
]

CPU metric distinction:

  • cpu_total_percent — system-wide, 0-100%
  • cpu_cores (containers) — cores used (e.g. 2.0 = two full cores)
  • peak_cpu_percent (processes) — per-process, where 100% = 1 core

All memory values are in bytes.

How Sizing Works

The sizer computes Kubernetes resource requests and limits by aggregating historical run data for a given workflow/job combination.

Algorithm

  1. Collect the N most recent runs (default: 5, configurable via ?runs=).

  2. Per container, across runs:

    • CPU request — take the selected percentile (default: p95) of each run's CPU usage, then take the maximum across runs.
    • Memory request — take the peak memory of each run, then take the maximum across runs.
  3. Apply a buffer to add headroom above observed values:

    • CPU uses a flat configurable buffer (default: 20%, via ?buffer=).
    • Memory uses a staircase buffer — larger allocations are inherently more stable and over-provisioning them wastes more cluster resources:
      Observed peak Buffer
      < 1 GiB 20%
      1 4 GiB 10%
      > 4 GiB 5%
  4. Apply floor values — ensure every container gets a minimum viable allocation even if it was completely idle in all observed runs:

    Resource Request floor Limit floor
    CPU 10m 500m
    Memory 32Mi 128Mi

    Request and limit floors are intentionally asymmetric: a low request allows efficient scheduling bin-packing, while a higher limit prevents OOM kills or severe throttling if a previously-idle container becomes active.

  5. Apply a memory ceiling — a single container cannot be recommended more memory than the entire pod ever consumed across all observed runs, plus 20%. This caps outlier recommendations without hardcoding a node-size-specific value; the ceiling adapts automatically as more runs are collected.

  6. Round limits to clean values: CPU limits round up to the nearest 0.5 cores; memory limits round up to the next power of 2 in Mi.

Query parameters

Parameter Default Description
runs 5 Number of recent runs to analyse (1100)
buffer 20 CPU headroom percentage (memory uses the staircase above)
cpu_percentile p95 CPU stat to use: peak, p99, p95, p75, p50, avg

Sizing response

GET /api/v1/sizing/repo/my-org/my-repo/ci.yml/build?runs=10&buffer=20&cpu_percentile=p95
{
  "containers": [
    {
      "name": "runner",
      "cpu":    { "request": "960m", "limit": "1" },
      "memory": { "request": "615Mi", "limit": "1024Mi" }
    },
    {
      "name": "buildkitd",
      "cpu":    { "request": "10m",  "limit": "500m" },
      "memory": { "request": "32Mi", "limit": "128Mi" }
    }
  ],
  "total": {
    "cpu":    { "request": "970m", "limit": "1500m" },
    "memory": { "request": "647Mi", "limit": "1024Mi" }
  },
  "meta": {
    "runs_analyzed": 10,
    "buffer_percent": 20,
    "cpu_percentile": "p95"
  }
}

The total fields sum requests across all containers and can be used to size the pod as a whole.

Note: For per-container sizing to work correctly, the collector must have CGROUP_PROCESS_MAP configured so that processes are grouped under stable container names. Runs collected without this mapping use raw cgroup paths as container identifiers, which change every run and will never accumulate history.

Running

Docker Compose

# Start the receiver (builds image if needed):
docker compose -f test/docker/docker-compose-stress.yaml up -d --build receiver

# Generate a scoped push token for the collector:
PUSH_TOKEN=$(curl -s -X POST http://localhost:9080/api/v1/token \
  -H "Authorization: Bearer dummyreadtoken" \
  -H "Content-Type: application/json" \
  -d '{"organization":"test-org","repository":"test-org/stress-test","workflow":"stress-test-workflow","job":"heavy-workload"}' \
  | jq -r .token)

# Start the collector and stress workloads with the push token:
COLLECTOR_PUSH_TOKEN=$PUSH_TOKEN \
  docker compose -f test/docker/docker-compose-stress.yaml up -d --build collector

# ... Wait for data collection ...

# Trigger shutdown summary:
docker compose -f test/docker/docker-compose-stress.yaml stop collector

# Query results with the read token:
curl -H "Authorization: Bearer dummyreadtoken" \
  http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload

Local

go build -o collector ./cmd/collector
go build -o receiver ./cmd/receiver

# Start receiver with both keys:
./receiver --addr=:8080 --db=metrics.db \
  --read-token=my-secret-token --hmac-key=my-hmac-key

# Generate a scoped push token:
PUSH_TOKEN=$(curl -s -X POST http://localhost:8080/api/v1/token \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}' \
  | jq -r .token)

# Run collector with the push token:
./collector --interval=2s --top=10 \
  --push-endpoint=http://localhost:8080/api/v1/metrics \
  --push-token=$PUSH_TOKEN

Internal Packages

Package Purpose
internal/proc Low-level /proc parsing (stat, status, cgroup)
internal/metrics Aggregates process metrics from /proc into system/container views
internal/cgroup Parses CGROUP_PROCESS_MAP and CGROUP_LIMITS env vars
internal/collector Orchestrates the collection loop and shutdown
internal/summary Accumulates samples, computes stats, pushes to receiver
internal/receiver HTTP handlers, SQLite store, and sizer logic
internal/output Metrics output formatting (JSON/text)

Dependency Updates (Renovate)

This repository includes a scheduled Renovate workflow at .github/workflows/renovate.yaml.

Create a repository secret named RENOVATE_TOKEN from a dedicated Forgejo bot account PAT.

Required Forgejo token scopes for RENOVATE_TOKEN:

  • repo (Read and Write)
  • user (Read)
  • issue (Read and Write)
  • organization (Read)

If Renovate needs to read Forgejo packages, also add read:packages.

Background

Technical reference on the Linux primitives this project builds on:

  • Identifying process cgroups by PID — how to read /proc/<PID>/cgroup to determine which container a process belongs to
  • /proc/stat behavior in containers — why /proc/stat shows host-level data in containers, and how to aggregate per-process stats from /proc/[pid]/stat instead, including CPU tick conversion and cgroup limit handling