No description

Go 78.6%
JavaScript 10.1%
CSS 7.6%
Makefile 1.2%
HTML 1.2%
Other 1.3%

Find a file

Daniel Sy ce3b16a20c All checks were successful ci / ci (push) Successful in 53s Details fix(receiver): 🛂 use SkipClientIDCheck with manual aud validation for gateway mode go-oidc built-in ClientID check requires aud=clientID which OAuth2 access tokens do not carry by default. Switch to SkipClientIDCheck:true and enforce audience manually via validateAudience() so signature/expiry are still verified via JWKS while allowing the IdP-issued access token aud to differ from the OIDC client ID. When ClientID is empty the aud check is skipped entirely, supporting deployments without audience enforcement. Add test cases: missing aud rejected when clientID set, any aud accepted when clientID is empty.		2026-04-01 16:40:28 +02:00
.github/workflows	chore: moved trify repository to edp	2026-03-31 09:57:41 +02:00
cmd	feat: added log level env var RECEIVER_LOG_LEVEL or --log-level flag	2026-03-31 12:29:05 +02:00
docs	feat: Configurable claim mapping	2026-03-30 18:32:18 +02:00
internal	fix(receiver): 🛂 use SkipClientIDCheck with manual aud validation for gateway mode	2026-04-01 16:40:28 +02:00
pkg/client	feat: added debug endpoint to dump entire metrics db	2026-03-12 15:59:48 +01:00
scripts	fix: update push client initialization to include nil context	2026-03-12 11:28:10 +01:00
test	refactor: Rename from optimiser to sizer	2026-02-17 17:42:39 +01:00
.gitignore	ci: add docker build to release	2026-02-12 11:26:29 +01:00
.goreleaser.yaml	refactor: Rename from optimiser to sizer	2026-02-17 17:42:39 +01:00
AGENTS.md	feat(ui): added dex integration	2026-03-19 10:23:42 +01:00
CLAUDE.md	refactor: Rename from optimiser to sizer	2026-02-17 17:42:39 +01:00
Dockerfile	feat(ui): add build version footer and inject git commit hash	2026-03-19 17:10:18 +01:00
Dockerfile.goreleaser	ci: generate two separate binaries	2026-02-13 14:24:33 +01:00
flake.lock	feat: Added gateway mode reading auth information from headers passed by api gateway	2026-03-30 18:09:48 +02:00
go.mod	chore: go mod tidy	2026-03-31 10:02:43 +02:00
go.sum	feat(ui): added dex integration	2026-03-19 10:23:42 +01:00
Makefile	feat(ui): add build version footer and inject git commit hash	2026-03-19 17:10:18 +01:00
metrics.db	feat(ui): added webui	2026-03-18 14:14:58 +01:00
README.md	feat(ui): enhanced styled ui	2026-03-18 16:46:02 +01:00
renovate.json	Add renovate.json	2026-03-16 15:21:59 +00:00

README.md

Forgejo Runner Sizer

A resource sizer for CI/CD workloads in shared PID namespace environments. The collector reads /proc to gather CPU and memory metrics grouped by container/cgroup, and pushes run summaries to the receiver. The receiver stores metrics and exposes a sizer API that computes right-sized Kubernetes resource requests and limits from historical data.

Architecture

The system has two binaries — a collector and a receiver (which includes the sizer):

┌─────────────────────────────────────────────┐       ┌──────────────────────────┐
│  CI/CD Pod (shared PID namespace)           │       │  Receiver Service        │
│                                             │       │                          │
│  ┌───────────┐  ┌────────┐  ┌───────────┐   │       │  POST /api/v1/metrics    │
│  │ collector │  │ runner │  │ sidecar   │   │       │         │                │
│  │           │  │        │  │           │   │       │         ▼                │
│  │ reads     │  │        │  │           │   │  push │  ┌────────────┐          │
│  │ /proc for │  │        │  │           │   │──────▶│  │  SQLite    │          │
│  │ all PIDs  │  │        │  │           │   │       │  └────────────┘          │
│  └───────────┘  └────────┘  └───────────┘   │       │         │                │
│                                             │       │         ▼                │
└─────────────────────────────────────────────┘       │  GET /api/v1/metrics/... │
│  GET /api/v1/sizing/...  │
│       (sizer)            │
└──────────────────────────┘

Collector

Runs as a sidecar alongside CI workloads. On a configurable interval, it reads /proc to collect CPU and memory for all visible processes, groups them by container using cgroup paths, and accumulates samples. On shutdown (SIGINT/SIGTERM), it computes run-level statistics (peak, avg, percentiles) and pushes a single summary to the receiver.

./collector --interval=2s --top=10 --push-endpoint=http://receiver:8080/api/v1/metrics

Flags: --interval, --proc-path, --log-level, --log-format, --top, --push-endpoint, --push-token

Environment variables:

Variable	Description	Example
`GITHUB_REPOSITORY_OWNER`	Organization name	`my-org`
`GITHUB_REPOSITORY`	Full repository path	`my-org/my-repo`
`GITHUB_WORKFLOW`	Workflow filename	`ci.yml`
`GITHUB_JOB`	Job name	`build`
`GITHUB_RUN_ID`	Unique run identifier	`run-123`
`COLLECTOR_PUSH_TOKEN`	Bearer token for push endpoint auth	—
`CGROUP_PROCESS_MAP`	JSON: process name → container name	`{"node":"runner"}`
`CGROUP_LIMITS`	JSON: per-container CPU/memory limits	See below

CGROUP_LIMITS example:

{
  "runner": { "cpu": "2", "memory": "1Gi" },
  "sidecar": { "cpu": "500m", "memory": "256Mi" }
}

CPU supports Kubernetes notation ("2" = 2 cores, "500m" = 0.5 cores). Memory supports Ki, Mi, Gi, Ti (binary) or K, M, G, T (decimal).

Receiver (with sizer)

HTTP service that stores metric summaries in SQLite (via GORM), exposes a query API, and provides a sizer endpoint that computes right-sized Kubernetes resource requests and limits from historical run data.

./receiver --addr=:8080 --db=metrics.db --read-token=my-secret-token --hmac-key=my-hmac-key

Flags:

Flag	Environment Variable	Description	Default
`--addr`	—	HTTP listen address	`:8080`
`--db`	—	SQLite database path	`metrics.db`
`--read-token`	`RECEIVER_READ_TOKEN`	Pre-shared token for read/admin endpoints (required)	—
`--hmac-key`	`RECEIVER_HMAC_KEY`	Secret key for push token generation/validation (required)	—

Web UI

The receiver includes a web UI for viewing collected metrics.

URL: /ui
Authentication: The UI uses the same --read-token as the API. Enter the token in the UI to load metrics.

Endpoints:

POST /api/v1/metrics — receive and store a metric summary (requires scoped push token)
POST /api/v1/token — generate a scoped push token (requires read token auth)
GET /api/v1/metrics/repo/{org}/{repo}/{workflow}/{job} — query stored metrics (requires read token auth)
GET /api/v1/debug/metrics — return all metric rows from the database (requires read token auth)
GET /api/v1/sizing/repo/{org}/{repo}/{workflow}/{job} — compute container sizes from historical data (requires read token auth)

Authentication:

All metrics endpoints require authentication via --read-token:

The GET endpoint requires a Bearer token matching the read token
The POST metrics endpoint requires a scoped push token (generated via POST /api/v1/token)
The token endpoint itself requires the read token

Token flow:

# 1. Admin generates a scoped push token using the read token
curl -X POST http://localhost:8080/api/v1/token \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}'
# → {"token":"<hex-encoded HMAC>"}

# 2. Collector uses the scoped token to push metrics
./collector --push-endpoint=http://localhost:8080/api/v1/metrics \
  --push-token=<token-from-step-1>

# 3. Query metrics with the read token
curl -H "Authorization: Bearer my-secret-token" \  #gitleaks:allow
  http://localhost:8080/api/v1/metrics/repo/my-org/my-repo/ci.yml/build

# 4. Debug endpoint: dump all stored metrics
curl -H "Authorization: Bearer my-secret-token" \  #gitleaks:allow
  http://localhost:8080/api/v1/debug/metrics

Push tokens are HMAC-SHA256 digests derived from --hmac-key and the scope (org/repo/workflow/job). They are stateless — no database storage is needed. The HMAC key is separate from the read token so that compromising a push token does not expose the admin credential.

How Metrics Are Collected

The collector reads /proc/[pid]/stat for every visible process to get CPU ticks (utime + stime) and /proc/[pid]/status for memory (RSS). It takes two samples per interval and computes the delta to derive CPU usage rates.

Processes are grouped into containers by reading /proc/[pid]/cgroup and matching cgroup paths against the CGROUP_PROCESS_MAP. This is necessary because in shared PID namespace pods, /proc/stat only shows host-level aggregates — per-container metrics must be built up from individual process data.

Container CPU is reported in cores (not percentage) for direct comparison with Kubernetes resource limits. System-level CPU is reported as a percentage (0-100%).

Over the course of a run, the summary.Accumulator tracks every sample and on shutdown computes:

Stat	Description
`peak`	Maximum observed value
`p99`, `p95`, `p75`, `p50`	Percentiles across all samples
`avg`	Arithmetic mean

These stats are computed for CPU, memory, and per-container metrics.

API Response

GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build

[
  {
    "id": 1,
    "organization": "my-org",
    "repository": "my-org/my-repo",
    "workflow": "ci.yml",
    "job": "build",
    "run_id": "run-123",
    "received_at": "2026-02-06T14:30:23.056Z",
    "payload": {
      "start_time": "2026-02-06T14:30:02.185Z",
      "end_time": "2026-02-06T14:30:22.190Z",
      "duration_seconds": 20.0,
      "sample_count": 11,
      "cpu_total_percent": { "peak": ..., "avg": ..., "p50": ... },
      "mem_used_bytes": { "peak": ..., "avg": ... },
      "containers": [
        {
          "name": "runner",
          "cpu_cores": { "peak": 2.007, "avg": 1.5, "p50": 1.817, "p95": 2.004 },
          "memory_bytes": { "peak": 18567168, "avg": 18567168 }
        }
      ],
      "top_cpu_processes": [ ... ],
      "top_mem_processes": [ ... ]
    }
  }
]

CPU metric distinction:

cpu_total_percent — system-wide, 0-100%
cpu_cores (containers) — cores used (e.g. 2.0 = two full cores)
peak_cpu_percent (processes) — per-process, where 100% = 1 core

All memory values are in bytes.

How Sizing Works

The sizer computes Kubernetes resource requests and limits by aggregating historical run data for a given workflow/job combination.

Algorithm

Collect the N most recent runs (default: 5, configurable via ?runs=).
Per container, across runs:
- CPU request — take the selected percentile (default: p95) of each run's CPU usage, then take the maximum across runs.
- Memory request — take the peak memory of each run, then take the maximum across runs.
Apply a buffer to add headroom above observed values:
- CPU uses a flat configurable buffer (default: 20%, via ?buffer=).
- Memory uses a staircase buffer — larger allocations are inherently more stable and over-provisioning them wastes more cluster resources:
  
  Observed peak Buffer
  
  < 1 GiB 20%
  
  1 – 4 GiB 10%
  
  > 4 GiB 5%
Apply floor values — ensure every container gets a minimum viable allocation even if it was completely idle in all observed runs:

Resource Request floor Limit floor

CPU 10m 500m

Memory 32Mi 128Mi

Request and limit floors are intentionally asymmetric: a low request allows efficient scheduling bin-packing, while a higher limit prevents OOM kills or severe throttling if a previously-idle container becomes active.
Apply a memory ceiling — a single container cannot be recommended more memory than the entire pod ever consumed across all observed runs, plus 20%. This caps outlier recommendations without hardcoding a node-size-specific value; the ceiling adapts automatically as more runs are collected.
Round limits to clean values: CPU limits round up to the nearest 0.5 cores; memory limits round up to the next power of 2 in Mi.

Observed peak	Buffer
< 1 GiB	20%
1 – 4 GiB	10%
> 4 GiB	5%

Resource	Request floor	Limit floor
CPU	`10m`	`500m`
Memory	`32Mi`	`128Mi`

Query parameters

Parameter	Default	Description
`runs`	`5`	Number of recent runs to analyse (1–100)
`buffer`	`20`	CPU headroom percentage (memory uses the staircase above)
`cpu_percentile`	`p95`	CPU stat to use: `peak`, `p99`, `p95`, `p75`, `p50`, `avg`

Sizing response

GET /api/v1/sizing/repo/my-org/my-repo/ci.yml/build?runs=10&buffer=20&cpu_percentile=p95

{
  "containers": [
    {
      "name": "runner",
      "cpu":    { "request": "960m", "limit": "1" },
      "memory": { "request": "615Mi", "limit": "1024Mi" }
    },
    {
      "name": "buildkitd",
      "cpu":    { "request": "10m",  "limit": "500m" },
      "memory": { "request": "32Mi", "limit": "128Mi" }
    }
  ],
  "total": {
    "cpu":    { "request": "970m", "limit": "1500m" },
    "memory": { "request": "647Mi", "limit": "1024Mi" }
  },
  "meta": {
    "runs_analyzed": 10,
    "buffer_percent": 20,
    "cpu_percentile": "p95"
  }
}

The total fields sum requests across all containers and can be used to size the pod as a whole.

Note: For per-container sizing to work correctly, the collector must have CGROUP_PROCESS_MAP configured so that processes are grouped under stable container names. Runs collected without this mapping use raw cgroup paths as container identifiers, which change every run and will never accumulate history.

Running

Docker Compose

# Start the receiver (builds image if needed):
docker compose -f test/docker/docker-compose-stress.yaml up -d --build receiver

# Generate a scoped push token for the collector:
PUSH_TOKEN=$(curl -s -X POST http://localhost:9080/api/v1/token \
  -H "Authorization: Bearer dummyreadtoken" \
  -H "Content-Type: application/json" \
  -d '{"organization":"test-org","repository":"test-org/stress-test","workflow":"stress-test-workflow","job":"heavy-workload"}' \
  | jq -r .token)

# Start the collector and stress workloads with the push token:
COLLECTOR_PUSH_TOKEN=$PUSH_TOKEN \
  docker compose -f test/docker/docker-compose-stress.yaml up -d --build collector

# ... Wait for data collection ...

# Trigger shutdown summary:
docker compose -f test/docker/docker-compose-stress.yaml stop collector

# Query results with the read token:
curl -H "Authorization: Bearer dummyreadtoken" \
  http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload

Local

go build -o collector ./cmd/collector
go build -o receiver ./cmd/receiver

# Start receiver with both keys:
./receiver --addr=:8080 --db=metrics.db \
  --read-token=my-secret-token --hmac-key=my-hmac-key

# Generate a scoped push token:
PUSH_TOKEN=$(curl -s -X POST http://localhost:8080/api/v1/token \
  -H "Authorization: Bearer my-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"organization":"my-org","repository":"my-repo","workflow":"ci.yml","job":"build"}' \
  | jq -r .token)

# Run collector with the push token:
./collector --interval=2s --top=10 \
  --push-endpoint=http://localhost:8080/api/v1/metrics \
  --push-token=$PUSH_TOKEN

Internal Packages

Package	Purpose
`internal/proc`	Low-level `/proc` parsing (stat, status, cgroup)
`internal/metrics`	Aggregates process metrics from `/proc` into system/container views
`internal/cgroup`	Parses `CGROUP_PROCESS_MAP` and `CGROUP_LIMITS` env vars
`internal/collector`	Orchestrates the collection loop and shutdown
`internal/summary`	Accumulates samples, computes stats, pushes to receiver
`internal/receiver`	HTTP handlers, SQLite store, and sizer logic
`internal/output`	Metrics output formatting (JSON/text)

Dependency Updates (Renovate)

This repository includes a scheduled Renovate workflow at .github/workflows/renovate.yaml.

Create a repository secret named RENOVATE_TOKEN from a dedicated Forgejo bot account PAT.

Required Forgejo token scopes for RENOVATE_TOKEN:

repo (Read and Write)
user (Read)
issue (Read and Write)
organization (Read)

If Renovate needs to read Forgejo packages, also add read:packages.

Background

Technical reference on the Linux primitives this project builds on:

Identifying process cgroups by PID — how to read /proc/<PID>/cgroup to determine which container a process belongs to
/proc/stat behavior in containers — why /proc/stat shows host-level data in containers, and how to aggregate per-process stats from /proc/[pid]/stat instead, including CPU tick conversion and cgroup limit handling

README.md Unescape Escape