diff --git a/README.md b/README.md new file mode 100644 index 0000000..e5ceea1 --- /dev/null +++ b/README.md @@ -0,0 +1,188 @@ +# Forgejo Runner Resource Collector + +A lightweight resource metrics collector designed to run alongside CI/CD workloads in shared PID namespace environments. It collects CPU and memory metrics, groups them by container/cgroup, and pushes summaries to a receiver service. + +## Components + +- **Collector**: Gathers system and per-process metrics at regular intervals, computes run-level statistics, and pushes a summary on shutdown. +- **Receiver**: HTTP service that stores metric summaries in SQLite and provides a query API. + +## Receiver API + +### POST `/api/v1/metrics` + +Receives metric summaries from collectors. + +### GET `/api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}` + +Retrieves all stored metrics for a specific workflow and job. + +**Example request:** +``` +GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build +``` + +**Example response:** +```json +[ + { + "id": 1, + "organization": "my-org", + "repository": "my-org/my-repo", + "workflow": "ci.yml", + "job": "build", + "run_id": "run-123", + "received_at": "2026-02-06T14:30:23.056Z", + "payload": { + "start_time": "2026-02-06T14:30:02.185Z", + "end_time": "2026-02-06T14:30:22.190Z", + "duration_seconds": 20.0, + "sample_count": 11, + "cpu_total_percent": { ... }, + "mem_used_bytes": { ... }, + "mem_used_percent": { ... }, + "top_cpu_processes": [ ... ], + "top_mem_processes": [ ... ], + "containers": [ + { + "name": "runner", + "cpu_cores": { + "peak": 2.007, + "p99": 2.005, + "p95": 2.004, + "p75": 1.997, + "p50": 1.817, + "avg": 1.5 + }, + "memory_bytes": { + "peak": 18567168, + "p99": 18567168, + "p95": 18567168, + "p75": 18567168, + "p50": 18567168, + "avg": 18567168 + } + } + ] + } + } +] +``` + +## Understanding the Metrics + +### CPU Metrics + +There are two different CPU metric formats in the response: + +#### 1. System and Process CPU: Percentage (`cpu_total_percent`, `peak_cpu_percent`) + +These values represent **CPU utilization as a percentage** of total available CPU time. + +- `cpu_total_percent`: Overall system CPU usage (0-100%) +- `peak_cpu_percent` (in process lists): Per-process CPU usage where 100% = 1 full CPU core + +#### 2. Container CPU: Cores (`cpu_cores`) + +**Important:** The `cpu_cores` field in container metrics represents **CPU usage in number of cores**, not percentage. + +| Value | Meaning | +|-------|---------| +| `0.5` | Half a CPU core | +| `1.0` | One full CPU core | +| `2.0` | Two CPU cores | +| `2.5` | Two and a half CPU cores | + +This allows direct comparison with Kubernetes resource limits (e.g., `cpu: "2"` or `cpu: "500m"`). + +**Example interpretation:** +```json +{ + "name": "runner", + "cpu_cores": { + "peak": 2.007, + "avg": 1.5 + } +} +``` +This means the "runner" container used a peak of ~2 CPU cores and averaged 1.5 CPU cores during the run. + +### Memory Metrics + +All memory values are in **bytes**: + +- `mem_used_bytes`: System memory usage +- `memory_bytes` (in containers): Container RSS memory usage +- `peak_mem_rss_bytes` (in processes): Process RSS memory + +### Statistical Fields + +Each metric includes percentile statistics across all samples: + +| Field | Description | +|-------|-------------| +| `peak` | Maximum value observed | +| `p99` | 99th percentile | +| `p95` | 95th percentile | +| `p75` | 75th percentile | +| `p50` | Median (50th percentile) | +| `avg` | Arithmetic mean | + +## Configuration + +### Collector Environment Variables + +| Variable | Description | Example | +|----------|-------------|---------| +| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` | +| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` | +| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` | +| `GITHUB_JOB` | Job name | `build` | +| `GITHUB_RUN_ID` | Unique run identifier | `run-123` | +| `CGROUP_PROCESS_MAP` | JSON mapping process names to container names | `{"node":"runner"}` | +| `CGROUP_LIMITS` | JSON with CPU/memory limits per container | See below | + +**CGROUP_LIMITS example:** +```json +{ + "runner": {"cpu": "2", "memory": "1Gi"}, + "sidecar": {"cpu": "500m", "memory": "256Mi"} +} +``` + +CPU values support Kubernetes notation: `"2"` = 2 cores, `"500m"` = 0.5 cores. + +Memory values support: `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal). + +### Receiver Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `DB_PATH` | SQLite database path | `metrics.db` | +| `LISTEN_ADDR` | HTTP listen address | `:8080` | + +## Running + +### Docker Compose (stress test example) + +```bash +docker compose -f test/docker/docker-compose-stress.yaml up -d +# Wait for metrics collection... +docker compose -f test/docker/docker-compose-stress.yaml stop collector +# Query results +curl http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload +``` + +### Local Development + +```bash +# Build +go build -o collector ./cmd/collector +go build -o receiver ./cmd/receiver + +# Run receiver +./receiver --listen=:8080 --db=metrics.db + +# Run collector +./collector --interval=2s --top=10 --push-endpoint=http://localhost:8080/api/v1/metrics +```