docs: add README with API documentation

Document receiver API endpoints and response format. Clarify that container cpu_cores values are in number of cores (not percentage), while system/process CPU values are percentages. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 15:38:22 +01:00 · 2026-02-06 15:38:22 +01:00 · 52f1b8b64d
commit 52f1b8b64d
parent d624d46822
1 changed files with 188 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,188 @@
+# Forgejo Runner Resource Collector
+
+A lightweight resource metrics collector designed to run alongside CI/CD workloads in shared PID namespace environments. It collects CPU and memory metrics, groups them by container/cgroup, and pushes summaries to a receiver service.
+
+## Components
+
+- **Collector**: Gathers system and per-process metrics at regular intervals, computes run-level statistics, and pushes a summary on shutdown.
+- **Receiver**: HTTP service that stores metric summaries in SQLite and provides a query API.
+
+## Receiver API
+
+### POST `/api/v1/metrics`
+
+Receives metric summaries from collectors.
+
+### GET `/api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}`
+
+Retrieves all stored metrics for a specific workflow and job.
+
+**Example request:**
+```
+GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build
+```
+
+**Example response:**
+```json
+[
+  {
+    "id": 1,
+    "organization": "my-org",
+    "repository": "my-org/my-repo",
+    "workflow": "ci.yml",
+    "job": "build",
+    "run_id": "run-123",
+    "received_at": "2026-02-06T14:30:23.056Z",
+    "payload": {
+      "start_time": "2026-02-06T14:30:02.185Z",
+      "end_time": "2026-02-06T14:30:22.190Z",
+      "duration_seconds": 20.0,
+      "sample_count": 11,
+      "cpu_total_percent": { ... },
+      "mem_used_bytes": { ... },
+      "mem_used_percent": { ... },
+      "top_cpu_processes": [ ... ],
+      "top_mem_processes": [ ... ],
+      "containers": [
+        {
+          "name": "runner",
+          "cpu_cores": {
+            "peak": 2.007,
+            "p99": 2.005,
+            "p95": 2.004,
+            "p75": 1.997,
+            "p50": 1.817,
+            "avg": 1.5
+          },
+          "memory_bytes": {
+            "peak": 18567168,
+            "p99": 18567168,
+            "p95": 18567168,
+            "p75": 18567168,
+            "p50": 18567168,
+            "avg": 18567168
+          }
+        }
+      ]
+    }
+  }
+]
+```
+
+## Understanding the Metrics
+
+### CPU Metrics
+
+There are two different CPU metric formats in the response:
+
+#### 1. System and Process CPU: Percentage (`cpu_total_percent`, `peak_cpu_percent`)
+
+These values represent **CPU utilization as a percentage** of total available CPU time.
+
+- `cpu_total_percent`: Overall system CPU usage (0-100%)
+- `peak_cpu_percent` (in process lists): Per-process CPU usage where 100% = 1 full CPU core
+
+#### 2. Container CPU: Cores (`cpu_cores`)
+
+**Important:** The `cpu_cores` field in container metrics represents **CPU usage in number of cores**, not percentage.
+
+| Value | Meaning |
+|-------|---------|
+| `0.5` | Half a CPU core |
+| `1.0` | One full CPU core |
+| `2.0` | Two CPU cores |
+| `2.5` | Two and a half CPU cores |
+
+This allows direct comparison with Kubernetes resource limits (e.g., `cpu: "2"` or `cpu: "500m"`).
+
+**Example interpretation:**
+```json
+{
+  "name": "runner",
+  "cpu_cores": {
+    "peak": 2.007,
+    "avg": 1.5
+  }
+}
+```
+This means the "runner" container used a peak of ~2 CPU cores and averaged 1.5 CPU cores during the run.
+
+### Memory Metrics
+
+All memory values are in **bytes**:
+
+- `mem_used_bytes`: System memory usage
+- `memory_bytes` (in containers): Container RSS memory usage
+- `peak_mem_rss_bytes` (in processes): Process RSS memory
+
+### Statistical Fields
+
+Each metric includes percentile statistics across all samples:
+
+| Field | Description |
+|-------|-------------|
+| `peak` | Maximum value observed |
+| `p99` | 99th percentile |
+| `p95` | 95th percentile |
+| `p75` | 75th percentile |
+| `p50` | Median (50th percentile) |
+| `avg` | Arithmetic mean |
+
+## Configuration
+
+### Collector Environment Variables
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` |
+| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` |
+| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` |
+| `GITHUB_JOB` | Job name | `build` |
+| `GITHUB_RUN_ID` | Unique run identifier | `run-123` |
+| `CGROUP_PROCESS_MAP` | JSON mapping process names to container names | `{"node":"runner"}` |
+| `CGROUP_LIMITS` | JSON with CPU/memory limits per container | See below |
+
+**CGROUP_LIMITS example:**
+```json
+{
+  "runner": {"cpu": "2", "memory": "1Gi"},
+  "sidecar": {"cpu": "500m", "memory": "256Mi"}
+}
+```
+
+CPU values support Kubernetes notation: `"2"` = 2 cores, `"500m"` = 0.5 cores.
+
+Memory values support: `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal).
+
+### Receiver Environment Variables
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `DB_PATH` | SQLite database path | `metrics.db` |
+| `LISTEN_ADDR` | HTTP listen address | `:8080` |
+
+## Running
+
+### Docker Compose (stress test example)
+
+```bash
+docker compose -f test/docker/docker-compose-stress.yaml up -d
+# Wait for metrics collection...
+docker compose -f test/docker/docker-compose-stress.yaml stop collector
+# Query results
+curl http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload
+```
+
+### Local Development
+
+```bash
+# Build
+go build -o collector ./cmd/collector
+go build -o receiver ./cmd/receiver
+
+# Run receiver
+./receiver --listen=:8080 --db=metrics.db
+
+# Run collector
+./collector --interval=2s --top=10 --push-endpoint=http://localhost:8080/api/v1/metrics
+```