docs: add README with API documentation
All checks were successful
ci / build (push) Successful in 34s
All checks were successful
ci / build (push) Successful in 34s
Document receiver API endpoints and response format. Clarify that container cpu_cores values are in number of cores (not percentage), while system/process CPU values are percentages. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
d624d46822
commit
52f1b8b64d
1 changed files with 188 additions and 0 deletions
188
README.md
Normal file
188
README.md
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
# Forgejo Runner Resource Collector
|
||||
|
||||
A lightweight resource metrics collector designed to run alongside CI/CD workloads in shared PID namespace environments. It collects CPU and memory metrics, groups them by container/cgroup, and pushes summaries to a receiver service.
|
||||
|
||||
## Components
|
||||
|
||||
- **Collector**: Gathers system and per-process metrics at regular intervals, computes run-level statistics, and pushes a summary on shutdown.
|
||||
- **Receiver**: HTTP service that stores metric summaries in SQLite and provides a query API.
|
||||
|
||||
## Receiver API
|
||||
|
||||
### POST `/api/v1/metrics`
|
||||
|
||||
Receives metric summaries from collectors.
|
||||
|
||||
### GET `/api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}`
|
||||
|
||||
Retrieves all stored metrics for a specific workflow and job.
|
||||
|
||||
**Example request:**
|
||||
```
|
||||
GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build
|
||||
```
|
||||
|
||||
**Example response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": 1,
|
||||
"organization": "my-org",
|
||||
"repository": "my-org/my-repo",
|
||||
"workflow": "ci.yml",
|
||||
"job": "build",
|
||||
"run_id": "run-123",
|
||||
"received_at": "2026-02-06T14:30:23.056Z",
|
||||
"payload": {
|
||||
"start_time": "2026-02-06T14:30:02.185Z",
|
||||
"end_time": "2026-02-06T14:30:22.190Z",
|
||||
"duration_seconds": 20.0,
|
||||
"sample_count": 11,
|
||||
"cpu_total_percent": { ... },
|
||||
"mem_used_bytes": { ... },
|
||||
"mem_used_percent": { ... },
|
||||
"top_cpu_processes": [ ... ],
|
||||
"top_mem_processes": [ ... ],
|
||||
"containers": [
|
||||
{
|
||||
"name": "runner",
|
||||
"cpu_cores": {
|
||||
"peak": 2.007,
|
||||
"p99": 2.005,
|
||||
"p95": 2.004,
|
||||
"p75": 1.997,
|
||||
"p50": 1.817,
|
||||
"avg": 1.5
|
||||
},
|
||||
"memory_bytes": {
|
||||
"peak": 18567168,
|
||||
"p99": 18567168,
|
||||
"p95": 18567168,
|
||||
"p75": 18567168,
|
||||
"p50": 18567168,
|
||||
"avg": 18567168
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## Understanding the Metrics
|
||||
|
||||
### CPU Metrics
|
||||
|
||||
There are two different CPU metric formats in the response:
|
||||
|
||||
#### 1. System and Process CPU: Percentage (`cpu_total_percent`, `peak_cpu_percent`)
|
||||
|
||||
These values represent **CPU utilization as a percentage** of total available CPU time.
|
||||
|
||||
- `cpu_total_percent`: Overall system CPU usage (0-100%)
|
||||
- `peak_cpu_percent` (in process lists): Per-process CPU usage where 100% = 1 full CPU core
|
||||
|
||||
#### 2. Container CPU: Cores (`cpu_cores`)
|
||||
|
||||
**Important:** The `cpu_cores` field in container metrics represents **CPU usage in number of cores**, not percentage.
|
||||
|
||||
| Value | Meaning |
|
||||
|-------|---------|
|
||||
| `0.5` | Half a CPU core |
|
||||
| `1.0` | One full CPU core |
|
||||
| `2.0` | Two CPU cores |
|
||||
| `2.5` | Two and a half CPU cores |
|
||||
|
||||
This allows direct comparison with Kubernetes resource limits (e.g., `cpu: "2"` or `cpu: "500m"`).
|
||||
|
||||
**Example interpretation:**
|
||||
```json
|
||||
{
|
||||
"name": "runner",
|
||||
"cpu_cores": {
|
||||
"peak": 2.007,
|
||||
"avg": 1.5
|
||||
}
|
||||
}
|
||||
```
|
||||
This means the "runner" container used a peak of ~2 CPU cores and averaged 1.5 CPU cores during the run.
|
||||
|
||||
### Memory Metrics
|
||||
|
||||
All memory values are in **bytes**:
|
||||
|
||||
- `mem_used_bytes`: System memory usage
|
||||
- `memory_bytes` (in containers): Container RSS memory usage
|
||||
- `peak_mem_rss_bytes` (in processes): Process RSS memory
|
||||
|
||||
### Statistical Fields
|
||||
|
||||
Each metric includes percentile statistics across all samples:
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `peak` | Maximum value observed |
|
||||
| `p99` | 99th percentile |
|
||||
| `p95` | 95th percentile |
|
||||
| `p75` | 75th percentile |
|
||||
| `p50` | Median (50th percentile) |
|
||||
| `avg` | Arithmetic mean |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Collector Environment Variables
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` |
|
||||
| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` |
|
||||
| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` |
|
||||
| `GITHUB_JOB` | Job name | `build` |
|
||||
| `GITHUB_RUN_ID` | Unique run identifier | `run-123` |
|
||||
| `CGROUP_PROCESS_MAP` | JSON mapping process names to container names | `{"node":"runner"}` |
|
||||
| `CGROUP_LIMITS` | JSON with CPU/memory limits per container | See below |
|
||||
|
||||
**CGROUP_LIMITS example:**
|
||||
```json
|
||||
{
|
||||
"runner": {"cpu": "2", "memory": "1Gi"},
|
||||
"sidecar": {"cpu": "500m", "memory": "256Mi"}
|
||||
}
|
||||
```
|
||||
|
||||
CPU values support Kubernetes notation: `"2"` = 2 cores, `"500m"` = 0.5 cores.
|
||||
|
||||
Memory values support: `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal).
|
||||
|
||||
### Receiver Environment Variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `DB_PATH` | SQLite database path | `metrics.db` |
|
||||
| `LISTEN_ADDR` | HTTP listen address | `:8080` |
|
||||
|
||||
## Running
|
||||
|
||||
### Docker Compose (stress test example)
|
||||
|
||||
```bash
|
||||
docker compose -f test/docker/docker-compose-stress.yaml up -d
|
||||
# Wait for metrics collection...
|
||||
docker compose -f test/docker/docker-compose-stress.yaml stop collector
|
||||
# Query results
|
||||
curl http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload
|
||||
```
|
||||
|
||||
### Local Development
|
||||
|
||||
```bash
|
||||
# Build
|
||||
go build -o collector ./cmd/collector
|
||||
go build -o receiver ./cmd/receiver
|
||||
|
||||
# Run receiver
|
||||
./receiver --listen=:8080 --db=metrics.db
|
||||
|
||||
# Run collector
|
||||
./collector --interval=2s --top=10 --push-endpoint=http://localhost:8080/api/v1/metrics
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue