Update and extend documentation
This commit is contained in:
parent
addab99e5d
commit
2a4c64bfb0
3 changed files with 212 additions and 571 deletions
251
README.md
251
README.md
|
|
@ -1,28 +1,100 @@
|
|||
# Forgejo Runner Resource Collector
|
||||
|
||||
A lightweight resource metrics collector designed to run alongside CI/CD workloads in shared PID namespace environments. It collects CPU and memory metrics, groups them by container/cgroup, and pushes summaries to a receiver service.
|
||||
A lightweight metrics collector for CI/CD workloads in shared PID namespace environments. Reads `/proc` to collect CPU and memory metrics, groups them by container/cgroup, and pushes run summaries to a receiver service for storage and querying.
|
||||
|
||||
## Components
|
||||
## Architecture
|
||||
|
||||
- **Collector**: Gathers system and per-process metrics at regular intervals, computes run-level statistics, and pushes a summary on shutdown.
|
||||
- **Receiver**: HTTP service that stores metric summaries in SQLite and provides a query API.
|
||||
The system has two independent binaries:
|
||||
|
||||
## Receiver API
|
||||
```
|
||||
┌─────────────────────────────────────────────┐ ┌──────────────────────────┐
|
||||
│ CI/CD Pod (shared PID namespace) │ │ Receiver Service │
|
||||
│ │ │ │
|
||||
│ ┌───────────┐ ┌────────┐ ┌───────────┐ │ │ POST /api/v1/metrics │
|
||||
│ │ collector │ │ runner │ │ sidecar │ │ │ │ │
|
||||
│ │ │ │ │ │ │ │ │ ▼ │
|
||||
│ │ reads │ │ │ │ │ │ push │ ┌────────────┐ │
|
||||
│ │ /proc for │ │ │ │ │ │──────▶│ │ SQLite │ │
|
||||
│ │ all PIDs │ │ │ │ │ │ │ └────────────┘ │
|
||||
│ └───────────┘ └────────┘ └───────────┘ │ │ │ │
|
||||
│ │ │ ▼ │
|
||||
└─────────────────────────────────────────────┘ │ GET /api/v1/metrics/... │
|
||||
└──────────────────────────┘
|
||||
```
|
||||
|
||||
### POST `/api/v1/metrics`
|
||||
### Collector
|
||||
|
||||
Receives metric summaries from collectors.
|
||||
Runs as a sidecar alongside CI workloads. On a configurable interval, it reads `/proc` to collect CPU and memory for all visible processes, groups them by container using cgroup paths, and accumulates samples. On shutdown (SIGINT/SIGTERM), it computes run-level statistics (peak, avg, percentiles) and pushes a single summary to the receiver.
|
||||
|
||||
### GET `/api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}`
|
||||
```bash
|
||||
./collector --interval=2s --top=10 --push-endpoint=http://receiver:8080/api/v1/metrics
|
||||
```
|
||||
|
||||
Retrieves all stored metrics for a specific workflow and job.
|
||||
**Flags:** `--interval`, `--proc-path`, `--log-level`, `--log-format`, `--top`, `--push-endpoint`
|
||||
|
||||
**Environment variables:**
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` |
|
||||
| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` |
|
||||
| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` |
|
||||
| `GITHUB_JOB` | Job name | `build` |
|
||||
| `GITHUB_RUN_ID` | Unique run identifier | `run-123` |
|
||||
| `CGROUP_PROCESS_MAP` | JSON: process name → container name | `{"node":"runner"}` |
|
||||
| `CGROUP_LIMITS` | JSON: per-container CPU/memory limits | See below |
|
||||
|
||||
**CGROUP_LIMITS example:**
|
||||
```json
|
||||
{
|
||||
"runner": {"cpu": "2", "memory": "1Gi"},
|
||||
"sidecar": {"cpu": "500m", "memory": "256Mi"}
|
||||
}
|
||||
```
|
||||
CPU supports Kubernetes notation (`"2"` = 2 cores, `"500m"` = 0.5 cores). Memory supports `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal).
|
||||
|
||||
### Receiver
|
||||
|
||||
HTTP service that stores metric summaries in SQLite (via GORM) and exposes a query API.
|
||||
|
||||
```bash
|
||||
./receiver --addr=:8080 --db=metrics.db
|
||||
```
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `DB_PATH` | SQLite database path | `metrics.db` |
|
||||
| `LISTEN_ADDR` | HTTP listen address | `:8080` |
|
||||
|
||||
**Endpoints:**
|
||||
|
||||
- `POST /api/v1/metrics` — receive and store a metric summary
|
||||
- `GET /api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}` — query stored metrics
|
||||
|
||||
## How Metrics Are Collected
|
||||
|
||||
The collector reads `/proc/[pid]/stat` for every visible process to get CPU ticks (`utime` + `stime`) and `/proc/[pid]/status` for memory (RSS). It takes two samples per interval and computes the delta to derive CPU usage rates.
|
||||
|
||||
Processes are grouped into containers by reading `/proc/[pid]/cgroup` and matching cgroup paths against the `CGROUP_PROCESS_MAP`. This is necessary because in shared PID namespace pods, `/proc/stat` only shows host-level aggregates — per-container metrics must be built up from individual process data.
|
||||
|
||||
Container CPU is reported in **cores** (not percentage) for direct comparison with Kubernetes resource limits. System-level CPU is reported as a percentage (0-100%).
|
||||
|
||||
Over the course of a run, the `summary.Accumulator` tracks every sample and on shutdown computes:
|
||||
|
||||
| Stat | Description |
|
||||
|------|-------------|
|
||||
| `peak` | Maximum observed value |
|
||||
| `p99`, `p95`, `p75`, `p50` | Percentiles across all samples |
|
||||
| `avg` | Arithmetic mean |
|
||||
|
||||
These stats are computed for CPU, memory, and per-container metrics.
|
||||
|
||||
## API Response
|
||||
|
||||
**Example request:**
|
||||
```
|
||||
GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build
|
||||
```
|
||||
|
||||
**Example response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
|
|
@ -38,151 +110,66 @@ GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build
|
|||
"end_time": "2026-02-06T14:30:22.190Z",
|
||||
"duration_seconds": 20.0,
|
||||
"sample_count": 11,
|
||||
"cpu_total_percent": { ... },
|
||||
"mem_used_bytes": { ... },
|
||||
"mem_used_percent": { ... },
|
||||
"top_cpu_processes": [ ... ],
|
||||
"top_mem_processes": [ ... ],
|
||||
"cpu_total_percent": { "peak": ..., "avg": ..., "p50": ... },
|
||||
"mem_used_bytes": { "peak": ..., "avg": ... },
|
||||
"containers": [
|
||||
{
|
||||
"name": "runner",
|
||||
"cpu_cores": {
|
||||
"peak": 2.007,
|
||||
"p99": 2.005,
|
||||
"p95": 2.004,
|
||||
"p75": 1.997,
|
||||
"p50": 1.817,
|
||||
"avg": 1.5
|
||||
},
|
||||
"memory_bytes": {
|
||||
"peak": 18567168,
|
||||
"p99": 18567168,
|
||||
"p95": 18567168,
|
||||
"p75": 18567168,
|
||||
"p50": 18567168,
|
||||
"avg": 18567168
|
||||
}
|
||||
"cpu_cores": { "peak": 2.007, "avg": 1.5, "p50": 1.817, "p95": 2.004 },
|
||||
"memory_bytes": { "peak": 18567168, "avg": 18567168 }
|
||||
}
|
||||
]
|
||||
],
|
||||
"top_cpu_processes": [ ... ],
|
||||
"top_mem_processes": [ ... ]
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## Understanding the Metrics
|
||||
**CPU metric distinction:**
|
||||
- `cpu_total_percent` — system-wide, 0-100%
|
||||
- `cpu_cores` (containers) — cores used (e.g. `2.0` = two full cores)
|
||||
- `peak_cpu_percent` (processes) — per-process, where 100% = 1 core
|
||||
|
||||
### CPU Metrics
|
||||
|
||||
There are two different CPU metric formats in the response:
|
||||
|
||||
#### 1. System and Process CPU: Percentage (`cpu_total_percent`, `peak_cpu_percent`)
|
||||
|
||||
These values represent **CPU utilization as a percentage** of total available CPU time.
|
||||
|
||||
- `cpu_total_percent`: Overall system CPU usage (0-100%)
|
||||
- `peak_cpu_percent` (in process lists): Per-process CPU usage where 100% = 1 full CPU core
|
||||
|
||||
#### 2. Container CPU: Cores (`cpu_cores`)
|
||||
|
||||
**Important:** The `cpu_cores` field in container metrics represents **CPU usage in number of cores**, not percentage.
|
||||
|
||||
| Value | Meaning |
|
||||
|-------|---------|
|
||||
| `0.5` | Half a CPU core |
|
||||
| `1.0` | One full CPU core |
|
||||
| `2.0` | Two CPU cores |
|
||||
| `2.5` | Two and a half CPU cores |
|
||||
|
||||
This allows direct comparison with Kubernetes resource limits (e.g., `cpu: "2"` or `cpu: "500m"`).
|
||||
|
||||
**Example interpretation:**
|
||||
```json
|
||||
{
|
||||
"name": "runner",
|
||||
"cpu_cores": {
|
||||
"peak": 2.007,
|
||||
"avg": 1.5
|
||||
}
|
||||
}
|
||||
```
|
||||
This means the "runner" container used a peak of ~2 CPU cores and averaged 1.5 CPU cores during the run.
|
||||
|
||||
### Memory Metrics
|
||||
|
||||
All memory values are in **bytes**:
|
||||
|
||||
- `mem_used_bytes`: System memory usage
|
||||
- `memory_bytes` (in containers): Container RSS memory usage
|
||||
- `peak_mem_rss_bytes` (in processes): Process RSS memory
|
||||
|
||||
### Statistical Fields
|
||||
|
||||
Each metric includes percentile statistics across all samples:
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `peak` | Maximum value observed |
|
||||
| `p99` | 99th percentile |
|
||||
| `p95` | 95th percentile |
|
||||
| `p75` | 75th percentile |
|
||||
| `p50` | Median (50th percentile) |
|
||||
| `avg` | Arithmetic mean |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Collector Environment Variables
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` |
|
||||
| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` |
|
||||
| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` |
|
||||
| `GITHUB_JOB` | Job name | `build` |
|
||||
| `GITHUB_RUN_ID` | Unique run identifier | `run-123` |
|
||||
| `CGROUP_PROCESS_MAP` | JSON mapping process names to container names | `{"node":"runner"}` |
|
||||
| `CGROUP_LIMITS` | JSON with CPU/memory limits per container | See below |
|
||||
|
||||
**CGROUP_LIMITS example:**
|
||||
```json
|
||||
{
|
||||
"runner": {"cpu": "2", "memory": "1Gi"},
|
||||
"sidecar": {"cpu": "500m", "memory": "256Mi"}
|
||||
}
|
||||
```
|
||||
|
||||
CPU values support Kubernetes notation: `"2"` = 2 cores, `"500m"` = 0.5 cores.
|
||||
|
||||
Memory values support: `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal).
|
||||
|
||||
### Receiver Environment Variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `DB_PATH` | SQLite database path | `metrics.db` |
|
||||
| `LISTEN_ADDR` | HTTP listen address | `:8080` |
|
||||
All memory values are in **bytes**.
|
||||
|
||||
## Running
|
||||
|
||||
### Docker Compose (stress test example)
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
docker compose -f test/docker/docker-compose-stress.yaml up -d
|
||||
# Wait for metrics collection...
|
||||
# Wait for collection, then trigger shutdown summary:
|
||||
docker compose -f test/docker/docker-compose-stress.yaml stop collector
|
||||
# Query results
|
||||
# Query results:
|
||||
curl http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload
|
||||
```
|
||||
|
||||
### Local Development
|
||||
### Local
|
||||
|
||||
```bash
|
||||
# Build
|
||||
go build -o collector ./cmd/collector
|
||||
go build -o receiver ./cmd/receiver
|
||||
|
||||
# Run receiver
|
||||
./receiver --listen=:8080 --db=metrics.db
|
||||
|
||||
# Run collector
|
||||
./receiver --addr=:8080 --db=metrics.db
|
||||
./collector --interval=2s --top=10 --push-endpoint=http://localhost:8080/api/v1/metrics
|
||||
```
|
||||
|
||||
## Internal Packages
|
||||
|
||||
| Package | Purpose |
|
||||
|---------|---------|
|
||||
| `internal/proc` | Low-level `/proc` parsing (stat, status, cgroup) |
|
||||
| `internal/metrics` | Aggregates process metrics from `/proc` into system/container views |
|
||||
| `internal/cgroup` | Parses `CGROUP_PROCESS_MAP` and `CGROUP_LIMITS` env vars |
|
||||
| `internal/collector` | Orchestrates the collection loop and shutdown |
|
||||
| `internal/summary` | Accumulates samples, computes stats, pushes to receiver |
|
||||
| `internal/receiver` | HTTP handlers and SQLite store |
|
||||
| `internal/output` | Metrics output formatting (JSON/text) |
|
||||
|
||||
## Background
|
||||
|
||||
Technical reference on the Linux primitives this project builds on:
|
||||
|
||||
- [Identifying process cgroups by PID](docs/background/identify-process-cgroup-by-pid.md) — how to read `/proc/<PID>/cgroup` to determine which container a process belongs to
|
||||
- [/proc/stat behavior in containers](docs/background/proc-stat-in-containers.md) — why `/proc/stat` shows host-level data in containers, and how to aggregate per-process stats from `/proc/[pid]/stat` instead, including CPU tick conversion and cgroup limit handling
|
||||
|
|
|
|||
|
|
@ -1,26 +1,14 @@
|
|||
# Identify Process Cgroup by PID
|
||||
# Identifying a Process's Cgroup by PID
|
||||
|
||||
## TL;DR
|
||||
Read `/proc/<PID>/cgroup` to find which cgroup (and therefore which container) a process belongs to.
|
||||
|
||||
You can identify a process's cgroup by reading `/proc/<PID>/cgroup`.
|
||||
|
||||
## Outcome
|
||||
|
||||
Multiple methods to find which cgroup a process belongs to using its PID.
|
||||
|
||||
## Conversation
|
||||
|
||||
**Question:** Can I identify the process's cgroup by having its PID?
|
||||
|
||||
**Answer:** Yes, here are the main methods:
|
||||
|
||||
### Method 1: Read `/proc/<PID>/cgroup`
|
||||
## /proc/PID/cgroup
|
||||
|
||||
```bash
|
||||
cat /proc/<PID>/cgroup
|
||||
```
|
||||
|
||||
This shows all cgroup controllers the process belongs to. For example:
|
||||
Shows all cgroup controllers the process belongs to:
|
||||
```
|
||||
12:blkio:/user.slice
|
||||
11:memory:/user.slice/user-1000.slice
|
||||
|
|
@ -28,35 +16,23 @@ This shows all cgroup controllers the process belongs to. For example:
|
|||
0::/user.slice/user-1000.slice/session-1.scope
|
||||
```
|
||||
|
||||
### Method 2: Using `ps` with format options
|
||||
On cgroup v2, the path after `::` is the cgroup path under `/sys/fs/cgroup/`.
|
||||
|
||||
## Other Methods
|
||||
|
||||
```bash
|
||||
# ps format options
|
||||
ps -o pid,cgroup -p <PID>
|
||||
```
|
||||
|
||||
### Method 3: Using `systemd-cgls`
|
||||
|
||||
If you're on a systemd system:
|
||||
```bash
|
||||
# systemd systems
|
||||
systemd-cgls --unit <unit-name>
|
||||
# or to see the whole tree
|
||||
systemd-cgls
|
||||
systemd-cgls # whole tree
|
||||
```
|
||||
|
||||
### Method 4: Check cgroup v2 unified hierarchy
|
||||
|
||||
On cgroup v2 systems, check:
|
||||
```bash
|
||||
cat /proc/<PID>/cgroup
|
||||
# Output will look like: 0::/user.slice/user-1000.slice/session-1.scope
|
||||
```
|
||||
|
||||
The path after `::` is the cgroup path under `/sys/fs/cgroup/`.
|
||||
|
||||
### Quick one-liner
|
||||
## Quick One-Liners
|
||||
|
||||
```bash
|
||||
cat /proc/self/cgroup # For current shell
|
||||
cat /proc/$$/cgroup # Also current shell
|
||||
cat /proc/1234/cgroup # For PID 1234
|
||||
cat /proc/self/cgroup # current shell
|
||||
cat /proc/$$/cgroup # also current shell
|
||||
cat /proc/1234/cgroup # specific PID
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,64 +1,36 @@
|
|||
# /proc/stat Behavior in Containerized Environments
|
||||
# /proc/stat behavior in containerised environments
|
||||
|
||||
## TL;DR
|
||||
`/proc/stat` in containers shows **host-level** statistics, not container-specific data. To get container-aware CPU metrics when processes span multiple cgroups (e.g., sidecars sharing a PID namespace), aggregate `/proc/[pid]/stat` for all visible processes and use cgroup limits from `/sys/fs/cgroup` for normalization.
|
||||
|
||||
`/proc/stat` in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate `/proc/[pid]/stat` for all visible processes and calculate percentages using cgroup limits from `/sys/fs/cgroup`.
|
||||
## Why /proc/stat is wrong in containers
|
||||
|
||||
## Outcome
|
||||
`/proc/stat` reports host-wide values (CPU times, context switches, boot time, process count) because `/proc` is mounted from the host kernel, which has no namespace awareness for these metrics.
|
||||
|
||||
- Explained why `/proc/stat` doesn't provide container-aware metrics
|
||||
- Provided methods to aggregate process CPU usage from `/proc/[pid]/stat`
|
||||
- Showed how to determine parent/child process relationships
|
||||
- Demonstrated converting raw ticks to CPU percentages
|
||||
- Covered how to read cgroup CPU limits for other containers
|
||||
This means:
|
||||
- Tools reading `/proc/stat` (top, htop, etc.) show **host** CPU usage, not container usage
|
||||
- Cgroup CPU limits (e.g., 2 CPUs) are not reflected — all host CPUs are visible
|
||||
- In shared environments, containers see each other's aggregate impact
|
||||
|
||||
## Conversation
|
||||
|
||||
### How does /proc/stat behave in containerized environments?
|
||||
|
||||
By default, `/proc/stat` inside a container shows **host system statistics**, not container-specific data. This includes:
|
||||
|
||||
- **CPU times** (user, system, idle, etc.) - reflects all CPUs on the host
|
||||
- **Context switches** - host-wide count
|
||||
- **Boot time** - host's boot time
|
||||
- **Processes created** - host-wide total
|
||||
|
||||
#### Why This Happens
|
||||
|
||||
The `/proc` filesystem is mounted from the host kernel. Containers share the host kernel, and `/proc/stat` is a kernel-level interface that doesn't have namespace awareness for most of its metrics.
|
||||
|
||||
#### Practical Implications
|
||||
|
||||
1. **CPU usage calculations are wrong** - Tools like `top`, `htop`, or custom monitoring that read `/proc/stat` will show host CPU usage, not container usage
|
||||
2. **Resource limits aren't reflected** - Even if you limit a container to 2 CPUs via cgroups, `/proc/stat` shows all host CPUs
|
||||
3. **Multi-tenant confusion** - In shared environments, containers see each other's aggregate impact
|
||||
|
||||
#### Solutions
|
||||
### Alternatives
|
||||
|
||||
| Approach | Description |
|
||||
|----------|-------------|
|
||||
| **cgroups** | Read from `/sys/fs/cgroup/cpu/` for container-specific CPU accounting |
|
||||
| **LXCFS** | A FUSE filesystem that provides container-aware `/proc` files |
|
||||
| **Container runtimes** | Some runtimes (like Kata) use VMs with isolated kernels |
|
||||
| **Metrics APIs** | Use Docker/Kubernetes APIs instead of `/proc/stat` |
|
||||
|
||||
#### Example: Getting Real Container CPU
|
||||
| **cgroups** | Read `/sys/fs/cgroup/cpu/` for container-specific CPU accounting |
|
||||
| **LXCFS** | FUSE filesystem providing container-aware `/proc` files |
|
||||
| **Container runtimes** | Some (like Kata) use VMs with isolated kernels |
|
||||
| **Metrics APIs** | Docker/Kubernetes APIs instead of `/proc/stat` |
|
||||
|
||||
```bash
|
||||
# Instead of /proc/stat, use cgroups v2:
|
||||
# cgroups v2:
|
||||
cat /sys/fs/cgroup/cpu.stat
|
||||
|
||||
# Or cgroups v1:
|
||||
# cgroups v1:
|
||||
cat /sys/fs/cgroup/cpu/cpuacct.usage
|
||||
```
|
||||
|
||||
---
|
||||
## Aggregating per-Process CPU from /proc/[pid]/stat
|
||||
|
||||
### Can I aggregate process usage to get the usage of the container?
|
||||
|
||||
Yes, you can aggregate process usage, but there are caveats.
|
||||
|
||||
#### Aggregating from /proc/[pid]/stat
|
||||
When cgroup-level reads aren't an option (sidecars sharing PID namespace with different cgroups), aggregate individual process stats:
|
||||
|
||||
```bash
|
||||
# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
|
||||
|
|
@ -67,137 +39,76 @@ for pid in /proc/[0-9]*; do
|
|||
done | awk '{sum += $1} END {print sum}'
|
||||
```
|
||||
|
||||
#### Problems with This Approach
|
||||
### Caveats
|
||||
|
||||
1. **Race conditions** - Processes can spawn/die between reads, causing inconsistent totals
|
||||
2. **Short-lived processes** - Processes that start and exit between samples are missed entirely
|
||||
3. **Zombie/exited processes** - Their CPU time may not be captured
|
||||
4. **Overhead** - Scanning all PIDs repeatedly is expensive
|
||||
5. **Namespace visibility** - You only see processes in your PID namespace, but that's actually what you want in a container
|
||||
6. **Children accounting** - When a process exits, its CPU time gets added to the parent's `cutime`/`cstime` fields, which can cause double-counting if not handled carefully
|
||||
1. **Race conditions** — processes can spawn/die between reads
|
||||
2. **Short-lived processes** — missed if they start and exit between samples
|
||||
3. **Zombie/exited processes** — their CPU time may not be captured
|
||||
4. **Overhead** — scanning all PIDs repeatedly is expensive
|
||||
5. **Namespace visibility** — you only see processes in your PID namespace (which is what you want)
|
||||
6. **Children accounting** — when a process exits, its CPU time is added to the parent's `cutime`/`cstime`, risking double-counting
|
||||
|
||||
#### Better Alternative: cgroups
|
||||
Cgroups handle these edge cases natively, but **cannot be used when sidecars share the PID namespace with different cgroups** — in that case, per-process aggregation is the best option.
|
||||
|
||||
Cgroups track CPU usage at the group level and handle all these edge cases:
|
||||
## Parent/Child Process Relationships
|
||||
|
||||
Field 4 in `/proc/[pid]/stat` is the PPID (parent process ID):
|
||||
|
||||
```bash
|
||||
# cgroups v2 (modern)
|
||||
cat /sys/fs/cgroup/cpu.stat
|
||||
# Shows: usage_usec, user_usec, system_usec
|
||||
|
||||
# cgroups v1
|
||||
cat /sys/fs/cgroup/cpu/cpuacct.usage # total nanoseconds
|
||||
cat /sys/fs/cgroup/cpu/cpuacct.stat # user/system jiffies
|
||||
awk '{print $4}' /proc/1234/stat # PPID from stat
|
||||
grep PPid /proc/1234/status # more readable
|
||||
```
|
||||
|
||||
**However**, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating `/proc/[pid]/stat` is the best option.
|
||||
|
||||
---
|
||||
|
||||
### How do I determine child/parent processes?
|
||||
|
||||
#### From /proc/[pid]/stat
|
||||
|
||||
Field 4 is the **PPID** (parent process ID):
|
||||
|
||||
```bash
|
||||
# Get PPID for a specific process
|
||||
awk '{print $4}' /proc/1234/stat
|
||||
|
||||
# List all processes with their parents
|
||||
for stat in /proc/[0-9]*/stat; do
|
||||
pid="${stat#/proc/}"
|
||||
pid="${pid%/stat}"
|
||||
ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \
|
||||
echo "$pid -> parent: $ppid"
|
||||
done
|
||||
```
|
||||
|
||||
#### From /proc/[pid]/status (more readable)
|
||||
|
||||
```bash
|
||||
grep PPid /proc/1234/status
|
||||
# PPid: 1
|
||||
```
|
||||
|
||||
#### Building a Process Tree
|
||||
### Building a Process Tree
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
declare -A parent_of
|
||||
declare -A children_of
|
||||
declare -A parent_of children_of
|
||||
|
||||
for stat in /proc/[0-9]*/stat; do
|
||||
if read -r line < "$stat" 2>/dev/null; then
|
||||
pid="${stat#/proc/}"
|
||||
pid="${pid%/stat}"
|
||||
|
||||
# Extract PPID (field 4, but handle comm with spaces)
|
||||
rest="${line##*) }"
|
||||
read -ra fields <<< "$rest"
|
||||
pid="${stat#/proc/}"; pid="${pid%/stat}"
|
||||
rest="${line##*) }"; read -ra fields <<< "$rest"
|
||||
ppid="${fields[1]}" # 4th field overall = index 1 after state
|
||||
|
||||
parent_of[$pid]=$ppid
|
||||
children_of[$ppid]+="$pid "
|
||||
fi
|
||||
done
|
||||
|
||||
# Print tree from PID 1
|
||||
print_tree() {
|
||||
local pid=$1
|
||||
local indent=$2
|
||||
local pid=$1 indent=$2
|
||||
echo "${indent}${pid}"
|
||||
for child in ${children_of[$pid]}; do
|
||||
print_tree "$child" " $indent"
|
||||
done
|
||||
for child in ${children_of[$pid]}; do print_tree "$child" " $indent"; done
|
||||
}
|
||||
|
||||
print_tree 1 ""
|
||||
```
|
||||
|
||||
#### For CPU Aggregation: Handling cutime/cstime
|
||||
### Avoiding Double-Counting with cutime/cstime
|
||||
|
||||
To properly handle `cutime`/`cstime` without double-counting:
|
||||
Only sum `utime` + `stime` per process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on — those children no longer exist in `/proc`, so their time is only accessible via the parent.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
declare -A parent_of
|
||||
declare -A utime stime
|
||||
|
||||
# First pass: collect all data
|
||||
for stat in /proc/[0-9]*/stat; do
|
||||
if read -r line < "$stat" 2>/dev/null; then
|
||||
pid="${stat#/proc/}"
|
||||
pid="${pid%/stat}"
|
||||
rest="${line##*) }"
|
||||
read -ra f <<< "$rest"
|
||||
|
||||
parent_of[$pid]="${f[1]}"
|
||||
utime[$pid]="${f[11]}"
|
||||
stime[$pid]="${f[12]}"
|
||||
# cutime=${f[13]} cstime=${f[14]} - don't sum these
|
||||
pid="${stat#/proc/}"; pid="${pid%/stat}"
|
||||
rest="${line##*) }"; read -ra f <<< "$rest"
|
||||
utime[$pid]="${f[11]}"; stime[$pid]="${f[12]}"
|
||||
# cutime=${f[13]} cstime=${f[14]} — don't sum these
|
||||
fi
|
||||
done
|
||||
|
||||
# Sum only utime/stime (not cutime/cstime)
|
||||
total=0
|
||||
for pid in "${!utime[@]}"; do
|
||||
((total += utime[$pid] + stime[$pid]))
|
||||
done
|
||||
|
||||
for pid in "${!utime[@]}"; do ((total += utime[$pid] + stime[$pid])); done
|
||||
echo "Total CPU ticks: $total"
|
||||
echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"
|
||||
```
|
||||
|
||||
**Key insight:** Only sum `utime` + `stime` for each process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on—those children no longer exist in `/proc`, so their time is only accessible via the parent's `cutime`/`cstime`.
|
||||
## Converting Ticks to CPU Percentages
|
||||
|
||||
---
|
||||
|
||||
### How do I convert utime/stime to percentages?
|
||||
|
||||
You need **two samples** over a time interval. CPU percentage is a rate, not an absolute value.
|
||||
|
||||
#### The Formula
|
||||
CPU percentage is a rate — you need **two samples** over a time interval.
|
||||
|
||||
```
|
||||
CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
|
||||
|
|
@ -205,20 +116,17 @@ CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
|
|||
|
||||
- `delta_ticks` = difference in (utime + stime) between samples
|
||||
- `CLK_TCK` = ticks per second (usually 100, get via `getconf CLK_TCK`)
|
||||
- `num_cpus` = number of CPUs (omit for single-CPU percentage)
|
||||
|
||||
#### Two Common Percentage Styles
|
||||
- `num_cpus` = number of CPUs (omit for per-core percentage)
|
||||
|
||||
| Style | Formula | Example |
|
||||
|-------|---------|---------|
|
||||
| **Normalized** (0-100%) | `delta / (elapsed * CLK_TCK * num_cpus) * 100` | 50% = half of total capacity |
|
||||
| **Cores-style** (0-N*100%) | `delta / (elapsed * CLK_TCK) * 100` | 200% = 2 full cores busy |
|
||||
|
||||
#### Practical Script
|
||||
### Sampling Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
CLK_TCK=$(getconf CLK_TCK)
|
||||
NUM_CPUS=$(nproc)
|
||||
|
||||
|
|
@ -226,267 +134,94 @@ get_total_ticks() {
|
|||
local total=0
|
||||
for stat in /proc/[0-9]*/stat; do
|
||||
if read -r line < "$stat" 2>/dev/null; then
|
||||
rest="${line##*) }"
|
||||
read -ra f <<< "$rest"
|
||||
((total += f[11] + f[12])) # utime + stime
|
||||
rest="${line##*) }"; read -ra f <<< "$rest"
|
||||
((total += f[11] + f[12]))
|
||||
fi
|
||||
done
|
||||
echo "$total"
|
||||
}
|
||||
|
||||
# First sample
|
||||
ticks1=$(get_total_ticks)
|
||||
time1=$(date +%s.%N)
|
||||
|
||||
# Wait
|
||||
ticks1=$(get_total_ticks); time1=$(date +%s.%N)
|
||||
sleep 1
|
||||
ticks2=$(get_total_ticks); time2=$(date +%s.%N)
|
||||
|
||||
# Second sample
|
||||
ticks2=$(get_total_ticks)
|
||||
time2=$(date +%s.%N)
|
||||
|
||||
# Calculate
|
||||
delta_ticks=$((ticks2 - ticks1))
|
||||
elapsed=$(echo "$time2 - $time1" | bc)
|
||||
|
||||
# Percentage of total CPU capacity (all cores)
|
||||
pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
|
||||
echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"
|
||||
|
||||
# Percentage as "CPU cores used" (like top's 200% for 2 full cores)
|
||||
cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
|
||||
echo "CPU usage: ${cores_pct}% (cores-style)"
|
||||
```
|
||||
|
||||
#### Continuous Monitoring
|
||||
## Respecting Cgroup CPU Limits
|
||||
|
||||
The above calculations use `nproc`, which returns the **host** CPU count. If a container is limited to 2 CPUs on an 8-CPU host, `nproc` returns 8 and the percentage is misleading.
|
||||
|
||||
### Reading Effective CPU Limit
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
CLK_TCK=$(getconf CLK_TCK)
|
||||
NUM_CPUS=$(nproc)
|
||||
INTERVAL=1
|
||||
|
||||
get_total_ticks() {
|
||||
local total=0
|
||||
for stat in /proc/[0-9]*/stat; do
|
||||
read -r line < "$stat" 2>/dev/null || continue
|
||||
rest="${line##*) }"
|
||||
read -ra f <<< "$rest"
|
||||
((total += f[11] + f[12]))
|
||||
done
|
||||
echo "$total"
|
||||
}
|
||||
|
||||
prev_ticks=$(get_total_ticks)
|
||||
prev_time=$(date +%s.%N)
|
||||
|
||||
while true; do
|
||||
sleep "$INTERVAL"
|
||||
|
||||
curr_ticks=$(get_total_ticks)
|
||||
curr_time=$(date +%s.%N)
|
||||
|
||||
delta=$((curr_ticks - prev_ticks))
|
||||
elapsed=$(echo "$curr_time - $prev_time" | bc)
|
||||
|
||||
pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc)
|
||||
printf "\rCPU: %5.1f%%" "$pct"
|
||||
|
||||
prev_ticks=$curr_ticks
|
||||
prev_time=$curr_time
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Does this calculation respect cgroup limits?
|
||||
|
||||
No, it doesn't. The calculation uses `nproc` which typically returns **host CPU count**, not your cgroup limit.
|
||||
|
||||
#### The Problem
|
||||
|
||||
If your container is limited to 2 CPUs on an 8-CPU host:
|
||||
- `nproc` returns 8
|
||||
- Your calculation shows 25% when you're actually at 100% of your limit
|
||||
|
||||
#### Getting Effective CPU Limit
|
||||
|
||||
**cgroups v2:**
|
||||
|
||||
```bash
|
||||
# cpu.max contains: $quota $period (in microseconds)
|
||||
# "max 100000" means unlimited
|
||||
read quota period < /sys/fs/cgroup/cpu.max
|
||||
if [[ "$quota" == "max" ]]; then
|
||||
effective_cpus=$(nproc)
|
||||
else
|
||||
effective_cpus=$(echo "scale=2; $quota / $period" | bc)
|
||||
fi
|
||||
echo "Effective CPUs: $effective_cpus"
|
||||
```
|
||||
|
||||
**cgroups v1:**
|
||||
|
||||
```bash
|
||||
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
|
||||
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
|
||||
|
||||
if [[ "$quota" == "-1" ]]; then
|
||||
effective_cpus=$(nproc)
|
||||
else
|
||||
effective_cpus=$(echo "scale=2; $quota / $period" | bc)
|
||||
fi
|
||||
```
|
||||
|
||||
**Also Check cpuset Limits:**
|
||||
|
||||
```bash
|
||||
# cgroups v2
|
||||
cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null)
|
||||
|
||||
# cgroups v1
|
||||
cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null)
|
||||
|
||||
# Parse "0-3,5,7" format to count CPUs
|
||||
count_cpus() {
|
||||
local count=0
|
||||
IFS=',' read -ra ranges <<< "$1"
|
||||
for range in "${ranges[@]}"; do
|
||||
if [[ "$range" == *-* ]]; then
|
||||
start="${range%-*}"
|
||||
end="${range#*-}"
|
||||
((count += end - start + 1))
|
||||
else
|
||||
((count++))
|
||||
fi
|
||||
done
|
||||
echo "$count"
|
||||
}
|
||||
```
|
||||
|
||||
#### Updated Script Respecting Limits
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
CLK_TCK=$(getconf CLK_TCK)
|
||||
|
||||
get_effective_cpus() {
|
||||
# Try cgroups v2 first
|
||||
# cgroups v2
|
||||
if [[ -f /sys/fs/cgroup/cpu.max ]]; then
|
||||
read quota period < /sys/fs/cgroup/cpu.max
|
||||
if [[ "$quota" != "max" ]]; then
|
||||
echo "scale=2; $quota / $period" | bc
|
||||
return
|
||||
fi
|
||||
[[ "$quota" != "max" ]] && echo "scale=2; $quota / $period" | bc && return
|
||||
fi
|
||||
|
||||
# Try cgroups v1
|
||||
# cgroups v1
|
||||
if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
|
||||
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
|
||||
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
|
||||
if [[ "$quota" != "-1" ]]; then
|
||||
echo "scale=2; $quota / $period" | bc
|
||||
return
|
||||
fi
|
||||
[[ "$quota" != "-1" ]] && echo "scale=2; $quota / $period" | bc && return
|
||||
fi
|
||||
|
||||
# Fallback to nproc
|
||||
nproc
|
||||
nproc # fallback
|
||||
}
|
||||
|
||||
EFFECTIVE_CPUS=$(get_effective_cpus)
|
||||
echo "Effective CPU limit: $EFFECTIVE_CPUS"
|
||||
|
||||
# ... rest of your sampling logic using $EFFECTIVE_CPUS
|
||||
```
|
||||
|
||||
#### Shared PID Namespace Consideration
|
||||
Also check cpuset limits (`cpuset.cpus.effective` for v2, `cpuset/cpuset.cpus` for v1) which restrict which physical CPUs are available.
|
||||
|
||||
When sidecars share PID namespace but have different cgroups:
|
||||
- Each container may have different CPU limits
|
||||
- You're aggregating processes across those limits
|
||||
- There's no single "correct" limit to use
|
||||
### Shared PID Namespace Complication
|
||||
|
||||
When sidecars share a PID namespace but have different cgroups, there's no single "correct" CPU limit for normalization. Options:
|
||||
|
||||
**Options:**
|
||||
1. **Use host CPU count** — percentage of total host capacity
|
||||
2. **Sum the limits** — if you know each sidecar's cgroup, sum their quotas
|
||||
3. **Report in cores** — skip normalization, just show `1.5 cores used` instead of percentage
|
||||
3. **Report in cores** — skip normalization, show `1.5 cores used` instead of percentage
|
||||
|
||||
---
|
||||
## Reading Cgroup Limits for Other Containers
|
||||
|
||||
### Can I get the cgroup limit for another cgroup?
|
||||
|
||||
Yes, if you have visibility into the cgroup filesystem.
|
||||
|
||||
#### 1. Find a Process's Cgroup
|
||||
|
||||
Every process exposes its cgroup membership:
|
||||
Every process exposes its cgroup membership via `/proc/<PID>/cgroup`. If the cgroup filesystem is mounted, you can read any container's limits:
|
||||
|
||||
```bash
|
||||
# Get cgroup for any PID you can see
|
||||
cat /proc/1234/cgroup
|
||||
|
||||
# cgroups v2 output:
|
||||
# 0::/kubepods/pod123/container456
|
||||
|
||||
# cgroups v1 output:
|
||||
# 12:cpu,cpuacct:/docker/abc123
|
||||
# 11:memory:/docker/abc123
|
||||
# ...
|
||||
```
|
||||
|
||||
#### 2. Read That Cgroup's Limits
|
||||
|
||||
If the cgroup filesystem is mounted and accessible:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
get_cgroup_cpu_limit() {
|
||||
local pid=$1
|
||||
|
||||
# Get cgroup path for this PID
|
||||
cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) # v2
|
||||
|
||||
# cgroups v2
|
||||
cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
|
||||
if [[ -n "$cgroup_path" ]]; then
|
||||
# cgroups v2
|
||||
limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
|
||||
if [[ -r "$limit_file" ]]; then
|
||||
read quota period < "$limit_file"
|
||||
if [[ "$quota" == "max" ]]; then
|
||||
echo "unlimited"
|
||||
else
|
||||
echo "scale=2; $quota / $period" | bc
|
||||
fi
|
||||
[[ "$quota" == "max" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc
|
||||
return
|
||||
fi
|
||||
fi
|
||||
|
||||
# Try cgroups v1
|
||||
# cgroups v1
|
||||
cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
|
||||
if [[ -n "$cgroup_path" ]]; then
|
||||
quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
|
||||
period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
|
||||
if [[ -r "$quota_file" ]]; then
|
||||
quota=$(cat "$quota_file")
|
||||
period=$(cat "$period_file")
|
||||
if [[ "$quota" == "-1" ]]; then
|
||||
echo "unlimited"
|
||||
else
|
||||
echo "scale=2; $quota / $period" | bc
|
||||
fi
|
||||
quota=$(cat "$quota_file"); period=$(cat "$period_file")
|
||||
[[ "$quota" == "-1" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc
|
||||
return
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "unknown"
|
||||
}
|
||||
|
||||
# Example: get limit for PID 1234
|
||||
get_cgroup_cpu_limit 1234
|
||||
```
|
||||
|
||||
#### 3. Mount Visibility Requirements
|
||||
### Mount Visibility
|
||||
|
||||
| Scenario | Can Read Other Cgroups? |
|
||||
|----------|------------------------|
|
||||
|
|
@ -495,66 +230,9 @@ get_cgroup_cpu_limit 1234
|
|||
| `/sys/fs/cgroup` mounted read-only from host | Yes (common in Kubernetes) |
|
||||
| Only own cgroup subtree mounted | No |
|
||||
|
||||
Check what's visible:
|
||||
### Fallbacks When Cgroups Aren't Accessible
|
||||
|
||||
```bash
|
||||
mount | grep cgroup
|
||||
ls /sys/fs/cgroup/
|
||||
```
|
||||
|
||||
#### 4. Full Solution: Aggregate by Cgroup
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
CLK_TCK=$(getconf CLK_TCK)
|
||||
|
||||
declare -A cgroup_ticks
|
||||
declare -A cgroup_limit
|
||||
|
||||
for stat in /proc/[0-9]*/stat; do
|
||||
pid="${stat#/proc/}"
|
||||
pid="${pid%/stat}"
|
||||
|
||||
# Get cgroup for this process
|
||||
cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
|
||||
[[ -z "$cg" ]] && continue
|
||||
|
||||
# Get CPU ticks
|
||||
if read -r line < "$stat" 2>/dev/null; then
|
||||
rest="${line##*) }"
|
||||
read -ra f <<< "$rest"
|
||||
ticks=$((f[11] + f[12]))
|
||||
|
||||
((cgroup_ticks[$cg] += ticks))
|
||||
|
||||
# Cache the limit (only look up once per cgroup)
|
||||
if [[ -z "${cgroup_limit[$cg]}" ]]; then
|
||||
limit_file="/sys/fs/cgroup${cg}/cpu.max"
|
||||
if [[ -r "$limit_file" ]]; then
|
||||
read quota period < "$limit_file"
|
||||
if [[ "$quota" == "max" ]]; then
|
||||
cgroup_limit[$cg]="unlimited"
|
||||
else
|
||||
cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc)
|
||||
fi
|
||||
else
|
||||
cgroup_limit[$cg]="unknown"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo "Ticks by cgroup:"
|
||||
for cg in "${!cgroup_ticks[@]}"; do
|
||||
echo " $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)"
|
||||
done
|
||||
```
|
||||
|
||||
#### If You Can't Access Other Cgroups
|
||||
|
||||
Fallback options:
|
||||
|
||||
1. **Mount the cgroup fs** — add volume mount for `/sys/fs/cgroup:ro`
|
||||
2. **Use a sidecar with access** — one privileged container does the monitoring
|
||||
1. **Mount the cgroup fs** — volume mount `/sys/fs/cgroup:ro`
|
||||
2. **Use a sidecar with access** — one privileged container does monitoring
|
||||
3. **Accept "unknown" limits** — report raw ticks/cores instead of percentages
|
||||
4. **Kubernetes Downward API** — inject limits as env vars (only for your own container though)
|
||||
4. **Kubernetes Downward API** — inject limits as env vars (own container only)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue