diff --git a/README.md b/README.md index e5ceea1..196adf2 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,100 @@ # Forgejo Runner Resource Collector -A lightweight resource metrics collector designed to run alongside CI/CD workloads in shared PID namespace environments. It collects CPU and memory metrics, groups them by container/cgroup, and pushes summaries to a receiver service. +A lightweight metrics collector for CI/CD workloads in shared PID namespace environments. Reads `/proc` to collect CPU and memory metrics, groups them by container/cgroup, and pushes run summaries to a receiver service for storage and querying. -## Components +## Architecture -- **Collector**: Gathers system and per-process metrics at regular intervals, computes run-level statistics, and pushes a summary on shutdown. -- **Receiver**: HTTP service that stores metric summaries in SQLite and provides a query API. +The system has two independent binaries: -## Receiver API +``` +┌─────────────────────────────────────────────┐ ┌──────────────────────────┐ +│ CI/CD Pod (shared PID namespace) │ │ Receiver Service │ +│ │ │ │ +│ ┌───────────┐ ┌────────┐ ┌───────────┐ │ │ POST /api/v1/metrics │ +│ │ collector │ │ runner │ │ sidecar │ │ │ │ │ +│ │ │ │ │ │ │ │ │ ▼ │ +│ │ reads │ │ │ │ │ │ push │ ┌────────────┐ │ +│ │ /proc for │ │ │ │ │ │──────▶│ │ SQLite │ │ +│ │ all PIDs │ │ │ │ │ │ │ └────────────┘ │ +│ └───────────┘ └────────┘ └───────────┘ │ │ │ │ +│ │ │ ▼ │ +└─────────────────────────────────────────────┘ │ GET /api/v1/metrics/... │ + └──────────────────────────┘ +``` -### POST `/api/v1/metrics` +### Collector -Receives metric summaries from collectors. +Runs as a sidecar alongside CI workloads. On a configurable interval, it reads `/proc` to collect CPU and memory for all visible processes, groups them by container using cgroup paths, and accumulates samples. On shutdown (SIGINT/SIGTERM), it computes run-level statistics (peak, avg, percentiles) and pushes a single summary to the receiver. -### GET `/api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}` +```bash +./collector --interval=2s --top=10 --push-endpoint=http://receiver:8080/api/v1/metrics +``` -Retrieves all stored metrics for a specific workflow and job. +**Flags:** `--interval`, `--proc-path`, `--log-level`, `--log-format`, `--top`, `--push-endpoint` + +**Environment variables:** + +| Variable | Description | Example | +|----------|-------------|---------| +| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` | +| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` | +| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` | +| `GITHUB_JOB` | Job name | `build` | +| `GITHUB_RUN_ID` | Unique run identifier | `run-123` | +| `CGROUP_PROCESS_MAP` | JSON: process name → container name | `{"node":"runner"}` | +| `CGROUP_LIMITS` | JSON: per-container CPU/memory limits | See below | + +**CGROUP_LIMITS example:** +```json +{ + "runner": {"cpu": "2", "memory": "1Gi"}, + "sidecar": {"cpu": "500m", "memory": "256Mi"} +} +``` +CPU supports Kubernetes notation (`"2"` = 2 cores, `"500m"` = 0.5 cores). Memory supports `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal). + +### Receiver + +HTTP service that stores metric summaries in SQLite (via GORM) and exposes a query API. + +```bash +./receiver --addr=:8080 --db=metrics.db +``` + +| Variable | Description | Default | +|----------|-------------|---------| +| `DB_PATH` | SQLite database path | `metrics.db` | +| `LISTEN_ADDR` | HTTP listen address | `:8080` | + +**Endpoints:** + +- `POST /api/v1/metrics` — receive and store a metric summary +- `GET /api/v1/metrics/repo/{org}/{repo}/{workflow}/{job}` — query stored metrics + +## How Metrics Are Collected + +The collector reads `/proc/[pid]/stat` for every visible process to get CPU ticks (`utime` + `stime`) and `/proc/[pid]/status` for memory (RSS). It takes two samples per interval and computes the delta to derive CPU usage rates. + +Processes are grouped into containers by reading `/proc/[pid]/cgroup` and matching cgroup paths against the `CGROUP_PROCESS_MAP`. This is necessary because in shared PID namespace pods, `/proc/stat` only shows host-level aggregates — per-container metrics must be built up from individual process data. + +Container CPU is reported in **cores** (not percentage) for direct comparison with Kubernetes resource limits. System-level CPU is reported as a percentage (0-100%). + +Over the course of a run, the `summary.Accumulator` tracks every sample and on shutdown computes: + +| Stat | Description | +|------|-------------| +| `peak` | Maximum observed value | +| `p99`, `p95`, `p75`, `p50` | Percentiles across all samples | +| `avg` | Arithmetic mean | + +These stats are computed for CPU, memory, and per-container metrics. + +## API Response -**Example request:** ``` GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build ``` -**Example response:** ```json [ { @@ -38,151 +110,66 @@ GET /api/v1/metrics/repo/my-org/my-repo/ci.yml/build "end_time": "2026-02-06T14:30:22.190Z", "duration_seconds": 20.0, "sample_count": 11, - "cpu_total_percent": { ... }, - "mem_used_bytes": { ... }, - "mem_used_percent": { ... }, - "top_cpu_processes": [ ... ], - "top_mem_processes": [ ... ], + "cpu_total_percent": { "peak": ..., "avg": ..., "p50": ... }, + "mem_used_bytes": { "peak": ..., "avg": ... }, "containers": [ { "name": "runner", - "cpu_cores": { - "peak": 2.007, - "p99": 2.005, - "p95": 2.004, - "p75": 1.997, - "p50": 1.817, - "avg": 1.5 - }, - "memory_bytes": { - "peak": 18567168, - "p99": 18567168, - "p95": 18567168, - "p75": 18567168, - "p50": 18567168, - "avg": 18567168 - } + "cpu_cores": { "peak": 2.007, "avg": 1.5, "p50": 1.817, "p95": 2.004 }, + "memory_bytes": { "peak": 18567168, "avg": 18567168 } } - ] + ], + "top_cpu_processes": [ ... ], + "top_mem_processes": [ ... ] } } ] ``` -## Understanding the Metrics +**CPU metric distinction:** +- `cpu_total_percent` — system-wide, 0-100% +- `cpu_cores` (containers) — cores used (e.g. `2.0` = two full cores) +- `peak_cpu_percent` (processes) — per-process, where 100% = 1 core -### CPU Metrics - -There are two different CPU metric formats in the response: - -#### 1. System and Process CPU: Percentage (`cpu_total_percent`, `peak_cpu_percent`) - -These values represent **CPU utilization as a percentage** of total available CPU time. - -- `cpu_total_percent`: Overall system CPU usage (0-100%) -- `peak_cpu_percent` (in process lists): Per-process CPU usage where 100% = 1 full CPU core - -#### 2. Container CPU: Cores (`cpu_cores`) - -**Important:** The `cpu_cores` field in container metrics represents **CPU usage in number of cores**, not percentage. - -| Value | Meaning | -|-------|---------| -| `0.5` | Half a CPU core | -| `1.0` | One full CPU core | -| `2.0` | Two CPU cores | -| `2.5` | Two and a half CPU cores | - -This allows direct comparison with Kubernetes resource limits (e.g., `cpu: "2"` or `cpu: "500m"`). - -**Example interpretation:** -```json -{ - "name": "runner", - "cpu_cores": { - "peak": 2.007, - "avg": 1.5 - } -} -``` -This means the "runner" container used a peak of ~2 CPU cores and averaged 1.5 CPU cores during the run. - -### Memory Metrics - -All memory values are in **bytes**: - -- `mem_used_bytes`: System memory usage -- `memory_bytes` (in containers): Container RSS memory usage -- `peak_mem_rss_bytes` (in processes): Process RSS memory - -### Statistical Fields - -Each metric includes percentile statistics across all samples: - -| Field | Description | -|-------|-------------| -| `peak` | Maximum value observed | -| `p99` | 99th percentile | -| `p95` | 95th percentile | -| `p75` | 75th percentile | -| `p50` | Median (50th percentile) | -| `avg` | Arithmetic mean | - -## Configuration - -### Collector Environment Variables - -| Variable | Description | Example | -|----------|-------------|---------| -| `GITHUB_REPOSITORY_OWNER` | Organization name | `my-org` | -| `GITHUB_REPOSITORY` | Full repository path | `my-org/my-repo` | -| `GITHUB_WORKFLOW` | Workflow filename | `ci.yml` | -| `GITHUB_JOB` | Job name | `build` | -| `GITHUB_RUN_ID` | Unique run identifier | `run-123` | -| `CGROUP_PROCESS_MAP` | JSON mapping process names to container names | `{"node":"runner"}` | -| `CGROUP_LIMITS` | JSON with CPU/memory limits per container | See below | - -**CGROUP_LIMITS example:** -```json -{ - "runner": {"cpu": "2", "memory": "1Gi"}, - "sidecar": {"cpu": "500m", "memory": "256Mi"} -} -``` - -CPU values support Kubernetes notation: `"2"` = 2 cores, `"500m"` = 0.5 cores. - -Memory values support: `Ki`, `Mi`, `Gi`, `Ti` (binary) or `K`, `M`, `G`, `T` (decimal). - -### Receiver Environment Variables - -| Variable | Description | Default | -|----------|-------------|---------| -| `DB_PATH` | SQLite database path | `metrics.db` | -| `LISTEN_ADDR` | HTTP listen address | `:8080` | +All memory values are in **bytes**. ## Running -### Docker Compose (stress test example) +### Docker Compose ```bash docker compose -f test/docker/docker-compose-stress.yaml up -d -# Wait for metrics collection... +# Wait for collection, then trigger shutdown summary: docker compose -f test/docker/docker-compose-stress.yaml stop collector -# Query results +# Query results: curl http://localhost:9080/api/v1/metrics/repo/test-org/test-org%2Fstress-test/stress-test-workflow/heavy-workload ``` -### Local Development +### Local ```bash -# Build go build -o collector ./cmd/collector go build -o receiver ./cmd/receiver -# Run receiver -./receiver --listen=:8080 --db=metrics.db - -# Run collector +./receiver --addr=:8080 --db=metrics.db ./collector --interval=2s --top=10 --push-endpoint=http://localhost:8080/api/v1/metrics ``` + +## Internal Packages + +| Package | Purpose | +|---------|---------| +| `internal/proc` | Low-level `/proc` parsing (stat, status, cgroup) | +| `internal/metrics` | Aggregates process metrics from `/proc` into system/container views | +| `internal/cgroup` | Parses `CGROUP_PROCESS_MAP` and `CGROUP_LIMITS` env vars | +| `internal/collector` | Orchestrates the collection loop and shutdown | +| `internal/summary` | Accumulates samples, computes stats, pushes to receiver | +| `internal/receiver` | HTTP handlers and SQLite store | +| `internal/output` | Metrics output formatting (JSON/text) | + +## Background + +Technical reference on the Linux primitives this project builds on: + +- [Identifying process cgroups by PID](docs/background/identify-process-cgroup-by-pid.md) — how to read `/proc//cgroup` to determine which container a process belongs to +- [/proc/stat behavior in containers](docs/background/proc-stat-in-containers.md) — why `/proc/stat` shows host-level data in containers, and how to aggregate per-process stats from `/proc/[pid]/stat` instead, including CPU tick conversion and cgroup limit handling diff --git a/docs/background/identify-process-cgroup-by-pid.md b/docs/background/identify-process-cgroup-by-pid.md index 062abba..7d4734d 100644 --- a/docs/background/identify-process-cgroup-by-pid.md +++ b/docs/background/identify-process-cgroup-by-pid.md @@ -1,26 +1,14 @@ -# Identify Process Cgroup by PID +# Identifying a Process's Cgroup by PID -## TL;DR +Read `/proc//cgroup` to find which cgroup (and therefore which container) a process belongs to. -You can identify a process's cgroup by reading `/proc//cgroup`. - -## Outcome - -Multiple methods to find which cgroup a process belongs to using its PID. - -## Conversation - -**Question:** Can I identify the process's cgroup by having its PID? - -**Answer:** Yes, here are the main methods: - -### Method 1: Read `/proc//cgroup` +## /proc/PID/cgroup ```bash cat /proc//cgroup ``` -This shows all cgroup controllers the process belongs to. For example: +Shows all cgroup controllers the process belongs to: ``` 12:blkio:/user.slice 11:memory:/user.slice/user-1000.slice @@ -28,35 +16,23 @@ This shows all cgroup controllers the process belongs to. For example: 0::/user.slice/user-1000.slice/session-1.scope ``` -### Method 2: Using `ps` with format options +On cgroup v2, the path after `::` is the cgroup path under `/sys/fs/cgroup/`. + +## Other Methods ```bash +# ps format options ps -o pid,cgroup -p -``` -### Method 3: Using `systemd-cgls` - -If you're on a systemd system: -```bash +# systemd systems systemd-cgls --unit -# or to see the whole tree -systemd-cgls +systemd-cgls # whole tree ``` -### Method 4: Check cgroup v2 unified hierarchy - -On cgroup v2 systems, check: -```bash -cat /proc//cgroup -# Output will look like: 0::/user.slice/user-1000.slice/session-1.scope -``` - -The path after `::` is the cgroup path under `/sys/fs/cgroup/`. - -### Quick one-liner +## Quick One-Liners ```bash -cat /proc/self/cgroup # For current shell -cat /proc/$$/cgroup # Also current shell -cat /proc/1234/cgroup # For PID 1234 +cat /proc/self/cgroup # current shell +cat /proc/$$/cgroup # also current shell +cat /proc/1234/cgroup # specific PID ``` diff --git a/docs/background/proc-stat-in-containers.md b/docs/background/proc-stat-in-containers.md index 7c33575..9884b2a 100644 --- a/docs/background/proc-stat-in-containers.md +++ b/docs/background/proc-stat-in-containers.md @@ -1,64 +1,36 @@ -# /proc/stat Behavior in Containerized Environments +# /proc/stat behavior in containerised environments -## TL;DR +`/proc/stat` in containers shows **host-level** statistics, not container-specific data. To get container-aware CPU metrics when processes span multiple cgroups (e.g., sidecars sharing a PID namespace), aggregate `/proc/[pid]/stat` for all visible processes and use cgroup limits from `/sys/fs/cgroup` for normalization. -`/proc/stat` in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate `/proc/[pid]/stat` for all visible processes and calculate percentages using cgroup limits from `/sys/fs/cgroup`. +## Why /proc/stat is wrong in containers -## Outcome +`/proc/stat` reports host-wide values (CPU times, context switches, boot time, process count) because `/proc` is mounted from the host kernel, which has no namespace awareness for these metrics. -- Explained why `/proc/stat` doesn't provide container-aware metrics -- Provided methods to aggregate process CPU usage from `/proc/[pid]/stat` -- Showed how to determine parent/child process relationships -- Demonstrated converting raw ticks to CPU percentages -- Covered how to read cgroup CPU limits for other containers +This means: +- Tools reading `/proc/stat` (top, htop, etc.) show **host** CPU usage, not container usage +- Cgroup CPU limits (e.g., 2 CPUs) are not reflected — all host CPUs are visible +- In shared environments, containers see each other's aggregate impact -## Conversation - -### How does /proc/stat behave in containerized environments? - -By default, `/proc/stat` inside a container shows **host system statistics**, not container-specific data. This includes: - -- **CPU times** (user, system, idle, etc.) - reflects all CPUs on the host -- **Context switches** - host-wide count -- **Boot time** - host's boot time -- **Processes created** - host-wide total - -#### Why This Happens - -The `/proc` filesystem is mounted from the host kernel. Containers share the host kernel, and `/proc/stat` is a kernel-level interface that doesn't have namespace awareness for most of its metrics. - -#### Practical Implications - -1. **CPU usage calculations are wrong** - Tools like `top`, `htop`, or custom monitoring that read `/proc/stat` will show host CPU usage, not container usage -2. **Resource limits aren't reflected** - Even if you limit a container to 2 CPUs via cgroups, `/proc/stat` shows all host CPUs -3. **Multi-tenant confusion** - In shared environments, containers see each other's aggregate impact - -#### Solutions +### Alternatives | Approach | Description | |----------|-------------| -| **cgroups** | Read from `/sys/fs/cgroup/cpu/` for container-specific CPU accounting | -| **LXCFS** | A FUSE filesystem that provides container-aware `/proc` files | -| **Container runtimes** | Some runtimes (like Kata) use VMs with isolated kernels | -| **Metrics APIs** | Use Docker/Kubernetes APIs instead of `/proc/stat` | - -#### Example: Getting Real Container CPU +| **cgroups** | Read `/sys/fs/cgroup/cpu/` for container-specific CPU accounting | +| **LXCFS** | FUSE filesystem providing container-aware `/proc` files | +| **Container runtimes** | Some (like Kata) use VMs with isolated kernels | +| **Metrics APIs** | Docker/Kubernetes APIs instead of `/proc/stat` | ```bash -# Instead of /proc/stat, use cgroups v2: +# cgroups v2: cat /sys/fs/cgroup/cpu.stat -# Or cgroups v1: +# cgroups v1: cat /sys/fs/cgroup/cpu/cpuacct.usage ``` ---- +## Aggregating per-Process CPU from /proc/[pid]/stat -### Can I aggregate process usage to get the usage of the container? - -Yes, you can aggregate process usage, but there are caveats. - -#### Aggregating from /proc/[pid]/stat +When cgroup-level reads aren't an option (sidecars sharing PID namespace with different cgroups), aggregate individual process stats: ```bash # Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat @@ -67,137 +39,76 @@ for pid in /proc/[0-9]*; do done | awk '{sum += $1} END {print sum}' ``` -#### Problems with This Approach +### Caveats -1. **Race conditions** - Processes can spawn/die between reads, causing inconsistent totals -2. **Short-lived processes** - Processes that start and exit between samples are missed entirely -3. **Zombie/exited processes** - Their CPU time may not be captured -4. **Overhead** - Scanning all PIDs repeatedly is expensive -5. **Namespace visibility** - You only see processes in your PID namespace, but that's actually what you want in a container -6. **Children accounting** - When a process exits, its CPU time gets added to the parent's `cutime`/`cstime` fields, which can cause double-counting if not handled carefully +1. **Race conditions** — processes can spawn/die between reads +2. **Short-lived processes** — missed if they start and exit between samples +3. **Zombie/exited processes** — their CPU time may not be captured +4. **Overhead** — scanning all PIDs repeatedly is expensive +5. **Namespace visibility** — you only see processes in your PID namespace (which is what you want) +6. **Children accounting** — when a process exits, its CPU time is added to the parent's `cutime`/`cstime`, risking double-counting -#### Better Alternative: cgroups +Cgroups handle these edge cases natively, but **cannot be used when sidecars share the PID namespace with different cgroups** — in that case, per-process aggregation is the best option. -Cgroups track CPU usage at the group level and handle all these edge cases: +## Parent/Child Process Relationships + +Field 4 in `/proc/[pid]/stat` is the PPID (parent process ID): ```bash -# cgroups v2 (modern) -cat /sys/fs/cgroup/cpu.stat -# Shows: usage_usec, user_usec, system_usec - -# cgroups v1 -cat /sys/fs/cgroup/cpu/cpuacct.usage # total nanoseconds -cat /sys/fs/cgroup/cpu/cpuacct.stat # user/system jiffies +awk '{print $4}' /proc/1234/stat # PPID from stat +grep PPid /proc/1234/status # more readable ``` -**However**, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating `/proc/[pid]/stat` is the best option. - ---- - -### How do I determine child/parent processes? - -#### From /proc/[pid]/stat - -Field 4 is the **PPID** (parent process ID): - -```bash -# Get PPID for a specific process -awk '{print $4}' /proc/1234/stat - -# List all processes with their parents -for stat in /proc/[0-9]*/stat; do - pid="${stat#/proc/}" - pid="${pid%/stat}" - ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \ - echo "$pid -> parent: $ppid" -done -``` - -#### From /proc/[pid]/status (more readable) - -```bash -grep PPid /proc/1234/status -# PPid: 1 -``` - -#### Building a Process Tree +### Building a Process Tree ```bash #!/bin/bash -declare -A parent_of -declare -A children_of +declare -A parent_of children_of for stat in /proc/[0-9]*/stat; do if read -r line < "$stat" 2>/dev/null; then - pid="${stat#/proc/}" - pid="${pid%/stat}" - - # Extract PPID (field 4, but handle comm with spaces) - rest="${line##*) }" - read -ra fields <<< "$rest" + pid="${stat#/proc/}"; pid="${pid%/stat}" + rest="${line##*) }"; read -ra fields <<< "$rest" ppid="${fields[1]}" # 4th field overall = index 1 after state - parent_of[$pid]=$ppid children_of[$ppid]+="$pid " fi done -# Print tree from PID 1 print_tree() { - local pid=$1 - local indent=$2 + local pid=$1 indent=$2 echo "${indent}${pid}" - for child in ${children_of[$pid]}; do - print_tree "$child" " $indent" - done + for child in ${children_of[$pid]}; do print_tree "$child" " $indent"; done } - print_tree 1 "" ``` -#### For CPU Aggregation: Handling cutime/cstime +### Avoiding Double-Counting with cutime/cstime -To properly handle `cutime`/`cstime` without double-counting: +Only sum `utime` + `stime` per process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on — those children no longer exist in `/proc`, so their time is only accessible via the parent. ```bash #!/bin/bash -declare -A parent_of declare -A utime stime -# First pass: collect all data for stat in /proc/[0-9]*/stat; do if read -r line < "$stat" 2>/dev/null; then - pid="${stat#/proc/}" - pid="${pid%/stat}" - rest="${line##*) }" - read -ra f <<< "$rest" - - parent_of[$pid]="${f[1]}" - utime[$pid]="${f[11]}" - stime[$pid]="${f[12]}" - # cutime=${f[13]} cstime=${f[14]} - don't sum these + pid="${stat#/proc/}"; pid="${pid%/stat}" + rest="${line##*) }"; read -ra f <<< "$rest" + utime[$pid]="${f[11]}"; stime[$pid]="${f[12]}" + # cutime=${f[13]} cstime=${f[14]} — don't sum these fi done -# Sum only utime/stime (not cutime/cstime) total=0 -for pid in "${!utime[@]}"; do - ((total += utime[$pid] + stime[$pid])) -done - +for pid in "${!utime[@]}"; do ((total += utime[$pid] + stime[$pid])); done echo "Total CPU ticks: $total" echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)" ``` -**Key insight:** Only sum `utime` + `stime` for each process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on—those children no longer exist in `/proc`, so their time is only accessible via the parent's `cutime`/`cstime`. +## Converting Ticks to CPU Percentages ---- - -### How do I convert utime/stime to percentages? - -You need **two samples** over a time interval. CPU percentage is a rate, not an absolute value. - -#### The Formula +CPU percentage is a rate — you need **two samples** over a time interval. ``` CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100 @@ -205,20 +116,17 @@ CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100 - `delta_ticks` = difference in (utime + stime) between samples - `CLK_TCK` = ticks per second (usually 100, get via `getconf CLK_TCK`) -- `num_cpus` = number of CPUs (omit for single-CPU percentage) - -#### Two Common Percentage Styles +- `num_cpus` = number of CPUs (omit for per-core percentage) | Style | Formula | Example | |-------|---------|---------| | **Normalized** (0-100%) | `delta / (elapsed * CLK_TCK * num_cpus) * 100` | 50% = half of total capacity | | **Cores-style** (0-N*100%) | `delta / (elapsed * CLK_TCK) * 100` | 200% = 2 full cores busy | -#### Practical Script +### Sampling Script ```bash #!/bin/bash - CLK_TCK=$(getconf CLK_TCK) NUM_CPUS=$(nproc) @@ -226,267 +134,94 @@ get_total_ticks() { local total=0 for stat in /proc/[0-9]*/stat; do if read -r line < "$stat" 2>/dev/null; then - rest="${line##*) }" - read -ra f <<< "$rest" - ((total += f[11] + f[12])) # utime + stime + rest="${line##*) }"; read -ra f <<< "$rest" + ((total += f[11] + f[12])) fi done echo "$total" } -# First sample -ticks1=$(get_total_ticks) -time1=$(date +%s.%N) - -# Wait +ticks1=$(get_total_ticks); time1=$(date +%s.%N) sleep 1 +ticks2=$(get_total_ticks); time2=$(date +%s.%N) -# Second sample -ticks2=$(get_total_ticks) -time2=$(date +%s.%N) - -# Calculate delta_ticks=$((ticks2 - ticks1)) elapsed=$(echo "$time2 - $time1" | bc) -# Percentage of total CPU capacity (all cores) pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc) echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs" -# Percentage as "CPU cores used" (like top's 200% for 2 full cores) cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc) echo "CPU usage: ${cores_pct}% (cores-style)" ``` -#### Continuous Monitoring +## Respecting Cgroup CPU Limits + +The above calculations use `nproc`, which returns the **host** CPU count. If a container is limited to 2 CPUs on an 8-CPU host, `nproc` returns 8 and the percentage is misleading. + +### Reading Effective CPU Limit ```bash #!/bin/bash -CLK_TCK=$(getconf CLK_TCK) -NUM_CPUS=$(nproc) -INTERVAL=1 - -get_total_ticks() { - local total=0 - for stat in /proc/[0-9]*/stat; do - read -r line < "$stat" 2>/dev/null || continue - rest="${line##*) }" - read -ra f <<< "$rest" - ((total += f[11] + f[12])) - done - echo "$total" -} - -prev_ticks=$(get_total_ticks) -prev_time=$(date +%s.%N) - -while true; do - sleep "$INTERVAL" - - curr_ticks=$(get_total_ticks) - curr_time=$(date +%s.%N) - - delta=$((curr_ticks - prev_ticks)) - elapsed=$(echo "$curr_time - $prev_time" | bc) - - pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc) - printf "\rCPU: %5.1f%%" "$pct" - - prev_ticks=$curr_ticks - prev_time=$curr_time -done -``` - ---- - -### Does this calculation respect cgroup limits? - -No, it doesn't. The calculation uses `nproc` which typically returns **host CPU count**, not your cgroup limit. - -#### The Problem - -If your container is limited to 2 CPUs on an 8-CPU host: -- `nproc` returns 8 -- Your calculation shows 25% when you're actually at 100% of your limit - -#### Getting Effective CPU Limit - -**cgroups v2:** - -```bash -# cpu.max contains: $quota $period (in microseconds) -# "max 100000" means unlimited -read quota period < /sys/fs/cgroup/cpu.max -if [[ "$quota" == "max" ]]; then - effective_cpus=$(nproc) -else - effective_cpus=$(echo "scale=2; $quota / $period" | bc) -fi -echo "Effective CPUs: $effective_cpus" -``` - -**cgroups v1:** - -```bash -quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us) -period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us) - -if [[ "$quota" == "-1" ]]; then - effective_cpus=$(nproc) -else - effective_cpus=$(echo "scale=2; $quota / $period" | bc) -fi -``` - -**Also Check cpuset Limits:** - -```bash -# cgroups v2 -cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null) - -# cgroups v1 -cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null) - -# Parse "0-3,5,7" format to count CPUs -count_cpus() { - local count=0 - IFS=',' read -ra ranges <<< "$1" - for range in "${ranges[@]}"; do - if [[ "$range" == *-* ]]; then - start="${range%-*}" - end="${range#*-}" - ((count += end - start + 1)) - else - ((count++)) - fi - done - echo "$count" -} -``` - -#### Updated Script Respecting Limits - -```bash -#!/bin/bash -CLK_TCK=$(getconf CLK_TCK) - get_effective_cpus() { - # Try cgroups v2 first + # cgroups v2 if [[ -f /sys/fs/cgroup/cpu.max ]]; then read quota period < /sys/fs/cgroup/cpu.max - if [[ "$quota" != "max" ]]; then - echo "scale=2; $quota / $period" | bc - return - fi + [[ "$quota" != "max" ]] && echo "scale=2; $quota / $period" | bc && return fi - - # Try cgroups v1 + # cgroups v1 if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us) period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us) - if [[ "$quota" != "-1" ]]; then - echo "scale=2; $quota / $period" | bc - return - fi + [[ "$quota" != "-1" ]] && echo "scale=2; $quota / $period" | bc && return fi - - # Fallback to nproc - nproc + nproc # fallback } - -EFFECTIVE_CPUS=$(get_effective_cpus) -echo "Effective CPU limit: $EFFECTIVE_CPUS" - -# ... rest of your sampling logic using $EFFECTIVE_CPUS ``` -#### Shared PID Namespace Consideration +Also check cpuset limits (`cpuset.cpus.effective` for v2, `cpuset/cpuset.cpus` for v1) which restrict which physical CPUs are available. -When sidecars share PID namespace but have different cgroups: -- Each container may have different CPU limits -- You're aggregating processes across those limits -- There's no single "correct" limit to use +### Shared PID Namespace Complication + +When sidecars share a PID namespace but have different cgroups, there's no single "correct" CPU limit for normalization. Options: -**Options:** 1. **Use host CPU count** — percentage of total host capacity 2. **Sum the limits** — if you know each sidecar's cgroup, sum their quotas -3. **Report in cores** — skip normalization, just show `1.5 cores used` instead of percentage +3. **Report in cores** — skip normalization, show `1.5 cores used` instead of percentage ---- +## Reading Cgroup Limits for Other Containers -### Can I get the cgroup limit for another cgroup? - -Yes, if you have visibility into the cgroup filesystem. - -#### 1. Find a Process's Cgroup - -Every process exposes its cgroup membership: +Every process exposes its cgroup membership via `/proc//cgroup`. If the cgroup filesystem is mounted, you can read any container's limits: ```bash -# Get cgroup for any PID you can see -cat /proc/1234/cgroup - -# cgroups v2 output: -# 0::/kubepods/pod123/container456 - -# cgroups v1 output: -# 12:cpu,cpuacct:/docker/abc123 -# 11:memory:/docker/abc123 -# ... -``` - -#### 2. Read That Cgroup's Limits - -If the cgroup filesystem is mounted and accessible: - -```bash -#!/bin/bash - get_cgroup_cpu_limit() { local pid=$1 - - # Get cgroup path for this PID - cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) # v2 - + # cgroups v2 + cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) if [[ -n "$cgroup_path" ]]; then - # cgroups v2 limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max" if [[ -r "$limit_file" ]]; then read quota period < "$limit_file" - if [[ "$quota" == "max" ]]; then - echo "unlimited" - else - echo "scale=2; $quota / $period" | bc - fi + [[ "$quota" == "max" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc return fi fi - - # Try cgroups v1 + # cgroups v1 cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null) if [[ -n "$cgroup_path" ]]; then quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us" period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us" if [[ -r "$quota_file" ]]; then - quota=$(cat "$quota_file") - period=$(cat "$period_file") - if [[ "$quota" == "-1" ]]; then - echo "unlimited" - else - echo "scale=2; $quota / $period" | bc - fi + quota=$(cat "$quota_file"); period=$(cat "$period_file") + [[ "$quota" == "-1" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc return fi fi - echo "unknown" } - -# Example: get limit for PID 1234 -get_cgroup_cpu_limit 1234 ``` -#### 3. Mount Visibility Requirements +### Mount Visibility | Scenario | Can Read Other Cgroups? | |----------|------------------------| @@ -495,66 +230,9 @@ get_cgroup_cpu_limit 1234 | `/sys/fs/cgroup` mounted read-only from host | Yes (common in Kubernetes) | | Only own cgroup subtree mounted | No | -Check what's visible: +### Fallbacks When Cgroups Aren't Accessible -```bash -mount | grep cgroup -ls /sys/fs/cgroup/ -``` - -#### 4. Full Solution: Aggregate by Cgroup - -```bash -#!/bin/bash -CLK_TCK=$(getconf CLK_TCK) - -declare -A cgroup_ticks -declare -A cgroup_limit - -for stat in /proc/[0-9]*/stat; do - pid="${stat#/proc/}" - pid="${pid%/stat}" - - # Get cgroup for this process - cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) - [[ -z "$cg" ]] && continue - - # Get CPU ticks - if read -r line < "$stat" 2>/dev/null; then - rest="${line##*) }" - read -ra f <<< "$rest" - ticks=$((f[11] + f[12])) - - ((cgroup_ticks[$cg] += ticks)) - - # Cache the limit (only look up once per cgroup) - if [[ -z "${cgroup_limit[$cg]}" ]]; then - limit_file="/sys/fs/cgroup${cg}/cpu.max" - if [[ -r "$limit_file" ]]; then - read quota period < "$limit_file" - if [[ "$quota" == "max" ]]; then - cgroup_limit[$cg]="unlimited" - else - cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc) - fi - else - cgroup_limit[$cg]="unknown" - fi - fi - fi -done - -echo "Ticks by cgroup:" -for cg in "${!cgroup_ticks[@]}"; do - echo " $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)" -done -``` - -#### If You Can't Access Other Cgroups - -Fallback options: - -1. **Mount the cgroup fs** — add volume mount for `/sys/fs/cgroup:ro` -2. **Use a sidecar with access** — one privileged container does the monitoring +1. **Mount the cgroup fs** — volume mount `/sys/fs/cgroup:ro` +2. **Use a sidecar with access** — one privileged container does monitoring 3. **Accept "unknown" limits** — report raw ticks/cores instead of percentages -4. **Kubernetes Downward API** — inject limits as env vars (only for your own container though) +4. **Kubernetes Downward API** — inject limits as env vars (own container only)