docs: move background documentation to docs/background

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 15:40:38 +01:00 · 2026-02-06 15:40:38 +01:00 · fd02242d5e
commit fd02242d5e
parent 52f1b8b64d
2 changed files with 0 additions and 0 deletions
--- a/docs/background/identify-process-cgroup-by-pid.md
+++ b/docs/background/identify-process-cgroup-by-pid.md
@ -0,0 +1,62 @@
+# Identify Process Cgroup by PID
+
+## TL;DR
+
+You can identify a process's cgroup by reading `/proc/<PID>/cgroup`.
+
+## Outcome
+
+Multiple methods to find which cgroup a process belongs to using its PID.
+
+## Conversation
+
+**Question:** Can I identify the process's cgroup by having its PID?
+
+**Answer:** Yes, here are the main methods:
+
+### Method 1: Read `/proc/<PID>/cgroup`
+
+```bash
+cat /proc/<PID>/cgroup
+```
+
+This shows all cgroup controllers the process belongs to. For example:
+```
+12:blkio:/user.slice
+11:memory:/user.slice/user-1000.slice
+...
+0::/user.slice/user-1000.slice/session-1.scope
+```
+
+### Method 2: Using `ps` with format options
+
+```bash
+ps -o pid,cgroup -p <PID>
+```
+
+### Method 3: Using `systemd-cgls`
+
+If you're on a systemd system:
+```bash
+systemd-cgls --unit <unit-name>
+# or to see the whole tree
+systemd-cgls
+```
+
+### Method 4: Check cgroup v2 unified hierarchy
+
+On cgroup v2 systems, check:
+```bash
+cat /proc/<PID>/cgroup
+# Output will look like: 0::/user.slice/user-1000.slice/session-1.scope
+```
+
+The path after `::` is the cgroup path under `/sys/fs/cgroup/`.
+
+### Quick one-liner
+
+```bash
+cat /proc/self/cgroup  # For current shell
+cat /proc/$$/cgroup    # Also current shell
+cat /proc/1234/cgroup  # For PID 1234
+```
--- a/docs/background/proc-stat-in-containers.md
+++ b/docs/background/proc-stat-in-containers.md
@ -0,0 +1,560 @@
+# /proc/stat Behavior in Containerized Environments
+
+## TL;DR
+
+`/proc/stat` in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate `/proc/[pid]/stat` for all visible processes and calculate percentages using cgroup limits from `/sys/fs/cgroup`.
+
+## Outcome
+
+- Explained why `/proc/stat` doesn't provide container-aware metrics
+- Provided methods to aggregate process CPU usage from `/proc/[pid]/stat`
+- Showed how to determine parent/child process relationships
+- Demonstrated converting raw ticks to CPU percentages
+- Covered how to read cgroup CPU limits for other containers
+
+## Conversation
+
+### How does /proc/stat behave in containerized environments?
+
+By default, `/proc/stat` inside a container shows **host system statistics**, not container-specific data. This includes:
+
+- **CPU times** (user, system, idle, etc.) - reflects all CPUs on the host
+- **Context switches** - host-wide count
+- **Boot time** - host's boot time
+- **Processes created** - host-wide total
+
+#### Why This Happens
+
+The `/proc` filesystem is mounted from the host kernel. Containers share the host kernel, and `/proc/stat` is a kernel-level interface that doesn't have namespace awareness for most of its metrics.
+
+#### Practical Implications
+
+1. **CPU usage calculations are wrong** - Tools like `top`, `htop`, or custom monitoring that read `/proc/stat` will show host CPU usage, not container usage
+2. **Resource limits aren't reflected** - Even if you limit a container to 2 CPUs via cgroups, `/proc/stat` shows all host CPUs
+3. **Multi-tenant confusion** - In shared environments, containers see each other's aggregate impact
+
+#### Solutions
+
+| Approach | Description |
+|----------|-------------|
+| **cgroups** | Read from `/sys/fs/cgroup/cpu/` for container-specific CPU accounting |
+| **LXCFS** | A FUSE filesystem that provides container-aware `/proc` files |
+| **Container runtimes** | Some runtimes (like Kata) use VMs with isolated kernels |
+| **Metrics APIs** | Use Docker/Kubernetes APIs instead of `/proc/stat` |
+
+#### Example: Getting Real Container CPU
+
+```bash
+# Instead of /proc/stat, use cgroups v2:
+cat /sys/fs/cgroup/cpu.stat
+
+# Or cgroups v1:
+cat /sys/fs/cgroup/cpu/cpuacct.usage
+```
+
+---
+
+### Can I aggregate process usage to get the usage of the container?
+
+Yes, you can aggregate process usage, but there are caveats.
+
+#### Aggregating from /proc/[pid]/stat
+
+```bash
+# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
+for pid in /proc/[0-9]*; do
+  awk '{print $14 + $15}' "$pid/stat" 2>/dev/null
+done | awk '{sum += $1} END {print sum}'
+```
+
+#### Problems with This Approach
+
+1. **Race conditions** - Processes can spawn/die between reads, causing inconsistent totals
+2. **Short-lived processes** - Processes that start and exit between samples are missed entirely
+3. **Zombie/exited processes** - Their CPU time may not be captured
+4. **Overhead** - Scanning all PIDs repeatedly is expensive
+5. **Namespace visibility** - You only see processes in your PID namespace, but that's actually what you want in a container
+6. **Children accounting** - When a process exits, its CPU time gets added to the parent's `cutime`/`cstime` fields, which can cause double-counting if not handled carefully
+
+#### Better Alternative: cgroups
+
+Cgroups track CPU usage at the group level and handle all these edge cases:
+
+```bash
+# cgroups v2 (modern)
+cat /sys/fs/cgroup/cpu.stat
+# Shows: usage_usec, user_usec, system_usec
+
+# cgroups v1
+cat /sys/fs/cgroup/cpu/cpuacct.usage  # total nanoseconds
+cat /sys/fs/cgroup/cpu/cpuacct.stat   # user/system jiffies
+```
+
+**However**, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating `/proc/[pid]/stat` is the best option.
+
+---
+
+### How do I determine child/parent processes?
+
+#### From /proc/[pid]/stat
+
+Field 4 is the **PPID** (parent process ID):
+
+```bash
+# Get PPID for a specific process
+awk '{print $4}' /proc/1234/stat
+
+# List all processes with their parents
+for stat in /proc/[0-9]*/stat; do
+    pid="${stat#/proc/}"
+    pid="${pid%/stat}"
+    ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \
+    echo "$pid -> parent: $ppid"
+done
+```
+
+#### From /proc/[pid]/status (more readable)
+
+```bash
+grep PPid /proc/1234/status
+# PPid: 1
+```
+
+#### Building a Process Tree
+
+```bash
+#!/bin/bash
+declare -A parent_of
+declare -A children_of
+
+for stat in /proc/[0-9]*/stat; do
+    if read -r line < "$stat" 2>/dev/null; then
+        pid="${stat#/proc/}"
+        pid="${pid%/stat}"
+
+        # Extract PPID (field 4, but handle comm with spaces)
+        rest="${line##*) }"
+        read -ra fields <<< "$rest"
+        ppid="${fields[1]}"  # 4th field overall = index 1 after state
+
+        parent_of[$pid]=$ppid
+        children_of[$ppid]+="$pid "
+    fi
+done
+
+# Print tree from PID 1
+print_tree() {
+    local pid=$1
+    local indent=$2
+    echo "${indent}${pid}"
+    for child in ${children_of[$pid]}; do
+        print_tree "$child" "  $indent"
+    done
+}
+
+print_tree 1 ""
+```
+
+#### For CPU Aggregation: Handling cutime/cstime
+
+To properly handle `cutime`/`cstime` without double-counting:
+
+```bash
+#!/bin/bash
+declare -A parent_of
+declare -A utime stime
+
+# First pass: collect all data
+for stat in /proc/[0-9]*/stat; do
+    if read -r line < "$stat" 2>/dev/null; then
+        pid="${stat#/proc/}"
+        pid="${pid%/stat}"
+        rest="${line##*) }"
+        read -ra f <<< "$rest"
+
+        parent_of[$pid]="${f[1]}"
+        utime[$pid]="${f[11]}"
+        stime[$pid]="${f[12]}"
+        # cutime=${f[13]}  cstime=${f[14]} - don't sum these
+    fi
+done
+
+# Sum only utime/stime (not cutime/cstime)
+total=0
+for pid in "${!utime[@]}"; do
+    ((total += utime[$pid] + stime[$pid]))
+done
+
+echo "Total CPU ticks: $total"
+echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"
+```
+
+**Key insight:** Only sum `utime` + `stime` for each process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on—those children no longer exist in `/proc`, so their time is only accessible via the parent's `cutime`/`cstime`.
+
+---
+
+### How do I convert utime/stime to percentages?
+
+You need **two samples** over a time interval. CPU percentage is a rate, not an absolute value.
+
+#### The Formula
+
+```
+CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
+```
+
+- `delta_ticks` = difference in (utime + stime) between samples
+- `CLK_TCK` = ticks per second (usually 100, get via `getconf CLK_TCK`)
+- `num_cpus` = number of CPUs (omit for single-CPU percentage)
+
+#### Two Common Percentage Styles
+
+| Style | Formula | Example |
+|-------|---------|---------|
+| **Normalized** (0-100%) | `delta / (elapsed * CLK_TCK * num_cpus) * 100` | 50% = half of total capacity |
+| **Cores-style** (0-N*100%) | `delta / (elapsed * CLK_TCK) * 100` | 200% = 2 full cores busy |
+
+#### Practical Script
+
+```bash
+#!/bin/bash
+
+CLK_TCK=$(getconf CLK_TCK)
+NUM_CPUS=$(nproc)
+
+get_total_ticks() {
+    local total=0
+    for stat in /proc/[0-9]*/stat; do
+        if read -r line < "$stat" 2>/dev/null; then
+            rest="${line##*) }"
+            read -ra f <<< "$rest"
+            ((total += f[11] + f[12]))  # utime + stime
+        fi
+    done
+    echo "$total"
+}
+
+# First sample
+ticks1=$(get_total_ticks)
+time1=$(date +%s.%N)
+
+# Wait
+sleep 1
+
+# Second sample
+ticks2=$(get_total_ticks)
+time2=$(date +%s.%N)
+
+# Calculate
+delta_ticks=$((ticks2 - ticks1))
+elapsed=$(echo "$time2 - $time1" | bc)
+
+# Percentage of total CPU capacity (all cores)
+pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
+echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"
+
+# Percentage as "CPU cores used" (like top's 200% for 2 full cores)
+cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
+echo "CPU usage: ${cores_pct}% (cores-style)"
+```
+
+#### Continuous Monitoring
+
+```bash
+#!/bin/bash
+CLK_TCK=$(getconf CLK_TCK)
+NUM_CPUS=$(nproc)
+INTERVAL=1
+
+get_total_ticks() {
+    local total=0
+    for stat in /proc/[0-9]*/stat; do
+        read -r line < "$stat" 2>/dev/null || continue
+        rest="${line##*) }"
+        read -ra f <<< "$rest"
+        ((total += f[11] + f[12]))
+    done
+    echo "$total"
+}
+
+prev_ticks=$(get_total_ticks)
+prev_time=$(date +%s.%N)
+
+while true; do
+    sleep "$INTERVAL"
+
+    curr_ticks=$(get_total_ticks)
+    curr_time=$(date +%s.%N)
+
+    delta=$((curr_ticks - prev_ticks))
+    elapsed=$(echo "$curr_time - $prev_time" | bc)
+
+    pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc)
+    printf "\rCPU: %5.1f%%" "$pct"
+
+    prev_ticks=$curr_ticks
+    prev_time=$curr_time
+done
+```
+
+---
+
+### Does this calculation respect cgroup limits?
+
+No, it doesn't. The calculation uses `nproc` which typically returns **host CPU count**, not your cgroup limit.
+
+#### The Problem
+
+If your container is limited to 2 CPUs on an 8-CPU host:
+- `nproc` returns 8
+- Your calculation shows 25% when you're actually at 100% of your limit
+
+#### Getting Effective CPU Limit
+
+**cgroups v2:**
+
+```bash
+# cpu.max contains: $quota $period (in microseconds)
+# "max 100000" means unlimited
+read quota period < /sys/fs/cgroup/cpu.max
+if [[ "$quota" == "max" ]]; then
+    effective_cpus=$(nproc)
+else
+    effective_cpus=$(echo "scale=2; $quota / $period" | bc)
+fi
+echo "Effective CPUs: $effective_cpus"
+```
+
+**cgroups v1:**
+
+```bash
+quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
+period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
+
+if [[ "$quota" == "-1" ]]; then
+    effective_cpus=$(nproc)
+else
+    effective_cpus=$(echo "scale=2; $quota / $period" | bc)
+fi
+```
+
+**Also Check cpuset Limits:**
+
+```bash
+# cgroups v2
+cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null)
+
+# cgroups v1
+cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null)
+
+# Parse "0-3,5,7" format to count CPUs
+count_cpus() {
+    local count=0
+    IFS=',' read -ra ranges <<< "$1"
+    for range in "${ranges[@]}"; do
+        if [[ "$range" == *-* ]]; then
+            start="${range%-*}"
+            end="${range#*-}"
+            ((count += end - start + 1))
+        else
+            ((count++))
+        fi
+    done
+    echo "$count"
+}
+```
+
+#### Updated Script Respecting Limits
+
+```bash
+#!/bin/bash
+CLK_TCK=$(getconf CLK_TCK)
+
+get_effective_cpus() {
+    # Try cgroups v2 first
+    if [[ -f /sys/fs/cgroup/cpu.max ]]; then
+        read quota period < /sys/fs/cgroup/cpu.max
+        if [[ "$quota" != "max" ]]; then
+            echo "scale=2; $quota / $period" | bc
+            return
+        fi
+    fi
+
+    # Try cgroups v1
+    if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
+        quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
+        period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
+        if [[ "$quota" != "-1" ]]; then
+            echo "scale=2; $quota / $period" | bc
+            return
+        fi
+    fi
+
+    # Fallback to nproc
+    nproc
+}
+
+EFFECTIVE_CPUS=$(get_effective_cpus)
+echo "Effective CPU limit: $EFFECTIVE_CPUS"
+
+# ... rest of your sampling logic using $EFFECTIVE_CPUS
+```
+
+#### Shared PID Namespace Consideration
+
+When sidecars share PID namespace but have different cgroups:
+- Each container may have different CPU limits
+- You're aggregating processes across those limits
+- There's no single "correct" limit to use
+
+**Options:**
+1. **Use host CPU count** — percentage of total host capacity
+2. **Sum the limits** — if you know each sidecar's cgroup, sum their quotas
+3. **Report in cores** — skip normalization, just show `1.5 cores used` instead of percentage
+
+---
+
+### Can I get the cgroup limit for another cgroup?
+
+Yes, if you have visibility into the cgroup filesystem.
+
+#### 1. Find a Process's Cgroup
+
+Every process exposes its cgroup membership:
+
+```bash
+# Get cgroup for any PID you can see
+cat /proc/1234/cgroup
+
+# cgroups v2 output:
+# 0::/kubepods/pod123/container456
+
+# cgroups v1 output:
+# 12:cpu,cpuacct:/docker/abc123
+# 11:memory:/docker/abc123
+# ...
+```
+
+#### 2. Read That Cgroup's Limits
+
+If the cgroup filesystem is mounted and accessible:
+
+```bash
+#!/bin/bash
+
+get_cgroup_cpu_limit() {
+    local pid=$1
+
+    # Get cgroup path for this PID
+    cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)  # v2
+
+    if [[ -n "$cgroup_path" ]]; then
+        # cgroups v2
+        limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
+        if [[ -r "$limit_file" ]]; then
+            read quota period < "$limit_file"
+            if [[ "$quota" == "max" ]]; then
+                echo "unlimited"
+            else
+                echo "scale=2; $quota / $period" | bc
+            fi
+            return
+        fi
+    fi
+
+    # Try cgroups v1
+    cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
+    if [[ -n "$cgroup_path" ]]; then
+        quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
+        period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
+        if [[ -r "$quota_file" ]]; then
+            quota=$(cat "$quota_file")
+            period=$(cat "$period_file")
+            if [[ "$quota" == "-1" ]]; then
+                echo "unlimited"
+            else
+                echo "scale=2; $quota / $period" | bc
+            fi
+            return
+        fi
+    fi
+
+    echo "unknown"
+}
+
+# Example: get limit for PID 1234
+get_cgroup_cpu_limit 1234
+```
+
+#### 3. Mount Visibility Requirements
+
+| Scenario | Can Read Other Cgroups? |
+|----------|------------------------|
+| Host system | Yes |
+| Privileged container | Yes |
+| `/sys/fs/cgroup` mounted read-only from host | Yes (common in Kubernetes) |
+| Only own cgroup subtree mounted | No |
+
+Check what's visible:
+
+```bash
+mount | grep cgroup
+ls /sys/fs/cgroup/
+```
+
+#### 4. Full Solution: Aggregate by Cgroup
+
+```bash
+#!/bin/bash
+CLK_TCK=$(getconf CLK_TCK)
+
+declare -A cgroup_ticks
+declare -A cgroup_limit
+
+for stat in /proc/[0-9]*/stat; do
+    pid="${stat#/proc/}"
+    pid="${pid%/stat}"
+
+    # Get cgroup for this process
+    cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
+    [[ -z "$cg" ]] && continue
+
+    # Get CPU ticks
+    if read -r line < "$stat" 2>/dev/null; then
+        rest="${line##*) }"
+        read -ra f <<< "$rest"
+        ticks=$((f[11] + f[12]))
+
+        ((cgroup_ticks[$cg] += ticks))
+
+        # Cache the limit (only look up once per cgroup)
+        if [[ -z "${cgroup_limit[$cg]}" ]]; then
+            limit_file="/sys/fs/cgroup${cg}/cpu.max"
+            if [[ -r "$limit_file" ]]; then
+                read quota period < "$limit_file"
+                if [[ "$quota" == "max" ]]; then
+                    cgroup_limit[$cg]="unlimited"
+                else
+                    cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc)
+                fi
+            else
+                cgroup_limit[$cg]="unknown"
+            fi
+        fi
+    fi
+done
+
+echo "Ticks by cgroup:"
+for cg in "${!cgroup_ticks[@]}"; do
+    echo "  $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)"
+done
+```
+
+#### If You Can't Access Other Cgroups
+
+Fallback options:
+
+1. **Mount the cgroup fs** — add volume mount for `/sys/fs/cgroup:ro`
+2. **Use a sidecar with access** — one privileged container does the monitoring
+3. **Accept "unknown" limits** — report raw ticks/cores instead of percentages
+4. **Kubernetes Downward API** — inject limits as env vars (only for your own container though)