# /proc/stat Behavior in Containerized Environments ## TL;DR `/proc/stat` in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate `/proc/[pid]/stat` for all visible processes and calculate percentages using cgroup limits from `/sys/fs/cgroup`. ## Outcome - Explained why `/proc/stat` doesn't provide container-aware metrics - Provided methods to aggregate process CPU usage from `/proc/[pid]/stat` - Showed how to determine parent/child process relationships - Demonstrated converting raw ticks to CPU percentages - Covered how to read cgroup CPU limits for other containers ## Conversation ### How does /proc/stat behave in containerized environments? By default, `/proc/stat` inside a container shows **host system statistics**, not container-specific data. This includes: - **CPU times** (user, system, idle, etc.) - reflects all CPUs on the host - **Context switches** - host-wide count - **Boot time** - host's boot time - **Processes created** - host-wide total #### Why This Happens The `/proc` filesystem is mounted from the host kernel. Containers share the host kernel, and `/proc/stat` is a kernel-level interface that doesn't have namespace awareness for most of its metrics. #### Practical Implications 1. **CPU usage calculations are wrong** - Tools like `top`, `htop`, or custom monitoring that read `/proc/stat` will show host CPU usage, not container usage 2. **Resource limits aren't reflected** - Even if you limit a container to 2 CPUs via cgroups, `/proc/stat` shows all host CPUs 3. **Multi-tenant confusion** - In shared environments, containers see each other's aggregate impact #### Solutions | Approach | Description | |----------|-------------| | **cgroups** | Read from `/sys/fs/cgroup/cpu/` for container-specific CPU accounting | | **LXCFS** | A FUSE filesystem that provides container-aware `/proc` files | | **Container runtimes** | Some runtimes (like Kata) use VMs with isolated kernels | | **Metrics APIs** | Use Docker/Kubernetes APIs instead of `/proc/stat` | #### Example: Getting Real Container CPU ```bash # Instead of /proc/stat, use cgroups v2: cat /sys/fs/cgroup/cpu.stat # Or cgroups v1: cat /sys/fs/cgroup/cpu/cpuacct.usage ``` --- ### Can I aggregate process usage to get the usage of the container? Yes, you can aggregate process usage, but there are caveats. #### Aggregating from /proc/[pid]/stat ```bash # Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat for pid in /proc/[0-9]*; do awk '{print $14 + $15}' "$pid/stat" 2>/dev/null done | awk '{sum += $1} END {print sum}' ``` #### Problems with This Approach 1. **Race conditions** - Processes can spawn/die between reads, causing inconsistent totals 2. **Short-lived processes** - Processes that start and exit between samples are missed entirely 3. **Zombie/exited processes** - Their CPU time may not be captured 4. **Overhead** - Scanning all PIDs repeatedly is expensive 5. **Namespace visibility** - You only see processes in your PID namespace, but that's actually what you want in a container 6. **Children accounting** - When a process exits, its CPU time gets added to the parent's `cutime`/`cstime` fields, which can cause double-counting if not handled carefully #### Better Alternative: cgroups Cgroups track CPU usage at the group level and handle all these edge cases: ```bash # cgroups v2 (modern) cat /sys/fs/cgroup/cpu.stat # Shows: usage_usec, user_usec, system_usec # cgroups v1 cat /sys/fs/cgroup/cpu/cpuacct.usage # total nanoseconds cat /sys/fs/cgroup/cpu/cpuacct.stat # user/system jiffies ``` **However**, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating `/proc/[pid]/stat` is the best option. --- ### How do I determine child/parent processes? #### From /proc/[pid]/stat Field 4 is the **PPID** (parent process ID): ```bash # Get PPID for a specific process awk '{print $4}' /proc/1234/stat # List all processes with their parents for stat in /proc/[0-9]*/stat; do pid="${stat#/proc/}" pid="${pid%/stat}" ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \ echo "$pid -> parent: $ppid" done ``` #### From /proc/[pid]/status (more readable) ```bash grep PPid /proc/1234/status # PPid: 1 ``` #### Building a Process Tree ```bash #!/bin/bash declare -A parent_of declare -A children_of for stat in /proc/[0-9]*/stat; do if read -r line < "$stat" 2>/dev/null; then pid="${stat#/proc/}" pid="${pid%/stat}" # Extract PPID (field 4, but handle comm with spaces) rest="${line##*) }" read -ra fields <<< "$rest" ppid="${fields[1]}" # 4th field overall = index 1 after state parent_of[$pid]=$ppid children_of[$ppid]+="$pid " fi done # Print tree from PID 1 print_tree() { local pid=$1 local indent=$2 echo "${indent}${pid}" for child in ${children_of[$pid]}; do print_tree "$child" " $indent" done } print_tree 1 "" ``` #### For CPU Aggregation: Handling cutime/cstime To properly handle `cutime`/`cstime` without double-counting: ```bash #!/bin/bash declare -A parent_of declare -A utime stime # First pass: collect all data for stat in /proc/[0-9]*/stat; do if read -r line < "$stat" 2>/dev/null; then pid="${stat#/proc/}" pid="${pid%/stat}" rest="${line##*) }" read -ra f <<< "$rest" parent_of[$pid]="${f[1]}" utime[$pid]="${f[11]}" stime[$pid]="${f[12]}" # cutime=${f[13]} cstime=${f[14]} - don't sum these fi done # Sum only utime/stime (not cutime/cstime) total=0 for pid in "${!utime[@]}"; do ((total += utime[$pid] + stime[$pid])) done echo "Total CPU ticks: $total" echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)" ``` **Key insight:** Only sum `utime` + `stime` for each process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on—those children no longer exist in `/proc`, so their time is only accessible via the parent's `cutime`/`cstime`. --- ### How do I convert utime/stime to percentages? You need **two samples** over a time interval. CPU percentage is a rate, not an absolute value. #### The Formula ``` CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100 ``` - `delta_ticks` = difference in (utime + stime) between samples - `CLK_TCK` = ticks per second (usually 100, get via `getconf CLK_TCK`) - `num_cpus` = number of CPUs (omit for single-CPU percentage) #### Two Common Percentage Styles | Style | Formula | Example | |-------|---------|---------| | **Normalized** (0-100%) | `delta / (elapsed * CLK_TCK * num_cpus) * 100` | 50% = half of total capacity | | **Cores-style** (0-N*100%) | `delta / (elapsed * CLK_TCK) * 100` | 200% = 2 full cores busy | #### Practical Script ```bash #!/bin/bash CLK_TCK=$(getconf CLK_TCK) NUM_CPUS=$(nproc) get_total_ticks() { local total=0 for stat in /proc/[0-9]*/stat; do if read -r line < "$stat" 2>/dev/null; then rest="${line##*) }" read -ra f <<< "$rest" ((total += f[11] + f[12])) # utime + stime fi done echo "$total" } # First sample ticks1=$(get_total_ticks) time1=$(date +%s.%N) # Wait sleep 1 # Second sample ticks2=$(get_total_ticks) time2=$(date +%s.%N) # Calculate delta_ticks=$((ticks2 - ticks1)) elapsed=$(echo "$time2 - $time1" | bc) # Percentage of total CPU capacity (all cores) pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc) echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs" # Percentage as "CPU cores used" (like top's 200% for 2 full cores) cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc) echo "CPU usage: ${cores_pct}% (cores-style)" ``` #### Continuous Monitoring ```bash #!/bin/bash CLK_TCK=$(getconf CLK_TCK) NUM_CPUS=$(nproc) INTERVAL=1 get_total_ticks() { local total=0 for stat in /proc/[0-9]*/stat; do read -r line < "$stat" 2>/dev/null || continue rest="${line##*) }" read -ra f <<< "$rest" ((total += f[11] + f[12])) done echo "$total" } prev_ticks=$(get_total_ticks) prev_time=$(date +%s.%N) while true; do sleep "$INTERVAL" curr_ticks=$(get_total_ticks) curr_time=$(date +%s.%N) delta=$((curr_ticks - prev_ticks)) elapsed=$(echo "$curr_time - $prev_time" | bc) pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc) printf "\rCPU: %5.1f%%" "$pct" prev_ticks=$curr_ticks prev_time=$curr_time done ``` --- ### Does this calculation respect cgroup limits? No, it doesn't. The calculation uses `nproc` which typically returns **host CPU count**, not your cgroup limit. #### The Problem If your container is limited to 2 CPUs on an 8-CPU host: - `nproc` returns 8 - Your calculation shows 25% when you're actually at 100% of your limit #### Getting Effective CPU Limit **cgroups v2:** ```bash # cpu.max contains: $quota $period (in microseconds) # "max 100000" means unlimited read quota period < /sys/fs/cgroup/cpu.max if [[ "$quota" == "max" ]]; then effective_cpus=$(nproc) else effective_cpus=$(echo "scale=2; $quota / $period" | bc) fi echo "Effective CPUs: $effective_cpus" ``` **cgroups v1:** ```bash quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us) period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us) if [[ "$quota" == "-1" ]]; then effective_cpus=$(nproc) else effective_cpus=$(echo "scale=2; $quota / $period" | bc) fi ``` **Also Check cpuset Limits:** ```bash # cgroups v2 cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null) # cgroups v1 cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null) # Parse "0-3,5,7" format to count CPUs count_cpus() { local count=0 IFS=',' read -ra ranges <<< "$1" for range in "${ranges[@]}"; do if [[ "$range" == *-* ]]; then start="${range%-*}" end="${range#*-}" ((count += end - start + 1)) else ((count++)) fi done echo "$count" } ``` #### Updated Script Respecting Limits ```bash #!/bin/bash CLK_TCK=$(getconf CLK_TCK) get_effective_cpus() { # Try cgroups v2 first if [[ -f /sys/fs/cgroup/cpu.max ]]; then read quota period < /sys/fs/cgroup/cpu.max if [[ "$quota" != "max" ]]; then echo "scale=2; $quota / $period" | bc return fi fi # Try cgroups v1 if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us) period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us) if [[ "$quota" != "-1" ]]; then echo "scale=2; $quota / $period" | bc return fi fi # Fallback to nproc nproc } EFFECTIVE_CPUS=$(get_effective_cpus) echo "Effective CPU limit: $EFFECTIVE_CPUS" # ... rest of your sampling logic using $EFFECTIVE_CPUS ``` #### Shared PID Namespace Consideration When sidecars share PID namespace but have different cgroups: - Each container may have different CPU limits - You're aggregating processes across those limits - There's no single "correct" limit to use **Options:** 1. **Use host CPU count** — percentage of total host capacity 2. **Sum the limits** — if you know each sidecar's cgroup, sum their quotas 3. **Report in cores** — skip normalization, just show `1.5 cores used` instead of percentage --- ### Can I get the cgroup limit for another cgroup? Yes, if you have visibility into the cgroup filesystem. #### 1. Find a Process's Cgroup Every process exposes its cgroup membership: ```bash # Get cgroup for any PID you can see cat /proc/1234/cgroup # cgroups v2 output: # 0::/kubepods/pod123/container456 # cgroups v1 output: # 12:cpu,cpuacct:/docker/abc123 # 11:memory:/docker/abc123 # ... ``` #### 2. Read That Cgroup's Limits If the cgroup filesystem is mounted and accessible: ```bash #!/bin/bash get_cgroup_cpu_limit() { local pid=$1 # Get cgroup path for this PID cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) # v2 if [[ -n "$cgroup_path" ]]; then # cgroups v2 limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max" if [[ -r "$limit_file" ]]; then read quota period < "$limit_file" if [[ "$quota" == "max" ]]; then echo "unlimited" else echo "scale=2; $quota / $period" | bc fi return fi fi # Try cgroups v1 cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null) if [[ -n "$cgroup_path" ]]; then quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us" period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us" if [[ -r "$quota_file" ]]; then quota=$(cat "$quota_file") period=$(cat "$period_file") if [[ "$quota" == "-1" ]]; then echo "unlimited" else echo "scale=2; $quota / $period" | bc fi return fi fi echo "unknown" } # Example: get limit for PID 1234 get_cgroup_cpu_limit 1234 ``` #### 3. Mount Visibility Requirements | Scenario | Can Read Other Cgroups? | |----------|------------------------| | Host system | Yes | | Privileged container | Yes | | `/sys/fs/cgroup` mounted read-only from host | Yes (common in Kubernetes) | | Only own cgroup subtree mounted | No | Check what's visible: ```bash mount | grep cgroup ls /sys/fs/cgroup/ ``` #### 4. Full Solution: Aggregate by Cgroup ```bash #!/bin/bash CLK_TCK=$(getconf CLK_TCK) declare -A cgroup_ticks declare -A cgroup_limit for stat in /proc/[0-9]*/stat; do pid="${stat#/proc/}" pid="${pid%/stat}" # Get cgroup for this process cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) [[ -z "$cg" ]] && continue # Get CPU ticks if read -r line < "$stat" 2>/dev/null; then rest="${line##*) }" read -ra f <<< "$rest" ticks=$((f[11] + f[12])) ((cgroup_ticks[$cg] += ticks)) # Cache the limit (only look up once per cgroup) if [[ -z "${cgroup_limit[$cg]}" ]]; then limit_file="/sys/fs/cgroup${cg}/cpu.max" if [[ -r "$limit_file" ]]; then read quota period < "$limit_file" if [[ "$quota" == "max" ]]; then cgroup_limit[$cg]="unlimited" else cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc) fi else cgroup_limit[$cg]="unknown" fi fi fi done echo "Ticks by cgroup:" for cg in "${!cgroup_ticks[@]}"; do echo " $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)" done ``` #### If You Can't Access Other Cgroups Fallback options: 1. **Mount the cgroup fs** — add volume mount for `/sys/fs/cgroup:ro` 2. **Use a sidecar with access** — one privileged container does the monitoring 3. **Accept "unknown" limits** — report raw ticks/cores instead of percentages 4. **Kubernetes Downward API** — inject limits as env vars (only for your own container though)