Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
15 KiB
/proc/stat Behavior in Containerized Environments
TL;DR
/proc/stat in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate /proc/[pid]/stat for all visible processes and calculate percentages using cgroup limits from /sys/fs/cgroup.
Outcome
- Explained why
/proc/statdoesn't provide container-aware metrics - Provided methods to aggregate process CPU usage from
/proc/[pid]/stat - Showed how to determine parent/child process relationships
- Demonstrated converting raw ticks to CPU percentages
- Covered how to read cgroup CPU limits for other containers
Conversation
How does /proc/stat behave in containerized environments?
By default, /proc/stat inside a container shows host system statistics, not container-specific data. This includes:
- CPU times (user, system, idle, etc.) - reflects all CPUs on the host
- Context switches - host-wide count
- Boot time - host's boot time
- Processes created - host-wide total
Why This Happens
The /proc filesystem is mounted from the host kernel. Containers share the host kernel, and /proc/stat is a kernel-level interface that doesn't have namespace awareness for most of its metrics.
Practical Implications
- CPU usage calculations are wrong - Tools like
top,htop, or custom monitoring that read/proc/statwill show host CPU usage, not container usage - Resource limits aren't reflected - Even if you limit a container to 2 CPUs via cgroups,
/proc/statshows all host CPUs - Multi-tenant confusion - In shared environments, containers see each other's aggregate impact
Solutions
| Approach | Description |
|---|---|
| cgroups | Read from /sys/fs/cgroup/cpu/ for container-specific CPU accounting |
| LXCFS | A FUSE filesystem that provides container-aware /proc files |
| Container runtimes | Some runtimes (like Kata) use VMs with isolated kernels |
| Metrics APIs | Use Docker/Kubernetes APIs instead of /proc/stat |
Example: Getting Real Container CPU
# Instead of /proc/stat, use cgroups v2:
cat /sys/fs/cgroup/cpu.stat
# Or cgroups v1:
cat /sys/fs/cgroup/cpu/cpuacct.usage
Can I aggregate process usage to get the usage of the container?
Yes, you can aggregate process usage, but there are caveats.
Aggregating from /proc/[pid]/stat
# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
for pid in /proc/[0-9]*; do
awk '{print $14 + $15}' "$pid/stat" 2>/dev/null
done | awk '{sum += $1} END {print sum}'
Problems with This Approach
- Race conditions - Processes can spawn/die between reads, causing inconsistent totals
- Short-lived processes - Processes that start and exit between samples are missed entirely
- Zombie/exited processes - Their CPU time may not be captured
- Overhead - Scanning all PIDs repeatedly is expensive
- Namespace visibility - You only see processes in your PID namespace, but that's actually what you want in a container
- Children accounting - When a process exits, its CPU time gets added to the parent's
cutime/cstimefields, which can cause double-counting if not handled carefully
Better Alternative: cgroups
Cgroups track CPU usage at the group level and handle all these edge cases:
# cgroups v2 (modern)
cat /sys/fs/cgroup/cpu.stat
# Shows: usage_usec, user_usec, system_usec
# cgroups v1
cat /sys/fs/cgroup/cpu/cpuacct.usage # total nanoseconds
cat /sys/fs/cgroup/cpu/cpuacct.stat # user/system jiffies
However, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating /proc/[pid]/stat is the best option.
How do I determine child/parent processes?
From /proc/[pid]/stat
Field 4 is the PPID (parent process ID):
# Get PPID for a specific process
awk '{print $4}' /proc/1234/stat
# List all processes with their parents
for stat in /proc/[0-9]*/stat; do
pid="${stat#/proc/}"
pid="${pid%/stat}"
ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \
echo "$pid -> parent: $ppid"
done
From /proc/[pid]/status (more readable)
grep PPid /proc/1234/status
# PPid: 1
Building a Process Tree
#!/bin/bash
declare -A parent_of
declare -A children_of
for stat in /proc/[0-9]*/stat; do
if read -r line < "$stat" 2>/dev/null; then
pid="${stat#/proc/}"
pid="${pid%/stat}"
# Extract PPID (field 4, but handle comm with spaces)
rest="${line##*) }"
read -ra fields <<< "$rest"
ppid="${fields[1]}" # 4th field overall = index 1 after state
parent_of[$pid]=$ppid
children_of[$ppid]+="$pid "
fi
done
# Print tree from PID 1
print_tree() {
local pid=$1
local indent=$2
echo "${indent}${pid}"
for child in ${children_of[$pid]}; do
print_tree "$child" " $indent"
done
}
print_tree 1 ""
For CPU Aggregation: Handling cutime/cstime
To properly handle cutime/cstime without double-counting:
#!/bin/bash
declare -A parent_of
declare -A utime stime
# First pass: collect all data
for stat in /proc/[0-9]*/stat; do
if read -r line < "$stat" 2>/dev/null; then
pid="${stat#/proc/}"
pid="${pid%/stat}"
rest="${line##*) }"
read -ra f <<< "$rest"
parent_of[$pid]="${f[1]}"
utime[$pid]="${f[11]}"
stime[$pid]="${f[12]}"
# cutime=${f[13]} cstime=${f[14]} - don't sum these
fi
done
# Sum only utime/stime (not cutime/cstime)
total=0
for pid in "${!utime[@]}"; do
((total += utime[$pid] + stime[$pid]))
done
echo "Total CPU ticks: $total"
echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"
Key insight: Only sum utime + stime for each process. The cutime/cstime fields are cumulative from children that have already exited and been wait()ed on—those children no longer exist in /proc, so their time is only accessible via the parent's cutime/cstime.
How do I convert utime/stime to percentages?
You need two samples over a time interval. CPU percentage is a rate, not an absolute value.
The Formula
CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
delta_ticks= difference in (utime + stime) between samplesCLK_TCK= ticks per second (usually 100, get viagetconf CLK_TCK)num_cpus= number of CPUs (omit for single-CPU percentage)
Two Common Percentage Styles
| Style | Formula | Example |
|---|---|---|
| Normalized (0-100%) | delta / (elapsed * CLK_TCK * num_cpus) * 100 |
50% = half of total capacity |
| Cores-style (0-N*100%) | delta / (elapsed * CLK_TCK) * 100 |
200% = 2 full cores busy |
Practical Script
#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)
NUM_CPUS=$(nproc)
get_total_ticks() {
local total=0
for stat in /proc/[0-9]*/stat; do
if read -r line < "$stat" 2>/dev/null; then
rest="${line##*) }"
read -ra f <<< "$rest"
((total += f[11] + f[12])) # utime + stime
fi
done
echo "$total"
}
# First sample
ticks1=$(get_total_ticks)
time1=$(date +%s.%N)
# Wait
sleep 1
# Second sample
ticks2=$(get_total_ticks)
time2=$(date +%s.%N)
# Calculate
delta_ticks=$((ticks2 - ticks1))
elapsed=$(echo "$time2 - $time1" | bc)
# Percentage of total CPU capacity (all cores)
pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"
# Percentage as "CPU cores used" (like top's 200% for 2 full cores)
cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
echo "CPU usage: ${cores_pct}% (cores-style)"
Continuous Monitoring
#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)
NUM_CPUS=$(nproc)
INTERVAL=1
get_total_ticks() {
local total=0
for stat in /proc/[0-9]*/stat; do
read -r line < "$stat" 2>/dev/null || continue
rest="${line##*) }"
read -ra f <<< "$rest"
((total += f[11] + f[12]))
done
echo "$total"
}
prev_ticks=$(get_total_ticks)
prev_time=$(date +%s.%N)
while true; do
sleep "$INTERVAL"
curr_ticks=$(get_total_ticks)
curr_time=$(date +%s.%N)
delta=$((curr_ticks - prev_ticks))
elapsed=$(echo "$curr_time - $prev_time" | bc)
pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc)
printf "\rCPU: %5.1f%%" "$pct"
prev_ticks=$curr_ticks
prev_time=$curr_time
done
Does this calculation respect cgroup limits?
No, it doesn't. The calculation uses nproc which typically returns host CPU count, not your cgroup limit.
The Problem
If your container is limited to 2 CPUs on an 8-CPU host:
nprocreturns 8- Your calculation shows 25% when you're actually at 100% of your limit
Getting Effective CPU Limit
cgroups v2:
# cpu.max contains: $quota $period (in microseconds)
# "max 100000" means unlimited
read quota period < /sys/fs/cgroup/cpu.max
if [[ "$quota" == "max" ]]; then
effective_cpus=$(nproc)
else
effective_cpus=$(echo "scale=2; $quota / $period" | bc)
fi
echo "Effective CPUs: $effective_cpus"
cgroups v1:
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
if [[ "$quota" == "-1" ]]; then
effective_cpus=$(nproc)
else
effective_cpus=$(echo "scale=2; $quota / $period" | bc)
fi
Also Check cpuset Limits:
# cgroups v2
cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null)
# cgroups v1
cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null)
# Parse "0-3,5,7" format to count CPUs
count_cpus() {
local count=0
IFS=',' read -ra ranges <<< "$1"
for range in "${ranges[@]}"; do
if [[ "$range" == *-* ]]; then
start="${range%-*}"
end="${range#*-}"
((count += end - start + 1))
else
((count++))
fi
done
echo "$count"
}
Updated Script Respecting Limits
#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)
get_effective_cpus() {
# Try cgroups v2 first
if [[ -f /sys/fs/cgroup/cpu.max ]]; then
read quota period < /sys/fs/cgroup/cpu.max
if [[ "$quota" != "max" ]]; then
echo "scale=2; $quota / $period" | bc
return
fi
fi
# Try cgroups v1
if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
if [[ "$quota" != "-1" ]]; then
echo "scale=2; $quota / $period" | bc
return
fi
fi
# Fallback to nproc
nproc
}
EFFECTIVE_CPUS=$(get_effective_cpus)
echo "Effective CPU limit: $EFFECTIVE_CPUS"
# ... rest of your sampling logic using $EFFECTIVE_CPUS
Shared PID Namespace Consideration
When sidecars share PID namespace but have different cgroups:
- Each container may have different CPU limits
- You're aggregating processes across those limits
- There's no single "correct" limit to use
Options:
- Use host CPU count — percentage of total host capacity
- Sum the limits — if you know each sidecar's cgroup, sum their quotas
- Report in cores — skip normalization, just show
1.5 cores usedinstead of percentage
Can I get the cgroup limit for another cgroup?
Yes, if you have visibility into the cgroup filesystem.
1. Find a Process's Cgroup
Every process exposes its cgroup membership:
# Get cgroup for any PID you can see
cat /proc/1234/cgroup
# cgroups v2 output:
# 0::/kubepods/pod123/container456
# cgroups v1 output:
# 12:cpu,cpuacct:/docker/abc123
# 11:memory:/docker/abc123
# ...
2. Read That Cgroup's Limits
If the cgroup filesystem is mounted and accessible:
#!/bin/bash
get_cgroup_cpu_limit() {
local pid=$1
# Get cgroup path for this PID
cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) # v2
if [[ -n "$cgroup_path" ]]; then
# cgroups v2
limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
if [[ -r "$limit_file" ]]; then
read quota period < "$limit_file"
if [[ "$quota" == "max" ]]; then
echo "unlimited"
else
echo "scale=2; $quota / $period" | bc
fi
return
fi
fi
# Try cgroups v1
cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
if [[ -n "$cgroup_path" ]]; then
quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
if [[ -r "$quota_file" ]]; then
quota=$(cat "$quota_file")
period=$(cat "$period_file")
if [[ "$quota" == "-1" ]]; then
echo "unlimited"
else
echo "scale=2; $quota / $period" | bc
fi
return
fi
fi
echo "unknown"
}
# Example: get limit for PID 1234
get_cgroup_cpu_limit 1234
3. Mount Visibility Requirements
| Scenario | Can Read Other Cgroups? |
|---|---|
| Host system | Yes |
| Privileged container | Yes |
/sys/fs/cgroup mounted read-only from host |
Yes (common in Kubernetes) |
| Only own cgroup subtree mounted | No |
Check what's visible:
mount | grep cgroup
ls /sys/fs/cgroup/
4. Full Solution: Aggregate by Cgroup
#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)
declare -A cgroup_ticks
declare -A cgroup_limit
for stat in /proc/[0-9]*/stat; do
pid="${stat#/proc/}"
pid="${pid%/stat}"
# Get cgroup for this process
cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
[[ -z "$cg" ]] && continue
# Get CPU ticks
if read -r line < "$stat" 2>/dev/null; then
rest="${line##*) }"
read -ra f <<< "$rest"
ticks=$((f[11] + f[12]))
((cgroup_ticks[$cg] += ticks))
# Cache the limit (only look up once per cgroup)
if [[ -z "${cgroup_limit[$cg]}" ]]; then
limit_file="/sys/fs/cgroup${cg}/cpu.max"
if [[ -r "$limit_file" ]]; then
read quota period < "$limit_file"
if [[ "$quota" == "max" ]]; then
cgroup_limit[$cg]="unlimited"
else
cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc)
fi
else
cgroup_limit[$cg]="unknown"
fi
fi
fi
done
echo "Ticks by cgroup:"
for cg in "${!cgroup_ticks[@]}"; do
echo " $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)"
done
If You Can't Access Other Cgroups
Fallback options:
- Mount the cgroup fs — add volume mount for
/sys/fs/cgroup:ro - Use a sidecar with access — one privileged container does the monitoring
- Accept "unknown" limits — report raw ticks/cores instead of percentages
- Kubernetes Downward API — inject limits as env vars (only for your own container though)