8.5 KiB
/proc/stat behavior in containerised environments
/proc/stat in containers shows host-level statistics, not container-specific data. To get container-aware CPU metrics when processes span multiple cgroups (e.g., sidecars sharing a PID namespace), aggregate /proc/[pid]/stat for all visible processes and use cgroup limits from /sys/fs/cgroup for normalization.
Why /proc/stat is wrong in containers
/proc/stat reports host-wide values (CPU times, context switches, boot time, process count) because /proc is mounted from the host kernel, which has no namespace awareness for these metrics.
This means:
- Tools reading
/proc/stat(top, htop, etc.) show host CPU usage, not container usage - Cgroup CPU limits (e.g., 2 CPUs) are not reflected — all host CPUs are visible
- In shared environments, containers see each other's aggregate impact
Alternatives
| Approach | Description |
|---|---|
| cgroups | Read /sys/fs/cgroup/cpu/ for container-specific CPU accounting |
| LXCFS | FUSE filesystem providing container-aware /proc files |
| Container runtimes | Some (like Kata) use VMs with isolated kernels |
| Metrics APIs | Docker/Kubernetes APIs instead of /proc/stat |
# cgroups v2:
cat /sys/fs/cgroup/cpu.stat
# cgroups v1:
cat /sys/fs/cgroup/cpu/cpuacct.usage
Aggregating per-Process CPU from /proc/[pid]/stat
When cgroup-level reads aren't an option (sidecars sharing PID namespace with different cgroups), aggregate individual process stats:
# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
for pid in /proc/[0-9]*; do
awk '{print $14 + $15}' "$pid/stat" 2>/dev/null
done | awk '{sum += $1} END {print sum}'
Caveats
- Race conditions — processes can spawn/die between reads
- Short-lived processes — missed if they start and exit between samples
- Zombie/exited processes — their CPU time may not be captured
- Overhead — scanning all PIDs repeatedly is expensive
- Namespace visibility — you only see processes in your PID namespace (which is what you want)
- Children accounting — when a process exits, its CPU time is added to the parent's
cutime/cstime, risking double-counting
Cgroups handle these edge cases natively, but cannot be used when sidecars share the PID namespace with different cgroups — in that case, per-process aggregation is the best option.
Parent/Child Process Relationships
Field 4 in /proc/[pid]/stat is the PPID (parent process ID):
awk '{print $4}' /proc/1234/stat # PPID from stat
grep PPid /proc/1234/status # more readable
Building a Process Tree
#!/bin/bash
declare -A parent_of children_of
for stat in /proc/[0-9]*/stat; do
if read -r line < "$stat" 2>/dev/null; then
pid="${stat#/proc/}"; pid="${pid%/stat}"
rest="${line##*) }"; read -ra fields <<< "$rest"
ppid="${fields[1]}" # 4th field overall = index 1 after state
parent_of[$pid]=$ppid
children_of[$ppid]+="$pid "
fi
done
print_tree() {
local pid=$1 indent=$2
echo "${indent}${pid}"
for child in ${children_of[$pid]}; do print_tree "$child" " $indent"; done
}
print_tree 1 ""
Avoiding Double-Counting with cutime/cstime
Only sum utime + stime per process. The cutime/cstime fields are cumulative from children that have already exited and been wait()ed on — those children no longer exist in /proc, so their time is only accessible via the parent.
#!/bin/bash
declare -A utime stime
for stat in /proc/[0-9]*/stat; do
if read -r line < "$stat" 2>/dev/null; then
pid="${stat#/proc/}"; pid="${pid%/stat}"
rest="${line##*) }"; read -ra f <<< "$rest"
utime[$pid]="${f[11]}"; stime[$pid]="${f[12]}"
# cutime=${f[13]} cstime=${f[14]} — don't sum these
fi
done
total=0
for pid in "${!utime[@]}"; do ((total += utime[$pid] + stime[$pid])); done
echo "Total CPU ticks: $total"
echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"
Converting Ticks to CPU Percentages
CPU percentage is a rate — you need two samples over a time interval.
CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
delta_ticks= difference in (utime + stime) between samplesCLK_TCK= ticks per second (usually 100, get viagetconf CLK_TCK)num_cpus= number of CPUs (omit for per-core percentage)
| Style | Formula | Example |
|---|---|---|
| Normalized (0-100%) | delta / (elapsed * CLK_TCK * num_cpus) * 100 |
50% = half of total capacity |
| Cores-style (0-N*100%) | delta / (elapsed * CLK_TCK) * 100 |
200% = 2 full cores busy |
Sampling Script
#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)
NUM_CPUS=$(nproc)
get_total_ticks() {
local total=0
for stat in /proc/[0-9]*/stat; do
if read -r line < "$stat" 2>/dev/null; then
rest="${line##*) }"; read -ra f <<< "$rest"
((total += f[11] + f[12]))
fi
done
echo "$total"
}
ticks1=$(get_total_ticks); time1=$(date +%s.%N)
sleep 1
ticks2=$(get_total_ticks); time2=$(date +%s.%N)
delta_ticks=$((ticks2 - ticks1))
elapsed=$(echo "$time2 - $time1" | bc)
pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"
cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
echo "CPU usage: ${cores_pct}% (cores-style)"
Respecting Cgroup CPU Limits
The above calculations use nproc, which returns the host CPU count. If a container is limited to 2 CPUs on an 8-CPU host, nproc returns 8 and the percentage is misleading.
Reading Effective CPU Limit
#!/bin/bash
get_effective_cpus() {
# cgroups v2
if [[ -f /sys/fs/cgroup/cpu.max ]]; then
read quota period < /sys/fs/cgroup/cpu.max
[[ "$quota" != "max" ]] && echo "scale=2; $quota / $period" | bc && return
fi
# cgroups v1
if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
[[ "$quota" != "-1" ]] && echo "scale=2; $quota / $period" | bc && return
fi
nproc # fallback
}
Also check cpuset limits (cpuset.cpus.effective for v2, cpuset/cpuset.cpus for v1) which restrict which physical CPUs are available.
Shared PID Namespace Complication
When sidecars share a PID namespace but have different cgroups, there's no single "correct" CPU limit for normalization. Options:
- Use host CPU count — percentage of total host capacity
- Sum the limits — if you know each sidecar's cgroup, sum their quotas
- Report in cores — skip normalization, show
1.5 cores usedinstead of percentage
Reading Cgroup Limits for Other Containers
Every process exposes its cgroup membership via /proc/<PID>/cgroup. If the cgroup filesystem is mounted, you can read any container's limits:
get_cgroup_cpu_limit() {
local pid=$1
# cgroups v2
cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
if [[ -n "$cgroup_path" ]]; then
limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
if [[ -r "$limit_file" ]]; then
read quota period < "$limit_file"
[[ "$quota" == "max" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc
return
fi
fi
# cgroups v1
cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
if [[ -n "$cgroup_path" ]]; then
quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
if [[ -r "$quota_file" ]]; then
quota=$(cat "$quota_file"); period=$(cat "$period_file")
[[ "$quota" == "-1" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc
return
fi
fi
echo "unknown"
}
Mount Visibility
| Scenario | Can Read Other Cgroups? |
|---|---|
| Host system | Yes |
| Privileged container | Yes |
/sys/fs/cgroup mounted read-only from host |
Yes (common in Kubernetes) |
| Only own cgroup subtree mounted | No |
Fallbacks When Cgroups Aren't Accessible
- Mount the cgroup fs — volume mount
/sys/fs/cgroup:ro - Use a sidecar with access — one privileged container does monitoring
- Accept "unknown" limits — report raw ticks/cores instead of percentages
- Kubernetes Downward API — inject limits as env vars (own container only)