238 lines
8.5 KiB
Markdown
238 lines
8.5 KiB
Markdown
# /proc/stat behavior in containerised environments
|
|
|
|
`/proc/stat` in containers shows **host-level** statistics, not container-specific data. To get container-aware CPU metrics when processes span multiple cgroups (e.g., sidecars sharing a PID namespace), aggregate `/proc/[pid]/stat` for all visible processes and use cgroup limits from `/sys/fs/cgroup` for normalization.
|
|
|
|
## Why /proc/stat is wrong in containers
|
|
|
|
`/proc/stat` reports host-wide values (CPU times, context switches, boot time, process count) because `/proc` is mounted from the host kernel, which has no namespace awareness for these metrics.
|
|
|
|
This means:
|
|
- Tools reading `/proc/stat` (top, htop, etc.) show **host** CPU usage, not container usage
|
|
- Cgroup CPU limits (e.g., 2 CPUs) are not reflected — all host CPUs are visible
|
|
- In shared environments, containers see each other's aggregate impact
|
|
|
|
### Alternatives
|
|
|
|
| Approach | Description |
|
|
|----------|-------------|
|
|
| **cgroups** | Read `/sys/fs/cgroup/cpu/` for container-specific CPU accounting |
|
|
| **LXCFS** | FUSE filesystem providing container-aware `/proc` files |
|
|
| **Container runtimes** | Some (like Kata) use VMs with isolated kernels |
|
|
| **Metrics APIs** | Docker/Kubernetes APIs instead of `/proc/stat` |
|
|
|
|
```bash
|
|
# cgroups v2:
|
|
cat /sys/fs/cgroup/cpu.stat
|
|
|
|
# cgroups v1:
|
|
cat /sys/fs/cgroup/cpu/cpuacct.usage
|
|
```
|
|
|
|
## Aggregating per-Process CPU from /proc/[pid]/stat
|
|
|
|
When cgroup-level reads aren't an option (sidecars sharing PID namespace with different cgroups), aggregate individual process stats:
|
|
|
|
```bash
|
|
# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
|
|
for pid in /proc/[0-9]*; do
|
|
awk '{print $14 + $15}' "$pid/stat" 2>/dev/null
|
|
done | awk '{sum += $1} END {print sum}'
|
|
```
|
|
|
|
### Caveats
|
|
|
|
1. **Race conditions** — processes can spawn/die between reads
|
|
2. **Short-lived processes** — missed if they start and exit between samples
|
|
3. **Zombie/exited processes** — their CPU time may not be captured
|
|
4. **Overhead** — scanning all PIDs repeatedly is expensive
|
|
5. **Namespace visibility** — you only see processes in your PID namespace (which is what you want)
|
|
6. **Children accounting** — when a process exits, its CPU time is added to the parent's `cutime`/`cstime`, risking double-counting
|
|
|
|
Cgroups handle these edge cases natively, but **cannot be used when sidecars share the PID namespace with different cgroups** — in that case, per-process aggregation is the best option.
|
|
|
|
## Parent/Child Process Relationships
|
|
|
|
Field 4 in `/proc/[pid]/stat` is the PPID (parent process ID):
|
|
|
|
```bash
|
|
awk '{print $4}' /proc/1234/stat # PPID from stat
|
|
grep PPid /proc/1234/status # more readable
|
|
```
|
|
|
|
### Building a Process Tree
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
declare -A parent_of children_of
|
|
|
|
for stat in /proc/[0-9]*/stat; do
|
|
if read -r line < "$stat" 2>/dev/null; then
|
|
pid="${stat#/proc/}"; pid="${pid%/stat}"
|
|
rest="${line##*) }"; read -ra fields <<< "$rest"
|
|
ppid="${fields[1]}" # 4th field overall = index 1 after state
|
|
parent_of[$pid]=$ppid
|
|
children_of[$ppid]+="$pid "
|
|
fi
|
|
done
|
|
|
|
print_tree() {
|
|
local pid=$1 indent=$2
|
|
echo "${indent}${pid}"
|
|
for child in ${children_of[$pid]}; do print_tree "$child" " $indent"; done
|
|
}
|
|
print_tree 1 ""
|
|
```
|
|
|
|
### Avoiding Double-Counting with cutime/cstime
|
|
|
|
Only sum `utime` + `stime` per process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on — those children no longer exist in `/proc`, so their time is only accessible via the parent.
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
declare -A utime stime
|
|
|
|
for stat in /proc/[0-9]*/stat; do
|
|
if read -r line < "$stat" 2>/dev/null; then
|
|
pid="${stat#/proc/}"; pid="${pid%/stat}"
|
|
rest="${line##*) }"; read -ra f <<< "$rest"
|
|
utime[$pid]="${f[11]}"; stime[$pid]="${f[12]}"
|
|
# cutime=${f[13]} cstime=${f[14]} — don't sum these
|
|
fi
|
|
done
|
|
|
|
total=0
|
|
for pid in "${!utime[@]}"; do ((total += utime[$pid] + stime[$pid])); done
|
|
echo "Total CPU ticks: $total"
|
|
echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"
|
|
```
|
|
|
|
## Converting Ticks to CPU Percentages
|
|
|
|
CPU percentage is a rate — you need **two samples** over a time interval.
|
|
|
|
```
|
|
CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
|
|
```
|
|
|
|
- `delta_ticks` = difference in (utime + stime) between samples
|
|
- `CLK_TCK` = ticks per second (usually 100, get via `getconf CLK_TCK`)
|
|
- `num_cpus` = number of CPUs (omit for per-core percentage)
|
|
|
|
| Style | Formula | Example |
|
|
|-------|---------|---------|
|
|
| **Normalized** (0-100%) | `delta / (elapsed * CLK_TCK * num_cpus) * 100` | 50% = half of total capacity |
|
|
| **Cores-style** (0-N*100%) | `delta / (elapsed * CLK_TCK) * 100` | 200% = 2 full cores busy |
|
|
|
|
### Sampling Script
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
CLK_TCK=$(getconf CLK_TCK)
|
|
NUM_CPUS=$(nproc)
|
|
|
|
get_total_ticks() {
|
|
local total=0
|
|
for stat in /proc/[0-9]*/stat; do
|
|
if read -r line < "$stat" 2>/dev/null; then
|
|
rest="${line##*) }"; read -ra f <<< "$rest"
|
|
((total += f[11] + f[12]))
|
|
fi
|
|
done
|
|
echo "$total"
|
|
}
|
|
|
|
ticks1=$(get_total_ticks); time1=$(date +%s.%N)
|
|
sleep 1
|
|
ticks2=$(get_total_ticks); time2=$(date +%s.%N)
|
|
|
|
delta_ticks=$((ticks2 - ticks1))
|
|
elapsed=$(echo "$time2 - $time1" | bc)
|
|
|
|
pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
|
|
echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"
|
|
|
|
cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
|
|
echo "CPU usage: ${cores_pct}% (cores-style)"
|
|
```
|
|
|
|
## Respecting Cgroup CPU Limits
|
|
|
|
The above calculations use `nproc`, which returns the **host** CPU count. If a container is limited to 2 CPUs on an 8-CPU host, `nproc` returns 8 and the percentage is misleading.
|
|
|
|
### Reading Effective CPU Limit
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
get_effective_cpus() {
|
|
# cgroups v2
|
|
if [[ -f /sys/fs/cgroup/cpu.max ]]; then
|
|
read quota period < /sys/fs/cgroup/cpu.max
|
|
[[ "$quota" != "max" ]] && echo "scale=2; $quota / $period" | bc && return
|
|
fi
|
|
# cgroups v1
|
|
if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
|
|
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
|
|
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
|
|
[[ "$quota" != "-1" ]] && echo "scale=2; $quota / $period" | bc && return
|
|
fi
|
|
nproc # fallback
|
|
}
|
|
```
|
|
|
|
Also check cpuset limits (`cpuset.cpus.effective` for v2, `cpuset/cpuset.cpus` for v1) which restrict which physical CPUs are available.
|
|
|
|
### Shared PID Namespace Complication
|
|
|
|
When sidecars share a PID namespace but have different cgroups, there's no single "correct" CPU limit for normalization. Options:
|
|
|
|
1. **Use host CPU count** — percentage of total host capacity
|
|
2. **Sum the limits** — if you know each sidecar's cgroup, sum their quotas
|
|
3. **Report in cores** — skip normalization, show `1.5 cores used` instead of percentage
|
|
|
|
## Reading Cgroup Limits for Other Containers
|
|
|
|
Every process exposes its cgroup membership via `/proc/<PID>/cgroup`. If the cgroup filesystem is mounted, you can read any container's limits:
|
|
|
|
```bash
|
|
get_cgroup_cpu_limit() {
|
|
local pid=$1
|
|
# cgroups v2
|
|
cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
|
|
if [[ -n "$cgroup_path" ]]; then
|
|
limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
|
|
if [[ -r "$limit_file" ]]; then
|
|
read quota period < "$limit_file"
|
|
[[ "$quota" == "max" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc
|
|
return
|
|
fi
|
|
fi
|
|
# cgroups v1
|
|
cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
|
|
if [[ -n "$cgroup_path" ]]; then
|
|
quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
|
|
period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
|
|
if [[ -r "$quota_file" ]]; then
|
|
quota=$(cat "$quota_file"); period=$(cat "$period_file")
|
|
[[ "$quota" == "-1" ]] && echo "unlimited" || echo "scale=2; $quota / $period" | bc
|
|
return
|
|
fi
|
|
fi
|
|
echo "unknown"
|
|
}
|
|
```
|
|
|
|
### Mount Visibility
|
|
|
|
| Scenario | Can Read Other Cgroups? |
|
|
|----------|------------------------|
|
|
| Host system | Yes |
|
|
| Privileged container | Yes |
|
|
| `/sys/fs/cgroup` mounted read-only from host | Yes (common in Kubernetes) |
|
|
| Only own cgroup subtree mounted | No |
|
|
|
|
### Fallbacks When Cgroups Aren't Accessible
|
|
|
|
1. **Mount the cgroup fs** — volume mount `/sys/fs/cgroup:ro`
|
|
2. **Use a sidecar with access** — one privileged container does monitoring
|
|
3. **Accept "unknown" limits** — report raw ticks/cores instead of percentages
|
|
4. **Kubernetes Downward API** — inject limits as env vars (own container only)
|