All checks were successful
ci / build (push) Successful in 31s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
560 lines
15 KiB
Markdown
560 lines
15 KiB
Markdown
# /proc/stat Behavior in Containerized Environments
|
|
|
|
## TL;DR
|
|
|
|
`/proc/stat` in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate `/proc/[pid]/stat` for all visible processes and calculate percentages using cgroup limits from `/sys/fs/cgroup`.
|
|
|
|
## Outcome
|
|
|
|
- Explained why `/proc/stat` doesn't provide container-aware metrics
|
|
- Provided methods to aggregate process CPU usage from `/proc/[pid]/stat`
|
|
- Showed how to determine parent/child process relationships
|
|
- Demonstrated converting raw ticks to CPU percentages
|
|
- Covered how to read cgroup CPU limits for other containers
|
|
|
|
## Conversation
|
|
|
|
### How does /proc/stat behave in containerized environments?
|
|
|
|
By default, `/proc/stat` inside a container shows **host system statistics**, not container-specific data. This includes:
|
|
|
|
- **CPU times** (user, system, idle, etc.) - reflects all CPUs on the host
|
|
- **Context switches** - host-wide count
|
|
- **Boot time** - host's boot time
|
|
- **Processes created** - host-wide total
|
|
|
|
#### Why This Happens
|
|
|
|
The `/proc` filesystem is mounted from the host kernel. Containers share the host kernel, and `/proc/stat` is a kernel-level interface that doesn't have namespace awareness for most of its metrics.
|
|
|
|
#### Practical Implications
|
|
|
|
1. **CPU usage calculations are wrong** - Tools like `top`, `htop`, or custom monitoring that read `/proc/stat` will show host CPU usage, not container usage
|
|
2. **Resource limits aren't reflected** - Even if you limit a container to 2 CPUs via cgroups, `/proc/stat` shows all host CPUs
|
|
3. **Multi-tenant confusion** - In shared environments, containers see each other's aggregate impact
|
|
|
|
#### Solutions
|
|
|
|
| Approach | Description |
|
|
|----------|-------------|
|
|
| **cgroups** | Read from `/sys/fs/cgroup/cpu/` for container-specific CPU accounting |
|
|
| **LXCFS** | A FUSE filesystem that provides container-aware `/proc` files |
|
|
| **Container runtimes** | Some runtimes (like Kata) use VMs with isolated kernels |
|
|
| **Metrics APIs** | Use Docker/Kubernetes APIs instead of `/proc/stat` |
|
|
|
|
#### Example: Getting Real Container CPU
|
|
|
|
```bash
|
|
# Instead of /proc/stat, use cgroups v2:
|
|
cat /sys/fs/cgroup/cpu.stat
|
|
|
|
# Or cgroups v1:
|
|
cat /sys/fs/cgroup/cpu/cpuacct.usage
|
|
```
|
|
|
|
---
|
|
|
|
### Can I aggregate process usage to get the usage of the container?
|
|
|
|
Yes, you can aggregate process usage, but there are caveats.
|
|
|
|
#### Aggregating from /proc/[pid]/stat
|
|
|
|
```bash
|
|
# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
|
|
for pid in /proc/[0-9]*; do
|
|
awk '{print $14 + $15}' "$pid/stat" 2>/dev/null
|
|
done | awk '{sum += $1} END {print sum}'
|
|
```
|
|
|
|
#### Problems with This Approach
|
|
|
|
1. **Race conditions** - Processes can spawn/die between reads, causing inconsistent totals
|
|
2. **Short-lived processes** - Processes that start and exit between samples are missed entirely
|
|
3. **Zombie/exited processes** - Their CPU time may not be captured
|
|
4. **Overhead** - Scanning all PIDs repeatedly is expensive
|
|
5. **Namespace visibility** - You only see processes in your PID namespace, but that's actually what you want in a container
|
|
6. **Children accounting** - When a process exits, its CPU time gets added to the parent's `cutime`/`cstime` fields, which can cause double-counting if not handled carefully
|
|
|
|
#### Better Alternative: cgroups
|
|
|
|
Cgroups track CPU usage at the group level and handle all these edge cases:
|
|
|
|
```bash
|
|
# cgroups v2 (modern)
|
|
cat /sys/fs/cgroup/cpu.stat
|
|
# Shows: usage_usec, user_usec, system_usec
|
|
|
|
# cgroups v1
|
|
cat /sys/fs/cgroup/cpu/cpuacct.usage # total nanoseconds
|
|
cat /sys/fs/cgroup/cpu/cpuacct.stat # user/system jiffies
|
|
```
|
|
|
|
**However**, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating `/proc/[pid]/stat` is the best option.
|
|
|
|
---
|
|
|
|
### How do I determine child/parent processes?
|
|
|
|
#### From /proc/[pid]/stat
|
|
|
|
Field 4 is the **PPID** (parent process ID):
|
|
|
|
```bash
|
|
# Get PPID for a specific process
|
|
awk '{print $4}' /proc/1234/stat
|
|
|
|
# List all processes with their parents
|
|
for stat in /proc/[0-9]*/stat; do
|
|
pid="${stat#/proc/}"
|
|
pid="${pid%/stat}"
|
|
ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \
|
|
echo "$pid -> parent: $ppid"
|
|
done
|
|
```
|
|
|
|
#### From /proc/[pid]/status (more readable)
|
|
|
|
```bash
|
|
grep PPid /proc/1234/status
|
|
# PPid: 1
|
|
```
|
|
|
|
#### Building a Process Tree
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
declare -A parent_of
|
|
declare -A children_of
|
|
|
|
for stat in /proc/[0-9]*/stat; do
|
|
if read -r line < "$stat" 2>/dev/null; then
|
|
pid="${stat#/proc/}"
|
|
pid="${pid%/stat}"
|
|
|
|
# Extract PPID (field 4, but handle comm with spaces)
|
|
rest="${line##*) }"
|
|
read -ra fields <<< "$rest"
|
|
ppid="${fields[1]}" # 4th field overall = index 1 after state
|
|
|
|
parent_of[$pid]=$ppid
|
|
children_of[$ppid]+="$pid "
|
|
fi
|
|
done
|
|
|
|
# Print tree from PID 1
|
|
print_tree() {
|
|
local pid=$1
|
|
local indent=$2
|
|
echo "${indent}${pid}"
|
|
for child in ${children_of[$pid]}; do
|
|
print_tree "$child" " $indent"
|
|
done
|
|
}
|
|
|
|
print_tree 1 ""
|
|
```
|
|
|
|
#### For CPU Aggregation: Handling cutime/cstime
|
|
|
|
To properly handle `cutime`/`cstime` without double-counting:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
declare -A parent_of
|
|
declare -A utime stime
|
|
|
|
# First pass: collect all data
|
|
for stat in /proc/[0-9]*/stat; do
|
|
if read -r line < "$stat" 2>/dev/null; then
|
|
pid="${stat#/proc/}"
|
|
pid="${pid%/stat}"
|
|
rest="${line##*) }"
|
|
read -ra f <<< "$rest"
|
|
|
|
parent_of[$pid]="${f[1]}"
|
|
utime[$pid]="${f[11]}"
|
|
stime[$pid]="${f[12]}"
|
|
# cutime=${f[13]} cstime=${f[14]} - don't sum these
|
|
fi
|
|
done
|
|
|
|
# Sum only utime/stime (not cutime/cstime)
|
|
total=0
|
|
for pid in "${!utime[@]}"; do
|
|
((total += utime[$pid] + stime[$pid]))
|
|
done
|
|
|
|
echo "Total CPU ticks: $total"
|
|
echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"
|
|
```
|
|
|
|
**Key insight:** Only sum `utime` + `stime` for each process. The `cutime`/`cstime` fields are cumulative from children that have already exited and been `wait()`ed on—those children no longer exist in `/proc`, so their time is only accessible via the parent's `cutime`/`cstime`.
|
|
|
|
---
|
|
|
|
### How do I convert utime/stime to percentages?
|
|
|
|
You need **two samples** over a time interval. CPU percentage is a rate, not an absolute value.
|
|
|
|
#### The Formula
|
|
|
|
```
|
|
CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
|
|
```
|
|
|
|
- `delta_ticks` = difference in (utime + stime) between samples
|
|
- `CLK_TCK` = ticks per second (usually 100, get via `getconf CLK_TCK`)
|
|
- `num_cpus` = number of CPUs (omit for single-CPU percentage)
|
|
|
|
#### Two Common Percentage Styles
|
|
|
|
| Style | Formula | Example |
|
|
|-------|---------|---------|
|
|
| **Normalized** (0-100%) | `delta / (elapsed * CLK_TCK * num_cpus) * 100` | 50% = half of total capacity |
|
|
| **Cores-style** (0-N*100%) | `delta / (elapsed * CLK_TCK) * 100` | 200% = 2 full cores busy |
|
|
|
|
#### Practical Script
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
CLK_TCK=$(getconf CLK_TCK)
|
|
NUM_CPUS=$(nproc)
|
|
|
|
get_total_ticks() {
|
|
local total=0
|
|
for stat in /proc/[0-9]*/stat; do
|
|
if read -r line < "$stat" 2>/dev/null; then
|
|
rest="${line##*) }"
|
|
read -ra f <<< "$rest"
|
|
((total += f[11] + f[12])) # utime + stime
|
|
fi
|
|
done
|
|
echo "$total"
|
|
}
|
|
|
|
# First sample
|
|
ticks1=$(get_total_ticks)
|
|
time1=$(date +%s.%N)
|
|
|
|
# Wait
|
|
sleep 1
|
|
|
|
# Second sample
|
|
ticks2=$(get_total_ticks)
|
|
time2=$(date +%s.%N)
|
|
|
|
# Calculate
|
|
delta_ticks=$((ticks2 - ticks1))
|
|
elapsed=$(echo "$time2 - $time1" | bc)
|
|
|
|
# Percentage of total CPU capacity (all cores)
|
|
pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
|
|
echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"
|
|
|
|
# Percentage as "CPU cores used" (like top's 200% for 2 full cores)
|
|
cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
|
|
echo "CPU usage: ${cores_pct}% (cores-style)"
|
|
```
|
|
|
|
#### Continuous Monitoring
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
CLK_TCK=$(getconf CLK_TCK)
|
|
NUM_CPUS=$(nproc)
|
|
INTERVAL=1
|
|
|
|
get_total_ticks() {
|
|
local total=0
|
|
for stat in /proc/[0-9]*/stat; do
|
|
read -r line < "$stat" 2>/dev/null || continue
|
|
rest="${line##*) }"
|
|
read -ra f <<< "$rest"
|
|
((total += f[11] + f[12]))
|
|
done
|
|
echo "$total"
|
|
}
|
|
|
|
prev_ticks=$(get_total_ticks)
|
|
prev_time=$(date +%s.%N)
|
|
|
|
while true; do
|
|
sleep "$INTERVAL"
|
|
|
|
curr_ticks=$(get_total_ticks)
|
|
curr_time=$(date +%s.%N)
|
|
|
|
delta=$((curr_ticks - prev_ticks))
|
|
elapsed=$(echo "$curr_time - $prev_time" | bc)
|
|
|
|
pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc)
|
|
printf "\rCPU: %5.1f%%" "$pct"
|
|
|
|
prev_ticks=$curr_ticks
|
|
prev_time=$curr_time
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
### Does this calculation respect cgroup limits?
|
|
|
|
No, it doesn't. The calculation uses `nproc` which typically returns **host CPU count**, not your cgroup limit.
|
|
|
|
#### The Problem
|
|
|
|
If your container is limited to 2 CPUs on an 8-CPU host:
|
|
- `nproc` returns 8
|
|
- Your calculation shows 25% when you're actually at 100% of your limit
|
|
|
|
#### Getting Effective CPU Limit
|
|
|
|
**cgroups v2:**
|
|
|
|
```bash
|
|
# cpu.max contains: $quota $period (in microseconds)
|
|
# "max 100000" means unlimited
|
|
read quota period < /sys/fs/cgroup/cpu.max
|
|
if [[ "$quota" == "max" ]]; then
|
|
effective_cpus=$(nproc)
|
|
else
|
|
effective_cpus=$(echo "scale=2; $quota / $period" | bc)
|
|
fi
|
|
echo "Effective CPUs: $effective_cpus"
|
|
```
|
|
|
|
**cgroups v1:**
|
|
|
|
```bash
|
|
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
|
|
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
|
|
|
|
if [[ "$quota" == "-1" ]]; then
|
|
effective_cpus=$(nproc)
|
|
else
|
|
effective_cpus=$(echo "scale=2; $quota / $period" | bc)
|
|
fi
|
|
```
|
|
|
|
**Also Check cpuset Limits:**
|
|
|
|
```bash
|
|
# cgroups v2
|
|
cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null)
|
|
|
|
# cgroups v1
|
|
cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null)
|
|
|
|
# Parse "0-3,5,7" format to count CPUs
|
|
count_cpus() {
|
|
local count=0
|
|
IFS=',' read -ra ranges <<< "$1"
|
|
for range in "${ranges[@]}"; do
|
|
if [[ "$range" == *-* ]]; then
|
|
start="${range%-*}"
|
|
end="${range#*-}"
|
|
((count += end - start + 1))
|
|
else
|
|
((count++))
|
|
fi
|
|
done
|
|
echo "$count"
|
|
}
|
|
```
|
|
|
|
#### Updated Script Respecting Limits
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
CLK_TCK=$(getconf CLK_TCK)
|
|
|
|
get_effective_cpus() {
|
|
# Try cgroups v2 first
|
|
if [[ -f /sys/fs/cgroup/cpu.max ]]; then
|
|
read quota period < /sys/fs/cgroup/cpu.max
|
|
if [[ "$quota" != "max" ]]; then
|
|
echo "scale=2; $quota / $period" | bc
|
|
return
|
|
fi
|
|
fi
|
|
|
|
# Try cgroups v1
|
|
if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
|
|
quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
|
|
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
|
|
if [[ "$quota" != "-1" ]]; then
|
|
echo "scale=2; $quota / $period" | bc
|
|
return
|
|
fi
|
|
fi
|
|
|
|
# Fallback to nproc
|
|
nproc
|
|
}
|
|
|
|
EFFECTIVE_CPUS=$(get_effective_cpus)
|
|
echo "Effective CPU limit: $EFFECTIVE_CPUS"
|
|
|
|
# ... rest of your sampling logic using $EFFECTIVE_CPUS
|
|
```
|
|
|
|
#### Shared PID Namespace Consideration
|
|
|
|
When sidecars share PID namespace but have different cgroups:
|
|
- Each container may have different CPU limits
|
|
- You're aggregating processes across those limits
|
|
- There's no single "correct" limit to use
|
|
|
|
**Options:**
|
|
1. **Use host CPU count** — percentage of total host capacity
|
|
2. **Sum the limits** — if you know each sidecar's cgroup, sum their quotas
|
|
3. **Report in cores** — skip normalization, just show `1.5 cores used` instead of percentage
|
|
|
|
---
|
|
|
|
### Can I get the cgroup limit for another cgroup?
|
|
|
|
Yes, if you have visibility into the cgroup filesystem.
|
|
|
|
#### 1. Find a Process's Cgroup
|
|
|
|
Every process exposes its cgroup membership:
|
|
|
|
```bash
|
|
# Get cgroup for any PID you can see
|
|
cat /proc/1234/cgroup
|
|
|
|
# cgroups v2 output:
|
|
# 0::/kubepods/pod123/container456
|
|
|
|
# cgroups v1 output:
|
|
# 12:cpu,cpuacct:/docker/abc123
|
|
# 11:memory:/docker/abc123
|
|
# ...
|
|
```
|
|
|
|
#### 2. Read That Cgroup's Limits
|
|
|
|
If the cgroup filesystem is mounted and accessible:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
get_cgroup_cpu_limit() {
|
|
local pid=$1
|
|
|
|
# Get cgroup path for this PID
|
|
cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null) # v2
|
|
|
|
if [[ -n "$cgroup_path" ]]; then
|
|
# cgroups v2
|
|
limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
|
|
if [[ -r "$limit_file" ]]; then
|
|
read quota period < "$limit_file"
|
|
if [[ "$quota" == "max" ]]; then
|
|
echo "unlimited"
|
|
else
|
|
echo "scale=2; $quota / $period" | bc
|
|
fi
|
|
return
|
|
fi
|
|
fi
|
|
|
|
# Try cgroups v1
|
|
cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
|
|
if [[ -n "$cgroup_path" ]]; then
|
|
quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
|
|
period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
|
|
if [[ -r "$quota_file" ]]; then
|
|
quota=$(cat "$quota_file")
|
|
period=$(cat "$period_file")
|
|
if [[ "$quota" == "-1" ]]; then
|
|
echo "unlimited"
|
|
else
|
|
echo "scale=2; $quota / $period" | bc
|
|
fi
|
|
return
|
|
fi
|
|
fi
|
|
|
|
echo "unknown"
|
|
}
|
|
|
|
# Example: get limit for PID 1234
|
|
get_cgroup_cpu_limit 1234
|
|
```
|
|
|
|
#### 3. Mount Visibility Requirements
|
|
|
|
| Scenario | Can Read Other Cgroups? |
|
|
|----------|------------------------|
|
|
| Host system | Yes |
|
|
| Privileged container | Yes |
|
|
| `/sys/fs/cgroup` mounted read-only from host | Yes (common in Kubernetes) |
|
|
| Only own cgroup subtree mounted | No |
|
|
|
|
Check what's visible:
|
|
|
|
```bash
|
|
mount | grep cgroup
|
|
ls /sys/fs/cgroup/
|
|
```
|
|
|
|
#### 4. Full Solution: Aggregate by Cgroup
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
CLK_TCK=$(getconf CLK_TCK)
|
|
|
|
declare -A cgroup_ticks
|
|
declare -A cgroup_limit
|
|
|
|
for stat in /proc/[0-9]*/stat; do
|
|
pid="${stat#/proc/}"
|
|
pid="${pid%/stat}"
|
|
|
|
# Get cgroup for this process
|
|
cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
|
|
[[ -z "$cg" ]] && continue
|
|
|
|
# Get CPU ticks
|
|
if read -r line < "$stat" 2>/dev/null; then
|
|
rest="${line##*) }"
|
|
read -ra f <<< "$rest"
|
|
ticks=$((f[11] + f[12]))
|
|
|
|
((cgroup_ticks[$cg] += ticks))
|
|
|
|
# Cache the limit (only look up once per cgroup)
|
|
if [[ -z "${cgroup_limit[$cg]}" ]]; then
|
|
limit_file="/sys/fs/cgroup${cg}/cpu.max"
|
|
if [[ -r "$limit_file" ]]; then
|
|
read quota period < "$limit_file"
|
|
if [[ "$quota" == "max" ]]; then
|
|
cgroup_limit[$cg]="unlimited"
|
|
else
|
|
cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc)
|
|
fi
|
|
else
|
|
cgroup_limit[$cg]="unknown"
|
|
fi
|
|
fi
|
|
fi
|
|
done
|
|
|
|
echo "Ticks by cgroup:"
|
|
for cg in "${!cgroup_ticks[@]}"; do
|
|
echo " $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)"
|
|
done
|
|
```
|
|
|
|
#### If You Can't Access Other Cgroups
|
|
|
|
Fallback options:
|
|
|
|
1. **Mount the cgroup fs** — add volume mount for `/sys/fs/cgroup:ro`
|
|
2. **Use a sidecar with access** — one privileged container does the monitoring
|
|
3. **Accept "unknown" limits** — report raw ticks/cores instead of percentages
|
|
4. **Kubernetes Downward API** — inject limits as env vars (only for your own container though)
|