ci / build (push) Successful in 31s

Details

docs: move background documentation to docs/background

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-06 15:40:38 +01:00

15 KiB

Raw Blame History

/proc/stat Behavior in Containerized Environments

TL;DR

/proc/stat in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate /proc/[pid]/stat for all visible processes and calculate percentages using cgroup limits from /sys/fs/cgroup.

Outcome

Explained why /proc/stat doesn't provide container-aware metrics
Provided methods to aggregate process CPU usage from /proc/[pid]/stat
Showed how to determine parent/child process relationships
Demonstrated converting raw ticks to CPU percentages
Covered how to read cgroup CPU limits for other containers

Conversation

How does /proc/stat behave in containerized environments?

By default, /proc/stat inside a container shows host system statistics, not container-specific data. This includes:

CPU times (user, system, idle, etc.) - reflects all CPUs on the host
Context switches - host-wide count
Boot time - host's boot time
Processes created - host-wide total

Why This Happens

The /proc filesystem is mounted from the host kernel. Containers share the host kernel, and /proc/stat is a kernel-level interface that doesn't have namespace awareness for most of its metrics.

Practical Implications

CPU usage calculations are wrong - Tools like top, htop, or custom monitoring that read /proc/stat will show host CPU usage, not container usage
Resource limits aren't reflected - Even if you limit a container to 2 CPUs via cgroups, /proc/stat shows all host CPUs
Multi-tenant confusion - In shared environments, containers see each other's aggregate impact

Solutions

Approach	Description
cgroups	Read from `/sys/fs/cgroup/cpu/` for container-specific CPU accounting
LXCFS	A FUSE filesystem that provides container-aware `/proc` files
Container runtimes	Some runtimes (like Kata) use VMs with isolated kernels
Metrics APIs	Use Docker/Kubernetes APIs instead of `/proc/stat`

Example: Getting Real Container CPU

# Instead of /proc/stat, use cgroups v2:
cat /sys/fs/cgroup/cpu.stat

# Or cgroups v1:
cat /sys/fs/cgroup/cpu/cpuacct.usage

Can I aggregate process usage to get the usage of the container?

Yes, you can aggregate process usage, but there are caveats.

Aggregating from /proc/[pid]/stat

# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
for pid in /proc/[0-9]*; do
  awk '{print $14 + $15}' "$pid/stat" 2>/dev/null
done | awk '{sum += $1} END {print sum}'

Problems with This Approach

Race conditions - Processes can spawn/die between reads, causing inconsistent totals
Short-lived processes - Processes that start and exit between samples are missed entirely
Zombie/exited processes - Their CPU time may not be captured
Overhead - Scanning all PIDs repeatedly is expensive
Namespace visibility - You only see processes in your PID namespace, but that's actually what you want in a container
Children accounting - When a process exits, its CPU time gets added to the parent's cutime/cstime fields, which can cause double-counting if not handled carefully

Better Alternative: cgroups

Cgroups track CPU usage at the group level and handle all these edge cases:

# cgroups v2 (modern)
cat /sys/fs/cgroup/cpu.stat
# Shows: usage_usec, user_usec, system_usec

# cgroups v1
cat /sys/fs/cgroup/cpu/cpuacct.usage  # total nanoseconds
cat /sys/fs/cgroup/cpu/cpuacct.stat   # user/system jiffies

However, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating /proc/[pid]/stat is the best option.

How do I determine child/parent processes?

From /proc/[pid]/stat

Field 4 is the PPID (parent process ID):

# Get PPID for a specific process
awk '{print $4}' /proc/1234/stat

# List all processes with their parents
for stat in /proc/[0-9]*/stat; do
    pid="${stat#/proc/}"
    pid="${pid%/stat}"
    ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \
    echo "$pid -> parent: $ppid"
done

From /proc/[pid]/status (more readable)

grep PPid /proc/1234/status
# PPid: 1

Building a Process Tree

#!/bin/bash
declare -A parent_of
declare -A children_of

for stat in /proc/[0-9]*/stat; do
    if read -r line < "$stat" 2>/dev/null; then
        pid="${stat#/proc/}"
        pid="${pid%/stat}"

        # Extract PPID (field 4, but handle comm with spaces)
        rest="${line##*) }"
        read -ra fields <<< "$rest"
        ppid="${fields[1]}"  # 4th field overall = index 1 after state

        parent_of[$pid]=$ppid
        children_of[$ppid]+="$pid "
    fi
done

# Print tree from PID 1
print_tree() {
    local pid=$1
    local indent=$2
    echo "${indent}${pid}"
    for child in ${children_of[$pid]}; do
        print_tree "$child" "  $indent"
    done
}

print_tree 1 ""

For CPU Aggregation: Handling cutime/cstime

To properly handle cutime/cstime without double-counting:

#!/bin/bash
declare -A parent_of
declare -A utime stime

# First pass: collect all data
for stat in /proc/[0-9]*/stat; do
    if read -r line < "$stat" 2>/dev/null; then
        pid="${stat#/proc/}"
        pid="${pid%/stat}"
        rest="${line##*) }"
        read -ra f <<< "$rest"

        parent_of[$pid]="${f[1]}"
        utime[$pid]="${f[11]}"
        stime[$pid]="${f[12]}"
        # cutime=${f[13]}  cstime=${f[14]} - don't sum these
    fi
done

# Sum only utime/stime (not cutime/cstime)
total=0
for pid in "${!utime[@]}"; do
    ((total += utime[$pid] + stime[$pid]))
done

echo "Total CPU ticks: $total"
echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"

Key insight: Only sum utime + stime for each process. The cutime/cstime fields are cumulative from children that have already exited and been wait()ed on—those children no longer exist in /proc, so their time is only accessible via the parent's cutime/cstime.

How do I convert utime/stime to percentages?

You need two samples over a time interval. CPU percentage is a rate, not an absolute value.

The Formula

CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100

delta_ticks = difference in (utime + stime) between samples
CLK_TCK = ticks per second (usually 100, get via getconf CLK_TCK)
num_cpus = number of CPUs (omit for single-CPU percentage)

Two Common Percentage Styles

Style	Formula	Example
Normalized (0-100%)	`delta / (elapsed * CLK_TCK * num_cpus) * 100`	50% = half of total capacity
Cores-style (0-N*100%)	`delta / (elapsed * CLK_TCK) * 100`	200% = 2 full cores busy

Practical Script

#!/bin/bash

CLK_TCK=$(getconf CLK_TCK)
NUM_CPUS=$(nproc)

get_total_ticks() {
    local total=0
    for stat in /proc/[0-9]*/stat; do
        if read -r line < "$stat" 2>/dev/null; then
            rest="${line##*) }"
            read -ra f <<< "$rest"
            ((total += f[11] + f[12]))  # utime + stime
        fi
    done
    echo "$total"
}

# First sample
ticks1=$(get_total_ticks)
time1=$(date +%s.%N)

# Wait
sleep 1

# Second sample
ticks2=$(get_total_ticks)
time2=$(date +%s.%N)

# Calculate
delta_ticks=$((ticks2 - ticks1))
elapsed=$(echo "$time2 - $time1" | bc)

# Percentage of total CPU capacity (all cores)
pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"

# Percentage as "CPU cores used" (like top's 200% for 2 full cores)
cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
echo "CPU usage: ${cores_pct}% (cores-style)"

Continuous Monitoring

#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)
NUM_CPUS=$(nproc)
INTERVAL=1

get_total_ticks() {
    local total=0
    for stat in /proc/[0-9]*/stat; do
        read -r line < "$stat" 2>/dev/null || continue
        rest="${line##*) }"
        read -ra f <<< "$rest"
        ((total += f[11] + f[12]))
    done
    echo "$total"
}

prev_ticks=$(get_total_ticks)
prev_time=$(date +%s.%N)

while true; do
    sleep "$INTERVAL"

    curr_ticks=$(get_total_ticks)
    curr_time=$(date +%s.%N)

    delta=$((curr_ticks - prev_ticks))
    elapsed=$(echo "$curr_time - $prev_time" | bc)

    pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc)
    printf "\rCPU: %5.1f%%" "$pct"

    prev_ticks=$curr_ticks
    prev_time=$curr_time
done

Does this calculation respect cgroup limits?

No, it doesn't. The calculation uses nproc which typically returns host CPU count, not your cgroup limit.

The Problem

If your container is limited to 2 CPUs on an 8-CPU host:

nproc returns 8
Your calculation shows 25% when you're actually at 100% of your limit

Getting Effective CPU Limit

cgroups v2:

# cpu.max contains: $quota $period (in microseconds)
# "max 100000" means unlimited
read quota period < /sys/fs/cgroup/cpu.max
if [[ "$quota" == "max" ]]; then
    effective_cpus=$(nproc)
else
    effective_cpus=$(echo "scale=2; $quota / $period" | bc)
fi
echo "Effective CPUs: $effective_cpus"

cgroups v1:

quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)

if [[ "$quota" == "-1" ]]; then
    effective_cpus=$(nproc)
else
    effective_cpus=$(echo "scale=2; $quota / $period" | bc)
fi

Also Check cpuset Limits:

# cgroups v2
cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null)

# cgroups v1
cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null)

# Parse "0-3,5,7" format to count CPUs
count_cpus() {
    local count=0
    IFS=',' read -ra ranges <<< "$1"
    for range in "${ranges[@]}"; do
        if [[ "$range" == *-* ]]; then
            start="${range%-*}"
            end="${range#*-}"
            ((count += end - start + 1))
        else
            ((count++))
        fi
    done
    echo "$count"
}

Updated Script Respecting Limits

#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)

get_effective_cpus() {
    # Try cgroups v2 first
    if [[ -f /sys/fs/cgroup/cpu.max ]]; then
        read quota period < /sys/fs/cgroup/cpu.max
        if [[ "$quota" != "max" ]]; then
            echo "scale=2; $quota / $period" | bc
            return
        fi
    fi

    # Try cgroups v1
    if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
        quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
        period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
        if [[ "$quota" != "-1" ]]; then
            echo "scale=2; $quota / $period" | bc
            return
        fi
    fi

    # Fallback to nproc
    nproc
}

EFFECTIVE_CPUS=$(get_effective_cpus)
echo "Effective CPU limit: $EFFECTIVE_CPUS"

# ... rest of your sampling logic using $EFFECTIVE_CPUS

Shared PID Namespace Consideration

When sidecars share PID namespace but have different cgroups:

Each container may have different CPU limits
You're aggregating processes across those limits
There's no single "correct" limit to use

Options:

Use host CPU count — percentage of total host capacity
Sum the limits — if you know each sidecar's cgroup, sum their quotas
Report in cores — skip normalization, just show 1.5 cores used instead of percentage

Can I get the cgroup limit for another cgroup?

Yes, if you have visibility into the cgroup filesystem.

1. Find a Process's Cgroup

Every process exposes its cgroup membership:

# Get cgroup for any PID you can see
cat /proc/1234/cgroup

# cgroups v2 output:
# 0::/kubepods/pod123/container456

# cgroups v1 output:
# 12:cpu,cpuacct:/docker/abc123
# 11:memory:/docker/abc123
# ...

2. Read That Cgroup's Limits

If the cgroup filesystem is mounted and accessible:

#!/bin/bash

get_cgroup_cpu_limit() {
    local pid=$1

    # Get cgroup path for this PID
    cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)  # v2

    if [[ -n "$cgroup_path" ]]; then
        # cgroups v2
        limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
        if [[ -r "$limit_file" ]]; then
            read quota period < "$limit_file"
            if [[ "$quota" == "max" ]]; then
                echo "unlimited"
            else
                echo "scale=2; $quota / $period" | bc
            fi
            return
        fi
    fi

    # Try cgroups v1
    cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
    if [[ -n "$cgroup_path" ]]; then
        quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
        period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
        if [[ -r "$quota_file" ]]; then
            quota=$(cat "$quota_file")
            period=$(cat "$period_file")
            if [[ "$quota" == "-1" ]]; then
                echo "unlimited"
            else
                echo "scale=2; $quota / $period" | bc
            fi
            return
        fi
    fi

    echo "unknown"
}

# Example: get limit for PID 1234
get_cgroup_cpu_limit 1234

3. Mount Visibility Requirements

Scenario	Can Read Other Cgroups?
Host system	Yes
Privileged container	Yes
`/sys/fs/cgroup` mounted read-only from host	Yes (common in Kubernetes)
Only own cgroup subtree mounted	No

Check what's visible:

mount | grep cgroup
ls /sys/fs/cgroup/

4. Full Solution: Aggregate by Cgroup

#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)

declare -A cgroup_ticks
declare -A cgroup_limit

for stat in /proc/[0-9]*/stat; do
    pid="${stat#/proc/}"
    pid="${pid%/stat}"

    # Get cgroup for this process
    cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
    [[ -z "$cg" ]] && continue

    # Get CPU ticks
    if read -r line < "$stat" 2>/dev/null; then
        rest="${line##*) }"
        read -ra f <<< "$rest"
        ticks=$((f[11] + f[12]))

        ((cgroup_ticks[$cg] += ticks))

        # Cache the limit (only look up once per cgroup)
        if [[ -z "${cgroup_limit[$cg]}" ]]; then
            limit_file="/sys/fs/cgroup${cg}/cpu.max"
            if [[ -r "$limit_file" ]]; then
                read quota period < "$limit_file"
                if [[ "$quota" == "max" ]]; then
                    cgroup_limit[$cg]="unlimited"
                else
                    cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc)
                fi
            else
                cgroup_limit[$cg]="unknown"
            fi
        fi
    fi
done

echo "Ticks by cgroup:"
for cg in "${!cgroup_ticks[@]}"; do
    echo "  $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)"
done

If You Can't Access Other Cgroups

Fallback options:

Mount the cgroup fs — add volume mount for /sys/fs/cgroup:ro
Use a sidecar with access — one privileged container does the monitoring
Accept "unknown" limits — report raw ticks/cores instead of percentages
Kubernetes Downward API — inject limits as env vars (only for your own container though)

15 KiB Raw Blame History