forgejo-runner-optimiser/proc-stat-in-containers.md
Manuel Ganter 5e470c33a5
All checks were successful
ci / build (push) Successful in 30s
feat(collector): group CPU and memory metrics by cgroup
Add cgroup-based process grouping to the resource collector. Processes are
grouped by their cgroup path, with container names resolved via configurable
process-to-container mapping.

New features:
- Read cgroup info from /proc/[pid]/cgroup (supports v1 and v2)
- Parse K8s resource notation (500m, 1Gi, etc.) for CPU/memory limits
- Group metrics by container using CGROUP_PROCESS_MAP env var
- Calculate usage percentages against limits from CGROUP_LIMITS env var
- Output cgroup metrics with CPU cores used, memory RSS, and percentages

Environment variables:
- CGROUP_PROCESS_MAP: Map process names to container names for discovery
- CGROUP_LIMITS: Define CPU/memory limits per container in K8s notation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 14:50:36 +01:00

15 KiB

/proc/stat Behavior in Containerized Environments

TL;DR

/proc/stat in containers shows host-level statistics, not container-specific data. To get accurate container CPU metrics when cgroups aren't an option (e.g., sidecars sharing PID namespace with different cgroups), aggregate /proc/[pid]/stat for all visible processes and calculate percentages using cgroup limits from /sys/fs/cgroup.

Outcome

  • Explained why /proc/stat doesn't provide container-aware metrics
  • Provided methods to aggregate process CPU usage from /proc/[pid]/stat
  • Showed how to determine parent/child process relationships
  • Demonstrated converting raw ticks to CPU percentages
  • Covered how to read cgroup CPU limits for other containers

Conversation

How does /proc/stat behave in containerized environments?

By default, /proc/stat inside a container shows host system statistics, not container-specific data. This includes:

  • CPU times (user, system, idle, etc.) - reflects all CPUs on the host
  • Context switches - host-wide count
  • Boot time - host's boot time
  • Processes created - host-wide total

Why This Happens

The /proc filesystem is mounted from the host kernel. Containers share the host kernel, and /proc/stat is a kernel-level interface that doesn't have namespace awareness for most of its metrics.

Practical Implications

  1. CPU usage calculations are wrong - Tools like top, htop, or custom monitoring that read /proc/stat will show host CPU usage, not container usage
  2. Resource limits aren't reflected - Even if you limit a container to 2 CPUs via cgroups, /proc/stat shows all host CPUs
  3. Multi-tenant confusion - In shared environments, containers see each other's aggregate impact

Solutions

Approach Description
cgroups Read from /sys/fs/cgroup/cpu/ for container-specific CPU accounting
LXCFS A FUSE filesystem that provides container-aware /proc files
Container runtimes Some runtimes (like Kata) use VMs with isolated kernels
Metrics APIs Use Docker/Kubernetes APIs instead of /proc/stat

Example: Getting Real Container CPU

# Instead of /proc/stat, use cgroups v2:
cat /sys/fs/cgroup/cpu.stat

# Or cgroups v1:
cat /sys/fs/cgroup/cpu/cpuacct.usage

Can I aggregate process usage to get the usage of the container?

Yes, you can aggregate process usage, but there are caveats.

Aggregating from /proc/[pid]/stat

# Fields 14 (utime) and 15 (stime) in /proc/[pid]/stat
for pid in /proc/[0-9]*; do
  awk '{print $14 + $15}' "$pid/stat" 2>/dev/null
done | awk '{sum += $1} END {print sum}'

Problems with This Approach

  1. Race conditions - Processes can spawn/die between reads, causing inconsistent totals
  2. Short-lived processes - Processes that start and exit between samples are missed entirely
  3. Zombie/exited processes - Their CPU time may not be captured
  4. Overhead - Scanning all PIDs repeatedly is expensive
  5. Namespace visibility - You only see processes in your PID namespace, but that's actually what you want in a container
  6. Children accounting - When a process exits, its CPU time gets added to the parent's cutime/cstime fields, which can cause double-counting if not handled carefully

Better Alternative: cgroups

Cgroups track CPU usage at the group level and handle all these edge cases:

# cgroups v2 (modern)
cat /sys/fs/cgroup/cpu.stat
# Shows: usage_usec, user_usec, system_usec

# cgroups v1
cat /sys/fs/cgroup/cpu/cpuacct.usage  # total nanoseconds
cat /sys/fs/cgroup/cpu/cpuacct.stat   # user/system jiffies

However, cgroups cannot be used when sidecars share the PID namespace but have different cgroups. In that case, aggregating /proc/[pid]/stat is the best option.


How do I determine child/parent processes?

From /proc/[pid]/stat

Field 4 is the PPID (parent process ID):

# Get PPID for a specific process
awk '{print $4}' /proc/1234/stat

# List all processes with their parents
for stat in /proc/[0-9]*/stat; do
    pid="${stat#/proc/}"
    pid="${pid%/stat}"
    ppid=$(awk '{print $4}' "$stat" 2>/dev/null) && \
    echo "$pid -> parent: $ppid"
done

From /proc/[pid]/status (more readable)

grep PPid /proc/1234/status
# PPid: 1

Building a Process Tree

#!/bin/bash
declare -A parent_of
declare -A children_of

for stat in /proc/[0-9]*/stat; do
    if read -r line < "$stat" 2>/dev/null; then
        pid="${stat#/proc/}"
        pid="${pid%/stat}"

        # Extract PPID (field 4, but handle comm with spaces)
        rest="${line##*) }"
        read -ra fields <<< "$rest"
        ppid="${fields[1]}"  # 4th field overall = index 1 after state

        parent_of[$pid]=$ppid
        children_of[$ppid]+="$pid "
    fi
done

# Print tree from PID 1
print_tree() {
    local pid=$1
    local indent=$2
    echo "${indent}${pid}"
    for child in ${children_of[$pid]}; do
        print_tree "$child" "  $indent"
    done
}

print_tree 1 ""

For CPU Aggregation: Handling cutime/cstime

To properly handle cutime/cstime without double-counting:

#!/bin/bash
declare -A parent_of
declare -A utime stime

# First pass: collect all data
for stat in /proc/[0-9]*/stat; do
    if read -r line < "$stat" 2>/dev/null; then
        pid="${stat#/proc/}"
        pid="${pid%/stat}"
        rest="${line##*) }"
        read -ra f <<< "$rest"

        parent_of[$pid]="${f[1]}"
        utime[$pid]="${f[11]}"
        stime[$pid]="${f[12]}"
        # cutime=${f[13]}  cstime=${f[14]} - don't sum these
    fi
done

# Sum only utime/stime (not cutime/cstime)
total=0
for pid in "${!utime[@]}"; do
    ((total += utime[$pid] + stime[$pid]))
done

echo "Total CPU ticks: $total"
echo "Seconds: $(echo "scale=2; $total / $(getconf CLK_TCK)" | bc)"

Key insight: Only sum utime + stime for each process. The cutime/cstime fields are cumulative from children that have already exited and been wait()ed on—those children no longer exist in /proc, so their time is only accessible via the parent's cutime/cstime.


How do I convert utime/stime to percentages?

You need two samples over a time interval. CPU percentage is a rate, not an absolute value.

The Formula

CPU % = (delta_ticks / (elapsed_seconds * CLK_TCK * num_cpus)) * 100
  • delta_ticks = difference in (utime + stime) between samples
  • CLK_TCK = ticks per second (usually 100, get via getconf CLK_TCK)
  • num_cpus = number of CPUs (omit for single-CPU percentage)

Two Common Percentage Styles

Style Formula Example
Normalized (0-100%) delta / (elapsed * CLK_TCK * num_cpus) * 100 50% = half of total capacity
Cores-style (0-N*100%) delta / (elapsed * CLK_TCK) * 100 200% = 2 full cores busy

Practical Script

#!/bin/bash

CLK_TCK=$(getconf CLK_TCK)
NUM_CPUS=$(nproc)

get_total_ticks() {
    local total=0
    for stat in /proc/[0-9]*/stat; do
        if read -r line < "$stat" 2>/dev/null; then
            rest="${line##*) }"
            read -ra f <<< "$rest"
            ((total += f[11] + f[12]))  # utime + stime
        fi
    done
    echo "$total"
}

# First sample
ticks1=$(get_total_ticks)
time1=$(date +%s.%N)

# Wait
sleep 1

# Second sample
ticks2=$(get_total_ticks)
time2=$(date +%s.%N)

# Calculate
delta_ticks=$((ticks2 - ticks1))
elapsed=$(echo "$time2 - $time1" | bc)

# Percentage of total CPU capacity (all cores)
pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK * $NUM_CPUS)) * 100" | bc)
echo "CPU usage: ${pct}% of ${NUM_CPUS} CPUs"

# Percentage as "CPU cores used" (like top's 200% for 2 full cores)
cores_pct=$(echo "scale=2; ($delta_ticks / ($elapsed * $CLK_TCK)) * 100" | bc)
echo "CPU usage: ${cores_pct}% (cores-style)"

Continuous Monitoring

#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)
NUM_CPUS=$(nproc)
INTERVAL=1

get_total_ticks() {
    local total=0
    for stat in /proc/[0-9]*/stat; do
        read -r line < "$stat" 2>/dev/null || continue
        rest="${line##*) }"
        read -ra f <<< "$rest"
        ((total += f[11] + f[12]))
    done
    echo "$total"
}

prev_ticks=$(get_total_ticks)
prev_time=$(date +%s.%N)

while true; do
    sleep "$INTERVAL"

    curr_ticks=$(get_total_ticks)
    curr_time=$(date +%s.%N)

    delta=$((curr_ticks - prev_ticks))
    elapsed=$(echo "$curr_time - $prev_time" | bc)

    pct=$(echo "scale=1; $delta / ($elapsed * $CLK_TCK * $NUM_CPUS) * 100" | bc)
    printf "\rCPU: %5.1f%%" "$pct"

    prev_ticks=$curr_ticks
    prev_time=$curr_time
done

Does this calculation respect cgroup limits?

No, it doesn't. The calculation uses nproc which typically returns host CPU count, not your cgroup limit.

The Problem

If your container is limited to 2 CPUs on an 8-CPU host:

  • nproc returns 8
  • Your calculation shows 25% when you're actually at 100% of your limit

Getting Effective CPU Limit

cgroups v2:

# cpu.max contains: $quota $period (in microseconds)
# "max 100000" means unlimited
read quota period < /sys/fs/cgroup/cpu.max
if [[ "$quota" == "max" ]]; then
    effective_cpus=$(nproc)
else
    effective_cpus=$(echo "scale=2; $quota / $period" | bc)
fi
echo "Effective CPUs: $effective_cpus"

cgroups v1:

quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)

if [[ "$quota" == "-1" ]]; then
    effective_cpus=$(nproc)
else
    effective_cpus=$(echo "scale=2; $quota / $period" | bc)
fi

Also Check cpuset Limits:

# cgroups v2
cpuset=$(cat /sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null)

# cgroups v1
cpuset=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus 2>/dev/null)

# Parse "0-3,5,7" format to count CPUs
count_cpus() {
    local count=0
    IFS=',' read -ra ranges <<< "$1"
    for range in "${ranges[@]}"; do
        if [[ "$range" == *-* ]]; then
            start="${range%-*}"
            end="${range#*-}"
            ((count += end - start + 1))
        else
            ((count++))
        fi
    done
    echo "$count"
}

Updated Script Respecting Limits

#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)

get_effective_cpus() {
    # Try cgroups v2 first
    if [[ -f /sys/fs/cgroup/cpu.max ]]; then
        read quota period < /sys/fs/cgroup/cpu.max
        if [[ "$quota" != "max" ]]; then
            echo "scale=2; $quota / $period" | bc
            return
        fi
    fi

    # Try cgroups v1
    if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
        quota=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
        period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
        if [[ "$quota" != "-1" ]]; then
            echo "scale=2; $quota / $period" | bc
            return
        fi
    fi

    # Fallback to nproc
    nproc
}

EFFECTIVE_CPUS=$(get_effective_cpus)
echo "Effective CPU limit: $EFFECTIVE_CPUS"

# ... rest of your sampling logic using $EFFECTIVE_CPUS

Shared PID Namespace Consideration

When sidecars share PID namespace but have different cgroups:

  • Each container may have different CPU limits
  • You're aggregating processes across those limits
  • There's no single "correct" limit to use

Options:

  1. Use host CPU count — percentage of total host capacity
  2. Sum the limits — if you know each sidecar's cgroup, sum their quotas
  3. Report in cores — skip normalization, just show 1.5 cores used instead of percentage

Can I get the cgroup limit for another cgroup?

Yes, if you have visibility into the cgroup filesystem.

1. Find a Process's Cgroup

Every process exposes its cgroup membership:

# Get cgroup for any PID you can see
cat /proc/1234/cgroup

# cgroups v2 output:
# 0::/kubepods/pod123/container456

# cgroups v1 output:
# 12:cpu,cpuacct:/docker/abc123
# 11:memory:/docker/abc123
# ...

2. Read That Cgroup's Limits

If the cgroup filesystem is mounted and accessible:

#!/bin/bash

get_cgroup_cpu_limit() {
    local pid=$1

    # Get cgroup path for this PID
    cgroup_path=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)  # v2

    if [[ -n "$cgroup_path" ]]; then
        # cgroups v2
        limit_file="/sys/fs/cgroup${cgroup_path}/cpu.max"
        if [[ -r "$limit_file" ]]; then
            read quota period < "$limit_file"
            if [[ "$quota" == "max" ]]; then
                echo "unlimited"
            else
                echo "scale=2; $quota / $period" | bc
            fi
            return
        fi
    fi

    # Try cgroups v1
    cgroup_path=$(grep -oP 'cpu.*:\K.*' /proc/$pid/cgroup 2>/dev/null)
    if [[ -n "$cgroup_path" ]]; then
        quota_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_quota_us"
        period_file="/sys/fs/cgroup/cpu${cgroup_path}/cpu.cfs_period_us"
        if [[ -r "$quota_file" ]]; then
            quota=$(cat "$quota_file")
            period=$(cat "$period_file")
            if [[ "$quota" == "-1" ]]; then
                echo "unlimited"
            else
                echo "scale=2; $quota / $period" | bc
            fi
            return
        fi
    fi

    echo "unknown"
}

# Example: get limit for PID 1234
get_cgroup_cpu_limit 1234

3. Mount Visibility Requirements

Scenario Can Read Other Cgroups?
Host system Yes
Privileged container Yes
/sys/fs/cgroup mounted read-only from host Yes (common in Kubernetes)
Only own cgroup subtree mounted No

Check what's visible:

mount | grep cgroup
ls /sys/fs/cgroup/

4. Full Solution: Aggregate by Cgroup

#!/bin/bash
CLK_TCK=$(getconf CLK_TCK)

declare -A cgroup_ticks
declare -A cgroup_limit

for stat in /proc/[0-9]*/stat; do
    pid="${stat#/proc/}"
    pid="${pid%/stat}"

    # Get cgroup for this process
    cg=$(grep -oP '0::\K.*' /proc/$pid/cgroup 2>/dev/null)
    [[ -z "$cg" ]] && continue

    # Get CPU ticks
    if read -r line < "$stat" 2>/dev/null; then
        rest="${line##*) }"
        read -ra f <<< "$rest"
        ticks=$((f[11] + f[12]))

        ((cgroup_ticks[$cg] += ticks))

        # Cache the limit (only look up once per cgroup)
        if [[ -z "${cgroup_limit[$cg]}" ]]; then
            limit_file="/sys/fs/cgroup${cg}/cpu.max"
            if [[ -r "$limit_file" ]]; then
                read quota period < "$limit_file"
                if [[ "$quota" == "max" ]]; then
                    cgroup_limit[$cg]="unlimited"
                else
                    cgroup_limit[$cg]=$(echo "scale=2; $quota / $period" | bc)
                fi
            else
                cgroup_limit[$cg]="unknown"
            fi
        fi
    fi
done

echo "Ticks by cgroup:"
for cg in "${!cgroup_ticks[@]}"; do
    echo "  $cg: ${cgroup_ticks[$cg]} ticks (limit: ${cgroup_limit[$cg]} CPUs)"
done

If You Can't Access Other Cgroups

Fallback options:

  1. Mount the cgroup fs — add volume mount for /sys/fs/cgroup:ro
  2. Use a sidecar with access — one privileged container does the monitoring
  3. Accept "unknown" limits — report raw ticks/cores instead of percentages
  4. Kubernetes Downward API — inject limits as env vars (only for your own container though)