High CPU load is a common cause of issues. Let's look at how to dig into it with Prometheus and the Node exporter.
On a Node exporters' metrics page, part of the output is:
# HELP node_cpu Seconds the cpus spent in each mode. # TYPE node_cpu counter node_cpu_seconds_total{cpu="0",mode="guest"} 0 node_cpu_seconds_total{cpu="0",mode="idle"} 2.03442237e+06 node_cpu_seconds_total{cpu="0",mode="iowait"} 3522.37 node_cpu_seconds_total{cpu="0",mode="irq"} 0.48 node_cpu_seconds_total{cpu="0",mode="nice"} 515.56 node_cpu_seconds_total{cpu="0",mode="softirq"} 953.06 node_cpu_seconds_total{cpu="0",mode="steal"} 0 node_cpu_seconds_total{cpu="0",mode="system"} 6605.46 node_cpu_seconds_total{cpu="0",mode="user"} 23343.01 node_cpu_seconds_total{cpu="1",mode="guest"} 0 node_cpu_seconds_total{cpu="1",mode="idle"} 2.03471439e+06 node_cpu_seconds_total{cpu="1",mode="iowait"} 3633.5 node_cpu_seconds_total{cpu="1",mode="irq"} 0.58 node_cpu_seconds_total{cpu="1",mode="nice"} 542.05 node_cpu_seconds_total{cpu="1",mode="softirq"} 880.49 node_cpu_seconds_total{cpu="1",mode="steal"} 0 node_cpu_seconds_total{cpu="1",mode="system"} 6581.92 node_cpu_seconds_total{cpu="1",mode="user"} 23171.06
This metric comes from /proc/stat
and tell us how many seconds each CPU spent doing each type of work:
- user: The time spent in userland
- system: The time spent in the kernel
- iowait: Time spent waiting for I/O
- idle: Time the CPU had nothing to do
- irq&softirq: Time servicing interrupts
- guest: If you are running VMs, the CPU they use
- steal: If you are a VM, time other VMs "stole" from your CPUs
These modes are mutually exclusive. A high iowait means that you are disk or network bound, high user or system means that you are CPU bound.
These are counters, so to calculate the per-second values we use the rate
function in the expression browser:
rate(node_cpu_seconds_total{job="node"}[1m])
We can aggregate this to get the overall value across all CPUs for the machine:
sum by (mode, instance) (rate(node_cpu_seconds_total{job="node"}[1m]))
As these values always sum to one second per second for each cpu, the per-second rates are also the ratios of usage. We can use this to calculate the percentage of CPU used, by subtracting the idle usage from 100%:
100 - (avg by (instance) (rate(node_cpu_seconds_total{job="node",mode="idle"}[1m])) * 100)
No comments.