GUIDEResource saturation: CPU, memory, disk and network
For saturation metrics you need:
node_exporter — CPU, memory, disk (node_*_*)kube-state-metrics — resource limits (kube_pod_resource_*)kubelet cAdvisor — memory/CPU of containersiostat for disk I/O (if not using cAdvisor)All node_cpu_seconds_total, node_memory_*, and node_filesystem_* queries are filtered by the $job_node dashboard variable (default: node-exporter). If your node_exporter is scraped under a different job name, update the variable in Dashboard Settings → Variables.
Saturation metrics reflect your actual workload. In a typical homelab with light usage, it is completely normal to see CPU at 2–10%, memory at 30–50%, and disk I/O near zero. Low values are not a sign of misconfiguration — they indicate your infrastructure has headroom. Thresholds are calibrated for production workloads; feel free to lower the warning/critical boundaries to suit your homelab's baseline.
┌────────────────────────────────────────┐ │ SATURATION — RESOURCE EXHAUSTION │ ├────────────────────────────────────────┤ │ CPU % │ Memory % │ Disk % │ Net │ │ (gauge) │ (gauge) │ (gauge) │(gau)│ ├────────────────────────────────────────┤ │ CPU + Memory Time Series [12 cols] │ │ (evolution with thresholds) │ ├────────────────────────────────────────┤ │ Disk I/O [6 cols] │ Disk Free [6 col]│ │ (read/write ops) │ (by mount) │ ├────────────────────────────────────────┤ │ Resource by Node [12 cols] │ │ (table: CPU, mem, disk by host) │ ├────────────────────────────────────────┤ │ Requests vs Limits [12 cols] │ │ (breakdown of demand vs limit) │ └────────────────────────────────────────┘
Percentage of CPU in use (excluding idle).
Default query (excludes idle):100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
By specific instance:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle",instance="server1:9100"}[5m])) * 100)
Exclude specific modes (exclude iowait, steal):
100 - (avg(rate(node_cpu_seconds_total{mode=~"idle|iowait"}[5m])) * 100)
Recommended thresholds: 70% yellow, 85% red
Percentage of memory in use.
Using available (recommended):
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
Using free (less accurate):
100 * (1 - (node_memory_MemFree_bytes / node_memory_MemTotal_bytes))
By node:
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) by (instance)
Thresholds: 80% yellow, 90% red (leave buffer for cache)
Percentage of disk used (by mountpoint).
Default (root partition):
100 * (node_filesystem_used_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
Exclude tmpfs and virtual systems:
100 * (node_filesystem_used_bytes{fstype!~"tmpfs|devtmpfs|fuse.*",mountpoint!~"/sys.*|/proc.*"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|devtmpfs|fuse.*"}) by (device)
By mountpoint:
sum by (mountpoint) (100 * (node_filesystem_used_bytes / node_filesystem_size_bytes))
Thresholds: 80% yellow, 90% red
Percentage of bandwidth saturated.
Using interface speed (if available):
100 * ((rate(node_network_transmit_bytes_total[5m]) * 8) / (node_network_speed_bytes * 1000000000))
Without speed (normalize by historical):
rate(node_network_transmit_bytes_total[5m]) / avg_over_time(rate(node_network_transmit_bytes_total[5m])[1h:1m])
By direction (in/out):
sum by (device) (rate(node_network_receive_bytes_total[5m]))
Line chart showing evolution of both resources.
Query CPU:
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Query Memory:
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
Per-node breakdown:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Add threshold lines: In panel → Alerts → 70% (warning), 85% (critical)
Read/write operations per second.
Read ops/sec:
rate(node_disk_reads_completed_total{device="sda"}[5m])
Write ops/sec:
rate(node_disk_writes_completed_total{device="sda"}[5m])
Change device: sda → nvme0n1 (NVMe), vda (virtual), etc
Utilization %:
rate(node_disk_io_time_seconds_total{device="sda"}[5m]) * 100
Note: > 30% I/O wait indicates disk contention
Available space by mount point.
Query:node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / 1024 / 1024 / 1024
Show by mountpoint:
sum by (mountpoint) (node_filesystem_avail_bytes) / 1024 / 1024 / 1024
Percentage available:
100 * (node_filesystem_avail_bytes / node_filesystem_size_bytes) by (mountpoint)
Dynamic table with CPU, memory and disk by node.
Structure JSON for multi-metric table:
sum by (instance) (100 - (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) // CPU
sum by (instance) (100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) // Memory
sum by (instance, device) (100 * node_filesystem_used_bytes / node_filesystem_size_bytes) // Disk
Filter production nodes:
sum by (instance) (...) {instance=~"prod.*"}
Comparison of requested vs limit resources in K8s.
CPU requests:
sum by (namespace, pod) (kube_pod_container_resource_requests_cpu_cores)
CPU limits:
sum by (namespace, pod) (kube_pod_container_resource_limits_cpu_cores)
Memory (in bytes):
sum by (namespace, pod) (kube_pod_container_resource_limits_memory_bytes) / 1024 / 1024
Filter by namespace:
sum by (namespace, pod) (...) {namespace!~"kube-system|kube-.*"}
| Theme | OK (<70%) | Warning (70-85%) | Critical (>85%) |
|---|---|---|---|
| GREEN | #33FF00 |
#FFCC00 | #FF4444 |
| AMBER | #FFB000 |
#FF8C00 | #FF4500 |
| BLUE | #00BFFF |
#FFD700 | #FF1493 |
Gauge thresholds in JSON:
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
}
| Type | Stat Card Width | Graph Height | Table Height |
|---|---|---|---|
| Mobile | 6 (stack vertical) | 10 | 12 |
| Tablet 10.9" | 12 (full width) | 12 | 14 |
| iPad Pro 12.9" | 12 (full width) | 14 | 16 |
| Desktop 1920x1080 | 6 (4 cols) | 10 | 12 |
rate(node_cpu_seconds_total{mode!="idle"}[5m]) >
avg_over_time(rate(node_cpu_seconds_total{mode!="idle"}[5m])[1h:5m]) * 1.5
Alert if CPU rises 50% above 1h average.
(node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) < 0.3
Alert if using > 70% of swap (indicator of severe memory pressure).
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 86400)
Predicts available space in 24h. If < 10GB, alert proactively.
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) * 1000
+ (node_load1 / count(node_cpu_seconds_total{mode="idle"}) * 100)
Smooths latency by load factor (shows saturation impact).