RetroDash // SATURATION

GUIDEResource saturation: CPU, memory, disk and network

Theme

Requirements

For saturation metrics you need:

Compatibility Notes

$job_node variable

All node_cpu_seconds_total, node_memory_*, and node_filesystem_* queries are filtered by the $job_node dashboard variable (default: node-exporter). If your node_exporter is scraped under a different job name, update the variable in Dashboard Settings → Variables.

Low saturation values in homelabs

Saturation metrics reflect your actual workload. In a typical homelab with light usage, it is completely normal to see CPU at 2–10%, memory at 30–50%, and disk I/O near zero. Low values are not a sign of misconfiguration — they indicate your infrastructure has headroom. Thresholds are calibrated for production workloads; feel free to lower the warning/critical boundaries to suit your homelab's baseline.

Panel Layout

┌────────────────────────────────────────┐
│   SATURATION — RESOURCE EXHAUSTION    │
├────────────────────────────────────────┤
│  CPU %   │  Memory %  │  Disk %  │ Net │
│  (gauge) │  (gauge)   │  (gauge) │(gau)│
├────────────────────────────────────────┤
│  CPU + Memory Time Series [12 cols]   │
│  (evolution with thresholds)           │
├────────────────────────────────────────┤
│  Disk I/O [6 cols] │ Disk Free [6 col]│
│  (read/write ops)  │ (by mount)       │
├────────────────────────────────────────┤
│  Resource by Node [12 cols]            │
│  (table: CPU, mem, disk by host)       │
├────────────────────────────────────────┤
│  Requests vs Limits [12 cols]          │
│  (breakdown of demand vs limit)        │
└────────────────────────────────────────┘

Panel Customization

CPU % Gauge

Percentage of CPU in use (excluding idle).

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

By specific instance:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle",instance="server1:9100"}[5m])) * 100)

Exclude specific modes (exclude iowait, steal):

100 - (avg(rate(node_cpu_seconds_total{mode=~"idle|iowait"}[5m])) * 100)

Recommended thresholds: 70% yellow, 85% red

Memory % Gauge

Percentage of memory in use.

Using available (recommended):

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Using free (less accurate):

100 * (1 - (node_memory_MemFree_bytes / node_memory_MemTotal_bytes))

By node:

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) by (instance)

Thresholds: 80% yellow, 90% red (leave buffer for cache)

Disk % Gauge

Percentage of disk used (by mountpoint).

Default (root partition):

100 * (node_filesystem_used_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

Exclude tmpfs and virtual systems:

100 * (node_filesystem_used_bytes{fstype!~"tmpfs|devtmpfs|fuse.*",mountpoint!~"/sys.*|/proc.*"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|devtmpfs|fuse.*"}) by (device)

By mountpoint:

sum by (mountpoint) (100 * (node_filesystem_used_bytes / node_filesystem_size_bytes))

Thresholds: 80% yellow, 90% red

Network Saturation

Percentage of bandwidth saturated.

Using interface speed (if available):

100 * ((rate(node_network_transmit_bytes_total[5m]) * 8)
/ (node_network_speed_bytes * 1000000000))

Without speed (normalize by historical):

rate(node_network_transmit_bytes_total[5m]) /
  avg_over_time(rate(node_network_transmit_bytes_total[5m])[1h:1m])

By direction (in/out):

sum by (device) (rate(node_network_receive_bytes_total[5m]))

CPU + Memory Time Series

Line chart showing evolution of both resources.

Query CPU:

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Query Memory:

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Per-node breakdown:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Add threshold lines: In panel → Alerts → 70% (warning), 85% (critical)

Disk I/O

Read/write operations per second.

Read ops/sec:

rate(node_disk_reads_completed_total{device="sda"}[5m])

Write ops/sec:

rate(node_disk_writes_completed_total{device="sda"}[5m])

Change device: sdanvme0n1 (NVMe), vda (virtual), etc

Utilization %:

rate(node_disk_io_time_seconds_total{device="sda"}[5m]) * 100

Note: > 30% I/O wait indicates disk contention

Disk Free by Mount

Available space by mount point.

node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / 1024 / 1024 / 1024

Show by mountpoint:

sum by (mountpoint) (node_filesystem_avail_bytes) / 1024 / 1024 / 1024

Percentage available:

100 * (node_filesystem_avail_bytes / node_filesystem_size_bytes) by (mountpoint)

Resource by Node (Table)

Dynamic table with CPU, memory and disk by node.

Structure JSON for multi-metric table:

sum by (instance) (100 - (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) // CPU
sum by (instance) (100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) // Memory
sum by (instance, device) (100 * node_filesystem_used_bytes / node_filesystem_size_bytes) // Disk

Filter production nodes:

sum by (instance) (...) {instance=~"prod.*"}

Requests vs Limits

Comparison of requested vs limit resources in K8s.

CPU requests:

sum by (namespace, pod) (kube_pod_container_resource_requests_cpu_cores)

CPU limits:

sum by (namespace, pod) (kube_pod_container_resource_limits_cpu_cores)

Memory (in bytes):

sum by (namespace, pod) (kube_pod_container_resource_limits_memory_bytes) / 1024 / 1024

Filter by namespace:

sum by (namespace, pod) (...) {namespace!~"kube-system|kube-.*"}

Change Color Theme

Theme OK (<70%) Warning (70-85%) Critical (>85%)
GREEN #33FF00 #FFCC00 #FF4444
AMBER #FFB000 #FF8C00 #FF4500
BLUE #00BFFF #FFD700 #FF1493

Gauge thresholds in JSON:

"thresholds": {
  "mode": "absolute",
  "steps": [
    { "color": "green", "value": null },
    { "color": "yellow", "value": 70 },
    { "color": "red", "value": 85 }
  ]
}

Adapt to Your Resolution

Type Stat Card Width Graph Height Table Height
Mobile 6 (stack vertical) 10 12
Tablet 10.9" 12 (full width) 12 14
iPad Pro 12.9" 12 (full width) 14 16
Desktop 1920x1080 6 (4 cols) 10 12

Import in Grafana

  1. Export SATURATION dashboard as JSON
  2. Dashboards → Import
  3. Paste JSON
  4. Select Prometheus datasource
  5. Verify metrics from node_exporter are available
  6. If metrics are missing, check that node_exporter is running with correct flags
  7. Save and customize thresholds based on your infrastructure

Advanced Tips

Detect CPU spike

rate(node_cpu_seconds_total{mode!="idle"}[5m]) >
  avg_over_time(rate(node_cpu_seconds_total{mode!="idle"}[5m])[1h:5m]) * 1.5

Alert if CPU rises 50% above 1h average.

Memory pressure (using swap)

(node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) < 0.3

Alert if using > 70% of swap (indicator of severe memory pressure).

Disk exhaustion projection

predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 86400)

Predicts available space in 24h. If < 10GB, alert proactively.

Correlate saturation with latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) * 1000
+ (node_load1 / count(node_cpu_seconds_total{mode="idle"}) * 100)

Smooths latency by load factor (shows saturation impact).