GUIDEError monitoring: HTTP failures, pod crashes and OOM kills
For error monitoring you need:
http_requests_total — HTTP counter (with status label)kube-state-metrics — pod/container status (for K8s)kubelet cAdvisor metrics — container lifecycle eventsnode_exporter — host/system metrics (optional)Panels that reference node-level data (CPU, memory used in correlation queries) rely on the $job_node dashboard variable. The default value is node-exporter, which matches kube-prometheus-stack. If your Prometheus uses a different job label for node_exporter, update it in Dashboard Settings → Variables.
The Requirements section lists node_exporter as providing host/system metrics. This was intentionally updated from "system logs" in a previous revision — if you see older documentation referring to "system logs", treat it as host/system metrics.
┌────────────────────────────────────────┐ │ ERRORS — FAILURES & INCIDENTS │ ├────────────────────────────────────────┤ │ Error Rate │ Failed Pods │ CrashLoop│ │ (%) │ Count │ Count │ ├────────────────────────────────────────┤ │ Error Timeline [12 cols] │ │ (evolution of error rate) │ ├────────────────────────────────────────┤ │ OOMKilled Count │ Restart Trend │ │ (gauge) │ (time series) │ ├────────────────────────────────────────┤ │ Error Types Distribution [6 cols] │ │ Pod Status Table [6 cols] │ ├────────────────────────────────────────┤ │ Event Log [12 cols] │ │ (recent errors by pod/container) │ └────────────────────────────────────────┘
Percentage of requests with error (5xx status).
Default query (5xx only):100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
To include 4xx (client errors):
100 * sum(rate(http_requests_total{status=~"[45].."}[5m]))
/ sum(rate(http_requests_total[5m]))
Only timeout errors (504):
100 * sum(rate(http_requests_total{status="504"}[5m]))
/ sum(rate(http_requests_total[5m]))
Number of pods in Failed or Unknown state.
Base query:sum(kube_pod_status_phase{phase=~"Failed|Unknown"})
Filter by namespace (ignore jobs):
sum(kube_pod_status_phase{phase="Failed", namespace!~"default|kube-.*"})
By namespace (breakdown):
sum by (namespace) (kube_pod_status_phase{phase="Failed"})
Pods stuck in restart loops (CrashLoopBackOff).
Query:sum(kube_pod_container_status_last_terminated_reason{reason="Error"})
Or by container state:
count(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"})
By specific pod:
count by (pod) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"})
Alert threshold: Configure alert on > 0 pods
Containers terminated due to Out Of Memory.
Query:sum(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"})
Correlate with memory limits:
sum(kube_pod_container_resource_limits_memory_bytes) / 1024 / 1024 / 1024
Containers most affected by OOM:
topk(5, sum by (pod, container) (increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h])))
Temporal evolution of error rate (5m window).
Multi-series by status code:
sum by (status) (rate(http_requests_total{status=~"5.."}[5m])) * 100
Change time window: [5m] → [1m] (responsive) or [15m] (smoothed)
By service:
sum by (job, status) (rate(http_requests_total{status=~"5.."}[5m]))
Trend of container restarts.
Query:sum(increase(kube_pod_container_status_restarts_total[5m]))
By pod:
topk(10, sum by (pod) (increase(kube_pod_container_status_restarts_total[5m])))
Detect critical pods (> 10 restarts in 1h):
sum by (pod) (increase(kube_pod_container_status_restarts_total[1h])) > 10
Breakdown of errors by type (5xx, 4xx, timeout, etc).
Query for pie chart:
sum by (status) (rate(http_requests_total{status=~"[45].."}[5m])) * 100
Custom categories: Create labels in your app:
status="500" → Server Errorstatus="503" → Service Unavailablestatus="504" → Gateway Timeoutstatus="429" → Rate LimitedDynamic table of pods with status and error reasons.
Base query:kube_pod_status_phase{phase!="Running"}
Add columns in panel:
kube_pod_status_conditions_ready → Ready statuskube_pod_container_status_last_terminated_reason → Reasonkube_pod_container_status_restarts_total → Restart countTable with recent events (last 2 hours).
Query:kube_events{type="Warning"} or kube_events{type="Error"}
Only error events in last 30m:
increase(kube_events{type="Error"}[30m])
Sort by timestamp: In panel → Sort by → timestamp desc
Column format: Timestamp | Pod | Namespace | Reason | Message
| Theme | OK Color | Warning | Critical |
|---|---|---|---|
| GREEN | #33FF00 |
#FFCC00 | #FF4444 |
| AMBER | #FFB000 |
#FF8C00 | #FF4500 |
| BLUE | #00BFFF |
#FFD700 | #FF1493 |
| Type | Width | Height (Stat Cards) | Height (Graphs) |
|---|---|---|---|
| Mobile | 6 (stack) | 6 | 10 |
| Tablet 10" | 12 | 8 | 12 |
| Tablet 12.9" | 12 | 8 | 14 |
| Desktop 1920x1080 | 24 | 6 | 10 |
kube_events is missing, ensure kube-state-metrics is availablerate(http_requests_total{status=~"5.."}[5m]) >
avg_over_time(rate(http_requests_total{status=~"5.."}[5m])[1h:5m]) * 2
Alert if error rate doubles compared to 1h average.
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) * 1000
Shows latency of ONLY failed requests (useful for detecting timeouts).
sum by (pod) (increase(kube_pod_container_status_restarts_total[30m])) > 5
Identifies pods restarting more than 5 times in 30min (symptom of serious issue).