RetroDash // ERRORS

GUIDEError monitoring: HTTP failures, pod crashes and OOM kills

Theme

Requirements

For error monitoring you need:

Compatibility Notes

$job_node variable

Panels that reference node-level data (CPU, memory used in correlation queries) rely on the $job_node dashboard variable. The default value is node-exporter, which matches kube-prometheus-stack. If your Prometheus uses a different job label for node_exporter, update it in Dashboard Settings → Variables.

The Requirements section lists node_exporter as providing host/system metrics. This was intentionally updated from "system logs" in a previous revision — if you see older documentation referring to "system logs", treat it as host/system metrics.

Panel Layout

┌────────────────────────────────────────┐
│    ERRORS — FAILURES & INCIDENTS       │
├────────────────────────────────────────┤
│  Error Rate  │  Failed Pods │ CrashLoop│
│  (%)         │  Count       │ Count    │
├────────────────────────────────────────┤
│  Error Timeline [12 cols]              │
│  (evolution of error rate)             │
├────────────────────────────────────────┤
│  OOMKilled Count │ Restart Trend      │
│  (gauge)         │ (time series)      │
├────────────────────────────────────────┤
│  Error Types Distribution [6 cols]    │
│  Pod Status Table [6 cols]            │
├────────────────────────────────────────┤
│  Event Log [12 cols]                  │
│  (recent errors by pod/container)      │
└────────────────────────────────────────┘

Panel Customization

Error Rate %

Percentage of requests with error (5xx status).

100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

To include 4xx (client errors):

100 * sum(rate(http_requests_total{status=~"[45].."}[5m]))
/ sum(rate(http_requests_total[5m]))

Only timeout errors (504):

100 * sum(rate(http_requests_total{status="504"}[5m]))
/ sum(rate(http_requests_total[5m]))

Failed Pods Count

Number of pods in Failed or Unknown state.

sum(kube_pod_status_phase{phase=~"Failed|Unknown"})

Filter by namespace (ignore jobs):

sum(kube_pod_status_phase{phase="Failed", namespace!~"default|kube-.*"})

By namespace (breakdown):

sum by (namespace) (kube_pod_status_phase{phase="Failed"})

CrashLoop Count

Pods stuck in restart loops (CrashLoopBackOff).

sum(kube_pod_container_status_last_terminated_reason{reason="Error"})

Or by container state:

count(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"})

By specific pod:

count by (pod) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"})

Alert threshold: Configure alert on > 0 pods

OOMKilled Count

Containers terminated due to Out Of Memory.

sum(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"})

Correlate with memory limits:

sum(kube_pod_container_resource_limits_memory_bytes) / 1024 / 1024 / 1024

Containers most affected by OOM:

topk(5, sum by (pod, container) (increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h])))

Error Timeline

Temporal evolution of error rate (5m window).

Multi-series by status code:

sum by (status) (rate(http_requests_total{status=~"5.."}[5m])) * 100

Change time window: [5m][1m] (responsive) or [15m] (smoothed)

By service:

sum by (job, status) (rate(http_requests_total{status=~"5.."}[5m]))

Restart Trend

Trend of container restarts.

sum(increase(kube_pod_container_status_restarts_total[5m]))

By pod:

topk(10, sum by (pod) (increase(kube_pod_container_status_restarts_total[5m])))

Detect critical pods (> 10 restarts in 1h):

sum by (pod) (increase(kube_pod_container_status_restarts_total[1h])) > 10

Error Types Distribution (Pie)

Breakdown of errors by type (5xx, 4xx, timeout, etc).

Query for pie chart:

sum by (status) (rate(http_requests_total{status=~"[45].."}[5m])) * 100

Custom categories: Create labels in your app:

Pod Status Table

Dynamic table of pods with status and error reasons.

kube_pod_status_phase{phase!="Running"}

Add columns in panel:

Event Log

Table with recent events (last 2 hours).

kube_events{type="Warning"} or kube_events{type="Error"}

Only error events in last 30m:

increase(kube_events{type="Error"}[30m])

Sort by timestamp: In panel → Sort by → timestamp desc

Column format: Timestamp | Pod | Namespace | Reason | Message

Change Color Theme

Theme OK Color Warning Critical
GREEN #33FF00 #FFCC00 #FF4444
AMBER #FFB000 #FF8C00 #FF4500
BLUE #00BFFF #FFD700 #FF1493

Adapt to Your Resolution

Type Width Height (Stat Cards) Height (Graphs)
Mobile 6 (stack) 6 10
Tablet 10" 12 8 12
Tablet 12.9" 12 8 14
Desktop 1920x1080 24 6 10

Import in Grafana

  1. Download or copy JSON from ERRORS dashboard
  2. Dashboards → Import
  3. Select Prometheus datasource
  4. Import and verify all metrics
  5. If kube_events is missing, ensure kube-state-metrics is available
  6. Save and pin to homepage

Advanced Tips

Alert on error rate degradation

rate(http_requests_total{status=~"5.."}[5m]) >
  avg_over_time(rate(http_requests_total{status=~"5.."}[5m])[1h:5m]) * 2

Alert if error rate doubles compared to 1h average.

Correlate errors with latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) * 1000

Shows latency of ONLY failed requests (useful for detecting timeouts).

Pods with multiple recent restarts

sum by (pod) (increase(kube_pod_container_status_restarts_total[30m])) > 5

Identifies pods restarting more than 5 times in 30min (symptom of serious issue).