🔧 Cloud Track - Advanced

Troubleshooting Runbook

Document a professional runbook for debugging common Kubernetes pod failures and production issues.

⏱️ 6-8 hours 🎯 Advanced

📋 Overview

Kubernetes fails in predictable ways. This project teaches you to create a systematic troubleshooting guide that any team member can follow during incidents.

🔨 Runbook Structure

Scenario 1: CrashLoopBackOff

Symptom: Pod continuously restarting

Investigation Commands:

kubectl describe pod <name> kubectl logs <name> --previous

Common Causes:

Application crash on startup
Missing environment variables
Insufficient memory (OOMKilled)

Scenario 2: ImagePullBackOff

Symptom: Cannot pull container image

Investigation:

kubectl get events --sort-by=.metadata.creationTimestamp kubectl get pods <name> -o jsonpath='{.status.containerStatuses[0].state.waiting.message}'

Solutions:

Verify image name/tag exists
Check imagePullSecrets
Validate registry credentials

Scenario 3: Pending Pods (Resource Constraints)

Symptom: Pod stuck in Pending state

Investigation:

kubectl describe pod <name> | grep -A 5 Events kubectl top nodes

Common Causes:

Insufficient CPU/memory on nodes
Node selector mismatch
PVC not bound (storage issue)

Scenario 4: Service Not Reachable

Symptom: Cannot access service from other pods

Debugging Steps:

# Verify endpoints kubectl get endpoints <service-name> # Test from debug pod kubectl run debug --image=busybox --rm -it -- wget -qO- http://<service>

📦 Deliverables

✓Runbook covering 5+ common failure scenarios
✓Step-by-step investigation commands for each
✓Decision tree or flowchart for triage
✓Formatted as PDF or Markdown (team-ready)