🔧 Cloud Track - Advanced

Troubleshooting Runbook

Document a professional runbook for debugging common Kubernetes pod failures and production issues.

⏱️ 6-8 hours 🎯 Advanced

📋 Overview

Kubernetes fails in predictable ways. This project teaches you to create a systematic troubleshooting guide that any team member can follow during incidents.

🔨 Runbook Structure

Scenario 1: CrashLoopBackOff

Symptom: Pod continuously restarting

Investigation Commands:

kubectl describe pod <name> kubectl logs <name> --previous

Common Causes:

  • Application crash on startup
  • Missing environment variables
  • Insufficient memory (OOMKilled)

Scenario 2: ImagePullBackOff

Symptom: Cannot pull container image

Investigation:

kubectl get events --sort-by=.metadata.creationTimestamp kubectl get pods <name> -o jsonpath='{.status.containerStatuses[0].state.waiting.message}'

Solutions:

  • Verify image name/tag exists
  • Check imagePullSecrets
  • Validate registry credentials

Scenario 3: Pending Pods (Resource Constraints)

Symptom: Pod stuck in Pending state

Investigation:

kubectl describe pod <name> | grep -A 5 Events kubectl top nodes

Common Causes:

  • Insufficient CPU/memory on nodes
  • Node selector mismatch
  • PVC not bound (storage issue)

Scenario 4: Service Not Reachable

Symptom: Cannot access service from other pods

Debugging Steps:

# Verify endpoints kubectl get endpoints <service-name> # Test from debug pod kubectl run debug --image=busybox --rm -it -- wget -qO- http://<service>

📦 Deliverables