~/blog/how-to-fix-crashloopbackoff-in-kubernetes-2026
zsh
KUBERNETES

How to Fix CrashLoopBackOff in Kubernetes: The Complete Debugging Playbook (2026)

Engineering Team 2026-04-26

CrashLoopBackOff is the error every Kubernetes team encounters eventually. It appears simple — a container keeps crashing — but the root cause can be anything from a missing environment variable to a subtle memory leak that only triggers under production load. At Tasrie IT Services, we have built a debugging playbook over hundreds of CrashLoopBackOff incidents that eliminates guesswork and gets to the root cause fast.

This guide is that playbook. It covers every scenario we have seen, including the tricky ones that do not show up in basic guides: init container loops, sidecar crashes, intermittent CrashLoopBackOff, and race conditions during startup.

The CrashLoopBackOff Decision Tree

Follow this flowchart to find the root cause quickly:

Pod in CrashLoopBackOff

├── kubectl describe pod → Check "Last State" exit code

├── Exit Code 137?
│   ├── Yes → OOMKilled. Check memory limits.
│   │         → See Fix: Memory Issues below
│   └── No ↓

├── Exit Code 1?
│   ├── Yes → Application error
│   │         → kubectl logs --previous
│   │         → See Fix: Application Errors below
│   └── No ↓

├── Exit Code 127?
│   ├── Yes → Command not found
│   │         → Check entrypoint/command in pod spec
│   │         → See Fix: Image/Command Issues below
│   └── No ↓

├── Exit Code 126?
│   ├── Yes → Permission denied on entrypoint
│   │         → Check file permissions and security context
│   └── No ↓

├── Exit Code 0?
│   ├── Yes → Container exited successfully
│   │         → Should this be a Job instead of a Deployment?
│   │         → Check if entrypoint runs a background process without exec
│   └── No ↓

├── No exit code / "CreateContainerConfigError"?
│   ├── Yes → Missing ConfigMap, Secret, or volume
│   │         → kubectl describe pod → check Events
│   └── No ↓

└── Events show "Unhealthy" / "probe failed"?
    └── Yes → Liveness probe killing the container
              → See Fix: Probe Failures below

The Four-Step Debugging Process

Step 1: Identify

# Find the crashing pod
kubectl get pods -n <namespace> | grep CrashLoopBackOff

# Get the exit code, restart count, and timing
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "State:\|Last State:"

Key information to capture:

  • Exit code — tells you how the container died
  • Restart count — how many times it has crashed
  • Container start/finish times — how long the container runs before crashing (seconds = startup issue, hours = memory leak or intermittent failure)

Step 2: Read Logs

# Previous container logs (the one that crashed)
kubectl logs <pod-name> -n <namespace> --previous

# If multi-container pod, specify the container
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

# Tail multiple pods at once with stern
stern <deployment-name> -n <namespace> --since 10m

Step 3: Check Events

# Pod events (scheduling, image pull, probe failures)
kubectl describe pod <pod-name> -n <namespace> | tail -30

# Namespace events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Step 4: Apply the Fix

Based on the exit code and logs, apply the appropriate fix from the sections below.

Fix: Application Errors (Exit Code 1)

The container started but the application crashed due to an unhandled error.

Missing Environment Variables

# Check configured env vars
kubectl set env deployment/<deployment-name> -n <namespace> --list

# Check if env vars reference existing ConfigMaps/Secrets
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>

Fix: Add the missing variables:

kubectl set env deployment/<deployment-name> -n <namespace> KEY=value

Or create the missing ConfigMap:

kubectl create configmap app-config -n <namespace> --from-literal=DATABASE_URL=postgres://...

Database or Service Unreachable

The application tries to connect to a database or external service during startup and crashes when it fails.

# Test connectivity from the pod's namespace
kubectl run nettest --image=busybox --rm -it --restart=Never -n <namespace> -- nc -zv <host> <port>

Fix options:

  • Fix the service endpoint (DNS name, port, credentials)
  • Add retry logic with exponential backoff to the application startup
  • Use an init container to wait for the dependency:
initContainers:
  - name: wait-for-db
    image: busybox
    command: ['sh', '-c', 'until nc -z postgres-service 5432; do echo waiting for db; sleep 2; done']

File or Path Not Found

# Check volume mounts
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].volumeMounts}' | jq .

# Verify ConfigMap data
kubectl get configmap <name> -n <namespace> -o yaml

For ConfigMap debugging, see our Kubernetes ConfigMap guide.

Fix: Memory Issues (Exit Code 137 / OOMKilled)

# Confirm OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep OOMKilled

# Check limits
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].resources}'

Quick fixes by language:

LanguageIssueFix
JavaHeap > container limitSet -XX:MaxRAMPercentage=75.0
Node.jsV8 heap > limitSet NODE_OPTIONS="--max-old-space-size=768"
PythonLarge data in memoryUse streaming, generators, or reduce batch sizes
GoGoroutine leakProfile with pprof, find blocked goroutines

For a complete OOMKilled debugging guide, see our dedicated OOMKilled fix guide.

Fix: Image and Command Issues (Exit Codes 126, 127)

Exit Code 127: Command Not Found

# Check what command is configured
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].command}' | jq .
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].args}' | jq .

Common mistakes:

  • Using bash on Alpine images (use /bin/sh instead)
  • Overriding command in the pod spec, which replaces the Dockerfile ENTRYPOINT
  • Multi-stage Docker build not copying the binary to the final stage

Debug:

kubectl run debug --image=<image>:<tag> --rm -it --restart=Never -- /bin/sh
# Inside: which <binary>, find / -name <binary> 2>/dev/null

Exit Code 126: Permission Denied

# Check security context
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.securityContext}' | jq .
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].securityContext}' | jq .

Fix: Ensure entrypoint has execute permissions in the Dockerfile:

RUN chmod +x /app/entrypoint.sh

Fix: Probe Failures Causing CrashLoopBackOff

A misconfigured liveness probe kills the container repeatedly.

# Check for probe failure events
kubectl describe pod <pod-name> -n <namespace> | grep -i "unhealthy\|liveness\|readiness"

# Check probe config
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].livenessProbe}' | jq .

Fix:

# For slow-starting apps, add a startup probe
startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

# Keep liveness probe simple
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3

Key rule: Liveness probes should check if the process is alive, not if dependencies are available. Do not call /health endpoints that check database connectivity in liveness probes.

Fix: Init Container CrashLoopBackOff

When you see Init:CrashLoopBackOff, the init container is crashing, not the main application container.

# Check which init container is failing
kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Init Containers"

# Get init container logs
kubectl logs <pod-name> -n <namespace> -c <init-container-name> --previous

Common init container failures:

  • Waiting for a service that does not exist
  • Database migration script failing
  • Incorrect command or missing binary
  • Network policy preventing outbound access

Debug tip: Run the init container image manually:

kubectl run init-debug --image=<init-image> --rm -it --restart=Never -n <namespace> -- /bin/sh

Fix: Sidecar Container CrashLoopBackOff

In multi-container pods, a sidecar container crash can cause the entire pod to restart.

# Identify which container is crashing
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Container ID"

# Check each container's status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses}' | jq '.[] | {name, state, restartCount, lastState}'

Common sidecar issues:

  • Envoy/Istio sidecar — certificates expired, upstream cluster not found
  • Log forwarder — output destination unreachable
  • Config reloader — watching a ConfigMap that does not exist

Fix: Intermittent CrashLoopBackOff

The most difficult to debug. The container runs for hours or days, then crashes. This is almost always a memory leak, resource exhaustion, or an external dependency failure.

Diagnosis approach:

# Check how long the container runs before crashing
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"
# Note the Started and Finished timestamps

# Monitor memory over time
watch -n 30 kubectl top pod <pod-name> -n <namespace>

If the container runs for hours before crashing: likely a memory leak or growing connection pool. See our OOMKilled guide for memory profiling.

If the crashes correlate with traffic spikes: likely resource limits too low for peak load. Check CPU throttling:

# Check CPU throttling
kubectl top pod <pod-name> -n <namespace>

If the crashes correlate with time of day: likely a scheduled job or cron task consuming resources.

Fix: Race Condition During Startup

Some CrashLoopBackOff issues are caused by timing — the application starts before its dependencies are ready.

Fix with init containers:

initContainers:
  - name: wait-for-redis
    image: busybox
    command: ['sh', '-c', 'until nc -z redis-service 6379; do echo waiting; sleep 2; done']
  - name: wait-for-postgres
    image: busybox
    command: ['sh', '-c', 'until nc -z postgres-service 5432; do echo waiting; sleep 2; done']

Fix with retry logic in the application: This is the preferred approach for production applications. Implement exponential backoff for database connections, API calls, and message queue connections.

Automated Prevention

Prometheus Alerts

- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.pod }} in {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes"

CI/CD Pre-deployment Checks

Before deploying to production:

  1. Run the container locallydocker run <image> to verify it starts
  2. Run kubectl apply --dry-run=server — catches missing ConfigMaps, Secrets, and admission webhook rejections
  3. Deploy to staging first — with the same resource limits as production
  4. Set up automated rollback — roll back if restart count increases after deployment

For broader Kubernetes troubleshooting methodology, see our production troubleshooting guide.


Fix CrashLoopBackOff Permanently

CrashLoopBackOff is a symptom, not a root cause. Fixing it permanently requires understanding why the container crashes and implementing the right combination of resource sizing, health checks, and deployment safeguards. Our engineers at Tasrie IT Services do this daily.

Our Kubernetes consulting services include:

  • Production incident resolution for immediate CrashLoopBackOff debugging
  • Resource right-sizing with VPA and data-driven recommendations
  • CI/CD hardening with pre-deployment validation, staging gates, and automated rollbacks
  • Monitoring setup with Prometheus alerts for restart counts, exit codes, and resource usage

Get Kubernetes expert support →

Continue exploring these related topics

$ suggest --service

Need Kubernetes expertise?

From architecture to production support, we help teams run Kubernetes reliably at scale.

Get started
Chat with real humans
Chat on WhatsApp