A Kubernetes Deployment that will not roll out, gets stuck, or keeps creating failing pods is one of the most common issues teams face. At Tasrie IT Services, we debug deployment issues daily across client clusters. The good news is that deployments follow a predictable lifecycle, and once you understand it, troubleshooting becomes systematic.
This guide covers every deployment failure mode we have encountered, with the exact commands and fixes for each one.
Quick Diagnosis: Check Deployment Status
Start with these three commands to understand what is happening:
# Check deployment status
kubectl get deployment <deployment-name> -n <namespace>
# Check rollout status
kubectl rollout status deployment/<deployment-name> -n <namespace>
# Check deployment events and conditions
kubectl describe deployment <deployment-name> -n <namespace>
The kubectl rollout status command tells you whether the rollout is progressing, complete, or failed. If it hangs, the deployment is stuck.
The kubectl describe deployment output contains:
- Conditions —
Available,Progressing, andReplicaFailureconditions - Events — chronological history of scaling actions
- Replicas — desired vs current vs ready vs available counts
Understanding Deployment Conditions
| Condition | Status | Meaning |
|---|---|---|
Progressing | True | Rollout is actively creating new pods |
Progressing | True (reason: NewReplicaSetAvailable) | Rollout completed successfully |
Progressing | False (reason: ProgressDeadlineExceeded) | Rollout timed out |
Available | True | Minimum required pods are running |
Available | False | Not enough pods are ready |
ReplicaFailure | True | Cannot create pods (quota, admission webhook) |
Troubleshooting Failed Rollouts
Rollout Stuck: Pods in CrashLoopBackOff
The most common rollout failure. New pods are created but keep crashing, preventing the rollout from completing.
# Check the new ReplicaSet's pods
kubectl get replicasets -n <namespace> -l app=<app-label>
kubectl get pods -n <namespace> -l app=<app-label> | grep -v Running
# Check the failing pod's logs
kubectl logs <failing-pod> -n <namespace> --previous
# Check events on the failing pod
kubectl describe pod <failing-pod> -n <namespace>
Common causes:
- Application bug in the new version
- Missing or wrong environment variables
- Missing ConfigMap or Secret
- Wrong command or entrypoint in the container image
- Insufficient memory causing OOMKilled
Fix: Either fix the issue in the container image/config or roll back:
# Roll back to the previous version
kubectl rollout undo deployment/<deployment-name> -n <namespace>
# Roll back to a specific revision
kubectl rollout history deployment/<deployment-name> -n <namespace>
kubectl rollout undo deployment/<deployment-name> -n <namespace> --to-revision=<N>
For detailed CrashLoopBackOff troubleshooting, see our CrashLoopBackOff fix guide.
Rollout Stuck: ProgressDeadlineExceeded
By default, Kubernetes waits 600 seconds (10 minutes) for a deployment to make progress. If no new pods become ready within that time, the rollout is marked as failed.
# Check deployment conditions
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="Progressing")]}'
Common causes:
- New pods failing readiness probes
- New pods stuck in
Pending(insufficient resources) - Image pull taking too long
- Init containers timing out
Fix the deadline (if the application legitimately needs more time):
spec:
progressDeadlineSeconds: 1200 # 20 minutes
Or investigate why pods are not becoming ready:
# Find the new ReplicaSet
kubectl get rs -n <namespace> -l app=<app-label> --sort-by='.metadata.creationTimestamp' | tail -1
# Check pods in the new ReplicaSet
kubectl get pods -n <namespace> -l pod-template-hash=<hash>
kubectl describe pod <pod-name> -n <namespace>
Rollout Stuck: Pods in Pending
New pods cannot be scheduled to any node.
kubectl describe pod <pending-pod> -n <namespace> | grep -A 10 "Events:"
Common causes and fixes:
| Event Message | Cause | Fix |
|---|---|---|
Insufficient cpu | Not enough CPU on any node | Scale the cluster or reduce resource requests |
Insufficient memory | Not enough memory on any node | Scale the cluster or reduce resource requests |
node(s) had taint | Taint/toleration mismatch | Add tolerations or remove taints |
node(s) didn't match Pod's node affinity | Node selector/affinity mismatch | Fix labels or affinity rules |
persistentvolumeclaim not found | Missing PVC | Create the PVC |
For more on scheduling issues, see our Kubernetes taints and tolerations guide.
Rollout Stuck: ImagePullBackOff
New pods cannot pull the container image.
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Events:" | grep -i "image\|pull"
Common causes:
- Typo in image name or tag
- Image tag does not exist in the registry
- Private registry without
imagePullSecrets - Docker Hub rate limiting
Quick check:
# Verify image exists (if using Docker Hub)
docker manifest inspect <image>:<tag>
# Check if imagePullSecrets are configured
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.template.spec.imagePullSecrets}'
# Check if the secret exists and contains valid credentials
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Rollout Stuck: Readiness Probe Failures
New pods start but never become ready, so the deployment never progresses.
# Check for Unhealthy events
kubectl describe pod <pod-name> -n <namespace> | grep -i "unhealthy\|readiness"
# Test the health endpoint manually
kubectl exec -it <pod-name> -n <namespace> -- curl -s localhost:<port>/health
Common causes:
- Wrong probe path or port
initialDelaySecondstoo short for application startup time- Application listening on a different port than configured
- Probe endpoint checking downstream dependencies that are unavailable
Fix:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Give the app time to start
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
For slow-starting applications, add a startup probe:
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
# Gives the app up to 300 seconds (30 * 10) to start
Troubleshooting Deployment Scaling Issues
Deployment Not Scaling Up
# Check current replicas vs desired
kubectl get deployment <deployment-name> -n <namespace>
# Check if HPA is managing the deployment
kubectl get hpa -n <namespace>
# Check HPA status
kubectl describe hpa <hpa-name> -n <namespace>
Common causes:
- HPA target metric is below the scale-up threshold
- Metrics server not running
- Pod resource requests not set (HPA needs requests to calculate utilisation)
- Cluster out of capacity
# Check if metrics server is running
kubectl get pods -n kube-system | grep metrics-server
kubectl top pods -n <namespace> # If this fails, metrics server has issues
Deployment Not Scaling Down
# Check PodDisruptionBudget
kubectl get pdb -n <namespace>
# Check if HPA minReplicas is preventing scale-down
kubectl get hpa <hpa-name> -n <namespace> -o jsonpath='{.spec.minReplicas}'
Troubleshooting Deployment Updates
Changes Not Taking Effect
You updated the deployment spec but nothing happened.
# Check if the deployment spec actually changed
kubectl get deployment <deployment-name> -n <namespace> -o yaml | grep -A 5 "image:"
# Check rollout history
kubectl rollout history deployment/<deployment-name> -n <namespace>
Common causes:
- You changed a field that does not trigger a rollout (like replicas)
- The deployment is paused
# Check if deployment is paused
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.paused}'
# Resume if paused
kubectl rollout resume deployment/<deployment-name> -n <namespace>
Fields that trigger a rollout (new ReplicaSet):
spec.template.spec.containers[*].imagespec.template.spec.containers[*].envspec.template.metadata.labelsspec.template.metadata.annotations- Any change under
spec.template
Fields that do NOT trigger a rollout:
spec.replicasspec.strategyspec.minReadySecondsspec.progressDeadlineSeconds
Rolling Update Causing Downtime
By default, Kubernetes performs a rolling update that creates new pods before terminating old ones. But misconfigurations can cause brief outages.
# Check the deployment strategy
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.strategy}'
Prevent downtime during rollouts:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Create 1 extra pod at a time
maxUnavailable: 0 # Never remove a pod until the new one is ready
minReadySeconds: 10 # Wait 10s after pod is ready before continuing
Also ensure:
- Readiness probes are configured and working
terminationGracePeriodSecondsis long enough for graceful shutdown- A
preStophook is defined to allow connections to drain
For monitoring rollout progress, see our kubectl rollout status guide.
Troubleshooting Deployment Rollbacks
Rollback Not Working
# Check rollout history
kubectl rollout history deployment/<deployment-name> -n <namespace>
# Check if there is a previous revision to roll back to
kubectl rollout history deployment/<deployment-name> -n <namespace> --revision=1
Common issues:
revisionHistoryLimitset to 0 (no previous ReplicaSets are kept)- The previous revision has the same issues
Check revision history limit:
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.revisionHistoryLimit}'
If it is 0 or very low, increase it:
spec:
revisionHistoryLimit: 10 # Keep last 10 revisions
Rollback Caused the Same Problem
If rolling back does not fix the issue, the problem is likely not in the container image but in the environment:
- ConfigMap or Secret was changed independently of the deployment
- A dependent service (database, cache, API) is down
- A network policy was recently applied that blocks traffic
- PVC or storage backend is failing
# Check if ConfigMaps changed recently
kubectl get configmap -n <namespace> -o yaml | head -20
# Check dependent services
kubectl get pods -n <namespace>
kubectl get endpoints -n <namespace>
Debugging Deployment YAML Issues
Validate Before Applying
# Dry-run to catch YAML errors before applying
kubectl apply -f deployment.yaml --dry-run=client
# Server-side dry-run (also validates admission webhooks)
kubectl apply -f deployment.yaml --dry-run=server
# Diff against current state
kubectl diff -f deployment.yaml
Common YAML Mistakes
- Label selector mismatch —
spec.selector.matchLabelsmust matchspec.template.metadata.labels
# This will fail - labels don't match
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: my-app # Mismatch!
-
Port mismatch —
containerPortin the pod spec must match what the application listens on -
Resource requests exceeding limits — requests cannot be greater than limits
-
Invalid characters in labels — labels must match
[a-z0-9A-Z.-_]pattern
Deployment Troubleshooting Cheat Sheet
# Full deployment status overview
kubectl get deploy,rs,pods -n <namespace> -l app=<app-label>
# Check why rollout is stuck
kubectl rollout status deployment/<name> -n <namespace>
kubectl describe deployment <name> -n <namespace>
# Check new vs old ReplicaSet
kubectl get rs -n <namespace> -l app=<app-label> --sort-by='.metadata.creationTimestamp'
# Check failing pods in new ReplicaSet
kubectl get pods -n <namespace> -l pod-template-hash=<new-rs-hash>
kubectl logs <failing-pod> -n <namespace> --previous
# Quick rollback
kubectl rollout undo deployment/<name> -n <namespace>
# Check rollout history
kubectl rollout history deployment/<name> -n <namespace>
# Force restart all pods (useful for picking up ConfigMap changes)
kubectl rollout restart deployment/<name> -n <namespace>
For more on restarting deployments, see our kubectl restart deployment guide.
For a broader Kubernetes troubleshooting methodology, see our comprehensive troubleshooting guide.
Need Help With Kubernetes Deployments?
Failed deployments in production cost money and erode user trust. Our engineers at Tasrie IT Services troubleshoot deployment issues daily and can help you build CI/CD pipelines and deployment strategies that minimise risk.
Our Kubernetes consulting services include:
- Deployment strategy design with rolling updates, canary releases, and blue-green deployments
- CI/CD pipeline setup with ArgoCD, Flux, or traditional pipelines
- Automated rollback and alerting that catches failed deployments before they impact users