~/blog/how-to-troubleshoot-kubernetes-deployment-2026
zsh
KUBERNETES

How to Troubleshoot Kubernetes Deployment: We Debug Deployments Daily (2026 Guide)

Engineering Team 2026-04-26

A Kubernetes Deployment that will not roll out, gets stuck, or keeps creating failing pods is one of the most common issues teams face. At Tasrie IT Services, we debug deployment issues daily across client clusters. The good news is that deployments follow a predictable lifecycle, and once you understand it, troubleshooting becomes systematic.

This guide covers every deployment failure mode we have encountered, with the exact commands and fixes for each one.

Quick Diagnosis: Check Deployment Status

Start with these three commands to understand what is happening:

# Check deployment status
kubectl get deployment <deployment-name> -n <namespace>

# Check rollout status
kubectl rollout status deployment/<deployment-name> -n <namespace>

# Check deployment events and conditions
kubectl describe deployment <deployment-name> -n <namespace>

The kubectl rollout status command tells you whether the rollout is progressing, complete, or failed. If it hangs, the deployment is stuck.

The kubectl describe deployment output contains:

  • ConditionsAvailable, Progressing, and ReplicaFailure conditions
  • Events — chronological history of scaling actions
  • Replicas — desired vs current vs ready vs available counts

Understanding Deployment Conditions

ConditionStatusMeaning
ProgressingTrueRollout is actively creating new pods
ProgressingTrue (reason: NewReplicaSetAvailable)Rollout completed successfully
ProgressingFalse (reason: ProgressDeadlineExceeded)Rollout timed out
AvailableTrueMinimum required pods are running
AvailableFalseNot enough pods are ready
ReplicaFailureTrueCannot create pods (quota, admission webhook)

Troubleshooting Failed Rollouts

Rollout Stuck: Pods in CrashLoopBackOff

The most common rollout failure. New pods are created but keep crashing, preventing the rollout from completing.

# Check the new ReplicaSet's pods
kubectl get replicasets -n <namespace> -l app=<app-label>
kubectl get pods -n <namespace> -l app=<app-label> | grep -v Running

# Check the failing pod's logs
kubectl logs <failing-pod> -n <namespace> --previous

# Check events on the failing pod
kubectl describe pod <failing-pod> -n <namespace>

Common causes:

  • Application bug in the new version
  • Missing or wrong environment variables
  • Missing ConfigMap or Secret
  • Wrong command or entrypoint in the container image
  • Insufficient memory causing OOMKilled

Fix: Either fix the issue in the container image/config or roll back:

# Roll back to the previous version
kubectl rollout undo deployment/<deployment-name> -n <namespace>

# Roll back to a specific revision
kubectl rollout history deployment/<deployment-name> -n <namespace>
kubectl rollout undo deployment/<deployment-name> -n <namespace> --to-revision=<N>

For detailed CrashLoopBackOff troubleshooting, see our CrashLoopBackOff fix guide.

Rollout Stuck: ProgressDeadlineExceeded

By default, Kubernetes waits 600 seconds (10 minutes) for a deployment to make progress. If no new pods become ready within that time, the rollout is marked as failed.

# Check deployment conditions
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="Progressing")]}'

Common causes:

  • New pods failing readiness probes
  • New pods stuck in Pending (insufficient resources)
  • Image pull taking too long
  • Init containers timing out

Fix the deadline (if the application legitimately needs more time):

spec:
  progressDeadlineSeconds: 1200  # 20 minutes

Or investigate why pods are not becoming ready:

# Find the new ReplicaSet
kubectl get rs -n <namespace> -l app=<app-label> --sort-by='.metadata.creationTimestamp' | tail -1

# Check pods in the new ReplicaSet
kubectl get pods -n <namespace> -l pod-template-hash=<hash>
kubectl describe pod <pod-name> -n <namespace>

Rollout Stuck: Pods in Pending

New pods cannot be scheduled to any node.

kubectl describe pod <pending-pod> -n <namespace> | grep -A 10 "Events:"

Common causes and fixes:

Event MessageCauseFix
Insufficient cpuNot enough CPU on any nodeScale the cluster or reduce resource requests
Insufficient memoryNot enough memory on any nodeScale the cluster or reduce resource requests
node(s) had taintTaint/toleration mismatchAdd tolerations or remove taints
node(s) didn't match Pod's node affinityNode selector/affinity mismatchFix labels or affinity rules
persistentvolumeclaim not foundMissing PVCCreate the PVC

For more on scheduling issues, see our Kubernetes taints and tolerations guide.

Rollout Stuck: ImagePullBackOff

New pods cannot pull the container image.

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Events:" | grep -i "image\|pull"

Common causes:

  • Typo in image name or tag
  • Image tag does not exist in the registry
  • Private registry without imagePullSecrets
  • Docker Hub rate limiting

Quick check:

# Verify image exists (if using Docker Hub)
docker manifest inspect <image>:<tag>

# Check if imagePullSecrets are configured
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.template.spec.imagePullSecrets}'

# Check if the secret exists and contains valid credentials
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

Rollout Stuck: Readiness Probe Failures

New pods start but never become ready, so the deployment never progresses.

# Check for Unhealthy events
kubectl describe pod <pod-name> -n <namespace> | grep -i "unhealthy\|readiness"

# Test the health endpoint manually
kubectl exec -it <pod-name> -n <namespace> -- curl -s localhost:<port>/health

Common causes:

  • Wrong probe path or port
  • initialDelaySeconds too short for application startup time
  • Application listening on a different port than configured
  • Probe endpoint checking downstream dependencies that are unavailable

Fix:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30   # Give the app time to start
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

For slow-starting applications, add a startup probe:

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
  # Gives the app up to 300 seconds (30 * 10) to start

Troubleshooting Deployment Scaling Issues

Deployment Not Scaling Up

# Check current replicas vs desired
kubectl get deployment <deployment-name> -n <namespace>

# Check if HPA is managing the deployment
kubectl get hpa -n <namespace>

# Check HPA status
kubectl describe hpa <hpa-name> -n <namespace>

Common causes:

  • HPA target metric is below the scale-up threshold
  • Metrics server not running
  • Pod resource requests not set (HPA needs requests to calculate utilisation)
  • Cluster out of capacity
# Check if metrics server is running
kubectl get pods -n kube-system | grep metrics-server
kubectl top pods -n <namespace>  # If this fails, metrics server has issues

Deployment Not Scaling Down

# Check PodDisruptionBudget
kubectl get pdb -n <namespace>

# Check if HPA minReplicas is preventing scale-down
kubectl get hpa <hpa-name> -n <namespace> -o jsonpath='{.spec.minReplicas}'

Troubleshooting Deployment Updates

Changes Not Taking Effect

You updated the deployment spec but nothing happened.

# Check if the deployment spec actually changed
kubectl get deployment <deployment-name> -n <namespace> -o yaml | grep -A 5 "image:"

# Check rollout history
kubectl rollout history deployment/<deployment-name> -n <namespace>

Common causes:

  • You changed a field that does not trigger a rollout (like replicas)
  • The deployment is paused
# Check if deployment is paused
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.paused}'

# Resume if paused
kubectl rollout resume deployment/<deployment-name> -n <namespace>

Fields that trigger a rollout (new ReplicaSet):

  • spec.template.spec.containers[*].image
  • spec.template.spec.containers[*].env
  • spec.template.metadata.labels
  • spec.template.metadata.annotations
  • Any change under spec.template

Fields that do NOT trigger a rollout:

  • spec.replicas
  • spec.strategy
  • spec.minReadySeconds
  • spec.progressDeadlineSeconds

Rolling Update Causing Downtime

By default, Kubernetes performs a rolling update that creates new pods before terminating old ones. But misconfigurations can cause brief outages.

# Check the deployment strategy
kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.strategy}'

Prevent downtime during rollouts:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1          # Create 1 extra pod at a time
      maxUnavailable: 0    # Never remove a pod until the new one is ready
  minReadySeconds: 10      # Wait 10s after pod is ready before continuing

Also ensure:

  • Readiness probes are configured and working
  • terminationGracePeriodSeconds is long enough for graceful shutdown
  • A preStop hook is defined to allow connections to drain

For monitoring rollout progress, see our kubectl rollout status guide.

Troubleshooting Deployment Rollbacks

Rollback Not Working

# Check rollout history
kubectl rollout history deployment/<deployment-name> -n <namespace>

# Check if there is a previous revision to roll back to
kubectl rollout history deployment/<deployment-name> -n <namespace> --revision=1

Common issues:

  • revisionHistoryLimit set to 0 (no previous ReplicaSets are kept)
  • The previous revision has the same issues

Check revision history limit:

kubectl get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.revisionHistoryLimit}'

If it is 0 or very low, increase it:

spec:
  revisionHistoryLimit: 10  # Keep last 10 revisions

Rollback Caused the Same Problem

If rolling back does not fix the issue, the problem is likely not in the container image but in the environment:

  • ConfigMap or Secret was changed independently of the deployment
  • A dependent service (database, cache, API) is down
  • A network policy was recently applied that blocks traffic
  • PVC or storage backend is failing
# Check if ConfigMaps changed recently
kubectl get configmap -n <namespace> -o yaml | head -20

# Check dependent services
kubectl get pods -n <namespace>
kubectl get endpoints -n <namespace>

Debugging Deployment YAML Issues

Validate Before Applying

# Dry-run to catch YAML errors before applying
kubectl apply -f deployment.yaml --dry-run=client

# Server-side dry-run (also validates admission webhooks)
kubectl apply -f deployment.yaml --dry-run=server

# Diff against current state
kubectl diff -f deployment.yaml

Common YAML Mistakes

  1. Label selector mismatchspec.selector.matchLabels must match spec.template.metadata.labels
# This will fail - labels don't match
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: my-app  # Mismatch!
  1. Port mismatchcontainerPort in the pod spec must match what the application listens on

  2. Resource requests exceeding limits — requests cannot be greater than limits

  3. Invalid characters in labels — labels must match [a-z0-9A-Z.-_] pattern

Deployment Troubleshooting Cheat Sheet

# Full deployment status overview
kubectl get deploy,rs,pods -n <namespace> -l app=<app-label>

# Check why rollout is stuck
kubectl rollout status deployment/<name> -n <namespace>
kubectl describe deployment <name> -n <namespace>

# Check new vs old ReplicaSet
kubectl get rs -n <namespace> -l app=<app-label> --sort-by='.metadata.creationTimestamp'

# Check failing pods in new ReplicaSet
kubectl get pods -n <namespace> -l pod-template-hash=<new-rs-hash>
kubectl logs <failing-pod> -n <namespace> --previous

# Quick rollback
kubectl rollout undo deployment/<name> -n <namespace>

# Check rollout history
kubectl rollout history deployment/<name> -n <namespace>

# Force restart all pods (useful for picking up ConfigMap changes)
kubectl rollout restart deployment/<name> -n <namespace>

For more on restarting deployments, see our kubectl restart deployment guide.

For a broader Kubernetes troubleshooting methodology, see our comprehensive troubleshooting guide.


Need Help With Kubernetes Deployments?

Failed deployments in production cost money and erode user trust. Our engineers at Tasrie IT Services troubleshoot deployment issues daily and can help you build CI/CD pipelines and deployment strategies that minimise risk.

Our Kubernetes consulting services include:

  • Deployment strategy design with rolling updates, canary releases, and blue-green deployments
  • CI/CD pipeline setup with ArgoCD, Flux, or traditional pipelines
  • Automated rollback and alerting that catches failed deployments before they impact users

Get expert Kubernetes deployment support →

Continue exploring these related topics

$ suggest --service

Need Kubernetes expertise?

From architecture to production support, we help teams run Kubernetes reliably at scale.

Get started
Chat with real humans
Chat on WhatsApp