Kubernetes troubleshooting is overwhelming when you do not have a system. There are pods, nodes, services, ingress controllers, storage drivers, and control plane components, and any one of them could be the problem. At Tasrie IT Services, we have handled hundreds of production incidents across EKS, AKS, and GKE. Every single time, we follow the same five-layer method.
This guide teaches you that method. Start at layer one, work down, and you will find the root cause before you waste time chasing the wrong lead.
The 5-Layer Troubleshooting Method
We debug Kubernetes issues top-down. Each layer builds on the previous one. Do not skip layers.
Layer 1: Cluster Health → Is the cluster itself working?
Layer 2: Node Health → Are nodes Ready and resourced?
Layer 3: Workload Status → Are pods running and healthy?
Layer 4: Networking → Can services reach each other?
Layer 5: Application → Is the app behaving correctly?
Layer 1: Cluster Health
Start here. If the cluster is unhealthy, nothing else matters.
# Can you reach the API server?
kubectl cluster-info
# Are all control plane components healthy?
kubectl get --raw /healthz
# Check component status (deprecated but still useful on some clusters)
kubectl get componentstatuses 2>/dev/null
# Check system pods
kubectl get pods -n kube-system
What to look for:
- API server responding slowly or timing out
- etcd pods in CrashLoopBackOff or not running
- CoreDNS pods not running (this breaks all service discovery)
- kube-proxy pods missing (this breaks all Service routing)
If the API server is unreachable:
- On managed clusters (EKS, AKS, GKE): check the cloud provider’s status page and your cluster’s control plane logs
- On self-managed clusters: SSH into a control plane node and check
journalctl -u kube-apiserver - Check if your kubeconfig context and credentials are correct:
kubectl config current-context
Layer 2: Node Health
# Check all node statuses
kubectl get nodes
# Detailed node information
kubectl describe node <node-name>
# Resource consumption per node
kubectl top nodes
What to look for:
- Nodes in
NotReadyorUnknownstatus MemoryPressure,DiskPressure, orPIDPressureconditions- High CPU or memory utilisation on specific nodes
- Node taints that might prevent pod scheduling
If you find NotReady nodes, follow our guide to fixing Kubernetes Node NotReady. For resource monitoring, see checking node CPU and memory utilisation.
Layer 3: Workload Status
# Find all non-running pods across all namespaces
kubectl get pods -A | grep -v Running | grep -v Completed
# Check recent events (most useful single command for troubleshooting)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Check deployment status
kubectl get deployments -A
# Check for pods with high restart counts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20
What to look for:
- Pods in
CrashLoopBackOff— container keeps crashing (fix guide) - Pods in
Pending— scheduler cannot place them - Pods in
ImagePullBackOff— image cannot be pulled - High restart counts — the pod keeps dying and recovering
OOMKilledin pod events — memory limit exceeded (fix guide)
For detailed pod troubleshooting, see our Kubernetes pod troubleshooting guide.
Layer 4: Networking
Networking issues are the hardest to debug because they can appear as application errors. Always verify networking before blaming the application.
# Check if services have endpoints
kubectl get endpoints -n <namespace>
# Test DNS resolution from inside the cluster
kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default
# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl get ingress -A
What to look for:
- Services with zero endpoints (label selector mismatch or all pods failing readiness)
- DNS resolution failures (CoreDNS down or misconfigured)
- Ingress returning 502/503/504 errors
- NetworkPolicies blocking traffic between namespaces
# Debug network connectivity between pods
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Inside the pod:
curl <service-name>.<namespace>.svc.cluster.local:<port>
nslookup <service-name>.<namespace>.svc.cluster.local
traceroute <pod-ip>
For a deep dive into Kubernetes networking, see our CNI and service mesh guide.
Layer 5: Application
If the cluster, nodes, pods, and networking are all healthy, the problem is in the application itself.
# Check application logs
kubectl logs <pod-name> -n <namespace> --tail=200
# Follow logs in real time
kubectl logs <pod-name> -n <namespace> -f
# Exec into the pod for interactive debugging
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Check if the app is responding on its port
kubectl port-forward <pod-name> 8080:8080 -n <namespace>
# Then in another terminal: curl localhost:8080/health
What to look for:
- Application error messages in logs
- Connection refused errors to databases or external services
- Slow response times (check if CPU is being throttled)
- Missing environment variables or ConfigMap values
Troubleshooting Common Kubernetes Errors
Error: “Unable to connect to the server”
You cannot reach the Kubernetes API server at all.
# Check your kubeconfig
kubectl config view
kubectl config current-context
# Test connectivity to the API server
curl -k https://<api-server-address>:6443/healthz
# Check if your credentials are expired (common with EKS)
# EKS:
aws eks update-kubeconfig --name <cluster-name> --region <region>
# AKS:
az aks get-credentials --resource-group <rg> --name <cluster-name>
# GKE:
gcloud container clusters get-credentials <cluster-name> --zone <zone>
Error: “The connection to the server was refused”
The API server is not running or not listening on the expected port.
- On managed clusters, this usually indicates a control plane issue on the provider’s side
- On self-managed clusters, check if the API server process is running and if TLS certificates have expired
- Check if a firewall or security group is blocking port 6443
Error: “Forbidden” or “User cannot…”
RBAC permission issue. The user or service account does not have the required permissions.
# Check what you can do
kubectl auth can-i --list
# Check as a specific service account
kubectl auth can-i get pods --as=system:serviceaccount:<namespace>:<sa-name> -n <namespace>
# Check role bindings
kubectl get rolebindings -n <namespace>
kubectl get clusterrolebindings
For a comprehensive RBAC guide, see our Kubernetes RBAC audit best practices.
Error: “etcdserver: request is too large”
The object you are trying to store in etcd exceeds the 1.5 MB limit.
- This commonly happens with very large ConfigMaps or Secrets
- Split large configurations into multiple smaller ConfigMaps
- Use external configuration management for large datasets
Essential Troubleshooting Tools
kubectl Plugins and CLI Tools
| Tool | What It Does | Install |
|---|---|---|
| k9s | Terminal-based interactive Kubernetes dashboard | brew install k9s |
| stern | Multi-pod log tailing with colour coding | brew install stern |
| kubectl-debug | Attach ephemeral debug containers | Built into kubectl |
| kubectx/kubens | Fast context and namespace switching | brew install kubectx |
| kubectl-neat | Clean up verbose kubectl output | kubectl krew install neat |
Observability Stack
For production clusters, you need metrics, logs, and traces to troubleshoot effectively:
- Prometheus for metrics and alerting
- Grafana for dashboards and visualisation
- Loki or Fluentd for centralised log aggregation
- OpenTelemetry for distributed tracing
Quick Reference: Troubleshooting Decision Tree
Problem: Something is broken
│
├── Can you reach the API server?
│ ├── No → Check kubeconfig, network, API server status
│ └── Yes ↓
│
├── Are all nodes Ready?
│ ├── No → Fix NotReady nodes (kubelet, disk, memory, network)
│ └── Yes ↓
│
├── Are all pods Running?
│ ├── Pending → Check scheduling (resources, affinity, taints)
│ ├── CrashLoopBackOff → Check logs --previous, exit codes
│ ├── ImagePullBackOff → Check image name, registry auth
│ ├── OOMKilled → Increase memory limits
│ └── Running ↓
│
├── Can services reach each other?
│ ├── No → Check DNS, endpoints, NetworkPolicies
│ └── Yes ↓
│
└── Application-level issue
└── Check app logs, config, dependencies
Common Mistakes When Troubleshooting Kubernetes
- Starting at the wrong layer — do not read pod logs before confirming the node is healthy
- Ignoring events —
kubectl get eventsis often more useful than logs - Not using
--previousflag — for CrashLoopBackOff, the current container has no useful logs because it just started - Blaming the network — always verify DNS and endpoints before assuming network issues
- Not checking resource limits — CPU throttling and OOMKills cause subtle failures that look like application bugs
- Forgetting about init containers — if a pod is stuck in
Init:0/1, the init container is the problem, not the main container
For an in-depth look at the 20 most common Kubernetes production issues, see our production troubleshooting guide.
Get Expert Help Troubleshooting Kubernetes
Kubernetes troubleshooting under pressure is stressful, especially when production is down. Our engineers at Tasrie IT Services have debugged hundreds of incidents across EKS, AKS, and GKE and can help you build the systems and skills to resolve issues fast.
Our Kubernetes consulting services include:
- Incident response support with experienced Kubernetes engineers available when you need them
- Observability setup with Prometheus, Grafana, and custom alerting
- Team training on systematic troubleshooting methodology and runbook creation