~/blog/how-to-troubleshoot-kubernetes-2026
zsh
KUBERNETES

How to Troubleshoot Kubernetes: The 5-Layer Method We Use on Every Incident

Engineering Team 2026-04-26

Kubernetes troubleshooting is overwhelming when you do not have a system. There are pods, nodes, services, ingress controllers, storage drivers, and control plane components, and any one of them could be the problem. At Tasrie IT Services, we have handled hundreds of production incidents across EKS, AKS, and GKE. Every single time, we follow the same five-layer method.

This guide teaches you that method. Start at layer one, work down, and you will find the root cause before you waste time chasing the wrong lead.

The 5-Layer Troubleshooting Method

We debug Kubernetes issues top-down. Each layer builds on the previous one. Do not skip layers.

Layer 1: Cluster Health     → Is the cluster itself working?
Layer 2: Node Health        → Are nodes Ready and resourced?
Layer 3: Workload Status    → Are pods running and healthy?
Layer 4: Networking         → Can services reach each other?
Layer 5: Application        → Is the app behaving correctly?

Layer 1: Cluster Health

Start here. If the cluster is unhealthy, nothing else matters.

# Can you reach the API server?
kubectl cluster-info

# Are all control plane components healthy?
kubectl get --raw /healthz

# Check component status (deprecated but still useful on some clusters)
kubectl get componentstatuses 2>/dev/null

# Check system pods
kubectl get pods -n kube-system

What to look for:

  • API server responding slowly or timing out
  • etcd pods in CrashLoopBackOff or not running
  • CoreDNS pods not running (this breaks all service discovery)
  • kube-proxy pods missing (this breaks all Service routing)

If the API server is unreachable:

  • On managed clusters (EKS, AKS, GKE): check the cloud provider’s status page and your cluster’s control plane logs
  • On self-managed clusters: SSH into a control plane node and check journalctl -u kube-apiserver
  • Check if your kubeconfig context and credentials are correct: kubectl config current-context

Layer 2: Node Health

# Check all node statuses
kubectl get nodes

# Detailed node information
kubectl describe node <node-name>

# Resource consumption per node
kubectl top nodes

What to look for:

  • Nodes in NotReady or Unknown status
  • MemoryPressure, DiskPressure, or PIDPressure conditions
  • High CPU or memory utilisation on specific nodes
  • Node taints that might prevent pod scheduling

If you find NotReady nodes, follow our guide to fixing Kubernetes Node NotReady. For resource monitoring, see checking node CPU and memory utilisation.

Layer 3: Workload Status

# Find all non-running pods across all namespaces
kubectl get pods -A | grep -v Running | grep -v Completed

# Check recent events (most useful single command for troubleshooting)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Check deployment status
kubectl get deployments -A

# Check for pods with high restart counts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20

What to look for:

  • Pods in CrashLoopBackOff — container keeps crashing (fix guide)
  • Pods in Pending — scheduler cannot place them
  • Pods in ImagePullBackOff — image cannot be pulled
  • High restart counts — the pod keeps dying and recovering
  • OOMKilled in pod events — memory limit exceeded (fix guide)

For detailed pod troubleshooting, see our Kubernetes pod troubleshooting guide.

Layer 4: Networking

Networking issues are the hardest to debug because they can appear as application errors. Always verify networking before blaming the application.

# Check if services have endpoints
kubectl get endpoints -n <namespace>

# Test DNS resolution from inside the cluster
kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl get ingress -A

What to look for:

  • Services with zero endpoints (label selector mismatch or all pods failing readiness)
  • DNS resolution failures (CoreDNS down or misconfigured)
  • Ingress returning 502/503/504 errors
  • NetworkPolicies blocking traffic between namespaces
# Debug network connectivity between pods
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Inside the pod:
curl <service-name>.<namespace>.svc.cluster.local:<port>
nslookup <service-name>.<namespace>.svc.cluster.local
traceroute <pod-ip>

For a deep dive into Kubernetes networking, see our CNI and service mesh guide.

Layer 5: Application

If the cluster, nodes, pods, and networking are all healthy, the problem is in the application itself.

# Check application logs
kubectl logs <pod-name> -n <namespace> --tail=200

# Follow logs in real time
kubectl logs <pod-name> -n <namespace> -f

# Exec into the pod for interactive debugging
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Check if the app is responding on its port
kubectl port-forward <pod-name> 8080:8080 -n <namespace>
# Then in another terminal: curl localhost:8080/health

What to look for:

  • Application error messages in logs
  • Connection refused errors to databases or external services
  • Slow response times (check if CPU is being throttled)
  • Missing environment variables or ConfigMap values

Troubleshooting Common Kubernetes Errors

Error: “Unable to connect to the server”

You cannot reach the Kubernetes API server at all.

# Check your kubeconfig
kubectl config view
kubectl config current-context

# Test connectivity to the API server
curl -k https://<api-server-address>:6443/healthz

# Check if your credentials are expired (common with EKS)
# EKS:
aws eks update-kubeconfig --name <cluster-name> --region <region>
# AKS:
az aks get-credentials --resource-group <rg> --name <cluster-name>
# GKE:
gcloud container clusters get-credentials <cluster-name> --zone <zone>

Error: “The connection to the server was refused”

The API server is not running or not listening on the expected port.

  • On managed clusters, this usually indicates a control plane issue on the provider’s side
  • On self-managed clusters, check if the API server process is running and if TLS certificates have expired
  • Check if a firewall or security group is blocking port 6443

Error: “Forbidden” or “User cannot…”

RBAC permission issue. The user or service account does not have the required permissions.

# Check what you can do
kubectl auth can-i --list

# Check as a specific service account
kubectl auth can-i get pods --as=system:serviceaccount:<namespace>:<sa-name> -n <namespace>

# Check role bindings
kubectl get rolebindings -n <namespace>
kubectl get clusterrolebindings

For a comprehensive RBAC guide, see our Kubernetes RBAC audit best practices.

Error: “etcdserver: request is too large”

The object you are trying to store in etcd exceeds the 1.5 MB limit.

  • This commonly happens with very large ConfigMaps or Secrets
  • Split large configurations into multiple smaller ConfigMaps
  • Use external configuration management for large datasets

Essential Troubleshooting Tools

kubectl Plugins and CLI Tools

ToolWhat It DoesInstall
k9sTerminal-based interactive Kubernetes dashboardbrew install k9s
sternMulti-pod log tailing with colour codingbrew install stern
kubectl-debugAttach ephemeral debug containersBuilt into kubectl
kubectx/kubensFast context and namespace switchingbrew install kubectx
kubectl-neatClean up verbose kubectl outputkubectl krew install neat

Observability Stack

For production clusters, you need metrics, logs, and traces to troubleshoot effectively:

Quick Reference: Troubleshooting Decision Tree

Problem: Something is broken

├── Can you reach the API server?
│   ├── No → Check kubeconfig, network, API server status
│   └── Yes ↓

├── Are all nodes Ready?
│   ├── No → Fix NotReady nodes (kubelet, disk, memory, network)
│   └── Yes ↓

├── Are all pods Running?
│   ├── Pending → Check scheduling (resources, affinity, taints)
│   ├── CrashLoopBackOff → Check logs --previous, exit codes
│   ├── ImagePullBackOff → Check image name, registry auth
│   ├── OOMKilled → Increase memory limits
│   └── Running ↓

├── Can services reach each other?
│   ├── No → Check DNS, endpoints, NetworkPolicies
│   └── Yes ↓

└── Application-level issue
    └── Check app logs, config, dependencies

Common Mistakes When Troubleshooting Kubernetes

  1. Starting at the wrong layer — do not read pod logs before confirming the node is healthy
  2. Ignoring eventskubectl get events is often more useful than logs
  3. Not using --previous flag — for CrashLoopBackOff, the current container has no useful logs because it just started
  4. Blaming the network — always verify DNS and endpoints before assuming network issues
  5. Not checking resource limits — CPU throttling and OOMKills cause subtle failures that look like application bugs
  6. Forgetting about init containers — if a pod is stuck in Init:0/1, the init container is the problem, not the main container

For an in-depth look at the 20 most common Kubernetes production issues, see our production troubleshooting guide.


Get Expert Help Troubleshooting Kubernetes

Kubernetes troubleshooting under pressure is stressful, especially when production is down. Our engineers at Tasrie IT Services have debugged hundreds of incidents across EKS, AKS, and GKE and can help you build the systems and skills to resolve issues fast.

Our Kubernetes consulting services include:

  • Incident response support with experienced Kubernetes engineers available when you need them
  • Observability setup with Prometheus, Grafana, and custom alerting
  • Team training on systematic troubleshooting methodology and runbook creation

Talk to our Kubernetes team →

Continue exploring these related topics

$ suggest --service

Need Kubernetes expertise?

From architecture to production support, we help teams run Kubernetes reliably at scale.

Get started
Chat with real humans
Chat on WhatsApp