How to Troubleshoot Kubernetes: The 5-Layer Method We Use on Every Incident

Kubernetes troubleshooting is overwhelming when you do not have a system. There are pods, nodes, services, ingress controllers, storage drivers, and control plane components, and any one of them could be the problem. At Tasrie IT Services, we have handled hundreds of production incidents across EKS, AKS, and GKE. Every single time, we follow the same five-layer method.

This guide teaches you that method. Start at layer one, work down, and you will find the root cause before you waste time chasing the wrong lead.

The 5-Layer Troubleshooting Method

We debug Kubernetes issues top-down. Each layer builds on the previous one. Do not skip layers.

Layer 1: Cluster Health     → Is the cluster itself working?
Layer 2: Node Health        → Are nodes Ready and resourced?
Layer 3: Workload Status    → Are pods running and healthy?
Layer 4: Networking         → Can services reach each other?
Layer 5: Application        → Is the app behaving correctly?

Layer 1: Cluster Health

Start here. If the cluster is unhealthy, nothing else matters.

# Can you reach the API server?
kubectl cluster-info

# Are all control plane components healthy?
kubectl get --raw /healthz

# Check component status (deprecated but still useful on some clusters)
kubectl get componentstatuses 2>/dev/null

# Check system pods
kubectl get pods -n kube-system

What to look for:

API server responding slowly or timing out
etcd pods in CrashLoopBackOff or not running
CoreDNS pods not running (this breaks all service discovery)
kube-proxy pods missing (this breaks all Service routing)

If the API server is unreachable:

On managed clusters (EKS, AKS, GKE): check the cloud provider’s status page and your cluster’s control plane logs
On self-managed clusters: SSH into a control plane node and check journalctl -u kube-apiserver
Check if your kubeconfig context and credentials are correct: kubectl config current-context

Layer 2: Node Health

# Check all node statuses
kubectl get nodes

# Detailed node information
kubectl describe node <node-name>

# Resource consumption per node
kubectl top nodes

What to look for:

Nodes in NotReady or Unknown status
MemoryPressure, DiskPressure, or PIDPressure conditions
High CPU or memory utilisation on specific nodes
Node taints that might prevent pod scheduling

If you find NotReady nodes, follow our guide to fixing Kubernetes Node NotReady. For resource monitoring, see checking node CPU and memory utilisation.

Layer 3: Workload Status

# Find all non-running pods across all namespaces
kubectl get pods -A | grep -v Running | grep -v Completed

# Check recent events (most useful single command for troubleshooting)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Check deployment status
kubectl get deployments -A

# Check for pods with high restart counts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20

What to look for:

Pods in CrashLoopBackOff — container keeps crashing (fix guide)
Pods in Pending — scheduler cannot place them
Pods in ImagePullBackOff — image cannot be pulled
High restart counts — the pod keeps dying and recovering
OOMKilled in pod events — memory limit exceeded (fix guide)

For detailed pod troubleshooting, see our Kubernetes pod troubleshooting guide.

Layer 4: Networking

Networking issues are the hardest to debug because they can appear as application errors. Always verify networking before blaming the application.

# Check if services have endpoints
kubectl get endpoints -n <namespace>

# Test DNS resolution from inside the cluster
kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl get ingress -A

What to look for:

Services with zero endpoints (label selector mismatch or all pods failing readiness)
DNS resolution failures (CoreDNS down or misconfigured)
Ingress returning 502/503/504 errors
NetworkPolicies blocking traffic between namespaces

# Debug network connectivity between pods
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Inside the pod:
curl <service-name>.<namespace>.svc.cluster.local:<port>
nslookup <service-name>.<namespace>.svc.cluster.local
traceroute <pod-ip>

For a deep dive into Kubernetes networking, see our CNI and service mesh guide.

Layer 5: Application

If the cluster, nodes, pods, and networking are all healthy, the problem is in the application itself.

# Check application logs
kubectl logs <pod-name> -n <namespace> --tail=200

# Follow logs in real time
kubectl logs <pod-name> -n <namespace> -f

# Exec into the pod for interactive debugging
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Check if the app is responding on its port
kubectl port-forward <pod-name> 8080:8080 -n <namespace>
# Then in another terminal: curl localhost:8080/health

What to look for:

Application error messages in logs
Connection refused errors to databases or external services
Slow response times (check if CPU is being throttled)
Missing environment variables or ConfigMap values

Troubleshooting Common Kubernetes Errors

Error: “Unable to connect to the server”

You cannot reach the Kubernetes API server at all.

# Check your kubeconfig
kubectl config view
kubectl config current-context

# Test connectivity to the API server
curl -k https://<api-server-address>:6443/healthz

# Check if your credentials are expired (common with EKS)
# EKS:
aws eks update-kubeconfig --name <cluster-name> --region <region>
# AKS:
az aks get-credentials --resource-group <rg> --name <cluster-name>
# GKE:
gcloud container clusters get-credentials <cluster-name> --zone <zone>

Error: “The connection to the server was refused”

The API server is not running or not listening on the expected port.

On managed clusters, this usually indicates a control plane issue on the provider’s side
On self-managed clusters, check if the API server process is running and if TLS certificates have expired
Check if a firewall or security group is blocking port 6443

Error: “Forbidden” or “User cannot…”

RBAC permission issue. The user or service account does not have the required permissions.

# Check what you can do
kubectl auth can-i --list

# Check as a specific service account
kubectl auth can-i get pods --as=system:serviceaccount:<namespace>:<sa-name> -n <namespace>

# Check role bindings
kubectl get rolebindings -n <namespace>
kubectl get clusterrolebindings

For a comprehensive RBAC guide, see our Kubernetes RBAC audit best practices.

Error: “etcdserver: request is too large”

The object you are trying to store in etcd exceeds the 1.5 MB limit.

This commonly happens with very large ConfigMaps or Secrets
Split large configurations into multiple smaller ConfigMaps
Use external configuration management for large datasets

Essential Troubleshooting Tools

kubectl Plugins and CLI Tools

Tool	What It Does	Install
k9s	Terminal-based interactive Kubernetes dashboard	`brew install k9s`
stern	Multi-pod log tailing with colour coding	`brew install stern`
kubectl-debug	Attach ephemeral debug containers	Built into kubectl
kubectx/kubens	Fast context and namespace switching	`brew install kubectx`
kubectl-neat	Clean up verbose kubectl output	`kubectl krew install neat`

Observability Stack

For production clusters, you need metrics, logs, and traces to troubleshoot effectively:

Prometheus for metrics and alerting
Grafana for dashboards and visualisation
Loki or Fluentd for centralised log aggregation
OpenTelemetry for distributed tracing

Quick Reference: Troubleshooting Decision Tree

Problem: Something is broken
│
├── Can you reach the API server?
│   ├── No → Check kubeconfig, network, API server status
│   └── Yes ↓
│
├── Are all nodes Ready?
│   ├── No → Fix NotReady nodes (kubelet, disk, memory, network)
│   └── Yes ↓
│
├── Are all pods Running?
│   ├── Pending → Check scheduling (resources, affinity, taints)
│   ├── CrashLoopBackOff → Check logs --previous, exit codes
│   ├── ImagePullBackOff → Check image name, registry auth
│   ├── OOMKilled → Increase memory limits
│   └── Running ↓
│
├── Can services reach each other?
│   ├── No → Check DNS, endpoints, NetworkPolicies
│   └── Yes ↓
│
└── Application-level issue
    └── Check app logs, config, dependencies

Common Mistakes When Troubleshooting Kubernetes

Starting at the wrong layer — do not read pod logs before confirming the node is healthy
Ignoring events — kubectl get events is often more useful than logs
Not using --previous flag — for CrashLoopBackOff, the current container has no useful logs because it just started
Blaming the network — always verify DNS and endpoints before assuming network issues
Not checking resource limits — CPU throttling and OOMKills cause subtle failures that look like application bugs
Forgetting about init containers — if a pod is stuck in Init:0/1, the init container is the problem, not the main container

For an in-depth look at the 20 most common Kubernetes production issues, see our production troubleshooting guide.

Get Expert Help Troubleshooting Kubernetes

Kubernetes troubleshooting under pressure is stressful, especially when production is down. Our engineers at Tasrie IT Services have debugged hundreds of incidents across EKS, AKS, and GKE and can help you build the systems and skills to resolve issues fast.

Our Kubernetes consulting services include:

Incident response support with experienced Kubernetes engineers available when you need them
Observability setup with Prometheus, Grafana, and custom alerting
Team training on systematic troubleshooting methodology and runbook creation

Talk to our Kubernetes team →

How to Troubleshoot Kubernetes: The 5-Layer Method We Use on Every Incident

The 5-Layer Troubleshooting Method

Layer 1: Cluster Health

Layer 2: Node Health

Layer 3: Workload Status

Layer 4: Networking

Layer 5: Application

Troubleshooting Common Kubernetes Errors

Error: “Unable to connect to the server”

Error: “The connection to the server was refused”

Error: “Forbidden” or “User cannot…”

Error: “etcdserver: request is too large”

Essential Troubleshooting Tools

kubectl Plugins and CLI Tools

Observability Stack

Quick Reference: Troubleshooting Decision Tree

Common Mistakes When Troubleshooting Kubernetes

Get Expert Help Troubleshooting Kubernetes

Kubernetes Consulting Cost 2026: Real Rates From 100+ Quotes

Docker Compose vs Kubernetes: 17 Workloads We Moved Back

Kubernetes Consulting UK: We Audit 50+ Clusters

How to Fix CrashLoopBackOff in Kubernetes: The Complete Debugging Playbook (2026)

How to Fix CrashLoopBackOff Kubernetes Error: Exit Code Debugging Guide (2026)

Need Kubernetes expertise?

Tasrie IT Support

Start a conversation

The 5-Layer Troubleshooting Method

Layer 1: Cluster Health

Layer 2: Node Health

Layer 3: Workload Status

Layer 4: Networking

Layer 5: Application

Troubleshooting Common Kubernetes Errors

Error: “Unable to connect to the server”

Error: “The connection to the server was refused”

Error: “Forbidden” or “User cannot…”

Error: “etcdserver: request is too large”

Essential Troubleshooting Tools

kubectl Plugins and CLI Tools

Observability Stack

Quick Reference: Troubleshooting Decision Tree

Common Mistakes When Troubleshooting Kubernetes

Get Expert Help Troubleshooting Kubernetes

Related Articles

Kubernetes Consulting Cost 2026: Real Rates From 100+ Quotes

Docker Compose vs Kubernetes: 17 Workloads We Moved Back

Kubernetes Consulting UK: We Audit 50+ Clusters

How to Fix CrashLoopBackOff in Kubernetes: The Complete Debugging Playbook (2026)

How to Fix CrashLoopBackOff Kubernetes Error: Exit Code Debugging Guide (2026)

Need Kubernetes expertise?

Don't Miss Out on Expert DevOps Insights

Get Started

You're In!

Tasrie IT Support

Start a conversation