Troubleshooting a Kubernetes cluster is different from troubleshooting a single pod or deployment. Cluster-level issues affect everything: the control plane, networking, storage, and all workloads running on the cluster. At Tasrie IT Services, we manage over 100 production clusters across EKS, AKS, and GKE, and cluster-level problems are always the highest-severity incidents we handle.
This guide focuses on cluster-wide troubleshooting: the control plane, etcd, networking infrastructure, DNS, and node fleet health. For pod-level debugging, see our pod troubleshooting guide.
Step 1: Verify Cluster Reachability
Before anything else, confirm you can talk to the cluster.
# Test API server connectivity
kubectl cluster-info
# Check API server health
kubectl get --raw /healthz
# If that fails, check your kubeconfig
kubectl config current-context
kubectl config view --minify
If the API server is unreachable:
| Scenario | Diagnosis | Action |
|---|---|---|
| Connection refused | API server process is down | Check control plane node (self-managed) or cloud provider status (managed) |
| Connection timeout | Network issue between you and API server | Check VPN, firewall rules, security groups |
| Unauthorized | Credentials expired or revoked | Refresh kubeconfig (see cloud-specific commands below) |
| TLS handshake error | Certificate mismatch or expiry | Check API server certificates |
Refresh credentials for managed clusters:
# EKS
aws eks update-kubeconfig --name <cluster-name> --region <region>
# AKS
az aks get-credentials --resource-group <rg> --name <cluster-name> --overwrite-existing
# GKE
gcloud container clusters get-credentials <cluster-name> --zone <zone>
Step 2: Check Control Plane Components
The control plane runs the API server, scheduler, controller manager, and etcd. On managed clusters, the cloud provider handles these, but you can still observe their health.
Self-Managed Clusters
# Check control plane pods
kubectl get pods -n kube-system -l tier=control-plane
# Check each component
kubectl logs kube-apiserver-<node> -n kube-system --tail=50
kubectl logs kube-scheduler-<node> -n kube-system --tail=50
kubectl logs kube-controller-manager-<node> -n kube-system --tail=50
# On the control plane node
sudo journalctl -u kubelet --since "30 minutes ago" | grep -i "error\|fail"
Managed Clusters
On EKS, AKS, and GKE, you cannot access control plane components directly. Instead:
# Check API server responsiveness
time kubectl get nodes
# Check if the API server is throttling requests
kubectl get --raw /metrics | grep apiserver_request_total | grep -i "429\|503"
# Check cloud-specific control plane logs
# EKS: CloudWatch Log Groups → /aws/eks/<cluster-name>/cluster
# AKS: Azure Monitor → Diagnostic settings → kube-apiserver, kube-controller-manager
# GKE: Cloud Logging → resource.type="k8s_cluster"
If the API server is slow (taking more than 1 second for simple commands), the cluster is under stress. Common causes include too many LIST operations on large objects, webhook timeouts, or etcd performance issues.
Step 3: Assess Node Fleet Health
# Get all node statuses
kubectl get nodes
# Detailed view with conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].status,REASON:.status.conditions[-1].reason,AGE:.metadata.creationTimestamp
# Check resource pressure across all nodes
kubectl describe nodes | grep -E "Name:|Conditions:" -A 5
# Resource usage
kubectl top nodes
Red flags to watch for:
- Multiple nodes in
NotReadystatus — likely a cluster-wide issue, not individual node problems - All nodes showing
MemoryPressureorDiskPressure— workloads are over-provisioned for the cluster size - Nodes with very different ages — could indicate auto-scaling issues or failed node replacements
For individual node troubleshooting, see our Node NotReady fix guide.
Step 4: Check DNS (CoreDNS)
DNS is the foundation of service discovery in Kubernetes. When CoreDNS fails, every service-to-service call breaks.
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Test DNS resolution
kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml
Common CoreDNS failures:
| Symptom | Cause | Fix |
|---|---|---|
| CoreDNS pods CrashLoopBackOff | Corrupted ConfigMap or loop detection | Fix the Corefile ConfigMap |
| DNS timeout from pods | CoreDNS overwhelmed | Increase replicas, add NodeLocal DNSCache |
| Partial DNS failures | Some pods cannot resolve | Check ndots setting and search domains |
| External DNS not resolving | Missing upstream DNS config | Check CoreDNS forward directive |
Fix CoreDNS loop issues:
# Check if the loop plugin is detecting a forwarding loop
kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i "loop"
# Edit CoreDNS ConfigMap to fix upstream configuration
kubectl edit configmap coredns -n kube-system
Step 5: Verify Cluster Networking
kube-proxy
kube-proxy maintains the iptables or IPVS rules that route Service traffic to pods. When it fails, services stop working.
# Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50
# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
CNI Plugin
The Container Network Interface (CNI) plugin assigns IP addresses to pods and manages pod-to-pod networking.
# Check CNI plugin pods (varies by plugin)
kubectl get pods -n kube-system -l k8s-app=calico-node # Calico
kubectl get pods -n kube-system -l k8s-app=cilium # Cilium
kubectl get pods -n kube-system -l app=flannel # Flannel
kubectl get pods -n kube-system -l k8s-app=aws-node # AWS VPC CNI
# Test pod-to-pod connectivity
kubectl run nettest1 --image=nicolaka/netshoot --rm -it --restart=Never -- ping <another-pod-ip>
For a comprehensive networking deep dive, see our Kubernetes networking and CNI guide.
Ingress Controller
# Check ingress controller pods
kubectl get pods -n ingress-nginx
# Check ingress resources
kubectl get ingress -A
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller --tail=50
Common ingress issues (502, 503, 504 errors) are covered in detail in our Kubernetes production troubleshooting guide.
Step 6: Check Storage and PVCs
Storage issues affect all workloads that depend on persistent volumes.
# Check PVC status across all namespaces
kubectl get pvc -A | grep -v Bound
# Check PV status
kubectl get pv | grep -v Bound
# Check storage classes
kubectl get storageclass
# Check CSI driver pods
kubectl get pods -n kube-system | grep csi
Common storage problems:
- PVCs stuck in
Pending— no matching PV, storage provisioner not running, or quota exceeded - PVCs stuck in
Terminating— finalizer blocking deletion - Volume mount timeouts — CSI driver crash or cloud provider API rate limiting
Step 7: Check Cluster Events
Cluster events provide a timeline of everything happening in the cluster. This is often the fastest way to spot the root cause.
# All events sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
# Warning events only
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'
# Events for a specific namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Events related to specific resource types
kubectl get events -A --field-selector involvedObject.kind=Node
kubectl get events -A --field-selector involvedObject.kind=Pod
Key events to watch for:
| Event Reason | Meaning |
|---|---|
FailedScheduling | Pods cannot be placed on any node |
NodeNotReady | Node health check failing |
Evicted | Pod evicted due to resource pressure |
OOMKilling | Container exceeded memory limit |
FailedMount | Volume cannot be attached |
BackOff | Container crashing repeatedly |
Unhealthy | Probe failure detected |
FailedCreate | Controller cannot create pod (quota, admission webhook) |
Troubleshooting Managed Cluster Control Planes
EKS Troubleshooting
# Check EKS cluster status
aws eks describe-cluster --name <cluster-name> --query 'cluster.status'
# Check EKS add-on health
aws eks list-addons --cluster-name <cluster-name>
aws eks describe-addon --cluster-name <cluster-name> --addon-name <addon-name>
# Check VPC CNI plugin
kubectl get pods -n kube-system -l k8s-app=aws-node
kubectl logs -n kube-system -l k8s-app=aws-node --tail=20
# Check for IP address exhaustion
kubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -c | sort -rn | head
Common EKS issues:
- Subnet IP exhaustion (each pod gets a real VPC IP with the AWS VPC CNI)
- Node group launch template misconfigurations
- IAM role trust policy errors preventing node registration
See our EKS architecture best practices for prevention strategies.
AKS Troubleshooting
# Check AKS cluster health
az aks show --resource-group <rg> --name <cluster-name> --query 'provisioningState'
# Check node pool status
az aks nodepool list --resource-group <rg> --cluster-name <cluster-name> -o table
# Get AKS diagnostics
az aks get-credentials --resource-group <rg> --name <cluster-name>
kubectl get pods -n kube-system
Common AKS issues:
- Azure subnet exhaustion with Azure CNI
- NSG rules blocking node-to-control-plane communication
- AAD integration token expiry
GKE Troubleshooting
# Check GKE cluster status
gcloud container clusters describe <cluster-name> --zone <zone> --format='get(status)'
# Check node pool status
gcloud container node-pools list --cluster <cluster-name> --zone <zone>
# Check GKE system workloads
kubectl get pods -n kube-system --sort-by='.status.containerStatuses[0].restartCount'
Common GKE issues:
- Preemptible/Spot VM node pools causing frequent node rotations
- GKE Autopilot rejecting workloads that violate resource constraints
- Workload Identity misconfiguration
For a detailed comparison of managed Kubernetes platforms, see our EKS vs AKS vs GKE guide.
Cluster Troubleshooting Checklist
Use this checklist during any cluster-level incident:
[ ] Can you reach the API server? (kubectl cluster-info)
[ ] Is the API server healthy? (kubectl get --raw /healthz)
[ ] Are all nodes Ready? (kubectl get nodes)
[ ] Are any nodes under resource pressure? (kubectl describe nodes | grep Pressure)
[ ] Are control plane pods healthy? (kubectl get pods -n kube-system)
[ ] Is CoreDNS running and resolving? (kubectl get pods -n kube-system -l k8s-app=kube-dns)
[ ] Is kube-proxy running? (kubectl get pods -n kube-system -l k8s-app=kube-proxy)
[ ] Is the CNI plugin running on all nodes? (check DaemonSet pod count)
[ ] Are there pending PVCs? (kubectl get pvc -A | grep -v Bound)
[ ] Are there warning events? (kubectl get events -A --field-selector type=Warning)
[ ] Are admission webhooks responding? (check webhook pod health)
Setting Up Proactive Cluster Monitoring
Prevent cluster-level incidents with these Prometheus alerting rules:
# Alert when nodes go NotReady
- alert: KubeNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 2m
labels:
severity: critical
# Alert when CoreDNS is degraded
- alert: CoreDNSDegraded
expr: up{job="coredns"} == 0
for: 1m
labels:
severity: critical
# Alert when API server latency is high
- alert: KubeAPIServerHighLatency
expr: histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m])) > 1
for: 5m
labels:
severity: warning
# Alert when etcd is unhealthy
- alert: EtcdHighCommitDuration
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
for: 5m
labels:
severity: warning
For comprehensive monitoring setup, see our 10-layer monitoring framework for Kubernetes.
Get Expert Kubernetes Cluster Support
Cluster-level issues are the highest-severity incidents in Kubernetes because they affect every workload. Our engineers at Tasrie IT Services have managed over 100 production clusters and can help you build resilient infrastructure that prevents these problems.
Our Kubernetes consulting services include: