How to Troubleshoot a Kubernetes Cluster: We Manage 100+ Clusters (2026 Guide)

Troubleshooting a Kubernetes cluster is different from troubleshooting a single pod or deployment. Cluster-level issues affect everything: the control plane, networking, storage, and all workloads running on the cluster. At Tasrie IT Services, we manage over 100 production clusters across EKS, AKS, and GKE, and cluster-level problems are always the highest-severity incidents we handle.

This guide focuses on cluster-wide troubleshooting: the control plane, etcd, networking infrastructure, DNS, and node fleet health. For pod-level debugging, see our pod troubleshooting guide.

Step 1: Verify Cluster Reachability

Before anything else, confirm you can talk to the cluster.

# Test API server connectivity
kubectl cluster-info

# Check API server health
kubectl get --raw /healthz

# If that fails, check your kubeconfig
kubectl config current-context
kubectl config view --minify

If the API server is unreachable:

Scenario	Diagnosis	Action
Connection refused	API server process is down	Check control plane node (self-managed) or cloud provider status (managed)
Connection timeout	Network issue between you and API server	Check VPN, firewall rules, security groups
Unauthorized	Credentials expired or revoked	Refresh kubeconfig (see cloud-specific commands below)
TLS handshake error	Certificate mismatch or expiry	Check API server certificates

Refresh credentials for managed clusters:

# EKS
aws eks update-kubeconfig --name <cluster-name> --region <region>

# AKS
az aks get-credentials --resource-group <rg> --name <cluster-name> --overwrite-existing

# GKE
gcloud container clusters get-credentials <cluster-name> --zone <zone>

Step 2: Check Control Plane Components

The control plane runs the API server, scheduler, controller manager, and etcd. On managed clusters, the cloud provider handles these, but you can still observe their health.

Self-Managed Clusters

# Check control plane pods
kubectl get pods -n kube-system -l tier=control-plane

# Check each component
kubectl logs kube-apiserver-<node> -n kube-system --tail=50
kubectl logs kube-scheduler-<node> -n kube-system --tail=50
kubectl logs kube-controller-manager-<node> -n kube-system --tail=50

# On the control plane node
sudo journalctl -u kubelet --since "30 minutes ago" | grep -i "error\|fail"

Managed Clusters

On EKS, AKS, and GKE, you cannot access control plane components directly. Instead:

# Check API server responsiveness
time kubectl get nodes

# Check if the API server is throttling requests
kubectl get --raw /metrics | grep apiserver_request_total | grep -i "429\|503"

# Check cloud-specific control plane logs
# EKS: CloudWatch Log Groups → /aws/eks/<cluster-name>/cluster
# AKS: Azure Monitor → Diagnostic settings → kube-apiserver, kube-controller-manager
# GKE: Cloud Logging → resource.type="k8s_cluster"

If the API server is slow (taking more than 1 second for simple commands), the cluster is under stress. Common causes include too many LIST operations on large objects, webhook timeouts, or etcd performance issues.

Step 3: Assess Node Fleet Health

# Get all node statuses
kubectl get nodes

# Detailed view with conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].status,REASON:.status.conditions[-1].reason,AGE:.metadata.creationTimestamp

# Check resource pressure across all nodes
kubectl describe nodes | grep -E "Name:|Conditions:" -A 5

# Resource usage
kubectl top nodes

Red flags to watch for:

Multiple nodes in NotReady status — likely a cluster-wide issue, not individual node problems
All nodes showing MemoryPressure or DiskPressure — workloads are over-provisioned for the cluster size
Nodes with very different ages — could indicate auto-scaling issues or failed node replacements

For individual node troubleshooting, see our Node NotReady fix guide.

Step 4: Check DNS (CoreDNS)

DNS is the foundation of service discovery in Kubernetes. When CoreDNS fails, every service-to-service call breaks.

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS resolution
kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

Common CoreDNS failures:

Symptom	Cause	Fix
CoreDNS pods CrashLoopBackOff	Corrupted ConfigMap or loop detection	Fix the Corefile ConfigMap
DNS timeout from pods	CoreDNS overwhelmed	Increase replicas, add NodeLocal DNSCache
Partial DNS failures	Some pods cannot resolve	Check `ndots` setting and search domains
External DNS not resolving	Missing upstream DNS config	Check CoreDNS forward directive

Fix CoreDNS loop issues:

# Check if the loop plugin is detecting a forwarding loop
kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i "loop"

# Edit CoreDNS ConfigMap to fix upstream configuration
kubectl edit configmap coredns -n kube-system

Step 5: Verify Cluster Networking

kube-proxy

kube-proxy maintains the iptables or IPVS rules that route Service traffic to pods. When it fails, services stop working.

# Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy logs
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

CNI Plugin

The Container Network Interface (CNI) plugin assigns IP addresses to pods and manages pod-to-pod networking.

# Check CNI plugin pods (varies by plugin)
kubectl get pods -n kube-system -l k8s-app=calico-node     # Calico
kubectl get pods -n kube-system -l k8s-app=cilium           # Cilium
kubectl get pods -n kube-system -l app=flannel               # Flannel
kubectl get pods -n kube-system -l k8s-app=aws-node          # AWS VPC CNI

# Test pod-to-pod connectivity
kubectl run nettest1 --image=nicolaka/netshoot --rm -it --restart=Never -- ping <another-pod-ip>

For a comprehensive networking deep dive, see our Kubernetes networking and CNI guide.

Ingress Controller

# Check ingress controller pods
kubectl get pods -n ingress-nginx

# Check ingress resources
kubectl get ingress -A

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller --tail=50

Common ingress issues (502, 503, 504 errors) are covered in detail in our Kubernetes production troubleshooting guide.

Step 6: Check Storage and PVCs

Storage issues affect all workloads that depend on persistent volumes.

# Check PVC status across all namespaces
kubectl get pvc -A | grep -v Bound

# Check PV status
kubectl get pv | grep -v Bound

# Check storage classes
kubectl get storageclass

# Check CSI driver pods
kubectl get pods -n kube-system | grep csi

Common storage problems:

PVCs stuck in Pending — no matching PV, storage provisioner not running, or quota exceeded
PVCs stuck in Terminating — finalizer blocking deletion
Volume mount timeouts — CSI driver crash or cloud provider API rate limiting

Step 7: Check Cluster Events

Cluster events provide a timeline of everything happening in the cluster. This is often the fastest way to spot the root cause.

# All events sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -50

# Warning events only
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'

# Events for a specific namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Events related to specific resource types
kubectl get events -A --field-selector involvedObject.kind=Node
kubectl get events -A --field-selector involvedObject.kind=Pod

Key events to watch for:

Event Reason	Meaning
`FailedScheduling`	Pods cannot be placed on any node
`NodeNotReady`	Node health check failing
`Evicted`	Pod evicted due to resource pressure
`OOMKilling`	Container exceeded memory limit
`FailedMount`	Volume cannot be attached
`BackOff`	Container crashing repeatedly
`Unhealthy`	Probe failure detected
`FailedCreate`	Controller cannot create pod (quota, admission webhook)

Troubleshooting Managed Cluster Control Planes

EKS Troubleshooting

# Check EKS cluster status
aws eks describe-cluster --name <cluster-name> --query 'cluster.status'

# Check EKS add-on health
aws eks list-addons --cluster-name <cluster-name>
aws eks describe-addon --cluster-name <cluster-name> --addon-name <addon-name>

# Check VPC CNI plugin
kubectl get pods -n kube-system -l k8s-app=aws-node
kubectl logs -n kube-system -l k8s-app=aws-node --tail=20

# Check for IP address exhaustion
kubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -c | sort -rn | head

Common EKS issues:

Subnet IP exhaustion (each pod gets a real VPC IP with the AWS VPC CNI)
Node group launch template misconfigurations
IAM role trust policy errors preventing node registration

See our EKS architecture best practices for prevention strategies.

AKS Troubleshooting

# Check AKS cluster health
az aks show --resource-group <rg> --name <cluster-name> --query 'provisioningState'

# Check node pool status
az aks nodepool list --resource-group <rg> --cluster-name <cluster-name> -o table

# Get AKS diagnostics
az aks get-credentials --resource-group <rg> --name <cluster-name>
kubectl get pods -n kube-system

Common AKS issues:

Azure subnet exhaustion with Azure CNI
NSG rules blocking node-to-control-plane communication
AAD integration token expiry

GKE Troubleshooting

# Check GKE cluster status
gcloud container clusters describe <cluster-name> --zone <zone> --format='get(status)'

# Check node pool status
gcloud container node-pools list --cluster <cluster-name> --zone <zone>

# Check GKE system workloads
kubectl get pods -n kube-system --sort-by='.status.containerStatuses[0].restartCount'

Common GKE issues:

Preemptible/Spot VM node pools causing frequent node rotations
GKE Autopilot rejecting workloads that violate resource constraints
Workload Identity misconfiguration

For a detailed comparison of managed Kubernetes platforms, see our EKS vs AKS vs GKE guide.

Cluster Troubleshooting Checklist

Use this checklist during any cluster-level incident:

[ ] Can you reach the API server? (kubectl cluster-info)
[ ] Is the API server healthy? (kubectl get --raw /healthz)
[ ] Are all nodes Ready? (kubectl get nodes)
[ ] Are any nodes under resource pressure? (kubectl describe nodes | grep Pressure)
[ ] Are control plane pods healthy? (kubectl get pods -n kube-system)
[ ] Is CoreDNS running and resolving? (kubectl get pods -n kube-system -l k8s-app=kube-dns)
[ ] Is kube-proxy running? (kubectl get pods -n kube-system -l k8s-app=kube-proxy)
[ ] Is the CNI plugin running on all nodes? (check DaemonSet pod count)
[ ] Are there pending PVCs? (kubectl get pvc -A | grep -v Bound)
[ ] Are there warning events? (kubectl get events -A --field-selector type=Warning)
[ ] Are admission webhooks responding? (check webhook pod health)

Setting Up Proactive Cluster Monitoring

Prevent cluster-level incidents with these Prometheus alerting rules:

# Alert when nodes go NotReady
- alert: KubeNodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 2m
  labels:
    severity: critical

# Alert when CoreDNS is degraded
- alert: CoreDNSDegraded
  expr: up{job="coredns"} == 0
  for: 1m
  labels:
    severity: critical

# Alert when API server latency is high
- alert: KubeAPIServerHighLatency
  expr: histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m])) > 1
  for: 5m
  labels:
    severity: warning

# Alert when etcd is unhealthy
- alert: EtcdHighCommitDuration
  expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
  for: 5m
  labels:
    severity: warning

For comprehensive monitoring setup, see our 10-layer monitoring framework for Kubernetes.

Get Expert Kubernetes Cluster Support

Cluster-level issues are the highest-severity incidents in Kubernetes because they affect every workload. Our engineers at Tasrie IT Services have managed over 100 production clusters and can help you build resilient infrastructure that prevents these problems.

Our Kubernetes consulting services include:

Cluster architecture review with recommendations for high availability and disaster recovery
Monitoring and alerting setup with Prometheus, Grafana, and custom runbooks for cluster-level incidents
Managed Kubernetes support across EKS, AKS, and GKE with 24/7 incident response

Talk to our Kubernetes experts →

How to Troubleshoot a Kubernetes Cluster: We Manage 100+ Clusters (2026 Guide)

Step 1: Verify Cluster Reachability

Step 2: Check Control Plane Components

Self-Managed Clusters

Managed Clusters

Step 3: Assess Node Fleet Health

Step 4: Check DNS (CoreDNS)

Step 5: Verify Cluster Networking

kube-proxy

CNI Plugin

Ingress Controller

Step 6: Check Storage and PVCs

Step 7: Check Cluster Events

Troubleshooting Managed Cluster Control Planes

EKS Troubleshooting

AKS Troubleshooting

GKE Troubleshooting

Cluster Troubleshooting Checklist

Setting Up Proactive Cluster Monitoring

Get Expert Kubernetes Cluster Support

Kubernetes Consulting Cost 2026: Real Rates From 100+ Quotes

Docker Compose vs Kubernetes: 17 Workloads We Moved Back

Kubernetes Consulting UK: We Audit 50+ Clusters

How to Fix CrashLoopBackOff in Kubernetes: The Complete Debugging Playbook (2026)

How to Fix CrashLoopBackOff Kubernetes Error: Exit Code Debugging Guide (2026)

Need Kubernetes expertise?

Tasrie IT Support

Start a conversation

Step 1: Verify Cluster Reachability

Step 2: Check Control Plane Components

Self-Managed Clusters

Managed Clusters

Step 3: Assess Node Fleet Health

Step 4: Check DNS (CoreDNS)

Step 5: Verify Cluster Networking

kube-proxy

CNI Plugin

Ingress Controller

Step 6: Check Storage and PVCs

Step 7: Check Cluster Events

Troubleshooting Managed Cluster Control Planes

EKS Troubleshooting

AKS Troubleshooting

GKE Troubleshooting

Cluster Troubleshooting Checklist

Setting Up Proactive Cluster Monitoring

Get Expert Kubernetes Cluster Support

Related Articles

Kubernetes Consulting Cost 2026: Real Rates From 100+ Quotes

Docker Compose vs Kubernetes: 17 Workloads We Moved Back

Kubernetes Consulting UK: We Audit 50+ Clusters

How to Fix CrashLoopBackOff in Kubernetes: The Complete Debugging Playbook (2026)

How to Fix CrashLoopBackOff Kubernetes Error: Exit Code Debugging Guide (2026)

Need Kubernetes expertise?

Don't Miss Out on Expert DevOps Insights

Get Started

You're In!

Tasrie IT Support

Start a conversation