~/blog/how-to-fix-kubernetes-node-not-ready-2026
zsh
KUBERNETES

How to Fix Kubernetes Node Not Ready: We Resolved 200+ Cases (2026 Guide)

Engineering Team 2026-04-26

A Kubernetes node stuck in NotReady status means your cluster has lost capacity and workloads are at risk. At Tasrie IT Services, we have resolved over 200 Node NotReady incidents across EKS, AKS, and GKE clusters. In most cases, the fix takes under 10 minutes once you know where to look.

This guide walks through the exact steps we follow when a node goes NotReady, starting with the fastest diagnostic commands and working through every common root cause with tested fixes.

Quick Diagnosis: Find the Root Cause in 60 Seconds

Before diving into specific fixes, run these three commands to narrow down the problem immediately:

# Step 1: Confirm which nodes are NotReady
kubectl get nodes

# Step 2: Get detailed conditions and events
kubectl describe node <node-name>

# Step 3: Check recent cluster events
kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'

The describe node output contains a Conditions section that tells you exactly why the node is NotReady. Look for these condition types:

ConditionMeaningLikely Fix
Ready: FalseKubelet stopped reportingRestart kubelet
Ready: UnknownNo heartbeat receivedNetwork issue or node down
MemoryPressure: TrueMemory critically lowFree memory or add capacity
DiskPressure: TrueDisk space exhaustedClean up disk
PIDPressure: TrueToo many processesKill runaway processes
NetworkUnavailable: TrueCNI plugin not configuredFix network plugin

Fix 1: Restart the Kubelet

The kubelet is the agent that reports node health to the control plane. When it crashes or hangs, the node goes NotReady. This is the most common cause we see.

Symptoms: Node was recently working, Ready condition shows False with a message about kubelet not posting status.

# SSH into the node
ssh <node-ip>

# Check kubelet status
sudo systemctl status kubelet

# Check kubelet logs for errors
sudo journalctl -u kubelet --since "10 minutes ago" --no-pager

# Restart kubelet
sudo systemctl restart kubelet

After restarting, verify the node recovers:

kubectl get nodes -w

The node should transition to Ready within 30-60 seconds. If it does not, check the kubelet logs again for recurring errors.

Common kubelet log errors and what they mean:

  • "Unable to register node" — Certificate or API server connectivity issue
  • "Container runtime is not ready" — containerd or Docker is down
  • "PLEG is not healthy" — Pod lifecycle event generator is stalled (see Fix 6)
  • "failed to run Kubelet: running with swap on is not supported" — Swap is enabled on the node

Fix 2: Resolve Disk Pressure

When a node runs out of disk space, the kubelet marks it with DiskPressure: True and eventually NotReady. Container images, logs, and emptyDir volumes are the usual culprits.

# SSH into the node and check disk usage
df -h

# Find large files consuming space
du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10
du -sh /var/lib/containerd/* 2>/dev/null | sort -rh | head -10

# Check container log sizes
find /var/log/containers -name "*.log" -size +100M -exec ls -lh {} \;

Fix the disk pressure:

# Remove unused container images
sudo crictl rmi --prune

# For Docker-based nodes
sudo docker system prune -af

# Clear old journal logs
sudo journalctl --vacuum-time=2d

# Remove completed containers
sudo crictl rm $(sudo crictl ps -a --state exited -q)

Prevent recurrence: Configure kubelet garbage collection thresholds and implement log rotation. The kubelet has built-in image garbage collection that triggers at 85% disk usage by default. You can tune this with --image-gc-high-threshold and --image-gc-low-threshold flags.

For a deeper look at node resource monitoring, see our guide on checking node CPU and memory utilisation in Kubernetes.

Fix 3: Resolve Memory Pressure

Memory exhaustion causes the kubelet to mark the node with MemoryPressure: True. If memory drops below the hard eviction threshold (default 100Mi), the kubelet starts evicting pods and may stop responding entirely.

# Check memory on the node
free -m

# Find top memory consumers
ps aux --sort=-%mem | head -20

# Check kubelet memory thresholds
cat /var/lib/kubelet/config.yaml | grep -A5 evictionHard

Immediate fix:

# Identify and kill memory-hogging processes (not kubelet or containerd)
top -o %MEM

# Clear system caches (safe, temporary relief)
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Long-term fix: Set proper resource requests and limits on all pods so the scheduler distributes workloads evenly. Configure kubelet resource reservations to protect system processes:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
kubeReserved:
  memory: "1Gi"
systemReserved:
  memory: "1Gi"
evictionHard:
  memory.available: "200Mi"

Fix 4: Repair Container Runtime (containerd/Docker)

If the container runtime crashes, the kubelet cannot manage pods and the node goes NotReady. The kubelet logs will show messages like "Container runtime is not ready" or "CRI endpoint not reachable".

# Check container runtime status
sudo systemctl status containerd
# or for Docker
sudo systemctl status docker

# Check runtime logs
sudo journalctl -u containerd --since "10 minutes ago"

Fix:

# Restart the container runtime
sudo systemctl restart containerd
# Then restart kubelet
sudo systemctl restart kubelet

If containerd is stuck due to a corrupted state:

# Stop both services
sudo systemctl stop kubelet
sudo systemctl stop containerd

# Clean containerd state (this will restart all containers)
sudo rm -rf /run/containerd/*

# Start services
sudo systemctl start containerd
sudo systemctl start kubelet

Always restart kubelet after restarting the container runtime. The kubelet needs to re-establish its connection to the runtime socket.

Fix 5: Fix Network and CNI Plugin Issues

When the network plugin (Calico, Cilium, Flannel, Weave) fails on a node, the NetworkUnavailable condition becomes True and the node may go NotReady.

# Check if CNI plugin pods are running on the affected node
kubectl get pods -n kube-system -o wide | grep <node-name>

# Check CNI configuration on the node
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conf

# Check CNI binary exists
ls /opt/cni/bin/

Fix for Calico:

# Delete the calico-node pod on the affected node (DaemonSet recreates it)
kubectl delete pod -n kube-system -l k8s-app=calico-node --field-selector spec.nodeName=<node-name>

Fix for Cilium:

kubectl delete pod -n kube-system -l k8s-app=cilium --field-selector spec.nodeName=<node-name>

If CNI config is missing or corrupted, the quickest fix is to reinstall the CNI plugin. Check the Kubernetes networking documentation for plugin-specific instructions.

For a comprehensive understanding of Kubernetes networking, see our CNI and service mesh guide.

Fix 6: Handle PLEG Not Healthy

Pod Lifecycle Event Generator (PLEG) is a kubelet component that monitors container state changes. When PLEG falls behind, the kubelet stops reporting healthy status and the node goes NotReady.

Symptoms: Kubelet logs show "PLEG is not healthy: pleg was last seen active X ago".

Common causes:

  • Too many pods on the node (approaching the --max-pods limit)
  • Slow container runtime responses
  • High I/O wait on the node
  • A single container in a hung state blocking the PLEG relist
# Check number of pods on the node
kubectl get pods --field-selector spec.nodeName=<node-name> --all-namespaces | wc -l

# Check I/O wait on the node
iostat -x 1 5

# Check for stuck containers
sudo crictl ps -a | grep -i "created\|unknown"

Fix:

# Remove any stuck containers
sudo crictl rm <container-id>

# If too many pods, consider draining the node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Restart kubelet
sudo systemctl restart kubelet

# Uncordon after recovery
kubectl uncordon <node-name>

Fix 7: Resolve Certificate Expiration

When kubelet certificates expire, the kubelet cannot authenticate with the API server. The node shows Ready: Unknown because the control plane stops receiving heartbeats.

# Check certificate expiration on the node
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Check if certificate rotation is enabled
cat /var/lib/kubelet/config.yaml | grep rotateCertificates

Fix:

# Delete expired certificates (kubelet requests new ones on restart)
sudo rm /var/lib/kubelet/pki/kubelet-client-current.pem
sudo rm /var/lib/kubelet/pki/kubelet.crt
sudo rm /var/lib/kubelet/pki/kubelet.key

# Restart kubelet
sudo systemctl restart kubelet

Enable automatic certificate rotation to prevent this in the future:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
rotateCertificates: true

If you are running a managed Kubernetes service, certificate rotation is typically handled automatically. Check our EKS architecture best practices guide for AWS-specific considerations.

Cloud-Specific Fixes

EKS (Amazon Web Services)

# Check if the node's IAM instance profile has the required permissions
aws ec2 describe-instances --instance-ids <instance-id> --query 'Reservations[].Instances[].IamInstanceProfile'

# Verify the node can reach the EKS API endpoint
curl -k https://<eks-endpoint>/healthz

# Check if the AWS VPC CNI plugin is running
kubectl get pods -n kube-system -l k8s-app=aws-node

Common EKS-specific causes:

  • IAM role missing AmazonEKSWorkerNodePolicy or AmazonEKS_CNI_Policy
  • Security group rules blocking kubelet-to-API-server communication on port 443
  • Node running out of ENI-attachable IPs (secondary IP exhaustion)

AKS (Microsoft Azure)

# Check node status through Azure CLI
az aks nodepool list --resource-group <rg> --cluster-name <cluster> -o table

# Check if Azure CNI is functioning
kubectl get pods -n kube-system -l component=kube-proxy

# Verify node can reach AKS API
kubectl cluster-info

Common AKS-specific causes:

  • Azure subnet IP address exhaustion (especially with Azure CNI)
  • Network Security Group (NSG) rules blocking required ports
  • Node pool VM scale set issues

For more on AKS troubleshooting, see our AKS consulting services.

GKE (Google Cloud)

# Check node pool status
gcloud container node-pools list --cluster <cluster-name> --zone <zone>

# Check if nodes hit resource quotas
gcloud compute instances describe <instance-name> --zone <zone>

Common GKE-specific causes:

  • Preemptible/Spot VM instances being reclaimed
  • GKE node auto-repair already in progress (check node pool settings)
  • Insufficient quota for the node machine type

Fix 8: Handle Multiple Nodes NotReady Simultaneously

When multiple nodes go NotReady at the same time, the problem is almost always at the infrastructure or control plane level, not individual nodes.

Check the control plane first:

kubectl cluster-info
kubectl get componentstatuses 2>/dev/null
kubectl get --raw /healthz

Common causes of cluster-wide NotReady:

  • API server overload or crash
  • etcd quorum loss
  • Network partition between nodes and control plane
  • Cloud provider availability zone outage
  • Expired cluster CA certificate

Immediate actions:

# Check if the issue is zone-specific
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"

# If zone-specific, check cloud provider status page
# If cluster-wide, check control plane components
kubectl get pods -n kube-system

For comprehensive cluster recovery procedures, see our Kubernetes disaster recovery playbook.

When to Drain and Replace Instead of Fix

Sometimes fixing a node is not worth the effort. Consider draining and replacing when:

  • The node has been NotReady for over 30 minutes with no clear cause
  • You see kernel panics or hardware errors in system logs
  • The node is running an outdated OS or kubelet version
  • The root filesystem is corrupted
# Safely drain workloads from the node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s

# Remove the node from the cluster
kubectl delete node <node-name>

# Terminate and replace the instance (cloud-specific)
# EKS: Terminate the EC2 instance; ASG provisions a new one
# AKS: az aks nodepool scale, then scale back up
# GKE: gcloud compute instances delete; MIG provisions a new one

Prevention Checklist

Prevent Node NotReady incidents before they happen:

  • Monitor node conditions with Prometheus alerts for kube_node_status_condition{condition="Ready",status="true"} == 0
  • Set resource reservations on every node to protect kubelet and system processes
  • Enable certificate auto-rotation in kubelet configuration
  • Configure log rotation to prevent disk exhaustion from container logs
  • Use Node Problem Detector to catch hardware and kernel issues early
  • Set up cluster autoscaler to replace unhealthy nodes automatically
  • Run regular Kubernetes upgrades to avoid known kubelet bugs

Common Mistakes to Avoid

From our experience resolving hundreds of Node NotReady incidents, these are the mistakes teams make most often:

  1. Force-deleting pods instead of fixing the node — this does not solve the underlying problem and can cause data loss for stateful workloads
  2. Rebooting the node without checking logs first — you lose diagnostic information
  3. Ignoring DiskPressure warnings — these always escalate to NotReady if not addressed
  4. Not reserving system resources — without kubelet and system reservations, application pods can starve the kubelet itself
  5. Skipping the container runtime check — many teams jump straight to kubelet but the runtime is the actual cause in 30% of cases we see

For a broader overview of Kubernetes issues and debugging methodology, see our Kubernetes troubleshooting guide covering 20 common production issues.


Need Help Fixing Kubernetes Node Issues?

Node NotReady incidents can cascade into full service outages if not resolved quickly. Our engineers at Tasrie IT Services have resolved hundreds of these incidents and can help you build the monitoring and automation to prevent them.

Our Kubernetes consulting services include:

  • 24/7 incident response for production Kubernetes clusters across EKS, AKS, and GKE
  • Proactive monitoring setup with custom alerts for node health conditions
  • Infrastructure hardening with proper resource reservations, certificate rotation, and auto-remediation

We have managed production clusters for startups and enterprises across multiple cloud providers.

Talk to our Kubernetes engineers today →

Continue exploring these related topics

$ suggest --service

Need Kubernetes expertise?

From architecture to production support, we help teams run Kubernetes reliably at scale.

Get started
Chat with real humans
Chat on WhatsApp