Engineering

Kubernetes Node Not Ready: Troubleshooting Guide

admin

When a Kubernetes node enters the NotReady state, your cluster’s capacity drops and workloads may fail to schedule properly. This critical issue can stem from network problems, resource exhaustion, or misconfigured components. Understanding how to quickly diagnose and resolve NotReady nodes is essential for maintaining cluster health and application availability.

Understanding Node Status in Kubernetes

Kubernetes continuously monitors node health through the kubelet agent running on each node. The kubelet sends heartbeats to the control plane every 10 seconds by default. When these heartbeats fail or the node fails health checks, the control plane marks it as NotReady.

A node can have several conditions that affect its status:

  • Ready: The node is healthy and can accept pods
  • NotReady: The node is unhealthy and cannot accept new pods
  • Unknown: The control plane hasn’t received a status update recently
  • SchedulingDisabled: The node has been cordoned for maintenance

The NotReady state triggers automatic pod eviction after a timeout period, typically 5 minutes. This behavior protects your applications but can cause disruption if not addressed promptly.

Common Causes of NotReady Nodes

Network Connectivity Issues

Network problems represent the most frequent cause of NotReady nodes. The kubelet must maintain communication with the API server to report status and receive instructions. Common network-related failures include:

  • DNS resolution failures preventing API server discovery
  • Firewall rules blocking required ports (6443 for API server, 10250 for kubelet)
  • Network plugin failures causing pod networking to break
  • Route table misconfigurations in cloud environments

When working with managed Kubernetes services, network issues often relate to security group configurations or VPC settings. Our Kubernetes consulting services help teams identify and resolve these infrastructure-level problems.

Resource Exhaustion

Nodes enter NotReady state when critical resources become depleted:

  • Disk pressure: The node runs out of disk space for container images or logs
  • Memory pressure: Available memory drops below the eviction threshold
  • PID exhaustion: The node hits the maximum process limit
  • CPU throttling: Sustained high CPU usage prevents kubelet operations

Resource exhaustion typically requires immediate intervention to prevent cascading failures across the cluster.

Kubelet Service Failures

The kubelet service itself may crash or become unresponsive due to:

  • Configuration errors in kubelet flags or config files
  • Certificate expiration preventing API server authentication
  • Container runtime failures (Docker, containerd, CRI-O)
  • Corrupted kubelet state or cache files

Container Runtime Problems

Since Kubernetes relies on the container runtime to manage pods, runtime failures directly impact node status:

  • Docker daemon crashes or hangs
  • Containerd socket unavailability
  • Image pull failures consuming all disk space
  • Runtime configuration incompatibilities after upgrades

Diagnostic Steps for NotReady Nodes

Check Node Status and Conditions

Start by examining the node’s detailed status:

kubectl get nodes
kubectl describe node <node-name>

The describe output shows condition details, including reason codes and messages. Look for specific conditions like:

  • NetworkUnavailable: Network plugin hasn’t configured the node
  • MemoryPressure: Available memory is critically low
  • DiskPressure: Available disk space is insufficient
  • PIDPressure: Too many processes are running

The events section at the bottom of the describe output often reveals the root cause. For comprehensive monitoring strategies, check out our guide on checking node CPU and memory utilization.

Verify Kubelet Service Status

SSH into the problematic node and check the kubelet service:

sudo systemctl status kubelet
sudo journalctl -u kubelet -f

The kubelet logs typically reveal authentication errors, API server connectivity problems, or configuration issues. Common error patterns include:

  • “Unable to authenticate the request”: Certificate problems
  • “Connection refused”: API server unreachable
  • “Failed to initialize CSI”: Storage plugin failures
  • “Network plugin not ready”: CNI configuration issues

Inspect Container Runtime

Verify the container runtime is functioning:

# For Docker
sudo systemctl status docker
sudo docker ps

# For containerd
sudo systemctl status containerd
sudo crictl ps

A non-responsive runtime prevents the kubelet from managing pods, causing the node to report NotReady status.

Check Resource Availability

Examine disk space, memory, and process counts:

df -h
free -m
ps aux | wc -l

If disk usage exceeds 85%, Kubernetes triggers eviction thresholds. The kubelet automatically attempts to reclaim space by removing unused images and containers.

Review Network Configuration

Test network connectivity to the API server:

telnet <api-server-ip> 6443
curl -k https://<api-server-ip>:6443/healthz

For clusters using network policies, verify the CNI plugin is running:

kubectl get pods -n kube-system | grep -E 'calico|flannel|weave|cilium'

Network plugin failures often require restarting the plugin pods or reapplying the network configuration. The Kubernetes official documentation provides detailed networking troubleshooting guidance.

Resolution Strategies

Restart Kubelet Service

Many transient issues resolve with a kubelet restart:

sudo systemctl restart kubelet
sudo systemctl status kubelet

Monitor the node status after restart:

kubectl get nodes -w

The node should transition to Ready within 30-60 seconds if the kubelet restarts successfully.

Clear Disk Space

When disk pressure causes NotReady status, free up space immediately:

# Remove unused container images
sudo docker system prune -a

# For containerd
sudo crictl rmi --prune

# Clear old logs
sudo journalctl --vacuum-time=3d

# Remove unused container data
sudo rm -rf /var/lib/docker/tmp/*

Implement log rotation and image cleanup policies to prevent future occurrences. Consider using Kubernetes cost optimization strategies that include resource management best practices.

Resolve Certificate Issues

Expired certificates prevent kubelet authentication:

# Check certificate expiration
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Regenerate kubelet certificates
sudo rm /var/lib/kubelet/pki/kubelet-client*
sudo systemctl restart kubelet

The kubelet automatically requests new certificates from the API server upon restart.

Fix Network Plugin Issues

Restart the network plugin when network unavailability causes NotReady status:

# For Calico
kubectl delete pod -n kube-system -l k8s-app=calico-node

# For Flannel
kubectl delete pod -n kube-system -l app=flannel

The DaemonSet controller automatically recreates the pods. Verify network functionality:

kubectl run test-pod --image=busybox --restart=Never -- sleep 3600
kubectl exec test-pod -- ping 8.8.8.8

Restart Container Runtime

For runtime-related failures:

sudo systemctl restart docker
# or
sudo systemctl restart containerd

sudo systemctl restart kubelet

Always restart the kubelet after restarting the container runtime to ensure proper state synchronization.

Prevention and Best Practices

Implement Robust Monitoring

Proactive monitoring prevents NotReady incidents from impacting production. Deploy comprehensive observability solutions that track:

  • Node resource utilization (CPU, memory, disk, network)
  • Kubelet health and performance metrics
  • Container runtime status and response times
  • Network connectivity to control plane components

For production environments, consider our Kubernetes production support services that include 24/7 monitoring and rapid incident response.

Configure Resource Reservations

Reserve system resources for kubelet and OS processes:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "1Gi"
systemReserved:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "1Gi"
evictionHard:
  memory.available: "100Mi"
  nodefs.available: "10%"

These reservations prevent resource exhaustion that leads to node failures.

Automate Node Remediation

Implement automated remediation for common failure scenarios:

  • Use Node Problem Detector to identify and report node issues
  • Deploy Draino or similar tools for automatic node draining
  • Create runbooks for operations teams with step-by-step recovery procedures
  • Set up alerts for NotReady nodes with escalation policies

The CNCF landscape includes several tools for automated node management and self-healing.

Maintain Regular Updates

Keep cluster components current to avoid known bugs:

  • Update kubelet, container runtime, and kernel regularly
  • Test updates in staging before production deployment
  • Review release notes for breaking changes
  • Maintain consistent versions across all nodes

Version skew between control plane and nodes can cause unexpected NotReady conditions.

Plan for Capacity

Prevent resource exhaustion through proper capacity planning:

  • Monitor trends in resource consumption
  • Add nodes before utilization reaches critical thresholds
  • Implement horizontal pod autoscaling to distribute load
  • Use cluster autoscaling for dynamic capacity adjustment

Our EKS architecture best practices guide covers capacity planning strategies for cloud-based clusters.

Advanced Troubleshooting Scenarios

Persistent NotReady After All Fixes

When standard remediation fails, investigate deeper issues:

  1. Check for kernel panics or hardware failures in system logs
  2. Verify NTP synchronization across cluster nodes
  3. Examine cloud provider API rate limiting or quota issues
  4. Review security policies that might block kubelet operations
  5. Check for corrupted etcd data affecting node registration

In extreme cases, draining and replacing the node may be necessary:

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>

Then provision a new node to replace the failed one.

Multiple Nodes NotReady Simultaneously

Cluster-wide NotReady events indicate systemic problems:

  • Control plane overload or failure
  • Network infrastructure issues affecting the entire cluster
  • Shared storage system failures
  • Cloud provider service disruptions
  • Certificate authority problems affecting all nodes

These scenarios require immediate escalation and often involve multiple teams. Having expert support available becomes critical during such incidents.

Conclusion

Kubernetes node NotReady issues demand rapid diagnosis and resolution to maintain cluster health and application availability. By understanding common causes, following systematic diagnostic procedures, and implementing preventive measures, you can minimize downtime and improve overall cluster reliability.

Key takeaways for managing NotReady nodes:

  • Monitor node health continuously with comprehensive observability tools
  • Maintain runbooks for common failure scenarios
  • Reserve adequate system resources to prevent exhaustion
  • Keep cluster components updated and properly configured
  • Implement automated remediation where possible

For complex Kubernetes environments or when you need expert assistance, professional support can dramatically reduce resolution time and prevent recurring issues. Whether you’re running clusters on AWS, Azure, or GCP, having experienced practitioners available ensures your infrastructure remains stable and performant.

Chat with real humans
Chat on WhatsApp