Autoscaling in Kubernetes is not a single mechanism. It is two distinct systems operating at different layers of your infrastructure: the Horizontal Pod Autoscaler (HPA) scales your application replicas, while the Cluster Autoscaler (CA) scales the underlying nodes those replicas run on. Running one without the other creates blind spots that lead to either wasted resources or stuck deployments.
We run both in every production cluster we manage, and the difference between a well-tuned autoscaling setup and a broken one usually comes down to understanding how these two controllers interact. This guide covers exactly that, including the configuration details, common pitfalls, and the patterns we have validated across 100+ production Kubernetes environments.
Quick Comparison: Cluster Autoscaler vs HPA
Before diving into the details, here is a side-by-side breakdown of how these two autoscalers differ.
| Feature | Horizontal Pod Autoscaler (HPA) | Cluster Autoscaler (CA) |
|---|---|---|
| What it scales | Pod replicas within a deployment | Nodes in the cluster |
| Scaling trigger | Metric thresholds (CPU, memory, custom) | Pending pods that cannot be scheduled |
| Scale-down trigger | Metrics drop below target | Node utilization below threshold |
| Default check interval | 15 seconds | 10 seconds |
| Scale-down delay | Configurable via behavior policies | 10 minutes (scale-down-unneeded-time) |
| Requires Metrics Server | Yes | No (watches pod scheduling status) |
| Cloud provider dependency | None | Yes (needs cloud API to add/remove nodes) |
| Can scale to zero | No (minimum 1 replica) | Yes (can remove all nodes in a node group) |
| Configuration scope | Per-deployment/workload | Cluster-wide |
| Kubernetes resource | HorizontalPodAutoscaler | Deployment (runs as a pod) |
Understanding this table is essential for designing a scaling strategy that handles traffic spikes without wasting money on idle infrastructure. For a deeper look at controlling Kubernetes spending, see our guide on Kubernetes cost optimization strategies.
What the Horizontal Pod Autoscaler Does
The HPA is a Kubernetes controller that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics. It operates entirely at the application layer. The HPA does not know or care about nodes; it only cares about whether your workload needs more or fewer replicas to meet its performance targets.
How HPA Works Internally
The HPA controller runs a control loop that executes every 15 seconds by default (configurable via --horizontal-pod-autoscaler-sync-period on the kube-controller-manager). Each cycle follows this process:
- Query the Metrics API for the current value of the target metric
- Compare the current value against the target value defined in the HPA spec
- Calculate the desired replica count using the formula:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)) - Apply scaling constraints (
minReplicas,maxReplicas) and behavior policies - Update the replica count on the target workload
The HPA has a default tolerance of 10%, meaning it will not scale if the current metric value is within 10% of the target. This prevents constant scaling oscillation.
HPA Metric Types
The HPA supports four categories of metrics, each suited to different scaling scenarios.
Resource Metrics (CPU and Memory)
The most common and simplest to configure. These use data from the Metrics Server.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Pod Metrics (Custom Per-Pod Metrics)
These come from your application directly, exposed via the Custom Metrics API. Useful for scaling on application-specific indicators like queue depth or active connections.
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
Object Metrics (Kubernetes Object Metrics)
These reference a metric from another Kubernetes object, such as an Ingress.
metrics:
- type: Object
object:
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: web-app-ingress
metric:
name: requests_per_second
target:
type: Value
value: 5000
External Metrics
These pull metrics from sources outside the cluster, such as a cloud monitoring service or a message queue.
metrics:
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: orders-processing
target:
type: AverageValue
averageValue: 20
HPA Scaling Behavior Policies
Starting with autoscaling/v2, you can define granular scaling behavior to control how fast the HPA scales up and scales down. This is critical in production to prevent thrashing.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 10
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
selectPolicy: Min
This configuration allows aggressive scale-up (double the pods or add 10 pods per minute, whichever is greater) while enforcing conservative scale-down (remove at most 10% of pods every 2 minutes, with a 5-minute stabilization window). This asymmetry is intentional: you want to react fast to load spikes but avoid premature scale-down that causes repeated scaling cycles.
For more on how HPA metrics tie into broader observability, see the official Kubernetes HPA documentation.
What the Cluster Autoscaler Does
The Cluster Autoscaler is an entirely different beast. It does not look at metrics. It watches the Kubernetes scheduler for pods that cannot be placed on any existing node, and it communicates with your cloud provider’s API to add or remove virtual machines from your cluster.
CA operates at the infrastructure layer. Its job is to ensure that your cluster has enough node capacity to run all requested workloads, while also removing nodes that are no longer needed to save costs.
How Cluster Autoscaler Works Internally
The CA runs its own control loop with a default scan interval of 10 seconds. Each cycle evaluates two conditions:
Scale-Up Logic:
- Check for pods in
Pendingstate with a scheduling failure reason (insufficient CPU, memory, or other resources) - Simulate scheduling: determine which node group could host the pending pods
- If a suitable node group exists, call the cloud provider API to increase the node count
- Wait for the new node to join the cluster and become
Ready
Scale-Down Logic:
- Identify nodes where utilization (sum of pod requests / node allocatable) is below
scale-down-utilization-threshold(default: 0.5, meaning 50%) - Check if all pods on the node can be rescheduled elsewhere
- Respect Pod Disruption Budgets (PDBs), local storage constraints, and annotations that block eviction
- If the node has been underutilized for longer than
scale-down-unneeded-time(default: 10 minutes), drain and remove it
Cluster Autoscaler Configuration
CA is typically deployed as a Deployment in the kube-system namespace. Here is a production-ready configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --scale-down-utilization-threshold=0.5
- --scale-down-unneeded-time=10m
- --scale-down-delay-after-add=10m
- --max-graceful-termination-sec=600
- --max-node-provision-time=15m
resources:
requests:
cpu: 100m
memory: 600Mi
limits:
cpu: 100m
memory: 600Mi
Key CA Parameters Explained
| Parameter | Default | Description |
|---|---|---|
--scan-interval | 10s | How often CA evaluates the cluster state |
--scale-down-utilization-threshold | 0.5 | Node utilization below which CA considers removing it |
--scale-down-unneeded-time | 10m | How long a node must be underutilized before removal |
--scale-down-delay-after-add | 10m | Cooldown after adding a node before CA considers scale-down |
--max-node-provision-time | 15m | Maximum time to wait for a node to become ready |
--expander | random | Strategy for choosing which node group to expand (random, most-pods, least-waste, price, priority) |
--skip-nodes-with-local-storage | true | Whether to skip nodes with local volumes during scale-down |
--balance-similar-node-groups | false | Balance node counts across similar node groups |
The --expander flag is worth special attention. In production, we typically use least-waste, which selects the node group that will have the least idle CPU and memory after scheduling the pending pods. For multi-zone clusters, combining this with --balance-similar-node-groups ensures even distribution for high availability.
For the full list of configuration options and FAQs, see the Cluster Autoscaler GitHub repository.
How HPA and Cluster Autoscaler Work Together
This is where the real value lies. HPA and CA are designed to work as a coordinated system, even though they operate independently. Here is the step-by-step flow of what happens during a traffic spike:
The Autoscaling Chain of Events
Step 1: Load Increases Your application starts receiving more traffic. CPU utilization across pods rises above the HPA target (for example, above 70%).
Step 2: HPA Detects the Change Within 15 seconds (the default sync period), the HPA controller queries the Metrics Server, calculates that more replicas are needed, and updates the Deployment’s replica count.
Step 3: Scheduler Tries to Place New Pods The Kubernetes scheduler attempts to find nodes with enough available CPU and memory to run the new pods.
Step 4a: Nodes Available (Fast Path) If existing nodes have enough headroom, the new pods are scheduled immediately. Scaling happens in seconds. CA is not involved.
Step 4b: No Nodes Available (CA Path)
If no existing node can fit the new pods, they enter the Pending state with a FailedScheduling event. This is the trigger for the Cluster Autoscaler.
Step 5: CA Provisions New Nodes CA detects the pending pods, evaluates which node group can host them, and requests new nodes from the cloud provider. This typically takes 2-5 minutes depending on the cloud provider and instance type.
Step 6: New Nodes Join the Cluster
The new nodes register with the API server and become Ready. The scheduler immediately places the pending pods on these nodes.
Step 7: Application Handles the Load The additional replicas start serving traffic. Latency returns to normal.
The Scale-Down Sequence
Scale-down follows the reverse path, but on a much slower timeline by design:
Step 1: Traffic decreases. Pod CPU utilization drops below the HPA target.
Step 2: HPA reduces the replica count (subject to the scale-down behavior policy and stabilization window).
Step 3: Fewer pods run on each node. Node utilization drops below the scale-down-utilization-threshold (default 50%).
Step 4: After the node has been underutilized for scale-down-unneeded-time (default 10 minutes), CA drains the node and removes it from the cluster.
Step 5: Cloud provider terminates the instance, and you stop paying for it.
This intentional asymmetry, fast scale-up and slow scale-down, is the right pattern for production workloads. You want to handle spikes immediately but avoid removing capacity that might be needed again in minutes.
Common Pitfalls and How to Avoid Them
After managing autoscaling across hundreds of clusters, we see the same mistakes repeatedly. Here are the most critical ones.
Pitfall 1: HPA Without Cluster Autoscaler
If HPA increases your replica count but there are no nodes available to run the new pods, those pods stay in Pending state indefinitely. Your application is under load, HPA has correctly determined it needs more replicas, but nothing happens because there is nowhere to put them.
Fix: Always deploy CA alongside HPA in production clusters. If you are on a managed Kubernetes service like EKS, AKS, or GKE, enable the managed cluster autoscaler. For a comparison of how each provider handles this, see our EKS vs AKS vs GKE comparison.
Pitfall 2: Cluster Autoscaler Without HPA
CA only reacts to pending pods. If you have a fixed replica count and your pods become overloaded, CA has no reason to add nodes because no new pods are being requested. Your application degrades under load even though the cluster has room to grow.
Fix: Use HPA to signal demand. HPA creates new pods, which in turn signal CA to expand the cluster when needed. Without HPA (or an equivalent like KEDA), CA is effectively blind to application-level load.
Pitfall 3: Missing or Incorrect Resource Requests
Both HPA and CA depend on resource requests being set correctly on your pods.
- HPA uses resource requests to calculate utilization percentages. Without requests, the
Utilizationtarget type will not work. - CA uses the sum of pod resource requests on a node to determine utilization. If your pods have no requests, CA sees every node as 0% utilized and will try to remove them all.
Fix: Always set CPU and memory requests on every container. Base them on actual observed usage, not guesses. A good starting point:
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
Pitfall 4: HPA Stabilization Window Too Short
If the HPA scale-down stabilization window is too short, you get “flapping” where the HPA repeatedly scales up and down. Each scale-down removes pods, which pushes utilization above the target again, triggering an immediate scale-up.
Fix: Set the stabilizationWindowSeconds for scale-down to at least 300 seconds (5 minutes). For latency-sensitive workloads, consider 600 seconds.
Pitfall 5: CA Scale-Down Too Aggressive
Setting scale-down-unneeded-time too low (for example, 2 minutes) causes CA to remove nodes that might be needed again shortly. This leads to repeated node provisioning cycles, which are slow (2-5 minutes each) and can cause service disruptions.
Fix: Keep the default of 10 minutes, or increase it for clusters with bursty traffic patterns.
Pitfall 6: Not Setting Pod Disruption Budgets
When CA scales down a node, it drains all pods from it. Without PDBs, CA might evict too many pods from a service at once, causing an outage.
Fix: Define PDBs for all critical workloads:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: "75%"
selector:
matchLabels:
app: web-app
Best Practices for Combining HPA and CA
Based on what we have learned running autoscaling in production, here are the patterns that consistently work well.
1. Set Resource Requests Based on Real Data
Use monitoring data or tools like Vertical Pod Autoscaler (VPA) in recommendation mode to establish accurate resource requests. Inaccurate requests undermine both HPA and CA decision-making.
2. Use Multiple HPA Metrics
Do not rely solely on CPU. Combine CPU with a business-relevant metric like requests per second or queue depth. This gives HPA a more accurate picture of actual demand.
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 500
3. Use the Least-Waste Expander for CA
The least-waste expander chooses the node group that will leave the least idle resources after placing pending pods. This directly reduces cost by avoiding over-provisioning at the node level.
4. Keep Headroom with Priority-Based Overprovisioning
To reduce the time between HPA requesting more pods and those pods becoming schedulable, deploy low-priority “placeholder” pods that can be preempted when real workloads need the space.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: overprovisioning
value: -1
globalDefault: false
description: "Priority class for overprovisioning pods"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning
spec:
replicas: 3
selector:
matchLabels:
app: overprovisioning
template:
metadata:
labels:
app: overprovisioning
spec:
priorityClassName: overprovisioning
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: "1"
memory: 2Gi
When HPA creates new pods, the scheduler preempts these low-priority pods, placing real workloads instantly. The evicted placeholder pods become Pending, triggering CA to add nodes, which restores the buffer for the next spike.
5. Align Scale-Down Timers
Make sure HPA’s scale-down stabilization window is shorter than CA’s scale-down-unneeded-time. This way, HPA removes unnecessary pods first, which reduces node utilization, and then CA removes the empty or underutilized nodes. If CA acts faster than HPA, you may remove nodes that still have active pods.
A good default: HPA scale-down stabilization at 5 minutes, CA scale-down-unneeded-time at 10 minutes.
6. Monitor the Autoscaling Pipeline
Track these metrics to ensure your autoscaling system is healthy:
- HPA current vs desired replica count (detects when HPA is unable to scale)
- Pending pod count and duration (detects CA delays)
- Node count over time (verifies CA is responding)
- CA scaling decision events (audit why CA did or did not scale)
Integrating this data into your observability consulting pipeline ensures you can troubleshoot scaling failures before they impact users.
KEDA: An Alternative and Complement to HPA
KEDA (Kubernetes Event-Driven Autoscaling) is an open-source project that extends Kubernetes autoscaling beyond what the built-in HPA supports. KEDA does not replace HPA. It builds on top of it.
How KEDA Differs from HPA
The built-in HPA requires metrics to be available through the Kubernetes Metrics API, which limits you to resource metrics (via Metrics Server) and custom metrics (via an adapter). KEDA provides over 60 built-in scalers that connect directly to external event sources:
- Message queues: Kafka, RabbitMQ, AWS SQS, Azure Service Bus
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- Cloud services: AWS CloudWatch, Azure Monitor, GCP Pub/Sub
- HTTP traffic: Prometheus query results, NGINX Ingress metrics
KEDA’s Key Advantage: Scale to Zero
Unlike the built-in HPA, which requires a minimum of 1 replica, KEDA can scale workloads to zero replicas when there are no events to process. When a new event arrives, KEDA spins up a pod to handle it. This is particularly valuable for event-driven microservices that process work intermittently.
Example: KEDA with Kafka
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker:9092
consumerGroup: order-processor-group
topic: orders
lagThreshold: "100"
When to Use KEDA vs Built-In HPA
| Scenario | Use |
|---|---|
| Standard web services scaling on CPU/memory | Built-in HPA |
| Queue-based workers (Kafka, SQS, RabbitMQ) | KEDA |
| Scale-to-zero requirements | KEDA |
| Scaling based on Prometheus queries | KEDA |
| Simple deployments with minimal dependencies | Built-in HPA |
| Event-driven microservices architecture | KEDA |
KEDA works seamlessly with Cluster Autoscaler. When KEDA scales pods from zero to many, and those pods cannot be scheduled, CA adds nodes just as it would with HPA-driven scaling.
Managed Kubernetes Provider Considerations
Each major cloud provider has its own implementation of cluster autoscaling. While the core concepts are the same, configuration differs.
AWS EKS: Uses the open-source Cluster Autoscaler with Auto Scaling Groups (ASGs), or the newer Karpenter. Karpenter is AWS-specific but offers faster node provisioning (typically under 60 seconds) and more flexible instance selection.
Azure AKS: Integrates the Cluster Autoscaler directly into the managed control plane. You configure it via AKS API parameters rather than deploying it as a separate pod.
Google GKE: Offers both standard Cluster Autoscaler and Node Auto-Provisioning (NAP), which can create entirely new node pools with different machine types based on workload requirements.
For a detailed breakdown of scaling behavior across providers, see our EKS vs AKS vs GKE comparison.
Production Configuration Checklist
Before going live, verify these items:
- Resource requests and limits set on all containers
- HPA configured with appropriate
minReplicas(at least 2 for HA) - HPA scale-down behavior includes a stabilization window (300s minimum)
- Cluster Autoscaler deployed and connected to cloud provider
- CA
scale-down-unneeded-timeset to 10 minutes or more - Pod Disruption Budgets defined for critical workloads
- Metrics Server installed and functioning
- Monitoring dashboards tracking HPA and CA events
- Alerting configured for prolonged pending pods (more than 5 minutes)
- Overprovisioning pods deployed for latency-sensitive workloads
- Node group max size large enough to handle peak load plus buffer
Addressing Kubernetes security best practices is equally important when configuring autoscaling, as CA’s service account requires elevated cloud provider permissions to manage nodes.
Conclusion
The Horizontal Pod Autoscaler and Cluster Autoscaler are not alternatives to each other. They are partners in a layered autoscaling strategy. HPA handles application-level demand by adjusting pod replicas. CA handles infrastructure-level capacity by adjusting node count. Together, they give you a production-grade system that responds to traffic spikes in seconds while cleaning up unused resources to control costs.
The key is getting the configuration right: accurate resource requests, appropriate scaling behavior policies, aligned scale-down timers, and proper monitoring. Skip any of these, and you end up with either pods stuck pending during peak traffic or nodes sitting idle burning through your budget.
For a broader view of Kubernetes autoscaling within the context of cost management, see the Kubernetes autoscaling overview in the official documentation.
Get Your Kubernetes Autoscaling Right
Combining Cluster Autoscaler with HPA is essential for production Kubernetes — but the configuration details matter.
Our team provides expert Kubernetes consulting services to help you:
- Design autoscaling strategies that handle traffic spikes without over-provisioning
- Configure HPA and Cluster Autoscaler for optimal performance and cost
- Implement custom metrics scaling with KEDA for event-driven workloads
We have tuned autoscaling across 100+ production clusters for startups and enterprises.