Cluster Autoscaler vs Horizontal Pod Autoscaler: We Run Both in Production

Autoscaling in Kubernetes is not a single mechanism. It is two distinct systems operating at different layers of your infrastructure: the Horizontal Pod Autoscaler (HPA) scales your application replicas, while the Cluster Autoscaler (CA) scales the underlying nodes those replicas run on. Running one without the other creates blind spots that lead to either wasted resources or stuck deployments.

We run both in every production cluster we manage, and the difference between a well-tuned autoscaling setup and a broken one usually comes down to understanding how these two controllers interact. This guide covers exactly that, including the configuration details, common pitfalls, and the patterns we have validated across 100+ production Kubernetes environments.

Quick Comparison: Cluster Autoscaler vs HPA

Before diving into the details, here is a side-by-side breakdown of how these two autoscalers differ.

Feature	Horizontal Pod Autoscaler (HPA)	Cluster Autoscaler (CA)
What it scales	Pod replicas within a deployment	Nodes in the cluster
Scaling trigger	Metric thresholds (CPU, memory, custom)	Pending pods that cannot be scheduled
Scale-down trigger	Metrics drop below target	Node utilization below threshold
Default check interval	15 seconds	10 seconds
Scale-down delay	Configurable via behavior policies	10 minutes (`scale-down-unneeded-time`)
Requires Metrics Server	Yes	No (watches pod scheduling status)
Cloud provider dependency	None	Yes (needs cloud API to add/remove nodes)
Can scale to zero	No (minimum 1 replica)	Yes (can remove all nodes in a node group)
Configuration scope	Per-deployment/workload	Cluster-wide
Kubernetes resource	`HorizontalPodAutoscaler`	Deployment (runs as a pod)

Understanding this table is essential for designing a scaling strategy that handles traffic spikes without wasting money on idle infrastructure. For a deeper look at controlling Kubernetes spending, see our guide on Kubernetes cost optimization strategies.

What the Horizontal Pod Autoscaler Does

The HPA is a Kubernetes controller that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics. It operates entirely at the application layer. The HPA does not know or care about nodes; it only cares about whether your workload needs more or fewer replicas to meet its performance targets.

How HPA Works Internally

The HPA controller runs a control loop that executes every 15 seconds by default (configurable via --horizontal-pod-autoscaler-sync-period on the kube-controller-manager). Each cycle follows this process:

Query the Metrics API for the current value of the target metric
Compare the current value against the target value defined in the HPA spec
Calculate the desired replica count using the formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
Apply scaling constraints (minReplicas, maxReplicas) and behavior policies
Update the replica count on the target workload

The HPA has a default tolerance of 10%, meaning it will not scale if the current metric value is within 10% of the target. This prevents constant scaling oscillation.

HPA Metric Types

The HPA supports four categories of metrics, each suited to different scaling scenarios.

Resource Metrics (CPU and Memory)

The most common and simplest to configure. These use data from the Metrics Server.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Pod Metrics (Custom Per-Pod Metrics)

These come from your application directly, exposed via the Custom Metrics API. Useful for scaling on application-specific indicators like queue depth or active connections.

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: 1000

Object Metrics (Kubernetes Object Metrics)

These reference a metric from another Kubernetes object, such as an Ingress.

metrics:
- type: Object
  object:
    describedObject:
      apiVersion: networking.k8s.io/v1
      kind: Ingress
      name: web-app-ingress
    metric:
      name: requests_per_second
    target:
      type: Value
      value: 5000

External Metrics

These pull metrics from sources outside the cluster, such as a cloud monitoring service or a message queue.

metrics:
- type: External
  external:
    metric:
      name: sqs_queue_length
      selector:
        matchLabels:
          queue: orders-processing
    target:
      type: AverageValue
      averageValue: 20

HPA Scaling Behavior Policies

Starting with autoscaling/v2, you can define granular scaling behavior to control how fast the HPA scales up and scales down. This is critical in production to prevent thrashing.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 65
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 10
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 120
      selectPolicy: Min

This configuration allows aggressive scale-up (double the pods or add 10 pods per minute, whichever is greater) while enforcing conservative scale-down (remove at most 10% of pods every 2 minutes, with a 5-minute stabilization window). This asymmetry is intentional: you want to react fast to load spikes but avoid premature scale-down that causes repeated scaling cycles.

For more on how HPA metrics tie into broader observability, see the official Kubernetes HPA documentation.

What the Cluster Autoscaler Does

The Cluster Autoscaler is an entirely different beast. It does not look at metrics. It watches the Kubernetes scheduler for pods that cannot be placed on any existing node, and it communicates with your cloud provider’s API to add or remove virtual machines from your cluster.

CA operates at the infrastructure layer. Its job is to ensure that your cluster has enough node capacity to run all requested workloads, while also removing nodes that are no longer needed to save costs.

How Cluster Autoscaler Works Internally

The CA runs its own control loop with a default scan interval of 10 seconds. Each cycle evaluates two conditions:

Scale-Up Logic:

Check for pods in Pending state with a scheduling failure reason (insufficient CPU, memory, or other resources)
Simulate scheduling: determine which node group could host the pending pods
If a suitable node group exists, call the cloud provider API to increase the node count
Wait for the new node to join the cluster and become Ready

Scale-Down Logic:

Identify nodes where utilization (sum of pod requests / node allocatable) is below scale-down-utilization-threshold (default: 0.5, meaning 50%)
Check if all pods on the node can be rescheduled elsewhere
Respect Pod Disruption Budgets (PDBs), local storage constraints, and annotations that block eviction
If the node has been underutilized for longer than scale-down-unneeded-time (default: 10 minutes), drain and remove it

Cluster Autoscaler Configuration

CA is typically deployed as a Deployment in the kube-system namespace. Here is a production-ready configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --scale-down-utilization-threshold=0.5
        - --scale-down-unneeded-time=10m
        - --scale-down-delay-after-add=10m
        - --max-graceful-termination-sec=600
        - --max-node-provision-time=15m
        resources:
          requests:
            cpu: 100m
            memory: 600Mi
          limits:
            cpu: 100m
            memory: 600Mi

Key CA Parameters Explained

Parameter	Default	Description
`--scan-interval`	10s	How often CA evaluates the cluster state
`--scale-down-utilization-threshold`	0.5	Node utilization below which CA considers removing it
`--scale-down-unneeded-time`	10m	How long a node must be underutilized before removal
`--scale-down-delay-after-add`	10m	Cooldown after adding a node before CA considers scale-down
`--max-node-provision-time`	15m	Maximum time to wait for a node to become ready
`--expander`	random	Strategy for choosing which node group to expand (`random`, `most-pods`, `least-waste`, `price`, `priority`)
`--skip-nodes-with-local-storage`	true	Whether to skip nodes with local volumes during scale-down
`--balance-similar-node-groups`	false	Balance node counts across similar node groups

The --expander flag is worth special attention. In production, we typically use least-waste, which selects the node group that will have the least idle CPU and memory after scheduling the pending pods. For multi-zone clusters, combining this with --balance-similar-node-groups ensures even distribution for high availability.

For the full list of configuration options and FAQs, see the Cluster Autoscaler GitHub repository.

How HPA and Cluster Autoscaler Work Together

This is where the real value lies. HPA and CA are designed to work as a coordinated system, even though they operate independently. Here is the step-by-step flow of what happens during a traffic spike:

The Autoscaling Chain of Events

Step 1: Load Increases Your application starts receiving more traffic. CPU utilization across pods rises above the HPA target (for example, above 70%).

Step 2: HPA Detects the Change Within 15 seconds (the default sync period), the HPA controller queries the Metrics Server, calculates that more replicas are needed, and updates the Deployment’s replica count.

Step 3: Scheduler Tries to Place New Pods The Kubernetes scheduler attempts to find nodes with enough available CPU and memory to run the new pods.

Step 4a: Nodes Available (Fast Path) If existing nodes have enough headroom, the new pods are scheduled immediately. Scaling happens in seconds. CA is not involved.

Step 4b: No Nodes Available (CA Path) If no existing node can fit the new pods, they enter the Pending state with a FailedScheduling event. This is the trigger for the Cluster Autoscaler.

Step 5: CA Provisions New Nodes CA detects the pending pods, evaluates which node group can host them, and requests new nodes from the cloud provider. This typically takes 2-5 minutes depending on the cloud provider and instance type.

Step 6: New Nodes Join the Cluster The new nodes register with the API server and become Ready. The scheduler immediately places the pending pods on these nodes.

Step 7: Application Handles the Load The additional replicas start serving traffic. Latency returns to normal.

The Scale-Down Sequence

Scale-down follows the reverse path, but on a much slower timeline by design:

Step 1: Traffic decreases. Pod CPU utilization drops below the HPA target.

Step 2: HPA reduces the replica count (subject to the scale-down behavior policy and stabilization window).

Step 3: Fewer pods run on each node. Node utilization drops below the scale-down-utilization-threshold (default 50%).

Step 4: After the node has been underutilized for scale-down-unneeded-time (default 10 minutes), CA drains the node and removes it from the cluster.

Step 5: Cloud provider terminates the instance, and you stop paying for it.

This intentional asymmetry, fast scale-up and slow scale-down, is the right pattern for production workloads. You want to handle spikes immediately but avoid removing capacity that might be needed again in minutes.

Common Pitfalls and How to Avoid Them

After managing autoscaling across hundreds of clusters, we see the same mistakes repeatedly. Here are the most critical ones.

Pitfall 1: HPA Without Cluster Autoscaler

If HPA increases your replica count but there are no nodes available to run the new pods, those pods stay in Pending state indefinitely. Your application is under load, HPA has correctly determined it needs more replicas, but nothing happens because there is nowhere to put them.

Fix: Always deploy CA alongside HPA in production clusters. If you are on a managed Kubernetes service like EKS, AKS, or GKE, enable the managed cluster autoscaler. For a comparison of how each provider handles this, see our EKS vs AKS vs GKE comparison.

Pitfall 2: Cluster Autoscaler Without HPA

CA only reacts to pending pods. If you have a fixed replica count and your pods become overloaded, CA has no reason to add nodes because no new pods are being requested. Your application degrades under load even though the cluster has room to grow.

Fix: Use HPA to signal demand. HPA creates new pods, which in turn signal CA to expand the cluster when needed. Without HPA (or an equivalent like KEDA), CA is effectively blind to application-level load.

Pitfall 3: Missing or Incorrect Resource Requests

Both HPA and CA depend on resource requests being set correctly on your pods.

HPA uses resource requests to calculate utilization percentages. Without requests, the Utilization target type will not work.
CA uses the sum of pod resource requests on a node to determine utilization. If your pods have no requests, CA sees every node as 0% utilized and will try to remove them all.

Fix: Always set CPU and memory requests on every container. Base them on actual observed usage, not guesses. A good starting point:

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

Pitfall 4: HPA Stabilization Window Too Short

If the HPA scale-down stabilization window is too short, you get “flapping” where the HPA repeatedly scales up and down. Each scale-down removes pods, which pushes utilization above the target again, triggering an immediate scale-up.

Fix: Set the stabilizationWindowSeconds for scale-down to at least 300 seconds (5 minutes). For latency-sensitive workloads, consider 600 seconds.

Pitfall 5: CA Scale-Down Too Aggressive

Setting scale-down-unneeded-time too low (for example, 2 minutes) causes CA to remove nodes that might be needed again shortly. This leads to repeated node provisioning cycles, which are slow (2-5 minutes each) and can cause service disruptions.

Fix: Keep the default of 10 minutes, or increase it for clusters with bursty traffic patterns.

Pitfall 6: Not Setting Pod Disruption Budgets

When CA scales down a node, it drains all pods from it. Without PDBs, CA might evict too many pods from a service at once, causing an outage.

Fix: Define PDBs for all critical workloads:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: "75%"
  selector:
    matchLabels:
      app: web-app

Best Practices for Combining HPA and CA

Based on what we have learned running autoscaling in production, here are the patterns that consistently work well.

1. Set Resource Requests Based on Real Data

Use monitoring data or tools like Vertical Pod Autoscaler (VPA) in recommendation mode to establish accurate resource requests. Inaccurate requests undermine both HPA and CA decision-making.

2. Use Multiple HPA Metrics

Do not rely solely on CPU. Combine CPU with a business-relevant metric like requests per second or queue depth. This gives HPA a more accurate picture of actual demand.

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: 500

3. Use the Least-Waste Expander for CA

The least-waste expander chooses the node group that will leave the least idle resources after placing pending pods. This directly reduces cost by avoiding over-provisioning at the node level.

4. Keep Headroom with Priority-Based Overprovisioning

To reduce the time between HPA requesting more pods and those pods becoming schedulable, deploy low-priority “placeholder” pods that can be preempted when real workloads need the space.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: overprovisioning
value: -1
globalDefault: false
description: "Priority class for overprovisioning pods"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning
spec:
  replicas: 3
  selector:
    matchLabels:
      app: overprovisioning
  template:
    metadata:
      labels:
        app: overprovisioning
    spec:
      priorityClassName: overprovisioning
      containers:
      - name: pause
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "1"
            memory: 2Gi

When HPA creates new pods, the scheduler preempts these low-priority pods, placing real workloads instantly. The evicted placeholder pods become Pending, triggering CA to add nodes, which restores the buffer for the next spike.

5. Align Scale-Down Timers

Make sure HPA’s scale-down stabilization window is shorter than CA’s scale-down-unneeded-time. This way, HPA removes unnecessary pods first, which reduces node utilization, and then CA removes the empty or underutilized nodes. If CA acts faster than HPA, you may remove nodes that still have active pods.

A good default: HPA scale-down stabilization at 5 minutes, CA scale-down-unneeded-time at 10 minutes.

6. Monitor the Autoscaling Pipeline

Track these metrics to ensure your autoscaling system is healthy:

HPA current vs desired replica count (detects when HPA is unable to scale)
Pending pod count and duration (detects CA delays)
Node count over time (verifies CA is responding)
CA scaling decision events (audit why CA did or did not scale)

Integrating this data into your observability consulting pipeline ensures you can troubleshoot scaling failures before they impact users.

KEDA: An Alternative and Complement to HPA

KEDA (Kubernetes Event-Driven Autoscaling) is an open-source project that extends Kubernetes autoscaling beyond what the built-in HPA supports. KEDA does not replace HPA. It builds on top of it.

How KEDA Differs from HPA

The built-in HPA requires metrics to be available through the Kubernetes Metrics API, which limits you to resource metrics (via Metrics Server) and custom metrics (via an adapter). KEDA provides over 60 built-in scalers that connect directly to external event sources:

Message queues: Kafka, RabbitMQ, AWS SQS, Azure Service Bus
Databases: PostgreSQL, MySQL, MongoDB, Redis
Cloud services: AWS CloudWatch, Azure Monitor, GCP Pub/Sub
HTTP traffic: Prometheus query results, NGINX Ingress metrics

KEDA’s Key Advantage: Scale to Zero

Unlike the built-in HPA, which requires a minimum of 1 replica, KEDA can scale workloads to zero replicas when there are no events to process. When a new event arrives, KEDA spins up a pod to handle it. This is particularly valuable for event-driven microservices that process work intermittently.

Example: KEDA with Kafka

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-broker:9092
      consumerGroup: order-processor-group
      topic: orders
      lagThreshold: "100"

When to Use KEDA vs Built-In HPA

Scenario	Use
Standard web services scaling on CPU/memory	Built-in HPA
Queue-based workers (Kafka, SQS, RabbitMQ)	KEDA
Scale-to-zero requirements	KEDA
Scaling based on Prometheus queries	KEDA
Simple deployments with minimal dependencies	Built-in HPA
Event-driven microservices architecture	KEDA

KEDA works seamlessly with Cluster Autoscaler. When KEDA scales pods from zero to many, and those pods cannot be scheduled, CA adds nodes just as it would with HPA-driven scaling.

Managed Kubernetes Provider Considerations

Each major cloud provider has its own implementation of cluster autoscaling. While the core concepts are the same, configuration differs.

AWS EKS: Uses the open-source Cluster Autoscaler with Auto Scaling Groups (ASGs), or the newer Karpenter. Karpenter is AWS-specific but offers faster node provisioning (typically under 60 seconds) and more flexible instance selection.

Azure AKS: Integrates the Cluster Autoscaler directly into the managed control plane. You configure it via AKS API parameters rather than deploying it as a separate pod.

Google GKE: Offers both standard Cluster Autoscaler and Node Auto-Provisioning (NAP), which can create entirely new node pools with different machine types based on workload requirements.

For a detailed breakdown of scaling behavior across providers, see our EKS vs AKS vs GKE comparison.

Production Configuration Checklist

Before going live, verify these items:

Addressing Kubernetes security best practices is equally important when configuring autoscaling, as CA’s service account requires elevated cloud provider permissions to manage nodes.

Conclusion

The Horizontal Pod Autoscaler and Cluster Autoscaler are not alternatives to each other. They are partners in a layered autoscaling strategy. HPA handles application-level demand by adjusting pod replicas. CA handles infrastructure-level capacity by adjusting node count. Together, they give you a production-grade system that responds to traffic spikes in seconds while cleaning up unused resources to control costs.

The key is getting the configuration right: accurate resource requests, appropriate scaling behavior policies, aligned scale-down timers, and proper monitoring. Skip any of these, and you end up with either pods stuck pending during peak traffic or nodes sitting idle burning through your budget.

For a broader view of Kubernetes autoscaling within the context of cost management, see the Kubernetes autoscaling overview in the official documentation.

Get Your Kubernetes Autoscaling Right

Combining Cluster Autoscaler with HPA is essential for production Kubernetes — but the configuration details matter.

Our team provides expert Kubernetes consulting services to help you:

Design autoscaling strategies that handle traffic spikes without over-provisioning
Configure HPA and Cluster Autoscaler for optimal performance and cost
Implement custom metrics scaling with KEDA for event-driven workloads

We have tuned autoscaling across 100+ production clusters for startups and enterprises.

Get a free Kubernetes scaling assessment →