~/blog/cluster-autoscaler-vs-hpa-vs-vpa-2026
zsh
KUBERNETES

Cluster Autoscaler vs HPA vs VPA: We Tested All 3 (Here's What Works)

Engineering Team 2026-03-09

Kubernetes gives you three autoscaling mechanisms, each operating at a different layer. The Horizontal Pod Autoscaler (HPA) adds or removes pod replicas. The Vertical Pod Autoscaler (VPA) adjusts CPU and memory requests on individual pods. The Cluster Autoscaler (CA) adds or removes entire nodes from your cluster.

Most teams start with one, hit a wall, then bolt on another without understanding how they interact. We have tuned autoscaling across 100+ production clusters, and the pattern is always the same: each autoscaler solves a specific problem, but combining them incorrectly creates new ones.

This guide breaks down exactly how each autoscaler works, when they conflict, and which combinations actually work in production. We also cover Karpenter and KEDA as modern alternatives that change the equation entirely.

Quick Comparison: HPA vs VPA vs Cluster Autoscaler

Before diving deep, here is the three-way comparison at a glance:

FeatureHPAVPACluster Autoscaler
What it scalesPod replicasPod resources (CPU/memory)Nodes
DirectionHorizontal (more pods)Vertical (bigger pods)Infrastructure (more nodes)
Default in K8sYesNo (separate addon)No (separate addon)
Reaction time~15 secondsMinutesMinutes
Best forStateless servicesBatch/stateful workloadsAll workloads (node layer)
Cost impactMore pods, potentially more nodesFewer wasted resources per podRight-sized cluster
DisruptionNone (adds replicas)Pod restart required (in Auto mode)Node drain on scale-down
Metric sourceCPU, memory, custom, externalCPU, memory (historical usage)Pending pods, node utilization

Each autoscaler addresses a different layer. HPA handles demand-driven scaling of application replicas. VPA ensures each pod is right-sized so it does not waste resources. CA ensures the cluster has enough nodes to actually run those pods. Understanding these layers is the key to building an autoscaling strategy that works.

How HPA Works: Horizontal Pod Autoscaling

The Horizontal Pod Autoscaler is the most widely used autoscaler in Kubernetes. It watches metrics on your pods and adjusts the replica count to match demand.

HPA Architecture

HPA runs as a control loop inside the Kubernetes controller manager. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), it queries the metrics API, computes the desired replica count, and scales the deployment accordingly.

The formula is straightforward:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

If you have 3 replicas running at 80% CPU and your target is 50%, HPA calculates ceil(3 * 80/50) = 5 replicas.

HPA Configuration Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

HPA Best Practices

  • Always set resource requests on your pods. HPA calculates utilization as a percentage of the request. Without requests, utilization metrics are meaningless.
  • Use the behavior field to control scale-up and scale-down velocity. Aggressive scale-up with conservative scale-down prevents flapping.
  • Consider custom metrics for services where CPU does not reflect load. Requests per second, queue depth, or latency percentiles are often better signals than raw CPU for Kubernetes cost optimization strategies.
  • Set minReplicas to at least 2 for high-availability workloads. A single replica cannot survive a node failure.

When HPA Falls Short

HPA works well for stateless, horizontally scalable workloads like web servers and API gateways. It struggles with:

  • Stateful workloads where adding replicas requires data rebalancing
  • Workloads with long startup times where new pods take minutes to become ready
  • Services with uneven resource profiles where some pods need more CPU or memory than others

For these scenarios, VPA is often a better fit.

How VPA Works: Vertical Pod Autoscaling

The Vertical Pod Autoscaler adjusts the CPU and memory requests (and optionally limits) on individual pods based on historical usage. Instead of adding more pods, VPA makes each pod the right size.

VPA Architecture

VPA is not included in default Kubernetes. You need to install it separately from the VPA GitHub repository. It consists of three components:

  1. Recommender - Monitors resource usage and computes recommended requests
  2. Updater - Evicts pods that need resizing (in Auto mode)
  3. Admission Controller - Sets the recommended resources on new pods at creation time

VPA Operating Modes

VPA has three update modes, and choosing the right one is critical:

ModeBehaviorPod DisruptionUse Case
OffRecommends only, no changes appliedNoneMonitoring and right-sizing analysis
InitialSets resources only at pod creationNone (existing pods unchanged)Safe production use
AutoEvicts and recreates pods with new resourcesYes (pod restarts)Non-critical or batch workloads

For production services, start with Off mode to gather recommendations, then move to Initial once you trust the suggestions. Auto mode should be reserved for workloads that can tolerate pod restarts.

VPA Configuration Example

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: batch-processor-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: batch-processor
  updatePolicy:
    updateMode: "Auto"
    minReplicas: 2
  resourcePolicy:
    containerPolicies:
      - containerName: batch-processor
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "4"
          memory: "8Gi"
        controlledResources:
          - cpu
          - memory
        controlledValues: RequestsAndLimits

VPA in Recommendation-Only Mode

This is the safest way to start with VPA. It gives you visibility into resource waste without touching running pods:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa-recommend
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: web-app
        controlledResources:
          - cpu
          - memory

Check recommendations with:

kubectl describe vpa web-app-vpa-recommend -n production

The output shows lower bound, target, upper bound, and uncapped target recommendations. Use these to manually adjust your resource requests and start eliminating over-provisioning across your clusters.

When VPA Excels

VPA is the right choice for:

  • Batch and cron jobs where resource needs vary between runs
  • Stateful workloads like databases that cannot scale horizontally
  • Workloads with unpredictable resource profiles that are hard to size manually
  • Right-sizing exercises where you want data-driven resource requests instead of guesswork

How Cluster Autoscaler Works: Node-Level Scaling

The Cluster Autoscaler operates at the infrastructure layer. It adds nodes when pods cannot be scheduled due to insufficient resources, and removes nodes when they are underutilized.

CA Architecture

Cluster Autoscaler watches for two conditions:

  1. Scale-up trigger: Pods are in Pending state because no node has enough resources to schedule them. CA provisions a new node from the configured node group (ASGs on AWS, MIGs on GCP, VMSSs on Azure).

  2. Scale-down trigger: A node’s utilization falls below a threshold (default 50%) for a sustained period (default 10 minutes). CA drains the node and terminates it, provided all pods on it can be rescheduled elsewhere.

CA Configuration Example

On AWS EKS, Cluster Autoscaler runs as a deployment that interacts with Auto Scaling Groups:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
            - --balance-similar-node-groups
            - --scale-down-delay-after-add=5m
            - --scale-down-unneeded-time=5m
            - --scale-down-utilization-threshold=0.5
          resources:
            limits:
              cpu: 100m
              memory: 600Mi
            requests:
              cpu: 100m
              memory: 600Mi

Key CA Parameters

ParameterDefaultDescription
--scale-down-utilization-threshold0.5Node utilization below this triggers scale-down consideration
--scale-down-delay-after-add10mWait time after adding a node before considering scale-down
--scale-down-unneeded-time10mHow long a node must be underutilized before removal
--max-graceful-termination-sec600Max time to wait for pod termination during drain
--expanderrandomStrategy for choosing node group (random, most-pods, least-waste, priority)

CA Limitations

Cluster Autoscaler is reliable but has well-known limitations:

  • Slow reaction time: Adding a node takes minutes (API call to cloud provider, VM boot, kubelet registration, pod scheduling). This matters for bursty workloads.
  • Node group constraints: CA works with pre-defined node groups. You must configure the instance types, sizes, and availability zones ahead of time.
  • Bin-packing inefficiency: CA does not always choose the optimal instance type for the workload. The least-waste expander helps, but is limited to the node groups you have defined.

These limitations are exactly why Karpenter was created, which we cover later in this guide.

The HPA + VPA Conflict: Why They Fight

This is where most teams get burned. Running HPA and VPA together on the same workload seems logical: let HPA handle replica count and VPA handle pod sizing. In practice, they can create a feedback loop that destabilizes your workload.

How the Conflict Happens

Here is the scenario:

  1. Traffic increases, CPU utilization rises to 80%
  2. HPA sees high CPU and wants to add replicas to bring utilization down
  3. VPA also sees high CPU and wants to increase the CPU request on each pod
  4. HPA adds replicas, spreading the load, which brings CPU down
  5. VPA sees the lower CPU and reduces its recommendation
  6. With smaller resource requests, utilization spikes again
  7. HPA reacts, VPA reacts, and the cycle continues

The result is oscillating replica counts, unnecessary pod evictions, and unpredictable scaling behavior.

The Rule: Never Use Both on the Same Metric

The official Kubernetes autoscaling documentation is clear: do not use HPA and VPA on the same resource metric for the same workload. If HPA is scaling on CPU, VPA should not be managing CPU requests in Auto mode.

Safe Ways to Combine HPA and VPA

There are legitimate patterns for using both:

Pattern 1: VPA in Off mode with HPA

Run VPA in recommendation-only mode while HPA handles scaling. Use VPA’s suggestions to periodically update your resource requests manually:

# HPA handles scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
---
# VPA provides recommendations only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"

Pattern 2: HPA on CPU, VPA on memory only

This is the advanced pattern for teams that want both active. VPA controls memory sizing while HPA scales replicas based on CPU:

# HPA scales on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
---
# VPA manages memory only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: worker-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: worker
        controlledResources:
          - memory
        minAllowed:
          memory: "256Mi"
        maxAllowed:
          memory: "4Gi"

This pattern works because the autoscalers operate on separate metrics. HPA does not care about memory utilization, and VPA does not touch CPU requests.

Best Autoscaling Combinations for Production

After working with dozens of production environments, these are the combinations that consistently deliver results. The right choice depends on your workload type.

Combination 1: HPA + Cluster Autoscaler (Standard Web Apps)

This is the default recommendation for stateless web applications, APIs, and microservices.

How it works:

  • HPA adds pod replicas when CPU or custom metrics exceed the target
  • When new pods cannot be scheduled, CA adds nodes to the cluster
  • When traffic drops, HPA removes replicas and CA removes underutilized nodes

Best for: E-commerce platforms, REST APIs, web frontends, microservices

# HPA for the application
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: storefront-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: storefront
  minReplicas: 3
  maxReplicas: 100
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 55
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

CA handles the node layer automatically. When HPA creates pods that cannot be scheduled, CA provisions new nodes within minutes.

Combination 2: VPA + Cluster Autoscaler (Batch and Stateful Workloads)

For workloads that cannot scale horizontally, VPA right-sizes individual pods while CA ensures the cluster has capacity.

How it works:

  • VPA monitors resource usage and adjusts CPU/memory requests
  • When VPA increases a pod’s resource request beyond what the current node can provide, the pod becomes unschedulable
  • CA detects the pending pod and provisions a larger node

Best for: Databases, message brokers, batch processing jobs, ML training workloads

Combination 3: HPA + VPA (Memory Only) + Cluster Autoscaler (Advanced)

The full three-layer stack for teams that want maximum efficiency. This is the most sophisticated setup and requires careful tuning.

How it works:

  • HPA scales replicas based on CPU or custom metrics
  • VPA right-sizes memory requests (only memory, not CPU)
  • CA ensures node capacity matches the workload

Best for: Memory-intensive APIs, Java applications with variable heap usage, services with unpredictable memory patterns

This combination is what we frequently implement during EKS architecture engagements where teams need both horizontal scaling and memory optimization.

Combination 4: HPA + Karpenter (Modern Alternative)

Karpenter replaces Cluster Autoscaler with a faster, more flexible node provisioner. Instead of working with pre-defined node groups, Karpenter selects the optimal instance type for each pending pod in real time.

How it works:

  • HPA adds replicas as demand grows
  • When pods are unschedulable, Karpenter provisions the right-sized node in seconds (not minutes)
  • Karpenter consolidates workloads onto fewer nodes during low demand, terminating underutilized instances

Best for: AWS EKS clusters, bursty workloads, cost-sensitive environments

# Karpenter NodePool (replaces ASG-based node groups)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values:
            - m5.large
            - m5.xlarge
            - m5.2xlarge
            - c5.large
            - c5.xlarge
            - r5.large
            - r5.xlarge
        - key: "topology.kubernetes.io/zone"
          operator: In
          values:
            - eu-west-1a
            - eu-west-1b
            - eu-west-1c
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "1000"
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Karpenter is a significant upgrade over Cluster Autoscaler for AWS users. It provisions nodes in seconds rather than minutes, selects from a broader range of instance types, and consolidates workloads automatically. If you are running on EKS, this is the combination we recommend.

Combination 5: KEDA + Karpenter (Event-Driven Workloads)

For workloads that scale based on external events rather than CPU or memory, KEDA replaces HPA with event-driven autoscaling.

How it works:

  • KEDA scales pods based on event sources: SQS queue depth, Kafka consumer lag, Prometheus metrics, cron schedules, and 60+ other scalers
  • Karpenter provisions nodes as KEDA creates new pods
  • KEDA can scale to zero, and Karpenter removes nodes when they are empty

Best for: Queue processors, event-driven microservices, scheduled batch jobs, serverless-style workloads

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0
  maxReplicaCount: 100
  pollingInterval: 10
  cooldownPeriod: 300
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/orders
        queueLength: "5"
        awsRegion: eu-west-1
        identityOwner: operator

KEDA’s ability to scale to zero makes it particularly powerful for cost optimization. A queue processor that handles 1,000 messages during business hours and zero messages overnight can scale from 50 pods to 0 pods, and Karpenter removes the nodes entirely.

Reaction Time Comparison

How fast each autoscaler responds matters, especially for bursty traffic patterns. Here is what we have measured across production clusters:

AutoscalerReaction TimeWhat Happens
HPA~15 secondsChecks metrics every sync period, creates/removes pods
VPAMinutesAnalyzes historical usage, evicts and recreates pods
Cluster Autoscaler2-5 minutesDetects pending pods, calls cloud API, waits for node boot
Karpenter10-30 secondsDetects pending pods, provisions optimal instance directly
KEDA10-30 seconds (configurable)Polls event source, scales pods up or down

The gap between Cluster Autoscaler and Karpenter is significant. For a workload that experiences sudden traffic spikes, a 4-minute delay in node provisioning means 4 minutes of degraded performance or dropped requests. Karpenter closes that gap to under 30 seconds in most cases.

If your workloads are latency-sensitive, pair HPA (or KEDA) with Karpenter instead of Cluster Autoscaler. The faster node provisioning means your scaling pipeline responds in seconds, not minutes.

Decision Flowchart: Choosing the Right Autoscaler

Use this flowchart to pick the right autoscaling strategy for each workload:

Step 1: Can your workload scale horizontally?

  • Yes (stateless services, web apps, APIs) -> Go to Step 2
  • No (databases, stateful sets, single-instance apps) -> Use VPA + CA (or VPA + Karpenter)

Step 2: What drives your scaling?

  • CPU or memory utilization -> Use HPA
  • External events (queues, streams, schedules) -> Use KEDA
  • Custom application metrics -> Use HPA with custom metrics or KEDA

Step 3: Do you need node-level scaling?

  • Yes, on AWS -> Use Karpenter (preferred) or Cluster Autoscaler
  • Yes, on GCP or Azure -> Use Cluster Autoscaler
  • No (fixed cluster size) -> Skip node autoscaling

Step 4: Do you also need right-sizing?

  • Yes -> Add VPA in Off mode for recommendations, or VPA in Auto mode on memory only
  • No -> Your setup is complete

Step 5: Is cost optimization a priority?

  • Yes -> Add VPA recommendations to identify over-provisioned pods. Use Karpenter’s spot instance support. Consider Kubernetes FinOps practices across the cluster.
  • No -> Focus on reliability and performance tuning

Advanced Configuration Tips

Tuning HPA for Stability

HPA flapping (rapid scale-up/scale-down cycles) is the most common issue teams face. These settings help:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0      # Scale up immediately
    policies:
      - type: Percent
        value: 100                       # Double capacity per period
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300    # Wait 5 min before scaling down
    policies:
      - type: Percent
        value: 10                       # Remove only 10% per period
        periodSeconds: 60

The asymmetry is intentional: scale up fast to handle traffic, scale down slowly to avoid premature removal.

Setting VPA Bounds

Always set minAllowed and maxAllowed to prevent VPA from recommending absurdly small or large resource requests:

resourcePolicy:
  containerPolicies:
    - containerName: app
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "8"
        memory: "16Gi"

Without bounds, VPA might recommend 10m CPU for a web server during a quiet period, causing immediate throttling when traffic returns.

CA Expander Selection

The expander determines how Cluster Autoscaler chooses between multiple node groups:

  • random (default): Picks a random eligible node group. Simple but not cost-efficient.
  • least-waste: Chooses the node group that wastes the fewest resources after scheduling. Best for cost optimization.
  • most-pods: Chooses the node group that can schedule the most pending pods. Best for throughput.
  • priority: Uses a priority-based ordering you define. Best for teams with specific instance type preferences (e.g., prefer spot, fall back to on-demand).

For most production clusters, least-waste or priority are better choices than the default random.

Pod Disruption Budgets with Autoscaling

When CA scales down nodes or VPA evicts pods, Pod Disruption Budgets (PDBs) protect availability:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
  namespace: production
spec:
  minAvailable: "75%"
  selector:
    matchLabels:
      app: web-app

This ensures at least 75% of your pods remain running during node drains or VPA-triggered evictions. Without PDBs, autoscaling events can briefly take your service offline.

Monitoring Your Autoscaling Setup

An autoscaling configuration you cannot observe is one you cannot trust. Use observability consulting for monitoring autoscaling to ensure your scaling events are visible.

Key Metrics to Track

Monitor these metrics to verify your autoscaling is working correctly:

  • HPA: kube_horizontalpodautoscaler_status_current_replicas, kube_horizontalpodautoscaler_status_desired_replicas, and the delta between them
  • VPA: VPA recommendations vs actual requests, number of evictions triggered by VPA
  • CA: cluster_autoscaler_nodes_count, cluster_autoscaler_unschedulable_pods_count, time from pending pod to scheduled pod
  • Karpenter: karpenter_pods_startup_duration_seconds, karpenter_nodes_created, karpenter_nodes_terminated

Alerting Rules

Set alerts for autoscaling failures:

# Alert when pods remain pending for too long (CA not scaling fast enough)
- alert: PodsStuckPending
  expr: kube_pod_status_phase{phase="Pending"} > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pods pending for over 5 minutes - check Cluster Autoscaler"

# Alert when HPA is at max replicas (cannot scale further)
- alert: HPAAtMaxReplicas
  expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "HPA at maximum replicas for 10+ minutes - consider increasing max"

Common Pitfalls and How to Avoid Them

Pitfall 1: No Resource Requests Set

HPA calculates utilization as a percentage of the resource request. If your pods have no CPU or memory requests defined, HPA cannot compute utilization and will not scale. Always set explicit resource requests on every container.

Pitfall 2: VPA and HPA on the Same CPU Metric

As covered earlier, this creates feedback loops. If you must use both, restrict VPA to memory only or run it in Off mode.

Pitfall 3: CA Scale-Down Too Aggressive

Setting --scale-down-unneeded-time too low causes nodes to be removed and re-added repeatedly. Start with 10 minutes and adjust based on your workload patterns.

Pitfall 4: Ignoring Pod Startup Time

HPA can add pods in seconds, but if those pods take 3 minutes to start serving traffic, you have a gap. Use readiness probes and startup probes to ensure HPA knows when new pods are actually ready.

Pitfall 5: Not Testing with Realistic Load

Autoscaling configurations that work under synthetic load often fail under production traffic patterns. Use tools like k6, Locust, or Vegeta to simulate realistic traffic and validate your scaling behavior before deploying.

Summary: Which Combination Should You Use?

Workload TypeRecommended StackWhy
Stateless web apps/APIsHPA + CA (or Karpenter)Standard horizontal scaling with node provisioning
Batch processingVPA + CARight-size individual jobs, scale nodes as needed
Stateful workloads (databases)VPA + CACannot scale horizontally, need vertical right-sizing
Memory-heavy APIsHPA (CPU) + VPA (memory) + CAHorizontal scaling with memory optimization
Event-driven processorsKEDA + KarpenterScale on events, fast node provisioning, scale to zero
Cost-sensitive environmentsHPA + Karpenter + VPA (Off mode)Fast scaling, spot instances, right-sizing recommendations

The key takeaway: autoscaling is not a single tool. It is a multi-layer strategy where each autoscaler handles a different concern. Get the combination right, and your cluster scales efficiently while keeping costs under control. Get it wrong, and you end up with autoscalers fighting each other while your bill grows.

For teams running on AWS, the combination of HPA + Karpenter has become the standard recommendation, replacing the older HPA + Cluster Autoscaler pattern. Karpenter’s faster provisioning and intelligent instance selection make it the clear choice for production-ready EKS clusters.


Master Kubernetes Autoscaling for Your Workloads

Getting the right combination of HPA, VPA, and Cluster Autoscaler is the difference between a cluster that wastes money and one that scales efficiently.

Our team provides expert Kubernetes consulting services to help you:

  • Design multi-layer autoscaling combining HPA, VPA, and node-level scaling
  • Right-size your pods and nodes to eliminate over-provisioning
  • Implement Karpenter and KEDA for modern, event-driven autoscaling

We have optimized autoscaling strategies across 100+ production clusters.

Get a free Kubernetes autoscaling review →

Continue exploring these related topics

$ suggest --service

Kubernetes costs out of control?

We help teams cut Kubernetes spend by 40-60% without sacrificing performance.

Get started
Chat with real humans
Chat on WhatsApp