Kubernetes gives you three autoscaling mechanisms, each operating at a different layer. The Horizontal Pod Autoscaler (HPA) adds or removes pod replicas. The Vertical Pod Autoscaler (VPA) adjusts CPU and memory requests on individual pods. The Cluster Autoscaler (CA) adds or removes entire nodes from your cluster.
Most teams start with one, hit a wall, then bolt on another without understanding how they interact. We have tuned autoscaling across 100+ production clusters, and the pattern is always the same: each autoscaler solves a specific problem, but combining them incorrectly creates new ones.
This guide breaks down exactly how each autoscaler works, when they conflict, and which combinations actually work in production. We also cover Karpenter and KEDA as modern alternatives that change the equation entirely.
Quick Comparison: HPA vs VPA vs Cluster Autoscaler
Before diving deep, here is the three-way comparison at a glance:
| Feature | HPA | VPA | Cluster Autoscaler |
|---|---|---|---|
| What it scales | Pod replicas | Pod resources (CPU/memory) | Nodes |
| Direction | Horizontal (more pods) | Vertical (bigger pods) | Infrastructure (more nodes) |
| Default in K8s | Yes | No (separate addon) | No (separate addon) |
| Reaction time | ~15 seconds | Minutes | Minutes |
| Best for | Stateless services | Batch/stateful workloads | All workloads (node layer) |
| Cost impact | More pods, potentially more nodes | Fewer wasted resources per pod | Right-sized cluster |
| Disruption | None (adds replicas) | Pod restart required (in Auto mode) | Node drain on scale-down |
| Metric source | CPU, memory, custom, external | CPU, memory (historical usage) | Pending pods, node utilization |
Each autoscaler addresses a different layer. HPA handles demand-driven scaling of application replicas. VPA ensures each pod is right-sized so it does not waste resources. CA ensures the cluster has enough nodes to actually run those pods. Understanding these layers is the key to building an autoscaling strategy that works.
How HPA Works: Horizontal Pod Autoscaling
The Horizontal Pod Autoscaler is the most widely used autoscaler in Kubernetes. It watches metrics on your pods and adjusts the replica count to match demand.
HPA Architecture
HPA runs as a control loop inside the Kubernetes controller manager. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), it queries the metrics API, computes the desired replica count, and scales the deployment accordingly.
The formula is straightforward:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
If you have 3 replicas running at 80% CPU and your target is 50%, HPA calculates ceil(3 * 80/50) = 5 replicas.
HPA Configuration Example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
HPA Best Practices
- Always set resource requests on your pods. HPA calculates utilization as a percentage of the request. Without requests, utilization metrics are meaningless.
- Use the
behaviorfield to control scale-up and scale-down velocity. Aggressive scale-up with conservative scale-down prevents flapping. - Consider custom metrics for services where CPU does not reflect load. Requests per second, queue depth, or latency percentiles are often better signals than raw CPU for Kubernetes cost optimization strategies.
- Set
minReplicasto at least 2 for high-availability workloads. A single replica cannot survive a node failure.
When HPA Falls Short
HPA works well for stateless, horizontally scalable workloads like web servers and API gateways. It struggles with:
- Stateful workloads where adding replicas requires data rebalancing
- Workloads with long startup times where new pods take minutes to become ready
- Services with uneven resource profiles where some pods need more CPU or memory than others
For these scenarios, VPA is often a better fit.
How VPA Works: Vertical Pod Autoscaling
The Vertical Pod Autoscaler adjusts the CPU and memory requests (and optionally limits) on individual pods based on historical usage. Instead of adding more pods, VPA makes each pod the right size.
VPA Architecture
VPA is not included in default Kubernetes. You need to install it separately from the VPA GitHub repository. It consists of three components:
- Recommender - Monitors resource usage and computes recommended requests
- Updater - Evicts pods that need resizing (in Auto mode)
- Admission Controller - Sets the recommended resources on new pods at creation time
VPA Operating Modes
VPA has three update modes, and choosing the right one is critical:
| Mode | Behavior | Pod Disruption | Use Case |
|---|---|---|---|
| Off | Recommends only, no changes applied | None | Monitoring and right-sizing analysis |
| Initial | Sets resources only at pod creation | None (existing pods unchanged) | Safe production use |
| Auto | Evicts and recreates pods with new resources | Yes (pod restarts) | Non-critical or batch workloads |
For production services, start with Off mode to gather recommendations, then move to Initial once you trust the suggestions. Auto mode should be reserved for workloads that can tolerate pod restarts.
VPA Configuration Example
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: batch-processor-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: batch-processor
updatePolicy:
updateMode: "Auto"
minReplicas: 2
resourcePolicy:
containerPolicies:
- containerName: batch-processor
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledResources:
- cpu
- memory
controlledValues: RequestsAndLimits
VPA in Recommendation-Only Mode
This is the safest way to start with VPA. It gives you visibility into resource waste without touching running pods:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa-recommend
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: web-app
controlledResources:
- cpu
- memory
Check recommendations with:
kubectl describe vpa web-app-vpa-recommend -n production
The output shows lower bound, target, upper bound, and uncapped target recommendations. Use these to manually adjust your resource requests and start eliminating over-provisioning across your clusters.
When VPA Excels
VPA is the right choice for:
- Batch and cron jobs where resource needs vary between runs
- Stateful workloads like databases that cannot scale horizontally
- Workloads with unpredictable resource profiles that are hard to size manually
- Right-sizing exercises where you want data-driven resource requests instead of guesswork
How Cluster Autoscaler Works: Node-Level Scaling
The Cluster Autoscaler operates at the infrastructure layer. It adds nodes when pods cannot be scheduled due to insufficient resources, and removes nodes when they are underutilized.
CA Architecture
Cluster Autoscaler watches for two conditions:
-
Scale-up trigger: Pods are in
Pendingstate because no node has enough resources to schedule them. CA provisions a new node from the configured node group (ASGs on AWS, MIGs on GCP, VMSSs on Azure). -
Scale-down trigger: A node’s utilization falls below a threshold (default 50%) for a sustained period (default 10 minutes). CA drains the node and terminates it, provided all pods on it can be rescheduled elsewhere.
CA Configuration Example
On AWS EKS, Cluster Autoscaler runs as a deployment that interacts with Auto Scaling Groups:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --scale-down-delay-after-add=5m
- --scale-down-unneeded-time=5m
- --scale-down-utilization-threshold=0.5
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
Key CA Parameters
| Parameter | Default | Description |
|---|---|---|
--scale-down-utilization-threshold | 0.5 | Node utilization below this triggers scale-down consideration |
--scale-down-delay-after-add | 10m | Wait time after adding a node before considering scale-down |
--scale-down-unneeded-time | 10m | How long a node must be underutilized before removal |
--max-graceful-termination-sec | 600 | Max time to wait for pod termination during drain |
--expander | random | Strategy for choosing node group (random, most-pods, least-waste, priority) |
CA Limitations
Cluster Autoscaler is reliable but has well-known limitations:
- Slow reaction time: Adding a node takes minutes (API call to cloud provider, VM boot, kubelet registration, pod scheduling). This matters for bursty workloads.
- Node group constraints: CA works with pre-defined node groups. You must configure the instance types, sizes, and availability zones ahead of time.
- Bin-packing inefficiency: CA does not always choose the optimal instance type for the workload. The
least-wasteexpander helps, but is limited to the node groups you have defined.
These limitations are exactly why Karpenter was created, which we cover later in this guide.
The HPA + VPA Conflict: Why They Fight
This is where most teams get burned. Running HPA and VPA together on the same workload seems logical: let HPA handle replica count and VPA handle pod sizing. In practice, they can create a feedback loop that destabilizes your workload.
How the Conflict Happens
Here is the scenario:
- Traffic increases, CPU utilization rises to 80%
- HPA sees high CPU and wants to add replicas to bring utilization down
- VPA also sees high CPU and wants to increase the CPU request on each pod
- HPA adds replicas, spreading the load, which brings CPU down
- VPA sees the lower CPU and reduces its recommendation
- With smaller resource requests, utilization spikes again
- HPA reacts, VPA reacts, and the cycle continues
The result is oscillating replica counts, unnecessary pod evictions, and unpredictable scaling behavior.
The Rule: Never Use Both on the Same Metric
The official Kubernetes autoscaling documentation is clear: do not use HPA and VPA on the same resource metric for the same workload. If HPA is scaling on CPU, VPA should not be managing CPU requests in Auto mode.
Safe Ways to Combine HPA and VPA
There are legitimate patterns for using both:
Pattern 1: VPA in Off mode with HPA
Run VPA in recommendation-only mode while HPA handles scaling. Use VPA’s suggestions to periodically update your resource requests manually:
# HPA handles scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
---
# VPA provides recommendations only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off"
Pattern 2: HPA on CPU, VPA on memory only
This is the advanced pattern for teams that want both active. VPA controls memory sizing while HPA scales replicas based on CPU:
# HPA scales on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: worker
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
---
# VPA manages memory only
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: worker-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: worker
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: worker
controlledResources:
- memory
minAllowed:
memory: "256Mi"
maxAllowed:
memory: "4Gi"
This pattern works because the autoscalers operate on separate metrics. HPA does not care about memory utilization, and VPA does not touch CPU requests.
Best Autoscaling Combinations for Production
After working with dozens of production environments, these are the combinations that consistently deliver results. The right choice depends on your workload type.
Combination 1: HPA + Cluster Autoscaler (Standard Web Apps)
This is the default recommendation for stateless web applications, APIs, and microservices.
How it works:
- HPA adds pod replicas when CPU or custom metrics exceed the target
- When new pods cannot be scheduled, CA adds nodes to the cluster
- When traffic drops, HPA removes replicas and CA removes underutilized nodes
Best for: E-commerce platforms, REST APIs, web frontends, microservices
# HPA for the application
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: storefront-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: storefront
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 55
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
CA handles the node layer automatically. When HPA creates pods that cannot be scheduled, CA provisions new nodes within minutes.
Combination 2: VPA + Cluster Autoscaler (Batch and Stateful Workloads)
For workloads that cannot scale horizontally, VPA right-sizes individual pods while CA ensures the cluster has capacity.
How it works:
- VPA monitors resource usage and adjusts CPU/memory requests
- When VPA increases a pod’s resource request beyond what the current node can provide, the pod becomes unschedulable
- CA detects the pending pod and provisions a larger node
Best for: Databases, message brokers, batch processing jobs, ML training workloads
Combination 3: HPA + VPA (Memory Only) + Cluster Autoscaler (Advanced)
The full three-layer stack for teams that want maximum efficiency. This is the most sophisticated setup and requires careful tuning.
How it works:
- HPA scales replicas based on CPU or custom metrics
- VPA right-sizes memory requests (only memory, not CPU)
- CA ensures node capacity matches the workload
Best for: Memory-intensive APIs, Java applications with variable heap usage, services with unpredictable memory patterns
This combination is what we frequently implement during EKS architecture engagements where teams need both horizontal scaling and memory optimization.
Combination 4: HPA + Karpenter (Modern Alternative)
Karpenter replaces Cluster Autoscaler with a faster, more flexible node provisioner. Instead of working with pre-defined node groups, Karpenter selects the optimal instance type for each pending pod in real time.
How it works:
- HPA adds replicas as demand grows
- When pods are unschedulable, Karpenter provisions the right-sized node in seconds (not minutes)
- Karpenter consolidates workloads onto fewer nodes during low demand, terminating underutilized instances
Best for: AWS EKS clusters, bursty workloads, cost-sensitive environments
# Karpenter NodePool (replaces ASG-based node groups)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: "node.kubernetes.io/instance-type"
operator: In
values:
- m5.large
- m5.xlarge
- m5.2xlarge
- c5.large
- c5.xlarge
- r5.large
- r5.xlarge
- key: "topology.kubernetes.io/zone"
operator: In
values:
- eu-west-1a
- eu-west-1b
- eu-west-1c
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "1000"
memory: 1000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
Karpenter is a significant upgrade over Cluster Autoscaler for AWS users. It provisions nodes in seconds rather than minutes, selects from a broader range of instance types, and consolidates workloads automatically. If you are running on EKS, this is the combination we recommend.
Combination 5: KEDA + Karpenter (Event-Driven Workloads)
For workloads that scale based on external events rather than CPU or memory, KEDA replaces HPA with event-driven autoscaling.
How it works:
- KEDA scales pods based on event sources: SQS queue depth, Kafka consumer lag, Prometheus metrics, cron schedules, and 60+ other scalers
- Karpenter provisions nodes as KEDA creates new pods
- KEDA can scale to zero, and Karpenter removes nodes when they are empty
Best for: Queue processors, event-driven microservices, scheduled batch jobs, serverless-style workloads
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0
maxReplicaCount: 100
pollingInterval: 10
cooldownPeriod: 300
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/orders
queueLength: "5"
awsRegion: eu-west-1
identityOwner: operator
KEDA’s ability to scale to zero makes it particularly powerful for cost optimization. A queue processor that handles 1,000 messages during business hours and zero messages overnight can scale from 50 pods to 0 pods, and Karpenter removes the nodes entirely.
Reaction Time Comparison
How fast each autoscaler responds matters, especially for bursty traffic patterns. Here is what we have measured across production clusters:
| Autoscaler | Reaction Time | What Happens |
|---|---|---|
| HPA | ~15 seconds | Checks metrics every sync period, creates/removes pods |
| VPA | Minutes | Analyzes historical usage, evicts and recreates pods |
| Cluster Autoscaler | 2-5 minutes | Detects pending pods, calls cloud API, waits for node boot |
| Karpenter | 10-30 seconds | Detects pending pods, provisions optimal instance directly |
| KEDA | 10-30 seconds (configurable) | Polls event source, scales pods up or down |
The gap between Cluster Autoscaler and Karpenter is significant. For a workload that experiences sudden traffic spikes, a 4-minute delay in node provisioning means 4 minutes of degraded performance or dropped requests. Karpenter closes that gap to under 30 seconds in most cases.
If your workloads are latency-sensitive, pair HPA (or KEDA) with Karpenter instead of Cluster Autoscaler. The faster node provisioning means your scaling pipeline responds in seconds, not minutes.
Decision Flowchart: Choosing the Right Autoscaler
Use this flowchart to pick the right autoscaling strategy for each workload:
Step 1: Can your workload scale horizontally?
- Yes (stateless services, web apps, APIs) -> Go to Step 2
- No (databases, stateful sets, single-instance apps) -> Use VPA + CA (or VPA + Karpenter)
Step 2: What drives your scaling?
- CPU or memory utilization -> Use HPA
- External events (queues, streams, schedules) -> Use KEDA
- Custom application metrics -> Use HPA with custom metrics or KEDA
Step 3: Do you need node-level scaling?
- Yes, on AWS -> Use Karpenter (preferred) or Cluster Autoscaler
- Yes, on GCP or Azure -> Use Cluster Autoscaler
- No (fixed cluster size) -> Skip node autoscaling
Step 4: Do you also need right-sizing?
- Yes -> Add VPA in Off mode for recommendations, or VPA in Auto mode on memory only
- No -> Your setup is complete
Step 5: Is cost optimization a priority?
- Yes -> Add VPA recommendations to identify over-provisioned pods. Use Karpenter’s spot instance support. Consider Kubernetes FinOps practices across the cluster.
- No -> Focus on reliability and performance tuning
Advanced Configuration Tips
Tuning HPA for Stability
HPA flapping (rapid scale-up/scale-down cycles) is the most common issue teams face. These settings help:
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double capacity per period
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 10 # Remove only 10% per period
periodSeconds: 60
The asymmetry is intentional: scale up fast to handle traffic, scale down slowly to avoid premature removal.
Setting VPA Bounds
Always set minAllowed and maxAllowed to prevent VPA from recommending absurdly small or large resource requests:
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "8"
memory: "16Gi"
Without bounds, VPA might recommend 10m CPU for a web server during a quiet period, causing immediate throttling when traffic returns.
CA Expander Selection
The expander determines how Cluster Autoscaler chooses between multiple node groups:
- random (default): Picks a random eligible node group. Simple but not cost-efficient.
- least-waste: Chooses the node group that wastes the fewest resources after scheduling. Best for cost optimization.
- most-pods: Chooses the node group that can schedule the most pending pods. Best for throughput.
- priority: Uses a priority-based ordering you define. Best for teams with specific instance type preferences (e.g., prefer spot, fall back to on-demand).
For most production clusters, least-waste or priority are better choices than the default random.
Pod Disruption Budgets with Autoscaling
When CA scales down nodes or VPA evicts pods, Pod Disruption Budgets (PDBs) protect availability:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
namespace: production
spec:
minAvailable: "75%"
selector:
matchLabels:
app: web-app
This ensures at least 75% of your pods remain running during node drains or VPA-triggered evictions. Without PDBs, autoscaling events can briefly take your service offline.
Monitoring Your Autoscaling Setup
An autoscaling configuration you cannot observe is one you cannot trust. Use observability consulting for monitoring autoscaling to ensure your scaling events are visible.
Key Metrics to Track
Monitor these metrics to verify your autoscaling is working correctly:
- HPA:
kube_horizontalpodautoscaler_status_current_replicas,kube_horizontalpodautoscaler_status_desired_replicas, and the delta between them - VPA: VPA recommendations vs actual requests, number of evictions triggered by VPA
- CA:
cluster_autoscaler_nodes_count,cluster_autoscaler_unschedulable_pods_count, time from pending pod to scheduled pod - Karpenter:
karpenter_pods_startup_duration_seconds,karpenter_nodes_created,karpenter_nodes_terminated
Alerting Rules
Set alerts for autoscaling failures:
# Alert when pods remain pending for too long (CA not scaling fast enough)
- alert: PodsStuckPending
expr: kube_pod_status_phase{phase="Pending"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pods pending for over 5 minutes - check Cluster Autoscaler"
# Alert when HPA is at max replicas (cannot scale further)
- alert: HPAAtMaxReplicas
expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
for: 10m
labels:
severity: warning
annotations:
summary: "HPA at maximum replicas for 10+ minutes - consider increasing max"
Common Pitfalls and How to Avoid Them
Pitfall 1: No Resource Requests Set
HPA calculates utilization as a percentage of the resource request. If your pods have no CPU or memory requests defined, HPA cannot compute utilization and will not scale. Always set explicit resource requests on every container.
Pitfall 2: VPA and HPA on the Same CPU Metric
As covered earlier, this creates feedback loops. If you must use both, restrict VPA to memory only or run it in Off mode.
Pitfall 3: CA Scale-Down Too Aggressive
Setting --scale-down-unneeded-time too low causes nodes to be removed and re-added repeatedly. Start with 10 minutes and adjust based on your workload patterns.
Pitfall 4: Ignoring Pod Startup Time
HPA can add pods in seconds, but if those pods take 3 minutes to start serving traffic, you have a gap. Use readiness probes and startup probes to ensure HPA knows when new pods are actually ready.
Pitfall 5: Not Testing with Realistic Load
Autoscaling configurations that work under synthetic load often fail under production traffic patterns. Use tools like k6, Locust, or Vegeta to simulate realistic traffic and validate your scaling behavior before deploying.
Summary: Which Combination Should You Use?
| Workload Type | Recommended Stack | Why |
|---|---|---|
| Stateless web apps/APIs | HPA + CA (or Karpenter) | Standard horizontal scaling with node provisioning |
| Batch processing | VPA + CA | Right-size individual jobs, scale nodes as needed |
| Stateful workloads (databases) | VPA + CA | Cannot scale horizontally, need vertical right-sizing |
| Memory-heavy APIs | HPA (CPU) + VPA (memory) + CA | Horizontal scaling with memory optimization |
| Event-driven processors | KEDA + Karpenter | Scale on events, fast node provisioning, scale to zero |
| Cost-sensitive environments | HPA + Karpenter + VPA (Off mode) | Fast scaling, spot instances, right-sizing recommendations |
The key takeaway: autoscaling is not a single tool. It is a multi-layer strategy where each autoscaler handles a different concern. Get the combination right, and your cluster scales efficiently while keeping costs under control. Get it wrong, and you end up with autoscalers fighting each other while your bill grows.
For teams running on AWS, the combination of HPA + Karpenter has become the standard recommendation, replacing the older HPA + Cluster Autoscaler pattern. Karpenter’s faster provisioning and intelligent instance selection make it the clear choice for production-ready EKS clusters.
Master Kubernetes Autoscaling for Your Workloads
Getting the right combination of HPA, VPA, and Cluster Autoscaler is the difference between a cluster that wastes money and one that scales efficiently.
Our team provides expert Kubernetes consulting services to help you:
- Design multi-layer autoscaling combining HPA, VPA, and node-level scaling
- Right-size your pods and nodes to eliminate over-provisioning
- Implement Karpenter and KEDA for modern, event-driven autoscaling
We have optimized autoscaling strategies across 100+ production clusters.