We have deployed and tuned Kubernetes autoscaling across more than one hundred production clusters over the past four years. Some run latency-sensitive microservices that need nodes in under a minute. Others handle batch genomics workloads that spin up hundreds of Spot instances overnight and tear them down by morning. The question our clients ask most often is deceptively simple: should we use Cluster Autoscaler or Karpenter?
The short answer is that both tools solve the same fundamental problem — scaling Kubernetes compute capacity to match workload demand — but they approach it with radically different architectures, trade-offs, and cloud provider coverage. After running both side-by-side on production EKS, AKS, and GKE clusters, we have formed strong, data-backed opinions. This guide lays out everything we have learned.
Quick comparison: Cluster Autoscaler vs Karpenter at a glance
Before diving into architecture details, here is a side-by-side summary that captures the most important differences.
| Feature | Cluster Autoscaler (CA) | Karpenter |
|---|---|---|
| Provisioning model | Works through Auto Scaling Groups (ASGs) / Managed Node Groups | Provisions EC2 instances directly via Fleet API — no ASGs |
| Node provisioning speed | 3-4 minutes typical | ~55 seconds typical |
| Instance type selection | Pre-configured per node group | Dynamically selects from 800+ instance types |
| Capacity retry speed | Minutes (ASG retry cycle) | Milliseconds (immediate fallback across types/AZs) |
| Spot interruption handling | Requires external tooling (AWS Node Termination Handler) | Native SQS-based interruption handling |
| Consolidation | Removes idle nodes only | Proactive bin-packing and replacement of under-utilised nodes |
| Multi-cloud support | AWS, Azure, GCP, on-prem (8+ providers) | AWS (GA), Azure/AKS (GA via NAP), Alibaba Cloud (growing) |
| CNCF status | SIG-Autoscaling project, ships with Kubernetes release cycle | CNCF Sandbox, AWS-originated |
| Maturity | 8+ years in production | GA since v1.0 (2024), rapidly maturing |
| Configuration model | Node groups with identical scheduling properties | NodePool CRDs with flexible constraints |
| Best for | Multi-cloud, stable workloads, GPU node groups | AWS-heavy, bursty workloads, cost optimisation |
How Cluster Autoscaler works
Cluster Autoscaler is the original Kubernetes node autoscaling solution, maintained under the official CNCF SIG-Autoscaling project. It has been battle-tested for over eight years and ships with the same versioning and release cadence as Kubernetes itself.
Architecture
Cluster Autoscaler operates on a time-driven scan loop that runs every 10 seconds by default. During each cycle it:
- Scans for unschedulable pods — identifies pods stuck in
Pendingstate because no existing node has sufficient resources. - Simulates scheduling — runs the Kubernetes scheduling algorithm against each configured node group to determine which group could accommodate the pending pods.
- Requests scale-up — instructs the cloud provider’s Auto Scaling Group (ASG on AWS, VMSS on Azure, MIG on GCP) to add nodes.
- Monitors scale-down — identifies nodes with low utilisation (below 50% by default) and cordons, drains, then removes them if pods can be rescheduled elsewhere.
The critical architectural constraint is that every node in a node group must have identical scheduling properties — same instance type (or a small set), same labels, same taints. This means you typically need multiple node groups for different workload profiles: one for general compute, one for memory-optimised workloads, one for GPU jobs, and so on.
Cluster Autoscaler configuration example
Here is a typical CA deployment on EKS using a Helm chart:
# cluster-autoscaler-values.yaml
autoDiscovery:
clusterName: production-eks
tags:
- k8s.io/cluster-autoscaler/enabled
- k8s.io/cluster-autoscaler/production-eks
extraArgs:
scan-interval: "10s"
scale-down-delay-after-add: "10m"
scale-down-delay-after-delete: "0s"
scale-down-unneeded-time: "10m"
scale-down-utilization-threshold: "0.5"
skip-nodes-with-local-storage: "false"
expander: "least-waste"
balance-similar-node-groups: "true"
max-node-provision-time: "15m"
rbac:
create: true
serviceAccount:
name: cluster-autoscaler
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/cluster-autoscaler
resources:
requests:
cpu: 100m
memory: 600Mi
limits:
cpu: 500m
memory: 1Gi
And the corresponding ASG configuration that CA manages:
# Node group definition (Terraform)
resource "aws_eks_node_group" "general" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "general-compute"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnet_ids
scaling_config {
desired_size = 3
max_size = 50
min_size = 1
}
instance_types = ["m6i.xlarge", "m6a.xlarge", "m5.xlarge"]
labels = {
workload-type = "general"
}
tags = {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/production-eks" = "owned"
}
}
The expander strategy (least-waste in this example) is critical. It determines which node group CA selects when multiple groups could satisfy a pending pod. Options include random, most-pods, least-waste, and priority.
How Karpenter works
Karpenter takes a fundamentally different approach. Instead of managing node groups through an intermediary autoscaling layer, Karpenter provisions compute instances directly using the cloud provider’s API — on AWS, that means the EC2 Fleet API.
Architecture
Karpenter uses event-driven reconciliation rather than a periodic scan:
- Watches for unschedulable pods — the moment the Kubernetes scheduler marks a pod as
Pending, Karpenter’s controller receives the event. - Evaluates constraints — inspects the pod’s resource requests, node selectors, affinities, tolerations, and topology spread constraints.
- Selects the optimal instance — queries its internal instance type database (800+ types on AWS) and uses bin-packing heuristics to find the cheapest instance that satisfies all constraints.
- Provisions directly — launches the instance via the EC2 Fleet API, bypassing ASGs entirely. The node joins the cluster, and the pod is scheduled within seconds.
- Consolidates continuously — unlike CA which only removes idle nodes, Karpenter actively looks for opportunities to replace expensive or under-utilised nodes with cheaper alternatives.
This direct-provisioning model eliminates the ASG intermediary, which is the primary reason Karpenter achieves ~55-second provisioning times compared to CA’s 3-4 minutes. When capacity is unavailable for a particular instance type, Karpenter can retry across different instance types and availability zones in milliseconds rather than waiting for ASG retry cycles.
Karpenter configuration example
Karpenter uses two primary Custom Resource Definitions: NodePool (defines scheduling constraints and limits) and EC2NodeClass (defines AWS-specific configuration):
# nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-compute
spec:
template:
metadata:
labels:
workload-type: general
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h # 30 days
limits:
cpu: "1000"
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
weight: 50
# ec2nodeclass.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: KarpenterNodeRole-production-eks
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: production-eks
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: production-eks
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
encrypted: true
tags:
Environment: production
ManagedBy: karpenter
Notice the difference: instead of pre-defining specific instance types in node groups, you define constraints (instance categories, generations, capacity types) and let Karpenter dynamically select the best fit from hundreds of options at provisioning time.
Performance: provisioning speed in production
This is where the architectural differences translate into measurable business impact. In our production benchmarks across client clusters, we consistently measured the following:
Scale-up latency
| Scenario | Cluster Autoscaler | Karpenter | Difference |
|---|---|---|---|
| Single pod, standard instance | 3-4 minutes | 50-60 seconds | ~3.5x faster |
| Burst of 50 pods | 5-8 minutes (batched) | 60-90 seconds | ~5x faster |
| Spot capacity unavailable, fallback | 5-10 minutes | 55-70 seconds | ~7x faster |
| GPU instance (p4d.24xlarge) | 6-12 minutes | 2-4 minutes | ~3x faster |
The speed difference is most dramatic during capacity fallback scenarios. When Cluster Autoscaler requests a c6i.xlarge Spot instance and it is unavailable, the ASG must cycle through its retry logic, which can take minutes. Karpenter, by contrast, immediately tries alternative instance types (c6a.xlarge, c5.xlarge, m6i.xlarge) and alternative availability zones in a single Fleet API call. This retry happens in milliseconds, not minutes.
For latency-sensitive services, this difference directly impacts customer experience. A 3-minute scale-up delay during a traffic spike means 3 minutes of degraded performance or 5xx errors. A 55-second scale-up keeps the blast radius minimal. We have documented these patterns extensively in our guide to EKS architecture best practices.
Scale-down behaviour
Cluster Autoscaler’s scale-down is conservative by design. It waits for the scale-down-unneeded-time (default 10 minutes) before removing a node, and it only removes nodes that are truly idle or whose pods can be rescheduled elsewhere. It does not proactively consolidate workloads onto fewer nodes.
Karpenter’s disruption controller is fundamentally more aggressive. It continuously evaluates whether:
- A node can be replaced with a cheaper instance type that still fits the workloads.
- Multiple under-utilised nodes can be consolidated by moving pods onto fewer nodes.
- Nodes exceeding their
expireAfterTTL should be recycled (useful for patching and AMI rotation).
This proactive consolidation is a significant cost driver. Teams we have worked with typically see 15-30% cost reduction from consolidation alone, on top of savings from better Spot utilisation.
Cost optimisation: where the real savings come from
Cost is often the primary reason teams evaluate Karpenter. Here is how each tool approaches cost reduction.
Cluster Autoscaler cost levers
CA’s cost optimisation is relatively limited:
- Expander strategies —
least-wasteselects the node group that results in the least wasted resources, but you are still constrained to the instance types defined in each node group. - Scale-down tuning — adjusting
scale-down-utilization-thresholdand timing parameters can reduce idle capacity. - Spot node groups — you can create separate node groups for Spot instances, but you need to pre-define which instance types are eligible.
The fundamental constraint is that you must predict your instance type needs upfront when configuring node groups. If your workloads are diverse, you end up with many node groups, each with different instance types, creating operational complexity.
Karpenter cost levers
Karpenter has several structural advantages for cost optimisation:
- Dynamic instance selection — by evaluating 800+ instance types at provisioning time, Karpenter consistently finds cheaper options that CA would miss because they were not in any node group definition.
- Proactive consolidation — replacing under-utilised
m6i.2xlargenodes withm6i.xlargenodes, or consolidating three lightly-loaded nodes into two, happens automatically. - Spot diversification — Karpenter’s
capacity-type: ["spot", "on-demand"]constraint with broad instance category selection means it can tap into deeper Spot pools with lower interruption rates. - Right-sizing at the node level — instead of rounding up to the nearest node group size, Karpenter picks the instance that most closely matches the actual resource request.
Salesforce’s migration of over 1,000 EKS clusters from Cluster Autoscaler to Karpenter in 2025 validated these savings at enterprise scale. Their initial results showed approximately 5% cost savings in FY2026, with projected 5-10% further reduction in FY2027 as bin-packing and Spot utilisation continue to optimise. They also reported an 80% decrease in operational overhead as automated provisioning replaced manual node group management. You can find more detail in AWS’s case study on the Salesforce migration.
For teams pursuing deeper cost strategies, we recommend pairing autoscaler optimisation with the techniques covered in our Kubernetes cost optimization strategies guide.
Spot instance handling
Spot instances can reduce compute costs by up to 90%, but managing interruptions is critical for production reliability. This is an area where the two tools diverge significantly.
Cluster Autoscaler with Spot
CA itself has no native Spot interruption handling. On AWS, you need to deploy the AWS Node Termination Handler (NTH) as a separate DaemonSet or pod. NTH watches for EC2 Spot interruption notices (2-minute warning), scheduled events, and rebalance recommendations, then cordons and drains the affected nodes.
This works but introduces an additional component to deploy, configure, and maintain. It also means the interruption response is sequential: NTH detects the interruption, cordons the node, drains pods, and only then does CA notice the reduced capacity and begin provisioning replacement nodes.
Karpenter with Spot
Karpenter handles Spot interruptions natively through an SQS-based event pipeline:
- EC2 sends Spot interruption notices to EventBridge.
- EventBridge forwards events to a dedicated SQS queue.
- Karpenter consumes events from the queue and immediately begins draining the interrupted node and provisioning a replacement in parallel.
This parallel approach means the replacement node is often ready before the original node is terminated, resulting in near-zero disruption for workloads with properly configured pod disruption budgets.
Additionally, Karpenter supports Spot-to-Spot consolidation. When a cheaper Spot instance becomes available, Karpenter can proactively migrate workloads to it, further reducing costs. It enforces a minimum of 15 instance type flexibility during Spot-to-Spot consolidation to maintain diversification.
# Enable Spot with automatic fallback to On-Demand
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Spot preferred, On-Demand fallback
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # Broad pool for Spot availability
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"] # Gen 5+ for better price-performance
Multi-cloud support
This is Cluster Autoscaler’s strongest advantage and the most common reason teams choose it over Karpenter.
Cluster Autoscaler: the multi-cloud standard
CA supports virtually every Kubernetes environment through cloud-specific drivers:
- AWS (EKS, self-managed) — via ASGs
- Azure (AKS) — via Virtual Machine Scale Sets
- GCP (GKE) — via Managed Instance Groups
- On-premises — drivers for vSphere, OpenStack, and others
- Additional clouds — Alibaba, Oracle, DigitalOcean, and more
This broad support makes CA the default choice for organisations running Kubernetes across multiple cloud providers. The configuration differs per provider, but the core behaviour and tuning parameters remain consistent.
If you are evaluating which managed Kubernetes service to use, our EKS vs AKS vs GKE comparison covers the autoscaling differences across all three platforms.
Karpenter: AWS-first, expanding
Karpenter was built at AWS and is most mature on EKS. However, multi-cloud support is expanding:
- AWS (EKS) — full GA support, the reference implementation.
- Azure (AKS) — GA via Node Auto Provisioning (NAP), a managed Karpenter addon. Self-hosted mode is also available. Only supports CNI overlay with Cilium data plane and Linux nodes as of early 2026.
- Alibaba Cloud — community provider under active development.
- GKE — GKE has its own autoprovisioning system that draws inspiration from Karpenter’s approach but is not Karpenter itself.
For AWS-primary organisations, Karpenter’s cloud support is no longer a blocker. For multi-cloud or Azure/GCP-heavy environments, Cluster Autoscaler remains the safer choice in 2026.
Migration considerations: CA to Karpenter
If you are considering migration, Salesforce’s approach at scale provides a useful blueprint. Here are the key considerations based on our experience migrating dozens of clusters.
Pre-migration checklist
- Audit your node groups — document every ASG, its instance types, labels, taints, and the workloads that target each group.
- Map node groups to NodePools — each node group with distinct scheduling properties becomes a Karpenter NodePool. Consolidate where possible.
- Review pod disruption budgets — Karpenter’s consolidation is more aggressive than CA. Ensure all production workloads have appropriate PDBs.
- Set up the interruption queue — create the SQS queue and EventBridge rules for Spot interruption handling before enabling Spot in Karpenter.
- Plan IAM permissions — Karpenter needs permissions to create and terminate EC2 instances directly, which is a broader permission set than CA’s ASG-only access.
Recommended migration strategy
We recommend a parallel-running approach rather than a direct cutover:
# Step 1: Install Karpenter alongside CA
helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "1.2.0" \
--namespace kube-system \
--set "settings.clusterName=production-eks" \
--set "settings.interruptionQueueName=production-eks-karpenter" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi
# Step 2: Create Karpenter NodePools (initially with weight: 10)
kubectl apply -f nodepool.yaml
# Step 3: Gradually shift workloads by increasing
# Karpenter NodePool weight and decreasing ASG desired count
# Step 4: Cordon CA-managed nodes one group at a time
kubectl cordon -l eks.amazonaws.com/nodegroup=old-general-compute
# Step 5: Drain and let Karpenter provision replacements
kubectl drain -l eks.amazonaws.com/nodegroup=old-general-compute \
--ignore-daemonsets --delete-emptydir-data
# Step 6: Once all workloads are on Karpenter nodes,
# scale ASGs to zero and remove CA
Common migration pitfalls
- PDB violations during consolidation — if your PDBs are too tight, Karpenter’s consolidation controller will get stuck. Ensure
maxUnavailableis at least 1 for all production deployments. - Missing instance type constraints — without proper
instance-categoryandinstance-generationconstraints, Karpenter might select instance types that are incompatible with your workloads (e.g., burstablet3instances for memory-intensive apps). - AMI compatibility — ensure your
EC2NodeClassAMI selector matches the AMI family your workloads expect. Switching from Amazon Linux 2 to AL2023 during migration adds unnecessary risk. - Karpenter controller sizing — for clusters with 100+ nodes, Karpenter’s controller needs at least 1 CPU and 1Gi memory. Under-provisioning causes slow reconciliation.
When to use Cluster Autoscaler
Despite Karpenter’s advantages, Cluster Autoscaler remains the right choice in several scenarios:
Multi-cloud environments
If you run Kubernetes on AWS, Azure, and GCP, CA provides a consistent autoscaling experience across all three. Managing Karpenter on AWS and a different autoscaler on other clouds increases operational complexity.
Long-running GPU workloads
For ML training jobs that need specific GPU instance types (like p4d.24xlarge or p5.48xlarge) and run for hours or days, CA’s node group model provides more predictable behaviour. You define a GPU node group with the exact instance type, and CA scales it. Karpenter’s consolidation could theoretically disrupt long-running jobs if not carefully configured with do-not-disrupt annotations.
Heavily regulated environments
Some compliance frameworks require explicit approval of infrastructure changes. CA’s node group model, where every possible instance type is pre-defined and approved, can simplify compliance documentation compared to Karpenter’s dynamic selection from hundreds of instance types.
Teams with existing CA expertise
If your platform team has years of CA tuning experience and your autoscaling works well, the migration cost may not be justified. Evaluate whether the specific benefits of Karpenter (speed, consolidation, Spot handling) address real pain points in your environment.
When to use Karpenter
Karpenter is the stronger choice in these scenarios:
AWS-primary or AWS-only environments
If EKS is your primary Kubernetes platform, Karpenter offers strictly superior autoscaling. The provisioning speed, cost optimisation, and native Spot handling are compelling. Our EKS consulting engagements now default to Karpenter for new cluster deployments.
Bursty or unpredictable workloads
Web applications with traffic spikes, CI/CD systems that launch hundreds of build pods, event-driven architectures — any workload pattern where fast scale-up matters benefits significantly from Karpenter’s 55-second provisioning.
Cost-sensitive environments
If reducing compute spend is a priority, Karpenter’s dynamic instance selection, proactive consolidation, and Spot diversification typically deliver 15-30% savings compared to a well-tuned CA setup. Teams running significant Spot workloads report even higher savings.
Diverse workload profiles
If your cluster runs a mix of CPU-bound, memory-bound, and GPU workloads, Karpenter’s flexible NodePool model is easier to manage than maintaining dozens of CA node groups with different instance types.
Teams adopting EKS Auto Mode
AWS introduced EKS Auto Mode which uses Karpenter as its underlying autoscaling engine. If you are adopting Auto Mode for simplified cluster management, Karpenter is the default and only autoscaling option.
The future of Kubernetes autoscaling
The trajectory is clear: Karpenter’s architecture represents the future of Kubernetes node autoscaling. The direct-provisioning, constraint-based model is more aligned with how modern cloud-native teams think about infrastructure.
Cluster Autoscaler is not going away. As an official SIG-Autoscaling project, it will continue to be maintained and released alongside Kubernetes. But the innovation momentum has shifted. Karpenter’s feature velocity, the Salesforce-scale validation, and expanding multi-cloud support signal that it is becoming the default choice for new Kubernetes deployments on AWS.
For organisations on Azure, GKE, or multi-cloud, the decision will get easier over the next 12-18 months as the Karpenter provider ecosystem matures. For now, CA remains the pragmatic choice outside of AWS.
Regardless of which autoscaler you choose, the most important factors for effective autoscaling are proper resource requests on your pods, well-configured pod disruption budgets, and monitoring that gives you visibility into scaling events and latency. The autoscaler is only as good as the signals it receives from your workloads.
Optimize Your Kubernetes Autoscaling Strategy
Choosing between Cluster Autoscaler and Karpenter depends on your workload patterns, cloud provider, and cost targets.
Our team provides expert Kubernetes consulting services to help you:
- Evaluate and implement the right autoscaling strategy for your clusters
- Migrate from Cluster Autoscaler to Karpenter with zero downtime
- Optimize Spot instance usage to reduce compute costs by up to 90%
We have deployed and optimized autoscaling across 100+ production Kubernetes clusters.