~/blog/cluster-autoscaler-vs-karpenter-2026
zsh
KUBERNETES

Cluster Autoscaler vs Karpenter: We Tested Both on 100+ Clusters

Engineering Team 2026-03-09

We have deployed and tuned Kubernetes autoscaling across more than one hundred production clusters over the past four years. Some run latency-sensitive microservices that need nodes in under a minute. Others handle batch genomics workloads that spin up hundreds of Spot instances overnight and tear them down by morning. The question our clients ask most often is deceptively simple: should we use Cluster Autoscaler or Karpenter?

The short answer is that both tools solve the same fundamental problem — scaling Kubernetes compute capacity to match workload demand — but they approach it with radically different architectures, trade-offs, and cloud provider coverage. After running both side-by-side on production EKS, AKS, and GKE clusters, we have formed strong, data-backed opinions. This guide lays out everything we have learned.

Quick comparison: Cluster Autoscaler vs Karpenter at a glance

Before diving into architecture details, here is a side-by-side summary that captures the most important differences.

FeatureCluster Autoscaler (CA)Karpenter
Provisioning modelWorks through Auto Scaling Groups (ASGs) / Managed Node GroupsProvisions EC2 instances directly via Fleet API — no ASGs
Node provisioning speed3-4 minutes typical~55 seconds typical
Instance type selectionPre-configured per node groupDynamically selects from 800+ instance types
Capacity retry speedMinutes (ASG retry cycle)Milliseconds (immediate fallback across types/AZs)
Spot interruption handlingRequires external tooling (AWS Node Termination Handler)Native SQS-based interruption handling
ConsolidationRemoves idle nodes onlyProactive bin-packing and replacement of under-utilised nodes
Multi-cloud supportAWS, Azure, GCP, on-prem (8+ providers)AWS (GA), Azure/AKS (GA via NAP), Alibaba Cloud (growing)
CNCF statusSIG-Autoscaling project, ships with Kubernetes release cycleCNCF Sandbox, AWS-originated
Maturity8+ years in productionGA since v1.0 (2024), rapidly maturing
Configuration modelNode groups with identical scheduling propertiesNodePool CRDs with flexible constraints
Best forMulti-cloud, stable workloads, GPU node groupsAWS-heavy, bursty workloads, cost optimisation

How Cluster Autoscaler works

Cluster Autoscaler is the original Kubernetes node autoscaling solution, maintained under the official CNCF SIG-Autoscaling project. It has been battle-tested for over eight years and ships with the same versioning and release cadence as Kubernetes itself.

Architecture

Cluster Autoscaler operates on a time-driven scan loop that runs every 10 seconds by default. During each cycle it:

  1. Scans for unschedulable pods — identifies pods stuck in Pending state because no existing node has sufficient resources.
  2. Simulates scheduling — runs the Kubernetes scheduling algorithm against each configured node group to determine which group could accommodate the pending pods.
  3. Requests scale-up — instructs the cloud provider’s Auto Scaling Group (ASG on AWS, VMSS on Azure, MIG on GCP) to add nodes.
  4. Monitors scale-down — identifies nodes with low utilisation (below 50% by default) and cordons, drains, then removes them if pods can be rescheduled elsewhere.

The critical architectural constraint is that every node in a node group must have identical scheduling properties — same instance type (or a small set), same labels, same taints. This means you typically need multiple node groups for different workload profiles: one for general compute, one for memory-optimised workloads, one for GPU jobs, and so on.

Cluster Autoscaler configuration example

Here is a typical CA deployment on EKS using a Helm chart:

# cluster-autoscaler-values.yaml
autoDiscovery:
  clusterName: production-eks
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/production-eks

extraArgs:
  scan-interval: "10s"
  scale-down-delay-after-add: "10m"
  scale-down-delay-after-delete: "0s"
  scale-down-unneeded-time: "10m"
  scale-down-utilization-threshold: "0.5"
  skip-nodes-with-local-storage: "false"
  expander: "least-waste"
  balance-similar-node-groups: "true"
  max-node-provision-time: "15m"

rbac:
  create: true
  serviceAccount:
    name: cluster-autoscaler
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/cluster-autoscaler

resources:
  requests:
    cpu: 100m
    memory: 600Mi
  limits:
    cpu: 500m
    memory: 1Gi

And the corresponding ASG configuration that CA manages:

# Node group definition (Terraform)
resource "aws_eks_node_group" "general" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "general-compute"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids

  scaling_config {
    desired_size = 3
    max_size     = 50
    min_size     = 1
  }

  instance_types = ["m6i.xlarge", "m6a.xlarge", "m5.xlarge"]

  labels = {
    workload-type = "general"
  }

  tags = {
    "k8s.io/cluster-autoscaler/enabled"         = "true"
    "k8s.io/cluster-autoscaler/production-eks"   = "owned"
  }
}

The expander strategy (least-waste in this example) is critical. It determines which node group CA selects when multiple groups could satisfy a pending pod. Options include random, most-pods, least-waste, and priority.

How Karpenter works

Karpenter takes a fundamentally different approach. Instead of managing node groups through an intermediary autoscaling layer, Karpenter provisions compute instances directly using the cloud provider’s API — on AWS, that means the EC2 Fleet API.

Architecture

Karpenter uses event-driven reconciliation rather than a periodic scan:

  1. Watches for unschedulable pods — the moment the Kubernetes scheduler marks a pod as Pending, Karpenter’s controller receives the event.
  2. Evaluates constraints — inspects the pod’s resource requests, node selectors, affinities, tolerations, and topology spread constraints.
  3. Selects the optimal instance — queries its internal instance type database (800+ types on AWS) and uses bin-packing heuristics to find the cheapest instance that satisfies all constraints.
  4. Provisions directly — launches the instance via the EC2 Fleet API, bypassing ASGs entirely. The node joins the cluster, and the pod is scheduled within seconds.
  5. Consolidates continuously — unlike CA which only removes idle nodes, Karpenter actively looks for opportunities to replace expensive or under-utilised nodes with cheaper alternatives.

This direct-provisioning model eliminates the ASG intermediary, which is the primary reason Karpenter achieves ~55-second provisioning times compared to CA’s 3-4 minutes. When capacity is unavailable for a particular instance type, Karpenter can retry across different instance types and availability zones in milliseconds rather than waiting for ASG retry cycles.

Karpenter configuration example

Karpenter uses two primary Custom Resource Definitions: NodePool (defines scheduling constraints and limits) and EC2NodeClass (defines AWS-specific configuration):

# nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-compute
spec:
  template:
    metadata:
      labels:
        workload-type: general
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h  # 30 days
  limits:
    cpu: "1000"
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  weight: 50
# ec2nodeclass.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: KarpenterNodeRole-production-eks
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: production-eks
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: production-eks
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        encrypted: true
  tags:
    Environment: production
    ManagedBy: karpenter

Notice the difference: instead of pre-defining specific instance types in node groups, you define constraints (instance categories, generations, capacity types) and let Karpenter dynamically select the best fit from hundreds of options at provisioning time.

Performance: provisioning speed in production

This is where the architectural differences translate into measurable business impact. In our production benchmarks across client clusters, we consistently measured the following:

Scale-up latency

ScenarioCluster AutoscalerKarpenterDifference
Single pod, standard instance3-4 minutes50-60 seconds~3.5x faster
Burst of 50 pods5-8 minutes (batched)60-90 seconds~5x faster
Spot capacity unavailable, fallback5-10 minutes55-70 seconds~7x faster
GPU instance (p4d.24xlarge)6-12 minutes2-4 minutes~3x faster

The speed difference is most dramatic during capacity fallback scenarios. When Cluster Autoscaler requests a c6i.xlarge Spot instance and it is unavailable, the ASG must cycle through its retry logic, which can take minutes. Karpenter, by contrast, immediately tries alternative instance types (c6a.xlarge, c5.xlarge, m6i.xlarge) and alternative availability zones in a single Fleet API call. This retry happens in milliseconds, not minutes.

For latency-sensitive services, this difference directly impacts customer experience. A 3-minute scale-up delay during a traffic spike means 3 minutes of degraded performance or 5xx errors. A 55-second scale-up keeps the blast radius minimal. We have documented these patterns extensively in our guide to EKS architecture best practices.

Scale-down behaviour

Cluster Autoscaler’s scale-down is conservative by design. It waits for the scale-down-unneeded-time (default 10 minutes) before removing a node, and it only removes nodes that are truly idle or whose pods can be rescheduled elsewhere. It does not proactively consolidate workloads onto fewer nodes.

Karpenter’s disruption controller is fundamentally more aggressive. It continuously evaluates whether:

  • A node can be replaced with a cheaper instance type that still fits the workloads.
  • Multiple under-utilised nodes can be consolidated by moving pods onto fewer nodes.
  • Nodes exceeding their expireAfter TTL should be recycled (useful for patching and AMI rotation).

This proactive consolidation is a significant cost driver. Teams we have worked with typically see 15-30% cost reduction from consolidation alone, on top of savings from better Spot utilisation.

Cost optimisation: where the real savings come from

Cost is often the primary reason teams evaluate Karpenter. Here is how each tool approaches cost reduction.

Cluster Autoscaler cost levers

CA’s cost optimisation is relatively limited:

  • Expander strategiesleast-waste selects the node group that results in the least wasted resources, but you are still constrained to the instance types defined in each node group.
  • Scale-down tuning — adjusting scale-down-utilization-threshold and timing parameters can reduce idle capacity.
  • Spot node groups — you can create separate node groups for Spot instances, but you need to pre-define which instance types are eligible.

The fundamental constraint is that you must predict your instance type needs upfront when configuring node groups. If your workloads are diverse, you end up with many node groups, each with different instance types, creating operational complexity.

Karpenter cost levers

Karpenter has several structural advantages for cost optimisation:

  • Dynamic instance selection — by evaluating 800+ instance types at provisioning time, Karpenter consistently finds cheaper options that CA would miss because they were not in any node group definition.
  • Proactive consolidation — replacing under-utilised m6i.2xlarge nodes with m6i.xlarge nodes, or consolidating three lightly-loaded nodes into two, happens automatically.
  • Spot diversification — Karpenter’s capacity-type: ["spot", "on-demand"] constraint with broad instance category selection means it can tap into deeper Spot pools with lower interruption rates.
  • Right-sizing at the node level — instead of rounding up to the nearest node group size, Karpenter picks the instance that most closely matches the actual resource request.

Salesforce’s migration of over 1,000 EKS clusters from Cluster Autoscaler to Karpenter in 2025 validated these savings at enterprise scale. Their initial results showed approximately 5% cost savings in FY2026, with projected 5-10% further reduction in FY2027 as bin-packing and Spot utilisation continue to optimise. They also reported an 80% decrease in operational overhead as automated provisioning replaced manual node group management. You can find more detail in AWS’s case study on the Salesforce migration.

For teams pursuing deeper cost strategies, we recommend pairing autoscaler optimisation with the techniques covered in our Kubernetes cost optimization strategies guide.

Spot instance handling

Spot instances can reduce compute costs by up to 90%, but managing interruptions is critical for production reliability. This is an area where the two tools diverge significantly.

Cluster Autoscaler with Spot

CA itself has no native Spot interruption handling. On AWS, you need to deploy the AWS Node Termination Handler (NTH) as a separate DaemonSet or pod. NTH watches for EC2 Spot interruption notices (2-minute warning), scheduled events, and rebalance recommendations, then cordons and drains the affected nodes.

This works but introduces an additional component to deploy, configure, and maintain. It also means the interruption response is sequential: NTH detects the interruption, cordons the node, drains pods, and only then does CA notice the reduced capacity and begin provisioning replacement nodes.

Karpenter with Spot

Karpenter handles Spot interruptions natively through an SQS-based event pipeline:

  1. EC2 sends Spot interruption notices to EventBridge.
  2. EventBridge forwards events to a dedicated SQS queue.
  3. Karpenter consumes events from the queue and immediately begins draining the interrupted node and provisioning a replacement in parallel.

This parallel approach means the replacement node is often ready before the original node is terminated, resulting in near-zero disruption for workloads with properly configured pod disruption budgets.

Additionally, Karpenter supports Spot-to-Spot consolidation. When a cheaper Spot instance becomes available, Karpenter can proactively migrate workloads to it, further reducing costs. It enforces a minimum of 15 instance type flexibility during Spot-to-Spot consolidation to maintain diversification.

# Enable Spot with automatic fallback to On-Demand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]  # Spot preferred, On-Demand fallback
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]       # Broad pool for Spot availability
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["4"]                  # Gen 5+ for better price-performance

Multi-cloud support

This is Cluster Autoscaler’s strongest advantage and the most common reason teams choose it over Karpenter.

Cluster Autoscaler: the multi-cloud standard

CA supports virtually every Kubernetes environment through cloud-specific drivers:

  • AWS (EKS, self-managed) — via ASGs
  • Azure (AKS) — via Virtual Machine Scale Sets
  • GCP (GKE) — via Managed Instance Groups
  • On-premises — drivers for vSphere, OpenStack, and others
  • Additional clouds — Alibaba, Oracle, DigitalOcean, and more

This broad support makes CA the default choice for organisations running Kubernetes across multiple cloud providers. The configuration differs per provider, but the core behaviour and tuning parameters remain consistent.

If you are evaluating which managed Kubernetes service to use, our EKS vs AKS vs GKE comparison covers the autoscaling differences across all three platforms.

Karpenter: AWS-first, expanding

Karpenter was built at AWS and is most mature on EKS. However, multi-cloud support is expanding:

  • AWS (EKS) — full GA support, the reference implementation.
  • Azure (AKS) — GA via Node Auto Provisioning (NAP), a managed Karpenter addon. Self-hosted mode is also available. Only supports CNI overlay with Cilium data plane and Linux nodes as of early 2026.
  • Alibaba Cloud — community provider under active development.
  • GKE — GKE has its own autoprovisioning system that draws inspiration from Karpenter’s approach but is not Karpenter itself.

For AWS-primary organisations, Karpenter’s cloud support is no longer a blocker. For multi-cloud or Azure/GCP-heavy environments, Cluster Autoscaler remains the safer choice in 2026.

Migration considerations: CA to Karpenter

If you are considering migration, Salesforce’s approach at scale provides a useful blueprint. Here are the key considerations based on our experience migrating dozens of clusters.

Pre-migration checklist

  1. Audit your node groups — document every ASG, its instance types, labels, taints, and the workloads that target each group.
  2. Map node groups to NodePools — each node group with distinct scheduling properties becomes a Karpenter NodePool. Consolidate where possible.
  3. Review pod disruption budgets — Karpenter’s consolidation is more aggressive than CA. Ensure all production workloads have appropriate PDBs.
  4. Set up the interruption queue — create the SQS queue and EventBridge rules for Spot interruption handling before enabling Spot in Karpenter.
  5. Plan IAM permissions — Karpenter needs permissions to create and terminate EC2 instances directly, which is a broader permission set than CA’s ASG-only access.

We recommend a parallel-running approach rather than a direct cutover:

# Step 1: Install Karpenter alongside CA
helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "1.2.0" \
  --namespace kube-system \
  --set "settings.clusterName=production-eks" \
  --set "settings.interruptionQueueName=production-eks-karpenter" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi

# Step 2: Create Karpenter NodePools (initially with weight: 10)
kubectl apply -f nodepool.yaml

# Step 3: Gradually shift workloads by increasing
# Karpenter NodePool weight and decreasing ASG desired count

# Step 4: Cordon CA-managed nodes one group at a time
kubectl cordon -l eks.amazonaws.com/nodegroup=old-general-compute

# Step 5: Drain and let Karpenter provision replacements
kubectl drain -l eks.amazonaws.com/nodegroup=old-general-compute \
  --ignore-daemonsets --delete-emptydir-data

# Step 6: Once all workloads are on Karpenter nodes,
# scale ASGs to zero and remove CA

Common migration pitfalls

  • PDB violations during consolidation — if your PDBs are too tight, Karpenter’s consolidation controller will get stuck. Ensure maxUnavailable is at least 1 for all production deployments.
  • Missing instance type constraints — without proper instance-category and instance-generation constraints, Karpenter might select instance types that are incompatible with your workloads (e.g., burstable t3 instances for memory-intensive apps).
  • AMI compatibility — ensure your EC2NodeClass AMI selector matches the AMI family your workloads expect. Switching from Amazon Linux 2 to AL2023 during migration adds unnecessary risk.
  • Karpenter controller sizing — for clusters with 100+ nodes, Karpenter’s controller needs at least 1 CPU and 1Gi memory. Under-provisioning causes slow reconciliation.

When to use Cluster Autoscaler

Despite Karpenter’s advantages, Cluster Autoscaler remains the right choice in several scenarios:

Multi-cloud environments

If you run Kubernetes on AWS, Azure, and GCP, CA provides a consistent autoscaling experience across all three. Managing Karpenter on AWS and a different autoscaler on other clouds increases operational complexity.

Long-running GPU workloads

For ML training jobs that need specific GPU instance types (like p4d.24xlarge or p5.48xlarge) and run for hours or days, CA’s node group model provides more predictable behaviour. You define a GPU node group with the exact instance type, and CA scales it. Karpenter’s consolidation could theoretically disrupt long-running jobs if not carefully configured with do-not-disrupt annotations.

Heavily regulated environments

Some compliance frameworks require explicit approval of infrastructure changes. CA’s node group model, where every possible instance type is pre-defined and approved, can simplify compliance documentation compared to Karpenter’s dynamic selection from hundreds of instance types.

Teams with existing CA expertise

If your platform team has years of CA tuning experience and your autoscaling works well, the migration cost may not be justified. Evaluate whether the specific benefits of Karpenter (speed, consolidation, Spot handling) address real pain points in your environment.

When to use Karpenter

Karpenter is the stronger choice in these scenarios:

AWS-primary or AWS-only environments

If EKS is your primary Kubernetes platform, Karpenter offers strictly superior autoscaling. The provisioning speed, cost optimisation, and native Spot handling are compelling. Our EKS consulting engagements now default to Karpenter for new cluster deployments.

Bursty or unpredictable workloads

Web applications with traffic spikes, CI/CD systems that launch hundreds of build pods, event-driven architectures — any workload pattern where fast scale-up matters benefits significantly from Karpenter’s 55-second provisioning.

Cost-sensitive environments

If reducing compute spend is a priority, Karpenter’s dynamic instance selection, proactive consolidation, and Spot diversification typically deliver 15-30% savings compared to a well-tuned CA setup. Teams running significant Spot workloads report even higher savings.

Diverse workload profiles

If your cluster runs a mix of CPU-bound, memory-bound, and GPU workloads, Karpenter’s flexible NodePool model is easier to manage than maintaining dozens of CA node groups with different instance types.

Teams adopting EKS Auto Mode

AWS introduced EKS Auto Mode which uses Karpenter as its underlying autoscaling engine. If you are adopting Auto Mode for simplified cluster management, Karpenter is the default and only autoscaling option.

The future of Kubernetes autoscaling

The trajectory is clear: Karpenter’s architecture represents the future of Kubernetes node autoscaling. The direct-provisioning, constraint-based model is more aligned with how modern cloud-native teams think about infrastructure.

Cluster Autoscaler is not going away. As an official SIG-Autoscaling project, it will continue to be maintained and released alongside Kubernetes. But the innovation momentum has shifted. Karpenter’s feature velocity, the Salesforce-scale validation, and expanding multi-cloud support signal that it is becoming the default choice for new Kubernetes deployments on AWS.

For organisations on Azure, GKE, or multi-cloud, the decision will get easier over the next 12-18 months as the Karpenter provider ecosystem matures. For now, CA remains the pragmatic choice outside of AWS.

Regardless of which autoscaler you choose, the most important factors for effective autoscaling are proper resource requests on your pods, well-configured pod disruption budgets, and monitoring that gives you visibility into scaling events and latency. The autoscaler is only as good as the signals it receives from your workloads.


Optimize Your Kubernetes Autoscaling Strategy

Choosing between Cluster Autoscaler and Karpenter depends on your workload patterns, cloud provider, and cost targets.

Our team provides expert Kubernetes consulting services to help you:

  • Evaluate and implement the right autoscaling strategy for your clusters
  • Migrate from Cluster Autoscaler to Karpenter with zero downtime
  • Optimize Spot instance usage to reduce compute costs by up to 90%

We have deployed and optimized autoscaling across 100+ production Kubernetes clusters.

Get a free Kubernetes autoscaling assessment →

Continue exploring these related topics

$ suggest --service

Need help with Amazon EKS?

We manage 100+ EKS clusters in production. Let us handle yours.

Get started
Chat with real humans
Chat on WhatsApp