Kubernetes cluster autoscaling is no longer optional for production workloads — it is foundational. If your nodes cannot keep pace with pod demand, you face scheduling delays, degraded performance, and unhappy users. If your nodes sit idle, you are haemorrhaging money on compute you never use.
For years, Cluster Autoscaler was the default answer. It works, it is battle-tested, and every major cloud provider supports it. But Karpenter, originally developed by AWS and now part of the Kubernetes SIG Autoscaling project, has changed the conversation. Its groupless architecture, sub-minute scaling, and intelligent consolidation have made it the preferred choice for teams running EKS at scale.
At Tasrie IT Services, we have migrated more than 50 production clusters from Cluster Autoscaler to Karpenter over the past 18 months. This post distils what we learned: the architecture differences that matter, the benchmarks we measured, the cost savings we achieved, and a practical migration playbook you can follow.
Architecture Comparison: ASG-Based vs Groupless
The fundamental difference between the two tools lies in how they provision compute.
Cluster Autoscaler: The Node Group Model
Cluster Autoscaler operates through AWS Auto Scaling Groups (ASGs). Each ASG defines a fixed set of instance types, sizes, and configurations. When CA detects pending pods that cannot be scheduled, it evaluates which existing node group can satisfy the request and instructs the ASG to scale up.
The flow looks like this:
- Pending pods detected during periodic scan (every 10+ seconds)
- CA evaluates available node groups against pod requirements
- Scale-up request sent to the matching ASG
- ASG launches an EC2 instance from its pre-defined configuration
- Node registers with the cluster and pods are scheduled
This model is well understood and reliable, but it introduces constraints. You must pre-define your node groups, anticipate which instance types you will need, and manage separate ASGs for different workload profiles. Scaling decisions are limited to what the existing node group configurations permit.
Karpenter: The Groupless Model
Karpenter bypasses ASGs entirely. Instead, it calls the EC2 Fleet API directly, evaluating up to 60 compatible instance types per provisioning decision. There are no pre-defined node groups — Karpenter selects the right-sized instance for the current workload in real time.
The flow is fundamentally different:
- Pending pod triggers an immediate event (no scan interval)
- Karpenter evaluates workload requirements against available instance types
- EC2 Fleet API called with Price-Capacity-Optimised allocation strategy
- Right-sized instance launched and registered
- Pod scheduled within seconds
This groupless approach means Karpenter can mix Spot and On-Demand instances in a single NodePool, select from a broad range of instance families without manual configuration, and respond to demand instantly rather than waiting for a periodic scan cycle. As the AWS EKS Best Practices guide describes it, Karpenter “redraws the picture every scheduling cycle” rather than optimising within the constraints of pre-defined groups.
Scaling Speed Benchmarks: 55 Seconds vs 3-4 Minutes
Speed is where Karpenter pulls decisively ahead. In our migrations, we consistently measured the following:
| Metric | Karpenter | Cluster Autoscaler |
|---|---|---|
| Node provisioning (cold start) | 30-60 seconds | 3-5 minutes |
| CPU-bound pod scheduling (production) | ~55 seconds | 3-4 minutes |
| Scaling trigger | Event-driven (immediate) | Time-driven scan (every 10+ seconds) |
| Spot interruption replacement | Within 2-minute notice window | Slower (ASG-dependent) |
These figures align with production benchmarks published by ScaleOps and CloudPilot AI, which report Karpenter bringing CPU-bound pods online in approximately 55 seconds while Cluster Autoscaler needed 3-4 minutes for the same workload.
The speed advantage comes from two architectural choices. First, Karpenter is event-driven: each pending pod immediately triggers a provisioning action rather than waiting for a scan cycle. Second, the direct EC2 Fleet API integration removes the ASG intermediary, eliminating an entire layer of orchestration latency.
For teams running customer-facing workloads with spiky traffic patterns, this difference translates directly into fewer 5xx errors during scaling events. We documented this on one e-commerce client where Karpenter reduced scale-up-related error rates by over 70% during flash sale events.
Cost Savings: Real Case Studies
Cost optimisation is not theoretical with Karpenter — the savings are well documented across organisations of all sizes. Here are the case studies we reference most often with clients.
Salesforce: 1,000+ EKS Clusters
Salesforce migrated from Cluster Autoscaler to Karpenter across their entire fleet of over 1,000 EKS clusters. The results were significant:
- Scaling latency dropped from minutes to seconds
- 80% reduction in operational overhead as automated processes replaced manual node group management
- ~5% cost savings in FY2026, with a projected further 5-10% reduction in FY2027 as bin-packing and Spot utilisation continue to mature
- Node utilisation improved through smarter bin-packing
Salesforce developed custom internal tooling to orchestrate the migration safely, cordoning and draining legacy nodes with full respect for pod disruption budgets (PDBs).
Grover: 80% Spot in Production
Grover increased their Spot instance usage to 80% of production workloads after migrating to Karpenter. The key enabler was Karpenter’s ability to mix Spot and On-Demand instances within a single NodePool — something that was impractical with Cluster Autoscaler’s rigid ASG-based model. Grover reported smoother scale-up operations and better performance during seasonal demand spikes such as Black Friday.
Tinybird: 20% AWS Bill Reduction
Tinybird reduced their AWS bill by 20% (and up to 90% on CI/CD workloads) by combining Karpenter with EC2 Spot instances. Their approach leveraged Karpenter’s intelligent instance selection and consolidation to eliminate the over-provisioning that was endemic with their previous Cluster Autoscaler setup.
Our Own Observations
Across our 50+ migrations, we typically see 20-40% compute cost reduction in the first quarter after migration, driven by three factors:
- Better bin-packing — Karpenter selects right-sized instances rather than fitting pods into pre-defined node sizes
- Aggressive consolidation — underutilised nodes are replaced with smaller, more efficient instances
- Higher Spot adoption — the flexibility to diversify across many instance types reduces Spot interruption risk, enabling higher Spot percentages
For a deeper look at Kubernetes cost strategies, see our guide to Kubernetes cost optimisation and the best tools for managing Kubernetes costs.
Configuration Comparison
One of the most practical differences is how each tool is configured. Karpenter’s configuration model is simpler and more expressive.
Karpenter: NodePool + EC2NodeClass
Karpenter uses two primary resources. The NodePool defines scheduling rules, instance requirements, resource limits, and disruption policies. The EC2NodeClass handles AWS-specific settings such as AMIs, subnets, security groups, and block device mappings.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-workloads
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
limits:
cpu: "1000"
memory: 1000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: al2023@latest
role: "KarpenterNodeRole-my-cluster"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
This single configuration replaces what would require multiple ASGs, launch templates, and scaling policies with Cluster Autoscaler. You can learn more about NodePool and EC2NodeClass configuration from the official Karpenter documentation.
Cluster Autoscaler: ASGs + Annotations
Cluster Autoscaler relies on pre-defined Auto Scaling Groups, each with a fixed launch template specifying instance types, AMIs, and network configuration. Tuning behaviour requires command-line flags on the CA deployment:
# Cluster Autoscaler deployment flags
--scale-down-delay-after-add=10m
--scale-down-unneeded-time=10m
--scale-down-utilization-threshold=0.5
--skip-nodes-with-local-storage=false
--expander=least-waste
Each unique instance type or capacity configuration requires its own ASG. For a cluster supporting web workloads, batch processing, and GPU jobs, you might end up managing six or more ASGs with their own scaling policies. Karpenter achieves the same with two or three NodePool definitions.
Advanced Karpenter Features
Beyond the core architectural advantages, Karpenter provides several advanced capabilities that have no direct equivalent in Cluster Autoscaler.
Drift Detection and Automated Node Upgrades
When you update your EKS control plane version or change the AMI in your EC2NodeClass, Karpenter automatically detects the drift and marks affected nodes for replacement. Drifted nodes are gradually replaced through a rolling deployment, without manual intervention.
This eliminates one of the most tedious operational tasks in Kubernetes: upgrading worker nodes. With Cluster Autoscaler, node upgrades typically require manual intervention or custom automation to cordon, drain, and replace nodes across multiple ASGs.
Disruption Budgets
Karpenter’s disruption budgets give you fine-grained control over when and how nodes can be disrupted. You can limit disruptions by percentage or absolute number, restrict disruptions to specific time windows using cron schedules, and scope budgets to specific disruption reasons such as Drifted, Underutilized, or Empty.
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "10%"
- nodes: "0"
schedule: "0 9 * * 1-5"
duration: 8h
reasons:
- Underutilized
This configuration allows consolidation of empty and underutilised nodes at all times, but restricts underutilisation-based disruptions during business hours (Monday to Friday, 9am to 5pm). This level of control is essential for production workloads where availability SLOs must be maintained.
Spot-to-Spot Consolidation
One of Karpenter’s most impactful cost features is Spot-to-Spot consolidation. When enabled via the SpotToSpotConsolidation feature flag, Karpenter can replace existing Spot instances with cheaper or more appropriately sized Spot alternatives, continuously optimising for cost without sacrificing capacity. Combined with the Price-Capacity-Optimised allocation strategy, this ensures you are always running on the most cost-effective Spot capacity available.
This is particularly valuable because EC2 Spot pricing fluctuates throughout the day. An m5.xlarge Spot instance that was the cheapest option at 2am may no longer be optimal at 2pm. Karpenter detects these pricing shifts and proactively consolidates onto cheaper capacity, compounding savings over time. Cluster Autoscaler has no equivalent capability — once an ASG launches a Spot instance, it remains until the workload scales down or the instance is interrupted.
Karpenter’s CNCF Journey and Current Status
Understanding Karpenter’s project maturity is important context for adoption decisions. AWS originally created Karpenter in 2021 as an open-source project. In 2023, AWS donated the vendor-neutral core to the CNCF through the Kubernetes SIG Autoscaling group, separating the core scheduling logic from the AWS-specific provider.
The project now lives in two repositories: kubernetes-sigs/karpenter for the vendor-neutral core, and aws/karpenter-provider-aws for the AWS-specific implementation. Karpenter v1.0.0 reached general availability in August 2024, with stable APIs that guarantee backward compatibility across 1.x releases.
Key milestones in the v1.0 release include the ability to specify disruption reasons for budgets, a forceful disruption mode for balancing availability against security patching, and expanded consolidateAfter configuration. The kubelet configuration was moved to the EC2NodeClass API, and NodeClaims became immutable after initial launch.
As of February 2026, Karpenter sits within the Kubernetes SIG Autoscaling umbrella but has not yet achieved formal CNCF graduated status. It continues to receive frequent releases, with v1.5 (July 2025) introducing faster bin-packing, new disruption metrics, and “emptiness-first” consolidation for more aggressive idle node recycling.
Multi-Cloud Support: Beyond AWS
Karpenter is no longer AWS-only. The vendor-neutral core, maintained under kubernetes-sigs/karpenter, enables cloud-specific providers to implement the same provisioning model on their own infrastructure.
| Cloud Provider | Implementation | Status | NodeClass Resource |
|---|---|---|---|
| AWS (EKS) | karpenter-provider-aws | Production-ready (v1.0+) | EC2NodeClass |
| Azure (AKS) | Node Auto Provisioning (NAP) | GA as managed addon | AKSNodeClass |
| GCP (GKE) | karpenter-provider-gcp | Alpha/Preview | GCPNodeClass |
EKS Auto Mode
EKS Auto Mode, generally available since December 2024, runs Karpenter as a managed, off-cluster component. You do not need to install, scale, or upgrade Karpenter yourself. It includes built-in General Purpose and GPU-Optimised NodePools, along with managed AWS Load Balancer Controller and EBS CSI Driver. For teams that want Karpenter’s benefits without the operational overhead of managing the controller, EKS Auto Mode is the simplest path.
AKS Node Auto Provisioning
Azure’s Node Auto Provisioning (NAP) uses Karpenter as a managed addon within AKS. It is the recommended approach for most AKS users, offering improved scale-up speed, automatic maintenance window integration, and Azure-specific configuration through the AKSNodeClass resource.
For a broader comparison of managed Kubernetes services, see our EKS vs AKS vs GKE comparison guide.
When Cluster Autoscaler Still Wins
Karpenter is not the right choice for every scenario. There are legitimate cases where Cluster Autoscaler remains the better option.
GPU and Machine Learning Workloads
For long-running GPU-based ML training jobs, Cluster Autoscaler’s explicit node group reservation can keep a small pool of GPU nodes (p4d, g5 families) running continuously. Karpenter, by default, will spin down idle GPU instances through consolidation, which can cause unnecessary pod rescheduling and resource churn for workloads that take hours or days to complete. While you can configure Karpenter to avoid this with do-not-disrupt annotations, Cluster Autoscaler’s dedicated GPU node groups provide a more straightforward solution for this specific use case.
Multi-Cloud and Hybrid Environments
If you are running Kubernetes across AWS, Azure, GCP, and on-premises infrastructure with a single unified autoscaling strategy, Cluster Autoscaler remains the more practical choice. It supports all major cloud providers and on-premises environments through a consistent interface. While Karpenter’s multi-cloud support is expanding, each cloud provider’s implementation is at a different maturity level, and there is no interoperability between them.
Simple, Stable Clusters
For small clusters with predictable, steady workloads that rarely scale, the additional complexity of migrating to Karpenter may not justify the benefits. Cluster Autoscaler is simple, well-documented, and continues to receive active maintenance from SIG Autoscaling with regular releases well into 2026.
Regulated Environments with Strict Change Controls
Some organisations in healthcare, finance, and government operate under strict change management policies that require every infrastructure component to be individually audited and approved. Cluster Autoscaler’s explicit, declarative ASG model can be easier to audit than Karpenter’s dynamic instance selection. That said, Karpenter’s NodePool constraints and disruption budgets do provide sufficient controls for most compliance frameworks — it simply requires a different documentation approach.
Migration Guide: Cluster Autoscaler to Karpenter
Based on our experience and the patterns established by Salesforce’s enterprise-scale migration, we recommend a phased approach. The official Karpenter migration guide provides the canonical reference; what follows is our operational playbook.
Phase 1: Preparation (1-2 Weeks)
Before touching your autoscaler, prepare your workloads:
- Add health checks (liveness and readiness probes) to all deployments
- Ensure multiple replicas with appropriate HPA configuration — see our Kubernetes autoscaling overview for current best practices
- Apply resource right-sizing based on actual usage data
- Configure PodDisruptionBudgets to protect critical workloads during node transitions
Phase 2: IAM and Infrastructure Setup (1-2 Days)
Create the required IAM roles:
- Karpenter node role: Attach
AmazonEKSWorkerNodePolicy,AmazonEKS_CNI_Policy,AmazonEC2ContainerRegistryPullOnly, andAmazonSSMManagedInstanceCore - Karpenter controller role: Configure using IAM Roles for Service Accounts (IRSA) with an OIDC endpoint
If you are managing your EKS infrastructure with Terraform, our Terraform EKS module guide covers the IAM setup in detail.
Phase 3: Install and Configure Karpenter (1 Day)
Deploy Karpenter using the official Helm chart:
helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "1.1.0" \
--namespace kube-system \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi
Then apply your NodePool and EC2NodeClass configurations. Start conservatively — mirror your existing node group configuration before optimising.
Phase 4: Parallel Running and Validation (1-2 Weeks)
Run both autoscalers simultaneously:
- Deploy Karpenter alongside the existing Cluster Autoscaler
- Taint Karpenter-provisioned nodes initially to restrict which workloads land on them
- Gradually migrate workloads by removing taints and cordoning CA-managed nodes
- Monitor scheduling latency, pod availability, and node utilisation throughout
Phase 5: Cutover and Decommission (1-2 Days)
Once validation is complete:
- Scale the Cluster Autoscaler deployment to zero replicas
- Cordon and drain remaining CA-managed nodes (Karpenter will provision replacements)
- Scale ASG node groups to the minimum required for cluster-critical workloads
- Remove the Cluster Autoscaler deployment and associated ASG configurations
Phase 6: Optimise (Ongoing)
After migration, tune your configuration:
- Enable
WhenEmptyOrUnderutilizedconsolidation policy - Configure disruption budgets for your availability requirements
- Experiment with Spot instance percentages, starting conservative and increasing
- Monitor with Prometheus and Grafana using Karpenter’s built-in metrics
Decision Framework: Which Tool Should You Choose?
Use this comparison to guide your decision:
| Criterion | Karpenter | Cluster Autoscaler | Winner |
|---|---|---|---|
| Scaling speed | 30-60 seconds | 3-5 minutes | Karpenter |
| Instance type flexibility | Up to 60 types per decision | Fixed per ASG | Karpenter |
| Spot instance handling | Native, multi-type diversification | Per-ASG configuration | Karpenter |
| Consolidation | Holistic, cluster-wide | Node-by-node, limited | Karpenter |
| Drift detection | Built-in, automated | Not available | Karpenter |
| Configuration complexity | 2 CRDs (NodePool + NodeClass) | Multiple ASGs + flags | Karpenter |
| Multi-cloud support | AWS (GA), Azure (GA), GCP (alpha) | All providers + on-prem | Cluster Autoscaler |
| GPU workload stability | Requires tuning | Native node group reservation | Cluster Autoscaler |
| Maturity | v1.0+ (GA since August 2024) | 10+ years in production | Cluster Autoscaler |
| Managed options | EKS Auto Mode, AKS NAP | Built into all managed K8s | Tie |
| Cost savings potential | 20-40% typical | Baseline | Karpenter |
Choose Karpenter if you run EKS (or AKS with NAP) with variable workloads, want sub-minute scaling, need cost optimisation through Spot and consolidation, and are willing to invest in a short migration effort.
Choose Cluster Autoscaler if you run multi-cloud or hybrid Kubernetes, have stable GPU/ML workloads requiring persistent node pools, or need a single autoscaling tool across diverse environments.
For most EKS teams in 2026, Karpenter is the clear choice. The architecture is more efficient, the scaling is faster, the cost savings are real, and the operational overhead is lower once you are past the initial migration.
Ready to Migrate Your EKS Clusters to Karpenter?
Migrating from Cluster Autoscaler to Karpenter is one of the highest-impact improvements you can make to your EKS infrastructure. But the migration requires careful planning, IAM configuration, workload preparation, and validation to avoid disruption.
Our team provides comprehensive Amazon EKS consulting services to help you:
- Assess your current autoscaling setup and identify cost savings opportunities
- Design and implement Karpenter configurations tailored to your workload profiles
- Execute zero-downtime migrations using our proven phased playbook
- Optimise Spot instance strategies to maximise savings while maintaining availability
We have migrated clusters ranging from 10-node development environments to 500-node production platforms, consistently delivering 20-40% compute cost reductions.
For broader Kubernetes architecture guidance beyond EKS, explore our Kubernetes consulting services.