30% Cost Reduction in AWS EKS Monthly Bill Through Spot Instance Optimization
Key Results
The Challenge
The client's AWS EKS cluster costs were escalating rapidly, and they lacked proper failover mechanisms to handle spot instance terminations. Their fixed on-demand instance approach was inefficient, and they had no alerting system for spot instance interruptions. The absence of pod disruption budgets led to service disruptions during maintenance windows, and manual scaling processes couldn't keep up with traffic fluctuations.
Our Solution
We implemented a comprehensive EKS optimization strategy combining spot instances with on-demand instances for critical workloads. We configured Pod Disruption Budgets for all mission-critical deployments, implemented failover mechanisms ensuring at least three pods running on distinct nodes, and deployed cluster over-provisioner for instant capacity during spot terminations. Real-time alerting was integrated with Slack to notify the operations team of spot instance terminations.
The Results
The optimization transformed the client's cloud economics. Monthly AWS EKS costs decreased by 30% through intelligent use of spot instances while maintaining service reliability. The failover design with Pod Disruption Budgets eliminated service disruptions during maintenance and spot terminations. The hybrid approach of 70% spot and 30% on-demand instances provided the perfect balance of cost savings and reliability. Development team productivity increased as infrastructure concerns were minimized.
Introduction
This case study explores how our services assisted a client operating in the travel and hospitality industry to optimize their AWS EKS (Elastic Kubernetes Service) infrastructure, resulting in a substantial 30% reduction in their monthly bill. By leveraging spot instances, selecting appropriate instance types, implementing failover mechanisms, configuring pod disruption budgets, enabling proper alerting, incorporating on-demand instances, and utilizing cluster overprovisioning, we were able to help our client achieve significant cost savings without compromising on performance or reliability.
Client Background
Our client, a prominent player in the travel and hospitality industry, relied heavily on their AWS EKS cluster to support their critical applications and services. However, they faced challenges regarding cost optimization and ensuring high availability, which prompted them to seek our expertise.
Challenges Faced
The client’s primary concerns revolved around the escalating costs associated with running their AWS EKS cluster. They were also keen on implementing robust failover mechanisms and maintaining a high level of availability to minimize service disruptions. The absence of proper alerting mechanisms for spot instance terminations further intensified their concerns.
Key Pain Points
Escalating Costs
- Monthly EKS bills were increasing rapidly
- Fixed on-demand instance approach was inefficient
- No optimization strategy for resource utilization
Reliability Concerns
- Lack of failover mechanisms for spot instance terminations
- Service disruptions during maintenance windows
- Manual scaling couldn’t keep up with demand
Operational Challenges
- No alerting system for spot instance interruptions
- Configuration drift between environments
- Limited visibility into cost drivers
Solution Implemented
To address the challenges faced by our client, we devised a comprehensive solution encompassing the following key elements:
Spot Instances and Instance Types
We implemented AWS spot instances, which offer significant cost savings compared to on-demand instances. By carefully selecting the appropriate instance types based on the workload requirements, we maximized cost efficiency while ensuring optimal performance.
Failover Design
We implemented a failover strategy to enhance the cluster’s resilience and minimize downtime. By distributing the workload across multiple nodes, we designed a failover mechanism that ensured at least three pods were running on distinct nodes at all times. This approach guaranteed redundancy and fault tolerance, safeguarding against single points of failure.
Pod Disruption Budget (PDB)
To further enhance the availability and stability of the cluster, we configured Pod Disruption Budgets for all the mission critical deployments. This feature provided fine-grained control over the number of pods that could be simultaneously disrupted during maintenance or spot instance terminations. By enforcing PDBs, we minimized service disruptions and improved overall cluster reliability.
Alerting Mechanisms
We integrated alerting systems that sent real-time notifications to Slack channels whenever a spot instance termination event occurred. This proactive alerting mechanism allowed the operations team to analyze alerts on how frequently these instance types are terminating and take prompt action, such as changing the instance types based on the history of availability to handle sudden terminations.
On-Demand Instances
While spot instances offer substantial cost savings, they come with a risk of sudden termination. To mitigate this risk and ensure uninterrupted operation of critical services, we added a minimum of 30% on-demand instances to the cluster. This hybrid approach provided a safety net by maintaining a guaranteed capacity to handle workload spikes or spot instance interruptions.
Cluster Over-provisioner
To further optimize the spot instance termination, we implemented cluster over-provisioner. This tool, when configured with a proper priority class in Kubernetes, sets up a dummy deployment with a configurable capacity which reserves a pool of CPU and memory. In case of any mission critical pods going down, this cluster over-provisioner will supply the CPU and memory from its pool.
Results and Benefits
Through the implementation of our solution, our client experienced several significant benefits:
Cost Reduction
By migrating to spot instances, selecting appropriate instance types, and implementing cost-effective failover mechanisms, our client achieved a remarkable 30% reduction in their monthly AWS EKS bill. This cost optimization allowed them to allocate resources to other areas of their business, fostering growth and innovation.
| Month | Previous Cost | Optimized Cost | Savings |
|---|---|---|---|
| Before | £50,000 | - | - |
| After | - | £35,000 | 30% |
Enhanced Availability and Reliability
The failover design, along with the implementation of Pod Disruption Budgets, contributed to improved availability and resilience. The three-pod distribution across different nodes minimized the risk of service disruptions, ensuring a seamless experience for their customers.
Proactive Spot Instance Termination Handling
The alerting mechanism integrated with Slack enabled the operations team to respond promptly to spot instance terminations. This proactive approach minimized downtime and maintained uninterrupted service availability.
Resource Optimization
The hybrid approach of incorporating on-demand instances alongside spot instances provided the necessary capacity to handle workload spikes and mitigated the risks associated with spot instance interruptions. The cluster overprovisioner further optimized resource utilization, reducing unnecessary costs and maximizing efficiency.
Conclusion
Through a combination of spot instances, instance type selection, failover design, Pod Disruption Budgets, alerting mechanisms, hybrid on-demand instances, and cluster over-provisioner, our services successfully helped our travel and hospitality client achieve a significant 30% cost reduction in their monthly AWS EKS bill. The enhanced availability, reliability, and resource optimization further solidified their infrastructure, enabling them to focus on their core business while enjoying substantial cost savings.