Technology Infrastructure Automation

30% Cost Reduction in AWS EKS Monthly Bill Through Spot Instance Optimization

Leading Travel & Hospitality Company
•
3 months
•
Team size: 4 consultants

Key Results

30%
Cost Reduction
70%
Spot Instance Usage
Eliminated
Service Disruptions
£15K
Monthly Savings

The Challenge

The client's AWS EKS cluster costs were escalating rapidly, and they lacked proper failover mechanisms to handle spot instance terminations. Their fixed on-demand instance approach was inefficient, and they had no alerting system for spot instance interruptions. The absence of pod disruption budgets led to service disruptions during maintenance windows, and manual scaling processes couldn't keep up with traffic fluctuations.

Our Solution

We implemented a comprehensive EKS optimization strategy combining spot instances with on-demand instances for critical workloads. We configured Pod Disruption Budgets for all mission-critical deployments, implemented failover mechanisms ensuring at least three pods running on distinct nodes, and deployed cluster over-provisioner for instant capacity during spot terminations. Real-time alerting was integrated with Slack to notify the operations team of spot instance terminations.

The Results

The optimization transformed the client's cloud economics. Monthly AWS EKS costs decreased by 30% through intelligent use of spot instances while maintaining service reliability. The failover design with Pod Disruption Budgets eliminated service disruptions during maintenance and spot terminations. The hybrid approach of 70% spot and 30% on-demand instances provided the perfect balance of cost savings and reliability. Development team productivity increased as infrastructure concerns were minimized.

Introduction

This case study explores how our services assisted a client operating in the travel and hospitality industry to optimize their AWS EKS (Elastic Kubernetes Service) infrastructure, resulting in a substantial 30% reduction in their monthly bill. By leveraging spot instances, selecting appropriate instance types, implementing failover mechanisms, configuring pod disruption budgets, enabling proper alerting, incorporating on-demand instances, and utilizing cluster overprovisioning, we were able to help our client achieve significant cost savings without compromising on performance or reliability.

Client Background

Our client, a prominent player in the travel and hospitality industry, relied heavily on their AWS EKS cluster to support their critical applications and services. However, they faced challenges regarding cost optimization and ensuring high availability, which prompted them to seek our expertise.

Challenges Faced

The client’s primary concerns revolved around the escalating costs associated with running their AWS EKS cluster. They were also keen on implementing robust failover mechanisms and maintaining a high level of availability to minimize service disruptions. The absence of proper alerting mechanisms for spot instance terminations further intensified their concerns.

Key Pain Points

Escalating Costs

  • Monthly EKS bills were increasing rapidly
  • Fixed on-demand instance approach was inefficient
  • No optimization strategy for resource utilization

Reliability Concerns

  • Lack of failover mechanisms for spot instance terminations
  • Service disruptions during maintenance windows
  • Manual scaling couldn’t keep up with demand

Operational Challenges

  • No alerting system for spot instance interruptions
  • Configuration drift between environments
  • Limited visibility into cost drivers

Solution Implemented

To address the challenges faced by our client, we devised a comprehensive solution encompassing the following key elements:

Spot Instances and Instance Types

We implemented AWS spot instances, which offer significant cost savings compared to on-demand instances. By carefully selecting the appropriate instance types based on the workload requirements, we maximized cost efficiency while ensuring optimal performance.

Failover Design

We implemented a failover strategy to enhance the cluster’s resilience and minimize downtime. By distributing the workload across multiple nodes, we designed a failover mechanism that ensured at least three pods were running on distinct nodes at all times. This approach guaranteed redundancy and fault tolerance, safeguarding against single points of failure.

Pod Disruption Budget (PDB)

To further enhance the availability and stability of the cluster, we configured Pod Disruption Budgets for all the mission critical deployments. This feature provided fine-grained control over the number of pods that could be simultaneously disrupted during maintenance or spot instance terminations. By enforcing PDBs, we minimized service disruptions and improved overall cluster reliability.

Alerting Mechanisms

We integrated alerting systems that sent real-time notifications to Slack channels whenever a spot instance termination event occurred. This proactive alerting mechanism allowed the operations team to analyze alerts on how frequently these instance types are terminating and take prompt action, such as changing the instance types based on the history of availability to handle sudden terminations.

On-Demand Instances

While spot instances offer substantial cost savings, they come with a risk of sudden termination. To mitigate this risk and ensure uninterrupted operation of critical services, we added a minimum of 30% on-demand instances to the cluster. This hybrid approach provided a safety net by maintaining a guaranteed capacity to handle workload spikes or spot instance interruptions.

Cluster Over-provisioner

To further optimize the spot instance termination, we implemented cluster over-provisioner. This tool, when configured with a proper priority class in Kubernetes, sets up a dummy deployment with a configurable capacity which reserves a pool of CPU and memory. In case of any mission critical pods going down, this cluster over-provisioner will supply the CPU and memory from its pool.

Results and Benefits

Through the implementation of our solution, our client experienced several significant benefits:

Cost Reduction

By migrating to spot instances, selecting appropriate instance types, and implementing cost-effective failover mechanisms, our client achieved a remarkable 30% reduction in their monthly AWS EKS bill. This cost optimization allowed them to allocate resources to other areas of their business, fostering growth and innovation.

MonthPrevious CostOptimized CostSavings
Before£50,000--
After-£35,00030%

Enhanced Availability and Reliability

The failover design, along with the implementation of Pod Disruption Budgets, contributed to improved availability and resilience. The three-pod distribution across different nodes minimized the risk of service disruptions, ensuring a seamless experience for their customers.

Proactive Spot Instance Termination Handling

The alerting mechanism integrated with Slack enabled the operations team to respond promptly to spot instance terminations. This proactive approach minimized downtime and maintained uninterrupted service availability.

Resource Optimization

The hybrid approach of incorporating on-demand instances alongside spot instances provided the necessary capacity to handle workload spikes and mitigated the risks associated with spot instance interruptions. The cluster overprovisioner further optimized resource utilization, reducing unnecessary costs and maximizing efficiency.

Conclusion

Through a combination of spot instances, instance type selection, failover design, Pod Disruption Budgets, alerting mechanisms, hybrid on-demand instances, and cluster over-provisioner, our services successfully helped our travel and hospitality client achieve a significant 30% cost reduction in their monthly AWS EKS bill. The enhanced availability, reliability, and resource optimization further solidified their infrastructure, enabling them to focus on their core business while enjoying substantial cost savings.

Technologies Used

AWS EKS Kubernetes Spot Instances Pod Disruption Budget Cluster Autoscaler Cluster Over-provisioner Slack Integration Prometheus

Share this success story

Want Similar Results?

Let's discuss how we can help you achieve your infrastructure and DevOps goals