In so many ways, open-source projects are like “happy hours”. The options look good, the price is right, and you go for it! Sometimes, it works out great. But often times, hours after the “happy hour” has passed, you start feeling sick, wondering whether “happy hours” ever have happy endings.
There have been times when running production grade Kubernetes clusters on AWS has given us that same feeling.
In this series of blogs, we’ll share our story of getting a reliable and scalable, production-grade Kubernetes cluster to run on AWS. We will focus on highlighting the pitfalls that we ran into. And one after another, we will share our own experiences and solutions to these challenges.
We will cover:
- AWS limits and Cloud Hygiene (Ops strategy)
- Volumes
- Autoscaling Groups are your friend!
AWS Limits and Cloud Hygiene (Ops strategy)
After some hiccups related to installation and configuration, we had a Kubernetes cluster up and running on AWS. We were running a workload that created hundreds of containers per minute. The details of the workload can be found here. The hope was that the cluster would run overnight. However, disappointment was all that we got!
As the cluster was running overnight, components of our cluster started to get HTTP 500 errors when calling AWS APIs. Turned out, that we were hitting some AWS limits [1].
AWS has rate limits on the API calls that clients make. A couple of our components were making these API calls at a rate that was too high for AWS. In particular, the "cluster-autoscaler" was making calls to describe auto-scaling groups and the Kubernetes dynamic storage provisioner was making calls to create and describe EBS volumes. The situation was made worse by the fact that we were running several Kubernetes clusters in the same AWS account, each making these AWS API calls.
This prevented the clusters from successful autoscaling and also resulted in lots of pods stuck in the Pending state because the volumes needed were not being provisioned. The overall health of the cluster quickly deteriorated and the cluster became unusable.
We worked around this problem by reducing the cluster-autoscaler’s frequency of polling AWS (--scan-interval) and reducing the number of volumes required in our jobs.
With AWS API call rates under control, we hit the next set of limits. This time when running Kubernetes clusters over a period of several days. These were the resource caps that AWS enforces on a per account basis. In particular, we hit caps on the number of EBS volumes, S3 buckets, VPCs, launch-configurations, etc. The problem was made worse because we were running multiple Kubernetes clusters in the same AWS account. Quickly over time, as our load increased, our clusters started to consume more and more resources.
This is where we had to discipline ourselves to simply operate our AWS accounts better. We had to:
- Tag all resources so that we knew what it was being used for and by whom
- Regularly identify and delete unused AWS resources (also saves money!)
- Use multiple AWS accounts
Scripting and automation helped, but fundamentally, this was more about better communicating who was using what resources for what purpose. The next blog in this series will be focused on “Volumes” and “Autoscaling Groups”. Will our “happy hour” have a h a(/)p py ending? (/)