In part 2 of this blog series, we discussed the problems we ran into with EBS volumes when running a Kubernetes cluster on AWS. Once we got over them, we were able to run our cluster reliably overnight. The next mountain to scale was Scalability!
Our production Kubernetes cluster had a single AutoScalingGroup (ASG) for the worker nodes/minions. Furthermore, all the pods could loosely be classified into two categories: "system" pods and "user" pods. When starting pods on a new cluster, we ensured that the system pods would be started before the user pods. This was to ensure that user pods don’t use up all the resources and prevent the system pods from being scheduled.
On one occasion, all the system pods were running, some user pods were running and hundreds of user pods were waiting to run. Then the EC2 instances were terminated and new instances were brought up by the AWS AutoScalingGroup.
As the new instances came up, Kubernetes scheduled all the pods, both system and user, randomly, causing many system pods to be left in a “Pending” state waiting for resources. Unfortunately, many of the user pods required system services to make progress, so the system was essentially deadlocked. The fundamental problem was that while we knew about "system" and "user" pods, neither Kubernetes nor AWS understood the distinction.
To avoid this problem, instead of using a single ASG, we decided to use two ASGs. One dedicated for system pods and one dedicated for user pods. The minions in each of these ASGs were assigned the correct Kubernetes labels. Now, when the pods were run, they were given a specific node-selector. For system pods, the node-selector used was "system" whereas, for user pods, the node-selector used was "user".
With this approach, the AWS infrastructure and Kubernetes were both aware of the differences in the pods and automatically had resources reserved for each type. In the event of all nodes going down and coming back up, the system pods would automatically go to the system nodes and user pods would only compete for resources on the user nodes. Awesome!
After having gone through all these pains, the Kubernetes cluster was up and running reliably at scale. Well, at least until we upgrade to the next version of Kubernetes...