What does it take to run 7500 containers supporting 300 micro-service based applications while concurrently running a continuous integration and testing (CI) job once every 10s of seconds on a Kubernetes cluster that can autoscale up to 30 minions (240 CPU cores and 1TB memory)?
In this blog series, I'm going to share our war stories with you, and save you the pain of learning the lessons on your own.
First things first. There are many ways to create a Kubernetes cluster. Whether you use kops, kubeadm, or kube-up, it's a very satisfying moment when you finally connect to the Kubernetes API and launch your first Pod! But this joy may be short lived if you expect to have a stable, scalable, production-ready cluster with no additional work.
The default configurations for Kubernetes components are not designed for heavy and dynamic workloads characteristic of DevOps environments and micro-service based application deployments where containers are quickly created and destroyed. As a result, you will see unstable behaviors from many core Kubernetes components. At times, it may seem like your cluster is going crazy! Welcome to the brave new world.
At Applatix, we invested great efforts to understand fundamental Kubernetes behaviors under various loads and learned how to configure clusters so that they run stably under heavy, dynamic load conditions. We have also modified our core system-level micro-services to ensure that they work with Kubernetes to maintain platform stability.
Our production workload. The production workloads we run are challenging on many fronts.
- We need to handle mixed workloads with greatly varying time spans. Some workloads are short-running DevOps jobs that last up to 10s of minutes while others are long-running applications that can run from hours to months.
- There can be hundreds of pods created and destroyed every minute due to automatically triggered CI jobs.
- Many container workloads from users are unpredictable. Some workloads may be misconfigured and provide inaccurate estimates of expected resource usage.
The following is based on a single master cluster running Kubernetes 1.4.3. Each micro-service consists of one or more pods and other infrastructure such as volumes with run times varying from hours to months. Each CI task is a multi-step workflow, that performs code checkouts, parallelized builds and tests, and creates and uses fixtures (containerized services) for running complex tests.
During busy times, we have about 2300 Pods (~7500 containers) running on a single cluster and the cluster will autoscale up to 30 m3.2xlarge minions (240 CPU cores and 1TB memory cluster wide). We also run larger stress tests for shorter periods of time but the production cluster has the greatest longevity.
Using Default Values Is Dangerous
Running Applatix workload with default configurations for Kubernetes configurations is dangerous. There are two broad categories of negative consequences:
- resource related
- performance related
Kubernetes components have fixed default configurations for their resource consumptions. For example, maximum inflight mutating / non-mutating calls for API server are set to 200 and 400 respectively; lookup cache size for Replication Set, Replication Controller, and Daemon Set are 4GB, 4GB and 1GB respectively for controller manager; API query per second for scheduler is 100.
Using limits that are too large will cause instability by significantly increasing CPU and memory consumption during peak loads while limits that are too small will result in suboptimal performance and wasted resources.
For example, for a large Kubernetes cluster, you may want to use an instance with 128GB of memory for the master. However, Kubernetes would not be able to effectively use such a large instance with the default limits leading to poor master node performance which can cause clients communicating with the master to suffer timeouts.
Conversely, when you use an instance with 8GB memory for the master, Kubernetes controller manager can use up most of the memory making the cluster unstable. To avoid either extreme, you must tune component limits for the maximum size of the cluster and type of load you expect.
At Applatix, we deploy Kubernetes clusters of various sizes depending on customers' needs. We provision resources according to cluster sizes. If Kubernetes limits (such as the default values) are too high compared with the amount of available resources, OOM (Out of Memory) conditions can quickly occur on a master node when workloads add up, resulting in repeated killing of important system components which destabilizes the system.
Although those system components can be restarted automatically, the restarted containers will just balloon up again and be repeatedly killed due to the pending load. As system components get repeatedly killed, object status, such as Pod, Service, and Persistence Volume Claims may become inconsistent cluster wide. This can quickly become catastrophic as the confusion starts to spread from one micro-service to another.
Careful configuration is essential for running Kubernetes clusters in production. The configuration must take into account the desired maximum size of the cluster as well as the expected workloads. Without the correct configurations, a Kubernetes cluster will suffer from instability and/or waste precious resources.
In subsequent blogs, I will delve into the Kubernetes configuration details (Part 2) and go over how we architected our micro-services for stability and availability (Part 3). The configuration settings and the architectural decisions allow us to reliably run large Kubernetes clusters with hundreds of micro-services and thousands of pods in production for long periods of time. Stay tuned!
Harry Zhang is a member of technical staff at Applatix. He works with platform team focusing on building containerized micro-services with Kubernetes and AWS. Reach him on LinkedIn for questions, comments, or information about Applatix.