Making Kubernetes Production Ready – Part 2

In Part 1, I described the production workloads we run at Applatix and the many reasons why Kubernetes default configurations are ill-suited for running these workloads. In this post, I'm going to go over exactly how we configured Kubernetes to handle these production workloads. After much research and pouring over the source code, we found more than 30 performance and workload related knobs provided by Kubernetes components: some knobs are documented and some are not. We carefully tuned and tested many combinations of Kubernetes knob settings to identify the small set of configurations that are reliable and provide good performance for our workloads.
 
Kubernetes Overview
 
 

As a quick overview, consider the above block diagram of Kubernetes components and how they interact with each other.

  • kube-apiserver: Kubernete's REST API entry point that processes operations on Kubernetes objects, i.e. Pods, Deployments, Stateful Sets, Persistent Volume Claims, Secrets, etc. An operation mutates (create / update / delete) or reads a spec describing the REST API object(s)
  • Etcd: A highly available key-value store for kube-apiserver
  • kube-controller-manager: Runs control loops that manage objects from kube-apiserver and perform actions to make sure these objects maintain the states described by their specs
  • kube-scheduler: Gets pending Pods from kube-apiserver, assigns a minion to the Pod on which it should run, and writes the assignments back to API server. kube-scheduler assigns minions based on available resources, QoS, data locality and other policies described in its driving algorithm
  • kubelet: A Kubernetes worker that runs on each minion. It watches Pods via kube-apiserver and looks for Pods that are assigned to itself. It then syncs these Pods if possible. The procedure of Syncing Pods requires resource provisioning (i.e. mount volume), talking with container runtime to manage Pod life cycle (i.e. pull images, run containers, check container health, delete containers and garbage collect containers)
  • kube-proxy: A network proxy that reflects Service (defined in Kubernetes REST API) that runs on each node. Watches Service and Endpoint objects from kube-apiserver and modifies the underlying kernel iptable for routing and redirection.

Knobs, knobs, and more knobs

A rule of thumb is that Kubernetes workload and resource consumption are directly related to the number of Pods and rate of Pod churn (starts & stops) cluster wide. Based on benchmarking information about Kubernetes [1] [2] and perusing Kubernetes source code [3], Kubernetes officially recommends 32-cores and 120GB-memory for a 2,000 node, 60,000 Pod cluster. Although a certain fraction of mutating workloads is added in such benchmarks [2], the benchmark workloads are relatively static compared to a typical DevOps workload, where there can be hundreds of Pods spinning up and down every minute. With the Applatix production workload, the Kubernetes master components' memory usage is very sensitive to Pod churn, and much more memory is needed than compared with the official recommendation. In general, careful consideration of your particular workload is needed to properly configure your Kubernetes cluster for the desired level of stability and performance.

The following is a summary of the knobs that we adjust for Applatix's production clusters. As a reminder, our usage of these flags was tested specifically for high churn workloads. Your configurations may vary:

kube-apiserver [4] [5] [6]

Function

Flags

Description

Recommendation

 Throttle API requests

--max-requests-inflight

This flag will limit the number of API calls that will be processed in parallel, which is a great control point of kube-apiserver memory consumption. The API server can be very CPU intensive when processing a lot of requests in parallel.

With the latest Kubernetes release, they provide more fine-grained API throttling mechanisms with "--max-requests-inflight" and "--max-mutating-requests-inflight"

Adjust this value from the default (400) until you find a good balance. If it is too low, you will see too many request-limit-exceed errors. If it is too high, you will see

kube-apiserver getting OOM (Out Of Memory) killed because it is trying to process too many requests in parallel.

Generally speaking, 15 parallel requests per 25~30 Pods is sufficient.

Control memory consumption

--target-ram-mb

This value is used for kube-apiserver to guess the size of the cluster and to configure the deserialize cache size and watch cache [9] sizes inside the API server.

The kube-apiserver uses the same assumption as the above mentioned Kubernetes benchmark: 120GB for ~ 60,000 Pods, 2000 nodes, which is equivalent to 60MB / Node, and 30 Pods on each node.

Generally speaking, 60MB per 20~30 Pods is a good assumption to make.

Container memory request can be set to equal to or greater than this value.

kube-controller-manager [7] [8]

Function

Flags

Description

Recommendation

Control level of parallelism

 --concurrent-deployment-syncs
 --concurrent-endpoint-syncs
 --concurrent-gc-syncs
 --concurrent-namespace-syncs
 --concurrent-replicaset-syncs
 --concurrent-resource-quota-syncs
 --concurrent-service-syncs
 --concurrent-serviceaccount-token-syncs
 --concurrent-rc-syncs

Kube-controller-manager has a set of flags that can provide fine-grained controls of parallelism. Increasing parallelism means Kubernetes will be more agile when updating specs, but also allows the controller manager to consume more CPU and memory.

In general, increase settings for components you use more intensively. For larger clusters, feel free to increase default values as long as you are OK with its memory usage. For smaller clusters, if you are tight on memory, you can lower the settings. Kubernetes default values can be found in the kube-controller-manager documentation.

Control memory consumption

 --replication-controller-lookup-cache-size
 --replicaset-lookup-cache-size
 --daemonset-lookup-cache-size

 

 

 

 This set of flags is not documented but still available for use. Increasing look up cache size can increase sync speed of corresponding controllers, but will increase memory consumption for controller manager. In practice, we set memory request for controller manager container to be greater than the sum of these 3 values. Note that after Kubernetes 1.6, replication controller, replica set and daemonset no longer require a lookup cache so these flags are not needed.  The default values (4GB for Replication Controller, 4GB for Replica Set and 1GB for Daemon Set) works fine for even large workloads. You can tune these values down for smaller clusters to save memory.

Set the container's memory request to slightly greater than the sum of these values.

Throttle API query rate

 --kube-api-burst
 --kube-api-qps
 These 2 flags set normal and burst rate that controller manager can talk to kube-apiserver. We increase these values with larger Applatix cluster configs.  Default values work pretty well (20 for QPS and 30 for burst). Increase these values for large, production clusters. 

kube-scheduler [10]

Function

Flags

Description

Recommendation

Throttle API query rate

 --kube-api-burst
 --kube-api-qps

These 2 flags sets normal and burst rate that kube-scheduler can talk to kube-apiserver, as kube-scheduler polls "need-to-schedule" Pods from api-server, and writes scheduling decisions. [11] [12] kube-scheduler's memory consumption can increase noticeably when there is a burst of pod creation happening inside the cluster

Set QPS and Burst to roughly 20% and 30% of max inflight requests for API server is a good assumption to make.

kube-proxy [13] 

Function

Flags

Description

Recommendation

Throttle API query rate

--kube-api-burst
--kube-api-qps

These 2 flags sets normal and burst rate that kube-scheduler can talk to kube-apiserver. kube-proxy mainly use kube-apiserver to watch for changes in Service and Endpoint objects.

The default values are fine.

  etcd

Function

Flags

Description

Recommendation

Control memory consumption

 --snapshot-count

 Etcd memory consumption and disk usage are directly affected by `--snapshot-count`, and it is likely to have memory burst when there are a lot of Pod churns cluster-wide. See more information about etcd tuning here.

We reduced the default values significantly but keep it enough for our usage. The default value is unnecessarily large for our use case, and can easily cause etcd to be OOM killed. We referred to etcd's documentation for hardware provisioning

kubelet [14]

Function

Flags

Description

Recommendation

Throttle API query rate

 --kube-api-burst
 --kube-api-qps

These 2 flags sets normal and burst rate that kubelet can talk to kube-apiserver. As each node will have a limited number of pods, default values work pretty good for us in our stress tests

 

 

Based on information provided in Kubernetes' benchmark, they use 30 Pods per node, and we assume the default values are good enough for this. Scale these values based on the estimated number of Pods that you want to run for each minion. Use the default values as a minimum guideline.

 

Control event generation rate

 --event-burst
 --event-qps

These 2 flags controls rate kubelet creates events. More events means more workload for master node to process, and you need to tune this value if your have some micro service analyzing (especially caching) Kubernetes event streams, as these services can possibly get OOMed when kubelets are generating too many events globally

Throttle container registry query rate

 --registry-burst
 --registry-qps

Rate control when talking to container registry. In our daily development

Default values are fine. Increase the value if your application is sensitive to the delay of pulling container images.

Protect host

 --max-open-files
 --max-pods
 --kube-reserved
 --system-reserved

These 4 flags helps kubelet protect the host.

Set `--max-pods` to prevent too many small containers from overloading the kubelet for a node. Generally, limit it to 80 or so Pods per node (roughly about 300 containers) per m3.2xlarge node.

The `--kube-reserved` `--system-reserved` are also useful flags to reserve resources for system and kubernetes components (such as kernel, docker, kubelet, and other Kubernetes Daemon Sets). kube-scheduler will take this into consideration when scheduling pods


In addition to configuring component resources, master node root device size needs to be tuned based on the cluster size to ensure log rotation happens frequently enough to satisfy the verbosity level we set for the Kubernetes master components. If log rotation is not active enough, master node could become unstable when root device gets full, making the whole cluster unstable.

To recap, turning these knobs properly based on your production workload is a critical step towards creating a production-ready cluster. Even so, it may still be a painful experience trying to identify the root causes and come up with remedies for the instabilities and anomalies that you will likely observe in your cluster each time you scale up your load to new levels. In the end, a stable and performant Kubernetes cluster can play a critical role in helping you migrate your apps to a scalable container-based infrastructure, particularly in the public cloud. Welcome to the new era of cloud computing!

In Part 3, we will look into how we architected Applatix software and our micro-services to work with Kubernetes to ensure cluster stability and availability. Stay tuned!

*Making Kebernetes Production Ready - Part 1
*Making Kubernetes Production Ready - Part 3

Ref [1]: http://blog.kubernetes.io/2016/07/kubernetes-updates-to-performance-and-scalability-in-1.3.html
Ref [2]: http://blog.kubernetes.io/2016/03/1000-nodes-and-beyond-updates-to-Kubernetes-performance-and-scalability-in-12.html
Ref [3]: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-apiserver/app/server.go#L505
Ref [4]: https://kubernetes.io/docs/admin/kube-apiserver/
Ref [5]: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-apiserver/app/options/options.go
Ref [6]: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go
Ref [7]: https://kubernetes.io/docs/admin/kube-controller-manager/
Ref [8]: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/options/options.go
Ref [9]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apiserver-watch.md
Ref [10]: https://kubernetes.io/docs/admin/kube-scheduler/
Ref [11]: https://coreos.com/blog/improving-kubernetes-scheduler-performance.html
Ref [12]: https://docs.google.com/presentation/d/1HYGDFTWyKjJveAk_t10L6uxoZOWTiRVLLCZj5Zxw5ok/edit#slide=id.gd6d8abb5d_0_2805
Ref [13]: https://kubernetes.io/docs/admin/kube-proxy/
Ref [14]: https://kubernetes.io/docs/admin/kubelet/

 

Harry Zhang is a Member of Technical Staff at Applatix. He works with platform team focusing on building containerized micro-services with Kubernetes and AWS. Reach him on LinkedIn for questions, comments, or information about Applatix. 
 

Get started with Applatix

“Containers and Kubernetes 201” – tutorials and training

Learn More

Open Source Products for containers and cloud

Get Tools

Container challenges?  Try our technical team

Contact Us