In Part 2, I discussed how to tune workload and performance related knobs in Kubernetes for production. However, this is not the end of story. The way you manage your micro-services should also be carefully choreographed to work smoothly with Kubernetes and ensure a stable system. At Applatix, we run large Kubernetes clusters with a mix of DevOps workloads. Some workloads are short-lived while other run for long periods of time. Some use very little resources while other use lots of CPU, memory and disk. Meanwhile, Applatix micro-services need to frequently talk to Kubernetes to perform DevOps tasks as well. As a result, we need to cater to the stability needs of the most demanding workload. In other words, we favor stability over efficiency. Here are some things we do to ensure the stability of our Kubernetes clusters.
Not only the master node but also every minion should be protected from overload and resource exhaustion. Minions can be protected by properly configuring its kubelet. We set the "--max-pods" flag in kubelet to limit the number of pods admitted for scheduling. It is not uncommon for large batch jobs or runaway applications to spawn very large numbers of Pods in a short period of time. Without limiting Pod admission, Kubelet and other system Pods can be overwhelmed, resulting in unresponsive and unstable nodes. In combination with limiting Pod admissions, you should reserve the resources needed by Kubernetes and other high priority services by using the "--kube-reserved" and "--system-reserved" flags in kubelet. Kubelet will reserve and deduct these resources from the node's total allocable resource when registering node with the Kubernetes master. These limits will be followed by kube-scheduler when it schedules Pods. Finally, setting appropriate resource request and limit values for all Pods is equally important to protect nodes, as this ensures no potential runaway Pod can monopolize resources and starve other Pods or even introduce host level OOM (out of memory) kills.
Throttle Object Creations
Originally, we allowed large numbers of DevOps tasks to be submitted to Kubernetes even if the cluster did not have sufficient spare resources to run them. We believed that Kubernetes can store the backlog of work and schedule them as existing tasks finished. This strategy turned out to introduce instability in Kubernetes because as pending tasks accumulated, Kubernetes master would become overloaded trying to schedule too many Pods at once. These attempts to schedule massive backlogs of tasks dramatically increased CPU and memory consumption of kube-scheduler, kube-apiserver and etcd, frequently causing OOMs and made Kubernetes unresponsive. To improve this situation, we implemented high-level admission control algorithms to control both the number and rate of tasks submitted to Kubernetes and avoid resource deadlocks when running workflows that require sequencing the execution of multiple Pods. This made our Kubernetes clusters much more stable under high loads.
Back-off and Retry
In an Applatix cluster, some system-level micro-services need to communicate regularly with Kubernetes. These services include the cluster auto-scaler, workflow and application management engines, and various daemons for managing user Pods. As these micro-services are distributed, they can generate a burst of activities cluster-wide, and may of these activities involve communicating with Kubernetes. Since we use Kubernetes API limits to avoid overloading the kube-apiserver, the excess API calls will be rejected and clients will receive error code 429. In this case, exponential back-off and retry are used to cool down the cluster while ensuring progress. It is also important to add jitter in the back-off to prevent synchronized API request storms after a kube-apiserver crash. Kubernetes has clearly defined its API conventions, please refer to their documentation  for more information.
Avoid Talking to Master if Possible
API Calls to Kubernetes master is relatively expensive, especially when you can get the same information from local services. For example, kubelet can be accessed via the host network and can open a read-only port, using the "--read-only-port" flag, to serve a limited but useful set of read-only APIs. These APIs are not documented, but can be found by reading the source code :
"/pods" path can get all Pod information on that host:
- "/spec/" path can get host information
- "/stats/" path shows cadvisor  stats of the host, etc.
Finally, in some cases, it may be better to get information from the cloud provider to retrieve cloud level metadata such as IP addresses, region info, etc. instead of from Kubernetes master. For AWS, http://169.254.169.254 is the "magic" URL you can use , and for GCP/GKE, http://metadata.google.internal/computeMetadata/v1/ is the URL to access the metadata server .
Cross Monitoring Nodes
Like any virtual machine, Kubernetes nodes can suffer from many software related problems. For example, Linux kernel bugs are often trigger by interactions with container managers such as Docker that can cause kernel crashes , Docker can hang for long periods of time and cause Kubernetes to end up with inconsistent Pod/container status, critical Kubernetes components can get OOM killed and lose connection with each other. Such situations are much more frequent on nodes that have high levels of churn (frequent Pod starts and stops). Kubernetes relies highly on synchronizing states of objects between minion and master. As a result, such software related problems will initially affect only a single Kubernetes node, but can quickly escalate and affect the entire cluster if you do not react in a timely manner to fix the unhealthy node.
At Applatix, we run special micro-services and system daemons for monitoring critical cluster components such as Docker, Kubelet, and Kubernetes master. In the event, any of these system components go into non-recoverable inconsistent states (e.g., fail to restart a hanging docker daemon), our monitoring service will terminate the node, and an identical replacement node will be automatically launched. Although terminating an unhealthy node can introduce additional churns inside the cluster by killing and restarting all Pods running on it, and will take time for the replacement node to come up, it is much better than waiting conservatively for that node to (possibly) recover, as having unhealthy nodes hanging around will accumulate and spread local errors and finally make the entire cluster unusable.
At Applatix, we took huge efforts to learn about how Kubernetes actually behaves under DevOps production workloads. After configuring Kubernetes and architecting our micro-services properly, Kubernetes has become very stable and responsive. This concludes my 3-part blog series about sharing our experiences of making Kubernetes production ready. Kubernetes plays a critical role in modern containerized cloud platforms. I hope it encourages you to try using Kubernetes for your production workloads, and that it saves you much pain in making your Kubernetes cluster production ready. If you find this useful or have questions, just drop us a note at firstname.lastname@example.org!
Ref : https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md
Ref : https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/server.go
Ref : https://github.com/google/cadvisor
Ref : http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-instance-metadata.html#instancedata-data-retrieval
Ref : https://cloud.google.com/compute/docs/storing-retrieving-metadata
Ref : https://github.com/kubernetes/kops/issues/874#issuecomment-278824037
Harry Zhang is a Member of Technical Staff at Applatix. He works with platform team focusing on building containerized micro-services with Kubernetes and AWS. Reach him on LinkedIn for questions, comments, or information about Applatix.