In part 1 of this blog series, we discussed the AWS limits and caps that were hit when running a Kubernetes cluster on AWS over sustained periods of time. We got over these problems by tuning Kubernetes and implementing better operational processes around usage of AWS resources. Was that the end of our problems? Far from it …
Even when using a single cluster in an AWS account, there were some pods that got stuck in the Pending state forever. That's when we discovered multiple problems related to storage volumes.
Wrong volumes associated with pods
In one instance, there were two pods running on two different nodes that were using their own EBS volumes. The EC2 instances where these pods were running were terminated, and new instances were brought up to replace them. The pods were then properly assigned to these new nodes, however, the EBS volumes for these pods ended up being interchanged and ended up mounted on the wrong nodes! As a result, the pods could not access the volumes that they needed.
Failure to unmount volumes
Another problem we faced with pods migrating between instances was around unmounting of EBS volumes used by the pods. In our tests, some pods were using EBS volumes to store data. There were PersistentVolume and PersistentVolumeClaims associated with these volumes. Sometimes, when the pods moved between nodes, the EBS volumes associated with these were not unmounted from the source EC2 instances. As a result, the pods were stuck on the new EC2 instances waiting for the volumes to show up. Unfortunately, they never did.
Manually unmounting and detaching the volumes from the source EC2 instances did the trick and the pods were able to access the volumes they needed and run again. We wrote a “volume mount fixer” to periodically scan for volumes that were stuck and unmount them.
Leaked volumes due to "Tagging Race"
The only convenient way to determine which Kubernetes cluster was using which EBS volume was to use AWS tags to label the EBS volumes. However, AWS did not allow atomic creation and tagging of EBS volumes using a single API call. The first API call created the volume and a subsequent one created the tag. If the second call to tag the volume didn't happen for some reason; say the network was hosed or AWS API call limits kicked in, the volume would not be tagged and remain used.
Over time, many “orphaned” unused EBS volumes would accumulate in the account. This caused us to occasionally hit the cap on the max number of volumes permitted in each AWS account. To address this problem. We had to periodically identify and delete orphaned volumes. One trick we used was to create the Kubernetes volumes with a usual volume size to make it easier to identify potential orphaned volumes. The good news is that AWS now supports an API call to atomically create and tag volumes. Happy days are here again!
Created volumes didn't actually exist
This is a rare event but it did happen to us twice. Basically, when the call to create the volume from AWS returned with an AWS volume-id, we expected the volume to be present and be usable. However, subsequent calls to attach the volume to an EC2 instance failed and returned an error indicating that the volume did not exist.
This was odd because the call to create the volume had just succeeded. Also, it was guaranteed that no other thread/process was deleting volumes at that point. This is possibly a bug with AWS!
Here's the AWS CloudTrail log of one of these occasions: