On 6th Dec 10:08 UTC, an unexpected issue occurred in our Mumbai region's Kubernetes cluster, rendering it inaccessible and impacting all clients. The root cause was the accidental deletion of a critical configuration, which disrupted communication between cluster nodes and the control plane, leading to a regional system-wide outage.
The incident occurred during a routine infrastructure update involving auto-scaling groups and add-on upgrades of Kubernetes. An unforeseen error during this process caused the deletion of the aws-auth
ConfigMap, which manages node and user access to the Kubernetes cluster. This led to:
Access was restored by recreating the deleted configuration. Nodes were reauthorized to communicate with the control plane, and services gradually returned to normal. Additional capacity was added to expedite recovery.
System fully recovered and was operational by 11:10 UTC.
To avoid such incidents in the future: