System Down

Incident Report for Dyte

Postmortem

Incident Summary

On 6th Dec 10:08 UTC, an unexpected issue occurred in our Mumbai region's Kubernetes cluster, rendering it inaccessible and impacting all clients. The root cause was the accidental deletion of a critical configuration, which disrupted communication between cluster nodes and the control plane, leading to a regional system-wide outage.

Root Cause

The incident occurred during a routine infrastructure update involving auto-scaling groups and add-on upgrades of Kubernetes. An unforeseen error during this process caused the deletion of the aws-auth ConfigMap, which manages node and user access to the Kubernetes cluster. This led to:

Nodes losing authorization to communicate with the control plane.
Nodes transitioning to an unhealthy state, preventing pod scheduling.

Impact

Cluster became inaccessible for both users and system components.
All workloads were disrupted, leading to downtime for all clients in the Mumbai region.

Resolution

Access was restored by recreating the deleted configuration. Nodes were reauthorized to communicate with the control plane, and services gradually returned to normal. Additional capacity was added to expedite recovery.

System fully recovered and was operational by 11:10 UTC.

Preventive Actions

To avoid such incidents in the future:

Safeguards will be implemented to protect critical configurations.
Enhanced validation checks will be added to infrastructure changes.
Processes for rapid recovery and backups will be strengthened.

Posted Dec 09, 2024 - 13:12 IST

Resolved

This incident has been resolved.

Posted Dec 06, 2024 - 16:42 IST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 06, 2024 - 16:41 IST

Update

We are continuing to work on a fix for this issue.

Posted Dec 06, 2024 - 16:35 IST

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 06, 2024 - 16:32 IST

Investigating

We are currently investigating this issue.

Posted Dec 06, 2024 - 16:09 IST

This incident affected: Media (India Region, US Region, Singapore Region, Frankfurt Region, Other Regions), Meetings, Recordings, Webhooks, Analytics, api.dyte.io, and Signalling/Socket (India Region, Singapore Region, US Region).