Service Interruption: Video Calls unavailable for a subset of customers

Incident Report for Dyte

Postmortem

On July 20, 2024, from 12:35 AM to 1:30 AM IST, we experienced an incident that affected our call services. This resulted in a 55-minute service disruption for some of our customers.

Incident Summary:

A configuration change in our production Kubernetes cluster caused certain critical pods to terminate unexpectedly.
New pods were unable to start due to authentication issues with the cluster.
This led to a temporary unavailability of our call functionality for both ongoing and new calls.

Root Cause:

The incident was caused by an unintended removal of necessary permissions in our cluster configuration during a routine maintenance operation.

Resolution:

Our team quickly identified the issue and restored the required permissions, allowing the affected services to recover fully.

Actions Taken to Prevent Recurrence:

We are implementing stricter review processes for configuration changes.
We're enhancing our monitoring systems to detect similar issues more quickly in the future.
We're updating our documentation and training to prevent similar oversights.

We sincerely apologize for any inconvenience this may have caused. We appreciate your patience and understanding as we worked to resolve this issue. Our team remains committed to providing you with the highest level of service reliability.

If you have any questions or concerns, please don't hesitate to contact our support team.

Thank you for your continued trust in our services.

Posted Aug 22, 2024 - 14:05 IST

Resolved

Date and Time: 20th July 00:35 IST to 20th July 01:30 IST (19 July 19:05 to 19 July 20:00 UTC)

Affected Service: Video Calls

Issue: Video calls on the new media layer were unavailable for a period of about 55 minutes.

Root Cause: During a routine maintenance task, permissions associated with a class of instances were accidentally revoked. As a result, a service that manages the state of calls on the new media engine was unavailable, causing this interruption.

Resolution: Our engineering team has identified and fixed the issue by fixing the permissions with the instances. The maintenance task has been paused, and will be resumed after reviewing the health of all associated systems.

Impact: Existing calls abruptly ended, and new calls would not have started due to this interruption. This only affected organisations using the new media layer. The older media layer continued functioning, and some calls would have been rerouted to the working systems.

Next Steps:

* Provide a detailed RCA.
* Fix the underlying infrastructure code to prevent this revocation of permissions from happening again
* Complete the maintenance task

Posted Jul 20, 2024 - 00:30 IST