Partial Degradation: Delay in joining the meetings
Incident Report for Dyte
Postmortem

Incident Report: Delay in joining meetings.

Date and Time: 27th Aug 12:50 UTC to 27th Aug 13:50 UTC

Affected Service: Meeting Service (Meeting Join) 

Issue: Meeting join requests were taking more than 5 seconds to be processed, causing SDKs to constantly retry until a join succeeded under 5 seconds. This resulted in joins taking 1-2 minutes to resolve.

Root Cause: There was a delay in receiving confirmations from RabbitMQ, caused a bottleneck in the message publishing flow. This led to slow publishing of responses to RabbitMQ, even though messages were processed in time.

Resolution: A mitigation was pushed to the service using RabbitMQ code to make writing to the confirmation channel asynchronous and log timing for each operation.

Impact: Users experienced significant delays in joining sessions, with join times extending up to 1-2 minutes. This affected the user experience and potentially disrupted scheduled meetings or sessions.

Next Steps:

  1. Testing: Conduct thorough testing of the internal pub-sub layer integration with media service, which should prevent similar issues in the future.
  2. Alerting: Set up alerts for the meeting join timings to help detect similar problems in the future.
Posted Sep 04, 2024 - 11:52 IST

Resolved
Users experienced issues with the joining the room. Few requests timed out, leading to a degraded user experience. The join room flow is completely operational.
Posted Aug 27, 2024 - 06:30 IST