Degradation of the Recording Service

Incident Report for Dyte

Postmortem

Summary
On Wednesday, Nov 12, 2025, a slow degradation of messaging connectivity between the Recording API layer and recording control plane, exacerbated by a failed periodic restart job caused recording failures across multiple Orgs. The Recording success indicators wouldn't have worked, nor the recording status webhooks would have been triggered. Only a percentage of our Recordings were failing to start successfully.

Date and Time:
Start time: 2025-11-12 16:50:00
Fix time: 2025-11-13 08:21:00

Detailed analysis
Our Recording API plane uses a gRPC connection pool for gRPC communication with the recording control plane (behind a load balancer). These connections became stale on the Load balancer without being marked as disconnected, creating a broken communication channel that clients could not detect.

Mitigation:
This issue was mitigated after we restarted the Recording API plane. We also replaced the restart mechanism with a service level periodic restart, to arrest any such event happening further.

Next Steps

  • Added a logger alert to catch failed INVOKE recordings
  • Reinstated periodic restart of the API plane
  • Our newer Cloudflare stack based services are on Cloudflare workers and this will not be impacted, as gRPC connections are done every worker invocation.
  • Detection mechanism for GRPC staleness for current service
Posted Nov 14, 2025 - 17:03 IST

Resolved

Between 12 Nov 16:50 UTC and 13 Nov 08:21 UTC on Nov 13, we experienced a service degradation that prevented a percentage of new recordings from starting successfully. Affected recordings were stuck in the "INVOKED" state, and corresponding status webhooks also failed. The underlying issue has been resolved, and all services are stable
Posted Nov 12, 2025 - 16:30 IST