Summary
On Wednesday, Nov 12, 2025, a slow degradation of messaging connectivity between the Recording API layer and recording control plane, exacerbated by a failed periodic restart job caused recording failures across multiple Orgs. The Recording success indicators wouldn't have worked, nor the recording status webhooks would have been triggered. Only a percentage of our Recordings were failing to start successfully.
Date and Time:
Start time: 2025-11-12 16:50:00
Fix time: 2025-11-13 08:21:00
Detailed analysis
Our Recording API plane uses a gRPC connection pool for gRPC communication with the recording control plane (behind a load balancer). These connections became stale on the Load balancer without being marked as disconnected, creating a broken communication channel that clients could not detect.
Mitigation:
This issue was mitigated after we restarted the Recording API plane. We also replaced the restart mechanism with a service level periodic restart, to arrest any such event happening further.
Next Steps