We use Flagsmith in web-core, web-sdk, room-manager and recording service for controlling feature rollout.
10th July 11:12 PM, Flagsmith had an outage. Even though we handle 5XXs on Flagsmith, here we were getting timeouts, and we got stuck await Flagsmith.identify()
Flagsmith outage was due to one AWS Availability Zone in eu-west-2 going down due to thermal issues.
We quickly patched all services removing flagsmith temporarily, we deployed it for all the above services around 11:50 PM
Service was partially restored at 11:50 PM, but since we couldn't patch frontend SDKs we were only able to partially connect (it was resolving with 5XX half the time)
Web-core was affected more severely than Web SDK, with a crash when a certain flag wasn't available.
The outage was resolved around 11th July 2:00–2:30 AM