Resolved -
The incident has been resolved. All systems are healthy and back online again following the reboot and rollout of final fixes in all environments.
Jan 22, 10:12 UTC
Monitoring -
Status Update: The reboot was successful and we are seeing positive results as our applications come back online. We are still actively monitoring the situation as we work to ensure everything is back to a stable and healthy state. We have also started to process a large backlog of transactions.
As of now, we are estimating a resolution of ~1 hour but we will continue to provide updates should anything change.
Jan 22, 09:30 UTC
Update -
Current status: In-Progress
The backup has been complete and we are now working on rebooting the messaging system and restoring all of the data. Following that, we will work on getting all of the impacted services back online as quickly as possible. In tandem, a fix for the underlying root cause of the outage has been prepped and will be rolled out accordingly once the reboot of the messaging system has finished.
Thank you for your patience, Our team is all hands on deck here and we're optimistic that the incident will be resolved soon. We will provide another update in 30 minutes.
Jan 22, 09:06 UTC
Update -
We are still actively in the progress of backing up all the data in cluster. Once the backup is done we will reset the internal messaging system and bring all the services back online.
For full transparency and to keep everyone in loop, here's a recap of the incident:
Multiple services degraded due to our internal messaging system outage. Trading, Participants, Transact, Deposits and Withdrawals are impacted.
Root cause of messaging system outage:
The root cause was traced to an inefficiency in how our internal messaging system queries data. A specific function was creating and destroying connections much more rapidly than intended, which placed excessive stress on the infrastructure and ultimately led to the overload.
Root cause of delay in messaging system recovery:
As this is the core communication system to our services, we were all hands on backing up the cluster. But due to the amount of data we need to recover it’s taking a long time and we are assessing all of our options to accelerate the recovery process.
Current status:
we’re almost at the point where we have all the data backed up, then we can reset the messaging system and restore the data to bring all services back online.
Jan 22, 08:29 UTC
Update -
We are continuing to work on the fix. We sincerely apologise for the delay in resoving the incident. Keeping you updated on where we stand, we’re almost at the point where we have all the data backed up, then we can reset the messaging system and restore the data. We will make sure to provide a detailed RCA after this is resolved.
Jan 22, 07:27 UTC
Update -
We are continuing to work on the fix. Due to the huge amount of data to recover in our internal messaging system, it is taking longer than expected. We are now switching strategy to backup the data and restart the system. ETA still not available but our entire Engineering team and the leaders are continuing to work with external vendor to address this as soon as possible
Jan 22, 06:00 UTC
Update -
We are continuing to work on the fix. Incident was caused by our internal messaging system outage and now we are in the final stage of restoring it. We will keep you posted with the progress.
Jan 22, 03:25 UTC
Identified -
We have identified the root cause of the issue and are working on a fix. We are still all-hands on the issue, but do not have an ETA at this stage yet.
Jan 22, 00:11 UTC
Investigating -
We are actively investigating an issue causing elevated error rates and degraded service availability across multiple endpoints. Our team is aware of the impact and are currently all-hands on this incident. We will provide updates as soon as we can.
Jan 21, 22:54 UTC