Summary:
The underlying database serving critical metadata regarding Pipelines, Activations, Models, Workflows, and Destinations had hit peak CPU utilisation due to pipelines having a huge number of failed events leading to the EU cluster not being able to respond to any of the services relying on it for the said metadata for processing. This includes any interactions made with the database through the dashboard/APIs/System jobs like running pipelines/destination load jobs etc.
Timeline (UTC):
- Dec 01 2022 6:17 AM UTC, an internal alert was triggered when the CPU utilisation of the database reached critical levels and the cluster was moved to read-only mode to prevent any data loss.
- Dec 01 2022 7:25 AM UTC, a fix was identified to make sure that the DB does not get bloated further. Some non-critical reads to the database were blocked by disabling some components on the dashboard.
- Dec 01 2022 8:00 AM UTC, the abnormal load on the database had subsided due to the above fixes, meanwhile, some operational tasks on fixing the affected pipelines were started.
- Dec 01 2022 13:20 AM UTC, the load on the database had completely recovered to normal state and all the systems were operational, disabled dashboard components were enabled & tasks to recover affected pipelines still ongoing.
Date of the first occurrence:
Impact of the incident on the customer:
Impact: HIGH
Major services outage in EU cluster affecting the following :
- Pipelines
- Destinations
- Activations
- Models
- Workflows
- Dashboard
Action Items:
- Implement necessary transformations to fix the failed events so the newer events do not fail and the failed events can be replayed.
- Soon we will start processing the failed events batch by batch for better processing and performance.