Major Outage in the EU region

Incident Report for Hevo

Postmortem

Summary:

The underlying database serving critical metadata regarding Pipelines, Activations, Models, Workflows, and Destinations had hit peak CPU utilisation due to pipelines having a huge number of failed events leading to the EU cluster not being able to respond to any of the services relying on it for the said metadata for processing. This includes any interactions made with the database through the dashboard/APIs/System jobs like running pipelines/destination load jobs etc.

Timeline (UTC):

Dec 01 2022 6:17 AM UTC, an internal alert was triggered when the CPU utilisation of the database reached critical levels and the cluster was moved to read-only mode to prevent any data loss.
Dec 01 2022 7:25 AM UTC, a fix was identified to make sure that the DB does not get bloated further. Some non-critical reads to the database were blocked by disabling some components on the dashboard.
Dec 01 2022 8:00 AM UTC, the abnormal load on the database had subsided due to the above fixes, meanwhile, some operational tasks on fixing the affected pipelines were started.
Dec 01 2022 13:20 AM UTC, the load on the database had completely recovered to normal state and all the systems were operational, disabled dashboard components were enabled & tasks to recover affected pipelines still ongoing.

Date of the first occurrence:

Dec 01, 2022

Impact of the incident on the customer:

Impact: HIGH

Major services outage in EU cluster affecting the following :

Pipelines
Destinations
Activations
Models
Workflows
Dashboard

Action Items:

Implement necessary transformations to fix the failed events so the newer events do not fail and the failed events can be replayed.
Soon we will start processing the failed events batch by batch for better processing and performance.

Posted Dec 05, 2022 - 14:26 UTC

Resolved

This incident has been resolved.

Posted Dec 01, 2022 - 13:20 UTC

Monitoring

The systems are operational now. We are working on the fix for a UI component.

Posted Dec 01, 2022 - 08:00 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 01, 2022 - 07:25 UTC

Update

We are continuing to investigate this issue.

Posted Dec 01, 2022 - 06:39 UTC

Investigating

We are currently investigating this issue.

Posted Dec 01, 2022 - 06:17 UTC

This incident affected: Europe Cluster (UI Console, Sources, Destinations, Data Pipelines, Public API, Alerts).