Degraded Performance in US cluster

Incident Report for Hevo

Postmortem

Summary:

Between 13/08/22 at 06:30 AM IST and 09/08/22 at 09:45 AM IST, we experienced dashboard slowness for all customers in the US cluster.

Timeline (IST):

13 August, 2022 | 06:30 AM - UI stopped working. Ingestion stopped on US servers.
13 August, 2022 | 07:30 AM - Upgraded our DB server.
13 August, 2022 | 09:00 AM - Did a rolling restart of the application.
13 August, 2022 | 09:45 AM - Made some DB query tuning. All the components along with Dashboard were operable.

Date of the first occurrence:

13 August, 2022 05:15 AM IST

Impact of the incident on the customer:

Impact: HIGH

Event consumption stopped for all pipelines and all customers.

Root cause:

We have noticed some slow queries and deadlocks on few rows and tables because of which APIs were failing and the cluster went in inoperable state.

Solution:

We have identified and fixed the query/code which caused deadlock.

Posted Aug 19, 2022 - 06:44 UTC

Resolved

This incident has been resolved.

Posted Aug 13, 2022 - 04:27 UTC

Monitoring

The cluster is operational and we are still working on figuring out the root cause.

Posted Aug 13, 2022 - 03:46 UTC

Investigating

We are currently investigating the issue.

Posted Aug 12, 2022 - 23:47 UTC

This incident affected: US Cluster (Sources, Destinations, Data Pipelines, Public API).