Degraded Performance in US cluster
Incident Report for Hevo
Postmortem

Summary:

Between 13/08/22 at 06:30 AM IST and 09/08/22 at 09:45 AM IST, we experienced dashboard slowness for all customers in the US cluster. 

Timeline (IST):

  • 13 August, 2022 | 06:30 AM - UI stopped working. Ingestion stopped on US servers.
  • 13 August, 2022 | 07:30 AM - Upgraded our DB server.
  • 13 August, 2022 | 09:00 AM - Did a rolling restart of the application.
  • 13 August, 2022 | 09:45 AM - Made some DB query tuning.  All the components along with Dashboard were operable.

Date of the first occurrence:

13 August, 2022 05:15 AM IST

Impact of the incident on the customer:

Impact: HIGH

Event consumption stopped for all pipelines and all customers.

Root cause:

We have noticed some slow queries and deadlocks on few rows and tables because of which APIs were failing and the cluster went in inoperable state.

Solution:

We have identified and fixed the query/code which caused deadlock.

Posted Aug 19, 2022 - 06:44 UTC

Resolved
This incident has been resolved.
Posted Aug 13, 2022 - 04:27 UTC
Monitoring
The cluster is operational and we are still working on figuring out the root cause.
Posted Aug 13, 2022 - 03:46 UTC
Investigating
We are currently investigating the issue.
Posted Aug 12, 2022 - 23:47 UTC
This incident affected: US Cluster (Sources, Destinations, Data Pipelines, Public API).