Infrastructure issues with Pipelines and APIs

Incident Report for Hevo

Postmortem

On 17th September 2019 at 3:00 AM UTC, Hevo's Asia cluster experienced an Outage. The Outage was due to the way how Hevo handles event schema. One of the pipelines which were created around 2:30 AM UTC was having a schema with tens of thousands of field and was continuously growing in size. Due to the sheer size of the schema, Serializing and De-serializing the schema was taking a noticeable amount of time and memory because of which other requests to the application started queuing up and after a while, some of the application nodes crashed to due lack of available memory.

Timeline of the Events

3:00 AM UTC - Application started experiencing heavy memory usage.
3:06 AM UTC - Application started slowing down.
3:10 AM UTC - 2 Application nodes crashed starving for memory.
3:18 AM UTC - Engineers Jumped on to a call to figure out the issue.
3:35 AM UTC - Engineers were able to figure out the issue was being caused due to a large schema of an event type in one of the pipeline.
3:40 AM UTC - The culprit pipeline was paused.
4:10 AM UTC - New application nodes were commissioned with more memory allocation(around 2 times more than its previous allotment) and the application was restarted.

Resolution

We are now marking the events that have more fields then what the mapped destination supports as failed. The user will be asked to drop unnecessary fields from the event through transformations.
We have allocated more memory to the application for the time being until we find a permanent resolution.

Next Steps

We would be changing the way we handle the schema of an incoming event to reduce the memory footprint of event schemas.

‌

We appreciate your patience and understanding on this. If you have any questions or suggestions, please reach out to support. Also, we would like to apologize for letting this happen and will try to ensure it doesn’t happen again.

Posted Sep 18, 2019 - 16:16 UTC

Resolved

This incident has been resolved.

Posted Sep 17, 2019 - 07:28 UTC

Monitoring

Services have been stabilized. There will be a lag in pipelines for a couple of hours. We are monitoring the system.

Posted Sep 17, 2019 - 06:07 UTC

Identified

Hello, we are currently dealing with an infrastructure issue with Pipeline and APIs which will impact the throughput of the pipelines as well as the availability of the UI. We have identified the problem and are working to restore all services as soon as possible.

Posted Sep 17, 2019 - 04:54 UTC

This incident affected: Asia Cluster (Data Pipelines).