Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Summary

Shortly after midnight on 30th October (Melbourne time) / 9 AM 29th October (EST), CPU usage on our primary database began to rise. Over the following hour, this continued to increase until the database experienced persistent high latency, directly impacting the availability of Vivi Central until approximately 7 AM (Melbourne time) / 4 PM (EST).

...

In responding to this incident, we reduced traffic on our heartbeat service, which may have caused some rooms to appear offline in the Vivi app. After the initial outage was resolved, heartbeat services remained degraded until approximately 11:30 AM (Melbourne time) / 8 PM (EST).

Resolution

Resolution was delayed due to a runaway backend service triggering the initial spike in database CPU usage. This service, responsible for handling Google authentication, was essential for user access to documents stored on Google Drive and other Google services.

...

Ultimately, restarting the affected backend services restored functionality at approximately 7 AM Melbourne time / 4 PM (EST).

Affected Users

Following resolution and based on raised tickets, we confirmed that this outage primarily affected Vivi Central users. Between midnight and 7 AM, Vivi Central was largely unresponsive.

...

As noted, some Vivi app users also experienced issues, particularly those attempting to log in on a device for the first time. Based on our data and the timing of the outage, we estimate that approximately 11% of our North American customers may have been affected during this period.

Timeline

  • Midnight, 30th October 2024 (Melbourne Time) / 9 AM, 29th October 2024 (EST)

  • 1:16 AM (AEST) / 10:16 AM (EST, 29th October) - First customer ticket raised.

  • 1:20 - 1:47 AM (AEST) / 10:20 - 10:47 AM (EST, 29th October) - Support engineers begin investigating and conduct standard troubleshooting.

  • 1:48 AM (AEST) / 10:48 AM (EST, 29th October) - Issue escalated to on-call engineers.

  • 1:57 AM (AEST) / 10:57 AM (EST, 29th October) - First on-call engineer goes online.

  • 2:51 AM (AEST) / 11:51 AM (EST, 29th October) - Initial troubleshooting deemed ineffective; heartbeat traffic reduced to ease database load. Further actions deferred until 6 AM for additional engineer support.

  • 6:00 AM (AEST) / 3:00 PM (EST, 29th October) - Two additional engineers come online.

  • 6:47 AM (AEST) / 3:47 PM (EST, 29th October) - Third additional engineer joins.

  • 6:50 - 6:57 AM (AEST) / 3:50 - 3:57 PM (EST, 29th October) - Decision to sequentially restart backend services and further reduce non-critical service traffic.

  • 7:34 AM (AEST) / 4:34 PM (EST, 29th October) - Initial issue marked as resolved. Heartbeat degradation anticipated due to redirected traffic.

  • 9:52 - 10:50 AM (AEST) / 6:52 - 7:50 PM (EST, 29th October) - Heartbeat traffic restored to 100%.

  • 11:00 AM (AEST) / 8:00 PM (EST, 29th October) - Resolution of heartbeat issues communicated to customers; major incident resolved.

Root Cause Analysis

The outage stemmed from a combination of increased resource demands and integration issues.

Google Integration

A partial Google outage triggered an unusually high rate of backend requests related to our Google integration, which then impacted processing of other database queries, ultimately leading to resource exhaustion on our primary database.

Database Load

While our primary database is provisioned for variable load throughout the school day, the Google integration issue led to additional traffic that it could not accommodate.

Corrective and Preventative Measures

This incident led to the identification of several key corrective and preventative actions by our engineering team.

...