/
Post Incident Report (2024-10-30)

Post Incident Report (2024-10-30)

Summary

Shortly after midnight on 30th October (Melbourne time) / 9 AM 29th October (EST), CPU usage on our primary database began to rise. Over the following hour, this continued to increase until the database experienced persistent high latency, directly impacting the availability of Vivi Central until approximately 7 AM (Melbourne time) / 4 PM (EST).

During this period, some functions of the Vivi app were also affected. Specifically, users attempting to log in may have encountered issues, though those already logged in were generally able to connect to rooms and share screens using Offline Mode.

In responding to this incident, we reduced traffic on our heartbeat service, which may have caused some rooms to appear offline in the Vivi app. After the initial outage was resolved, heartbeat services remained degraded until approximately 11:30 AM (Melbourne time) / 8 PM (EST).

Resolution

Resolution was delayed due to a runaway backend service triggering the initial spike in database CPU usage. This service, responsible for handling Google authentication, was essential for user access to documents stored on Google Drive and other Google services.

In analysing the outage, we noted reports from other sources indicating that Google services were experiencing global disruptions.

Ultimately, restarting the affected backend services restored functionality at approximately 7 AM Melbourne time / 4 PM (EST).

Affected Users

Following resolution and based on raised tickets, we confirmed that this outage primarily affected Vivi Central users. Between midnight and 7 AM, Vivi Central was largely unresponsive.

Other signage functionality was impacted during this time; however, Vivi devices that had already loaded signage media remained unaffected.

As noted, some Vivi app users also experienced issues, particularly those attempting to log in on a device for the first time. Based on our data and the timing of the outage, we estimate that approximately 11% of our North American customers may have been affected during this period.

Timeline

  • Midnight, 30th October 2024 (Melbourne Time) / 9 AM, 29th October 2024 (EST)

  • 1:16 AM (AEST) / 10:16 AM (EST, 29th October) - First customer ticket raised.

  • 1:20 - 1:47 AM (AEST) / 10:20 - 10:47 AM (EST, 29th October) - Support engineers begin investigating and conduct standard troubleshooting.

  • 1:48 AM (AEST) / 10:48 AM (EST, 29th October) - Issue escalated to on-call engineers.

  • 1:57 AM (AEST) / 10:57 AM (EST, 29th October) - First on-call engineer goes online.

  • 2:51 AM (AEST) / 11:51 AM (EST, 29th October) - Initial troubleshooting deemed ineffective; heartbeat traffic reduced to ease database load. Further actions deferred until 6 AM for additional engineer support.

  • 6:00 AM (AEST) / 3:00 PM (EST, 29th October) - Two additional engineers come online.

  • 6:47 AM (AEST) / 3:47 PM (EST, 29th October) - Third additional engineer joins.

  • 6:50 - 6:57 AM (AEST) / 3:50 - 3:57 PM (EST, 29th October) - Decision to sequentially restart backend services and further reduce non-critical service traffic.

  • 7:34 AM (AEST) / 4:34 PM (EST, 29th October) - Initial issue marked as resolved. Heartbeat degradation anticipated due to redirected traffic.

  • 9:52 - 10:50 AM (AEST) / 6:52 - 7:50 PM (EST, 29th October) - Heartbeat traffic restored to 100%.

  • 11:00 AM (AEST) / 8:00 PM (EST, 29th October) - Resolution of heartbeat issues communicated to customers; major incident resolved.

Root Cause Analysis

The outage stemmed from a combination of increased resource demands and integration issues.

Google Integration

A partial Google outage triggered an unusually high rate of backend requests related to our Google integration, which then impacted processing of other database queries, ultimately leading to resource exhaustion on our primary database.

Database Load

While our primary database is provisioned for variable load throughout the school day, the Google integration issue led to additional traffic that it could not accommodate.

Corrective and Preventative Measures

This incident led to the identification of several key corrective and preventative actions by our engineering team.

  • Improved Monitoring for Google Integration
    Our analysis revealed inadequate monitoring of our Google integration, meaning our systems couldn’t shed excess load when this third-party service was compromised, delaying our response. We are implementing more robust monitoring for all third-party integrations.

  • Vivi Central Monitoring Enhancements
    Monitoring of Vivi Central’s uptime was found lacking in timeliness and accuracy from a customer-experienced perspective. This has now been corrected.

  • Service Isolation Improvements
    We were in the late stages of implementing enhanced service isolation and load balancing for backend services. This will ensure a runaway backend service cannot disrupt unrelated services. This work is set for completion within the week.

  • Internal Documentation and Training
    Two areas for improvement were identified in internal documentation to better equip on-call engineers with the information and tools necessary for efficient incident response. Updated documentation has been distributed within Vivi’s engineering team, with additional training scheduled.

  • Status Pages
    We became aware during this outage that one of our service status pages inaccurately displayed all services as online, misrepresenting the actual service status. Steps are being taken to ensure our service status pages reflect accurate, real-time information when services are offline or degraded.