Post Incident Report (2023-08-24)

Summary

On 24th August 2023, at approximately 9am AEST (11pm UTC), Vivi’s engineering team were notified of issues related to showing the incorrect online/offline status for a number of Vivi devices. These issues were determined to have stemmed from attempted upgrades to our Heartbeat service. This service is responsible for handling the ‘heartbeat’ requests from Vivi devices, which in turn is used to determine whether a device will be marked as online or offline.

There had been a number of changes deployed to this Heartbeat service in an attempt to improve overall reliability and scalability of the service. It seems that at least one of these deployments had resulted in the service entering a state where it was no longer able to process requests fast enough, which resulted in a number of delayed Heartbeat requests, resulting in devices showing an incorrect online status.

This issue resulted in a number of customer’s seeing both their devices reporting as offline in Vivi Central as well as being greyed out and shown as offline in the Vivi Client.

Resolution

The service was monitored for the next hour from initial report time and it was deemed to have recovered and had re-entered a working state.

However, at approximately 4pm (6am UTC), the engineering team had determined that it would be safer to redeploy an older version of the Heartbeat service that had been proven to be reliable, to ensure that this issue would not occur again.

Affected Users

Following resolution, and based on the tickets that were raised by a number of schools, we were able to determine that this would have affected approximately 4500 devices. This issue would only have affected the ability to see a box as online/offline, it did not affect actually connectivity to, and usage of, each device and/or room.

As this occurred at the start of the school day, it is likely that a large number of teachers and/or admins would have been attempting to use the devices, and would have experienced issues with seeing their rooms as online.

Timeline

All times below are in AEST.

24th August 2023

9:00am: Initial reports raised to Vivi Support relating to schools experiencing issues where they were incorrectly seeing their devices/rooms marked as offline.

9:30am: Continued monitoring of the Vivi Heartbeat service showed that there had been degradation in the operating status of the service itself, however it was determined to have been slowly recovering.

10:00am: The Heartbeat service had returned to a reliable working state and monitoring of engineering devices had shown that the issue with incorrect online status' had been resolved.

4:00pm: Continued monitoring throughout the day had shown that the Heartbeat service had continued working as expected, however it was determined that to completely eliminate the chance of this happening again, we would redeploy an older version of the service that had proven reliability.

Root Cause Analysis

A root cause analysis identified a couple of key issues that had let to this outage:

  1. Scale Testing: It was determined that the updated service had not been tested for long enough with a large enough number of requests to reliably prove that it would have no issues handling production loads.

  2. Inadequate Monitoring: It was also determined that with more reliable monitoring of this service, the issue could have been identified much earlier and an older version of the service could have been deployed much quicker, meaning customers would likely not have noticed an outage.

Corrective and Preventative Measures

This outage has resulted in the engineering team identifying two key points of correction and prevention to ensure that an outage like this does not occur again

Prioritised Scale Testing

  • It is clear that the upgrades to this service need to be thoroughly tested at production loads (and beyond) for a longer period of time to ensure that if there are any issues that may occur, they are caught before it is deployed to production.

  • Vivi’s engineering team will ensure rigorous scale testing is performed in our staging environment on the new service for an extended period of time to ensure that customers will not experience a similar outage again.

Effective Monitoring

  • This issue has made it clear that more effective monitoring relating to our Heartbeat service needs to be put in place such that similar problems can be identified and resolved quickly.

  • The Vivi engineering team will work to introduce better monitoring of the Heartbeat service to ensure any issues are identified quickly, as well as their causes, to ensure a prompt resolution.