Post Incident Report (2022-11-23)

Summary

On 22nd November 2022, at approximately 4pm AEST (3am UTC), Vivi’s engineering team deployed a change to one of our backend services to address an issue related to SSL certificate management. This service is responsible for provisioning the SSL certificates that allow for secure communication with Vivi boxes. This was expected to have no customer-facing side effects.

Initial testing showed that the change had been deployed successfully, and that communication between the Vivi app and Vivi boxes was unaffected.

Approximately 10 hours later, it became apparent to Vivi’s support team that some customers were experiencing issues connecting to Vivi boxes using the Vivi web app. Some customers were also reporting issues updating boxes on Vivi Central (via the Device Configuration page). Both scenarios rely on SSL certificates for secure communication.

Resolution

On 23rd November 2022, at 7am AEST (4pm 22nd November UTC), Vivi’s engineering team was made aware of the issue and implemented a fix that would be rolled out over the next few hours. We advised customers who needed an immediate resolution for the issue to reboot their Vivi boxes.

Vivi’s engineering and support teams have identified several process and infrastructure improvements that will prevent similar outages from occurring in the future.

Affected Users

Following resolution, we were able to determine that the outage could only have affected users of the Vivi web app (approximately 4100 users), and IT administrators trying to configure/update Vivi boxes using Google Chrome or Microsoft Edge. We estimate that only 15% of Vivi web app users (600 users) were active during the outage window.

Timeline

All times below are in AEST.

22nd November 2022

4.00pm: Vivi Engineering team deploys fix for on-going certificate management issue, intended to prevent an anticipated future failure.
4.30pm: Initial testing in our staging environment showed that Vivi box communication was not affected by the change.
5.00pm: Further fixes deferred to the following day.

23rd November 2022

12:00am: At approximately this time, Vivi boxes would begin to sync their configuration with Vivi’s backend servers, causing them to be updated with an expired SSL certificate. Around this time, we received the first report of a customer being unable to connect to rooms using the Vivi web app.
3.15pm: Vivi’s support team notice that several incidents are being raised reporting the same problem, and other customers report issues with device updates/configuration via Vivi Central on Google Chrome.
7.00am: Vivi’s engineering team becomes aware of this as a wide-spread issue, and implements a fix. Vivi boxes are instructed to sync their configuration over the course of the next several hours.
10.00am: Configuration sync process is completed.

Root Cause Analysis

A root cause analysis identified two key issues that led to this outage:

  1. Testing process. Coupled with differences between our staging and production environments, this led to Vivi’s engineering team being unaware of the issue. Our change management process did not account for these risks.

  2. Escalation process. Vivi does not have a formalised out-of-hours (AEST) level 3 (Vivi engineering team) escalation process to cater for multi-region outages.

Corrective and Preventative Measures

Over the coming weeks, we will be implementing a number of changes to our infrastructure and internal processes to prevent this kind of outage from occurring again.

Testing process

  • Vivi’s engineering team will invest in the development of virtualised box testing infrastructure and synthetic transactions that will rapidly identify issues that could led to Vivi box connection failures. We will make changes to our backend services to make them more resilient to failure.

  • We will review our change management processes, so that we account for risks that can lead to similar outages, and rapidly identify the cause of outages.

Escalation process

  • Vivi’s support and engineering teams will review existing documentation and processes, as they pertain to potential outages. We will ensure that Vivi’s support team has the capability to rapidly escalate issues that are identified as potential outages.

Crisis management

  • Vivi will implement a global crisis management process to enable formal communications with our customers if an outage occurs.