Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Summary

On 22nd November 2022, at approximately 4pm AEST (3am UTC), Vivi’s engineering team deployed a change to one of our backend services to address an issue related to SSL certificate management. This service is responsible for provisioning the SSL certificates that allow for secure communication with Vivi boxes. This was expected to have no customer-facing side effects.

...

Vivi’s engineering and support teams have identified several process and infrastructure improvements that will prevent similar outages from occurring in the future.

Affected Users

Following resolution, we were able to determine that the outage could only have affected users of the Vivi web app (approximately 2% of our user-base, or 4100 users), and IT administrators trying to configure/update Vivi boxes using Google Chrome or Microsoft Edge. We estimate that only 15% of Vivi web app users (600 users) were active during the outage window.

Timeline

All times below are in AEST.

...

12:00am: At approximately this time, Vivi boxes would begin to sync their configuration with Vivi’s backend servers, causing them to be updated with an expired SSL certificate. Around this time, we received the first report of a customer being unable to connect to rooms using the Vivi web app.
3.15pm: Vivi’s support team notice that several incidents are being raised reporting the same problem, and other customers report issues with device updates/configuration via Vivi Central on Google Chrome.
7.00am: Vivi’s engineering team becomes aware of this as a wide-spread issue, and implements a fix. Vivi boxes are instructed to sync their configuration over the course of the next several hours.
10.00am: Configuration sync process is completed.

Root Cause Analysis

A root cause analysis identified two key issues that led to this outage:

  1. Testing infrastructureprocess. Coupled with differences between our staging and production environments, this led to Vivi’s engineering team being unaware of the issue. Our change management process did not account for these risks.

  2. Escalation process. Vivi does not have a formalised out-of-hours (AEST) level 3 (Vivi engineering team) escalation process to cater for multi-region outages.

Corrective and Preventative Measures

Over the coming weeks, we will be implementing a number of changes to our infrastructure and internal processes to prevent this kind of outage from occurring again.

Testing

...

process

  • Vivi’s engineering team will invest in the development of virtualised box testing infrastructure and synthetic transactions that will rapidly identify issues that could led to Vivi box connection failures. We will make changes to our backend services to make them more resilient to failure.

  • We will review our change management processes, so that we account for risks that can lead to similar outages, and rapidly identify the cause of outages.

Escalation process

  • Vivi’s support and engineering teams will review existing documentation and processes, as they pertain to potential outages. We will ensure that Vivi’s support team has the capability to rapidly escalate issues that are identified as potential outages.

Crisis management

  • Vivi will implement a global crisis management process to enable formal communications with our customers if an outage occurs.