...
On 22nd November 2022, at approximately 4pm AEST (3am UTC), Vivi’s engineering team deployed a change to one of our backend services to address an issue related to SSL certificate management. This service is responsible for provisioning the SSL certificates that allow for secure communication with Vivi boxes. This was expected to have no customer-facing side effects.
...
Following resolution, we were able to determine that the outage could only have affected users of the Vivi web app (approximately 2% of our user-base, or 4100 users), and IT administrators trying to configure/update Vivi boxes using Google Chrome or Microsoft Edge. We estimate that only 15% of Vivi web app users (600 users) were active during the outage window.
...
A root cause analysis identified two key issues that led to this outage:
Testing infrastructureprocess. Coupled with differences between our staging and production environments, this led to Vivi’s engineering team being unaware of the issue. Our change management process did not account for these risks.
Escalation process. Vivi does not have a formalised out-of-hours (AEST) level 3 (Vivi engineering team) escalation process to cater for multi-region outages.
...
Over the coming weeks, we will be implementing a number of changes to our infrastructure and internal processes to prevent this kind of outage from occurring again.
Testing
...
process
Vivi’s engineering team will invest in the development of virtualised box testing infrastructure and synthetic transactions that will rapidly identify issues that could led to Vivi box connection failures. We will make changes to our backend services to make them more resilient to failure.
We will review our change management processes, so that we account for risks that can lead to similar outages, and rapidly identify the cause of outages.
Escalation process
Vivi’s support and engineering teams will review existing documentation and processes, as they pertain to potential outages. We will ensure that Vivi’s support team has the capability to rapidly escalate issues that are identified as potential outages.
Crisis management
Vivi will implement a global crisis management process to enable formal communications with our customers if an outage occurs.