Summary


Vivi undertook an upgrade of its AWS infrastructure on the evening of Monday 7th May at 1800 AEST both to increase capacity and keep up with the demands for Vivi’s services and to upgrade the version of the software frameworks that the Vivi admin portal is built on to ensure that they remained under support as the current version in use was due to be deprecated.  While internal testing showed the upgrade to be successful we encountered an incident at 0840 AEST when customer load started to hit the systems which was partially resolved at 1007 and completely resolved by 1209, following the resolution there was a further short intermittent outage at 1404 AEST when a database server had to be restarted.  The incident caused intermittent access to the admin portal and customers with particular LDAP configurations not being able to authenticate. 


In summary there were three underlying root causes: 


A database became overloaded causing intermittent issues with the admin portal, this would have affected all customers causing issues with login attempts to both the admin portal and the client and therefore affecting screen sharing for users that weren’t already logged in. 


An incompatibility between our LDAP authentication library and the new version of the software framework we had upgraded to that only occurred within a specific code path.  This would have affected up to 54 customers using LDAPS causing logins to the client to fail and therefore affecting screen sharing for users that weren’t already logged in.


A database server required an unplanned restart.  This would have affected all customers causing issues with login attempts to both the admin portal and the client and therefore affecting screen sharing for users that weren’t already logged in. 


Timeline

 


Root Cause

 

The abnormal load on the database server was caused by an increased number of connections and a doubling up of scheduled jobs, ironically caused by the by the old servers that had been retired which had been left on hot standby in case we encountered any issues with the new servers and enable us to fail back quickly.


A specific LDAP configuration caused a code path to be executed that was not in our test plans and therefore the bug was not identified in testing.


A database server required a unplanned restart due to the abnormal load that it was placed under during the initial incident.


Resolution and Recovery

 

 

Corrective and Preventative Measures