Summary
Vivi undertook an upgrade of its AWS infrastructure on the evening of Monday 7th May at 1800 AEST both to increase capacity and keep up with the demands for Vivi’s services and to upgrade the version of the software frameworks that the Vivi admin portal is built on to ensure that they remained under support as the current version in use was due to be deprecated. While internal testing showed the upgrade to be successful we encountered an incident at 0840 AEST when customer load started to hit the systems which was partially resolved at 1007 and completely resolved by 1209, following the resolution there was a further short intermittent outage at 1404 AEST when a database server had to be restarted. The incident caused intermittent access to the admin portal and customers with particular LDAP configurations not being able to authenticate.
In summary there were three underlying root causes:
A database became overloaded causing intermittent issues with the admin portal, this would have affected all customers causing issues with login attempts to both the admin portal and the client and therefore affecting screen sharing for users that weren’t already logged in.
An incompatibility between our LDAP authentication library and the new version of the software framework we had upgraded to that only occurred within a specific code path. This would have affected up to 54 customers using LDAPS causing logins to the client to fail and therefore affecting screen sharing for users that weren’t already logged in.
A database server required an unplanned restart. This would have affected all customers causing issues with login attempts to both the admin portal and the client and therefore affecting screen sharing for users that weren’t already logged in.
Timeline
0927 AEST First support call was received with regards to LDAP authentication
0928 AEST Our monitoring system alerted us that 500 errors were being returned from our load balancer
0937 AEST After initial trouble shooting support team handed incident over to engineering team
0940 AEST Engineering team ascertained that load had been rising gradually on the database since 0840 and it was now at maximum
0951 AEST Engineering team identified the cause of the abnormal load and took measures to stop it
1000 AEST Engineering team restarted the database server to clear the queue
1005 AEST Database load returned to normal
1007 AEST 500 errors from the load balancer ceased and the issue with the admin portal and client logins was resolved
1010 AEST Customers still reporting issues with LDAP authentication
1023 AEST Engineering team identified that there was an incompatibility between our LDAP authentication and the new version of the software framework that was deployed that only occurred within a specific code path related to certain configurations
1151 AEST Engineering team identified the cause of the incompatibility
1200 AEST Engineering team implemented and started testing fix
1209 AEST Engineering team deployed fix and LDAP issue was resolved
1320 AEST During the post incident analysis the engineering team discovered that the abnormal database load had had an adverse effect on the database server and performance was degrading
1340 AEST The engineering team concluded that the degradation of the performance of the database server would continue until it was restarted
1404 AEST The engineering team made the decision to restart the database to restore the performance as it was starting to cause intermittent issues
1412 AEST The database restarted and performance was returned to normal
Root Cause
The abnormal load on the database server was caused by an increased number of connections and a doubling up of scheduled jobs, ironically caused by the by the old servers that had been retired which had been left on hot standby in case we encountered any issues with the new servers and enable us to fail back quickly.
...
A database server required a unplanned restart due to the abnormal load that it was placed under during the initial incident.
Resolution and Recovery
0954 AEST The old application servers isolated from network
1000 AEST The database server restarted
1007 AEST Issues with the admin portal confirmed as resolved
1151 AEST Engineering team isolated issue to the net-ldap library and configuration affecting LDAPS
1200 AEST Engineering team identified configuration changes required to resolve issue
1209 AEST Engineering team deployed fix and issues with LDAP authentication was resolved.
1340 AEST The engineering team concluded that the database server needed to be restarted and configuration modified
1404 AEST The database server was restarted
1412 AEST The intermittent issues were resolved
...
Corrective and Preventative Measures
...
Increased monitoring added to the database server to ensure earlier notification of potential issues
...