Post Incident Report (2018-5-8)

Summary

Vivi undertook an upgrade of its AWS infrastructure on the evening of Monday 7th May at 1800 AEST both to increase capacity and keep up with the demands for Vivi’s services and to upgrade the version of the software frameworks that the Vivi admin portal is built on to ensure that they remained under support as the current version in use was due to be deprecated.  While internal testing showed the upgrade to be successful we encountered an incident at 0840 AEST when customer load started to hit the systems which was partially resolved at 1007 and completely resolved by 1209, following the resolution there was a further short intermittent outage at 1404 AEST when a database server had to be restarted.  The incident caused intermittent access to the admin portal and customers with particular LDAP configurations not being able to authenticate. 

In summary there were three underlying root causes: 

  • A database became overloaded causing intermittent issues with the admin portal, this would have affected all customers causing issues with login attempts to both the admin portal and the client and therefore affecting screen sharing for users that weren’t already logged in. 

  • An incompatibility between our LDAP authentication library and the new version of the software framework we had upgraded to that only occurred within a specific code path.  This would have affected up to 54 customers using LDAPS causing logins to the client to fail and therefore affecting screen sharing for users that weren’t already logged in.

  • A database server required an unplanned restart.  This would have affected all customers causing issues with login attempts to both the admin portal and the client and therefore affecting screen sharing for users that weren’t already logged in. 

Timeline

 

  • 0927 AEST First support call was received with regards to LDAP authentication  

  • 0928 AEST Our monitoring system alerted us that 500 errors were being returned from our load balancer

  • 0937 AEST After initial trouble shooting support team handed incident over to engineering team

  • 0940 AEST Engineering team ascertained that load had been rising gradually on the database since 0840 and it was now at maximum

  • 0951 AEST Engineering team identified the cause of the abnormal load and took measures to stop it

  • 1000 AEST Engineering team restarted the database server to clear the queue

  • 1005 AEST Database load returned to normal

  • 1007 AEST 500 errors from the load balancer ceased and the issue with the admin portal and client logins was resolved

  • 1010 AEST Customers still reporting issues with LDAP authentication

  • 1023 AEST Engineering team identified that there was an incompatibility between our LDAP authentication and the new version of the software framework that was deployed that only occurred within a specific code path related to certain configurations

  • 1151 AEST Engineering team identified the cause of the incompatibility

  • 1200 AEST Engineering team implemented and started testing fix

  • 1209 AEST Engineering team deployed fix and LDAP issue was resolved

  • 1320 AEST During the post incident analysis the engineering team discovered that the abnormal database load had had an adverse effect on the database server and performance was degrading

  • 1340 AEST The engineering team concluded that the degradation of the performance of the database server would continue until it was restarted  

  • 1404 AEST The engineering team made the decision to restart the database to restore the performance as it was starting to cause intermittent issues

  • 1412 AEST The database restarted and performance was returned to normal  

Root Cause

The abnormal load on the database server was caused by an increased number of connections and a doubling up of scheduled jobs, ironically caused by the by the old servers that had been retired which had been left on hot standby in case we encountered any issues with the new servers and enable us to fail back quickly.

A specific LDAP configuration caused a code path to be executed that was not in our test plans and therefore the bug was not identified in testing.

A database server required a unplanned restart due to the abnormal load that it was placed under during the initial incident.

Resolution and Recovery

 

  • 0954 AEST The old application servers isolated from network

  • 1000 AEST The database server restarted

  • 1007 AEST Issues with the admin portal confirmed as resolved

  • 1151 AEST Engineering team isolated issue to the net-ldap library and configuration affecting LDAPS

  • 1200 AEST Engineering team identified configuration changes required to resolve issue

  • 1209 AEST Engineering team deployed fix and issues with LDAP authentication was resolved.

  • 1340 AEST The engineering team concluded that the database server needed to be restarted and configuration modified

  • 1404 AEST The database server was restarted

  • 1412 AEST The intermittent issues were resolved

Corrective and Preventative Measures

  • Increased monitoring added to the database server to ensure earlier notification of potential issues

  • Upgrade and migration procedure updated with learnings to ensure that there is not a repeat occurrence

  • Problematic LDAP configuration added to test plans to ensure that all code paths are tested

  • Increase test automation suite coverage

  • Review risk assessment process