Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A database server required an unplanned restart.  This would have affected all customers causing issues with login attempts to both the admin portal and the client and therefore affecting screen sharing for users that weren’t already logged in. 

 

Timeline

 

  • 0927 AEST first First support call was received with regards to LDAP authentication  
  • 0928 AEST Our monitoring system alerted us that 500 errors were being returned from our load balancer
  • 0937 AEST After initial trouble shooting support team handed incident over to engineering team
  • 0940 AEST Engineering ascertained that load had been rising gradually on the database since 0840 and it was now at maximum
  • 0951 AEST Engineering team identified the cause of the abnormal load and took measures to stop it
  • 1000 AEST Engineering team restarted the database server to clear the queue
  • 1005 AEST Database load returned to normal
  • 1007 AEST 500 errors from the load balancer ceased and the issue with the admin portal and client logins was resolved
  • 1010 AEST Customers still reporting issues with LDAP authentication
  • 1023 AEST Engineering team identified that there was an incompatibility between our LDAP authentication and the new version of the software framework that was deployed that only occurred within a specific code path related to certain configurations
  • 1151 AEST Engineering team identified the cause of the incompatibility
  • 1200 AEST Engineering team implemented and started testing fix
  • 1209 AEST Engineering team deployed fix and LDAP issue was resolved
  • 1320 AEST During the post incident analysis the engineering team discovered that the abnormal database load had had an adverse effect on the database server and performance was degrading
  • 1340 AEST The engineering team concluded that the degradation of the performance of the database server would continue until it was restarted  
  • 1404 AEST The engineering team made the decision to restart the database to restore the performance as it was starting to cause intermittent issues
  • 1412 AEST The database restarted and performance was returned to normal  

...

Resolution and Recovery

 

  • 0954 AEST The old application servers isolated from network
  • 1000 AEST The database server restarted
  • 1007 AEST Issues with the admin portal confirmed as resolved
  • 1151 AEST Engineering team isolated issue to the net-ldap library and configuration affecting LDAPS
  • 1200 AEST Engineering team identified configuration changes required to resolve issue
  • 1209 AEST Engineering team deployed fix and issues with LDAP authentication was resolved.
  • 1340 AEST The engineering team concluded that the database server needed to be restarted and configuration modified
  • 1404 AEST The database server was restarted
  • 1412 AEST The intermittent issues were resolved

...

  • Increased monitoring added to the database server to ensure earlier notification of potential issues
  • Upgrade and migration procedure updated with learnings to ensure that there is not a repeat occurrence
  • Problematic LDAP configuration added to test plans to ensure that all code paths are tested
  • Increase test automation suite coverage
  • Review risk assessment process