Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 0927 AEST First support call was received with regards to LDAP authentication  
  • 0928 AEST Our monitoring system alerted us that 500 errors were being returned from our load balancer
  • 0937 AEST After initial trouble shooting support team handed incident over to engineering team
  • 0940 AEST Engineering team ascertained that load had been rising gradually on the database since 0840 and it was now at maximum
  • 0951 AEST Engineering team identified the cause of the abnormal load and took measures to stop it
  • 1000 AEST Engineering team restarted the database server to clear the queue
  • 1005 AEST Database load returned to normal
  • 1007 AEST 500 errors from the load balancer ceased and the issue with the admin portal and client logins was resolved
  • 1010 AEST Customers still reporting issues with LDAP authentication
  • 1023 AEST Engineering team identified that there was an incompatibility between our LDAP authentication and the new version of the software framework that was deployed that only occurred within a specific code path related to certain configurations
  • 1151 AEST Engineering team identified the cause of the incompatibility
  • 1200 AEST Engineering team implemented and started testing fix
  • 1209 AEST Engineering team deployed fix and LDAP issue was resolved
  • 1320 AEST During the post incident analysis the engineering team discovered that the abnormal database load had had an adverse effect on the database server and performance was degrading
  • 1340 AEST The engineering team concluded that the degradation of the performance of the database server would continue until it was restarted  
  • 1404 AEST The engineering team made the decision to restart the database to restore the performance as it was starting to cause intermittent issues
  • 1412 AEST The database restarted and performance was returned to normal  

...

The abnormal load on the database server was caused by and an increased number of connections and a doubling up of scheduled jobs, ironically caused by the by the old servers that had been retired which had been left on hot standby in case we encountered any issues with the new servers and enable us to fail back quickly.

...

A database server required and a unplanned restart due to the abnormal load that it was placed under during the initial incident.

...