fbpx

Server Outage Post-mortems

Outage
Issue Summary

We first received the server outage report on 9:26 AM today (17/05/2022), our Appolous application was inaccessible due to our Application server and Main Database server unable to be connected. Our team performed an investigation to resolve the issue as soon as possible. Users that trying to access from 9:26 AM to 10:34 AM couldn’t login to our application, affected all our users. The root cause of this outage is due to network failure on our main cloud service provider, cause all internal connection failed to talk to each other.

STEP TAKEN
  1. Reboot our Application server, but it still unable to return to normal.
  2. Reboot our Main Database server, still no luck.
  3. Pointed https://app.appolous.com to our Front End server, and it can show login page.
  4. Tried to login, but failed with 500 SERVER ERROR.
  5. Checked the error log, Cache server unable to connect, we pointed to another Cache server and issue resolved.
  6. Tried to login again, but still with 500 SERVER ERROR.
  7. Now we identified our Main Database connection failed, we quickly turn into Secondary Database server, and performed an upgrade.
  8. After the upgrade Secondary Database server refused to boot up, in the mean time we are replicating the Main Database server.
  9. Spin up another Replacement Main Database server to replace the failed server.
  10. Reconnect all servers to new Main Database server.
What WENT WELL
  • No data lost during this outage.
  • Front End server was not affected and able to serve our users almost immediately.
What Went Wrong
  • Our Secondary Database unable to upgrade prolonged the recovery duration.
  • Daily back-up was not useful in this case because data will be lost.
  • We did not have another Passive Back-up Database server to be our failed-over.
  • Secondary Cloud Provider was not ready to serve huge traffic in a short period of time.
Going Forward
  • We will ditched the daily backup and in favor of real-time passive back-up database that will perform upgrade check every month.
  • Adding more monitoring tools on internal connection as well.
  • Enhance Secondary Cloud Provider infrastructure to be ready to switch over in short period of time.