*By Jeppe Liisberg, founder at myphoner*
Earlier this week, myphoner was unavailable for 3 hours and 39 minutes. I understand how important myphoner is for your daily work and how much some of you rely on us to be able to work at all. We consider system availability one of our core features. Over the past four years, we have worked hard to ensure that myphoner is accessible 24/7, but on tuesday we failed to maintain the uptime you all rightfully expect. I am deeply sorry for this, and I would like to share the details of what happened and what we are doing to prevent similar events from happening again.
## The Event
At 11.11am (CEST) on Tuesday, April 25, 2017, a sudden increase in response times from our main database caused web requests to begin timing out. In response, our application began delivering HTTP 503 response codes, which meant that our error page with the ‘It looks like something is broken’ message began appearing.
We were immediately notified by our availability monitors, and at 11.13, the status page was initially updated.
Upon investigating, we realised that what we initially thought was a partial outage due to some slow requests was in fact a major outage with practically all database requests being extremely slow, causing extensive timeouts for web requests and huge request queues that effectively took down the entire app.
For a while, we tried easing the load on the main database by scaling down web processes and shutting down non-critical sub-systems relying on the database. However, this had no visible effect on the issues we were facing.
About 30 minutes after the initial alerts, we began investigating detailed metrics of the main database. We discovered that one of the indices for lead data relating to duplicate detection had become huge (> 110 GB) with a bloat ratio of 28. We put the app in maintenance mode, turned off all schedulers and background workers and started an emergency vacuum procedure on the database.
However, the vacuum process was very slow, and when it finally finished, we saw no real signs of improvement.
We decided to fork the database cluster to enable working with more aggressive recovery strategies without risking any data loss; unfortunately, that meant waiting for a fork to finish, which took time.
Once the fork was done, we tried booting up a few processes on the new database cluster. Initial results looked good. We then pointed all processes to the new cluster and started booting them up. The app resurfaced 3 hours and 39 minutes later.
The combination of a high load and uncontrolled bloat brought our main database to its knees, and it took a complete lockdown on load and an expensive maintenance procedure in terms of time to get it back up. The switch to a new cluster was probably unnecessary, but it allowed us to investigate the state of the old cluster thoroughly after the app recovered.
We spent the following hours performing further maintenance to reduce the remaining bloat, confirming that all systems were performing normally and verifying that there was no data loss from this incident. We are grateful that much of the disaster mitigation work put in place was successful in guaranteeing that all your leads, activity feeds and other critical data remained safe and secure.
## Future Work
It is clear that we overlooked the growing bloat in the database to a point at which it became fatal. PostgreSQL has auto-vacuum procedures in place by default to automatically detect and reduce bloat, and only in rare circumstances will anything else be needed. However, our implementation of duplicate housekeeping generated such a circumstance, and we did not realise that until it was too late.
In addition, it took longer than necessary to recover, because we had to do research on the specific conditions we were facing, before recovery work could be performed.
We are working on several strategies to avoid similar incidents in the future:
In the short term, we are implementing improved monitoring of database health through better diagnosis tools and regular maintenance to reduce bloat.
In the mid-term, we will strengthen our competencies around database performance, scalability and ad-hoc maintenance, to ensure we are better prepared and able to recover faster from future events.
In the long term, we will reconsider the implementation for duplicate detection and housekeeping. This incident adds to several other factors that might make it worthwhile to rethink the duplication detection and related features.
## In Conclusion
I realise how important myphoner is to the work you do. Everyone at myphoner would like to apologise for the impact of this outage. We identified the issue that caused the outage, and we will use the knowledge we gained through this incident to improve our monitoring, maintenance and recovery procedures going forward.