- UpdateUpdate
*By Jeppe Liisberg, founder at myphoner* Earlier this week, myphoner was unavailable for 3 hours and 39 minutes. I understand how important myphoner is for your daily work and how much some of you rely on us to be able to work at all. We consider system availability one of our core features. Over the past four years, we have worked hard to ensure that myphoner is accessible 24/7, but on tuesday we failed to maintain the uptime you all rightfully expect. I am deeply sorry for this, and I would like to share the details of what happened and what we are doing to prevent similar events from happening again. ## The Event At 11.11am (CEST) on Tuesday, April 25, 2017, a sudden increase in response times from our main database caused web requests to begin timing out. In response, our application began delivering HTTP 503 response codes, which meant that our error page with the ‘It looks like something is broken’ message began appearing. We were immediately notified by our availability monitors, and at 11.13, the status page was initially updated. Upon investigating, we realised that what we initially thought was a partial outage due to some slow requests was in fact a major outage with practically all database requests being extremely slow, causing extensive timeouts for web requests and huge request queues that effectively took down the entire app. For a while, we tried easing the load on the main database by scaling down web processes and shutting down non-critical sub-systems relying on the database. However, this had no visible effect on the issues we were facing. About 30 minutes after the initial alerts, we began investigating detailed metrics of the main database. We discovered that one of the indices for lead data relating to duplicate detection had become huge (> 110 GB) with a bloat ratio of 28. We put the app in maintenance mode, turned off all schedulers and background workers and started an emergency vacuum procedure on the database. However, the vacuum process was very slow, and when it finally finished, we saw no real signs of improvement. We decided to fork the database cluster to enable working with more aggressive recovery strategies without risking any data loss; unfortunately, that meant waiting for a fork to finish, which took time. Once the fork was done, we tried booting up a few processes on the new database cluster. Initial results looked good. We then pointed all processes to the new cluster and started booting them up. The app resurfaced 3 hours and 39 minutes later. The combination of a high load and uncontrolled bloat brought our main database to its knees, and it took a complete lockdown on load and an expensive maintenance procedure in terms of time to get it back up. The switch to a new cluster was probably unnecessary, but it allowed us to investigate the state of the old cluster thoroughly after the app recovered. We spent the following hours performing further maintenance to reduce the remaining bloat, confirming that all systems were performing normally and verifying that there was no data loss from this incident. We are grateful that much of the disaster mitigation work put in place was successful in guaranteeing that all your leads, activity feeds and other critical data remained safe and secure. ## Future Work It is clear that we overlooked the growing bloat in the database to a point at which it became fatal. PostgreSQL has auto-vacuum procedures in place by default to automatically detect and reduce bloat, and only in rare circumstances will anything else be needed. However, our implementation of duplicate housekeeping generated such a circumstance, and we did not realise that until it was too late. In addition, it took longer than necessary to recover, because we had to do research on the specific conditions we were facing, before recovery work could be performed. We are working on several strategies to avoid similar incidents in the future: In the short term, we are implementing improved monitoring of database health through better diagnosis tools and regular maintenance to reduce bloat. In the mid-term, we will strengthen our competencies around database performance, scalability and ad-hoc maintenance, to ensure we are better prepared and able to recover faster from future events. In the long term, we will reconsider the implementation for duplicate detection and housekeeping. This incident adds to several other factors that might make it worthwhile to rethink the duplication detection and related features. ## In Conclusion I realise how important myphoner is to the work you do. Everyone at myphoner would like to apologise for the impact of this outage. We identified the issue that caused the outage, and we will use the knowledge we gained through this incident to improve our monitoring, maintenance and recovery procedures going forward.
- ResolvedResolved
All systems are now operational and everything looks good. We are continuing to monitor the situation closely.
- MonitoringMonitoring
We have reprovisioned our database cluster and are now starting services one by one, monitoring performance as we go. Initial results are good.
- UpdateUpdate
The emergency maintenance procedures did not seem to increase the responsiveness of our database significantly. We're working with our infrastructure provider to troubleshoot and resolve, while at the same time forking the database to boot it up on new hardware as a fallback strategy. We'll provide updates as soon as we know more, or latest in one hour (3 pm CEST)
- UpdateUpdate
We are running emergency maintenance on our main database and awaiting the results of that. We'll update the incident as soon as we know more, or latest in one hour (2 pm CEST). We're incredibly sorry about this and we are very much aware of the impact it has for you (our customers). We are doing everything we can to get back online as fast as possible!
- UpdateUpdate
We are still performing emergency maintenance on our main database. No ETA yet.
- IdentifiedIdentified
We are having serious performance issues with our main database, and it does not seem to recover from easing the load. We are now performing emergency maintenance on our database server.
- UpdateUpdate
Our backend database is suffering from overload - we are looking into offloading and scaling it.
- InvestigatingInvestigating
We are invetsigating a partial outage.