At a bout 8:54 CEST this morning the response times of our caching service suddenly grew to a level where cache requests started to timeout, falling back to requesting the main database. This in turn put a heavy load on the database, which caused longer response times for database queries in general.
Combined this means every request to the app server had to:
1. wait for cache service timeout
2. request the cached content from database instead
3. wait for longer db response times
Eventually, this started leading to requests queueing up, and some of the requests started timing out \(requests time out after 30 seconds of sitting in the queue\)
As we were working to find solutions, eliminating bottlenecks and optimise for database performance on long running requests, the caching service started responding again at around 9:40 CEST, and everything returned to normal shortly after.
When key components like our caching layer or the database becomes unresponsive, it’s hard to work around it, as those components are both hard to replace and hard to scale. However, there’s always takeaways from incidents, and this one was no exception:
* We gathered a lot of useful data identifying bottlenecks at certain endpoints of the API and other pages in the app, that we can use for optimising performance in general, going forward.
* We learned how important the caching layer is, and regarding that as a single point of failure will help us plan and scale going forward.
We apologise for the inconvenience to all who were affected, we know how important Myphoner is to our clients, and we work hard to ensure stability of the platform.
This incident included, our uptime over the past 30 days is 99.958%