Myphoner - Heavy load – Incident details

Heavy load

Resolved
Partial outage
Started almost 5 years agoLasted about 1 hour

Affected

Web Application

Operational from 7:12 AM to 7:12 AM, Partial outage from 7:12 AM to 8:05 AM

Updates
  • Update
    Update

    At a bout 8:54 CEST this morning the response times of our caching service suddenly grew to a level where cache requests started to timeout, falling back to requesting the main database. This in turn put a heavy load on the database, which caused longer response times for database queries in general. Combined this means every request to the app server had to: 1. wait for cache service timeout 2. request the cached content from database instead 3. wait for longer db response times Eventually, this started leading to requests queueing up, and some of the requests started timing out \(requests time out after 30 seconds of sitting in the queue\) As we were working to find solutions, eliminating bottlenecks and optimise for database performance on long running requests, the caching service started responding again at around 9:40 CEST, and everything returned to normal shortly after. When key components like our caching layer or the database becomes unresponsive, it’s hard to work around it, as those components are both hard to replace and hard to scale. However, there’s always takeaways from incidents, and this one was no exception: * We gathered a lot of useful data identifying bottlenecks at certain endpoints of the API and other pages in the app, that we can use for optimising performance in general, going forward. * We learned how important the caching layer is, and regarding that as a single point of failure will help us plan and scale going forward. We apologise for the inconvenience to all who were affected, we know how important Myphoner is to our clients, and we work hard to ensure stability of the platform. This incident included, our uptime over the past 30 days is 99.958%

  • Resolved
    Resolved

    This incident has been resolved.

  • Monitoring
    Monitoring

    Caching service response times are returning to normal. We are continuing to monitor the situation closely.

  • Identified
    Identified

    We're having issues with our caching service, which cause heavy loads on our databases. We are working to ease the load one step at a time.

  • Investigating
    Investigating

    We are currently investigating this issue.