Myphoner - Queue Performance – Incident details

Queue Performance

Resolved
Partial outage
Started 7 months agoLasted about 5 hours

Affected

Web Application

Operational from 11:48 AM to 11:48 AM, Partial outage from 11:48 AM to 4:47 PM

Updates
  • Resolved
    Update

    The incident was caused by an upgrade of the software version of our main database, which unexpectedly behaved slightly differently on planning queries. The solution was to remove an index that "confused" the query planner to select a more "expensive" query in the newer version of the database software. Once identified, the fix was easy, but we struggled to understand what was going on and it took a significant amount of research and experiments to make sure that we understood the problem correctly and that the fix was a proper one. We apologize to everyone affected by the degraded performance during this incident.

  • Resolved
    Resolved

    The implemented solution seems to have worked, and response times are back to normal. The incident was caused by an upgrade of the software version of our main database, which unexpectedly behaved slightly differently on planning queries. The solution was to remove an index that "confused" the query planner to select a more "expensive" query in the newer version of the database software. Once identified, the fix was easy, but we struggled to understand what was going on and it took a significant amount of research and experiments to make sure that we understood the problem correctly and that the fix was a proper one. We apologize to everyone affected by the degraded performance for the past 36 hours.

  • Monitoring
    Monitoring

    We have adjusted database indexes after having analyzed the situation and conducted experiments in a safe environment, and we are now monitoring the results.

  • Identified
    Update

    We are performing emergency maintenance on the main database to repair an index that is not performing as expected. Expect long response times for 5-10 minutes. Some outages may occur.

  • Identified
    Identified

    We are currently working on solutoins for degraded performance when calculating queues and next best lead. Especially large lists with 50K+ leads are very slow at the moment.