First our apologies for the inconveniences this incident has caused our customers and their clients. During investigations, we found that related issues have been happening on Thursday last week as well, but at a lesser scale and went past undiscovered by us. We understand how important a reliable platform is for our customers, and we do everything we can to build, scale and maintain a stable and fast product.
We have now found the root cause of the issues, corrected them, set up monitoring and alarms and adjusted relevant procedures to avoid similar issues in the future.
We have included the more elaborate story below for anyone interested.
On Monday afternoon, we became aware that customers were occasionally getting errors when trying to access the app. We started investigations and were puzzled by the symptoms, leading us to believe we were dealing with an issue on the underlying infrastructure, and we escalated to our infrastructure partner within 15 minutes after first becoming aware of the issue.
There were no load issues, there were no spikes, nor any other anomalies or outliers in our monitoring tools.
We continued investigating, trying different things to remedy the impact and eventually found that spawning more web server instances seemed to resolve the issue - even though the web instances already running were not near out of capacity.
During the coming day, as we continued investigating to find the root cause, our infrastructure partner politely pointed us toward an interesting error message in the logs which led us to discover an application server config with an outdated default setting for its maximum number of worker connections. Our application server had gone out of sync with the HTTP/reverse proxy server and that was causing it to drop connections before the application stack was reached, which is why our normal monitoring tool did not reveal the issues.
Earlier today we deployed an updated configuration to the application server and adjusted the number of web instances back down, carefully monitoring the impact. As expected our servers are now spinning along serving up all requests without error.
In the future, all default settings will be checked and updated accordingly when we update the application server. In addition, we have set up monitoring of error response codes happening before the application stack.