May. 25th, 2020

mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Hi all,

First of all, sorry for the trouble tonight -- I know a lot of you had issues accessing Dreamwidth tonight.

TL;DR: the issue is fixed and service is restored. The issue was one that we solved by changing the configuration of the web server software we use (Apache).

The basic description of the issue is that we believe that all of our Apache workers were being tied up doing keep-alive to clients. There is a 5 second timeout after you request a page where we wait for you to request another -- and if you don't, we close the connection.

It's possible for there to be enough users that we run out of connections, and then you have to basically wait in line for another connection. This is usually very fast, but with lots of people trying to use Dreamwidth tonight, we ran out of free connections. This caused a cascading problem though -- because our upstream load balancer _also_ needs connections in order to do health checks to ask if the web servers are doing OK.

So, we started seeing oscillating health check failures. Web servers would start to be considered failed (because the load balancer couldn't ask it how it was doing) and that caused us to have even fewer connections to be available, which just exacerbated the problem.

It was hard to track down because this really shouldn't be the case -- the load balancer should be multiplexing requests to different backend connections. We need to do some more investigation to understand why we were seeing this behavior. At any rate, we resolved it by disabling keep-alive.

Again, we're sorry for the interruption of service, and that we didn't tweet faster. That's on me, I forgot to update over there while we were working on the issue.

Big thanks to [personal profile] jennifer and [personal profile] alierak for the help here! Ultimately, it was [personal profile] alierak's suggestion that pointed us in the right direction and got the issue solved.

-Mark
Page generated Jun. 12th, 2025 12:54 am
Powered by Dreamwidth Studios