Tonight's site interruptions
May. 25th, 2020 09:48 pm![[staff profile]](https://www.dreamwidth.org/img/silk/identity/user_staff.png)
Hi all,
First of all, sorry for the trouble tonight -- I know a lot of you had issues accessing Dreamwidth tonight.
TL;DR: the issue is fixed and service is restored. The issue was one that we solved by changing the configuration of the web server software we use (Apache).
The basic description of the issue is that we believe that all of our Apache workers were being tied up doing keep-alive to clients. There is a 5 second timeout after you request a page where we wait for you to request another -- and if you don't, we close the connection.
It's possible for there to be enough users that we run out of connections, and then you have to basically wait in line for another connection. This is usually very fast, but with lots of people trying to use Dreamwidth tonight, we ran out of free connections. This caused a cascading problem though -- because our upstream load balancer _also_ needs connections in order to do health checks to ask if the web servers are doing OK.
So, we started seeing oscillating health check failures. Web servers would start to be considered failed (because the load balancer couldn't ask it how it was doing) and that caused us to have even fewer connections to be available, which just exacerbated the problem.
It was hard to track down because this really shouldn't be the case -- the load balancer should be multiplexing requests to different backend connections. We need to do some more investigation to understand why we were seeing this behavior. At any rate, we resolved it by disabling keep-alive.
Again, we're sorry for the interruption of service, and that we didn't tweet faster. That's on me, I forgot to update over there while we were working on the issue.
Big thanks to
jennifer and
alierak for the help here! Ultimately, it was
alierak's suggestion that pointed us in the right direction and got the issue solved.
-Mark
First of all, sorry for the trouble tonight -- I know a lot of you had issues accessing Dreamwidth tonight.
TL;DR: the issue is fixed and service is restored. The issue was one that we solved by changing the configuration of the web server software we use (Apache).
The basic description of the issue is that we believe that all of our Apache workers were being tied up doing keep-alive to clients. There is a 5 second timeout after you request a page where we wait for you to request another -- and if you don't, we close the connection.
It's possible for there to be enough users that we run out of connections, and then you have to basically wait in line for another connection. This is usually very fast, but with lots of people trying to use Dreamwidth tonight, we ran out of free connections. This caused a cascading problem though -- because our upstream load balancer _also_ needs connections in order to do health checks to ask if the web servers are doing OK.
So, we started seeing oscillating health check failures. Web servers would start to be considered failed (because the load balancer couldn't ask it how it was doing) and that caused us to have even fewer connections to be available, which just exacerbated the problem.
It was hard to track down because this really shouldn't be the case -- the load balancer should be multiplexing requests to different backend connections. We need to do some more investigation to understand why we were seeing this behavior. At any rate, we resolved it by disabling keep-alive.
Again, we're sorry for the interruption of service, and that we didn't tweet faster. That's on me, I forgot to update over there while we were working on the issue.
Big thanks to
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
-Mark