|Mark Smith (mark) wrote in dw_maintenance,|
@ 2011-12-29 05:14 pm UTC
The summary of the issue is that when you connect to Dreamwidth, you are actually connecting to a bit of software called Perlbal. This software handles routing your request to one of our web servers (we have a bunch) and it also does some other nifty stuff.
The main problem with it is that it's single-threaded. That means that, on the machines we have that have eight or more CPU cores (most modern stuff!), it can only ever run on one of those. This leaves the machine very underutilized -- i.e., it's mostly idle!
This was never a problem for us because even just one core was enough to handle all of the Dreamwidth traffic. At some point we split it up so that static traffic (images, CSS, etc) goes to a second Perlbal instance, but most of the main web traffic still goes through that primary instance.
Today, we finally hit the threshold where Perlbal was taking 100% of the one core it was on and couldn't go any faster. This caused it to queue up requests -- making the site feel really slow. The backend has plenty of capacity, it's just that the frontend wasn't able to go fast enough to handle the traffic.
The fix was to put a much faster load balancer in front and use it to balance traffic to two different Perlbal instances. Now we have a bit of software called Pound that runs in front. We have always been using Pound, but it was only serving SSL requests. Now it is also serving unencrypted HTTP traffic and is then passing that traffic on to two Perlbal instances. In short, it's a load balancer for our load balancers!
This lets us scale more since Pound is an order of magnitude more efficient than Perlbal. By the time we reach the limits of scalability on Pound, we'll have to legitimately move to bigger hardware. (And actually by the time we get there, we will probably be collocating! Exciting!)