Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi all,

We are currently working around a problem with our static file serving -- i.e., JavaScript, CSS, and image files. I had to move these back to our main web frontend instead of using the fast static frontend.

The root cause is that our hosting provider, ServerBeach, reassigned the IP address we were using for static content. They assigned it to one fo the new machines they are building for us. Given the vagaries of networks, I can't unassign that IP easily on my end. I have to kill the machine and it's still in provisioning, so I don't have access to it yet.

In order to work around this problem, I've had to make two changes -- use the main Perlbal infrastructure and turn off an optimization we use -- this is going to cause things to load a little slower. I will be working this evening to resolve the problem in a more efficient way.

My apologies for the issue here. I don't know why they reassigned our IP address, but as soon as I figure out what happened, I will let you know.

Thanks for your patience.
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Just to keep you all in the loop -- the servers have been really busy lately! [personal profile] alierak, one of our sysadmins, has been working on things and found an issue with our load balancer. He fixed that, and that has resolved a number of issues people have had connecting to Dreamwidth during periods of peak load.

(For the technical, the issue was that our connection tracking table in the kernel was set to the default 64k, and we've started peaking past that. He raised the limits.)

We have also submitted an order for two more web servers. This will make things faster by just giving us more horsepower to work with. The rest of the system (databases, memcache, network, etc) isn't anywhere near capacity, but we're running low on CPU on our web servers.

We also have some of our developers working on optimizations for entries that have many comments. That work is in progress and we will continue to iterate to make things faster.

If you have any questions, let me know! I'll do my best to answer them quickly and accurately. :)
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi all,

I'm going to be doing a code push shortly. As always, if you see anything stop working or break, please comment here and we'll get it fixed up. Thank you!

Edit: We had an issue with delayed notifications. They should be moving normally again!
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
I made a post on my own journal that talks about the scalability/load/capacity of Dreamwidth. The short summary is that we're nowhere near our ultimate capacity on anything and the few things that have started to hit capacity are being expanded. I am very comfortable with our current and future status!

More details here:

http://mark.dreamwidth.org/21787.html

I'm going to disable comments here and ask that if you want to comment/ask questions/etc that you do it over there on the relevant post.

Thank you!
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
Hello, my lovely Dreamwidthians! I bring to you information about several site issues this afternoon/evening/other-time-of-day-as-appropriate-for-your-timezone, and the steps we're taking to fix them:

* Some people have noticed intermittent site slowdowns today. This is partially because of increased traffic, and partially, as we feared, because the import queue, and the speed at which we're processing import jobs, is putting a lot of load on the database servers. Short-term fix: [staff profile] mark is juggling things around to reduce database load and trying to direct some queries to the backup database machine. Long-term fix: we have a ticket in with our hosting provider to upgrade the database machines to make them bigger, better, faster, and more.

* You may get some "internal server error" messages when submitting data to the site, or messages saying that the site sent no data or the connection was reset. This is almost certainly related to the new load balancing solution we put into place last night, and Mark is doing some configuration tweaks and fixes that will hopefully stop it from happening. We're not 100% sure on the exact cause -- there are a few things it could be -- but we're going to keep whacking at it until the errors go away.

As part of these fixes, Mark may need to take the site down for a brief maintenance window -- he's hoping he'll be able to make the fixes without having to put the site into maintenance mode, but he might not be able to. The site may be slow for a while in the next few hours, and there may be brief periods of downtime. We're working to fix the problems as fast as we can!

(Well, okay, Mark is working to fix the problems. I'm cheering him on and passing him towels and Gatorade.)
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi all,

I'm doing some updates on our backend file storage system. This should be transparent, but there may be a brief window where icons aren't loading. Please let me know if you are having any trouble or see any problems!
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Dreamwidth was having issues loading for ~15 minutes. This is now resolved.

The summary of the issue is that when you connect to Dreamwidth, you are actually connecting to a bit of software called Perlbal. This software handles routing your request to one of our web servers (we have a bunch) and it also does some other nifty stuff.

The main problem with it is that it's single-threaded. That means that, on the machines we have that have eight or more CPU cores (most modern stuff!), it can only ever run on one of those. This leaves the machine very underutilized -- i.e., it's mostly idle!

This was never a problem for us because even just one core was enough to handle all of the Dreamwidth traffic. At some point we split it up so that static traffic (images, CSS, etc) goes to a second Perlbal instance, but most of the main web traffic still goes through that primary instance.

Today, we finally hit the threshold where Perlbal was taking 100% of the one core it was on and couldn't go any faster. This caused it to queue up requests -- making the site feel really slow. The backend has plenty of capacity, it's just that the frontend wasn't able to go fast enough to handle the traffic.

The fix was to put a much faster load balancer in front and use it to balance traffic to two different Perlbal instances. Now we have a bit of software called Pound that runs in front. We have always been using Pound, but it was only serving SSL requests. Now it is also serving unencrypted HTTP traffic and is then passing that traffic on to two Perlbal instances. In short, it's a load balancer for our load balancers!

This lets us scale more since Pound is an order of magnitude more efficient than Perlbal. By the time we reach the limits of scalability on Pound, we'll have to legitimately move to bigger hardware. (And actually by the time we get there, we will probably be collocating! Exciting!)
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
I'm going to be doing a code push shortly as soon as I'm happy with the positioning of my ducks. (I don't need them to all have the same Y coordinates, but they should at least be near each other.)

This is a smaller push than most, there isn't a ton of stuff that's changed. There are a few major changes to how the importer works though, which is my area of most concern and where I'll be spending a lot of time watching.

As always, please comment and let me know if there are any issues. Thank you!

Update: Code pushed. There was some temporary slowness, our load balancer (Perlbal) got into a weird state where it was using 100% CPU. I restarted it and things returned to normal.

Update #2: There was an issue affecting status messages for community imports. This should be fixed now.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
Thanks to a few people reporting problems with their import jobs stuck on the "verify" step for an extended period of time, we discovered a bottleneck in the import process we hadn't realized. We've taken steps to fix it. If your import was showing as "ready to be inserted into the queue", those jobs are now being moved into the import queue more quickly. (That's why, if you've been watching the queue on the Import Journal page, the numbers just jumped like whoa.)

It will take time for the importer to process all the queued jobs -- whenever there's a surge in account creation, there's a corresponding surge in import jobs -- but fear not, once they're scheduled your import jobs will run. You don't have to leave the page open: just schedule the job and wander off, and sooner or later you will look at your journal and all of your stuff will be there like magic. :)

More of the technical details, for those who are curious )

EDIT, 8:40PM EDT: Sorry about the rampant internal server error problems -- we thought it was a problem with the new webserver, but it turned out that imports were happening too fast and were locking up the database. Mark has throttled back the import speed enough that the errors should go away now. (This means that imports will be happening more slowly, but the queue's backed up enough right now that it probably won't make much difference anyway!)

EDIT, 4:30 PM EDT, 12/23: As always happens whenever we have an influx of new users, the import queue is very, very busy right now. Your import will almost certainly take at least a day to finish. Please be patient! Once your job is in the queue, it will complete eventually and you don't need to stay logged into the site or leave your computer on. Just start it and go do other things, and eventually your stuff will catch up with you. :)
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi all,

Dreamwidth is seeing some extra traffic today, so we're rolling out a new webserver to handle the extra load. We'll be paying close attention to the system load though and will be doing what we can to make sure everything is speedy and working.

Also, today I am joined by my infant son Oliver, who is helping me with the servers. Hello from Mark and Oliver!

Back up!

Dec. 13th, 2011 09:51 am
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Looks like we're back online. Please let me know if anything is not working like you expect!

Update: DNS was broken. This affected imports, crossposts, and emails. It's back up and running now and the email backlog will slowly be cleared. Nothing was lost, just delayed.
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi everybody.

The extended downtime I posted about a few days ago is upon is. Dreamwidth will be going offline in about 45 minutes and will be down for about two hours.

Thanks for all of your patience. I'll see you on the flip side.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
We're running a little late with tonight's code push (took some extra time to get some last-minute things committed), but we'll be starting tonight's push in a bit. If it goes well, y'all will barely notice. If it goes poorly, well, we'll fix it as quickly as we can!

EDIT: And, we're done! If you notice any problems with the site in general, comment here. If you notice problems with the beta test of the new update page, comment on the reporting post in dw-beta.
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Our hosting provider, ServerBeach, will be performing some maintenance on our servers next week. They're going to be unplugging and carrying them across the data center and reinstalling them in a new location. They're doing some reorganization and our rack is one of the ones affected.

The main impact from this is that Dreamwidth will be hard down (no response at all and no maintenance page) from 9AM to 11AM PST (1700 to 1900 UTC) on Tuesday, December 13th. There is a secondary impact in that when we come back up, none of our caches will be warm. The site will be slow for some period of time after it comes back -- up to several hours.

I will post again when we get closer to the outage. As always, we will update our @dreamwidth Twitter account as this happens.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
We are scheduling a 2-hour code push and maintenance window starting at 11PM EST Fri, 9 Dec and ending at 1AM EST Sat 10 Dec (beginning 4AM UTC Sat 10 Dec; convert to your local time)

In the best case scenario, this push will involve putting the site into "maintenance mode" very briefly (five minutes or so) during that window to run updates. In the worst case scenario, there may be some unexpected errors after the push: this update will include a lot of backend changes and reorganization, and although these changes have been tested, there will always be that one bug that doesn't show up until you've suddenly got thousands of users trying a whole bunch of things all at once! We'll be on top of things to watch out for problems, but we wanted to give you a bit of warning, and that's why we're scheduling the full 2-hour window even though if all goes well it shouldn't take anywhere near that long.

Either [staff profile] fu or I will post a reminder before maintenance begins.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
LJ has let us know that they've re-enabled third-party application access to the site, which has allowed us to re-enable imports from LJ and means that crossposting to LJ will be less likely to fail. (Crossposts still may fail due to temporary site unavailability; as mentioned in the previous update, until you get a failure message in your inbox, the crosspost is still retrying even if it hasn't actually posted to the remote account yet.)

Thanks to LJ for keeping us in the loop on this. We salute y'all over there for your tireless efforts to stay online throughout this DDoS.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
The DDoS mitigation steps that LiveJournal is taking to make sure their users can continue to access the site have, unfortunately, meant that third-party applications (such as DW's importer!) temporarily can't access the site. This means that importing from LJ is temporarily disabled again, because all imports would fail due to being blocked. We'll let you know when they're re-enabled!

Many people are also experiencing problems with crossposts. If you try to crosspost an entry and it doesn't immediately crosspost, your crosspost attempt has not mysteriously wandered off into the deepest darkest depths of the internet: until you receive a failure message in your inbox or the crosspost attempt succeeds, the worker is still trying to contact LJ in order to make your crosspost, even if you then edit the entry and the crosspost box is unchecked. (The crosspost box will not be checked on the entry edit page until the crosspost is successful.) So, just hang tight until you get the final failure message in your inbox, then wait a day or two and try to crosspost again by editing the entry, checking the crosspost box, and hitting 'save'.

Because of the problems with accessing LJ, and in order to help reduce the traffic on LJ's servers, we've lowered the number of times a crosspost attempt will try to contact LJ before failing and we've extended the delays between attempts (it was five tries at an interval of 10 seconds, 30 seconds, 60 seconds, 5 minutes, and 10 minutes; now it's three tries at 5 minutes, 15 minutes, and 30 minutes.) This may help more crossposts to succeed because of the reduced load, or it might have no effect at all, but we're hoping it will also help reduce the traffic to LJ and help them recover from their DDoS more quickly.

Best of luck to LJ with their DDoS mitigation efforts!
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
After a bunch of testing, we are pleased to announce that we've been able to re-enable imports from LiveJournal. Apologies for the delay: we were waiting for the LJ cookie changes to stabilize and be documented!

Thanks go to LJ for making their changes more backwards-compatable. If you'd like to start a new import from LJ, you can do so at the Import Content page.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
The code push is now complete, and we're working on fixing the issues that have been reported!

(Biggest issue right now: Javascript is not being properly loaded on the site. [staff profile] fu and the rest of the Usual Suspects are working to fix.)
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
We are now beginning our code push. The site will be down for a few minutes during the process. (And when it comes back up, we will have the new update page beta omg omg omg.)
Page generated Feb. 10th, 2012 03:17 am
Powered by Dreamwidth Studios