Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi all,

I'm going to be doing a code push shortly. As always, if you see anything stop working or break, please comment here and we'll get it fixed up. Thank you!

Edit: We had an issue with delayed notifications. They should be moving normally again!
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
I made a post on my own journal that talks about the scalability/load/capacity of Dreamwidth. The short summary is that we're nowhere near our ultimate capacity on anything and the few things that have started to hit capacity are being expanded. I am very comfortable with our current and future status!

More details here:

http://mark.dreamwidth.org/21787.html

I'm going to disable comments here and ask that if you want to comment/ask questions/etc that you do it over there on the relevant post.

Thank you!
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
Hello, my lovely Dreamwidthians! I bring to you information about several site issues this afternoon/evening/other-time-of-day-as-appropriate-for-your-timezone, and the steps we're taking to fix them:

* Some people have noticed intermittent site slowdowns today. This is partially because of increased traffic, and partially, as we feared, because the import queue, and the speed at which we're processing import jobs, is putting a lot of load on the database servers. Short-term fix: [staff profile] mark is juggling things around to reduce database load and trying to direct some queries to the backup database machine. Long-term fix: we have a ticket in with our hosting provider to upgrade the database machines to make them bigger, better, faster, and more.

* You may get some "internal server error" messages when submitting data to the site, or messages saying that the site sent no data or the connection was reset. This is almost certainly related to the new load balancing solution we put into place last night, and Mark is doing some configuration tweaks and fixes that will hopefully stop it from happening. We're not 100% sure on the exact cause -- there are a few things it could be -- but we're going to keep whacking at it until the errors go away.

As part of these fixes, Mark may need to take the site down for a brief maintenance window -- he's hoping he'll be able to make the fixes without having to put the site into maintenance mode, but he might not be able to. The site may be slow for a while in the next few hours, and there may be brief periods of downtime. We're working to fix the problems as fast as we can!

(Well, okay, Mark is working to fix the problems. I'm cheering him on and passing him towels and Gatorade.)
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi all,

I'm doing some updates on our backend file storage system. This should be transparent, but there may be a brief window where icons aren't loading. Please let me know if you are having any trouble or see any problems!
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Dreamwidth was having issues loading for ~15 minutes. This is now resolved.

The summary of the issue is that when you connect to Dreamwidth, you are actually connecting to a bit of software called Perlbal. This software handles routing your request to one of our web servers (we have a bunch) and it also does some other nifty stuff.

The main problem with it is that it's single-threaded. That means that, on the machines we have that have eight or more CPU cores (most modern stuff!), it can only ever run on one of those. This leaves the machine very underutilized -- i.e., it's mostly idle!

This was never a problem for us because even just one core was enough to handle all of the Dreamwidth traffic. At some point we split it up so that static traffic (images, CSS, etc) goes to a second Perlbal instance, but most of the main web traffic still goes through that primary instance.

Today, we finally hit the threshold where Perlbal was taking 100% of the one core it was on and couldn't go any faster. This caused it to queue up requests -- making the site feel really slow. The backend has plenty of capacity, it's just that the frontend wasn't able to go fast enough to handle the traffic.

The fix was to put a much faster load balancer in front and use it to balance traffic to two different Perlbal instances. Now we have a bit of software called Pound that runs in front. We have always been using Pound, but it was only serving SSL requests. Now it is also serving unencrypted HTTP traffic and is then passing that traffic on to two Perlbal instances. In short, it's a load balancer for our load balancers!

This lets us scale more since Pound is an order of magnitude more efficient than Perlbal. By the time we reach the limits of scalability on Pound, we'll have to legitimately move to bigger hardware. (And actually by the time we get there, we will probably be collocating! Exciting!)
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
I'm going to be doing a code push shortly as soon as I'm happy with the positioning of my ducks. (I don't need them to all have the same Y coordinates, but they should at least be near each other.)

This is a smaller push than most, there isn't a ton of stuff that's changed. There are a few major changes to how the importer works though, which is my area of most concern and where I'll be spending a lot of time watching.

As always, please comment and let me know if there are any issues. Thank you!

Update: Code pushed. There was some temporary slowness, our load balancer (Perlbal) got into a weird state where it was using 100% CPU. I restarted it and things returned to normal.

Update #2: There was an issue affecting status messages for community imports. This should be fixed now.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
Thanks to a few people reporting problems with their import jobs stuck on the "verify" step for an extended period of time, we discovered a bottleneck in the import process we hadn't realized. We've taken steps to fix it. If your import was showing as "ready to be inserted into the queue", those jobs are now being moved into the import queue more quickly. (That's why, if you've been watching the queue on the Import Journal page, the numbers just jumped like whoa.)

It will take time for the importer to process all the queued jobs -- whenever there's a surge in account creation, there's a corresponding surge in import jobs -- but fear not, once they're scheduled your import jobs will run. You don't have to leave the page open: just schedule the job and wander off, and sooner or later you will look at your journal and all of your stuff will be there like magic. :)

More of the technical details, for those who are curious )

EDIT, 8:40PM EDT: Sorry about the rampant internal server error problems -- we thought it was a problem with the new webserver, but it turned out that imports were happening too fast and were locking up the database. Mark has throttled back the import speed enough that the errors should go away now. (This means that imports will be happening more slowly, but the queue's backed up enough right now that it probably won't make much difference anyway!)

EDIT, 4:30 PM EDT, 12/23: As always happens whenever we have an influx of new users, the import queue is very, very busy right now. Your import will almost certainly take at least a day to finish. Please be patient! Once your job is in the queue, it will complete eventually and you don't need to stay logged into the site or leave your computer on. Just start it and go do other things, and eventually your stuff will catch up with you. :)
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi all,

Dreamwidth is seeing some extra traffic today, so we're rolling out a new webserver to handle the extra load. We'll be paying close attention to the system load though and will be doing what we can to make sure everything is speedy and working.

Also, today I am joined by my infant son Oliver, who is helping me with the servers. Hello from Mark and Oliver!

Back up!

Dec. 13th, 2011 09:51 am
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Looks like we're back online. Please let me know if anything is not working like you expect!

Update: DNS was broken. This affected imports, crossposts, and emails. It's back up and running now and the email backlog will slowly be cleared. Nothing was lost, just delayed.
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Hi everybody.

The extended downtime I posted about a few days ago is upon is. Dreamwidth will be going offline in about 45 minutes and will be down for about two hours.

Thanks for all of your patience. I'll see you on the flip side.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
We're running a little late with tonight's code push (took some extra time to get some last-minute things committed), but we'll be starting tonight's push in a bit. If it goes well, y'all will barely notice. If it goes poorly, well, we'll fix it as quickly as we can!

EDIT: And, we're done! If you notice any problems with the site in general, comment here. If you notice problems with the beta test of the new update page, comment on the reporting post in dw-beta.
Photo of Mark's face, taken in standard office fluorescent.
[staff profile] mark
Our hosting provider, ServerBeach, will be performing some maintenance on our servers next week. They're going to be unplugging and carrying them across the data center and reinstalling them in a new location. They're doing some reorganization and our rack is one of the ones affected.

The main impact from this is that Dreamwidth will be hard down (no response at all and no maintenance page) from 9AM to 11AM PST (1700 to 1900 UTC) on Tuesday, December 13th. There is a secondary impact in that when we come back up, none of our caches will be warm. The site will be slow for some period of time after it comes back -- up to several hours.

I will post again when we get closer to the outage. As always, we will update our @dreamwidth Twitter account as this happens.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
We are scheduling a 2-hour code push and maintenance window starting at 11PM EST Fri, 9 Dec and ending at 1AM EST Sat 10 Dec (beginning 4AM UTC Sat 10 Dec; convert to your local time)

In the best case scenario, this push will involve putting the site into "maintenance mode" very briefly (five minutes or so) during that window to run updates. In the worst case scenario, there may be some unexpected errors after the push: this update will include a lot of backend changes and reorganization, and although these changes have been tested, there will always be that one bug that doesn't show up until you've suddenly got thousands of users trying a whole bunch of things all at once! We'll be on top of things to watch out for problems, but we wanted to give you a bit of warning, and that's why we're scheduling the full 2-hour window even though if all goes well it shouldn't take anywhere near that long.

Either [staff profile] fu or I will post a reminder before maintenance begins.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
LJ has let us know that they've re-enabled third-party application access to the site, which has allowed us to re-enable imports from LJ and means that crossposting to LJ will be less likely to fail. (Crossposts still may fail due to temporary site unavailability; as mentioned in the previous update, until you get a failure message in your inbox, the crosspost is still retrying even if it hasn't actually posted to the remote account yet.)

Thanks to LJ for keeping us in the loop on this. We salute y'all over there for your tireless efforts to stay online throughout this DDoS.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
The DDoS mitigation steps that LiveJournal is taking to make sure their users can continue to access the site have, unfortunately, meant that third-party applications (such as DW's importer!) temporarily can't access the site. This means that importing from LJ is temporarily disabled again, because all imports would fail due to being blocked. We'll let you know when they're re-enabled!

Many people are also experiencing problems with crossposts. If you try to crosspost an entry and it doesn't immediately crosspost, your crosspost attempt has not mysteriously wandered off into the deepest darkest depths of the internet: until you receive a failure message in your inbox or the crosspost attempt succeeds, the worker is still trying to contact LJ in order to make your crosspost, even if you then edit the entry and the crosspost box is unchecked. (The crosspost box will not be checked on the entry edit page until the crosspost is successful.) So, just hang tight until you get the final failure message in your inbox, then wait a day or two and try to crosspost again by editing the entry, checking the crosspost box, and hitting 'save'.

Because of the problems with accessing LJ, and in order to help reduce the traffic on LJ's servers, we've lowered the number of times a crosspost attempt will try to contact LJ before failing and we've extended the delays between attempts (it was five tries at an interval of 10 seconds, 30 seconds, 60 seconds, 5 minutes, and 10 minutes; now it's three tries at 5 minutes, 15 minutes, and 30 minutes.) This may help more crossposts to succeed because of the reduced load, or it might have no effect at all, but we're hoping it will also help reduce the traffic to LJ and help them recover from their DDoS more quickly.

Best of luck to LJ with their DDoS mitigation efforts!
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
After a bunch of testing, we are pleased to announce that we've been able to re-enable imports from LiveJournal. Apologies for the delay: we were waiting for the LJ cookie changes to stabilize and be documented!

Thanks go to LJ for making their changes more backwards-compatable. If you'd like to start a new import from LJ, you can do so at the Import Content page.
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
The code push is now complete, and we're working on fixing the issues that have been reported!

(Biggest issue right now: Javascript is not being properly loaded on the site. [staff profile] fu and the rest of the Usual Suspects are working to fix.)
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
We are now beginning our code push. The site will be down for a few minutes during the process. (And when it comes back up, we will have the new update page beta omg omg omg.)
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
As our Halloween gift to y'all, we are planning a fairly impromptu code push sometime in the next few hours. (It's hard for us to schedule these further ahead of time, due to the difficulty of timezones; [staff profile] fu and I live 12 timezones apart!) The site will be down for approximately 2 minutes as part of the process. We will update here and our offsite status Twitter account when the process begins.

As we mentioned in this week's [site community profile] dw_news update, this code push includes a new profile section for communities you maintain, tied to the display of the list of communities you are a member of. To disable this list, uncheck the "communities" box in the final section of the Edit Profile page.

This code push will also activate the beta testing of our new Create Entries page. Check out [site community profile] dw_beta for more details.

More details on changes going live in this push can be found in the following code tours:

11 Aug - 24 Aug (currently live on site)
25 Aug - 5 Sep
6 Sep - 25 Sep
26 Sep - 11 Oct
12 Oct - 20 Oct
21 Oct - 30 Oct

Highlights of this push can be found in the 21 October [site community profile] dw_news entry.

omg the new create entries page is going into beta omg
Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome
[staff profile] denise
Due to the previously announced login cookie and protocol changes LiveJournal has made, our content import system is having serious trouble reaching the site or accessing the exportable data in order to import it. Many people have noticed that our import queue is tremendously backed up, and is hanging on the 'verify' step (which is intended to just be a quick check that the username and password provided is accurate before starting the whole import process). We've looked into things, and it appears to be a combination of the changes LJ made and a bug in the import verification step that causes the process to hang when the remote site isn't available or is refusing logins.

Until we can work out the best way to handle the changes that LJ has made, we've temporarily disabled imports from LiveJournal, in order to allow people to still import from InsaneJournal (our other current import source) without getting stuck in the endless wait with the LJ import queue.

This is only a temporary step and is being used until we can make the necessary fixes, although unfortunately I can't estimate how long that will be. (We're currently waiting for more information from LJ on the changes they've made, and it's looking like we will need them to make a few additional code changes before we can make it work again; as things stand right now, we would only be able to log you into the site -- necessary for accessing the export files we use for importing -- by screen-scraping the login page, which is in violation of their bot access guidelines and very bad internet manners in addition.)

In the meantime, if you have requested an import from LJ that has neither failed nor succeeded -- you get a message in your inbox when that happens -- just hang tight. When we re-enable imports, the queued imports will go back to running.

I'm really sorry about the hassle, everyone. We will work over the next few days to see what we can do to re-enable importing from LJ, and I'll update everyone here if and when it happens.
Page generated Jan. 29th, 2012 02:27 am
Powered by Dreamwidth Studios