(Getting one more code push in before I'm out who knows how long for surgical recovery!)
(Getting one more code push in before I'm out who knows how long for surgical recovery!)
EDIT: And we're done! As always, we're watching for issues, but let us know of any problems.
* A dw_news post was posted just before 0800 EDT (see in your time zone). Comment notifications may be delayed for up to an hour or two, due to the high volume of notifications generated by each news post. Please don't worry about missing notifications until at least 1000 EDT.
* If you have a custom journal style, and you're seeing entries in the site default (Practicality/Neutral Good) rather than your custom layout, please recompile your layout layer and then try again. (This code push included S2 changes; we recompiled all the site styles, but we can't force custom styles to recompile.)
The biggest change in this push is a new and improved frontend for community administration tasks, but there's a bunch more! We'll update you shortly after the push with what new stuff you can expect.
We have no reason to believe that anyone was exploiting this vulnerability against us or that any user data has been compromised. We'll be changing our security certificates for extra confidence.
On the other hand, the nature of this vulnerablity means that it's impossible for a website to know for absolute certain whether someone was exploiting it. If someone was exploiting the vulnerability, against us or against any other website, they potentially have access to any information you sent to the site, including your username/password for the site and any data you sent to the site under HTTPS. It's a good idea to change your passwords pretty much everywhere, but don't do it until you can verify that a site is no longer vulnerable.
If you have any questions, feel free to ask!
Today we started getting some alerts on the database, so I'm going to do some maintenance to verify the health and wellness of the machine and get it back to a non-alerting status.
I should be able to do this without any downtime, but just in case, you might want to make sure to use your favorite text editor to save a copy of any long entries or comments you're working on.
Once I've got things sorted out, I'll update this with more details for the technically curious.
Update [4:50PM PST]:
sb-db06 (the slave) has been rebooted and is recovering, I'm doing system updates on it since the problem looks like a kernel bug (it struck both databases at the same time). Next: master failover then recover the other database.
Update [5:05PM PST]: I'm doing what we call a "master failover" now. This means I'm shifting all traffic from the database that was active (
sb-db05) to the spare database (
sb-db06). I have to shut off "extra" services like imports, feeds, and searches while this happens.
Update [5:30PM PST]: Well, that was unexpectedly bumpy. Sorry for that. There should be no further bumping, as we're now on the spare database so I can take maintenance on the original master.
Update [6:20PM PST]: If you had userpics not loading, they should be back to normal.
EDIT: And we're done! We're watching for issues now, but if you spot anything, sing out.
This push contains some sweeping backend changes, so you either won't notice anything at all, or things will be Very Broken. :) (We're pretty sure things won't be Very Broken, since things have been working out fine in testing, but there's always the chance of things getting screwy when the new changes get widespread adoption.) We'll have everyone on hand to mke sure problems get dealt with quickly.
Hi all --
Some have noticed that re-importing to get new comments hasn't been working for a while. This has been fixed; it was an operational issue (the importer cache wasn't being cleaned).
Anyway, if you have been having trouble getting recent comments to import onto DW, things should be working now. Please give it a shot.
Edited: Also, if you are still having importer troubles, please open a new request and let us know here:
Thanks and sorry for the trouble!
* In some rare cases, trying to complete a payment will result in an error message. It seems to be even odds whether the failure happens before or after your card has been charged, but either way, the items won't be applied to your account. If this happens to you, the error message asks you to open a support request in the Account Payments category: please do so! I'll be able to check whether your card has been charged, and if so, make sure you get the items you paid for.
(EDIT: There was another point here about a different problem that had cropped up since last night, but further investigation turned up that it was only a variant of the above, and only a single payment was affected by it. So, false alarm there!)
I'm really sorry about the hassle, folks! We've been trying to work out what's causing this to happen, but no luck so far. If you ever have questions about whether or not your payment has gone through, just open a support request to ask, and I'll get to it as soon as possible. (This weekend there might be some delay, since my sister's getting married! But usually it's a pretty quick turnaround.)
1) Random "403 Forbidden" errors when loading site pages, mostly in journals but sometimes on site pages.
2) Shop carts paid for by credit card where you get an error when trying to check out, then the cart is set to "waiting for payment" status.
3) When crossposting to LiveJournal, the crosspost not going through and error messages in your inbox that read "Failed to connect to http://www.livejournal.com/interface/xm
All three of these errors are only happening occasionally.
The first two are problems on our end that we're working to track down -- we've added some extra debugging code that should help us pinpoint the cause, and we'll get it fixed as quickly as possible after that. (Things that only happen occasionally are very hard to diagnose and fix, since you can't always know for sure what the cause is, and you can't usually verify that the fixes you're putting into place actually fix the problem.)
The crossposting problem is, unfortunately, on LJ's end and not on ours: that error means that LJ was unreachable at the time the crosspost attempt was sent. If you get the error, your crosspost attempt will retry up to five times, at increasingly-longer intervals, before sending you a final failure notice in your inbox. Once that happens, you can edit the post and re-check the crosspost box, then save the post, to start it trying again. Whether or not it succeeds depends entirely on whether we can reach LJ at the time it tries. (It doesn't matter whether you can load LJ on your own computer when that happens -- the crosspost attempt is sent from our servers, so our servers have to be able to reach LJ, not your computer.)
If you run into problems with a payment, open a support request in the Account Payments category, and I'll get things fixed up for you as soon as I possibly can. (There's only one of me, though, so it probably won't be an immediate response, unless you happen to catch me while I'm sitting right in front of the computer!)
If you run into either of the other two problems, you can let us know by leaving a comment here, just so we can get a rough sense of how often the problems are coming up. We might not be able to tell you anything more than what's in this post, though!
If you're having an issue that isn't one of these three, open a support request, describing the issue as thoroughly as possible, and somebody will help you troubleshoot it.
I will be doing a small code push tonight in about an hour or two. It hasn't been long since our last push so this one isn't particularly large, hence the short notice.
I'll post on Twitter when we start. There should be no downtime for this.
The code push and downtime will be happening in one hour. Remember, the site will be down for ~10 minutes while I do a database failover to move us completely on to the new hardware.
I will post again when the push happens, and as always, you can check out our dreamwidth account on Twitter to keep up with the code push!
I'd like to do a brief site maintenance and code push in about 32 hours. The scheduled time:
- 2100 PDT Thursday
- 0400 UTC Friday
This code push will involve a period of downtime. I'm estimating it will take about 10 minutes. During that window, I will be switching us from our old
sb-db01 database master to our new
sb-db05 cluster. I have to take the site offline for this since it's what we call a master failover, and our system isn't designed to do that without downtime.
After this maintenance, we will be completely on new hardware -- and all of the trusty hardware we've had for the past three years since moving to ServerBeach will be completely retired.
As always, there will be another post here as well as on our Twitter account when the time comes. Feel free to shoot me any questions or comments, I'll watch this post!
PS HELLO PLURKERS.
EDIT: This problem has been fixed now!
I did some database maintenance today -- moving our workers around! -- and this caused a glitch in the replication between our old databases and the new ones, so the new ones weren't getting all the updated data.
What this means to you: if you saw problems trying to update your access list or subscription filters, or with community invitations, or viewing support requests, that was caused by the glitch in replication. I'm really sorry for the inconvenience.
This particular issue won't recur, since it was caused by a very specific circumstance related to moving the workers around. Since I'm done moving them, the problem won't happen again.
As part of our new hardware project, I'm going to be failing us over to our new load balancers. This will involve a brief downtime for the site while everything fails over, but it should be less than 60 seconds.
Thanks for your patience, and sorry for the interruption!