mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] dw_maintenance2013-07-01 17:17

Database maintenance

Hi all,

I did some database maintenance today -- moving our workers around! -- and this caused a glitch in the replication between our old databases and the new ones, so the new ones weren't getting all the updated data.

What this means to you: if you saw problems trying to update your access list or subscription filters, or with community invitations, or viewing support requests, that was caused by the glitch in replication. I'm really sorry for the inconvenience.

This particular issue won't recur, since it was caused by a very specific circumstance related to moving the workers around. Since I'm done moving them, the problem won't happen again.

Technically:

Right now we're migrating from our old master databases (db01 and db02) to the new pair (db05 and db06). To do this sanely, I have it set up in a replication chain so that any changes made at the top will trickle down to the bottom ones, like this:

db01 -> db02 -> db05 -> db06

The idea is that, to migrate seamlessly from the old ones to the new ones, at some point in time I just change the configuration files that used to say 01/02 and make them say 05/06. Then, magically and nearly instantaneously, we're using the new databases and after some days I can get rid of the old ones.

Anyway, today I moved our TheSchwartz based workers (they do notifications, emails, and some other tasks). I switched them to the new database cluster -- but of course, nothing is actually instantaneous. What happened was that some of the web servers started using db05 a split-second (literally) before some of the others, so we had a few hundred milliseconds where db01 (the OLD master) received some writes after db05 (new one) did.

The problem was then that both databases assigned the same number to different jobs. (When a job gets inserted, it gets assigned an ID. Since both databases had a slight overlap where they both thought they were boss, both created the same ID!)

This is where the sadness happened, because when db05 tried to replicate the commands that db01 had done in that split-second, there was a conflict: two jobs had the same ID. So, db05 stopped replicating from db01 (technically db02) and we didn't have an alert on it because it's going to be a master (i.e., it's not supposed to be replicating long term, so I never set up a replication alarm for it).

Anyway, someone reported an issue which I tracked down to a replication problem. It's been fixed, the database is now fully replicated, and the problem won't repeat because the switchover has already happened. db05 is the master for generating IDs for jobs now, db01 is deprecated.

Thanks for reading.

dragondancer5150: (General - Problem Solver)

[personal profile] dragondancer5150 2013-07-02 01:21 (UTC)(link)
Thanks for the detailed explanation! I wasn't on at any time earlier to notice a problem, but I'll be forever grateful that you and Denise (and everyone!) are always really good about telling us what's going on. \o/
Edited 2013-07-02 01:21 (UTC)
azurelunatic: Azz and best friend grabbing each other's noses.  (Default)

[personal profile] azurelunatic 2013-07-02 01:46 (UTC)(link)
Thanks!
jordannamorgan: Bob Crane as Col. Robert Hogan, "Hogan's Heroes". (Trouble)

[personal profile] jordannamorgan 2013-07-02 02:25 (UTC)(link)
Can you people please take over my awful ISP? :Þ
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 02:30 (UTC)(link)
Ha. I don't know if we'd be any better at ISP-ing than they are, but at least we could tell you when something went wrong...
jordannamorgan: Frank McHugh as Francis, "Footlight Parade". (Get Real)

[personal profile] jordannamorgan 2013-07-02 03:02 (UTC)(link)
That in itself is a *huge* thing. At this moment, my ISP's Facebook page is covered with frustrated questions from customers wanting to know why they can't get into webmail--because the company won't take thirty seconds to post "FYI, we're having X problem, and we're working on it". (Just to get that much, I had to *convince* one of those out-of-country customer service chat reps to check whether there was a problem on their end instead of in my computer.)

[/rant] By all of which I just mean to say, every business should be required by law to model its customer service on DW's. :)
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 03:12 (UTC)(link)

Aww. We try!

monk111: (Default)

[personal profile] monk111 2013-07-02 03:02 (UTC)(link)
Might this be affecting cross-posting?
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 03:08 (UTC)(link)
It shouldn't be! If you're getting failures, check your DW inbox, which will have details about what the problem is. (I took a look at the importer logs for your username and don't see any attempted imports, or I'd help you to diagnose things a bit more directly!)
monk111: (Default)

[personal profile] monk111 2013-07-02 03:17 (UTC)(link)
I think I'm exporting rather than importing, trying to cross-post from here to LJ. The inbox says that it "failed to connect" to LJ.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 03:23 (UTC)(link)

Oh -- yes, cross-posting is a different system than importing. Sorry about the confusion!

That error message means that our servers weren't able to reach LiveJournal's servers at the time you tried to crosspost. It can happen for a number of different reasons, but it's usually a transient error having to do with LJ being unavailable at the exact second the crosspost attempted. The system will retry the crosspost up to five times, at progressively longer intervals, before failing (each attempt will be numbered in your inbox, so if you only get one failure rather than five, that means the second attempt was successful). After the fifth failure, it won't try again anymore; if that happens, doublecheck that your LJ password is correct, then edit the post and check the unchecked crosspost box to get it to try again. (Then, if you still keep getting failures after that, open a support request.)

monk111: (Default)

[personal profile] monk111 2013-07-02 03:29 (UTC)(link)
Thanks! I was just noticing that cross-posting was proving to be significantly more difficult today, and I thought it might be connected to this maintenance.
jordannamorgan: The artwork "Ascending and Descending", by M. C. Escher. (No Hero)

[personal profile] jordannamorgan 2013-07-02 03:37 (UTC)(link)
I don't know if it's also related, but I did notice earlier today that a PM I sent was slow to show up. (Both the sent copy in my outbox, and the incoming copy-to-myself in my inbox.) Usually they're there immediately, but it took a considerable span of minutes this morning.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 03:43 (UTC)(link)

That was an unrelated problem, and should be fixed now!

denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 03:44 (UTC)(link)
As it happens, a bunch of people are reporting the problem (so it's not just a transient error) -- we're looking into whether it's a problem on LJ's end or ours.
subluxate: Sophia Bush leaning against a piano (Default)

[personal profile] subluxate 2013-07-02 09:19 (UTC)(link)
I don't know if you've found out the problem/if it's fixed, nor if this is related, but most of my feeds on LJ updated a whole lotta posts from various blogs (I think three XKCD, three or four Get Rich Slowly...) around the same time today. So it looked to me like they might be having problems.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 18:38 (UTC)(link)
Not related, no. That's probably a side effect of switching machines; if it happens again, open a support request, but otherwise just chalk it up as "occasional weirdness" :)
littlebutfierce: (k-on mio laptop)

[personal profile] littlebutfierce 2013-07-02 05:48 (UTC)(link)
I did notice for a few hours yesterday in the morning (BST) that PMs weren't appearing in my sent-mail folder, nor getting to their recipients, & same w/comment notifications -- though after several hours this all seemed to be ironed out (& I was in a training all day yesterday so not really having the time to file a support request) -- not sure if this was related? As I said, it seems to be ironed out now, but mentioning it in case it's useful data.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2013-07-02 18:36 (UTC)(link)
That was something else! And fixed now. :)
littlebutfierce: (10 things win)

[personal profile] littlebutfierce 2013-07-02 19:37 (UTC)(link)
Yay, thanks for letting me know!
kaberett: A sleeping koalasheep (Avatar: the Last Airbender), with the dreamwidth logo above. (dreamkoalasheep)

[personal profile] kaberett 2013-07-02 12:00 (UTC)(link)
Thank you so much for the detailed explanation - I love learning about what is going on with the site.
mildred_of_midgard: (Default)

[personal profile] mildred_of_midgard 2013-07-04 03:23 (UTC)(link)
Forgive me for asking something that I could probably track down if I looked through other posts, but what database platform(s) do you use? As a DBA, I saw the words "database maintenance" and "replication" and couldn't resist asking. ;)
mildred_of_midgard: (Default)

[personal profile] mildred_of_midgard 2013-07-06 02:54 (UTC)(link)
Cool, thanks. (If you'd said Postgres, there would have been a lot more follow-up questions.)