mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] dw_maintenance2020-05-25 09:48 pm

Tonight's site interruptions

Hi all,

First of all, sorry for the trouble tonight -- I know a lot of you had issues accessing Dreamwidth tonight.

TL;DR: the issue is fixed and service is restored. The issue was one that we solved by changing the configuration of the web server software we use (Apache).

The basic description of the issue is that we believe that all of our Apache workers were being tied up doing keep-alive to clients. There is a 5 second timeout after you request a page where we wait for you to request another -- and if you don't, we close the connection.

It's possible for there to be enough users that we run out of connections, and then you have to basically wait in line for another connection. This is usually very fast, but with lots of people trying to use Dreamwidth tonight, we ran out of free connections. This caused a cascading problem though -- because our upstream load balancer _also_ needs connections in order to do health checks to ask if the web servers are doing OK.

So, we started seeing oscillating health check failures. Web servers would start to be considered failed (because the load balancer couldn't ask it how it was doing) and that caused us to have even fewer connections to be available, which just exacerbated the problem.

It was hard to track down because this really shouldn't be the case -- the load balancer should be multiplexing requests to different backend connections. We need to do some more investigation to understand why we were seeing this behavior. At any rate, we resolved it by disabling keep-alive.

Again, we're sorry for the interruption of service, and that we didn't tweet faster. That's on me, I forgot to update over there while we were working on the issue.

Big thanks to [personal profile] jennifer and [personal profile] alierak for the help here! Ultimately, it was [personal profile] alierak's suggestion that pointed us in the right direction and got the issue solved.

-Mark
egret: egret in Harlem Meer (Default)

[personal profile] egret 2020-05-26 05:49 am (UTC)(link)
Thank you for getting it running again!
darkoshi: (Default)

[personal profile] darkoshi 2020-05-26 06:15 am (UTC)(link)
Thank you! I'd been wondering if it was a problem on my side or not.
bemused_writer: Noblewoman in blue (Coco)

[personal profile] bemused_writer 2020-05-26 07:44 am (UTC)(link)
Thank you for letting us know and for getting DW up and running again!
coinneachf: (Default)

[personal profile] coinneachf 2020-05-26 08:18 am (UTC)(link)
Load balancer healthcheck issues are waaaay too familiar to me. Good on yer.
weofodthignen: selfportrait with Rune the cat (Default)

[personal profile] weofodthignen 2020-05-26 08:32 am (UTC)(link)
Thank you for the fix and for being open about it (and trying to explain it to non-techies). I'd thought it was at my end.
mc776: The blocky spiral motif based on the golden ratio that I use for various ID icons, ending with a red centre. (Default)

[personal profile] mc776 2020-05-26 08:35 am (UTC)(link)
Y'all are amazing at both fixing and documenting what goes on around here. Thank you.
lunabee34: (Default)

[personal profile] lunabee34 2020-05-26 11:44 am (UTC)(link)
Thank you so much for fixing the problem!
dewline: Text - "On the DEWLine" (Default)

[personal profile] dewline 2020-05-26 01:56 pm (UTC)(link)
Thanks for clearing it up for us.

Also, mildly and perversely grateful because I already needed to call it a night when the problems started affecting my connection to Dreamwidth. Good timing in my case, and I recognize that such was not the case for all my Dreamwidth "neighbours".
darkoshi: (Default)

[personal profile] darkoshi 2020-05-26 02:36 pm (UTC)(link)
Oh, that is a useful site. I'm bookmarking it.
primwood: (Default)

[personal profile] primwood 2020-05-26 02:40 pm (UTC)(link)
Thanks, Mark, Jennifer and Alierak!
havocthecat: the lady of shalott (Default)

[personal profile] havocthecat 2020-05-26 02:57 pm (UTC)(link)
Thanks for all your hard work!
numb3r_5ev3n: 7 from Matrix Online (Default)

[personal profile] numb3r_5ev3n 2020-05-26 03:18 pm (UTC)(link)
Thanks for the explanation and the swift recovery!
kore: (Dreamwidth - green)

[personal profile] kore 2020-05-26 03:21 pm (UTC)(link)
Thank you for being so transparent and fast about it! I love Dreamwidth.
isis: (awesome)

[personal profile] isis 2020-05-26 03:46 pm (UTC)(link)
Thanks for getting on it and fixing it, and also for coming here to tell us about the problem. Dreamwidth is awesome!
paynesgrey: Marilyn (Default)

[personal profile] paynesgrey 2020-05-26 03:49 pm (UTC)(link)
Thanks for all your hard work!
renay: photo of the milky way from new zealand on a clear night (Default)

[personal profile] renay 2020-05-26 05:18 pm (UTC)(link)
Thanks for the quick response and explanation. Y'all are great.
tozka: title character sitting with a friend (Default)

[personal profile] tozka 2020-05-26 05:42 pm (UTC)(link)
Thanks for the quick fix!
monanotlisa: symbol, image, ttrpg, party, pun about rolling dice and getting rolling (Default)

[personal profile] monanotlisa 2020-05-26 06:37 pm (UTC)(link)
Good to hear -- thanks, Mark!
sistawendy: a cartoon of me saying "Praise Bob!" (prabob)

[personal profile] sistawendy 2020-05-26 08:07 pm (UTC)(link)
From one erstwhile ops person to another, I salute you. Your load balancer wouldn't happen to be in AWS, would it? (I know, you shouldn't answer that.) My co-workers observed some annoying behavior in the distant past whereby AWS LBs would just point a firehose at one server until it fell over, then move on to the next one instead of proper round robin.
sine_nomine: (Default)

[personal profile] sine_nomine 2020-05-26 09:44 pm (UTC)(link)
Tried that last night. It insisted the site was up. While I was getting 504 messages.

Pobody's nefect?
claidheamhmor: (Default)

[personal profile] claidheamhmor 2020-05-28 06:22 pm (UTC)(link)
Thank you for the clear explanation. Load balancers can be fun!
mudousetsuna: (Default)

[personal profile] mudousetsuna 2020-06-09 10:35 pm (UTC)(link)
Would this error have anything to do with why LJ Juggler extension on firefox no longer works? It keeps saying that I probably have typoed the username or password, and I know I haven't.
cesy: "Cesy" - An old-fashioned quill and ink (Default)

[personal profile] cesy 2020-06-10 09:58 am (UTC)(link)
mudousetsuna: (Default)

[personal profile] mudousetsuna 2020-06-10 02:09 pm (UTC)(link)
Oh I see now, thank you very much!