mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] dw_maintenance2019-12-02 11:50 am

Notifications slow -- but recovering

Hi all,

Due to some behind the scenes maintenance last night, our notifications system got delayed. I've fixed the issue now and it's working on catching up.

For details -- I've been experimenting with Kubernetes as a way to make managing production easier (and hopefully reduce costs!), but it turns out that one of our worker jobs that handles notifications doesn't use much CPU (it mostly spends time waiting on the database).

This caused the pod autoscaler to reduce the size of that particular deployment below what we needed to sustain throughput on our notifications service. The temporary fix is to pin that deployment size to something much larger, the better fix will be to integrate Kubernetes' pod autoscaler with the ability to monitor the queue depth on our task queue.

Sorry for the trouble, and thank you for the person who pinged us on Twitter. When I checked last night, everything was working, but as traffic came back up we fell behind and I wasn't watching anymore. My bad.
ninetydegrees: Art: self-portrait (Default)

[personal profile] ninetydegrees 2019-12-02 08:04 pm (UTC)(link)
I hate to be that user but I stopped getting notifications (email and inbox) almost 2 years ago and never got any reply to my request. Is there *anything* I can do on my side to solve this issue?
mildred_of_midgard: (Default)

[personal profile] mildred_of_midgard 2019-12-02 08:08 pm (UTC)(link)
Glad you got it catching up!

Is there a particular database bottleneck? (asks the DBA)
siderea: (Default)

[personal profile] siderea 2019-12-02 08:37 pm (UTC)(link)
Oh thank goodness - I thought it was on my end and was getting a bit panicky that my email service (which I also use for business email) was bouncing incoming email (again!) Thanks for letting us know, very appreciated!
kore: (Default)

[personal profile] kore 2019-12-02 08:46 pm (UTC)(link)
Thank you for keeping us all in the loop!
dennisgorelik: 2020-06-13 in my home office (Default)

Watchdog service

[personal profile] dennisgorelik 2019-12-02 08:48 pm (UTC)(link)
Could you add a "watchdog" service that will check every hour how far behind your messages sending queue is?
Then if your messaging queue is behind more than 2 hours - email notification to yourself (to devops).
trobadora: (Default)

[personal profile] trobadora 2019-12-02 08:55 pm (UTC)(link)
Thank you, so glad to hear it's catching up!

[personal profile] justice 2019-12-02 09:48 pm (UTC)(link)
I'm glad someone reached out to you on Twitter because I tried the support@dreamwidth.org method, and it turns out it just goes into the support queue. The Twitter said not to reach out to you there - but is it what you'd prefer in scenarios like this one where even the site notifications wouldn't be reaching you?
sovay: (Rotwang)

[personal profile] sovay 2019-12-02 09:51 pm (UTC)(link)
Thanks for the explanation!
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2019-12-02 09:53 pm (UTC)(link)
It's fine to ping us on Twitter for a problem or outage that's affecting a bunch of people! It's more that we can't do individual troubleshooting in 280 characters for specific problems only affecting one person.
mildred_of_midgard: (Default)

[personal profile] mildred_of_midgard 2019-12-02 09:55 pm (UTC)(link)
Makes sense!
niqaeli: cat with arizona flag in the background (Default)

[personal profile] niqaeli 2019-12-02 10:14 pm (UTC)(link)
Truly, one of the many best things about DW is that I can rely on y'all to consistently and without bullshit cop to whatever caused the last major problem! Even when it's "yeah, so today I have custody of the commit-and-ditch pony." <3

[personal profile] justice 2019-12-02 10:16 pm (UTC)(link)
Thanks! I'll keep that in mind for the future.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Watchdog service

[personal profile] dennisgorelik 2019-12-02 10:50 pm (UTC)(link)
> I hadn't ported it over to Kubernetes yet

That makes me wonder what other Dreamwidth backend services may be silently falling behind now?

Are you sure that switch to Kubernetes was a right call?
If your bottlenecks are:
1) Database performance.
2) Complexity of monitoring your queues.
Then Kubernetes is, probably, not the right tool to address these issues, right?
schematise: (5)

[personal profile] schematise 2019-12-02 11:06 pm (UTC)(link)
Bad pod autoscaler!

Kubernetes is brilliant when it works, though. We use it for our media streaming servers and it's ability to just self heal and self-manage is amazing once the set up is steady! I hope it continues to be everything you dreamed of.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: Watchdog service

[staff profile] denise 2019-12-02 11:28 pm (UTC)(link)
Ultimately, the problem is that DW-as-a-codebase and DW-as-a-platform were both born in 1999 (as LiveJournal), and at the time there weren't any of the contemporary backend services that people can choose from now. The prototypes we-as-LJ invented to solve scaling problems were what morphed into those services (in some cases directly, in some cases as parallel evolution), but there have been a lot of newer backend services coming and going in those 20 years. We haven't made use of what's been developed in a lot of cases because the existing "battle-hardened" prototypes worked just fine, they had been optimized for our use cases over that exceptionally long time, a rewrite would run the risk of introducing more issues that weren't worth the performance gain we'd get from the switch, and a lot of the examples would require more rearchitecting than a platform that was written from the beginning to swap those newer services more easily would need. Which is not to say that we don't do the work when we need to, when the benefits we'd get start to outweigh the downsides, but it's not as simple a switch as it would be on a project that had its genesis more recently.

It's a problem we run into pretty often -- I actually have a talk I give at tech conferences called "When Your Code Is Old Enough To Vote" about, among other things, how to do the cost/benefit analysis for deciding whether or not to make major rewrites to benefit from new systems and services that were developed long after you started. If you jump on every new advance in technology, your list of "modernization projects we are in the middle of" outgrows your "new features/enhancements" list really, really quickly. We are in the middle of a lot of modernization projects right now (some you can see; some backend) and we have to be really judicious about which additional ones we take on.
metahacker: Half of an unusual keyboard, its surface like two craters with keys within. (keys)

Re: Watchdog service

[personal profile] metahacker 2019-12-02 11:37 pm (UTC)(link)
I actually have a talk I give at tech conferences called "When Your Code Is Old Enough To Vote"

Awesome! Is this a talk you make available other ways, or am I going to have to get invited to the right conferences?
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: Watchdog service

[staff profile] denise 2019-12-02 11:53 pm (UTC)(link)
I have an older version of the slides at https://www.slideshare.net/mobile/dreamwidth/when-your-code-is-nearly-old-enough-to-vote but I don't think there's a video of me delivering it anywhere -- I'll look, though!
nonelvis: (Default)

Re: Watchdog service

[personal profile] nonelvis 2019-12-03 12:02 am (UTC)(link)
This was fascinating. Thanks for the link!
devilbear: Markiplier with bright red hair is in the process of falling. The word "HECK" indicates his reaction to the situation. (Heck!)

[personal profile] devilbear 2019-12-03 12:06 am (UTC)(link)
Hi! I don't know if this is the right place to put this, since I basically rarely use this site yet. I'm assuming that it's maintenance related because this definitely didn't used to be a function of the site...?

Anyway, I had a lot of my journal layouts in a private community and the coding, put within textboxes, is now broken. It would seem that things like @ keyframes [without spaces; I just don't want to ping a person] has been turned into the raw username code for mentioning a user. Which means that now when I try to copypaste the layout codes featuring that, and presumably anything else using @ in it, I just get a completely broken layout I then have to go through and fix.

Is there a way to disable this and revert the changes without all of my entries needing manual fixing / being ruined?
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

Re: Watchdog service

[personal profile] ilyena_sylph 2019-12-03 12:19 am (UTC)(link)
I watched the video of you giving it somewhere!!!!
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

[personal profile] ilyena_sylph 2019-12-03 01:24 am (UTC)(link)
Hey, this is from a few months ago when Mark added an "/@username" feature in the Markdown step.

Last I heard, they're working on a fix for the stuf in textboxes and such, so, speaking as just another DW-izen... hang on a bit still?
Edited 2019-12-03 01:24 (UTC)

Page 1 of 3