dw_maintenance | Notifications slow -- but recovering

Notifications slow -- but recovering

Hi all,

Due to some behind the scenes maintenance last night, our notifications system got delayed. I've fixed the issue now and it's working on catching up.

For details -- I've been experimenting with Kubernetes as a way to make managing production easier (and hopefully reduce costs!), but it turns out that one of our worker jobs that handles notifications doesn't use much CPU (it mostly spends time waiting on the database).

This caused the pod autoscaler to reduce the size of that particular deployment below what we needed to sustain throughput on our notifications service. The temporary fix is to pin that deployment size to something much larger, the better fix will be to integrate Kubernetes' pod autoscaler with the ability to monitor the queue depth on our task queue.

Sorry for the trouble, and thank you for the person who pinged us on Twitter. When I checked last night, everything was working, but as traffic came back up we fell behind and I wasn't watching anymore. My bad.

Flat | Top-Level Comments Only | Expand All

Yeah, that's a good call! We have something like that, but I hadn't ported it over to Kubernetes yet. I manually inspected things and saw they were working and hoped they would keep working for 24 hours...

Fool's hope, of course.

> I hadn't ported it over to Kubernetes yet

That makes me wonder what other Dreamwidth backend services may be silently falling behind now?

Are you sure that switch to Kubernetes was a right call?
If your bottlenecks are:
1) Database performance.
2) Complexity of monitoring your queues.
Then Kubernetes is, probably, not the right tool to address these issues, right?

Ultimately, the problem is that DW-as-a-codebase and DW-as-a-platform were both born in 1999 (as LiveJournal), and at the time there weren't any of the contemporary backend services that people can choose from now. The prototypes we-as-LJ invented to solve scaling problems were what morphed into those services (in some cases directly, in some cases as parallel evolution), but there have been a lot of newer backend services coming and going in those 20 years. We haven't made use of what's been developed in a lot of cases because the existing "battle-hardened" prototypes worked just fine, they had been optimized for our use cases over that exceptionally long time, a rewrite would run the risk of introducing more issues that weren't worth the performance gain we'd get from the switch, and a lot of the examples would require more rearchitecting than a platform that was written from the beginning to swap those newer services more easily would need. Which is not to say that we don't do the work when we need to, when the benefits we'd get start to outweigh the downsides, but it's not as simple a switch as it would be on a project that had its genesis more recently.

It's a problem we run into pretty often -- I actually have a talk I give at tech conferences called "When Your Code Is Old Enough To Vote" about, among other things, how to do the cost/benefit analysis for deciding whether or not to make major rewrites to benefit from new systems and services that were developed long after you started. If you jump on every new advance in technology, your list of "modernization projects we are in the middle of" outgrows your "new features/enhancements" list really, really quickly. We are in the middle of a lot of modernization projects right now (some you can see; some backend) and we have to be really judicious about which additional ones we take on.

I actually have a talk I give at tech conferences called "When Your Code Is Old Enough To Vote"

Awesome! Is this a talk you make available other ways, or am I going to have to get invited to the right conferences?

I have an older version of the slides at https://www.slideshare.net/mobile/dreamwidth/when-your-code-is-nearly-old-enough-to-vote but I don't think there's a video of me delivering it anywhere -- I'll look, though!

This was fascinating. Thanks for the link!

I watched the video of you giving it somewhere!!!!

Thanks! very helpful.

(I mean, I'm not stuck in a 35+ year old codebase any more, but...still relevant.)

So what you are saying is that even if there are reasons to switch to Kubernetes, you still should be cautious in order to prevent increase in complexity.

But you do not even have compelling reasons to switch to Kubernetes, right?

Not sure how familiar you are with Kubernetes vs hand-maintained-legacy-stuff, but it's actually a pretty compelling story to move in the direction of having something that can autoscale based on performance inputs (CPU, queue depth, etc).

In the legacy stack, if we're falling behind I have to log in and add more cron jobs/spin up new workers. Given that Dreamwidth is a side project for the tech staff (none of us are full-time, we don't make that much money!), I'm not actually going to log in and rebalance workers very often. User experience suffers.

Kubernetes (and other technologies like it, ECS, Fargate, Nomad, etc) give us the ability to tie performance metrics to scaling decisions and then we get a better user experience and a lower technical staff overhead for management.

FWIW, this is only a possibility because we're using managed Kubernetes. I don't have to run it myself and our use case is small enough (~20 nodes) that we aren't really going to run into the gnarly edge cases that SIG Scalability talk about. It does add some complexity, but that's a trade I'm willing to make for flexibility and the possibility of a better user experience.

> if we're falling behind I have to log in and add more cron jobs/spin up new workers

Why do you need more workers if the bottleneck is in the database?
Usually, calling application code is cheap from performance perspective (microseconds), but database calls (retrieve/save data) are much longer (milliseconds to seconds).

How many emails does DreamWidth send per day?

Somewhere at about a million, I think I heard last time we discussed notifications.

Edited (Spelling, ugh) 2019-12-08 00:54 (UTC)

Are you using Amazon SES to send DreamWidth emails?

It looks like SPF record is not correctly configured in "DreamWidth - Amazon SES" setup.
This is part of email headers I received for your reply comment notification:
~~~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee300049e-64c13222-3e92-4437-b31f-9d7dbd124d02-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~~~

I don't mean to be rude, but what part of that Received-SPF header do you think has anything to do with Dreamwidth?

Most likely Dreamwidth.org DNS is misconfigured (specifically, SPF record), and that is causing Gmail to classify emails that Amazon SES sends on behalf of Dreamwidth.org -- as "SPF: fail".

In order to make it right, Dreamwidth.org DNS should explicitly declare that Amazon SES IP addresses have permission to send emails on behalf of dreamwidth.org

I'm sorry, but that's not correct, and no part of the header you showed has anything to do with Dreamwidth. An SPF check is performed against the SPF record of the envelope sender's domain. In this case the Received-SPF header states that the sender is something at amazonses.com, so the relevant SPF record is Amazon's, not Dreamwidth's. If Amazon hasn't allowed the IP address from which you received the message, then it should be a fail.

In this case the IP address from which you received the message appears to be one of Google's, not one of Amazon's or Dreamwidth's. Neither Dreamwidth nor Amazon uses Google to send email, so the failure is correct and there is nothing we need to adjust.

1) Thank you - you are correct.

I see now that there are 2 "Received-SPF" headers in the email sent by [Bad username or site: dreamwidth @ org] that I received through [Bad username or site: gmail @ com]:
~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~
Received-SPF: pass (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com designates 54.240.11.151 as permitted sender) client-ip=54.240.11.151;
~~~~~~

"pass" record indicates that google.com recognizes 54.240.11.151 IP address (that belongs to Amazon SES) as a permitted sender).

"fail" record, probably (I am guessing here) refers to email forwarding feature inside of Gmail (forwarding from one email address to another).

I incorrectly assumed that this "fail" record indicates misconfiguration in dreamwidth.org

"pass" record really matters and indicates that dreamwidth.org DNS configured SPF record correctly.

I am sorry for misreading my email headers.

> the relevant SPF record is Amazon's, not Dreamwidth's

Are you sure it is a correct statement?

Wouldn't it be more correct to state that both Amazon DNS SPF and Dreamwidth DNS SPF together define whether specific sender IP address is permitted for sending emails on behalf of dreamwidth.org?

Specifically, Dreamwidth DNS SPF delegates defining email sending permissions to Amazon SES:
~~~~~~~~~~~~~~~~~
dreamwidth.org text =
"v=spf1 include:amazonses.com ~all"
~~~~~~~~~~~~~~~~~

Then Amazon SES uses these permissions to permit 54.240.11.151 IP address to send emails that originate from dreamwidth.org (and other domains that Amazon SES serves)
~~~~~~~~~~~~~
amazonses.com text =

"v=spf1 ip4:199.255.192.0/22 ip4:199.127.232.0/22 ip4:54.240.0.0/18 ip4:69.169.224.0/20 ip4:76.223.180.0/23 ip4:76.223.188.0/24 ip4:76.223.189.0/24 ip4:76.223.190.0/24 -all"
~~~~~~~~~~~~~

Re: Amazon SES - misconfigured SPF record?

alierak - 2019-12-09 14:55 (UTC) - Expand

Dreamwidth.org SPF record

dennisgorelik - 2019-12-09 15:50 (UTC) - Expand

Re: Dreamwidth.org SPF record

alierak - 2019-12-09 16:47 (UTC) - Expand

Re: Dreamwidth.org SPF record

dennisgorelik - 2019-12-09 18:13 (UTC) - Expand

Re: Dreamwidth.org SPF record

alierak - 2019-12-09 19:26 (UTC) - Expand

Re: Dreamwidth.org SPF record

dennisgorelik - 2019-12-09 20:03 (UTC) - Expand

Re: Dreamwidth.org SPF record

alierak - 2019-12-09 20:10 (UTC) - Expand

Re: Dreamwidth.org SPF record

dennisgorelik - 2019-12-09 20:16 (UTC) - Expand

Re: Dreamwidth.org SPF record

alierak - 2019-12-09 20:22 (UTC) - Expand

Processing notifications and email deliverability

dennisgorelik - 2019-12-09 15:52 (UTC) - Expand

Re: Processing notifications and email deliverability

alierak - 2019-12-09 16:58 (UTC) - Expand

Notification types

dennisgorelik - 2019-12-09 18:35 (UTC) - Expand

1 million per day is about 12 per second (if email sending speed was consistent during the day).

I would say that a single email sending service with about 10 threads -- should be sufficient to send all your emails.

I would set it up this way:
When it is time to send a message:
1) Take MessageId of the message.
2) Calculate remainder of MessageId dividing by 10:
Remainder = MessageId % 10
If messageId = 123, then Remainder = 3.
3) Every thread will process messages with corresponding "MessageId Remainder".
If Remainder = 0 -- process it by EmailSendingThread0
If Remainder = 1 -- process it by EmailSendingThread1
If Remainder = 2 -- process it by EmailSendingThread2
Etc.

Email sending process is not a heavy process and does not consume too much resources.
The biggest delay -- is the wait when command that is sent to Amazon SES completes (~200 ms).

I'm the support volunteer, not the half owner and developer of the site. I keep up with stats and flag things that are an issue up to the developers.

The knowledge is here, however the time and ability due to life aren't

Hey, look, I don't think you're coming across the way you want to. Maybe back off a bit and assume that people do know what they're doing, and ask questions that are phrased as being about helping you to understand what's happening. Many of the questions you've been asking sound confrontational, and as if you think you know better than the people who've been working with the codebase for years. It sounds like you could have some really good ideas, but right now I think you're alienating people.

> Maybe back off a bit and assume that people do know what they're doing

I definitely assume that people who run website with 3 million/month total users -- know what they are doing. Most developers do not reach that.

I also assume that these same competent developers/devops -- make occasional mistakes. There is no shame in making mistakes. I make mistakes too.

I do not mean that my suggestions are necessarily correct.
Actually many of my suggestions are likely to be suboptimal for implementing on dreamwidth.org -- for one or another reason.
I do not know these reasons and looking forward for a feedback from somebody who would point me to these specific reasons.
That would allow me to adjust my suggestions so they fit better to what dreamwidth.org needs.
So, hopefully, some of my suggestions would, actually, help to make dreamwidth.org to work better.

Seconded.

Speaking as a very very non-tech person but one who has used DW since closed beta...

The thing is, notification sending isn't consistent, like, at all. If most of the people in a big community have a notification set up for 'Notify me when there is a new post to this comm', DW suddenly needs to send, say, a thousand emails (halving the numbers I found on a big comm) at once any time a post is made. If ten posts get made in half an hour, that's 10k emails in that half hour, but not spaced evenly because we all want our notifications as fast as possible.

Or if, another hypothetical here, a lot of people have tracking set up on, say,

fail_fandomanon, which gets 6k comments per post (posts go up roooooooughly every two days as the 6k mark is reached), all of those people tracking it need instant notification of each comment.

And those are just a couple of examples I can think of, not even counting the

dw_news posts, that push notifications to [afaik] basically everyone with an account unless you deliberately set it up not to. That's a lot of email that needs to go out very fast.

> If ten posts get made in half an hour, that's 10k emails in that half hour

That is, actually, routine and easy scenario, because it is well below an average rate of email sending on Dreamwidth.
If Dreamwidth sends 30M emails per day, then it sends 1M+ emails per hour anyway.

> we all want our notifications as fast as possible

What do you think is an acceptable delay for notification in case when posting on dw_news goes up?
Would several minutes delay be acceptable in that situation?

> fail_fandomanon, which gets 6k comments per post

That, probably, generates a lot of emails, but these emails will be almost evenly spread (a lot of relatively small batches of emails).

I think that if service is able to send 10x faster than average email sending speed is - that should be sufficient.
That means that service should be able to send about 12 million emails per hour.
Which is ~200,000 emails per minute.
Which is ~3,300 emails per second.

> Would several minutes delay be acceptable in that situation?

...dude. Several minutes would be freaking amazing.

When

dw_news posts go out, notifications, all notifications sitewide, are slowed for at least an hour, normally more like 2. That's what

mark is trying to fix with all the work he's doing -- among other things -- and what you wandered into the middle of.

Re: Sending emails

dennisgorelik - 2019-12-09 19:13 (UTC) - Expand

Flat | Top-Level Comments Only | Expand All

Notifications slow -- but recovering

Re: Watchdog service

Re: Watchdog service

Re: Watchdog service

Re: Watchdog service

Re: Watchdog service

Re: Watchdog service

Re: Watchdog service

Re: Watchdog service

Legacy vs Kubernetes

Re: Legacy vs Kubernetes

Re: Legacy vs Kubernetes

Re: Legacy vs Kubernetes

Amazon SES - misconfigured SPF record?

Re: Amazon SES - misconfigured SPF record?

Re: Amazon SES - misconfigured SPF record?

Re: Amazon SES - misconfigured SPF record?

Re: Amazon SES - misconfigured SPF record?

Re: Amazon SES - misconfigured SPF record?

Dreamwidth.org SPF record

Re: Dreamwidth.org SPF record

Re: Dreamwidth.org SPF record

Re: Dreamwidth.org SPF record

Re: Dreamwidth.org SPF record

Re: Dreamwidth.org SPF record

Re: Dreamwidth.org SPF record

Re: Dreamwidth.org SPF record

Processing notifications and email deliverability

Re: Processing notifications and email deliverability

Notification types

Sending emails

Re: Sending emails

Re: Sending emails

Re: Sending emails

Re: Sending emails

Re: Sending emails

Re: Sending emails

Re: Sending emails

Re: Sending emails