mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] dw_maintenance2019-12-02 11:50 am

Notifications slow -- but recovering

Hi all,

Due to some behind the scenes maintenance last night, our notifications system got delayed. I've fixed the issue now and it's working on catching up.

For details -- I've been experimenting with Kubernetes as a way to make managing production easier (and hopefully reduce costs!), but it turns out that one of our worker jobs that handles notifications doesn't use much CPU (it mostly spends time waiting on the database).

This caused the pod autoscaler to reduce the size of that particular deployment below what we needed to sustain throughput on our notifications service. The temporary fix is to pin that deployment size to something much larger, the better fix will be to integrate Kubernetes' pod autoscaler with the ability to monitor the queue depth on our task queue.

Sorry for the trouble, and thank you for the person who pinged us on Twitter. When I checked last night, everything was working, but as traffic came back up we fell behind and I wasn't watching anymore. My bad.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Watchdog service

[personal profile] dennisgorelik 2019-12-02 10:50 pm (UTC)(link)
> I hadn't ported it over to Kubernetes yet

That makes me wonder what other Dreamwidth backend services may be silently falling behind now?

Are you sure that switch to Kubernetes was a right call?
If your bottlenecks are:
1) Database performance.
2) Complexity of monitoring your queues.
Then Kubernetes is, probably, not the right tool to address these issues, right?
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: Watchdog service

[staff profile] denise 2019-12-02 11:28 pm (UTC)(link)
Ultimately, the problem is that DW-as-a-codebase and DW-as-a-platform were both born in 1999 (as LiveJournal), and at the time there weren't any of the contemporary backend services that people can choose from now. The prototypes we-as-LJ invented to solve scaling problems were what morphed into those services (in some cases directly, in some cases as parallel evolution), but there have been a lot of newer backend services coming and going in those 20 years. We haven't made use of what's been developed in a lot of cases because the existing "battle-hardened" prototypes worked just fine, they had been optimized for our use cases over that exceptionally long time, a rewrite would run the risk of introducing more issues that weren't worth the performance gain we'd get from the switch, and a lot of the examples would require more rearchitecting than a platform that was written from the beginning to swap those newer services more easily would need. Which is not to say that we don't do the work when we need to, when the benefits we'd get start to outweigh the downsides, but it's not as simple a switch as it would be on a project that had its genesis more recently.

It's a problem we run into pretty often -- I actually have a talk I give at tech conferences called "When Your Code Is Old Enough To Vote" about, among other things, how to do the cost/benefit analysis for deciding whether or not to make major rewrites to benefit from new systems and services that were developed long after you started. If you jump on every new advance in technology, your list of "modernization projects we are in the middle of" outgrows your "new features/enhancements" list really, really quickly. We are in the middle of a lot of modernization projects right now (some you can see; some backend) and we have to be really judicious about which additional ones we take on.
metahacker: Half of an unusual keyboard, its surface like two craters with keys within. (keys)

Re: Watchdog service

[personal profile] metahacker 2019-12-02 11:37 pm (UTC)(link)
I actually have a talk I give at tech conferences called "When Your Code Is Old Enough To Vote"

Awesome! Is this a talk you make available other ways, or am I going to have to get invited to the right conferences?
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

Re: Watchdog service

[staff profile] denise 2019-12-02 11:53 pm (UTC)(link)
I have an older version of the slides at https://www.slideshare.net/mobile/dreamwidth/when-your-code-is-nearly-old-enough-to-vote but I don't think there's a video of me delivering it anywhere -- I'll look, though!
nonelvis: (Default)

Re: Watchdog service

[personal profile] nonelvis 2019-12-03 12:02 am (UTC)(link)
This was fascinating. Thanks for the link!
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

Re: Watchdog service

[personal profile] ilyena_sylph 2019-12-03 12:19 am (UTC)(link)
I watched the video of you giving it somewhere!!!!
metahacker: Half of an unusual keyboard, its surface like two craters with keys within. (keys)

Re: Watchdog service

[personal profile] metahacker 2019-12-06 03:08 am (UTC)(link)
Thanks! very helpful.

(I mean, I'm not stuck in a 35+ year old codebase any more, but...still relevant.)
dennisgorelik: 2020-06-13 in my home office (Default)

Legacy vs Kubernetes

[personal profile] dennisgorelik 2019-12-03 02:46 am (UTC)(link)
So what you are saying is that even if there are reasons to switch to Kubernetes, you still should be cautious in order to prevent increase in complexity.

But you do not even have compelling reasons to switch to Kubernetes, right?
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Legacy vs Kubernetes

[personal profile] dennisgorelik 2019-12-03 06:10 pm (UTC)(link)
> if we're falling behind I have to log in and add more cron jobs/spin up new workers

Why do you need more workers if the bottleneck is in the database?
Usually, calling application code is cheap from performance perspective (microseconds), but database calls (retrieve/save data) are much longer (milliseconds to seconds).

How many emails does DreamWidth send per day?
sporky_rat: It's a rat!  With a spork!  It's ME! (Default)

Re: Legacy vs Kubernetes

[personal profile] sporky_rat 2019-12-08 12:54 am (UTC)(link)
Somewhere at about a million, I think I heard last time we discussed notifications.
Edited (Spelling, ugh) 2019-12-08 00:54 (UTC)
dennisgorelik: 2020-06-13 in my home office (Default)

Amazon SES - misconfigured SPF record?

[personal profile] dennisgorelik 2019-12-08 10:55 am (UTC)(link)
Are you using Amazon SES to send DreamWidth emails?

It looks like SPF record is not correctly configured in "DreamWidth - Amazon SES" setup.
This is part of email headers I received for your reply comment notification:
~~~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee300049e-64c13222-3e92-4437-b31f-9d7dbd124d02-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~~~
alierak: (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] alierak 2019-12-09 02:39 am (UTC)(link)
I don't mean to be rude, but what part of that Received-SPF header do you think has anything to do with Dreamwidth?
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] dennisgorelik 2019-12-09 03:18 am (UTC)(link)
Most likely Dreamwidth.org DNS is misconfigured (specifically, SPF record), and that is causing Gmail to classify emails that Amazon SES sends on behalf of Dreamwidth.org -- as "SPF: fail".

In order to make it right, Dreamwidth.org DNS should explicitly declare that Amazon SES IP addresses have permission to send emails on behalf of dreamwidth.org
alierak: (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] alierak 2019-12-09 03:57 am (UTC)(link)
I'm sorry, but that's not correct, and no part of the header you showed has anything to do with Dreamwidth. An SPF check is performed against the SPF record of the envelope sender's domain. In this case the Received-SPF header states that the sender is something at amazonses.com, so the relevant SPF record is Amazon's, not Dreamwidth's. If Amazon hasn't allowed the IP address from which you received the message, then it should be a fail.

In this case the IP address from which you received the message appears to be one of Google's, not one of Amazon's or Dreamwidth's. Neither Dreamwidth nor Amazon uses Google to send email, so the failure is correct and there is nothing we need to adjust.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] dennisgorelik 2019-12-09 05:04 am (UTC)(link)
1) Thank you - you are correct.

I see now that there are 2 "Received-SPF" headers in the email sent by [Bad username or site: dreamwidth @ org] that I received through [Bad username or site: gmail @ com]:
~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~
Received-SPF: pass (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com designates 54.240.11.151 as permitted sender) client-ip=54.240.11.151;
~~~~~~

"pass" record indicates that google.com recognizes 54.240.11.151 IP address (that belongs to Amazon SES) as a permitted sender).

"fail" record, probably (I am guessing here) refers to email forwarding feature inside of Gmail (forwarding from one email address to another).

I incorrectly assumed that this "fail" record indicates misconfiguration in dreamwidth.org

"pass" record really matters and indicates that dreamwidth.org DNS configured SPF record correctly.

I am sorry for misreading my email headers.


> the relevant SPF record is Amazon's, not Dreamwidth's

Are you sure it is a correct statement?

Wouldn't it be more correct to state that both Amazon DNS SPF and Dreamwidth DNS SPF together define whether specific sender IP address is permitted for sending emails on behalf of dreamwidth.org?

Specifically, Dreamwidth DNS SPF delegates defining email sending permissions to Amazon SES:
~~~~~~~~~~~~~~~~~
dreamwidth.org text =
"v=spf1 include:amazonses.com ~all"
~~~~~~~~~~~~~~~~~

Then Amazon SES uses these permissions to permit 54.240.11.151 IP address to send emails that originate from dreamwidth.org (and other domains that Amazon SES serves)
~~~~~~~~~~~~~
amazonses.com text =

"v=spf1 ip4:199.255.192.0/22 ip4:199.127.232.0/22 ip4:54.240.0.0/18 ip4:69.169.224.0/20 ip4:76.223.180.0/23 ip4:76.223.188.0/24 ip4:76.223.189.0/24 ip4:76.223.190.0/24 -all"
~~~~~~~~~~~~~

Dreamwidth.org SPF record

[personal profile] dennisgorelik - 2019-12-09 15:50 (UTC) - Expand

Re: Dreamwidth.org SPF record

[personal profile] alierak - 2019-12-09 16:47 (UTC) - Expand

Re: Dreamwidth.org SPF record

[personal profile] alierak - 2019-12-09 19:26 (UTC) - Expand

Re: Dreamwidth.org SPF record

[personal profile] alierak - 2019-12-09 20:10 (UTC) - Expand

Re: Dreamwidth.org SPF record

[personal profile] alierak - 2019-12-09 20:22 (UTC) - Expand

Notification types

[personal profile] dennisgorelik - 2019-12-09 18:35 (UTC) - Expand
dennisgorelik: 2020-06-13 in my home office (Default)

Sending emails

[personal profile] dennisgorelik 2019-12-08 11:23 am (UTC)(link)
1 million per day is about 12 per second (if email sending speed was consistent during the day).

I would say that a single email sending service with about 10 threads -- should be sufficient to send all your emails.

I would set it up this way:
When it is time to send a message:
1) Take MessageId of the message.
2) Calculate remainder of MessageId dividing by 10:
Remainder = MessageId % 10
If messageId = 123, then Remainder = 3.
3) Every thread will process messages with corresponding "MessageId Remainder".
If Remainder = 0 -- process it by EmailSendingThread0
If Remainder = 1 -- process it by EmailSendingThread1
If Remainder = 2 -- process it by EmailSendingThread2
Etc.

Email sending process is not a heavy process and does not consume too much resources.
The biggest delay -- is the wait when command that is sent to Amazon SES completes (~200 ms).
sporky_rat: Effie Trinket in pink at the first District Twelve Reaping (everything is FINE)

Re: Sending emails

[personal profile] sporky_rat 2019-12-09 02:18 am (UTC)(link)
I'm the support volunteer, not the half owner and developer of the site. I keep up with stats and flag things that are an issue up to the developers.

The knowledge is here, however the time and ability due to life aren't
madgastronomer: detail of Astral Personneby Remedios Varo (Default)

Re: Sending emails

[personal profile] madgastronomer 2019-12-09 02:51 am (UTC)(link)
Hey, look, I don't think you're coming across the way you want to. Maybe back off a bit and assume that people do know what they're doing, and ask questions that are phrased as being about helping you to understand what's happening. Many of the questions you've been asking sound confrontational, and as if you think you know better than the people who've been working with the codebase for years. It sounds like you could have some really good ideas, but right now I think you're alienating people.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Sending emails

[personal profile] dennisgorelik 2019-12-09 03:46 am (UTC)(link)
> Maybe back off a bit and assume that people do know what they're doing

I definitely assume that people who run website with 3 million/month total users -- know what they are doing. Most developers do not reach that.

I also assume that these same competent developers/devops -- make occasional mistakes. There is no shame in making mistakes. I make mistakes too.

I do not mean that my suggestions are necessarily correct.
Actually many of my suggestions are likely to be suboptimal for implementing on dreamwidth.org -- for one or another reason.
I do not know these reasons and looking forward for a feedback from somebody who would point me to these specific reasons.
That would allow me to adjust my suggestions so they fit better to what dreamwidth.org needs.
So, hopefully, some of my suggestions would, actually, help to make dreamwidth.org to work better.
kore: (Default)

Re: Sending emails

[personal profile] kore 2019-12-12 03:47 pm (UTC)(link)
Seconded.
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

Re: Sending emails

[personal profile] ilyena_sylph 2019-12-09 02:30 pm (UTC)(link)
Speaking as a very very non-tech person but one who has used DW since closed beta...

The thing is, notification sending isn't consistent, like, at all. If most of the people in a big community have a notification set up for 'Notify me when there is a new post to this comm', DW suddenly needs to send, say, a thousand emails (halving the numbers I found on a big comm) at once any time a post is made. If ten posts get made in half an hour, that's 10k emails in that half hour, but not spaced evenly because we all want our notifications as fast as possible.

Or if, another hypothetical here, a lot of people have tracking set up on, say, [community profile] fail_fandomanon, which gets 6k comments per post (posts go up roooooooughly every two days as the 6k mark is reached), all of those people tracking it need instant notification of each comment.

And those are just a couple of examples I can think of, not even counting the [site community profile] dw_news posts, that push notifications to [afaik] basically everyone with an account unless you deliberately set it up not to. That's a lot of email that needs to go out very fast.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Sending emails

[personal profile] dennisgorelik 2019-12-09 03:09 pm (UTC)(link)
> If ten posts get made in half an hour, that's 10k emails in that half hour

That is, actually, routine and easy scenario, because it is well below an average rate of email sending on Dreamwidth.
If Dreamwidth sends 30M emails per day, then it sends 1M+ emails per hour anyway.

> we all want our notifications as fast as possible

What do you think is an acceptable delay for notification in case when posting on dw_news goes up?
Would several minutes delay be acceptable in that situation?

> fail_fandomanon, which gets 6k comments per post

That, probably, generates a lot of emails, but these emails will be almost evenly spread (a lot of relatively small batches of emails).


I think that if service is able to send 10x faster than average email sending speed is - that should be sufficient.
That means that service should be able to send about 12 million emails per hour.
Which is ~200,000 emails per minute.
Which is ~3,300 emails per second.
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

Re: Sending emails

[personal profile] ilyena_sylph 2019-12-09 05:10 pm (UTC)(link)
> Would several minutes delay be acceptable in that situation?

...dude. Several minutes would be freaking amazing.

When [site community profile] dw_news posts go out, notifications, all notifications sitewide, are slowed for at least an hour, normally more like 2. That's what [staff profile] mark is trying to fix with all the work he's doing -- among other things -- and what you wandered into the middle of.

Re: Sending emails

[personal profile] dennisgorelik - 2019-12-09 19:13 (UTC) - Expand