mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] dw_maintenance2019-12-02 11:50 am

Notifications slow -- but recovering

Hi all,

Due to some behind the scenes maintenance last night, our notifications system got delayed. I've fixed the issue now and it's working on catching up.

For details -- I've been experimenting with Kubernetes as a way to make managing production easier (and hopefully reduce costs!), but it turns out that one of our worker jobs that handles notifications doesn't use much CPU (it mostly spends time waiting on the database).

This caused the pod autoscaler to reduce the size of that particular deployment below what we needed to sustain throughput on our notifications service. The temporary fix is to pin that deployment size to something much larger, the better fix will be to integrate Kubernetes' pod autoscaler with the ability to monitor the queue depth on our task queue.

Sorry for the trouble, and thank you for the person who pinged us on Twitter. When I checked last night, everything was working, but as traffic came back up we fell behind and I wasn't watching anymore. My bad.
dennisgorelik: 2020-06-13 in my home office (Default)

Legacy vs Kubernetes

[personal profile] dennisgorelik 2019-12-03 02:46 am (UTC)(link)
So what you are saying is that even if there are reasons to switch to Kubernetes, you still should be cautious in order to prevent increase in complexity.

But you do not even have compelling reasons to switch to Kubernetes, right?
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Legacy vs Kubernetes

[personal profile] dennisgorelik 2019-12-03 06:10 pm (UTC)(link)
> if we're falling behind I have to log in and add more cron jobs/spin up new workers

Why do you need more workers if the bottleneck is in the database?
Usually, calling application code is cheap from performance perspective (microseconds), but database calls (retrieve/save data) are much longer (milliseconds to seconds).

How many emails does DreamWidth send per day?
sporky_rat: It's a rat!  With a spork!  It's ME! (Default)

Re: Legacy vs Kubernetes

[personal profile] sporky_rat 2019-12-08 12:54 am (UTC)(link)
Somewhere at about a million, I think I heard last time we discussed notifications.
Edited (Spelling, ugh) 2019-12-08 00:54 (UTC)
dennisgorelik: 2020-06-13 in my home office (Default)

Amazon SES - misconfigured SPF record?

[personal profile] dennisgorelik 2019-12-08 10:55 am (UTC)(link)
Are you using Amazon SES to send DreamWidth emails?

It looks like SPF record is not correctly configured in "DreamWidth - Amazon SES" setup.
This is part of email headers I received for your reply comment notification:
~~~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee300049e-64c13222-3e92-4437-b31f-9d7dbd124d02-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~~~
alierak: (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] alierak 2019-12-09 02:39 am (UTC)(link)
I don't mean to be rude, but what part of that Received-SPF header do you think has anything to do with Dreamwidth?
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] dennisgorelik 2019-12-09 03:18 am (UTC)(link)
Most likely Dreamwidth.org DNS is misconfigured (specifically, SPF record), and that is causing Gmail to classify emails that Amazon SES sends on behalf of Dreamwidth.org -- as "SPF: fail".

In order to make it right, Dreamwidth.org DNS should explicitly declare that Amazon SES IP addresses have permission to send emails on behalf of dreamwidth.org
alierak: (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] alierak 2019-12-09 03:57 am (UTC)(link)
I'm sorry, but that's not correct, and no part of the header you showed has anything to do with Dreamwidth. An SPF check is performed against the SPF record of the envelope sender's domain. In this case the Received-SPF header states that the sender is something at amazonses.com, so the relevant SPF record is Amazon's, not Dreamwidth's. If Amazon hasn't allowed the IP address from which you received the message, then it should be a fail.

In this case the IP address from which you received the message appears to be one of Google's, not one of Amazon's or Dreamwidth's. Neither Dreamwidth nor Amazon uses Google to send email, so the failure is correct and there is nothing we need to adjust.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] dennisgorelik 2019-12-09 05:04 am (UTC)(link)
1) Thank you - you are correct.

I see now that there are 2 "Received-SPF" headers in the email sent by [Bad username or site: dreamwidth @ org] that I received through [Bad username or site: gmail @ com]:
~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~
Received-SPF: pass (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com designates 54.240.11.151 as permitted sender) client-ip=54.240.11.151;
~~~~~~

"pass" record indicates that google.com recognizes 54.240.11.151 IP address (that belongs to Amazon SES) as a permitted sender).

"fail" record, probably (I am guessing here) refers to email forwarding feature inside of Gmail (forwarding from one email address to another).

I incorrectly assumed that this "fail" record indicates misconfiguration in dreamwidth.org

"pass" record really matters and indicates that dreamwidth.org DNS configured SPF record correctly.

I am sorry for misreading my email headers.


> the relevant SPF record is Amazon's, not Dreamwidth's

Are you sure it is a correct statement?

Wouldn't it be more correct to state that both Amazon DNS SPF and Dreamwidth DNS SPF together define whether specific sender IP address is permitted for sending emails on behalf of dreamwidth.org?

Specifically, Dreamwidth DNS SPF delegates defining email sending permissions to Amazon SES:
~~~~~~~~~~~~~~~~~
dreamwidth.org text =
"v=spf1 include:amazonses.com ~all"
~~~~~~~~~~~~~~~~~

Then Amazon SES uses these permissions to permit 54.240.11.151 IP address to send emails that originate from dreamwidth.org (and other domains that Amazon SES serves)
~~~~~~~~~~~~~
amazonses.com text =

"v=spf1 ip4:199.255.192.0/22 ip4:199.127.232.0/22 ip4:54.240.0.0/18 ip4:69.169.224.0/20 ip4:76.223.180.0/23 ip4:76.223.188.0/24 ip4:76.223.189.0/24 ip4:76.223.190.0/24 -all"
~~~~~~~~~~~~~
alierak: (Default)

Re: Amazon SES - misconfigured SPF record?

[personal profile] alierak 2019-12-09 02:55 pm (UTC)(link)
1) Thanks, apology accepted.

2) Yes, I'm sure it's a correct statement. Amazon isn't sending on behalf of dreamwidth.org; an email can only have one envelope sender address and in this case it's something at amazonses.com, so that is the only domain whose SPF record is relevant. That email could be from some other Amazon customer and the behavior would be the same. Amazon guarantees that emails sent thru SES will pass an SPF check by default, by using their domain to do the sending. It is possible to use a custom domain for the envelope sender in SES, and then you're responsible for the SPF record, but as far as I know we're not doing that.

There is another, completely different, use case for SPF records involved in DMARC, where the check would be performed against the domain of the address in the From: header instead of the envelope sender. That's the part where Dreamwidth's SPF record might be relevant, except that isn't what you asked about, and we aren't making any attempt to use DMARC as far as I know. That would also require using a custom envelope sender domain, since DMARC requires the domains of From: header and envelope sender to match.

I would recommend that you educate yourself further on these topics using the abundant documentation available on the web, as well as running your own email services to gain experience if it's something that really interests you. There also exist "header analyzer" services that will help you understand the various email delivery issues if you paste in the headers from an email you've received. All my efforts here aside, Dreamwidth is not one of those services and this [site community profile] dw_maintenance post is about the processing of notifications, not email deliverability.
dennisgorelik: 2020-06-13 in my home office (Default)

Dreamwidth.org SPF record

[personal profile] dennisgorelik 2019-12-09 03:50 pm (UTC)(link)
> an email can only have one envelope sender address and in this case it's something at amazonses.com

I agree here.

> so that is the only domain whose SPF record is relevant.

I disagree with that part.
If "From" email is "...[Bad username or site: dreamwidth @ org]" then dreamwidth.org SPF settings are relevant too, even if sender IP address belongs to Amazon SES (and does not belong to dreamwidth.org).

> Amazon guarantees that emails sent thru SES will pass an SPF check by default

... only if Dreamwidth DNS specifies something like
dreamwidth.org text =
"v=spf1 include:amazonses.com ~all"

Or do you mean that even without dreamwidth.org granting permissions to Amazon SES, Amazon SES will guarantee SPF check to pass?

> that isn't what you asked about

Correct. I am not referring to DMARC.
I am only referring to "v=spf1" DNS text record.
alierak: (Default)

Re: Dreamwidth.org SPF record

[personal profile] alierak 2019-12-09 04:47 pm (UTC)(link)
All right, best of luck with your disagreement with RFC 7208.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Dreamwidth.org SPF record

[personal profile] dennisgorelik 2019-12-09 06:13 pm (UTC)(link)
> your disagreement with RFC 7208

I do NOT disagree with RFC 7208.

RFC 7208 supports my position:
~~~~~~~~~
https://tools.ietf.org/html/rfc7208
The "include" mechanism triggers a recursive evaluation of
check_host().
~~~~~~~~~

Re: Dreamwidth.org SPF record

[personal profile] alierak - 2019-12-09 19:26 (UTC) - Expand

Re: Dreamwidth.org SPF record

[personal profile] alierak - 2019-12-09 20:10 (UTC) - Expand

Re: Dreamwidth.org SPF record

[personal profile] alierak - 2019-12-09 20:22 (UTC) - Expand
dennisgorelik: 2020-06-13 in my home office (Default)

Processing notifications and email deliverability

[personal profile] dennisgorelik 2019-12-09 03:52 pm (UTC)(link)
> processing of notifications, not email deliverability

Why do you separate these two?
If Dreamwidth successfully sent out all notifications, but most of them end up in spam folders of the recipients - then the effect is almost the same as not sending these notifications at all, right?

For example, sudden 100x spike in email sending speed -- may look like spam attack from the perspective of spam filters that email service providers (such as Gmail) have.
If it is the case, then Dreamwidth team may choose not to send emails too fast, even if is technically possible (e.g. allow email sending 10x faster than normal, but not 100x faster).
Edited 2019-12-09 15:56 (UTC)
alierak: (Default)

Re: Processing notifications and email deliverability

[personal profile] alierak 2019-12-09 04:58 pm (UTC)(link)
In the codebase we inherited from LJ, a notification is an asynchronous message to the user, and it is not the same thing as an email. Email is just one of the available delivery methods for notifications. The workers [staff profile] mark was talking about were dealing with notifications, not email, so that's what he said and I have repeated.

The code is available if you're interested, and I think there are some wiki entries about how the various parts work, including notifications ("ESN" might be the term to search for). However, SES isn't really one of those parts; it and any of its necessary rate-limiting controls are implemented by Amazon.
dennisgorelik: 2020-06-13 in my home office (Default)

Notification types

[personal profile] dennisgorelik 2019-12-09 06:35 pm (UTC)(link)
> mark was talking about were dealing with notifications, not email

Thanks, I see the difference now.

For me, as a Dreamwidth user, the only notification that practically matters is email.
Is it different for most other Dreamwidth users?

ESN wiki lists 3 notification types:
~~~~~~~~~~~
https://wiki.dreamwidth.net/wiki/index.php/ESN
Email
Inbox
DebugLog
~~~~~~~~~~~
but it is not clear how important "Inbox" and "DebugLog" notification types are.
dennisgorelik: 2020-06-13 in my home office (Default)

Sending emails

[personal profile] dennisgorelik 2019-12-08 11:23 am (UTC)(link)
1 million per day is about 12 per second (if email sending speed was consistent during the day).

I would say that a single email sending service with about 10 threads -- should be sufficient to send all your emails.

I would set it up this way:
When it is time to send a message:
1) Take MessageId of the message.
2) Calculate remainder of MessageId dividing by 10:
Remainder = MessageId % 10
If messageId = 123, then Remainder = 3.
3) Every thread will process messages with corresponding "MessageId Remainder".
If Remainder = 0 -- process it by EmailSendingThread0
If Remainder = 1 -- process it by EmailSendingThread1
If Remainder = 2 -- process it by EmailSendingThread2
Etc.

Email sending process is not a heavy process and does not consume too much resources.
The biggest delay -- is the wait when command that is sent to Amazon SES completes (~200 ms).
sporky_rat: Effie Trinket in pink at the first District Twelve Reaping (everything is FINE)

Re: Sending emails

[personal profile] sporky_rat 2019-12-09 02:18 am (UTC)(link)
I'm the support volunteer, not the half owner and developer of the site. I keep up with stats and flag things that are an issue up to the developers.

The knowledge is here, however the time and ability due to life aren't
madgastronomer: detail of Astral Personneby Remedios Varo (Default)

Re: Sending emails

[personal profile] madgastronomer 2019-12-09 02:51 am (UTC)(link)
Hey, look, I don't think you're coming across the way you want to. Maybe back off a bit and assume that people do know what they're doing, and ask questions that are phrased as being about helping you to understand what's happening. Many of the questions you've been asking sound confrontational, and as if you think you know better than the people who've been working with the codebase for years. It sounds like you could have some really good ideas, but right now I think you're alienating people.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Sending emails

[personal profile] dennisgorelik 2019-12-09 03:46 am (UTC)(link)
> Maybe back off a bit and assume that people do know what they're doing

I definitely assume that people who run website with 3 million/month total users -- know what they are doing. Most developers do not reach that.

I also assume that these same competent developers/devops -- make occasional mistakes. There is no shame in making mistakes. I make mistakes too.

I do not mean that my suggestions are necessarily correct.
Actually many of my suggestions are likely to be suboptimal for implementing on dreamwidth.org -- for one or another reason.
I do not know these reasons and looking forward for a feedback from somebody who would point me to these specific reasons.
That would allow me to adjust my suggestions so they fit better to what dreamwidth.org needs.
So, hopefully, some of my suggestions would, actually, help to make dreamwidth.org to work better.
kore: (Default)

Re: Sending emails

[personal profile] kore 2019-12-12 03:47 pm (UTC)(link)
Seconded.
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

Re: Sending emails

[personal profile] ilyena_sylph 2019-12-09 02:30 pm (UTC)(link)
Speaking as a very very non-tech person but one who has used DW since closed beta...

The thing is, notification sending isn't consistent, like, at all. If most of the people in a big community have a notification set up for 'Notify me when there is a new post to this comm', DW suddenly needs to send, say, a thousand emails (halving the numbers I found on a big comm) at once any time a post is made. If ten posts get made in half an hour, that's 10k emails in that half hour, but not spaced evenly because we all want our notifications as fast as possible.

Or if, another hypothetical here, a lot of people have tracking set up on, say, [community profile] fail_fandomanon, which gets 6k comments per post (posts go up roooooooughly every two days as the 6k mark is reached), all of those people tracking it need instant notification of each comment.

And those are just a couple of examples I can think of, not even counting the [site community profile] dw_news posts, that push notifications to [afaik] basically everyone with an account unless you deliberately set it up not to. That's a lot of email that needs to go out very fast.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Sending emails

[personal profile] dennisgorelik 2019-12-09 03:09 pm (UTC)(link)
> If ten posts get made in half an hour, that's 10k emails in that half hour

That is, actually, routine and easy scenario, because it is well below an average rate of email sending on Dreamwidth.
If Dreamwidth sends 30M emails per day, then it sends 1M+ emails per hour anyway.

> we all want our notifications as fast as possible

What do you think is an acceptable delay for notification in case when posting on dw_news goes up?
Would several minutes delay be acceptable in that situation?

> fail_fandomanon, which gets 6k comments per post

That, probably, generates a lot of emails, but these emails will be almost evenly spread (a lot of relatively small batches of emails).


I think that if service is able to send 10x faster than average email sending speed is - that should be sufficient.
That means that service should be able to send about 12 million emails per hour.
Which is ~200,000 emails per minute.
Which is ~3,300 emails per second.
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

Re: Sending emails

[personal profile] ilyena_sylph 2019-12-09 05:10 pm (UTC)(link)
> Would several minutes delay be acceptable in that situation?

...dude. Several minutes would be freaking amazing.

When [site community profile] dw_news posts go out, notifications, all notifications sitewide, are slowed for at least an hour, normally more like 2. That's what [staff profile] mark is trying to fix with all the work he's doing -- among other things -- and what you wandered into the middle of.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Sending emails

[personal profile] dennisgorelik 2019-12-09 07:13 pm (UTC)(link)
> Several minutes would be freaking amazing.

Cool: having clear and realistic goals should help.
"5 minutes to send 1M notifications" is much more specific, than "Send notifications immediately".

1 million in 5 minutes = 200,000 per minute
Which is about 10x faster than the average email sending speed on Dreamwidth now.

> are slowed for at least an hour, normally more like 2

Most likely, current Dreamwidth email sending speed is only ~1.5x faster than average. So in case of big spike it takes notification delivery service long time to catch up, because there is only ~50% of spare capacity.

If my ~1.5x estimate is correct, then only ~7x speed improvement is needed.

Incremental speed improvement, in my opinion, is the best strategy in this situation:
- Gradually add more threads to notification sending service.
- Monitor database performance and delays along the way.
- Finetune database queries and indexes.

Replacing existing system with new external component, probably, would be much more risky and more involving.