Mark Smith (
mark) wrote in
dw_maintenance2019-12-02 11:50 am
Notifications slow -- but recovering
Hi all,
Due to some behind the scenes maintenance last night, our notifications system got delayed. I've fixed the issue now and it's working on catching up.
For details -- I've been experimenting with Kubernetes as a way to make managing production easier (and hopefully reduce costs!), but it turns out that one of our worker jobs that handles notifications doesn't use much CPU (it mostly spends time waiting on the database).
This caused the pod autoscaler to reduce the size of that particular deployment below what we needed to sustain throughput on our notifications service. The temporary fix is to pin that deployment size to something much larger, the better fix will be to integrate Kubernetes' pod autoscaler with the ability to monitor the queue depth on our task queue.
Sorry for the trouble, and thank you for the person who pinged us on Twitter. When I checked last night, everything was working, but as traffic came back up we fell behind and I wasn't watching anymore. My bad.
Due to some behind the scenes maintenance last night, our notifications system got delayed. I've fixed the issue now and it's working on catching up.
For details -- I've been experimenting with Kubernetes as a way to make managing production easier (and hopefully reduce costs!), but it turns out that one of our worker jobs that handles notifications doesn't use much CPU (it mostly spends time waiting on the database).
This caused the pod autoscaler to reduce the size of that particular deployment below what we needed to sustain throughput on our notifications service. The temporary fix is to pin that deployment size to something much larger, the better fix will be to integrate Kubernetes' pod autoscaler with the ability to monitor the queue depth on our task queue.
Sorry for the trouble, and thank you for the person who pinged us on Twitter. When I checked last night, everything was working, but as traffic came back up we fell behind and I wasn't watching anymore. My bad.

Re: Watchdog service
Fool's hope, of course.
Re: Watchdog service
That makes me wonder what other Dreamwidth backend services may be silently falling behind now?
Are you sure that switch to Kubernetes was a right call?
If your bottlenecks are:
1) Database performance.
2) Complexity of monitoring your queues.
Then Kubernetes is, probably, not the right tool to address these issues, right?
Re: Watchdog service
It's a problem we run into pretty often -- I actually have a talk I give at tech conferences called "When Your Code Is Old Enough To Vote" about, among other things, how to do the cost/benefit analysis for deciding whether or not to make major rewrites to benefit from new systems and services that were developed long after you started. If you jump on every new advance in technology, your list of "modernization projects we are in the middle of" outgrows your "new features/enhancements" list really, really quickly. We are in the middle of a lot of modernization projects right now (some you can see; some backend) and we have to be really judicious about which additional ones we take on.
Re: Watchdog service
Awesome! Is this a talk you make available other ways, or am I going to have to get invited to the right conferences?
Re: Watchdog service
Re: Watchdog service
Re: Watchdog service
Re: Watchdog service
(I mean, I'm not stuck in a 35+ year old codebase any more, but...still relevant.)
Legacy vs Kubernetes
But you do not even have compelling reasons to switch to Kubernetes, right?
Re: Legacy vs Kubernetes
In the legacy stack, if we're falling behind I have to log in and add more cron jobs/spin up new workers. Given that Dreamwidth is a side project for the tech staff (none of us are full-time, we don't make that much money!), I'm not actually going to log in and rebalance workers very often. User experience suffers.
Kubernetes (and other technologies like it, ECS, Fargate, Nomad, etc) give us the ability to tie performance metrics to scaling decisions and then we get a better user experience and a lower technical staff overhead for management.
FWIW, this is only a possibility because we're using managed Kubernetes. I don't have to run it myself and our use case is small enough (~20 nodes) that we aren't really going to run into the gnarly edge cases that SIG Scalability talk about. It does add some complexity, but that's a trade I'm willing to make for flexibility and the possibility of a better user experience.
Re: Legacy vs Kubernetes
Why do you need more workers if the bottleneck is in the database?
Usually, calling application code is cheap from performance perspective (microseconds), but database calls (retrieve/save data) are much longer (milliseconds to seconds).
How many emails does DreamWidth send per day?
Re: Legacy vs Kubernetes
Amazon SES - misconfigured SPF record?
It looks like SPF record is not correctly configured in "DreamWidth - Amazon SES" setup.
This is part of email headers I received for your reply comment notification:
~~~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee300049e-64c13222-3e92-4437-b31f-9d7dbd124d02-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~~~
Re: Amazon SES - misconfigured SPF record?
Re: Amazon SES - misconfigured SPF record?
In order to make it right, Dreamwidth.org DNS should explicitly declare that Amazon SES IP addresses have permission to send emails on behalf of dreamwidth.org
Re: Amazon SES - misconfigured SPF record?
In this case the IP address from which you received the message appears to be one of Google's, not one of Amazon's or Dreamwidth's. Neither Dreamwidth nor Amazon uses Google to send email, so the failure is correct and there is nothing we need to adjust.
Re: Amazon SES - misconfigured SPF record?
I see now that there are 2 "Received-SPF" headers in the email sent by [Bad username or site: dreamwidth @ org] that I received through [Bad username or site: gmail @ com]:
~~~~~~
Received-SPF: fail (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com does not designate 209.85.220.69 as permitted sender) client-ip=209.85.220.69;
~~~~~~
Received-SPF: pass (google.com: domain of 0100016ee886fbff-ceef5459-337e-4a6a-a337-e936f2318a2d-000000@amazonses.com designates 54.240.11.151 as permitted sender) client-ip=54.240.11.151;
~~~~~~
"pass" record indicates that google.com recognizes 54.240.11.151 IP address (that belongs to Amazon SES) as a permitted sender).
"fail" record, probably (I am guessing here) refers to email forwarding feature inside of Gmail (forwarding from one email address to another).
I incorrectly assumed that this "fail" record indicates misconfiguration in dreamwidth.org
"pass" record really matters and indicates that dreamwidth.org DNS configured SPF record correctly.
I am sorry for misreading my email headers.
> the relevant SPF record is Amazon's, not Dreamwidth's
Are you sure it is a correct statement?
Wouldn't it be more correct to state that both Amazon DNS SPF and Dreamwidth DNS SPF together define whether specific sender IP address is permitted for sending emails on behalf of dreamwidth.org?
Specifically, Dreamwidth DNS SPF delegates defining email sending permissions to Amazon SES:
~~~~~~~~~~~~~~~~~
dreamwidth.org text =
"v=spf1 include:amazonses.com ~all"
~~~~~~~~~~~~~~~~~
Then Amazon SES uses these permissions to permit 54.240.11.151 IP address to send emails that originate from dreamwidth.org (and other domains that Amazon SES serves)
~~~~~~~~~~~~~
amazonses.com text =
"v=spf1 ip4:199.255.192.0/22 ip4:199.127.232.0/22 ip4:54.240.0.0/18 ip4:69.169.224.0/20 ip4:76.223.180.0/23 ip4:76.223.188.0/24 ip4:76.223.189.0/24 ip4:76.223.190.0/24 -all"
~~~~~~~~~~~~~
Re: Amazon SES - misconfigured SPF record?
Dreamwidth.org SPF record
Re: Dreamwidth.org SPF record
Re: Dreamwidth.org SPF record
Re: Dreamwidth.org SPF record
Re: Dreamwidth.org SPF record
Re: Dreamwidth.org SPF record
Re: Dreamwidth.org SPF record
Re: Dreamwidth.org SPF record
Processing notifications and email deliverability
Re: Processing notifications and email deliverability
Notification types
Sending emails
I would say that a single email sending service with about 10 threads -- should be sufficient to send all your emails.
I would set it up this way:
When it is time to send a message:
1) Take MessageId of the message.
2) Calculate remainder of MessageId dividing by 10:
Remainder = MessageId % 10
If messageId = 123, then Remainder = 3.
3) Every thread will process messages with corresponding "MessageId Remainder".
If Remainder = 0 -- process it by EmailSendingThread0
If Remainder = 1 -- process it by EmailSendingThread1
If Remainder = 2 -- process it by EmailSendingThread2
Etc.
Email sending process is not a heavy process and does not consume too much resources.
The biggest delay -- is the wait when command that is sent to Amazon SES completes (~200 ms).
Re: Sending emails
The knowledge is here, however the time and ability due to life aren't
Re: Sending emails
Re: Sending emails
I definitely assume that people who run website with 3 million/month total users -- know what they are doing. Most developers do not reach that.
I also assume that these same competent developers/devops -- make occasional mistakes. There is no shame in making mistakes. I make mistakes too.
I do not mean that my suggestions are necessarily correct.
Actually many of my suggestions are likely to be suboptimal for implementing on dreamwidth.org -- for one or another reason.
I do not know these reasons and looking forward for a feedback from somebody who would point me to these specific reasons.
That would allow me to adjust my suggestions so they fit better to what dreamwidth.org needs.
So, hopefully, some of my suggestions would, actually, help to make dreamwidth.org to work better.
Re: Sending emails
Re: Sending emails
The thing is, notification sending isn't consistent, like, at all. If most of the people in a big community have a notification set up for 'Notify me when there is a new post to this comm', DW suddenly needs to send, say, a thousand emails (halving the numbers I found on a big comm) at once any time a post is made. If ten posts get made in half an hour, that's 10k emails in that half hour, but not spaced evenly because we all want our notifications as fast as possible.
Or if, another hypothetical here, a lot of people have tracking set up on, say,
And those are just a couple of examples I can think of, not even counting the
Re: Sending emails
That is, actually, routine and easy scenario, because it is well below an average rate of email sending on Dreamwidth.
If Dreamwidth sends 30M emails per day, then it sends 1M+ emails per hour anyway.
> we all want our notifications as fast as possible
What do you think is an acceptable delay for notification in case when posting on dw_news goes up?
Would several minutes delay be acceptable in that situation?
> fail_fandomanon, which gets 6k comments per post
That, probably, generates a lot of emails, but these emails will be almost evenly spread (a lot of relatively small batches of emails).
I think that if service is able to send 10x faster than average email sending speed is - that should be sufficient.
That means that service should be able to send about 12 million emails per hour.
Which is ~200,000 emails per minute.
Which is ~3,300 emails per second.
Re: Sending emails
...dude. Several minutes would be freaking amazing.
When
Re: Sending emails