mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
Mark Smith ([staff profile] mark) wrote in [site community profile] dw_maintenance2019-12-02 11:50 am

Notifications slow -- but recovering

Hi all,

Due to some behind the scenes maintenance last night, our notifications system got delayed. I've fixed the issue now and it's working on catching up.

For details -- I've been experimenting with Kubernetes as a way to make managing production easier (and hopefully reduce costs!), but it turns out that one of our worker jobs that handles notifications doesn't use much CPU (it mostly spends time waiting on the database).

This caused the pod autoscaler to reduce the size of that particular deployment below what we needed to sustain throughput on our notifications service. The temporary fix is to pin that deployment size to something much larger, the better fix will be to integrate Kubernetes' pod autoscaler with the ability to monitor the queue depth on our task queue.

Sorry for the trouble, and thank you for the person who pinged us on Twitter. When I checked last night, everything was working, but as traffic came back up we fell behind and I wasn't watching anymore. My bad.
ilyena_sylph: picture of Labyrinth!faerie with 'careful, i bite' as text (Default)

Re: Sending emails

[personal profile] ilyena_sylph 2019-12-09 05:10 pm (UTC)(link)
> Would several minutes delay be acceptable in that situation?

...dude. Several minutes would be freaking amazing.

When [site community profile] dw_news posts go out, notifications, all notifications sitewide, are slowed for at least an hour, normally more like 2. That's what [staff profile] mark is trying to fix with all the work he's doing -- among other things -- and what you wandered into the middle of.
dennisgorelik: 2020-06-13 in my home office (Default)

Re: Sending emails

[personal profile] dennisgorelik 2019-12-09 07:13 pm (UTC)(link)
> Several minutes would be freaking amazing.

Cool: having clear and realistic goals should help.
"5 minutes to send 1M notifications" is much more specific, than "Send notifications immediately".

1 million in 5 minutes = 200,000 per minute
Which is about 10x faster than the average email sending speed on Dreamwidth now.

> are slowed for at least an hour, normally more like 2

Most likely, current Dreamwidth email sending speed is only ~1.5x faster than average. So in case of big spike it takes notification delivery service long time to catch up, because there is only ~50% of spare capacity.

If my ~1.5x estimate is correct, then only ~7x speed improvement is needed.

Incremental speed improvement, in my opinion, is the best strategy in this situation:
- Gradually add more threads to notification sending service.
- Monitor database performance and delays along the way.
- Finetune database queries and indexes.

Replacing existing system with new external component, probably, would be much more risky and more involving.