-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notification queue fills with single down AM instance #7676
Comments
How exactly was the bad AM down? Would it have been a failed TCP connection, or would it have hung until timeout? |
Whole DC was down, so failed TCP connection |
We have a timeout on the sends and do things concurrently, so this doesn't seem likely to be a bug. What I'd say happened is that with so many alerts, you ran into a throughput issue. We do batches of 64 alerts, and there's a default 10s timeout - so that limits things to 6.4 alert/s or around 384 active alerts given the default resend delay is 1m. So most likely what we need to do here is adjust our defaults to allow for much higher throughput. |
Your comment got me thinking & digging further: The AM instance went down ~ 03:50 UTC on the 24th (unpingable), but the notification queue didn't really build up until the border gateway router went down ~ 11:00. I'm not certain what the behavior would have been at that point - whether it'd start timing out or just getting ICMP unreachable messages. Unfortunately for this situation, the border router is back at this point, so I can't test but our legacy Nagios instance was reporting UNREACHABLE for the AM instance at that point (it flipped back to just host down ~ 05:52 the next day). |
What does the prometheus_notifications_latency_seconds metric show? |
The .5 bucket for that AM instance shot up to 10s as soon as the AM instance went down ~ 04:00. No appreciable change when the AM went unreachable ~ 11:00, but that's when prometheus_notifications_queue_length shot straight up. |
I do have similar issue - 2 DCs, 2xPrometheus + Alertmanagers per DC. DC1 went down for maintenance. Issue - in DC2, no active alerts in Alertmaanger, while there are in Prometheus ( web ui - Alerts ) From the logs I do see:
From the docs - https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config - I don't see the way to increase the throughtput. The only relevant to my POV setting is in remote_write section, which is for different purposes to my understanding ( https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write ) Prometheus version: prometheus-2.30.0-1.el7.x86_64 |
Hello from the bug scrub: @krajorama to take an action to verify whether there's a single or multiple queues for the attached AMs in the ruler. |
prometheus/notifier/notifier.go Line 564 in 4b7a44c
I could imagine that waiting for all alertmanagers to return causes the queue to fill up, if one doesn't respond. |
this answers your question, I believe? (i.e. a shared queue, where each batch is sent concurrently to all AMs) prometheus/notifier/notifier.go Line 451 in 4b7a44c
|
Yes, that certainly looks like it. Also I don't know if the code actually respects the context timeout set on line |
prometheus/notifier/notifier.go Line 215 in 2524a91
https://pkg.go.dev/net/http#Request.WithContext
on cursory view the timeout(default 10s) should work. |
Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
I wrote a unit test in #14099 that shows what I think is happening: alert delivery does not stop , but instead of emptying the queue on each iteration, we incur a 10s penalty (default timeout) for every sendAll call, which means that the throughput drops dramatically. I put some ideas into the PR regarding a solution. Comments welcome on the test as well as the ideas. |
What did you do?
DC went down, taking 1 of 3 Alertmanager instances with it and also causing a large number of alerts
(there is cross DC monitoring right now for $reasons - that's being addressed, but is not directly relevant to this issue).
What did you expect to see?
Prometheus sends notifications to remaining 2 AM instances, skipping down AM instance.
Not sure if it'd be as straight-forward as a queue per AM - I could see odd timing issues with that.
What did you see instead? Under which circumstances?
prometheus_notifications_queue_length eventually filled and it started dropping notifications, even through a majority of AM instances were perfectly functional.
Alertmanagers are configured via static_configs.
Commenting-out the down AM instance from the config addressed the issue - things have been fine since then.
DC02 was down - most of the notifications were generated from the prom pair in DC01 and those notification_queues filled.
DC03 was also sending notifications, albeit at a far lower rate & prometheus_notifications_queue_length on those instances topped out ~ 300.
Environment
Linux 3.10.0-1127.el7.x86_64 x86_64
prometheus, version 2.19.2 (branch: HEAD, revision: c448ada)
build user: root@dd72efe1549d
build date: 20200626-09:02:20
go version: go1.14.4
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4
Relevant bit:
prod-am02.dc02
was down (along with the rest of DC02) - commenting that out of the config fixed the issue.Don't think it's relevant, but let me know.
Sample from DC01 prom instance:
The text was updated successfully, but these errors were encountered: