Notification queue fills with single down AM instance #7676

britcey · 2020-07-27T15:10:58Z

What did you do?

DC went down, taking 1 of 3 Alertmanager instances with it and also causing a large number of alerts

(there is cross DC monitoring right now for $reasons - that's being addressed, but is not directly relevant to this issue).

What did you expect to see?

Prometheus sends notifications to remaining 2 AM instances, skipping down AM instance.

Not sure if it'd be as straight-forward as a queue per AM - I could see odd timing issues with that.

What did you see instead? Under which circumstances?

prometheus_notifications_queue_length eventually filled and it started dropping notifications, even through a majority of AM instances were perfectly functional.

Alertmanagers are configured via static_configs.

Commenting-out the down AM instance from the config addressed the issue - things have been fine since then.

DC02 was down - most of the notifications were generated from the prom pair in DC01 and those notification_queues filled.

DC03 was also sending notifications, albeit at a far lower rate & prometheus_notifications_queue_length on those instances topped out ~ 300.

Environment

System information:

Linux 3.10.0-1127.el7.x86_64 x86_64

Prometheus version:

prometheus, version 2.19.2 (branch: HEAD, revision: c448ada)
build user: root@dd72efe1549d
build date: 20200626-09:02:20
go version: go1.14.4

Alertmanager version:

alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4

Prometheus configuration file:
Relevant bit:

alerting:
  alert_relabel_configs:
    - regex: 'replica'
      action: labeldrop
  alertmanagers:
    - scheme: https
      path_prefix: alertmanager/
      basic_auth:
        username: prometheus
        password: "xxxxx"
      static_configs:
        - targets:
          - 'prod-am01.dc01'
          - 'prod-am02.dc02'
          - 'prod-am03.dc03'

prod-am02.dc02 was down (along with the rest of DC02) - commenting that out of the config fixed the issue.

Alertmanager configuration file:

Don't think it's relevant, but let me know.

Logs:
Sample from DC01 prom instance:

Jul 24 12:15:26 prod-prom01 prometheus: level=warn ts=2020-07-24T12:15:26.539Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=29

The text was updated successfully, but these errors were encountered:

brian-brazil · 2020-07-27T15:14:10Z

How exactly was the bad AM down? Would it have been a failed TCP connection, or would it have hung until timeout?

britcey · 2020-07-27T15:15:25Z

Whole DC was down, so failed TCP connection

brian-brazil · 2020-07-27T16:11:22Z

We have a timeout on the sends and do things concurrently, so this doesn't seem likely to be a bug. What I'd say happened is that with so many alerts, you ran into a throughput issue. We do batches of 64 alerts, and there's a default 10s timeout - so that limits things to 6.4 alert/s or around 384 active alerts given the default resend delay is 1m.

So most likely what we need to do here is adjust our defaults to allow for much higher throughput.

britcey · 2020-07-27T16:12:17Z

Your comment got me thinking & digging further:

The AM instance went down ~ 03:50 UTC on the 24th (unpingable), but the notification queue didn't really build up until the border gateway router went down ~ 11:00.

I'm not certain what the behavior would have been at that point - whether it'd start timing out or just getting ICMP unreachable messages.

Unfortunately for this situation, the border router is back at this point, so I can't test but our legacy Nagios instance was reporting UNREACHABLE for the AM instance at that point (it flipped back to just host down ~ 05:52 the next day).

brian-brazil · 2020-07-27T16:17:03Z

What does the prometheus_notifications_latency_seconds metric show?

britcey · 2020-07-27T16:21:00Z

The .5 bucket for that AM instance shot up to 10s as soon as the AM instance went down ~ 04:00. No appreciable change when the AM went unreachable ~ 11:00, but that's when prometheus_notifications_queue_length shot straight up.

britcey · 2020-07-27T17:02:13Z

Graphs of the above - you can see that latency went up well before the notifications queue filled.

CoolCold · 2023-12-14T18:34:14Z

I do have similar issue - 2 DCs, 2xPrometheus + Alertmanagers per DC. DC1 went down for maintenance.

Issue - in DC2, no active alerts in Alertmaanger, while there are in Prometheus ( web ui - Alerts )

From the logs I do see:

Dec 14 17:46:22 s2217.j prometheus[12471]: level=error ts=2023-12-14T17:46:22.713Z caller=notifier.go:527 component=notifier alertmanager=http://s3066.j:9093/api/v2/alerts count=64 msg="Error sending alert" err="Post \"http://s3066.j:9093/api/v2/alerts\": dial tcp 10.215.79.17:9093: i/o timeout"
Dec 14 17:46:32 s2217.j prometheus[12471]: level=warn ts=2023-12-14T17:46:32.238Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=754
Dec 14 17:46:32 s2217.j prometheus[12471]: level=warn ts=2023-12-14T17:46:32.592Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=277

From the docs - https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config - I don't see the way to increase the throughtput. The only relevant to my POV setting is in remote_write section, which is for different purposes to my understanding ( https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write )

Prometheus version: prometheus-2.30.0-1.el7.x86_64
Alertmanager version: alertmanager-0.23.0-1.el7.x86_64

krajorama · 2024-05-07T11:56:48Z

Hello from the bug scrub: @krajorama to take an action to verify whether there's a single or multiple queues for the attached AMs in the ruler.

nielsole · 2024-05-07T13:05:56Z

prometheus/notifier/notifier.go

Line 564 in 4b7a44c

wg.Wait()

I could imagine that waiting for all alertmanagers to return causes the queue to fill up, if one doesn't respond.

nielsole · 2024-05-07T13:08:07Z

this answers your question, I believe? (i.e. a shared queue, where each batch is sent concurrently to all AMs)

prometheus/notifier/notifier.go

Line 451 in 4b7a44c

// sendAll sends the alerts to all configured Alertmanagers concurrently.

krajorama · 2024-05-07T16:20:06Z

Yes, that certainly looks like it. Also I don't know if the code actually respects the context timeout set on line
544
ctx, cancel := context.WithTimeout(n.ctx, time.Duration(ams.cfg.Timeout))

nielsole · 2024-05-07T16:30:25Z

prometheus/notifier/notifier.go

Line 215 in 2524a91

return client.Do(req.WithContext(ctx))

https://pkg.go.dev/net/http#Request.WithContext

For outgoing client request, the context controls the entire lifetime of a request

on cursory view the timeout(default 10s) should work.

Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama · 2024-05-14T15:16:55Z

I wrote a unit test in #14099 that shows what I think is happening: alert delivery does not stop , but instead of emptying the queue on each iteration, we incur a 10s penalty (default timeout) for every sendAll call, which means that the throughput drops dramatically. I put some ideas into the PR regarding a solution. Comments welcome on the test as well as the ideas.

britcey added the kind/bug label Jul 27, 2020

brian-brazil added component/notify priority/P3 labels Jul 27, 2020

Spaceman1701 mentioned this issue Feb 29, 2024

Alertmanager service discovery does not update under heavy load #13676

Open

krajorama added the help wanted label May 7, 2024

krajorama added a commit to krajorama/prometheus that referenced this issue May 14, 2024

notifier: unit test for dropping throughput on stuck AM

d13d7b2

Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama linked a pull request May 14, 2024 that will close this issue

notifier: stop queue filling due to single failed AM #14099

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notification queue fills with single down AM instance #7676

Notification queue fills with single down AM instance #7676

britcey commented Jul 27, 2020

brian-brazil commented Jul 27, 2020

britcey commented Jul 27, 2020

brian-brazil commented Jul 27, 2020

britcey commented Jul 27, 2020 •

edited

brian-brazil commented Jul 27, 2020

britcey commented Jul 27, 2020

britcey commented Jul 27, 2020

CoolCold commented Dec 14, 2023

krajorama commented May 7, 2024

nielsole commented May 7, 2024

nielsole commented May 7, 2024 •

edited

krajorama commented May 7, 2024

nielsole commented May 7, 2024 •

edited

krajorama commented May 14, 2024

Notification queue fills with single down AM instance #7676

Notification queue fills with single down AM instance #7676

Comments

britcey commented Jul 27, 2020

brian-brazil commented Jul 27, 2020

britcey commented Jul 27, 2020

brian-brazil commented Jul 27, 2020

britcey commented Jul 27, 2020 • edited

brian-brazil commented Jul 27, 2020

britcey commented Jul 27, 2020

britcey commented Jul 27, 2020

CoolCold commented Dec 14, 2023

krajorama commented May 7, 2024

nielsole commented May 7, 2024

nielsole commented May 7, 2024 • edited

krajorama commented May 7, 2024

nielsole commented May 7, 2024 • edited

krajorama commented May 14, 2024

britcey commented Jul 27, 2020 •

edited

nielsole commented May 7, 2024 •

edited

nielsole commented May 7, 2024 •

edited