Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notification queue fills with single down AM instance #7676

Open
britcey opened this issue Jul 27, 2020 · 14 comments · May be fixed by #14099
Open

Notification queue fills with single down AM instance #7676

britcey opened this issue Jul 27, 2020 · 14 comments · May be fixed by #14099

Comments

@britcey
Copy link

britcey commented Jul 27, 2020

What did you do?

DC went down, taking 1 of 3 Alertmanager instances with it and also causing a large number of alerts

(there is cross DC monitoring right now for $reasons - that's being addressed, but is not directly relevant to this issue).

What did you expect to see?

Prometheus sends notifications to remaining 2 AM instances, skipping down AM instance.

Not sure if it'd be as straight-forward as a queue per AM - I could see odd timing issues with that.

What did you see instead? Under which circumstances?

prometheus_notifications_queue_length eventually filled and it started dropping notifications, even through a majority of AM instances were perfectly functional.

Alertmanagers are configured via static_configs.

Commenting-out the down AM instance from the config addressed the issue - things have been fine since then.

DC02 was down - most of the notifications were generated from the prom pair in DC01 and those notification_queues filled.

DC03 was also sending notifications, albeit at a far lower rate & prometheus_notifications_queue_length on those instances topped out ~ 300.

Environment

  • System information:

Linux 3.10.0-1127.el7.x86_64 x86_64

  • Prometheus version:

prometheus, version 2.19.2 (branch: HEAD, revision: c448ada)
build user: root@dd72efe1549d
build date: 20200626-09:02:20
go version: go1.14.4

  • Alertmanager version:

alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4

  • Prometheus configuration file:
    Relevant bit:
alerting:
  alert_relabel_configs:
    - regex: 'replica'
      action: labeldrop
  alertmanagers:
    - scheme: https
      path_prefix: alertmanager/
      basic_auth:
        username: prometheus
        password: "xxxxx"
      static_configs:
        - targets:
          - 'prod-am01.dc01'
          - 'prod-am02.dc02'
          - 'prod-am03.dc03'

prod-am02.dc02 was down (along with the rest of DC02) - commenting that out of the config fixed the issue.

  • Alertmanager configuration file:

Don't think it's relevant, but let me know.

  • Logs:
    Sample from DC01 prom instance:
Jul 24 12:15:26 prod-prom01 prometheus: level=warn ts=2020-07-24T12:15:26.539Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=29
@brian-brazil
Copy link
Contributor

How exactly was the bad AM down? Would it have been a failed TCP connection, or would it have hung until timeout?

@britcey
Copy link
Author

britcey commented Jul 27, 2020

Whole DC was down, so failed TCP connection

@brian-brazil
Copy link
Contributor

We have a timeout on the sends and do things concurrently, so this doesn't seem likely to be a bug. What I'd say happened is that with so many alerts, you ran into a throughput issue. We do batches of 64 alerts, and there's a default 10s timeout - so that limits things to 6.4 alert/s or around 384 active alerts given the default resend delay is 1m.

So most likely what we need to do here is adjust our defaults to allow for much higher throughput.

@britcey
Copy link
Author

britcey commented Jul 27, 2020

Your comment got me thinking & digging further:

The AM instance went down ~ 03:50 UTC on the 24th (unpingable), but the notification queue didn't really build up until the border gateway router went down ~ 11:00.

I'm not certain what the behavior would have been at that point - whether it'd start timing out or just getting ICMP unreachable messages.

Unfortunately for this situation, the border router is back at this point, so I can't test but our legacy Nagios instance was reporting UNREACHABLE for the AM instance at that point (it flipped back to just host down ~ 05:52 the next day).

@brian-brazil
Copy link
Contributor

What does the prometheus_notifications_latency_seconds metric show?

@britcey
Copy link
Author

britcey commented Jul 27, 2020

The .5 bucket for that AM instance shot up to 10s as soon as the AM instance went down ~ 04:00. No appreciable change when the AM went unreachable ~ 11:00, but that's when prometheus_notifications_queue_length shot straight up.

@britcey
Copy link
Author

britcey commented Jul 27, 2020

Graphs of the above - you can see that latency went up well before the notifications queue filled.

Screen_Shot_2020-07-27_at_12_24_13_PM

Screen_Shot_2020-07-27_at_12_23_48_PM

@CoolCold
Copy link

I do have similar issue - 2 DCs, 2xPrometheus + Alertmanagers per DC. DC1 went down for maintenance.

Issue - in DC2, no active alerts in Alertmaanger, while there are in Prometheus ( web ui - Alerts )

From the logs I do see:

Dec 14 17:46:22 s2217.j prometheus[12471]: level=error ts=2023-12-14T17:46:22.713Z caller=notifier.go:527 component=notifier alertmanager=http://s3066.j:9093/api/v2/alerts count=64 msg="Error sending alert" err="Post \"http://s3066.j:9093/api/v2/alerts\": dial tcp 10.215.79.17:9093: i/o timeout"
Dec 14 17:46:32 s2217.j prometheus[12471]: level=warn ts=2023-12-14T17:46:32.238Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=754
Dec 14 17:46:32 s2217.j prometheus[12471]: level=warn ts=2023-12-14T17:46:32.592Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=277

image

From the docs - https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config - I don't see the way to increase the throughtput. The only relevant to my POV setting is in remote_write section, which is for different purposes to my understanding ( https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write )

Prometheus version: prometheus-2.30.0-1.el7.x86_64
Alertmanager version: alertmanager-0.23.0-1.el7.x86_64

@krajorama
Copy link
Member

Hello from the bug scrub: @krajorama to take an action to verify whether there's a single or multiple queues for the attached AMs in the ruler.

@nielsole
Copy link

nielsole commented May 7, 2024


I could imagine that waiting for all alertmanagers to return causes the queue to fill up, if one doesn't respond.

@nielsole
Copy link

nielsole commented May 7, 2024

this answers your question, I believe? (i.e. a shared queue, where each batch is sent concurrently to all AMs)

// sendAll sends the alerts to all configured Alertmanagers concurrently.

@krajorama
Copy link
Member

Yes, that certainly looks like it. Also I don't know if the code actually respects the context timeout set on line
544
ctx, cancel := context.WithTimeout(n.ctx, time.Duration(ams.cfg.Timeout))

@nielsole
Copy link

nielsole commented May 7, 2024

return client.Do(req.WithContext(ctx))

https://pkg.go.dev/net/http#Request.WithContext

For outgoing client request, the context controls the entire lifetime of a request

on cursory view the timeout(default 10s) should work.

krajorama added a commit to krajorama/prometheus that referenced this issue May 14, 2024
Ref: prometheus#7676

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama
Copy link
Member

I wrote a unit test in #14099 that shows what I think is happening: alert delivery does not stop , but instead of emptying the queue on each iteration, we incur a 10s penalty (default timeout) for every sendAll call, which means that the throughput drops dramatically. I put some ideas into the PR regarding a solution. Comments welcome on the test as well as the ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants