You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The summary of our alerts (shown in the title and as the Slack message) include all of the labels in the message, making it very hard to see what the real problem is. An example:
[FIRING:1] NodeMemoryMajorPagesFaults dev node-exporter http-metrics 10.92.10.18:9100 node-exporter monitoring kube-prometheus-stack-prometheus-node-exporter-n4gq8 monitoring/kube-prometheus-stack-prometheus kube-prometheus-stack-prometheus-node-exporter warning infra
Labels:
- alertname = NodeMemoryMajorPagesFaults
- cluster_id = dev
- container = node-exporter
- endpoint = http-metrics
- instance = 10.92.10.18:9100
- job = node-exporter
- namespace = monitoring
- pod = kube-prometheus-stack-prometheus-node-exporter-n4gq8
- prometheus = monitoring/kube-prometheus-stack-prometheus
- service = kube-prometheus-stack-prometheus-node-exporter
- severity = warning
- team = infra
Annotations:
- description = Memory major pages are occurring at very high rate at 10.92.10.18:9100, 500 major page faults per second for the last 15 minutes, is currently at 1400.70.
Please check that there is enough memory available at this instance.
- runbook_url = https://runbooks.prometheus-operator.dev/runbooks/node/nodememorymajorpagesfaults
- summary = Memory major page faults are occurring at very high rate.
Source: https://prometheus.ourdomain/graph?g0.expr=rate%28node_vmstat_pgmajfault%7Bjob%3D%22node-exporter%22%7D%5B5m%5D%29+%3E+500&g0.tab=1
I've tried to summarize in the AlertManagerConfig like so:
- details:
- key: summary
value: '{{ `{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}` }}'
This doesn't change anything however. I'm pretty much out of ideas, and I would like to keep trying to use this new CRD rather than our older method. Any help is appreciated.
Steps to Reproduce
Expected Result
The summary displayed on PagerDuty and in Slack should actually be Memory major page faults are occurring at very high rate. as shown above.
What happened?
Description
The summary of our alerts (shown in the title and as the Slack message) include all of the labels in the message, making it very hard to see what the real problem is. An example:
I've tried to summarize in the
AlertManagerConfig
like so:Full CRD:
This doesn't change anything however. I'm pretty much out of ideas, and I would like to keep trying to use this new CRD rather than our older method. Any help is appreciated.
Steps to Reproduce
Expected Result
The summary displayed on PagerDuty and in Slack should actually be
Memory major page faults are occurring at very high rate.
as shown above.Actual Result
Prometheus Operator Version
Kubernetes Version
Kubernetes Cluster Type
EKS
How did you deploy Prometheus-Operator?
helm chart:prometheus-community/kube-prometheus-stack
Manifests
No response
prometheus-operator log output
Anything else?
No response
The text was updated successfully, but these errors were encountered: