Surface the problem the alert is reporting in favour of an alert id #14059
Replies: 6 comments 21 replies
-
the idea to trigger this discussion came from a chat me, @MrZammler and @ilyam8 were having around a potential solution to tackle what had been reported on this bug netdata/netdata-cloud#130 we need to agree conceptually the key information to display when an alert is triggered, if new attributes need to be added and what will be the purpose/role of each attribute on the alert definition. tagging a couple of folk for visibility and getting inputs: @amalkov @ktsaou @ralphm @sashwathn |
Beta Was this translation helpful? Give feedback.
-
My tl;dr We need:
We use the info field for 1. 2nd (we have no atm) should be used instead of the alert id (2nd column). |
Beta Was this translation helpful? Give feedback.
-
Hi guys, one little for what might be technical, but still: Currently the notification script builds the final subject line for the notification like this:
If we aim to replace the alert_id with the new field, we need to decide whether or not to include e.g. the value or the hostname as part of it or not. If we do, then we need to re-think the subject line, as to not include the value twice there. Similarly, if this new field is propagated to the cloud to be shown in the alert list, we already have the value in a column there as well (so it would need to actually be stripped out...). So even if the initial implementation of this field had variables for value, hostname, severity, etc, I think we should not include them. Rather, the field should only have variables for family and chart labels, like the current info field has. I like what @shyamvalsan shared there as well, this seems like what this new field can be. I would argue that we don't need to include words like
but it's a minor nitpick. Of course by being part of the alert config the user can choose to use whatever they want there. |
Beta Was this translation helpful? Give feedback.
-
@ilyam8 @MrZammler I think the pending question is about stop using family as a variable and start using chart labels, right? I understand this will probably take a bigger effort to review the alerts that are using family and identify what/if there is a chart label that could be used instead but the introduction of this new fields doesn't seem to be needed to be done in one shot, we could be rolling out across the alert definitions |
Beta Was this translation helpful? Give feedback.
-
This came up again in light of #14006, which aims to add a configurable title to alerts. I want to point to other work in this area, which I see as a generalization of that solution. In Prometheus alerting rules, besides the alert name/identifier (
An example: As you can see, the annotations are in this examples are templated, using the Go Template language. Alert labels can be templated, too. I think we can do something very similar. In our case, |
Beta Was this translation helpful? Give feedback.
-
As we needed to represent the |
Beta Was this translation helpful? Give feedback.
-
Current state
Our current approach is mostly to surface the alert id across the Netdata Cloud UI and alert notification templates, adding some additional details (
info
field).Examples:
Email template
Netdata Cloud UI - Alerts tab
Netdata Cloud UI - Dedicated Landing page or side drawer
Proposal
When presenting alerts on Netdata Cloud UI and/or through notification channels we should be aiming to display more clearly what problem the alert is referring. This would mean that instead of displaying the alert id and the
info
field has key prominent information we should display a more clear short phrase of what problem is ocurring. Some examples:CPU usage alert
Critical, 10min_cpu_usage = <value> %, on <node-name>
10min cpu usage
and tooltip or extra detailsaverage cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
CPU load is very high on <node-name> with <value> %
(optionally add instances, example CPU core)CPU load is very high
PostgreSQL connections
Warning, postgres_total_connection_utilization = <value> %, on <node-name>
postgres total connection utilization
and tooltip or extra detailsaverage total connection utilization over the last minute
PostgreSQL: High connection count utilization on <node-name> with <value> %
PostgreSQL: High connection count utilization
Discussion
This discussion aims to clarify that we seem to need:
Beta Was this translation helpful? Give feedback.
All reactions