Surface the problem the alert is reporting in favour of an alert id #14059

hugovalente-pm · 2022-11-28T15:32:07Z

hugovalente-pm
Nov 28, 2022

Current state

Our current approach is mostly to surface the alert id across the Netdata Cloud UI and alert notification templates, adding some additional details (info field).
Examples:

Email template

Alert id on the email title
Alert info displayed after the value

Netdata Cloud UI - Alerts tab

Alert id on the second column of the table
Alert info displayed on tooltip on hover of alert id

Netdata Cloud UI - Dedicated Landing page or side drawer

Alert id on top of the page
Alert info displayed on top position of the page

Proposal

When presenting alerts on Netdata Cloud UI and/or through notification channels we should be aiming to display more clearly what problem the alert is referring. This would mean that instead of displaying the alert id and the info field has key prominent information we should display a more clear short phrase of what problem is ocurring. Some examples:

CPU usage alert

Current:
- Notification email title: Critical, 10min_cpu_usage = <value> %, on <node-name>
- Netdata Cloud UI: 10min cpu usage and tooltip or extra details average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
Proposal:
- Notification email title: CPU load is very high on <node-name> with <value> % (optionally add instances, example CPU core)
- Netdata Cloud UI: CPU load is very high

PostgreSQL connections

Current:
- Notifiction email title: Warning, postgres_total_connection_utilization = <value> %, on <node-name>
- Netdata Cloud UI: postgres total connection utilization and tooltip or extra details average total connection utilization over the last minute
Proposal:
- Notification email title: PostgreSQL: High connection count utilization on <node-name> with <value> %
- Netdata Cloud UI: PostgreSQL: High connection count utilization

Discussion

This discussion aims to clarify that we seem to need:

alert id - unique identifier of the alert but most probably not to be a key piece of info to show on alert notification titles and UI
info - field still makes sense to be displayed on notification templates and the UI since it bring clarity to the way the value was calculated but not as key information to show.
title/problem - clear text surfacing what issue the alert is referring to with additional context on instances where it applies, e.g. httpcheck alerts, etc.

hugovalente-pm · 2022-11-28T15:35:17Z

hugovalente-pm
Nov 28, 2022
Author

the idea to trigger this discussion came from a chat me, @MrZammler and @ilyam8 were having around a potential solution to tackle what had been reported on this bug netdata/netdata-cloud#130

we need to agree conceptually the key information to display when an alert is triggered, if new attributes need to be added and what will be the purpose/role of each attribute on the alert definition.

tagging a couple of folk for visibility and getting inputs: @amalkov @ktsaou @ralphm @sashwathn

0 replies

ilyam8 · 2022-11-28T15:59:45Z

ilyam8
Nov 28, 2022
Collaborator

My tl;dr

We need:

A human-readable description of the value (lookup + calc).
A short human-readable description of the problem (warn/crit).

We use the info field for 1. 2nd (we have no atm) should be used instead of the alert id (2nd column).

6 replies

ilyam8 Nov 29, 2022
Collaborator

We can auto generate these for our existing alerts.

Can you please share a few examples? Also, the tool (?) you use to auto-generate.

shyamvalsan Nov 29, 2022
Collaborator

I'm using GPT3 (It's an LLM - large language model which can be used for generative tasks like this)

Here's some examples, the "Problem: " statement is auto generated.

 template: cockroachdb_used_usable_storage_capacity
       on: cockroachdb.storage_used_capacity_percentage
    class: Utilization
     type: Database
component: CockroachDB
     calc: $capacity_usable_used_percent
    units: %
    every: 10s
     warn: $this > (($status >= $WARNING)  ? (80) : (85))
     crit: $this > (($status == $CRITICAL) ? (85) : (95))
    delay: down 15m multiplier 1.5 max 1h
     info: storage usable space utilization
       to: dba
Problem: “CockroachDB: High storage capacity utilization"


 template: postgres_total_connection_utilization
       on: postgres.connections_utilization
    class: Utilization
     type: Database
component: PostgreSQL
    hosts: *
   lookup: average -1m unaligned of used
    units: %
    every: 1m
     warn: $this > (($status >= $WARNING)  ? (70) : (80))
     crit: $this > (($status == $CRITICAL) ? (80) : (90))
    delay: down 15m multiplier 1.5 max 1h
     info: average total connection utilization over the last minute
       to: dba

Problem: “PostgreSQL: High total connection utilization”

 template: postgres_db_cache_io_ratio
       on: postgres.db_cache_io_ratio
    class: Workload
     type: Database
component: PostgreSQL
    hosts: *
   lookup: average -1m unaligned of miss
     calc: 100 - $this
    units: %
    every: 1m
     warn: $this < (($status >= $WARNING)  ? (70) : (60))
     crit: $this < (($status == $CRITICAL) ? (60) : (50))
    delay: down 15m multiplier 1.5 max 1h
     info: average cache hit ratio in db $label:database over the last minute
       to: dba

Problem: “PostgreSQL: Low cache hit ratio for database"

 template: vcsa_database_storage_health
       on: vcsa.components_health
    class: Errors
     type: Virtual Machine
component: VMware vCenter
   lookup: max -10s unaligned of database_storage
    units: status
    every: 10s
     warn: $this == 1
     crit: ($this == 2) || ($this == 3)
    delay: down 1m multiplier 1.5 max 1h
     info: database storage health status \
           (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey)
       to: sysadmin

Problem: “vCenter: Database storage health status is not green”

 template: vernemq_queue_message_expired
       on: vernemq.queue_undelivered_messages
    class: Latency
     type: Messaging
component: VerneMQ
   lookup: average -1m unaligned absolute of queue_message_expired
    units: expired messages
    every: 1m
     warn: $this > (($status >= $WARNING) ? (0) : (5))
    delay: up 2m down 5m multiplier 1.5 max 2h
     info: number of messages which expired before delivery in the last minute
       to: sysadmin

Problem: “VerneMQ: High number of expired messages before delivery”

ilyam8 Nov 30, 2022
Collaborator

That is good, yes. But I think is too literal. And probably it can't handle variable replacement ($label:name).

shyamvalsan Nov 30, 2022
Collaborator

If we fine tune with a few examples, it will probably learn the right way to do it (without being too literal). But its still better than what's currently being shown in terms of "usefulness" imo.

As for variable replacement, if its something that is present in the alert definition it might be able to do it - else we'll need to add that logic ourselves.

hugovalente-pm Nov 30, 2022
Author

But where will we store this "problem" description per alert? Should it be a new field in the alert definition?

Yes the proposal though is to have a new field on the alert defitinion

I agree with @shyamvalsan it is better than what we show currently and we can tweak the ones that need to be reviewed as we go.

MrZammler · 2022-12-01T07:58:30Z

MrZammler
Dec 1, 2022

Hi guys, one little for what might be technical, but still:

Currently the notification script builds the final subject line for the notification like this:

Severity, alert_id = value, on host.

If we aim to replace the alert_id with the new field, we need to decide whether or not to include e.g. the value or the hostname as part of it or not.

If we do, then we need to re-think the subject line, as to not include the value twice there. Similarly, if this new field is propagated to the cloud to be shown in the alert list, we already have the value in a column there as well (so it would need to actually be stripped out...).

So even if the initial implementation of this field had variables for value, hostname, severity, etc, I think we should not include them. Rather, the field should only have variables for family and chart labels, like the current info field has.

I like what @shyamvalsan shared there as well, this seems like what this new field can be. I would argue that we don't need to include words like High or Low etc, since the severity will actually show what it is. I.e. it could just be: PostgreSQL: Total connection utilization on my_database which would produce notifications like:

Warning, PostgreSQL: Total connection utilization on my_database_name = value, on hostname.
Recovered, PostgreSQL: Total connection utilization on my_database_name = value, on hostname.

but it's a minor nitpick. Of course by being part of the alert config the user can choose to use whatever they want there.

7 replies

ilyam8 Dec 6, 2022
Collaborator

title/problem - clear text surfacing what issue the alert is referring to with additional context on instances where it applies, e.g. httpcheck alerts, etc.

I 100% agree with that. Can only add that it should be short/concise. And a bit not clear part is "the alert is referring", but likely we will figure it out along the way (while updating stock alarms).

MrZammler Dec 6, 2022

Just from my perspective:

Starting with this I was not sure (and still I'm not) what exactly we should use as a string in the title. I know we can do "better" than what we currently show in the e.g. list in the cloud (where some can be cryptic).

I don't have any strong opinion on what can be used and I believe we can decide together. It also will vary between different alerts.

hugovalente-pm Dec 6, 2022
Author

I think the purposed change is something, picking up the email notification title, like:

current Severity, alert_id = value, on host.
target Severity, title on <family> = value, on host (placed family assuming for example on PostgreSQL it would be db-name)

the actual title text we can get the suggested values from @shyamvalsan and do a sanity check on the outcome.
the tweaking on each specific alert I think it will be as we go and based on feedback to improve it

ilyam8 Dec 6, 2022
Collaborator

I think family shouldn't be ever used. All the context comes from chart labels.

hugovalente-pm Dec 7, 2022
Author

@ilyam8 I mentioned family because of this

So even if the initial implementation of this field had variables for value, hostname, severity, etc, I think we should not include them. Rather, the field should only have variables for family and chart labels, like the current info field has.

If we can use chart labels instead all up for that

hugovalente-pm · 2022-12-22T09:52:48Z

hugovalente-pm
Dec 22, 2022
Author

@ilyam8 @MrZammler I think the pending question is about stop using family as a variable and start using chart labels, right?

I understand this will probably take a bigger effort to review the alerts that are using family and identify what/if there is a chart label that could be used instead but the introduction of this new fields doesn't seem to be needed to be done in one shot, we could be rolling out across the alert definitions

2 replies

ilyam8 Dec 22, 2022
Collaborator

I understand this will probably take a bigger effort to review the alerts that are using family

Addressed in #14173

hugovalente-pm Dec 22, 2022
Author

yes, I saw you had done the first alerts using chart labels so imagined foundations would be there to use it on others didn't know you had already done it 🙌
@MrZammler guess we can consider this discussion closed and proceed with the work using this bug, not? netdata/netdata-cloud#130

ralphm · 2023-01-24T16:04:48Z

ralphm
Jan 24, 2023
Maintainer

This came up again in light of #14006, which aims to add a configurable title to alerts. I want to point to other work in this area, which I see as a generalization of that solution. In Prometheus alerting rules, besides the alert name/identifier (alert), expression (expr) and trigger delay (for), it allows one to set two sets of labels:

labels: these are labels that are attached to the alert. Existing labels are overridden.
annotations: these are a separate set of labels that allow for conveying additional information to external systems through AlertManager.

An example:

groups:
- name: example
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: CertExpiring
    expr: probe_ssl_earliest_cert_expiry - time() < 24 * 60 * 60 * 7
    for: 1h
    labels:
      severity: critical
      category: cert-expiry
    annotations:
      summary: "Certificate for {{ $labels.instance }} will expire in {{ $value | humanizeDuration }}"
      description: "Certificate is expiring soon"
      suggestions: "Try to manually renew the certificate via cert-manager"
      runbook_url: "https://example.com/infra/wiki/Renew-Certificate"
      dashboard_url = "https://grafana.example.com/d/323hDhdU12/certificates"

As you can see, the annotations are in this examples are templated, using the Go Template language. Alert labels can be templated, too.

I think we can do something very similar. In our case, $labels could come from chart and host labels, augmented by the labels defined on the alert definition. Our collectors should define an instance label similarly to Prometheus. The annotations can be passed to the respective notification integrations to be used to generate their message payloads, or, in the case of generic webhooks, passed as-is.

6 replies

hugovalente-pm Jan 26, 2023
Author

@ralphm thanks for sharing that, what you say is definitely more flexible and powerful. what we were trying to aim is in, some sort of way, that annotations.summary as being this title that we are speaking about.

we will need inputs from both @MrZammler and @car12o to see the effort to achieve something like you explain and compare to the plan we were having, not sure if we couldn't make our current plan work as an interim step to something like what your are suggesting

we will also need to surface these annotations to our UI, or at least the equivalent to the title, to solve this and the alert dedicated landing page

ralphm Jan 27, 2023
Maintainer

Yes I understand. We could just choose a name for an annotation that will be used in our UI and notifications as the title. It could even be named title, but my thinking was that if we're going to let the user configure this now, I'd much rather go to final thing than have a temporary new field in the alert config that will be replace soon after.

hugovalente-pm Jan 30, 2023
Author

@MrZammler let's not proceed with the PR and, as also discussed, let's have a call where we will discuss this approach.
will try to book something on the coming weeks

MrZammler Jan 30, 2023

Yes, we'll put this on hold!

amalkov Feb 1, 2023

I agree with @ralphm , the suggested approach is more flexible in the future

ralphm · 2023-07-07T10:35:24Z

ralphm
Jul 7, 2023
Maintainer

As we needed to represent the info property from current alert definitions, which are already templated, I've gone ahead and introduced an annotations property for feed events. Should we indeed move ahead with annotations as suggested above, other (freeform) annotations can be represented into feed events naturally.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface the problem the alert is reporting in favour of an alert id #14059

{{title}}

Replies: 6 comments 21 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Surface the problem the alert is reporting in favour of an alert id #14059

hugovalente-pm Nov 28, 2022

Current state

Proposal

Discussion

Replies: 6 comments · 21 replies

hugovalente-pm Nov 28, 2022 Author

ilyam8 Nov 28, 2022 Collaborator

ilyam8 Nov 29, 2022 Collaborator

shyamvalsan Nov 29, 2022 Collaborator

ilyam8 Nov 30, 2022 Collaborator

shyamvalsan Nov 30, 2022 Collaborator

hugovalente-pm Nov 30, 2022 Author

MrZammler Dec 1, 2022

ilyam8 Dec 6, 2022 Collaborator

MrZammler Dec 6, 2022

hugovalente-pm Dec 6, 2022 Author

ilyam8 Dec 6, 2022 Collaborator

hugovalente-pm Dec 7, 2022 Author

hugovalente-pm Dec 22, 2022 Author

ilyam8 Dec 22, 2022 Collaborator

hugovalente-pm Dec 22, 2022 Author

ralphm Jan 24, 2023 Maintainer

hugovalente-pm Jan 26, 2023 Author

ralphm Jan 27, 2023 Maintainer

hugovalente-pm Jan 30, 2023 Author

MrZammler Jan 30, 2023

amalkov Feb 1, 2023

ralphm Jul 7, 2023 Maintainer

hugovalente-pm
Nov 28, 2022

Replies: 6 comments 21 replies

hugovalente-pm
Nov 28, 2022
Author

ilyam8
Nov 28, 2022
Collaborator

ilyam8 Nov 29, 2022
Collaborator

shyamvalsan Nov 29, 2022
Collaborator

ilyam8 Nov 30, 2022
Collaborator

shyamvalsan Nov 30, 2022
Collaborator

hugovalente-pm Nov 30, 2022
Author

MrZammler
Dec 1, 2022

ilyam8 Dec 6, 2022
Collaborator

hugovalente-pm Dec 6, 2022
Author

ilyam8 Dec 6, 2022
Collaborator

hugovalente-pm Dec 7, 2022
Author

hugovalente-pm
Dec 22, 2022
Author

ilyam8 Dec 22, 2022
Collaborator

hugovalente-pm Dec 22, 2022
Author

ralphm
Jan 24, 2023
Maintainer

hugovalente-pm Jan 26, 2023
Author

ralphm Jan 27, 2023
Maintainer

hugovalente-pm Jan 30, 2023
Author

ralphm
Jul 7, 2023
Maintainer