Parent & Child approach to anomaly detection. #12477

andrewm4894 · 2022-03-21T14:01:45Z

andrewm4894
Mar 21, 2022

Currently anomaly_detection.* charts live on the node running the ml. This means that in the case of a parent running ML on its children, the anomaly_detection.* charts will all live on the parent like the image below (parent-ml-ml-stress-0 is running ML for its 10 children child-ml-ml-stress-0 to child-ml-ml-stress-9) where you can see each anomaly detection chart is duplicated for each of the children on the parent:

This results in Netdata Cloud not being able to associate this subset of charts that live on the parent with the child node instances. This results in Netdata Cloud not being able to "find" the ML charts for the children. The result being that we only have the one blue line below for the parent itself, all the child node instances (parent/host/child) don't have anomaly_detection.* charts and so can not be picked up by the new cloud architecture.

Simply storing the child anomaly_detection.* charts at parent/host/child will not solve the issue as you could also have the root child node itself doing ml and so have its own anomaly_detection.* charts in which case you would have a clash and things could get very messy.

We have discussed two potential solutions.

Option 1: Add parent machine guid suffix and store at `parent/hosts/child`

The idea here would be to add a prefix to the chart name like anomaly_detection.<chart_name>_<parent_netdata_machine_guid>.

Then, if you have child and parent running ml you just have two different sets of chart id's both with a different suffix. As long as they all have the same chart contexts then cloud (we think) would be able to discover them all and aggregate just like any other composite data.

Then each of the anomaly_detection.* charts from parent/host/child would appear as node instance lines in a chart like this from the overview. So below we would expect to see 10 extra lines one for each of the child node instances coming from the parent-ml-ml-stress-0 parent.

This would then power charts 1-3 and the anomaly rate sorting for the anomaly advisor tab (the raw anomaly-bit's is another discussion around how they would be propagated around).

One downside with this could be that you would end up with some anomaly_detection.* charts like anomaly_detection.training_stats living at parent/host/child even though those charts are actually about processes being run on parent/hosts/parent

Option 2: Ability to inform cloud that a subset of charts on `parent/hosts/parent` actually relate to the child node instances and should in some way be treated as such.

This way things could stay as they currently are on the agent but with some ability to maybe tag specific charts as in some way relating to the child node instances and have cloud "understand" this association.

@ktsaou @stelfrag @amalkov @shadycuz myself and @vkalintiris thought it was not obvious what option is best (both have pros and cons) or if another option(s) we have not considered, and that its complex enough to merit more discussion before we decide one way or the other and try and implement anything.

My own view is that option 1 is a step in the right direction. Maybe we move the anomaly_detection.prediction_stats and anomaly_detection.training_stats into netdata monitoring section or something such that they always live on parent/hosts/parent and then the other charts anomaly_detection.dimensions, anomaly_detection.anomaly_rate, anomaly_detection.detector_window, anomaly_detection.detector_events are perfectly fine to live (with the parent suffix) side by side any others that maybe already live at parent/host/child and just get discovered and aggregated by cloud like any other charts under the new architecture. This would give us what we need for all but the anomaly-bit charts on the anomaly advisor tab, which we could then tackle seperatly.

andrewm4894 · 2022-03-24T11:14:36Z

andrewm4894
Mar 24, 2022
Author

Following on from meeting yesterday we decided to

Move diagnostic type charts like anomaly_edetection.prediction_stats and anomaly_edetection.training_stats into a ml submenu under the "Netdata Monitoring" menu. The convention here is that you would see these charts where the ml is being run. So in case of a parent with 10 children where the parent is running ml for those 10 children and itself you would have 11 charts per host on the parent dashboard in this ml submenu.
For the other charts, anomaly_detection.dimensions, anomaly_detection.anomaly_rate, anomaly_detection.detector_window, anomaly_detection.detector_events we decided they should be injected into the child instance on the parent, i.e. live at parent/host/child dashboard and will have a human readable name like anomaly_detection.anomaly_rate_on_<hostname where ml is running>. In netdata cloud we would like this to appear in below dropdown like anomaly_detection.anomaly_rate_on_<parent hostname> @ <child>

Note: for 2. when we end up with clashing names for whatever reason we will add a short suffix of the last 6 digits of the relevant netdata machine guid to the chart name so that it tends to be unique as much as we can.

1 reply

vkalintiris Mar 30, 2022
Collaborator

For the other charts, anomaly_detection.dimensions ...

I think we also said that anomaly_detection.* charts should not get streamed by default, right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parent & Child approach to anomaly detection. #12477

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Parent & Child approach to anomaly detection. #12477

andrewm4894 Mar 21, 2022

Option 1: Add parent machine guid suffix and store at parent/hosts/child

Option 2: Ability to inform cloud that a subset of charts on parent/hosts/parent actually relate to the child node instances and should in some way be treated as such.

Replies: 1 comment · 1 reply

andrewm4894 Mar 24, 2022 Author

vkalintiris Mar 30, 2022 Collaborator

andrewm4894
Mar 21, 2022

Option 1: Add parent machine guid suffix and store at `parent/hosts/child`

Option 2: Ability to inform cloud that a subset of charts on `parent/hosts/parent` actually relate to the child node instances and should in some way be treated as such.

Replies: 1 comment 1 reply

andrewm4894
Mar 24, 2022
Author

vkalintiris Mar 30, 2022
Collaborator