Parent & Child approach to anomaly detection. #12477
andrewm4894
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Following on from meeting yesterday we decided to
Note: for 2. when we end up with clashing names for whatever reason we will add a short suffix of the last 6 digits of the relevant netdata machine guid to the chart name so that it tends to be unique as much as we can. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently
anomaly_detection.*
charts live on the node running the ml. This means that in the case of a parent running ML on its children, theanomaly_detection.*
charts will all live on the parent like the image below (parent-ml-ml-stress-0
is running ML for its 10 childrenchild-ml-ml-stress-0
tochild-ml-ml-stress-9
) where you can see each anomaly detection chart is duplicated for each of the children on the parent:This results in Netdata Cloud not being able to associate this subset of charts that live on the parent with the child node instances. This results in Netdata Cloud not being able to "find" the ML charts for the children. The result being that we only have the one blue line below for the parent itself, all the child node instances (
parent/host/child
) don't haveanomaly_detection.*
charts and so can not be picked up by the new cloud architecture.Simply storing the child
anomaly_detection.*
charts atparent/host/child
will not solve the issue as you could also have the root child node itself doing ml and so have its ownanomaly_detection.*
charts in which case you would have a clash and things could get very messy.We have discussed two potential solutions.
Option 1: Add parent machine guid suffix and store at
parent/hosts/child
The idea here would be to add a prefix to the chart name like
anomaly_detection.<chart_name>_<parent_netdata_machine_guid>
.Then, if you have child and parent running ml you just have two different sets of chart id's both with a different suffix. As long as they all have the same chart contexts then cloud (we think) would be able to discover them all and aggregate just like any other composite data.
Then each of the
anomaly_detection.*
charts fromparent/host/child
would appear as node instance lines in a chart like this from the overview. So below we would expect to see 10 extra lines one for each of the child node instances coming from theparent-ml-ml-stress-0
parent.This would then power charts 1-3 and the anomaly rate sorting for the anomaly advisor tab (the raw anomaly-bit's is another discussion around how they would be propagated around).
One downside with this could be that you would end up with some
anomaly_detection.*
charts likeanomaly_detection.training_stats
living atparent/host/child
even though those charts are actually about processes being run onparent/hosts/parent
Option 2: Ability to inform cloud that a subset of charts on
parent/hosts/parent
actually relate to the child node instances and should in some way be treated as such.This way things could stay as they currently are on the agent but with some ability to maybe tag specific charts as in some way relating to the child node instances and have cloud "understand" this association.
@ktsaou @stelfrag @amalkov @shadycuz myself and @vkalintiris thought it was not obvious what option is best (both have pros and cons) or if another option(s) we have not considered, and that its complex enough to merit more discussion before we decide one way or the other and try and implement anything.
My own view is that option 1 is a step in the right direction. Maybe we move the
anomaly_detection.prediction_stats
andanomaly_detection.training_stats
into netdata monitoring section or something such that they always live onparent/hosts/parent
and then the other chartsanomaly_detection.dimensions
,anomaly_detection.anomaly_rate
,anomaly_detection.detector_window
,anomaly_detection.detector_events
are perfectly fine to live (with the parent suffix) side by side any others that maybe already live atparent/host/child
and just get discovered and aggregated by cloud like any other charts under the new architecture. This would give us what we need for all but theanomaly-bit
charts on the anomaly advisor tab, which we could then tackle seperatly.Beta Was this translation helpful? Give feedback.
All reactions