Extending ML training data window #12763
Replies: 8 comments 13 replies
-
Thanks @andrewm4894 for the detailed explanation.
Let's say we have the following models: M(h - 4), M(h - 8), M(h - 12), M(h - 16), M(h - 20), M(h - 24). If model M(h - 4) produces a 0 anomaly bit, why do we need to consult the rest of the models? |
Beta Was this translation helpful? Give feedback.
-
Adding here for reference. Here is a colab notebook with an initial rough python implementation of some ideas discussed above. Here is a quick screencast of the main idea. |
Beta Was this translation helpful? Give feedback.
-
@ktsaou just an fyi that this is where we doing the discussions on approaches to extend the "memory" of the models used as part of AA. |
Beta Was this translation helpful? Give feedback.
-
@andrewm4894 @vkalintiris some questions
|
Beta Was this translation helpful? Give feedback.
-
@ktsaou some previous discussion in here on extending the ml training windows and models. just pinging as reference. |
Beta Was this translation helpful? Give feedback.
-
cross posting this PR where we are actually trying some of the stuff here: #14065 |
Beta Was this translation helpful? Give feedback.
-
Updating this with some findings from the public netdata demo machine learning room: https://app.netdata.cloud/spaces/netdata-demo/rooms/machine-learning I have 3 nodes in here doing same workloads i use for testing ML stuff.
As you would expect the node anomaly rates are lower for 48h and 72h: Interestingly netdata cpu overhead is not all that different: memory a little higher for 48h and 72h but only a few mb each way really: disk reads generally similar: disk writes generally similar: so some initial findings here that extending to 48h or 72h by default might actually not be that difficult. will keep dogfooding this on the ml demo space and see over next few weeks how things behave. |
Beta Was this translation helpful? Give feedback.
-
@andrewm4894 promising, what I'd need on my end to be really sold on this is a CPU consumption graph when many things become anomalous at the same time. |
Beta Was this translation helpful? Give feedback.
-
@vkalintiris here is my idea for one reasonable way to extend the training window for the ML.
Introduce a new param called
trained models expire after
which is the number of seconds after a model is trained during which it is eligible for use in prediction. Once it is "older than"trained models expire after
it's no longer considered valid for use in prediction.So, a config like this:
Would mean that each trained model gets used in predictions up to 24 hours after it's been trained. So, the idea here is that at prediction time you now are using a set of trained models, each of which produces a 1 or 0 and they majority vote to decide is anomaly bit should be set to 1 or 0. The idea here is that at prediction time you are now using models trained over the last 24 hours and if enough of them say the latest feature vector is anomalous then that's basically saying - this data looks stranger than anything I have learned about in the last 24 hours.
So, our current default config is almost just a special case like:
Where, in above config, models expire every
train every
and so you only ever use most recent one at prediction.One thing I'm less sure on is that assuming my original config above, at prediction time you could now have, for each metric, assuming it trains a model each hour for the preceding 4 hours, maybe 24 or 25 trained models, each of which spans a moving 4 hour window. But in reality, we would just need to pick some subset of the 6 trained models from this set of ~24 that best span the full
trained models expire after
window. For example, if you just picked the most recent 6 models then you would not be reaching far enough back to the initial trained models.So, we'd need to figure out how to best pick the models from the set of all trained models (assuming we can't just use them all, if we could then could just maybe do that). I think what we would want would be the set of trained models with as little overlap between them as possible to make sure we are picking the models that best "cover" the
trained models expire after
period. Unsure how easy or not implementing something like that would be.Mainly I like the idea of a user just having to think about deciding how long to keep models alive for, and in that way then extend the training window (indirectly via saved models) that's used for prediction time. It's just the internal complexity of how to efficiently pick the best set of models that cover the
trained models expire after
window with as little waste as possible that i'm still not 100% clear on.Beta Was this translation helpful? Give feedback.
All reactions