Extending ML training data window #12763

andrewm4894 · 2022-04-26T10:48:45Z

andrewm4894
Apr 26, 2022

@vkalintiris here is my idea for one reasonable way to extend the training window for the ML.

Introduce a new param called trained models expire after which is the number of seconds after a model is trained during which it is eligible for use in prediction. Once it is "older than" trained models expire after it's no longer considered valid for use in prediction.

So, a config like this:

[ml]
	enabled = yes
	maximum num samples to train = 14400
	minimum num samples to train = 3600
	train every = 3600
	trained models expire after = 86400

Would mean that each trained model gets used in predictions up to 24 hours after it's been trained. So, the idea here is that at prediction time you now are using a set of trained models, each of which produces a 1 or 0 and they majority vote to decide is anomaly bit should be set to 1 or 0. The idea here is that at prediction time you are now using models trained over the last 24 hours and if enough of them say the latest feature vector is anomalous then that's basically saying - this data looks stranger than anything I have learned about in the last 24 hours.

So, our current default config is almost just a special case like:

[ml]
	enabled = yes
	maximum num samples to train = 14400
	minimum num samples to train = 3600
	train every = 3600
	trained models expire after = 3600

Where, in above config, models expire every train every and so you only ever use most recent one at prediction.

One thing I'm less sure on is that assuming my original config above, at prediction time you could now have, for each metric, assuming it trains a model each hour for the preceding 4 hours, maybe 24 or 25 trained models, each of which spans a moving 4 hour window. But in reality, we would just need to pick some subset of the 6 trained models from this set of ~24 that best span the full trained models expire after window. For example, if you just picked the most recent 6 models then you would not be reaching far enough back to the initial trained models.

So, we'd need to figure out how to best pick the models from the set of all trained models (assuming we can't just use them all, if we could then could just maybe do that). I think what we would want would be the set of trained models with as little overlap between them as possible to make sure we are picking the models that best "cover" the trained models expire after period. Unsure how easy or not implementing something like that would be.

Mainly I like the idea of a user just having to think about deciding how long to keep models alive for, and in that way then extend the training window (indirectly via saved models) that's used for prediction time. It's just the internal complexity of how to efficiently pick the best set of models that cover the trained models expire after window with as little waste as possible that i'm still not 100% clear on.

vkalintiris · 2022-04-26T10:59:52Z

vkalintiris
Apr 26, 2022
Collaborator

Thanks @andrewm4894 for the detailed explanation.

So, the idea here is that at prediction time you now are using a set of trained models, each of which produces a 1 or 0 and they majority vote to decide is anomaly bit should be set to 1 or 0

Let's say we have the following models:

M(h - 4), M(h - 8), M(h - 12), M(h - 16), M(h - 20), M(h - 24).

If model M(h - 4) produces a 0 anomaly bit, why do we need to consult the rest of the models?

6 replies

vkalintiris Apr 26, 2022
Collaborator

@andrewm4894 still confused :) I'm asking why we need to consult the older models when the most recent one produces a zero anomaly bit. I think you explained why we need older models when the most recent one produces a one anomaly bit, which makes sense based on the reasoning you provided.

andrewm4894 Apr 26, 2022
Author

oh i see! hmm need to think about that one. mainly in my head it was just so its easy to reason about what is being predicted each step. But maybe it could work that you just need to check all the 1's against the other models but not the 0's. Unsure, will think.

andrewm4894 May 8, 2022
Author

@vkalintiris i have thought about this and think it makes sense to focus on mainly "sense checking" or "getting a second opinion on" the 1's.

Mainly right now we want to target false positives - so when we set a 1 for anomaly bit ideally, we'd be fairly confident. A false negative of a 0 when it should be a 1 is something we can focus on further down the line.

Getting good at having users feel confident that when netdata says it's a 1 it typically is, is the main focus for the ML side of the anomaly detection for now.

So i think basically an approach like this

for each second:
if anomaly_bit_most_recent_model = 1 then:
get_second_opinion()
else:
do nothing

Question then is what does get_second_opinion() do?

(a) It could iteratively get anomaly bits from the recent saved models one by one with some logic to decide when to stop and say we have enough evidence that the 1 should stand or if it see's some 0's then decides to set anomaly bit to 0 since there was some disagreement in the models.

Or

(b) it could just get all the scores from the active models in one go and then decide from that list of 1's and 0's if it is indeed a 1 or should be a 0.

If it was feasible then i like (b) better since its more easy to reason about and a little more deterministic in some sense.

What do you think - id imagine there is a significant difference maybe implementation wise perhaps with (a) vs (b).

Can you think of a (c) or (d) i've missed.

One thing i was thinking could be that as soon as a 1 is produced - on the next loop we then score against all the models if thats any easier to implement that would work too. bascially the first 1 then is always like a initial signal and it's the following step then that will either produce another 1 in which case its a good signal somethings up or it will be a 0 when it checks the larger set of models and no harm done. (apart from maybe a sequence of 1,0,1,0,1,0 going on forever that maybe we'd need to handle in some way).

andrewm4894 May 8, 2022
Author

fyi in ML sense we are interested in positive predictive value or precision (they same thing).

Recall - of all the anomalies on the system how many do we flag - is a different thing for much further down the line.

andrewm4894 May 16, 2022
Author

@vkalintiris another approach could be to just randomly pick any model from the active set for prediction. No bookkeeping or any custom complex logic. Then really it's only when you see a solid "run" of consecutive 1s you know that it's consistently considered anomalous over all active models.

andrewm4894 · 2022-05-31T15:42:46Z

andrewm4894
May 31, 2022
Author

Adding here for reference.

Here is a colab notebook with an initial rough python implementation of some ideas discussed above. Here is a quick screencast of the main idea.

2 replies

andrewm4894 Jun 8, 2022
Author

internal slack link to another idea i'd like to try extend this example to cover: https://netdata-cloud.slack.com/archives/C01K28B4V7Y/p1654254691597419?thread_ts=1654011440.491239&cid=C01K28B4V7Y

just adding for internal reference and will update the colab notebook to try cover it soon

andrewm4894 Jun 13, 2022
Author

@vkalintiris fyi i have updated the colab notebook to have a new anomaly_check_offset_policy param that's a string of {model_offset_from_1}:{model_offset_to_1},{model_offset_from_2}:{model_offset_to_2}, such that a user could define an array of windows within which models are considered "in policy" for when doing the anomaly check.

Idea here is that a default for this for example would probably be "all models from last 24 hours" AND ALSO "models from same time yesterday" AND ALSO "models from same time 2 days ago" etc.

Would like to try see if we can be a little clever here and try solve the problem of deciding what models to check against in some more general and flexible way such that if models are still available, users can define a list of from:to type offsets to define any sort of "anomaly check policy" they want if they ever wanted to go beyond the defaults.

andrewm4894 · 2022-06-13T15:52:05Z

andrewm4894
Jun 13, 2022
Author

@ktsaou just an fyi that this is where we doing the discussions on approaches to extend the "memory" of the models used as part of AA.

0 replies

shyamvalsan · 2022-06-15T14:29:28Z

shyamvalsan
Jun 15, 2022
Collaborator

@andrewm4894 @vkalintiris some questions

Will the consensus based approach proposed above really solve the 3am cronjob problem?
- Let's say we have 6 models (of 4 hours each) covering the entire 24 hour window with zero overlap: M(1) 00:00-04:00, M(2) 04:00-08:00, M(3) 08:00-12:00, M(4) 12:00-16:00, M(5) 16:00-20:00, M(6) 20:00-24:00
- M(1) sees cronjob behavior for the first time
- M(2), M(3), M(4), M(5) and M(6) do not see the cronjob
- At 03:00 cronjob behavior happens again and we detect an anomaly
- Then for a 2nd opinion we check consensus from previous 6 models, and 5/6 models will say this should be an anomaly.
- Won't we always end up flagging this as an anomaly?
Should there be (an option) to choose to give extra weight for models from the same time-of-day?
- There could still be a consensus approach but first we could check what the time-of-day model pair says, and then decide to get the overall consensus opinion
- In the above example, when an anomaly bit is set for the 3am cronjob, get_second_opinion() would first check what M(1) (the time-of-day pair) says, and if it says this is NOT an anomaly then do not proceed with consensus. If M(1) also flags anomaly then check remaining consensus.

2 replies

andrewm4894 Jun 15, 2022
Author

M(1) 00:00-04:00, M(2) 04:00-08:00, M(3) 08:00-12:00, M(4) 12:00-16:00, M(5) 16:00-20:00, M(6) 20:00-24:00

what you actually would have is more like

M(1) 00:00-04:00, M(2) 01:00-05:00, M(3) 02:00-06:00...

As models are retrained every 1 hour.

Then for a 2nd opinion we check consensus from previous 6 models, and 5/6 models will say this should be an anomaly.
Won't we always end up flagging this as an anomaly?

No because 1 model (it will be more based on my first comment about overlapping models) said it was not anomalous so the anomaly bit would be reset in this case.

shyamvalsan Jun 15, 2022
Collaborator

I was assuming minimum overlap based on your comment #12763 (comment)

I think what we would want would be the set of trained models with as little overlap between them as possible to make sure we are picking the models that best "cover" the trained models expire after period.

From the options you mentioned

(a) It could iteratively get anomaly bits from the recent saved models one by one with some logic to decide when to stop and say we have enough evidence that the 1 should stand or if it see's some 0's then decides to set anomaly bit to 0 since there was some disagreement in the models.

Or

(b) it could just get all the scores from the active models in one go and then decide from that list of 1's and 0's if it is indeed a 1 or should be a 0.

If it was feasible then i like (b) better since its more easy to reason about and a little more deterministic in some sense.

So in either of these cases, would we reset the anomaly bit if even 1 or a minority of models said it was not anomalous?

andrewm4894 · 2022-11-15T14:00:25Z

andrewm4894
Nov 15, 2022
Author

@ktsaou some previous discussion in here on extending the ml training windows and models. just pinging as reference.

0 replies

andrewm4894 · 2022-12-08T11:55:13Z

andrewm4894
Dec 8, 2022
Author

cross posting this PR where we are actually trying some of the stuff here: #14065

1 reply

andrewm4894 Jan 3, 2023
Author

#14198

andrewm4894 · 2023-06-12T12:21:23Z

andrewm4894
Jun 12, 2023
Author

Updating this with some findings from the public netdata demo machine learning room: https://app.netdata.cloud/spaces/netdata-demo/rooms/machine-learning

I have 3 nodes in here doing same workloads i use for testing ML stuff.

ml-demo-nightly using nightly and latest ml defaults of 24 hours
ml-demo-nightly-48h-training using nightly and latest ml configs of 48 hours - using 17 models
ml-demo-nightly-72h-training using nightly and latest ml configs of 72 hours - using 24 models

As you would expect the node anomaly rates are lower for 48h and 72h:

Interestingly netdata cpu overhead is not all that different:

memory a little higher for 48h and 72h but only a few mb each way really:

disk reads generally similar:

disk writes generally similar:

so some initial findings here that extending to 48h or 72h by default might actually not be that difficult.

will keep dogfooding this on the ml demo space and see over next few weeks how things behave.

1 reply

andrewm4894 Jun 12, 2023
Author

also, most important, when i ssh'd onto each node for first time in a week all of them reacted as expected with elevated anomaly rates

so this is good as shows real anomalies actually still get picked up

vkalintiris · 2023-06-12T13:41:46Z

vkalintiris
Jun 12, 2023
Collaborator

@andrewm4894 promising, what I'd need on my end to be really sold on this is a CPU consumption graph when many things become anomalous at the same time.

1 reply

andrewm4894 Jun 12, 2023
Author

here is one example where i randomly kicked off various stress tests:

anomaly rates increasing on all 3 nodes:

But maybe would need a more controlled way to really stress it with some script or attack that really would get node AR past like 5% or more and see then how it goes.

Will think if can come up with a way to trigger this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending ML training data window #12763

{{title}}

Replies: 8 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extending ML training data window #12763

andrewm4894 Apr 26, 2022

Replies: 8 comments · 13 replies

vkalintiris Apr 26, 2022 Collaborator

vkalintiris Apr 26, 2022 Collaborator

andrewm4894 Apr 26, 2022 Author

andrewm4894 May 8, 2022 Author

andrewm4894 May 8, 2022 Author

andrewm4894 May 16, 2022 Author

andrewm4894 May 31, 2022 Author

andrewm4894 Jun 8, 2022 Author

andrewm4894 Jun 13, 2022 Author

andrewm4894 Jun 13, 2022 Author

shyamvalsan Jun 15, 2022 Collaborator

andrewm4894 Jun 15, 2022 Author

shyamvalsan Jun 15, 2022 Collaborator

andrewm4894 Nov 15, 2022 Author

andrewm4894 Dec 8, 2022 Author

andrewm4894 Jan 3, 2023 Author

andrewm4894 Jun 12, 2023 Author

andrewm4894 Jun 12, 2023 Author

vkalintiris Jun 12, 2023 Collaborator

andrewm4894 Jun 12, 2023 Author

andrewm4894
Apr 26, 2022

Replies: 8 comments 13 replies

vkalintiris
Apr 26, 2022
Collaborator

vkalintiris Apr 26, 2022
Collaborator

andrewm4894 Apr 26, 2022
Author

andrewm4894 May 8, 2022
Author

andrewm4894 May 8, 2022
Author

andrewm4894 May 16, 2022
Author

andrewm4894
May 31, 2022
Author

andrewm4894 Jun 8, 2022
Author

andrewm4894 Jun 13, 2022
Author

andrewm4894
Jun 13, 2022
Author

shyamvalsan
Jun 15, 2022
Collaborator

andrewm4894 Jun 15, 2022
Author

shyamvalsan Jun 15, 2022
Collaborator

andrewm4894
Nov 15, 2022
Author

andrewm4894
Dec 8, 2022
Author

andrewm4894 Jan 3, 2023
Author

andrewm4894
Jun 12, 2023
Author

andrewm4894 Jun 12, 2023
Author

vkalintiris
Jun 12, 2023
Collaborator

andrewm4894 Jun 12, 2023
Author