"train sample percent" param for anomaly detection #12350

andrewm4894 · 2022-03-09T13:06:03Z

andrewm4894
Mar 9, 2022

@vkalintiris following up on something you mentioned.

I'd like to create a new param called train sample percent that would default to 1 and so mean that all training data is used in training as we currently do.

But if a user was to set for example train sample percent = 0.5 then we would just train on a random 50% of feature vectors.

The thinking here is that it's often the case in ML that a lot of the feature vectors can be quite similar and not really 'teaching' the model much so we can often just sample randomly and get as good (or maybe slightly less good) models a lot cheaper. This for sure potentially applies to our current implementation.

So a new param like train sample percent could give the user more flexibility to potentially reduce overhead even more at the cost of maybe a slightly less optimal model (we don't really have a measure of accuracy or anything so its hard to define this).

I have a feeling that on average we can still train useful models even on say 50% of random feature vectors.

So this would be another general param that would apply for all models we might even implement and be a sort of core lever users could leverage as part of config.

This could even be an obvious easy way to get to models trained on 24 hours for example if you sample maybe 20% (although this might be stretching things a bit - we could experiment and see).

I think, if possible, the ideal way to implement this would be at the time of appending a feature vector to the buffer you just pick a random number and if its less than train sample percent you append to the buffer else do nothing. Not sure if this too complex in terms of how we have implemented. It might be easier that just before training you just randomly sample train sample percent of feature vectors from the buffer. Downside of this approach would be that you end up building a feature buffer that you then just ignore whereas first approach, i think, avoids building up that buffer in the first place and ends up with sample result, a sample-based bag of feature vectors used during training.

Anyway, no idea on how easy of not such a capability would be to implement but I'm pretty sure it will make sense to add as part of next phase of improvements and iterations as in general being able to sample training data may even open up more design space for potentially more complicated or heavy models at some later stage. For example, a deep learning based autoencoder (possible with ingredients in dlib) that maybe trains just once a day on last 3 days but sampled at 25% or something like that, just a rough example/idea. Not proposing we get fancy like that yet but just that sample based approaches will probably be core in enabling more stuff.

So just wanted to kick off this discussion we can have as reference and get your thoughts.

andrewm4894 · 2022-03-21T12:28:57Z

andrewm4894
Mar 21, 2022
Author

some discussion and a potential initial implementation approach to this is here: #12173 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"train sample percent" param for anomaly detection #12350

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

"train sample percent" param for anomaly detection #12350

andrewm4894 Mar 9, 2022

Replies: 1 comment

andrewm4894 Mar 21, 2022 Author

andrewm4894
Mar 9, 2022

andrewm4894
Mar 21, 2022
Author