"train sample percent" param for anomaly detection #12350
andrewm4894
started this conversation in
Ideas
Replies: 1 comment
-
some discussion and a potential initial implementation approach to this is here: #12173 (comment) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@vkalintiris following up on something you mentioned.
I'd like to create a new param called
train sample percent
that would default to1
and so mean that all training data is used in training as we currently do.But if a user was to set for example
train sample percent = 0.5
then we would just train on a random 50% of feature vectors.The thinking here is that it's often the case in ML that a lot of the feature vectors can be quite similar and not really 'teaching' the model much so we can often just sample randomly and get as good (or maybe slightly less good) models a lot cheaper. This for sure potentially applies to our current implementation.
So a new param like
train sample percent
could give the user more flexibility to potentially reduce overhead even more at the cost of maybe a slightly less optimal model (we don't really have a measure of accuracy or anything so its hard to define this).I have a feeling that on average we can still train useful models even on say 50% of random feature vectors.
So this would be another general param that would apply for all models we might even implement and be a sort of core lever users could leverage as part of config.
This could even be an obvious easy way to get to models trained on 24 hours for example if you sample maybe 20% (although this might be stretching things a bit - we could experiment and see).
I think, if possible, the ideal way to implement this would be at the time of appending a feature vector to the buffer you just pick a random number and if its less than
train sample percent
you append to the buffer else do nothing. Not sure if this too complex in terms of how we have implemented. It might be easier that just before training you just randomly sampletrain sample percent
of feature vectors from the buffer. Downside of this approach would be that you end up building a feature buffer that you then just ignore whereas first approach, i think, avoids building up that buffer in the first place and ends up with sample result, a sample-based bag of feature vectors used during training.Anyway, no idea on how easy of not such a capability would be to implement but I'm pretty sure it will make sense to add as part of next phase of improvements and iterations as in general being able to sample training data may even open up more design space for potentially more complicated or heavy models at some later stage. For example, a deep learning based autoencoder (possible with ingredients in dlib) that maybe trains just once a day on last 3 days but sampled at 25% or something like that, just a rough example/idea. Not proposing we get fancy like that yet but just that sample based approaches will probably be core in enabling more stuff.
So just wanted to kick off this discussion we can have as reference and get your thoughts.
Beta Was this translation helpful? Give feedback.
All reactions