What if data volume is low?

rajanikant-ghate · 15 October 2021 10:17

Needless to say deep learning has an upper hand when the volume of data is high. But what if you have rows in the order of a few hundreds and you still need to build an ML model on top of it?

The major risk of such a model is that it can be biased or overfitting or completely non generalized.

The best way to test this is:

Go with 20 folds, with each time keeping sufficient test set.
On the train set in each fold, expand the dataset using artifical synthesis techniques.
If the results metrics and the most features are consistent across the folds, pick the dataset for building a final model, or pick all 20 and use voting algorithm for real time prediction.

Caution: Do not synthesize data and then split into test train. This will cause data leakage.