How much data should you allocate for your training, validation, and test sets?

priyanka-gaikwad-9f6e5281 · 11 August 2020 12:19

Data management

ruble-joseph · 14 August 2020 08:14

You have to find a balance, and there’s no right answer for every problem. If your test set is too small, you’ll have an unreliable estimation of model performance (performance statistic will have high variance). If your training set is too small, your actual model parameters will have a high variance.

A good rule of thumb is to use an 80/20 train/test split. Then, your train set can be further split into train/validation or into partitions for cross-validation.

chirag-garg · 7 August 2021 16:14

There is no to the point answer to this question but there needs to be a balance/equilibrium when allocating data for training, validation and test sets.

If you make the training set too small, then the actual model parameters might have high variance. Also, if the test set is too small, there are chances of unreliable estimation of model performance. A general thumb rule to follow is to use 80: 20 train/test spilt. After this the training set can be further split into validation sets.