Dataset Size Sensitivity Analysis

The amount of training data required for a machine learning predictive model is an open question.

It depends on your choice of model, on the way you prepare the data, and on the specifics of the data itself.

One way to approach this problem is to perform a sensitivity analysis and discover how the performance of your model on your dataset varies with more or less data.

This might involve evaluating the same model with different sized datasets and looking for a relationship between dataset size and performance or a point of diminishing returns.

Typically, there is a strong relationship between training dataset size and model performance, especially for nonlinear models. The relationship often involves an improvement in performance to a point and a general reduction in the expected variance of the model as the dataset size is increased.

Knowing this relationship for your model and dataset can be helpful for a number of reasons, such as:

  • Evaluate more models.
  • Find a better model.
  • Decide to gather more data.

You can evaluate a large number of models and model configurations quickly on a smaller sample of the dataset with confidence that the performance will likely generalize in a specific way to a larger training dataset.

This may allow evaluating many more models and configurations than you may otherwise be able to given the time available, and in turn, perhaps discover a better overall performing model.

You may also be able to generalize and estimate the expected performance of model performance to much larger datasets and estimate whether it is worth the effort or expense of gathering more training data.