Finding a Skillful Model Is Not Enough
Applied machine learning is the process of discovering the model that performs best for a given predictive modeling dataset.
In fact, it’s more than this.
In addition to discovering which model performs the best on your dataset, you must also discover:
- Data transforms that best expose the unknown underlying structure of the problem to the learning algorithms.
- Model hyperparameters that result in a good or best configuration of a chosen model.
There may also be additional considerations such as techniques that transform the predictions made by the model, like threshold moving or model calibration for predicted probabilities.
As such, it is common to think of applied machine learning as a large combinatorial search problem across data transforms, models, and model configurations.
This can be quite challenging in practice as it requires that the sequence of one or more data preparation schemes, the model, the model configuration, and any prediction transform schemes must be evaluated consistently and correctly on a given test harness.
Although tricky, it may be manageable with a simple train-test split but becomes quite unmanageable when using k-fold cross-validation or even repeated k-fold cross-validation.
The solution is to use a modeling pipeline to keep everything straight.