Stacked Generalization

Stacked Generalization or stacking for short, is an ensemble machine learning algorithm.

Stacking involves using a machine learning model to learn how to best combine the predictions from contributing ensemble members.

In voting, ensemble members are typically a diverse collection of model types, such as a decision tree, naive Bayes, and support vector machine. Predictions are made by averaging the predictions, such as selecting the class with the most votes (the statistical mode) or the largest summed probability.

… (unweighted) voting only makes sense if the learning schemes perform comparably well.

— Page 497, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

An extension to voting is to weigh the contribution of each ensemble member in the prediction, providing a weighted sum prediction. This allows more weight to be placed on models that perform better on average and less on those that don’t perform as well but still have some predictive skill.

The weight assigned to each contributing member must be learned, such as the performance of each model on the training dataset or a holdout dataset.

Stacking generalizes this approach and allows any machine learning model to be used to learn how to best combine the predictions from contributing members. The model that combines the predictions is referred to as the meta-model, whereas the ensemble members are referred to as base-models.

The problem with voting is that it is not clear which classifier to trust. Stacking tries to learn which classifiers are the reliable ones, using another learning algorithm—the metalearner—to discover how best to combine the output of the base learners.

— Page 497, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

In the language taken from the paper that introduced the technique, base models are referred to as level-0 learners, and the meta-model is referred to as a level-1 model.

Naturally, the stacking of models can continue to any desired level.

Stacking is a general procedure where a learner is trained to combine the individual learners. Here, the individual learners are called the first-level learners, while the combiner is called the second-level learner, or meta-learner.

— Page 83, Ensemble Methods, 2012.

Importantly, the way that the meta-model is trained is different to the way the base-models are trained.

The input to the meta-model are the predictions made by the base-models, not the raw inputs from the dataset. The target is the same expected target value. The predictions made by the base-models used to train the meta-model are for examples not used to train the base-models, meaning that they are out of sample.

For example, the dataset can be split into train, validation, and test datasets. Each base-model can then be fit on the training set and make predictions on the validation dataset. The predictions from the validation set are then used to train the meta-model.

This means that the meta-model is trained to best combine the capabilities of the base-models when they are making out-of-sample predictions, e.g. examples not seen during training.

… we reserve some instances to form the training data for the level-1 learner and build level-0 classifiers from the remaining data. Once the level-0 classifiers have been built they are used to classify the instances in the holdout set, forming the level-1 training data.

— Page 498, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Once the meta-model is trained, the base models can be re-trained on the combined training and validation datasets. The whole system can then be evaluated on the test set by passing examples first through the base models to collect base-level predictions, then passing those predictions through the meta-model to get final predictions. The system can be used in the same way when making predictions on new data.

This approach to training, evaluating, and using a stacking model can be further generalized to work with k-fold cross-validation.

Typically, base models are prepared using different algorithms, meaning that the ensembles are a heterogeneous collection of model types providing a desired level of diversity to the predictions made. However, this does not have to be the case, and different configurations of the same models can be used or the same model trained on different datasets.

The first-level learners are often generated by applying different learning algorithms, and so, stacked ensembles are often heterogeneous

— Page 83, Ensemble Methods, 2012.

On classification problems, the stacking ensemble often performs better when base-models are configured to predict probabilities instead of crisp class labels, as the added uncertainty in the predictions provides more context for the meta-model when learning how to best combine the predictions.

… most learning schemes are able to output probabilities for every class label instead of making a single categorical prediction. This can be exploited to improve the performance of stacking by using the probabilities to form the level-1 data.

— Page 498, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

The meta-model is typically a simple linear model, such as a linear regression for regression problems or a logistic regression model for classification. Again, this does not have to be the case, and any machine learning model can be used as the meta learner.

… because most of the work is already done by the level-0 learners, the level-1 classifier is basically just an arbiter and it makes sense to choose a rather simple algorithm for this purpose. […] Simple linear models or trees with linear models at the leaves usually work well.

— Page 499, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

This is a high-level summary of the stacking ensemble method, yet we can generalize the approach and extract the essential elements.