Bootstrap Aggregation

Bootstrap Aggregation, or bagging for short, is an ensemble machine learning algorithm.

The techniques involve creating a bootstrap sample of the training dataset for each ensemble member and training a decision tree model on each sample, then combining the predictions directly using a statistic like the average of the predictions. The sample of the training dataset is created using the bootstrap method, which involves selecting examples randomly with replacement.

Replacement means that the same example is metaphorically returned to the pool of candidate rows and may be selected again or many times in any single sample of the training dataset. It is also possible that some examples in the training dataset are not selected at all for some bootstrap samples.The bootstrap method has the desired effect of making each sample of the dataset quite different, or usefully different for creating an ensemble.

A decision tree is then fit on each sample of data. Each tree will be a little different given the differences in the training dataset. Typically, the decision tree is configured to have perhaps an increased depth or to not use pruning. This can make each tree more specialized to the training dataset and, in turn, further increase the differences between the trees.

Differences in trees are desirable as they will increase the “diversity” of the ensemble, which means produce ensemble members that have a lower correlation in their prediction or prediction errors. It is generally accepted that ensembles composed of ensemble members that are skillful and diverse (skillful in different ways or make different errors) perform better. A benefit of bagging is that it generally does not overfit the training dataset, and the number of ensemble members can continue to be increased until performance on a holdout dataset stops improving.