Bagging & Boosting Algorithms
Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is to create several subsets of data from training samples chosen randomly with replacement. Each collection of subset data is used to train their decision trees. As a result, we get an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree classifier.
Bagging Steps:
- Suppose there are N observations and M features in the training data set. A sample from the training data set is taken randomly with replacement.
- A subset of M features is selected randomly and whichever feature gives the best split is used to split the node iteratively.
- The tree is grown to the largest.
Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. Consecutive trees (random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When an input is misclassified by a hypothesis, its weight is increased so that the next hypothesis is more likely to classify it correctly. This process converts weak learners into better performing models.
Boosting Steps:
- Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1
- Draw second random training subset d2 without replacement from the training set and add 50 percent of the samples that were previously falsely classified/misclassified to train a weak learner C2
- Find the training samples d3 in the training set D on which C1 and C2 disagree to train a third weak learner C3
- Combine all the weak learners via majority voting.
- The above steps are repeated n times and prediction is given based on the aggregation of predictions from n number of trees.
The initial idea behind the boosting concept was to achieve high performance for a model, represented by the ensemble of “weak” learners …e.g. set of relatively shallow trees. It’s was driven probably by the fact that building a lot of deep trees is computationally intensive and time consuming, also often overfitting becomes a problem. Looking at the models based on the idea of boosting (XGBoost, Ada-Boost, Catboost, etc.) it is possible to see that most boosting algorithms iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier. Comparing to the Random Forest model where a bunch of trees ( often deep trees) are build independently in a models such as XGBoost each next tree is build using a knowledge about the error we get from the previous tree…thus we can improve decision trees we are building at each iteration….
Bagging is a different technique, even it’s also aims reducing bias and variance as much as possible It’s also typically applied to the decision trees. The most simple explanation here is that for bagging we perform sampling on our train data set…just picking for each tree a subset of samples by sampling uniformly and with replacement - that means e.g. we got a set of [A, B, C, D, E, F, G. H,…] with bagging we can get for example two sets such as [A,D,F,G…] and [B, C,G,H…]…and so on. Typically a share of unique samples is about ( 1–1/e) ~ 63% of the total train data set. So, when training m trees , m subsets are generated and used….then the results for all m models are averaged.