‘Random’ in Random Forest refers to mainly two processes –
- Random observations to grow each tree.
- Random variables selected for splitting at each node.
Random Record Selection: Each tree in the forest is trained on roughly 2/3rd of the total training data (exactly 63.2%) and here the data points are drawn at random with replacement from the original training dataset. This sample will act as the training set for growing the tree.
Random Variable Selection: Some independent variables(predictors) say, m are selected at random out of all the predictor variables, and the best split on this m is used to split the node.
- By default, m is taken as the square root of the total number of predictors for classification whereas m is the total number of all predictors divided by 3 for regression problems.
- The value of m remains constant during the algorithm run i.e, forest growing.