# R Random Forest Tutorial with Example

## What is Random Forest in R?

Random forests are based on a simple idea: ‘the wisdom of the crowd’. Aggregate of the results of multiple predictors gives a better prediction than the best individual predictor. A group of predictors is called an ensemble . Thus, this technique is called Ensemble Learning .

In earlier tutorial, you learned how to use Decision trees to make a binary prediction. To improve our technique, we can train a group of Decision Tree classifiers , each on a different random subset of the train set. To make a prediction, we just obtain the predictions of all individuals trees, then predict the class that gets the most votes. This technique is called Random Forest .

Step 1) Import the data
To make sure you have the same dataset as in the tutorial for decision trees, the train test and test set are stored on the internet. You can import them without make any change.

library(dplyr)
glimpse(data_train)
glimpse(data_test)
Step 2) Train the model
One way to evaluate the performance of a model is to train it on a number of different smaller datasets and evaluate them over the other smaller testing set. This is called the F-fold cross-validation feature. R has a function to randomly split number of datasets of almost the same size. For example, if k=9, the model is evaluated over the nine folder and tested on the remaining test set. This process is repeated until all the subsets have been evaluated. This technique is widely used for model selection, especially when the model has parameters to tune.

Now that we have a way to evaluate our model, we need to figure out how to choose the parameters that generalized best the data.

Random forest chooses a random subset of features and builds many Decision Trees. The model averages out all the predictions of the Decisions trees.

Random forest has some parameters that can be changed to improve the generalization of the prediction. You will use the function RandomForest() to train the model.

Syntax for Randon Forest is

RandomForest(formula, ntree=n, mtry=FALSE, maxnodes = NULL)
Arguments:

• Formula: Formula of the fitted model
• ntree: number of trees in the forest
• mtry: Number of candidates draw to feed the algorithm. By default, it is the square of the number of columns.
• maxnodes: Set the maximum amount of terminal nodes in the forest
• importance=TRUE: Whether independent variables importance in the random forest be assessed
Note: Random forest can be trained on more parameters. You can refer to the vignette to see the different parameters.

Tuning a model is very tedious work. There are lot of combination possible between the parameters. You don’t necessarily have the time to try all of them. A good alternative is to let the machine find the best combination for you. There are two methods available:

Random Search
Grid Search
We will define both methods but during the tutorial, we will train the model using grid search

Grid Search definition
The grid search method is simple, the model will be evaluated over all the combination you pass in the function, using cross-validation.

For instance, you want to try the model with 10, 20, 30 number of trees and each tree will be tested over a number of mtry equals to 1, 2, 3, 4, 5. Then the machine will test 15 different models:

``````.mtry ntrees
``````

1 1 10
2 2 10
3 3 10
4 4 10
5 5 10
6 1 20
7 2 20
8 3 20
9 4 20
10 5 20
11 1 30
12 2 30
13 3 30
14 4 30
15 5 30
The algorithm will evaluate:

RandomForest(formula, ntree=10, mtry=1)
RandomForest(formula, ntree=10, mtry=2)
RandomForest(formula, ntree=10, mtry=3)
RandomForest(formula, ntree=20, mtry=2)

Each time, the random forest experiments with a cross-validation. One shortcoming of the grid search is the number of experimentations. It can become very easily explosive when the number of combination is high. To overcome this issue, you can use the random search

Random Search definition
The big difference between random search and grid search is, random search will not evaluate all the combination of hyperparameter in the searching space. Instead, it will randomly choose combination at every iteration. The advantage is it lower the computational cost.

Set the control parameter
You will proceed as follow to construct and evaluate the model:

Evaluate the model with the default setting
Find the best number of mtry
Find the best number of maxnodes
Find the best number of ntrees
Evaluate the model on the test dataset
Before you begin with the parameters exploration, you need to install two libraries.

caret: R machine learning library. If you have install R with r-essential. It is already in the library
Anaconda: conda install -c r r-caret
e1071: R machine learning library.
Anaconda: conda install -c r r-e1071
You can import them along with RandomForest

library(randomForest)
library(caret)
library(e1071)
Default setting
K-fold cross validation is controlled by the trainControl() function

trainControl(method = “cv”, number = n, search =“grid”)
arguments

• method = “cv”: The method used to resample the dataset.
• number = n: Number of folders to create
• search = “grid”: Use the search grid method. For randomized method, use “grid”
Note: You can refer to the vignette to see the other arguments of the function.
You can try to run the model with the default parameters and see the accuracy score.

Note: You will use the same controls during all the tutorial.

# Define the control

trControl <- trainControl(method = “cv”,
number = 10,
search = “grid”)
You will use caret library to evaluate your model. The library has one function called train() to evaluate almost all machine learning algorithm. Say differently, you can use this function to train other algorithms.

The basic syntax is:

train(formula, df, method = “rf”, metric= “Accuracy”, trControl = trainControl(), tuneGrid = NULL)
argument

• `formula`: Define the formula of the algorithm
• `method`: Define which model to train. Note, at the end of the tutorial, there is a list of all the models that can be trained
• `metric` = “Accuracy”: Define how to select the optimal model
• `trControl = trainControl()`: Define the control parameters
• `tuneGrid = NULL`: Return a data frame with all the possible combination
Let’s try the build the model with the default values.

set.seed(1234)

# Run the model

rf_default <- train(survived~.,
data = data_train,
method = “rf”,
metric = “Accuracy”,
trControl = trControl)

# Print the results

print(rf_default)
Code Explanation

trainControl(method=“cv”, number=10, search=“grid”): Evaluate the model with a grid search of 10 folder
train(…): Train a random forest model. Best model is chosen with the accuracy measure.
Output:

## The final value used for the model was mtry = 2.

The algorithm uses 500 trees and tested three different values of mtry: 2, 6, 10.

The final value used for the model was mtry = 2 with an accuracy of 0.78. Let’s try to get a higher score.

Step 2) Search best mtry

You can test the model with values of mtry from 1 to 10

set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 10))
rf_mtry <- train(survived~.,
data = data_train,
method = “rf”,
metric = “Accuracy”,
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 300)
print(rf_mtry)
Code Explanation

tuneGrid <- expand.grid(.mtry=c(3:10)): Construct a vector with value from 3:10
The final value used for the model was mtry = 4.

Output:

## The final value used for the model was mtry = 4.

The best value of mtry is stored in:

rf_mtry\$bestTune\$mtry
You can store it and use it when you need to tune the other parameters.

max(rf_mtry\$results\$Accuracy)
Output:

## [1] 0.8110729

best_mtry <- rf_mtry\$bestTune\$mtry
best_mtry
Output:

## [1] 4

Step 3) Search the best maxnodes
You need to create a loop to evaluate the different values of maxnodes. In the following code, you will:

Create a list
Create a variable with the best value of the parameter mtry; Compulsory
Create the loop
Store the current value of maxnode
Summarize the results
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5: 15)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = “rf”,
metric = “Accuracy”,
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
Code explanation:

store_maxnode <- list(): The results of the model will be stored in this list
expand.grid(.mtry=best_mtry): Use the best value of mtry
for (maxnodes in c(15:25)) { … }: Compute the model with values of maxnodes starting from 15 to 25.
maxnodes=maxnodes: For each iteration, maxnodes is equal to the current value of maxnodes. i.e 15, 16, 17, …
key <- toString(maxnodes): Store as a string variable the value of maxnode.
store_maxnode[[key]] <- rf_maxnode: Save the result of the model in the list.
resamples(store_maxnode): Arrange the results of the model
summary(results_mtry): Print the summary of all the combination.
Output:

## 15 0.4014252 0.5689401 0.6117973 0.5867010 0.6507194 0.6955990 0

The last value of maxnode has the highest accuracy. You can try with higher values to see if you can get a higher score.

store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(20: 30)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = “rf”,
metric = “Accuracy”,
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
key <- toString(maxnodes)
store_maxnode[[key]] <- rf_maxnode
}
results_node <- resamples(store_maxnode)
summary(results_node)
Output:

## 30 0.3297872 0.5534173 0.6202632 0.5843432 0.6590982 0.7189781 0

The highest accuracy score is obtained with a value of maxnode equals to 22.

Step 4) Search the best ntrees
Now that you have the best value of mtry and maxnode, you can tune the number of trees. The method is exactly the same as maxnode.

store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
set.seed(5678)
rf_maxtrees <- train(survived~.,
data = data_train,
method = “rf”,
metric = “Accuracy”,
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = 24,
ntree = ntree)
key <- toString(ntree)
store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree)
Output:

## 2000 0.4601542 0.5482030 0.5961590 0.5862151 0.6440678 0.6656337 0

You have your final model. You can train the random forest with the following parameters:

ntree =800: 800 trees will be trained
mtry=4: 4 features is chosen for each iteration
maxnodes = 24: Maximum 24 nodes in the terminal nodes (leaves)
fit_rf <- train(survived~.,
data_train,
method = “rf”,
metric = “Accuracy”,
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 800,
maxnodes = 24)
Step 5) Evaluate the model
The library caret has a function to make prediction.

predict(model, newdata= df)
argument

• `model`: Define the model evaluated before.
• `newdata`: Define the dataset to make prediction
prediction <-predict(fit_rf, data_test)
You can use the prediction to compute the confusion matrix and see the accuracy score

confusionMatrix(prediction, data_test\$survived)
Output:

## ‘Positive’ Class : No

You have an accuracy of 0.7943 percent, which is higher than the default value

Step 6) Visualize Result
Lastly, you can look at the feature importance with the function varImp(). It seems that the most important features are the sex and age. That is not surprising because the important features are likely to appear closer to the root of the tree, while less important features will often appear closed to the leaves.

varImpPlot(fit_rf)
Output:

varImp(fit_rf)

## embarkedS 0.000

Summary
We can summarize how to train and evaluate a random forest with the table below:

Library Objective function parameter
randomForest Create a Random forest RandomForest() formula, ntree=n, mtry=FALSE, maxnodes = NULL
caret Create K folder cross validation trainControl() method = “cv”, number = n, search =“grid”
caret Train a Random Forest train() formula, df, method = “rf”, metric= “Accuracy”, trControl = trainControl(), tuneGrid = NULL
caret Predict out of sample predict model, newdata= df
caret Confusion Matrix and Statistics confusionMatrix() model, y test
caret variable importance cvarImp() model
Appendix
List of model used in caret

names>(getModelInfo())
Output: