What Is Random Forest?

What is the health condition of a heart patient? Normal, suspect and pathologic as per CTG?

Here, we are predicting the health condition of heart patients by analyzing their CTG with Random Forest algorithm.

What Is Random Forest?

A Random Forest can be considered as an ensemble of decision trees. It builds and combines multiple decision trees to get more accurate prediction. Each of the decision tree models used is weak when employed on its own, but it becomes stable when put together.

Random forest can be used for both classification and regression. For regression, the output can be calculated by mean of observations at the terminal nodes. For classification, the output can be calculated by mode of observations at the terminal nodes. The tree splitting always takes from top-to-bottom approach.

Splitting of the tree is done same as that of a decision tree by using Gini Index and Entropy.

Model Tuning:

Random forest contains a function tuneRF () for tuning the model and parameters like ntree and mtry etc.

Ntree:

This parameter is used to pass number of trees to be grown, default value is 500. 

Mtry:

It gives how many nodes need to be selected while splitting a tree. By default, it takes square root(p), where p is the number of variables.

StepFactor:

At each iteration, mtry is inflated (or deflated) by this value.

Improve:

The improvement in OOB error must be by this much for the search to continue.

tuneRF () will give the mtry value where the OOB error (Out of Bag) is less.

In the above example, we can see that the OOB error is less at the value where mtry is 5.

Variable importance:

There are different functions to check the importance of variables used in the model like:

1. VarImpplot ():

It is used to check the importance of a variable in sorted order

Synt: varImpPlot (x, sort=TRUE)

Where x is model parameter, if you want to check top 10 variables pass parameter n.var = N (N is no. of variable required)

2. Importance ():

It is used to check the importance of variable in percentage

Synt: importance(x)

3. Varused ():

It gives the count of variable, that how many times used for building a tree.

Synt: varused(x)

Now, let’s go for a practical demo with R.

 

Random forest model:

We are having a function randomforest ()

Synt: randomForest (x, y, data, ntree=500, mtry= 5)

Where x is a dependent variable, y is an independent variable, data and ntree, mtry are optional. We already discussed about those parameters and I will explain how to choose those values in model tuning part.

Prediction on test data:

Confusion Matrix:

Output:

Model Tuning:

To increase accuracy of our model, we must tune our model with different parameters

 

I already discussed about the parameters in the tuneRF () function. By running the above, we will get the mtry value with less OOB.

Output:

In the above graph, OOB is less at mtry =8, if you want to see as values type print(t)

Output:

Checking Variable Importance:

There are different types of variable importance checking functions, already I discussed about those functions and here in our scenario 

1. VarImpPlot (model_rf, sort = TRUE)

Output:

2. Importance(model_rf)

Output:

3. VarUsed(model_rf):

This gives count of variables used in model

Output:

Reference: You can get source code and dataset from below link

https://github.com/maddipatikiran/Random_Forest-BI_Desk

Happy Machine Learning

Related posts