Decision tree is a supervised learning algorithm which is used for both classification and regression. As the name goes, it uses a tree-like model of decisions, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.
It consists of a set of rules for dividing large heterogenous population into smaller groups with respect to the target label, the algorithm used for this tree construction is recursive partitioning. It follows top- down approach.
Decision tree Terminology:
- Root node: Represents entire population
- Splitting: Process of dividing sample
- Decision Node: Node splits into further sub nodes
- Leaf / Terminal Node: Last stage of node (output label)
- Pruning: Opposite to splitting (to reduce size of tree)
How is a tree split is done?
Decision tree uses multiple algorithms like Gini index, Entropy, Chi-Square and Information Gain to decide to split a node into sub nodes.
It is used to decide a split of node, depends on purity and information required, less impure node requires less information.
For example, in above picture c is pure node compare to other, so it requires less information for tree splitting, more impure node requires more information. The measure of impurity is called Entropy. The following formula is used to calculate Entropy:
Entropy= -plog2p – qlog2q
Where p and q are the probability of success and failure, if sample is homogenous, then entropy is zero and if sample is equally divided (50% – 50 %), then the entropy is zero.
Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with lower Gini Index should be preferred.
It is an algorithm for statistically calculating the significance between different sub nodes and parent node, it is used for categorical variables.
But, there is no need to bother about all of this in this algorithm as r can handle all this calculation automatically. The above definitions and their explanations are just for the sake of understanding how decision tree works.
In a decision tree, the main disadvantage is overfitting of the tree, which can lead to issues like too many branches, less accurate and unseen samples. To avoid overfitting, we need to follow two approaches here:
- Pre-pruning and
In pre-pruning, it stops the tree construction a bit early. It is preferred not to split a node if its goodness measure is below a threshold value. But it’s difficult to choose an appropriate stopping point. This can be done by providing controls while building the model.
It is used to pass maximum depth of a tree i.e. length of longest path from a root node to leaf node. By passing this parameter, we can stop the growth of the tree.
It is the minimum number of records that must exit in a node to split a node.
It is the minimum number of records that can be present in a terminal node.
In this approach first, we will allow the tree to grow fully and observe the CP (Complexity parameter) value. Next, we prune (cut the tree) with optimal cp value. Complexity parameter is used to control the size of a decision tree.
Now, let’s go to practical demo with CTG (Cardiotocography) data set to predict fetal state class code (N=normal; S=suspect; P=pathologic)
Loading data set
Checking NA values
Checking dependent variable levels and change data types of dependent columns
Building model using rpart ()
rpart. Plot () function is used to plot the tree build by model
Now our model is ready to predict with test data
Here dtree is my model and ctg_test is my test data, check the performance of our model using confusion matrix ()
Now let’s check critical point of tree and pruning techniques
We can see from the graph minimum critical point is at 7 with 0.305 error rate, let’s prune the tree with minimum CP value
We have a function called prune () for pruning the tree
After pruning again, you can check your model accuracy
You can get data set and source code from below link
Happy Machine Learning