Which model should we choose for our problem?
Decision Tree is one of the well-known supervised machine learning models. It can be used for both regression and classification hence it is also known as CART(Classification and Regression Trees). One of the main advantages of trees is that we can visually generate a decision tree with the decisions that the model took helping us in understanding the model better. The main focus of this article is to understand decision trees and their advantages & disadvantages over the OLS regression model.
A Decision Tree model looks like an inverted tree. It has three parts, Firstly we start with a root node followed by a decision node and then finally ends with leaf nodes. It is a Non-Parametric Model which means the model structure is purely dependent on the data. Firstly the model gets structured basing on the training data we supply by following the splitting criteria that we selected. When we test it on the testing data, the model will follow the previously created tree structure basing on the splitting points and whatever the y-value the leaf node has will be the prediction of the model.
This is in fact like a sampling exercise, meaning a prediction is not a projection of the future, the prediction is going into the historical sample and take mean or mode value of that sample.
You can think of trees as asking your friend 20 yes/no questions to guess what he is thinking about. Every question you ask helps you gaining some information based on which you will split your decision on which question to ask next and by the end of your 20 questions you can predict the object that your friend is thinking of.
Similarly in Decision tree models splitting happens based on certain criteria which makes sure that these split areas are distinct and non-overlapping regions. For every observation that falls into a region, we make the prediction as the mean of response variables in the training set of that particular region in case of continuous variables (ie., regression) and we will use mode if we are having discrete variables (ie., classification). We can think of each leaf node as a sample.
Attribute Selection Criteria
Deciding how the model decides the split point is going to be very important, hence we cannot allow it to select any random attribute to be node as they will result in bad results with low accuracy.
Entropy and Information gain is one of the attribute selection criteria’s we generally use.
Entropy is the amount of randomness in the data, higher the entropy the harder it is to draw conclusions from the data.
Information Gain is a statistical property that can measure how well a given attribute can separate the data according to the target variable.
“Constructing a decision tree is all about finding attributes that gives us the highest information gain and lowest entropy”
Another famous attribute selection criteria that we use is Gini Index.
Gini index is a cost function that evaluates splits, it favours the larger partitions whereas information gain favours smaller partitions with distinct values.
We can select one of these splitting criteria’s using the criterion parameter of the Decision tree. The default option is “gini” which selects the Gini index, we can select information gain as our splitting criteria by giving criterion = “entropy”.
Problem of overfitting
A Decision tree with higher number of splits can cause overfitting and low number of splits can cause underfitting. The depth of the tree is the tuning parameter that gives us the optimal tree. the default value of this parameter is None which will make the tree grow until each leaf node has less than the minimum sample or one observation giving us 100% training accuracy leading to overfitting.
As we can see in the above diagram as the max depth is not mentioned the tree kept on growing by splitting the training data until each leaf node is as pure as it can be.
Pruning and Random Forest
To avoid overfitting we can use Pruning or Random Forest.
In the Pruning method, we will trim off the branches of the tree by removing some decision nodes which are present on both sides so that the overall accuracy is not distributed. Basically, we will hyper tune the max depth parameter here which will remove some nodes from the original tree.
Random Forest is an example of ensemble learning (ie., combining multiple models to improve the overall performance) in this case it combines multiple decision trees to improve the predictive performance.
Why the name “Random”?
- Each tree of the model will use a random sample of the data for training
- Each tree uses a random subset of features ensuring that the ensemble model doesn’t rely heavily on a particular feature and makes use of all potential predictive features.
Let's leave the random forest model for another article and continue with trees for now.
Advantages of Decision Tree over OLS
- For linear regression to work we need normal distribution and features should not have correlation but decision trees are not sensitive to underlying distribution hence they are more robust.
- Too many features will complicate things in OLS regression which is not the case in Trees.
- A lot of times when we make a prediction using OLS the most common problem we face is extrapolating our prediction too far, but in the case of trees the predictions are more reliable because they are historically experienced.
Disadvantages of Decision Trees
- It is difficult to come up with good splitting points in all cases
- The objective function here is to minimize the forecasted error which is dependent on how many leaf’s we allow to branch into (larger the no of leaf nodes we have the smaller the error will be but we will face overfitting issue). So there will be a trade-off between how many nodes we can branch into and the error we can have.
Even though OLS regression is easy to use and straight forward, we cannot use it for all problems. When we are fitting data using OLS we are optimizing the cost function. The problem with optimization is we may optimize the error by magnifying it. So we should make sure that we are predicting correctly because if we are predicting wrongly optimization will increase the error. Other alternative regression models that we can use are Decision Trees and Random Forests.
Chauhan, N. S. (2020, Jan). Decision Tree Algorithm Explained. Retrieved from KDnuggets: https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
Gupta, P. (2017, May 17). Decision Trees in Machine Learning. Retrieved from Medium: https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
Maini, V. (2017, Aug 19). Machine Learning for Humans, Part 2.3: Supervised Learning III. Retrieved from Medium: https://medium.com/machine-learning-for-humans/supervised-learning-3-b1551b9c4930