Decision Tree Essentials for Every Data Scientist
By Robert Wood
Introduction to Decision Trees
Decision trees can be an incredibly useful classification method that lends very well to getting up and running with minimal code.
I have used some form of decision tree to predict the likelihood a customer would churn, customer conversion, new product adoption, new feature adoption, among many other useful applications.
This quick intro will serve to give you an understanding of the main benefits & limitations of using decision trees as a classification tool.
I’ll also walk you through the steps to build your own decision tree and, just as important, test its performance.
When and Why to Use Decision Trees
When it comes to classification, using a decision tree classifier is one of the easiest to use.
Why to use a decision tree
Incredibly easy to interpret
It handles missing data & outliers very well and as such requires far less up front cleaning
You get to forego the categorical variable encoding as decision trees handle categoricals well!
Without diving into the specifics of recursive partitioning, decision trees are able to model non-linear relationships.
Why not to use a decision tree
With all that good said they’re not always the perfect option.
In the same way they can be simple, they can also be overly complicated making it nearly impossible to conceptualize or interpret.
To take this idea a tad further, with a tree that is overly biased or complicated, it may be catering too well to its training data and as a result is overfit.
With that said, let’s jump into it. I wont talk about cross validation or train, test split much, but will post the code below. Be sure to comment if there’s something you’d like more explanation on.
First we’ll break the data into training & test sets.
Also note that we’ll be using the classic titanic dataset that’s included in base R.
Now we’ll train the model using the rpart function from the rpart package. The key things to notice here is that the variable we want to predict is Survived, so we want to understand the likelihood any given individual survived according to some data. ~ can be interpreted as by; so in other words lets understand Survived by some variables. If after the ~ there is a . that means we want to use every other variable in the dataset to predict survived. Alternatively as shown below we can call out the variables we want to use explicitly.
Another thing to note is that the method is class. That is because we want to create a classification tree predicting categorical outcomes, as opposed to a regression tree that would be used for numerical outcomes. And finally the data we're using to train the model is train.
As previously mentioned one of the things that makes a decision tree so easy to use is that it’s incredibly easy to interpret. You’re able to follow the different branches of the tree to different outcomes.
It’s a bit difficult to read there, but if you zoom in a tad, you’ll see that the first criteria if someone likely lived or died on the titanic was whether you were a male. If you were a male you move to the left branch and work down two nodes, whether you were an adult and your sibling/spouse count onboard. So if you were a single man you’re odds of survival were pretty slim.
Before we break out the metrics, lets predict values for your test set. Similar to the call to train, you select the data, and type of prediction. The core difference being the model specification.
There are a variety of performance evaluation metrics which will come in very handy when understanding the efficacy of your decision tree.
This metric is very simple, what percentage of your predictions were correct. The confusion matrix function from caret includes this.
The confusionMatrix function from the caret package is incredibly useful. For assessing classification model performance. Load up the package, and pass it your predictions & the actuals.
The first thing this function shows you is what’s called a confusion matrix. This shows you a table of how predictions and actuals lined up. So the diagonal cells where the prediction and reference are the same represents what we got correct. Counting those up 149 (106 + 43) and dividing it by the total number of records, 178; we arrive at our accuracy number of 83.4%.
True positive: The cell in the quadrant where both the reference and the prediction are 1. This indicates that you predicted survival and they did in fact survive.
False positive: Here you predicted positive, but you were wrong.
True negative: When you predict negative, and you are correct.
False negative: When you predict negative, and you are incorrect.
A couple more key metrics to keep in mind are sensitivity and specificity. Sensitivity is the percentage of true records that you predicted correctly.
Specificity on the other hand is to measure what portion of the actual false records you predicted correctly.
Specificity is one to keep in mind when predicting on an imbalanced dataset. A very common example of this is for classifying email spam. 99% of the time it’s not spam, so if you predicted nothing was ever spam you’d have 99% accuracy, but your specificity would be 0, leading to all spam being accepted.
To wrap up our discussion on decision trees, we know they can be incredibly useful because they’re extremely interpretable, there is minimal pre-processing required, they can model non-linear relationships, and they have functionality that make it easy to fix imbalanced classification problems. imbalanced datasets
On the other hand, when modeling a more complicated relationship, a decision tree can be very difficult to understand and can be easily over-fit.
Keep this in mind as you begin to leverage this modeling technique.
I hope you enjoyed this quick lesson in decision trees. Let me know if there was something you wanted more info on or if there’s something you’d like me to cover in a different post.
Happy Data Science-ing! If you enjoyed this, come check out other posts like this at datasciencelessons.com