Decision Trees Pt. 2: Bagged Trees and Random Forest

Written on January 6, 2018

In the last tutorial, we saw the basics of a single decision tree. While easy to interpret and understand, it still leaves some things to be desired. In general CARTs (Classification and Regression Tree) are lazy learners that struggle with model variance. They don’t perform much better than basic regression and can have varied outcomes and interpretations. To address this, we can use ensemble methods, creating several tree models rather than relying on a single tree. Two related ensemble tree methods we’ll look at in this tutorial are bagged trees and random forest.

Data

We’re going to use a dataset that’s a bit larger than a few of our past datasets to get a better idea of model performance. The data we’ll be using comes from kaggle.com which has a lot of datasets and data science competitions. The dataset has several shot logs from the first half of the 2014-15 NBA season. This data was formerly available on the NBA stats website, but they took it down a year or so ago, so we have to work with what we can get!

You can download the dataset here: Kaggle Link. It’s just a simple CSV file that we can load right into R.

#Remember to specify where  YOU saved it in your directory!

shot_log<- read.csv("~/R/blog/shot log.csv")

The data contains several shots from throughout the season with several pieces of info on each shot, such as distance from basket, distance from defender, number of dribbles taken before shot, etc. We’ll be looking at using our ensemble methods to predict SHOT_RESULT.

Bagged Decision Trees

As explained earlier, single decision tree models can be highly variable from model to model. A model built on one subset of data could have a very different output than a model built on a separate subset. Bagged decision trees try to address this problem.

Bagged trees use a process known as bootstrapping to split data into different subsets. Bootstrapping is a process of sampling from a large dataset into different subsets with replacement. So, for example, imagine a dataset is a bag filled with 100 observations. If I wanted to create a bootstrapped sample, I would pull one observation from the bag, and put it back in. I would continue until I have a sample the same size as the original dataset. There’s no limit to how many times I pick an observation for a bootstrapped sample; I could see observation 1 was picked 5 times, but observation 2 was never picked.

Applying this process to decision tree building, we can create x number of bootstrapped samples and build a model on each one. To make predictions on all these trees, we either average the prediction for each tree for regression problems, or use a majority vote for classification problems (each tree gets a vote as to what class each observation should be predicted as).

Observations that weren’t picked for each sample can act as a sort of test set for each model; this is known as the out of bag (OOB) error. For each observation, we can look up which trees didn’t include it and input it as a sort of test set.

This helps lower issues of variance because with a bagged tree model, we’re actually creating several models off slightly different datasets and averaging results together!

Bagged Decision Trees in R

We’ll be using the ipred package to create a bagged model. You’ll need to install this before loading it into R!

library("ipred")

Let’s create a basic training and testing dataset with caret.

library(caret)

set.seed(1234)
#75/25 training testing split
sl_split<- createDataPartition(shot_log$SHOT_RESULT, p=.75, list=F) 

training<- shot_log[sl_split, ]
testing<- shot_log[-sl_split, ]

Now we’ll use the bagging() function to create a bagged tree model. To keep it simple, we’ll select some easy to interpret variables as predictors. Let’s use location (home or away), the quarter the shot was taken, the amount of seconds left in the shot clock, the amount of dribbles the player took before shooting, the amount of time the player had the ball before shooting, the distance the shot was taken, and the distance of the closest defender.

We also add two additional arguments to the model: coob=T and nbagg=10. Setting coob to true gives us an out of bag error rate. nbagg specifies the number of bootstrapped samples, the default of which is 25. I lowered it a bit to save some processing time; this can take a minute to run because classification trees are grown as large as possible with this package. This can be altered with the control argument, but in general, bagged decision trees are usually grown out (overfitting isn’t as much of a problem when you’re combining a bunch of models).

set.seed(1234)

model<- bagging(SHOT_RESULT ~ LOCATION + PERIOD + SHOT_CLOCK + DRIBBLES + 
                TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST, data=training, 
                coob=T, nbagg=10)

print(model)

## 
## Bagging classification trees with 10 bootstrap replications 
## 
## Call: bagging.data.frame(formula = SHOT_RESULT ~ LOCATION + PERIOD + 
##     SHOT_CLOCK + DRIBBLES + TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST, 
##     data = training, coob = T, nbagg = 10)
## 
## Out-of-bag estimate of misclassification error:  0.4478

Here we got an OOB error rate of about 45%, so we classified slightly over half of our left out observations correctly.

We can get a more in depth view of our accuracy using a confusion matrix. Let’s look at how we performed on our held out test set.

predictions<- predict(model, newdata=testing)

#using confusionMatrix from caret
confusionMatrix(predictions, testing$SHOT_RESULT)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  made missed
##     made    6632   5993
##     missed  7844  11548
##                                           
##                Accuracy : 0.5678          
##                  95% CI : (0.5624, 0.5733)
##     No Information Rate : 0.5479          
##     P-Value [Acc > NIR] : 3.494e-13       
##                                           
##                   Kappa : 0.1178          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4581          
##             Specificity : 0.6583          
##          Pos Pred Value : 0.5253          
##          Neg Pred Value : 0.5955          
##              Prevalence : 0.4521          
##          Detection Rate : 0.2071          
##    Detection Prevalence : 0.3943          
##       Balanced Accuracy : 0.5582          
##                                           
##        'Positive' Class : made            
## 

Our sensitivity is a bit lower than our specificity. This indicates that our model has a harder time predicting the positive class, in this case made shots, than it would predicting missed shots.

Random Forest

Random forest uses the same general bagging principle, but offers an important improvement. Decision trees are naturally greedy; what I mean by this is that they choose the best split at each step. “What’s the problem with that,” you may be asking. Well read the next sentence and I’ll tell you! The local optimum is not always the global optimum. The best split at a specific step, may not be the best split for the overall model. This issue is exacerbated by bagged trees; we’ll most likely have several models that are making very similar, locally optimized splits.

Random forest provides us with a method to deal with this issue. We can specify a subset of variables of a specific size at each step, from which the decision tree can split on. We’ll specify the size, but the variables selected will be completely random.

For example, imagine I had a dataset with variables A, B, and C. Splitting on A usually lead to the highest increase in node purity, so my simple decision tree and my bagged trees usually focused on it. But with random forest, I specified that two variables can only be looked at at each step. My model might then randomly select variables B and C for the first step, see that splitting on variable C leads to the largest increase in purity of the two, and split on it. This process continues for each step in the tree, allowing for new ideas and interactions that we may have missed using a greedy algorithm.

Again, random forest uses the same bootstrapping architecture as bagged trees, it just provides a method from which we can make our model a bit more globally optimal.

Random Forest in R

We’ll be using the randomForest package to create our model. Again, remember to install before trying to load it up.

library(randomForest)

One thing to note about the randomForest() function used to develop the model is that it requires a way to deal with missing values. In this example, I’ll just remove na’s from the dataset, but there are imputation methods available to sort of fill in the blanks. We’ll go over some imputation in the future, but for now, we’ll keep it simple.

training.complete<- na.omit(training)
testing.complete<- na.omit(testing)

Now we’ll use the randomForest() to create our model. We’ll specify three arguments here: mtry which sets the number of variables to randomly select from at each split, ntree which is the number of trees developed, and importance which we’ll get into in a minute. We’ll go for a higher amount of trees in this model than we would in the bagging example, mainly as a way to try and make sure we get a good selection of random variables. The model is also a bit quicker than the bagging() function in ipred so it’s not as big of a computational burden.

set.seed(1234)
model.rf<- randomForest(SHOT_RESULT ~ LOCATION + PERIOD + SHOT_CLOCK + 
                        DRIBBLES + TOUCH_TIME + SHOT_DIST + 
                        CLOSE_DEF_DIST, data=training.complete, 
                        ntrees=300, mtry=3, importance=T)

model.rf

## 
## Call:
##  randomForest(formula = SHOT_RESULT ~ LOCATION + PERIOD + SHOT_CLOCK +      
##               DRIBBLES + TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST, 
##               data = training.complete,      
##               ntrees = 300, mtry = 3, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 40.77%
## Confusion matrix:
##         made missed class.error
## made   18045  23889   0.5696809
## missed 13582  36393   0.2717759

Our out of bag error decreased a fair amout from bagging, showing how random forest’s non-greedy procedure works towards a better overall model.

One of the benefits of randomForest is that it allows us to measure variable importance in a couple of different ways. One way we’ll look at is called mean decrease in accuracy. This essentially compares accuracy of the trees with all variables as-is, with a model that permutes a variable randomly over the dataset. The OOB error rate is recorded for each tree and compared to the OOB error for the tree with each imputed variable. The average is then taken of all these differences. If we set importance=T in our model call, we can then find the mean decrease in accuracy measure using the importance() function.

importance(model.rf, type=1)

##                MeanDecreaseAccuracy
## LOCATION                   6.729370
## PERIOD                     5.830155
## SHOT_CLOCK                26.039685
## DRIBBLES                  35.046074
## TOUCH_TIME                65.238486
## SHOT_DIST                218.148084
## CLOSE_DEF_DIST            24.674467

The output of the importance() function gives us the average decrease in accuracy of the trees divided by the standard deviation of the decreases in accuracy of our trees. Larger numbers imply more important variables. So the distance the shot was taken is our most important variable using mean decrease in accuracy.

Let’s look at the confusion matrix of our test set predictions.

predictions.rf<- predict(model.rf, newdata=testing)

confusionMatrix(predictions.rf, testing$SHOT_RESULT)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  made missed
##     made    5962   4469
##     missed  7984  12178
##                                           
##                Accuracy : 0.5929          
##                  95% CI : (0.5874, 0.5985)
##     No Information Rate : 0.5441          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1624          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4275          
##             Specificity : 0.7315          
##          Pos Pred Value : 0.5716          
##          Neg Pred Value : 0.6040          
##              Prevalence : 0.4559          
##          Detection Rate : 0.1949          
##    Detection Prevalence : 0.3410          
##       Balanced Accuracy : 0.5795          
##                                           
##        'Positive' Class : made            
## 

Our overall accuracy increased, but our sensitivity actually decreased. The random forest model worked well at predicting misses, but struggled a bit to predict makes. It might be worth it to look at changing up some model arguments to see if we can get more balanced results (but I’m tired of typing so you do it).

Conclusion

Bagging procedures help the variance issues related to single decision trees. You’ll more than likely get better and more reproducible results. Random forest improves on bagging’s greedy process, so if bagged models sound fun to you, I’d suggest going that route over a regular, old bagged decision trees. You do lose some interpretibility by moving towards ensemble tree methods. We’re dealing with multiple trees here, and just pulling out one to look at or plot doesn’t make much sense. Although some interpretation is lost, bagging methods still provide strong performance as well as a fairly simple structure.

Stay tuned for the conclusion of the decision tree lessons where we go over boosted trees!