Logistic Regression

Written on July 6, 2017

With linear regression under our belt, we have a method to predict a numeric dependent variable. But what if the response isn’t numeric? If this is the case, you need to start playing around with classification. Logistic regression is one of the more basic forms of classification. It is used for binary predictions (i.e. two class responses like yes/no), predicting the probability of an observation being a certain class. We’ll explore logistic regression today by trying to predict whether a team wins or loses a game based on different box score statistics.

Data

For this tutorial, we are going to try and predict the outcome of games the Pacers played in the 2016-17 season. I’ll try and hide any dissapointment from the Paul George trade. If I start sounding depressed just remind me that I get to watch Lance Stephenson be our starting point guard next season (tanking with style!). Anyway, we’ll be scraping this data from basketball reference.

library(dplyr)
library(rvest)

pacers<-
  "http://www.basketball-reference.com/teams/IND/2017/gamelog/" %>%
  read_html('#tgl_basic') %>%
  html_table() %>%
  #removing the scraped dataframe from a list
  .[[1]] %>%
  #selecting specific columns
  .[,2:23] %>%
  #changing column names to names in the first row
  `colnames<-` (make.names(.[1,], unique=T)) %>%
  #removing excess headers in dataframe
  filter(Date!="Date" & Date!="") %>%
  rename(H.A=X, Opp.Score=Opp.1) %>%
  #creating a home and away column
  mutate(H.A=ifelse(H.A=="@", "A", "H")) %>%
  mutate_at(funs(as.numeric), .vars=vars(Tm:TOV)) %>%
  mutate_at(funs(factor), .vars=vars(W.L, H.A))

Now if you just looked at the scraped data, you’d get a pretty dirty result. First off, the data is stored in a list, which we removed and put in a dataframe for easier access. Basketball reference also includes opponent stats in their game logs. This could be helpful, but for simplicity, we’ll look at just Pacer stats. This does create an issue with column names. The game logs in basketball reference have two rows of column names, the first specifying team or opponent and the second being the actual stats. R reads the first row as the column names, while inserting the second into the dataframe itself. To get around this, we can simply assign the first row of data to the colnames.

Other issues that we cleaned up were removing the excess headers basketball reference uses every 20 rows, renaming a few columns, and changing up the home or away column which originally specified the home games by leaving a blank. Finally, columns that we wanted to treat as numeric or factors were changed from their current character state.

With a clean dataframe, we can start getting into the fun stuff.

Understanding Logistic Regression

Looking at our data, we have a lot of statistics that could influence whether the Pacers win or lose. To make it simple, let’s choose only a few to look at. Looking over the stats, I chose FG%, total rebounds, home or away, and turnovers as columns we’ll use as predictors. W.L is the result of the game and will be our dependent variable.

pacers<- 
  pacers %>%
  select(W.L, FG., TRB, H.A, TOV)

head(pacers)

##   W.L   FG. TRB H.A TOV
## 1   W 0.505  52   H  16
## 2   L 0.378  49   A  13
## 3   L 0.489  33   A  13
## 4   W 0.471  43   H  15
## 5   L 0.469  32   A  21
## 6   W 0.535  36   H  11

Let’s look at a scatter plot of field goal % against the games result.

The y-value is the probability of a win. A linear regression line is shown in blue while a logistic line is shown in red. We can see that the linear line goes beyond 1 and 0 and isn’t very flexible. It doesn’t make much sense to have probabilities that are over 1 or negative. The logistic line is s-shaped and bound between 1 and 0, making it better for a binary problem like this.

Logistic regression uses the logistic function to find the probability of the positive class in the dependent variable (in this example, Win). The logistic function looks like:

Probability(Y) = e^(B0 + B1X) / 1 + e^(B0 + B1X)

Remember from the linear regression that B0 is the intercept term and Bi (in this case B1) is the coefficient for the predictor.

The odds of an event is the events probability of happening divided by the probability of it not happening. So if there was an 80% chance of the Pacers winning a game, their odds would be .8 / 1-.8 or 4 to 1. Think of this game being played 5 times; the odds say the Pacers would win 4 times compared to 1 loss. The logistic equation can be manipulated to give us the odds like so:

Probability(Y) / (1 - Probability(Y)) = e^(B0 + B1X)

Now if we take the logarithm of both sides we get:

log(Probability(Y) / (1 - Probability(Y))) = B0 + B1X

The left hand side of the equation is called the log odds. We can see that the right side now looks like the formula we used in linear regression! The log odds are linear; a one unit increase in X would lead to an increase (or decrease depending on the coefficient B1) of B1 to the log odds. This is similar to how a one unit increase in X leads to an increase or decrease in Y of B1 in linear regression. Remember, the change in X changes the LOG ODDS by the value of B1, NOT the probability.

Performing Logistic Regression

In the linear regression tutorial, we used lm to develop a model. For logistic regression, we are going to use glm. glm stands for generalized linear model and can fit several different families of linear models, including logistic. Here, we’ll fit a model predicting W.L with all of our predictors. You can write each predictor out, like we did in the linear regression tutorial, or you can just use a . to tell R we want all predictors included.

set.seed(1234)
pacers.glm<- glm(W.L ~ ., data=pacers, family=binomial)

The command is similar to lm, except that we need to define the family command, which we specify as binomial, indicating logistic regression. We can use summary to look at the general output of our model.

summary(pacers.glm)

## 
## Call:
## glm(formula = W.L ~ ., family = binomial, data = pacers)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8954  -0.6090   0.1407   0.5709   2.4511  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -25.28876    6.34006  -3.989 6.64e-05 ***
## FG.          34.37660    9.30910   3.693 0.000222 ***
## TRB           0.24763    0.06854   3.613 0.000303 ***
## H.AH          1.43891    0.64625   2.227 0.025977 *  
## TOV          -0.12941    0.09874  -1.311 0.189971    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.627  on 81  degrees of freedom
## Residual deviance:  66.142  on 77  degrees of freedom
## AIC: 76.142
## 
## Number of Fisher Scoring iterations: 5

Here we get an output similar to the output of lm. It appears as though field goal percentage, total rebounds, and the game being at home all are significant based on their p-values. All three of these variables have a positive coefficient estimate, indicating that higher field goal percentages and total rebound numbers, as well as the game being home rather than away increase the log odds of a Pacers win. Number of turnovers has a negative relationship with the log odds of a Pacers win, but is not significant. The regression equation would look like this:

log odds(Win) = -25.28 + 34.37(FG.) + .24(TRB) + 1.43(H.A) -.12(TOV)

The deviance is a measure of fit for the model. The null deviance is the deviance of the model with only the intercept while the residual deviance is the deviance of the model with all of our predictors. We want our residual deviance to be lower than the null (while also not sacrificing a ton of degrees of freedom). Here, our residual deviance is significantly lower than the null while only sacrificing 4 degrees of freedom, indicating that this model is better than simply making predictions with the intercept.

Another, more straightforward way of assessing the model is by checking the accuracy of predictions. Let’s start by using the predict function on our model. We need to specify type = "response" to predict probabilities rather than log odds.

pacer.prediction<- predict(pacers.glm, type="response")

head(pacer.prediction)

##          1          2          3          4          5          6 
## 0.98682737 0.13669425 0.12033458 0.74045082 0.01871131 0.88418547

Now, because probabilities were predicted and not classes, we need to set a probability threshold of what is considered a predicted win and loss. Here, we’ll just say that .5 and above is a win. There could be a scenario where you might want to change this up, but this seems like a generally accepted cutoff.

pacer.prediction<- factor(ifelse(pacer.prediction>=.5, "W", "L"))

head(pacer.prediction)

## 1 2 3 4 5 6 
## W L L W L W 
## Levels: L W

We can now compare this vector of predictions to the actual values:

mean(pacer.prediction==pacers$W.L)

## [1] 0.7926829

Overall, our accuracy is close to 80%. We would want to measure our accuracy on observations we didn’t train the model on (a form of cross-validation), but I’ll save that for a future tutorial.

With a good idea of the logistic regression process, we can now venture into the world of classification. Whether this is predicting a game result, or a shot result, it opens up a lot more ideas of how to implement statistics in a basketball frame of mind.