Time Series and ARIMA

Written on July 9, 2018

Previously in these blog posts, we’ve looked at predicting some variable using a bunch of available statistics. In this post, we’ll look at using time to predict a value. Time series models allow us to make predictions using a more flexible framework for… time series data (crazy, I know)! There are many time series models, each with their own pros and cons, but in this post, I’m going to focus on ARIMA models and their components.

Time Series Basics

A time series is generally made up of three components: trend, seasonality, and an error term. Trend is the direction of the series over time; is it going up or down as we move forward? Seasonality is the seasonal effect on our response. For example, we might see sales of a product increase every year at Christmas time.

If there is a trend present, it can either be stochastic or deterministic. A deterministic trend is one that is caused by some other variable. A stochastic trend is one that is caused by randomness.

Another term we should know is stationarity. A stationary time series is one with a constant mean and variance. If we have a stationary time series, we shouldn’t be seeing a trend, or see any sort of change in the variance of the series.

ARIMA

ARIMA models are a class of models for stationary time series. They consist of three components:

Autoregressive
Integration
Moving Average

Autoregressive Models

The autoregressive component uses previous time values (lags) to predict the current time value. We use a specific amount of previous terms in our time series to predict the current term using regression; this is our autoregressive order, written as $AR(p)$ . So, for example, if we determined that the previous three terms were significant in predicting the present value, $t_{i}$ , we’d have an $AR(3)$ process and we’d use a model that would look like:

$t_{i}= x1(t_{i-1}) + x2(t_{i-2})+ x3(t_{i-3}) + wt$

In this case, $t_{i-x}$ represents a value at the previous lag. $x$ is a coefficient to minimize the error. $wt$ is an error term.

Moving Average Models

The moving average part of the model is the same concept as the autoregressive component, but instead of regressing using previous values in time, we’re regressing using previous error terms. The order of the moving average model is represented as $MA(q)$ . The error terms of a stationary process are white noise, and are represented by the difference between t and the previous value of t. So if we determined that the 3 previous error terms were significant in predicting the current value, $t_i$ , our equation would look like:

$t_i = x1(wt_{i-1}) + x2(wt_{i-2}) + x3(wt_{i-3}) + wt$

In this case, $wt_{i-x}$ represents the error term at different lags. $x$ is again acting as a coefficient for each lag. Since a stationary error term is just the difference between a time point and its previous time point, we could rewrite this as:

$t_i = x1(t_{i-1}-t_{i-2}) + x2(t_{i-2}-t_{i-3}) + x3(t_{i-3}-t_{i-4}) + wt$

Integration

So with stationary time series we shouldn’t see a trend. But a lot of time series actually have a trend. Do we just throw these away? No! We can remove the trend using a method called differencing (or integration). All we are doing with differencing is just subtracting each value by its previous lag. Doing this at one level usually removes the trend, but we can difference more times with further lags if we still see a trend.

The $I(d)$ component represents the amount of differencing that goes into the ARIMA model. So an $I(1)$ parameter would represent differencing each value by its previous lag.

Putting it All Together

We can combine all three aspects into one model, $ARIMA(p, d, q)$ . Each parameter takes on the same meaning it does when the three parts of the model are separate. So an $ARIMA(1, 1, 1)$ model would look like:

$t_i = x1(t_{i-1}) + x2(wt_{i-1}) + wt$

In this example, we would first order difference the time series to remove any trend, then use the previous lag (the autoregressive component) and the previous error term (the moving average component) to predict the current value, $t_i$ .

Another thing to note, but something we won’t get into in this post’s example: ARIMA models can also take on seasonality. Seasonal ARIMA models have an additional (p, d, q) term that adds on to the first, but is strictly for seasonal values. So if the ARIMA model looked like $ARIMA(0, 1, 0)(1, 0, 0)$ and the seasonal period was 12 (monthly), we would first do first order differencing, and then use the value of t from 12 months ago to predict the current time point.

Now let’s take a look at the data!

Data

Over the past few seasons, we’ve seen a huge emphasis on three point shooting throughout the league. What better way to introduce ourselves to time series analysis than by trying to forecast the amount of three pointers attempted over future seasons.

We’ll scrape the data we need from basketball reference. Essentially, what we’re doing with this code chunk, is scraping the player stats page for each NBA season going back to 1980 (when the three point line was introduced), counting up the total amount of three pointers shot, and then combining all of that into one data frame. This code takes a minute to run, so feel free to grab a coffee or reminise on all of those three pointers you’ve made throughout your life.

library(rvest)
library(dplyr)

three.df<-
  lapply(1980:2018, function(x) 
  paste("https://www.basketball-reference.com/leagues/NBA_",x,
        "_totals.html", 
        sep = "") %>%
    read_html() %>%
    html_node(xpath='//*[@id="totals_stats"]') %>%
    html_table() %>%
    `colnames<-` (make.names(colnames(.), unique = T)) %>%
    mutate(Season=x) %>%
    distinct(Player, .keep_all = T) %>% 
    filter(Player!="Player") %>%
    mutate(X3PA=as.numeric(X3PA)) %>%
    group_by(Season) %>%
    summarize(tot.3PA=sum(X3PA))
) %>%
  bind_rows()

Still with me? Good. Now how about we get into actually modeling this stuff.

ARIMA in R

Before we get into any analysis, we’re going to put our data into a time series object. To do this, we’ll use the ts() function.

three.ts<- ts(three.df)

Let’s take a look at the amount of three point attempts over time graphically.

library(ggplot2)
library(zoo)

as.zoo(three.ts) %>% 
  ggplot(aes(x=Season, y=tot.3PA)) + 
  geom_line(col="lightblue", size=1.1) +
  labs(x="3 Point Attempts")

We see that there is a pretty clear positive trend. Because of that, we can say that the mean value is not constant, indicating that the model is not stationary. The variance also isn’t constant throughout, as some seasons see much larger changes from the previous year (especially the years from 1997-1999).

Luckily, ARIMA can be used for non-stationary models. The integration component differences the time series to remove the trend. To identify how many differences we should use, we’ll look at the results of the differencing graphically. Usually differencing once is enough to remove the trend.

as.zoo(diff(three.ts)) %>% 
  ggplot(aes(x=index(as.zoo(diff(three.ts))), y=tot.3PA)) + 
  geom_line(col="lightblue", size=1.1) +
  labs(x="Differenced 3 Point Attempts")

We can see that the trend is pretty much removed. So we’ll set the $d$ component of our ARIMA model to 1.

So one component down, two more to go. To determine $p$ , the number of previous values we’ll use in our model, and $q$ , the number of previous error terms we’ll use in our model, we need to get a sense of autocorrelation and partial autocorrelation.

Autocorrelation is just a variables correlation with itself at any two time-points.

Partial autocorrelation is the correlation with any two timepoints, when the linear effect of other timepoints is removed. That’s a little more confusing, but we’re essentially saying that it’s the correlation between two timepoints when we take the mutual correlations of previous lags into account.

We can visualize these through an ACF plot for autocorrelations and a PACF plot for partial autocorrelations.

To determine the autoregressive, $p$ , component, we’ll look at a plot of the partial autocorrelations. Any spikes that go beyond the dashed lines imply significant partial autocorrelations at a .05 significance level. We should take into account that about 5% of any significant spikes are false positives. Let’s look at the PACF plot of the differenced time series.

pacf(diff(three.ts[,2]))

We can see we only have one semi-significant spike at 13. This is most likely a false positive as there would be nothing to indicate that the three point attempts going back that far would be that helpful in predicting the current three point attempts. Because of this, I would suggest setting $p$ to 0. If we thought it wasn’t a false positive, we would use all previous season’s three point attempts going back 13 years in our model.

We can use the ACF plot in a similar method to determine the number of previous error terms to use to predict the current value of t.

acf(diff(three.ts[,2]))

There doesn’t appear to be much interesting in this plot either; it may be for the best to go with 0 for the $q$ component.

Wait a second, you might be saying, there’s a huge spike at the first lag in the ACF plot. However, notice that the first lag in the ACF plot is at 0, which would indicate how correlated the most recent time point is with itself. In comparison, the PACF plot starts at lag 1.

So we have an idea of a general model we can create: $ARIMA(0,1,0)$ . If we wanted to visualize this, it would look like: $t_i - (t_{i-1})=wt$ Which can be rewritten as: $t_i=(t_{i-1})+wt$ This is a random walk. Our value $t$ is only dependent on the previous term and some error. Let’s use the Arima function from the forecast package to build this model.

library(forecast)

model1<- Arima(three.ts[,2], order = c(0,1,0))
model1

## Series: three.ts[, 2] 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 25723953:  log likelihood=-378.12
## AIC=758.23   AICc=758.34   BIC=759.87

forecast(model1, h = 5)

##    Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 40          71337 64837.13 77836.87 61396.30 81277.70
## 41          71337 62144.79 80529.21 57278.73 85395.27
## 42          71337 60078.89 82595.11 54119.20 88554.80
## 43          71337 58337.25 84336.75 51455.60 91218.40
## 44          71337 56802.84 85871.16 49108.92 93565.08

We can see that we get several measurements of model accuracy. Let’s take a look at AIC; here it is 758.2307869. We can see that our forecasted values for the next 5 years are all the same. This is what happens when we predict a random walk. A forecast for a random walk is just the previous term.

Looking at the plot of three point attempts, we know that it’s probably not the case that we get the same value of attempts next year. It’s likely to increase. To adjust for this, we can include a drift term in the equation. A drift term is a constant value that is the average change between consecutive lags.

model2<- Arima(three.ts[,2], order = c(0,1,0), include.drift = T)
model2

## Series: three.ts[, 2] 
## ARIMA(0,1,0) with drift 
## 
## Coefficients:
##           drift
##       1745.8684
## s.e.   772.4866
## 
## sigma^2 estimated as 23288758:  log likelihood=-375.72
## AIC=755.44   AICc=755.78   BIC=758.71

forecast(model2, h = 5)

##    Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 40       73082.87 66898.30 79267.43 63624.39  82541.35
## 41       74828.74 66082.44 83575.03 61452.43  88205.04
## 42       76574.61 65862.62 87286.59 60192.04  92957.17
## 43       78320.47 65951.34 90689.61 59403.52  97237.43
## 44       80066.34 66237.23 93895.45 58916.54 101216.14

plot(forecast(model2, h = 5))

Notice now that we are adding a constant value of 1745.8684211 to each prediction. That seems to better capture the general trend we’ve seen in the time series. We’ve also decreased the AIC indicating a better model fit.

Now, although we’ve gone through some general rules for how to determine which parameters to put into our ARIMA model, it’s not an exact science. We may be able to get better results by playing around with different parameters, whether they are obvious in plots of the (P)ACF plots or not.

The forecast package has a handy auto.arima function that uses a stepwise procedure to determine the optimal ARIMA model parameters. Let’s try it on our data.

model3<- auto.arima(three.ts[,2])
model3

## Series: three.ts[, 2] 
## ARIMA(0,1,0) with drift 
## 
## Coefficients:
##           drift
##       1745.8684
## s.e.   772.4866
## 
## sigma^2 estimated as 23288758:  log likelihood=-375.72
## AIC=755.44   AICc=755.78   BIC=758.71

This method seems to indicate that the random walk with drift was the ideal model. I would suggest being careful with this procedure and just making sure that the model it chooses at least makes some sense!

Conclusion

With this post, we have a general idea of using ARIMA models to make predictions on time series data. ARIMA can be very helpful, but there are other models out there that can be more general or cover ideas that ARIMA doesn’t cover. We’ll probably get into those in the future, but for now, play around with AR, MA, and ARIMA processes!