Back to Basics: Measuring Spread and Correlation
Up till now, we’ve gone over a lot of different topics, some complicated, and some not so much. In this tutorial, I want to go over a collection of more basic ideas that may not require their own post, but are helpful to look back on and get a better understanding of. Yes, putting on an aire of knowledge is fun and makes us look smart (and in some circles “cool”), but actually knowing what we’re talking about gives us even MORE credibility. Take off the temporary mask of faux-know-it-all-ism and get the permanent facial reconstruction of actual know-it-all-ism in a series I like to call, “Back to Basics”.
Data
The data we’ll be using today are player weights and heights from the
2014 season. This info was gathered by Simon Warchol and generously
posted on github. We’ll grab it right from his repo and put it into R
.
library(tidyverse)
# Grabbing the csv directly from Simon's Github page
player_info <- read_csv("https://raw.githubusercontent.com/simonwarchol/NBA-Height-Weight/master/CSVs/Yearly/2014.csv") %>%
rename(Height = `Height (Inches)`)
Measuring Spread
One thing that gets brought up a lot about data is how spread out it is. We have a couple of ways to measure this, but the most common are variance and standard deviation. The variance can be represented as while the standard deviation is just .
The variance is the average squared difference from the mean value. So if we were interested in the variance of player weights, we would find the difference of each weight from the average and square it, then average all of those together.
In this case, we are working with a sample from a larger population of
basketball players, so instead of averaging over the entire dataset,
, we average over
. By subtracting
1
, we remove the bias associated with calculating a metric like
variance on a sample rather than the entire population.
# Calculating variance of weight by hand
weight_squared_diff <- (player_info$Weight - mean(player_info$Weight))^2
sum(weight_squared_diff) / (nrow(player_info) - 1)
## [1] 689.1512
# Using base R
var(player_info$Weight)
## [1] 689.1512
By default, var()
calculates the sample variance rather than the
population variance (instead of just averaging the squared differences,
it sums them up and then divides by the total observations minus 1
).
The standard deviation is just the square root of the variance. It is a bit easier to interpret because it puts the measure of spread back on a level similar to the original data.
# Standard deviation by hand
sqrt(sum(weight_squared_diff) / (nrow(player_info) - 1))
## [1] 26.25169
# Using base R
sd(player_info$Weight)
## [1] 26.25169
Correlation
We can see whether the relationship between two variables is positive or negative by calculating the covariance. The sample covariance equation looks like this:
We first find the difference between each value of both x
and y
and
their respective means. We then multiply each pairing and sum all of the
products up. Finally, we divide by the total amount of observations
minus 1
.
Let’s see how we’d compute the covariance between player weight and height in R.
# Calculating covariance manually
sum((player_info$Weight - mean(player_info$Weight)) * (player_info$Height - mean(player_info$Height))) / (nrow(player_info) - 1)
## [1] 73.38047
# Using base R
cov(player_info$Weight, player_info$Height)
## [1] 73.38047
We can tell from the output that there is a positive relationship between weight and height of players.
When most people are talking correlation coefficient values, they are talking Pearson correlation. Let’s take a look at the formula to get an understanding of what exactly is being calculated in the Pearson correlation formula:
Hmm… This looks pretty familiar… It’s actually equivalent to this:
The correlation is actually just the covariance of x
and y
, but
divided by the product of the standard deviations of x
and y
.
Correlation is really just a normalized covariance; the covariance can
tell us whether two variables are positively or negatively related, but
we can’t gain to much info on the strength of the relationship from that
number. By dividing by the products of the standard deviations, we are
putting the covariance on a scale (-1 to 1) so we can gauge the strength
of the relationship. Correlation tells us how the two variables move
together.
Let’s go back to our player weight and height info. A covariance value of 73 doesn’t tell us much beyond heavier players are also taller. What scale is that number on? Is it large? Calculating the correlation coefficient will help tell us this.
# Calculating correlation manually
cov(player_info$Weight, player_info$Height) / (sd(player_info$Weight) * sd(player_info$Height))
## [1] 0.8078669
# Using base R
cor(player_info$Weight, player_info$Height)
## [1] 0.8078669
Now the Pearson correlation coefficient does have an assumption of
normality. If the variables we were comparing did not come from a normal
distribution, we could calculate a different correlation coefficient.
For example, The Spearman correlation coefficient follows the same
formula, but is used on the rank or order of the numbers rather than the
numbers themselves. You can specify different methods using the method
argument in cor
.
We can also run a correlation test to identify if the correlation
between two variables is significantly different from 0. The cor.test
function can be used to run this test in R
.
cor.test(player_info$Weight, player_info$Height)
##
## Pearson's product-moment correlation
##
## data: player_info$Weight and player_info$Height
## t = 30.031, df = 480, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7744309 0.8368026
## sample estimates:
## cor
## 0.8078669
Here we can see that our p-value is below .05
, indicating a
significant linear relationship between player weight and height. Our
estimated correlation is again around 81%
and the 95%
confidence
interval goes from about 77%
to 83%
, which does not include 0
.
Pearson’s correlation test does have assumptions of linearity and
normality. If we didn’t meet those, though, we could simply change the
method
argument in cor.test
to a non-parametric method, such as
Spearman’s. In our case, although the variables appear to linearly
related, neither passes the Shapiro-Wilke’s test for normality. We can
re-run cor.test
using a non-parametric method instead.
library(ggplot2)
# Checking out normality with a Shapiro-Wilke's Test
shapiro.test(player_info$Weight)
##
## Shapiro-Wilk normality test
##
## data: player_info$Weight
## W = 0.98832, p-value = 0.0006742
shapiro.test(player_info$Height)
##
## Shapiro-Wilk normality test
##
## data: player_info$Height
## W = 0.97054, p-value = 2.901e-08
# Checking out linearity assumption graphically
player_info %>%
ggplot(aes(Weight, Height)) +
geom_point()
cor.test(player_info$Weight, player_info$Height, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: player_info$Weight and player_info$Height
## S = 3154695, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8309678
It appears that using the non-parametric test still indicates that there is a significant linear relationship between player weight and height.
Conclusion
Today we took a step back and looked at some more basic ideas. It’s always helpful to go back and make sure we know the simple things, even when you have a bunch of knowledge of advanced topics!