Getting Confident About Confidence Intervals
In most statistical research, we take a sample of data from a larger population to analyze. This allows us to come to conclusions that are representative of our population faster and at lower cost. Confidence intervals provide a range around a sample estimate that likely contains the actual population parameter. In this post, we’ll dive into how we can use and properly explain confidence intervals.
Sample vs. Population
When working with data, we generally use a representative sample of a larger population to perform analysis. Performing analysis on an entire population can be extremely costly, and in some cases, impossible.
A good example of the difficulties surrounding working with a population is polling. If we wanted to get the public’s opinion on the MJ vs. Lebron debate, we wouldn’t seek an answer out from every basketball fan in the world. We would take a representive sample and collect their answers. This sample estimate could be used to come to a conclusion about the population within some margin of error.
Confidence intervals provide an upper and lower bound to our sample estimate, allowing us to represent the uncertainty that comes with the sample estimate.
How to Interpret Confidence Intervals
There are a lot of misconceptions with what confidence intervals represent. They can be a bit confusing when we just think of them in a single analysis, but they make a lot more sense when we think of sampling from a population.
Over several repeated samples, the confidence interval will contain the
true population value at a specific confidence level. So if our
confidence level was 95%
, we’d expect that out of 100
independent
sample estimates, around 95
would have a confidence interval that
contains the true population value.
If looking at a single event rather than a long-term run of samples, we
could say that we are x%
confident the interval contains the true
population value. Note that we would not attach a probability to this
statement (i.e. there is x%
probability); the event has happened and
the confidence interval either contains the population value or it does
not.
Exercise Run-Through
In this run-through, we’ll explore ages of players. Specifically, we’ll look at creating confidence intervals around the mean age of everyone who played in a given season. In this scenario, we’ll treat all NBA players in the 2021 season as our population.
To get a sense of how confidence intervals work, we’ll take samples from this population and calculate average player age within each sample, as well as confidence intervals for those sample means. In this case, we know the true population mean, so we can compare our confidence intervals against the population mean.
Data
We’ll pull player ages from basketball reference. We can use the player stats tables for the 2021 season to gather all players who actually played (our population). These tables also include player ages.
Here, we pull in the total stats table and do a bit of cleaning to get a tibble with a player and age column.
library(rvest)
library(tidyverse)
player_age <- read_html("https://www.basketball-reference.com/leagues/NBA_2021_totals.html") %>%
html_node("#totals_stats") %>%
html_table() %>%
rename_all(.funs = ~tolower(.)) %>%
select(player, age) %>%
distinct(player, .keep_all = T) %>%
filter(player != "Player") %>%
mutate(age = as.numeric(age))
We can calculate the average age of all players who played during the 2021 season to get our population mean. Just a reminder that this is a toy example; most of the time we won’t be able to calculate population level statistics. If we were able to, confidence intervals wouldn’t be necessary!
# Population mean is around 25-26
mean(player_age$age)
## [1] 25.55556
Now, we’re going to create several independent samples from this player
population. We’ll create 20
sample groups from this dataset. To do
this, we’ll set up a random vector called sample_groups
; to create
this vector, we’ll sample from the values 1-20
as many times as there
are rows in the player_age
dataframe. We’ll then assign this field to
the player_age
dataframe. This allows every player to have a sample
group value between 1-20
.
# Setting seed so we can reproduce samples
set.seed(123)
sample_groups <- sample(1:20, nrow(player_age), replace = T)
# Assigning sample groups vector to dataframe
sample_df <- player_age %>%
mutate(sample = sample_groups)
Now we have 20
samples of players and their respective ages. The
formula for the confidence interval is going to differ based on what
parameter we’re estimating but specifically for the mean, we’ll need the
following:
- Sample mean
- Sample standard deviation
- Sample size
- Standard error of the mean
- Measures how far the sample mean is likely to be from the population mean
- Calculated as the sample standard deviation divided by the square root of the sample size
We’ll calculate each of the above values for every sample:
sample_summary <- sample_df %>%
group_by(sample) %>%
summarize(avg = mean(age),
st_dev = sd(age),
sample_size = n()) %>%
mutate(st_error = st_dev / sqrt(sample_size))
With the standard error, we can calculate the width of the confidence
interval, also known as the margin of error. To calculate the margin of
error, we’ll multiply the standard error by the t-value corresponding to
a 95%
confidence level.
We’ll use the qt
function to calculate the t-value. We need to specify
our degrees of freedom (sample size - 1
) and our probability (.975
).
Once we have the margin of error, we just need to add and subtract it to
our sample average to get our confidence interval.
sample_ci <- sample_summary %>%
mutate(error = qt(.975, df = sample_size - 1) * st_error) %>%
mutate(low_ci = avg - error,
high_ci = avg + error) %>%
select(sample, avg, low_ci, high_ci)
Now we have a 95%
confidence interval for each of our samples. We’d
expect that around 95%
of our samples’ confidence intervals will
include the population mean. Let’s find out what percentage of our
confidence intervals contain the population mean:
sample_ci %>%
mutate(contain_pop = ifelse(low_ci <= mean(player_age$age) &
high_ci >= mean(player_age$age), 1, 0)) %>%
summarize(contain_pop_rate = mean(contain_pop))
## # A tibble: 1 x 1
## contain_pop_rate
## <dbl>
## 1 0.9
So 90%
of our confidence intervals include the population mean. This
is a little less than expected, but we’re dealing with only 20
samples. Over the long run, we’d expect to see this rate shift up to
95%
.
With this tutorial, we now have a good understanding of what confidence intervals truly represent. They’re an easy thing to misinterpret, but when we look at it in a sampling / population context, we can get a better understanding of their meaning.