Getting Confident About Confidence Intervals

Written on March 9, 2022

In most statistical research, we take a sample of data from a larger population to analyze. This allows us to come to conclusions that are representative of our population faster and at lower cost. Confidence intervals provide a range around a sample estimate that likely contains the actual population parameter. In this post, we’ll dive into how we can use and properly explain confidence intervals.

Sample vs. Population

When working with data, we generally use a representative sample of a larger population to perform analysis. Performing analysis on an entire population can be extremely costly, and in some cases, impossible.

A good example of the difficulties surrounding working with a population is polling. If we wanted to get the public’s opinion on the MJ vs. Lebron debate, we wouldn’t seek an answer out from every basketball fan in the world. We would take a representive sample and collect their answers. This sample estimate could be used to come to a conclusion about the population within some margin of error.

Confidence intervals provide an upper and lower bound to our sample estimate, allowing us to represent the uncertainty that comes with the sample estimate.

How to Interpret Confidence Intervals

There are a lot of misconceptions with what confidence intervals represent. They can be a bit confusing when we just think of them in a single analysis, but they make a lot more sense when we think of sampling from a population.

Over several repeated samples, the confidence interval will contain the true population value at a specific confidence level. So if our confidence level was 95%, we’d expect that out of 100 independent sample estimates, around 95 would have a confidence interval that contains the true population value.

If looking at a single event rather than a long-term run of samples, we could say that we are x% confident the interval contains the true population value. Note that we would not attach a probability to this statement (i.e. there is x% probability); the event has happened and the confidence interval either contains the population value or it does not.

Exercise Run-Through

In this run-through, we’ll explore ages of players. Specifically, we’ll look at creating confidence intervals around the mean age of everyone who played in a given season. In this scenario, we’ll treat all NBA players in the 2021 season as our population.

To get a sense of how confidence intervals work, we’ll take samples from this population and calculate average player age within each sample, as well as confidence intervals for those sample means. In this case, we know the true population mean, so we can compare our confidence intervals against the population mean.

Data

We’ll pull player ages from basketball reference. We can use the player stats tables for the 2021 season to gather all players who actually played (our population). These tables also include player ages.

Here, we pull in the total stats table and do a bit of cleaning to get a tibble with a player and age column.

library(rvest)
library(tidyverse)

player_age <- read_html("https://www.basketball-reference.com/leagues/NBA_2021_totals.html") %>%
  html_node("#totals_stats") %>%
  html_table() %>%
  rename_all(.funs = ~tolower(.)) %>%
  select(player, age) %>%
  distinct(player, .keep_all = T) %>%
  filter(player != "Player") %>%
  mutate(age = as.numeric(age))

We can calculate the average age of all players who played during the 2021 season to get our population mean. Just a reminder that this is a toy example; most of the time we won’t be able to calculate population level statistics. If we were able to, confidence intervals wouldn’t be necessary!

# Population mean is around 25-26
mean(player_age$age)
## [1] 25.55556

Now, we’re going to create several independent samples from this player population. We’ll create 20 sample groups from this dataset. To do this, we’ll set up a random vector called sample_groups; to create this vector, we’ll sample from the values 1-20 as many times as there are rows in the player_age dataframe. We’ll then assign this field to the player_age dataframe. This allows every player to have a sample group value between 1-20.

# Setting seed so we can reproduce samples
set.seed(123)
sample_groups <- sample(1:20, nrow(player_age), replace = T)

# Assigning sample groups vector to dataframe
sample_df <- player_age %>%
  mutate(sample = sample_groups)

Now we have 20 samples of players and their respective ages. The formula for the confidence interval is going to differ based on what parameter we’re estimating but specifically for the mean, we’ll need the following:

  • Sample mean
  • Sample standard deviation
  • Sample size
  • Standard error of the mean
    • Measures how far the sample mean is likely to be from the population mean
    • Calculated as the sample standard deviation divided by the square root of the sample size

We’ll calculate each of the above values for every sample:

sample_summary <- sample_df %>%
  group_by(sample) %>%
  summarize(avg = mean(age),
            st_dev = sd(age),
            sample_size = n()) %>%
  mutate(st_error = st_dev / sqrt(sample_size))

With the standard error, we can calculate the width of the confidence interval, also known as the margin of error. To calculate the margin of error, we’ll multiply the standard error by the t-value corresponding to a 95% confidence level.

We’ll use the qt function to calculate the t-value. We need to specify our degrees of freedom (sample size - 1) and our probability (.975). Once we have the margin of error, we just need to add and subtract it to our sample average to get our confidence interval.

sample_ci <- sample_summary %>%
  mutate(error = qt(.975, df = sample_size - 1) * st_error) %>%
  mutate(low_ci = avg - error,
         high_ci = avg + error) %>%
  select(sample, avg, low_ci, high_ci)

Now we have a 95% confidence interval for each of our samples. We’d expect that around 95% of our samples’ confidence intervals will include the population mean. Let’s find out what percentage of our confidence intervals contain the population mean:

sample_ci %>%
  mutate(contain_pop = ifelse(low_ci <= mean(player_age$age) &
                                high_ci >= mean(player_age$age), 1, 0)) %>%
  summarize(contain_pop_rate = mean(contain_pop))
## # A tibble: 1 x 1
##   contain_pop_rate
##              <dbl>
## 1              0.9

So 90% of our confidence intervals include the population mean. This is a little less than expected, but we’re dealing with only 20 samples. Over the long run, we’d expect to see this rate shift up to 95%.

With this tutorial, we now have a good understanding of what confidence intervals truly represent. They’re an easy thing to misinterpret, but when we look at it in a sampling / population context, we can get a better understanding of their meaning.