Power Analysis

Written on August 21, 2018

Statistical models are a lot like us: in order to become stronger, they have to first bulk up. While my daily diet of handfuls of birthday cake flavored protein powder hasn’t yet paid off in the strength department, it’s no mystery what a model needs to get stronger: data.

Power analyses can give us answers to questions of how much data we need to identify a certain size effect in an inferential model. Come with me, as we judge the proverbial statistical model Mr. Universe contest.

Understanding Power

In hypothesis testing, there are two types of errors:

Type 1: We reject the null hypothesis when the alternative is false (false positive)
Type 2: We fail to reject the null when the alternative is true (false negative)

We take type 1 into account through setting a designated alpha that we compare our p-value to; this value is usually .05. Remember what the p-value represents: it’s the probability that we see a given result (or something more extreme) if the null hypothesis was true. Getting a p-value below .05 doesn’t indicate that the null is not true, it’s just very unlikely. The user can mess with alpha value as they see fit based on how much they care about type 1 error.

So what do we have in place to prevent type 2 error? That’s where power comes into play.

If alpha was the type 1 error rate, let’s say beta is the type 2 error rate. Power is simply 1-beta. Generally, an acceptable power value is .8. So at this baseline value, we’d fail to reject the null when the alternative is true 20% of the time (not great, but false negatives are seen as less troublesome than false positives).

When we’re looking at a power analysis, there are four numbers that are all interacting:

Effect Size
Power (1-beta)
Alpha
Sample Size

How do these parameters effect power? In general, the larger the parameter, the larger the power.

Larger effect sizes lead to increased power; it’s easier to identify larger effect sizes than smaller ones.
Larger values of alpha lead to increased power; less stringent values to determine significant results means that we’re less likely to miss significant results.
Larger sample sizes lead to increased power; more samples lead to more precise distributions of the test statistic, creating more distinction between the actual and hypothesized test statistic distribution.

With a power analysis, we need to know three of these parameters to calculate the fourth. The best way to increase power is to increase the sample size, but know that for most research, this can be very expensive. Rather than just sampling a crazy amount and paying out the nose for it, you can use a power analysis to calculate the minimum you’d need to reach an expected amount of power.

Now here’s a warning that a lot of people don’t really pay attention to: try not to perform a post-hoc power analysis to calculate power required for a test (Danial Lakens has a good blog post on the subject). The researcher who is always looking for a significant result can take an experiment that has low power and use it as an excuse for not finding a significant outcome. Use power as a set baseline value (again, .8 is a generally accepted value) to determine what sample size is necessary to capture the minimum effect size you or your stakeholders deem practically significant.

Data

We’re going to go back in time and pull up the Derrick Rose and Lebron James game score data for the 2011 season. We used this back on the t-test lesson, and guess what… we’ll be using it for more t-tests! This time, though, we’ll be using a power analysis in our study.

library(tidyverse)

rose <- tibble(Player = "Rose",
               GmSc = c(14, 26.4, 14.7, 22.2, 
                        9.7, 8, 23.2, 19.7, 27.6, 19, 14.9, 21.5, 20.7, 
                        24.5, 6, 12.6, 26.3, 7.9, 21.4, 19.3, 20.7, 13.8, 
                        7.3, 26.4, 26.2, 15.6, 16.1, 14.9, 18.1, 16.7, 
                        21.4, 17.2, 9.1, 22, 24.4, 25.9, 9.9, 23.2, 26.8, 
                        22.1, 20.2, 16.6, 18.9, 18.5, 17.9, 16.5, 30.9, 5, 
                        22, 21, 16.9, 13.6, 36.4, 23.9, 13.3, 9.9, 19.4, 
                        4.2, 13.8, 16.7, 19, 20.8, 23.7, 16.2, 17, 13.5, 
                        24, 16.1, 28, 14.4, 32.5, 16.6, 22.8, 23.1, 32.6, 
                        12.9, 27.7, 6.7, 31.6, 15.9, 8.6))

james <- tibble(Player = "James",
                GmSc = c(16, 9.6, 10, 15.6, 23.7, 
                         20.6, 19.4, 22.8, 29, 17.6, 20.6, 20.6, 21.7, 
                         15.5, 17, 18, 11.7, 19.8, 12, 33.7, 19.7, 12.5, 
                         28.6, 21.7, 19, 11.1, 14.5, 27, 24.4, 11.2, 26.8, 
                         33.3, 12.9, 17.6, 23.1, 28.8, 24, 20.1, 35.1, 17.4, 
                         20.5, 30.1, 15, 34, 25.6, 22.8, 46.7, 17.2, 5.3, 
                         39.3, 16.9, 14.5, 20.3, 17.7, 17.9, 25.5, 27, 19.7, 
                         26.2, 20.2, 24, 26.5, 17.8, 23.8, 16.5, 12.7, 32.2, 
                         20.5, 14.9, 28.2, 28.6, 21.9, 34.6, 26.1, 26.6, 22, 
                         24.7, 23.2, 26.4))

Power Analysis in R

Let’s first look again at whether Derrick Rose’s average game score was significantly higher than a made up league average of 15. This is a one-sided, one-sample t-test: we are only interested in answering if Rose’s game scores were better, not worse, than the league average.

We’ll use the pwr package to do our power calculations. Install it first and then bring it up with library().

Let’s think about how large of a sample of game scores we would need for this experiment. To calculate it, we need: effect size, power, and alpha. Power and alpha can be kept at their general values of .8 and .05. But how do we determine effect size?

The pwr package has different effect sizes for different tests laid out by Jacob Cohen with an explanation of the size of each effect. We can access this with the cohen.ES function. We should note that effect sizes aren’t one-size fits all; it usually takes some discussion with stakeholders to identify what size effect is practically significant.

library(pwr)

#we specify the type of test as t and the effect size as medium
cohen.ES(test = "t", size = "medium")

## 
##      Conventional effect size from Cohen (1982) 
## 
##            test = t
##            size = medium
##     effect.size = 0.5

We can see that the medium effect size for a t-test is about 0.5. Knowing this, we can calculate the required sample size to capture a medium effect with power of .8 and alpha of .5 using the pwr.t.test function.

pwr.t.test(d = .5, sig.level = .05, power = .8, type = "one.sample", 
           alternative = "greater")

## 
##      One-sample t test power calculation 
## 
##               n = 26.13753
##               d = 0.5
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

In this function, d represents effect size, sig.level is alpha, and power is… well, power. The type of t-test is also specified by type and alternative. We can see that we’d need about 27 games to identify a medium effect size. Knowing we have 81 games, let’s see what the smallest effect size we’ll be able to identify is.

pwr.t.test(n = 81, sig.level = .05, power = .8, type = "one.sample", 
           alternative = "greater")

## 
##      One-sample t test power calculation 
## 
##               n = 81
##               d = 0.2786405
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

With a power of .8, we’d be able to identify an effect size of about .28. This is slightly larger than Cohen’s predefined small effect size of .2.

Knowing this, let’s perform the t-test.

t.test(rose$GmSc, mu = 15, alternative = "greater")

## 
##  One Sample t-test
## 
## data:  rose$GmSc
## t = 4.8825, df = 80, p-value = 2.633e-06
## alternative hypothesis: true mean is greater than 15
## 95 percent confidence interval:
##  17.4552     Inf
## sample estimates:
## mean of x 
##  18.72469

We get a p-value below .05, indicating that Rose had a significantly better average game score than 15.

Now you might be wondering what exactly was the effect size here? There isn’t one direct formula to calculate it as it’s different for different tests. For t-tests, though, it is:

$\frac{mean(Group1)-mean(Group2)}{\sqrt{\frac{(n1-1)var(Group1)+(n2-1)var(Group2)}{n1+n2-2}}}$

The numerator is pretty simple, but the denominator is a bit complicated. The denominator is the pooled standard deviation: a single standard deviation estimate for multiple groups. In the case of a one sample t-test, the denominator is just the standard deviation of the one sample, so we can calculate our effect size like so:

(mean(rose$GmSc) - 15) / sd(rose$GmSc)

## [1] 0.5425054

So we saw about a medium effect size. Based on the power analysis we ran prior to the t-test, we saw we’d be able to identify an effect this size with relatively high power.

Let’s look at the two-sample example, comparing Lebron to D-Rose.

pwr.t.test(d = .5, sig.level = .05, power = .8, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 63.76561
##               d = 0.5
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

We’d need about 64 games in each group to accurately detect a medium effect size. We have enough samples to meet this from both groups.

t.test(james$GmSc, rose$GmSc, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  james$GmSc and rose$GmSc
## t = 2.6911, df = 158, p-value = 0.007888
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8017541 5.2248125
## sample estimates:
## mean of x mean of y 
##  21.73797  18.72469

We can see that we got a significant result; Lebron’s average game score was higher than Rose’s average. We can estimate the effect size manually, or use the cohen.d function from the effsize package. Here I show both methods:

#manual method; dividing difference in means by pooled standard deviation
diff.in.means <- (mean(james$GmSc) - mean(rose$GmSc))

pooled.sd <- sqrt((((length(james$GmSc)-1)*var(james$GmSc)) + 
                    ((length(rose$GmSc)-1)*var(rose$GmSc))) /
                   (length(james$GmSc)+length(rose$GmSc)-2))

diff.in.means / pooled.sd

## [1] 0.4255382

#using cohen.d(); specify pooled = T
library(effsize)
cohen.d(james$GmSc, rose$GmSc, pooled = T)

## 
## Cohen's d
## 
## d estimate: 0.4255382 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.1097100 0.7413664

This effect size is slightly lower than the medium effect size we had put into the sample size calculation prior to running the test.

Post-test, power is pretty much irrelevant. It represents the probability of an event that already happened: the test correctly rejecting the null when it is false. It doesn’t really matter how this smaller effect size impacted power or sample size; the results were significant.

With an understanding of power, we can now focus on both type 1 and 2 errors. Doing so helps us both adequately prepare and analyze research. A lot of researchers focus only on type 1 errors, but knowing topics associated with type 2 error, such as effect and sample size, can help us create a better, more reliable experiment.