Chapter 31 Power and Sample Size
31.1 Introduction
cfcdae
includes three
functions of use here: power.f.test()
, sample.size.f.test()
, and sample.size.t.interval()
; there is also a
sample.size.bayes.interval()
in bcfcdae
.
You won’t be surprised to learn that
the first computes power for an F test and the last two compute sample sizes,
for F tests and t-intervals respectively. There is a
builtin power.anova.test()
, but I prefer the way that
power.f.test()
parameterizes things.
31.2 Sample size for t intervals
The sample.size.t.interval()
function computes the minimum
sample size such that the t-interval for some contrast
\(\sum_i w_i \hat{\alpha}_i\)
is no wider than some amount. The arguments are
the desired width (twice the margin of error), the
error variance, the sum of the
squared \(w_i\) values (\(\sum_i w_i^2\)),
the number of groups \(g\), and the type 1 error rate (.05 by default).
Suppose that you wanted to compare two political poll results on the approval of the president on inauguration day and the approval one year later. We will compute a 95% confidence interval on the difference of the two rates, and we want it to be no longer than .02 (margin of error 1%). The worst case variance for a binomial is when p=.5 for a variance of .25 (and in a polarized USA, approval of .5 might not be too far off). The sum of the squared coefficients in the contrast is 2, and there are two groups.
Computing the sample size, we would need to survey a ton of people at each time point. This is an order of magnitude more than most political surveys.
> sample.size.t.interval(.02,.25,2,2,.05)
[1] 19209
If you could settle for a 3% margin of error you get a somewhat more sane sample size.
> sample.size.t.interval(.06,.25,2,2)
[1] 2136
The squared ratio of our requirements \((.06/.02)^2=9\), and the ratio of our two sample sizes (19209/2136) is also 9 (well, 8.993).
Most political polls are in the 1,500 range for sample size.
31.2.1 Example 7.2 VOR in Ataxia patients
We want the 95% confidence interval for the pairwise difference to be no longer than .5, with an assumed error variance of .075. Pairwise differences have a \(\sum w_i^2 = 2\), and there are \(g=2\) groups.> sample.size.t.interval(.5,.075,2,2)
[1] 11
> sample.size.t.interval(.5,.15,2,2)
[1] 20
31.3 Power and Sample Size for ANOVA
In order to compute power you need to be able to compute noncentrality,
which depends on means, sample sizes, and the error variance. In
the power function, you can specify the noncentrality in two ways. You
can either give the means, the sample sizes, and the error variance,
or you can give the noncentrality, the numerator degrees of freedom, and the
denominator degrees of freedom. You also need to provide the type
one error rate, but by default it is assumed to be .05.
The value for power.f.test()
is simply the power
for the parameters in the arguments.
Sample size is computed to obtain a particular power. In this case you
must specify the power that you wish to obtain and the type I error rate.
You also have to help it out with how to compute noncentrality. You
do this by providing either treatment means and error variance or
“n=1” noncentrality and number of groups.
The value for sample.size.f.test()
is a list with the required
sample sizes and the achieved power.
When finding sample size with the means specified, you can also
provide an argument called propvec
, which is short hand for
proportionality vector. The propvec
should have the same
number of entries as there are means. If you provide a propvec
, then the computed
sample sizes will roughly proportional to the entries of propvec
.
The noncentrality parameter is \(\sum_{i=1}^g n_i \alpha_i^2/\sigma^2\), where the \(\alpha_i\)s here use the restriction that \(\sum n_i\alpha_i = 0\). The \(n=1\) “noncentrality parameter’’ is \(\sum_{i=1}^g \alpha_i^2/\sigma^2\).
31.3.1 Lots of little examples
OK, suppose that we have six groups, and we think that an interesting alternative would be two with mean 4, two with mean 5, and two with mean 6. We think that the error variance is 1, and we’re testing at the .05 level. What is the power? About .55.
> power.f.test(means=c(6,6,5,4,4,5),ns=rep(3,6),sigma2=1,alpha=.05)
[1] 0.5529867
You can add or subtract a constant from all the means, and it won’t change the power. It’s only the differences between means that matter.
> power.f.test(means=c(1,1,0,-1,-1,0),ns=rep(3,6),sigma2=1,alpha=.05)
[1] 0.5529867
If the sample sizes are all the same, you can just use the common value instead of a vector of sizes.
> power.f.test(means=c(1,1,0,-1,-1,0),ns=3,sigma2=1,alpha=.05)
[1] 0.5529867
Smaller sample size means less power.
> power.f.test(means=c(1,1,0,-1,-1,0),ns=2,sigma2=1,alpha=.05)
[1] 0.2715161
Larger sample size means more power.
> power.f.test(means=c(1,1,0,-1,-1,0),ns=4,sigma2=1,alpha=.05)
[1] 0.7640361
Compute the noncentrality parameter for the sample size. Then use that and the degrees of freedom instead of means and sample sizes. You get the same power.
> (4*1^2+4*1^2+4*0+4*(-1)^2+4*(-1)^2+4*0)/1 # sample size 4
[1] 16
> power.f.test(ncp=16,df1=5,df2=18,alpha=.05) # 18 = 6*(4-1) or 24-6
[1] 0.7640361
If you make the type I error rate smaller, then you make the type II error rate bigger, and thus the power smaller.
> power.f.test(means=c(6,6,5,4,4,5),ns=rep(3,6),sigma2=1,alpha=.01)
[1] 0.2561233
A bigger error variance decreases power (makes the noncentrality parameter smaller).
> power.f.test(means=c(1,1,0,-1,-1,0),ns=3,sigma2=2,alpha=.05)
[1] 0.2888767
In the next example we have the same total sample size (18) as three in every group, but we put more replications on the treatments in the extremes and fewer on those in the middle. More data on means in the extremes increases power more than more data on those in the center. Of course, you have to know where the extremes are to use this in practice, and that is rather unrealistic.
> power.f.test(means=c(6,6,5,4,4,5),ns=c(4,3,2,4,3,2),sigma2=1,alpha=.05)
[1] 0.6283721
Here we find the smallest sample size that will give us at least power .4 when testing at the .05 level with the means and error variance as specified. It shows us the \(n_i\)s and the achieved power. Note:
- We can’t hit the requested power right spot on, we’ve overshot. Above we saw that n=2 gave us power .27, so we need n=3.
- No one is going to design for power .4; we are using it for illustration to match our earlier findings.
> sample.size.f.test(.4,.05,means=c(6,6,5,4,4,5),sigma2=1)
$nis
[1] 3 3 3 3 3 3
$power
[1] 0.5529867
Alternatively, we can specify the noncentrality parameter for n=1 and the number of groups.
> sample.size.f.test(.4,.05,ncp1=4,ngrps=6)
$nis
[1] 3 3 3 3 3 3
$power
[1] 0.5529867
Asking for more power means that you’ll need more data.
> sample.size.f.test(.9,.05,ncp1=4,ngrps=6)
$nis
[1] 6 6 6 6 6 6
$power
[1] 0.9520672
Testing at a lower type I error rate means that you’ll need more data to keep the same power.
> sample.size.f.test(.9,.01,ncp1=4,ngrps=6)
$nis
[1] 7 7 7 7 7 7
$power
[1] 0.9112071
Larger error variance (which gives you a smaller noncentrality parameter) means you need more data to get the required power.
> sample.size.f.test(.9,.01,means=c(6,6,5,4,4,5),sigma2=2)
$nis
[1] 13 13 13 13 13 13
$power
[1] 0.9220537
Often you can’t say what you think all the treatment effects will be. If you can give a value D, such that any treatment differences of D or greater are “interesting,” then the smallest possible noncentrality with two treatments D units apart is \(nD^2/(2\sigma^2)\). Designing for this noncentrality gives us worst case power for differences of size D. So if we get the power we want for this worst case noncentrality, then we’ll get at least that much power for any other noncentrality from the same error variance and differences of at least D.
Suppose we have a D of 2 (=6-4) and our error variance is still 1. The minimal n=1 noncentrality in this case is 2 (instead of the 4 we had in examples above which had more groups spread out). We now need at least 4 in each group to get at least power .4 (worst case, remember?).
> sample.size.f.test(.4,.05,ncp1=2^2/2/1,ngrps=6)
$nis
[1] 4 4 4 4 4 4
$power
[1] 0.4322055
Same thing but directly using means that show the difference D.
> sample.size.f.test(.4,.05,means=c(-1,1,0,0,0,0),sigma2=1)
$nis
[1] 4 4 4 4 4 4
$power
[1] 0.4322055
Finally,
the sample size function has a limited ability to find unequal sample sizes using the
propvec
argument. For the sake of argument, assume that treatments 1 and 2
are more expensive than the other treatments, so in general we would like to use
fewer replications there. In this example we ask for (roughly) twice as many
replications in treatments 3 through 6.
> sample.size.f.test(.9,.01,means=c(6,6,5,4,4,5),sigma2=2,propvec=c(1,1,2,2,2,2))
$nis
[1] 9 9 17 17 17 17
$power
[1] 0.9047962
Here is a useful fact: for the same amount of noncentrality and total number of data, it is easier to detect a difference with fewer groups than with more groups. In some sense it is easier to find the difference if there is only one direction to look in.
> power.f.test(ncp=16,df1=1,df2=18,alpha=.05) # two groups of 10
[1] 0.9655275
> power.f.test(ncp=16,df1=3,df2=16,alpha=.05) # four groups of 5
[1] 0.8556421
> power.f.test(ncp=16,df1=4,df2=15,alpha=.05) # five groups of 4
[1] 0.791441
> power.f.test(ncp=16,df1=9,df2=10,alpha=.05) # ten groups of 2
[1] 0.4608024
It is sometimes useful to compute power for a sequence of sample sizes and then make a plot.
What should be obvious is that there are serious diminishing returns. Once the power gets pretty big, we can add a lot of additional sample size with very little actual increase in power. (However, standard errors for contrasts, estimated effects, etc. will continue to get smaller as sample size increases.)
> ntotry <- 2:20
> power <- ntotry
> for(i in 1:length(ntotry)) {
+ power[i] <- power.f.test(means=c(6,6,5,4,4,5),ns=ntotry[i],sigma2=1,alpha=.05)
+ }
> plot(ntotry,power)

Figure 31.1: Power curve
31.3.2 Example 7.3 VOR in Ataxia Patients
We have two potential sets of means that are of interest: (1, 1.2, 1.4) or (1, 1.4, 1.4). We think the error variance is about .075, but it might easily be as large as .15. We are testing at the .01 level and we want power at least .9.> sample.size.f.test(.9,alpha=.01,means=c(1,1.4,1.4),sigma2=.075)
$nis
[1] 14 14 14
$power
[1] 0.904201
> sample.size.f.test(.9,alpha=.01,means=c(1,1.2,1.4),sigma2=.075)
$nis
[1] 18 18 18
$power
[1] 0.901666
> sample.size.f.test(.9,alpha=.01,means=c(1,1.4,1.4),sigma2=.15)
$nis
[1] 27 27 27
$power
[1] 0.9127837
> sample.size.f.test(.9,alpha=.01,means=c(1,1.2,1.4),sigma2=.15)
$nis
[1] 35 35 35
$power
[1] 0.9080998
The means (1, 1.2, 1.4) are less dispersed than the means (1, 1.4, 1.4), so we need more data to achieve power .9 for the former set than the latter set. Doubling the error variance (almost) doubles the required sample size.
31.4 Serious example
We want to find the effect of certain diets on the blood concentration of estradiol in premenopausal women. We have historical data on six subjects. For each woman, estradiol was measured twice midfollicular diet 1, twice midfollicular diet 2, twice early follicular diet 1, and twice early follicular diet 2. A cursory inspection of the data reveals that variance changes with mean. On the log scale, variance seems fairly constant at about .109 and does not seem to change across diet or cycle phase.
The response of interest is post-diet minus pre-diet change in blood estradiol. This difference should have a variance of about \(2 \times .109 = .218\).
We want 95% power to detect a change of 0.5 log units (roughly 65%), testing at the 0.05 level. We need 24 women per group.
> sample.size.f.test(.95,.05,ncp1=.5^2/2/.218,ngrps=2)
$nis
[1] 24 24
$power
[1] 0.9526565
If we were instead really interested in a larger change (.75 on the log scale, slightly more than doubling), we need fewer subjects.
> sample.size.f.test(.95,.05,ncp1=.75^2/2/.218,ngrps=2)
$nis
[1] 12 12
$power
[1] 0.9638958
31.5 Power for a Contrast
The ANOVA F-test is an omnibus test of all possible departures from the null hypothesis that all treatment means are equal. In some situations, we might be interested in testing the null that a specific contrast is zero. The test in this situation is an F-test with 1 and N-g degrees of freedom and noncentrality parameter \[ \frac{(\sum_{i=1}^g w_i \alpha_i )^2}{\sigma^2 \sum_{i=1}^g w_i^2/n_i} \;. \] Assuming the sample sizes are all the same, the noncentrality is \[ n \frac{(\sum_{i=1}^g w_i \alpha_i )^2}{\sigma^2 \sum_{i=1}^g w_i^2} \;. \]
We cannot use sample.size.f.test()
for this problem, because it
is looking at the F-test with g-1 numerator degrees of freedom.
However, power.f.test()
will still give us the correct
power if we can tell it the correct noncentrality parameter
and numerator (1) and denominator (\(N-g = g*(n-1)\)) degrees of freedom.
What you do is compute the noncentrality parameter for a
bunch of different sample sizes \(n\) and find the smallest
\(n\) that gives us the power we want.
31.5.1 Example 7.4 VOR in Ataxia Patients
We want power .95 for the contrast with coefficients (1, -.5, -.5) when testing at the .05 level. We have two sets of possible means to consider: (1, 1.2, 1.4) and (1, 1.4, 1.4). We think that the error variance is .075, but it could be as big as .15.
What we will do is choose a range of potential sample sizes \(n\), compute the power for each sample size, and find the smallest sample size that gives us the power we want. Note that changing \(n\) changes both the noncentrality and the error degrees of freedom.
First compute \[ \frac{(\sum_{i=1}^g w_i \alpha_i )^2}{\sigma^2 \sum_{i=1}^g w_i^2} \; \] for the different scenarios. Note that because we are using a contrast, we can use the means in the formula instead of the treatment effects.> ncp1a <- (1*1 - .5*1.4 - .5*1.4)^2/.075/(1+.5^2+.5^2)
> ncp2a <- (1*1 - .5*1.2 - .5*1.4)^2/.075/(1+.5^2+.5^2)
> ncp1b <- (1*1 - .5*1.4 - .5*1.4)^2/.15/(1+.5^2+.5^2)
> ncp2b <- (1*1 - .5*1.2 - .5*1.4)^2/.15/(1+.5^2+.5^2)
> ncp1a;ncp2a;ncp1b;ncp2b
[1] 1.422222
[1] 0.8
[1] 0.7111111
[1] 0.4
Because these (\(n=1\)) noncentralities are decreasing, we should anticipate that the corresponding required sample sizes should increase.
Let’s investigate sample sizes from 5 to 50. For each scenario, get the power at all of the sample sizes. Fortunately,power.f.test()
can take vectors of
noncentrality and degrees of freedom!
> ns <- 5:50
> power1a <- power.f.test(ncp=ns*ncp1a,df1=1,df2=3*(ns-1),alpha=.05)
> power1b <- power.f.test(ncp=ns*ncp1b,df1=1,df2=3*(ns-1),alpha=.05)
> power2a <- power.f.test(ncp=ns*ncp2a,df1=1,df2=3*(ns-1),alpha=.05)
> power2b <- power.f.test(ncp=ns*ncp2b,df1=1,df2=3*(ns-1),alpha=.05)
> min(ns[power1a >= .95])
[1] 10
> min(ns[power1b >= .95])
[1] 19
> min(ns[power2a >= .95])
[1] 17
> min(ns[power2b >= .95])
[1] 34
In the most favorable situation, we only need 10 units per treatment, but when the means are less spread out and the error variance is doubled we need 34 units per treatment.
31.6 Sample Size for Bayesian Intervals
The function sample.size.bayes.interval()
implements
Adcock’s method for choosing a sample size.
You must specify,
- The width of the interval;
- The prior expected value of the error variance;
- The prior degrees of freedom for the error variance;
- The equivalent sample size in the prior for the mean;
- The sum of squared coefficients for the contrast;
- The error rate for the interval (default .05).
31.6.1 Example 7.5 VOR in Ataxia Patients
We want a width of .5 with (average) coverage of .95 for a pairwise comparison (coefficients 1 and -1). Our prior information for the error variance is a mean of .075 with 14 degrees of freedom. Assume we have just a single unit’s worth of information in the prior for the mean.> library(bcfcdae)
> sample.size.bayes.interval(.5,.075,14,1,2)
[1] 11
So we need 11 observations per treatment.
If we choose a tighter prior (more information in the prior), it just reduces the required sample size by an equivalent amount.> sample.size.bayes.interval(.5,.075,14,3,2)
[1] 9
> sample.size.bayes.interval(.5,.075,24,1,2)
[1] 10
> sample.size.bayes.interval(.5,.075,4,1,2)
[1] 18
31.7 Simulating Bayes Factors
Nearly all Bayesian sample size computations require considerable computation. One approach is to simulate the situation at hand.
Suppose that we have three groups and will have n data values in each group. Model 1 is that all three groups have the same mean, and model 2 is that the groups can have different means. We would like the sample size to be big enough to have probability .95 that the Bayes factor of model 2 relative to model 1 will be at least 3 under various conditions.
Iteration \(i\) of the simulation has these steps:
- Choose the actual means under model 2. In this simulation, the model 2 means \(\mu_i\) are (0, .4, .4) plus a normal deviate with standard deviation \(\sigma_\mu\).
- Get \(n\) data values for each group as normal centered at \(\mu_i\) with standard deviation \(\sigma_{yi}\).
- Compute and record the Bayes factor for model 2 relative to model 1.
After completing all the iterations, determine the number of iterations that had a Bayes factor greater than 3. This estimates the probability. Manipulate the sample sizes until you find one that is just large enough to meet the probability goal.
We wrap all of this in a function:> library(BayesFactor)
> BFfrac <- function(K,nis,mu0,sigma.mu,sigma0,M=1000) {
+ # K is Bayes factor minimum
+ # nis are the sample sizes per group
+ # mu0 is the vector of means for the groups in model 2
+ # sigma.mu is the sd of actual means around mu0
+ # sigma0[i] is the standard deviation of data around mean for
+ # iteration i
+ # M is the number of iterations
+ count <- 0 # initialize the count
+ g <- length(nis) # number of groups
+ x <- factor(rep(1:g,nis)) # factor to indicate groups
+ N <- sum(nis) # total sample size
+ if(length(sigma0) < M) {
+ sigma0 <- rep(sigma0,length=M) # need a sigma for each iteration
+ }
+ for(i in 1:M) {
+ mus <- rnorm(g,mu0,sigma.mu) # get means
+ y <- rnorm(N,rep(mus,nis),sigma0[i]) # get data
+ fit <- lmBF(y~x,data.frame(x,y),progress=FALSE) # get BF
+ if(extractBF(fit)[1] > K) count <- count+1 # increment if > K
+ }
+ count/M # return fraction
+ }
Note that we are using the lmBF()
function from package BayesFactor
.
That function is very fast and makes the time involved in the
simulation bearable. Using bglmm()
would be thousands of times slower.
> set.seed(10040) # for repeatability
> BFfrac(K=3,nis=rep(14,3),mu0=c(0,.4,.4),sigma.mu=.0001,sigma0=.075^.5)
[1] 0.938
> BFfrac(K=3,nis=rep(15,3),mu0=c(0,.4,.4),sigma.mu=.0001,sigma0=.075^.5)
[1] 0.96
Bearing in mind that these fractions themselves are random, it looks like a sample size of 15 would work.
If we are not quite so sure about the prior means, then the probability of the Bayes factor greater than 3 decreases.> BFfrac(K=3,nis=rep(15,3),mu0=c(0,.4,.4),sigma.mu=.1,sigma0=.075^.5)
[1] 0.898
.075 * 14/rchisq(1000,14)
gives us 1000 random error variances that are compatible with the
information we have. Using this sample of potential error
variances further reduces the probability of having a sufficiently large
Bayes factor, thus requiring a larger sample size to achieve the goal.
> BFfrac(K=3,nis=rep(15,3),mu0=c(0,.4,.4),sigma.mu=.1,sigma0=sqrt(.075*14/rchisq(1000,14)))
[1] 0.84
> BFfrac(K=3,nis=rep(31,3),mu0=c(0,.4,.4),sigma.mu=.1,sigma0=sqrt(.075*14/rchisq(1000,14)))
[1] 0.951