A colleague recently inquired about computing sample size when comparing two group means. My initial response was to compute sample size based on the two sample t-test. After a few probing questions it became clear the colleague did not expect the means to differ and demonstrating they were similar was the aim of the study.

There are two primary methods for computing sample size — one is based on hypothesis testing and the other on confidence intervals. The goal of hypothesis testing is to demonstrate differences in means (or proportions, variances, etc) which is the antithesis of what the colleague wanted. Sample size based on a confidence interval is a much better option here.

The large sample confidence for the difference in two population means and is

where is the quantile of a normal distribution corresponding to confidence level and , and are the sample mean, sample variance and sample size respectively for group .

The right side part of this equation is often termed the *margin of error* , i.e.,

This formula can be simplified if it’s reasonable to assume a common variance, i.e., , and equal sample sizes, i.e., . The equation simplifies to

We can then solve for to get

As an example, the sample size (per group) for a 95% confidence interval with a unit margin of error and standard deviation is

In practice, this would be rounded up to per group. It is common to choose 95% confidence (i.e., ) whereas the margin of error and standard deviation are context specific. One strategy for the margin of error is to choose the smallest value that represents a meaningful difference, so that any smaller value would be considered inconsequential. The choice of standard deviation can be informed by previous research.

Another consideration would be loss to follow up (if perhaps the outcome was the difference pre- and post-measurements). So, that with say a 20% attrition rate, the sample size per group would be increased to

Of course, the computation gets far more complex, and possibly intractable, when the equal variance and sample size assumptions are not reasonable.