The case–control design balances practical efficiency against statistical precision.^{1} To achieve the most favorable balance between these two, the sampling of the controls in case–control studies requires much attention.

Suppose one is interested in designing an unmatched case–control study to estimate an odds ratio for the relation between an exposure and an outcome with as much precision as possible. There is only a limited budget for a fixed number of measurements. Given that the total number of measurements (i.e., the sum of the number of cases and controls) is fixed, one is interested in finding the optimal ratio of cases-to-controls so to minimize the expected variance of the (logarithm of the) odds ratio. We will provide a simple but effective approach for finding this optimal case-to-control ratio. As we will demonstrate, a balanced design (case-to-control ratio of 1) is typically not optimal. First, we will show the generally known derivation of the benefit of an increasing number of controls when the number of cases is fixed. Second, we will derive the statistically most optimal case-to-control ratio when the number of measurements is fixed.

NOTATION
We consider an unmatched case–control study of a binary exposure. Let a , b , c , and d denote the number of exposed cases, exposed controls, unexposed cases, and unexposed controls in a case–control study, respectively. Define n _{1} = a + c as the number of cases and n _{0} = b + d as the number of controls in the study and n = n _{1} + n _{0} as the total number of subjects in the study. Furthermore, let p _{1} = a/n _{1} and p _{0} = b/n _{0} denote the prevalence of exposure among the cases and controls, respectively, and q _{1} = 1 − p _{1} and q _{0} = 1 − p _{0} . The effect of exposure is expressed by the odds ratio (ψ), ψ = (p _{1} /q _{1} )/(p _{0} /q _{0} ). The variance of ψ is asymptotically normal on the natural log scale, which by large sample approximation is given by: var(log[ψ]) = 1/a + 1/b + 1/c + 1/d .^{2} ^{,} ^{3} The variance can alternatively be expressed as follows:

One can also express the number of controls as a multiple (k ) of the number of cases: n _{0} = k n _{1} . In that case, the variance of the log(ψ) can be expressed as follows:

Further, the variance of log(ψ) is proportional to (k + 1)/k , if and only if p = q , i.e., when the odds ratio equals 1.^{4}

SAMPLING CONTROLS WHEN THE NUMBER OF CASES IS FIXED
Now consider a case–control study in which the number of cases (n _{1} ) is fixed (e.g., because there are only so many patients in a single hospital where the study is conducted within an acceptable time frame for conducting the study), yet the number of controls (n _{0} ) to be sampled is a choice to be made by the researcher. We assume that the goal of this exercise is to optimize precision (i.e., minimize the variance) of the estimated log(ψ). From equation 2, it can be seen that, given that the number of cases (n _{1} ) is fixed, the variance monotonically decreases in a nonlinear manner as the number of controls (n _{0} ) increase. This principle has also been demonstrated by others (e.g., see Ref. ^{4} ^{,} ^{5} ).

Figure 1 depicts the relationship between the number of controls per case and the variance of the estimated log(ψ) for different values of the odds ratio. The variance tends to the horizontal asymptote:

as

. Hence, from an optimal precision point of view, a larger number of controls per case is to be preferred. A practical optimum may lie around five controls per case. That is, the gain in precision due to collecting information on more than five controls per case may not outweigh the additional costs of gathering more data.^{6} Previous work considered the situation in which the number of cases is fixed, while the optimal number of controls per case is to be determined, given a constraint on the total costs.^{7} ^{,} ^{8} In that case, the costs for sampling may differ between cases and controls, or may differ between strata of confounders. Notably, when stratification is needed (e.g., to control for confounding), when the primary interest lies in an interaction effect of the exposure rather than the main effect of exposure, or in case of a matched design,^{9} the required number of controls per case may be substantially higher. In the following, we assume equal costs for sampling cases and controls, and only consider the sum of the number of cases and controls fixed.

FIGURE 1: Relationship between the number of controls per case and the variance of the estimated log(ψ), when the number of cases in a case–control study is fixed. The variance of log(ψ) not only depends on the value of ψ but also on the prevalence of exposure among the controls (p _{0} ). The figures show the relation between number of controls per case and the variance of the estimated log(ψ) for (A) p _{0} = 0.01 and (B) p _{0} = 0.05.

SAMPLING CASES AND CONTROLS WHEN THE TOTAL NUMBER OF MEASUREMENTS ARE FIXED
Consider a case–control study in which the number of measurements (n ) is fixed (e.g., because the research budget only allows a certain number of measurements to be made), yet the ratio of cases-to-controls to be sampled is a choice to be made by the researcher. Again, the goal is to minimize the variance of the log(ψ) (i.e., to optimize precision). Let the proportion of cases among all measurements in a case–control study be denoted by R : R = n _{1} /n . The variance of log(ψ) (equation 1) can be re-expressed in terms of n and R as follows:

In the case of no effect of exposure on the outcome (i.e., when p _{1} = p _{0} , ψ = 1), it can be shown that var(log[ψ]) is proportional to 1/(R [1 − R ]). To minimize this variance, one can take the derivative with respect to R of this expression and solve the equation, leading to an optimal ratio of cases-to-controls of 1 (R = 0.5).^{8}

More generally, for any situation of p _{1} and p _{0} , one can show that by taking the derivative with respect to R of expression (3) and solve the equation, the optimal proportion of cases among all subjects in a case–control study is given by:

The optimal ratio of cases-to-controls thus depends on the prevalence of exposure among the controls (p _{0} ) as well as the prevalence of exposure among the cases (p _{1} ), or alternatively, the optimal ratio of cases-to-controls depends on p _{0} and ψ. For example, when the prevalence of the exposure among the controls (p _{0} ) is 0.05, while ψ = 2 (meaning p _{1} = 0.095), R = 0.426 and the optimal ratio of cases-to-controls is 0.426:0.574, or 1:1.35. This means that if, e.g., n = 200, the number of cases and controls that minimize the variance of the log(ψ) would be 85 and 115, respectively. Consequently, if ψ ≠ 1, the optimal ratio of cases-to-controls, in terms of the minimal variance of the estimated log(ψ), is not 1:1, which is illustrated in Figure 2 .

FIGURE 2: Relationship between the optimal proportion of cases, leading to the smallest variance of the estimated log(ψ) when the sum of the number of cases and controls in a case–control study is fixed.

We note that equation 4 also holds if ψ < 1, in which case p _{1} and p _{0} need to be interchanged (as well as q _{1} and q _{0} ) and R denotes the proportion of controls among all measurements in a case–control study.

DISCUSSION
The case–control design balances statistical efficiency with practical efficiency. Given a fixed number of measurements (or a fixed sum of the number of cases and controls) in a case–control study, 1:1 sampling of cases and controls is often not optimal in terms of statistical precision. This is especially true for situations where the exposure is rare and for stronger relationships between the exposure and the outcome under study.

This phenomenon has been described before in the context of diagnostic accuracy studies, where the measure of association is the area under the receiver operating characteristic (ROC) curve.^{10} Literature about sampling in case–control studies, in which the primary interest is in estimating the odds ratio for the effect of a certain exposure on a certain outcome, focus on the situation in which the number of cases is a given, while an optimal (in terms of precision, or in terms of costs) number of controls is to be determined.^{4} ^{,} ^{7} ^{,} ^{8} In the context of randomized trials, the allocation to the experimental and the control treatment is often in a 1:1 ratio. Literature on imbalanced designs indicates that there too, a 1:1 ratio may not be optimal in terms of statistical precision or costs.^{11} ^{,} ^{12} We did not explicitly take costs into account; yet implicitly these are considered fixed, if we assume the costs of sampling cases and controls to be the same.

The formulae presented here do not take confounding adjustment or matching into account, in which case optimal sampling of cases-to-controls may be different.^{6} ^{,} ^{9} Also, equations and conclusions presented here rely on large sample approximation. Hence, when conducting a case–control study of small size, the variance of log(ψ) may not be well approximated.

When designing a case–control study of a binary exposure, in which the expected odds ratio deviates from the null and the sum of the number of cases and controls is fixed, 1:1 sampling of cases and controls may yield suboptimal statistical efficiency. The equations presented here may provide researchers with a more efficient sampling strategy of cases and controls.

ACKNOWLEDGMENTS
We would like to thank Professor J. P. Vandenbroucke for fruitful discussion and his comments on an earlier version of this manuscript. We gratefully acknowledge financial contribution from the Netherlands Organisation for Scientific Research.

REFERENCES
1. Vandenbroucke JP, Pearce N. Case-control studies: basic concepts. Int J Epidemiol. 2012;41:1480–1489.

2. Kirkwood BR, Sterne JAC. Kirkwood BR, Sterne JAC. Comparing two proportions. In: Essential Medical Statistics. 2003; 164.2nd ed. Oxford: Wiley-Blackwell.

3. Greenland S, Rothman KJ. Rothman KJ, Greenland S, Lash TL. Introduction to categorical statistics. In: Modern Epidemiology. 2008; 249.3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins.

4. Breslow NE, Day NE. Statistical methods in cancer research. Volume I - The analysis of case–control studies. IARC Sci Publ. 1980:5–338.

5. Mittleman MA, Maclure M, Robins JM. Control sampling strategies for case-crossover studies: an assessment of relative efficiency. Am J Epidemiol. 1995;142:91–98.

6. Hennessy S, Bilker WB, Berlin JA, Strom BL. Factors influencing the optimal control-to-case ratio in matched case-control studies. Am J Epidemiol. 1999;149:195–197.

7. Ury HK. Efficiency of case-control studies with multiple controls per case: continuous or dichotomous data. Biometrics. 1975;31:643–649.

8. Nam JM, Fears TR. Optimum allocation of samples in strata-matching case-control studies when cost per sample differs from stratum to stratum. Stat Med. 1990;9:1475–1483.

9. Dupont WD. Power calculations for matched case-control studies. Biometrics. 1988;44:1157–1168.

10. Janes H, Pepe M. The optimal ratio of cases to controls for estimating the classification accuracy of a biomarker. Biostatistics. 2006;7:456–468.

11. Nam JM. Optimum sample sizes for the comparison of the control and treatment. Biometrics. 1973;29:101–108.

12. Liu X. Statistical power and optimum sample allocation ratio for treatment and control having unequal costs per unit of randomization. J Edu Behav Stats. 2003;28:231–248.