From the Department of Preventive Medicine, University of Southern California, Los Angeles, CA.
Editors' note: This series addresses topics that affect epidemiologists across a range of specialties. Commentaries start as invited talks at symposia organized by the Editors. This paper was presented at the 2009 Society for Epidemiologic Research Annual Meeting in Anaheim, CA.
Editors' note: Related articles appear on pages 13 and 3.
Correspondence: Bryan Langholz, Keck School of Medicine, University of Southern California, 1540 Alcazar St, CHP-220, Los Angeles, CA 90033. E-mail: email@example.com.
Many epidemiologists and statisticians believe that the odds ratio is the only measure that can be reliably estimated from case-control studies. I have received many reviews of case-control study papers instructing us to change “rate ratio” to “odds ratio” when, in fact, it was the rate ratio that we had estimated. Further, in discussions about methods we have developed to estimate absolute risk measures from nested case-control studies, I have found that many are surprised to learn that absolute risk can be estimated systematically and reliably in the case-control setting. While case-control studies are best suited for estimation of relative measures, and I do not wish to minimize the challenges of estimation on other scales from case-control data generally, there are reliable methods for doing so. Reporting of absolute risk estimates seems desirable to supplement the usual case-control relative measure analyses,1 but this is only rarely done even when it is feasible.
So why is there an odds-ratio fixation? I believe the core problem is that epidemiologists often think of case-control studies as a “retrospective model.” According to this view, we start with a set of cases and controls. Then the covariates (exposures and other factors) occur as independent realizations with distribution dependent on disease status. This is in contrast to the “prospective model” of cohort data, in which we start with a group of subjects with given covariates, and disease status is the result of independent realizations with probability dependent on the covariate values. Students learn that the odds ratio parameters in the retrospective logistic model are same as the odds ratio parameters in the corresponding prospective logistic model, but that other measures do not translate. The matter is further confused because valid estimation of odds ratio parameters from retrospective model case-control data may be obtained using the corresponding prospective cohort data logistic regression, with the estimated “baseline odds” a nuisance parameter to be ignored.2,3
But case-control studies are not backwards cohort (“trohoc”) studies.4 A more realistic way to represent case-control designs is as sampling from a prospective cohort that depends on disease outcomes and other information available on cohort subjects—what we have called the nested case-control model.5 While this representation is certainly not new, and is often used in epidemiology textbooks as a “conceptual framework” to think about basic sampling and bias issues, almost invariably the retrospective model approach is used to develop the analysis methods. The alternative approach my colleagues and I have taken is to develop case-control study methods based completely on the nested case-control model, and to provide a unifying framework across cohort and case-control analysis methods, as well as across individually matched and unmatched case-control study designs. While there are still some important gaps to be filled, we have made progress. After working within the nested case-control model over a number of years, I have come to find the retrospective model both unnatural and limiting. There are 3 commonly held misconceptions as a result of the retrospective model mindset.
The first is that cases and controls need to be “random samples,” meaning having equal probability of being sampled, from their respective populations (eg, selection cannot depend on covariates). This misconception is not directly relevant to the topic of estimation of measures other than odds ratios, so I will not discuss it at length here except to reiterate our opinion that “The ‘random sampling of controls’ principle needs to be replaced by the principle that ‘the method of control selection must be incorporated into the analysis.’”1,6
The second misconception is that only odds ratio parameters can be estimated from case-control studies. As a quick counter-example, rate-ratio parameters can be estimated when controls are sampled from the risk sets in disease incidence cohort data.7–9 With additional information related to the overall cohort rates or size, absolute risk or rates can be estimated. In the retrospective model, there is no element of an “underlying cohort,” so that the connection between a case-control sample and the cohort is not integrated into the conceptual framework. I refer to our work in this area, but there are other approaches.10–12 Our methods for case-control data are a natural extension of well established methods for disease incidence cohort studies, such as Kaplan-Meier estimates of survival and risk13,14 and are easily implemented using Cox regression software that computes baseline survival or cumulative hazard functions such as SAS, Stata, S-Plus, or R.5 As originally presented, these methods require the number in each risk set, and are thus appropriate when the underlying cohort can be fully enumerated. However, as I discuss below, the methods also apply in more general situations. Estimation of other measures is also possible, including excess risk estimation via nonparametric15,16 and Poisson regression17,18 methods.
The third misconception is that exact specification of the sampling design in the analysis does not matter as long as the controls are randomly sampled. The crux of the argument can be discussed in the context of estimation of the baseline odds. As mentioned earlier, analysis of case-control data using cohort (unconditional) logistic regression arose based on the retrospective view. From this standpoint, the natural and common approach to estimation of baseline odds is to “fix up” the logistic regression intercept parameter using the case and control sampling fractions.11 While this is certainly valid for point estimation of the baseline odds, variance estimation requires more precise specification of the sampling19—a complication not addressed in most basic textbooks advocating the approach. The prospective view of case-control studies naturally suggests an approach to analysis of case-control data based on a conditional logistic likelihood. This likelihood depends on the specific sampling method, and the baseline odds may (or may not) be estimable from the case-control data, without supplemental information, depending on the sampling design. So, for instance, the baseline odds is estimable if controls are drawn as Bernoulli trials or as case-base samples, but not as frequency matched samples.6
The “retrospective” model has led to the mindset of “case-control study = odds ratio.” This equation is a fallacy. In many situations, absolute risk and other measures can be reliably estimated from case-control data and, in fact, from a much wider range of case-control studies than the nested case-control study setting I have discussed thus far. In none of the nested case-control studies in which our methods were applied were the cohorts fully enumerated.20,21 In each, the target cohort was a subset of the assembled cohort, with eligibility determined as part of the process of identifying cases and controls. Our methods can accommodate this situation, incorporating the numbers of potential cases and controls that were sampled and disqualified from the analysis, as well as the number of controls in the assembled cohort risk sets.5,22 Moreover, I conjecture that the methods can be extended to situations where controls are matched to the case within some group characteristic (unit) that can be enumerated, such as school or defined geographic areas, and where the numbers of potential controls in the units in which cases occurred can be ascertained.
In conclusion, reliable estimation of risk and other nonodds-ratio measures from case-control studies is certainly possible as long as the ancillary information about the underlying cohort is obtained and the details of the case and control selection process recorded. Such analysis requires a little extra planning and the use of appropriate design and analysis techniques—but little added expense.
I would like to thank Drs. David Richardson and Duncan Thomas for their excellent feedback on the manuscript.
1. Greenland S. Model-based estimation of relative risks and other epidemiologic measures in studies of common outcomes and in case-control studies. Am J Epidemiol
2. Prentice R, Pyke R. Logistic disease incidence models and case-control studies. Biometrika
3. Carroll RJ, Wang S, Wang CY. Prospective analysis of logistic case-control studies. J Am Stat Assoc
4. Feinstein AR. Clinical Biostatistics XX. The epidemiologic trohoc, the ablative risk ratio, and “retrospective” research. Clin Pharmacol Ther
5. Langholz B. Use of cohort information in the design and analysis of case-control studies. Scand J Stat
6. Langholz B, Goldstein L. Conditional logistic analysis of case-control studies with complex sampling. Biostatistics
7. Greenland S, Thomas D. On the need for the rare disease assumption in case-control studies. Am J Epidemiol
8. Oakes D. Survival times: aspects of partial likelihood (with discussion). Int Stat Rev
9. Borgan Ø, Goldstein L, Langholz B. Methods for the analysis of sampled cohort data in the Cox proportional hazards model. Ann Stat.
10. Wacholder S. The case-control study as data missing by design: estimating risk differences. Epidemiology
11. Rothman KJ, Greenland S. Modern Epidemiology
. Philadelphia: Lippincott-Raven Publishers; 1998:chap 21.
12. Benichou J, Gail M. Methods of inference for estimates of absolute risk derived from population-based case-control studies. Biometrics
13. Langholz B, Borgan O. Estimation of absolute risk from nested case-control data. Biometrics
14. Borgan Ø, Langholz B. Risk set sampling designs for proportional hazards models. In: Everitt BS, Dunn G, eds. Statistical Analysis of Medical Data: New Developments
. London: Arnold; 1998:75–100.
15. Aalen OO. Nonparametric inference for a family of counting processes. Ann Stat
16. Borgan Ø, Langholz B. Estimation of excess risk from case-control data using Aalen's linear regression model. Biometrics.
17. Samuelsen SO. A pseudolikelihood approach to analysis of nested case-control data. Biometrika
18. Samuelsen SO, Ånestad H, Skrondal A. Stratified case-cohort analysis of general cohort sampling designs. Scand J Stat
19. Arratia R, Goldstein L, Langholz B. Local central limit theorems, the high order correlations of rejective sampling, and logistic likelihood asymptotics. Ann Stat
20. Habel LA, Shak S, Jacobs MK, et al. A population-based study of tumor gene expression and risk of breast cancer death among lymph node-negative patients. Breast Cancer Res
21. Lacey JV Jr, Sherman ME, Ronnett ME, et al. Absolute risk of endometrial carcinoma over 20-year follow-up among women diagnosed with endometrial hyperplasia. J Clin Oncol.
22. Langholz B. Estimation of Absolute Risk When Controls are Sequentially Sampled Until a Specified Number of Valid Controls are Found
. Los Angeles: USC Department of Preventive Medicine Biostatistics Division; 2005. Technical Report 172.
© 2010 Lippincott Williams & Wilkins, Inc.