Mandrekar, Jayawant N. PhD; Mandrekar, Sumithra J. PhD
Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota.
Disclosure: The authors declare no conflicts of interest.
Address for correspondence: Sumithra J. Mandrekar, PhD, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901. E-mail: firstname.lastname@example.org
In this issue of the Journal of Thoracic Oncology, Berthiller et al.1 present the results of a pooled analysis of three hospital based case-control studies performed to estimate the risk of lung cancer from cannabis smoking in men. Data regarding smoking, demographics, and occupational exposures were gathered through a questionnaire. In two of the three studies, each case (men with primary incident lung cancer) was matched to one or two controls based on age and place of residence and matched on age, gender, and place of residence in the third study. The authors use an unconditional logistic regression model to obtain the study specific and pooled odds ratio estimates. All analyses were performed on the complete set of cases and controls as well as on data with the recoded missing variables. The analyses techniques used in this article include unconditional logistic regression models, pooling of data without stratifying by study, and testing for between study heterogeneity, and the (biased) approaches for recoding the missing data. These approaches are somewhat inappropriate, however, in this editorial we will focus only on the case-control study design highlighting the benefits, limitations, and appropriate analysis methods.
Recent research has shown that a substantial majority of highly cited case-control studies are subsequently contradicted or found to be substantially over-ambitious when more rigorous investigations are conducted.2 While a prospective double blind randomized controlled trial remains the ‘Gold standard’ of study design, such trials may not be logistically and scientifically possible for studying diseases that have infrequent events that require large populations to be tracked for lengthy periods to observe disease development. Case-control studies are a perfect alternative to randomized controlled trial in such cases, as they use data from patients who already have a disease or the condition of interest and look back to see if there are characteristics of these patients that differ from those who don’t have the disease or the condition of interest. This design essentially compares two groups of people with regard to the risk factor(s) of interest: those with the disease or condition under study (cases) and a very similar group of people who do not have the disease or condition (controls).3
An important aspect of case-control study design is the concept of matching controls to cases based on certain features. Matching on well established confounders helps to address confounder-related distributional imbalance problems in cases and controls at the design stage, however, matching on a factor that is a consequence of the exposure is inappropriate. Matching can be done in two different ways. Group or frequency matching ensures that there are the same number of controls as cases within each level of confounder. While this helps reduce bias, it does not necessarily eliminate it. A finer level of matching is achieved by matching each case with one or more controls. Berthiller et al.1 do not provide details on the type of matching, but given the factors used for matching, specifically, age and place of residence, it is likely that a group/frequency matching was used. Next is the decision on the ideal matching, i.e., 1:1, 1:2, 1:3 and so on. This typically depends on the number of variables to be matched, availability of controls and cost of matching. In many instances, this decision is based on an investigator’s prior knowledge and experience about the disease. The gain in statistical efficiency diminishes beyond 4 or 5 controls for each case, specifically the efficiency for 1, 2, 3, 4, and 5 controls per case is approximately 0.5, 0.67, 0.75, 0.8, and 0.83.4 It is common practice, however, to match two controls per case, as done in two of three studies in the article by Berthiller et al.1
A conditional logistic regression model involving a stratified analysis on the matching factors is used for the analysis of matched case-control study design.5 Simply speaking, unlike an unconditional (i.e., ordinary) logistic regression model used for an unmatched case-control study, the conditional approach keeps track of which case was matched with which control(s) at the design stage in the logistic regression model, thus providing greater statistical efficiency. This concept is analogous to using a paired t test (versus a 2 sample t test) when the measurements (i.e., data) are paired and the outcome of interest is continuous. An unconditional logistic regression model used in a matched case-control design setting, as done by Berthiller et al.,1 does not use the matching information, resulting in possibly incorrect model specification and biased estimates.
In summary, case-control studies, when designed and analyzed correctly, are a valuable investigative tool that can provide rapid results at low cost for diseases with low incidences that take a long time to manifest. Caution should be exercised in interpreting results from these studies due to their retrospective and non-randomized nature. Moreover, ascertainment of disease, selection of controls, and measurement of exposure need careful consideration in a case-control study design.
1.Berthiller J, Straif K, Boniol M, et al. Cannabis smoking and risk of lung cancer in men: a pooled analysis of three studies in Maghreb. J Thorac Oncol
2.Petitti D. Hormone replacement therapy and coronary heart disease: four lessons. Int J Epidemiol
3.Breslow NE, Day NE. Statistical methods in cancer research. Volume I–The analysis of case-control studies. IARC Sci Publ
4.Ury HK. Efficiency of case-control studies with multiple controls per case: continuous or dichotomous data, Biometrics
5.Breslow NE, Day NE, Halvorsen KT, Prentice RI, Sabai C. Estimation of multiple relative risk functions in matched case-control studies. Am J Epidemiol