Secondary Logo

Journal Logo

Analysis of Matched Case-Control Data With Incomplete Strata: Applying Longitudinal Approaches

Lin, I-Feng*†; Lai, Ming-Yun; Chuang, Pei-Hung*

doi: 10.1097/EDE.0b013e318064630a
METHODS: Original Article

Background: Matched case-control data have a structure that is similar to longitudinal data with correlated outcomes, except for a retrospective sampling scheme. In conditional logistic regression analysis, sets that are incomplete due to missing covariates and sets with identical values of the covariates do not contribute to the estimation; both situations may cause a loss in efficiency. These problems are more severe when sample sizes are small. We evaluated retrospective models for longitudinal data as alternatives in analyzing matched case-control data.

Methods: We conducted simulations to compare the properties of matched case-control data analyses using conditional likelihood and a commonly used longitudinal approach generalized estimating equation (GEE). We simulated scenarios for one-to-one and one-to-two matching designs, each with various sizes of matching strata, with complete and incomplete strata, and with dichotomous and normal exposures.

Results and Conclusions: The simulations show that the estimates by conditional likelihood and GEE methods are consistent, and a proper coverage was reached for both binary and continuous exposures. The estimates produced by conditional likelihood have greater standard errors than those obtained by GEE. These relative efficiency losses are more substantial when data contain incomplete matched sets and when the data have small sizes of matching strata; these can be improved by including more controls in the strata. These losses of efficiency also increase as the magnitude of the association increases.

From the *Division of Biostatistics, Institute of Public Health; and †Department of Social Medicine, Faculty of Medicine, National Yang-Ming University, Taipei, Taiwan.

Submitted 18 July 2006; accepted 28 February 2007.

Supported in part by the National Science Council of Taiwan, the Republic of China, NSC 92-2320-B010-045.

Correspondence: I-Feng Lin, Division of Biostatistics, Institute of Public Health, Faculty of Medicine, National Yang-Ming University, 155, Sec. 2, Lih-Nong Street, Taipei 11221, Taiwan. E-mail:

Matching has been widely used to enhance efficiency in case-control studies.1 In an individual “one-to-n” matching, one or more controls (nondiseased subjects) are matched to a case (diseased subject) on variables such as age, sex, and race/ethnicity. Due to the retrospective sampling scheme and matching procedure in matched case-control data, fitting a prospective model with unconditional logistic regression that treats disease status as the dependent variable and exposure as the independent variable may lead to biased and inconsistent estimates even in large data sets.2 For this reason, conditional logistic regression has been preferred for the analysis of matched case-control data.3–5

Conditional likelihood treats the matching effects as nuisance parameters and estimates the parameters of interest (the disease-exposure log odds ratio) conditioning on the matching sets. Since this approach uses only information within a stratum, neither incomplete matching sets (ie, the strata containing only cases or only controls) or sets with identical values of the covariates contribute to the estimation. A missing covariate for a case causes the model to discard the corresponding observed controls within the same stratum and, similarly, the strata with cases only are left out from analysis even though covariates for case are observed. Therefore, the conditional logistic regression with missing covariates loses more information than unconditional logistic regression. These problems may be even more severe when sample sizes are small.

Matched case-control data have a data structure that is similar to longitudinal correlated data except for the retrospective sampling scheme. The role of the matched sets is similar to the “subjects” or “clusters” in longitudinal data. In a matched case-control study, the exposure measures for the case and controls within a “matched set” are dependent on sharing the same matching factors. In a longitudinal study, the outcome measures at different time points are correlated. In this article, we use the terms “matched set,” “stratum,” and “cluster” interchangeably.

Methods developed for analyzing longitudinal correlated outcomes—such as the generalized estimating equations (GEEs) and generalized linear mixed models—have been widely used.6 The GEE approach is a marginal model based on quasi-likelihood, and accounts for the cluster effect by estimating additional parameters for the correlations in outcome measures between different time points. The generalized linear mixed models are random-effects models accounting for the cluster effect by assuming cluster-specific effects, (eg, cluster-specific intercepts and cluster-specific slopes) and jointly estimating the parameters of interest and variance of the random effects.

The conditional logistic regression is a special case of a logistic model with stratum-specific intercepts; this approach estimates the parameter of interest conditioning on the random effect and does not require any assumption on the distribution of the random effect. Unlike conditional likelihood approach, the GEE approach and the generalized linear mixed models utilize all observations in estimating the parameter of interest in the main mean model, treating all missing outcomes completely at random.7 The generalized linear mixed model is a likelihood approach that requires information within stratum to estimate the parameters. It involves more complicated computation procedures and usually takes a long time to run for problems with many parameters or for nonlinear models. For individual matching designs with only few observations in each cluster, the precision of the estimates is sometimes unreliable, especially in nonlinear models.

This study focus on the longitudinal approach with GEE models only. The purpose of this paper is to compare the estimates obtained by conditional likelihood and the GEE based on a matched case-control study. We conducted Monte Carlo simulation studies to compare the estimates for the data with complete strata and the estimates obtained with missing covariates. The analysis was carried out for varying sample sizes, varying true parameters, and for matching designs of 1:1 (1 case and 1 control for a stratum) and a 1:2 (1 case and 2 controls for a stratum).

Back to Top | Article Outline


Conditional logistic regression and the other 2 longitudinal models can handle data with more than one case and multiple controls per matched set. For notation simplicity, we will illustrate the models for a 1: n matching design (n controls per case for a stratum). Let Dij be the case indicator for the j th subject, j = 1, …, n(i), in the i th matched set, i = 1, …, m. That is, Di1 = 1 for case, and Di2 = 0, …, Din(i) = 0 for controls, where the number of controls for matched set i is n(i) − 1 and the total number of strata is m. Let Xij be the variable of interest (eg, exposure) for the j th subject, Wij be the covariate vector of potential confounders to be adjusted, and Zi be a vector of matching factors in the i th matched set (eg, age). Given the data, we assume that the linear logistic disease prevalence model to be

The ln is the natural logarithm function, β0 is a constant, β = (β12T)T is a vector of regression coefficients, and g(.) is a function linking the relationship between Z and D. The exponential of β1, exp(β1), is the interested disease-exposure odds ratio.

Back to Top | Article Outline

Prospective and Retrospective Models in Analyzing Matched Data

Consider a binary exposure variable X of primary interest with value 1 as exposed and 0 as unexposed. In matched cohort studies (matching on Z between exposed and unexposed groups), we can directly estimate the probability of disease (D) given the exposure status (X) adjusting for matching variable (Z) and other covariates (W), P(D|X,Z,W), as well as the parameter of interest, β1, by accounting explicitly for the effect of Z on the disease status in an unconditional logistic regression model, if the Z is quantifiable.8 In retrospective case-control studies, logistic regression directly models the probability of exposure (X), P(X|D,Z,W). According to the invariant property of the odds ratio, the coefficient of D in a retrospective model (ie, fitting the model with X as the dependent and D as the independent variables) is the disease-exposure log odds ratio.9 In an unmatched case-control study, a prospective model using a standard unconditional logistic regression (ie, fitting the model with D as the dependent and X as the independent variables) still provides a valid estimate of the disease-exposure odds ratio except for the intercept term, if there is no selection bias.10 Therefore, the prospective analysis with unconditional logistic model is valid in unmatched cohort, unmatched case-control, and matched cohort studies. In matched case-control studies, however, prospective analysis with unconditional logistic models may introduce additional biases due to model misspecification.2 In special situations where matching variables are categorical or the log odds ratio β1 is close to zero, this bias is minimal.2 Conditional likelihood approach has been used to analyze this kind of data; fortunately, the prospective analysis works for conditional logistic regression in the matched case-control data as well.11 Alternatively, one can apply retrospective analysis to matched case-control studies.8 This setting requires either parametric models with an explicit term of matching effect fully specified, or more complicated multivariate models accounting for matching sets. Different types of exposure may require different models.

Back to Top | Article Outline

The Conditional Logistic Regression Model (CL)

The conditional likelihood for matched case-control data is as follows,

The g(Z) terms are cancelled out within each stratum and thus there is no need to estimate the nuisance parameters corresponding to the matching factors in this likelihood. The conditional maximum likelihood estimate for β is a solution to the first derivative of the logarithm of Lc(β), the conditional score function, equal to zero. This score function is equivalent to the one in Cox regression; the conditional likelihood is conditional on the matched sets whereas the partial likelihood of the Cox model is conditional on the risk sets. Any software package handling the Cox model can also handle conditional logistic regression. Note that the matched case-control data that contains more than one case in a stratum is equivalent to survival data with “tied deaths” (more than one person died at the same time). Since the number of observations is small within a matched set (eg, for 2:2 matching, there are only 4 observations in a matched set), approximate methods for handling “tied deaths” for large samples using the default settings of Cox model in most software packages are not applied in conditional likelihood; an exact method handling discrete “tied survival time” in the Cox model is needed to estimate β in conditional logistic regression for data with more than one case per stratum.

Back to Top | Article Outline

The Retrospective Models of the GEE for Matched Case-Control Data

In the retrospective version of GEE for correlated outcomes, one estimates the disease-exposure log odds ratio, simultaneously taking into account the matching effect by estimating additional parameters for the correlations within the clusters. Let á be the vector of parameters for association of X between case and control within a stratum. The “prospective analysis” on matched case-control data would not work in the GEE because á would not be estimable due to the identical values at the first time point (all cases have an “outcome” of one) and at the second or further time points (all controls have an “outcome” of zero) within the same stratum. To estimate the disease-exposure odds ratio with the GEE we now consider the following 2 retrospective models with logit link,


We estimate an additional vector of the parameters, α, which quantifies the association between Xij and Xil in the same stratum i for jl.

The exp(γ1) in equations (2) and (2a) is the disease-exposure odds ratio of interest, namely, exp(β1) in equation 1. The association parameters, α, may be formulated in several ways, for example, in terms of correlations6 or odds ratios.12 In a special case with only 2 correlated values within a cluster (1 case and 1 control each matched set), only 1 α parameter needs to be estimated. Note that the equation (2) does not include the term γ3(Zi) and at the first glance the matching effect seems not to be adjusted for. In fact, the association vector of the α parameters quantifies the matching effect Z in a different way, and the α and γ3(Zi) are traded off. Under the circumstances in equation (2a) which estimates both α and γ3(Zi), we will expect to find no substantial differences in the parameter estimate γ1 between models by equation (2) and by equation (2a).

The above retrospective GEE model can be generalized and applied to an exposure variable that belongs to an exponential family such as binomial, normal, Poisson, and gamma distributions.

The main mean model in equation (2a) can be rewritten as

where μ is the expected value of X and h(.) is a link function in generalized linear models.

When the exposure variable X is normally distributed, the relationship between β and γ is β1 = γ12 and σ2 is the variance of X.13

Back to Top | Article Outline

The Simulation

We simulated the model with one continuous matching variable, Z, and 2 covariates, X and W. The X was the exposure of interest, and the W was a potential confounder. The values of the matching variable (Z) mimic the age distribution of the matched case-control study investigating the protective effect of alcohol consumption on stroke in a multirace/ethnicity area.14 We first generated W given Y and Z, then generated X given W, Y, and Z; several steps of the Bayes’ rules were applied. The design and details of the Bayes’ steps in this simulation study are similar to the ones in previous papers that studied missing covariates13 and selection bias15 for matched case-control data. When the exposure variable is binary, all retrospective models by GEE are also logistic models, as shown in the previous sections. We also simulated a special case of continuous X that is normally distributed with a mean of zero and a standard deviation of one. In practice, any normally distributed variable can be transformed into this standardized variable. The true parameters for β1 were set up as 0.7 and 1.6 (the disease-exposure odds ratio is approximately 2 and 5, respectively). The retrospective models with normal X were fitted by GEE with identity links and normal distributions. The W was binary and β2 was 1 (odds ratio is approximately 2.7). We considered the situations for 1-to-1, 1-to-2 matching, and mixture of 1-to-1 and 1-to-2 matching, with sample sizes 100, 200, and 500, respectively. Since the results had similar trends, we report the results only for one-to-one matching with 500 strata and the results for the worst situation, with 100 strata. We generated the data with 10% and 30% chance for missing values on X with the mechanism missing set to be completely at random.16 For each combination of these scenarios, 2000 replicated samples were generated and analyzed by conditional likelihood and the GEE with and without matching Z in the models. The association parameter α was set up as a uniform correlation coefficient. We also experimented with other forms of α such as odds ratio.12 However, the choice of form for α does not change the results substantially in our simulations.

Several summarized measures from the simulation are presented in the tables. Let 1 be a parameter estimate of disease-exposure log odds ratio. The mean 1 is the average of β1 estimates from the 2000 generated samples. The absolute relative bias is calculated by the absolute value of (mean 1 − true β1/(true β1). The mean standard error of 1 is the square root of the average of estimated asymptotic variance of 1. For conditional likelihood, the asymptotic variance is the inverse of the first derivative of the conditional score function. For the GEE, the asymptotic variance is referred to as the “sandwich” or “robust” variance estimate based on the quasi-likelihood.6 The simulation standard error of 1 is the square root of the variance of all simulation 1. These 2 standard errors of 1 are compared to verify the asymptotic properties of the variance estimates of the above models on different situations. Since standard errors are associated with the mean of the estimates even when the sample size are the same, the coefficient of variation (CV) type of measure is presented, which is calculated by [(simulation standard error of 1)/(mean 1)] as a percentage. To compare the relative efficiency of the estimates among different analyses, the “relative CV” is calculated by taking the ratio of each CV to the one obtained by conditional likelihood for data without missing values; a number greater than 100% refers to a bigger variation of the estimates relative to the one in conditional likelihood for complete data. The 95% coverage percentage is calculated by observing how many times the true parameter was contained in the confidence intervals generated. All computations were performed using the SAS 8.2 statistical software package (SAS Institute Inc., Cary, NC). The conditional likelihood and GEE were estimated by the procedures PHREG and GENMOD, respectively.

Back to Top | Article Outline


Tables 1 and 2 show the simulation results for matched case-control data based on 2000 replicated samples each with 500 strata. Table 1 is performed on data with binary exposure variable and Table 2 is for standard normally distributed X. The results from these 2 situations are similar. For both conditional likelihood and GEE, regardless of type of matching or existence of missing covariate, the mean 1 is close to the true parameter. The values of absolute relative bias range from less than 0.1% up to 2%. The mean standard error of 1 is approximately equivalent to the simulation standard error of 1 for the same model. The asymptotic variance estimates of 1 are valid both for conditional likelihood and for the GEE on matched case-control data. The 95% coverage percentages are all close to 95%. As one expected, the performance by the GEE with and without Z terms in the models are very similar in our simulations. The variation of the parameter estimates (in terms of relative CV for comparison purpose) by the GEE are generally smaller than the ones by the GEEs applied for the same data. These efficiency losses are more substantial when data contain incomplete matched sets, and these losses increase as the magnitude of the association increases, as the proportions of the missing covariates increases, and as the numbers of matching strata decreases.





Table 3 shows the measures of performance by conditional likelihood and GEE excluding the Z terms for data with 100 strata. Patterns are similar to those in previous tables. The absolute relative bias for data with 100 strata is not as small as those for 500 strata due to the fact that relatively small numbers of strata require more replications to converge to the true parameter. Given the same numbers of cases, the estimate of 1 is more efficient for 1-to-2 matching data than for 1-to-1 matching. Note that in the case of β1 = 1.6 for the 1-to-1 matched data with 30% missing covariates (equivalent to complete data with 70 strata), the conditional likelihood approach estimates extreme values of 1 thus producing unusually large standard errors.



Back to Top | Article Outline


The simulation results show that both conditional likelihood and the retrospective GEE produced consistent point estimates of the parameter in matched case-control data. This consistency property depends on correct specification of the mean model in these 2 approaches. Both approaches are robust on specification of the matching effect g(Z) in equation (1) since they have implicitly taken into account the effect of matching effect by conditioning or by estimation of the working correlation parameters α. Data with unquantified matching factors such as neighborhood or friend matching can be applied. The conditional logistic regression model is a special case of the family of random effect models with random intercept, except that the random effect is conditioned, and therefore does not require any specification of the form or distribution of the effect of Z. The GEE model is a population-average model based on quasi-likelihood with robust variance estimator. The correct specification of α affects only the efficiency of the parameter estimates in these retrospective models. In a situation where the term γ3(Zi) in equation (2a) can fully quantify the matching effect and is correctly specified, a likelihood method using retrospective unconditional logistic regression with explicit parametric term γ3(Zi) is valid, too. However, the GEE is more robust in terms of the specification of matching effect.

The estimates produced by conditional likelihood generally have larger standard errors than those obtained by GEE. These relative losses in efficiency are more substantial when data contain incomplete matched sets, or when the data have small matching strata. The GEE approach utilizes the incomplete strata as well as the matched sets with identical X values. On the other hand, the conditional likelihood approach utilizes only the complete matching sets with discordant values of the X; in other words it uses only the “longitudinal information” (ie, the information within the same matched set), not the “cross-sectional information.”7 Previous studies shows that bias arises in the estimates by conditional likelihood when number of discordant pairs is small even in a simple matched-pair study without confounding, selection bias, and measurement error.17 This efficiency loss also increases as the magnitude of the association (β1) increases. Imagine that in a simple McNemar χ2 analysis for a paired matching design without W, and given the same total number of discordant pairs, an increase of odds ratio (exp(β1)) corresponds to a greater difference between the frequencies in one of the discordant cell over the one in the other discordant cell. When the total number of discordant pairs is small, the frequency in one of the discordant cells could be very low (the extreme situation is a zero cell) which could produce very unstable estimates of the odds ratio. Given the same number of cases, the efficiency can be improved by including more controls in the strata under this potential “sparse data” problem (Table 3).

In general, the estimates by both conditional likelihood and those by retrospective GEE are consistent, and the GEE generally obtained more efficient estimates. Especially in relatively smaller sample size, the GEE has fewer computation problems. However, the GEE approach in our study was focused on an exposure variable and disease status adjusting for matching and other confounding variables—ie, except for the exposures, all other variables are not of interest and are to be adjusted for. If multiple exposure variables are of interest and are to be estimated simultaneously, the retrospective models require more complicated multivariate models or several models. The performance of models in these situations needs further study. In such situations, the conditional likelihood approaches with prospective analysis are still preferred when number of informative strata is large.

Another approach that utilizes the observed values in incomplete strata more efficiently is to merge neighboring strata with similar values of the matching variables.18 Once the strata become completed (ie, all with at least 1 case and 1 control), the merged strata can be included in the conditional logistic regression analysis. When continuous matching variables are involved, however, arbitrariness of grouping or the possibility of an outlier might produce different estimates. Instead of combining strata, one can also apply several methods for small-sample bias correction in a simple matched-pair data without confounding, or turn to Bayesian types of methods.17

In this simulation study, data missing completely at random were generated. We also performed a simulation study on 1-to-1 matched data with 500 strata for each of 2000 replications using a mechanism of missing-at-random (missing not depending on unobserved values but depending on variables of interest; eg, probabilities of missing are different between case and controls in case-control study; for details of the missing value mechanisms see reference 16). The probability of missing X was assumed to be 0.1 for W = 1 in the controls, 0.9 for W = 0 in the cases, and 0.5 for both W = 1 in cases and W = 0 in controls, ie, the missing of X depended on the values of W and D. Numerically, the simulation results under this condition (table not shown) show a similar behavior to the reported tables. Nevertheless, the longitudinal models in theory does not always produce consistent estimates for data missing at random.7 Statistical methods have been proposed to handle data that are missing at random for conditional logistic regression14,19,20 and for the GEE.21,22 However, software performing this task is not yet commercially available.

Back to Top | Article Outline


The authors thank the referees for their helpful comments.

Back to Top | Article Outline


1. Rothman KJ, Greenland S. Modern Epidemiology. 2nd ed. Lippincott-Raven; 1998.
2. Levin B, Paik MC. The unreasonable effectiveness of a biased logistic regression procedure in the analysis of pair-matched case-control studies. J Stat Plan Inference. 2001;96:371–385.
3. Breslow NE, Day NE, Halvorsen KT, et al. Estimation of multiple relative risk functions in matched case-control studies. Am J Epidemiol. 1978;108:299–307.
4. Breslow NE, Day NE. Statistical Methods in Cancer Research: The Analysis of Case-Control Studies. Vol. 1. IARC Scientific publications No. 32. Lyon, France: International Agency for Research on Cancer; 1980.
5. Holford TR, White C, Kelsey JL. Multivariate analysis for matched case-control studies. Am J Epidemiol. 1978;107:245–256.
6. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
7. Diggle PJ, Heagerty P, Liang K-Y, et al. Analysis of Longitudinal Data. 2 ed. Oxford: Oxford: Oxford University Press; 2002.
8. Fleiss JL, Levin B, Paik MC. Regression models for matched samples. In: Statistical Methods for Rates and Proportions. 3rd ed. Wiley; 2003:407–439.
9. Prentice RL. Use of the logistic model in retrospective studies. Biometrics. 1976;32:599–606.
10. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411.
11. Weinberg CR, Wacholder S. The design and analysis of case-control studies with biased sampling. Biometrics. 1990;46:953–975.
12. Liang K-Y, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data (with discussion). J R Stat Soc Ser B. 1992;54:3–40.
13. Paik MC, Sacco RL. Matched case-control data analyses with missing covariates. Appl Stat. 2000;49(Part 1):145–156.
14. Sacco RL, Elkind M, Boden-Albala B, et al. The protective effect of moderate alcohol consumption on ischemic stroke. JAMA. 1999;281:53–60.
15. Lin I-F, Paik MC. Matched case-control data analysis with selection bias. Biometrics. 2001;57:1106–1112.
16. Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592.
17. Greenland S. Small-sample bias and corrections for conditional maximum-likelihood odds-ratio estimators. Biostatistics. 2000;1:113–122.
18. Brookmeyer RON, Liang KY, Linet M. Matched case-control designs and overmatched analyses. Am J Epidemiol. 1986;124:693–701.
19. Lipsitz SR, Parzen M, Ewell M. Inference using conditional logistic regression with missing covariates. Biometrics. 1998;54:295–303.
20. Gibbons LE, Hosmer DW. Conditional logistic regression with missing data. Commun Stat Simul Comput. 1991;20:109–120.
21. Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc. 1995;90:106–121.
22. Paik MC. The generalized estimating equation approach when data are not missing completely at random. J Am Stat Assoc. 1997;92:1320–1329.
© 2007 Lippincott Williams & Wilkins, Inc.