# Analysis of Case-Crossover Designs Using Longitudinal Approaches: A Simulation Study

Background: Application of case-crossover designs provides an alternative to time-series analysis for analyzing the health-related effects of air pollution. Although some case-crossover studies can control for trend and seasonality by design, to date they have been analyzed as matched case-control studies. Such analyses may exhibit biases and a lower statistical efficiency than traditional time series analyzed with Poisson.

Methods: In this article, case-crossover studies are treated as cohort studies in which each subject is observed for a short period of time before and/or after the event, thus making possible analyzing with Andersen-Gill and generalized linear mixed models. We conducted a simulation study to compare the behavior of these models applied to case-crossover designs with time series analyzed with Poisson and with case-crossover analyzed by conditional logistic regression. To this end, we created a random variable that follows a Poisson distribution of low (2/day) and high mean events (22/day). This variable is a function of an unobserved confounding variable (that introduces trend and seasonality) and data on small particulate matter (PM_{10}) from Barcelona. In addition, scenarios were created to assess the effect on exposure exerted by autocorrelation and the magnitude of the pollutant coefficient.

Results: The full semisymmetric design analyzed with generalized linear mixed models yields good coverage and a high statistical power for air-pollution effect magnitudes close to the real values but shows bias for high effect magnitudes. This bias seems to be attributable to autocorrelation in the exposure variable.

Conclusions: Longitudinal approaches applied to case-crossover designs may prove useful for analyzing the acute effects of environmental exposures.

From the*Department of Preventive Medicine and Public Health, University of Santiago de Compostela; and †Statistics, Applied Economics and Health Research Group (Grup de Recerca en Estadística, Economia Aplicada i Salut - GRECS), University of Girona, Spain.

Submitted 5 March 2003; final version accepted 23 November 2004.

Supported by the Spanish Ministry of Health. Health Research Fund: Fondo de Investigación Sanitaria grants 00/0010–05 and 99/1189.

Supplemental material for this article is available with the online version of the journal at www.epidem.com

Correspondence: Adolfo Figueiras, Dto. de Medicina Preventiva y Salud Pública. Facultad de Medicina, 15705 Santiago de Compostela (A Coruña) Spain. E-mail: adolfo.figueiras@usc.es.

**ArticlePlus**

Click on the links below to access all the ArticlePlus for this article.

Please note that ArticlePlus files may launch a viewer application outside of your web browser.

In recent years, keen interest has been shown in the application of case-crossover analyses to the study of the relationship between air pollution and health.^{1–5} Maclure^{6} proposed case-crossover designs to study the effects of transient exposures on the occurrence of acute events. In these designs, only cases are sampled, and estimates are based on within-subject comparisons of exposures at the time of and before the effect (“case” and “control” periods, respectively). This design has several advantages. Control subjects do not have to be sampled, thereby reducing study costs and eliminating the source of bias attributable to poor selection of controls. In addiction, when intrasubject comparisons are made, all possible confounders (whether measured or not), not modified between the control and case periods, are automatically controlled for by design, since they remain constant for each subject.^{6–8}

Initially, variations of case-crossover designs were proposed in which the control periods were in all cases situated before the date of the effect.^{9,10} When these designs are applied to exposures exhibiting a time trend, risk estimates could be confounded by the time trend in such an exposure.^{11} Indeed, this might well be the situation in studies addressing the effect of air pollutants on mortality and morbidity. Navidi^{7} proposed bidirectional case-crossover, in which “case” exposures are compared with “control” exposures selected both before and after the event, as well as 2 modes of control-period selection (full-stratum bidirectional case-crossover and random bidirectional case-crossover). This modification proved adequate to control for trend in exposure, but not for confounding due to unobserved variables with a systematic temporal component, which are the main source of seasonality in mortality and morbidity studies.^{8}

Bateson and Schwartz^{8} proposed symmetric bidirectional case-crossover as a new way of selecting “control” periods, with a 1-day control period being chosen a week before and after the event. In their simulation they observed that symmetric case-crossover controls for trend and seasonality in unmeasured confounding variables.^{8} In subsequent simulation analyses,^{12} the same authors reported that symmetric case-crossover may display biases attributable both to selection (subjects at the beginning or end of the series have fewer control periods, and these are asymmetric) and confounding (where exposure and effect share the same short-term temporal pattern); however, when the control periods are reduced to 1 week, both biases are minimized.^{12} According to a later simulation analysis by Navidi et al,^{13} symmetric case-crossover may nevertheless exhibit biases in cases where exposure is characterized by a time trend. These authors therefore proposed a new design, called semisymmetric bidirectional case-crossover, in which a single control period is randomly chosen one week before or after the event.^{13} To analyze such a study they applied a likelihood function derived from conditional logistic regression adapted to the sampling method used.^{13,14} In their simulations, Navidi et al^{13} have found that semi-symmetric bidirectional case-crossover affords good control of exposure trends and unobserved confounding variables. Moreover, according to Levy et al,^{5} in designs in which the control periods are selected at fixed intervals before or after the case event, only one data configuration is possible. Under these conditions, the likelihood used by conditional logistic regression is not the correct one, and may give rise to certain biases in the estimates. The authors report that this effect does not exist where the cases may occur at any point in the window from which the controls are chosen, as in the time-stratified case-crossover proposed by Lumley et al^{15} This design is based on dividing the study time a priori into fixed strata (eg, calendar months) and selecting the control days from among the remaining days of the stratum.

In all previous designs case-crossover studies were analyzed as matched case-control studies, as initially envisaged by McClure,^{6} using conditional logistic regression^{7,8,12} or derivations.^{13} This gives rise to 2 limitations. Conditional logistic regression-based analysis can lead to biased estimates in cases where exposure displays a time trend or seasonality,^{13,16,17} and the likelihood used by conditional logistic regression is not correct.^{5,15} Secondly, because this is a marginal approach in which all the information available, particularly within-subject correlation, is not used, it entails an important loss of statistical efficiency vis-à-vis time-series studies with Poisson regression studies.^{8}

From an epidemiologic standpoint, case-crossover designs can be regarded as the observational equivalent of clinical crossover trials, with the difference that only a sample of the population base is studied.^{6} A feature common to all case-crossover designs is that each subject has measures of exposure and effect variables at different points in time. The period of time that elapses between the first and last assessments can be viewed as the follow-up period during which each subject's exposure and/or effect is modified. Indeed, when Maclure^{6} first proposed the case-crossover design he linked it to cohort studies, describing it as “the counterpart to a cohort study with crossover of subjects between periods of exposure and non exposure.” The principal difference between case-crossover studies and traditional cohort designs is that for case-crossover designs only those subjects that have suffered the event are sampled.

From a statistical point of view, case-crossover designs can be regarded as longitudinal studies since “each individual is repeatedly measured over time,”^{18} and this measure may be prospective or retrospective.^{18} One of the characteristics of longitudinal data is their high statistical efficiency^{18} and the fact that each subject acts as his own control, thereby eliminating the influence of genetic factors or habits that remain constant over time.^{18} Thanks to the advantages afforded by longitudinal analysis techniques, application of this analyses techniques to cohort studies,^{19,20} parallel clinical trials^{21} or crossover trials is very widespread.^{18} If case-crossover studies are analyzed as longitudinal data, the application of random effects in the independent term enables heterogeneity to be eliminated among the various subject—heterogeneity that may stem from unmeasured confounding variables such as trend and seasonality. On the other hand, the statistical efficiency of such conditional models is very high, because—unlike conditional logistic regression—the fact that the study has repeated measures is taken into account. The other possibility is for case-crossover studies to be considered as cohort studies in which the explanatory variables are dependent on time, to analyze them as survival models.

The precise statistical approach best suited to analyzing case-crossover designs as follow-up studies is not known, and so we suggest to evaluate various analytic options. We conducted a simulation study in which we compared the results of analyzing case-crossover designs using longitudinal analysis methods (generalized linear mixed models; GLMM), survival analysis methods (Andersen-Gill approach to proportional hazards regression) and methods used in matched case–control designs (conditional logistic regression). A further comparison was run between the results obtained and those yielded by time-series studies with Poisson regression.

## METHODS

### Design

The simulation framework is taken from a scenario used in earlier studies (Bateson et al^{8} and Navidi et al^{13}) to compare our results with previous studies we constructed a 3-year (1096-day) series in which the number of events per day followed a Poisson distribution, as shown in the following model:

where *Y*_{t} denotes the number of events that can follow a mean of 2 or 22, and β_{1} denotes the log RR of *Y* associated with increases in particulates of less than 10 μm in diameter (PM_{10}). As in other studies,^{13} β_{1} was set at 0.001, though other scenarios were also assessed with β_{1} assigned values from 0 to 0.005. The PM_{10} series was drawn from data recorded for the city of Barcelona, Spain, during the period 1995–1997 (Fig. 1A). In expression (1), *trend* is an unobserved confounding variable that has a seasonal and trend component (Fig. 1B), was used in other simulation studies,^{8,13} and has the following formula:

In Eq 1, coefficient β_{2} assumes a value of 0.1 or 0 as respectively established by scenarios with or without confounding due to unobserved variables. To assess the effect of PM_{10} autocorrelation on the estimates, we created a new time series without autocorrelation. To this end, the original PM_{10} series was repermuted (random reassortment of existing data) on the basis of the original data series.

On the basis of the time series generated by Eq 1, data were obtained using different structures. First, we used a symmetric case-crossover design having 2 control days, one 7 days before and the other 7 days after the event,^{8} with subjects being excluded from the first and last 7 days of the series to avoid selection bias.^{12} We also employed a semisymmetric case-crossover design,^{13} in which a single control day was chosen at random 7 days before or after the event (Table 1). When conditional logistic regression was applied to this design, weighting was performed to avoid the influence, which is exerted by subjects failing near the beginning or end of the observation time and which results in only one of the 2 potential control times being observable.^{13} In the semisymmetric design, a stratum is formed for each case by choosing a control time, with equal probability, before or after the referent time. This design calls for the use of a likelihood function altered to achieve a weighted conditional logistic regression,^{13} but the procedure is nevertheless difficult to implement with standard software. We therefore created the effect of weighting by means of sampling, so as to be able to use standard conditional logistic regression: where a case in any given stratum had a weighting of 0.5, a second control was added on the same day as that on which the case had occurred; and where a control in any given stratum had a weighting of 0.5, a second case was added on the same day as that on which the first case had occurred.

Third, time-stratified case-crossover was applied.^{5,6} To this end, calendar months were taken as strata, and controls for each case day were taken as being all those days of the same stratum (month) having the same day of the week as the referent day, for example, where the case day was Saturday 9 December 1995, the control days would be the remaining Saturdays in December.

Fourth, we used a full-symmetric case-crossover design, in which the 7 days before and 7 days after the event were chosen as the control period (Figure available with the electronic version of this article). Finally, we examined a full semisymmetric case-crossover design, for which a control period of 14 days before or after the event was chosen at random. A control period of 14 days was used so that the statistical power would be comparable with that of the full-symmetric design. The S-Plus syntax for implementing these samplings will be made available to any reader who requests it from the corresponding author.

### Data Analysis

Data were analyzed using different methods. In the first place, we applied an ecological time series study, using generalized additive models with Poisson response, and smoothing splines with 7 degrees of freedom per year as smoothers.^{6} For model estimation purposes, the convergence criteria were modified as proposed by Dominici et al,^{6} and the standard errors were calculated using the gam.exact function.^{22}

Data transformed by the case-crossover–structure approach were analyzed as matched case–control studies, using exact conditional logistic regression^{23} generalized to more than one control period, without weighting. We also analyzed these data following a proportional hazards model, with the possibility of recurrent events and time-dependent explanatory variables. We used the Andersen–Gill approach to proportional hazards regression for this purpose.^{24–26} This approach deems every observation for every individual an independent unit of analysis, albeit dependent on explanatory variables. Under such conditions there is no guarantee of the efficiency of the estimates. Consequently, standard errors were robustly estimated by grouping jackknife estimates.^{26,27}

GLMMs,^{28–30} which are an extension of generalized linear models,^{31} also were applied to allow for additional sources of variability caused by unobservable random effects. GLMMs describe the relationship between a variable response and covariables in data that are grouped by one or more factors (in case-crossover studies the groupings would be the individuals) and are thus very useful for analyzing longitudinal data, repeated measures or multilevel data. The GLMMs are estimated using the penalized pseudo-likelihood method.^{28} Notwithstanding the fact that the estimators of the parameters are consistent and efficient,^{29} the estimation of the variance of the fixed effects may be slightly biased.^{30}

A binary-response GLMM was constructed including, as fixed effects, the independent term, exposure and individual trend, and as random effect, the independent term. We defined the individual trend as the calendar time of follow-up for each individual, by taking the value of 1 for the first day of follow-up for each such individual and so on successively, until 15. Inclusion of the independent term as a random effect in the GLMM enables the initial heterogeneity among subjects to be controlled for, and its application to a case-crossover design of acute effects of environmental exposures could enable trend and seasonality to be controlled for. Individual trend was included as a fixed effect with the aim of eliminating any possible confounding due to exposure trend. Similarly, to eliminate the influence of autocorrelation on exposure, the inclusion of a first-order autocorrelation variance structure was evaluated.

Using the S-Plus 6.0 statistical software package (Insightful, Seattle, WA), 1000 replicates were generated and analyzed for each of the scenarios. To ensure that the results of the different scenarios would be comparable, we used the same set of randomization seeds (consisting of the replicate number, ie, 1 to 1000) in all scenarios. The glmmPQL function was obtained from the *Venables and Ripley MASS Library* (4th ed.).^{32}

### Expression of Results

For each of the simulations, we calculated: (1) the percentage of mean bias with respect to the coefficient's true value, (Ax − β) · 100/β; (2) the standard deviation of the coefficients, which in turn enabled us to approximate the dispersion and instability of the estimates of Ax; (3) the percentage of coverage, which consists of calculating the percentage of replicates that include the true value of the coefficient within their 95% confidence interval (CI). This measure is complementary to bias because even though mean bias is low, if the calculation of the coefficients is very imprecise (great dispersion) the percentage of coverage may well be small. In addition, the percentage of coverage tends to indicate whether the standard errors of the coefficients have been properly calculated. Obviously, percentage coverage must be equal to or higher than 95%; and (4) the percentage of replicates that rejected the null hypothesis. We felt that the measure of coverage should be complemented by an indicator of statistical power, in view of the fact that a very high percentage of coverage could be attained thanks to overestimated standard errors, which result in wide 95% confidence interval and ensuing low statistical power. As is the case with coverage, the best model is that which has the greatest statistical power.

## RESULTS

Figures 1 and 2 set out the simulation conditions. Figure 1A shows the PM_{10} sequence that was used as an independent variable to generate the replicates. Shown in Figure 1B is the confounding variable with trend and seasonality. Figure 2A is the graph of close autocorrelations (1–30 days) and Figure 2B that of distant autocorrelations (28–364 days, at intervals of 28 days) in respect of PM_{10.} In Barcelona, the PM_{10} variable plotted a statistically significant negative trend (*P* < 0.001) of −5.96 10^{−3}. Furthermore, the confounding variable had a correlation of 0.3670 with PM_{10}.

From Table 1, it can be seen that case-crossover designs (whether symmetric, semi-symmetric or time-stratified) analyzed with conditional logistic regression achieved coverage percentages of around 95%, though with a statistical efficiency that in some cases was half that of time-series studies with Poisson regression (23.5 vs. 61.4%). Biases in the case of symmetric and time-stratified designs were higher than in the case of their semisymmetric counterparts and appeared to display a certain sensitivity to seasonal confounding. When case-crossover designs were analyzed by means of survival models, they did not appear to afford any advantage over matched case–control models (data not shown). Yet, when the longitudinal GLMM design was applied, whether full-symmetric or full semisymmetric, this generally ensured attainment of high coverage percentages, a statistical power very much higher than that of conditional logistic regression and time-series studies with Poisson regression, as well as low biases; thus, where the mean number of events was high (22 events/day), full semisymmetric case-crossover designs analyzed with GLMM yielded biases of 7.9%. Furthermore, where the daily mean number of events was low (2/day), full semi-symmetric case-crossover designs analyzed as longitudinal studies showed themselves to be very superior to all the remaining models, not only in terms of bias but also in terms of coverage and statistical efficiency.

Table 2 (available with the electronic version of this article) shows the influence exerted by effect magnitude (from 0 to 0.005) and exposure autocorrelation on the estimates of the different models. Semisymmetric case-crossover and time-series studies with Poisson regression models seemed to be least affected by the magnitude of the effect of the pollutant. In general, case-crossover tended to register biases for low coefficient values, except in semisymmetric case-crossover. Whereas full semisymmetric CC model designs analyzed as longitudinal data displayed moderate biases and optimal coverages for coefficient values below 0.003, bias increased and coverages decreased for high coefficients. When data without autocorrelation were applied, a decrease in bias was in evidence across all case-crossover designs, yet the effect was clearer in case-crossover analyzed as longitudinal data, in which bias disappeared for high coefficients.

## DISCUSSION

The results of this simulation study show that longitudinal approaches applied to case-crossover designs may prove useful for analyzing the acute effects of environmental exposures. Specifically, the full semisymmetric case-crossover design analyzed with generalized linear mixed models applied to scenarios with pollutant effect magnitudes similar to those described in the literature^{33} displays optimal confidence-interval coverage and high statistical efficiency (very superior to that of case-crossover designs analyzed with conditional logistic regression). When applied to scenarios with magnitudes greater than the real values, however, an underestimate of the effect and a decrease in coverage are in evidence, possibly attributable to the effect of autocorrelation present in the exposure variable.

The greatest advantage of case-crossover designs with respect time-series studies with Poisson regression is that time-variable functions are not needed to control for the effect of unmeasured confounding variables. This leads to an automatically control (by design) of any possible confounders (known and unknown). In turn, this avoids the establishment of degrees of freedom nor implies problems of concurvity. Bateson and Schwartz^{8} comment that a price must be paid if case-crossover studies analyzed with conditional logistic regression are to control for trend and seasonality by design and, in addition, are to investigate potential effect modifiers at an individual level. The price referred to lies in the fact that the relative efficiency of semisymmetric case-crossover designs analyzed with conditional logistic regression is only 66% of that of time-series studies with Poisson regression.^{8} In our simulations, in which underestimation of standard errors due to concurvity had already been eliminated, time-series studies with Poisson regression were as much as doubly efficient when compared with case-crossover analyzed with conditional logistic regression. Nevertheless, case-crossover analyzed as longitudinal data can be even more efficient than time-series studies with Poisson regression; thus, in analyzing case-crossover as longitudinal data, one would no longer have to pay the price of efficiency in order for these advantages to be obtained from case-crossover studies.

The principal limitation of the case-crossover design analyzed with generalized linear mixed models is the existence of a sensitivity on the part of the model to autocorrelation present in the pollutant variable. While the effect of autocorrelation on estimates yielded by case-crossover designs analyzed with conditional logistic regression has been studied in depth in previous articles,^{5,34} this effect is not yet known in cases where such designs are analyzed as longitudinal studies. Our data indicate that longitudinal case-crossover designs have a greater sensitivity to autocorrelation than case-crossover with conditional logistic regression and that this effect becomes more pronounced in response to a rise in the magnitude of the effect of the pollutant. For coefficients below 0.003 this effect appears to be small, but it rises sharply thereafter, leading to ensuing low coverage. We feel that new studies are needed to identify the control-selection techniques best suited to reducing the effect of autocorrelation where case-crossover are analyzed as longitudinal data, as has already been done in the case of case-crossover designs analyzed with conditional logistic regression.^{5,34}

Consideration of case-crossover studies with a longitudinal approach and their analysis with mixed models would afford an additional advantage, namely, multilevel analysis, which allows for joint analysis of many cities and so eliminates the need for application of meta-analysis,^{35} pooled analysis^{36} or 2-stage hierarchical normal model methodologies.^{37} In these multilevel case-crossover models, one level could be the city and another the individual, and intercity heterogeneity could be studied by including it as a random effect. This would enable assessment of: (1) the effect of overall dose-response relationships for all the cities as a whole; (2) the forms of possible interaction between climatological variables and pollutants; and (3) the interactions between individual characteristics on the one hand, and climatological variables and pollutants, on the other, for all cities as a whole. The application of mixed models could also be extended to geographical analyses, with parts of cities being deemed intermediate levels of analysis, and areas of highest risk thus being assessed.

Once the influence of autocorrelation on estimates of very marked effects has been controlled for, case-crossover designs analyzed as longitudinal studies could prove a very good alternative to time-series studies with Poisson regression, in that they may well afford a considerable number of additional advantages. Hence, the effects of pollutants and their dose-response relationships could be more specifically ascertained in line with the characteristics of individual subjects (eg, previous disease history) or geographic location (specific areas of cities), with optimal statistical power. This could in turn open the way to a new type of public health guideline addressing the health-related effects of air pollution, targeted at specific population groups or areas of cities.

## ACKNOWLEDGMENTS

We thank William Navidi for his advice and guidance on semisymmetric case-crossover design and Michael Benedict for his comments on the translation.

## REFERENCES

*Thorax*. 2002;57:687–693.

*Int J Epidemiol*. 2001;30:1138–1140.

*Epidemiology*. 2001;12:413–419.

*Circulation*. 2001;103:2810–2815.

*Epidemiology*. 2001;12:193–199.

*Am J Epidemiol*. 1991;133:144–153.

*Biometrics*. 1998;54:596–605.

*Epidemiology*. 1999;10:539–544.

*Am J Epidemiol*. 1995;142:91–98.

*Stat Med*. 1993;12:2333–2341.

*Epidemiology*. 1996;7:231–239.

*Epidemiology*. 2001;12:654–661.

*Epidemiology*. 2002;13:100–105.

*Lifetime Data Anal*. 2000;6:39–58.

*Environmetrics*. 2000;11:689–704.

*Stat Med*. 1999;18:1–15.

*Stat Med*. 2001;20:3039–3049.

*Analysis of Longitudinal Data*. Oxford: Oxford University Press; 1994.

*N Engl J Med*. 2003;348:2508–2516.

*N Engl J Med*. 2003;348:1517–1526.

*Med Care*. 2001;39:158–167.

*Stat Med*. 1995;14:2143–2160.

*Ann Stat*. 1982;10:1100–1120.

*Stat Med*. 1997;16:2029–2047.

*JAMA*. 1989;84:1074–1078.

*J Am Stat Assoc*. 1993;88:9–25.

*J Stat Computation Simulation*. 1993;48:233–243.

*Gac Sanitaria*. 2001;15:347–352.

*Generalized Linear Models*. London: Chapman and Hall; 1989.

*Modern Applied Statistics with S-PLUS*. Oxford: Springer; 1999.

*J Royal Stat Soc A*. 2000;163:263–302.

*Health Effects Institute Res Report*. 2000;14–26.

*Environ Health Perspect*. 2002;110:221–228.

*J Epidemiol Community Health*. 2001;55:423–432.

*Am J Epidemiol*. 2002;156:193–203.