Figueiras, Adolfo*; Carracedo-Martínez, Eduardo*; Saez, Marc†; Taracido, Margarita*
Click on the links below to access all the ArticlePlus for this article.
Please note that ArticlePlus files may launch a viewer application outside of your web browser.
In recent years, keen interest has been shown in the application of case-crossover analyses to the study of the relationship between air pollution and health.1–5 Maclure6 proposed case-crossover designs to study the effects of transient exposures on the occurrence of acute events. In these designs, only cases are sampled, and estimates are based on within-subject comparisons of exposures at the time of and before the effect (“case” and “control” periods, respectively). This design has several advantages. Control subjects do not have to be sampled, thereby reducing study costs and eliminating the source of bias attributable to poor selection of controls. In addiction, when intrasubject comparisons are made, all possible confounders (whether measured or not), not modified between the control and case periods, are automatically controlled for by design, since they remain constant for each subject.6–8
Initially, variations of case-crossover designs were proposed in which the control periods were in all cases situated before the date of the effect.9,10 When these designs are applied to exposures exhibiting a time trend, risk estimates could be confounded by the time trend in such an exposure.11 Indeed, this might well be the situation in studies addressing the effect of air pollutants on mortality and morbidity. Navidi7 proposed bidirectional case-crossover, in which “case” exposures are compared with “control” exposures selected both before and after the event, as well as 2 modes of control-period selection (full-stratum bidirectional case-crossover and random bidirectional case-crossover). This modification proved adequate to control for trend in exposure, but not for confounding due to unobserved variables with a systematic temporal component, which are the main source of seasonality in mortality and morbidity studies.8
Bateson and Schwartz8 proposed symmetric bidirectional case-crossover as a new way of selecting “control” periods, with a 1-day control period being chosen a week before and after the event. In their simulation they observed that symmetric case-crossover controls for trend and seasonality in unmeasured confounding variables.8 In subsequent simulation analyses,12 the same authors reported that symmetric case-crossover may display biases attributable both to selection (subjects at the beginning or end of the series have fewer control periods, and these are asymmetric) and confounding (where exposure and effect share the same short-term temporal pattern); however, when the control periods are reduced to 1 week, both biases are minimized.12 According to a later simulation analysis by Navidi et al,13 symmetric case-crossover may nevertheless exhibit biases in cases where exposure is characterized by a time trend. These authors therefore proposed a new design, called semisymmetric bidirectional case-crossover, in which a single control period is randomly chosen one week before or after the event.13 To analyze such a study they applied a likelihood function derived from conditional logistic regression adapted to the sampling method used.13,14 In their simulations, Navidi et al13 have found that semi-symmetric bidirectional case-crossover affords good control of exposure trends and unobserved confounding variables. Moreover, according to Levy et al,5 in designs in which the control periods are selected at fixed intervals before or after the case event, only one data configuration is possible. Under these conditions, the likelihood used by conditional logistic regression is not the correct one, and may give rise to certain biases in the estimates. The authors report that this effect does not exist where the cases may occur at any point in the window from which the controls are chosen, as in the time-stratified case-crossover proposed by Lumley et al15 This design is based on dividing the study time a priori into fixed strata (eg, calendar months) and selecting the control days from among the remaining days of the stratum.
In all previous designs case-crossover studies were analyzed as matched case-control studies, as initially envisaged by McClure,6 using conditional logistic regression7,8,12 or derivations.13 This gives rise to 2 limitations. Conditional logistic regression-based analysis can lead to biased estimates in cases where exposure displays a time trend or seasonality,13,16,17 and the likelihood used by conditional logistic regression is not correct.5,15 Secondly, because this is a marginal approach in which all the information available, particularly within-subject correlation, is not used, it entails an important loss of statistical efficiency vis-à-vis time-series studies with Poisson regression studies.8
From an epidemiologic standpoint, case-crossover designs can be regarded as the observational equivalent of clinical crossover trials, with the difference that only a sample of the population base is studied.6 A feature common to all case-crossover designs is that each subject has measures of exposure and effect variables at different points in time. The period of time that elapses between the first and last assessments can be viewed as the follow-up period during which each subject's exposure and/or effect is modified. Indeed, when Maclure6 first proposed the case-crossover design he linked it to cohort studies, describing it as “the counterpart to a cohort study with crossover of subjects between periods of exposure and non exposure.” The principal difference between case-crossover studies and traditional cohort designs is that for case-crossover designs only those subjects that have suffered the event are sampled.
From a statistical point of view, case-crossover designs can be regarded as longitudinal studies since “each individual is repeatedly measured over time,”18 and this measure may be prospective or retrospective.18 One of the characteristics of longitudinal data is their high statistical efficiency18 and the fact that each subject acts as his own control, thereby eliminating the influence of genetic factors or habits that remain constant over time.18 Thanks to the advantages afforded by longitudinal analysis techniques, application of this analyses techniques to cohort studies,19,20 parallel clinical trials21 or crossover trials is very widespread.18 If case-crossover studies are analyzed as longitudinal data, the application of random effects in the independent term enables heterogeneity to be eliminated among the various subject—heterogeneity that may stem from unmeasured confounding variables such as trend and seasonality. On the other hand, the statistical efficiency of such conditional models is very high, because—unlike conditional logistic regression—the fact that the study has repeated measures is taken into account. The other possibility is for case-crossover studies to be considered as cohort studies in which the explanatory variables are dependent on time, to analyze them as survival models.
The precise statistical approach best suited to analyzing case-crossover designs as follow-up studies is not known, and so we suggest to evaluate various analytic options. We conducted a simulation study in which we compared the results of analyzing case-crossover designs using longitudinal analysis methods (generalized linear mixed models; GLMM), survival analysis methods (Andersen-Gill approach to proportional hazards regression) and methods used in matched case–control designs (conditional logistic regression). A further comparison was run between the results obtained and those yielded by time-series studies with Poisson regression.
The simulation framework is taken from a scenario used in earlier studies (Bateson et al8 and Navidi et al13) to compare our results with previous studies we constructed a 3-year (1096-day) series in which the number of events per day followed a Poisson distribution, as shown in the following model:
Equation (Uncited)Image Tools
where Yt denotes the number of events that can follow a mean of 2 or 22, and β1 denotes the log RR of Y associated with increases in particulates of less than 10 μm in diameter (PM10). As in other studies,13 β1 was set at 0.001, though other scenarios were also assessed with β1 assigned values from 0 to 0.005. The PM10 series was drawn from data recorded for the city of Barcelona, Spain, during the period 1995–1997 (Fig. 1A). In expression (1), trend is an unobserved confounding variable that has a seasonal and trend component (Fig. 1B), was used in other simulation studies,8,13 and has the following formula:
In Eq 1, coefficient β2 assumes a value of 0.1 or 0 as respectively established by scenarios with or without confounding due to unobserved variables. To assess the effect of PM10 autocorrelation on the estimates, we created a new time series without autocorrelation. To this end, the original PM10 series was repermuted (random reassortment of existing data) on the basis of the original data series.
On the basis of the time series generated by Eq 1, data were obtained using different structures. First, we used a symmetric case-crossover design having 2 control days, one 7 days before and the other 7 days after the event,8 with subjects being excluded from the first and last 7 days of the series to avoid selection bias.12 We also employed a semisymmetric case-crossover design,13 in which a single control day was chosen at random 7 days before or after the event (Table 1). When conditional logistic regression was applied to this design, weighting was performed to avoid the influence, which is exerted by subjects failing near the beginning or end of the observation time and which results in only one of the 2 potential control times being observable.13 In the semisymmetric design, a stratum is formed for each case by choosing a control time, with equal probability, before or after the referent time. This design calls for the use of a likelihood function altered to achieve a weighted conditional logistic regression,13 but the procedure is nevertheless difficult to implement with standard software. We therefore created the effect of weighting by means of sampling, so as to be able to use standard conditional logistic regression: where a case in any given stratum had a weighting of 0.5, a second control was added on the same day as that on which the case had occurred; and where a control in any given stratum had a weighting of 0.5, a second case was added on the same day as that on which the first case had occurred.
Third, time-stratified case-crossover was applied.5,6 To this end, calendar months were taken as strata, and controls for each case day were taken as being all those days of the same stratum (month) having the same day of the week as the referent day, for example, where the case day was Saturday 9 December 1995, the control days would be the remaining Saturdays in December.
Fourth, we used a full-symmetric case-crossover design, in which the 7 days before and 7 days after the event were chosen as the control period (Figure available with the electronic version of this article). Finally, we examined a full semisymmetric case-crossover design, for which a control period of 14 days before or after the event was chosen at random. A control period of 14 days was used so that the statistical power would be comparable with that of the full-symmetric design. The S-Plus syntax for implementing these samplings will be made available to any reader who requests it from the corresponding author.
Data were analyzed using different methods. In the first place, we applied an ecological time series study, using generalized additive models with Poisson response, and smoothing splines with 7 degrees of freedom per year as smoothers.6 For model estimation purposes, the convergence criteria were modified as proposed by Dominici et al,6 and the standard errors were calculated using the gam.exact function.22
Data transformed by the case-crossover–structure approach were analyzed as matched case–control studies, using exact conditional logistic regression23 generalized to more than one control period, without weighting. We also analyzed these data following a proportional hazards model, with the possibility of recurrent events and time-dependent explanatory variables. We used the Andersen–Gill approach to proportional hazards regression for this purpose.24–26 This approach deems every observation for every individual an independent unit of analysis, albeit dependent on explanatory variables. Under such conditions there is no guarantee of the efficiency of the estimates. Consequently, standard errors were robustly estimated by grouping jackknife estimates.26,27
GLMMs,28–30 which are an extension of generalized linear models,31 also were applied to allow for additional sources of variability caused by unobservable random effects. GLMMs describe the relationship between a variable response and covariables in data that are grouped by one or more factors (in case-crossover studies the groupings would be the individuals) and are thus very useful for analyzing longitudinal data, repeated measures or multilevel data. The GLMMs are estimated using the penalized pseudo-likelihood method.28 Notwithstanding the fact that the estimators of the parameters are consistent and efficient,29 the estimation of the variance of the fixed effects may be slightly biased.30
A binary-response GLMM was constructed including, as fixed effects, the independent term, exposure and individual trend, and as random effect, the independent term. We defined the individual trend as the calendar time of follow-up for each individual, by taking the value of 1 for the first day of follow-up for each such individual and so on successively, until 15. Inclusion of the independent term as a random effect in the GLMM enables the initial heterogeneity among subjects to be controlled for, and its application to a case-crossover design of acute effects of environmental exposures could enable trend and seasonality to be controlled for. Individual trend was included as a fixed effect with the aim of eliminating any possible confounding due to exposure trend. Similarly, to eliminate the influence of autocorrelation on exposure, the inclusion of a first-order autocorrelation variance structure was evaluated.
Using the S-Plus 6.0 statistical software package (Insightful, Seattle, WA), 1000 replicates were generated and analyzed for each of the scenarios. To ensure that the results of the different scenarios would be comparable, we used the same set of randomization seeds (consisting of the replicate number, ie, 1 to 1000) in all scenarios. The glmmPQL function was obtained from the Venables and Ripley MASS Library (4th ed.).32
Expression of Results
For each of the simulations, we calculated: (1) the percentage of mean bias with respect to the coefficient's true value, (Axβ̂ − β) · 100/β; (2) the standard deviation of the coefficients, which in turn enabled us to approximate the dispersion and instability of the estimates of Axβ̂; (3) the percentage of coverage, which consists of calculating the percentage of replicates that include the true value of the coefficient within their 95% confidence interval (CI). This measure is complementary to bias because even though mean bias is low, if the calculation of the coefficients is very imprecise (great dispersion) the percentage of coverage may well be small. In addition, the percentage of coverage tends to indicate whether the standard errors of the coefficients have been properly calculated. Obviously, percentage coverage must be equal to or higher than 95%; and (4) the percentage of replicates that rejected the null hypothesis. We felt that the measure of coverage should be complemented by an indicator of statistical power, in view of the fact that a very high percentage of coverage could be attained thanks to overestimated standard errors, which result in wide 95% confidence interval and ensuing low statistical power. As is the case with coverage, the best model is that which has the greatest statistical power.
Figures 1 and 2 set out the simulation conditions. Figure 1A shows the PM10 sequence that was used as an independent variable to generate the replicates. Shown in Figure 1B is the confounding variable with trend and seasonality. Figure 2A is the graph of close autocorrelations (1–30 days) and Figure 2B that of distant autocorrelations (28–364 days, at intervals of 28 days) in respect of PM10. In Barcelona, the PM10 variable plotted a statistically significant negative trend (P < 0.001) of −5.96 10−3. Furthermore, the confounding variable had a correlation of 0.3670 with PM10.
From Table 1, it can be seen that case-crossover designs (whether symmetric, semi-symmetric or time-stratified) analyzed with conditional logistic regression achieved coverage percentages of around 95%, though with a statistical efficiency that in some cases was half that of time-series studies with Poisson regression (23.5 vs. 61.4%). Biases in the case of symmetric and time-stratified designs were higher than in the case of their semisymmetric counterparts and appeared to display a certain sensitivity to seasonal confounding. When case-crossover designs were analyzed by means of survival models, they did not appear to afford any advantage over matched case–control models (data not shown). Yet, when the longitudinal GLMM design was applied, whether full-symmetric or full semisymmetric, this generally ensured attainment of high coverage percentages, a statistical power very much higher than that of conditional logistic regression and time-series studies with Poisson regression, as well as low biases; thus, where the mean number of events was high (22 events/day), full semisymmetric case-crossover designs analyzed with GLMM yielded biases of 7.9%. Furthermore, where the daily mean number of events was low (2/day), full semi-symmetric case-crossover designs analyzed as longitudinal studies showed themselves to be very superior to all the remaining models, not only in terms of bias but also in terms of coverage and statistical efficiency.
Table 2 (available with the electronic version of this article) shows the influence exerted by effect magnitude (from 0 to 0.005) and exposure autocorrelation on the estimates of the different models. Semisymmetric case-crossover and time-series studies with Poisson regression models seemed to be least affected by the magnitude of the effect of the pollutant. In general, case-crossover tended to register biases for low coefficient values, except in semisymmetric case-crossover. Whereas full semisymmetric CC model designs analyzed as longitudinal data displayed moderate biases and optimal coverages for coefficient values below 0.003, bias increased and coverages decreased for high coefficients. When data without autocorrelation were applied, a decrease in bias was in evidence across all case-crossover designs, yet the effect was clearer in case-crossover analyzed as longitudinal data, in which bias disappeared for high coefficients.
The results of this simulation study show that longitudinal approaches applied to case-crossover designs may prove useful for analyzing the acute effects of environmental exposures. Specifically, the full semisymmetric case-crossover design analyzed with generalized linear mixed models applied to scenarios with pollutant effect magnitudes similar to those described in the literature33 displays optimal confidence-interval coverage and high statistical efficiency (very superior to that of case-crossover designs analyzed with conditional logistic regression). When applied to scenarios with magnitudes greater than the real values, however, an underestimate of the effect and a decrease in coverage are in evidence, possibly attributable to the effect of autocorrelation present in the exposure variable.
The greatest advantage of case-crossover designs with respect time-series studies with Poisson regression is that time-variable functions are not needed to control for the effect of unmeasured confounding variables. This leads to an automatically control (by design) of any possible confounders (known and unknown). In turn, this avoids the establishment of degrees of freedom nor implies problems of concurvity. Bateson and Schwartz8 comment that a price must be paid if case-crossover studies analyzed with conditional logistic regression are to control for trend and seasonality by design and, in addition, are to investigate potential effect modifiers at an individual level. The price referred to lies in the fact that the relative efficiency of semisymmetric case-crossover designs analyzed with conditional logistic regression is only 66% of that of time-series studies with Poisson regression.8 In our simulations, in which underestimation of standard errors due to concurvity had already been eliminated, time-series studies with Poisson regression were as much as doubly efficient when compared with case-crossover analyzed with conditional logistic regression. Nevertheless, case-crossover analyzed as longitudinal data can be even more efficient than time-series studies with Poisson regression; thus, in analyzing case-crossover as longitudinal data, one would no longer have to pay the price of efficiency in order for these advantages to be obtained from case-crossover studies.
The principal limitation of the case-crossover design analyzed with generalized linear mixed models is the existence of a sensitivity on the part of the model to autocorrelation present in the pollutant variable. While the effect of autocorrelation on estimates yielded by case-crossover designs analyzed with conditional logistic regression has been studied in depth in previous articles,5,34 this effect is not yet known in cases where such designs are analyzed as longitudinal studies. Our data indicate that longitudinal case-crossover designs have a greater sensitivity to autocorrelation than case-crossover with conditional logistic regression and that this effect becomes more pronounced in response to a rise in the magnitude of the effect of the pollutant. For coefficients below 0.003 this effect appears to be small, but it rises sharply thereafter, leading to ensuing low coverage. We feel that new studies are needed to identify the control-selection techniques best suited to reducing the effect of autocorrelation where case-crossover are analyzed as longitudinal data, as has already been done in the case of case-crossover designs analyzed with conditional logistic regression.5,34
Consideration of case-crossover studies with a longitudinal approach and their analysis with mixed models would afford an additional advantage, namely, multilevel analysis, which allows for joint analysis of many cities and so eliminates the need for application of meta-analysis,35 pooled analysis36 or 2-stage hierarchical normal model methodologies.37 In these multilevel case-crossover models, one level could be the city and another the individual, and intercity heterogeneity could be studied by including it as a random effect. This would enable assessment of: (1) the effect of overall dose-response relationships for all the cities as a whole; (2) the forms of possible interaction between climatological variables and pollutants; and (3) the interactions between individual characteristics on the one hand, and climatological variables and pollutants, on the other, for all cities as a whole. The application of mixed models could also be extended to geographical analyses, with parts of cities being deemed intermediate levels of analysis, and areas of highest risk thus being assessed.
Once the influence of autocorrelation on estimates of very marked effects has been controlled for, case-crossover designs analyzed as longitudinal studies could prove a very good alternative to time-series studies with Poisson regression, in that they may well afford a considerable number of additional advantages. Hence, the effects of pollutants and their dose-response relationships could be more specifically ascertained in line with the characteristics of individual subjects (eg, previous disease history) or geographic location (specific areas of cities), with optimal statistical power. This could in turn open the way to a new type of public health guideline addressing the health-related effects of air pollution, targeted at specific population groups or areas of cities.
We thank William Navidi for his advice and guidance on semisymmetric case-crossover design and Michael Benedict for his comments on the translation.
1.Sunyer J, Basagana X, Belmonte J, et al. Effect of nitrogen dioxide and ozone on the risk of dying in patients with severe asthma. Thorax. 2002;57:687–693.
2.Sunyer J, Basagana X. Particles, and not gases, are associated with the risk of death in patients with chronic obstructive pulmonary disease. Int J Epidemiol. 2001;30:1138–1140.
3.Kwon HJ, Cho SH, Nyberg F, et al. Effects of ambient air pollution on daily mortality in a cohort of patients with congestive heart failure. Epidemiology. 2001;12:413–419.
4.Peters A, Dockery DW, Muller JE, et al. Increased particulate air pollution and the triggering of myocardial infarction. Circulation. 2001;103:2810–2815.
5.Levy D, Sheppard L, Checkoway H, et al. A case-crossover analysis of particulate matter air pollution and out-of-hospital primary cardiac arrest. Epidemiology. 2001;12:193–199.
6.Maclure M. The case-crossover design: a method for studying transient effects on the risk of acute events. Am J Epidemiol. 1991;133:144–153.
7.Navidi W. Bidirectional case-crossover designs for exposures with time trends. Biometrics. 1998;54:596–605.
8.Bateson TF, Schwartz J. Control for seasonal variation and time trend in case-crossover studies of acute effects of environmental exposures. Epidemiology. 1999;10:539–544.
9.Mittleman MA, Maclure M, Robins JM. Control sampling strategies for case-crossover studies: an assessment of relative efficiency. Am J Epidemiol. 1995;142:91–98.
10.Marshall RJ, Jackson RT. Analysis of case-crossover designs. Stat Med. 1993;12:2333–2341.
11.Greenland S. Confounding and exposure trends in case-crossover and case-time-control designs. Epidemiology. 1996;7:231–239.
12.Bateson TF, Schwartz J. Selection bias and confounding in case-crossover analyses of environmental time-series data. Epidemiology. 2001;12:654–661.
13.Navidi W, Weinhandl E. Risk set sampling for case-crossover designs. Epidemiology. 2002;13:100–105.
14.Borgan O, Langholz B, Samuelsen SO, et al. Exposure stratified case-cohort designs. Lifetime Data Anal. 2000;6:39–58.
15.Lumley T, Levy D. Bias in the case-crossover design: implications for studies of air pollution. Environmetrics. 2000;11:689–704.
16.Greenland S. A unified approach to the analysis of case-distribution (case-only) studies. Stat Med. 1999;18:1–15.
17.Vines SK, Farrington CP. Within-subject exposure dependency in case-crossover studies. Stat Med. 2001;20:3039–3049.
18.Diggle PJ, Liang KY, Zeger SL. Analysis of Longitudinal Data. Oxford: Oxford University Press; 1994.
19.Verghese J, Lipton RB, Katz MJ, et al. Leisure activities and the risk of dementia in the elderly. N Engl J Med. 2003;348:2508–2516.
20.Canfield RL, Henderson CR Jr., Cory-Slechta DA, et al. Intellectual impairment in children with blood lead concentrations below 10 microg per deciliter. N Engl J Med. 2003;348:1517–1526.
21.Figueiras A, Sastre I, Tato F, et al. One-to-one versus group sessions to improve prescription in primary care: a pragmatic randomized controlled trial. Med Care. 2001;39:158–167.
23.Mehta CR, Patel NR. Exact logistic regression: theory and examples. Stat Med. 1995;14:2143–2160.
24.Andersen PK, Gill RD. Cox's regression model for counting processes: a large sample study. Ann Stat. 1982;10:1100–1120.
25.Barceló MA. Modelos de fragilidad en el análisis multivariante de supervivencia. PhD Dissertation. Girona: Servei de Publicacions, Universitat de Girona; 2001.
26.Therneau TM, Hamilton SA. rhDNase as an example of recurrent event analysis. Stat Med. 1997;16:2029–2047.
27.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. JAMA. 1989;84:1074–1078.
28.Breslow NE, Clayton DG. Approximate inference in Generalized Linear Mixed Models. J Am Stat Assoc. 1993;88:9–25.
29.Wolfinger R, O'Connell M. Generalized linear mixed models: a pseudo-likelihood approach. J Stat Computation Simulation. 1993;48:233–243.
30.Saez M. El problema de las medidas repetidas. Análisis longitudinal en epidemiología. Gac Sanitaria. 2001;15:347–352.
31.McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman and Hall; 1989.
32.Venables WN, Ripley BD. Modern Applied Statistics with S-PLUS. Oxford: Springer; 1999.
33.Dominici F, Zeger SL, Samet JM. Combining Evidence on Air Pollution and Daily Mortality from the Largest 20 US cities: a hierarchical modeling strategy. J Royal Stat Soc A. 2000;163:263–302.
34.Checkoway H, Levy D, Sheppard L, et al. A case-crossover analysis of fine particulate matter air pollution and out-of-hospital sudden cardiac arrest. Health Effects Institute Res Report. 2000;14–26.
35.Saez M, Ballester F, Barcelo MA, et al. A combined analysis of the short-term effects of photochemical air pollutants on mortality within the EMECAM project. Environ Health Perspect. 2002;110:221–228.
36.Saez M, Figueiras A, Ballester F, et al. Comparing meta-analysis and ecological-longitudinal analysis in time-series studies. A case study of the effects of air pollution on mortality in three Spanish cities. J Epidemiol Community Health. 2001;55:423–432.
37.Dominici F, McDermott A, Zeger SL, et al. On the use of generalized additive models in time-series studies of air pollution and health. Am J Epidemiol. 2002;156:193–203.
© 2005 Lippincott Williams & Wilkins, Inc.