Analogous results are shown in the Table and Figure 2 for σ2 = 0.1, representing a situation where one of the geographic covariates is less variable in the distribution of monitoring locations than are the other geographic covariates. The smaller value of σ2 results in more variability in estimating α3, but this parameter is still estimated well and is statistically significant in 83% of Monte Carlo simulations. There is clear improvement in the exposure predictions from using the full model with RW2 at least 0.67 in 95% of simulations, as compared with the misspecified model with RW2′ consistently near 0.50. But in this situation, the health effect estimates are more precise when we use the misspecified exposure model, with the standard deviation of β̂′X equal to 0.16, compared with 0.23 for β̂X using the fully specified model. The misspecified model also results in less bias and a modest improvement in coverage probability.
We vary the number of subjects as well as σ2 and summarize the results in Figure 3 by plotting the difference between the standard deviation of β̂′X, based on the misspecified exposure model, and β̂X, based on the correct exposure model, on the vertical axis against N on the horizontal axis; a positive difference indicates that the correctly specified model is more efficient. We restrict to 5000 Monte Carlo simulations because this is sufficient to estimate the standard deviations (the biases are smaller and require more Monte Carlo simulations). The difference is positive for σ2 = 1.0 and 4.0, consistent with the prior expectation that more accurate exposure predictions result in more efficient health effect estimation. But it is negative for σ2 = 0.1 except for the case where there are only N = 100 subjects, demonstrating that in larger health studies the misspecified exposure model results in more efficient health effect estimation, even though it gives less accurate exposure predictions. For all simulations, the average out-of-sample exposure model prediction accuracies are R̄W2 between 0.73 and 0.75 for the correctly specified model and R̄W2′ between 0.49 and 0.50 for the misspecified model that omits the third geographic covariate.
THEORETICAL INTERPRETATION IN A MEASUREMENT ERROR FRAMEWORK
The results of our simulation study seem paradoxical in that more accurate exposure predictions do not necessarily lead to improved health effect estimation. The Table shows that for σ2 = 0.1, the correctly specified model consistently gives more variable exposure predictions and more accurate out-of-sample prediction, compared with the misspecified exposure model. However, a small part of the additional exposure variability is induced by error in estimating α3, which leads to less efficient estimation of βX. These findings can be understood in a theoretical context by referring to the statistical measurement error framework developed for this setting.11,12
Briefly, for a fairly general class of exposure models there are 2 components to the measurement error. The Berkson-like component of error results from smoothing the exposure surface using a model that may not account for all sources of variation and can be thought of as the part of the true exposure that is not predictable from the model. It is similar to standard Berkson error16 in that it inflates the standard deviation of the health effect estimate and introduces little or no bias. However, it is different from Berkson error in that it is correlated in space and is not completely independent of the predicted exposures.11,12 The classical-like component comes from uncertainty in estimating the exposure model parameters. It is similar to classical measurement error in that it is a source of variability in the predicted exposures and can introduce bias in health effect estimates as well as change their standard errors. The classical-like component is also different from classical measurement error in that the additional variability from exposure model parameter estimation is shared across all prediction locations rather than being independent.11
For the simple land-use regression exposure model considered here, the Berkson-like component is pure Berkson error because there is no spatial dependence structure in η and η* and the Sij and S*ij are independent. When we use the correctly specified exposure model, the Berkson error is just η, but misspecifying the model by omitting the third geographic covariate increases the Berkson error substantially, resulting in a degradation of prediction accuracy. However, Berkson error plays the same role mathematically as the random ϵ in the disease model, and so its impact on the health effect estimation error diminishes for large N. On the other hand, each coefficient that needs to be estimated in the exposure model contributes to the classical-like error, and this part of the error remains important regardless of the number of subjects. In some situations, this could result in a bias-variance tradeoff because classical-like error induces bias while Berkson-like error does not.
It turns out that for σ2 = 0.1 in the monitoring data, we get relatively variable estimates of α3 when using the full exposure model, while still improving out-of-sample prediction accuracy at subjects' locations. This results in substantial classical-like measurement error that (for sufficiently large N) is more important than the additional Berkson error that is introduced by omitting the corresponding geographic covariate. There is very little bias in any of our simulations, and so the dominant classical-like error primarily results in more variable estimates of βX.
IMPLICATIONS FOR FUTURE RESEARCH
We have shown a class of examples in which more accurate exposure prediction does not lead to improved health effect estimation. It bears emphasis that this does not result from overfitting the exposure model, at least not as overfitting is traditionally understood for prediction models.13 In all cases, using the correctly specified model that includes all 3 geographic covariates leads to improved prediction accuracy, as measured by out-of-sample RW2 evaluated at subjects' locations.
Our findings have important implications for the design and analysis of environmental epidemiologic studies. Development of models for exposure prediction and health effect estimation should be considered simultaneously, in contrast with the current practice of first selecting an exposure model to optimize prediction accuracy and then using the resulting predictions for health effect estimation. Recent papers that address measurement error in air pollution cohort studies represent progress in this direction.11,12,14,17 Our results do not necessarily suggest employing a joint statistical estimation model for the exposure and health parameters in which the health data would influence estimation of the exposure model parameters. The issue we have highlighted relates more directly to model selection than to parameter estimation.
There is extensive literature on penalization and other methods for optimizing accuracy of prediction models,13 but these techniques are not directly applicable because better prediction accuracy may induce less precise health effect estimation. New statistical methodology is needed to select exposure models to optimize efficiency of health effects inference, perhaps involving alternative forms of penalization that account for the structure in both the monitoring and health outcome data. It is also worth exploring asymptotic methods to estimate the bias and variance of β̂X to select optimal geographic covariates, particularly when there is a relatively large number of monitoring locations compared with the geographic covariates.
The relative benefits of various air-pollution exposure models depend on the variability of geographic covariates in the subject population and monitor locations, and on the size of the cohort. It is evident that study design can be improved by accounting for statistical issues at the intersection of exposure prediction and health effect estimation. All else being equal, it is preferable to design an exposure monitoring campaign to maximize the variability of pertinent geographic covariates across monitor locations. An asset allocation-algorithm may be useful for optimizing the monitoring design to predict exposures in an epidemiology study with known subjects' locations.18
We have considered only the relatively simple setting of a linear disease model with an exposure model that is land-use regression with independent geographic covariates. Even in this case, we have shown that more accurate exposure prediction does not necessarily lead to improved health effect estimation. We expect that similar phenomena can occur in other settings, but further research is needed to identify general conditions and assess the implications of more complex situations.
We thank 3 anonymous referees for their valuable suggestions and Sverre Vedal for helpful comments on a draft of this manuscript.
1.Yanosky JD, Paciorek CJ, Suh H. Predicting chronic fine and coarse particulate exposure using spatio-temporal models for the northeastern and midwestern United States. Environ Health Perspect.
2.Szpiro AA, Sampson PD, Sheppard L, Lumley T, Adar SD, Kaufman JD. Predicting intra-urban variation in air pollution concentrations with complex spatio-temporal dependencies. Environmetrics.
3.Brauer M. How much, how long, what, and where: Air pollution exposure assessment for epidemiologic studies of respiratory disease. Proc Am Thoracic Soc.
4.Fanshawe TR, Diggle PJ, Rushton S, et al. Modelling spatio-temporal variation in exposure to particulate matter: a two-stage approach. Environmetrics.
5.Su JG, Jerrett M, Beckerman B, Wilhelm M, Ghosh JK, Ritz B. Predicting traffic-related air pollution in Los Angeles using a distance decay regression selection strategy. Environ Res.
6.Jerrett M, Arain A, Kanaroglou P, et al. A review and evaluation of intraurban air pollution exposure models. J Expo Anal Environ Epidemiol.
7.Hoek G, Beelen R, de Hoogh K, et al. A review of land-use regression models to assess spatial variation in outdoor air pollution. Atmos Environ.
8.Jerrett M, Burnett RT, Ma R, et al. Spatial analysis of air pollution mortality in Los Angeles. Epidemiology.
9.Kunzli N, Jerrett M, Mack WJ, et al. Ambient air pollution and atherosclerosis in Los Angeles. Environ Health Perspect.
10.Puett RC, Hart JE, Yanosky JD, et al. Chronic fine and coarse particulate exposure, mortality, and coronary heart disease in the Nurses' Health Study. Environ Health Perspect.
11.Szpiro AA, Sheppard L, Lumley T. Efficient measurement error correction for spatially misaligned data. Biostatistics
. In Press.
12.Gryparis A, Paciorek CJ, Zeka A, et al. Measurement error caused by spatial misalignment in environmental epidemiology. Biostatistics.
13.Hastie T, Tibshirani R, Friedman J. Elements of Statistical Learning
. New York: Springer; 2001.
14.Kim SY, Sheppard L, Kim H. Health effects of long-term air pollution: influence of exposure prediction methods. Epidemiology.
15.R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2010. ISBN 3-900051-07-0.
16.Carroll RJ, Ruppert D, Stefanski LA, et al. Measurement Error in Nonlinear Models: A Modern Perspective
. 2nd ed. Boca Raton, FL: Chapman and Hall/CRC; 2006.
17.Madsen L, Ruppert D, Altman NS. Regression with spatially misaligned data. Environmetrics.
18.Kanaroglou PS, Jerrett M, Morrison J, et al. Establishing an air pollution monitoring network for intraurban population exposure assessment: A location-allocation approach. Atmos Environ.
Supplemental Digital Content
© 2011 Lippincott Williams & Wilkins, Inc.