# Does More Accurate Exposure Prediction Necessarily Improve Health Effect Estimates?

A unique challenge in air pollution cohort studies and similar applications in environmental epidemiology is that exposure is not measured directly at subjects' locations. Instead, pollution data from monitoring stations at some distance from the study subjects are used to predict exposures, and these predicted exposures are used to estimate the health effect parameter of interest. It is usually assumed that minimizing the error in predicting the true exposure will improve health effect estimation. We show in a simulation study that this is not always the case. We interpret our results in light of recently developed statistical theory for measurement error, and we discuss implications for the design and analysis of epidemiologic research.

From the ^{a}Department of Biostatistics, University of Washington, Seattle, WA; ^{b}Department of Biostatistics, Harvard School of Public Health, Boston, MA; ^{c}Department of Statistics, University of California, Berkeley, CA; and ^{d}Department of Environmental and Occupational Health Sciences, University of Washington, Seattle, WA.

Submitted October 2010; accepted 7 April 2011; posted 30 June 2011.

Supported by the United States Environmental Protection Agency through R831697 and Assistance Agreement CR-83407101 and by the National Institute of Environmental Health Sciences through R01-ES009411 and 5P50ES015915 (Drs. Szpiro and Sheppard) and by the National Institute of Environmental Health Sciences through R01 ES017017 (Dr. Paciorek).

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com).

Correspondence: Adam A. Szpiro, Department of Biostatistics, University of Washington, Seattle, WA 98195. E-mail: aszpiro@u.washington.edu.

There has been a major effort in air pollution epidemiology research to develop statistical models to predict exposures at subjects' locations in situations where measurements at the desired locations are not available.^{1â€“7} These efforts assume that exposure predictions with less measurement error relative to the unknown true values will improve health effect estimation.^{8â€“10} We demonstrate, in a simulation study, that this assumption is not always true, and we interpret our results using recently developed statistical theory for measurement error resulting from spatially misaligned data.^{11}

## MATHEMATICAL FRAMEWORK AND SIMULATION STUDY

Most modern statistical models for predicting long-term average air pollution concentrations are based on land-use regression. In land-use regression modeling, a linear regression model with geographic (land-use) covariates such as population density, proximity to traffic, and proximity to commercial areas is fit to monitoring data and then used to predict concentrations at subjects' locations. Elaborations on this framework account for spatial and spatiotemporal correlation and various approaches to model selection, but land-use regression remains a central component. We focus on a pure land-use regression model in this paper.

### Stochastic Data-generating Model

Consider an association study with the *N* Ã— 1 vector of observed health outcomes *Y*, *N* Ã— 1 vector of exposures *X*, and *N* Ã— *m* matrix of covariates *Z*. Assume a linear regression model

with coefficient of interest Î²_{X} and ϵ an *N* Ã— 1 random vector with independent elements distributed as Gaussian random variables with mean 0 and variance Ïƒ_{ϵ}^{2} (ie, N(0,Ïƒ_{ϵ}^{2})).

We are interested in the situation where *Y* and *Z* are observed, but instead of *X*, we observe the *N** Ã— 1 vector *X** of exposures at various locations. *N** is the number of exposure monitors. Assume that *X*, the subjects' exposures, and *X**, the exposure concentrations measured at the monitors, are jointly distributed as

In this expression, *S* and *S** are random *N* Ã— *k* and *N** Ã— *k* dimensional matrices of the *k* geographic covariates used in the land-use regression model observed without error, Î± is an unknown *k* Ã— 1 vector of coefficients, and Î· and Î·* are independent vectors with elements distributed as N(0,Ïƒ_{Î·}^{2}). The stochasticity in *S* and *S** derives from random selection of subject and monitor locations. If the exposure model is known, it is standard practice to estimate Î± based on *X** and then use *W* = *S*Î±Ì‚ in place of *X* in equation (1) to estimate Î²_{X}. That is, predictions from the land-use regression model are used as estimated exposures in place of the unknown true values, a form of regression calibration.^{12}

We quantify the accuracy in approximating *X* by *W* by

where larger R_{W}^{2} values correspond to less measurement error. This defines an out-of-sample measure of prediction accuracy, as it is based on prediction error at subjects' locations, and it is not subject to bias from overfitting the exposure model to the monitoring data.^{13} R_{W}^{2} is a random quantity that varies for each realization of the data-generating model, and we denote its expectation RÌ„_{W}^{2}.

There are a number of criteria for evaluating the validity and reliability of health effect estimates. We consider bias, standard deviation, root mean squared error, and coverage probability (the proportion of 95% confidence intervals that include the true Î²_{X}).^{14}

### Misspecified Exposure Model

We generally do not know the exact form of the exposure model and may use a misspecified model for prediction. One form of model misspecification is to omit a geographic covariate from the land-use regression model. This corresponds to observing only the *N* Ã— (*k* âˆ’ 1) and *N** Ã— (*k* âˆ’ 1) matrices *S*â€² and S*^{â€²} obtained by deleting the *k*th columns of *S* and *S**. We then estimate the corresponding (*k* âˆ’ 1) Ã— 1 vector of coefficients Î±â€² and replace *X* in equation (1) by *W*â€² = *S*â€²Î±Ì‚â€² to obtain Î²Ì‚^{â€²}_{X}. We denote measures of exposure prediction accuracy R_{W}^{2â€²} and RÌ„_{W}^{2â€²} as in the case of the correctly specified exposure model.

We generally expect R_{W}^{2} to be larger than R_{W}^{2â€²}, which from the perspective of exposure modeling implies that the correctly specified exposure model gives better predictions than the misspecified one. It is reasonable to expect that this will also lead to improved health effect estimation. However, in the next subsection, we will demonstrate a class of examples in which R_{W}^{2} is consistently larger than R_{W}^{2â€²}, but Î²Ì‚^{â€²}_{X} has more error than Î²Ì‚_{X} as measured in terms of bias, variance, root mean squared error, and coverage probability. We emphasize that R_{W}^{2} is not inflated by overfitting, because it is based on the correctly specified exposure model and quantifies out-of-sample prediction accuracy at subjects' locations.

### Simulation Study

We set *k* = 4 (3 geographic covariates and an intercept) and consider scenarios with *N* between 100 and 10,000 subjects and *N** = 100 monitors. We assume the 3 geographic covariates are independent of each other at all locations and are independent between subjects. In particular, for each subject *i*, we assume the *j*th geographic covariate *Sij* is independently distributed as *N*(0,1).

Similarly, we assume the S*_{ij} are distributed as *N*(0,1) for *j* = 1,2, but the third geographic covariate for the monitoring sites is distributed as *N*(0,Ïƒ^{2}) for Ïƒ^{2} = 0.1, 1.0, or 4.0. Finally, we set Î±_{0} = 0, Î±_{j} = 4 for *j* = 1, 2, 3, Î²_{0} = 1, Î²_{X} = 2, Ïƒ_{ϵ} = 25, and Ïƒ_{Î·} = 4, and we assume there are no additional covariates *Z*. Example simulation code in R^{15} can be found in the eAppendix (http://links.lww.com/EDE/A497).

The choice of Ïƒ^{2} controls the level of variability in the third geographic covariate at the monitoring locations. By comparing the misspecified model (ie, the model that does not contain the third geographic covariate) to the correctly specified full model, we are able to assess the added value of including the third geographic covariate in predictions, depending on its variability. The situation with Ïƒ^{2} = 0.1 is of particular interest, as it represents a geographic covariate that has limited variability in the monitoring data compared with the other geographic covariates but is equally variable in the subject data where it will be used to predict exposures. This is realistic, for example, if the covariate measures near-road traffic exposure. Regulatory monitors are often sited away from roadways to measure background pollution levels, and so they may not span the full range of covariate values relevant for predicting exposures at subjects' home locationsâ€”a significant fraction of which are near major roads.

In the Table and Figure 1, we show the results from 80,000 Monte Carlo simulations with *N* = 10,000 subjects, *N** = 100 monitoring sites, and Ïƒ^{2} = 1.0. The coefficient for the third geographic covariate Î±_{3} is estimated well in the full model and is statistically significant in all simulations. The corresponding exposure prediction accuracy R_{W}^{2} is consistently near 0.75, compared with R_{W}^{2â€²} near 0.50 with the misspecified model. Health effect estimation efficiency is improved by using the correctly specified exposure model, which gives a standard deviation for Î²Ì‚_{X} of 0.12 compared with 0.21 for Î²Ì‚^{â€²}_{X} with the misspecified model. The coverage probabilities for both models are poor, as the standard error estimates fail to account for exposure measurement error. The correctly specified exposure model results in a modest improvement in coverage probability, although it also introduces slightly more bias than the misspecified model.

Analogous results are shown in the Table and Figure 2 for Ïƒ^{2} = 0.1, representing a situation where one of the geographic covariates is less variable in the distribution of monitoring locations than are the other geographic covariates. The smaller value of Ïƒ^{2} results in more variability in estimating Î±_{3}, but this parameter is still estimated well and is statistically significant in 83% of Monte Carlo simulations. There is clear improvement in the exposure predictions from using the full model with R_{W}^{2} at least 0.67 in 95% of simulations, as compared with the misspecified model with R_{W}^{2â€²} consistently near 0.50. But in this situation, the health effect estimates are more precise when we use the misspecified exposure model, with the standard deviation of Î²Ì‚^{â€²}_{X} equal to 0.16, compared with 0.23 for Î²Ì‚_{X} using the fully specified model. The misspecified model also results in less bias and a modest improvement in coverage probability.

We vary the number of subjects as well as Ïƒ^{2} and summarize the results in Figure 3 by plotting the difference between the standard deviation of Î²Ì‚^{â€²}_{X}, based on the misspecified exposure model, and Î²Ì‚_{X}, based on the correct exposure model, on the vertical axis against *N* on the horizontal axis; a positive difference indicates that the correctly specified model is more efficient. We restrict to 5000 Monte Carlo simulations because this is sufficient to estimate the standard deviations (the biases are smaller and require more Monte Carlo simulations). The difference is positive for Ïƒ^{2} = 1.0 and 4.0, consistent with the prior expectation that more accurate exposure predictions result in more efficient health effect estimation. But it is negative for Ïƒ^{2} = 0.1 except for the case where there are only *N* = 100 subjects, demonstrating that in larger health studies the misspecified exposure model results in more efficient health effect estimation, even though it gives less accurate exposure predictions. For all simulations, the average out-of-sample exposure model prediction accuracies are RÌ„_{W}^{2} between 0.73 and 0.75 for the correctly specified model and RÌ„_{W}^{2â€²} between 0.49 and 0.50 for the misspecified model that omits the third geographic covariate.

## THEORETICAL INTERPRETATION IN A MEASUREMENT ERROR FRAMEWORK

The results of our simulation study seem paradoxical in that more accurate exposure predictions do not necessarily lead to improved health effect estimation. The Table shows that for Ïƒ^{2} = 0.1, the correctly specified model consistently gives more variable exposure predictions and more accurate out-of-sample prediction, compared with the misspecified exposure model. However, a small part of the additional exposure variability is induced by error in estimating Î±_{3}, which leads to less efficient estimation of Î²_{X}. These findings can be understood in a theoretical context by referring to the statistical measurement error framework developed for this setting.^{11,12}

Briefly, for a fairly general class of exposure models there are 2 components to the measurement error. The Berkson-like component of error results from smoothing the exposure surface using a model that may not account for all sources of variation and can be thought of as the part of the true exposure that is not predictable from the model. It is similar to standard Berkson error^{16} in that it inflates the standard deviation of the health effect estimate and introduces little or no bias. However, it is different from Berkson error in that it is correlated in space and is not completely independent of the predicted exposures.^{11,12} The classical-like component comes from uncertainty in estimating the exposure model parameters. It is similar to classical measurement error in that it is a source of variability in the predicted exposures and can introduce bias in health effect estimates as well as change their standard errors. The classical-like component is also different from classical measurement error in that the additional variability from exposure model parameter estimation is shared across all prediction locations rather than being independent.^{11}

For the simple land-use regression exposure model considered here, the Berkson-like component is pure Berkson error because there is no spatial dependence structure in Î· and Î·* and the *Sij* and S*_{ij} are independent. When we use the correctly specified exposure model, the Berkson error is just Î·, but misspecifying the model by omitting the third geographic covariate increases the Berkson error substantially, resulting in a degradation of prediction accuracy. However, Berkson error plays the same role mathematically as the random ϵ in the disease model, and so its impact on the health effect estimation error diminishes for large *N*. On the other hand, each coefficient that needs to be estimated in the exposure model contributes to the classical-like error, and this part of the error remains important regardless of the number of subjects. In some situations, this could result in a bias-variance tradeoff because classical-like error induces bias while Berkson-like error does not.

It turns out that for Ïƒ^{2} = 0.1 in the monitoring data, we get relatively variable estimates of Î±_{3} when using the full exposure model, while still improving out-of-sample prediction accuracy at subjects' locations. This results in substantial classical-like measurement error that (for sufficiently large *N*) is more important than the additional Berkson error that is introduced by omitting the corresponding geographic covariate. There is very little bias in any of our simulations, and so the dominant classical-like error primarily results in more variable estimates of Î²_{X}.

## IMPLICATIONS FOR FUTURE RESEARCH

We have shown a class of examples in which more accurate exposure prediction does not lead to improved health effect estimation. It bears emphasis that this does not result from overfitting the exposure model, at least not as overfitting is traditionally understood for prediction models.^{13} In all cases, using the correctly specified model that includes all 3 geographic covariates leads to improved prediction accuracy, as measured by out-of-sample R_{W}^{2} evaluated at subjects' locations.

Our findings have important implications for the design and analysis of environmental epidemiologic studies. Development of models for exposure prediction and health effect estimation should be considered simultaneously, in contrast with the current practice of first selecting an exposure model to optimize prediction accuracy and then using the resulting predictions for health effect estimation. Recent papers that address measurement error in air pollution cohort studies represent progress in this direction.^{11,12,14,17} Our results do not necessarily suggest employing a joint statistical estimation model for the exposure and health parameters in which the health data would influence estimation of the exposure model parameters. The issue we have highlighted relates more directly to model selection than to parameter estimation.

There is extensive literature on penalization and other methods for optimizing accuracy of prediction models,^{13} but these techniques are not directly applicable because better prediction accuracy may induce less precise health effect estimation. New statistical methodology is needed to select exposure models to optimize efficiency of health effects inference, perhaps involving alternative forms of penalization that account for the structure in both the monitoring and health outcome data. It is also worth exploring asymptotic methods to estimate the bias and variance of Î²Ì‚_{X} to select optimal geographic covariates, particularly when there is a relatively large number of monitoring locations compared with the geographic covariates.

The relative benefits of various air-pollution exposure models depend on the variability of geographic covariates in the subject population and monitor locations, and on the size of the cohort. It is evident that study design can be improved by accounting for statistical issues at the intersection of exposure prediction and health effect estimation. All else being equal, it is preferable to design an exposure monitoring campaign to maximize the variability of pertinent geographic covariates across monitor locations. An asset allocation-algorithm may be useful for optimizing the monitoring design to predict exposures in an epidemiology study with known subjects' locations.^{18}

We have considered only the relatively simple setting of a linear disease model with an exposure model that is land-use regression with independent geographic covariates. Even in this case, we have shown that more accurate exposure prediction does not necessarily lead to improved health effect estimation. We expect that similar phenomena can occur in other settings, but further research is needed to identify general conditions and assess the implications of more complex situations.

## ACKNOWLEDGMENTS

We thank 3 anonymous referees for their valuable suggestions and Sverre Vedal for helpful comments on a draft of this manuscript.

## REFERENCES

*Environ Health Perspect.*2009;117:522â€“529.

*Environmetrics.*2010;21:606â€“631.

*Proc Am Thoracic Soc.*2010;7:111â€“115.

*Environmetrics.*2008;19:549â€“566.

*Environ Res.*2009;109:657â€“670.

*J Expo Anal Environ Epidemiol.*2005;15:185â€“204.

*Atmos Environ.*2008;42:7561â€“7578.

*Epidemiology.*2005;16:727â€“736.

*Environ Health Perspect.*2005;113:201â€“206.

*Environ Health Perspect.*2009;117:1697â€“1701.

*Biostatistics*. In Press.

*Biostatistics.*2009;10:258â€“274.

*Elements of Statistical Learning*. New York: Springer; 2001.

*Epidemiology.*2009;20:442â€“450.

*Measurement Error in Nonlinear Models: A Modern Perspective*. 2nd ed. Boca Raton, FL: Chapman and Hall/CRC; 2006.

*Environmetrics.*2008;19:453â€“467.

*Atmos Environ.*2005;39:2399â€“2409.