# Model Selection and Health Effect Estimation in Environmental Epidemiology

In air pollution epidemiology, improvements in statistical analysis tools can help improve signal-to-noise ratios, and untangle large correlations between exposures and confounders. For this reason, we welcome a novel model-selection approach that helps to identify the time-windows of exposure to pollutants that produces adverse health effects. However, there are concerns about approaches that select a model based on a given data set, and then estimate health effects in the same data. This can create problems when (1) the sample size is small in relation to the magnitude of the health effects; and (2) candidate predictors are highly correlated and likely to have similar effects. Bayesian Model Averaging has been advocated as a way to estimate health effects that accounts for model uncertainty. However, implementations where posterior model probabilities are approximated using BIC, as well as other default choices, may not reflect the ability of each model to provide an estimate of the health effect that is properly adjusted for confounding. Air pollution studies need to focus on estimating health effects while accounting for the uncertainty in the adjustment for confounding factors. This is true especially when model choice and estimation are performed on the same data. The development of appropriate statistical tools remains an open area of investigation.

From the ^{a}Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore; and ^{b}The Sydney Kimmel Comprehensive Cancer Center, Johns Hopkins University

Submitted 1 February 2008; accepted 7 March 2008.

Supported by US Environmental Protection Agency (RD-83241701), the National Institute for Environmental Health Sciences (ES012054-03), and by the NIEHS Center in Urban Environmental Health (P30 ES 03819).

*Editorâ€™s Note:**The authorsâ€™ response appears on page 561.*

Correspondence: Francesca Dominici, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205. E-mail: fdominic@jhsph.edu.

In this issue of Epidemiology, Mortimer et al^{1} estimate the association between prenatal and lifetime exposures to air pollutants and pulmonary function in a cohort of children with asthma. They find large correlations among various pollutants as well as among different exposure time-windows for the same pollutant. Therefore, rather than estimate the health effect of each exposure separately, they use a recently-developed model selection procedure, the Deletion/Substitution/Addition algorithm^{2} to identify the best predictive model. This procedure is an iterative model-search algorithm that optimizes global measures of prediction performanceâ€”here, the mean squared error of residuals. Compared with stepwise model selection procedures, the Deletion/Substitution/Addition algorithm has some advantages. It is less sensitive to outliers (via the use of cross-validation during the search), and it allows the search to move among statistical models that are not nested.

Air pollution epidemiology is an area where progress in statistical modeling strategies can translate into important scientific advances. Air pollution studies often have unfavorable signal-to-noise ratios, and substantial correlations among both the exposures and the potential confounders. These conditions are exemplified in the study of Mortimer et al.^{1} We welcome the thorough exploration of novel model-selection approaches in this well-conducted and statistically challenging study. However, we are also concerned about 2 potential issues: (1) Is it safe to estimate health effects by assuming that the best predictive model is correct? and (2) How can one account for uncertainty in the selection of population characteristics when estimating the association between an exposure and a health outcome?

### Potential Pitfalls of Model Selection followed by Estimation

The authorsâ€™ implementation of the Deletion/Substitution/Addition algorithm is ambitious: it aims to identify which exposures (X_{i}), population characteristics (Z_{i}), functional form of the X, functional form of the Z, and interactions between the X and the Z will lead to a model with the best predictive power. The chosen maximum model size is 10 terms, but the order of the interaction between covariates is up to 2, and the sum of the power of the polynomial function of each predictor is up to 3. Thus, there are a large number of candidate terms for the algorithm to choose from.

Health effect estimates and their variances are then obtained using GEE (see Table 3 of the paper by Mortimer et al^{1}) assuming that the selected model is correct. Previous literature has shown that approaches that select a model based on a given data set and then estimate health effects in the same data (assuming that the chosen model is correct), can lead to misleading inferences. These approaches can identify exposure variables that appear to have strong predictive power even in randomly generated data with no true health effects.^{3,4} In addition, confidence intervals of regression coefficients calculated after model selection can have poor statistical properties.^{5,6} These problems can be particularly severe when the sample size is small for estimating the magnitude of the health effects, and when the candidate predictors are highly correlated and likely to have a similar effect on the health outcome.

We illustrate these points using a simulation study modeled after the approach taken by Mortimer and colleagues.^{1} For 232 subjects, we generate the 8 pollutant-specific metrics (prenatal and lifetime exposure to CO, NO_{2}, O_{3}, and PM_{10}), from a multivariate normal distribution with mean zero, variance 1, and correlation matrix (as in Table 2 of Mortimer et al). This table does not include correlations between prenatal exposures and pollutants and thus we assumed that these correlations are the same as those estimated for lifetime exposures. We generate a continuous outcomeâ€”for example, forced vital capacity (FVC)â€”from a linear regression model with intercept 1.95 (from Table 1 of Mortimer et al). We include all 8 predictors as linear terms in the true model, and we assume that they all have the same regression coefficient of 0.1 (model 1). Finally we assume that errors are independent and identically distributed as normal with mean zero and variance 1. We then apply the Deletion/Substitution/Addition algorithm using the inputs chosen by Mortimer et al (maxsize = 10, maxsumofpow = 3, maxorderint = 2, nsplits = 10). We used the R statistical analysis package.7 Table 1 of this commentary shows the predictors that were identified by this algorithm and the estimates, with 95% confidence intervals. Even though all predictors have the same predictive power, the Deletion/Substitution/Addition algorithm selects only lifetime exposure to PM_{10} and prenatal exposure of NO_{2}. The estimates of the health effects of these 2 pollutant metrics are severely biased upward, and their 95% confidence intervals (CIs) do not include the true value of 0.1.

In this example, the chosen model may be useful for prediction, but its interpretation in terms of etiology would be misleading in 2 ways. It would fail (by singling out 1 pollutant as predictive) to indicate that the pollutants have similar effects, and it would overstate the effect of that single pollutant on the outcome.

We repeated the simulation study assuming that the true regression model not only included all the linear terms, but also the quadratic and cubic terms of lifetime exposure to CO (model 2) with true regression coefficients of 0.01 and 0.001, respectively. In this scenario (Table 2), Deletion/Substitution/Addition algorithm selects the linear terms of prenatal exposures to O_{3} and NO_{2} and the linear terms of lifetime exposure to NO_{2} and O_{3}. Again, the estimated regression coefficients of these 4 predictors are biased and their 95% confidence intervals do not include the true values. Although we recognize that we may have simulated data from a somewhat unfavorable (though not unrealistic) situation, these simulations results suggest caution in the interpretation of the results reported by Mortimer and colleagues. A critical consideration in this discussion is the relation of the sample size to the number of candidate predictors. If we generate data with a 100-fold larger sample size (23,200) under model 1, then Deletion/Substitution/Addition algorithm identifies all the correct predictors and provides unbiased estimates of the corresponding regression coefficients (Table 3).

### Model Uncertainty Versus Adjustment Uncertainty

The challenges highlighted in our simulations emphasize the importance of accounting for uncertainty arising from the variable selection stage when reporting estimates of health effects. Bayesian model averaging^{8â€“10} has been advocated as a way to estimate health effects accounting for model uncertainty. This approach treats the true model as an unknown random variable, and estimates health effects by a weighted average of model-specific health effects using the posterior model probabilities as weights.^{8,9} We^{11} have demonstrated that Bayesian model averaging with posterior weights approximated using the BIC might not be as effective for effect estimation as it has proven to be for prediction. (See also Thomas et al.^{12}) Heuristically, this appears to be because the posterior model probabilities so approximated might not reflect the limited ability of the model to provide an estimate of the health effect properly adjusted for confounding.

For example, large weights may be assigned to models that do not adequately adjust for confounders, leading to a biased estimate of the health effect. Consider an exposure X, an outcome Y, and 2 covariates Z_{1} and Z_{2}. Assume that Z_{1} is independent from X, but a good predictor of Y, and that Z_{2} is highly correlated with X but a poor predictor of Y. Table 4 summarizes the 3 possible models and hypothetical weights that would be assigned based on both predictive ability and ability to estimate the health effects properly adjusting for confounding. In some common implementations Bayesian model averaging is likely to assign a high weight to models that do not include Z_{2} and therefore would provide a biased estimate of the health effect.

Bayesian model averaging can over- or under-estimate health effects, depending on the correlations of the variables involved. When this approach is applied to the simulated examples above, the health effect estimates are higher than the true ones for 6 of 8 pollutants. The 95% posterior intervals are larger than Deletion/Substitution/Addition algorithm procedure, with 3 of the 8 intervals including 0, but 2 of the 8 intervals are still failing to include the true value of 0.1. (We used the implementation in the BMA package in R with default settings, constraining exposures to be included in all models. The default prior distribution and hyperparameters are specified in Raftery et al, 1997.^{13})

In studies of air pollution and health, it is important to focus on estimating health effects that are properly adjusted for all the confounding factors,^{14} including exposure to other pollutants. Ideally, adjustment uncertainty should be fully incorporated into statistical inference whenever estimation is sensitive to model choice. This is especially important when model choice and estimation are performed on the same data. However, the developent of appropriate statistical tools to achieve this goal is still in progress. In the case of a single exposure, we^{11} have discussed an approach to estimate a health effect accounting for uncertainty in the confounding adjustment. Similar methods for multiple exposures, as would have been required by Mortimer et al, are not currently available.

Mortimer et al present interesting results on the association between lifetime and prenatal exposure to air pollution and pulmonary functions in a cohort of asthmatic children. The authors used an innovative model-search approach to select the exposure variables that led to the best predictive model. In doing so they faced a challenge common in environmental epidemiology, in which the health effects to be estimated are small and both the exposure variables and the potential confounders are correlated. In this context, using ambitious model selection methods that can search efficiently through a large number of possible models may illuminate important novel directions for etiological research. However, inference that ignores the biases and uncertainty in the selection of predictors may lead to inflated effects and overly optimistic estimates of precision. Therefore, unless the sample size far exceeds the number of potential terms in the model, findings of this type of analysis require further validation.

## REFERENCES

*Epidemiology*. 2008;19;550â€“557.

*Stat Appl Genet Mol Biol.*2004;3:Article 18. Available at: http://www.bepress.com/sagmb/vol3/iss1/art18.

*J Am Stat Assoc*. 1997;92:179â€“191.

*J R Stat Soc B*. 1995;57:45â€“97.

*J Am Stat Assoc*. 2005;11:71â€“81.

*Epidemiology*. 2007;18:186â€“190.

*Environmetrics*. 2000;11:745â€“763.

*J Environ Econ Manage*. 2004;47:30â€“54.

*Sociologic Methodol*. 1994;25:111â€“163.

*Biometrika.*In press. Technical report available at: http://www.bepress.com/jhubiostat/paper89/.

*J Toxicol Environ Health A*. 2007;70;1â€“5.

*J Am Stat Assoc*. 1997;92:179â€“191.

*J Am Stat Assoc*. 2004;468:938â€“948.