We thank Francesca Dominici et al1 for their thoughtful examination of our article.2 We share many of their concerns about the model fitting process and would like to address several points raised in their commentary.
As explicitly noted in our paper, the analysis conducted was exploratory because there is little published data on prenatal exposure to air pollution and pulmonary function later in life. There is, however, strong evidence that prenatal exposure to environmental tobacco smoke influences pulmonary function.3,4 Because environmental tobacco smoke and outdoor air pollutants share common components, we undertook an analysis of the FACES-LiTE data to discover whether one or more pollutants could predict and thus was associated with altered pulmonary function in childhood. The analytic objective was not the evaluation of specific health effects according to a priori specified hypotheses regarding the effect of some pollutants on health outcomes. We did not aim to identify all pollutants causing health effects via pregnancy and early life exposure, or to evaluate the precise magnitude of these effects, as was suggested by the simulation study conducted by Dominici and colleagues. We acknowledged that our findings could not necessarily be interpreted “causally.” The term “effect” was meant more in the traditional epidemiologic interpretation to mean “association.” The terminology of variable importance could have been used instead.
Nevertheless, even if causal inference is not the primary aim of discovery analyses, inference about the (adjusted) associations between the explanatory covariates considered and the outcome remains desirable. This is true even if, as Dominici and colleagues have further emphasized, the theoretical framework for such inference has not been developed to date for the commonly used statistical approaches in machine learning (such as the DSA algorithm). This limitation was underscored in our paper when we indicated that the inference obtained from GEE did not account for the model selection undertaken and, thus, could lead to underestimation of the uncertainty regarding the associations reported. We welcome skepticism about our results and agree with the need for future research undertakings to validate, refute, or refine these results.
Despite our inability to provide an exact measure of the uncertainty around the association reported in our paper, we would like to underscore several points that should motivate confidence in those associations. First, the Deletion/Substitution/Addition algorithm is based on cross-validation to avoid over-fitting, ie, model choice is based on data sets independent from the one used to fit each model considered. Second, cross-validation is not used to compare directly the performance of all models considered by the algorithm but, instead, aims to select a class of models that is most predictive of the outcome among only a few (user-specified) classes of models. As noted by Dominici and colleagues, the number of models considered by the model in our analyses is very large and, thus, even cross-validation criterion such as the cross-validated L2 risk would not protect against the selection of covariates in the regression model that, in fact, are independent of the outcome. However, the Deletion/Substitution/Addition algorithm does not attempt such comparisons but instead uses cross-validation to select a class of models among a more restricted number of model classes. Third, in our experience, the algorithm tends to select very low dimensional models (often the intercept model) as the number of candidate covariates and the complexity of the models under consideration increases. Based on all of these points, the fact that the models included the prenatal pollutant metrics, rather than just demographic or other possible confounders, supports the inference that the pollutants may influence the pulmonary function measures in this analysis.
Finally, we would like to draw attention to recent statistical progress with regard to variable importance problems and, in particular, the derivation of proper inference in this context. In fact, we propose the use of this variable importance methodology in combination with targeted maximum likelihood estimation5–7 to infer the effects of pollutants with adjustment for a user-supplied set of control variables. We propose this for 2 main reasons. First, it defines a variable importance measure for all explanatory covariates considered. This is unlike the approach we used in this paper, where the importance of covariates that were not selected in the regression model is zero. Second, it provides the theoretical framework to derive proper P-values and confidence intervals for the variable importance measure of each explanatory covariate under consideration, and does so even when aggressive model selection algorithms (such as the Delection/Substitution/Addition model) are implemented as part of the general estimation procedure. We believe that software implementation should allow broader applications of this new methodology in epidemiologic research generally and in particular to etiologic research efforts such as the one we attempted.
1. Dominici F, Wang C, Crainiceanu C, et al. Model selection and health effect estimation in environmental epidemiology [commentary]. Epidemiology
2. Mortimer KM, Neugebauer R, Lurmann F, et al. Air pollution and pulmonary function in asthmatic children: effects of prenatal and lifetime exposure. Epidemiology
3. Tepper R W-NT, Martinez T, Kisling J, et al. Parental smoking and airway reactivity in healthy infants. Am J Respir Crit Care Med
4. Gilliland F, Berhane K, McConnell R, et al. Maternal smoking during pregnancy, environmental tobacco smoke exposure and childhood lung function. Thorax
5. van der Laan M. Statistical Inference for Variable Importance. The International Journal of Biostatistics 2. 1
; 2006. Available at: http://works.bepress.com/markvanderlaan/156
6. van der Laan M, Rubin D. Targeted Maximum Likelihood Learning. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 213
; 2006. Available at: http://www.bepress.com/ucbbiostat/paper213
7. Bembom O, Petersen L, Rhee S, et al. Biomarker Discovery Using Targeted Maximum Likelihood Estimation: Application to the Treatment of Antiretroviral Resistant HIV Infection. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 221
; 2007. Available at: http://www.bepress.com/ucbbiostat/paper22