Epidemiology

Home Current Issue Previous Issues Published Ahead-of-Print For Authors Journal Info
Skip Navigation LinksHome > November 2008 - Volume 19 - Issue 6 > Variable Selection Methods Influence the Identification of F...
Epidemiology:
November 2008 - Volume 19 - Issue 6 - pp S282-S283
doi: 10.1097/01.ede.0000340321.57437.f0
Abstracts: ISEE 20th Annual Conference, Pasadena, California, October 12-16, 2008: Contributed Abstracts

Variable Selection Methods Influence the Identification of Factors That Predict Serum Dioxin Concentrations in Michigan, USA

Hong, B; Garabrant, D H; Chen, Q; Chang, C-W; Jiang, X; Hedgeman, E; Gillespie, B W; Lepkowski, J; Franzblau, A

Free Access
Article Outline
Collapse Box

Author Information

*Risk Science Center and Dept of Environmental Health Sciences, University of Michigan School of Public Health, Ann Arbor, Michigan, USA; †Dept of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, USA; and ‡Institute for Social Research, University of Michigan, Ann Arbor, Michigan, USA.

Abstracts published in Epidemiology have been reviewed by the organizations of Epidemiology. Affliate Societies at whose meetings the abstracts have been accepted for presentation. These abstracts have not undergone review by the Editorial Board of Epidemiology.

ISEE-1377

Back to Top | Article Outline

Methods:

Linear regression models were performed to identify important factors that were associated with serum dioxin concentrations in the Midland and Saginaw Counties in Michigan, using data from 946 participants in the University of Michigan Dioxin Exposure Study (UMDES) that were selected from the study area by a complex sample design. We used two different variable selection approaches in the linear regression models: backward selection and forward stepwise selection. The influential diagnostics were then performed to investigate the influence of influential observations on the regression coefficients, by using DFBETAS. The purpose of this paper is to compare the results from these approaches and investigate which approach is more sensitive to the influential observations in the data. In linear regression when the number of potential predictors is relative large compared with the number of observations, there are typically not enough degrees of freedom to run a single step of backward selection from the complete variable list. Although this problem is often addressed by using a multi-stage backwards selection procedure, there is no guarantee that the resulting model will be optimal. Forward stepwise selection allows variable selection in a single step, but there is no software available to implement this in the setting of multiply-imputed survey data. We solved this problem by writing a SAS macro that implements forward stepwise selection in the setting of multiply-imputed survey data.

Back to Top | Article Outline

Results:

We found that the most important predictors of serum dioxins (age, gender, body mass index, smoking status, the length of breastfeeding, etc.) were consistently identified by using either backward or forward stepwise variable selection in our regression models. These factors were stable (in terms of parameter estimates and p-values) and collectively explained a large proportion of the variance in serum dioxin levels (measured by the adjusted R2). However, some factors that were dependent on a small number (1 to 3) of observations tended to be identified in backward selection, but not in forward stepwise selection. These factors should be interpreted with caution insofar as the associations were highly dependent on a few influential observations, the factors explained little of the variation in serum dioxin levels, and their inclusion in the model was dependent on the variable selection procedure. For example, living in Midland or Saginaw Counties in the 1940s and 1950s was identified in backward selection and was statistically significant associated with increased serum 2378-TCDD (P-value = 0.019). If the most influential observation is excluded from the regression analysis, there is not statistically significant association between the years living in Midland or Saginaw Counties in the 1940 and 1950s and serum 2378-TCDD. But this factor was not identified by the forward stepwise selection even the most influential observation is included in the regression analysis.

Back to Top | Article Outline

Conclusion:

Based on the above findings, we recommend using forward stepwise variable selection in linear regression analysis in studies like ours. Our implementation of a SAS program to do this is a substantial forward step in the analysis of multiply-imputed survey data.

© 2008 Lippincott Williams & Wilkins, Inc.