Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, CA firstname.lastname@example.org
Division of Epidemiology and Division of Social and Behavioral Sciences, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
Supported by the Stanford Department of Medicine.
The authors report no conflicts of interest.
Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com). This content is not peer-reviewed or copy-edited; it is the sole responsibility of the author.
To the Editor:
Geographic disparities in age-adjusted premature mortality have been extensively catalogued across the United States.1 For example, Wayne County, Michigan (Detroit), recently lost 10,263 years of potential life per 100,000 population per year,2 whereas nearby Washtenaw County lost only 5,096 years per 100,000. These large differences have prompted federal public health agencies to attempt to identify the most vulnerable areas for intervention (“hot spotting”).3 However, traditional indicators such as education and health-care access are inadequate to predict geographic disparities in mortality.1 A widening array of new indicators have therefore been developed—from smoking and alcohol consumption rates to density of fast-food restaurants. The proliferation of new indicators presents a further challenge: which are most efficient at predicting vulnerable areas?
Here, we “open-source” an approach using readily available data sets to identify key predictors of US geographic disparities in premature mortality. As detailed in the eAppendix, http://links.lww.com/EDE/A775 (which includes full statistical code), we analyzed 50 key indicators of socioeconomic, demographic, behavioral, and environmental conditions available in 20 commonly used, geocoded, publicly available data sets from all US counties. The primary outcome was age-adjusted years of potential life lost before 75 years of age, as computed by the National Center for Health Statistics.2 This end point is a principal target of the US Centers for Disease Control and Prevention for reducing geographic disparities.4 We also investigated alternative outcomes and found similar solutions (see eAppendix, http://links.lww.com/EDE/A775).
We analyzed the data using regression tree analysis, which can avoid bias in the presence of multicollinearity.5 This approach tests all possible combinations of interactions among all available indicator variables to identify a logical sequence of indicators associated with mortality rates. A standard complexity parameter was used to prevent overfitting,6 and “random forest” bootstrapping was performed by randomly sampling repeatedly from subsets of the data that consist of approximately two-thirds of the complete data set, then selecting the estimators that have the highest explanation of variance in the remaining one-third of the sample, generating a large number of bootstrapped trees from which we present the convergent solution.7,8
We identified combinations of traditional indicators that, together with some less commonly used indicators, could explain approximately 70% of geographic disparities in premature mortality. (Income, education, and race combined explained only one-third of the variance.) As illustrated in the Figure, the largest division in premature mortality among counties was between those experiencing more or less than 46.5 teen births per 1000 women 15–19 years of age. A second branch of the tree further separated counties by median household income (greater or less than $42,330/year). The wealthier group had the lowest rates of premature mortality (group 1: mean 5,623 years of potential life lost before age 75 per 100,000 population; N = 503 counties). On the right side of the tree are counties with the highest rates of premature mortality. The percent Native American population was key among counties with a high teen birth rate. The 14 US counties with the highest rates of premature mortality (group 10: mean 19,102 per 100,000) had a teen birth rate above 46.5 per 1,000, Native Americans as more than 46.6% of the population, and more than 12.5% of children uninsured.
FIGURE. Data-mining ...Image Tools
The eAppendix tables (http://links.lww.com/EDE/A775) provide summary statistics, further diagnostic and cross-validation plots, additional trees with alternative outcomes and subsamples, and complete code that requires less than 5 minutes on a standard laptop computer. As shown here, just a few parsimonious combinations of key indicators can quickly identify vulnerable counties.
Stanford Prevention Research Center
Department of Medicine
Stanford University School of Medicine
Division of Epidemiology and Division of
Social and Behavioral Sciences
Dalla Lana School of Public Health
University of Toronto
Toronto, ON, Canada
1. Murray CJ, Kulkarni SC, Michaud C, et al. Eight Americas: investigating mortality disparities across races, counties, and race-counties in the United States. PLoS Med. 2006;3:e260
3. Gawande A. The hot spotters. New Yorker. 2011;86:41
4. US Department. of Health and Human Services. Healthy People 2020: General Health Status Topics and Objectives. 2011 Washington, DC US Department of Health and Human Services
5. Howard-Pitney B, Winkleby MA. Chewing tobacco: who uses and who quits? Findings from NHANES III, 1988–1994. Am J Public Health. 2002;92:250–256
6. Lewis RJ. An introduction to classification and regression tree (CART) analysis. Annual Meeting of the Society for Academic Emergency Medicine. 2000 San Francisco, CA Citeseer:1–14
7. Breiman L. Random forests. Mach Learn. 2001;45:5–32
8. Garge NR, Bobashev G, Eggleston B. Random forest methodology for model-based recursive partitioning: the mobForest package for R. BMC Bioinformatics. 2013;14:125