In large epidemiologic cohort studies, researchers often record data on dozens or even hundreds of variables that are potentially relevant for regression models. When interaction effects are suspected, the number of potentially relevant predictors can exceed the sample size. With such high-dimensional covariate spaces, specifying the predictors to include in models can be daunting. In finding potentially important predictors, analysts often use automated model selection procedures such as stepwise regression and its variants,1 regression trees,2 and penalized regression approaches.3,4
Most of these techniques are designed for estimating conditional means of the response variable. Often, however, the effects of covariates on percentiles of the response distribution are relevant. For example, if some exposure decreases the birth weights of babies who otherwise would have average to high birth weights by some modest amount, but it does not decrease the birth weights of babies who otherwise would have low birth weights, it might not be considered deleterious. On the other hand, an exposure would be troubling if it lowers the birth weights of babies who already would have low birth weights, even if it does not have a large impact on babies who otherwise would have average to high birth weights.
Figure 1 illustrates this scenario graphically using artificial data. The relationship between the conditional mean response and the covariate is the same in each panel. Accordingly, the fitted least squares lines are nearly identical. However, the lower tail of the response in panel A is barely affected by the covariate, whereas lower quantiles in panel B are highly sensitive to the covariate's value. Such differential effects can be detected with quantile regression,5 as evident in the different slopes for the fitted quantile regressions at the fifth percentile.
Unlike least squares regression, quantile regression lets covariates affect more than the mean structure. However, analysts still must decide which among many possible covariates and interactions to include in the model. We present 2 approaches for identifying potentially important predictors in quantile regression that are particularly useful for high-dimensional covariate spaces. First, we estimate quantile regression coefficients subject to the lasso penalty.4 This penalty forces many coefficients to equal zero, leaving the most important ones to be nonzero. Second, we implement a quantile regression version of the elastic net.6 Like lasso, the elastic net encourages coefficients to equal zero. But, it more effectively identifies groups of important predictors that are highly correlated with each other.
Our exploratory framework complements existing methodology for penalized quantile regression.7–11 Specifically, we adapt Zhao and Yu's12 boosting technique to provide computationally expedient and theoretically justified algorithms for penalized quantile regression. Using both simulated data and data from a prospective study on the determinants of birth weights, we demonstrate how epidemiologists can use these algorithms to guide model specification for quantile regression in high-dimensional settings.
Standard linear regressions are of the form
where εi are independent and identically distributed errors with mean 0, β is a vector of regression coefficients with length p, and x′i is a row vector of covariates for the ith individual. The linear regression model implies that the distribution of y shifts in its mean but not in its shape for different xi.
Quantile regression allows for both the mean and shape of the distribution of y to change with x, without assuming a particular error distribution. Hence, it is more flexible than linear regression. Quantile regression uses the same basic model as (1), but assumes that εi are independent errors whose τth quantile is equal to zero, ie, Pr(εi ≤ 0) = τ. Thus, x′iβ is interpreted as the conditional τth quantile of y given xi.
Standard quantile regression does not require a formal probability model.5 Instead, coefficients are estimated by minimizing the empirical loss function
Techniques exist for estimating standard errors of the estimated coefficients, testing hypotheses, and constructing interval estimates. Koenker and Hallock13 provide an accessible introduction to the topic with several case studies, including an application to birth weight data. However, they do not provide guidance on navigating the high-dimensional covariate spaces that motivate this article.
Quantile regression has advantages for modeling birth outcomes over other regression approaches. Categorizing birth outcomes into low and high values14 or a small number of ordered categories15 is common in this field. However, using a logistic regression on an indicator for low birth weight (<2500 g) treats a birth at 2499 g as fundamentally different from a birth at 2501 g; it also treats a birth at 2499 g like one at 2000 g. Neither treatment is scientifically or clinically optimum.
Relatedly, least squares regression assumes that changing covariates merely shifts the conditional distribution. Although the marginal distribution of birth weights is nearly Gaussian, its conditional distribution is known to do more than shift with covariates. For example, baby boys tend to weigh more than girls, but this difference attenuates in the lower quantiles: small baby boys and girls are much closer in weight than a linear regression would suggest.13 Quantile regression can illuminate such changing effects.
We now present 2 approaches for exploring high-dimensional quantile regression covariate spaces. We emphasize that these are exploratory techniques for identifying useful models and do not provide formal inference.
Boosted Lasso for Quantile Regression
Model selection techniques operate by penalizing large models. For example, in linear regression, one may select the minimum Akaike Information Criterion (AIC)16 or Bayesian Information Criterion (BIC)17 model. Adding a predictor penalizes these log likelihood-based criteria values by 2 and log(n), respectively. Thus, the penalty increases with the number of included predictors p.
The lasso method uses a penalty related to the sum of the absolute values of the regression coefficients.4 Using simplified notation, lasso solutions for β minimize the function
for an empirical loss function L(β). For large values of λ1, many components of β are estimated to equal zero. As λ1 shrinks toward 0, the estimates of β move toward an unpenalized estimate. Equivalently, one can solve (4) by minimizing L(β) subject to the restriction
for a value of t1 that corresponds to λ1.4 We follow the convention of displaying lasso solutions using this representation.
AIC, BIC, and lasso each weigh how closely a candidate model fits the data (ie, the value of the loss function) against how “big” the model is. AIC and BIC measure the size of the model by counting the number of nonzero elements in β; lasso measures the size by summing the absolute values of the β components. Although these methods are similar in spirit, enumerative minimization of AIC or BIC is impractical in high-dimensional applications. For example, our motivating analysis equates to more than 600 possible predictors. Even restricting a search to models with 10 or fewer predictors results in more than 1021 possible models. In contrast, there are efficient methods to find lasso solutions, even in high dimensions.18
Before fitting a lasso or elastic net model, we center and scale covariates so that the penalties exert their force equally on all components of β. Further, we work with a centered version of the response to penalize the intercept minimally. We also add any potentially relevant interactions among the covariates to xi.
To adapt the lasso to exploratory quantile regression, we set L(β) to be the loss function in (2). Rather than use one specific t1, we investigate the solutions of (4) for a wide range of t1, beginning with zero and ending at a large value. We find the lasso solutions via a boosting algorithm12; see the eAppendix (http://links.lww.com/EDE/A495).
The solution paths, ie, plots of estimated β versus t1, can be used to identify important variables in the quantile regression. In particular, we search for variables whose estimates take on nonzero values early in the solution path. To illustrate with simulated data, consider the first panel of Figure 2, which shows the lasso solution paths for a 25th percentile regression with 10 variables, 5 of which have nonzero true coefficients. The estimates of 4 parameters quickly take on nonzero values as we move from left to right, suggesting that these are important variables. The other estimated coefficients take much longer to rise above zero, suggesting they are less important. Hence, if seeking to fit a parsimonious quantile regression, the first 4 variables would be a starting point for model building.
Although this example has a modest number of predictors, lasso works in settings with more predictors than observations; examples are given in the eAppendix (http://links.lww.com/EDE/A495). We simply follow the solution path until the penalty is weak enough that the penalized and unpenalized estimates are equal, at which point the algorithm stops. With numerous predictors, we typically are interested in models with significantly fewer nonzero β elements than observations, so we can stop the algorithm early to save computational time.
It is possible to select one value of t1, eg, via cross validation, and use the fitted model at that point in the solution path as a final model. When analysts' primary goal is making out-of-sample predictions, or when there is scant domain knowledge to guide model selection, this approach can provide useful models. However, as an exploratory tool, we prefer to use penalized quantile regression as a process for ranking the importance of groups of covariates. We then combine domain knowledge with the information gleaned from the exploratory fits in the formal model building process. This process is described further later in the text and in the eAppendix (http://links.lww.com/EDE/A495).
Boosted Elastic Net for Quantile Regression
As an exploratory tool, lasso can be suboptimal when there are highly correlated groups of predictors, as it will tend to grant only one of them a nonzero β coefficient. Here, the elastic net6 may better identify groups of important predictors. It penalizes the empirical loss function by adding a term of
to the lasso criterion. Thus, we seek the value of β that minimizes
for given λ1 and λ2. As with lasso, larger values of λ1 encourage more zeros in the solution for β. With a large value of λ2, components of β are encouraged to take on small but nonzero values, because the penalty disfavors large coefficients.
To see why the elastic net penalty encourages groups of correlated predictors to enter the model simultaneously relative to lasso, consider covariates X1 and X2 that are nearly identical for all subjects. The lasso penalty cannot easily distinguish between a solution with (β1, β2) = (0,φ) and (β1, β2) = (φ/2, φ/2) for some φ ≠ 0, so that the solution may present only one nonzero predictor. The elastic net penalty with λ2 > 0 prefers (φ/2, φ/2), so that both predictors take on nonzero values at similar points in the solution path.
To adapt the elastic net to exploratory quantile regression, we let L(β) be the loss function in (2). Rather than select one value of (λ1, λ2), we investigate the solutions for a range of values to identify important predictors. To facilitate the exploration, we replace λ2 with λ1λ*2, where λ*2 is a positive constant. For a fixed λ*2, we fit the solution path as λ1 shrinks toward 0. As with lasso, we look for groups of coefficients that move away from zero early and fast in the solution path. We again use boosting to fit the solution paths given λ*2; see the eAppendix (http://links.lww.com/EDE/A495).
We examine the solution paths for several values of λ*2, searching for a value of λ*2 large enough to allow more variables to enter the model early in the solution path (compared with the lasso fit), but small enough that some parsimony is maintained early in the solution path. Appropriate values of λ*2 depend on the data. We recommend starting with λ*2 = 1 and moving up or down by factors of 5 or 10 until these goals are achieved. It may be helpful to inspect intermediate values of λ*2, but we have experienced only moderate sensitivity to λ*2 settings.
To illustrate, consider Figure 2, which shows the fits for λ*2 between 0 and 1000. The fit with λ*2 = 1 is nearly unchanged from the lasso (λ*2 = 0) fit. When λ*2 is 10 or 100, a fifth variable enters the model early, thus appearing as an important predictor. When we increase λ*2 to 500 or 1000, it becomes difficult to discern exactly which variables are likely to be most important because most of them take on nonzero values early in the solution path; we argue that these fits are not useful for an exploratory analysis. Thus, in this simulated dataset, fits with λ*2 equal to 10 or 100 satisfy our goals.
A Framework for Penalized Exploratory Quantile Regression
The lasso and elastic net quantile regressions can be combined to improve exploratory analyses for quantile regression. We recommend the following steps.
First, we fit the lasso regression solution path at the quantile of interest. We isolate variables with coefficients that take on nonzero values early in the solution path. These variables are the starting point for formal model building. We can rank the importance of the remaining variables by the order in which their corresponding coefficients take on nonzero values. We can choose to include some of these or additional variables in the final model based on scientific considerations and formal hypothesis testing.
Second, as a check on the lasso results, we fit the elastic net at several values of λ*2, as outlined above. Here, we look for variables that appear early in the elastic net fit but are not prominent in the lasso. These variables can be considered for inclusion based on scientific merits and formal hypothesis tests. The elastic net is especially useful in settings where interaction effects are suspected, because interaction variables are often collinear with main effects, and thus may be excluded by the lasso.
This is an exploratory framework designed to highlight prominent groups of predictors. Like all exploratory model-building techniques, it should supplement—not replace—scientific rationales for model building. Analysts should add or subtract variables accordingly. Nonetheless, in high-dimensional settings with many predictors, exploratory quantile regression tools can help analysts find important predictors that might otherwise be difficult to uncover, as we now illustrate on genuine data.
We apply the exploratory framework to the Healthy Pregnancy, Healthy Baby Study, an observational prospective cohort study in Durham, NC, focused on the etiology of adverse birth outcomes. These adverse outcomes, including low birth weight and preterm birth, have been linked to many problems19 including blindness,20 deafness,21 and behavioral problems.22 Unfortunately, the causes of adverse birth outcomes are poorly understood, although predictors cited as important include smoking,23 lead exposure,24 environmental exposures more generally,19 and psychologic stress.25–27
We focus on non-Hispanic black mothers, with a sample size of 881 births. We seek models for birth weight and gestational age, with particular emphasis on low quantiles of these distributions, eg, the 10th, 20th, and 30th percentiles. The data comprise 35 covariates, including maternal demographics such as age, education, and income; maternal medical history variables such as previous birth outcomes; maternal environmental exposures to cadmium, nicotine, cotinine, mercury, and lead as measured in blood; and maternal psychosocial factors such as scores on the Interpersonal Support Evaluation List, the Center for Epidemiologic Studies Depression scale, the NEO Five-Factor Personality Inventory, perceived racism, and availability of social support. We believe that interactions among these variables may be important predictors of birth outcomes. Therefore, we include in x the main effects of all predictors (using indicator variables for multicategory covariates), linear and squared terms for continuous predictors, and all two-way interactions, resulting in over 600 covariates for consideration. Summary statistics for the covariates are given in Table 1, and a histogram of birth weights is provided in eFigure 1 (http://links.lww.com/EDE/A495). Missing data in x (birth outcomes are complete) were multiply imputed using regression trees.28
With lasso quantile regressions, measures of tobacco smoke exposure take on primary importance, particularly for birth weight. When modeling the 10th percentile of birth weight, 4 of the first 9 covariates to enter the model are interactions that include a tobacco measure (Fig. 3); at the 20th percentile, 4 of the first 7 are tobacco-related; at the 30th percentile, 5 of the first 6 are interactions involving tobacco. These results are consistent with the literature that extensively documents the impact of smoking on birthweight. Across the quantiles, a lead/tobacco interaction is consistently flagged.
Other important variables from the lasso 10th percentile regression for birth weight include the mother's age selected as a squared term and in an interaction with environmental tobacco smoke exposure. The former suggests that the association between age and birth weight is a U-shaped curve, which is again consistent with the literature.29 In addition, the Interpersonal Support Evaluation List appraisal score, which is a measure of the availability of someone with whom to discuss problems, was selected in an interaction with a variable indicating perceived social standing and an interaction with having a “visiting” relationship with the father of the child. At the 20th percentile, prominent variables (other than those involving tobacco) include an interaction between negative paternal support and NEO-“conscientiousness” (a measure of organization and persistence) and an interaction between negative paternal support and the perceived self-efficacy score, which measures a woman's belief that she can steer her own life's trajectory.30 Qualitatively, the results are similar at the 30th percentile of birth weight. Smoking measures are prominent, and environmental measures other than a tobacco/lead interaction are slow to enter the model.
In the presented results, we focus on birth weight as the response because lasso fits of gestational age reveal fewer useful explanatory variables, with less consistency across the response quantiles. This may be evidence that the covariates poorly capture the determinants of gestational age.
With the elastic net, the fits are essentially identical to the lasso fit when λ*2 = 0.01. When λ*2 = 0.1, dozens of variables take on nonzero values very early in the fitting process, which makes identifying the important variables difficult. Hence, we focus on the λ*2 = 0.05 elastic net. This fit generally selects the same variables as lasso, but it also identifies other potentially important covariates: interactions between blood lead levels and mother's self-reported tobacco use at the start of the pregnancy, and between blood lead levels and mother's tobacco use during the pregnancy, are among the early variables with nonzero coefficients.
Building on the exploratory analyses, we estimate regressions for the 10th, 20th, and 30th percentiles of birth weight. Although not selected in the exploratory stage, we include standard controls known to be predictors of birth weight, including the baby's sex and an indicator for first birth.29,31 Because the mother's age appeared important as a squared term, we include age and age squared. We also consider the interaction of tobacco use at prenatal intake with lead, which was the first variable to appear in our 10th percentile lasso and elastic net fits, and we add in the main effects to comply with the hierarchy principle. We also fit models including parental relationship status, appraisal scores, and their interaction; however, statistically insignificant F tests of these effects suggested that they could be removed from the models without much sacrifice. We similarly remove interaction of maternal age and tobacco after an F test.
The results are summarized in Table 2, along with a median regression for completeness. Although age does not approach significance in any of the presented quantiles, (centered) age squared is consistently negative. This suggests a U-shaped effect of maternal age on birth weight; ie, women at the bottom and top end of the age distribution will tend to have babies with lower birth weights. It is not until the 30th percentile that infant sex nearly reaches significance, and parity (first or higher-order birth) never reaches significance in the quantiles presented. Since infant sex and parity are well-established predictors of birth weight in traditional estimates of the mean regression analysis, we might reasonably conclude that at lower quantiles, other processes overwhelm the effects that dominate at the mean.
The coefficients on the environmental variables (the continuous blood lead measurement, tobacco use at prenatal care intake, and their interaction) are especially interesting. The combined main and interaction effects of lead and tobacco are essentially a wash at the 10th percentile. At the 20th percentile, we found a net decrease in birth weight of 174 g among women who both use tobacco and who have lead exposure one standard deviation above the mean. However, at the 10th and 20th percentiles, these coefficient estimates are not significant. At the 30th percentile, the combined main and interaction effects are associated with a 168 g decrement in birth weight. Here, the interaction effect is significant, and the 2 main effects approach significance.
The meta-analysis of Navas et al32p. 478 found that “evidence is sufficient to infer a causal relationship between lead exposure and high blood pressure.” Concurrently, hypertension is associated with poorer birth outcomes.29 We also know the anomalous but persistent result that smoking during pregnancy reduces the risk for preeclampsia33–35 (although among women with preeclampsia, smoking increases the risk for perinatal mortality and fetal growth restriction). Both preeclampsia and maternal hypertension are associated with lower birth weights. Because hypertension can be either the cause or the result of problems in pregnancy (especially if it progresses to preeclampsia), this exploratory analysis suggests that a nuanced treatment of the lead-hypertension-smoking nexus may be critical to understanding adverse birth outcomes. Such modeling is part of our ongoing research agenda.
A recent report14p. 229 noted that, while the relationships between environmental exposures and preterm birth are in general poorly understood, “possible exceptions are lead and environmental tobacco smoke, for which the weight of evidence suggests that maternal exposure to these pollutants increases the risk for preterm birth.” These variables are also important contributors to low birth weights.19 The potential interaction of lead and smoking—while apparently unreported in the birth outcomes literature—has been highlighted in other contexts. For example, the interaction has been found to be important for predicting mortality due to cancer36 and attention deficit/hyperactivity disorder.37
When analysts suspect that the effect of some covariate is not constant across the distribution of a continuous response variable, as in Figure 1, they can use quantile regression to estimate the potential differences. Quantile regression can be particularly useful to epidemiologists, as scientific interest often focuses on persons who have high or low—rather than average—values of the response, given an arrangement of covariates. For example, epidemiologists studying blood pressures could be more interested in identifying covariates that increase the upper tails of blood pressure to medically dangerous levels compared with ones that increase the mean blood pressure within the normal range.38 Examples of other uses of quantile regression include studying the effects of tobacco in relation to sleep patterns,39 degradation of mental acuity in multiple sclerosis patients,40 and determinants of obesity.41
We presented an exploratory framework for identifying potentially important covariates in quantile regression for high-dimensional predictor spaces. The underlying algorithms, based on the lasso and elastic net penalties, are computationally fast and simple to use, thereby offering methods for investigating the importance of hundreds of potential regressors on noncentral features of the response distribution. We applied the framework to a study of adverse birth outcomes and identified an interaction between maternal lead exposure and smoking that was not previously discussed in the literature. As quantile regression gains popularity and datasets become richer, these exploratory techniques may be a useful addition to the epidemiologist's toolbox.
We thank Geeta Swamy and 2 anonymous referees for comments that improved the manuscript.
1. Bendel RB, Afifi AA. Comparison of stopping rules in forward “stepwise” regression. J Am Stat Assoc
2. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees
. Boca Raton: Chapman & Hall/CRC; 1984.
3. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics
4. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol
5. Koenker R, Bassett G Jr. Regression quantiles. Econometrica
6. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol
7. Fenske N, Kneib T, Hothorn T. Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. Technical Report, University of Munich Department of Statistics, 2009.
8. Wu Y, Liu Y. Variable selection in quantile regression. Stat Sin
9. Koenker R. Quantile regression for longitudinal data. J Multivar Anal
10. Li Y, Zhu J. L1
-norm quantile regression. J Comput Graph Stat
11. Li Q, Xi R, Lin N. Bayesian regularized quantile regression. Bayesian Anal
12. Zhao P, Yu B. Boosted lasso. In: Lui H, Stine R, Auslender L, chairs. Workshop on Feature Selection for Data Mining
. Newport Beach, CA: SIAM; 2005:35–44.
13. Koenker R, Hallock KF. Quantile regression. J Econ Perspect
14. Behrman RE, Butler AS. Preterm Birth: Causes, Consequences and Prevention
. Washington, DC: National Academy Press; 1997.
15. Windham GC, Hopkins B, Fenster L, Swan SH. Prenatal active or passive tobacco smoke exposure and the risk of preterm delivery or low birth weight. Epidemiology
16. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrox BN, Caski F, eds. Second International Symposium on Information Theory
. Budapest: Academiai Kiado; 1973:267–281.
17. Schwarz G. Estimating the dimension of a model. Ann Stat
18. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat
19. Miranda ML, Maxson P, Edwards S. Environmental contributions to disparities in pregnancy outcomes. Epidemiol Rev
20. Crofts BJ, King R, Johnson A. The contribution of low birth weight to severe vision loss in a geographically defined population. Br J Ophthalmol
21. Lorenz JM, Wooliever DE, Jetton JR, Paneth N. A quantitative review of mortality and developmental disability in extremely premature newborns. Arch Pediatr Adolesc Med
22. McCormick MC, Gortmaker SL, Sobol AM. Very low birth weight children: behavior problems and school difficulty in a national sample. J Pediatr
23. Savitz DA, Dole N, Terry JW Jr, Zhou H, Thorp JM Jr. Smoking and pregnancy outcome among African-American and white women in central North Carolina. Epidemiology
24. Gonzalez-Cossio T, Peterson KE, Sanin LH, et al. Decrease in birth weight in relation to maternal bone-lead burden. Pediatrics
25. Newton RW, Hunt LP. Psychosocial stress in pregnancy and its relation to low birth weight. BMJ
26. Orr ST, Reiter JP, Blazer DG, James SA. Maternal prenatal pregnancy-related anxiety and spontaneous preterm birth in Baltimore, Maryland. Psychosom Med
27. Orr ST, Blazer DG, James SA, Reiter JP. Depressive symptoms and indicators of maternal health status during pregnancy. J Womens Health
28. Burgette LF, Reiter JP. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol
29. Swamy GK, Edwards S, Gelfand A, James SA, Miranda ML. Maternal age, birth order and race: differential effects on birthweight. J Epidemiol Community Health
. In press.
31. Magnus P, Berg K, Bjerkedal T. The association of parity and birth weight: testing the sensitization hypothesis. Early Hum Dev
32. Navas-Acien A, Guallar E, Silbergeld EK, Rothenberg SJ. Lead exposure and cardiovascular disease—A systematic review. Environ Health Perspect
33. Cnattingius S, Mills JL, Yuen J, Eriksson O, Salonen H. The paradoxical effect of smoking in preeclamptic pregnancies: smoking reduces the incidence but increases the rates of perinatal mortality, abruptio placentae, and intrauterine growth restriction. Am J Obstet Gynecol
34. Castles A, Adams EK, Melvin CL, Kelsch C, Boulton ML. Effects of smoking during pregnancy: five meta-analyses. Am J Prev Med
35. Lain KY, Powers RW, Krohn MA, Ness RB, Crombleholme WR, Roberts JM. Urinary cotinine concentration confirms the reduced risk of preeclampsia with tobacco exposure. Am J Obstet Gynecol
36. Lustberg M, Silbergeld E. Blood lead levels and mortality. Arch Intern Med
37. Froehlich TE, Lanphear BP, Auinger P, et al. Association of tobacco and lead exposures with attention deficit/hyperactivity disorder. Pediatrics
38. Muntner P, He J, Cutler JA, Wildman RP, Whelton PK. Trends in blood pressure among children and adolescents. JAMA
39. Zhang L, Samet J, Caffo B, Punjabi NM. Cigarette smoking and nocturnal sleep architecture. Am J Epidemiol
40. Marrie RA, Dawson NV, Garland A. Quantile regression and restricted cubic splines are useful for exploring relationships between continuous variables. J Clin Epidemiol
41. Kan K, Tsai WD. Obesity and risk knowledge. J Health Econ