# Exploratory Quantile Regression With Many Covariates: An Application to Adverse Birth Outcomes

Covariates may affect continuous responses differently at various points of the response distribution. For example, some exposure might have minimal impact on conditional means, whereas it might lower conditional 10th percentiles sharply. Such differential effects can be important to detect. In studies of the determinants of birth weight, for instance, it is critical to identify exposures like the one above, since low birth weight is a risk factor for later health problems. Effects of covariates on the tails of distributions can be obscured by models (such as linear regression) that estimate conditional means; however, effects on tails can be detected by quantile regression. We present 2 approaches for exploring high-dimensional predictor spaces to identify important predictors for quantile regression. These are based on the lasso and elastic net penalties. We apply the approaches to a prospective cohort study of adverse birth outcomes that includes a wide array of demographic, medical, psychosocial, and environmental variables. Although tobacco exposure is known to be associated with lower birth weights, the analysis suggests an interesting interaction effect not previously reported: tobacco exposure depresses the 20th and 30th percentiles of birth weight more strongly when mothers have high levels of lead in their blood compared with those who have low blood lead levels.

SUPPLEMENTAL DIGITAL CONTENT IS AVAILABLE IN THE TEXT.

From the ^{a}Department of Statistical Science, Duke University, Durham, NC; and ^{b}Nicholas School of the Environment and Department of Pediatrics, Duke University, Durham, NC.

Submitted October 2010; accepted May 2011.

Supported by U.S. Environmental Protection Agency (grant R833293).

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com).

Correspondence: Lane F. Burgette, Department of Statistical Science, Box 90251, Duke University, Durham, NC 27708. E-mail: lb131@stat.duke.edu.

In large epidemiologic cohort studies, researchers often record data on dozens or even hundreds of variables that are potentially relevant for regression models. When interaction effects are suspected, the number of potentially relevant predictors can exceed the sample size. With such high-dimensional covariate spaces, specifying the predictors to include in models can be daunting. In finding potentially important predictors, analysts often use automated model selection procedures such as stepwise regression and its variants,^{1} regression trees,^{2} and penalized regression approaches.^{3,4}

Most of these techniques are designed for estimating conditional means of the response variable. Often, however, the effects of covariates on percentiles of the response distribution are relevant. For example, if some exposure decreases the birth weights of babies who otherwise would have average to high birth weights by some modest amount, but it does not decrease the birth weights of babies who otherwise would have low birth weights, it might not be considered deleterious. On the other hand, an exposure would be troubling if it lowers the birth weights of babies who already would have low birth weights, even if it does not have a large impact on babies who otherwise would have average to high birth weights.

Figure 1 illustrates this scenario graphically using artificial data. The relationship between the conditional mean response and the covariate is the same in each panel. Accordingly, the fitted least squares lines are nearly identical. However, the lower tail of the response in panel A is barely affected by the covariate, whereas lower quantiles in panel B are highly sensitive to the covariate's value. Such differential effects can be detected with quantile regression,^{5} as evident in the different slopes for the fitted quantile regressions at the fifth percentile.

Unlike least squares regression, quantile regression lets covariates affect more than the mean structure. However, analysts still must decide which among many possible covariates and interactions to include in the model. We present 2 approaches for identifying potentially important predictors in quantile regression that are particularly useful for high-dimensional covariate spaces. First, we estimate quantile regression coefficients subject to the lasso penalty.^{4} This penalty forces many coefficients to equal zero, leaving the most important ones to be nonzero. Second, we implement a quantile regression version of the elastic net.^{6} Like lasso, the elastic net encourages coefficients to equal zero. But, it more effectively identifies groups of important predictors that are highly correlated with each other.

Our exploratory framework complements existing methodology for penalized quantile regression.^{7–11} Specifically, we adapt Zhao and Yu's^{12} boosting technique to provide computationally expedient and theoretically justified algorithms for penalized quantile regression. Using both simulated data and data from a prospective study on the determinants of birth weights, we demonstrate how epidemiologists can use these algorithms to guide model specification for quantile regression in high-dimensional settings.

## METHODS

Standard linear regressions are of the form

where *εi* are independent and identically distributed errors with mean 0, β is a vector of regression coefficients with length *p*, and *x′i* is a row vector of covariates for the *i*th individual. The linear regression model implies that the distribution of *y* shifts in its mean but not in its shape for different *xi*.

Quantile regression allows for both the mean and shape of the distribution of *y* to change with *x*, without assuming a particular error distribution. Hence, it is more flexible than linear regression. Quantile regression uses the same basic model as (1), but assumes that *εi* are independent errors whose τth quantile is equal to zero, ie, Pr(*εi* ≤ 0) = τ. Thus, *x′iβ* is interpreted as the conditional τth quantile of *y* given *xi*.

Standard quantile regression does not require a formal probability model.^{5} Instead, coefficients are estimated by minimizing the empirical loss function

where

Techniques exist for estimating standard errors of the estimated coefficients, testing hypotheses, and constructing interval estimates. Koenker and Hallock^{13} provide an accessible introduction to the topic with several case studies, including an application to birth weight data. However, they do not provide guidance on navigating the high-dimensional covariate spaces that motivate this article.

Quantile regression has advantages for modeling birth outcomes over other regression approaches. Categorizing birth outcomes into low and high values^{14} or a small number of ordered categories^{15} is common in this field. However, using a logistic regression on an indicator for low birth weight (<2500 g) treats a birth at 2499 g as fundamentally different from a birth at 2501 g; it also treats a birth at 2499 g like one at 2000 g. Neither treatment is scientifically or clinically optimum.

Relatedly, least squares regression assumes that changing covariates merely shifts the conditional distribution. Although the marginal distribution of birth weights is nearly Gaussian, its conditional distribution is known to do more than shift with covariates. For example, baby boys tend to weigh more than girls, but this difference attenuates in the lower quantiles: small baby boys and girls are much closer in weight than a linear regression would suggest.^{13} Quantile regression can illuminate such changing effects.

We now present 2 approaches for exploring high-dimensional quantile regression covariate spaces. We emphasize that these are exploratory techniques for identifying useful models and do not provide formal inference.

### Boosted Lasso for Quantile Regression

Model selection techniques operate by penalizing large models. For example, in linear regression, one may select the minimum Akaike Information Criterion (AIC)^{16} or Bayesian Information Criterion (BIC)^{17} model. Adding a predictor penalizes these log likelihood-based criteria values by 2 and log(*n*), respectively. Thus, the penalty increases with the number of included predictors *p*.

The lasso method uses a penalty related to the sum of the absolute values of the regression coefficients.^{4} Using simplified notation, lasso solutions for β minimize the function

for an empirical loss function *L*(β). For large values of λ_{1}, many components of β are estimated to equal zero. As λ_{1} shrinks toward 0, the estimates of β move toward an unpenalized estimate. Equivalently, one can solve (4) by minimizing *L*(β) subject to the restriction

for a value of *t* _{1} that corresponds to λ_{1}.^{4} We follow the convention of displaying lasso solutions using this representation.

AIC, BIC, and lasso each weigh how closely a candidate model fits the data (ie, the value of the loss function) against how “big” the model is. AIC and BIC measure the size of the model by counting the number of nonzero elements in β; lasso measures the size by summing the absolute values of the β components. Although these methods are similar in spirit, enumerative minimization of AIC or BIC is impractical in high-dimensional applications. For example, our motivating analysis equates to more than 600 possible predictors. Even restricting a search to models with 10 or fewer predictors results in more than 10^{21} possible models. In contrast, there are efficient methods to find lasso solutions, even in high dimensions.^{18}

Before fitting a lasso or elastic net model, we center and scale covariates so that the penalties exert their force equally on all components of β. Further, we work with a centered version of the response to penalize the intercept minimally. We also add any potentially relevant interactions among the covariates to *xi*.

To adapt the lasso to exploratory quantile regression, we set *L*(β) to be the loss function in (2). Rather than use one specific *t* _{1}, we investigate the solutions of (4) for a wide range of *t* _{1}, beginning with zero and ending at a large value. We find the lasso solutions via a boosting algorithm^{12}; see the eAppendix (http://links.lww.com/EDE/A495).

The solution paths, ie, plots of estimated β versus *t* _{1}, can be used to identify important variables in the quantile regression. In particular, we search for variables whose estimates take on nonzero values early in the solution path. To illustrate with simulated data, consider the first panel of Figure 2, which shows the lasso solution paths for a 25th percentile regression with 10 variables, 5 of which have nonzero true coefficients. The estimates of 4 parameters quickly take on nonzero values as we move from left to right, suggesting that these are important variables. The other estimated coefficients take much longer to rise above zero, suggesting they are less important. Hence, if seeking to fit a parsimonious quantile regression, the first 4 variables would be a starting point for model building.

Although this example has a modest number of predictors, lasso works in settings with more predictors than observations; examples are given in the eAppendix (http://links.lww.com/EDE/A495). We simply follow the solution path until the penalty is weak enough that the penalized and unpenalized estimates are equal, at which point the algorithm stops. With numerous predictors, we typically are interested in models with significantly fewer nonzero β elements than observations, so we can stop the algorithm early to save computational time.

It is possible to select one value of *t* _{1}, eg, via cross validation, and use the fitted model at that point in the solution path as a final model. When analysts' primary goal is making out-of-sample predictions, or when there is scant domain knowledge to guide model selection, this approach can provide useful models. However, as an exploratory tool, we prefer to use penalized quantile regression as a process for ranking the importance of groups of covariates. We then combine domain knowledge with the information gleaned from the exploratory fits in the formal model building process. This process is described further later in the text and in the eAppendix (http://links.lww.com/EDE/A495).

### Boosted Elastic Net for Quantile Regression

As an exploratory tool, lasso can be suboptimal when there are highly correlated groups of predictors, as it will tend to grant only one of them a nonzero β coefficient. Here, the elastic net^{6} may better identify groups of important predictors. It penalizes the empirical loss function by adding a term of

to the lasso criterion. Thus, we seek the value of β that minimizes

for given λ_{1} and λ_{2}. As with lasso, larger values of λ_{1} encourage more zeros in the solution for β. With a large value of λ_{2}, components of β are encouraged to take on small but nonzero values, because the penalty disfavors large coefficients.

To see why the elastic net penalty encourages groups of correlated predictors to enter the model simultaneously relative to lasso, consider covariates *X* _{1} and *X* _{2} that are nearly identical for all subjects. The lasso penalty cannot easily distinguish between a solution with (β_{1}, β_{2}) = (0,φ) and (β_{1}, β_{2}) = (φ/2, φ/2) for some φ ≠ 0, so that the solution may present only one nonzero predictor. The elastic net penalty with λ_{2} > 0 prefers (φ/2, φ/2), so that both predictors take on nonzero values at similar points in the solution path.

To adapt the elastic net to exploratory quantile regression, we let *L*(β) be the loss function in (2). Rather than select one value of (λ_{1}, λ_{2}), we investigate the solutions for a range of values to identify important predictors. To facilitate the exploration, we replace λ_{2} with λ_{1}λ*_{2}, where λ*_{2} is a positive constant. For a fixed λ*_{2}, we fit the solution path as λ_{1} shrinks toward 0. As with lasso, we look for groups of coefficients that move away from zero early and fast in the solution path. We again use boosting to fit the solution paths given λ*_{2}; see the eAppendix (http://links.lww.com/EDE/A495).

We examine the solution paths for several values of λ*_{2}, searching for a value of λ*_{2} large enough to allow more variables to enter the model early in the solution path (compared with the lasso fit), but small enough that some parsimony is maintained early in the solution path. Appropriate values of λ*_{2} depend on the data. We recommend starting with λ*_{2} = 1 and moving up or down by factors of 5 or 10 until these goals are achieved. It may be helpful to inspect intermediate values of λ*_{2}, but we have experienced only moderate sensitivity to λ*_{2} settings.

To illustrate, consider Figure 2, which shows the fits for λ*_{2} between 0 and 1000. The fit with λ*_{2} = 1 is nearly unchanged from the lasso (λ*_{2} = 0) fit. When λ*_{2} is 10 or 100, a fifth variable enters the model early, thus appearing as an important predictor. When we increase λ*_{2} to 500 or 1000, it becomes difficult to discern exactly which variables are likely to be most important because most of them take on nonzero values early in the solution path; we argue that these fits are not useful for an exploratory analysis. Thus, in this simulated dataset, fits with λ*_{2} equal to 10 or 100 satisfy our goals.

### A Framework for Penalized Exploratory Quantile Regression

The lasso and elastic net quantile regressions can be combined to improve exploratory analyses for quantile regression. We recommend the following steps.

First, we fit the lasso regression solution path at the quantile of interest. We isolate variables with coefficients that take on nonzero values early in the solution path. These variables are the starting point for formal model building. We can rank the importance of the remaining variables by the order in which their corresponding coefficients take on nonzero values. We can choose to include some of these or additional variables in the final model based on scientific considerations and formal hypothesis testing.

Second, as a check on the lasso results, we fit the elastic net at several values of λ*_{2}, as outlined above. Here, we look for variables that appear early in the elastic net fit but are not prominent in the lasso. These variables can be considered for inclusion based on scientific merits and formal hypothesis tests. The elastic net is especially useful in settings where interaction effects are suspected, because interaction variables are often collinear with main effects, and thus may be excluded by the lasso.

This is an exploratory framework designed to highlight prominent groups of predictors. Like all exploratory model-building techniques, it should supplement—not replace—scientific rationales for model building. Analysts should add or subtract variables accordingly. Nonetheless, in high-dimensional settings with many predictors, exploratory quantile regression tools can help analysts find important predictors that might otherwise be difficult to uncover, as we now illustrate on genuine data.

## APPLICATION

We apply the exploratory framework to the Healthy Pregnancy, Healthy Baby Study, an observational prospective cohort study in Durham, NC, focused on the etiology of adverse birth outcomes. These adverse outcomes, including low birth weight and preterm birth, have been linked to many problems^{19} including blindness,^{20} deafness,^{21} and behavioral problems.^{22} Unfortunately, the causes of adverse birth outcomes are poorly understood, although predictors cited as important include smoking,^{23} lead exposure,^{24} environmental exposures more generally,^{19} and psychologic stress.^{25–27}

We focus on non-Hispanic black mothers, with a sample size of 881 births. We seek models for birth weight and gestational age, with particular emphasis on low quantiles of these distributions, eg, the 10th, 20th, and 30th percentiles. The data comprise 35 covariates, including maternal demographics such as age, education, and income; maternal medical history variables such as previous birth outcomes; maternal environmental exposures to cadmium, nicotine, cotinine, mercury, and lead as measured in blood; and maternal psychosocial factors such as scores on the Interpersonal Support Evaluation List, the Center for Epidemiologic Studies Depression scale, the NEO Five-Factor Personality Inventory, perceived racism, and availability of social support. We believe that interactions among these variables may be important predictors of birth outcomes. Therefore, we include in *x* the main effects of all predictors (using indicator variables for multicategory covariates), linear and squared terms for continuous predictors, and all two-way interactions, resulting in over 600 covariates for consideration. Summary statistics for the covariates are given in Table 1, and a histogram of birth weights is provided in eFigure 1 (http://links.lww.com/EDE/A495). Missing data in *x* (birth outcomes are complete) were multiply imputed using regression trees.^{28}

With lasso quantile regressions, measures of tobacco smoke exposure take on primary importance, particularly for birth weight. When modeling the 10th percentile of birth weight, 4 of the first 9 covariates to enter the model are interactions that include a tobacco measure (Fig. 3); at the 20th percentile, 4 of the first 7 are tobacco-related; at the 30th percentile, 5 of the first 6 are interactions involving tobacco. These results are consistent with the literature that extensively documents the impact of smoking on birthweight. Across the quantiles, a lead/tobacco interaction is consistently flagged.

Other important variables from the lasso 10th percentile regression for birth weight include the mother's age selected as a squared term and in an interaction with environmental tobacco smoke exposure. The former suggests that the association between age and birth weight is a U-shaped curve, which is again consistent with the literature.^{29} In addition, the Interpersonal Support Evaluation List appraisal score, which is a measure of the availability of someone with whom to discuss problems, was selected in an interaction with a variable indicating perceived social standing and an interaction with having a “visiting” relationship with the father of the child. At the 20th percentile, prominent variables (other than those involving tobacco) include an interaction between negative paternal support and NEO-“conscientiousness” (a measure of organization and persistence) and an interaction between negative paternal support and the perceived self-efficacy score, which measures a woman's belief that she can steer her own life's trajectory.^{30} Qualitatively, the results are similar at the 30th percentile of birth weight. Smoking measures are prominent, and environmental measures other than a tobacco/lead interaction are slow to enter the model.

In the presented results, we focus on birth weight as the response because lasso fits of gestational age reveal fewer useful explanatory variables, with less consistency across the response quantiles. This may be evidence that the covariates poorly capture the determinants of gestational age.

With the elastic net, the fits are essentially identical to the lasso fit when λ*_{2} = 0.01. When λ*_{2} = 0.1, dozens of variables take on nonzero values very early in the fitting process, which makes identifying the important variables difficult. Hence, we focus on the λ*_{2} = 0.05 elastic net. This fit generally selects the same variables as lasso, but it also identifies other potentially important covariates: interactions between blood lead levels and mother's self-reported tobacco use at the start of the pregnancy, and between blood lead levels and mother's tobacco use during the pregnancy, are among the early variables with nonzero coefficients.

Building on the exploratory analyses, we estimate regressions for the 10th, 20th, and 30th percentiles of birth weight. Although not selected in the exploratory stage, we include standard controls known to be predictors of birth weight, including the baby's sex and an indicator for first birth.^{29,31} Because the mother's age appeared important as a squared term, we include age and age squared. We also consider the interaction of tobacco use at prenatal intake with lead, which was the first variable to appear in our 10th percentile lasso and elastic net fits, and we add in the main effects to comply with the hierarchy principle. We also fit models including parental relationship status, appraisal scores, and their interaction; however, statistically insignificant *F* tests of these effects suggested that they could be removed from the models without much sacrifice. We similarly remove interaction of maternal age and tobacco after an *F* test.

The results are summarized in Table 2, along with a median regression for completeness. Although age does not approach significance in any of the presented quantiles, (centered) age squared is consistently negative. This suggests a U-shaped effect of maternal age on birth weight; ie, women at the bottom and top end of the age distribution will tend to have babies with lower birth weights. It is not until the 30th percentile that infant sex nearly reaches significance, and parity (first or higher-order birth) never reaches significance in the quantiles presented. Since infant sex and parity are well-established predictors of birth weight in traditional estimates of the mean regression analysis, we might reasonably conclude that at lower quantiles, other processes overwhelm the effects that dominate at the mean.

The coefficients on the environmental variables (the continuous blood lead measurement, tobacco use at prenatal care intake, and their interaction) are especially interesting. The combined main and interaction effects of lead and tobacco are essentially a wash at the 10th percentile. At the 20th percentile, we found a net decrease in birth weight of 174 g among women who both use tobacco and who have lead exposure one standard deviation above the mean. However, at the 10th and 20th percentiles, these coefficient estimates are not significant. At the 30th percentile, the combined main and interaction effects are associated with a 168 g decrement in birth weight. Here, the interaction effect is significant, and the 2 main effects approach significance.

The meta-analysis of Navas et al^{32} ^{p. 478} found that “evidence is sufficient to infer a causal relationship between lead exposure and high blood pressure.” Concurrently, hypertension is associated with poorer birth outcomes.^{29} We also know the anomalous but persistent result that smoking during pregnancy reduces the risk for preeclampsia^{33–35} (although among women with preeclampsia, smoking increases the risk for perinatal mortality and fetal growth restriction). Both preeclampsia and maternal hypertension are associated with lower birth weights. Because hypertension can be either the cause or the result of problems in pregnancy (especially if it progresses to preeclampsia), this exploratory analysis suggests that a nuanced treatment of the lead-hypertension-smoking nexus may be critical to understanding adverse birth outcomes. Such modeling is part of our ongoing research agenda.

A recent report^{14} ^{p. 229} noted that, while the relationships between environmental exposures and preterm birth are in general poorly understood, “possible exceptions are lead and environmental tobacco smoke, for which the weight of evidence suggests that maternal exposure to these pollutants increases the risk for preterm birth.” These variables are also important contributors to low birth weights.^{19} The potential interaction of lead and smoking—while apparently unreported in the birth outcomes literature—has been highlighted in other contexts. For example, the interaction has been found to be important for predicting mortality due to cancer^{36} and attention deficit/hyperactivity disorder.^{37}

## SUMMARY

When analysts suspect that the effect of some covariate is not constant across the distribution of a continuous response variable, as in Figure 1, they can use quantile regression to estimate the potential differences. Quantile regression can be particularly useful to epidemiologists, as scientific interest often focuses on persons who have high or low—rather than average—values of the response, given an arrangement of covariates. For example, epidemiologists studying blood pressures could be more interested in identifying covariates that increase the upper tails of blood pressure to medically dangerous levels compared with ones that increase the mean blood pressure within the normal range.^{38} Examples of other uses of quantile regression include studying the effects of tobacco in relation to sleep patterns,^{39} degradation of mental acuity in multiple sclerosis patients,^{40} and determinants of obesity.^{41}

We presented an exploratory framework for identifying potentially important covariates in quantile regression for high-dimensional predictor spaces. The underlying algorithms, based on the lasso and elastic net penalties, are computationally fast and simple to use, thereby offering methods for investigating the importance of hundreds of potential regressors on noncentral features of the response distribution. We applied the framework to a study of adverse birth outcomes and identified an interaction between maternal lead exposure and smoking that was not previously discussed in the literature. As quantile regression gains popularity and datasets become richer, these exploratory techniques may be a useful addition to the epidemiologist's toolbox.

## ACKNOWLEDGMENTS

We thank Geeta Swamy and 2 anonymous referees for comments that improved the manuscript.

## REFERENCES

*J Am Stat Assoc*. 1977;72:46–53.

*Classification and Regression Trees*. Boca Raton: Chapman & Hall/CRC; 1984.

*Technometrics*. 2000;42:80–86.

*J R Stat Soc Series B Stat Methodol*. 1996;58:267–288.

*Econometrica*. 1978;46:33–50.

*J R Stat Soc Series B Stat Methodol*. 2005;67:301–320.

*Stat Sin*. 2009;19:801–817.

*J Multivar Anal*. 2004;91:74–89.

_{1}-norm quantile regression.

*J Comput Graph Stat*. 2008;17:163–185.

*Bayesian Anal*. 2010;5:533–556.

*Workshop on Feature Selection for Data Mining*. Newport Beach, CA: SIAM; 2005:35–44.

*J Econ Perspect*. 2001;15:143–156.

*Preterm Birth: Causes, Consequences and Prevention*. Washington, DC: National Academy Press; 1997.

*Epidemiology*. 2000;11:427–433.

*Second International Symposium on Information Theory*. Budapest: Academiai Kiado; 1973:267–281.

*Ann Stat*. 1978;6:461–464.

*Ann Stat*. 2004;32:407–499.

*Epidemiol Rev*. 2009;31:67–83.

*Br J Ophthalmol*. 1998;82:9–13.

*Arch Pediatr Adolesc Med*. 1998;152:425–435.

*J Pediatr*. 1990;117:687–693.

*Epidemiology*. 2001;12:636–642.

*Pediatrics*. 1997;100:856–862.

*BMJ*. 1984;288:1191–1194.

*Psychosom Med*. 2007;69:566–570.

*J Womens Health*. 2007;16:535–542.

*Am J Epidemiol*. 2010;172:1070–1076.

*J Epidemiol Community Health*. In press.

*Corsini Encyclopedia of Psychology*. Wiley Online Library, 2010. Available at: http://onlinelibrary.wiley.com/doi/10.1002/9780470479216.corpsy0836/abstract.

*Early Hum Dev*. 1985;12:49–54.

*Environ Health Perspect*. 2007;115:472–482.

*Am J Obstet Gynecol*. 1997;177:156–161.

*Am J Prev Med*. 1999;16:208–215.

*Am J Obstet Gynecol*. 1999;181:1192–1196.

*Arch Intern Med*. 2002;162:2443–2449.

*Pediatrics*. 2009;124:e1050–e1063.

*JAMA*. 2004;291:2107–2113.

*Am J Epidemiol*. 2006;164:529–537.

*J Clin Epidemiol*. 2009;62:511–517.

*J Health Econ*. 2004;23:907–934.