Multiple regression techniques and the broader class of generalized linear models have a long history in statistics and are widely used in epidemiology. Recent years have seen important frequentist and Bayesian advances in hierarchical and multilevel generalized linear models (ie, generalized linear mixed models).^{1} These approaches provide a unified framework to tackle many problems that are not well addressed by standard statistical models, particularly issues involving various levels of clustering in one’s data. Their traditional use in social sciences is to address problems involving correlated outcomes, as arise in cluster sampling designs.^{2}

Of perhaps greater interest to epidemiologists is the use of hierarchical models to study highly correlated exposures, as investigated by MacLehose et al^{3} in this issue. Here, one can improve exposure effect estimates by modeling them in a higher-level model. MacLehose and colleagues assume that exposures with similar effect sizes arise from the same prior distribution. One can also more directly incorporate external information about the exposures into a higher-level model. For example, one could use information on the sources of air pollution components^{4} or their interactions with metabolizing enzymes^{5} as a way of dissecting which constituents are etiologically relevant or which sources are responsible. Other examples of external information include exposures’ chemical structure or composition,^{6,7} nutrient content,^{8} or genomic properties.^{9,10}

Another issue addressed by hierarchical models has arisen with the “-omics” revolution: the “large *p*, small *n*” problem, namely many more variables (*p*) than observations (*n*).^{11} Standard multiple regression techniques break down here because some combination of variables can always be found that will perfectly predict any other variable (ie, the design matrix is singular). Models are also becoming more complex,^{12} as investigators seek to understand biologic pathways leading from multiple exposures to disease through intermediates that are influenced by various genes.^{13} Hierarchical models can overcome these problems by modeling many fewer variables that retain characteristics of the original *p* exposures.^{14}

Moreover, techniques like hierarchical^{15} and Bayesian model averaging^{16} can be used to reflect the uncertainty in model form. This is especially important because a single model is seldom fitted in the course of data analysis; yet all too often, the estimates from a single “best” model are presented along with confidence limits that do not reflect the uncertainty about the form of the model. Hierarchical model averaging, in particular, allows inclusion of all candidate variables in the model and thus eliminates the need for variable selection and its attendant artifacts.^{7,15}

Observational studies are also subject to many different types of potential bias, which are often considered in a qualitative manner in the discussion sections of papers or treated in some form of sensitivity analysis. But as we will discuss below, Bayesian methods are available for formally modeling the different sources of bias.^{17–19}

#### HIERARCHICAL MODELING OF MULTIPLE EXPOSURES

MacLehose and colleagues^{3} describe 4 broad classes of higher-level (prior) distributions for incorporating similarities among multiple exposures in their analysis. Two of the prior distributions are fully parametric and 2 are semiparametric, and all can include a nonzero probability mass at the null hypothesis of no exposure effect.

Parametric models have the advantage that covariates are readily incorporated in the prior distribution. These covariates define “exchangeability classes,” a priori groupings of variables that are assumed to have equivalent effects on disease (ie, that derive from the same prior distribution). The specification of such groupings does not, however, require the investigator to specify exactly how such groups might differ or even to declare that they really do differ. They merely specify groups that potentially differ and the data will indicate whether this grouping has any predictive value.

Although the presentation by MacLehose et al of prior covariates as defining a set of distinct categories is simple, the general hierarchical modeling framework does not require discrete groupings, and indeed one can incorporate a whole vector of prior covariates on continuous or categorical scales. Thus, an investigator does not need to define explicit prior probabilities, but rather a set of prior covariates that might define distinguishable subgroups, and the general form of the prior probability distribution (along with its dependence on the covariates or “hyperparameters”). For this approach to provide notable statistical benefits, the number of effects under study (first-stage parameters) must be substantially larger than the number of prior (second-stage) parameters to be estimated.

MacLehose and colleagues do not focus on use of prior knowledge, although they mention using broad classes of pesticides as a grouping variable. Instead, they consider the case in which one suspects some meaningful similarities among the variables but has no idea what they are. They employ a semiparametric Dirichlet process prior, which allows each effect under study to come from a discrete but unknown distribution. Different effects may come from the same distribution, but the grouping of effects according to their shared distribution is also unknown. Ultimately, effects are grouped by similarity in estimated size rather than by external properties (such as chemical structure), and each effect estimate is shrunk toward effects of similar initial size. Such an analysis also allows other kinds of inferences, such as estimation of the posterior probability of that a given effect is null, and the posterior probability that 2 given variables belong to the same cluster.

The flexibility of semiparametric methods is seductive, but has the drawback of allowing (and often producing) possibilities that are simply not credible in most applications, such as multiple modes and groupings that contain physically dissimilar exposures. The semi-parametric model thus represents the opposite extreme of rigid two-parameter priors such as the normal distribution, which offer essentially no flexibility apart from location and scale. Such priors are generally chosen for convenience (ease of representation) or ignorance (we want a continuous distribution but have only a vague idea what shape to use). If an investigator is seriously worried about such assumptions, however, they can perform a sensitivity analysis. Alternatively, a more flexible parametric class of distributions can be used that can approximate reasonable possibilities such as skewness and heavy tails, but conforms to desired constraints such as smoothness, unimodality, and strict monotonicity. An example is 4-parameter generalized conjugate (log-*F*) family of distributions.^{20–22}

To summarize our point, one should appreciate that the Dirichlet process prior model in the paper by MacLehose et al is not using any external information to derive the grouping, in contrast to the use of prior covariates in a hierarchical model. Instead, exposures are clustered together based on their observed effect sizes, so the resulting clusters need not have anything else in common. This can be helpful when there are outliers that really should not be grouped with the bulk of the data. Nonetheless, it can also accentuate spurious clustering that has no biologic basis but instead merely reflects chance similarities. Fortunately, in their application to the pesticide data, this does not seem to have happened, as the observations are generally shrunk towards a highly concentrated distribution quite near the null.

#### SHRINKAGE, VARIABLE SELECTION, OR MODEL AVERAGING?

The traditional approach to dealing with multiple covariates is some form of variable selection, such as stepwise regression. As is well known, such techniques can lead to biased point and interval estimates and generally do a poor job of choosing the right variables, particularly when the variables are highly correlated, although they may produce reasonable prediction models. Even for the latter purpose, however, stepwise regression generally yields less accurate prediction equations than putting all variables into the equation and using hierarchical modeling techniques to stabilize their coefficients.^{23–26}

One objection to the hierarchical approach is that it does not yield a parsimonious model, if only because all variables are included. This objection has led to the development of a variety of techniques that include some form of variable selection, often in a Bayesian framework.^{27,28} Combined with frequentist^{29} or Bayesian^{15,16,30} model averaging, one can then make statements about both the relative contributions of the different variables and the coefficients of the variables, allowing for uncertainty about which other variables to adjust for. As with hierarchical models, these averaged models generally yield better predictions than conventional variable-selection methods; however, they are admittedly more complex, as again all variables must be considered. Although these techniques can be applied without use of external data, they can easily incorporate prior information, either in models for the regression coefficients in each model, or in the model for the selection of variables, or in both. For example, Conti et al^{13} described a hierarchical framework for modeling complex metabolic pathways. They used a logistic model for the probability of a nonzero coefficient for each gene effect, environmental factor effect, and product among them (“interaction”) and a linear model for the magnitude of the coefficient given it was nonzero. This approach also provides a natural way of exploiting any hierarchical structure of the covariates themselves, as in the analysis of SNPs within linkage disequilibrium blocks within genes within pathways.^{31}

#### WHY HAVE HIERARCHICAL MODELS SEEN LIMITED USE IN EPIDEMIOLOGY?

Considering the potential applications and value of hierarchical models, it is surprising that they are not used more often. One explanation is their computational difficulty, but this argument is obsolete in light of the growing availability of software^{32,33} and approaches such as data augmentation priors^{21,34,35} for hierarchical modeling. With the latter, prior information is expressed as “prior data” and added to the observed data, allowing Bayesian analysis with conventional software.

This leads to another reason voiced for not using hierarchical models: they require thoughtful application and thus can be difficult to specify. A key to their proper use is understanding that problems are solved by the introduction of external information concerning properties of and relations among the variables. This information is most needed—and potentially most helpful and most hazardous—when the data under analysis have little or no information about these relations. The quality of the external information then becomes as central to the analysis as the quality of the data themselves.

There may also be some stigma about Bayesian methods reflecting a misguided belief that inferences were entirely dependent upon the user’s subjective choices of priors. This probably reflects a basic misunderstanding about how hierarchical priors are specified. As described earlier, these priors only define potentially distinguishable exchangeability classes, not the magnitude of differences among the classes. The False Positive Report Probability^{36} was presented as a means of combining an investigator’s prior belief about the probability that the null hypothesis was false with observed data to obtain a posterior probability by straight-forward application of Bayes formula. Of course, an investigator who was uncomfortable with the idea of specifying a single number for that prior probability could always redo the calculation over a range of priors to obtain a range of posterior probabilities, but reducing the problem to a simple binary decision to accept or reject the null hypothesis is arguably unsatisfactory.^{37} More appropriate would be to consider the entire range of possible sizes for (say) a relative risk, give this range a prior distribution informed by external information, and see how the data modifies that distribution.

#### SHOULDN’T WE LET DATA “SPEAK FOR THEMSELVES”?

These issues highlight the need for reasonable assumptions and high-quality external information in hierarchical modeling. The latter may sound daunting, but is in fact no different than the quality of external information demanded for more common tasks, such as proper selection of variables for confounder control (eg, do not control variables affected by exposure). More generally, the harsh reality is that data by themselves say nothing at all; every inference, even from nonparametric methods, depends crucially on sound judgment and external information.^{18} Models are always necessary to provide a context for interpreting the data. This is even true for randomized trials, where the results are driven by assumptions about noncompliance and drop out, and that there was no ad-hoc data manipulation—assumptions that are not always true (as the recent Vioxx controversy reminds us).^{38}

Every analysis that fails to explicitly model a potential source of bias assumes that the bias is absent. In most observational data analyses, few biases are modeled. It is thus no surprise that underestimation of biases is a recurring problem, as refutation of observational results by controlled trials reminds us. Some statisticians talk of analyses being “data-driven” almost virtuously when in fact it may be a bad thing in some contexts, because it may be the data and not the models or priors that are wrong. The more data-driven a method, the more it pivots on fewer assumptions, which include notions that translate to “the data are valid.” By asserting the primacy of the data, we assert that the data are revealing the structure of the effects we are after, rather than the structure of biases. Thus, the more data-driven a method, the more it becomes driven by uncontrolled methodologic defects or inescapable limitations of the study, such as bias and random error.

The challenge is striking a balance between validity (zero-bias) assumptions and the external contextual information that we inject as modeling assumptions or priors. Because the optimal balance is never known, we need to do sensitivity analyses in which the relative strength of the 2 are varied. To the extent that more sophisticated methods like those of MacLehose et al allow us to explore further options, they are a valuable addition to our toolkit. The danger is only in presuming inherent superiority because of adjectives like “nonparametric” and “semiparametric.” These descriptors are often taken as virtues, but should really be a red flag, because they indicate that the data will be taken as unquestionably superior to any model along certain dimensions. It is not hard to see such methods break down in simple examples in which the data are neither pure nor ample. A nonparametric curve going wild at the ends is a nice way to show that even with perfect random sampling we always want to rely somewhat on prior information, such as monotonicity and smoothness.^{39}

At the other extreme, we have methods for bias modeling that do not take the data as completely trustworthy, and so are heavily model-driven. Because these methods admit uncertainties along dimensions that are not even identified by the data, they cannot be nonparametric; they rely entirely on investigator inputs about validity problems. These Bayesian approaches view the data as nothing but an observed margin of a much higher-dimensional distribution, and one that can be practically connected to the desired effects only by using parametric models and priors. These methods exist in the realm of nonidentified modeling, meaning they do not make assumptions strong enough to produce point estimates unless one introduces explicit priors (as opposed to the zero-bias priors implicit in conventional methods). This realm is arguably far more appropriate for observational study analysis than assuming biases away.^{7,40,41}

From the nonidentified modeling perspective, nonparametric and semiparametric methods are modeling only a marginal distribution that may have tenuous connection to our real targets. The latter targets may be buried deep in the full joint distribution of biases and effects. A fair comparison of nonparametric and semiparametric methods with parametric methods would evaluate them both, allowing for uncertainty due to uncontrolled data problems. When this is done, both approaches will be seen to understate the uncertainty warranted by our limited background understanding. The differences in mean squared error of nonparametric and semiparametric methods relative to parametric methods will no longer be so dramatic, and may easily favor the parametric approach, depending critically on the prior used to evaluate the nonidentified bias components.

This comparison also bears on the issue of the practicality of effort needed to set up and interpret each method.^{42} We cannot do everything when analyzing a study. Where should our limited analytic resources be focused if we want to most improve our overall performance? Many if not most epidemiologists think they can deal with biases informally and intuitively, saving them the effort of formal bias modeling. Meanwhile, massive computing power has enabled statisticians to implement dazzling and sophisticated techniques for dealing with random error. To the extent that biases are due to use of parametric models rather than fundamental study problems, these methods can help, but no amount of sophistication in this regard can compensate for study problems. The growing list of topics in which informal validity judgments have proven wrong (eg, beta-carotene and cancer prophylaxis, hormone replacement therapy and cardioprotection) indicates that we need to do better in evaluating the strength (or rather, weakness) of the epidemiologic evidence when health policy and medical practice is at stake. Reanalysis of the epidemiologic data from past controversies might shed more light on the merits of various approaches than would analytic or simulation studies. It would also help inform us about how much analytic sophistication—as opposed to more sophisticated data collection, as in “validation studies”—could have saved us from past mistakes.

#### CONCLUSION

In the end, we use informative priors, whether we admit it to ourselves and others or not. This is true even for so-called “objective” methods such as conventional frequentist methods and objective-Bayesian methods. How informative our conclusions appear will depend entirely on how informative our analysis assumptions are, even if we have billions of subjects. It is therefore of utmost importance that we recognize and admit those assumptions and make the best effort to found them on sound scientific information. Nor should we miss any opportunity to exploit what information may be available from other studies or allied fields, for this information will help us better discern signal from noise and bias in our data.

#### ABOUT THE AUTHOR

DUNCAN C. THOMAS is Verna Richter Professor in the Department of Preventive Medicine at the University of Southern California. JOHN S. WITTE is Professor of Epidemiology and Biostatistics and the Associate Director of the Institute for Human Genetics at UC San Francisco. SANDER GREENLAND is Professor of Epidemiology and Statistics at UC Los Angeles. Drs. Thomas, Witte, and Greenland have longstanding research interests in epidemiologic methods, including the development and application of hierarchical models. Over the years they have enjoyed numerous collaborations.