Environmental exposures, such as chemicals or dietary intake, often consist of an array of related individual factors that may be highly correlated with each other, causing concern about the impact of collinearity when attempting to identify individual effects. The literature provides several common methods for handling collinear variables, such as scaling, ridge regression, principal components analysis, and the “leave one variable out” approach.1 These methods are effective for providing unbiased and efficient estimators but they assume that the relationship between the exposure of interest and the outcome is correctly specified with respect to appropriate adjustment for confounders and adequate representation of the relationship’s functional form. Under highly correlated data, correct specification of the model may be in direct contrast to achieving stable and efficient estimators, highlighting the dilemma faced by researchers to answer research questions about specific exposure–outcome relationships.
In epidemiologic research focused on causal questions, estimation tends to be restricted to consistent estimators while perhaps ignoring estimation procedures that will yield inconsistent estimators with far superior small sample performance in terms of mean squared error and confidence interval coverage. Thus, any bias is thought to arise from model misspecification due to improper omission or inclusion of covariates, or its functional form. Moreover, the correct specification of the model is often based on the hypothesized causal framework, with directed acyclic graphs (DAGs) recommended as a tool for documenting this framework.2 These two concerns, reducing bias and increasing precision, can lead to different modeling strategies for handling collinear variables depending on whether the motivation is primarily predictive or causal.
A search of four leading topic-specific environmental and nutritional journals revealed that of all articles published from 2010 to 2014, 74 mentioned “collinearity” or “highly correlated” in the manuscript text. Of those, 24 (32%) attempted to mitigate collinearity by leaving one/multiple variable(s) out, 31 (42%) created a composite variable of the collinear predictors, and 19 (26%) ignored the impact of collinearity, primarily because the correlation was not very high, rather than a deliberate choice to correctly specify the model based on the underlying causal structure. This extensive review of the literature provides evidence that collinearity concerns are common in certain fields of epidemiology and there is considerable heterogeneity in how collinearity is handled.
Herein, we evaluate the effects of highly correlated data in epidemiologic research when the focus is a correctly specified model, while assuming statistical consistency. Specifically, we use DAGs to illustrate three fundamental structural scenarios in which high correlation between a covariate and the exposure can arise. We evaluate the consequences of increasing levels of correlation between a covariate and the exposure. We derive closed-form solutions for the bias and variance for the total effect using linear regression, and we use simulations to empirically investigate bias and variance in nonlinear models. Finally, we discuss a real data example of highly correlated variables in the context of reproductive epidemiology.
CAUSAL SCENARIOS WHERE COLLINEARITY CAN ARISE
Two variables are defined as collinear if one can be expressed as an exact or near linear combination of the other. For example, X1 and X2 are collinear if X1 = a + λ * X2, where a and λ are constants. In general, collinearity between variables is described by the magnitude of their correlation.
We illustrate three fundamental causal scenarios in which high correlation between a given variable and the exposure can arise: (1) intermediates, (2) confounders, and (3) colliders (Figure 1). For each scenario, we modeled the effect of an exposure (E) on the outcome (D), first not adjusting and then adjusting for the third variable, identified in Figure 1 by an outlined box and denoted as V in the equations below. For all scenarios, the total effect of E on D was of primary interest. Equations (1) and (2) represent these unadjusted and adjusted models, respectively.
FIGURE 1: Expected value, bias, and variance of parameter estimates from linear regression models with β 1 and β 2 representing unadjusted and adjusted effect estimates of exposure on outcome, respectively.
;)
;)
The coefficient β1 represents the effect of the exposure (E) on the outcome (D). The interpretation of β2 and β3 depends on the causal scenario, which will be discussed below. All regression coefficients correspond to centered and standardized variables.
Intermediate Variable
In the first causal scenario (structure 1), we estimate the total effect of E on D with increasing correlation between E and I. The magnitude of the effect of E on I is represented by ρ, I on D by a, and E on D by b (Figure 1). In this case, the unadjusted model (Equation 1) is correctly specified, according to DAG theory, and should yield an unbiased estimate of the total effect of E on D. The adjusted model (Equation 2) is misspecified and therefore may lead to overadjustment bias.3 The magnitude of overadjustment bias and the effects on precision under increasing correlation between I and E will be the foci of the analysis for this causal scenario.
Confounder Variable
Structure 2 evaluates the impact of increasing correlation between a confounder C and the exposure on the E and D relationship. The magnitude of the effect of C on E is represented by ρ, E on D by a, and C on D by b (Figure 1). Adjusting for C (Equation 2), would be the correctly specified model for the total effect of E on D; therefore, the misspecified unadjusted model (Equation 1) would be potentially biased. The bias and the effects on precision under increasing correlation between E and C will be the foci of the analysis for this causal scenario.
Collider Variable
Structure 3 depicts the effect of adjusting for an intermediate collider variable which is correlated with the exposure.3–6 An unmeasured variable, U, is a common cause of both D and I (which is a descendent of E), with a representing the direct effect of I on D, b the direct effect of U on D, γ the effect of U on I, and ρ the effect of E on I (Figure 1). Adjusting for I creates a spurious association between E and U, potentially biasing the total effect of E on D. The correctly specified model for the total effect of E on D would thus be the unadjusted model (Equation 1). Note that this scenario is an extension of structure 1, where I is an intermediate variable, except that in structure 3, I becomes a collider because of U, an unobserved confounder between I and D. The direct effect of E on D is assumed to be null in this scenario. The bias and the effects on precision under increasing correlation between E and I will be the foci of the analysis for this causal scenario.
EFFECTS OF COLLINEARITY ON BIAS AND PRECISION
Given these three causal scenarios, we investigated the effect of increasing correlation (ρ) between E and V (confounding or intermediate variable) on estimation of the total effect of E on D using both correctly specified and misspecified models. We examined these scenarios with a continuous outcome using linear regression and with a binary outcome using logistic regression.
Continuous Outcome
We derived the closed-form solution for the bias, variance, and covariance for each causal scenario under the normality assumption (for analytical details see eAppendix 1; https://links.lww.com/EDE/B107). Closed forms were shown empirically via simulation across a range of possible correlations between E and V (ρ = −0.99 to 0.99) with the number of observations held fixed and a = 0.5 and b = 0.2 (simulation code provided in eAppendix 2; https://links.lww.com/EDE/B107). Relative bias was calculated for
;)
and
;)
in relation to the correctly specified model. For example, in structure 1 modeled under Equation 1, the relative bias for
;)
was calculated as
;)
because the correctly specified model is unadjusted. However, in structure 2, the relative bias for
;)
was
;)
because the correctly specified model is adjusted. Root mean squared error (RMSE) for
;)
and
;)
was calculated, separately.
Closed-form solutions show that bias due to model misspecification for each causal scenario is a function of the degree of correlation present (Figure 1, see “Bias for Total Effect of E on D”). This bias, as a function of increasing ρ, is shown graphically in Figure 2. Specifically, for the intermediate variable scenario (structure 1),
;)
is equivalent to b + aρ, which is the total effect (indirect plus direct) of E on D; therefore,
;)
will be biased by an amount equivalent to −aρ. Similarly, for the confounding scenario (structure 2), the unadjusted model is biased by bρ, and for the collider scenario (structure 3) the adjusted model is biased by
;)
. In these scenarios, the degree of bias can exacerbate any model misspecification bias. This is illustrated by the relative bias plots shown in Figure 2 for structures 2 and 3, with greater bias for the misspecified model as ρ approaches 1 and −1. For structure 1, relative bias plots show an asymptote due to ρ being in the denominator, although absolute bias does increase with collinearity (similar to structures 2 and 3). On the other hand, collinearity does not induce any measurable bias when the model is correctly specified and |ρ| < 0.99. This is especially important for the confounding variable scenario, where correctly specifying the model may result in adjusting for a nearly collinear variable. However, under the unlikely scenario of extreme collinearity (|ρ| > 0.999), the numerical matrix inversion required for estimation fails, resulting in unstable and biased parameters (not shown in Figure 2).
FIGURE 2: Relative bias and root mean squared error (RMSE) for parameter estimates β 1 (unadjusted model) and β 2 (adjusted model) with increased correlation ρ between the given variable and exposure for continuous outcomes.
Root mean squared error for each causal scenario also depends on both the degree of correlation present and whether the model is correctly specified. In the intermediate and collider structures (structures 1 and 3, respectively) where
;)
is the correctly specified parameter, we observe that efficiency increases slightly (as noted by the decrease in root mean square error in Figure 2) as the magnitude of correlation increases. This trend is due to the improved predictive capacity of the exposure on the outcome, which thereby reduces the model’s random error. In other words, the amount of variation in D attributable to I is now better captured in the relationship between E and D, due to the increasing similarity between E and I. When the model is misspecified, by adjusting for I, the precision drastically decreases as the magnitude of correlation increases, resulting in highly imprecise estimates.
In the confounding variable scenario (structure 2), the unadjusted model, which is misspecified, produces gains in precision as the correlation increases, which is similar to the unadjusted models for structures 1 and 3. However, Figure 2 shows that the corresponding increase in bias dominates the root mean square error and can result in invalid inference, by implicating a highly precise but inaccurate estimate of the regression parameter of interest.
Thus, while the correctly specified model results in decreasing precision as correlation increases, especially at high levels of correlation, the ability of the correctly specified model to maintain an unbiased estimate validates the preference of this accurate, although imprecise, model over the misspecified version. In addition, only for extreme values of ρ close to 1 or −1 is
;)
nonidentifiable due to numerical instability; otherwise, the correctly specified model produces the most desirable (e.g., unbiased) estimates of the total effect of the exposure on the outcome.
Binary Outcome
For binary outcomes, Equations 1 and 2 were modified to represent logistic regressions:
;)
;)
where π is the conditional probability that Y = 1 given the specified set of predictors. Equations 3 and 4 represent unadjusted and adjusted models, respectively, for structures 1–3. The probability of outcome D is represented by π. The coefficient
;)
represents the marginal log odds effect of an exposure on the outcome and the interpretation of
;)
and
;)
depends on the causal scenario.
The bias and precision under logistic regression models were estimated using simulated data with the same node to node relations as in the linear regression models, with the exception of D being a binary variable with a marginal outcome probability of p (see Appendix 1; https://links.lww.com/EDE/B107).
Coefficients in Equations 3 and 4 were estimated from the average values from 10,000 simulations of 10,000 observations for each causal scenario across a range of possible correlations between E and V (ρ = −0.99 to 0.99) with a = 0.5, b = (0.5, 1.0) and the prevalence of the outcome set to 3%. Relative bias and root mean squared error were calculated in the same manner as for the continuous outcome models.
The simulated expected values of
;)
and
;)
for structures 2 and 3 (Figure 3) were remarkably close to the corresponding closed-form solutions from linear regression (shown graphically in Figure 2). In addition, bias of the parameter estimate from the misspecified model was greater when the relationship between C–D (structure 2) or U–D (structure 3) was stronger. For structure 1, unlike linear regression, the bias in
;)
increased slightly as ρ approached 1 (with little deviation due to changes in the strength of E–D (b) relationship); the bias in
;)
increased more markedly as ρ approached −1, with greater bias for the weaker E–D relationship. For all causal scenarios, the correctly specified model remained unbiased even under high correlation between E and V.
FIGURE 3: Relative bias and root mean squared error (RMSE) for parameter estimates of
(unadjusted model) and
(adjusted model) with increased correlation
ρ between the given variable and exposure for binary outcomes.
Although not readily apparent in Figure 3 when assessing the effects of collinearity on root mean squared error for each structure, the variance estimates for binary outcome models do not exhibit trends identical to those found in simulations performed under the linear regression scenario. For structures 1–3, in most cases
;)
and
;)
increased as ρ increased, with
;)
increasing to a lesser extent than
;)
(Figure 3). Indeed, once ρ exceeded 0.6, the
;)
increased markedly, more than doubling when ρ = |0.9| compared with ρ = |0.1|.
In sensitivity analyses, we increased the strength of the association between E and D (setting a equal to 1), and observed changes in the magnitude of relative bias and RMSE that would be expected based on the linear closed forms (eFigure 1; https://links.lww.com/EDE/B107), most notably a sizable change in relative bias and root mean squared error for the misspecified model for structure 1. We additionally expanded the confounding scenario to evaluate the impact of two independent confounders (eFigure 2; https://links.lww.com/EDE/B107), simulated under a framework similar to structure 2 where C1 and C2 are causes of D and E, with b1 and b2 representing the respective strengths of association with D and ρ1 and ρ2 representing strengths of association with E. The expected unadjusted total effect of E on D was therefore a + b1* ρ1 + b2* ρ2, which implies that the bias is equal to the weighted average of the associations between C1 and C2 with D, respectively, with weights equal to ρ1 and ρ2. In this confounding scenario, increasing correlation exacerbates the bias due to misspecification, similar to results for a single confounder. We also observed that the corresponding increases in bias tend to dominate the root mean squared error, thus potentially resulting in invalid inference, as was found in the single confounder scenario (eFigure 3; https://links.lww.com/EDE/B107). One also needs to recognize that nonlinear associations, such as logistic regression, are susceptible to noncollapsibility or omitted variable bias, especially in cases where the outcome is common and specifically prone to noncollapsibility bias.7–9 This is particularly important for the estimation of the conditional effect, where the leave one variable out approach is employed, creating both bias due to lack of collapsibility and residual confounding.7–9 Therefore, misspecification of the model will add another level of complexity to the interpretation of E–D relationships.
REAL DATA EXAMPLE: LEPTIN AND ESTROGEN
Often in practice, investigators are faced with decisions of how to handle highly correlated variables. Studies examining the effects of leptin on estrogen levels provide a natural setting for a discussion of collinearity because of high correlation between leptin and measures of body fat. In fact, a correlation of 0.76 was observed in the BioCycle Study.10,11 Under such strong correlation, a common analytical approach might be to leave body fat or leptin out, or to include a less correlated proxy of adiposity or composite variables,12–15 to avoid potential consequences of collinearity.
For our study, we were interested in the effect of leptin on estrogen concentrations on the day of ovulation. The University at Buffalo Health Sciences Institutional Review Board (IRB) approved the study, and served as the IRB designated by the National Institutes of Health under a reliance agreement. All participants provided written informed consent. Based on a priori knowledge of the associations between leptin, estrogen, and body fat, we hypothesized that body fat was a confounder (structure 2) of the leptin–estrogen relationship because body fat is a source of both leptin and estrogen production (i.e., common cause).16 Using equations in Figure 1, we can estimate the expected absolute bias due to model misspecification if we did not adjust for body fat in our analyses (eTable 1; https://links.lww.com/EDE/B107). We estimated the absolute bias to be −0.102, algebraically equivalent to
;)
−
;)
, representing a relative bias of −40%.
We further observed that the biased
;)
was more precisely estimated than the unbiased
;)
(eTable 1; https://links.lww.com/EDE/B107). Although adjustment for body fat increased the variance estimate due to correlation between the predictors, failure to account for the confounding effect of body fat on this relationship resulted in a biased estimate of association. This example highlights the need to clearly specify the causal structure to determine the appropriate modeling strategy and not solely rely on correlation between covariates.
DISCUSSION
Our findings highlight the importance of considering the causal framework under study when specifying regression models in the presence of highly correlated data. Here, we evaluated the effects of collinearity in epidemiologic research under three fundamental causal scenarios and found that, when models were correctly specified based on the underlying causal structure, parameter estimates of the effect of interest remained unbiased. This was the case even when there was almost perfect collinearity of the exposure with another variable under intermediate, confounder, or collider scenarios. Incorrectly specifying the models, such as adjusting for an intermediate or collider variable, resulted in biased effect estimates, with the bias being magnified as correlation increased.
There is often the concern that including two highly correlated variables in a model will result in numerical instability. Indeed, as discussed in Kleinbaum,1 the potential danger of collinearity stems from the potential singularity of the Hessian matrix, which is obtained by taking the second derivative of the log-likelihood. When the correlation between predictors is perfect (=1), inverting the Hessian to obtain standard errors is equivalent to dividing by 0, which produces infinite or undefined standard errors. This issue is similar to the problem of complete separation in logistic regression, which violates the positivity assumption. This real concern about numerical instability has somewhat been alleviated by the advancement of computing technology as the ability of computer programs to accurately invert nearly singular matrices has considerably improved. In addition, the probability of selecting distinct predictors that are perfectly or nearly perfectly correlated in practice is highly unlikely. As described in Kleinbaum,1 we observed that numerical instability occurs when correlation exceeds 0.99, equivalent to an R2 value of greater than 0.95, or a variance inflation factor >10. This can arise naturally in situations where there are different powers of the same variable, such as age + age2 or when multiple interaction terms are included in the model. In epidemiologic studies, however, this degree of correlation is unlikely to occur between distinct variables even when variables are known to be highly associated such as spousal characteristics like age, body mass index, or education.17,18 If such a degree of correlation does occur, then the two variables are essentially measuring the same thing in two different ways. Thus, misspecifying the causal structure of the model to allay collinearity concerns is much more likely to create bias rather than to be of practical use in avoiding numerical instability.
When the goal is to choose covariates based on their predictive value, concerns about collinearity stem from the desire to improve parameter estimates’ precision, which can be inflated when highly correlated covariates are included in the model. In a causal framework, on the other hand, interest is often centered on a specific effect from a correctly specified causal model, even in the presence of highly correlated data, despite a potential loss of statistical efficiency.2 Under such a viewpoint, bias due to misspecification of the causal associations is a more devastating result than reduced precision of parameter estimates correctly estimated. Furthermore, a precisely but incorrectly estimated parameter may lead to invalid inferences. When correct specification of the models still leads to problems for conventional regression methods, one can apply modern techniques such as Bayesian hierarchical models with shrinkage estimators.19 In sum, relying on conventional variable-selection algorithms and standard analyses can ignore important confounders and even produce extremely imprecise confidence intervals.20,21
Herein, we have analytically derived the bias induced to the total effect by overadjusting for an intermediate or collider variable. To our knowledge, this is the first such derivation existing under the causal structures of intermediate and collider variables. In addition, we used simulation studies to empirically assess the potential magnitude of the bias and loss of precision under various levels of collinearity with binary outcome data. While we have demonstrated that high correlation between variables can inflate standard errors, the causal relationship between the variables plays a more important role in delivering valid inference and should be the primary consideration for choosing which covariates to include in the regression formulation. As previously mentioned, it is also important to consider when using nonlinear regression models, such as logistic regression or Cox proportional hazards models, that the models are also susceptible to noncollapsibility or omitted variable bias when not all predictors are included in the model, and when the outcome is common.7–9 In the absence of confounding, adjustment for a strong predictor of the outcome may provide different results compared with crude analyses and thus confuse interpretation, as the conditional and marginal causal effects may differ in nonlinear models due to noncollapsibility.3 Robinson and Jewell,9 among others, discussed this topic in detail.7,8 They demonstrated that the standard error for the crude association (marginal effect) is smaller than or equal to the adjusted (covariate-conditional effect) for a strong predictor of the outcome, but that the adjusted point estimate (conditional) will be larger or equal to the crude (marginal). Therefore, model misspecification adds another level of complexity to the interpretation of exposure–outcome relationships in these cases, and further highlights the importance of correct model specification. To illustrate these findings and promote the application of these methods, we conducted a step-by-step demonstration on the association between leptin and estrogen levels in the BioCycle study, adjusting for body fat, which is highly correlated with leptin.
While previous research has explored the potential for bias when inappropriately adjusting for a collider or mediator,2,6,21,22 our study generalizes this idea to incorporate continuous outcomes, as well as to explicitly quantify the change in precision due to collinearity in these circumstances. As with any etiologic study, however, the validity of the inferences is a reflection of the accuracy of the assumptions; a faulty DAG could produce flawed inference, regardless of the impact of collinearity. Because only a limited number of scenarios were considered here, alternative variable associations are likely to produce varied results, although the main conclusions are expected to remain the same. While we focused on total effects, one could apply the same principles to the estimation of the direct or indirect effects. It is important to note, however, that attempts to decompose the direct and indirect effects of an exposure through controlling for an intermediate are additionally subject to unmeasured confounding on the intermediate–outcome relationship, which could induce collider bias (structure 3). In conclusion, highly correlated or collinear data is present under a variety of causal structures. Our article highlights the danger of adjusting or not adjusting for variables without considering their causal framework, and selecting models based solely on how precisely parameters can be estimated may lead to biased effect estimation for the original question of interest.
REFERENCES
1. Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Applied Regression Analysis and Other Multivariable Methods. 1998:3rd ed. Pacific Grove, CA: Duxbury Press; 237–248.
2. Hernán MA, Hernández-Díaz S, Werler MM, Mitchell AA. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am J Epidemiol. 2002;155:176–184.
3. Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology. 2009;20:488–495.
4. Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15:615–625.
5. Whitcomb BW, Schisterman EF, Perkins NJ, Platt RW. Quantification of collider-stratification bias and the birthweight paradox. Paediatr Perinat Epidemiol. 2009;23:394–402.
6. Cole SR, Platt RW, Schisterman EF, et al. Illustrating bias due to conditioning on a collider. Int J Epidemiol. 2010;39:417–420.
7. Pang M, Kaufman JS, Platt RW. Studying noncollapsibility of the odds ratio with marginal structural and logistic regression models. Stat Methods Med Res. 2013;0:1–13.
8. Greenland S. Absence of confounding does not correspond to collapsibility of the rate ratio or rate difference. Epidemiology. 1996;7:498–501.
9. Robinson L, Jewell NP. Some surprising results about covariate adjustment in logistic regression models. Int Stat J. 1991;58:227–240.
10. Ahrens K, Mumford SL, Schliep KC, et al. Serum leptin levels and reproductive function during the menstrual cycle. Am J Obstet Gynecol. 2014;210:248.e1–248.e9.
11. Wactawski-Wende J, Schisterman EF, Hovey KM, et al.; BioCycle Study Group. BioCycle study: design of the longitudinal study of the oxidative stress and hormone variation during the menstrual cycle. Paediatr Perinat Epidemiol. 2009;23:171–184.
12. Ajala OM, Ogunro PS, Elusanmi GF, Ogunyemi OE, Bolarinde AA. Changes in serum leptin during phases of menstrual cycle of fertile women: relationship to age groups and fertility. Int J Endocrinol Metab. 2013;11:27–33.
13. Asimakopoulos B, Milousis A, Gioka T, et al. Serum pattern of circulating adipokines throughout the physiological menstrual cycle. Endocr J. 2009;56:425–433.
14. Cella F, Giordano G, Cordera R. Serum leptin concentrations during the menstrual cycle in normal-weight women: effects of an oral triphasic estrogen-progestin medication. Eur J Endocrinol. 2000;142:174–178.
15. Hardie L, Trayhurn P, Abramovich D, Fowler P. Circulating leptin in women: a longitudinal study in the menstrual cycle and during pregnancy. Clin Endocrinol (Oxf). 1997;47:101–106.
16. Ahima RS, Flier JS. Adipose tissue as an endocrine organ. Trends Endocrinol Metab. 2000;11:327–332.
17. Price RA, Vandenberg SG. Spouse similarity in American and Swedish couples. Behav Genet. 1980;10:59–71.
18. Knuiman MW, Divitini ML, Bartholomew HC, Welborn TA. Spouse correlations in cardiovascular risk factors and the effect of marriage duration. Am J Epidemiol. 1996;143:48–53.
19. Greenland S. Invited commentary: variable selection versus shrinkage in the control of multiple confounders. Am J Epidemiol. 2008;167:523–9; discussion 530.
20. MacLehose RF, Dunson DB, Herring AH, Hoppin JA. Bayesian methods for highly correlated exposure data. Epidemiology. 2007;18:199–207.
21. Howards PP, Schisterman EF, Heagerty PJ. Potential confounding by exposure history and prior outcomes: an example from perinatal epidemiology. Epidemiology. 2007;18:544–551.
22. Greenland S. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology. 2003;14:300–306.