It has long been known that conditioning on the determinants of exposure can reduce confounding bias when estimating causal effects.^{1} This insight led to the development of propensity score methods^{2} and inverse-probability-of-treatment weighting.^{3} The latter technique is used for the estimation of marginal structural models, where the word “marginal” in this name refers to the potential outcomes *Y* | SET(*X* = *x*) for exposure *X*, since this model estimates the marginal distribution of these (counterfactual) quantities. Inverse probability of treatment has also been shown to be equivalent to standardization for point-time treatment models.^{4,5} Confounding has traditionally been combated with a variety of techniques in epidemiologic research, and some important distinctions in the interpretations generated by these techniques are often overlooked in practice. In this brief comment, I note that effect measures adjusted simply via inverse probability of treatment weights have a marginal interpretation with respect to the covariates, and in this sense the word “marginal” has nothing to do with the first word in “marginal structural models.” Moreover, stabilization of the weights in a marginal structural model can change this interpretation from marginal to conditional—a potentially important consequence that appears to have not yet been widely discussed in published work. This can have implications for interpretation of effect estimates and for comparison of adjusted effect measures across different models.

#### INVERSE PROBABILITY OF TREATMENT AND STANDARDIZATION

When measured binary covariate *Z* is the only common cause of binary point-time exposure *X* and binary outcome *Y*, then adjustment for *Z* is sufficient to identify the causal effect of *X* on *Y*, where the causal effect is defined as some contrast of Pr(*Y* = 1 | SET(*X* = 1)) with Pr(*Y* = 1 | SET(*X* = 0)). This contrast is most commonly a risk difference (RD), risk ratio (RR), or odds ratio (OR) defined as:

The SET statement to the right of the conditioning bar is a hypothetical intervention used to define the causal effect. Confounding in the observed data requires that this effect be estimated by controlling for Z.

The first potential outcome, Pr(*Y* = 1 | SET(*X* = 1)), can be constructed as a weighted average of the observed stratum-specific risks in the 2 levels of *Z*. To obtain this quantity in the entire study population, the weights would reflect the observed distribution of *Z* in this same study population. The use of alternate weights, such as those reflecting the distribution of *Z* in the *X* = 1 or *X* = 0 subpopulations, would be relevant for interest in other target populations. Thus, the expression:

is the *Z*-standardized risk in the exposed, and provides the causal effect on *Y* of manipulating *X* to the value of 1 (left-hand side) in terms of observed proportions involving *X*, *Y*, and *Z* (right-hand side). The corresponding expression for manipulating *X* to the value of 0 is:

Since these 2 right-hand-side expressions contain only observed proportions, they may be contrasted to obtain the causal RD, RR, or OR estimates of interest.

Alternately, one may take the weighted average of the effect contrast itself (eg, RD) to adjust for confounding by covariate *Z*:

where RD*z* is the observed risk difference in stratum *Z* = *z*. Robins et al^{3} (p. 559) comment that this is the “usual way” to estimate adjusted effect measures.^{6} Nonetheless, there is an important statistical distinction between the 2 approaches, since the first is marginal with respect to covariate *Z* and the latter is conditional on covariate *Z*. This distinction is consequential for noncollapsible measures such as odds ratios and hazard ratios.

#### A POINT EXPOSURE EXAMPLE

The Table shows data on exposure *X*, outcome *Y*, and covariate *Z*. The risks in the exposed and unexposed persons in the *Z* = 1 stratum are 4/5 = 0.8 and 3/5 = 0.6, whereas the risks are 2/5 = 0.40 and 1/5 = 0.20 in the exposed and unexposed of the *Z* = 0 stratum. Collapsing across *Z*-strata, the crude risks in the exposed and unexposed are 6/10 = 0.6 and 4/10 = 0.40, respectively. The 2 stratum-specific risk differences are therefore 0.20 and 0.20, and the RD value ignoring *Z* is also 0.20. Clearly, any weighted average of the 2 equal stratum-specific values must also yield 0.20, and so any *Z*-standardized estimate will equal the crude, suggesting an absence of confounding by *Z*. Indeed, inspection of the Table shows clearly that Pr(*Z* = 1 | *X* = 1) = Pr(*Z* = 1 | *X* = 0) = 0.5, and therefore that no confounding is possible. Nonetheless, the odds ratio is not so well behaved.^{7} The *Z*-stratum-specific OR values are:

which is distinct from the crude value of:

This oft-noted noncollapsibility of the OR means that change-in-estimate approaches for detecting confounders are not reliable when the OR is the measure of effect and the risk of disease is large in at least one stratum of the covariate.^{8}

This distinction between marginal and conditional estimates is directly applicable to inverse-probability-of-treatment models, which are equivalent to standardizing the individual risk estimates, rather than taking a weighted average of the covariate-stratum-specific effect estimates. For example, inverse-probability-of-treatment weights are trivial to compute in the example shown in the Table because the probability of exposure *X* does not depend on the covariate *Z* (ie, Pr(*X* = 1 | *Z* = 1) = Pr(*X* = 1 | *Z* = 0) = 0.5). Therefore, all inverse-probability-of-treatment weights are 1/0.5 = 2. Hence, if one doubles the value of every cell in Table and computes a crude RD or OR in this “pseudopopulation,” this estimate has a causal interpretation (ie, is unconfounded by *Z*). Such an analysis produces the expected RD = 0.20, but perhaps less widely anticipated is that the calculation of the OR yields the *Z*-marginal value of 2.25, rather than the *Z*-conditional estimate of 2.67—the latter being equivalent to the weighted average of the *Z*-stratum-specific OR values using any choice of standardization weights.

The *Z*-stratum-specific OR values are homogeneous in the Table example, making Mantel-Haenszel and regression-based adjustment (eg, logistic regression) entirely appropriate methods. Both of these methods will produce adjusted OR values of 2.67 if fit properly, whereas a correctly specified inverse-probability-of-treatment analysis will produce a value of 2.25. It is common in the marginal structural models literature to compare modeled estimates to “traditional” methods (ie, models that do not adjust properly for time-dependent confounding).^{9} When these 2 approaches produce distinct estimates, the usual conclusion is that time-dependent confounding was not successfully adjusted in the standard model. The simple example above illustrates that the estimates from these 2 approaches would not generally agree even in the absence of time-dependent confounding. Indeed, as shown above, the 2 approaches don't agree even when there is no confounding at all.

#### STABILIZED WEIGHTS

Consider now that *Z* is a baseline covariate in a longitudinal follow-up of the same study described above, but that another measured covariate *W* is predictive of outcome *Y*, and may be both an effect of baseline exposure *X* and a cause of subsequent exposure at follow-up. In this situation, traditional methods are not able to control for the potential time-dependent confounding by *W*, and so a technique such as inverse probability of treatment weighting would be necessary. *Z* is controlled at baseline and its value does not change over follow-up time. The standard marginal structural models is constructed by predicting the probability of exposure at each follow-up time, as a function of past treatment, baseline *Z* and time-varying values of *W*, and weighting by the inverse of the probability that each subject obtained the observed exposure. As described above, this model is marginal with respect to covariates *Z* and *W*, and so for the OR and for other noncollapsible measures will not generally match standard regression-model estimates even if correctly specified and if time-dependent confounding is absent.

Most researchers would not fit this model, however, opting instead to improve precision of the estimates by stabilizing the weights with the inclusion of another exposure model in the numerator.^{10} With a model for exposure *X* in the numerator as a function of previous levels of exposure, this estimate would still be marginal with respect to *Z* and *W*. Most researchers would still not stop there, however, and would opt to obtain even better precision by including in the numerator model not just previous values of *X*, but also baseline values of *Z*. This would also be necessary if one wished to consider modification of the exposure effect by a baseline variable. With *Z* included in both numerator and denominator, there is no longer any control for potential confounding by *Z*, and so it would be added to the outcome model as a covariate using standard methods. This changes the interpretation of the marginal structural models from marginal with respect to *Z* to *Z*-conditional. When the exposure effect is marginal with respect to some covariates (eg, the time-dependent covariates) and conditional with respect to others (eg, the time invariant covariates), there is no way to compare with any standard model so long as a noncollapsible measure, such as an OR, is employed.

#### NEIGHBORHOOD POVERTY AND ALCOHOL USE

Cerdá and colleagues^{11} have published an extraordinary article on the effects of neighborhood poverty on alcohol consumption. Their careful enumeration of assumptions and extensive model checking, as presented in elaborately detailed appendices, is a paragon of rigorous analysis and its assiduous documentation. These authors fit several types of adjusted models, however, and the comparison of the estimates from these various models is bedeviled by the considerations described above. Moreover, it can be quite difficult for the reader to keep straight what is being conditioned on and what is integrated over in each instance. The “traditional” model with time-varying covariates is fit with marginal generalized-estimating-equation (GEE) regression. Despite the appearance of the word “marginal,” however, the estimated odds ratios from this model are not marginal with respect to the covariates. Rather, this term refers to the treatment of the clusters (census tracts). In contrast, the marginal structural model estimates would be marginal with respect to all covariates, except that the authors stabilized the weights with an exposure model in the numerator of the weights that conditions on baseline covariates. These baseline covariates are then included in the outcome model, making the OR estimates conditional with respect to these factors. The time-varying covariates, such as income and occupation, however, are dealt with only in the denominator of the weights, and therefore the OR estimates are marginal with respect to these variables. Thus, even if there were no confounding by these factors, the estimates from the GEE model and from the marginal structural models would differ on the basis of the noncollapsibility of the OR alone. The results, however, suggest that there is indeed substantial confounding by these factors, not only because of the bivariate associations detailed in their eTable 2, but also because the marginal OR estimates are further from the null than the conditional OR estimates (their Table 2), which is the opposite of the pattern expected in the absence of confounding.^{12}

To avoid being alarmist about this distinction, it is important to stress that collapsible effect measures such as risk differences and ratios are equivalent when calculated marginally or conditionally over covariates. Furthermore, when the outcome is rare, the OR approximates the RR and is therefore comparable between marginal and conditional calculations. The outcome in the Cerdá et al study, however, occurs in roughly 15%–25% of subjects (depending on the year) and therefore could not be considered sufficiently rare to obviate this concern. The authors note explicitly that no such direct contrast of coefficients is possible.^{11} Nonetheless, the side-by-side presentation in Table 2 naturally invites such a comparison, which motivates this comment as a more explicit explanation of why this is not straightforward. It would be laudable if other authors were similarly cautious about this distinction, pointing out the noncomparability of the estimates, or alternatively focusing on RR and RD models in which these distinctions are moot.