When generalizing inferences from a randomized trial to a target population, two classes of estimators are used: g-formula estimators^{1–3} that depend on modeling the conditional outcome mean among trial participants and inverse probability (IP) weighting estimators^{1}^{,}^{4–6} that depend on modeling the probability of participation in the trial.

In this article, we take a careful look at the relation between g-formula and IP weighting generalizability estimators for time-fixed (nontime-varying) treatments.^{7} We propose IP weighting estimators that involve weighting by the inverse of the estimated probability of participating in the trial and receiving the treatment actually received. We show that these estimators are equivalent to the g-formula estimator when conditional probabilities are estimated using nonparametric “frequency” methods^{8}—a result that does not depend on whether the estimators have a causal interpretation and relates to well-known results in the context of confounding adjustment in observational studies.^{9-11} In practice, nonparametric frequency-based estimation of the models for trial participation or the outcome typically is infeasible because the curse of dimensionality means that some modeling assumptions are almost always necessary. Thus, we argue that in applied generalizability analyses, augmented IP weighting (doubly robust) estimators that combine models for the probability of trial participation and the expectation of the outcome should be preferred. Finally, we compare the finite-sample behavior of different estimators when using parametric models in a simulation study.

## TARGETS OF INFERENCE AND IDENTIFIABILITY CONDITIONS

Consider a randomized trial nested within a cohort of trial-eligible individuals.^{1}^{,}^{12} Let

denote the baseline covariates;

the trial participation indicator;

the time-fixed treatment assignment variable; and

the observed outcome. Data are available from

individuals (the cohort of trial-eligible patients) of whom

are randomly assigned to treatment. The data are independent realizations of

. Throughout, we use uppercase letters to denote random variables and lowercase letters to denote realizations.

Let

denote the potential (counterfactual) outcome under an intervention to set treatment

to

. We use the following identifiability conditions:^{1}^{,}^{7}

- I. Consistency of potential outcomes, so that if then , for every unit in the target population.
- II. Conditional mean exchangeability over among trial participants, .
- III. Positivity of treatment assignment in the trial, for each and each with positive density in the trial.
- IV. Conditional mean generalizability from trial participants to the target population (conditional mean exchangeability over ), .
- V. Positivity of trial participation, for each with positive density in the target population. Note that the probability of trial participation need only be bounded away from 0, but not 1.

Implicit in the notation above is that the invitation to participate in the trial and trial participation do not affect the outcome, except through treatment. Furthermore, to simplify exposition, we assume the absence of interference, measurement error, missing data in baseline covariates, or dropout in the trial; we also only deal with causal quantities that are most meaningful under perfect adherence to assigned treatment. In other words, the methods we discuss only address selective trial participation, and will often need to be combined with additional remedies if any other complications affect the observed data mechanism.

Conditions (II) and (III) are expected to hold by design in randomized trials. We have used

generically to denote baseline covariates. It is possible, however, that strict subsets of

are adequate to satisfy the different exchangeability conditions. For example, in a marginally randomized trial, the mean exchangeability among trial participants holds unconditionally. Note, however, that in a marginally randomized trial, both

and

are expected to hold by design because, in this article, we assume that the covariates

are measured at baseline and not affected by treatment.

In this article, to focus on issues related to estimation, we view conditions (I)–(V) as primitive, and do not discuss graphical models^{13} under which (some of) the conditions may be implied by the assumed structure. Instead, our starting point is that, when conditions (I)–(V) hold, the observed data functional

is equal to

, that is,

can be interpreted as the potential outcome mean under intervention to set treatment to

in the target population.^{1} In the eAppendix; http://links.lww.com/EDE/B586, we show that, if investigators are interested in the average treatment effect

, identifiability condition (IV) can be replaced by a weaker condition of generalizability in measure^{14} (an informal suggestion along these lines was made in reference^{15}). The potential outcome means

are typically of inherent scientific and policy interest, and in the remainder of the article, we focus on estimating

using the observed data.

Importantly, the relations between g-formula and IP weighting pertain to the observed variables, not the counterfactual variables. To see this, we re-express

using IP weighting:

where

is the indicator function and to obtain the expression after the second equality we multiplied and divided by

, a legitimate operation because the positivity conditions imply

for every

. The derivation above relies on the positivity conditions but does not use any conditions about counterfactual variables; thus, the result holds regardless of whether

has a causal interpretation. Furthermore, under the positivity conditions (see eAppendix; http://links.lww.com/EDE/B586, for the derivation), we obtain the following identity that will prove useful in deriving an estimator of

:

## G-FORMULA AND IP WEIGHTING ESTIMATORS

Equation (1) suggests the g-formula estimator,

where

is a (possibly nonparametric) estimator of

;

is the empirical cumulative distribution function of

, and the last equality uses the usual (nonparametric) estimator of the distribution of

.

Furthermore, the re-expression in (2) suggests the Horvitz-Thompson-style^{16} IP weighting estimator,

where

is a (possibly nonparametric) estimator of

and

is a (possibly nonparametric) estimator of

.

Identity (3), combined with the estimator in (4), suggests the Hájek-style^{17} IP weighting estimator,

The estimator in (5) can be thought of as a variant of the estimator in (4), where the inverse of the probability of trial participation and treatment weights are normalized to sum to *m*.

The properties of empirical probability and expectation imply that, when the conditional mean of the outcome and the probability of trial participation and treatment are estimated using nonparametric frequency-based (nonsmooth) methods,^{8} the g-formula estimator and the two proposed IP weighting estimators are finite-sample equivalent,

. In the eAppendix, http://links.lww.com/EDE/B586, we derive this result for nonparametric frequency-based estimators.

## IP WEIGHTING WITHOUT THE PROBABILITY OF TREATMENT

When the randomized trial is marginally randomized, one may be tempted to use an IP weighting estimator that uses the probability of trial participation but does not use the probability of treatment among randomized individuals (as proposed in ^{7}):

Note that

uses outcome information from trial participants,

, who received treatment

, weighted by the inverse of the (estimated) probability of trial participation,

. In contrast,

and

use the same outcome information, but weighted by the inverse of the (estimated) probability of trial participation and of receiving the treatment actually received,

. In essence,

does not account for possible baseline imbalances in the randomized trial sample, whereas

and

do. The g-formula estimator,

, also accounts for baseline covariate imbalances in the trial because it standardizes (“marginalizes”) the conditional outcome mean among all trial participants (regardless of their treatment status), as well as the nonparticipants. Intuitively,

,

, and

are finite-sample equivalent when the conditional outcome mean, and the probability of trial participation and treatment, are nonparametrically estimated because they achieve the same two goals: (1) adjust differences in the distribution of measured covariates between trial participants and the target population, and (2) adjust chance imbalances in measured baseline covariates in the trial. In contrast,

only addresses the first goal. That is why in finite samples, where imbalances between the randomized groups almost always occur (even in large marginally randomized trials),

is not equivalent to the other IP weighting estimators or the g-formula estimator.

In the eAppendix, http://links.lww.com/EDE/B586, we examine the conditions under which

and

produce identical estimates and provide an alternative argument to show that any discrepancy between these two estimators is due to baseline covariate imbalances among trial participants.

### Generalizability in Conditionally Randomized and Observational Studies

Because

and

include in their weights a component for the estimated probability of receiving the treatment actually received, these estimators can be used to generalize the findings of conditionally randomized trials (provided that

includes all the covariates that determine the randomization probability) and observational studies (provided that

adequate to control confounding). In contrast,

cannot be used in these contexts, because it does not model the probability of treatment among those with *S* = 1.

## WHEN NONPARAMETRIC ESTIMATION IS INFEASIBLE

In practice, nonparametric estimation of the probability of trial participation or the expectation of the outcome is not feasible. When additional modeling assumptions are needed, the finite-sample behavior of the estimators of

changes in important ways. First,

and

are no longer equivalent. The latter estimator should usually be preferred because it is bounded^{18} and often has smaller variance than the former estimator.^{1}^{,}^{19} Second, none of the IP weighting estimators discussed in this article is equivalent with the g-formula estimator. Third, the g-formula estimator typically has smaller variance than the IP weighting estimators.^{1}^{,}^{19}

The large-sample behavior of the estimators is also different. First, whether different estimators converge in probability to

depends on correct specification of the models they rely on, in the sense that the assumed models “contain” the corresponding true models (see the references for detailed discussions of model misspecification in the context of likelihood^{20}^{,}^{21} or estimating equation^{22}^{,}^{23} approaches). When using correctly specified parametric models, the estimators converge in probability to

, and the g-formula estimator has the smallest large-sample variance.

Second, the large-sample variances of the IP weighting estimators are unequal; often, but not always,

has smaller large-sample variance than

.^{1}^{,}^{19} Third, estimating the probability of treatment among trial participants (even though that probability is known) can actually improve the large-sample variance of the IP weighting estimators.^{19}^{,}^{24}

When modeling assumptions are necessary, generalizability analyses need not rely on just modeling the conditional outcome or the probability of trial participation. A better approach is to combine both models to obtain “augmented” IP weighting (doubly robust) estimators.^{1}^{,}^{25} The following is one of several possible^{1}^{,}^{18}^{,}^{25} augmented IP weighting estimators for

:

When the expectation of the outcome and the probability of trial participation and treatment can be estimated using nonparametric frequency-based estimators (as done in our eAppendix; http://links.lww.com/EDE/B586 and Lesko et al.^{7}), there is no benefit to using

instead of

,

, or

. When nonparametric estimation is infeasible, however,

converges in probability to

and is asymptotically normal when either the model for the expectation of the outcome or the model for the probability of participation is correctly specified. When both models are correctly specified,

has large-sample variance less than or equal to that of the IP weighting estimators, but larger than or equal to the g-formula estimator.^{26} In finite samples,

often produces estimates that are strikingly more precise than those of the IP weighting estimators.^{1}^{,}^{19}^{,}^{25}

## SIMULATION STUDY

We conducted a small simulation study to illustrate the behavior of different estimators when the probability of trial participation and the expectation of the outcome are estimated using parametric models. For a total cohort size of

equal to 2,000, 5,000, or 10,000, we generated baseline covariates

and

from independent standard normal distributions; trial participation

from a Bernoulli distribution with parameter

, with

values chosen to lead to a marginal probability of trial participation of 0.05, 0.1, or 0.2; treatment assignment in the trial by simple randomization with assignment probability of 0.5; and outcomes

, with independent and identically distributed errors,

. All estimators of

used correctly specified parametric models.

We present simulation results for the potential outcome means for

and

, based on 10,000 runs per scenario, in Table 1 in eAppendix (http://links.lww.com/EDE/B586). In the Table in eAppendix (http://links.lww.com/EDE/B586), we also report results for the average treatment effect by taking the difference of the estimated potential outcome means.^{27}^{,}^{28} Estimators

,

, and

were approximately unbiased, even in the scenario with the smallest cohort size (

) and lowest trial participation probability (

). In that scenario, both

and

had some finite-sample bias; this bias became negligible in scenarios with larger sample sizes and higher trial participation probabilities.

Variance results for treatment-specific potential outcome means are summarized in the Figure (additional results are in Table 1 in eAppendix; http://links.lww.com/EDE/B586). Overall,

had the smallest variance,

had slightly larger variance but much smaller than the IP weighting estimators. The better bias properties of

in small cohorts with low trial participation probabilities came at the price of much larger variance compared with

and

. These two estimators had smaller variance than

even with larger cohorts and higher probabilities of participation. For the estimation of treatment-specific potential outcome means, using the estimated probability of treatment in

did not substantially increase variance in smaller cohorts and improved variance in larger cohorts, compared with

. The differences in terms of bias and variance performance between the two IP weighting estimators that use normalized weights,

and

, seem small for all practical purposes; much more important drivers of performance are the total cohort sample size, the marginal probability of trial participation, and the use of outcome modeling, as is done in

.

## DISCUSSION

The g-formula and IP weighting are fundamentally connected in ways that extend well beyond the derivations in this article.^{29} Epidemiologists steeped in the analysis of observational comparative effectiveness studies are familiar with the connection because of the well-known equivalence of g-formula and IP weighting estimators for confounding control, when the outcome mean and probability of treatment are nonparametrically estimated.^{11} We examined this connection in the context of analyses generalizing trial findings to a target population. We proposed IP weighting estimators that use models for the probability of treatment (among trial participants) and the probability of trial participation and showed that these estimators are equivalent to the g-formula estimator, when all models are estimated by nonparametric frequency estimators. We also provided an intuitive explanation for the nonequivalence of the IP weighting estimator proposed by Lesko et al.^{7} and the g-formula estimator.

In realistic generalizability analyses, nonparametric estimation of models for the probability of trial participation, the conditional outcome mean, or the probability of treatment is infeasible because of the curse of dimensionality.^{30} Instead, some modeling assumptions will be necessary to obtain estimates with reasonable precision. Nevertheless, the equivalence of g-formula and IP weighting estimators when using nonparametric models ensures that, when using flexible semiparametric or parametric models, in large samples, the point estimates of g-formula and IP weighting procedures will be fairly similar. Any large discrepancy between these estimates suggests model misspecification or some other failure of assumptions.^{31}

Augmented IP weighting estimators provide analysts with two opportunities for valid inference in large samples because the estimators converge in probability to the desired parameter when either the working model for the outcome or the working model for the probability of trial participation is correctly specified.^{25} Furthermore, in finite samples, augmented IP weighting estimators are often more efficient than nonaugmented IP weighting estimators and almost as efficient as g-formula estimators. We illustrated this behavior when using correctly specified parametric models in the small simulation study reported here, as well as the more extensive studies reported elsewhere.^{1} Thus, we argue that in practical generalizability analyses, augmented IP weighting doubly robust estimators should be preferred, provided that empirical positivity violations are not present or can be addressed.^{32}

In this article, we have considered two extremes on the spectrum of approaches for estimating working models: nonparametric frequency-based (nonsmooth) and fully parametric methods. These extremes have well-known limitations: frequency-based methods cannot handle problems of even modest dimensions in realistic datasets, because of the curse of dimensionality; fully parametric methods converge at the usual

-rate under such conditions, but do so at the risk of serious misspecification.^{8} There is increasing interest in using approaches that fall somewhere in between these two extremes, such as semiparametric and machine learning methods that reduce the risk of misspecification while being able to handle high dimensional data. The increase in modeling flexibility compared with parametric models, however, comes at the cost of slower rates of convergence of the working models and attendant bad performance of g-formula–based and nonaugmented IP weighting generalizability estimators.^{33} In contrast, when using flexible modeling approaches, doubly robust generalizability estimators are particularly attractive because, by appropriately combining two flexible models, they converge to the parameter value “faster” than estimators that rely on just one^{31}^{,}^{34} (for technical details and additional techniques that can be used to improve performance see, e.g., ^{33}).

Finally, we note that further research is needed to examine the performance of generalizability estimators when models are misspecified and weights are highly variable.^{35} In such cases, variants of the augmented IP weighting estimator in (6) are likely to perform better.^{27}^{,}^{28}^{,}^{36}

## ACKNOWLEDGMENTS

We thank Drs. Ashley Buchanan (University of Rhode Island), Stephen Cole (University of North Carolina), and Catherine Lesko (Johns Hopkins University) for helpful comments on earlier versions of this article.

## REFERENCES

1. Dahabreh IJ, Robertson SE, Tchetgen Tchetgen EJ, Stuart EA, Hernán MA. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics. 2019;75:685–694.

2. Goldstein BA, Phelan M, Pagidipati NJ, Holman RR, Stuart MJ. An outcome model approach to translating a randomized controlled trial results to a target population. 2018.

3. Rudolph JE, Cole SR, Eron JJ, Kashuba AD, Adimora AA. Estimating human immunodeficiency virus (HIV) prevention effects in low-incidence settings. Epidemiology. 2019;30:358–364.

4. Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. Am J Epidemiol. 2010;172:107–115.

5. Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the

generalizability of results from randomized trials. J R Stat Soc Ser A Stat Soc. 2011;174:369–386.

6. Buchanan AL, Hudgens MG, Cole SR, et al. Generalizing evidence from randomized trials using inverse probability of sampling weights. J R Stat Soc Ser A Stat Soc. 2018;181:1193–1209.

7. Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Practical considerations when generalizing study results: a potential outcomes perspective. Epidemiology. 2017;28:553–561.

8. Li Q, Racine JS. Nonparametric

*Econometrics*:

*Theory* and

*Practice*. 2007.New Jersey, Princeton University Press.

9. Sato T, Matsuyama Y. Marginal structural models as a tool for standardization. Epidemiology. 2003;14:680–686.

10. Hernán MA, Robins JM. Estimating causal effects from epidemiological data. J Epidemiol Commu

*n* Health. 2006;60:578–586.

11. Hernán MA, Robins JM. Causal

*Inference*(forthcoming). 2019.Boca Raton, FL: Chapman & Hall/CRC.

12. Dahabreh IJ, Haneuse SJ-PA, Robins JM, et al. Study designs for extending causal inferences from a randomized trial to a target population. 2019.

13. Pearl J, Bareinboim E. External validity: from do-calculus to transportability across populations. Stat Sci. 2014;29:579–595.

14. VanderWeele TJ. Confounding and effect modification: distribution and measure. Epidemiol methods. 2012;1:55–82.

15. Huitfeldt A, Stensrud MJ. Re: Generalizing study results: a potential outcomes perspective. Epidemiology. 2018;29:e13–e14.

16. Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–685.

17. Hájek J. Godambe VP, Sprott DA. Comment on “An essay on the logical foundations of survey sampling by D. Basu”. In: Foundations of

*Statistical Inference*. 1971:Holt, Rinehart, and Winston, New York City, NY; 236.

18. Robins JM, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: performance of double-robust estimators when “inverse probability” weights are highly variable. Stat Sci. 2007;22:544–559.

19. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23:2937–2960.

20. Huber PJ, et al. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the

*Fifth* Berkeley

*Symposium* on

*Mathematical Statistics* and

*Probability*. 1967:Volume 1. Berkeley, CA; University of California Press, 221–233.

21. White H. Maximum Likelihood Estimation of Misspecified Models. 1982; 50:Econometrica: Journal of the Econometric Society; 1–25.

22. Newey WK, McFadden D. Large sample estimation and hypothesis testing. Handbook of

*E*conometrics. 1994;4:2111–2245.

23. Boos DD, Stefanski LA. Essential

*Statistical Inference*:

*Theory* and

*Methods*. 2013.Vol 120. New York City, NY: Springer Science & Business Media.

24. Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66:315–331.

25. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–973.

26. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89:846–866.

27. Tan Z. A distributional approach for causal inference using propensity scores. J Am Stat Assoc. 2006;101:1619–1637.

28. Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–734.

29. Richardson TS, Robins JM. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Technical Report 128. 2013.Seattle, WA: Center for Statistics and the Social Sciences, University of Washington.

30. Robins JM, Ritov Y. Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models. Stat Med. 1997;16:285–319.

31. Robins JM, Rotnitzky A. Comments. Stat Sin. 2001;11:920–936.

32. Petersen ML, Porter KE, Gruber S, Wang Y, van der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Stat Methods Med Res. 2012;21:31–54.

33. Chernozhukov V, Chetverikov D, Demirer M, et al. Double/debiased machine learning for treatment and structural parameters. Economet J. 2018;21:C1–C68.

34. Naimi AI, Kennedy EH. Nonparametric

double robustness. 2017.

35. Kang JDY, Schafer JL. Demystifying

double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat

*ist* Sc

*i*. 2007;22:523–539.

36. Vermeulen K, Vansteelandt S. Bias-reduced doubly robust estimation. J Am Stat Assoc. 2015;110:1024–1036.