In public health and medicine, there is a tension between internal and external validity.^{1–12} Many interventions are known to be efficacious at the individual level, but less is known about (1) their impact once scaled to a population level, (2) effect modification by both measured and unmeasured factors, and (3) which intervention components should be implemented universally and which adapted to local context. The field of human immunodeficiency virus (HIV) prevention and treatment provides an illustrative example. Randomized trials and observational studies have shown that immediate initiation of antiretroviral therapy for an HIV-positive individual improves his or her health and prevents transmission between couples and from mothers to children.^{13–16} Four community randomized trials aim to examine the impact of “universal test and treat” (population-wide HIV testing with immediate antiretroviral therapy initiation for all HIV-positive persons) on HIV incidence in several countries in Eastern and Southern Africa.^{17–20} These trials are pragmatic in that they aim to learn about effectiveness and implementation in real-world conditions. Nonetheless, the specific components of their interventions, their implementation, and their impact are expected to vary within and across trials. Given promising interim results,^{21} open questions remain about nation-wide rollout and the heterogeneity in expected impact.

In this issue of *Epidemiology*, Lesko et al.^{12} highlight the distinction between estimating the effect for the study units and the effect for the target population. Concretely, the sample average treatment effect^{22},^{23} is the mean difference in the counterfactual (potential) outcomes for the enrolled units, whereas the population average treatment effect is the expected difference in the counterfactual (potential) outcomes for the population from which the study units were selected. It is worth emphasizing that sample and population effects are fundamentally different causal parameters—even when the units are drawn as a simple random sample from the target population, there is only one version of the treatment, and there is no interference.^{1},^{2},^{11} In other words, if all the assumptions for both internal and external validity held, the sample and population effects are likely to be different.

Consider the simulation study conducted by Lesko et al.,^{12} hereafter called “the authors.” There is one value of the population average treatment effect, calculated analytically or by Monte Carlo simulations as −5.5%. In contrast, the sample average treatment effect is a data-adaptive parameter; with each new selection of study units, a new value of the sample effect is obtained. To illustrate this point, we replicate the authors’ simulation 5,000 times. For increasing enrollment sizes, we draw a simple random sample of *n* units and calculate the sample effect as the average difference in the counterfactual outcomes for the enrolled units (R code in Appendix C). The resulting minimum, median, and maximum values of the sample effect and its variability from our study are shown in Table 1. For the smallest size of 100, the true value of the sample effect ranges from −14.8% to 9.2% with a median value of −5.8%. In some studies, the intervention is highly protective and in others, quite harmful. Our simple simulation highlights the potential dangers of immediately generalizing the sample effect to the population level, even when the conditions for internal and external validity hold.

The sample effect is also an appealing causal parameter but has received less attention in public health literature. As discussed by the authors, we commonly assume the existence of a real or hypothetical target population from which the study units were selected and about which we wish to make inferences. Concretely defining this population is challenging. In contrast, the sample effect avoids all assumptions about a “vaguely defined super-population of study units”^{7} and is simply the intervention effect for units at hand. In the Sustainable East Africa Research in Community Health (SEARCH) trial, for example, the sample effect corresponds to the average difference in the counterfactual cumulative HIV incidence under the test and treat strategy and under the standard of care for the *n* = 32 study communities.^{11} In this example, the sample effect captures the impact for approximately 320,000 people living in rural Uganda and Kenya. As Lesko et al.^{12} and others^{1–11} discuss, generalizing this intervention effect to a wider population (e.g., all of Uganda and Kenya) or transporting it to a different population (e.g., Boston or San Francisco) requires additional assumptions and distinct estimators. Finally, the sample effect will be estimated with at least as much precision as the population effect,^{1},^{22},^{23} especially under pair matching.^{11},^{24} In particular, if there is heterogeneity in the intervention effect by measured or unmeasured factors, a given study will have more power to detect a sample than a population effect. In other words, the price for generalizing the sample to the population is higher variance. Altogether, the interpretability, relevance, and increased precision from specifying the sample average treatment effect as the target of inference make this causal parameter an appealing alternative to the population average treatment effect.

The authors’ presentation is focused on the enrollment (sampling) mechanism, and their estimators are derived under the potential outcomes framework.^{1–4},^{7},^{9},^{22–26} An alternative would be to consider the structural causal model of Pearl.^{27} Recall the authors’ notation with *W* as the set of characteristics influencing enrollment, *S* as indicator of being selected into the study, *A* as an indicator of receiving the exposure, and *Y* as the outcome. Specifically, we consider a binary exposure with *A* = 1 for the intervention and *A* = 0 for the control. For simplicity, we define the exposure and outcome (*A*, *Y*) to be 0 for units not selected (*S* = 0). Then the following causal model would describe a study (observational or randomized) wherein units are enrolled as a function of baseline covariates and the exposure is assigned as a function of baseline covariates and enrollment:

Here (*U*_{W}, *U*_{S}, *U*_{A}, *U*_{Y}) denote the corresponding set of background or unmeasured factors that have some joint distribution

. This approach for representing selection into the study (a threat to external validity) could easily be extended to include missing data after enrollment (a threat to internal validity) as well as postenrollment covariates that influence the exposure assignment (additional confounders, another threat to internal validity; Appendix A). Furthermore, this causal model is defined at the unit level, and implicitly assumes no interference. (This framework could also be extended to handle interference.) The authors’ simulation example is one possible data-generating process, compatible with this structural causal model. In a randomized trial (such as considered by the authors), the unmeasured factors contributing to the intervention assignment *U*_{A} are independent of the others and the covariates *W* do not impact randomization *A*.

We assume that the above causal model provides a description of the data-generating process under existing conditions and under specific interventions.^{27} (See Appendix 1 in Bareinboim and Pearl^{8} for a short introduction.) Counterfactual outcomes are generated by intervening on this causal model. To define the sample effect, we intervene to set the exposure *A* = *a* to generate the counterfactual outcome

: the outcome that would have been observed if unit *i* received exposure level *A* = *a*. Then the sample average treatment effect (*SATE*) is defined as the average difference in these counterfactual outcomes among enrolled units (*S = 1*):

where *n* denotes the total number of units in the study. To define the population average treatment effect (*PATE*) in the context of biased sampling, we consider a hypothetical intervention to enroll the entire target population (i.e., set *S* = 1) and assign the exposure *A* = *a*. We denote the counterfactual outcome under this joint intervention as *Y*(*s* = 1, *a*). The *PATE* is then given by the expected difference in these counterfactual outcomes across the target population of interest:

For illustration, we repeat the authors’ simulation study 5,000 times: (1) generate a target population of 50,000 units; (2) from that population, draw a biased sample of *n* = 2,000 units; and (3) for each sample, calculate the *SATE* as the average difference in the counterfactual outcomes for the enrolled units (R code in Appendix C). As shown in Table 1, the sample effect under the biased sampling scheme ranges from −12.8% to −8.3% and is −10.4% on average. The sample effect is larger on average than the population effect (−5.5%), because units at higher risk of the outcome are selected into the study. Practically, this may suggest that instead of rolling out the intervention to the entire population, the greatest impact could be obtained by targeting the intervention to high-risk groups. Likewise, the effect heterogeneity may suggest alternative parameters of interest, such as the conditional average treatment effect, an intermediate between the sample and population effects (Appendix B).^{1},^{28},^{29}

The structural causal model representation also draws a connection between the authors’ assumptions and estimators for external validity and the standard machinery for controlling for confounding, selection bias, and/or unrepresentative sampling.^{3},^{4},^{8},^{10},^{30–35} Given the sequential randomization assumption,

and the corresponding positivity assumptions, we have the G-computation identifiability result:^{32}

This estimand is written equivalently in inverse probability weighting (*IPW*) form as

where the weights could be factorized into a product of propensity score

and selection mechanism

.^{4},^{30} While stabilized weights are also possible,^{3} the above estimands showcase the equivalence when nonparametric estimators are used for the outcome regression and selection/exposure mechanisms in both observational and trial settings.

For our simulation study, we implemented the corresponding G-computation (also known as standardization)^{36},^{37} and *IPW* estimators for the population effect. For comparison, we also implemented the unadjusted estimator, as the difference in the mean outcomes between enrolled treated units and enrolled control units. The results of 5,000 repetitions are shown in Table 2 (R code in Appendix C). As expected, the unadjusted estimator is unbiased for the sample effect but exhibits substantial bias when the target of inference is the population effect. Also as expected, the G-computation and *IPW* estimators are able to correct for the biased sampling scheme and are identical. (The algorithms are equivalent when nonparametric estimators are used for the outcome regression and the selection/exposure mechanism.)

The structural causal model representation also emphasizes that a rich toolkit of estimators could be used to correct for biased sampling, which is presented by the authors as a threat to external validity. The nonparametric estimators, implemented by the authors and here, will break down when many covariates or a single continuous covariate influence unit selection (and/or the exposure mechanism in an observational setting). As an alternative, we could immediately implement augmented inverse probability weighting or targeted maximum likelihood estimation.^{35},^{38},^{39} These methods are double robust and can incorporate data-adaptive (machine learning) algorithms to relax parametric modeling assumptions while retaining valid statistical inference.

An important open question, not addressed by Lesko et al.^{12} nor in this commentary, is generalizability and transportability when the intervention (or its specific components) must be adapted to local context. We should be wary of assuming the sample effect is immediately generalizable to the population. We should also be wary of assuming that a one-size-fits-all intervention is best.

## ACKNOWLEDGMENTS

The author thanks Maya Petersen and Victor DeGruttola for their expert advice.

## ABOUT THE AUTHOR

LAURA B. BALZER is a postdoctoral fellow in Biostatistics at the Harvard T.H. School of Public Health. She is a methodologist with substantive interests in global health, community-based participatory research, and social determinants of health. Laura is an expert causal inference and semiparametric estimation. With Maya Petersen, she was awarded the 2014Causality in Statistics Education Award from the American Statistical Association.

## REFERENCES

1. Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat. 2004;86:4–29.

2. Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. J R Stat Soc Ser A. 2008;171(part 2):481–502.

3. Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. Am J Epidemiol. 2010;172:107–115.

4. Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. J R Stat Soc Ser A. 2011;174(part 2):369–386.

5. Rothman KJ, Gallacher JEJ, Hatch EE. Why representativeness should be avoided. Int J Epidemiol. 2013;42:1012–1014.

6. Elwood JW. On representativeness [Commentary]. Int J Epidemiol. 2013;42:104–1015.

7. Schochet P. Estimators for clustered education RCTs using the Neyman model for causal inference. J Educ Behav Stat. 2013;38:219–238.

8. Bareinboim E, Pearl J. A general algorithm for deciding transportability of experimental results. J Causal Inference. 2013;1:107–134.

9. Hartman E, Grieve R, Ramsahai R, Sekhon JS. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. J R Stat Soc Ser A. 2015;178:757–778.

10. Pearl J. Generalizing experimental findings. J Causal Inference. 2015;3:259–266.

11. Balzer LB, Petersen ML, van der Laan MJ; SEARCH Collaboration. Targeted estimation and inference for the sample average treatment effect in trials with and without pair-matching. Stat Med. 2016;35:3717–3732.

12. Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing study results: a potential outcomes perspective. Epidemiology. 2017;28:553–561.

13. The TEMPRANO ANRS 12136 Study Group. A trial of early antiretrovirals and isoniazid preventive therapy in Africa. N Engl J Med. 2015;373:808–822.

14. The INSIGHT START Study Group. Initiation of antiretroviral therapy in early asymptomatic HIV infection. N Engl J Med. 2015;373:795–807.

15. Cohen M, Chen Y, Mccauley M, et al. Final results of the HPTN 052 randomized controlled trial: antiretroviral therapy prevents HIV transmission. J Acquir Immune Defic Syndr. 2015;18(suppl 4):20479.

17. French National Institute for Health and Medical Research-French National Agency for Research on AIDS and Viral Hepatitis (Inserm-ANRS). Impact of immediate versus South African recommendations guided ART initiation on HIV incidence (TasP). Available at:

https://clinicaltrials.gov/ct2/show/NCT01509508. Accessed 12 March 2017.

21. Petersen M, Balzer L, Kwarsiima D, et al. SEARCH test and treat study in Uganda and Kenya exceeds the UNAIDs 90-90-90 cascade target by achieving over 80% population-level viral suppression after 2 years. 21st International AIDS Conference, Durban, South Africa, 2016.

22. Neyman J. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes (In Polish). English translation by D.M. Dabrowska and T.P. Speed (1990). Stat Sci. 1923;5:465–480.

23. Rubin DB. Neyman (1923) and causal inference in experiments and observational studies [Comment]. Stat Sci. 1990;5:472–480.

24. Imai K. Variance identification and efficiency analysis in randomized experiments under the matched-pair design. Stat Med. 2008;27:4857–4873.

25. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701.

26. Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986;81:945–960.

27. Pearl J. Causality: Models, Reasoning and Inference. 2009.2nd ed. New York, NY: Cambridge University Press.

28. Abadie A, Imbens G. Simple and bias-corrected matching estimators for average treatment effects. Technical Report 283, NBER technical working paper, 2002.

29. Balzer LB, Petersen ML, van der Laan MJ; SEARCH Consortium. Adaptive pair-matching in randomized trials with unbiased and efficient effect estimation. Stat Med. 2015;34:999–1011.

30. Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–685.

31. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.

32. Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods—application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–1512.

33. Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560.

34. Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15:615–625.

35. van der Laan MJ, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. 2011.New York, Dordrecht, Heidelberg, London: Springer.

36. Miettinen OS. Standardization of risk ratios. Am J Epidemiol. 1972;96:383–388.

37. Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 2008.Philadelphia, PA: Lippincott Williams & Wilkins.

38. Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. 1999 Proceedings of the American Statistical Association. 2000:Alexandria, VA: American Statistical Association; 6–10.

39. van der Laan MJ, Robins JM. Unified Methods for Censored Longitudinal Data and Causality. 2003.New York, Berlin, Heidelberg: Springer-Verlag.

40. Taubman SL, Robins JM, Mittleman MA, Hernán MA. Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. Int J Epidemiol. 2009;38:1599–1611.

41. Bodnar LM, Davidian M, Siega-Riz AM, Tsiatis AA. Marginal structural models for analyzing causal effects of time-dependent treatments: an application in perinatal epidemiology. Am J Epidemiol. 2004;159:926–934.

42. Petersen ML, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. J Causal Inference. 2014;2:147–185.

43. R Core Team. R: A Language and Environment for Statistical Computing. 2015.Vienna, Austria: R Foundation for Statistical Computing.

## APPENDIX A

Let *Z* denote the set of postenrollment characteristics influencing exposure assignment and Δ be an indicator that a unit has its outcome measured (i.e., is not loss to follow-up). For simplicity, we define the postenrollment characteristics, the exposure, the missing data indicator, and the outcome (*Z*, *A*, Δ *Y*) equal to 0 for units not enrolled (*S* = 0). Then the following structural causal model would describe a study (observational or randomized) wherein units are enrolled as a function of baseline covariates, the exposure is rolled out as a function of baseline and postenrollment characteristics, and missingness on the outcome is not random:

Here

denote the corresponding set of background or unmeasured factors with some distribution. Let

denote the counterfactual outcome for a given unit under a hypothetical intervention to ensure its enrolled (i.e., set *S* = 1), assign the exposure *A* = *a*, and ensure its outcome is measured (i.e., set Δ = 1). Then the PATE is defined as

Under the sequential randomization and positivity assumptions,^{32} the corresponding statistical estimand could be estimated with a variety of methods, including longitudinal parametric G-computation,^{40} longitudinal inverse probability weighting,^{33},^{41} and longitudinal targeted maximum likelihood estimation.^{35},^{42}

## APPENDIX B

In 2002, Abadie and Imbens^{28} proposed the conditional average treatment effect as

where *i* indexes the *n* = 2,000 units selected for the study. The conditional effect is interpreted as the average intervention effect given the covariates of the study units and is equal to −10.4% under this biased sampling scheme:

## APPENDIX C

Simulation studies were conducted in R-3.3.2.^{43} Full R code and the resulting dataset are available at https://github.com/LauraBalzer/On-Generalizability.