Hoggatt, Katherine J.a,b; Greenland, Sanderb,c
In recent decades, there has been an explosion of methods for causal analysis. In this issue of Epidemiology, Gatto et al1 provide a valuable schema for coping with the numerous distinctions that have arisen. Although the schema is already quite comprehensive, we will take it as a starting point for further discussion and development.
Among the crucial dimensions highlighted by Gatto and colleagues1 is the target population in which effects are defined. As the authors point out, this dimension is often left vague or implicit, and yet at least some (and usually all) effect measures depend on the population or person. In theory, both the effect measure to be estimated (the causal estimand or target parameter) and the population to which it applies should be determined from stated research hypotheses or policy goals. In practice, however, analysis methods and selection restrictions may result in estimates of effects for a different population, necessitating adjustment to yield estimates for the population of interest or target population for intervention.2–4 Furthermore, identifying effect variation across subpopulations or individuals (who can be viewed as subpopulations of size 1) is critical to efforts to tailor clinical practice to provide “patient-centered” care and to inform personal treatment choices.
In this commentary, we explore further classification dimensions arising from the population aspect of epidemiologic measures: (1) population effects versus subgroup effects versus average individual effects and (2) collapsible versus noncollapsible effects. We also discuss method-induced subsetting and generalization problems, the latter being crucial to the translation of epidemiologic science into public policy and healthcare recommendations. We end by pointing out the importance of considering potential interference between units. As in the article by Gatto et al,1 our discussion will concern causal parameters (actual effects) in populations and subpopulations rather than estimates of effects.
CONCEPTS AND NOTATION ASSUMING NO INTERFERENCE
Suppose Y is an observed outcome of interest, which may be an indicator (eg, 1 = dead, 0 = alive at end of study), a rate, or a quantity (eg, blood pressure); X is a treatment or exposure variable whose effects on Y are under study; and Yx is the potential outcome when X = x is received, possibly counter-to-fact. For the moment, suppose that Y and X are individual variables, so that in a population of N individuals indexed by i = 1, …, N we have a Yi and Xi for each individual i. Y and X then denote the variables for an unspecified individual.
Gatto et al1 use a now-standard model for the effect of a simple binary treatment X (eg, X = 1 for tetanus immunization, 0 for none) in which Y is supplemented by two potential outcomes, which we label Y1 = outcome if X = 1 is received and Y0 = outcome if X = 0 is received. Y is then the actual outcome 5–7 The average absolute treatment effect in a population p is defined as where the expectation E[·] is a shorthand for averaging over individuals in p:
This average individual effect is both the average change in Y, E[Y1 − Y0] and the change in average Y, E[Y1] – E[Y0], produced by treating everyone (all X = 1) versus treating no one (all X = 0).8 This has been labeled the average treatment effect,5 the marginal treatment effect,9 and the population-average treatment effect.10,11 We emphasize that (as in the article by Gatto et al1) the above notation and terms assume that neither the treatment nor the outcome of any given person affects the treatment or outcome of any other person, an assumption we will refer to as “no interference” among individuals, discussed further below.
When Y1 and Y0 are binary outcome indicators, the average E[Y1] equals the proportion Pr(Y1 = 1) with Y1 = 1, the average risk (incidence proportion) if everyone receives X = 1; similarly E[Y0] = Pr(Y0 = 1) is the average risk if everyone receives X = 0. Thus, E[Y1 − Y0] = Pr(Y1 = 1) − Pr(Y0 = 1) is the average-risk difference. If Y1 and Y0 are quantities, E[Y1 − Y0] = E[Y1] – E[Y0] is simply the mean difference in outcome when X = 1 versus X = 0. For example, if Y is time to death, then E[Y1 − Y0] is the expected years of life lost (if negative) or saved (if positive) from having X = 1 versus X = 0. Given a set of covariates sufficient for confounding control and a sufficiently large and unbiased sample, these absolute population effects can be estimated by standardization to the total population covariate distribution (G-computation) and by inverse probability of treatment weighting2,12,13; when the expected outcome is always positive, parallel formulas yield relative effect measures such as E[Y1]/E[Y0]. It has been argued that other measures of effect, such as years of life lost, are important in studying disease etiology, as well as for policy and legal purposes.14–17 We thus recommend that dimension 2 proposed by Gatto and colleagues1 be expanded to include measures beyond those comparing risks or rates. This expansion is easily accommodated using the standard framework just described, by letting Y represent any outcome measure. We now turn to more intricate issues.
INDIVIDUAL VERSUS POPULATION MEASURES
One dimension we would add to a classification scheme is that of population measures versus individual measures. This dimension is distinct from item (6) in Gatto et al,1 insofar as certain population measures have no individual analog and other measures may exhibit discrepancies between averages over individuals and other aggregate (population) measures.
Suppose that the outcome of interest is annual per-capita healthcare expenditure. While at first glance this variable might seem individualized by the “per capita,” it is not derived by averaging individual expenditures. Instead, to adjust for different sizes of compared populations, total population expenditures are divided by N; nonetheless, the result is an aggregate macroeconomic (system-wide) measure of complex capital outlays not only by individual persons but also by hospitals, insurance carriers, health plans, and governments (which are funded by fees and taxes on corporations, sales, and property, as well as on individual persons). We could average direct individual payments for care and insurance, but that would not capture the same concept.
Now suppose there is an unambiguous individual-level outcome variable Y. For some measures of this outcome, the average value may not equal the aggregate population or subgroup measure estimated by our statistical procedures. For example, if (as in most potential-outcome expositions) Y is deterministic, the odds of Y = 1 versus Y = 0 for an individual i can be only 0/1 = 0 if Yi = 0 or 1/0 = ∞ if Yi = 1. Hence, the average individual odds will be zero if no one has Y = 1 and infinite otherwise. If, as is usually the case, the individual-level outcome Y varies across persons in the population, the population (marginal) outcome odds E(Y)/E(1 − Y) = Pr(Y = 1)/Pr(Y = 0) will be a positive finite number and hence not equal to the average individual odds. To eliminate zeros and infinities, consider next a stochastic (probabilistic) individual model, in which each individual i is like a coin about to be tossed that has probability Pi of Yi = 1 and 1 − Pi for Yi = 0. Such Bernoulli “random-effects” models arise when allowing for unmeasured individual risk factors so complex (eg, the entire genome and its expression) that no two individuals are alike. Individual odds are then ωi = Pi/(1 − Pi). If Y is uncertain (0 < Pi < 1) for everyone, these ωi and hence their average E(ω) will be finite and positive, but if the risks Pi vary E(ω) will exceed the population odds E(Y)/E(1 − Y) = E(P)/E(1 − P), although their difference will be small if the odds ωi are very small for everyone.8
These inequalities extend to causal effect measures. Suppose each person has a well-defined risk P1i when given X = 1 and a smaller risk P0i when given X = 0, with corresponding odds ω1i and ω0i. The average causal odds ratio will then be further from the null than the population causal odds ratio:
that is, the average effect of X on the individual odds exceeds the effect of X on the population odds, illustrating how an odds ratio can be noncollapsible even when it is unconfounded.18,19 The same noncollapsibility problem arises when comparing a total population odds ratio to average subgroup odds ratios or when using person-time rate ratios or differences instead of odds to measure effects.19,20 It does not arise, however, with risk differences or risk ratios.18,19 We are thus led to another dimension for classifying measures of effect: whether the effect measure itself is collapsible (as with risk ratios and risk differences) or noncollapsible (as with rate ratios and odds ratios). Note that here we are speaking of the effect measures themselves, not estimates or substituted associations. By definition, these measures cannot be biased, and thus, noncollapsibility is not confounding but is rather a change in an actual effect measure as stratification increases.19 This phenomenon becomes important when the outcome under study is common in at least one comparison group so that rate and odds ratios no longer approximate risk ratios. When this happens, we prefer to use collapsible measures to ease causal interpretation.19
The individual versus population dimension can be subsumed under the more common statistical differentiation between conditional and marginal effects,9,21 by allowing for conditioning so fine that individuals are distinct. At that fine level of conditioning, however, statistical methods must impose parametric constraints to identify the conditional effects of interest. Typically, these constraints imply that individuals with identical covariate values will have identical risks when given the same treatment (conditional exchangeability). Although such models are essential for statistical analysis, molecular epidemiology demonstrates that these models may omit many important risk factors and effect modifiers. Thus, conceptually, we prefer formulations (as given above) that allow individual risks to vary within levels of measured covariates.
To describe issues concerning subgroups of the initial population (Gatto et al1 characteristic 6), suppose S indicates selection from the population into a subgroup, with S = 1 for those in the selected subgroup and S = 0 otherwise. Then, the average absolute treatment effect in the selected subpopulation is where E[·|S=1] is the average over individuals in the selected subpopulation. When S = X, the subpopulation is the treated or exposed (those with S = X = 1); the effect has been labeled the “effect of treatment on the treated”22 and the “average effect of treatment on the treated”.4,5 This measure is estimated by classical exposed population (“indirect” or standardized morbidity ratio-weighted) standardization and by odds-of-treatment–weighted analyses.2,12,23
The composition of the selected subpopulation relative to the intended target population can be crucial in determining the relevance of the resulting estimate of E[Y1 − Y0|S = 1]. In some studies (eg, of occupational hazards), the selected subpopulation may itself be the target, and thus, the effect in this subpopulation is the target parameter. If the selected subpopulation can be expected to have the same distribution as the target for relevant factors (as for a simple random sample from the target), then we may also expect the subpopulation effect to equal the target effect, that is, Nonetheless, we would often expect this equality to fail under typical selection strategies and analysis methods. As an example, if the subpopulation is selected from the target with balanced matching of untreated subjects to treated subjects or by propensity score matching, the subpopulation will have a distribution of matching factors that follows that of the treated subpopulation (X = 1), not the original population. Consequently, if these factors include important effect modifiers, the average effect in the matched subpopulation will usually not equal the average effect in the original population. In particular, if there are no uncontrolled modifiers, it will equal the effect of treatment on the treated.3
In a similar fashion, odds-of-treatment weighting estimates the effect of treatment on the treated by reweighting subjects in the unmatched population to have a weighted distribution equal to the unweighted distribution of the exposed.2,23 Method-induced selection is thus a benefit of the matching (and reweighting a benefit of the odds-of-treatment weighting) if the effect of treatment on the treated is indeed the target,24 but it is a potential source of bias if the average effect in the total unmatched population is the target. In contrast, adjustment using outcome regression or inverse-probability-of-treatment weighting or both on the original unmatched population will tend to estimate the effect of treatment on that original population.3,24,25 This can be seen as an advantage if the total unmatched population is the target, but a potential source of bias otherwise.
Some methods may estimate parameters with no interpretation as an average causal effect in any specified population of interest. Consider propensity score matching in which some treated persons have no untreated match. This is a type of “positivity failure,” with a consequence that the targeted effect is E[Y1 − Y0|S = 1] where S = 1 means “X = 1 and match available.” This quantity may not equal the intended target E[Y1 − Y0|X = 1], although the discrepancy may be slight if matching failure is rare.4 A related problem arises in inverse-probability-of-treatment weighting estimation when extreme weights are trimmed or truncated.26 The population corresponding to the weights then does not correspond to the original population, and the estimated effect need not equal the average treatment effect in that target or any natural population. Alternative weight estimation methods do not appear to require weight truncation27,28 and thus can be recommended on interpretational and statistical grounds.
Simple instrumental variable methods estimate the average treatment effect among compliers E[Y1 – Y0|C = 1], where C is the compliance indicator that takes a value of 1 for persons who follow the treatment to which they are assigned, regardless of what they are assigned.29E[Y1 – Y0|C = 1] has been called the “local average treatment effect”29 or the “complier average causal effect.”30,31 Whether the complier average treatment effect is in itself an effect of substantive interest is debatable, especially if such compliers cannot be identified in advance. Whether the complier average causal effect equals another average effect of interest depends heavily on whether the predictors of compliance are also modifiers, as would be expected if side effects predict both compliance and treatment effects.30–32 Similar if not stronger reservations apply to principal stratum effects.30,31,33,34
GENERALIZATION (PROJECTION) OF EFFECTS
If we wish to project estimates from the subgroup selected for analysis to a different target population, we will need the modifier-specific effects and the distribution of modifiers in the new target. Given large enough numbers, this may be a relatively simple requirement when projecting to the original parent population of our analyzed subgroup, because given the data it can be carried out with regression standardization or equivalent methods.12 However, owing to lack of data, projection can be daunting and hazardous when the goal is to generalize or transport treatment effect estimates to external targets.
Randomized clinical trials (RCTs) offer cautionary examples. Development and approval trials are often conducted on patients selected on characteristics one should expect to be modifiers, such as outcome prognosis, expected responsiveness to treatment, unresponsiveness to standard treatments, and low risk of untoward side effects. Yet, RCTs typically enroll study groups that are too small to allow detection of every important modifier, let alone estimate modification accurately. As a result, it is unsurprising when effects observed in clinical practice and in postmarketing surveillance fail to match those seen in earlier trials.
We believe that when the goal of our research is to inform public-health and medical practice, external targets are the main targets of interest, for the only interventions or practices that can be informed by our research results are in the future and thus are unobserved. Furthermore, as some RCT examples show, it can be dangerous to view available observations as random samples from a population that includes our future targets. While there are now various methodological tools for projecting from observed populations to external targets,13,35–38 as with all tools these methods require certain basic information that is often absent or must be imputed from past data. We thus would extend any schema to include a category for measures applying to populations yet unseen and hence unsampled. We leave open the question of how best to incorporate this category, although perhaps expanding item (6) of Gatto et al1 to more categories would suffice.
MEASURES UNDER INTERFERENCE
As do Gatto et al1 and the vast majority of other writers, we have so far assumed no interference. Nonetheless, violations abound. Infectious disease and social research provide many important settings in which the treatment, outcome, and effect of treatment on outcome depend on the treatment, outcome, and effects for other population members.39–42 For example, sufficiently high population prevalence of vaccination can lead to steep risk reduction for those remaining unvaccinated (herd immunity), limiting the effect of vaccination on those so protected, and possibly affecting their choice to be vaccinated as well as their outcome if they choose not to. Further details on the vaccination distribution will affect the risk of unvaccinated persons, including whether family members or neighbors are vaccinated.
As a consequence, it can be ambiguous to discuss effects of an individual vaccine indicator X on an individual in the population. Furthermore, given that some people may avoid or miss vaccination and yet large numbers will be vaccinated, it may be of no policy use to consider effects of everyone versus no one vaccinated. Thus, any useful outcome and effect measure (whether for the population, a subgroup, or an individual person) will instead be a function of the distribution of vaccination across the entire population. Note that “distribution” here means much more than a generic probability distribution applying to everyone in the population; it also includes details of exactly who gets the vaccine and who does not (including their age, school enrollment, family members), as well as considerations of vaccine effectiveness. Potential outcomes are then as numerous as such distributions, and the expression of population effects as averages across individual effects appears useless.
It is beyond the scope of this commentary to address the subtleties and complexities that follow from these observations. We do suggest, however, that it is important to extend effect definition and measurement to situations with interference, and that efforts to do so39–42 will raise classification dimensions beyond those discussed above and by Gatto and colleagues.1
We have not attempted to raise, let alone address, all the issues that could arise from a classification scheme for effects and their measures. As with terminology, any classification scheme will have arbitrary elements, but a good scheme can be invaluable for teaching and communication. We therefore thank Gatto and colleagues1 for providing a well-developed foundation. We hope this discussion and expansion might encourage further contributions to this topic.
ABOUT THE AUTHORS
KATHERINE J. HOGGATT is a Research Health Scientist at the VA Greater Los Angeles and an Adjunct Assistant Professor of Epidemiology at the UCLA Fielding School of Public Health, Los Angeles, California. SANDER GREENLAND is Professor Emeritus of Epidemiology at the UCLA Fielding School of Public Health and Professor Emeritus of Statistics at the UCLA College of Letters and Science, Los Angeles, California.
1. Gatto NM, Campbell UB, Schwartz S. An organizational schema for epidemiologic causal effects. Epidemiology. 2014; 25:88–97
2. Sato T, Matsuyama Y. Marginal structural models as a tool for standardization. Epidemiology. 2003; 14:680–686
3. Kurth T, Walker AM, Glynn RJ, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol. 2006; 163:262–270
4. Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci. 2010; 25:1–21
5. Morgan SL, Winship C. Counterfactuals and Causal Inference: Methods and Principles for Social Research (Analytical Methods for Social Research). 2007; New York Cambridge University Press
6. Pearl J. Causality: Models, Reasoning, and Inference. 2009; 2nd ed New York Cambridge University Press
7. Pearl J. On the consistency rule in causal inference: axiom, definition, assumption, or theorem? Epidemiology. 2010; 21:872–875
8. Greenland S. Interpretation and choice of effect measures in epidemiologic analyses. Am J Epidemiol. 1987; 125:761–768
9. Martens EP, Pestman WR, de Boer A, Belitser SV, Klungel OH. Systematic differences in treatment effect estimates between propensity score methods and logistic regression. Int J Epidemiol. 2008; 37:1142–1147
10. Imbens G. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat. 2004; 86:4–30
11. Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. J Roy Stat Soc. 2008; 171:481–502
12. Greenland S. Rothman KJ, Greenland S, Lash TL. Introduction to regression analysis. Modern Epidemiology. 2008; 3rd ed Philadelphia Wolters Kluwer Health/Lippincott Williams & Wilkins 418–455
13. Vansteelandt S, Keiding N. Invited commentary: G-computation–lost in translation? Am J Epidemiol. 2011; 173:739–742
14. Greenland S, Frerichs RR. On measures and models for the effectiveness of vaccines and vaccination programmes. Int J Epidemiol. 1988; 17:456–463
15. Robins J, Greenland S. Estimability and estimation of expected years of life lost due to a hazardous exposure. Stat Med. 1991; 10:79–93
16. Boshuizen HC, Greenland S. Average age at first occurrence as an alternative occurrence parameter in epidemiology. Int J Epidemiol. 1997; 26:867–872
17. Greenland S, Robins JM. Epidemiology, justice, and the probability of causation. Jurimetrics. 2000; 40:321
18. Samuels ML. Matching and design efficiency in epidemiological studies. Biometrika. 1981; 68:577–588
19. Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci. 1999; 14:29–46
20. Greenland S. Absence of confounding does not correspond to collapsibility of the rate ratio or rate difference. Epidemiology. 1996; 7:498–501
21. Austin PC, Grootendorst P, Normand SLT, Anderson GM. Authors’ reply. Stat Med. 2007; 26:3210–3212
22. Shpitser I, Pearl J. Bilmes J, Ng AY. Effects of treatment on the treated: Identification and generalization. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 2009; Arlington, VA AUAI Press 514–521
23. Ridgeway G, MacDonald JM. Doubly robust internal benchmarking and false discovery rates for detecting racial bias in police stops. J Am Stat Assoc. 2009; 104:661–668
24. Stürmer T, Rothman KJ, Glynn RJ. Insights into different results from different causal contrasts in the presence of effect-measure modification. Pharmacoepidemiol Drug Saf. 2006; 15:698–709
25. Greenland S, Maldonado G. The interpretation of multiplicative-model parameters as standardized parameters. Stat Med. 1994; 13:989–999
26. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008; 168:656–664
27. Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol. 2010; 63:826–833
28. Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One. 2011; 6:e18174
29. Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Am Stat Assoc. 1996; 91:444–455
30. Joffe M. Principal stratification and attribution prohibition: good ideas taken too far. Int J Biostat. 2011; 7:Article 35
31. VanderWeele TJ. Principal stratification—uses and limitations. Int J Biostat. 2011; 7:Article 28
32. Hernán MA, Robins JM. Instruments for causal inference: an epidemiologist’s dream? Epidemiology. 2006; 17:360–372
33. Pearl J. Principal stratification—a goal or a tool? Int J Biostat. 2011; 7:20
34. Sjolander A. Reaction to Pearl’s critique of principal stratification. Int J Biostat. 2011; 7:1–5
35. Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. Am J Epidemiol. 2010; 172:107–115
36. Bareinboim E, Pearl J. Meta-transportability of causal effects: a formal approach. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics. 2013; Scottsdale, AZ
37. Hartman E, Grieve R, Ramsahai R, Sekhon JS. From SATE to PATT: Combining Experimental with Observational Studies to Estimate Population Treatment Effects. Working Paper. 2013; 1–32
38. Pressler TR, Kaizar EE. The use of propensity scores and observational data to estimate randomized controlled trial generalizability bias. Stat Med. 2013; 32:3552–3568
39. Halloran ME, Struchiner CJ. Causal inference in infectious diseases. Epidemiology. 1995; 6:142–151
40. Hudgens MG, Halloran ME. Toward Causal Inference With Interference. J Am Stat Assoc. 2008; 103:832–842
41. Rosenbaum PR. Interference between units in randomized experiments. J Am Statist Assoc. 2007; 102:191–200
42. Tchetgen EJ, VanderWeele TJ. On causal inference in the presence of interference. Stat Methods Med Res. 2012; 21:55–75