# Concerning the Consistency Assumption in Causal Inference

Cole and Frangakis (*Epidemiology.* 2009;20:3–5) introduced notation for the consistency assumption in causal inference. I extend this notation and propose a refinement of the consistency assumption that makes clear that the consistency statement, as ordinarily given, is in fact an assumption and not an axiom or definition. The refinement is also useful in showing that additional assumptions (referred to here as treatment-variation irrelevance assumptions), stronger than those given by Cole and Frangakis, are in fact necessary in articulating the ordinary assumptions of ignorability or exchangeability. The refinement furthermore sheds light on the distinction between intervention and choice in reasoning about causality. A distinction between the range of treatment variations for which potential outcomes can be defined and the range for which treatment comparisons are made is discussed in relation to issues of nonadherence. The use of stochastic counterfactuals can help relax what is effectively being presupposed by the treatment-variation irrelevance assumption and the consistency assumption.

From the Departments of Epidemiology and Biostatistics, Harvard School of Public Health, Boston, MA.

Submitted 22 December 2008; accepted 14 April 2009.

Correspondence: Tyler J. VanderWeele, Departments of Epidemiology and Biostatistics, Harvard School of Public Health, 677 Huntington Ave, Boston, MA 02115. E-mail: tvanderw@hsph.harvard.edu.

Cole and Frangakis^{1} have provided an interesting commentary on the consistency assumption in the causal inference literature. I would like to raise a number of additional points.

## REFINEMENTS IN THE ARTICULATION OF AND NOTATION FOR THE CONSISTENCY ASSUMPTION

Cole and Frangakis define *Yj(x, k)* as the potential outcome for individual *j* if exposure *X* is set to value *x* by means *k*. They let *Xj* and *Yjobs* denote, respectively, the observed values of *X* and *Y* for individual *j*. The consistency assumption is then reformulated as the assumption that “*Yjobs=Yj(x, k)* if *x=Xj*, no matter the value of *k*.” One issue that is not addressed in their commentary is that the range of *k* will generally vary with *x* and we will in general need to index treatment variations by *kx*, which may be different for each treatment option *x*. For example, in comparing surgery(*x=1*) and chemotherapy (*x=0*) as cancer treatments, the range of treatment variations for surgery (eg, which specific surgeon) differs considerably from the range of treatment variations for chemotherapy (eg, dose). Similarly, one might imagine countless variations on the implementation of a job training program for participants, but there are arguably far fewer variations for a control arm of “not participating.” Some aspects of treatment variation (eg, specific hospital, time of treatment initiation) may be common to the treatments being compared but generally not all will be. In addition to specifying the components of *kx*, one might also wish to specify some set of values *K* _{x} and require *Yjobs=Yj(x, kx)* when *x=Xj* only for *kx∈* *K* _{x}, rather than necessarily requiring this consistency condition for all *kx*.

With this additional notation, I would like to propose a further refinement. Consider the following 2 conditions. First, whether for each *x*,

If condition *(C1*) holds, then, for any *k* _{x}∈*K* _{x}, we could define *Yj(x*) as *Yj(x):=Yj(x, kx)*. Second, we might be interested in whether for each *j*,

The condition *(C1*) requires for each *x* that the potential outcomes *Yj(x, kx*) take the same value irrespective of what means *kx* is used to set *X* to *x* so long as *k* _{x}∈*K* _{x}. Condition *(C2*) then requires that for some *kx* the potential outcome *Yj(x, kx)* is equal to the observed outcome *Yjobs* when *x=Xj*. If *(C1*) and *(C2*) hold then for all *kx*∈*K* _{x}, the potential outcome *Yj(x, kx*) is equal to the observed outcome *Yjobs* when *x=Xj*.

Several consequences follow from this refinement of the consistency assumption. First, we can better understand how the consistency statement, as ordinarily articulated, is an assumption, and not an axiom or definition. Ordinarily the consistency assumption is articulated as *Yjobs=Yj(x*) when *x=Xj*. Now if condition *(C1*) is satisfied, we can define *Yj(x):=Yj(x, kx*) for any *kx*. Condition *(C2*) then simply becomes *Yjobs=Yj(x*) when *x=Xj*, the ordinary consistency assumption. I would argue that when researchers in the past made the consistency assumption, 2 things were being assumed. First, it was assumed that the notation *Yj(x*) was justified ie, that some range of potential interventions under consideration to set *X* to *x* would all yield the same potential outcomes. This is what is articulated above as assumption *(C1)*; the notation itself presupposed *(C1)*. Second, the consistency assumption, stated as *Yjobs=Yj(x*) when *x=Xj*, was assumed and required *(C2)*.

The very use of the notation *Yj(x*) conceals assumption *(C1)*. A similar issue arises in discussions of interference. The notation *Yj(x*) is justified only if the outcome for one individual *j* does not depend on the treatments received by other individuals. This is sometimes referred to as the “no-interference” assumption. Several authors^{2–6} have relaxed this assumption by introducing notation such as *Yj(xj,* **x** _{−j}) to denote the potential outcome for individual *j* if individual *j* received treatment *xj* and if the individuals other than *j* had received treatments corresponding to the vector *x* _{−j}. The no-interference assumption can then be articulated as *Yj(xj*, **x** _{−j})=*Yj*(x_{j}, **x’** _{−j}) for all *x* _{−j}, **x’** _{−j}. Most work in causal inference does not formally introduce this notation to make the no-interference assumption; the assumption is simply implicit in the ordinary use of the notation *Yj(x)*. Notation can conceal substantive assumptions. In the case of the consistency assumption, I would argue that the notation *Yj(x*) presupposes assumption *(C1)*; the consistency assumption itself, stated as *Yjobs=Yj(x*) when *x=Xj*, is then the assumption *(C2)*.

A second consequence of this refinement of the consistency assumption concerns the values of *x* for which *(C1*) is to hold. Cole and Frangakis require that “*Yjobs=Yj(x, k*) if *x=Xj*, no matter the value of *k*.” However, once we follow Cole and Frangakis and introduce the notation *Yj(x, kx*) for the potential outcome for individual *j* if exposure *X* is set to value *x* by means *kx*, the question arises whether *(C1*) holds for *x≠Xj*. The drawing of causal inferences often makes use not only of the consistency assumption but also, as noted by Cole and Frangakis, of the “exchangeability” or “ignorability” assumption. This assumption is often articulated as the independence of the potential outcome *Yj(x*) and actual treatment *Xj*, conditional on some set of confounding variables *Z*. The very statement of this ignorability assumption requires condition *(C1*) but does not require condition *(C2)*. Similarly, assumption *(C1)*, but not *(C2)*, would be required for causal effects of the form *Yj(1)−Yj(0*) to be well-defined. Importantly, to state the ignorability assumption or to define causal effects such as *Yj(1)−Yj(0)*, assumption *(C1*) must hold for values of *x* other than the observed value *Xj*. The consistency assumption as articulated by Cole and Frangakis does not imply that *(C1*) holds for *x≠Xj*; nothing in the formulation of Cole and Frangakis ensures that *Yj(x*) is well-defined if *x≠Xj*.

One further consequence follows from this refinement of the consistency assumption: we can begin to distinguish between intervention and choice. Many analyses of average causal effects seek to understand the impact of intervening to ensure that all individuals in a population received some particular treatment. Consider the use of observational data in the evaluation of a job training program; suppose the outcome is employment after 3 months and that *Yj(1, k1*) denotes employment status with job training under instructor *k1∈K1*. Assumption *(C1*) would then require that the specific instructor does not affect employment status after 3 months. What then is the meaning of *(C2)*? If *Xj=1*, then *(C2*) requires that *Yjobs*, individual *j*'s employment status if the individual chose to participate in the job training program, is equal to *Yj(1)*, the individual's employment status if there had been an intervention to ensure the individual participated in the job training program. The values *Yjobs* and *Yj(1*) need not be equal when *Xj=1*; for example, resentment arising from compulsion might hinder training. In this setting, the assumption *(C2*) requires that the outcome arising from choosing a treatment, the naturally occurring outcome, is in fact equal to the outcome arising from an intervention to ensure treatment. I would argue that in a number of observational settings this is the real meaning of assumption *(C2)*. In some contexts, *(C1*) may hold without *(C2)*; similarly, in other contexts, it is also possible that *(C2*) would hold without *(C1)*. In any case, the consistency assumption, *(C2)*, at least in some contexts, can allow us to distinguish between choice and intervention: the assumption *(C2*) requires that the outcome *Yjobs* arising from choice, the outcome naturally arising, is equal to the outcome *Yj(x*) arising from intervention. On matters of choice, the econometrics literature on causality^{7–9} is certainly ahead of the literature in epidemiology or statistics. However, this may well be due to the fact that exposure to particulate matter, for instance, is not chosen in the same way or with the same intentionality as a job training program.

In light of the above considerations, I propose that *(C1*) be referred to as the assumption of treatment-variation irrelevance (implying for all practical purposes “no multiple-versions-of-treatment”) and that if assumption *(C1*) holds, then *(C2*) be referred to as the consistency assumption. Assumption *(C1*) is required not only to state the consistency assumption but also to be able to state the ignorability assumption and to define potential outcomes of the form *Y* *j*(*x*). Assumption *(C2)*, however, relates counterfactual quantities (the potential outcomes) to the data.

Note that if the actual effect of treatment on the treated is of interest, the causal estimand *E[Y*|*X=1] − E[Y(0)*|*X=1]* would only require that *Y(0)* be well defined that is, (*C1)* would only have to hold for x=0 so that Y_{j}(0, k_{0})=Y_{j}(0, k_{0}) for all k_{0}, k_{0}∈ *K* _{0}. In some settings this may be more plausible because, as noted above, there will often be far fewer possible variations for a control condition.

## THE CONSISTENCY ASSUMPTION AND NONCOMPLIANCE

I would now like to return to the issue of specifying the set *K* _{x}. As noted above, it will often be desirable to specify some set of values *K* _{x} for which the consistency assumption is to hold. Cole and Frangakis consider a setting in which *x* is taken to be “once daily intake of 40 mg of buffered aspirin” and they suggest that *k* might include components to indicate whether the aspirin is taken in the morning, or whether it is taken with or without food. Suppose we are interested in a treatment comparison of “once daily intake of 40 mg of buffered aspirin taken in the morning” and “no aspirin intake” where “morning” is defined, say, as being between 5:30 am and 11:00 am. The consistency assumption and treatment-variation irrelevance assumption for this treatment comparison would then specify *K* _{x} to be those values of *kx* that indicate the aspirin was taken between 5:30 am and 11:00 am and require that that *(C1*) and *(C2*) hold for all *kx*∈*K* _{x}, and thus that *Yjobs=Yj(x, kx*) for all *kx∈* *K* _{x}. Alternatively we might instead be interested in the treatment comparison of “once daily intake of 40 mg of buffered aspirin taken between 7:00 am and 9:00 am” and “no aspirin intake.” The consistency assumption and treatment-variation irrelevance assumption for this treatment comparison would then specify *K* _{x} to be those values of *kx* which indicate the aspirin was taken between 7:00 am and 9:00 am and require that that *(C1*) and *(C2*) hold for all *kx*∈*K* _{x} and thus that *Yjobs=Yj(x, kx*) for all *kx*∈*K* _{x}. In any case, a distinction can be drawn between the range of *kx* for which the potential outcomes *Yj(x, kx*) can be defined and the range of *kx* for which we wish to compare treatments. For a particular treatment comparison, we would require *(C1*) and *(C2*) only for values of *kx*∈*K* _{x} corresponding to the particular treatment comparison of interest. The set *K* _{x} would be specified for each *x* depending on the treatment comparison of interest.

Similarly, we could define a set of individuals *Jc* such that *j∈Jc* indicates that individual *j* in actual fact obtained treatment *Xj* in a manner inconsistent with the treatment comparison being made; such individuals could be considered as cases of “noncompliance” or “nonadherence.” For example, in the treatment comparisons described above, individuals actually taking aspirin at 4:00 pm could be considered as noncompliant or nonadherent to the treatment once daily intake of 40 mg of buffered aspirin taken between 5:30 am and 11:00 am. Only individuals *j* who either do not take aspirin (and are thus in the control group) or who take aspirin between 5:30 am and 11:00 am would be included in the set *Jc*. We might then require that the consistency assumption *(C2*) hold not for all *j* but for only for *j∈Jc*. If the extent of treatment variations when treatment is naturally selected is the same as the extent of the treatment variations under interventions then for individual *j* receiving treatment *Xj* one could potentially define *Kj* as the treatment variant for *Xj* actually received by individual *j*. We could then potentially define the compliance set *Jc* by reference to individuals *j* such that *Kj*∈*K* _{Xj}. In some settings to address noncompliance we may wish to define the set *Jc* by principal strata defined by potential outcomes.^{10}

When there are individuals who are “noncompliant” with respect to a particular treatment comparison, various techniques for addressing noncompliance could potentially be used in the analysis.^{11–14} At least in theory, the same data could be used for several different treatment comparisons. For example, an individual taking aspirin at 10:30 am would be considered adherent in the first treatment comparison above but not in the second treatment comparison. The same data could thus potentially be used to compare once daily intake of 40 mg of buffered aspirin taken between 5:30 am and 11:00 am and “no aspirin intake” and also to compare “once daily intake of 40 mg of buffered aspirin taken between 7:00 am and 9:00 am” and no aspirin intake with different groups being considered adherent to the treatment of interest.

## STOCHASTIC COUNTERFACTUALS AND VIOLATIONS OF THE TREATMENT-VARIATION IRRELEVANCE ASSUMPTION

The treatment-variation irrelevance assumption and the consistency assumption, like almost all assumptions, are at best approximations; in drawing causal inferences we hope that these are reasonable approximations. Perhaps one further extension will help make these assumptions more reasonable in some contexts. Consider the effect on mortality of driving under the influence of alcohol. One might let *kx* indicate the time an individual starts driving under the influence, where *x* denotes the level of intoxication. If the driving begins at 12:01 am, there might be an oncoming driver at a particular turn, and driving under the influence would then lead to a crash and so to death; if the intoxicated driving begins at 12:02 am, the road might be empty and no crash would result. In this case, *(C1*) will not hold; it will not be the case that *Yj(x, kx)=Yj(x, kx’*) for all *kx*, kx’∈*K* _{x}; we would presumably want to include both 12:01 am and 12:02 am as possible times in the set *K* _{x}. To circumvent this difficulty and the violation of *(C1*) in this case, we might introduce stochastic counterfactuals^{15–18} that is, we might allow *Yjobs* and *Yj(x, kx*) to be random variables for each individual *j*. The outcome obtained both in actuality and under various potential interventions would follow some random distribution. This random distribution would allow for variation in, say, the time when other drivers are on the road. In general, the likelihood of an oncoming driver if the intoxicated driving began at 12:01 am would likely be approximately the same as the likelihood of an oncoming driver if the intoxicated driving began at 12:02 am. The assumption of treatment-variation irrelevance, *(C1)*, and the consistency assumption, *(C2)*, could still be formulated as above but the equalities would be equalities in distribution rather than for single values. The distribution of outcomes if intoxicated driving began at 12:01 am versus at 12:02 am would be approximately the same and so the stochastic version of *(C1*) would likely be reasonable in this case.

Certain aspects of treatment variation, such as which instructor is assigned in a job training program, might also be conceptualized in terms of stochastic counterfactuals. For example, *Xj* may simply be an indicator of whether individual *j* participated in the job training program. If there were 3 instructors so that *K* _{1}={1, 2, 3} and if each of the instructors taught any particular trainee with probability 1/3 then the stochastic potential outcome *Yj(1*) would take values *Yj(1, k1=1)*, *Yj(1, k1=2*) or *Yj(1, k1=3)*, each with probability 1/3. In such cases, the stochastic nature of the potential outcomes effectively circumvents the need for *(C1*), and the consistency assumption for *Yj(1*) could then simply be stated as *Yj(1)=Yj* when *Xj=1*, with *Yj(1*) and *Yj* both conceived of as random variables. Suppose now that the extent of treatment variations when treatment is naturally selected is the same as the extent of treatment variations under interventions so that one could define *Kj* as the treatment variant for *Xj* actually received by individual *j*, as considered above. In these settings, a quantity such as *E[Y(1)]* would represent the expected outcome under a stochastic intervention to set each individual's treatment *X* to 1, by a means *k1* that randomly varies across the population according to how the means is actually randomly determined when treatment is given. In some cases, the probability that *Yj(x*) takes a particular value *Yj(x, kx)* may vary with the set of covariates *Z*. See Taubman et al^{19} for a description of data analysis methods for estimating the effects of such hypothetical stochastic interventions.

## REFERENCES

*Epidemiology*. 2009;20:3–5.

*J Am Stat Assoc*. 2006;101:901–910.

*J Am Stat Assoc*. 2006;101:1398–1407.

*J Am Stat Assoc*. 2007;102:191–200.

*J Am Stat Assoc*. 2008;103:832–842.

*Sociol Methods Res.*In press.

*Sociol Methodol*. 2005;1–98.

*Handbook of Econometrics*. Vol. 6B. Amsterdam: Elsevier; 2007:4875–5144.

*Int Stat Rev*. 2008;76:1–27.

*Biometrics*. 2002;58:21–29.

*J Am Stat Assoc*. 1996;91:928–934.

*Ann Stat*. 1997;25:305–327.

*Stat Med*. 1998;17:269–302.

*Biometrics*. 2000;56:779–788.

*Am J Epidemiol*. 1987;125:761–768.

*Biometrics*. 1989;45:1125–1138.

*J Am Stat Assoc*. 2000;95:477–482.

*Am J Epidemiol*. 2005;162:621–622.