Selection Bias with Outcome-dependent Sampling : Epidemiology

Secondary Logo

Journal Logo

Methods

Selection Bias with Outcome-dependent Sampling

Sjölander, Arvid

Author Information
Epidemiology 34(2):p 186-191, March 2023. | DOI: 10.1097/EDE.0000000000001567

Abstract

Beside confounding, nonrandom selection is probably the most important source of bias in epidemiologic studies. Hernán et al.1 provided a systematic classification of selection biases, based on causal diagrams. In their classification, the selection into the study is a “collider” on a noncausal path between the exposure and the outcome. One such scenario is depicted in Figure 1, where the exposure, outcome and selection indicator are denoted A, Y, and S, respectively. The conditioning on being selected into the study is illustrated with the square box around S=1. This conditioning induces a noncausal association between the exposure and the outcome, even in the absence of a causal exposure effect, thus invalidating both hypothesis testing and effect estimation.2,3

F1
FIGURE 1.:
A causal diagram illustrating selection that is directly influenced by both the exposure and the outcome.

Hernán4 discussed another type of selection bias, where the selection is statistically independent of the exposure but associated with the outcome through common causes, as in Figure 24. He showed, using standard graphical rules, that this selection is more benign than those considered by Hernán et al.1 in that it preserves the sharp causal null, that is, it does not induce an association between the exposure and the outcome unless the exposure has a causal effect on the outcome for at least some individuals. Thus, a hypothesis test of the sharp causal null remains valid. However, he showed with numerical examples that this selection may give bias away from the sharp causal null in that the exposure–outcome association in the selected population is not generally equal to the exposure effect in the source population, that is, the population from which the data were taken.

F2
FIGURE 2.:
A causal diagram illustrating outcome-associated selection, where the selection is statistically independent of the exposure, but associated with the outcome through common causes.

There are several issues with the discussion by Hernán.4 Although many readers may be content with the numerical examples that he gave, some may find the lack of formal proofs unsatisfactory. Furthermore, whereas the causal diagram in Figure 2 represents one possible mechanism of outcome-dependent selection, there are other such mechanisms. In particular, the causal diagram in Figure 3 represents the important scenario where the selection is directly influenced by the outcome, but not by the exposure. An obvious example is the unmatched case–control study, in which the selection is by definition influenced by the outcome status alone. To distinguish between the scenarios in Figures 2 and 3, we refer to them as “outcome-associated selection” and “outcome-influenced selection,” respectively. Finally, Hernán4 focused on causal effects in the source population, that is, the population from which the observed data were selectively drawn. However, one may also wonder to what extent the observed data can help to estimate causal effects in the selected population, that is, the population defined by those selected into the study. To the best of our knowledge, this question has not been addressed in other literature either.

F3
FIGURE 3.:
A causal diagram illustrating outcome-influenced selection, where the selection is directly influenced by the outcome, but not by the exposure.

In this article, we address these issues. We consider both outcome-associated selection and outcome-influenced selection, and we use formal yet intuitive methods based on counterfactual diagrams5 to show whether causal effects, both in the source population and in the selected population, are estimable under these selection mechanisms. We will assume that the reader is familiar with causal diagrams, and in particular with the rules of d-separation.2,3 We will focus on selection bias and ignore other possible biases, confounding in particular. Real observational studies typically control for measured confounders to reduce the degree of confounding bias; all our arguments and results hold within levels of (conditional on) such measured confounders, provided that these are sufficient for confounding control. We restrict attention to binary exposures and outcomes, but some of our arguments and results carry over to other types of variables; we indicate this as we proceed. Throughout, we ignore uncertainty due to sampling variability.

POTENTIAL OUTCOMES AND CAUSAL TARGET PARAMETERS

We use standard potential outcome notation2,3 to define causal effects. Let Ya denote the potential outcome of Y for a given subject if, possibly contrary to fact, exposed to level A=a. We say that the exposure has a causal effect for a given individual if Y0Y1 for that individual. Let p(Ya=1) be the counterfactual probability of the outcome had everyone in the source population (both those selected into the study and those not selected) been exposed to level A=a. We say that the exposure has a causal effect in the source population if p(Y0=1)p(Y1=1). To measure the causal effect in the source population we may, for instance, use the causal risk difference p(Y1=1)p(Y0=1) or the causal risk ratio p(Y1=1)/p(Y0=1). Finally, let p(Ya=1|S=1) be the counterfactual probability of the outcome had everyone in the selected population been exposed to level A=a. We say that the exposure has a causal effect in the selected population if p(Y0=1|S=1)p(Y1=1|S=1). To measure the causal effect in the selected population we may, for instance, use the causal risk difference p(Y1=1|S=1)p(Y0=1|S=1) or the causal risk ratio p(Y1=1|S=1)/p(Y0=1|S=1).

We emphasize that, since we are not concerned with sampling variability, we use the term “selected population” in an asymptotic sense. That is, we do not use the term to refer to the limited sample of selected individuals in the particular study, but rather to an infinite “super-population” of individuals, generated under the same selection mechanism as the factual sample. Probabilities conditional on S=1, such as p(Y0=1|S=1) and p(Y1=1|S=1), can be interpreted as proportions in this super-population.

CONSISTENCY AND EXCHANGEABILITY

We make the standard consistency assumption6–8 that the potential outcome Ya is equal to the factual (observed) outcome Yfor subjects who are factually exposed to level A=a:

A=aY=Yaforalla.

From consistency (1) it follows that

p(Y=1|A=a,S=1)=p(Ya=1|A=a,S=1)foralla;

that is, the probability of the outcome Y=1 among those exposed to level A=a is equal to the probability of the potential outcome Ya=1 among those exposed to level A=a, in the selected population. If we would further have that

p(Ya=1|A=a,S=1)=p(Ya=1|S=1)foralla,

then we could interpret the exposure–outcome association in the selected population as the corresponding causal effect, for example, we could interpret the risk difference p(Y=1|A=1,S=1)p(Y=1|A=0,S=1) in the selected population as the causal risk difference p(Y1=1|S=1)p(Y0=1|S=1) in the selected population. Finally, if we would also have that

p(Ya=1|S=1)=p(Ya=1)foralla,

then we could interpret the causal effect in the selected population as the corresponding causal effect in the source population, for example, we could interpret the causal risk difference p(Y1=1|S=1)p(Y0=1|S=1) in the selected population as the causal risk difference p(Y1=1)p(Y0=1) in the source population.

The relation in (2) states that the potential outcome Ya has the same distribution among those factually exposed to A=a as among everyone in the selected population, or—equivalently—that Ya is conditionally independent of A, given S=1:

YaA|S=1foralla;

this is often referred to as “conditional exchangeability.”2,3 Similarly, the relation in (3) states that the potential outcome Ya has the same distribution in the selected population as in the source population, or—equivalently—that Ya is conditionally independent of S:

YaSforalla.

The concepts and definitions above are related to a recent study by Lu et al.9 These authors assumed that the causal effect in the source population is the target parameter, and showed that the total bias of the exposure–outcome association in the selected population can be decomposed into two parts. They used the terms “type 1 selection bias” and “type 2 selection bias” for the bias components due to violations of (4) and (5), respectively. This decomposition is also related to the modern literature on transportability of causal effects, where we say that the causal effect in the selected population is “transportable” to the source population if it is equal to the causal effect in that population, that is, if there is no type 2 selection bias; see Barenboim and Pearl10 and the references therein.

ESTIMATION OF CAUSAL EFFECTS IN THE SELECTED POPULATION

Outcome-associated Selection

It is difficult to judge whether counterfactual independencies like (4) and (5) hold in a causal diagram using intuitive reasoning alone, because standard causal diagrams do not include potential outcomes like Ya. Fortunately, there exists a simple method based on counterfactual diagrams.5 In this method, the causal diagram illustrating the factual world is augmented with a parallel diagram, illustrating the counterfactual world where the exposure is set to a certain level for everyone. The factual and counterfactual worlds are joined by exogenous error terms, corresponding to all (measured or unmeasured) factors that influence the variables under consideration, apart from those explicitly depicted on the original causal diagram. For instance, in the causal diagram of Figure 3 there is only one variable, A, that influences Y. However, there are of course always other factors (genetics, lifestyle etc) that influence Y as well, which are not explicitly depicted on the diagram; heuristically, the error term for Y is the whole set of all these implicit factors. Pearl2 provides a formal connection between causal diagrams and potential outcomes through nonparametric structural equations. Once the counterfactual diagram has been constructed, counterfactual independencies like (4) and (5) can easily be evaluated using standard rules of d-separation.

The counterfactual diagram corresponding to the causal diagram in Figure 2 for outcome-associated selection is shown in Figure 4. The left part of the diagram represents the factual world where the exposure A varies randomly, and the right part represents the counterfactual world where the exposure is set to level a for everyone. The subindex a on Ya reflects that this is a potential outcome under the hypothetical intervention setting A to a. The variable εY is the error term for Y. Because U, S, and εY are not descendants of A, these are unaffected by interventions on A, and are thus shared between (that is, have the same value in) both worlds.

F4
FIGURE 4.:
Counterfactual diagram corresponding to the causal diagram in Figure 2.

In Figure 4, there are two paths between Ya and A, YaεYYA and YaUYA; however, both pass through the collider Y. Because we have not conditioned on Yin (4), only on S=1, all paths between Ya and A are blocked. It follows that conditional exchangeability (4) holds under the causal diagram in Figure 2, so that the conditional association between A and Y, given S=1, can be interpreted as the corresponding causal effect of A on Y in the selected population. In the jargon of Lu et al.,9 we say that there is no type 1 selection bias. Because the counterfactual diagram in Figure 4 makes no assumption about A and Y being binary, this result holds for nonbinary exposures and outcomes as well.

Outcome-influenced Selection

Lu et al.9 considered a variation of outcome-influenced selection shown in Figure 5, where there is a covariate L that affects Y, but not A. Although the causal diagram in Figure 3 does not explicitly include such a covariate, there will always be other predictors for the outcome than the exposure of interest; these are usually subsumed into the implicit error term εY. Thus, the scenarios in Figures 3 and 5 are essentially equivalent from our perspective. Lu et al.9 stated that, for the causal diagram in Figure 5, the exposure–outcome association in the selected population “suffers from type 1 selection bias by restricting to one level of a descendant of the collider Y, leading to a biased effect estimate on both risk difference and risk ratio scales.” This argument is somewhat unsatisfactory. Although it is technically correct that Y is a collider between A and L in Figure 5, it is not clear from the causal diagram why conditioning on its descendant S would give bias, because the causal diagram does not include the potential outcome Ya that we ultimately care about.

F5
FIGURE 5.:
A causal diagram illustrating outcome-influenced selection, borrowed from Lu et al.9

To provide a more rigorous argument, we again use counterfactual diagrams. The counterfactual diagram corresponding to the causal diagram in Figure 3 for outcome-influenced selection is shown in Figure 6. Because S is now a descendant of, and thus influenced by, A in the causal diagram, S and Sa are not generally equal, and must be distinguished in the counterfactual diagram. In Figure 6, there are two paths between Ya and A; YaSaεSSYA and YaεYYA. The first path is blocked at the collider Sa, because we have not conditioned on this variable in (4). However, the second path is open by conditioning on S=1. This is because S is a descendant of the collider Yon the path, and conditioning on a collider or any of its descendants opens a path on a causal diagram. It follows that conditional exchangeability (4) does not hold under the causal diagram in Figure 3, so that the conditional association between A and Y, given S=1, cannot be interpreted as the corresponding causal effect of A on Y in the selected population. In the jargon of Lu et al.,9 we say that there is type 1 selection bias. Thus, Lu et al.9 were correct in that bias occurs because of conditioning on a collider, but this collider appears on the counterfactual diagram, not on the original causal diagram. This bias occurs for nonbinary exposures and outcomes as well.

F6
FIGURE 6.:
Counterfactual diagram corresponding to the causal diagram in Figure 3.

In eAppendix 1, https://links.lww.com/EDE/B995, we prove the stronger result that causal effects in the selected population are not estimable in any way under the causal diagram in Figure 3. We only prove this result for binary variables, but we conjecture that a similar result holds for other types of variables as well.

That causal effects in the selected population cannot be estimated under outcome-influenced selection does not mean that data are completely uninformative about such causal effects. A straightforward application of arguments by Robins11 and Manski12 leads to the conclusion that the counterfactual probability p(Ya=1|S=1) is confined to the range

p(A=a,Y=1|S=1)
p(Ya=1|S=1)
p(A=a,Y=1|S=1)+p(A=1a|S=1).

For completeness, we prove this relation in eAppendix 2, https://links.lww.com/EDE/B995. By maximizing p(Y0=1|S=1) and minimizing p(Y1=1|S=1) within the range in (6), we obtain lower bounds for the causal risk difference and risk ratio in the selected population. Similarly, by minimizing p(Y0=1|S=1) and maximizing p(Y1=1|S=1), we obtain upper bounds for the causal risk difference and risk ratio in the selected population. These bounds are given by

p(A=1,Y=1|S=1){p(A=0,Y=1|S=1)+p(A=1|S=1)}
p(Y1=1|S=1)p(Y0=1|S=1)
{p(A=1,Y=1|S=1)+p(A=0|S=1)}p(A=0,Y=1|S=1)

and

p(A=1,Y=1|S=1)/{p(A=0,Y=1|S=1)+p(A=1|S=1)}
p(Y1=1|S=1)/p(Y0=1|S=1)
{p(A=1,Y=1|S=1)+p(A=0|S=1)}/p(A=0,Y=1|S=1)

ESTIMATION OF CAUSAL EFFECTS IN THE SOURCE POPULATION

Outcome-associated Selection

From the counterfactual diagram in Figure 4, we observe that Ya and S are associated via the open path YaUS. Thus, the independence relation in (5) does not hold under outcome-associated selection, which implies that the causal effect in the selected population is not equal to the causal effect in the source population; in the jargon of Lu et al.,9 we say that there is type 2 selection bias. This bias occurs for nonbinary variables as well.

In fact, the situation is even worse. In eAppendix 3, https://links.lww.com/EDE/B995, we prove that the observed data have no information about causal effects in the source population under outcome-associated selection. Thus, whatever data we observe under outcome-associated selection, the causal risk difference in the source population can be anywhere between 1 and 1, and the causal risk ratio in the source population can be anywhere between 0 and infinity. We only prove this result for binary variables, but we conjecture that a similar result holds for other types of variables as well.

This result may seem reasonable to some readers, because we have not assumed that the sampling fraction p(S=1) is known. Thus, it is possible that the selected population only constitutes a tiny (technically 0) proportion of the source population, in which case it makes intuitive sense that the observed data have no information about causal effects in the source population. However, this explanation cannot be trivially taken for granted, because it is not valid for all sampling schemes. Specifically, we show in the next section that the observed data do have some information about causal effects in the source population under outcome-influenced selection, irrespectively of the sampling fraction.

To other readers, the fact that the observed data have no information about causal effects in the source population under outcome-associated selection may seem to contradict the result by Hernán,4 that we can test whether the sharp causal null holds under outcome-associated selection. However, we remind the reader that the sharp causal null means that the exposure has no effect for any single individual. A violation of the sharp causal null does not imply that the exposure has an effect on the population level, because the exposure may have positive effects for some individuals and negative effects for other individuals, which may cancel out in the population.

Outcome-influenced Selection

From the counterfactual diagram in Figure 6, we observe that Ya and S are associated via the open path YaεYYS. Thus, the independence relation in (5) does not hold under outcome-influenced selection, which implies that the causal effect in the selected population is not equal to the causal effect in the source population; in the jargon of Lu et al.,9 we say that there is type 2 selection bias, in addition to the type 1 selection bias noted above. This bias occurs for nonbinary variables as well.

However, in contrast to outcome-associated selection, the data under outcome-influenced selection do have some information about causal effects in the source population. Under outcome-influenced selection, it can be shown that the causal odds ratio in the source population is equal to the odds ratio in the selected population,

p(Y1=1)/p(Y1=0)p(Y0=1)/p(Y0=0)=p(Y=1|A=1,S=1)/p(Y=0|A=1,S=1)p(Y=1|A=0,S=1)/p(Y=0|A=0,S=1)=OR,

and can therefore be estimated; this is true irrespectively of the sampling fraction p(S=1). This has been shown by various authors13,14; for completeness, we give a detailed proof in eAppendix 4, https://links.lww.com/EDE/B995. By using the causal odds ratio in the source population, it is possible to provide bounds on the causal risk difference and risk ratio in the source population. Specifically, it follows from results in King and Zeng15 that

min(0,OR1OR+1)p(Y1=1)p(Y0=1)max(0,OR1OR+1)

and

min(1,OR)p(Y1=1)/p(Y0=1)max(1,OR).

The bounds in (7) for the causal risk difference in the selected population are qualitatively different from those in (9) for the causal risk difference in the source population, in that the former include both positive and negative values, whereas the latter include either positive or negative values, but not both. Similarly, the bounds in (8) for the causal risk ratio in the selected population include both values above and below 1, whereas the bounds in (10) for the causal risk ratio in the source population include either values above or below 1, but not both. Thus, using these bounds we are able to tell the direction of the exposure effect in the source population, but not in the selected population.

CONCLUSIONS

In this note, we have contrasted outcome-associated selection and outcome-influenced selection. We have shown that causal effects in the selected population are estimable under outcome-associated selection but not under outcome-influenced selection. We have shown that data have no information about causal effects in the source population under outcome-associated selection, but that the causal odds ratio in the source population can be estimated, and the causal risk ratio and risk difference can be bounded, both in the selected population and in the source population, under outcome-influenced selection. For some of these results, we have used counterfactual diagrams, but we note that it may also be possible to prove these results with Single World Intervention Graphs (SWIGs).16

We have presented bounds for the causal risk difference and risk ratio in the source population and in the selected population under outcome-influenced sampling. Other authors have presented related bounds, but under somewhat different conditions. Kuroki et al.17 derived bounds for the causal risk difference and the causal risk ratio in the source population under case-control sampling, which is a special case of outcome-influenced sampling. Unlike us, though, these authors allowed for both confounding and biased selection, so we would expect their bounds to be less informative (i.e. wider) than our bounds in (9) and (10); we verify this in eAppendix 5, https://links.lww.com/EDE/B995. Gabriel et al.18 derived bounds for the causal risk difference in the source population in scenarios with missing data, which is analogous to selection. Unlike us though, these authors allowed simultaneous outcome-associated and outcome-influenced missingness or selection (their Figure 1A), and they assumed that the proportion of nonmissingness, corresponding to p(S=1) in our exposition, is known. Thus, their bounds are not directly comparable to our bounds. Neither of these authors considered causal effects in the selected population.

We have not taken a stance on which parameter is most relevant from a scientific perspective: a causal effect in the source population or in the selected population. We believe that most researchers would prefer to estimate causal effects in the source population, but given all sources of errors in real epidemiologic studies (e.g., selection bias, measurement bias, confounding bias), we conjecture that many researchers would be content with an estimate of any causal effect that is at least approximately unbiased. Thus, the fact that outcome-associated selection admits estimation of causal effects in the selected population, whereas outcome-influenced selection does not, may be useful information to practitioners.

REFERENCES

1. Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15:615–625.
2. Pearl J. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press; 2009.
3. Hernán MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC; 2020.
4. Hernán MA. Invited commentary: selection bias without colliders. Am J Epidemiol. 2017;185:1048–1050.
5. Shpitser I, Pearl J. What counterfactuals can be tested. Proceedings of the 23rd Annual Conference on Uncertainty in Artificial Intelligence. 2007:437–444.
6. Cole S, Frangakis C. The consistency statement in causal inference: a definition or an assumption? Epidemiology. 2009;20:3–5.
7. VanderWeele T. Concerning the consistency assumption in causal inference. Epidemiology. 2009;20:880–883.
8. Pearl J. On the consistency rule in causal inference: axiom, definition, assumption, or theorem? Epidemiology. 2010;21:872–875.
9. Lu H, Cole SR, Howe CJ, Westreich D. Toward a clearer definition of selection bias when estimating causal effects. Epidemiology. 2022;33:699–706.
10. Barenboim E, Pearl J. A general algorithm for deciding transportability of experimental results. J Causal Inference. 2013;1:107–134.
11. Robins JM. The analysis of randomized and non-randomized aids treatment trials using a new approach to causal inference in longitudinal studies. Sechrest L, Freeman H, Mulley A, eds. In: Health service research methodology: a focus on AIDS. US Public Health Service, National Center for Health Services Research; 1989:113–159.
12. Manski CF. Nonparametric bounds on treatment effects. Am Econ Rev. 1990;8:319–323.
13. Didelez V, Kreiner S, Keiding N. Graphical models for inference under outcome-dependent sampling. Stat Sci. 2010;25:368–387.
14. Bareinboim E, Pearl J. Controlling selection bias in causal inference. Artif Intell Stat. 2012;22:100–108.
15. King G, Zeng L. Estimating risk and rate levels, ratios and differences in case-control studies. Stat Med. 2002;21:1409–1427.
16. Richardson TS, Robins JM. Single world intervention graphs (SWIGs): a unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper. 2013;128:1–148.
17. Kuroki M, Cai Z, Geng Z. Sharp bounds on causal effects in case-control and cohort studies. Biometrika. 2010;97:123–132.
18. Gabriel EE, Sjölander A, Sachs MC. Nonparametric bounds for causal effects in imperfect randomized experiments. J Am Stat Assoc. 2021. doi:10.1080/01621459.2021.1950734.
Keywords:

Causal diagrams; Causality; Counterfactual graphs; Outcome-dependent sampling; Selection bias

Supplemental Digital Content

Copyright © 2022 Wolters Kluwer Health, Inc. All rights reserved.