Commentary: How to Report Instrumental Variable Analyses (Suggestions Welcome)

Swanson, Sonja A.a; Hernán, Miguel A.a,b,c

doi: 10.1097/EDE.0b013e31828d0590
Author Information

From the aDepartment of Epidemiology, Harvard School of Public Health, Boston, MA; bDepartment of Biostatistics, Harvard School of Public Health, Boston, MA; and cHarvard-MIT Division of Health Sciences and Technology, Boston, MA.

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article ( This content is not peer-reviewed or copy-edited; it is the sole responsibility of the author.

Editors’ note: Related articles appear on pages 352 and 363.

Correspondence: Sonja A. Swanson, 677 Huntington Avenue, Kresge 9th Floor, Boston, MA 02115. E-mail:

Article Outline

Instrumental variable (IV) methods are becoming mainstream in comparative effectiveness research,1 but IV methods are radically different from traditional epidemiologic methods. The goal of IV methods is to eliminate confounding without ever measuring the confounders. This apparent miracle can be achieved only when four conditions are met (see below).2,3 Here we suggest a checklist for investigators who use IV methods. Like others before,4 we hope this step-by-step guide will improve the reporting of IV estimates and increase the transparency of the underlying assumptions.

Our discussion focuses on reports of causal effects of medical interventions and is informed by two papers by Davies et al5,6 that appear in this issue of EPIDEMIOLOGY: an application that estimates the causal effects of cyclooxygenase-2 (COX-2) selective versus nonselective nonsteroidal anti-inflammatory drugs (NSAIDs), and a literature review of IV papers that describes their use, and perhaps misuse, in epidemiology. We supplement this review with additional information from our own review of IV analyses of observational studies with a relatively well-defined medical intervention. Details of our review can be found in the online supplement (

IV methods require a variable—the “instrument”—that meets the three so-called instrumental conditions: (1) the instrument is associated with the treatment, (2) the instrument does not affect the outcome except through treatment (also known as the exclusion restriction assumption), and (3) the instrument does not share any causes with the outcome. An example of such an instrument is the randomization indicator in double-blind randomized experiments.2 Davies et al6 summarize instruments that have been proposed in epidemiologic studies, including the physician preference type they use themselves.5 Unfortunately, no variable can be proved to be an instrument in observational studies because only condition (1) can be empirically verified. We outline some steps for reporting IV analyses based on the variables proposed as instruments (Figure) and discuss how these steps have been reported in previous studies (Table). A detailed specification of how IV methods should be implemented is beyond the scope of this commentary.

The first step in our reporting flowchart is to empirically verify condition (1). When the association between the proposed instrument and the treatment is weak, the proposed “weak” instrument may amplify biases due to small violations of conditions (2) or (3), producing very biased effect estimates.7,8 Alternatively, if the proposed instrument is very strong, it may be more likely to violate conditions (2) or (3); in the extreme, a perfect correlation between the proposed instrument and the observational treatment implies the proposed instrument is associated with the same set of (possibly unmeasured) confounders as the treatment.9 Most prior studies have evaluated the strength of their proposed instrument via risk differences, odds ratios, or partial R2 or F statistics like Davies et al5 did and several authors have recommended.6,8 (In contrast, the use of R2 or F statistics as effect measures is subject to many study-specific artifacts and should generally be avoided.10,11)

Unlike condition (1), conditions (2) and (3) are not empirically verifiable: they are assumptions. Therefore, the next step is to use subject-matter knowledge to build a case for why the proposed instrument may be reasonably assumed to meet each of these two conditions; this is similar to how one would use subject-matter knowledge to justify the no unmeasured confounding assumption of traditional methods. When dichotomizing a polytomous or continuous treatment, condition (2) may hold for the original treatment, but not for its dichotomous version. Investigators can provide some support for condition (3) by showing the proposed instrument is not associated with measured confounders.6 Although any associations found with measured covariates could be adjusted for by modeling (see below), associations with measured confounders may raise concerns over associations with unmeasured confounders. Davies et al5 provide arguments to support conditions (2) and (3) when physician’s preference is proposed as an instrument; other authors have provided counterarguments.2

In our review, we notice a tendency for conditions (2) and (3) to be treated as a single condition. Although these conditions are statistically similar, it is important to consider them separately in order to incorporate subject-matter knowledge in discussions about their validity and in decisions on adjustment. For example, Davies et al5 found some evidence that the proposed instrument is associated with prior gastrointestinal (GI) complications, suggesting that different types of patients are seen at different practices, thus violating condition (3) but not condition (2). They also found an association with gastro-protective drug use; this evidence suggests that physicians who prefer one type of NSAID may also prefer prescribing other medications concomitantly, which would violate condition (2) but not condition (3). On the other hand, if gastro-protective medications were prescribed because the NSAID induced GI symptoms, neither condition is violated and no adjustment for gastro-protective medications would be necessary.

Though conditions (2) and (3) cannot be proven true, it is sometimes possible to find empirical evidence against them, that is, conditions (2) and (3) can be falsified. The next step in the flowchart is to perform these falsification tests, which are based on leveraging prior causal assumptions, assessing inequalities that can detect extreme violations, making use of effect modifiers, or comparing estimates utilizing several potential instruments. Examples of each of these have been presented elsewhere.12 Only 20% of studies explicitly reported a falsification test; none reported two or more tests. Bear in mind that falsification tests may fail to reject a proposed instrument even when conditions (2) and (3) are violated.

Having an instrument is insufficient to obtain a point estimate for the causal effect. Conditions (1)–(3) only identify bounds,13,14 that is, lower and upper limits for the effect that are consistent with the data. One more untestable condition, often not given explicitly, is required to obtain a point estimate. Therefore, we suggest investigators estimate such bounds, either for the average treatment effect in the population13,14 or within subgroups.15,16 Bounds are typically very wide (even wider if one estimates 95% confidence intervals around them): for example, Davies et al5 report bounds that range from a strong protective effect to a strong harmful effect of COX-2–selective NSAIDs. However, these bounds are important because they convey the uncertainty about the causal effect when one combines the data with conditions (1)–(3). That is, the bounds show how much “information” needs to be provided by a fourth condition to fill in the blanks left by the data and the instrument.17 No studies in our review besides Davies et al5 reported bounds.

Whether estimating bounds or effects (see below), one needs to decide the causal effect of interest. Two options are the average treatment effect in the population and the local average treatment effect (LATE) in the subpopulation of “compliers.” For Davies et al,5 “compliers” are patients who would be prescribed a selective NSAID had they seen a physician who preferred selective NSAIDs, but would be prescribed a nonselective NSAID had they seen a physician who preferred nonselective NSAIDs. For nondichotomous instruments, LATE-like effects are weighted averages of the effect in multiple subgroups.2

The relative relevance of the average treatment effect and the LATE for understanding etiology and informing policy is debatable.17,18 Because “compliers” cannot be identified, the LATE is a causal effect in an unknown subset of the population, for example, if one found a beneficial effect of treatment, it is unclear who the “compliers” would be who would have this benefit. Because the average treatment effect and the LATE will generally differ, investigators should be explicit about their reasons for choosing one over the other; only 61% of studies in our review were explicit.

If interest is in the average treatment effect, the untestable fourth condition for point estimation may be either condition (4c) of identical (ie, constant) treatment effect for all individuals in the population—generally impossible for dichotomous outcomes and highly unlikely for others—or the weaker homogeneity condition (4h) of no additive effect modification across levels of the instrument within the treated and the untreated (no effect modification only in the treated would allow valid estimation of the effect in the treated).2,19 The next step in the flowchart is to assess this homogeneity condition (4h) using subject-matter knowledge. Because assessing effect modification by the instrument conditional on treatment is not straightforward, Hernán and Robins2 presented a more natural (but still untestable) sufficient condition in terms of effect modification by confounders: for example, for Davies et al,5 condition (4h) would be violated if the causal effect of NSAIDs varies by underlying risk of GI complications, an unmeasured variable that likely informs treatment decisions.

If interest is in the LATE, investigators need to assess a different untestable condition (4m): monotonicity or no “defiers.” For Davies et al,5 “defiers” are patients who would be prescribed a nonselective NSAID had they seen a physician who preferred selective NSAIDs, but would be prescribed a selective NSAID had they seen a physician who preferred nonselective NSAIDs.20 Under some study designs monotonicity is plausible,20,21 but this may not be the case for many IV applications in observational studies.22 Monotonicity is generally implausible for the commonly proposed preference-based instruments, which are dichotomous, because such potential instruments are surrogates for a likely continuous preference variable.2

Although one cannot identify who is a “complier,” under conditions (1)–(3) and (4m) one can estimate the proportion of the study population that is composed of “compliers” when the instrument is dichotomous.20 For example, Davies et al’s5 LATE estimate is pertinent for 27% of the study population; the magnitude of the effect, if any, in the other 73% is unknown. Further, one can characterize the “compliers” in terms of their distribution of observed covariates relative to the study population.23 The next steps in our flowchart are to report the proportion of “compliers” and their relevant characteristics. In our review, 70% of studies with dichotomous instruments reported enough information to estimate the proportion of “compliers,” but only 5% (two studies) stated it explicitly; 10% of studies reported information to understand “complier” characteristics. The proportion of “compliers” ranged from 1% to 62% (median = 11%).

If conditions (1)–(3) and either (4h) or (4m) appear justifiable, it may be appropriate to estimate effect estimates using IV methods. The average treatment effect and LATE can be estimated nonparametrically using the standard19,20 or other2 IV estimators. Alternatively, one can use models that estimate these effects within levels of measured covariates, which means the requisite conditions need to hold conditional on these covariates. The last step in the flowchart calls for a detailed description of the modeling approach; Davies et al6 summarize previously used modeling approaches like the common two-stage least squares methods, and structural mean models (used by Davies et al5) that rely on condition (4h). As always when adjusting for covariates, one should be aware that adjustment for some covariates (ie, colliders) may introduce new bias.24–26

Our reporting flowchart does not cover more advanced IV analyses. For example, IV analyses based on structural nested models can incorporate time-varying instruments and treatments.27,28 Other models can incorporate multiple instruments to help with condition (1), although then the other three conditions must be jointly satisfied by all instruments. As for all causal inference methods, sensitivity analyses can illuminate how robust estimates are to violations of the untestable conditions. Some sensitivity analyses have been suggested for IV methods,7,29–31 but further development and implementation of such approaches are warranted.

Causal inference relies on transparency of assumptions and on triangulation of results from methods that depend on different sets of assumptions. Because of the wide 95% confidence intervals typical of IV estimates, the value added by using this approach will often be small. More importantly, IV methods present special challenges as relatively minor violations of the four conditions may result in large biases of unpredictable or counterintuitive direction. Further efforts are needed to understand the direction and magnitude of bias when the conditions do not hold perfectly. Until then, IV methods are particularly dangerous, which is why our flowchart conservatively suggests avoiding IV methods when the necessary conditions are unlikely to hold. At the very least, we hope that our flowchart will help investigators report IV analyses in such a way that colleagues can better evaluate the estimates.

Back to Top | Article Outline


We thank Alan Brookhart, James Robins, Sander Greenland, Sonia Hernández-Díaz, Debbie Lawlor, and Dylan Small for helpful comments.

Back to Top | Article Outline


1. Chen Y, Briesacher BA. Use of instrumental variable in prescription drug research with observational data: a systematic review. J Clin Epidemiol. 2011;64:687–700
2. Hernán MA, Robins JM. Instruments for causal inference: an epidemiologist’s dream? Epidemiology. 2006;17:360–372
3. Greenland S. An introduction to instrumental variables for epidemiologists. Int J Epidemiol. 2000;29:1102
4. Brookhart MA, Rassen JA, Schneeweiss S. Instrumental variable methods in comparative safety and effectiveness research. Pharmacoepidemiol Drug Saf. 2010;19:537–554
5. Davies NM, Davey-Smith G, Windmeijer F, Martin RM. COX-2 selective nonsteroidal anti-inflammatory drugs and risk of gastrointestinal tract complications and myocardial infarction: an instrumental variable analysis. Epidemiology. 2013;24:352–362
6. Davies NM, Davey-Smith G, Windmeijer F, Martin RM. Issues in the reporting and conduct of instrumental variable studies: A systematic review. Epidemiology. 2013;24:363–369
7. Small DS, Rosenbaum P. War and wages: the strength of instrumental variables and their sensitivity to unobserved biases. J Am Statist Assoc. 2008;103:924–933
8. Bound J, Jaeger D, Baker R. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J Am Statist Assoc. 1995;90:443–450
9. Martens EP, Pestman WR, de Boer A, Belitser SV, Klungel OH. Instrumental variables: application and limitations. Epidemiology. 2006;17:260–267
10. Greenland S, Schlesselman JJ, Criqui MH. The fallacy of employing standardized regression coefficients and correlations as measures of effect. Am J Epidemiol. 1986;123:203–208
11. Greenland S, Maclure M, Schlesselman JJ, Poole C, Morgenstern H. Standardized regression coefficients: a further critique and review of some alternatives. Epidemiology. 1991;2:387–392
12. Glymour MM, Tchetgen EJ, Robins JM. Credible Mendelian randomization studies: approaches for evaluating the instrumental variable assumptions. Am J Epidemiol. 2012;175:332–339
13. Balke A, Pearl J. Bounds on treatment effects for studies with imperfect compliance. J Am Statist Assoc. 1997;92:1171–1176
14. Pearl J. Imperfect experiments: Bounding effects and counterfactuals. In: Causality. 2009 New York City, NY Cambridge University Press:259–281
15. Richardson T, Robins JMDechter R, Geffner H, Halpern JY. Analysis of the binary instrumental variable model. In: Heuristics, Probability, and Causality: A Tribute to Judea Pearl. 2010 London: College Publications:415–444
16. Cheng J, Small DS. Bounds on causal effects in three-arm trials with noncompliance. J R Stat Soc Series B. 2006;68:815–836
17. Robins JM, Greenland S. Comment: Identification of causal effects using instrumental variables. J Am Statist Assoc. 1996;91:456–458
18. Imbens GW. Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009) J Econ Lit. 2010;48:399–423
19. Robins JMSechrest L, Freeman H, Mulley A. The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In: Health Service Research Methodology: A Focus on AIDS. 1989 Washington, DC US Public Health Service:113–159
20. Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Am Statist Assoc. 1996;91:444–455
21. Imbens GW, Angrist JD. Identification and estimation of local average treatment effects. Econometrica. 1994;62:467–475
22. Korn EL, Baumrind S. Clinician preferences and the estimation of causal treatment differences. Statistical Science. 1998;13:209–235
23. Angrist JD, Pischke J. Instrumental variables in action: sometimes you get what you need. In: Mostly Harmless Econometrics: An Empiricist's Companion. 2009 Princeton, NJ Princeton University Press:113–218
24. Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10:37–48
25. Hernán MA, Hernández-Díaz S, Werler MM, Mitchell AA. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am J Epidemiol. 2002;155:176–184
26. Greenland S. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology. 2003;14:300–306
27. Robins JM. Correcting for non-compliance in randomized trials using structural nested mean models. Community Statistics. 1994;23:2379–2412
28. Robins JMAnderson PK, Keiding N. Structural nested failure time models. In: The Encyclopedia of Biostatistics. 1998 Chichester, UK John Wiley and Sons:4372–4389
29. Baiocchi M, Small DS, Lorch S, Rosenbaum P. Building a stronger instrument in an observational study of perinatal care for premature infants. J Am Statist Assoc. 2010;105:1285–1296
30. Brookhart MA, Schneeweiss S. Preference-based instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results. Int J Biostat. 2007;3:14
31. Small DS. Sensitivity analysis for instrumental variables regression with overidentifying restrictions. J Am Statist Assoc. 2007;102:1049–1058

Supplemental Digital Content

Back to Top | Article Outline
© 2013 by Lippincott Williams & Wilkins, Inc