We would like to congratulate Zubizarreta et al1 for a carefully done nonexperimental study on the effects of the 2010 Chilean earthquake on posttraumatic stress. The article illustrates a compelling mix of complementary techniques to improve their nonexperimental study—clever design elements (broadly termed “design sensitivity”) to increase power and robustness, propensity score matching to adjust for observed confounders, and analysis of sensitivity to potential unobserved confounders. Using rare longitudinal data of persons before and after the earthquake to minimize recall bias and ensure appropriate temporal ordering, they compare persons with very disparate exposure to the earthquake and allow for heterogeneous treatment effects, finding that posttraumatic stress is “dramatically but unevenly elevated” among affected residents. Although there are many interesting aspects to the article, we focus here on the idea of design sensitivity because we believe it to be unfamiliar to most epidemiologists, with few examples in the applied literature.
Design sensitivity can be thought of as a formalization of the general idea of using smart design elements to improve the validity of a nonexperimental study. Since the work of Campbell,2 Cornfield et al,3 Hill,4 and others, the assessment of threats to validity has played a central role in epidemiology and other applied sciences. Design sensitivity is a natural outgrowth of that earlier foundation. In particular, design sensitivity approaches aim to design studies that are less sensitive to unobserved confounding. Rosenbaum5–7 introduces and formalizes design sensitivity: “For a specific treatment effect and a specific research design with a large sample size, the design sensitivity asks the question of how much hidden bias would need to be present to render plausible the null hypothesis of no effect. The answer is a number, and it provides a quantitative comparison of alternative research designs. Other things being equal, we prefer the design that is less sensitive to hidden biases.”5 (p154) Examples of potential design elements include assessing dose-response relationships and formalizing tests of multiple hypotheses (eg, multiple outcomes) within one study.5
In their use of design sensitivity, Zubizarreta et al1 compare people with particularly high versus low exposure to the earthquake and tailor the analysis to the theory that earthquakes may have “uncommon but dramatic” effects6 on posttraumatic stress. The particular test used to increase the design sensitivity in the analysis stage is Stephenson’s test, a variation of Wilcoxon’s signed-rank test, which allows for no effects on a subset of the sample but large effects on others.8 Using these two strategies results in an effect estimate that is highly insensitive to unobserved confounders.
One benefit is that design sensitivity offers a formal way to incorporate qualitative information and beliefs about expected patterns of effects. For example, if researchers really believe a priori that only a subset of the population will benefit from the exposure or treatment of interest (as in the article by Zubizarreta et al1), then the statistical design and corresponding tests should be tailored to address that question. Of concern, however, are the trade-offs when those prior beliefs are not accurate. For example, what do we lose by thinking there may be a small subgroup with large effects if in fact there really is a constant effect? As stated by Zubizarreta et al,1 “Using methods designed to detect such a response pattern will reduce sensitivity to bias if the anticipated pattern does, in fact, occur.” But what if it does not occur? And on what basis should the a priori theories be made? Subject matter knowledge? Previous studies? Theories developed using the data at hand should be considered exploratory until results can be replicated in other studies. One approach to something like replication but in the context of a single dataset is a strategy described by Heller et al,9 which uses a random (and small) subset of the sample at hand to help plan the analysis in the remaining sample.
A related concern is that the use of such methods can look like one is fishing for the largest effect, for example, by comparing those with the highest exposure to those with the lowest exposure. This reemphasizes the need to prespecify analyses. In some settings, there may also be a trade-off between enhancing design sensitivity and ensuring that the comparison remains relevant and of scientific interest. It should be made clear, for example, that the plan all along was to look for “uncommon but dramatic”6 responses to treatment and that it did not come up after a main effect was not found.
Another issue that deserves further exploration is what the estimand is for these methods. Stephenson’s test does not estimate an average treatment effect, but, rather, it explicitly assumes that the “average” is meaningless because many people have small (or no) effects, while only a few have large effects. Although that may be an important question and design for this investigation (and likely also for others), it seems potentially less useful for some clinical or policy purposes that require more generalizable inferences. An obvious next step for the Chilean study is to identify or predict the types of persons who may experience a large effect. More generally, it will be crucial to make sure that “clever” study designs still answer the scientific question at hand.
One potentially interesting result of attention to design sensitivity may be to encourage researchers to think more broadly about study design and how, for example, to convince reviewers that a given design is appropriate. Epidemiologists and applied researchers are currently very good at using power analyses to show sufficient power to detect effects; design sensitivity methods provide a way to assess the “power” of a nonexperimental study to be robust to unobserved confounding. In fact, robustness to unobserved confounding may be as important as traditional power analyses.
In conclusion, we thank Zubizarreta et al1 for providing readers of EPIDEMIOLOGY with these innovative methods and designs. We encourage epidemiologists to think in these terms—to be creative about study designs, using “choice as an alternative to control.”10 Making smart choices in a design sensitivity framework may help improve the answers to important scientific questions. We also encourage statisticians and methodologists to advance these ideas because it is likely that more exploration and development will be needed before the methods become widely accepted.
ABOUT THE AUTHORS
ELIZABETH A. STUART is an Associate Professor in the Departments of Mental Health and Biostatistics at Johns Hopkins Bloomberg School of Public Health (JHSPH). She is an expert on methods for causal inference, particularly propensity score methods for nonexperimental studies and methods for generalizing trial results to target populations. DAVID B. HANNA received his PhD in Epidemiology at JHSPH and is now an Instructor in the Department of Epidemiology and Population Health at Albert Einstein College of Medicine. He is a former teaching assistant for Elizabeth A. Stuart’s course, “Causal Inference for Medicine and Public Health” at JHSPH.
1. Zubizarreta JR, Cerda M, Rosenbaum PR.. Effect of the 2010 Chilean earthquake on posttraumatic stress: reducing unmeasured bias through study design. Epidemiology. 2013;24:79–87
2. Campbell DT.. Factors relevant to the validity of experiments in social settings. Psychol Bull. 1957;54:297–312
3. Cornfield J, Haenszel W, Hammond EC, Lilienfeld AM, Shimkin MB, Wynder EL. Smoking and lung cancer: recent evidence and a discussion of some questions. J Natl Cancer Inst. 1959;22:173–203
4. Hill AB. The environment and disease: association or causation?. Proc R Soc Med. 1965;58:295–300
5. Rosenbaum PR. Design sensitivity in observational studies. Biometrika. 2004;91:153–164
6. Rosenbaum PR. Confidence intervals for uncommon but dramatic responses to treatment. Biometrics. 2007;63:1164–1171
7. Rosenbaum PR. Design sensitivity and efficiency in observational studies. J Am Stat Assoc. 2010;105:692–702
8. Stephenson WR. A general class of one-sample nonparametric test statistics based on subsamples. . J Am Stat Assoc. 1981;76:960–966
9. Heller R, Rosenbaum PR, Small DS. Split samples and design sensitivity in observational studies. J Am Stat Assoc. 2009;104:1090–1101
10. Rosenbaum PR. Choice as an alternative to control in observational studies. Stat Sci. 1999;14:259–278