Secondary Logo

Journal Logo

Replicability, Reproducibility, and Fragility of Research Findings—Ultimately, Caveat Emptor

Vetter, Thomas R. MD, MPH; McGwin, Gerald Jr. MS, PhD; Pittet, Jean-Francois MD

doi: 10.1213/ANE.0000000000001398
The Open Mind: The Open Mind
Free

From the Departments of *Anesthesiology and Perioperative Medicine and Epidemiology, University of Alabama at Birmingham, Birmingham, Alabama.

Accepted for publication April 2, 2016.

Funding: None.

The authors declare no conflicts of interest.

Reprints will not be available from the authors.

Address correspondence to Thomas R. Vetter, MD, MPH, Department of Anesthesiology and Perioperative Medicine, University of Alabama at Birmingham, 619 19th St. South, JT-862, Birmingham, AL 35249. Address e-mail to tvetter@uabmc.edu.

A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do. He may as well concern himself with his shadow on the wall.

Ralph Waldo Emerson, Self-Reliance (1841)

It is traditionally held that the first comparative clinical trial was performed in 1747 by Dr. James Lind (1716–1794) of the British Royal Navy to identify a treatment for scurvy. Lind evenly assigned 12 scurvy-afflicted sailors aboard the HMS Salisbury to receive cider, vitriol (a weak acid), vinegar, sea water, oranges and lemons, or nutmeg paste.1 After 6 days, only the 2 sailors who had received oranges and lemons—and thus adequate amounts of vitamin C—had sufficiently recovered to return to duty.

Prophetically, it took approximately 50 years for the British Navy to make lemon juice a compulsory part of their sailors’ diet, and this regimen was in turn replaced by lime juice because it was cheaper—even though likely less effective for very long sea voyages.2,3 Hence, the 19th century nickname of “Limey” was applied to sailors in the Royal Navy.

Two centuries later, the advent of commercial pharmaceuticals and the evolution of methodologic concepts led to the first randomized controlled clinical trial, performed in 1947 to 1948, which observed the effects of streptomycin on pulmonary tuberculosis to be significantly different from those of a placebo.4,5

In the interim, the volume of research and its published findings have grown tremendously; however, concern has been raised by Ioannidis6,7 that the majority of the findings of these clinical trials, epidemiological studies, and molecular research may be false.

In this Open Mind article, we explore the current controversy and contribute to the debate surrounding the replicability, reproducibility, and fragility and thus uncertain validity of the research findings published in many clinical and health care journals, including Anesthesia & Analgesia.

Back to Top | Article Outline

WHAT MIGHT BRADFORD HILL HAVE TO SAY?

In his classic 1965 article entitled “The Environment and Disease: Association or Causation?”8,9 Sir Bradford Hill set forth 9 self-described “viewpoints” to be systematically addressed in attempts at inferring causation from statistically significant associations observed in epidemiologic studies (Table 1).10

Table 1

Table 1

Specifically, Hill8,9 asked, “In what circumstances can [one] pass from [an] observed association to a verdict of causation?” Pertinent to the present discussion, Hill’s viewpoint of strength can alternatively be interpreted as effect size and that of consistency as reproducibility.

Hill’s 9 elements have, in the interim, been applied widely as requisite “causal criteria” by clinicians and epidemiologists that assess association versus causation in a number of arenas.11 However, Hill8,9 explicitly never intended for his 9 “viewpoints” to be applied as “hard-and-fast rules…to bring evidence for or against the cause-and-effect hypothesis” but rather “to help us make up our minds on the fundamental question—is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect?”

Hill’s salient proviso has been reiterated more recently by Phillips and Goodman11 as the 3 frequently “missed lessons” of not equating (1) statistical significance with substantial association, (2) observed simple association with definitive causation, and (3) statistical precision with scientific validity. Ioannidis12 has offered a rigorous, contemporary reinterpretation and largely refutation of Hill’s classic criteria.

Table 2

Table 2

Ioannidis7 also has proposed a thought-provoking series of research practices that may help increase the proportion of “true research findings” (Table 2). As Ioannidis12 has done with Bradford Hill’s criteria, we offer here our critique of the practicality of Ioannidis’ series of proposed research practices (Table 2).

Back to Top | Article Outline

REPLICABILITY VERSUS REPRODUCIBILITY

Ioannidis7 has proposed the “adoption of replication culture” and “reproducibility practices” (Table 2). A key distinction, however, should be made between the replicability versus the reproducibility of research findings.13,14 Replicability is the ability to obtain identical results when an experiment or trial is performed under precisely identical conditions.13 Reproducibility refers to an outcome that can be predicted to recur even when experimental conditions may vary to some degree.13 As concisely observed by Drummond,14 “Reproducibility requires changes; replicability avoids them. Although reproducibility is desirable, I contend that the impoverished version, replicability, is one not worth having.”

We contend that with medical research—and especially with clinical outcomes and health services research—changes in the study subjects and clinical setting, whether at the individual or the population level, are both inevitable and desirable so as to confirm or to challenge the external validity of initially reported findings. Furthermore, the practicality of true replicability is unrealistic.

Back to Top | Article Outline

THE DILEMMA OF REPLICABILITY/REPRODUCIBILITY AND ONE PROPOSED YET AT LEAST CURRENTLY UNTENABLE SOLUTION

It has been opined: “There may be no more important issue for authors and reviewers than the question of reproducibility (replicability), a bedrock principle in the conduct and validation of experimental science”13; however, it has been further observed by these same authors:

Given the requirement for reproducibility in experimental science, we face two apparent contradictions. First, published science is expected to be reproducible, yet most scientists are not interested in replicating published experiments or reading about them. Many reputable journals…are unlikely to accept manuscripts that precisely replicate published findings, despite the explicit requirement that experimental protocols must be reported in sufficient detail to allow repetition. This leads to a second paradox that published science is assumed to be reproducible, yet only rarely is the reproducibility of such work tested or known. In fact, the emphasis on reproducing experimental results becomes important only when work becomes controversial or called into doubt.13

One possible minimal standard that can fill the void between full replication and nothing is “reproducible research,” which requires that complete data sets and computer code be made available to others to verify published results and to conduct alternative analyses.15 Ioannidis7 has likewise proposed greater “registration (of studies, protocols, analysis codes, datasets, raw data, and results)” and “sharing (of data, protocols, materials, software, and other tools)” (Table 2).

Although this may be feasible in basic science and computational science, such open access to complete human subject research data sets is innately fraught with problems. In the United States, these original patient-level, raw clinical data are considered protected health information under the 1996 federal Health Insurance Portability and Accountability Act. We do not envision that a medical journal will be allowed by an IRB or research ethics committee, without a Health Insurance Portability and Accountability Act-compliant data use agreement, to have access to these raw data. Furthermore, it is unlikely that a journal and its parent organization will want to assume the responsibility and attendant major civil and criminal liability for maintaining such a shared repository of protected health information.

It should be noted that the National Institutes of Health (NIH) expects that the results and accomplishments of the activities that it funds be made available to the research community and to the public at large.a Nevertheless, the “NIH recognizes that data sharing may be complicated or limited, in some cases, by institutional policies or local IRB rules, as well as by local, state and Federal laws and regulations like the Privacy Rule.”b In February 2015, with “Plan for Increasing Access to Scientific Publications and Digital Scientific Data from NIH Funded Scientific Research,” the NIH noted that it “will explore the development of policies to require NIH-funded researchers to make the data underlying the conclusions of peer-reviewed scientific research publications freely available in public repositories at the time of initial publication in machine readable formats.”c

Perhaps, medical journals like Anesthesia & Analgesia can follow the lead of the NIH, if the policy, infrastructure, and security requirements for such sharing of human subjects research data do not prove be excessively onerous.

Back to Top | Article Outline

ANOTHER PROPOSED YET LIKELY SEISMIC SOLUTION

Nonreproducibility in scientific studies can be attributed to flawed research designs, improper statistical analyses, or scientific misconduct.16 We focus here on problems that can be attributed to the use of widely accepted and hence applied hypothesis and inferential statistical testing procedures.16,17

Ioannidis7 has proposed the use of “more appropriate statistical methods” and “more stringent thresholds for claiming discoveries or ‘successes’” (Table 2). It also has been observed by Johnson16 that concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of classical statistical testing at unjustifiably high levels of significance (i.e., large α values for the risk of a type I error). To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased, especially for the declaration of a “highly significant” finding.

In terms of the conventional frequentist hypothesis testing used in the majority of clinical trials, like that of Ioannidis,7 Johnson16 has proposed that more rigorous evidence standards—and ostensibly hence more reproducible findings—would require the routine conduct of hypothesis testing at a P value of 0.005 or 0.001 for concluding significance.

It is conventionally held that a small P value infers a high likelihood that the alternative hypothesis is correct (i.e., the clinical intervention made a real difference) and that a repetition or replication of the same experiment will have a high probability of resulting again in statistical significance.18,19 It should be clearly noted that the probability of finding a significant finding in a subsequent study—given a significant finding in the first study—depends largely on the assumed effect size in the second study being the same as the observed effect size in the first study.18 Otherwise, if the true effect size is smaller, then the probability of finding a significant difference in the subsequent study would be substantially less. The second study could thus simply be underpowered to detect a meaningful but smaller absolute treatment difference. These previously observed and currently assumed effect sizes need to be plainly stated by the investigators.

Given these assumed (and plainly stated) equal true effect sizes, as shown by Cumming,19 if an initial experiment results in 2-tailed P = 0.05, there is an 80% chance that the 1-tailed P value from a replicated experiment will fall in the interval (0.00008–0.44) and a 10% chance of a P > 0.44. As noted by Cumming,19 “it is more natural to consider one-tailed P-values given by replications…because the initial experiment indicates a direction for the effect—although of course this could be wrong.”

As observed by Shafer and Dexter,17 many journal readers and other stakeholders thus often erroneously interpret a very small P value (e.g., 0.006) as strong evidence that the study results are surely replicable and thus represent some more widely applicable underlying truth. Shafer and Dexter17 further noted that applying Cumming’s methodology means that to assure that an identically replicated study would achieve a P < 0.05, the initial study would need a P < 0.0001 or lesser.

We agree with Cumming’s19 recommendation that confidence intervals for point estimates be routinely reported—not simply P values. Nevertheless, as per Cumming and Maillardet,20 “on average, a 95% confidence interval will include just 83.4% of future replication means.” We also endorse Shafer and Dexter’s17 recommendation that authors report actual P values (not simply P < 0.05). However, while requiring a P < 0.005 or a P < 0.001 may or may not actually enhance the likelihood of replicability, this more stringent cut-point will be an onerous paradigm shift, which will likely not assure the more pertinent and desirable construct of reproducibility as defined earlier by Drummond.14

Back to Top | Article Outline

THE FRAGILITY OF FINDINGS

Ioannidis7 has proposed “large-scale collaborative research” (Table 2). Devereaux et al.21 similarly observed a largely unmet need for such large-scale clinical studies in perioperative medicine. Sessler and Devereaux22 subsequently identified 3 emerging trends in clinical trial design: (1) (larger) trial size, (2) factorial randomization, and (3) composite outcomes. Although all 3 are relevant to the present discussion, we focus here on trial size.

A large sample size can effectively evaluate a small, yet clinically important treatment effect. It also affords a better estimation of the treatment effect.22 However, a large sample presents the limitations of greater cost, longer data-acquisition periods, and possibly needing multiple collaborating centers.21,22

Historically, small-scale randomized controlled trials, with a total sample size of 60 to 100, have been commonly published in the anesthesia, critical care, and pain medicine literature. With many complex, multifactorial, perioperative conditions, however, only a moderate treatment effect size or relative risk reduction of 15% to 30% can be expected with a single intervention.21,22 Furthermore, very clinically important, dichotomous perioperative complications (e.g., myocardial infarction, stroke, acute renal failure, sepsis, and death) occur at low event rates. Consequently, very large sample sizes of upward of 10,000 are needed to demonstrate statistical significance for a clinically important observed treatment effect.21,22

As first reported in 2005 by Ioannidis,23 initially stronger observed and subsequently contradictory treatment effects are not unusual in highly cited research of clinical interventions and their outcomes. A subsequent 2014 study by Ioannidis and colleagues24 identified only 37 published reanalyses of randomized controlled trials, with 35% of these reporting salient findings and conclusions different from those of the original article.

A closely related and commonly underappreciated problem is that small-scale trials, even when statistically significant, can often prove to be wrong because of their “fragility.”22 Fragility of findings is characterized by a substantial change in the P value (e.g., from 0.02 to 0.13) with a very small change in the number of patients experiencing the dichotomous primary outcome or event (e.g., myocardial infarction or stroke) in the treatment group.21,22

Because the statistically significant results of many clinical trials “hinge on a small number of events,” the “fragility index” has been developed to complement the P value and to identify possibly less robust results.25 The fragility index is calculated by sequentially (1) adding an event from the group with the smaller number of events; (2) subtracting a nonevent from the same group to keep the total number of patients constant; and (3) recalculating the 2-sided P value for the Fisher exact test. Events are then iteratively so reassigned until the first time the calculated P value became ≥0.05. The number of reassigned events required to obtain a P value <0.05 is considered the fragility index for a given trial result.25

A recent analysis was undertaken of 399 randomized controlled trials with at least 1 dichotomous or time-to-event outcome reported as significant (P < 0.05 or a 95% confidence interval that excluded the null value) and were published in high-impact journals.25 These trials had a median sample size of 682 patients and a median of 112 events, and 53% reported a P value <0.01. The median fragility index was 8 (absolute range of 0–109), and 25% had a fragility index of 3 or lesser.25

Importantly, as evidenced by a narrower confidence interval, a larger study will give a more precise and thus less fragile estimated treatment effect than a smaller study. It has thus been proposed that the narrower the observed confidence interval for the observed point estimate, the stronger the inference and more “durable” the study findings.26 However, the aforementioned fragility index for P values is consistent with the number of events required to change the bounds of a 95% confidence interval from 1 that does not include the null value to one that does. The 95% confidence interval may thus suffer from the same limitations as the P value.25

The findings of a small-scale study, with a group sample size of <100 or even 200, are more fraught with fragility than those of a large-scale study with a group sample size of more than a 1000.21 Furthermore, fragility is considerably magnified when the proportion of patients with the event of interest is low. In other words, with an n = 100 per group, the results of a study are much more fragile when the event rate for a primary outcome is 1% compared with 20%.27

Researchers, editors, reviewers, and readers should thus be cautious in interpreting and applying marginally statistically significant results that likely demonstrate such fragility because the observed/reported results might easily no longer be significant if the study were repeated (i.e., exactly replicated).21,22

However, despite their potential fragility, smaller scale trials can generate very important preliminary data and hypotheses for larger scale trials. Initial smaller scale clinical trials are also an opportunity (a) to test key elements of clinical trial design, including the feasibility of patient selection and enrollment, (b) to generate estimates of effect size, and (c) to determine whether the therapy or intervention has the intended biological or clinical effect.28 As candidly observed by Matthay and Liu,28 “it is important that new therapies are tested, and clinicians and researchers learn from mistakes related to these (preliminary) studies.”

Back to Top | Article Outline

CONCLUSIONS

There is undoubtedly much room for improvement in research practices and journal publication policies, but we recommend a measured, transparent, and inclusive solution. To this end, we wholeheartedly endorse Ioannidis’7 proposed “improvement of study design standards” and “improvements in peer review, reporting, and dissemination of research” (Table 2). As described in our editorial in this month’s issue of Anesthesia & Analgesia,29 the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network has been created to monitor, propagate, and promote the proper use of guidelines for study design and reporting of research findings.30 The primary goal of the EQUATOR Network is “to improve the quality of scientific publications by promoting transparent and accurate reporting of health research.”dAnesthesia & Analgesia has thus incorporated the recommendations of the EQUATOR Network into its new “Instructions for Authors.”e

Back to Top | Article Outline

DISCLOSURES

Name: Thomas R. Vetter, MD, MPH.

Contributions: This author helped write the manuscript.

Attestation: Thomas R. Vetter approved the final manuscript.

Name: Gerald McGwin, Jr., MS, PhD.

Contributions: This author helped write the manuscript.

Attestation: Gerald McGwin, Jr., approved the final manuscript.

Name: Jean-Francois Pittet, MD.

Contributions: This author helped write the manuscript.

Attestation: Jean-Francois Pittet approved the final manuscript.

This manuscript was handled entirely by: Edward J. Mascha, PhD.

Back to Top | Article Outline

FOOTNOTES

a National Institutes of Health: NIH Sharing Policies and Related Guidance on NIH-Funded Research Resources. Available at https://grants.nih.gov/grants/sharing.htm. Accessed March 21, 2016.
Cited Here...

b National Institutes of Health: Frequently Asked Questions, Data Sharing. Available at http://grants.nih.gov/grants/policy/data_sharing/data_sharing_faqs.htm#924. Accessed March 21, 2016.
Cited Here...

c National Institutes of Health: Plan for Increasing Access to Scientific Publications and Digital Scientific Data from NIH-Funded Scientific Research. Available at https://grants.nih.gov/grants/NIH-Public-Access-Plan.pdf. Accessed March 21, 2016.
Cited Here...

d Enhancing the QUAlity and Transparency Of health Research (EQUATOR): Enhancing the Quality of and Transparency of Health Research. Available at http://www.equator-network.org/. Accessed March 1, 2016.
Cited Here...

e Instructions for Authors—Anesthesia & Analgesia and A&A Case Reports. Available at http://www.editorialmanager.com/aa/default.aspx. Accessed March 15, 2016.
Cited Here...

Back to Top | Article Outline

REFERENCES

1. Collier RLegumes, lemons and streptomycin: a short history of the clinical trial.CMAJ200918023–4
2. Bhatt AEvolution of clinical research: a history before and beyond James Lind.Perspect Clin Res201016–10
3. Baron JHSailors’ scurvy before and after James Lind—a reassessment.Nutr Rev200967315–32
4. D’Arcy Hart PA change in scientific approach: from alternation to randomised allocation in clinical trials in the 1940s.BMJ1999319572–3
5. Medical Research Council Streptomycin in Tuberculosis Trials CommitteeStreptomycin treatment of pulmonary tuberculosis.Br Med J19482769–82
6. Ioannidis JPWhy most published research findings are false.PLoS Med20052e124
7. Ioannidis JPHow to make more published research true.PLoS Med201411e1001747
8. Hill ABThe environment and disease: association or causation?Proc R Soc Med196558295–300
9. Hill ABThe environment and disease: association or causation.Bull World Health Organ200583796–8
10. Höfler MThe Bradford Hill considerations on causality: a counterfactual perspective.Emerg Themes Epidemiol2005211
11. Phillips CV, Goodman KJThe missed lessons of Sir Austin Bradford Hill.Epidemiol Perspect Innov200413
12. Ioannidis JPExposure-wide epidemiology: revisiting Bradford Hill.Stat Med2016351749–62
13. Casadevall A, Fang FCReproducible science.Infect Immun2010784972–5
14. Drummond CReplicability is not reproducibility: nor is it good science.26th International Conference on Machine LearningMontreal, Quebec, Canada2009
15. Peng RDReproducible research and biostatistics.Biostatistics200910405–8
16. Johnson VERevised standards for statistical evidence.Proc Natl Acad Sci U S A201311019313–7
17. Shafer SL, Dexter FPublication bias, retrospective bias, and reproducibility of significant results in observational studies.Anesth Analg2012114931–2
18. Goodman SNA comment on replication, p-values and evidence.Stat Med199211875–9
19. Cumming GReplication and p intervals: p values predict the future only vaguely, but confidence intervals do much better.Perspect Psychol Sci20083286–300
20. Cumming G, Maillardet RConfidence intervals and replication: where will the next mean fall?Psychol Methods200611217–27
21. Devereaux PJ, Chan MT, Eisenach J, Schricker T, Sessler DIThe need for large clinical studies in perioperative medicine.Anesthesiology20121161169–75
22. Sessler DI, Devereaux PJEmerging trends in clinical trial design.Anesth Analg2013116258–61
23. Ioannidis JPContradicted and initially stronger effects in highly cited clinical research.JAMA2005294218–28
24. Ebrahim S, Sohani ZN, Montoya L, Agarwal A, Thorlund K, Mills EJ, Ioannidis JPReanalyses of randomized clinical trial data.JAMA20143121024–32
25. Walsh M, Srinathan SK, McAuley DF, Mrkobrada M, Levine O, Ribic C, Molnar AO, Dattani ND, Burke A, Guyatt G, Thabane L, Walter SD, Pogue J, Devereaux PJThe statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index.J Clin Epidemiol201467622–8
26. Poole CLow P-values or narrow confidence intervals: which are more durable?Epidemiology200112291–4
27. Mascha EJAlpha, beta, meta: guidelines for assessing power and type I error in meta-analyses.Anesth Analg20151211430–3
28. Matthay MA, Liu KDNew strategies for effective therapeutics in critically ill patients.JAMA2016315747–8
29. Pittet JF, Vetter TRContinuing the terra firma and establishing a new EQUATOR for Anesthesia & Analgesia.Anesth Analg20161238–9
30. Simera I, Altman DG, Moher D, Schulz KF, Hoey JGuidelines for reporting health research: the EQUATOR network’s survey of guideline authors.PLoS Med20085e139
© 2016 International Anesthesia Research Society