Secondary Logo

Journal Logo

Publication Bias, Retrospective Bias, and Reproducibility of Significant Results in Observational Studies

Shafer, Steven L., MD*; Dexter, Franklin, MD, PhD

doi: 10.1213/ANE.0b013e31824a0b5b
Editorials: Editorials

From the *Department of Anesthesiology, Columbia University, New York, New York; and Department of Anesthesia, University of Iowa, Iowa City, Iowa.

Funding: None.

Conflicts of Interest: See Disclosures at the end of the article.

Reprints will not be available from the authors.

Address correspondence to Steven L. Shafer, MD, 630 West 168th Street, P&S Box 46, New York, NY 10032. Address e-mail to

Accepted January 4, 2012

In this issue of Anesthesia & Analgesia, Greenberg and colleagues reveal an unexpected association between rain and the annual meeting of the Society of Pediatric Anesthesia (SPA).1 This is not a small effect! The odds ratio for rain on the first day of the SPA meeting is 2.63 when compared to historical controls. The finding is statistically significant, with a P value of 0.006. The authors conclude that “further investigation is warranted” and recommend that “SPA leadership may use [its] rainmaking ability to benefit drought-stricken areas.”

This is not the April Fool's issue of Anesthesia & Analgesia. The report by Greenberg and colleagues went through multiple rounds of peer review. Dr. Greenberg provided the raw data for his analysis, which we both vetted. The association demonstrated by Dr. Greenberg is real. All statistical assumptions of the mixed (random) effects logistic regression model were satisfied.

It is also complete nonsense. There is no chance that the weather is influenced by the SPA meeting, and it is equally impossible that the SPA leaders can predict the weather when they schedule their meeting. Despite the large effect and high level of statistical certainty, the finding of an association between rain and the SPA meeting is pure chance. We published Greenberg et al.'s report because it is an outstanding example of 2 types of bias common in the medical literature, and is an opportunity to comment on the relationship between small P values and reproducibility.

Back to Top | Article Outline


Publication bias has been discussed extensively, including a report in this issue of the journal.2 A submission to Anesthesia & Analgesia that documented no association between rain and the SPA meeting would be rejected outright, probably with a polite recommendation that the author seek research mentorship or mental health counseling. We published the report by Greenberg and colleagues because of the large effect and small P value. The thousands of counterexamples showing no association between rain and other meetings would not have been published.

Back to Top | Article Outline


Retrospective studies are intrinsically biased by our innate human ability to discern patterns in random noise. Consider the digits of π, which are random but full of patterns.a Starting at the 72nd digit is a sequence of 10 consecutive even numbers. The odds of 10 consecutive even numbers are ½10, or 1 in 1024. We would not expect to find 10 consecutive even numbers in the first 100 digits of π, but we do, because by its very nature, patterns appear in random noise. Our ability to see patterns in noise is why we are able to see the face of Jesus on a pierogi,b the face of Satan in the smoke from the World Trade Center fire,c and an unidentified face on the surface of Mars.d We excel at finding patterns in noise.

The results of many retrospective studies are published after observant investigators notice an unexpected association in their patients. Such observations have led to medical breakthroughs, such as finding an association between an unidentified bacterium and chronic gastritis,3 leading to the discovery of the causative role of helicobacter pylori in peptic ulcer disease. However, many retrospective associations are just patterns in random noise, like the association of rain and the SPA meeting.

Our ability to discern patterns in random noise introduces profound bias into retrospective studies. After noticing a pattern, we formally analyze the data, and our statistical analysis confirms the pattern. This is expected, of course, because we analyzed the pattern only after we saw the association. It is as though someone read out loud the digits of π, and after hearing “6 4 0 6 2 8 6 2 0” (digits 72 through 81), looked back at the last 10 numbers and determined with a high level of certainty π (P < 0.001) that the digits of π were even.

It is no different if an investigator notices that a certain anesthetic technique has worked in a particular population of patients. A statistical analysis will probably demonstrate significance, simply because of the bias introduced by the investigator's seeing a pattern in random noise. Adjustment of a retrospective analysis for observer bias is difficult. Generally, we try the best that we can, but most importantly treat retrospective studies as hypothesis generating. The analysis identifies an association that might be true. The only way to know, for certain, is to perform a prospective trial, ideally using a double-blind randomized design.

Back to Top | Article Outline


Readers might interpret the P value of 0.006 for the association between rain and the SPA meeting as strong evidence that the study results are surely reproducible. That is wrong. As documented by Goodman,4 and more recently by Cumming,5 the smaller the P value, the larger the chance of reproducing a significant finding. However, P values need to be surprisingly small to achieve a desired chance of reproducibility. If you perform a conventional power analysis, then you will likely want your upcoming study to have a 90% chance of finding a clinically significant difference at P < 0.05. However, consider the better criterion: powering your study so that your results of 2-sided P < 0.05 are reproducible. In other words, you want a 90% chance that if your upcoming study finds a significant difference at P < 0.05, then an identical second trial (e.g., by an investigator who doesn't believe your findings) would also identify the difference at P < 0.05. Surprisingly, the P value for the first trial must be 0.0001 or less.e This is why we ask authors to report the actual P value, not. “P < 0.05,” when the P value exceeds 0.0001.

What is the likelihood that a prospective study of rain and the SPA meeting will verify the observed association? For a prospective study of limited duration, the probability is low, but not zero. That is the nature of randomness, and the limited utility of the small P value in predicting reproducibility.

We thank Drs. Greenberg, Bembea, and Heitmiller for their amusing and useful report. At our request, they discussed their findings in a conventional manner, seemingly oblivious to the obvious impossibility of the underlying hypothesis. Their entertaining paper offers an outstanding example of publication bias, retrospective bias, and the questionable reproducibility of even highly significant results in retrospective studies.

Back to Top | Article Outline


Name: Steven L. Shafer, MD.

Conflict of Interest: Steven Shafer serves as Editor-in-Chief for Anesthesia & Analgesia.

Name: Franklin Dexter, MD, PhD.

Conflict of Interest: Franklin Dexter serves as a section Editor and Statistical Editor for Anesthesia & Analgesia.

This manuscript was handled by: James G. Bovill, MD, PhD, FCARCSI, FRCA.

a Available at: Last accessed December 15, 2011.
Cited Here...

b Available at: Last accessed December 15, 2011.
Cited Here...

c Available at: Last accessed December 15, 2011.
Cited Here...

d Available at: Last accessed December 15, 2011.
Cited Here...

e Substituting 2-sided P = 0.00016 into equation (B4) of Appendix B in Cummings' paper5 gives replication 2-tailed probability P = 0.0996.
Cited Here...

Back to Top | Article Outline


1. Greenberg RS, Bembea M, Heitmiller E. Rainy days for the Society of Pediatric Anesthesia. Anesth Analg 2012;114:1102–3
2. De Oliveira GS Jr, Chang R, Kendall MC, Fitzgerald PC, McCarthy RJ. Publication bias in the anesthesiology literature. Anesth Analg 2012;114:1042–8
3. Marshall BJ, Warren JR. Unidentified curved bacilli in the stomach of patients with gastritis and peptic ulceration. Lancet 1984;1:1311–5
4. Goodman SN. A comment on replication, p-values and evidence. Stat Med 1992;11:875–9
5. Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspect Psychol Sci 2008;3:286–300
© 2012 International Anesthesia Research Society