The idea that p values are misused is almost as old as the concept of p values themselves. While R.A. Fisher did not invent them, his work popularized p values in the 1920s . The ink had hardly dried on his work about p values before others were ready to point out their considerable shortcomings , and none did so more colorfully than computer scientist Robert Matthews, who wrote “Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding” . Stanford methodologist and professor of medicine John Ioannidis then took this to its logical endpoint, when he used problems with p values, particularly the blind adherence to the use of a p value threshold of 0.05, to substantiate a claim that “most published research findings are false” . In fairness to Sir Ronald, he never intended his tool to be used this way.
There is no sense in crying over spilled ink, and plenty has been spilled on this topic already. The more-important question is how can surgeons (and the journals that serve them) protect their patients from statistical approaches that promote misleading conclusions?
Suggestions abound. In 2016, the American Statistical Association published several explanations and caveats about p values with which it is impossible to disagree :
- “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”
- “Proper inference requires full reporting and transparency.”
- “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result”.
These suggestions are completely reasonable (and the last point has been covered on the editorial pages of Clinical Orthopaedics and Related Research® in some detail already [10-12]). The American Statistical Association’s coverage of the topic is entirely comprehensible to the nonstatistician, and well worth the time of anyone who reads medical research. But it seems to us that these good general suggestions stop short of serving as specific recommendations, which we believe are warranted due to sufficient confusion on this topic.
Dr. Ioannidis suggests changing the “traditional” p value threshold from 0.05 to 0.005 , but this strikes us as too simplistic. It seems analogous to raising the cutoff value on a diagnostic test to make the test more specific and less sensitive. While sometimes desirable, in other situations, we prefer tests with more sensitivity, even at the cost of specificity. The same may be true when applying statistical analyses. Dr. Ioannidis’ suggestion also does nothing to address the concern that sometimes the p value is simply the wrong tool for the job.
Surgeons reading clinical research must not outsource the essential responsibility of discerning truth from fiction, not even to statisticians. As another longtime commentator on this topic has thoughtfully written: “The determinants of the truth of a knowledge claim lie in combination of evidence both within and outside a given experiment, including the plausibility and evidential support of the proposed underlying mechanism. If that mechanism is unlikely, as with homeopathy or perhaps intercessory prayer, a low P value is not going to make a treatment based on that mechanism plausible” . For that reason, and since crafty statistical analyses cannot overcome shortcomings in research design, readers should focus carefully on the limitations inherent to each study’s design, and to interpret statistical inferences with more flexibility than often is used. For example, we might consider “more or less likely” rather than “true or false” as the possible interpretations of each assertion an author makes. To help the reader interpret thoughtfully, clinician-scientists designing studies must make it possible for readers to infer from a study’s design how likely the contentions are to be true, and it is the journal editor’s job to apply sensible standards not just to scientific reporting but also to inference testing.
We favor approaching the evaluation of statistical claims just as a clinician might when evaluating the results of a diagnostic test.
Consider what happens when a clinician wishes to determine whether a prosthetic joint infection is present. (S)he may set a cutoff value of histological analysis of a frozen section from the joint at 5 polymorphonuclear cells per high powered field (pmn/hpf), or (s)he can use 10 pmn/hpf ; at least one study explored a threshold as low as 2 pmn/hpf . It seems likely that lower thresholds will detect more infections, at the risk of misclassifying some patients as having infection when in fact they do not. By contrast, higher thresholds are more specific for infection, but may miss some patients who have that diagnosis. Instead of slavish adherence to one cutoff value, clinicians probably should vary the thresholds they use based on a pretest estimation of the likelihood that infection is present. If the clinical presentation suggests the absence of infection, perhaps 15 pmn/hpf is a more-reasonable standard to convince the surgeon that infection is, in fact, present; on the other hand, if a surgeon believes prior to surgery that infection likely is present, then 5 pmn/hpf might suffice as the needed confirmatory evidence.
We propose the same could be done with p values. A “default” p value threshold of 0.05 makes no intuitive, philosophical, or scientific sense. P value thresholds can and should be adjusted up or down as prior knowledge dictates. While readers can do this as they interpret published research, it probably makes sense for this to take place when the clinician-scientist designs the project. Based on what is known before reading (or designing) the study, if the contention being evaluated is more of a long-shot, and assuming an otherwise-robust study design, it should take a lower (stricter) p value to alleviate skepticism. The same standard (that is, a more-stringent p value threshold) should be used if the clinical intervention the study recommends carries with it avoidable risk, such as elective surgery, or a medication with an unfavorable side-effect profile.
By contrast, there should be room for what we like to call “conversation-starters,” which we define as papers that open our minds to new ways of thinking about education or practice, but that don’t expose patients to serious risk. In the interest of keeping a nascent and potentially helpful topic alive for subsequent validation studies, a more-relaxed p value threshold may be appropriate (like 0.1), provided that the conclusions are not too grandiose, and provided that the work is identified as exploratory. We note that there is precedent for this kind of recommendation . The relaxation of a p value threshold should be done a priori—that is, before the experiment begins—and reasonable proof that the threshold was not changed post hoc (such as a dated IRB or a prospective clinical trial registry number) should be provided. And, again, raising a p value threshold generally is not appropriate where the conclusion of a study might drive readers to make more-liberal use of an intervention with avoidable risk, such as elective surgery.
One might call this a “Bayes through the back door” approach, in that it involves using a concept analogous to a Bayesian prior probability in concert with a frequentist (that is, p-value driven) approach to inference testing. We accept this, and see no problem arising from it, since it is entirely consistent with the approach that clinicians use every day in the office.
Although at least one journal now bans p values , and Robert Matthews (of “baloney into breakthroughs” fame) suggests we “pull the plug” on Fisher’s p-value machine , we believe that doing so is neither necessary nor helpful. That said, we agree that the machine badly needs a tune up. We hope that clinician scientists will consider our “backdoor Bayesian” approach as they test their research inferences. We are open to this and any other well-substantiated statistical methods in the work we will publish in CORR®, and indeed, we have already welcomed several Bayesian approaches onto our pages [4, 14, 15], though these studies generally used Bayesian methods as alternatives to make predictions rather than (as we’ve been discussing here) to assess efficacy. Regardless, we think these approaches are underutilized. They can be handy in diverse research settings even outside survivorship estimation [9, 18], and, of course, we believe our “backdoor Bayesian” approach has merit when assessing treatment efficacy if thoughtfully applied. There also are other ways to convey statistical robustness of research findings, such Bayes factors  and clinical significance curves , and we believe these are worth exploring in orthopaedic clinical research. We would welcome manuscripts that make use of them.
Finally, although we have focused on p value thresholds, we do not believe they are the best way to think about datasets even within a frequentist analytic framework. As we have written before, confidence intervals are essential and we insist on them in the papers we publish . Still more important are effect-size estimates, which ideally should be presented alongside minimum clinically important differences when they are available . Effect-size estimation is critical, since clinicians (and their patients) do not think in terms of p values, but rather in terms of effect sizes. For that reason, when we evaluate papers here, we push authors to present their results in terms of those effect sizes, and to ask whether those effect sizes justify the interventions in question. We urge readers to look askance at any research that omits them, which, sadly, remains most research not published in CORR .
The authors thank Lee Beadling BA, Matthew B. Dobbs MD, Mark C. Gebhardt MD, Terence J. Gioe MD, Paul A. Manner MD, Clare M. Rimnac PhD, and Montri D. Wongworawat MD, who helped guide the message shared in this editorial.
1. Berkson J. Tests of significance considered as evidence. J Am Stat Assoc. 1942;37:325–35.
2. Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of reporting p values in the biomedical literature. JAMA. 2016;315:1141–1148.
3. Fisher RA. Statistical Methods for Research Workers. Edinburgh, UK: Oliver and Boyd; 1925.
4. Forsberg JA, Wedin R, Boland PJ, Healey JH. Can we estimate short- and intermediate-term survival in patients undergoing surgery for metastatic bone disease? Clin Orthop Relat Res. 2017;475:1252–1261.
5. Goodman S. A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology. 2008;45:135–140.
6. Goodman S. Toward evidence-based medical statistics. 2: The Bayes factor. Ann Intern Med. 1999;130:1005–1013.
7. Ioannidis JPA. The proposal to lower p value thresholds to .005. JAMA. 2018;319:1429–1430.
8. Ioannidis JPA. Why most published research findings are false. PLOS Medicine. 2005;2:e124
9. Jiménez-Almonte JH, Wyles CC, Wyles SP, Norambuena-Morales GA, Báez PJ, Murad MH, Sierra RJ. Is local infiltration analgesia superior to peripheral nerve blockade for pain management after THA: A network meta-analysis. Clin Orthop Relat Res. 2016;474:495–516.
10. Leopold SS. Editorial: No-difference studies make a big difference. Clin Orthop Relat Res. 2015;473:3329–3331.
11. Leopold SS, Porcher R. Editorial: Reporting statistics in abstracts in Clinical Orthopaedics and Related Research. Clin Orthop Relat Res. 2013;471:1739–1740.
12. Leopold SS, Porcher R. Editorial: The minimum clinically important difference—The least we can do. Clin Orthop Relat Res. 2017;475:929–932.
14. Nandra R, Parry M, Forsberg J, Grimer R. Can a Bayesian belief network be used to estimate 1-year survival in patients with bone sarcomas? Clin Orthop Relat Res. 2017;475:1681–1689.
15. Ogura K, Gokita T, Shinoda Y, Kawano, Takegi T, Ae K, Kawai A, Wedin R, Forsberg JA. Can a multivariate model for survival estimation in skeletal metastases (PATHFx) be externally validated using Japanese patients? Clin Orthop Relat Res. 2017;475:2263–2270.
16. Rubinstein LV, Korn EL, Freidlin B, Hunsberger S, Ivy SP, Smith MA. Design issues of randomized phase II trials and a proposal for phase II screening trials. J Clin Oncol. 2005;23:7199–7206.
17. Shakespeare TP, Gebski VJ, Veness MJ, Simes J. Improving interpretation of clinical studies by use of confidence levels, clinical significance curves, and risk-benefit contours. Lancet. 2001;357:1349–1353
18. Takenaka S, Aono H. Prediction of postoperative clinical recovery of drop foot attributable to lumbar degenerative diseases, via a Bayesian network. Clin Orthop Relat Res. 2017;475:872–880.
19. Trafimow D, Marks M. Editorial. Basic Appl Soc Psych. 2015;37:1–2.
20. Wasserstein RL, Lazar NA. The ASA's statement on p-values: Context, process, and purpose. Am Stat. 2016;70:129–133.
21. Wu C, Qu X, Mao Y, Li H, Dai K, Liu F, Zhu Z. Utility of intraoperative frozen section in the diagnosis of periprosthetic joint infection. PloS One. 2014;9:e102346.
22. Zhao X, Guo C, Zhao GS, Lin T, Shi ZL, Yan SG. Ten versus five polymorphonuclear leukocytes as threshold in frozen section tests for periprosthetic infection: A meta-analysis. J Arthroplasty. 2013;28:913–917.