In science and in clinical practice, the common challenge is to understand—to determine the underlying cause or best course of treatment for clinical signs and symptoms, or to provide fundamental explanations on life and our physical world. The problem is, it is really hard to know anything with certainty. This point was given new context for me years ago when I met a scientist from the Scripps Institution of Oceanography whose life’s work was quantifying uncertainty. I challenged him first on temperature, which seemed like a rather safe and well-established physical phenomenon. He answered with disputes about establishing absolute zero. I tried again with time, touting atomic clocks as a firm standard; he parried with gravitational effects (Einstein’s theory of relativity). What a discouraging reality, I thought, to have so little reliable knowledge.
The dearth of reliable knowledge extends well into biomedical research, and for that, I blame the p-value, and I am not alone.1–3 In 2005, Ioannidis published a disturbing study on the veracity of published research findings, stating that many published findings are often just accurate measures of the prevailing bias in a particular field,4 i.e. confirmation bias. In 2014, Nuzzo published a call for more sensible use and interpretation of p-values in a report on statistical errors. She nicely illustrated how the p-value cannot answer questions about the odds that a hypothesis is correct because the odds depend on the strength of the observed effect and the plausibility of the hypothesis. These two notable criticisms are part of a growing movement to fundamentally change the culture of communicating scientific results, and a change is needed to address the sorry state of reproducibility in science today.
The misuse of p-values has recently drawn the attention of The American Statistical Association (ASA), a venerable institution in existence for more than 175 years. On March 7, 2016, the ASA published a statement on p-values, addressing their interpretation, use, and illuminating some common abuses.5 This publication was motivated in part by problems with the reproducibility of published research. I raise this issue here to encourage the community of OVS readers, authors, and educators to help combat the common misunderstandings and misuses of p-values that have led to a history of unhelpful practices in biomedical research that include our own journal.
I suggest that there are at least five reasons that using p-values have become so widely used and abused. These reasons are not meant to be an exhaustive list, but a starting point for further reflection. I enumerate these five reasons and propose alternatives and solutions that OVS authors should adopt.
1. Using p-values is what we were taught to do.
This is an issue raised by George Cobb in a recent ASA discussion forum that I will dub Cobb’s Circularity: We teach p = 0.05 because that is what our community does; we use p = 0.05 because that is what we teach.5 Hypothesis testing in statistics can be a powerful working illustration of how several basic scientific principles (e.g. quantitative analytical methods, probability theory, and inferential reasoning) come together in a way that is useful for both instructor and student. Getting beyond the cultural indoctrination of our training requires thoughtful use of each of these basic tools. We must dare to question the basis for our own practices and be brave enough to seek a better approach when one exists. Inferential thinking and hypothesis testing is a cornerstone of scientific discovery, but there are more powerful alternatives to the p-value that authors should embrace to help tell the story of their discoveries (e.g. graphical presentation, confidence intervals, distribution plots, etc.) “Conformity is the jailer of freedom and the enemy of growth.”─John F. Kennedy.
2. It is easy to generate a p-value.
The availability of personal computers and statistical software packages contributes to the misuse of p-values. Given data and a statistics software package, one can surely generate results, but they may have no meaning or value. Proper use of these powerful analytical tools is a perennial training requirement. Reviewing a student’s draft manuscript recently, I noticed that it included some astonishing p-values (e.g. p = 1.39 × 10−39). This was emblematic for me of how good intentions can go wrong when study results are on the line. Yes, the analytical tools produced a result, but the result requires thoughtful consideration before reporting.
There is another problem related to the ease generating p-values. Because the cost associated with running additional statistical tests is low, authors can be tempted to look everywhere for findings that cross the magical limit of p < 0.05. Intentional exploratory analysis has a legitimate place in scientific discovery, but integrity demands full disclosure of the nature and purpose of the exploration, i.e. that exploration should be an intentional and explicit exercise. “It is commonly believed that anyone who tabulates numbers is a statistician. This is like believing that anyone who owns a scalpel is a surgeon.”─Robert Hooke.6
3. The definition of the p-value is convoluted.
At the center of this problem is the difficulty understanding the p-value. In the 1920s, Fisher introduced the notion of the p-value as a quantitative criterion for guiding decisions about whether or not observations were worthy of further study. The p-value has since become dogma for researchers seeking simplicity in the presentation and interpretation of study results. The ASAs recently published simplified description of the p-value did not offer much clarity or hope for the casual statistician:
“…a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”5
“Why should things be easy to understand?”─Thomas Pynchon.
4. The consequences of dichotomous thinking inherent to p-values are not immediately obvious.
A central challenge for those who practice statistics is to communicate clearly the meaning of a numerical analysis that is sometimes complex and nuanced. Epidemiologists have been sounding alarms about p-values for decades.7–9 P-values dichotomize results into significant and nonsignificant outcomes. They are confounded measures that entangle the magnitude and variability of an effect. If one knows only the p-value, it is impossible to discern whether the parameter magnitude or the measurement uncertainty explains a p-value less than p = 0.05. Large studies may generate many significant findings of no real importance due to more precise parameter estimates driven by a large sample size. Conversely, small studies with few subjects may fail to show statistical significance despite a strong relationship. The oversimplification inherent to p-values also squelches further meaningful consideration and thoughtful discussion of what might explain the observations regardless of the associated p-value. Such discussions are equally important and far more interesting.
The preferred alternative to significance testing is more descriptive and involves point estimates that describe the magnitude of study parameters (e.g. means, ratios, relative risk) along with the associated bounds of error (confidence intervals). This approach embraces the publication of interesting results from small studies and encourages discussion of the possible sources and effects of bias, confounding, or how chance might have influenced the reported results. In a 1998 editorial titled: Writing for Epidemiology, Editor Kenneth Rothman took a bold stand against tests for statistical significance when he wrote: “When writing for Epidemiology, you can also enhance your prospects if you omit tests of statistical significance.” Eighteen years later, this idea is still heresy to most biomedical researchers. Nevertheless, the clear message is that p-values are overly simplistic summaries that are easily misinterpreted and have led to unrepeatable results in biomedical research. The focus should not be on the value of p, but the magnitude of the effect and its variability.
“Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital.”—Aaron Levenstein.
5. A p-value can give you what you are hoping to find.
The familiar inferential approach to research involves evaluating observations against a null hypothesis (one with an a priori probability that is nearly always false) using a statistical test. The product of this exercise is a p-value that is, more often than not (because of its a priori probability), significant. This is often accompanied by very little discussion about the plausibility of the hypothesis tested, potential sources of bias, confounding, or even the validity of the statistical test assumptions. This approach has led to an explosion of studies reporting significant results for marginal effects that are not repeatable despite the fact that the p-values reported were significant (p < 0.05).10,11 This lack of reliability is not just an academic exercise when decisions about health and well-being are involved.
Robert Bolles published a comment in Biological Psychiatry in 1988 titled Why You Should Avoid Statistics.12 His 4th and 5th rules state simply and succinctly how one might address the problems enumerated above:
Rule 4: “What you want to do in your research is measure things. You do not want to test hypotheses. You want to measure things.”
Rule 5: “…rather than looking at the statistics, you should look at the data. Always, look at the data to see what the numbers say. The numbers that you collect in an experiment will tell you if you have found something, even while statistical tests are fibbing, lying, and deceiving you.”
Both rules are fundamentally good advice.
“The path of least resistance leads to crooked rivers and crooked men.”—Henry David Thoreau. This sentiment seems to be at the heart of how biomedical research has wandered astray, chasing the simplistic allure and ever-rewarded p-value. The apparent clarity provided by a definitive result is compelling, yet the meaning and thoughtful interpretation of results requires more effort than simply judging if p <0.05. While OVS will not be abandoning tests of statistical significance, their value will receive increasing scrutiny and appropriate alternatives such as confidence intervals will be strongly encouraged.
Michael D. Twa
Optometry and Vision Science
1. Rothman KJ. Curbing type I and type II errors. Eur J Epidemiol 2010; 25: 223–4.
2. Nuzzo R. Scientific method: statistical errors. Nature 2014; 506: 150–2.
3. Greenland S. Null misinterpretation in statistical testing and its impact on health risk assessment. Prev Med 2011; 53: 225–8.
4. Ioannidis JP. Why most published research findings are false. PLoS Med 2005; 2: e124.
5. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 2016.
6. Hooke R. How to tell the liars from the statisticians. New York: M. Dekker; 1983.
7. Lang JM, Rothman KJ, Cann CI. That confounded P-value. Epidemiology 1998; 9: 7–8.
8. Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ 1995; 311: 485.
9. Poole C. Low P-values or narrow confidence intervals: which are more durable? Epidemiology 2001; 12: 291–4.
10. Ioannidis JP. An epidemic of false claims. Competition and conflicts of interest distort too many medical findings. Sci Am 2011; 304: 16.
11. Ioannidis JP. Discussion: Why “An estimate of the science-wise false discovery rate and application to the top medical literature” is false. Biostatistics 2014; 15: 28–36; discussion 9–45.
12. Bolles RC. Why you should avoid statistics. Biol Psychiatry 1988; 23: 79–85.