Statistical significance using the p-value has historically been used as the standard for establishing the importance of a measured effect. Unfortunately, this standard is flawed and has led to misinterpretations, erroneous conclusions, and poor decisions that have cost lives, money, and resources (1). In fact, the reliance on statistical significance as an indicator of importance is so misguided and problematic that the American Statistical Association, the world's largest professional association of statisticians, recently broke with the organization's policy of neutrality about statistical methods and for the first time in its 181-year history published a statement about statistical significance and interpretation of the p-value (2).
The process of significance testing entails comparing the p-value to a subjectively defined threshold (conventionally 0.05) and concluding statistical significance if the p-value is below the cutoff. This practice necessitates reducing reality to dichotomous thinking and has a long history which originates from the desire for an objective way to state importance. However, the threshold for defining significance is not objectively determined and the p-value does not measure the size of an effect or clinical importance of a result. Misunderstanding about this has led to misuse. Confusing statistical significance with clinical importance is the most common statistical reporting error in the biomedical literature (3).
In hypothesis testing studies, researchers intuitively want to know the probability of their research hypothesis being true and often try to turn the p-value into a statement about this. However, the p-value is not the probability of a true hypothesis. Wasserstein and Lazar in 2016 informally define the p-value as the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value (2). In other words, the p-value is a statement about how likely it is to observe an effect of a similar or larger size due to chance alone. The meaning of a p-value is not straightforward or easily understandable, and statisticians have cautioned for decades about the misuse of p-values. Wasserstein and Lazar proposed a solution for misinterpreting a p-value: “We believe that a reasonable prerequisite for reporting any p-value is the ability to interpret it appropriately”. This ability would likely include the minority of scientific authors.
Researchers are usually interested in the size of an effect. The effect size is reflected in statistical measures of magnitude, such as mean difference, risk ratio, odds ratio, hazard ratio, regression coefficient, and correlation coefficient. Unlike the p-value, statistics that quantify the size of an effect are informative and understandable with respect to a study's outcome and measurements. Similarly, a confidence interval gives a range of values for which we are confident that the effect size may take. Confidence intervals quantify the uncertainty about the effect size, but suffer from the same plight as significance testing due to requiring a subjectively determined confidence level (e.g., 95%). However, confidence intervals provide an estimate of the magnitude of an effect, as well as precision about this estimate.
The use of p-values and reliance on statistical significance as an indicator of clinical importance is pervasive and entrenched in the medical literature. For example, Chavalarias et al. (4) in 2016 conducted a review of 1,000 abstracts for research studies reported in PubMed and MEDLINE between 1990 and 2015. They found a vast overuse of p-value reporting in medicine as the primary measure for indicating clinical significance. In addition, they found that only 18 (2.3%) of reviewed abstracts reported confidence intervals, 179 (22.5%) included at least 1 effect size, and only a single abstract out of 1,000 mentioned clinical significance.
Reporting of effect size and use of a confidence interval has been found to be more prevalent in the otolaryngology research literature as compared with general medicine. Karadaghy et al. (5) conducted a descriptive review of analytical studies published in JAMA Otolaryngology–Head & Neck Surgery between 2012 and 2015. They selected a random sample of 121 study articles and found 58 (55%) articles reported an effect size for an outcome of interest. Of the 58 articles with an effect size, 29 (27%) of these included a confidence interval. While promising to see that effect size and confidence intervals are being reported, it is nonetheless concerning that 45% of articles did not provide the reader with information about the size of an effect.
One solution to the issues with significance testing that has been proposed was to change the cutoff for indicating statistical significance from 0.05 to 0.005 (6). However, this does not address the misinterpretation and misuse of p-values, which would likely persist. The distinction between statistical and clinical significance is unchanged with a different threshold for statistical significance. Another solution was presented by the editor of Journal of Statistics Education, an official publication of the American Statistical Association (7). Witmer published an editorial advocating for a language change from statistically significant to statistically discernible. This suggestion is focused on trying to prevent interpreting a small p-value as an indication of importance. While the term discernible may be more accurate in describing a statistical result, it may still be misleading since judging the meaningfulness of an effect is a clinical judgment.
In attempting to address the issue, prominent statisticians recently launched a large-scale effort aimed at “Moving to a World Beyond ‘p < 0.05’” (8). This included a 19-page editorial with this title in the American Statistical Association-sponsored journal The American Statistician, along with 43 thought-provoking articles from prominent statisticians and other experts on the topic. The authors in the editorial called for abandoning the phrase “statistically significant” and discontinuing the practice of categorizing p-values based on an arbitrary threshold such as 0.05. It is important to note that the call to move to a world beyond p < 0.05 was not a call to ban p-values, but rather specific to ceasing use of an arbitrary cutoff for statistical significance and reporting of effect sizes and interval estimates. Some journals have taken note, including The New England Journal of Medicine which requires authors to “replace p-values with estimates of effects or association and 95% confidence intervals when neither the protocol nor the statistical analysis plan has specified methods used to adjust for multiplicity (9).”
In light of recent efforts to improve statistical reporting and alleviate the issues with statistical significance, we recommend the following manuscript submission guidelines when a p-value is reported:
- 1. State the exact p-value regardless of how small or large it may be.
- 2. Avoid using 0.05 or any other cutoff for a p-value as the basis for a decision about the meaningfulness/importance of an effect.
- 3. Include a measure of the effect size, along with a corresponding interval estimate (e.g., confidence interval).
Researchers performing data analysis should quantify the magnitude of an effect. The clinician is then tasked with interpreting the meaningfulness of this effect size. Statistics can provide a platform for quantifying uncertainty, but context matters and clinical importance necessitates clinical knowledge and judgment. Statistical significance is not an objective measure and does not provide an escape from the requirement to think carefully and judge the clinical and practical importance of a study's results.
1. Ziliak ST, McCloskey DN. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, MI:The University of Michigan Press; 2008.
2. Wasserstein R, Lazar N. The ASA's statement on p-values: Context, process, and purpose. Am Statist
3. Lang T. The need for accurate statistical reporting. A commentary on “Guidelines for reporting statistics in journals published by the American Physiological Society: The sequel.’”. Adv Physiol Educ
4. Chavalarias D, Wallach JD, Li AH, Ioannidis JP. Evolution of reporting p
-values in the biomedical literature, 1990-2015. JAMA
5. Karadaghy OA, Hong H, Scott-Wittenborn N, et al. Reporting of effect size and confidence intervals in JAMA Otolaryngology–Head & Neck Surgery. JAMA Otolaryngol Head Neck Surg
6. Ioannidis JPA. The proposal to lower p
-value thresholds to .005. JAMA
7. Witmer J. Editorial. J Statist Educ
8. Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p < 0.05.”. Am Statist
9. Harrington D, D’Agostino RB, Gatsonis C, et al. New guidelines for statistical reporting in the journal. N Engl J Med