In modern medical research, P values have become, in the minds of many, the most important indicator of the truth of a scientific proposition and the key instrument differentiating “real” effects from those due to “random chance.” Consequently, it is no exaggeration to say that the landscape of medical research has been decisively shaped by a pervasive belief in the power of P values. They have come to determine not only which studies are published but also which projects are funded and which ideas pursued. It is, therefore, shocking to many physicians to learn that a legitimate suspicion exists among statisticians regarding the real usefulness of P values, particularly as a stand-alone measure of validity. This skepticism has its roots in, of all places, the Guinness brewery, the birth place of P values and the concept of statistical significance. In 1908, William Sealy Gosset, head experimental brewer for Guinness Beer in London and mathematician, published an article on “statistical significance,” in which he endeavored to explain how to determine which inputs in the brewing process made the greatest difference in the quality of the drink. Nevertheless, in this article, Gosset wrote, “The important thing is to have a low real error, not to have a ‘significant’ result at a particular station. The latter seems to me to be nearly valueless in itself” (1). Another statistician of the time, Ronald Fisher, is often credited with the ubiquitous use of 5% as the benchmark for scientific legitimacy although this is often based on a misunderstanding of Gosset work. Debates between Fisher and 2 other statisticians of the time, Jerzy Neyman and Egon Pearson, perhaps best illustrate the controversy regarding the P value; is it an absolute measure to be interpreted in context or one either above or below a prespecified benchmark, typically 5%?
These debates regarding the use of P values have been inherited by subsequent generations of statisticians; yet, as with many traits that are passed from one generation to the next and become muted with time, P values are frequently misinterpreted and vastly misunderstood (2–7). When asked to explain a P value a not uncommon response is “It is the probability that the null hypothesis is true.” This is incorrect. In fact, it is the exact opposite of the truth. This is because the operating principle of the P value is the assumption that the null hypothesis is correct, that is, there is no real effect of a given intervention. Therefore, the information provided by P values is the probability of the data, given the assumption that the intervention is ineffectual. As a result, the P value provides no information (and makes no attempt) to inform about the probability of null hypothesis.
Another misconception is that the P value <0.05 is “statistically significant” and therefore must be clinically important. This is not correct for several reasons. First, the difference may be too small to be clinically meaningful. The P value carries no information about the magnitude of the effect and precision of the estimate, which are captured by the point estimate and the confidence interval. A very small P value such as <0.01 does not necessarily mean a strong association (8,9). The strength of the association comes from the effect size, which is a measure of the strength of the relationship between 2 variables (e.g., odds ratio, relative risk, and correlation coefficient) (10). Second, the end point itself may not be clinically important (e.g., use of surrogate outcomes). Third, it is possible to achieve a P value <0.05 simply by changing the sample size of the sample.
A far greater problem arises from misinterpretation of nonsignificant findings. A P value >0.05 is often called “nonsignificant.” The term “nonsignificant” wrongly implies that the study has shown that there is no difference between groups and that a nonsignificant P value is a good evidence of a true null hypothesis (4). This does not mean that the treatment is not beneficial; only that the possibility of chance producing a difference of this size is so large that it is impossible to demonstrate the “significance” of the treatment effect. Although it is usually reasonable not to accept a new treatment unless there is positive evidence in its favor, when issues of public health are concerned, the absence of evidence is not always valid justification for inaction (4). Rather, other evidence is needed to appropriately accept the null hypothesis as true. The magnitude of the association and confidence intervals that can help researchers make an informed interpretation (11). The other problem is that this terminology perpetuates the idea that the results must fall on one side or another of a demarcation as if the study conclusively proved whether a certain phenomenon existed when, in reality, one of the established beliefs of modern biomedical statistics is that results are simply statistically significant or not (10,12,13).
Having described what a P value is not, the question remains what is a P value? A P value is a probability of obtaining a result at least as extreme as the observed results in a study, when the null hypothesis is really true (14). A P value is a continuous measure with a uniform distribution ranging from zero to one; however, the P value conventionally is dichotomized at 0.05. If the P value is below 0.05, the null hypothesis is rejected and the observed results are called “significant.” A P value <0.05 is an arbitrary cut-point for a statistic that captures one of many possible sources of error, so the correct evaluation of a P value is fundamentally a qualitative process. Moreover, it is but one of several pieces of information that should be used to interpret the results of scientific research, the magnitude of the effect and the associated, typically 95% confidence interval. Taken together, this information provides the researcher, and perhaps more importantly, the scientific community and the public, a more robust perspective on a study's results. They allow us to appropriately dismiss statistically significant results associated with clinically meaningless effect sizes while advancing nonsignificant results wherein the effect size was large. The value of P values is not their ability to serve as an isolated, easily understood, scientific seal of approval; rather, they are but one piece of scientific evidence that, when properly applied and combined with all other available evidence, can be used to begin a conversation regarding the proper interpretation of a study's results.
1. Ziliak S, McClosky D. The Cult of Statistical Significance. Ann Arbor, MI: The University of Michigan Press, 2008.
2. Berkson J. Tests of significance considered as evidence. J Am Stat Assoc. 1942;37:325–335.
3. Rothman KJ. Significance questing. Ann Int Med. 1986;105:445–447.
4. Altman DG, Bland JM. Absence of evidence is not evidence of absence. Br Med J. 1995;311:485.
5. Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998;51:355–360.
6. Goodman SN. Toward evidence-based medical statistics. 1: the P value fallacy. Ann Int Med. 1999;130:995–1004.
7. Pharoah P. How not to interpret a P value? J Natl Cancer Inst. 2007;99:332–333.
8. Fisher R. Statistical Methods for Research Workers. Edinburgh, Scotland: Oliver and Boyd, 1950.
9. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Rev Camb Philos Soc. 2007;82:591–605.
10. Rothman KJ, Greeland S. Modern Epidemiology. 2nd edition. Philadelphia, PA: Lippincott Williams & Wilkins, 1998.
11. Poole C. Low P-values or narrow confidence intervals: which are more durable? Epidemiology. 2001;12:291–294.
12. Poole C. Beyond the confidence interval. Am J Pub Health. 1987;77:195–199.
13. Rothman KJ, Lanes S, Robins J. Causal inference. Epidemiology. 1993;4:555–556.
14. Schervish MJ. P values: what they are and what they are not. Am Stat. 1996;50:203–206.