Secondary Logo

Journal Logo

Economics, Education, and Health Systems Research: Editorial

The Legend of the P Value

Kain, Zeev N. MD, MBA

Editor(s): Miller, Ronald D.

Author Information
doi: 10.1213/01.ANE.0000181331.59738.66
  • Free

Although there is a growing body of literature criticizing the use of mere statistical significance as a measure of clinical impact, much of this literature remains out of the purview of the discipline of anesthesiology. Currently, the magical boundary of P < 0.05 is a major factor in determining whether a manuscript will be accepted for publication or a research grant will be funded. Similarly, the Federal Drug Administration does not currently consider the magnitude of an advantage that a new drug shows over placebo. As long as the difference is statistically significant, a drug can be advertised in the United States as “effective” whether clinical trials proved it to be 10% or 200% more effective than placebo. We submit that if a treatment is to be useful to our patients, it is not enough for treatment effects to be statistically significant; they also need to be large enough to be clinically meaningful.

Unfortunately, physicians often misinterpret statistically significant results as showing clinical significance as well. One should realize, however, that with a large sample it is quite possible to have a statistically significant result between groups despite a minimal impact of treatment (i.e., small effect size). Also, study outcomes with lower P values are typically misinterpreted by physicians as having stronger effects than those with higher P values. That is, most clinicians agree that a result with a P = 0.002 has a much greater treatment effect than a result of P = 0.045. Although this is true if the sample size is the same in both studies, it is not true if the sample size is larger in the study with the smaller P value. This is of particular concern when one realizes that most pharmaceutically funded studies have very large sample sizes and effect sizes are typically not reported in these types of studies. In the following editorial I highlight some of issues related to this complex problem. Please note that a detailed discussion of the underlying statistics involved in this topic is beyond the scope of this editorial.

When examining the report of a clinical trial investigating a new treatment, clinicians should be interested in answering the following three basic questions:

  1. Could the findings of the clinical trial be solely a result of a chance occurrence? (i.e., statistical significance)
  2. How large is the difference between the primary end-points of the study groups? (i.e., impact of treatment, effect size)
  3. Is the difference of primary end-points between groups meaningful to a patient? (i.e., clinical significance)

It was Sir Ronald A. Fisher, an extraordinarily influential British statistician, who first suggested the use of a boundary to accept or reject a null hypothesis, and he arbitrarily set this boundary at P = 0.05; where “P” stands for probability related to chance (1,2). That is, the level of statistical significance as defined by Fisher in 1925 and as used today refers to the probability that the difference between two groups would have occurred solely by chance (i.e., probability of <5 in 100 is reported as P < 0.05). Fisher’s emphasis on significance testing and the arbitrary boundary of P < 0.05 has been widely criticized over the past 80 yr. This criticism was based on the rationale that focusing on the P value does not take into account the size and clinical significance of the observed effect. That is, a small effect in a study with large sample size has the same P value as a large effect in a study with a small sample size. Also, P value is commonly misinterpreted when there are multiple comparisons, in which case a traditional level of statistical significance of P < 0.05 is no longer valid. Fisher himself indicated some 25 yr after his initial publication that “If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05…” (3). Indeed, this issue has been addressed in multiple recent review articles and editorials in the general medical and psychological literature (4–8).

In an attempt to address some of the limitations of the P value, the use of the confidence intervals (CI) has been advocated by some clinicians (9). One should realize, however, that these two definitions of statistical significance are essentially reciprocal (10). That is, getting a P < 0.05 is the same as having a 95% CI that does not overlap zero. CIs can also, however, be used to estimate the size of difference between groups in addition to merely indicating the existence or absence of statistical significance (11). This later approach, however, is not widely used in the medical and psychological literature, and today CIs are mostly used as surrogates for the hypothesis test rather than considering the full range of likely effect size.

The group of statistics called “effect sizes” designate indices that measure the magnitude of difference between groups, controlling for variation within the groups; effect sizes can be thought of as a standardized difference. In other words, although a P value denotes whether the difference between two groups in a particular study is likely to occur solely by chance, the effect size quantifies the amount of difference between the two groups. Quantification of effect size does not rely on sample size but instead relies on the strength of the intervention. There are a number of different types of effect sizes and a description of these various types and formulae is beyond the scope of this editorial. We refer the interested reader to review articles that describe the various types of effect sizes and their calculation methodology (12,13). Effect sizes of the d type are the most commonly used in the medical literature, as they are primarily used to compare two treatment groups. D type effect size is defined as the magnitude of difference between two means, divided by the sd [(Mean of control group − Mean of treatment group)/sd of the control group]. Thus, the d effect size is dependent on variation within the control group and the differences between the control and intervention groups. Values of the d type effect sizes range from −∞ to +∞, where zero denotes no effect and values less than or more than zero are treated as absolute values when interpreting magnitude. Conventionally, d type effect sizes that are near 0.20 are interpreted as small, effect sizes near 0.50 are considered “medium,” and effect sizes in the range of 0.80 are considered “large” (14). However, interpretation of the magnitude of an effect size depends on the type of data gathered and the discipline involved. Effect sizes of another type—the risk potency type—include likelihood ratios such as odds ratio, risk ratio, risk difference, and relative risk reduction. Clinicians are probably more familiar with these less abstract statistics and it may be helpful to realize that likelihood statistics are a type of effect size.

Clinicians should be cautioned to not interpret magnitude of change (effect size) as an indication of clinical significance. The clinical significance of a treatment should be based on external standards provided by patients and clinicians. That is, a small effect size may still be clinically significant and, likewise, a large effect size may not be clinically significant, depending on what is being studied. Indeed, there is a growing recognition that traditional methods used, such as statistical significance tests and effect sizes, should be supplemented with methods for determining clinically significant changes. Although there is little consensus about the criteria for these efficacy standards, the most prominent definitions of clinically significant change include: 1) treated patients make a statistically reliable improvement in the change scores; 2) treated patients are empirically indistinguishable from a normal population after treatment, or 3) changes of at least one sd. The most frequently used method for evaluating the reliability of change scores is the Jacobson-Truax method in combination with clinical cutoff points (15). Using this method, change is considered reliable, or unlikely to be the product of measurement error, if the reliable change index (RCI) is more than 1.96. That is, when the individual has a change score more than 1.96, one can reasonably assume that the individual has improved.

Unfortunately, most of the methods above are difficult to adopt in the perioperative arena, as comparison with a normal population is not an option in most trials, and the RCI, which controls for statistical issues involving the assessment tool, is a somewhat complicated and controversial technique. Thus, clinical significance in the perioperative arena may be best assessed by posing a particular question such as “is a change of 8.5% reduction in intraoperative bleed clinically significant?” or “how many sd does this change represent?” Obviously, both of these questions have a subjective component in them and although it is traditionally agreed that at least a 1-sd change is generally needed for clinical significance, this boundary has no scientific underpinning. The validity of a clinical cutoff for these last two methods can be improved by establishing external validity (e.g., patient perspective) for the decision. For example, Flor et al. (16) have conducted a large meta-analysis that was aimed at evaluating the effectiveness of multidisciplinary rehabilitation for chronic pain. The investigators found that pain among the patients who received the intervention was indeed reduced by 25%. This reduction was certainly statistically significant and had an effect size of 0.7. Colvin et al. (17), however, reported earlier that patients would consider only a 50% improvement in their pain levels as a treatment “success.” Thus, in this example, a reduction of 25% in pain scores may be statistically, but not clinically, significant. Clearly this is a developing area that warrants further discussion.

In conclusion, we suggest that reporting of perioperative medical research should continue beyond reporting results consisting primarily of descriptive and statistically significant or nonsignificant findings. The interpretation of findings should occur in the context of the magnitude of change that occurred and the clinical significance of the findings.

References

1. Fisher RA. Statistical methods for research workers, 1st ed. Edinburgh: Oliver and Boyd, 1925. Reprinted by Oxford University Press.
2. Fisher RA. Design of experiments. 1st ed. Edinburgh: Oliver and Boyd, 1935. Reprinted by Oxford University Press.
3. Fisher RA. Statistical methods for research workers. London: Oliver and Boyd, 1950:80.
4. Borenstein M. Hypothesis testing and effect size estimation in clinical trials. Ann Allergy Asthma Immunol 1997;78:5–11.
5. Matthey S. P < 0.05: but is it clinically significant? Practical examples for clinicians. Behav Change 1998;15:140–6.
6. Cummings P, Rivara FP. Reporting statistical information in medical journal articles. Arch Pediatr Adolesc Med 2003;157:321–4.
7. Greenstein G. Clinical versus statistical significance as they relate to the efficacy of periodontal therapy. J Am Dent Assoc 2003;134:1168–70.
8. Sterne JAC, Smith GD, Cox DR. Sifting the evidence: what’s wrong with significance tests? Another comment on the role of statistical methods. BMJ 2001;322:226–31.
9. Simon R. Confidence intervals for reporting results of clinical trials. Ann Intern Med 1986;105:429–35.
10. Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol 1998;51:355–60.
11. Gardner MG, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ 1986;292:746–50.
12. Kirk R. Practical significance: A concept whose time has come. Educ Psychol Meas 1996;56:746–59.
13. Snyder P, Lawson S. Evaluating results using corrected and uncorrected effect size estimates. J Exper Educ 1993;61:334–349.
14. Cohen J. Statistical power analysis for the behavioral sciences, 2nd ed. Mahwah, New Jersey: Lawrence Erlbaum, 1988.
15. Jacobson NS, Truax P. Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. J Consult Clinic Psych 1991;59:12–9.
16. Flor H, Fydrich T, Turk DC. Efficacy of multidisciplinary pain treatment centers: a meta-analytic review. Clin J Pain 1992;49:221–30.
17. Colvin DF, Bettinger R, Knapp R, et al. Characteristics of patients with chronic pain. South Med J 1980;73:1020–3.
© 2005 International Anesthesia Research Society