What should be the role of P-values and confidence intervals in the interpretation of scientific results? This question is not new 1 and our field of epidemiology is far from alone in struggling with it. 2,3 I have four suggestions for authors and readers. The first is quite broad, so I offer that one before describing current practices. I then turn to the other three. My remarks are confined to settings in which P-values and confidence intervals accompany estimates of effect measures, such as the relative risk.
Briefly, here are my suggestions. One, we should work harder than ever to avoid strict or exact interpretations of P-values and confidence intervals in observational research, where these statistics lack a theoretical basis. Two, we should stop interpreting P-values and confidence intervals as though they measure the probability of hypotheses. Three, when we want to know the probability of hypotheses, we should use Bayesian methods, which are designed expressly for that purpose. Four, we should get serious about precision and look for narrow confidence intervals instead of low P-values to identify results that are least influenced by random error.
Real Life Is Not Randomized
When treatment or exposure is randomized, we have a solid theoretical basis, testable in simulations, for the probability models from which P-values, confidence intervals, and likelihoods are deduced. In observational research, all we can do is hope that the social, behavioral, and physical processes by which people become exposed to risk factors in the unrandomized real world do not differ too greatly from randomization. 4 Unfortunately, each time we find that risk factors are associated with each other in observational studies, we find evidence against that hope. We cannot remind ourselves too often of this fundamental problem. At the very least, it should cause us to avoid hairsplitting interpretations of probabilistic statistics in observational research, where they are intrinsically fuzzy.
Contemporary Uses of P-Values and Confidence Intervals
Significance testing unquestionably dominates epidemiology today. In attempting to refrain from this practice over the past 17 years, 5 I have often been expected, assumed, encouraged, and sometimes even forced to engage in it by editors, reviewers, colleagues, professors, students, funding sources, regulators, attorneys, and journalists. It is not easy to be a non-tester in a testing world.
After Rothman’s highly influential 1978 essay, “A Show of Confidence,”6 an immense and easily documented shift in reporting style took place. 7 Whereas P-values or “S” (significant) and “NS” (not significant) once were reported exclusively, the reporting of confidence intervals has now become accepted practice, with or without P- value accompaniment. Confidence intervals have a survival advantage for the tiny non-testing minority to which I belong. They enable us to gauge the precision of estimates easily, but without depriving the established majority of its beloved tests.
Epidemiologists who see no purpose to a confidence interval other than its use in significance testing sometimes wonder why this shift in reporting practice has occurred. The P-value provides the information they desire more efficiently and exactly. Some are vaguely aware that confidence intervals supposedly convey information that P-values do not, but are unsure what that extra information is and even less sure how it might be useful. The word “precision” seems to be used with increasing regularity nowadays, and confidence intervals are occasionally described as “wide,” but “wide” and “imprecise” often seem nothing more than code words for “includes the null value” and hence for “not statistically significant.”
Improbable Observations Do Not Imply Improbable Hypotheses
When we estimate a parameter such as the relative risk, each possible value of that parameter is the expected value under some hypothesis, and each hypothesis has a P-value. 8,9 What we call “the”P-value is the P-value for the null hypothesis. Approximately, each P-value is the probability of obtaining an estimate at least as far from a specified value as the estimate we have obtained, if that specified value were the true value. It follows that no P-value, for the null hypothesis or any other, is the probability that the specified hypothesis is true. As an obvious example, the hypothesis corresponding to the point estimate has a (two-sided) P-value of 1.0. However, we do not treat our point estimates as absolutely certain to be true. Neither is the point estimate, in general, the most probable value.
For a given estimate, the 95% confidence interval is the set of all parameter values for which P ≥ 0.05. For the value at each limit of a 95% confidence interval, P = 0.05 (two-sided). Thus, if either of the 95% confidence limits for a relative risk estimate equals 1.0 (the null value of this parameter), we can infer that the null P-value is 0.05. From this link between confidence intervals and P-values, it follows that a 95% confidence interval is not a range of values within which the unknown true value lies with 95% probability.
The well-known “coverage probability” of confidence intervals pertains to a parameter value that is known to be true and the probability that an as yet unknown confidence interval will contain it. Coverage probability does not pertain to a known confidence interval and an unknown true value. To interpret a given 95% confidence interval as having a 95% probability of including the unknown true value is to mistake a frequentist confidence interval for a Bayesian probability interval. 10 This error is merely an extension of the logical fallacy of mistaking the null P-value for the probability that the null hypothesis is true.
Why do we turn probability logic on its head in this way? We very much want to know the probabilities of hypotheses, which require Bayesian methods to determine, but our biostatistical teachers give us the P-values and confidence intervals of frequentist statistics. We are thus led into a basic fallacy, by which the probability of A given B is mistaken for the probability of B given A. 10 A P-value of 0.04 tells us that, if the null hypothesis were true, an association at least as strong as the one we observed would occur with a probability of 4%. We find it quite natural to reverse the terms, and conclude mistakenly that the probability of the null hypothesis is 4%, given the association we observed.
The null hypothesis or any other hypothesis can be highly probable even though its P-value is less than 0.05. The null hypothesis or any other hypothesis can have a low probability even though its P- value is greater than 0.05. A relative risk can be highly probable even though it lies outside a 95% confidence interval. A relative risk can be highly improbable even though it lies inside a 95% confidence interval. The indispensable role of hypotheses in the computation of P-values and confidence intervals, with each hypothesis assigning a probability to each estimate we might possibly obtain, means that these measures are not the descriptive statistics they are sometimes said to be. 11P-values and confidence intervals are inferential statistics, but the flow of the inference is a deductive flow, in which hypotheses confer probability “down” to estimates . 12,13 Inductive statistical inference, in which the direction of the probability flow is from estimates back “up” to hypotheses, properly takes place only when prior probabilities are updated with new data, by means of Bayes’s theorem, to form posterior probabilities. 10,13
The only way we can determine the probability of the null hypothesis, or a range of values within which the true value lies with a given level of probability, is by using Bayesian methods. 10,13–15 Bayesian methods cannot be employed without the specification of prior probabilities for the hypothetical values of interest (eg, all possible values of relative risk, from zero to infinity). Since we do not specify prior probability distributions when we compute conventional (frequentist) confidence intervals, those intervals have no generally valid interpretation as Bayesian probability intervals.
Many familiar expressions - some employing probabilistic language, others avoiding it - have the effect of leading us into this misinterpretation. It has been said that being located inside a 95% confidence interval makes values plausible, probable, likely, reasonably included by the data, or even possible. Values exterior to 95% confidence intervals have been said to be implausible, improbable, unlikely, reasonably excluded by the data, or even ruled out. None of these variations on a rhetorical theme can change a simple fact of statistical life: If we want to know which values are more and less likely, more and less plausible, etc., we must specify prior probabilities for those values and use Bayes’s theorem to update those probabilities when new data are in hand.
It has become increasingly clear that the null P-value (hereafter called “the”P-value) does not do a very good job of the task for which it was originally intended: to quantify the statistical evidence against the null hypothesis. The reason is simple. The familiar Type I and Type II error rates upon which Neyman and Pearson taught us to focus 16,17 beg vitally important questions.
One minus the Type I error rate is the specificity of a significance test: the probability of not declaring “significance” when the null hypothesis is true. One minus the Type II error rate is the test’s power or sensitivity: the probability of declaring “significance” when the alternative hypothesis is true. No informed patient would be satisfied with a diagnostic test result knowing only the test’s specificity and sensitivity. That patient would want to know the test’s predictive value (positive or negative, depending on the result).
Significance tests are no different. In the same frequency terms that Neyman and Pearson used, 16,17 the researcher who wishes to be fully informed should be interested in questions such as the following: How often is the null hypothesis true when we fail to reject it? When we do reject the null hypothesis, how often is the alternative hypothesis true? These are the probabilities of ultimate concern in significance testing – the predictive values of “NS” and “S.” There is no way to determine them without postulating (stated again in frequency terms) how often the null and alternative hypotheses are true. The interest many epidemiologists express in how low the P-value is, if it is lower than 0.05, 18 raises still other questions. How much evidence against the null hypothesis do we have when P = 0.04, or when P = 0.001? To answer these questions, we need to consider the probabilities under the null and alternative hypotheses of obtaining these particular P-values, not just the probabilities of obtaining P < 0.05.
Statisticians who have examined these questions in detail 19–26 have found, under widely ranging conditions, that P-values on the order of 0.05, 0.01, and even lower provide much less evidence against the null hypothesis than they appear to provide at face value. As a general matter, P- values in the vicinity of 0.05 provide almost no evidence against the null hypothesis at all. P = 0.04, for instance, is typically found to be almost equally probable under the null and alternative hypotheses.
One upshot of this work has been a statistical research program devoted to calibrating, standardizing, conditioning, or adjusting low P-values to make them higher, so that they reflect more realistically the limited statistical evidence they provide against the null hypothesis. 19–26 Now that Bayesian methods are computationally feasible, one wonders whether these efforts to patch up P-values will ultimately be viewed a transitional stopgap.
Taking Precision Seriously
Transitional stopgaps should not be dismissed lightly, especially when the transitions in question take decades to unfold. Stopgaps can be particularly valuable when it seems that the only alternative is to cry in the (frequentist) wilderness for a (Bayesian) revolution. In epidemiology, the advent of confidence intervals creates an opportunity to take another small step toward more widespread use of Bayesian methods, while at the same time improving overall interpretation. This step is merely to take precision seriously.
Epidemiologists have many reasons to emphasize certain results over others. Some results may pertain to particularly topical research questions. Some may be more valid than others. And some may be less influenced by random error. This last consideration seems to be an important one to many epidemiologists, who regularly use P-values to determine the degree to which chance influences their results. They believe that the lower the P-value, the less the influence of chance. Unfortunately, this extremely common use of the P-value is a misuse and an abuse of that statistic. The estimates least influenced by chance are not those with low P-values, but those with narrow confidence intervals.
Consider the four hypothetical relative risk estimates in Table 1. The ratio of the upper to lower 95% confidence limits (CLR) is a handy measure of confidence interval width, and thus of precision. (For a difference measure such as the risk difference, the difference between the upper and lower confidence limits would serve the same purpose.) The example was devised to dramatize four clear-cut combinations of statistical “significance” and precision.
To the extent that the role of chance would be taken into account in deciding which of these results to emphasize, the conventional choices would be the statistically “significant” estimates B and C. These would be the “associations unlikely to be due to chance alone.” But one of them, estimate C, is very unstable. That estimate is influenced much more by random error, and from that standpoint is much less dependable, than estimate B.
Of equal importance, when C is compared with D, estimate C is influenced much more by chance and in that regard is much less trustworthy, even though estimate C is statistically “significant” and estimate D is not. Estimates B and D – not B and C – are this study’s most precise estimates. Estimates B and D stand the best chance of holding up, conditional on their validity, in the context of existing and future research. Estimates B and D would weigh more heavily into meta-analyses and would exert stronger influences on probability distributions in properly conducted Bayesian analyses. Estimates B and D are the results that should be put forth for emphasis as the most statistically stable results this study has to offer.
It is sometimes said that confidence intervals are especially valuable, and that increases in sample size and statistical efficiency are particularly needed, when statistical “significance” has not been attained. To the contrary, an estimate that has a wide confidence interval is imprecise and unstable no matter how low its P-value. Based solely on the results in Table 1, larger sample sizes, special study populations and statistically more efficient designs would be particularly desirable for A and C, regardless of the fact that one of these estimates is statistically “significant” and the other is not.
Some epidemiologists wonder what all the fuss over P-values and confidence intervals is about. This hypothetical example shows how an emphasis on precision rather than statistical “significance” can affect which results we may choose to highlight. I invite the reader to examine published research reports in which the estimates with the lowest P-values have been singled out for emphasis, and to imagine how differently those papers would read if the estimates with the narrowest confidence intervals had been highlighted instead.
Our results that deserve the greatest reliance are those that are most stable and trustworthy. With regard to random error, a very poor way of identifying dependable results is to select associations with impressively low P-values. Inference and decision-making would be far better served by choosing estimates with narrow confidence intervals, which are least vulnerable to the play of chance. These are the results for which, by virtue of intentional or accidental features of our research methods, our studies provide the most evidence (as distinguished from the most valid evidence).
By taking precision seriously, we can easily identify those research questions on which our studies provide the greatest quantity of statistical evidence, and those questions for which larger and more statistically efficient studies are needed. In terms of resistance to random error, our most durable results are our most precise estimates - however unspectacular, unsensational, and “non-significant” many of those estimates might be.
1. Berkson J. Some difficulties of interpretation encountered in the application of the chi-squared test. J Am Stat Assoc 1938; 33: 526–536.
2. Anderson DR, Burnham KP, Thompson WL. Null hypothesis testing: problems, prevalence, and an alternative. J Wildlife Management 2000; 64: 912–923.
3. Walter SD. Methods of reporting statistical results from medical research studies. Am J Epidemiol 1995; 141: 896–906.
4. Greenland S. Randomization, statistics, and causal inference. Epidemiology 1990; 1: 421–429.
5. Poole C, Lanes SF, Rothman KJ. Analyzing data from ordered categories (letter). New Engl J Med 1984; 311: 1382.
6. Rothman KJ. A show of confidence. N Engl J Med 1978; 299: 1362–1363.
7. Savitz D, Tolo K-A, Poole C. Statistical significance testing in the American Journal of Epidemiology 1970 to 1990. Am J Epidemiol 1994; 139: 1047–1052.
8. Poole C. Beyond the confidence interval. Am J Public Health 1987; 77: 195–199.
9. Poole C. Confidence intervals exclude nothing. Am J Public Health 1987; 77: 492–493.
10. Lindley DV. The philosophy of statistics (with discussion). The Statistician 2000; 49: 293–337.
11. Savitz DA, Olshan AF. Describing data requires no adjustment for multiple comparisons: a reply from Savitz and Olshan. Am J Epidemiol 1998; 147: 813–814.
12. Poole C. Induction does not exist in epidemiology, either. In: Rothman KJ (ed), Causal Inference. Chestnut Hill, MA: Epidemiology Resources Inc., 1988: 153–162.
13. Greenland S. Probability logic and probabilistic induction. Epidemiology 1998; 9: 322–332.
14. Gelman AB, Carlin JS, Stern HS, Rubin DB. Bayesian data analysis. Boca Raton, FL: Chapman & Hall/CRC, 1995; 42–45.
15. Berry DA. Statistics. a Bayesian perspective. Belmont, CA: Duxbury Press, 1996; 147–161.
16. Neyman J, Pearson E. On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika 1928; 20: 175–240.
17. Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond A 1933; 231: 289–337.
18. Goodman SN. p
values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate (with discussion). Am J Epidemiol 1993; 137: 485–501.
19. Casella G, Berger R. Reconciling Bayesian and frequentist evidence in the one-sided testing problem (with discussion). J Am Stat Assoc 1987; 82: 106–111.
20. Berger JO, Selke T. Testing a point null hypothesis: the irreconcilability of P-
values and evidence. J Am Stat Assoc 1987; 82: 112–122.
21. Berger JO, Delampady M. Testing precise hypotheses (with discussion). Stat Sci 1987; 3: 317–352.
22. Delampady M, Berger JO. Lower bounds on Bayes factors for the multinomial distribution, with application to chi-squared tests of fit. Ann Stat 1990; 18: 1295–1316.
23. Berger J, Boukai B, Wang Y. Unified frequentist and Bayesian testing of a precise hypothesis (with discussion). Stat Sci 1997; 12: 133–160.
24. Selke T, Bayarri MJ, Berger J. Calibration of P-
values for testing precise null hypotheses. Am Stat 2001; 55: 62-71.
25. Goodman SN. Towards evidence-based medical statistics. I. The P-
value fallacy. Ann Intern Med 1999; 130: 995–1004.
26. Goodman SN. Towards evidence-based medical statistics. II. The Bayes factor. Ann Intern Med 1999; 130: 1005–1013.