# It’s Time to Rehabilitate the P-Value

Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park NC 27709

Address correspondence to: Clarice R. Weinberg, Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park NC 27709

Submitted and accepted January 19, 2001.

Any biostatistician worth her salt has had extensive training in hypothesis testing. Those of us who then find ourselves working in epidemiology have been frustrated when we try to use our tools. The *P*-value has become politically incorrect, at least among a circle of vocal and influential epidemiologists. This journal has taken a leading role in discouraging the use of *P-* values and hypothesis tests, in favor of confidence intervals and estimation. ^{1} This view has produced a near-complete prohibition, and the few of us intrepid diehards who have secured the former editor’s permission to publish an actual *P-* value in these pages wear that achievement as a badge of courage.

The *P*-value quantifies the discrepancy between a given data set and the null hypothesis, as the probability of results being as discrepant or more so, given the null hypothesis. It thus assesses the degree of inconsistency between the stated null and the observed data. A fuller description of this simple idea can be found in many introductory textbooks. ^{2}

How has the *P*-value come to be so despised? There are many reasons. Certainly, less abstract ways to summarize study results are appealing. I like to see papers that stay close to the data by showing graphs and point estimates where possible. Moreover, the point estimate for a risk ratio can speak more directly to the possible public health implications than can a *P*-value, which inherently confounds sample size and magnitude of effect.

Furthermore, the *P*-value is easy to misinterpret. (No, it is *not* the probability that a null hypothesis is true, nor is it the probability that we are making a certain error.) Mindless efforts to dichotomize results into significant *vs* nonsignificant, based on some cutpoint, such as 0.05, get us into trouble: results were significant for males but not for females, and therefore? One of the most pernicious abuses of automated decision making occurs when clinical treatments are asserted to be equivalent, based on a nonsignificant *P*-value for the observed difference. ^{3} Finally, we have all read literature reviews where silly claims are made that the research on a given question has been “inconsistent,” because some were “positive” (statistically significant) and others “negative” (not statistically significant) - when in fact there is no inconsistency among the findings. Such abuses remain painfully common in epidemiology.

On the other hand, the propensity to misuse or misunderstand a tool should not necessarily lead us to prohibit its use. The theory of estimation is also often misunderstood. How many epidemiologists can explain the meaning of their 95% confidence interval? There are other simple concepts susceptible to fuzzy thinking. I once quizzed a class of epidemiology students and discovered that most had only a foggy notion of what is meant by the word “bias.” Should we then abandon all discussion of bias, and dumb down the field to the point where no subtleties need trouble us?

There is no need to choose one set of techniques to the dogmatic exclusion of another. The challenge is to have a healthy respect for the uses and limitations of the available methods. Sometimes estimation alone is insufficient and hypothesis testing should play an important supplemental role. Here are some examples.

Consider a case-control study in which an exposure is divided according to quintiles for analysis. This creates a four-dimensional contrast parameter in which the four estimates are not statistically independent. The appropriate confidence region is not four separate confidence intervals (one per odds ratio), but a curved region in four-dimensional space. Showing such results graphically is problematic. But the separate evaluation of the marginal confidence interval for each odds ratio is not a satisfactory alternative.

My first question in evaluating the effect of a multi-level factor will always be: Is there overall evidence for an influence of this exposure on this outcome? We could go right to estimation, and there are good methods for estimating an exposure-response curve, *eg*, using splines or generalized additive models (GAMs). However, empirical estimates, whether based on category-based odds ratios or spline fits or GAM approaches, will rarely give a flat estimated exposure-response, even when the truth is flat. I want to avoid being deceived by the wiggles in the estimated exposure-response, wiggles that might only reflect the random noise present in all finite data sets. Is there much evidence for a difference in risk across the five exposure categories, or across the range of measured exposures? If a test suggests there is, I can allow myself to become fascinated by the wiggles in the estimated exposure-response. If not, then I should wonder if trying to find inferential meaning in the curve, however interesting, is more akin to reading tea leaves than science.

Another context where the estimate is naturally multivariate is when assessing possible heterogeneity of odds ratios (*eg*, across cities). The individual differences may not be of interest. What we want to judge is whether the estimates are similar enough to justify pooling across the strata to form a single estimate of effect. The most convenient approach may be to test homogeneity. Analogous problems arise when assessing multiplicative interactions, or etiologic heterogeneity across a multi-category outcome based on a polytomous logistic model.

A related and important use of test-based statistics is in evaluating the fit of a model to the data, *eg*, by using the likelihood ratio statistic or Akaike’s information criterion (AIC). For example, in the logistic analysis of case-control data one usually omits many higher order terms (two-way, three-way interactions, etc.). One should carry out a comparative goodness-of-fit procedure to evaluate whether this reduction to a more parsimonious model has markedly degraded the fit to the data. A similar example arises in the context of the proportional hazards model, where a prudent investigator uses the data to examine possible violations of the proportionality assumption.

One key point often overlooked is that for estimation to be valid, the model specified must be correct. By contrast, for hypothesis testing to be valid, the model has to be correct only under the null hypothesis. For this reason, hypothesis-testing tools are particularly useful wherever one can specify a sensible model under the null, but cannot confidently specify an alternative model.

Time-space clustering is a good example. We looked for seasonal variation in the risk of very early pregnancy loss as evidence of action by some seasonally varying risk factor. ^{4} We fitted seasonal risk using trigonometric regression with data-determined phase and amplitude. Under the null hypothesis, the risk should be constant across the days of the year, which allows a valid likelihood ratio test. In our data, this test provided good evidence (based on the *P*-value) that risk was not flat across the days of the year. However, we could not accept a sine function as the correct representation of the seasonal pattern. This model had served merely to provide a simply-parameterized alternative that offered reasonable power under a wide variety of unimodal seasonal patterns. ^{4} Rather, we chose to estimate the relation between season and risk by using a nonparametric, moving average method, estimating the risk function across the calendar year.

This approach is conceptually similar to the strategy employed by toxicologists who evaluate dose-response by a trend test. Toxicologists do not necessarily believe in linearity of the dose-response, but a linear function can provide good statistical power against a wide range of monotone alternatives to the flat, no-effect null.

Yet another example is found in survival analysis applications. ^{5} Suppose one is unwilling to specify a proportional hazards or additive hazards model, but wants to compare time-to-event for an exposed *vs* an unexposed group. The method of choice is the log-rank procedure. Without a parametrically specified alternative model, no single parameter summarizes the difference, though Kaplan-Meier curves can give a visual sense of the contrast between the two survival curves.

The fact that a null model is often easier to specify than other alternatives may shed light on a difference that has puzzled me: Epidemiologists have a strong preference for estimation, while lod scores and *P*-values continue to reign in the human genetics literature. In genetics, the question often centers on possible linkage disequilibrium between a marker and a putative (unknown) disease gene. The magnitude of this association may vary among populations, depending on the age of the mutation and the marker haplotype background on which it occurred historically in shared ancestors. Measures of association between risk and a marker allele should thus vary across populations, and are not particularly meaningful in themselves. This illustrates the setting described above, in which one cannot specify an alternative model with confidence. What one can specify is the null model, *eg*, transmissions of alleles from parents to affected individuals should occur randomly under a null specifying no linkage or no association. For this reason, the methodologic focus in genetics is often on null models, which we can confidently specify, rather than estimation.

The *P*-value provides one useful strategy to weigh databased evidence against a particular hypothesis (although even this notion has provoked controversy ^{6}). Other methods, such as the Bayes factor ^{7} or a Bayesian posterior probability can also be used. These measures, together with prior information, estimation, and scientific judgment, can serve to inform our decision-making. However, we should continue to resist any attempts to automate our decisions, as in formal hypothesis testing.

One point often neglected in the criticism of hypothesis testing is that confidence regions correspond in a natural way to hypothesis tests. A 95% confidence region excludes the null parameter value if and only if the corresponding P < 0.05. Those who like to focus on whether or not their confidence interval excludes some null value are covertly committing a hypothesis test, the arbitrariness of 0.05 being matched by the arbitrariness of 0.95.

One of the most pernicious effects of attempts to prohibit *P*-values may be the *de facto* avoidance of decision-making itself. Wacholder has suggested (personal communication) that the fundamental philosophical discord between those who would do only estimation and those who prefer a more eclectic approach is in the refusal of the former to assign any particular importance to the null hypothesis. Investigators who confine themselves to estimation in effect treat the whole parameter space as equally interesting. It is as if we do not need to answer any of the either/or questions, as if we can do science without ever having to decide what factors are etiologic and what are not, as if we can set policy without deciding what drugs are effective, what screening strategies are effective, etc. In my opinion, the either/or questions should continue to motivate much of our research, even if we do occasionally get the answers wrong. Again, this is not to say that decisions should be automated, as in formal hypothesis testing, but that *P*-values, together with supplemental evidence and scientific judgment can inform the policy and research priority decisions that must be made.

In conclusion, my view is that the *P*-value (and its Bayesian counterparts) has important uses, and should remain an important tool for inference in epidemiology. In the words of Joseph Fleiss, ^{5} data analysts who choose to calculate test statistics and

“.. . who interpret the resulting *P*-values in an appropriately cautious fashion, should neither apologize for doing so nor tolerate unreasonable demands to reanalyze their data in a manner contrary to what they believe to be appropriate.”

It is not just the *P*-value, but statistics in general that needs to be rehabilitated to a fuller role in epidemiology. Training in epidemiology should include both statistical methods for estimation and conceptual appreciation for the statistical theory underlying hypothesis testing. Full competence, including the ability to weigh the literature from other fields with insight and care, and the ability to avoid the ever-seductive temptation to overinterpret one’s own data, requires no less. It is time to stop blaming the tools, and turn our attention to the investigators who misuse them. We need to discourage the abuse of *P*-values, but not constrain progress in epidemiology by prohibiting their use.