From the perspective of practicing epidemiologists, what do we really want to know about the results of our studies? In particular, what are the questions that lead us to draw on frequentist inferential statistics such as P values and calculation of confidence intervals? I have yet to hear an epidemiologist ask about infinite repetitions of an experiment that was never conducted, or the maximum error rate in rejecting or failing to reject a null hypothesis. Yet, as a result of training and either forced capitulation or willing cooperation, we invoke such frequentist techniques with little thought, having lost sight of the original questions and substituting statistical answers for substantive ones. Our misinterpretation of P values comes in large part because they do not answer the questions we are asking—so we mistakenly pretend that they do.1,2 We readily drift into the mistaken belief and language that a P value of 0.05 means that there is only a 5% chance that the null hypothesis is true. Or even worse, we dichotomize results and conclude that if P < 0.05, the association is not because of chance and if P ≥ 0.05, the association may be because of chance.
Statistics, like the many other tools in the armamentarium of epidemiologists, are helpful only insofar as they improve our ability to advance scientific understanding and ultimately, public health. They are, at best, the means, not the end. To paraphrase from a recent commentary,3 we are susceptible to the “tyranny of statistics,” bowing before the God of scientific objectivity and conservative interpretation, despite the fact that conventional statistical tools are neither objective nor conservative.1,4 Although there are good reasons to substitute Bayesian for frequentist thinking,2 that is not likely to happen any time soon. The solution is not to therefore simply accept the frequentist framework, but to exploit the technology to the extent that it helps answer the questions we have posed of the data, even if it deviates from what is technically correct or the approaches that would be ideal.5
The motivation for these comments comes from a class session in which I carefully explained that randomization-based statistical inferences are not applicable to studies in which exposure is not randomly allocated (ie, observational studies)6; epidemiology is an exercise in measurement rather than hypothesis testing7; results should not be degraded into “statistically significant” and “nonsignificant findings”8; and we should incorporate some more thoughtful Bayesian approaches to evaluating evidence.4 That led to the perhaps predictable rejoinder, “But then what do you do in practice? How do you ever get a paper published?” In fact, I have found a reasonably comfortable compromise that strikes a balance between methodologic purity and pragmatism, recognizing the need to make peace with the many rigidly frequentist reviewers and editors, but without capitulation or dishonesty.
The solution is simple and practiced quietly by many researchers—use P values descriptively, as one of many considerations to assess the meaning and value of epidemiologic research findings. We consider the full range of information provided by P values,9 from 0 to 1, recognizing that 0.04 and 0.06 are essentially the same, but that 0.20 and 0.80 are not. There are no discontinuities in the evidence at 0.05 or 0.01 or 0.001 and no good reason to dichotomize a continuous measure. We recognize that in the majority of reasonably large observational studies, systematic biases are of greater concern than random error as the leading obstacle to causal interpretation. For example, in judging any non-null association, we consider the apparent shape of the dose-response function. We are much less impressed with nonmonotonic associations, whatever the null P value may be. P values provide an estimate of the compatibility of the data with the null value in the absence of biases, a rough index of the statistical support for the phenomenon of interest. How strong is the statistical evidence that the association is non-null, that there is a linear trend, that there is effect-measure modification on an additive scale, that one predictor is more strongly associated with the outcome than another, or that the results from a set of studies are compatible? P values answer none of these questions directly or formally, yet provide one piece of a more comprehensive answer to the questions we are really interested in.
There are several important questions of interpretation that can benefit from calculation of null P values or other related frequentist indices such as confidence intervals, despite their being subject to abuse.
1. How likely is it that an apparent association is truly null but has arisen because of random error? In other words, how certain are we that any non-null association is present?
Although epidemiologic research should be conceptualized as an exercise in measurement, with the product of that effort being an estimate with an indication of its precision,7 there are circumstances in which any deviation from the null value is of special interest. When we are considering multiple hypotheses about exposures and disease, for example, in wide-ranging exploratory studies, we do not devote equal scrutiny to every measure of association that is calculated; rather we want to make a judgment about what we should consider more thoroughly. If there is some degree of statistical support for an association, we may now want to consider whether biases are distorting the estimate or whether there is a plausible biologic mechanism that might underlie an observed association. Obviously, biases could also cause distortion to mask a causal effect, but we are often more willing to risk false negatives than false positives when screening multiple associations. The more extensive the array of candidate predictors, and the more limited the prior knowledge, the more central this question of “any association” becomes. Genome-wide association studies represent the extreme case among observational studies in which hundreds of thousands of associations with essentially no prior hypotheses are screened, thus increasing random error to the primary concern and making P values the only practical tool for focusing attention. But assessing arrays of dietary constituents or environmental toxicants may also require some judgment about whether there is sufficient statistical evidence of an association to justify the additional steps needed to evaluate it.10
2. How do we characterize the improved precision of a larger study relative to a smaller study?
Intuitively, it is obvious that a larger study has advantages over a smaller study in reducing the potential for distortion because of random processes such as the particular persons who enroll or the occurrence of confounding because of chance alone. We seek some means of quantifying that advantage, to indicate that the larger study is less susceptible to such error. This is sometimes done by calculating power to detect associations, but power estimates are laden with arbitrary, unrealistic assumptions. A straightforward way to compare the size of studies is to focus on confidence intervals for the main effects of interest—considering confidence interval width (ratio of upper to lower boundary for ratio measures, upper minus lower boundary for difference measures)8 or, even more informally, simply perusing the range that is reflected in a 95% confidence interval and considering whether the larger study has a meaningfully narrower range than the smaller study.
The 95% confidence interval is defined as the interval which would, if constructed repeatedly across repetitions of the experiment, contain the true parameter value 95% of the time. The methods for calculating confidence intervals are therefore as arbitrary and assumption-laden as the P value, but the difference is that through repeated use we can develop an intuitive grasp of their reflection of precision, using them to inform our judgment about the susceptibility of the study to random error and thus compare the precision of one study to another. For the purposes of enhancing familiarity and comparability, it is advantageous to adhere to the admittedly arbitrary choice of 95% confidence intervals (as opposed to 90%, 99%, or some other value).
3. How much evidence is there for specific patterns in the data, such as effect-measure modification or a linear trend across categories?
There is a difference between asking whether a particular measure of association is non-null and asking whether a more complex pattern is present. The pattern is defined by a series of measures, and we need to make a judgment about whether the evidence reaches a threshold to make the phenomenon worthy of more detailed scrutiny and discussion. In the case of effect-measure modification, the data are divided more finely than for the main effects, and it is easy to be distracted by statistically unstable estimates. (I recall a manuscript I drafted with a full paragraph devoted to the meaning of apparent effect-measure modification by race/ethnicity. A coauthor suggested calculating a P value for modification of the risk ratio on a multiplicative scale, which turned out to be ~0.80. We concluded that this apparent “pattern” was not worthy of discussion.) Linear trends across categories or other dose-response patterns also pose challenges for intuitive judgment. The question of how much statistical support there is for such associations is a useful element in focusing attention.9 Although both interaction and trends can be quantified with point estimates and confidence intervals, these measures are less familiar, making the point estimates and especially the confidence intervals less easily interpreted.
4. Are the results from a series of studies meaningfully different or are the differences that appear likely to result from random processes?
Integrating results across a series of studies calls for some judgment as to whether there is a common finding that is blurred because of random differences versus a true difference. We typically make such assessments informally and use statistical tools in ways that are not very helpful. Perhaps the worst option is to compare which studies did and did not find statistically significant results—of no use whatsoever in judging their compatibility under any theoretical framework—or equivalently assessing whether their confidence intervals overlap. A purely subjective assessment of effect estimates and confidence intervals graphically is preferable to such tallies, but the assessment of heterogeneity can be further aided by calculation of a Q-statistic and corresponding P value in a meta-analysis. We are asking whether the array of results from two or more studies likely reflects the same underlying value with observed variation because of random error. Statistical tools can help in making that judgment.
Despite the layers of unrealistic assumptions and scenarios, the mismatch of the substantive questions and the statistical answers, and radically varying views within the research community, there is a “middle road” that balances these issues reasonably. So what shall we do at the end of the day?
— For measures of association between an exposure and outcome, simply adhere to the presentation of point estimates and 95% confidence intervals. Those intervals give a sense for the precision or susceptibility to random error, providing a standardized and familiar index for variability. Narrower confidence intervals are more precise and less susceptible to the influence of random error than wider confidence intervals. For the studies designed for screening large numbers of associations, a mechanism for selecting findings worthy of further scrutiny is needed, and P values may be a part of such an algorithm.
— For more complex patterns, such as effect-measure modification or the shape of the dose-response function, P values convey a general sense of the statistical support for the presence of the pattern of interest. The P value provides a rough suggestion of how well the data conform to a specific pattern (eg, additive effects, linear trend across categories), but says nothing about whether it is large enough to be of interest or how likely it is to result from bias.
— In evaluating the results of a study comprehensively, random error is a consideration that needs to be entertained in a thoughtful, not mechanical or algorithmic, manner. Just as we incorporate multiple considerations into assessing the potential for distortion because of other biases, including use of quantitative tools,11 the assessment of random error is not subsumed in a P value or, worse, a test of statistical significance. The study size is taken into account, along with the internal consistency and pattern of results, and the similarity or dissimilarity to the patterns seen by others using similar methods.
By considering the role of random error and entertaining the ever-present potential for spurious results to have arisen by chance, reviewers and editors can be assured that the role of random error has been given its due. The statistics do not directly or objectively answer the question of what role random error has played, constraining us to “accept” some results and “reject” others. At best, P values provide clues for interpretation, one element of a more informed assessment of the role of random error by the paper’s authors and others who interpret and apply the study’s results.
I thank my colleagues Joe Braun, Chanelle Howe, and Greg Wellenius for helpful comments on the manuscript.