Secondary Logo

Journal Logo

False Positives: Commentary

On Standards of Evidence

Wacholder, Sholom

Author Information
doi: 10.1097/EDE.0b013e31821d127d
  • Free

In this issue of the journal, Ioannidis and colleagues1 call for a more conservative criterion than the standard P value <5% (5 × 10−2) for some areas of epidemiology. As a successful example, they point to genetic epidemiology, in which the standard P value for genome-wide association (GWA) studies is a one-million-fold more extreme (P < 5 × 10−8).2 The authors also point out that random variation is not the only cause of false-positive claims of association, and that many forms of bias can also cause false positives.

Two invited commentaries3,4 on the Ioannidis proposal identify different reasons why genetic epidemiology might be considered a special case. Davey Smith3 highlights how genetic studies might provide potential insights that lead to improved clinical practice efficiently and quickly, without some of the bias from confounding that encumbers nonrandomized research with more direct measurement of exposures. Fallin and Kao4 assert that GWA studies have not yet led to any clinically useful interventions, and that many difficult steps, mostly outside of epidemiology, lie between identification of a marker with a low P value and impact on clinical practice.

Genetic epidemiology differs from clinical, environmental, occupational, or nutritional epidemiology, and perhaps even to other forms of molecular epidemiology in several other attributes. First, the per-unit marginal cost of studying a million markers is almost trivial compared with the cost of enrolling participants and collecting DNA for studying a single genotype. In contrast, the marginal cost of collecting each exposure with minimal measurement error is often substantial.5 Second, each genetic marker can signal an association between disease and several highly correlated genetic variants, through linkage disequilibrium. Third, the sheer number of markers tested in GWA studies means that only a very small fraction can be usefully called significant. Fourth, data from GWA studies, at least those funded by NIH, are typically available to other researchers freely, with minor restrictions to respect participants' confidentiality6 and pooled analyses are routine.

How might following the suggestion of Ioannidis and colleagues affect epidemiology? First, consider design implications. A cost of the additional rigor with a P value less than 0.001 instead of <0.05 is a 3.5-fold increase in needed sample size to achieve 50% power, and a 2.2-fold increase to achieve 90% power. Thus, the percentage of true effects that are false negatives will increase unless sample sizes, and study costs, increase substantially. The effect on the study of rare diseases, for which ascertainment of large numbers of cases is challenging, might be especially damaging.

Second, no change in P value criterion will eliminate the inevitable difficulties we face in integrating all available evidence for clinical decision-making, public health policy-setting, and choosing the next scientific steps. For example, the standard hypothesis-based approach to statistical testing data can address several but not all of the Bradford Hill7 elements for evaluating claims of causality; an integrated approach to P values might require higher or lower thresholds of evidence from an epidemiologic study, depending on how convincingly other criteria are satisfied. We do not typically rely on Bayesian methods, which attempt to integrate other information through a prior distribution, mostly because people have difficulty integrating information from nonepidemiologic studies. Even members of the same study team struggle to achieve consensus about questions such as the relevance of animal studies. Similarly, the preset statistical analysis plan with precise criteria for claims of success in a pivotal randomized controlled trial is only a final step in a process informed by earlier scientific research, including laboratory work. Meeting a preset P value criterion does not ensure licensure if adverse events or other harmful effects are apparent, or if some feature of the findings is not consistent with the putative mode of action of the agent. Questions remaining about interpreting and integrating evidence at the analysis stage of the current study, not reflected in the P value, will be crucial at the next stage, whether that be changes to clinical practice, policy setting, or resource allocation for a new potential study.

Third, the proposed change at the level of subdiscipline within epidemiology will affect studies at different stages of the scientific process, as Fallin and Kao4 note for genetic epidemiology. Requiring preliminary studies to convincingly rule out bias and random variation as explanations of findings might stifle or delay further work that could provide stronger evidence; using the wrong biomarker of exposure or disease, poor measurement of exposure, or misclassification of disease might have obscured important relationships not apparent in the preliminary data. Measurement difficulties of competing technologies of viral detection might have delayed the consensus that HPV is the necessary cause of cervical cancer, and slowed the start of translational efforts at prevention through vaccination and HPV testing. An older parallel is what appear in retrospect to be bogus arguments against the smoking-lung cancer association. These arguments delayed effective efforts to prevent a major public health problem, thereby putting another generation at increased risk of smoking-related disease. Still, rigorous consideration of bias and random variation as explanations of a finding is necessary for evaluation of decisions about policy. Simply put, the relative cost of false negatives and false positives depends on the maturity of the research and the purpose of the analysis. As a research area progresses, accumulated evidence from earlier work might allow more realistic formal and informal assessment of strength of evidence.

When desired, alternative methods to reduce rates of false positives are available, though implementation will be subjective and likely contentious. A change of the null hypothesis from no effect to a nontrivial effect in the desired direction will lower the percentage of false positives and reduce power without requiring a change in the usual criterion. Insisting on evidence that a vaccine or drug reduces disease endpoints by significantly more than 30%, or a new treatment extends survival significantly longer than 6 months might allow direct evaluation of whether the total benefits of an intervention are greater than the total costs.1

A less conventional approach to manage the tradeoff between false positives and false negatives, based on the false-positive report probability8 or Bayesian false discovery probability9 retains the frequentist framework, explicitly considers prior evidence in a very simple way, and uses Wakefield loss function formulation.9 The lowest total expected loss can be achieved when α is chosen to minimize the total expected loss from false-positive errors, α(1 − π)Lα and false negative errors βπLβ, where Lα is the loss from false-positive conclusion (type I error) and Lβ is the loss from a false-negative conclusion (type II error), and

is the prior odds of a true effect and β is the complement of power to find an effect of specified magnitude when the statistical size is α. Note that power decreases with decreasing α when the sample size is fixed. The loss calculations should encompass the costs of future work to rectify any wrong conclusions, and the realistic chance that prospects for other good studies are diminished by launching studies that are unlikely to provide the most useful information or are overly large and expensive.

An instructive example far from epidemiology might be the controversy over the weight of evidence needed to establish the existence of extrasensory perception (ESP).10,11 Someone skeptical of the existence of the ESP meant to be detected by an experiment likely requires a far more extreme P value than researchers who try to establish its existence; I presume everyone is careful about bias, given the history of this kind of research. The scientific community, it seems to me, is far better served by explicit consideration of study design, prior knowledge, and expected losses than about discipline-specific P value standards.

In general, I seem to assign more weight to the loss from false negatives relative to loss from false positives than do Ioannidis and colleagues,1 and so I am inclined to act based on slightly less rigorous evidence. I agree with them that current practice is not ideal. But I would rather see explicit arguments on the evidence for individual hypotheses at issue, perhaps formulated into a prior probability and a Bayes factor from the data at hand, than rely on the more general rules that Ioannidis and colleagues suggest.1


SHOLOM WACHOLDER is in the Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute. His work on methods is informed by long-term involvement in primary research on HPV natural history and clinical epidemiology. He is the statistician on the Costa Rica vaccine trial and is a collaborator on studies of molecular, genetic epidemiology, and occupational epidemiology.


1. Ioannidis J, Tarone R, McLauglin J. The false-psoitive to false-negative ratio in epidemiologic studies. Epidemiology. 2011;22:450–456.
2. Chanock SJ, Manolio T, Boehnke M, et al. Replicating genotypephenotype associations. Nature. 2007;447:655–660.
3. Davey Smith G. Random allocation in observational data: how small but robust effects could facilitate hypothesis—free causal inference. Epidemiology. 2011;22:460–463.
4. Fallin D, Kao L. Is “X”-WAS the future for all of epidemiology? Epidemiology. 2011;22:457–459.
5. Khoury MJ, Wacholder S. Invited commentary: from genome-wide association studies to gene-environment-wide interaction studies—challenges and opportunities. Am J Epidemiol. 2009;169:227-230; discussion 234–235.
6. Office of Extramural Research, National Insstitutes of Health, US Department of Health and Human Services. Genome-wide association studies (GWAS). Available at: Accessed March 27, 2011.
7. Hill AB. The environment and disease: association or causation? Proc R Soc Med. 1965;58:295–300.
8. Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst. 2004;96:434–442.
9. Wakefield J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007;81:208–227.
10. Carey B. You might already know This. New York Times. January 11, 2011.
11. Miller G. ESP paper rekindles discussion about statistics. Science. 2011;331:272–273.
© 2011 Lippincott Williams & Wilkins, Inc.