The discovery and replication of associations is a core activity of quantitative research. This article will not deal with the debate on whether research findings are credible.^{1} I will focus instead on the interesting subset of research findings that are true. Research findings discussed here encompass all types of associations that emerge from quantitative measurements, and are expressed as effect metrics. This includes treatment effects from clinical trials, measures of risk for observational risk factors, prognostic effects for prognostic studies, and so forth. I start here with the assumption that a research finding is indeed true (non-null), ie, it reflects a genuine association that is not entirely due to chance or biases (confounding, misclassification, selection biases, selective reporting, or other). The question is: do the effect sizes for such associations, at the time they are first discovered and published in the scientific literature, accurately reflect the true effect sizes?

The article has the following sections: a brief literature review on inflated early-effect sizes based on theoretical and empirical considerations; a description of the major reasons why early discovered effects are inflated and the major countering forces that may occasionally lead to deflated effects (underestimates); and suggestions on how to deal with these problems.

##### Evidence About Inflated Early-Effect Sizes

Table 1 cites articles suggesting that early studies give (on average) inflated estimates of effect.^{2–34} I list here only selected evaluations that cover either many different articles/effects or a whole research domain or method. This list is nowhere close to exhaustive. For some topics, such as the inflation of regression coefficients for variables selected through stepwise statistical-significance-based processes, the literature is vast. The theme of inflated early effects has been encountered in various disguises in many scientific disciplines in the biomedical sciences and beyond. For empirical studies, it may not be known whether the subsequent studies are more correct than the original discovery, but when a pattern is seen repeatedly in a field, the association is probably real, even if its exact extent can be debated. One should also acknowledge the difficulty in differentiating between an early inflated but true (non-null) effect and an entirely false (null) one. In addition to empirical studies, however, Table 1 also includes theoretical work that proves why inflation is anticipated; some of these arguments are discussed in the next section.

I mention here a few examples to demonstrate the seriousness of the problem. The prognostic significance of a 70-gene expression signature for lymph-node-negative breast cancer is accepted beyond doubt.^{35} However, while the first study published in *Nature* showed almost perfect sensitivity and specificity, even in an independent replication exercise of 19 patients,^{36} subsequent evaluation in a cohort of 307 women showed sensitivity of 90% and specificity of only 40% (AUC for survival 0.648).^{37} Prognostic ability is present, but the difference between an almost-perfect predictor and a modest-to-poor predictor is prominent.^{35}

Many high-profile clinical trials are stopped early during their conduct. This is performed according to robust rules that suggest termination when a demanding threshold of statistical significance is crossed during an interim analysis.^{3,4} These interventions are indeed effective (the null of “no effectiveness” is correctly rejected). However, as shown both in theory^{3,4} and in practice,^{5} the effect sizes derived from such early terminated trials are inflated. With very early termination, the effect sizes may be markedly inflated,^{5} with implications for decision-making in the use of these interventions.

Theoretical considerations prove that linkage signals of genome-wide linkage studies are inflated.^{12–15} These studies have aimed to reveal loci that harbor genetic variants that are related to various phenotypes. Several thousands of such studies conducted over 2 decades have yielded very few replicated hits. Although the replication record is better with genome-wide association studies, theoretical considerations again show the early discovered effects are inflated.^{23,24} Furthermore, if the observed effects are used as estimates in designing replication studies, these subsequent studies will be underpowered, and genuine effects will be falsely nonreplicated.^{38}

##### Inflated Effect Sizes Due to Selection Thresholds and Suboptimal Power

Effect sizes of newly discovered true (non-null) associations are inherently inflated on average. This is due to the key characteristic of the discovery process. Inflation is expected when, to claim success (discovery), an association has to pass a certain threshold of statistical significance, and the study that leads to the discovery has suboptimal power to make the discovery at the requested threshold of statistical significance. Both conditions are necessary to inflate effect sizes. If investigators were not fixated on claiming discoveries based on *P* value thresholds, this would not be an issue. Similarly if the discovery studies were fully powered, inflation would not be an issue. Selection usually entails *P* values, but a similar pattern may be seen if selection is based on effect size or some other threshold measure.

For illustrative purposes, I use here a simulation approach to demonstrate this phenomenon and the relationship between inflation and lack of power. Suppose that the true odds ratio (OR) for an association is 1.10 or 1.25 and that the proportion of exposed individuals in the control group is 30%. We can simulate a set of studies that have an equal number of participants (n) in each of the 2 compared groups. The number of exposed in the control group in each simulated study is drawn randomly from a binomial distribution with probability 0.30. The number of exposed in the case group in each simulated study is drawn randomly from a binomial distribution with probability 0.3203 or 0.3488, so as to correspond to OR = 1.10 and 1.25, respectively. The median OR of these simulated studies is expected to be 1.10 or 1.25, respectively. However, this is not so when we focus only on the simulated studies that have a *P* value for the association crossing a specific level of statistical significance. Table 2 shows the median and IQR of the ORs that cross the “*P* value = 0.05” threshold of statistical significance for different values of n. As shown, even though the true OR is 1.10, the median observed OR when a study discovers this association (*P* < 0.05) is 1.51 when n = 250 (a study of 500 participants total). With similar sample sizes, when the true OR = 1.25, the discovered median OR is 1.60. When the studies have n = 50 (100 participants total), the median discovered OR is 2.73 instead of 1.25, representing huge inflation. One should note also the skewed nature of the distributions of discovered effects.

One may argue that we do not know the true effect sizes necessary to make these simulations for specific hypotheses. In the example above, if the true OR were 500, then studies with 250 participants per group would have excellent power to detect it at α = 0.05 and the discovered effects would not be inflated compared with the true OR = 500. In some fields, there may be considerable uncertainty about the magnitude of the true effect sizes. However, in most fields, we can make reasonable guesses about the effect sizes, with only modest uncertainty. For example, in genetic associations of common variants with common diseases, we have repeatedly found that effect sizes of consistently and extensively replicated associations tend to be small or even very small (most ORs = 1.1–1.4; a few, 1.4–2).^{39–41} Similarly, for most medical interventions with hard clinical outcomes (including mortality) relative risk decreases of 10%–30% are the best we can hope for. Some fields that have proposed much larger effect sizes may simply need a reality check. Perhaps some of these fields have been stuck in doing underpowered studies, and thus effects circulating in their literature appear large when they are actually much smaller.

##### Inflated Effects Due to Flexible Analyses (Vibration of Effects) and Selective Reporting

Until now, we have assumed that the (simulated) studies arise out of the play of chance alone. We have assumed that there is no human intervention in the analysis process and there is only one analysis based on the observed results. This situation is rare in discovery research. The hallmark of discovery is the performance of exploratory analyses. Flexible analyses lead to vibration of effects. Vibration conveys the extent to which an effect may change in alternative analytical approaches.

Vibration is mostly due to the availability of alternative options in statistical model selection (eg, Cox model for time-to-death vs. logistic regression for death in 30 days); statistical inference machine (eg, different methods for computation of the odds ratio [eg, with or without Wolf correction and with different corrections of zero cells] and its variance^{42}); data selection (eg, possibility to exclude or include some participants based on some partly prespecified, prespecified but ambivalent, or entirely post hoc criteria); dependent arbitration of equivocal data; and wide choice of adjustments for other covariates (especially when there are many such). Changes may affect not only the analytic core but also the question formulation itself, eg, changing eligibility criteria may modify the research question.

I define the vibration ratio for effect size as the ratio between the extremes of effect sizes that can be obtained in the same study under different analytical options. In Figure 1, I have analyzed the same dataset (250 participants) with different approaches. Unadjusted analysis yields OR = 2.10 (95% confidence interval [CI] = 1.18–3.72). I simulate 2 random variables and also perform analyses adjusting the association for each one of them. The vibration ratio is only 1.01. I simulate another random variable and perform analyses where the top 6% or the top 10% of the participants for this random variable are considered noneligible for the analysis. The vibration ratio is 1.18. Then, I also simulate 5 observations (only 2% of the data) for which exposure is considered equivocal, and is either changed to specifically agree with the direction of the association or is changed to specifically disagree with the direction of the association. The vibration ratio is 1.55. The possible combinations of random adjustment, random eligibility, and dependent arbitration, as above, yield a vibration ratio of 1.95: ORs as divergent as 1.48 (CI 0.81–2.70) and 2.88 (1.55–5.35) are obtained with these relatively subtle options. Without trying hard, I changed the OR 2-fold.

The vibration ratio will be larger in small datasets and in those with hazy definitions of variables, unclear eligibility criteria, large numbers of covariates, and no consensus in the field about what analysis should be the default. In most discovery research, this explosive mix is the rule. It is difficult to obtain funding to run very large studies for taking a first shot into the dark, and discovery is inherently related to situations where hazy definitions and iterative searching abound. The wealth of databases in covariates has also grown over time.

Even if enormous, vibration alone would not lead to inflated discovered effects if one eventually presents all the applied analytical options without any preference. However, typically only one or a few analyses are presented. Moreover, vibration would not be a problem if the one or few analyses selected for presentation were a random choice of the possible ones, selected with an impartial view and no interest in making a discovery. However, this is counterintuitive to the discovery process. One makes exploratory analyses specifically to find something. The effects selected for presentation are likely to be among the largest observed, if not the largest possible. Secondary analyses similarly may be chosen to show that they are consistent with the main selected analysis.

Selective analyses and outcome reporting have been extensively demonstrated in clinical-trials research comparing protocols against reported results.^{43–45} In theory, randomized trials have more inflexible protocols compared with observational epidemiology and fully exploratory research. For observational research, similar evaluations are more difficult to conduct because protocols are not readily available—often there is no protocol at all. Empirical evidence has demonstrated across a large sample of 379 epidemiologic studies that investigators selected the contrasts for continuous variables so as to show effects as being larger: more extreme contrasts were presented, when effects were inherently smaller.^{46}

Post hoc demonstration of selective analysis and outcome reporting is difficult. Recently, a test was proposed to examine whether the number of reported study results that pass certain levels of statistical significance is reasonable or larger than what one would expect, even if the effect sizes for the proposed associations (eg, as suggested by meta-analyses of all relevant studies) were true.^{47} Testing has suggested substantial selective reporting biases in both clinical trials and observational epidemiology.^{12,47–49}

##### Inflated Interpretation for Effect Sizes

Inflated interpretation is the toughest of all sources of inflation to tackle. In a culture that rewards discovery, investigators may make an extra effort to present results in the most favorable way. This goes beyond selective reporting and enters the realm of qualitative interpretation of quantitative effects. Typical variants of inflated interpretation include unwarranted extrapolations and over-stated generalizability,^{50} silencing or downplaying limitations and caveats,^{51} mishandling external evidence,^{52,53} and extension of promises to different inferential levels. In the last category, some typical leaps of faith in the epidemiologic literature include the interpretation of association as causation, the interpretation of association or even causation as anticipated treatment effects, and the interpretation of optimal efficacy as effectiveness in every-day life and clinical practice. In the molecular literature, a typical leap of faith is the interpretation that a modest association pointing to a new biologic pathway can be translated into a major benefit for treatment of diseases that may somehow be involved in this pathway. The sparse successful clinical translation of major promises made in the most high-profile basic science journals shows that this over-interpretation is common.^{54}

##### Why Published True Associations May Sometimes Have Deflated Effects

Contrary to the above, some discovered associations may have deflated effect sizes compared with the true ones. For example, this may occur with overpowered studies, where interim looks at the data are performed at early stages and discovery happens late. If the association does not cross the desired threshold of significance at the interim looks, but only at the very end, the effect may be deflated, although the deflation is typically small.^{3,4} The same situation would arise if the discovery process occurs as a regularly updated prospective meta-analysis, a true association gets discovered (becomes formally significant for the first time) only after many studies have been performed and combined in the meta-analysis, and the power of these combined studies is high to detect such as association. Nevertheless, in most fields, overpowered studies at the discovery phase are still a small minority compared with underpowered studies^{55–60}; moreover, the paradigm of prospective cumulative meta-analysis as a discovery tool has not been widely disseminated.

Another reason for deflated effect sizes is independent nondifferential misclassification due to measurement error in the associated variables. There is an extensive literature on misclassification and how to correct effect sizes for misclassification.^{61} However, such corrections have never become main stream. Perhaps this is because usually nonindependent and differential misclassification has been difficult to exclude, and these can either deflate or inflate observed effects.^{62,63} Measurement error has decreased over time for many fields of research in the current era. For example, genetic measurements have very minor measurement error if measurement platforms are used properly. Conversely, for some other variables, (eg, lifestyle), measurement error may remain substantial. Even in molecular/genetic epidemiology, misclassification remains important for evaluating gene-environment interactions.^{64–67} Of note, when effects diminish because of misclassification, power to detect them also diminishes sharply^{68}; this enhances the inflation upon discovery (inflation of a deflated effect), as above.

Furthermore, vibration of effects with selective reporting and interpretation of effects may sometimes reflect reverse biases. Various conflicts of interest may work in the direction of silencing or diminishing newly discovered associations that don't fit financial or other dogmatic perspectives. For therapeutic research, although financial conflicts may lead to inflation of treatment effects for new interventions,^{69} they may similarly lead to deflation of the magnitude of adverse events.^{70} For example, although most meta-analyses^{71,72} of rosiglitazone found ORs for myocardial infarction in the range of 1.43, a meta-analysis originally conducted by Glaxo found a more conservative OR and the company did not consider it to be of concern.^{73} However, the literature on adverse events of interventions is small compared with the literature on effectiveness.^{74} Most harms probably remain unknown rather than silenced.^{75}

Finally, conflicts may be of nonfinancial nature. Some investigators may fervently support their line of research and beliefs. For example, even the most strongly refuted associations continue to have supporters many years after the refutation.^{76} Investigators may suppress new findings when they do not suit their beliefs.

##### What To Do

At the time of first postulated discovery, we usually can not tell whether an association exists at all,^{1} let alone judge its effect size. As a starting principle, one should be cautious about effect sizes. Uncertainty is not conveyed simply by CIs (no matter if these are 95%, 99%, or 99.9% CIs) (Table 3).

For a new proposed association, credibility and accuracy of its proposed effect varies depending on the case. One may ask the following questions: does the research community in this field adopt widely statistical significance or similar selection thresholds for claiming research findings? Did the discovery arise from a small study? Is there room for large flexibility in the analyses? Are we unprotected from selective reporting (eg, was the protocol not fully available upfront)? Are there people or organizations interested in finding and promoting specific “positive” results? Finally, are the counteracting forces that would deflate effects minimal?

Modeling or correcting some of the sources of inflation is possible with (more) appropriate methods, such as for genetic linkage or association^{17,23} or for regression coefficients in general.^{33,77} These methods are probably more useful in estimating expected effect sizes, so as to perform more proper power calculations for future replication efforts, rather than for claiming that accurate “corrected” estimates of effect are known. In each case, one has to ask whether it is appropriate to ignore completely the effect size for a new proposed association. It may be best to wait for additional, larger studies and cumulative evidence to reach a more firm conclusion on whether an effect exists at all, and then worry about its size later. Most fields can wait for the conduct of replication studies.

The conduct of larger studies in the discovery phase will diminish inflation due to suboptimal power. However, this is not always feasible. Discovery may sometimes arise from small investigations or even unanticipated case observations.^{70} However, even if many discoveries in the past arose out of haphazard encounters of scientists with phenomena, this does not mean that we cannot improve in the future by running larger discovery-oriented studies. Agnostic genome-wide associations provide such an example.^{78}

Using a strict protocol for the design, conduct, and analysis of a study can diminish vibration, but would this stifle creativity? Flexible analyses will not cause a problem if they are accompanied by complete and transparent reporting of all results. Despite demonstrable progress and the availability of evidence-based guidance for reporting, such as CONSORT,^{79,80} STROBE,^{81} and STARD,^{82} full reporting remains an unattained target even in fields such as randomized trials, which are further ahead in registration and reporting efforts.^{83,84} Making databases publicly available is more easily said than done, and there are many challenges in making this a widespread practice.^{85,86} Still, the antithesis of practices among various fields is striking. For example, genome-wide associations studies currently test hundreds of thousands of associations, ask for very demanding thresholds (eg, *P* < 10^{−7}), report all results in a single paper, and then often make the data publicly available.^{87,88} Conversely, in traditional risk-factor epidemiology (eg, nutritional epidemiology), each (or a few) of the thousands of tested associations is reported as a single separate paper, “*P* < 0.05” rules are still widespread, and databases rarely become public. Imagine what would happen if the criteria of genome-wide association studies were applied to nutritional epidemiology associations. There are clearly other major differences among such fields,^{89} but one wonders whether such widely discrepant practices are justified. Inclusive consortia of investigators may also help enhance transparency and completeness of reporting of results.^{90}

Discovery can be unfettered, haphazard, exploratory, opportunistic, selective, and highly subjectively interpreted. Conversely, these same characteristics that are perfectly fine for discovery are not desirable of replication. Replication is essential for all discoveries and with few exceptions (eg, treatment effects in interventional studies) only resource constrains and prioritization issues would prohibit replication ad infinitum. Replication offers a wider evidence base on which to try to make inferences about the truth and biases that may affect it.

A crucial question is whether replication suffices to correct the inflated effects that arise in early studies.^{91,92} For example, should a meta-analysis worry about including an early terminated study? In principle, the replication process, if unbiased, should correct the inflation^{91} and if stopping is not very early, inflation is small regardless.^{93} However, the replication process may not be unbiased, and may sometimes suffer from similar problems as (or more problems than) the discovery. Observational evidence has been attacked as unreliable, and even the best meta-analyses of observational data meet with skepticism for their spurious precision.^{94} Problems may arise, however, even for the supposedly more rigorous design of randomized trials. To demonstrate this problem, an evaluation of the whole Cochrane Library shows 1011 systematic reviews that have at least one meta-analysis with at least 4 studies.^{95} Selecting the largest meta-analysis in each of these reviews, 256 of the 1011 meta-analyses have formally statistically significant results (*P* < 0.05) by random effects calculations in the OR scale. The effect sizes of these “positive” meta-analyses are inversely related to the amount of evidence accumulated (Fig. 2). Perhaps large anticipated effects lead to the conduct of small trials and small anticipated effects promote several large trials. However, the observed pattern is consistent with what one would expect based on the inflation biases described above. Most meta-analyses remain largely underpowered for small-to-modest effects.^{96} Superimposed selective reporting can also be operating. Thus, even in the theoretically most rigorous study design (randomized trials), not only discoveries but also pragmatically limited replication efforts may not eliminate inflation of effects, and may not even ensure that any effect at all is present.

What constitutes fair interpretation of new discoveries is unavoidably subjective. However, critical discussion of limitations, caveats, and a reserved stance against one's findings is useful. Thresholds of significance that dictate a discovery may have to be abolished. Instead, all results would be reported, grading their credibility and the uncertainty thereof in a Bayesian framework. Suggestions to adopt Bayesian views of research results have long been made.^{1,11,97–103} However, inflation of effects may still be an issue, even if effects are selected based on Bayes factor thresholds rather than *P* value thresholds. This depends on how Bayes factors are calculated. For example, direct translation^{99} of *P* values (or z-scores) to minimum Bayes factors, exp(−z^{2}/2), would face the same problem, whereas if priors assume that small effects are plausible but large effects are implausible, Bayes factors become most promising for small effects.^{103} Bayesian views are useful when coupled with unselective presentation of all results. In this way, one can see which results are more interesting based on different prior assumptions, and whether there is consistency in highlighting specific results. New results modify future priors. If new results are biased because of selection, priors get biased and we may keep pursuing, believing, and expecting nonexistent large effects.

Finally, Table 4 summarizes 2 stances in hunting associations—the aggressive discoverer versus the reflective replicator. These stances may underlie the root of the problems that I discussed here, and their possible solutions. In trying to reward or punish scientists for their stance and in shaping the new generation of scientists, we need to think hard about which of the 2 modes we want to promote, and whether some good elements can be picked from each list.

#### ACKNOWLEDGMENTS

I am grateful to Duncan Thomas for helpful comments.