In response to the widespread misuse of statistics in research, several biomedical organizations have published statistical guidelines in their journals, including the International Committee of Medical Journal Editors (www.icmje.org), the American Psychological Association (^{2}), and the American Physiological Society (^{8}). Expert groups have also produced statements about how to publish reports of various kinds of medical research (Table 1). Some medical journals now include links to these statements as part of their instructions to authors.

In this article, we provide our view of best practice for the use of statistics in sports medicine and the exercise sciences. The article is similar to those referenced in Table 1 but includes more practical and original material. It should achieve three useful outcomes. First, it should stimulate interest and debate about constructive change in the use of statistics in our disciplines. Secondly, it should help legitimize the innovative or controversial approaches that we and others sometimes have difficulty including in publications. Finally, it should serve as a statistical checklist for researchers, reviewers, and editors at the various stages of the research process. Not surprisingly, some of the reviewers of this article disagreed with some of our advice, so we emphasize here that the article represents neither a general consensus among experts nor editorial policy for this journal. Indeed, some of our innovations may take decades to become mainstream.

Most of the data in this article are devoted to advice on the various kinds of sample-based studies that comprise the bulk of research in our disciplines. Table 2 and the accompanying notes deal with issues common to all such studies, arranged in the order that the issues arise in a manuscript. This table applies not only to the usual studies of samples of individuals but also to meta-analyses (in which the sample consists of various studies) and quantitative nonclinical case studies (in which the sample consists of repeated observations on one subject). Table 3, which should be used in conjunction with Table 2, deals with additional advice specific to each kind of sample-based study and with clinical and qualitative single-case studies. The sample-based studies in this table are arranged in the approximate descending order of quality of evidence they provide for causality in the relationship between a predictor and a dependent variable, followed by the various kinds of method studies, meta-analyses, and single-case studies. For more on causality and other issues in choice of design for a study, see Hopkins (^{14}).

#### Note 1

**Inferences** are evidence-based conclusions about the true nature of something. The traditional approach to inferences in research on samples is an assertion about whether the effect is statistically significant or "real," based on a *P* value. Specifically, when the range of uncertainty in the true value of an effect represented by the 95% confidence interval does not include the zero or null value, *P* is <0.05, the effect "cannot be zero," so the null hypothesis is rejected and the effect is termed significant; otherwise *P* is >0.05 and the effect is nonsignificant. A fundamental theoretical dilemma with this approach is the fact the null hypothesis is always false; indeed, with a large-enough sample size, all effects are statistically significant. On a more practical level, the failure of this approach to deal adequately with the real-world importance of an effect is evident in the frequent misinterpretation of a nonsignificant effect as a null or trivial effect, even when it is likely to be substantial. A significant effect that is likely to be trivial is also often misinterpreted as substantial.

A more realistic and intuitive approach to inferences is based on where the confidence interval lies in relation to threshold values for substantial effects rather than the null value (^{4}). If the confidence interval includes values that are substantial in some positive and negative sense, such as beneficial and harmful, you state in plain language that the effect could be substantially positive *and* negative, or more simply that the effect is *unclear*. Any other disposition of the confidence interval relative to the thresholds represents a clear outcome that can be reported as trivial, positive, or negative, depending on the observed value of the effect. Such magnitude-based inferences about effects can be made more accurate and informative by qualifying them with probabilities that reflect the uncertainty in the true value: *possibly harmful*, *very likely substantially positive*, *unclear but likely to be beneficial*, and so on. The qualitative probabilistic terms can be assigned using the following scale (^{16}): <0.5%, most unlikely or almost certainly not; 0.5-5%, very unlikely; 5-25%, unlikely or probably not; 25-75%, possibly; 75-95%, likely or probably; 95-99.5%, very likely; >99.5%, most likely or almost certainly. Research on the perception of probability could result in small adjustments to this scale in future.

Use of thresholds for moderate and large effects allows even more informative inferential assertions about magnitude, such as *probably moderately positive*, *possibly associated with small increase in risk*, *almost certainly large gain*, and so on. As yet, only a few effect statistics have generally accepted magnitude thresholds for this purpose. Thresholds of 0.1, 0.3, and 0.5 for small, moderate, and large correlation coefficients suggested by Cohen (^{7}) can be augmented with 0.7 and 0.9 for very large and extremely large; these translate approximately into 0.20, 0.60, 1.20, 2.0, and 4.0 for standardized differences in means (the mean difference divided by the between-subject SD) and into risk differences of 10%, 30%, 50%, 70%, and 90% (newstats.org/effectmag.html). The latter applied to chances of a medal provides thresholds for change in an athlete's competition time or distance of 0.3, 0.9, 1.6, 2.5, and 4.0 of the within-athlete variation between competitions (^{17}) (W.G.H., unpublished observations). The smallest effect for a risk ratio depends on the application: 1.10 will begin to affect one or more groups in a community, whatever the prevalence of the condition, whereas individuals accept higher risk for increasingly rare conditions; however, some epidemiologists consider that confounding inherent in most cohort and case-control studies can bias an effect so much that observed effects have to be moderate (risk ratios, ∼3.0) before you can be confident that the true effect is at least small (^{26}). Thresholds have been suggested for some diagnostic statistics (^{20}), but more research is needed on these and on thresholds for the more usual measures of validity and reliability.

An appropriate default level of confidence for the confidence interval is 90%, because it implies quite reasonably that an outcome is clear if the true value is very unlikely to be substantial in a positive and/or negative sense. Use of 90% rather than 95% has also been advocated as a way of discouraging readers from reinterpreting the outcome as significant or nonsignificant at the 5% level (^{24}). In any case, a symmetrical confidence interval of whatever level is appropriate for making only nonclinical or mechanistic inferences. An inference or decision about clinical or practical use should be based on probabilities of harm and benefit that reflect the greater importance of avoiding use of a harmful effect than failing to use a beneficial effect. Suggested default probabilities for declaring an effect clinically beneficial are <0.5% (most unlikely) for harm and >25% (possible) for benefit (^{16}). A clinically unclear effect is therefore possibly beneficial (>25%) with an unacceptable risk of harm (>0.5%). These probabilities correspond to a ratio of ∼60 for odds of benefit to odds of harm, a suggested default when sample sizes are suboptimal or supraoptimal (^{16}). Note that even when an effect is unclear, you can often make a useful probabilistic statement about how big or small it could be, and your findings should contribute to a meta-analysis.

Magnitude-based inferences as outlined above represent a subset of the kinds of inference that are possible using so-called Bayesian statistics, in which the researcher combines the study outcome with uncertainty in the effect before the study to get the posterior (updated) uncertainty in the effect. A qualitative version of this approach is an implicit and important part of the discussion section of most studies, but in our view, specification of the prior uncertainty is too subjective to apply the approach quantitatively. Researchers may also have difficulty accessing and using the computational procedures. On the other hand, confidence limits and probabilities related to threshold magnitudes can be derived readily via a spreadsheet (^{16}) by making the same assumptions about sampling distributions that statistical packages use to derive *P* values. Bootstrapping, in which a sampling distribution for an effect is derived by resampling from the original sample thousands of times, also provides a robust approach to computing confidence limits and magnitude-based probabilities when data or modeling are too complex to derive a sampling distribution analytically.

#### Note 2

**Public access** to depersonalized data, when feasible, serves the needs of the wider community by allowing more thorough scrutiny of data than that afforded by peer review and by leading to better meta-analyses. Make this statement in your initial application for ethics approval, and state that the data will be available indefinitely at a Web site or on request without compromising the subjects' privacy.

#### Note 3

**Multiple inferences.** Any conclusive inference about an effect could be wrong, and the more effects you investigate, the greater the chance of making an error. If you test multiple hypotheses, there is inflation of the Type I error rate: an increase in the chance that a null effect will turn up statistically significant. The usual remedy of making the tests more conservative is not appropriate for the most important preplanned effect, it is seldom applied consistently to all other effects reported in a paper, and it creates problems for meta-analysts and other readers who want to assess effects in isolation. We therefore concur with others (e.g., ^{23}) who advise against adjusting the Type I error rate or confidence level of confidence intervals for multiple effects.

For several important clinical or practical effects, you should constrain the increase in the chances of making clinical errors. Overall chances of benefit and harm for several interdependent effects can be estimated properly by bootstrapping, but a more practical and conservative approach is to assume the effects are independent and to estimate errors approximately by addition. The sum of the chances of harm of all the effects that separately are clinically useful should not exceed 0.5% (or your chosen maximum rate for Type 1 clinical errors; Note 4); otherwise, you should declare fewer effects useful and acknowledge that your study is underpowered. Your study is also underpowered if the sum of chances of benefit of all effects that separately are not clinically useful exceeds 25% (or your chosen Type 2 clinical error rate). When your sample size is small, reduce the chance that the study will be underpowered by designing and analyzing it for fewer effects.

A problem with inferences about several effects with overlapping confidence intervals is misidentification of the largest (or smallest) and upward (or downward) bias in its magnitude. In simulations, the bias is of the order of the average SE of the outcome statistic, which is approximately one-third the width of the average 90% confidence interval (W.G.H., unpublished observations). Acknowledge such bias when your aim is to quantify the largest or smallest of several effects.

#### Note 4

**Sample sizes** that give acceptable precision with 90% confidence limits are similar to those based on a Type 1 clinical error of 0.5% (the chance of using an effect that is harmful) and a Type 2 clinical error of 25% (the chance of not using an effect that is beneficial). The sample sizes are approximately one-third those based on the traditional approach of an 80% chance of statistical significance at the 5% level when the true effect has the smallest important value. Until hypothesis testing loses respectability, you should include the traditional and new approaches in applications for ethical approval and funding.

Whatever approach you use, sample size needs to be quadrupled to adequately estimate individual differences or responses and effects of covariates on the main effect. Larger samples are also needed to keep clinical error rates for clinical or practical decisions acceptable when there is more than one important effect in a study (Note 3). See Hopkins (^{12}) for a spreadsheet and details of these and many other sample-size issues.

#### Note 5

**Mechanisms.** In a mechanisms analysis, you determine the extent to which a putative mechanism variable mediates an effect through being in a causal chain linking the predictor to the dependent variable of the effect. For an effect derived from a linear model, the contribution of the mechanism (or mediator) variable is represented by the reduction in the effect when the variable is included in the model as another predictor. Any such reduction is a necessary but not sufficient condition for the variable to contribute to the mechanism of the effect, because a causal role can be established definitively only in a separate controlled trial designed for that purpose.

For interventions, you can also examine a plot of change scores of the dependent variable versus those of potential mediators, but beware that a relationship will not be obvious in the scattergram if individual responses are small relative to measurement error. Mechanism variables are particularly useful in unblinded interventions, because evidence of a mechanism that cannot arise from expectation (placebo or nocebo) effects is also evidence that at least part of the effect of the intervention is not due to such effects.

#### Note 6

**Linear models.** An effect statistic is derived from a model (equation) linking a dependent to a predictor and usually other predictors (covariates). The model is linear if the dependent can be expressed as a sum of terms, each term being a coefficient times a predictor or a product of predictors (interactions, including polynomials), plus one or more terms for random errors. The effect statistic is the predictor's coefficient or some derived from of it. It follows from the additive nature of such models that the value of the effect statistic is formally equivalent to the value expected when the other predictors in the model are held constant. Linear models, therefore, automatically provide adjustment for potential confounders and estimates of the effect of potential mechanism variables. A variable that covaries with a predictor and dependent variable is a confounder if it causes some of the covariance and is a mechanism if it mediates it. The reduction of an effect when such a variable is included in a linear model is the contribution of the variable to the effect, and the remaining effect is independent of (adjusted for) the variable.

The usual models are linear and include the following: regression, ANOVA, general linear, and mixed for a continuous dependent; logistic regression, Poisson regression, negative binomial regression, and generalized linear modeling for events (a dichotomous or count dependent); and proportional hazards regression for a time-to-event dependent. Special linear models include factor analysis and structural equation modeling.

For repeated measures or other clustering of observations of a continuous dependent variable, avoid the problem of interdependence of observations by using within-subject modeling, in which you combine each subject's repeated measurements into a single measure (unit of analysis) for subsequent modeling; alternatively, account for the interdependence using the more powerful approach of mixed (multilevel or hierarchical) modeling, in which you estimate different random effects or errors within and between clusters. Avoid repeated-measures ANOVA, which sometimes fails to account properly for different errors. For clustered event-type-dependents (proportions or counts), use generalized estimation equations.

#### Note 7

**Nonparametric analysis.** A requirement for deriving inferential statistics with the family of general linear models is normality of the sampling distribution of the outcome statistic. Although there is no test that data meet this requirement, the central-limit theorem ensures that the sampling distribution is close enough to normal for accurate inferences, even when sample sizes are small (∼10) and especially after a transformation that reduces any marked skewness in the dependent variable or nonuniformity of error. Testing for normality of the dependent variable and any related decision to use purely nonparametric analyses (which are based on rank transformation and do not use linear or other parametric models) are therefore misguided. Such analyses lack power for small sample sizes, do not permit adjustment for covariates, and do not permit inferences about magnitude. Rank transformation followed by parametric analysis can be appropriate (Note 8), and ironically, the distribution of a rank-transformed variable is grossly nonnormal.

#### Note 8

**Nonuniformity** of effect or error in linear models can produce incorrect estimates and confidence limits. Check for nonuniformity by comparing SD of the dependent variable in different subgroups or by examining plots of the dependent variable or its residuals for differences in scatter (heteroscedasticity) with different predicted values and/or different values of the predictors.

Differences in SD or errors between groups can be taken into account for simple comparisons of means by using the unequal-variances *t* statistic. With more complex models, use mixed modeling to allow for and estimate different SD in different groups or with different treatments. For a simpler robust approach with independent subgroups, perform separate analyses then compare the outcomes using a spreadsheet (^{15}).

Transformation of the dependent variable is another approach to reducing nonuniformity, especially when there are differences in scatter for different predicted values. For many dependent variables, effects and errors are uniform when expressed as factors or percents; log transformation converts these to uniform additive effects, which can be modeled linearly then expressed as factors or percents after back transformation. Always use log transformation for such variables, even when a narrow range in the dependent variable effectively eliminates nonuniformity.

Rank transformation eliminates nonuniformity for most dependent variables and models, but it results in loss of precision with a small sample size and should therefore be used as a last resort. To perform the analysis, sort all observations by the value of the dependent variable, assign each observation a rank (consecutive integer), then use the rank as the dependent variable in a linear model. Such analyses are often referred to incorrectly as nonparametric.

Use the transformed variable, not the raw variable, to gauge magnitudes of correlations and of standardized differences or changes in means. Back-transform the mean effect to a mean in raw units and its confidence limits to percents or factors (for log transformation) or to raw units at the mean of the transformed variable or at an appropriate value of the raw variable (for all other transformations). When analysis of a transformed variable produces impossible values for an effect or a confidence limit (e.g., a negative rank with the rank transformation), the assumption of normality of the sampling distribution of the effect is violated and the analysis is therefore untrustworthy. Appropriate use of bootstrapping avoids this problem.

#### Note 9

**Outliers** for a continuous dependent variable represent a kind of nonuniformity that appears on a plot of residuals versus predicteds as individual points with much larger residuals than other points. To delete the outliers in an objective fashion, set a threshold by first standardizing the residuals (dividing by their SD). The resulting residuals are *t* statistics, and with the assumption of normality, a threshold for values that would occur rarely (<5% of the time is a good default) depends on the sample size. Approximate sample sizes and thresholds for the absolute value of *t* are as follows: less than ∼50 and >3.5, respectively; ∼500 and >4.0, respectively; ∼5000 and >4.5, respectively; and ∼50000 and >5.0, respectively. Some packages identify outliers more accurately using statistics that account for the lower frequency of large residuals further away from the mean predicted value of the dependent.

#### Note 10

**Effect of continuous predictors.** The use of two SD to gauge the effect of a continuous predictor ensures congruence between Cohen's threshold magnitudes for correlations and standardized differences (Note 1). Two SD of a normally distributed predictor also corresponds approximately to the mean separation of lower and upper tertiles (2.2 SD). The SD is ideally the variation in the predictor after adjustment for other predictors; the effect of 2 SD in a correlational study is then equivalent to, and can be replaced by, the partial correlation (the square root of the fraction of variance explained by the predictor after adjustment for all other predictors).

A grossly skewed predictor can produce incorrect estimates or confidence limits, so it should be transformed to reduce skewness. Log transformation is often suitable for skewed predictors that have only positive values; as simple linear predictors, their effects are then expressed per factor or percent change of their original units. Alternatively, a skewed predictor can be parsed into quantiles (usually two to five subgroups with equal numbers of observations) and included in the model as a nominal variable or as an ordinal variable (a numeric variable with integer values). Parsing is also appropriate for a predictor that is likely to have a nonlinear effect not easily or realistically modeled as a polynomial.

#### Note 11

**SEM versus SD.** The standard error of the mean (SEM = SD/√(group sample size)) is the sampling variation in a group mean, which is the expected typical variation in the mean from sample to sample. Some researchers argue that, as such, this measure communicates uncertainty in the mean and is therefore preferable to the SD. A related widespread belief is that nonoverlap of SEM bars on a graph indicates a difference that is statistically significant at the 5% level. Even if statistical significance was the preferred approach to inferences, this belief is justified only when the SEM in the two groups are equal, and for comparisons of changes in means, only when the SEM are for means of change scores. SE bars on a time-series graph of means of repeated measurements thus convey a false impression of significance or nonsignificance, and therefore, to avoid confusion, SEM should not be shown for any data. In any case, researchers are interested not in the uncertainty in a single mean but in the uncertainty of an effect involving means, usually a simple comparison of two means. Confidence intervals or related inferential statistics are used to report uncertainty in such effects, making the SEM redundant and inferior.

The above represents compelling arguments for not using the SEM, but there are even more compelling arguments for using the SD. First, it helps to assess nonuniformity, which manifests as different SD in different groups. Secondly, it can signpost the likely need for log transformation, when the SD of a variable that can have only positive values is of magnitude similar to or greater than the mean. Finally and most importantly, the SD communicates the magnitude of differences or changes between means, which, by default, should be assessed relative to the usual between-subject SD (Note 1). The manner in which the SEM depends on sample size makes it unsuitable for any of these applications, whereas the SD is practically unbiased for sample sizes ∼10 or more (^{9}).

#### Note 12

**Error-related bias.** Random error or random misclassification in a variable attenuates effects involving the variable and widens the confidence interval. (Exception: random error in a continuous dependent variable does not attenuate effects of predictors on means of the variable.) After adjustment of the variable for any systematic difference from a criterion in a validity study with subjects similar to those in your study, it follows from statistical first principles that the correction for attenuation of an effect derived directly from the variable's coefficient in a linear model is 1/*v* ^{2}, where *v* is the validity correlation coefficient; the correction for a correlation with the variable is 1/*v*. In this context, a useful estimate for the upper bound of *v* is the square root of the short-term reliability correlation.

When one variable in an effect has *systematic* error or misclassification that is substantially correlated with the value of the other variable, the effect will be biased up or down, depending on the correlation. Example: a spurious beneficial effect of physical activity on health could arise from healthier people exaggerating their self-reported activity.

Substantial random or systematic error of measurement in a covariate used to adjust for confounding results in partial or unpredictable adjustment, respectively, and thereby renders untrustworthy any claim about the presence or absence of the effect after adjustment. This problem applies also to a mechanisms analysis involving such a covariate.

#### Note 13

**Limits of agreement.** Bland and Altman introduced limits of agreement (defining a reference interval for the difference between measures) and a plot of subjects' difference versus mean scores of the measures (for checking relative bias and nonuniformity) to address what they thought were shortcomings arising from misuse of validity and reliability correlation coefficients in measurement studies. Simple linear regression, nevertheless, provides superior statistics in validity studies, for the following reasons: the SEE and the validity correlation can show that a measure is entirely suitable for the clinical assessment of individuals and for sample-based research, yet the measure would not be interchangeable with a criterion according to the limits of agreement; the validity correlation provides a correction for attenuation (Note 12), but no such correction is available with limits of agreement; the regression equation provides trustworthy estimates of the bias of one measure relative to the other, whereas the Bland-Altman plot shows artifactual bias for measures with substantially different errors (^{11}); regression statistics can be derived in all validity studies, whereas limits of agreement can be derived from difference scores in only a minority of validity studies ("method-comparison" studies, where both measures are in the same units); finally, limits of agreement in a method-comparison study of a new measure with an existing imprecise measure provide no useful information about the validity of the new measure, whereas the regression validity statistics can be combined with published validity regression statistics for the imprecise measure to correctly estimate validity regression statistics for the new measure.

Arguments have also been presented against the use of limits of agreement as a measure of reliability (^{13}). In addition, data generally contain several sources of random error, which are invariably estimated as variances in linear models then combined and expressed as standard errors of measurement and/or correlations. Transformation to limits of agreement is of no further clinical or theoretical value.

#### Note 14

**Qualitative inferences.** Some qualitative researchers believe that it is possible to use qualitative methods to generalize from a sample of qualitatively analyzed cases (or assessments of an individual) to a population (or the individual generally). Others do not even recognize the legitimacy of generalizing. In our view, generalizing is a fundamental obligation that is best met quantitatively, even when the sample is a series of qualitative case studies or assessments.

Chris Bolter, Janet Dufek, Doug Curran-Everett, Patria Hume, George Kelley, Ken Quarrie, Chris Schmid, David Streiner, and Martyn Standage provided valuable feedback on drafts, as did nine reviewers on the submitted manuscript. The authors have no professional relationship with a for-profit organization that would benefit from this study; publication does not constitute endorsement by ACSM.

No funding was received for this work from any organization, other than salary support for the authors from their respective institutions.

*Editor-in-Chief*'s *note*: This article by Hopkins et al. should be considered invited commentary by the authors. The article has undergone peer review by eight other scientists, each an acknowledged expert in experimental design, statistical analysis, data interpretation, and reporting, and the authors have undertaken extensive revision in response to those reviews and my own reviews. The majority of reviewers recommended publication of the article, but there remain several specific aspects of the discussion, on which authors and reviewers strongly disagreed. Therefore, the Associate Editors and I believe that our scientific community has not yet achieved sufficient "consensus" to establish formal editorial policy about appropriate reporting of research design, data analysis, and results. However, we also believe that Dr. Hopkins and his colleagues have presented a thoughtful, provocative framework of "progressive" recommendations, which merit consideration and discussion. Readers are advised that the recommendations remain the authors' opinion and not the journal's editorial policy, and we encourage your feedback.