I was recently asked to weigh in on a statistical debate that has been brewing in the sports science literature. Some researchers have advocated the use of a new statistical method they are calling “magnitude-based inference” (MBI) as an alternative to standard hypothesis testing (1–5). The method is being used in practice in the sports science literature (4,6), which makes it imperative to resolve this debate.
Several statisticians have criticized MBI due to its lack of a sound theoretical framework (7–9). In a 2015 article in Medicine & Science in Sports & Exercise, Welsh and Knight provided a statistical review of MBI in which they identified theoretical problems with the method, including that it creates unacceptably high false-positive rates (8). In response, MBI’s proponents, Hopkins and Batterham (4), published a rebuttal in Sports Medicine 2016 in which they claim that MBI “outperforms” standard null hypothesis testing in terms of both type I (false-positive) and type II (false-negative) error rates for most cases.
At face value, this conclusion is dubious. There is a tradeoff between type I and type II error: when you improve one, you sacrifice the other. Thus, you do not need to be a statistician to immediately be skeptical of their paper. Indeed, their article is flawed in both its methods and conclusions.
First, Hopkins and Batterham (4) have obscured the systematic behavior of MBI in the way they presented their results. I have reproduced the exact numbers they report in their article, but have regraphed them in a more transparent and informative way. This single change reveals the fundamental problem with MBI. Second, Hopkins and Batterham have incorrectly defined type I and type II error. When I correct these mistakes, I show that the problem with MBI holds for all cases. Finally, I derive general mathematical equations for the type I and type II error rates for MBI; these equations confirm the findings from the simulations.
The problem boils down to this: MBI creates peaks of false positives at specific sample sizes; and MBI’s creators provide sample size calculators (1) that specifically find these peaks. For example, for a particular statistical comparison, Hopkins and Batterham (4) conclude that 50 participants per group is the “optimal” sample size when using MBI. It turns out that 50 per group is precisely where the false-positive rate peaks for that case.
MAGNITUDE-BASED INFERENCE: A BRIEF SYNOPSIS
The motivation behind MBI is a good one. Hopkins and Batterham encourage researchers to pay more attention to confidence intervals and effect sizes. By doing so, researchers can avoid many common statistical errors, such as mistakenly concluding that a significant but trivially small effect is clinically important (10).
In MBI, researchers start by defining a trivial range, in which effect sizes are too small to care about. For example, researchers might declare that changes in resting heart rate within 1 bpm are trivial. Effects outside of this range are either beneficial (when resting heart rate is lowered) or harmful (when resting heart rate is increased). Researchers then interpret their confidence intervals relative to these ranges. For example, if a supplement reduces resting heart rate a statistically significant amount but the 95% confidence interval is −0.9 to −0.1 bpm, one should conclude that the supplement has only a trivial biologic effect. Conversely, if the reduction in resting heart rate is statistically nonsignificant, but the 95% confidence interval is −10 to +0.1 bpm—which predominantly spans the beneficial range—one should not conclude that the supplement is ineffective. This is a good approach.
Where Hopkins and Batterham’s method breaks down is when they go beyond simply making qualitative judgments like this and advocate translating confidence intervals into probabilistic statements such as: the effect of the supplement is “very likely trivial” or “likely beneficial.” This requires interpreting confidence intervals incorrectly, as if they were Bayesian credible intervals. For example, they incorrectly interpret a 95% confidence interval that falls completely within the trivial range as meaning that there is a 95% chance that the effect is trivial. Others have pointed out the problems with this misinterpretation (7–9); I will avoid a lengthy discussion of this issue here, because my primary goal is to demonstrate the empirical behavior of MBI when implemented as Hopkins and Batterham propose.
Magnitude-based inference provides probabilities that the effect is beneficial, trivial, and harmful. Then these probabilities are interpreted using the following scale: <0.5% = most unlikely; 0.5% to 5% = very unlikely; 5% to 25% = unlikely; 25% to 75% = possibly; 75% to 95% = likely; 95% to 99.5% = very likely; >99.5% = most likely (2).
For example, suppose our supplement study yields a 90% confidence interval of −8 bpm to −1 bpm. This is translated to: “There is a 90% probability that the true effect lies between −8 bpm and −1 bpm.” This leaves a 5% chance that the true effect is > −1 bpm and thus not beneficial. So, MBI concludes: “There is a 95% chance that the supplement is beneficial” or the supplement is “very likely beneficial.”
Hopkins and Batterham describe two versions of MBI: clinical and nonclinical (4). I will consider each of these cases separately. I believe that the difference is that clinical MBI entails a one-sided test whereas nonclinical MBI entails a two-tailed test; I will delve into this distinction later.
When you are interested in testing whether a clinical intervention is beneficial or not, Hopkins and Batterham call this “clinical MBI.” In clinical MBI, users define a trivial range by setting thresholds for harm and benefit; these are usually assigned the same value, but they don’t have to be. Hopkins and Batterham contend that an intervention is implementable if it is at least “possibly” beneficial (≥25% chance of benefit) and “most unlikely” harmful (<0.5% risk of harm). Equivalently, the upper limit of the 50% confidence interval (UCL50) must equal or exceed the threshold for benefit and the lower limit of the 99% confidence interval (LCL99) must be above the threshold for harm. For example, if the thresholds for benefit and harm are +0.2 standard deviations and −0.2 standard deviations, respectively, the intervention would be implementable when UCL50 ≥ +0.2 and LCL99 > −0.2.
It is possible to change the “minimum chance of benefit” and “maximum risk of harm” from their defaults of 25% and 0.5%. For example, if you want to require a 75% minimum chance of benefit (i.e., “likely” beneficial), then the intervention would be implementable when LCL50 ≥ +0.2 and LCL99 > −0.2.
When you are simply trying to determine whether an effect exists (either positive or negative), Hopkins and Batterham call this nonclinical MBI and refer to effects as positive or negative rather than beneficial and harmful. Researchers can claim various degrees of certainty for a positive effect when the 90% confidence interval excludes the negative range (LCL90 > −0.2, for example, corresponding to <5% chance of a negative effect) but overlaps the positive range: “Unlikely” positive is when just the upper limit of the 90% confidence interval (UCL90) makes it into the positive range; “possibly” positive is when the upper limit of the 50% confidence interval (UCL50) makes it into the positive range; “likely” positive is when the lower limit of the 50% confidence interval (LCL50) makes it into the positive range; and “very likely” positive is when the lower limit of the 90% confidence interval (LCL90) makes it into the positive range. Negative effects follow the same pattern.
Magnitude-based inference’s proponents advocate reporting the probabilistic statements and allowing individuals to judge how much uncertainty they can tolerate for a given decision. For example, a laboratory scientist might choose to move a drug to clinical tests if it shows at least a “likely” effect in laboratory experiments.
In summary, MBI imposes two constraints: a constraint on harm (or negative effects) and a constraint on benefit (or positive effects). Each constraint is determined by two parameters: the threshold for harm/benefit, and the maximum risk of harm/minimum chance of benefit. Following Welsh and Knight (8), I will use the following symbols for these parameters:
It is possible to write general mathematical formulas for the constraints. I will focus on clinical MBI here; nonclinical MBI is similar but considers both directions. For the specific case of a two-group comparison of means with equal variances and equal group sizes of n per group, an effect is implementable if the following conditions are met (for full derivation, see Document, Supplemental Digital Content 1, derivation of the constraints, http://links.lww.com/MSS/B270):
- 1. Constraint on benefit:
Note that we can simplify the above to:
In other words, the observed value must be greater than whichever constraint is bigger for a given example.
Magnitude-based inference is based on the confidence intervals used in standard hypothesis testing. Thus, not surprisingly the two methods converge. If you set the thresholds for harm/benefit to 0, clinical MBI just reverts to a one-sided null hypothesis test with a significance level of ηh (because, presumably, the minimum chance of benefit will always be set higher than the maximum risk of harm).
Type I errors are false positives and can occur only when the true effect is trivial. Type II errors are false negatives and can occur only when the true effect is nontrivial. (Note: For a one-sided test, Type I errors occur when the effect is trivial or in the direction you do not care about and type II errors occur when the true effect is real and in the direction you care about).
To estimate type I and type II error rates, Hopkins and Batterham (4) ran simulations for a particular scenario: comparing a continuous outcome, measured in standard deviation units, between two groups of athletes in a pre–post design (4). Simulations used the defaults for minimum chance of benefit (25% for clinical MBI, varying for nonclinical MBI) and maximum risk of harm (0.5% for clinical MBI, 5% for nonclinical MBI); a threshold for harm/benefit of 0.2 standard deviations; and samples sizes of 10, 50, and 144 per group. They generated type I errors for trivial effect sizes of 0, ±0.1, and ±0.199, and type II errors for nontrivial effect sizes of ±0.2 to ±0.6.
I commend Hopkins and Batterham for providing their simulation code as a supplement to their 2016 paper (4). With their code, I was able to figure out how their method works, and to reproduce their results. I re-ran their simulations with 100,000 repetitions, and was able to match the numbers they report in Figure 3 of their 2016 Sports Medicine paper (4). I then systematically varied both the sample size (from 10 to 150 per group) and the threshold for harm/benefit (from 0.1 to 0.3 for sample sizes of 5 to 300 per group), and plotted type I and type II error against sample size. I used either 100,000 or 30,000 repetitions in my simulations, depending on the size of the simulation.
I then corrected mistakes I found in Hopkins and Batterham’s definitions of type I and type II error. Table 1a shows my corrected definitions.
First, Hopkins and Batterham treat an “unclear” result—when the confidence intervals are so wide that they span from harmful to beneficial—as error-free. However, when a study misses a real effect because the sample size is too small, this is clearly a type II error.
To illustrate this point further, consider the case when you have effect sizes that straddle the border of trivial—for example, 0.199 and 0.2 when the threshold for benefit is 0.2. The errors associated with these two effect sizes must be mirror images of one another. A correct call at 0.199 must be an incorrect call at 0.2, and vice versa. For example, if you correctly dismiss an effect at 0.199, you would have wrongly dismissed it at 0.2 (a type II error). My plots of type I and type II errors for these “border cases” are indeed mirror images of each other, whereas Hopkins and Batterham’s plots are not (see Figure, Supplemental Digital Content 2, simulation results for effect sizes 0.199 and 0.2, http://links.lww.com/MSS/B271). Correctly counting unclear cases as type II errors fixes their plots.
Second, it appears that Hopkins and Batterham intend clinical MBI as a one-sided test. The goal is to determine whether an intervention is beneficial or not beneficial. There is no distinction between an inference of harmful and an inference of trivial—in both cases, you will not implement the intervention. This is consistent with a one-sided test for benefit.
Moreover, Hopkins and Batterham are confused about what to call cases in which there is a true nontrivial effect, but an inference is made in the wrong direction (i.e., inferring that a beneficial effect is harmful or that a harmful effect is beneficial). In the text, they switch between calling these type I and type II errors; and, in their calculations, they treat them both as type II errors (Table 1a). However, they cannot both be type II errors at the same time. Inferring that a beneficial effect is harmful constitutes a type II error only for a one-sided test for benefit; and inferring that a harmful effect is beneficial constitutes a type II error only for a one-sided test for harm. Recognizing that clinical MBI is in fact a one-sided test for benefit clears up this confusion: inferring that a beneficial effect is harmful is a type II error, whereas inferring that a harmful effect is beneficial is a type I error.
In addition to these corrections, I also corrected Hopkins and Batterham’s definitions of type I and type II error for standard hypothesis testing to reflect a one-sided test (Table 1a). Using my corrected definitions, I reran my simulations to estimate type I and type II error for clinical MBI and standard hypothesis testing for true effect sizes of 0, ±0.1, ±0.199, ±0.2, and ±0.3.
Hopkins and Batterham also used incorrect definitions of type I and type II error for nonclinical MBI. Table 1b shows my corrected definitions.
As before, “unclear” cases are considered type II errors when there is a real effect (either positive or negative).
Hopkins and Batterham are again confused about how to treat cases in which researchers correctly infer a nontrivial effect, but in the wrong direction. Hopkins and Batterham call these type II errors, but—as previously discussed—these are type II errors only for one-sided tests. Clearly, Hopkins and Batterham intend nonclinical MBI as a two-sided test, since positive and negative are treated equivalently. For a two-sided test that incorporates direction, inferences in the wrong direction are actually neither type I nor type II errors—these are type III errors (11).
For nonclinical MBI, there is a third issue. Hopkins and Batterham tabulate type I and type II error such that most inferential choices are guaranteed to be error-free. If you infer that the effect is “unlikely,” “possibly,” or “likely” positive or negative, you can never be wrong. These inferences are not counted as type I errors when the true effect is trivial, and are not counted as type II errors when the true effect is nontrivial. Hopkins and Batterham’s logic is that as long as you acknowledge even a small chance (5%–25%) that the effect might be trivial when it is, then you have not made a type I error; and as long as you acknowledge even a small chance (5%–25%) that the effect might be positive (or negative) when it is, then you have not made a type II error.
However, this seems specious. Is concluding that an effect is “likely” positive really an error-free conclusion when the effect is in fact trivial? Again, the errors for the border cases—0.199 and 0.2, when the threshold for harm/benefit is 0.2—should be mirror images of one another; but they are clearly not under this logic.
The solution is to allow degrees of error. If you conclude that an effect is “likely” positive when it is trivial, you are mostly wrong. And if you conclude that an effect is “unlikely” positive when it is trivial, you are mostly right. Thus, I introduced partial errors in my calculations. For scoring the partial errors, I used values of 0.15, 0.50, and 0.85, corresponding to the midpoints of the probability ranges assigned by Hopkins and Batterham: 0.05 to 0.25 (unlikely), 0.25 to 0.75 (possibly), and 0.75 to 0.95 (likely). For example, declaring an effect “unlikely” positive when it is trivial is 0.15 of a type I error, whereas the same inference is 0.85 of a type II error when the effect is positive. Note that it is possible to change the weighting of these partial errors—this will simply alter the balance between type I and type II errors.
Using my corrected definitions, I re-ran my simulations to estimate type I and type II error for nonclinical MBI and standard hypothesis testing for true effect sizes of 0, ±0.1, ±0.199, ±0.2, and ±0.3. Note that in my corrected plots, the border cases of 0.199 and 0.2 are nearly mirror images (Fig. 4), as they should be. The slight asymmetry is due to the presence of type III errors, which occur more frequently in nonclinical MBI than standard hypothesis testing (see Figure, Supplemental Digital Content 3, type III errors, http://links.lww.com/MSS/B272). The code for my simulations is available as a supplement to this paper (see Document, Supplemental Digital Content 4, SAS code for simulations, http://links.lww.com/MSS/B273).
I also ran simulations setting the trivial threshold to 0.00001 (effectively 0) to demonstrate that: (1) MBI and standard hypothesis testing converge as expected, and (2) Hopkins and Batterham’s definitions for type I and type II error produce inaccurate values for these known cases.
Finally, I explored MBI mathematically: I worked out the derivation of Hopkins and Batterham’s sample size formula; and I derived general mathematical equations for the type I and type II error rates for clinical MBI for the problem of comparing two means.
Figure 1 shows the type I error for clinical MBI and standard hypothesis testing when the true effect is null-to-trivial in the beneficial direction (0, 0.1, 0.199). Figure 1 reveals an important characteristic of MBI that Hopkins and Batterham’s paper obscures: MBI causes peaks of false positives. For specific sample sizes, the false positive rate spikes to double or triple that of standard hypothesis testing. Note that this pattern is the same whether I use Hopkins and Batterham’s incorrect definitions (panels A, C, and E) or the correct definitions (panels B, D, and F), as their definitional errors did not impact this range of true effects.
It is easy to explain why these peaks occur. When the true effect is 0 to 0.199, observed effects often end up in or near this range. At small sample sizes, confidence intervals around these observed effects are so wide that they cross into the harmful range (LCL99 ≤ −0.2). As sample size increases, however, the confidence intervals narrow out of the harmful range (LCL99 > −0.2), while still overlapping the beneficial range (UCL50 ≥ +0.2)—resulting in a sharp increase in type I error rates. Beyond a certain sample size, the confidence intervals become sufficiently narrow to drop out of the beneficial range (UCL50 < +0.2)—and thus type I error rates drop.
In mathematical terms, recall that the bigger of the constraints on harm and benefit decides the inference. At small-to-moderate sample sizes, the constraint on harm tends to be bigger; at larger sample sizes, the constraint on benefit tends to be bigger. The peak occurs at the point at which the two constraints are equally likely to decide the inference (50% chance that the harm constraint is bigger, 50% chance that the benefit constraint is bigger).
This pattern also explains why MBI has higher false-positive rates than standard hypothesis testing at small-to-moderate sample sizes, where the harm constraint dominates. Clinical MBI’s constraint on harm (LCL99 > −0.2) is almost always less stringent than the corresponding constraint for a standard one-sided hypothesis test (LCL90 > 0). Figure 2 illustrates the typical case in which clinical MBI incurs a false-positive whereas standard hypothesis testing does not.
The false-positive rate peaks at a sample size of 50 per group. Interestingly, this is exactly the sample size that Hopkins and Batterham proclaim as “optimal” for MBI for this example (4). In fact, their sample size formula explicitly finds these peaks, as I will later show mathematically.
Changing the threshold for harm/benefit changes the location of the peaks (Fig. 1, panels G and H). For example, the peak is at about 20 per group when the threshold is 0.3 standard deviations rather than 0.2. This is because you can rule out harmful effects faster (since the threshold for harm is lower, −0.3 rather than −0.2) but can also rule out beneficial effects faster (since the threshold for benefit is higher, 0.3 rather than 0.2). Smaller trivial thresholds have the opposite effect. For example, the peak in type I error occurs at a sample size of about 200 per group when the threshold is 0.1.
In contrast to MBI, standard hypothesis testing has predictable type I error rates (Fig. 1). When the true effect is 0, the type I error rate is constant at 5%. When the true effect is trivial, but nonzero, the type I error rate climbs steadily with increasing sample sizes. This is because as precision improves, it becomes clear that there is a true nonzero effect. At sample sizes above about 100 per group, clinical MBI has lower type I error rates than standard hypothesis testing. However, the false positives prevented by MBI are all cases where researchers need only look at the narrow confidence intervals to realize that the effect—though statistically significant—is small or trivial in size.
Figure 3 shows the type II errors when the true effect is beneficial (0.2, 0.3). Clinical MBI reduces the type II error rate at the same sample sizes where it increases the type I error rate. This is expected, because loosening the constraints on false positives makes it easier to avoid false negatives. At Hopkins and Batterham’s “optimal” sample size of 50, clinical MBI cuts the type II error rate in half compared with standard hypothesis testing for a true effect of 0.2. Clinical MBI has the biggest advantage for type II error when the true effect is near the trivial threshold. With larger true effects, the advantage of clinical MBI shrinks. In fact, when the true effect is 0.5 standard deviations—what Cohen calls a medium effect size (12)—clinical MBI has almost no advantage over standard hypothesis testing (see Figure, Supplemental Digital Content 5, simulation for effect size of 0.5, http://links.lww.com/MSS/B274).
Figure 3 shows the type I errors when the true effect is harmful, whether by a trivial (−0.1) or nontrivial (−0.2, −0.3) amount. The type I error rates are low overall (<3%), but generally higher for clinical MBI than standard hypothesis testing. Clinical MBI also exhibits peaks in the type I error rate around the “optimal” sample size of 50, as before. The explanation for the higher false-positive rates is the same as before—false positives occur for MBI but not standard hypothesis testing when LCL99 > −0.2 but LCL90 < 0.
Figure 4 shows the results for type I and type II error for nonclinical MBI. The patterns are similar to those seen in clinical MBI, namely, 1) nonclinical MBI creates peaks of false positives at specific sample sizes (here ~ 30 per group). Within these peaks, the type I error rate is two to six times higher than for standard hypothesis testing. 2) The location of these peaks depends on the value chosen for the threshold for harm/benefit. 3) Nonclinical MBI has lower type II error rates than standard hypothesis testing at the same sample sizes at which the type I error rates peak. 4) The difference in type II error between nonclinical MBI and standard hypothesis testing is most pronounced when the true effect is close to the trivial threshold and less pronounced when the true effect is larger.
Where MBI and Standard Hypothesis Testing Converge
If you reduce the harm/benefit threshold to 0, then MBI reverts to a one-sided null hypothesis test with a significance level equal to the maximum risk of harm. Thus, in our simulations, clinical MBI should revert to a one-sided null hypothesis test with a significance level of 0.005; and nonclinical MBI should revert to a two-sided null hypothesis test with a significance level of 0.10.
Indeed, simulations show that the type I and type II errors for clinical and nonclinical MBI are as expected for a one-sided null hypothesis test with a significance level of 0.005 and a two-sided null hypothesis test with a significance level of 0.10, respectively (see Figure, Supplemental Digital Content 6, simulations for threshold for harm/benefit of 0, http://links.lww.com/MSS/B275). As further proof that Hopkins and Batterham’s definitions for the errors are incorrect, their definitions wildly underestimate the type II errors for these known cases (see Figure, Supplemental Digital Content 6, simulations for threshold for harm/benefit of 0, http://links.lww.com/MSS/B275).
Mathematical Confirmation (Clinical MBI)
Sample size formula
Hopkins and Batterham provide sample size calculators in Excel spreadsheets for clinical MBI (1), though they do not appear to have published the formulas that underlie these calculators. Welsh and Knight (8) were able to work out the formula for comparing two means from the algorithms in the spreadsheets but were unsure of the derivation. Based on an online presentation by Hopkins (13), I believe that I have figured out the derivation. For simplicity, I will treat the group sizes as equal, but one can generalize to unequal group sizes by substituting (r + 1)σ2/r for 2σ2, and changing the degrees of freedom to (r + 1)n−2, where r is the ratio of the larger to smaller group.
Hopkins and Batterham start by assuming that the true effect is a random variable t-distributed around a fixed observed effect (this mirrors their misinterpretation of frequentist confidence intervals as Bayesian credible intervals). This distribution will have a variance of assuming a pooled variance and equal group sizes. They then solve for the sample size that will make the following true: P(true effect<−δh) = ηh and P(true effect≥δb) = ηb. This occurs when MBI’s constraints on harm and benefit are equal:
Solving this equation leads to an n per group of:
This formula indeed matches the formula that Welsh and Knight (8) report. Furthermore, it explains why Hopkins and Batterham’s “optimal” sample sizes correspond to the peaks in type I error—these peaks occur precisely at the point at which the constraints on harm and benefit are equally important.
General equations for MBI’s error rates
I derived general mathematical equations for the type I and type II error rates of clinical MBI for the case of comparing two means. These equations assume equal group sizes but can be generalized to unequal group sizes as described above (for full derivation see: Document, Supplemental Digital Content 7, math derivation, http://links.lww.com/MSS/B276):
ES = true effect size (difference in means) n = per group sample size
Type I error (type I error can only occur when ES<δb):
Type II error (type II error can only occur when ES≥δb:
Values generated from these equations closely match the results of my simulations (Fig. 5). My equations may slightly overestimate the type I error rates at the peaks (Fig. 5) due to a simplification made in the math for tractability (for details see: text, Supplemental Digital Content 7, math derivation, http://links.lww.com/MSS/B276). In contrast, the mathematically predicted results do not match the simulations that use Hopkins and Batterham’s definitions of type I and type II error, again demonstrating that their definitions are incorrect.
These equations eliminate the need for simulation, making it easy to calculate the type I and type II error rates for many combinations of parameters, and allowing further exploration of MBI’s behavior. For example, I changed the minimum chance of benefit from 25% (“possibly” beneficial) to 75% (“likely” beneficial). For an effect size of 0, the type I error is low—peaking at 4.9% when n = 19 per group. However, for an effect size of 0.2, the type II error is at least 75% for all sample sizes; and for an effect size of 0.3, MBI requires 165 participants per group to achieve a type II error of 20%, whereas standard hypothesis testing requires only 50 per group (see Figure, Supplemental Digital Content 8, simulation for minimum chance of benefit of 75%, http://links.lww.com/MSS/B277).
I have provided a SAS program that calculates type I and type II error for clinical MBI (see Document, Supplemental Digital Content 9, SAS implementation of error rate formulas, http://links.lww.com/MSS/B278).
Though Hopkins and Batterham claim that MBI has “superior” error rates compared with standard hypothesis testing for most cases (4), this is false. MBI exhibits very specific tradeoffs between type I and type II error. Magnitude-based inference creates peaks in the false positive rate and corresponding dips in the false negative rate at specific sample sizes. Sample size calculators provided by MBI’s creators are tuned to find these peaks—which typically occur at small-to-moderate sample sizes. At these peaks, the type I error rates are two to six times that of standard hypothesis testing. The use of MBI may therefore result in a proliferation of false positive results.
In their Sports Medicine paper, Hopkins and Batterham do acknowledge inflated type I error rates in one specific case—clinical MBI when the effect is marginally trivial in the beneficial direction (4). However, due to incorrect definitions of type I and type II error, they failed to recognize that type I error is actually inflated in all cases at their “optimal” sample sizes (for null-to-trivial effects in both directions and for nonclinical MBI). The increase in false positives occurs because MBI’s constraint on harm—which dominates at small-to-moderate sample sizes—is less stringent than the corresponding constraint in standard hypothesis testing (as illustrated in Fig. 2). Incorrect definitions also led Hopkins and Batterham to dramatically underestimate type II error in several of their simulations.
Hopkins and Batterham might argue that one can simply change the minimum chance of benefit, for example from a “possibly” inference (25%) to a “likely” inference (75%), to reduce the type I error. This indeed controls the type I error rate, but greatly increases the type II error rate, meaning that clinical MBI will require much larger sample sizes than standard hypothesis testing to achieve comparable statistical power (and for true effects on the border of trivial, the statistical power will never be higher than 25%).
Whereas standard hypothesis testing has predictable type I error rates, MBI has type I error rates that vary greatly depending on the sample size; choice of thresholds for harm/benefit; and choice of maximum risk of harm/minimum chance of benefit. This is problematic because unless researchers calculate and report the type I error for every application, this will always be hidden to readers. Furthermore, the dependence on the thresholds for harm/benefit as well as the maximum risk of harm/minimum chance of benefit makes it easy to game the system. A researcher could tweak these values until they get an inference they like. Hopkins and Batterham dismiss this issue by saying: “Researchers should justify a value within a published protocol in advance of data collection, to show they have not simply chosen a value that gives a clear outcome with the data. Users of NHST [Null Hypothesis Significance Testing] are not divested of this responsibility, as the smallest important effect informs sample size” (4). But there’s an obvious difference: You cannot change the sample size once the study is done, but it is easy to fiddle with MBI’s harm and benefit parameters. And, though it would be ideal for researchers to publish protocols ahead of time, in reality they rarely do (14).
There is no doubt that standard hypothesis testing has pitfalls. Too often, researchers place undue emphasis on p-values (15) and fail to consider other key pieces of statistical information—including graphical displays, effect sizes, confidence intervals, and consistency across multiple analyses (16). I appreciate Hopkins and Batterham’s attempt to educate researchers about the importance of magnitude and precision. However, although their intentions may have been good, they introduced a statistical method into the sports science literature without adequately understanding its systematic behavior. Also, they continue to defend it with demonstrably false arguments. For example, I have established that Hopkins and Batterham defined type I and type II error incorrectly in their paper in Sports Medicine (4), and consequently made false claims about MBI. This should give users of MBI pause.
Additionally, in their 2016 Sports Medicine article (4), Hopkins and Batterham claim that, “We also provide published evidence of the sound theoretical basis of MBI [10,11,16].” However, the three references they cite do not provide such evidence. Gurrin et al. (17) suggests that it may sometimes be reasonable to interpret frequentist confidence intervals as Bayesian credible intervals by assuming a uniform prior probability, but they explicitly state: “Although the use of a uniform prior probability distribution provides a neat introduction to the Bayesian process, there are a number of reasons why the uniform prior distribution does not provide the foundation on which to base a bold new theory of statistical analysis!” Shakespeare et al. (18) just provides general information on confidence intervals, and does not address anything directly related to MBI. Finally, the last reference, by Hopkins and Batterham (19), is a short letter in which they point to empirical evidence from a simulation that is only a preliminary version of the simulations reported in Sports Medicine (4).
Hopkins and Batterham’s sample size spreadsheets may also mislead users. They ask users to enter values for the minimum chance of benefit (ηb) and maximum risk of harm (ηh), but they label these the “type I clinical error” and “type II clinical error.” In fact, the minimum chance of benefit and maximum risk of harm are not interchangeable with type I and type II error, as previously noted by Welsh and Knight (8). At the specified sample size, the type I error rate will equal the minimum chance of benefit only if the true effect equals the threshold for harm (−δh). Also, the type II error rate will equal the minimum chance of benefit only if the true effect is equal to the threshold for benefit (δb). However, because of the conflation of terms, the users may mistakenly infer that they are setting the type I and type II errors at fixed values. Moreover, users may be left with the false impression that they are controlling the overall type I error when, in fact, they are controlling only the type I error from inferring a harmful effect is beneficial, which accounts for only a tiny fraction of type I errors; they are failing to constrain the predominant source of type I error, from inferring that a trivial effect is beneficial. Users (or potential users) of MBI are encouraged to use my general equations to explore the true type I and type II error rates for MBI for different scenarios.
In this article, I have explored type I and type II errors only for a specific statistical test: comparing two means. However, my error equations could easily be adapted to other statistical tests; and the patterns will be the same. Also, I have not addressed the veracity of MBI’s probabilistic statements, for example, when MBI states that there is a 75% to 95% chance that the effect is beneficial, is this accurate? One can largely ignore these statements by simply viewing MBI as a method that evaluates the location of confidence intervals relative to certain ranges. However, I will note that these probabilistic statements are unlikely to be accurate given that Hopkins and Batterham are interpreting confidence intervals as Bayesian credible intervals without doing a proper Bayesian analysis. Arguably, many interventions tested in sports science will have a low prior probability of effectiveness and failing to account for this will lead to overly optimistic conclusions. As Welsh and Knight argue (8), if Hopkins and Batterham want to provide probabilistic statements, they should adapt a fully Bayesian analysis.
Hopkins and Batterham may be correct in recognizing that when it comes to testing interventions to enhance athletic performance, our tolerance for type I error may be higher than in other situations (for example, testing a cancer drug). Coaches and athletes may be willing to occasionally adapt ineffective interventions so long as they are not detrimental to athletic performance. However, the cost of this is not nothing—ineffective interventions waste time and money, and may cause side effects. What is needed is an approach that’s less conservative than a two-sided null hypothesis test but still adequately reins in the type I error. The answer is simple: Use a one-sided null hypothesis test for benefit. Compared with clinical MBI, this approach: 1) guards more strictly against inferring benefit when the effect is harmful, 2) has lower type I error rates for small-to-moderate sample sizes, and 3) has higher type II error rates only when the effect sizes are close to trivial. Researchers would, of course, need to consider the accompanying effect size and confidence interval. If the confidence interval contains only trivial effects, this should be interpreted accordingly. For an excellent reference on how to interpret confidence intervals, see the study of Curran-Everett (20).
In conclusion, “magnitude-based inference” exhibits undesirable systematic behavior and should not be used. As Welsh and Knight (8) have already pointed out, MBI should be replaced with a fully Bayesian approach or should simply be scrapped in favor of making qualitative statements about confidence intervals. In addition, a one-sided null hypothesis test for benefit—interpreted alongside the corresponding confidence interval—would achieve most of the objectives of clinical MBI while properly controlling type I error.