Some people try to find things in this game that don’t exist but football is only two things—blocking and tackling.

—Vince Lombardi (1913–1970), American football player, coach, and executive

The previous basic statistical tutorial focused on the appropriate use of the common unadjusted bivariate tests for comparing data, namely, the unpaired t test, paired t test, chi-square test for association, Wilcoxon-Mann-Whitney test, and Wilcoxon signed rank test.^{1} These unadjusted bivariate tests will serve as the basis for this tutorial, which includes these fundamental concepts:

Central limit theorem and law of large numbers
Definition of statistical significance
Effect size or Cohen’s d
Type I error, alpha (α ), significance level
Type II error, beta (β ), power
Power analysis
Determining the sample size
Interim analyses and early closure of a study
We also discuss the importance of sample size justification for single-group estimation studies; the importance of justifying the sample size; the need and methods to control for type I error with multiple comparisons; and the unique aspects of equivalence and noninferiority trials.

Central Limit Theorem and Law of Large Numbers
Research in anesthesia, perioperative medicine, critical care, and pain medicine typically involves inferential statistics. Inferential statistics essentially allow investigators to make a valid inference about an association of interest (eg, a therapeutic intervention and its clinical outcome) for a specific population that is based on data derived from a random sample of that population. An unknown population parameter representing the clinical association or treatment effect of interest can thus be estimated from the study sample.^{1} ^{,} ^{2}

Inferential statistics relies heavily on the central limit theorem and the related law of large numbers.^{3} ^{,} ^{4} Let us assume that you (1) randomly generate many random samples of equal size from a given population; (2) tabulate the means of each of these random samples; and then (3) graph the frequency distribution of these mean values.^{5} The central limit theorem states that if these random samples are large enough, the distribution of their means will approximate a normal (Gaussian) distribution—with a distribution mean that is equal to the true, underlying population mean—even if the distribution of the underlying population is not normal.^{5}

Therefore, according to the central limit theorem, regardless of the distribution of the source population, a sample estimate of that population will have a normal distribution, but only if the sample is large enough.^{4} ^{,} ^{6} The related law of large numbers holds that the central limit theorem is valid as random samples become large enough, usually defined as an n ≥ 30.^{6} Thus, the use of conventional inferential statistics may not be valid with small sample sizes <30. Moreover, with so-called heavy-tailed distributions (where outliers are common and/or the data are heavily skewed to the right or left), larger sample sizes of 100 or even 300 may be needed.^{7} ^{,} ^{8}

If a test for data normality like the Shapiro-Wilk test concludes that your study data deviate significantly from a normal (Gaussian) distribution^{9} ―rather than substantially increasing your sample size to comply with the law of large numbers―this problem can potentially instead be remedied by judiciously and transparently (1) performing a data transformation of all the data values or (2) eliminating any obvious data outlier(s).^{9–11} Most commonly, logarithmic, square root, or reciprocal data transformation are applied to achieve data normality.^{12}

Definition of P Value and Statistical Significance
In everyday language, the word “significant” connotes something that is important or consequential, whereas in research-related hypothesis testing, the term “statistical significant” is used to describe when an observed difference or association has met a certain threshold.^{13} This threshold is referred to as the significance level for a statistical test (eg, an unpaired t test that compares 2 independent sample means).^{13} ^{,} ^{14} This significance threshold or cut-point is denoted as α and is typically set at .05. When the observed or calculated P value is less than α , one can reject the null hypothesis (Ho) and accept the alternative hypothesis (Ha).^{13}

It is understandably tempting for researchers to report their observed difference or association and its calculated P value as “marginally significant” or “almost significant” (eg, P = .08) and thus indicative of a “trend.”^{13} ^{,} ^{15} Alternatively, they may use terms like “highly significant” with a P < .01 or “extremely significant” with a P < .001.^{13} However, neither statement is valid because the a priori P value of .05 is a binary (yes/no, all-or-none) cut-point.^{13}

Moreover, as the sample size increases, the P value will become smaller for the same observed difference or association.^{16} Theoretically, as the sample size approaches infinity, any observed difference or association—no matter how infinitesimal—will become statistically significant.

The innate limitations of significance testing have led some experts to recommend abandoning it altogether in the medical literature.^{17} However, old habits are hard to break, especially given the unfamiliarity with better methods and authors’ concerns if using less commonly applied, albeit superior techniques will be equally well received by journal editors, reviewers, and readers.^{17}

Effect Size and Cohen's d
As an example, researchers dutifully report the observed difference in the means of their 2 study groups and the confidence interval (CI) for this observed difference. A bivariate statistical test (eg, unpaired t test) indicates that the observed difference or association is statistically significant (ie, P < .05). Yet the equally if not more important question is whether this observed difference or association is clinically important or otherwise meaningful.^{14} ^{,} ^{18–21} Reporting the observed effect size can help answer this question.^{21}

Effect size is an objective and standardized measure of the magnitude of the observed difference or association.^{6} Effect size is a measure of how different 2 study groups are from one another and thus helps answer “How big is big?”^{18} One of the most widely applied measures of effect size is Cohen’s d for a t test.^{16} ^{,} ^{22}

The important role of the minimal or expected effect size in calculating the needed study sample size will be explored below.^{23} However, when computing the observed effect size, the sample size is not included in the calculation.^{18} ^{,} ^{20} For example, Cohen’s d is calculated as the difference between the 2 sample means divided by the pooled standard deviation (SD) for all the data in the 2 independent samples. It is thus the difference in means in SD units, which allows for comparisons of results across studies.^{16} ^{,} ^{20}

The following guidelines or cut-points are conventionally applied in interpreting Cohen’s d ^{16} ^{,} ^{22} :

A small effect size with a Cohen’s d < 0.2
A medium effect size with a Cohen’s d of 0.2–0.8
A large effect size with a Cohen’s d > 0.8
A CI should be reported for the observed effect size.^{21} ^{,} ^{24} The formula for this CI includes the sample size, with a larger sample giving a narrower interval.^{19}

Null hypothesis significance testing (NHST),^{2} in which the Ho is either rejected or not rejected based on the P value, remains the dominant statistical approach in clinical and basic science research. When used in isolation, NHST has often unappreciated limitations.^{17} ^{,} ^{19} Specifically, relying on P values alone does not provide 2 crucial pieces of information: (1) the magnitude of the effect of interest and (2) the precision of the estimated magnitude of that effect.^{19}

Researchers and journal stakeholders should thus be equally interested in clinical or biological importance, which can be assessed using the magnitude of an effect—but not by its statistical significance.^{19} Therefore, we strongly advocate also presenting appropriate measures of the magnitude of effect and their CIs in Anesthesia & Analgesia .

In their Cochrane Review of ultrasound guidance for perioperative neuraxial and peripheral nerve blocks in children, Guay et al^{25} noted that the increased success rate with use of ultrasound was most evident for peripheral nerve blocks, where the amplitude of the effect size (difference between ultrasound guidance and no ultrasound guidance) was inversely proportional to the age of the participant. In their Cochrane Review of peripheral nerve blocks for hip fractures, Guay et al^{26} observed that the effect size (difference between peripheral nerve blocks and systemic analgesia) is proportional to the concentration of local anesthetic.

Type I Error and Alpha Level
A type I error occurs when the Ho that there is no difference, association, or correlation between the study groups is falsely rejected.^{13} ^{,} ^{16} ^{,} ^{27} A type I error or the significance level is the probability of rejecting the Ho when in fact there is no difference (or association, or correlation) in the population of interest. A type I error represents a false-positive study result or finding^{13} (Figure 1 ).

Figure 1.: Decision in a study versus truth in the population of interest. This figure shows the difference between a type I error and a type II error when making a decision to either reject or not reject the null hypothesis.

In undertaking conventional NHST,^{2} researchers predetermine the α level to assess whether the Ho can be rejected and the result of their study is statistically significant.^{16} The α level is typically set at .05, implying that if the probability of the observed result occurring simply by chance is less than this α level of .05 (5%), then one can conclude that the observed study result did not occur simply by chance and that it is statistically significant.^{14} ^{,} ^{16}

Alpha Level Versus the P Value.
It is important to note that α level and the P value are not the same.^{13} The α value is the predetermined acceptable probability of committing a type I error.^{27} The P value is generated by the applied statistical test based on the collected study data. If calculated P value is less than the predetermined α , you can reject the Ho and conclude the observed difference, association, or correlation is statistically significant.^{13} One can also plausibly conclude that the observed study group difference is associated with the study intervention.

Beta Level and Type II Error
A type II error occurs when there is a true difference, association, or correlation between the groups in the population of interest, so that the Ha is true, but no difference (or association, or correlation) is found in the study.^{13} ^{,} ^{16} ^{,} ^{27} A type II error represents a false-negative study result or finding^{13} (Figure 1 ).

In undertaking conventional NHST,^{2} the β level is the predetermined acceptable probability of committing a type II error or failing to detect a true difference, effect, or association of a prespecified size.^{23} ^{,} ^{27} The β level is typically set at 0.10 or 0.20, implying that the probability of committing a type II error is <.10 (10%) or 0.20 (20%).^{23}

Power.
Power is the probability of rejecting the Ho when there is a true difference (or association, or correlation) of a specified magnitude in the population of interest. Thus, power can only be discussed in context of a specific underlying true treatment effect of a defined magnitude, as is discussed in more detail below.

The converse of a type II error is correctly rejecting the Ho when the Ha is true—thus detecting a true difference, effect, or association.^{27} The ability to detect this true difference, effect, or association is referred to as the power of the applied statistical test. Because the probabilities of these 2 events are additive, their sum is 1. Power is thus equal to 1 − β .^{6} ^{,} ^{27} Typically, researchers choose to have 80% power with a β of .2 (20%) or 90% power with β of .1 (10%).^{22} ^{,} ^{28} As explained below, 90% power is markedly better.

Power Analysis and Sample Size Justification: Prestudy
Power analysis and sample size justification are integral parts of the a priori, prestudy design of any research in which an inference will be made. Determining the appropriate sample size is challenging because there are unknown parameters that need to be estimated. However, because it is crucial to the success of a research study, it is incumbent on researchers to invest the time and resources to do it well.^{29} ^{,} ^{30} Even if the goal is only estimation, sample size justification is still needed and important, as discussed below. Achieving the planned sample size can also be challenging, and we will specifically discuss below study dropouts and interim analyses.

When planning a study one strives to minimize the 2 above-discussed error rates: α , the probability of rejecting the Ho when it is true; and β , the probability of failing to reject the Ho when it is false and the Ha is true. Designing a study with sufficient power (1 − β ) and sufficiently small α , or chance of type I error, is a crucial aspect of study planning.^{31}

Need for careful planning of sample size is evidenced by past reviews, which have shown that a large percentage of clinical trials, for example, are underpowered to detect moderate (say 25%) or even large (say 50%) relative reductions in binary outcomes.^{32–34} Conducting underpowered clinical trials raises ethical concerns,^{35} including an elevated probability that beneficial treatment effects are missed (ie, high risk of a type II error with a false-negative conclusion). Underpowered trials not only expose subjects to potential harm with inadequate probability of detecting a beneficial treatment, but waste resources and time that might be better spent on sufficiently-powered studies. In choosing the primary outcome many factors should be considered, including that continuous outcomes generally have much higher power to detect clinically important effects than binary (dichotomous) outcomes. However, binary outcomes are often more attractive in definitive clinical trials or large database studies.

Determining the Sample Size
When conducting a sample size calculation for a study comparing groups, 4 essential ingredients are needed: (1) the treatment-related effect to be detected; (2) the SD of the primary outcome variable (if it is a continuous data variable); (3) the desired significance level (α ); and (4) the power (1 − β ).^{23} ^{,} ^{31} There are key characteristics of these 4 required items in study planning.

Exemplary sample size calculations can be found in the randomized, controlled trial of the effect of multiple ports on the analgesic efficacy of wire-reinforced flexible catheters used for labor epidural analgesia reported by Philip et al^{36} and the randomized, controlled trial of the effect of intraperitoneal instillation of lidocaine on postoperative analgesia after cesarean delivery reported by Patel et al.^{37}

Treatment Effect.
In calculating a sample size, one must specify the treatment effect of interest (ie, the difference in means or proportions, relative risk, odds ratio, or correlation) that one would like to be able to detect, were it true in the population of interest. The specified treatment effect should correspond to the primary outcome variable. One is thus assuming and specifying the Ha; statistical power by definition requires specifying a certain effect, not just that the Ho is false. It is best to base the calculation on the minimally important difference, if that is known or can be estimated.^{29}

The specified treatment effect should be displayed in the units of the outcome variable of interest, so that readers can understand it. In addition, but not exclusively, it is helpful to give the expected effect size, such as Cohen’s d , with designation as small, medium, or large (see above section on effect size and Cohen’s d ).

Variability—SD.
Researchers must specify the expected variability of the primary outcome variable to calculate sample size. The SD of a continuous outcome variable can often be gleaned from the relevant literature.^{31} Pooling several SD estimates is optimal, and any SD estimate should be increased somewhat so that the sample size calculation is conservative.

Contrary to what some researchers understand, it is not necessary to provide SD estimates from a study which used the same intervention as in the proposed study; this might be very difficult or impossible. Instead, it is only important to identify studies with a similar enrolled patient population, with a control group, if possible, similar to that of the proposed study.^{31}

For binary (dichotomous) outcomes, researchers need to specify the expected proportion of subjects with the outcome in each of the 2 or more groups being compared.^{23} ^{,} ^{31} This is because the variability of a proportion depends on the estimated proportions. For example, researchers might specify the control group’s proportion and then hypothesize that a specific relative reduction in that value will occur in the intervention group.^{29}

Significance Level (α ) and Type I Error.
For 2-tailed tests for superiority, the significance level is typically set at 5%, thus allowing a 5% chance of making a false positive conclusion (type I error). For the occasional 1-sided tests, such as with a noninferiority study design (see below), the standard is to use α = .025, giving the same α value as on each side of a 2-sided test with overall α = 0.05.^{38} ^{,} ^{39}

Multiple Comparisons and Testing.
It is very important for researchers to maintain the risk of a type I error for at least the primary study aim and hypothesis at 5% (or other prespecified level), and sample size calculations should include the a priori plans to do so. For example, with an overall α of .05 and 2 primary outcomes, when either 1 (not both) are required to be statistically significant before intervention is deemed better than control, sample size would be calculated using Bonferroni correction at α of .025. However, if the intervention would only be deemed effective if found superior on both outcomes, then no adjustment of α would be needed.^{40} There are many methods to control for a type I error in the presence of multiple comparisons among several groups or across multiple outcomes, including procedures by Holm-Bonferroni, Tukey, Duncan and Dunnet, most of which are less conservative than Bonferroni.^{41} ^{,} ^{42}

Power.
Power is typically set at 80% or 90%, although 90% is much preferred for the following reason. Consider that with 80% power, the type II error, the probability of missing a true difference or association is 20%, but with 90% power the type II error is only 10%—a 50% improvement. Therefore, we strongly recommend that studies be planned for 90% power, if at all possible.

A smaller difference of interest (effect size) increases the required sample size, as does a larger SD. Smaller α and smaller β (larger power) also increase the required sample size. Figure 2 shows the relationship among power, sample size, and difference in means for a given SD.

Figure 2.: Total sample size as a function of mean difference and power for a 2-sample, 2-tailed t test with α = .05 and standard deviation = 1. Since the standard deviation = 1, the mean difference can be interpreted as the number of standard deviations in the outcome that we want to be able to detect. This is equivalent to the effect size known as Cohen’s d . Sample size increases markedly as the effect size decreases from 1.0 (large) to 0.2 (small). Higher power requires higher sample size.

The sample size for a binary (dichotomous) outcome depends heavily on its incidence in the control group. Figure 3 shows that for the same relative reduction, the control group incidence changes the required sample size markedly, and it thus should be chosen carefully and conservatively.^{29}

Figure 3.: Total sample size as a function of relative risk and control group incidence (3 reference proportion lines) for a Pearson chi-square test comparing 2 proportions, assuming α = .05 and power = 0.90. Sample size increases markedly as relative risk moves from 0.70 (large effect) to 0.90 (small effect). For the same relative risk, sample size also increases considerably as the reference proportion decreases.

For example, suppose investigators plan a study to detect a 20% relative reduction in the proportion of patients with any major complication after surgery (relative risk of 0.80) between the experimental and control groups. If they expect a 30% incidence in the control group, and a 20% relative reduction to 24% with treatment, the required sample size per group is 1149. However, if they only expect 15% in the control group, the required sample size per group increases to 2726. However, if the true control group incidence is only 15% and authors planned for 30%, they would only have 56% power to detect a relative risk of 0.80. Therefore, it is crucial to be conservative and not underestimate the proportion experiencing the event in the control group, or the study may be markedly underpowered.

Available Software.
An easy to use, freely available software for calculating sample size is the “PS: Power and Sample Size Calculation” program from the Vanderbilt Department of Biostatistics <http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize >.^{43} ^{,} ^{44} Interested readers are encouraged to download this program and perform calculations for various study designs. The software also generates a write-up that can be included (sometimes with minor modifications) in a research protocol and manuscript.

Power Analysis and Sample Size Justification: Poststudy
It is always preferable that the sample size for a study be planned a priori, or if “fixed” (constrained) by nonstatistical restrictions (eg, the size of an existing clinical care dataset), at least justified as being sufficient for the a priori determined study objectives.^{31} However, when this is not done a priori, authors should still include a post hoc, poststudy justification of the sample size in their statistical methods section. Even though a study is completed, and a poststudy calculation will not change the results, reporting the available power to detect a clinically important difference is key information that can help to interpret the study findings, as well as speak to the strength of the study itself. Suppose in an observational study, the researchers assess the relationship between receiving a specific intraoperative intervention (eg, red blood cell transfusion) and a postoperative outcome (eg, 30-day mortality) in all 200 available patients. Importantly, poststudy power calculations for such studies should never be based on the treatment effect that was observed during the study, because the observed effect is a random variable and may not be close to what would be considered clinically important. In this example, if the observed difference in the proportion with 30-day mortality is zero, the power to detect that difference would be very small. Instead, the calculation should be based on what is clinically important, such as a 10% reduction in mortality. This helps to interpret results, because with the sample size of 200, there would be low power to detect a 10% reduction, and a nonsignificant study finding should thus not be interpreted as a true lack of effect in the population.

Importantly, the observed effect and CI provide the best estimate of the population effect of interest. For example, a wide CI for relative risk, such as 0.5–2.0, suggests that the true effect ranges from a 50% decrease to a 2-fold increase in risk, which includes both large positive and negative effects. While a CI gives specific information about what is known about the true effect, the available power allows readers to gauge the strength of a study design by putting results, especially negative ones, in the proper context.

Sample Size for Single-Group Estimation Studies
Sometimes there is no need for a power calculation because no hypothesis is being tested. Researchers instead aim to estimate an underlying parameter such as a prevalence of a condition, incidence of an event, or change from baseline in a laboratory measurement. In estimation studies, the sample size should be planned to have sufficient precision when estimating the parameter of interest, gauged by the width of the typically 95% CI. For example, if researchers aim to estimate the incidence of myocardial infarction in a population of interest to a CI width of ±0.10 (±10%), they would need about 100 total patients, assuming a true incidence of 5% and 200 total patients for a true incidence of 15%.

Study Dropouts
Sample size estimation should also consider the potential for dropouts, and accordingly recruit more patients.^{23} ^{,} ^{31} For example, if the sample size calculation indicates 100 patients are needed per group, but 20% are expected to drop out before either receiving the intervention or measuring the outcome, the planned enrollment should be increased by a factor of [1 ÷ (1 − 0.20)] or 1.25, and thus actually125 patients per group.^{31} ^{,} ^{45}

Interim Analyses and Early Closure of a Study
Clinical trials can be designed to assess efficacy or harm, and often futility (ie, no evidence of effect), at planned interim analyses during the trial, such as at each 25% of the maximum planned enrollment.^{46} The advantage of these “group sequential” designs is that the study question of whether an intervention is effective or not can be answered faster and with fewer patients compared to a study with no interim analyses.^{23} ^{,} ^{46} ^{,} ^{47} Design of a group sequential trial controls for the multiple testing over time apply so-called α (type I error) and β (type II error) spending functions,^{48–50} making appropriate adjustments to what is considered statistically significant.^{51}

A common critique of such interim looks is that stopping a trial early because the study signal crossed an efficacy or futility boundary may result in CIs that are too wide to make a meaningful inference, even if the result is statistically significant. This can be a valid concern, and so spending functions should be set up so as not to require a trial to necessarily stop if a boundary is crossed. The data and safety monitoring board for the trial instead should weigh all evidence available to them, including, for example, safety and logistical issues, and decide if it is preferable to continue the trial in order to gain more precision in the estimation of the treatment effect, even though the P value is already significant.^{52}

Equivalence and Noninferiority Trials
Occasionally, the study goal is to rule out a significant or meaningful association between the independent predictor variable and the dependent outcome variables.^{46} An equivalence trial tests whether a new drug has essentially the same efficacy of an established drug, so the desired effect size is very small or zero.^{31} A noninferiority trial is a 1-sided version of this study design that tests whether the new drug is either better or at least not worse than an established drug, where “worse” is defined a priori in what is called the noninferiority margin or delta.^{31} The power analysis and sample size calculation for these study designs are complex^{38} ^{,} ^{53–56} —typically necessitating collaboration with a suitably experienced statistician.^{31}

CONCLUSIONS
Using exemplary data from a randomized controlled trial of postoperative analgesic drug efficacy, Andersen et al^{57} recently demonstrated that widely divergent significance levels and estimates of effect size are obtained across various data processing procedures and standard statistical methods.

Nevertheless, researchers, as well as journal editors, peer reviewers, and readers, should take into equal consideration the observed statistical significance and effect size in interpreting the findings of a study. Doing so can provide important information about the reliability and the importance of the findings of the study.^{16} ^{,} ^{19} ^{,} ^{58}

Furthermore, sample size calculation is a key aspect of study design. It enables researchers to recruit or include enough observation to have sufficient power to detect clinically important effects and to avoid elevated chances of false-positive findings. Careful and conservative sample size planning for randomized as well as non-randomized studies will increase the reliability of anesthesia, perioperative medicine, critical care, and pain medicine research findings.

DISCLOSURES
Name: Edward J. Mascha, PhD.

Contribution: This author helped write and revise the manuscript.

Name: Thomas R. Vetter, MD, MPH.

Contribution: This author helped write and revise the manuscript.

This manuscript was handled by: Jean-Francois Pittet, MD.

REFERENCES
1. Vetter TR, Mascha EJ. Unadjusted bivariate two-group comparisons—when simpler is better. Anesth Analg. 2018;126:338–342.

2. Vetter TR, Mascha EJ. In the beginning—there is the introduction—and your study hypothesis. Anesth Analg. 2017;124:1709–1711.

3. Matz DC, Hause EL. “Dealing” with the central limit theorem. Teach Psychol. 2008;35:198–200.

4. Rumsey DJ. Sampling distributions and the central limit theorem. In: Statistics for Dummies. 2016:2nd ed. Hoboken, NJ: John Wiley & Sons171–186.

5. Motulsky H. The Gaussian distribution. In: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 2014:New York, NY: Oxford University Press85–89.

6. Field A. Everything you never wanted to know about statistics. In: Discovering Statistics Using IBM SPSS Statistics: And Sex and Drugs and Rock ‘n’ Roll. 2013:Los Angeles, CA: Sage40–88.

7. Wilcox RR. Introduction. In: Introduction to Robust Estimation and Hypothesis Testing. 2017:Amsterdam, the Netherlands: Elsevier/Academic Press1–24.

8. Field A. The beast of bias. In: Discovering Statistics Using IBM SPSS Statistics: And Sex and Drugs and Rock ‘n’ Roll. 2013:Los Angeles, CA: Sage163–212.

9. Vetter TR. Fundamentals of research data and variables: the devil is in the details. Anesth Analg. 2017;125:1375–1380.

10. Motulsky H. Comparing proportions. In: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 2014:New York, NY: Oxford University Press233–241.

11. Motulsky H. Outliers. In: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 2014:New York, NY: Oxford University Press209–215.

12. Manikandan S. Data transformation. J Pharmacol Pharmacother. 2010;1:126–127.

13. Motulsky H. Statistical significance and hypothesis testing. In: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 2014:New York, NY: Oxford University Press137–148.

14. Salkind NJ. Significantly significant: what it means for you and me. In: Statistics for People Who (Think They) Hate Statistics. 2016:6th ed. Thousand Oaks, CA: Sage Publications177–196.

15. Hankins MC. Still not significant. Probable Error: I don’t mean to sound critical, but I am; so that’s how it comes across. 2013Available at:

https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/ . Accessed October 19, 2017.

16. Urdan TC. Statistical significance and effect size. In: Statistics in Plain English. 2017:4th ed. New York, NY: Routledge, Taylor & Francis Group73–91.

17. The B. Significance testing—are we ready yet to abandon its use? Curr Med Res Opin. 2011;27:2087–2090.

18. Salkind NJ. Only the lonely: the one-sample z-test. In: Statistics for People Who (Think They) Hate Statistics. 2016:6th ed. Thousand Oaks, CA: Sage Publications197–210.

19. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev Camb Philos Soc. 2007;82:591–605.

20. Berben L, Sereika SM, Engberg S. Effect size estimation: methods and examples. Int J Nurs Stud. 2012;49:1039–1047.

21. Coe R. It’s the effect size, stupid: what effect size is and why it is important. Annual Conference of the British Educational Research Association. September 12–14, 2002. England.University of Exeter.

22. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 1988.Hillsdale, NJ: Lawrence Erlbaum Associates.

23. Motulsky H. Choosing a sample size. In: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 2014:New York, NY: Oxford University Press216–229.

24. Howell DC. Confidence intervals on effect size. 2011. Available at:

https://www.uvm.edu/~dhowell/ . Accessed October 28, 2017.

25. Guay J, Suresh S, Kopp S. The use of ultrasound guidance for perioperative neuraxial and peripheral nerve blocks in children: a Cochrane review. Anesth Analg. 2017;124:948–958.

26. Guay J, Parker MJ, Griffiths R, Kopp SL. Peripheral nerve blocks for hip fractures: a Cochrane review. Anesth Analg. 2017 October 4 [Epub ahead of print].

27. Mendenhall W, Beaver RJ, Beaver BM. Large-sample tests of hypotheses. In: Introduction to Probability and Statistics. 2013:Boston, MA: CL-Wadsworth324–363.

28. Cohen J. A power primer. Psychol Bull. 1992;112:155–159.

29. Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB. Sample size. In: Fundamentals of Clinical Trials. 2015:5th ed. New York, NY: Springer-Verlag165–200.

30. Lachin JM. Introduction to sample size determination and power analysis for clinical trials. Control Clin Trials. 1981;2:93–113.

31. Browner WS, Newman TB, Hulley Stephen B. Hulley Stephen B, Cummings SR, Browner WS, Grady DG, Newman TB. Estimating sample size and power: applications and examples. In: Designing Clinical Research. 2013:4th ed. Philadelphia, PA: Wolters Kluwer Health/Lippincott Williams & Wilkins55–83.

32. Breau RH, Carnat TA, Gaboury I. Inadequate statistical power of negative clinical trials in urological literature. J Urol. 2006;176:263–266.

33. Chan AW, Altman DG. Epidemiology and reporting of randomised trials published in PubMed journals. Lancet. 2005;365:1159–1162.

34. Freiman J, Chalmers TC, Smith H, Kuebler R. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. N Engl J Med. 1978;299:690–694.

35. Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;288:358–362.

36. Philip J, Sharma SK, Sparks TJ, Reisch JS. Randomized controlled trial of the clinical efficacy of multiport versus uniport wire-reinforced flexible catheters for labor epidural analgesia. Anesth Analg. 2018;126:537–544.

37. Patel R, Carvalho JC, Downey K, Kanczuk M, Bernstein P, Siddiqui N. Intraperitoneal instillation of lidocaine improves postoperative analgesia at cesarean delivery: a randomized, double-blind, placebo-controlled trial. Anesth Analg. 2017;124:554–559.

38. Mascha EJ, Sessler DI. Equivalence and noninferiority testing in regression models and repeated-measures designs. Anesth Analg. 2011;112:678–687.

39. United States Food and Drug Administration: Center for Drug Evaluation and Research (CDER); Center for Biologics Evaluation and Research (CBER). The Non-Inferiority Margin. Guidance for Industry: Non-Inferiority Clinical Trials. 2016:8–11. Available at:

https://www.fda.gov/downloads/Drugs/Guidances/UCM202140.pdf . Accessed November 1, 2017.

40. Mascha EJ, Turan A. Joint hypothesis testing and gatekeeping procedures for studies with multiple endpoints. Anesth Analg. 2012;114:1304–1317.

41. Aickin M, Gensler H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am J Public Health. 1996;86:726–728.

42. Muiller RG. Simultaneous Statistical Inference. 1981.2nd ed. New York, NY: Springer Verlag.

43. Dupont WD, Plummer WD Jr. Power and sample size calculations. A review and computer program. Control Clin Trials. 1990;11:116–128.

44. Dupont WD, Plummer WD. PS: Power and Sample Size Calculation Version 3.1.2. 2014Available at:

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize . Accessed November 1, 2017.

45. Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB. Participant adherence. In: Fundamentals of Clinical Trials. 2015:5th ed. New York, NY: Springer-Verlag, 297–318.

46. Grady DG, Cummings SR, Hulley Stephen B. Hulley Stephen B, Cummings SR, Browner WS, Grady DG, Newman TB. Alternative clinical trial designs and implantation issues. In: Designing Clinical Research. 2013:4th ed. Philadelphia, PA: Wolters Kluwer Health/Lippincott Williams & Wilkins151–166.

47. Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB. Statistical methods used in interim monitoring. In: Fundamentals of Clinical Trials. 2015:5th ed. New York, NY: Springer-Verlag, 373–401.

48. DeMets DL, Lan G. The alpha spending function approach to interim data analyses. Cancer Treat Res. 1995;75:1–27.

49. DeMets DL, Lan KK. Interim analysis: the alpha spending function approach. Stat Med. 1994;13:1341–1352.

50. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556.

51. Grady DG, Cummings SR, Hulley Stephen B. Hulley Stephen B, Cummings SR, Browner WS, Grady DG, Newman TB. Interim monitoring of trial outcomes and early stopping. In: Designing Clinical Research. 2013:4th ed. Philadelphia, PA: Wolters Kluwer Health/Lippincott Williams & Wilkins168–170.

52. Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB. Monitoring committee structure and function. Fundamentals of Clinical Trials. 2015:5th ed. New York, NY: Springer-Verlag343–372.

53. Guo JH, Chen HJ, Luh WM. Sample size planning with the cost constraint for testing superiority and equivalence of two independent groups. Br J Math Stat Psychol. 2011;64:439–461.

54. Zhang P. A simple formula for sample size calculation in equivalence studies. J Biopharm Stat. 2003;13:529–538.

55. Stucke K, Kieser M. A general approach for sample size calculation for the three-arm ‘gold standard’ non-inferiority design. Stat Med. 2012;31:3579–3596.

56. Julious SA, Owen RJ. A comparison of methods for sample size estimation for non-inferiority studies with binary outcomes. Stat Methods Med Res. 2011;20:595–612.

57. Andersen LPK, Gögenur I, Torup H, Rosenberg J, Werner MU. Assessment of postoperative analgesic drug efficacy: method of data analysis is critical. Anesth Analg. 2017;125:1008–1013.

58. Kelley K, Preacher KJ. On effect size. Psychol Methods. 2012;17:137–152.