Share this article on:

The Importance of A Priori Sample Size Estimation in Strength and Conditioning Research

Beck, Travis W.

Journal of Strength and Conditioning Research: August 2013 - Volume 27 - Issue 8 - p 2323–2337
doi: 10.1519/JSC.0b013e318278eea0
Technical Report

Beck, TW. The importance of a priori sample size estimation in strength and conditioning research. J Strength Cond Res 27(8): 2323–2337, 2013—The statistical power, or sensitivity of an experiment, is defined as the probability of rejecting a false null hypothesis. Only 3 factors can affect statistical power: (a) the significance level (α), (b) the magnitude or size of the treatment effect (effect size), and (c) the sample size (n). Of these 3 factors, only the sample size can be manipulated by the investigator because the significance level is usually selected before the study, and the effect size is determined by the effectiveness of the treatment. Thus, selection of an appropriate sample size is one of the most important components of research design but is often misunderstood by beginning researchers. The purpose of this tutorial is to describe procedures for estimating sample size for a variety of different experimental designs that are common in strength and conditioning research. Emphasis is placed on selecting an appropriate effect size because this step fully determines sample size when power and the significance level are fixed. There are many different software packages that can be used for sample size estimation. However, I chose to describe the procedures for the G*Power software package (version 3.1.4) because this software is freely downloadable and capable of estimating sample size for many of the different statistical tests used in strength and conditioning research. Furthermore, G*Power provides a number of different auxiliary features that can be useful for researchers when designing studies. It is my hope that the procedures described in this article will be beneficial for researchers in the field of strength and conditioning.

Department of Health and Exercise Science, University of Oklahoma, Norman, Oklahoma

Address correspondence to: Travis W. Beck, tbeck@ou.edu.

Back to Top | Article Outline

Introduction

Statistical power and sample size calculations are an important component of experimental design and are required for all research studies submitted to the Journal of Strength and Conditioning Research. It is quite common for graduate students and beginning researchers to ask the question: How many subjects do I need to sample to achieve an appropriate level of statistical power in my study? As one might expect, the answer to this question is not simple. Strength and conditioning research often uses a variety of different statistical tests, each of which requires different procedures for estimating sample size. Thus, the notion that a single procedure can be used for all power and sample size calculations is false. Many excellent reviews and textbooks have been written on statistical power (1,2,4,10,12,13), and the interested reader is encouraged to consult these resources for detailed theoretical information and a comprehensive list of power and sample size calculations. The purpose of this tutorial is not to describe all of these calculations but rather to provide some basic theoretical information on statistical power and sample size estimation, followed by example calculations for some experimental designs that are common in strength and conditioning research. It is my hope that this information will be useful for strength and conditioning researchers during the initial stages of study design.

Back to Top | Article Outline

Power Theory

The statistical power, or sensitivity of an experiment is the probability of detecting a treatment effect when that effect actually exists. Stated differently, statistical power is your ability as a researcher to reject the null hypothesis when it is false. Researchers often think of statistical power in relation to the Type II error rate (β). Because β is the probability of accepting a null hypothesis that should have been rejected, power is 1-β. For example, an experiment with a β of 0.20 has a statistical power of 0.80, or an 80% chance of detecting a treatment effect that is real. Only 3 factors can affect power: (a) the significance level (α), (b) the magnitude or size of the treatment effect (effect size), and (a) the sample size (n) (9). Of these 3 factors, only sample size can be manipulated by the researcher because the significance level is usually fixed (i.e., 0.05 or 0.10), and the effect size is determined by the effectiveness of the treatment. In addition, identification of any 2 of these factors fully determines the third. Thus, it is quite common (and important) for researchers to use the significance level and anticipated effect size to estimate the sample size required to achieve a given power level. This practice is typically done before the study as part of research planning and is referred to as a priori power analysis (7). The a priori power analysis should not be confused with a post hoc power analysis, which is done after the statistical test has been performed and often provides an “observed power” value. A post hoc power analysis uses the effect size produced by the treatment, the significance level, and the sample size to calculate the actual power of the test after it has been performed. However, observed power assumes that the effect size for the sample is an accurate estimate of the effect size in the population. This is not always a safe assumption (16), and observed power does nothing to help with estimating sample size before the study, which is the focus of this article. For these reasons, I will focus exclusively on a priori power analysis. The interested reader is encouraged to read Faul et al. (6,7) for a review of the different power analyses.

Back to Top | Article Outline

G*Power

G*Power is a statistical power analysis software program that was first developed by Erdfelder et al. (5). Since this first version, 2 revisions (G*Power 2 and G*Power 3) have been released (6,7). The program is compatible with both Macintosh and Windows-based computers and is capable of performing a variety of power and sample size calculations. Perhaps the most attractive feature for many researchers, however, is the fact that the program is freely downloadable. It should also be emphasized that G*Power is not the only power analysis software available for researchers. In fact, Goldstein (8) provided an excellent review of 13 power analysis programs that were available in the late 1980s, and it is likely that many more programs are available today. Many of these programs are similar, with slight differences in compatibility and ease of use, but the basic principles of how they estimate power and sample size are the same. Thus, researchers are welcome to apply the principles described in this article to other software programs.

Back to Top | Article Outline

The Importance of Estimating Effect Size

The first step in estimating sample size with an a priori power analysis involves estimating the effect size that the researcher expects to see in the experiment. As stated previously, the significance level and power are usually fixed (e.g., at 0.05 and 0.80, respectively). Thus, the only factor that can affect sample size is the effect size, with large effect sizes resulting in smaller required sample sizes (for a given power and significance level), and vice versa. Thus, selection of an appropriate effect size is essential because it fully determines sample size when the power and significance level are fixed. However, this also confronts the researcher with a very important question: How big must an effect size be for it to be considered meaningful? For some research questions, the answer may be relatively simple, whereas for others, it can be considerably more difficult. Consider an example where an 8-week training program caused an increase in average bench press strength of 20 lbs. Most strength and conditioning researchers are familiar with bench press strength and would agree that this 20-lb improvement is meaningful. But, what about a somewhat less intuitive variable such as plasma hormone concentration? In many cases, collection of pilot data and/or literature review is necessary to identify the effect sizes typically observed for a given dependent variable. However, what if a researcher is attempting to measure a variable that is not commonly used in the literature, or even a completely new variable? And what if it is too time consuming or expensive to collect pilot data? How can the researcher estimate meaningful change in these situations? An interesting feature of effect sizes is that they are often standardized and classified (somewhat arbitrarily) into “small,” “medium,” and “large” ranges. The researcher can then select which range they feel is most appropriate for their experiment and estimate the sample size necessary for their designated power level. Although this strategy seems very attractive, it does have drawbacks (11). Most notable is the fact that what is considered a “large” effect size can vary substantially among different scientific disciplines. Thus, the most common effect size ranges, that were originally designated by Cohen (1) for the behavioral and social sciences, should not be blindly generalized to other fields. A more in-depth discussion of effect size ranges for strength and conditioning research will be provided later, along with an argument against using them for sample size estimation. They do, however, provide the researcher with a last option if literature review and collection of pilot data are not feasible. I will now describe some of the effect sizes used by G*Power.

Back to Top | Article Outline

Effect Sizes for Comparing Means

The effect size used by G*Power for independent-samples and dependent-samples t-tests is d (1):

where μ 1 and μ 2 are the group means and σ is the pooled standard deviation. Cohen's (1) ranges of 0.2, 0.5, and 0.8 are used to define small, medium, and large d values in G*Power. The effect size used by G*Power for all analyses of variance (ANOVAs) is f:

where σ u reflects variability due to the treatment, and σ is the pooled standard deviation. The small, medium, and large ranges used by G*Power for Cohen's f are 0.10, 0.25, and 0.40, respectively (1). Alternatively, f can also be calculated with the following formula:

where u j is the population mean for an individual group, u is the overall mean, k is the number of groups, and σ error is the within-group standard deviation. This formula is much more intuitive than the previous one because researchers can generally estimate group means and the standard deviation more effectively than the treatment variability. For this reason, this formula will be used to estimate f for all of the ANOVA examples.

Back to Top | Article Outline

Effect Sizes for Regression

Unlike t-tests and ANOVAs, G*Power does not require the researcher to specify an effect size for bivariate linear regression. Instead, the slope for the alternative hypothesis must be selected, in addition to the slope for the null hypothesis (which is usually 0) and the standard deviations of the independent and dependent variables. The effect size used by G*Power for multiple linear regression is f 2:

where R 2 Y·A is the proportion of the variance explained by the independent variable. The small, medium, and large ranges used by G*Power for f 2 are 0.02, 0.15, and 0.35, respectively.

Back to Top | Article Outline

Estimating Effect Sizes

Obviously, when researchers are conducting an a priori power analysis, they do not know exactly how much of a treatment effect to expect, nor do they know what the pooled standard deviation will be. As a result, it is often very difficult to make an educated guess on what effect size to use for the sample size estimation. If at all possible, researchers should strongly consider collecting pilot data to make the effect size estimate because this technique takes into account factors that are unique to the laboratory, such as instrumentation, personnel, and environmental factors. This strategy is also beneficial when examining new variables. However, if collection of pilot data is not possible, the next best option is to use relevant literature to make an educated guess on the magnitude of the treatment effect and pooled variance. An estimate of the effect size can then be made from these values. A final option is to use standardized effect size ranges. As discussed previously, these ranges can vary based not only on the effect size that is being used but also on the research area that they are being applied to. For example, the values of 0.2, 0.5, and 0.8 for a small, medium, and large effect sizes, respectively, as described by Cohen for research in the behavioral and social sciences (1), are lower than those discussed by Rhea (15) for applications in strength and conditioning studies. For highly trained individuals, trivial, small, moderate, and large effect sizes were defined as <0.25, 0.25–0.50, 0.50–1.0, and >1.0, respectively, with even higher values for recreationally trained and untrained individuals (15). Although this article is discouraging the use of standardized effect sizes for sample size estimation, they can be implemented as a last resort, and it is certainly better to use them than to perform no sample size estimation at all.

Back to Top | Article Outline

Additional Considerations

The above discussion illustrates the inverse relationship between effect size and sample size for a given power level. Thus, a large estimated effect size is beneficial for researchers because it reduces the sample size needed to achieve appropriate power. There are 2 ways to maximize effect size: (a) increase the magnitude of the treatment effect and (b) decrease the error variability in the data. Although the first option cannot be easily manipulated by the researcher, he/she does have control over the heterogeneity of the sample used in the study. Specifically, use of a more homogenous sample reduces the error variability, thereby increasing the effect size. This strategy is useful in situations where researchers are concerned about having a potentially small effect size and/or have a limited ability to achieve an appropriate sample size. Another strategy that can be used to increase the power of an experiment is the use of an appropriate covariate in the analysis of variance (i.e., ANCOVA) (17). The ANCOVA is particularly useful in the pretest-posttest control group experimental design because it corrects for pretest group differences in the dependent variable. In essence, ANCOVA is a method that reduces error variance with statistical techniques, rather than direct manipulation of the subject pool. This increases the sensitivity (i.e., power) of the experiment for a given sample size, or, alternatively, reduces the sample size needed to achieve a given level of power.

It is also important to point out the fact that, just like selecting the alpha level of 0.05, setting the power level at 0.80 is a somewhat arbitrary decision. Recall that a power level of 0.80 means that the researcher is willing to make a Type II error (i.e., fail to reject a false null hypothesis) 20% of the time. Thus, why not reduce this risk by using a power level of, for example, 0.90? The obvious argument is that this strategy requires a greater sample size (assuming equivalent effect sizes). For studies in which time and resources are not a major concern, it is beneficial to use this strategy. However, researchers should carefully consider this trade-off because an increase in power from 0.80 to 0.90 requires an exponential, rather than linear, increase in sample size. It is strongly recommended that separate sample size estimations be done for power levels of 0.80 and 0.90 such that the researcher can see the trade-off between sample size and power. An example of this trade-off will be demonstrated below with the sample size estimation for an independent-samples t-test.

Back to Top | Article Outline

Examples

Here I will provide some examples of the procedures used to estimate sample size with G*Power (version 3.1.4) for some common designs in strength and conditioning research.

Back to Top | Article Outline

Example 1: Independent-Samples t-Test

A strength and conditioning coach is interested in comparing the lower body strength levels of offensive vs. defensive lineman that compete in American Football. He/she plans on using the 1 repetition maximum (1RM) back squat as the indicator of lower body strength but is unsure about how many subjects are needed. Refer to Figure 1 as you go through the following steps.

  • Step 1: In the “Test family” panel, select “t tests.”
  • Step 2: In the “Statistical test” panel, select “Means: Difference between two independent means (two groups).”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power, and effect size.”
  • Step 4: In the “Input Parameters” panel, select “Two” for the number of “Tail(s),” 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” and 1 for the “Allocation ratio N2/N1.” Select “Determine,” which will open up a window, where the difference between the two means can be estimated, along with their standard deviations. In the “n1 = n2” panel, type “500” for “Mean group 1” and “525” for “Mean group 2.” These are the estimated mean 1RM back squat values (in pounds) for the two groups. Type “50” for “SD σ group 1,” and “50” for “SD σ group 2.” These are the estimated standard deviation 1RM back squat values (in pounds) for the 2 groups. Select “Calculate” and then “Calculate and transfer to main window.”
  • Step 5: Select “Calculate” in the main window, and observe the “Total sample size,” “Sample size group 1,” and “Sample size group 2” fields. Figure 1 shows that with an estimated mean difference of 25 lbs and estimated standard deviations of 50 lbs, Cohen's d = 0.5, and 128 total subjects (64 in each group) are needed to achieve a power of 0.80. To perform the sample size estimation for a power level of 0.90, leave all fields the same, except change the “Power (1-β err prob)” to 0.90. Once again, select “Calculate” and observe the “Total sample size” of 172 subjects (86 in each group) required for a power level of 0.90.
Figure 1

Figure 1

Back to Top | Article Outline

Example 2: Dependent-Samples t-Test

A strength and conditioning coach would like to examine the acute effects of stretching on 40-yd dash time in American Football players. He/she develops a stretching protocol and would like to measure the same group of players before and after the stretching. How many subjects are needed to achieve a power of 0.80? Refer to Figure 2 as you go through the following steps.

  • Step 1: In the “Test family” panel, select “t tests.”
  • Step 2: In the “Statistical test” panel, select “Means: Difference between two dependent means (matched pairs).”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power, and effect size.”
  • Step 4: In the “Input Parameters” panel, select “Two” for the number of “Tail(s),” 0.05 for the “α err prob,” and 0.80 for the “Power (1-β err prob).” Select “Determine,” which will open up a window where the difference between the 2 means can be estimated, along with their standard deviations. In the “From group parameters” panel, type “4.7” for “Mean group 1,” and “4.6” for “Mean group 2.” These are the estimated 40-yd dash times (in seconds) for the 2 groups. Type “0.2” for “SD group 1,” and “0.2” for “SD group 2.” These are the estimated standard deviation 40-yd dash times (in seconds) for the 2 groups. Use “0.5” for the “Correlation between groups.” Select “Calculate” and then “Calculate and transfer to main window.”
  • Step 5: Select “Calculate” in the main window and observe the “Total sample size” field. Figure 2 shows that with an estimated mean difference of 0.1 seconds and estimated standard deviations of 0.2 seconds, the effect size (dz) = 0.5, and 34 subjects are needed to achieve a power of 0.80.
Figure 2

Figure 2

Back to Top | Article Outline

Example 3: 1-Way Between-Subjects ANOVA

A strength and conditioning coach would like to compare the upper body strength levels of running backs, linebackers, and tight ends that compete in American Football. He/she has decided to use the bench press 1RM as the measure of upper body strength but is unsure how large the sample size must be. Refer to Figure 3 as you go through the following steps.

  • Step 1: In the “Test family” panel, select “F tests.”
  • Step 2: In the “Statistical test” panel, select “ANOVA: Fixed effects, omnibus, one-way.”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power and effect size.”
  • Step 4: In the “Input Parameters” panel, select 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” and 3 for the “Number of groups.”
  • Step 5: Select the “Determine =>” button, which will open up a new panel that allows the user to estimate the effect size from the means or from the variance. We will estimate it from the means because these values are more commonly reported in the literature. Select 325, 350, and 365 as the mean values for groups 1, 2, and 3, respectively. These correspond to the estimated bench press 1RM values (in pounds) for the running backs, linebackers, and tight ends. Select 35 as the “SD σ within each group.” This is the estimated standard deviation in bench press 1RM for each of the groups. Select 20 in the box just to the right of the “Equal n” button and select “Equal n.” Finally, select “Calculate” and “Calculate and transfer to main window.”
  • Step 5: Select the “Calculate” button and observe the “Total sample size” of 48 subjects required to achieve a power level of 0.80. Thus, 16 subjects (48 subjects/3 groups) are needed in each group.
Figure 3

Figure 3

Back to Top | Article Outline

Example 4: 1-Way Within-Subjects ANOVA

A strength and conditioning coach would like to examine changes in vertical jump height during the season for male collegiate basketball players. He/she schedules 3 separate testing sessions: preseason, midseason, and postseason but is unsure how many subjects are needed to achieve a power level of 0.80. Refer to Figure 4 as you go through the following steps.

  • Step 1: In the “Test family” panel, select “F tests.”
  • Step 2: In the “Statistical test” panel, select “ANOVA: Repeated measures, within factors.”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power, and effect size.”
  • Step 4: In the “Input Parameters” panel, select 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” 1 for the “Number of groups,” and 3 for the “Number of Measurements.”
  • Step 5: The researcher must then provide an estimate of the “Effect size f.” Let us estimate the average vertical jump heights to be 34, 32, and 31 inches for the pre-, mid-, and postseason testing times, respectively. Let us also assume that the within-group standard deviation is 5 inches. Using the previously described formula for estimating f, results in f = 0.25.
  • Step 6: The researcher must also estimate the correlation among the repeated measurements and the nonsphericity correction factor. For this example, we will use 0.5 as the correlation among repeated measures, and 1 as the nonsphericity correction. A nonsphericity correction of 1 assumes sphericity.
  • Step 7: Select the “Calculate” button and observe the “Total sample size” required. For this example with f = 0.25, the total sample size needed to achieve a power level of 0.80 is 28 subjects. The researcher should manipulate the estimated effect size, correlation among the repeated measures, and nonsphericity correction factor to examine their effects on the sample size estimate.
Figure 4

Figure 4

Back to Top | Article Outline

Example 5: 2-Way Within-Subjects ANOVA

A strength and conditioning researcher is interested in examining cross education during a unilateral training program for the leg extensors. He/she plans on measuring isokinetic leg extension strength for both the trained and untrained limbs at the beginning of the unilateral training program, after 4 weeks of training, and again after 8 weeks of training. Thus, the study has 2 separate within-subjects factors: limb (trained vs. untrained) and time (pretraining vs. midtraining vs. posttraining). The researcher is not sure how many subjects are needed to achieve an appropriate level of statistical power. Refer to Figure 5 as you go through the following steps.

  • Step 1: In the “Test family” panel, select “F tests.”
  • Step 2: In the “Statistical test” panel, select “ANOVA: Repeated measures, within factors.”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power, and effect size.” With multiple factor ANOVAs, the researcher must do a sample size estimate for the interaction, as well as both main effects. G*Power does not do a sample size estimate specifically for the within-within interaction. However, the F ratio for the interaction in any multiple factor ANOVA is equivalent to that for a 1-way ANOVA on difference scores. Thus, if the researcher estimated the differences between the trained and untrained limbs at the pretraining, midtraining, and posttraining time points and then performed a 1-way ANOVA across time on the difference scores, the resulting F ratio is equal to that for the interaction. To estimate the sample size necessary to achieve a power of 0.80 for this interaction, the researcher must therefore estimate the magnitude of the difference scores.
  • Step 4: Let us estimate the mean isokinetic leg extension strength values for the trained limb as 200, 210, and 220 ft-lbs at the pre-, mid-, and posttraining time points, respectively. The corresponding values for the untrained limb are 200, 205, and 210 ft-lbs, respectively. This results in difference scores of 0, 5, and 10 ft-lbs, for the pre-, mid-, and posttraining time points, respectively. Also assume that the standard deviation of the difference scores is 20 ft-lbs. The procedure for estimating sample size is now the same as that for a 1-way within subjects ANOVA, except for the use of difference scores, rather than mean values. Using the formula for calculating f and the data described above, we come up with f = 0.204.
  • Step 4: In the “Input Parameters” panel, select 0.204 for the “Effect size f,” 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” 1 for the “Number of groups,” 3 for the “Number of measurements,” 0.5 for the “Corr among rep measures,” and 1 for the “Nonsphericity correction [Latin Small Letter Open E].”
  • Step 5: Select “Calculate” and observe the “Total sample size.” For this example, 41 subjects are needed to achieve a power level of 0.80 for the interaction. These same procedures are also used to estimate sample size for the main effects, with the only difference being that the estimated mean values should be used, rather than the difference scores.
Figure 5

Figure 5

Back to Top | Article Outline

Example 6: 2-Way Within-Between ANOVA

A strength and conditioning researcher would like to compare changes in back squat 1RM strength during a training program among running backs and lineman that compete in American Football. He/she will schedule pre-, mid-, and posttraining testing but is unsure how many subjects are needed in each group to achieve a power level of 0.80. Refer to Figure 6 as you go through the following steps.

  • Step 1: Select “F tests” in the “Test family” panel.
  • Step 2: Select “ANOVA: Repeated measures, within-between interaction” in the “Statistical test” panel.
  • Step 3: Select “A priori: Compute required sample size – given α, power, and effect size.”
  • Step 4: In the “Input Parameters” panel, select 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” 2 for the “Number of groups”, 3 for the “Number of measurements,” 0.5 for the “Corr among rep measures,” and 1 for the “Nonsphericity correction [Latin Small Letter Open E].”
  • Step 5: Let us estimate the mean back squat 1RM values for the running backs at the pre-, mid-, and posttraining time points as 440, 450, and 465 lbs, respectively. We also estimate the corresponding values for the lineman to be 525, 530, and 550 lbs, respectively. Let us also assume the standard deviation of the 1RMs to be 10 lbs for both groups and at each time point. Comparison of the mean values results in difference scores of 85, 80, and 85 lbs for the pre-, mid-, and posttraining time points, respectively. Using the previously described equation results in f = 0.236.
  • Step 5: Select “Calculate” and observe the “Total sample size” required. For this example, a sample size of 32 subjects is required to achieve a power level of 0.80 for the interaction. It is important for the researcher to keep in mind that separate sample size estimations must also be done for the main effects. For the within-between ANOVA, the between-subjects main effect will always require the greatest sample size.
Figure 6

Figure 6

Back to Top | Article Outline

Example 7: Bivariate Linear Regression

A strength and conditioning coach is interested in examining the linear association between body weight and 1RM strength during the deadlift exercise. He/she believes that there will be a strong positive linear relationship between these 2 variables but is unsure how many subjects are needed to achieve a power level of 0.80. Refer to Figure 7 as you go through the following steps.

  • Step 1: In the “Test family” panel, select “t-tests.”
  • Step 2: In the “Statistical test” panel, select “Linear bivariate regression: One group, size of slope.”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power, and effect size.”
  • Step 4: In the “Input Parameters” panel, select “Two” for the “Tail(s),” 0.40 for the “Slope H1,” 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” 0 for the “Slope H0,” 1 for the “Std dev σ_x,” and 1 for the “Std dev σ_y.”
  • Step 5: Select “Calculate” and observe the “Total sample size” of 44 subjects required to achieve a power of 0.80. The “Slope H1” is analogous to the effect sizes used in the previous examples, and the researcher is encouraged to manipulate the “Std dev σ_x” and “Std dev σ_y” values to see their effects on sample size.
Figure 7

Figure 7

Back to Top | Article Outline

Example 8: Multiple Linear Regression (Omnibus R 2)

A strength and conditioning researcher is interested in using the linear combination of back squat 1RM and bodyweight to predict vertical jump height. He/she believes that back squat 1RM will be positively related to vertical jump height, whereas the partial correlation for bodyweight will be negative once the variance accounted for by back squat 1RM is partitioned out. How many subjects are needed to achieve a power level of 0.80? Refer to Figure 8 as you go through the following steps.

  • Step 1: Select “F tests” in the “Test family” panel.
  • Step 2: In the “Statistical test” panel, select “Linear multiple regression: Fixed model, R2 deviation from zero.”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power, and effect size.”
  • Step 4: In the “Input Parameters” panel, select “Determine =>.” This will open up a new panel that allows the user to estimate the “Effect size f2” based on the estimated R 2 for the whole model, or on the estimated partial correlations among the predictors and the dependent variable. We will go through procedures for both methods. First, assume that the estimated R 2 for the whole model is 0.6. Thus, in the “Squared multiple correlation ρ2” box, select 0.6. Then, select “Calculate,” and “Calculate and transfer to main window.”
  • Step 5: Select 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” and 2 for the “Number of predictors.”
  • Step 6: Select “Calculate” and observe the “Total sample size” of 11 subjects required to achieve a power level of 0.80.
  • Step 7: Alternatively, the “Effect size f2” can be estimated from the estimated partial correlations among the predictors and the dependent variable. To estimate the “Effect size f2” using this method, select “Determine =>” in the “Input Parameters” panel. Next, select “From predictor correlations,” and specify 2 for the “Number of predictors.” Select “Specify matrices,” which will open a new window that allows the user to estimate the partial correlations. In the “Corr between predictors and outcome” tab, select 0.6 as the correlation between “P1” and “corr with outcome Y,” and −0.25 as the correlation between “P2” and “corr with outcome Y.” In the “Corr between predictors” tab, select 0.5 as the correlation between “P1” and “P2.” Thus, by using these procedures, we are estimating that the partial correlations between back squat 1RM and vertical jump height and bodyweight and vertical jump height will be 0.6 and −0.25, respectively. We are also estimating that the correlation between back squat 1RM and bodyweight will be 0.5. Select “Calc ρ2,” “Accept values,” “Calculate,” “Calculate and transfer to main window,” and ““Calculate.” Observe the “Total sample size” of 7 subjects required to achieve a power level of 0.80.
Figure 8

Figure 8

Back to Top | Article Outline

Example 9: Multiple Linear Regression (R 2 Increase)

A strength and conditioning coach would like to predict 40-yd dash time. He/she has measured body weight, back squat 1RM, and vertical jump but is unsure which combination of these variables will provide the best prediction of 40-yd dash time. How many subjects are needed to achieve a power level of 0.80? Refer to Figure 9 as you go through the following steps.

  • Step 1: Select “F tests” in the “Test family” panel.
  • Step 2: In the “Statistical test” panel, select “Linear multiple regression: Fixed model, R2 increase.”
  • Step 3: In the “Type of power analysis” panel, select “A priori: Compute required sample size – given α, power, and effect size.”
  • Step 4: In the “Input Parameters” panel, select “Determine =>.” This will open up a new panel that allows the user to estimate the “Effect size f2” by estimating the variance explained by the regression model and the residual variance, as well as the estimated R 2 for the whole regression model. We will go through procedures for both methods. First, assume that the “Variance explained by special effect” is 0.5, and the “Residual variance” is 1. Then, select “Calculate,” and “Calculate and transfer to main window.” This will transfer the estimated “Effect size f2” of 0.5 to the main window.
  • Step 5: Select 0.05 for the “α err prob,” 0.80 for the “Power (1-β err prob),” and 2 for the “Number of tested predictors,” and 3 for the “Total number of predictors.”
  • Step 6: Select “Calculate” and observe the “Total sample size” of 23 subjects required to achieve a power level of 0.80.
  • Step 7: Alternatively, the “Effect size f2” can be estimated from the estimated R 2 for the whole regression model. To estimate the “Effect size f2” using this method, select “Determine =>” in the “Input Parameters” panel. Next, select “Direct,” and specify 0.6 as the “Partial R2.” Select “Calculate,” and “Calculate and transfer to main window.” This will transfer the estimated “Effect size f2” of 1.5 to the main window. Select “Calculate,” and observe the “Total sample size” of 11 subjects required to achieve a power level of 0.80.
Figure 9

Figure 9

Back to Top | Article Outline

Important Considerations and Additional Features

Those of you who have been following along and calculating the sample sizes for each of the examples have likely seen the huge effect that effect size estimates have on sample size calculations. In fact, it could be argued that the effect size estimation is, in itself, more important than that for sample size. Thus, it is strongly recommended that researchers should make the best possible estimate of effect size, always cautioning on the lower end of that estimate with the knowledge that large estimated effect sizes reduce the sample size needed to achieve a given level of power. Nothing is more frustrating than basing sample size on an overestimated effect size, and then finding out that the study was underpowered after it was completed, despite an a priori power analysis. It should also be emphasized that with an a priori power analysis, the researcher is estimating, rather than calculating, the sample size that should be used in the study. All components of the effect size come from estimations. As such, the resulting sample size is simply an estimate of what is needed to achieve the designated power level. In addition, factors such as calibration errors, subject attrition, measurement errors, and other unforeseen events can quickly reduce power during the study. Thus, researchers should always strive to achieve an actual sample size that is slightly above that provided by G*Power. If the a priori power analysis calls for 25 subjects, the researcher should at least attempt to sample 30 subjects (and possibly even more), depending on the study design and likelihood that he/she will get complete and accurate data from all subjects.

G*Power also provides some very useful auxiliary features that can be helpful when estimating sample size. For example, clicking the “X-Y plot for a range of values” button allows the researcher to create a plot of sample size as a function of power for the statistical test being performed (Figure 10). This allows the researcher to track the relationship between sample size and power across a range of sample sizes. For example, it may be very expensive or time consuming for the researcher to increase their sample size by a given amount. This graphing function allows the researcher to examine the trade-off between improvements in power and the subsequent increase in time or cost. It is strongly recommended that researchers use this feature when performing their a priori power analysis. Although the trade-off between sample size and power can lead to a somewhat sobering realization of the number of subjects that is needed to correctly perform a research study, it is an absolutely necessary part of research planning. It could be argued that, in the long run, the consequences of performing an underpowered study outweigh any costs that might be accrued from using large (but appropriate from a power standpoint) sample sizes. A final useful feature provided by G*Power is the fact that a record of each sample size estimation is kept under the “Protocol of power analysis” tab. This record contains a description of the test, as well as the input and output parameters. Researchers can save and/or print this record for future use.

Figure 10

Figure 10

Back to Top | Article Outline

Summary and Closing Remarks

Hopefully, this tutorial will be helpful for strength and conditioning researchers. Sample size estimation during study design always requires 3 steps: (a) select the significance level, (b) select the power level that needs to be achieved, and (c) estimate the effect size for the experiment. The effect size estimate is obviously subjective but also quite essential. Even a relative guess about the effect size is better than nothing. The researcher should certainly not shy away from this step because they fear providing an inaccurate estimate. In the ideal situation, a range of plausible effect sizes should be estimated, with each entered into the G*Power software to examine its effect on sample size. Then, if the resources are available, the researcher should use the sample size from the smallest effect size because this will provide the best chance of achieving enough power. The necessity of the effect size estimate is best described by Cohen (1), p. 285, who stated: “The investigator who insists that he has absolutely no way of knowing how large an ES [effect size] to posit fails to appreciate that this necessarily means that he has no rational basis for deciding whether he needs to make ten observations or ten thousand.”

It is also important to point out that, as described previously, there are many other software programs available that will perform sample size estimations. I chose to describe the G*Power software because it accommodates most of the research designs that are common in strength and conditioning. However, researchers are welcome to apply the procedures that I described to other software programs. In addition, I have not addressed the limitations of null hypothesis significance testing. Despite its historical precedence and widespread use, the use of the null hypothesis significance test for making definitive decisions in research has been heavily criticized (14). Some have even suggested that the procedure should be abandoned and replaced with something that provides information regarding treatment magnitude, such as effect sizes and confidence interval adjustments (3,14). There are certainly 2 sides to this issue, and it is not the purpose of this article to condone 1 method over the other. Researchers should always be aware of the limitations of the statistical procedures they are using and the methods that can be used to overcome these limitations.

Back to Top | Article Outline

References

1. Cohen J. Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers, 1988.
2. Cohen J. A power primer. Psychol Bull 112: 155–159, 1992.
3. Cohen J. The earth is round (p < .05). Am Psychol 49: 997–1003, 1994.
4. Ellis PD. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge, UK: Cambridge University Press, 2010.
5. Erdfelder E, Faul F, Buchner A. GPOWER: a general power analysis program. Behav Res Methods Instrum Comput 28: 1–11, 1996.
6. Faul F, Erdfelder E, Buchner A, Lang A-G. Statistical power analyses using G*Power 3.1: tests for correlation and regression analyses. Behav Res Methods 41: 1149–1160, 2009.
7. Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods 39: 175–191, 2007.
8. Goldstein R. Power and sample size via MS/PC-DOS computers. Am Stat 43:253–260, 1989.
9. Keppel G. Design and Analysis: A Researcher’s Handbook (3rd ed.). Upper Saddle River, NJ: Prentice Hall, 1991.
10. Kraemer HC, Thiemann S. How Many Subjects? Statistical Power Analysis in Research. Newbury Park, CA: Sage Publications, 1987.
11. Lenth RV. Some practical guidelines for effective sample size determination. Am Stat 55:187–193, 2001.
12. Lipsey MW. Design Sensitivity: Statistical Power for Experimental Research. Newbury Park, CA: Sage Publications, 1990.
13. Murphy KR, Myors B, Wolach A. Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests. New York, NY: Routledge, 2009.
14. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev 82: 591–605, 2007.
15. Rhea MR. Determining the magnitude of treatment effects in strength training research through the use of the effect size. J Strength Cond Res 18: 918–920, 2004.
16. Richardson JTE. Measures of effect size. Behav Res Methods Instrum Comput 28: 12–22, 1996.
17. Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow up measurements. Br Med J 323: 1123–1124, 2001.
Keywords:

power; effect size; statistics

Copyright © 2013 by the National Strength & Conditioning Association.