The dispensing profession has always been driven by a reliance on evidence to support its activities. While various leaders in our field^{1–3} have outlined the different levels of evidence and specified the criteria to judge the validity of evidence, we continue to come across “evidence” that is quasi-evidence at best. Either the results of the studies are not treated statistically, or the statistical treatment lacks sufficient power to be meaningful.

Such evidence can mislead readers to believe in the effectiveness of a particular feature when such efficacy is not present; or to believe in the ineffectiveness of a feature when its true efficacy has not been adequately evaluated. This paper briefly reviews an important factor that affects the interpretation of the results of a study — sample size — in hopes that the information presented will help clinicians and students better understand how to interpret the results of any clinical study.

## WHY SAMPLE SIZE IS IMPORTANT

All human studies involve a pool of human participants or subjects. The subjects involved in any study constitute the sample size.

In a research design, an adequate sample size should allow one to accept or refute the experimental hypothesis based on the choice of an appropriate statistical test. A common practice in speech and hearing research is to choose a sample size of 20. Unfortunately, that is not an adequate rationale, because depending on the study design and the strength of the findings, 20 may be too many subjects in some instances and too few in others.

In general, the larger the sample size, the more reliable the findings. It stands to reason that one may want to recruit as many subjects as possible for all studies in order to increase the reliability of the results. On the other hand, a larger number of subjects would mean greater expenses, longer data collection time, and difficulty finding the required number of subjects. It is also true that additional subjects will not change the conclusions of the study substantially when a certain number of subjects are recruited.

For researchers, knowing the optimal number of subjects to recruit to a study will provide more certainty in the conclusion without wasting unnecessary resources (such as time and money). For clinicians who review the research, knowing that the study was performed with an adequate number of subjects could instill additional confidence in the reported findings. Thus, it is important that both researchers and consumers of the research understand whether the study was conducted with an adequate sample of subjects. One way to choose an appropriate sample is to use power analysis.^{4–6}

## PROCEDURES IN POWER ANALYSIS

### What is power analysis?

The power (P) of a test is the chance of detecting a treatment effect, given that such effect indeed exists. It is expressed as a number between 0 and 1, with a larger number meaning greater certainty. Therefore, a P of 0.8 means that researchers will detect a true effect in the real world 80% of the time. The value of P also gives the readers an idea of the practical impact of the treatment on the population. Hence, P is also called practical significance.^{2}

Power analysis can be done after the data has been collected in order to report the chance (P) of detecting a true effect. This is known as a *post-hoc* power analysis. The American Psychological Association^{7} recommends that researchers include a *post-hoc* power analysis as a good practice when non-significant results are reported in the paper. This information can be used to improve the design of future independent studies that replicate the non-significant study. Unfortunately, it is uncommon to find *post-hoc* power analyses in publications.

Another useful application of power analysis is to determine the minimal number of subjects in a research design prior to data collection. This is known as *a priori* power analysis. An increasing number of journal reviewers are asking for *a priori* power analyses to justify the sample size in a submitted paper, especially for those studies with a relatively small pool of subjects.

### Post-hoc power analysis to determine the impact of treatment

When applying power analysis, four essential, inter-related parameters are required: sample size (N), effect size (ES), significance criterion (α), and power (P). The sample size N denotes the number of subjects who take part in the experiment. The effect size is also known as the sensitivity of the test.

ES can be conventionally expressed as different indices (with different notations) according to the statistical tests being employed. For example, the effect size index is *d* for a t-test with two independent groups, and *r* for a Pearson correlation test. Effect size is further classified by small, medium, and large effects according to different cutoff values in ES indices, with an intention that a medium ES should be noticeable by researchers with normal experience.^{5} Table 1 shows several ES indices with ranges of classification and their exemplary formats in reporting the results. For the formula that were used in calculating ES indices, see the studies by Cohen^{5} and Meline and Schmitt^{8}.

The calculation of *p*-value can be referenced to any standard textbook on statistics. The significance criterion (α) in power analysis is the chance level for an error the researcher is willing to accept when the test shows a significant treatment effect, i.e., *p* < α. When α is set to 0.05, it means there is a 5 % chance of making an error of believing that there is a true effect in the population when there is none.

In a *post-hoc* power analysis, the parameters N, ES, and α are typically known, and the calculation of P is of the most interest to the researchers and its intended audience. If P is found to be greater than 0.8, it is assumed that the treatment effect is considered practically significant and has an impact in the real world. If P is less than 0.8, the value of ES could help in estimating how many subjects will be needed to yield a P greater than 0.8. The *post-hoc* power analysis becomes an *a priori* power analysis.

### A priori power analysis to estimate the required number of subjects

N, ES, α, and P are the same parameters considered in an *a priori* power analysis. It is reasonable to assume that the experimenter would not tolerate more than a 5 % chance of error of believing there is a true effect when there is none; therefore, it is reasonable to set α to 0.05. Consequently, only the effect size needs to be specified before estimating the required number of subjects.

There are three common approaches to estimate ES prior to the actual experiment. All methods begin with a literature survey. The first method is to research the sensitivity (magnitude) of the treatment reported in published data. For example, if one wants to replicate an experiment, s/he may directly adopt the effect size reported in the published papers and use it in the power analysis in order to determine the required sample size. However, Meline and Wang^{9} surveyed five volumes of American Speech-Language Hearing Association journals from 1999–2003 and found that effect size was reported in less than 30 % of the manuscripts of which the results of statistical tests were reported.

Alternatively, when the effect size cannot be estimated from the literature, conducting a pilot study with a smaller sample of subjects can also provide a tentative ES of the treatment. For example, one can start a pilot study with five subjects to calculate the tentative effect size *d* in a t-test. Based on this preliminary effect size, the researchers can calculate an adequate sample size for the actual study using *a priori* power analysis.

When a pilot study is not feasible because of limited resources or for other reasons, one may guess the magnitude of the treatment effect. If the researcher anticipates the treatment will only produce a small effect, one may choose a small ES index, such as 0.2 for a t-test (Table 1). If the treatment effect is expected to be large, a large ES index may be used to estimate the number of subjects, such as 0.8 for a t-test. Or, one may use the average (medium) effect size index of 0.5 that was averaged over many individual studies.^{5}, ^{10}

Once the magnitude of the treatment effect is estimated, the minimal sample size (N) can be calculated from the inter-relationships among N, ES, α, and P. It is a good practice to conduct an *a priori* power analysis before convening any studies.

## TOOLS IN POWER ANALYSIS

The computation involved in power analysis can be very complex and cumbersome.^{11} Fortunately, there are many tools available that allow researchers to perform power analysis conveniently, without laboring in the mathematics. Software can be installed in a stand-alone PC, while some websites provide calculators for power analysis that are independent of the operating systems of the computer.

Table 2 summarizes the capabilities of different tools that allow calculation of *a priori* sample size. The tools given in Table 2 are by no means exclusive, and their capabilities and website addresses may have been changed or updated.

There are at least three considerations in choosing a power analysis tool. They include:

- Cost: Free programs which are usually provided by individuals and universities may not be as flexible and easy to use as commercial packages.
- Flexibility: A suitable tool for a researcher is one that covers the most common test situations for him or her. Choose a tool that you are most comfortable using.
- Accuracy: Power estimates among programs may slightly differ by as much as 0.3 in a t-test, for example.
^{12} To ensure accuracy of results, look for a tool that provides well-documented manuals.^{13}

In the following examples, we choose the free software G*Power 3.1 in performing *a priori* power analysis. This widely used program also provides references that validated their statistical results.^{14}, ^{15} Furthermore, it covers most statistical tests of hypothesis, including *F-*test, *t-*test, χ2-test, z test, and exact test.

## EXAMPLES FOR DETERMINING ENOUGH SUBJECTS

To illustrate the steps in determining *a priori* sample size, we have used several papers published at the Widex ORCA-USA research center as references. These examples covered two new features in the Widex Passion and Mind hearing aids. One is the linear frequency transposition feature, called the audibility extender (AE) where the higher frequency of a signal is transposed to a lower frequency region.^{16–19} The other feature is a fractal music generator called Zen which is a part of the Mind hearing aids used for relaxation and as a tool in tinnitus management.^{20}, ^{21}

### Example 1: Effect size estimated from guesswork

Kuk et al^{17} wanted to examine the efficacy of the audibility extender in aiding speech perception when it was first introduced. Although there were reports in the literature on the efficacy of various frequency lowering techniques, none used the same technology as the AE, and few included formal training as part of the research design. Thus, it was impossible to reference any specific studies for the effect size.

Although there were no formal data on the performance of the AE at the time, we were encouraged by observations that the majority (>80%) of wearers accepted the AE during their initial two-week use of the algorithm. This suggested a possibly strong effect size, but since we were unfamiliar with what to expect, we decided to let the G*Power software show us how many subjects we would need for various effect sizes.

To perform *a priori* estimation, one has to have specified test conditions. In this case, the same subjects would be tested with and without the AE. The intended statistical test to evaluate the hypothesis would be a two-tail, matched group, t-test. The following are entries into the G*Power calculator:

- Test family = t-test
- Statistical test = Means: Difference between two dependent means (matched pairs)
- Type of power analysis =
*A priori*: Compute required sample size
- Tail(s) = Two
- α error probability = 0.05
- Power = 0.8

The output of the G*Power program showing the relationship between sample size and effect size for a fixed P of 0.8 and a criterion α of 0.05, is shown in Figure 1. An exponentially decreasing function is seen suggesting that the required number of subjects decreases when the effect size increases from 0.2 to 0.4 (from a small to medium effect size).

The required number of subjects decreases only slightly when the ES increases from medium to strong. As a numeric example from Figure 1, an ES of 0.2 would require 198 subjects. A medium effect size (i.e., 0.5) requires 33 subjects, while a large effect size (i.e., 0.8) needs only 13 subjects. Based on the anticipation that a large difference may be obtained, Kuk et al^{17} decided to use a sample size of 13 in their study.

Clearly, there is an element of “educated guess or luck” when using this approach. It is a good idea to use a smaller estimate of ES to be conservative, and this approach should only be used after an exhaustive literature review fails to provide a concrete suggestion.

### Example 2: Effect size estimated from a pilot study

A pilot study may be used to fine-tune the design of a study. For that reason, it may follow the exact protocol as the actual study, or it may only focus on a limited number of test conditions of the final study. For example, the design of the study by Sweetow et al^{21} on the effectiveness of the Zen tones in managing tinnitus was the result of a pilot study conducted at ORCA-USA prior to the actual study.

In the study, the experimental hypothesis was that the intervention with the Zen fractal tones would reduce tinnitus in sufferers as measured on a tinnitus handicap scale. Ten tinnitus subjects took part in the pilot study. The subjects were seen several times for an evaluation of their tinnitus severity consequent to using the Zen feature. The tinnitus handicap questionnaire scores measured with no Zen experience were compared to the scores measured with Zen after one month of experience. The descriptive statistics in the pilot study (values not reported in Sweetow et al^{21}) are as follows:

- Mean difference in tinnitus handicap scores between two visits = 10.6
- Standard deviation of difference = 12.36

Based on this descriptive information, the effect size (*dz*) is calculated to be 0.85 (i.e., mean difference divided by standard deviation of difference) when a matched pairs t-test is used. One may also use G*Power calculation to obtain the effect size by entering these two input values. With this tentative effect size obtained from the pilot study, the G*Power calculator is used to compute the achieved power of the test using the values for the following input parameters:

- Test family = t-test
- Statistical test = Means
- Type of power analysis=Post-hoc: Compute achieved power
- Tail(s) = Two
- Effect size (dz) = 0.85
- α error probability = 0.05
- Total sample size =10

The achieved power (P) is found to be 0.67, which is less than the acceptable level of 0.8. Therefore, using 10 subjects (as in the pilot study) was not enough to detect a true effect. One advantage of the G*Power software is that, based on the supplied information, it also generates the data showing the relationship between the achieved power (P) and the sample size (N). This is shown in Figure 2.

In this case, a tentative effect size of 0.85 would require a minimum of 13 subjects to reach a power of 0.8. Therefore, a sample size of 14 subjects was chosen in the Sweetow et al^{21} study.

### Example 3: Effect size referenced to a paper

Kuk et al^{17} first reported the efficacy of the audibility extender in adults tested in quiet. Subsequently, Kuk et al^{18} wanted to further examine the efficacy of the AE in noise.

The findings of the first study indicated that the benefit of the AE was more significant at the softer level (30 dB HL) after the subjects had worn the AE for one month. In the follow-up evaluation, Kuk et al^{18} used the data measured under this test condition (consonant scores measured at 30 dB HL with AE experience) to determine the minimum number of required subjects. The descriptive statistics reported for this test condition were:

- Mean difference (%) between AE and no AE = 13.3
- Standard deviation of difference = 10.3

The effect size (*dz*) in a t-test with matched pairs is 1.29 (i.e., mean difference divided by standard deviation of difference). Cohen's ES convention table (Table 1) would suggest that this is a large effect size, implying that the AE is potentially effective in enhancing consonant identification.

To calculate the minimal sample size in the follow-up study, the following input parameters were chosen and entered into the G*Power calculator:

- Test family = t-test
- Statistical test = Means
- Type of power analysis = A priori
- Tail(s) = Two
- Effect size (dz) = 1.29
- α error probability = 0.05
- Power = 0.8

The output parameters show that the minimal sample size is 7 in order to achieve a power of 0.8. The assumption is that the new subjects recruited for the follow-up test would show similar behaviors as the previous group, and that their performance in noise is similar to that in quiet. Therefore, Kuk et al^{18} reported that a pool of 10 subjects would likely be sufficient to examine the efficacy of the AE in noise.

## DISCUSSION

This brief overview suggests that as a researcher and/or a consumer of research articles, one has to have a basic understanding of the relationship between effect size, sample size, significance criterion, and power of the test when evaluating the adequacy of a research report.

As a researcher, one must start out with an *a priori* power analysis to estimate the optimal number of subjects for a study, and conclude the study with a *post-hoc* power analysis to determine the achieved power of the test. If the results do not reach a significant level, the study should also indicate how many subjects might be needed to reach a significant level (i.e., P >0.8).

As a consumer, one should look for the documentation within the report itself that suggests an optimal sample size was attempted before agreeing with the conclusions of the study. This is especially true in studies that reported a lack of significant effect. Under-powered studies are frequently encountered in the area of speech and hearing research.^{9}

It needs to be remembered that conducting *a priori* power analysis before a study does not guarantee statistical significance or positive outcome in the new study. *A priori* power analysis ensures that an optimal number of subjects are present for the evaluation of the null hypothesis, while assuming the same effect size in the new study. This assumption may be faulty at many levels.

Studies that differ in the subject characteristics (age, gender, degree of hearing loss, etc) and test conditions (e.g., type of signal processing, quiet vs. noise, SNR, etc) could lead to a different outcome and thus a different effect size. This will be true regardless of how the ES may be estimated.

Despite such a limitation, a study that has undergone a proper power analysis is more likely to achieve an acceptable chance of detecting a treatment effect (when such effect really exists). For the researchers and the readers, a study report that includes a proper power analysis, increases their confidence in the findings; it shows that the authors did their homework when designing the clinical study.