“If someone understands both the disease and the medicine, only then is he a wise physician” (“rog daaroo dovai bujhai taa vaid sujaan;” - Shri Guru Angad Dev Ji, the second Sikh guru.
Power and sample size are critical aspects of the design of any population-based hypothesis testing study. They must be addressed at the planning stage of a high-quality study. A study needs to be adequately powered to be able to prove a treatment effect as well as to avoid the wastage of resources. An underpowered study provides insufficient evidence to accept or reject a hypothesis. On the other hand, an overpowered study, in which the sample size is larger than required, wastes resources. For any given statistical method and significance level, there are four major considerations while designing a study, namely, sample size, power, clinically meaningful effect size, and the variability of the parameter of interest in the target population. If three of them are known, the fourth can be calculated. Generally, researchers outsource the sample size calculation of a study to a statistician, but critical inputs from the clinician are necessary to decide the relevant outcomes and clinically meaningful differences. This article intends to cover the basics of statistical power interpretation, its misinterpretations, uses, and limitations. The objective of this review article is to help clinicians understand the concept and the practical aspects of statistical power.
MATERIALS AND METHODS
An online literature search was conducted in the various databases including PubMed, Embase, Cochrane, and Google using a planned scheme, as depicted in Figure 1. The search terms used included “statistical power,” “clinical importance,” “type II error,” and “statistical significance.” Articles published in the past 20 years were considered. We excluded duplicate citations and articles that were unrelated to the topic of statistical power, non-human studies, articles for which the full text was not available, and those published in a language other than English. We then searched the references in the bibliographies of certain selected articles to identify any other relevant literature that had been missed. Finally, we included 18 articles that contained relevant information and illustrations for this review. Two more articles were selected by the manual search done on the references of the selected articles, and one reference was from Shri Guru Granth Sahib.
Concept and interpretation
The statistical power is a conditional probability similar to the P value. Conventionally, in hypothesis testing, the alternative hypothesis is the assumption that the null hypothesis is false. If the alternative hypothesis is actually true, the power indicates the probability of correctly rejecting the null hypothesis. The statistical power is best determined prospectively to decide whether a clinical study is worth doing, looking at the required effort, expenditure, time, manpower, and patient exposure. A hypothesis test with small power may yield large P values and large confidence intervals (CI). Hence, a low-power study may fail to reject the null hypothesis, even if a clinically meaningful difference exists between the treatments being compared.
While analyzing the results of a research study, the conclusion of the investigator could be true or false. The investigator's conclusion would be false in case the alternative hypothesis was chosen when in actuality, the null hypothesis was true. This scenario (incorrectly rejecting the null hypothesis, or a false positive) indicates a type I error, and the probability of committing this error is called “α,” the significance level. The second scenario, that is, incorrectly accepting the null hypothesis, or a false negative, is called the type II error, and the probability of committing this error is called “β.” The power of a study is the complementary probability of the type II error (1 − β) and it signifies the probability of correctly rejecting the null hypothesis. The levels of α and β are decided based upon the phase of the study, available resources, and the effect size (which is the quantification of the difference between two groups). The smaller these values are, the better will be the quality of the evidence generated by the study, but the larger will be the required sample size. For instance, in earlier phases, a large α and β may be acceptable. Both α and β are arbitrary values and are decided by balancing between the sample size and the effect size. If the sample size is pre-fixed, for example, in previously collected data or in a retrospective analysis, one can calculate the power with a given effect size or vice versa. A graph can be plotted with the power against the effect size for a fixed sample size. For a sample size that is too small, the required effect size has to be very large for a given power value. Otherwise, the study may be futile if the actual effect size is small. Power, effect size, and sample size are best decided a priori, while planning a study. In certain situations, investigators may perform post hoc analyses, but this is not ideal and should be a rarity.
The calculation of the sample size with desired values of α and β requires specification of the effect size and the standard deviation (SD). The effect size and the SD may be taken from a similar previous study in the literature or may be obtained from a pilot study. There are inherent limitations to each method, including sampling errors, publication bias, and study design. To calculate the optimal sample size, an investigator can use sensitivity analysis, with multiple possible values of effect size and SD. This is explained in more detail in the next section.
Factors affecting the power of a study
Four factors influence the power of a study. These need to be accounted for during the planning stage of a study.
Precision and standard deviation of the data
Precision is how close or dispersed the measurements are to each other. It may be influenced by certain modifiable factors like observer and measurement biases as well as the actual variability (measured as the SD) of the population parameter. Observer bias occurs when a researcher's expectations, opinions, or prejudices influence what the researcher perceives or records in a study, whereas measurement bias refers to any systematic or non-random error that occurs in the collection of data in a study. Their collective impact is demonstrated as the 95% CI. The higher the values of these biases and the SD, the broader is the CI. The bigger the sample size, the narrower is the CI, and the closer the result is to the actual population value.
The magnitude of the effect size
Detecting minute differences between intervention effects requires very accurate results for the study to successfully determine the difference. This will require a bigger sample size and more power. A wide CI may be acceptable when the effect size is large.
Type I or type II error
A smaller type II error indicates a higher probability that an actual existing effect size will be detected, with the given α and sample size.
Type of statistical tests
Two types of statistical tests can be used to calculate the sample size. Parametric tests are always preferable as they require a smaller sample size. However, parametric tests (e.g., Student's t test) require the data to be normally distributed, compared to the non-parametric tests (e.g., Mann–Whitney U).
- The investigator can estimate the sample size required to test a specific hypothesis if the power has been pre-decided.
- The power of a study can be calculated for an already existing dataset with a fixed sample size and a pre-defined effect size.
- The effect size can be calculated with a given power and a fixed sample size.
- Addressing these issues a priori while designing the study allows a tighter and more rigorous study.
- Sample size and power calculations help decide the feasibility of a study within the available resources.
- Power derivations give us the sample size with a known type I and type II error.
A study needs to have an adequate sample size to be practically feasible, clinically valuable, and reliable. A low-power study fails to answer the research question reliably. An overpowered study is wasteful in terms of resources as well as makes a small difference appear very significant.
The sample size is directly proportional to the power. A power threshold of 80% is often used. This means that if the treatment has a detectable effect, the results obtained from the study will be statistically significant 80% of the time. If a large sample size is not feasible for any reason, the power of the study will be compromised. Hence, many studies in rare diseases are conducted without calculating the power before conducting the study. Rather, some investigators calculate the power in the post hoc setting, that is, the power is calculated based on the observed effect after conducting the study. By demonstrating a low power with an observed meaningful effect, they tend to recommend lower power thresholds in such studies. In principle, the statistical power is about the population being sampled. It is an assumption that in a post hoc power analysis, the observed effect size is the real effect size. Post hoc power estimates have been shown to be illogical and misleading. Investigators do these post hoc power calculations either to justify the study design or to explain why the study did not yield a statistically significant effect after the study completion. Ironically, post hoc power estimates do not serve these purposes validly. A post hoc power estimate is a justification akin to the statement, “If the axe chopped down the tree, it was sharp enough.”
Another scenario is when the power is calculated for a post hoc secondary data analysis of an existing dataset. The kind of research question here would be, “Would studying a dataset of this size provide sufficient evidence to reject the null hypothesis?” This is a somewhat redundant question as, in reality, that is the only dataset the investigator has, and therefore, the sample size is fixed. The only conceivable benefit of performing a power analysis here would be to avoid type I or type II errors. Although a secondary data analyst cannot gain access to a larger dataset in case the sample size of the existing dataset is inadequate, the investigator could avoid wasting time and risking misunderstandings by not doing the analysis at all.
When to do sample size calculations for a study?
Sample size calculations should ideally be done before the study, so that the correct answer to the research question is obtained in the most efficient way.
Sometimes, an interim power calculation is performed during an already ongoing study. However, the researcher needs to be careful to ensure that the study is not stopped early even if statistical significance has been attained. Conversely, one must use an interim power calculation to avoid prolongation of a study in the case of life-saving or hazardous therapies. This provision must be a part of the research protocol a priori.
Rarely, after a negative study, one can retrospectively interrogate the data to assess if the study was underpowered and whether the negative study represents a false negative, that is, a type II error.
When can an underpowered study be acceptable?
Conventionally, the power has to always be at least 80% or greater. However, a researcher may choose to conduct an underpowered study in certain situations like
- In an exploratory analysis or a pilot study
- When there is only one available dataset
- When studying a very interesting question with limited time and resources
- In rare diseases and situations in which limited scientific knowledge is available
- In resource-limited settings
- In a laboratory study or a retrospective correlation study
The answer that the post hoc power estimates try to obtain may simply be found in the relevant CI. For instance, in a study in which the clinically meaningful effect size is 4 units and the 95% CI is (−0.3, +0.2), the researcher may accept the null hypothesis. Conversely, if the CI is very broad (−13, +22), it is clearly quite unsafe to assume that the true effect size is zero simply because it is statistically non-significant; it might be zero, but it might be highly positive or highly negative, and recognition of this uncertainty is necessary. Although both (−0.3, +0.2) and (−13, +22) correspond to non-significant hypothesis tests in that they include H0, they do not have the same practical implications.
The Bayesian method is a different approach to decide about the correctness of the findings from a study. In a Bayesian analysis, a researcher can specify prior probabilities for different population hypotheses and/or prior distributions over possible values of population parameters. Based on these priors, which must often be chosen subjectively, the Bayes' theorem can then be used to calculate posterior probabilities. A more detailed discussion on the Bayesian method is beyond the scope of this article.
Augmenting the power without increasing the sample size
- Correction for non-adherence of the participants
- Adjustment for multiple comparisons
- Innovative study designs
A well-powered study is critical to generate reliable and reproducible results. In certain situations, a low-powered study may be acceptable. Prospective power analysis is the ideal and standard statistical requirement. Resorting to a post hoc power estimate is not a valid technique. When trying to interpret a failure to reject an important null hypothesis, extrapolating the information from the CI or a Bayesian analysis is better in many situations than using a post hoc power estimate.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
1. Gurmukhi to English Translation and Phonetic Transliteration of Siri Guru Granth Sahib.:148 Available from: https://www.srigurugranth.org/0148.html
. [Last accessed on 2022 Jun 13]
2. Darling HS. Basics of statistics – 4: Sample size calculation (ii): A narrative review Cancer Res Stat Treat. 2020;3:817–28
3. Darling HS. Basics of statistics-3: Sample size calculation – (i) Cancer Res Stat Treat. 2020;3:317–22
4. Schmidt SAJ, Lo S, Hollestein LM. Research techniques made simple: Sample size estimation and power calculation J Invest Dermatol. 2018;138:1678–82
5. Darling HS. To “P” or not to “P”, that is the question: A narrative review on P
value Cancer Res Stat Treat. 2021;4:756–62
6. Darling HS. Are you confident about your confidence in confidence intervals? Cancer Res Stat Treat. 2022;5:139–44
7. Case LD, Ambrosius WT. Power and sample size Methods Mol Biol. 2007;404:377–408
8. Hulley SB, Cummings SR, Browner WS, Grady D, Newman TB Designing Clinical Research. 20133rd Philadelphia, PA A Wolters Kluwer Business, Lippincott Williams & Wilkins
9. Anderson JL, Mulligan TS, Shen MC, Wang H, Scahill CM, Tan FJ, et al mRNA processing in mutant zebrafish lines generated by chemical and CRISPR-mediated mutagenesis produces unexpected transcripts that escape nonsense-mediated decay PLoS Genet. 2017;13:1007105
10. Dziak JJ, Dierker LC, Abar B. The interpretation of statistical power after the data have been gathered Curr Psychol. 2020;39:870–7
11. Jones SR, Carley S, Harrison M. An introduction to power and sample size estimation Emerg Med J. 2003;20:453–8
12. Griffith KN, Feyman Y. Amplifying the noise: The dangers of post hoc power analyses J Surg Res. 2021;259:9–11
13. Bielby WT, Kluegel JR. Statistical inference and statistical power in applications of the general linear model Sociol Methodol. 1977;8:283–312
14. Kraemer H, Blasey C How Many Subjects?. 20162nd SAGE Publications Ltd.
15. Bierman EJ, Comijs HC, Gundy CM, Sonnenberg C, Jonker C, Beekman AT. The effect of chronic benzodiazepine use on cognitive functioning in older persons: Good, bad or indifferent? Int J Geriatr Psychiatry. 2007;22:1194–200
16. Simmons JP, Nelson LD, Simonsohn U. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant Psychol Sci. 2011;22:1359–66
17. Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, du Sert NP, et al A manifesto for reproducible science Nat Hum Behav. 2017;1:0021
18. Detsky AS, Sackett DL. When was a “negative” clinical trial big enough? How many patients you needed depends on what you found Arch Intern Med. 1985;145:709–12
19. Hoenig JM, Heisey DM. The abuse of power: The pervasive fallacy of power calculations for data analysis Am Statist. 2012;55:19–24
20. Schulz KF, Grimes DA. Multiplicity in randomised trials II: Subgroup and interim analyses Lancet. 2005;365:1657–61
21. Lakens D. Equivalence tests: A practical primer for t tests, correlations, and meta-analyses Soc Psychol Personal Sci. 2017;8:355–62