Journal Logo

Is There Truly “No Significant Difference”?

Underpowered Randomized Controlled Trials in the Orthopaedic Literature

Abdullah, Leath BS1; Davis, Daniel E. MD, MSc2; Fabricant, Peter D. MD, MPH3; Baldwin, Keith MD, MPH, MSPT3; Namdari, Surena MD, MSc4

doi: 10.2106/JBJS.O.00012
The Orthopaedic Forum
Free
Disclosures

Background: Randomized controlled trials (RCTs) are considered the gold standard in evidence-based medicine. Underpowered RCTs that describe comparative outcomes without significance are of questionable benefit. The present study hypothesizes that a substantial proportion of RCTs in the orthopaedic literature that do not note significant differences between groups are inadequately powered.

Methods: Using the ISI Web of Science database, we searched all English-language journals in the orthopaedic category for RCTs published from January 2012 to December 2013. Qualifying articles were analyzed with regard to whether the null hypothesis was rejected (a positive study) for the primary outcome or if it was not (a negative study), whether a power analysis was performed, and whether the study was adequately powered. We performed a power analysis based on the primary outcome or outcomes of interest for the studies that did not describe a power analysis.

Results: After inclusion and exclusion criteria were applied, 456 RCTs were selected for complete review. Of those studies, 215 (47.1%) had negative findings and 241 (52.9%) had positive findings for primary outcomes. Twenty-five studies that failed to reject the null hypothesis noted inadequate power in the study. On the basis of our own power analyses, we found an additional thirty-five negative studies without power calculations to be underpowered. Sixty (27.9%) of the 215 negative studies were underpowered. Following binary logistic regression, only the journal impact factor was a significant predictor of whether a study was underpowered.

Conclusions: If an RCT lacks adequate statistical power to identify a clinically meaningful absence of a difference between groups, there is an unacceptable risk of inappropriately failing to reject the null hypothesis. The present study found that a sizable proportion of RCTs in orthopaedic surgery in which the null hypothesis is rejected are inadequately powered. Researchers should consider this when designing clinical trials, and journal editors and reviewers should be wary of underpowered RCTs when considering manuscripts for publication.

1Drexel University College of Medicine, 2900 West Queen Lane, Philadelphia, PA 19129

2Department of Orthopaedic Surgery, Thomas Jefferson University Hospital, 1025 Walnut Street, Suite 516, Philadelphia, PA 19107

3Department of Orthopaedic Surgery, Children’s Hospital of Philadelphia, 2 Wood Building, Philadelphia, PA 19104

4The Rothman Institute at Thomas Jefferson University, 925 Chestnut St, 5th Floor, Philadelphia, PA 19107. E-mail address: Surena.Namdari@rothmaninstitute.com

Peer Review: This article was reviewed by the Editor-in-Chief and one Deputy Editor, and it underwent blinded review by two or more outside experts. It was also reviewed by an expert in methodology and statistics. The Deputy Editor reviewed each revision of the article, and it underwent a final review by the Editor-in-Chief prior to publication. Final corrections and clarifications occurred during one or more exchanges between the author(s) and copyeditors.

A well-designed randomized controlled trial (RCT) is generally considered the gold standard in clinical research and is assigned the highest level of evidence (Level I) for clinical research. These types of studies are often highly cited for evidence-based clinical decisions and are highly valued by journal editors. The underlying conclusions of rejecting or retaining the null hypothesis are particularly dependent on statistical power. A priori power analyses are performed before beginning an RCT in order to establish the sample size required to adequately differentiate between a true lack of clinically meaningful difference rather than a false-negative finding (when a difference may exist, but the study is underpowered to detect it) (Fig. 1). By convention, 80% power is considered adequate. In other words, there is a 20% chance of falsely determining there is no difference between study groups. This threshold is called the type-II or beta error1. Beyond the methodological importance of achieving adequate power, multiple studies have proposed that underpowered RCTs are unethical2,3. If an RCT falsely reports no clinical difference, clinical decision making could be misguided. It has been proposed that many studies in the orthopaedic literature have inadequate sample sizes and thus inaccurately describe an absence of clinically meaningful differences when one may in fact exist4.

Fig. 1

Fig. 1

The purpose of the current study was to systematically evaluate RCTs in a representative sample from the recent orthopaedic literature and determine (1) the prevalence of power analyses in the orthopaedic literature, (2) the proportion of negative studies that were adequately powered compared with those not adequately powered, and (3) the factors that were predictive of adequate statistical power in negative studies. We hypothesized that a substantial proportion of RCTs in the orthopaedic literature that do not note significant differences between groups are inadequately powered.

Back to Top | Article Outline

Materials and Methods

Search Strategy

The methodology utilized in the present study has been previously used in the rheumatology literature5. In order to identify orthopaedic RCTs, we searched the ISI Web of Science under the orthopaedic category for English-language studies, published from January 2012 to December 2013, containing the terms “random,” “randomized,” “randomization,” “randomly,” “controlled,” or “control.” The resulting titles were then reviewed by one author (D.E.D.) to eliminate duplicates or studies that were not RCTs on the basis of the title. The abstracts were then further reviewed by two authors (D.E.D. and S.N.), and nonprospective RCTs were rejected. Disagreements between these two authors were reviewed by the entire team and resolved by consensus. Pilot studies and secondary reviews of previous RCT data were excluded. The full texts of the remaining studies were thoroughly reviewed to confirm that the studies were prospective RCTs, and they were assessed according to the criteria listed below.

Back to Top | Article Outline

Evaluation for RCT Quality

The quality of each study was determined by the calculation of the Jadad score for each investigation4. The Jadad score was calculated by giving 1 point for the presence of the following components: randomization, double-blinding, and a description of withdrawn subjects. Additionally, by reviewing the acknowledgment section and statistics section of each article, we noted whether a statistician had been involved in the paper or if one of the primary authors held an MPH or PhD in statistics.

Back to Top | Article Outline

Data Extraction

The studies were reviewed by one author (L.A.) for data extraction. The following variables were recorded: significant presence of an effect of treatment on primary outcome(s) (i.e., a positive study), no significant difference (i.e., a negative study), level of evidence, study type, subject number, pathology studied, outcome measures, statistical testing methods, reported statistical power (or beta error), reported sample-size adequacy, statistician involvement, subspecialty, and journal impact factor. Positive studies were those in which a significant result (p < 0.05) was noted in the primary outcome of interest, and therefore a type-II error could not have been committed. Negative studies were those in which the authors failed to reject the null hypothesis (Fig. 1).

Back to Top | Article Outline

Post Hoc Power Analysis

The primary outcome variable(s) in each study were determined to be those explicitly stated by the authors in the specific aims. Articles claiming no significant difference between study groups in terms of primary outcome variables were reviewed for the presence of a power analysis. If one was not performed in the study, we performed a power analysis based on the primary outcome of interest utilizing one of the following methods, as appropriate. If a published minimal clinically important difference (MCID) existed for the primary outcome of interest, we used the MCID to calculate a power analysis. The MCID is the smallest difference in an outcome score that a patient perceives as beneficial. MCIDs can provide the basis for determining if significant differences in outcomes after treatment are clinically relevant. For example, Tashjian et al. determined that patients with rotator cuff disease who are treated without surgery and have a 2-point change in the Simple Shoulder Test score or a 12 to 17-point change in the American Shoulder and Elbow Surgeons score experience a clinically important change in self-assessed outcome6. If no MCID existed for the primary outcome of interest or if an article had multiple outcomes, an MCID of one-half of a standard deviation was used on the basis of data suggesting that one-half of a standard deviation correlates well with an MCID in health-related quality-of-life studies7,8. The difference between means was calculated, and the standard deviation was calculated using an effect size of 0.58. If an article investigated multiple primary outcomes, a Bonferroni correction was used. By dividing the conventionally accepted alpha (p < 0.05) by the number of primary outcomes and then converting that value to the appropriate two-sided z-score, a new, smaller p value threshold was calculated.

Back to Top | Article Outline

Statistical Methods

Continuous variables were compared using independent sample t tests. Binary or categorical variables were compared with a Pearson chi-square test for independence. A receiver operating characteristic (ROC) analysis was performed on continuous variables found to be significant in order to dichotomize them for the purpose of regression analysis. Binary logistic regression analysis was used to determine independent predictors of whether a study was underpowered or truly negative. We initially included all variables and then used backward likelihood ratio methodology to eliminate variables in a stepwise fashion. A minimum criterion for inclusion in the final model was p < 0.10. All statistics were calculated with SPSS statistical software (SPSS).

Back to Top | Article Outline

Source of Funding

No internal or external funding was obtained for the preparation of this manuscript.

Back to Top | Article Outline

Results

The initial search resulted in 1008 titles. After inclusion and exclusion criteria were applied, 456 full-text articles were reviewed (Fig. 2). There was an equal distribution between 2012 and 2013, and the included articles covered all areas of musculoskeletal care. Of the 456 articles included, 241 studies (52.9%) noted significant differences in the primary outcome between study groups (positive studies) and 215 articles (47.1%) noted no difference (negative studies) (Table I). Of the 215 negative studies, twenty-five had performed a power analysis and noted that their study was underpowered. An additional forty-four (20.5%) of the 215 negative studies had not performed a power analysis. Our power analysis revealed that thirty-five of them had inadequate sample size and were inadequately powered. As a result, sixty (27.9%) of 215 negative studies were underpowered and thus were unable to determine if their lack of a significant difference between study groups represented a lack of clinically important groupwise differences or if the study merely missed a difference that did exist but was overlooked because of low statistical power (Table II). Of all 456 studies, sixty-nine (15.1%) were negative studies that had inadequate or unreported power.

Fig. 2

Fig. 2

TABLE I - A Summary of All 456 Randomized Controlled Trials Reviewed in the Current Study
Characteristic No. (%) of Studies
Study conclusions
 Positive findings 241 (52.9)
 Negative findings 215 (47.1)
Study level
 I 256 (56.1)
 II 199 (43.6)
 III* 1 (0.2)
Study design
 Crossover 5 (1.1)
 Parallel 451 (98.9)
Subspecialty
 Adult reconstruction 119 (26.1)
 Trauma 91 (20.0)
 Rehabilitation 58 (12.7)
 Sports 50 (11.0)
 Spine 46 (10.1)
 Shoulder and elbow 27 (5.9)
 Miscellaneous 22 (4.8)
 Hand 18 (3.9)
 Foot and ankle 12 (2.6)
 Pediatrics 9 (2.0)
 Podiatry 3 (0.7)
 Tumor 1 (0.2)
*
While RCTs are by definition Level I or II, one study self-reported as a Level-III study and thus was reported here as such.

TABLE II - Summary of 456 RCTs in Which an a Priori Power Analysis Was Reported and Those That Were Underpowered or Had No Power Analysis Reported
Power Analysis No. (%) of Studies
Power analysis reported
 Yes—adequate 299 (65.6)
 Yes—underpowered 34 (7.5)
 No 123 (27.0)
Underpowered studies or no power analysis reported (n = 157)
 Positive findings (rejected the null) 88 (56.1)
 Negative findings (failed to reject the null) 69 (43.9)

The mean sample size, study design, mean length of follow-up, mean Jadad score, statistician involvement, and mean journal impact factor are summarized in Table III. There were no differences noted in Jadad scores between RCTs with positive findings and those with negative findings (2.071 versus 2.060; p = 0.860). The mean impact factor of the journals whose articles were included in this study was 2.42. The mean impact factor was slightly higher for all negative studies (2.5) compared with positive studies (2.3; p = 0.042). The mean impact factor for 155 negative studies with adequate power was significantly higher than the mean for the sixty negative studies that were underpowered (2.6 versus 2.19; p = 0.001).

TABLE III - A Summary of 215 Negative Studies with Power Noted or Calculated and Comparisons of the Underpowered and Adequately Powered Negative Studies
Negative Studies Underpowered
(N = 60)
Adequate Power
(N = 155)
P Value*
Study level (no.[%])
 I 12 (20.0) 110 (71.0) <0.001*
 II 48 (80.0) 45 (29.0) <0.001*
Mean no. of subjects (range) 72.0 (20-200) 112.5 (18-1120) 0.009
Mean follow-up (range) (mo) 26.3 (0-144) 18.4 (0-149) 0.060
Double-blind design (no.[%])
 Yes 12 (20.0) 42 (27.1) 0.282
 No 48 (80.0) 113 (72.9) 0.282
Mean Jadad score (and stand. dev.) 1.950 ± 0.59 2.103 ± 0.59 0.091
Input from statistician (no.[%])
 Yes 12 (20.0) 49 (31.6) 0.09
 No 48 (80.0) 106 (68.4) 0.09
Mean journal impact factor (range) 2.19 (0.18-4.44) 2.6 (0.08-4.44) 0.001
*
P values calculated using a two-tailed Student t test for continuous variables and chi-square for count variables.
P < 0.05.

In a binary logistic regression model, only the journal impact factor was identified as an independent predictor of whether a negative study was underpowered (p = 0.002). For each increase of 1 point in the impact factor, the odds of a study being adequately powered increased by 1.82 times (95% confidence interval [CI]: 1.26, 2.63). An ROC analysis showed an optimal inflection point of an impact factor of 2.15. Journals with an impact factor of ≥2.15 had a 2.35 (95% CI: 1.26, 4.37) times greater odds of publishing an adequately powered negative study compared with journals with impact factors below this marker.

Back to Top | Article Outline

Discussion

As technological advancement continues and the cost of health care continues to increase, the importance of creating high-quality clinical evidence through RCTs is paramount to supporting best practices. Conducting RCTs that are underpowered is a concern both because such studies may inaccurately influence the clinical decision-making process and because the results may unethically place patients at risk without the benefit of elucidating a meaningful conclusion3. The present study of RCTs in the English-language orthopaedic literature found that 27.9% of articles that do not find a difference in their primary outcome are in fact underpowered to make such conclusions.

Underpowered RCTs are not limited to the orthopaedic literature. An analysis of the rheumatology literature by Keen et al. performed a similar examination of RCTs5. Of 205 Phase-III RCTs in the rheumatology literature, eighty-six studies (41.9%) were negative or indeterminate. Thirty-seven studies (18.0%) noted sample size calculations, and thirty-three (16.1%) were adequately powered. Forty-nine of the negative or indeterminate studies did not describe a power analysis. Of those negative studies, ten were adequately powered, twenty-seven were underpowered, and twelve did not provide enough data to perform an analysis. The results of the current study are consistent with those findings. Similarly, in the oncology literature, Bedard et al. examined RCTs with negative findings that were presented at the annual meetings of the American Society of Clinical Oncology from 1995 to 20039. Four hundred and twenty-three negative studies underwent a post hoc analysis performed for adequate power for detecting small, medium, and large effect sizes. They found that 89.4% of the studies were underpowered to detect a small difference between groups, 67.4% were underpowered to detect a medium difference between groups, and 44.9% were underpowered to detect a large difference between groups.

Improperly performed RCTs have been examined in the orthopaedic literature, particularly with respect to the Consolidated Standards of Reporting Trials (CONSORT) criteria10. The CONSORT criteria were defined in a Special Communication in The Journal of the American Medical Association in 1996 and included a twenty-one-item checklist pertaining to the methods, results, and conclusion sections of RCT study designs11. The goal was to help guide researchers in planning meaningful and complete study designs. These criteria and checklists were further updated and amended in 2010 as it is a continually evolving guideline12. In a 2011 review of the spine literature, Naunheim et al. examined thirty-two RCTs in three prominent spine journals, with the goal of evaluating their adherence to the CONSORT criteria13. Interestingly, they found that, when scoring the abstracts versus the body of the text, there was a discrepancy of 78% adherence versus 60%, respectively. McCormick et al. reviewed RCTs focused on rotator cuff disorders using the CONSORT criteria and the Jadad score to determine the quality of the studies10. In fifty-four studies identified in the top six orthopaedic journals with the highest impact factors, they found that 66% were considered high quality (a Jadad score of 3), and the mean CONSORT criteria score was 70%. There was an association between a higher Jadad score and studies that had high CONSORT criteria scores10. The authors found that, of the deficiencies listed, the absence of a power analysis in 64.8% of the studies was one of the most common. Importantly, we found that underpowered studies were self-reported as lower levels of evidence and were published in journals with lower impact factors. This indicates that the review process of higher-impact journals may place greater emphasis on the power analyses of submitted articles.

The present study does have limitations. When calculating power for negative studies in which no power calculation was reported, we used one-half of a standard deviation when MCID information was not available. This methodology is accepted in measures of health-related quality of life, but may not be as accurate as power analyses performed for specific data sets8. Ideally, as suggested by the International Committee of Medical Journal Editors, primary outcomes and power analyses would have been available in RCT registries (e.g., clinicaltrials.gov). Unfortunately, no study that neglected to note a power calculation was found in any RCT registry14. Additionally, while the Jadad scale is a well-accepted tool to evaluate RCTs, further analysis using the CONSORT criteria would have aided in determining the quality of the reviewed studies. Finally, in the present investigation, studies that showed a significant difference for primary outcome variables were considered to be adequately powered. It is possible that a portion of these studies would have been underpowered if a significant difference between groups had not been shown.

In conclusion, the present study provides a comprehensive analysis of RCTs reported in the orthopaedic literature over a recent two-year period. These findings are similar to previously raised concerns in other medical specialties that studies with high levels of evidence frequently are not sufficiently powered to accurately draw conclusions from the reported data10,13,15. This leads to both ethical dilemmas of data reporting as well as the clinical difficulty of incorporating clinical research into daily practice. It is critical to understand that when reporting the lack of clinical differences between study groups, adequate sample size must be ensured in order to generate an acceptable level of type-II error; yet, this was absent in 27.9% of the analyzed studies. As the clinical literature in orthopaedics continues to expand, researchers must strive to achieve adequate power and thus add relevant findings (both positive and negative) on which to base clinical decisions. In addition, orthopaedic surgeons should pay particular attention to whether studies are adequately powered to support their negative findings and journal editors and reviewers should be wary of underpowered RCTs when considering manuscripts for publication.

Disclosure: None of the authors received payments or services, either directly or indirectly (i.e., via his or her institution), from a third party in support of any aspect of this work. One or more of the authors, or his or her institution, has had a financial relationship, in the thirty-six months prior to submission of this work, with an entity in the biomedical arena that could be perceived to influence or have the potential to influence what is written in this work. In addition, one or more of the authors has a patent or patents, planned, pending, or issued, that is broadly relevant to the work. No author has had any other relationships, or has engaged in any other activities, that could be perceived to influence or have the potential to influence what is written in this work. The complete Disclosures of Potential Conflicts of Interest submitted by authors are always provided with the online version of the article.

Back to Top | Article Outline

References

1. Portney LG, Watkins MP. Foundations of clinical research: applications to practice. 3rd ed. Upper Saddle River, NJ: Pearson/Prentice Hall; 2008.
2. Edwards SJ, Lilford RJ, Braunholtz D, Jackson J. Why “underpowered” trials are not necessarily unethical. Lancet. 1997 Sep 13;350(9080):804-7.
3. Halpern SD, Karlawish JHT, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002 Jul 17;288(3):358-62.
4. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, McQuay HJ. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Control Clin Trials. 1996 Feb;17(1):1-12.
5. Keen HI, Pile K, Hill CL. The prevalence of underpowered randomized clinical trials in rheumatology. J Rheumatol. 2005 Nov;32(11):2083-8.
6. Tashjian RZ, Deloach J, Green A, Porucznik CA, Powell AP. Minimal clinically important differences in ASES and Simple Shoulder Test scores after nonoperative treatment of rotator cuff disease. J Bone Joint Surg Am. 2010 Feb;92(2):296-303.
7. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.
8. Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care. 2003 May;41(5):582-92.
9. Bedard PL, Krzyzanowska MK, Pintilie M, Tannock IF. Statistical power of negative randomized controlled trials presented at American Society for Clinical Oncology annual meetings. J Clin Oncol. 2007 Aug 10;25(23):3482-7.
10. McCormick F, Cvetanovich GL, Kim JM, Harris JD, Gupta AK, Abrams GD, Romeo AA, Provencher MT. An assessment of the quality of rotator cuff randomized controlled trials: utilizing the Jadad score and CONSORT criteria. J Shoulder Elbow Surg. 2013 Sep;22(9):1180-5. Epub 2013 Mar 17.
11. Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D, Schulz KF, Simel D, Stroup DF. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA. 1996 Aug 28;276(8):637-9.
12. Schulz KF, Altman DG, Moher D; CONSORT Group. CONSORT 2010 Statement: Updated guidelines for reporting parallel group randomised trials. J Clin Epidemiol. 2010 Aug;63(8):834-40. Epub 2010 Mar 25.
13. Naunheim MR, Walcott BP, Nahed BV, Simpson AK, Agarwalla PK, Coumans JV. The quality of randomized controlled trial reporting in spine literature. Spine (Phila Pa 1976). 2011 Jul 15;36(16):1326-30.
14. De Angelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, Kotzin S, Laine C, Marusic A, Overbeke AJ, Schroeder TV, Sox HC, Van Der Weyden MB; International Committee of Medical Journal Editors. Is this clinical trial fully registered?—A statement from the International Committee of Medical Journal Editors. N Engl J Med. 2005 Jun 9;352(23):2436-8. Epub 2005 May 23.
15. Dodwell E, Dua S, Dulai SK, Astone K, Mulpuri K. The quality of randomized controlled trials in pediatric orthopaedics: Are we improving? J Pediatr Orthop. 2015 Jul-Aug;35(5):536-45.
Copyright © 2015 by The Journal of Bone and Joint Surgery, Incorporated