The Oxford Levels of Evidence are now widely used to classify scientific publications in orthopaedic surgery journals, such as this one, in order to promote critical appraisal of the data according to the relative rigor of various study designs1-5. This process has been criticized for undervaluing useful data from studies with lower levels of evidence6. Furthermore, studies with the highest level of evidence (prospective randomized trials) may be overvalued, and serious study flaws may be overlooked because of a designation of Level I or II.
To test the hypothesis that a high level of evidence does not ensure a high-quality scientific study, we used quantitative measures of study quality to evaluate prospective randomized controlled therapeutic trials (Level-I and Level-II studies) of treatments of lateral epicondylitis. This clinical issue was selected because of the relatively large number of Level-I and II studies available.
Materials and Methods
Asearch of PubMed7, with use of various combinations of the terms “lateral,” “epicondylitis,” “tennis,” “elbow,” and “pain,” identified 861 articles addressing lateral epicondylitis spanning from September 1950 to December 2005. From this list, we identified fifty-four therapeutic prospective randomized clinical trials. Two observers (one with an MD degree and additional postgraduate training in epidemiology and statistics [S.L.-C.] and the other a premedical student with no specific epidemiological training [J.C.]) classified each trial according to three systems: the Oxford Levels of Evidence1, a modified Coleman Methodology Score8-10, and the revised CONSORT (Consolidated Standards of Reporting Trials) score11.
Each observer assigned an Oxford Level of Evidence. With this system, Level-I therapeutic studies are defined as high-quality randomized controlled trials demonstrating a significant difference or no significant difference but with narrow confidence intervals. Level-II studies are lower-quality randomized controlled trials (e.g., with <80% follow-up, no blinding, improper randomization, and so on).
The Coleman Methodology Score was developed as a means of grading the methodology of clinical studies on patellar and Achilles tendinopathy8-10. With this instrument, a maximum possible score of 100 points indicates that chance, biases, and confounding factors had generally been avoided in the study8. To accommodate the topic and objectives of the present investigation, it was necessary to modify both the categories and the point allocation of the Coleman Methodology Score. These modifications consisted of (1) expanding the inclusion criteria category to include a description of criteria and enrollment rates; (2) adding a category to assess whether the statistical power of the study and the methods used to calculate that power were reported; (3) adding a category to assess the extent of randomization and whether it was blinded; (4) expanding the patient follow-up category to include the percentage of patients who were retained in the study; (5) adding a clinical effect measurement category to account for whether the authors of the study had reported the effect size, relative risk reduction, or absolute risk reduction; and (6) adding categories to address each study's alpha error, patient analysis, level of blinding, co-interventions, group comparability, and number of patients to treat (Table I). This modified score was scaled to result in values between 0 and 1001. The categorical rating was considered to be excellent if the score was 85 to 100 points, good if it was 70 to 84 points, fair if it was 55 to 69 points, and poor if it was ≤54 points.
The revised CONSORT statement consisted of a twenty-two-item checklist pertaining to the content within various sections of a report on a prospective randomized trial11. The purpose of the CONSORT checklist is to provide a means with which to compare the conduct of trials and the validity of their results11. For each of the twenty-two items on the checklist, a trial was given 1 point if it met the criteria of the CONSORT statement and 0 points if it did not. Thus, the maximum possible score was 22 points. The categorical rating was considered to be excellent if the score was 18 to 22 points, good if it was 13 to 17 points, fair if it was 8 to 12 points, and poor if it was ≤7 points (Table II).
The Student t test was used to assess the interobserver reliability of the scoring systems by testing the differences in the continuous scores. The kappa statistic was used to evaluate the interobserver reliability of the categorical ratings. The kappa statistic is a chance-corrected proportion of agreement (i.e., reliability) calculated when independent observers make categorical classifications. A categorical rating of the reliability was assigned with use of the benchmarks for the kappa statistic described by Landis and Koch12. Descriptive frequency statistics were obtained for each nominal variable and expressed as percentages for each observer and as a global average. Standard descriptive statistics with measures of dispersion were also applied to continuous variables. Statistical analysis was performed with use of SPSS version-14.0 software (SPSS, Chicago, Illinois).
Oxford Levels of Evidence
There was substantial agreement in the classifications according to the Oxford Levels of Evidence (κ = 0.73, p < 0.01). Observer 1 believed that five of the fifty-four publications fulfilled the criteria to be considered Level I and forty-nine could be considered Level II, whereas observer 2 considered three of the publications to be Level I and fifty-one to be Level II.
Modified Coleman Methodology Score
There was no significant difference between the mean modified Coleman Methodology Scores of the two observers (59.8 compared with 59.0 points; p = 0.61). The two observers had substantial agreement, according to the standards of Landis and Koch12, with regard to their assignment of categorical ratings based on the modified Coleman Methodology Score (κ = 0.73; p < 0.01).
Combining the assessments of both observers resulted in a mean total score of 59 points (range, 43 to 80 points) on the scale of 100 points. Observer 1 and observer 2 assigned a good categorical rating of quality to 13% and 11% of the studies, respectively; a fair rating to 59% and 56%; and a poor rating to 28% and 33%. None of the studies were rated excellent.
Both observer 1 and observer 2 found that many articles (61% and 63%, respectively) failed to describe how patients had been enrolled or the percentage of enrollment. Both found an enrollment rate of <80% in many of the trials (15% according to both observers). Description of the enrollment process was often absent (in 15% of the trials according to both observers). Another shortcoming of the reports on the trials is that most (52% and 54%) failed to state how the power and the sample size were calculated. A substantial percentage of the articles (35% and 33%) stated that the power level was >80%, but did not explain the methodology used to obtain this value, and a power level of >90% was reported in very few trials. Similarly, 91% of the reports did not provide the number of patients needed for successful treatment (also known as the “number needed to treat” to see one treatment success). Effect sizes were not reported in 69% of the trials and were reported to be <50% in 26% of them. There was no description of a rehabilitation protocol in the majority of the reports (96% and 94%), and only a few reports evaluating surgical techniques (2% and 3%) described postoperative management adequately.
The reporting of patient follow-up, blinding, and sample size may not be considered areas of failure in these trials, but these categories were certainly not strengths of the trials in general. The most common score in the follow-up category was 4 points (assigned to 48% and 51% of the trials), which corresponded to a follow-up period of more than twenty-four months and a <80% retention rate, a six to twenty-four-month follow-up and an 80% to 90% retention rate, or a less than six-month follow-up period and a >90% retention rate. The next most common score was 6 points (assigned to 26% and 32% of the trials), whereas scores of 0 or 2 points were less frequent. Both observers found the majority of the trials (59% and 52%) to be double-blinded. While both observers also found many trials to be single-blinded (22%), almost as many trials did not have any type of blinding whatsoever (19%). Regarding the sample-size criteria, most trials (44% and 46%) had more than sixty patients, many had forty-one to sixty patients (28% and 26%) or twenty to forty patients (22%), and a smaller percentage (6%) had fewer than twenty patients.
There were other categories, however, in which the majority of the trials scored well according to the modified Coleman Methodology Score. Most trials (78% and 80%) were randomized and blinded, some (17% and 19%) were randomized and not blinded, and only a few (6% and 2%) were not randomized. In the majority of the trials (100% and 94%), the analysis was performed on an intention-to-treat basis. Also, treatment was usually adequately described (in 96% and 94% of the trials). Treatment groups were comparable in most trials (89% and 85%) and were partially comparable in the remainder. In the majority of the studies (98% and 93%), outcome was reported by the recruited patients, and an independent investigator's assessment was the sole measure of outcome in only approximately 3% of the trials. The majority of the articles (74% and 76%) stated that no co-interventions were permitted during the trial period. However, co-interventions were observed in some trials (13% and 19%). A smaller percentage of the studies (13% and 6%) did not satisfy the criteria for similarity of treatment. Most reports (93% and 85%) stated that the level of significance had been set at 0.05, whereas some (6% and 11%) did not provide any alpha error and even fewer (2% and 4%) stated that the level of significance had been set at 0.01.
Revised CONSORT Statement
The difference between the average CONSORT scores assigned by the two observers was not significant (12.0 compared with 11.6 points; p = 0.55). Interobserver agreement for the revised CONSORT statement was moderate according to the Landis and Koch benchmarks (κ = 0.53; p < 0.01).
The two observers' combined assessments yielded a mean total score of 12 points (12 and 12 points). The maximum scores assigned by observer 1 and observer 2 were 19 and 20 points, respectively, and the minimum scores were 6 and 5 points. Most trials were given either a fair (56% and 50%) or a good (32% and 32%) categorical rating. A smaller percentage of trials were considered to be excellent (7% and 6%) or poor (6% and 13%).
In the majority of the reports reviewed in this study, the Introductory section met the guidelines of the CONSORT checklist: most included an adequate title and abstract (83% and 87%) as well as complete background information (87% and 85%).
The reporting in the Methods sections was not as consistent. Outcome variables and objectives were clearly defined and reported in almost all of the articles (98% and 100%), as were the details of the interventions in the treatment groups, including how and where the interventions took place (91% and 94%). However, the majority of the articles (72% and 74%) did not report sample size sufficiently because most did not provide the methodology or statistics behind the determination of this calculation. The eligibility criteria for participants was not adequately reported in >60% of the articles (67% and 65%), as these articles did not include information regarding the specific settings and locations where data were collected. Approximately one-third of the reports fulfilled the revised CONSORT criteria in this category. Most (91% and 89%) did not adhere to the CONSORT guidelines for describing their specific objectives and hypotheses.
More than half of the articles (57% and 59%) did not adequately describe the methods used to generate the random allocation sequence for participants. Similarly, on the average, almost 85% of the articles (82% and 85%) did not provide sufficient information with regard to implementation of randomization and almost 90% (89% and 91%) did not fully describe blinding. While the authors of most articles reported on blinding generally, details on measuring the success of blinding, which is required by the revised CONSORT statement, were rarely included. Approximately 65% of the articles (67% and 63%) did not sufficiently describe concealment of allocation from participants. Regarding methods and randomization, >80% of the articles (87% and 89%) did not include adequate information about the statistical methods used for comparing treatment groups.
As in the Methods sections, the Results sections of the reports rarely included all of the details required by the revised CONSORT statement. Most were adequate in terms of reporting the number of participants included in the analysis of each treatment group (85% and 83%), baseline data (68% and 68%), and outcomes and estimations (80% and 69%). However, only about half of the articles included a description of the flow of participants through each stage of the study (54% and 46%) and the adverse effects of therapy (56% and 50%). Both observers found that the reporting of the results of the trials was quite poor in two important areas. On the average, the location and dates of patient recruitment were not reported in >90% of the articles (91% and 94%) and ancillary analyses were not provided in >80% (80% and 85%).
The interpretation of results was adequately reported in about three-fourths (70% and 79%) of the Discussion sections of the articles. Similarly, both observers agreed that 78% of the Discussion sections presented overall evidence or the general interpretation of results in the context of available literature according to CONSORT guidelines. However, on the average, less than half of the Discussion sections (46% and 52%) adequately presented any external validation of results.
To determine the most effective treatments for our patients, we follow the principles of evidence-based medicine and turn to prospective randomized trials as our gold standard. This study was performed to measure the relative quality of prospective randomized clinical trials. The modified Coleman Methodology Score and the revised CONSORT statement proved to be reliable for the evaluation of trial quality.
The vast majority (>90%) of the published therapeutic clinical trials on lateral epicondylitis were considered to be of low quality, or Level II, according to our application of the Oxford Levels of Evidence. The modified Coleman Methodology Score was used to further quantify the inadequacies of published clinical trials for the treatment of lateral epicondylitis, with >87% of the studies being rated fair or poor. The strength of these trials lies in the reporting of the alpha error, randomization, intention-to-treat analysis, treatment description, group comparability, and outcome measures, rather than in areas that indicate a higher level of reporting quality. The authors of these therapeutic trials frequently failed to describe how patients were enrolled and the percentages of enrollment as well as the methodology of determining a power level and sample size. Other areas of deficiency included follow-up, sample size, and blinding.
According to the CONSORT statement, >62% of the clinical trials were rated fair or poor. Again, areas of strength included the most straightforward aspects of the paper, including the introductory details, the descriptions of the interventions under evaluation and the statistical methods, analysis and reporting of the results, interpretation of findings, and analysis of overall evidence. Areas of weakness included inadequate reporting of eligibility criteria, allocation concealment, patient recruitment, ancillary analysis, and sample size determination. The phenomenon of providing only basic information without including the details of methodology was also prevalent in areas such as sequence generation, implementation of the randomization, blinding, and follow-up protocol. We also found room for improvement with regard to reporting of the flow of participants through the stages of the trial, the inclusion of baseline data in the Results section, outcomes and estimations, adverse effects, and external validation of the results.
Stratifying orthopaedic scientific reports according to levels of evidence implies a degree of respect for Level-I and II trials that may not always be merited. Our data suggest that the quality of these higher-level trials—at least those involving lateral epicondylitis—varies substantially and is often unsatisfactory. There may be other criteria that correlate with trial quality. Many trials were reported in relatively obscure journals such as the Journal of Manipulative and Physiological Therapeutics, Photomedicine and Laser Surgery, Clinical Rehabilitation, Prosthetics and Orthotics International, The Surgeon: Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland, Journal of Traditional Chinese Medicine, and Australian Family Physician, among others. Others, such as studies evaluating extracorporeal shock wave therapy13-17, laser treatment18,19, and anti-inflammatory medications20-22 or other medications23,24, were performed under the influence of a manufacturer or a strong advocate of the technique. Positive findings in therapeutic clinical trials have been associated with the presence of commercial funding25,26.
There is no doubt that performance of a trial with a well-defined study question and a prospectively applied protocol, with an appropriate control group and blinded treatment allocation and assessment, is the best way to overcome many of the shortcomings of the current literature. On the other hand, the use of this study design ensures neither quality science nor internally and externally valid data. This report serves to emphasize the need to critically evaluate all scientific reports regardless of study design or level of evidence. ▪
Disclosure: In support of their research for or preparation of this work, one or more of the authors received, in any one year, outside funding or grants in excess of $10,000 from the AO Foundation, Small Bone Innovations, Wright Medical, Joint Active Systems, and Smith and Nephew. Neither they nor a member of their immediate families received payments or other benefits or a commitment or agreement to provide such benefits from a commercial entity. No commercial entity paid or directed, or agreed to pay or direct, any benefits to any research fund, foundation, division, center, clinical practice, or other charitable or nonprofit organization with which the authors, or a member of their immediate families, are affiliated or associated.
A commentary is available with the electronic versions of this article, on our web site (www.jbjs.org) and on our quarterly CD-ROM (call our subscription department, at 781-449-9780, to order the CD-ROM).
Investigation performed at the Orthopaedic Hand and Upper Extremity Service, Massachusetts General Hospital, Boston, Massachusetts
1. Levels of evidence and grades of recommendation. Centre for Evidence-Based Medicine. http://http://www.cebm.net
/levels_of_evidence.asp. 2001. Accessed 2007 April 4.
2. The periodic health examination. Canadian Task Force on the Periodic Health Examination. Can Med Assoc J. 1979;121: 1193-254.
3. . Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest. 1986;89(2 Suppl): 2S-3S.
4. , Guyatt GH, Laupacis A, Sackett DL, Goldberg RJ. Clinical recommendations using levels of evidence for antithrombotic agents. Chest. 1995;108(4 Suppl): 227S-30S.
5. , Cairns JA, Camm AJ, Fallen EL, Gersh BJ. Evidenced based cardiology. London: BMJ Publishing Group; 1998.
6. . Evidence-based orthopaedic surgery: what type of research will best improve clinical practice? J Bone Joint Surg Br. 2005;87: 1593-4.
8. , Engebretsen L, Slauterbeck JR. An analysis of the quality of cartilage repair studies. J Bone Joint Surg Am. 2005;87: 2232-9.
9. , Khan KM, Maffulli N, Cook JL, Wark JD. Studies of surgical outcome after patellar tendinopathy: clinical significance of methodological deficiencies and guidelines for future studies. Victorian Institute of Sport Tendon Study Group. Scand J Med Sci Sports. 2000;10: 2-11.
10. , Coleman BD, Khan KM, Maffulli N. Outcome of surgery for chronic Achilles tendinopathy. A critical review. Am J Sports Med. 2001;29: 315-20.
11. , Schulz KF, Altman DG; CONSORT. The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials. BMC Med Res Methodol. 2001;1: 2.
12. , Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33: 159-74.
13. , Wiley JP. Effectiveness of extracorporeal shock wave therapy in the treatment of previously untreated lateral epicondylitis: a randomized controlled trial. Am J Sports Med. 2004;32: 1660-7.
14. , Riedel C, Betz U, Fink C. Chronic lateral epicondylitis of the elbow: a prospective study of low-energy shockwave therapy and low-energy shockwave therapy plus manual therapy of the cervical spine. Arch Phys Med Rehabil. 2001;82: 578-82.
15. , Nichols D, Richards C, Humphreys H, Wies JT, Burnet S, Hazleman BL. Extracorporeal shock wave therapy for lateral epicondylitis—a double blind randomised controlled trial. J Orthop Res. 2002;20: 895-8.
16. , Bannister GC, Huma H, Rooker GD. A prospective, randomised study to compare extracorporeal shock-wave therapy and injection of steroid for the treatment of tennis elbow. J Bone Joint Surg Br. 2002;84: 678-9.
17. , Hope C, Kullmer K, Heine J, Burger R. Analgesic effect of extracorporeal shock-wave therapy on chronic tennis elbow. J Bone Joint Surg Br. 1996;78: 233-7.
18. , Sheffield CG, Cieslak KR. Laser therapy: a randomized, controlled trial of the effects of low intensity Nd:YAG laser irradiation on lateral epicondylitis. Arch Phys Med Rehabil. 2000;81: 1504-10.
19. , Hoeg N, Kjeldstad B, Johnsson A, Larsen S. Low level laser versus placebo in the treatment of tennis elbow. Scand J Rehabil Med. 1992;24: 37-42.
20. , Walenkamp GH, van Mameren H, Kester AD, van der Linden AJ. Local corticosteroid injection versus Cyriax-type physiotherapy for tennis elbow. J Bone Joint Surg Br. 1996;78: 128-32.
21. , Gunal I, Ozturk H. Local injection treatment for lateral epicondylitis. Clin Orthop Relat Res. 2002;398: 127-30.
22. , Laskowski ER, Idank DM, McLean TJ, Egan KS. Corticosteroid injection in early treatment of lateral epicondylitis. Clin J Sport Med. 2001;11: 214-22.
23. , Santini AJ, Hughes PJ, Frostick SP, Trail IA, Stanley JK. Botulinum toxin injection in the treatment of tennis elbow. A double-blind, randomized, controlled, pilot study. J Bone Joint Surg Am. 2005;87: 503-7.
24. , Rutten HP, Pilot P, Morre HH, van Os JJ, Verburg AD. Botulinum toxin injection versus surgical treatment for tennis elbow: a randomized pilot study. Clin Orthop Relat Res. 2002;401: 125-31.
25. , Chen W, Gluud C, Kjaergard LL. Association of funding and conclusions in randomized drug trials: a reflection of treatment effect or adverse events? JAMA. 2003;290: 921-8.
26. , Torres J. Reported outcomes in major cardiovascular clinical trials funded by for-profit and not-for-profit organizations: 2000-2005. JAMA. 2006;295: 2270-4. Erratum in: JAMA. 2006;295:2726.