Level of Evidence in Orthopaedic Journals

Obremskey, William T. MD, MPH; Pappas, Nick MS; Attallah-Wasif, Emad MD; Tornetta, Paul III MD; Bhandari, Mohit MD

Journal of Bone & Joint Surgery - American Volume:
doi: 10.2106/JBJS.E.00370
Scientific Articles
Abstract

Background: The American edition of The Journal of Bone and Joint Surgery (JBJS-A) has included a level-of-evidence rating for each of its clinical scientific papers published since January 2003. The purpose of this study was to assess the type and level of evidence found in nine different orthopaedic journals by applying this level-of-evidence rating system.

Methods: We reviewed all clinical articles published from January through June 2003 in nine orthopaedic journals. Studies of animals, studies of cadavera, basic-science articles, review articles, case reports, and expert opinions were excluded. The remaining 382 clinical articles were randomly assigned to three experienced reviewers and two inexperienced reviewers, who rated them with the JBJS-A grading system. Each reviewer determined whether the studies were therapeutic, prognostic, diagnostic, or economic, and each rated the level of evidence as I, II, III, or IV. Reviewers were blinded to the grades assigned by the other reviewers.

Results: According to the reviewers' ratings, 70.7% of the articles were therapeutic, 19.9% were prognostic, 8.9% were diagnostic, and 0.5% were economic. The reviewers graded 11.3% as Level I, 20.7% as Level II, 9.9% as Level III, and 58.1% as Level IV. The kappa values for the interobserver agreement between the experienced reviewers and the inexperienced reviewers were 0.62 for the level of evidence and 0.76 for the study type. The kappa values for the interobserver agreement between the experienced reviewers were 0.75 for the level of evidence and 0.85 for the study type. The kappa values for the agreement between the reviewers' grades and the JBJS-A grades were 0.84 for the level of evidence and 1.00 for the study type. All kappa values were significantly different from zero (p < 0.0001 for all). The percentage of articles that were rated Level I or II increased in accordance with the 2003 journal impact factors for the individual journals (p = 0.0061).

Conclusions: Orthopaedic journals with a higher impact factor are more likely to publish Level-I or II articles. The type and level of information in orthopaedic journals can be reliably classified, and clinical investigators should pursue studies with a higher level of evidence whenever feasible.

Author Information

1 Department of Orthopaedic Surgery, Vanderbilt University, 131 Medical Center South, 2100 Pierce Avenue, Nashville, TN 37212. E-mail address for W.T. Obremskey: william.obremskey@vanderbilt.edu

2 3104 Dudley Avenue, Nashville, TN 37212

3 Department of Orthopaedic Surgery, Boston Medical Center, Dowling 2 North, 850 Harrison Avenue, Boston, MA 02118

4 Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton Health Sciences–General Site, 7 North, Suite 723, Hamilton, ON L8L 2X2, Canada

Article Outline

Almost twenty years ago, David Sackett and his colleagues first proposed a system for grading different levels of medical evidence1. Sackett altered this system several times over the years, and his latest version, published in May 2001, can currently be found on the Oxford Centre for Evidence-Based Medicine web site (www.cebm.net/levels_of_evidence.asp#levels). In January 2003, the editorial board of the American edition of The Journal of Bone and Joint Surgery (JBJS-A) adopted a level-of-evidence rating system that is a modification of Sackett's most recent proposal. Since that time, JBJS-A has included a level-of-evidence rating for each of the clinical scientific papers that it has published. The system employed by JBJS-A categorizes each article as one of five levels (I, II, III, IV, or V) on the basis of its design and as one of four different types (therapeutic, prognostic, diagnostic, or economic or decision analysis) on the basis of its content. The goal of implementing this rating system was to introduce both authors and readers to the notion of levels of evidence and to stress the importance of using the level-of-evidence rating as a guide in the interpretation of a clinical paper2.

Different study designs merit different level-of-evidence ratings. The basis for rating study designs on a hierarchical scale is to ensure that the best available evidence is used in patient care. In the so-called hierarchy of evidence, randomized controlled trials are Level I or II, cohort studies are Level II or III, case-control studies are Level III, case series are Level IV, and expert opinions are Level V. Studies with higher level-of-evidence ratings (Level I or II) are generally more reproducible for general application to patient care than are those with lower ratings. The conclusions of clinical studies with lower ratings can still be effectively applied to patient care, but they may be more difficult to apply to general use. For example, the results of a well-constructed randomized controlled trial comparing the efficacy of two different types of surgical treatments for fractures (a Level-I study) should be trusted with fewer reservations than should the findings in a case series in which patients with the same type of injury received one type of treatment and there was no comparison group (a Level-IV study). Of course, some clinical questions may never be suitable for randomized controlled trials because of cost or the infrequency of the disease. In such instances, a well-designed case series may be the only option.

A study recently published in JBJS-A showed a high degree of interobserver agreement when the level-of-evidence rating system employed by JBJS-A was applied to clinical scientific articles published in six monthly issues of JBJS-A3. While journals such as Clinical Orthopaedics and Related Research, Arthroscopy, and American Journal of Sports Medicine have recently begun including level-of-evidence ratings for their clinical papers, there have been no studies, to our knowledge, assessing the collective level of evidence in these or any other orthopaedic journals4,5. Our aim was to evaluate both the type and the level of evidence found in nine different orthopaedic journals, including JBJS-A, by using the hierarchical rating system provided in the January 2003 issue of JBJS-A. It is important to note that JBJS-A has been using an updated, very similar version of this rating system since January 2005 (see Appendix). We also evaluated the interobserver agreement among the reviewers who classified the type and level of evidence in the clinical articles assessed in this study.

Back to Top | Article Outline

Materials and Methods

Eligibility Criteria and Study Identification

Three hundred and eighty-two clinical articles published from January through June 2003 were identified through a combined comprehensive computer and manual library search. The interval from January through June 2003 was used as the time-frame in order to obtain a reasonable number of relatively recent articles for reviewers to classify. The articles were obtained from nine different orthopaedic journals intended to represent a broad spectrum of orthopaedic subspecialties. The journals included the American and British editions of The Journal of Bone and Joint Surgery, Journal of Orthopaedic Trauma, Journal of Shoulder and Elbow Surgery, American Journal of Sports Medicine, Journal of Pediatric Orthopaedics, Foot & Ankle International, the American edition of Journal of Hand Surgery, and Journal of Arthroplasty. Studies of animals, studies of cadavera, basic-science articles, review articles, case reports, and expert opinions were excluded. Case reports were excluded as they are not studies of a particular injury or disease but reports of unusual events or procedures. Expert opinions (i.e., Level-V studies) were excluded because they are easily recognized and therefore would yield very low interobserver variability. The reviewers were blinded to each article so that only the title, abstract, introduction, and methods section could be viewed, as these are the only sections required for accurate classification. Articles that satisfied the above criteria were randomly assigned to one of five three-ring binders, with approximately equal numbers of articles from each journal placed in each binder.

Back to Top | Article Outline
Assessment of Level of Evidence

We selected five reviewers to participate in this study so that the findings would be based on a broad range of experience in orthopaedics and in level-of-evidence grading. The reviewers consisted of a medical student, a resident, a recently fellowship-trained orthopaedic surgeon with extensive epidemiologic training and experience, an experienced orthopaedic surgeon with no formal epidemiologic training, and an experienced orthopaedic surgeon with epidemiologic training. The three orthopaedic surgeons practice in three different academic centers in North America (two in the United States and one in Canada). Each of the binders was randomly assigned to each of the independent (and blinded) reviewers. The three orthopaedic surgeons were classified as experienced reviewers, while the resident and medical student were considered to be inexperienced reviewers. Each reviewer used the grading system provided in the January 2003 issue of JBJS-A2 to classify the clinical papers in terms of (1) the type of study (therapeutic, prognostic, diagnostic, or economic or decision analysis) and (2) the level of evidence (I, II, III, or IV). The articles classified by the inexperienced reviewers were graded again by an experienced reviewer, and the articles classified by two experienced reviewers were reevaluated by a third experienced reviewer. Each reviewer was blinded to the grades assigned by the others. When an experienced and an inexperienced reviewer disagreed regarding the proper classification of either the type or the level of a clinical paper, those reviewers subsequently discussed the article in detail until they reached a consensus.

Back to Top | Article Outline
Data Analysis

We determined the frequency with which the different study types (therapeutic, prognostic, diagnostic, and economic or decision analysis) and levels of evidence (I through IV) were found in the various journals. Also, the proportion of Level-I and Level-II articles was compared with the 2003 journal impact factor for each journal (Fig. 1)6. The interobserver agreement between the experienced and inexperienced reviewers, between the experienced reviewers, and between all of the reviewers' ratings and the JBJS-A ratings were calculated with use of kappa values. The kappa values were assessed with use of the method described by Landis and Koch7. Equidistant and adjusted weighting was used to calculate each kappa value. Weighted kappa statistics allow one to measure observer agreement in rank scales that account for chance agreement while incorporating the magnitude of the disagreement into the overall calculation.

Back to Top | Article Outline

Results

Three hundred and eighty-two articles from nine different orthopaedic journals were reviewed. The greatest number of articles came from JBJS-A (seventy-six; 19.9%) and the fewest were from the Journal of Orthopaedic Trauma (twenty-two; 5.8%). Most articles were therapeutic (270; 70.7%), and most reflected Level-IV evidence (222; 58.1%) (Tables I and II). Only forty-three articles (11.3%) were classified as Level I. Seventy-nine (20.7%) were Level II, and thirty-eight (9.9%) were Level III. Of the nine orthopaedic journals involved in the study, American Journal of Sports Medicine had the highest percentage (31.4%) of Level-I articles and Foot & Ankle International had the highest percentage (75.0%) of Level-IV articles (Fig. 2). The percentage of articles that were rated Level I or II increased in accordance with the 2003 journal impact factors for the individual journals (Fig. 3) (p = 0.0061).

The kappa values for the interobserver agreement between the experienced and inexperienced reviewers were 0.62 for the level of evidence and 0.76 for the study type, and the kappa values for the agreement between the experienced reviewers were 0.75 and 0.85, respectively (p < 0.0001 for all four kappa values). The kappa values for the agreement between all of the reviewers' ratings and the JBJS-A ratings were 0.84 for the level of evidence and 1.00 for the study type (p < 0.0001). When the experienced and inexperienced reviewers disagreed on either the level of evidence or the type of study (thirty-six instances of such disagreement), they subsequently agreed to use the experienced reviewer's rating 72% of the time and the inexperienced reviewer's rating only 28% of the time.

Back to Top | Article Outline

Discussion

The primary aim of this study was to assess the type and level of evidence, according to the level-of-evidence rating system adopted by JBJS-A in January 2003, in articles published in nine different orthopaedic journals. We found that most of the articles were therapeutic in type, which reflects the trend in orthopaedic journals to publish studies that address the results of specific treatments. Economic studies, which are particularly valuable because they can have a large-scale impact on decision-making in our health-care system, were the least frequently encountered, accounting for only 0.5% of all articles reviewed. The relative lack of economic studies in orthopaedic journals suggests that few economic studies are being conducted.

The majority of the articles represented Level-IV evidence. This finding exposes a well-known weakness in the orthopaedic literature, which is its tendency to contain retrospective studies with a lower level of evidence. Studies with a lower level of evidence do have their place in orthopaedic research. A Level-IV study may be the only research design feasible to address a given clinical question regarding an infrequently occurring problem, and such a study can provide valuable information for patient care. Features of a well-designed, well-executed Level-IV study include a population in which 100% of the patients have the same diagnosis, use of strict inclusion and exclusion criteria, prospective patient enrollment, use of a standard treatment protocol, follow-up of patients at specified time-intervals, well-defined outcome measures that include clinical parameters, and use of patient-derived validated instruments for functional assessment. In addition, the follow-up rate should be >80%, and the investigators should account for all patients who have been lost to follow-up. Inclusion of these features in a Level-IV study can help to decrease the known and unknown factors that can lead to bias in the design or interpretation of the study.

Although a well-designed Level-IV study can provide valuable information for patient care, investigators should, whenever possible, pursue studies with a higher level of evidence to answer a given clinical inquiry. Our review of the orthopaedic literature revealed that this has not always been the case. For example, several papers that we reviewed could have qualified as cohort studies (Level II or III) instead of case series (Level IV) if the author(s) had included a control group for comparison. We recommend that authors make every attempt to use a control group in their studies, since this can substantially raise the level-of-evidence rating for the study and thus improve the confidence with which one can apply the information to a clinical setting. However, the ultimate goal should be to produce more Level-I and II studies. Studies with a higher level of evidence are generally more reliable and therefore more appropriate for application to patient care. We hope that by formally studying the percentages of Level-I and II studies in the literature, as compared with the percentages of Level-III and IV studies, we have demonstrated the need for editors to choose articles with the highest levels of evidence available and for clinical investigators to pursue more Level-I and II studies instead of retrospective studies.

We also investigated a potential association between the number of Level-I and/or II articles published in a particular journal and the 2003 journal impact factor for that journal. The journal impact factor is a numerical value assigned to a particular journal on the basis of the frequency with which its articles are cited in other journals. The Institute of Scientific Information of Philadelphia produces annual Journal Citation Reports that list the yearly citation rates for selected scientific journals6,8. The journal impact factor is calculated by dividing the total number of citations, in the past year, of articles published in the journal in the preceding two years by the number of substantive articles (i.e., source items) published in the journal in the preceding two years8-20. For example, the 2004 journal impact factor for JBJS-A was determined by counting the number of citations, in 2004, to articles published in JBJS-A during 2002 and 2003, and dividing that number by the total number of scientific articles published in JBJS-A during 2002 and 2003. The journal impact factor is often used to quantify journal quality on the basis of the notion that a larger number of citations reflects a higher-quality journal. As a very general rule, journals with exceptionally high impact factors are among those widely considered to be the most prestigious. For example, The New England Journal of Medicine and the similarly well-respected JAMA (Journal of the American Medical Association) had two of the highest 2003 journal impact factors (34.8 and 21.5, respectively)6. However, the journal impact factor is only a quantitative measure and may not always reflect the true quality of a particular journal8,18. In fact, in recent years, several critics have questioned whether it is appropriate to use the journal impact factor as an assessment of journal quality. Some have criticized journal impact factors because they believe that its arbitrary two-year citation period favors journals publishing articles in quickly evolving disciplines19. Others think that the journal impact factor is flawed because it can be manipulated by journal editors through keen selection of certain types of articles (e.g., review articles, which historically have higher citation rates)20. Despite its potential shortcomings, we still believe that the journal impact factor is the best and most accepted tool available (although far from perfect) for assessing the quality of a particular journal.

Analysis of our data showed that the percentage of Level-I and II articles increased in accordance with the 2003 journal impact factor (Fig. 3) for each journal (p = 0.0061)—that is, journals that published higher percentages of studies with a high level of evidence had higher citation rates than journals that published lower percentages of high-level studies. This finding suggests that the frequency with which a particular journal publishes studies with a high level of evidence correlates with the size of that journal's readership and the overall perception of its quality.

Finally, we assessed, with the use of kappa values, the interobserver reliability of the level-of-evidence grading system employed by JBJS-A to rate its clinical papers. Landis and Koch suggested that kappa values ranging from 0.0 to 0.2 indicate slight agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, substantial agreement; and 0.81 to 1.0, almost perfect agreement7. In general, kappa values of >0.5 indicate a high level of agreement and kappa values of <0.3, poor agreement. With an increasing number of observers in the various subgroups, the kappa value stabilizes around a mean value, indicating a decrease in the sampling variation and standard error21. In our study, the kappa statistics showed that, between experienced reviewers, there was a substantial level of agreement with regard to the level of evidence (0.75) and an almost perfect level of agreement with regard to the study type (0.85). Between experienced and inexperienced reviewers, there was a substantial level of agreement with regard to both the level of evidence (0.62) and the study type (0.76). This result suggests that even reviewers without epidemiologic training can reliably apply the level-of-evidence definitions to various types of clinical studies, and it confirms the findings with regard to agreement in the study by Bhandari et al.3.

The strengths of our study were that we reviewed a large number of articles (382) from a diversity of orthopaedic journals (nine), included reviewers both with and without epidemiologic training, and blinded those reviewers to the ratings assigned to the articles by the other reviewers. However, our study also had several limitations. First, each reviewer did not assess all of the articles in the study. Such a task would have been time-consuming, and the purpose of our study was to confirm the high correlations that had been previously reported3. Second, we reviewed an unequal number of articles from each journal. Third, the reviewers were not blinded to the journal in which the article had appeared, and this is a potential source of detection bias. Fourth, the limited number of inexperienced reviewers resulted in a lower kappa value. Fifth, although our study design was adequate for evaluating the overall reliability of the level-of-evidence categories, it was inadequate for examining the reliability of each individual level. None of the clinical papers in our study were subclassified within a particular level of evidence.

In conclusion, orthopaedic journals with higher journal impact factors (JBJS-A, JBJS-B, and American Journal of Sports Medicine) are more likely to publish Level-I or II articles. The type and level of information in orthopaedic journals can be reliably classified, and editors should continue to choose articles providing the best evidence available. Moreover, clinical investigators should pursue studies with a higher level of evidence whenever feasible. We believe that the level-of-evidence rating as pioneered by JBJS-A represents an important move in the right direction for evidence-based orthopaedic research. Quantifying the level of evidence for each clinical study both provides the proper context in which each study should be interpreted and promotes greater recognition of higher-level-of-evidence studies, which are generally more reliable for application to patient care. Thus, a level-of-evidence rating not only tells a clinician how to interpret the results of a particular study but also helps him or her to decide whether clinical practices should be altered because of them. It is for this reason that the entire level-of-evidence concept is so fundamentally important. Since the level-of-evidence rating can have such an impact on patient care, we encourage the editors of other orthopaedic journals to add these ratings (i.e., type and level of evidence) to their abstracts. The Board of Directors of the American Academy of Orthopaedic Surgeons recently voiced this same sentiment when they approved, in February 2005, the use of level-of-evidence ratings for proper classification of orthopaedic studies22. The revised level-of-evidence table most recently endorsed by the Academy, in April 2005, can be found on the AAOS website (www.aaos.org/wordhtml/bulletin/apr05/fline9.htm). Of course, the importance of rating articles applies to research conducted in all fields of medicine, not just orthopaedics. We envision a future in which all peer-reviewed publications include a level-of-evidence rating for clinical papers.▪

Back to Top | Article Outline

Appendix

The currently used JBJS-A level-of-evidence rating system is available with the electronic versions of this article, on our web site at jbjs.org (go to the article citation and click on “Supplementary Material”) and on our quarterly CD-ROM (call our subscription department, at 781-449-9780, to order the CD-ROM).

The authors did not receive grants or outside funding in support of their research or preparation of this manuscript. They did not receive payments or other benefits or a commitment or agreement to provide such benefits from a commercial entity. No commercial entity paid or directed, or agreed to pay or direct, any benefits to any research fund, foundation, educational institution, or other charitable or nonprofit organization with which the authors are affiliated or associated.

Investigation performed at the Departments of Orthopaedic Surgery at Vanderbilt University, Nashville, Tennessee; Boston Medical Center, Boston University, Boston, Massachusetts; and McMaster University, Hamilton, Ontario, Canada

1. . Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest. 1986;89(2 Suppl): 2S-3S.
2. , Swiontkowski MF, Heckman JD. Introducing levels of evidence to the journal. J Bone Joint Surg Am. 2003;85: 1-3.
3. , Swiontkowski MF, Einhorn TA, Tornetta P 3rd, Schemitsch EH, Leece P, Sprague S, Wright JG. Interobserver agreement in the application of levels of evidence to scientific papers in the American volume of the Journal of Bone and Joint Surgery. J Bone Joint Surg Am. 2004;86: 1717-20.
4. . Read early and often. Am J Sports Med. 2005;33: 21-2.
5. . Understanding evidence-based arthroscopy. Arthroscopy. 2004;20: 1-3.
6. The Thomson Corporation. ISI web of knowledge. Accessed August 10, 2004. isi15.isiknowledge.com.
7. , Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33: 159-74.
8. . Journal impact factor: a brief review. CMAJ. 1999;161: 979-80.
9. . A deficiency in the algorithm for calculating the impact factor of scholarly journals: the Journal Impact Factor. Cortex. 2001;37: 590-4.
10. . Impact factor: the numbers game. AJR Am J Roentgenol. 2002;178: 541-2.
11. , Pistotti V, Gabutti G, Klersy C. Impact factor and electronic versions of biomedical scientific journals. Haematologica. 2001;86: 1015-20.
12. , Hecht BK, Sandberg AA. The journal “impact factor”: a misnamed, misleading, misused measure. Cancer Genet Cytogenet. 1998;104: 77-81.
13. , Barnaby DP. Evidence of methodologic bias in the derivation of the Science Citation Index impact factor. Ann Emerg Med. 1998;31: 83-6.
14. . Impact factor as a misleading tool in evaluation of medical journals. Lancet. 1995;346: 906. Erratum in: Lancet. 1995;346:1172.
15. . Impact factors: facts and myths. Eur Radiol. 2002;12: 715-7.
16. . On the pitfalls of journal ranking by Impact Factor. Eur J Oral Sci. 1998;106: 525-6.
17. . [The impact factor as an assessment criterion of scientific achievements—the right to equal chances]. Strahlenther Onkol. 1999;175: 141-53. German.
18. , Weinstein JN. Spine: scientific citation index and its impact factor. Spine. 1998;23: 1087-90.
19. . The impact of the impact factor. J Hand Surg [Am]. 2004;29: 983-6.
20. . Understanding the limitations of the journal impact factor. J Bone Joint Surg Am. 2003;85: 2449-54.
21. , Olsen LH, Nielsen ST. Kappa statistics in the assessment of observer variation: the significance of multiple observers classifying ankle fractures. J Orthop Sci. 2002;7: 163-6.
22. . Board approves SOPs, patient safety projects, position statements. AAOS Bull. 2005;53: 45-8.
Copyright 2005 by The Journal of Bone and Joint Surgery, Incorporated