The kappa values for the interobserver agreement between the experienced and inexperienced reviewers were 0.62 for the level of evidence and 0.76 for the study type, and the kappa values for the agreement between the experienced reviewers were 0.75 and 0.85, respectively (p < 0.0001 for all four kappa values). The kappa values for the agreement between all of the reviewers' ratings and the JBJS-A ratings were 0.84 for the level of evidence and 1.00 for the study type (p < 0.0001). When the experienced and inexperienced reviewers disagreed on either the level of evidence or the type of study (thirty-six instances of such disagreement), they subsequently agreed to use the experienced reviewer's rating 72% of the time and the inexperienced reviewer's rating only 28% of the time.
The primary aim of this study was to assess the type and level of evidence, according to the level-of-evidence rating system adopted by JBJS-A in January 2003, in articles published in nine different orthopaedic journals. We found that most of the articles were therapeutic in type, which reflects the trend in orthopaedic journals to publish studies that address the results of specific treatments. Economic studies, which are particularly valuable because they can have a large-scale impact on decision-making in our health-care system, were the least frequently encountered, accounting for only 0.5% of all articles reviewed. The relative lack of economic studies in orthopaedic journals suggests that few economic studies are being conducted.
The majority of the articles represented Level-IV evidence. This finding exposes a well-known weakness in the orthopaedic literature, which is its tendency to contain retrospective studies with a lower level of evidence. Studies with a lower level of evidence do have their place in orthopaedic research. A Level-IV study may be the only research design feasible to address a given clinical question regarding an infrequently occurring problem, and such a study can provide valuable information for patient care. Features of a well-designed, well-executed Level-IV study include a population in which 100% of the patients have the same diagnosis, use of strict inclusion and exclusion criteria, prospective patient enrollment, use of a standard treatment protocol, follow-up of patients at specified time-intervals, well-defined outcome measures that include clinical parameters, and use of patient-derived validated instruments for functional assessment. In addition, the follow-up rate should be >80%, and the investigators should account for all patients who have been lost to follow-up. Inclusion of these features in a Level-IV study can help to decrease the known and unknown factors that can lead to bias in the design or interpretation of the study.
Although a well-designed Level-IV study can provide valuable information for patient care, investigators should, whenever possible, pursue studies with a higher level of evidence to answer a given clinical inquiry. Our review of the orthopaedic literature revealed that this has not always been the case. For example, several papers that we reviewed could have qualified as cohort studies (Level II or III) instead of case series (Level IV) if the author(s) had included a control group for comparison. We recommend that authors make every attempt to use a control group in their studies, since this can substantially raise the level-of-evidence rating for the study and thus improve the confidence with which one can apply the information to a clinical setting. However, the ultimate goal should be to produce more Level-I and II studies. Studies with a higher level of evidence are generally more reliable and therefore more appropriate for application to patient care. We hope that by formally studying the percentages of Level-I and II studies in the literature, as compared with the percentages of Level-III and IV studies, we have demonstrated the need for editors to choose articles with the highest levels of evidence available and for clinical investigators to pursue more Level-I and II studies instead of retrospective studies.
We also investigated a potential association between the number of Level-I and/or II articles published in a particular journal and the 2003 journal impact factor for that journal. The journal impact factor is a numerical value assigned to a particular journal on the basis of the frequency with which its articles are cited in other journals. The Institute of Scientific Information of Philadelphia produces annual Journal Citation Reports that list the yearly citation rates for selected scientific journals6,8. The journal impact factor is calculated by dividing the total number of citations, in the past year, of articles published in the journal in the preceding two years by the number of substantive articles (i.e., source items) published in the journal in the preceding two years8-20. For example, the 2004 journal impact factor for JBJS-A was determined by counting the number of citations, in 2004, to articles published in JBJS-A during 2002 and 2003, and dividing that number by the total number of scientific articles published in JBJS-A during 2002 and 2003. The journal impact factor is often used to quantify journal quality on the basis of the notion that a larger number of citations reflects a higher-quality journal. As a very general rule, journals with exceptionally high impact factors are among those widely considered to be the most prestigious. For example, The New England Journal of Medicine and the similarly well-respected JAMA (Journal of the American Medical Association) had two of the highest 2003 journal impact factors (34.8 and 21.5, respectively)6. However, the journal impact factor is only a quantitative measure and may not always reflect the true quality of a particular journal8,18. In fact, in recent years, several critics have questioned whether it is appropriate to use the journal impact factor as an assessment of journal quality. Some have criticized journal impact factors because they believe that its arbitrary two-year citation period favors journals publishing articles in quickly evolving disciplines19. Others think that the journal impact factor is flawed because it can be manipulated by journal editors through keen selection of certain types of articles (e.g., review articles, which historically have higher citation rates)20. Despite its potential shortcomings, we still believe that the journal impact factor is the best and most accepted tool available (although far from perfect) for assessing the quality of a particular journal.
Analysis of our data showed that the percentage of Level-I and II articles increased in accordance with the 2003 journal impact factor (Fig. 3) for each journal (p = 0.0061)—that is, journals that published higher percentages of studies with a high level of evidence had higher citation rates than journals that published lower percentages of high-level studies. This finding suggests that the frequency with which a particular journal publishes studies with a high level of evidence correlates with the size of that journal's readership and the overall perception of its quality.
Finally, we assessed, with the use of kappa values, the interobserver reliability of the level-of-evidence grading system employed by JBJS-A to rate its clinical papers. Landis and Koch suggested that kappa values ranging from 0.0 to 0.2 indicate slight agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, substantial agreement; and 0.81 to 1.0, almost perfect agreement7. In general, kappa values of >0.5 indicate a high level of agreement and kappa values of <0.3, poor agreement. With an increasing number of observers in the various subgroups, the kappa value stabilizes around a mean value, indicating a decrease in the sampling variation and standard error21. In our study, the kappa statistics showed that, between experienced reviewers, there was a substantial level of agreement with regard to the level of evidence (0.75) and an almost perfect level of agreement with regard to the study type (0.85). Between experienced and inexperienced reviewers, there was a substantial level of agreement with regard to both the level of evidence (0.62) and the study type (0.76). This result suggests that even reviewers without epidemiologic training can reliably apply the level-of-evidence definitions to various types of clinical studies, and it confirms the findings with regard to agreement in the study by Bhandari et al.3.
The strengths of our study were that we reviewed a large number of articles (382) from a diversity of orthopaedic journals (nine), included reviewers both with and without epidemiologic training, and blinded those reviewers to the ratings assigned to the articles by the other reviewers. However, our study also had several limitations. First, each reviewer did not assess all of the articles in the study. Such a task would have been time-consuming, and the purpose of our study was to confirm the high correlations that had been previously reported3. Second, we reviewed an unequal number of articles from each journal. Third, the reviewers were not blinded to the journal in which the article had appeared, and this is a potential source of detection bias. Fourth, the limited number of inexperienced reviewers resulted in a lower kappa value. Fifth, although our study design was adequate for evaluating the overall reliability of the level-of-evidence categories, it was inadequate for examining the reliability of each individual level. None of the clinical papers in our study were subclassified within a particular level of evidence.
In conclusion, orthopaedic journals with higher journal impact factors (JBJS-A, JBJS-B, and American Journal of Sports Medicine) are more likely to publish Level-I or II articles. The type and level of information in orthopaedic journals can be reliably classified, and editors should continue to choose articles providing the best evidence available. Moreover, clinical investigators should pursue studies with a higher level of evidence whenever feasible. We believe that the level-of-evidence rating as pioneered by JBJS-A represents an important move in the right direction for evidence-based orthopaedic research. Quantifying the level of evidence for each clinical study both provides the proper context in which each study should be interpreted and promotes greater recognition of higher-level-of-evidence studies, which are generally more reliable for application to patient care. Thus, a level-of-evidence rating not only tells a clinician how to interpret the results of a particular study but also helps him or her to decide whether clinical practices should be altered because of them. It is for this reason that the entire level-of-evidence concept is so fundamentally important. Since the level-of-evidence rating can have such an impact on patient care, we encourage the editors of other orthopaedic journals to add these ratings (i.e., type and level of evidence) to their abstracts. The Board of Directors of the American Academy of Orthopaedic Surgeons recently voiced this same sentiment when they approved, in February 2005, the use of level-of-evidence ratings for proper classification of orthopaedic studies22. The revised level-of-evidence table most recently endorsed by the Academy, in April 2005, can be found on the AAOS website (www.aaos.org/wordhtml/bulletin/apr05/fline9.htm). Of course, the importance of rating articles applies to research conducted in all fields of medicine, not just orthopaedics. We envision a future in which all peer-reviewed publications include a level-of-evidence rating for clinical papers.▪
The currently used JBJS-A level-of-evidence rating system is available with the electronic versions of this article, on our web site at jbjs.org (go to the article citation and click on “Supplementary Material”) and on our quarterly CD-ROM (call our subscription department, at 781-449-9780, to order the CD-ROM).
The authors did not receive grants or outside funding in support of their research or preparation of this manuscript. They did not receive payments or other benefits or a commitment or agreement to provide such benefits from a commercial entity. No commercial entity paid or directed, or agreed to pay or direct, any benefits to any research fund, foundation, educational institution, or other charitable or nonprofit organization with which the authors are affiliated or associated.
Investigation performed at the Departments of Orthopaedic Surgery at Vanderbilt University, Nashville, Tennessee; Boston Medical Center, Boston University, Boston, Massachusetts; and McMaster University, Hamilton, Ontario, Canada
1. . Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest. 1986;89(2 Suppl): 2S-3S.
2. , Swiontkowski MF, Heckman JD. Introducing levels of evidence to the journal. J Bone Joint Surg Am. 2003;85: 1-3.
3. , Swiontkowski MF, Einhorn TA, Tornetta P 3rd, Schemitsch EH, Leece P, Sprague S, Wright JG. Interobserver agreement in the application of levels of evidence to scientific papers in the American volume of the Journal of Bone and Joint Surgery. J Bone Joint Surg Am. 2004;86: 1717-20.
4. . Read early and often. Am J Sports Med. 2005;33: 21-2.
5. . Understanding evidence-based arthroscopy. Arthroscopy. 2004;20: 1-3.
6. The Thomson Corporation. ISI web of knowledge. Accessed August 10, 2004. isi15.isiknowledge.com.
7. , Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33: 159-74.
8. . Journal impact factor: a brief review. CMAJ. 1999;161: 979-80.
9. . A deficiency in the algorithm for calculating the impact factor of scholarly journals: the Journal Impact Factor. Cortex. 2001;37: 590-4.
10. . Impact factor: the numbers game. AJR Am J Roentgenol. 2002;178: 541-2.
11. , Pistotti V, Gabutti G, Klersy C. Impact factor and electronic versions of biomedical scientific journals. Haematologica. 2001;86: 1015-20.
12. , Hecht BK, Sandberg AA. The journal “impact factor”: a misnamed, misleading, misused measure. Cancer Genet Cytogenet. 1998;104: 77-81.
13. , Barnaby DP. Evidence of methodologic bias in the derivation of the Science Citation Index impact factor. Ann Emerg Med. 1998;31: 83-6.
14. . Impact factor as a misleading tool in evaluation of medical journals. Lancet. 1995;346: 906. Erratum in: Lancet. 1995;346:1172.
15. . Impact factors: facts and myths. Eur Radiol. 2002;12: 715-7.
16. . On the pitfalls of journal ranking by Impact Factor. Eur J Oral Sci. 1998;106: 525-6.
17. . [The impact factor as an assessment criterion of scientific achievements—the right to equal chances]. Strahlenther Onkol. 1999;175: 141-53. German.
18. , Weinstein JN. Spine: scientific citation index and its impact factor. Spine. 1998;23: 1087-90.
19. . The impact of the impact factor. J Hand Surg [Am]. 2004;29: 983-6.
20. . Understanding the limitations of the journal impact factor. J Bone Joint Surg Am. 2003;85: 2449-54.
21. , Olsen LH, Nielsen ST. Kappa statistics in the assessment of observer variation: the significance of multiple observers classifying ankle fractures. J Orthop Sci. 2002;7: 163-6.
Copyright 2005 by The Journal of Bone and Joint Surgery, Incorporated
22. . Board approves SOPs, patient safety projects, position statements. AAOS Bull. 2005;53: 45-8.