Secondary Logo

Journal Logo


What Is the Reliability of a New Classification for Bone Defects in Revision TKA Based on Preoperative Radiographs?

Belt, Maartje MSc; Smulders, Katrijn PhD; van Houten, Albert MD; Wymenga, Ate MD, PhD; Heesterbeek, Petra PhD; van Hellemondt, Gijs MD

Author Information
Clinical Orthopaedics and Related Research: September 2020 - Volume 478 - Issue 9 - p 2057-2064
doi: 10.1097/CORR.0000000000001084



The frequency of revision TKA is increasing, with the incidence of revision TKA in the Netherlands doubling from 9.8 to 17.8 per 100,000 persons between 2010 and 2017 [14]. Patients are younger at the time of their primary TKA [9, 19] and have a longer life expectancy, increasing the likelihood of revision TKA [3]. Revision TKA is generally more challenging than primary TKA [1, 2, 5, 10, 15, 18], and orthopaedic surgeons often must treat bone defects. The management of bone defects predominantly depends on their size and location [15, 17]. Surgical options include the use of newly developed cones and sleeves for larger epiphyseal or metaphyseal bone defects, and variations in stem length and type of fixation for diaphyseal defects. Such options seem to be successful in creating a stable implant in most patients with a metaphyseal bone defect. However, clear indications for which option is the best available solution are absent and outcomes of different surgical options are rarely studied. A reproducible and accurate classification of the bone defects is required to aid such research. Moreover, standard classification of bone defects facilitates comparisons of patients between cohorts or registries.

The Anderson Orthopaedic Research Institute (AORI) classification is the most commonly used classification for bone defects in the femur and tibia [4, 17]. However, AORI only partially quantifies the metaphyseal area and does not quantify diaphyseal bone loss, and might be less suited for detailed assessment in revision TKA patients. It should also be noted that obtaining implant fixation in two of three anatomic zones (epiphysis, metaphysis, diaphysis), as recommended to ensure sufficient stability of the revision implant, might be aided by preoperative planning, allowing for a more detailed assessment of bone defects [15].

In clinical practice, the primary assessment of bone loss is performed with radiographs. However, additional CT images may theoretically result in better estimates of bone loss and location of bone defects because of the 3-D nature of CT scans [11]. The modality used for evaluating bone defects may thus influence reliability. In this study, we developed a new three-zone bone defect classification to evaluate bone defects in patients undergoing revision TKA, which includes a separate evaluation of the size and severity of the defect in the epiphysis, metaphysis, and diaphysis.

We tested (1) the intraobserver and interobserver reliability of this classification for revision TKA based on preoperative radiographs, and (2) whether additional CT images might improve interobserver reliability.

Patients and Methods

Study Design and Setting

This study was registered on the Open Science Framework before data were collected. The study protocol, raw data, and analytical code are deposited and accessible via:

Design of the Bone Defect Classification

First, a concept classification was designed, and the diaphyseal, metaphyseal, and epiphyseal zones of the femur and tibia were described using anatomic landmarks (Fig. 1). The concept classification was tested in a pilot study by four orthopaedic surgeons (three experienced [VB, GvH, JS] and one resident in training [AvH]) from our center. The orthopaedic surgeons independently rated the bone defects of 15 patients on de-identified radiographs using the new classification. AP and lateral radiographs of the knee were available. A researcher was present to take notes for further discussion about necessary adjustments. After the first pilot test, two changes were made to the classification. First, the definition of epiphysis defects was altered from the percentage of bone loss (cutoff of 50%) to the size of bone loss (cutoffs of 5 mm and 10 mm) to remain consistent with the AORI classification [4]. Second, in the definition of diaphyseal defects, the state of the cortex was also incorporated. A distinction was made between an intact cortex, partial intrusion into the cortex, and discontinuation of the cortex. The adjusted classification was subsequently tested in a new pilot test by three other orthopaedic surgeons (all residents in training [SvG, AvH, BN]) (Fig. 2). The difference in ratings among these observers was described and used in a consensus meeting with all orthopaedic staff who specialized in knee arthroplasty. No changes were deemed necessary during this consensus meeting. An instructional video (see Videos 1-5, Supplemental Content 1,; Supplemental Content 2,; Supplemental Content 3,; Supplemental Content 4,; and Supplemental Content 5, was made to illustrate the definition of the zones to be rated and where to measure the bone defects. This is accessible via

Fig. 1:
This shows the definition of femoral and tibial zones used for rating bone defects. The white lines indicate the cutoff points for the zones. The dotted lines (blue) indicate where the measurements of the bone defects for the specific zones should be taken. The anatomic landmarks for the measurements per zone are described on the right and indicated in the picture (black lines).
Fig. 2:
These tables show the bone defect classification for the (A) femur and (B) tibia.

Bone Defect Classification

The bone defect classification consisted of four rating options for bone defects (none, mild, moderate, severe), which are rated separately per zone (epiphysis, metaphysis, and diaphysis) for the femur and tibia (Fig. 2). The zones were defined using anatomic landmarks (Fig. 1). The epiphysis is defined from the original saw cut to the epicondyle (femur) or until the tip of the fibular head (tibia). The bone defect of metaphysis is rated at the adductor tubercle (femur) or at the widest part of the fibular head (tibia). The diaphysis is measured at the worst part of the bone defect, which is usually, but not necessarily, at the tip of the stem. A bone defect was defined as the volume when the normal bone is absent. This included volumes with the implant, cement, osteolytic lesions, and radiolucent lines, as no bone is present in these areas. Bone quality is not incorporated in the classification because an additional DEXA scan is needed for an adequate and consistent evaluation of bone quality, and bone quality is not part of standard preoperative radiologic examinations. For the epiphysis, the AORI classification was maintained. For the metaphysis, the bone defect was classified as mild when the defect covered less than 50% of the AP or mediolateral distance of the metaphyseal zone. When the defect covered more than 50% of the AP or mediolateral distance of the metaphyseal zone, a contained defect was classified as moderate, and when there was discontinuation of the cortex, it was classified as severe. The description of the diaphyseal bone defect was also based on 50% as a cutoff of the AP or mediolateral diameter, but less than 50% was classified as none. When the defect was more than 50%, a distinction was made between an intact cortex (mild), partial intrusion of the defect into the cortex (moderate), and discontinuation of the cortex (severe). To illustrate the new bone defect classification, we have collected example radiographs for every type of bone defect (Fig. 3A-B).


The sample size calculation was based on the agreement probability and chance agreement probability of the ratings done during the design phase of the classification. The chance agreement reflects the agreement between observers based on random rating, not on true agreement, and is used to adjust the agreement to avoid overestimation of the agreement probability. The sample size was powered at 80%, with an expected overall agreement probability of 0.8, a chance agreement probability of 0.4 , and assuming the sample was drawn from a population of n = 100 . This resulted in a required sample size of 61 [8]. Preoperative clinical images of all patients who underwent revision TKA or a repeat revision TKA in our hospital in 2018 were collected. Patients were excluded when (1) no CT image was available, (2) more than 6 months elapsed between the radiograph and CT, (3) radiograph and CT image taken more than 6 months before surgery, and (4) fracture of the tibia or femur evident on the radiographs. This resulted in 61 patients to be included in the study.

All images were de-identified. Five orthopaedic surgeons (KD, RdJ, GvH, JL, JS) independently rated the severity of bone defects on the images of all patients. All five observers were members of the clinical knee reconstruction unit of our clinic and were experienced in revision TKA, with between 5 and 23 years of experience. No observers had participated in the pilot study during the design phase. All observers scored all defects twice on radiographs, with a minimum of 2 weeks between the two timepoints (Timepoint 1 [T1] and Timepoint 2 [T2]) (Fig. 4). The order of the radiographs was identical at T1 and T2. After the second rating, the observer was provided with the CT image of each patient and was asked to adjust their bone defect rating if they deemed it necessary (Timepoint 3 [T3]).

Fig. 3:
A-B Shown here are example radiographs for every type of bone defect, for both (A) the femur and (B) the tibia.
Fig. 4:
This figure shows the measurement schedule and comparisons for the reliability testing. T1 = Timepoint 1; T2 = Timepoint 2 (minimum of 2 weeks after T1); T3 = Timepoint 3 (directly after T2); O1-O5 = Observer 1 to Observer 5.

Typically, the grading of bone defects, classified according to the rating of most of the observers, were none or mild (Table 1). Moderate to severe bone defects were most frequently observed in the epiphysis of the tibia. The duration of the T1 measurements on radiographs ranged between 1:10 minutes and 2:15 minutes per radiograph (median: 1:50 minutes per radiograph).

Table 1.:
Number of patients per type of bone defect, specified by the six zones

Radiographs and CT Images

All clinical images were collected retrospectively. Preoperatively, AP and lateral knee radiographs, made in supine position with the knee in extension (Philips Healthcare, Best, the Netherlands). The distance from the beam was adjusted to make sure the entire knee prosthesis was visible on the radiograph. Preoperative CT scanning of the knee was performed with the patient in the supine position. The patients underwent scanning in the axial plane. CT images were collected using the Toshiba Aquillion 32 (Otawara, Japan) with metal artefact reduction (135 kV/250 mAs; slice thickness: 1.0 mm) or the Philips Ingenuity (Philips Healthcare, Best, the Netherlands), 128 slice, with metal artefact reduction for large orthopaedic implants (140 kV; slice thickness: 1.0 mm).

Statistical Analysis

Given the categorical nature of the classification, we used Gwet’s agreement coefficient (AC) to test reliability. This is considered a better alternative to Cohen’s kappa because Gwet’s AC is less affected by prevalence [20]. Also, Gwet’s AC is often close to the percentage agreement between observers, and is thereby easily interpretable. We analyzed the intraobserver reliability by comparing the ratings on radiographs at T1 and T2, using Gwet’s AC with second-order chance correction (AC2) with linear weights [6]. We analyzed interobserver reliability at T1 by comparing ratings between observers using Gwet’s AC2 with linear weights. Interobserver reliability was also tested at the ratings on radiographs at T2, and at the CT ratings on T3. All statistical tests were performed using R version 3.5.3 (The R Foundation for Statistical Computing, Vienna, Austria). The agreement coefficient function was used for calculating Gwet’s AC [7]. The agreement coefficient was interpreted using the Landis and Koch scale for Kappa statistics because there is no equivalent scale for Gwet’s AC [13]. In the Landis and Koch scale for kappa statistics, k < 0 reflects poor agreement, 0 to 0.20 is slight, 0.21 to 0.4 is fair, 0.41 to 0.60 is moderate, 0.61 to 0.8 is substantial, and above 0.8 is almost perfect.


Intraobserver Reliability

The intraobserver reliability (Table 2) of the radiography ratings at T1 and T2 varied between 0.55 and 0.99. The lowest agreement was observed in the epiphysis of the tibia, with reliability ranging between 0.55 (95% CI 0.40 to 0.71) and 0.78 (95% CI 0.69 to 0.88). Agreement in the metaphysis was substantial to almost perfect for both the tibia and femur, ranging between 0.69 (95% CI 0.58 to 0.80) and 0.98 (95% CI 0.95 to 1). For the diaphysis, the reliability ranged between 0.95 (95% CI 0.90 to 0.99) and 0.99 (95% CI 0.97 to 1). The reliability was similar for the femur and tibia.

Table 2.:
Intraobserver reliability per zone on radiographs

Interobserver Reliability

The interobserver reliability (Table 3) using radiographs varied from 0.48 (95% CI 0.39 to 0.57) to 0.97 (95% CI 0.95 to 0.99). The lowest reliability was observed in the epiphysis (between 0.48 [95% CI 0.39 to 0.57] and 0.55 [95% CI 0.46 to 0.64]), for both the femur and tibia. The metaphysis and diaphysis had almost perfect reliability (between 0.81 [95% CI 0.75 to 0.87] and 0.97 [95% CI 0.95 to 0.99]), according to the Landis and Koch scale. The interobserver reliability on CT (T3) ranged between 0.44 (95% CI 0.38 to 0.51) and 0.96 (95% CI 0.93 to 0.99), and thus did not substantially differ from reliability using only radiographs. Similar to the intraobserver agreement, the lowest reliability coefficients were observed for the ratings of the epiphysis.

Table 3.:
Interobserver reliability per zone, separately for timepoint of the rating and modality


For revision TKA, a reproducible and extensive classification for bone defects in all anatomic zones of the tibia and femur is needed to compare the outcome of surgical options for revision TKA and for comparisons of patients between cohorts and registries. We developed and described here a new bone defect classification, including the diaphysis, which is not part of the most commonly used AORI system for bone defects. We found that this bone defect classification had high intra- and inter-rater agreement for bone defects in the metaphysis and diaphysis, but performed worse for bone defects in the epiphysis.

This study has several limitations that merit attention. First, our study only tested reliability of the bone defect classification. Evaluation of the validity of the classification by comparing it to intraoperative findings and its clinical value for decision making is required before implementation for research or clinical purposes. Future studies of a prospective nature, and thus an independent data set, are necessary. Second, all observers in this study work in the same high-volume clinic. Therefore, they are all familiar with evaluating radiographs of a prosthesis in situ and discussing patients based on radiographs, which may improve agreement between raters and thus limit generalizability. Future evaluation with observers from other centers is warranted to substantiate our findings. Third, the order of the radiographs was not randomized due to practical issues involving a software limitation. We attempted to reduce recall bias by having a minimum of 2 weeks between the radiographic ratings.

We also extensively described the classification to the observers to minimize the confounding effect of learning on reliability [21]. We provided standardized verbal instructions that were supplemented by instructional videos (see Videos 1-5, Supplemental Content 1,; Supplemental Content 2,; Supplemental Content 3,; Supplemental Content 4,; and Supplemental Content 5, describing the bone defects of five patients using the new classification. However, we consider this instruction part of the bone defect classification and have made it publicly available (

Intra- and Interobserver Reliability of the Epiphysis

Overall, the reliability was substantially lower for the epiphysis than for the other two zones. This might be because the prosthesis is in situ, obscuring bony defects and complicating an evaluation of them. In particular, the visibility of epiphyseal defects in the femur was influenced to a great extent by the component type. TKAs with a posterior-stabilized design resulted in poorer visibility of the epiphysis than did a cruciate-retaining design. Lower agreement between raters may also be due to larger bone defects existing in the epiphysis, due to the presence of the prosthesis. It should be noted that a difference between the scoring of the epiphysis and other zones existed in the new bone defect classification: where the epiphysis is scored based on size of defect in mm (according to AORI), and the diaphysis and metaphysis are scored as percentage of bone loss. This may be an alternative explanation for the difference in reliability between the zones. We are aware of only one study that has assessed agreement between observers using the AORI classification [16]. The authors reported a moderate agreement between observers, with the outcome of the study reported as the percentage of physicians scoring the same way. To enable a direct comparison, we re-calculated the interobserver reliability based on that study’s results using the same statistical test. This resulted in a interobserver reliability of 0.39 (95% CI 0.27 to 0.51) for the tibia and 0.57 (95% CI 0.45 to 0.69) for the femur, and that previous study had slightly lower agreement than we did.

Use of CT Did Not Improve Reliability of Classification

The interobserver reliability when both CT and radiographic images were used to rate bone defects at T2 was similar to the rating reliability using radiograph only. This suggests that, in most cases, CT images did not add value to agreement among and between raters. In particular, the metal of the prosthesis resulted in artefacts in the epiphysis that obscured the images, even with metal subtraction software, decreasing the visibility of small defects. This was contrary to our expectations because CT generally provides more detailed images [11], and a previous study on the reliability of a classification on ossifications found that the reliability improved when CT-images were added [12]. However, in some cases, CT was decisive on the size or severity of the defect. For example, discontinuation of the cortex caused by intrusion of the stem tip was sometimes missed on radiographs but was visible on CT.


This study was a first step to standardize bone defects in all zones relevant for revision TKA. In addition to this single-center reliability study, further studies testing reliability of use of the bone defect classification by raters outside our clinic and validity testing are necessary. Such studies should clarify if the new bone defect classification can be used for research purposes, such as development of treatment algorithms and the evaluation of the outcomes of different treatment options for large bone defects. Such a classification scheme could also enable comparisons of revision TKA patients across different registries.


We thank José Smolders PhD, Vincent Busch PhD, Bram Nijsse MD, Stijn van Gennip MD, Koen Defoort MD, Joris Lansdaal MD, Richard de Jong MD for their contribution by rating the radiographs during the development and/or study of this classification.


1. Bae DK, Song SJ, Heo DB, Lee SH, Song WJ. Long-term survival rate of implants and modes of failure after revision total knee arthroplasty by a single surgeon. J Arthroplasty. 2013;28:1130-1134.
2. Baier C, Luring C, Schaumburger J, Köck F, Beckmann J, Tingart M, Zeman F, Grifka J, Springorum HR. Assessing patient-oriented results after revision total knee arthroplasty. J Orthop Sci. 2013;18:955-961.
3. Bayliss LE, Culliford D, Monk AP, Glyn-Jones S, Prieto-Alhambra D, Judge A, Cooper C, Carr AJ, Arden NK, Beard DJ, Price AJ. The effect of patient age at intervention on risk of implant revision after total replacement of the hip or knee: a population-based cohort study. Lancet. 2017;389:1424-1430.
4. Engh GA, Ammeen DJ. Classification and preoperative radiographic evaluation: knee. Orthop Clin North Am. 1998;29:205-217.
5. Greidanus NV, Peterson RC, Masri BA, Garbuz DS. Quality of life outcomes in revision versus primary total knee arthroplasty. J Arthroplasty. 2011;26:615-620.
6. Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29-48.
7. Gwet KL. R functions for calculating agreement coefficients. Available at: Accessed May 6 2019
8. Gwet KL. Sample size determination. Available at: Accessed May 6 2019
9. Hamilton DF, Howie CR, Burnett R, Simpson AH, Patton JT. Dealing with the predicted increase in demand for revision total knee arthroplasty. Bone Joint J. 2015;97:723-728.
10. Hardeman F, Londers J, Favril A, Witvrouw E, Bellemans J, Victor J. Predisposing factors which are relevant for the clinical outcome after revision total knee arthroplasty. Knee Surg Sports Traumatol Arthrosc. 2012;20:1049-1056.
11. Keiler A, Riechelmann F, Thöni M, Brunner A, Ulmar B. Three-dimensional computed tomography reconstruction improves the reliability of tibial pilon fracture classification and preoperative surgical planning. Arch Orthop and Trauma Surg. 2019;1-9 [Published online ahead of print September 16, 2019]. DOI: 10.1007/s00402-019-03259-8.
12. Kudo H, Yokoyama T, Tsushima E, Ono A, Numasawa T, Wada K, Tanaka S, Toh S. Interobserver and intraobserver reliability of the classification and diagnosis for ossification of the posterior longitudinal ligament of the cervical spine. Eur Spine J. 2013;22:205-210.
13. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159-174.
14. LROI. LROI report 2019. Available at: Accessed May 10, 2019.
15. Morgan-Jones R, Oussedik SIS, Graichen H, Haddad FS. Zonal fixation in revision total knee arthroplasty. Bone Joint J. 2015;97:147-149.
16. Pecora JR, Hinckel BB, Demange MK, Gobbi RG, Tirico LE IM. Interobserver correlation in classification of bone loss in total knee arthroplasty. Acta Ortop Bras. 2011;19:368-372.
17. Qiu YY, Yan CH, Chiu KY, Ng FY. Review article: treatments for bone loss in revision total knee arthroplasty. J Orthop Surg. 2012;20:78-86.
18. Van Kempen RW, Schimmel JJ, Van Hellemondt GG, Vandenneucker H, Wymenga AB. Reason for revision TKA predicts clinical outcome: prospective evaluation of 150 consecutive patients with 2-years followup knee. Clin Orthop Relat Res. 2013;471:2296-2302.
19. Wainwright C, Theis JC, Garneti N, Melloh M. Age at hip or knee joint replacement surgery predicts likelihood of revision surgery. J Bone Joint Surg Br. 2011;93:1411-1415.
20. Wongpakaran N, Wongpakaran T, Wedding D, Gwet KL. A comparison of Cohen’s kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013;13:61.
21. Yu R, Hofstaetter JG, Sullivan T, Costi K, Howie DW, Solomon LB. Validity and reliability of the Paprosky acetabular defect classification hip. Clin Orthop Relat Res. 2013;471:2259-2265.

Supplemental Digital Content

© 2019 by the Association of Bone and Joint Surgeons