In contrast to x-ray, magnetic resonance imaging (MRI) provides ample information for a differentiated assessment of the degenerative state of an intervertebral disc (IVD). It has been shown, for instance, that the signal intensity is correlated with degenerative changes related to the biochemical composition of the disc, such as water content or, with appropriate techniques, the proteoglycan concentration in the nucleus.^{1–7} To formalize evaluations and make them comparable across studies, Pfirrmann et al^{8} defined 5 grades of disc degeneration depending on image attributes of the midsagittal slice of T2-weighted (T2w) MRI scans, including the signal intensity and homogeneity of the nucleus, discriminability of annulus and nucleus, or disc height of each IVD. While easy to understand and apparently intuitive, applying grading systems such as Pfirrmann's scheme in practice in a consistent, unbiased manner has proven to be notoriously difficult. According to literature and our own investigations, the intrarater reliability for Pfirrmann's grading scheme can be expected to be around κ ≈ 0.8 (Cohen's κ) on average, whereas the interrater reliability is probably closer to only κ ≈ 0.6.^{8–12}

One reason for this limited reliability may be that the distinction between qualitative image attributes such as “homogeneous/nonhomogeneous” or “bright-white/white” is highly subjective. Another reason is the existence of “hybrid stages” of degeneration that do not conform fully to any of the 5 degrees as defined by Pfirrmann. Unsurprisingly, different raters regularly come to different conclusions based on the same evidence due to personal bias, differences in experience, or different interpretations of the grading rules. In fact, observers often do not agree even with themselves when repeating evaluations, resulting in relatively high intraobserver variability. This lack of reproducibility may add enough noise to hide interesting effects or patterns, interfering with the requirements of studies that rely on consistent grading, such as in long-term follow-up studies with small sample sizes, for example, for investigating the potential impact of medical treatment of the spine on the progression of disc degeneration in either the treated or adjacent segments. The only way to eradicate, or at least mitigate, these sources of error when using semiquantitative grading schemes such as Pfirrmann's is to reduce the influence of human subjectiveness and erratic behavior by replacing human raters with a deterministic algorithm, that is, with software.

Alomari et al^{13} used classical computer vision methods combined with (shallow) machine learning to detect “abnormal” discs on T2w sagittal images, whereas Ghosh et al^{14} and Hashia and Mir^{15} created similar systems to classify lumbar disc herniation. Da Silva Barreiro et al^{16} proposed a semiautomatic method, where the disc has to be first segmented manually and then a multilayer perceptron uses up to 15 texture features to classify the degeneration grade according to Pfirrmann. This system achieved an average sensitivity of 75.2%, although the performance varied widely for different grades. The training data (n = 210 discs) did not include any grade 5 discs. Castro-Mateos et al^{17} developed a similar system, which only requires manual selection of the discs' centroid. They report an average sensitivity of 87.3% based on a multilayer perceptron with a single hidden layer.

Although these techniques based on shallow learning have the advantage of requiring only little amounts of training data, suitable features need to be handcrafted specifically for each application and the results are often not as robust as one would wish regarding unforeseen variability in the input data. The recent emergence of practical deep learning methods based on deep artificial neural networks established ways to create robust, reliable image classifiers for a wide range of applications, including medical image processing.^{18} Jamaludin et al^{19} used a multitask deep convolutional neural network (CNN) to classify disc image volumes of T2w sagittal MRI scans with respect to 6 different labels, including the Pfirrmann grade.^{19,20} Based on a large training set of n = 12,018 discs derived from the Genodisc project, they achieved an average sensitivity of 70.1% for the Pfirrmann degeneration grading.

In this study, we also used deep learning to train a CNN-based classifier for disc degeneration according to the Pfirrmann scheme, which automatically and reproducibly classifies MRI scans of discs with high discriminatory power in a self-consistent manner and with negligible error rate. Because the availability of large quantities of high-quality training data is crucial to achieve an accurate and robust performance, we focused on creating a training database with as little label noise as possible and performed extensive hyperparameter optimization to achieve state-of-the-art classification performance. The result is a system that not only exceeds human-level capabilities, but also outperforms existing AI solutions.

MATERIALS AND METHODS
Training Data and Preprocessing
The classifier developed by the authors of this study is based on a deep CNN, which learns the connection between input (T2w MRI sagittal image of the IVD) and output (Pfirrmann grade) via supervised learning by training on large pool of input-output samples. Imaging data of 2075 patients from the Genodisc project (The European Union Health Project on Genes and Disc Degeneration, FP7 project HEALTH-F2–2008-201626) served as the basis for creating a sufficiently large training data set. Genodisc was a multicenter study with participants in the United Kingdom, Italy, Hungary, and Slovenia, and the image data originate from a wide array of different scanners using varying image acquisition parameters, depending on clinical demands. Each participating center obtained ethical approval independently from its local ethics committee (as required by national laws). The approvals included the possibility of using the images in anonymized form, and also to share them among the Genodisc partners, for scientific and educational purposes.

For the current study, a single expert (rater 1) reevaluated the lumbar disc degeneration of each disc of a random subset of 1238 patients (6179 discs) using a custom evaluation tool written in C++, using Qt and OpenCV, without access to the original evaluations from the Genodisc project. For each disc, the expert first selected the midsagittal slice of the T2w MRI scan data and then defined a region of interest (ROI) encapsulating the disc and approximately 30% of the upper and lower vertebrae as well as some spinal canal/cerebrospinal fluid as a reference for a signal intensity considered “bright white.”

Then the evaluator assigned an extended, fractional Pfirrmann score that we defined based on the conventional 5-grade scheme, but which features 0.3 and 0.7 transitional grades between each integral degeneration grade (ie, in this scheme, valid grades are 1.0, 1.3, 1.7, 2.0, 2.3, 2.7, 3.0, 3.3, …, 5.0). This allowed us to express uncertainties or tendencies when the image attributes did not fit exactly 1 of the 5 categories defined by Pfirrmann. For instance, for any conventional Pfirrmann grade x ∈ {1, 2, 3, 4}, a grade of x + 0.3 means that the disc is probably slightly more degenerated than a Pfirrmann grade x would suggest, but it also means that it is closer to x than to grade x + 1. x + 0.7 on the other hand expresses that the grade should be rounded to x + 1 if an (integral) Pfirrmann grade was desired. Note that fractional Pfirrmann grades can be easily converted to conventional (integral) Pfirrmann grades by rounding to the closest integer.

To estimate the intrarater reliability, the evaluator graded 840 discs a second time after 4 weeks had passed (rater 1*). To determine the interrater reliability, a second rater also evaluate 96 discs from the same random subset (rater 2).

As it turned out, the Pfirrmann grades of our evaluation of the subset of the Genodisc database yielded roughly normally distributed degeneration grades, such that, in particular, grade 1 and grade 5 discs were underrepresented in the initial training data. After receiving a positive vote from our institution's ethics committee (vote no. 50/20), we therefore acquired supplemental, completely anonymized sagittal T2w image data of 1868 nonconsecutive patients in an age range between 18 and 95, to whom lumbar MRI was prescribed for medical reasons and that were scanned between 2016 and 2018 using either Siemens 1.5-T or 3.0-T scanners. From this set, we selected both as healthy as possible (201 cases, equaling 973 discs; average age, 34 ± 12 years) as well as severely as possible degenerated cases (160 cases, equaling 796 discs; average age, 69 ± 12 years). After selecting the midsagittal slices and defining ROIs for each disc, we applied the classifier trained on the Genodisc data set to generate pseudo-labeled data, which we then reviewed manually for all cases where the pseudo-label deviated at least 1.7 Pfirrmann grades from the ground truth as defined by the human rater to correct obvious misclassifications (84 of 1769 discs ≈ 4.8%).

We randomly shuffled the data patientwise, before extracting the ROI of each disc from the full image data (varying between 38 and 154 pixels in height and 18 and 93 pixels in width), padding the ROI to an 1:1 aspect ratio and resampling the cropped and padded disc image data to 128 × 128 pixels. Because zero-centering did not result in improved learning, we decided to simply normalize the 16-bit gray values of each image to [−1, 1] (single-precision floating point). From each training sample, we generated 4 augmented samples by flipping the images horizontally and applying a random rotation of ±20 degrees.

Network Architecture, Losses, and Training
The starting point for the architecture of the classification network was an off-the-shelf VGG16-like architecture^{21} that takes a 128 × 128 gray value image as input and maps to 1 of the 13 possible degeneration classes (fractional Pfirrmann grade). The network initially consisted of 8 convolutional layers and 2 fully connected classification layers with 1024 units each, as well as a SoftMax output layer. The parameters are initialized as proposed by He et al.^{22} During the process of optimizing our system, we also improved the architecture of our network step by step: we replaced ReLU (rectified linear unit) activation with leaky ReLUs (α = 0.01) and added postactivation batch normalization after each convolutional layer, leading to faster convergence. Using dilated convolutions made it possible to reduce the number of max pooling layers and thus increase the spatial resolution, resulting in better predictions.^{23} To combat overfitting, we added spatial dropout layers after each convolutional block (ie, before max pooling) as well as L _{2} regularization (λ = 5 · 10^{−4} ) and replaced the 2 fully connected (dense) classification layers of the original network with a global average pooling layer that averages the feature maps of the last convolutional layer.^{24} This reduced the number of parameters from 22 million to 2.4 million and significantly reduced overfitting, while improving prediction accuracy (Table 1 ). The current implementation of the CNN is based on Python 3.7.6 (Python Software Foundation, Beaverton, OR) and TensorFlow 2.1.0 (Google LLC, Mountain View, CA).

TABLE 1 -
Architecture of the CNN Classifier Network

Block
Layer
Description
Output Shape
Parameters
1
1
Convolution 2D, 3 × 3 × 32
128 × 128 × 32
320
2
Leaky ReLU, α = 0.01
128 × 128 × 32
0
3
Batch normalization
128 × 128 × 32
128
4
Dropout, P = 0.2
128 × 128 × 32
0
2
5
Convolution 2D, 3 × 3 × 64, dilation rate = 2
128 × 128 × 64
18,496
6
Leaky ReLU, α = 0.01
128 × 128 × 64
0
7
Batch normalization
128 × 128 × 64
256
8
Convolution 2D, 3 × 3 × 64, dilation rate = 2
128 × 128 × 64
36,928
9
Leaky ReLU, α = 0.01
128 × 128 × 64
0
10
Batch normalization
128 × 128 × 64
256
11
Dropout, P = 0.2
128 × 128 × 64
0
12
Max pooling, 2 × 2
64 × 64 × 64
0
3
13
Convolution 2D, 3 × 3 × 128, dilation rate = 2
64 × 64 × 128
73,856
14
Leaky ReLU, α = 0.01
64 × 64 × 128
0
15
Batch normalization
64 × 64 × 128
512
16
Convolution 2D, 3 × 3 × 128, dilation rate = 2
64 × 64 × 128
147,854
17
Leaky ReLU, α = 0.01
64 × 64 × 128
0
18
Batch normalization
64 × 64 × 128
512
19
Dropout, P = 0.2
64 × 64 × 128
0
4
20
Convolution 2D, 3 × 3 × 256, dilation rate = 3
64 × 64 × 256
295,168
21
Leaky ReLU, α = 0.01
64 × 64 × 256
0
22
Batch normalization
64 × 64 × 256
1024
23
Convolution 2D, 3 × 3 × 256, dilation rate = 3
64 × 64 × 256
590,080
24
Leaky ReLU, α = 0.01
64 × 64 × 256
0
25
Batch normalization
64 × 64 × 256
1024
26
Dropout, P = 0.2
64 × 64 × 256
0
27
Max pooling, 2 × 2
32 × 32 × 256
0
5
28
Convolution 2D, 3 × 3 × 512
28 × 28 × 512
1,180,160
29
Leaky ReLU, α = 0.01
28 × 28 × 512
0
30
Batch normalization
28 × 28 × 512
2048
31
Dropout, P = 0.2
28 × 28 × 512
0
6
32
Global average pooling
512
0
33
Fully connected, n = 13
13
6669
34
SoftMax
13
0

If trained on integral grades, the classifier layer's (#33) output shape is reduced to 5. For the regression models, the fully connected layer (#33) has a single node and the SoftMax activation (#34) is replaced by a linear activation layer.

CNN, convolutional neural network; 2D, 2-dimensional; ReLU, rectified linear unit.

During the entire hyperparameter optimization process, we maintained the same 90%/10% training/validation data set split. We employed plain SGD (stochastic gradient descent) with momentum (learning rate, 0.001; momentum, 0.9; batch size, 32) and explicit learning rate scheduling by reducing the learning rate by a factor of 10 after 10 epochs on a loss plateau. We perform model checkpointing by saving the model's weights after each epoch, if the validation accuracy has improved compared with the previous checkpointed version. This way, we ensure to retain the version of the model that has performed best during the entire training period. Training was performed on an NVIDIA Titan X Pascal (12 GB) and, more recently, an NVIDIA Titan RTX (24 GB) GPU, supported by CUDNN 10.1 (NVIDIA Corporation, Santa Clara, CA).

By default, we treated the assignment of a degeneration grade as an image classification problem and used categorical cross-entropy (CCE) as a loss function to optimize the network's parameters. Yet, Pfirrmann grading could equally well be considered an ordinal regression problem, interpreting Pfirrmann grades as ranks ordered by severity of degeneration and thus constituting an ordinal scale. In that case, we could additionally utilize the distance between predicted and ground truth degeneration rank as a penalty, which might help improve the accuracy of the system further. We therefore also trained models using a modified CCE loss that additionally considers grading rank distance (“ordinal categorical cross-entropy”):

${\mathcal{L}}_{\text{ordinal}\phantom{\rule{0.25em}{0ex}}\mathrm{CCE}}=-\frac{1}{n}\sum _{i=1}^{n}\left(\left(1+\frac{1}{m-1}\left|\underset{j}{arg\phantom{\rule{0.25em}{0ex}}max}\left({y}_{\mathit{ij}}\right)-\underset{j}{arg\phantom{\rule{0.25em}{0ex}}max}\left({p}_{\mathit{ij}}\right)\right|\right)\right)\sum _{j=1}^{m}{y}_{\mathit{ij}}log{p}_{\mathit{ij}}$

y _{i} is the one-hot coded rank-transformed ground truth Pfirrmann grade for sample i out of n , that is, for rank k of m possible ranks, y _{ik} = 1 and y _{ij} = 0 ∀ j = 1…m ∧ j ≠ k . p _{i} is the m -vector of predicted class probabilities after SoftMax with ${\sum}_{j=1}^{m}{p}_{\mathit{ij}}\equiv 1$ .

Another option is to use a loss based on a “softened,” weighted Cohen's κ as proposed by de la Torre et al^{25} where the objective is to minimize

${\mathcal{L}}_{\text{soft}\phantom{\rule{0.25em}{0ex}}\kappa}=log\left(1-\kappa \right)$

Because the output of the CNN is not a predicted class, but a probability distribution over all possible classes, the definition of κ needs to be adapted. The interested reader may find a detailed derivation of this “soft kappa” in the study by de la Torre et al.^{25}

Assuming rough equidistance of grades, one could even pretend that the Pfirrmann grades were indeed defined on a continuous scale and simply round the predicted continuous grade to the next valid discrete grade, rephrasing the problem as a regression task. We used both mean square error (MSE) loss

${\mathcal{L}}_{\mathrm{MSE}}=\frac{1}{n}\mathit{\sum}_{i=1}^{n}\phantom{\rule{0.25em}{0ex}}{\left({y}_{i}-{p}_{i}\right)}^{2}$

as well as mean absolute error (MAE) loss

${\mathcal{L}}_{\mathrm{MAE}}=\frac{1}{n}\mathit{\sum}_{i=1}^{n}\left|{y}_{i}-{p}_{i}\right|$

to investigate this possibility. In both these cases, y _{i} refers to the fractional Pfirrmann grade of the i -th training sample and p _{i} to the predicted continuous value.

Aside from training to predict fractional grades, we also experimented with training on integral degeneration grades to increase the number of samples per class or using class weighting to mitigate the effects of class imbalance. In addition, we checked how the size of the training data set and the chosen image resolution affect classification performance.

Validation/Performance Evaluation
For all variants, we evaluated the performance on the validation subset using the following classification and/or correlation metrics:

- Cohen's κ (unweighted and with linear error weights)
- Accuracy (exact agreement of prediction with ground truth)
- Matthews correlation coefficient (MCC)
- Sensitivity, precision, specificity, F _{1} score, and Jaccard similarity coefficient
- Lin's concordance correlation coefficient
- MAE
- Pearson's r
We have chosen to use these metrics specifically either because they are particularly meaningful for the overall performance of the system or to establish at least some degree of comparability to solutions by other groups (eg, sensitivity, specificity, and Lin's concordance).

Because we lack an external test set, there is a danger of overfitting to the validation set during hyperparameter optimization. As a tradeoff, we performed 10-fold cross-validation using the best/default model and computed mean scores and corresponding 3σ credible intervals (defined as the highest posterior density interval) using Bayesian bootstrapping with 10,000 resamples.

RESULTS
Training Data Characteristics and Rater Reliability
The blue bars in Figure 1 illustrate the distribution of Pfirrmann grades in the ground truth after evaluating the subset of the Genodisc data. Although it contains a sufficient amount of grade 2, 3, and 4 samples, there are only 61 grade 1 and 315 grade 5 discs present. Adding the supplemental data (Fig. 1 , orange bars) that we selected for very healthy and strongly degenerated spines boosts the sample counts for grade 1 from 61 to 354 (+293), for grade 2 from 1067 to 1557 (+490), for grade 3 from 2953 to 3354 (+401), for grade 4 from 1783 to 2147 (+364), and for grade 5 from 315 to 536 (+221). Although the data distribution is still imbalanced, combined with 4-fold augmentation, it is now sufficiently large for training a deep learning classifier.

FIGURE 1: Distribution of gradings for the Genodisc data subset (blue), the supplemental data used to boost the amount of grade 1 and grade 5 samples (orange), and the combined data set.

We determined an intrarater reliability of κ = 0.852 (Cohen's κ with linear error weights) for fractional Pfirrmann grades and κ = 0.867 for integral grades. Both evaluations agree exactly with each other in 74% of cases (integral, 88%). In contrast, the interrater reliability is only κ = 0.541 (integral, κ = 0.585), and both raters only agreed in 38% of cases (integral, 73%). Table 2 lists global performance scores for intrarater and interrater reliability, and Table 3 breaks down precision, sensitivity, F _{1} score, and Jaccard similarity coefficient for each grade separately, assuming rater 1 as the ground truth. Note that the subset of 96 discs evaluated by both raters according to our main rater contained only one grade 1 and three grade 2 discs, resulting in highly uncertain statistics for these grades.

TABLE 2 -
Comparison of Global Performance Metrics for Human and AI Raters

Cohen κ
Accuracy
MCC
Average Specificity
Average Sensitivity
Average Precision
Average F
_{1} Score
Average Jaccard Similarity
Lin's Concordance Correlation
MAE
Pearson's r
Human intrarater
This study
0.867
0.881
0.823
0.965
0.860
0.858
0.858
0.753
0.916
0.141
0.935
Pfirrmann et al, 2001^{8}
0.880*
0.908†
—
—
—
—
—
—
—
—
—
Pfirrmann et al, 2004^{9}
0.745*
—
—
—
—
—
—
—
—
—
—
Carrino et al, 2009^{11}
0.740
—
—
—
—
—
—
—
—
—
—
Arana et al, 2010^{10}
0.689
—
—
—
—
—
—
—
—
—
—
Urrutia et al, 2016^{12}
0.890
0.753
—
—
—
—
—
—
—
—
—
Human interrater
This study
0.585
0.729
0.605
0.920
0.491
0.571
0.500
0.377
0.616
0.350
0.752
Pfirrmann et al, 2001^{8}
0.780*
0.830†
—
—
—
—
—
—
—
—
—
Pfirrmann et al, 2004^{9}
0.645*
—
—
—
—
—
—
—
—
—
—
Carrino et al, 2009^{11}
0.660
—
—
—
—
—
—
—
—
—
—
Arana et al, 2010^{10}
0.491
—
—
—
—
—
—
—
—
—
—
Urrutia et al, 2016^{12}
0.830
0.549
—
—
—
—
—
—
—
—
—
Machine learning models
Default model
0.920
0.930
0.899
0.979
0.902
0.925
0.913
0.841
0.942
0.082
0.960
Training on integral grades
0.921
0.929
0.897
0.979
0.894
0.926
0.908
0.833
0.946
0.077
0.947
Balanced class weights
0.910
0.919
0.883
0.977
0.902
0.885
0.892
0.807
0.935
0.108
0.951
Ordinal categorical cross-entropy loss
0.926
0.934
0.904
0.980
0.898
0.924
0.911
0.838
0.947
0.083
0.960
Squared ordinal categorical cross-entropy
0.919
0.930
0.898
0.979
0.894
0.927
0.909
0.834
0.940
0.084
0.957
Soft κ loss
0.865
0.874
0.822
0.965
0.745
0.693
0.716
0.650
0.917
0.127
0.927
MSE loss (regression)
0.900
0.904
0.862
0.972
0.847
0.922
0.877
0.784
0.939
0.120
0.965
MAE loss (regression)
0.901
0.906
0.864
0.972
0.853
0.913
0.877
0.783
0.940
0.112
0.966
da Silva Barreiro et al, 2014^{16}
—
—
—
0.791‡
0.626‡
—
—
—
—
—
—
Castro-Mateos et al, 2016^{17}
—
—
—
0.955§
0.873§
—
—
—
—
—
—
Jamaludin et al, 2017^{19}
—
—
—
—
0.701∥
—
—
—
0.880
—
—

Cohen κ, accuracy, MCC, specificity, sensitivity, precision, F _{1} , Jaccard similarity, as well as Lin's concordance are computed based on integral Pfirrmann grades. MAE and Pearson r are computed based on fractional Pfirrmann grades, if available. Values in bold indicate the best results for each metric.

*Averaged, unweighted κ.

†Averaged agreement = accuracy.

‡Unweighted average 1-false positive rate = specificity and true positive rate = sensitivity; grades 1–4 only.

§Differentiates only 4 classes/levels, as grade 1 and 2 were pooled together.

∥Class average accuracy as defined by Maji et al^{26} = average sensitivity.

MCC, Matthews correlation coefficient; MAE, mean absolute error.

TABLE 3 -
Per-Grade (Human) Intrarater and Interrater Reliability Compared With the Default Model's Classification Performance

Grade
1
2
3
4
5
Intra
Inter
Model
Intra
Inter
Model
Intra
Inter
Model
Intra
Inter
Model
Intra
Inter
Model
Precision
0.857
0.000
0.901
0.752
0.500
0.924
0.929
0.579
0.926
0.881
0.944
0.952
0.870
0.833
0.921
Sensitivity
0.857
0.000
0.893
0.826
0.333
0.937
0.825
1.000
0.969
0.959
0.667
0.897
0.833
0.455
0.816
F
_{1} score
0.857
0.000
0.897
0.788
0.400
0.930
0.874
0.732
0.947
0.918
0.782
0.924
0.851
0.588
0.865
Jaccard similarity
0.750
0.000
0.814
0.650
0.250
0.869
0.776
0.577
0.899
0.849
0.642
0.858
0.741
0.417
0.762

The first evaluation by rater 1 is considered as the ground truth. Values in bold indicate the best results per grade and per metric.

When interpreted as a regression problem, the evaluation of rater 1 correlates with the repeated evaluation (rater 1*) with r = 0.935 (Pearson's r ), corresponding to an MAE of 0.141 grades. The correlation of rater 1 with rater 2 on the other hand is r = 0.752, and the MAE doubles to 0.350 grades.

Model Evaluation
After training for 100 epochs, the CNN exactly agrees with the human evaluator in 83% of the validation set cases. For integral grades, the accuracy reaches 93%. In 99.2% of cases, the CNN's grading deviates by no more than one full Pfirrmann grade from the ground truth. If we take a closer look at the “wrongly” classified samples, we can see that it is not always clear, whether the ground truth is actually more correct than the model's prediction (Fig. 2 ). Cohen's κ = 0.916 (integral, κ = 0.920) indicates excellent agreement with the human “teacher,” even when accounting for class imbalance. The MAE of the prediction is 0.082 grades, and the CNN's prediction correlates with the ground truth with r = 0.960. Table 2 compares the model's global predictive performance scores to both human raters as well as other machine learning models. Table 3 compares the performance of the model to human intrarater and interrater reliability on a per-grade basis.

FIGURE 2: Ground truth grades versus grades predicted by the default model. Each point in this scatterplot represents one single validation sample whose location is determined by its ground truth evaluation (x -axis) and the corresponding model prediction (y -axis). Note that, for visualization purposes only, all points have been jittered randomly by ±0.1 Pfirrmann grades to improve visibility of individual samples, but the underlying values are still constrained to valid (fractional) Pfirrmann grades (light-gray grid). Many of the “outliers” are hard to categorize unambiguously within the traditional Pfirrmann scheme (highlighted samples A–D).

When comparing the results from the 10-fold cross-validation run of the same model to the results obtained on the fixed train/test split using the last 10% of the training set as validation data, it seems that the latter tends to slightly overestimate the performance of the model, although the differences are generally within the margin of error and thus do not signify considerable overfitting (Table 4 ). The full size of the training set is 7152 discs, which is quadrupled using image augmentation. Repeatedly halving the training data set by randomized subsampling reduces the overall classification performance, as expected, although with κ > 0.7, the global performance is still far beyond random guessing, even with only 1/32nd of the full training set (Fig. 3 , left). Yet, due to the class imbalance of the training data, Pfirrmann grades 1 and 5 benefit particularly from an increase of training data and the trajectories of the F _{1} scores suggest that additional training data could further improve classification of those degeneration states (Fig. 3 , right).

TABLE 4 -
Comparison of Performance Metrics Obtained for a Fixed 90/10 Train/Test Split and 10-Fold Cross-Validation

Fixed Train/Test Split
10-Fold Cross-Validation
Accuracy
83%
81% (79%, 82%)
Accuracy (integral)
93%
91% (90%, 92%).
Cohen's κ
0.916
0.906 (0.899, 0.914)
Cohen's κ (integral)
0.920
0.903 (0.890, 0.916)
MCC (integral)
0.899
0.871 (0.854, 0.888)
MAE
0.082
0.096 (0.086, 0.105)
Pearson r
0.960
0.958 (0.951, 0.964)

The values in the cross-validation column are arithmetic mean values followed by the lower and upper limits (in parentheses) of the 3σ Bayesian credible interval (highest density interval of the posterior).

MCC, Matthews correlation coefficient; MAE, mean absolute error.

FIGURE 3: Dependency of classification performance on training set size. Decreasing the training set size reduces the overall classification reliability, as expected (left). As they are underrepresented in the training data, classification scores for grades 1 and 5 react strongly to a decrease in the training data amount (right).

In contrast to training set size, the image resolution does not have a strong impact on accuracy or reliability of the classifier, implying that the chosen resolution of 128 × 128 pixels is indeed sufficient to capture all the required information for classifying disc degeneration (Fig. 4 ). Even 64 × 64 pixels yield comparable results, although lower resolutions may lead to stronger artifacts when augmenting images. The usefulness of very high resolutions on the other hand is limited by the native resolution of the MRI scan itself.

FIGURE 4: Dependency of classification performance on image resolution. The accuracy of the classifier is not sensitive to the chosen image resolution, particularly beyond image resolutions of 64 × 64 pixels.

Further Attempts to Improve Classification
Training on Integral Grades
Rounding the ground truth data to integral Pfirrmann grades reduces the number of classes from 13 to 5 and increases the number of samples per class, which, in theory, should lead to overall improved classification. In practice though, the differences are only marginal and both approaches yield similar results (Table 2 ).

Class Weights
Another way to mitigate class imbalance is to use loss weights that penalize misclassifications of rare classes (grades 1 and 5) more strongly than those of common classes (eg, grades 2, 3, and 4). We used scikit-learn's (v0.22.1) compute_class_weight utility function to determine “balanced” class weights^{27,28} (1.0, 2.3820; 1.3, 6.1139; 1.7, 7.6424; 2.0, 0.5923; 2.3, 1.3291; 2.7, 1.1707; 3.0, 0.3171; 3.3, 0.6948; 3.7, 2.0843; 4.0, 0.3478; 4.3, 5.7921; 4.7, 6.3983; 5.0, 1.4001). Unfortunately, instead of an improvement, we see a slight degradation in performance compared with the unweighted version (Table 2 ).

Ordinal Categorical Cross-Entropy
Categorical cross-entropy loss does not consider the distance between the predicted grade and the ground truth. Using custom (ordinal) variations of categorical cross-entropy with either linear or squared error weights should guide the optimization process to focus on cases with large errors. In practice, we, again, do not see any noteworthy differences compared with common cross-entropy loss (Table 2 ).

Soft κ Loss
Instead of minimizing cross-entropy, we could instead directly attempt to optimize Cohen's κ instead. In our case, using the “softened,” quadratically weighted version proposed by de la Torre et al,^{25} however, not only leads to much worse overall performance, but also results in a classifier that does not predict some very rare grades (1.0, 1.7, 4.7, 5.0) at all (Table 2 ).

Regression
Instead of predicting a discrete class for each disc, we also trained regression models that predict a continuous Pfirrmann grade, which is then rounded to the nearest valid discrete grade. Using MAE as a loss instead of MSE proved to be marginally superior (Table 2 ). Although the classification performance overall is slightly worse compared with a cross-entropy–based classifier, this model does have its unique advantages: although the predictions of the regression models are noisier on average (ie, they produce a larger amount of small errors compared with the classifier models), they also tend to create fewer outliers, where prediction and ground truth deviate by ≫ 1 grade (Fig. 5 ).

FIGURE 5: Ground truth grades versus grades predicted by the regression. Compared with the classifier models, the regression models (here: mean absolute error loss) deviate more from the ground truth, on average, but they also produce fewer outlier predictions.

DISCUSSION
Our system is able to accurately classify the degree of degeneration of IVDs based on sagittal T2w MRI slices. While our main rater was able to achieve a very good intrarater reliability of κ ≈ 0.87, the neural network proves to be even better at reproducing the evaluations of its human teacher with a reliability of κ ≈ 0.92, not to mention its vast outperformance of human interrater reliability, which we measured at only κ ≈ 0.59. Considering that according to the literature the interrater reliability of the Pfirrmann grading scheme is a modest κ somewhere between 0.7 and 0.9 (Table 2 ) and an estimated interrater reliability of κ ≈ 0.5…0.8 (Table 2 ), our system achieves overall superior reproducibility compared with human evaluators.

According to Pfirrmann et al^{8} themselves, the intrarater reliability for their proposed grading system ranged from 0.84 up to 0.90 (unweighted κ), but the interrater reliability varied between 0.69 and 0.81. In a later study with a much larger patient sample, Pfirrmann et al^{9} reported considerably lower ranges of 0.72 … 0.77 and 0.62 … 0.67, respectively, a finding that was later corroborated by Carrino et al^{11} whose estimates range between 0.67 … 0.82 (intrarater) and 0.62 … 0.70 (interrater). Arana et al^{10} on the other hand placed both intrarater as well as interrater reliability considerably lower at approximately 0.69 and 0.49, whereas Urrutia et al^{12} reported a more optimistic intrarater estimate of 0.89, but also a very high 0.83 for interrater reliability.

Our system seems to also score better than other systems proposed to automate the evaluation of disc degeneration. Da Silva Barreiro et al^{16} reported an average sensitivity of 62.6% and an average specificity of 79.1%, which is clearly surpassed by our system with a sensitivity of 90.2% and a specificity of 97.9%. Although solution scores by Castro-Mateos et al^{17} are considerably better with a sensitivity of 87.3% and a specificity of 95.5%, it nevertheless is still inferior to our system. The same applies to the deep learning–based solution developed by Jamaludin et al,^{19} who reported an average sensitivity of 70.1%. According to Lin's concordance correlation coefficient ρ , which was used by Jamaludin et al to compare the human and the AI readings, our system performs significantly better with ρ ≈ 0.94 compared with that of Jamaludin et al with ρ ≈ 0.88.

We want to stress that such comparisons to other studies should be treated with caution. Most classification metrics depend not only on the classification performance, but also, at least to some degree, on the distribution of the underlying validation data. The most obvious case would be accuracy, which is easily artificially inflated by imbalanced validation data. To enable meaningful cross-study comparisons, we would need something like Kaggle machine learning competitions for IVD degeneration grading, offering completely independent test data that should be left concealed from the developers of the individual systems to prevent any kind of information leakage. Until then, we can only offer rough estimates of how well our system would likely perform in comparison to other competitors. Moreover, we did not validate our system on a completely independent test set, which may have led to overfitting to the validation set and thus to overestimating the system's predictive performance. Yet, the results from the 10-fold cross-validation run we performed after the hyperparameter optimization do not indicate significant overfitting.

Our training data set shows a Gaussian distribution of degeneration grades. Deep learning requires a sufficient amount of training data for each class; however, grades 1 and 5 only just about meet this criterion. Consequently, the prediction errors are noticeably higher for grades 1 and 5 compared with the remaining grades (Table 2 ). Attempts at mitigating this class imbalance using the usual recipes (pooling classes, class loss weighting) were largely unsuccessful, although “ordinal categorical cross-entropy” loss, our modification of cross-entropy for ordinal regression problems, might perform slightly, though not phenomenally, better (Table 1 ). Regression using either MSE or MAE loss might be an interesting tradeoff for applications, where avoiding few big errors is more important than minimizing the average grading error. The only promising solution to the existing class imbalance is thus to collect further training data covering in particular grades 1 and 5, which would probably also allow to employ deeper, more complex networks and thus better performance overall.

Despite this shortcoming with respect to the training data, our system's gradings deviate higher than 1 Pfirrmann grade from the ground truth for only 0.8% of all validation cases. All of those cases are either false-positives or false-negatives related to grades 1 and 5, that is, either discs of these grades that were not detected as such, or discs of other grades that were wrongly categorized as grade 1 or 5. Yet, it is not always clear, what the correct grading actually is, particularly for cases that feature combinations of attributes not covered by the Pfirrmann grading system, for example, a nearly collapsed disc combined with a bright white nucleus (Fig. 2 ). Aside from these rather rare, more extreme cases, grading deviations of ≤ 1 Pfirrmann grades are also often hard to decide definitively due to the vagueness inherent in the Pfirrmann grading system (What separates a homogeneous nucleus from a nonhomogeneous one? When is the nucleus/annulus distinction “clear” or “unclear”? What is the difference between “white” and “light gray”?).

Nevertheless, in contrast to human raters, our system is completely deterministic and has proven to be excellent at reproducing its human teacher's way of rating discs. With this level of self-consistency, our system is predestined to be applied in studies where it is critical to be able to discern even small, relative changes in disc quality over time, for instance, when assessing the progression of degeneration in adjacent levels after fixation (adjacent segment disease) or for monitoring the effect of stem cell injections for disc regeneration. Systems such as the one presented in this study might also be of interest to the implant industry, where it is important to be able to judge the performance of new devices objectively to comply with regulatory requirements.

CONCLUSIONS
We have presented a novel deep learning–based system for automatically grading disc degeneration according to the Pfirrmann scheme. Our system not only outperforms its human teachers at producing consistent evaluations, but also improves upon existing AI solutions. This system will prove to be useful for examining the results of longitudinal studies, even with regard to the subtle, otherwise undetectable changes in disc quality.

ACKNOWLEDGMENTS
The authors of this study would like to thank Margarita Darminow for her valuable contributions to this study and her relentless efforts to improve the quality of our training data.

REFERENCES
1. Schiebler ML, Camerino VJ, Fallon MD, et al. In vivo and ex vivo magnetic resonance imaging evaluation of early disc degeneration with histopathologic correlation.

Spine (Phila Pa 1976) . 1991;16:635–640.

2. Tertti M, Paajanen H, Laato M, et al. Disc degeneration in magnetic resonance imaging: a comparative biochemical, histologic, and radiologic study in cadaver spines.

Spine . 1991;16:629–634.

3. Modic MT, Masaryk TJ, Ross JS, et al. Imaging of degenerative disk disease.

Radiology . 1988;168:177–186.

4. Pearce RH, Thompson JP, Bebault GM, et al. Magnetic resonance imaging reflects the chemical changes of aging degeneration in the human intervertebral disk.

J Rheumatol Suppl . 1991;27:42–43.

5. Johannessen W, Auerbach JD, Wheaton AJ, et al. Assessment of human disc degeneration and proteoglycan content using T1ρ-weighted magnetic resonance imaging.

Spine . 2006;31:1253–1257.

6. Brayda-Bruno M, Tibiletti M, Ito K, et al. Advances in the diagnosis of degenerated lumbar discs and their possible clinical application.

Eur Spine J . 2014;23:315–323.

7. Ogon I, Takebayashi T, Takashima H, et al. Imaging diagnosis for intervertebral disc.

JOR Spine . 2020;3:e1066.

8. Pfirrmann CW, Metzdorf A, Zanetti M, et al. Magnetic resonance classification of lumbar intervertebral disc degeneration.

Spine . 2001;26:1873–1878.

9. Pfirrmann CWA, Dora C, Schmid MR, et al. MR image–based grading of lumbar nerve root compromise due to disk herniation: reliability study with surgical correlation.

Radiology . 2004;230:583–588.

10. Arana E, Royuela A, Kovacs FM, et al. Lumbar spine: agreement in the interpretation of 1.5-T MR images by using the Nordic Modic consensus group classification form.

Radiology . 2010;254:809–817.

11. Carrino JA, Lurie JD, Tosteson ANA, et al. Lumbar spine: reliability of MR imaging findings.

Radiology . 2009;250:161–170.

12. Urrutia J, Besa P, Campos M, et al. The Pfirrmann classification of lumbar intervertebral disc degeneration: an independent inter- and intra-observer agreement assessment.

Eur Spine J . 2016;25:2728–2733.

13. Alomari RS, Corso JJ, Chaudhary V, et al. Computer-aided diagnosis of lumbar disc pathology from clinical lower spine MRI.

Int J Comput Assist Radiol Surg . 2010;5:287–293.

14. Ghosh S, Alomari RS, Chaudhary V, et al. Computer-aided diagnosis for lumbar MRI using heterogeneous classifiers. In:

2011 I.E. International Symposium on Biomedical Imaging: From Nano to Macro . 2011;1179–1182.

15. Hashia B, Mir AH. Texture features' based classification of MR images of normal and herniated intervertebral discs.

Multimed Tools Appl . 2018. Available at:

https://doi.org/10.1007/s11042-018-7011-4 . Accessed March 16, 2020.

16. da Silva Barreiro M, Nogueira-Barbosa MH, Rangayyan RM, et al. Semiautomatic classification of intervertebral disc degeneration in magnetic resonance images of the spine. In:

5th ISSNIP-IEEE Biosignals and Biorobotics Conference (2014): Biosignals and Robotics for Better and Safer Living (BRC) . 2014;1–5.

17. Castro-Mateos I, Hua R, Pozo JM, et al. Intervertebral disc classification by its degree of degeneration from T2-weighted magnetic resonance images.

Eur Spine J . 2016;25:2721–2727.

18. Hosny A, Parmar C, Quackenbush J, et al. Artificial intelligence in radiology.

Nat Rev Cancer . 2018;18:500–510.

19. Jamaludin A, Lootus M, Kadir T, et al. ISSLS PRIZE IN BIOENGINEERING SCIENCE 2017: automation of reading of radiological features from magnetic resonance images (MRIs) of the lumbar spine without human intervention is comparable with an expert radiologist.

Eur Spine J . 2017;26:1374–1383.

20. Jamaludin A, Kadir T, Zisserman A. SpineNet: automated classification and evidence visualization in spinal MRIs.

Med Image Anal . 2017;41:63–73.

21. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition.

ArXiv14091556 Cs . 2014. Available at:

http://arxiv.org/abs/1409.1556 . Accessed October 12, 2016.

22. He K, Zhang X, Ren S, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

ArXiv150201852 Cs . 2015. Available at:

http://arxiv.org/abs/1502.01852 . Accessed March 21, 2017.

23. Yu F, Koltun V. Multi-Scale Context Aggregation by Dilated Convolutions.

ArXiv151107122 Cs . 2015. Available at:

http://arxiv.org/abs/1511.07122 . Accessed July 17, 2017.

24. Lin M, Chen Q, Yan S; Network in Network.

ArXiv13124400 Cs . 2013. Available at:

http://arxiv.org/abs/1312.4400 . Accessed May 30, 2018.

25. de la Torre J, Puig D, Valls A. Weighted kappa loss function for multi-class classification of ordinal data in deep learning.

Pattern Recognit Lett . 2018;105:144–154.

26. Maji S, Rahtu E, Kannala J, et al. Fine-Grained Visual Classification of Aircraft.

ArXiv13065151 Cs . 2013. Available at:

http://arxiv.org/abs/1306.5151 . Accessed April 7, 2020.

27. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python.

J Mach Learn Res . 2011;12:2825–2830.

28. King G, Zeng L. Logistic regression in rare events data.

Polit Anal . 2001;9:137–163.