Deep Learning and Imaging for the Orthopaedic Surgeon: How Machines “Read” Radiographs : JBJS

Journal Logo

Current Concepts Review

Deep Learning and Imaging for the Orthopaedic Surgeon

How Machines “Read” Radiographs

Hill, Brandon G. MSc1; Krogue, Justin D. MD2,3; Jevsevar, David S. MD, MBA1,4; Schilling, Peter L. MD, MSc1,4,a

Author Information
The Journal of Bone and Joint Surgery 104(18):p 1675-1686, September 21, 2022. | DOI: 10.2106/JBJS.21.01387
  • Open
  • Disclosures

Abstract

In 2012, a research group led by Geoffrey Hinton achieved a dramatic milestone in the ability of a computer to automatically identify objects in images1. Their work was a testament to the power and capability of machine learning—the idea of using algorithms to identify patterns in data, to build mathematical models based on those patterns, and to use those models to make a determination or prediction about something in the world2-4. This is a departure from the usual way of doing things. Traditionally, the way to enable a computer to do anything was to enter explicit instructions, line-by-line in the form of human-authored computer code. Machine learning is different. It does not require step-by-step guidance from a human; rather, humans need only supply data and a learning system, and then the computer learns patterns on its own5. Hinton’s group specifically demonstrated the capacity of deep learning, a branch of machine learning in which mathematical models learn to make predictions directly from unprocessed data (such as images)3,6. Excitement over their advancement ignited a research renaissance in deep learning and the broader field of machine learning. Since then, the results have entered our daily lives in numerous forms, from digital assistants we talk to, to self-driving cars, to automated drug discovery7-17.

The impact of deep learning is expected to be just as profound in clinical medicine as it already has been in daily life. This is particularly true for medical imaging. Deep learning has demonstrated remarkable progress in the analysis of medical imaging across a range of modalities including radiographs, computed tomographic (CT) scans, and magnetic resonance imaging (MRI) scans18-35. There is a growing body of evidence showing clinical utility for deep learning in musculoskeletal radiography (Table I), as evidenced by studies that use deep learning to achieve an expert or near-expert level of performance for the identification and localization of fractures on radiographs (Table II)31,36-43. Until recently, these deep learning algorithms had been confined to research papers, narrow tasks, and specific regions of human anatomy, but the technology is advancing rapidly. Deep learning is now in the early stages of entering the clinical setting, involving validation and proof-of-concept studies44,45. Only time will tell, but we believe that one thing is certain: The success of deep learning in the analysis of medical imaging has been propelling the field forward so rapidly that now is the time for surgeons to pause and understand how this technology works at a conceptual level, before the technology ends up in front of us and our patients. This article is intended to provide surgeons with this basic level of understanding of how current deep learning methods work. We do this with a foundational example—explaining how a deep learning method called convolutional neural networks (CNNs) enables a computer to “read” radiographs to detect the presence of a fracture.

TABLE I - A Nonexhaustive Review of Use Cases and Representative Studies Demonstrating the Current State of Deep Learning Across Musculoskeletal Radiography
Use Case Representative Study Level of Evidence Description of Study*
Diagnosis (including classification, staging, and severity of disease)
 Fracture identification, classification, and localization See See See
 Implant loosening Shah et al. 33 (2020) II Incremental inputs improve the automated detection of implant loosening using machine-learning algorithms:
The study used a CNN to predict implant loosening from radiographs in first-time revision total hip and knee arthroplasty patients with 88% accuracy by using a combination of radiographs in conjunction with patient characteristics.
 Bone tumor classification von Schacky et al. 34 (2021) II Multitask deep learning for segmentation and classification of primary bone tumors on radiographs:
This work trained a CNN to simultaneously localize and classify primary bone tumors on radiographs. Trained on 934 radiographs, benign or malignant bone tumors were diagnosed using the patient histopathologic findings as the reference standard. For the classification of bone tumors as malignant or benign, the model achieved 80% accuracy, 63% sensitivity, and 88% specificity. This classification accuracy was higher than that of 2 radiographic residents (71% and 65%; p = 0.002 and p < 0.001, respectively). It was comparable with that of 2 musculoskeletal fellowship-trained radiologists (84% and 83%; p = 0.13 and p = 0.25, respectively).
 Knee osteoarthritis severity Norman et al. 21 (2019) II Applying densely connected convolutional neural networks for staging osteoarthritis (OA) severity from radiographs:
The study used a CNN to automatically stage osteoarthritis severity according to Kellgren-Lawrence grade from radiographs of the knee in an adult population. The model was trained on 4,490 bilateral PA fixed-flexion knee radiographs from adults in the U.S. The model achieved sensitivity rates for no, mild, moderate, and severe OA of 84%, 70%, 69%, and 86%, respectively. The corresponding specificity rates were 86%, 84%, 97%, and 99%.
 Osteoporosis Yamamoto et al. 35 (2020) II Deep learning for osteoporosis classification using hip radiographs and patient clinical covariates:
The authors used deep learning to diagnose osteoporosis from radiographs of the hip (T-score < −2.5) using a data set of 1,131 images from patients who underwent both skeletal bone mineral density measurement and hip radiography at a single hospital in Japan from 2014 to 2019. The CNN applied to hip radiographs alone exhibited high accuracy (approx. 84%), sensitivity (approx. 81%), and specificity (approx. 88%), and the performance improved further with the addition of clinical covariates from patient records (approx. 89%, 89%, and 88%, respectively).
 Pediatric bone age Halabi et al. 27 (2019) II The RSNA pediatric bone age machine learning challenge:
The challenge asked competitors to create a model using machine learning that could accurately determine skeletal age in a data set of pediatric hand radiographs (n = 14,236). The best models were all CNNs, and the top performing models were, on average, accurate to within 4.2 to 4.5 months of the reference standard, bone age.
 Cause of shoulder pain Grauhan et al. 26 (2022) II Deep learning for accurately recognizing common causes of shoulder pain:
The authors trained a CNN to automatically detect the most common causes of shoulder pain on radiographs, including proximal humeral fractures, joint dislocations, periarticular calcification, osteoarthritis, osteosynthesis, and a joint endoprosthesis. The model was trained on 2,700 shoulder radiographs from multiple institutions. Model performance was variable across the 6 diagnoses: sensitivity and specificity were 75% and 86% for fractures, 95% and 65% for joint dislocation, and 90% and 86% for osteoarthrosis.
Implant identification
 Arthroplasty implant identification Karnuta et al. 29 (2021) II Artificial intelligence to identify arthroplasty implants from radiographs of the hip:
The authors trained, validated, and tested a CNN to classify total hip arthroplasty femoral implants as 1 of 18 different models from 1,972 retrospectively collected radiographs of the hip. The CNN discriminated 18 implant models with AUC of 0.99, accuracy of 99%, sensitivity of 94%, and specificity of 99% in the external-testing data set.
Prognosis and risk stratification
 Risk of dislocation in total hip arthroplasty Rouzrokh et al. 32 (2021) II Deep learning artificial intelligence model for assessment of hip dislocation risk following primary hip arthroplasty from postoperative radiographs:
The study presented a CNN trained to predict the risk of future hip dislocation from postoperative radiographs and succeeded in achieving high sensitivity (89%) and high negative predictive value (99%) in the external-testing data set. The study also demonstrated the use of saliency maps to highlight the radiographic features that the model found to be most predictive of prosthesis dislocation.
 Bone mineral density Hsieh et al. 28 (2021) II Automated bone mineral density prediction and fracture risk assessment using radiographs via deep learning:
This study examined how successfully a CNN can measure bone mineral density and evaluate fracture risk through a radiograph. Results were compared against dual x-ray absorptiometry (DXA) measurements and fracture risk assessment tool (FRAX) scores. Both hips (5,633 training radiographs) and the lumbar spine (7,307 training radiographs) were examined. The AUC and accuracy were 0.89 and 92% for detecting hip osteoporosis and 0.96 and 90% for high hip fracture risk. When applied to the lumbar spine radiographs, the scores were 0.89 and 86% for spine osteoporosis, and 0.83 and 95% for high 10-year fracture risk. This capability would allow for evaluating fracture risk using radiographs already made for other indications to identify at-risk patients, without additional cost, time, and radiation from DXA.
Measurements
 Leg lengths Zheng et al. 25 (2020) II Deep learning measurement of leg-length discrepancy in children based on radiographs:
This study created a method for automating the measurement of leg length discrepancies in pediatric patients using deep learning. The authors trained and tested a CNN using 179 radiographs and found that deep learning-derived measurements were strongly correlated with those that were manually derived by pediatric radiologists (r = 0.92 and mean absolute error of 0.51 cm for full leg length discrepancy, p < 0.001).
 Cobb angle Horng et al. 18 (2019) II Cobb angle measurement of the spine from radiographs using convolutional neural networks:
The authors created an automatic system for measuring spine curvature via Cobb angle by applying a CNN to 595 AP spinal radiographs. The deep learning-derived measurements did not demonstrate any significant differences compared with the manual measurements made by clinical doctors (p < 0.98).
Scientific discovery
 Insights into pain disparities in underserved populations Pierson et al. 22 (2021) II An algorithmic approach to reducing unexplained variation in pain disparities in underserved populations:
Underserved populations experience higher levels of pain. Using osteoarthritis as an example, this study looked at whether this pain can be detected in radiographs. Traditional methods to objectively measure osteoarthritic severity (i.e., Kellgren-Lawrence grade) account for only 9% of this disparity. By training a CNN to predict pain levels from 25,049 radiographs, algorithmic predictions accounted for 43% of disparities. This implies that much of the osteoarthritic pain felt by underserved patients stems from visible factors within the knee that are not captured in standard radiographic measures of severity such as Kellgren-Lawrence grade.
*PA = posteroanterior, RSNA = Radiological Society of North America, AUC = area under the receiver operating characteristic curve, and AP = anteroposterior.

TABLE II - Select Studies That Demonstrate the Potential of Deep Learning for the Automated Detection of Bone Fractures in Radiographs Across Different Regions of Human Anatomy*
Study by Anatomical Region Level of Evidence Objectives and Prediction Classes Study Cohort and Image Sources Reference Standard* Training Data Set Size Performance of Model
AUC Sensitivity Specificity Accuracy
Hip
 Krogue et al. 31 (2020) II Objectives:
- Fracture detection
- Fracture classification
- Fracture localization
Prediction classes:
- Fracture or no fracture
- Multiclass: No fracture, displaced femoral neck fracture, nondisplaced, femoral neck fracture, intertrochanteric fracture, and preexisting implant
- Localization of fracture by bounding box
Retrospective cohort:
All hip and pelvic radiographic studies obtained in the ED of a single institution with the words “intertrochanteric” or “femoral neck” occurring near “fracture” in the radiology report at a single institution from 1998 to 2017
Reviewed by 2 orthopaedic residents. In cases of uncertainty, CT and MRI scans and postop. imaging were reviewed 1,815 0.98 93% 94% 94%
 Cheng et al. 37 (2019) II Objective:
- Fracture detection
Prediction classes:
- Fracture or no fracture
Retrospective cohort:
Pelvic radiographs from trauma patients seen at a single institution from 2008 to 2017
Diagnosis derived from trauma registry reviewed by a trauma surgeon. Other imaging modalities and clinical course were reviewed in equivocal cases 3,605 0.98 98% 84% 91%
 Urakawa et al. 42 (2019) II Objective:
- Fracture detection (intertrochanteric)
Prediction classes:
- Fracture or no fracture
Retrospective cohort:
All hip radiographs from patients with intertrochanteric hip fractures treated with compression hip screws from 2006 to 2017
Radiology reports were reviewed by a single board-certified orthop. surgeon 2,678 0.98 94% 97% 96%
Shoulder
 Chung et al. 38 (2018) II Objectives:
- Fracture detection
- Fracture classification
Prediction classes
- Neer classification + no fracture
Retrospective cohort:
All shoulder AP radiographs from 7 hospitals
2 experienced shoulder orthop. specialists and 1 musculoskeletal radiologist. When no agreement from independent reports, CT scans and other imaging were reviewed 1,891 1.00 99% 97% 96%
Wrist
 Thian et al. 43 (2019) II Objectives:
- Fracture detection
- Fracture localization
Prediction classes:
- Fracture or no fracture
- Estimates of fracture localization
Retrospective cohort:
AP and lateral wrist radiographs in a single institution between 2015 and 2017
3 experienced radiologists. Questionable images were labeled by consensus 14,614 0.90 98% 73% NS
 Kim and MacKinnon 40 (2018) II Objective:
- Fracture detection
Prediction classes:
- Fracture or no fracture
Retrospective cohort:
Lateral wrist radiographs at a single institution between 2015 and 2016
Report reviewed by a radiology registrar competent in the reporting of radiographs and with 3 yr of radiology experience 1,111 0.95 90% 88% NS
Spine
 Chen et al. 36 (2021) II Objective:
- Fracture detection
Prediction classes:
- Fracture or no fracture
Retrospective cohort:
Plain abdominal frontal radiographs obtained from a single institution from 2015 to 2018
Initially labeled based on diagnosis in registry. Final diagnosis determined via the agreement of a radiologist and spine surgeon using any supportive images available 1,045 0.72 74% 73% 74%
Ankle
 Kitamura et al. 41 (2019) II Objective:
- Fracture detection
Prediction classes:
- Fracture or no fracture
Retrospective cohort:
Studies of ankle fractures were identified by parsing radiology reports
Reviewed by a board-certified radiologist and a fourth-year radiology resident 1,441 NS 80% 83% 81%
All
 Jones et al. 39 (2020) II Objectives:
- Fracture detection
- Fracture localization
Prediction classes:
- Fracture or no fracture
- Estimates of fracture localization
Retrospective cohort:
Radiographs of 16 anatomical regions from 15 hospitals and outpatient care centers in the U.S.
Data set manually annotated by 18 orthop. surgeons and 11 radiologists 715,343 0.97 95% 81% NS
*AUC = area under the receiver operating characteristic curve, ED = emergency department, CT = computed tomographic, MRI = magnetic resonance imaging, AP = anteroposterior, and NS = not specified
The reference standard is the source of “ground truth,” in this case, the source of truth for whether a radiograph demonstrates (or does not demonstrate) a fracture. The size of the training data set is the number of example radiographs used to train the model. The performance measures used for deep learning models are conceptually the same as those used to evaluate diagnostic screening tests used in medicine and include measures such as the AUC, sensitivity, specificity, and accuracy. In general, these are measures that compare a model’s predictions (e.g., fracture/no fracture) to ground truth using a testing data set.

How a CNN Model Works

Overview: Layers of Mathematical Functions

Deep learning is not magic; it is mathematics. Computers “think” in numbers, so for a computer to “see” a radiograph, the information in the image must be put into a form that computers can process, perform calculations on, and analyze. Thus, computers “see” a digitized image as numbers, and make sense of these images-made-numeric using mathematical functions. Stated simply, CNNs are mathematical functions that are uniquely suited for the analysis of images. As in all mathematical functions, CNNs have inputs (in this case, the numeric representation of the image) and a final output (a numeric prediction of what it “sees” in the image)3. CNNs are very complex, but one does not need to understand the mathematical details to get a sense of how these functions work.

The original inspiration for the mathematical structure of all deep learning systems (including CNNs) came from the network of neurons in the brain (Fig. 1)46-48. In place of a network of interconnected biologic neurons, deep learning systems substitute a network of interconnected mathematical functions referred to as an artificial neural network (ANN). While biologic neurons exchange electrical signals, the artificial neurons in an ANN exchange the results of their mathematical functions. One should think of the ANN in its entirety as a complex mathematical function made up of a large number of more basic mathematical functions that “connect” and “send signals” between each other in ways that are reminiscent of the synapses and action potentials of biologic neurons.

fig1
Fig. 1:
A biologic neuron versus an artificial neuron. The original inspiration for the mathematical structure of all deep-learning systems (including CNNs) came from the network of neurons in the visual cortex of mammals. In place of a vast network of interconnected biologic neurons, deep learning systems substitute a vast network of interconnected mathematical functions. While biologic neurons exchange electrical signals, the artificial neurons of these networks exchange the results of their functions. That is, an artificial neural network (ANN) can be thought of as a complex mathematical function made up of a large number of more basic mathematical functions that “connect” and “send signals” between each other in ways that are reminiscent of the synapses and action potentials of biologic neurons. ANNs are typically organized in layers of interconnected neurons. The mathematical function in each neuron takes weighted inputs (that is, excitatory or inhibitory signals at a biologic neuron’s dendrites) and sums them to produce an output or “activation” (something akin to a biologic neuron’s action potential). The output is then passed on to the neuron(s) in the next layer of the network, where it is received as a weighted input. Thus, the “deep” in deep learning refers to the mathematical depth—the number of layers of mathematical functions that make up the more complex mathematical function that is the neural network in its totality. It is this layering of mathematical functions that enables ANNs (like CNNs) to capture complex, nonlinear relationships, as is required in the identification of a fracture within a radiograph.

ANNs are organized in layers of interconnected neurons. A layer is a general term for a group of neurons that work together at a specific depth within the ANN. The mathematical function in each neuron takes weighted inputs (think excitatory or inhibitory signals at a biologic neuron’s dendrites) and sums them to produce an output or “activation” (something akin to a biologic neuron’s action potential). The output is then passed on to the neuron(s) in the next layer of the network, where it is received as a weighted input.

Different layers in an ANN perform distinct roles. These layers are categorized as input, hidden, and output layers3. Every ANN starts with an input layer and ends with an output layer. Not surprisingly, the input layer is responsible for receiving the inputs (in our example, a radiograph). The output layer is responsible for providing us an answer (e.g., how confident the ANN is that a radiograph demonstrates a fracture). An ANN also typically has a variable number of hidden layers—hidden because they are stacks of mathematical functions that reside in the middle, between input and output layers, shielded from view (Fig. 2).

fig2
Fig. 2:
ANNs are a network of artificial neurons, depicted as circles. ANNs are often arranged in layers. It starts with the input layer: The input layer is a column of hundreds of thousands of neurons (mathematical functions) that are each fed a small part of the raw input (for example, a radiograph). The output of each neuron is then sent to hundreds or thousands of neurons in the first hidden layer. Each mathematical function in this hidden layer takes the hundreds or thousands of inputs it receives and sends its own mathematical output to the next hidden layer. This process repeats itself throughout each layer of the ANN: each layer takes its input, performs calculations via its neurons, and then transmits its output on to the subsequent layer. The final output layer typically only has a small number of neurons. In the case of 1 neuron, the final output is a number showing how confident the ANN is that something is true (e.g., the input radiograph shows a fracture). In the case of several output neurons (as depicted here), each output number would show how confident the ANN is about several predictions (e.g., the input radiograph shows a femoral neck fracture versus intertrochanteric fracture versus subtrochanteric hip fracture, etc.).

The hidden layers are where most of the “magic” happens. That is because the addition of hidden layers of mathematical functions adds depth. The “deep” in deep learning refers to the mathematical depth—the number of layers of mathematical functions that make up the more complex mathematical function that is the neural network in its totality. It is this layering of mathematical functions that enables ANNs to capture complex, nonlinear relationships. There may be dozens of these so-called hidden layers within a deep ANN, but not all layers are the same3. Different types of hidden layers use different mathematical functions, and some layers are better suited for some tasks than others. The types of layers commonly used for “reading” a radiograph include convolutional layers, pooling layers, and fully connected layers, all of which will be discussed3.

The final output layer is responsible for providing an answer (e.g., does the radiograph demonstrate a fracture?). The output layer typically has only a small number of neurons. In the case with 1 output neuron, the final output is a number showing how confident the ANN is that something is true (e.g., the input radiograph shows a fracture). In the case with several output neurons, each output number would show how confident the ANN is about several predictions (e.g., the input radiograph shows a femoral neck versus an intertrochanteric hip fracture, etc.).

CNNs are a form of ANN that takes this biologic inspiration further, with chains of interconnected artificial neurons responsible for detecting different shapes or visual features within an image47. As in our visual cortex, the earliest neurons in this chain detect relatively simple visual features, while the downstream neurons use this output to detect more complex shapes referred to as high-level features. Low-level features are shapes like lines and edges. Mid-level features might be something akin to a cortical border or the edge of the joint line in a radiograph. High-level features might look to the human eye like a femoral diaphysis or a femoral head in its entirety (Fig. 3). In a CNN, like the visual cortex, it is the stacking of layers of neurons (i.e., mathematical functions) that enables vision. The layers of mathematical functions define details of images, little bits at a time, starting simply and then building to eventually identify entire objects46.

fig3
Fig. 3:
As in the visual cortex of mammals, the earliest neurons in a CNN detect relatively simple features while the downstream neurons use this output to detect more complex shapes referred to as high-level features. Low-level features are shapes like lines and edges. Mid-level features might be something akin to a cortical border, the edge of the joint line, etc. High-level features might look to a human like a femoral diaphysis or a femoral head in its entirety. It is critical to understand that humans do not tell the computer what visual features to look for—a jagged edge, a disruption of the cortex—rather, the computer selects the features on its own, choosing those that are most predictive for the task at hand—in this case, the identification of a fracture.

A More Detailed Look: Computing Pixels

Clearly, there are no rods and cones in a CNN. Instead, a computer uses pixels—the small, illuminated dots that make up images on computer displays. Yet, computers do not “see” illuminated pixels as a human does. Instead, computers represent pixels numerically, assigning a number to each pixel that represents its brightness49-51. White pixels have a value of 1, black pixels have a value of 0, and gray is somewhere in between (Fig. 4). For grayscale images such as radiographs, an image is just a 2-dimensional matrix of numbers. As in a Microsoft Excel spreadsheet, this matrix is merely a table of numbers stored in columns and rows (width times height), with each cell of the table storing the brightness of a pixel in a particular location49,51.

fig4
Fig. 4:
Digital images are composed of pixels, and computers represent pixels numerically, by assigning a number to each pixel that encodes brightness. For illustrative purposes, a small square of the image along the edge of the iliac crest has been enlarged to show the brightness of 9 pixels and their numeric representation in a matrix.

Using this format, a CNN must mathematically detect shapes within an image’s matrix of numbers. The first layer of the CNN starts simply, with neurons containing mathematical functions to detect elementary shapes such as lines or edges (so-called low-level features). To do this, a neuron uses a “drawing” of the shape being detected. This “drawing” is called a filter51. The filter’s “drawing” is itself a matrix that looks like a small table of numbers, with each cell of the table characterizing the pixel brightness of the shape. Figure 5 shows a 3 × 3 filter (3 pixels in width times 3 pixels in height) for detecting a right-sided edge. The CNN starts by aligning this filter with the first 3 × 3 pixels in the image and measures how closely those 9 pixels in the image match with the filter. The filter is then moved over 1 pixel to the right, and the degree of similarity is again measured. This process is repeated in a scanning fashion across the entire image, left to right, top to bottom. Mathematically, this is convolving the image with the filter, and it is the reason that a CNN is called a convolutional neural network51. The output of this neuron is called a feature map—a map of how strongly right-sided edges were detected at each point in the original image, with white pixels where the pattern exists and black ones where it does not. Each neuron in the first layer of the CNN has a unique filter and creates its own feature map for the shape it was designed to detect (horizontal, diagonal, or curved lines, etc.). Together, all feature maps are passed to the next layer of the network3,49,51.

fig5
Fig. 5:
To detect a shape (i.e., a visual feature) within an image, a “drawing” of the shape is passed over the image. This drawing is called a filter. The figure shows a 3 × 3 filter used for detection of an edge (right-sided) along with its corresponding matrix values. This filter is passed over the entire image in a scanning fashion, left to right, top to bottom. The value for each pixel is multiplied by the value of the corresponding cell within the 3 × 3 filter, and the result is summed to produce a single value. Mathematically, this is called convolving the image with a filter, and it is repeated over the whole image. The output of this mathematical operation is a new matrix of numerical values called a feature map, presented here as an image. This feature map shows how strongly right-sided edges were detected at each point in the original image, with white pixels where the pattern exists and black pixels where it does not.

Individually, feature maps of lines and edges do not capture very complex shapes and patterns. So, the neurons in the next layer detect more complex shapes by looking at several feature maps from the prior layer. For example, consider a slightly more complex visual feature, such as a corner. For a neural filter in the next layer of the CNN to detect a corner, it must sense the end of a vertical edge meeting the end of a horizontal edge. Only in places where both the vertical and horizontal edge maps from the prior layer record the end of an edge will the corner filter signal the presence of a corner at that location. In reality, each filter can use all of the feature maps from a prior layer of neurons to measure the presence of the compound shape it was designed to detect3.

As described, the system cannot detect any features bigger than 3 × 3 pixels. While larger filters may be used (e.g., 9 × 9 pixels), they can create overly specific filters. To avoid this problem, special layers called pooling layers are included. Pooling layers summarize the visual features that exist in each area of an image52. The layer pools together the most relevant information and abstracts away the less helpful details. It creates a lower-resolution version of the feature maps. It keeps a record of the important and predictive features, while omitting the fine details that may not be useful for the task.

For each input feature map, these layers output smaller summarized feature maps. In a max-pooling layer, for example, a 100 × 100 input feature map would be divided into 5 × 5 areas and the largest value in each would be returned53. The end result is a 20 × 20 output that records the presence of important visual features while reducing the input from areas that lack features (e.g., the radiograph’s black background).

After several layers in a CNN, the feature maps contain measures of how strongly each higher-level feature was detected. At this stage, the final layers are dedicated to answering whatever question the model was trained for (e.g., is a fracture present?). This decision is made by hundreds of neurons connected in what are called the fully connected layers of the CNN. These fully connected neurons are quite different from their predecessors. Each neuron considers all of the feature measurements from the prior layer3. Each neuron is designed to look for a specific combination of higher-level features to be present and outputs a confidence measure indicating how certain it is that they are present. With multiple fully connected layers, complex decisions can be made as to whether the found features exist in the right combination, amount, spatial layout, and locations for the model to make a decision. To indicate fracture presence, the final layer would be a single neuron, whose output would be a single number—that is, its certainty of whether a fracture exists on a scale of 0 to 1. In the case of fracture classification, the number of neurons in the final layer would be equal to the number of different types of fractures the system can identify (plus 1 for no fracture). Again, the output of each neuron would be a single number, the model’s confidence that the given image contains a fracture of a given type (e.g., femoral neck, intertrochanteric, subtrochanteric hip fractures, etc., or no fracture at all).

Figure 6 presents a simplified example of a complete CNN. It has a total of 2 convolutional layers and 2 max-pooling layers before the fully connected layers. Real-world CNNs would have many more convolutional and max-pooling layers to detect enough visual features. In this example, with a score of 0.96, the network is highly confident that a fracture is present.

fig6
Fig. 6:
A simplified example of a complete CNN. This CNN has a total of 2 convolutional layers and 2 max-pooling layers before the fully connected layers. Real-world networks have many more convolutional and max-pooling layers to detect enough visual features to make accurate predictions. At the end of the fully connected network, a different type of neuron reads the final feature maps and “decides” on the basis of the presence and combination of different visual features whether it believes a fracture is present (expressed on a scale of 0 to 1). In this example, with a score of 0.96, the network is highly confident that a fracture is present.

How a CNN Is Created

Overview: Teaching Through Testing

Traditionally, software is created by humans writing precise instructions for executing a task (e.g., if “this,” then “do that,” etc.). In the case of human vision, however, we do not really know how to express these instructions. For example, imagine trying to write a computer program that can enable computers to see fractures in radiographs (e.g., if “cortical disruption here” then “femoral neck fracture,” or if “jagged edge there” then “intertrochanteric fracture”). Approaching the problem in this manner would make for an endless task of describing every possible fracture pattern ever exhibited (or possible) within the human skeleton. This provides a seeming impasse when we attempt to create a computer that can “see” fractures in radiographs: We cannot write instructions for what we cannot comprehensively explain ourselves. However, by starting with the presumption that a mathematical model must exist to do a given task (e.g., fracture detection), then the challenge changes. Rather than trying to write a program that sees radiographs as a human does, we instead write a computer program that will find a mathematical model (i.e., a mathematical function) that can “see” fractures in radiographs. This is the premise behind all of machine learning—that is, that reality can be represented with a mathematical function that we do not know in advance but that we can attempt to estimate through iterative testing and adjustments2.

Creating a program to find a mathematical model that can “see” fractures requires 3 things: a rough draft model, data, and a teacher to teach the model from that data. Trying to write a teacher program that knows how to “see” only returns to the problem of explaining vision. So, instead, the programmer does not use a teacher at all. Rather, the teacher becomes a proctor. It does not teach, it tests. It does this testing with data. The programmer provides the proctor with data in the form of tests and answer keys (e.g., inputting thousands of radiographs and indicating which ones contain fractures). The first draft model being tested is a CNN populated by mathematical functions with random values, and not surprisingly, it scores very poorly on its first test: few fractures are accurately identified. But this is just the starting point. Given the poor test result, the proctor program looks at the CNN and suggests what changes in mathematical functions would have given a better score on the test. For a CNN, this translates to tasks such as changing what kind of shapes the filters should find in the radiographs and how much each neuron should weight the input from different earlier neurons. The CNN is updated with these adjustments, and a brand-new version of the test is then given with the intention that this updated version of the CNN will achieve a better score by accurately identifying a larger share of radiographs with fractures. This cycle is iterative. It continues until the accuracy of the CNN model levels off and the training is ended. A CNN is said to be successfully trained when it can find the most important patterns of the training data, apply those patterns to new prospective data in the real world, and maintain an acceptable degree of accuracy.

A More Detailed Look: Born from Data, Not Logic

The process of repeatedly testing and adjusting the mathematical functions in the model is called supervised learning2. A model starts off with no preconceived notions; all filters and neuron functions are initialized with random values. Initially, when an image is inputted, the output is random as well. The model needs to be “trained” on a task. The training images are typically inputted in small batches. After each batch, the functions and filters of the model are updated by the proctor program to achieve better accuracy. At some point, the system reaches its accuracy limit. The final accuracy is determined by the difficulty of the task, the available training data, and the sophistication of the model.

Since machine learning is a methodology for creating computer programs through examples rather than logic, a model is only as good as the data it is trained on. Too little data and the model may “memorize” the data it trained on, and therefore not predict well on new data—a concept referred to as overfitting2. Even with substantial amounts of data, those data may not have enough difficult radiographs to permit learning how to manage challenging cases. Similarly, it is common for the task itself to have accuracy limits. Models are often trained to do a task that humans themselves cannot agree on (some radiographs are difficult enough that 2 experts disagree). Thus, a model trained to match expert opinion will naturally reflect these limits of human judgement in its own accuracy.

While the specific filters of a CNN are not developed by a human programmer, the model architecture (the kind and count of neurons that make up each layer) is created by the programmer. Just as the building materials available determine how complex and what kind of object may be built, the internal architecture of the CNN determines limits of what it can and cannot “see.” The exploration of different architectures is what consumes much of the research and model development efforts.

Conclusions

CNNs have shown great promise in many areas of computer vision, including the interpretation of medical images, and will soon be entering the clinical setting for validation and proof of concept (Table III). A CNN does not know what an image is; it does not perceive an image as humans do. Rather, a CNN reads pixels as numerical values of brightness and proximity. A CNN is “neural” in that its mathematical structure is inspired by biologic neural networks. This enables a CNN to extract visual features (i.e., shapes) from an image that can ultimately be used to make useful predictions about the image. In the case of fracture identification, a machine “reads” a radiograph by applying a CNN trained for fracture detection to predict the likelihood of the image containing a fracture.

TABLE III - Grades of Recommendation
Recommendation Grade*
Deep learning has demonstrated remarkable progress in the analysis of medical imaging across a range of modalities that are commonly used in orthopaedics. I
There is a growing body of evidence showing clinical utility for deep learning in musculoskeletal radiography, as evidenced by studies that use deep learning to achieve expert or near-expert-level performance for the identification and localization of fractures on radiographs. I
Deep learning is now in the very early stages of entering the clinical setting, involving validation and proof-of-concept studies. I
*According to Wright54, grade A indicates good evidence (Level-I studies with consistent findings) for or against recommending intervention; grade B, fair evidence (Level-II or III studies with consistent findings) for or against recommending intervention; grade C, poor-quality evidence (Level-IV or V studies with consistent findings) for or against recommending intervention; and grade I, insufficient or conflicting evidence not allowing a recommendation for or against intervention.

Source of Funding

There was no source of funding for this study.

References

1. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017 Jun;60(6):84-90.
2. Bishop CM. Pattern recognition and machine learning. Springer; 2006.
3. Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016.
4. Samuel AL. Some studies in machine learning using the game of checkers. IBM J Res Develop. 1959;3(3):210-29.
5. Domingos P. The master algorithm: how the quest for the ultimate learning machine will remake our world. Basic Books; 2015.
6. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533-6.
7. Condliffe J. In 2016, AI Home Assistants Won Our Hearts. 2016 Dec 20. Accessed February 18, 2022. https://www.technologyreview.com/2016/12/20/155032/in-2016-ai-home-assistants-won-our-hearts/
8. Dahl GE, Yu D, Deng L, Acero A. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans Audio Speech Lang Process. 2012 Jan;20(1):30-42.
9. Fagnant DJ, Kockelman K. Preparing a nation for autonomous vehicles: opportunities, barriers and policy recommendations. Transp Res Part A Policy Pract. 2015;77:167-81.
10. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process Mag. 2012 Nov;29(6):82-97.
11. Lewis-Kraus G. The Great A.I. Awakening. 2016 Dec 14. https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
12. Mikolov T, Deoras A, Povey D, Burget L, Černocký J. Strategies for training large scale neural network language models. In: Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011 Dec 11-15. Institute of Electrical and Electronics Engineers; 2011. p 196-201.
13. Sallab AE, Abdou M, Perot E, Yogamani S. Deep Reinforcement Learning Framework for Autonomous Driving. In: Proceedings of the Imaging Science and Technology International Symposium on Electronic Imaging: Autonomous Vehicles and Machine; 2017. Society for Imaging Science and Technology; 2017. p 70-6.
14. Thrun S, Montemerlo M, Dahlkamp H, Stavens D, Aron A, Diebel J, Fong P, Gale J, Halpenny M, Hoffmann G, Lau K, Oakley C, Palatucci M, Pratt V, Stang P, Strohband S, Dupont C, Jendrossek L-E, Koelen C, Markey C, Rummel C, van Niekerk J, Jensen E, Alessandrini P, Bradski G, Davies B, Ettinger S, Kaehler A, Nefian A, Mahoney P. Stanley: The robot that won the DARPA Grand Challenge. J Field Robot. 2006;23(9):661-92.
15. Upson S. The A.I. Takeover Is Coming. Let’s Embrace It. 2016 Dec 22. https://www.wired.com/2016/12/the-ai-takeover-is-coming-lets-embrace-it/#:∼:text=Takeover%20Is%20Coming.-,Let's%20Embrace%20It.,from%20the%20coming%20AI%20revolution
16. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015 May 28;521(7553):436-44.
17. Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V. Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model. 2015 Feb 23;55(2):263-74.
18. Horng MH, Kuok CP, Fu MJ, Lin CJ, Sun YN. Cobb Angle Measurement of Spine from X-Ray Images Using Convolutional Neural Network. Comput Math Methods Med. 2019 Feb 19;2019:6357171.
19. Liu F, Zhou Z, Samsonov A, Blankenbaker D, Larison W, Kanarek A, Lian K, Kambhampati S, Kijowski R. Deep Learning Approach for Evaluating Knee MR Images: Achieving High Diagnostic Performance for Cartilage Lesion Detection. Radiology. 2018 Oct;289(1):160-9.
20. Nguyen TP, Chae D-S, Park S-J, Kang K-Y, Yoon J. Deep learning system for Meyerding classification and segmental motion measurement in diagnosis of lumbar spondylolisthesis. Biomed Signal Process Control. 2021;65:102371.
21. Norman B, Pedoia V, Noworolski A, Link TM, Majumdar S. Applying Densely Connected Convolutional Neural Networks for Staging Osteoarthritis Severity from Plain Radiographs. J Digit Imaging. 2019 Jun;32(3):471-7.
22. Pierson E, Cutler DM, Leskovec J, Mullainathan S, Obermeyer Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat Med. 2021 Jan;27(1):136-40.
23. Tiulpin A, Thevenot J, Rahtu E, Lehenkari P, Saarakkala S. Automatic Knee Osteoarthritis Diagnosis from Plain Radiographs: A Deep Learning-Based Approach. Sci Rep. 2018 Jan 29;8(1):1727.
24. Tolpadi AA, Lee JJ, Pedoia V, Majumdar S. Deep Learning Predicts Total Knee Replacement from Magnetic Resonance Images. Sci Rep. 2020 Apr 14;10(1):6371.
25. Zheng Q, Shellikeri S, Huang H, Hwang M, Sze RW. Deep Learning Measurement of Leg Length Discrepancy in Children Based on Radiographs. Radiology. 2020 Jul;296(1):152-8.
26. Grauhan NF, Niehues SM, Gaudin RA, Keller S, Vahldiek JL, Adams LC, Bressem KK. Deep learning for accurately recognizing common causes of shoulder pain on radiographs. Skeletal Radiol. 2022 Feb;51(2):355-62.
27. Halabi SS, Prevedello LM, Kalpathy-Cramer J, Mamonov AB, Bilbily A, Cicero M, Pan I, Pereira LA, Sousa RT, Abdala N, Kitamura FC, Thodberg HH, Chen L, Shih G, Andriole K, Kohli MD, Erickson BJ, Flanders AE. The RSNA Pediatric Bone Age Machine Learning Challenge. Radiology. 2019 Feb;290(2):498-503.
28. Hsieh CI, Zheng K, Lin C, Mei L, Lu L, Li W, Chen FP, Wang Y, Zhou X, Wang F, Xie G, Xiao J, Miao S, Kuo CF. Automated bone mineral density prediction and fracture risk assessment using plain radiographs via deep learning. Nat Commun. 2021 Sep 16;12(1):5472.
29. Karnuta JM, Haeberle HS, Luu BC, Roth AL, Molloy RM, Nystrom LM, Piuzzi NS, Schaffer JL, Chen AF, Iorio R, Krebs VE, Ramkumar PN. Artificial Intelligence to Identify Arthroplasty Implants From Radiographs of the Hip. J Arthroplasty. 2021 Jul;36(7S):S290-S294.
30. Karnuta JM, Luu BC, Roth AL, Haeberle HS, Chen AF, Iorio R, Schaffer JL, Mont MA, Patterson BM, Krebs VE, Ramkumar PN. Artificial Intelligence to Identify Arthroplasty Implants From Radiographs of the Knee. J Arthroplasty. 2021 Mar;36(3):935-40.
31. Krogue JD, Cheng KV, Hwang KM, Toogood P, Meinberg EG, Geiger EJ, Zaid M, McGill KC, Patel R, Sohn JH, Wright A, Darger BF, Padrez KA, Ozhinsky E, Majumdar S, Pedoia V. Automatic Hip Fracture Identification and Functional Subclassification with Deep Learning. Radiol Artif Intell. 2020 Mar 25;2(2):e190023.
32. Rouzrokh P, Ramazanian T, Wyles CC, Philbrick KA, Cai JC, Taunton MJ, Maradit Kremers H, Lewallen DG, Erickson BJ. Deep Learning Artificial Intelligence Model for Assessment of Hip Dislocation Risk Following Primary Total Hip Arthroplasty From Postoperative Radiographs. J Arthroplasty. 2021 Jun;36(6):2197-2203 e3.
33. Shah RF, Bini SA, Martinez AM, Pedoia V, Vail TP. Incremental inputs improve the automated detection of implant loosening using machine-learning algorithms. Bone Joint J. 2020 Jun;102-B(6_Supple_A):101-6.
34. von Schacky CE, Wilhelm NJ, Schäfer VS, Leonhardt Y, Gassert FG, Foreman SC, Gassert FT, Jung M, Jungmann PM, Russe MF, Mogler C, Knebel C, von Eisenhart-Rothe R, Makowski MR, Woertler K, Burgkart R, Gersing AS. Multitask Deep Learning for Segmentation and Classification of Primary Bone Tumors on Radiographs. Radiology. 2021 Nov;301(2):398-406.
35. Yamamoto N, Sukegawa S, Kitamura A, Goto R, Noda T, Nakano K, Takabatake K, Kawai H, Nagatsuka H, Kawasaki K, Furuki Y, Ozaki T. Deep Learning for Osteoporosis Classification Using Hip Radiographs and Patient Clinical Covariates. Biomolecules. 2020 Nov 10;10(11):E1534.
36. Chen HY, Hsu BW, Yin YK, Lin FH, Yang TH, Yang RS, Lee CK, Tseng VS. Application of deep learning algorithm to detect and visualize vertebral fractures on plain frontal radiographs. PLoS One. 2021 Jan 28;16(1):e0245992.
37. Cheng CT, Ho TY, Lee TY, Chang CC, Chou CC, Chen CC, Chung IF, Liao CH. Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs. Eur Radiol. 2019 Oct;29(10):5469-77.
38. Chung SW, Han SS, Lee JW, Oh KS, Kim NR, Yoon JP, Kim JY, Moon SH, Kwon J, Lee HJ, Noh YM, Kim Y. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop. 2018 Aug;89(4):468-73.
39. Jones RM, Sharma A, Hotchkiss R, Sperling JW, Hamburger J, Ledig C, O’Toole R, Gardner M, Venkatesh S, Roberts MM, Sauvestre R, Shatkhin M, Gupta A, Chopra S, Kumaravel M, Daluiski A, Plogger W, Nascone J, Potter HG, Lindsey RV. Assessment of a deep-learning system for fracture detection in musculoskeletal radiographs. NPJ Digit Med. 2020 Oct 30;3:144.
40. Kim DH, MacKinnon T. Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin Radiol. 2018 May;73(5):439-45.
41. Kitamura G, Chung CY, Moore BE 2nd. Ankle Fracture Detection Utilizing a Convolutional Neural Network Ensemble Implemented with a Small Sample, De Novo Training, and Multiview Incorporation. J Digit Imaging. 2019 Aug;32(4):672-7.
42. Urakawa T, Tanaka Y, Goto S, Matsuzawa H, Watanabe K, Endo N. Detecting intertrochanteric hip fractures with orthopedist-level accuracy using a deep convolutional neural network. Skeletal Radiol. 2019 Feb;48(2):239-44.
43. Thian YL, Li Y, Jagmohan P, Sia D, Chan VEY, Tan RT. Convolutional Neural Networks for Automated Fracture Detection and Localization on Wrist Radiographs. Radiol Artif Intell. 2019 Jan 30;1(1):e180001
44. Mosquera C, Binder F, Diaz FN, Seehaus A, Ducrey G, Ocantos JA, Aineseder M, Rubin L, Rabinovich DA, Quiroga AE, Martinez B, Beresnak AD, Benitez SE, Luna DR. Integration of a deep learning system for automated chest x-ray interpretation in the emergency department: A proof-of-concept. Intelligence-Based Medicine. 2021;5:100039.
45. Lindsey R, Daluiski A, Chopra S, Lachapelle A, Mozer M, Sicular S, Hanel D, Gardner M, Gupta A, Hotchkiss R, Potter H. Deep neural network improves fracture detection by clinicians. Proc Natl Acad Sci U S A. 2018 Nov 6;115(45):11591-6.
46. Fukushima K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern. 1980;36(4):193-202.
47. Hubel DH, Wiesel TN. Receptive fields of single neurones in the cat’s striate cortex. J Physiol. 1959 Oct;148:574-91.
48. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. 1943. Bull Math Biol. 1990;52(1-2):99-115, discussion 73-97.
49. Forsyth DA, Ponce J. Computer vision: a modern approach. 2nd ed. Pearson; 2012.
50. Kirsch RA. SEAC and the start of image processing at the National Bureau of Standards. IEEE Ann Hist Comput. 1998;20(2):7-13.
51. Szeliski R. Computer Vision: Algorithms and Applications. 1st ed. Springer; 2011.
52. Weng J, Ahuja N, Huang TS. Learning recognition and segmentation of 3-D objects from 2-D images. In: Proceedings of the 1993 (4th) International Conference on Computer Vision; 1993 May 11-14. Institute of Electrical and Electronics Engineers; 1993. p 121-8.
53. Yamaguchi K, Sakamoto K, Akabane T, Fujimoto Y. A neural network for speaker-independent isolated word recognition. In: Proceedings of the First International Conference on Spoken Language Processing; 1990 Nov 18-22. International Conference on Spoken Language Processing; 1990. p 1077-80.
54. Wright JG. Revised grades of recommendation for summaries or reviews of orthopaedic surgical studies. J Bone Joint Surg Am. 2006 May;88(5):1161-2.

Supplemental Digital Content

Copyright © 2022 The Authors. Published by The Journal of Bone and Joint Surgery, Incorporated. All rights reserved.