CORR Insights®: Deep Learning Assistance Closes the Accuracy Gap in Fracture Detection Across Clinician Types : Clinical Orthopaedics and Related Research®

Secondary Logo

Journal Logo


CORR Insights®: Deep Learning Assistance Closes the Accuracy Gap in Fracture Detection Across Clinician Types

Wilson, Nicole A. PhD, MD, FACS1

Author Information
Clinical Orthopaedics and Related Research 481(3):p 589-591, March 2023. | DOI: 10.1097/CORR.0000000000002463

Where Are We Now?

The term artificial intelligence (AI) evokes the idea of a “thinking machine” that uses algorithms to perform tasks that require learning and problem-solving [7, 8]. Machine learning (ML) is a subset of AI that uses algorithms trained to perform specific tasks based on features provided in a large dataset. Deep learning is a further subset of ML that learns features and tasks directly from the data without the requirement of explicit programming [11]. For more than 30 years, AI and ML have promised to make healthcare decisions more accurate and less variable, and thereby revolutionize the healthcare field [8]. Although AI and ML already impact almost all aspects of modern life outside healthcare, adoption of AI and ML systems in healthcare has been surprisingly slow [9] and largely behind the scenes (for example, disease risk prediction based on electronic health records, bone density calculations from radiographs, and predictions of the risk of readmission) [13]. This is in large part because of the myriad ethical issues that arise when allowing computers to augment or even replace traditional medical decision-making.

Traditionally, medical decision-making represents a complex synthesis of basic science, clinical evidence, and patient preferences [4]. When making medical decisions, clinicians also shoulder the burden of needing to explain or defend the rationale behind a decision. However, when decisions are based on or supported by potentially opaque AI or ML systems, the ability to explain why or how a decision was made may be compromised [1]. Commonly called the black-box problem, the lack of understanding of how ML models generate outputs may be amplified in the novice practitioner because they do not have the underlying domain expertise to help guide decision-making, likely resulting in increased reliance on algorithm output but less ability to explain the “why” of the decision.

The authors of “Deep Learning Assistance Closes the Accuracy Gap in Fracture Detection Across Clinician Types” demonstrated that using a deep learning system improves clinician accuracy for detecting fractures [3]. The authors show the largest improvements were seen in clinicians with limited training in musculoskeletal imaging [3] and argue that use of the deep learning system enabled those clinicians to detect fractures with diagnostic accuracy comparable to that of radiologists and orthopaedic surgeons [3]. Based on this research, the authors received FDA clearance for their deep learning algorithm, enabling it to be used for fracture detection in the clinical setting.

Where Do We Need To Go?

Given the overtaxed state of the United States healthcare system, we can all appreciate the benefits of enabling and supporting clinicians outside the hospital in performing tasks more easily and efficiently. Although it is tempting to celebrate the triumph of finally realizing the promise of AI in healthcare, we must ask, is this a good thing? That is, should a relatively inexperienced practitioner simply be paired with an AI system to diagnose and treat medical conditions? What are the pitfalls of introducing AI tools into the clinical space, and what level of understanding is sufficient for a clinician to safely use and interpret clinical AI or ML outputs? AI systems may be uniquely situated to aid frontline clinicians because AI can often be well-trained to appropriately handle most standard procedures it might encounter. However, unlike humans, algorithms are only as good as their data, and they begin to fail as they encounter real-world data that deviate from examples used during training. For example, depending on the task, a deep learning algorithm will likely have a relative blind spot for an 82-year-old male patient if the algorithm was trained on data from a group of patients that was 65% female and 93.9% younger than 80 years [6]. Although an expert clinician may recognize that the patient in front of him or her is not the typical patient, an algorithm cannot always recognize these “edge cases,” where available data are sparse.

Edge cases are rare occurrences in which an algorithm is presented with a scenario it hasn’t encountered before, usually because of an extreme value for a single parameter. Unfortunately, given the extraordinarily large volume of patients who move through healthcare systems and the vast numbers of parameters used to represent each patient, edge cases are relatively common in healthcare. These are not only the “zebra” diagnoses (that is, the very unlikely diagnostic possibilities) [10], but also the areas in which data variability can limit an algorithm’s ability to make connections, find patterns, or provide consistent recommendations. Worse, when dealing with an algorithm, an end user may not know where the edges lie. For example, when detecting fractures, a deep learning algorithm may learn to recognize cortical irregularities, periosteal reaction, and soft tissue swelling. However, if the power and strength of the X-ray beam are also learned as a salient feature, changes between facilities may result in unexpected (and unrecognized) edge cases. Crucially, at these edges, algorithms may start producing incorrect or harmful predictions. How do we better understand edge cases and their impact on predictions in healthcare AI?

In the current healthcare environment, we rely on clinical expertise to navigate edge cases. Unfortunately, the combination of a novice practitioner and an AI system does not equate to the wisdom and experience of an expert clinician. So, how do we bridge that gap? How do we teach novice practitioners enough about AI systems that they can safely interpret algorithmic predictions and know when to (or not to) trust algorithmic output?

How Do We Get There?

The first step is to ensure that AI systems behave robustly. Much like autonomous vehicles, AI in healthcare is high stakes and must be engineered for reliability. Situations that can result in unpredictable algorithmic output or recommendations for a dangerous course of action can have disastrous consequences. Therefore, we need to study and understand edge cases. However, evaluating edge cases requires large amounts of data from diverse sources. The new National Institutes of Health policy on data management and sharing will certainly make more data accessible and decrease data silos [5]. In the case of proprietary software, we must insist on transparency related to training and validating data. Finally, research opportunities exist in the development of adversarial systems that can generate “deepfake” inputs used to train an AI system and decrease edge cases, as well as challenging systems with invalid, malformed, or unexpected inputs and monitoring system behavior. Before implementation, challenging the behavior of novel AI systems will build confidence that each system is sufficiently flexible and robust enough to fare well in the clinical setting.

Assuming a robust AI system, how do we address the black-box problem? Much of the debate around AI in healthcare has focused on explainability, meaning that clinicians should be able to understand why a particular recommendation has been made [1]. Many have argued for limiting healthcare AI models to “interpretable” algorithms, where each decision point can be explained [12]. However, this approach is not feasible because it eliminates neural networks and other deep learning techniques, as these are not inherently interpretable [2]. Others have developed model-agnostic explainability methods that can derive post hoc explanations from black-box models [2]. Regardless of the particulars, it is imperative that clinicians develop a basic working knowledge of AI algorithms, how they work, why they fail, and edge cases. In fact, there is an urgent need for curriculum development surrounding these issues. Like learning the basics of interpreting and critiquing healthcare research in medical school, development of a forward-thinking, introductory AI curriculum would situate clinicians-in-training for the future they will almost certainly face.


1. Adadi A, Berrada M. Peeking inside the black-box: survey on explainable artificial intelligence (XAI). IEEE Access. 2018;6:52138-52160.
2. Amann J, Blasimme A, Vayena E, Frey D, Madai VI; Precise4Q Consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inform Decis Mak 2020:20:310.
3. Anderson PG, Baum GL, Keathley N, et al. Deep learning assistance closes the accuracy gap in fracture detection across clinician types. Clin Orthop Rel Res. 2023;481:580-588.
4. Diprose WK, Buist N, Hua N, Thurier Q, Shand G, Robinson R. Physician understanding, explainability, and trust in a hypothetical machine learning risk calculator. J Am Med Inform Assoc. 2020;27:592-600.
5. National Institutes of Health. Final NIH policy for data management and sharing. Available at: Accessed September 18, 2022.
6. Food and Drug Administration Center for Devices and Radiological Health. Evaluation of automatic Class III designation for OsteoDetect DEN180005. Available at: Accessed September 18, 2022.
7. Haan M, Ongena YP, Hommes S, Kwee TC, Yaka D. A qualitative study to understand patient perspective on the use of artificial intelligence in radiology. J Am Coll Radiol. 2019:10;1416-1419.
8. Handelman GS, Kok HK, Chandra RV, Razavi AH, Lee MJ, Asadi H. eDoctor: machine learning and the future of medicine. J Intern Med. 2018;284:603-619.
9. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med 2019:25;30-36.
10. Joseph C. The enigma of the edge case in healthcare. Nordic Global. Available at: Accessed September 18, 2022.
11. Palmisciano P, Jamjooom AAB, Taylor D, Stoyanov D, Marcus HJ. Attitudes of patients and their relatives toward artificial intelligence in neurosurgery. World Neurosurg. 2020;138:e627-e633.
12. Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell. 2019;1:206-215.
13. Senders JT, Arnaout O, Karhade AV, et al. Natural and artificial intelligence in neurosurgery: a systematic review. Neurosurg. 2018:83;181-192.
© 2022 by the Association of Bone and Joint Surgeons