New methodological approaches to answer important questions can be intimidating even to experienced readers. One informal metric we have here at CORR® that gives some insight to how tough a paper might be is the number of reviewers we must invite in order to get three peer reviewers to take up the challenge. Typically, we have to ask about five to get three. Methodologically intensive papers often take more.
With its talk of logistic LASSO and linear support vector machines, we were not surprised when this month’s Editor’s Spotlight paper ran through a double-digit quantity of our best reviewers. But don’t be intimidated; its methodologic bark is louder than its bite. Stay with me here, it’ll be well worth your while.
In this month’s Clinical Orthopaedics and Related Research®, a team lead by Catherine H. MacLean MD, PhD, Chief Value Medical Officer at the Hospital for Special Surgery and head of its Center for the Advancement of Value in Musculoskeletal Care, show how machine learning can help surgeons anticipate which patients will enjoy a clinically important amount of improvement following elective hip or knee replacement . In this case more than some, it’s worth saying something about the team that did the work: The lead analyst was Senior Director of Data Science, economist, and health policy researcher Mark Alan Fontana PhD, and Drs. MacLean and Fontana partnered here with methodology guru and two-time Editor's Spotlight veteran Stephen Lyman PhD [8, 9]. This truly interdisciplinary team brought together expertise in health policy, machine learning, survey and experimental design, mobile health, and behavioral economics.
I believe that the tool they evaluated here—machine learning—is the most-exciting methodological innovation our specialty has seen in the last decade. Supervised machine-learning systems, such as the ones evaluated by Drs. MacLean and Fontana’s team, are computer algorithms that produce models to estimate results based on large numbers of variables and vast quantities of data. These systems begin with a dataset of potential predictors and a known set of later outcomes that one wishes to predict (called a training dataset), and then the developing algorithm derives functions that can be applied to future datasets to make useful predictions . These algorithms identify factors among the many that their human supervisors feed into them—elements of demography, diagnosis, surgical approach, or any of a host of other parameters we see in practice—that are most-strongly predictive of the outcomes being assessed. With more data (for machines, more data is like more time and practice), these algorithms adjust to provide ever-more-refined predictions, and the system’s guesses then can be tested (in a separate validation sample that does not contain data from the training dataset). In an iterative process that involves humans—thus the term “supervised” machine learning—these models can be further tested with more validation data, retrained, and further refined so they perform even better.
The potential—the dream now, but someday soon likely to be a reality—is that these systems will help clinicians use information available in advance of a proposed surgical procedure to anticipate its eventual benefits or harms.
Dr. MacLean’s team used machine learning on a hospital’s arthroplasty database to identify those patients most likely to achieve a minimally clinically important difference (MCID) in outcomes scores after joint replacement. As clinicians, we need to direct our attention to MCIDs rather than p values to decide whether our procedures have helped our patients, since patients perceive effect sizes but are utterly blind to p values [7, 12, 13].
Machine learning has already proven its value in our specialty in topics as diverse as anticipating survival in patients with orthopaedic tumors , predicting complications after arthroplasty , and deciding which spine-surgery research trends are going to become “hot” . The parameters that MacLean and colleagues identified as important in their article in this month’s CORR® are less important than the fact that the three kinds of machine-learning models they tested all were pretty effective at predicting outcomes of interest. This may prove to be especially important, since experienced surgeons are surprisingly inefficient at anticipating which patients will achieve an MCID after joint replacement .
I particularly like that this paper is published in this year’s Proceedings of the International Society of Arthroplasty Registers (ISAR). CORR is proud of its relationship with ISAR, whose members are doing some of the most-important arthroplasty research in the world. Because of this, my colleagues and I at CORR are thrilled to see one of ISAR’s papers in the Editor's Spotlight this month. Leveraging the power of machine learning in the context of the world’s large registries is undoubtedly a key next step both for registry-based researchers and machine-learning enthusiasts.
Please enjoy the other papers in ISAR’s proceedings in this month’s CORR, and join me in the Take 5 interview that follows with Drs. MacLean and Fontana, two of the authors of “Can Machine Learning Algorithms Predict Which Patients Will Achieve Minimally Clinically Important Differences From Total Joint Arthroplasty?” They share their vision of how machine-learning will transform surgical practice in the years to come. I have no doubt that the same technology that brought us self-driving cars, online recommendations from our favorite e-tailers, “curated” content on social-media feeds, and credit-card fraud detection  can improve clinical practice in similarly dramatic ways.
Take 5 Interview with Catherine H. MacLean MD, PhD and Mark Alan Fontana PhD, coauthors of “Can Machine Learning Algorithms Predict Which Patients Will Achieve Minimally Clinically Important Differences From Total Joint Arthroplasty?”
Seth S. Leopold MD:Congratulations on this intriguing paper. You say that your machine-learning models are not ready for “off-the-shelf” application by surgeons in practice. Why, then, should clinicians read this work? (Spoiler alert: They should!)
Catherine H. MacLean MD, PhD: Thank you for your interest in our paper. To clarify, it’s not that machine-learning methods are not ready for clinical application, but rather that the specific models we developed using those techniques are not applicable in an “off-the-shelf” way to clinical settings other than the one from which the data used in the model came. In other words, machine learning-derived models developed using data from one institution may not be generalizable to other institutions. This is partially because the variables used in our models may not be available at other institutions, and partially because the variables used in our models may be measured differently or not apply at other institutions. For example, certain variables may have different categories, or variables representing a particular surgeon would not apply. That said, the same methods could be used leveraging data from other institutions to develop models that, depending on the available data elements and sample size, might produce clinically useful models.
Dr. Leopold:One of the consistent findings seen in all your machine-learning models was that preoperative outcomes scores were associated with postoperative outcomes scores; this is not new or surprising. In what ways did the machine-learning models surprise you, in terms of finding unexpected parameters that mattered, and why—again, if these models aren’t ready to be exported to other practices—is that still so important?
Mark Alan Fontana PhD: I think it’s important to understand exactly what being a “top predictor” (or “top feature” using machine learning terminology) really means, and how that should or (more importantly) should not be interpreted. Machine learning focuses on maximizing predictive power, and we often want to throw many variables into the algorithm and allow it to figure out what’s most predictive. Because we’re throwing together so many correlated predictors, combined with potential confounding, we have to take the top features with a grain of salt. The top features mechanically contribute most to the predictions and can be used for brainstorming and discussion with clinicians, or to ascertain whether a model is behaving strangely or something is miscoded or otherwise awry, but they are NOT necessarily unbiased estimates of associations nor indicative of causal relationships. I’ve found it’s hard to get people to internalize this point. For example, some of the top features related to health insurance are perhaps markers for the availability of services and/or social determinants of health. Another example—and one that probably caught the eye of many readers—was the fact that the fifth best predictor for the SF-36 physical component score MCID was one particular surgeon. It would be a mistake to interpret this as some indication of surgeon quality, positive or negative, given the myriad issues that could confound such an association, as well as all the other variables potentially highly correlated with this variable. So, the top features (and any surprises in them) need to be interpreted very cautiously; the important part is having machinery that can ingest a large number of potential predictors. Unboxing the causal relationships underlying these predictors and models is more in the realm of econometrics or epidemiology than machine learning. But machine learning can spark discussion.
Dr. Leopold:Dr. MacLean, I like your title at the Hospital for Special Surgery—Chief Value Medical Officer. I have not seen that title before, but I suspect there will be more CVMOs in the years to come. How do you see machine learning changing how hospitals approach the concept of measuring and delivering value to the patients they treat?
Dr. MacLean: I view machine learning as one of many methodologies available to hospitals that can be used to promote better clinical care and operational efficiencies, both of which drive value. The promise of machine learning is that it can be used to identify patterns in large datasets that otherwise could not be identified and can also be used to identify known patterns more accurately or efficiently. On the clinical side of things, I anticipate that early hospital use cases will include improving diagnostic accuracy and determining the likelihoods of various clinical outcomes with different therapeutic options. The first applications improving diagnostic accuracy will be in areas where pattern recognition is used to make diagnoses using limited discrete data and for which large amounts of data are available. Examples include reading radiology images and pathology slides, areas in which promising machine learning work is underway . Initial applications of machine-read films and slides might include a first read, perhaps with a call out of abnormal areas on the study, which then may be overread by a human radiologist or pathologist. This could both improve diagnostic accuracy and reduce the time needed to read studies. Machine-learning models could also be used to construct highly personalized patient profiles that detail the probabilities of complications and of achieving various functional outcomes for individual patients. This would be enormously helpful to patient and clinicians in framing clinical decisions. Avoiding procedures that are unlikely to help or very likely to result in complications would be huge value wins both in terms of improving quality and reducing costs. The work we present in this issue is one step toward being able to do this.
In terms of promoting operational efficiencies, I anticipate that these methods will be used to optimize supply chains and logistics in much the same way they have been used in other industries including online retail and shipping.
Dr. Leopold:You looked at patients who underwent surgery; plainly the future of machine-learning assessments must include patients at a decision point, that is, whether to have surgery or not. We should not assume the findings drawn from groups of patients who had surgery will apply to those who are just thinking about it, and we often lack data on what becomes of patients who decide against having an operation. What will it take for us to develop machine-learning models that can help patients decide whether to have the intervention in question, and how far are we from having such models?
Dr. Fontana: This is a really important point. Taking models derived from surgical patients and applying them on patients contemplating surgery requires something of a logical leap, as we obviously only measure surgical outcomes on surgical patients. Whether this is appropriate is something that would need to be directly, empirically tested via a prospective, randomized study. That is, we would need to compare a group contemplating surgery whose decision-making included a discussion of the output of some machine learning model to a group contemplating surgery whose decision-making did not include such a discussion, evaluated in terms of their propensity to have surgery, as well as subsequent outcomes and utilization. This is another example of why these models should not be used “off-the-shelf.” Even with a perfect prediction (which we are far from), whether the output of these models is useful in practice should depend on improving downstream outcomes, not just predictive power.
Another goal might be to make and test specific predictions about results following sundry surgical and nonsurgical interventions. This would allow us to provide patients with a sort of “menu” of treatment options and personally curated predictions for each: “This is what we think will happen if you have surgery” (based on your information and past patients who had surgery), “this is what we think will happen if you have treatment X” (based on your information and past patients who had treatment X, for a many treatment options), and “this is what we think will happen if you do nothing” (based on your information and past patients who did nothing). Building these models requires longitudinal data on these various treatment options. What will it take to get there? Time, for one, and certainly many more data scientists, data engineers, and analysts—along with herds of unicorns (a term in the industry for creative, insightful scientists who both solve hard problems and communicate their ideas in plain language, combining computer science, statistics, and subject matter expertise, nicknamed for a creature that does not exist )—working on this sort of research. The fact that people with expertise in machine learning are more often attracted to tech than health care is a problem that we, as an industry, need to address if we want to get there sooner.
Dr. Leopold: I found it especially ironic that an earlier study from your institution that also tried to anticipate which patients would enjoy an improvement in knee scores greater than the MCID after arthroplasty found that surgeons’ predictions were no better than chance . So, if I get my knee replaced at the Hospital for Special Surgery, should I talk to a surgeon or a computer? (Sorry, couldn’t resist…). More seriously, if we take your study on machine learning together with that one, what should we anticipate the future of preoperative patient counseling to look like?
Dr. MacLean: You should talk both to a surgeon and a machine! The models we present are pretty good but not perfect at predicting whether a patient will achieve a MCID. Some of the AUROCs (area under the receiver operating characteristic curve, a marker of diagnostic utility) that we reported were similar to those of commonly used diagnostic tests such as digital mammography (AUROC = 0.84-0.88) and CT pulmonary angiography (AUROC = 0.88) [6, 11]. This level of accuracy is helpful to inform clinical decisions but not good enough to be the sole driver of such decisions. Importantly, many clinical elements that are assessed as part of an evaluation for total joint arthroplasty such as medical history, physical exam, and radiographic findings are not included in our current models. While we plan to test the addition of such variables to future models, it is unlikely that any model will ever include every clinical element that might inform a surgical decision. Likewise, many other factors including patient goals, preferences, chances of having a complication, and degree of risk aversion should factor into the deciding whether surgery is the best therapeutic option for the patient.
We think future preoperative counseling will be informed by highly personalized patient profiles that detail the probabilities of complications and of achieving various functional outcomes not only for the surgery being considered, but also for other therapeutic options. Machine-learning techniques will surely contribute to the development of these profiles. Regarding when this will happen: It’s already happening to varying degrees. We are currently testing the use decision aids based on the models we present here and have a several-year production plan to develop and test models that can be used to inform various treatment decisions.
2. Fontana MA, Lyman S, Sarker GK, Padgett DE, MacLean CH. Can machine learning algorithms predict which patients will achieve minimally clinically important differences from total joint arthroplasty? Clin Orthop Relat Res. [Published online ahead of print]. DOI: 10.1097/CORR.0000000000000687.
3. Ghomrawi HM, Mancuso CA, Dunning A, Gonzalez Della Valle A, Alexiades M, Cornell C, Sculco T, Bostrom M, Mayman D, Marx RG, Westrich G, O’Dell M, Mushlin AI. Do surgeon expectations predict clinically important improvements in WOMAC scores after THA and TKA? Clin Orthop Relat Res. 2017;475:2150–2158.
5. Harris AH, Kuo AC, Weng Y, Trickey AW, Bowe T, Giori N. Can machine learning methods produce accurate and easy-to-use prediction models of 30-day complications and mortality after knee or hip arthroplasty? Clin Orthop Relat Res. 2019;477:452–460.
6. Lehman CD, Wellman RD, Buist DS, Kerlikowske K, Tosteson AN, Miglioretti DL. Breast Cancer Surveillance Consortium. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Intern Med. 2015;175:1828–1837.
7. Leopold SS, Porcher R. Editorial: The minimum clinically important difference—The least we can do. Clin Orthop Relat Res. 2017;475:929–932.
8. Leopold SS. Editor’s Spotlight/Take 5: Is changing hospitals for revision total joint arthroplasty associated with more complications? Clin Orthop Relat Res. 2014;472:2002–2005.
9. Leopold SS. Editor’s Spotlight/Take 5: Do patient race and sex change surgeon recommendations for TKA? Clin Orthop Relat Res. 2015;473:406–409.
10. Lindsey R, Daluiski A, Chopra S, Lachapelle A, Mozer M, Sicular S, Hanel D, Gardner M, Gupta A, Hotchkiss R, Potter H. Deep neural network improves fracture detection by clinicians. PNAS. 2018;115:11591–11596.
11. Macdonald WB, Patrikeos AP, Thompson RI, Adler BD, van der Schaaf AA. Diagnosis of pulmonary embolism: Ventilation perfusion scintigraphy versus helical computed tomography pulmonary angiography. Australas Radiol. 2005;49:32–38.
12. Maltenfort M, Días-Ledezma C. Statistics In Brief: Minimum clinically important difference—Availability of reliable estimates. Clin Orthop Relat Res. 2017;475:933–946.
13. Maltenfort M. CORR Insights: What is the minimum clinically important difference for the WOMAC index after TKA? Clin Orthop Relat Res. 2018;476:2015–2016.
15. Sing DC, Metz LN, Dudli S. Machine learning-based classification of 38 years of spine-related literature into 100 research topics. Spine (Phila Pa 1976). 2017;42:863–870.
16. Thio QCBS, Karhade AV, Ogink PT, Raskin KA, De Amorim Bernstein K, Lozano Calderon SA, Schwab JH. Can machine-learning techniques be used for 5-year survival prediction of patients with chondrosarcoma? Clin Orthop Relat Res. 2018;476:2040–2048.