Secondary Logo

Journal Logo


Using Machine Learning to Assess Physician Competence: A Systematic Review

Dias, Roger D. MD, MBA, PhD; Gupta, Avni BDS, MPH; Yule, Steven J. PhD

Author Information
doi: 10.1097/ACM.0000000000002414


Medical education is in the middle of a global revolution that is characterized by a movement toward competency-based instead of time-based metrics of progression.1,2 This has triggered a need for objective structured assessments of performance, catalyzing the emergence of outcomes-based milestones and entrustable professional activities. It has also set an expectation that outcomes will be monitored throughout medical training to ensure that trainees gain expertise in the six core Accreditation Council for Graduate Medical Education (ACGME) competency domains, allowing progression from supervised training to independent clinical practice.3,4

Worldwide adoption of competency-based assessment will necessitate periodic evaluations of all physicians, both during training and in unsupervised practice, to guarantee the achievement and maintenance of competence. Feasible, viable, and scalable assessment systems could maximize cost-effectiveness by using validated tools with high interrater reliability and low implementation costs. However, current assessment strategies primarily rely on trained evaluators and may have high costs and suboptimal reliability.5–7

Artificial intelligence has seen a vertiginous development in the last decade, prompting the application of several advanced techniques to medicine.8,9 To reduce the burden and reliance on human evaluators, some of this technology has been used to assess physician competence via automated systems.10,11 For example, machine learning is a branch of artificial intelligence that—with the use of large datasets—creates algorithmic models capable of recognizing patterns and making predictions. There are many ways in which data can be used for machine learning analysis. Supervised machine learning occurs when both input data and the corresponding output variables are used for algorithm creation. In this case, the aim is to map predictive functions from the input to the output. When the output variable is continuous (e.g., a numeric performance score), the addressed problem is a regression, and when the output is categorical (e.g., differentiation between junior and senior surgeons), it is a classification problem. In unsupervised machine learning, only input variables are available, and the aim is to model the structure and distribution of input data. The most common problem addressed by unsupervised machine learning is clustering (e.g., establishing distinct groups of physicians according to performance similarity).12 Computer vision, speech recognition, and natural language processing are examples of machine learning techniques that have been used to assess physicians’ knowledge, skills, and behaviors. These automated methods can be used to set standards for competence assessment, with a large potential for scalability while improving reliability of assessments and reducing costs.12–14

In the present systematic review, our aim is to identify the different machine learning techniques that have been applied to automate physician competence assessment. We also aim to evaluate how these novel methods can be used to assess different competence domains in several medical specialties.


We used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to design the present study and report the review findings.15,16

Search strategy and data sources

We developed a literature search strategy in collaboration with a librarian who has expertise in systematic reviews. Using medical subject headings (MeSH) and keywords, in May 2017, we searched the following databases for articles published from inception to April 30, 2017: MEDLINE, EMBASE, PsycINFO, Web of Science, ACM (Association for Computing Machinery) Digital Library, IEEE Xplore Digital Library, International Prospective Register of Systematic Reviews (PROSPERO), and Cochrane Database of Systematic Reviews. We adapted the MeSH terms and keywords from our MEDLINE search strategy (see Supplemental Digital Appendix 1 at to the other databases according to the specific syntax required for each database. To ensure literature saturation, we also hand-searched the reference list of all articles for other potential inclusions.

Selection process and data extraction

We only considered original articles published in peer-reviewed journals. No restrictions were imposed regarding language, study design, or setting. Articles were included if they assessed attending physicians’, residents’, fellows’, or medical students’ competence and used at least one machine learning technique. After duplicates were removed, two independent authors (R.D.D. and A.G.) screened the titles and abstracts of all search results and identified relevant articles on the basis of the inclusion criteria. For the selected articles, the full text was read by both authors independently, who then jointly decided whether the study met final inclusion criteria for the systematic review. In the case of disagreement, the authors would discuss and reach a decision by consensus. We noted the reasons for excluding articles after full-text review (see Figure 1). Using standardized data extraction forms created in REDCap (Vanderbilt University, Nashville, Tennessee),17 two authors (R.D.D. and A.G.) independently extracted the following information from each included article: sample size, participant characteristics, study setting and design, medical specialty, machine learning techniques, competence domains assessed, study outcomes, and methodological quality metrics. The PRISMA flow diagram showing screening and selection results is given in Figure 1.

Figure 1
Figure 1:
Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of screening and selection process for a systematic review of the literature on using machine learning techniques to assess physician competence, May 2017.

Data synthesis and quality assessment

We conducted a qualitative narrative synthesis focusing on the medical specialties, machine learning techniques used, and competence domains evaluated in the included articles. In addition, we discussed the applications of machine learning techniques in each study and how they were integrated into current physician competence metrics. Two authors (R.D.D. and A.G.) independently assessed the methodological quality of all included studies using the Medical Education Research Study Quality Instrument (MERSQI), the gold standard of medical education research evaluation.18 This is a 10-item instrument that assesses six domains of research quality (study design, sampling, data type, validity of assessments, data analysis, and outcomes). Each domain receives a score from 0 to 3 for a maximum total score of 18. We recorded the mean MERSQI score, based on the two authors’ individual ratings, for each included study.


A total of 4,942 articles were initially identified through database searching, and an additional 11 articles were identified through hand-searching (for a total of 4,953 initial articles). After duplicates were removed and title and abstract screening, 199 full-text articles were assessed for inclusion, with 69 studies meeting final inclusion criteria.19–87Figure 2 shows the distribution of the included studies by year of publication; 36 (52.2%) studies were published in the last six years (2012–April 2017). Appendix 1 presents detailed data on all included studies.

Figure 2
Figure 2:
Year of publication of the 69 studies included in a systematic review of the literature on using machine learning techniques to assess physician competence, May 2017. aOnly studies published between January and the end of April were included in 2017.

Study design and setting

Thirty-two (46.4%) studies were cross-sectional studies, 21 (30.4%) were retrospective cohorts, 14 (20.3%) were prospective cohorts, and 2 (2.9%) were randomized controlled trials. Most studies were carried out in clinical settings: 17 (24.6%) in diagnostic units, 7 (10.1%) in outpatient services, 5 (7.2%) in inpatient units, 4 (5.8%) in in- and outpatient units, and 3 (4.3%) in operating rooms. A large proportion of the included studies used simulation (28; 40.6%), while 3 (4.3%) used animal models and 2 (2.9%) involved qualifying examinations.


A total of 13 (18.8%) studies did not mention the number of participants included in the study. Among those that did report the number of participants, the total number of participants was 2,683. Thirty-two (46.4%) studies assessed competence at more than one level of expertise (e.g., residents vs. attending). A total of 53 (76.8%) studies assessed attending physicians, 42 (60.9%) assessed residents or fellows, and 22 (31.9%) assessed medical students. Only 2 (2.9%) studies assessed competence at the team level, with both of the studies involving multidisciplinary teams that also included either nurse practitioners20 or nurse practitioners and physician assistants.19

Competence domains and medical specialties

General surgery (24; 34.8%) and radiology (15; 21.7%) were the most studied specialties. To classify the different domains being evaluated in physician competence assessments, we used the six ACGME core competencies (patient care, medical knowledge, practice-based learning and improvement, interpersonal and communication skills, professionalism, and systems-based practice), as well as the outcomes-based milestone subcompetencies developed by each medical specialty group88 (see Figure 3). The number of studies related to each competence domain were as follows: patient care (63; 91.3%), medical knowledge (45; 65.2%), interprofessional and communication skills (6; 8.7%), professionalism (5; 7.2%), practice-based learning and improvement (2; 2.9%), and systems-based practice (1; 1.4%).

Figure 3
Figure 3:
The six Accreditation Council for Graduate Medical Education core competencies and their corresponding milestones subcompetencies (developed by medical specialty groups)88 evaluated in the included studies of a systematic review of the literature on using machine learning techniques to assess physician competence, May 2017. The width of each specialty, competency, and subcompetency bar represents the relative number of included studies that reviewed these. Abbreviations: PC indicates patient care; MK, medical knowledge; ICS, interpersonal and communication skills; PROF, professionalism; PBLI, practice-based learning and improvement; SBP, systems-based practice; OR, operating room.

Machine learning techniques

A wide variety of machine learning techniques12,89 were used to assess physician competence in the included studies (Table 1). Natural language processing (24; 34.8%), support vector machine (15; 21.7%), and hidden Markov models (14; 20.3%) were the machine learning techniques most often applied. Based on what we saw in the included studies, we created a conceptual diagram (see Supplemental Digital Appendix 2 at to illustrate how different machine learning techniques can be integrated with current competence assessment methods.

Table 1
Table 1:
Machine Learning Techniques Used to Assess Physician Competence, From a Systematic Review of the Literature on Using Machine Learning Techniques to Assess Physician Competence, May 2017

Study quality

The mean MERSQI score for all included studies was 13.1 (standard deviation = 0.8), with individual studies’ mean scores ranging from 11.5 to 15.5. Although several of the included studies established construct validity (e.g., expertise classification) and concurrent validity (e.g., correlation with gold standards, such as objective structured assessment of technical skills [OSATS]), no study evaluated predictive validity as related to patient outcomes. Further, only one study investigated the correlation between physicians’ performance as assessed by natural language processing and patient experience (patient-reported communication quality).20


To automate physician competence assessment, a growing number of studies have been incorporating machine learning techniques within medical education. The sudden increase of interest in and application of machine learning techniques to physician competence assessment is evidenced by our finding that more than 50% of the included studies were published in the past six years.

In the following sections, we evaluate the evidence on using machine learning techniques for competence assessment, structured according to the six ACGME core competencies.

Patient care

The majority of included studies (63; 91.3%) sought to assess competencies related to the patient care domain. This domain involves “[providing] patient-centered care that is compassionate, appropriate, and effective for the treatment of health problems and the promotion of health.”90 Most of the assessed subcompetencies in this domain were related to technical steps in the performance of surgical procedures and radiological interpretations. These assessments mainly involved surgical motion-tracking technologies (e.g., infrared cameras and wearable devices). In the studies that used these motion-tracking technologies, various automated dexterity metrics, such as economy of movement and acceleration, were assessed and showed substantial correlation with gold standard observation-based technical skills tools (e.g., OSATS and fundamentals of laparoscopic surgery [FLS]),21,23,59 thus providing construct and concurrent validity for the automated metrics. These studies represent examples of supervised machine learning, in which input data (e.g., time to completion, path length, control effort) predict output data (numerical ratings from human evaluators) using regression analysis. The same input data were also used to classify surgeons’ expertise (e.g., novices vs. seniors) and ability to achieve a high-accuracy expertise classification in two studies.22,57 Despite the similarities in terms of the machine learning techniques used in the studies assessing the patient care domain, there was vast heterogeneity among the studies in terms of how the input data were extracted and how training or validation datasets were built.

Among the patient care competency studies that were conducted in radiology settings, text-mining techniques (e.g., natural language processing) were applied and demonstrated substantial accuracy of automated systems, as compared with pathology reports, providing predictive validity evidence of automated algorithms for competency assessment.26,27,38 Additionally, these studies suggest that automated systems have the potential to generate information that could be used for real-time feedback during radiologic report generation, allowing diagnosis corrections and/or additional discussion with more expert radiologists before releasing the final report. Other relevant applications of automated systems that were identified in the included studies related to (1) obtaining medical history and synthetizing essential information accurately using natural language processing tools and (2) extracting information from electronic health records to assess the relevance of clinical notes in internal medicine and geriatrics.25,73 In addition, four studies extracted established quality measures in colonoscopy (e.g., adenoma detection rate, preparation quality, indication), finding that natural language processing produced higher accuracy in extracting these quality metrics compared with human raters.70,77,78,82

The studies in this review that used machine learning for patient care assessment largely focused on the technical aspect of performance. And it was evident that humanistic and compassionate patient care were not addressed in the reviewed studies. Because there is a slow but progressive interest in integrating these relevant skills and attitudes into competency-based medical education, future studies investigating the application of machine learning to physician assessment should also attempt to develop reliable metrics for assessing humanism and other relational aspects of patient care.

Medical knowledge

The medical knowledge domain is related to “[demonstrating] knowledge of established and evolving biomedical, clinical, epidemiological and social-behavioral sciences, as well as the application of this knowledge to patient care.”90 Forty-five (65.2%) included studies assessed medical knowledge subcompetencies, including interpretation of examinations, diagnostic knowledge, basic science and clinical knowledge, performance of operations and procedures, and care for diseases and conditions. Natural language processing was the most commonly used approach to attempt assessment of medical knowledge, with information extracted from clinical notes, diagnostic reports, verbal communications, and written responses.20,25,26,31,35,65,70–87 Additionally, one study sought to automate medical knowledge assessment in qualifying examinations involving medical students.81

This is a domain with a vast applicability for machine learning, since accurate tools for speech recognition and free-text analysis have been developed rapidly in the last few years because of advanced techniques, such as convolutional neural networks and deep learning.91 Most of the machine learning techniques used in this context attempted to extract patterns from nonstructured data and assess their relationship with an expert-based assessment. Chen et al,25 for example, applied natural language processing to classify medical students’ clinical notes as low, medium, high, or no relevance (output data), using ACGME geriatric competencies. This system presented moderate to high accuracy in detecting the relevance of clinical notes, as compared with human evaluations. This technology, therefore, has the potential to automate some of the processes used to assess medical students’ knowledge and could be translated and scaled to other areas of knowledge elicitation and classification.

Interpersonal and communication skills

In the interpersonal and communications skills domain, physicians are expected to “demonstrate interpersonal and communication skills that result in the effective exchange of information and collaboration with patients, their families, and health professionals.”90 A total of 6 (8.7%) included studies used machine learning techniques (mainly natural language processing) to assess competence in this domain. In 2 of these studies, an online platform was developed to automatically provide feedback to medical students regarding their communication skills performance during clinical conversations with standardized patients.68,69 These studies used face recognition, gesture tracking, and speech recognition to extract objective metrics of verbal and nonverbal communication. The automated metrics were validated against human assessments of communication skill. In a third study, an automated system was able to code and assess patient–provider conversations by measuring a speaker’s information-giving ratio, defined as the speaker’s balance between giving and requesting information.20 The authors found that automated metrics were substantially correlated with human assessments. It is important to point out that most of the studies assessing this domain investigated the feasibility and face validation of automated systems; thus, more research is needed to establish predictive validity related to patient-centered outcomes. Many other emergent machine learning techniques, such as facial or expression recognition; speech recognition (e.g., sentiment analysis); and gaze, gestures, or pose tracking have also been used in other fields besides medicine.92–94 These techniques attempt to provide objective measures of a wide variety of behaviors and emotions (e.g., nonverbal cues, attention, engagement, empathy, stress, and frustration) and should be studied further for their use in assessing physicians’ interpersonal, communication, and nontechnical skills, such as leadership, situational awareness, and teamwork.95


Only 5 (7.2%) included studies sought to assess the professionalism domain, which involves “[demonstrating] a commitment to carrying out professional responsibilities and an adherence to ethical principles.”90 Primary data sources for machine learning techniques in this domain were usually surveys, self-assessment, and patient–doctor conversations (audio and transcribed data). For instance, a study involving physicians from multiple National Health Service sites in England and Wales collected multisource feedback in the form of open-text data related to professional performance, and using machine learning techniques (including support vector machine and natural language processing), the authors found high interrater agreement between the algorithms and the human coders in the professional domain.35 Because this type of assessment using open-text feedback has been implemented broadly in the United Kingdom, an automated system providing reliable measures of professionalism may reduce costs while also providing a standardized methodology for the assessment of competence in this domain.

Practice-based learning and improvement

Only 2 (2.9%) included studies addressed whether physicians “[demonstrated] the ability to investigate and evaluate [their] care of patients, to appraise and assimilate scientific evidence, and to continuously improve patient care based on constant self-evaluation and life-long learning.”90 Peer assessment using an open-text survey and applying data-mining techniques, such as natural language processing to extract relevant themes associated with physician performance, was the focus of one of these studies.35 The second study identified patterns of clinical communication among interdisciplinary teams during handoffs.19 The authors assessed the teams’ shared mental models by measuring content overlap of several handoffs and established clusters of different types of clinical content (patient presentation, assessment, plan, and professional environment). Future automated systems similar to these (e.g., speech recognition and natural language processing) could be used to assess interprofessional communications and handoffs, which are known to relate to patient outcomes. A real-time system could be used to assess risk of adverse events and complications related to team communication and to identify avoidable safety events early.

Systems-based practice

In competency-based medical education, the systems-based practice domain involves “[demonstrating] an awareness of and responsiveness to the larger context and system of health care, as well as the ability to call effectively on other resources in the system to provide optimal health care.”90 Few researchers have addressed the development of tools for assessing this domain, as evidenced by there being only 1 (1.4%) study, which assessed tasks related to interprofessionalism and transitions of care using categorical cluster analysis, in our review that looked at this domain.19 Methods used for systems-based practice competence assessment should capture the complexity inherent to this domain by transcending the individual level and moving toward a systemic approach.96 Machine learning techniques have the advantage of being able to capture complex behavior patterns that humans would be unable to observe alone and would therefore be an exemplary way to assess systems-based practice in the future.97,98

Limitations and future directions

Machine learning has been used for physician assessment in several competence domains and medical specialties and, in some cases, provides metrics that are significantly correlated with gold standard competence assessments (e.g., OSATS and FLS in surgery).21,23,59 Nevertheless, as with all novel technology, there are important limitations that should be considered. The wide variety of existing machine learning techniques, in addition to small sample sizes, results in significant challenges for reproducibility and generalizability. Despite being able to generate accurate predictive models, the relationship between variables and model functioning is not always explained. This limitation, known as “the black-box problem,” has been highlighted not only in the medical field but also in various other settings in which machine learning is applied99,100 and should be considered when attempting to understand the relationship between predictive factors (e.g., personal characteristics and training level) and physician performance metrics. This is especially relevant when establishing competence evaluation standards for assessments that have high stakes, such as eligibility to practice. Another related challenge is the translation of these automated metrics into a competency-based framework, which would not only enable the identification of substandard performance but also provide pragmatic information for remediation or intervention. Future research should also investigate the predictive validity of machine learning techniques as they relate to patient-centered outcomes. An advantage of using automated systems for assessment is the possibility of integrating the assessment measures with patient outcomes data from electronic health records, enhancing the predictive validity of these metrics. The potential applications of machine learning in medical education are vast, and one intrinsically related to competence assessment is the ability of these novel tools to assess performance in real time, enabling physicians and other health professionals to receive immediate corrective feedback.


In the past decade, a near-exponential increase of artificial intelligence and machine learning techniques has been observed in medical education. Concurrent with the shift toward competency-based medical education, several studies have attempted to apply machine learning techniques to assess physician competence in different medical specialties. Despite the rapid growth of research applying machine learning to competence assessment, most of the studies included in the present systematic review were at early stages. As such, they describe the feasibility of such technology yet lack rigorous validity evidence. In the included studies, surgery and radiology were the main medical specialties to apply these novel approaches, and patient care and medical knowledge were the most commonly assessed competence domains, with relatively few studies attempting to develop automated systems to assess the other domains (professionalism, practice-based learning and improvement, systems-based practice, and interpersonal and communication skills). To further advance the application of this technology—which may have the potential to integrate and analyze pragmatic information that could be used in real-time assessments and interventions—in medical education, the next generation of machine learning research should use robust methodological approaches to test and demonstrate the validity of the proposed tools, examine the use of these tools in other specialties, and study their use for assessing the other competence domains. It will also be important to educate physicians about the potential for these tools to augment care, support training, and enhance patient safety to facilitate widespread research and future adoption.


1. Frank JR, Snell LS, Cate OT, et al. Competency-based medical education: Theory to practice. Med Teach. 2010;32:638–645.
2. Hawkins RE, Welcher CM, Holmboe ES, et al. Implementation of competency-based medical education: Are we addressing the concerns and challenges? Med Educ. 2015;49:1086–1102.
3. Nasca TJ, Philibert I, Brigham T, Flynn TC. The next GME accreditation system—Rationale and benefits. N Engl J Med. 2012;366:1051–1056.
4. Englander R, Carraccio C. From theory to practice: Making entrustable professional activities come to life in the context of milestones. Acad Med. 2014;89:1321–1323.
5. Landon BE, Normand SL, Blumenthal D, Daley J. Physician clinical performance assessment: Prospects and barriers. JAMA. 2003;290:1183–1189.
6. Epstein RM, Hundert EM. Defining and assessing professional competence. JAMA. 2002;287:226–235.
7. Accreditation Council for Graduate Medical Education. Data Resource Book: Academic Year 2015–2016. 2016. Chicago, IL: Accreditation Council for Graduate Medical Education; Accessed July 19, 2018.
8. Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–2410.
9. Svensson CM, Hübler R, Figge MT. Automated classification of circulating tumor cells and the impact of interobsever variability on classifier training and performance. J Immunol Res. 2015;2015:573165.
10. Rutherford DN, D’Angelo AL, Law KE, Pugh CM. Advanced engineering technology for measuring performance. Surg Clin North Am. 2015;95:813–826.
11. Oropesa I, Sánchez-González P, Lamata P, et al. Methods and tools for objective assessment of psychomotor skills in laparoscopic surgery. J Surg Res. 2011;171:e81–e95.
12. Kelleher JD, Namee BM, D’Arcy A. Fundamentals of Machine Learning for Predictive Data Analytics. 2015.Cambridge, MA: MIT Press.
13. Mitchell TM. Machine Learning. 1997.Boston, MA: McGraw Hill.
14. Obermeyer Z, Emanuel EJ. Predicting the future—Big data, machine learning, and clinical medicine. N Engl J Med. 2016;375:1216–1219.
15. Moher D, Liberati A, Tetzlaff J, Altman DG; PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. BMJ. 2009;339:b2535.
16. Shamseer L, Moher D, Clarke M, et al.; PRISMA-P Group. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: Elaboration and explanation. BMJ. 2015;350:g7647.
17. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42:377–381.
18. Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA. 2007;298:1002–1009.
19. Mamykina L, Jiang S, Collins SA, et al. Revealing structures in narratives: A mixed-methods approach to studying interdisciplinary handoff in critical care. J Biomed Inform. 2016;62:117–124.
20. Mayfield E, Laws MB, Wilson IB, Penstein Rosé C. Automating annotation of information-giving for analysis of clinical conversation. J Am Med Inform Assoc. 2014;21(e1):e122–e128.
21. Ahmidi N, Hager GD, Ishii L, Gallia GL, Ishii M. Robotic path planning for surgeon skill evaluation in minimally-invasive sinus surgery. Med Image Comput Comput Assist Interv. 2012;15(pt 1):471–478.
22. Allen B, Nistor V, Dutson E, Carman G, Lewis C, Faloutsos P. Support vector machines improve the accuracy of evaluation for the performance of laparoscopic training tasks. Surg Endosc. 2010;24:170–178.
23. Andreu-Perez J, Leff DR, Shetty K, Darzi A, Yang GZ. Disparity in frontal lobe connectivity on a complex bimanual motor task aids in classification of operator skill level. Brain Connect. 2016;6:375–388.
24. Brown JD, O Brien CE, Leung SC, Dumon KR, Lee DI, Kuchenbecker KJ. Using contact forces and robot arm accelerations to automatically rate surgeon skill at peg transfer. IEEE Trans Biomed Eng. 2017;64:2263–2275.
25. Chen Y, Wrenn J, Xu H, et al. Automated assessment of medical students’ clinical exposures according to AAMC geriatric competencies. AMIA Annu Symp Proc. 2014;2014:375–384.
26. Garla V, Taylor C, Brandt C. Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management. J Biomed Inform. 2013;46:869–875.
27. Kaya A, Can AB. A weighted rule based method for predicting malignancy of pulmonary nodules by nodule characteristics. J Biomed Inform. 2015;56:69–79.
28. Kumar R, Jog A, Malpani A, et al. Assessing system operation skills in robotic surgery trainees. Int J Med Robot. 2012;8:118–124.
29. Kumar R, Jog A, Vagvolgyi B, et al. Objective measures for longitudinal assessment of robotic surgery training. J Thorac Cardiovasc Surg. 2012;143:528–534.
30. Oropesa I, Sánchez-Gonzáez P, Chmarra MK, et al. Supervised classification of psychomotor competence in minimally invasive surgery based on instruments motion analysis. Surg Endosc. 2014;28:657–670.
31. Pham AD, Névéol A, Lavergne T, et al. Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings. BMC Bioinform. 2014;15:266.
32. Rafii-Tari H, Payne CJ, Bicknell C, et al. Objective assessment of endovascular navigation skills with force sensing. Ann Biomed Eng. 2017;45:1315–1327.
33. Watson RA. Use of a machine learning algorithm to classify expertise: Analysis of hand motion patterns during a simulated surgical task. Acad Med. 2014;89:1163–1167.
34. Zhu J, Luo J, Soh JM, Khalifa Y. A computer vision-based approach to grade simulated cataract surgeries. Mach Vis Appl. 2015;26:115–125.
35. Gibbons C, Richards S, Valderas JM, Campbell J. Supervised machine learning algorithms can classify open-text feedback of doctor performance with human-level accuracy. J Med Internet Res. 2017;19:e65.
36. Ashizawa K, MacMahon H, Ishida T, et al. Effect of an artificial neural network on radiologists’ performance in the differential diagnosis of interstitial lung disease using chest radiographs. AJR Am J Roentgenol. 1999;172:1311–1315.
37. Ikeda M, Ito S, Ishigaki T, Yamauchi K. Evaluation of a neural network classifier for pancreatic masses based on CT findings. Comput Med Imaging Graph. 1997;21:175–183.
38. Jesneck JL, Lo JY, Baker JA. Breast mass lesions: Computer-aided diagnosis models with mammographic and sonographic descriptors. Radiology. 2007;244:390–398.
39. Li G, Kim H, Tan JK, et al. Semantic characteristics prediction of pulmonary nodule using artificial neural networks. Conf Proc IEEE Eng Med Biol Soc. 2013;2013:5465–5468.
40. Matsuki Y, Nakamura K, Watanabe H, et al. Usefulness of an artificial neural network for differentiating benign from malignant pulmonary nodules on high-resolution CT: Evaluation with receiver operating characteristic analysis. AJR Am J Roentgenol. 2002;178:657–663.
41. Nakamura K, Yoshida H, Engelmann R, et al. Computerized analysis of the likelihood of malignancy in solitary pulmonary nodules with use of artificial neural networks. Radiology. 2000;214:823–830.
42. Richstone L, Schwartz MJ, Seideman C, Cadeddu J, Marshall S, Kavoussi LR. Eye metrics as an objective assessment of surgical skill. Ann Surg. 2010;252:177–182.
43. Stevens RH, Najafi K. Artificial neural networks as adjuncts for assessing medical students’ problem solving performances on computer-based simulations. Comput Biomed Res. 1993;26:172–187.
44. Stevens RH, Lopo AC, Wang P. Artificial neural networks can distinguish novice and expert strategies during complex problem solving. J Am Med Inform Assoc. 1996;3:131–138.
45. Yost MJ, Gardner J, Bell RM, et al.; TriMetrix and Success Research Group. Predicting academic performance in surgical training. J Surg Educ. 2015;72:491–499.
46. Chmarra MK, Klein S, de Winter JC, Jansen FW, Dankelman J. Objective classification of residents based on their psychomotor laparoscopic skills. Surg Endosc. 2010;24:1031–1039.
47. Lin HC, Shafran I, Yuh D, Hager GD. Towards automatic skill evaluation: Detection and segmentation of robot-assisted surgical motions. Comput Aided Surg. 2006;11:220–230.
48. Varadarajan B, Reiley C, Lin H, Khudanpur S, Hager G. Data-derived models for segmentation with application to surgical assessment and training. Med Image Comput Comput Assist Interv. 2009;12(pt 1):426–434.
49. Macke T, Rosen J, Pugh C. Data mining of the E-pelvis simulator database: A quest for a generalized algorithm for objectively assessing medical skill. Stud Health Technol Inform. 2006;119:355–360.
50. Sewell C, Morris D, Blevins NH, et al. Providing metrics and performance feedback in a surgical simulator. Comput Aided Surg. 2008;13:63–81.
51. Venkatesh SS, Levenback BJ, Sultan LR, Bouzghar G, Sehgal CM. Going beyond a first reader: A machine learning methodology for optimizing cost and performance in breast ultrasound diagnosis. Ultrasound Med Biol. 2015;41:3148–3162.
52. Forestier G, Lalys F, Riffaud L, Trelhu B, Jannin P. Classification of surgical processes using dynamic time warping. J Biomed Inform. 2012;45:255–264.
53. Rosen J, Brown JD, Chang L, Sinanan MN, Hannaford B. Generalized approach for modeling minimally invasive surgery as a stochastic process using a discrete Markov model. IEEE Trans Biomed Eng. 2006;53:399–413.
54. Zia A, Sharma Y, Bettadapura V, et al. Automated video-based assessment of surgical skills for training and evaluation in medical schools. Int J Comput Assist Radiol Surg. 2016;11:1623–1636.
55. D’Negri CE, De Vito EL. Making it possible to measure knowledge, experience and intuition in diagnosing lung injury severity: A fuzzy logic vision based on the Murray score. BMC Med Inform Decis Mak. 2010;10:70.
56. Huang J, Payandeh S, Doris P, Hajshirmohammadi I. Fuzzy classification: Towards evaluating performance on a surgical simulator. Stud Health Technol Inform. 2005;111:194–200.
57. Ahmidi N, Hager GD, Ishii L, Fichtinger G, Gallia GL, Ishii M. Surgical task and skill classification from eye tracking and tool motion in minimally invasive surgery. Med Image Comput Comput Assist Interv. 2010;13(pt 3):295–302.
58. Cavallo F, Sinigaglia S, Megali G, et al. Biomechanics-machine learning system for surgical gesture analysis and development of technologies for minimal access surgery. Surg Innov. 2014;21:504–512.
59. Leong JJ, Nicolaou M, Atallah L, Mylonas GP, Darzi AW, Yang GZ. HMM assessment of quality of movement trajectory in laparoscopic surgery. Comput Aided Surg. 2007;12:335–346.
60. Megali G, Sinigaglia S, Tonet O, Dario P. Modelling and evaluation of surgical performance using hidden Markov models. IEEE Trans Biomed Eng. 2006;53:1911–1919.
61. Reiley CE, Hager GD. Task versus subtask surgical skill evaluation of robotic minimally invasive surgery. Med Image Comput Comput Assist Interv. 2009;12(pt 1):435–442.
62. Reiley CE, Plaku E, Hager GD. Motion generation of robotic surgical tasks: Learning from expert demonstrations. Conf Proc IEEE Eng Med Biol Soc. 2010;2010:967–970.
63. Rosen J, Hannaford B, Richards CG, Sinanan MN. Markov modeling of minimally invasive surgery based on tool/tissue interaction and force/torque signatures for evaluating surgical skills. IEEE Trans Biomed Eng. 2001;48:579–591.
64. Rosen J, Solazzo M, Hannaford B, Sinanan M. Objective laparoscopic skills assessments of surgical residents using hidden Markov models based on haptic information and tool/tissue interactions. Stud Health Technol Inform. 2001;81:417–423.
65. Guo X, Yu Q, Alm CO, et al. From spoken narratives to domain knowledge: Mining linguistic data for medical image understanding. Artif Intell Med. 2014;62:79–90.
66. Mazomenos EB, Chang PL, Rippel RA, et al. Catheter manipulation analysis for objective performance and technical skills assessment in transcatheter aortic valve implantation. Int J Comput Assist Radiol Surg. 2016;11:1121–1131.
67. Forestier G, Lalys F, Riffaud L, et al. Multi-site study of surgical practice in neurosurgery based on surgical process models. J Biomed Inform. 2013;46:822–829.
68. Liu C, Lim RL, McCabe KL, Taylor S, Calvo RA. A web-based telehealth training platform incorporating automated nonverbal behavior feedback for teaching communication skills to medical students: A randomized crossover study. J Med Internet Res. 2016;18:e246.
69. Liu C, Scott KM, Lim RL, Taylor S, Calvo RA. EQClinic: A platform for learning communication skills in clinical consultations. Med Educ Online. 2016;21:31801.
70. Abdul-Baki H, Schoen RE, Dean K, et al. Public reporting of colonoscopy quality is associated with an increase in endoscopist adenoma detection rate. Gastrointest Endosc. 2015;82:676–682.
71. Castro SM, Tseytlin E, Medvedeva O, et al. Automated annotation and classification of BI-RADS assessment from radiology reports. J Biomed Inform. 2017;69:177–187.
72. Crowley RS, Tseytlin E, Jukic D. ReportTutor—An intelligent tutoring system that uses a natural language interface. AMIA Annu Symp Proc. 2005;2005:171–175.
73. Denny JC, Bastarache L, Sastre EA, Spickard A 3rd. Tracking medical students’ clinical experiences using natural language processing. J Biomed Inform. 2009;42:781–789.
74. Denny JC, Spickard A 3rd, Speltz PJ, Porier R, Rosenstiel DE, Powers JS. Using natural language processing to provide personalized learning opportunities from trainee clinical notes. J Biomed Inform. 2015;56:292–299.
75. Dutta S, Long WJ, Brown DF, Reisner AT. Automated detection using natural language processing of radiologists recommendations for additional imaging of incidental findings. Ann Emerg Med. 2013;62:162–169.
76. El Saadawi GM, Tseytlin E, Legowski E, et al. A natural language intelligent tutoring system for training pathologists: Implementation and evaluation. Adv Health Sci Educ Theory Pract. 2008;13:709–722.
77. Gawron AJ, Thompson WK, Keswani RN, Rasmussen LV, Kho AN. Anatomic and advanced adenoma detection rates as quality metrics determined via natural language processing. Am J Gastroenterol. 2014;109:1844–1849.
78. Harkema H, Chapman WW, Saul M, Dellon ES, Schoen RE, Mehrotra A. Developing a natural language processing application for measuring the quality of colonoscopy procedures. J Am Med Inform Assoc. 2011;18(suppl 1):i150–i156.
79. Hazlehurst B, Sittig DF, Stevens VJ, et al. Natural language processing in the electronic medical record: Assessing clinician adherence to tobacco treatment guidelines. Am J Prev Med. 2005;29:434–439.
80. Itri JN, Kim W, Scanlon MH. Orion: A web-based application designed to monitor resident and fellow performance on-call. J Digit Imaging. 2011;24:897–907.
81. Latifi S, Gierl MJ, Boulais AP, De Champlain AF. Using automated scoring to evaluate written responses in English and French on a high-stakes clinical competency examination. Eval Health Prof. 2016;39:100–113.
82. Mehrotra A, Dellon ES, Schoen RE, et al. Applying a natural language processing tool to electronic health records to assess performance on colonoscopy quality measures. Gastrointest Endosc. 2012;75(6):1233–1239.e14.
83. Oliven A, Nave R, Gilad D, Barch A. Implementation of a web-based interactive virtual patient case simulation as a training and assessment tool for medical students. Stud Health Technol Inform. 2011;169:233–237.
84. Pakhomov S, Bjornsen S, Hanson P, Smith S. Quality performance measurement using the text of electronic medical records. Med Decis Making. 2008;28:462–470.
85. Payne VL, Medvedeva O, Legowski E, et al. Effect of a limited-enforcement intelligent tutoring system in dermatopathology on student errors, goals and solution paths. Artif Intell Med. 2009;47:175–197.
86. Sevenster M, Qian Y, Abe H, Buurman J. Cross-sectional relatedness between sentences in breast radiology reports: Development of an SVM classifier and evaluation against annotations of five breast radiologists. J Digit Imaging. 2013;26:977–988.
87. Spickard A 3rd, Ridinger H, Wrenn J, et al. Automatic scoring of medical students’ clinical notes to monitor learning in the workplace. Med Teach. 2014;36:68–72.
88. Hamstra SJ, Edgar L, Yamazaki K, Holmboe ES. Milestones: Annual Report 2016. 2016. Chicago, IL: Accreditation Council for Graduate Medical Education; Accessed July 20, 2018.
89. Foster I, Ghani R, Jarmin RS, Kreuter F, Lane J. Big Data and Social Science: A Practical Guide to Methods and Tools. 2017.Boca Raton, FL: Chapman and Hall.
90. Englander R, Cameron T, Ballard AJ, Dodge J, Bull J, Aschenbrener CA. Toward a common taxonomy of competency domains for the health professions and competencies for physicians. Acad Med. 2013;88:1088–1094.
91. Sainath TN, Kingsbury B, Saon G, et al. Deep convolutional neural networks for large-scale speech tasks. Neural Netw. 2015;64:39–48.
92. Harley JM. Tettegah SY, Gartmeier M. Measuring emotions: A survey of cutting-edge methodologies used in computer-based learning environment research. In: Emotions, Technology, Design, and Learning. 2015:London, UK: Academic Press; 89–114.
93. Pelachaud C. Modelling multimodal expression of emotion in a virtual agent. Philos Trans R Soc Lond B Biol Sci. 2009;364:3539–3548.
94. Pantic M. Machine analysis of facial behaviour: Naturalistic and dynamic behaviour. Philos Trans R Soc Lond B Biol Sci. 2009;364:3505–3513.
95. Yule S, Flin R, Maran N, Rowley D, Youngson G, Paterson-Brown S. Surgeons’ non-technical skills in the operating room: Reliability testing of the NOTSS behavior rating system. World J Surg. 2008;32:548–556.
96. Dekker SW, Leveson NG. The systems approach to medicine: Controversy and misconceptions. BMJ Qual Saf. 2015;24:7–9.
97. Van Esbroeck A, Rubinfeld I, Hall B, Syed Z. Quantifying surgical complexity with machine learning: Looking beyond patient factors to improve surgical models. Surgery. 2014;156:1097–1105.
98. Sakhre V, Jain S, Sapkal VS, Agarwal DP. Fuzzy counter propagation neural network control for a class of nonlinear dynamical systems. Comput Intell Neurosci. 2015;2015:719620.
99. Castelvecchi D. Can we open the black box of AI? Nature. 2016;538:20–23.
100. Cabitza F, Rasoini R, Gensini GF. Unintended consequences of machine learning in medicine. JAMA. 2017;318:517–518.
Appendix 1
Appendix 1:
Included Studies (N = 69) in a Systematic Review of the Literature on Using Machine Learning Techniques to Assess Physician Competence, May 2017

Supplemental Digital Content

Copyright © 2018 by the Association of American Medical Colleges