Leveraging Natural Language Processing: Toward Computer-Assisted Scoring of Patient Notes in the USMLE Step 2 Clinical Skills Exam : Academic Medicine

Secondary Logo

Journal Logo

Invited Commentaries

Leveraging Natural Language Processing: Toward Computer-Assisted Scoring of Patient Notes in the USMLE Step 2 Clinical Skills Exam

Salt, Jessica MD, MBE; Harik, Polina PhD; Barone, Michael A. MD, MPH

Author Information
Academic Medicine 94(3):p 314-316, March 2019. | DOI: 10.1097/ACM.0000000000002558
  • Free


The United States Medical Licensing Examination Step 2 Clinical Skills (CS) exam uses physician raters to evaluate patient notes written by examinees. In this Invited Commentary, the authors describe the ways in which the Step 2 CS exam could benefit from adopting a computer-assisted scoring approach that combines physician raters’ judgments with computer-generated scores based on natural language processing (NLP). Since 2003, the National Board of Medical Examiners has researched NLP technology to determine whether it offers the opportunity to mitigate challenges associated with human raters while continuing to capitalize on the judgment of physician experts. The authors discuss factors to consider before computer-assisted scoring is introduced into a high-stakes licensure exam context. They suggest that combining physician judgments and computer-assisted scoring can enhance and improve performance-based assessments in medical education and medical regulation.

Medical education, with its formative and summative assessments, and medical regulation, with its high-stakes assessments for licensure and certification, strive to accurately and reliably measure competencies relevant to practice. State medical licensing boards value the information provided by the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (CS) exam because of its focus on measuring competencies critical to patient care, such as communication skills and clinical reasoning. In this high-stakes licensure exam, standardized patients (trained actors portraying patients with medical conditions) interact with examines to simulate 12 patient–doctor encounters designed to assess “the ability of examinees to apply medical knowledge, skills, and understanding of clinical science essential for the provision of patient care under supervision.”1 On completion of each encounter, examinees record their findings in a postencounter patient note.

Although skill-based performance assessments are an integral part of medical education and regulation, the scoring of such assessments may be challenging. For example, when evaluating an examinee’s postencounter patient note in the Step 2 CS exam, the physician raters must consider the following difficult questions: “Did the examinee provide detailed documentation of pertinent history and physical findings?” and “Did the examinee support diagnoses with pertinent findings obtained from the history and physical examination?”1

When performance-based assessments include complex and nuanced behaviors, they require the scoring of complex and nuanced performances. Human raters are advantageous insofar as they may be highly capable of judging such performances; however, being human, they are also a source of challenges with respect to standardization and quality control.2 In assessments, human raters may fail to remain objective or may suffer from rater fatigue, which may prevent them from applying rating rules uniformly. They may score sections of a performance that were intended to be judged independently in a compensatory manner, which may lead to halo or horns effects. Human raters may also vary in their stringency, which may cause them to over- or underrate performances relative to one another or prevent them from using the entire rating scale.

These rater behaviors can make it difficult to effectively discriminate among performances, which can decrease the reliability of an assessment and affect the validity of the inferences made based on its scores. For this reason, it would be ideal if we could retain the advantages of human raters while also addressing some of the disadvantages. Recent advances in computer science and computational linguistics have led us to consider the use of technology to supplement human judgment in the assessment of skill-based performance, raising a compelling question: What if a computer-assisted scoring system using natural language processing (NLP) could be used to score the patient notes in the Step 2 CS exam?

Current Use of Physician Raters in the Step 2 CS exam

The physicians who rate the patient notes in the Step 2 CS exam use case-specific scoring rubrics to “provide ratings on the documented summary of the findings of the patient encounter (history and physical examination), diagnostic impressions, justification of the potential diagnoses, and initial patient diagnostic studies.”1 The examinee’s patient note ratings are combined with the examinee’s physical exam performance, scored by the standardized patients using case-specific checklists, to produce an Integrated Clinical Encounter (ICE) score that is used to determine the examinee’s ICE pass/fail classification. The examinee must receive a passing ICE score to pass the Step 2 CS exam.

The physician raters are clinically active, licensed, and board certified. They undergo initial training and regular refreshers, and they are subject to quality control oversight. The use of physicians to rate patient notes ensures that examinee performances are evaluated by experts familiar with both the medical content and the expectations appropriate for medical licensure consideration. Yet, despite this advantage, physician raters are no less susceptible to some of the limitations associated with human raters more generally. This led us to embark on a program of research at the National Board of Medical Examiners (NBME) with the goal to further improve the measurement properties of the Step 2 CS patient note scores.

Development of a Tool for Computer-Assisted Scoring

NLP combines the principles of linguistics and computer science to derive meaning from written or spoken language. NLP has become ubiquitous and, to a large extent, now characterizes how humans typically interact with technology. Autocorrect, e-mail spam filters, Internet search engines, and virtual assistants (e.g., Amazon’s Alexa) all use NLP. NLP is also becoming an indispensable tool in health care and educational assessment. Since 2007, the National Library of Medicine has cataloged more than 17,000 publications that focused on applications of NLP and other related information-extraction methods in health care. In 2018 alone, more than 2,000 publications cited by Google Scholar addressed the use of NLP and machine learning in educational assessment.* In educational assessment, e-rater, a well-known NLP-based scoring engine developed by the Educational Testing Service, has been in use since 1999. It evaluates grammar, vocabulary usage, style, organization, and development of essays in the writing sections of the Test of English as a Foreign Language (TOEFL) and Graduate Record Examination (GRE) tests.3

A well-designed computer-based scoring system is superior to human judgment with respect to consistency, objectivity, and efficiency because it is not susceptible to the effects of fatigue or human biases, and it produces scores in a fraction of the time it takes human raters to assign scores. With these advantages in mind, the NBME embarked in 2003 on work with academic computational linguists on the feasibility of using NLP to extract content-specific information from clinical texts such as the patient notes of the Step 2 CS exam. In 2016, in another partnership with academia, the NBME began developing an NLP-based scoring system designed to replicate the expert judgments of physician raters assessing patient notes. This system’s detailed annotation process allows experienced patient note raters to match each essential case-specific concept with its lexical variants, synonyms, and acceptable alternatives. For example, “rotator cuff impingement” and “overuse injury of an anterior rotator cuff” might both be matched to an essential concept such as “rotator cuff injury.” Once concepts are annotated, various NLP and machine learning methods are used to train the system to recognize the presence or absence of pertinent clinical concepts within actual examinee patient notes for scoring purposes.

The NBME has developed NLP-based case-specific scoring algorithms for more than 100 clinical cases. We have compared computer-generated scores using these algorithms versus human ratings with respect to accuracy and found them to be interchangeable for all subpopulations of examinees. For most cases, physician-rater-to-NLP-score agreement is as high as or higher than the agreement between physician raters. The level of agreement with human ratings is one criterion we are using to identify which cases could benefit from a hybrid system that combines human ratings and computer-generated scores and increases the consistency and efficiency of the scoring process.

Considerations for Implementing Computer-Assisted Scoring

There are two main considerations we believe must be addressed before a computer-assisted scoring process can be introduced into the Step 2 CS exam. First, the crucial role of clinical judgment in the evaluation of Step 2 CS performances will need to be preserved. Second, the examinee experience must not be adversely affected.

In implementing a computer-assisted scoring process, we must ensure that the clinical expertise of trained human raters will continue to make an important contribution to Step 2 CS scoring. We will need physicians’ expert judgment on an ongoing basis for the development of patient note scoring guidelines and in the training of the computer-based scoring system. Further, all new cases will require physicians to score in parallel with the computer until we are confident in the accuracy of the computer’s ICE pass/fail classifications. Additionally, human raters will manually score examinees whose computer-scored performance falls below or near the passing standard to ensure the most accurate pass/fail outcomes for these candidates by further bolstering reliability in this critical performance band.

Although we expect that neither the examination administration process nor the score reporting timeline will be affected by implementing computer-assisted scoring, we are sensitive to the importance of controlling the cost of the exam. We believe the cost of development and deployment of NLP-based technology for scoring the patient notes should not increase the exam fees for examinees. We anticipate that operational expenses associated with NLP, along with our ongoing physician rater costs, will be offset by gains in efficiency.

Looking Ahead: Benefits of Computer-Assisted Scoring

Computer-assisted scoring of patient notes has the potential to improve the accuracy of Step 2 CS scoring while preserving the vital contribution of physician raters. More than a decade of NBME research on the patient note section of the Step 2 CS exam has demonstrated that computer-assisted scoring is equal, and in some ways superior, to human rater judgment. On the basis of this research, we believe this technology can be adopted without affecting scoring time or examination fees. When it is adopted, the USMLE program will use parallel scoring by physician raters over a representative time frame and diverse examinee population to model the ICE pass/fail decisions generated by computer-assisted scoring. If these results demonstrate favorable outcomes, then the USMLE program will seek support from its governing committees, which consist of representatives of medical regulation, medical education, and the public, for the full implementation of computer-assisted scoring of the Step 2 CS exam’s patient notes.

Adopting an NLP-based system may benefit the Step 2 CS exam beyond patient note scoring. Because the system can capture and analyze examinee responses in a centralized and uniform manner, it will become feasible to monitor exam security in ways that are impractical or impossible using human raters alone. In addition, computer-assisted scoring can provide valuable research data to correlate measures within the exam with other training and practice outcomes.

Our program of research applying NLP to the problem of scoring open-ended medical text, like patient notes, suggests that this technology can enhance and improve performance-based assessments in medical education and medical regulation. Moreover, it suggests that the time is near for the USMLE program to use this technology to score patient notes in the Step 2 CS exam.

*Based on November 2018 searches of PubMed and Google Scholar.


1. United States Medical Licensing Examination. Step 2 Clinical Skills: Content description and general information. https://www.usmle.org/pdfs/step-2-cs/cs-info-manual.pdf. Updated October 2017. Accessed November 27, 2018.
2. Engelhard G Jr, Wang J, Wind SA. A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings. Psychol Test Assess Model. 2018;60:33–52.
3. Burstein J. Shermis MD, Burstein J. The e-rater scoring engine: Automated essay scoring with natural language processing. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. 2003:Mahwah, NJ: Lawrence Erlbaum Associates Publishers; 113–121.
Copyright © 2018 by the Association of American Medical Colleges