We thank the authors for expressing their concerns regarding potential bias in the implementation of machine learning (ML) to scoring of the United States Medical Licensing Examination Step 2 Clinical Skills (CS) patient notes (PN), and appreciate this opportunity to address these ahead of an upcoming technical report submitted for publication. We have worked with experts in educational measurement and medical informatics to minimize threats to validity, including implicit bias. Rather than using a black-box ML approach, we incorporated transparency in the system design, eliminated the use of individual human ratings in ML training sets, and built in fail-safe steps to ensure accuracy of pass/fail classifications. In the cited example of bias in the article “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings,”1 word representations were learned from news articles in which, for centuries, women have been associated with household activities. Therefore, the language model trained from such data captured those associations. Our training sets are compiled of cross-validated judgments of multiple independent groups of experts (rather than individual human ratings), which allows us to monitor, detect, and minimize potential bias.
The authors’ skepticism about an ML system that is “more objective than the humans it emulates” is valid. Typically, ML and natural language processing systems are evaluated relative to a standard set by human judgment and so, theoretically, cannot be superior to human ratings. However, in practice, human raters suffer from fatigue and other challenges associated with quality control. Therefore, while an automated scoring system cannot be superior to the ideal human rater, it can be more adherent to predefined scoring rubrics than the typical human rater in practice.
We agree with the authors that producing reliable and valid scores for the Step 2 CS exam is critical. For this reason, we included a number of verification steps in our system design. Notes flagged by the system as containing unusual features, notes containing terminology that can’t be recognized by the system, and all notes for examinees whose performance falls near or below the level of minimum competence are reviewed by human raters. In our Invited Commentary, we proposed a hybrid scoring system whereby no examinee receives a failing score without an additional independent review by human raters. We are confident that this will allow PN scoring to benefit from the strengths of ML while maintaining the psychometric rigor of the Step 2 CS Exam.
Jessica Salt, MD, MBE
Assistant vice president of assessment, Clinical Skills Evaluation Collaboration, Philadelphia, Pennsylvania; email@example.com.
Polina Harik, PhD
Senior measurement scientist, National Board of Medical Examiners, Philadelphia, Pennsylvania.
Michael A. Barone, MD, MPH
Vice president of licensure programs, National Board of Medical Examiners, Philadelphia, Pennsylvania.
1. Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16. 2016:Red Hook, NY: Curran Associates Inc.; 4356–4364. http://dl.acm.org/citation.cfm?id=3157382.3157584
. Accessed April 8, 2018.