Letters to the Editor
To the Editor:
We read with interest the Invited Commentary by Salt and colleagues1 and agree with the potential benefits of incorporating natural language processing (NLP) and machine learning (ML) to improve the validity evidence for scoring of high-stakes examinations such as the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (CS). However, we urge caution regarding the authors’ assertion that an NLP/ML scoring system “is superior to human judgment with respect to … objectivity … because it is not susceptible to the effects of … human biases.”1 The field of algorithmic bias has shown that algorithms do indeed carry bias. Algorithms, for example, have recommended parole for white prisoners more frequently than for black prisoners.2 Word2Vec, a landmark NLP algorithm developed at Google, completes the phrase “man is to computer programmer as woman is to ______” with “housewife,” and “father is to doctor as mother is to _____” with “nurse.”3
Algorithms carry these biases because we train them to. ML systems attempt to fulfill a human-defined performance metric against a human-defined dataset. The authors’ NLP-based scoring system is designed to “replicate the expert judgments of physician raters assessing patient notes.”1 If the algorithm’s goal is to emulate the performance of human raters, and in training it is penalized for not producing the same output as humans, how can it simultaneously be more objective than the humans it emulates?
While emulation of human performance does not by itself ensure objectivity, it is possible to explicitly audit algorithms for bias. The degree of scrutiny an algorithm receives should be directly proportional to the stakes of the decision it makes, and decisioning software with the potential to profoundly affect the career of a medical student should be examined closely. Transparency of implementation is critical for such a high-stakes application, but the National Board of Medical Examiners (NBME) has not published details regarding the system’s technical function, performance, or any investigations into the system’s potential for bias.
It is a common misconception that a process performed by a computer is automatically more objective than its human-performed equivalent. The authors describe a system with the potential to drastically improve the consistency and efficiency of Step 2 CS scoring, but not necessarily its objectivity. Given the high-stakes nature of the examination, we believe that the NBME should ensure the objectivity of their system by proceeding with increased caution, transparency, and, if possible, regular and independent algorithmic audits.
Third-year medical student, University of Michigan Medical School, Ann Arbor, Michigan; firstname.lastname@example.org; ORCID: http://orcid.org/0000-0001-5927-1428.
Seetha U. Monrad, MD
Assistant dean for assessment, evaluation, and quality improvement and associate professor of internal medicine and learning health sciences, University of Michigan Medical School, Ann Arbor, Michigan; ORCID: http://orcid.org/0000-0002-3374-2989.
1. Salt J, Harik P, Barone MA. Leveraging natural language processing: Toward computer-assisted scoring of patient notes in the USMLE Step 2 Clinical Skills exam. Acad Med. 2019;94:314–316.
3. Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16. 2016:Red Hook, NY: Curran Associates Inc.; 4356–4364. http://dl.acm.org/citation.cfm?id=3157382.3157584
. Accessed March 20, 2019.