Developing medical students’ clinical reasoning requires a structured longitudinal curriculum with frequent targeted assessment and feedback. Performance-based assessments, which have the strongest validity evidence, are currently not feasible for this purpose because they are time-intensive to score. This study explored the potential of using machine learning technologies to score one such assessment—the diagnostic justification essay.
From May to September 2018, machine scoring algorithms were trained to score a sample of 700 diagnostic justification essays written by 414 third-year medical students from the Southern Illinois University School of Medicine classes of 2012–2017. The algorithms applied semantically based natural language processing metrics (e.g., coherence, readability) to assess essay quality on 4 criteria (differential diagnosis, recognition and use of findings, workup, and thought process); the scores for these criteria were summed to create overall scores. Three sources of validity evidence (response process, internal structure, and association with other variables) were examined.
Machine scores correlated more strongly with faculty ratings than faculty ratings did with each other (machine: .28–.53, faculty: .13–.33) and were less case-specific. Machine scores and faculty ratings were similarly correlated with medical knowledge, clinical cognition, and prior diagnostic justification. Machine scores were more strongly associated with clinical communication than were faculty ratings (.43 vs .31).
Machine learning technologies may be useful for assessing medical students’ long-form written clinical reasoning. Semantically based machine scoring may capture the communicative aspects of clinical reasoning better than faculty ratings, offering the potential for automated assessment that generalizes to the workplace. These results underscore the potential of machine scoring to capture an aspect of clinical reasoning performance that is difficult to assess with traditional analytic scoring methods. Additional research should investigate machine scoring generalizability and examine its acceptability to trainees and educators.