The shift toward outcomes-based inquiry in medical education has prompted better integration of assessments with learning aims and methods. Such inquiry increasingly informs our understanding of whether and how essential competencies are attained during the course of clinical training. 1 Outcomes-based inquiry in medical education also has opened up novel research questions, including the exploration of potential associations between medical learners’ performance on assessments and key markers of their developmental progression. 2,3
Studies by Cullen et al 4 and Cuddy et al 5 appearing in this issue of Academic Medicine are examples of empirical projects that examine how performance on written assessments and subsequent professional behaviors may be related. In their hypothesis-driven study, Cullen et al found that overall scores of trainees (n = 256) on the Situational Judgment Test (SJT) correlated positively with midyear and year-end professionalism ratings by their program’s clinical competence committee and were negatively associated with whether the trainees were identified as manifesting professionalism behaviors of concern. In the analysis of United States Medical Licensing Examination (USMLE) Step 3 data from 2000 to 2017 (n = 275,392), Cuddy et al found that a 1-standard-deviation increase in Step 3 total scores, Step 3 computer-based simulation scores, and Step 3 multiple-choice question scores was each associated with a decrease in the odds of receiving a disciplinary action in practice, as based on data from the Federation of State Medical Boards Physician Data Center.
It is tempting to draw a line between 2 points. And, indeed, studies that examine associations between assessments and behaviors are of tremendous interest to the field of medicine, especially if there are hypothesized relationships aligned with a given assessment tool. Yet, drawing a line between 2 points can be problematic. The rationale for linking 2 points must be questioned, especially when the second point on the line is an outcome that embodies important consequences on the developmental trajectories of the trainees involved. In making assessments, there is a tradition of gathering evidence that is used to infer validity and support their interpretation and use. 2,6 In particular, studies using assessment scores to predict crucially important outcomes have increased awareness of the discussion of consequential validity evidence 7,8 that may support longitudinal learning and outcomes of trainees. Such consequential validity studies have the potential to align assessment practices with tools to identify struggling learners or to provide early remediation, for example.
While predictive assessment studies can strengthen the validity of an assessment, one must be wary of associated inferences and the potential unintended consequences of their use and interpretation. Cullen et al are to be commended for their care and restraint in interpreting their findings, outlining the limitations of their project, and proffering recommendations. They caution the use of SJT scores in high-stakes admissions or selections contexts, and state that the SJT and other professionalism measures should be studied further to evaluate their usefulness. Cuddy et al suggest that their work may help provide “some” validity evidence for the use of USMLE Step 3 scores in making licensure decisions about candidates.
Assessments, at their core, are developed with blueprinted content, and are intended to capture the measured construct using a constrained, yet systematic, sampling of behaviors and skills. Generalizing and linking specific assessment scores with other outcomes should be done cautiously, particularly as such actions may deviate from the intended purpose of the original assessment design. Studies examining consequential validity evidence are particularly subject to this pitfall; when an assessment inference is made at the individual level, the inference may have negative unintended effects on a learner’s trajectory and professional development. This concern is especially important in relation to professionalism-related assessment and behaviors, as they may perpetuate bias or result in disproportionately negative repercussions. 9,10
Studies that predict and use statistical models to examine associations are also prone to spurious correlations, confounders, and construct-irrelevant variance that can threaten the implied causal claims concerning outcomes. Most quantitative approaches rely on regression-based techniques that may not account for multifaceted complexities that confound the learning environment. Constructs in education and social science inherently have measurement error (noise) that can attenuate detecting the association (signal) with the intended outcome—the resulting attenuated relationships may sometimes be inflated due to estimation uncertainties.
Furthermore, statistical techniques generally rely on linear methods and assumptions of conditional independence, when, in fact, learning and developmental pathways toward becoming a clinician may be nonlinear and connected with many other influences and extraneous factors. Standardizing these complexities into a linear statistical model using a subset of variables cannot possibly capture the broad educational and social interactions that trainees and practicing physicians encounter in providing care to patients. As such, extrapolating inferences based on reported effect sizes and the intended unit of analysis becomes extremely challenging for translational inquiries in outcomes-based medical education.
Developing robust assessment systems that inform learning has additional values—such as strengthening approaches to medical education or regulation—that have meaning and possible repercussions. Understanding signals that assessment scores provide to learners, faculty, and societal stakeholders adds to the rigor of scientific inquiry in education. That said, awareness of the nuanced interconnections of educational constructs and the complexity of learning and practice environments should shape our interpretation and application of research findings. We encourage thoughtful interpretation of assessment inferences and opportunities to enrich methodologies that contribute to outcomes-based inquiries into medical education.
Drawing a line between 2 points, such as making a connection between performance on a written test and the likelihood of an educational outcome of great importance, is a natural and often worthy exercise. But when the educational outcome is the emergence of professionalism difficulties later in training or in practice, studying and interpreting the association is also a weighty act. Ethical use of assessments should guide broader interpretation and inference in our medical education mission of training outstanding and caring clinicians. Assessments have an inherent role in the learning process and serve to ensure accountability to the public about the quality of medical education. For these reasons, we must use extra care in our selection of educational outcomes and study designs that integrate assessments with consequential implications on learning, and also must foster broader inquiry into outcomes-based medical education.
1. Holmboe E, Batalden P. Achieving the desired transformation: Thoughts on next steps for outcomes-based medical education. Acad Med. 2015;90:1215–1223.
2. Yudkowsky R, Park YS, Downing S. Assessment in Health Professions Education. 2nd ed. New York, NY: Routledge; 2019.
3. Dewan M, Walia K, Meszaros ZS, Manring J, Satish U. Using meaningful outcomes to differentiate change from innovation in medical education. Acad Psychiatry. 2017;41:100–105.
4. Cullen MJ, Zhang C, Sackett PR, Thakker K, Young JQ. Can a situational judgment test identify trainees at risk of professionalism issues? A multi-institutional, prospective cohort study. Acad Med. 2022;97:1494–1503.
5. Cuddy MM, Liu C, Ouyang W, Barone MA, Young A, Johnson DA. An examination of the associations among USMLE Step 3 scores and the likelihood of disciplinary action in practice. Acad Med. 2022;97:1504–1510.
6. Jibson MD, Agarwal G, Anzia JM, Summers RF, Young JQ, Seyfried LS. Psychiatry clinical skills evaluation: A multisite study of validity. Acad Psychiatry. 2021;45:413–419.
7. Kane MT. Current concerns in validity theory. J Educ Meas. 2001;38:319–342.
8. Cook DA, Lineberry M. Consequences validity evidence: Evaluating the impact of educational assessments. Acad Med. 2016;91:785–795.
9. Frye V, Camacho-Rivera M, Salas-Ramirez K, et al. Professionalism: The wrong tool to solve the right problem? Acad Med. 2020;95:860–863.
10. Roberts LW. High road, low road: Professionalism, trust, and medical education. Acad Med. 2020;95:817–818.