The Federation of State Medical Boards and the National Board of Medical Examiners announced in February 2020 that United States Medical Licensing Examination (USMLE) Step 1 score reporting will change from 3-digit numerical scores to pass/fail beginning no earlier than January 1, 2022.1 This paradigm shift was driven by a desire to retain the role of the exam in medical licensure while addressing the overemphasis of Step 1 scores in residency selection.2
Most medical educators can attest to the prominent role of Step 1 scores in residency selection—as a metric for advising students considering competitive specialties, as a residency application screening tool, and as a key criterion in applicant ranking decisions by programs. The top selection criteria used by residency programs include grades in required clerkships, Step 1 scores, recommendation letters, specialty elective grades, honors grades, and USMLE Step 2 Clinical Knowledge (CK) scores.3–12 These criteria are often used in screening for interviews, and the Step 1 score is a leading criterion for initial selection decisions.
Emphasizing Step 1 scores in residency selection may be perceived as being objective and providing equal opportunities for medical students from rural state schools or highly ranked Ivy League schools to excel. However, despite their seemingly impartial nature, Step 1 scores have been shown to disadvantage female and underrepresented minority (URM) applicants, running counter to a common goal of enhancing diversity in the physician workforce.13,14 In addition, emphasizing Step 1 scores may have a negative impact on medical students’ learning and lead to neglect of educational content not directly applicable to the exam.15–17 For many students, Step 1 preparation is a financial burden and a source of substantial anxiety.16,18 Poor performance on this one test, with no opportunity to repeat or remediate, can influence students’ future careers. Medical educators have seen otherwise outstanding applicants fail to advance through initial screenings for residency programs because of their Step 1 scores. Given these high stakes, it is not surprising that low Step 1 scores are associated with a life crisis for medical students.19
While Step 1 scores may be predictive of future written board exam performance, neither they nor Step 2 CK scores, which may be an even better predictor, have been found to correlate with overall performance in residency training, including clinical outcomes, professionalism, and success in other core competencies (patient care, practice-based learning and improvement, interpersonal and communication skills, professionalism, and systems-based practice).11,20–23 These findings point to limited utility of Step 1 scores as a critical single criterion for residency selection.
Although many medical educators and learners agree that Step 1 scores are overemphasized, the decision to shift to pass/fail reporting has nonetheless met with substantial controversy. Stakeholders, including medical students, have expressed concerns about unintended consequences. One study suggested that third- and fourth-year medical students would prefer to keep the familiar Step 1 numerical scores.24 Given that Step 1 scores are one of the few remaining objective residency applicant assessment metrics, these reactions are not surprising. Medical school grades have gradually been relaxed or eliminated; dean’s letters (Medical Student Performance Evaluations, or MSPEs) do not consistently provide class rank or peer comparison data, nor are they always available in time for initial screening for early-match residency programs; and Alpha Omega Alpha Honor Medical Society (AΩA) elections, which are not available in all schools, may be delayed until after the Match.
Without Step 1 numerical scores, many residency programs will be left to make ranking decisions based on applicants’ publications/presentations (which may speak to productivity as a researcher but not necessarily skill as a clinician), subjective recommendation letters, and first impressions in a 10- to 15-minute interview. There is understandable concern that this information is inadequate, misses the bigger picture of applicants’ merit and medical school performance, and may lead to unintended consequences—such as shifting weight to other metrics. For example, medical school reputation may acquire greater influence, perhaps unfairly disadvantaging students from lower-tier schools. Competitive specialties have already begun to discuss requiring Step 2 CK scores as an application prerequisite. An advantage of emphasizing Step 2 CK scores is the exam’s greater relevance for clinical skills compared with Step 1 content. If there is increased emphasis on Step 2 CK scores for applicant screening, then exam timing and standardized requirements by programs must be considered for the exam to be a fair and useful screening test.
If Not Step 1 Scores, Then What?
While there are strong views about the role of Step 1 scores in residency selection, the magnitude of the controversy is a symptom of a deeper issue. Residency selection inherently requires metrics to rank applicants. However, some students do well in coursework while others excel in research or as community leaders. All of these merits should be cultivated and recognized, and performance in each should be considered. Increasing the breadth of metrics and what is considered excellence will increase the likelihood that both applicants and programs can match in a way that will yield appropriate selection and meet future health care workforce needs. We believe there is a need for broader, systemic change—beyond Step 1 reporting—to evaluate applicants and their achievements more effectively, equitably, and holistically while making efforts to mitigate implicit bias. As steps toward such change, we propose revisiting medical school assessment metrics, addressing application overload, and considering alternative metrics to screen applicants for interviews.
Revisiting medical school assessment metrics
In a statement arguing against Step 1 pass/fail scoring, the Association of Program Directors in Radiology indicated that “the outsize role of the USMLE Step 1 score in resident selection is due to lack of standardization in medical school transcripts, grade inflation, and the lack of class standing in many summative assessments.”25 Perhaps the root cause of overemphasizing Step 1 performance is a lack of more meaningful assessment of medical student achievements. With few standardized or objective metrics, and with the high volume of applications, residency programs have had to increasingly rely on Step 1 scores to measure applicants’ medical school performance. There would be less consternation about the anticipated lack of numerical Step 1 scores if programs had access to better data than the information currently provided by grades, MSPEs, and AΩA selection. Even earlier release of MSPEs might help: Early-match residency programs currently make initial screening decisions without the benefit of MSPEs. Clerkship shelf examination scores offer another potential metric; however, increasing emphasis on shelf test preparation could shift focus away from clinical learning and add more stress during clinical rotations.
The shift of many medical schools to pass/fail curricula has been a seismic change. Although pass/fail grading is attractive to medical school applicants and beneficial to students’ well-being during medical school, it may disadvantage students in the end by limiting how they can distinguish themselves in the inherently competitive residency application process. A return to traditional grades would not be sufficient, however, as there is compelling evidence that traditional grading systems are flawed. A study from the University of California, San Francisco (UCSF), School of Medicine, for example, found that URM students received approximately half as many clerkship honors grades as non-URM students and were 3 times less likely to be selected for AΩA membership—the amplified effect of small differences in assessed performance, highlighting the profound impact of factors such as implicit bias.26
This does not mean that medical school performance metrics such as grades and AΩA membership should be abandoned. Instead, it suggests opportunities for improvement in assessment. For example, selection to AΩA is intended to recognize excellence in educational achievement, leadership, research, humanism, professionalism, and service.27 However, at many schools, AΩA selection emphasizes grades and scholarly achievement, which are subject to bias. Responding to criticism that disproportionately low numbers of URM students are selected, AΩA has worked with chapters across the country to recognize and promote diversity.28 At UCSF School of Medicine, a holistic AΩA selection process was successfully implemented, including selection committee education to mitigate bias and increase inclusion, and equal weighting of academic achievement and professional contributions in final selection. Using the new holistic process, URM and non-URM students were similarly likely to be selected to AΩA.29
In addition, clerkship evaluations could follow a competency-based framework. For example, in 2010, Stanford University School of Medicine implemented the Criterion-Based Evaluation System for required clerkships.30 Under this system, student performance is assessed in 3 domains: patient care, professionalism and interpersonal communication, and medical knowledge (measured by written exam). Students who meet criteria in all 3 domains earn pass with distinction (honors) for the clerkship, and performance in each domain is reported in the MSPE.
Addressing application overload
Though the examples above suggest ways to enable more holistic assessment, they do not address the problem of application overload. Applicants apply to more residency programs every year, challenging programs to sift through as many as 100 applicants for each position31 and prompting many to implement Step 1 screening thresholds. The Association of American Medical Colleges (AAMC) has identified the point of diminishing return on additional applications for 19 specialties, based on Step 1 score and degree type, and has provided this information online via an interactive tool.32 Potential solutions to application overload have been suggested, such as limiting the number of applications, early-decision matching, and tiering applications (where applicants indicate their program preferences in groups at the outset). To successfully reduce application volume, medical school advisors will also need to counsel students on appropriate and realistic program choices to enhance likelihood of a successful match.31,33–35
Considering alternative metrics: Situational judgment tests and competency-based assessment
Scores on situational judgment tests (SJTs) could be preferable to Step 1 scores as metrics to screen residency applicants. SJTs are standardized tests in which applicants review hypothetical scenarios and identify behavioral responses relevant to role-specific competencies. SJTs evaluate general domain knowledge and cognitive ability, can be designed to assess job-specific knowledge, are good predictors of job performance, and may be better able to predict professionalism than alternative screening approaches.36–38 The AAMC has piloted an SJT for medical school admissions39; other SJTs are commercially available and potentially customizable.40,41
There has been minimal exploration of SJTs as screening tools in graduate medical education. Done well, SJTs would provide another source of standardized and objective information about applicants in addition to medical school performance information. Further, SJTs could help enhance recruitment of URM residents. In a study of 2,742 general surgery applicants across 7 residency programs, program use of an online SJT as a screening test resulted in a higher likelihood that a URM applicant would be considered for a residency position. Eight percent more URM applicants were recommended for an interview when programs used SJTs compared with when they used traditional screening methods (e.g., Step 1 score thresholds).42 This finding—a more diverse applicant pool invited to interview—has been echoed in other studies.43
To be useful as residency selection screening tests, SJTs would require careful design and validity testing to ensure reproducibility and predictability for competency performance. SJTs could be developed for general residency selection, and specialties could adapt them for specific domains. Assessing professionalism accurately via SJTs may be particularly challenging, since professionalism can be more subjective and domain-specific than other competencies and relevant scenarios may vary across specialties.44–46 Yet despite these considerations, 85% of a small sample of general surgery residency applicants said that SJTs would not deter them from applying to programs. They felt traditional interviews were more accurate but less objective than SJTs.47
The end of Step 1 numerical scores is a substantial change. It will be important to evaluate the impact of Step 1 pass/fail reporting, both at the program and national levels; to reexamine the weight that residency programs assign to other applicant metrics; and to assess outcomes such as changes in resident diversity and performance. Medical school educators should strive for an academic environment that can support and recognize trainees across all domains of excellence—including academic performance, research, clinical acumen, respect from peers, leadership, service and community outreach, global health, mentorship, and diversity. Residency programs should strive to evaluate applicants holistically across these domains when making residency selection decisions. To do this, we need systemic commitment at the level of medical schools and residency programs.
We also need to address residency application overload, by considering approaches such as application limits or tiering and broadening our view of assessment to incorporate factors that are predictive of residency performance across competencies. The change in Step 1 score reporting must spur us toward developing and implementing better metrics, such as meaningful and fair grades; providing MSPEs earlier; adopting holistic selection; and introducing new screening tools, such as SJTs. Only with diversity of metrics will we be capable of cultivating diversity of excellence.
1. Invitational Conference on USMLE Scoring. Change to pass/fail score reporting for Step 1. United States Medical Licensing Examination. United States Medical Licensing Examination. https://www.usmle.org/incus/#decision
. Accessed April 23, 2020.
2. Invitational Conference on USMLE Scoring. Summary report and preliminary recommendations from the Invitational Conference on USMLE Scoring (InCUS), March 11-12, 2019. United States Medical Licensing Examination. https://www.usmle.org/pdfs/incus/incus_summary_report.pdf
. Accessed April 23, 2020.
3. Green M, Jones P, Thomas JX Jr.. Selection criteria for residency: Results of a national program directors survey. Acad Med. 2009;84:362–367.
4. Gorouhi F, Alikhan A, Rezaei A, Fazel N. Dermatology residency selection criteria with an emphasis on program characteristics: A national program director survey. Dermatol Res Pract. 2014;2014:692760.
5. Crane JT, Ferraro CM. Selection criteria for emergency medicine residency applicants. Acad Emerg Med. 2000;7:54–60.
6. Puscas L, Sharp SR, Schwab B, Lee WT. Qualities of residency applicants: Comparison of otolaryngology program criteria with applicant expectations. Arch Otolaryngol Head Neck Surg. 2012;138:10–14.
7. Stain SC, Hiatt JR, Ata A, et al. Characteristics of highly ranked applicants to general surgery residency programs. JAMA Surg. 2013;148:413–417.
8. Makdisi G, Takeuchi T, Rodriguez J, Rucinski J, Wise L. How we select our residents–a survey of selection criteria in general surgery residents. J Surg Educ. 2011;68:67–72.
9. Melendez MM, Xu X, Sexton TR, Shapiro MJ, Mohan EP. The importance of basic science and clinical research as a selection criterion for general surgery residency programs. J Surg Educ. 2008;65:151–154.
10. Al Khalili K, Chalouhi N, Tjoumakaris S, et al. Programs selection criteria for neurological surgery applicants in the United States: A national survey for neurological surgery program directors. World Neurosurg. 2014;81:473–477.
11. Raman T, Alrabaa RG, Sood A, Maloof P, Benevenia J, Berberian W. Does residency selection criteria predict performance in orthopaedic surgery residency? Clin Orthop Relat Res. 2016;474:908–914.
12. Bernstein AD, Jazrawi LM, Elbeshbeshy B, Della Valle CJ, Zuckerman JD. An analysis of orthopaedic residency selection criteria. Bull Hosp Jt Dis. 2002;61:49–57.
13. Edmond MB, Deschenes JL, Eckler M, Wenzel RP. Racial bias in using USMLE Step 1 scores to grant internal medicine residency interviews. Acad Med. 2001;76:1253–1256.
14. Rubright JD, Jodoin M, Barone MA. Examining demographics, prior academic performance, and United States Medical Licensing Examination scores. Acad Med. 2019;94:364–370.
15. Chen DR, Priest KC, Batten JN, Fragoso LE, Reinfeld BI, Laitman BM. Student perspectives on the “Step 1 climate” in preclinical medical education. Acad Med. 2019;94:302–304.
16. Prober CG, Kolars JC, First LR, Melnick DE. A plea to reassess the role of United States Medical Licensing Examination Step 1 scores in residency selection. Acad Med. 2016;91:12–15.
17. Carmody JB, Sarkany D, Heitkamp DE. The USMLE Step 1 pass/fail reporting proposal: Another view. Acad Radiol. 2019;26:1403–1406.
18. Andolsek KM. One small step for Step 1. Acad Med. 2019;94:309–313.
19. Greenburg DL, Durning SJ, Cruess DL, Cohen DM, Jackson JL. The prevalence, causes, and consequences of experiencing a life crisis during medical school. Teach Learn Med. 2010;22:85–92.
20. Kenny S, McInnes M, Singh V. Associations between residency selection strategies and doctor performance: A meta-analysis. Med Educ. 2013;47:790–800.
21. McCaskill QE, Kirk JJ, Barata DM, Wludyka PS, Zenni EA, Chiu TT. USMLE Step 1 scores as a significant predictor of future board passage in pediatrics. Ambul Pediatr. 2007;7:192–195.
22. Lee AG, Oetting TA, Blomquist PH, et al. A multicenter analysis of the ophthalmic knowledge assessment program and American Board of Ophthalmology written qualifying examination performance. Ophthalmology. 2012;119:1949–1953.
23. Lee M, Vermillion M. Comparative values of medical school assessments in the prediction of internship performance. Med Teach. 2018;40:1287–1292.
24. Lewis CE, Hiatt JR, Wilkerson L, Tillou A, Parker NH, Hines OJ. Numerical versus pass/fail scoring on the USMLE: What do medical students and residents want and why? J Grad Med Educ. 2011;3:59–66.
25. Rozenshtein A, Mullins ME, Marx MV. The USMLE Step 1 pass/fail reporting proposal: The APDR position. Acad Radiol. 2019;26:1400–1402.
26. Teherani A, Hauer KE, Fernandez A, King TE Jr, Lucey C. How small differences in assessed clinical performance amplify to large differences in grades and awards: A cascade with serious consequences for students underrepresented in medicine. Acad Med. 2018;93:1286–1292.
27. Alpha Omega Alpha. About Alpha Omega Alpha. http://www.alphaomegaalpha.org/about.html
. Updated February 7,2020. Accessed April 30, 2020.
28. Byyny RL, Martinez D, Cleary L, et al. Alpha Omega Alpha Honor Medical Society: A commitment to inclusion, diversity, equity, and service in the profession of medicine. Acad Med. 2020;95:670–673.
29. Teherani A, Harleman E, Hauer KE, Lucey C. Toward creating equity in awards received during medical school: Strategic changes at one institution. Acad Med. 2020;95:724–729.
30. Stanford University School of Medicine. The criterion-based evaluation system. http://med.stanford.edu/md/office-of-medical-education/faculty-resources/cbes.html
. Accessed April 30, 2020.
31. Berger JS, Cioletti A. Viewpoint from 2 graduate medical education deans application overload in the residency match process. J Grad Med Educ. 2016;8:317–321.
32. Association of American Medical Colleges. Apply smart: Data to consider when applying to residency. https://students-residents.aamc.org/applying-residency/filteredresult/apply-smart-data-consider-when-applying-residency
. Accessed April 23, 2020.
33. Hueston W. A proposal to address the increasing number of residency applications [letter]. Acad Med. 2017;92:896–897.
34. Weissbart SJ, Kim SJ, Feinn RS, Stock JA. Relationship between the number of residency applications and the yearly match rate: Time to start thinking about an application limit? J Grad Med Educ. 2015;7:81–85.
35. Katsufrakis PJ, Uhler TA, Jones LD. The residency application process: Pursuing improved outcomes through better understanding of the issues. Acad Med. 2016;91:1483–1487.
36. Krumm S, Lievens F, Hüffmeier J, Lipnevich AA, Bendels H, Hertel G. How “situational” is judgment in situational judgment tests? J Appl Psychol. 2015;100:399–416.
37. Lievens F, Buyse T, Sackett PR. The operational validity of a video-based situational judgment test for medical college admissions: Illustrating the importance of matching predictor and criterion construct domains. J Appl Psychol. 2005;90:442–452.
38. Lievens F, Sackett PR. The validity of interpersonal skills assessment via situational judgment tests for predicting academic success and job performance. J Appl Psychol. 2012;97:460–468.
39. Association of American Medical Colleges. AAMC Medical Education: Situational Judgment Test Research. https://www.aamc.org/services/admissions-lifecycle/situational-judgment-test
. Accessed April 23, 2020.
40. Strahan J, Fogarty GJ, Machin MA. Predicting performance on a situational judgement test: The role of communication skills, listening skills, and expertise. 2005:Proceedings of the 40th Annual Conference of the Australian Psychological Society. Sydney, Australia: Australian Psychological Society; 323–327.
41. Dore KL, Reiter HI, Kreuger S, Norman GR. CASPer, an online pre-interview screen for personal/professional characteristics: Prediction of national licensure scores. Adv Health Sci Educ Theory Pract. 2017;22:327–336.
42. Gardner AK, Cavanaugh KJ, Willis RE, Dunkin BJ. Can better selection tools help us achieve our diversity goals in postgraduate medical education? Comparing use of USLME Step 1 scores and situational judgment tests at 7 surgical residencies. Acad Med. 2020;95:751–757.
43. Lyons J, Bingmer K, Ammori J, Marks J. Utilization of a novel program-specific evaluation tool results in a decidedly different interview pool than traditional application review. J Surg Educ. 2019;76:e110–e117.
44. Ginsburg S, Regehr G, Lingard L. Basing the evaluation of professionalism on observable behaviors: A cautionary tale. Acad Med. 2004;79(10 suppl):S1–S4.
45. Rees CE, Knight LV. The trouble with assessing students’ professionalism: Theoretical insights from sociocognitive psychology. Acad Med. 2007;82:46–50.
46. Goss BD, Ryan AT, Waring J, et al. Beyond selection: The use of situational judgement tests in the teaching and assessment of professionalism. Acad Med. 2017;92:780–784.
47. Shipper ES, Mazer LM, Merrell SB, Lin DT, Lau JN, Melcher ML. Pilot evaluation of the computer-based assessment for sampling personal characteristics test. J Surg Res. 2017;215:211–218.