The United States Medical Licensing Examination (USMLE) is a four-step test designed to assess physicians for licensure. Cosponsored by the Federation of State Medical Boards (FSMB) and the National Board of Medical Examiners (NBME), it was developed in the late 1980s and implemented in the early 1990s, replacing the prior NBME and FLEX examinations. Step 1, the first of the steps, primarily measures knowledge about foundational sciences and organ systems. The test was psychometrically designed as pass/fail for licensing boards to decide whether physician candidates meet minimum standards to obtain the medical licensure necessary to practice. Results are reported as a three-digit number. Since its inception, Step 1 has been increasingly used, and some would argue misused, by medical schools to promote or graduate students and evaluate curricula. Residency program directors (PDs) use Step 1 scores to “screen” residency applicants to select which candidates to interview. Applicants judge the competitiveness of specialties by the average Step 1 scores of residents in that discipline. However, the test correlates best with performance on subsequent multiple-choice tests, such as residency in-service training and specialty board examinations, and less well with the clinical performance of residents.
Chen and colleagues1 are the latest contributors to a debate that began in the early 1990s when the USMLE Step 1’s role in residency selection was contested even before the exam was implemented.2 As current and recent medical students, Chen and colleagues’1 poignant perspective adds thoughtful learner voices to the debate. They characterize a new facet of our learning environments, the “Step 1 climate”: the collective attitudes, processes, and behaviors situated within medical education that adversely impact “education, diversity, and student well-being.” They note the cottage industry of Step 1 test prep, which has hijacked, and frequently replaced, the individual curricula of medical schools. Even as student debt escalates, students spend an additional $50 to $825 on proprietary resources, many of which fail to provide any feedback.
Many of these arguments are well known. Step 1 assesses only one of six core competencies recognized by the Accreditation Council for Graduate Medical Education (ACGME) and the American Board of Medical Specialties (ABMS). Its silence on the others devalues them. Step 1 is psychometrically designed to discriminate a passing grade from a failing one for use in medical licensure. It does not differentiate “better” from “less better” performance and fails to correlate with clinical skills,3 residency progress,4 or faculty evaluations.5
But Step 1 does discriminate against people—by race, gender, and means. Performance differs by gender and race.6 Women,7 those historically underrepresented in medicine,8 nontraditional students, and students with financial need perform less well.9,10
At the extremes, higher Step 1 scores predict higher scores on future multiple-choice tests. Yet, over 90% of residency graduates in pediatrics, obstetrics, orthopedics, anesthesia, and internal medicine pass board exams if their Step 1 scores are 200 to 227,11 levels some PDs routinely “screen out.”
Tackling these well-documented issues is challenging. Various stakeholders frequently spend more time demonizing one another rather than listening, addressing what lies under their respective control, and working collaboratively toward better long-term solutions. We know, for example, that medical students do not feel their course work alone prepares them for Step 1, “prep” has become the de facto preclinical curriculum, and faculty do not like to teach to a test. One solution would be for Step 1 to become an eligibility requirement for medical school, as the Medical College Admission Test (MCAT) is now. Alternatively, medical students could matriculate and spend the first several months preparing for and taking it, using resources from their school, as well as supplemental proprietary and freely available resources. Khan Academy MCAT prep could be a model for future free Step 1 resources. With Step 1 completed, students could then focus on the curriculum the school provides. These solutions, however, do not address the lack of an evidence-based process to support PDs in residency selection. So, what actions can we take to meet the needs of all stakeholders? Each of our constituencies can act now to improve this situation while we await aspirational future solutions.
Considerations for Stakeholders
Schools of medicine
Step 1 anxiety is at least partially rooted in students’ concern that they will fail to match. Schools can ease this anxiety by (1) repaying tuition of any unmatched student or (2) guaranteeing (and funding) an intern position either directly or through an affiliation with another medical school or teaching hospital.
Medical schools and PD communities must move beyond the linguistic and cultural divides between ACGME milestones and Association of American Medical Colleges (AAMC) Core Entrustable Professional Activities for Entering Residency. We need a shared mental model of what last-day medical students/day 1 residents “look like,” and we need better tools to measure student achievement. The validity and credibility of assessments should be improved, and the learner “handoff” between undergraduate medical education (UME) and graduate medical education (GME) must be honest and transparent. Several assessment methods could be optimized and embedded into the medical student performance evaluation. Medical student peer assessments correlate with later PD ratings. Peer assessments could be standardized across UME and GME settings.12 Longer-duration experiences with the same faculty, such as in a longitudinal integrated clerkship, improve the faculty’s formative and summative assessments of students.13 Learners who have greater continuity with faculty and patients have improved clinical skills, and clinical schedules could be adapted to maximize these opportunities.14 “Machine learning” and artificial intelligence have already been used to assess interpersonal and communication skills and professionalism.15 “Progress testing” intermittently assesses all necessary competencies.16 Strategically employing these types of curricular and assessment methods and reporting students’ performance in an honest, transparent, and standardized way would enable a smoother transition between UME and GME.
To begin mitigating Step 1 anxiety even earlier in the pipeline, more students could be admitted to medical school with conditional acceptance to residencies, avoiding the Match altogether and potentially reducing tuition if transitions could be accelerated.17 Deliberate curation of the curriculum could allow the student aspiring to pediatrics, for example, to spend required clerkships in pediatric surgery, adolescent medicine, and child psychiatry. We must be sufficiently agile to optimize learning for students who know their specialty preference at school entry as well as those who discover or change their specialty preferences later.
The NBME could allow students to repeat Step 1 if they passed but were not pleased with their score. Unlike the SAT or MCAT, students who pass Step 1 are not allowed a “do-over.” By contrast, 24% of medical students took the MCAT twice, and 9% took it three or more times. Ninety-one percent improved their scores.18 Admittedly, a retake adds cost and time. However, knowing a retake is possible may lesson anxiety.
Substituting pass/fail scoring for Step 1’s numeric score may be desirable, but PDs who currently use Step 1 scores to screen residency applicants would doubtless substitute a different screening method. Pass/fail scoring may have other benefits, though. The performance of women and men on the Step 2 Clinical Knowledge (CK) exam are essentially the same, and racial/ethnic differences on this exam are attenuated.6 Reporting Step 1 scores as pass/fail and numerically reporting a different portion of the USMLE—the Step 2 CK—would enhance “fairness.” An even better option might be to report Step 2 Clinical Skills (CS) scores numerically and to score Step 1 and Step 2 CK pass/fail. Step 2 CS is the only opportunity for all medical students to be directly observed in a standardized simulated clinical environment, similar to the “selection tests” used successfully in other countries.19,20 Step 2 CS predicts first-year resident history-taking and physical exam skills.21 It is far closer to the skills PDs expect at GME entry. Step 2 CS is the most expensive “step” because of registration costs and the need for most students to travel, yet it has limited value to students or PDs. Students receive no feedback except whether they pass or fail. PDs don’t value it because 96% of students pass. Step 2 CS could be repurposed, its psychometric properties enhanced, and relevant basic science content added. PD associations could contribute by collaboratively developing some specialty-specific “stations.” Applicants could consent to have a video of their performance on some of these stations sent to the residency programs to which they applied, similar to the current emergency medicine video interview pilot. If CS became a truly competency-based standardized exam, it might help level the playing field for diverse applicants and contribute to holistic review in GME. Underrepresented minority interns have been shown to perform similarly to majority counterparts on simulated clinical performance examinations (similar in style to Step 2 CS) despite statistically significantly lower Step 1 scores.22
If PDs need a numeric score to use in screening applicants, assessments of new areas of essential scientific knowledge (health inequities, cultural humility, communication, ethics, data science, patient safety, interprofessional care), which are currently not well assessed in Step 1, could be included. Alternatively, structured interviews might be included in the application process: Structured interview scores outperformed Step 1 and Step 2 CK in predicting intern performance.23 A situational judgment test (SJT), using work-based scenarios to assess noncognitive professional attributes such as empathy, integrity, teamwork, and resilience, is another option.24–26 Measures of personality27,28 or “grit”29 may provide even greater value.
The NBME could also consider eliminating Step 1 entirely and integrating more basic sciences into the other USMLE Steps, better representing how physicians apply basic science concepts in practice and supporting greater integration of clinical and basic sciences throughout the curriculum. Step 2 CS should be sunsetted if it cannot be enhanced to provide greater value.30
The FSMB should reexamine the need for USMLE examinations to support state licensing decisions regarding a physician’s readiness for supervised and unsupervised practice.31 Residency programs assess the former on a daily basis, and PDs of accredited programs assess the latter at GME completion. Because nearly all physicians are board certified, board certification and maintenance of certification may be a better substitute for the USMLE exams in supporting licensing decisions. NBME’s expertise might be better tapped in constructing a multicomponent test that is intentionally, intelligently, and psychometrically designed for residency selection.32
Rather than allowing Step 1 preparation to be the de facto preclinical curriculum, PDs should collaborate with UME colleagues to create a UME curriculum that ensures that residency applicants have the knowledge, attitudes, and skills required to be successful residents. PDs could collaborate with NBME to develop decision support specifically designed for resident selection. PDs could begin by making the ACGME-required description of their program’s unique attributes available to applicants along with the explicit criteria they use to evaluate them.33 In this spirit, one Canadian medical school specifically recommends, among other things, that selection of applicants should reflect the residency program’s goals, emphasize all competencies equally, and promote diversity.34
In the 2018 National Resident Matching Program PD survey, the Step 1 score ranked first out of 33 factors in deciding which applicants to interview, with 94% of respondents citing it as a factor.35 Forty years ago, PDs ranked Part 1 of the NBME exam, Step 1’s predecessor, 23rd of 31 factors.36 Yet, PDs reported that 8 factors were more important than Step 1 in deciding which applicants who passed the exam to interview. These factors were no prior Match violation; professionalism and ethics; perceived commitment to the specialty; grades in the specialty’s clerkship; specialty-specific letters of recommendation (LoRs); personal knowledge of the applicant; audition rotations; and Step 2 CS! Required clerkship grades, leadership, and perceived interest in the program were tied in importance with Step 1 scores.
If PDs truly need a “number” by which to screen applicants, perhaps an algorithm could generate one by capturing these elements. A multicomponent “test” to predict resident performance could integrate these factors.37 A well-constructed SJT could assess professionalism, ethics, and leadership.
Emergency medicine PDs have modeled the improved reliability of specialty LoRs by implementing standardized letters of evaluation (SLOEs). They now rank SLOEs as the single most important factor in deciding which applicants to interview; Step 1 ranks 10th.38 PD associations could design their own SLOE templates and create a standardized grading rubric for their specialty’s clerkship, committing to being accountable to one another in accurately portraying student performance.
ACGME’s new common program requirements change the threshold for program graduates’ first-time board passage rates to higher than the bottom fifth percentile of programs in that specialty.39 This should make PDs more comfortable in selecting an otherwise highly desirable candidate with slightly lower Step 1 scores if they are concerned about the impact of a graduate failing to pass the board examination on the first attempt.
Applicants are split on whether they prefer Step 1 to be pass/fail. Not unexpectedly, those with higher scores prefer numeric grading, and those with lower scores prefer pass/fail.40 Applicants should work with advisors to plan appropriate residency application strategies. They can use tools such as the AAMC’s Apply Smart for Residency,41 which includes data on how many applications it takes, given a particular Step 1 score, to maximize the chance of a match and reach the point of diminishing returns. For example, at a given Step 1 score, an applicant in pediatrics needs to apply to 19 to 29 programs for a 71% to 81% chance of matching, whereas an otolaryngology applicant must apply to 38 to 40 programs for a 61% to 82% chance of matching. Students can use these data to make better informed decisions about their residency application strategy and avoid the arms race of ever-increasing numbers of applications.
Board exams should be criterion referenced. Specialty boards should provide greater transparency regarding correlation between USMLE Step exams and board exams. Doing so might demonstrate that many of these correlations are at best modest, and highlight the many other factors that influence board performance which are under the PD’s ability to control, such as structured reading assignments.42,43 These may enhance PDs’ confidence in recruiting candidates with skills beyond test taking.
If Step 1 were a diagnostic test, we would all educate learners and faculty to apply evidence-based principles in interpreting the results. Step 1 was not designed to, nor does it, predict success as a physician. Its misuse has created a “Step 1 climate”1 inimical to learning, diversity, and well-being. We are collectively responsible for and must collaboratively solve this problem. The quickest solution would be to report Step 1 and Step 2 CK scores as pass/fail and Step 2 CS scores numerically. Existing data could be analyzed to determine how women, underrepresented minorities, and low-income students fare compared with majority candidates. Obviously, any changes must be carefully implemented in a way that is mindful of the kind of unintended consequences that have befallen Step 1. There is a huge opportunity for improvement in the upcoming invitational conference on USMLE scoring (InCUS) which will convene stakeholders (AAMC, American Medical Association, Educational Commission for Foreign Medical Graduates, FSMB, and NBME) to explore options for addressing some of the challenges related to Step 1.44 In addition to InCUS’s important work, I have outlined ways in which we can all take (at least) one small step toward fixing Step 1 today.
1. Chen DR, Priest KC, Batten JN, Fragoso LE, Reinfeld BI, Laitman BM. Student perspectives on the “Step 1 climate” in preclinical medical education. Acad Med. 2019;94:302–304.
2. Berner ES, Brooks CM, Erdmann JB. Use of the USMLE to select residents. Acad Med. 1993;68:753–759.
3. McGaghie WC, Cohen ER, Wayne DB. Are United States Medical Licensing Exam Step 1 and 2 scores valid measures for postgraduate medical residency selection decisions? Acad Med. 2011;86:48–52.
4. Prober CG, Kolars JC, First LR, Melnick DE. A plea to reassess the role of United States Medical Licensing Examination Step 1 scores in residency selection. Acad Med. 2016;91:12–15.
5. Zuckerman SL, Kelly PD, Dewan MC, et al. Predicting resident performance from preresidency factors: A systematic review and applicability to neurosurgical training. World Neurosurg. 2018;110:475–484.e10.
6. Rubright JD, Jodoin M, Barone MA. Examining demographics, prior academic performance, and United States Medical Licensing Examination scores. Acad Med. 2019;94:364–370.
7. Gauer JL, Jackson JB. Relationships of demographic variables to USMLE physician licensing exam scores: A statistical analysis on five years of medical student data. Adv Med Educ Pract. 2018;9:39–44.
8. Edmond MB, Deschenes JL, Eckler M, Wenzel RP. Racial bias in using USMLE Step 1 scores to grant internal medicine residency interviews. Acad Med. 2001;76:1253–1256.
9. Giordano C, Hutchinson D, Peppler R. A predictive model for USMLE Step 1 scores. Cureus. 2016;8:e769.
10. Teherani A, Hauer KE, Fernandez A, King TE Jr, Lucey C. How small differences in assessed clinical performance amplify to large differences in grades and awards: A cascade with serious consequences for students underrepresented in medicine. Acad Med. 2018;93:1286–1292.
12. Lurie SJ, Lambert DR, Nofziger AC, Epstein RM, Grady-Weliky TA. Relationship between peer assessment during medical school, dean’s letter rankings, and ratings by internship directors. J Gen Intern Med. 2007;22:13–16.
13. Snow SC, Gong J, Adams JE. Faculty experience and engagement in a longitudinal integrated clerkship. Med Teach. 2017;39:527–534.
14. Teherani A, Irby DM, Loeser H. Outcomes of different clerkship models: Longitudinal integrated, hybrid, and block. Acad Med. 2013;88:35–43.
15. Dias RD, Gupta A, Yule SJ. Using machine learning to assess physician competence: A systematic review. Acad Med. 2019;94:427–439.
16. DeMuth RH, Gold JG, Mavis BE, Wagner DP. Progress on a new kind of progress test: Assessing medical students’ clinical skills. Acad Med. 2018;93:724–728.
17. Cangiarella J, Fancher T, Jones B, et al. Three-year MD programs: Perspectives from the Consortium of Accelerated Medical Pathway Programs (CAMPP). Acad Med. 2017;92:483–490.
19. Kelly ME, Patterson F, O’Flynn S, Mulligan J, Murphy AW. A systematic review of stakeholder views of selection methods for medical schools admission. BMC Med Educ. 2018;18:139.
20. Gale TC, Roberts MJ, Sice PJ, et al. Predictive validity of a selection centre testing non-technical skills for recruitment to training in anaesthesia. Br J Anaesth. 2010;105:603–609.
21. Cuddy MM, Winward ML, Johnston MM, Lipner RS, Clauser BE. Evaluating validity evidence for USMLE Step 2 Clinical Skills data gathering and data interpretation scores: Does performance predict history-taking and physical examination ratings for first-year internal medicine residents? Acad Med. 2016;91:133–139.
22. Lypson ML, Ross PT, Hamstra SJ, Haftel HM, Gruppen LD, Colletti LM. Evidence for increasing diversity in graduate medical education: The competence of underrepresented minority residents measured by an intern objective structured clinical examination. J Grad Med Educ. 2010;2:354–359.
23. Marcus-Blank B, Dahlke JA, Braman JP, et al. Predicting performance of first-year residents: Correlations between structured interview, licensure exam, and competency scores in a multi-institutional study. Acad Med. 2019;94:378–387.
24. Koczwara A, Patterson F, Zibarras L, Kerrin M, Irish B, Wilkinson M. Evaluating cognitive ability, knowledge tests and situational judgement tests for postgraduate selection. Med Educ. 2012;46:399–408.
25. Patterson F, Zibarras L, Ashworth V. Situational judgement tests in medical education and training: Research, theory and practice: AMEE guide no. 100. Med Teach. 2016;38:3–17.
26. Smith DT, Tiffin PA. Evaluating the validity of the selection measures used for the UK’s foundation medical training programme: A national cohort study. BMJ Open. 2018;8:e021918.
27. Phillips D, Egol KA, Maculatis MC, et al. Personality factors associated with resident performance: Results from 12 Accreditation Council for Graduate Medical Education accredited orthopaedic surgery programs. J Surg Educ. 2018;75:122–131.
28. Valley B, Camp C, Grawe B. Non-cognitive factors predicting success in orthopedic surgery residency. Orthop Rev (Pavia). 2018;10:7559.
29. Salles A, Lin D, Liebert C, et al. Grit as a predictor of risk of attrition in surgical residency. Am J Surg. 2017;213:288–291.
30. Alvin MD. The USMLE Step 2 CS: Time for a change. Med Teach. 2016;38:854–856.
31. Haist SA, Katsufrakis PJ, Dillon GF. The evolution of the United States Medical Licensing Examination (USMLE): Enhancing assessment of practice-related competencies. JAMA. 2013;310:2245–2246.
32. Haist SA, Butler AP, Paniagua MA. Testing and evaluation: The present and future of the assessment of medical professionals. Adv Physiol Educ. 2017;41:149–153.
33. Giang D; Vice president for graduate medical education, designated institutional official, and professor of neurology. Loma Linda School of Medicine. Personal communication with K.M. Andolsek, July 19, 2018.
34. Bandiera G, Abrahams C, Ruetalo M, Hanson MD, Nickell L, Spadafora S. Identifying and promoting best practices in residency application and selection in a complex academic health network. Acad Med. 2015;90:1594–1601.
36. Wagoner NE, Gray GT. Report on a survey of program directors regarding selection factors in graduate medical education. J Med Educ. 1979;54:445–452.
37. Schoenmakers B, Wens J. Proficiency testing for admission to the postgraduate family medicine education. J Family Med Prim Care. 2018;7:58–63.
38. Negaard M, Assimacopoulos E, Harland K, Van Heukelom J. Emergency medicine residency selection criteria: An update and comparison. AEM Educ Train. 2018;2:146–153.
40. Lewis CE, Hiatt JR, Wilkerson L, Tillou A, Parker NH, Hines OJ. Numerical versus pass/fail scoring on the USMLE: What do medical students and residents want and why? J Grad Med Educ. 2011;3:59–66.
42. Kim RH, Tan TW. Interventions that affect resident performance on the American Board of Surgery In-Training Examination: A systematic review. J Surg Educ. 2015;72:418–429.
43. Ferrell BT, Tankersley WE, Morris CD. Using an accountability program to improve psychiatry resident scores on in-service examinations. J Grad Med Educ. 2015;7:555–559.
44. Whelan A; Chief medical education officer. Association of American Medical Colleges. Personal communication with K.M. Andolsek, November 19, 2018.