The United States Medical Licensing Examination (USMLE) has a mission to protect the health of the public. Passing these examinations is required for U.S. states and territories to consider granting an unrestricted medical license to a physician. USMLE comprises three Steps (four exams). Step 1 is a multiple-choice examination assessing an examinee’s knowledge of foundational science concepts applicable to medicine. Step 2 Clinical Knowledge (CK) assesses the ability to apply scientific concepts to clinical medicine. Step 2 Clinical Skills (CS) uses standardized patients to test the examinee’s ability to gather information from patients, perform physical examinations, and communicate findings to patients and colleagues. Step 3 uses multiple-choice questions and computerized patient cases to assess an examinee’s ability to practice in an unsupervised setting.
This examination series may represent a barrier to practice for certain aspiring physicians. A rich body of research exists for the USMLE, including research on demographic differences in USMLE scores. A number of subgroups have been examined, including analyses grouped by sex and self-identified race. These previous studies have examined total reported scores with a focus on secondary use, such as postgraduate residency screening and selection.
Examining differences by sex on the precursor to the current USMLE Step 1, the National Board of Medical Examiners (NBME) Part I examination, Case and colleagues1 found that men performed better than women on average by about 0.3 standard deviations (SDs). This difference was at least partly explained by covariates such as Medical College Admission Test (MCAT) scores, undergraduate grade point average (GPA), and college selectivity. This finding has been replicated.2 A later study analyzing Step 1 scores showed a similar pattern of men performing better than women, even after controlling for covariates.3 Analyses on NBME Part II, the precursor to Step 2 CK, showed women performing as well as or better than men.1 This effect was again seen using the current Step 2 format, showing women moderately outperforming men on Step 2 CS and CK.4–6
Comparably less research has been performed on racial differences in USMLE scores. Our literature search identified only one study, using data from the older Part I format. That analysis showed racial differences wherein white students performed highest among self-identified racial groups, followed by Asian/Pacific Islanders, Hispanics, then blacks. Controlling for the MCAT, undergraduate GPA, and college selectivity reduced, but did not eliminate, differences.2
USMLE Step 3 scores have received less attention than Steps 1 or 2. Successfully passing Step 3 was most associated with being a native English-speaking U.S. citizen from a U.S. school. Although sex appeared statistically significant, with men outperforming women, the practical significance was small.7 Together, previous work suggests that men outperform women on Step 1, yet the trend is reversed for Step 2 and negligible for Step 3. Some racial differences have also been seen, albeit from a study using older data on a test format no longer used.
These studies have told a story of average demographic differences across the USMLE series. Yet the story goes back 24 years, spans outdated test formats, examines demographic characteristics individually, and uses a variety of methodological approaches. To provide data on possible subgroup performance differences, this study examines many demographic characteristics of interest simultaneously within one modeling framework, under the current Step testing format, for all computer-based USMLE Step exams. Current information on subgroup performance differences may inform how accreditation organizations, medical schools, and postgraduate training programs use USMLE data above and beyond the primary intended use of assessing passing scores for medical licensure.
Design, sample, and data collection
We used a cross-sectional analysis of historical, deidentified data. Ethical approval with “exempt” status was granted by the American Institutes for Research, Washington, DC. Examinees’ first-time scores for Step 1, Step 2 CK, and Step 3 were included if the examinee took Step 1 during or after 2010, completed Step 3 by 2015, and reported demographic information. As our research was intended to address secondary use of scores, we sampled examinees who had progressed through the examination series and taken each of the computerized Steps. To focus on results from U.S. and Canadian allopathic and osteopathic medical schools, we did not include international medical graduates in this analysis.
Dependent variables were scores on computer-based USMLE Step examinations: Step 1, Step 2 CK, and Step 3. Test-taker characteristics were self-reported on the application to sit for the first USMLE examination, and included sex (male as reference category), race (self-identified: Asian/Pacific Islander; black not of Hispanic origin; and Hispanic, with white as reference category), U.S. citizenship status (U.S. citizen as reference category), English as a second language (ESL) (native English speaker as reference category), and age at first Step 1 attempt (grand mean centered). Composite MCAT scores (from first take, grand mean centered) and undergraduate GPA (grand mean centered) were obtained from the Association of American Medical Colleges (AAMC). The MCAT composite included the verbal reasoning, biological sciences, and physical sciences sections and excluded the writing sample, as the former sections have been shown to be related to USMLE scores and one another while the latter section has not.3 We did not include racial categories with too few examinees (American Indian/Alaskan Native, n = 175), nor from the categories “do not wish to respond,” “multiple,” or “other.” Examinees were included if they agreed to allow their deidentified data to be used for research purposes.
Hierarchical linear modeling (HLM)8 has been used previously in this line of research, with most score variance within, not between, schools9 or cases.5 Still, HLM is more appropriate in datasets with a nested structure. Medical students were nested within medical schools for this analysis performed using SAS statistical software, version 9.3 (SAS Institute Inc., Cary, North Carolina) with maximum likelihood estimation. Multicollinearity among predictors is not a concern here because variables likely to be correlated are used as control variables and not variables of interest. Additionally, centering of variables is used to aid in the interpretation of the resulting coefficients, and has the secondary benefit of reducing the relationships among the variables under study.
First, we produced descriptive statistics for all included variables. Principally interested in how examinee characteristics predicted USMLE performance and not in how these relationships varied by school, we estimated random intercept models allowing schools to have different intercepts but not slopes. This decision was driven by our interest in overall demographic effects and also by small sample sizes from school-level clusters. These models constrain the relationships between demographic characteristics and USMLE performance to remain the same across schools, although school intercepts may vary.
Because the research questions were to understand demographic differences among scores and whether covariates attenuated these differences, model building was guided by the research questions. We ran the following models with Step 1, Step 2 CK, and then Step 3 as the dependent variable:
- An unconditional model to calculate the intraclass correlation (ICC), which is the ratio of between-to-total variance. This value tells us the proportion of variance attributable to clustering at the medical school level.
- A random intercept model using the demographic characteristics U.S. citizenship, self-identified racial category, ESL status, sex, and age at first Step 1 attempt. Here, this will be referred to as the demographics model.
- A random intercept model including the variables above, along with GPA and MCAT score as covariates, to assess whether demographic relationships associated with USMLE performance are attenuated. Here, this will be referred to as the covariates model. With Step 2 CK scores as the dependent variable, Step 1 was entered in the covariates model grand mean centered. With Step 3 scores as the dependent variable, both Step 1 and Step 2 CK were added grand mean centered.
For the dichotomous variables in all models, we generated an effect size measure along with each coefficient. Because coefficients are interpretable in terms of USMLE score points, and all Step examinations are scaled to a base reference group with an SD of 20 points, the effect size used was the coefficient divided by 20 and is interpretable as differences in SD units. Cohen suggested that an effect size in SD units could be considered small if ≥ 0.2 yet < 0.5, medium if ≥ 0.5 yet < 0.8, and large if ≥ 0.8.10 We provided effect sizes because, given the sample size we used, statistical significance is likely.
A total of 45,154 examinees from 172 schools fit study criteria (average 262.52 examinees per school, SD 190.27, range 1–820). Table 1 shows descriptive statistics for the sample. Tables 2, 3, and 4 sequentially show the modeling results with USMLE Steps 1, 2, and 3 as the dependent variable. The ICC for predicting Step 1 scores is 0.12. Therefore, 88% of the variance in scores was due to student differences. Examining Step 1 results in Table 2, the intercept for the demographics model is the predicted performance when all demographic variables represent the reference category—that is, for a native English-speaking white male U.S. citizen at average age. The coefficients are interpreted as the difference in predicted Step 1 scores compared with the reference group with all others constant. Thus, a female ESL test taker, or any nonwhite test taker, would be predicted to have a lower Step 1 score. Similarly, scores are predicted to be lower for each year of age above average. Being a non-U.S. citizen would increase the predicted score.
Adding GPA and MCAT score to arrive at the covariates model (penultimate column of Table 2) improved predictions of Step 1 scores, as shown by the lower error variance at both levels along with improved fit indices (−2 log likelihood, Akaike information criterion and Bayesian information criterion). Because the added covariates were grand mean centered, the intercept is now interpreted as the predicted Step 1 performance of a test taker with the demographic characteristics described above who is also of average GPA and MCAT score. For every 1-point increase in GPA above the average value, predicted Step 1 performance increased by 11.91 points. Predicted scores also increased if an individual had above-average composite MCAT performance. After including these variables, the variables representing U.S. citizenship and ESL status were no longer significant. That is, these demographic differences were explained by differences in GPA and MCAT scores. The coefficients for black or Hispanic test takers were attenuated, although the Asian coefficient remained similar.
The ICC for Step 2 CK is similar to that of Step 1: 0.10. Table 3 displays results with Step 2 CK scores as the dependent variable; all demographic variables under study were statistically significant. The intercept retained the same interpretation as that of the Step 1 demographics model, albeit for the prediction of Step 2 CK scores. All demographic variables alter the prediction of Step 2 CK performance in the same direction as the Step 1 model, except for sex. Similar to previous studies of Step 2 performance, we found that women were predicted to have higher performance than men (by 0.34 points). Adding covariates again improved the model as shown by the decrease in error variance and fit indices. The demographic variable coefficients again changed under this model, with the impact of sex increased and U.S. citizenship status no longer a significant model predictor. Individuals with above-average GPA, composite MCAT, and Step 1 scores were predicted to have higher performance, while those with above-average age were predicted to be lower. And, the addition of the GPA and MCAT covariates again attenuated differences for Asian, black, Hispanic, and ESL examinees.
The ICC for Step 3 is 0.12. Lastly, Table 4 reports the parameters for the prediction of USMLE Step 3 performance. The direction and magnitude of the demographic variables were similar to those from Tables 2 and 3, except for sex, which is nonsignificant. Adding covariates to the model again aided in the prediction of Step 3 scores, with higher levels of Step 1, Step 2 CK, GPA, and composite MCAT increasing the prediction of Step 3 performance and higher age decreasing the predicted score. With added covariates, U.S. citizenship was no longer significant; racial and ESL indicators are attenuated when covariates were included.
This study extends and updates previous analyses by using the modern USMLE Step format, examining the impact of all self-reported examinee characteristics simultaneously across all computerized Steps, and examining the impact of important premedical school covariates. Our findings show that, on average, demographic differences exist in USMLE scores. In the nonadjusted models, sex effects were present, although they varied depending on the Step under consideration. Men outperformed women on Step 1, women outperformed men on Step 2, and there was no difference on Step 3. ESL test takers and self-identified nonwhite groups consistently performed lower on all three Steps; although their practical significance varies, the size of the coefficients remained similar across Steps. Citizenship and ESL status showed statistical, yet not practical, significance. Age consistently showed a negative relationship with Step scores, with examinees above average age predicted to have lower scores.
Another consistent finding emerged: Adding covariates on a test taker’s previous examination and undergraduate performance increases the accuracy of prediction and, with the exception of sex, substantially reduces the predicted effects of demographic characteristics. In some cases, the effects of citizenship and ESL status were erased entirely. In others, the effects were attenuated. For example, self-identified blacks were predicted to score 16 points lower on all Step examinations compared with whites in the demographics-only model, representing more than three-fourths of an SD. When additional premedical school covariates were included, these differences were reduced to 4 or 5 points, around one-quarter of an SD. More than 10 points of a black test taker’s predicted performance were explained by covariates.
There are limitations to this study. First, although our analysis aimed at understanding individual characteristics and their association with USMLE performance, 10% to 12% of score performance remains to be explained by medical school characteristics. Medical schools have different ways of supporting students through their curricula, and different policies concerning whether students need to take USMLE Steps for promotion or graduation (see, for example, https://www.aamc.org/initiatives/cir/406442/10b.html). Measuring and understanding how schools contribute to examination performance across demographic groups could be useful in understanding examinee performance and may further attenuate the demographic effects seen here. Second, additional aspects of training, included self-selected specialties, also have been shown to affect USMLE performance11 yet are not considered here. Third, undergraduate institutions vary in their grading standards, which affects the comparability of GPAs for individuals across institutions. Fourth, this analysis only examines the computer-based USMLE Step exams; comparable analyses for Step 2 CS are planned.
Implications of these findings are relevant to two increasingly important concerns in medicine and medical education: the use of a score, on an examination intended for medical licensure, as a high-stakes screen or selection criterion for residency selection; and the recruitment and retention of a diverse physician workforce.
It is widely accepted that residency program directors, with the daunting task of screening numerous applications, use USMLE scores to screen applicants for interviews.12 , 13 Furthermore, this practice has been associated in the past with potential bias against certain racial and ethnic minorities.14 If applicants do not meet this screen, they are no longer considered despite their potentially having qualities or experiences that translate to becoming effective physicians. More recently, there has been a consistent message from leaders in the academic community as well as from the NBME to reduce or eliminate the use of USMLE scores, particularly Step 1, as a barrier to residency selection.15 , 16 These calls acknowledge the mission of the USMLE program, and point to evidence where USMLE scores can be predictive of performance on subsequent assessments, such as specialty in-training and certification examinations.17 Relationships have also been demonstrated between scores on subcomponents of the USMLE and residency program director performance ratings, as well as for scores on certain USMLE Steps and disciplinary action in practice.18–20 While research is ongoing regarding the predictive value of licensing examinations on clinical practice measures,21 the debate remains over the evidence, or lack thereof, for using USMLE scores as a threshold for residency candidate consideration.22 Some investigators have reported that, despite consistently lower scores on the USMLE obtained by underrepresented minority residents, no difference existed in observed structured clinical examinations at the start of residency.23
In 2015, black medical students comprised less than 6% of medical school graduates in the United States, and Latinos less than 5%.24 Over the past 10 years, the AAMC’s Holistic Review initiative has provided guidance and resources for medical admissions programs to “widen the lens” when viewing prospective candidates, emphasizing the applicants’ experiences and personal attributes, in addition to their academic metrics.25 An admissions process that focuses on mission-based initiatives is likely to produce diverse students, viewpoints, experiences, and ultimately a workforce reflecting the same. The concept of holistic review has carried into graduate medical education, particularly given the need for program directors to assess professionalism and communication competencies during the brief selection season, as well as the priority that graduate medical education programs are placing on recruiting and retaining diverse cohorts of trainees.26 , 27 Given our findings, residency program directors may be able to more effectively engage in holistic review of applicants, and may also be motivated to provide additional resources to trainees in need of support for success on licensure and certification examinations. Some health professions education programs have demonstrated the effectiveness that targeted resources or mentoring may have on standardized test scores.28 Furthermore, it would be important to consider how traditional program evaluation metrics—such as certifying board pass rates—might hinder efforts to advance diversity in medicine across specialties.29
Subgroup examinee performance on standardized tests need not be equal for a test to meet the standard of fairness.30 In the case of our study, as in one previous study,2 prior academic performance explains much of the demographic differences in scores. Although mean performance between racial categories, especially for blacks and Hispanics, appears initially large, “the observed racial and ethnic differences reflect the lower mean MCAT scores and GPAs of underrepresented minority students.”2(p678) And, MCAT scores themselves have not shown evidence of bias against underrepresented minority test takers.31 As the remaining performance differences are unexplained, additional work is required to identify factors contributing to the remaining demographic differences and identify factors that can aid medical educators in identifying candidate examinees who may need additional help with USMLE preparation.
The authors thank Monica Cuddy and Kimberly Swygert for their valuable comments on early drafts of this manuscript.
1. Case SM, Becker DF, Swanson DB. Performances of men and women on NBME Part I and Part II: The more things change. Acad Med. 1993;68(10 suppl):S25–S27.
2. Dawson B, Iwamoto CK, Ross LP, Nungester RJ, Swanson DB, Volle RL. Performance on the National Board of Medical Examiners. Part I examination by men and women of different race and ethnicity. JAMA. 1994;272:674–679.
3. Cuddy MM, Swanson DB, Clauser BE. A multilevel analysis of examinee gender and USMLE Step 1 performance. Acad Med. 2008;83(10 suppl):S58–S62.
4. Cuddy MM, Swygert KA, Swanson DB, Jobe AC. A multilevel analysis of examinee gender, standardized patient gender, and United States medical licensing examination Step 2 clinical skills communication and interpersonal skills scores. Acad Med. 2011;86(10 suppl):S17–S20.
5. Swygert KA, Cuddy MM, van Zanten M, Haist SA, Jobe AC. Gender differences in examinee performance on the Step 2 Clinical Skills data gathering (DG) and patient note (PN) components. Adv Health Sci Educ Theory Pract. 2012;17:557–571.
6. Cuddy MM, Swanson DB, Clauser BE. A multilevel analysis of the relationships between examinee gender and United States Medical Licensing Exam (USMLE) Step 2 CK content area performance. Acad Med. 2007;82(10 suppl):S89–S93.
7. De Champlain A, Sample L, Dillon GF, Boulet JR. Modeling longitudinal performances on the United States Medical Licensing Examination and the impact of sociodemographic covariates: An application of survival data analysis. Acad Med. 2006;81(10 suppl):S108–S111.
8. Raudenbush SW, Bryk AS. Hierarchical Linear Models: Applications and Data Analysis Methods. 2002.2nd ed. Newbury Park, CA: Sage.
9. Cuddy MM, Swanson DB, Dillon GF, Holtman MC, Clauser BE. A multilevel analysis of the relationships between selected examinee characteristics and United States Medical Licensing Examination Step 2 Clinical Knowledge performance: Revisiting old findings and asking new questions. Acad Med. 2006;81(10 suppl):S103–S107.
10. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. 1988.Hillsdale, NJ: Lawrence Erlbaum Associates.
11. Sawhill AJ, Dillon GF, Ripkey DR, Hawkins RE, Swanson DB. The impact of postgraduate training and timing on USMLE Step 3 performance. Acad Med. 2003;78(10 suppl):S10–S12.
12. Green M, Jones P, Thomas JX Jr. Selection criteria for residency: Results of a national program directors survey. Acad Med. 2009;84:362–367.
13. National Resident Matching Program. Data Release and Research Committee. Results of the 2016 NRMP Program Director Survey. 2016.Washington, DC: National Resident Matching Program.
14. Edmond MB, Deschenes JL, Eckler M, Wenzel RP. Racial bias in using USMLE Step 1 scores to grant internal medicine residency interviews. Acad Med. 2001;76:1253–1256.
15. Prober CG, Kolars JC, First LR, Melnick DE. A plea to reassess the role of United States Medical Licensing Examination Step 1 scores in residency selection. Acad Med. 2016;91:12–15.
16. Katsufrakis PJ, Uhler TA, Jones LD. The residency application process: Pursuing improved outcomes through better understanding of the issues. Acad Med. 2016;91:1483–1487.
17. Dillon GF, Swanson DB, McClintock JC, Gravlee GP. The relationship between the American Board of Anesthesiology Part 1 certification examination and the United States Medical Licensing Examination. J Grad Med Educ. 2013;5:276–283.
18. Cuddy MM, Winward ML, Johnston MM, Lipner RS, Clauser BE. Evaluating validity evidence for USMLE Step 2 Clinical Skills data gathering and data interpretation scores: Does performance predict history-taking and physical examination ratings for first-year internal medicine residents? Acad Med. 2016;91:133–139.
19. Winward ML, Lipner RS, Johnston MM, Cuddy MM, Clauser BE. The relationship between communication scores from the USMLE Step 2 Clinical Skills examination and communication ratings for first-year internal medicine residents. Acad Med. 2013;88:693–698.
20. Cuddy MM, Young A, Gelman A, et al. Exploring the relationships between USMLE performance and disciplinary action in practice: A validity study of score inferences from a licensure examination. Acad Med. 2017;92:1780–1785.
21. Tamblyn R, Abrahamowicz M, Dauphinee WD, et al. Association between licensure examination scores and practice in primary care. JAMA. 2002;288:3019–3026.
22. McGaghie WC, Cohen ER, Wayne DB. Are United States Medical Licensing Exam Step 1 and 2 scores valid measures for postgraduate medical residency selection decisions? Acad Med. 2011;86:48–52.
23. Lypson ML, Ross PT, Hamstra SJ, Haftel HM, Gruppen LD, Colletti LM. Evidence for increasing diversity in graduate medical education: The competence of underrepresented minority residents measured by an intern objective structured clinical examination. J Grad Med Educ. 2010;2:354–359.
26. King A, Mayer C, Starnes A, Barringer K, Beier L, Sule H. Using the Association of American Medical Colleges standardized video interview in a holistic residency application review. Cureus. 2017;9:e1913.
27. Van Voorhees AS, Enos CW. Diversity in dermatology residency programs. J Investig Dermatol Symp Proc. 2017;18:S46–S49.
28. Girotti JA, Park YS, Tekian A. Ensuring a fair and equitable selection of students to serve society’s health care needs. Med Educ. 2015;49:84–92.
29. Berger JS, Cioletti A. Viewpoint from 2 graduate medical education deans: Application overload in the residency Match process. J Grad Med Educ. 2016;8:317–321.
30. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for Educational and Psychological Testing. 2014.Washington, DC: American Educational Research Association.
© 2019 by the Association of American Medical Colleges
31. Davis D, Dorsey JK, Franks RD, Sackett PR, Searcy CA, Zhao X. Do racial and ethnic group differences in performance on the MCAT exam reflect test bias? Acad Med. 2013;88:593–602.