Gender Differences in Milestone Ratings and Medical Knowledge Examination Scores Among Internal Medicine Residents : Academic Medicine

Secondary Logo

Journal Logo

Research Reports

Gender Differences in Milestone Ratings and Medical Knowledge Examination Scores Among Internal Medicine Residents

Hauer, Karen E. MD, PhD; Jurich, Daniel PhD; Vandergrift, Jonathan MS; Lipner, Rebecca S. PhD; McDonald, Furman S. MD, MPH; Yamazaki, Kenji PhD; Chick, Davoren MD; McAllister, Kevin MEd; Holmboe, Eric S. MD

Author Information
Academic Medicine 96(6):p 876-884, June 2021. | DOI: 10.1097/ACM.0000000000004040


Competency-based medical education should support each trainee’s individual path toward competence. Optimally, observations of performance are collected and interpreted by educators to ensure that each trainee progresses as expected and has needed learning opportunities. 1–3 In U.S. internal medicine (IM) residency programs, milestone ratings submitted by program directors working with clinical competency committees (CCCs) track a trainee’s developmental trajectory and correlate with performance on medical knowledge (MK) certification examinations. 4–6 However, in a multiple linear regression model, averaged MK milestone ratings accounted for only 3% of explained variation in IM Certification Examination (IM-CE) scores, while IM In-Training Examination (IM-ITE) scores had the strongest relationship with IM-CE scores. 7 Milestone ratings may be affected by factors other than actual performance, including bias related to trainee gender. Group procedures, such as the use of CCCs, can minimize, or conversely contribute to, bias in the decision-making process. 8 Understanding whether and how milestone ratings from program directors working with CCCs nationally differ based on gender and whether differences are associated with performance on MK examinations could help identify and minimize any bias in the milestone ratings process.

Studies in graduate medical education have yielded conflicting findings about differences in performance ratings based on the gender of the resident or the supervisor providing the ratings. Among emergency medicine residents at 8 programs, postgraduate year (PGY)-2 and PGY-3 male residents achieved slightly higher milestone ratings (approximately 0.15 points higher on a 5-point scale) from their supervisors than female residents at the same training level. 9 Nationally, emergency medicine milestone ratings revealed negligibly higher ratings (<0.01 points on a 5-point scale) for men at graduation in 4 of 22 subcompetency milestones, all addressing patient care (PC). 10 Studies of IM trainees have yielded variable findings related to ratings of trainees stratified by gender. Some single-institution studies have shown higher ratings for men, including when rated by a male attending physician, 11 and an interaction effect between trainee and evaluator gender at the resident 12 and fellow level, 13 with male evaluators rating male trainees the highest. In contrast, there were no gender-based differences in a single-institution study of faculty ratings of residents at Yale 14 or in a multicenter study of ratings of residents in scripted videos. 15

Analyses of narrative comments describing trainees’ performance show differences based on gender stereotypes. Narrative analysis of written descriptions of students’ performance across specialties showed small differences in word choice, with women’s personality traits more likely to be highlighted, while for men, their competence was more likely to be described. 16 Narrative analysis of comments about emergency medicine residents showed that men were more likely to be praised for demonstrating ideal emergency medicine skills than women. 17 In family medicine, female residents assessed by male attendings received more praise for their interpersonal interactions than for their competence compared with other gender combinations of trainees and faculty. 18 These studies suggest areas in which bias in ratings, or true differences in performance, may arise based on particular gender-based stereotypes.

An understanding of any gender differences in ratings of female and male trainees in IM is needed across programs at the population level. The aim of this national study is to examine milestone ratings of IM female and male residents, using ratings submitted by program directors working with CCCs, to determine: (1) whether there are group differences in milestone ratings of residents based on gender and (2) whether women and men rated similarly on milestones perform comparably on subsequent in-training and certification examinations.


Data sources

This national retrospective cohort study used deidentified data merged across 4 organizations. The American Board of Internal Medicine (ABIM) provided residency program, gender, IM-CE scores, and IM-CE test date. The Accreditation Council for Graduate Medical Education (ACGME)/ABIM provided PGY and milestone ratings. The American College of Physicians provided IM-ITE scores and IM-ITE test dates. The National Board of Medical Examiners provided United States Medical Licensing Examination (USMLE) Step 1 and Step 2 Clinical Knowledge (CK) scores, native English-speaker status, birth date, medical school location, and degree type. These covariates were selected based on theoretical and empirical differences in examination performance and milestone ratings related to program size, program region, and resident native language that have been discussed in the literature. 5,7,19–23

The ACGME/ABIM milestones are used to rate PGY-1 to PGY-3 residents along 22 subcompetencies on a scale of 1 to 5, with 9 possible ratings (1.0, 1.5, 2.0, 2.5, etc.). 24 This study focused on 2 end-of-year subcompetency averages, which were submitted by program directors based on CCC discussions: the average of the 2 MK ratings and the average of the 5 PC ratings. We focused on comparing the MK ratings with future examination performance because they each intend to measure a theoretically similar construct—MK. We also examined PC ratings as aspects of PC require MK to execute effectively. 25,26 For each resident, we computed an averaged MK rating and averaged PC rating to improve reliability of these ratings. 6

The IM-ITE contains 300 multiple-choice items (260 scored, 40 unscored pretests) developed to resemble the IM-CE blueprint. Percent correct scores are reported for individuals and programs for formative feedback (range, 0–100). IM-ITE scores have been shown to be comparable across this study’s time frame (2014–2018). 7 IM-ITE reliability coefficients typically exceed 0.89. Most residents take the IM-ITE in all 3 residency years.

The IM-CE contains 200 scored multiple-choice items addressing MK, diagnostic reasoning, and clinical judgment for IM practice. Scores are reported on a standardized scale (mean, 500; standard deviation, 100; range, 200–800) and are statistically equated to be comparable across examination forms. IM-CE reliability estimates typically exceed 0.89 across administrations. Residents complete the IM-CE after residency training.


We matched 21,440 U.S. IM residents at ACGME-accredited programs with PGY-1, PGY-2, or PGY-3 end-of-year MK and PC milestone ratings. These residents comprised 2 cohorts (2014–2017 and 2015–2018). As shown in Supplemental Digital Appendix 1 (at, residents in combined programs (e.g., medicine-pediatrics) and preliminary residents were excluded. We excluded 58 and 1,122 residents rated as “not yet assessed” on any MK or PC milestone, respectively. The gender distribution of those rated as “not yet assessed” mirrored that of the final sample. We also removed 36 residents with missing data on native language. Finally, we excluded 126 residents whose programs comprised only 1 gender (9 programs were male-only, another 9 were female-only) to avoid potential confounding if these programs had irregular ratings. The final sample included 20,098 (94%) residents: 17,652 (88%) MD and 2,446 (12%) DO, with 9,424 (47%) women and 10,674 (53%) men from 380 residency programs. The 380 programs ranged in size from 4 to 213 residents and represented all U.S. Department of Health & Human Services (HHS) geographic regions. 27


We first examined descriptive statistics for each subcompetency to evaluate observed differences between genders. To disentangle potential performance-based gender differences (associated with another measure of the same construct) from bias (not associated with another measure of the same construct), we used differential prediction techniques to determine statistically whether women and men with similar MK or PC ratings perform equivalently on future IM examinations. 28 Before conducting these analyses, we compared correlations between the examinations and all subcompetency averages (i.e., the systems-based practice, practice-based learning and improvement, professionalism, and interpersonal and communication skills subcompetencies, as well as the MK and PC subcompetencies) to support the examinations as meaningful outcomes that measure characteristics more closely related to MK and PC ratings than to other subcompetency milestones. Our application of differential prediction techniques used regression models to evaluate whether women and men with similar milestone ratings perform comparably on their next IM examination (IM-ITE or IM-CE), an assumed objective assessment of related content. A series of hierarchical linear models (HLMs) then tested whether including gender as a predictor improves model fit. An increase in model fit with gender indicates that women and men with comparable milestone ratings score differently on the corresponding IM examination.

The differential prediction models used IM-ITE or IM-CE as dependent variables, while gender and subcompetency served as independent variables. For PGY-1 and PGY-2 milestone ratings, the following year’s IM-ITE served as the dependent variable in the HLMs. For PGY-3 milestone ratings, the IM-CE served as the dependent variable in the HLMs. To adjust for other potentially meaningful variables given that residents are nonrandomly assigned to programs, we included the following covariates in the HLMs: USMLE Step 1 and Step 2 CK scores (as baseline assessments of knowledge), native English-speaker status, age when taking the IM-ITE or IM-CE (depending on which was serving as the dependent variable in the model; age was computed by subtracting the resident’s birth date from the respective test date), degree type, medical school location, and residency program HHS geographic region.

Within each PGY and subcompetency combination, we compared 3 different HLM specifications in sequential order: a baseline model, interaction model, and main-effect model. The baseline model specifies the examination score predicted from all covariates and the main effect of the subcompetency; gender is excluded from this model. We tested this model first to ensure the subcompetency significantly related to the examination after adjusting for covariates. A nonsignificant subcompetency coefficient considerably weakens any inferences made regarding bias as it would indicate the subcompetency does not measure the same construct as the examination does after adjusting for the remaining covariates. Given that the subcompetency was significant, we then tested the interaction model. The interaction model specifies the examination score predicted from all covariates, the main effects of gender and the subcompetency, and the interaction between gender and subcompetency. This model examines whether gender differences in examination performance vary across the subcompetency rating scale (e.g., does one gender outperform the other at higher subcompetency ratings but not at lower ratings). The main-effect model drops the interaction between gender and subcompetency, thus testing whether the difference in examination performance between women and men is constant across the milestone rating scale. As noted, the baseline model drops gender from the model entirely. Thus, if the baseline model best represents the data, this suggests that each gender performs statistically equivalently on the examinations across subcompetency ratings. Simplified depictions of the 3 HLM specifications in order of model complexity are as follows:

  • Baseline model: examination = subcompetency + covariates
  • Main-effect model: examination = gender + subcompetency + covariates
  • Interaction model: examination = gender × subcompetency + gender + subcompetency + covariates

These 3 HLM specifications were compared using likelihood ratio tests and the difference in R2 values estimated via the marginal coefficient of determination for mixed models. 29 Not only acknowledging the multiple statistical significance comparisons made across PGYs and the 2 subcompetencies, but also recognizing the importance of detecting bias, we used a more conservative value of P < .01. We assessed the magnitude of performance differences between genders by examining mean differences after adjusting for covariates on the examination’s reported (raw) score scale and standardized score. The standardized mean difference reflects the difference between female and male examination performance in standard deviation units after accounting for the covariates and thus is analogous to the traditional Cohen’s d effect size. 30

The 3 HLM specifications were used to account for dependencies introduced by residents being nested within programs. We examined the intraclass correlations (ICCs) for the examination variables to evaluate whether the dependencies within residency programs necessitated the use of HLMs. In our context, the ICC indicates the proportion of examination score variation due to residency programs. All 3 HLMs were specified to have random intercepts with fixed slopes, which allows programs to differ in average examination performance but holds the relationships between the independent variables and dependent variables constant across all programs. All predictors were specified at the level of the resident, except program HHS geographic region. Analyses were conducted using the lme4 package 31 in R version 3.6.3 (R Core Team, Vienna, Austria), which computes the statistical test P values via the Satterthwaite degrees of freedom method.

The American Institutes for Research Institutional Review Board deemed this study exempt.


Milestone ratings

Table 1 presents milestone ratings, examination scores, and demographics by gender and PGY, unadjusted for program clustering and other covariates. For MK milestone ratings in PGY-1, men and women showed no statistical difference at our significance level of .01 (P = .02). In PGY-2 and PGY-3, however, men received statistically higher average MK ratings than women (P = .002 and P < .001, respectively), though the absolute differences were small. In contrast, men and women received equivalent average PC ratings in each PGY (P = .47, P = .72, and P = .80, for PGY-1, PGY-2, and PGY-3, respectively). Men also scored slightly higher on both the IM-ITE and IM-CE.

Table 1:
Descriptive Statistics for Average Subcompetency Milestone Ratings, Medical Knowledge Assessment Scores, and Demographic Characteristics for Internal Medicine Residents by PGYa

Table 2 presents the unadjusted correlations among subcompetencies and examinations. Average MK ratings correlated more strongly than the other subcompetency averages with examination performance. PC ratings showed the second highest set of correlations, whereas the remaining subcompetencies exhibited a similar range of weaker correlations.

Table 2:
Correlations for Average Subcompetency Milestone Ratings and Medical Knowledge Assessment Scores for Internal Medicine Residents by PGYa

Differential prediction

The ICCs were 0.21, 0.19, and 0.09 for PGY-1, PGY-2, and PGY-3 examinations, respectively. These results suggest that residency programs account for some score variance and justify the use of HLMs. Supplemental Digital Appendix 2 (at presents the likelihood ratio and R2 change model comparison tests for each PGY, subcompetency, and HLM specification. Table 3 contains regression coefficients related to gender and subcompetency for the best-fitting model within each PGY-subcompetency combination.* Both MK and PC were statistically significant predictors of examination performance across the baseline models (all P values < .001), allowing us to test further model specifications. The main-effect model statistically fit best for the PGY-1 and PGY-2 ratings across both subcompetencies, indicating a significant main effect of gender on these subcompetency milestone ratings. Despite the statistical differences in model fit, the R2 changes were trivial, with the largest increase in R2 never exceeding 0.010. This result indicates that including gender in the model adds negligible value in accounting for examination performance variability after adjusting for the other variables. For PGY-3 ratings, the baseline model fit best for both subcompetencies, suggesting similar performance on the IM-CE regardless of gender after adjusting for the other variables.

Table 3:
Milestone- and Gender-Related Regression Coefficients From Each Best-Fitting Hierarchical Linear Model Specificationa

The main-effects models yielded small examination performance differences between genders. Figures 1 and 2 show adjusted examination scores from the main-effect model for MK and PC ratings analyses, respectively. Although the baseline model fit best in PGY-3 for both subcompetencies, we plotted the main-effect model to illustrate the similarity in examination performance between genders across the ratings. Men slightly outperformed women who received similar MK ratings in PGY-1 and PGY-2 on the IM-ITE with adjusted difference scores of 1.7 (Cohen’s d = 0.28) and 1.5 (Cohen’s d = 0.25) percentage points, respectively, translating to men answering about 4 more questions correct on the 260 scored items (Figure 1 and Supplemental Digital Appendix 3 at The PGY-3 graph highlights the similarity in adjusted IM-CE performance as the regression lines for each gender are nearly indistinguishable. Results for PC (Figure 2 and Supplemental Digital Appendix 4 at showed nearly identical trends and magnitudes of difference. The PGY-3 graph again highlights the similarity in adjusted IM-CE performance with nearly indistinguishable regression lines for each gender. Across both MK and PC ratings, the standardized mean differences for PGY-1 and PGY-2 never exceeded 0.30 and would be classified as small effects by traditional guidelines.

Figure 1:
Adjusted examination performance by MK milestone ratings and gender for the main-effects model within each PGY. From a national retrospective cohort study seeking to determine whether women and men rated similarly on milestones perform comparably on subsequent in-training and certification examinations, using data on U.S. internal medicine residents at Accreditation Council for Graduate Medical Education–accredited programs, 2014–2017 and 2015–2018 cohorts. Solid lines denote the estimated performance on the corresponding examination across the scale of MK ratings after adjusting for the variables included in the model. The shaded area reflects the 99% confidence interval around the estimated value. A color version of this figure is available in Supplemental Digital Appendix 3 (at Abbreviations: MK, medical knowledge; PGY, postgraduate year; ITE, Internal Medicine In-Training Examination; M, male; F, female; IM-CE, Internal Medicine Certification Examination.
Figure 2:
Adjusted examination performance by PC milestone ratings and gender for the main-effects model within each PGY. From a national retrospective cohort study seeking to determine whether women and men rated similarly on milestones perform comparably on subsequent in-training and certification examinations, using data on U.S. internal medicine residents at Accreditation Council for Graduate Medical Education–accredited programs, 2014–2017 and 2015–2018 cohorts. Solid lines denote the estimated performance on the corresponding examination across the scale of PC ratings after adjusting for the variables included in the model. The shaded area reflects the 99% confidence interval around the estimated value. A color version of this figure is available in Supplemental Digital Appendix 4 (at Abbreviations: PC, patient care; PGY, postgraduate year; ITE, Internal Medicine In-Training Examination; M, male; F, female; IM-CE, Internal Medicine Certification Examination.


This study of milestone ratings and subsequent examination scores for 2 national cohorts of IM residents demonstrates largely similar MK and PC milestone ratings for women and men. That is, with our large sample size, we identified milestone ratings differences in the hundredths of a point (on a 5-point scale) that are of questionable educational or clinical significance. Subsequent certification examination scores for women and men with similar milestone ratings were also similar. These results provide evidence of comparable demonstration of competence achievement using different measures. Overall, these findings provide reassuring evidence of a lack of systemic bias related to gender in milestone ratings, which are meant to document residents’ progressive development of competence toward entrustment for unsupervised practice.

Our findings add to prior studies of gender and milestone ratings that suggest milestone ratings from individual evaluators may show different results by gender than milestone ratings generated by program directors working with CCCs. A recent study of individual IM evaluator ratings demonstrated variable patterns for men and women residents, with slightly higher ratings for women in PGY-2 and for men in PGY-3. 32 In emergency medicine, while differences in milestone ratings favoring men were reported in a study of faculty ratings from 8 programs, subsequent study of CCC ratings from all emergency medicine programs found very small differences favoring men. 9,10 Attenuation of gender differences in the latter national study may have resulted from the larger sample size or using ratings from the CCC rather than from individual faculty. Our study also used milestone ratings generated by program directors working with CCCs. Taken together, these studies of milestone ratings generated through CCCs may provide indirect evidence of the benefits of group decision making and bias mitigation through aggregating multiple data points into a single rating that CCCs may afford. 33 These benefits may arise through committee training about implicit bias (though the extent and impact of such training in CCCs are unknown), engagement in group discussion among diverse members, or procedures for structured committee discussions. Though our study may be interpreted as demonstrating a lack of gender bias in milestone ratings, individual programs can nonetheless engage in reviewing their own data to identify any local patterns that could signal differential ratings based on gender.

A strength of our study compared with other studies examining milestone ratings in relation to gender is the inclusion of in-training examination and certification examination scores. Previous studies have documented the association between IM MK ratings and certification examination scores 5,6 and between the IM-ITE and IM-CE. 7 Our differential prediction analyses indicate that men may slightly outperform women with comparable milestone ratings on examinations early in training (i.e., on the subsequent year’s IM-ITE), but these differences are small and disappear by the end of residency (i.e., on the IM-CE). These results also provide reasonable evidence that MK ratings and IM examinations assess a similar construct and that the PC subcompetency also relates to examination scores more so than the remaining subcompetencies.

The origin of the small differences favoring men in IM-ITE performance among residents with comparable milestone ratings is unknown. It is possible that men are underrated on the milestones relative to their early examination performance. Conversely, women may perform below expectations on the IM-ITE due to differences in learning opportunities and/or in assessments of their performance during training. For example, in a single-institution study of surgery residents, women reported receiving less respect for their professional role and receiving less mentoring. 34 Evaluation of narrative comments about women in performance evaluations and letters of recommendation reveal some evidence of bias in the language used, with women in some studies more likely to be described with more relation-focused than competence-focused language. 16,32,35,36 The influence of stereotype threat for women, in which the fear of confirming a stereotype about one’s group leads to lower performance, could also be at play. 37 The IM-ITE may capture aspects of MK where women and men differ; however, the lack of difference in scores based on gender on the IM-CE makes this possibility less likely. Alternatively, the IM-ITE may manifest subtle bias favoring men. Overall, in our study, the small magnitude of gender differences, which diminished over time, casts doubt on the practical educational or clinical significance of these score differences.

This study has limitations. The data we used included milestone ratings from a single specialty across 2 cohorts and may not generalize to other specialties or cohorts. Our findings represent program director ratings submitted based on CCC data. It is possible that individual program directors or CCCs may have adjusted individual evaluators’ ratings to reduce bias. The survey items we used did not include nonbinary genders, and we did not know the gender of program directors and CCC members. Although PC ratings correlated more strongly with examination performance than any other subcompetency besides MK, there were limitations with using PC ratings as a dependent variable in the differential prediction analysis. PC ratings likely capture performance aspects outside those measured by examinations and bias that may manifest in ratings may not manifest in examination scores. Additionally, milestone rating differences could be caused by differences in other resident characteristics, such as demographics, socioeconomic status, or prior training. Nonetheless, we believe the theoretical and observed relations between MK and PC ratings and examination performance justified investigation of potential differential performance, as differential prediction methods assume that the outcome variable is unbiased. We used highly reliable standardized examinations that have objective answers as outcome variables, though the possibility exists that some bias occurs within these examinations.


In this large national study of IM residents, MK and PC milestone ratings and IM-ITE and IM-CE performance for examinees with similar milestone ratings were largely comparable for female and male residents. Generally, our results suggest equitable outcomes for residents rated similarly on MK and PC subcompetencies. Although men slightly outperformed similarly rated women on examinations early in training, these differences disappeared by the final year of training. The lack of differential performance on the IM-CE after accounting for milestone ratings provides evidence for fair, unbiased milestone ratings generated by program directors and CCCs assessing residents’ MK and PC competence. While these results from point-in-time data are reassuring, further analyses of individual residents’ longitudinal data could elucidate developmental trajectories. It will be important to conduct further investigations as assessments, in particular milestones, change in future years to ensure that women have comparable opportunities to men to learn and build successful careers in medicine.


*The authors also conducted the differential prediction analyses with only milestone ratings, gender, and residency program to evaluate whether the results were robust to the exclusion of the other covariates. These analyses yielded similar results in model selection and magnitude of effects (data not shown).


1. Nasca TJ, Philibert I, Brigham T, Flynn TC. The next GME accreditation system—Rationale and benefits. N Engl J Med. 2012; 366:1051–1056
2. Green ML, Aagaard EM, Caverzagie KJ, et al. Charting the road to competence: Developmental milestones for internal medicine residency training. J Grad Med Educ. 2009; 1:5–20
3. Iobst W, Aagaard E, Bazari H, et al. Internal medicine milestones. J Grad Med Educ. 2013; 5suppl 114–23
4. Hauer KE, Clauser J, Lipner RS, et al. The internal medicine reporting milestones: Cross-sectional description of initial implementation in U.S. residency programs. Ann Intern Med. 2016; 165:356–362
5. Hauer KE, Vandergrift J, Lipner RS, Holmboe ES, Hood S, McDonald FS. National internal medicine milestone ratings: Validity evidence from longitudinal three-year follow-up. Acad Med. 2018; 93:1189–1204
6. Hauer KE, Vandergrift J, Hess B, et al. Correlations between ratings on the resident annual evaluation summary and the internal medicine milestones and association with ABIM Certification Examination scores among US internal medicine residents, 2013-2014. JAMA. 2016; 316:2253–2262
7. McDonald FS, Jurich D, Duhigg LM, et al. Correlations between the USMLE Step examinations, American College of Physicians In-Training Examination, and ABIM Internal Medicine Certification Examination. Acad Med. 2020; 95:1388–1395
8. Hauer KE, Cate OT, Boscardin CK, et al. Ensuring resident competence: A narrative review of the literature on group decision making to inform the work of clinical competency committees. J Grad Med Educ. 2016; 8:156–164
9. Dayal A, O’Connor DM, Qadri U, Arora VM. Comparison of male vs female resident milestone evaluations by faculty during emergency medicine residency training. JAMA Intern Med. 2017; 177:651–657
10. Santen SA, Yamazaki K, Holmboe ES, Yarris LM, Hamstra SJ. Comparison of male and female resident milestone assessments during emergency medicine residency training: A national study. Acad Med. 2020; 95:263–268
11. Rand VE, Hudes ES, Browner WS, Wachter RM, Avins AL. Effect of evaluator and resident gender on the American Board of Internal Medicine evaluation scores. J Gen Intern Med. 1998; 13:670–674
12. Krause ML, Elrashidi MY, Halvorsen AJ, McDonald FS, Oxentenko AS. Impact of pregnancy and gender on internal medicine resident evaluations: A retrospective cohort study. J Gen Intern Med. 2017; 32:648–653
13. Thackeray EW, Halvorsen AJ, Ficalora RD, Engstler GJ, McDonald FS, Oxentenko AS. The effects of gender and age on evaluation of trainees and faculty in gastroenterology. Am J Gastroenterol. 2012; 107:1610–1614
14. Brienza RS, Huot S, Holmboe ES. Influence of gender on the evaluation of internal medicine residents. J Womens Health (Larchmt). 2004; 13:77–83
15. Holmboe ES, Huot SJ, Brienza RS, Hawkins RE. The association of faculty and residents’ gender on faculty evaluations of internal medicine residents in 16 residencies. Acad Med. 2009; 84:381–384
16. Rojek AE, Khanna R, Yim JWL, et al. Differences in narrative language in evaluations of medical students by gender and under-represented minority status. J Gen Intern Med. 2019; 34:684–691
17. Mueller AS, Jenkins TM, Osborne M, Dayal A, O’Connor DM, Arora VM. Gender differences in attending physicians’ feedback to residents: A qualitative analysis. J Grad Med Educ. 2017; 9:577–585
18. Loeppky C, Babenko O, Ross S. Examining gender bias in the feedback shared with family medicine residents. Educ Prim Care. 2017; 28:319–324
19. Sharma A, Schauer DP, Kelleher M, Kinnear B, Sall D, Warm E. USMLE Step 2 CK: Best predictor of multimodal performance in an internal medicine residency. J Grad Med Educ. 2019; 11:412–419
20. Falcone JL, Gonzalo JD. Relationship between internal medicine program board examination pass rates, accreditation standards, and program size. Int J Med Educ. 2014; 5:11–14
21. Falcone JL, Hamad GG. The state of performance on the American Board of Surgery Qualifying Examination and Certifying Examination and the effect of residency program size on program pass rates. Surgery. 2012; 151:639–642
22. Falcone JL, Middleton DB. Pass rates on the American Board of Family Medicine Certification Exam by residency location and size. J Am Board Fam Med. 2013; 26:453–459
23. Garibaldi RA, Subhiyah R, Moore ME, Waxman H. The in-training examination in internal medicine: An analysis of resident performance over time. Ann Intern Med. 2002; 137:505–510
24. Accreditation Council for Graduate Medical Education, American Board of Internal Medicine. The Internal Medicine Milestone Project. Published July 2015. Accessed February 8, 2021
25. Donnon T, Violato C. Medical students’ clinical reasoning skills as a function of basic science achievement and clinical competency measures: A structural equation model. Acad Med. 2006; 81suppl 10S120–S123
26. Woods NN. Science is fundamental: The role of biomedical knowledge in clinical reasoning. Med Educ. 2007; 41:1173–1177
27. Department of Health & Human Services Office of the Assistant Secretary for Health. OASH Regional Offices. Accessed February 8, 2021
28. Brennan RL, ed. Educational Measurement. 4th ed. Westport, CT: Praeger Publishers, 2006
29. Camilli G. Test fairness. Brennan RL, ed. In: Educational Measurement. 4th ed. Westport, CT: Praeger Publishers, 2006221–256
30. Nakagawa S, Schielzeth H. A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol Evol.2013; 4:133–142
31. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. J Stat Software.2015; 67:1–48
32. Klein R, Julian KA, Snyder ED, et al.; From the Gender Equity in Medicine (GEM) workgroup. Gender bias in resident assessment in graduate medical education: Review of the literature. J Gen Intern Med. 2019; 34:712–719
33. Kinnear B, Warm EJ, Hauer KE. Twelve tips to maximize the value of a clinical competency committee in postgraduate medical education. Med Teach. 2018; 40:1110–1115
34. Myers SP, Hill KA, Nicholson KJ, et al. A qualitative study of gender differences in the experiences of general surgery trainees. J Surg Res. 2018; 228:127–134
35. Madera JM, Hebl MR, Martin RC. Gender and letters of recommendation for academia: Agentic and communal differences. J Appl Psychol. 2009; 94:1591–1599
36. French JC, Zolin SJ, Lampert E, et al. Gender and letters of recommendation: A linguistic comparison of the impact of gender on general surgery residency applicants. J Surg Educ. 2019; 76:899–905
37. Spencer SJ, Steele CM, Quinn DM. Stereotype threat and women’s math performance. J Exp Soc Psychol.1999; 35:4–28

Supplemental Digital Content

Copyright © 2021 by the Association of American Medical Colleges