The United States Medical Licensing Examination (USMLE) provides a single evaluation system for allopathic physicians seeking an initial license to practice medicine in the United States. Step 1 of this system assesses important basic science concepts, and Step 2 Clinical Knowledge (CK) assesses clinical science concepts essential for patient care under supervision.
Past research has documented a strong positive relationship between Steps 1 and 2 CK performance.1 Past research has also documented differences in Steps 1 and 2 CK performance by gender, indicating that men generally outperformed women on Step 1 and that women generally outperformed or performed similarly to men on Step 2 CK.1–4 When examined by subscore, the relationship between gender and Step 2 CK performance appears more complicated, with women outperforming men in some areas and men outperforming women in others.2,3
Controlling for differences in prematriculation measures, such as undergraduate grade point average (GPA) and Medical College Admission Test (MCAT) scores, partially reduced the gender-related score difference for Step 1.3,4 For Step 2 CK, controlling for collateral information, such as Step 1 score, increased the gender-related score difference.4 Although past research demonstrates gender-related differences in Steps 1 and 2 CK performance, research also indicates similar ultimate pass rates for men and women for both examinations.1,5
Given that the majority of this research was conducted over a decade ago, recent changes in medical students, basic science and clinical curricula at medical schools, and the USMLE may have affected the aforementioned relationships. One such change is the decrease in the number of Step 2 CK items in 2002. This resulted in an increase in the allotted time per item from 72 seconds (50 items per hour) to 78 seconds (46 items per hour). A recent study showed that this Step 2 CK timing increase generally resulted in improvements in overall Step 2 CK examinee performance.6 No changes in the number of items or the time allotted per item occurred for Step 1 during this time period.
This paper is part of an ongoing validity-related investigation into the relationships between various examinee characteristics and USMLE performance. Such research is important because it can provide some understanding of how demographic information, educational influences, and testing characteristics affect examinee performance. Also, studying the relationships between the various components of the USMLE can help establish that the components are related in ways that ensure they collectively signal readiness to practice medicine, while simultaneously verifying that each component measures distinct proficiencies essential to safe and effective practice. The present analyses primarily focus on the relationships between examinee gender, examination timing (time per item), Step 1 score, and Step 2 CK performance. This paper has three main objectives: (1) to reevaluate earlier findings with more recent data; (2) to investigate the effect of gender and time per item on the relationship between Steps 1 and 2 CK performance; and (3) to examine how medical school characteristics influence the relationships between examinee characteristics and Step 2 CK performance.
The data came from the National Board of Medical Examiners databases and included scores and demographic information for examinees who matriculated into U.S. Liaison Committee on Medical Education–accredited medical schools between 1997 and 2002, took Step 1 for the first time between 1999 and 2004, and took Step 2 for the first time between 2000 and 2005. The sample included 54,487 examinees from 114 medical schools and spanned six entering cohorts. All examinees were given the option to decline the use of their score and demographic data for research purposes. Only examinees who allowed the use of their data for research purposes were included in the sample. Cohorts with less than 80% of the median cohort size at a school and schools with less than five cohorts were excluded. Because there is an alternate route to licensure in Puerto Rico that can affect the perceived stakes of the examinations, Puerto Rican schools were excluded. Different campuses from the same school were included in the sample separately.
Hierarchical linear modeling approach
Because medical students are nested within medical schools, medical education data are often structured hierarchically. In other words, individual medical students attend certain medical schools, which themselves may have characteristics that affect the relationships between student characteristics and various outcome measures. For example, GPA may be a better predictor of Step 1 performance for students attending schools with low levels of grade inflation and challenging curricula. Specifically designed to analyze hierarchically structured data, Hierarchical Linear Modeling (HLM) simultaneously conducts regression analyses at both the student and school levels. Unlike other regression techniques, HLM can handle data that violate the assumption of independence, which is that an individual's data are not in any way systematically related to another individual's data. Data sets with medical students from the same medical school potentially violate this assumption and therefore nonhierarchical regression procedures may be inappropriate. HLM techniques allow one to properly analyze hierarchically structured medical education data where both the characteristics of students and schools may matter for various outcomes.
In terms of this project, HLM techniques provide a mechanism for investigating the relationships between examinee characteristics, medical school characteristics, and Step 2 CK performance. In order to do this, HLM conceptually estimates a separate regression line predicting Step 2 CK scores from examinee characteristics for each medical school (random-coefficients models), and then the results of these regression analyses (intercepts and slopes) are used as dependent variables in the school-level analyses that include school characteristics as between-school predictors (means-and-slopes-as-outcomes models). Bryk and Raudenbush's7 comprehensive text provides more detailed information on the HLM approach.
Using the software HLM 6.0, a series of examinees-nested-in-schools analyses was fit to the data set with Step 2 CK total score as the dependent variable. These analyses included: (1) a random-effects analysis of variance (ANOVA) that partitioned total variation in Step 2 CK scores into within-school and between-school components; (2) a series of random-coefficients models used to determine (a) which examinee characteristics to use as within-school predictors, and (b) which characteristics should be fixed and which should be allowed to vary across schools; and (3) a series of means-and-slopes-as-outcomes models used to predict the impact of school characteristics on both mean Step 2 CK scores and the relationships between examinee characteristics and Step 2 CK performance.
Step 1 score and an indicator variable for gender (0 = male, 1 = female) were included in the models as examinee-level independent variables (within-school predictors). Another indicator variable included in the models as a within-school predictor specified if an examinee took Step 2 before or after the decrease in Step 2 CK items in 2002 that resulted in an increase in the allotted time per item (0 = examinee received less time per item, 1 = examinee received more time per item). To control for possible variation in Step 2 CK scores due to language, an English as a Second Language (ESL) indicator variable was also included in the models as a within-school predictor (0 = native English speaker, 1 = ESL). The within-school predictors further included three interaction terms: Step 1*gender, Step 1*timing, and gender*timing. These interaction terms were included in the models to investigate the influence of gender and time on the relationship between Step 1 score and Step 2 CK performance. All within-school predictors were group-mean centered. Table 1 provides means and standard deviations for Step 1 and Step 2 CK observed scores by the ESL, gender, and time per item indicator variables.
The school-level independent variables (between-school predictors) included percent of students who are female (mean = 46.0, SD = 6.2), percent of students who are native English speakers (mean = 90.4, SD = 6.8), school size as indicated by the average cohort size (mean = 94.4, SD = 43.3), and average Step 1 score (mean = 217.3, SD = 6.4). All between-school predictors were grand-mean centered. Because complete school-level data were not available for all medical schools included in the sample, the between-school predictors were calculated by aggregating examinee-level data.
The top portion of Table 2 provides the results of the random-effects ANOVA. Results indicated that 94.2% of the total variation in Step 2 CK scores was within-schools and 5.8% was between-schools. Although some variability in Step 2 CK performance exists from school to school, the majority occurs between examinees.
Exploratory results showed that Step 1 score included in a model alone explained 48.9% of the within-school variation in Step 2 CK performance (variance component of 244.37), and gender included alone explained 0.35% of the within-school variation in Step 2 CK performance (variance component of 476.35). Although gender used alone explains little variation in Step 2 CK score, the effect was statistically significant, with women generally outperforming men by approximately 2.6 points.
The initial random-coefficients model included all within-school predictors. Gender*time explained little variation in Step 2 CK and was removed. All other within-school predictors were statistically significant and retained for subsequent analyses. Results indicated that the intercept and slope for Step 1 scores should be allowed to vary across schools, while all other predictors could be fixed across schools.
The middle portion of Table 2 presents the results of the final random-coefficients model, which explained 52.2% of the within-school variation in Step 2 CK performance. The average Step 2 CK score across schools was 218.41 with a standard error of 0.52. Step 1 score was positively related to Step 2 CK score, with an approximate 7.5-point (0.75 × 10) increase in Step 2 CK expected for every 10-point increase in Step 1. Controlling for Step 1 scores, ESL examinees generally scored lower than native English speakers (approximately 4.8 points), and examinees with more time per item generally scored slightly higher than examinees with less time per item (approximately 1.5 points).
Consistent with past research, women generally scored higher than men on Step 2 CK. Controlling for differences in Step 1 score increased the gender-related difference in Step 2 CK performance, with women outperforming men by approximately 7.0 points (compared to 2.6).
Overall, the regression of Step 2 CK performance on Step 1 scores was steeper for men compared with women, and the regression of Step 2 CK performance on exam time per item was steeper for examinees with lower Step 1 scores compared with examinees with higher Step 1 scores. The specific results by gender by time are as follows: for examinees who received less time per item, with every 10-point increase in Step 1 score, Step 2 CK score was expected to increase approximately 7.6 points (0.76 × 10) for men and approximately 7.2 points ([0.76 - 0.04] × 10) for women. For examinees who received more time per item, for every 10-point increase in Step 1 score, Step 2 CK score was expected to increase approximately 7.2 points ([0.76 - 0.04] × 10) for men and approximately 6.8 points ([0.76 – 0.04 – 0.04] × 10) for women.
Means-and-slopes-as-outcomes models were estimated to identify school characteristics that explain between-school variation in: (1) mean Step 2 CK scores; and (2) the effect of Step 1 scores on Step 2 CK performance. The final means-and-slopes-as-outcomes model included the same within-school predictors as the random-coefficients model. It also included schools' mean Step 1 score and school size as between-school predictors of mean Step 2 CK scores, and percent native English speaker and mean Step 1 score as between-school predictors of the relationship between Steps 1 and 2 CK. Percent female explained little variation in mean Step 2 CK scores or the relationship between Steps 1 and 2 CK and was excluded.
The lower portion of Table 2 presents the results of the final means-and-slopes-as-outcomes model, which explained 52.2% of the within-school variation and 67.3% of the between-school variation in Step 2 CK scores. The within-school regression coefficients were all statistically significant, and the results were consistent with the final random-coefficients model.
The between-school variables indicated that school characteristics affected: (1) school mean Step 2 CK scores; and (2) the relationship between examinee performance on Steps 1 and 2 CK. Schools with higher mean Step 1 scores and schools with more native English speakers, on average, had higher mean Step 2 CK scores. More specifically, for every 10-point increase in schools' average Step 1 score, the mean Step 2 CK score increased, on average, 7.3 points (0.73 × 10). For every 10% increase in the percentage of native English speakers, schools' mean Step 2 CK scores increased, on average, 1.5 points (0.15 × 10).
In addition, mean Step 1 scores and school size affected the slope of the regression of Step 2 CK performance on Step 1 scores. The regression was steeper at schools with higher mean Step 1 scores: the slope increased, on average, 0.02 points (0.002 × 10) for every 10-point increase in schools' average Step 1 scores. At larger schools, the regression of Step 2 CK scores on Step 1 was flatter: for every 10-person increase in a school's average cohort size, the predicted slope for Step 1 decreased 0.003 points (-0.0003 × 10).
Using recent data, the current study replicated previous findings related to the relationship between gender and Step 2 CK performance. On average, women outperformed men on Step 2 CK. Previous researchers have provided possible explanations for this difference. For example, women may have performed better in specialty areas that have historically interested women and are included in Step 2 CK, such as obstetrics and gynecology.2,3,8 A closer examination of the relationship between gender and Step 2 CK performance by subscore is warranted in future research.
This study also examined two issues rarely addressed in the literature: (1) the influence of gender and timing on the relationship between Steps 1 and 2 CK; and (2) the effect of school characteristics on the relationship between examinee characteristics, such as Step 1 score and Step 2 CK performance. Step 1 score was more strongly associated with Step 2 CK performance for men than for women. In addition to the specific concepts measured by Step 1, other factor(s) may have affected women's Step 2 CK performance. For example, as others have noted, women may have entered medical schools with weaker basic science backgrounds than men, as reflected in their generally lower Step 1 scores, but closed this gap through their medical school training.2,3 In this sense, women's medical school training may have influenced their Step 2 CK performance in a way that it may have not for men.
Further, time per item was more strongly associated with Step 2 CK performance for examinees with lower Step 1 scores than for examinees with higher Step 1 scores. Step 1 performance may have not only reflected what an examinee generally knew, but also how well they knew it, and consequently how quickly and accurately they could recall it. Consistent with this notion, one recent study found that Step 1 scores were negatively related to item-response time, with higher-scoring examinees, on average, using less time per item than lower-scoring examinees.9 Therefore, it is not surprising that examinees who scored higher on Step 1 benefited less from increases in time per item, because such increases in time may disproportionately help examinees who are less certain about their answers. More research is needed to fully understand the influence of time per item on the relationship between Steps 1 and 2 CK performance.
In terms of school characteristics, mean Step 1 performance and school size influenced the relationship between Steps 1 and 2 CK. The regression of Step 2 CK performance on Step 1 scores was higher and steeper for students from schools with higher Step 1 scores means, compared to students from schools with lower Step 1 scores means. This finding may simply indicate that students with high academic potential are more likely to see that potential activated in terms of Step 2 CK performance when they are surrounded by other high-potential students.
The regression of Step 2 CK performance on Step 1 scores was also higher and steeper for students from larger schools compared to students from smaller schools. This may reflect a more narrow curricular focus at many small schools, which may activate a student's potential in a few specific areas but leave a significant portion of that potential untapped in terms of the total range of areas covered by Step 2 CK.
Although the present study revisits existing research and provides some useful information from new inquirers, there is much related research still to be done. In particular, similar analyses as those conducted in this study should be done predicting scores on Steps 1, 2 Clinical Skills, and 3 of the USMLE. Future research should also continue to explore the role of medical school characteristics on examinee performance, and as such, HLM techniques should be more actively incorporated into future research, since multilevel analyses are clearly important for studies involving hierarchically structured medical education data.
1 Case SM, Swanson DB, Ripkey DR, Bowles LT, Melnick DE. Performance of the class of 1994 in the new era of USMLE. Acad Med. 1996;71:S91–S93.
2 Weinberg E, Rooney JF. The academic performance of women students in medical school. J Med Educ. 1973;48:240–7.
3 Case SM, Becker DF, Swanson DB. Performance of men and women on NBME Part I and Part II: The more things change. Acad Med. 1993;68:S25–S27.
4 Dawson B, Iwamoto CK, Ross LP, et al. Performance on the National Board of Medical Examiners Part I Examination by men and women of different race and ethnicity. JAMA. 1994;272:674–9.
5 DeChamplain AF, Winward ML, Dillon GF. Modeling passing rates on a computer-based medical licensing examination: An application of survival data analysis. Educ Meas. 2004;23:15–22.
6 Swygert K, Muller E, Clauser BE, Dillon GF, Swanson DB. The impact of timing changes on examinee pacing on the USMLE Step 2 Exam. Acad Med. 2004;79:S52–S54.
7 Bryk AS, Raudenbush SW. Hierarchical Linear Models: Applications and Data Analysis Methods. Newbury Park, CA: Sage Publications, 1992.
8 Krueger PM. Do women medical students outperform men in obstetrics and gynecology? Acad Med. 1998;73:S101–S102.
9 Swanson DB, Case SM, Ripkey DR, Clauser BE, Holtman MC. Relationships among item characteristics, examinee characteristics, and response times on USMLE Step 1. Acad Med. 2001;76:S114–S116.