Graduates of MD-granting medical schools who wish to practice medicine in the United States must achieve passing scores on all components of the United States Medical Licensing Examination (USMLE).1 The USMLE consists of four separate examinations designed to assess an examinee’s understanding of and ability to apply concepts and principles that are important in providing effective and safe patient care. The four examinations are Step 1, Step 2 Clinical Knowledge (CK), Step 2 Clinical Skills (CS), and Step 3. Although medical education, both undergraduate and graduate, is not primarily directed at teaching physicians to achieve high performance on these exams, passing these examinations en route to medical licensure is unquestionably an important outcome of medical education.
Clinical clerkships often include objec tive examinations to assess students’ performance. For example, many programs use the National Board of Medical Examiners (NBME) clinical subject examinations (“subject exams”) as one component of student assessment.2,3 Subject exams are intended to assess a medical student’s ability to solve scientific and clinical problems. One reason these exams are useful to medical educators and administrators is that a student’s scores can be directly compared against the exams’ national norms scores.
Because the USMLE Step 2 (CK and CS) and Step 3 exams assess clinically relevant knowledge and skills, it is important to examine the correlations of performance on other assessments of clinical knowledge during medical school (e.g., scores on subject exams) with performance on these Step exams. If it is shown that a student’s subject exam performance is associated with poor performance or failure on Steps 2 or 3, remediation efforts could be instituted early in clinical education to help the student avert future difficulties on licensing examinations.
To date, few studies have evaluated the association between subject exam and Step exams performances, with most of these prior studies focusing on Step 1 and Step 2 CK (taken during undergraduate education) and including results from only one clerkship.4–9 Ogunyemi and De Taylor-Harris5 investigated the relationship between poor performance on the obstetrics–gynecology subject exam and performance on Step 2 CK and found that failure on the subject exam was a stronger predictor of Step 2 failure than undergraduate grade point average, score on the Medical College Admissions Test (MCAT), clinical faculty evaluation, race, or gender. Several studies have reversed the direction of interest, using subject exams as the outcome variable, and have shown that poor performance on Step 1 is associated with subsequent poor performance on subject exams in individual clerkships.4,7–9 We recently examined correlations between subject exam scores and Step 1 and Step 2 CK scores,10 as well as with Step 2 CS and Step 3 scores.11 Results of moderate correlations between subject exams scores with Steps 1 and 2 CK scores were not surprising, as subject exam items are built in the same rigorous manner as the Step 1 and Step 2 CK exams, but little research exists to show the relationship of subject exams’ scores with Step 3 performance.12 Our main purpose in carrying out the present study was to help fill this gap and determine whether poor performance on one or more subject exams across six core clerkships is associated with a significant increase in the likelihood of failing Step 3.
Study context and participants
The F. Edward Hébert School of Medicine, Uniformed Services University of the Health Sciences (USU) matriculates approximately 170 students annually. Upon entry into USU, students are commissioned as officers in one of four of the United States uniformed services (Army, Navy, Air Force, and Public Health Service). After graduation, nearly all students enter a military-affiliated postgraduate training program for at least one year, with the vast majority continuing in a military-affiliated residency program after completion of the first postgraduate year. The sample for the present study, which we conducted in 2012, came from the cohort of students graduating between 2007 and 2011 (N = 853). For those academic years, USU provided a traditional four-year curriculum, including two years of basic science-focused courses followed by two years of clinically oriented education.
The third-year curriculum consisted of the school’s core clerkship rotations: family medicine (6 weeks), internal medicine (12 weeks), surgery (12 weeks), psychiatry (6 weeks), pediatrics (6 weeks), and obstetrics–gynecology (6 weeks). All core clerkship rotations, then and now, use the relevant clinical subject exam, which is given near the end of the core rotation as a component of a student’s clerkship grade; the weight assigned to the examination varies by clerkship. Students are required to pass the subject exam in order to achieve a passing grade for the clerkship, and each clerkship has policies in place to address failure on an initial attempt at the examination. These policies include suggested remediation methods and a retesting schedule so that the student may, hopefully, pass the clerkship and progress to the fourth year. In addition, all military trainees are required, according to military service regulations, to possess a medical license within one year of completing the first year of postgraduate training. For this purpose, they must pass the Step 3 exam.
The data for this analysis were obtained from two sources: the USU Long-Term Career Outcome Study (LTCOS) and the NBME. The end dates of the data from the LTCOS and NBME were June 2010 and June 2011, respectively. The LTCOS is a comprehensive investigation of performance measures of USU medical students. The LTCOS follows the academic and clinical careers of its graduates from admission to clinical practice. For the present study, students’ subject exam scores were extracted from the LTCOS database, and Step 3 scores for all USU graduates in the initial sample were provided by the NBME. Data from the NBME were provided in accordance with a collaborative data-sharing agreement between the USU and the NBME (included in the institutional review board approval), which allowed NBME staff to assemble a deidentified matched dataset, under secure conditions, of Step 3 scores for research purposes only. USMLE policy allows the use of such data for collaborative research provided that the data are kept confidential and that individual students are not identified in the study.13 None of the USU students included in the study had indicated that their scores should not be included in research initiatives. In the LTCOS dataset, the subject exam score recorded was the score achieved on the student’s first attempt at the examination during the clerkship year. For Step 3, the score used was also for the student’s first attempt at the exam. Given the time gap between test taking and score report and the requirement of passing Step 3 within one year of graduation for USU students, the first-attempt performance of Step 3 is particularly important. Further, first-takers represent the full spectrum of abilities and have all taken the exam under equal and standardized conditions (i.e., none of them have ever seen live Step 3 material before).
For the purpose of this analysis, categorization of exam scores as “poor performance” or “failures” depended on the examination. Failure on Step 3 was a reported score of Fail by the NBME, using the minimum passing standard in use at the time of sitting for the exam.14 Although poor performance criteria for subject exams are not as universally defined, the NBME does provide supplementary information and criteria regarding categorization of subject exam scores and potential interpretation of scores relative to student performance. However, individual medical schools usually determine their own criteria for failure. For the current study, we defined poor performance on subject exams by using the national norms—mean and standard deviation (SD)—reported by the NBME for the corresponding test year.15 For example, for the internal medicine clerkship of the Class of 2011, the national mean was 75.8 with an SD of 8.3. A USU student in this class who took the internal medicine subject exam would receive one “flag,” indicating poor performance, if that student’s score was at least 1 SD below this mean, which corresponds to a score of 67.5 or less. We repeated this procedure for each clerkship and class to identify poor performance for each of the subject exams. We then calculated the total number of flags a student received (e.g., performing below 1 SD or more below the mean on three subject exams would result in three flags) and categorized the students into three groups. The first group consisted of those who did not receive any flags on the subject exams, the second group consisted of those who received only one flag, and the third group consisted of those who received two or more flags.
The statistical analyses consisted of three parts. The first part involved computing descriptive statistics on the subject exam and Step 3 scores, as well as the Pearson correlation coefficients among these scores. The second part of the analysis used contingency tables to investigate the association between poor performance on subject exams and the probability of passing/failing Step 3. Using these contingency tables, we compared the observed frequencies in each cell with the expected frequencies. Expected frequencies were calculated assuming the null hypothesis that there was no association between the flag group in which an examinee fell and the examinee’s pass/fail outcome on Step 3. A larger and consistent discrepancy between observed and expected frequencies would indicate a higher likelihood of rejection of the null hypothesis. A chi-square (χ2) test of independence was performed against the null hypothesis. Finally, we fit logistic regression models to determine the odds of failing Step 3 for students who performed poorly on subject exam(s) when compared with those without poor performance flags. All statistical analyses were performed with SPSS Version 21.0 (Armonk, New York). This study was approved by the USU institutional review board.
Among the 853 students graduating between 2007 and 2011, 802 had initial Step 3 scores in the NBME database (157 of Class of 2007, 159 of Class of 2008, 157 of Class of 2009, 163 of Class of 2010, and 166 of Class of 2011). This dataset was reduced to 654 students, all of whom had complete data of all six clerkship subject exam scores (119 of Class of 2007, 117 of Class of 2008, 150 of Class of 2009, 157 of Class of 2010, and 111 of Class of 2011). The primary reason for these incomplete data is a lack of scores on the obstetrics–gynecology subject exam because of turnover in the clerkship director position. The availability of the data from the class was not linked to subject exam performance, nor do we suspect that the availability of NBME data is linked to performance. As a result, we do not think that the incompleteness introduces bias into the sample or otherwise contaminates the results. Among these students, 23 (3.5%) failed Step 3 on the first take. The number (percentage) of failures in each of the five class years were as follows: 2 out of 119 (1.7%) in the Class of 2007, 6 out of 117 (5.1%) in 2008, 8 out of 150 (5.3%) in 2009, 5 out of 157 (3.2%) in 2010, and 2 out of 111 (1.8%) in 2011. Table 1 presents the means and SDs of subject exam scores for students who passed Step 3 and for those who did not, as well as correlation coefficients between scores on each subject exam and the Step 3 score. Subject exam means were consistently lower for students who did not pass Step 3. The bivariate correlations between the subject exams and Step 3 were moderate (r values ranging from 0.49 to 0.57; all P values < .01).
Contingency table analyses
Table 2 presents the results of the contingency table analyses. For the “no flag” group, the observed number of failing Step 3 students was lower than the expected number—the observed number who failed Step 3 was 1, yet the expected number was 11.7. In contrast, for both the “one flag” and “two or more flags” groups, the observed number who failed Step 3 was higher than the expected number. The discrepancy between the observed number and the expected number was larger for the “two or more flags” group than the “one flag” group. This consistent pattern is a sign of association between subject exam performance and the chance of failing/passing Step 3. The test statistic was significant, χ2 = 26.63 (df = 2, P< .0005), indicating that these two factors were associated, and the frequencies showed the association to be in the expected direction. In fact, the majority of the students who failed Step 3 could be identified by poor performance on subject exams; students who showed poor performance on two or more subject exams (representing 27% of the cohort) made up 16 (70%) of the total of 23 students who failed Step 3.
Logistic regression modeling
Finally, the logistic regression modeling was intended to quantify the magnitude of the associations of interest, with the null hypotheses for the two models being that the odds of failing Step 3 would be no different for those students receiving one or more flags than for those receiving no flag. Those students receiving one flag on the subject exams had 14.23-fold higher odds of failing Step 3 than did those with no flags (95% CI 1.7–119.3), whereas those receiving two or more flags had 33.41-fold higher odds when compared with those with no flags (95% CI 4.4–254.2). This indicates that the odds of failing are considerably greater when an examinee has at least one flag (i.e., when that examinee performs poorly on a subject exam).
Discussion and Conclusions
The goal of this study was to determine whether subject exam performance is associated with Step 3 performance; more specifically, we wanted to know whether students who did poorly on the subject exams were at a significantly higher risk of failing Step 3. We believe this goal is important for two reasons. First, as noted previously, although several studies have addressed failure or poor performance as predictors of poor USMLE performance, most of those studies addressed performance in a single clerkship or did not include Step 3 as an outcome variable. Second, if a statistically significant link were found between one or more subject exams and Step 3 failure, this would allow medical educators to identify students as “at risk of failing Step 3” a full two years before they are expected to sit for the Step exam; thus, educators could institute remediation even before students’ Step 2 CK and CS outcomes are reported. To this end, we began by quantifying the relationship between the scores of the core clerkship subject exams and Step 3 and found moderate positive correlations. Our subsequent analyses showed that, in our sample, poor performance on subject exams, defined by scoring at least 1 SD below the national mean, was associated with increased chances of failing Step 3. The increased odds of failing Step 3 were statistically significant even if a student performed poorly on only one subject exam, although poor performance on more than one subject exam increased the chances of failing Step 3 to an even greater extent.
The identification of risk factors for failure on Step 3 is important because students at MD-granting schools who do not pass Step 3 cannot become licensed to practice medicine in the United States. These students will have spent at least five years in training, developing knowledge, skills, and attitudes that they will ultimately not be able to use, representing a considerable waste of their time and money as well as institutional and societal resources. Although it may appear obvious that students performing very poorly on a subject exam may be at risk of failing one of the subsequent Step exams, it may not be as apparent that students who score as little as 1 SD below the national mean are also at significant risk. However, as our findings demonstrate, poor performance defined by this criterion, even for one subject exam, is strongly associated with failing Step 3.
Although there is some research to show that certain “premedical school” measures, such as the MCAT scores, have a small amount of predictive validity for the Step 3 exam,16 the present study revealed that the use of subject exam performance and a national norm-based flag may also provide useful parameters for identifying students at risk of future difficulty on licensing examinations. Such identification affords residency program directors an early opportunity to begin remediation and mentoring to assist students at risk. For example, identified trainees could be offered ancillary educational experiences such as guided reading programs, test-taking strategy sessions, and more intensive board review courses soon after beginning their internships.17 Such early intervention may also help prepare trainees for their first in-training examination, which is often offered five to six months after the start of internship. Program directors could use this information as an “early warning” for examination difficulty and could institute programs early in residency training that could prevent poor performance on this examination, one that is essential for licensure.
The limitations of the present study include the single-institution nature of the investigation, which makes it difficult to extrapolate the results to other medical schools. However, the vast majority of medical schools accredited by the Liaison Committee on Medical Education use subject exams,2 and so we believe our findings are important and may generalize beyond our single institution. Moreover, the use of the national norms of mean and SD of subject exams to define poor performance provides additional generalizability of our results. That being said, although the 1 SD method of defining poor performance was associated with later Step 3 failure at our institution, other schools will need to perform similar sets of analyses for their students to find the cutoff point that is most helpful in assessing which of their students is at risk, and in a way that identifies most of those at risk for failing while not flagging too many who are not at as significant a risk. In addition, it should be noted that the USMLE has announced changes expected for Step 3 over the next few years.18 The redesigned exam will increase the test items that assess knowledge of foundational science. Whether the relationship between subject exams and Step 3 would remain the same as reported in the present study is yet to be examined.
Notwithstanding these limitations, we believe this study is the first to investigate the relationship between performance on subject exams across core clerkships and performance on Step 3. Our findings provide evidence for a method and analysis that may be useful in identifying at-risk students on the basis of poor performance on subject exams. Using this information, medical educators in both undergraduate and graduate medical education will be in a better position to intervene in their trainees’ education to enhance their chances of successfully passing Step 3.
1. . Information about the United States Medical Licensing Examination (USMLE). http://www.usmle.org
. Accessed January 10, 2014
3. Torre D, Papp K, Elnicki M, Durning S. Clerkship directors’ practices with respect to preparing students for and using the National Board of Medical Examiners Subject Exam in medicine: Results of a United States and Canadian Survey. Acad Med. 2009;84:867–871
4. Myles T, Galvez-Myles R. USMLE Step 1 and 2 scores correlate with family medicine clinical and examination scores. Fam Med. 2003;35:510–513
5. Ogunyemi D, De Taylor-Harris S. NBME obstetrics and gynecology clerkship final examination scores: Predictive value of standardized tests and demographic factors. J Reprod Med. 2004;49:978–982
6. Spellacy WN, Dockery JL. A comparison of medical student performance on the obstetrics and gynecology National Board Part II examination and a comparable examination given during the clerkship. J Reprod Med. 1980;24:76–78
7. Myles TD, Henderson RC. Medical licensure examination scores: Relationship to obstetrics and gynecology examination scores. Obstet Gynecol. 2002;100(5 pt 1):955–958
8. Myles TD. United States Medical Licensure Examination Step 1 scores and obstetrics–gynecology clerkship final examination. Obstet Gynecol. 1999;94:1049–1051
9. Armstrong A, Dahl C, Haffner W. Predictors of performance on the National Board of Medical Examiners obstetrics and gynecology subject examination. Obstet Gynecol. 1998;91:1021–1022
10. Zahn CM, Saguil A, Artino AR Jr, et al. Correlation of National Board of Medical Examiners scores with United States Medical Licensing Examination Step 1 and Step 2 scores. Acad Med. 2012;87:1348–1354
11. Dong T, Swygert KA, Durning SJ, et al. Validity evidence for medical school OSCEs: Predicting performance on USMLE step examinations. TLM. 2014; in press.
16. Donnon T, Paolucci EO, Violato C. The predictive validity of the MCAT for medical school performance and medical board licensing examinations: A meta-analysis of the published research. Acad Med. 2007;82:100–106
17. Feinberg RA, Swygert KA, Haist SA, Dillon GF, Murray CT. The impact of postgraduate training on USMLE® Step 3® and its computer-based case simulation component. J Gen Intern Med. 2012;27:65–70