The National Board of Medical Examiners (NBME) developed Clinical Science Subject Examinations (NBME Subject Examinations) to assess students’ clinical knowledge in the clerkship phase of undergraduate medical education. As of 2013, 97% of Liaison Committee on Medical Education–accredited medical schools used one or more NBME Subject Examinations as a contributor to students’ final clerkship grade(s).1 The NBME Subject Examinations contain high-quality medical knowledge questions, are administered in a convenient and reliable format, and allow for comparison of individual student performance with national norms.2,3
Prior research shows a moderate to strong positive correlation between NBME Subject Examination scores and United States Medical Licensing Examination (USMLE) Step 1,4–7 Step 2 Clinical Knowledge (CK),6–8 and Step 39 scores. A positive correlation (convergent relationship) between NBME Subject Examination scores, Step 2 CK, and Step 3 scores may suggest that all three test types are measuring some form of medical knowledge. However, a convergent relationship between the NBME Subject Examination and other examinations is only one component in “relationship to other variables” validity evidence.10 Evidence of a divergent relationship is equally imperative to demonstrate whether measures are sufficiently specific to the context of interest.10
The purpose of this study was to examine the relationship between Step 1 and NBME Subject Examinations and, as an extension, final clerkship grades, to assess the validity of using scores from NBME Subject Examinations as a component in final clerkship grades. For each student in each clerkship, we substituted standard scores achieved on NBME Subject Examinations with his/her standard score on Step 1. Final clerkship grades based on the Step 1 substitution were compared with actual final clerkship grades. Because of the convergent relationship between NBME Shelf Examination scores and Step 1 scores, we expected minimal variation between Step 1 substituted clerkship grades and actual clerkship grades.
Participants and setting
This was a retrospective cohort study. The study sample included all medical students at the Virginia Commonwealth University School of Medicine (VCU-SOM) who completed core clerkships during academic year 2012–2013 or academic year 2013–2014.
VCU-SOM is a large, public medical school affiliated with a comprehensive university in Richmond, Virginia. Throughout the study period, VCU-SOM enrolled approximately 200 students in each first-year class. The first two years consisted of a basic-science-oriented curriculum. In this phase, students progressed through courses arranged by organ system. In addition, a longitudinal course introduced students to history-taking and physical examination skills. After concluding the first two years, VCU-SOM required that all students pass Step 1. During the third year, students rotated through seven core clinical clerkships. After completing their clerkships, students were required to complete two Acting Internships and successfully pass Step 2 CK and Clinical Skills to graduate.
We selected academic years 2012–2013 and 2013–2014 to reflect a period in which each clerkship required the NBME Subject Examination and the duration of each clerkship remained stable. During the study period, students completed core clerkships in family medicine (4 weeks), internal medicine (12 weeks), neurology (4 weeks), obstetrics–gynecology (6 weeks), pediatrics (8 weeks), psychiatry (6 weeks), and surgery (8 weeks). The NBME Subject Examination constituted 35% of the final grade for family medicine, internal medicine, obstetrics–gynecology, and neurology; 30% of the final grade for surgery; and 25% of the final grade for pediatrics and psychiatry. We collected the following data elements for each student: demographic variables, Step 1 score, and NBME Subject Examination scores in family medicine, internal medicine, neurology, obstetrics–gynecology, pediatrics, psychiatry, and surgery.
Calculation of clerkship grades
VCU-SOM used a four-tiered grading system for all clerkships: (“Honors,” “High Pass,” “Pass,” or “Fail/Marginal”). To determine final grade assignment, all clerkships followed the same multistep process. First, numerical scores from all assessments (preceptor evaluations, Subject Examination scores, encounter notes, etc.) were converted to a standard score with a mean of 50 and standard deviation of 10. Standard scores were then summed on the basis of weight (e.g., the NBME Subject Examination is worth 25% of the final grade in pediatrics) to determine a final numerical score.
If a student’s standard score was in the top 15% compared with values from the previous year, he or she would receive a suggested grade of Honors. If a student’s standard score was at or above the mean from the previous year but below the top 15%, he or she would receive a suggested grade of High Pass. Students with a final score below, but still within two standard deviations of the mean, received a suggested grade of Pass. Finally, students with a final score two standard deviations below the mean received a suggested grade of Fail/Marginal. For the purposes of this study, standard clerkship scores and grades are henceforth called actual final clerkship score and actual final clerkship grade, respectively.
We included students in the analysis if they completed Step 1 anytime in the academic year prior to entry into the core clerkships; we excluded students if they completed Step 1 during any other time. For students who took any examination more than once, only the first score was included in the analysis. We included scores from students even if they did not complete all clerkships in the same academic year. We did this to account for students who may have interrupted their clerkships because of remediation of courses, clerkships, or USMLE examinations, as well as those who required a leave of absence for academic or personal reasons.
We then created a “substituted” final standard score for each student on each clerkship and compared that score with the actual score and grade that the student received. First, we converted each student’s three-digit score on Step 1 to a standard score using published national means and standard deviations from the USMLE for the corresponding administration year of the examination. For each clerkship, we substituted the NBME Subject Examination standard score with the student’s standard score on Step 1. Finally, we recalculated scores for each clerkship using the standard score from Step 1 to create a “substituted” final standard score. We then reassigned clerkship grades using the same norm-referenced values for distinguishing Honors, High Pass, Pass, and Fail/Marginal to develop a “substituted” suggested clerkship grade.
We computed the following: frequencies and percentages for demographic variables; means and standard deviations for Step 1 and NBME Subject Examination scores; and frequencies and percentages for actual and substituted clerkship grades. We converted letter grades to a numerical identifier (Fail/Marginal = 0, Pass = 1, High Pass = 2, Honors = 3) to compare between actual and Step 1 substituted grades with the Wilcoxon rank test.
We then employed separate linear regression models for each NBME Subject Examination to determine whether Step 1 scores significantly predicted Subject Examination scores. We computed the percentage of shared variance (R2) to determine the degree of prediction for each regression equation. We then applied the regression equation to our sample to determine expected NBME Subject Examination scores. Lastly, we compared expected and actual NBME Subject Examination scores using a Mann–Whitney U test. Alpha was set at 0.05 for all tests. We used the statistical software SPSS Statistics, Version 22 (IBM SPSS Inc., Armonk, New York) for all analyses. The VCU institutional review board deemed this study exempt.
Out of a possible 479 students, a total of 432 students met eligibility criteria (90.2%). From that sample, we analyzed the results from 2,777 NBME Subject Examinations. The number of examinations by clerkship was 398 family medicine, 397 internal medicine, 401 neurology, 396 obstetrics–gynecology, 393 pediatrics, 399 psychiatry, and 395 surgery. Descriptive statistics for demographic variables and Step 1 scores are presented in Table 1.
Percentages of shared variance (R2), regression equations, and predicted NBME Subject Examination scores with their respective P values are summarized in Table 2. Results from the linear regression models suggested that Step 1 scores significantly predicted NBME Subject Examination scores for all clerkships, P < .001. The percentage of shared variance between NBME Subject Examination scores and Step 1 scores was high for all clerkships except for pediatrics, which demonstrated a moderate correlation and medium amount of variance. Step 1 means were subtracted from the lowest mean (185) to aid in interpretation of the regression equations. Thus, for every 1-point increase in Step 1 above 185, the Subject Examination scores increased by 0.23 to 0.30 points. The varying intercept values (e.g., 65.21 for family medicine) suggest that mean NBME Subject Examination scores vary across clerkship. Therefore, a given Step 1 score does not guarantee the same NBME Subject Examination score in each clerkship. The expected means based on the regression for each NBME Subject Examination were closely aligned with observed means. Figure 1 depicts a graphical comparison of expected and actual NBME Subject Examination means.
There was no significant difference between actual grades and Step 1 substituted grades, P = .49. Specifically, the majority (73%; 2,038) of Step 1 substituted grades matched actual grades. For the 739 nonmatching cases, changes were usually going up or down one grade (e.g., Honors to High Pass; High Pass to Pass) (88%; 651) compared with going up or down two grades (e.g., Honors to Pass) (12%; 88). The percentage of matches was generally similar between clerkships. Neurology had the greatest (320; 80%) and obstetrics–gynecology had the least (262; 66%) percentage of matches, while all other clerkships ranged from 71% to 77%. At the student level, the 739 nonmatching cases impacted almost all students in at least one clerkship since only 5% (60 students) had a match between their actual and Step 1 substituted grades for all seven clerkships. However, 46% of students had a match between their actual and Step 1 substituted grade in at least six (out of seven) of their clerkships. Very few (16%) matched in less than the majority of their clerkships. Figure 2 provides percentages of matches and nonmatch types between actual final clerkship grades and Step 1 substituted final clerkship grades.
The findings from our study demonstrate the relationship between Step 1 scores, NBME Subject Examinations scores, and, ultimately, final clerkship grades. We found little to no difference between expected NBME Subject Examination scores and actual scores in each clerkship. Additionally, we identified convergent relationships between Step 1 scores and all clerkship NBME Subject Examination scores to the point that substitution of Step 1 scores for NBME Subject Examination scores resulted in no change to approximately three-quarters of suggested grade assignments. These findings suggest that students who perform poorly on Step 1 are likely to perform poorly on NBME Subject Examinations. In addition, these findings suggest that if educators use NBME Subject Examinations to calculate final clerkship grades, students who perform poorly on Step 1 are unlikely to achieve high grades in their clerkships.
While the majority of NBME Subject Examination scores could be predicted from USMLE Step 1 scores, some students performed significantly better (or worse) than expected on several NBME Subject Examinations. Two factors may explain this observation. First, students’ own motivation to excel in a particular clerkship likely impacts their study habits, and therefore performance, on specific NBME Subject Examinations. For example, students who intend to match into pediatrics are probably motivated to perform as well as they can on the pediatrics NBME Subject Examination.
In addition, previous studies have shown that students tend to improve on NBME Subject Examinations over the course of the academic year.11–14 This improvement is related to familiarity with the structure of the examinations as well as overlap in content between related disciplines (e.g., family medicine and internal medicine).14 Thus, mean performance on NBME Subject Examinations in the middle of the year may correlate strongly with Step 1 scores, while examination scores near the beginning or end of the year could be more extreme and result in more variation. Further research is needed to determine individual factors that influence performance on NBME Subject Examinations at the individual level. In particular, it may be informative to determine whether—and why—some students perform better than expected on the NBME Subject Examinations throughout the year.
Are NBME Subject Examination scores valid measures of clerkship-specific knowledge?
In assessment literature, a convergent relationship between scores from two examinations suggests that the instruments represent “measures of the same achievement or ability.”10 Our findings highlight the convergent relationship between NBME Subject Examinations and Step 1 scores and are similar to the results of a recent study conducted by Zahn and colleagues.7 In that study, the authors found a moderate correlation (0.46–0.56) between Step 1 scores and each of six NBME Subject Examinations.7 Other studies have demonstrated the positive correlation between individual NBME Subject Examinations and Step 1 as well,4–6 thus suggesting the validity of NBME Subject Examination scores as markers of medical knowledge. However, this relationship is not necessarily surprising given the similar format and structure to these examinations and is not alone sufficient evidence to determine the validity of using the examinations as a marker for clerkship-specific knowledge acquisition.
The flipside to convergence is divergence, or the ability for a measure of “different achievement or ability” to demonstrate negative or no correlation with the instrument of interest.10 To date, only one study has demonstrated less than a moderate positive correlation between Step 1 and any NBME Subject Examination.15 In that 2003 study, Myles and Galvez-Myles15 found a weakly positive correlation (0.18) between the Family Medicine Subject Examination and Step 1. More recent findings from Zahn and colleagues7 (2012) as well as those from our current study have demonstrated a much higher correlation between the family medicine examination and Step 1 (0.48 and 0.57, respectively). Thus, the cumulative data would suggest that there is not currently sufficient evidence to support a divergent relationship between Step 1 scores and NBME Subject Examination scores.
In addition to considering the correlation between examination scores, one should also consider the consequences of administering a given assessment. Consequences evidence is another measure of validity and takes into consideration the anticipated and unanticipated impact of assessment on various stakeholders.16 A recent study from a single institution suggests that students spend up to 20 hours per week studying for the NBME Subject Examinations.17 In addition, clinical time is often intentionally reduced toward the conclusion of the clerkship to provide formal study time and allow for administration of the examination. The impact of these practices on development of competent clinical practice is unclear, but should also be considered when balancing the pros and cons of administering the NBME Subject Examinations.
Collectively, these issues call into question the validity of using NBME Subject Examination scores to determine final clerkship grades. Clerkship-level grades are generally regarded as independent evaluations of student performance in a particular context and are valued highly by program directors across resident training programs.18 Students expect that each clerkship signifies a “clean slate” and an ability to demonstrate clerkship-specific skills: Receiving an Honors grade in one clerkship should have no influence on future clerkship grades.
Similarly, failure of Step 1 should help identify students who are at risk for future examination failures, but should not preclude success in future clerkships. The NBME acknowledges that the Subject Examinations reflect some degree of longitudinal knowledge acquisition independent of clerkship-specific content.19 However, 95% of clerkships nationally require students to pass the respective Subject Examination to pass the clerkship. Moreover, up to 46% of clerkships require a specific threshold score on the Subject Examination to determine “honors-level” performance.1
If a student achieves a poor score on Step 1, does he or she have a realistic opportunity to receive Honors in his or her clerkships? Our findings and the practice of many clerkships nationally suggest that outcome is unlikely. Therefore, the current practice of using NBME Subject Examinations to determine grades should be challenged. We suggest that perhaps the highest and best use of Subject Examination data is not in assigning individual clerkship grades but, rather, as a marker of aggregate medical knowledge acquisition across the clinical years of the medical school curriculum.
We conducted this study at a single institution. Although the relative weights of NBME Subject Examinations for computing final clerkship grades are similar to the weights at other institutions,1 there is probably variability in terms of other assessments used at clerkships across medical schools. Thus, the impact on final grade assignments must be considered relative to the local grading strategy. Similarly, the use of regression is suggested for all schools, but it is advised that each school use local data to compute its own equations.
Another limitation of the study is the retrospective design. Because we computed the regression equations with the same data we used to generate expected scores, there is a higher chance that the actual and expected scores would be similar. We will need to apply the regression equations prospectively to a future cohort before we can draw any firm conclusions regarding the accuracy of expected scores. The ability to draw these conclusions may be muddied if we use expected scores to provide enhanced study tools or remediation to students we identify as at risk since their actual Shelf Examination performance is likely to be impacted by these efforts. Finally, the NBME recently modified their score reports in favor of an “equated percent correct score” rather than the scaled score used in this study. Future research will need to verify the results using equated percent correct scores.
Implications and conclusion
The results of this study have two major implications. The first relates to counseling and advising. The regression models developed from our study demonstrated the ability to predict future NBME Subject Examination performance using Step 1 scores. Such predictive value may allow educators to provide targeted counseling to students who are at risk of poor performance on the NBME Subject Examinations. This may be particularly advantageous for identifying learners who may not have failed Step 1, but are still at substantial risk in future standardized tests. Generally, only a failing Step 1 score is seen as a “red flag,” but there is clearly a wider range of scores that indicate a struggling student who may need additional counseling or study skills remediation. Future research could include a multi-institutional study in partnership with the NBME and Association of American Medical Colleges to replicate these findings and develop strategies for working with students at risk.
The second major implication concerns the utility of the NBME Subject Examinations in assessing clerkship-specific knowledge. In recent years, other national clerkship-specific examinations have been developed to replace or serve as an adjunct to the NBME Subject Examinations.20 One cited advantage to these examinations includes their direct link to national clerkship objectives and case-based learning materials.20 Preliminary evidence suggests that these examinations may show divergent correlation with the NBME Subject Examinations.21 However, further investigation is required to draw conclusions regarding these relatively new examination options.
In conclusion, performance on the NBME Subject Examinations can be predicted on the basis of Step 1 scores. These findings can be used to counsel students at risk for poor performance on the NBME Subject Examinations. In some circumstances, students perform better on NBME Subject Examinations than would be predicted by Step 1 scores. Further research may elucidate the successful strategies used by these students. In addition, the lack of divergent correlations between Step 1 scores and NBME Subject Examination scores suggests the need to reconsider how these examinations are used in determining final clerkship grades. Although Subject Examinations may be beneficial in preparing for Step 2 CK and assessing overall clinical knowledge, the results of this study suggest that these examinations may not be valid assessments of the knowledge obtained during a clinical clerkship.
2. Suskie LSuskie LChapter 14: Selecting a published test or survey. In: Assessing Student Learning: A Common Sense Guide. 2009:2nd ed. San Francisco, CA: Jossey-Bass; 216–217.
3. Sissom T, Grum CPangaro LN, McGaghie WCChapter 15: Clerkship examinations. In: Handbook on Medical Student Evaluation and Assessment. 2015:1st ed. North Syracuse, NY: Gegensatz Press; 178–180.
4. Ogunyemi D, De Taylor-Harris SNBME obstetrics and gynecology clerkship final examination scores: Predictive value of standardized tests and demographic factors. J Reprod Med. 2004;49:978–982.
5. Armstrong A, Dahl C, Haffner WPredictors of performance on the National Board of Medical Examiners obstetrics and gynecology subject examination. Obstet Gynecol. 1998;91:1021–1022.
6. Ogunyemi D, Taylor-Harris DFactors that correlate with the U.S. Medical Licensure Examination Step-2 scores in a diverse medical student population. J Natl Med Assoc. 2005;97:1258–1262.
7. Zahn CM, Saguil A, Artino AR Jr, et al.Correlation of National Board of Medical Examiners scores with United States Medical Licensing Examination Step 1 and Step 2 scores. Acad Med. 2012;87:1348–1354.
8. Spellacy WN, Dockery JLA comparison of medical student performance on the obstetrics and gynecology National Board Part II examination and a comparable examination given during the clerkship. J Reprod Med. 1980;24:76–78.
9. Dong T, Swygert KA, Durning SJ, et al.Is poor performance on NBME clinical subject examinations associated with a failing score on the USMLE Step 3 examination? Acad Med. 2014;89:762–766.
10. Downing SMValidity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837.
11. Cho JE, Belmont JM, Cho CTCorrecting the bias of clerkship timing on academic performance. Arch Pediatr Adolesc Med. 1998;152:1015–1018.
12. Gerhardt JD, Filipi CJ, Watson P, Tselentis R, Reeves JAre long hours and hard work detrimental to end-clerkship examination scores? Am J Surg. 1999;177:132–135.
13. Ripkey DR, Case SM, Swanson DBPredicting performances on the NBME Surgery Subject Test and USMLE Step 2: The effects of surgery clerkship timing and length. Acad Med. 1997;72(10 suppl 1):S31–S33.
14. Reteguiz JA, Crosson JClerkship order and performance on family medicine and internal medicine National Board of Medical Examiners exams. Fam Med. 2002;34:604–608.
15. Myles T, Galvez-Myles RUSMLE Step 1 and 2 scores correlate with family medicine clinical and examination scores. Fam Med. 2003;35:510–513.
16. Cook DA, Lineberry MConsequences validity evidence: Evaluating the impact of educational assessments. Acad Med. 2016;91:785–795.
17. Manguvo A, Litzau M, Quaintace J, Ellison SMedical students’ NBME subject exam preparation habits and their predictive effects on actual scores. J Contemp Med Edu. 2015;3:143–149.
Copyright © 2017 by the Association of American Medical Colleges
21. Gold J, Sharman M, Mavis BDo the CLIPP exam and NBME shelf exam test different knowledge? Poster presented at: Council on Medical Student Education in Pediatrics Annual Meeting; March 27, 2014; Ottawa, Ontario, Canada.