Secondary Logo

Journal Logo

Research Report

The Relationship Between the National Board of Medical Examiners’ Prototype of the Step 2 Clinical Skills Exam and Interns’ Performance

Taylor, Marcia L. MD; Blue, Amy V. PhD; Mainous, Arch G. III PhD; Geesey, Mark E. MS; Basco, William T. Jr MD

Author Information


In recent years, medical educators have sought a more objective measure of a student's ability, especially in the area of clinical skills. In response to this, more medical schools are using standardized patients (SPs) and objective structured clinical examinations (OSCEs).1 Some studies have shown that these clinical performance exams are reliable tools for measuring students’ performances.2,3 OSCEs have also been shown to identify deficits in students’ clinical knowledge and test different skills than those examined in more traditional methods.4–6 Recent survey data for the Liaison Committee on Medical Education (LCME) indicates that 65% of medical schools require students to take an objective structured clinical examination (OSCE) during their final year of medical school, with 65% of these schools requiring passage to graduate.7

Can an OSCE administered in the final year of medical school predict how a medical student will perform as an intern? Several studies have looked at this issue and have yielded mixed results. Some have shown a positive correlation between medical students’ performances on an OSCE and interns’ performances.8–10 Others have shown that an OSCE can identify deficits in clinical knowledge and test different skills than those examined in more traditional methods.4–6 However, other studies have shown no relationship between students’ performances on an OSCE and their performances as interns.11,12 Studies relating more traditional factors for predicting residency performance, such as class rank, United States Medical Licensing Examination (USMLE) Step 1 and 2 scores, grade point average (GPA), and membership in Alpha Omega Alpha, have also shown mixed results. Several studies have shown a positive correlation between one or more of theses measures and interns’ performances,13–21 while other studies have shown no such relationships.10,22–25

In July 1998, the Educational Commission for Foreign Medical Graduates (ECFMG) required all international medical graduates to pass a clinical skills assessment (CSA) before beginning a U.S. residency. One study examined differences in interns’ performance levels based on residency directors’ evaluation scores for those who had taken and passed the CSA and those who were interns before the implementation of the CSA. The study found that interns who passed the CSA were less likely to be rated deficient, especially in interpersonal skills, compared with those who had not taken the test.26

Responding to the need for an objective measure of clinical skills, the National Board of Medical Examiners (NBME) implemented a clinical skills exam as part of USMLE Step 2 (the Step 2 CS) beginning in mid-2004. (The CSA for foreign medical graduates has been phased out with the introduction of Step 2 CS.) The Step 2 CS consists of a one-day test in which each student examines 11 or 12 SPs. Students are expected to establish rapport with the patient, elicit key elements of the history, perform a focused physical examination, communicate effectively with the patient, document findings, and formulate a plan. Each encounter is to last 15 minutes with an additional ten minutes for documentation related to the encounter. The NBME states that this additional clinical skills examination is necessary to protect patients’ safety by establishing a national minimal standard for clinical and communication skills, which it feels is not represented by the previous steps of the USMLE. The general public favors students’ passing a clinical skills examination before receiving their medical licenses.27

Several organizations in the medical community have opposed the implementation of the Step 2 CS, including the American Medical Association (AMA), the AMA-Medical Student Section (AMA-MSS), and the American Academy of Family Physicians (AAFP). In the AMA's policy statement, they affirm their commitment to the demonstration of clinical competence but argue it can best be assessed by medical schools themselves. Also, the AMA-MSS has raised concern about the financial and travel burdens the exam places on medical students. The exam costs $975 to take, and is offered initially in only five cities (Philadelphia, Atlanta, Los Angeles, Chicago, and Houston).28 In November 2003, the AMA sent a letter to the deans of all medical schools encouraging them to continue emphasizing clinical skills assessment in the curriculum and helping students with the financial burden of Step 2 CS. The letter also requested that deans not make passing the Step 2 CS a requirement for graduation for at least five years.29 The AAFP also voiced concern about cost and travel issues, and they noted the Step 2 CS has not yet been tested for its association with future clinical performance.30 Moreover, it is unclear whether the Step 2 CS is a better predictor of future clinical performance than are more traditional measures like medical school GPA and USMLE Step 1 and 2 scores. Thus, the purpose of this study is to examine whether performance on the Step 2 CS was associated with future clinical performance as interns (their first year of residency).



From 2000–2001, the NBME sponsored a prototype of the Step 2 CS at over 50 U.S. medical schools. The Medical University of South Carolina (MUSC) was one of the test sites for both years (July 24 to August 10, 2000, and July 18 to August 7, 2001). In those years, all fourth-year MUSC students (the graduating classes of 2001 and 2002) were required to take the prototype exam.


The Step 2 CS prototype.

The Step 2 CS prototype consisted of 12 clinical stations in 2000 and 20 stations in 2001. Half of the stations each year were encounters with SPs focusing on history and physical examination skills and communication skills. The other half consisted of written exercises. SPs were trained extensively by the NBME to portray patients in certain clinical situations and to document students’ performances. The training typically lasted 12 to 15 hours and included a mock examination with students in the role of physicians. During the prototype examination itself, one SP performed the encounter while a second one monitored the encounter in a separate viewing room. A third served as a backup for the day.

During the Step 2 CS prototype examination, students’ history and physical examination and communication skills were evaluated using case-specific checklists. Their interpersonal skills were measured using two instruments: a seven-question Patient Perception Questionnaire using a five-point Likert-type rating scale (1 = poor to 5 = excellent) completed by the SP after each encounter; and the Physical Examination Questionnaire, a six-item, five-point Likert-type (1 = poor to 5 = excellent) questionnaire completed by the SP after each physical examination encounter. The NBME, when returning data to the study institutions, did not provide MUSC with information regarding the interrater reliability of the prototype due to the small sample of students encountering a given SP. The NBME reported scores both per case and as an overall skill scores for both history and physical examination skills (referred to as the checklist score) and interpersonal skills.31 They did not provide scores for the written stations, so these are not included in this study.

For the first year of the Step 2 CS prototype's administration, the NBME provided raw scores and scores scaled against the cohort taking the examination at other institutions. However, for the second year of the prototype, the NBME provided raw scores and scaled scores at the institution level only. To combine data from both years and to control for a station's difficulty, the raw scores from both years of the prototype at MUSC were transformed by year and station to a mean of 65 and a standard deviation (SD) of 10 by using the following formula: Transformed score = 65 + 10 × [(student's station score − mean score for that station) / SD of scores for that station]. Both the interpersonal and checklist scores for each station were transformed this way. We determined an average interpersonal and checklist score using these transformed station scores. All analyses were conducted using the average interpersonal and checklist scores.

USMLE scores.

The USMLE Step 1 is the first part of the medical licensing examination administered to all medical students after their second year. This one-day computerized examination is designed to “assess the understanding and application of important concepts of the sciences basic to the practice of medicine, with special emphasis on principles and mechanisms underlying health, disease, and modes of therapy.”32 The USMLE Step 2 is administered at the end of medical students’ third year. Also a one-day computerized examination, Step 2 is designed to “assess the application of medical knowledge, skills, and understanding of clinical science essential for the provision of patient care under supervision and includes emphasis on health promotion and disease prevention.”32 For this study, we used the students’ scores from their first attempt at Steps 1 and 2. In this article, USMLE Step 2 score refers to the student's score on the computerized examination only and does not include any scores on SP encounters.


GPA is a student's cumulative four-year grade point average for course work throughout medical school.

Interns’ performance.

MUSC sends a residency director's evaluation form to each graduate's residency program director approximately 18 months after graduation. This form inquires about the intern's performance in six core competencies defined by the Accreditation Committee of Graduate Medical Education (ACGME): patient care, medical knowledge, practice-based learning, interpersonal and communication skills, professionalism, and systems-based practice using a five-point Likert scale (1 = lowest to 5 = highest). For this study, we report the average competency score. The form also asks the residency directors to give a global quartile ranking of the intern's performance compared with other interns (1 = 76–100%, 2 = 51–75%, 3 = 26–50%, 4 = 0–25%).


Our analyses focused on determining how well the checklist and the interpersonal score from the Step 2CS prototype correlated with the interns’ quartile ranking and the average score for the six core competencies. We computed Pearson product moment correlation coefficients for this relationship as well as for the relationships of GPA, USMLE Step 1 scores, and USMLE Step 2 scores with the quartile ranking and average competency score. We then computed Pearson partial correlation coefficients of the same associations while controlling for age, gender, race (defined as white or other), and specialty. For this study, we defined specialty as primary care (family medicine, internal medicine, pediatrics, obstetrics–gynecology) and other. We chose to control for specialty because their criteria for quartile rankings may differ.

We also computed both Pearson and Pearson partial correlation coefficients relating the interpersonal score from the Step 2 CS prototype with the interpersonal and communication competency score from the residency director's evaluation.

Finally, we performed a forward stepwise linear regression to determine the best predictors for determining both quartile ranking and average competency score. In this type of regression, variables are sequentially added to the regression model beginning with the independent variable that is most highly correlated with the dependent variable. During the addition of each independent variable, a partial F test is performed for every variable in the current model. Only variables that had a p value of less than .05 for the partial F test were included in the model.

We performed all analyses using the SAS System, version 8.02. The MUSC Institutional Review Board approved this study.


A total of 265 MUSC graduates from 2001–2002 took the Step 2 CS prototype. The response rate for the residency director's evaluation was 64% for the class of 2001 and 75% for the class of 2002. One hundred and thirty five individuals provided complete data for all of our measures, and those individuals were the sample used in the analysis. Of those with complete data, 41% were women, 20% were nonwhite, and 42% were in specialties other than primary care. Their average age was 24.9 years. Table 1 shows the average scores on both parts of the Step 2 CS prototype, average scores for the USMLE Step 1 and Step 2, average GPA, and average competency scores and quartile rankings from the residency directors’ evaluations.

Table 1
Table 1:
Results of Undergraduate and Intern Clinical Skills Measures for 135 Students Taking the National Board of Medical Examiners (NBME) Step 2 CS Prototype, the Medical University of South Carolina, 2000–2001

Table 2 shows both the Pearson and Pearson partial correlation coefficients for each of the analyses. GPA, USMLE Step 1 score, USMLE Step 2 score, interpersonal score on the Step 2 CS prototype, and checklist score on the Step 2 CS prototype all significantly correlated with both the quartile and average competency scores from the residency director's evaluation (p < .05). Quartile rankings were most highly correlated with GPA, followed by interpersonal score on the Step 2 CS prototype, USMLE Step 2 score, USMLE Step 1 score, and checklist score on the Step 2 CS prototype. Average competency score from the residency director's evaluation was most highly correlated with GPA, followed by Step 2 score, Step 1 score, interpersonal score on the Step 2 CS prototype, and checklist score on the Step 2 CS prototype.

Table 2
Table 2:
Pearson Product Moment Correlation Coefficients and Partial Coefficients for Undergraduate and Intern Clinical Skills Measures, the Medical University of South Carolina, 2000–2001

The results of the stepwise regression indicated that only two variables were significant predictors of the interns’ quartile ranking. The best predictor was GPA, which had a R2 of 0.16, while interpersonal score on the Step 2 CS prototype improved the model R2 to 0.26. Similarly, for average competency score, GPA was the strongest predictor (R2 = 0.21) and interpersonal score on the Step 2 CS prototype improved the model R2 to 0.28. No other variable significantly improved either model.


We found a positive correlation between undergraduate clinical skills measures (GPA, USMLE Step 1 and 2 scores, and performance on the Step 2 CS prototype) and interns’ measures (quartile ranking and average score in the six core ACGME competencies as determined by a graduate's residency director). Our linear regression shows GPA and interpersonal score from the Step 2 CS prototype appear to be the best predictors of an intern's performance as determined by residency director. The R2 increases by 38.5% when interpersonal score is added to the model containing GPA in predicting quartile ranking and increases by 25% for predicting average competency score. These data suggest that the interpersonal score on the Step 2 CS prototype may add additional information in predicting a medical school graduate's future performance.

The Step 2 CS prototype checklist score, however, did not appear to correlate as well with measures of future performance, and it did not contribute significantly to our model that already contained GPA and interpersonal score. One would have theorized that the checklist score would have correlated with an intern's performance since it is a fairly direct measure of how well the medical student performed history and physical examination skills, but this was not the case. In the clinical years of education, clinical performance is a component of the GPA, thus it may not be surprising that a measure combining both cognitive knowledge (preclinical GPA) and clinical skills (clinical GPA) correlates better than does a measure focused solely on clinical skills (the checklist score). The finding that the interpersonal score predicts an intern's overall performance suggests that residency directors value interpersonal communication abilities in their overall ranking of an intern. However, the correlation coefficient between the residency directors’ ranking of the interpersonal communication skills competency and the interpersonal score on the Step 2 CS prototype is not as high as one would expect. It appears that the two interpersonal skills assessments may measure different aspects of interpersonal communication and interactions; the SPs likely place emphasis upon features of dyadic interaction, whereas residency directors may include a consideration of teamwork abilities. Future research could explore the specific interpersonal attributes that residency directors value in interns.

Our findings are similar to others that indicate GPA predicts interns’ performances,17,18 but they differ from studies that have found strong associations between OSCE performance and residency directors’ ratings.8–10 These differences may be attributed to differences in the instruments used for dependent and independent measures. One strength of our findings is that the clinical skills examination measure was based upon a prototype of a national examination and not a locally developed clinical skills examination.

Some limitations to this study must be considered. One is that the data are for only one institution. As we explain in Method, the Step 2 CS prototype scores returned to institutions were scaled to each institution for the second year of the study. Because this was a pilot program, the sample was necessarily limited. A second limitation was our use of residency directors’ ratings as a measure of interns’ performances. There is a possibility that the Step 2 CS prototype may have measured different aspects of future clinical performance not measured by the residency directors’ ratings. However, resident directors’ ratings have been used as a measure of performance in other studies.8,13–15,22–24,26

Another limitation is that we used scores from a prototype to make inferences about the Step 2 CS implemented in 2004. The Step 2 CS has changed since this prototype. As we described previously, however, the differences between the prototype and the Step 2 CS are minor. On the Step 2 CS, students have 15 minutes with the SP, then they leave the room and have ten minutes to complete a progress note on the encounter. As in the prototype, the SP uses a checklist to rate the student on history and physical performance and interpersonal and English communication skills. However, the written exercise on the Step 2 CS is graded by physicians and used in score calculations. Unlike the prototype's scoring, which was reported numerically, scores for the Step 2 CS are reported as pass/fail for clinical encounter, interpersonal skills, and English proficiency. To pass the examination overall, the student must pass each area. The USMLE has used both practicing physicians and medical educators to develop and review the cases on the Step 2 CS, as well as to determine standards for performance.32 We believe that despite some changes from the prototype, the prototype examination is still a fair representation of the actual Step 2 CS.

Our research indicates that traditional measures of medical students’ performance (GPA and USMLE Step 1 and Step 2 scores) correlated with the students’ future performance as interns, as measured by their residency directors’ evaluations. Also, the interpersonal and checklist scores from the Step 2 CS prototype correlated with their future performance as interns. However, the association between the Step 2 CS prototype's checklist score with future performance as an intern was not as strong as were other indicators. Because GPA was most strongly associated with future performance as an intern, an argument can be made that medical schools may be in the best position to predict a student's future performance as an intern. Also, 65% of medical schools already have in place some type of formal assessment of medical students’ clinical skills.7 One of the arguments against implementing the Step 2 CS has been based on concerns for students having to travel to take the examination.28,29 This raises the question of whether the NBME could begin to certify as testing sites for Step 2 CS the examinations that many medical schools already have in place. In our study, however, even the two best predictors of residency performance (GPA and interpersonal score from the Step 2 CS prototype), explained roughly only 30% of the variation in an intern's clinical performance, suggesting that further research is needed to find better measures of students’ future clinical performance.

Funded in part by grants 1 D14 HP00161 and 1 D12 HP00023 from the Health Resources and Services Administration. The Robert Wood Johnson Foundation Generalist Physician Faculty Scholars Program (#039179) provided funding for Dr. Basco's efforts for this project.


1Kassebaum DG, Eaglen RH. Shortcomings in the evaluation of students’ clinical skills and behaviors in medical school. Acad Med. 1999;74:842–9.
2Sloan DA, Donnelly MB, Johnson SB, Schwartz RW, Strodel WE. Use of an objective clinical examination (OSCE) to measure improvement in clinical competence during the surgical internship. Surgery. 1994;114:343–50.
3Stillman P, Swanson D, Regan MB, et al. Assessment of clinical skills of residents utilizing standardized patients. Ann Intern Med. 1991;114:393–401.
4Schwartz RW, Witzke DB, Donnelly MB, Stratton T, Blue AV, Sloan DA. Assessing residents’ clinical performance: Cumulative results of a four-year study with objective structured clinical examination. Surgery. 1998;124:307–12.
5Simon SR, Volkan K, Hamann C, Duffey C, Fletcher SW. The relationship between second-year medical students’ OSCE scores and USMLE Step 1 scores. Med Teach. 2002;24:535–9.
6Dupras DM, Li JT. Use of an objective structured clinical examination to determine clinical competence. Acad Med. 1995;70:1029–34.
7Liaison Committee on Medical Education (LCME) Questionnaire (Part II) Medical School Profile System. Washington, DC: Association of American Medical Colleges, 2003.
8Durning SJ, Cation LJ, Markert RJ, Pangaro LN. Assessing the reliability and validity of the mini-clinical evaluation exercise for internal medicine residency training. Acad Med. 2002;77:900–4.
9Martin IG, Jolly B. Predictive validity and estimated cut score of an objective structured clinic examination (OSCE) used as an assessment of clinical skills at the end of the first clinical year. Med Educ. 2002;36:418–25.
10Smith SR. Correlations between graduates’ performance as first-year residents and their performance as medical students. Acad Med. 1993;68:633–4.
11Chessman AW, Blue AV, Gilbert GE. Assessing students’ communication and interpersonal skills across evaluation settings. Fam Med. 2003;35:487–92.
12Campos-Outcalt D, Watkins A, Fulginiti Kutob R, Gordon P. Correlations of family medicine clerkship evaluations and objective structured clinical examination scores and residency director ratings. Fam Med. 1999;31:90–3.
13Blacklow RS, Goepp CE, Mohammadreza H. Further psychometric evaluations of a class-ranking model as a predictor of graduates’ clinical competence in the first year of residency. Acad Med. 1993;68:295–7.
14Loftus LS, Arnold L, Willoughby TL, Connolly A. First-year residents’ performance compared with their medical school class ranks as determined by three ranking systems. Acad Med. 1992;67:319–23.
15Gunzburger LK, Frazier RG, Yang LM, Rainey ML, Wronski T. Premedical and medical school performance in predicting first-year residency performance. J Med Educ. 1987;62:379–84.
16Erlandson EE, Calhoun JG, Barrack FM, et al. Resident selection: applicant selection criteria compared with performance. Surgery. 1982;92:270–5.
17Hojat M, Gonnella JS, Veloski JJ, Erdmann JB. Is the glass half full or half empty? A reexamination of the associations between assessment measures during medical school and clinical competence after graduation. Acad Med. 1993;68(2 suppl):S69–76.
18Markert RJ. The relationship of academic measures in medical school to performance after graduation. Acad Med. 1993;68(2 suppl):S31–4.
19Keck JW, Arnold L, Willoughby L, Calkins V. Efficacy of cognitive/noncognitive measures in predicting resident-physician performance. J Med Educ. 1979;54:759–65.
20Veloski J, Herman MW, Gonnella JS, Zeleznik C, Kellow WF. Relationships between performance in medical school and first post graduate year. J Med Educ. 1979;54:909–16.
21Amos DE, Massagli TL. Medical school achievements as predictors of performance in a physical medicine rehabilitation residency. Acad Med. 1996;71:678–80.
22Kahn MJ, Merrill WW, Anderson DS, Szerlip HM. Residency program director evaluations do not correlate with performance on a required 4th-year objective structured clinical examination. Teach Learn Med. 2001;13:9–12.
23Yindra KJ, Rosenfeld PS, Donnelly MB. Medical school achievements as predictors of residency performance. J Med Educ. 1988;63:356–63.
24Borowitz SM, Saulsbury FT, Wilson WG. Information collected during the residency match process does not predict clinical performance. Arch Pediatr Adolesc Med. 2000;154:256–60.
25Wood PS, Smith WL, Altmaier EM, Tarico VS, Franken EA. A prospective study of cognitive and noncognitive selection criteria as predictors of resident performance. Invest Radiol. 1990;25:855–9.
26Boulet JR, Mckinley DW, Whelan GP, van Zanten M, Hambleton RK. Clinical skills deficiencies among first-year residents: utility of the ECFMG Clinical Skills Assessment. Acad Med. 2002;77:S33–5.
27National Board of Medical Examiners Information of Step 2 CS 〈〉. Accessed 12 November 2003.
28American Medical Association Statement of Clinical Skills Assessment Exam 〈〉. Accessed 12 November 2003.
29Papadakis MA. The Step 2 Clinical-Skills Examination. N Engl J Med. 2004;350:1703–5.
30American Association of Family Physicians NBME adds clinical skills exam despite protests 〈〉. Accessed 12 November 2003.
31Prototype Nine and Ten Clinical Skills Examination Prepared by the National Board of Medical Examiners.
32United States Medical Licensing Examination 〈〉. Accessed 10 August 2004.
© 2005 Association of American Medical Colleges