The addition of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (CS) examination to the medical licensure sequence provides the opportunity for standardized assessment of a physician’s history and physical examination skills, ability to document findings, communication and interpersonal skills, and spoken English proficiency.1 Although this addition has the potential to increase the credibility of the licensing process, any new test raises questions about how its scores correlate internally and how they correlate with related measurements.
Taylor and colleagues2 studied the relationship between the performance of 256 students on a prototype of Step 2 CS and program directors’ ratings of their performance as interns. They reported correlations of 0.19 between checklist scores on the prototype and the ratings of their clinical ability as interns and 0.26 between interpersonal scores on the prototype and ratings of interpersonal skills. Harik and colleagues3 provided early information about the correlations among the components of Step 2 CS and their correlations with total scores on Step 1 and Step 2 Clinical Knowledge (CK). They reported correlations within Step 2 CS ranging from a high of 0.48 between Data Gathering and Documentation scores to a low of 0.05 between Data Gathering and Spoken English Proficiency. The correlation between the Step 2 CK score and the components of the Step 2 CS examination ranged from 0.39 for Documentation to 0.14 for Spoken English Proficiency. These modest correlations suggest that the subtests of the Clinical Skills examination measure independent proficiencies, all of which are substantially different than that measured by the multiple-choice Step 2 CK examination.
A question that follows logically from these studies is how performance on Step 2 CS relates to performance on the clinical skills examinations administered in medical schools. Many schools have a long history of experience with formal assessments of clinical skills using standardized patients. A vast majority have now implemented skills assessments.4,5 One would expect a positive correlation between local and national assessments if their goals were similar. Furthermore, if a school’s examination were intended to gauge its students’ readiness for high-stakes assessment, then a high correlation or consistent classification across the two assessments would be very desirable. A very high correlation might also be interpreted to mean that Step 2 CS is redundant with assessment already occurring in a school. On the other hand, if the purpose of the school’s assessment were to measure its own unique curricular goals, a low correlation might not be surprising. Nevertheless, to the extent that the school-based and national assessments were intended to measure similar proficiencies, negligible correlations would call into question the empirical validity of the scores for either or both assessments.4
Consistent with these concerns, the purpose of this study was to examine the relationship between students’ performance on a medical school’s third-year Objective Structured Clinical Skills Examination (OSCE) and Step 2 CS. We expected positive relationships because of similarities in the structure and content of the two tests.
The university’s institutional review board determined that this study of existing data was exempt from review related to human subject protection.
The sample included 217 students who completed Step 2 CS within 12 months after completing the OSCE at the end of their third year of medical school. Their mean age was 26.8 years. They included a balance of men (53%) and women (47%). Their mean scores on Step 1 and Step 2 CK were 221 and 229, respectively, which are close to the mean scores for U.S. medical schools.
The procedures used to administer Step 2 CS have been described previously.1,6 Examinees have 15 minutes to interact with each of 12 standardized patients. They are instructed to take a focused history based on the reason for the visit and, in most encounters, are also instructed to perform a focused physical examination. After each encounter, they are allowed 10 minutes to summarize their findings in a postencounter note. The patient completes a checklist after the encounter to document history questions asked and physical examination maneuvers performed. The patient also completes rating scales to evaluate the examinee’s communication and interpersonal skills and the examinee’s spoken English proficiency. The postencounter notes are scored by trained physicians.
The medical school’s OSCE, which has been administered annually since 2004, is similar. Its goals, which are comparable with those of assessments conducted at many schools,5 include the summative evaluation of students’ clinical competence, evaluation of the clinical curriculum, and the provision of feedback to students preparing for Step 2 CS. Students have 15 minutes to perform a focused history and, in most cases, a physical examination on each of 10 standardized patients. They summarize their findings from the history, physical examination, differential diagnosis, and workup plan in a postencounter note. The patients are trained to complete checklists and rating scales of 30 to 40 items per case, which are similar to those used in Step 2 CS and by other medical schools. The case content is determined by a faculty committee on the basis of the objectives of the medical school and the major clerkships required in the third year. The case mix included acute and chronic conditions. A passing score of 70 in data gathering, documentation, and communication and interpersonal skills is required for graduation.
Examinees must pass the following three subcomponents of Step 2 CS: Integrated Clinical Encounter (ICE), Spoken English Proficiency, and Communications and Interpersonal Skills. The ICE score is a combination of a Data Gathering score based on checklists and a Documentation score based on postencounter notes. We analyzed the scores for Data Gathering, Documentation, and Communication and Interpersonal Skills. The scores for Spoken English Proficiency were not analyzed because there was little variance for this sample of U.S. citizens.
Nine scores from the medical school’s OSCE were used. One case was dropped from scoring on the basis of the results of item-analysis procedures. Separate percentage done scores for History and Physical Examination were calculated on the basis of the standardized patients’ checklists. These were averaged to produce an overall Data Gathering score. Within Communications and Interpersonal Skills, three separate Likert-scale scores for Medical Interviewing Skills, Patient Counseling Skills, and Professional Relationships were calculated on the basis of patients’ ratings. These scores were averaged to produce an overall Communications and Interpersonal Skills score. Within the postencounter note, four separate percentage correct Documentation scores were assigned to each student by one of the authors (K.W.). These included History, Physical Examination, Differential Diagnosis, and Workup Plan. Again, these were averaged to produce an overall Documentation score.
The USMLE and medical school scores for students were linked using the students’ USMLE identification numbers, which were then removed from the analytic file to mask individual identities. Pearson product–moment correlations were calculated between the OSCE and USMLE scores and were corrected for unreliability.7
The observed and corrected (in parentheses) correlation coefficients for scores within the medical school’s OSCE, within Step 2 CS, and across the tests are shown in Table 1. The highest correlations are between scores measuring similar proficiencies within and between the tests. For example, the highest observed correlation of 0.51 can be seen between Documentation and Data Gathering within Step 2. The next-highest observed value of 0.35 occurs twice in the table: once between the Documentation and Data Gathering scores within the medical school’s OSCE, and again between the Documentation scores across the tests. The third-highest correlation of 0.32 is between the Communications and Interpersonal Skills scores across the tests.
Conversely, the correlations between scores measuring different proficiencies are low. The lowest observed correlation of 0.04 is between the Documentation and Communication and Interpersonal Skills scores within the medical school’s OSCE. The next-lowest correlation of 0.05 is between the same two scores across the two tests. The third-lowest correlation of 0.08 is between Step 2 Data Gathering scores and the medical school’s Communications and Interpersonal Skills scores.
The corrected values, which provide estimates of the relationships between the proficiencies measured by the scores after removing measurement error, are, on average, about 0.15 higher than the observed values. However, the largest difference of 0.51 between the corrected and observed correlations involving the medical school’s Documentation and Data Gathering scores indicates that the observed correlation is low because of the unreliability of the measures. The corrected value of 0.86 is close to the corrected value of 0.79 for the correlation coefficient involving the corresponding scores within Step 2 for Documentation and Data Gathering.
The magnitude of the correlations across tests is relevant for assessing the redundancy of the national and school-based assessments. These correlations are also important for evaluating the extent to which the school’s test is useful in predicting performance on the licensing test. When evaluating the level of redundancy and the predictive value of the tests, it is also appropriate to look at the extent to which their scores classify examinees at the same level, especially around a pass/fail cutoff. In some instances, even scores with low correlations may be useful in identifying weak examinees. Inspection of the scatter plots for the three pairs of related scores on the two tests indicated that the two tests did not consistently identify the same examines as being weak.
We analyzed the relationship between performance on a medical school’s comprehensive OSCE administered at the end of the third year and Step 2 CS. The school’s 10-case OSCE was similar in structure to Step 2 CS and comparable with the comprehensive clinical skills assessments used in many other schools.4,5
Table 1 provides an opportunity to examine the relationship between the two tests from a multitrait–multimethod perspective.8 That is, it allows for comparison of the strength of relationships between those scores purported to measure similar proficiencies across assessments and those scores intended to measure different proficiencies within assessments. The corrected and uncorrected correlations lead to similar conclusions. The strongest relationship within each test is between Data Gathering and Documentation. The next-strongest correlations across tests are between their respective Documentation scores and between their respective Communication and Interpersonal Skills scores.
The finding of strong correlations across tests for the Documentation and the Communication and Interpersonal scores provides support for their construct validity. It is not surprising that the Data Gathering and Documentation scores for Step 2 CS are highly correlated. This strong relationship is consistent with the intended structure of the test, the decision to combine these two scores for licensing, and the findings of earlier studies.3
The observed as well as the corrected correlations of 0.18 and 0.32 between Data Gathering scores across tests are conspicuously low compared with the corresponding values for Documentation and Communication and Interpersonal Skills. It is unlikely that this is attributable to differences in the content of the two examinations because the observed and corrected correlations for the Documentation scores on the tests were 0.35 and 0.75. It seems more likely that this low correlation is attributable to differences in the tests’ checklists, which may differ in the specific processes underlying their development or the extent to which they reward thoroughness. It has been shown that different groups of experts working independently create substantially different checklists depending on their personal preferences or other factors irrelevant to the construct being measured.9 The lower correlations for the scores based on checklists suggest that more work is needed in this area.
An important finding is that the observed correlations between scores on the school-based OSCE and the licensing examination are low. These results provide little or no support for the argument that the two examinations are redundant. Similarly, the predictive value of the school-based test seems to be limited.
Interpretation of these findings is complicated by several limitations of this initial study from one school. The most important confounding factor is the varying time lag between the two tests, which ranged from 3 to 12 months for these students. Assuming that the ranking of examinees’ proficiency changes over time, the correlations between tests will be reduced. Interpretation will be further complicated if the pattern of change varies across the three measures. This effect will lower the correlations between tests. We were unable to estimate the magnitude of this effect because of the small sample size within time periods. Furthermore, the 217 students who took the OSCE received score reports, and the 10 students who failed completed a one-month remedial course. The potential effect of case or task specificity may also enter into the interpretation of these findings. The within-test relationship between Data Gathering and Documentation may be influenced by the fact that both scores are based on the same cases. The extent to which individual examinees excel on certain cases because of their experiences may distort the within-test correlations. Cases between tests are independent, so no such effect will be present. The ideal study would involve students from multiple schools with content-equivalent OSCEs, a sample large enough to adjust statistically for the effect of the time lag between the OSCE and Step 2 CS, and other interventions such as feedback and remediation.
This report represents an initial attempt to examine the relationship between a medical school’s OSCE and scores on USMLE Step 2 CS. The pattern of positive relationships between similar proficiencies within and across tests supports the tests’ construct validity. The correlations’ magnitude indicates that the tests are not redundant and that the scores on the school’s OSCE are of limited value in predicting performance on Step 2 CS. Future studies of the relationship between medical school and licensing tests need to address the time lag between the two assessments and the effect of feedback and remediation. Continued work in this area will lead to a better understanding of the assessment process, improvements in both types of assessments, and a better-integrated overall assessment process.
Emily Gavin reviewed several versions of the manuscript and provided valuable comments and suggestions.
1 Hawkins RE, Swanson DB, Dillon GF, et al. The introduction of clinical skills assessment into the United States Medical Licensing Examination (USMLE): Description of USMLE Step 2 Clinical Skills (CS). J Med Licensure Discipline. 2005;91:21–25.
2 Taylor ML, Blue AV, Mainous AG 3rd, Geesey ME, Basco WT Jr. The relationship between the National Board of Medical Examiners’ prototype of the Step 2 clinical skills exam and interns’ performance. Acad Med. 2005;80:496–501.
3 Harik P, Clauser BE, Grabovsky I, Margolis MJ, Dillon GF, Boulet JR. Relationships among subcomponents of the USMLE Step 2 Clinical Skills examination, the Step 1, and the Step 2 Clinical Knowledge examinations. Acad Med. 2006;81(10 suppl):S21–S24.
4 Hauer KE, Hodgson CS, Kerr KM, Teherani A, Irby DM. A national study of medical student clinical skills assessment. Acad Med. 2005;80(10 suppl):S25–S29.
5 Hauer KE, Teherani A, Kerr KM, O’Sullivan PS, Irby DM. Impact of the United States Medical Licensing Examination Step 2 Clinical Skills exam on medical school clinical assessment. Acad Med. 2007;81(10 suppl): S13–S16.
6 McKinley DW, Boulet JR. Using factor analysis to evaluate checklist items. Acad Med. 2005;80(10 suppl):S102–S105.
7 Gulliksen H. Theory of Mental Tests. New York, NY: John Wiley & Sons; 1959.
8 Campbell DT, Fisk DW. Convergent and divergent validation by the multitrait-multimethod matrix. Psychol Bull. 1959;56:81–105.
9 Gorter S, Rethans J, Scherpbier A, van der Heijde D, van der Vleuten C, van der Linden S. Developing case-specific checklists for standardized-patient-based assessments in internal medicine: A review of the literature. Acad Med. 2000;75:1130–1137.