After well over a decade of research and development effort at the National Board of Medical Examiners, the United States Medical Licensing Examination (USMLE) was expanded in June 2004 to include the Step 2 Clinical Skills (CS) Examination.1 This new component of the licensing examination uses a standardized patient (SP) format to assess examinees on communication and interpersonal skills (CIS), spoken English proficiency (SEP), and the integrated clinical encounter (ICE), which combines assessment of skills in history taking, focused physical examination, and completion of a postencounter patient note (PN). A pass/fail decision is reported for each examination component; to pass the examination, examinees must obtain passing scores on ICE, CIS, and SEP. The Step 2 CS examination was implemented in collaboration with the Educational Commission for Foreign Medical Graduates (ECFMG), which had been delivering a similar examination for certification of international medical graduates seeking clinical training in the United States since 1998.
With the introduction of a new component to the USMLE sequence, it is appropriate to ask how the subcomponent scores from this assessment relate to each other and to available scores from other Step Examinations. These relationships have been examined in previous studies, but none of these studies provide direct information about the performance of both U.S. medical students and international graduates on a test for licensure. For example, Margolis and colleagues2 presented analysis based on a sample of US medical students who completed the ECFMG’s Clinical Skills Assessment (CSA) as part of a pilot test implemented in preparation for Step 2 CS. The results showed a moderate to strong correlation between performance on data gathering (DG; the combination of history-taking and physical examination) and the PN, a moderate correlation between DG and interpersonal skills, and a relatively weak relationship between the PN and interpersonal skills. The generalizability of the results to the operational Step 2 CS examination is limited in two ways: (1) analyses were based on a small sample of U.S. students who completed the test under relatively low stakes; and (2) the individual subcomponent scores reported for that examination are not identical to those produced for Step 2 CS.
A paper by Muller and colleagues3 described the relationship between scores on the ECFMG’s CSA and the USMLE Step 2 Clinical Knowledge (CK) examination. The paper reported on both the performance of a sample of U.S. medical students who completed the CSA as part of the pilot examination described in the paper by Margolis and colleagues2 and on the performance of a sample of international medical graduates (IMGs) who took the test under high-stakes conditions. The results showed low to modest correlations between scores from the CSA and the Step 2 examination; these correlations were modestly higher for the U.S. students than for the IMGs. The results reported by Muller and colleagues are consistent with previous studies comparing the performance of U.S. medical students on clinical skills examinations and tests for licensure.4 Once again, however, the results for the U.S. students are limited because of the low-stakes conditions.
The present study examines the relationships among the subcomponents from Step 2 CS and the relationship between those subcomponents and scores from Step 1 and Step 2 CK. Because the analyses are based on data collected as part of the operational test administration, they provide—for the first time—a basis for examining the performance of U.S. medical students testing under high-stakes conditions.
Each Step 2 CS examination session lasts approximately eight hours during which examinees encounter 12 SPs. Examinees have up to 15 minutes to interact with each SP; they are informed about the reason for the visit, are instructed to take a focused patient history, and for most encounters they also are instructed to complete a focused physical examination. After completing the encounter, examinees have 10 minutes to document their findings in a structured patient note.
Three skill areas are assessed using different assessment instruments that are completed by SPs for each encounter: (1) DG is assessed using a checklist on which the SP records whether the examinee has asked specified critical questions about the patient’s history and performed necessary physical examination maneuvers; (2) CIS is assessed using a set of rating scales designed to measure the examinee’s ability to gather and share information and the examinee’s professional manner and rapport; and (3) SEP is similarly assessed using a rating scale. All of these assessments are completed immediately after the encounter. A fourth instrument, the PN, is completed by the examinee following each encounter; PNs are reviewed and rated by trained physician raters. CIS and SEP results are scored independently; DG and PN scores are combined to yield the ICE score.
The cases that comprise each form of the Step 2 CS examination are based upon an examination blueprint designed to ensure that each examinee is presented with an appropriate sample of the types of patients and problems typically encountered in medical practice in the United States. The criteria used to define the blueprint and create individual examinations focus primarily on presenting complaints and conditions. Presentation categories include, but are not limited to, cardiovascular, constitutional, gastrointestinal, genitourinary, musculoskeletal, neurological, psychiatric, respiratory, and women’s health. Additional specifications balance the mix of patients in terms of problem acuity, age, and gender. Although all examinees will not see cases in each of these defined categories, the test specifications insure that each examinee encounters a reasonable sample of such cases. In addition to the use of a blueprint to standardize the administration across test sites and dates, the scores produced for each candidate are adjusted to account for any differences in case, SP, and rater difficulty that may result from the test construction and administration process.
Two separate examinee samples were analyzed for this research. (All USMLE examinees are given an option to decline use of their score data for research purposes. Less than .01% of examinees completing the Step 2 CS examination declined to make their scores available for research; they were excluded from the current sample.) The first sample included 15,800 first-time takers from U.S. medical schools who tested during the first year of administration (June 2004 to July 2005). The second sample included 12,300 IMGs who were first-time takers during the same time period. Because of a problem related to scoring the PN, scores for this subcomponent were unavailable for some of these examinees. Analyses were based on all examinees in the previously defined groups except those impacted by the note-scoring problem.
Pearson product-moment correlations between scores were estimated for each Step 2 CS subcomponent. In addition, variance components were estimated using an Examinee by SP/Case design for the DG, CIS, and SEP scores and an Examinee by Rater/Case design for the PN score. An estimate of reliability was computed as the ratio of examinee variance to the overall variance. These estimates were then used to correct (disattenuate) the correlations for unreliability. This produced what is typically referred to as a true-score correlation.
Correlations provide a mathematically tractable index of the relationship between variables (and one that is often practically meaningful); however, a belief expressed by some medical educators is that in spite of low correlations, examinees who fail the test of clinical knowledge will also fail the test of clinical skills. To examine this relationship, cross tabulations were done to identify: (1) the proportion of examinees failing multiple components of Step 2 CS; and (2) the proportion of examinees failing Step 2 CK and one or more components of Step 2 CS.
Table 1 presents correlations between Step 1, Step 2 CK, and the subcomponents of Step 2 CS for both the US and IMG samples. The values below the diagonal are the corrected (or true-score) correlations, the values above the diagonal are the observed correlations, and the values on the diagonal (in bold) are the reliabilities for the respective components based on the specific examinee sample.
Relationships among the Step 2 CS subcomponents are generally consistent with expectations. For the U.S. students, the PN and DG scores have a moderate to high true-score correlation, the CIS and DG scores have moderate correlations, and the remaining pairs of scores have modest to low correlations. It is not surprising that SEP is only weakly related to the other scores, because most U.S. students receive similarly high scores on this component.
When the correlations for the U.S. sample are compared to those for the IMGs, the underlying pattern is similar but the impact of spoken English proficiency is immediately apparent. The correlations between this score and the other three component scores are approximately twice as large for the IMGs. There also appears to be a stronger relationship between the PN and CIS scores for the IMGs; one potential explanation is that both of these scores are influenced by spoken English in a manner not apparent with U.S. students.
Correlations between the subcomponent scores for Step 2 CS and Step 1 and Step 2 CK again are similar to expectations both for the U.S. students and IMGs. The DG and PN scores have modest correlations with scores from the multiple-choice tests; the CIS and SEP scores have low correlations with these tests. These relationships are similar for the US students and IMGs; curiously, the relationships between the SEP score and scores on the multiple-choice examinations are slightly weaker for the international graduates.
Table 2 shows the proportion of examinees failing Step 2 CK and each of the three subcomponents for which pass/fail decisions are made on Step 2 CS; the table also shows the proportion of examinees failing each combination of the four components. In general, there is not excessive redundancy between individuals identified as nonproficient by the Step 2 CS components and individuals identified as nonproficient by Step 2 CK. For example, only about 10% of U.S. students failing Step 2 CK also failed one or more components of Step 2 CS; about 15% who failed one or more component of Step 2 CS also failed Step 2 CK. Similarly, only 27% of IMGs failing Step 2 CK on their first attempt subsequently failed one or more components of Step 2 CS and about 37% who failed Step 2 CS failed Step 2 CK on their first attempt.
Discussion and Conclusions
The analyses reported in this paper provide information about the relationships between the subcomponents of Step 2 CS and more broadly between the subcomponents of Step 2 CS and Steps 1 and 2 CK. This is the first report of analysis of the performance of U.S. medical students completing a clinical skills examination under the high stakes associated with a licensing examination. The results in this paper provide new insight regarding the relationships between scores of this type for IMGs because SEP is included as a separate component of the test.
The IMG results highlight the extent to which performance on Step 2 CS is moderated by spoken English proficiency. This is consistent with expectations in that although this dimension is intended to be a separate and conceptually independent component of the test, for examinees with proficiency below a certain threshold it is unavoidable that English language skills will interfere with the ability to gather data, share information, and establish rapport.
More generally, the results conform to expectations. Both the correlational evidence and the information about failure rates confirm that the proficiency measured by Step 2 CK is substantially different than the proficiencies that are measured by the components of Step 2 CS. This does not provide direct support for the usefulness of the Step 2 CS scores, but it does demonstrate that there is relatively little redundancy between the scores.
The results also support the reasonableness of combining the DG and PN scores into a single composite score. The two measures are conceptually related, and the statistical evidence shows a moderate relationship between the scores. Finally, the results provide a strong argument against combining the SEP and CIS scores with any of the other scores if the instrument is to be applied to U.S. medical students.