The Step 2 Clinical Skills (CS) examination is a performance-based assessment during which examinees interact as physicians with a series of actors who portray patients in a standardized fashion (aka, standardized patients or SPs). Each Step 2 CS examination form comprises a total of 12 encounters—each a different clinical scenario, or case, portrayed by a different SP. Step 2 CS is one of four parts of the United States Medical Licensing Examination (USMLE) series, and all physicians who wish to practice medicine in the United States, regardless of the location of their training, must pass it.1 Step 2 CS is administered continually throughout the year (at least five days a week, each week) in five testing centers (Philadelphia, Pennsylvania; Atlanta, Georgia; Chicago, Illinois; Houston, Texas; and Los Angeles, California) to approximately 35,000 examinees per year. Similar to many other organizations that use performance-based assessments (medical and otherwise), USMLE has a limited amount of resources for case development and SP/rater training, yet must not only develop many complex case scenarios that fit the examination blueprint and timing constraints but also train a substantial number of SPs and raters to perform the scenarios and provide ratings. These constraints, coupled with a high-stakes environment and a continuous testing environment, necessitate complex form-assembly software that must balance multiple factors in order to determine which SPs and which cases each examinee experiences.
Currently, the software used for Step 2 CS includes as one of its constraints the rule that repeat examinees should not experience information they have previously encountered. In operational use, however, the software must balance this constraint with many other factors such as examination content, the availability of SPs, and so on. Avoiding repeat information (the same SP, the same case, or the same SP portraying the same case) has a high weight in the software, but other certain aspects of content may have an even higher weight, depending on the particular balance of examinees testing and the SPs and cases available at that time. The end result is that some examinees who repeat the Step 2 CS do encounter repeat SPs and/or cases. The impact of item exposure on repeat examinees' performance is a concern even with multiple-choice items (which are more easily generated than performance-based assessment material); therefore, when, as with Step 2 CS, the stakes are high, the cases relatively few, and the examinee's exposure to each “item” (case) lengthy, the question of whether prior exposure to specific exam content impacts performance is crucial.
The Educational Commission for Foreign Medical Graduates' Clinical Skills Assessment (CSA), a historical precursor to Step 2 CS, was operational for international medical graduates (IMGs) from 1998 to 2004. The extent of score gains and the impact of previous exposure to exam items on the CSA were investigated in 2003.2 The authors examined data from the first four years of the CSA's six-year run, and they found that although repeat examinees seemed to make significant score gains for each component of the CSA (i.e., Integrated Clinical Encounter [ICE] and Doctor–Patient Communication), prior exposure to the SP, the case, or both potentially affected only 15% of the CSA examinee–SP encounters, indicating that repeat examinees were fairly unlikely to encounter repeat information. On further investigation, the researchers found little evidence to suggest that prior exposure to the SP, case, or SP–case combination led to higher scores for an examinee–patient encounter.
Several other researchers, using both actual and artificial item exposure, have investigated the impact of potential security breaches on SP-based performance assessments. One study involving a medical school objective structured clinical exam (OSCE) found no significant gains in scores when the same SP–cases were presented across a 15-day testing window.3 Another similar study found no significant gains when the same cases were used on an OSCE for an entire academic year.4 Studies have shown that even when examinees received detailed postexam feedback,5 had access to exam material ahead of time,6 or were allowed to discuss exam material with one another, they still did not make significant gains in overall exam performance.7,8 Examinees made substantial gains only when they received the actual checklist items or the communications scoring scale for a large proportion of the cases.7,8
The literature seems to show that examinees do not necessarily benefit from repeat exposure to clinical skills examination material, but most of the studies cited here address only whether significant mean gains occur across all examinees in the study. Even when no mean gains occurred, individual examinees could still have received some benefit from prior exposure to exam material, and in the clinical skills scenario, limiting this prior exposure—and measuring its impact—is important. Regardless, the results from these studies, especially the CSA-based research,2 are encouraging from a test-validity standpoint for large-scale, high-stakes, SP-based exams.
It is essential that medical educators also investigate whether examinees who retake Step 2 CS perform better, over and above any expected improvement due to practice, if they encounter repeat information. Very little recent literature examines the effect of repeat information on repeat examinees' scores; many of the studies cited above were published more than 10 years ago.3–7 Also, although Step 2 CS is very similar to the CSA, Step 2 CS has slightly different components than did the CSA; Step 2 CS measures spoken English skills and communication skills on discrete scales, whereas the CSA combined them into one scale. The passing standards for Step 2 CS are different from those for the CSA, and, further, Step 2 CS has a much more heterogeneous examinee pool that includes United States and Canadian medical graduates (USMGs)—not just IMGs.
Assessing the score gains made by Step 2 CS repeat examinees and quantifying any impact on performance (over and above the expected score gains) from repeat exposure to exam material are the focus of the current study, as this information could be an essential contribution both to the body of evidence supporting the validity of Step 2 CS scores specifically and to the literature on the unfair potential benefits of repeat information in clinical skills assessments in general.
In Step 2 CS, examinees rotate through a series of 12 encounters, and in each encounter they interact with an SP portraying a specific case, spending a maximum of 15 minutes with each SP. The exam requires examinees to perform each of the following tasks: gather a complete and focused history (in all encounters), perform a set of relevant physical exam maneuvers (in most encounters), and write a patient note (PN) following all encounters. The SPs then rate the examinees on three components. The first, Spoken English Proficiency (SEP), is a global rating of the examinee's spoken English skills, and the second, Communication and Interpersonal Skills (CIS), allows the SP to assess, via a global rating, the examinee's ability to gather information in an appropriate manner, share relevant information with the SP, and demonstrate an appropriate personal manner and develop an appropriate rapport with the patient. SEP is rated via a nine-point scale; higher numbers indicate that the SP has little or no trouble understanding the examinee. CIS is a combination of three 9-point scales, so a CIS rating can range from 3 to 27; a higher rating indicates that the examinee demonstrates proficiency in gathering and sharing information, in demonstrating a professional manner, and in establishing good rapport with patients. The third component, Data Gathering (DG), is a percent-correct checklist on which the SP reports each medical history question and physical exam maneuver as either completed or not completed. One final rating comes from the PN; physicians rate these following the examinees' completion of the exam.
CIS, DG, and PN ratings are calibrated at the encounter level to adjust ratings for SP/rater stringency and case difficulty; SEP is not calibrated because, historically, SP stringency and case difficulty account for a very small portion of the overall variability in SEP ratings. The calibrated DG component, combined with the calibrated PN component, comprises the ICE portion of the exam. The SEP-, CIS-, and ICE-component calibrated scores are scaled via a linear transformation to have a mean of 70 and a standard deviation (SD) of 10; these final scaled scores determine the final pass/fail status for each component. Each component has its own separate passing standard, and examinees must pass all three components in a single attempt in order to pass the exam. For the current study, we used the encounter-level calibrated CIS, DG, and PN scores, along with encounter-level uncalibrated SEP scores.
From April 1, 2005, to December 31, 2007, a total of 84,836 examinees took the Step 2 CS exam at least once. Only examinees who fail the exam are allowed to repeat it; examinees who pass any USMLE exam may not retake it in the hopes of improving their scores. A total of 5,074 examinees, representing 6.0% of the examinee population, took the exam more than once during this time period. We included in this study only those repeat examinees who retested within six months of their initial exam (N = 3,045), assuming that we would be able to best detect any benefit attained from prior exposure to information during this time period. When all Step 2 CS examinees register for the examination, they give permission for their deidentified examination data to be used in research, and all data used in this study were deidentified prior to access and analysis by the authors.
This repeat examinee dataset does not represent all those who failed the exam on their first attempt during this time period, as some examinees chose not to retake the exam at all and still others had not retaken the test by the end of 2007. There is no limit on the number of times an examinee may take Step 2 CS in order to obtain a passing score (although states may limit the number of attempts allowed for licensure), and, beginning in May 2007, examinees who failed were not required to wait a minimum amount of time before retaking the exam (before that point, a 60-day wait was required between attempts).
For the purposes of this study, we classified examinees as USMGs (students who attended or graduated from U.S. or Canadian medical schools), US-IMGs (American citizens who attended international medical schools), or non-US-IMGs (citizens of other countries who attended international medical schools). We calculated means and SDs by Step 2 CS component (i.e., DG, PN, CIS, SEP) and examinee subgroup (i.e., USMGs, US-IMGs, non-US-IMGs), and we used a series of paired t tests to determine whether the score gains from initial to repeat examination attempt differed significantly by component and subgroup. We chose a P value of <.001 for significance (using the Bonferroni method to adjust the family-wise error rate of 0.05), as 12 separate paired t tests were included in the analyses. The different components are on different scales, so although score gains on the same component can be directly compared across examinee subgroups, gains on the original scale metric (the raw scores) cannot be directly compared across different components. Thus, in addition to the t tests, we computed standardized-mean-score gains by component and subgroup, so that we would have a set of values that could be directly compared across subgroups and components.
On the basis of what each repeat examinee had encountered on his or her first examination attempt, we classified each encounter of his or her second attempt as an encounter with a new or repeat SP, with a new or repeat case, or with a new or repeat SP–case combination. We then performed a series of analysis of variance (ANOVA) tests on the repeat encounter-level data separately by examinee subgroup and component, comparing encounters without repeat information against those with repeat information. Again, we chose a P value of <.001 for significance, as 12 separate ANOVAs were performed.
In Table 1 we present the means, SDs, and score gains (both unstandardized [i.e., raw] and standardized) by examinee subgroup and component. With only one exception, all of the score gains on all of the components were significant (P < .001) for each of the examinee subgroups, which was not surprising considering the large sample sizes. As expected, USMGs on SEP showed no significant gains (most USMGs are native English speakers). The standardized score gains show that the USMGs who retook the exam within six months made the biggest standardized gains on the DG component of the encounter (+1.96), followed by the PN component (+1.58). The non-US-IMGs, on the other hand, experienced their biggest standardized score gains on the CIS component of the exam (+1.28). The US-IMGs also experienced their biggest standardized score gains on the CIS component, but their increase on the PN component was roughly the same as for non-US-IMGs, whereas their gain on the DG component was slightly higher than that of the non-US-IMGs, thus placing the US-IMGs, along with the non-US-IMGs, well below the USMGs in score gains on these two components.
Table 2 shows the same examinee subgroups and their mean encounter-level scores based on whether the encounters included new or repeat SPs, new or repeat cases, or a new or repeat SP–case combination. The percentage of repeat information that repeat examinees encountered was relatively consistent across examinee subgroups: A total of 2,676 encounters (4% of all 66,981 Step 2 encounters experienced in two attempts by repeat examinees) contained a repeat SP, 3,814 (6%) contained a repeat case, and 1,168 (2%) contained a repeat SP–case combination. No scores for new versus repeat pair information were significantly different for USMGs or US-IMGs. Non-US-IMGs' scores show significant differences on SEP for repeated SPs and repeated cases; however, for each, the SEP score on attempt two is lower than on attempt one, suggesting that for this one group and component, examinees actually do worse when encountering repeat test information.
Discussion and Conclusions
Examinees who fail Step 2 CS on their initial attempt may retake the examination multiple times, and no minimum time is required between attempts. Repeat examinees may encounter repeat test information on multiple retakes. Although assessing whether examinees show improvement on subsequent examination attempts is crucial, determining whether repeat examinees perform better when exposed to repeat examination information is even more so. Better performance by repeat examinees on previously experienced encounters, compared with new encounters, would indicate that these examinees have an unfair advantage. To preserve the integrity and validity of the examination scores, it would be necessary to constrain the test assembly software or registration availability even further than it is already constrained so that repeat examinees never encountered repeat information.
These data demonstrate that repeat examinees show improvements on all four Step 2 CS components between their first and second attempts, and this finding is consistent with previous findings from the CSA study. Little else in the clinical skills literature describes the expected score gains on repeat attempts for clinical assessments, but we can surmise several potential reasons that scores would fluctuate, and increase, on subsequent attempts. The first is that, because the examination is required—for graduation from medical school for some U.S. students, for the Residency Match for some U.S. and all international students, and ultimately for licensure for all physicians practicing in the United States (whether they trained inside or outside the United States or Canada)—there is a great deal of incentive for a failing examinee to review the material and retake the examination as soon as possible. Second, Step 2 CS scores, like all test scores, contain some amount of measurement error and do not have perfect reliability, so the mean observed scores may fluctuate across takes. Finally, Step 2 CS is quite innovative and different from other standardized exams, so examinees who have not previously encountered a clinical skills examination or an OSCE may have performed poorly on their first attempt (despite the availability of practice materials) simply as a result of the exam's unique format.
Although repeat examinees improve their scores on their second attempt, we are reassured to see that virtually no significant relationship emerged between performances on repeat attempts and exposure to repeat information. What's more, the significant findings that did emerge suggest that examinees do not benefit from seeing repeat information, and that—for one component and examinee group in particular—they perform worse on subsequent encounters with the same information. The impact of the repeat information is difficult to detect because the number of encounters with repeat SPs, repeat cases, or repeat SP–case combinations represents a very small percentage of all encounters; however, these small percentages provide, in and of themselves, a reassurance that the form-assembly software is performing as expected.
Also reassuring is that these results are consistent with prior research on CSA,2 including the findings that non-US-IMGs do slightly worse when exposed to repeat information. In the current study, the only significant finding was that non-US-IMGs do worse on SEP when exposed to a repeat SP or a repeat case, but the means for both IMG groups were lower for almost all components when they encountered repeat information. This effect may be due to a reaction on the part of the examinee: An examinee who is not expecting to see the same SP again may become either confused about encountering the same information or overconfident about his or her proficiencies. Either of these could lead to a poorer performance. The National Board of Medical Examiners (NBME) is continuing to investigate the impact of repeat information and will focus on isolating potential causes of this effect.
We should note two caveats. First, although the results suggest that examinees do not benefit from repeat information, this is solely within the context of a high-stakes, high-volume clinical skills assessment with a large pool of SPs and case information. Step 2 CS examinees have a relatively low risk of seeing repeat information as compared with a smaller assessment program that is forced to reuse SPs or cases more often. Second, we would like to reiterate that the results show only that there is no mean improvement due to exposure to repeat information. An individual examinee, of course, might perform better after having been exposed to the SP or the case. Because of this possibility, continuing to limit examinee exposure to repeat information as much as possible is still necessary. The NBME is currently investigating multiple efficient, cost-effective ways in which to limit repetition, including increasing the numbers of cases and SPs at each site and modifying the form-assembly software.
The authors would like to acknowledge the staff of the Clinical Skills Evaluation Collaboration for their input and support on this project.