Passing the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (Step 2 CS), a standardized patient (SP)-based examination in place since 2004, is part of the licensure process for physicians wishing to practice in the United States. Step 2 CS employs SPs in 12 simulated encounters per exam session, with each SP portraying a different clinical scenario, yet rating examinees on the same set of clinical skills. Examinees who fail the exam are allowed to repeat it and must pass each of the multiple components in a single take to pass the exam. Previous research on Step 2 CS and its predecessor, the Educational Commission for Foreign Medical Graduates Clinical Skills Assessment, has shown increases in the range of 0.5 to 1.0 standard deviations (SDs), depending on the component, across testing sessions for repeat examinees.1–3 Possible explanations for this phenomenon include the presence of repeated test content across subsequent attempts, regression to the mean, practice effects due to the novel test format, and true improvement in clinical skills. This research on score gains across testing sessions has provided varying levels of support for these explanations. Two notable publications reported that repeat examinees do not seem to benefit from seeing the same case or SP twice.1,2 Another reported that some portion of repeat examinees’ score gains across sessions can be explained by practice effects associated with the novel SP format.3
A separate line of research reported that within the span of a single testing session, examinees’ scores increased from 0.2 to 0.7 SDs, depending on the component.4,5 Such within-session score gains have not been as thoroughly documented as across-session gains, but, in the work that has been done, researchers have shown these gains to be possibly related to the novel testing environment, decreases in examinees’ test-taking anxiety, or increases in their self-confidence.5–6 These within-session score gains have been termed sequence effects,5,7 practice effects,8 and learning effects,9 but they may be described best as effects of “mere repetition,”10 in which scores increase in the absence of intervening studying or instruction.
In contrast, score gains across two (or more) testing sessions might reflect true gains in examinees’ clinical skills because they have the opportunity to increase their skills through studying and instruction. Thus, across-session score gains may reflect genuine improvement in clinical skills, whereas within-session score gains can only reflect construct-irrelevant factors associated with mere repetition. We would expect gains due to repetition to be largest for the first few SP encounters, then to diminish as examinees become more familiar with the test, more confident, or less anxious—possibly disappearing altogether within the course of a single testing session. Indeed, studies with SP exams have shown that the largest score gains occur between the first and second encounters and subsequently level off.4,5,7–9
This report extends the literature on within-session score gains with the use of examinees who initially failed and subsequently repeated Step 2 CS. This focus adds a unique component to the literature on repeat examinees, as we are able to observe both within-session and across-session score changes for the same set of examinees, and we can address two research questions with one set of data. First, do examinees experience within-session score gains on both their first and second attempts at Step 2 CS, and, if so, are their within-session score patterns similar or different across attempts? One possibility is that within-session score gains occur only on the first few SP encounters during their first attempt but not during their second attempt. Such an outcome would imply that mere repetition produces a lasting effect on performance, possibly due to an increase in familiarity and confidence with the SP testing environment. Another possibility is that scores increase on the first few SP encounters of both examinees’ first and second attempts. Such an outcome would imply that repetition yields a temporary effect on performance, consistent with the explanations of test anxiety, which educators might expect to “reset” between testing attempts. We answered our first question by comparing examinees’ score increase trajectories for their first and second attempts at Step 2 CS.
Our second question addresses whether examinees’ score gains across sessions can be explained wholly as a continuation of their gains within the first session. That is, is an examinee’s improvement for the second session greater than what we predict it would be on the basis of his or her gains during the first session? A consistent finding has been that mean second-attempt scores are higher than mean first-attempt scores for Step 2 CS,1–3 so it is possible that we can attribute this difference entirely to the trajectory of increasing scores during the first attempt. If so, then second-attempt scores should simply follow the first-attempt score trajectory, at least on the first few SP encounters. Such an outcome would imply that across-session score increases are due entirely to factors irrelevant to clinical skills. If, instead, there is true skill improvement across testing attempts, then initial second-attempt scores should exhibit a discontinuity from the first-attempt score trajectory, “jumping” to a higher score level during the initial SP encounter of the second attempt. To assess this hypothesis, we tested whether examinees’ initial scores on their second attempt at Step 2 CS differed significantly from the scores we predicted using their first-attempt score trajectory. The results from these two sets of analyses will reveal the amount of clinical skills learning that would-be physicians potentially can demonstrate across attempts on Step 2 CS, as well as the amount of construct-irrelevant variance due to mere repetition (e.g., from test anxiety or practice effects) that remains in this testing format.
Step 2 CS measures examinees’ clinical skills across four domains: communication and interpersonal skills (CIS), data gathering (DG), patient note documentation (PN), and spoken English proficiency (SEP). During each of the 12 SP encounters, examinees have up to 15 minutes to interact with the SP. Examinees are informed of the patient’s initial complaint and vital signs before entering the SP’s room and are instructed to take a focused medical history and perform relevant physical examination maneuvers. At the conclusion of the encounter, examinees have 10 minutes to document their findings and provide differential diagnoses in a structured patient note. The SPs use these 10 minutes to complete the checklist and rating scales that result in examinees’ scores for CIS, DG, and SEP. After the examination, trained physicians assign examinees ratings for PN. According to 2010 performance data, the fail rate was 12% for all first-attempt examinees; for only the U.S. and Canadian medical students and graduates (USMGs),* the first-attempt fail rate was 3%.11
We analyzed the first- and second-attempt scores of the 2,165 USMGs who took Step 2 CS at least twice between April 1, 2005 and December 31, 2010, and whose first attempt was on or after April 1, 2005. We chose to focus only on USMGs because they are more homogeneous in native language, language of instruction, exposure to the clinical skills testing format, and medical education. Examinees are informed on registration for the USMLE that their deidentified data may be used for research. Analytic data sets and derivative analyses are stored securely under conditions that preserve privacy and prevent release to third parties. Because the risk of harm to any individual was negligible, we did not submit this study for institutional review board review.
To facilitate comparisons across score components, we standardized the set of four scores (CIS, DG, PN, and SEP) by subtracting the mean and dividing by the SD for each component across both attempts. A unit gain in standardized CIS score could therefore be compared directly with a unit gain in standardized DG score. SEP scores showed virtually no within-session score change, so we excluded this component from our analyses. Thus, the dependent (outcome) variables were the standardized CIS, DG, and PN scores. The independent (predictor) variables included SP encounter number (1–12) and attempt number (first or second).
Our comparison of score gain trajectories over examinees’ first and second attempts at Step 2 CS required that we model the trajectories. We initially applied locally weighted scatterplot smoothing (LOESS)12 to explore the functional form of scores over encounters, as the LOESS framework allows us to use the power and simplicity of linear ordinary least squares to fit linear and nonlinear patterns in different regions of the data. This framework is useful when significant nonlinearity, such as a break in trajectory, is expected, as it was here. Our inspection of the LOESS plots (see Figure 1) suggests a piecewise linear relationship, in which scores increased linearly over the first few SP encounters up to a “breakpoint” and then flattened out. To model this pattern of scores, we estimated piecewise regression models, fitting breakpoints to match the LOESS curves. We also fit a linear regression of scores on encounter number, in the interest of specifying the simplest model that would adequately capture the score trajectories. For each score component, the piecewise model fit significantly better than the linear model (each P < .0001) and no worse than the LOESS model, with the exception of CIS scores (P > .05). Because the piecewise model provided a reasonable and parsimonious description of the score pattern, we used it to test our hypotheses. In addition, we added examinee-level random effects (intercept and slopes) to the piecewise model to determine whether there was significant variability in trajectory shape across examinees.13 We used the open-source language R (Vienna, Austria) for all analyses.14
Within-session score gains on first and second attempts
Figure 1 shows mean standardized scores for each component by encounter number and attempt number. During both the first and second attempts, within-session scores increased over the first three to six encounters and then leveled off, although the rate of increase varied across the three components. First-attempt scores increased most quickly for CIS and least quickly for DG, as is evident by their respective breakpoint estimates at the third and sixth encounters.
To test our observations, we evaluated estimates from the piecewise linear mixed-effects model (see Table 1). For both the first and second attempts, all slopes before the breakpoint were positive and statistically significant (each P < .0001), confirming the presence of an initial within-session score increase for all components. The largest of these slopes was for CIS, which had a predicted per-encounter initial score increase of 0.09 SDs on both attempts. PN had an initial slope of 0.05 SDs on both attempts, and DG had an initial slope of 0.03 and 0.04 SDs on the first and second attempts, respectively. For each component, the difference in initial slopes across attempts was not statistically significant (each P > .05). In addition, slopes after the breakpoint were small for all components for both the first and second attempts (0.00–0.01 SDs), indicating that scores flattened out after the first three to six encounters. Finally, with the exception of the initial slope for CIS, we found no statistically significant examinee-to-examinee variation in slopes (see the random effects estimates in Table 1). Thus, examinees exhibited considerable consistency both in their initial score increases and in their subsequent flattening-out of scores.
Mere repetition and across-session score gains
Table 2 shows the magnitude of the average across-session score gain. Because previous studies5,7–9 have reported within-session score gains of a magnitude similar to the across-session gains we report in Table 2, we sought to determine whether the pattern of increasing scores during the first attempt could account entirely for the across-session score gains. Figure 1 suggests that this possibility is unlikely because extrapolating the first-attempt trajectory past the final SP encounter (i.e., for a hypothetical “13th” encounter) would underpredict the initial encounter score of the second attempt for all components.
To test this hypothesis, we used the first-attempt data reported in Table 1 to generate predicted scores for the initial SP encounter of the second attempt. A comparison of these modeled scores with actual scores (see Table 3) revealed that mere repetition alone could not explain examinees’ score gains, as the modeled scores were significantly lower than examinees’ actual scores. We reached the same conclusion when we used linear regression fits for this comparison.
The literature shows clearly that examinees improve their clinical skills performance across repeated SP exam sessions. The purpose of our study was to supplement that literature by investigating their performance within both their first and second attempts at Step 2 CS. Three significant findings came to light: (1) On all three of the Step 2 CS components that we included for analysis, we found an increase in scores over the first few SP encounters on both examinees’ first and second attempts, as indicated by the significant positive slopes at the beginning of the exam session, though the size of the slopes and the point at which they leveled off differed by component, (2) on the second attempt, examinees’ starting points were higher than what we predicted from the slope and end point on the first attempt, and (3) the strength of the initial score gains was fairly consistent across examinees. The score increases over the first few encounters of both examinees’ first and second attempts indicate that there was a temporary “warm-up” effect on performance that “reset” between attempts; this finding does not support the notion of a lasting practice effect and suggests that examinees must reorient themselves to the exam when taking it a second time. The following discussion elaborates on the use of our findings, and we discuss the practical implications of our results for medical education.
Figure 1 highlights the initial score gains for examinees on both their first and second attempts. Examinees’ scores show the steepest, but briefest, warm-up effect on the CIS scale, whereas their improvement on both the DG and PN scales is less steep but takes longer to reach a leveling-off point. This finding is consistent with prior studies of the sequence effects within Step 2 CS, where results showed a steeper curve and quicker leveling of CIS scores compared with scores for other components.5 One explanation is that this component may be the area in which USMGs demonstrate the most test anxiety; despite the fact that all U.S. medical schools have communication skills curricula in place, research shows that these skills deteriorate over the course of medical school.15,16 Also, unlike in the real world, examinees must get quickly into the mind-set of interviewing a real patient while being given little feedback from the SP about how to proceed, which may add to their test anxiety. The DG and PN slopes are less steep; it may be that examinees are more comfortable at this point in their education with the tasks of taking a history and physical and summarizing the results, or they may perceive these tasks as more important and pay more attention to them than to their interview style, empathic statements, and other factors that affect their CIS score. The less steep DG and PN slopes may reflect the examinee merely having to adjust to the specific exam format and time limit.
Not only do examinees’ scores jump between the first and second attempts, but we also saw score increases over the first few encounters on the second attempt, indicating that repeat examinees continue to experience a warm-up effect. This pattern clarifies and extends previous research showing that both performance and measurement quality improve on the repeat attempt.17 Although this previous study suggested that some of the score gain could be attributed to a true improvement in performance, it left open the possibility that effects of mere repetition from the first attempt accounted for a large part of the gain. That is, the between-session score gain could be the result of the examinees overcoming construct-irrelevant difficulty18 on their first attempt and this effect carrying over to the second attempt. However, our results indicate that this possibility is unlikely, given that (1) the median time between attempts is five months, (2) the repeat examinees start out at a higher skill level on their second attempt than would be predicted on the basis of the slopes of their first-attempt scores, and (3) the warm-up effect persists on the second attempt. Instead, our results indicate that, between the first and second attempts, examinees do in fact improve their performance and that the effect of mere repetition across exam sessions is minimal.
Because of the consistency of our findings across examinees, our study further supports the presence of a structure (remediation, additional clinical experience, etc.) within the U.S. medical education system to actively help students who fail Step 2 CS on their first attempt. The difference in scores between the first and second attempt is large enough to lead us to the conclusion that at least some effective remediation is already taking place, but the persistent warm-up effect is an indicator that the medical school faculty providing this remediation may be choosing to focus primarily on the skills themselves. However, it might behoove medical educators to help students who fail Step 2 CS in a way that allows students to reduce their test anxiety and improve their pacing within the test. It is possible that this effect appears on school-level clinical skills exams as well as on national ones like the USMLE, so exam administrators may consider this multifaceted approach to helping the students who struggle with those exams.
Our results, although intriguing, are only a first step toward understanding the performance of examinees across repeat attempts of a challenging, performance-based certification exam, such as Step 2 CS. There are several potential next steps that build logically on our results. One possible avenue of investigation is to select a matched group of USMGs who passed the exam on their first attempt and see whether an initial score gain appears for them as well, and, if so, whether the magnitudes of the within-session increases for these examinees are similar to those for repeat examinees. If these initial score gain differences do not exist, or are not as large, for examinees who passed Step 2 CS on the first attempt, this additional information can better inform us why some USMGs initially fail the exam. Another possibility is to extend our analyses to international medical students and graduates (IMGs), as it is likely that their performance will differ from that of USMGs because of the variability in their native languages, medical education and training, and familiarity with the SP format. If IMGs are less prepared for this exam format than USMGs, their initial score gains may be steeper on their first attempt, and if they do not show improvement at the start of their second attempt, then an investigation into the impact of differing opportunities for remediation between IMGs and USMGs should be considered. Regardless of future endeavors, our current findings are consistent with both the policy that allows examinees to retake Step 2 CS and the use of the exam in physician licensure.
Acknowledgments: The authors thank Dr. Steve Haist and Dr. Gerry Dillon for their feedback.
Funding/Support: The report was prepared within the scope of the authors’ employment with the National Board of Medical Examiners (NBME).
Other disclosures: None.
Ethical approval: Not applicable.
Disclaimer: The opinions expressed here are the authors’ and do not necessarily represent the opinions of the NBME or USMLE.
* The vast majority of examinees who are educated in the United States and Canada sit for Step 2 CS while they are still enrolled in undergraduate medical education; nonetheless, we chose to use “USMG” to identify our examinee group for the sake of convenience and consistency with other Step 2 CS publications.
1. Boulet JR, McKinley DW, Whelan GP, Hambleton RK. The effect of task exposure on repeat candidate scores in a high-stakes standardized patient assessment. Teach Learn Med. 2003;15:227–232
2. Swygert KA, Balog KP, Jobe A. The impact of repeat information on examinee performance for a large-scale standardized-patient examination. Acad Med. 2010;85:1506–1510
3. Raymond MR, Kahraman N, Swygert KA, Balog KP. Evaluating construct equivalence and criterion-related validity for repeat examinees on a standardized patient examination. Acad Med. 2011;86:1253–1259
4. McKinley DW, Boulet JR. The effects of task sequence on examinee performance. Teach Learn Med. 2004;16:18–22
5. Ramineni C, Harik P, Margolis MJ, Clauser BE, Swanson DB, Dillon GF. Sequence effects in the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (CS) examination. Acad Med. 2007;82(10 suppl):S101–S104
6. Messick S. Standard of validity and the validity of standards in performance assessment. Educ Meas Issues Pract. 1995;14:5–8
7. Colliver JA, Nu VV, Verhulst SJ, Barrows HS. Effect of position-within-sequence on case performance in a multiple-stations examination using standardized-patient cases. Eval Health Prof. 1991;14:343–355
8. Lloyd JS, Williams RG, Simonton DK, Sherman D. Order effects in standardized patient examinations. Acad Med. 1990;65(9 suppl):S51–S52
9. Gispert R, Rué M, Roma J, Martinez-Carretero JM. Gender, sequence of cases and day effects on clinical skills assessment with standardized patients. Med Educ. 1999;33:499–503
10. Hausknecht JP, Halpert JA, Di Paolo NT, Moriarty Gerrard MO. Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. J Appl Psychol. 2007;92:373–385
12. Cleveland WS. Robust locally weighted regression and smoothing scatterplots. J Am Statist Assoc. 1979;74:829–836
13. Goldstein H, Browne WJ, Rasbash J. Partitioning variation in multilevel models. Underst Stat. 2002;1:223–231
14. R Development Core Team. R: A Language and Environment for Statistical Computing. 2008 Vienna, Austria R Foundation for Statistical Computing http://www.R-project.org/
. Accessed January 22, 2013
15. Pfeiffer C, Madray H, Ardolino A, Willms J. The rise and fall of students’ skill in obtaining a medical history. Med Educ. 1998;32:283–288
16. Hojat M, Mangione S, Nasca TJ, et al. An empirical study of decline in empathy in medical school. Med Educ. 2004;38:934–941
17. Raymond MR, Swygert KA, Kahraman N. Psychometric equivalence of ratings for repeat examinees on a performance assessment for physician licensure. J Educ Meas. 2012;49:339–361
18. Messick SLinn R. Validity. In: Educational Measurement. 19893rd ed. Washington, DC American Council on Education