Raymond, Mark R. PhD; Kahraman, Nilufer PhD; Swygert, Kimberly A. PhD; Balog, Kevin P. MS
The United States Medical Licensing Examination (USMLE) includes an assessment of clinical proficiency known as the Step 2 Clinical Skills (CS) Examination. Step 2 CS uses standardized patients (SPs) to assess an examinee's ability to acquire medical information through patient interviews, to perform physical examinations, and to summarize and communicate their findings. Determining the extent to which scores on the exam actually assess these skills is an important aspect of construct validity.1 Previous correlational and factor analytic studies provide evidence that Step 2 CS measures different but related aspects of clinical skill.2,3 In addition, Step 2 CS scores make a unique contribution to the assessment of competence, as suggested by the low to moderate correlations between Step 2 CS and other exams comprising the USMLE sequence.3
An issue that has received limited attention is the validity of score interpretations for examinees who initially fail and then later repeat a performance assessment such as Step 2 CS. Although scores should increase for examinees who remediate deficiencies, score gains can also occur for reasons that compromise validity. Construct-irrelevant variance is a concern when the score increase can be attributed to an improvement unrelated to the purpose of the assessment, such as increased examinee self-confidence or test sophistication (i.e., test-taking skills).1,4 Researchers have observed large score gains for repeat examinees on performance assessments such as oral exams, and it seems likely that construct-irrelevant variance explains some portion of those increases.5,6 Scores might also increase for repeat examinees who remember previously seen test materials. If the increase is specific to memorized content and does not reflect an overall improvement in skill, then scores on the second attempt will overestimate true proficiency.7
Researchers have extensively studied retest effects in the context of multiple-choice tests. Results generally indicate that repeat examinees obtain significantly higher scores on their second attempt, and this benefit is considerably more pronounced for examinees who see identical test items on their second try.8 One particularly relevant study reported score gains exceeding a standard deviation (SD) on the reused portion of a medical school admissions test administered in Belgium.9 Furthermore, the factor structure changed for repeat examinees from their first to second attempt, and scores on the second attempt were less predictive of performance in medical school.9 The advantage of seeing the same form twice does not seem to occur on licensure and certification tests, at least not in the few studies that have been conducted. Although scores improve on the second attempt, the average increase seems to be the same whether examinees see the same test form or a different form on their second attempt.10 Nor does the same-test advantage seem to hold for clinical skills exams. One study of Step 2 CS found gains averaging about 0.87 SD units across all examinee groups and skill domains, but there was no additional advantage for examinees who saw one or two of the same cases on their second attempt.7 Studies of clinical skill exams at medical schools have shown similar findings.11–13
The cumulative findings suggest that although repeat examinees experience large score gains on performance tests, the increase cannot be attributed to memorization of test content. However, the possibility remains that construct-irrelevant variance explains some portion of the score gain. The purpose of the current study was to further investigate the validity of scores for examinees who repeat the Step 2 CS exam by evaluating its internal structure as well as its relationship to external variables. The internal and external characteristics of a test are important aspects of validity,1 but to our knowledge no studies have investigated both properties for a sample of repeat examinees.
For this study, we used correlations and factor analysis to evaluate the internal structure of Step 2 CS scores separately for single-take and repeat examinees. Although a dissimilar correlational structure for single-take and repeat examinees would suggest differences in the constructs being assessed, such results would leave unanswered any questions regarding which scores were more or less valid. Therefore, we also examined the relationships between Step 2 CS scores and scores on three external measures of physician knowledge and skill for repeat examinees. Prior studies report external correlations that range from about .10 to .403; similar correlations in the present study would support the external validity of scores for repeat examinees.
Step 2 CS is designed to measure clinical skills in four domains: (1) communication–interpersonal skills, (2) spoken English proficiency, (3) data gathering, and (4) patient note documentation. Successful completion of Step 2 CS is required for entry into graduate medical education (residency) in the United States; therefore, students generally take this exam just before graduating from medical school and/or immediately before entering residency. The exam is administered five or six days a week, year-round, at five testing centers throughout the United States (Atlanta, Chicago, Houston, Los Angeles, and Philadelphia). Exam forms are generated daily within and across test centers to ensure that examinees who test on one day do not see the same cases that were administered on the previous or subsequent days. Exams are assembled according to a detailed blueprint to ensure that different forms are comparable in terms of case difficulty and content. As a result of logistic constraints, some examinees who repeat Step 2 CS may see the same case on two occasions, but this occurs for only 8% of all encounters.7
Examinees encounter 12 cases during a testing session, and a different SP portrays each case. During each encounter, examinees have up to 15 minutes to interact with the SP. Examinees are informed of the reason for the patient's visit before entering the SP's room and are instructed to take a medical history and perform a physical examination. At the conclusion of the encounter, examinees have 10 minutes to document their findings in a structured patient note. The SPs use these 10 minutes to complete the checklist and rating scales that result in scores for communication–interpersonal skills, spoken English proficiency, and data gathering. Trained physicians assign patient note ratings subsequent to the examination.
Approximately 34,000 examinees take Step 2 CS each year. To pass, examinees must exceed cutoff scores in each of three areas: (1) communication–interpersonal skills, (2) spoken English proficiency, and (3) a composite consisting of data gathering and patient notes. An examinee who fails in one or more of these three areas and who wishes to take the exam again must repeat the entire Step 2 CS. Fail rates average about 14% each year. Of those who fail, 59% fail communication–interpersonal skills, 50% fail the composite of data gathering and patient notes, and 22% fail spoken English proficiency. Some examinees fail in more than one area; thus, the sum of these percentages is greater than 100. Although pass–fail decisions are based on the three areas just described, this study analyzes scores on all four domains (i.e., communication–interpersonal skills, spoken English proficiency, data gathering, and patient notes) separately.
Scores were also available for the three written examinations: The Step 1 exam is a measure of basic science (BS) knowledge, the Step 2 exam is a measure of clinical knowledge (CK), and the Step 3 exam is a measure of the ability to apply CK to patient management (PM). For this report, we have designated these tests as Step 1 BS, Step 2 CK, and Step 3 PM. Although the time interval from the first test to the last test varies, particularly for international medical graduates (IMGs), the three tests are generally taken in that order.
The potential sample consisted of all examinees completing Step 2 CS between July 16, 2007, and September 12, 2009, under normal test administration conditions. Participants gave prior approval for their scores to be used for research purposes. A National Board of Medical Examiners research panel deemed this study to be exempt research. All personal identifying information had been removed from examinee records prior to analysis. The data of interest initially included 5,184 examinees who completed Step 2 CS on two occasions. As previously noted, 22% of the repeat examinees failed the spoken English proficiency portion. Given that Step 2 CS performance is positively influenced by spoken English proficiency in the general population of examinees,3 we were concerned that the data contained a potential confounding factor: that any differences in correlations between single-take and repeat examinees might be a function of differences in English language skills rather than retest status. Therefore, we matched single-take and repeat examinees on spoken English proficiency scores.
The goal of matching was to ensure that single-take and repeat examinees had similar levels of spoken English proficiency. Given that all single-take examinees earned a passing score on spoken English proficiency, we first excluded from the sample any repeat examinees (n = 1,154) who did not earn a passing score. It was apparent that the matching process would result in a ratio of single-take to repeat examinees of approximately three to one. Therefore, we next drew a random sample of scores for single-take examinees with the constraint that for every spoken English score (eg, 71, 72, 73), there would be three times as many single-take examinees as repeat examinees. This matching produced almost identical score distributions on spoken English proficiency for single-take and repeat examinees.
We assigned the examinees' scores to three “groups” based on their repeat status: single-take examinees, repeat examinees on their first attempt (repeat-1), and repeat examinees on their second attempt (repeat-2). That is, we measured all repeat examinees twice on all four Step 2 CS domains.
We completed three sets of analyses. First, we obtained descriptive statistics both to determine the magnitude of the score gains for repeat examinees and to compare correlations among the four subscores internal to Step 2 CS.
Second, we used multigroup confirmatory factor analysis to evaluate the similarity of the correlations for single-take and repeat examinees. The primary purpose of the confirmatory factor analysis for this investigation was to formally assess the equivalence of the correlation matrices and factor structure of test scores for the two groups of examinees.14 At the first stage, the model assumes that the factor structure is the same by constraining factor loadings to be equal across groups. At the second stage, the factor loadings are unconstrained, allowing each group to have its own factor structure. We evaluated model fit at each stage using a conventional χ2 goodness-of-fit test, which indicates the degree to which the observed correlation or covariance matrix is predicted by the factor model.15 If the unconstrained model at the second stage provides a significantly better model fit than the constrained model at stage one, then we can conclude that each group is best described by its own factor structure.14 The χ2 goodness-of-fit test is useful for statistical testing, but it does not lend itself to useful interpretation because large sample sizes tend to produce large and statistically significant results. Therefore, we also reported the comparative fit index (CFI), which is an R2 type of statistic ranging from 0 to 1, with values close to 1 indicating good fit. We conducted two confirmatory factor analyses. The first compared single-take examinees with repeat-1 examinees, whereas the second compared single-take examinees with repeat-2 examinees.
Third (and last), we obtained correlations between Step 2 CS scores and scores on the three written external measures separately for each group. These correlations were based on scores from the first attempt for the three external measures.
We used SPSS (version 16.0, Chicago, Illinois) to compute descriptive statistics and correlations and Mplus14 (version 4.1, Los Angeles, California) for the factor analyses.
The final sample included 12,090 single-take examinees and 4,030 repeat examinees. Table 1 summarizes the demographic characteristics of the two groups. The repeat group contains a larger proportion of males and a smaller proportion of females than the single-take group. In addition, compared with single-take examinees, the group of repeat examinees has a smaller proportion of graduates of U.S. medical schools and a larger proportion of U.S. citizens who graduated from an international medical school. The proportion of true IMGs (i.e., IMGs who are not U.S. citizens) is similar in the two groups of examinees (64% for single-take examinees, 63% for repeat examinees).
Means and correlations
Table 2 presents means, SDs, and correlations for the three sets of scores. The nearly identical means and SDs on spoken English proficiency for single-take and repeat-1 examinees are a consequence of the matching process for the present sample. Repeat examinees exhibited increases in mean scores in all four areas, but most notably in communication–interpersonal skills. Eighty percent of repeat examinees passed all areas, a value which approaches the pass rate of 86% for first-time examinees. The score increases are slightly less than those reported by Swygert and colleagues,7 which might be due, in part, to matching on spoken English proficiency scores. The SDs for single-take, repeat-1, and repeat-2 examinees for each of the four measures are comparable. For example, the SDs for the three groups on communication–interpersonal skills are, respectively, 6.6, 6.3, and 5.8. This is noteworthy because the similarity in SDs across groups contributes to the interpretability of subsequent analyses in that too little variability for one or more groups would suppress the correlations.
Consistent with prior research, correlations for single-take examinees are positive and moderately strong, ranging from 0.24 to 0.56.3 However, correlations for the repeat-1 examinees are mostly low. In fact, three of the six are negative. Some of the differences in correlations between single-take and repeat-1 examinees are large, particularly those involving the two communication scales. For single-take examinees, the correlation between communication–interpersonal skills and data gathering is 0.47, and the correlation between communication–interpersonal skills and patient notes is 0.53. In contrast, those correlations for repeat-1 examinees are, respectively, –0.25 and –0.15. The correlations of spoken English proficiency with data gathering and with patient notes are also unexpectedly low (–0.17 and 0.15). In other words, for repeat-1 examinees, performance on the two communication measures seems to be independent of obtaining a history, performing a physical, and/or documenting findings. Meanwhile, for repeat-2 examinees, the negative and low correlations move in a positive direction. The largest shift in correlation is for communication–interpersonal skills and data gathering, which changes from –0.25 to 0.36. The results suggest that the constructs assessed for repeat-1 examinees are different from the constructs assessed for single-take examinees but that the differences diminish by the time repeat examinees complete their second attempt. The confirmatory factor analysis provides a formal evaluation of this observation.
The first multigroup confirmatory factor analysis compared correlation matrices for single-take examinees with repeat-1 examinees. Given the unusual correlations for repeat-1 examinees, reasonable model fit was achieved only by allowing certain error terms to correlate. Even so, the fit for the constrained model at stage one was quite poor (CFI = 0.78; χ2 = 3,611), whereas the unconstrained model at stage two fit slightly better (CFI = 0.86; χ2 = 2,349). The difference in model fit is statistically significant by the χ2 difference test (χ2diff = 1,262; P < .001), indicating that the groups have different underlying factor structures. The second confirmatory factor analysis compared single-take examinees with repeat-2 examinees. Fit indices for the constrained model at stage one were good (CFI = 0.96; χ2 = 663) and improved only slightly for the unconstrained model at stage two (CFI = 0.96; χ2 = 625); however, the unconstrained model did provide significantly better fit (χ2diff = 38; P < .001). Although statistically significant, the small value of χ2diff and the identical CFIs indicate that the differences between single-take and repeat-2 examinees are small.
We subjected correlations for the three groups to single-group factor analyses to further investigate differences among them. The results for the single-factor solutions appear in Table 3. The factor loadings for single-take examinees are high and positive, ranging from 0.60 to 0.78. In addition, the model fit is good (CFI = 0.93). The factor loadings for the repeat-1 group are different from those for single-take examinees, ranging from –0.39 to 0.75; furthermore, model fit is very poor (CFI = 0.51). The negative loadings indicate that the two communication scales are inversely related to data gathering and patient notes for repeat-1 examinees. Meanwhile, the factor loadings for repeat-2 examinees range from 0.40 to 0.69; they are similar to the loadings for single-take examinees and different from the loading for repeat-1 examinees. The notable difference in factor loadings between single-take and repeat-2 examinees is for spoken English proficiency (0.64 versus 0.40). That is, spoken English proficiency is less related to overall clinical proficiency for repeat-2 examinees than it is for single-take examinees.
Correlations with external criteria
Correlations between the four Step 2 CS skill domains and the three external measures appear in Table 4. As noted in a footnote to Table 4, the number of examinees is different for the three external measures. The sample size for Step 3 PM is the smallest because some examinees who completed Step 2 CS had not yet taken Step 3 PM. The 12 correlations for single-take examinees range from 0.16 to 0.44, with a median of 0.33; these values are comparable to those reported in previous studies.3 Correlations for repeat-1 examinees range from –0.04 to 0.31, with a median of 0.15. The largest difference in correlations between repeat-1 examinees and single-take examinees is for communication–interpersonal skill and Step 3 PM (0.02 [repeat-1] versus 0.42 [single-take]). Meanwhile, correlations are slightly higher at repeat-2 than repeat-1, ranging from 0.0 to 0.37, with a median of 0.27. The correlations for repeat examinees generally approach the magnitude of the correlations for single-take examinees as the former move from their first to second attempt. The largest increases in correlation from repeat-1 to repeat-2 are for communication–interpersonal skills and Step 2 CK (–0.01 to 0.24) and for communication–interpersonal skills and Step 3 PM (0.02 to 0.27). Taken as a whole, the correlations indicate that criterion-related validity of scores on Step-2 CS improves for repeat examinees on their second attempt.
Discussion and Conclusions
Two lines of evidence suggest that the construct underlying performance on Step 2 CS is markedly different for single-take examinees and repeat examinees on their first attempt. Not only were correlations and factor structure among Step 2 CS components weak and difficult to interpret for repeat examinees on their initial attempt, but correlations of Step 2 CS scores with external measures of medical knowledge were lower than expected for this group. However, much of the difference between repeat and single-take examinees diminished with experience. By the time repeat examinees completed their second attempt, their factor structure was similar to that of single-take examinees, and their correlations with external measures approached expectations. These outcomes extend previous research7,11–13 by going beyond score gains to evaluate the impact of these gains on validity.
One puzzling outcome concerns the correlation between spoken English proficiency and other scores. Although we matched the groups, the relationship between spoken English proficiency and other scores still varied based on repeater status. Additional analyses confirmed that lower spoken English proficiency scores were associated with higher scores on data gathering for both native and nonnative speakers of English, suggesting that impaired performance on the first attempt is not a simple function of language differences. It may be, for example, that examinees who struggle with English speak enough in the encounter to ensure that they cover all or most data-gathering checklist questions, but this thoroughness exposes more fully the limitations of their English skills.
These findings raise questions regarding the source and validity of the score gains. Although some of the gain can be attributed to examinees improving their clinical skills, other factors may contribute. Part of the increase can be attributed to random measurement error. Even in the absence of any improvement in proficiency, low scores on performance tests tend to regress toward the mean on retesting by nontrivial amounts.6,16 The shift in correlations also implicates construct-irrelevant variance as a possible source. If construct-irrelevant variance is introduced after the first attempt, then the validity of scores for the second attempt will be compromised. Such a compromise could occur when repeat examinees learn between their first and second attempts certain test-taking tactics that are not available to most examinees on their first attempt. That is, scores may increase because examinees have become skilled test takers rather than skilled clinicians. In contrast, if construct-irrelevant variance decreases after the first attempt, then score validity on the second attempt should improve. A reduction in construct-irrelevant variance occurs when performance on the first attempt is suppressed by factors—such as anxiety or unfamiliarity with an assessment format—that are irrelevant to the construct being measured but which become neutralized by the second attempt. The consequence is that the latter scores will more accurately reflect an examinee's true proficiency.
The existence of a large practice effect implies that some examinees are not well prepared for the innovative SP format on their initial attempt. Given that the vast majority of U.S. medical schools now use SP-based clinical skills exams for student evaluation, most U.S. graduates are now likely to have considerable preparation with this format.17 The situation for IMGs is not as clear; although some IMGs may have considerable experience with the SP format, others may have little or no prior experience. That much said, the unusual pattern of correlations observed for repeat-1 examinees did not seem to be attributable to IMG status. Although IMGs are more likely to repeat Step 2 CS,7 the sampling method employed here resulted in approximately equal percentages of IMGs in each group (single-take = 64% IMG; repeat = 63% IMG). Regardless of country of medical education, medical schools likely differ in the extent to which their SP exams are similar to Step 2 CS, and examinees from schools with less similar assessment formats may feel less certain and more challenged on their first attempt.
Further research is needed to better understand the relationships between IMG status, English language proficiency, and test performance for repeat examinees. We have plans to evaluate score gains and patterns of correlations for each of these groups; the practical problem is that sample sizes become exceedingly small as repeat examinees are partitioned into groups based on IMG status and English as a first language. To investigate practice effects within a testing session would also be informative. Previous research detected a sequence effect by which examinees perform better after their initial few SP encounters18; future studies should evaluate the magnitude of the within-session effect for repeat or other low-scoring examinees. Investigating scores for examinees who take Step 2 CS three or more times would also be informative; however, small sample sizes would influence the utility of such efforts. Additional studies might seek to verify our assumption that lack of experience with the SP format is a source of construct-irrelevant variance for some examinees (e.g., by surveying repeat examinees). Such results could identify interventions that would help minimize the effect of test format. Plans are also under way to determine the portion of score increases that can be attributed to random measurement error because such gains may have implications for the manner in which passing scores are established.6,16
In summary, we interpret the more consistent correlations for repeat examinees on their second attempt as a sign that construct-irrelevant variance decreased and that inferences based on scores from the second assessment will be more valid than inferences based on the first attempt. The first attempt seems to serve as a practice test for many repeat examinees, who are better equipped to demonstrate their skill once they learn the examination format.4 However, alternative explanations are certainly possible. For example, the improved correlation between Step 2 CS and Step 1 BS might be attributable to a type of test-taking skill that is common to most assessment formats.
Even if the practice effect explanation is correct, characteristics of the study design may limit the extent to which they generalize to other settings, including medical schools. The SP exam studied here was completed by a heterogeneous and highly motivated group of examinees. A more homogeneous group of less motivated examinees may change their behaviors less between their first and second attempts. Another potential limitation is that repeat examinees in the present study had a low probability of encountering the same case or SP on their second attempt. Score gains on exams for which examinees see identical content may produce different outcomes than we observed,9 although previous studies suggest that repeating content on performance exams does not tend to lead to better scores.7,11–13 A related issue is that Step 2 CS is highly standardized, which means that many of its features remain the same across time. Even if the content changes, repeat examinees know what to expect. Possibly, the results would be different for less standardized SP exams. In short, although the present results may not generalize to assessment contexts too dissimilar from the one studied here, they do serve as a general reminder to medical educators to use caution when drawing inferences from test scores of examinees who have limited experience with a novel assessment format.4
The authors received many helpful comments from Marc Gessaroli, PhD, and Ron Nungester, PhD, of the National Board of Medical Examiners; from three anonymous reviewers; and from Academic Medicine's editorial staff.
This article was prepared within the scope of the authors' employment with the National Board of Medical Examiners.
Exam participants gave prior approval for their scores to be used for research purposes. A National Board of Medical Examiners research panel deemed this study to be exempt research. All personal identifying information had been removed from examinee records prior to analysis.
The opinions expressed here are those of the authors alone and do not necessarily represent the opinions of the National Board of Medical Examiners or United States Medical Licensing Exam.