For over a decade, performance assessments using standardized patients (SPs) have served as the foremost approach to evaluating the clinical skills of medical students. Indeed, most medical schools include such assessments as part of their student evaluation systems, and the United States Medical Licensing Examination (USMLE) has included a performance assessment—Step 2 Clinical Skills (CS)—since 2004. Typically, these assessments aim to measure a range of clinical abilities, such as communication and interpersonal skills, the capacity to gather pertinent patient information by taking a history and performing a physical examination, and the ability to write a structured patient note that documents clinical findings and proposed diagnoses.
Performance assessments allow for an evaluation of important skill sets typically not well assessed with traditional multiple-choice examinations. Yet, because they rely on human raters and only approximate real-world settings, particular attention must be paid to ensure their efficacy. Furthermore, when scores from performance assessments are used to make high-stakes decisions, it is essential to evaluate the validity of the intended score interpretations.
In his seminal work on validity theory, Messick1,2 outlines several important aspects of a unified validity model. One of these aspects relates to the relationships between examination scores and external criterion measures (i.e., other variables representing the same construct as the examination). Messick defines validity as “an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores”2(p5) and indicates that “the meaning of the scores is substantiated externally by appraising the degree to which empirical relationships with other measures, or the lack thereof, is consistent with that meaning.”2(p7) More recently, Kane3 reiterated the importance of relating test scores to the outcomes they are intended to represent by identifying extrapolation, the establishment of relationships between assessment outcomes and their associated real-world behaviors, as one of four essential components of a validity argument.
With this theoretical framework in mind, a review of past research indicates that a limited number of studies have investigated the relationships between physicians’ scores on clinical skills examinations and real-world outcomes related to patient care. One notable study by Tamblyn and colleagues4 found that examinees with unusually low scores on the communication and clinical decision-making components of the Medical Council of Canada’s licensing examination were significantly more likely than other examinees to receive nontrivial patient complaints in practice. These findings suggest that the Medical Council of Canada’s licensing examination scores are useful predictors of effective patient care and thus provide support for their use in medical licensing decisions in Canada.
Other research has explored whether clinical skills examination scores are able to predict subsequent ratings of clinical skills performance during residency training. Most of this research demonstrates a modest positive association between examination scores and performance ratings. For example, Taylor and colleagues5 examined the relationship between scores on a National Board of Medical Examiners clinical skills examination administered to fourth-year medical students at a single institution (as part of a research and development project) and ratings of their subsequent performance in residency training as provided by program directors. The authors reported correlations of 0.25 for interpersonal skills scores and ratings and 0.20 for data gathering scores and ratings.5 Smith6 also focused on outcome measures from a clinical skills examination at a single institution and found a correlation of 0.27 between data collection scores for senior medical students and ratings of their subsequent performance in data collection during their residency training. More recently, using data from a national sample, we found a modest association between Step 2 CS communication and interpersonal skills (CIS) scores and communication skills ratings assigned by residency directors to internal medicine residents in their first year of residency training.7 These studies provide some validity evidence in support of performance assessment scores that are intended to measure the clinical skills of medical students. However, given the limited number of studies and the rare use of national samples, more research is needed. This is especially true for Step 2 CS scores because they are used to make medical licensure decisions in the United States.
Step 2 CS uses SPs to assess examinees on their ability to gather information from patients, perform physical examinations, and communicate their findings. Examinees are scored in three subcomponents: integrated clinical encounter (ICE), CIS, and spoken English proficiency (SEP). ICE includes separate assessments of (1) data gathering and (2) data interpretation. Data gathering scores assess history-taking and physical examination skills, and data interpretation scores assess how well examinees compile patient notes based on their ability to gather pertinent patient data.
This study extends our previous work7 by using data from the same national sample to examine the relationships between Step 2 CS data gathering and data interpretation scores and subsequent proficiency in history taking and physical examination during the first year of residency training. As in our earlier study, we focus on the aspect of validity that relates to relationships with external measures and on the extrapolation element of a validity argument.
The data set included demographic information, USMLE scores, and ratings of performance from residency directors as reported to the American Board of Internal Medicine (ABIM). Residency programs with fewer than 10 examinees (n = 147) were excluded from analysis (about 3% of the examinees included in the initial study sample) because they were too small to yield reliable results. The final study sample comprised 6,306 examinees from 238 internal medicine residency programs who completed Step 2 CS for the first time in 2005 and for whom ratings were available for their first year of internal medicine residency training.
The study was reviewed by the American Institutes for Research institutional review board and qualified for exempt status because it was based on operational data and involved very minimal or no risk to study subjects.
Hierarchical linear modeling
The data used in this study are hierarchically structured in that examinees are nested within internal medicine residency programs. In other words, each examinee is associated with a single program, and each program includes multiple examinees. Given this data structure, there may be dependency in the data because residents within the same training program likely are more similar to one another than to residents in different programs. Hierarchical linear modeling (HLM) techniques allow for appropriate analysis of nested data.8
For this study, HLM can be conceptualized as estimating a unique regression line predicting internal medicine ratings from Step 2 CS scores for each internal medicine residency program (random-coefficients models). The results of these examinee-level regression analyses (the intercept and slopes) are then treated as dependent variables in program-level analyses that include program characteristics as between-program predictors (intercept-and-slopes-as-outcomes models). This allows for the possibility that residency program characteristics affect the examinee-level relationships. For example, it may be that the effect of Step 2 CS data gathering scores on physical examination ratings is stronger for residents in smaller training programs, as smaller programs may allow for the provision of more individualized instruction.
Two distinct sets of examinees-nested-in-programs HLM analyses were fit to the data, one that treated history-taking ratings as the dependent variable and the other that treated physical examination ratings as the dependent variable. Each set of analyses included the following:
- a one-way random-effects analysis of variance (ANOVA) that partitioned the total variation in ratings into within-program and between-program components;
- a series of random-coefficients models used to examine the relationships among Step 2 CS scores and ratings; and
- a series of intercept-and-slopes-as-outcomes models used to predict the effects of program characteristics on both mean ratings and on the relationships between Step 2 CS scores and ratings.
For ease of interpretation, all independent variables were grand-mean centered.
Using the ABIM’s standardized Web-based system for rating internal medicine residents’ competence, residency directors rated residents annually in skills such as clinical judgment, history taking, and physical examination. Ratings were based on information collected through the residency program’s assessment system, which typically included data from end-of-rotation evaluations, direct observations, and chart audits. Directors rated residents using a nine-point scale that was divided into three categories: unsatisfactory (ratings of 1–3), satisfactory (ratings of 4–6), and superior (ratings of 7–9). Residents must receive satisfactory ratings on all components of the evaluation during their final year of training to be eligible to take the ABIM certification examination. Previous studies have shown that clinical competence ratings of internal medicine residents by program directors correlate significantly with certification examination scores and physician peer ratings.9,10 We examined history-taking and physical examination ratings from the end of a resident’s first year in a training program, rather than the beginning of the first year, because Step 2 CS scores are intended to indicate readiness to enter supervised practice.
The primary examinee-level independent variables were Step 2 CS data gathering and data interpretation scores. SPs completed dichotomous checklists to record examinees’ data gathering activities including history taking and physical examination. Trained physician raters assessed examinees’ data interpretation skills, including documentation of findings, initial differential diagnoses, and necessary diagnostic studies. Both data gathering scores and data interpretation scores were statistically adjusted for the effects of SP/rater stringency and content difficulty.
Additional examinee-level independent variables included Step 2 CS CIS and SEP scores because they may affect the relationships between Step 2 CS data gathering and data interpretation scores and residency director ratings. Step 1 and Step 2 Clinical Knowledge (CK) scores likewise were treated as examinee-level independent variables to allow for evaluation of the usefulness of Step 2 CS scores above and beyond the information available from these two USMLE multiple-choice examinations. Demographic examinee-level independent variables included gender (0 = female, 1 = male); native language (0 = nonnative English speaker, 1 = native English speaker); and medical school location (0 = international medical school graduate [IMG], 1 = U.S./Canadian medical school graduate).
Program-level independent variables included residency program quartile ranking, program size (i.e., number of residents), proportion of examinees with English as a second language, and proportion of IMGs. The quartile rankings were based on first-time taker pass rate data from the internal medicine certification examination for 385 residency programs spanning the years 2007 to 2009 and provide an indicator of a program’s selectivity. These rankings were computed prior to excluding programs with fewer than 10 residents.
Of the 6,306 first-year internal medicine residents included in the study sample, 3,468 (55%) were male, 4,856 (77%) were native English speakers, and 4,919 (78%) were graduates of U.S./Canadian medical schools. Table 1 provides descriptive statistics for the USMLE variables used in the analyses for both the study sample and the entire population of examinees who took Step 2 CS in 2005. For the study sample, the mean (standard deviation [SD]) Step 2 CS data gathering score was 71.3 (9.2), and the mean (SD) Step 2 CS data interpretation score was 73.6 (8.7). The examinees in the study sample appear generally more capable than the total population of examinees who took Step 2 CS in 2005. The examinees in the total population had mean (SD) data gathering and data interpretation scores of 69.5 (10.0) and 70.8 (9.8), respectively. (Each of the score differences between the study sample and the total population is statistically significant at the P < .01 level.) It is possible that examinees that initially failed a USMLE component, and thus had low scores, were delayed or restricted from entering residency training in internal medicine, a highly selective specialty area.
The simple bivariate correlation between Step 2 CS data gathering scores and data interpretation scores for the study sample was 0.51. Generalizability coefficients for Step 2 CS data gathering scores and data interpretation scores, as reported in earlier work by Clauser and colleagues,11 were 0.69 and 0.72, respectively. With respect to the dependent variables, for the study sample, the mean (SD) for both the history-taking ratings and the physical examination ratings was 7 (1).
At the program level, the mean (SD) program quartile ranking for the 238 residency programs included in the study sample was 2.5 (1.1), and the mean (SD) program size was 23.8 (12.5) residents. The mean (SD) proportion of examinees with English as a second language in a program was 0.3 (0.2) and of IMGs in a program was 0.3 (0.3).
One-way random-effects ANOVAs
The top portions of Charts 1 and 2 present the results of the one-way random-effects ANOVAs. Of the total variation in history-taking ratings, 66% (0.71 / [0.36 + 0.71]) occurred between examinees within the same program, and 34% (0.36 / [0.36 + 0.71]) occurred between programs. Similarly, 64% (0.67 / [0.38 + 0.67]) of the total variation in physical examination ratings occurred between examinees within the same program, and 36% (0.38 / [0.38 + 0.67]) occurred between programs. Thus, while the majority of variation in ratings existed from examinee to examinee, a nontrivial amount existed between programs.
For history taking, the final random-coefficients model included the following examinee-level independent variables: Step 2 CS data interpretation scores, Step 2 CS CIS scores, Step 2 CS SEP scores, Step 1 scores, and Step 2 CK scores. The other examinee-level independent variables were not related to history-taking ratings, after accounting for other factors, and were excluded. Preliminary results showed between-program variation in mean history-taking ratings, the effect of SEP scores on ratings, and the effect of Step 1 scores on ratings. The intercept and the slopes for SEP and Step 1 scores, therefore, were allowed to vary randomly across programs.
For physical examination, the final random-coefficients model included the following examinee-level independent variables: Step 2 CS data interpretation scores, Step 2 CS CIS scores, Step 1 scores, Step 2 CK scores, and medical school location. The other examinee-level independent variables were unrelated to physical examination ratings, after accounting for other factors, and were excluded. Preliminary results showed between-program variation in mean physical examination ratings, the effect of Step 1 scores on ratings, and the effect of medical school location on ratings. Thus, the intercept and the slopes for Step 1 scores and medical school location were allowed to vary randomly across programs.
The middle portions of Charts 1 and 2 present the results of the final random-coefficients models predicting history-taking ratings and physical examination ratings, respectively. As reported in Chart 1, on average, Step 2 CS data interpretation scores were positively related to history-taking ratings, with a 0.01-point increase in ratings expected for every 1-point increase in data interpretation scores. Notably, Step 2 CS data gathering scores were unrelated to history-taking ratings, after accounting for Step 2 CS data interpretation scores. Step 2 CIS, Step 2 SEP, Step 1, and Step 2 CK scores were all positively related to history-taking ratings.
As reported in Chart 2, Step 2 CS data interpretation scores were positively related to physical examination ratings, with a 0.01-point increase in ratings expected for every 1-point increase in data interpretation scores. As with history-taking ratings, Step 2 CS data gathering scores were unrelated to physical examination ratings, after accounting for Step 2 CS data interpretation scores. Again, Step 2 CIS, Step 1, and Step 2 CK scores were all positively related to physical examination ratings. U.S./Canadian medical school graduates outperformed IMGs by 0.16 points.
The final intercept-and-slopes-as-outcome models included the same examinee-level independent variables as the final random-coefficients models. For history taking, the proportion of IMGs was treated as a between-program predictor of (1) average program history-taking ratings (i.e., the intercept); and (2) the relationship between SEP scores and history-taking ratings (i.e., the SEP slope). No other program-level independent variables were significantly related to either the intercept or the SEP slope. None of the program-level independent variables were significantly related to the Step 1 slope.
As reported in the bottom portion of Chart 1, residency programs with higher proportions of IMGs, on average, had lower mean history-taking ratings. The expected difference in mean history-taking ratings for a program including no IMGs compared with a program including only IMGs is 0.46. As the proportion of IMGs increases, the effect of SEP scores on history-taking ratings decreases: Each 1-unit increase in the proportion of IMGs decreases the SEP slope by 0.04.
For physical examination, none of the program-level independent variables were statistically related to the between-program variation in mean ratings or the slopes for Step 1 scores and medical school location. Thus, the final intercept-and-slopes-as-outcomes model was identical to the final random-coefficients model.
Two primary results of this study have important implications for the USMLE Step 2 CS examination specifically and for clinical skills examinations more generally. First, our results indicate that Step 2 CS data interpretation scores provide useful information for predicting ratings of examinees’ subsequent performance in history taking and physical examination during their first year of supervised practice. This is not surprising given that Step 2 CS data interpretation scores are based on experts’ ratings of patient notes in which examinees document and interpret findings using information they gathered by taking a patient history and performing a physical examination. Conversely, Step 2 CS data gathering scores did not provide information about subsequent performance in history taking and physical examination above and beyond the information provided by other USMLE scores.
The finding that Step 2 CS data interpretation scores are at least partially related to history-taking and physical examination ratings obtained during the first year of residency training provides some validity evidence for the intended use of Step 2 CS data interpretation scores as an indication of examinees’ readiness to enter supervised practice. Although the relationships between data interpretation scores and ratings in history taking and physical examination are modest, there are several plausible explanations for why these relationships may be minimized. Presumably, over the course of their first year of residency training, examinees improve their clinical skills—advances that residency directors’ ratings should reflect but Step 2 CS scores would not. The difference in skill level at the time examinees take Step 2 CS and the time directors evaluate examinees as residents could curtail the strength of the relationships between scores and ratings. Another factor may be the reliability of the ratings themselves. Although formal reliability analyses are not available, it is reasonable to assume that the ratings have, at most, moderate reliability, given that they are based on the overall assessment of a single rater (i.e., the residency program director).
Our second important finding is the lack of validity evidence for the intended use of Step 2 CS data gathering scores. One possible reason for the lack of a relationship between Step 2 CS data gathering scores and later performance in residency training may be that some examinees employ testing strategies that may artificially inflate their scores. As stated above, Step 2 CS data gathering scores are based on dichotomous checklists completed by SPs. Examinees receive points for asking relevant questions about the patient’s history and performing indicated physical examination maneuvers. If examinees memorize standard lists of history-taking questions that are appropriate across a range of clinical presentations—such as the lists that are commonly provided by test preparation courses and texts12—they may be able to increase the number of history-taking points that they receive. This may lead to artificially inflated data gathering scores, which possibly reflect rote memorization rather than actual skill. Furthermore, medical school faculty and residency directors may emphasize a hypothesis-driven and focused history and physical examination,13,14 so that appropriate history-taking skills and physical examination maneuvers may be conceptualized differently in supervised practice than they are during the Step 2 CS examination. Lastly, scores based on checklists completed by SPs may be less reliable than other scores typically included in performance assessments of clinical skills. Indeed, in a multivariate generalizability study, Clauser and colleagues11 found that data gathering scores were the least generalizable among the score components included in Step 2 CS.
An understanding that test-taking strategies and/or contrasting conceptualizations of the appropriate ways to take a patient history and perform a physical examination may have affected data gathering scores, at least in part, factored into the USMLE’s decision to assess history taking as part of the Step 2 CS data interpretation score, rather than the data gathering score.15 Our finding that data gathering scores assessing both history taking and physical examination did not predict history-taking and physical examination performance in the first year of residency training provides some confirmation that this adjustment was warranted.
Although our study sample is nationally representative, it is limited in scope to the single, albeit largest, medical specialty of internal medicine, and thus the current findings may not be generalizable to residency training programs in other specialties. Additionally, the fact that history-taking and physical examination ratings reflect residents’ performance at the end of the first year of residency training may introduce the confounding effect of differential growth during the training year.
With these limitations in mind, the current study remains among the first large-scale investigations focusing on validity evidence for scores from the Step 2 CS examination. As mentioned above, we previously found that Step 2 CS CIS scores provide useful information related to the level of communication skills demonstrated in supervised practice.7 In a similar vein, the results of this study provide some support for the usefulness of Step 2 CS data interpretation scores in predicting later performance in both history taking and physical examination. We found no evidence to support the usefulness of Step 2 CS data gathering scores for predicting such performance.
In addition to providing validity evidence specifically relevant for Step 2 CS score interpretations, our findings contribute to the larger body of work exploring the usefulness of scores from performance assessments of clinical skills for predicting later performance in practice settings. As such, this study provides important information for practitioners interested specifically in Step 2 CS or more generally in fine-tuning performance assessments of medical students’ clinical skills that use SPs.
1. Messick SLinn RL. Validity. Educational Measurement. 19893rd ed New York, NY Macmillan:13–103
2. Messick S. Standards of validity and the validity of standards in performance assessment. Educ Meas. 1995;14:5–8
3. Kane MTBrennan RL. Validation. Educational Measurement. 20064th ed Westport, Conn American Council on Education/Praeger:17–64
4. Tamblyn R, Abrahamowicz M, Dauphinee D, et al. Physician scores on a national clinical skills examination as predictors of complaints to medical regulatory authorities. JAMA. 2007;298:993–1001
5. Taylor ML, Blue AV, Mainous AG 3rd, Geesey ME, Basco WT Jr. The relationship between the National Board of Medical Examiners’ prototype of the Step 2 clinical skills exam and interns’ performance. Acad Med. 2005;80:496–501
6. Smith SR. Correlations between graduates’ performances as first-year residents and their performances as medical students. Acad Med. 1993;68:633–634
7. Winward ML, Lipner RS, Johnston MM, Cuddy MM, Clauser BE. The relationship between communication scores from the USMLE Step 2 Clinical Skills examination and communication ratings for first-year internal medicine residents. Acad Med. 2013;88:693–698
8. Raudenbush SW, Bryk AS Hierarchical Linear Models: Applications and Data Analysis Methods. 2002 Thousand Oaks, Calif Sage Publications
9. Shea JA, Norcini JJ, Kimball HR. Relationships of ratings of clinical competence and ABIM scores to certification status. Acad Med. 1993;68(10 suppl):S22–S24
10. Lipner RS, Blank LL, Leas BF, Fortna GS. The value of patient and peer ratings in recertification. Acad Med. 2002;77(10 suppl):S64–S66
11. Clauser BE, Harik P, Margolis MJ. A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. J Educ Meas. 2006;43:173–191
12. Le T, Bhushan V, Sheikh-Ali M, Shahin FA First Aid For the USMLE Step 2 CS. 2012 New York, NY McGraw-Hill
13. Yudkowsky R, Otaki J, Lowenstein T, Riddle J, Nishigori H, Bordage G. A hypothesis-driven physical examination learning and assessment procedure for medical students: Initial validity evidence. Med Educ. 2009;43:729–740
14. Yudkowsky R, Park YS, Riddle J, Palladino C, Bordage G. Clinically discriminating checklists versus thoroughness checklists: Improving the validity of performance test scores. Acad Med. 2014;89:1057–1062
15. Federation of State Medical Boards, National Board of Medical Examiners (NBME). Step 2 Clinical Skills (CS) Content Description and General Information. 2012;11 Philadelphia, Pa Federation of State Medical Boards and National Board of Medical Examiners