Scores from standardized patient (SP)-based examinations such as the Step 2 Clinical Skills (CS) component of the United States Medical Licensing Examination (USMLE) often are used to make high-stakes decisions informing licensure and certification. Because of the importance of these decisions, gathering evidence to evaluate the validity of the score interpretations is essential. We intend for the present research to add to that body of evidence.
Kane1 conceptualizes the process of validating test score interpretations as a structured argument that supports the proposed interpretations. The argument has four parts: (1) scoring, reflecting the extent to which the scores are accurately recorded under defined assessment conditions, (2) generalization, supporting the stability of the scores across replications of the data collection procedure, (3) extrapolation, demonstrating the relationship between the conditions represented in the assessment and the criterion behaviors in the real world, and (4) decision, supporting the ultimate interpretation of the scores. The strength of the validity argument is limited by the weakest link in this chain of inference.1
Multiple-choice tests tend to provide strong evidence to support the scoring and generalization parts of the validity argument. Ensuring that such tests are administered under standardized conditions is relatively straightforward, as is scoring them accurately. The multiple-choice format is also efficient—allowing assessors to sample many items in a relatively short time—leading to highly generalizable scores. By contrast, clinical skills examinations using SPs are more difficult to standardize, and the resulting scores tend to be less generalizable.2 Proponents of SP-based clinical skills examinations contend, however, that the use of SPs strengthens both the extrapolation and decision parts of the validity argument because the assessment task closely approximates the criterion behavior of interest. Advocates of SP-based clinical skills examinations further argue that this ability to approximate real life offsets the limitations of the format.
Relatively little empirical evidence supports the contention that success on these examinations correlates with actual ability. In a book chapter summarizing the state of the art for clinical performance assessments in 2002, Petrusa3 reports information about the reliability of SP-based clinical skills examinations and provides validity evidence demonstrating that scores on these examinations tend to increase with examinees’ levels of training. He also describes positive relationships between clinical skills examination scores and other concurrent measures of examinee performance such as clerkship grades. However, he concludes that the face validity of these examinations—“the degree to which the clinical challenges in the examination look real”—remains the strongest evidence of validity.3
More recent studies have provided additional evidence. First, Taylor and colleagues4 examined the relationship between residency directors’ ratings of first-year residents and those residents’ scores on a National Board of Medical Examiners’ SP-based examination (an examination delivered at medical schools as part of a research and development project). The authors reported a correlation of 0.25 between the directors’ interpersonal skills ratings and the interpersonal skills scores on the SP examination and a correlation of 0.30 between the overall quartile residency ranking of each resident and his or her interpersonal skills scores on the SP examination; both correlations were statistically significant.4 Second, Tamblyn and colleagues5 reported that examinees with unusually low performance on the communications component of the Canadian Clinical Skills Exam were significantly more likely to draw nontrivial complaints in subsequent practice.
Although Taylor and colleagues4 base their study on a relatively small sample of examinees from a single institution, their findings provide valuable information related to the relationship between performance on an SP-based examination and later performance in residency as measured by residency director ratings. Similarly, Tamblyn and colleagues’ report5 provides important information about the relationship between performance on a national SP-based examination and patient complaints in clinical practice. Both studies provide predictive validity evidence for SP examination scores, but given the limited number of studies, further research is needed.
To this end, the current study, which included a national sample of examinees, extends available evidence about the relationship between performance on SP examinations and subsequent performance in residency. Specifically, it focuses on the relationship between Step 2 CS communication and interpersonal skills scores and communication skills ratings that residency directors assign to residents in their first year of postgraduate internal medicine training. Because Step 2 CS is intended to measure the necessary clinical skills essential for safe and effective patient care under supervision, we used residency director ratings as the external criterion measure; these ratings represent a measure of performance in supervised practice. Furthermore, previous research has shown that patient ratings of a physician’s overall communication skills are related to residency director ratings of the physician’s humanistic qualities.6
To the best of our knowledge, this study represents the first large-scale effort to provide evidence about the extent to which Step 2 CS communication and interpersonal skills scores can be extrapolated to examinee performance in supervised practice.
The data set consisted of examinees’ demographic information, USMLE scores, and residency director ratings as reported to the American Board of Internal Medicine (ABIM). The sample comprised 6,306 examinees from 238 internal medicine residency programs who completed Step 2 CS for the first time in 2005 and for whom we had ratings of their first year of internal medicine residency training. Residency programs with fewer than 10 examinees were excluded from the sample to support the hierarchical linear modeling (HLM; see below) used in the analysis.
This study was exempt from institutional review board inspection because data were collected as part of routine operational activities, results were reported in the aggregate with no individual examinee identified, and examinees who requested that their data be withheld from research (<1% of all examinees) were excluded.
The data are hierarchically structured in that examinees are nested within internal medicine residency programs. This structure allows for the possibility that examinee characteristics (i.e., demographics and USMLE scores) and residency program characteristics (i.e., program size, proportion of international medical graduates)—and the interaction between the two—may influence residency ratings. These relationships are more than a statistical nicety; there is no reason to believe that the slope of a regression line predicting examinee-level residency training ratings from Step 2 CS scores within a single residency program will be the same as (or even similar to) the slope predicting the same relationship across all programs. Ignoring the nested data structure may result in compelling, but misleading, results.7
HLM techniques allow for appropriate analysis of nested data. For this study, HLM can be conceptualized as estimating a unique regression line predicting internal medicine communication skills ratings from Step 2 CS scores and other examinee characteristics for each internal medicine residency program (random-coefficients models). Then the results of these examinee-level regression analyses (the intercepts and slopes) are used as dependent variables in program-level analyses that include program characteristics as between-program predictors. These models allow for the possibility that residency program characteristics may affect the examinee-level relationships. Raudenbush and Bryk’s8 comprehensive text provides more detailed information on the HLM approach.
A series of examinees-nested-in-programs analyses were fit to the data; ABIM’s residency program communication skills ratings were treated as the dependent variable. These analyses included the following:
- (1) a random-effects analysis of variance (ANOVA) that partitioned the total variation in the ratings into within-program and between-program components;
- (2) a series of random-coefficients models used to determine (a) which examinee characteristics to use as within-program independent variables and (b) which characteristics should be fixed and which should be allowed to vary across programs; and
- (3) a series of intercepts-and-slopes-as-outcomes models to predict the impact of program characteristics on both (a) mean communication skills ratings and (b) the relationships between Step 2 CS scores and communication skills ratings.
The ABIM Residents Evaluation Summary is a standardized, Web-based rating of clinical competence. Residency directors rate internal medicine residents annually on several components including communication skills. The ratings are based on data that are collected through the program’s assessment system. Assessment information typically includes data from end-of-rotation evaluations, direct observation, and chart audits. It also can include data from chart-stimulated recall, SP or objective structured clinical examinations, in-service examinations or multiple-choice tests, and feedback from a program’s competency or promotions committee. Ratings are scored on a nine-point Likert-like scale that is divided into three categories: “Unsatisfactory” (scores of 1–3), “Satisfactory” (scores of 4–6), and “Superior” (scores of 7–9). At the completion of residency training, each resident must have satisfactory ratings in all individual components and a satisfactory rating on overall clinical competence to be able to take the ABIM certification examination. For the current study, we used communication skills ratings from the end of residents’ first year in the program because the Step 2 CS examination is intended to indicate readiness to enter supervised practice. Prior research has shown that the overall clinical competence ratings correlate significantly with certification examination scores and physician peer ratings.6,9
Examinees completing the USMLE Step 2 CS examination receive three scores representing (1) communication and interpersonal skills, (2) spoken English proficiency, and (3) the integrated clinical encounter. The communication and interpersonal skills score, in turn, includes three subcomponents: questioning skills, information-sharing skills, and professional manner and rapport. Results of these three subcomponents are summed to obtain an overall communication and interpersonal skills score. The spoken English proficiency component measures clarity of spoken English communication within the context of the doctor–patient encounter (e.g., pronunciation, word choice, minimizing the need to repeat questions or statements). The integrated clinical encounter score comprises two subscores representing data gathering and documentation. SPs provide ratings of communication and interpersonal skills and spoken English proficiency, and they complete a checklist to record the examinee’s data gathering activities. Trained physicians provide ratings to assess the examinee’s documentation of the encounter.
Examinee-level independent variables comprised the following:
- two Step 2 CS scores (communication and interpersonal skills and spoken English proficiency),
- two Step 2 CS subscores (data gathering and documentation),
- Step 1 scores, and
- Step 2 Clinical Knowledge (CK) scores.
Although only pass/fail results are reported for the Step 2 CS examination, the Step 2 CS scores used in this study are the underlying numeric scores used to make the pass/fail decisions. We used both USMLE Step 1 and Step 2 CK scores to allow for an evaluation of the usefulness of the CS scores above and beyond the information already available from the computer-based examinations. Additional variables specified examinee gender, identified examinees whose native language was English, and distinguished U.S./Canadian medical school graduates from international medical school graduates. We computed descriptive statistics for both the study sample and the entire population of examinees who took the Step 2 CS examination in 2005.
Program-level independent variables included residency program quartile ranking, program size, proportion of examinees with English as a second language, proportion of graduates of international medical schools, and average Step 2 CS spoken English proficiency scores. We computed the quartile ranking on the basis of first-taker pass rate data for 385 residency programs on the internal medicine certification examination spanning three years (2007, 2008, and 2009). All independent variables were grand-mean centered (i.e., centered around the overall mean rather than the program mean).
Of the 6,306 first-year internal medicine residents who completed the Step 2 CS for the first time in 2005, 2,819 (44.7%) of examinees were female, 1,425 (22.6%) spoke English as a second language, and 1,368 (21.7%) were graduates of international medical schools.
The mean quartile ranking of the 238 internal medicine residency programs that the examinees represented was 2.5 (standard deviation [SD] = 1.1). The mean program size was 23.8 residents (SD = 12.5), the mean proportion of examinees with English as a second language was 0.3 (SD = 0.2), the mean proportion of graduates of international medical schools was 0.3 (SD = 0.3), and the mean Step 2 CS spoken English proficiency score for programs was 77.5 (SD = 4.6).
Table 1 provides sample sizes and mean scores (plus SDs) for relevant USMLE scores for both the sample of examinees assessed in the current study as well as for the entire population of examinees who took the Step 2 CS examination in 2005. The data suggest that the examinees in our sample are generally more proficient than the full population of examinees who took the examination. (Each of the differences is statistically significant at P ≤ .001.) One explanation for this difference is that examinees who failed any USMLE examination component may have been delayed or prevented from entering residency training. The mean (and SD) communication skills rating from residency directors for the study sample was 6.8 (and 1.1).
The top portion of Chart 1 provides the results of the random-effects one-way ANOVA. Results indicate that approximately 70% (0.90 / [0.38 + 0.90]) of the total variation in communication skills ratings occurred between examinees (within-program variance), and approximately 30% (0.38 / [0.38 + 0.90]) occurred between internal medicine training programs (intercept variance).
Initial random-coefficients models included all examinee-level independent variables. Examinee gender, native language, medical school location, Step 2 CS data gathering scores, and Step 1 scores were not significantly related to communication skills ratings and were removed. Step 2 CS communication and interpersonal skills scores, Step 2 CS spoken English proficiency scores, Step 2 CS documentation scores, and Step 2 CK scores displayed statistically significant relationships to communication skills ratings and were retained in the random-coefficients models. Because preliminary results indicated that program mean communication ratings and the effect of spoken English proficiency scores on communication skills ratings both varied by training program, we allowed the intercept and the slope for the spoken English proficiency scores to vary randomly across programs.
The middle portion of Chart 1 presents the results of the final random-coefficients model, which explained approximately 13% ([0.90 − 0.78] / 0.90) of the within-program variation in communication skills ratings. On average, communication and interpersonal skills scores were positively related to internal medicine communication skills ratings, controlling for other examinee scores, with a 0.02-point increase in ratings expected for every 1-point increase in communication and interpersonal skills scores. Spoken English proficiency scores, documentation scores, and Step 2 CK scores also were positively related to communication skills ratings, with a 0.03-point increase in ratings expected for every 1-point increase in spoken English proficiency scores and a 0.01-point increase in ratings expected both for every 1-point increase in documentation scores and for every 1-point increase in Step 2 CK scores.
The final intercepts-and-slopes-as-outcomes model included the same examinee-level independent variables as the final random-coefficients model. It also included average spoken English proficiency score as a between-program predictor of (1) average communication skills ratings (i.e., the intercept) and (2) the relationship between spoken English proficiency scores and ratings (i.e., the spoken English proficiency slope). No other between-program independent variables were significantly related to either the intercept or the spoken English proficiency slope, so we removed these other variables from the model.
The final intercepts-and slopes-as-outcomes model explained approximately 14% ([0.90 − 0.77] / 0.90) of the within-program variation in communication skills ratings and approximately 24% ([0.38 − (0.29 + 0.0006)] / 0.38) of the between-program variation in communication skills ratings.
The lower portion of Chart 1 presents selected results of the final intercepts-and-slopes-as-outcomes model. The examinee-level regression coefficients were consistent with the results of the final random-coefficients model in terms of significance, direction, and magnitude and were excluded from the lower portion of Chart 1. Results indicated that average spoken English proficiency scores affected (1) program mean communication skills ratings and (2) the relationship between spoken English proficiency scores and ratings. Training programs with higher spoken English proficiency scores, on average, had higher mean communication skills ratings; for every 1-point increase in programs’ average spoken English proficiency scores, the program mean rating is expected to increase, on average, 0.02 points. As average spoken English proficiency score increases, the effect of examinee-level spoken English proficiency scores on communication skills ratings increases; each 1-point increase in average spoken English proficiency scores increases the spoken English proficiency slope by 0.003.
Discussion and Conclusions
The key finding of the present study is that the Step 2 CS communication and interpersonal skills score does predict the communication skills rating of first-year internal medicine residents even after accounting for other scores (i.e., spoken English proficiency, documentation, and Step 2 CK). Admittedly, the strength of the relationship is relatively weak, but several factors argue against the conclusion that a modest conditional relationship between the communication and interpersonal skills scores and the subsequent ratings is evidence of poor measurement of the construct of interest.
One reason that the reported results may underestimate the usefulness of the communication and interpersonal skills score is that the sample we used in this study is unavoidably incomplete. We excluded examinees with failing scores on the communication and interpersonal skills component of Step 2 CS from the sample because they could not enter residency training. Although the number with failing scores in this component is a modest percentage of examinees trained in the United States and Canada (approximately 1%), approximately 12% of internationally trained examinees fail the communication and interpersonal skills component of Step 2 CS. Eliminating the lowest-scoring examinees is likely to attenuate the observed relationship between scores and ratings. Eliminating smaller programs (in this case, those with 10 or fewer examinees) may affect the observed relationship as well.
Additionally, although the internal medicine residency ratings represent an important outcome measure for an assessment designed to evaluate readiness to enter supervised practice and have been shown in previous research6,9 to be significantly related to other measures of cognitive performance, these ratings are a less-than-ideal measure of the criterion of interest, communication skills. Ratings that use only one judge generally have relatively low reliability. The present data set provided no basis for estimating the reliability of communication skills ratings, but assuming that the reliability of the ratings is well below 1.00 is reasonable. This level of reliability may result in additional attenuation of the observed relationship between the scores and the construct measured by the ratings.
In addition to reliability issues, the residency-director-produced communication skills ratings and the Step 2 CS communication and interpersonal skills scores are clearly not directly equivalent measures. Specifically, the Step 2 CS communication and interpersonal skills score is not intended to reflect spoken English proficiency; the corresponding residency rating does not distinguish between communication skills and spoken English proficiency. This lack of perfect alignment between the constructs measured by the scores and ratings again may suppress the strength of the relationship between the two measures.
One other characteristic of the data set is worth noting. Because USMLE reports only pass/fail status for the communication and interpersonal skills score, the numeric scores used in this study were unavailable to the residency programs. The fact that residency programs did not have numeric scores means both that there could be no selection bias based on the use of the scores for residency selection and that prior knowledge of the scores could not have influenced the subsequent residency-director-produced ratings.
Viewed in light of these factors, the results of this national study make a reasonable case that the Step 2 CS communication and interpersonal skills scores provide information that is useful in predicting the level of communication skills that examinees will display in their first year of internal medicine residency training. This finding, at a minimum, shows some level of extrapolation from the testing context to behavior in supervised practice and, thus, strengthens the validity argument. The ability to extrapolate real-life behavior from SP testing is important for those responsible for making licensing decisions and for those affected by such decisions. Extrapolation, as Kane1 argues, provides support for the Step 2 CS examination’s intended use: signaling a resident’s readiness with respect to clinical skills for supervised practice. Of course, a study of this sort, particularly because it is limited to a single year, is not sufficient in and of itself. A more complete validity argument would need to demonstrate the utility over time of decisions made using the cut score and would provide evidence that higher scores were associated with safer and more effective treatment and, therefore, with better patient outcomes.
Other disclosures: None.
Ethical approval: This study was considered exempt from institutional review board review because data were collected as part of routine operational activities, all records were deidentified, and examinees who requested that their data be withheld from research were excluded.
1. Kane MTBrennan RL. Validation. In: Educational Measurement. 20064th ed Westport, Conn American Council on Education/Praeger:17–64
2. Swanson DB, Clauser BE, Case SM. Clinical skills assessment with standardized patients in high-stakes tests: A framework for thinking about score precision, equating, and security. Adv Health Sci Educ Theory Pract. 1999;4:67–106
3. Petrusa ERNorman GR, van der Vleuten CPM, Newble DI. Clinical performance assessments. In: International Handbook of Research in Medical Education. 2002 Dordrecht, Netherlands Kluwer Academic Publishers:673–709
4. Taylor ML, Blue AV, Mainous AG 3rd, Geesey ME, Basco WT Jr. The relationship between the National Board of Medical Examiners’ prototype of the Step 2 Clinical Skills Exam and interns’ performance. Acad Med. 2005;80:496–501
5. Tamblyn R, Abrahamowicz M, Dauphinee D, et al. Physician scores on a national clinical skills examination as predictors of complaints to medical regulatory authorities. JAMA. 2007;298:993–1001
6. Lipner RS, Blank LL, Leas BF, Fortna GS. The value of patient and peer ratings in recertification. Acad Med. 2002;77(10 suppl):S64–S66
7. Cronbach LJ, Deken JE, Webb N. Research on classrooms and schools: Formulation of questions, design, and analysis (Occasional Papers of the Stanford Evaluation Consortium). 1976 Stanford, Calif Stanford University
8. Raudenbush SW, Bryk AS Hierarchical Linear Models: Applications and Data Analysis Methods. 2002 Thousand Oaks, Calif Sage Publications
9. Shea JA, Norcini JJ, Kimball HR. Relationships of ratings of clinical competence and ABIM scores to certification status. Acad Med. 1993;68(10 suppl):S22–S24