The large body of literature dedicated to the use of standardized patients (SPs) in clinical skills examinations (CSEs) attests to their popularity and acceptance in the medical education community as valuable teaching and assessment tools.1 Additionally, both the Educational Commission for Foreign Medical Graduates (ECFMG®) and the Medical Council of Canada currently require the successful completion of a CSE component as part of their licensure and certification processes, respectively.2,3 The National Board of Medical Examiners (NBME®), in conjunction with the ECFMG, is planning to implement Step 2 Clinical Skills (CS) as part of the United States Medical Licensing Examination (USMLE™) in the summer of 2004.4
One aspect of validity that is of particular importance to assess with clinical skills examinations, given the presence of SPs and examinees of differing ethnicity, is the consequential aspect of validity.5 The latter aspect pertains to sources of invalidity related to fairness and bias. That is, can we gather evidence to support that the use of CSEs has no adverse consequences for students of different genders, ethnic groups, etc.?
Assessment of Fairness Issues in Clinical Skills Examinations
Few studies have been undertaken to ascertain the effect of the interaction of SP and examinee ethnicity on test scores in large-scale CSEs.6,7 Typically, these studies focused on comparing case scores when both the ethnicity of the SP and the examinee were varied. Findings reported in these investigations indicate that significant interactions between examinee and SP ethnicity do occur on certain SP cases.6,7 However, the authors of this research did not detect any clear pattern across either case or skill, leading them to suggest that this source of error does not appear to pose a significant threat to the validity of inferences derived from these test scores. That is, based on the entire examination, candidates of a certain ethnic background did not perform significantly better when facing a SP of the same ethnicity.6,7
Unfortunately, the findings reported in the latter studies are difficult to interpret because no effort was made to control for initial differences in clinical skill level that may have existed between examinees of different ethnic backgrounds. That is, the analyses performed were aimed at measuring impact or simple differences between means, without adjusting for initial differences in proficiency that may have existed between candidates of varying ethnicity.8 Similarly, no information was provided with respect to the variability in portrayal and rating stringency levels of SPs assigned to the same case. Differences that may have existed between SPs of the same ethnic background portraying the same case were not factored into the analyses performed. Consequently, the results reported in investigations aimed at assessing the relationship between examinee and SP ethnicity on test scores are potentially confounded by a priori differences in examinee proficiency and SP stringency levels.
The importance of comparing matched groups of examinees, with respect to the proficiency targeted by the examination, is critical when undertaking research to gather evidence of the consequential validity of any examination program. Simple mean differences between two groups of examinees on a given test may reflect impact and are not necessarily indicative of an unfair examination. Chambers and colleagues9 compared interpersonal skills scores as a function of candidate and standardized patient gender, after matching on a number of covariates. No significant SP gender by candidate gender interaction was noted. However, no study has been undertaken to compare the performance of candidates from varying ethnic backgrounds exposed to SPs of similar and different ethnicities, after matching for overall clinical skill level, as well as after accounting for SP stringency level.
The purpose of the present study was therefore to assess whether data gathering and written communication scores differed significantly as a function of the ethnicity of the candidate, the ethnicity of the SP, and, more critically, the interaction of the latter two factors, after matching for overall proficiency level on a large-scale CSE used for certification purposes. It is hoped that the findings reported in this investigation will provide valuable information in regard to the fairness of this type of performance assessment, that is, yield evidence of consequential validity for this assessment modality.
Sample and Examination
The sample that was the focus of the present investigation was selected from the population of 9,551 candidates who completed the ECFMG's Clinical Skills Assessment (CSA) between May 1, 2002, and May 31, 2003. During this period, over 110,000 SP-candidate encounters were portrayed by over 50 SPs. The ECFMG's CSA is composed of ten scored cases and assesses the clinical skills (history taking, physical examination, and written communication), interpersonal skills, as well as spoken-English proficiency level of international medical graduates (IMGs). Until the implementation of Step 2 CS, the ECFMG's CSA is compulsory for all IMGs wishing to complete a residency in the United States. History taking and physical examination skills are assessed via case-specific, dichotomously scored checklists completed by the SP following each encounter. Written communication (WC) skills are measured using generic patient notes, which require that the student document pertinent history taking and physical examination findings as well as provide a list of differential diagnoses and a diagnostic management plan. Patient notes are scored globally by trained physicians using a 1–9 rating scale, where scores of 1–3 denote unacceptable performance and 7–9 reflect superior performance.
Based on this sample of encounters, four cases that provided a sufficient number of encounters across different examinee and SP ethnicity groups were selected as the focus of the study. For two of the cases, comparisons were done between black and white SPs/examinees. For the remaining two cases, these comparisons were undertaken between Hispanic and white SPs/examinees. Ethnicity was self-reported by the candidate. Table 1 provides a breakdown of the frequency of examinees for each case by candidate, as well as SP, ethnicity. It is important to note that examinees were nonidentical across the four cases. However, analyses were undertaken at the case level, and as such, this is not a hindrance to the overall analysis. Additionally, for any given case, the pair of SPs was identical. Thus, differences in SP stringency level were controlled. Case 1 dealt with a lifestyle change issue whereas Case 2 focused on a malignancy. Case 3 dealt with chest pain whereas Case 4 focused on an abdominal problem. For security reasons, it is not possible to provide additional descriptive information for these four cases.
For each of the four cases, the following two component scores were computed:
* A data gathering (DG) score, which corresponds to the percentage of actions on the checklist completed by the candidate.
* A WC score, ranging from 1–9.
For each of these two scores and four cases, separate analyses of covariance (ANCOVA) were undertaken. ANCOVA allows researchers to compare group means on a dependent variable, after the group means have been adjusted for differences between groups on a relevant covariate. The independent variables in the model were SP ethnicity (two levels) and examinee ethnicity (two levels). The dependent variable was the component score. The covariate, for any given analysis, was the corresponding equated component score. Equating is the process by which examinee scores are adjusted to account for the varying difficulty of a test form. Equating places scores on a common scale, which enables their direct comparison, even though examinees may have encountered forms of varying difficulty level portrayed by SPs of different stringency levels. The ECFMG does compute equated DG and WC scores as a means of reporting measures that are comparable across candidates, irrespective of the test form completed. For example, the ANCOVA aimed at comparing DG scores on Case 1 used the equated DG score for all examinees as the covariate in the model. Readers interested in obtaining more information on equating should consult other sources.10 In the present study, using an equated score as a covariate enabled us to ensure that examinee/SP comparisons for each of the four cases were based on test takers that were comparable, with respect to the targeted overall proficiency level (i.e., DG or WC), at the onset. Not doing so would complicate the interpretation of any significant effect for any case examined, as the latter might be confounded by initial examinee differences in overall DG or WC proficiency level.
Table 2 provides mean DG and WC scores, as well as standard deviation values, for the various examinee/SP combinations by case, adjusted for initial differences on the corresponding covariate. None of the SP ethnicity/examinee ethnicity interactions were statistically significant on any of the four cases with regard to DG or WC scores.
With regard to Case 1, a significant examinee main effect was obtained, F1,649 = 4.79; p = .03. Specifically, averaged over SP ethnicity, black examinees (M = 60.94%) outperformed white examinees (M = 58.30%) with respect to DG scores. Regarding Case 2, results show that the mean DG score for examinees who encountered the white SP (M = 74.45%) was significantly higher than that associated with candidates who interacted with the black SP (M = 71.02%, F1,592 = 16.37; p = .0001). Similarly, with respect to Case 3, a significant SP ethnicity main effect was obtained (F1,910 = 233.47; p = .0001), where the mean DG score associated with examinees who completed the encounter with the Hispanic SP (M = 74.86%) was significantly higher than the mean associated with encounters completed with the white SP (M = 61.50%). Additionally, the mean WC score, when based on an encounter with the Hispanic SP (M = 5.55), was significantly higher than when based on an encounter with the white SP (M = 4.95, F1,910 = 69.39; p = .0001). Also, the mean WC score for Hispanic examinees (M = 5.33) was significantly higher than the mean rating of white students (M = 5.17) on Case 3 (F1,910 = 5.09; p = .02). A significant SP ethnicity main effect was also noted for the Case 4 DG score (F1,637 = 5.99; p = .01). DG scores for students encountering the Hispanic SP (M = 71.49%) were significantly higher than those obtained in an encounter with the white SP (M = 69.55%). In addition, Hispanic examinees (M = 71.33%) significantly outperformed white examinees (M = 69.70%) with regard to their mean DG rating on Case 4.
The Standards for Educational and Psychological Testing clearly underscore the need to compare the performances of members of different subgroups on cases and tests to ensure that these assessments do not unduly favor one group.11 However, relatively little research has addressed bias-related issues in performance assessments. What role, if any, does the interaction of the ethnicity of the examinee and SP play in a clinical skills examination?
The purpose of this study was to assess whether the interaction of examinee and SP ethnicity impacts DG and WC scores in a large-scale clinical skills examination program. The results of this study were very encouraging and show that, for the subset of cases examined, there is little advantage to be gained by interacting with a SP of similar ethnic makeup. In this regard, our findings mirror those reported in past similar investigations.6,7 However, unlike past investigations, our findings are more defensible as we adjusted for initial differences in DG and WC proficiency levels between groups of examinees using appropriate covariates (equated test scores).
It is also interesting to note that the significant main effects reported in our study are largely attributable to small differences in SPs, irrespective of examinee ethnicity. As such, this potentially points to differences in stringency, both at the portrayal and checklist recording levels, between SPs assigned to portray a common case. Currently, the ECFMG (and the NBME) treat the SP-case combination as the unit of analysis, rather than simply the case. That is, “Chest Pain” is treated differently whether it is portrayed by SP 1 as opposed to SP 2. The findings of the present study lend support to implementing this scoring approach. It is important to underscore, however, that the practical significance of the majority of these SP effects is very small. With the exception of the SP effect noted for Case 3, differences in mean percent-correct checklist scores due to these SP effects would amount to less than one item on a checklist.
Although the findings reported in this study were encouraging, it is important to interpret them in light of several caveats. First, the analyses were based on a small set of cases. Consequently, generalizations to a greater pool of cases should be interpreted cautiously. Nonetheless, the cases were fairly representative of the entire set of scenarios. Second, it was not possible, owing to the small number of encounters, to base our analyses on several SPs of the same ethnic background for any given case. It would be important to undertake this research so as to be able to tease out effects due to SP ethnicity and simply differences between SPs, possibly attributable to training discrepancies. Third, there are a host of additional examinee and SP characteristics (e.g., native language, gender) that could potentially impact scores and that could be further examined. With regard to WC, it would also be important to model the effect of physician rater characteristics on scores. Also, main effects noted as part of this paper, although small, are based on perceived ethnicity. For many individuals, especially those of mixed race, categorizing ethnicity is error prone. Finally, this study was undertaken with a group of IMGs and as such generalizations to U.S. medical students need to be interpreted with caution until additional research is completed.
Despite these limitations, it is hoped that the present findings will provide useful information not only for the present CSE program, but also for other performance assessments, as well as foster future research in the area. This accumulating body of research will lead to a better understanding of the impact of potential sources of bias with this type of examination and provide informative feedback for future training, psychometric, and case-development efforts.