Ethnicity poses a unique but serious question for assessments using standardized patients (SPs): Is there an interaction between examinees' ethnicity and SPs' ethnicity? With conventional examinations, this is not an issue because the examinee typically encounters only written, paper-and-pencil materials; however, with clinical-performance assessments, the examinee interacts with an SP, and their ethnicities are necessarily a part of the interaction. Thus, the concern for assessment is that examinees of a given ethnic background might be marked higher (or lower), depending upon the ethnic background of the SP. The usual concern is that examinees might perform and score better in encounters with SPs of the same ethnic background than they perform in encounters with SPs of different ethnicity. Naturally, if this were to occur, it would seriously undermine the validity of the examination scores, and worse, it could be indicative of an inherent and irreparable flaw in the SP-assessment approach. Despite the seriousness of this concern, to our knowledge, research has been limited to our recently published study.1 The problem is that most SP examinations do not have sufficient numbers of examinees and SPs of different ethnic backgrounds to make the necessary comparisons to conduct a study.
We were able to address this important question by using data from the large-scale fourth-year assessment program involving medical students in the eight medical schools of the New York City Consortium.2 In that study, we compared the performances of white and black examinees in encounters with white SPs versus their performances in encounters with black SPs.1 Specially, we were interested in whether examinees performed and scored better in encounters with SPs of the same ethnicity, as might be expected given their shared cultural backgrounds and the many things that implies. The results showed no pattern across the 24 analyses (involving different outcome measures, different cases, and two different classes) that would suggest any problem of an interaction between examinee ethnicity and SP ethnicity. Only three of the 24 interactions were statistically significant, and their results showed different patterns. One showed better examinees' performances in encounters with SPs of the same ethnic background, consistent with what might be expected; however, the other two showed the opposite: better examinees' performances in encounters with SPs of different ethnic backgrounds. In general, the results for all 24 interaction analyses showed only weak effects. The differences between white and black examinees were quite small regardless of the SPs' ethnicity, and those differences were nearly identical for encounters with white SPs and those with black SPs, showing no interaction.
The purpose of this study was to extend the analyses performed in our previous study by including data for a sizable group of Asian students. In our previous study, which focused on the comparison of performances of examinees with same and different ethnicities as the SPs, there were no Asian SPs, so the Asian students' data were not used. There was no “same” condition for the Asian examinees' data. The focus of the present study was the performances of Asian examinees in encounters with white and black SPs, which were compared with those of white and black examinees in encounters with the white and black SPs.
The examination. A fourth-year assessment that uses SP-based cases is administered by the Morchand Center at Mount Sinai School of Medicine to students in member schools of the New York City Consortium.2 This consortium consists of the following eight schools: Albert Einstein College of Medicine, Columbia University College of Physicians and Surgeons, Cornell University Medical School, Mount Sinai School of Medicine, New York Medical College, New York University School of Medicine, State University of New York Health Sciences Center at Brooklyn, and State University of New York Health Sciences Center at Stony Brook. This represents approximately 10% of all the graduating fourth-year medical students in the United States.
The examination consists of seven SP cases that represent commonly encountered problems in internal medicine, surgery, pediatrics, gynecology, and psychiatry. Each case requires 30 minutes for its administration, with 20 minutes for the student—SP encounter, during which the student performs a complete focused history and physical examination, and another ten minutes after the encounter, for the student to answer case-related written questions concerning pathophysiology, differential diagnosis, test selection, and test interpretation. During the post-encounter period, SPs complete checklists on which they record students' actions in history taking and physical examination as well as rating interpersonal and communication skills. Cases are selected and developed by faculty representatives from the eight member schools, who meet every six weeks to discuss examination direction and policy and who work in ad hoc committees to complete the many tasks involved in conducting a testing program of this magnitude. Inter-case reliabilities of the seven-case examination were .68 for the class of 1995 and .64 for the class of 1996.
Data analysis. Data were available for 1,048 students in the class of 1995 and 1,024 in the class of 1996. Ethnicity for this study was derived from the “self-description” code used by the American Medical College Application Service (AMCAS), resulting in 644 white, 50 black, and 205 Asian students for the class of 1995 and 604, 64, and 218 for the class of 1996. Separate analyses were performed on data for the two classes, with analyses performed on three outcome measures: history and physical examination scores, interpersonal and communication scores, and post-station written scores.
The primary analyses were two-way (3 × 2) analyses to test the main and interaction effects of examinee's ethnicity and SP's ethnicity. For all analyses, the examinee's ethnicity was a between-subjects factor, but the SP's ethnicity was a between-subjects factor for some analyses and a within-subjects factor for others. Cases 2, 5, and 7 on the seven-case examination were simulated by SPs of different ethnicities; that is, both white SPs and black SPs served as simulators for each of the three cases. Thus, SP's ethnicity was a between-subjects factor, and separate analyses were performed on data for each of these three cases. Of the remaining four cases, two were simulated by white SPs only (Cases 3 and 4) and the other two by black SPs only (Cases 1 and 6). Thus, SP's ethnicity was a within-subjects factor, and split-plot—type analyses were performed on data involving all four cases to compare mean performance on Cases 3 and 4 (white SPs) with that on Cases 1 and 6 (black SPs). Thus, a total of 24 analyses were performed—four case comparisons for each of three outcome measures for each of two classes.
Effect-size measures (standardized mean differences, d) were computed to provide a sharper picture of the ethnicity effects.3 A d value shows the difference between two means divided by their pooled standard deviation. For each interaction, six d values were computed. Three involved the white SPs: the first compared white and Asian examinees, the second compared black and Asian examinees, and the third compared white and black examinees. The other three d values involved black SPs and consisted of the same three examinee comparisons as for white SPs.
Results were considered statistically significant if p values were less than .05. The study required a large number of statistical tests (i.e., 72 tests were performed—two main effects and one interaction effect for each of 24 analyses). However, a correction for inflated error rate due to multiple testing was not employed, in order to avoid seriously compromising the power of the analyses. Instead, the effect sizes were weighed heavily, in conjunction with the findings of the statistical tests, in arriving at conclusions about the theoretical and practical significance of the relationships between the examinees' and SPs' ethnicities and the examination scores.
Results for the class of 1995 are presented in Table 1. Results for the class of 1996 were similar and are not presented because of space limitations. However, summaries of results for both classes—separately and combined—are presented in the last three rows of Table 1. In general, the results showed no evidence of an interaction between examinees' and SPs' ethnicity. Five of the 24 interactions were statistically significant, which is significantly more than expected by chance (.05 × 24 = 1.2 expected by chance). However, means and d values showed no consistent pattern across these five interactions. One of the significant interactions was for history and physical scores, two were for interpersonal and communication scores, and two were for post-station scores. Three significant interactions were obtained for the class of 1995, and two for the class of 1996. To see the difference in the patterns of interaction effects, compare for illustration the d values for history and physical scores (for Cases 3 and 4 versus Cases 1 and 6) with the d values for interpersonal and communication scores (for Case 2) in Table 1. For example, with white SPs, the history and physical scores for black examinees were .02 standard deviations lower than (nearly equal to) those for Asian examinees, whereas with black SPs, black examinees. For interpersonal and communication scores, with white SPs, black examinees scored .42 standard deviations higher than did Asian examinees, whereas with black SPs, black examinees scored lower (d = −.33).
Similarly, for all 24 interactions, significant or not, the effects were generally weak and showed no consistent pattern across the two classes, three outcome measures, and four case comparisons. For the classes of 1995 and 1996 combined, the means of the interaction effects were quite small, ranging from d = .06 to d = .19. (See the last row in Table 1.) On average, white examinees scored .19 standard deviations higher than did Asian examinees in encounters with white SPs; and in encounters with black SPs, they scored .17 standard deviations higher—with only a very small difference between the d values, .19–.17 = .02. Black examinees scored slightly higher than did Asian examinees in encounters with white SPs (d = .07) and in encounters with black SPs (d = .06)—with a small difference of only .07–.06 = .01. White examinees scored slightly higher than did black examinees in encounters with white SPs (d = .12) and in encounters with black SPs (d = .11)—again a small difference, .12–.11 = .01. In other words, the examinee's ethnicity was not an important factor in the SP's assessment, regardless of the ethnicity of the SP. These small d values of .19, .17, .07, .06, .12, and .11 indicate that the examinee groups differed only slightly, showing no main effect of examinee ethnicity. The differences between these d values were even smaller (.02, .01, and .01), showing no interaction effect of examinee and SP ethnicity.
For the main effect of SP's ethnicity, 18 of the 24 main effects were statistically significant (p < .05). Again, there was no consistent pattern across the 24 main-effect analyses. Nine of the 18 significant main effects showed higher scores with white SPs (positive sign) and nine showed higher scores with black SPs (negative sign). The mean of the nine d values showing higher scores with white SPs was .51, and the mean of the eight showing higher scores with black SPs was −.36. These d values show a moderate effect of the SP's ethnicity. However, the results were nearly equally divided between positive and negative effects, which suggests that the finding may simply be due to differences among individual SPs in portraying the case and completing the checklists rather than an effect of their ethnicity. In other words, this seems to be more an effect of “multiple SPs,” as it is referred to in the SP research literature, than an effect of ethnicity.4,5
The results of this extended analysis are consistent with the findings of our previous study, showing no evidence for an interaction of examinees' ethnicity and SP's ethnicity. Ethnicity has been a continuing concern for the validity of performance assessment that uses SPs. This was one of four concerns raised by participants at the Consensus Conference on the Use of Standardized Patients organized by the Association of American Medical Colleges nearly ten years ago.6 The recommendation was that “studies are needed to evaluate racial, ethnic, and cultural factors affecting performance in clinical skills testing.”7 This recommendation has been difficult to follow, as mentioned above, because most testing programs lack enough examinees and SPs of different ethnicities. Thus, the results of this extended analysis are important because they add to what may continue to be a very limited base of research. Clearly more research is needed to address this important concern.