Evaluations of faculty teaching performance are widely used for decisions regarding promotion and retention. Although there are many new evaluation instruments to measure teaching effectiveness, such as 360-degree evaluation and peer review,1,2 trainee ratings remain the most common source of evaluative data. Fortunately, several studies have shown that learner ratings are robust and valid assessments of teaching skills.3 These favorable measurement characteristics, combined with factors such as accessibility and affordability, likely mean that learner ratings will remain a popular source of data regarding teaching effectiveness. Therefore, it is important to look critically at these data to understand the nuances that lead to variability in ratings.
Gender and race—of teacher and learner—are two issues that merit further study. Outside of the teacher–learner arena, there are suggestions that gender and race concordance are important for building trust and communication.4 In the scant literature about the influence of race in faculty evaluation, minorities received lower ratings than nonminorities, although the race of the evaluator was not considered.5 There is a slightly larger literature regarding gender effects in faculty evaluation. In one ambulatory clerkship, female students rated the ability to teach problem solving among male faculty as significantly higher than female faculty, whereas male students rated female faculty as significantly higher than male faculty on the same measure.6 Other literature suggests the existence of a significant faculty–student gender interaction among select disciplines.7 The simple existence of a difference does not necessarily mean bias—it may be that the female—or male—faculty were better teachers in the samples studied. However, there is an abundant literature to suggest that women have stronger communication and clinical skills than do men.8 On the other hand, clinical teaching effectiveness also depends on a strong knowledge base and procedural skills, areas where men often perform better, at least on testing.9
Accordingly, the purpose of this study was to examine how ratings of clinical faculty teaching effectiveness vary according to teacher (faculty attending) and learner (resident) gender. We hypothesized that overall, we would observe the highest ratings for gender-concordant pairs. Secondly, we asked how ratings of teaching effectiveness varied by faculty and resident underrepresented minority (URM) status, and we hypothesized that race-concordant pairs would result in the highest ratings. Within both hypotheses, we look at individual record data and subsequently explore how ratings vary by gender and URM status when data are aggregated to the level of an individual faculty person. Discovering significant race or gender effects could have implications for how ratings of faculty are used. Moreover, much of what is known about teaching evaluations comes from classroom settings rather than the extended and intense clinical settings that exemplify many residencies.
Between July 2005 and June 2006, 10,443 evaluations of teaching effectiveness were collected for 720 faculty members by 516 residents across 18 clinical departments. The evaluations were completed confidentially via a Web-based system at the end of a training period, usually a two- to four-week period as identified by the program director. The clinical teaching effectiveness instrument consisted of eight items and a global item rated on a scale, where 1 = poor, 2 = fair, 3 = good, 4 = very good, and 5 = excellent. Questions were phrased as two statements, one summarizing behaviors of the effective teacher and one the ineffective teacher. The scale was developed and pilot tested over several months using consensus-building processes with representatives from multiple clinical departments. A scale score was computed for each evaluation response as the mean of all items except the global item. Cronbach’s alpha was estimated at 0.96. According to the office of faculty affairs, there were 82 Asian, 24 black, 600 Caucasian, and 14 Hispanic faculty members; and 197 females and 523 males in the data set. According to definitions provided by the graduate medical education office, there were 131 Asian, 18 black, 340 Caucasian, 12 Hispanic, 10 other-non-URM, and 5 other-URM residents; and 211 females and 305 males represented in the data set. All faculty had gender information, and 30% of faculty did not have race information; 32% of residents did not have race and gender information. Faculty received an average of 11 evaluations (range 1–89). This study was reviewed by the institutional review board and classified as exempt.
For gender and then for minority status, we conducted a series of four analyses. First, on the level of individual evaluation records, we examined average teaching effectiveness scores according to faculty and resident gender or URM status using a factorial ANOVA. As defined by the faculty affairs office, Hispanic and black faculty were included in the URM category for faculty. For residents, the Association of American Medical Colleges underrepresented in medicine definition was used and includes black, Hispanic, Middle Eastern, and native Hawaiian trainees. Second, factorial ANOVAs were repeated, using records of the subset of faculty evaluated by both male and female residents (N = 7,321) or by both URM and non-URM residents (N = 5,230). The reanalysis with the smaller sample was intended to guard against any possible selection effects. Selection effects are thought to be minimal because residents are usually assigned to services, but nevertheless it is important to rule out as a source of potential bias. Third, with factorial ANOVAs we looked at item-level data to see whether there were differences related to item content (e.g., women receive higher ratings for the communication item). Finally, using data aggregated to the faculty level, we created means for evaluations by male and female residents, and URM/non-URM residents, and we compared means within gender/minority subgroups with t tests and computed effect sizes (Cohen’s d).
In the total data set, the main effect for faculty gender was not significant (P = .06); mean ratings of female and male faculty were 4.49 (SD = 0.66) and 4.46 (SD = 0.74), respectively. Similarly, the main effect for resident gender was not significant (P = .13); mean ratings by female and male residents were 4.47 (SD = 0.72) and 4.48 (SD = 0.72), respectively. However, there was a significant interaction effect (P < .001). Female faculty were rated highest by female residents (mean = 4.56, SD = 0.60; versus mean = 4.44, SD = 0.70 by male residents). In contrast, male faculty were rated highest by male residents (mean = 4.49, SD = 0.73; versus mean = 4.43, SD = 0.76 by female residents). As shown in the left panel of Figure 1, when the analysis was repeated with the subset of faculty rated by both men and women, the interaction effect remained significant: means ratings of female faculty by female residents = 4.55 (SD = 0.61), female faculty by male residents = 4.43 (SD = 0.70), male faculty by male residents = 4.49 (SD = 0.73), and male faculty by female residents = 4.41 (SD = 0.78; P < .001). There were significant interactions for all items (P < .05). Females consistently rated female faculty highest, and males consistently rated male faculty highest, with no apparent differences related to item content (data not shown).
In the total data set, the main effect for faculty URM status was not significant (P = .23). The overall mean ratings of URM faculty and non-URM faculty by all residents were 4.46 (SD = 0.71) and 4.48 (SD = 0.72), respectively. Main effects for resident URM status and the interaction were not significant (P = .69 and P = .10, respectively). However, when we repeated the ANOVA including only those faculty evaluated by both URM and non-URM residents, there was a significant interaction term (P = .04). As shown in the right panel of Figure 1, mean ratings were: non-URM faculty, URM residents = 4.38 (SD = 0.82), non-URM faculty, non-URM residents = 4.44 (SD = 0.75), URM faculty, URM residents = 4.60 (SD = 0.68), and URM faculty, non-URM residents = 4.33 (SD = 0.76). Significant interactions were observed for three items: ability to teach critical thinking, communication, and professionalism (P < .05). Consistently, the highest scores were for URM–URM pairs.
On the level of the faculty, all of the findings favoring gender or minority concordance remained, but the effects were mostly small. As shown in Table 1, gender concordance still benefited female faculty (ES = 0.07) and male faculty (ES = 0.06). On average, non-URM faculty received slightly higher ratings from non-URM residents (ES = 0.05). However, there was a large observed difference in rating given to URM faculty between URM residents (mean = 4.66, SD = 0.43) and non-URM residents (mean = 4.40; SD = 0.40; ES = 0.61).
With data for 720 faculty from 18 clinical departments who were rated by 516 residents, we asked questions regarding the importance, if any, of gender and URM status concordance on observed teaching evaluations. Few studies in medical education have asked this question, particularly in the graduate medical education setting, where interactions often last for weeks and teaching can be at the bedside and/or regarding patients in very intense situations. This study is novel in that it used data from multiple departments. Moreover, the large amount of data enabled subanalyses for faculty rated by both male and female residents and URM and non-URM residents, thus reducing the likelihood that selection by residents of similar preceptors influence the results. In short, we observed significant race and gender interactions, supporting our concordance hypotheses overall, but the effect sizes were small in all but URM faculty. In the latter, there seems to be a very large, positive effect of ratings by URM residents.
Our findings are limited in several ways. Only data submitted by residents for clinical faculty are included. Learners at different points in their training or in different teaching venues might have different perspectives. Moreover, analyses were restricted to evaluations where race and gender information were available about the faculty and resident. Some programs could not provide the data. Third, the URM definitions were not the same for residents and faculty. It will likely be several years until faculty data are aligned with resident data. Finally, as is reflective of academic medicine in general, there were very small numbers of URM faculty.
The sizable literature suggesting the importance of concordance in patient–physician interactions4 comes from a setting that allows freedom to choose one’s physician. In a more restrictive training environment, a trainee typically cannot choose his or her faculty preceptor. Additionally, the learning environment focuses on multiple skills in addition to communication. Overall, the significant interaction effects observed in the total data set largely disappear when data were aggregated to the faculty level. We interpret this to mean that residents are very knowledgeable consumers—they desire, appreciate, and recognize good precepting, and they rate it accordingly. In fact, the average of 11 ratings per faculty is more than enough to capture true score. The good news is that gender and race are largely irrelevant. This parallels studies reporting that race and gender concordance are not important in mentoring of junior faculty,10 when the situation may more closely resemble the resident–attending relationship.
The take-home meaning of these results is not so clear. Although the effects were significant, they were also small. The impact is potentially much larger if current recruitment efforts of URM residents and faculty are successful. It is certainly reasonable to think that faculty communicate in different ways with gender- and race-concordant residents. It may be that URM faculty are better teachers to URM residents. Future research to examine faculty and resident attitudes regarding gender and minority status may help us determine whether our results are related to teaching at all, a function of preexisting demographic conditions, or simply a halo effect. Additionally, including time of year in the analysis may produce different results. As faculty and resident nonconcordant pairs become acquainted with each other, comfort may increase, reducing any effect.
1Beckman TJ, Lee MC, Rohren CH, Pankratz VS. Evaluating an instrument for the peer review of inpatient teaching. Med Teach. 2003;25:131–135.
2Joshi R, Ling FW, Jaeger J. Assessment of a 360-degree instrument to evaluate residents’ competency in interpersonal and communication skills. Acad Med. 2004;79:458–463.
3Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching? J Gen Intern Med. 2005;20:1159–1164.
4Cooper-Patrick L, Gallow JJ, Gonzales JJ, et al. Race, gender, and partnership in the patient–physician relationship. JAMA. 1999;282:583–589.
6Leone-Perkins M, Schnuth R, Kantner T. Preceptor–student interactions in an ambulatory clerkship: gender differences in student evaluations of teaching. Teach Learn Med. 1999;11:164–167.
7Centra JA, Gaubatz JA. Is there gender bias in student evaluations of teaching? J Higher Educ. 2000;71:17–33.
8Cuddy MM, Swanson DB, Dillon GF, Holtman MC, Clauser BE. A multilevel analysis of the relationships between selected examinee characteristics and United States Medical Licensing Examination Step 2 Clinical Knowledge performance: revisiting old findings and asking new questions. Acad Med. 2006;81(10 suppl):S103–S107.
9Day SC, Norcini JJ, Shea JA, Benson JA Jr. Gender differences in the clinical competence of residents in internal medicine. J Gen Intern Med. 1989;4:309–312.
10Jackson VA, Palepu A, Szalacha L, Caswell C, Carr PL, Inui T. “Having the right chemistry”: a qualitative study of mentoring in academic medicine. Acad Med. 2003;78:328–334.