Necessary but Insufficient and Possibly Counterproductive: The Complex Problem of Teaching Evaluations : Academic Medicine

Secondary Logo

Journal Logo


Necessary but Insufficient and Possibly Counterproductive: The Complex Problem of Teaching Evaluations

Ginsburg, Shiphra MD, MEd, PhD1; Stroud, Lynfa MD, MEd2

Author Information
Academic Medicine 98(3):p 300-303, March 2023. | DOI: 10.1097/ACM.0000000000005006
  • Free


The evaluation of clinical teachers’ performance has been a subject of research and debate for decades. The literature is rife with papers that either review the evidence for students’ evaluations of teachers or attempt to create better rating instruments. 1–3 Despite this extensive literature, teaching evaluations (TEs) remain problematic. For example, there is little evidence that students’ evaluations are associated with learning, yet we rely on these evaluations almost exclusively as assessments of and feedback to teachers. While there is some evidence of validity in certain settings, 4,5 TEs are subject to many forms of bias and are known to be confounded by construct-irrelevant factors, such as the teacher’s physical attractiveness or personality. 6 We have also blurred the distinction between formative (i.e., to help teachers improve) and summative (e.g., for promotion) purposes of these tools.

Regardless of these limitations, TEs have been used to promote teachers, award tenure, and bestow awards, all of which may serve to emphasize the value of excellent teaching. But TEs have also been used to deny promotion and to hold faculty back from advancement. In some institutions, teaching scores from learners affect clinical teachers’ financial compensation. The same tools can therefore be used for rewarding or punitive purposes. Given that TEs carry so much weight for faculty, is it any wonder that faculty are not satisfied with how these tools are implemented? 7,8 In this commentary, we briefly review the literature on what TEs are meant to do, what they actually do in the real world, and their overall impact. We then consider productive ways forward.

Aspirations and Intended Goals: What Are TEs Meant to Do?

On the surface, asking learners for their opinions about their teachers makes sense. By collecting TEs, we ensure that learners have a voice in who is teaching them and how they are being taught. By evaluating teachers, students can help select for and reward the best individuals and identify those who may require coaching or even removal. This is especially important for teachers who create hostile learning environments, as learners may be the only group that can comment from this perspective. It also helps ensure that teachers maintain a learner-focused approach; that is, that they respond to their learners, teach them what they need, give appropriate feedback, and help learners learn and grow.

For clinical teachers, TEs signal what is important. What is delineated on the evaluation form is what is seen to matter to various stakeholders (e.g., trainees, clinical teachers, hospital and academic leadership) and sets expectations for performance. TEs also allow teachers (especially those on education tracks) to accrue evidence of their effectiveness and even excellence to show they are teaching well (and enough) for purposes of promotion and tenure, for awards, and even for financial gain. Ideally, good TE tools should be embedded in faculty development systems that include reflection and coaching for improvement. 9

Real-World Performance: What Do TEs Actually Do?

Evidence suggests that TEs may not actually capture teaching very well. Surprisingly, there isn’t much evidence that higher-rated teachers are associated with more or better learning in their students. In the higher education classroom setting, research has found that there are only tenuous links between TEs and classroom learning. 10 In the clinical setting, an older study showed quite minimal effects between teaching quality and students’ performance during clerkship, although teachers may have an effect on students’ career choice. 11 In another study, feedback to lower performing teachers actually made them worse. 12 In some ways, this lack of correlation between learning and TEs may not be surprising, given that students usually do not have expertise in either the content domain in which they are being taught or in best pedagogical practices.

If not learning, what do TEs actually measure? In some studies of TEs, teacher ratings were subject to significant halo effects, being influenced by charisma, personality, and context. 6,13,14 This is not to say that enjoyment and psychological safety in a classroom or on a ward rotation are not important. But we should be careful not to equate these factors with “teaching.” Indeed, when we ask learners to evaluate their clinical teachers, most rating scales include much broader constructs apart from teaching or instruction, such as, among other things, treating learners with respect, providing effective feedback, and being available. Yet research looking at associations between TEs and learning are largely limited to tests of students’ knowledge, as they are standardized and easy to conduct. Clearly, knowledge is only one aspect of being a good physician. In one interesting study, ratings of clinicians by residents were inversely correlated with ratings by patients, suggesting that resident ratings may be missing important elements of what it means to be a good doctor. 15 Thus, TEs may be missing important aspects of being a good physician, as these evaluations also include the construct-irrelevant factors we described above. As Naftulin and colleagues wrote in a seminal article in 1973, “Student satisfaction with learning may represent little more than the illusion of having learned.” 16

In addition, TEs have often been reported to show biases against women and individuals from visible minority groups, although effects are not seen in all studies and are not all in the same direction. At a minimum, students may use different language to describe faculty by gender. 17,18 While this may not negatively impact TE ratings of women in all specialties, 19,20 in some instances, gender bias may lead to significant differences in TEs between men and women. This is particularly problematic in specialties with a lower representation of women at the faculty level, such as in surgery. 19–21 There may also be intersectionality effects with gender and race/ethnicity, both positively and negatively. 22

Additionally, we identified differences in teaching scores and comments based on the gender of the person completing the evaluation. 18,20 Individual dyadic relationships between student and teacher may therefore be important, yet they are rarely explored. Beyond demographic biases, additional construct-irrelevant items have been shown to positively influence TEs in medicine; these items include the perceived degree of faculty members’ involvement with trainees, 23 charisma and physical attractiveness, 16,24 extraversion, 25 and provision of cookies. 6 Given the extent to which TEs are vulnerable to bias, some institutions have gone so far as to disallow students’ evaluations of teaching for promotion and tenure decisions. 26

What Is The Overall Impact of TEs?

Given the above, what is the net effect of TEs? There is clearly some evidence of the reliability and validity of TEs, although notably research on consequential validity is lacking. 5 Yet there are also drawbacks, as we noted above. So, the net effect of the benefits and potential harms of TEs is hard to determine, as research in the workplace that includes relevant outcomes is very limited. Teachers want and need to be good teachers and to get good evaluations. This means getting evaluations that demonstrate their effectiveness and help them improve. Yet clinical teachers are wary of the power and weight these evaluations hold, not just externally but also internally. Anyone who has ever received negative TEs can attest to how threatening it can feel to one’s self-esteem, self-efficacy, and professional identity. 7,27 These feelings may be heightened by the anonymity of TEs, which also makes it challenging to use the provided feedback to improve. Anonymous feedback received months after an encounter usually lacks sufficient context for the teacher to understand what went wrong, from whose perspective, and how they might do better in the future.

Understandably, teachers may be reluctant to rock the boat, to try new teaching methods or to do anything that may risk their learners feeling too challenged or even uncomfortable. In higher education settings, classes that are easier, and teachers who are more lenient, are rated more highly. 28 Indeed, we see this as the most vexing problem in medical education today. Clinical teachers who are beholden to their learners for career advancement are inhibited from giving honest, critical feedback and assessments to learners when required. 29,30 This is a potentially fatal flaw in creating a meaningful workplace-based assessment system. A study at the University of Toronto a number of years ago found a statistically significant relationship between the scores faculty assign to learners in the workplace and the scores they receive as teachers. 31 That is, the score you give is sometimes the score you get. In an article entitled, “Student Evaluations of Teaching Encourages Poor Teaching and Contributes to Grade Inflation: A Theoretical and Empirical Analysis,” Stroebe writes:

Students and faculty are in an implicit negotiation situation, where each side has a “good” that is valuable to the other side. Faculty can provide good grades and easy courses, and students can provide positive [teaching evaluations]. 28

While this study was not conducted in a health care setting, we have seen this scenario play out as a quid pro quo in medical education. We are aware of senior faculty who instruct junior faculty to give all learners 4s and 5s on their evaluations so as not to put themselves at risk. This issue is critical—it warps the system by depriving learners of the constructive and corrective feedback they need to become excellent clinicians and paradoxically rewards faculty who may inflate scores and put less effort into providing meaningful feedback.

Where Do We Go From Here?

In this commentary, we have outlined numerous threats to the use of TEs. However, we would not for a minute advocate that we should do away with this critical source of data. TEs provide important feedback for faculty growth and, vitally, they provide information about the learning environment a teacher establishes, information that may not be available otherwise. The problem is the overreliance on TEs as often the sole arbitrator by which a faculty member’s teaching skills are judged. We propose that, just as medical education has moved toward programmatic assessment of learners, the time has come to move toward programmatic evaluation of faculty teaching effectiveness.

Within such a program, TEs would provide the student voice, which is essential—but alone is an insufficient measure of teaching. We also need to be thoughtful about how TEs are implemented. First, what is being measured must be explicit. Many TEs in the clinical setting conflate teaching with other skills, such as supervision and coaching. This is not in itself a problem as long as we align the items on TEs with what it is we wish to evaluate in teachers and make this explicit to both teachers and students. Second, the rating scales used on TEs should ideally be focused on specific behaviors or skills, rather than on subjective, poorly defined anchors. Third, there needs to be recognition that, for learners, completing TEs anonymously is a privilege, one that comes with a responsibility to be constructive and professional. Recognizing the inherent power dynamic between learners and teachers, we would not necessarily advocate for de-anonymizing TEs, although this is an interesting prospect that has been studied. 32 While rare, comments that are overtly sexist, racist, or otherwise personally demeaning are unprofessional and totally inappropriate. Schools should have transparent, arms-length mechanisms for identifying learners who use such language and should follow up on these behaviors as a professionalism issue. Schools should also establish processes to remove such comments from TEs. Ideally, learners should receive education and training on how to provide constructive and inspirational narrative feedback.

There must then be a robust, transparent system in place that dictates how TE data are used. For example, many schools use a norm-referenced range to report their TEs and often select faculty for awards based on those in the very top percentiles. However, this practice is problematic in 2 ways. First, the vast majority of faculty are very good or excellent teachers, so there is a significant ceiling effect and those in the top few percentiles are a razor’s edge apart from those in the other percentiles. Second, these extremely narrow margins may be influenced by the numerous biases listed above. Just as we do for learners, a criterion-referenced standard is more appropriate to identify those falling below an acceptable level of performance, but other metrics may be more appropriate to identify outstanding teachers.

So, what might these other metrics be that could contribute to programmatic TEs? To start, just as with the move toward competency-based medical education for residents, there should be a greater focus on the narrative comments included on TEs, although comments are not without their own limitations. 33 One might argue that the scores, given the ceiling effect, are only helpful in identifying low-performing outliers who clearly need remediation. Instead, a deeper dive into the comments and reflection by faculty on what they plan to work on over the next year collated into a portfolio over time may be a more meaningful process. Moving toward formative TEs that are shared only with the teacher is being considered in some jurisdictions. 28 Regardless of what data are provided to teachers, TEs should be collected and used as part of a system that includes effective faculty development strategies. 2 Merely handing learner feedback to teachers is not an effective way to foster growth. 9

Beyond the actual TEs, there are additional sources of data that could be collected to develop a more fulsome and informed picture of teaching effectiveness. For example, near-peer assessments of teaching sessions (such as from colleagues attending teaching rounds), periodic formal external TEs, learning outcomes (the most challenging to measure and holy grail of teaching impact), portfolios, and 360-degree evaluations from others may all contribute to a multi-pixelated, complete picture of teaching effectiveness. 34 However, we must ensure that any new methods or processes are actually meaningful and helpful for faculty and not just extra work. Lastly, and somewhat provocatively, we propose greater recognition (perhaps even an award) for faculty who do not “fail to fail” learners. Those faculty who take the significant time and energy to identify a learner in difficulty and who then support, teach, and coach that individual to become a better doctor are the educators we should all aspire to be. Our learners need this support to succeed, and our obligation to protect our patients demands it.


The authors would like to thank Dr. Brian Wong and Dr. Martina Trinkaus for their valuable insights and suggestions on this commentary and for the work they have each done to improve the evaluation of clinical teachers at the University of Toronto.


1. Snell L, Tallett S, Haist S, et al. A review of the evaluation of clinical teaching: New perspectives and challenges. Med Educ. 2000;34:862–870.
2. Steinert Y, Mann K, Anderson B, et al. A systematic review of faculty development initiatives designed to enhance teaching effectiveness: A 10-year update: BEME guide no. 40. Med Teach. 2016;38:769–786.
3. Fluit CRMG, Bolhuis S, Grol R, Laan R, Wensing M. Assessing the quality of clinical teachers. J Gen Intern Med. 2010;25:1337–1345.
4. Boerboom TBB, Mainhard T, Dolmans DHJM, Scherpbier AJJA, Van Beukelen P, Jaarsma ADC. Evaluating clinical teachers with the Maastricht clinical teaching questionnaire: How much “teacher” is in student ratings? Med Teach. 2012;34:320–326.
5. Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching? J Gen Intern Med. 2005;20:1159–1164.
6. Hessler M, Popping DM, Hollstein H, et al. Availability of cookies during an academic course session affects evaluation of teaching. Med Educ. 2018;52:1064–1072.
7. Hammer R, Peer E, Babad E. Faculty attitudes about student evaluations and their relations to self-image as teacher. Social Psychol Educ. 2018;21:517–537.
8. Wong WY, Moni K. Teachers’ perceptions of and responses to student evaluation of teaching: Purposes and uses in clinical education. Assess Eval High Educ. 2014;39:397–411.
9. Boerboom TBB, Stalmeijer RE, Dolmans DHJM, Jaarsma DADC. How feedback can foster professional growth of teachers in the clinical workplace: A review of the literature. Stud Educ Eval. 2015;46:47–52.
10. Uttl B, White CA, Gonzalez DW. Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Stud Educ Eval. 2017;54:22–42.
11. Griffith CH 3rd, Wilson JF, Haist SA, Ramsbottom-Lucier M. Relationships of how well attending physicians teach to their students’ performances and residency choices. Acad Med. 1997;72:S118–S120.
12. Litzelman DK, Stratos GA, Marriott DJ, Lazaridis EN, Skeff KM. Beneficial and harmful effects of augmented feedback on physicians’ clinical-teaching performances. Acad Med. 1998;73:324–332.
13. Ware JE, Williams RG. The Dr. Fox effect: A study of lecturer effectiveness and ratings of instruction. J Med Educ. 1975;50:149–156.
14. Scheepers RA, Lombarts KMJMH, van Aken MAG, Heineman MJ, Arah OA. Personality traits affect teaching performance of attending physicians: Results of a multi-center observational study. PLoS One. 2014;9:e98107.
15. Dobbs MR, Smith JH. Evaluations of neurologists by their patients and residents are inversely correlated. J Patient Exp. 2016;3:17–19.
16. Naftulin DH WJ Jr, Donnelly FA. The Doctor Fox lecture: A paradigm of educational seduction. J Med Educ. 1973;48:630–635.
17. Heath JK, Clancy CB, Carillo-Perez A, Dine CJ. Assessment of gender-based qualitative differences within trainee evaluations of faculty. Ann Am Thorac Soc. 2020;17:621–626.
18. Ginsburg S, Stroud L, Lynch M, Melvin L, Kulasegaram K. Beyond the ratings: Gender effects in written comments from clinical teaching assessments. Adv Health Sci Educ Theory Pract. 2022;27:355–374.
19. Fassiotto M, Li J, Maldonado Y, Kothary N. Female surgeons as counter stereotype: The impact of gender perceptions on trainee evaluations of physician faculty. J Surg Educ. 2018;75:1140–1148.
20. Stroud L, Freeman R, Kulasegaram K, Cil TD, Ginsburg S. Gender effects in assessment of clinical teaching: Does concordance matter? J Grad Med Educ. 2020;12:710–716.
21. Morgan HK, Purkiss JA, Porter AC, et al. Student evaluation of faculty physicians: Gender differences in teaching evaluations. J Womens Health (Larchmt). 2016;25:453–456.
22. McOwen KS, Bellini LM, Guerra CE, Shea JA. Evaluation of clinical faculty: Gender and minority implications. Acad Med. 2007;82:S94–S96.
23. Irby DM, Gillmore GM, Ramsey PG. Factors affecting ratings of clinical teachers by medical students and residents. J Med Educ. 1987;62:1–7.
24. Rannelli L, Coderre S, Paget M, Woloschuk W, Wright B, McLaughlin K. How do medical students form impressions of the effectiveness of classroom teachers? Med Educ. 2014;48:831–837.
25. Scheepers RA, Arah OA, Heineman MJ, Lombarts KMJMH. How personality traits affect clinician-supervisors’ work engagement and subsequently their teaching performance in residency training. Med Teach. 2016;38:1105–1111.
26. Kaplan W. Ryerson University v Ryerson Faculty Association, 2018 CanLII 58446. Accessed September 20, 2022.
27. Rubino D. 5 Strategies to Manage the Hurt of Student Evaluations. Published 2021. Accessed September 20, 2022.
28. Stroebe W. Student evaluations of teaching encourages poor teaching and contributes to grade inflation: A theoretical and empirical analysis. Basic Appl Soc Psych. 2020;42:276–294.
29. Scarff CE, Bearman M, Chiavaroli N, Trumble S. Keeping mum in clinical supervision: Private thoughts and public judgements. Med Educ. 2019;53:133–142.
30. Yepes-Rios M, Dudek N, Duboyce R, Curtis J, Allard RJ, Varpio L. The failure to fail underperforming trainees in health professions education: A BEME systematic review: BEME guide no. 42. Med Teach. 2016;38:1092–1099.
31. Bandiera G, Fung K, Iglar K, et al. Best practices in teacher assessment: Summary of recommendations. Postgraduate Medical Education, University of Toronto. Published 2010. Accessed September 27, 2022.
32. Dudek NL, Dojeiji S, Day K, Varpio L. Feedback to supervisors: Is anonymity really so important? Acad Med. 2016;91:1305–1312.
33. Ginsburg S, Watling CJ, Schumacher DJ, Gingerich A, Hatala R. Numbers encapsulate, words elaborate: Toward the best use of comments for assessment and feedback on entrustment ratings. Acad Med. 2021;96:S81–S86.
34. Berk RA. Top five flashpoints in the assessment of teaching effectiveness. Med Teach. 2013;35:15–26.
Copyright © 2022 by the Association of American Medical Colleges