Secondary Logo

Journal Logo

Research Reports

Frequency and Determinants of Residents’ Narrative Feedback on the Teaching Performance of Faculty

Narratives in Numbers

van der Leeuw, Renée M. MD; Overeem, Karlijn MD, PhD; Arah, Onyebuchi A. MD, PhD; Heineman, Maas Jan MD, PhD; Lombarts, Kiki M.J.M.H. MHA, PhD

Author Information
doi: 10.1097/ACM.0b013e31829e3af4
  • Free


The quality of residency training depends a great deal on the quality of the faculty who supervise residents.1,2 However, faculty receive training primarily as physicians, whereas formal training in teaching is often optional.3 Clinical faculty often feel unprepared for nonclinical tasks, such as supervision, after finishing residency training.4 Improving residency training to better prepare future faculty for nonclinical tasks is a possible solution; however, there is a need to develop tools that can aid today’s faculty in developing their teaching skills. Some reliable and valid measurement instruments, often using residents’ feedback, are currently available for evaluating the teaching performance of faculty.5–9 Although many complexities influence the effectiveness of feedback in practice,10 several recent reviews have elucidated some characteristics of feedback that are likely to make it more useful.11,12 One important point these studies make is that feedback should contain narrative comments. The studies suggest that such comments increase the value of the feedback for learning and that narrative, descriptive, and linguistic information is much richer and more deeply appreciated by learners than numeric information.13,14

Because learners increasingly perceive narrative comments as a valuable part of feedback,13 knowing whether residents are indeed motivated to provide written comments is essential. A previous study in this journal demonstrated that comments submitted by residents are unlikely to provide faculty with substantive feedback,15 but this previous study lacked predictive models to investigate the determinants (which we define as predictors—not causes) of the numbers of positive comments and suggestions for improvement that residents gave to faculty and, hence, that faculty received from residents. Studying factors that potentially influence the number (frequency) and type of narrative feedback that residents provide could be an important first step in improving the use of these narrative comments in evaluating teaching performance.

We conducted this study, which had three goals, to increase our understanding of the frequency and determinants of narrative feedback. First, we aimed to quantify the frequency of two main types of narrative comments (namely, positive comments and suggestions for improvement) in an evaluation-and-feedback system; that is, we wanted to assess how frequently residents gave positive comments and suggestions for improvement within their narrative feedback to their teaching faculty. Second, we tested the hypotheses that, compared with those with lower teaching performance evaluations, faculty with higher evaluations would receive (A) positive comments more frequently, and (B) suggestions for improvement less frequently; that is, we hoped to determine whether higher teaching performance scores accompanied a higher frequency of positive comments and a lower frequency of suggestions for improvement. Third, we hoped to increase our understanding of the determinants of the frequency of the positive comments and suggestions for improvement, so we investigated the extent to which the two types of narrative comments were associated with hospital, specialty, faculty, and resident factors.


Study context and the faculty evaluation system

We developed the System for Evaluation of Teaching Qualities (SETQ) to generate reliable and valid, individual feedback for faculty on their teaching performance and to aid in their self-directed learning.6–9 In short, SETQ allows residents to provide feedback to each faculty member with whom they have worked using a feasible, password-protected, Internet-based environment. Faculty also complete a self-evaluation through the same Web-based environment. The specialty-specific SETQ questionnaires contain between 23 and 30 specialty-relevant teaching performance items covering five predefined and validated domains of teaching qualities: learning climate, professional attitude and behavior towards residents, communication of goals, evaluation of residents, and feedback.6–9 All items require responses on a five-point Likert scale (1 = completely disagree, 3 = neutral, 5 = completely agree). Previous reports have described the development and assessment of the psychometric properties of specialty-specific SETQ instruments.6–9,16

In addition to the set of numerical items, the SETQ instruments include open comment fields that invite residents to record narrative feedback, through which they may list the strengths of and formulate specific suggestions for the improvement of individual faculty members’ teaching performance. Instructions on the survey read, “Please provide strengths of faculty’s teaching performance” and “Please provide concrete suggestions for improvement of faculty’s teaching performance.” The automatically created individual feedback reports summarize faculty self-evaluation scores, the mean values generated from the scores of all residents who evaluated the faculty member, and a verbatim list of all the positive comments and all the suggestions for improvement that residents provided. To protect residents’ anonymity, individual feedback reports contain no resident characteristics. In the Netherlands, SETQ is the most widely used system for evaluating individual teaching faculty members: 41 teaching hospitals, representing over 190 residency training programs, 2,500 faculty members, and 2,300 residents, use the system.

Study population and ethical considerations

From September 2008 through May 2010, 964 faculty members and 839 residents representing 56 residency training programs in 20 academic and nonacademic hospitals received an invitation to participate in the SETQ system. To prevent sample dominance and potential selection bias by faculty who participated in subsequent years, we included only the first year of each faculty member’s evaluations from the SETQ system. All faculty and residents consented to provide their anonymous data for research analysis. We provided no financial incentives. We consulted the institutional ethical review board of the Academic Medical Center of the University of Amsterdam, which confirmed that the Medical Research Involving Human Subjects Act did not apply to this study and that an official approval of this study by the committee was not required.

Frequency of positive comments and suggestions for improvement

To answer our first research question regarding the frequencies of narrative comments in the SETQ feedback system, we investigated the positive comments and the suggestions for improvement recorded in each resident-completed SETQ evaluation of a faculty member. We performed a structured coding of the data, counting only the number of comments that were either positive or offered suggestions for improvement. Some suggestions for improvement were phrased, for example: “None. Stay the way you are.” Myers and colleagues15 referred to such comments as “embedded positives,” which is why we included these in the positive comments counts. We considered feedback that was not specifically a positive comment or a suggestion for improvement to be positive when it was presented in the column of positive comments and, likewise, a suggestion for improvement when it appeared in the suggestions column (see Chart 1). Two of us (R.v.d.L. and K.O.) independently counted and documented the number and nature of phrases in sets of 100 evaluations at a time, and we concurrently calculated interrater reliability using the Kappa statistic. As long as the Kappa statistic remained > 0.8, we each continued coding one-half of our dataset while frequently discussing the coded evaluations and resolving possible issues with a third researcher (K.L.). After the coding of all evaluations, we calculated the mean number of positive comments and the mean number of suggestions that residents gave to faculty.

Chart 1 Examples of the Coding of the Narrative Feedback in the Positive Column and in the Column With Suggestions for Improvement

Finally, we performed the Mann–Whitney U test to analyze differences between the frequency of narrative comments and the hospital, specialty, faculty, and resident characteristics. To adjust for our multiple significance testing, we implemented the methodology of false discovery rate (FDR)-adjusted P values.17–20 For classical unadjusted tests, significance threshold is set at P < .05. For the FDR-adjusted multiple testing, we set the maximum acceptable FDR at 0.05. We estimated the FDR-derived threshold for each test and adjusted each P value using both the classical one-stage17,18 and the sharpened two-stage techniques.19 Because both techniques yielded similar results, we have reported only the more conservative results of the classical one-stage approach, declaring significance only when each subsequent FDR-adjusted P value was less than the FDR-derived significance threshold.

Teaching performance and its association with the frequency of narrative feedback

To address our hypotheses that, compared with those with lower teaching performance evaluation scores, faculty with higher evaluation scores would receive (A) positive comments more frequently, and (B) suggestions for improvement less frequently, we calculated the teaching performance mean score of all residents’ ratings per faculty member. We divided this mean score into quartiles as follows: The scores of the top quartile ranged from ≥ 4.06 to 5.0, the scores of the middle two quartiles ranged from ≥ 3.81 to < 4.06 and from ≥ 3.51 to < 3.81, and the scores of the lowest quartile ranged from 0 to < 3.51. We used these quartiles as dummy variables to perform multivariable Poisson regression analysis using generalized estimating equations (GEE) with robust variances.21 This type of regression analysis enabled us to correct for clustering in our sample. The evaluations were clustered within both residents and faculty because each resident could evaluate different faculty members and each faculty member could receive an evaluation from different residents. The GEE modeling framework allowed us to model faculty-averaged SETQ mean teaching performance scores adjusted for the cross-classified hierarchical clustering of the evaluations within residents and faculty. A Poisson-type GEE was an appropriate regression because the study outcomes were counts of the number of narrative comments per evaluation. In line with our a priori expectation that hospital, specialty, faculty, and resident characteristics could relate to the frequencies of positive comments and suggestions for improvement, and in keeping with our preference for mutually adjusted (possibly direct) associations, we entered the corresponding variables simultaneously into the GEE models for each outcome.

Determinants of the amount of narrative feedback given by residents and received by faculty

To explore the relationships between various characteristics and the number of comments per evaluation (positive comments or suggestions for improvement), for our third goal, we developed four categories of potential determinants: hospital characteristics, specialty-related characteristics, faculty-related characteristics, and resident-related characteristics.

First, we grouped all hospitals into either academic (university-based hospitals) or nonacademic (university affiliated hospitals or community-based teaching hospitals). We then categorized all specialties into either medical specialties (internal medicine and its subspecialties, pediatrics, dermatology, oncology, psychiatry, radiology, anesthesiology, pathology) or surgical specialties (surgery, orthopedic surgery, urology, obstetrics–gynecology, ophthalmology, otolaryngology, thoracic surgery, vascular surgery, neurosurgery).

Third, our faculty-related characteristics included years of work experience as a registered specialist (less than 10 years and 10 or more years [based on achieving similar group sizes]), gender, age, whether or not faculty had participated in teacher training, and the quartile into which their mean SETQ score fell. Next, resident-related characteristics consisted of training year and gender. To create two equal-sized groups, we dichotomized residents’ year-in-training into years 1–3 and years 4–6.

We then used the descriptive statistics of the potential determinants to examine the distributions of respondents’ demographic characteristics. Finally, we used the same mutually adjusted GEE models that we had used to study our second research objective to investigate the relationship between potential determinants and the number of positive comments and suggestions for improvement.

We conducted all analyses using SPSS version 18.0.1 (IBM SPSS Inc., Chicago, Illinois, 2009) and Microsoft Excel 2011 for Mac version 14.2.3 (Microsoft Corporation, Redmond, Washington, 2010).


Frequency of positive comments and suggestions for improvement

In total, 659 residents (79%) completed 6,216 evaluations of 917 faculty members (95%), resulting in an average of 6.8 (standard deviation [SD] = 4.4) evaluations per individual faculty member. The interrater reliability of the number of narrative comments was 0.98 (P < .0001). Within the 6,216 evaluations, residents formulated a total of 11,574 positive comments and 4,870 suggestions for improvement. On average, each resident completed 9.4 evaluations (SD = 6.4) and provided 17.5 positive comments (interquartile range [IQR]: 4–26) and 7.4 suggestions for improvement (IQR: 1–11). On average, each faculty member received a mean of 12.6 positive comments (IQR: 5–18) and 5.3 suggestions for improvement (IQR: 2–8).

Table 1 provides demographic information about the setting and participants as well as details on the frequency of both the positive comments and the suggestions for improvement that residents gave to faculty. Table 2 presents the results of the univariate analysis (Mann–Whitney U test), showing the differences in the number of positive comments and the suggestions for improvement for hospital, specialty, faculty, and resident characteristics.

Table 1
Table 1:
Characteristics of Evaluations, Faculty, and Residents in a Study of the Nature and Frequency of Narrative Feedback Provided by Residents and Received by Faculty, 2008–2010
Table 2
Table 2:
Univariate Analysis of Differences Between Hospitals, Specialties, Faculty,* and Residents in the Number of Positive Comments and Suggestions for Improvement

Teaching performance and its association with the frequency of narrative feedback

Through our FDR-adjusted univariate analysis, displayed in Table 2, we found that higher teaching performance scores were associated with a higher frequency of positive comments and a lower frequency of suggestions for improvement. Compared with faculty in the bottom quartile of teaching performance, faculty in the top quartile received four more positive comments(P = .0002) and four fewer suggestions for improvement (P = .0002). Also, as shown in Tables 3 and 4, our multivariable adjustment analysis for hospital, specialty, faculty, and resident characteristics indicated that a higher teaching performance score was associated with a higher frequency of positive comments (regression coefficient 0.538; 95% confidence interval [CI]: 0.464 to 0.613; P < .0001; Table 3) and a lower frequency of suggestions for improvement (−0.802; 95% CI: −0.911 to −0.692; P < .0001; Table 4).

Table 3
Table 3:
Multivariable Analysis* of the Associations of Hospital, Resident, and Faculty Characteristics With the Number of Positive Comments
Table 4
Table 4:
Multivariable Analysis* of the Associations of Hospital, Resident, and Faculty Characteristics With the Number of Suggestions for Improvement

Other determinants of the amount of narrative feedback given by residents and received by faculty

Tables 3 and 4 show the results of the multivariable adjusted analysis for, respectively, positive comments and suggestions for improvement. Working in a nonacademic hospital, participating in teacher training, and being evaluated by female residents were positively associated with receiving more positive comments and more suggestions for improvement.

Working in a surgical specialty, being a female faculty member, and increasing years of work experience were associated with a higher frequency of positive comments (Table 3). Being a male faculty member was associated with a higher frequency of suggestions for improvement (Table 4).


Main findings

We garnered a wide overview of the use of narrative comments in this multicenter quantitative study of the frequency and determinants of residents’ narrative feedback regarding the teaching performance of their clinical faculty. First, we found that residents did, in fact, frequently provide narrative feedback within the context of formative evaluation of faculty members’ teaching performances. Second, our findings showed that the faculty members who received high teaching performance scores were significantly more likely than their peers with low teaching performance scores to receive positive comments and that, similarly, the faculty members who received lower teaching performance scores were significantly more likely than those with high teaching performance scores to receive suggestions for improvement. In other words, residents elaborated on the scores they gave to individual faculty members, and faculty members received individualized explanations in line with the ratings they received from residents.

Explanations for and interpretations of findings

First, our finding that residents provided more positive comments than suggestions for improvement is in line with previous research in undergraduate medical education.22 Although this tendency may suggest that “critiquing your boss” is difficult for residents when providing feedback to their clinical faculty, we found that residents were also generous in providing suggestions for improvement. Of the 6,216 evaluations in our study, 48% (n = 2,971) contained one or more suggestions for improvement. Our findings contrast with those of an earlier study reporting that only 17% of evaluations contained suggestions for improvement.15 Importantly, the residents in this previous (2011) study may have evaluated as many as 50 faculty members annually.15 In comparison, the residents in our study completed, on average, nine evaluations in one year, and this relatively low burden could have enabled them to provide more narrative feedback per evaluation. One study of residents’ perspectives on clinical teaching assessments reported perceived evaluation burden as one of three main themes.23 Therefore, system characteristics may have caused the difference between our findings and those of the 2011 study, but other unknown factors and/or the cultural context (the previous study occurred in Canada and ours in the Netherlands) should also be taken into account.

Second, our findings confirm our hypotheses regarding the association between high and low overall teaching performance score and, respectively, more numerous positive comments and suggestions for improvements. This finding aligns with a UK study of free-text comments from faculty on colleagues’ overall performances, which reported that higher mean scores correlated with a greater number of positive comments.24 Research on assessments of medical students22 and trainee doctors25 showed similar patterns.

Furthermore, we determined which variables (those related to hospital, specialty, faculty, and resident characteristics) were associated with the number of positive comments and the number of suggestions for improvement. Two remarkable findings bear further explanation. First, hospital-type was associated with the amount of narrative comments. Faculty in nonacademic hospitals received more narrative comments compared with their peers in academic health centers. We cannot provide a straightforward explanation for this difference because hospital type is itself a variable that contains many other variables within it. Therefore, we hypothesize that there could be aspects of the organization of the hospitals, their residency training programs, their learning climates, their feedback culture, and the characteristics of their residents and faculty members, which we did not consider in our analysis that could explain this finding. Moreover, other studies have demonstrated the difficulty of identifying the characteristics that could clarify the impact of hospital type on patient outcomes.26,27

Second, the multivariable analysis showed that teacher training determined the number of positive comments, but, interestingly, the univariate analysis indicated that teacher training was not associated with a significant absolute difference in the amount of narrative feedback received. A possible explanation for this divergence could be that completing teacher training is associated with having higher SETQ scores,28 and having higher SETQ scores is, in turn (according to our results) associated with more narrative feedback such that teacher training has an indirect impact on the number of positive comments faculty members receive. In other words, teacher training may indirectly influence the amount of narrative feedback faculty receive from residents through its effect on teaching performance scores.

Strengths and limitations

The multicenter, multispecialty design of this study, the large sample size, and the high response rate increase the external validity of our study. Another important strength of this study is the relatively low number of evaluations required from residents per year. Furthermore, the emphasis on preserving the anonymity of the residents’ feedback reduces the likelihood of socially desirable answers. Finally, we achieved a high level of interrater reliability during our initial count of the positive comments and suggestions for improvement per evaluation.

A limitation of our study could be that it was conducted within the context of residency training in the Netherlands. Although investigators have found similar results in U.S.-based and UK-based studies,22,24dissimilar results were found in a Canadian study,15 and the need to extrapolate findings to other settings outside the Netherlands warrants further investigations. Finally, the dichotomization of narrative feedback into either positive comments or suggestions for improvement may not have done justice to the nuances that often exist in narrative feedback. Although there were no comments that we could not code within either of these two categories, future research could take a truly qualitative approach to investigate the content of narrative feedback in order to clarify the richness of the data.

Implications for future research and practice

Although this study, along with previous studies, adds to the evidence supporting the feasibility of the SETQ system and its teaching performance measurement instruments,69 further research is necessary to investigate the effect of feedback on the teaching performance of an individual faculty member. Hereto, a next step could be a content analysis of the narrative feedback that focuses on which skills or attitudes faculty should improve in order to be perceived as good teachers by residents. Myers and colleagues15 analyzed written comments provided by residents to faculty and found that residents often expressed their need for “more” teaching. Therefore, it is also important to investigate whether narrative comments hold effective characteristics such as specificity, a direct relationship to a behavior, and information that is actionable.29

The finding that faculty with low teaching performance scores received more suggestions for improvement provides a practical implication. Faculty members can read the narrative comments they have received to find individualized elaborations from residents on their teaching performance and, in turn, use these comments to further improve their teaching skills.


The high frequency of narrative feedback provided by residents to faculty demonstrates the feasibility of a system for evaluation of individual teaching performance of faculty members. Furthermore, this study shows that faculty who received higher teaching performance scores also tended to receive more positive comments and fewer suggestions for improvement and that faculty with a lower teaching performance score received more suggestions for improvement. Thus, faculty would be wise to attend to the narrative feedback they receive because it offers invaluable individualized insights that could shape and improve their teaching performance.

Acknowledgments: The authors would like to thank all faculty and residents who generously participated in our study. They also thank for developing and maintaining the Web application. Finally, they thank the Heusden Crew for their splendid social support while they worked on this report.


1. Dolmans DH, Wolfhagen IH, Essed GG, Scherpbier AJ, van der Vleuten CP. The impacts of supervision, patient mix, and numbers of students on the effectiveness of clinical rotations. Acad Med. 2002;77:332–335
2. Hore CT, Lancashire W, Fassett RG. Clinical supervision by consultants in teaching hospitals. Med J Aust. 2009;191:220–222
3. Wilkerson L, Irby DM. Strategies for improving teaching practices: A comprehensive approach to faculty development. Acad Med. 1998;73:387–396
4. Westerman M, Teunissen PW, van der Vleuten CP, et al. Understanding the transition from resident to attending physician: A transdisciplinary, qualitative study. Acad Med. 2010;85:1914–1919
5. Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971–977
6. Lombarts KM, Bucx MJ, Arah OA. Development of a system for the evaluation of the teaching qualities of anesthesiology faculty. Anesthesiology. 2009;111:709–716
7. van der Leeuw R, Lombarts K, Heineman MJ, Arah O. Systematic evaluation of the teaching qualities of obstetrics and gynecology faculty: Reliability and validity of the SETQ tools. PLoS One. 2011;6:e19142
8. Arah OA, Hoekstra JB, Bos AP, Lombarts KM. New tools for systematic evaluation of teaching qualities of medical faculty: Results of an ongoing multi-center survey. PLoS One. 2011;6:e25983
9. Boerebach BC, Arah OA, Busch OR, Lombarts KM. Reliable and valid tools for measuring surgeons’ teaching performance: Residents’ vs. self evaluation. J Surg Educ. 2012;69:511–520
10. Archer JC. State of the science in health professional education: Effective feedback. Med Educ. 2010;44:101–108
11. Hattie J, Timperley H. The power of feedback. Rev Educ Res. 2007;77:81–112
12. Shute VJ. Focus on formative feedback. Rev Educ Res. 2008;78:153–189
13. Burford B, Illing J, Kergon C, Morrow G, Livingston M. User perceptions of multi-source feedback tools for junior doctors. Med Educ. 2010;44:165–176
14. van der Vleuten CP, Schuwirth LW, Scheele F, Driessen EW, Hodges B. The assessment of professional competence: Building blocks for theory development. Best Pract Res Clin Obstet Gynaecol. 2010;24:703–719
15. Myers KA, Zibrowski EM, Lingard L. A mixed-methods analysis of residents’ written comments regarding their clinical supervisors. Acad Med. 2011;86(10 suppl):S21–S24
16. Lombarts MJ, Arah OA, Busch OR, Heineman MJ. Using the SETQ system to evaluate and improve teaching qualities of clinical teachers [in Dutch]. Ned Tijdschr Geneeskd. 2010;154:A1222
17. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Royal Stat Soc Series B. 1995;57:289–300
18. Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat. 2000;25:60–83
19. Benjamini Y, Krieger A, Yekutieli D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika. 2006;93:491–507
20. Storey J. The positive false discovery rate: A Bayesian interpretation and the q-value. Ann Stat. 2003;31:2013–2035
21. Hanley JA, Negassa A, Edwardes MD, Forrester JE. Statistical analysis of correlated data using generalized estimating equations: An orientation. Am J Epidemiol. 2003;157:364–375
22. Frohna A, Stern D. The nature of qualitative comments in evaluating professionalism. Med Educ. 2005;39:763–768
23. Myers K, Zibrowski EM, Lingard L. Engaged at the extremes: Residents’ perspectives on clinical teaching assessment. Acad Med. 2012;87:1397–1400
24. Richards SH, Campbell JL, Walshaw E, Dickens A, Greco M. A multi-method analysis of free-text comments from the UK General Medical Council Colleague Questionnaires. Med Educ. 2009;43:757–766
25. Bullock AD, Hassell A, Markham WA, Wall DW, Whitehouse AB. How ratings vary by staff group in multi-source feedback assessment of junior doctors. Med Educ. 2009;43:516–520
26. Kohn GP, Galanko JA, Meyers MO, Feins RH, Farrell TM. National trends in esophageal surgery—are outcomes as good as we believe? J Gastrointest Surg. 2009;13:1900–1910
27. Lee SL, Yaghoubian A, de Virgilio C. A multi-institutional comparison of pediatric appendicitis outcomes between teaching and nonteaching hospitals. J Surg Educ. 2011;68:6–9
28. Arah OA, Heineman MJ, Lombarts KM. Factors influencing residents’ evaluations of clinical faculty member teaching qualities and role model status. Med Educ. 2012;46:381–389
29. Canavan C, Holtman MC, Richmond M, Katsufrakis PJ. The quality of written comments on professional behaviors in a developmental multisource feedback program. Acad Med. 2010;85(10 suppl):S106–S109
© 2013 by the Association of American Medical Colleges