Judgments about the clinical performance of residents and medical students are based primarily on subjective ratings of performance. The quality of decisions made using this information depends on the extent to which these ratings accurately reflect performance and clinical competence. This topic has been the focus of much research and deliberation over the years.1–4 In 1987, Tonesk and Buchanan5 reported the results of a survey regarding clinical evaluation conducted under the auspices of the Association of American Medical Colleges (AAMC). They reported that one of the most commonly reported clinical evaluation problems was unwillingness to record negative performance evaluations. A second problem was unwillingness to act on negative evaluations. Magarian and Mazur6,7 conducted a nationwide survey of U.S. internal medicine clerkship directors and found that 36% of clerkship directors could not recall having ever given a failing grade in their clerkship. Many studies have documented uniformly high clinical performance ratings and low failure rates over the years.8 These findings are not unique to medicine. In a review chapter, Tesser and Rosen9 described research demonstrating a bias among communicators to transmit messages that are pleasant for recipients to receive and to avoid transmitting unpleasant messages. They reviewed a number of studies that suggest good news is communicated more frequently, more quickly and more fully than bad news. Harris et al.10 had supervisors rate employees to validate a performance test. The supervisors knew that the information would not be provided to the employees and would not harm the employees’ promotion and salary increases in any way. They found that under these circumstances, almost 7% of employees were rated in need of improvement. When the same supervisors rated the same employees using the same instruments during annual evaluations where the information would be shared with the employees and would have economic consequences for those employees, the supervisors rated only slightly more than 1% of the employees as in need of improvement. Similarly, Waldman and Thornton11 found that ratings used for promotion and salary increases and those that were to be shared with employees were more lenient than confidential ratings of the same employees. Finally, Bretz et al.12 found that 60% to 70% of the workforce in 3,587 business and industrial organizations received ratings in the top two performance levels on the organization's performance rating scale.
Speer et al.13 reported on efforts to modify the clinical rating process in an internal medicine clerkship with a goal of reducing grade inflation. In the original grading system, the average grade assigned to students by preceptors was an “A.” The authors documented that preceptors believed the “average” student should receive a grade of “B.” The authors introduced a more behaviorally anchored definition for each grade and assigned final grades through an “objective group review process.” This revised system resulted in students’ receiving an average grade of “B+,” a grade that is more consistent with preceptor expectations. Since two interventions were involved it is not possible to establish the relative contributions of the behavioral anchors and the group review process to this improvement. A study by Hemmer et al.14 is instructive in this regard. These investigators found that a group review process significantly improved the detection of unprofessional behavior in a medicine clerkship. In almost 25% of the cases, the group review process was the only evaluation method that detected instances of unprofessional behavior. No individual rater noted the unprofessional behavior in written comments or reflected poor performance in his or her individual ratings of the students.
The purpose of our study was to use a research method similar to that of Hemmer et al. to determine whether individual attendings’ post-rotation performance ratings and written comments detected deficits in surgery residents’ clinical performance.
Performance Evaluation Instruments
We collected evaluations from all attending physicians in the Department of Surgery, Southern Illinois University (SIU) School of Medicine at the end of each four to six week resident rotation for the years 1997–2002. The 1997–2001 evaluation form contained a seven-point scale (truly exceptional, outstanding, very good, good, adequate, significant deficits, severe deficits) for evaluating overall clinical performance as well as a three-point scale (excellent, satisfactory, marginal) for each of four specific performance areas: applied knowledge, clinical performance, technical skills, and professional behavior. A space for comments was provided after each rating item. In 2002 the clinical evaluation form was changed to a five-point scale (excellent, very good, good, fair, poor) for rating overall performance and for evaluating two performance areas: clinical performance and professional behavior. For the purposes of our study, the five-point scale was translated to the seven-point scale (excellent = outstanding, very good = very good, good = good, fair = adequate, and poor = significant deficits).
Residents work with multiple attending physicians for varying lengths of time during each rotation. The extent of exposure to any one attending physician is not collected on the end of rotation evaluation.
In the spring of each year, all attending physicians meet to evaluate the clinical performance of residents and make a decision regarding the resident's academic progress. Typically 17 to 25 attending physicians attend the meeting, which is conducted by the program director for general surgery. Attending physicians are provided a book containing a summary of all performance data for each resident, including end-of-rotation ratings in each performance area and written comments offered by attending physicians in their evaluations of residents after each rotation. The evaluation portfolio also includes performance data for the American Board of Surgery In-Training Examination (ABSITE), oral examination, conference attendance, medical records management, teaching performance, and performance summaries written by faculty advisors semiannually.
Chief residents participate in the evaluation of all residents other than themselves. The resident's advisor begins the discussion by summarizing the resident's progress and offers one of five recommendations: advancement with statement of exemplary performance, advancement with statement of deficiencies to be improved, advancement with one year of probation, no advancement with one year of probation, or unsatisfactory performance and dismissal from program. The attending physicians at the meeting discuss the resident's performance and the recommendation. After the discussion, group consensus is reached regarding a progress decision for the resident as well as the deficiencies to be documented and the plan for remediating any deficits. For purposes of our study, any resident who received a recommendation other than advancement with statement of exemplary performance was considered to have received a negative progress decision.
Research Design and Data Analysis
We obtained Internal Review Board approval for our study. Using a retrospective cohort study design, we reviewed the clinical performance records for all surgery residents in the SIU General Surgery Residency program for the years 1997–2002. We focused on residents who were determined by the resident evaluation committee to have clinical performance or professional behavior deficits that were serious enough to require formal remediation. We excluded any resident from our study who received a deficit solely because of ABSITE scores. No resident received a negative end-of-year decision based on conference attendance, medical records management, or teaching performance (other characteristics that faculty might not know about when filling out end-of-rotation evaluation forms). Using quantitative and qualitative methods, we abstracted the records for all residents. Records included evaluation forms completed by an individual attending physician after each rotation, written records reflecting attendings’ discussion of an individual resident's performance during annual evaluation meetings, and written records reflecting the final decision regarding the resident's progress for the year and recommended remediation, if any. In our study, we focused on determining whether individual faculty's post-rotation performance ratings or written comments were useful indicators (predictors) of the final decisions made by consensus about a resident's performance and progress.
First, we determined individual faculty member's post-rotation ratings of residents’ overall clinical performance. We then collected individual faculty's post-rotation ratings of specific skills. Finally we collected and analyzed faculty's post-rotation comments about skills. The results for residents identified as having performance deficits were compared with results for residents who were determined not to have performance deficits.
Our analysis sought to answer the following research questions:
1. Do independent post-rotation overall performance ratings by attending physicians reflect resident performance deficits as established during the end-of-year evaluation meeting?
2. Do independent post-rotation attending ratings of specific skills reflect these deficits?
3. Do comments written by attending physicians about residents on post-rotation evaluation forms reflect these deficits?
From 1997–2002 during annual evaluation meetings, the faculty identified 30 residents as having deficiencies needing remediation. Of these residents, two had deficiencies noted in more than one area. One resident had two deficiencies and one resident had three. One resident was placed on probation and repeated the year, and three were placed on probation but advanced. The remaining 26 residents were advanced without probation but with deficiencies that required remediation. Thus the prevalence of residents with deficiencies requiring remediation across these years was 28% and varied from 42% in 2001 to 21% in both 1998 and 1999.
An average of 18 evaluations were submitted for each resident each year (range, seven to 40). Only 13 out of 1,986 evaluations (0.7%) nominally indicated a deficit. Seventy-one percent of individual global ratings were either outstanding or very good.
Residents with positive end-of-year decisions (no deficiencies) received an average of 19 evaluations per year (95% confidence interval [CI], 17–20; range, seven to 40 evaluations) while those with negative end-of-year decisions (deficiencies requiring remediation) were evaluated 17 times per year (95% CI, 15–18; range, nine to 29). An average of 14 comments were written (95% CI, 13–16; range, six to 29) for residents with positive end-of-year decisions while residents with negative end-of-year decisions (deficiencies present) received an average of 13 written comments (95% CI, 11–14; range, six to 23).
Figure 1 shows the overall ratings assigned to residents with negative end-of-year progress decisions and those for residents with positive end-of-year progress decisions. The figure shows that no resident received a post-rotation evaluation indicating severe deficits. Residents with positive end-of-year decision were more likely to receive “truly exceptional” or “outstanding” post-rotation ratings than were their counterparts who received negative end-of-year decisions. On the other hand, residents who received negative end-of-year decisions were more likely to receive “good,” “adequate,” or “significant deficit” post-rotation ratings. However, some residents who received negative end-of-year decisions did receive “truly exceptional” post-rotation ratings and almost 20% received “outstanding” post-rotation ratings. As mentioned previously, there were only 13 post-rotation evaluations with a rating that nominally pointed to performance deficits (an overall rating of “significant deficit”). Nine of these ratings were for residents who received a negative end-of-year progress decisions and four were for residents who received positive end-of-year progress decisions. The positive predictive value was calculated and indicated that 46% of the ratings of “good” or below were assigned to residents who received negative end-of-year progress decisions.
Table 1 shows the nature of performance deficits identified during the evaluation committee meeting and indicates the number and percentages of attendings’ post-rotation ratings and written comments that indicated this deficiency. The table also shows ratings and written comments that contradicted the presence of a deficit in this area. From 1997–2002, 12 of the noted deficits were in the technical skills area, ten were in the applied knowledge area (knowledge applied to patient care), and 11 were in the area of professional behavior.
As shown in Table 1, 4% of the individual ratings of technical skills were “marginal” (the lowest rating) for those individuals ultimately determined to have a deficit in this area. Twenty-three percent of the technical skills ratings for these individuals were excellent, seemingly contradicting the presence of a deficit in the area. Similarly, but even more extreme results were documented for the areas of applied knowledge and professional behavior.
Eighteen percent of the residents determined to have some deficiency requiring remediation by the end of year progress committee received no post-rotation performance ratings indicating that specific deficiency from any attending throughout the year.
Written comments did a better job of detecting deficits, especially in the area of technical skills. Twenty percent of evaluations had written comments noting technical skills deficits for those ultimately identified as having these deficits and only 7% of the evaluations had written comments contradicting this diagnosis. The results for applied knowledge and professional behavior were not as promising. Again, residents identified as having deficits in these areas had more individual written post-rotation comments contradicting these deficits than supporting them.
A large percentage of deficiencies only became apparent when the attending physicians came together to discuss performance at the annual evaluation meeting, possibly for one of three reasons: (1) the annual evaluation meeting allowed for triangulation on a resident's performance that made a pattern of behavior apparent that was not previously apparent to individual attendings; (2) the annual evaluation meeting provided evidence that strengthened individual attendings’ preexisting convictions about residents’ performance deficiencies and the meetings may lead to a corporate judgment that is more stringent than that of individual raters; or (3) the annual evaluation meetings lead to erroneous conclusions about deficiencies. We concluded after attending these sessions and analyzing the substance of the discussions and the supporting evidence that the deficiencies identified were real and that the true explanation for our findings is some combination of reasons one and two.
Hemmer et al.14 found that face-to-face group evaluation sessions significantly increased the detection of one class of performance problems among medical students. The results reported by Speer et al.13 also suggested that there are benefits to face-to-face formal group evaluation sessions for making decisions about the progress of medical students. Our results parallel those of Hemmer et al. Further, they extend their results and those of Speer et al. from the domain of medical students to that of residency training and from the domain of internal medicine to surgery. Finally, our results broaden the range of competencies evaluated when compared to the study by Hemmer et al.
Some individuals have expressed concern that making progress decisions using a face-to-face, collective decision-making process (i.e., via an evaluation committee) may compromise validity by allowing a single outspoken individual to sway unduly the decision of the committee. Williams et al.8 completed an extensive review of the clinical performance appraisal literature and reported that they found no research results that have addressed this research question. Our findings, together with those of Hemmer et al. and Speer et al., suggest some major benefits associated with a group approach to making decisions about residents’ and students’ progress. These results suggest the importance of studying the dynamics of evaluation meetings for making decisions about clinical performance progress to determine whether there are major weaknesses associated with this approach. Barring the presence of such weaknesses, our results argue for increased use of a committee approach to making resident progress decisions.
1.Kwolek CJ, Donnelly MB, Sloan DA, Birrell SN, Strodel WE, Schwartz RW. Ward evaluations: should they be abandoned? J Surg Res. 1997;69:1–6.
2.Yao DC, Wright SM. National survey of internal medicine residency program directors regarding problem residents. JAMA. 2000;284:1099–104.
3.Day RP, Hewson MG, Kindy P Jr, Van Kirk J. Evaluation of resident performance in an outpatient internal medicine clinic using standardized patients. J Gen Intern Med. 1993;8:193–8.
4.Scheuneman AL, Carley JP, Baker WH. Residency evaluations. Are they worth the effort? Arch Surg. 1994;129:1067–73.
5.Tonesk X, Buchanan RG. An AAMC pilot study by 10 medical schools of clinical evaluation of students. J Med Educ. 1987;62:707–18.
6.Magarian GJ, Mazur DJ. A national survey of grading systems used in medicine clerkships. Acad Med. 1990;65:636–9.
7.Magarian GJ, Mazur DJ. Evaluation of students in medicine clerkships. Acad Med. 1990;65:341–5.
8.Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical competence ratings. Teach Learn Med. 2003;15:270–92.
9.Tesser A, Rosen S. The reluctance to transmit bad news. In: Berkowitz L (ed). Advances in Experimental Social Psychology, Vol. 8. New York: Academic Press, 1975;193–232.
10.Harris MM, Smith DE, Champagne D. A field study of performance appraisal purpose: research- versus administrative-based ratings. Personnel Psychol. 1995;48:151–60.
11.Waldman DA, Thornton GC. A field study of rating conditions and leniency in performance appraisal. Psychol Rep. 1988;63:835–40.
12.Bretz RD, Milkovich GT, Read W. The current state of performance appraisal research and practice: concerns, directions, and implications. J Manag. 1992;18:321–52.
13.Speer AJ, Solomon DJ, Ainsworth MA. An innovative evaluation method in an internal medicine clerkship. Acad Med. 1996;71:S76–8.
14.Hemmer PA, Hawkins R, Jackson JL, Pangaro LN. Assessing how well three evaluation methods detect deficiencies in medical students’ professionalism in two settings of an internal medicine clerkship. Acad Med. 2000;75:167–73.