In-training evaluation is an essential method for assessing trainee clinical competence and is commonly documented on an in-training evaluation report (ITER).1 ITERs are also referred to as, among other terms, clinical performance reports, performance assessment forms, clinical performance progress reports, and end-of-clinical-rotation reports. They usually consist of a list of items on a checklist or rating scale and written comments. Completion of an ITER is thus an education task frequently requested of faculty supervising medical students and residents on clinical rotations. Unfortunately, there is evidence to suggest that the final assessment (i.e., pass versus fail) written on the ITER is not always consistent with the evaluator’s judgment of a trainee’s performance, especially for the poorly performing resident.2–4
Several authors, including the Advisory Committee on Educational Outcome Assessment,5 have proposed that assessor training is a key component in addressing the problem of quality assessments in residency programs, with some suggesting that rater training may be the “missing link” in improving assessment quality.6 Clinical supervisors have also indicated that they want faculty development (FD) programs to help them improve their ability to complete evaluation reports.7 It seems logical that rater training would improve report quality. However, there is remarkable controversy regarding the effectiveness of FD for improving rater-based assessment. Although there is some evidence to suggest that training can improve the quality of faculty members’ assessments,8,9 evidence also exists to suggest that such training is largely ineffective,10,11 leading several authors to suggest that faculty might be largely untrainable in this regard.10–12
Recently, Dudek et al13 reported on a workshop designed to improve ITER quality. They demonstrated an improvement in the quality of completed ITERs following workshop participation, adding to the literature supporting rater training as an effective means of improving the assessments provided by faculty at the ends of rotations. However, there are known difficulties with recruitment for FD workshops,14 and these difficulties were borne out in Dudek and colleagues’ study, where there were significant challenges in participant recruitment.
We wanted to develop an FD program to improve completed ITER quality that would provide content similar to that offered in the workshop but be more appealing for clinical supervisors to participate in. Three key changes were proposed: (1) provide an “at-home” program, (2) incorporate a feedback component, and (3) have a control group, as the previous workshop study lacked this key element. The addition of a control group is rare in FD studies but can provide useful information about the value of a program.15
Feedback has been included as an integral element of all learning theories, including behavioral, cognitive, and social constructivist learning theories.16,17 Feedback can be provided to an individual in the form of an assessment, such as a rating of faculty teaching performance. This assessment provides the individual with information about the quality of his or her performance. Previous studies have demonstrated that physicians have improved their performance in response to assessments in both educational and clinical practice settings.18,19 It has also been suggested that change can be further enhanced when recipients are guided in their interpretation of the feedback.16,17
Acceptance of feedback is influenced by a recipient’s belief that the feedback is (1) credible, (2) accurate, (3) offered in a nonthreatening manner, and (4) from someone whom the recipient trusts.16 Feedback, when compared with a known standard, has been shown to have a significant positive effect on performance.20 What is less clear is how often feedback should be given and in what format. On the basis of this body of evidence, we incorporated various feedback components into our “at-home” FD program, where we altered the content and frequency of the feedback delivery.
The objectives of this particular study were to continue to add to the conversation regarding the value of FD for improving rater-based assessments and to evaluate the impact of a different style of FD program that incorporated feedback that would be less onerous for afaculty member to participate in.
We recruited 98 participants from four Canadian medical schools (school one = 35, school two = 41, school three = 4, school four = 18) during the 14-month enrollment period, January 2009 to February 2010. We estimated that with five groups (which included a control group) and 30 participants per group (for a total of 150 participants), it would be possible to find a significant main effect of group as small as a four-point difference between the highest- and lowest-scoring groups. We also expected that some participants would drop out of the study over time; therefore, our goal was to recruit 240 participants in total (60 participants per site).
We invited physicians who supervise and assess medical trainees on clinical rotations to participate. Recruitment occurred through various means: e-mails to all teaching faculty, FD office Web site announcements, e-mails to all program directors, letters to all program directors, and presentation of the FD program opportunity at multiple program director meetings. We invited program directors to participate and advertise the program to their faculty members. Clinical supervisors, who agreed to complete all segments of the program, including the pre–post evaluations, were enrolled. Only supervisors using ITER forms that fit the design criteria for use of the completed clinical evaluation report rating (CCERR, a validated tool described below), could be included.
The CCERR provides a reliable rating of the quality of ITERs completed by clinical supervisors. Nine items are each rated on five-point partially anchored scales (where a score of 1 is described with the anchor “not at all,” 3 with the anchor “acceptable,” and 5 with the anchor “exemplary”), resulting in a total score that ranges from 9 to 45. A full description of the tool, its development, and its validation has been published.21 In brief, the CCERR was developed using a focus group to determine key features of high-quality completed ITERs. The features were used to create the CCERR, which was pilot-tested locally, analyzed, modified, and then tested on a national level. In the national field test, the reliability of the CCERR was 0.82. Evidence for validity was demonstrated by the CCERR’s ability to discriminate between groups of completed ITERs previously judged by experts to be of high, average, and poor quality. The CCERR can be used on any style of ITER form, provided that it has a list of items to be evaluated on a checklist or rating scale and a space for comments. By using the CCERR, we are able to evaluate the effectiveness of our FD program at the Kirkpatrick III (change in behavior) level.22
An “at-home” FD program was developed whereby participants did not need to attend any type of “in-person” session during the program. Rather, participants received different types of feedback (described below) regarding the quality of their submitted ITERs. We hypothesized that participants would use this information to improve their ITERs’ quality. Feedback was provided to the participants in two ways. First, some participants received feedback on their performance in the form of a copy of the CCERR and their mean scores on each item. Second, some participants received the CCERR information plus additional feedback in the form of a “feedback guide.” The feedback guide sensitized participants to varying ratings on the CCERR. Specific items on the CCERR were identified as areas for improvement and/or maintenance of the participant’s strong performance using a standard set of feedback responses. The guide also provided participants with more detailed information about how to improve these identified areas with the goal of improving their overall ITER quality. This content was adapted from the previously developed workshop, which demonstrated ITER improvement.13 We then assessed the impact of repeated feedback by offering the two forms of feedback to some of the participants at subsequent study time points.
Participants were randomly assigned to one of five groups, as outlined in Table 1. Table 1 also provides the study time line. Participants were asked to submit three recently completed ITERs to be evaluated using the CCERR at the beginning of the study (Time 1 = T1). This number was chosen because we felt that it would be representative of the ITERs that they had recently completed. We also needed to be practical and acknowledge that supervisors in different programs may evaluate vastly different numbers of trainees. We felt that the majority of potential participants would have completed three ITERs in the six months prior to the study onset. All ITERs in this study were blinded for trainee, supervisor, medical school, and study group and scored by two trained research assistants. Previously, we demonstrated that research assistants could be trained to reliably use the CCERR in a manner consistent with physician raters23 and that two trained research assistant raters are required to achieve a reliability of >0.8.23 Therefore, on the basis of this result, we trained two research assistants to evaluate each of the submitted ITERs.
After the ITERs were evaluated using the CCERR, participants received feedback according to their group assignment (Table 1). Six months later, the process was repeated with the participants in all groups submitting their most recently completed ITERs (to a maximum of five). ITERs were evaluated using the CCERR by the two trained raters (Time 2 = T2). This process was repeated as outlined in Table 1, resulting in a total of four ITER evaluation times (T1, T2, T3, and T4). Data collection for this study took place from the end of February 2010 to January 2012.
All analysis was done using total CCERR scores, which was a sum of the ratings on the nine items. To minimize the effect of a variation in the number of ITERs that a participant may have submitted, a mean total CCERR score for each participant, averaging over the submitted ITERs at each of the collection times, was calculated for each rater. An intraclass correlation was then calculated to ensure a high interrater reliability before averaging the ratings assigned by the two raters. These averaged scores were used for all comparative analysis. A mixed ANOVA using time as a repeated-measures factor and group as a between-subject variable was used to assess for an effect of group assignment at each of the time points.
This project was approved by the institutional review boards of the Ottawa Hospital Research Ethics Board, Health Sciences II–University of Toronto Research Ethics Board, Capital Health Research Ethics Board at Dalhousie University, and the Health Research Ethics Board at the University of Alberta. Informed consent was provided by all participants. Participation was voluntary and independent of evaluation, promotion, and tenure. All participants received a $20 bookstore gift card on completion of the study.
As stated earlier, 98 participants from four different medical schools (school one = 35, school two = 41, school three = 4, school four = 18) were recruited during the 14-month enrollment period. Despite not achieving our recruitment goal, we felt that additional recruitment efforts were unlikely to greatly increase our participant numbers, and therefore the study was initiated with 98 participants. However, only 37 participants from three of the medical schools (school one=15, school two = 15, school four=7) submitted ITERs at all collection times (T1, T2, T3, T4). Among the other participants, there was one formal withdrawal from the study, and the remaining ones either did not respond to requests for ITER submissions at all time points or indicated that they had not completed any ITERs in the past six months. Analyses were completed using the data from the 37 participants, who submitted an average of 2.97 ITERs at T1 (range = 2–3), 3.81 ITERs at T2 (range= 1–5), 3.70 ITERs at T3 (range =1–5), and 3.62 ITERs at T4 (range = 1–5).
The interrater reliability for the CCERR scores at each time point ranged from 0.88 to 0.91.
Table 2 displays the mean CCERR scores for each group at each of the four time points. We first assessed the effect of time on all participants’ CCERR scores irrespective of group assignment. A significant effect of time was demonstrated (F(3,96) = 12.03, P<.0001,
= 0.27), suggesting a benefit of time in that the CCERR scores for the participants, when considered as an entire group, are better at later time points. We then assessed the effect of group assignment irrespective of time. The main effect of group assignment was not significant (F(4,32) = 0.54, P > .05,
= 0.06). Finally, we assessed the interaction between time and group assignment. It was not significant (F(12,96) = 1.22, P>.05,
= 0.13). This pattern indicates that although there is a benefit of time in that ITERs at the later time points are better when assessing the group as a whole, none of the feedback interventions appeared to have an effect.
That said, the effect size associated with the interaction (
= 0.13) would be considered moderate in size and might suggest the study was underpowered to detect a difference between groups as a function of time. We decided to explore the interaction in more detail by first collapsing the participants into two conditions, control (n = 10) versus intervention (all four feedback groups, n= 27). A 2 × 4 mixed ANOVA with groups as the between-subject variable and time as the repeated-measures variable demonstrated that time remained significant, but neither condition nor the interaction between time and condition was significant. However, as shown in Table 3, which displays the resulting means and standard deviations, it appears that the mean CCERR ratings do not increase over time for the control group but do go up over time for the intervention group, suggesting a benefit of feedback. This observation was confirmed by using paired t tests to compare the ratings at baseline (T1) with each of the ratings at the other time points. For the control group (Group 1), ratings at baseline do not differ from T2 (t(9) = 1.10, P=.30), T3 (t(9) = 1.33, P = .22), or T4 (t(9)=0.58, P = .58). For the intervention group (Groups 2–5 combined), ratings at baseline differ from ratings at T2 (t(26) = 2.69, P = .01), T3 (t(26) = 3.72, P=.001), and T4 (t(26) = 3.92, P = .001), indicating that there may be a benefit over time with receiving feedback.
Given the apparent power issue, it would be of interest to determine how many participants would have been needed to produce a significant result. To examine this, we reanalyzed the group data at each time point from the original overall ANOVA, calculated a standardized effect size measure (Cohen F), and then determined how many participants per group would have been needed to produce significant results given the means and standard deviations displayed in Table 2.24 It would have required at least 27 people per group.
We examined the impact of an “at-home” FD program on improving ITER quality by comparing quality over time, with four different methods of providing feedback and a control group (a manipulation that is seldom used in FD studies). The primary analysis showed an effect of time but not a differential effect of the feedback interventions, which may have occurred because the study did not have enough power to detect a significant interaction. Subgroups were analyzed, and we found that participants who received one of the four feedback interventions improved during our study, whereas those in the control group did not. This pattern suggests that faculty are able to improve ITER quality, and this study adds to the literature that has found success with improving the quality of trainee assessments following rater training.8,9,13
In addition to being underpowered, other possibilities could explain the pattern of results. It is possible that we did not see a difference in the amount of improvement between the four different feedback interventions because there was not enough difference between the interventions in terms of learning. On the other hand, the results may suggest that clinical supervisors simply need a small amount of guidance in order to improve their ITER quality, and that it does not matter what type of guidance it is. Another possibility is that just the knowledge that someone would monitor their progress (a type of Hawthorne effect25) is enough to improve quality. However, it should be noted that the control group did not improve, suggesting that the provision of feedback was at least a part of that improvement.
The value of the feedback provision could be questioned, given that we found a statistically significant pre/post difference of only about three points on the CCERR. Previous work21 demonstrated that ITERs evaluated as “poorly completed” (mean CCERR score ≈ 16) versus those rated as “average” (mean CCEER score ≈ 24) differed by eight points on the CCERR. A similar difference of eight points was found between the ITERs rated as “average” and those rated as “excellently” completed. We suggest that making over 30% of the improvement toward the next level of performance on the basis of one FD intervention can be argued to be educationally significant.
It is important to note that we made no effort in this study to examine the reliability of ratings assigned by supervisors on the ITERs (and our study design precludes this type of analysis). This was a deliberate choice based on previous research that found that most of the features of a high-quality completed ITER deal with the comments as opposed to the ratings.21 This approach is in line with recent discussions in the literature that suggest that we move beyond focusing solely on numeric rating scales and incorporate more qualitative types of assessment.6 The use of this approach might have contributed to our study result, as perhaps faculty are more capable of change when it comes to improving the quality of their comments as opposed to their ratings on ITERs.
Traditionally, FD programs have been evaluated using satisfaction surveys, self-reported ratings of confidence, or, at best, pre/post knowledge tests. Objective assessment of teaching and evaluation skills in clinical environments is rarely considered.15,26 Given that we did evaluate the impact of our FD program at the Kirkpatrick III (outcomes) level, our study adds to the small group of studies in the literature that demonstrate that it is possible to assess an FD program’s ability to create behavior change.13,22,26–28
Beyond developing an FD program that would result in ITER quality improvement, we wanted to design an FD program that would encourage greater participation than traditional workshops do. Certainly, when compared with a multisite workshop study that taught similar content but only had 22 participants,13 this program seems to be more palatable to supervisors and produced similar results (an approximately three-point increase on the CCERR). However, we still recruited only a fraction of potentially eligible faculty.
The difficulties in recruiting physicians to participate in FD initiatives are well documented.14 We did not formally study the reasons faculty declined to participate in our study. However, anecdotally, many potential participants cited a concern with the time it might take them to collect their ITERs. We took many steps to ensure that the time it would take faculty to participate in the study would be minimal, and study participants indicated that the time commitment was very small. As well, it is possible that the addition of the evaluation component (having their ITERs evaluated), something that is not typically part of an FD program, deterred some individuals. Informal comments made to us suggested that despite the anonymous nature of the analysis, some faculty members did not want an assessment of their ITER performance. Certainly, these and many other possiblities exist to explain the recruitment challenges. This is an important area for future studies to address.
On top of issues with recruitment, we had difficulties with participant retention, as only 38% of the participants completed all phases of the study. However, only one person formally withdrew, so it is likely that many faculty simply did not complete evaluations routinely throughout the study. That is certainly to be expected given the length of our study, as many faculty rotate on and off clinical services throughout a given year. A larger number of original participants would have dealt with this problem.
In conclusion, despite using an “at-home” FD program rather than a traditional “in-person” FD workshop, participating in our study resulted in ITER quality improvement for those who received feedback. This is encouraging, as the majority of faculty do not attend traditional workshops. It also adds to the literature demonstrating improved trainee assessment quality following rater training. Future studies with larger participant numbers are required to more fully clarify what type of feedback intervention is needed to see an improvement in ITER quality.
Dedication: The authors would like to dedicate this project to their friend, mentor, and colleague, the late Dr. Meridith Marks. Her desire to see medical education research disseminated through peer-reviewed publications motivated us to finalize our work.
Acknowledgments: The authors wish to thank Dr. Glenn Regehr for his thoughtful review and suggestions regarding the manuscript.
1. Turnbull J, Van Barneveld CNorman GR, Van der Vleuten CPM, Newble DI. Assessment of clinical performance: In-training evaluation. International Handbook of Research in Medical Education. 2002 London, UK Kluwer Academic Publishers
2. Speer AJ, Solomon DJ, Ainsworth MA. An innovative evaluation method in an internal medicine clerkship. Acad Med. 1996;71(1suppl):S76–S78
3. Cohen GS, Blumberg P, Ryan NC, Sullivan PL. Do final grades reflect written qualitative evaluations of student performance? Teach Learn Med. 1993;5:10–15
4. Hatala R, Norman GR. In-training evaluation during an internal medicine clerkship. Acad Med. 1999;74(10 suppl):S118–S120
5. Swing SR, Clyman SG, Holmboe ES, Williams RG. Advancing resident assessment in graduate medical education. J Grad Med Educ. 2009;1:278–286
6. Holmboe ES, Ward DS, Reznick RK, et al. Faculty development in assessment: The missing link in competency-based medical education. Acad Med. 2011;86:460–467
7. Dudek NL, Marks MB, Regehr G. Failure to fail: The perspectives of clinical supervisors. Acad Med. 2005;80(10 suppl):S84–S87
8. Holmboe ES, Hawkins RE, Huot SJ. Effects of training in direct observation of medical residents’ clinical competence: A randomized trial. Ann Intern Med. 2004;140:874–881
9. Littlefield JH, Darosa DA, Paukert J, Williams RG, Klamen DL, Schoolfield JD. Improving resident performance assessment data: Numeric precision and narrative specificity. Acad Med. 2005;80:489–495
10. Newble DI, Hoare J, Sheldrake PF. The selection and training of examiners for clinical examinations. Med Educ. 1980;14:345–349
11. Cook DA, Dupras DM, Beckman TJ, Thomas KG, Pankratz VS. Effect of rater training on reliability and accuracy of mini-CEX scores: A randomized, controlled trial. J Gen Intern Med. 2009;24:74–79
12. Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15:270–292
13. Dudek NL, Marks MB, Wood TJ, et al. Quality evaluation reports: Can a faculty development program make a difference? Med Teach. 2012;34:e725–e731
14. Rubeck RF, Witzke DB. Faculty development: A field of dreams. Acad Med. 1998;73(9 suppl):S32–S37
15. Steinert Y, Mann K, Centeno A, et al. A systematic review of faculty development initiatives designed to improve teaching effectiveness in medical education: BEME guide no. 8. Med Teach. 2006;28:1–30
16. Ilgen DR, Fisher CD, Taylor MS. Consequences of individual feedback on behaviour in organizations. J Appl Psychol. 1979;64:349–371
17. Wilkerson L, Irby DM. Strategies for improving teaching practices: A comprehensive approach to faculty development. Acad Med. 1998;73:387–396
18. Cohen PA. Effectiveness of student rating feedback for improving college instruction: A meta-analysis research in higher education. Res Higher Educ. 1980;13:321–341
19. Bing-You RG, Greenberg LW, Wiederman BL, Smith CS. A randomized multicentre trial to improve resident teaching with written feedback. Teach Learn Med. 1997;9:10–13
20. Patterson K, Greeny J, Maxfield D, McMillan R, Switzler A Influencer—The Power to Change Anything. 2008 New York, NY McGraw Hill
21. Dudek NL, Marks MB, Wood TJ, Lee AC. Assessing the quality of supervisors’ completed clinical evaluation reports. Med Educ. 2008;42:816–822
22. Kirkpatrick DL Evaluating Training Programs: The Four Levels. 19982nd ed San Francisco, Calif Berrett-Koehler
23. Dudek N, Wood T. The completed clinical evaluation report rating—Validation for use with research assistants.Abstract presented at: Research in Medical Education (RIME) ConferenceNovember 6–9, 2009 Boston, Mass
24. Cohen J Statistical Power Analysis for the Behavioral Sciences. 19882nd ed New York, NY Psychology Press
25. Holden JD. Hawthorne effects and research into professional practice. J Eval Clin Pract. 2001;7:65–70
26. Marks MB, Wood TJ, Nuth J, Touchie C, O’Brien H, Dugan A. Assessing change in clinical teaching skills: Are we up for the challenge? Teach Learn Med. 2008;20:288–294
27. D’Eon MF. Evaluation of a teaching workshop for residents at the University of Saskatchewan: A pilot study. Acad Med. 2004;79:791–797
28. Pandachuck K, Harley D, Cook D. Effectiveness of a brief workshop designed to improve teaching performance at the University of Alberta. Acad Med. 2004;79:798–804