Secondary Logo

Journal Logo


Do In-Training Evaluation Reports Deserve Their Bad Reputations? A Study of the Reliability and Predictive Ability of ITER Scores and Narrative Comments

Ginsburg, Shiphra, MD, MEd; Eva, Kevin, PhD; Regehr, Glenn, PhD

Author Information
doi: 10.1097/ACM.0b013e3182a36c3d
  • Free


Evaluating residents’ clinical competence is challenging for educators. In-training evaluation report (ITER) scores are used by most residency programs to perform this task despite suggestions that they are not reliable in practice and suffer from validity issues.1 These evaluations are often done without proper training, are not based on enough direct observation, and suffer from recall bias, among other issues. Further, the literature on “failure to fail” suggests that clinical teachers are reluctant to provide honest yet critical feedback because of multiple cognitive, social, and political biases.2 As a result, calls have been made to reduce our reliance on these types of assessments.3

In contrast to the issues with the numerical data that ITERs generate, some research suggests that the comments written on these forms may be a better source of information. In one study, faculty participants who analyzed only the written comments on evaluation forms were much more likely to identify problem students (and at earlier time points) than when using the scores alone.4 Thus, there was a discordance between what was written and the grades assigned. More recent studies of residents and practicing physicians have also found a significant number of comments that were discordant with scores, generally indicating a lower performance level than the scores would suggest.5,6 Importantly, one study found that “equivocal” comments, neither wholly positive or negative, were negatively correlated with scores and in fact behaved just like negative comments.7 In other qualitative work, faculty’s written comments on residents’ evaluations were found to offer additional, interpretive information that did not always map easily onto standard competencies.8 Thus, it seems there is richness in the narrative comments that the scores do not capture, perhaps because of difficulty in translating the holistic nature of one’s perceptions or concerns into simple numerical ratings. Systematically reading and scoring comments is problematic, however, in that it is not always easy to decide whether a comment is positive, negative, or somewhere in between.

As analyzing written language is so labor-intensive, several software programs have been developed that can automate this task; however, before any of these programs can be evaluated for use on ITER comments, certain assumptions require testing including whether or not narrative comments from ITER evaluations can be reliably and meaningfully assessed. To evaluate these assumptions, we conducted a study to predict the performance of one cohort of third-year residents (PGY3) based on ITER scores and comments from their first and second postgraduate years (PGY1 and PGY2). Our hypotheses were as follows: ITER scores are unreliable across rotations, as commonly reported; as a result, ITER scores from PGY1 and PGY2 are poor predictors of performance in PGY3 as measured by both ITER scores and program directors’ (PDs’) rankings; and, in line with the research synthesized in the previous paragraph, faculty attendings’ analysis of narrative comments from the ITERs can be reliable and will be more predictive than ITER scores.


Our data set was a compilation of all ITERs from a single cohort of residents collected across PGY 1 to 3 in a core internal medicine (IM) training program at the University of Toronto Faculty of Medicine, for the academic years 2009 to 2011. The data were anonymized and divided by academic year. In this program, residents are expected to receive a single ITER at the end of every one-month rotation.

The dependent variable of interest was performance during PGY3, which we assessed in two ways. The first outcome variable was the mean ITER score for each resident across all ITERs collected during their PGY3 year. The second outcome variable was created by inviting PDs from the major teaching sites within our program to rank-order their most recent group of PGY3s. Three of four site PDs are responsible for 12 to 15 residents each, and one has the responsibility for two hospitals combined with a total of 29 residents. To reduce recall bias as much as possible, we conducted this ranking task between July and August for all PGY3s who graduated in June of that year. PDs were shown a list of their most recent PGY3s and were asked to rank-order them from highest to lowest based on whatever relevant knowledge they had and to categorize them according to a previously developed framework: A = outstanding, excellent, exemplary; B = solid, safe, may need some fine tuning; C = borderline, bare minimum, remediable; and D = unsafe, unacceptable, multiple deficits.9 They were allowed to review any notes or prior ITERs they had for each trainee. Because each site had different numbers of trainees, we scaled the rankings to enable meaningful collation of rankings across sites. That is, the highest-ranked resident from each site received a score of 1, the lowest-ranked resident from each site received a score of 0, and those in between received scores corresponding with their percentile. Thus, we had two scores for each PGY3: the mean ITER score (ranging from 1 to 5) and a PD rank-order score within each of the four teaching sites (ranging from 0 to 1).

ITER data

Our ITERs consist of 19 items related to the seven CanMEDS10 competencies. There are 4 items on the Medical Expert role, 2 each for the Communicator, Collaborator, and Manager roles, and 3 each for the Scholar, Health Advocate, and Professional roles. There is also an overall rating. All items are scored on a five-point scale as follows: 1 = unsatisfactory, 2 = needs improvement, 3 = meets expectations, 4 = exceeds expectations, and 5 = outstanding. We calculated the mean of all 19 items plus the overall rating for each resident for each rotation to generate a score ranging from 1 to 5 for each ITER.

Narrative data

Each ITER form has a single box for general comments with instructions to “Provide a general impression of the trainee’s development during this rotation, including general competence, motivation and consultant skills. Please emphasize strengths and areas that require improvement.” We cut and pasted the comments from each ITER and combined them for each resident separately for PGY1 and PGY2. Thus, each resident had two documents of comments, one to two pages in length, one for each year, with no numerical data attached. For each year separately, these documents were randomly assigned to 12 sets of 15 to 16 documents, such that each resident’s comments appeared in 3 different sets, no 2 sets were alike, and no identifying information (regarding evaluator or resident) was included.

Narrative ranking procedure

Twenty-four faculty attendings with at least two years’ experience evaluating residents on inpatient units in IM were recruited by e-mail to participate. A research assistant conducted one-on-one sessions, during which each participant was given up to an hour to read through a distinct package of residents’ comments and to rank-order them from 1 to 15 or 16. The research assistant conducted a debriefing interview with each participant on completion of the exercise.


To determine the internal structure of the scale, we conducted a factor analysis of all fully complete ITERs in the data set.

Reliability of the ITER scores for each resident was calculated across ITERs within each year using Shrout and Fleiss Case 111 in an unbalanced design to allow for the inclusion of residents with different numbers of ratings (each resident had between 1 and 13 ITERs with an average of 9.63). To allow consistent comparisons across groups, the inter-ITER reliability is reported with k (the number of observations averaged across) set to 9 (the approximate average number of ITERs per year across residents in our data set).

Reliability of the comment ranking procedure across three raters was assessed by using Shrout and Fleiss Case 1 (with raters nested within residents). Thus, we calculated three-rater reliability (k = 3) for the ranking of each resident’s comments.

We also calculated the correlation between the mean PGY3 ITER scores and the PDs’ rankings.

To assess predictive ability of the ITER scores and comment rank scores, we conducted regression analyses with PGY3 performance as our dependent measure (separately for ITER scores and PD rankings). For each regression, we first used PGY1 and PGY2 ITER scores as independent variables, then added the comment rank scores for both PGY 1 and PGY2 to assess for incremental predictive ability (increase in R2).

Ethics approval was obtained from the University of Toronto’s office of research ethics.


Of a total of 75 possible residents, 63 (84%) had sufficient data to include in the analysis, but only 59 received rankings from the PGY3 site PDs. Thus, for all analyses presented, our sample size was 63 residents except for the regression predicting PGY3 PD rankings, where the sample was 59.

Analysis of the dimensions of the ITER scales

The factor analysis used 903 fully completed forms accrued across resident, rotation, and year. The unrotated factor solution revealed two factors, with the first accounting for 66.0% of the variance and every item loading more or less equally from a low of 0.69 for basic science knowledge to a high of 0.85 for the overall rating at the end of the scale. The second factor accounted for an additional 5.4% of the variance. A varimax rotation revealed two major dimensions, which can be roughly translated as “knowledge/clinical skills” and “interpersonal skills” (see Table 1). Despite having seven competency roles represented on our ITERs, we did not see obvious clustering into these roles. Interestingly, the overall or global rating on the ITER was the highest loading variable on the unrotated general factor and was nearly equally weighted between the two factors in the rotated solution (with a slight bias toward knowledge/skills), suggesting it was truly an overall rating of the resident.

Table 1
Table 1:
Results of Factor Analysis (Unrotated and Varimax) for All Items on 903 In-Training Evaluation Reports for First-, Second-, and Third-Year Residents, University of Toronto Faculty of Medicine, 2009–2011*

Predictor variables

The reliability of a single ITER score was 0.11 for both the PGY1 and PGY2 data. However, the reliability for the average of nine rotations was moderately high at 0.53 for both years. For the ranking of the ITER comments, the reliability for three raters per resident was 0.83 for PGY1 and 0.79 for PGY2.

Outcome measures

The reliability of the PGY3 ITER scores was 0.14 for a single ITER and 0.59 for the average of nine rotations. Figure 1 shows the correlation (r = 0.63) between the average ITER scores and the PD rank-orderings (scaled from 0 to 1). The gray dots represent residents that PDs placed in category C (borderline performance), and the single black dot represents one resident that a PD placed in category D (multiple deficits, unsafe, unacceptable).

Figure 1
Figure 1:
Correlation between the average in-training evaluation report (ITER) scores and the site program directors’ (PD) rank-order scores (scaled from 0 = worst to 1 = best) for all postgraduate year three (PGY3) residents, University of Toronto Faculty of Medicine, 2009 to 1011. Gray dots represent residents placed by their PDs in category C (borderline, bare minimum, remediable); the single black dot represents a resident placed in category D (multiple deficits, unsafe, unacceptable).

Predicting PGY3 assessments

Table 2 shows the correlations between the PGY1 and PGY2 average ITER scores and the comment rank-order scores compared with the two PGY3 outcomes (ITER scores and PD rank-ordering). There were strong correlations between ITER scores and comment scores within each of PGY1 and PGY2 (0.72 and 0.70), suggesting that much of the information gleaned from reading and ranking the comments was already captured in the numbers, with a shared variance of about 50%. Correlations between both predictor scores and the PGY3 ITERs were moderate (0.44 and 0.44 for PGY1 and 0.44 and 0.33 for PGY2). These correlations were higher for the PGY3 PD rank-order score (0.58 and 0.53 for PGY1 and 0.60 and 0.62 for PGY2).

Table 2
Table 2:
Correlations Between the Postgraduate Year (PGY) 1 and PGY2 In-Training Evaluation Report (ITER) Scores and Comment Rank-Order Scores, Compared With PGY3 ITER Scores and Program Directors’ (PDs’) Rank-Ordering, University of Toronto Faculty of Medicine, 2009–2011

To assess predictive ability we conducted two regression analyses, one for PGY3 ITER scores and one for PGY3 PD rank-order scores, as seen in Table 3. ITER scores for PGY1 and PGY2 explained approximately 25% of the variance in PGY3 ITER scores. Adding in the comment-ranking scores did not significantly improve prediction. When predicting the PGY3 PD rankings, the PGY1 and PGY2 ITER scores explained much more of the variance, approximately 46%, and this again did not improve significantly with the addition of the comment scores.

Table 3
Table 3:
Results of Regression Analyses to Explore the Ability of Several Variables to Predict Postgraduate Year (PGY) 3 In-Training Evaluation Report (ITER) Score and PGY3 Program Directors’ (PDs’) Rank-Order Score, University of Toronto Faculty of Medicine, 2009–2011*


In-training evaluations of residents by faculty are often criticized for poor reliability and validity, and educators are discouraged from using them for summative assessment.1,12,13 The narrative comments on these forms may hold promise by adding additional information over and above the scores, potentially allowing identification of residents in difficulty that the scores alone do not identify. This study was designed to systematically analyze the written comments on residents’ evaluation forms in an attempt to quantify their added value. On postexercise debriefing, faculty attendings reported finding the ranking task easy to complete. Even with faculty blinded to ITER score, this process yielded excellent three-rater reliability and rankings that moderately correlated with performance in PGY3. To our surprise, however, we also found ITER scores to be quite reliable and predictive of PGY3 performance based on PD rank-order—so much so that the comments did not add incremental validity to the predictive model.

Why were our ITERs “better” than what the literature would suggest? Our program is not unique or unusual. It is a large residency dispersed over five major teaching hospitals. Our attending physicians do not receive any sort of special or rigorous training on how to use the ITER forms, which follow a fairly typical format of competency dimensions (based on the CanMEDS roles), a global or overall rating, and a box for free-text comments. Our ITERs, in addition to appearing to consist of only two underlying dimensions, should also suffer from all the pitfalls and biases commonly reported: halo and ceiling effects, leniency, recall bias, etc.12 Certainly this was true of scores from a single ITER, but across a year’s worth of ITER evaluations, the reliability (and predictive validity) of our ITER scores was surprisingly good. That is, although the reliability of a single end of rotation ITER was poor, averaging across nine rotations yielded reasonable inter-ITER reliability.

A reexamination of the literature suggests that we are not the first to find that ITERs may be quite reliable if the data are aggregated. In 1992, Carline and colleagues14 reported a study involving 328 IM students over two years, whereby multiple evaluations were submitted for each student from various attendings and residents. Their analysis of 3,557 forms (an average of 12 per student) found that 8 forms provided acceptable reliability (using a G-study) when looking at the overall clinical grade assigned. In 1998, Kreiter and colleagues,15 also using a G-study, found that reliability was acceptable if three or more raters evaluated every student on a given rotation. In a surgical residency, Littlefield and colleagues16 analyzed 474 evaluation forms from 35 residents and found that for one rater the reliability would be very low (0.13), but for 10 raters it was much higher, at 0.59. In all of these studies, the evaluation forms consisted of multiple items tapping into various elements of performance, all rated on scales (4–7 items) followed by a “global” or overall rating, similar to ours. So the evidence seems to suggest that as long as there are multiple raters, either within or across rotations, there can be acceptable reliability. That begs the question of why educators are so discouraged from using ITERs.1,12,17

One reason why ITERs are often discredited relates to the discourse that these ratings are merely “subjective” and therefore somehow substandard to other assessment methods that are purported to be “objective,” such as OSCEs.18 Another issue is that factor analysis studies routinely show that however many items there are on a form, they tend to load on two factors, as we found as well, usually along the lines of medical knowledge and interpersonal skills.19 Indeed, the American Board of Internal Medicine’s form was found in one study to load mostly on one single factor.20 This may serve to undermine people’s confidence in what the forms are purported to measure (i.e., multiple competencies),21 especially as individual practitioners would rarely have access to aggregated scores collected across multiple observers or rotations.

Apart from the apparent reliability of our ITER scores, we also found significant reliability in rankings of the ITER comments. When using three raters per resident (and only 15 or 16 residents per rater), the reliability of the rank-ordering of comments combined across 63 residents was 0.83 in PGY1 and 0.79 in PGY2. One recent study used similar methodology to explore faculty attendings’ ability to rank-order narratives of residents’ performance and also found very high interrater reliability.9 One criticism of that methodology was that those narratives were fabricated and covered a much broader range of resident performance than what we usually see in practice. In the current study using real residents’ comments over each year of training (and where, incidentally, over 91% of residents receive a 4 or 5 out of 5 on their global rating—see Figure 1), we were still able to show excellent interrater reliability, suggesting that experienced faculty attendings have a shared understanding and conceptualization of residents’ performance in real-world settings. This echoes recent calls to rely on the wisdom and expertise of assessors.22,23

In the regression models we found that the PGY1 and PGY2 predictors (both numeric and comment scores) were more predictive of the PD rankings than of the PGY3 ITER itself, explaining a significantly higher proportion of the variance. One possible explanation is that the PDs were forced to rank-order their residents, thus separating out many who were likely performing at very similar levels. It is also important to note that although the correlations were high, they certainly were not perfect, and a lot of the variance in PGY3 performance is left unexplained by PGY1 and PGY2 scores and comments. The rest may be “noise,” but it is possible there is more signal yet to be found, such as effects related to rotation type, hospital site, time of year, etc. That comments and scores explained equal amounts of variance can be explained by the high correlation between these two metrics. These unanticipatedly high correlations suggest that raters were documenting their concerns adequately through their commentary and appropriately reflecting their severity through use of the numeric scale when both comments and scores are considered in aggregate. It is also possible that there are more sophisticated ways of analyzing the language on the ITERs that might yield more information and be far less time-consuming than having faculty read and rank them. This is being evaluated in current research by our team.

Some limitations of our study include the issue of potential recall bias of the PGY3 PDs, who could also refer to the residents’ ITERs if they wished, perhaps confounding the rank-orderings. Also, despite having multiple hospital sites, data were collected from a single institution, which limits generalizability. Further, we acknowledge that although we have focused here on correlations, they should be interpreted with caution. As shown in Figure 1, for example, the residents that PDs flagged as problematic were seen to cluster at the bottoms of the rank-orders and tended to have lower-than-average ITER scores, with none higher than 4.3. Using this number as a cutoff or a red flag would capture many more residents (approximately 40) who are doing well—that is, there would not be sufficient specificity for this to be useful.

Despite these potential concerns, this study represents one of the first we know of to include the analysis of a large cohort of residents followed over a three-year period using both numeric and narrative data from the ITERs. We also used two different outcome measures (PGY3 ITER scores plus their PD rankings) which were similar, but clearly not the same, given the moderate correlation of 0.59. Although the results bear replication, this systematic analysis of ITER scores and narrative comments does point to the predictive value of ITER scores—despite their bad reputation—as well as offering interesting insights into the potential for the structured use of narrative comments. Although in this study the narrative comments did not offer additional predictive value, our results certainly indicated an impressive amount of “signal” in this data source, and future studies might well seek out mechanisms to automate the quantification of this signal in order to explore opportunities to use this information more effectively.


1. Chaudhry SI, Holmboe E, Beasley BW. The state of evaluation in internal medicine residency. J Gen Intern Med. 2008;23:1010–1015
2. Cleland JA, Knight LV, Rees CE, Tracey S, Bond CM. Is it me or is it them? Factors that influence the passing of underperforming students. Med Educ. 2008;42:800–809
3. Bullock G, Sherbino J, Hodder R, Watling C Assessment: A White Paper Prepared by the Royal College of Physicians and Surgeons of Canada, Future of Medical Education in Canada. 2011 Ottawa, Ontario, Canada Royal College of Physicians and Surgeons of Canada
4. Cohen G, Blumberg P, Ryan N, Sullivan P. Do final grades reflect written qualitative evaluations of student performance? Teach Learn Med. 1993;5:10–15
5. Richards SH, Campbell JL, Walshaw E, Dickens A, Greco M. A multi-method analysis of free-text comments from the UK General Medical Council Colleague Questionnaires. Med Educ. 2009;43:757–766
6. Durning SJ, Hanson J, Gilliland W, McManigle JM, Waechter D, Pangaro LN. Using qualitative data from a program director’s evaluation form as an outcome measurement for medical school. Mil Med. 2010;175:448–452
7. Frohna A, Stern D. The nature of qualitative comments in evaluating professionalism. Med Educ. 2005;39:763–768
8. Ginsburg S, Gold W, Cavalcanti RB, Kurabi B, McDonald-Blumer H. Competencies “plus”: The nature of written comments on internal medicine residents’ evaluation forms. Acad Med. 2011;86(10 suppl):S30–S34
9. Regehr G, Ginsburg S, Herold J, Hatala R, Eva K, Oulanova O. Using “standardized narratives” to explore new ways to represent faculty opinions of resident performance. Acad Med. 2012;87:419–427
10. Frank JR The CanMEDS 2005 Physician Competency Framework. 2005 Ottawa, Ontario, Canada Royal College of Physicians and Surgeons of Canada
11. Shrout PE, Fleiss JL. Intraclass correlations: Uses in assessing rater reliability. Psychol Bull. 1979;86:420–428
12. Gray JD. Global rating scales in residency education. Acad Med. 1996;71(1 suppl):S55–S63
13. Williams RG, Dunnington GL, Klamen DL. Forecasting residents’ performance—partly cloudy. Acad Med. 2005;80:415–422
14. Carline JD, Paauw DS, Thiede KW, Ramsey PG. Factors affecting the reliability of ratings of students’ clinical skills in a medicine clerkship. J Gen Intern Med. 1992;7:506–510
15. Kreiter CD, Ferguson K, Lee WC, Brennan RL, Densen P. A generalizability study of a new standardized rating form used to evaluate students’ clinical clerkship performances. Acad Med. 1998;73:1294–1298
16. Littlefield J, Paukert J, Schoolfield J. Quality assurance data for residents’ global performance ratings. Acad Med. 2001;76(10 suppl):S102–S104
17. Epstein RM. Assessment in medical education. N Engl J Med. 2007;356:387–396
18. Eva KW, Hodges BD. Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Med Educ. 2012;46:914–919
19. Silber CG, Nasca TJ, Paskin DL, Eiger G, Robeson M, Veloski JJ. Do global rating forms enable program directors to assess the ACGME competencies? Acad Med. 2004;79:549–556
20. Haber RJ, Avins AL. Do ratings on the American Board of Internal Medicine Resident Evaluation Form detect differences in clinical competence? J Gen Intern Med. 1994;9:140–145
21. Lurie SJ, Mooney CJ, Lyness JM. Measurement of the general competencies of the Accreditation Council for Graduate Medical Education: A systematic review. Acad Med. 2009;84:301–309
22. Crossley J, Johnson G, Booth J, Wade W. Good questions, good answers: Construct alignment improves the performance of workplace-based assessment scales. Med Educ. 2011;45:560–569
23. Ginsburg S. Respecting the expertise of clinician assessors: Construct alignment is one good answer. Med Educ. 2011;45:546–548
© 2013 by the Association of American Medical Colleges