Secondary Logo

Journal Logo

Research Report

Improving Resident Performance Assessment Data: Numeric Precision and Narrative Specificity

Littlefield, John H. PhD; DaRosa, Debra A. PhD; Paukert, Judy PhD; Williams, Reed G. PhD; Klamen, Debra L. MD; Schoolfield, John D. MS

Author Information


The Accreditation Council for Graduate Medical Education (ACGME) accredits nearly 7,800 residency education programs in the United States. With the implementation of the Outcome Project, the ACGME defined six general competency domains that provide a framework for decision making in the performance assessments of individual residents and the residency’s teaching program. The ACGME’s implementation plan requires that residency programs “Improve evaluation processes as needed to obtain accurate resident performance data in all six competency domains” by June 2006.1

The ACGME’s focus on residents’ competency presumes that individual performance can be assessed independently of the patient care environment surrounding a performance (i.e., performance quality remains at the same level across clinical situations). This presumption is consistent with published standards for educational and psychological testing,2 which require performance assessments to be reproducible across different clinical settings, patient problems, and evaluators. However, contextual factors, such as clinical setting and the patient’s problem, do affect the quality of a resident’s performance.

In their 2002 review on defining and assessing professional competence, Epstein and Hundert3 note that competence is context dependent. They describe competence as a statement of relationships among an ability, a task, and a clinical context in which the task occurs. For example, in diagnosing heart failure, a marginal resident may demonstrate satisfactory performance when the environment includes substantial technical support (e.g., in a Level 1 trauma center with capable staff), but that same resident may make a diagnostic error at 3:00 am with a ward night crew. Most specialties lack a comprehensive set of evidence-based process or patient-outcome measures to assess clinical performance.4 Moreover, performance related to competency domains, such as professionalism, is difficult to measure because it has no underlying physical dimension, and not all evaluators agree on what is acceptable and unacceptable behavior. In summary, the ACGME’s requirement to “obtain accurate resident performance data in all six competency domains” presents a daunting challenge to residency programs.

Global performance rating based on observing and judging a resident’s performance in actual work settings is the predominant method for resident performance assessment. The method is prevalent outside of medicine as well, but it is not without problems. Although more than 95% of large business organizations use this method, neither supervisors nor subordinates look forward to it.5 Researchers in health professions education have identified numerous problems with global performance rating, including low return rates; leniency errors; and score variation due to cognitive, social, and environmental factors.6 Thus, if global performance rating is to continue as the predominant method for resident performance assessment, it must be significantly improved.

To improve the quality of global performance rating in residency programs, two critical factors must be addressed. First, faculty must directly observe residents’ performances across situations and with multiple patients. The lack of direct observation by faculty has been described as the largest problem in the evaluation of clinical skills.7 Second, faculty must be willing to interpret and elaborate their observations in writing. Written documentation often does not occur, and numerous deficiencies in residents’ performances only become apparent when attending physicians discuss performance at annual evaluation meetings.8 The problem of insufficient written documentation was addressed by a study that used an organizational-change approach to help overcome problems of breadth and documentation of observations in a resident performance assessment system.9

One way to improve resident performance assessment is to focus on agents or elements, such as completed rating forms, that make up the system. All systems consist of interdependent parts that function as a unified whole. Complex adaptive systems (CAS) theory encourages identifying a small number of a system’s agents and their internal models of behavior to simplify the decision of where to start system improvement efforts. For example, a physicians’ group practice is a CAS consisting of agents such as the physicians, office staff, and patients.10 Researchers have analyzed a practice’s response to external forces (surrounding community) and to internal forces such as the staff’s unique styles and interests.11 Similarly, a residency program has the qualities of a CAS. Faculty members and residents are expected to adapt functionally to the environment and to provide patient care while teaching and learning.

In a CAS, information feedback loops (i.e., communication between and among agents) generate both organizational stability and opportunities for change. In a residency program, information feedback loops can be verbal (e.g., reports at department meetings) or written (e.g., synthesis of performance appraisal data prior to administrative reviews), and they can be informal (e.g., conversational) or formal (e.g., training new faculty). Resident performance assessment occurs in the midst of these interactions and is profoundly influenced by them. Using CAS as a theoretical framework, we sought to study the improvement of two resident performance assessment systems by focusing on the information feedback loops among agents who comprise the system.

To conduct this study, we defined three “aspects” (agents and their behavior) of a resident performance assessment system:

  • Individual evaluator. The individual evaluator’s role is assumed primarily by attending faculty. It includes communicating expectations to residents, observing performance, interpreting and judging performance, communicating performance information to the program director, coaching residents, and completing global rating forms.
  • Program director. The program director’s role is the administrative decision making part of the performance assessment system and includes the Resident Education Committee (REC). It consists of developing and implementing a performance evaluation system, monitoring and interpreting assessments, facilitating administrative decision making, and formally informing residents about administrative decisions.
  • Organizational infrastructure. The organizational infrastructure is the environment supporting the individual evaluators and the program director. It includes various elements such as incentives and consequences for evaluators (e.g., the department chair’s support), evaluator training to use the rating form, and staff assigned to monitor the sending and completing of rating forms.

These three aspects of a resident performance assessment system are interdependent, differ in size, and are affected by their surrounding environment, as shown metaphorically by the bubbles in Figure 1. In the figure, information feedback loops are represented by the bidirectional arrows. Because of the numerous interactions among individuals, the whole resident performance assessment system is more complex than the linear sum of its aspects, and a small change in one aspect may cause large changes in overall system output. The three aspects we define here, however, provide a conceptual framework for diagnosing problems and developing organizational interventions to ameliorate the problems.

Figure 1
Figure 1:
Bubble representation of a resident performance assessment system displaying size and interdependence of three aspects plus information exchange.

The purpose of this three-year study was to evaluate the impact of organizational interventions designed to improve the quality of resident performance assessment data by focusing on information feedback loops. We developed three research questions:

  1. What percentage of each resident’s rotations returned one or more completed forms?
  2. How numerically precise were scores on the forms?
  3. What percentage of each resident’s rotations returned one or more forms with one or more behaviorally specific narrative comments?



The experimental sites were two surgery residency programs. Institutional Review Board approval was obtained at both experimental sites. Surgery was a good testing field for improving resident performance assessment because problems associated with insufficient direct observation of residents were less likely because of the faculty’s numerous interactions with residents in the operating rooms. Site A had 22 categorical residents, 71 faculty who completed rating forms for this study, and three teaching hospitals. Site B had 45 categorical residents, 91 faculty who completed forms for this study, and three primary teaching hospitals. Data for the study were rating forms generated during the academic years 1999, 2000, and 2001. Each categorical resident was a unit of analysis. Preliminary first-year residents were excluded because they leave the program after Year 1.

Data collection

For each program, we assessed the quality of performance assessment data in a baseline year and two intervention years using a before-and-after trial design. There were three outcome variables.

Percentage of each resident’s rotations that returned one or more completed rating forms.

All rating forms available for a given resident were reviewed against the list of rotations completed annually by the resident. If one or more forms existed for each rotation, the resident’s file was scored as 100% complete. Thus, return rates could vary from zero to 100% for each resident. To answer the research question, return rates were computed for each resident in the program during the baseline year and two intervention years. A Kruskal-Wallis test of percentage return rates across three years was calculated to compare three sets of annual return rates at each residency program. The Kruskal-Wallis test was used because it is the nonparametric equivalent of an F test generated by a one-way analysis of variance, and the data were not normally distributed. Follow-up Dunn multiple comparison tests were calculated to compare return rates between pairs of years (e.g., baseline versus intervention Year 1). Dunn multiple comparison tests were used because they are the nonparametric equivalent of an analysis of variance post hoc test.

Precision of numeric scores.

We calculated variance components for residents, raters within resident, and error annually at each site. We then computed a generalizability coefficient based on ten evaluators for each resident and a standard error (SE) to provide a 95% confidence interval around mean rating scores for individual residents. SE is the recommended score band statistic for use in generalizability analysis.12

Percentage of rotations that returned forms with behaviorally specific narrative comments.

All rating forms available for a given resident were reviewed and narrative comments were classified into three levels: no comment or meaningless comment, general comment, or behaviorally specific comment. For example, a behaviorally specific comment is, “Tom’s instrument handling is excellent as is his ability to anticipate the next step of the operation.” An analysis of classification consistency between two observers found 94% agreement. For each resident, the percentage of rotations that returned one or more forms with one or more behaviorally specific comments was computed for the baseline year and two intervention years. Thus the percentage of rotations could vary from zero to 100% for each resident. A Kruskal-Wallis test was calculated to compare three sets of annual percentages of rotations with behaviorally specific comments at each site. Follow-up Dunn multiple comparison tests assessed percentages of rotations with behaviorally specific comments between pairs of years.

All statistical analyses were performed using SAS® statistical software (Release 8.2, copyright 2001).


After collecting data during a baseline year, we attempted to improve the quality of resident performance assessment data through organizational interventions by educational specialists who were faculty members in each program (one specialist per program). The first step was to identify problems in the performance assessment system using a diagnostic checklist shown in Appendix 1. The checklist criteria reflected our own experience and 16 research-based recommendations to improve performance ratings.6 The checklist provided a framework for each residency program to self-assess its system and identify deficient areas for possible intervention. The self-assessments identified numerous deficiencies. For example, attendings faced no consequences for not completing rating forms, and no specific staff members were assigned to monitor the sending and receiving of forms. The organizational interventions to correct the deficiencies were based on CAS theory.

CAS theory asserts that each organization has a unique history, and that information feedback loops generate both organizational stability and change. Each educational specialist served as an information feedback loop from the performance assessment system to the department chair, the program director, and individual attending faculty. The most frequent feedback loop was daily conversation, but written reports were also used. Examples of the type of small procedural changes made by one or both of the programs are shown in Table 1.

Table 1
Table 1:
Examples of Small Procedural Changes Made by One or Both Surgical Residency Programs Involved in a Resident Performance Assessment System Improvement Intervention, 2000 and 2001

The changes were essentially efforts to improve verbal and written communication guided by the results of diagnosing problems in the performance assessment system. The impact of these changes was evaluated during the intervention Years 1 and 2 by analyzing performance assessment data to answer the three research questions.


Over the three years of this study at Site A, 32 individual residents were evaluated by 71 faculty; while at Site B, 66 residents were evaluated by 91 faculty. The annual volume of returned forms increased substantially from the baseline year to intervention Year 1 at Site A (232 to 456) and at Site B (from 483 to 849). The change in total volume from intervention Year 1 to Year 2 decreased 17% at Site A (from 456 to 379) and increased 12% at Site B (from 849 to 969). The overall mean rating scores (seven-point scale) and standard deviations (SDs) changed slightly over the three years at Site A (5.07, SD = 1.27; 5.19, SD = 1.15; and 5.33, SD = 1.19, respectively) and at Site B (5.10, SD = 0.94; 5.08, SD = 1.04; and 5.13, SD = 1.05, respectively).

Table 2 presents data to answer each research question for the three years. The median percentage of each resident’s rotations that returned a completed form increased significantly across the three years at both sites. At Site A, follow-up Dunn tests found that median return rates for the baseline year differed significantly from intervention Year 1 (t = 3.31, p < .005), but that return rates for intervention Year 1 did not differ significantly from intervention Year 2 (t = 0.99, p > .30). At Site B, follow-up tests showed median return rates for the baseline year differed significantly from intervention Year 1 (t = 11.05, p < .001), but return rates for intervention Year 1 did not differ significantly from intervention Year 2 (t = 1.09, p > .25). The generalizability coefficients based on ten evaluators increased from the .6 range during the baseline year to the .8 range during the intervention years. The 95% confidence interval around individual resident mean scores improved at Site A from baseline to intervention Year 1 (0.78 to 0.58) and then remained stable; at Site B, it remained constant throughout (0.55 to 0.56). The median percentage of rotations that returned forms with behaviorally specific written narrative comments also increased significantly across the three years at both sites. At Site A, follow-up Dunn tests found the median percentage of rotations with behaviorally specific comments for the baseline year differed significantly from intervention Year 1 (t = 2.26, p < .030), and the median percentage of rotations with behaviorally specific comments for intervention Year 1 differed significantly from intervention Year 2 (t = 4.26, p < .001). At Site B, follow-up Dunn tests found the median percentage of rotations with behaviorally specific comments for the preintervention year differed significantly from intervention Year 1 (t = 3.32, p < .005), but the median percentage for intervention Year 1 did not differ significantly from intervention Year 2 (t = 1.63, p > .10).

Table 2
Table 2:
Median Percentage of Returned Rating Forms, Precision of Scores, and Median Percent of Rotations with Behaviorally Specific Comments


Finding an increase in the median percentage of rotations that returned forms and a concomitant increase in the percentage of rotations for which rating forms contained a behaviorally specific comment is consistent with the findings of another study of an organizational intervention to improve the quality of resident performance assessment data.9 The generalizability coefficients of scores during the intervention years were clustered around .80 based on ten raters. Coefficients of .80 or higher are recommended for performance assessment data used to make high-stakes administrative decisions, and the use of ten evaluators is consistent with research-based recommendations.6 In contrast, scores produced by a single rater are not precise. In this study, the intervention year generalizability coefficients for one rater ranged from .25 to .36 and the 95% confidence intervals ranged from 1.75 to 1.92 on a seven-point scale. In general, assessments from one observer should not be used to make important administrative decisions. Exceptions to this guideline are breakdowns of resident professional integrity and circumstances that allow an extensive observation period over a number of procedures and patients.

The diagnostic checklist developed for this study is useful for identifying system-based problems. After problems have been identified, taking a holistic CAS perspective of a residency program can help guide interventions. CAS assumes that individuals will function in a manner that “makes sense” to them. Thus, system-based improvement depends on being able to influence individual behaviors (e.g., some degree of social control). Improvement will be gradual (evolutionary) and is more likely to occur if faculty and residents perceive the changes to be compatible with their existing values. Despite their considerable autonomy, individual faculty also share a value system about the importance of assessing an individual resident’s performance. Moreover, faculty are not likely to write candid appraisals of residents’ performances unless they believe the program director will take administrative action based on their input. Because of these factors, the educational specialists in this study focused on improving information feedback loops (communication) to improve the performance assessment system. The feedback loops were verbal (e.g., reports at department meetings) and written (e.g., synthesize PA data prior to administrative reviews) and both informal (e.g., conversations) and formal (e.g., training new faculty). The educational specialists performed some information feedback loops that could otherwise be performed by electronic systems, such as reminding evaluators about delinquent rating forms and synthesizing data prior to administrative reviews.

Resident performance assessment is based on observations, interpretations, and judgments that repeat at multiple levels. For example, residents and their peers observe, interpret, and judge the problems of patients whom they jointly manage and then learn and improve by reflecting on their practice.13 Similarly, attendings also observe, interpret and judge residents’ performance in the context of their own perceptions of the patients whom they manage with the residents. At the same time that attendings are reflecting on practice, they are also reflecting about the residents whom they supervise. Ideally, they will communicate these reflections through numeric scores and behaviorally specific narrative comments. At the next level of the performance assessment system, the REC’s members interpret and judge residents’ performances based on numeric and narrative data from rating forms and their personal observations of the resident. Although the validity of a residency program’s administrative decisions is linked to interpretations and judgments at two levels (individual attendings and the REC), the performance assessment process should begin with individual residents and peers reflecting on their performances (self- and peer-assessment).

We believe that the global level of a resident’s performance can be assessed through careful daily observation by faculty in a clinical setting. Physicians are accustomed to making high-inference interpretations of behavior because the practice of medicine is an interpretive activity.14 Medicine is grounded in subjective knowledge, such as a physician’s understanding of a particular patient. We believe that attending physicians have an analogous understanding of particular residents with whom they have worked closely. This belief is supported by the findings of an intensive 18-month participant–observer study of a surgery residency program,15 which identified two features of social control in the program: knowledge of everyday performance is limited to members of the work group, and attending physicians have remarkable power and decide whether residents are trustworthy. We believe high-inference interpretations of behavior and social control are important factors to analyze when attempting to improve a resident performance assessment system at all levels from residents’ self-assessments to administrative decision making.

We began this study with the assumption that surgery faculty have numerous opportunities to observe residents directly in the operating room. We focused on increasing the incentives for writing candid performance assessments by improving the information feedback loops from the system to the department chair, the program director, and individual attending faculty. We hope that a well-functioning performance assessment system will provide a positive model to residents about how they should conduct self- and peer-performance assessment throughout their careers.

Generalizations from this study are limited by at least three constraints. First, surgeons have numerous opportunities for direct observation of residents’ performance, so these results may not apply to medical disciplines in which faculty have fewer opportunities for direct observation. Second, because we used global performance ratings, this study’s results clearly apply to the ACGME’s competency domain patient care. More research is needed to verify the utility of global ratings for assessing performance related to other of the ACGME’s competency domains, such as interpersonal and communication skills or professionalism. Third, the link between the independent variables (improving information feedback loops) and the dependent variables (quality assurance measures) is indirect. Possibly, other factors such as the ACGME’s pronouncements about improving resident-evaluation processes may have caused the improvements we observed in quality assurance measures.

The practical implications of this study can be summarized by four assertions. First, the diagnostic checklist from this study provides a framework for residency programs to critique their own performance assessment system and identify deficiency areas. Second, because residency programs are complex adaptive systems, the quality of global performance ratings data can be improved by modifying information feedback loops to the many individuals who comprise the system. Third, when resident performance assessment systems are functioning well, the global performance rating scores have excellent levels of numeric precision, and 60% to 80% of rotations will return forms that contain behaviorally specific narrative comments. Finally, a high-quality resident performance assessment system provides a positive model of professional accountability for residents to emulate during their careers.

This project was funded in part by a National Board of Medical Examiners (NBME®) Edward J. Stemmler, MD Medical Education Research Fund grant. The project does not necessarily reflect NBME policy, and NBME support provides no official endorsement.


1Timeline—working guidelines 〈〉. Accessed 7 February 2005. ACGME Outcome Project, Chicago, IL, 2000.
2American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Standards for Educational and Psychological Testing. Washington DC: AERA, 1999;26:-7.
3Epstein RM, Hundert EM. Defining and assessing professional competence. JAMA. 2002;287:226–35.
4Landon BE, Normand ST, Blumenthal D, Daley J. Physician clinical performance assessment. JAMA. 2003;290:1183–9.
5Murphy KR, Cleveland JN. Understanding Performance Appraisal: Social, Organizational and Goal-Based Perspectives. Thousand Oaks, CA: Sage Publications;1995:4–5.
6Williams RG, Klamen DL, McGaghie WC. Cognitve, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15:270–92.
7Holmboe ES. Faculty and the observation of trainee’s clinical skills: problems and opportunities. Acad Med. 2004;79:16–22.
8Schwind CJ, Williams RG, Boehler ML, Dunnington GL. Do individual attendings’ post-rotation performance ratings detect residents’ clinical performance deficiencies. Acad Med. 2004;79:453–7.
9Littlefield J Terrell C. Improving the quality of resident performance appraisals. Acad Med. 1997;72(10 suppl):S45-7.
10Miller WL, McDaniel RR, Crabtree BF, Stange KC. Understanding change in primary care practice using complexity theory. J Fam Pract. 1998;46:369–76.
11Miller WL, Crabtree BF, McDaniel RR, Stange KC. Practice jazz: understanding variation in family practices using complexity science. J Fam Pract. 2001;50:872–8.
12Cronbach LJ, Linn RL, Brennan RL, Haertel EH. Generalizability analysis for performance assessments of student achievement or school effectiveness. Educ and Psych Meas. 1997;57:373–99.
13Schon DA. Educating the Reflective Practitioner: Towards a New Design for Teaching and Learning in the Professions. San Francisco: Jossey Bass, 1987.
14Hunter KM. Doctors’ Stories: The Narrative Structure of Medical Knowledge. Princeton, NJ: Princeton University Press, 1991:xix.
15Bosk, CL. Forgive and Remember: Managing Medical Failure. Chicago: University of Chicago Press, 1979:112.
Diagnostic Checklist: Resident Performance Assessment System
© 2005 Association of American Medical Colleges