In-training assessment, observing medical trainees in the setting of day-to-day practice, remains central to clinical skills assessment.1 Accurate assessment of trainees' clinical skills with patients necessitates that faculty be able to identify skills performed well and those performed inadequately. However, clinical performance ratings often suffer from poor accuracy, poor reliability, and rating errors,2–4 with raters themselves being the largest components of variance.5 Yet while many studies describe poor accuracy and reliability of ratings, few have explored the factors that might potentially underlie this rater variability. Several studies exploring the impact of rater gender, ethnicity, and experience have found negligible or small, but variable, effects on rater stringency.4,6,7 Faculty's own approach to clinical and or interpersonal skills, especially if deficient,8 may potentially compromise their ability to detect errors in the cognitive and noncognitive domains, but little research has investigated the effects of faculty's own clinical skills on their ratings of trainees' clinical skills.
Therefore, the objective of this study was to identify whether faculty characteristics and clinical skills impact their ratings of residents in clinical scenarios. We hypothesized that faculty with better clinical skills and more clinical and teaching experience would be more stringent raters, but that other demographic variables would have minimal impact. We further hypothesized that relationships would be stronger in competency-specific domains (i.e., faculty with more complete history-taking approaches would rate history taking more stringently).
Program directors from local university-based (n = 7) and community-based, university-affiliated (n = 9) internal medicine residency programs were e-mailed and asked to identify their general internal medicine outpatient faculty resident preceptors who would potentially be interested in participating in a study about trainee assessment. One-hundred fourteen faculty were subsequently e-mailed and invited; on the basis of an a priori power calculation that 45 would need to participate, we stopped recruitment at 48 faculty, oversampling to allow for dropout. The study took place on one of nine days between March and August 2009, with three to six faculty members participating each day.
Faculty demographics and clinical skills
Prior to their assigned study day, faculty completed a Web-based demographic questionnaire eliciting age, gender, academic rank, experience as an outpatient preceptor (years), percent of practice time spent in outpatient care (excluding precepting) and precepting, frequency evaluating trainees in the outpatient setting using the mini-CEX or other assessments, and prior faculty development in clinical skills training (i.e., physical exam or communication skills training), observing residents in a clinical setting, or giving feedback.
On the study day, faculty completed eight, 15-minute standardized patient (SP) encounters (previously validated for use with medical students and residents) focusing on common outpatient clinical scenarios. Although case validation with trainees introduces study bias because clinicians with higher levels of expertise may be penalized by checklists rewarding thoroughness,9 the skills incorporated into the validated cases are those we would expect faculty to detect and judge during observational assessment of their own trainees.
SPs completed case-specific content checklists (10-24 items per case rated yes/no) covering key questions and maneuvers for history taking, physical exam, and counseling. SPs also completed a 12-item rating form focusing on important process elements (e.g., permitting patient to say reason for visit without being interrupted, getting illness chronology, using open- to closed-ended question style, repeating or summarizing information, empathy, avoiding jargon, etc.). Six of these items were rated on a three-point scale (0 = unsatisfactory, 2 = very well done), and six were rated yes/no. The exception was a single checklist for the case on delivering bad news which had a 12-item checklist covering communication skills needed for delivering bad news. For each case, SPs completed an overall patient satisfaction rating using a seven-point Likert scale (0 = extremely dissatisfied, 6 = extremely satisfied).
Standardized resident/SP videos
After completing the SP exam, faculty individually watched four standardized resident/SP videos in random order and then each completed a mini-CEX rating seven competencies (medical interviewing, physical examination, humanistic qualities/professionalism, clinical judgment, counseling skills, organization/efficiency, overall clinical competence) using a nine-point scale (1-3 = unsatisfactory, 4-6 = satisfactory, 7-9 = superior).
Each case was scripted to depict a PGY2 resident with a certain level of performance for checklist content items (unsatisfactory, satisfactory, or superior) and global skills (process items) (unsatisfactory, satisfactory, or superior). Initial error scripting (J.K.) was based on actual resident performance norms. Five additional investigators with experience in residency education reviewed scripts to confirm they reflected predetermined performance levels. Four volunteer medical trainees were each trained on a single script, practiced with the SP, and were videotaped once performance accurately represented the intended performance level. The study was approved by the University of Pennsylvania School of Medicine institutional review board, and all participants provided informed consent.
Descriptive statistics summarized faculty demographics and score performance. Faculty history taking, physical exam, and counseling content scores were calculated by first computing the percent of checklist items successfully completed in each domain for each case, and then percents were averaged across the cases that measured that domain (seven, two, and eight cases, respectively). The three domain scores were averaged to derive an overall content score. Faculty's process skills performance scores were determined by first computing the percent of items obtained (out of 12 possible items); percents were averaged across seven cases (one case did not have a process-rating form). The single patient satisfaction rating was averaged across all cases to yield an overall patient satisfaction score. Faculty mini-CEX ratings of residents for each of the seven competencies were averaged across the four video cases, except for physical exam skills because only one case was used. Generalizability theory was used to estimate the reliability coefficient for all faculty clinical skills scores and mini-CEX competency ratings.
Faculty demographics, content, process, and patient satisfaction performance scores were each correlated with the mean ratings faculty gave standardized residents on each of the mini-CEX competencies. A power calculation determined that the appropriate sample size to detect a moderate size correlation (r = 0.30), using a type I significance level of .05 and type II level of .20, was 45 faculty participants.
Of the 48 faculty who agreed to participate in the study, 44 (92%) completed the study (two faculty dropped out secondary to a personal conflict, one due to family illness, and one due to lack of hospital coverage). The mean age was 44.2 years old (SD = 8.7); 25 (57%) were men. Four (9%) were instructors, 19 (43%) were assistant professors, 15 (34%) were associate professors, and 4 (9%) were professors; 2 faculty (5%) did not report their rank. Twenty (46%) worked in community-based and 24 (54%) worked in university-based practices. Mean faculty outpatient resident precepting experience was 12.4 years (SD = 7.5). The mean percent of nonprecepting outpatient practice time was 46.2% (SD = 25%). Nineteen (43%) faculty had participated in a faculty development workshop in the previous five years targeting clinical skills training, 20 (46%) in a workshop targeting observing residents in a clinical setting, and 23 (52%) in a workshop targeting giving feedback to residents. In terms of prior experience using the mini-CEX to evaluate residents in the past year, 5 (11%) did not complete mini-CEX evaluations of residents, 20 (46%) completed between 1 and 10, 11 (25%) completed 11 to 20, and 8 (18%) completed more than 20.
Table 1 presents descriptive statistics for each clinical skills score and mini-CEX competency rating. Faculty tended to perform lower on history taking compared with the other content domains. Overall content and process scores were positively correlated (r = 0.45, P < .01). Generalizability coefficients suggested moderate reliability for all scores. The mean faculty ratings of each standardized resident case were similar to how that case was scripted (data not shown). There was no significant correlation between the number of years faculty had precepted residents and other faculty demographic variables with faculty content and process performance scores or with any of their mini-CEX competency ratings (P > .05).
Significant negative correlations were observed between faculty's history-taking content performance scores and their mini-CEX ratings in interviewing skills (r = −0.55, P < .01) and organization skills (r = −0.35, P < .05) (Table 1). This suggests that faculty with more complete history-taking content scores exhibited greater rater stringency in the interviewing and organization domains. Higher overall content performance scores were associated with rater stringency in interviewing (r = −0.34, P < .05). Correlations between faculty process performance scores and their mini-CEX ratings were also consistently negative, with significant correlations between process scores and interviewing (r = −0.41, P < .01), physical exam (r = −0.42, P < .01), and organization (r = −0.36, P < .05), suggesting that faculty with higher process performance scores exhibited greater rater stringency. Correlations between faculty's patient satisfaction scores and their mini-CEX ratings were also consistently negative, with significant correlations between patient satisfaction and interviewing (r = −0.30, P < .05), judgment (r = −0.31, P < .05), and organization (r = −0.34, P < .05), suggesting that faculty who receive higher patient satisfaction ratings exhibited greater rater stringency.
Previous work focusing on clinical performance ratings suggests that evaluators use different criteria and standards to judge clinical performance, differentially weight these aspects of performance, and variably define what is acceptable.2,10 However, evaluator characteristics potentially influencing what is attended to during observation are not well understood. We found that faculty rating behaviors were not significantly correlated with demographic characteristics such as age, gender, or clinical or precepting experience; however, rating behaviors were moderately correlated with faculty's SP exam performance. Faculty clinical skills seemed more important than experience in rating behavior. Although the correlations must be interpreted with caution, these findings provide some of the first direct evidence that faculty's clinical skills behavior affects their rating behavior. We postulate that faculty use their own practice style as their frame of reference during observation. As such, those with better performance scores may be more attentive to the presence or absence of specific history, exam, and interpersonal skills which then impacts their ratings. Practicing physicians often have deficient clinical skills,8 and given the range of faculty performance scores in this study, our findings suggest the same may be true for some faculty who play a key role in supervising trainees. How this impacts trainee clinical skill development is unknown but is worthy of further investigation. Although checklists are a less effective method for evaluating practicing physicians who need to ask fewer questions in an encounter to arrive at a diagnosis,9 the checklist items were important to each case and were items that would be important for trainees to perform competently and for faculty to be able to detect and assess. Additionally, the process skills are important for all physicians because of these skills' effects on patients.
There are several limitations. Importantly, because the faculty SP cases were developed for trainees, higher-scoring faculty may have been those with more “student-like” (i.e., novice) skills. Nevertheless, faculty process scores and overall patient satisfaction ratings, potentially better indicators of faculty expertise,9 demonstrated similar relationships to their rating behaviors. Future studies using cases validated for use with faculty are needed to ascertain reproducibility of our findings. Participants were self-selected, mostly general internists and therefore not representative of all faculty evaluators, so findings will need to be verified with other groups. However, faculty were recruited from multiple university- and community-based residency training programs and had a range of clinical and precepting experience. Although some of our outcome scores lacked levels of reliability necessary for high-stakes decisions, the majority of measures had sufficient reliability to explore the relationship between faculty skills and rating behavior. It is unknown to what degree the SP encounters primed faculty in their ratings and whether findings would persist in the absence of this experience. Although it is unknown whether findings would be similar when faculty rate trainees they know (which may invoke an emotional response that can affect judgment)11 with actual patients, our work does help to identify patterns of faculty rating behavior. Finally, this study only focuses on the faculty ratings of trainees and not the specific details of the actual observations that produced the rating. Therefore, additional research is needed to compare the actual observations faculty make that, in turn, influence their ratings.
What are the potential implications of these findings? Faculty observation and high-quality feedback remain essential to residents' clinical skills development. However, feedback is highly dependent on the quality of the evaluation, and rater stringency contributes more to measurement error than case specificity.12 Our findings provide some of the first direct evidence that physicians' own clinical skills behavior explains some of the variance in ratings. Unlike age and gender, however, faculty clinical skills may be modifiable.13 If our findings are replicated, faculty development in performance assessment may need, in part, to not only target training in observation and feedback but also target faculty's own clinical skills. Like all skills, direct observation is a competency that requires training and deliberate practice. We hope that continued exploration of factors impacting faculty rating of trainees' clinical skills that heretofore have received little attention might ultimately lead to more effective faculty development.
The authors wish to thank participating faculty; the Drexel University College of Medicine's standardized patient program; the University of Pennsylvania School of Medicine's Standardized Patient Program and the Penn Medicine Clinical Simulation Center; and the standardized patients and residents.
This study was funded by the American Board of Internal Medicine from whom Dr. Kogan also receives salary support.
Eric Holmboe is employed by the ABIM and receives royalties for a textbook on assessment of clinical competence from Mosby-Elsevier.
The abstract of an earlier version of this article was presented at the Ottawa 2010 Conference, Miami, Florida, May 2010.
1Govaerts MJ, van der Vleuten CP, Schuwirth LW, Muijtjens AM. Broadening perspectives on clinical performance assessment: rethinking the nature of in-training assessment. Adv Health Sci Educ Theory Pract. 2007;12:239–260.
2Williams RG, Klamen DA, McGaghie WC. Cognitive, social, and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15:270–292.
3Woehr DJ, Huffcutt AI. Rater training for performance appraisal: A quantitative review. J Occup Organ Psychol. 1994;67:189–205.
4McManus IC, Thompson M, Mollon J. Assessment of examiner leniency and stringency (‘hawk-dove effect’) in the MRCP (UK) clinical examination (PACES) using multi-facet Rasch modeling. BMC Med Educ. 2006;6:42.
5Downing SM. Threats to the validity of clinical teaching assessments: What about rater error? Med Educ. 2005;39:350–355.
6Carline JD, Paauw DS, Thiede KW, Ramsey PG. Factors affecting the reliability of ratings of students' clinical skills in a medicine clerkship. J Gen Intern Med. 1992;7:506–510.
7Weingarten MA, Polliack MR, Tabenkin H, Kahan E. Variations among examiners in family medicine residency board oral examinations. Med Educ. 2000;34: 13–17.
8Ramsey PG, Wenrich MD, Carline JD, Inui TS, Larson EB, LoGerfo JP. Use of peer ratings to evaluate physician performance. JAMA. 1993;269:1655–1660.
9Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson M. OSCE checklists do not capture increasing levels of expertise. Acad Med. 1999;74:1129–1134.
10Noel GL, Herbers JEJ, Caplow MP, Cooper GS, Pangaro LN, Harvey J. How well do internal medicine faculty members evaluate the clinical skills of residents? Ann Intern Med. 1992;117:757–765.
11Albanese MA. Challenges in using rater judgments in medical education. J Eval Med Educ. 2000;6:305–319.
12Margolis MJ, Clauser BE, Cuddy MM, et al. Use of the mini-clinical evaluation exercise to rate examinee performance on a multiple-station clinical skills examination: A validity study. Acad Med. 2006;81(10 suppl):S56–S60.
13Hauer KE, Ciccone A, Henzel TR, et al. Remediation of the deficiencies of physicians across the continuum from medical school to practice: A thematic review of the literature. Acad Med. 2009;84:1822–1832.