Secondary Logo

Journal Logo

Economics, Education, and Policy: Research Report

An Instrument Designed for Faculty Supervision Evaluation by Anesthesia Residents and Its Psychometric Properties

Section Editor(s): Dexter, Franklinde Oliveira Filho, Getúlio R. PhD; Dal Mago, Adilson José MD; Garcia, Jorge Hamilton Soares MD; Goldschmidt, Ranulfo MD

Author Information
doi: 10.1213/ane.0b013e318182fbdd

Supervision is probably the most influential variable in residency as it assures high standards of care, patient safety, and high quality education.1 Supervision also provides residents the opportunity to incorporate the virtues of good supervisors acting as role models.2,3 Residents value supervision, as demonstrated in a study in which residents who felt themselves unsupervised during most of their patient care activities rated faculty lower than those not experiencing such feelings.4 Although its central role in the effective training of physicians is acknowledged, supervision has been the least researched aspect of medical education.5

According to the Accreditation Council for Graduate Medical Education, clinical supervision is a required faculty activity that must be confidentially evaluated by residents at least annually.6 However, there is a paucity of published instruments for resident evaluation of faculty supervision in anesthesiology.

This study aimed to develop a valid and reliable instrument for measuring faculty supervision by anesthesia residents. We hypothesized that constructs emerging from individuals involved in the supervisory process (residents and faculty) could provide valid content to the instrument.

Given the relevance and the sensitive nature of resident-generated faculty evaluation,1,6 the sources of measurement error must be clearly understood.7–9 Indeed, it has been demonstrated that faculty evaluations can be flawed by response or rater biases, which adversely affect the reliability of the measures and/or the measure-based relative ranking of faculty.8,10,11 Halo error, the most pervasive type of rater bias,12 is conceptualized as positive correlations in residents’ unique perceptions of a given instructor on multiple attributes.8 Halo error can be disclosed by identifying the sources of variance associated with a given score.13–15 A significant amount of variance attributed to the resident (rater) - instructor (ratee) interaction term characterizes halo error.8 In this context, it was hypothesized that differences among residents’ ratings of faculty supervision could arise from (a) actual differences in instructors’ supervisory abilities; (b) items of the instrument; (c) residents rating each instructor, a term that represents the influence of resident’s unique perceptions of instructors (halo effect); and/or (d) interactions among these factors. The second aim of the study was to test the psychometric properties of the instrument and to disclose the sources of error in faculty evaluations.


The IRB approved the study. Participants’ informed consents were obtained. This study comprised two phases: I–a qualitative study aimed to identify the content for the measurement instrument,16 and II - psychometric analyses under both classical test17 and generalizability theories.18 The terms “instructors” and “faculty” refer to attending anesthesiologists providing clinical supervision to residents at an academic, university-affiliated anesthesia training program.

Phase I: Qualitative Study and Item Construction

Instructors and residents from four independent academic institutions comprised the convenience sample of the study. Data were collected on an Internet-based electronic form (MS FrontPage, Microsoft, Redmond, WA) from August through November 2005. Participants listed instructors’ behaviors that characterized good and poor supervision. Responses were imported into a text-mining software (QDA Mining v. 1.3.1, Provalis Research, Montreal) and independently codified by the authors. Consensus on final codes was reached after a three-round Delphi process as described elsewhere.19 Data collection was stopped when the point of data saturation was attained (i.e., responses no longer resulted in new data).16 Jaccard’s coefficients, which measure code similarity, were used to merge nearby codes into dimensions. Inter-rater reliability coefficients were calculated. Compiling and organizing the sentences within each dimension resulted in their respective definitions, which were included as items in the questionnaire applied in the second phase of the study.

Phase II: Psychometric Analyses

An electronic, Internet-based, password-protected form was created (MS FrontPage, Microsoft, Redmond, WA), containing the nine-item questionnaire rated on 4-point Likert scales, each representing one dimension of supervision (Table 1). We used a forced-choice response scale (a Likert scale) with an even number of choices, i.e., without a middle point (“neutral” or “undecided”) to polarize responses. Two criterion items were added to the form: (a) “The supervision you receive from this instructor generally 1) stands below, 2) fulfills or 3) exceeds your expectations,” as a measure of the overall perception of the quality of the supervision provided by each instructor relative to residents’ expectations; and (b) “If you were an instructor of anesthesia residents, you would provide 1) poor or 2) good supervision if you emulated him/her (the instructor),” as a measure of residents’ perceptions of the instructor as a role model 3). From March through December 2006, monthly e-mail messages were sent to the 19 CA1–CA3 residents at the authors’ institution asking them to voluntarily and anonymously rate the instructors from whom they had received supervision in the previous month, according to their logbooks. Chang and Hocevar20 have provided evidence that, for obtaining dependable measures of faculty ratings, each faculty member should be evaluated by a minimum of 10 raters. Data collection aimed to obtain such sample size.

Table 1
Table 1:
Items of the Faculty Supervision Evaluation form with the Respective Factor Loadings

Individual assessments were considered units of analysis, each comprising scores on the nine items of the questionnaire, an average score, and the responses to both criterion items. Cronbach’s α coefficient and principal components factor analysis tested the internal consistency and the factorial structure of the questionnaire, respectively.

A generalizability (G) study was performed to disclose variance components and test the reliability of measures (scores). A G-study is a method of estimating the amount of variance added to scores by individual components of the study and their interactions. A generalizability (G) coefficient is calculated as a measure of reliability for relative decisions (e.g., ranking); a dependability (phi) coefficient tests the dependability of the instrument for absolute decisions (e.g., the minimum number of residents necessary to reliably evaluate one instructor).18 As the G-study required the same number of instruments for every faculty member, an electronically generated random sample of eight instruments per faculty member was created (MS Excel 2000, Microsoft, Redmond, WA), containing data of 38 instructors evaluated by 18 residents in different combinations. One resident who provided only two evaluations and one instructor who received only five evaluations were excluded from analyses. For the G-study, instructors were the objects of measurement. Residents (raters) were considered a random facet, since the study sample was smaller than the total universe of residents with similar characteristics to those participating in the study. Items were considered a fixed facet, since the nine items comprised the whole questionnaire. As different combinations of residents (r) evaluated each instructor (p) on every item (i) of the instrument, residents were nested (:) within instructors and crossed (×) with items. Such (r:p) × ifixed design matched our interest in determining the characteristics of the measurement of instructors’ supervision, as defined by the nine-item instrument and rated by any anesthesia resident similar to those participating in the study. A decision-study (D-study) estimated the minimum number of residents needed to reliably evaluate one instructor at phi = 0.75–0.95, i.e., 75% to 95% reproducibility or dependability of measures.

Another G-study assessed the effect of measurement occasions on the reliability of measures on a random sample of seven instructors (p) assessed on three occasions (o) by five different residents (r). All facets were considered random and crossed (p × r × o). Variance components were calculated by the minimum norm quadratic unbiased estimation method (SPSS v.12, SPSS Inc., Chicago, IL).

When assessing faculty on multiple attributes, some residents’ judgments may be biased, resulting in disagreement among raters due to either differential interpretations of the rating scale or to their unique perceptions of individual faculty members. Such biases may be rater-specific (e.g., leniency/severity) or dyad-specific (e.g., reverence or hostility toward some staff members). These biases increase the resident component (rater-specific bias), the resident-instructor interaction component of score variance (dyad-specific bias), or both. As a consequence, they may affect the relative measure-based ranking of instructors.8 To test this possibility, responses of each possible combination of two residents (r) of 19 (n) (19C2) were successively eliminated from the original database and instructors’ average scores were recalculated, as suggested by Love et al.,10 generating 171 recalculated scores per instructor. Instructors were ranked into 4 categories: 1) >75th percentile, 2) ≤75th and ≥50th percentiles, 3) <50th and ≥25th percentiles, or 4) <25th percentile of each set of scores. The stability of instructors’ relative ranking according to the recalculated scores as compared to the original ranking order was estimated by the percentage of agreement. Paired Student’s t and Wilcoxon tests assessed the differences between original and recalculated scores. Unpaired Student’s t-test, Mann-Whitney U, one-way analysis of variance, and Kruskal-Wallis analysis of variance compared item and average scores between levels of the criterion-variables. Spearman’s R coefficient tested the correlations between item and average scores and levels of criterion variables. Significance was set at P < 0.05.


Phase I: Qualitative Study

Fourteen residents and 12 instructors from the authors’ institution, and five residents and five instructors from three nonaffiliated institutions participated in the qualitative study. Responses resulted in 251 statements. Sixty-nine statements from faculty and 160 statements from residents referred to adequate supervision. Thirty-eight statements from faculty, and 91 from residents referred to poor supervision. At the end of the Delphi process, 23 codes were created for good supervision, and 20 codes designed poor supervision. Based on Jaccard’s coefficients, nearby codes were merged into nine dimensions (Table 1). Intercoder reliability was 0.91.

Phase II–Psychometric Analyses

Residents submitted 970 instructor evaluations. After removing incompletely answered instruments, 954 questionnaires (98%) were available for analyses. This final data set contained evaluations of 39 instructors provided by 19 residents. Each resident evaluated the same instructor on 1 through 8 occasions (median = 1; 25th–75th percentiles = 1–2). The median number of evaluations submitted by each resident was 38 (25th–75th percentiles = 16–80). The median number of evaluations per instructor was 18 (25th–75th percentiles = 16–36). Both parametric and nonparametric tests yielded the same results. Unless otherwise indicated, data are summarized by the mean ± sd, and the respective P values refer to parametric tests.

Cronbach’s α coefficient was 0.93. Average inter-item correlation was 0.60 (range, 0.32–0.81), and average item-total correlation was 0.75 (range, 0.56–0.84) A single factor (eigenvalue, 5.93) comprised all items and explained 66% of the total variance (Table 1). Such a finding of a single factor suggests that one average or summed score, including the nine items of the instrument, can be used as a measure of supervisory ability.

Table 2 summarizes the generalizability studies. The fully crossed design p × r × i (p = instructors as ratees, r = residents as raters, i = items) that preceded the nested design18 showed that the main variance components were due to instructors, the instructor-resident, and the resident-instructor-item interactions. The estimated variance components for items and for the item-resident interaction approached zero, indicating that residents’ ratings were not sensitive to item phrasing or content (face validity). High G- and phi-coefficients (0.93) indicated adequate reliability and dependability of the measures. With residents-nested-within-instructors crossed with fixed-items, reliability, and dependability slightly increased (G- and phi-coefficients = 0.96). The instructors’ facet and the resident-instructor interaction term accounted for the total variance of scores. Figure 1 shows the estimated dependability (phi) coefficients according to the number of residents evaluating faculty. Dependability of measures in the 75% to 95% range can be expected with 3 to 33 residents rating one faculty member, respectively.

Table 2
Table 2:
Summary Statistics of the Generalizability (G) Studies: Variance Component Estimates, G and D Coefficients
Figure 1
Figure 1:
Figure 1.

In the G-study designed to assess the test-retest reliability of the instrument, the variance of average scores was partitioned between instructors (42%), raters (30%) and the interaction instructor-resident (28%). Measurement occasions, and the interaction components instructor-by-occasion, resident-by-occasion, and instructor-by-resident-by-occasions did not contribute to the average score variance. G-coefficient was 0.97, indicating that no gain in reliability of measures should occur if a resident evaluates a given faculty more than once.

Instructors’ average scores ranged from 1.93 to 3.86. According to the ranking based on quartiles, 10 instructors (26%) had average scores >75th percentile (quartile limit = 3.6); 10 (26%) ≤75th percentile and ≥50th percentile (quartile limit = 3.38); 9 (23%) <50th percentile and ≥25th percentile (quartile limit = 3.1); and 10 instructors (26%) < 25th percentile. The percentage of agreement between the original rank of instructors and the ranking based on the 171 recalculated scores was 86%. The mean difference between the original and the recalculated scores was 0.009 ± 0.19 (95% CI = 0.004–0.013; P < 0.01).

The overall perception of the quality of the supervision provided by the instructors relative to residents’ expectations was rated as below expectations in 130 evaluations (13%), fulfilling expectations in 589 evaluations (62%), and exceeding expectations in 235 evaluations (25%). Residents rated instructors as negative role models in 135 evaluations (14%). Spearman’s rank R was 0.68 between criterion items (P < 0.01). Tables 3 and 4 show the average scores on each item and the average questionnaire scores obtained in instructors’ evaluations according to the response categories (scores) of the criterion items. Significant differences were found at between-categories comparisons. Item and average scores significantly correlated to criterion item scores (Table 5).

Table 3
Table 3:
Average Scores Obtained in Instructors' Evaluations According to the Criterion-Item Regarding Residents' Overall Perception of the Quality of the Supervision Provided by the Instructor Relative to Their Expectations
Table 4
Table 4:
Average Scores Obtained in Instructors' Evaluations According to the Criterion-Item Regarding Residents' Perceptions of the Instructor as a Supervisor Role Model
Table 5
Table 5:
Correlations Between Average and Item Scores and Scores on Criterion Variables Scores


The main result of the present study was that the measurement instrument of faculty supervision we developed based on grounded theory provided highly dependable measures, with adequate face and content validities, internal consistency and unidimensional factor structure. However, measures were biased by the halo effect.

The qualitative analyses reached the saturation point before the completion of the data collection process with high inter-rater agreement on final categories, suggesting that the resulting elements of good supervisory ability reflected the specific intended domain of content (content validity).17 Moreover, except for “presence during the critical phases of the anesthetic procedure,” all dimensions of supervision identified in the qualitative study were also identified in other studies of clinical teaching effectiveness,3,10,21–23 which provides further evidence of the content validity.

When analyzed under the generalizability theory, the instrument provided highly reliable measures of instructors’ supervisory abilities for both relative and absolute decisions, as indicated by the respective G- and D-coefficients. Test-retest reliability was evidenced by the lack of effect of measurement occasions on score variance. The variance component associated with instructors’ supervisory abilities accounted for 56% of the variance of residents’ ratings, indicating that actual differences regarding supervisory abilities among instructors were responsible for the greatest component of score variance. However, a significant variance component (44%) was due to the resident-instructor interaction (dyadic variance). This variance component is attributable to each resident’s unique perceptions of instructors.8 This dyadic effect suggests that measures obtained might be affected by factors not measured by the instrument, such as leniency or severity in assessments of specific instructors, sympathy or antipathy for specific faculty members,8 reverence or hostility toward some instructors,10 fear of retaliation or concerns about eventual damage to the relationship with the instructor.24

Significant amounts of dyadic variance, sometimes exceeding the ratee variance component, have been consistently found in generalizability studies addressing students’ evaluations of faculty. Mazor et al.15 using a rater-nested-within-ratees-crossed-with-items (r:p × i) design, found that 45% of score variance was due to inter-ratee differences, whereas the rater-ratee interaction term was responsible by 47% of the total variance. James et al.13 assessed the dependability of students’ evaluations of diverse clinical training sites using rater-nested-within-sites-crossed-with-items (r:s × i) design. They found that, although the sites component accounted by 8% to 17% of the total variance, the interaction term raters-by-site was responsible by 25% to 54% of the total variance. Kreiter and Lakshman14 investigated student-generated faculty teaching evaluations with rater-nested-within-lectures-crossed-with-items (r:l × i) generalizability study. The lecture (teacher) facet as the object of measurement accounted for 26% of the score variance, whereas 56% of the total score variance was attributed to the interaction between students and the lecture they evaluated.

Dyadic-specific variance indicates halo error.8 In his classical article, Thorndike9 characterized halo error (or effect) as a strong tendency of raters to “think of the person in general as rather good or rather inferior and to color the judgment of the separate qualities by this general feeling,” resulting in inflated correlations across the assessed dimensions given raters’ inability to treat an individual as a compound of separate qualities. Other investigators25 assessed the relationship between halo effect and raters’ awareness of the influence of global evaluations on ratings of attributes and concluded that such relationship occurs at the unconscious level. Both views might explain the observed resistance to estimation and correction of halo error, since bias contributed by any given rater varies from one ratee to another.8

Risucci et al.11 assessed the accuracy of surgical residents’ evaluations of faculty. They found that, although accurate and reliable (Cronbach’s α coefficient = 0.97), evaluations were biased by the halo effect because high inter-item correlations were found in residents’ evaluations of faculty. In the present study, significant inter-item and average score-criterion-items correlations suggest that residents may have rated similarly across items of the questionnaire, depending on their overall impression about the instructor they evaluate.

Improving item clarity reduces the probability of occurrence of halo effect.11 We found no effect on score variance attributable to the item facet or to the rater-item interaction term, suggesting that items in the questionnaire were clear enough and well understood by residents (adequate face validity).

Halo error may compromise the relative ranking of instructors depending on which residents evaluate them.8,10 Indeed, the successive elimination of two residents’ evaluations at a time caused some instability in the instructors’ recalculated scores, as the differences between original and recalculated scores, albeit small, were statistically significant and caused the 14% disagreement rate between original ranks and ranks based on recalculated scores found in the study.

Raters’ anonymity increases the reliability of faculty evaluations.24 As residents’ anonymity was assured, this study lacks information of certain demographic variables, such as gender, age, and especially year-in-residency, which might affect the reliability of residents’ ratings.11 Although the content of the instrument was based on data provided by instructors and residents at four independent institutions, psychometric analyses included residents at a single program. This limitation must be acknowledged when generalizing our results to distinct educational settings.

The instrument we developed produced reliable measures of faculty supervision. However, a significant amount of halo effect biased such measures. G-studies may help identify the type and magnitude of rater biases affecting resident-generated faculty supervision evaluations, and can be useful for interpreting their results,8 especially if personnel decisions (e.g., tenure, promotion) rely on such measures.


The authors thank Drs. Fabrício Cochlar Medeiros (CA3) and Paulo Rodrigo Frasseto (CA3) for their assistance during the qualitative phase of the study.


1.Association of American Medical Colleges (AAMC). AAMC policy guidance on graduate medical education. Washington DC: Association of American Medical Colleges, 2001
2.Leach DC. Supervision: nine helpful principles and a story. Chicago: ACGME Bulletin, 2005
3.Maker VK, Curtis KD, Donnelly MB. Are you a surgical role model? Curr Surg 2004;61:111–5 Oliveira Filho GR, Sturm EJ, Sartorato AE. Compliance with common program requirements in Brazil: its effects on resident’s perceptions about quality of life and the educational environment. Acad Med 2005;80:98–102
5.Kilminster S, Cottrell D, Grant J, Jolly B. AMEE Guide No. 27: Effective educational and clinical supervision. Med Teach 2007;29:2–19
6.Accreditation Council for Graduate Medical Education(ACGME). Accreditation council for graduate medical education glossary of terms. Chicago: Accreditation Council for Graduate Medical Education(ACGME), 2007
7.Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching? A review of the published instruments. J Gen Intern Med 2004;19:971–7
8.Hoyt WT. Rater bias in psychological research: when is it a problem and what can we do about it? Psychol Methods 2000;5:64–86
9.Thorndike EL. A constant error in psychological ratings. J Appl Psychol 1920;4:25–9
10.Love MB, Revesz G, Degnan JW. Teaching effectiveness in the radiology residency: a method of evaluation by residents. Invest Radiol 1990;25:287–91
11.Risucci DA, Lutsky L, Rosati RJ, Tortolani AJ. Reliability and accuracy of resident evaluations of surgical faculty. Eval Health Prof 1992;15:313–24
12.Jacobs R. A closer look at halo error in performance ratings. Acad Manag J 1985;28:201–12
13.James PA, Kreiter CD, Shipengrover J, Crosson J, Heaton C, Kernan J. A generalizability study of a standardized rating form used to evaluate instructional quality in clinical ambulatory sites. Acad Med 2001;76:S3–S5
14.Kreiter CD, Lakshman V. Investigating the use of sampling for maximizing the efficiency of student-generated faculty teaching evaluations. Med Educ 2005;39:171–5
15.Mazor K, Clauser B, Cohen A, Alper E, Pugnaire M. The dependability of students’ ratings of preceptors. Acad Med 1999;74:S19–S21
16.Pope C, Ziebland S, Mays N. Qualitative research in health care. Analysing qualitative data. BMJ 2000;320:114–16
17.Trochim W. Research method knowledge basis. Cincinnati, OH: Atomic Dog Publishing, 2001
18.Shavelson RJ, Webb NM. Generalizability theory. A primer. Newbury Park: Sage Publications, 1991 Oliveira Filho GR, Schonhorst L. The development and application of an instrument for assessing resident competence during preanesthesia consultation. Anesth Analg 2004;99:62–9
20.Chang L, Hocevar D. Models of generalizability theory in analyzing existing faculty evaluation data. Appl Meas Educ 2000;13:255–75
21.Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching? J Gen Intern Med 2005;20:1159–64
22.Irby DM. Clinical teacher effectiveness in medicine. J Med Educ 1978;53:808–15
23.Kilminster SM, Jolly BC. Effective supervision in clinical practice settings: a literature review. Med Educ 2000;34:827–40
24.Afonso NM, Cardozo LJ, Mascarenhas OA, Aranha AN, Shah C. Are anonymous evaluations a better assessment of faculty teaching performance? A comparative analysis of open and anonymous evaluation processes. Fam Med 2005;37:43–7
25.Nisbett RD, Wilson TD. The halo effect: evidence for unconscious alteration of judgments. J Pers Soc Psychol 1977;35: 250–6
© 2008 International Anesthesia Research Society