Economics, Education, and Policy: Research Report

Written Comments Made by Anesthesia Residents When Providing Below Average Scores for the Supervision Provided by the Faculty Anesthesiologist

Dexter, Franklin MD, PhD*; Szeluga, Debra MD, PhD; Masursky, Danielle PhD; Hindman, Bradley J. MD

Supervision of anesthesiology residents is a major daily responsibility of faculty (academic) anesthesiologists. The word “supervision” is used here not as a US billing term. Rather, we use the term “supervision” to include all clinical oversight functions directed toward ensuring the quality of clinical care whenever the anesthesiologist is not the sole anesthesia care provider. Supervision of residents is required for both postgraduate medical educationa and billing compliance.b

For 2.5 years, our department has sent daily e-mail requests to anesthesia residents to evaluate the supervision provided by each faculty anesthesiologist with whom they worked the previous day (Table 1).1–10 We have shown previously the dependability, reliability, and validity of the de Oliveira Filho clinical supervision scale (Table 2). We have shown also its usefulness for evaluating anesthesia individual (Table 2). When either residents or nurse anesthetists are the direct providers of anesthesia care, the supervision scores serve as an independent measure of the contribution of the faculty anesthesiologist to the care of the patient (Table 2.17).10

Table 1.
Table 1.:
de Oliveira Filho et al.’s Instrument for Measuring Faculty Anesthesiologists’ Supervision of Anesthesia Residents During Clinical Operating Room Care
Table 2.
Table 2.:
Previous Findings Regarding Supervision of Anesthesia Residents and Nurse Anesthetists by Anesthesiologists

When supervision quality is monitored and feedback provided to the anesthesiologists, quality increases (Table 2.18).10 Thus, anesthesiology department managers probably should not only be monitoring and reporting the quality of their departments’ level of supervision but also establishing processes so that individual anesthesiologists can receive feedback regarding the quality of supervision they provide.10 However, what advice a department manager can best provide to a faculty anesthesiologist who receives below-average supervision scores is unknown.

From a previous qualitative analysis, exemplary operating room behavior includes “being courteous through respectful behavior.”11 The “excellent” anesthesiologist, “even if stressed … never shouts at other team members.”12 The ACGMEa states: “an accredited program in anesthesiology must provide … an atmosphere of mutual respect between instructor and residents so that residents will be stimulated and prepared to apply acquired knowledge.”a Disruptive physician behavior can interrupt learning13 and team performance.14 One of the questions (#8) included in de Oliveira Filho’s supervision scale addresses the extent to which “the faculty treated me respectfully” (Table 1). Thus, one of the 3 hypotheses we tested was that disrespectful behavior reported by residents in their written comments was associated with lesser faculty supervision scores.

When nurse anesthetists evaluated anesthesiologists’ supervision, low scores (Table 2.3) were associated with comments indicating that the anesthesiologist was not seen during the case (Table 2.16).9,10 Because residents are trainees, the incidence of such comments was much less among resident physicians than the nurse anesthetists.9 The “excellent” anesthesiologist “regularly goes into the operating” rooms “that he/she is responsible for, to check that everything is in order.”12 Thus, we hypothesized there would be an association between residents’ written comments of insufficient faculty presence and lower supervision scores.

Providing feedback of nurse anesthetists’ supervision to the anesthesiologists was associated with substantial increases in departmental and individual supervision scores (Table 2.18).10 This was due principally to questions associated with teaching (e.g., “stimulate my clinical reasoning, critical thinking, and theoretical learning”).10 Therefore, our third hypothesis was that residents’ comments related to poor quality of teaching would be both relatively common (compared with insufficient faculty presence) and would be associated with lower supervision scores.


The University of Iowa IRB declared that this investigation did not meet the regulatory definition of human subjects research.

All analyses were performed with deidentified data. The data used were the complete (finite) population from July 1, 2013, the first date when residents started to evaluate faculty supervision through December 31, 2015.

The residents were asked daily to evaluate the faculty with whom they worked the previous day in an operating room setting, including obstetrics and/or nonoperating room anesthesia (e.g., radiation therapy). The residents provided numerical scores using radio buttons to answer the 9 supervision questions (Table 1; 1 “never,” 2 “rarely,” 3 “frequently,” or 4 “always”). The residents could also enter a written comment of unlimited length if they so desired (i.e., 0 comments or 1 comment per day).

Among the 1387 comments, there were 215 accompanying a below-average daily supervision score (i.e., <3.80) (Table 2.3).c Using those 215 comments, words and phrases for being disrespectful were identified independently by all 4 authors (Table 3). The same process was used for the themes of insufficient faculty presence (Table 4) and of poor-quality teaching (Table 5). The words and phrases of the Tables 3 to 5 subsequently were applied to the other 1172 comments. The words and phrases indicating insufficient faculty presence or poor faculty teaching did not appear in any of the 1172 comments with a daily supervision score >3.80. From Table 3, “demeaning” appeared once: “was not demeaning.” “Threatening” appeared 15 times, but all preceded by negation: “non” or “not” (see Discussion).

Table 3.
Table 3.:
Words and Phrases Used for the Theme of “Disrespectful”
Table 4.
Table 4.:
Words and Phrases Used for the Theme of “Insufficient Faculty Presence”
Table 5.
Table 5.:
Words and Phrases Used for the Theme of “Poor-Quality Teaching”

Data are reported as mean ± SD. All P values were calculated using an exact method (StatXact-11; Cytel Software, Inc., Cambridge, MA). Counts were analyzed for single groups using exact binomial confidence intervals (CIs). Comparisons of counts between pairs of groups were made using exact conditional CIs for odds ratios. Odds ratios were chosen to correspond to McNemar’s pairwise tests. Comparisons of supervision scores between pairs of groups were performed using 2-sided Wilcoxon-Mann-Whitney tests. The CI for Cronbach α was calculated asymptotically.15


Among the 14,722 evaluations that were requested, residents provided a response for 99.1% (N = 14,585).d Residents could answer that they worked with the faculty that day for insufficient time to evaluate supervision; residents used this option for 6.3% of requests (N = 921). Among the N = 13,664 numerical evaluations of supervision quality, the day’s intraoperative patient care time together was 4.87 ± 2.53 hours. Supervision scores were 3.80 ± 0.37, with 2.20% low scores (<3.00, less than “frequent”).

Among the N = 76 faculty who received at least 20 evaluations, each received 177 ± 126 evaluations from 42 ± 16 residents with supervision scores 3.80 ± 0.13.d,e The faculty received 17.8 ± 12.9 comments (11.4% ± 6.1% of evaluations) with 112.7 ± 35.8 characters (including blanks) and 19.0 ± 6.4 words.

There were 25 evaluations with a comment about the faculty being disrespectful (1.8% of the 1387 comments; 0.18% of the 13,664 evaluations). Question #8, evaluating whether “the faculty treated me respectfully,” was answered <4 (i.e., not “always”) for all (100%) of 25 evaluations with a comment of disrespectful behavior (95% CI for sensitivity 88%–100%). The supervision scores for those 25 evaluations were much less than for the other 1362 evaluations with comments (2.86 ± 0.65 vs 3.90 ± 0.27; P < 0.0001; area under the receiving operating characteristic [ROC] curve: 0.963). The 14 faculty who each had at least 1 comment about being disrespectful had mean scores5e less than the other 62 faculty (3.72 ± 0.11 vs 3.82 ± 0.12; P = 0.0012). Each increase in the faculty’s number of comments of being disrespectful (from 0 to 5) was associated with a lesser meane score (Kendall τb = −0.31; P = 0.0002). A low supervision score (<3.00, see Table 2.10, 13, and 15) had an odds ratio of 85.4 (95% CI, 31.2–232.5) for a comment regarding disrespectful faculty behavior (P < 0.0001). The predictive value of a non-low supervision score (i.e., a score ≥3.00) for absence of a comment of disrespectful behavior was 99.2% (1342 of 1353; 95% CI, 98.6%–99.6%). That finding was especially useful because the vast majority of scores that were below average (<3.80) were not low (≥3.00) (94.4% ± 6.2%, N = 74 faculty; see Discussion).

There were 6 evaluations with a comment about insufficient faculty presence (0.43% of 1387 comments; 0.04% of 13,664 evaluations). The 6 faculty who each had 1 such comment had significantly lesser meane scores than the other 70 faculty (3.70 ± 0.09 vs 3.81 ± 0.12; P = 0.0071). The 6 evaluations with comments related to insufficient faculty presence had lesser scores than the other 1381 evaluations with comments (2.61 ± 0.30 vs 3.89 ± 0.30; P < 0.0001).

There were 34 evaluations with a comment about poor-quality teaching (2.4% of 1387 comments; 0.25% of 13,664 evaluations). The 19 faculty who each received at least 1 such comment had lesser mean scores than the other faculty (3.72 ± 0.11 vs 3.83 ± 0.12; P < 0.0001). Each increase in the faculty’s number of comments of poor-quality teaching (from 0 to 4) was associated with a lesser meane score (Kendall τb = −0.34, P = 0.0002). The 34 evaluations related to poor teaching had lesser scores than the other 1353 evaluations with a comment (3.18 ± 0.55 vs 3.90 ± 0.28; P < 0.0001).

Several observations indicate that focusing faculty attention on teaching as a basis for preventing low supervision scores would also effectively incorporate insufficient faculty presence. The supervision questions were internally consistent (Cronbach α = 0.948; 95% CI, 0.947–0.949). A faculty with at least 1 comment about poor-quality teaching had significant odds of also having at least 1 comment about insufficient presence (5.33; 95% CI, 1.61–21.5; P = 0.0044). A comment with negative sentiment had significant odds of being about poor-quality teaching rather than being about insufficient presence (odds ratio, 6.60; 95% CI, 2.74–17.8; P < 0.0001). The odds were much greater for questions #1 and/or #3 (i.e., teaching) to be answered <3 vs questions #2 (“promptly available”) and/or #5 (“present”) (19.5; 95% CI, 13.1–29.2; P < 0.0001; N = 13,664 evaluations).f


Low faculty supervision scores (<3, i.e., less than “frequent”): (a) do not meet resident expectations (Table 2.10), (b) are associated with resident errors causing patient harm (Table 2.15), and (c) can be monitored reliably (Table 2.13). From this case series, we add that low faculty supervision scores are caused by insufficient faculty presence, disrespectful faculty behavior, and poor quality of teaching. However, a faculty who has insufficient presence cannot be providing good teaching. Furthermore, in a previous study, we found negligible to no correlation between supervision scores and faculty clinical assignments (e.g., numbers of residents and/or nurse anesthetists supervised simultaneously) (Tables 2.7–9).5 The relevance is that insufficient faculty presence accounted for a small proportion of below-average supervision scores and low-quality supervision. Furthermore, scores ≥3 had a predictive value for the absence of disrespectful behavior ≅99%. Approximately 94% of the faculty supervision scores that were below average were still ≥3. Consequently, for the vast majority of the faculty-resident-days, it was the quality of teaching that distinguished between below- versus above-average supervision scores.c This result is consistent with our prior finding of a strong correlation between faculty supervision scores and assessments of teaching effectiveness (Table 2.6).3 Taken together, we recommend that when individual faculty anesthesiologists are counseled regarding their clinical supervision scores, the attribute to emphasize is the quality of their clinical teaching. Being respectful of the residents is part of being an effective supervisor and teacher.

We used exact words and phrases (Tables 3–5) from the comments rather than coding them into themes. We did so, in part, so that we could provide examples of comments to readers and yet ensure blinding of responses for scientific analysis and publication. Still, our study was needed for our internal department use because the comments themselves did not always provide clear guidance on how to advise faculty. There were on average only ≅18 comments per faculty over 2.5 years, and the comments were usually brief (≅19 words). Furthermore, the (few) specific comments generally could not be provided directly to the individual faculty because their usefulness depended on the context, which would reveal the identity of the evaluating resident.6 We recently described this issue of confidentiality in our Statistical Grand Rounds article on using Bernoulli cumulative sum control charts for analyzing the supervision scores.6 We also recently studied the issue of the procedure-specific context of evaluations in our Statistical Grand Rounds on quantification of the diversity of procedures at facilities.16 We limited our current investigation to comments of negative sentiment because, just as we found previously for the comments made by nurse anesthetists, the comments of positive sentiment were frequently platitudes.9 Many raters used the same phrase regularly (e.g., “thank you”).9

Our results were limited to being from 1 department, and, at our department, faculty supervise no greater than 2 trainees doing cases simultaneously. Because of our findings’ lack of sensitivity to insufficient faculty presence, we expect that results would be the same at a department (country) with anesthesiologists supervising only 1 room (trainee). However, this is our speculation. We expect that our results would be different if the faculty were supervising greater than 2 trainees doing cases simultaneously. The reason is that an increase from 2 to 3 simultaneous cases dramatically increases the percentage of critical events when the anesthesiologist assigned cannot be physically present because then such events occur in greater than 1 case coincidentally.17,18 When activities outside operating rooms are added (e.g., postanesthesia care unit), the percentage incidence is even greater.19,20

The de Oliveira Filho et al.’s1 supervision scale has been found in multiple departments and settings to be unidimensional.3,8,9 In the current study, the Cronbach α was 0.95. Thus, disrespectful behavior, insufficient faculty presence, and poor quality of teaching are not separable dimensions (themes), but closely interrelated. Consequently, when the supervision score was low (<3) and the comment only about disrespectful behavior, the quality of teaching may still have been substantially below average. This is a major limitation. Our study is a large case series and, reasonably, should be interpreted to be such, with all concomitant limitations. Future studies can evaluate whether even greater (Table 2.18)10 increases in average supervision scores can be achieved than we obtained previously10 by providing feedback that is targeted toward helping faculty provide higher quality teaching in the operating room (clinical) setting.g


Dr. Franklin Dexter is the Statistical Editor and the Section Editor for Economics, Education, and Policy for Anesthesia & Analgesia. This manuscript was handled by Dr. Steven Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.


Name: Franklin Dexter, MD, PhD.

Contribution: This author helped design the study, conduct the study, analyze the data, write the manuscript, and is the archival author.

Attestation: Franklin Dexter has seen the original study data and approved the final manuscript.

Name: Debra Szeluga, MD, PhD.

Contribution: This author helped write the manuscript.

Attestation: Debra Szeluga has seen the original study data and approved the final manuscript.

Name: Danielle Masursky, PhD.

Contribution: This author helped conduct the study.

Attestation: Danielle Masursky has seen the original study data and approved the final manuscript.

Name: Bradley J. Hindman, MD.

Contribution: This author helped conduct the study and write the manuscript.

Attestation: Bradley J. Hindman has seen the original study data and approved the final manuscript.


Jennifer Espy provided editorial assistance. Thomas Smith managed the department database and web interfaces.


aAccreditation Council for Graduate Medical Education (ACGME) Program Requirements for Graduate Medical Education in Anesthesiology, Sections II.A.4.f, II.B.7, and VI.D.3. Available at: Accessed April 7, 2016.

bCenters for Medicare and Medicaid Services, Pub 100-04 Medicare Claims Processing, Transmittal 1859, November 20, 2009. Section 139 Teaching Anesthesiologists. Available at: Accessed December 21, 2015.

cThe median score was 4.00. The mean, weighting each evaluation equally and weighting each rater-faculty combination equally, both equaled 3.80, with SEM 0.01. This was the 72.4th percentile of the 13,664 evaluations. The least value of the supervision score greater than or equal to the mean is 3.89, obtained with 1 question out of 9 answered with a 3 (Table 1) (i.e., 3.89 = (8 × 4 + 1 × 3)/9. The greatest value of the supervision score less than or equal to the mean is 3.78, obtained with 2 of 9 questions answered with a 3.

dThere was heterogeneity in response rates among the N = 65 residents: response rate 98.7% ± 2.6%.

eThe mean faculty score is calculated for each anesthesiologist by weighting each rater (resident) equally, because raters with multiple evaluations have correlated scores.5 Thus, the mean for each combination is calculated and then the mean faculty scores is the mean for that faculty among all residents with whom the faculty has worked. See Ref.5 for the math. The reported mean ± SD are the mean and SD among the mean faculty scores.

fThe odds also were greater for questions #1 and/or #3 (i.e., teaching) being answered <4 versus questions #2 and/or #5 (“promptly available,” “present”) (7.39; 95% confidence interval, 6.40–8.53; P < 0.0001). Among the 6 evaluations with comments of the theme of insufficient faculty presence, questions #1 and #3 (i.e., teaching) were answered <4 for 6 of 6 (100%).

gFor each of the 5 successive 6-month periods, the faculty mean ± SD scores (with the raters equally weighted for each faculty) have been 3.75 ± 0.14 (N = 65), 3.77 ± 0.13 (N = 62), 3.82 ± 0.13 (N = 58), 3.83 ± 0.10 (N = 59), and 3.86 ± 0.09 (N = 61). The medians have been 3.77, 3.79, 3.84, 3.84, and 3.88, respectively.


