Secondary Logo

Journal Logo

Economics, Education, And Policy: Research Report

Reliability and Validity of Assessing Subspecialty Level of Faculty Anesthesiologists’ Supervision of Anesthesiology Residents

De Oliveira, Gildasio S. Jr MD, MSCi*; Dexter, Franklin MD, PhD; Bialek, Jane M. BS*; McCarthy, Robert J. DPharm*

Author Information
doi: 10.1213/ANE.0000000000000453

Supervision of anesthesiology residents is a major daily responsibility of faculty (academic) anesthesiologists. The word supervision is used here not as a U.S. billing term. Rather, we use the term supervision to include all clinical oversight functions directed toward assuring the quality of clinical care whenever the anesthesiologist is not the sole anesthesia care provider. Supervision of residents is required for both graduate medical educationa and billing compliance.b

Previously, de Oliveira Filho et al.1 developed a set of 9 questions for evaluating the quality of supervision of anesthesiology residents. Each question contributes to a single dimension of supervision. The instrument utilizes a 4-point Likert scale (never = 1, rarely = 2, frequently = 3, and always = 4). The supervision score equals the mean of the responses to the 9 individual questions.

De Oliveira et al.2 used this supervision scale and found in a national (U.S.) survey that residents who reported mean department-wide supervision scores less than 3 (“ frequent”) reported significantly more frequent occurrences of mistakes with negative consequences to patients and medication errors. In the accompanying Editorial, de Oliveira Filho and Dexter3 recommended that program directors “measure … residents’ perception of overall faculty supervision and set the expectation that the faculty overall provide a minimum of frequent supervision to residents.” Subsequently, a substantial majority of anesthesiology residents (94%, n = 44/47) were found to consider the minimum level of supervision that they expected from individual faculty anesthesiologists to be at least 3 (i.e., at least “frequent”).4 Hindman et al.5 showed that, to achieve this overall minimum level (≥3) assessed departmentally (globally), individual faculty scores need to be even greater. Individual faculty scores differ markedly among anesthesiologists when monitored daily, even when controlling for the anesthesiologists’ assignments (e.g., 1:1 with a resident or 1:2 with 2 residents).6,7

On the basis of these previous findings, a logical next step would be for the supervision instrument (developed by de Oliveira Filho) to be used to evaluate anesthesiology residency programs (e.g., during annual evaluations).3 The objective of this study was to test the reliability and validity of this instrument using a national sample of U.S. anesthesiology residents during various residency rotations.


This cross-sectional national study was approved by the Northwestern University Institutional Review Board (STU00085572).

The mailing and e-mail lists of U.S. anesthesiology trainees were obtained from the American Society of Anesthesiologists directory. A database was constructed from the list using PostgreSQL ver 9.4.1. A random sample of 1500 residents was selected. The survey was created using SurveyMonkey software (SurveyMonkey Inc., Portland, OR). To assure confidentiality of the respondents, the survey was set up to delink the responses to the respondents’ e-mail addresses but retain the Internet protocol addresses of the respondents. An internal tracking system was used to allow only one response per survey invitation and to generate a list of nonresponders. Each resident who did not respond to the electronic questionnaire was mailed a copy of the survey with self-addressed return envelope addressed to the primary investigator. Mailed surveys did not contain specific subject identifiers. Mailed responses (n = 40) were entered by a single investigator (JMB) and verified for accuracy by a second investigator (RJM).

The questionnaire was divided into 5 parts and included 68 questions. Multiple choice and free text questions were used.

The first part of the survey involved 15 questions designed to capture characteristics of the respondents including: age, number of hours worked per week, number of residents in their class, gender, and year of training (Table 1). There were also questions about future job plans, designed for the other article from this survey.

Table 1
Table 1:
Percentiles of the Responses to Questions

The second part of the survey had 27 questions, including 25 questions from the Agency for Healthcare Research and Quality survey on safety culture.8,9,c The most robust psychometric domains were used, including teamwork within units, nonpunitive response to errors, and feedback/communication about error. Each of the safety culture domains contains 3 to 4 questions. The additional 2 questions assessed errors with negative consequences to patients and the perceived probability to follow a subspecialty career in the same field as the current clinical rotation.

The third part of the survey included the 9 questions from the faculty supervision instrument of de Oliveira Filho et al.1 (see Introduction).

The fourth part of the survey included 2 questions to evaluate job burnout. West et al.10 have described internal consistency of the single questions “I feel burned out from my work” and “I have become more callous toward people since I took this job” to evaluate the emotional exhaustion and depersonalization dimensions of burnout, respectively. Single-question evaluation and conversion to the original Maslach by proportioning scores were uniformly consistent with those reported for models based on the full Maslach with no changes in any of prior reported associations.

The fifth part of the survey included 15 questions to evaluate residents’ clinical research proficiency, including 12 questions from the Shortened Version of the Clinical Research Appraisal Inventory.11,d These questions were used for the second article from this survey, examining future job plans.

Statistical Analysis

Internal consistency was measured using Cronbach’s α. The confidence interval (CI) was calculated using 1000 nonparametric bootstrap samples, each with a sample size of 641, the number of respondents with all 9 supervision items answered (SYSTAT 13, SYSTAT Software Inc., Chicago, IL). The sample size of 1500 invited residents was designed around the second article examining future job plans. For the current article, the sample size would have been based on Cronbach’s α (e.g., to measure within 0.05), and thus would have been significantly smaller. The analyses below used the mean of 8 or more supervision scores, providing 656 responses.

Discriminant validity evaluates correlation with variables about which supervision would be expected not to differ, such as age, year of anesthesia training, and number of survey respondents from the current rotation and program. Spearman rank correlation (ρ) was used. All P values are 2-sided. Calculations were performed using StatXact 10 (Cytel Software, Cambridge, MA).

Convergent validity evaluates correlation with variables about which supervision would be expected to differ, positively with safety culture and negatively with individual resident’s burnout. Spearman rank correlations (ρ) were used. In addition, the Cochran-Armitage trend test was used to associate the numbers of errors witnessed (Table 1) and the incidence of the current rotation’s reported supervision being considered less than “frequent” (i.e., mean of the 9 supervision questions <3). The P value is exact (StatXact 10).

Usefulness of the supervision scale for assessing departments would be limited if correlation between supervision and safety culture were sensitive to individual resident’s burnout. This was assessed using regression trees since modeling the supervision score using regression trees combines the strength of relationship between supervision and independent variables and the distribution of the independent variables among respondents. Analyses were performed using three different criteria: least squares, trimmed mean, and least absolute deviations (SYSTAT 13).


The sample sizes for analyses ranged from n = 615 (41.0%) to n = 656 (43.7%) (Table 1).

Internal consistency of the supervision scale was excellent, with Cronbach’s α = 0.909 (95% CI, 0.896–0.922, n = 641). Estimates were 0.925 among the 155 residents on critical care, chronic pain, or regional/acute pain rotations and 0.903 among the others.

Discriminant validitye was found based on absence of rank correlation of supervision scores with characteristics of the respondent and program (Table 1): age (ρ = −0.019, P = 0.62), estimated hours worked per week (ρ = −0.057, P = 0.14), female (ρ = 0.032, P = 0.42), year of anesthesia training (ρ = −0.020, P = 0.61), weeks in the current rotation (ρ = 0.064, P = 0.10), sequence of survey response (ρ = 0.024, P = 0.54), size of residency class (ρ = −0.015, P = 0.69), and number of survey respondents from the current rotation and program (ρ = −0.037, P = 0.34).

Convergent validitye was found based on significant positive correlation with variables related to safety culture (all P < 0.0001): “Overall perceptions of patient safety” (ρ = 0.287), “Teamwork within units” (ρ = 0.393), “Nonpunitive response to errors” (ρ = 0.311), “Handoffs and transitions” (ρ = 0.256), “Feedback and communication about error” (ρ = 0.347), “Communication openness” (ρ = 0.381), and rotation’s “overall grade on patient safety” (ρ = 0.287).

Convergent validity was found also based on significant negative correlation with variables related to the individual resident’s burnout (all P < 0.0001): “I feel burnout from my work” (ρ = −0.253), “I have become more callous toward people since I took this job” (ρ = −0.258), and number of “errors with potential negative consequences to patients [that you have] made and/or witnessed” (ρ = −0.183). Regarding those errors, among the 39.4%, 34.5%, and 26.2% of respondents with 0, 1, or ≥2, the percentage supervision scores less than “frequent” (i.e., 3) were 8.2%, 12.4%, and 19.9%, respectively (P = 0.0005).

Usefulness was shown by the supervision score being predicted by the same 1 variable for each of the 3 regression tree types: “Teamwork within [the rotation]” (e.g., “When one area in this rotation gets busy, others help out”).f

Secondary (Unplanned) Analyses

We explored the relationships among reported errors, supervision, and burnout. A residency program may potentially be able to influence the quality of its faculty supervision,5–7 but burnout is influenced by resident factors such as average alcoholic drinks per week and marital status.12 Among residents reporting overall supervision during the current rotation that was less than frequent (i.e., <3) versus frequent, the 10th, 25th, 50th, 75th, 90th, and 95th percentiles of errors were 1 vs 1, 1 vs 1, 2 vs 2, 3 vs 2, 4 vs 3, and 6 vs 4, respectively (Wilcoxon-Mann-Whitney P < 0.0001). Each of the 3 types of regression trees detected no effect of burnout on count of errors while controlling for supervision. Forward stepwise, backward stepwise, and complete linear regression also found no effect of resident burnout on numbers of reported errors while controlling for supervision (all P > 0.138).

The overall supervision score was mean 3.40 with an SD of 0.47 (i.e., varied substantively among respondents). All rotation types (e.g., intensive care unit) had at least 33 respondents. By fixed effects analysis, there were no significant differences in scores among rotations (F7,648 = 1.85, P = 0.075). By random effects analysis, there also were no significant differences among rotations (all 8, P > 0.12, without correction for multiple comparisons). There were 13 programs each with at least 10 respondents. By fixed effects analysis, there were no significant differences in scores among programs (F12,157 = 1.30, P = 0.22). By random effects analysis, there also were no significant differences among programs (all 13, P > 0.36).


Evaluation of the overall quality of supervision of residents by faculty anesthesiologists depends on the reliability and validity of the instrument. In this study, we documented very strong internal consistency, discriminant validity, and convergent validity, irrespective of type of rotation and time spent (number of weeks) in the rotation. On the basis of these findings, the supervision scale can be applied reliably and validly for overall assessment of anesthesia training programs.

Our findings may have clinically important implications because less effective supervision is associated with greater prevalence of medical errors among anesthesia residents and trainees of other medical specialties.2,13,14 We found that same result in this survey. Yet, the global (department-wide) implementation of the supervision instrument by de Oliveira Filho has been hindered by the lack of psychometric properties of the instrument among different types of anesthesia rotations.15 Our findings suggest that anesthesiology departments can use the supervision instrument, reliably and validly, to monitor the overall quality of faculty supervision to anesthesiology trainees irrespective of the specific rotation.

Another finding of the current study was the significant positive correlation between the rotation supervision scores and the subdomains of patient safety culture. Lesser patient safety culture is associated with less favorable patient outcomes.16–18 The Institute of Medicine has determined that one of the major steps to reduce medical errors is to improve patient safety culture.19 Hospitals are concerned with the need to measure safety culture, and in the United States, The Joint Commission requires the evaluation of patient safety culture for hospital accreditation.g The regression tree analysis shows that supervision scores are associated with residents’ perceptions of safety culture. Anesthesiology departments focusing on supervision is especially practical because it is a single dimension (e.g., easy to monitor). Previously, Hindman et al.15 showed that supervision scores for individual faculty scores are highly correlated with residents’ choice of the anesthesiologist to care for their families (Kendall τb = +0.77, P < 0.0001). We expect this means that residents recognize the individual anesthesiologist’s contribution to safety culture and its relevance to their families if care were needed.

There was a negative relationship between resident burnout and rotation supervision score. Our group demonstrated previously that both lower supervision scores and burnout are associated with self-reported medical errors by anesthesiology residents.2,12 A motivation for the current study was to measure all 3. The secondary (exploratory) analysis suggests that perhaps effective supervision mitigates medical errors in residents with high risk for burnout.

Our study should be interpreted in the context of its limitations. The surveys were self-reported and might not represent actual behaviors (e.g., were not completed in a controlled setting). Since response bias is always a concern in cross-sectional questionnaire studies, we recommend focus on the associations we studied, not the prevalence of events. Finally, if supervision were principally a function of individual faculty anesthesiologists’ activities7 (e.g., teamwork), since supervision scores generally are unrelated to physical presence and typical assignments,6 we would expect supervision scores not to differ among programs because supervision has not been monitored. That is precisely what we found. We do not know how much an individual faculty anesthesiologist’s supervision scores can improve or influence medical errors.

In summary, we evaluated the psychometric properties of the supervision instrument developed by de Oliveira Filho across different anesthesiology rotations in a national survey of anesthesiology trainees. The instrument had excellent internal consistency, discriminant validity, and convergent validity. Our findings support the reliability and validity of the instrument by residents for evaluating attending supervision across different anesthesia rotations. In the companion article, we show that the same residents who provide lower scores for the overall department also provide lower scores for individual faculty anesthesiologists.5–7


Name: Gildasio S. De Oliveira, Jr, MD, MSCi.

Contribution: This author helped design the study, conduct the study, and write the manuscript. This author is the archival author.

Attestation: Gildasio S. De Oliveira, Jr, has approved the final manuscript.

Name: Franklin Dexter, MD, PhD.

Contribution: This author helped design the study, analyze the data, and write the manuscript.

Attestation: Franklin Dexter has approved the final manuscript.

Name: Jane M. Bialek, BS.

Contribution: This author helped conduct the study.

Attestation: Jane M. Bialek has approved the final manuscript.

Name: Robert J. McCarthy, DPharm.

Contribution: This author helped design the study, analyze the data, and write the manuscript.

Attestation: Robert J. McCarthy has approved the final manuscript.


Franklin Dexter is the Statistical Editor and Section Editor for Economics, Education, and Policy for Anesthesia & Analgesia. This manuscript was handled by Dr. Steven L. Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.


a ACGME Program Requirements for Graduate Medical Education in Anesthesiology. See Section II.B.2.a. Available at: Accessed November 27, 2013.
Cited Here

b Department of Health and Human Services, Centers for Medicare and Medicaid Services. CMS Manual System. Pub 100–04 Medicare Claims Processing, Transmittal 1859, November 20, 2009. Subject: MIPPA Section 139 Teaching Anesthesiologists. Available at: Accessed November 28, 2013.
Cited Here

c Available at: Accessed June 15, 2014.
Cited Here

d Available at: Accessed June 15, 2014.
Cited Here

e “Convergent and discriminant validity are both considered subcategories or subtypes of construct validity. The important thing to recognize is that they work together—if you can demonstrate that you have evidence for both convergent and discriminant validity, then you’ve by definition demonstrated that you have evidence for construct validity. But, neither one alone is sufficient for establishing construct validity. [It is] easiest to think about convergent and discriminant validity as two inter-locking propositions. In simple words … what they are doing as follows: measures of constructs that theoretically should be related to each other are, in fact, observed to be related to each other (that is, you should be able to show a correspondence or convergence between similar constructs) and measures of constructs that theoretically should not be related to each other are, in fact, observed to not be related to each other (that is, you should be able to discriminate between dissimilar constructs). To estimate the degree to which any two measures are related to each other … use the correlation coefficient … Correlations between theoretically similar measures should be ‘high’ while correlations between theoretically dissimilar measures should be ‘low.’” Available at: Accessed July 16, 2014.
Cited Here

f The other questions for the dimension (factor) “Teamwork within” [the rotation] were: “Regarding your current clinical rotation …” “People support one another in this rotation,” “When a lot of work needs to be done quickly, we work together as a team to get the work done,” and “In this rotation, people treat each other with respect.”
Cited Here

g Available at: Accessed June 19, 2014.
Cited Here


1. de Oliveira Filho GR, Dal Mago AJ, Garcia JH, Goldschmidt R. An instrument designed for faculty supervision evaluation by anesthesia residents and its psychometric properties. Anesth Analg. 2008;107:1316–22
2. De Oliveira GS Jr, Rahmani R, Fitzgerald PC, Chang R, McCarthy RJ. The association between frequency of self-reported medical errors and anesthesia trainee supervision: a survey of United States anesthesiology residents-in-training. Anesth Analg. 2013;116:892–97
3. de Oliveira Filho GR, Dexter F. Interpretation of the association between frequency of self-reported medical errors and faculty supervision of anesthesiology residents. Anesth Analg. 2013;116:752–3
4. Dexter F, Logvinov II, Brull SJ. Anesthesiology residents’ and nurse anesthetists’ perceptions of effective clinical faculty supervision by anesthesiologists. Anesth Analg. 2013;116:1352–5
5. Hindman BJ, Dexter F, Smith TC. Anesthesia residents’ global (departmental) evaluation of faculty anesthesiologists’ supervision can be less than their average evaluations of individual anesthesiologists. Anesth Analg. 2015;120:204–8
6. Dexter F, Ledolter J, Smith TC, Griffiths D, Hindman BJ. Influence of provider type (nurse anesthetist or resident physician), staff assignments, and other covariates on daily evaluations of anesthesiologists’ quality of supervision. Anesth Analg. 2014;119:670–8
7. Dexter F, Ledolter J, Hindman BJ. Bernoulli cumulative sum (CUSUM) control charts for monitoring of anesthesiologists’ performance in supervising anesthesia residents and nurse anesthetists. Anesth Analg. 2014;119:679–85
8. Huang DT, Clermont G, Sexton JB, Karlo CA, Miller RG, Weissfeld LA, Rowan KM, Angus DC. Perceptions of safety culture vary across the intensive care units of a single institution. Crit Care Med. 2007;35:165–76
9. Morello RT, Lowthian JA, Barker AL, McGinnes R, Dunt D, Brand C. Strategies for improving patient safety culture in hospitals: a systematic review. BMJ Qual Saf. 2013;22:11–8
10. West CP, Dyrbye LN, Satele DV, Sloan JA, Shanafelt TD. Concurrent validity of single-item measures of emotional exhaustion and depersonalization in burnout assessment. J Gen Intern Med. 2012;27:1445–52
11. Robinson GF, Switzer GE, Cohen ED, Primack BA, Kapoor WN, Seltzer DL, Bakken LL, Rubio DM. A shortened version of the Clinical Research Appraisal Inventory: CRAI-12. Acad Med. 2013;88:1340–5
12. De Oliveira GS Jr, Chang R, Fitzgerald PC, Almeida MD, Castro-Alves LS, Ahmad S, McCarthy RJ. The prevalence of burnout and depression and their association with adherence to safety and practice standards: a survey of United States anesthesiology trainees. Anesth Analg. 2013;117:182–93
13. McKee M, Black N. Does the current use of junior doctors in the United Kingdom affect the quality of medical care? Soc Sci Med. 1992;34:549–58
14. Farnan JM, Petty LA, Georgitis E, Martin S, Chiu E, Prochaska M, Arora VM. A systematic review: the effect of clinical supervision on patient and residency education outcomes. Acad Med. 2012;87:428–42
15. Hindman BJ, Dexter F, Kreiter CD, Wachtel RE. Determinants, associations, and psychometric properties of resident evaluations of faculty operating room supervision in a US anesthesia residency program. Anesth Analg. 2013;116:1342–51
16. Davenport DL, Henderson WG, Mosca CL, Khuri SF, Mentzer RM Jr.. Risk-adjusted morbidity in teaching hospitals correlates with reported levels of communication and collaboration on surgical teams but not with scale measures of teamwork climate, safety climate, or working conditions. J Am Coll Surg. 2007;205:778–84
17. Singer S, Lin S, Falwell A, Gaba D, Baker L. Relationship of safety climate and safety performance in hospitals. Health Serv Res. 2009;44:399–421
18. Birkmeyer NJ, Finks JF, Greenberg CK, McVeigh A, English WJ, Carlin A, Hawasli A, Share D, Birkmeyer JD. Safety culture and complications after bariatric surgery. Ann Surg. 2013;257:260–5
19. Morello RT, Lowthian JA, Barker AL, McGinnes R, Dunt D, Brand C. Strategies for improving patient safety culture in hospitals: a systematic review. BMJ Qual Saf. 2013;22:11–8
© 2015 International Anesthesia Research Society