Secondary Logo

Journal Logo

Economics, Education, And Policy: Research Report

Reliability and Validity of the Anesthesiologist Supervision Instrument When Certified Registered Nurse Anesthetists Provide Scores

Dexter, Franklin MD, PhD*; Masursky, Danielle PhD; Hindman, Bradley J. MD

Author Information
doi: 10.1213/ANE.0000000000000510
  • Free


Currently, at many U.S. health care facilities, supervision of anesthesiology residents and/or Certified Registered Nurse Anesthetists (CRNAs) is a major daily responsibility of anesthesiologists.1,2 The word “supervision” is used here not as a U.S. billing term. Rather, we use the term supervision to include all clinical oversight functions directed toward assuring the quality of clinical care whenever the anesthesiologist is not the sole anesthesia care provider (Table 1). Investigators have been learning how the quality of anesthesiologists’ supervision in operating rooms (ORs) can be evaluated, both individually3–7 and departmentally8–13 (Table 2).

Table 1
Table 1:
de Oliveira Filho et al.’s Instrument3,5,12 for Measuring Faculty Anesthesiologists’ Supervision of Anesthesiology Residents During Clinical Operating Room Care
Table 2
Table 2:
Previous Findings Regarding Supervision of Anesthesiology Residents and Certified Registered Nurse Anesthetists (CRNAs) by Anesthesiologists

In a national survey of anesthesiology residents, supervision quality was most closely associated with 1 variable: “Teamwork within [the rotation]” (e.g., “When one area in this rotation gets busy, others help out”) (Table 2.4). Thus, although in this article we refer to CRNA assessments of “supervision,” likely an equivalent term is assessments of anesthesia “teamwork.”

Based on previous studies (Table 2), our department implemented a process by which the supervision provided by each anesthesiologist working in OR(s) is evaluated each day by the CRNA(s) and/or resident(s) with whom they worked the previous day.4–7 These daily evaluations utilize the 9 questions developed by de Oliveira Filho to assess supervision of anesthesiologists by residents (Table 1). Each question is answered using a 4-point Likert scale (1 = never, 2 = rarely, 3 = frequently, 4 = always).3 In our department, CRNAs account for 51.1% ± 0.9% (SE) of billed minutes of care (i.e., CRNA–physician team interactions constitute a substantive component of all clinical care).6

The goal of our study was to answer 2 questions. First, in routine daily practice, does the supervision instrument have sufficient reliability (internal consistency) for use by CRNAs? Second, unlike residents, CRNAs are not trainees and expect a lower level of faculty supervision than do residents (Table 2).5 Anesthesiologists reasonably provide lesser levels of supervision to CRNAs than to residents. The second question we address is whether the supervision instrument is valid at the extreme (both low and high) levels of supervision sometimes reported by CRNAs.


The University of Iowa Institutional Review Board determined that this project does not meet the regulatory definition of human subjects’ research, because it was a quality assurance and educational activity.

This project used an existing data set that included all 7273 daily faculty supervision scores and 1088 comments provided by CRNAs, as well as the 6246 daily faculty supervision scores and 681 comments provided by residents, for dates of service between July 1, 2013, and June 30, 2014. This data set is constantly available to BJH, using a secure intranet-based webpage, for his daily responsibilities as Vice-Chair of Faculty Development. For this project, the names of all individual anesthesiologists in the name field of the exported data were replaced with anesthesiologist-specific numbers. In addition, all comments were reviewed by BJH word for word. When an anesthesiologist was specifically named in a comment, the anesthesiologist’s name was replaced with the anesthesiologist-specific number. The name of the CRNA or resident providing each evaluation was redacted to conceal their identity. There were 77 anesthesiologists, 62 residents, and 49 CRNAs, making it essentially impossible to infer the identity of anyone from the comments alone. Finally, the date of clinical service when each rater (CRNA or resident) worked with the anesthesiologist was replaced with the corresponding 2-month period, creating 6 two-month periods.

Reliability of the supervision instrument was assessed using its internal consistency. Cronbach’s alpha was calculated for each of the 6 two-month periods. The use of 6 two-month periods was done to have statistically independent samples, at the cost of reduction in sample size. This approach was taken because surgical cases are not assigned randomly to providers, and surgical schedules follow complex patterns. The study included 1 year’s worth of data (i.e., July 1, 2013, through June 30, 2014) because we knew from previous Monte-Carlo simulation studies that 6 eight-week periods would be suitable.14 Since we started data collection on July 1, 2013, and made comparisons with residents, we expanded the period of data collection by an additional 4 weeks to have 1 academic year (i.e., 48 weeks + 4 weeks = 6 two-month periods).

Analyses were performed using Excel 2010 array formulas (Microsoft, Redmond, WA). P values were calculated using Student’s one-group t tests with N = 6 two-month periods.14

Content analysis was used to associate comments and supervision scores.15 If the anesthesiologist was responsible for “supervision” throughout a day, but there was only brief interaction, and provided Cronbach’s alpha was substantial (above), then the supervision score should be small. Thus, we used content analysis to assess convergent validity.15

The content analysis followed an inductive process.15 Each supervision score represented a rater-anesthesiologist pairing on a given day. The rater (CRNA or resident) was always provided the option of entering a comment. The anesthesiologist-rater-day (with 0 or 1 comments) was treated as the unit of data collection. The analysis was performed using each comment as written (i.e., no effort was made to infer intent from a comment).15

Using every comment from an anesthetizing location where CRNAs work in our department, we created a concordance, listing the frequency of use of each word included in the corpus of all evaluations.15 Prior to analysis, words were converted to lower case, and plurals were converted to their singular form, with common irregular plural forms taken into account. The 9 abbreviations that were each used several times were expanded to the corresponding whole words (e.g., “pt” for “patient”). The punctuation marks (e.g., dashes) were replaced with single spaces. This resulted in 2662 unique words.

A screening process was used to choose which word(s) to explore in depth. Each comment was counted 0 or 1 for the occurrence of each word in a comment. For each of the 2662 words, Fisher’s exact test was used to relate presence (or absence) of the word in the comment versus a supervision score <2 (or <3). For this study, we focused on CRNA scores <2, because for the score to be <2, at least 1 of the 9 individual questions had to have been answered with a 1 (“never”) (Table 1). We also included an analysis of CRNA scores <3 because most CRNAs consider a supervision score of <3 as not meeting their expectations of faculty (Table 2.5).5 We treated scores <3 as a secondary outcome measure because “rarely” (i.e., score of 2) is more ambiguous than “never” (i.e., score of 1).

Controlling the false discovery rate at α = 0.05, no word that was an action verb (e.g., evaluate, help, or solve) was significantly associated with CRNA supervision scores <2 or <3. We next included the 5 negation or limiting modifiers used more than once in the comments: no, not, never, only, and without.a Each of the other 2657 words (i.e., 2662 – 5) was combined with 1 or more of the negation or limiting modifier words (e.g., “not see”). While controlling the false discovery rate, and using the negative modifiers, 3 action verbs were significant for both CRNA scores <2 and CRNA scores <3. These 3 were (not) “help,” (not) “see,” and (not) “saw.” “Help” was used in 90 unique comments provided by 12 CRNAs, but 80.0% of those comments were made by 1 CRNA who wrote often that the anesthesiologist was “available if I needed help.” Thus, subsequent consideration of convergent validity was performed using the verbs (not) “see” and (not) “saw.”b Neither of these words appears in de Oliveira Filho et al.’s 2008 paper describing the instrument (Table 1).

“See” or “saw” were used in the comments in different themes. For example, our focus was comments that included the conceptual theme: “I did not see the anesthesiologist during the case(s) together” (i.e., little or no interpersonal interaction). However, there were other themes, such as “Dr. {blinded} did not go to the [holding area] to see the patient.” The scenarios with “see” or “saw” were presented to the 3 authors in random sequence using the original comments with all punctuation. The authors independently reviewed the 81 comments and evaluated whether each included the focus theme. There was concordance among the 3 authors, assigning 77 of 81 comments to the focus theme and reported below.

The statistical association of (not) “see” or (not) “saw” with the focus theme (“I did not see the anesthesiologist …”) was evaluated using an odds ratio for each of the 6 two-month periods. The 95% confidence interval (CI) was calculated using the N = 6 log odds ratios and Student’s t-distribution. We considered the primary assessment of convergent validity to be association of the comments about not “seeing” the anesthesiologist and scores <2, as explained above.

Among the 77 comments with the focus theme, there were 2 for which the focus theme seemed secondary (e.g., “I did not see Dr. {blinded} much during … [the day]; however, I believe that {blinded} was monitoring quite closely.”). These 2 comments had the greatest and fourth greatest of the 77 supervision scores. We deliberately included these 2 comments because doing so had the (deliberate) effect of underestimating the odds ratio (i.e., overestimating the P value).

Results (percentages, Cronbach’s alphas, and sample sizes) are reported as the mean ± SE from among the 6 two-month periods.


Supervision scores <2 were provided for 7.2% ± 0.4% of assessments by 18.2 ± 0.9 CRNAs about 34.0 ± 1.5 different anesthesiologists. Scores <3 were provided for 36.6% ± 1.1% of assessments by 34.0 ± 0.6 CRNAs about 57.8 ± 1.2 different anesthesiologists. Thus, CRNA scores <2 or <3 were not unusual or attributable to just a few CRNAs or a few anesthesiologists. The frequencies of supervision scores <2 or <3 provided by CRNAs were greater than for supervision scores provided by anesthesiology residents, who provided supervision scores <2 for 3.9% ± 0.3% of assessments and <3 for 21.3% ± 0.8% (both P < 0.0001). The anesthesiologists who received low mean scores from CRNAs were the same as those who received low mean scores among from residents (P < 0.0001).c Both results show concurrent validity.

No single question among the 9 questions in the supervision instrument explained CRNA scores <2 (or <3) because of substantial (expected) interquestion correlation. Cronbach’s alpha equaled 0.895 ± 0.003 among the 6 two-month periods. The Cronbach’s alpha was calculated using N = 1212 ± 17 evaluations per period. Among the CRNA evaluations including a written comment, the Cronbach’s alpha was 0.907 ± 0.003 (N = 181 ± 13 evaluations per period). This result shows that, just as for anesthesiology residents (Table 2.1), the 9 supervision questions reliably measure a one-dimensional attribute.

The CRNAs provided a comment for 38.8% ± 4.1% of scores <2, 19.6% ± 1.6% of scores <3, and 12.3% ± 1.0% of scores ≥3. There was a significant association between a supervision score <2 (or <3) and a comment being provided (<2 odds ratio = 4.18, 95% CI = 3.03–5.77, P = 0.0001; <3 odds ratio = 1.73, 95% CI = 1.46–2.06, P = 0.0004).

Comments with (not) “see” or (not) “saw” and the theme, “I did not see the anesthesiologist during the case(s) together,” increased the odds of a CRNA providing a supervision score <2 (odds ratio = 74.2, 95% CI = 21.0–262, P = 0.0003) and a supervision score <3 (odds ratio = 48.2, 95% CI = 39.5–58.8, P < 0.0001). This comparison was made versus all of the other supervision scores provided by CRNAs, with or without comments. Because presence of a comment, regardless of its content, was associated with supervision scores (see above), we repeated this analysis of (not) “see” or (not) “saw,” while limiting consideration to scores with comments (i.e., scores without comments were excluded). There too was an association between these words and a score <2 (odds ratio = 19.4, 95% CI = 8.2–45.9, P = 0.0003) and a score <3 (odds ratio = 31.5, 95% CI = 24.0–41.4, P < 0.0001).d This result shows content (face) validity.

Anesthesiology residents are supervised with the requirement of substantial anesthesiologist presence for each case, both for education (American College of Graduate Medical Education) and billing (Centers for Medicare and Medicaid Services).e,f However, because Iowa is an “opt-out” state, substantial anesthesiologist presence is not required for CRNA billing.g Among the 77 comments containing (not) “see” or (not) “saw,” many more were made by CRNAs rather than residents (n = 75 [97.4%] vs n = 2 [2.6%], respectively, P< 0.0001), showing concurrent validity.

The supervision evaluation score has an upper boundary of 4.0 (“always”) (Table 1). Some of the 9 individual questions contain content that clearly is applicable directly to residents (e.g., teaching) (Table 1). If some of the 9 questions were not perceived by the CRNAs practicing in our department as relevant to their interprofessional interactions, we would expect that these questions should not routinely be scored 4 (“always”). One indication that this was not so is the above-mentioned Cronbach’s alpha of 0.907 ± 0.003. Another indication was that the overall supervision score provided was 4.0 (“always”) for 24.9% ± 0.3% of the CRNA evaluations. An overall score of 4.0 was significantly more common than even the next most common combination of scores (P < 0.0001; difference 16.1% ± 0.6%).h This result shows content (face) validity of the question set.


The de Oliveira Filho supervision instrument (Table 1) was designed and validated for use by anesthesiology residents when evaluating faculty supervision during clinical care. In this paper, we found that the instrument also is reliable and valid when used by CRNAs (Table 2). This knowledge has helped our department in using the instrument for daily CRNA evaluations of the anesthesiologists’ supervision.6,7 With the Bernoulli cumulative sum control charts recalculated with each evaluation, low levels of anesthesiologist supervision performances can be detected reliability within 2 months (Table 2.10).

Prior to the current study, it was not known how individual supervision questions (Table 1) would be scored when an anesthesiologist is often absent from the OR where the case(s) is being performed. Our findings show that when there is limited physical presence of the anesthesiologist, the anesthesiologist then is rated low (e.g., <2), precisely as would be desired of a valid instrument. Thus, our results provide further indication of the validity of the supervision instrument when scored by CRNAs. This result complements our previous finding that the anesthesiologists who receive lower than average supervision scores from CRNAs are the same anesthesiologists who receive lower than average supervision scores from residents (Table 2.3 and footnote c in the Results).

Among anesthesiology residents nationally, supervision scores were highly associated with 1 variable: “Teamwork within [the rotation]” (Table 2.4) (see Introduction). This dimension’s questions include: “When one area in this rotation gets busy, others help out” and “When a lot of work needs to be done quickly, we work together as a team to get the work done.” Neither of these can be accomplished without being present (i.e., “seen”). Our results are consistent with the finding, in that a supervising anesthesiologist may tell a CRNA what they want to have done and yet receive a low score on the quality of their supervision by failing to be present as a team member.

From State of Iowa law and University of Iowa policy, when anesthesia care was provided by CRNAs, the degree of supervision provided by anesthesiologists could be based on the judgment of the anesthesiologist and the CRNA, not billing directives of the United States Centers for Medicare and Medicaid services (see footnote g). This allowed us to study a broad continuum of supervision. Other anesthesia groups nationwide can benefit from the scientific knowledge that the supervision instrument is reliable and valid when completed by CRNAs. This is important given our previous finding that the CRNA:MD ratio was not correlated with the level of supervision provided (Table 2.7).6


Name: Franklin Dexter, MD, PhD.

Contribution: This author helped design the study, conduct the study, analyze the data, and write the manuscript.

Attestation: Franklin Dexter has approved the final manuscript.

Name: Danielle Masursky, PhD.

Contribution: This author helped conduct the study.

Attestation: Danielle Masursky has approved the final manuscript.

Name: Bradley J. Hindman, MD.

Contribution: This author helped design the study, conduct the study, and write the manuscript. This author is the archival author.

Attestation: Bradley J. Hindman has approved the final manuscript.


Dr. Franklin Dexter is the Statistical Editor and Section Editor for Economics, Education, and Policy for Anesthesia & Analgesia. This manuscript was handled by Dr. Steven L. Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.


a Available at: and Accessed August 1, 2014.
Cited Here

b Although we tried to be quantitative and systematic in choosing words for scientific evaluation, this process achieved nothing more than what was grossly obvious simply by sorting all scores in ascending sequence and reading the 23 comments for which the corresponding score was the lowest possible, 1.00 (“never”). “I did not see the anesthesiologist during the case(s) together” was the theme of 16 of the 23 comments. Reviewing all the comments associated with scores <2 was done independently by 2 of the authors (FD, BJH) before the quantitative screening process, and neither identified any other common theme.
Cited Here

c This finding was reported previously for the first 6-months of data (Table 2.3). In addition, this analysis was different from all other parts of the Results because the data were pooled among the 8-week periods. As we explained in detail previously,6 the mean was taken of each rater’s scores for the anesthesiologist and the anesthesiologist’s overall mean was calculated equally weighting each rater. This is the appropriate approach psychometrically (statistically).6,7 As stated in Table 2.3, previously using 6 months of data, the Kendall’s τb = 0.36 (P < 0.0001).6 The τb = 0.43 when limited to anesthesiologists who were evaluated both by at least 9 different CRNAs and 9 different residents, and τb = 0.51 when limited to anesthesiologists evaluated by at least 15 different CRNAs and 15 different residents. Using the current 1 year of data, the corresponding τb = 0.34, 0.46, and 0.50, respectively. The corresponding sample sizes for the 1 year of data were N = 74, 59, and 56 anesthesiologists, respectively.
Cited Here

d Among the 75 evaluations with a comment related to the theme “I did not see the anesthesiologist during the case(s) together,” 74.7% of the scores provided were <2 and 96.0% were <3. Among the 6110 evaluations without such a comment, 4.3% of the scores provided were <2 and 33.6% were <3. Among the 1013 evaluations without such a comment but with any other comment, 14.4% of the scores provided were <2 and 44.4% of the scores provided were <3. Thus, comments were more likely included when low scores were provided.
Cited Here

e ACGME Program Requirements for Graduate Medical Education in Anesthesiology. Available at: Accessed October 24, 2014.
Cited Here

f Department of Health and Human Services, Centers for Medicare and Medicaid Services. CMS Manual System. Pub 100–04 Medicare Claims Processing, Transmittal 1859, November 20, 2009. Subject: MIPPA Section 139 Teaching Anesthesiologists. Available at: Accessed November 28, 2013.
Cited Here

g Available at: Accessed September 12, 2014.
Cited Here

h Analyzing why there were many 4.00 scores among the CRNAs was not fruitful. Following precisely the same process as for scores <2.00 and <3.00, while controlling the false discovery rate, the only verb significantly (P < 0.00001) associated with a score of 4.00 was “teach.” None of the other tenses was statistically significant even without correction for the multiple comparisons (all P > 0.15). The theme was “Dr. {blinded} is a pleasure to work with … and loves to teach.” However, 12 of these 13 comments with “teach” were written by just 1 CRNA. The 13th comment was that the CRNA “really appreciated” that the anesthesiologist “took the time to teach the SRNA” (i.e., student registered nurse anesthetist) in the OR that day. Thus, although this observation about the comments associated with scores of 4.0 (“always”) provided credence to the validity of the supervision instrument at high scores, these comments (unlike the scores per se) essentially reflected the perspectives of a single individual.
Cited Here


1. Shumway SH, Del Risco J. A comparison of nurse anesthesia practice types. AANA J. 2000;68:452–62
2. Taylor CL. Attitudes toward physician-nurse collaboration in anesthesia. AANA J. 2009;77:343–8
3. de Oliveira Filho GR, Dal Mago AJ, Garcia JH, Goldschmidt R. An instrument designed for faculty supervision evaluation by anesthesia residents and its psychometric properties. Anesth Analg. 2008;107:1316–22
4. Hindman BJ, Dexter F, Kreiter CD, Wachtel RE. Determinants, associations, and psychometric properties of resident evaluations of faculty operating room supervision in a US anesthesia residency program. Anesth Analg. 2013;116:1342–51
5. Dexter F, Logvinov II, Brull SJ. Anesthesiology residents’ and nurse anesthetists’ perceptions of effective clinical faculty supervision by anesthesiologists. Anesth Analg. 2013;116:1352–5
6. Dexter F, Ledolter J, Smith TC, Griffiths D, Hindman BJ. Influence of provider type (nurse anesthetist or resident physician), staff assignments, and other covariates on daily evaluations of anesthesiologists’ quality of supervision. Anesth Analg. 2014;119:670–8
7. Dexter F, Ledolter J, Hindman BJ. Bernoulli cumulative sum (CUSUM) control charts for monitoring of anesthesiologists’ performance in supervising anesthesia residents and nurse anesthetists. Anesth Analg. 2014;119:679–85
8. Paoletti X, Marty J. Consequences of running more operating theatres than anaesthetists to staff them: a stochastic simulation study. Br J Anaesth. 2007;98:462–9
9. Epstein RH, Dexter F. Influence of supervision ratios by anesthesiologists on first-case starts and critical portions of anesthetics. Anesthesiology. 2012;116:683–91
10. De Oliveira GS Jr, Rahmani R, Fitzgerald PC, Chang R, McCarthy RJ. The association between frequency of self-reported medical errors and anesthesia trainee supervision: a survey of United States anesthesiology residents-in-training. Anesth Analg. 2013;116:892–7
11. de Oliveira Filho GR, Dexter F. Interpretation of the association between frequency of self-reported medical errors and faculty supervision of anesthesiology residents. Anesth Analg. 2013;116:752–3
12. De Oliveira GS Jr, Dexter F, Bialek JM, McCarthy RJ. Reliability and validity of assessing subspecialty level of faculty anesthesiologists’ supervision of anesthesiology residents. Anesth Analg. 2015;120:209–13
13. Hindman BJ, Dexter F, Smith TC. Anesthesia residents’ global (departmental) evaluation of faculty anesthesiologists’ supervision can be less than their average evaluations of individual anesthesiologists. Anesth Analg. 2015;120:204–8
14. Dexter F, Marcon E, Epstein RH, Ledolter J. Validation of statistical methods to compare cancellation rates on the day of surgery. Anesth Analg. 2005;101:465–73
15. Neuendorf KA The Content Analysis Guidebook. 2002 Thousand Oaks Sage Publications:11, 13, 23, 116, 129
© 2015 International Anesthesia Research Society