Resident supervision is defined by the Veterans Health Administration as “an intervention provided by a supervising practitioner to a resident … [that] is evaluative, extends over time, and has the simultaneous purposes of enhancing the professional functioning of the resident while monitoring the quality of professional services delivered … [and that] is exercised through observation, consultation, directing the learning of the resident, and role modeling.”a Faculty supervision of resident patient care is a required element of both postgraduate medical educationb and billing compliance.c However, in this study, our consideration of “supervision” is distinct from United States billing nomenclature (“medical supervision”). We use the term supervision to include all faculty clinical oversight functions directed toward optimizing the quality of clinical care1 whenever the faculty anesthesiologist is not the sole anesthesia care provider. A recent systematic review concluded enhanced faculty supervision of residents favorably affects: (1) procedural complications, (2) assessment of patient acuity, (3) diagnostic and treatment plans, and (4) resident adherence to quality of care guidelines.2 However, the authors of the review also concluded that the lack of objective standardized metrics of supervision limit past and current research regarding the effect of faculty supervision on patient outcomes.
In 2008, de Oliveira Filho et al.3 developed and validated a 9-item question set by which anesthesia residents evaluated the supervision provided by anesthesia faculty in the operating room. On the basis of responses from a group of 19 residents (clinical anesthesia [CA] -1’s, -2’s, and -3’s), the question set showed a high level of internal consistency within forms (single-factor structure), and mean scores attained high reliability and dependability when averaged over a sufficient number of independent observations of supervision by unique resident raters. In their analysis, de Oliveira Filho et al.3 considered the evaluations from all residents to be equally valid. In other words, the evaluation from a CA-1 resident was equally weighted with the evaluation from a CA-3 resident; the evaluation from a resident who had worked with a faculty once was equally weighted with evaluation from a resident who had worked with a faculty many times; and the evaluation of a resident who had worked with a faculty the previous day was equally weighted with the evaluation from a resident who had worked with a faculty weeks or months before the evaluation. While it is possible that 1 or more of these factors could meaningfully affect resident assessments of anesthesia faculty supervision, de Oliveira Filho et al.3 were limited by their sample size and study design in their ability to make these assessments. Furthermore, although de Oliveira Filho et al.3 showed the question set and measurement method had excellent psychometric properties (reliability, dependability) when individual faculty received evaluations from 4 or more residents, this result was obtained in a single institution in Brazil. It is not known how well the de Oliveira Filho et al. supervision question set performs in the setting of an entirely different group of residents and faculty with differences in culture, language, and health care system.
Accordingly, this study was conducted with the following 3 aims. Our first aim was to determine whether anesthesia faculty operating room supervision scores were associated with any of the following: (1) resident clinical anesthesia experience (CA-1 versus -2 versus -3), (2) the number of specific resident interactions with the faculty member (either patient encounters or days working together), or (3) the interval between the last interaction of the resident with the faculty. Our second aim was to characterize associations between faculty operating room supervision scores and resident assessments of: (1) faculty supervision in clinical settings outside of the operating room (e.g., intensive care unit), (2) faculty clinical ability (based on whether resident would chose the faculty anesthesiologist to care for a family member; family choice), and (3) faculty teaching effectiveness. Our third aim was to conduct a generalizability analysis to characterize the psychometric properties of the de Oliveira Filho et al. supervision question set in a US anesthesia residency program (University of Iowa, Iowa City, IA).
This study was approved by and conducted in accordance with guidelines set forth by the University of Iowa IRB for Human Subjects. We invited study participation to all 39 residents in the Department of Anesthesia of the University of Iowa who, as of May 2012, were either in the first (n = 14), second (n = 13), or third (n = 12) year of clinical anesthesia training (e.g., CA-1, -2, -3, respectively). Participation was not part of the residency program and was voluntary. Residents were each paid $200 for their participation. All residents chose to participate and provided written informed consent before initiation of the study. We did not perform an a priori power analysis because we studied an entire finite sample of available residents and faculty. However, we knew before the study started that our sample size of raters was twice as large as that used successfully by de Oliveira Filho et al.3
Using a secure resident-specific internet hyperlink and a standard electronic evaluation form (see below) generated by Qualtrics (Qualtrics Labs, Inc., Provo, UT), residents were asked to evaluate all anesthesia faculty who were currently clinically active (n = 56), with the exception of the study principal investigator (B.J.H.). Although residents evaluated anesthesia faculty, faculty were not study subjects and faculty consent was not obtained. In our department, anesthesia faculty practice in 3 clinical settings: (1) operating rooms (n = 49 faculty; 40 of whom practice only in operating rooms); (2) surgical intensive care unit (SICU; n = 10 faculty; 7 of whom also practice in operating rooms); and (3) pain clinic (n = 6 faculty; 2 of whom also practice in operating rooms).
Each evaluation form consisted of the set of 9 questions to assess faculty supervision developed and reported by de Oliveira Filho et al.3 For faculty who practiced in more than 1 clinical setting (e.g., operating room and SICU) residents were asked to a complete separate evaluation for each clinical setting. Although the de Oliveira Filho et al. question set was developed specifically to assess operating room supervision, the investigators decided to also request assessments of faculty supervision in the SICU and pain clinic as measures of concurrent validity. For SICU and pain clinic evaluations, questions 5, 6, and 7 of the de Oliveira Filho et al.3 question set were modified to provide examples consistent with clinical practice in those settings (Table 1). We recognized a priori that, because of our small sample size, it would not be possible to assess with precision the validity of the de Oliveira Filho et al. supervision question set in settings other than operating rooms.
In addition to the 9 supervision questions, each resident was asked whether s/he would choose the faculty member to care for his/her family in that specific clinical setting (question 10) and whether the faculty member was an excellent teacher in that specific clinical setting (question 11). Questions were answered on a 4-point Likert scale (Table 1). The descriptors linked to each of the 4 numeric values differed from those used by de Oliveira Filho et al.3 Specifically, a value of 1 was “consistently no” rather than “never;” 2 was “usually no” rather than “rarely;” 3 was “usually yes” rather than “frequently;” and 4 was “consistently yes” rather than “always” (please see [important] Discussion).
For faculty who practiced in the SICU or pain clinic, residents were asked to evaluate faculty in that setting using 2 different evaluation forms, the de Oliveira Filho et al. supervision question set (modified for that setting as described above) and the current University of Iowa teaching assessment form specific for that clinical setting. The investigators recognized a priori that our existing teaching evaluation forms for these 2 nonoperating room locations were not validated. Consequently, the comparison of faculty scores from the SICU and pain clinic (de Oliveira Filho et al. supervision question set versus the current University of Iowa teaching question set) is provided (only) as a footnote in the Results section.
Overall, each resident was asked to provide 81 faculty evaluations: 1 evaluation (n = 40 faculty; operating room practice only); 2 evaluations (n = 7 faculty with practice limited to SICU [n = 3] or pain clinic [n = 4]); and 3 evaluations (n = 9 faculty who practice in both the operating room and either SICU [n = 7] and both the operating room and pain clinic [n = 2]). Evaluations were presented to each resident in 3 groups in random order (operating room, SICU, pain clinic), and the order in which residents were asked to evaluate the faculty within each of the 3 groups was also randomized. Residents could decline to provide an evaluation if they indicated that they had not had sufficient contact with the faculty member to provide an evaluation. Otherwise, all questions on each evaluation were to be answered before proceeding to the next evaluation. After an evaluation was submitted, residents could not go back to either view or change their answers.
Residents were required to provide all evaluations within an 84-hour interval between 12:00 PM on Friday, May 11, 2012 and 23:59 PM on Monday, May 14, 2012. The month of May was selected to increase resident overall clinical experience and resident–faculty interactions within each resident class (CA-1, -2, -3) and so that CA-3 residents would be within a few weeks of their graduation (June 30, 2012). The 84-hour Friday–Monday interval was selected to allow most residents to complete their assessments over the weekend with few to no concurrent resident–faculty clinical interactions. The 84-hour reporting interval also facilitated our analysis of billing-based resident–faculty interactions.
Departmental billing data from July 1, 2009 through May 14, 2012 were used to quantitate resident–faculty interactions. For this study, operating room care was defined to include all types of anesthesia care (general anesthesia, regional anesthesia, and monitored anesthesia care), both elective and emergent, provided in all anesthetizing locations within the University of Iowa Hospitals and Clinics for which a bill was prepared. The majority of this care occurred in either the main operating rooms or ambulatory surgery center, but other locations such as labor and delivery, and satellite locations such as radiology and other specialty clinics were included. Billing data were used to calculate 3 variables for each specific resident–faculty pair: (1) patient encounters; (2) days of interaction; (3) the interval between the last day of interaction and the resident’s evaluation. A resident–faculty pair was considered to have worked together when the billing record indicated they had cared for the same patient concurrently for 15 consecutive minutes or longer. A patient encounter was defined as when anesthesia care provided to a patient generated a distinct and separate bill on any single day. Accordingly, although rare, if the resident–faculty pair cared for the same patient: (1) for 2 or more distinct and separately billed anesthetics on the same day, or (2) for 2 or more separately billed anesthetics on 2 or more different days, each bill would be considered to be a unique patient encounter. A resident–faculty pair was considered to have worked together for the day if they had at least 1 patient encounter for at least 15 minutes. However, in the event that patient care started on one day and ended on another, this would count as only 1 day. Finally, billing data were used to calculate the interval between the resident’s assessment of the faculty member and the most recent day that they had worked together before the assessment.
Thirty-nine (n = 39) residents provided 1549 evaluations of faculty (n = 49) operating room supervision. By design, all questions (100%) on each provided evaluation were completed; there were no missing individual question scores because nonresponses were not allowed. Faculty supervision scores were calculated as the mean of all 9 supervision questions provided by all resident raters. For each of the 3 resident classes, Table 2 summarizes the number of operating room evaluations provided and the number of: (1) patient encounters, (2) days of interaction, and (3) days between the last interaction with faculty in the operating room and the resident’s evaluation.
All 49 (100%) operating room faculty received resident evaluations. The median number of resident evaluations per faculty was 34 (25–75; interquartile range = 29–37). Among the 1549 evaluations of faculty operating room supervision, there were 28 evaluations where there was no billing record of the resident–faculty pair having worked together in the operating room. Excluding 3 of 49 faculty (n = 48 total evaluations) for whom most or all of their operating room resident interactions took place outside of our billing system (i.e., at the Veterans Administration Hospital), there were 46 faculty who received (1549-48) 1501 evaluations. From this latter set (46 faculty, 1501 evaluations), there were 16 evaluations in which there was no billing record of an interaction between the resident and faculty. Thus, for the 46 faculty whose care was based primarily or exclusively at the University Hospital, 98.9% (1485/1501) of all resident evaluations had billing-based confirmation that the resident–faculty pair had worked together in the operating room setting on at least 1 occasion. Only evaluations with corresponding billing-based data (n = 1485) were included in analyses regarding associations between resident–faculty interactions and supervision scores. Seven of 49 faculty who supervised residents in the operating room also supervised residents in the SICU. For these 7 faculty, there were 151 paired evaluations of supervision in both the operating room and SICU provided by the same resident.
A generalizability (G) study was performed to estimate variance components, or sources of variance, in the measurement process.4 A G-study estimates the magnitude or influence of various aspects (or facets) of the measurement process on observed scores. These variance components were then used in the calculation of generalizability (G) and dependability (φ) indexes. The G-coefficients estimated the reproducibility (or reliability) of the relative ranking of faculty when rankings were based on mean score averaged over various numbers of independent and unique resident ratings. A dependability (φ) index estimates the dependability of the scores for absolute decisions (e.g., to estimate a mean score with a given level of confidence). G- and φ-coefficients ≥0.80 denote very high levels of reliability and dependability.5–7 For the G-study, individual faculty (person, p) were the objects of measurement. Resident raters (r) were considered to be a random nested facet because: (1) residents were pseudorandomly assigned to evaluate faculty from a pool of residents and (2) not all residents provided evaluations of all faculty. Questions (items, i) were considered to be a random and crossed facet because all resident raters assessed each faculty using the same set of items and these items were considered a sample from a universe of equally acceptable items that could have been developed using a similar item development process. In other words, although the 9 questions comprised the entire supervision question set, it is possible other questions might also be used to evaluate faculty supervision (see Discussion). The model used reflected the fact that different combinations of resident raters (r) provided evaluations of each faculty member (p) on every question (item, i), and residents were nested (:) within faculty (p:r) and crossed (×) with questions (items, i). Accordingly the notation for our G-study design was (r:p) × iRANDOM. Generalizability analyses were performed using GENOVA®, version 3.1 (2001).d The balanced generalizability study analysis was performed using the supervision scores of 42 faculty who had received operating room supervision evaluations from 25 or more resident raters with 25 raters randomly selected for each faculty.
All other statistical analyses were performed using StatXact-9 (Cytel Software Corporation, Cambridge, MA) or SAS version 9.3 (SAS Institute, Inc., Cary, NC). All P values are 2-sided.
Factor analysis based on all operating room evaluations (n = 1549) showed a high internal consistency among the 9 items on the supervision question set; Cronbach α coefficient was 0.885. This is consistent with the single-factor structure reported by de Oliveira Filho et al.3 Accordingly, in all subsequent results we refer to a single factor: supervision.
A 1-way analysis of variance demonstrated that there was a significant difference among the means of resident class ratings of faculty anesthesiologists’ operating room supervision (P = 0.0201; F = 3.92). On post hoc testing, the only statistically significant difference was between the means of the CA-2 and CA-3 resident classes supervision scores. However, the mean difference of 0.07 was not practically meaningful on the 1 to 4 scale.
We calculated the overall Kendall τb between faculty anesthesiologists’ mean operating room supervision scores (n = 1485) and: (1) resident–faculty patient encounters (τb=0.01; 95% confidence interval [CI], −0.02 to +0.04; P = 0.71); (2) resident–faculty days of interaction (τb = −0.01; 95% CI, −0.05 to +0.02; P = 0.46); and (3) days since last resident–faculty interaction (τb = 0.01; 95% CI, −0.02 to 0.05; P = 0.49).e These narrow CIs show that resident assessments of faculty operating room supervision did not vary with either the amount of individual resident–faculty interaction (either numbers of patients days working together) or the interval between the last interaction and the assessment. Thus, mean faculty operating room supervision scores were not affected meaningfully by: (1) the level of resident clinical anesthesia experience (year of residency training); (2) the amount of interaction between resident and faculty (number of patients or days); or (3) how recently the resident and faculty worked together. On this basis, faculty supervision scores provided by all residents were pooled and analyzed collectively.
Using paired resident supervision scores from both the operating room and SICU (n = 151),f Kendall τb correlation coefficient was 0.71 (95% CI, 0.63 to 0.78; P < 0.0001), and Krippendorff α was 0.79 (95% CI, 0.69 to 0.86). Because the operating room and the SICU differ both in the nature of the clinical care and educational content, the strong positive correlation between supervision scores in both environments suggests the de Oliveira Filho et al.3 question set is addressing a factor common to both environments, namely, supervision.
The relationships between individual faculty operating room supervision scores (mean of all evaluations) and: (1) family choice score (mean of all evaluations) and (2) teaching excellence scores (mean of all evaluations) are shown in Figure 1, A and B, respectively. There was a strong positive association between mean operating supervision scores and mean family choice scores (τb = 0.77; 95% CI, 0.70 to 0.84; P < 0.0001). Thus, resident assessments of the quality of faculty anesthesiologists’ supervision and the quality of their clinical care were closely interrelated. There was also a strong positive association between mean operating room supervision scores and mean operating room teaching scores (τb = 0.87; 95% CI, 0.82 to 0.92; P < 0.0001). Thus, it appears residents consider faculty anesthesiologists’ supervision and teaching also to be closely interrelated.g
The results of the generalizability analysis (42 faculty receiving operating room supervision evaluations from 25 or more residents) are summarized in Table 3. The faculty supervision score variance related to the 9 supervision questions (i, pi) was relatively small (5% and 8%, respectively). This indicates both the number of questions used, as well as which particular questions were used, had relatively little effect on the either rank ordering of faculty or on the absolute value of their mean supervision scores. In contrast, most of the error variance in faculty supervision scores was related to resident raters (29%, which includes the confounded/combined effects of r and r:p). Resident raters also contributed to some degree to the 3-way interaction among resident, faculty, and question (43%, which includes both the 3-way interaction and residual error). Hence, the magnitude of the error variance in an individual mean faculty supervision score was primarily determined by the number of residents providing assessments.
The mean score reliability measure, reflected by the G-coefficient, is sensitive to relative error variance and is used as an indicator of the reproducibility of faculty rankings based on mean supervision score. The dependability measure, the φ-coefficient, expresses the absolute reproducibility of the supervision score and reflects the degree to which the mean supervision score would change on replication of the measurement process. Therefore, for determination of whether an individual faculty member has attained a supervision score with a defined level of accuracy, or whether their true supervision score is above a threshold value, a φ-coefficient is more relevant. Consistent with: (1) the observed single-factor structure of the 9-item supervision question set and (2) the observed low level of faculty score variance from terms related to the questions, both modeling studies show that the reliability (G-coefficient) and dependability (φ-coefficient) of individual faculty supervision scores are little affected by the number of questions. As shown in Figure2, there is little meaningful increase in either reliability or dependability by adding additional questions beyond the existing de Oliveira Filho et al. 9-question set. The 9-question set is clearly sufficient for assessing supervision.
Both modeling studies show the reliability (G-coefficient) and dependability (φ-coefficient) of faculty supervision scores is primarily determined by the number of residents per faculty providing responses. When the number of resident raters exceeds 15, both G- and φ-coefficients exceed 0.80. This means that whenever individual faculty anesthesiologists receive assessments from 15 or more different resident raters: (1) faculty can be reliably ranked and (2) the absolute values of faculty supervision scores are highly reliable. Although assessments from more than 15 residents increase reliability and dependability, there are only minimal gains with more than 15 raters.
In our study, anesthesia faculty operating room supervision scores were not meaningfully affected by: (1) the level of resident general clinical experience (year of residency), (2) the amount of interaction between resident and faculty (number of patients or days), (3) how recently the resident and faculty worked together, (4) or the clinical environment in which supervision took place (operating room versus SICU). These observations suggest that, as designed originally by de Oliveira Filho et al. using Delphi methodology,3 operating room supervision scores probably do not solely reflect traditional teaching scores but, rather, include a much broader set of attributes related to the supervisory capacities of faculty anesthesiologists. We conclude that supervision scores provided by each resident can be considered to be equally valid and can be given equal weight when calculating an individual faculty anesthesiologist’s mean supervision score.
Our study suggests potential future studies. There can be evaluation of whether adjusting the relative weight of individual resident supervision scores (based on the total number evaluations made of the individual faculty, n = 1, 2, 3, etc.) meaningfully affects faculty supervision score reliability and dependability. There can be formal evaluation of the psychometric properties of the de Oliveira Filho et al. supervision question set in anesthesia clinical environments other the operating room, such as intensive care units and pain clinics. Our sample size was too small to permit such as analysis. This question set might serve as an appropriate single instrument to assess anesthesia faculty supervision in all clinical environments.
If supervision were exactly the same thing as teaching, one would expect that as residents become more clinically experienced, the need for teaching might decrease. Because faculty anesthesiologists’ operating room supervision scores did not vary as a function of resident general clinical experience (CA-1’s versus -2’s versus -3’s), supervision would seem to differ from teaching. On the other hand, it is possible that, as faculty engage in greater levels of supervision, they may better recognize the educational needs of more senior residents and change the content of their teaching accordingly. If so, greater levels of supervision might result in continued, rather than decreased, teaching effectiveness as residents progress. To be sure, we observed strong associations between supervision scores and teaching scores in all clinical environments (operating room, SICU, pain clinic). It may not be possible (nor necessary) to distinguish between the qualities of resident clinical supervision and clinical teaching. This topic is considered further in our companion paper.8
As discussed previously, the weight of current evidence supports the concept that patient diagnosis and treatment are favorably affected by faculty supervision.2 In surgery, some9 but not all10 studies report lesser perioperative morbidity and mortality when attending surgeons are present. There have been 2 studies that have addressed the clinical effects of anesthesia faculty supervision of residents. In the first, significantly fewer airway management complications occurred when faculty was present during emergent tracheal intubations outside of the operating room.11 Although the outcomes were defined and prospectively recorded, the quality of supervision was not quantified (only the presence or absence of faculty). Recently, De Oliveira et al. surveyed US anesthesia residents regarding their perceived level of supervision.12 Using the de Oliveira Filho et al. supervision question set,3 residents who reported mean (program wide) supervision scores less than 3.0 (“frequent”) reported a significantly more frequent occurrence of mistakes with negative consequences to patients, as well as medication errors. Although the quality of supervision was quantified using a validated instrument,3 the outcome measures were subjective and assessed retrospectively. Nevertheless, these 2 studies suggest that adequate levels of anesthesiologist supervision may be essential for resident physicians’ patient safety and quality clinical care. In our study, faculty operating room supervision scores were highly associated with residents’ perceptions of faculty overall clinical ability as reflected in residents’ choices for the faculty member to care for their family. We suggest future studies should prospectively assess the association between individual anesthesia faculty operating room supervision scores and anesthesia-related adverse events.
Because each resident possesses their own individual set of strengths, needs, and communication styles, a range of resident perceptions regarding individual faculty performance in supervision is to be expected. For this reason, an adequate sample (i.e., an adequate number) of residents is needed to obtain a reliable general measure of individual faculty supervisory performance. Because resident perceptions of faculty supervision do not appear to greatly depend on the absolute number of interactions, or to change with resident general clinical experience, this suggests resident perceptions are largely determined by the: (1) characteristics of the individual resident irrespective of the faculty with whom they are working (individual resident leniency or stringency) and (2) the interpersonal interaction of the specific resident–faculty pair (dyad).13,14 Indeed, our generalizability study indicates that approximately 29% of supervision score variance is derived from these 2 factors. A resident’s overall impression of the faculty anesthesiologist, the net gestalt (often referred to as the “halo effect”), tends to bias resident assessment scores on all questions, with residents tending to provide either generally favorable (or unfavorable) scores on all questions of the assessment scale for a specific faculty, irrespective of the specific content of the question. Our observation of strong correlations between measures of supervision and measures of teaching in the SICU and pain clinic, each based on entirely different question sets, is consistent with this effect.
Because of differences among residents in their individual leniency, and their unique interaction characteristics with faculty, supervision scores are not reliable when they are based on assessments provided by only a few residents. This is so because, as the number of evaluating residents decreases, there is an increasing likelihood that the scores are derived from a nonrepresentative (biased) sample of residents. Accordingly, a key question is how many individual residents must provide evaluations for an individual faculty member so that the individual faculty’s mean score is reasonably representative of the score that would be obtained from a much larger (near infinite) sample of residents? Our G-study modeling indicates that mean faculty supervision scores will be both highly reliable and dependable (both G and φ ≥ 0.80) when an individual faculty member is evaluated by 15 or more different residents. Our estimated minimums for the sufficient number of resident evaluations differ from those reported by de Oliveira Filho et al.,3 who reported that evaluations from approximately 4 residents were sufficient to obtain a φ ≥0.80.3 Several factors likely contribute to these differences.
In our analysis, we considered the 9-item question set to be random facet whereas de Oliveira Filho et al.3 considered the question set to be a fixed facet. Their rationale was that the question set encompassed the entire universe of questions that could be asked to address supervision, whereas we considered the question set to be a subset of all possible questions to address supervision. If the question set is considered be a fixed facet, the variance components related to systematic and interaction effects involving questions are no longer considered error variance. With fewer sources of error variance, the number of resident raters required to obtain reliable and dependable scores decreases. When we repeated our analysis, treating the question set as a fixed facet, we observed that G- and φ-coefficients exceeded 0.80 when evaluations were provided by approximately 9 residents, a value closer to that reported by de Oliveira Filho et al.3 (Appendix). Whether the question set should be considered to be a fixed or random facet is debatable and, the correct answer (the minimum necessary number of resident rater for high reliability and dependability) likely exists somewhere between the 2 extremes. Although the random effects approach is the most conservative and suggests a greater minimum number of resident evaluations are required, it also allows one to generalize more broadly and more confidently. We suggest the random effects analytic approach is most appropriate if/when high stakes decisions are made based on evaluations of the quality of faculty anesthesiologists’ supervision.
Another factor that may contribute to differences between our findings and those of de Oliveira Filho et al.3 is that the question set was originally developed and tested in a uniform environment. Fourteen of 19 (74%) residents and 12 of 17 (71%) faculty who participated in the development of the question set were members of the Brazilian department in which the question set was subsequently tested.3 Some of the same residents and faculty who developed the question set may have participated in subsequent testing. In our study, the question set was used by an entirely different group of residents evaluating an entirely different group of faculty, with different cultural norms and health care delivery systems. It would be expected that our department’s “fit” to the question set would not be nearly as good as for the residents and faculty (at the Brazilian department) who originally developed and tested it. The consequence of this poorer fit would be increased score variance and, consequently, a greater minimum number of resident raters needed to obtain high reliability and dependability, as observed. Although, our analysis indicates that a larger number of resident raters are needed to obtain high reliability (9–15 vs 4), this is still a fairly small number of residents, and should be readily achievable by most US anesthesia residency programs.
A third factor that may have contributed to differences between our findings and those of de Oliveira Filho et al.3 was a methodological error made by the investigators of this study. The error is important and instructive because, if other centers adopt the use of the de Oliveira Filho et al.3 question set to assess faculty anesthesiologist supervision, they must avoid the error we made, specifically changing the qualitative descriptors of the numeric rating scale. The qualitative descriptors used by de Oliveira Filho et al.3 were as follows: 1 = never; 2 = rarely; 3 = frequently; 4 = always.3 Because “never” corresponds to an absolute value of 0%, and “always” corresponds to an absolute value of 100%, these are highly restrictive scores they are difficult scores for most faculty to attain. Effectively, this forces most raters to most often choose between 2 = rarely and 3 = frequently, functionally decreasing the available scoring options. The qualitative descriptors that we used in the current study differed from those of de Oliveira Filho et al.3 and were as follows: 1 = consistently no; 2 = usually no; 3 = usually yes; 4 = consistently yes. “Consistently yes” is not as restrictive “always,” and “consistently no” is not as restrictive as “never.” Consequently, in our study, resident raters had a greater range of applicable responses; scores of 1 or 4 would be expected to be assigned more often in our scoring system (see footnote e in the Results). With a greater range of available scores, it would be expected that individual faculty anesthesiologists’ scores would have greater variance, thereby increasing the total number of resident respondents needed to obtain threshold values for reliability and dependability. This was precisely as observed. Accordingly, if we had used the original de Oliveira Filho et al.3 qualitative descriptor scale for the 4 numeric values, it is likely that the minimum number of resident evaluations required for high reliability and dependability would have been less than we observed.
Because we used different qualitative descriptors for the supervision scores, the absolute values we obtained for faculty supervision scores cannot be compared with scores obtained from other centers using the original numeric descriptors. It is likely our mean scores are greater than reported in other centers because the maximum score of 4 was less restrictive. Because there is now evidence that supervision scores less than the absolute value of 3.0 (“frequent”) are associated with increased medical errors,12 it is essential that centers use both: (1) the original de Oliveira Filho et al.3 9-question supervision set and (2) the original qualitative descriptors for the numeric answers. The scoring system in our companion study8 follows those of the original de Oliveira Filho et al. report.3
Our results indicate that resident assessments of faculty supervision are not affected by the interval between the last interaction and the provision of the assessment. Therefore, it is probably not necessary that residents provide assessments immediately after each faculty interaction (i.e., the next day). On the other hand, because of the importance of adequate supervision in the quality of patient care, supervision scores should be nearly continuously calculated and reported. For this reason, frequent sampling of residents should be used. Nevertheless, the most important factor in obtaining high dependability supervision scores is obtaining evaluations from a sufficiently large number of individual resident raters. Because resident assessments of faculty anesthesiologists’ supervision does not appear to be affected by the total amount of interaction between the resident and faculty (1 day versus many days; 1 patient versus many patients), when a resident provides multiple evaluations of a faculty over several occasions, for the present, we suggest that the mean of each resident’s evaluation scores should be used, and the scores from each individual resident are equally weighted. The faculty anesthesiologist logically knows the residents with whom s/he has been working and the number of different residents. Therefore, a daily updated moving average would disclose individual resident’s supervision scores. To preserve resident confidentiality, we suggest that faculty should be informed of their supervision scores only periodically (e.g., every few weeks), receiving a moving average of mean scores provided by all evaluating residents when the number of evaluating residents is sufficient to have high dependability (φ > 0.80). On the basis of our findings, that minimum number of individual evaluating residents per faculty anesthesiologist appears to be between 9 and 15.
Dr. Franklin Dexter is the Statistical Editor and Section Editor for Economics, Education, and Policy for the Journal. This manuscript was handled by Dr. Steven L. Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.
Name: Bradley J. Hindman, MD.
Contribution: This author helped design and conduct the study, analyze the data, and write the manuscript.
Attestation: Bradley J. Hindman has seen the original study data, reviewed the analysis of the data, approved the final manuscript, and is the author responsible for archiving the study files.
Name: Franklin Dexter, MD, PhD.
Contribution: This author helped design and conduct the study, analyze the data, and write the manuscript.
Attestation: Franklin Dexter has seen the original study data, reviewed the analysis of the data, and approved the final manuscript.
Name: Clarence D. Kreiter, PhD.
Contribution: This author helped analyze the data and write the manuscript.
Attestation: Clarence D. Kreiter has seen the original study data, reviewed the analysis of the data, and approved the final manuscript.
Name: Ruth E. Wachtel, PhD, MBA.
Contribution: This author helped design and conduct the study.
Attestation: Ruth E. Wachtel has seen the original study data and approved the final manuscript.
a Department of Veterans Affairs. Veterans Health Administration. Resident Supervision. VHA Handbook 1400.1, July 27, 2005. Available at: http://www.va.gov/oaa/resources_resident_supervision.asp. Accessed November 8, 2012.
b Common Program Requirements, Effective July 1, 2011. Accreditation Council for Graduate Medical Education (ACGME) Approved Standards. Available at: http://acgme-2010standards.org/pdf/Common_Program_Requirements_07012011.pdf. Accessed November 8, 2012.
c Department of Health and Human Services, Centers for Medicare and Medicaid Services. CMS Manual System. Pub 100–04 Medicare Claims Processing, Transmittal 1859, November 20,2009. Subject: MIPPA Section 139 Teaching Anesthesiologists. Available at: http://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/downloads/R1859CP.pdf. Accessed November 8, 2012.
d University of Iowa College of Education Center for Advanced Studies in Measurement and Assessment. Computer programs. Available at: http://www.education.uiowa.edu/centers/casma/computer-programs.aspx. Accessed November 8, 2012.
e We did not perform a linear mixed effects model because the residuals were asymmetric (e.g., Lilliefors P < 0.00001) because multiple faculty had at least several scores equal to 4.0, the maximum (see Discussion). Instead, for each of the 46 faculty with complete billing data (n = 1485 evaluations) and the 3 independent variables (resident–faculty patient encounters, interaction days, and interval since last interaction), we calculated the Kendall τb. This resulted in (46 × 3) 138 faculty-specific comparisons derived from a median of 34 resident evaluations per faculty (range 9–39 evaluations). Two-sided P values were calculated using Monte-Carlo simulation to accuracy within 0.0001. For supervision scores and patient encounters, 3 of 46 faculty had significant associations (P < 0.05); 2 positive (maximum τb = 0.32) and 1 negative (τb = −0.38). For supervision scores and interaction days, 2 of 46 faculty had significant associations (P < 0.05); 1 positive (τb = 0.33) and 1 negative (τb = −0.38). Therefore, for these 2 variables, individual faculty associations between supervision scores and interactions were few and balanced in direction. For supervision scores and interaction interval, 3 of 46 faculty had significant associations (P < 0.05); all 3 positive (maximum τb = 0.35). Therefore, a greater interval between interaction and evaluation appeared to introduce a mild positive bias for a very few faculty.
f The group mean (±SD) supervision scores for the operating room and SICU were 3.77 ± 0.31 and 3.79 ± 0.31, respectively; mean difference = 0.02 ± 0.20, (P = 0.3326; Student t = 0.972).
g For the 10 SICU faculty, there were 280 paired evaluations of SICU supervision using the 9-item de Oliveira Filho et al. supervision question set and SICU clinical teaching using the current 13-item University of Iowa SICU teaching question set (available on request). There was a strong association between instruments in mean scores(τb = 0.73; 95% CI, 0.67 to 0.78; P < 0.0001). For the 6 pain clinic faculty, there were 112 paired evaluations of pain clinic supervision using the 9-item de Oliveira Filho et al. supervision question set and pain clinic clinical teaching using the current 21-item University of Iowa Pain Clinic teaching question set (available on request). There was a strong association between instruments in mean scores (τb = 0.73; 95% CI, 0.67 to 0.80; P <0.0001)
1. Kennedy TJ, Lingard L, Baker GR, Kitchen L, Regehr G. Clinical oversight: conceptualizing the relationship between supervision and safety. J Gen Intern Med. 2007;22:1080–5
2. Farnan JM, Petty LA, Georgitis E, Martin S, Chiu E, Prochaska M, Arora VM. A systematic review: the effect of clinical supervision on patient and residency education outcomes. Acad Med. 2012;87:428–42
3. de Oliveira Filho GR, Dal Mago AJ, Garcia JH, Goldschmidt R. An instrument designed for faculty supervision evaluation by anesthesia residents and its psychometric properties. Anesth Analg. 2008;107:1316–22
4. Brennan RL Generalizability Theory. 2001 New York Springer-Verlag, Inc.
5. Cohen J Statistical Power Analysis for the Behavioral Sciences. 19882nd ed Hillsdale, NJ Lawrence Erlbaum Associates, Inc.
6. Nunnally JC, Bernstein IHNunnally JC, Bernstein IH. Chapter 7: Making measures reliable. In: Psychometric Theory. 19943rd ed New York, NY McGraw-Hill
7. Webb NM, Shavelson RJ, Haertel EHRao CR, Sinharay S. Chapter 4: Reliability coefficients and generalizability theory. In: Handbook of Statistics. 2007;Vol 26 (Psychometrics) Amsterdam, The Netherlands Elsevier B.V.
8. Dexter F, Logvinov II, Brull SJ. Anesthesiology residents’ and nurse anesthetists’ perceptions of effective clinical faculty supervision by anesthesiologists. Anesth Analg. 2013;116:1352–5
9. Fallon WF Jr, Wears RL, Tepas JJ 3rd.. Resident supervision in the operating room: does this impact on outcome? J Trauma. 1993;35:556–61
10. Itani KM, DePalma RG, Schifftner T, Sanders KM, Chang BK, Henderson WG, Khuri SF. Surgical resident supervision in the operating room and outcomes of care in Veterans Affairs hospitals. Am J Surg. 2005;190:725–31
11. Schmidt UH, Kumwilaisak K, Bittner E, George E, Hess D. Effects of supervision by attending anesthesiologists on complications of emergency tracheal intubation. Anesthesiology. 2008;109:973–7
12. De Oliveira GS Jr, Rahmani R, Fitzgerald PC, Chang R, MacCarthy RJ. The association between frequency of self-reported medical errors and anesthesia trainee supervision: a survey of United States anesthesiology residents-in-training. Anesth Analg. 2013;116:892–7
13. Kreiter CD, Ferguson K, Lee WC, Brennan RL, Densen P. A generalizability study of a new standardized rating form used to evaluate students’ clinical clerkship performances. Acad Med. 1998;73:1294–8
14. Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15:270–92