1. INTRODUCTION
Kestin^{1} and de Oliveira Filho^{2} suggested using Bernoulli cumulative sum (CUSUM) control charts to assess resident procedural performance. With this method, the CUSUM is increased with each failure and decreased when a procedure is performed successfully. When the CUSUM reaches the selected lower control limit indicative of success, then detection of expected performance (e.g., the “milestone”^{a} ) is said to have occurred. The CUSUM is “Bernoulli” because it sums successes and failures as binary events. The CUSUM is a type of “control chart” because it produces a graph showing how a process (e.g., change in procedural performance) changes over time.

Starkie and Drake^{3} recently reviewed the many subsequent anesthesia studies that have applied CUSUM. They identified substantial challenges in routine use of CUSUM for educational applications, because thresholds that define success (i.e., “conformance”) and failure (i.e., “non-conformance”) need to be chosen.^{3} For example, 7 of 11 first-year anesthesiology residents achieved the milestone of an epidural anesthesia success rate ≥80%.^{2} The finding would be different if the choice of the threshold had been 90%. Norris and McCahon^{4} raise similar concerns. Starkie and Drake concluded that CUSUM is “ideally … used” for a different application, specifically, “a quality control tool for a trained individual.” In this Statistical Grand Rounds, we take that approach. Our specific application is for monitoring low scores of anesthesiologists’ clinical supervision of anesthesia residents and/or Certified Registered Nurse Anesthetists (CRNAs).

We will show that there are limitations with using Bernoulli CUSUM that were raised in neither Starkie and Drake’s or Norris and McCahon’s reviews^{3} ^{,} ^{4} nor in the earlier articles.^{1} ^{,} ^{2} Specifically, the Bernoulli CUSUM assumptions often are not satisfied. Although these limitations do not influence the usefulness of the CUSUM findings for our use as “a quality control tool for a trained individual,” they may for other applications (e.g., evaluating when a trainee is considered to satisfy required milestones).

2. BACKGROUND OF ASSESSING CLINICAL SUPERVISION BY ANESTHESIOLOGISTS
At many U.S. health care facilities, supervision of anesthesiology residents and/or CRNAs is a daily responsibility of anesthesiologists.^{5} ^{,} ^{6} The word “supervision” is used here not as a U.S. billing term. Rather, supervision is used as a single-dimensional construct incorporating attributes including the anesthesiologist’s participation in perianesthesia planning, availability for help/consultation, presence during critical phases of the anesthetic, and fostering safety measures.^{7} ^{,} ^{8}

Anesthesiologists’ supervision can be quantified with high reliability and dependability using an instrument with 9 questions that takes <90 seconds to complete.^{7} Raters evaluate how often each of the 9 attributes is demonstrated by the anesthesiologist (never 1, rarely 2, frequently 3, or always 4).^{7} The supervision score is the mean of the 9 responses.^{7} Residents who report mean supervision scores for their entire department (i.e., the mean of all anesthesiologists) that are less than frequent (<3.0) report making more “mistakes that have negative consequences for the patient” (P < 0.0001) and having made more “medication errors (dose or incorrect drug) in the last year” (P < 0.0001).^{9} ^{,} ^{10} ^{,} ^{b} Although residents’ and CRNAs’ mean supervision scores are correlated pairwise by anesthesiologist (P < 0.0001), residents assign greater supervision scores than do CRNAs (P < 0.0001).^{11} Most CRNAs (67%) perceive that supervision that meets their expectations is at least frequent (≥3.0) (P < 0.0001).^{12} Most residents (94%) perceive that supervision that meets expectations is at least frequent (i.e., a score ≥3.0, P < 0.0001).^{12} Residents’ evaluations of supervision correlate with residents’ evaluations using other data: supervision by the anesthesiologist when working in the surgical intensive care unit (Kendall τ_{b} = 0.71, P < 0.0001), choice of the anesthesiologist to care for their families (τ_{b} = 0.77, P < 0.0001), and anesthesiologists’ clinical teaching (τ_{b} = 0.87, P < 0.0001).^{8}

In our department, the supervision provided by each anesthesiologist working in operating rooms (ORs) is evaluated almost every day by the resident(s) and CRNA(s) with whom they worked.^{8,11} The behaviors and attributes that are used to assess the quality of an anesthesiologist’s supervision have significant commonality between residents and CRNAs (i.e., concurrent validity, P < 0.0001). However, the supervision scores are analyzed separately for residents and CRNAs,^{11} because the pairwise differences in supervision scores provided by residents and CRNAs are heterogeneous among anesthesiologists (P < 0.0001).^{11}

Supervision scores were negligibly influenced by covariates related to staff assignments.^{11} There was no association between each resident’s perception of supervision that meets expectations and years since the resident’s start of training (Kendall τ_{b} = 0.03, P = 0.77).^{12} Similarly, there was no association between each CRNA’s perceptions of supervision that meets expectations and years since the CRNA’s start of training (τ_{b} = 0.01, P = 0.90).^{12} There was no association between each resident’s single evaluations of anesthesiologists and the resident’s year of training (mean differences ≤0.07 units).^{8} There was no association between each resident’s single evaluations of an anesthesiologist and the number of occasions that the resident rater had worked with the anesthesiologist (by patients τ_{b} = 0.01, P = 0.71; by days τ_{b} = −0.01, P = 0.46).^{8} There was no association between a resident’s or CRNA’s mean score for an anesthesiologist and the number of days worked together (τ_{b} = −0.07 ± 0.02 and τ_{b} = 0.04 ± 0.02, respectively).^{11} There were negligible associations between supervision scores provided by residents and whether a resident had more units of work that day with the rated anesthesiologist (τ_{b} = 0.08 ± 0.01) and whether the anesthesiologist had more units of work on that same day with other providers (τ_{b} = −0.06 ± 0.01).^{11} There were no associations between supervision scores provided by CRNAs and whether the CRNA had more units of work that day with the anesthesiologist (τ_{b} = −0.03 ± 0.01) and whether the anesthesiologist had more units of work that same day with other providers (τ_{b} = −0.01 ± 0.01).^{11}

For the current article, we used 6 months of data from our prior study of covariates.^{11} The University of Iowa IRB determined that the work was not human subjects research.^{11} We limited consideration to the 55 anesthesiologists with more than 9 evaluations from residents and 58 anesthesiologists with more than 9 evaluations from CRNAs.^{8} ^{,} ^{11}

3. CHOICE OF THRESHOLD FOR LOW SCORES
When an anesthesiologist has multiple evaluations from the same rater, the scores are correlated. Therefore, the psychometrically validated approach^{8} is to take the mean of each rater’s scores. Then the mean score for the anesthesiologist is the mean of each rater’s means. We refer to this as the mean_{equal} because each rater is weighted equally.^{11} There is a monotonic relationship among anesthesiologists between the anesthesiologist’s mean_{equal} and the anesthesiologist’s percentage of scores less than frequent (i.e., <3.0).^{11}

Although monitoring the incidence of low scores (e.g., <3.0) is practical, we do not know a practical way to monitor the mean_{equal} . Specifically, supervision scores do not follow normal distributions. For example, the Lilliefors test P ≤ 0.003 for 54 of the 55 anesthesiologists with residents’ evaluations. In addition, transformation to a normal distribution seems impractical. Figure 1 shows dot plots for the 3 anesthesiologists with the smallest values of mean_{equal} based on residents’ evaluations. Even though these anesthesiologists had the smallest scores, the upper boundary of 4.0 influences scores.

Figure 1: Supervision evaluation scores by resident physicians of the 3 anesthesiologists with the smallest mean estimate. Because there were so many values at 4.00, values presented at 3.90 include observed values at 4.00. The dot plots show a clear influence of the upper boundary of 4.0, even though these anesthesiologists have the smallest means among the 55 anesthesiologists. As described in Section 3, for 54 of the 55 anesthesiologists, the distributions were not normally distributed.

Among the 2788 resident daily evaluations of anesthesiologists, the overall incidence of scores <3.0 was 4.12%, which seemed conveniently close to 5%. Thus, we set the desired control chart “in-control” incidence of low scores p _{0} = 0.05. The 14 anesthesiologists who had the least values of mean_{equal} (i.e., the lower quartile) had overall 10.3% of scores <3.0. Therefore, we set the greater “out-of-control” incidence to be detected as promptly as feasible at p _{1} = 0.10. We consider sensitivity to these choices below, in Section 5.

For the 3139 CRNA daily evaluations of anesthesiologists, the overall incidence of scores <2.0 was 7.04%. That incidence was reasonable, being close to that of resident scores <3.0. In addition, the threshold of <2.0 was easy to interpret. Since the supervision score is the mean of 9 responses and each response is {1, 2, 3, or 4}, a score <2.0 implies that, for at least 5 of the 9 attributes, the rater’s evaluation was “never” (i.e., 1). A rating of “never” (1) is not consistent with supervision being provided. Thus, the choice of using a CRNA evaluation score (i.e., mean of 9 responses) <2.0 was considered to be exceedingly conservative. None of 153 CRNAs surveyed considered a score <2.0 to meet expectations for supervision (95% upper confidence limit 1.7% of CRNAs).^{12} Figure 2 shows a plot of 1 anesthesiologist’s CRNA evaluation scores, with the limit line of 2.0 in red.

Figure 2: Supervision evaluation scores and Bernoulli cumulative sum (CUSUM) of anesthesiologist C described in Section 8. By runs test, this anesthesiologist had significant clustering of low evaluation scores among days (

P = 0.013). Evaluations shown are those done by Certified Registered Nurse Anesthetists (CRNAs). The limit line of 2.0 is shown in red. The vertical axis on the right is shown in reverse sequence because the CUSUM is increased when there are more scores less than 2.0. The 3 occurrences of the Bernoulli CUSUM exceeding its threshold of 2.32 are shown using the blue squares (see Section 4). The supplemental digital content includes the Excel file (Supplemental Digital Content,

http://links.lww.com/AA/A922 ) used to create this figure, including the formulas in footnotes

d and

e .

4. BERNOULLI CUSUM MODEL
The Bernoulli CUSUM uses a series of 0’s and 1’s. There is a separate series for each anesthesiologist. The data in date sequence are represented by X _{1} , X _{2} , …, where the k th score X _{k} = 0 if the evaluation score is greater than or equal to the threshold score or X _{k} = 1 if the evaluation score is less than the threshold (e.g., <2.0 for CRNA evaluation or <3.0 for resident evaluation). Because on any one day an anesthesiologist may supervise several CRNAs or residents, the concept of a date sequence is not fully satisfied. We address this below in Section 8, Assumption of Statistical Independence.

The geometric CUSUM is an alternative surveillance scheme to detect increases in the proportion of low (nonconforming) evaluation scores. The geometric CUSUM uses the number of evaluations between successive low (nonconforming) scores. For example, an anesthesiologist with multiple periods with <10 evaluations between low scores has a substantial incidence of low scores.

The formulation of the Bernoulli CUSUM for monitoring the performance of an individual is different from the formulation for its use as typically described in anesthesia studies.^{1–3} Typically, we see the CUSUM control chart formulated to be used once. This is analogous to the Wald sequential probability ratio test.^{13} ^{,} ^{14} ,^{c} There are type I and type II error rates (e.g., α = 0.05 and β = 0.10) for falsely detecting or not detecting, respectively, that an individual has a lower or greater than threshold incidence of failing at the procedure (or having low scores). Yet, to apply Starkie and Drake’s recommendation (see Introduction), there should not be consideration for failing to reject the null hypothesis that the true proportion of low scores for an anesthesiologist equals p _{0} . There also should not be interest in evaluating whether the true proportion <p _{0} . Rather, there are 2 objectives. One objective is to detect promptly the greater incidence p _{1} when relevant (i.e., low type II error rate). The other objective is to have, on average, many observations until falsely detecting an anesthesiologist as having multiple low scores if the anesthesiologist has a true incidence p =~ p _{0} (i.e., low type I error rate). The geometric CUSUM is sometimes considered because it can have a lower type II error rate for a selected type I error rate versus the Bernoulli CUSUM.^{15} ^{,} ^{16}

The Bernoulli CUSUM B _{k} = max(0, B _{k − 1} + X _{k} − 1/m ), where the max(0,) obtains 0 when a negative value is calculated and the 1/m term is the control chart “reference value.”^{13} If the k th evaluation score is not below the threshold value (i.e., X _{k} = 0), the CUSUM B _{k} is decreased by 1/m . If the k th score is below the threshold, B _{k} is decreased by 1 − 1/m . An anesthesiologist is considered to have an incidence of low scores greater than p _{0} if B _{k} ≥ h _{B} , the Bernoulli CUSUM “control limit.” The Bernoulli CUSUM becomes^{15} ^{,} ^{16} effectively equivalent to the geometric CUSUM by setting the starting value B _{0} = 1 − 1/m . The series is restarted to that value if B _{k} ≥ h _{B} .

The value of m is calculated from the in-control and out-of-control proportions of low evaluation scores, p _{0} and p _{1} .^{13} ^{,} ^{d} For example, when p _{0} = 5% and p _{1} = 10%, the calculated^{d} m = 13. Thus, when there are m − 1 = 12 successive scores above the threshold, B _{k} = 0.

The value of h _{B} is calculated from the same 2 parameters, p _{0} and p _{1} , plus a third, the desired average number of observations to signal (ANOS) when in-control.^{13} ^{,} ^{e} For example, with ANOS = 250 observations, the calculated^{e} h _{B} = 2.32. We chose 250 observations based on there being approximately 250 regular workdays in a year. Individual anesthesiologists take vacations, attend meetings, and have non-OR responsibilities, which would reduce this number of days. However, they also work on some weekends and holidays, which would increase this number of days. Consequently, we expect a maximum of 250 days per anesthesiologist. Some days, there will be supervision of 1 provider (e.g., junior resident) and, other days, 3 providers (e.g., CRNAs), with the median being 2 providers daily. Half the departmental work is by anesthesia residents and half by CRNAs,^{11} resulting in a maximum expected 250 evaluations annually from each type of rater (resident or CRNA). In fact, over the 6 months, our maximum number of evaluations of an anesthesiologist was 121, just slightly less than 250 ÷ 2.

Figure 2 shows the Bernoulli CUSUM with the vertical axis on the right in reverse sequence since low scores are detected. The 3 occurrences of the CUSUM exceeding 2.32 are shown using the blue squares. The supplemental digital content includes the Excel file (Supplemental Digital Content, http://links.lww.com/AA/A922 ) used to create that figure including the formulas in footnotes d and e .

5. PERFORMANCE OF THE BERNOULLI CUSUM CONTROL CHART
Ideally, to assess empirically the performance of the control charts, we would have some anesthesiologists with “correct” classification. When CRNAs did the evaluations, half of the anesthesiologists (29 of 58) with more than 9 evaluations had ≤5% of their scores <2.0. In addition, the lower quartile (14 of 58) based on percentage of scores <2.0 all had ≥10.0% of scores <2.0. These observed percentages corresponded precisely to the above selections of p _{0} and p _{1} . Among the half with all scores ≥ p _{0} , none was detected over the 6 months by the Bernoulli CUSUM. This was not by definition, because none of the data was used to choose parameters. Among the 14 anesthesiologists in the lower quartile based on the threshold, all were detected by CUSUM.

We are interested fundamentally in the mean_{equal} . Among the half of anesthesiologists (29/58) with CRNAs’ mean_{equal} >3.13, only 1 of 29 was detected (flagged) during the 6 months by the Bernoulli CUSUM. Among the lower quartile of anesthesiologists (14/58) based on CRNAs’ mean_{equal} (<2.94), 13 of 14 were detected by the Bernoulli CUSUM. Among the half of anesthesiologists (27/55) with residents’ mean_{equal} ≥3.80, 0 of 27 was detected during the 6 months by Bernoulli CUSUM. Among the lower quartile of anesthesiologists (13/55) based on residents’ mean_{equal} (≤3.61), 12 of 13 were detected by Bernoulli CUSUM. Although these findings related to mean_{equal} are important to our specific application, because they do not assess the performance of the Bernoulli CUSUM per se, in contrast to the preceding paragraph, they are not used subsequently.

Although the observed detection incidences precisely matched p _{0} and p _{1} , because it was known after 6 months which anesthesiologists were in which quartile, one may ask why there is a reason to use the Bernoulli CUSUM at all. The (potential) advantage of using CUSUM is in its ability to provide early detection of low performance. Rather than waiting 182 days for detection of the 14 anesthesiologists in the lower quartile of scores, the detection by the Bernoulli CUSUM occurred much earlier at 50 (median) ± 14 (quartile deviation) days; range 2 to 99 days. This may be of utility when a new anesthesiologist joins the department and/or after expectations have been communicated. We expand on this topic below in Section 6.

Regarding our specific selection of parameter values, from Section 4 above, the choice of p _{0} = 0.05 was straightforward, and the 0.05 is the typical value chosen for type I and type II error rates. However, we surely could have chosen to use a different out-of-control p _{1} = 0.10. Although, for the remainder of our article, we continue to use this 2.0 relative risk of low scores (i.e., p _{1} /p _{0} = 2.0), the choice had little effect on the Bernoulli CUSUM performance. We lowered p _{1} progressively in increments of 0.1%, down to 5.1% and then progressively increased p _{1} . With each change, we recalculated h _{B} . For 5.1% ≤ p _{1} ≤ 17.9%, Bernoulli CUSUM performance remained the same, detecting 14 of 14 anesthesiologists in the lower quartile and 0 of 29 anesthesiologists in the upper half. Only 1 of the 29 was detected falsely for p _{1} ≤ 35.9%. Consequently, for our application, only 2 of the 3 parameters mattered substantively: p _{0} (i.e., the “in-control,” “acceptable” incidence of low scores) and the desired ANOS. We consider the ANOS below in the last paragraph of Section 8.

6. ASSUMPTION OF BINARY OUTCOME AND COMPARISON WITH SHEWHART INDIVIDUAL CONTROL CHARTS
The next 4 sections each address 1 of the 4 fundamental assumptions of the Bernoulli CUSUM. In this section, we consider that the Bernoulli CUSUM inherently assumes that the outcome is binary (e.g., residents’ scores <3.0 are low). However, the outcome is in fact a continuous variable and we must apply a threshold to obtain binary outcomes.

The Shewhart individual control chart (“Shewhart chart”) uses the actual value of each evaluation score.^{f} The principle is that when a process becomes out of control, there will be some very low individual evaluation scores. For example, from Section 3, there was an anesthesiologist with normally distributed scores (Lilliefors P = 0.17, N = 23 evaluations). The anesthesiologist resident’s scores had pooled mean 3.464 and sample standard deviation 0.446. The typical limit in a Shewhart chart is “3 sigma” below the mean, which is 2.13 for this example.^{f} On the 31st of the 182 consecutive (calendar) days, the anesthesiologist received a resident score of 2.11. Thus, if the mean and standard deviation had been known when the July 31 evaluation was submitted, the anesthesiologist’s supervision would have been detected for review by the Shewhart individual chart.

However, there is, in fact, no need to assume a normal distribution, and the mean and standard deviation are not known a priori. Rather, the single best threshold for the Shewhart chart can be chosen empirically. Doing so, and evaluating performance, we again took advantage of the fortuitous coincidence that, for the CRNA evaluations, the upper half of anesthesiologists all had <5% (i.e., <p _{0} ) of scores <2.0 and the lower quartile all had ≥10.0% (i.e., ≥p _{1} ) of scores. We tried a threshold score of <1.1 and observed which anesthesiologists were detected at least once. We tried next a threshold of 1.2 and so forth in increments of 0.1 on the 1 to 4 scale. From the preceding Section 5, the Bernoulli CUSUM correctly detected 14 of 14 anesthesiologists in the lower quartile and 0 of 29 in the upper half. In contrast, the Shewhart chart performed poorly. Using a threshold of 1.1 (i.e., the anesthesiologist would need to have received at least one score of 1.0), detection was 3 of 14 and 2 of 29 anesthesiologists, respectively. The smallest threshold (1.6) that was sufficiently large to detect 14 of 14 in the lower quartile falsely detected 16 of 29 in the upper half.

A logical modification of the Shewhart chart would be not to make detection when there is just 1 score <1.1 or <1.6 (etc.) but rather 2 scores less than a threshold in relatively quick succession. In fact, such a modification is precisely the Bernoulli CUSUM, as shown by Figure 2 . From Section 4, the CUSUM B _{k} starts at 1 − 1/m = 1 − 1/13. For CRNAs, at each score <2.0, the CUSUM is increased by 1.0, and for each score ≥2.0 it is decreased by 1/13. Because the h _{B} = 2.32 and 2 + (1 − 1/13) − (8/13) > 2.32, when there are 2 scores below threshold within 8 evaluations, detection occurs. For residents, it is identical, except that the threshold score <3.0.

7. ASSUMPTION OF IMMEDIATE KNOWLEDGE OF AN INDIVIDUAL SCORE
The second of the 4 assumptions of the Bernoulli CUSUM that we consider is that the outcome from each evaluation is known before the next evaluation. This assumption routinely will not apply for anesthesia applications, such as ours. An anesthesiologist may work with 2 CRNAs on the same day (Friday). The sequencing of those 2 evaluations is arbitrary because the supervision provided generally will overlap in time. Furthermore, whereas one of the CRNAs may do her evaluation that same evening, the other may do his evaluation the following Monday afternoon. Whereas such a delay has no managerial consequence (Section 9), it could affect the Bernoulli CUSUM calculations performed daily (e.g., on our department’s server).

The way the data are sequenced within 1 day has no practical effect on results. Consider that h _{B} = 2.32. Low scores cause B _{k} to increase by 1. Even 4 non-low scores (i.e., 4/13 = 0.31) are less than 0.32. Thus, provided a day has 1 or 2 and occasionally 3 evaluations, how those scores are sequenced in the CUSUM calculations would be expected to have very little influence on detection.

Above, and in the next section, the sorting we followed was first by anesthesiologist in ascending order, second by date in ascending order, and third by supervision score in ascending order. For both residents and CRNAs, we repeated analysis, but sorting third by supervision score in descending order. We repeated again sorting third by the blinded identifier of the rater in ascending order. Finally, we sorted third by blinded identifier in descending order. The anesthesiologists detected were identical for all 8 combinations. Furthermore, the specific numbers of detections over the 182 days changed only for 0 to 2 of the 58 anesthesiologists among the 6 different combinations.

8. ASSUMPTION OF STATISTICAL INDEPENDENCE
The third of the 4 assumptions of the Bernoulli CUSUM is that of statistical independence of observations.^{13} In other words, if an anesthesiologist has a low score with one rater in one OR, this should (it is assumed) have no association with the probability of a low score with some other rater in the anesthesiologist’s other OR. Likewise, if an anesthesiologist has a low score today, this should have no association with the probability of a low score the next clinical day. This assumption affects the equations in footnotes d and e of Section 4. The problem is that managerial and psychological factors may influence clinicians’ performance, and, if so, we would expect such correlations to be observed. In practice, they are indeed present.^{17–19} We have previously published several articles on how to analyze OR and anesthesia group performance while compensating for such correlations, but these methods all depend on pooling data among individuals and days,^{17} ^{,} ^{20–24} which does not apply to the current article.^{g}

One way that we evaluated statistical independence was to consider (separately for CRNAs and residents) the days with precisely 2 evaluations. If there were no correlation among evaluations, and letting p represent the pooled estimate for a low score, then among the days with 2 evaluations, p ^{2} would be the probability that both scores are low (1 − p )^{2} that both scores are not low, and 2p (1 − p ) that exactly 1 of the 2 scores is low. Among the CRNAs’ 1182 evaluations on days with 2 evaluations by CRNAs, p = 5.92%. There were 4.34-fold more days with 2 low scores than expected at random (χ22 = 52.32, P < 0.0001). Among the residents’ 776 evaluations on days with 2 evaluations by residents, p = 5.03%. There were 3.06-fold more double low scores than expected at random (χ22 = 9.23, P = 0.010). Thus, by an omnibus test, there appears to be lack of independence.

We evaluated randomness of each anesthesiologist’s time series of the binary nonconformance indicators, X _{k} by using the Wald-Wolfowitz runs test.^{h} Because there were small numbers of low scores for many anesthesiologists, we calculated exact P values using StatXact-10 (Cytel, Cambridge, MA). There were 3 anesthesiologists (A, B, C) based on resident evaluations and 4 anesthesiologists (C, D, E, F) based on CRNA evaluations who had low scores that probably were not randomly distributed (uncorrected P ≤ 0.05). Figure 2 shows the CRNA evaluation scores for anesthesiologist C. Among the resident evaluations, 1 anesthesiologist (A) had 4 scores <3.0, and 2 of the 4 were on the same day. The second (B) and third (C) anesthesiologists each had 2 scores <3.0 over the 6 months, both on successive days.^{25} ,^{i} The 4 anesthesiologists who may have had nonrandom clustering among CRNA evaluations accounted for 39 of the total 221 CRNA scores <2.0. There were 6 low scores on same days and 22 low scores on successive days. Thus, it would be a poor choice to revise the analysis to treat X _{k} = 1 if there was at least one score on the k th day less than threshold, because that would not only change the interpretation of the low scores but also would not address what appears to be most of the correlation, which is among successive days.

We suspect that managerially we want to detect the clustering when substantive, and Bernoulli CUSUM will do so. The clustering is a positive correlation among X _{k} (i.e., one low score increases the chance of another). From simulation studies, when there are even small positive correlations between X _{k} and X _{k + 1} , the calculated values of m and h _{B} result in observed ANOS that are less than expected.^{26} For example, with the relative risk (p _{1} /p _{0} ) of a low score when out of control being 2.5 times that when in control, the ANOS is 57% less than expected when the correlation is 0.10 and 76% less when the correlation is 0.20.^{26} In other words, the Bernoulli CUSUM is sensitive when there is clustering. Provided this seems “good” for ongoing monitoring, the violation of the assumption may be less of a concern.

9. ASSUMPTION OF EQUAL RISK AND RELEVANCE TO CONFIDENTIALITY
The fourth assumption of the Bernoulli CUSUM as formulated in Section 4 and the previous papers^{1–3} is that each score has an equal probability for being low. Yet, this often is untrue. When a trainee places an epidural catheter, the anesthesiologist assisting may influence success. Similarly, some raters evaluate anesthesiologists’ supervision more leniently than other raters do.^{8} Yet, risk adjustment was limited to stratification by the primary covariate, resident, or CRNA rater.^{11}

We used only the quantitatively most important covariate, because that approach matched the previous psychometric (generalizability) analysis. The 29% variance component estimated for rater (resident) was nested within person (anesthesiologist).^{8} The variance included the confounded influence of both the systematic rater effect (i.e., leniency) and the interaction of rater (resident) and anesthesiologist.^{8} We do not know how to model the interaction between individual rater and anesthesiologist, because we have so many different raters, and the number of occasions was 2 or fewer for most (>90%) combinations of rater and anesthesiologist.^{11} For anesthesiologists who often work with the same raters, we recommend consideration of adjusting the Bernoulli CUSUM for each rater’s predilection to give low scores. Paynabar et al.^{27} describe how to include categorical covariates (e.g., raters) in risk-adjusted (Bernoulli) control charts. However, the issue may be moot due to confidentiality.

When detection occurs, a first step should be reviewing whether different raters contributed to the low scores. We do so manually, because CUSUM detection prompts notification of the relevant human resources professional in our department (Vice Chair for Faculty Development, BJH), not the rated anesthesiologist . From Section 4, the Bernoulli CUSUM is decreased with each evaluation and is increased only upon a low score. If an anesthesiologist works with one resident today (the k th day), and this evening the Bernoulli CUSUM “announces” B _{k} ≥ h _{B} , then it is likely that the resident’s evaluation indicated less than desirable supervision. Notification of the rated anesthesiologist would result in loss of confidentiality of the evaluation. The CUSUM serves as a process for detection. Once detected, the human resources professional can rely on the fundamental objective, that being the psychometrically valid (equally weighted) mean_{equal} with 9 to 15 different raters.

10. SUMMARY
Recently, Starkie and Drake^{3} evaluated use of Bernoulli CUSUM and concluded: “It is ideally placed to be used as a quality control tool for a trained individual.” We found the same but with 2 substantive caveats.

First, if the end point being monitored reflects a sensitive topic (e.g., evaluation of the quality of supervision), both confidentiality and psychometric reliability depend on there being multiple independent raters.^{7} ^{,} ^{8} Thus, although the Bernoulli CUSUM would result in prompt detection of lower than desired performance, to preserve confidentiality and psychometric reliability, the “alert” probably should not be provided directly to the “trained individual,” in this circumstance the anesthesiologist. Rather, the Bernoulli CUSUM is for detection and notification at the managerial level (Section 9). So notified, the manager considers the individuals providing the scores, the mean of each of those individual rater’s means, and the context under which the scores were provided.

Second, the method assumes statistical independence of monitored events (see Section 8). This assumption can be violated. We recommend testing for such behavior and understanding its ramifications for conclusions, specifically that the Bernoulli CUSUM will give many more false positives. That may potentially be desirable if the clustering might be behaviorally important. However, anesthesiologists with incidences of low scores that are in control may still have on average more than 1 alert per year. If used for evaluating trainees’ performance (e.g., milestones), this could be a limitation.

RECUSE NOTE
Franklin Dexter is the Statistical Editor and Section Editor for Economics, Education, and Policy for Anesthesia & Analgesia . This manuscript was handled by Dr. Steven L. Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.

DISCLOSURES
Name: Franklin Dexter, MD, PhD.

Contribution: This author helped design the study, conduct the study, analyze the data, and write the manuscript.

Attestation: Franklin Dexter has approved the final manuscript.

Name: Johannes Ledolter, PhD.

Contribution: This author helped analyze the data and prepare the manuscript.

Attestation : Johannes Ledolter approved the final manuscript.

Name: Bradley J. Hindman, MD.

Contribution: This author helped conduct the study and write the manuscript.

Attestation: Bradley J. Hindman has approved the final manuscript.

FOOTNOTES

a Available at: https://www.acgme.org/acgmeweb/Portals/0/PDFs/Milestones/AnesthesiologyMilestones.pdf . Accessed April 11, 2014. Cited Here

b The P values were provided by Dr. Robert McCarthy,9 personal communication, February 24, 2014. Cited Here

c Available at: http://en.wikipedia.org/wiki/Sequential_probability_ratio_test . Accessed February 23, 2014. Cited Here

d Let r1 refer to the logarithm of the relative risk of not low scores (i.e., X _{k} = 0) when in control rather than out of control: r _{1} = ln([1 – p _{1} ]/[1 – p _{0} ]). Let r _{2} refer to the logarithm of the odds ratio of low scores when out of control rather than when in control: r _{2} = ln([p _{1} /(1 – p _{1} )]/[p _{0} /(1 – p _{0} )]). Then m is the largest integer less than r _{2} /r _{1} . Cited Here

Using the ANOS, and r _{1} and r _{2} from footnote d , solve iteratively for h _{B} ^{*} : ANOS ≊ (exp [r _{2} h _{B} *] – r _{2} h _{B} * – 1)/|r _{2} p _{0} – r _{1} |. For example, with p _{0} = 5%, p _{1} = 10%, and ANOS = 250, h _{B} * = 2.6312. Then, Cited Here

Cited Here

, where ε ≊ 0.410 – 0.0842(ln[p _{0} ]) – 0.0391(ln[p _{0} ])^{3} – 0.00387(ln[p _{0} ])^{4} – 0.000008 (ln[p _{0} ])^{7} . For example, ε =1.4279 for p _{0} = 5%. Cited Here

f Available at: http://en.wikipedia.org/wiki/Shewhart_individuals_control_chart . Accessed February 25, 2014. The individual control chart differs from the Shewhart mean chart, http://en.wikipedia.org/wiki/Xbar_and_s_chart . Cited Here

g We chose not to use Poisson CUSUM, because the numbers of evaluations per day were highly heterogeneous among anesthesiologists (e.g., cardiac anesthesiologist who often supervises 1 resident versus ambulatory surgery anesthesiologist who usually supervises 3 CRNAs). Cited Here

h Available at: http://en.wikipedia.org/wiki/Wald-Wolfowitz_runs_test . Accessed February 25, 2014. Cited Here

i Another potential approach is a multinomial CUSUM to combine residents and CRNAs. However, among the anesthesiologists with clustering, only 1 was for both residents and CRNAs. The dates with clustering for each had no overlap. Furthermore, the anesthesiologist was in the upper quartile for the incidence of low scores for both (i.e., was detected promptly anyway by the individual Bernoulli CUSUM). Cited Here

REFERENCES
1. Kestin IG. A statistical approach to measuring the competence of anaesthetic trainees at practical procedures. Br J Anaesth. 1995;75:805–9

2. de Oliveira Filho GR. The construction of learning curves for basic skills in anesthetic procedures: an application for the cumulative sum method. Anesth Analg. 2002;95:411–6

3. Starkie T, Drake EJ. Assessment of procedural skills training and performance in anesthesia using cumulative sum analysis (CUSUM). Can J Anaesth. 2013;60:1228–39

4. Norris A, McCahon R. Cumulative sum (CUSUM) assessment and medical education: a square peg in a round hole. Anaesthesia. 2011;66:250–4

5. Shumway SH, Del Risco J. A comparison of nurse anesthesia practice types. AANA J. 2000;68:452–62

6. Taylor CL. Attitudes toward physician-nurse collaboration in anesthesia. AANA J. 2009;77:343–8

7. de Oliveira Filho GR, Dal Mago AJ, Garcia JH, Goldschmidt R. An instrument designed for faculty supervision evaluation by anesthesia residents and its psychometric properties. Anesth Analg. 2008;107:1316–22

8. Hindman BJ, Dexter F, Kreiter CD, Wachtel RE. Determinants, associations, and psychometric properties of resident assessments of anesthesiologist operating room supervision. Anesth Analg. 2013;116:1342–51

9. De Oliveira GS Jr, Rahmani R, Fitzgerald PC, Chang R, McCarthy RJ. The association between frequency of self-reported medical errors and anesthesia trainee supervision: a survey of United States anesthesiology residents-in-training. Anesth Analg. 2013;116:892–7

10. de Oliveira Filho GR, Dexter F. Interpretation of the association between frequency of self-reported medical errors and faculty supervision of anesthesiology residents. Anesth Analg. 2013;116:752–3

11. Dexter F, Ledolter J, Smith TC, Griffiths D, Hindman BJ. Influence of provider type (nurse anesthetist or resident physician), staff assignments, and other covariates on daily evaluations of anesthesiologists' quality of supervision. Anesth Analg. 2014;119:670–8

12. Dexter F, Logvinov II, Brull SJ. Anesthesiology residents’ and nurse anesthetists’ perceptions of effective clinical faculty supervision by anesthesiologists. Anesth Analg. 2013;116:1352–5

13. Reynolds MR Jr, Stoumbos ZG. A CUSUM chart for monitoring a proportion when inspecting continuously. J Qual Technol. 1999;31:87–108

14. Grigg O, Farewell V. An overview of risk-adjusted charts. J R Statist Soc A. 2004;167:523–39

15. Chang TC, Gan FF. Cumulative sum charts for high yield processes. Stat Sinica. 2001;11:791–805

16. Szarka JL III, Woodall WH. On the equivalence of the Bernoulli and geometric CUSUM charts. J Qual Technol. 2012;44:54–62

17. Dexter F, Epstein RH, Marcon E, Ledolter J. Estimating the incidence of prolonged turnover times and delays by time of day. Anesthesiology. 2005;102:1242–8

18. Epstein RH, Dexter F, Rothman B. Communication latencies of wireless devices suitable for time-critical messaging to anesthesia providers. Anesth Analg. 2013;116:911–8

19. Wang J, Dexter F, Yang K. A behavioral study of daily mean turnover times and first case of the day start tardiness. Anesth Analg. 2013;116:1333–41

20. Law AM, Kelton WD Simulation Modeling and Analysis. 19912nd ed New York, NY McGraw-Hill, Inc:551–3

21. Dexter F, Macario A, Qian F, Traub RD. Forecasting surgical groups’ total hours of elective cases for allocation of block time: application of time series analysis to operating room management. Anesthesiology. 1999;91:1501–8

22. Dexter F, Marcon E, Epstein RH, Ledolter J. Validation of statistical methods to compare cancellation rates on the day of surgery. Anesth Analg. 2005;101:465–73

23. Dexter F, Marcon E, Aker J, Epstein RH. Numbers of simultaneous turnovers calculated from anesthesia or operating room information management system data. Anesth Analg. 2009;109:900–5

24. Ledolter J, Dexter F, Epstein RH. Analysis of variance of communication latencies in anesthesia: comparing means of multiple log-normal distributions. Anesth Analg. 2011;113:888–96

25. Topalidou E, Psarakis S. Review of multinomial and multiattribute quality control charts. Qual Reliab Eng Int. 2009;25:773–804

26. Mousavi S, Reynolds MR Jr. A CUSUM chart for monitoring a proportion with autocorrelated binary observations. J Qual Technol. 2009;41:401–14

27. Paynabar K, Jin JJ, Yeh AB. Phase I risk-adjusted control charts for monitoring surgical performance by considering categorical covariates. J Qual Technol. 2012;44:39–53