Probabilistic estimates of case duration are important for several decisions on and soon before the day of surgery.1 These decisions include comparing the durations of cases between operating rooms requiring specialized equipment in order to prevent resource conflicts,1–3 and filling or preventing a hole in the operating room schedule.1,3,4 These decisions are economically important because, if done properly, they can reduce the hours of over-utilized operating room time.1
Case durations are estimated most accurately when classified (at a minimum) by the combination of scheduled procedure(s) to be performed and the primary surgeon scheduling the case.5–11 For decisions dependent on case duration estimates, parameter uncertainty for cases of uncommon combinations of procedures (e.g., historical durations from 0 to 2 such cases) accounts for the majority of the uncertainty in decisions.6 If historical data are used, there is absence of reliably estimated durations for many cases, resulting in no clearly best decision. Substantial benefit is achieved by using the scheduled duration as a prediction for the median duration of the next case of the same combination.3,7 The process variability around that prediction (i.e., the coefficient of variation) is estimated using data from similar procedures.3,6 This approach is necessary because at tertiary hospitals more than half of cases are of combinations scheduled fewer than 3 times during the preceding year.12 Pooling similar procedures together to increase sample sizes is counter-productive because doing so increases the process variability.3,6 More than 20% of cases at tertiary hospitals are of procedure(s) that have been scheduled previously only once or twice.12,13 Results are similar for most ambulatory surgery centers.14 Even in single-specialty surgical suites (e.g., pediatric endoscopy), the relatively few procedures that are of uncommon combinations still have a substantive effect on managerial decisions because what matters is the percentage among workdays of all operating rooms with at least one such case.15 Bayesian methods use information from scheduled durations, historical data for the combination, and historical data for all other combinations to moderate the effect of small sample sizes on uncertainty in estimates of the case durations.3
A Bayesian method for case duration prediction has 3 parameters that need to be estimated for each hospital.3,16 The coefficient of variation of case duration is calculated for each of the common (e.g., N ≥ 30) combinations (e.g., of scheduled procedure(s) and surgeon). Two parameters describe the heterogeneity among combinations in their coefficients of variation. The third parameter (τ) specifies the equivalence between a surgeon’s scheduled estimate for the duration of a case and the information that is contained in the median of a certain number of historical data. For example, when there are only 3 historical durations available and τ > 3, the scheduled duration provides more information to the Bayesian estimate than do the historical data. On the other hand, when many (e.g., N ≥ 30) historical data are available and τ = 3, the historical data provide more information than does the scheduled duration. The value of τ quantifies the number of cases for which the historical data are balanced equally by the scheduled duration.
The Bayesian method is accurate.3 In one of our studies, using τ = 8.68 cases, the 90% upper bounds were exceeded by 9.7% of the actual operating room times (i.e., close to the nominal rate of 10%). The 5% lower bounds exceeded 4.9% of the cases. Using τ = 1.49 cases, the 90% bounds were exceeded by 10.0% of cases and the 5% lower bounds exceeded 5.1% of cases. Using τ = 1.49 cases, an order of magnitude smaller, the results were the same within 0.2%.3 These results summarize what has been known: within a range of reasonable values, the specific choice of τ is unimportant.
In this study, we start by presenting a case report of a different hospital in which the previously described method for estimating τ resulted in a value that was not interpretable. We developed a new method and then applied it to study schedulers’ behaviors at the 3 hospitals. In the process, we learned better why Bayesian methods for case duration prediction are so accurate. We also learned why accuracy is insensitive to the choice of τ.
Three tertiary care U.S. hospitals were studied. Data from Hospitals A and B were taken from previously published studies.3,16 Data for the case report was approved by the Vanderbilt University IRB, the study determined not to be human subject research (Hospital C).
We use the following notation:
We showed previously that, subject to the assumptions put forward in our previous article,3 an unbiased estimator for the median of Yik equals:
The term inside the exponential is a weighted average of the scheduled duration and the historical mean data, both in the log scale.
Previously, we estimated τ using properties of the Student t and the log-normal distributions:
where α and β are the shape and scale parameters of an inverse γ distribution. Details are in Equations 15 and 16 of Ref. 3.
For example, among the cases performed at Hospital A, there were 302 combinations of scheduled procedure(s) and surgeon each with at least 30 cases (Table 1). For each of these combinations, the sample variance of the natural logarithms of the case durations in hours (i.e., Var(xik)) was calculated. The coefficient of variation of the actual durations is a function solely of this variance in the log scale. The inverses were taken of the sample variances and fit to a γ distribution (see Fig. 1 of Ref. 3). The Kolmogorov-Smirnov test of fit was good, P = 0.88. The estimated α = 2.32 and β = 0.142, respectively. Among the cases of the common combinations, the sample estimate of Var(xik – Sik) = 0.120. Substituting the values into Equation 2 and then rearranging terms, τ = 8.68 cases.
We used the same approach in a publication with data from Hospital B (Table 1).16 Using the 485 combinations with at least 30 cases, the fit of the inverses of the variances to a γ distribution had P = 0.039, α = 3.15, and β = 0.225. Using the 352 combinations each with at least 40 cases, P = 0.20, α = 3.20, and β = 0.218. Using the 278 combinations each with at least 50 cases, P = 0.25, α = 3.20, and β = 0.216. Among the 47,187, 42,610, and 39,337 cases, the corresponding estimates of the variances Var(xik – Sik) = 0.139, 0.135, and 0.133, respectively. Using Equation 2, τ = 3.07, τ = 2.69, and τ = 2.87, respectively.
The current study was motivated by an application of the approach at Hospital C. There were 803 combinations of scheduled procedure(s), surgeon, and anesthesia provider with at least 30 cases. The sample variances in the log scale were calculated, inverted, and fit to a γ distribution. The Kolmogorov-Smirnov test of fit was good, P = 0.46. The estimated shape and scale parameters were α = 2.12 and β = 0.138, respectively, similar to what was found for Hospitals A and B. The sample Var(xik – Sik) = 0.117. Substituting into Equation 2 and solving for τ, the estimated τ = -20.6 cases. This result was unexpected (i.e., we did not know how to interpret a negative number of cases) and prompted the current investigation.
To understand the negative τ, instead of using the observed value of Var(xik – Sik) in Equation 2, we used different values while repeatedly solving for:
We substituted the value of τ = 8.68 from Hospital A into Equation 2 and solved for Var(xik – Sik) = 0.1374. Reducing the sample variance from 0.1374 to 0.1300 and using Equation 3 the resulting τ = 18 cases. This increase in τ makes sense, because a reduction in Var(xik – Sik) increases the “worth” of the scheduled duration as a forecast for duration. Reducing the variance progressively in 0.001 increments, for 0.124 the τ = 157 cases whereas for 0.123 the τ = -575 cases. Thus, the calculated negative τ = -20.6 for Hospital C does not indicate that the scheduled durations provide worse information than even having no historical duration data for the combination. Rather, the negative τ implies that a scheduled duration is worth thousands of historical cases. However, that is impossible. Although Equation 2 functions well for Hospitals A and B, presentation in the different format of Equation 3 suggests that this may simply have been fortunate, because of the instability that can occur when the right-hand side of the denominator of Equation 3 is close to 1, resulting in the denominator itself being close to 0.
A New Approach to Estimating τ
Equation 1 gives an unbiased estimator for the median. The sum of the absolute deviations of a set of numbers from the median is less than or equal to the sum of the absolute deviation from any other fixed number.17 Therefore, we identified a new approach for the estimation of τ, that being minimization of the sum (mean) of the absolute deviations. The absolute deviation is the absolute value of the difference between each case’s observed duration in hours and the cases’ predicted duration from Equation 1.
Solving for τ is a nonlinear optimization problem with 1 variable, but more than 150,000 absolute differences (Table 1). We used Excel 2010’s complementary version of Solver called the Generalized Reduced Gradient Nonlinear engine (Microsoft, Redmond, WA and Frontline Systems, Incline Village, NV). The estimation was constrained to τ > 0, because Equation 1 has a singularity at τ = 0 for the cases with ni(k) = 1.
Reliability and Validity of Estimated τ
Lacking a gold-standard to which to compare our estimates for τ, we assessed the reliability of the estimates, content (face) validity, concurrent validity, and construct validity.18Reliability refers to whether the calculations give the same answer when performed differently. For example, the nonlinear optimization should be insensitive to the initial conditions. Content validity assesses qualitatively whether the estimation of τ covers the construct being measured. In our circumstance, we consider whether our approach of minimizing the mean of the absolute deviations also makes sense economically. Concurrent validity evaluates whether the estimated τ and other data measured at the same time are related as expected. We evaluated the schedulers’ behaviors. Construct validity would hold if multiple theoretical predictions about the process for estimating τ were tested and found to be supported. Typically, reliability and the 3 different types of validity would be presented in the Results sequentially. However, because some analyses rely on other analyses, we intersperse the 4 to prevent duplication of material.
The mean absolute deviation in units of hours is sensitive to each hospital’s case durations. To facilitate interpretation, each mean absolute deviation was divided by the mean operating room time among all cases at the hospital. The standard error of the ratio of the 2 means (SEM) was calculated using Fieller theorem.19–21 The results are reported as mean ratio ± SEM, both multiplied by 100 to give percentages.
The value of τ is reported along with an approximate 95% confidence interval. First, the 95% 1-sided asymptotic upper confidence limit for the mean absolute deviation (as a ratio) was calculated as in the preceding paragraph, using the point estimate of τ. Second, the confidence interval for τ was calculated as the lower and upper value of τ such that the mean absolute deviation (as a ratio) equaled that upper confidence limit.
Somer’s D correlation was calculated between scheduled and actual durations.22 As background, Kendall’s τ is a class of rank correlation coefficients. The Kendall’s τa correlation between scheduled and actual durations makes no correction for more than one case having the same scheduled or actual duration. Typically when “Kendall’s τ” is reported, it is Kendall’s τb, which would correct for ties in either the scheduled and actual duration. Somer’s D is like Kendall’s τb correlation, but with correction for ties in the actual duration. Somer’s D differs from Kendall’s τb in that no correction is made for ties in the scheduled duration. This absence of correction for scheduled duration ties is desired since if all scheduled durations for a combination were the same (e.g., the historical mean duration were always used), then the scheduler provided no improvement. Correlations were calculated for each combination of scheduled procedure(s) and surgeon for which at least 30 cases were performed. The asymptotic standard error of Somer’s D within each combination was calculated using SYSTAT 13 (Systat Software, Chicago, IL). The (standard) Dersimonian and Laird random effects analysis was used to calculated the pooled estimate and its standard error among combinations.23 The P-value comparing the pooled Somer’s D to zero was calculated asymptotically, with sample sizes of 318 to 803 combinations.
Using the newly derived method, the estimated τ = 5.60 cases for Hospital A (95% confidence interval, 1.95–18.00), τ = 1.23 cases for Hospital B (0.22–4.13), and τ = 5.51 cases for Hospital C (2.40–13.93).
Observational studies show combining schedulers’ estimated case durations with historical data results in smaller absolute deviations than either approach alone.7–10 The 3 estimates for τ are sufficiently small that scheduled durations can influence the resulting Bayesian estimates of case duration. The concordance of prediction and result suggests convergent validity for the process of estimating τ.
The estimates of τ were insensitive to the initial conditions of the nonlinear optimization. The same estimates of τ were obtained to within 0.001 using initial conditions of τ = 0.01, τ = 10, and τ = 1000. The insensitivity supports the reliability of the estimates for τ.
Correlations between scheduled and actual durations were made for each hospital while stratifying by procedure. The Somer’s D correlations = 0.132 ± 0.009, 0.051 ± 0.005, and 0.139 ± 0.003, respectively. Hospital B’s correlations were less than for Hospitals’ A and C (P < 0.0001), but larger than zero (P < 0.0001). The same held when pooled among all cases and procedures (Hospital A, 0.713 ± 0.001; Hospital B, 0.581 ± 0.002; Hospital C, 0.780 ± 0.001). The smaller correlations for Hospital B relative to Hospitals A and C can explain Hospital B’s smaller estimated τ, since smaller values of τ imply larger reliance on scheduled rather than historical durations, as described in the Methods. These results suggest concurrent validity of our estimates for τ.
The magnitudes of the Somer’s D correlations were small when stratified by procedure (i.e., 0.051–0.139) but substantial when pooled among all cases and procedures (i.e., 0.581–0.780). This pattern of correlations (i.e., substantial among but not within procedures) was obtained using scheduled durations, but matches the pattern of correlations when medians (or means) of historical durations are used. Consequently, although the mean absolute deviations should depend on τ, the dependence should be small. In other words, for the moderate sample sizes for which τ matters, the scheduled durations and historical data are essentially interchangeable for estimating the median duration of a future case. To test this hypothesis further, the mean absolute deviations were reported as percentages (see Methods). For Hospital A, mean errors were 25.86% for τ = 0.01 and 25.77% for τ = 1000. For Hospital B, mean errors were 24.85% for τ = 0.01 and 28.86% for τ = 1000. For Hospital C, mean errors were 25.86% for τ = 0.01 and 25.77% for τ = 1000. The insensitivity is consistent with construct validity of our new process for estimating τ.
The mean absolute deviations for optimal τ were comparable among the 3 studied hospitals, despite being from different U.S. states and from a range of over 15 years: 24.37% ± 0.10%, 24.56% ± 0.09%, and 24.33% ± 0.06%, respectively. These percentages are the same (24%) as reported previously for anesthetics performed in the interventional radiology suite.11 The consistency suggests construct validity of our process for estimating τ.
Slight corrections for bias in scheduled estimates had no significant effects on the estimates. At Hospital A, scheduled estimates overall overestimated actual durations by 2.71%. Incorporating proportional bias correction, the absolute deviation was increased by 0.12% ± 0.10% (τ = 5.26). At Hospital B, scheduled estimates overall underestimated actual durations by 1.03%. Incorporating bias correction, the absolute deviation was decreased by 0.01% ± 0.09% (τ = 1.25). At Hospital C, scheduled estimates overall underestimated actual durations by 2.08%. Incorporating bias correction, the absolute deviation was decreased by 0.07% ± 0.06% (τ = 5.34). None of these 3 changes was statistically significant. Insensitivity of results to bias supports the reliability of the estimates of τ and the process for the estimation.
We estimated τ by minimizing the absolute deviation for a mathematical reason (see Methods). Nevertheless, the objective is rational economically because costs of forecast errors are related to time differences (e.g., for add-on, moving cases, and choosing which operating room to receive relief1).24 For example, when an anesthesiologist not on call waits for relief during a late-running case, the minutes (hours) the anesthesiologist waits matter, not whether the anesthesiologist waits a period equal to 10% or 20% of the case’s scheduled duration. When we sought combinations of Bayesian estimates and anesthesia providers’ instant message estimates for how much longer their cases would last, we (appropriately) also used least absolute deviations regression.16 The use of a reasonable cost model supports the content (face) validity of estimated τ.
The confidence intervals for τ are wide. This would be expected if most cases either have so few historical durations (i.e., ni(k) < 3) that the Bayesian estimate is influenced principally by the scheduled duration,12 or so many historical durations (e.g., ni(k) < 10) that the Bayesian estimate is influenced principally by the historical durations. These conditions applied to 79%, 88%, and 80% of cases at Hospitals A, B, and C, respectively. The prediction being satisfied suggests construct validity for the new process of estimating τ.
The preceding hypothesis has a corollary to be tested: τ is effectively being estimated using the minority of cases, with something like 3 ≤ ni(k) ≤ 10. Furthermore, because the mean absolute deviation is being used, cases with long scheduled durations would have a larger influence on estimated τ than brief cases. For each hospital, we estimated τ using only cases for which 3 ≤ ni(k) ≤ 10 and the scheduled duration exceeds the median of all cases’ scheduled durations at the hospital. Only 12.2%, 6.10%, and 13.1% of cases were used, respectively. As hypothesized, the differences in the values of τ were not statistically significant (Hospital A, 5.60 to 4.06; Hospital B, 1.23 to 1.20; and Hospital C, 5.51 to 4.29). The small changes support construct validity of τ.
Although we previously used a paragraph in an Appendix3 to show that the Bayesian method is insensitive to the choice of τ, our use of Equation 2 provided us with no mathematical way to study the relationship. In retrospect, our previous approach to the estimation of τ was unreliable (see Equation 3). In our current study, we fix the latter problem, but in the process achieve the tools needed to show that the estimation is insensitive to the value of τ. Any choice of τ in a reasonable range (e.g., 1–8 cases) can be used. The implication is that facilities wishing to implement Bayesian estimates of case durations can simply use a reasonable value (e.g., τ = 5) without going through the steps to calculate it.
However, estimation of the other 2 parameters, α and β, is still necessary. Simultaneous estimation of these parameters of an inverse γ distribution requires just a few mouse clicks or equivalent using most statistics packages. They are estimated using only the common (N ≥ 30) combinations of scheduled procedure(s) and surgeon. In contrast, we show in the Results that τ is effectively being estimated using the cases with 3 ≤ ni(k) ≤ 10. Thus, our findings about τ are unrelated to the estimation of α and β.
Our series of assessments of validity of the approach provide insight into the Bayesian method and its performance. Such understanding matters because the Bayesian method for case duration is important practically for operating room management decisions soon before and on the day of surgery. First, scheduled estimates are needed for the one third of cases with hardly any historical data. Therefore, a facility cannot make good decisions soon and on the day of surgery if surgeons’ estimates are disregarded by the scheduling system out of fear that those estimates will be biased.3,6–11,16,25 Second, for a procedure or combination of procedures that the surgeon has scheduled scores of times before, the variability around the median (e.g., from the scheduled duration) is estimated principally from the corresponding historical data. For uncommon combinations of procedures, the data used are from other surgeons’ corresponding combinations, which is less desirable, but the Bayesian method handles this automatically. The principal benefit of historical data is not for estimating the median (central tendency), but the coefficient of variation (i.e., the variability). The scheduling challenges the working day before surgery and on the day of surgery are principally about quantifying, managing, and responding to variability.1 The same applies to governments, insurers, and anesthesia group and hospital billing offices providing cost information to patients before surgery.26 For the vast majority of procedures, the only way to quantify and thus manage variability in case durations is to apply Bayesian (or equivalent) statistical methods.3,6,16,26
Dr. Franklin Dexter is the Statistical Editor and Section Editor for Economics, Education, and Policy for the Journal. This article was handled by Dr. Steven L. Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.
Name: Franklin Dexter, MD, PhD.
Contribution: This author helped design the study, conduct the study, analyze the data, and prepare the manuscript.
Attestation: Franklin Dexter attests to the analysis reported in this manuscript.
Conflicts of Interest: The University of Iowa, Department of Anesthesia, Division of Management Consulting, performs some of the analyses described in this study for facilities. Franklin Dexter has tenure and receives no funds personally from such activities. Income from the Division’s consulting work is used to fund its research.
Name: Johannes Ledolter, PhD.
Contribution: This author helped analyze the data and prepare the manuscript.
Attestation: This author attests to the data analysis and approved the final manuscript.
Conflicts of Interest: The author has no conflicts of interest to declare.
Name: Vikram Tiwari, PhD.
Contribution: This author helped conduct the study.
Attestation: This author approved the final manuscript.
Conflicts of Interest: The author has no conflicts of interest to declare.
Name: Richard H. Epstein, MD, CPHIMS.
Contribution: This author helped conduct the study and write the manuscript.
Attestation: This author approved the final manuscript. This author is the archival author.
Conflicts of Interest: The author has no conflicts of interest to declare.