Currently, for many US practices, patients cannot easily determine before their surgery an estimate of the cost of anesthesia professional services and the associated duration of their surgery.^{1} ^{,} ^{2} There are tens of thousands of blog site entries, online discussions, etc., from patients trying to forecast (or understand) what they perceive to be unexpectedly large US anesthesia bills.^{a} The US Government Accountability Office reported last year:^{b}

“Meaningful price information is difficult for consumers to obtain prior to receiving care … Consumers generally learn of their health care costs after receiving care, such as when they receive a bill from their provider or an explanation of benefits from their insurer. In contrast, information on health care prices is considered transparent when this information is available to consumers before they receive health care services. Transparent health care price information may help consumers anticipate their health care costs and reduce the possibility of unexpected expenses … We anonymously contacted 19 hospitals to obtain information on the price of a full knee replacement surgery. Several hospital representatives … explained that the price for the procedure could vary based on a variety of factors, such as the time the patient will be in the operating room and the type of anesthetic the patient may receive, and some noted that they would need to know this information if they were to provide a more specific price estimate …”

Patients with a high deductible or substantive copayment for professional (physician) services balance cost versus travel time and convenience of scheduling, in part, when choosing a hospital at which to have surgery performed.^{3} ^{,} ^{4} In this article, we review how an anesthesia group and/or facility can accurately provide this information to patients (Table 1 ). For simplicity, we henceforth refer to “anesthesia” charges, costs, etc., in lieu of the phrase “anesthesia professional service charges.”

Table 1: Summary of Findings and Application

The first half of the article focuses on which methods can be applied to provide estimates for patients of the average and upper limit of costs (Table 1 ). The second half of the article focuses on the role of insurers and/or governments. The last section also has implications for analyses of outcomes among facilities. Because the relevant statistical methods have recently been published in Anesthesia & Analgesia , we omit repetition of the equations and instead refer readers to the 2 articles.^{1} ^{,} ^{2}

Previous articles on surgical case duration prediction do not apply directly to this topic because they addressed different decisions, and appropriate statistical methods of prediction cannot validly be separated from the corresponding decisions.^{5} ^{,} ^{6} For example, checking scheduled durations against historical data for plausibility and the usefulness of adjustment during case booking depend on typical durations of the operating room workday and on the intervals used for scheduling (e.g., 5 or 15 minutes).^{1} ^{,} ^{7–12} Appropriate adjustment of the scheduled case duration the working day before surgery depends on the flexibility of adjusting assigned teams.^{13–18} Preventing conflicts over equipment involves the pairwise comparisons of the durations of cases.^{5} ^{,} ^{19} The choice of the time of patient arrival needs to incorporate the risk of cancellation of a preceding case in the operating room or the case being moved to another room.^{20–23} Staff assignment and add-on case scheduling decisions depend on updates to estimated case durations on the day of surgery.^{1} ^{,} ^{5} ^{,} ^{24} ^{,} ^{25} Readers interested in these different topics related to case duration prediction can refer to the provided references.

ANESTHESIA GROUP CALCULATING MEAN DURATION
When a patient contacts an anesthesia group and has the Current Procedural Terminology (CPT^{®} ) code or description of the scheduled procedure, information that can be provided to the patient includes the mean (expected value) of the patient’s anesthesia payment, an upper percentile of the patient’s payment, and the numbers of prior cases used to calculate those 2 estimates.

Base Units and Time Units
The amount paid by insurance carriers for the anesthetic is a function of the charge (price) per American Society of Anesthesiologists’ Relative Value Guide unit, the number of base units, and the case duration. Generally, the amount paid equals (base units + duration × contracted units per hour) × (contracted rate per unit), often with the contracted units per hour equaling ¼.

Estimating the base units is straightforward because many surgical CPTs map to the same anesthesia code (i.e., identical base units).^{26} ,^{c} For example, consider a patient scheduled for CPT 49560, “Repair initial, reducible, ventral hernia.” Suppose that the performed procedure is the scheduled CPT 49560 plus CPT 49568, implantation of mesh for ventral hernia repair. Suppose that the mean case duration for CPT 49560 alone is 1.6 hours, while the mean for the 2 procedures combined is 2.6 hours. Regardless, all of the ventral (incisional) hernias map to the same anesthesia CPT code 00752, with 6 base units. The times (case durations) differ, not the base units.

The base units rarely will be different from that expected preoperatively for a patient. In contrast, the case duration is a random variable with substantial predictive error.^{27} Consequently, calculating the mean and an upper percentile of patient charge effectively is the same as calculating these estimates for case duration. The latter is the topic of our article.

Mean of Case Durations
When providing a patient with information on typical anesthesia charges, the mean generally is the appropriate measure of central tendency for case duration data. The reason is that resource use for a population of patients is proportional to the total time, and the mean is proportional to the total time.^{28} The mean includes unusually long (i.e., expensive) case durations. Such outlier case durations should not be eliminated from the data because they actually occur (i.e., each future patient is at risk for being an outlier).

Multiple estimates can be made for the mean, the most common of which is the (arithmetic) sample mean of the durations in units of hours. The sample mean is reported ubiquitously in biostatistics because many biological data follow normal distributions. However, surgical case durations do not follow normal distributions.^{29} Knowing how best to calculate an estimate of the mean depends on knowing the probability distribution of case durations and coefficients of variation of case durations (i.e., the standard deviation divided by the mean). See Ref. ^{2} for a recent Statistical Grand Rounds on this topic.^{2}

Typical Coefficients of Variation
The coefficients of variation for the durations of cerebral aneurysm clipping cases at each of 30 facilities were studied.^{30} The maximum among the 30 estimated coefficients of variation was 49%. For some facilities, the sample sizes were so small (e.g., 1 facility had only 3 cases) that their coefficients of variation were measured with large uncertainty. Among the 14 facilities with at least 30 cases, the maximum estimated coefficient of variation was 30%.

Dexter and Ledolter^{31} estimated the coefficients of variation of case durations for all combinations of scheduled CPT(s) and surgeon with at least 30 cases. For the current article, we repeated the calculations^{31} but classified cases by scheduled procedure alone (see footnote d in the next section, Case Durations Generally Follow Log-Normal Distributions, and the Discussion). There were 354 such scheduled procedures. For each, the coefficient of variation was calculated. The estimated coefficient of variation was <100% for 99.4% of the 354 procedures and was <50% for 90% of them.

Spangler et al.^{32} estimated the coefficient of variation of surgical times for each of 574 procedures with at least 27 cases. The coefficient of variation was <76% for 98.4% of the procedures.

Case Durations Generally Follow Log-Normal Distributions
Case durations generally follow log-normal distributions.^{16} ^{,} ^{29} ^{,} ^{31–35}

For example, Figure 1 shows a previously published histogram of the logarithm of case durations for a single surgeon’s strabismus surgery cases.^{31} Lilliefors test of the quality of fit of the 105 logarithms of case duration to a normal distribution has P = 0.69. This P value indicates a good fit, since it is much larger than the typical (e.g., 0.05) threshold for rejection of the hypothesis that the data follow a specified distribution.

Figure 1: Previously published histogram of natural logarithm of case durations for a single surgeon’s strabismus surgery cases.

^{31} The 105 cases are those cases that the surgeon scheduled as Current Procedural Terminology 67311, “Strabismus surgery, recession or resection procedure; one horizontal muscle.” This was

Figure 5 in our paper in which we originally showed how Bayesian methods can be used to predict case durations.

^{31} For example, 30 facilities’ times for cerebral aneurysm clipping were recorded.^{30} The Lilliefors tests for log-normal distributions have P > 0.05 for 29 of 30 of the facilities, and the other P = 0.045. Having 1 of 30 studies with P < 0.05 is expected by random chance, based on Dunn-Šidák correction (i.e., there would be statistical significance only for an observed P ≤ 0.0017, where 0.0017 = 1 − [1 − 0.05]^{1/30} ). Figure 2 shows a histogram of case durations from the facility with the largest sample size. The Lilliefors test P = 0.96, showing excellent fit of the logarithms of the case durations to a normal distribution.

Figure 2: Histogram of case durations for cerebral aneurysm clipping from the facility with the largest sample size. The Lilliefors test P = 0.96 for the 92 cases shows an excellent fit of the natural logarithms of the case durations to a normal distribution. In the Appendix, we consider that this is a so-called 2-parameter log-normal distribution, with the 2-parameters being the sample mean and the sample standard deviation of the logarithms.

Throughout the article, we subsequently use the cerebral aneurysm data to illustrate our recommendations.^{30} This example is not ideal from a clinical perspective, since patients with subarachnoid hemorrhage are unable and have no reason economically to call around to hospitals to compare costs of anesthesia. However, these are prospectively (i.e., accurately) collected durations from many facilities for a single surgical procedure and a dataset that has been studied extensively.^{30} ^{,} ^{36} ^{,} ^{37} ^{d}

Finney’s Uniformly Minimum Variance Unbiased Estimator of the Mean
Given that the log-normal provides an appropriate distribution for case durations, how should one estimate the mean of such a distribution from the historical data? We look for an estimator that satisfies 2 criteria. First, the estimator should be unbiased, meaning that it is neither consistently larger nor smaller than the mean. Second, it should minimize the expected mean squared error (i.e., the expected squared difference between the estimator and the true value). For normal distributions, this estimator is the sample mean. For log-normal distributions, there exist other unbiased estimators that have a smaller expected squared error than the sample mean. One such estimator, with the smallest mean squared error, was derived by Finney.^{38} ^{,} ^{39} ^{,} ^{40} , ^{e}

Finney’s estimator performs better than the sample mean.^{38} However, the improvement is negligible when the coefficient of variation is <50%.^{39} The same applies when the coefficient of variation is <100% for sample sizes at least 20.^{39} The section Typical Coefficients of Variation illustrates that these conditions are met by case durations. Thus, Finney’s optimal mean squared estimator of the mean is only negligibly better than the ordinary sample mean. For case duration data, we recommend routine use of the sample mean (Table 1 ).

Use of the Sample Median In Lieu of the Use of the Sample Mean
Although the sample mean of case durations is a suitable choice for estimation of the expected anesthesia cost for a patient, current practice routinely uses the sample median instead. For example, inferential analysis using rank based methods (e.g., Wilcoxon-Mann-Whitney test) relies on percentiles and medians. A second way that the median is used, effectively, is by calculating the exponential of the sample mean of the logarithms of durations. The median of a log-normal random variable equals the exponential of the mean of the logarithms (see Equations (1) and (3) of Ref. ^{2} ). This is referred to as the geometric mean of the durations in hours. Finally, a third way that the median is used is literally just to calculate the sample median because it excludes outliers. This latter approach makes duration information that is communicated to patients robust to any unusual events.

For log-normal distributions, the sample median is not an estimator for the mean; rather, it underestimates the mean. That is a disadvantage. Yet, the bias may seem in practice too small to be an economically important concern.

For example, we calculated the coefficients of variation for each of the 354 scheduled procedures with at least 30 cases from Ref. ^{31} (see Typical Coefficients of Variation, above). The median coefficient of variation was 32%. Among log-normal distributions with a coefficient of variation of 32%, the mean equals the 53rd percentile (see Equations (1) and (2) of Ref. ^{2} ). At face value, the 53rd percentile seems “close” to the 50th percentile.

However, there are several reasons to use the mean, rather than the median.First, when the median is used systematically (i.e., for scheduling many cases), the cumulative effect over many thousands of patients causes substantial underestimation of total costs (i.e., budget).^{9} ^{,} ^{41}

Second, patients undergo cases of durations in units of time, not in percentages. Using the 354 procedures, the differences between the mean and the median were calculated in units of time.^{31} The differences exceeded 15 minutes for 17% of scheduled procedures and exceeded 1 hour for 4% of procedures.

Third, among the 354 coefficients of variations calculated, the maximum was taken and its equivalent percentile determined, again using Equations (1) and (2) of Ref. ^{2} . The maximum percentile was the 71st, substantially further away from the 50th percentile than the 53rd percentile.

Fourth, and most importantly, for most procedures there are too few data from individual facilities to determine the coefficient of variation reliably for the specific procedure. The coefficients of variation were estimated for the procedures from Ref. ^{31} with at least 30 cases, excluding 43% of cases and 96.9% of procedures.^{31} Thus, for any 1 procedure (or patient), usually it is not known how close the mean would be to the median (i.e., one does not know in using the median whether it is close enough).

ANESTHESIA GROUP CALCULATING UPPER PREDICTION LIMIT
Definition of Upper Prediction Limits
The maximum cost (i.e., case duration) is not a good parameter to provide to a patient since the maximum is sensitive to the sample size. Taking an extreme example, among 2000 patients undergoing laparoscopic cholecystectomy, there may be 1 with a major vascular injury from trocar insertion.^{42}

A prediction limit should be used instead of the maximum. We consider the 90% prediction limit, which implies that there is a 90% chance that the duration of a patient’s case will be no longer than that value.^{21} ^{,} ^{43} The 90% prediction limit is not the 90% upper confidence limit for the mean. The former represents the duration of 1 (future) case, and the latter represents the behavior of the mean of several cases’ durations.

The next 3 sections, Upper Prediction Limits Calculated Using Nonparametric Method, Upper Prediction Limits Calculated Using Student t Distribution, and Upper Prediction Limits Calculated Using Bayesian Method, describe 3 ways to calculate these prediction limits.

Upper Prediction Limits Calculated Using Nonparametric Method
One method to calculate prediction limits is nonparametric.^{43} The prediction limit is estimated using the 90th percentile of the sample, which is the observation with rank 0.9 × (1 + sample size).^{44} ^{,} ^{45} That alone is the process. The calculations do not depend on any distributional assumptions (e.g., does not assume log-normal distribution). For example, at a historical sample size of 9 cases, the 90% upper prediction limit is the longest of the historical durations, because 9 = 0.9 × (1 + 9). At historical sample sizes of 19 and 29, the 90% upper prediction limit is the second longest and third longest of the durations, respectively. When 0.9 × (1 + sample size) is not an integer, interpolation can be used.^{46} Alternatively, the prediction limit reported can be a conservative estimate, calculated by using the observation of rank that is the smallest integer larger than 0.9 × (1 + sample size).^{45} For example, at a historical sample size of 10 cases, the conservative 90% upper prediction limit would be the longest of the 10 durations.

We first applied these methods more than 15 years ago, when the probability distributions of case durations were understood poorly.^{47} In the remainder of this section, we describe the method’s 2 limitations and describe alternative approaches.

First, the method is limited by the need for suitably large sample sizes. To calculate the 90% prediction limit using the distribution-free method, there needs to be at least 9 historical values. Limiting consideration to scheduled procedures having at least 9, 19, or 29 historical data causes 27%, 37%, or 43% of cases to be excluded from use of the method.^{31} The criterion of having 29 historical durations is essentially the same as the criterion of having 30 cases in a dataset as used in the sections Typical Coefficients of Variation and Use of the Sample Median In Lieu of the Use of the Sample Mean, above.

Second, each increase in the sample size from 9 to 19 or from 19 to 29 substantively reduces the coefficients of variation of the estimated prediction limits.^{43–47} This is because the longest or both the longest and 2nd longest of the durations no longer contributes to the prediction limit (e.g., like calculating a trimmed mean).

For example, pooling all of the durations of cerebral aneurysm clippings, the mean was 5.339 hours and the sample coefficient of variation was 32.268%. These parameters were used to generate 29,000 log-normally distributed random numbers. Using the first 9 cases, a 90% upper prediction limit was calculated. Using the next 9 cases, another prediction limit was calculated, and so forth. The same was done for consecutive groups of 19 and 29 cases. Using sample sizes of 9, 19, and 29, the coefficients of variation of the 90% upper prediction limits were 19.4%, 12.5%, and 11.6%, respectively. This is important because although for all 3 sample sizes overall 90% of patients had an upper prediction limit that exceeded their actual duration (i.e., the so-called coverage is the same), for the smaller sample sizes some patients were provided limits vastly larger than the true 90th percentile (7.5 hours) and some substantially less. For sample sizes of 9, 19, and 29, the 90th percentiles of the 90% upper prediction limits were 10.2, 9.1, and 8.7 hours, respectively. Each increase in sample size resulted in substantive reductions toward the 90th percentile of 7.5 hours, which ideally all patients would be told (see Upper Prediction Limits Calculated Using Student t Distribution).

The sample size limitation of the distribution-free method is not overcome by pooling scheduled procedures into broader categories, because categories of surgically similar procedures have different mean durations.^{27} Otherwise, there would not be different procedure codes.

For example, consider a patient seeking a cost estimate for excision of an epidermoid cyst under monitored anesthesia care. The estimated case duration depends on the size of the cyst. This is why there are different CPT codes (i.e., different surgical fee schedules) depending on size (e.g., CPT 11400 for “excised diameter 0.5 cm or less,” 11401 for “excised diameter 0.6 to 1.0 cm,” …, 11406 for “diameter over 4.0 cm”). An example of the pooling of procedures would be to consider all epidermoid cyst excisions under the same anesthesia CPT code. Pooling increases the sample size but also increases the variability among durations of the same (pooled) procedure.^{27} Using the specific CPT codes increases the accuracy of predictions of durations.^{48} This example highlights also that reporting the sample size for each specific CPT as a surrogate for facility and/or surgeon’s expertise can be misleading, because the numbers of cysts removed of each size is unlikely related to expertise, just case duration.

Upper Prediction Limits Calculated Using Student t Distribution
A second method to calculate prediction limits applies percentiles of the Student t distribution. The sample mean and standard deviation of the logarithms of the durations are calculated. The sample standard deviation is multiplied by (a) the square root of (1 + 1/[sample size]), which approaches 1 for large sample sizes, and (b) the 90th percentile of the Student t distribution, with degrees of freedom equal to 1 less than the sample size. The mean is added to the product, and then the exponential is taken of the sum. The method is accurate in the sense that with at least 9 historical case durations, the duration of the next case actually had a 91% chance of exceeding the 90% prediction limit.^{43}

However, when the sample size is small, the 90% prediction limits for some cases are much larger than the mean.^{31} For example, the simulated data from the preceding section, Upper Prediction Limits Calculated Using Nonparametric Method, was used. Consecutive 90% upper prediction limits were calculated. Using sample sizes of 9, 19, and 29, the coefficients of variation of the prediction limits were 14.8%, 10.0%, and 8.1%, respectively. Since the distribution itself was being used, each coefficient of variation was less than the corresponding coefficient of variation of the distribution-free method: 19.4%, 12.5%, and 11.6%. However, for sample sizes of 2, 3, and 5, the coefficients of variation of the estimated prediction limits were much larger, 45.4%, 29.2%, and 21.0%, respectively. For sample sizes of 2, 3, 5, 9, 19, and 29, the 90th percentiles of the 90% upper prediction limits were 16.3, 11.6, 10.2, 9.1, 8.5, and 8.3 hours, respectively. The 16.3 hours is much larger than the mean duration of a cerebral aneurysm clipping, 5.3 hours.

The preceding simulation was performed for just 1 procedure. When all procedures at facilities were studied, sample sizes effectively “small” were <5 historical cases at one hospital and 9 historical cases at another hospital.^{1} ^{,} ^{31} The estimates for the 90% prediction limits are insensitive to these specific values of 5 and 9 cases.^{31} What matters is that often there will be <5 historical cases for the procedure that a patient is scheduled to undergo.^{1} ^{,} ^{27} ^{,} ^{31} ^{,} ^{43} ^{,} ^{49} With the 3 years of data from Ref. ^{31} , there were <5 historical cases for 22% of cases and 87% of procedures. Such results are consistent among hospitals^{50} and are no less of an issue at outpatient surgery facilities.^{51}

Thus, from the section Definition of Upper Prediction Limits, above we need to calculate an upper (e.g., 90%) prediction limit. However, from the sections Upper Prediction Limits Calculated Using Nonparametric Method and Upper Prediction Limits Calculated Using Student t Distribution, limited sample sizes (<9 cases) make both the nonparametric method and the modification of Student t distribution ineffective for many procedures.

Upper Prediction Limits Calculated Using Bayesian Method
A third (and effective) way to calculate prediction limits is a Bayesian method based on the log-normal distribution.^{1} ^{,} ^{16} ^{,} ^{31} ^{,} ^{33} Bayesian methods combine data with prior knowledge. When the sample size is larger than a threshold (e.g., 9 cases, as above), then the prediction limit is based principally on historical data for the scheduled procedure.^{1} ^{,} ^{31} The calculated prediction limit is very similar to that which would be calculated using Student t distribution (see the preceding section Upper Prediction Limits Calculated Using Student t Distribution). When few or even no historical data are available for the scheduled procedure, an estimate of the median time is obtained from an expert (e.g., scheduler, anesthesiologist, or surgeon). The Bayesian method creates a calculated median that is a weighted combination of the historical data and the time estimate, principally based on the time estimate. The Bayesian 90% prediction limit uses the calculated median and a model of the coefficient of variation, the latter obtained from all other available data.^{f}

The calculations for the Bayesian method can be performed using spreadsheet formulas (e.g., using Equations (2–6) of Ref. ^{1} ). For example, an Excel file (workbook) without macros can be used. One worksheet contains the historical data and a few formulas. That worksheet can be hidden. A second worksheet specifies the procedure and displays the prediction limit. The same approach can be used with Web pages.

INSURER AND/OR GOVERNMENT PERFORMING THE CALCULATIONS
Random Effects Models
Anesthesia durations available from billing (administrative) data are accurate measures of actual case durations.^{52} An insurer and/or government (henceforth “insurer”) could use the submitted (billed) durations as data to calculate for each interested patient with a known insurance the patient’s likely (e.g., mean) anesthetic cost for the procedure and the corresponding 90% upper prediction limit for cost.

Conceptually, the approach would match that of the preceding section Upper Prediction Limits Calculated Using Bayesian Method. For facilities for which the insurer has more durations than a threshold (e.g., 9 cases, as above), principally the historical duration data from that facility would be used. For the other facilities for which the insurer has few data, the data from all the other facilities would be used as well. Those other data are statistically equivalent to the scheduled duration as used in the Bayesian approach by individual facilities. The statistical methods that achieve such “borrowing” of information are referred to as “random effects” models.^{53} They are the same methods as those used to pool durations among many procedures and surgeons at a facility (i.e., to estimate durations for rare procedures).^{1} ^{,} ^{16} ^{,} ^{31}

Mean Case Durations Vary Among Facilities
Mean case durations (i.e., costs for patients) vary among facilities.

For example, among US government (Medicare) patients, there were significant differences in anesthesia durations among hospitals in Pennsylvania, PA, for orthopedic and general surgery procedures.^{54}

For example, case durations were compared among 10 hospitals in 8 countries (e.g., Australia and Finland).^{55} The second longest mean case duration was 50% longer than the second briefest mean time for both laparoscopic cholecystectomy and lung lobectomy.

For example, Figure 3 shows the sample means and 90% prediction limits of durations for aneurysm clipping among the 14 facilities each with at least 30 cases (see the section Typical Coefficients of Variation, above). The prediction limits were calculated using Student t distribution (i.e., as in the section Upper Prediction Limits Calculated Using Student t Distribution). Not only do the sample means of the durations differ among facilities, but also the 90% prediction limits and the differences between 90% prediction limits and means (all P < 0.0001). The second longest mean duration was approximately 50% longer than the second briefest mean time, matching that of the preceding^{55} example. Applying Equations (4) and (5) of Ref. ^{2} , the 95% confidence interval for this ratio was 40% to 71%.

Figure 3: Means and 90% upper prediction limits of durations in hours for cerebral aneurysm clipping, among the 14 facilities each with at least 30 cases. The upper confidence limits were calculated using the Student t distribution method described in the section Upper Prediction Limits Calculated Using Student t Distribution. The figure shows large differences among facilities in mean durations, in 90% prediction limits, and in differences between 90% prediction limits and mean (all P < 0.0001). The pooled estimate (red circle) was calculated using the durations from all 30 facilities. Because the sample size of the pooled estimate was large and the pooling creates a mixture of distributions of the different facilities, the first (nonparametric) method to calculate the 90% prediction limit was used for that 90% upper prediction limit (see Upper Prediction Limits Calculated Using Nonparametric Method). Since that answer is the same (to within 0.01 hours) as the 90th percentile, the latter is the label applied to the vertical axis.

Suppose insurers aimed to use their data to provide to patients estimates of the mean and the 90% upper prediction limit from multiple facilities. For each combination of procedure and facility, the sample size known to the insurer will be less than that known by each facility, because there are multiple insurers. Thus, as described in the preceding section Random Effects Models, above, data from all facilities would be pooled together to improve the precision of estimates for each facility. In the next 3 subsections, we consider statistical issues needed to do this accurately for log-normally distributed data.

Probability Distribution Among Facilities of Means of Logarithms of Durations
Random effects models rely on knowing the probability distribution among facilities of the means of the logarithms of durations.^{53}

From the section Case Durations Generally Follow Log-Normal Distributions, above, case durations for a procedure at each facility generally follow a log-normal distribution. This implies that the logarithms of durations at each facility follow a normal distribution. The means of the logarithms will not be the same among the facilities. When durations are sampled at random from the population of all facilities, the resulting probability distribution is referred to as a “mixture distribution.” A mixture of normal distributions also follows a normal distribution with its mean equal to the weighted combination of individual facilities’ means, the weights being the relative numbers of cases at each facility. Comparing among facilities the sample means of the logarithms of durations for aneurysm clipping, the sample means among facilities followed a normal distribution (Figure 4 , Lilliefors test P = 0.95).

Figure 4: Normal distribution probability plot of the mean log case duration in hours for cerebral aneurysm clipping, the mean being among all cases at each of 30 facilities. The nearly linear relationship is consistent with a normal distribution, Lilliefors test P = 0.95. In Probability Distribution Among Facilities of Means of Logarithms of Durations section, we explain that this normal distribution is an expected consequence of the case durations at each facility following a 2-parameter log-normal distribution (see Case Durations Generally Follow Log-Normal Distributions). A mixture of normal distributions follows again a normal distribution with its mean equal to the weighted combination of individual facilities’ means, the weights being the relative numbers of cases at each facility.

Heterogeneity Among Facilities in Means of Logarithms of Durations
The premise of the “random effects” model is that for a facility at which the insurer has few data, the mean of the logarithms of durations from all the other facilities can be used to infer the true value for the facility with few data.^{53}

Suppose that the insurer has only a sample size of 3 for a facility. Yet, the objective of the insurer is to estimate the mean and 90% prediction limit of durations in units of hours. One requirement for the standard error of these 2 estimates to be usefully narrow is to have an accurate estimate for the mean of the logarithms of the durations. What the random effects model does statistically is to “shrink” the sample mean of the logarithms of durations from that facility toward the overall sample mean of the logarithms. The estimate of the mean of the logarithms of the durations at the facility is a weighted combination of the facility’s own sample mean of the logarithms and that of all facilities combined.

Shrinkage depends on the amount of homogeneity among the means. No shrinkage occurs if the means are markedly heterogeneous, since in such a circumstance the mean at one facility says nothing about the mean at another.

Figure 4 shows heterogeneity among facilities in the means of the logarithms of durations. The heterogeneity can be quantified using the I ^{2} statistic.^{56} This is the same I ^{2} as reported in meta-analyses for quantifying heterogeneity among studies.^{56} ^{,} ^{57} The I ^{2} statistic equals the percentage of the total variance in the means of the logarithms of durations of all cases attributable to the variance in the means among facilities.^{56} A value of 0% indicates no differences in means among facilities. A value of 100% indicates that there is effectively certainty (i.e., standard deviation of 0) in each facility’s estimate of its mean of the logarithms of durations relative to the variability in these means among facilities.

To estimate I ^{2} , the standard error of the mean for the logarithms of case durations is estimated for each facility.^{58} ^{,} ^{59} Using Equation (4) of Ref. ^{3} , a 90% 2-sided confidence interval for the mean of the logarithms of durations was calculated for each facility. An estimate of the standard error was obtained by dividing the width of the confidence interval by 2 × 1.64, where 1.64 is the 95th percentile of the standardized normal distribution. Each facility, represented by its sample mean and standard error, characterizes a study in the framework of a meta-analysis.^{59} ^{,} ^{60}

The standard Dersimonian and Laird method of moments estimate is used to perform the random effects meta-analysis.^{53} ^{,} ^{57} ^{,} ^{59–62} The calculated I ^{2} = 96% is huge, implying considerable heterogeneity. Simulation studies^{62} have been performed to evaluate the bias and mean squared error in the calculated I ^{2} . Since the distribution among facilities of the means of the logarithms of durations is (very close) to being normally distributed (Figure 4 ), this estimate of I ^{2} likely is negligibly biased and has small error.^{62} Nevertheless, being conservative, we followed the Appendix of Ref. ^{63} and repeated the calculation after deleting the facilities with the smallest and largest means of the logarithms of durations.^{63} These are the same 2 facilities as those with the smallest and largest means in units of hours (Figs. 3 and 5 ) and with the smallest P values comparing the means of the logarithms to the overall pooled estimate. The I ^{2} remains very large, 93%. Thus, there is so much heterogeneity among facilities that trying to provide facility-specific estimates of mean durations and 90% prediction limits of durations by pooling data is not practically useful.^{g}

Figure 5: Each hospital’s mean duration in hours for cerebral aneurysm clipping and upper or lower confidence limit for the mean. This is a caterpillar plot (see “Caterpillar” Plot with Sample Means and Confidence Limits of Mean Durations). Lack of overlap (blue) of the 99.83% confidence limit with the pooled mean (red) is the same as rejecting the hypothesis that the facility’s mean is the same as that of the pooled mean at P < 0.0017. The choice of 99.83% compensates for the multiple comparisons among facilities to maintain an overall error rate of 5.0% (see the section Controlling for Effect of Multiple Comparisons).

Heterogeneity Among Facilities in the Coefficient of Variation of Durations
Treating facilities as studies in a meta-analysis, as done in the preceding section Heterogeneity Among Facilities in Means of Logarithms of Durations, would be sufficient if our objective was to estimate the overall mean duration in hours among all studies. However, it is not. Rather, as described in the section Random Effects Models, the objective of our use of a random effects model is to pool data among facilities to improve estimates (e.g., mean and 90% upper prediction limit) for each facility. To do this effectively, there needs to be little to no heterogeneity among facilities in the coefficients of variations among the facilities. Since the 90% upper prediction limits depend highly on the coefficients of variations, incorrectly assuming homogeneity of the coefficients of variation would result in markedly inaccurate estimates.

Consider 2 hypothetical facilities at which a procedure is performed.

At one facility, the procedure is performed by 1 surgeon of average speed with her personal team of first assist registered nurse and surgical technologist.^{14} ^{,} ^{15} ^{,} ^{49} ^{,} ^{64} The anesthetic is the same for each patient.^{14} ^{,} ^{15} ^{,} ^{64} The surgeon and her scheduler predict which cases of a procedure will take longer than the median (or mean) and which will take less.^{16} ^{,} ^{25} ^{,} ^{31} ^{,} ^{65} When a case is expected to take longer than average, each team member plans ahead to avoid any unusual delays.

At the other facility, there are multiple surgeons performing the procedure, some briefer than average and some faster than average. Each surgeon sometimes works with junior residents and sometimes fellows.^{14} ^{,} ^{25} Each surgeon sometimes is assigned trainee surgical technologists and sometimes technologists who are experienced.^{14} ^{,} ^{25}

The means in hours would be the same between the 2 facilities, but the standard deviations in hours would differ (i.e., the coefficients of variation would differ). For the log-normal distribution, the coefficient of variation of durations in hours is determined by the standard deviation of the logarithms (see Equation (2) of Ref. ^{2} ). Thus, these 2 facilities with unequal coefficients of variation of durations in hours have different standard deviations of the logarithms of durations. Because the described operational heterogeneity between these 2 facilities is fully reasonable (i.e., has face validity), unequal standard deviations of the logarithms of durations should be expected.

For example, applying Levene test to check for homogeneity of the standard deviations of the logarithms of cerebral aneurysm clipping durations, P < 0.0001, and among the 14 facilities each with at least 30 cases, P = 0.0002.^{55} ^{h}

The magnitude of the heterogeneity of the standard deviations of the logarithms of durations among facilities is small relative to the heterogeneity of the means of the logarithms among facilities. Suppose that there was no heterogeneity of the standard deviations of the logarithms. Then, the rank correlation would equal 1.0 between the means and the 90% prediction limits of duration in hours. Figure 3 would show that each increase in the mean is associated with an increase in the 90% prediction limit. This condition is approximately satisfied (Spearman r = 0.98, P < 0.0001, lower 95% confidence limit 0.95). We are unaware of any example, consulting work, articles, etc., for which a patient would need to be concerned that different facilities with the same mean duration would have substantively different 90% prediction limits (i.e., risk of an unusually long case).

The importance of the unequal standard deviations of the logarithms of durations is that the analyst (statistician, etc.) must consider them when comparing the means . For log-normally distributed data, incorrectly assuming equality of the standard deviations of the logarithms when comparing facilities’ means (and 90% prediction limits) in hours results in large inferential errors.^{2} ^{,} ^{59} This differs from analyses of normally distributed data, because for the log-normal distribution the standard deviation of the logarithms contributes to the mean (see Equation (1) of Ref. ^{2} ). Highlighting this problem and its solution was the focus of a Statistical Grand Rounds in Anesthesia & Analgesia .^{2}

“Caterpillar” Plot with Sample Means and Confidence Limits of Mean Durations
Figure 5 shows each hospital’s mean duration and upper or lower confidence limit for the mean. The plot is sometimes referred to as a “caterpillar” plot because the confidence limits give the impression of insect legs on a smooth body. The confidence limits were calculated using Equation (4) of Ref. ^{2} , a Statistical Grand Rounds on analysis of log-normal distributions. For comparing costs of anesthetics among facilities, the goal would be to identify facilities with brief durations. Thus, it would be the mean and the upper confidence limit for the mean that matters. However, for presentation purposes, we also show facilities with mean durations that are longer than average, as that can be relevant for the different application of studying morbidity and mortality (see the section Information on Patient Morbidity and Mortality, below).

The overall sample size of durations in hours among all facilities is very large for the procedure. Therefore, a pooled mean of durations in hours was calculated using the overall sample mean.^{66} ^{,} ^{67} ,^{i} The pooled mean is shown as the entry in red to the far right of the Figure 5 .

Each of the confidence limits was compared with the overall pooled mean among facilities. Lack of overlap of the 95% or 99.83% confidence limit with the pooled mean is the same as rejecting the hypothesis that the facility’s mean is the same as that of the pooled mean at P < 0.05 or P < 0.0017, respectively.^{2} ^{,} ^{68} The pooled mean is shown by a red bar in Figure 5 . The facilities for which the 99.83% confidence limits do not overlap with the pooled mean are shown in blue.

Controlling for Effect of Multiple Comparisons
The choice of 99.83% compensates for the multiple comparisons among facilities to maintain an overall error rate of 5.0%. We rely on the Dunn-Šidák correction mentioned above in the section Case Durations Generally Follow Log-Normal Distributions. The 99.83% equals 100% × [1 − 0.05]^{1/30} , where the 30 is the number of facilities. There were 16 facilities with P ≤ 0.0017 (i.e., mean duration significantly briefer or longer than the overall pooled mean). The P value for the facility with the smallest P > 0.0017 was P = 0.0064. Since 0.0064 was about 3 times larger than 0.0017, that 17th facility’s mean was not considered to differ from the pooled mean. Furthermore, in practice, we have no experience with patients considering 30 facilities for surgery, rather inquiring about 2 or 3 options. Thus, usually the correction to the P values would not be as large as we made.^{69} ^{,} ^{70} ^{j}

Suppose that, instead, case durations were being compared among facilities for the different application of quality of care (e.g., focus was on those longer than average durations, see the section Information on Patient Morbidity and Mortality, below). Then, there would be scores or even hundreds of facilities.^{52} An alternative method of multiple comparisons that limits the so-called “False Discovery Rate” could be used.^{71} The method is written out in Equation (1) of Ref. ^{72} ^{i} and is no more complicated than sorting P values and doing some arithmetic.^{71} The Dunn-Šidák correction ensures that the probability of there being at least 1 facility detected falsely as having mean duration longer than average is no larger than 5.0%.^{71} The false discovery method ensures that among all facilities detected as having mean duration longer than average no more than 5.0% have been detected falsely.^{71}

Information on Patient Morbidity and Mortality
Although our article has focused on comparisons of case durations for estimating anesthesia costs, a potentially far more important issue is how to study the association between anesthetic durations and patient morbidity and mortality.^{73} ^{k} Increases in durations beyond 4 hours are associated for some procedures with increased morbidity.^{74–79} Whether assessed by logistic regression, survival analysis (i.e., with censoring), or propensity score analysis, these are methods of nonlinear regression. A recent Statistical Grand Rounds reviewed that when duration and outcome are associated, the procedure itself (e.g., video-assisted thoracoscopic lobectomy or thoracotomy lobectomy) needs to be tested for inclusion in the multivariable analysis.^{73} If the procedure itself influences duration, but has been excluded in lieu of category of procedure (e.g., lung resection), estimates of the odds ratios for all independent variables are biased.^{73} From the sections Random Effects Models and Mean Case Durations Vary Among Facilities, the same applies to each facility. These are so-called “multilevel” or “hierarchical” models and were the subject of another recent Statistical Grand Rounds in Anesthesia & Analgesia .^{80} The statistical assumptions to be satisfied are those from the sections Heterogeneity Among Facilities in Means of Logarithms of Durations and Heterogeneity Among Facilities in the Coefficient of Variation of Durations.

DISCUSSION
A compilation of our findings is in Table 1 .

Throughout the paper, we have considered the performed (actual) procedure to be the same as the scheduled procedure. Whereas this is a reasonable assumption for joint replacement procedures, it is not for some other specialties. When actual procedures differ from scheduled procedures, there is no increase in the absolute predictive error in case durations estimated by the Bayesian method (see the section Upper Prediction Limits Calculated Using Bayesian Method).^{27} This is counter-intuitive (i.e., why we did the study^{27} ), but caused by the differences in procedure being principally expected decision options that surgeons make intraoperatively.^{27} Provided the facility provides the mean and upper prediction limit, rather than the insurer and/or government, our conclusions would be unaffected, because the facility would (should) rely on scheduled procedures. If the specific surgeon performing the case were known, then also classify the data by surgeon (see footnote f in the section Upper Prediction Limits Calculated Using Bayesian Method).^{14–17} ^{,} ^{64} ^{,} ^{65} Since our finding was that the facility provides the mean and prediction limit (see the sections Heterogeneity Among Facilities in Means of Logarithms of Durations and Heterogeneity Among Facilities in the Coefficient of Variation of Durations), differences between scheduled and actual procedures are not a substantive concern with respect to our application of case durations. This is in contrast to the other types of decisions listed in the first paragraph of the second page of the article.

DISCLOSURES
Name: Franklin Dexter, MD, PhD.

Contribution: This author helped design the study, conduct the study, analyze the data, and prepare the manuscript.

Attestation: Franklin Dexter attests to the analysis reported in this manuscript.

Conflicts of Interest: The University of Iowa, Department of Anesthesia, Division of Management Consulting, performs some of the analyses described in this paper for facilities. Franklin Dexter has tenure and receives no funds personally from such activities. Income from the Division’s consulting work is used to fund its research.

Name: Richard H. Epstein, MD, CPHIMS.

Contribution: This author helped write the manuscript.

Attestation: Richard H. Epstein approved the final manuscript.

Conflicts of Interest: The author has no conflicts of interest to declare.

Name: Emine O. Bayman, PhD.

Contribution: This author helped design the study.

Attestation: Emine O. Bayman reviewed the original study data, approved the final manuscript, and is the archival author.

Conflicts of Interest: The author has no conflicts of interest to declare.

Name: Johannes Ledolter, PhD.

Contribution: This author helped analyze the data and prepare the manuscript.

Attestation: Johannes Ledolter attests to the data analysis and approved the final manuscript.

Conflicts of Interest: The author has no conflicts of interest to declare.

RECUSE NOTE
Franklin Dexter is the Statistical Editor and Section Editor for Economics, Education, and Policy for Anesthesia & Analgesia . This manuscript was handled by Steve Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.

ACKNOWLEDGMENTS
We appreciate Dr. Michael Todd and the other Intraoperative Hypothermia for Aneurysm Surgery Trial (IHAST) investigators for providing access to the anesthesia time data.

a Google search July 19, 2012 using (“anesthesia bill” OR “bill for anesthesia” OR “bill for the anesthesiologist”), and then manual exclusion of unrelated items such as passing laws (e.g., getting a bill passed by a state legislate). Cited Here

b US Government Accountability Office. Health care price transparency, Meaningful price information is difficult for consumers to obtain prior to receiving care, GAO-11–791. Available at: www.gao.gov/assets/590/585400.pdf . Accessed April 29, 2012. Cited Here

c In order for the facility and anesthesia group to provide cost estimates to patients, the group must know the surgical CPTs, not just the far broader anesthesia CPTs. This involves coordination of data between surgical facilities and/or practices. Such coordination already is required to provide information to patients on the group’s relative performance at preventing morbidity and mortality. Risk quantification in noncardiac surgical patients is predicted (very well) using the Agency for Healthcare Research and Quality’s Clinical Classification Software’s categorization of surgical CPTs.^{26} Cited Here

d Another feature that is realistic for this application is that data are pooled among surgeons (see Discussion). For readers who choose to repeat the Web search from footnote a in the first paragraph of the article, patients inquiring about price appear generally not to ask for price by surgeon, but “shop around” by institution. This also is our experience at the University of Iowa’s Department of Anesthesia. First, the price is asked, and then based on the estimate an inquiry is made of available appointment with a surgeon. Cited Here

e An intuitive half page summary of the derivation is in Section 2.2 of Zhou’s article.^{39} Finney’s estimator equals the product of x and y . The x is the exponential of the sample mean of the logarithms of durations (i.e., the geometric mean). The y is an infinite power series, from a Taylor series expansion, each term of which includes the product of a sequence of items collectively resulting in the moment generating function of a χ^{2} distribution. Cited Here

f When a scheduled duration is treated as providing 5.6 historical cases of information, the 90% prediction limits are exceeded by 9.8% of the actual case durations (i.e., close to the nominal rate of 10%).^{31} Among the one-third of cases with 0, 1, or 2 historical data of the same surgeon, 13.1% of cases are exceeded. Among the two-third of cases with at least 3 historical data, 8.2% are exceeded. The 9.8% = 1/3 × 13.1% + 2/3 × 8.2%. When the 90% prediction limits are calculated using only historical data as in the section Upper Prediction Limits Calculated Using Student t Distribution, the rate for the latter two-third of cases is 10.4%. Cited Here

g Using the Dersimonian and Laird estimate of τ corresponding to I ^{2} = 96%, only 2 of the 30 hospitals have P < 0.05, the hospitals with the briefest and longest means (Figs. 3 –5 ).^{53} P = 0.042 and P = 0.002, respectively. There are few outliers and unimpressive P values because I ^{2} was so large. The analysis is simply looking at the tails of a (slightly right skewed) normal distribution curve (Figure 4 ). Recalculating τ after trimming these 2 outliers,^{63} there was only 1 additional hospital identified as an outlier, the hospital with the second briefest duration (Figs. 3 –5 , P = 0.024). Cited Here

h We also applied the test to data from an international benchmarking study that we had performed previously for lung lobectomy and laparoscopic cholecystectomy. The P = 0.015 and P = 0.38, respectively. There were only 10 cases per facility and thus a low statistical power to detect unequal coefficients of variation even when present. Cited Here

i The pooled sample mean of the 997 durations was 5.3 hours. By Student t distribution, the 95% confidence interval was 5.2 to 5.4 hours. Including the kurtosis (i.e., using Chen’s method),^{66} ^{,} ^{67} the confidence interval was 5.2 to 5.5 hours. For other procedures, the insurer and/or government’s sample size will be much larger. Cited Here

j Consider a different application of a facility aiming to compare its mean duration to that of all other facilities pooled (e.g., to use results internally when negotiating receipt of a percentage of a bundled surgical payment to the facility including all professional fees). No correction for multiple comparisons may seem needed as only 1 such comparison is being performed. However, such comparisons may be made for many procedures simultaneously making the methods of this section appropriate. Furthermore, the rational internal business model for this application generally would be to use transfer pricing, as described previously,^{69} ^{,} ^{70} making such analyses moot. Cited Here

k Realistically, for our application to insurance, we modeled duration as a function of the procedure, not procedure and surgeon or procedure, surgeon, and type of anesthetic (see footnote d in the section Case Durations Generally Follow Log-Normal Distributions).^{14} ^{,} ^{15} ^{,} ^{49} ^{,} ^{50} ^{,} ^{64} This assumption also is realistic for studies comparing morbidity and mortality among facilities (or anesthesia groups) using anesthesia billing data (see below).^{52–54} ^{,} ^{63} Since facilities generally have >1 surgeon who performs each procedure, we have not considered the (previously studied) influence of the surgeon’s age (e.g., learning effect) on duration either.^{15} ^{,} ^{16} Cited Here

APPENDIX
Throughout the body of our article, we used the 2-parameter log-normal distribution. Three-parameter log-normal distributions are available. However, for many procedures and facilities, especially as known by an insurer and/or government, the sample sizes are simply too small to estimate 3 parameters precisely (see the section Upper Prediction Limits Calculated Using Nonparametric Method).

Procedures with minimum durations much >0 hours are not inherently more suitable for a 3-parameter log-normal distribution than are brief procedures. Deciding whether to consider and subtract a shift from the observations does not depend just on whether the minimum is much larger than zero. The cerebral aneurysm clipping durations both are long and a good illustration for 2-parameter log-normal distributions (see Figure 2 and the section Case Durations Generally Follow Log-Normal Distributions).

The third (shift) parameter can be estimated by formula or empirically.^{32} ^{,} ^{81} For inferential studies involving case durations, Equation (1) of Ref. ^{32} can be used, which is the same as Equation (2) of Ref. ^{81} . The values are a combination of the minimum, maximum, and the median. When the sample size exceeds 50, this simple method performs well for surgical data.^{32} However, for smaller sample sizes, it can be better to work empirically, trying every possible value in 1-minute increments and choosing the shift that results in the largest P value when fitting the logarithms of the shift-adjusted durations to a normal distribution.^{32} Spangler used the Shapiro-Wilk^{82} ^{,} ^{83} test, but we know of no reason why Lilliefors test^{84} ^{,} ^{85} would not be equally suitable. Stepaniak et al.^{15} ^{,} ^{33} studied the procedures at 2 hospitals for which the facilities had at least 10 cases. There were better fits from the 3-parameter log-normal than from the 2-parameter log-normal.^{15} ^{,} ^{33} However, by definition, the fits could not be worse, just better.

There may appear then to be a discrepancy in study findings. The 90% Bayesian prediction limits of the section Upper Prediction Limits Calculated Using Bayesian Method are based on 2-parameter log-normal distributions and are so accurate that they are exceeded by just 9.7% of actual case durations (i.e., they are conservative but by only 0.3%).^{31} Yet, the 3-parameter log-normal distribution appears empirically to be better for some procedures.^{33}

To get more insight into this issue, we performed a simulation using 30,000 random numbers following a 3-parameter log-normal distribution with location parameter (shift) 1.0 hours, mean 3.0 hours, and coefficient of variation 33%. These parameter choices were based on the observations from the section Use of the Sample Median In Lieu of the Use of the Sample Mean. After subtracting the shift, the corresponding mean of the natural logarithms equaled 0.5816 and their standard deviation equaled 0.4724. The Lilliefors test of fit to a 3-parameter log-normal distribution had P = 0.44. Fitting a 2-parameter log-normal distribution (i.e., not subtracting the shift), the Lilliefors test had P < 10^{−8} . The mean of the logarithms equaled 1.0513 and their standard deviation equaled 0.3026. Next, we compared the 90% prediction limits calculated using Student t distribution (see the section Upper Prediction Limits Calculated Using Student t Distribution). Using the 3-parameter log-normal, the limit equaled 4.28 hours, where 4.28 = 1 + exp(0.5816 + 1.2816 × 0.4724), where 1.2816 is the 90th percentile of the Student t distribution with large sample size. Using the 2-parameter log-normal, the limit equaled 4.22 hours, where 4.22 = exp(1.0513 + 1.2816 × 0.3026). The percentage error is just 1.6%, and (importantly) that is based on knowing the true shift, which does not happen in reality. When the shift was excluded, the sample mean and standard deviation of the logarithms adjusted and the resulting distribution was practically insensitive to the exclusion of the shift.^{86} Thus, for inferential analyses (and practically when the sample sizes are large), we recommend focusing less on getting the “right” shift and more on whether its value influences the conclusions (i.e., perform sensitivity analyses).

REFERENCES
1. Dexter F, Epstein RH, Lee JD, Ledolter J. Automatic updating of times remaining in surgical cases using bayesian analysis of historical case duration data and “instant messaging” updates from anesthesia providers. Anesth Analg. 2009;108:929–40

2. Ledolter J, Dexter F, Epstein RH. Analysis of variance of communication latencies in anesthesia: comparing means of multiple log-normal distributions. Anesth Analg. 2011;113:888–96

3. Dexter F, Wachtel RE, Sohn MW, Ledolter J, Dexter EU, Macario A. Quantifying effect of a hospital’s caseload for a surgical specialty on that of another hospital using multi-attribute market segments. Health Care Manag Sci. 2005;8:121–31

4. Dexter F, Birchansky L, Bernstein JM, Wachtel RE. Case scheduling preferences of one Surgeon’s cataract surgery patients. Anesth Analg. 2009;108:579–82

5. Dexter F, Epstein RH, Traub RD, Xiao Y. Making management decisions on the day of surgery based on operating room efficiency and patient waiting times. Anesthesiology. 2004;101:1444–53

6. Dexter F, Wachtel RE, Epstein RH. Event-based knowledge elicitation of operating room management decision-making using scenarios adapted from information systems data. BMC Med Inform Decis Mak. 2011;11:2

7. Dexter F, Macario A, Epstein RH, Ledolter J. Validity and usefulness of a method to monitor surgical services’ average bias in scheduled case durations. Can J Anaesth. 2005;52:935–9

8. McIntosh C, Dexter F, Epstein RH. The impact of service-specific staffing, case scheduling, turnovers, and first-case starts on anesthesia group and operating room productivity: a tutorial using data from an Australian hospital. Anesth Analg. 2006;103:1499–516

9. Dexter F, Macario A, Ledolter J. Identification of systematic underestimation (bias) of case durations during case scheduling would not markedly reduce overutilized operating room time. J Clin Anesth. 2007;19:198–203

10. Dexter F, Weih LS, Gustafson RK, Stegura LF, Oldenkamp MJ, Wachtel RE. Observational study of operating room times for knee and hip replacement surgery at nine U.S. community hospitals. Health Care Manag Sci. 2006;9:325–39

11. Berry M, Berry-Stölzle T, Schleppers A. Operating room management and operating room productivity: the case of Germany. Health Care Manag Sci. 2008;11:228–39

12. Sulecki L, Dexter F, Zura A, Saager L, Epstein RH. Lack of value of scheduling processes to move cases from a heavily used main campus to other facilities within a health care system. Anesth Analg. 2012;115:395–401

13. Cassera MA, Zheng B, Martinec DV, Dunst CM, Swanström LL. Surgical time independently affected by surgical team size. Am J Surg. 2009;198:216–22

14. Dexter F, Dexter EU, Masursky D, Nussmeier NA. Systematic review of general thoracic surgery articles to identify predictors of operating room case durations. Anesth Analg. 2008;106:1232–41

15. Stepaniak PS, Heij C, de Vries G. Modeling and prediction of surgical procedure times. Statistica Neerlandica. 2010;64:1–18

16. Eijkemans MJ, van Houdenhoven M, Nguyen T, Boersma E, Steyerberg EW, Kazemier G. Predicting the unpredictable: a new prediction model for operating room times using individual characteristics and the surgeon’s estimate. Anesthesiology. 2010;112:41–9

17. Stepaniak PS, Vrijland WW, de Quelerij M, de Vries G, Heij C. Working with a fixed operating room team on consecutive similar cases and the effect on case duration and turnover time. Arch Surg. 2010;145:1165–70

18. Gillespie BM, Chaboyer W, Fairweather N. Factors that influence the expected length of operation: results of a prospective study. BMJ Qual Saf. 2012;21:3–12

19. Dexter F, Traub RD. Sequencing cases in the operating room: predicting whether one surgical case will last longer than another. Anesth Analg. 2000;90:975–9

20. Weiss EN. Models for determining estimated start times and case orderings in hospital operating rooms. IIE Transactions. 1990;22:143–50

21. Wachtel RE, Dexter F. A simple method for deciding when patients should be ready on the day of surgery without procedure-specific data. Anesth Analg. 2007;105:127–40

22. Wachtel RE, Dexter F. Influence of the operating room schedule on tardiness from scheduled start times. Anesth Analg. 2009;108:1889–901

23. Smallman B, Dexter F. Optimizing the arrival, waiting, and NPO times of children on the day of pediatric endoscopy procedures. Anesth Analg. 2010;110:879–87

24. Dexter F, Macario A, O’Neill L. A strategy for deciding operating room assignments for second-shift anesthetists. Anesth Analg. 1999;89:920–4

25. Dexter EU, Dexter F, Masursky D, Kasprowicz KA. Prospective trial of thoracic and spine surgeons’ updating of their estimated case durations at the start of cases. Anesth Analg. 2010;110:1164–8

26. Dalton JE, Kurz A, Turan A, Mascha EJ, Sessler DI, Saager L. Development and validation of a risk quantification index for 30-day postoperative mortality and morbidity in noncardiac surgical patients. Anesthesiology. 2011;114:1336–44

27. Dexter F, Dexter EU, Ledolter J. Influence of procedure classification on process variability and parameter uncertainty of surgical case durations. Anesth Analg. 2010;110:1155–63

28. Barber JA, Thompson SG. Analysis of cost data in randomized trials: an application of the non-parametric bootstrap. Stat Med. 2000;19:3219–36

29. Strum DP, May JH, Vargas LG. Modeling the uncertainty of surgical procedure times: comparison of log-normal and normal models. Anesthesiology. 2000;92:1160–7

30. Todd MM, Hindman BJ, Clarke WR, Torner JCIntraoperative Hypothermia for Aneurysm Surgery Trial (IHAST) Investigators. . Mild intraoperative hypothermia during surgery for intracranial aneurysm. N Engl J Med. 2005;352:135–45

31. Dexter F, Ledolter J. Bayesian prediction bounds and comparisons of operating room times even for procedures with few or no historic data. Anesthesiology. 2005;103:1259–167

32. Spangler WE, Strum DP, Vargas LG, May JH. Estimating procedure times for surgeries by determining location parameters for the lognormal model. Health Care Manag Sci. 2004;7:97–104

33. Stepaniak PS, Heij C, Mannaerts GH, de Quelerij M, de Vries G. Modeling procedure and surgical times for current procedural terminology-anesthesia-surgeon combinations and evaluation in terms of case-duration prediction and operating room efficiency: a multicenter study. Anesth Analg. 2009;109:1232–45

34. He B, Dexter F, Macario A, Zenios S. The timing of staffing decisions in hospital operating room: incorporating workload heterogeneity into the newsvendor problem. Manuf Serv Op. 2012;14:99–114

35. Dexter F, Traub RD. Statistical method for predicting when patients should be ready on the day of surgery. Anesthesiology. 2000;93:1107–14

36. Bayman EO, Chaloner K, Cowles MK. Detecting qualitative interaction: a Bayesian approach. Stat Med. 2010;29:455–63

37. Mahaney KB, Todd MM, Bayman EO, Torner JCIHAST Investigators. . Acute postoperative neurological deterioration associated with surgery for ruptured intracranial aneurysm: incidence, predictors, and outcomes. J Neurosurg. 2012;116:1267–78

38. Finney DJ. On the distribution of a variate whose logarithm is normally distributed. J R Stat Soc Series B Stat Methodol. 1941;7:155–61

39. Zhou XH. Estimation of the log-normal mean. Stat Med. 1998;17:2251–64

40. Zellner A. Bayesian and non-Bayesian analysis of the log-normal distribution and log-normal regression. J Am Stat Assoc. 1971;66:327–30

41. Wachtel RE, Dexter F. Reducing tardiness from scheduled start times by making adjustments to the operating room schedule. Anesth Analg. 2009;108:1902–9

42. Kaushik R. Bleeding complications in laparoscopic cholecystectomy: Incidence, mechanisms, prevention and management. J Minim Access Surg. 2010;6:59–65

43. Zhou J, Dexter F. Method to assist in the scheduling of add-on surgical cases–upper prediction bounds for surgical case durations based on the log-normal distribution. Anesthesiology. 1998;89:1228–32

44. Clopper CJ, Pearson ES. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934;26:404–13

45. Hahn GJ, Meeker WQ Statistical intervals. A guide for practitioners. 1991 New York Wiley:82–4, 100–5

46. Hall P, Rieck A. Improving coverage accuracy of nonparametric prediction intervals. J R Stat Soc Ser B. 2001;63:717–25

47. Dexter F. Application of prediction levels to OR scheduling. AORN J. 1996;63:607–15

48. Li Y, Zhang S, Baugh RF, Huang JZ. Predicting surgical case durations using ill-conditioned CPT code matrix. IIE Transactions. 2010;42:121–35

49. Zhou J, Dexter F, Macario A, Lubarsky DA. Relying solely on historical surgical times to estimate accurately future surgical times is unlikely to reduce the average length of time cases finish late. J Clin Anesth. 1999;11:601–5

50. Dexter F, Traub RD, Fleisher LA, Rock P. What sample sizes are required for pooling surgical case durations among facilities to decrease the incidence of procedures with little historical data? Anesthesiology. 2002;96:1230–6

51. Dexter F, Macario A. What is the relative frequency of uncommon ambulatory surgery procedures performed in the United States with an anesthesia provider? Anesth Analg. 2000;90:1343–7

52. Silber JH, Rosenbaum PR, Zhang X, Even-Shoshan O. Estimating anesthesia and surgical procedure times from medicare anesthesia claims. Anesthesiology. 2007;106:346–55

53. Jones HE, Spiegelhalter DJ. The identification of “unusual” health-care providers from a hierarchical model. Am Stat. 2011;65:154–63

54. Silber JH, Rosenbaum PR, Zhang X, Even-Shoshan O. Influence of patient and hospital characteristics on anesthesia time in medicare patients undergoing general and orthopedic surgery. Anesthesiology. 2007;106:356–64

55. Dexter F, Davis M, Egger Halbeis CB, Halbeis CE, Marjamaa R, Marty J, McIntosh C, Nakata Y, Thenuwara KN, Sawa T, Vigoda M. Mean operating room times differ by 50% among hospitals in different countries for laparoscopic cholecystectomy and lung lobectomy. J Anesth. 2006;20:319–22

56. Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21:1539–58

57. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JP, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. J Clin Epidemiol. 2009;62:e1–34

58. Dexter F, Bayman EO, Epstein RH. Statistical modeling of average and variability of time to extubation for meta-analysis comparing desflurane to sevoflurane. Anesth Analg. 2010;110:570–80

59. Wachtel RE, Dexter F, Epstein RH, Ledolter J. Meta-analysis of desflurane and propofol average times and variability in times to extubation and following commands. Can J Anaesth. 2011;58:714–24

60. Ledolter J, Dexter F. Analysis of interventions influencing or reducing patient waiting while stratifying by surgical procedure. Anesth Analg. 2011;112:950–7

61. DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7:177–88

62. Jackson D, Bowden J, Baker R. How does the DerSimonian and Laird procedure for random effects meta-analysis compare with its more efficient but harder to compute counterparts? J Stat Plan and Inference. 2010;140:961–70

63. Spiegelhalter DJ. Handling over-dispersion of performance indicators. Qual Saf Health Care. 2005;14:347–51

64. Strum DP, Sampson AR, May JH, Vargas LG. Surgeon and type of anesthesia predict variability in surgical procedure times. Anesthesiology. 2000;92:1454–66

65. Wright IH, Kooperberg C, Bonar BA, Bashein G. Statistical modeling to predict elective surgery time. Comparison with a computer scheduling system and surgeon-provided estimates. Anesthesiology. 1996;85:1235–45

66. Chen L. 1995 for testing the mean of a skewed distribution was published in JASA 1995;90:767–72

67. Banik S, Kibria BMG. Comparison of some parametric and nonparametric type one sample confidence intervals for estimating the mean of a positively skewed distribution. Commun Stat Simulat. 2010;39:361–89

68. Krishnamoorthy K, Mathew T. Inferences on the means of lognormal distributions using generalized p-values and generalized confidence intervals. J Stat Plan Inference. 2003;115:103–121

69. Kuntz L, Vera A. Transfer pricing in hospitals and efficiency of physicians: the case of anesthesia services. Health Care Manage Rev. 2005;30:262–9

70. Schuster M, Standl T, Reissmann H, Kuntz L, Am Esch JS. Reduction of anesthesia process times after the introduction of an internal transfer pricing system for anesthesia services. Anesth Analg. 2005;101:187–94

71. Jones HE, Ohlssen DI, Spiegelhalter DJ. Use of the false discovery rate when comparing multiple health care providers. J Clin Epidemiol. 2008;61:232–40

72. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995;57:289–300

73. Dexter F, Dexter EU, Ledolter J. Importance of appropriately modeling procedure and duration in logistic regression studies of perioperative morbidity and mortality. Anesth Analg. 2011;113:1197–201

74. Allen BT, Anderson CB, Rubin BG, Thompson RW, Flye MW, Young-Beyer P, Frisella P, Sicard GA. The influence of anesthetic technique on perioperative complications after carotid endarterectomy. J Vasc Surg. 1994;19:834–42

75. Serletti JM, Higgins JP, Moran S, Orlando GS. Factors affecting outcome in free-tissue transfer in the elderly. Plast Reconstr Surg. 2000;106:66–70

76. Farwell DG, Reilly DF, Weymuller EA Jr, Greenberg DL, Staiger TO, Futran NA. Predictors of perioperative complications in head and neck patients. Arch Otolaryngol Head Neck Surg. 2002;128:505–11

77. Penel N, Mallet Y, Roussel-Delvallez M, Lefebvre JL, Yazdanpanah Y. Factors determining length of the postoperative hospital stay after major head and neck cancer surgery. Oral Oncol. 2008;44:555–62

78. Greenblatt DY, Rajamanickam V, Mell MW. Predictors of surgical site infection after open lower extremity revascularization. J Vasc Surg. 2011;54:433–9

79. Postoperative Visual Loss Study Group. . Risk factors associated with ischemic optic neuropathy after spinal fusion surgery. Anesthesiology. 2012;116:15–24

80. Glaser D, Hastings RH. An introduction to multilevel modeling for anesthesiologists. Anesth Analg. 2011;113:877–87

81. Muralidhar K, Zanakis SH. A simple minimum-bias percentile estimator of the location parameter for the gamma, weibull, and log-normal distributions. Decision Sciences. 1992;23:862–79

82. Shapiro SS, Wilk MB, Chen HJ. A comparative study of various tests for normality. J Am Stat Assoc. 1968;63:1343–72

83. Shapiro SS, Wilk MB. An analysis of variance test for normality. Biometrika. 1965;52:591–611

84. Lilliefors HW. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J Am Stat Assoc. 1967;62:399–402

85. Sprent P Applied Nonparametric Statistical Methods. 1989 New York Chapman and Hall:49–58

86. Johnson NL, Kotz S, Balakrishnan N Continuous Univariate Distributions. 1994;Vol 12nd ed. New York John Wiley:223