Secondary Logo

Journal Logo

General Articles: Original Clinical Research Report

Monte Carlo Simulations Comparing Fisher Exact Test and Unequal Variances t Test for Analysis of Differences Between Groups in Brief Hospital Lengths of Stay

Dexter, Franklin MD, PhD*; Bayman, Emine O. PhD; Dexter, Elisabeth U. MD, FACS

Author Information
doi: 10.1213/ANE.0000000000002428

Hospital length of stay (LOS) is a common and suitable secondary, economic end point for Perioperative Surgical Home and Enhanced Recovery type interventions.1 LOS can be benchmarked using publically available national data.2 Each 1-day increase in hospital median LOS is associated with odds of readmission nationwide of 1.05 (P = .012) and odds of transfer to short-term care facility of 1.47 (P = .0008).3 LOS can be influenced by perioperative anesthesia care.4 The vast majority of anesthesiologists nationwide think that their clinical decisions influence LOS: 90% think “anesthesiologist intraoperative management should reduce overall LOS”; 81% think that “anesthesiologist coordination of all preoperative care should reduce overall LOS”; and 81% think “anesthesiologist coordination of all postoperative care should reduce overall LOS.”5

Surprisingly, there are limited data on appropriate inferential analyses for comparing brief hospital LOS between groups. Intensive care unit LOS is often short (eg, overnight for many surgical patients). From Moran and Solomon6 and Verburg et al’s7 analyses of intensive care unit LOS, treating LOS data as coming from a log-normal distribution can be appropriate. Student t test with unequal variances has performed well for distributions with similar shapes.8 On the other hand, merely analyzing the percentage of patients with LOS ≤1 day has the advantage of ensuring a negligible chance for large type I errors,9 because such a binary criterion results in binomial distributed data.10

In this article, we compare type I and II error rates for the analysis of the mean LOS versus the percentages of hospital LOS that are overnight. Both have strong economic interpretation. The mean LOS has an economic interpretation in terms of cost per day.1 However, brief reductions in LOS generally have proportionately less reduction in costs.1,11,12 The proportion of patients with LOS ≤1 day indicates the potential for surgery to be performed at for-profit ambulatory surgery centers and/or hospitals’ outpatient surgery departments. At such facilities, limited or regulated hospital beds are not constraints. Consequently, perioperative economics becomes that of operating room management (ie, surgeons and anesthesiologists are in control, not hospital executives and “nursing”).1 Although the statistical plan for clinical trials often is prespecified, rarely is the sample size chosen based on an economic end point. Thus, we consider the economic end point to be prespecified, but selected among meaningful options based on an already determined sample size.

We use pooled LOS data for thoracoscopic wedge resections and lung lobectomy, to match analgesia studies of thoracoscopic surgery.4


The 2013 Nationwide Readmissions Database includes nearly all relevant discharges from each of 21 participating states’ hospitals.13,14 The procedures studied in adults were International Classification of Diseases, Ninth Revision, Clinical Modification 32.20 “thoracoscopic excision of lesion or tissue of lung” (ie, thoracoscopic wedge resection) and 32.41 thoracoscopic lobectomy of lung (Supplemental Digital Content, Table A, There were 26 hospitals with at least 100 of these cases, totaling N = 5052 discharges (Figure; Supplemental Digital Content, Table A,

Number of discharges of each hospital length of stay in days. There are 5052 discharges among the 26 hospitals each with at least 100 discharges. The procedures included are thoracoscopic wedge resection and lobectomy of lung.

To simulate type I error rates for 2-group comparisons, we resampled with replacement from the LOS of the 5052 discharges (eg, to simulate 2 groups each with N = 100 we drew 200 cases with replacement from the 5052). As explained in the introduction, unequal variances t test (ie, Welch test or Welch-Satterthwaite test), and Fisher exact test, with the binary end point being LOS ≤1 day or not, were used. We used Fisher exact test rather than the more powerful9 χ2 test based on the expectation that the statistical analysis plan would be prespecified, but the expected number of observations meeting the binary end point known poorly a priori. We did not use the Poisson distribution for LOS, since the variance of LOS is severalfold greater than the mean (χ2 test P < 10−8).a

Although we used N ≥ 100 per group, previous randomized or observational studies often use smaller sample sizes,4 and thus we also included resampling with replacement using sizes of N = 75, 50, and 25 per group. Simulations of N = 1000 per group were performed to examine the asymptotic performance of the 2 tests. The standard errors of the percentage type I error rates were calculated using the Clopper-Pearson method.

To assess type II error rates, we made all (325) pairwise comparisons of the 26 hospitals using each hospital’s observed LOS data (eg, hospital A’s LOS were compared to hospital B’s). The calculated odds ratios were for Fisher exact test obtaining statistical significance (P < .05 or P < .01) while unequal variances t test did not. The 2-sided 95% confidence intervals and P values were exact (StatXact-11; Cytel, Inc, Cambridge, MA).

The Wilcoxon-Mann-Whitney test does not have a reliable interpretation as a test for differences of mean (or median) for skewed distributions such as LOS (ie, is not interpretable as an economic measure).15,16 However, readers interested in LOS as a clinical secondary end point may seek to infer whether a randomly sampled LOS from 1 group (eg, hospital) is likely to be less than a randomly selected LOS from the other group (ie, Wilcoxon-Mann-Whitney odds).16,17 For right-skewed distributions with many zeros (and 1’s for LOS), Wilcoxon rank sum test can have greater statistical power than Fisher exact test.9,18 Therefore, we also performed the analyses using the Wilcoxon rank sum test. The normal approximation for the U statistic included the correction for tied ranks.19

The log-normal distribution has been suitable for many continuous end points in anesthesia such as case duration, since these data follow log-normal distributions.20–29 We simulated the type I error rates for comparisons of mean LOS using Zhou et al’s30 maximum likelihood Z-score test. We used the same process as we simulated type I error rates for Fisher exact test and Welch method, with 1 million simulations of N = 25 per group. Previous studies found that when the data were not log-normally distributed, comparisons of log-normal means had poor type I error rates.20,31,32 We therefore assessed the normality of the log-transformed LOS data using Lilliefors tests, overall and by hospital.


The LOS data were right-skewed with mean ± standard deviation of 3.58 ± 3.09 days (Figure); two-thirds of the 5052 discharges were 0, 1, 2, or 3 days, and 76.6% (lower confidence limit, 75.6%) were 0–4 days. The LOS equaled 0 days (ie, logarithm undefined) for 1.43% of the observations. When those zero values were treated as LOS of 1 day, and Lilliefors’ test (of normality) applied, there was P < .0001 for 2-parameter log-normal distribution. Thus, the LOS did not follow a log-normal distribution. The consequent type I error rates for differences in the log-normal means tested using the maximum likelihood Z-score test30 were larger than nominal: 11.97% ± 0.03% instead of 5.0% and 2.76% ± 0.02% instead of 1.0%, respectively. When analyses were repeated by hospital, the Lilliefors’ tests P ≤ .0007 for all 26 hospitals.

Fisher exact test was conservative (Table 1), as previously reviewed.9 Unequal variances t test reached the nominal type I error rate of 0.05 or 0.01 for N = 1000 per group, but was conservative for smaller N. Wilcoxon-Mann-Whitney type I error rates did not differ from the nominal rates, 0.05 or 0.01.

Table 1.
Table 1.:
Type I Error Rates (± Standard Error) for Different Sample Sizes Per Group
Table 2.
Table 2.:
Detection of Significant Differences in Length of Stay Among the 26 Hospitals (ie, All 325 Pairwise Comparisons) Using Each of the 3 Different Statistical Methods

There was heterogeneity in LOS among the 26 hospitals, as indicated by the interquartile range among hospitals for the percentage of LOS ≤1 day: 9.2%–24.0% of cases. Fisher exact test obtained P < .05 for 64.9% of all 325 pairwise comparisons between the 26 hospitals, and P < .01 for 52.3% of comparisons (Table 2). Unequal variances t test obtained P < .05 for 54.8% of the 325 comparisons and P < .01 for 44.3% of comparisons. Using McNemar test for pairwise comparisons, Fisher exact test obtained P < .05 for more comparisons than did unequal variances t test (P = .0015), and for more P < .01 comparisons too (P = .0049) (Table 2). The estimated odds ratio for obtaining P < .05 with Fisher exact test versus unequal variances t test was 1.94, and for obtaining P < .01 was 1.96 (Table 2). Fisher exact test and Wilcoxon-Mann-Whitney test had comparable statistical power in terms of differentiating LOS between hospitals (Table 2).


We studied 2 approaches to compare brief LOS between groups for use when LOS is being included as a planned secondary, economic end point. Our findings show that if P < .05 for LOS after thoracoscopic lung resection surgery, then for both unequal variances t test and Fisher exact test, the finding of statistical significance is likely reliable. This was true even though the LOS were not normally distributed (Figure). Since multiple similar procedures are often combined in analgesia studies, and the studies designed with LOS as only a secondary end point, the LOS distributions commonly represent a mixture of procedures (Supplemental Digital Content, Table A, Fisher exact test was conservative (ie, type I error rates less than nominal)9 for the analysis of brief LOS, but had greater statistical power (ie, significant for more comparisons) to detect differences among hospitals than did unequal variances t test. This was likely due to the relatively large percentage of LOS ≤1 day (Figure). Substantive statistical power while treating the data as binary was found previously for another right-skewed distribution with few different sampled values: narcotic doses in the postanesthesia care unit.18

Although we used Fisher exact test, any of the exact statistical methods to compare 2 proportions could be used; their relative advantages have been extensively studied.9,33 Our study was not intended to represent a complete survey of the many options available. Even though LOS has been a common economic end point for decades,1 we failed to identify a single previous Monte Carlo simulation study comparing its analyses for randomized trials.

Our findings are useful because thoracic surgery is an area of substantial interest in acute postoperative analgesia, including the transition from acute to chronic pain. General thoracic surgery hospital costs are principally attributable to operating room time and hospital LOS.34 Randomized trials with LOS as a primary or secondary end point are warranted and expected to continue.1 On the other hand, our results may be limited to thoracic procedures. The distributions of LOS are likely to differ little for other procedures with brief LOS, because the LOS cannot practically differ substantially being limited to units of days. Thus, when LOS is considered as a secondary end point, likely applying our results to other procedures would be reasonable. However, previous simulation findings have found heterogeneity of results for different statistical methods depending on the probability distributions.35 Consequently, we cannot recommend that our results be assumed to apply to other procedures when used as a primary end point, particularly when patients will be randomly assigned to groups. Studies of relatively discrete distributions have not found the unequal variances t test (Welch method) to be as conservative as for our LOS data.18,36,37 Our findings provide a methodology that others can repeat for different procedures. Since both of our tests that compared means between groups performed less well than treating the data as binary, we recommend paying particular attention to the probability distributions of LOS among included patients. Realistically, the distributions will often represent mixture distributions. The vast majority of surgical procedures are too uncommon to be studied individually.21,38–44 Furthermore, studies of individual surgical procedures often do not accurately reflect the heterogeneous populations to which the results are applied.

In conclusion, for studies with LOS to be used as a secondary end point of economic interest, we considered the planned statistical analysis of the percentage of patients suitable for ambulatory surgery (ie, binary outcome of hospital LOS 0 or 1 midnight). Our results show that there need not be a loss of statistical power to compare groups with the Fisher exact test versus Welch method or the Wilcoxon rank sum test.


Name: Franklin Dexter, MD, PhD.

Contribution: This author helped design the study, obtain the data, perform the analysis, and write the manuscript.

Name: Emine O. Bayman, PhD.

Contribution: This author helped design the study.

Name: Elisabeth U. Dexter, MD, FACS.

Contribution: This author helped interpret the data and write the manuscript.

This manuscript was handled by: Edward J. Mascha, PhD.

Acting EIC on final acceptance: Thomas R. Vetter, MD, MPH.


aThe negative binomial distribution fit by χ2 test was P < 10–8. When analyses were repeated by hospital, the negative binomial distribution fit by χ2 test was P ≤ .0007 for 20 of the 26 hospitals and P < .05 for all 26 hospitals.


1. Dexter F, Wachtel REStrategies for net cost reductions with the expanded role and expertise of anesthesiologists in the perioperative surgical home. Anesth Analg. 2014;118:1062–1071.
2. Agency for Healthcare Research and Quality. Healthcare Cost and Utilization Project (HCUP). Available at: Accessed June 10, 2017.
3. Dexter F, Epstein RH, Dexter EU, Lubarsky DA, Sun ECHospitals with briefer than average lengths of stays for common surgical procedures do not have greater odds of either re-admission or use of short-term care facilities. Anaesth Intensive Care. 2017;45:210–219.
4. Bayman EO, Parekh KR, Keech J, Selte A, Brennan TJA prospective study of chronic pain after thoracic surgery. Anesthesiology. 2017;126:938–951.
5. Raphael DR, Cannesson M, Rinehart J, Kain ZNHealth care costs and the perioperative surgical home: a survey study. Anesth Analg. 2015;121:1344–1349.
6. Moran JL, Solomon PJANZICS Centre for Outcome and Resource Evaluation (CORE) of the Australian and New Zealand Intensive Care Society (ANZICS). A review of statistical estimators for risk-adjusted length of stay: analysis of the Australian and New Zealand Intensive Care Adult Patient Data-Base, 2008-2009. BMC Med Res Methodol. 2012;12:68.
7. Verburg IW, de Keizer NF, de Jonge E, Peek NComparison of regression methods for modeling intensive care length of stay. PLoS One. 2014;9:e109684.
8. de Gil PR, Chen Y, Kim ES, et al2013Parametric test for two population means: an empirical comparison of type I error control and statistical power. JSM Proceedings, Social Statistics Section, Montreal, Canada. 2388–2402.
9. D’Agostino RB, Chase W, Belanger AThe appropriateness of some common procedures for testing the equality of two independent binomial populations. Am Stat. 1988;42:198–202.
10. Binomial proportion confidence interval. Available at: Accessed June 10, 2017.
11. Dexter F, Macario A, Dexter EUComputer simulation of changes in nursing productivity from early tracheal extubation of coronary artery bypass graft patients. J Clin Anesth. 1998;10:593–598.
12. Taheri PA, Butz DA, Greenfield LJLength of stay has minimal impact on the cost of hospital admission. J Am Coll Surg. 2000;191:123–130.
13. Agency for Healthcare Research and Quality Healthcare Cost and Utilization Project. Introduction to the HCUP Nationwide Readmissions Database (NRD), 2013. Available at: Accessed June 10, 2017.
14. Agency for Healthcare Research and Quality Healthcare Cost and Utilization Project. HCUP supplemental variables for revisit analyses. Available at: Accessed June 10, 2017.
15. Fagerland MW, Sandvik LThe Wilcoxon-Mann-Whitney test under scrutiny. Stat Med. 2009;28:1487–1497.
16. Dexter FWilcoxon-Mann-Whitney test used for data that are not normally distributed. Anesth Analg. 2013;117:537–538.
17. Divine G, Norton HJ, Hunt R, Dienemann JA review of analysis and sample size calculation considerations for Wilcoxon tests. Anesth Analg. 2013;117:699–710.
18. Dexter FAnalysis of statistical tests to compare doses of analgesics among groups. Anesthesiology. 1994;81:610–615.
19. Mann-Whitney U test normal approximation and tie correction. Available at: Accessed June 10, 2017.
20. Ledolter J, Dexter F, Epstein RHAnalysis of variance of communication latencies in anesthesia: comparing means of multiple log-normal distributions. Anesth Analg. 2011;113:888–896.
21. Dexter F, Epstein RH, Bayman EO, Ledolter JEstimating surgical case durations and making comparisons among facilities: identifying facilities with lower anesthesia professional fees. Anesth Analg. 2013;116:1103–1115.
22. Strum DP, May JH, Vargas LGModeling the uncertainty of surgical procedure times: comparison of log-normal and normal models. Anesthesiology. 2000;92:1160–1167.
23. Eijkemans MJ, van Houdenhoven M, Nguyen T, Boersma E, Steyerberg EW, Kazemier GPredicting the unpredictable: a new prediction model for operating room times using individual characteristics and the surgeon’s estimate. Anesthesiology. 2010;112:41–49.
24. Dexter F, Ledolter JBayesian prediction bounds and comparisons of operating room times even for procedures with few or no historic data. Anesthesiology. 2005;103:1259–1167.
25. Spangler WE, Strum DP, Vargas LG, May JHEstimating procedure times for surgeries by determining location parameters for the lognormal model. Health Care Manag Sci. 2004;7:97–104.
26. Stepaniak PS, Heij C, Mannaerts GH, de Quelerij M, de Vries GModeling procedure and surgical times for current procedural terminology-anesthesia-surgeon combinations and evaluation in terms of case-duration prediction and operating room efficiency: a multicenter study. Anesth Analg. 2009;109:1232–1235.
27. He B, Dexter F, Macario A, Zenios SThe timing of staffing decisions in hospital operating room: incorporating workload heterogeneity into the newsvendor problem. Manuf Serv Op. 2012;14:99–114.
28. Dexter F, Traub RDStatistical method for predicting when patients should be ready on the day of surgery. Anesthesiology. 2000;93:1107–1114.
29. Tiwari V, Dexter F, Rothman BS, Ehrenfeld JM, Epstein RHExplanation for the near-constant mean time remaining in surgical cases exceeding their estimated duration, necessary for appropriate display on electronic white boards. Anesth Analg. 2013;117:487–493.
30. Zhou XH, Gao S, Hui SLMethods for comparing the means of two independent log-normal samples. Biometrics. 1997;53:1129–1135.
31. Briggs A, Nixon R, Dixon S, Thompson SParametric modelling of cost data: some simulation evidence. Health Econ. 2005;14:421–428.
32. Weng SF, Myers LPerformance of confidence interval tests for the ratio of two lognormal means applied to Weibull and gamma distribution data. J Stat Comput Simul. 2012;82:1035–1054.
33. Newcombe RGInterval estimation for the difference between independent proportions: comparison of eleven methods. Stat Med. 1998;17:873890.
34. Khullar OV, Fernandez FG, Perez S, et al.Time is money: hospital costs associated with video-assisted thoracoscopic surgery lobectomies. Ann Thorac Surg. 2016;102:940–947.
35. Fagerland MW, Sandvik LPerformance of five two-sample location tests for skewed distributions with unequal variances. Contemp Clin Trials. 2009;30:490–496.
36. Fagerland MW, Sandvik L, Mowinckel PParametric methods outperformed non-parametric methods in comparisons of discrete numerical variables. BMC Med Res Methodol. 2011;11:44.
37. Wilcox RR, Cheryl V, Clark F, Carlson MComparing discrete distributions when the sample space is small. Universitas Psychologica. 2013;12:1583–1595.
38. Dexter F, Macario AWhat is the relative frequency of uncommon ambulatory surgery procedures performed in the United States with an anesthesia provider? Anesth Analg. 2000;90:1343–1347.
39. Dexter F, Traub RD, Fleisher LA, Rock PWhat sample sizes are required for pooling surgical case durations among facilities to decrease the incidence of procedures with little historical data? Anesthesiology. 2002;96:1230–1236.
40. Dexter F, Dexter EU, Ledolter JInfluence of procedure classification on process variability and parameter uncertainty of surgical case durations. Anesth Analg. 2010;110:1155–1163.
41. O’Neill L, Dexter F, Zhang NThe risks to patient privacy from publishing data from clinical anesthesia studies. Anesth Analg. 2016;122:2016–2026.
42. Luangkesorn KL, Eren-Dogu ZFMarkov chain Monte Carlo methods for estimating surgery duration. J Statist Comput Simulation. 2016;86:262–278.
43. O’Neill L, Dexter F, Park SH, Epstein RHUncommon combinations of ICD10-PCS or ICD-9-CM operative procedure codes account for most inpatient surgery at half of Texas hospitals. J Clin Anesth. 2017;41:65–70.
44. O’Neill L, Dexter F, Park SH, Epstein RHDischarges with surgical procedures performed less often than once per month per hospital account for two-thirds of hospital costs of inpatient surgery. J Clin Anesth. 2017;41:99–103.

Supplemental Digital Content

Copyright © 2017 International Anesthesia Research Society