Secondary Logo

Journal Logo

Editorials: Editorial

Wilcoxon-Mann-Whitney Test Used for Data That Are Not Normally Distributed

Dexter, Franklin MD, PhD

Author Information
doi: 10.1213/ANE.0b013e31829ed28f

Many submitted statistical methods’ sections state something like: normally distributed data were compared using the Student t test and other data were compared using the Wilcoxon-Mann-Whitney test. Nonparametric methods typically are based on sorting the data by magnitude and replacing observed values with rank counts. However, applying rank-based methods to experimental data can be poor because for some end points the analysis needs to be performed in the original units.1 Furthermore, the statements’ implication that “nonparametric” means no assumptions isfalse.

For example, consider a hypothetical comparison #1 based on multiple previous articles published in Anesthesia & Analgesia.2–5 Two anesthesiologists have each worked at the same surgical suite for the past couple of years (i.e., 500 workdays). Each workday, each anesthesiologist supervised 2 or 3 operating rooms, where “supervise” is not referring to a regulatory billing term but the process of providing patient care. For each of the workdays, one of each of the anesthesiologists’ 2 or 3 first case starts was selected at random (i.e., N1 = N2 = 500). The minutes after 7:30 AM when the patient entered into the room were recorded. Such minutes are highly non-normally distributed.a The medians (25th–75th percentiles) were:

Comparison #1: Anesthesiologist 1A: 4 (2–8) minutes;

Anesthesiologist 1B: 5 (3–9) minutes.

Although the Wilcoxon-Mann-Whitney test 2-sided P = 0.0005, comparison #1 seems irrelevant because 1 minute is tiny. The use of ranks alone has lost perspective on the fact that the difference between anesthesiologists is only 1 minute. My impression is that this loss of perspective is commonplace when skewed data are observed, because the applied “solution” is to use a rank-based statistical method and the investigators do not know how to calculate confidence intervals corresponding to such P-values. Since the sample size is relatively large, the same limitation can apply using a parametric method if only the P-value is reported (e.g., P = 0.031). For costs, times, days, and other valuable resources, we care literally about the costs, the times, and the days, not ranks and/or P-values. We are not particularly concerned about whether there is a difference between anesthesiologists. What we care about principally is the (95%) confidence interval in the units of cost or time (e.g., the 0.1 to 2.1 minute difference between the respective means). Appropriate methods of analysis have been developed to analyze the means of skewed distributions without or with extra zeros, and these methods are used ubiquitously in actuarial statistics, finance, engineering statistics, reliability analysis, etc. We published a Statistical Grand Rounds on the topic last year6 and a variety of econometric methods can be used.2,3

The current issue of the journal contains another Statistical Grand Rounds. Divine et al.7 review several nonparametric tests (e.g., Wilcoxon-Mann-Whitney and Wilcoxon signed ranks). Consider the hypothetical tardiness values for first cases of the day of two different anesthesiologists, with medians (25th–75th percentiles) of their 500 observations:b

Comparison #2: Anesthesiologist 2A: 11 (4–21) minutes;

Anesthesiologist 2B: 11 (4–32) minutes.

The Anesthesiologists’ 2A and 2B have the same median times, yet the Wilcoxon-Mann-Whitney test is nearly identical to that of comparison #1, P = 0.0006. What is the interpretation? The distributions are highly skewed and the mean times differ substantially, 16 vs 38 minutes, respectively. Confidence intervals for the differences of the means exclude values close to zero.c Thus, for time data it is best to use statistical methods appropriate for skewed distributions and literally to compare the means.4,6,8 However, for nausea data as considered by Divine et al.,7 using the mean was not a solution, because the data (scores) literally were ranked. The Wilcoxon-Mann-Whitney test is appropriate, but interpreted appropriately and cautiously, because as the example shows it does not literally compare medians.

An important focus of Divine et al.’s article7 was how to create a confidence interval for comparison #2 using rank-based methods. Use the WMWodds. Envision random selection (with replacement) of 1 tardiness value from anesthesiologist 2A and 1 from 2B. Evaluate whether the tardiness of 2A is less than the tardiness of 2B. Repeat the process many times, so that every pairwise combination is tested. The WMWodds is related to the proportion p″ of pairwise selections for which the tardiness selected from 2A is briefer than the tardiness from 2B. I use the notation p″ to correspond to that of Divine et al.’s article. The WMWodds is an odds (i.e., p″/[1 − p″]). Using comparison #2,b Anesthesiologist 2B was longer for 56.2% of pairwise selections. The WMWodds = 0.562/(1 − 0.562) = 1.28. The odds that a randomly selected tardiness of Anesthesiologist 2B was larger than a randomly selected tardiness of Anesthesiologist 2B was equal 1.28 to 1. Divine et al.’s article7 describes how to calculate a corresponding 95% confidence intervald (e.g., WMWodds 1.11–1.49). However, as seen in this example, WMWodds specifies the odds that one of Anesthesiologist 2B cases will be more tardy than one of Anesthesiologists 2A cases, but not by how many fewer minutes (i.e., knowing 11% larger odds of tardiness may be less informative than knowing of an increase of at least 8 minutesc). Comparison #1 has nearly the same P-value and WMWodds as comparison #2, even though the difference between anesthesiologists’ mean tardiness is smaller.

WMWodds is appropriate to calculate the odds for comparison #1. The Hodges–Lehmann confidence interval for the median difference could also be insightful. However, for comparison #2, only the former would be appropriate. The Hodges-Lehmann confidence interval makes assumptions other than that the data be at least ranked. The method is distribution free, but that does not imply it is assumption free. The method assumes the same shape (distribution) for both groups. In contrast, the WMWodds considers only the odds that the tardiness for Anesthesiologist 2A is less than the tardiness of Anesthesiologist 2B, and for that asymmetry is not a concern when sample sizes are similar.

Divine et al.’s article7 comprehensively covers multiple topics that should be considered when using the Wilcoxon-Mann-Whitney, Wilcoxon signed-rank, and permutation tests. Authors of submissions to Anesthesia & Analgesia using these methods should study the article to understand the assumptions underlying nonparametric statistical analysis.


Dr. Franklin Dexter is the Statistical Editor and Section Editor for Economics, Education, and Policy for the Journal. This manuscript was handled by Dr. Steven L. Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.


Name: Franklin Dexter, MD, PhD.

Contribution: This author helped write the manuscript.

Attestation: Franklin Dexter has approved the final manuscript.

Conflicts of Interest: Franklin Dexter reports no conflicts of interest other than being an author of Ref. 5.


a The tardiness of first case of the day starts often can be fit to zero-inflated 2-parameter log-normal distributions. A value of 0 means that the patient enters the operating room exactly at 7:30 AM. Anesthesiologist 1A has 15% zeros, and the other 85% of values follow a log-normal distribution with median 5.0 minutes and coefficient of variation of 100%. Anesthesiologist 1B has 3% zero and the same nonzero (log-normal) values.
Cited Here

b Anesthesiologist 2A has 20% zeros and the other 80% of values follow a log-normal distribution with median 13.9 minutes and coefficient of variation of 100%. Anesthesiologist 2B has 0% zeros, with log-normal distribution of median 11.4 minutes and coefficient of variation of 300%.
Cited Here

c For example, even the 99% lower confidence limit excludes 8 minutes using Student t test with equal or unequal variances.
Cited Here

d Dr. Divine kindly sent a small SAS program for their7 footnote b (personal communication, May 30, 2013, Henry Ford Health System). The SAS program is available as a digital supplement to the article, As they recommended, I too followed their approach of using logistic regression, with the binary variable being 0 if Anesthesiologist 2A and 1 if Anesthesiologist 2B and the independent variable being minutes. I calculated the WMWodds directly from the ROC model area under the curve. I calculated the confidence interval from the SE of the area (i.e., the 95% Wald confidence limits).
Cited Here


1. Dexter F. Checklist for statistical topics in Anesthesia & Analgesia reviews. Anesth Analg. 2011;113:216–9
2. Dexter F, Epstein RH. Typical savings from each minute reduction in tardy first case of the day starts. Anesth Analg. 2009;108:1262–7
3. Wachtel RE, Dexter F. Influence of the operating room schedule on tardiness from scheduled start times. Anesth Analg. 2009;108:1889–901
4. Ernst C, Szczesny A, Soderstrom N, Siegmund F, Schleppers A. Success of commonly used operating room management tools in reducing tardiness of first case of the day starts: evidence from German hospitals. Anesth Analg. 2012;115:671–7
5. Wang J, Dexter F, Yang K. A behavioral study of daily mean turnover times and first case of the day start tardiness. Anesth Analg. 2013;116:1333–4
6. Ledolter J, Dexter F, Epstein RH. Analysis of variance of communication latencies in anesthesia: comparing means of multiple log-normal distributions. Anesth Analg. 2011;113:888–96
7. Divine G, Norton HJ, Hunt R, Dienemann J. A review of analysis and sample size calculation considerations for Wilcoxon tests. Anesth Analg. 2013;117:699–710
8. Zhou XH, Gao S, Hui SL. Methods for comparing the means of two independent log-normal samples. Biometrics. 1997;53:1129–35

Supplemental Digital Content

© 2013 International Anesthesia Research Society