Secondary Logo

Journal Logo

Use the Correct Statistical Test

Plummer, John L. PhD

Letter to the Editor: In Response

Department of Anaesthesia and Intensive Care, Flinders Medical Centre, Bedford Park, South Australia 5042.

To answer the last questions of Drs. Sneyd and Kestin first, the method of analysis used was recommended by statisticians of the sponsoring company, and all parties involved agreed before the commencement of the study that conclusions would be based on this analysis. Consequently, the study protocol specified that this method of analysis would be used. After completion of the study and checking of the data, the study sponsors provided the investigators with a list specifying which patients received treatment A and which received treatment B. Which treatment was active and which was placebo was not revealed at this stage. The analysis was then conducted by the investigators according to the protocol, and conclusions concerning differences between treatment A and treatment B were drawn before the code was fully broken.

We recognize that our data were not normally distributed; no real data are. However, it is well established that the normality assumption of analysis of variance (ANOVA) can be adequately met by many data sets under suitable conditions. Whether this is the case depends not just on the distributional shape of the data but also on the sample size. The test for treatment differences in the mixed-model ANOVA we used is mathematically equivalent to conducting a two-sample t-test on the total pain scores of each patient. As each patient had seven pain scores recorded and each score could range from 0 to 4, the total score for each patient is in the range of 0 to 28. The observed values ranged from 0 to 21. For each treatment group, we calculated the mean of these totals. The test statistic is the difference between group means divided by its standard error. It is not the raw data, as suggested by Drs. Sneyd and Kestin, but the difference between treatment means to which the assumptions of continuity and normality apply. Although the distribution of this difference will not be strictly continuous, we estimate that it has approximately 3000 possible values and so can be considered continuous for practical purposes. One of the fundamental theorems of statistics, the central limit theorem, assures us that as sample size increases, the distribution of the difference between means will become more normally distributed. The sample size required before the normality assumption is adequately met will depend on the distribution of the raw data: in our case, the total pain scores. For most distributions, sample sizes of 30 are adequate [1]; to give a specific example, ANOVA may be applied with confidence to visual analog pain scores with group sizes as small as 10 [2,3]. The available evidence therefore supports the validity of ANOVA in the present case, in which both group sizes were greater than 50.

The supposed assumption referred to in the third paragraph of Drs. Sneyd and Kestin's letter arises from the field of psychology, not statistics. It was first suggested by Stevens [4], who raised it in relation to descriptive statistics, but it was later extrapolated by others to cover hypothesis tests also. However, the mathematical derivation of parametric statistical tests requires no assumptions concerning scale of measurement [5]. Empirical studies have revealed parametric tests to be robust over a wide range of scales of measurement [6]. In a classic example, Lord [7] showed that statistical tests are applied to numbers, not scales. The alleged limitation of certain statistical tests to certain scales of measurement has been described as "… a figment of the imagination of a number of psychologists" [5].

Drs. Sneyd and Kestin suggest an alternative analysis using Mann-Whitney U-tests or contingency tables [for which the usual analysis is the Mann-Whitney U-test [8]]. As pain scores were measured at seven time points, this would involve the use of seven tests. Not only does the use of multiple tests inflate the Type 1 error, but a likely outcome-only some of the tests yielding a significant result-cannot be interpreted in a meaningful way. It cannot be interpreted to mean that a treatment difference exists only at those time points associated with a significant result, as this implies the presence of a treatment x time interaction, which cannot be assessed by Mann-Whitney U-tests.

In conclusion, the assertions of Drs. Sneyd and Kestin are noteworthy for their total lack of supporting evidence. On the other hand, evidence published over a period of decades supports our view that the analysis applied to our data is appropriate. We are confident, therefore, that our conclusions concerning the efficacy of sustained-release ibuprofen as an adjunct to pain-controlled analgesia reflect a true treatment effect.

John L. Plummer, PhD

Department of Anaesthesia and Intensive Care; Flinders Medical Centre; Bedford Park, South Australia 5042

Back to Top | Article Outline


1. Boneau CA. The effects of violations of assumptions underlying the t-test. Psychol Bull 1960;57:49-64.
2. Plummer JL, Ilsley AH. Applicability of parametric tests in the statistical analysis of visual analogue pain scores. Analgesia 1996;2:19-29.
3. Dexter F, Chestnut DH. Analysis of statistical tests to compare visual analog scale measurements among groups. Anesthesiology 1995;82:896-902.
4. Stevens SS. On the theory of scales of measurement. Science 1946;103:677-80.
5. Gaito J. Measurement scales and statistics: resurgence of an old misconception. Psychol Bull 1980;87:564-7.
6. Baker BO, Hardyck CD, Petrinovich LF. Weak measurements vs strong statistics: an empirical critique of S.S. Stevens' proscriptions on statistics. Educ Psychol Meas 1966;26:291-309.
7. Lord FM. On the statistical treatment of football numbers. Am Psychol 1953;8:750-1.
8. Moses LE, Emerson JD, Hosseini H. Analysing data from ordered categories. N Engl J Med 1984;311:442-8.
© 1997 International Anesthesia Research Society