Randomized trials are an important method for determining the clinical effectiveness of psychosomatic interventions. The end point of such studies is often a continuous variable such as blood pressure, sleep latency, or depression score. Perhaps the technique most widely used to analyze such data in the psychosomatic literature is analysis of variance (ANOVA). In this article, I will argue that, despite its wide use, ANOVA should be generally avoided as a method for analyzing data from randomized trials. This is not because there is anything suspect about the mathematics underlying ANOVA; indeed, alternative statistical approaches such as linear regression are mathematically related to ANOVA and will generally lead to similar results. Rather, I advise against ANOVA because it is inherently easy to misuse and is often applied inappropriately. To help illustrate my argument, I will use a hypothetical trial to determine whether psychotherapy reduces anxiety more than standard psychosocial care in patients recently diagnosed with colorectal cancer. Many of the points I raise are applicable to the unsophisticated use of other statistical methods, including t tests and nonparametric methods such as the Mann-Whitney. ANOVA remains unique, however, in terms of prevalence and its inappropriate application to studies with multiple follow-up times or more than two groups.
STATISTICAL ANALYSIS OF RANDOMIZED TRIALS
The results of randomized trials are often viewed in black-and-white terms: “Did the treatment work, yes or no?” To answer this question, we need a p value for the group comparison, typically concluding that the treatment “works” if p is less than .05. Many statisticians and clinical trialists argue that, rather than a p value, the key result is the size of the difference between groups (the “effect size”). For example, many women with breast cancer are treated with adjuvant chemotherapy after surgery in the hope of preventing a recurrence. This takes place over many months and is associated with unpleasant side effects. Patients generally want to know not just whether chemotherapy helps, but how much it helps. A typical question a patient might ask would be: “If 100 women were given adjuvant chemotherapy, how many of them would have a recurrence compared with 100 women who had only surgery?” This is the equivalent of asking for the absolute risk difference, a well-known method of expressing the effect size. Other methods of reporting effect size include the relative risk and, for continuous end points, the difference between means. Patients ask such questions because they may not want to put themselves through time, effort, and unpleasantness of treatment if the degree of benefit is small. Indeed, although it has been convincingly shown that adjuvant chemotherapy does reduce recurrence rates, surveys of patients have found that the risk reduction is smaller than some women would require to opt for treatment (1).
The CONSORT group, which issues recommendations on the reporting of randomized trials, has therefore stated that the results of a trial should stated as “a summary of results for each group, and the estimated effect size and its precision (e.g., a 95% confidence interval).” They go on to state that “although p-values may be provided … results should not be reported solely as p-values” (2).
ANOVA produces p values by a method that does not require calculation of the difference between groups. Accordingly, the default setting for ANOVA results in most statistical software packages such as SPSS, SAS, and STATA is to give only F, p, and the degrees of freedom. Perhaps as a result, these are the values that are most typically reported in randomized trials of psychosomatic interventions. A typical example, selected at random from a paper in Health Psychology, is: “when the two interventions were compared, the [cognitive–behavioral] participants had significantly better scores on the … POMS Vigor subscale F(1,46) = 6.60, p = .014.” This gives us no idea by how much cognitive behavior therapy improves vigor and therefore whether it is worth receiving treatment.
Reporting of F and p values in isolation, without an estimate of effect size, is particularly problematic when differences between groups are not statistically significant. To illustrate, I give three possible “negative” results of the hypothetical psychotherapy trial in Table 1. Following the CONSORT recommendations, I give the results in each group, the effect size (in terms of the difference between means), and a 95% confidence interval, calculated from the standard errors (see Altman for an introduction to calculating a confidence interval for the difference between means (3)). The p value was obtained by ANOVA. For the sake of simplicity, I assume that anxiety is measured on a 0 to 10 scale, with higher scores indicating worse symptoms.
In scenario 1, treatment was considerably better than control, reducing anxiety scores by approximately 20%. In this case, the lack of statistical significance is an indication of a trial with insufficient power. In scenario 2, the treatment was similar to control, but the 95% confidence interval includes differences of clinical relevance; it could be that anxiety scores in the psychotherapy group are as much as 1.8 points (30%) lower. This would lead us to conclude that although there was no evidence of a treatment effect, it remains possible that treatment is of benefit. Conversely, in scenario 3, the 95% confidence interval includes only clinically trivial differences between groups; at best, psychotherapy could reduce anxiety scores by 0.6 points, or 10%. In this case, we might conclude that psychotherapy is unlikely to help. Reporting estimates and confidence intervals along with the p value allows us to draw appropriately varying conclusions from the three clinical trial results. Reporting only a p value from an ANOVA leads to the same conclusion from different findings: a failure to reject the null hypothesis.
TRIALS WITH MORE THAN TWO GROUPS
ANOVA is frequently justified on the grounds that a trial incorporates more than two groups. Clearly, ANOVA has an advantage compared with methods such as t test or Mann-Whitney U that are inherently unable to compare more than two groups of observations. However, the question asked by ANOVA, “Is there an overall difference between the three groups?,” is often scientifically uninteresting and may ignore the study design. Let us imagine that our initial trial of psychotherapy for anxiety in patients with cancer indicated a positive effect of treatment, and that the size of benefit was both statistically and clinically significant. Nonetheless, the trial is criticized; was the effect of the psychotherapy the result of the therapy itself or just the time and attention of the therapist? Because the psychotherapy treatment developed for these patients involved a relaxation technique, would it be possible to obtain similar benefit by having a nurse or occupational therapist run a group relaxation class, something that would be considerably less expensive?
The researchers decided to run a second trial, this time randomizing patients to receive psychotherapy, a group relaxation class, or standard care alone. In our first scenario, the mean anxiety scores in the three groups were, 5.5, 6.2, and 6.1, respectively. This would seem to indicate that psychotherapy, but not relaxation alone, is of benefit, and therefore that the effects of psychotherapy are not merely the result of attention from a practitioner or a relaxation component. An ANOVA of some data with these means gave F(2,117) = 2.4, p = .096, leading us to conclude that there is no overall difference between groups. This appears to contradict the previous trial result, which showed an effect of psychotherapy. Moreover, the conclusion has little connection to the study design, which concerns the degree to which relaxation and attention contribute to the effects of psychotherapy. An alternative approach is to conduct a multivariable linear regression using predictor variables that reflect the questions asked by the researchers. One variable could be called “contact” and is coded 1 for both the relaxation and psychotherapy groups (who both get additional care) and 0 for controls. The second variable could be called “therapy” and is coded 1 for patients in the psychotherapy group and 0 otherwise. When this regression is run on the dataset for the first scenario, the coefficients for these variables are −0.1 (95% confidence interval [CI], −0.8–0.5; p = .6) and 0.7 (95% CI, 0.1–1.3; p = .035), leading us to the more intuitive conclusion that therapy is of benefit and that any effect of contact and relaxation is small. Note that the coefficients are equivalent, respectively, to the difference between groups for relaxation versus control and therapy versus relaxation. They would be reported along with the means and standard deviations of the baseline and follow-up anxiety scores for each group such as was done in Table 1.
In a second scenario, the anxiety scores for psychotherapy, relaxation, and control are 5.3, 5.3, and 6.3. ANOVA gives F(2,117) = 6.6, p = .002; regression gives coefficients for “contact” and “therapy” as 1.0 (95% CI, 0.4–1.6; p = .002) and 0.0 (95% CI, −0.5–0.5; p = .8). Again, the conclusion from regression—that psychotherapy is effective, but that this is the result of relaxation and contact with a health professional—is more useful and relates more strongly to the study questions than the conclusion of the ANOVA, which is only that relaxation classes, psychotherapy, and usual care are not equivalent.
A multivariable regression is not the only alternative to ANOVA. One point of view is that we should not ask whether psychotherapy is superior to relaxation alone unless it is known that treatment, of whatever form, is better than control. Accordingly, we would combine the results from the psychotherapy and relaxation groups and compare with controls using a t test. If there was a significant difference, we would then compare psychotherapy and relaxation. Like regression, and in contrast to ANOVA, this method provides answers to specific questions of clinical relevance.
TRIALS WITH REPEATED MEASURES
Many studies measure an end point on several occasions. The most common and straightforward case is one in which an end point is measured at baseline. In the cancer study, for example, it would be sensible to measure anxiety before randomization. The results of the study could then be interpreted as the change in anxiety resulting from treatment. In theory, ANOVA can be an appropriate tool for analyzing such studies. In previous publications (4,5), I have repeated the call made by many previous statisticians (6) that randomized trials should be analyzed by linear regression with baseline score of the end point as an independent variable. This technique is commonly described as “analysis of covariance” (ANCOVA), and it is a standard ANOVA option in many statistical packages.
Unfortunately, it is much more common in the literature to see what I consider to be an inappropriate use of ANOVA. When setting up an ANOVA, a researcher specifies the different effects of interest. In the case of our cancer study, in which we measure anxiety before and after a course of psychotherapy treatment or usual care, these effects are “time” (Do scores change between baseline and follow up after treatment?) and “group” (Do scores depend on whether a patient is assigned to psychotherapy or control?). The problem is that these effects are uninteresting and irrelevant to the analysis of the randomized trial. We are not concerned in whether scores will change from baseline (it seems likely than they would) or whether overall anxiety scores, including pretreatment score, differ between groups (at baseline, they should be similar because of randomization). What we are interested in, and why we conducted the randomized trial, is whether the change over time is different between groups. This is technically known as the “group by treatment interaction,” an unwieldy term that, in my view, reduces the interpretability of clinical trial results. Take the following example, based on a trial reported in the Archives of General Psychiatry: “There was no significant group effect F(1,18) = 1.2, p = .3. However, there was a significant time effect F(1,18) = 48, p < .001 and a significant time by treatment interaction F(1,18) = 11, p = .003.” It is not immediately obvious to the nonstatistical reader that the treatment was effective; indeed, the statement “there was no significant group effect” might lead one to conclude exactly the opposite. Here the “group effect” includes the pretreatment mean and so is of little interest.
An additional complication is one in which the trial incorporates more than one measure after randomization. Imagine that our colorectal cancer study involved an additional end point at 6 months, after the completion of adjuvant chemotherapy. Again, I present the possible results in terms of three scenarios shown in Figure 1. The simplest approach would be to report the results of the 6-week and 6-month follow up separately, particularly because they appear to address separate questions: Can psychotherapy alleviate distress in the period immediately after diagnosis? Can psychotherapy lead to persisting improvements in a patient’s psychologic response to cancer? The approach would then be to undertake a linear regression of the 6-week score separately from the 6-month score using baseline score as a predictor variable for both, and estimate the coefficient for group (equivalent to the difference between groups). This approach would often be described as ANCOVA.
An alternative would be to throw all assessments together into a single, repeated-measures ANOVA. One problem would be the interpretation of the time-by-group interaction. In my view, what clinicians and patients are interested in are the posttreatment results, in which a time-by-group interaction means that the short- and long-term effects of psychotherapy differ. This is the case in scenario 1, in which the effects of psychotherapy do not persist, and in scenario 3, in which differences between groups become larger over time. However, in traditional ANOVA models, the time-by-group interaction includes the baseline. Hence, we might well see a time-by-treatment interaction for scenarios 2 and 3, but not 1. The appropriate analysis is an extension of linear regression known as “longitudinal mixed models,” “latent growth curve modeling,” or “generalized linear modeling.” An excellent description has been given in a prior paper in this series (7). Such models allow clear specification of the particular periods of time over which investigators want to examine whether the effects of treatment differ.
It should also be noted that in many cases in which repeated measures are taken, there is no need to examine time-by-treatment interactions. For example, in a randomized trial of two different surgical techniques, we might measure pain once or twice a day during the patient’s hospital stay. Such a trial might best be analyzed by first calculating the mean of each patient’s postoperative pain scores and then comparing these means between groups. A more complex analysis involving time-by-treatment interaction is not warranted because it is unlikely that the difference between groups will change importantly over time.
There are few methodologic absolutes in statistics, and this article should not be seen as indicating that ANOVA should never be used in the analysis of randomized trials. Indeed, in many situations, ANOVA is mathematically identical to a t test (F is t2) or a regression. It is not hard to find examples in the literature in which authors calculate a difference between groups with a 95% confidence interval and then use an ANOVA to obtain a p value that is appropriately reported. Similarly, the alternative I recommend, regression, is not immune from misuse; regression provides researchers with the opportunity to adjust their analyses for all sorts of covariates before choosing the analysis that gives the most favorable results. It can also be argued that reliance on regression might lead researchers to overanalyze individual group differences in multiarm trials when the bigger picture, perhaps summarized by ANOVA, is that the treatments are ineffective.
However, ANOVA is especially prone to misuse: obtaining an F and p value from an ANOVA does require calculation of a difference between groups, entailing that this estimate often goes unreported; for trials with more than two arms, ANOVA tests a hypothesis that is often uninteresting; for trials with repeated measures, ANOVA requires specification and analysis of effects that are extraneous to the principal study question of a randomized trial. In Table 2, I summarize this article by describing each problem associated with ANOVA and giving an alternative statistical approach. These alternatives should be considered in preference to ANOVA for the analysis of randomized trials.
1. Jansen SJ, Kievit J, Nooij MA, de Haes JC, Overpelt IM, van Slooten H, Maartense E, Stiggelbout AM. Patients’ preferences for adjuvant chemotherapy in early-stage breast cancer: is treatment worthwhile? Br J Cancer 2001;84:1577–85.
2. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. Ann Intern Med 2001;134:657–62.
3. Altman DG. Practical Statistics for Medical Research. London: Chapman and Hall; 1991.
4. Vickers AJ, Altman DG. Statistics notes: analysing controlled trials with baseline and follow up measurements. BMJ 2001;323:1123–4.
5. Vickers AJ. The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Med Res Methodol 2001;1:6.
6. Senn S. Statistical Issues in Drug Development. Chichester: John Wiley; 1997.
7. Llabre MM, Spitzer S, Siegel S, Saab PG, Schneiderman N. Applying latent growth curve modeling to the investigation of individual differences in cardiovascular recovery from stress. Psychosom Med 2004;66:29–41.