Secondary Logo

Journal Logo


Misuse of Baseline Comparison Tests and Subgroup Analyses in Surgical Trials

Bhandari, Mohit MD, MSc*; Devereaux, P J MD*; Li, Patricia BSc*; Mah, Doug BSc*; Lim, Ki BA*; Schünemann, Holger J MD*; Tornetta, Paul III MD

Author Information
Clinical Orthopaedics and Related Research: June 2006 - Volume 447 - Issue - p 247-251
doi: 10.1097/01.blo.0000218736.23506.fe
  • Free


Researchers have documented the misuse and overemphasis of statistical tests that compare patients' baseline characteristics and subgroup analyses in medical randomized control trials.2,3,12 The goal of randomization is to produce groups of patients with similar prognoses at the start of the trial so that the results can be attributable to the interventions being evaluated.7 Medical researchers commonly compare baseline variables between treatment groups using statistical tests to show baseline comparability of trial groups. However, any differences between treatment groups in a randomized control trial must be attributable to chance or faulty randomization. Reporting and emphasizing the presence or absence of statistically significant baseline differences between treatment groups distracts clinicians from assessing what is being studied (ie, whether there are clinically important baseline differences between the treatment groups).4

Subgroup analyses are frequently post hoc. This risks false-positive results (Type I error) that result in ineffective (or even harmful) treatments being deemed beneficial.3,9,10,12,13 Alternatively, false-negative results may occur because negative subgroup analyses often are under powered. Numerous investigators recommend cautious interpretation of data derived from an analysis of a subgroup of patients in a randomized control trial.3,10,12,13

It is unknown whether the problems with misuse and overemphasis of baseline statistical comparability tests and subgroup analyses in medical randomized control trials exist in the surgical literature. We did an observational study evaluating the current use of baseline comparability tests and subgroup analyses in surgical randomized control trials. We hypothesized that surgical trials had the same limitations in reporting baseline comparison tests and subgroup analyses as previously reported medical trials.12 We hypothesized that: (1) randomized trials in surgery were infrequent; (2) baseline comparison tests were frequent; (3) subgroup analyses were frequent; and (4) differences identified in subgroups were inappropriately featured in the conclusions.


We did computerized (PubMed) and manual searches to identify published surgical randomized control trials in the Journal of Bone and Joint Surgery (American volume), and Journal of Bone and Joint Surgery (British volume) from January 2000 to April 2003. We also included surgical trials published during the same period in the British Medical Journal (BMJ), Journal of the American Medical Association (JAMA), New England Journal of Medicine (N Engl J Med), and The Lancet. Eligible studies included randomized control trials in any field of surgery, surgical interventions (ie, procedures done in an operating room), trials published in English, and trials involving humans. We did not exclude trials based on size or other trial characteristics. We did exclude pseudorandomized trials. The table of contents was reviewed for surgical trials, and PubMed database searches were filtered by the journal name and keyword “surgery.” All eligible articles were cross-referenced with manual searches for accuracy.

All potentially eligible articles were reviewed independently by three investigators for study inclusion (KL, DM, PL). A fourth reviewer (MB) reviewed all cases to ensure eligibility criteria were met. All discrepancies were resolved by verbal consensus meetings. Our decision to include articles published in medical journals was based on our belief that these journals published high-quality randomized control trials.

Three reviewers (PL, DM, KL) independently abstracted the following variables from each included study: first author's profession, continent of research, subspecialty, number of centers, number of patients, number of treatment groups compared, and primary and secondary outcomes. We also determined details regarding baseline comparability of treatment groups in each eligible trial. These included the number of baseline variables reported, whether significance tests were done on the baseline variables, and any statistical imbalances in baseline variables. We also examined the reporting of subgroup analyses. Our definition of subgroup analysis included treatment outcome comparisons for patients subdivided by baseline characteristics. Three reviewers (PL, DM, KL) noted whether studies included a subgroup analysis, the total number of subgroup analyses, statistical methods (ie, descriptive only, subgroup p values, hazard ratios, or test of interaction), whether subgroup analyses were planned a priori (ie, review of the methods section for details about planned evaluation of subgroups), whether subgroup differences were reported, and whether the subgroup differences were noted in the summary or conclusion. All disagreements were resolved by consensus (MB, PL, DM, KL).

We present descriptive statistics for continuous and dichotomous variables. Continuous variables such as number of patients are presented with mean and standard deviations (SD). Categorical variables are presented as proportions. Our sample size was based on a previous study that used 50 randomized trials.3 Fifty randomized trials is large enough to provide precise results, but small enough to allow thorough analysis of each article.3


We identified 46 randomized trials in surgery from 762 randomized trials (6%) published in four medical journals (25 from Lancet, four from BMJ, five from JAMA, and 12 from N Engl J Med) from January 2000 to April 2003. We identified an additional 26 trials in surgical journals (16 from the Journal of Bone and Joint Surgery-American volume, 10 from the Journal of Bone and Joint Surgery- British volume). Therefore, our final included sample of trials was 72. In 45 (62%) of the studies, the first author was a surgeon, 43 (60%) studies were done in Europe, and 29 (40%) studies were related to orthopaedic surgery (Table 1). Trials included one to139 centers and 21 to 2309 patients. Fifty-seven (79%) studies were funded, of which 24 were industry funded.

Study Characteristics

Significance tests on baseline characteristics between study groups (Table 2) were frequent and performed in 18 trials (25%). Studies compared treatment groups with an average of 10.3 ± 7.7 baseline variables. We compared 689 baseline variables in the 72 trials. Of 166 significance tests, 17 (10%; 95% confidence interval [CI], 5-15%) were significant at a p less than 0.05. No differences in significance testing were found between orthopaedic and medical journals.

Baseline Comparability

Fifty-four subgroup analyses were conducted in 27 trials (38%) (Table 3). Of these, only five (9%) of the subgroup analyses were preplanned as explicitly stated in the study report. There were one to 23 subgroup analyses with a mean of two subgroup analyses per trial. The number of baseline variables included in each analysis also varied substantially. Twelve trials included one baseline variable, and 22 trials included more than eight baseline variables (Table 3). The majority of investigators used tests of significance when comparing outcomes between subgroups (41 subgroup analyses, 76%). Only three (6%) subgroup analyses were performed using statistical tests for interaction. Ten subgroup analyses (19%) were descriptive only, with no report of how subgroup differences were tested.

Subgroup Analysis

Investigators reported differences between subgroups in 31 (72%) analyses, all of which were featured in the summary or conclusion. Three studies showed no differences between overall treatment groups, but did report differences in the subgroups. We did not identify differences between orthopaedic and medical journals in the number and reporting of subgroup analyses.


We identified important problems in the reporting of surgical trials. These included: frequent significance testing of baseline variables between treatment groups; conducting multiple post hoc subgroup analyses; failure of most subgroup analyses to use a statistical test of interaction; and emphasizing subgroup differences in the summary or conclusions.

Our study has some limitations. The lack of adequate reporting of statistical analysis plans in the trial samples may limit the validity of our findings that suggest many studies failed to preplan the subgroup analyses. Our findings may not extend to randomized control trials published in journals other than those included in our review. Although we had verbal consensus meetings among three independent reviewers in the identification of articles and data abstraction, we did not record the individual reviewer responses for calculation of reviewer agreement (Kappa). Our study is strengthened by identification of surgical trials across various surgical subspecialties and data collection in triplicate with consensus among data collectors. Our sample of 72 randomized trials is larger than the numbers of trials in previous studies.3,9,12

The most striking findings were the number of subgroup analyses done in randomized control trials (range, 1-23 trials), the lack of preplanning, insufficient consideration of statistical details, and subgroup results being reported in the conclusions or summary. In a trial comparing operative versus nonoperative treatment for displaced fractures of the calcaneus, Buckley et al reported that operative treatment provided no improvement over nonoperative treatment of displaced intraarticular calcaneal fractures.5 However, after doing numerous subgroup analyses, they found that women, patients who were not receiving workers' compensation, younger males, patients with a greater Böhler angle, patients with a lighter workload, and patients with one simple displaced intraarticular calcaneal fracture had better outcomes after operative treatment than after nonoperative treatment.5 Rationale for the choice of these post hoc subgroups was not provided by the authors.

Less than 10% of subgroup analyses were preplanned. Our finding is less than that reported in other studies. Parker and Naylor reviewed 67 randomized control trials involving pharmacotherapy of at least 1000 patients with unstable angina, myocardial infarction, left ventricular dysfunction, or heart failure from 1980-1997.10 Only 24 of the 67 trials (36%) had preplanned subgroup analyses. Moriera et al reviewed 32 trials published in four leading journals (N Engl J Med, JAMA, Lancet, and Am J Public Health) after July 1, 1998.9 They found that only 41% of studies that had subgroup analyses provided details of the analysis plan.9

Critics of subgroup analyses have argued that they can lead to false-positive results (Type I errors).3,10,12,13 Doing multiple significance tests on numerous subgroups risks spuriously significant results. This problem was well illustrated in a report of a cardiovascular trial (ISIS-2 study).6 Study investigators found zodiac birth sign to be significantly associated with a treatment effect-a clearly false-positive finding.

The most appropriate statistical tests for making inferences from subgroup analyses often are omitted. Pocock et al argued in favor of statistical tests for interaction between subgroups.12 An interaction test evaluates whether the subgroup treatment effects are different from each other.1 Some have argued that tests for interaction are under powered.10,13 However, this is not a limitation, but a reality regarding subgroup analyses. This appropriately conservative approach inevitably guards against overemphasizing subgroup effects. We found that interaction tests were rarely used among trials that had subgroup analyses (three of 54; 5.8%). This was less than reported by Assman et al in their review of medical trials (15 of 45; 33%).3

Although authors of one randomized trial of calcaneal fractures mentioned doing statistical tests of interaction, they were not reported in the results of the subgroup analyses. Investigators reported subgroup p values and reported that women achieved better outcomes with operative treatment than men.5

The majority of reports in our review (41 of 54; 75.9%) used subgroup p values for comparisons between subgroups. However, this approach is fundamentally flawed because within subgroup comparisons cannot determine between subgroup differences. In addition, performing multiple post hoc subgroup analyses when overall treatment effects are nonsignificant risks false-positive findings by chance alone.

Interactions can be quantitative or qualitative (Fig 1). A quantitative interaction occurs when a treatment effect is beneficial or harmful in all subgroups but the magnitude of difference varies across subgroups. Peto argued that quantitative interactions are common but rarely meaningful.11 Qualitative interactions may be thought of as very large quantitative interactions. In general, the differentiation between quantitative and qualitative falls on a continuum.

Fig 1:
The relative risk and quantitative interaction are shown for Subgroups A and B. Subgroup A had a significant increase in relative risk of an outcome, whereas Subgroup B had a nonsignificant increase in risk of an outcome. A test of interaction (ie, comparison of the relative risks between Subgroups A and B) may or may not be significant. A qualitative interaction occurs when Subgroup A shows a significantly increased risk of an outcome when Subgroup B reveals a significantly decreased risk of an outcome. The test of interaction was significant.

Careful review of the randomized trial of calcaneal fractures suggests a quantitative interaction between treatment and gender with respect to outcomes.5 Buckley et al reported no difference in outcomes with gender for non- operative cases (relative risk = 0.99; 95% CI, 0.35-2.78; p = 0.88) and a significant difference in outcome with gender for operative cases (relative risk = 3.84; 95% CI, 1.1-16.9; p = 0.02).5 Comparison of the relative risks between groups results in a nonsignificant p value (p > 0.05). We found another example of a quantitative interaction in a study evaluating the role of chemotherapy before resection of esophageal cancer.8 Readers might infer differential outcomes by patient gender. This regimen reduced the risk of mortality in males (Hazard ratio, 0.79; 95% CI, 0.67-0.94; p = 0.008) but did not do so in females (Hazard ratio, 0.84; 95% CI, 0.62-1.03; p = 0.25). However, the Hazard ratios between males and females had widely overlapping confidence intervals and were not different.8

Qualitative interactions occur when the point estimates and confidence intervals suggest that one treatment is beneficial for one subgroup and also harmful for another subgroup. Yusuf et al13 and Peto11 reported that qualitative subgroups are uncommon in clinical trials and observed qualitative interactions should be viewed with skepticism.

Investigators should consider providing only baseline factors thought to be associated with the outcome of interest. Lengthy baseline comparison tables are unnecessary and often distract readers from the important variables. The CONSORT statement2 suggests that statistical comparisons of baseline variables should not be reported. Subgroup analyses should be performed and interpreted with caution. The validity of a subgroup analysis can be improved by defining a few important (and biologically plausible) subgroups in advance and doing statistical tests of interaction (Table 4).

Guide to Interpreting Subgroup Analyses*

We identified important problems with the reporting of randomization, baseline comparisons, and subgroup analyses. The most concerning was the misuse of subgroup analyses. One in three studies did a subgroup analysis, of which most found subgroup differences using inappropriate statistical tests. The presentation of these subgroup findings in the conclusions only exaggerates the perceived significance of such exploratory analyses.


1. Altman D, Bland JM. Interaction revisited: the difference between two estimates. BMJ. 2003;326:219-220.
2. Altman DG, Schulz KF, Moher D, Egger M, Davidoff F, Elbourne D, Gotzsche PC, Lang T; CONSORT GROUP. (Consolidated Standards of Reporting Trials). The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med. 2001;134:663-694.
3. Assmann S, Pocock S, Enos L, Kasten L. Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet. 2000;355: 1064-1069.
4. Bhandari M, Guyatt GH, Lochner H, Sprague S, Tornetta P3rd. Application of the Consolidated Standards of Reporting Trials (CONSORT) in the fracture care literature. J Bone Joint Surg Am. 2002;84A:485-489.
5. Buckley R, Tough S, McCormack R, Pate G, Leighton R, Petrie D, Galpin R. Operative versus non-operative treatment of displaced intra-articular calcaneal fractures. J Bone Joint Surg Am. 2002;84: 1733-1744.
6. Collins R, Gray R, Godwin J, Peto R. Avoidance of large biases and large random errors in the assessment of moderate treatment effects: the need for systematic reviews. Stat Med. 1987;6:245-250.
7. Guyatt GH, Rennie D, (eds). User's Guides to the Medical Literature: A Manual for Evidence-based Clinical Practice. Chicago, IL: American Medical Association Press; 2001.
8. Medical Research Council Esophageal Cancer Working Group. Surgical resection with or without pre-operative chemotherapy in esophageal cancer: a randomized controlled trial. Lancet. 2002;359: 1727-1733.
9. Moriera EJ, Stein Z, Susser E. Reporting on methods of subgroup analysis: a survey of four scientific journals. Braz J Med Biol Res. 2001;34:1441-1446.
10. Parker A, Naylor CD. Subgroups, treatment effects, and baseline risks: some lessons from cardiovascular trials. Am Heart J. 2000;139:952-961.
11. Peto R. Statistical Aspects of Cancer Trials. In: Halnan K, ed. Treatment of Cancer. London, UK: Chapman and Hall; 1982:867- 871.
12. Pocock S, Assman S, Enos L, Kasten L. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002;21:2917-2930.
13. Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA. 1991;266:93-98.
© 2006 Lippincott Williams & Wilkins, Inc.