The randomized controlled trial (RCT) is widely held to be the most valid and revealing study design. RCTs comparing surgical procedures or comparing nonsurgical care with surgery are difficult to design and conduct. Establishing selection criteria to maintain clinical equipoise, ensuring adequate patient enrollment, maintaining high rates of patient compliance, blinding patients or investigators to the treatment received, and analyzing the data appropriately are all significant challenges in any RCT but may be particularly difficult in surgical trials.
Randomized Controlled Trials
The publication of the lumbar disk herniation (LDH) portion of the Spine Patient Outcomes Research Trial (SPORT) in November 2006 was accompanied by a press release from the Journal of the American Medical Association and occasioned significant coverage in the professional and lay press.1-3 Despite a significant financial investment by the National Institute of Arthritis and Musculoskeletal and Skin Diseases and a multicenter design in which many leaders of orthopaedic spinal surgery were involved, many shortcomings were apparent before the trial began.4,5 These were discussed briefly in the report of the trial and in editorials and commentaries in the medical literature.6-9 Ultimately, valid conclusions from the trial were limited in number, and few new data emerged that significantly affected the care of patients with symptomatic LDH.
Other RCTs comparing lumbar diskectomy with nonsurgical care have been conducted and published. Our goal was to determine whether common aspects in the design and conduct of these trials resulted in threats to their validity, as they did in SPORT, and to try to determine how these shortcomings might be addressed through alternative study design or analysis. We identified five papers published between 1966 and 2007 that met the criteria of a prospective comparative study of lumbar diskectomy and nonsurgical treatment and that employed a random method of patient allocation. 2,10-13 In general, the studies included patients who were candidates for either surgical or nonsurgical treatment of LDH and excluded those with neurologic signs indicating the need for urgent surgery and those with medical contraindications for surgery. The inclusion and exclusion criteria are summarized in Table 1.
The outcomes measured in these studies and the times at which they were assessed varied considerably. Weber11 reported outcomes on a fourtier descriptive scale that was commonly in use before the development of current outcomes instruments. The more recent studies used various combinations of general and diseasespecific health-related quality of life instruments, visual analog scales (VASs), patient satisfaction ratings, and clinical examinations. Follow-up examinations were generally standard visits within 1 year after initiation of treatment. The exception to this was Weber's study,11 in which outcomes were reported at 1, 4, and 10 years after surgery (Table 2).
In all five studies, statistical analysis was performed using null hypothesis statistical testing and P values. Peul et al10 and Weinstein et al2 were the only authors to report the results of an a priori power analysis. Neither calculation included any allowance for patient crossover. A variety of statistical tests was used in the studies, including chi-square, Student t test, and regression analysis. P values <0.05 were considered statistically significant by all authors. Only Osterman et al13 specifically mentioned any adjustment for multiple comparisons.
Crossover rates in all studies were significant. At least one third of patients initially allocated to undergo nonsurgical treatment had surgery during the study period. In the study by Buttermann,12 more than 50% of the patients in the nonsurgical arm crossed over to undergo surgery. A smaller proportion of patients in the surgical arms of these studies crossed over, ranging from 0% to almost 40% (Table 3).
Three of the studies reported primarily the results of intent-to-treat (ITT) analysis. Weber11 published patient outcomes in tables that delineated both treatment allocated and treatment received, but the primary statistical analysis appears to have excluded the patients who crossed over from nonsurgical to surgical care. Buttermann12 presented the results of his study primarily in graphic form. The results for patients who crossed over to undergo surgery were plotted separately. The nonsurgical group changed in composition over time, however, as the crossover patients were excluded from the calculations for that group after undergoing surgery. This gives the appearance of a changing enrollment in the study because the baseline plots contain data for 127 patients (50 each in the nonsurgical and surgical groups and 27 crossovers), whereas the final assessment contains data for 100 patients (23 nonsurgical, 50 surgical, 27 crossovers).
In the study by Peul et al,10 2.8% of all patients who underwent surgery required a reoperation for recurrent sciatica. Two dural tears and one wound hematoma occurred, yielding a complication rate of 1.6%. Weinstein et al2 reported 10 dural tears, 1 vascular injury, and 2 other complications, for an overall intraoperative complication rate of 5%. Perioperative (within 8 weeks) complications arose in 13 patients (5%). Twenty-five patients underwent revision within 2 years after the index procedure, 8 for recurrent herniation.
Clinical outcomes in the studies were variable. Weber11 reported that at 1 year, there was a statistically significant difference (P < 0.0015) favoring surgery based on both the analysis excluding crossovers and on ITT analysis. Neither method of analysis demonstrated a statistically significant difference at 4- or 10-year follow-up. In the study by Buttermann, 12 patients who underwent diskectomy reported greater mean decreases in VAS back and leg pain scores at 1- to 3-month and 4- to 6-month follow-ups. Although the Oswestry Disability Index scores for both groups improved with treatment, there was a greater improvement in patients who underwent surgery at 1- to 3-month follow-up. The magnitude of these differences is likely greater than it would be for ITT analysis because the outcomes of patients who crossed over to surgery are excluded from the nonsurgical group. Patient-rated treatment success was higher in the diskectomy group (92% to 98%) compared with the group initially assigned to nonsurgical treatment (42% to 56%).
Weinstein et al2 reported essentially only the results of the ITT analysis. The authors found no statistically or clinically significant differences between the treatment groups on any measure at any time point. As-treated analysis was performed but was not presented in detail. The results of this alternative analysis were summarized as showing “strong, statistically significant advantages … for surgery at all follow-up times through 2 years.” Osterman et al13 also reported no significant differences between nonsurgical care and diskectomy up to the 2-year follow-up evaluation. Astreated analysis also did not reveal any statistically significant differences between interventions.
Peul et al10 reported that although patients who underwent early diskectomy had, on average, less severe symptoms and less disability early on (8 to 12 weeks), by 1 year, the differences disappeared. The area under the curve of the leg pain VAS for the patients assigned to surgery was significantly smaller (P < 0.001) than that for patients treated nonsurgically; there was no difference in area under the curve for Roland Morris Disability Questionnaire scores between treatment groups.
Five RCTs with a total enrollment of >1,000 patients that compared lumbar diskectomy with nonsurgical care for the treatment of symptomatic LDH have been performed and published in the past 25 years. One of the recent papers was part of a multimillion dollar, multicenter study in the United States that was funded by the National Institutes of Health. Despite these extensive efforts, significant questions remain about the optimal treatment for individual patients with LDH. None of these studies has provided physicians or patients with the ability to determine who will improve rapidly with nonsurgical treatment and who will not respond readily and may need to undergo, or consider, surgery. Simple, dichotomous (ie, “Which is better?”) RCTs are not an appropriate study design to produce the data needed to inform decisions under conditions of uncertainty. Frequentist statistical techniques based on null hypothesis testing and P values make valid subgroup analyses difficult, further limiting the amount of relevant data that can be obtained from even such large-scale RCTs as SPORT. The hegemony of the RCT among study designs can result in an uncritical acceptance of the findings among both professional and lay audiences. These five trials demonstrate the shortcomings of RCTs in studying LDH treatment and raise questions about their utility in providing data relevant to improving the care of patients with LDH.
Patients who volunteer to enroll in RCTs may be expected to differ from those who decline enrollment in both measured and unmeasured ways.14 Unless data are obtained from patients who decide not to enroll in an RCT, it may be difficult to predict the ways that nonenrollers differ from enrollers. Furthermore, unmeasured differences between enrollers and nonenrollers may unpredictably affect patient responses to treatment. The SPORT investigators compared an RCT patient cohort with patients who were followed in a nonrandomized, prospective comparative cohort study.15 Although many comparisons demonstrated no statistically significant differences between the randomized and nonrandomized cohorts, there were measurable differences in some of the disability measurements and in patient perceptions of their own health trends. There may have been other important differences, such as treatment preferences or expectations that could significantly affect response to either treatment positively or negatively. The existence of potentially significant differences between patients who enroll in
RCTs and those who do not means that the results of RCTs may not apply to the general population. RCTs comparing diskectomy with nonsurgical treatment are predicated on the assumptions that there is a homogenous patient population with LDH and that diskectomy and nonsurgical care are competitive, alternative treatments for these patients. To eliminate from a study those patients who rapidly improve without surgery, the inclusion criteria usually require that symptoms be present for 6 to 12 weeks before enrollment. The assumption that at enrollment in an RCT all patients are equally good candidates for surgical or nonsurgical care is specious. Patients with continued radiculopathy who have noted an improvement with nonsurgical care are likely to have a different response to continued nonsurgical management than are those who are no better or are worsening symptomatically despite maximal therapy. Additionally, patients with more severe symptoms at baseline tend to derive greater benefit and to have better outcomes regardless of treatment than do more mildly affected patients.16
Properly educated patients are aware of these distinctions and act rationally on them, as is illustrated by the high proportion of patients who crossed over from their assigned treatment to the alternative treatment in these RCTs. The variability of crossover between studies likely is related to the selection criteria of each individual study, patient population, patient education, and time between randomization and surgery, among other factors. The assertion by Weinstein,17 the lead SPORT investigator, that “no bias in the form of patient or physician opinion about which therapy might be better enters into the assessment of any given treatment's effectiveness” is simplistic. The high crossover rates in his and other RCTs of LDH treatment starkly demonstrate that patients have opinions about treatments and act rationally upon them. The asymmetry in the proportion of patients who crossed over from the surgical cohorts compared with the nonsurgical cohorts is expected because, after surgery, it is not possible for patients to cross over to nonsurgical treatment.
The random allocation of patients to treatment alternatives provides strong protection against selection bias. Randomization balances, on average, measured and unmeasured patient characteristics that may be associated with a positive or negative response to treatment. ITT analysis is the only method that preserves the benefits of randomization. Analyzing patients based on ITT assumes, however, that treatment assigned equals treatment received. With increasing crossover, this correlation decreases and the study becomes a valid investigation not of the treatment itself but of the program of recommending the treatment. Even the director of the National Arthritis and Musculoskeletal and Skin Diseases Advisory Council, who stated, “nonoperative therapies may offer benefits that are similar, if not equal, for patients who can not or elect not to have surgery,” missed this important distinction.18
The problem of crossover most seriously affected the two largest RCTs, those by Weinstein et al2 and Peul et al.10 This may derive from the necessity of meeting multiple investigators' individual criteria for clinical equipoise and therefore including a broader population than the smaller studies. When the baseline data provided by Weinstein et al2 is segregated by treatment received within each treatmentassignment group, it is evident that patients with more severe symptoms at baseline tended to select surgery, regardless of their initial treatment assignment. The mean baseline Medical Outcomes Study 36-Item Short Form Health Survey Bodily Pain and Physical Function subscale scores and the mean Oswestry Disability Index scores for patients undergoing surgery were strikingly similar in the two treatment assignment groups. These scores were approximately 10 points worse than for patients in either assignment group who selected continued nonsurgical care. Peul et al10 did not provide similar data, so no assessment of the pattern of crossover can be made.
Crossover, which is essentially patient noncompliance, reduces the measured outcome difference between treatments because it dilutes the dissimilarity of assigned groups regarding the one factor that is supposed to distinguish them—the treatment received. With no crossover, the measured difference in outcomes between interventions should be maximal and, assuming an otherwise rigorous study, will be a valid estimate of the effect size. At the other extreme, there will be no measured difference between treatments if 50% of each treatment assignment group crosses over (randomly) and ITT analysis is performed. It is no surprise, therefore, that Weinstein et al2 found no difference between treatments, with almost 40% and 45% of the assignment groups choosing the other treatment.
Carragee7 wrote that the SPORT data “clearly show that the risk of serious problems (neurologic deterioration, cauda equina syndrome, or progression of spinal instability) [for patients] receiving nonoperative care is extremely small.” Strictly speaking, this contention is true, but it cannot be extrapolated to demonstrate that fears that prolonged waiting may result in harm are “unfounded.”1 It is conceivable that some of the patients who crossed over to have surgery from the nonsurgical arm did so because of worrisome symptoms and avoided harm by doing so. Reassuring such patients that data show that continued nonsurgical care is always safe is not supported by fact.
One partial remedy for crossovers in RCTs of LDH treatment is to perform an analysis of dichotomous outcomes by categorizing patients as successes or failures. This is the method used by the US Food and Drug Administration for recent investigational device exemption trials. In this scenario, a patient would have to achieve a certain increment of symptomatic or functional benefit to be considered a success. For surgical patients, success would mean no complications from surgery. Patients assigned to nonsurgical care, however, would be considered failures of that treatment if they crossed over and underwent surgery. Although this would not address some of the asymmetry issues of these RCTs, it would partially remedy one of the most glaring problems.
The use of null hypothesis statistical testing and reliance on P values in these RCTs, while typical of the great majority of medical studies, introduces several potential problems. First, power analysis must be performed before embarking on the study and sample sizes determined before enrollment begins. This requires the investigators to accurately predict effect size, attrition, and crossover. Second, because of the theoretical underpinnings of these tests, all hypotheses must be specified beforehand, and “second look” analyses are of questionable validity. Because of the underlying assumptions of the methods themselves, the results that are generated are cumbersome to explain and are frequently misinterpreted. For example, the P value is the likelihood of an outcome as or more extreme than the result under the assumption of the null hypothesis.19 This translates into a useful piece of information only with some difficulty.
An alternative analytical method based on Bayes theorem would address several of these problems. Based on prior probabilities and likelihoods, there is no need to prespecify sample sizes or to limit the number of analyses performed because the data are collected prospectively or after the study is complete. Results are reported as probabilities of outcomes, which are eminently understandable and inherently useful.20 The main obstacles to widespread use of Bayesian analysis are unfamiliarity with its methods, criticisms of its supposed subjectivity, and a lack of user-friendly software for the necessary calculations.21,22 The use of Bayesian analysis in RCTs and other study designs would not overcome serious flaws in the conception or conduct of clinical investigations but would provide more relevant, intuitive results.
A final consideration is the necessity and the appropriateness of RCTs for the investigation of certain clinical questions. For many individual patients who have undergone 6 to 12 weeks of nonsurgical care and still have pain or disability, it may not be clear whether surgery is appropriate or if more time may yield a satisfactory outcome. This decision is ultimately best left to the patient after the treating physician informs her or him about the options. The information that would be most helpful in this circumstance is the estimated probability of improvement for that patient within a certain time frame, with or without surgery. These data are not likely to be readily obtained from an RCT designed to test the hypothesis that there is no difference between surgical and nonsurgical treatment.
Even a large, multicenter RCT does not provide accurate estimates of the mean effect size of one treatment compared with another when a high proportion of patients crosses over. In this instance, the primary strength of the RCT is subverted, and the RCT is possibly no better than other study designs in providing unbiased estimates. Even without serious threats to validity, however, it is not clear that the results of observational studies and RCTs yield consistently different results.23,24 Studies that have purported to demonstrate greater bias in observational studies compared with RCTs have generally used the results of RCTs as the basis of comparison. There is no “gold standard” comparator that avoids this tautology; therefore, any difference between the results of cohort studies and RCTs will be attributed by some to bias.25 This ignores the possibility that the randomized trial design itself may produce biased estimates based on altered patient expectations or clinical vigilance as part of the trial.25-27
Five RCTs of lumbar diskectomy versus nonsurgical care for symptomatic LDH were conducted and published between 1983 and 2007. All had high crossover rates and used frequentist statistical techniques; the largest study had an enrollment refusal rate of almost 75%. These shortcomings seriously compromised the validity of these studies and heighten concerns about the possibility of conducting a highquality RCT of treatment outcomes for LDH and, furthermore, the appropriateness of the randomized design itself for better understanding the likely outcomes of the surgical and nonsurgical treatment of LDH.
Evidence-based Medicine: References 2, 10-13, and 23 are level I/II (prospective, randomized) studies. References 15 and 16 are level III/IV (case reports and case-control cohort studies). The remaining references are level 5 (expert opinion) or are editorial commentary.
Citation numbers printed in bold type indicate references published within the past 5 years.
1. Kolata G: Study questions need to operate on disk injuries. The New York Times
. November 22, 2006. Available at: http://www.nytimes.com/2006/11/22/health/22spine.html
. Accessed July 16, 2008.
. Weinstein JN, Tosteson TD, Lurie JD, et al: Surgical vs nonoperative treatment for lumbar disk herniation: The Spine Patient Outcomes Research Trial (SPORT). A randomized trial. JAMA
3. American Medical Association: Randomized study indicates that patients with herniated disk improved with or without surgery. Available at http://pubs.ama-assn.org/media/2006j/1121.dtl
. Accessed August 12, 2008.
. Angevine PD, McCormick PC: Wrong science: The SPORT trial and its potential impact on neurosurgery. Clin Neurosurg
. Dunsker SB, Awad IA, McCormick PC: Spine patient outcomes research trial. J Neurosurg
. Angevine PD, McCormick PC: Inference and validity in the SPORT herniated lumbar disc randomized clinical trial. Spine J
. Carragee E: Surgical treatment of lumbar disk disorders. JAMA
. Flum DR: Interpreting surgical trials with subjective outcomes: Avoiding unSPORTsmanlike conduct. JAMA
. McCormick PC: The Spine Patient Outcomes Research Trial results for lumbar disc herniation: A critical review. J Neurosurg Spine
. Peul WC, van Houwelingen HC, van den Hout WB, et al: Surgery versus prolonged conservative treatment for sciatica. N Engl J Med
11. Weber H: Lumbar disc herniation: A controlled, prospective study with ten years of observation. Spine
. Buttermann GR: Treatment of lumbar disc herniation: Epidural steroid injection compared with discectomy. A prospective, randomized study. J Bone Joint Surg Am
. Osterman H, Seitsalo S, Karpinnen J, Malmivaara A: Effectiveness of microdiscectomy for lumbar disc herniation: A randomized controlled trial with 2 years of follow-up. Spine
14. Friedman LM, Furberg CD, DeMets DL: Fundamentals of Clinical Trials
, ed 3. New York, NY: Springer, 1998.
. Weinstein JN, Lurie JD, Tosteson TD, et al: Surgical vs nonoperative treatment for lumbar disk herniation: The Spine Patient Outcomes Research Trial (SPORT) observational cohort. JAMA
16. Atlas SJ, Deyo RA, Keller RB, et al: The Maine Lumbar Spine Study: II. 1-year outcomes of surgical and nonsurgical management of sciatica. Spine
17. Weinstein JN: The tortoise and the hare: Is there a place in spine surgery for randomized trials? Spine
18. National Institute of Arthritis and Musculoskeletal and Skin Diseases: Study shows patients with herniated disks improve over time: Even without surgery. Published December 2006. Available at: http://www.niams.nih.gov/News_and_Events/Spotlight_on_Research/2006/backpain_surgery.asp
. Accessed July 16, 2008.
19. Goodman SN: Toward evidencebased medical statistics: I. The P value fallacy. Ann Intern Med
20. Goodman SN: Toward evidencebased medical statistics: II. The Bayes factor. Ann Intern Med
21. Fryback DG, Stout NK, Rosenberg MA: An elementary introduction to Bayesian computing using WinBUGS. Int J Technol Assess Health Care
22. Sheingold SH: Can Bayesian methods make data and analyses more relevant to decision makers? A perspective from Medicare. Int J Technol Assess Health Care
23. Benson K, Hartz AJ: A comparison of observational studies and randomized, controlled trials. N Engl J Med
24. Concato J, Shah N, Horwitz RI: Randomized, controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med
25. Kaptchuk TJ: The double-blind, randomized, placebo-controlled trial: Gold standard or golden calf? J Clin Epidemiol
26. McPherson K: The Cochrane Lecture: The best and the enemy of the good. Randomised controlled trials, uncertainty, and assessing the role of patient choice in medical decision making. J Epidemiol Community Health
27. McPherson K, Britton AR, Wennberg JE: Are randomized controlled trials controlled? Patient preferences and unblind trials. J R Soc Med