Most researchers are already well aware of recent findings from the Women's Health Initiative regarding the effects of estrogen plus progestin in healthy postmenopausal women. 1 The results of this randomized placebo-controlled trial came as a surprise, because many of us assumed that the findings would support estimates obtained from observational cohorts (as well as expert rationale) that suggested that hormone replacement therapy would reduce coronary heart disease and provide an overall benefit to recipients. 2,3 The finding from the trial is quite the opposite—after an average of 5.2 years of follow-up of 16,608 women, clinically meaningful increases in breast cancer, coronary heart disease, stroke and pulmonary embolism, attributable to hormone therapy, convincingly offset the benefits of a reduced incidence of colorectal cancer and hip fracture. The study of combined estrogen and progestin was ended early because of the strength of these findings, although the parallel WHI study of unopposed estrogen remains open. The current finding is important not only because the estimated effect for coronary heart disease is opposite that which was expected, but also because as many as 38% of postmenopausal women have been treated with hormone replacement therapy.
This circumstance appears to represent what many observationalists had hoped would be a rare or impossible event, namely a strong discordance between risk estimates obtained from epidemiologic study designs and those obtained from randomized clinical trials (RCTs). In recent years, there has been a growing sentiment among some researchers that observational studies yield qualitatively and quantitatively similar estimates of treatment effects as RCTs. For example, a pair of papers written from this perspective was published 2 years ago in The New England Journal of Medicine. 4,5 Many readers understood these papers to support the idea that estimates of treatment effects from observational studies are similar to those from RCTs, calling into question the traditional hierarchy of evidence that places RCTs at the top.
The WHI result requires that researchers and clinicians rethink their optimism about the lack of bias in observational inferences about treatment—an optimism not often found among seasoned clinical trialists. Observational studies are efficient and necessary, but they will sometimes prove unreliable. It seems to me that case-control designs, for example, may be more reliable for estimating the effects of “exposures” than the effects of “treatments.” Two factors with great leverage for creating bias in cross-sectional studies of treatments—selection of a suboptimal control group and confounding by indication—might be mitigated when studying exposures that are not purposely selected by skilled practitioners to yield a desired effect.
Other lessons related to process and study design emerge from the WHI experience. The Data Safety Monitoring Board (DSMB) recommended terminating these treatment arms early based on evidence (hazard ratios) less extreme and less precisely estimated than some would have liked. However, the examination of the interim data was carefully planned and executed, and no doubt there was considerable discussion and application of good clinical judgment and monitoring experience before making the decision to end the trial. Decisions such as this are usually not symmetric: investigators seldom intend to provide evidence of harm that is as strong as that for benefit. (This asymmetry implies that it is not wise to mask the DSMB to treatment assignment in a clinical trial.) Interim decisions can be made reliably because a thoughtful process for doing so has evolved over the years, and this process is a strength of the modern clinical experiment. I suspect that much of the small amount of discomfort regarding how the study was terminated actually arises from surprise at the results.
What do the results of the Women's Health Initiative RCT imply with regard to the reliability of either observational findings or those from RCTs? Some might say nothing—an assertion that we have observed a rare event. However, I think it would be a mistake to dismiss the importance of the discordance. Most investigators, practitioners and patients will likely find the RCT evidence persuasive. The design and execution were rigorous. There will be changes in practice patterns and altered patient attitudes resulting from the findings, at least with regard to combined hormonal replacement. This behavior implies that the results of the RCT are more convincing than contradictory opinion or observational findings.
As I am a clinical trialist, it should not be surprising that I agree with this perspective. The strongest benefit of randomization is that it controls unknown or unobserved prognostic factors. No other research design can make this claim. The best explanation for the difference between the results of previous observational studies and the WHI is that the earlier studies were biased by unknown or unobserved confounders and that these effects were controlled in the RCT. Historically, women who received hormone replacement therapy were probably generally healthier than those who did not.
Whether or not unobserved prognostic factors influence a particular observational study is not the point. The fact is, they can and they do, and a research design that prevents or reduces this bias is superior to one that does not. RCTs are not guaranteed to be correct— indeed we could later learn that the WHI trial result is inaccurate. However, we do have a stronger expectation for a RCT than for an observational study that it will yield an unbiased estimate of treatment effect. Thus, the hierarchy of evidence is not based on quality of results but on quality of design.
I have often prodded my colleagues in epidemiology, saying that they have an inferiority complex with regard to randomization. If true, it is unnecessary. After all, most of the experimental designs that are employed in the development of new therapeutics are nonrandomized (at least in oncology, my primary specialty). Nevertheless, extremely useful estimates of treatment effect are produced by such studies. They provide essential information for the overall development process, and occasionally they form the sole basis for therapeutic decisions. Limits on the applicability of rigorous experimental methods in general leave no choice but to estimate some treatment effects from observational studies. However, we must remain vigilant and recognize the limitations of research designs that do not control unobserved effects.
About the Author
STEVEN PIANTADOSI is Professor of Oncology and Director of Biostatistics in the Johns Hopkins Oncology Center. He is also Professor of Epidemiology and Biostatistics in the Bloomberg School of Public Health. For 20 years he has been a clinical trialist, working mostly in cancer-related fields. He is the author of the book Clinical Trials: A Methodologic Perspective.
1. WHI Investigators Writing Group. Risks and benefits of estrogen plus progestin in healthy postmenopausal women. JAMA 2002; 288: 321–333.
2. Stampfer M, Colditz G. Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence. Prev Med 1991; 20: 47–63.
3. Grady D, Rubin SM, Petitti DB. Hormone therapy to prevent disease and prolong life in postmenopausal women. Ann Intern Med 1992; 117: 1016–1037.
4. Benson K, Hartz AJ. A comparison of observational studies and randomized controlled trials. N Engl J Med 2000; 342: 1878–1886.
5. Concato J, Shah N, Horwitz RI. Randomized controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med 2000; 342: 1887–1892.