Health care decision making increasingly relies on scientific evidence; this is true in the United States^{1–7} and elsewhere. In the United Kingdom, insurers have explored paying for certain treatments only when administered in research settings.^{8} One objective of the insurers is to add to the body of evidence regarding the intervention in question. The National Institute for Health and Clinical Excellence in the United Kingdom has recommended essentially “paying for outcomes” by restricting payment for an expensive treatment for multiple myeloma to only patients who respond to the therapy.^{9}

In the United States, the Centers for Medicare and Medicaid Services (CMS) is experimenting with its own restricted-to-research policy called “Coverage with Evidence Development.”^{10} The policy indicates that the CMS will pay, within an agreed upon research protocol, for services deemed promising but needing additional evidence before a final, national coverage decision is made. The Agency for Healthcare Research and Quality and the CMS are evaluating different methodologies for synthesizing evidence, and have initiated efforts to understand the potential applicability and usefulness of Bayesian techniques in this regard.^{11} As a case study, the CMS requested that the Agency for Healthcare Research and Quality (via one of its evidence-based practice centers) fund a Bayesian analysis of clinical trials of implantable cardioverter defibrillators (ICDs).^{12} That project used some patient-level data, which can be helpful in addressing subset effects. However, the “raw data” are seldom, if ever, available to a particular modeling group (or a health insurer).

In the present study, we conducted Bayesian meta-analyses of the same trials as in the CMS ICD project, but using only published results. We use the ICD trials to achieve 3 goals: to show that Bayesian meta-analyses based on literature surveys can effectively inform coverage decisions; to show the richness of Bayesian modeling for endpoints such as mortality over time; and to show how the Bayesian approach can be used in a sequential manner over time to predict the results of future clinical trials and to help assess the utility of carrying out additional trials.

The gold standard of medical research is the randomized clinical trial (RCT); however, RCTs have limitations. Some of these limitations can be overcome or mitigated by combining RCTs in a meta-analysis, which provides the following advantages: (i) there is usually greater power for addressing the scientific question; (ii) there may be some ability to address treatment effect in subsets of patients; and, (iii) there is some ability to address heterogeneity across trials, which helps researchers understand the effectiveness of an intervention when used in clinical practice.

Regarding trial heterogeneity, different RCTs may evince different treatment effects, even if they have identical designs. The patient populations may vary, or the eligibility criteria may be exercised differently. Especially important are changes in patient populations over time. Indeed, modifications in screening and diagnostic procedures may lead to what is, in effect, a different disease. It is possible for a patient in one trial to have an underlying event rate that is different from that of a patient in another trial, even if they have the same apparent clinical and demographic characteristics and are treated in the same way. Thus, data from patients in different trials may not be exchangeable.^{13}

For these and other reasons, it is inappropriate to consider a collection of trials addressing the same medical question to be a single larger trial. An analysis in which “trial” is a covariate would be better (referred to as a fixed effect). Better still is an analysis allowing for different treatment effects in different trials, where the collection of studies informs each individual study—a random effects model—or Bayesian hierarchical model.

Bayesian statistical methods are inherently synthetic and so are ideally suited for meta-analysis.^{14–17} In a Bayesian hierarchical model, each trial has its own value of a parameter. These values—and therefore, the trials themselves—are regarded as being selected from some larger population (random effects). The focus of a Bayesian hierarchical analysis may be (i) the particular trial under consideration; (ii) the mean treatment effect in the population of trials; or, (iii) the predictive distribution of the results of a future trial having a specified design.

There are many types of Bayesian hierarchical models. For example, the experimental treatment effect (relative to the control) may be assumed to be the same for patients on the same treatment in different trials even though they may have different prognoses. Or the treatment effect in different trials may be modeled as potentially different. In either case, a hierarchical model explicitly allows for trial heterogeneity, but it does not assume trial heterogeneity. In both cases there will be “borrowing of strength” across trials. Such borrowing is not fixed in advance. If the results in the various trials are similar, there will be greater borrowing and therefore estimates and predictions will have greater precision than if the empirical results are disparate.

The application of a Bayesian approach in trial design has markedly increased within the pharmaceutical and medical device industries.^{17} Essentially all large drug companies have taken a Bayesian approach in selected early phase trials and some are using it for registration trials. On the medical device side, since 1997 more than 20 Premarket Approval Applications have been approved by the Center for Devices and Radiologic Health of the Food and Drug Administration.

One advantage of the Bayesian approach is that it enables researchers to update the state of knowledge as information in the trial accumulates. Such updating can occur at any time or even continuously. At a decision epoch, the current state of knowledge is used to assess the value of future information. Such an assessment is made possible in the Bayesian approach by calculating the probabilities of future observations based on the current information. The Bayesian approach is unique in this regard; probabilities of future observations can be calculated in the more familiar frequentist approach only by conditioning on the values of unknown parameters. Predictive probabilities allow for building adaptive clinical trial designs that require smaller patient populations than traditional designs yet provide at least as much statistical power.^{17} Predictive probabilities also enable building efficient drug and medical device development programs.

We conducted Bayesian meta-analyses of published trials and compared the results to those obtained through a frequentist analysis of the original data. We show the value of our approach to inform coverage decisions and to study comparative effectiveness.

#### METHODS

##### Data Source

We analyzed 12 trials published between 1996 and 2005 that had examined the effect of ICDs on all-cause mortality^{18–29} (see Table 1), and which had been previously considered by Ezekowitz et al.^{12} We specified hierarchical Bayesian survival rate models in the experimental and control arms. Conventional methods use overall efficacy estimates such as rates of mortality as units of analysis. Our approach is more finely tuned to the possibility of differential hazards over time.

We model survival with a piecewise exponential distribution, where each year of follow-up has a different hazard rate. For this modeling it is necessary to know the exposure and number of deaths in each year, for each treatment. For the ICD trials, we used Kaplan-Meier curves to calculate survival probabilities at the end of each year for each treatment arm within each trial and estimate the exposure of subjects in each year interval. This richer data set enables researchers to consider a variety of statistical models based on varying assumptions.

##### Statistical Computing

These analyses were conducted with self-written computer code using Intel Fortran compiler v.11.0.083 (Santa Clara, CA). Calculations were done using standard Markov chain Monte Carlo (MCMC) methodology utilizing Metropolis-Hastings steps.

##### Base Model

Constant treatment effect. Our base model considers trial heterogeneity by allowing the risk (hazard) of death in the control group to differ from one trial to another. The effect of ICDs (hazard ratio [HR]) is constant over time and across trials. These assumptions will be relaxed in the model variants that follow.

We model the hazard rate over time, which is the rate of events per unit time (years). The ratio of the ICD hazard rate to the control hazard rate is the HR. A value less than 1 implies the ICD has a smaller rate of events. A hazard rate of 0.75 implies a 25% reduction in the hazard rate. The base model (and its variants) allows for different hazards depending on the year of follow-up. Estimating hazards over time provides insights that have public health impacts.

The model is specified by its hazard function, depending on follow-up time t:

Here, *λ*_{0}(*t*) is the baseline hazard (piecewise constant within each year), *i* is the trial number, and *x**=**1* implies the subject is in the ICD group and *x =**0* implies the control group.

The parameter *θ* is the natural logarithm of the HR, ICD to the control. In this base model, the ICD effect is the same over time.

We assume a hierarchical distribution for trial effects: *φ*_{i} approximately normal (*μ*, *τ*^{2}) for *i* = 2, …, 12 (with *φ*_{1} = 0). The prior distributions for the hyperparameters are *μ* approximately normal(0, 1), *τ*^{2} approximately inverse-gamma(0.5, 1.0). The prior distributions for the baseline hazard rates and treatment effect are effectively noninformative: *λj* approximately gamma(0.001, 0.001) for *j* = 1, …, 5, and *θ* approximately normal (0, 100^{2}).

##### Model Variant 1: Time-Dependent Treatment Effect

This model variant relaxes the assumption of a constant treatment effect over time and separately estimates the hazard within each year of follow-up.^{30} This variant addresses the possible temporal nature of the ICD effect: Is there an immediate reduction in hazard or does it become apparent only after some time? Is the benefit sustained? Does the benefit increase or decrease with time since the defibrillator was implanted? As in the base model, the treatment effect is assumed to be the same across trials. The hazard is

Here, the hazard *λ*_{x} depends on treatment *x*. The hazard functions in each group are modeled separately as piecewise exponentials in each year.

The HR in any interval is *λ*_{1}_{j}*/λ*_{0}_{j}. The prior distributions are analogous to those of the base model and are independent across the year of follow-up.

##### Model Variant 2: Trial-Specific Treatment Effect

This model variant relaxes the base-model assumption of the same treatment effect across trials.^{30} The ICD effect is constant over time but can vary across trials. Thus, the model quantifies the extent of variability in the treatment effects across trials. Bayesian meta-analyses allow for consideration of specific causes of the variability; this variant allows for heterogeneity without specifying the cause.

In variant 2, each trial has its own efficacy parameter, *θ*_{i}, the logarithm of the HR for the ICD in trial *i*.

These trial-specific parameters are regarded as sampled from a larger population of treatment effects indexed by trial: *θ*_{i} approximately normal (*β*, *ς*^{2}) for *i* = 1,…, 12. The prior distributions of the hyperparameters of this population are essentially noninformative: *β* approximately normal(0, 100) and *ς*^{2} approximately IG (−0.5, 100). The prior distributions of the baseline hazard rates are the same as those in the base model.

The posterior distributions for the parameters are available at any time between or during the trials. These distributions can be used to make inferences about the parameters, including the relative risks. We reported the posterior distribution mean and a 95% probability interval by giving the 2.5th and 97.5th percentiles. We also provided the probability that the HR is less than 1.0. This measure is similar to but fundamentally different from a *P* value used in frequentist analyses. The Bayesian measure is the probability that the ICD is superior to the control on the basis of the observed results, whereas the *P* value is the probability of observing results more extreme than those observed, assuming the ICD is identical to the control.

We addressed how the estimated ICD effect varies as information accumulates from one trial to the next. We evaluated the posterior distributions over chronological time to assess the extent to which the next trial will alter the current distribution. We also generated predictive distributions for the effect to be observed in a new trial (having a given size) based on the information accrued thus far. These predictions incorporate uncertainty from multiple sources: uncertainty of the true HR of the ICD, uncertainty for a new trial, and uncertainty of the usual sampling variability of the next trial given its underlying characteristics. We addressed the following: (1) the effect of population heterogeneity across trials, (2) the effect of ICDs on mortality and whether it varies or persists over time, and (3) at what point there is reliable/scientifically defensible confidence in the clinical evidence of the treatment's effectiveness.

#### RESULTS

In Figures 1A and B, we illustrate the posterior means from the base model and its variant 1 in terms of the annual hazards (deaths per person-year) and survival probabilities for the ICD and control groups. Both models suggest that the risk of death is highest in the first year, dropping slightly in subsequent years (The increase in year 5 is likely due to the aging population—hazards of death increase with age). A necessary implication of the base model is that the pattern is the same in both treatment arms, with the estimated risk 22.3% lower for the ICD group in each year (each HR is 0.777, with standard deviation 0.036). The base-model posterior probability that ICDs reduce mortality overall (HR < 1) is greater than 0.9999.

Model variant 1 allows different HRs in each year. This can be seen in Figure 1A where the HRs (crossed symbols) are not proportional—the differences between the crossed symbols vary while the open symbols have a constant proportional effect through time. The posterior mean annual HRs from this nonproportional hazards model are 0.81, 0.71, 0.72, 0.99, and 0.88, with respective probabilities of ICD benefit (within that year) of 0.9994, 0.9999, 0.9988, 0.5549, and 0.7431.

The similarities in the patterns of hazards and survival probabilities from the base model and variant 1 evinced in Figures 1A and B are driven by the trial results and not by model assumptions. One might hypothesize that the benefit of ICDs is restricted to the first year because the control patients who died are those who were more seriously ill and who might have benefited most from an ICD. Because they are disproportionately removed from the at-risk population, the control group would compare well in later years. Indeed, such an effect might be hypothesized from the Multicenter Automatic Defibrillator Implantation Trial I,^{18} in which the empirical HR in the first year was 0.15. However, that observation is not borne out by the totality of the results. Indeed, it was not seen with the publication of the next 2 ICD trials, the Antiarrhythmics Versus Implantable Defibrillators Trial,^{19} and the coronary artery bypass graft (CABG) Patch Trial.^{20} Rather, the overall results suggest that the ICD effect is durable for at least the first 3 years of follow-up—the ICD effect is less clear in years 4 and 5.

In variant 2 of the base model, the treatment effect can vary by trial. Figure 2 shows the posterior mean and 95% probability intervals of the HRs for the individual trials. The solid bars represent the base model assumption but ignore the other 11 trials. The open bars show the posterior distributions in the respective trials using model variant 2. The open bars are narrower than the solid bars, representing the borrowing of strength from the other trials—even though the other trials are allowed to have different treatment effects. Also, the point estimates of trial-specific treatment effects using model variant 2 (diamonds in open bars) are regressed toward the mean treatment effect across all of the trials. In particular, they are always closer to the overall mean than are the means from the model ignoring other trials. The base model bar in Figure 2 represents the posterior distribution of the HR in the base model for all 12 trials, as described earlier.

In variant 2 of the base model, the hierarchical distribution of HRs across studies has a mean of *β* with a standard deviation of *ς*. The posterior mean of *β* is 0.763 with a standard deviation 0.061. The standard deviation of the distribution of HRs in the population of trials (*ς*) measures the degree of variability or heterogeneity in the effects. The posterior mean of *ς* is 0.228 with standard deviation 0.097. The probability that the mean ICD effect (*β*) is positive is greater than 0.999. However, the population standard deviation *ς* is moderate in size, reflecting some heterogeneity of the ICD effect across the trials. A new observation from this population of trial-specific effects (*θ*)—that is, a new trial, not 1 of the 12 in Table 1—would be negative with probability about 5%. This means it is possible for a trial in the larger population of ICD trials to have a true HR greater than 1.

Table 2 and Figure 3 show the results from cumulative sequential meta-analyses at the time each of the 12 trials was published. Table 2 gives the posterior mean HR estimates for the base model over chronological time. It also gives relative risk by year of follow-up for model variant 1. The final row includes all 12 trials. Figure 3 shows the HRs and their 95% probability intervals using the base model (solid diamonds). It also shows (open diamonds) the individual trial posterior distributions (base model estimates restricted to the single trial, as in Fig. 2). After the third trial, the posterior mean HR changes very little as new trials emerge, drifting somewhat upward from 0.737 to 0.777. Each subsequent 95% posterior probability interval narrows, reflecting the increased precision. The posterior standard deviation is about halved, from 0.071 after 3 trials to 0.036 after 12 trials.

Figure 3 Image Tools |
Table 2 Image Tools |

Figure 4 shows the predictive distribution of each trial, based on the previously conducted trials, for both the base model (open bars) and its variant 2 (solid lines). For example, the open bar for the Canadian Implantable Defibrillator Study^{21} represents the 95% prediction interval for the observed HR for the Canadian Implantable Defibrillator Study for the base model in light of the earlier trials: the Multicenter Automatic Defibrillator Implantation Trial,^{18} the Antiarrhythmics Versus Implantable Defibrillators Trial,^{19} and the CABG Patch Trial.^{20} The corresponding solid line represents the 95% prediction interval from model variant 2. These predictions incorporate uncertainty from multiple sources: the uncertainty in the true HR, the uncertainty of a new trial population, and the usual sampling variability of the next trial. The intervals for model variant 2 are wider because the treatment effects are allowed to be different in the various trials, an extra source of variability.

The prediction intervals from variant 2 covered every observed HR for the next trial. For the base model, only the observed relative risk for the CABG trial lies outside its prediction interval. There is a suggestion of heterogeneity of the treatment effect in the early part of the time period, but the base model predicts well after the third trial. Most of the uncertainty in the base model predictions after the third trial is due to the sampling variability in the trial in question. For example, the prediction interval for the Amiodarone Versus Implantable Cardioverter-Defibrillator trial^{25} in Figure 4 is very wide because that trial had a total of only 16 events.

#### DISCUSSION

The systematic review of clinical trial evidence is an important component of decision making processes in health care, and is specifically useful to inform coverage decisions by a health insurer. Such reviews must properly consider the gamut of uncertainties. The most well understood type of uncertainty is sampling variability in which an observed effect differs from a true population effect. Less well understood, but also important, is the uncertainty in the population effect. This may differ from one circumstance to another, from one trial to another. The Bayesian approach is ideally suited for—and indeed requires—addressing this form of uncertainty. Each trial's effect is a random variable with a probability distribution. This approach properly models the heterogeneity of the trial population. The results of the various trials inform the extent to which they represent a cohesive whole. If their results evince greater heterogeneity, then conclusions about the treatment effect are less strong.

Readers familiar with the systematic review by Ezekowitz et al^{12} will note that our overall findings are consistent with their frequentist meta-analysis of the same trials. Although the models we used can be implemented in a frequentist framework, we demonstrate 4 basic advantages of the Bayesian approach. First, the interpretations are more intuitive, including conclusions of the probability of treatment effect. Second, Bayesian measures are naturally synthetic and sequential enabling conclusions after each new piece of information becomes available. Third, Bayesian predictions can incorporate various levels of uncertainty while conditioning on all available data (but not on values of unknown parameters). Finally, modeling is inherent to the approach and computationally straightforward, which makes it natural to consider detailed models. For example, in this synthesis of the ICD clinical trials, we were able to address the question of treatment effect depending on time since randomization. Such modeling improves the strength of inferences about the overall treatment effect as well as informs about the type of treatment effect.

We demonstrated the usefulness of a Bayesian meta-analysis relative to a conventional frequentist approach for a decision-making process. We specified models that quantify the risk of death in the patient population under study, estimated the effect of an ICD over time, and examined and accounted for the degree of heterogeneity in effects across the 12 trials. Despite the moderate amount of heterogeneity we observed across the trials, we found stability of results after the first 3 trials had been conducted. This stability enabled reasonable predictions of the results of future trials. The chronological analyses bring out the effects that single trials had on the overall assessment of the ICD effect. In fact, the posterior distribution of the ICD HR did not change much after only 3 to 4 studies. The remaining 7 to 8 studies very likely were not needed to make an assessment of the efficacy of ICDs.

We also demonstrated that the Bayesian approach enables the assessment of the value of conducting an additional trial, which is an important feature that relates to decisions for conditional coverage such as those instituted by the CMS^{10} in the United States and by the National Institute for Health and Clinical Excellence^{9} in the United Kingdom. This feature tracks nicely with the present US public policy initiative to create a dynamic “learning health care system.”^{5} As envisioned by the Institute of Medicine, this health care delivery system will provide continuous feedback of positive, negative, and, if desired, economic evidence of medical practice. This “learning” concept is essentially Bayesian in nature, and is consistent with a payer's conditional coverage notion which, in effect, systematically updates knowledge as it accumulates to inform the coverage decision. Our study provides an example of how a Bayesian meta-analysis is well suited to accomplish such an objective. Furthermore, should a decision maker wish to factor in the cost of adding another trial, he or she could readily merge a value of information analysis, which is a useful companion to a Bayesian meta-analysis.

Conclusions such as those drawn from Figures 3 and 4 will be helpful to readers of journal articles as well as to policy makers. Clinical trialists might reasonably publish (i) their trial results, (ii) the updated state of knowledge regarding the medical question at issue based on their trial results (as in Fig. 3), and (iii) the predictive distribution of the results on another trial (as in Fig. 4), perhaps one of reasonable sample size and one of large sample size.

#### ACKNOWLEDGMENT

The authors thank Craig Hunter, United BioSource Corporation, for coordinating this project.