Nonrandomized studies in pharmacoepidemiology and comparative effectiveness research generally assess the relative benefits and risks of one medication versus another. However, for many conditions, three or more appropriate treatment choices may be available; in these cases, patients, clinicians, and policymakers require a rigorous assessment of the full breadth of treatment options. A study methodology is required that both reflects the underlying clinical decision-making process and minimizes the potential for confounding.
Propensity score matching of two treatment groups is a common technique for removing confounding bias in nonrandomized studies. The process of propensity score matching, unlike stratification or regression, yields a patient cohort that is balanced with respect to measured covariates1–6 and is also in “clinical equipoise”; that is, the patients should each be plausible candidates for each of the treatments under study. This population may be of particular interest in comparative effectiveness research. Propensity score matching may result in decreased precision as cohort sizes decrease, but in many circumstances, propensity score matching will also improve validity by excluding patients who are treated contrary to expectation (eg, those with a high predicted probability of treatment but who remain untreated).7 These excluded patients could be atypical in some way or may have extreme propensity scores as an artifact of important unmeasured confounding.7
As traditionally defined, propensity scores predict patients’ probabilities of getting one treatment versus a single alternative. Imbens proposed the generalized propensity score to account for multiple levels of treatment.8 In the case of three categorical treatments, the generalized propensity score is estimated with multinomial logistic regression. His work suggests (and others have demonstrated)9 methods for generalized propensity score adjustment based on regression, stratification, and weighting. Imbens8 notes that “matching approaches … appear less well suited to the multi-valued treatment case” (p. 709). We hypothesize that this may be, in some part, because of the computational challenges of matching three, four, or more groups; as the number of groups grows, the complexity of the match increases exponentially. A three-group match with 5000 patients per group would have 125 billion possible match configurations. Lu and Rosenbaum10 present a way to efficiently match with one treatment group and two control groups; a treated patient is matched to a single control chosen optimally from one of the two control groups. Although their work is related to what we present, their study design differs from the clinical scenario we sought to investigate.
If matching in a three-group study is desired, there is a core question of whether to simply rely on pairwise contrasts—in a three-way study, there are three—or to design a matching process that creates trios of patients all of whom, on average, will have similar covariate vectors. From a statistical perspective, the pairwise contrast should maximize precision; assuming no unmeasured confounding and no treatment effect heterogeneity, the pairwise contrast should also yield unbiased estimates. Practically, however, these assumptions will often not hold: despite investigators’ best efforts, there is generally some amount of unmeasured confounding and treatment effects may be similar but not entirely homogeneous. In these cases, a minimally biased estimate of the effect in the subpopulation of patients who could reasonably be treated with any of the study drugs may be most clinically meaningful. This estimate would reflect the design of a three-arm randomized trial where patients are in equipoise at the time of randomization.
We describe an efficient new algorithm for caliper-based nearest neighbor propensity score matching on three treatment groups, using the generalized propensity score. We examine the performance of the algorithm through a Monte Carlo simulation study and apply the method to a study of the safety of three classes of analgesics in the treatment of rheumatoid arthritis.
Approaches to Estimating Treatment Effects
A three-group study of Treatments 1, 2, and 3 yields three possible effects of interest: 1 versus 2, 2 versus 3, and 1 versus 3. We considered three approaches to estimation of the treatment effects. We began with a simple pairwise approach in which we assessed the three possible contrasts separately. For each contrast, we estimated a propensity score: using logistic regression, we predicted treatment as a function of the measured confounders and used the predicted values from the model as the propensity score.5,11 We then matched on propensity score using caliper-based 1:1 nearest neighbor matching.1,12–14 This approach will attempt, for example, to find each Treatment 1 patient’s nearest Treatment 2 analog, as measured by distance (change in) propensity score, with a maximum distance defined by a caliper (Figure 1).
We then considered a variant of this method, which we term the common-referent approach. We considered Treatment 1 to be the referent treatment. Using the Treatment 2 versus 1 and the Treatment 3 versus 1 propensity-matched populations from the prior step, we extracted patients treated with Treatments 2 or 3 who had a common match of a patient who was treated with Treatment 1.15,16 We created a single cohort of these patients and their Treatment 1 matches. Because of the additional restriction of the common Treatment 1 referent patient, this cohort was generally smaller than the pairwise source cohorts. With the link of the Treatment 2 or 3 patients through their shared Treatment 1 analog, we hypothesized that this cohort of patients would have a greater base of “common support” than the pairwise approach and that this common support may lead to a better characterization of the treatment effect in the case of treatment effect heterogeneity.
Finally, we created a three-way matched cohort by using an algorithm developed by two of the authors (A.A.S. and J.A.R.). The algorithm requires two generalized propensity scores for each patient: probabilities of being treated with each of Treatments 1 and 2, which together determine the probability of Treatment 3. We estimated these propensity scores using multinomial logistic regression.8,17 The algorithm we created matches without replacement and performs “within-trio” optimized matching in a two-dimensional space. The goal is to find the trios of patients—one receiving each of Treatments 1, 2, and 3—with the smallest within-trio distance. Because the number of required computations scales exponentially with number of groups, a major goal of the algorithm is to limit the search space to find guaranteed optimal trios without the need to consider all possible combinations. We accomplished this by using algebraic analogs of computational geometry. The source code, plus SAS macros and R example code are available online.14 The algorithm has the flexibility for the user to specify a distance function18; in this study, we used the perimeter of the triangle formed by linking all three patients as the distance (Figure 2). Note that the algorithm does not seek to “globally” minimize distance over the entire cohort, as this would introduce great complexity and may provide only a small increase in confounding adjustment.
We performed a Monte Carlo simulation study in which we sought to characterize the bias, variance, and mean squared error (MSE) of these three approaches in the optimal circumstances of no unmeasured confounding and no treatment effect heterogeneity, as well as the cases of small amounts of unmeasured confounding and heterogeneity.
In each run of the simulation, we created 20,000 patients. Following a design derived from a 1:1 matching simulation published by Austin,13,19 we assigned each patient’s exposure using a multinomial distribution with three cases of expected exposure prevalence:
- P (Treatment 1) = 1/3; P (Treatment 2) = 1/3; P (Treatment 3) = 1/3
- P (Treatment 1) = 49%; P (Treatment 2) = 49%; P (Treatment 3) = 2%
- P (Treatment 1) = 2.5%; P (Treatment 2) = 2.5%; P (Treatment 3) = 95%
We then simulated 10 continuous positive confounders. All confounders were normally distributed. The first confounder had an expected value of 0 for patients with Treatment 1, 1 × 0.05 for patients with Treatment 2, and 3 × 0.05 for patients with Treatment 3. The second confounder used a base expected value of 0.10, the third a value of 0.15, and so forth. Under this construction, the difference in mean values of the confounders was larger for Treatment 2 versus Treatment 3 when compared with Treatment 1 versus Treatment 2. We selected three cases for the standard error of the confounders to create differing levels of overlap of the resulting propensity score; we used standard errors of 0.75 (small overlap), 1.5, and 3.0 (large overlap). The total measured confounding effect for each patient i was calculated as the value of each confounder times an increasing beta, from 0.125 for the first confounder to 1.25 for the tenth:
We then created a normally distributed unmeasured confounder; to be of moderate strength with respect to the other confounders, we set its expected value to that of the fifth measured confounder. We varied its beta
from 0.00 to 1.00 in increments of 0.25 to achieve an unmeasured confounding effect of:
We then created varying levels of treatment effect heterogeneity. The interaction was between confounder C5 and treatment X and grew in strength with each level of X:
The continuous expected treatment effect for each patient E(Yi) was defined as:
We then created a preliminary continuous outcome effect YP, defined as:
Finally, we standardized the values of YPi to be mean 0 and simulated an error term
with the standard error set such that the size of the errors would be in proportion to the size of the outcome effect. We set the outcome for each patient to:
For the pairwise approach, we estimated three propensity scores. As recommended in earlier literature,20 we matched on the logit of the propensity score and used a matching caliper of 0.2 times the standard deviation of the logit of the propensity score, calculated as
, where the σ2 values were the variance of the logit of the propensity score observed among patients in each cohort’s two treatment groups. We then used these same matched cohorts to create a single cohort for the common-referent approach. We finally created a single three-way matched cohort using multinomial logistic regression to estimate each patient’s set of propensity scores. For this match, we set the caliper to be 0.6 times
, where the τ2 values were the average of the two observed variances for patients’ two propensity scores, with a τ2 value for each treatment group. We chose the figure 0.6 as three times the recommended value of 0.2 because a patient trio’s distance is calculated by traversing three legs rather than just one (compare the distances in Figure 1 vs. Figure 2).
We chose several key metrics to judge each approach’s success. First, we calculated the standardized means of each covariate in each group and recorded the mean sum of difference in standardized means among the groups as a measure of overall covariate distance.21 Lower values indicated cohorts with more similar values of measured confounders.
We then calculated the percentage bias in each of the three contrasts when compared with the known true treatment effects. We also noted the observed variance of the estimates across the 1000 simulations per scenario and the MSE. In cases without treatment effect heterogeneity, characterizing the true treatment effect was straightforward because the treatment effects were homogenous across the population. In cases where we introduced treatment effect heterogeneity, we defined a subpopulation in which to calculate the true treatment effects. We considered that a contrast among three treatment groups is most pertinent among patients with the opportunity to get any of the three treatments—in randomized trial terms, this would be the subpopulation of patients for whom there is equipoise. To find the treatment effect in this group, we sought to identify the group’s members by examining the multinomial propensity score estimated in the full, unmatched cohort.
In a two treatment study, if the distribution of observed propensity scores is plotted by treatment group (Figure 3), the areas where the treatments groups’ distributions overlap indicate this clinical equipoise because these are patients for whom either treatment may have been appropriate. In a three-group study, this overlap is harder to identify. With the three-group cohort’s two propensity score values, P(Treatment 1) and P(Treatment 2), we created 50 equally sized bins for each treatment’s range of observed probabilities and assigned each patient to a bin for each probability. We identified the overlapping patients as those whose assigned bins each contained at least two patients from each of the three treatment groups, indicating the existence of comparable patients.
We then applied the matching schemes to an empirical dataset drawn from a database of insurance claims for prescription drugs, medical services, and hospitalizations. We studied three treatments for pain among rheumatoid arthritis patients. The study had a series of 14 serious individual outcomes—including myocardial infarction, stroke, gastrointestinal bleed, falls, and fracture—as well as six composite outcomes. These patients were drawn from participants in Pennsylvania’s Pharmaceutical Contract for the Elderly program, a state-wide program that provides prescription drug benefits to low-income elderly residents, from the years 1999 to 2005. Data on drug usage are linked to data from Medicare Parts A and B and together provide a longitudinal view of patients’ drug use, medical procedures, hospitalizations, and associated diagnoses. In this cohort, we measured exposure to nonselective nonsteroidal anti-inflammatory drugs (NSAIDs), COX-2 selective inhibitors (coxibs), and opioids. This cohort and the outcomes have been described previously.15,16
Using a series of measured covariates, we estimated both pairwise propensity scores and a three treatment generalized propensity score following the methodology described above. We matched on propensity score using the three-way matching algorithm and also performed the three pairwise matches with a fixed caliper of 0.025. We also matched with a common-referent group drawn from the nonselective NSAID patients. We computed the balance achieved with each type of matching, as measured by the Mahalanobis distance. We also estimated treatment effects for the range of outcomes and compared observed treatment effects among the matching approaches. Observed differences in treatment effects may have been because of chance, elimination of bias, or as a consequence of treatment effect heterogeneity.
All analyses were carried out in R version 2.13. All programs and source code for the matching algorithms described are available as part of the Pharmacoepidemiology Toolbox,14 available at http://www.drugepi.org. Simulation results were analyzed on a Netezza data warehouse appliance (IBM Netezza, Marlborough, MA), and we used Tableau Professional (Tableau Software, Seattle, WA) for data visualization.
We ran 1000 simulations per combination of input parameters. Across all simulation runs (Table 1), including those with strong unmeasured confounding, we observed a bias in the unmatched cohort ranging from 247% to 764%, and an MSE (×100) of 6.16–58.46. With three-way 1:1:1 matching, the bias was reduced to 4.5–22.0%, with an MSE (×100) of 0.06–0.18. The pairwise approach reduced the bias to 3.5–24.2%, with an MSE (×100) of 0.02–0.21. The pairwise approach had the lowest bias in the Treatment 2 versus Treatment 1 contrast, for which the three-way approach had the highest bias. This pattern was reversed for the Treatment 3 versus Treatment 2 contrast. The common-referent approach had large biases (4.7–28.8%) and also substantially larger standard deviations of the observed bias. Because of its inconsistent performance, particularly in cases where there was a high prevalence of Exposure 3 and thus a small prevalence of the common-referent exposure, we do not consider the common-referent approach further. Full results including the common-referent approach are in eFigure 1 (http://links.lww.com/EDE/A656).
Considering some of the more specific cases (Figure 4 and eFigures 1 and 2, http://links.lww.com/EDE/A656), we observed near-zero bias for both the pairwise and three-way approaches in the simplest simulation scenario (equal numbers of patients in each exposure group, broad propensity score overlap, no unmeasured confounding, and no treatment effect heterogeneity). Despite sizeable differences in mean cohort size—a mean of 12,744 matched patients for the pairwise approach versus 9,569 for the three-way approach—both approaches had a low MSE × 100 of 0.2.
The methods’ performance began to diverge as treatment effect heterogeneity increased, the propensity score overlap decreased, or prevalence of exposures became unequal. In many cases, performance with respect to bias was substantially similar or the three-way technique offered lower bias. As one example, in the case of the strongest treatment effect heterogeneity, less propensity score overlap, and equal exposure prevalences, the mean bias in the three-way approach was 13.6%, whereas the pairwise approach’s bias was 18.1%. However, because of the lack of clinical equipoise reflected in the low propensity score overlap, the number of matched patients decreased substantially in the three-way case, so that MSE × 100 increased from 36.6 in the three-way case from 34.8 in the pairwise case. In other cases, the three-way MSE was lower than the pairwise MSE; in a small number of cases, the mean bias in the three-way case was slightly higher than the pairwise case.
As the strength of unmeasured confounding increased (eFigure 1, http://links.lww.com/EDE/A656), both methods displayed higher bias overall. On a relative basis, however, we saw substantially similar patterns as in the other cases: the three-way approach generally had lower bias than the pairwise approach, sometimes sharply so, except in the cases of strong treatment effect heterogeneity. However, in the presence of strong unmeasured confounding, the magnitude of the bias generally decreased as the strength of treatment effect heterogeneity increased.
Our method for evaluating the true treatment effect in the presence of treatment effect heterogeneity involved selecting a group of patients whose propensity scores indicated that they had the opportunity to get any of the three available treatments. The number of patients in this group was generally a large proportion of the available patient population, but this number decreased in the scenarios with the 2% probability of patients’ receiving Treatment 3 (eFigure 3, http://links.lww.com/EDE/A656).
In the unmatched cohort, there were substantial differences among initiators of nonselective NSAIDS (n = 4,874), coxibs (n = 6,172), and opioids (n = 12,601), as displayed in Table 2. When compared with nonselective NSAID users, opioid users were older (average age, 81.2 years vs. 79.7 years), sicker (average comorbidity score, 2.2 vs. 1.6), and more likely to have had specific preexisting conditions (9.6% had a prior MI vs. 5.2%). Coxib initiators were between these extremes; they had an average age of 80.9 years and an average comorbidity score of 1.7. The mean Mahalanobis distance—a measure of the standardized difference in measured variables, accounting for covariance—between the opioid and nonselective NSAID group was 34.04. It was 18.97 between the opioid and coxib group. The coxib and nonselective NSAID group had a distance of 8.69.
The pairwise matching approach lowered these distances substantially: after matching, the distances ranged from 0.26 to 0.61. The distances with the three-way approach were similar but somewhat higher, with a range of 0.33–0.96. The common-referent approach had yet again higher distances, ranging from 0.49 to 1.56.
Although we observed variation in the balance achieved, treatment effects were largely similar (eTable 1, http://links.lww.com/EDE/A656). The observed composite cardiovascular hazard ratio was 1.14 (95% confidence interval = 0.95–1.38) for the opioids versus coxibs contrast with the three-way matching approach; with the common-referent group approach, it was 1.06 (0.88–1.28). It was 1.18 (1.01–1.38) with the pairwise approach. Note that the cohorts may have included different patients—as seen in Table 2, the cohort sizes ranged from 4,582 to 6,104—and so variations in the point estimate may reflect randomness, residual confounding, or treatment effect heterogeneity.
We developed and tested an algorithm for simultaneously matching groups of three patients on propensity score, to create cohorts of exchangeable patients among whom the comparative effects of three exposure groups could be studied. In a simulation study, we compared our new method to an existing approach and observed that in many cases our three-way matching approach yielded lower or equal bias with little or no cost to MSE. We saw particular benefit to the new technique when there were large imbalances in the prevalence of the exposures under study. In an empirical study, we observed strong covariate balance after matching with approaches including our three-way matching approach. Based on the results of our study, we would recommend the three-way matching approach for estimating treatment effects with three exposure groups.
An important question when performing three-way matches is the group for whom the treatment effect should be estimated. We considered two main approaches. First, because a multiway study reduces to a number of pairwise contrasts, we considered a pairwise matching approach that maximized the use of the available data and was thus expected to provide the most precise estimates. In the case of no treatment effect heterogeneity, the estimated treatment effect should not have been affected by which patients are included and excluded by the match, and thus the pairwise approach may also be optimal with respect to bias. When there was treatment effect heterogeneity, the process of including and excluding certain patients affected the estimated effect, because the true treatment effect is affected by the “window” of patients under consideration. In this case, we view the choice between the pairwise approach and the three-way approach not as one of tools, but rather of the desired result. If the investigator wishes to know which treatment is best for a patient who could reasonably get any of the three treatments, then the three-way match may be a more appropriate technique as it implicitly limits the study cohort to those patients who meet this criterion. In trial terms, this is equivalent to requiring equipoise at baseline for all three treatment arms rather than running three trials with inclusion and exclusion criteria set per pair of treatments. If the investigator does not wish to impose this requirement or wishes first and foremost to maximize precision, then a pairwise approach might be preferable.
One challenge to quantifying bias when there is treatment effect heterogeneity was that there is no clear definition of what the true treatment effect should be, because the true average treatment effect would vary based on which patients were included in the average. We took a pragmatic approach. To quantify the true average treatment effect in the area of common support—that is, those patients in a state of clinical equipoise—we chose to let the propensity scores guide us. A strict application might require finding patients in each of the three treatment groups with exactly the same or very similar combination of scores and defining the population’s true average treatment effect among these patients. Because we thought this approach was overly restrictive—and because it very closely resembled our three-way matching process, thereby favoring the matching process’s results—we took a broader approach that required each of the patients’ two scores to have other patients with at least one of their scores in the same range. In our study, we did observe some increased bias by using three-way matching in cases of treatment effect heterogeneity; this increase may have reflected true increases in bias or may have reflected our definition of truth. Alternative methods of defining truth in the presence of treatment effect heterogeneity may have yielded different results.
One limitation of our approach is the maximum of three study groups. Although three may suffice for many cases, other comparative effectiveness questions may better be answered by four or more groups. Revised software, currently in preparation, will be required to handle the exponential increase in potential matches. Another limitation is the estimation of variance; although in our simulation study we were able to observe variance over the many simulation runs, in the usual practice of epidemiology, other methods such as simultaneous confidence intervals or bootstrapping may be required (eFigure 4, http://links.lww.com/EDE/A656).
The evolving use of evidence-based medicine requires that the underlying epidemiology and comparative effectiveness research methods keep pace with the clinical questions asked by physicians, policymakers, regulators, and patients. We have presented a method of comparing three treatment choices simultaneously and contrasted our new method with standard approaches. We believe that our three-way matching approach offers a clear and clinically meaningful way to answer questions of drugs’ comparative safety and effectiveness and would recommend its use over pairwise or common-referent approaches in many study scenarios.