Estimating the causal effect of an exposure (vs. some control) on an outcome using observational data often requires addressing the fact that exposed and control groups differ on pre-exposure characteristics that may be related to the outcome (confounders). Propensity score methods have long been used as a tool for adjusting for observed confounders in order to produce more valid causal effect estimates under the strong ignorability assumption. In this article, we compare two promising propensity score estimation methods (for time-invariant binary exposures) when assessing the average treatment effect on the treated: the generalized boosted models and covariate-balancing propensity scores, with the main objective to provide analysts with some rules-of-thumb when choosing between these two methods. We compare the methods across different dimensions including the presence of extraneous variables, the complexity of the relationship between exposure or outcome and covariates, and the residual variance in outcome and exposure. We found that when noncomplex relationships exist between outcome or exposure and covariates, the covariate-balancing method outperformed the boosted method, but under complex relationships, the boosted method performed better. We lay out criteria for when one method should be expected to outperform the other with no blanket statement on whether one method is always better than the other.

# The Right Tool for the Job

## Choosing Between Covariate-balancing and Generalized Boosted Model Propensity Scores

- Free
- SDC

## Abstract

Epidemiologic studies often focus on estimating the impact of an exposure of interest on a health outcome, such as the impact of radon exposure on lung cancer^{1} or of second hand smoking on heart diseases and mortality.^{2} Because the processes that determine individuals’ exposures may also directly impact their outcomes of interest, great care is often taken in the analysis of observational data to ensure that exposure effect estimates reflect a true causal association rather than merely the effects of confounding. Propensity score methods are widely used in epidemiology and other fields to improve causal effect estimates by balancing exposed and unexposed groups based on a rich set of pre-exposure characteristics. For time-invariant dichotomous exposures (or treatments, in the more common terminology), the critical step in propensity score analyses is the estimation of the propensity score itself, which is defined as the probability that each individual is exposed or receives the treatment versus not, as a function of observed pretreatment covariates.^{3}^{,}^{4} In recent years, methodological advances have produced propensity score estimation methods that substantially improve upon the standard logistic regression models that underpinned the original work in the field. Comparative studies^{5}^{,}^{6} have found that two promising methods are generalized boosted models (GBM)^{7} and covariate-balancing propensity scores (CBPS).^{8} However, the literature gives practitioners little information on which empirical situations call for which of these two methods.

Through extensive simulations—involving some 4 million simulated datasets—this article seeks to give analysts a measure of clarity when choosing between the two methods in the case of time-invariant exposure or treatment. As opposed to claiming that one method is better than the other (which we do not believe is true), this article lays out both qualitative and quantitative criteria for when one method may be expected to outperform the other. In the end, we believe that both methods should be in the toolbox of epidemiologists. Our goal is to offer context-dependent advice on which tool to reach for first and also provide performance results. This study was approved by the Human Subjects Protection Committee at the RAND Corporation.

## BACKGROUND

### Inverse-probability Weighting for Time-invariant Binary Treatment

Assuming a time-invariant binary treatment variable of interest

taking value 1 if a study participant is in the treatment and 0 otherwise, and assuming a set of confounders

, the propensity score

is defined as the probability of being assigned to the treatment conditional on

or formally

. Strong ignorability of treatment assignment^{4} states that (1) all confounders are captured in

and (2) there is a positive probability of receiving the treatment for all values of

(

for all

). Under this assumption, the propensity score is all that is needed to control for pretreatment differences between the treatment and the control groups, which may be operationalized through matching, weighting, or subclassification.^{9}

This study focuses on weighting comparisons because both boosted model and covariate-balancing methods were developed with optimizing weighted balance in mind. With weighting, once an estimate

of

is obtained, a function of the propensity score is used as a weight in the estimation of the difference in the outcome between the treatment and control groups, a difference that will be a consistent estimator of the treatment effect under the strong ignorability assumption. The most common estimands of the treatment effect are the average treatment effect on the population of interest and the average treatment effect on the treated. For average treatment effect, the weight equals

, while for average treatment on the treated, it equals

.^{10–12} In this study, we will conduct comparisons between weighted estimators using boosted models versus covariate-balancing method for estimating the average treatment on the treated.

### Propensity Score Estimation Using Covariate-balancing Propensity Scores

To estimate the propensity score, researchers typically posit a parametric model that is linear in the unknown parameters and is fit via maximum likelihood. The logistic model is most popular, where

One of the limitations of parametric modeling is that the model may be misspecified. The covariate-balancing method developed by Imai and Ratkovic,^{8} which models exposure while optimizing the covariate balance, confers robustness to mild model misspecification with regard to balancing confounders compared with direct maximum likelihood estimation. Further, even if the model is correctly specified, the covariate-balancing method can improve the covariate balance in observed datasets and potentially improve the accuracy of estimated treatment effects over traditional logistic regression.

In addition to maximizing the model likelihood, the covariate-balancing method incorporates a balance condition for the weighted means of the covariates in the parameter estimation procedure. Covariate balancing uses a generalized method of moments or an empirical likelihood estimation framework^{13}^{,}^{14} to find estimates that come closest to optimizing the likelihood function while meeting the balance condition simultaneously.

### Propensity Score Estimation Using Generalized Boosted Models

Generalized boosted model is a nonparametric, piecewise constant model for predicting the treatment

.^{7} This method builds the propensity score model iteratively, starting from a globally constant model, and gradually increasing model complexity. At each step, a simple regression tree^{15} is added to the current model gradually creating an increasingly complex piecewise constant function. In propensity score applications, generalized boosted model’s complexity is tuned by optimizing covariate balance between the inverse-probability-weighted treatment and control samples.^{16}^{,}^{17} The boosted model’s tree-based approach avoids making any assumptions of linearity in unknown parameters and can automatically accommodate interactive effects in the propensity score model. In addition to flexibility, by using a piecewise constant approximation to the propensity score, the boosted models method estimates “flatten out” over areas of the covariate space where few observations are available, often resulting in more stable propensity score estimates.

## STUDY DESIGN

To assess the performance of the standard implementations of covariate-balancing propensity scores and generalized boosted models for estimating propensity scores, we conducted a simulation study to test conjectures about the impact of five factors on the relative accuracy of treatment effects estimated using weights derived from propensity scores estimated by the boosted model or the covariate-balancing method (Table 1). In Table 1, extraneous variables refer to variables used in the modeling but unrelated to either the treatment (outcome-only predictors) or the outcome (instrumental variables) or both (distractors). When the covariates are weakly related to treatment, we expect both methods to perform well because the treatment and control groups are highly similar even without weighting. However, given the nonparametric aspect of the boosted models method, it may possibly suffer inefficiencies in these circumstances while trying to fit possibly nonlinear relationships that are better approximated via linear ones.^{5}

## DATA GENERATION

### Simulated Predictors

We used a modification of the simulation structure described by Setoguchi et al^{18} (2008) and replicated by others (Lee et al^{5} and Wyss et al^{6}). For each simulation, we generated 15 predictors as a mixture of continuous and binary variables (see more details in eAppendix; https://links.lww.com/EDE/B256, Table 1). The continuous predictors were standard normal, and the binary variables were dichotomized (at cut point 0) versions of standard normal random variables.

### Simulation of Treatment

We generated treatment assignments using one of seven versions for the propensity score model. All seven versions were of the form

, where version was a function of the confounders that established the complexity of the association between them and treatment assignment. The versions vary from simple linear combinations of covariates up to very complex relationships with nonlinearity and interaction between covariates (Table 2). The parameters

have values between −0.8 and 0.8. The variable

controls the strength of the covariates for predicting treatment through *τ* which takes one of six values *τ* = 0, 0.25, 0.5, 1, 1.5, 2. For each version, smaller values of *τ* correspond to a stronger relationship between treatment assignment and the covariates.

### Simulation of Outcome

We generated outcomes

, using five different models for

, where

ranging from linear additive to nonlinear and nonadditive (Table 3). The parameters

had values between −0.73 and 0.71 and the intercept

. The treatment effect

was assumed additive and equals to

. For each outcome specification,

and 20 different

were considered ranging from 0 to 5 (0.25 point increments) to control the amount of error variance.

Overall, the simulation scenarios were composed of seven versions crossed with six different levels of random variability to specify the propensity score model and five outcome models with 20 different error variances for a total of 4200 scenarios. For each scenario, we generated

simulated datasets and used the methods being compared with estimate propensity scores and treatment effects.

## ESTIMATION

The propensity estimation strategies were designed to assess the impact of the set of variables included in the propensity score estimation model. See a detail description of the strategies in eAppendix; https://links.lww.com/EDE/B256, Table 1. In practical situations, researchers make a decision on what variable to include in the estimation procedures, and these strategies mimic some of those decisions. The options of variables included considered were

- (1) Only confounders directly related to both the outcome and treatment;
- (2) Variables only directly associated with the treatment, including the instrument
- ;
- (3) Only variables associated directly to the outcome;
- (4) All variables directly or indirectly related to both outcome and treatment;
- (5) All variables related to the outcome or to the treatment; and
- (6) All available variables including distractors.

For each strategy, we used both the boosted model and the covariate-balancing methods to estimate the propensity score models, which were then used to generate the average treatment on the treated inverse-probability weights.

## COMPARISON OF GENERALIZED BOOSTED MODELS AND COVARIATE-BALANCING PROPENSITY SCORES METHODS

### Performance and Comparison Measures

To evaluate the performance of both methods on the different propensity score versions being considered, we compared the measures on how well they balanced the covariate distributions between the treated and nontreated groups and on the bias and accuracy of the resulting treatment effect estimates defined as follows:

The average standardized absolute mean difference (referred to as standardized mean difference henceforth) measures differences in group means of the covariates. To calculate the standardized mean difference, for each covariate, we first computed the absolute value of the difference between the average treatment on the treated weighted treatment and control group means, then standardized this difference by the standard deviation of the covariate in the treatment group, and finally averaged these values across all covariates.

The Kolmogorov-Smirnov statistic is a nonparametric measure of the difference in the entire univariate distribution of confounders between the treatment and the control group and is defined as the largest absolute difference in the weighted empirical cumulative distribution function of the groups. Because it is sensitive to any difference in the entire distribution of a covariate, rather than just the mean as in the standardized mean difference, it can be important for detecting imbalance when nonlinear transformations of confounders are in the confounding scheme. For the Kolmogorov-Smirnov statistic and the standardized mean difference, balance is calculated for each simulated dataset and averaged over datasets within a scenario.

The variance of the weights has the potential to increase the variability of estimated treatment effects and can indicate large influence for a small number of observations. For each simulated dataset, we estimated the sample standard deviation of the weights for the control group and report the average over the datasets in scenario.

Bias is the difference between the expected value of the estimated treatment effect and the true effect set at −0.4. We report the absolute value of the average over the simulated datasets in each scenario.

Standard error is the standard deviation of the estimated treatment effects across the simulated datasets in each scenario.

Root mean squared error (RMSE) is the square root of the expected value of the square of the difference between the estimated treatment effect and the true value. It captures both the bias and the precision of an estimator. It was estimated by taking the square root of the mean of the squared differences between estimated and true treatment effects from the simulated datasets in each scenario.

Relative bias, RMSE, and standard error: In order to compare the performance statistics (bias, RMSE, and standard error) between the boosted model and the covariate-balancing methods, we used the log-relative values of the statistic defined as

where for the statistic of interest,

was the estimated value for the covariate-balancing method and

was the estimated value for the boosted models method. Negative values of the log-relative statistics were indicative of better performance of covariate-balancing when compared with boosted models and positive values indicative of dominance by the boosted models.

### Measures of Model Complexity

To assess the impact of the complexity of the data generating structures on the relative performance of the boosted models and the covariate-balancing methods, we proposed a one-dimensional measure of complexity that summarizes the different simulation parameters and that can be estimated by practitioners on any data. We defined it as the measure of the proportion of variance that could have been explained in a model (*R* square) if polynomial models were considered, relative to the linear specification or logistic regression model. For the complexity of the relationship between the outcome and the predictors or in the propensity score model, we used the *R* squares from linear models and the pseudo *R* squares from logistic regression models,^{19}^{,}^{20} respectively. For a simulated data and a specific strategy, the following steps were used to measure complexity:

- (1) A model was fit with the strategy predictors as specified in the data generating section linearly entered in the regression and the model R square (called ) recorded.
- (2) Then a second model was fitted with the linear predictors as well as their quadratic and cubic power and all interactions between the continuous predictors included. The model R square (called ) was also recorded. This model will potentially capture most nonlinearity that may exist in the relationship.
- (3) The model complexity equals the difference in R squares

between the two estimates where values close to 0 reveal data models with relatively little complexity that can be well fitted with simple linear combinations and large values reflect very complex models. For the outcome and propensity score models, we denote such complexity indices by

and

, respectively, and note they will be specific to each dataset generated.

The impact of model complexity on bias, RMSE, and standard error was contrasted between the different values of

and

(eAppendix; https://links.lww.com/EDE/B256 for details).

## RESULTS

### Overall Performance

We first report in Table 4 the results of simulations with

and an additive linear outcome model (outcome 1), a setting without nuisance parameter and similar to simulations reported by Lee et al^{5} and Wyss et al.^{6} We report the summary of the results for the treatment generation versions A, E, and G; the other versions were in similar directions. For covariate balance, for both boosted models and covariate-balancing methods across all different treatment generation versions considered from very simple to very complex (in a logit scale), both methods obtain the desired covariate balance with observed standardized mean differences below the commonly used threshold of 0.1 for acceptable balance.^{21} The covariate-balancing method tended to produce better balance as measured by the standardized mean difference than the boosted models method. For the boosted models method, the average standardized mean difference ranged from 0.052 to 0.088, whereas for the covariate-balancing method, it ranged from 0.005 to 0.032. These results are similar to the ones reported by Lee et al^{5} and Wyss et al.^{6} In contrast, in all the simulations, except for very few cases, the Kolmogorov-Smirnov statistics were slightly smaller for boosted models than covariate balancing, suggesting that the covariate-balancing method tended to balance the means of the distributions of the covariates, whereas the boosted models method slightly better balanced the full distributions of the covariates. Also across the board, including instruments or distractors in the propensity score estimation (strategies 2, 5, or 6), resulted in larger Kolmogorov-Smirnov statistics for both estimation methods. The covariate-balancing method produced more variable weights in all but one condition (strategy 1 used with treatment generation G) with the biggest differences for models that included instruments and distractors.

For estimation of the treatment effect, the covariate-balancing method consistently resulted in smaller bias when simple propensity score models were considered but the trend reversed for complex models such as version G. The absolute bias in version A (additive linear propensity score structure) was between and 5 times higher for the boosted models method than the covariate-balancing method, and in version E, it was about between 2 and 3 times higher for the boosted models method. On the other hand, in version G, the bias in the boosted models method was 2.4 to 3.1 times lower with the exception of strategy 4 where it was only 1.5 times lower. The RMSE performance was similar between both methods. The ratios in RMSE were between 0.9 and 1.1. Even more similar standard errors were observed between the two estimation methods.

The results of the simulation with possible nonlinear relationships between outcomes and confounders are reported in Table 5 where we show the summary statistics averaged across all scenarios. Similar to the previous results, the standardized mean differences from the boosted models method were between 3 and 14 times higher than the ones from the covariate-balancing method. The Kolmogorov-Smirnov statistic was similar between both estimation methods under the propensity score model version A that has little to no complexity, but in more complex propensity score model setting G, the statistic was between 1.4 and 1.9 times lower in the boosted models method than the covariate-balancing method. In version E, the Kolmogorov-Smirnov statistic was similar in the three strategies (2, 5, and 6) where all the covariates used for the generation of the propensity score model were included in the estimation, but for all the other strategies, the statistic was 1.4–1.6 times lower with the boosted models method. As in Table 4, the covariate-balancing method yields more variable weights, especially in the presence of instruments and distractors. In terms of bias, consistently across all model strategies, the boosted models method produced higher bias for the simple propensity score version A structure (between 3.4 and 4.8 times higher). But for complex models such as version G, the bias in the boosted models method was consistently smaller on the order of 5.6 to 10.4 times. Even in version E, the bias in the boosted models method was lower. For the RMSE, on average, both estimation methods produced similar estimates in version A, but with more complicated propensity score versions, the RMSE in the boosted models method was consistently lower. In versions E and G, the RMSE in the boosted models method was lower on the order of 1.1 and 1.3 to 1.4 times, respectively. Once again very similar standard errors were observed between both estimation methods. A comparison of the linear (Table 4) and the nonlinear (Table 5) outcome model setting results also reveals that even if one method performs better in terms of standardized mean difference or Kolmogorov-Smirnov statistic balance, bias and RMSE can still depend the outcome model or the existence of instruments and distractors and may not always favor the method that yields better balance.

In summary, while the covariate-balancing method with only main effects for covariates tended to produce better covariate standardized mean difference balance, the boosted models method produced slightly better Kolmogorov-Smirnov statistic balance especially in complex propensity score settings. For simple outcome and propensity score structures, the covariate-balancing method outperformed the boosted models method in bias and both lead to similar RMSE and standard error, but in more complex propensity score settings, the boosted models method performed better than the covariate-balancing method in both bias and RMSE, whereas the standard error remains similar between the methods. For nearly all conditions, the covariate-balancing method yields more variable weights (5–49% greater variability; Tables 4 and 5 and eAppendix; https://links.lww.com/EDE/B256) but this does not necessarily translate to larger standard errors for the treatment effect: the standard errors are more similar across both methods than the variance of the weights and more often smaller for the covariate-balancing method than the boosted models method. Hence, variance in the weights does not necessarily translate to more variable treatment effect estimates.

### Method Diagnostic

To provide researchers with ways of assessing circumstances where one method might be more appropriate, we now report the performance statistics by the model complexity indicators defined above Results of the estimated bias and RMSE depending on the level of the complexity in the outcome and in the treatment assignment are presented in Figures 1 and 2 where for each outcome complexity

and each treatment complexity

, the average statistic across scenarios was computed and the log-relative statistic (

) presented. The figures for standard errors are reported in the eAppendix; https://links.lww.com/EDE/B256. The generalized additive model,^{22} a method that combines the advantages of nearest neighbor and kernel methods, was used to create a continuous map surface of the log-relative bias and RMSE by smoothing over the complexities

and

. Other than smoothing the edges, the smoothed heat plots provided the same information as the original values that had not been smoothed. In the figures, the smoothed log-relative statistics were truncated at most to be between −4 and 4. Values of the log-relative statistic (e.g., for bias) at 1, 2, 3, and 4 suggest that the bias in the covariate-balancing method is 2.7, 7.4, 20.1, and 54.6 times larger than the bias in the boosted models method, respectively. The values at −1, −2, −3, and −4 will reflect the opposite with the covariate-balancing method producing the smaller bias. For comparisons where the range of difference between the methods is smaller (e.g., RMSE), the range was truncated to values that capture the key differences; for RMSE, we used −2 to 2.

In Figure 1, we present the bias and RMSE plots for strategies 1 and 2 where predictors of both treatment and outcome or instruments were used. For strategy 1 where only predictors of both outcome and treatment were included in the analysis, we observed only a small range in the improvement

in the propensity score estimation. In this scenario, both estimation methods performed similarly in RMSE in most cases with the exception of cases with large complexity in the outcome and propensity score model where the boosted models method had lower RMSE. A somewhat different picture was revealed when it comes to bias where, for very low treatment complexity index, the covariate-balancing method performed best, but as the propensity score model started getting more complex, the boosted models method outperformed the covariate-balancing method. Similar results were observed for modeling strategy 2 (instruments included in the model) where for RMSE, both methods tended to be similar. For bias, the covariate-balancing method outperformed the boosted models method whenever there was almost no complexity in the propensity score model, and the boosted models method outperformed the covariate-balancing method for greater level of complexity. Interestingly, the standard errors were similar between the methods with only differences (on the order of 1.28 times larger or smaller) observed when the complexity of the models is high (see plots in the eAppendix; https://links.lww.com/EDE/B256).

In the more common situation of including all available covariates (a model where instruments and distractor variables ended up in the propensity score model), both estimation methods still had comparable RMSE, except when there was relatively large complexity in both outcome and propensity, then the boosted models method performed better (Figure 2A). In terms of bias, the covariate-balancing method only outperformed the boosted models method when both outcome and propensity score model complexities were very low; otherwise, the boosted models method produced treatment effect estimates with lower bias most of the time (Figure 2B).

In general, results were similar under the different propensity score estimation strategies as demonstrated in the plots of the averages across all strategies (Figure 2C and D). Taken together, these results suggest that when a researcher is confident about the confounders or variables that determine treatment assignment and the complexity in the propensity model is low, the covariate-balancing method will likely produce better results as it has RMSEs similar to the ones obtained when using the boosted models method while at the same time enjoying less bias. Conversely, the boosted models method will perform better when complexity in the propensity score model and/or outcome model is high.

## OTHER DESIGN-BASED ALTERNATIVES TO COVARIATE-BALANCING PROPENSITY SCORES

Several other techniques similar to the covariate-balancing propensity method that estimate propensity score weights with methods beyond logistic regression have been proposed.^{23–25} Many of them design weights that are not equal to the inverse of the propensity score but are chosen explicitly to optimize balance between the covariate distributions in the exposed and control groups. The method of entropy balancing proposed by Hainmueller,^{23} for example, provides such a way of estimating weights for each control observation so that the sample means (or moments) of the selected covariates are identical between the treatment and weighted control groups. We also compared the entropy balancing method to the boosted models and the covariate-balancing methods using the same simulation setups in the study design section to assess comparability with other methods (see eAppendix; https://links.lww.com/EDE/B256). The performance of entropy balancing mirrors the generalized boosted models method across simulations. When the model complexity in the propensity score is low, entropy balancing and the boosted models method are similar and produce better results than the boosted models method, but with high model complexity, the boosted models method outperforms both of them.

## DISCUSSION

In a field where the analysis of observational study data is common, propensity score methods have become a useful tool for epidemiologists. In practice, it has been reported that propensity scores that are estimated using logistic regression can have disadvantages.^{26} In particular, such propensity scores often result in lingering imbalances in covariate distributions between the treatment and control samples, leaving open the possibility that claimed associations between the treatment and the outcome can be explained by observed confounders.

Generalized boosted models and covariate-balancing propensity scores methods are two advanced propensity score estimation methods that have been found in previous research to circumvent some of the shortcomings of logistic regression. In this article, we analyze millions of simulated datasets to illuminate for practitioners when they can expect the covariate-balancing method to outperform the boosted models method and vice versa. In short, if the parametric assumptions of the logistic regression model are satisfied and the outcome model is linear in the covariates, the covariate-balancing method performs extremely well. On the other hand, when the true propensity score model is substantially nonlinear or contains important interactions in the log-odds scale or the outcome model is not linear in the observed confounders, the flexibility of the boosted models method tends to provide better results. The relative performances of the methods also depend on the kind of covariate balance that matters in the outcome model. As such, researchers might have to think about the outcome model and its complexity when deciding which method to use. In addition to these qualitative guidelines, we provide a data-analytic framework for choosing between the two methods. After selecting potential confounders and avoiding possible distractors and instruments, practitioners can compute our proposed measure of complexity

from their data and use results from this study as a guide for the choice between the generalized boosted models and the covariate-balancing propensity scores methods. We hope that this guidance will help epidemiologists to choose the best tool for the job when estimating propensity scores. Additionally, we believe our simulation structure will provide a useful framework for future work comparing alternative propensity score methods. Simulations using the method of entropy balancing revealed that it was essentially equivalent in our performance measures to the covariate-balancing method. Thus, the comparison shown here between the boosted models and the covariate-balancing methods is also representative of a comparison between entropy balancing and the boosted models method (see eAppendix; https://links.lww.com/EDE/B256). The field is rapidly developing robust tools for propensity score estimation, each with their own strengths and limitations.^{27–30} Future work should aim to expand on the assessment of optimal propensity score estimation diagnostics tools, how trimming weights might impact our results in light of the impact of weight trimming reported by Lee et al,^{31} and understand when these methods might be most useful relative to one another in similar ways to our work here with the generalized boosted models and the covariate-balancing propensity scores methods.