Secondary Logo

Journal Logo

Treatment Prediction, Balance, and Propensity Score Adjustment

Moodie, Erica E. M.; Stephens, David A.

doi: 10.1097/EDE.0000000000000657

Department of Mathematics & Statistics, McGill University, Montreal, QC, Canada,

The data are publicly available through the R library “nhanes”. Code is available from the authors on request.

The authors report no conflicts of interest.

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (

Back to Top | Article Outline

To the Editor:

It has been argued1 that to make causal inferences from nonexperimental data—for which inferences may be compromised due to the presence of confounding—an analysis should be designed to mimic a randomized trial due to the covariate balance induced by randomization. The propensity score,2 when correctly specified and utilized, can eliminate imbalance in the distribution of covariates between treated and untreated subjects, thereby offering several approaches to adjustment (e.g., stratification, matching, or inverse weighting). Defined for binary treatment

and vector of confounding variables


, the propensity score is a balancing score,


is a scalar quantity, whatever the dimension of

. Achieving balance on covariates that are not confounders (particularly instruments, i.e., strong predictors of treatment) is unhelpful, and yet there is still considerable interest in (and some advocacy for) using procedures to estimate the treatment model that are more complex than the simple approach of fitting a binary regression model. We caution that such procedures must be used with care, as the goals of optimal treatment prediction and balancing are very different.

We report findings from an empirical study of the impact of current smoking on systolic blood pressure using data from National Health and Nutrition Examination Survey, restricting attention to adults in the second wave of the survey. We compare the estimated propensity scores and balance statistics for propensity score estimation using procedures of increasing complexity. Potential confounders are gender, age, race, education, marital status, household income, and a poverty index. We compare logistic regression, generalized boosted models as suggested by,3 and the ensemble approach of Super Learning,4 using the following R libraries: k-nearest neighbors, regularized generalized linear models, mean prediction, and random forests (all with default settings). We examined standardized mean difference for each of the confounding variables (i) in the original sample, (ii) within quintiles of each fitted propensity score, (iii) following 1:1 matching with replacement,5 and (iv) following inverse probability of treatment weighting.

Local balance is not achieved within quintiles of the propensity score for any of the estimation approaches (Table). Thus a stratified analysis could not rely on fitting simple means within quintiles, but rather would need to rely on outcome regression modeling within quintiles—however, typically the analyst wishes to avoid specifying an outcome regression model. Using logistic regression to compute the propensity score, we observe excellent balance is achieved through inverse probability of treatment weighting, and balance is quite good following matching.



The propensity score estimated via a generalized boosted model leads to greater predictive accuracy (0.80 as compared with 0.70 within-sample accuracy for logistic regression), and greater separation in the propensity score distribution between smokers and nonsmokers (see eFigure 1;, and a decrease in the balance as measured by the standardized mean difference. Super Learning’s predictive accuracy is even greater (0.98), yielding such strong separation between the smokers and nonsmokers that only the third quintile contains both exposure groups. Within the third quintile, balance is generally worse than in the original sample.

Beyond the lack of overlap (positivity violations) resulting from high predictive accuracy, in an analysis using inverse probability weighting, weights will be close to 1 when the treatment model is very accurate. Thus, in this example, the average treatment effect estimated using weighting by a treatment model fit by Super Learning is most similar to the naive (unweighted) difference of averages between exposed and unexposed: the naive estimate (95% confidence interval [CI]) is −3.70 (−5.71, −1.78), where as the inverse probability weighted estimates using a treatment model fit by logistic regression, generalized boosted model, and Super Learning are, respectively, −1.99 (−3.93, −0.16), −2.18 (−4.22, −0.17), and −3.49 (−5.39, −1.87).

We note that while small standardized mean differences are not sufficient to guarantee unbiased estimation,4 they are thought to be necessary. Furthermore, the assumption of conditional exchangeability, or balance between treated and untreated groups, is made with respect to the joint distribution of confounders and not marginally (confounder by confounder) as displayed in the Table and eFigure 2 ( and so there may be settings in which balance is achieved marginally but not jointly. Interestingly, even though the propensity score fit by logistic regression used only main effect terms, it appears to provide the better balance at the level of first-order interactions than the two more complex estimation approaches (see eFigure 3; Of the methods considered, none aims to optimize balance. Super Learning offers the greatest potential because its in-built cross-validation could be adapted to maximize balance rather than predictive accuracy, although care would be required to determine balance “depth” (main effects,

-order interactions, etc.). Until such modifications exist, we suggest caution: accurate prediction may not lead to “better” causal analyses.

Erica E. M. Moodie

David A. Stephens

Department of Mathematics & Statistics

McGill University

Montreal, QC, Canada

Back to Top | Article Outline


1. Didelez VCommentary: should the analysis of observational data always be preceded by specifying a target experimental trial? Int J Epidemiol. 2016.
2. Rosenbaum PR, Rubin DBThe central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.
3. Ridgeway G, McCaffrey DComment: demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussion). Stat Sci. 2007;22:540543.
4. Pirracchio R, Petersen ML, van der Laan MImproving propensity score estimators’ robustness to model misspecification using super learner. Am J Epidemiol. 2015;181:108–119.
5. Sekhon JSMultivariate and propensity score matching software with automated balance optimization: the Matching package for R. J Stat Softw. 2011;42:1–52.

Supplemental Digital Content

Back to Top | Article Outline
Copyright © 2017 Wolters Kluwer Health, Inc. All rights reserved.