Secondary Logo

Journal Logo

Institutional members access full text with Ovid®

Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases

Schneeweiss, Sebastian; Eddings, Wesley; Glynn, Robert J.; Patorno, Elisabetta; Rassen, Jeremy; Franklin, Jessica M.

doi: 10.1097/EDE.0000000000000581

Background: Data-adaptive approaches to confounding adjustment may improve performance beyond expert knowledge when analyzing electronic healthcare databases and have additional practical advantages for analyzing multiple databases in rapid cycles. Improvements seemed possible if outcome predictors were reliably identified empirically and adjusted.

Methods: In five cohort studies from diverse healthcare databases, we implemented a base-case high-dimensional propensity score algorithm with propensity score decile-adjusted outcome models to estimate treatment effects among prescription drug initiators. The original variable selection procedure based on the estimated bias of each variable using unadjusted associations between confounders and exposure (RRCE) and disease outcome (RRCD) was augmented by alternative strategies. These included using increasingly adjusted RRCD estimates, including models considering >1,500 variables jointly (Lasso, Bayesian logistic regression); using prediction statistics or likelihood-ratio statistics for covariate prioritization; directly estimating the propensity score with >1,500 variables (Lasso, Bayesian regression); or directly fitting an outcome model using all covariates jointly (Lasso, Ridge).

Results: In five example studies, most tested augmentations of the base-case hdPS did not meaningfully change estimates in light of wide confidence intervals except for Bayesian regression and Lasso to estimate RRCD, which moved estimates minimally closer to the expectation in three of five examples. The direct outcome estimation with Lasso performed worst.

Conclusion: Overall, the basic heuristic of variable reduction in high-dimensional propensity score adjustment performed, as well as alternative approaches in diverse settings. Minor improvements in variable selection may be possible using Bayesian outcome regression to prioritize variables for propensity score estimation when outcomes are rare. See video abstract at,

Supplemental Digital Content is available in the text.

From the aDivision of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA; and bAetion Inc., New York, NY.

Submitted 19 January 2015; accepted 19 October 2016.

This study was funded by the Patient-centered Outcomes Research Institute with additional support by research grants from the National Library of Medicine (R01-LM010213), the National Heart Lung and Blood Institute (RC4-HL102023).

Dr. Schneeweiss is consultant to WHISCON, LLC and to Aetion Inc., a software manufacturer of which he also owns shares. He is principal investigator of investigator-initiated grants to the Brigham and Women’s Hospital from Novartis, Genentech, and Boehringer Ingelheim unrelated to the topic of this study.

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (

Correspondence: Sebastian Schneeweiss, Division of Pharmacoepidemiology, Brigham & Women’s Hospital, 1 Brigham Circle, Suite 3030, Boston, MA 02120. E-mail:

Copyright © 2017 Wolters Kluwer Health, Inc. All rights reserved.