Institutional members access full text with Ovid®

Share this article on:

Can We Train Machine Learning Methods to Outperform the High-dimensional Propensity Score Algorithm?

Karim Mohammad Ehsanul; Pang, Menglan; Platt, Robert W.
doi: 10.1097/EDE.0000000000000787
Original Article: PDF Only

The use of retrospective healthcare claims datasets is frequently criticized for the lack of complete information on potential confounders. Utilizing patient’s health status-related information from claims datasets as surrogates or proxies for mismeasured and unobserved confounders, the high-dimensional propensity score algorithm enables us to reduce bias. Using a previously published cohort study of post-myocardial infarction statin use (1998 - 2012), we compare the performance of the algorithm with a number of popular machine learning approaches for confounder selection in high-dimensional covariate spaces: random forest, least absolute shrinkage and selection operator, and elastic net. Our results suggest that, when the data analysis is done with epidemiologic principles in mind, machine learning methods perform as well as the high-dimensional propensity score algorithm. Using a plasmode framework that mimicked the empirical data, we also showed that a hybrid of machine learning and high-dimensional propensity score algorithms generally perform slightly better than the both in terms of mean squared error, when a bias-based analysis is used.

Funding Information: This work was supported by a post-doctoral fellowship from the Canadian Network for Observational Drug Effect Studies (CNODES). CNODES, a collaborating centre of the Drug Safety and Effectiveness Network (DSEN), is funded by the Canadian Institutes of Health Research (CIHR). M.E.K. is a Scientist and Biostatistician at the Centre for Health Evaluation and Outcome Sciences (CHÉOS), faculty of Medicine, UBC. M.P. holds a studentship from the Fonds de Recherche du Québec - Santé (FQR-S). R.W.P. holds the Albert Boehringer I Chair in Pharmacoepidemiology, and is a member of the Research Institute of the McGill University Health Centre, which is supported by core funds from FQR-S.

Conflict of Interest: M.E.K. has received accommodation costs from the endMS Research and Training Network (2011, 2012), Statistical Society of Canada (2016) to present at conferences, and from Pacific Institute for the Mathematical Sciences (2013), the Canadian Statistical Sciences Institute (2016) to attend workshops. R. W. P. has received fees for service for consulting from Abbvie, Amgen, Eli Lilly, and Searchlight Pharma, for teaching from Novartis, and for scientific steering committee membership from Pfizer.

Availability of Data and Code for Replication Software code hints are provided in the supporting material (as an eAppendix) for implementing the methods. Retrospective population-based cohort Dataset from the Clinical Practice Research Datalink (CPRD) is not publicly available due to patient confidentiality reasons.

Mohammad Ehsanul Karim, Assistant professor, School of Population and Public Health, University of British Columbia, 2206 East Mall, Vancouver, BC V6T 1Z3; and Scientist / Biostatistician, Centre for Health Evaluation and Outcome Sciences (CHÉOS), St. Paul’s Hospital, 588-1081 Burrard St, Vancouver, BC V6Z1Y6, Canada. Email:, Tel.: +1604-682-2344 ext. 64251, Fax: +1604-806-8005, ORCiD: 0000-0002-0346-2871

Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved.