Skip Navigation LinksHome > June 2010 - Volume 48 - Issue 6 > Prediction Modeling Using EHR Data: Challenges, Strategies,...
Medical Care:
doi: 10.1097/MLR.0b013e3181de9e17
Comparative Effectiveness

Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches

Wu, Jionglin MS*†; Roy, Jason PhD‡; Stewart, Walter F. PhD, MPH†

Collapse Box

Abstract

Background: Electronic health record (EHR) databases contain vast amounts of information about patients. Machine learning techniques such as Boosting and support vector machine (SVM) can potentially identify patients at high risk for serious conditions, such as heart disease, from EHR data. However, these techniques have not yet been widely tested.

Objective: To model detection of heart failure more than 6 months before the actual date of clinical diagnosis using machine learning techniques applied to EHR data. To compare the performance of logistic regression, SVM, and Boosting, along with various variable selection methods in heart failure prediction.

Research Design: Geisinger Clinic primary care patients with data in the EHR data from 2001 to 2006 diagnosed with heart failure between 2003 and 2006 were identified. Controls were randomly selected matched on sex, age, and clinic for this nested case-control study.

Measures: Area under the curve (AUC) of receiver operator characteristic curve was computed for each method using 10-fold cross-validation. The number of variables selected by each method was compared.

Results: Logistic regression with model selection based on Bayesian information criterion provided the most parsimonious model, with about 10 variables selected on average, while maintaining a high AUC (0.77 in 10-fold cross-validation). Boosting with strict variable importance threshold provided similar performance.

Conclusions: Heart failure was predicted more than 6 months before clinical diagnosis, with AUC of about 0.76, using logistic regression and Boosting. These results were achieved even with strict model selection criteria. SVM had the poorest performance, possibly because of imbalanced data.

© 2010 Lippincott Williams & Wilkins, Inc.

Login

Article Tools

Share

Search for Similar Articles
You may search for similar articles that contain these same keywords or you may modify the keyword list to augment your search.