# Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches

Background: Electronic health record (EHR) databases contain vast amounts of information about patients. Machine learning techniques such as Boosting and support vector machine (SVM) can potentially identify patients at high risk for serious conditions, such as heart disease, from EHR data. However, these techniques have not yet been widely tested.

Objective: To model detection of heart failure more than 6 months before the actual date of clinical diagnosis using machine learning techniques applied to EHR data. To compare the performance of logistic regression, SVM, and Boosting, along with various variable selection methods in heart failure prediction.

Research Design: Geisinger Clinic primary care patients with data in the EHR data from 2001 to 2006 diagnosed with heart failure between 2003 and 2006 were identified. Controls were randomly selected matched on sex, age, and clinic for this nested case-control study.

Measures: Area under the curve (AUC) of receiver operator characteristic curve was computed for each method using 10-fold cross-validation. The number of variables selected by each method was compared.

Results: Logistic regression with model selection based on Bayesian information criterion provided the most parsimonious model, with about 10 variables selected on average, while maintaining a high AUC (0.77 in 10-fold cross-validation). Boosting with strict variable importance threshold provided similar performance.

Conclusions: Heart failure was predicted more than 6 months before clinical diagnosis, with AUC of about 0.76, using logistic regression and Boosting. These results were achieved even with strict model selection criteria. SVM had the poorest performance, possibly because of imbalanced data.

From the *Department of Statistics, Machine Learning Department, Carnegie Mellon University, Pittsburgh, †Geisinger Clinic, Center for Health Research, Danville, and ‡Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania, Philadelphia, PA.

Reprints: Jason Roy, PhD, Center for Biostatistics and Clinical Epidemiology, University of Pennsylvania, 229 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104. E-mail: jaroy@upenn.edu.

Although comparative effectiveness studies are typically thought of in the context of direct comparisons of clinical treatments and methods of process improvement, the same philosophy is relevant to other aspects of healthcare research, including the comparison of data sources, collection methods, and analyses.

Electronic health records (EHRs) are becoming increasingly common in ambulatory care practices and in inpatient care. This transition to digital records also represents an important transition in how patient data are organized and made accessible for uses never imagined with paper records. In particular, EHRs offer access to longitudinal data that may offer a unique clinical care asset that can be used to predict future outcomes or diagnoses, opening opportunities to personalize decision making for a given patient. In parallel, over the past decade, powerful, modern machine learning techniques have evolved and offer a potentially promising means to rapidly extract information from large, high-dimensional data sets. Machine learning, or sometimes known as data mining, is about extracting useful patterns in the data to solve questions of interests. It is closely related to statistics and engineering. The use of EHR data for predictive modeling poses a unique challenge. In contrast to traditional longitudinal studies (ie, complete data collected on a fixed schedule), EHR data are collected in an unscheduled fashion (ie, because the patient seeks care or the physician orders care). Both what information is collected and when it is collected is not determined by researchers, creating a situation in which a traditional framework would view most patients as having missing data.

Relatively little is known about the use of EHR data for predictive modeling and the application of machine learning methods to such data and the relative strengths and weaknesses of modeling options. In this article, we focus on prediction models for diagnosis of heart failure in primary care. Heart failure is a common and serious progressive illness, usually detected at a relatively advanced stage, leaving few options to slow progression. Earlier detection of heart failure could potentially lead to improved outcomes through aggressive intervention, such as treatment with angiotensin-converting enzyme (ACE)-inhibitors or Angiotensin II receptor blockers (ARBs). Therefore, it is important to determine whether machine learning techniques can effectively be used to predict heart failure from longitudinal EHR data. We describe the process of selecting the sample, defining the outcome, dealing with “missing” data, and validation. We compare 3 machine learning methods: logistic regression, support vector machine (SVM), and Boosting.

## METHODS

In this section, we first describe the study design and methodology. We then provide a brief introduction to the machine learning techniques that were used for the analysis.

### Study Design, Population, and Variables

#### Setting

Geisinger Clinic is a multispecialty group practice that includes 41 outpatient clinics in central and northeastern Pennsylvania. Data were obtained on patients who had a primary care physician at one of these sites. Geisinger Clinic has used EpicCare EHR since 2001 and has been serviced by a single laboratory company since 1993. Patient EHR data include: age, sex, height, weight, and other demographics, lifestyle (eg, smoking and alcohol), clinical measures (eg, blood pressure, pulmonary function, and bone mineral density), digital imaging (magnetic resonance imaging, computed tomography, and x-ray), all orders (ie, laboratories, prescriptions, imaging, and procedures) with a required indication (ie, ICD-9 code), clinical notes, and laboratory measures.

### Definition and Selection of Cases

One challenge with using EHR data for research is defining the outcomes. For this study, we defined heart failure diagnosis using the following criteria: (1) heart failure diagnosis appeared on the problem list at least once; (2) heart failure appeared in the EHR for 2 outpatient encounters, indicating consistency in clinical assessment; (3) at least 2 medications were prescribed with an associated ICD-9 diagnosis of heart failure; or (4) heart failure appeared on 1 or more outpatient encounter and at least 1 medication was prescribed with an associated ICD-9 diagnosis for heart failure. The diagnosis date was defined as the first appearance in the EHR of a heart failure diagnosis. These criteria have been previously validated as part of Geisinger Clinic's involvement in a Centers for Medicare & Medicaid Services (CMS) pay-for-performance pilot.^{1}

We limited the analysis to heart failure cases 50 to 79 years of age at diagnosis with diagnosis occurring between January 1, 2003 and December 31, 2006. We further excluded patients who had their first Geisinger Clinic encounter less than 2 years before the index date, which was 6 months before diagnosis, to ensure that patient data were available for the predictive model. A total of 536 cases were identified.

### Validation of Outcomes

We aimed to validate our EHR-based definition of congestive heart failure (CHF) against the Framingham-study definition, which is the criteria that have been widely used applied to defining cases of CHF from medical records.^{2} Applying such criteria are labor intensive, relies on dated terms and constructs for symptoms and signs (eg, “Circulation time of 25 seconds”) and is highly dependent on the reliability and completeness of documentation (eg, “Weight loss of 4.5 kg in 5 days in response to treatment”) by a given physician and among physicians. Given these shortcomings, we created criteria that were more reflective of outpatient medical practice in an EHR environment. In our definition, Heart Failure (HF) cases had to have a Geisinger Clinic (GC) primary care provider during the observation window, defined, at a minimum, as 12 months before the index date, to ensure EHR data were available for the predictive model. HF diagnosis was based on the presence of any one of the following criteria: (1) HF diagnosis appeared on the problem list at least once; (2) HF appeared in the EHR for 2 outpatient encounters, indicating consistency in clinical assessment; (3) at least 2 medications were prescribed with an associated ICD-9 diagnosis of HF; or (4) HF appeared on 1 or more outpatient encounter and at least 1 medication was prescribed with an associated ICD-9 diagnosis for HF. The diagnosis date was defined as the first appearance in the EHR of a HF diagnosis. We compared these criteria with the Framingham criteria, in a random sample of 50 individuals who met the operational criteria for CHF. We completed a chart review of each case, documenting the first appearance and date of appearance of all major or minor Framingham criteria for CHF. Of the 50 cases, 42 of the 50 cases met Framingham criteria for CHF (ie, either 2 major signs and symptoms or 1 major and 2 minor signs and symptoms). Two of the remaining 8 cases had documentation of 2 minor criteria. Of the 42 cases that met both criteria, the date of diagnosis was similar (ie, ±30 days) for 15, earlier for 17 cases when using operational criteria, and earlier for 10 cases when using Framingham criteria.

### Definition and Selection of Controls

Primary care patients were eligible as controls if they had no history of heart failure diagnosis before December 31, 2006, and had their first Geisinger Clinic office encounter within 12 months of the case's first office visit and had at least 1 office encounter 30 days before or anytime after the case's heart failure diagnosis date (ie, they were active primary care patients at the same time as their matched case). Approximately 10 eligible clinic-, sex-, and age-matched (in five-year age intervals) controls were selected for each heart failure case. In situations where 10 matches were not available, all available matches were selected. Nine or 10 controls were identified for 81% of the cases; 8% of cases had 1 to 2 controls.

### Variables Created

Table 1 summarizes the EHR data used in the model development. Many of the input variables are operationalized both as an indicator (eg, diagnosis of diabetes) and as a duration (eg, time since diabetes diagnosis). Beyond the variables in Table 1, secondary variables were also created, based on clinical evidence of potential importance to an emerging heart failure problem. These included pulse pressure (ie, the difference between systolic and diastolic blood pressure) and the proportions of widened (ie, >40 mm Hg) and narrowed (ie, <30 mm Hg) pulse pressure measurements out of all physician visits. It is plausible that physician visit frequency by patients will increase in relation to time of heart failure diagnosis. As such, we created variables to represent physician visit frequencies in consecutive 6-month windows before the diagnosis date of the heart failure case and for a comparable time period for the control. Lastly, for each abnormal laboratory measure, we created a variable for the time interval between the first abnormal laboratory measurement date and the 6-month prior diagnosis reference date for each laboratory measurement and each individual.

### “Missing” Data

Variables that could be repeatedly represented (eg, laboratories, medication orders, and clinical measures), varied by number per patient and were, effectively, obtained in an unscheduled manner (ie, compared with traditional methods where scheduled data collection is completed on all study participants). These variables could be thought of as missing for some subjects, as they are only observed if/when a decision was made to collect them. We consider the decision about whether or not to collect a certain variable and the timing of such a decision to potentially contain predictive information. For all such repeated measures, a binary variable was created to indicate whether a relevant measure was obtained and, by a separate variable, the value of measure for patients that had it ordered. The latter can be thought of as an interaction between whether the value was obtained and the value itself. This type of variable is “observed” for all subjects, and takes a value of 0 if the laboratory was never measured. For example, as candidate variables in the prediction model we would include an indicator that hemoglobin A1c was ordered and the value of A1c among people who had it ordered.

### Machine Learning Classification Techniques

We compared SVM, Boosting, and logistic regression models.^{3,4} Although logistic regression is a traditional statistical approach that might not typically be thought of as a machine learning technique, it is a member of discriminative models in machine learning. We briefly describe SVM and Boosting and assume the reader is familiar with logistic regression modeling. For each classification technique, we implement 2 variable selection methods—1 with a larger penalty for including additional variables. For logistic regression, we use forward stepwise variable selection based on Akaike information criterion (AIC) and Bayesian information criterion (BIC). BIC penalty tend to select more parsimonious models. For Boosting, we used variable importance scores with 2 different thresholds to select variables. For SVM, we chose L1-norm penalized variable selection with different parameters. We selected logistic regression model in our comparison because of its familiarity to the readers and its generally robust performance. We selected SVM and Boosting because of their availability in R software program and the state of the art performance status in many existing applications in machine learning.^{5}

### Support Vector Machine

SVM is used to transform the original data variable space into a higher dimensional space, denoted “feature space.” The advantage of this approach is that the search for linear classification decision boundary may not be trivial in the lower dimensional input space, but it is easier in the higher dimensional feature space. For example, in Figure 1, the original input space is X=(x_{1}, x_{2}). By defining the basis functions (ie, the features) as follows,

we are able to increase from 2 to 3 dimensions (the feature space is plotted in Fig. 1B). A linear boundary (Fig. 1) separates the data points perfectly in the feature space. However, the decision boundary is nonlinear when it is projected back to the original input space (Fig. 1A).

In a typical prediction problem setting and SVM model formulation, let (x_{i}, y_{i}), *i* = 1, …, N, be the training data pairs for N patients. Each *xi* is assumed to have *p* inputs, defined as the independent variables that include measures such as body mass index and blood pressure measurement (Table 1). The dependent or outcome variable (ie, heart failure status), *yiε*{1, −1}, is denoted as a “label” in machine learning models. The {1, −1} values are for mathematical convenience and are equivalent to {1, 0} in typical two-class classification problems. Before making the final prediction decision, the signed scaled distance (between −1 and 1) between each unseen data point *x* and the SVM decision boundary are computed by the following equation. The sign of the distance measure indicates which side of the decision boundary *x* is on.

where <*h*(*x*), *h*(*xi*)> represents the dot product between 2 vectors. Parameters *ai* are the new parameters introduced when the SVM estimation is formulated as an optimization problem using Lagrange multipliers. The *M* features are defined as following:

In the actual implementation, the *M* features are rarely stored in memory directly. Instead, all the computation is done through the kernel function *K*. For example, in Figure 1, the SVM solution is obtained by the computation through kernel function *K*.

The kernel function *K* produces the same expression *<h(x), h(xi)>* as following:

It greatly reduces the computational cost because only the lower dimensional features need to be stored. To estimate β, β_{0} in Eq. (1), we find the parameters that minimize the equation below:

where “+” represents the positive part of the function [1−*yif(xi)*]. In machine learning literature, [1−*yif(xi)]+* is also known as the hinge loss for SVM. The SVM decision rule is defined as follows:

### Boosting

Boosting is a popular example of machine learning ensemble methods. These iteratively combine the outputs of many “weak classifiers” to produce a strong “committee.” Weak classifiers refer to algorithms with error rates that are only slightly better than random guessing.^{3,4} For example, in the heart failure analysis, decision stumps were used as the weak classifier. At each model run, more weight is given to those misclassified heart failure cases identified from the previous iterations, whereas the weights are reduced for correctly classified heart failure cases. Thus, as iterations proceed, observations that are difficult classify correctly receive ever-increasing influence. We used a Boosting algorithm called AdaBoost^{3,4} specifically developed for binary classification problems. A distinct feature of AdaBoost is that training error continues to decrease as a function of iterations.

Again, the dependent variable is *yi* **ε**{1, −1}, where *Gm, m=1, …, M,* denotes a sequence of weak classifiers after iteration *m*. In the heart failure analysis, we used decision stumps^{5} as weak classifiers for all Boosting experimentation. In the following paragraph, we describe the Boosting technique formulation. We chose to describe the additive model formulation described here because it has been proven equivalent to the original computer science formulation, and it is more understandable to statisticians.

The equation below shows the additive model formulations of AdaBoost by minimizing the exponential loss function^{5}:

*Gm, m=1, …, M* are the basis functions for the basis function expansion *F(x),*

where *X* represents the vectors of input variables and *wim* is the weight given to data point *i* at iteration *m*. The weights can be written as:

The parameters β*m* represent the weights given to each weak learner to form the Adaboost ensemble classifier at the end. The final classifier is defined as:

### Machine Learning Variable Selection Techniques

The original 179 independent variable were narrowed in each modeling approach through variable selection protocols. For logistic regression, variable selection was based on minimizing AIC and BIC. The BIC tends to select more parsimonious models because of its more severe penalty, which is a multiple of log_{e} (*n*), where *n* is the number of data points. For example, for 2000 data points, log_{e}(2000)≈8 is much bigger than the penalty multiplier of 2 for AIC. For SVM, we used the L1-norm variable selection technique,^{5,6} where parameters are estimated with a penalty term based on the number of independent variables or features selected as defined by:

*s, λ* are the tuning parameters and *q* is the number of features selected (the original formulation). We used a fast Newton method formulation for parameter estimations,^{6} where *δ, v* are the new tuning parameters. For AdaBoost, we used the variable importance scores to select features.^{5} Variable importance after iteration ** m** is defined as the sum of all the improvements in squared error risk where variable

*l*is used to partition the region:

where *Tm* is the weak classifier after iteration *m* below, the sum is over the *J*−1 internal nodes of each stump and *îl* ^{2} is the improvement in squared error risk fit. At each node *t*, 1 of the input variables *v(t)* is used to partition the region associated with that node. The variable importance score is calculated as the average over all *M* of the weak classifiers:

A user chosen threshold is used based on the variable importance score to determine if a variable is selected.

### Statistical Software

For variable creations and data processing, we used SAS 9.2 (Cary, NC). We used generalized linear models step functions in the open source R program to perform the logistic regression fitting and variable selection. The SVM approach was implemented using the kernlab packages in R for penalized SVM. For all the SVM experimentation, we used the radial basis kernel with the default automatic parameter selection by the R package. For Boosting, we used the AdaBoost package in R.

### Challenge: Lack of Balance

Controls were over-sampled to increase power. However, because just 12% of the sample (ie, 536 cases and 3953 controls) were cases, we have an imbalanced data classification problem. Machine learning algorithms are typically formulated with the assumption of data balance. As such, the use of these methods for some epidemiological application is generally suboptimal. To alleviate the problem, weighted sampling methods have been used, where the majority class is under-sampled, whereas the original minority class is kept intact.^{8} Although more elaborate methods (ie, synthetic minority over-sampling^{8} and nearest-neighbor based active learning^{8}) have emerged in recent literature, we limited sampling methods to the under-sampling of the majority class (controls). Specifically, we used a sampling weight of 0.25 for Boosting and a sampling weight of 0.50 for SVM when selecting controls based on empirical experimentation.

### Model Validation

Because of the imbalanced data problem, prediction accuracy percentage is not an optimal criterion for evaluating model validity because simply classifying all subjects to the majority class yields a high measure (here, almost 90%). As such, we used the area under the curve (AUC) for each of the 3 methods under a 10-fold cross-validation setting, and method that provides a conservative estimate of the true model prediction risk.^{5} The 10-fold cross-validation is random in the sense that the 10-fold data split can be different for each simulation. However, when taking the median AUC values of all the folds, they are consistent across all the methods. We report the median, maximum, and minimum AUC values for each 10-fold cross-validation in Figure 2.

## RESULTS

Of the models tested, the AUCs were similar, on average, for logistic regression and Boosting (Fig. 2). The highest median AUC (0.77) was observed for logistic regression with BIC and Boosting when using a less strict cutoff. The AUCs for SVM were significantly lower than those observed for the other 2 modeling approaches.

Figure 3 shows the number of variables selected for each of the first 5 folds of the 10-fold cross-validation. These did not vary substantially by fold within each model tested. It provides evidence that the variable selection techniques we used are consistent. The first 3 models from the left of each fold (ie, Boosting-strict, Logistic-BIC, and SVM-strict) show that the application of strict criteria results in selection of substantially fewer variables. Logistic regression with BIC yielded the most parsimonious model with an average of 10 linear variables per model. Boosting with strict variable selection threshold (0.0003) was next, with an average of 15 variables per model. This latter model performed approximately as well as Boosting with less strict threshold but used far fewer variables (15 compared with 40).

Figure 4 shows the variables that were selected most frequently across all methods. “Test to date” refers to the time elapsed between the laboratory test date and the 6-month reference date. A variable could be selected approximately 60 times (6 methods, 10 data sets for cross-validation). The width of each bar in the figure is proportional to the number of times each variable was selected by a particular method (maximum of 10). Boosting (strict and less strict) and logistic regression with BIC tended to select similar variables. For example, past diuretics medication order, atrial fibrillation, and respiratory symptom were all selected 10 times by each of these methods. L1-norm SVM selected quite different variables. For instance, none of the 3 variables above was selected by SVM even once.

## DISCUSSION

We compared 3 different approaches to predict heart failure diagnosis. Of these, the direct application of Boosting performed similarly to the logistic regression model traditionally applied in biomedical research, and, based on AUCs determined for model validation purposes, both were superior to the direct application of SVM. This suggests that Boosting and Logistic regression may be the better options if the goal is to build a simple yet effective model.

There are a number of reasons why SVM may have been less effective in developing this prediction model. First, it is not an optimal approach for handling both continuous and categorical variables in the same model.^{5} Despite standardizing all the continuous input variables, the Euclidean distance SVM relies on is still probably not a good distance metric between continuous and categorical variables. For the same reason, SVM is not robust to outliers in either the input variable space or feature space when both continuous and categorical variables are present. Third, SVM is not naturally good in dealing with and does not handle irrelevant features well.^{10} Weston et al.^{10} conducted an empirical study showing a number of synthetic examples where natural direct application of SVM only achieves a consistent test accuracy of 0.5 as a function of increasing training data size. Finally, our data suggest that SVM may be more strongly affected by classification imbalance in the data than either Boosting or logistic regression.

In contrast, the use of decision stumps as the weak learners in the Boosting model made this method well equipped to handle mixed data: the decision split at each stump branch does not rely on any particular distance measure between any pair of data points. Similarly, it should be more robust to outliers in both input space and feature space,^{5} and there is no evidence to suggest decision stumps perform poorly in the presence of irrelevant features.

Using cross-validation and variable selection with Boosting required several decisions related to the number of iterations to be performed. For each fold in the 10-fold cross-validation, we calculated the average variable importance in the 10 random samples of training data with Boosting run for 30 iterations. There is no single rule for how many iterations 1 should run for calculating average variable importance across the 10-fold. However, the default number of iterations in the program is 50. Allocated training data comprised 90% of the overall data each time. To avoid over-fitting on the training data each time, we chose to run only 30 iterations. Once the variables were selected, the AUC was calculated for the validation data using model constructed with the default parameters in the computer program running for 200 iterations. We used this higher number of iterations on validation data because it comprised only 10% of the overall data, making over-fitting much less likely. We experimented with running 50 and 500 iterations, but these models performed poorly because of under-fitting and over-fitting, respectively.

In general, the variables that were most often selected are clinically sensible as a cause (eg, hypertension) of heart failure or as a consequence (eg, respiratory symptom indicator) of an emerging heart failure problem. Known risk factors for heart failure include body mass index, diabetes, atrial fibrillation, and common cardiovascular risk mediators (ie, hypertension, elevated triglycerides and low-density lipoprotein [LDL]). Each of the models showed a strong preference for these factors, including time since first abnormal value and treatment indicators for these conditions. Respiratory symptoms were selected as 1 of the more common predictors, which is noteworthy as this symptom is often noted in the record long before a heart failure diagnosis emerges. The lack of recognition that heart failure is emerging is due, in part, to the nonspecific nature of respiratory symptoms (ie, many possible causes). Prediction models of the kind tested herein may be valuable in detecting when such nonspecific symptoms are a strong indicator of an emerging heart failure problem.

We have shown that the application of machine learning techniques to longitudinal EHR data performs reasonably well in predicting future diagnosis of a disease that is ordinarily difficult to detect. Because the focus here was on introducing the different machine learning techniques and comparing their performance on a single data set, we restricted the analysis to main effects only. We expect that the AUC would increase if we included interaction terms in the list of candidate input variables.

Most large EHR systems have a lot in common, because structured data exist for medications, procedures, clinical measures, and laboratories. Furthermore, with growth in use of natural language processing, we are likely to see the emergence of validated standards in translating text data to fixed values. Thus, prediction models developed in 1 system have the potential to be implemented in other systems. Future research will include validating these models on other data sets.

## REFERENCES

*JAMA*. 2009;301:383–392.

*Ann Stat*. 1998;28:2000.

*Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence*. 1999:1401–1406.

*The Elements of Statistical Learning*. Stanford: Springer; 2001.

*Neural Inf Process Syst*. 2003:16.

*Comput Optim Appl*. 2004;28:185–202.

*J Artif Intell Res*. 2002;16:321–357.

^{st}Annual Conference on

*Neural Information Processing Systems (NIPS-2007)*, Vancouver, Canada, 2007.

**Keywords:**

comparative effectiveness; electronic health record; machine learning; prediction model; variable selection