Journal Logo

Original Research

Toward Prevention of Adverse Events Using Anticipatory Analytics

Norman, Joseph PhD,a; Akhavan, Amir PhD,a; Shen, Chen MSca; Aron, David MD, MScb; Leykum, Luci MD, MBA, MSca,c; Bar-Yam, Yaneer PhDa

Author Information
Progress in Preventive Medicine: June 2020 - Volume 5 - Issue 2 - p e0029
doi: 10.1097/pp9.0000000000000029



The ubiquity of Electronic Health Records (EHRs), with hospital adoption rates growing to 83.8% in 2015 from 9.4% in 20081 combined with fast-advancing machine learning and analytics techniques, gives rise to the possibility of novel data analyses including medical concept extraction, patient trajectory modeling, and clinical decision support systems.2–4 In 2014, almost 60% of Americans suffered from at least 1 chronic condition. The prevalence of diagnosed type 1 and type 2 diabetes in 2016 and 2017 among U.S. adults was 0.5% and 8.5%, respectively.5 The risk of complications of diabetes can be substantially reduced by early diagnosis and treatment. In the past few years, researchers have achieved promising results in predicting diabetes mellitus using historical medical patient records.6,7 In addition, there is increasing evidence for the importance of dynamic measurements. For example, glycemic variability itself, separate from the level of glycemia, is a risk factor for complications of diabetes and poor outcomes.8–11

A variety of machine learning techniques have been employed for medical tasks and have yielded promising results.12

The EHR data for diabetic patients are usually information-rich records featuring patient’s vital signs, medical exams, laboratory test results, psychological test results, ethnicity, age, and family history. One of the useful methods for identifying important information in large amount of data is dimension reduction. Dimension reduction methods are unsupervised machine learning algorithms that identify the most distinguishing features in a high-dimensional space. Unsupervised learning can yield the discovery of new information and insights, which is distinct from the more standard supervised learning approach transferring known information from human experts to an artificial intelligence system.

In this article, we consider the possibility of using EHR data to better identify individuals who need more urgent attention because they are at risk for adverse events. We propose an analytic process that integrates many of the usual laboratory test results into combined measures to reflect health, disease, and risk.

An individual’s physical statistics and laboratory test results at a given time are represented as a point in a high-dimensional space, with the points forming a cloud. Using dimensional reduction analysis, we represent this high-dimensional data in a much smaller number of dimensions that best capture the variation across individuals and appointments. Where a point lies in the resultant low-dimensional space is determined by a patient’s combined measure values at a given appointment.

We present a new method to combine dimension reduction and dynamic trajectory modeling to detect physiological variability based on historical EHR. We suggest that outliers in these combined measures tend to be exhibit greater physiologic variability. Their condition changes rapidly over time, an indicator that they may experience adverse events or death.



We used data from 5,133 medical appointments across 1,711 diabetic patients (each patient having 3 appointments approximately 6 months apart) including standard measures of blood pressure, sugar, and cholesterol, weight and body mass index, kidney function, and height and age (Appendix, Supplemental Digital Content 1, The results of the tests of the patients are organized based on the patient unique id so that we can investigate the trajectory of changes across appointments for each patient. Each dot in the scatter plot represents 1 observation, and there are 3 observation per patient in the data. This allows us to see the observations that are similar (or different) and the temporal sequence provides us with the most significant dynamical behavior.


We performed dimensionality reduction analysis—principal component analysis (PCA)—to identify the combinations of measures that best capture the variation in the population and how individual properties covary in conditions of health and disease across the population and over time. Combining data across the population enables comparing how one individual is doing with others and to identify properties of the population as a whole. Further, we can track the change in combined measures as physical conditions of the individuals deteriorate or improve between appointments.

For the PCA, we selected the features that are most frequently considered in relation to diabetes and its complications: Hemoglobin A1c (A1c), LDL cholesterol, urine microalbumin, systolic and diastolic blood pressure, age, body mass index, height, and weight. Within the reduced dimensional space, we identified regions of the space itself which characteristic properties of the individuals found there. We also studied the dynamics of individuals across the space and interpreted rapid movement from one edge to another edge as indicating instability of physiological observations suggesting absence of homeostasis and thus vulnerability to adverse events.


We visualized the population as a scatter plot in the 2 most informative reduced dimensions, that is, captured the most variation across the population (Fig. 1). From the visualization, we found that individuals clustered into a single main group with outliers away from the central region that reflect specific conditions such as anomalously low blood pressure accompanied by low to normal cholesterol and sugar values (Fig. 1A), young adults with very high cholesterol levels (Fig. 1B), obese individuals with high blood pressure and cholesterol (Fig. 1C), nonobese individuals with very high blood pressure and typically normal values otherwise (Fig. 1D), and the oldest members of the population who have normal values overall (Fig. 1E).

Fig. 1.
Fig. 1.:
Scatter plot showing individual appointments as points in the 2 combined measure dimensions that capture the most variation across the population. Areas with particular clinical measure signatures are circled. A, Anomalously low blood pressure, accompanied by low to normal LDL and A1c values. B, Young adults with very high LDL and/or Hemoglobin A1c values. C, Overweight with high blood pressure and raised LDL and A1c values. D, Very high blood pressure yet normal BMI and other values. E, Oldest members of the population, who have normal values overall. Of note is the relative sharpness of the boundary of E with the unpopulated region compared with the rest of the periphery. BMI indicates body mass index; LDL, low-density lipoprotein.

Coloring the points on the scatter plot according to the values of the original (noncombined) measures reveals how the reduced dimensions capture the structure of the population via the combined measures (Fig. 2) and how the different measures change together reveals patterns of disease properties.

Fig. 2.
Fig. 2.:
Plots of original measures as a color overlay of the 2-dimensional combined measure plots. A, Blood sugar (A1C), (B) age, (C) body mass index, (D) diastolic blood pressure, (E) height, (F) LDL, (G) systolic blood pressure, (H) weight, (I) microalbumin. A few of the original measures have very strong dependencies with the combined coordinates, that is, age and blood pressure; others have correlations but do not vary as consistently with the combined dimensions, for example, LDL.

We also analyzed the dynamics of patients across appointments in the combined measure dimensions (Figs. 3 and 4). As the data are from 3 appointments, each one approximately 6 months apart, we can investigate the position of each of the dots in the principle components scatter plot and calculate the dynamics of the patients in this space. Patients with higher displacement in the principle components space are more variable in their physiological measurements over this time frame. Dynamic instability reflects a loss of homeostasis and potential for adverse events.

Fig. 3.
Fig. 3.:
Dynamics of physiological states across 3 appointments separated by 6 months for 4 individuals. Appointments are highlighted in red (bright, medium, and dark red, in order). The top panels show trajectories of 2 individuals in the core of the data cloud. The bottom panels show 2 patients whose combined measures place them in the periphery of the population. Note the much larger and more rapid changes for those in the periphery, indicating an instability in self-regulatory mechanisms. In contrast, those in the core show much less variation.
Fig. 4.
Fig. 4.:
Average path length associated with points in each area of the scatter plot. Short paths are found in the core and associated with physiological stability. Long paths are associated with physiological instability and found in the periphery. The area of Fig 1C is particularly unstable. The area in the lower left is the only peripheral region not associated with instability. Here the oldest individuals in the population are found. It is noteworthy that these oldest individuals are more stable and form the sharpest boundary, consistent with a sharp age limit.

We found that those who are within the main cluster tend to have stable clinical measurement, not moving much from appointment to appointment. Those who are on the edges of the cluster tend to move rapidly and inconsistently from appointment to appointment. The areas associated with rapid movement suggest that individuals found that there are physiologically unstable.

One specific edge region is associated with stability (Fig. 4). This region consists of the most elderly individuals, consistent with the hypothesis that unstable individuals are eliminated from the population leaving only the stable individuals in this region.


The availability of EHRs provides valuable opportunities to provide new insights. In this study, a relatively small dataset is studied using unsupervised machine learning technique. We find that the dynamics of existing measures may be used to identify patients with greater clinical measure variability. Dynamic variability may reflect a loss of homeostasis, associated physiological instability, and potential for adverse events. The results indicate that even with such a small dataset, it is possible to identify the patients with marginal conditions easily by the aid of a dimensionality reduction algorithm (PCA).

Constructing a reduced space of the most relevant dimensions and identifying individuals in variability-associated regions, taking measurements at multiple appointments, and observing an individual’s dynamics may also aid in identifying those at greater risk. We note that individuals with variable clinical measures may have an appointment as they pass through a stability-associated region. Their variability would become apparent from displacement across multiple appointments.

Whether this type of variability has the same implications as glycemic variability remains to be determined. Nevertheless, identifying patients with variability is the first step.


The authors have no financial interest to declare in relation to the content of this article. The Article Processing Charge was paid for by Progress in Preventive Medicine at the discretion of the Editor-in- Chief.


1. Charles D, Gabriel M, Searcy T. Adoption of Electronic Health Record systems among U.S. nonfederal acute care hospitals: 2008 -2014 The Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 Directed the Office of the National Coordinator for Health; 2015. Available at:
2. Shickel B, Tighe PJ, Bihorac A, et al. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2018;22:1589–1604.
3. Santhanam T, Padmavathi MS.. Application of K-means and genetic algorithms for dimension reduction by integrating SVM for diabetes diagnosis. Procedia Comput Sci. 2015;47:76–83.
4. Zheng T, Xie W, Xu L, et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–127.
5. Xu G, Liu B, Sun Y, et al. Prevalence of diagnosed type 1 and type 2 diabetes among US adults in 2016 and 2017: population based study. BMJ. 2018;362:k1497.
6. Barakat NH, Bradley AP, Barakat MN.. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE Trans Inf Technol Biomed. 2010;14:1114–1120.
7. Mani S, Chen Y, Elasy T, et al. Type 2 diabetes risk forecasting from EMR data using machine learning. AMIA Annu Symp Proc. 2012;2012:606–615.
8. Critchley JA, Carey IM, Harris T, et al. Glycemic control and risk of infections among people with type 1 or type 2 diabetes in a large primary care cohort study. Diabetes Care. 2018;41:2127–2135.
9. Ceriello A, Monnier L, Owens D.. Glycaemic variability in diabetes: clinical and therapeutic implications. Lancet Diabetes Endocrinol. 2019;7:221–230.
10. Ravona-Springer R, Heymann A, Schmeidler J, et al. Hemoglobin A1c variability predicts symptoms of depression in elderly individuals with type 2 diabetes. Diabetes Care. 2017;40:1187–1193.
11. Orsi E, Solini A, Bonora E, et al.; Renal Insufficiency and Cardiovascular Events (RIACE) Study Group. Haemoglobin A1c variability is a strong, independent predictor of all-cause mortality in patients with type 2 diabetes. Diabetes Obes Metab. 2018;20:1885–1893.
12. Kononenko I.. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med. 2001;23:89–109.

Electronic Medical Records; Diabetes Mellitus; Dimensionality reduction; Anticipatory Analytics

Supplemental Digital Content

Copyright © 2020 The Author(s). Published by Wolters Kluwer on behalf of the European Society of Preventive Medicine.