Tapping Into Underutilized Healthcare Data in Clinical Research : Annals of Surgery

Journal Logo


Tapping Into Underutilized Healthcare Data in Clinical Research

Mori, Makoto MD∗,†; Schulz, Wade L. MD, PhD†,‡; Geirsson, Arnar MD; Krumholz, Harlan M. MD, SM†,§

Author Information
doi: 10.1097/SLA.0000000000003329
  • Free

The new digital era of medicine presents immense opportunity to leverage data that would have been too difficult to acquire by manual means. Traditional registries, including the Society of Thoracic Surgeons (STS) database and the National Cardiovascular Data Registry (NCDR), acquire data by manually abstracting information from electronic medical records to fill a predefined list of data fields. In this process, the original high-dimensional data are often collapsed into a series of aggregated (maximum, mean, or skew values) or binary variables. This type of data structure is easy to handle and is foundational to conventional predictive models. However, potentially meaningful data and metadata are discarded in this reductionist process, consequently predefining “study-able” phenomenon at the time of data collection.

One venue, replete with data yet underutilized in predictive models, is generated in the operating room. For instance, a complex procedure such as coronary artery bypass grafting (CABG) can produce more than 200,000 data points generated by medical devices including the cardiopulmonary bypass machine, continuous physiologic monitoring, and mechanical ventilators at a per-second temporal resolution, in addition to data that are manually entered by the circulating nurse (Table 1). This number of data points does not include the imaging data produced via transesophageal echocardiogram, which may total up to several hundred megabytes when including 3-dimentional reconstruction of the intracardiac anatomy. Such high-dimensional data are produced in a matter of 4 to 5 hours during a typical CABG surgery. However, these data are not customarily used in postsurgical outcomes prediction or research.

Data Generated During an Isolated CABG

The STS Adult Cardiac Surgery Database (ACSD) is 1 of the most extensive clinical databases that captures a large number of variables related to cardiac surgery.1 For isolated CABG, the current data version (v2.90) contains approximately 200 intraoperative variables. Therefore, the raw data that populated more than 200,000 fields in the entire course of CABG surgery are reduced to approximately 200 fields, defined by a panel of experts as the variables perceived as most pertinent to encode the course of the surgery. Such reduced, structured data formats have advantages in that they are uniform across institutions with different electronic medical record systems, and that they are easier to analyze. This type of data structure, where the number of independent variables is relatively small in scale and consistent across patients, is suited for conventional regression analyses. Commercial statistical analysis packages are typically designed within the premise of this particular data structure. However, the rich features within raw data may provide new opportunities to learn more about a surgery and better predict what a patient's risk may be in subsequent days.

Raw data take an even more complex structure for those elements that are recorded continuously along the time axis, such as the flow volume tracing from cardiopulmonary bypass machines or airway pressure from mechanical ventilators. Such high-dimensional, time-dependent data may contain important features of the operation. For example, air locking of the bypass circuit is a rare but important event, as the entire cardiopulmonary bypass circuit halts until the air lock resolves. Apart from such sentinel events, features such as the duration of low arterial flow or persistently high airway pressure may have clinical implications, but are not routinely captured in clinical registries. General patterns of continuous vital signs, such as heart rate variability, may provide relevant details about the patient's clinical status, but are not routinely studied, often due to the complexity of routinely acquiring and analyzing these data. Familiarity with conventional research tools, including data collection forms for clinical registries and software packages that were designed with the mindset of using low-dimensional data, also likely contributed to the underutilization of high-dimensional data.

Some research efforts are underway in this area, such as the Multicenter Perioperative Outcomes Group (MPOG), a multicenter consortium that houses granular perioperative data. MPOG began to demonstrate the feasibility of acquiring, storing, and drawing values from such granular perioperative data across a large number of hospital systems,2 and may provide a preliminary platform upon which future efforts may be based.2 Further work is needed to ensure completeness of the data across all participating sites and centralization of data collected beyond what happens in the operating room. Ultimately, such high-resolution data may bring value to clinical practice in a form of dynamic decision support tool, where prediction of adverse event is updated in real time the moment new data are generated.

For high-dimensional operative data to bring value to clinical practice and research, some changes will be needed, including the consistent storage of the generated data, commodification of feature extraction techniques, and a paradigm shift away from the up-front reductionist approach.

  • 1. Data management: Many of the data sampled at a sub-second time interval undergo a loss when they are stored at a longer time interval. For example, cardiopulmonary bypass devices may be capable of sampling pressures and flows at 1-second intervals, but only record a 30-second average, with an even longer average value being transferred to the electronic healthcare record. However, the data reduction facilitates efficient transmission and storage. With the continuous improvement in software and hardware capacities, efficient transmission and storage of all the raw data should be possible with investment in the infrastructure.3,4
  • 2. Analysis and feature engineering: Effective use of high-dimensional data requires advanced analytical skills involving feature projection, engineering, and extraction. Industries have attracted such skill sets early on but the advanced analytical skills may not be readily accessible in an academic research setting currently,5 although such approaches are gaining popularity in the context of machine learning in academia. Much like basic programming skills have become a commodity and necessity, advanced analytic techniques for high-dimensional data are likely to follow this trend. Importantly, the raw data are rich but noisy, often containing artifacts such as dampened arterial line tracings due to a kinked line. Systematic discrimnation of signal from noise may be accomplished via advanced classification algorithms, including machine learning.
  • 3. Paradigm shift from a reductionist approach: The current approach to clinical data collection for research, at least in the framework of constructing a clinical registry, is to reduce the dimension of the data at the time of data collection. Postoperative renal failure, for example, may be reduced to a binary coding of “yes” or “no” in relation to a threshold value. In reality, renal function undergoes dynamic change, and estimated glomerular filtration function plotted along the time axis encodes features such as rate of decline in the renal function, duration of poor renal function, and rate of improvement. Utilization of these rich data is made possible by changing the convention of upfront reduction at the time of data collection.

To achieve these points, hospitals and other healthcare organizations must have institutional commitment for proper investment. Because of the complexity of information technology systems and clinical devices, data warehousing and clinical engineering team must collaborate with clinical stakeholders to allow seamless and complete acquisition and centralization of such data.3 Through such collaboration, our group implemented a preliminary data science platform at a large academic healthcare system to support real-time, big data workload.4 This process first requires investment in software implementation, and also hardware and network infrastructure. Then, the inflow of the data density must be controlled to facilitate data organization. Once the mechanism to centralize the data is in place, then recruiting analytical talents would allow the actual utilization of the data in the form of developing and validating predictive and prescriptive analytics to improve postoperative care and outcomes. Data scientists accustomed to high-dimensional data and machine learning approach are important, and hospitals affiliated with academic institutions may have an advantage of recruiting personnel from statistics and data science department. Clinicians must closely collaborate with analysts to identify clinically meaningful values and metrics for the analysis. To facilitate the communication, clinicians well-versed in statistics and data science may also be an important talent to acquire. Once a critical mass of institutions achieve such organization and demonstrate the value and use of these data, large clinical registries may begin to incorporate such data as part of their routine data collection process.

In summary, the vast majority of data produced in healthcare are underutilized in biomedical research and, when used, are often reduced to a small subset of the available information. Investment in infrastructure for data management and analysis, access to analytic skills needed to handle high-dimensional data, and a paradigm shift away from upfront dimensionality reduction would allow for more effective utilization of these incredibly rich data.


1. The Society of Thoracic Surgeons. Adult Cardiac Surgery Database Data Collection. Available at: https://www.sts.org/registries-research-center/sts-national-database/adult-cardiac-surgery-database/data-collection. Accessed October 8, 2018.
2. Akkermans A, van Waes JAR, Thompson A, et al. An observational study of end-tidal carbon dioxide trends in general anesthesia. Can J Anaesth 2019; 66:149–160.
3. Rockstroh M, Franke S, Neumuth T. Requirements for the structured recording of surgical device data in the digital operating room. Int J Comput Assist Radiol Surg 2014; 9:49–57.
4. McPadden J, Durant TJ, Bunch DR, et al. A scalable data science platform for healthcare and precision medicine research. arXiv 2018; abs/1808.04849.
5. Rodriguez-Sanchez F, Marwick B, Lazowska E, et al. Academia's failure to retain data scientists. Science 2017; 355:357–358.

healthcare data; machine learning; operating room; outcomes prediction

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved.