Secondary Logo

Journal Logo

Opening the Black Box: Understanding the Science Behind Big Data and Predictive Analytics

Hofer, Ira S. MD*; Halperin, Eran PhD†,‡; Cannesson, Maxime MD, PhD*

doi: 10.1213/ANE.0000000000003463
Technology, Computing, and Simulation

Big data, smart data, predictive analytics, and other similar terms are ubiquitous in the lay and scientific literature. However, despite the frequency of usage, these terms are often poorly understood, and evidence of their disruption to clinical care is hard to find. This article aims to address these issues by first defining and elucidating the term big data, exploring the ways in which modern medical data, both inside and outside the electronic medical record, meet the established definitions of big data. We then define the term smart data and discuss the transformations necessary to make big data into smart data. Finally, we examine the ways in which this transition from big to smart data will affect what we do in research, retrospective work, and ultimately patient care.

From the *Department of Anesthesiology and Perioperative Medicine

Departments of Computer Science and Biomathmatics, UCLA Henry Samueli School of Engineering and Applied Science

Department of Anesthesiology, David Geffen School of Medicine at UCLA, Los Angeles, California.

Published ahead of print May 25, 2018.

Accepted for publication March 22, 2018.

Funding: None.

Conflicts of Interest: See Disclosures at the end of the article.

Reprints will not be available from the authors.

Address correspondence to Ira S. Hofer, MD, Department of Anesthesiology and Perioperative Medicine, David Geffen School of Medicine at UCLA, 757 Westwood Plaza, Los Angeles, CA 90095. Address e-mail to

Big data, machine learning, artificial intelligence, and other similar buzzwords seem to be the hot topics of the new millennium. It is virtually impossible to go a full week without seeing one of these terms on the front page of a major newspaper or periodical. Finland has even instituted a universal basic income to help citizens, as work previously performed by humans is being increasingly automated.1 Yet for those of us on the ground in health care, the coming of the Information Age has led to increased administrative workload,2 less time with our patients, and even more burnout.3,4 So what has happened—are the pundits all wrong? Or are changes really underway that will revolutionize health care the way that industries such as banking, manufacturing, and others have been disrupted?

Back to Top | Article Outline


Big data is traditionally defined by the 4 Vs: volume, variety, velocity, and veracity (Table 1).5 Before the past decade, data around medical care were primarily collected via a tedious manual process. While some aggregate data might be collected automatically (say, operating room volumes), most patient-centric information (medical history, use of medications, etc) were collected during the performance of targeted prospective randomized controlled trials. Thus, data were of small volume, of limited variety, slow to produce, and considered highly accurate.

Table 1.

Table 1.

The introduction of the electronic medical record (EMR), combined with the proliferation of wearables and other devices, has changed all of this. The EMR alone generates gigabits of information a day, not only in structured data such as medication orders, lab results, and vital signs but also in image files (ie, radiography), waveforms, and unstructured text (ie, parts of physician notes). These data are not only differentiated and voluminous but also have varying degrees of veracity.6–9

Perhaps the biggest change is that health care data can no longer be found solely in the EMR—a modern understanding of health mandates realizing that the factors which influence a patient’s health permeate his or her life. These data include not only data from wearables such as fitness trackers or smart watches (which contain data on a patient’s daily activity10 and potentially even cardiac function11) but also social determinants of health such as a patient’s socioeconomic status, profession, and family dynamics. All of these factors can have profound effects on a patient’s overall health and performance in the perioperative period. These data can now be captured in discrete ways and integrated into databases. These databases that contain data from the EMR combined with other health data are often called the electronic health record (EHR).12

Medical data is now big.5 However, bigger is not always better. Most physicians have experienced the frustration of trying to go through screen after screen of EHR data to try and understand a complex patient’s medical history and current state.4,13 Moreover, research has demonstrated that even diligent people often miss things.14 That is why bigger data is not the key to disruption—smarter data is.

Back to Top | Article Outline


While the concept of big data generally focuses on the size of the data, smart data focuses on the fifth V—value.15 This is often based on concentrating on the veracity of data while integrating domain-specific knowledge (clinical knowledge in the case of health care) to translate the data from raw numbers to actionable information.

To understand the issue of veracity, it is useful to begin with the terms precision and accuracy (Table 2). Figure 1 is a commonly used analogy to differentiate between the 2. The graph on the upper left demonstrates high accuracy; on average, the marksman hits the target. Unfortunately, the odds of him or her hitting the bulls eye on any 1 shot were actually quite low. In contrast, precision refers to the reproducibility of the result. In the top right of the figure, the marksman consistently hits the same spot, even if it is the wrong spot. In the world of big data, people generally focus on the accuracy of the data—in aggregate, how well the information performs. In discussing smart data, the goal is to add precision to the accuracy (moving to the bottom right) and thus increase the veracity of the data.

Table 2.

Table 2.

Figure 1.

Figure 1.

As the veracity of the data is improved, clinical knowledge is then integrated to increase the value of the data. For example, this might involve translating data about the medication administration and nursing documentation in the postoperative care unit into information about if a patient had postoperative nausea and vomiting (PONV) or using combined hemoglobin A1c, glucose results, and insulin prescriptions to classify patients based on diabetes mellitus control. This is the transformation into smart data.

Figure 2.

Figure 2.

A framework to understand the progression of value of smart data can be seen in Figure 2. Taking the foundation of big data, smart data can be first thought of as descriptive (who had PONV), then predictive (who will have PONV), then prescriptive (how can you prevent PONV), and finally cognitive (teaching or explaining PONV to providers or patients). Using this foundation, the remainder of this article will explore how the transition from big data to smart data will have implications for research and patient care.

Back to Top | Article Outline


In anesthesia, many studies have focused on using data from EMRs, from registries, or collected manually to explore the association between various events and outcomes (ie, hypotension and acute kidney injury [AKI]16,17) or to create models that predict perioperative risk.18 In both areas, the transition from big data to smart data will yield significant changes.

Back to Top | Article Outline

Data Integrity and Extraction

As discussed earlier, fundamental to the transition from big to smart data is improved accuracy and data precision. Multiple studies have shown significant variation among various data sources (EMR, registry, and administrative) and within data sources, thereby drawing attention to the need for this transition in the creation of data models that create more accurate data.7–9

In the simplest sense, this involves using statistical information about the underlying data to remove erroneous results. For example, while an EHR might store temperatures in both Celsius and Fahrenheit, a data-cleansing technique might look at the distribution of temperatures stored in the EHR and eliminate those that fall outside of the traditional norms and then use a conversion to standardize the measurement.

As a next step, investigators can devise algorithms to better extract the most accurate and precise information from the EMR. For example, rather than simply classifying a patient with diabetes mellitus as any patient with an International Classification of Disease code for diabetes mellitus, a researcher might also include those with an Hb A1c >6.5 mg/dL or a previous prescription for home insulin. Our group has previously demonstrated that a technique such as this is more accurate than manual chart review for determining postoperative ventilator duration after cardiac surgery.14,19,20

Finally, modern methods of natural language and image processing are beginning to allow us to access unstructured data—raw images and text. Just as cell phones and home personal assistants (ie, Amazon’s Alexa) are increasingly able to understand raw speech, in medicine, these machine-learning algorithms are beginning to allow researchers to read physician notes extracting key phenotypic information or process raw radiographic image files to detect diseases such as cancer and pneumonia.21,22 As these technologies reach maturity, they can be combined with the structured data in the EHR to create more robust data sets with increasing precision and accuracy.

Back to Top | Article Outline

Combining Data Sets

A second major challenge to be solved in moving from big research to smart research is the linking of data across multiple sources. At present, data from a single patient are usually spread across multiple institutional EMRs and potentially populating many data registries. Thus, at present, a researcher at 1 hospital will not necessarily be able to detect readmission in another. As Glance et al23 pointed out, as we move toward smart data, there will be increasing ways to link these records (without necessarily exposing the underlying protected health information), thereby introducing more complete data. To take an example from earlier, it is currently cumbersome for researchers to detect if a prescription for home insulin is not only written but also filled.

Eventually, this may progress to having the reading ability to integrate cleaned consumer health data (such as wearables) into EMR data, yielding ever more accurate data sets. This can open whole new avenues of research into areas such as medical compliance and opioid abuse.

Back to Top | Article Outline

More Advanced Techniques

The previous text describes better descriptive data, but research will increasingly also become more predictive by integrating more advanced analytics techniques. Current research into problems such as postoperative myocardial infarction (MI), mortality, and AKI tends to make use of multivariate logistic regression models to explore the relative effects of various comorbidities.24 While these models can be illustrative, they have difficulty accounting for the dynamic interplay among comorbidities. For example, the effect of intraoperative hypotension on postoperative MI is likely different for an American Society of Anesthesiologists (ASA) physical status IV patient than a physical status I patient and even between physical status IV patients with severe congestive heart failure and those with cirrhosis. Logistic models can only account for these factors if they are built into the model by the researcher, while machine-learning models can discover these associations on their own, not only creating more accurate models but also helping us to better understand the relevant physiology. These methods have the promise of creating models that are highly accurate and precise in predicting postoperative outcomes and greatly improving our current risk stratification methods.

Back to Top | Article Outline


As our research increasingly becomes more predictive and eventually prescriptive, it will likely move from the bench to the bedside. In our opinion, this evolution will come first with better prediction, then become more personalized, and eventually actually become prescriptive.

Back to Top | Article Outline

Predictive Analytics

As we have discussed, current logistic models have been created to predict postoperative outcomes with a high degree of accuracy.25–27 Unfortunately, while these algorithms are accurate, when it comes to day-to-day patient care, they are often imprecise. The addition of improved descriptive data in conjunction with better statistical tools, as described earlier, will enable the creation of models that both accurately and precisely predict patient outcomes. One can begin to imagine a dashboard that contains dynamic risk scores for patients updated in real time with their accurate and precise risks of perioperative complications such as AKI, MI, or mortality.

Providers can then use these scores to triage patients spotting early decompensation and intervening before complications occur.28 The integration of these models into clinical care will likely result in changes to the way that doctors care for patients and will require the integration of disciplines such as human factors research to help facilitate acceptance and ensure that they enhance rather than disrupt workflows. However, if performed successfully, they will also enable advances such as risk-based preoperative triage, telemedicine, prehabilitation, and other programs to improve perioperative outcomes.

Back to Top | Article Outline

Personalized Medicine

One downside of the algorithms that we generate today is that they are 1 size fits all. An algorithm developed on 10,000 ASA physical status III and IV patients in an academic medical center may or may not be accurate in the community setting—or for an ASA physical status II patient. Even more to the point, the algorithm can only describe the risk factors for an outcome at those institutions; different hospitals (or different countries) with different care systems might have different results.

As future models increase in complexity, they will be able to account for these factors and include others. By greatly expanding the number of factors included in the model (and the complexity with which they are analyzed), factors such as different performance for race, gender, or even economic class (ie, the social determinants of health) can be added and accounted for.

Additionally, these models will be able to teach themselves. Many of us have experienced setting up a new iPhone where we repeat the same phrase multiple times. What is happening is that Siri is actually teaching herself to understand our voice—this is machine learning. Just like Siri can learn our voice, advanced models will be able to calibrate themselves to the data from the EHR. This will allow them to adjust for effects at different locations and different patients—thereby making the predictions truly personalized.

Back to Top | Article Outline

Prescriptive Analytics

As we move further ahead with learning systems, they will eventually move from being purely predictive to actually prescriptive—that is, helping to determine optimal treatment. Most anesthesiologists have cared for a patient who has become hypotensive and wondered about the best treatment. Perhaps a fluid bolus or a vasopressor, and if a pressor, which one? Current treatment often involves clinical judgment and at times trial and error (giving volume and looking for a response) or personal preference (phenylephrine versus ephedrine). A prescriptive model would be able to predict response to volume (perhaps based on the waveform and a time trend of blood pressures) and various vasopressors to help elucidate the best course based on traditional clinical data and genetic information or biomarkers.

Back to Top | Article Outline


Thus far, this review has addressed the first 3 aspects of smart data—descriptive, predictive, and prescriptive—but what about the cognitive?

The reality is that we are not much closer to having machines replicate the cognitive abilities of people than we were when Isaac Asimov wrote I, Robot in 1940.29 While computers may help us better guess which outcomes may occur for a specific patient, they still lack the capacity to translate those predictions into decisions that take into account patient preferences. A high-risk patient may be likely to have a significant complication, but there are downsides to not having surgery as well. The correct decision is not simply one of weighing the odds—but rather balancing the relative risks with the patient’s overall goals. Maybe the procedure is risky, but the patient would rather roll the dice and be healthy than continue in a state of illness. Making the right decision requires talking to a compassionate doctor who can help the patient understand the risks and benefits—a computer cannot currently replicate this.

The current information revolution will certainly have profound implications for health care, just as it has for other areas of our life. If managed properly, these changes will enable better patient care and likely change the daily lives of those who provide patient care. However, as data become smarter and these changes unfold, our role as physicians will not be to blindly implement the suggestions of the machine but rather to integrate an algorithm’s prediction with our own understanding of pathophysiology and the patient’s wishes. That will help us provide precise and personalized patient-centered medicine.

Back to Top | Article Outline


Name: Ira S. Hofer, MD.

Contribution: This author helped write and conceive the manuscript.

Conflicts of Interest: None.

Name: Eran Halperin, PhD.

Contribution: This author edited the manuscript.

Conflicts of Interest: None.

Name: Maxime Cannesson, MD, PhD.


This author edited and helped conceive the manuscript.

Conflicts of Interest: M. Cannesson is co-owner of US patent serial no. 61/432,081 for a closed-loop fluid administration system based on the dynamic predictors of fluid responsiveness which has been licensed to Edwards Lifesciences. M. Cannesson is a consultant for Edwards Lifesciences (Irvine, CA), Medtronic (Boulder, CO), and Masimo Corp. (Irvine, CA). M. Cannesson has received research support from Edwards Lifesciences through his Department and NIH R01 GM117622–Machine learning of physiological variables to predict diagnose and treat cardiorespiratory instability and NIH R01 NR013912–Predicting Patient Instability Non invasively for Nursing Care-Two (PPINNC-2).

This manuscript was handled by: Nancy Borkowski, DBA, CPA, FACHE, FHFMA.

Back to Top | Article Outline


1. Cuthbertson A. Finland trials universal basic income on 2000 citizens. Available at: Accessed May 17, 2018.
2. Park SY, Lee SY, Chen Y. The effects of EMR deployment on doctors’ work practices: a qualitative study in the emergency department of a teaching hospital. Int J Med Inform. 2012;81:204–217.
3. Babbott S, Manwell LB, Brown R. Electronic medical records and physician stress in primary care: results from the MEMO Study. J Am Med Inform Assoc. 2014;21:e100–e106.
4. Shanafelt TD, Dyrbye LN, Sinsky C, et al. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. Mayo Clin Proc. 2016;91:836–848.
5. Levin MA, Wanderer JP, Ehrenfeld JM. Data, big data, and metadata in anesthesiology. Anesth Analg. 2015;121:1661–1667.
6. Wagner MM, Hogan WR. The accuracy of medication data in an outpatient electronic medical record. J Am Med Inform Assoc. 1996;3:234–244.
7. McIsaac DI, Gershon A, Wijeysundera D, Bryson GL, Badner N, van Walraven C. Identifying obstructive sleep apnea in administrative data: a study of diagnostic accuracy. Anesthesiology. 2015;123:253–263.
8. Lawson EH, Louie R, Zingmond DS. A comparison of clinical registry versus administrative claims data for reporting of 30-day surgical complications. Ann Surg. 2012;256:973–981.
9. Seddon DJ, Williams EM. Data quality in population-based cancer registration: an assessment of the Merseyside and Cheshire Cancer Registry. Br J Cancer. 1997;76:667–674.
10. Wolk S, Meißner T, Linke S, et al. Use of activity tracking in major visceral surgery—the Enhanced Perioperative Mobilization (EPM) trial: study protocol for a randomized controlled trial. Trials. 2017;18:77.
11. Rudner J, McDougall C, Sailam V, Smith M, Sacchetti A. Interrogation of patient smartphone activity tracker to assist arrhythmia management. Ann Emerg Med2016;68:292–294.
12. Garrett P, Seidman J. EMR vs EHR–what is the difference? Available at: Accessed May 17, 2018.
13. Tai-Seale M, Olson CW, Li J. Electronic health record logs indicate that physicians split time evenly between seeing patients and desktop medicine. Health Aff (Millwood). 2017;36:655–662.
14. Gabel E, Hofer IS, Satou N. Creation and validation of an automated algorithm to determine postoperative ventilator requirements after cardiac surgery. Anesth Analg. 2017;124:1423–1430.
15. Coffey L. Smart Data - What is it & How it is Differe... | Public Site: MIT IDE. digitalcommunitymitedu. Available at: Accessed December 31, 2017.
16. Salmasi V, Maheshwari K, Yang D. Relationship between intraoperative hypotension, defined by either reduction from baseline or absolute thresholds, and acute kidney and myocardial injury after noncardiac surgery: a retrospective cohort analysis. Anesthesiology. 2017;126:47–65.
17. Oprea AD, Lombard FW, Liu WW. Baseline pulse pressure, acute kidney injury, and mortality after noncardiac surgery. Anesth Analg. 2016;123:1480–1489.
18. Chamoun GF, Li L, Chamoun NG, Saini V, Sessler DI. Validation and calibration of the risk stratification index. Anesthesiology. 2017;126:623–630.
19. Agarwal V, Podchiyska T, Banda JM. Learning statistical models of phenotypes using noisy labeled training data. J Am Med Inform Assoc. 2016;23:1166–1173.
20. Yu S, Liao KP, Shaw SY. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J Am Med Inform Assoc. 2015;22:993–1000.
21. Anavi Y, Kogan I, Gelbart E, Geva O, Greenspan H. A comparative study for chest radiograph image retrieval using binary texture and deep learning classification. Conf Proc IEEE Eng Med Biol Soc. 2015;2015:2940–2943.
22. Qiu Y, Yan S, Gundreddy RR. A new approach to develop computer-aided diagnosis scheme of breast mass classification using deep learning technology. J Xray Sci Technol. 2017;25:751–763.
23. Glance LG, Wanderer JP, Dick AW, Dutton RP. Building bridges across clinical registries. Anesth Analg. 2017;125:1–3.
24. Long TE, Helgason D, Helgadottir S. Acute kidney injury after abdominal surgery: incidence, risk factors, and outcome. Anesth Analg. 2016;122:1912–1920.
25. Le Manach Y, Collins G, Rodseth R. Preoperative score to predict postoperative mortality (POSPOM): derivation and validation. Anesthesiology. 2016;124:570–579.
26. Sigakis MJ, Bittner EA, Wanderer JP. Validation of a risk stratification index and risk quantification index for predicting patient outcomes: in-hospital mortality, 30-day mortality, 1-year mortality, and length-of-stay. Anesthesiology. 2013;6:525–540.
27. Sessler DI, Sigl JC, Manberg PJ, Kelley SD, Schubert A, Chamoun NG. Broadly applicable risk stratification system for predicting duration of hospitalization and mortality. Anesthesiology. 2010;113:1026–1037.
28. Sobol JB, Wunsch H. Triage of high-risk surgical patients for intensive care. Crit Care. 2011;15:217.
29. Asimov I. I, Robot. 1950.Greenwich, CT: Fawcett Publications.
Copyright © 2018 International Anesthesia Research Society