I must admit that I am a tad skeptical about the value of Big Data and will explain why below. For its most direct application, I of course recognize the promise that large data sets can provide about medical care and its usage by different oncologists, say. If, for example, one could get detailed clinical information on the treatment of a thousand patients with the same tumor, it could be possible to demonstrate what treatment seemed to be more effective.
Big Data was the topic of an entire recent issue of Health Affairs (“Using Big Data to Transform Care,” 2014 July) and included are viewpoints on the promise and obstacles of the potential use in practice. Modern technology has made it possible to analyze and collate mountains of data very rapidly. Questions may be asked that were not possible as recently as two or three years ago. Furthermore, such data could be made available to practicing physicians in their offices. By the way, “Big Data” has not been defined in detail; the term is used as a generic description of very large data sets.
An ambitious effort implementing this approach is ASCO's CancerLinQ. With ASCO's access to its very large membership of oncologists and the new partnership with the German technology company SAP, the plan, as the society notes, is for ASCO and its wholly owned subsidiary, CancerLinQ LLC, to use the SAP HANA platform in the development of CancerLinQ, “a groundbreaking health information technology platform that will harness Big Data to deliver high-quality care to patients with cancer. It is one of the only major cancer data initiatives being developed and led by physicians with the primary purpose of improving patient care.”
SAP already has extensive experience in the management of medical data so this alone makes success more likely.
However, there are many potholes on the road to the efficient and affordable use of Big Data in medicine, some of which are well described in articles in the Health Affairs issue.
For example, in “For Big Data, Big Questions Remain,” Dawn Fallik describes the difficulty of analyzing the Medicare payment records of 825,000 practitioners nationwide released to the public in response to a lawsuit by the Wall Street Journal. Journalists were elated because this event showed promise for transparent, useful information about what doctor to choose, the best price, and so forth.
She goes on: “Except none of these things happens magically, seamlessly, cleanly. Databases managed by one state, for example, rarely match up with those from another state—they may collect different information or use different diagnosis codes. And rarely can data, especially in great volume, tell a clear story on their own. They have to be understood, interpreted, and explained and could easily be misunderstood. For example, after the release of the Medicare data, some publications reported that ophthalmologists were the greatest beneficiaries of Medicare payments, but some failed to explain that ophthalmologists typically treat an older population that is more likely to bill Medicare and often administer expensive medications in their offices.”
Fallik cited an interesting example of the over-interpretation of Big Data: Google Flu Trends is a model for following how flu spreads in real time by using search-engine queries—e.g., the number of searches of “influenza” from distinct geographic areas. In 2009 Google published an article in Nature saying that the method was 97 percent correct in its predictions. But a follow-up paper in 2013 reported that Google projected almost twice as many flu cases as the CDC estimated actually happened. Google has since changed its tune and states on its website that “it is possible that future estimates may deviate from actual flu activity.”
A serious worry expressed by several of the writers in the Health Affairs issue is patient confidentiality. Although access to publicly available health data is common, such as mortality data and the FDA's adverse events database, Big Data has many more details about where patients live, their purchasing habits, and how they responded to a new medication for hypertension, for example—data that is collected in doctors' offices.
Since large data sets are generally collected unsystematically, they can easily incorporate large biases that can be difficult to detect and correct for. Fallik quotes NYU computer scientist Ernest Davis as saying: “If important decisions are being made by applying opaque statistical techniques to big data sets, then there is a serious danger that these decisions are in fact being based on features that are essentially proxies for race, gender, and so on.”
In another article, Joachim Roski and colleagues describe their optimism for the actuation of mining Big Data, using such innovations as cloud data storage and flexible IT infrastructures. However, they point out that most health IT systems still rely on data “warehouse” structures (rather than the more flexible cloud storage): “Without the right IT infrastructure, analytic tools, visualization approaches, work flows, and interfaces, the insights provided by Big Data are likely to be limited,” Roski et al write.
“Big Data's success in creating value in the health care sector may require changes in current policies to balance the societal benefits of big-data approaches and the protection of patients' confidentiality. Other policy implications of using big data are that many current practices and policies related to data use, access, sharing, privacy, and stewardship would need to be revised. ... Big data may have the potential to create $300 billion annually in value in the health care sector.”
I don't know about you, but the quotes above by Davis and Roski et al give me the willies. What does $300 billion in value annually mean with respect to the quality of care? What will be the cost, if even possible, of changing the whole country's medical IT infrastructure to take advantage of the data in Big Data? How does one protect patient privacy with terabytes of data rapidly changing hands? If we cannot protect the privacy of credit cards, social security numbers, and purchases at Target, how are we to protect the medical information of 300 million Americans?
Finally is another relevant article in the Health Affairs issue by Lynn M. Etheridge, a pioneer in the use of “rapid learning” evidence-based systems that promote transparent access to data that informs doctors in real time how a certain treatment is working (or not). In a sense, CancerLinQ is an extension of his model. Etheridge's interest in big data is in the context of a national goal of health system improvement. He writes that he worries that, “the concept of a rapid learning health system may be too ambitious for America's pluralistic health system.” Most western countries have unified health systems funded and overseen by the government, thus avoiding a major obstacle to the safe and efficient use of big data: a pluralistic health system.
In the early days of computers, it was common to quote “GIGO”—garbage in, garbage out—to express the obvious fact that incorrect data leads to useless interpretation. I may have missed it, but in the Health Affairs review of Big Data I saw no reference to the quality of data. Anyone who has treated patients or participated in a clinical trial knows that there are many mistakes in recorded data that are not caught and corrected. How can this be controlled in Big Data at an affordable cost?
I certainly am no expert in the management and use of large data sets and I hope mining Big Data leads to ready access and useful information that is worth the cost. I can hope with others, but am still a tad skeptical.