Last year I wrote a column about the trend toward “Big Data” and what it might mean for oncologists (8/25/13 issue). However, I found a particularly informative assessment that was recently published in the New York Times by Gary Marcus and Ernest Davis, both professors at NYU—“Eight (No, Nine!) Problems With Big Data.” The article is worthwhile reviewing because it is well balanced and describes more clearly the limitations of relying on big data. The following are some of the shortcomings of big data that Marcus and Davis identified in their article:
Although big data is very good at detecting correlations that a smaller sample size might miss, it can't tell us which correlations are meaningful: The example in the article is that although the murder rate and the market share of Internet Explorer both went down sharply, no one would accept that there was a causal relationship. Between 1998 and 2007 the number of new cases of autism was extremely well correlated with the sale of organic foods; both rose sharply, but correlation alone cannot tell if there is a causal relationship.
Big data can work well as an adjunct to scientific inquiry, but can rarely replace the science: The study of molecular structures of DNA finds big data useful, but an understanding of biochemistry and physics is necessary to make a reliable determination of the structure.
Many tools that are based on big data can easily be gamed: Big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out which program works, they start writing long sentences using obscure words rather than learning how to formulate and write clear, coherent text. Even Google's search engine can theoretically be gamed by “Google bombing” and “spamdexing,” techniques for elevating a website to a higher spot in the results list.
Even when the results of big data analysis aren't intentionally gamed, they often turn out to be less robust than they initially seem: Google Flu Trends (GFT) gained fame by using flu-related search queries to forecast the spread of flu as accurately as and more quickly than the CDC. However, as time passed the accuracy of GFT began to falter and in the past two years there were more bad predictions than good ones. A contributing factor to this failure may have been that the Google search engine itself is constantly changed by the continuous input of millions of searches so that the patterns in data collected at one time do not necessarily apply to data collected at another time. Collections of data that rely on web hits often merge data that was collected in different ways with different purposes, with possible negative consequences.
There is a risk of too many correlations: If you look 100 times for a correlation between two variables, you risk finding, purely by chance, about five correlations that appear statistically significant even thought there is no meaningful connection between variables. Without careful supervision, the magnitudes of big data can greatly amplify such errors.
There is an example of particular importance to those of us in the health and medicine fields. An article in the 7 April 2014 New Yorker by Kathryn Schulz, “Final Forms,” is a historical analysis of the origins and evolving usage to the present day of death certificates. She traces back to what can be considered the first death certificate, The Bill of Mortality, from early 16th century England.
As scientific, political, and cosmological revolutions changed the world, this document evolved to become today's death certificate. The history is a good read that I recommend, but for my purpose I will jump to her analysis of current death certificate practices in developed countries, especially in the U.S.
Today, the one-page death certificate has instruction booklets for coroners, physicians, and funeral directors totaling 250 pages on how to fill out certificates. They are sometimes harder to fill out than one would think because the writer must sort out the “immediate cause” of death, the second line being the condition that contributed to the immediate cause, and the third and final line is the cause of the condition on the second line. Residents often fill out this form, just as confused (and probably more rushed and tired) as everyone else. The author states, “death certificates… like tax returns, do not always scrupulously reflect the truth.” The understatement is palpable.
Deliberate obfuscation to protect the reputations of the deceased and family is not rare, substituting more socially acceptable causes of death than AIDS, suicide, alcoholism, tuberculosis, and, believe it or not, breast cancer. Cardiac and many unconscious errors are made on a regular basis. A survey of doctors in New York City in 2010 revealed that only one-third believed that death certificates were accurate.
So here we have a mature, valued, and important source of big data that we can probably trust for the date of death and site of death, but not necessarily the cause(s) of death—the main impetus for having death certificates that are used widely for epidemiologic studies and for identifying fatal diseases that Congress may fund based on frequency.
Just like an individual's search on the Internet, more data is not always better, and the sheer volume of potential sites to search does not reflect the quality of each. We depend on a flawed customer star system that can be manipulated to assure us.
An Internet search is one thing. Using big data to make medical decisions is quite another. Big data can be useful in medicine, the lives of our patients may depend on our analysis of the quality of the data put into it; we are obliged to be extra cautious and aware of the limits of extrapolation from it when planning any action.