Musings of a Cancer Doctor

Wide-ranging views and perspective from George W. Sledge, Jr., MD

Saturday, October 22, 2011

Ngrams and Culturomics

Lately I’ve become addicted to Google’s Ngrams. You may not be familiar with Ngrams, so let me explain them to you, because they give you some sense of where the world is headed. Google has, for several years, been uploading the world’s books, with the stated goal of making all human knowledge available online. Not surprisingly, holders of copyrights have concerns with Google’s omnivorous approach to data, with lawsuits galore, but the result of Google’s efforts is that literally millions of books have been uploaded.


These books are OCR readable, hence digitizable, which allows researchers unparalleled access (two words that you can apply to lots of Google efforts) to the world’s cultural heritage. Historians love Google: you can avoid a great deal of time in dusty libraries by going straight to Google Books.


Enter Jean-Baptiste Michel and  Erez Lieberman Aiden, two young informatics types from Harvard who had a clever idea: why not mine millions of books and use them to create a whole new academic discipline, which they call “culturomics.” Working with Google computer scientists and other collaborators, they created a program that would allow one to get real quantitative data on specific subjects, tracking trends over time. The database used contains about 4% of all the books that have ever been published, which turns out to be a lot of books; Google has gathered data from over 5 million books, with some 500 billion words.


The work snagged Michel and Aidan a fine paper in Science earlier this year (to read it, see Science 331, 176-82, 2011). The Google folks thought it was sufficiently cool to offer it to anyone who wants it: check it out at


You can perform all sorts of fun analyses with the Ngrams program. I typed in “science” and “religion” and looked at trends from 1800 to the present. In the first decade or two of the 1800s religion was a far more popular subject than science, by a ratio of about 7:1, but its relative popularity declined steadily throughout the 19th and early 20th centuries, the two reaching statistical parity around 1922. 


Ever since, books using the word science have outnumbered those with the word religion, though the ratio of the one to the other stabilized around 1980: an armed truce?


This, of course, doesn’t tell you too much, or shouldn’t, other than to give us some sense of the overall cultural zeitgeist. And perhaps not even that, since the world of books is not the entirety of culture. For that we would need Google Ngrams that included periodicals, movies, television, radio, email, texting and (I imagine this will happen, given the iPhone’s Siri software and Google’s voice recognition capabilities) spoken communication.


But it is a start, and it is easy to see (and perhaps over-interpret) patterns of interest with Ngrams. Take the world of oncology, for instance.  Do we use the word cancer, or the more Latinate carcinoma?


The Ngram Viewer suggests that, as far back as one can measure, cancer has been a more popular term than carcinoma. The ratio of the two was always fairly stable, though, until the 1980s, when the curves begin to diverge, and starting in the late 1980s “carcinoma” went into a steep decline. Physicians still use it to differentiate some cancers from others (carcinoma vs. sarcoma), but will it eventually disappear as a popular word?


My non-breast cancer colleagues always complain about what they consider the excessive attention breast cancer receives in the popular press. I entered breast cancer into Ngrams, along with leukemia, stomach, lung and prostate cancers. Gastric cancer has never been popular: I guess the public has no stomach for it, and didn’t even when it was an exceptionally common disease. 


Leukemia, in contrast, was clearly a source of fascination, taking off in the 1950s and peaking in interest around 1980, roughly in parallel with the discipline’s successes in childhood leukemia. Old news, though: it has been in steep decline ever since.


Breast cancer, barely a blip on the screen in the 1950s, gradually rose in the public consciousness in the 1960s, with a huge inflection point occurring around 1970 (Betty Ford’s diagnosis, perhaps?), followed by an ever-upward climb until reaching a plateau a decade ago. 


Lung and prostate cancer? Never as much interest as for breast (my colleagues complaints are objectively correct), though both have grown in the public’s esteem, prostate cancer taking off in the 1990s and lung cancer experiencing a more gradual ascent beginning in the 1950s. Breast cancer still outpolls each by about a 3:1 ratio. Tough. Deal with it, guys.


And finally, my least favorite phrase, paradigm shift.  Paradigm shifts never occurred prior to the 1960’s publication of Thomas Kuhn’s groundbreaking book The Structure of Scientific Revolutions. After that, however, you just couldn’t shut the darn things up: a classic logarithmic curve until the start of the new millennium. Everyone and his brother, apparently, had one in his back pocket.  The last few years have seen some modest slowing of this plague, and not a moment too soon. All those paradigm shifts had kept my head spinning.


I am as always impressed with the progressive reach of the digital revolution, particularly when actively pushed by the Google’s of the world. What happens when everything is digitized and publically analyzable, when every walk down the street is caught on some camera, every public conversation or act (and the sphere of privacy progressively shrinks) enters the vast and unforgiving domain of cyberspace?


Will it be a better world, the works of the powerful and vicious made ignominious and unacceptable by the ongoing democratization of data? Or will it be a world where all that information is co-opted by those same powerful and vicious, a technologic version of Orwell’s boot endlessly crushing the human face? Or, perhaps even worse, a world where we lose all sense of shame as every act becomes fodder for public cynicism and amusement?


It could go either way, I suppose.  I’m no social prophet. Girls Gone Wild, and the tenacious hold on power of the Chinese politburo’s aging thugs, certainly gives one pause. On the other hand, the creative use of social media in the Arab Spring revolutions offers reason for hope. I am an oncologist, which means I am both a paranoid and a pathologic optimist. Cognitive dissonance is the natural order of things for cancer doctors.


The scientific arc of this progressive digitization is clear: huge clouds of data raining down on us, tens of genomes (host and tumor) giving way to hundreds and thousands, all within a decade’s time.  And, beyond that, millions of genomes a decade or two away at most.


The total genome sequences of all the people of the earth measures in at 4.75 exabytes (that’s 4.75 x 1018 bytes). Medical research belongs to those who can analyze and synthesize the secrets buried in those exabytes. Cancer, and biology in general, have become massive math problems. And me unable to count to ten without using all my fingers.