Lee, Brian K.
The basic information technology for epidemiologic surveillance was once (and often still is) “shoe-leather”—a term that harkens back to the days of John Snow and his predecessors, when data collection was limited by how far an epidemiologist could walk. Since then, the technological tools for data collection have evolved. The gradual adoption of the telephone in the late 19th and early 20th centuries facilitated the use of phones for surveillance purposes. By the 1950s, public health officials were using telephone interviews to conduct outbreak investigations1,2 and, in doing so, helped to usher in a new era in survey methodology. Likewise, the advent of the computer age profoundly altered the landscape of population-health research. In the last 30 years, electronic medical records,3 health insurance claims data,4 and population-based registers5 have allowed investigators to conduct research on large samples, and usage of computer data repositories has become accepted practice.
More recently, epidemiologists have taken advantage of the Internet as a communications medium to facilitate research. The world is increasingly “wired”: over 1.8 billion persons worldwide use the Internet, and population percentages of Internet users are high for many developed regions.6 Accordingly, many aspects of research, including recruitment,7 data collection,8 and even certain interventions9 have been implemented through the Internet.
But the value of the Internet for epidemiologic research is not simply as a faster method of reaching potential participants or conducting a survey, or as a replacement for the telephone when people are increasingly reluctant to respond to solicitations. User-driven Internet content—particularly the content produced under the Web 2.0 platform—offers research opportunities for epidemiology that have only begun to be explored.
The Internet is a global network of smaller computer networks and the World Wide Web (Web, for short) is the browser-accessible content that exists on the Internet. The Web was introduced in 1991 and, for many years, information transfer was generally 1-way. Users were limited to passive viewing and receipt of Web site content with little interaction among users. But technological advances in broadband access, Internet-connected digital devices (such as mobile phones, smart phones, and personal computers), and software applications helped foster a fundamental shift in Internet usage toward the production of user-driven content.
Popularized in 2004, the term Web 2.0 refers to the active generation of dynamically updated content made possible by social interaction and participation in online communities.10 Although revolutions in hardware and software have facilitated this shift, Web 2.0 refers to neither hardware nor software, nor a new technical version of the Internet. Rather, Web 2.0 embodies an emerged culture in how persons engage in the Web. It is precisely this culture—how individuals actively use the Web (and the data they contribute)—that may be of value to epidemiologists.
Participation in Web 2.0 can occur inadvertently, such as when using a search engine whose output is ranked by popularity, or when clicking on the most-read articles links in online newspapers (or the Epidemiology Web site).11 The next level of engagement might include the active seeking-out of community content, such as reading restaurant reviews to decide on a dinner destination. The deepest level of involvement is in the production and delivery of content for others. Examples of Web 2.0 content include individual information platforms of blogs and microblogs such as Twitter; photo-sharing sites (Flickr); collaborative or crowd-sourced information efforts (Wikipedia, Amazon product reviews); social networks (Facebook, MySpace, LinkedIn); and dating websites (OKCupid, eHarmony). Although the forms and functions of each of these Web 2.0 services vary, the common thread is that each relies on the user community to supply the content that other users demand.
In the most prominent example to date of user-driven Internet content for population health research, Google researchers derived accurate estimates of US influenza prevalence from the frequency and geographic origin of influenza-related Google searches.12 This type of research is known as infodemiology (information + epidemiology)—the study of the distribution and determinants of information on the Internet with the intent of informing public health and public policy.13 The basic premise of infodemiology is that certain information patterns on the Internet may be caused by, or may cause, population-health patterns.
An estimated 61% of US adults search for health information on the Internet.14 Information patterns such as those provided by health-seeking behaviors can be capitalized-upon for surveillance purposes. Outbreaks that have been correlated with search-query data include salmonella,15 listeriosis,16 gastroenteritis, and chicken pox.17 Presumptive advantages of infodemiology studies for outbreak surveillance include a much faster timescale compared with traditional surveillance, as well as improved cost-effectiveness. In Google's case, their data accurately described influenza activity 2 weeks earlier than CDC efforts.12 However, the added-value of infodemiology for surveillance of endemic disease is uncertain, and few studies exist. One study of cancer-related search behaviors suggested that the frequency of searches was correlated with the prevalence and mortality of specific cancers but was also influenced by factors such as news coverage and awareness months.18 A recent Epidemiology letter demonstrated plausible seasonality trends in searches regarding diabetes, blood pressure, heart attacks, and kidney stones.19
In addition to disease surveillance, infodemiology may be useful for the study of disasters through monitoring of social-network communications, such as microblog status updates regarding the location and magnitude of effects. The American Red Cross uses Twitter to help coordinate relief efforts,20 while the US Geological Survey is investigating the use of Twitter for real-time earthquake detection.21
The social nature of Web 2.0 is perhaps best exemplified by the large and rapidly growing social networking sites. On Facebook alone, more than 500 million persons (70% outside the United States) actively participate and the average user has 130 “friends.”22 Breadth, depth, and public view ability of content vary by user but online profiles can contain a wealth of data. Researchers can mine profiles for data regarding risk behaviors, personal difficulties, attitudes and beliefs, and relationships.23 Similarly, text analysis of blog posts may reveal information regarding cognitive abilities, personality, and psychologic profiles. Sociologically relevant group-level phenomena can also be studied. For example, a recent provocative paper suggested that the migration of educated white and Asian users from one social networking site (MySpace) to another (Facebook) was similar to white flight from a “digital ghetto.”24
Interactions within online social networks also may be fruitful for research regarding how social networks influence health. Networks formed in cyberspace are both popular for social purposes and relevant for sexually transmitted diseases.25 While Internet chat rooms and dating sites continue to thrive, real-time, location-based socialization has surged in popularity due to GPS-enabled devices such as the iPhone. Grindr, popular with gay males, is a smartphone application that shows potential sex partners and their real-time distance from the user's location. The Foursquare and Google Latitude applications also make use of mobile phone GPS data to publish real-time users' locations onto the Web, generally for social purposes (eg, enabling friends to meet in an impromptu fashion). By analyzing shared location data, researchers may be able to quantify mobility patterns with regard to environments, activities, and health outcomes of interest. For example, GPS tracking can be used to study human movements and their relations to risk of dengue virus,26 as well as environmental characteristics such as population density and street connectivity.27
A special type of social network is the community centered on a particular health concern. More than ever, people are leading a data-driven life. They share personal health data within communities to support others, exchange advice, and gain insight into their own health issues. PatientsLikeMe has over 40,000 patient profiles within a broad range of disease communities, including amyotrophic lateral sclerosis, HIV/AIDS, and organ transplants.28 MedHelp, which has 10 million monthly visitors, features health trackers for a variety of personal health issues such as pregnancy, exercise, sleep, chemotherapy, and hepatitis C. While the data in these health communities are self-reported, data derived from wearable sensors to monitor physical activity, sleep, calorie expenditure, location, and more are frequently shared and may be attractive alternatives to traditional self-reported data. Commercial devices of note include bodybugg, FitBit, and the Nike+ iPod system. At the other end of the health spectrum, interactive communities promoting unhealthy behaviors such as self-injury, suicide, and eating disorders,29 have also proliferated in recent years, and warrant monitoring to understand the impact of such dysfunctional encouragements on health.
CONCERNS AND LIMITATIONS
For the skeptical epidemiologist who may be unimpressed by the novelty of Web 2.0 uses for epidemiologic research, several obstacles are apparent. The first is the lack of generalizability. The population using Web 2.0 is likely to differ from target populations of interest due to self-selection. However, the Web 2.0 demographic is diversifying rapidly. Even though Facebook was targeted at college students, 18–24 year olds account for less than 25% of the total user population, with the fastest growing age group being persons over 65 years of age (6.5 million of whom signed up in May 2010).30 Still, there is clearly a “digital divide.” Those with little or no access to information technology are fundamentally different by factors such as age, income, education, and health status.31 Web 2.0 studies may therefore exclude segments of the population at highest risk of poor health outcomes.
Measurement and analytical issues loom large. Ecologic studies such as Google's influenza study rely on crowd-aggregated data and are thus vulnerable to problems including ecological fallacy and lack of confounder control.32 Individual-level Web 2.0 data are generally self-reported and subject to potential biases. Misleading data and selective reporting (especially for sensitive issues) may occur on online profiles. An analysis of a dating Web site's users found indications of inaccurate claims regarding height, income, and sexual preference.33 Even objective data, such as the Internet protocol address that uniquely identifies most machines on the Internet and can be linked with geographic areas, can correctly resolve within 25 miles of the true location only 83% of the time in the United States.34 The inherent interactive nature of Web 2.0 communities renders questionable assumptions of statistical independence, and sampling from social networks may result in “snowball samples” and consequent statistical complexities.35
One by-product of the digital age is digital detritus that must be sifted to extract useful data. Informatics techniques such as Web scraping (an automated software technique for parsing Web pages to collect information), mashups (the integration of data from multiple disparate sources),36 natural language processing (computer interpretation of language), and machine-learning (a general term for a diverse number of classification and prediction algorithms) may be useful at various points in the workflow translating from Web 2.0 content to research findings. These tools are used in many other fields (eg, machine learning for detection of suspicious credit card activity) but rarely in epidemiology.37 These techniques require technical proficiency, and epidemiologists can benefit from collaboration with computer scientists and informaticists. Other tools include HealthMap, which scrapes health news to depict global disease activity. Google Flu Trends tracks influenza-related Google search queries for 28 countries, while Google Insight allows user-friendly analysis of search volume patterns across location and time.
Epidemiologists generally know what to measure and how to measure it. What Web 2.0 offers is a new anthropological dimension to data, with new possibilities for what epidemiologists can measure and new challenges for how to measure it. The prospects of Web 2.0 for epidemiologic research are limited by the available Web 2.0 content—content which, by its nature, grows daily in both depth and breadth. Given the vast content, the real constraints may simply be those imposed by the limits of epidemiologists' own ingenuity and creativity.
I gratefully acknowledge Daniel Westreich (UNC-Chapel Hill and Duke University) and Amy Auchincloss (Drexel University) for many helpful comments on a previous version of this manuscript.
ABOUT THE AUTHOR
BRIAN LEE is Assistant Professor of Epidemiology and Biostatistics at the Drexel University School of Public Health. A member of Generation Y, he is broadly interested in the interface of technology and health research. He is currently studying the use of machine-learning algorithms in epidemiology.
1. Ravenholt RT, Nixon M. The telephone in epidemiology of staphylococcal disease. Am J Nurs. 1961;61:60–64.
2. Laveck GD, Ravenholt RT. Staphylococcal disease; an obstetric, pediatric, and community problem. Am J Public Health Nations Health. 1956;46:1287–1296.
3. Platt R. Opportunity knocks: the electronic (public health) medical record. Epidemiology. 2009;20:662–663.
4. Motheral BR, Fairman KA. The use of claims databases for outcomes research: rationale, challenges, and strategies. Clin Ther. 1997;19:346–366.
5. Rosen M, Hakulinen T. Use of disease registers. In: Ahrens W, Pigeot I, eds. Handbook of Epidemiology. Berlin: Springer Berlin Heidelberg; 2005.
7. Ramo DE, Hall SM, Prochaska JJ. Reaching young adult smokers through the Internet: comparison of three recruitment mechanisms. Nicotine Tob Res. 2010;12:768–775.
8. Honeth L, Bexelius C, Eriksson M, et al. An Internet-based hearing test for simple audiometry in nonclinical settings: preliminary validation and proof of principle. Otol Neurotol. 2010;31:708–714.
9. Blas MM, Alva IE, Carcamo CP, et al. Effect of an online video-based intervention to increase HIV testing in men who have sex with men in Peru. PLoS One. 2010;5:e10448.
11. Osimo D. Web 2.0 in Government: Why and How? JRC Scientific and Technical Reports. Seville, Spain: European Commission; 2008.
12. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457:1012–1014.
13. Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J Med Internet Res. 2009;11:e11.
15. Brownstein JS, Freifeld CC, Madoff LC. Digital disease detection–harnessing the Web for public health surveillance. N Engl J Med. 2009;360:2153–2155.
16. Wilson K, Brownstein JS. Early detection of disease outbreaks using the Internet. CMAJ. 2009;180:829–831.
17. Pelat C, Turbelin C, Bar-Hen A, Flahault A, Valleron A. More diseases tracked by using Google Trends. Emerg Infect Dis. 2009;15:1327–1328.
18. Cooper CP, Mallon KP, Leadbetter S, Pollack LA, Peipins LA. Cancer Internet search activity on a major search engine, United States 2001–2003. J Med Internet Res. 2005;7:e36.
19. Breyer BN, Eisenberg ML. Use of Google in study of noninfectious medical conditions. Epidemiology. 2010;21:584–585.
23. Williams AL, Merten MJ. A review of online social networking profiles by adolescents: implications for future research and intervention. Adolescence. 2008;43:253–274.
24. Boyd D. White flight in networked publics? How race and class shaped American teen engagement with MySpace and Facebook. In: Nakamura L, Chow-White P, eds. Digital Race Anthology. New York: Routledge. In press.
25. Klausner JD, Wolf W, Fischer-Ponce L, Zolt I, Katz MH. Tracing a syphilis outbreak through cyberspace. JAMA. 2000;284:447–449.
26. Vazquez-Prokopec GM, Stoddard ST, Paz-Soldan V, et al. Usefulness of commercially available GPS data-loggers for tracking human movement and exposure to dengue virus. Int J Health Geogr. 2009;8:68.
27. Rodriguez DA, Brown AL, Troped PJ. Portable global positioning units to complement accelerometry-based physical activity monitors. Med Sci Sports Exerc. 2005;37(suppl 11):S572–S581.
28. Brownstein CA, Brownstein JS, Williams DS III, Wicks P, Heywood JA. The power of social networking in medicine. Nat Biotechnol. 2009;27:888–890.
29. Borzekowski DL, Schenk S, Wilson JL, Peebles R. e-Ana and e-Mia: a content analysis of pro-eating disorder web sites. Am J Public Health. 2010;100:1526–1534.
31. Cresci MK, Yarandi HN, Morrell RW. The Digital Divide and urban older adults. Comput Inform Nurs. 2010;28:88–94.
32. Morgenstern H. Ecologic studies in epidemiology: concepts, principles, and methods. Annu Rev Public Health. 1995;16:61–81.
35. Heckathorn DD. Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl. 2002;49:11–34.
36. Scotch M, Yip KY, Cheung KH. Development of grid-like applications for public health using Web 2.0 mashup techniques. J Am Med Inform Assoc. 2008;15:783–786.
37. Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol. 2010;63:826–833.
© 2010 Lippincott Williams & Wilkins, Inc.