In 2009, researchers at Google partnered with the Centers for Disease Control and Prevention to establish the first use of Internet search engine data to predict trends in infectious disease.1 By using highly correlated search engine terms for flu-like symptoms, Ginsberg et al.1 were able to predict flu outbreaks 1 to 2 weeks ahead of the Centers for Disease Control and Prevention’s US Influenza Sentinel Provider Surveillance Network. The results of this project made international headlines; subsequently, Google has made the program, “Google Trends,” available for public use. To date, Google Trends has successfully predicted trends in West Nile virus, norovirus, varicella, influenza, and HIV.1–4
In the United States, Internet use is largely ubiquitous with 78% of all adults and 95% of teenagers endorsing regular Internet use.5 In addition, search engine use tops the list of most popular online activities, with 92% of adults reporting use of Internet search engines like Google.6 In 2013, results from the Internet and American Life project showed that 72% of American Internet users have gone online specifically to seek health information; 77% of those users state that they started their research with a search engine such as Google.6 The Internet is a portal for users who are seeking health information that is easily and anonymously accessible. There is a potential for search engine data to provide an additional venue for surveillance efforts, allowing for earlier detection of trends in disease and increased ability to monitor overall impact and examine the geographic spread of disease. Most recently, in 2013, Jena et al.2 used search engine data to model HIV incidence by US state. As surveillance for notifiable sexually transmitted infections (STIs) evolves, the Internet may play a key role. This exploratory study sought to determine whether or not STI-related searches on Google Trends would be related to trends in rates of nationally notifiable STIs.
Google Trends analyzes Internet searches to tally how many searches are completed for the terms entered, relative to the total number of searches on Google over a selected period. Google limits redundant searches from unique devices, as identified by Internet protocol address. In addition, search terms need to exceed a threshold of search frequency to appear in the results.7 Search data are normalized by an automated procedure of dividing sets of the data by a common variable, thus allowing comparisons to be made without regard to total search volume. For example, if 2 regions show the same percentage for a search term, it does not mean that their absolute volumes are the same; rather, the data can be compared equally because it has been normalized by the total volume from each region.7 Search volume is presented on a scale of 0 to 100 to reflect interest over time for the relative period of interest selected.7 The values for the same search term may vary by the selected period; to compare these values, they must be scaled on the same period.7
Search terms were selected based on an initial probe of the terms gonorrhea, syphilis, and chlamydia to determine the most effective choice based on correlations above 80%. Google Trends generates correlated search terms for any term entered. The lower bound of a strong correlation was chosen as a cutoff point to examine any extraneous terms that might be correlated with our selected terms. The following terms were selected for our final analysis: “Gonorrhea symptoms, chlamydia symptoms, syphilis symptoms.” After selecting the search terms, Google Trends was used to graph the popularity of search terms over time from 2005 through 2011 and to map by US state. Centers for Disease Control and Prevention STI surveillance data from 2011, the most recently available year of surveillance data, were compared to the graph for each term.8 Using the export data function, we downloaded response volume of search data as a continuous variable by state in comma separated value format from Google Trends. Bivariate correlations were used to assess the strength of the relationship between frequency of search term and rate of disease by US state for rates in 2011. Google Trends and STI data were analyzed in Stata/SE 13 for Windows.9
Each search term was highly correlated (>0.80) with many other like terms. For example, “gonorrhea symptoms” was correlated with: gonorrhea in men, gonorrhea STD, facts gonorrhea, gonorrhea women, and gonorrhea treatment. The term “chlamydia symptoms” was the most commonly searched of the 3 STI terms across all years (Fig. 1). Search trends showed relative increases in search terms overall. The biggest increases in chlamydia search terms occurred early in 2009 and again toward the end of 2010, whereas increases over time for gonorrhea and syphilis search terms were more gradual and closely paralleled each other.
The frequency of the search terms relative to all other searches was greatest in states where STI rates are highest, for example, in the southwestern states (Fig. 2A–C). Pearson correlation coefficient was used to explore the association between STI rates and Google search term frequency by state for 2011. Owing to missing data at the state level for search terms, syphilis trends could not be tested. The gonorrhea search term was positively and statistically significantly associated with gonorrhea rates in 2011 (r = 0.72, P < 0.001); however, the chlamydia search term did not show an association with chlamydia rates in 2011 (r = 0.06, P = 0.65).
Although the search term of chlamydia was unexpectedly not associated with STI rates by state and we were unable to assess the correlation for syphilis, trends in the graphical display of the data seem to be similar to overall prevalence curves for each STI. The lack of association between chlamydia search term and state rate of disease may be due to the short period of data analyzed. Because screening for chlamydia is much more common relative to screening for gonorrhea, if most chlamydia cases are detected asymptomatically, this may explain the lack of correlation between search terms regarding chlamydia symptoms and reported chlamydia rates. Although rates of syphilis remain steady overall, rates among women are declining and rates among men, particularly men who have sex with men, are increasing; this, along with greater awareness, may be related to the relative increase in syphilis search term frequency. We were surprised by the similarity in relative interest for gonorrhea and syphilis symptoms, despite the much higher rates of gonorrhea. This may reflect differences in information seeking by disease, symptomatic versus asymptomatic detection, or awareness or perceived severity of disease.
Because this is an ecological data analysis, findings cannot be applied at the individual level; for example, the trends are at a population level and cannot conclude that only STI-infected individuals are, in fact, those who are generating all STI-related search terms. There is uncertainty about the cause of trends in search terms (e.g., actual increases in infection, news or media attention) and when they occur (e.g., before STI diagnosis or after). The current study also does not account for differences in access to computer/Internet by region.
Data with actual usage and more precise time intervals, as well as time series analysis, will enable statistical assessment of whether such data may be used to predict epidemics. Working with Google to enhance user flexibility in terms of export data files, options for data intervals, enhanced geographic capabilities, and increased transparency in the methods of forecasting will be essential for future analysis of search engine data. Although a direct export function of search trend data exists, the user needs more flexibility including options with data available at shorter intervals, such as weekly or monthly, as well as information about missing data. In addition, geographic limitations are present; the distribution of search volume is unclear beyond the top cities within states, and there is no ability to modify the display maps (e.g., mapping by category or cutoff). Forecasting ability is somewhat limited; the specific methods that Google uses to forecast trends are not reported on the Google Trends Web site.
Search engine trends are an innovative tool to integrate into real-time surveillance of STIs. Health departments, who have access to weekly rates, may be able to forecast trends to anticipate increases in STIs or epidemics. Public health professionals should work with Google to refine trending tools for infectious disease surveillance. Future directions should include collaborations with health departments, as well as increased specificity of search terms to match subpopulation data for surveillance trends.
1. Ginsberg J, Mohebbi MH, Patel RS, et al. Detecting influenza epidemics using search engine query data. Nature 2009; 457: 1012.
2. Jena AB, Karaca-Mandic P, Weaver L, et al. Predicting new diagnoses of HIV infection using Internet search engine data. Clin Infect Dis 2013; 56: 1352–1353.
3. Pelat C, Turbelin C, Bar-Hen A, et al. More diseases tracked by using google trends. Emerg Infect Dis 2009; 1327–1328.
4. Wilson K, Brownstein JS. Early detection of disease outbreaks using the Internet. Can Med Assoc J 2009; 180: 829–831. doi: 10.1503/cmaj.1090215.
8. Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance 2011. Atlanta, USA: Department of Health and Human Services.
9. StataCorp. Stata statistical software: Release 13. College Station, TX: StataCorp LP, 2013.