Secondary Logo

Journal Logo

Institutional members access full text with Ovid®

Using Search Engine Data as a Tool to Predict Syphilis

Young, Sean, D.a,b; Torrone, Elizabeth, A.c; Urata, Johnb; Aral, Sevgi, O.c

doi: 10.1097/EDE.0000000000000836
Infectious diseases

Background: Researchers have suggested that social media and online search data might be used to monitor and predict syphilis and other sexually transmitted diseases. Because people at risk for syphilis might seek sexual health and risk-related information on the internet, we investigated associations between internet state-level search query data (e.g., Google Trends) and reported weekly syphilis cases.

Methods: We obtained weekly counts of reported primary and secondary syphilis for 50 states from 2012 to 2014 from the US Centers for Disease Control and Prevention. We collected weekly internet search query data regarding 25 risk-related keywords from 2012 to 2014 for 50 states using Google Trends. We joined 155 weeks of Google Trends data with 1-week lag to weekly syphilis data for a total of 7750 data points. Using the least absolute shrinkage and selection operator, we trained three linear mixed models on the first 10 weeks of each year. We validated models for 2012 and 2014 for the following 52 weeks and the 2014 model for the following 42 weeks.

Results: The models, consisting of different sets of keyword predictors for each year, accurately predicted 144 weeks of primary and secondary syphilis counts for each state, with an overall average R2 of 0.9 and overall average root mean squared error of 4.9.

Conclusions: We used Google Trends search data from the prior week to predict cases of syphilis in the following weeks for each state. Further research could explore how search data could be integrated into public health monitoring systems.

From the aUniversity of California Institute for Prediction Technology, University of California, Los Angeles, CA

bDepartment of Family Medicine, University of California, Los Angeles, CA

cDivision of STD Prevention, Centers for Disease Control and Prevention, Atlanta, GA.

Submitted March 7, 2017; accepted March 29, 2018.

Availability of Data and Code for Replication: For information about code, please contact

This work was supported by support from the National Institute of Mental Health (NIMH) grant 5R01MH106415 (S.D.Y.), the National Institute of Allergy and Infectious Diseases grants R56 and R01 5R01AI132030 (S.D.Y.), and the University of California Office of the President (UCOP).

The authors report no conflicts of interest.

The findings and conclusions in this manuscript are those of the authors and do not necessarily represent views of the Centers for Disease Control and Prevention.

Correspondence: Sean Young, Department of Family Medicine, University of California, 10880 Wilshire Blvd, Ste 1800 Los Angeles, CA 90024. E-mail:

Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved.