To the Editor:
Geocoding (the assignment of spatial coordinates to geographic locations) is common in spatial analyses.1 Depending on the quantity of data, geocoding can be expensive and time-consuming. We propose a batch geocoding utility that is free and does not rely on an intermediary conversion service or software. This utility adapts existing public code,2 runs in the R environment,3–5 and uses existing application programming interfaces (API; namely, Google Maps and Places). Annotated source code is available at http://www.pages.drexel.edu/~bkl29/docs/geocode_public.R.
The Google Maps API allows street addresses to be geocoded with proprietary fuzzy matching algorithms. When the location is nonresidential, the Google Places front-end to Google Maps provides an added layer of fuzzy matching specific to businesses and points of interest that otherwise are not resolved via Google Maps alone, resulting in improved geocoding success rates. These program interfaces are dynamic services that are continuously updated, making them superior to static geocoding tools that can fail to recognize new or altered places or roads.
Both these application program interfaces are available free of charge for noncommercial use. In order to use the Google Places API, the user needs to obtain an API key, which accepts up to 10,000 queries per day. In contrast with free online batch services (which may throttle or drastically limit queries and may use an unknown geocoding API) or offline pay services, this approach is an integrated solution that avoids sending addresses to an intermediary; furthermore, it allows fine-tuning of the query and coding parameters within R. The researcher avoids time delays and import and export mistakes. The integration of this utility into R provides a generic interface to geocoding that is not project- or application-specific and that can be used with minimal overhead to existing code.
To demonstrate, we used this utility to geocode several locations in Philadelphia—both street addresses and named places (Table). We geocoded two locations using both Maps and Places APIs, and the others only with the Maps API. The street addresses had varying degrees of typographical accuracy and completeness, while the named places had no geographical information other than the place name. Even when the address was completely absent (“St. Christopher’s Hospital for Children”) or address colloquialisms were used (ie, “Philly”), Google Places API accurately geocoded the record. Geocoding output quality can be assessed via fields for building-level accuracy (“accuracy” rooftop, etc.) and matching success (“partial match”). While both the Maps and Places APIs were able to correctly resolve street addresses, Places was better than Maps for geocoding named places. For example, the Maps API erroneously geocoded a small community-based organization (YHEP - Youth Health Empowerment Project) to a location in Switzerland, while Places returned the correct location in Philadelphia.
As with any automated geocoding solution, there are limitations. First, geocode output is only as good as the data going in. Investigators should provide as much address detail as possible. Verification is always needed to ensure valid geocodes. This can include manual spot checks if no systematic errors are believed to exist, review of partial matches, or reverse geocoding the data to compare addresses. Second, Google’s public-use agreement limits the user to 10,000 queries in a given day. This would be a problem only for the largest geocoding needs and even then could be circumvented by segmenting the data.
We have provided an option for the researcher who needs occasional bulk geocoding and wishes to maintain full control over data. This adds to the growing list of tools for spatial analysis and can be particularly useful for researchers already working in the R environment.
Neal D. Goldstein
Amy H. Auchincloss
Brian K. Lee
Department of Epidemiology and
Drexel University School of Public Health
1. Auchincloss AH, Gebreab SY, Mair C, Diez Roux AV. A review of spatial methods in epidemiology, 2000-2010. Annu Rev Public Health. 2012;33:107–122
3. R Core Team. . R: A language and environment for statistical computing. Available at: http://www.R-project.org/
. Accessed 19 September 2013