Geographic information systems (GIS) technology is becoming increasingly important in public health research. 1–5 Recognizing the utility of GIS in health research, the US Department of Health and Human Services recently set a goal of increasing the proportion of government health data systems using address geocoding to promote nationwide use of GIS at all levels. 6 The first step in GIS projects is to geocode the data points of interest (ie, assign a location such as latitude/longitude, census tract, and so on). One common starting point for epidemiologic studies is the geocoding of subjects’ residential addresses. Once geocoded, locational attributes can be assigned to subjects (eg, socioeconomic status and proximity to environmental toxicants). While government agencies are responding to the call for geocoded data, issues related to geocoding quality and completeness remain only partially explored. 5,7
We are currently using GIS to evaluate environmental factors related to regional variations in breast cancer incidence in California. Because of our study’s focus, we are concerned about potential biases introduced by regional differences in successful geocoding. Of particular concern are addresses that cannot readily be geocoded, such as post office boxes (POBs). It is a common practice to assign latitude/longitude coordinates to POBs based on their delivery-weighted five-digit zip code centroid (the geographic center point of a polygon formed by the zip code boundaries). The effects of potential biases and misclassification introduced by such geocoding methods have not been fully scrutinized.
A preliminary geographic assessment of our subjects with POBs, compared with those with street addresses, suggested that they were distributed unevenly throughout the state and were more likely to be in rural locations. Hence, we conducted a small substudy to geocode the POB addresses in our breast cancer study and to assess the quality of this coding.
Our study protocols were approved by the Committee for the Protection of Human Subjects of the California Health and Human Services Agency. The California Cancer Registry (CCR) provided us with the records of all women in California diagnosed with invasive breast cancer between 1988 and 1997 (N = 181,110). The CCR extracts information from patients’ hospital records; patients’ addresses at diagnosis are geocoded for the CCR by Geographic Data Technology (GDT). 8 This work focuses on 4,537 (2.5%) women from our breast cancer study who had POB addresses. Of these, 4,299 had traditional post office boxes, 192 had rural/star route boxes, 39 had highway contract boxes, 6 had general delivery boxes, and 1 had a military box.
We explored several tracing methods to obtain residential street addresses for these women and pilot tested each to determine their cost and effectiveness. Many had limitations precluding their use for our purposes (Table 1). The most promising strategy was to obtain street address information from POB rental records of the US Postal Service. Federal Regulations 9 allow the US Postal Service to release boxholders’ street address information to other government agencies. 10 During October and November of 2002, we mailed 4,537 inquiries to postmasters throughout California, requesting the name and street address of the person who held each POB on the date indicated in the CCR records as the subject/boxholder’s diagnosis date. To protect our subjects’ confidentiality, we revealed neither their names nor the significance of the date.
We calculated the postmasters’ response rate (Fig. 1) and evaluated whether each postmaster-provided street address was a match to our subject (ie, likely to represent the subject’s street address at diagnosis). We considered each postmaster-provided address a match if the boxholder’s last name was the same as our subject’s last or maiden names or if the postmaster-provided address was a residential facility; addresses were considered not to be a match if the boxholder’s last name differed from the subject’s last name or if the boxholder was a business or nonresidential institution. We attempted to geocode and map the matched street addresses by standardizing them using ZP4 software 11 and then assigning geographic coordinates to each address using ArcView GIS 12 with GDT street data. Of the 1,547 matched addresses, we geocoded 1,063 (69%) using ArcView batch-matching and 415 (27%) using manual geocoding. We evaluated the potential effect of excluding nonmatched addresses from our geographic analyses by comparing the case characteristics of subject/boxholders whom we could and could not match to a street address.
Major US geocoding vendors such as GDT routinely use the delivery-weighted five-digit zip code centroid (zip-centroid;Fig. 2) of a POB to assign a proxy residential location. Because residential addresses are often distributed unevenly throughout zip code polygons, zip-centroids have been developed to account for the actual distribution of residences throughout a zip code. 8 We geocoded all boxholders’ addresses and determined their US Census block-group assignment by two methods: first using the zip-centroid of the POB and second using the postmaster-provided street address. To assess potential case-attribute misclassification, we compared the degree of urbanization, 13 socioeconomic status (SES), and agricultural pesticide exposure measure 14 assigned to subjects based on their geographic locations derived from these two geocoding methods. Since GIS-based environmental health studies often assign exposure attributes based on a subject’s proximity to a certain location of interest (eg, incinerator, air monitoring site), we evaluated the potential for exposure misclassification by calculating the distance between the geocoded point of the zip-centroid of the POB and the boxholder’s street address.
Postmaster Response Rates
Although their overall response rate was good (90%), postmasters provided us with boxholder street addresses in only 1,963 (47%) of their 4,193 responses (Fig. 1). The more recent the address information we asked for, the more likely it was that postmasters could provide it (Fig. 3). Our inquiries regarding rural/star route and highway contract boxes were less likely to yield a street address (31% and 13%, respectively) than inquiries regarding traditional post office boxes (44%). Street addresses do not exist for general delivery and military boxes.
Assessment of Potential Biases
In approximately 2.5% of the CCR records in our breast cancer study, subject addresses were limited to a POB. (During the period of our study, patients with other types of cancer had similar percentages of POB addresses in their records.) Boxholders were more likely to be age 50 or older and non-Hispanic white (Table 2). Comparing boxholders whom we could and could not match, we found that race/ethnicity distributions were similar between the 2 groups; not-matched subjects were slightly more likely than matched to have later stage disease at diagnosis and to be in the youngest (≤50 years) and oldest (≥70 years) age groups.
We found substantial discordances between area-attribute assignments based on geocoded POB zip-centroids and street addresses provided by postmasters (Table 3). When street address, rather than POB zip-centroid, was used to pinpoint subjects’ geographic location, 81% were assigned to a different census block-group. We saw less dramatic but still substantial discordance in area-attribute assignments (urbanization, SES and agricultural pesticide exposure); most notably, 43% of subjects were assigned to a different SES quartile.
The distance between each subject’s street address and her POB’s zip-centroid was within 1 mile for 25% of the street addresses; most were within 5 miles of the zip-centroid (Fig. 4). However, 25% of our boxholders had a geocoded street address that fell more than 4.3 miles away from the zip-centroid.
Our results demonstrate that it is possible to obtain street addresses for POB holders from the US Postal Service, but this approach is unlikely to yield substantial improvement in overall geocoding success, particularly for older addresses. Furthermore, the use of zip-centroids as a proxy for residential street address may result in substantial misclassification of geographically-based attributes.
Reliance on POB tracing through postal service records alone is not ideal. Using only the postmaster responses, our means of matching boxholders and study subjects was limited to a comparison of their names. Of the nonmatching street addresses, 16% were for boxholders with a different last name than our subject’s. We cannot discount the possibility that excluding these women from our analyses might introduce additional bias. Although our match rate may have improved had we revealed the subject/boxholders’ names, we did not do so to protect our subjects’ confidentiality. Additionally, it puts a burden on the postal service to make tracing requests that could be satisfied through other resources.
Using all the tracing methods at our disposal (see Table 1), would have been costly and labor intensive without increasing our POB/street address match rate dramatically. The overarching limitation we faced was our need for historical information. We asked postmasters for address information that was 3–12 years old. However, because of postal service policies on record keeping, we could not expect POB holder information from 2 years or more earlier to be available unless the boxholder had maintained her box continuously to the present. Any of the other tracing methods we explored would have yielded considerably better results, if performed within 2 years of initial case ascertainment.
These results suggest a need for a reevaluation of data collection practices and a modernization of registry infrastructure that will allow GIS to be used to its full potential. Concerted efforts by cancer reporting abstractors and central registry staff to ascertain case street addresses (rather than POBs) would help improve the quality of address information in registry records. Although the California Cancer Registry’s geocoding is over 95% complete for residential addresses, it is important to recognize and evaluate the potential effect of missing data on GIS-based health research. In our breast cancer study, POB addresses accounted for the majority (75.5%) of ungeocodable addresses. Our analysis of boxholder case characteristics suggests that they are not necessarily representative of the whole case population and that excluding them could introduce selection bias. Assigning POBs to zip-centroids may also introduce substantial geographically-based exposure misclassification.
Although these differences may not be critical to all health studies that use disease registry data, it is important to consider such data limitations in the context of individual research questions. For example, some of our breast cancer study’s environmental exposure assignments are based on residential proximities of a half mile or less, whereas our analyses revealed that the median distance between POB zip-centroids and boxholder street addresses was 2.2 miles, with some discrepant by more than 100 miles. We remain uncertain about why these larger discrepancies might exist; possible explanations include subjects with high mobility or multiple residences and reporting errors on the part of postmasters, disease registry abstractors or boxholders themselves. Clearly, reliance on POB zip-centroids for geocoding, in our breast cancer study, would result in substantial exposure misclassification. GIS-based health studies also face the challenge of address-matching errors, arising from incomplete or erroneous source data and geocoding base maps. Base-map errors, such as positional inaccuracy and bad or theoretical address ranges that introduce interpolation error, can cause locational errors in address geocoding.
It is unknown to what extent our results can be generalized outside California. The data quality concerns we faced may be magnified for researchers studying rural populations, for whom disease registry records are likely to contain a higher proportion of POB addresses.
Disease registry data are a valuable epidemiologic resource. The emergence of GIS technology as an epidemiologic tool promises new scientific insights but depends on the availability of useable geographic data. While disease registries work toward improving the quality and completeness of their geocoded data, epidemiologists should understand the limitations of such data and consider, within the context of their research questions, the potential biases introduced by incomplete or inaccurate geocoding.
We thank the California Cancer Registry for providing the data used in this study. We also thank the legal staff of the US Postmaster General’s office for their assistance with the composition and formatting of the inquiry letters used in our substudy. Our gratitude is extended to local postmasters throughout California for their timely and helpful responses to our inquiries.
1. Richards T, Croner C, Rushton G, et al. Geographic information systems and public health: mapping the future. Public Health Rep. 1999; 114: 359–373.
2. MacDorman M, Gay G. State initiatives in geocoding vital statistics data. J Public Health Manage Pract. 1999; 5: 91–93.
3. Thrall G. The future of GIS in public health management and practice. J Public Health Manage Pract. 1999; 5: 75–82.
4. Bouton PB, Fraser M. Local health departments and GIS: the perspective of the National Association of County and City Health Officers. J Public Health Manag Pract. 1999; 5: 33–41.
5. Williams RC, Howie MM, Lee CV, et al. Geographic information systems in public health: proceedings of the third national conference [Agency for Toxic Substances and Disease Registry, GIS in Public Health
web site]. 1998. Available at: http://www.atsdr.cdc.gov/gis/conference98/index.html
. Accessed January 15, 2002.
6. US Department of Health and Human Services. Healthy People 2010 (Volumes I and II
Edition. Washington, DC: U.S. Government Printing Office; 2000. (Publication No. S/N 017-001-00547-9).
7. Krieger N, Waterman P, Lemieux K, et al. On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research. Am J Public Health. 2001; 91: 1114–1116.
8. Dynamap/ZIP+4 centroids and correspondence files [data file]. Lebanon, NH: Geographic Data Technologies (GDT); 2002.
9. Office of the Federal Register, National Archives and Records Administration. United States Code of Federal Regulations, Title 39, Postal Service; Chapter I, United States Postal Service; Part 265, Release of Information, Section 265. 6(d)(4)(i
). Available from U. S. Government Printing Office (http://www.access.gpo.gov/nara/cfr/waisidx_01/39cfr265_01
. html). Last updated July 1, 2001.
10. US Postal Service. Domestic Mail Manual, Part D910 - Post Office Box Service [USPS Postal Explorer
web site]. May 2, 2002. Available at: http://pe.usps.gov/text/dmm/d910
. htm. Accessed June 15, 2002.
11. Semaphore Corporation. ZIP+4 database and CASS-certified address cleaning software [Semaphore Corp., Free Internet Resources
web site]. February 1, 2001. Available at: http://www.semaphorecorp.com/cgi/zp4
. html. Accessed March 1, 2001.
12. ArcView. Version 3.2a. Redlands, CA: Environmental Systems Research Institute; 2000.
13. US Bureau of the Census. Census of Population and Housing, 1990: Summary Tape File 3 on CD-ROM Technical Documentation. Washington, DC: Bureau of the Census; 1992.
14. Gunier RB, Harnly ME, Reynolds P, et alJ. Agricultural pesticide use in California: pesticide prioritization, use densities, and population distributions for a childhood cancer study. Environ Health Perspect. 2001; 109: 1071–1078.