Gonorrhea and chlamydia are consistently the 2 most commonly reported diseases in the United States.^{1} Gonorrhea and chlamydia are sexually transmitted and share common epidemiologic characteristics. Gonococcal and chlamydial infections are often asymptomatic, particularly in women. In women, they can cause pelvic inflammatory disease (PID), ectopic pregnancy, tubal infertility, and chronic pelvic pain.^{2–5} In addition, gonorrhea and chlamydia may cause symptomatic urethritis, epididymitis, and acute proctitis in men.^{6–8}

Previous studies have documented and discussed the association between sexually transmitted diseases (STDs) and minority racial/ethnic groups.^{9–15} Aral^{10} and Adimora and Schoenbach^{9} suggest racial/ethnic disparities in STD rates are in large part due to mixing. These studies have shown that the choice of sexual partners is largely based on race/ethnicity; individuals are more likely to choose partners of the same race/ethnicity. The resulting sexual networks also suggest a geographic dimension to the observed disparities—a direct consequence of the fact that people of similar race/ethnicity also tend to reside in geographic units that are close together in space as evidenced by several other studies on STDs and location.^{13,16–30} Thus, population-level STD rates may show spatial clustering, such as at the county or city level. Therefore, when analyzing the correlates of STDs, it might be important to account for the existence of spatial clustering in order to more effectively estimate the presence and extent of the association/correlation.

Methods for analyzing correlates of events over space and time have developed substantially over the last 2 decades. The change in approach has been warranted by the fact that traditional application of ordinary least square (OLS) for the analysis of events (such as crime and infectious diseases) assumes that there is no spatial clustering/dependence.^{31,32} As evidenced by the studies cited above, this assumption is in direct disagreement with the epidemiologic characteristics of STDs, because STD incidence in a specified spatial unit (such as a city or a county) is likely dependent on location. In other words, cities or counties that are close to each other are likely to have similar STD incidence rates, depending on the degree of interaction between the populations and on variations in socio-demographic factors that influence STD rates.

With the recent improvements and developments in geographic information systems (GIS) application and the ensuing infusion of other associated statistical software into geographic software, models have been suggested to account for spatial autocorrelation.^{31–36} The application of these techniques specifically for studies on STDs has been sparse.^{12,13,37,38} A second problem associated with spatial regression analyses is spatial heterogeneity i.e., nonstationarity of the estimated parameters as a result of the substantial differences in the variance of the dependent and independent variables across space.^{31,32} For instance, a spatial regression using all the counties in the United States would potentially have unstable parameters due to the wide differences in the variance of the variables as one moves from one region to the other, or from one state to the other. However, the potential problem of spatial heterogeneity can be substantially reduced by focusing on a smaller geographic area, such a state.^{31–33,39} Thus, we focused on the state of Texas and used county-level data to illustrate the application of spatial regression methods in examining the association between county-level racial/ethnic composition and the 2 most commonly reported STDs in the United States. In doing so, we explored different spatial relationships in the specification of spatial regression models. We also estimated an OLS model for comparison.

#### MATERIALS AND METHODS

We obtained chlamydia and gonorrhea data on reported cases (for all ages and gender) from the National Electronic Telecommunications System for Surveillance (NETSS) for 3 consecutive years (1999, 2000, and 2001) for all 254 counties in Texas. We computed temporally smoothed rates using the cases and their respective population for the 3 years–the sum of the number of cases for 1999, 2000, and 2001 divided by the sum of the populations for 1999, 2000, and 2001 from the 2000 census.^{40} The result is used to represent a more reliable estimate of the annual incidence of disease for the year 2000; the year for which there is the most reliable population data.

Demographic and socio-economic data were obtained from the 2000 edition of the City and County Data Book^{40} (Table 1). Our choice of independent variables was based on availability and previous studies.^{9–14,37,38} The independent variables included percent aged 18 to 24 years, male-female ratio, population density, crime rate, percent owner-occupied, unemployment rate, median household income and death rate. Persons aged 18 to 24 years have relatively higher STD rates as well as a disproportionate gender representation^{41}; Population density, crime rate and percent owner-occupied have been shown to be significantly associated with county-level STDs^{12,13,38}; Unemployment rate and median household income are measures of county-level socio-economic status that have been found to be associated with STDs^{10,42}; Death rate is a county-level health status measure that has been shown to be associated with STDs.^{13} We excluded other independent variables that were highly correlated with any 1 of the variables mentioned above. For instance, we found that poverty level was significantly correlated with median household income. From the Economic Research Service, United States Department of Agriculture,^{43} we obtained data on the proportion of workers who traveled from nonmetropolitan counties to central counties of adjacent metropolitan counties–a suburban commute index. This was used as a proxy for the degree of interaction between suburban and urban counties. Summary statistics of the data are presented in Table 1.

The spatial relationship concept is used as a basis for identifying the form and extent of the existing spatial dependence (or spatial autocorrelation). We focused on the 2 basic criteria^{44} (Rook and Queen) because the third criteria (Bishop) had very little influence due to the potentially low interaction as was confirmed by preliminary analyses. In general, Rook neighbors share a common side, Bishop neighbors share a common vertex, and Queen neighbors share a common side or vertex (see the Appendix for specific Queen and Rook examples). The county with the largest number of first-order Queen and Rook neighbors is Crockett, which has 9 contiguous neighbors–Crane, Upton, Pecos, Terrell, Val Verde, Sutton, Schleicher, Irion, and Reagan (Fig. 1). El Paso county has the lowest number of contiguous neighbors in Texas–only Hudspeth county (Fig. 1). The overall average number of Rook neighbors in Texas is approximately 5, whereas the average for Queen is approximately 6. By definition, the average number of Rook neighbors can never exceed the average number of Queen neighbors, for counties in a given state. However, the difference between the average number of Rook and Queen neighbors may be substantially large, depending on the shape of the polygons representing the counties and their overall configuration within the state or region. This implies that the criteria for identifying spatial relationship may result in substantially different values (see Appendix for more details).

Spatial autocorrelation (or dependence) is formalized in terms of first-order (i.e., only adjacent counties) contiguity where the dependent variable or error term for each county (spatial unit) is correlated with observations for the dependent variable or error term at contiguous locations. We present below 3 formulations of spatial models suggested by Anselin.^{33} Spatial correlation in the dependent variable is defined as a spatial lag which is specified by the spatial autoregressive model (SAM):

The spatial error model (SEM) is represented by:

The spatial Durbin model (SDM) model takes the form:

where, R is a vector of temporally smoothed incidence rates (in log form), X is the matrix of county characteristics, W is the row-standardized spatial weight matrix and the parameter ρ is a coefficient on the spatial lag of the dependent variable. WR captures the extent to which smoothed incidence rates are affected by rates in contiguous counties. βs are the coefficients associated with the independent variables and ϵs are the error terms. u is the error term in (2a) (used to differentiate it symbolically from ϵ in the other equations) and Wu gives the spatial lag of u. λ in (2b) is the error correlation coefficient, equivalent to the serial correlation coefficient in time series models.

The SAM, represented by equation (1), accounts for spatial dependence by including the average of the rates of the neighbors (constructed by the matrix product WR) as an independent variable; the SEM, represented by (2a) and (2b) provides a way to account for the effect of omitted variables in neighboring locations by including the average of the errors generated by the neighbors and constructed using the matrix product Wu; the SDM, represented by equation (3) accounts for spatial dependence by including additional spatial lags of the independent variables constructed using the matrix product WX.

Equation 1 Image Tools |
Equation 3 Image Tools |

We used a mixed log-log/linear form to reduce the problem of overdispersion in the raw data. The dependent variables were the logs of temporally-smoothed chlamydia and gonorrhea incidence rates. Independent variables included percent black, percent white (omitted), percent Hispanic, percent American Indian, percent American Asian, percent aged 18 to 24 years, log of male-female ratio, log of crime rate, log of population density, percent owner-occupied, unemployment rate, log of median household income, deaths per 1000 residents, and the suburban commute index.

We fitted OLS and ran several spatial regression models (SAM, SEM, and SDM), using the Queen and Rook contiguity specifications to identify the best contiguity criteria using the Akaike information criterion (AIC)^{45} and Bayesian information criterion (BIC)^{46} as suggested by Anselin.^{33} In general, the lower the AIC and BIC, the better the model. Following Belsley et al,^{47} we computed the mean and maximum variance inflation factors (VIFs) for each model to test for multicollinearity. Patients infected with gonorrhea are often coinfected with chlamydia.^{48,49} Based on the close relationship between the 2 diseases, and the fact that the data were from the same population, we finally used a 2-equation seemingly unrelated regression estimation (SURE) suggested by Zellner^{50} to improve efficiency. According to Zellner,^{50} the SURE is a better estimation procedure when it is suspected that the error terms from multiple equations are correlated; models are estimated concurrently while accounting for the correlated errors at the same time, leading to efficient estimates of the coefficients and standard errors.

We used ArcGIS version 9.2 to obtain the polygons representing counties in the state of Texas. GeoDa version 0.9.5-i by Luc Anselin^{36} was used to develop and create spatial lag variables (WX, WR, and Wu as shown in equations 1, 2b, and 3 above) as well as to conduct our preliminary analyses to select the best models using the different spatial contiguity matrix specifications and then transferred the spatial data created into STATA (StataCorp LP, College Station, TX) to confirm our results and to perform the SURE regression and diagnostics.

#### RESULTS

The Rook contiguity matrix specification was better for the gonorrhea regression while the Queen specification was better for the chlamydia regression based on AIC, BIC and Lagrange multiplier test (results not shown). Thus, for the spatial regression analyses, a Rook contiguity matrix was used to create the spatial lag variables for the gonorrhea equations, whereas a Queen contiguity matrix was used for the chlamydia equations. Results of the models (OLS, SAM, and SEM) for chlamydia and gonorrhea are presented in Tables 2 and 3, respectively. A measure of multicollinearity is provided for all models - highest and mean VIFs. We dropped the SDM model because the VIF was greater than 15 indicating potential multicollinearity as suggested by Belsley et al.^{47} (should be less than 10). Additionally, the AICs and BICs were substantially larger than the values found in the SAM and SEM. For both STDs, the SEMs were superior to all the other models explored based on the AIC and BIC (Tables 2 and 3). Also, the SEM and SAM models were superior to the OLS models in both cases.

Table 2 Image Tools |
Table 3 Image Tools |

Variables that were significantly associated (*P* <0.05) with county-level STD rates (in both disease SEMs) included: percent black, percent Hispanic, percent aged 18 to 24 years, log of male-female ratio, log of population density, log of crime rate, log of deaths per 1000 residents, and the log of median household income.

Results for the SURE analysis are presented in Table 4. A Breusch-Pagan^{51} test of independence indicated a strong contemporaneous correlation of errors across models (59.52, *P* <0.0001), suggesting that the 2 equations should be estimated concurrently. Thus, the use of SURE was appropriate although the results did not deviate substantially from the single-equation results. The coefficients on the racial/ethnic variables imply that a unit change in percent black is associated with 1.6 (1.1 for Hispanic) and 3.3 (0.5 for Hispanic) percent change (on average) in chlamydia and gonorrhea rates respectively, compared with percent white. Our test of coefficients indicated that the magnitude of association is significantly higher for percent black in both disease models. However, the association between percent black and gonorrhea is 2 times higher than its association with chlamydia (3.3 vs. 1.6). On the contrary, the association between percent Hispanic and chlamydia is 2 times higher than its association with gonorrhea (1.1 vs. 0.5).

#### DISCUSSION

We illustrated the application of spatial regression models to examine the association between county-level racial/ethnic composition and reported cases of 2 STDs (chlamydia and gonorrhea), using Texas data for the year 2000. Our results imply that a unit change in percent black is associated with 1.6% (1.1% for Hispanic) and 3.3% (0.5% for Hispanic) change (on average) in chlamydia and gonorrhea rates respectively, compared with percent white. Thus while the average percent change in chlamydia rate associated with a unit change in percent black was slightly higher than the average percent change associated with a unit change in percent Hispanic (1.6% vs. 1.1%), the average percent change in gonorrhea rate associated with a unit change in percent black was substantially higher (3.3% vs. 0.5%)—over 6 times higher. Additionally, for percent black, the magnitude of the association is about 2 times higher for gonorrhea than for chlamydia (3.3% vs. 1.6%). In contrast, the magnitude of the association for percent Hispanic was about 2 times higher for chlamydia than for gonorrhea. The results confirmed the relatively higher association between percent black and STDs found in studies that accounted for spatial autocorrelation.^{12,13,37,38} However, none of them included percent Hispanic in their regression analyses so we cannot make any comparisons concerning the relative magnitude of association.

The control variables used in the analysis had the expected signs except the socio-economic variables. Their signs seemed counter-intuitive because higher rates of STDs are expected to be negatively associated with socio-economic status.^{10,52,53} However, at the county level, median household income was relatively high in the urban counties, which also tended to have higher STD rates. This may explain the positive association between STDs and median household income. On the other hand, percent owner-occupied had a negative association because urban counties with relatively higher STD rates had lower proportion of single-family homes, on average.

Although the spatial models (SAM and SEM) were superior (using standard criteria) it is important to point out that we did not find any substantial difference in the coefficients. Koumans et al^{13} did not provide information on alternative model comparisons with the spatial variable. Greenberg et al^{12} found substantial difference in the coefficients between the OLS and spatial model results using county-level gonorrhea data for 2002 in the United States. Delcher and Stover^{37} did not highlight the differences between the models they used. Semaan et al^{38} studied state-level association between social capital and STDs (gonorrhea and syphilis). Semaan et al^{38} also found no difference between OLS and spatial regression results when they controlled for population-level variables. Consequently, they suggested that the bulk of the spatial effect may have been captured by the racial composition variable, which may also explain our results. Thus spatial regression models may not always result in substantial changes in the coefficients of interest. However, they are superior methods and have become an indispensable part of regression analyses when, in theory, location plays an important part of the issue being studied, which is undeniably true for STDs.

We note limitations in the STD incidence data we used. The data were assembled from reported cases of infection which are dependent upon medical providers testing for primarily asymptomatic infections and providers or laboratory reporting positive results. Additionally, different localities may focus screening efforts on specific subpopulations thereby limiting the ability to generalize or extrapolate to the general population in any particular geographic unit. For instance, differences or similarities in how counties adhere to annual chlamydia screening for women in certain age groups recommended by individuals and national organizations^{54–58} may affect the county-level geographic distribution of the reported cases of chlamydia in this study. However, the extent to which it affects the results is difficult to assess. Also, the measure of percent Hispanic as reported by the 2000 census introduces some overlap in our measure of racial/ethnic composition as the Hispanic ethnic category includes other races.

Consensus on appropriate methods for adjusting rates to reduce variance instability and to make comparing rates from different locations more reliable has not been reached.^{59} Temporal smoothing is one method but may not be the best for STDs. However, for the purpose of identifying differences in STDs for counties in this study, it was a fairly robust measure because it provided an average over a 3-year period. More work is needed in this area to develop validated methods to reduce the “small-number problem” with incidence rates.

By focusing on only the counties within Texas, our analysis ignored spatial effects from counties that are contiguous to border counties but located in neighboring states. This omission may have affected our estimates, but it is difficult to determine the extent of the effect. This potential limitation, together with the problem of spatial heterogeneity illustrates the need for more studies to develop methods that use data from a relatively wider geographic area (such as all counties in the United States), while controlling for spatial autocorrelation and heterogeneity. Such methods would be useful to help understand the overall extent of the association between county-level racial/ethnic composition and STD rates.

This study has shown that, for the state of Texas, the exact specification of the spatial relationship (Queen or Rook) was important in measuring the extent of the spatial autocorrelation in STD rates. The difference was primarily because most counties in Texas were represented by “well-arranged” regular rectangular polygons. The difference may not exist with irregularly shaped or positioned polygons. Additionally, the exact reason for this difference may be understood from a contextual framework of local activities generating the signals from the existing data. In view of this, depending on the overall configuration of the polygons representing the spatial units of analyses, it is important to explore which specification gives the best results, because the exact type of spatial relationship used may be a source of statistically significant difference in the results obtained.^{32,36} The few studies that accounted for spatial autocorrelation in STD studies did not explore different spatial relationships and spatial regression models. Our study and previous studies have used larger geographic units (counties and states). Thus, further research is needed in this area, using smaller geographic units such as census blocks or cities, including more investigation into higher-order contiguity measures (i.e., spatial dependence that goes beyond the adjacent neighbors and accounts for the effects of the “neighbors of neighbors”).

Numerous previous studies have documented higher rates of reported STDs among certain minority racial/ethnic groups. Using county-level data on reported cases of chlamydia and gonorrhea for the state of Texas, we found that these disparities persisted at the county level even when controlling for STD rates in neighboring counties, although the association between county-level STD rates and racial/ethnic composition was dependent on the STD in question. In spite of the fact that there were no substantial differences in the magnitude of the estimated parameters, our illustrative analyses showed that the spatial regression models used were superior to the ordinary regression models and should be carefully explored in future studies.