## Introduction

Every day roughly 5700 people are newly infected with HIV according to the Joint United Nations Programme on HIV/AIDS (UNAIDS) 2016 estimates ^{[1]}. The location and the characteristics of the people who are newly infected with HIV provide critical information for country program managers to develop effective and efficient responses. Good understanding of epidemics at subnational levels allows managers to reallocate resources and respond more precisely and effectively to emerging epidemics.

The most informative data source for monitoring an epidemic is incidence data, often measured as the proportion of recently infected individuals among the uninfected population of interest. However, determining whether an individual is recently infected by HIV is more difficult than determining whether an individual is HIV positive. Therefore, the prevalence data from surveillance systems, for example, antenatal clinic (ANC) sites or cross-sectional surveys that include HIV testing among key populations, are the main data sources for estimating HIV epidemics. Figure 1 gives an example of the ANC prevalence data from the rural area of Angola with multiple surveillance sites [screenshot from the Spectrum software (Avenir Health, Glastonbury, CT) used by UNAIDS for constructing global, national, and subnational estimates]. At each year and each site, the sample size and the percentage of HIV^{+} individuals are recorded.

When epidemiological data cannot be directly observed, mathematical models are used to estimate the new infections as well as AIDS deaths and other quantities that are not easily available through surveillance systems in low-resource settings ^{[2]}. In recent years, UNAIDS and partners have been investigating models that provide a more detailed geographic understanding of the HIV epidemic. These models require extensive data at subnational levels. In many countries, the detailed data are available only for limited areas and years. In regions where data are sparse, estimating the epidemic based merely on the data gathered within those regions will lead to unreliable and highly uncertain results. Figure 2 shows an example of fitting the EPP model in the Spectrum/EPP-fitting engine to the ANC data (black dots) in the rural and the urban areas of Angola. The black curve is the posterior median of the estimated prevalence trend and the shaded area represents the uncertainty bounds around the median. We can see that in the rural area, we do not have any ANC data until 2007, which leads to high uncertainty about the early period trend such as when the epidemic started and when it peaked.

To improve the estimation in the data-sparse areas like rural Angola, one option is to ‘borrow’ data from other areas like urban Angola. It is reasonable to assume some similarities in the epidemic trends among areas within the same country. For example, in urban Angola with observations in the early 2000s, we estimate the epidemic started in the 1990s and peaked after 2000. If we can incorporate such information into the rural area estimation, it may reduce the uncertainty in the rural area's early epidemic and improve the overall trend estimation.

Bao *et al.* ^{[3]} introduce the idea of sharing information across subepidemics (regions or high-risk groups) of a country in a hierarchical model, and creating one auxiliary data set for each subepidemic with prevalence and sample size estimated from the hierarchical model. For EPP-fitting purposes, this auxiliary data then serve as an additional site that transfers information from other regions; for simplicity, this will be referred to as a pseudosite in the software itself. EPP model is then fitted to both the actually observed data and the auxiliary data to estimate prevalence, incidence, and mortality for each region. Bao *et al.* ^{[3]} apply their method to two countries as illustrative examples. The detailed implementation and connection with the UNAIDS-supported software (Spectrum/EPP) are not discussed.

In this study, we apply the Bao *et al.* ^{[3]} approach to four countries with different data availability to demonstrate its varying degree of improvement over the original EPP model. We discuss in detail the implementation of the method and its interface with Spectrum/EPP. We investigate the effects of the sample size assigned to the auxiliary prevalence data points. Finally, we recommend default settings for Spectrum/EPP users. This new model is being incorporated into Spectrum/EPP 2017.

## Method

In this section, we introduce the hierarchical model used for joint modeling of prevalence data from multiple areas, explain the mechanism of incorporating the hierarchical structure via auxiliary data, and describe the procedure of utilizing the hierarchical model in Spectrum/EPP software.

### Generalized linear mixed model

The available data are observed prevalence rates at ANC sites in different areas. For each country, there are multiple areas, defined either by geographic regions or by urban and rural. Within each area, there are multiple observation sites. Owing to the structure of the data, it is possible to use a hierarchical modeling framework to describe the relationships among different layers of the data. Bao *et al.* ^{[3]} propose a generalized linear mixed model (GLMM) for the prevalence data as follows:

where *Y* _{ait} is the observed HIV^{+} cases in area *a*, clinic *i,* at time *t*, *n* _{ait} and

are the corresponding clinic size and prevalence rate, *f(t)* is the country-level flexible time trend, such as splines, *b* _{a} is the area-level random intercept, *b* _{i(a)} is the clinic-level random intercept nested within a specific area, and *f* _{a} *(t)* is the area-specific random time trend which implies that each area's time trend borrows some information from other areas.

One property of the binomial distribution is that the mean of the response *Y* is *n* × ρ and the variance has to be *n* × ρ × (+ − ρ). In real data, the situation that the mean and variance of the binomial outcome do not satisfy this relationship is called overdispersion. To account for the additional dispersion parameter, one popular choice is a beta–binomial model, which allows the probability ρ to be a random variable from a beta distribution. This additional hierarchical structure results in an overdispersion parameter that measures the pairwise correlation between the observations within each clinic. When there is indeed some correlation among the samples within each site, the beta–binomial model describes the data pattern in a more accurate way. If this overdispersion parameter is very small, the beta–binomial distribution behaves similarly to a binomial distribution, which implies that the dependence between samples is negligible.

In this manuscript, we offer the beta–binomial model as an alternative to the binomial model proposed in Bao *et al.* ^{[3]}. We check the fitted overdispersion parameter of the beta–binomial model and diagnostic plots of the binomial model to decide which model to use. We then use the selected model to generate the prevalence trend estimates for each area.

### Incorporating generalized linear mixed model into estimation and projection package

As discussed in the introduction, to estimate the unobserved incidence rates and AIDS deaths, we need to utilize a mathematical model grounded in the transmission and progression dynamics of HIV. Spectrum/EPP, which is used by most countries to develop HIV estimates, is based on the following susceptible–infected model for the age 15–49 adult population:

In (2), at time *t*, *Z(t)* is the susceptible population, *Y(t)* is the infected population, *E(t)* is the number of people entering the population (who just turned 15), *r(t)* is the infection rate, ρ(τ) is the prevalence rate, μ(τ) is the non-HIV death rate, a_{50}(t) is the population exit rate (who just turned 50), *M(t)* is the rate of net migration into the population, and HIVdeath*(t)* is the number of deaths in the infected population, which is calculated using a CD4 progression model ^{[4]}. Equation (2) is a dynamic system that calculates the HIV infections in the population. Given an initial value, the system generates a series of prevalence rates, incidence rates, and HIV mortality rates. The output prevalence of system (2), ρ(τ), is then linked to the observed prevalence data and population survey data through a linear mixed model as follows:

where

is the inverse cumulative distribution function of the standard normal distribution, α is the bias of ANC data with respect to prevalence data from national population-based household surveys , b_{i} is the site-specific random effect, σ^{−} is assumed to have an inverse-gamma prior which gets integrated out in the likelihood evaluation, and v_{it} is a fixed quantity that depends on the clinic data and approximates the binomial variation. The shape of the prevalence curve is mainly determined by system (2) and ANC data, whereas the level of the prevalence needs to calibrate to the national population-based household surveys level as in model (3). More details can be found in ^{[5]} and ^{[6]}.

We adopt the framework of Bao *et al.* ^{[3]} to incorporate the GLMM results as auxiliary data into the EPP model. For each area, we create a pseudosite with the prevalence and sample size derived from the predictive distribution of the area prevalence estimated from GLMM. Specifically, let ματ and *v* _{at} be the posterior mean and variance of the prevalence in area a and year t. From the binomial mean and variance relationship, the GLMM estimated sample size of this pseudosite can be calculated by ματ(+−ματ)/ωατ. The pseudosite can be viewed as prior information of the area epidemic. The sample size of the pseudosite can be rescaled to reflect varying strengths of this prior information. Given an average rescaled sample size, the distribution of sample sizes across years is proportional to the GLMM estimated size. Then we add the pseudosite as auxiliary data to the original data and fit the EPP model for each area as before.

The resulting model maintains EPP model's epidemiological features and ability to estimate prevalence, incidence, and HIV mortality simultaneously. In the meantime, the shared information across areas within a country is incorporated into the dynamic system by the auxiliary data. The computational cost does not change much as we only need to fit the GLMM once for each country. The most time-consuming part, running the dynamic model, is still done area by area.

### Model evaluation

To evaluate the prediction accuracy, for each country, we define a 5-year period that ends at the last data year as the test period. Data in the test period are removed from the model-fitting process, and referred to as test data. The remaining data are used to estimate model parameters and make predictions, and they are called training data. For each subnational area, we apply the EPP model with the added pseudosite to the training data, predict the next 5 years of prevalence, and compare with the observed prevalence in the test set. We try different sample sizes (0, 10, 100, 1000, and GLMM estimates) of the auxiliary data. Sample size 0 corresponds to the original EPP approaches without using auxiliary data.

We introduce two measures to evaluate the prediction accuracy of different auxiliary data sample sizes. The first one is mean absolute error (MAE), defined as the absolute difference between the mean of predictive distribution and the observed value, averaged across all observations in the test period. The second measure is called continuous ranked probability score (CRPS) ^{[7]}, which takes both the prediction accuracy and the width of the prediction interval into account. CRPS is defined as the following:

where *y* is an observed prevalence rate in test set, *P* is its corresponding posterior predictive distribution, *Y* and *Y’* are independent samples from distribution *P*. A smaller CRPS is preferred, and the CRPS reduces to the absolute error when the predictive distribution is a point mass. We summarize CRPS as an average over all observations in the test set. All of the above measures are calculated on the original prevalence scale for ease of comparison and interpretation.

### The estimation and projection package interface for the hierarchical model

The approach of using a pseudosite – drawing information from the trends in data-rich areas to influence the shape of curve fits in sparse data areas – lends itself to fairly simple implementation in EPP. Fundamentally, three steps are required:

- Put the surveillance data from all regions in a form that can be used to run the hierarchical model.
- Run the hierarchical model in R Foundation for Statistical Computing (Vienna, Austria)/RStudio, Inc. (Boston, MA) and generate a set of pseudosites that can be used to inform fitting in the various subnational projections that have sparse data.
- Load those pseudosites, choose the projections in which they are to be used, and fit EPP.

These steps are implemented in EPP with a hierarchical model panel, shown in Fig. 3.

The user writes the surveillance data, runs the hierarchical model on that surveillance data in RStudio, and then clicks on ‘Import pseudo-sites’ to make the pseudosites available to any projection where the user chooses to use it for fitting. If a projection is to use the hierarchical model, the pseudosite's data will appear as a blue-colored site in the graph on the EPP project page, as shown in Fig. 4. The user only needs to run fit in EPP now to incorporate the hierarchical model effects into the fitting.

The amount of influence that the pseudosite will have on the fit is determined by the sample sizes. In EPP, those can be altered by clicking on the ‘scale data’ button on the hierarchical model panel. This will bring up the scaling dialogue shown in Fig. 5. By simply typing the desired average sample size into the ‘Scale’ column, the user can alter the sample sizes of the data points from the pseudosite. Larger sample sizes will increase the influence of the pseudosite on the fitting.

## Results

In this section, we present an empirical study of the proposed model. We first describe the data we use. Then we discuss the selection of GLMM. After that we evaluate the performance of Spectrum/EPP with and without using auxiliary data. Finally, we show the fitted curves of the data.

### Data description

We select the following four countries to demonstrate the empirical results of the proposed method: Liberia, Angola, Swaziland, and Ghana. The data are provided by UNAIDS. Those four countries are selected as representatives of different data richness and prevalence-level scenarios. Table 1 lists the data availability by years and number of sites for each of the four countries.

In Angola and Liberia, compared with the urban sites the rural areas have fewer sites and insufficient years of data to indicate a clear trend in prevalence. In Liberia, the data is anchored by a survey, whereas Angola is not. In Ghana, all areas have relatively rich data and population survey data. In Swaziland, there are four areas of sparser data rather than just rural and urban, each of which has only five or six sites plus a single year of surveys. Moreover, Swaziland's national prevalence level is above 20%, whereas the other three countries are under or around 5%.

### GLMM selection

We fit both the binomial GLMM and beta–binomial GLMM within each country. The model is estimated using The R-INLA project (Trondheim, Norway) package ^{[8]}. The estimated over-dispersion parameters of the beta–binomial models are all close to zero, ranging from 0.00192 to 0.00902. For the binomial model, we visually inspect the Quantile-Quantile plot of the standardized Pearson residuals versus the theoretical standardized Pearson residuals of the binomial distribution, and find they line up very well along each other. Both the magnitude of the overdispersion parameters in the beta–binomial models and the residual diagnostics of the binomial models suggest that there is not much evidence of overdispersion. Moreover, the binomial model runs 2–20 times faster than the beta–binomial model. Therefore, we use the binomial GLMMs to generate the area-specific prevalence estimates and create pseudosites.

### Model prediction accuracy

We summarize the test data evaluation results in Table 2, with the minimum MAE and CPRS for each area highlighted in bold. We have the following observations:

- Out of the 10 areas in the four countries, eight areas show improvements in out-of-sample prediction by adding auxiliary data regardless of the auxiliary sample size. The original EPP is preferred in two areas of Swaziland.
- In the areas where auxiliary data improve the prediction, the optimum sample size varies.
- The most improvement lies in the data-sparse areas borrowing information from data-rich areas, as in rural Angola, rural Liberia. The auxiliary data with sample size 1000 offers the minimum MAE and CRPS in both cases. The MAE is increased by 8.5% and 17.1% for rural Angola and rural Liberia, and the CRPS is increased by 5.1% and 14% for the two areas. The GLMM estimated sample size is the next best scenario and provides similar improvements.
- In data-rich areas, such as urban Angola, urban Liberia, and both areas in Ghana, we see marginal improvement using auxiliary data.
- In Swaziland, each area has about only two sites. The effects of adding a pseudosite is mixed. In two of the areas, adding a pseudosite with a small sample size (10) shows significant improvement. In the other two areas, no auxiliary data is preferred. One possible explanation is that because of the small number of existing sites (two per area), the results highly depend on whether the samples and trends in the eight sites are similar, as changing the number of sites from two to three can have a large impact on estimating the site effects.
- Finally, we observe that higher prediction error is associated with higher prevalence rates, as seen in the comparison between Swaziland and the other three countries.

Based on the above findings, we believe adding auxiliary data has benefits in most situations and recommend the GLMM-estimated sample size as the default setting.

### Data results

We compare the entire prevalence trajectories of the EPP fits between no auxiliary data and using auxiliary data with GLMM estimated size. For Ghana and Swaziland, the national prevalence trends look the same. Therefore, in Fig. 6, we only present the national prevalence estimates for Angola and Liberia. We notice that the national trends and uncertainties with (red) and without (black) auxiliary data are similar in urban areas. Using auxiliary data provides much narrower uncertainty bounds than not using auxiliary data in rural areas. Moreover, with auxiliary data, the projected trend stabilizes from 2008 to 2010 in rural Angola, and slightly declines in rural Liberia after 2010. Both trends are more consistent with the observed data when the auxiliary data trends are included.

## Discussion

In this study, we apply the methodology in Bao *et al.* ^{[3]} to four countries with different data availability. We discuss in detail the implementation of the method and its interface in Spectrum/EPP. We examine the effects of different auxiliary data sample sizes on the model's prediction accuracy using several measures. The empirical results suggest that, for areas with sparse data and existence of relatively richer data in other areas, adding auxiliary data could improve the EPP fit. We allow EPP users to specify the auxiliary data sample sizes, and recommend using the default sample size provided by GLMM. Though we use the binomial GLMM as the prior model in our empirical study, we propose the beta–binomial GLMM that accounts for sample correlations within a clinic as an alternative.

We find the most improvement lies in Angola and Liberia, where data are sparse in the rural areas and richer in the urban areas. As countries move to subnational estimations, this pattern of sparsity is particularly true in settings where some regions had surveillance introduced early on and have long-time trends, but numerous other regions have only been added to the surveillance system in recent years. The ability of the hierarchical model approaches described here to share data from the data-rich regions with longer time trends can lead to more realistic fits subnationally and, thus, improve aggregated national estimates.

In countries where all areas have limited data, especially small numbers of sites, we do not recommend using the auxiliary data. In those settings, the EPP trends can be highly affected by adding the auxiliary data and the outcomes are not guaranteed.

One potential further improvement for rich data situation is to add social, economic, and environmental factors as covariates in the hierarchical model. For countries without data-rich areas, we would consider applying the hierarchical model to multiple neighboring countries so that information can be borrowed across countries. Although careful consideration would be needed to take into account the potential similarities and differences in the epidemics among the countries before applying such a model.

Although beyond the scope of this study, the same approach can be extended to model multiple high-risk groups in a country. The combination of a subnational region and a particular high-risk group can have sparse data. In the GLMM, we have one more layer of data structure that makes the observation y_{agit}, where g is the high-risk group indicator. We can simply treat the groups the same way as we treat areas, and introduce group-specific intercept and time trends. We will then generate area and group specific pseudosites. Any general subepidemic can be estimated in a similar manner.

## Acknowledgements

The research was supported by the Joint United Nations Programme on HIV/AIDS and NIH-R56 AI120812-01A1. The authors are grateful to Peter Ghys, Tim Hallett, and Jeff Eaton for discussions of hierarchical models, to Ben Sheng and Yuan Tang for preliminary experiments, to Bangze Chen, Kaiyi Wu, and Haici Tan for testing the implementations in Spectrum/EPP, to Kelsey Case for coordinating meetings.

### Conflicts of interest

There are no conflicts of interest.