## Background

The primary data for estimating HIV epidemic trends in sub-Saharan Africa are sentinel surveillance of HIV prevalence among pregnant women attending antenatal clinics (ANC). Since the early 1990s, HIV prevalence was measured among women attending a selection of ANC sentinel sites every 1–3 years, furnishing a time series of HIV prevalence observations in each clinic. Additional clinics have been added over time. Estimates of HIV prevalence and incidence trends are created by statistically fitting the Estimation and Projection Package (EPP) model ^{[1,2]}, a simple ‘susceptible–infected’ HIV epidemic model, to ANC sentinel surveillance (ANC-SS) prevalence and prevalence from nationally representative household surveys in a Bayesian framework. A linear mixed-effects model is used to account for the potentially unbalanced repeated observations at the same sites when inferring a population prevalence trend from ANC-SS ^{[3]}. One assumption underpinning this estimation is the discrepancies between the model predictions, and the observed ANC-SS prevalence is explained by the random sampling error expected based on the sample sizes in each clinic ^{[3]}.

In practice, we observe greater variation in year-to-year prevalence in time series from individual clinics than would be expected from random sampling error alone, given the sample sizes in each clinic. For example, Brown ^{[5]} noted that the in-sample coverage of clinic-level 95% prediction intervals is on average around only 88% (instead of the theoretical 95%). A number of factors could potentially contribute to the underestimation of the uncertainty, such as uncertain sampling procedures or changes in sampling procedures; changes in inclusion criteria for women attending participating facilities; poor quality control in laboratories, contamination, or changes in diagnostic tests over time; or heterogeneous variation in local epidemic trends not captured by the site-level random-effects intercepts.

Although national survey prevalence is included in the likelihood for fitting the model, in settings with large numbers of ANC sites and frequent sentinel surveys, ANC prevalence data tend to overwhelm the higher-quality national survey data, resulting in epidemic trends that can be inconsistent with national surveys. At present, there is a transition from using sentinel surveillance of HIV prevalence among a fixed number of pregnant women at a selection of ANC sites to utilizing prevalence estimated via routine HIV testing of all pregnant women attending ANC. For these data, the sampling variation of the observed prevalence is negligible due to the large sample sizes, but the nonsampling error may be expected to be even larger than for sentinel surveys ^{[4]}.

In this study, we propose an additional variance parameter to allow nonsampling error in ANC prevalence observations, compare different methods for estimating the additional parameter, and examine the implications of this for estimates of HIV prevalence and incidence from the EPP model.

## Methods

### Estimation and Projection Package model

Estimation and Projection Package model is a simple susceptible-infected epidemic model that is used to infer internally consistent estimates of adult HIV prevalence and incidence trends from ANC-SS prevalence and household survey prevalence. The model stratifies the adult (age 15–49 years old) population according to susceptible individuals, CD4^{+} stage of infection, and the antiretroviral therapy (ART) population.

Further details of the EPP model are available from Brown *et al.* ^{[2]}. EPP implements two flexible models for the transmission rate *r*(*t*). The ‘r-trend’ model is a flexible seven-parameter model ^{[5]}, and the ‘r-spline’ model uses penalized B-splines with seven knots, a smoothing penalty, and initial seed incidence rate (nine parameters total) ^{[6]}. In the EPP software, the r-spline model implements an equilibrium prior assumption for *r*(*t*) after the end of the data ^{[7]}. In this analysis, we examine the r-spline model both with and without the equilibrium prior assumption.

### Incorporating nonsampling error

The likelihood function currently used for fitting the EPP model to ANC-SS data assumes that all uncertainty about ANC prevalence *γ* _{st} for a clinic *s* in year *t* is captured by the binomial sampling variation associated with the sample size *N* _{st}, that is, the number of HIV-positive women *Y* _{st} ∼ Binomial (*N* _{st},*γ* _{st}). Repeated observations at the same clinics over time are accounted for via a hierarchical linear model. The observed prevalence at clinic *s* in year *t* is modeled on probit scale ^{[3]}. Following notation of Alkema, Raftery, and Clark, define

, (where *Φ* ^{−1} is the inverse of the normal cumulative distribution function), and the model is

where *ρ* _{t} is the prevalence in year *t* predicted by the model, *α* _{ANC} is the systematic bias between prevalence among the ANC population compared to the general population, *b* _{s} is a site-level random effect, and *∊* _{st} is the residual error. The variance of *∊* _{st} is approximated by:

where

is substituted for the true prevalence *γ* _{st} in the delta method approximation to var(*W* _{st}). As the sample size *N* _{st} increases, the variance of prevalence at the probit scale declines to zero at the rate of 1/*N* _{st}.

To allow additional nonsampling error in observed ANC prevalence, we propose adding an additional variance term

to inflate the variance associated with clinic-level ANC prevalence observations. The residual variance becomes

The above approach maintains the relationship that uncertainty about ANC prevalence observations is related in magnitude to the expected binomial sample error, but allows additional variance in ANC prevalence observations as suggested by the data.

The additional variance could be global, with a single value fixed or estimated across countries and regions, or different values could be chosen for each setting. Consistent with the paradigm for application of the EPP model, we estimate

independently for each subnational region to which EPP is applied.

### Estimating

We consider the following three different approaches for estimating

:

- The simplest implementation applies a linear regression model to the residuals obtained from the original EPP model. It leads to a priori estimation of , which is then a fixed input parameter for the model fitting process,
- Johnson et al.[8] and Ševčíková et al.[9] propose substituting an unbiased sample variance estimator for in the likelihood calculation, or
- The full Bayesian approach estimates the joint posterior distribution of and other unknown parameters simultaneously.

For (1), we fit a linear regression to the probit-transformed observed ANC prevalences *W* _{st} with dummy indicator variables for each clinic *s* and time *t*, analogous to the hierarchical ANC likelihood, but with fixed effects for each clinic and year. Assume the residuals of transformed ANC prevalence are independent and identically distributed with an equal variance *σ* ^{2}, and let

be its maximum likelihood estimate (MLE). We estimate the excess variance by subtracting the expected sampling variation

:

The estimate may underestimate the true

since the survey year fixed effects are completely independent, potentially allowing greater year-to-year change in prevalence rather than being constrained by the epidemiologic model. If the estimate

, we set

.

Second, Johnson *et al.* ^{[8]} and Ševčíková *et al.* ^{[9]} propose substituting an MLE conditional on the model output to approximate an unknown variance in the likelihood calculation. For (2), let *ρ* _{t}(*θ*) be the predicted prevalence among pregnant women for a vector of input parameters *θ*. The clinic level random effect *b* _{s} is approximated by

where T_s is the number of observations for site s, and

the model predicted transformed prevalence for clinic *s* at time *t*. In the likelihood calculation we substitute

We subtract the number of sites *S* in the denominator to account for the lost degrees of freedom for calculating

at each site.

Finally, approach (3) is to explicitly estimate

as an additional parameter in Bayesian inference. We used an exponential prior distribution with rate *v* _{0} = 0.015^{−1}, informed by the regression analysis for approach (1). We test the sensitivity to the prior using more diffuse priors with rate *v* _{0} ∊ {0.1^{−1},1}.

### Analysis

Data were taken from national estimates country files provided by UNAIDS for the 2016 UNAIDS estimates. We used data from nine countries in southern and eastern Africa with two or more prevalence estimates from household-based surveys: Botswana, Kenya, Lesotho, Malawi, Tanzania, Uganda, South Africa, Zambia, and Zimbabwe. For Botswana, Lesotho, Tanzania, Uganda, and Zambia models were fit separately to data from urban and rural regions. Data from Malawi were stratified by region (northern/central/southern), Kenya by eight former provinces, and South Africa and Zimbabwe by province (9 and 10, respectively), for a total of 40 subnational regions as defined for the 2016 UNAIDS estimates.

Each of the three approaches for incorporating

are implemented for the EPP r-trend model ^{[5]}, the EPP r-spline model ^{[7]}, and the EPP r-spline model without the equilibrium prior. Outcomes are compared to results without additional variance for the ANC prevalence

.

Parameter estimation is conducted via Incremental Mixture Importance Sampling (IMIS) ^{[10]} with *B* _{0} = 100 000 initial samples and *B* = 10 000 samples per iteration. R code for reproducing the analyes is available from https://github.com/jeffeaton/anc-overdispersion.

Model outputs compared are HIV prevalence, HIV incidence, and the transmission rate *r*(*t*) for the age 15–49 population represented by EPP.

### Model selection and validation

We compare the model performance using a number of metrics: such as the coverage of the predictive interval, the log-posterior predictive density (LPPD) ^{[11]}, the computing cost, and coefficient of variation at different time points. We obtain the posterior predictive distribution for each ANC observation and calculate the coverage of 95% predictive intervals (in-sample fit). We also conduct an out-of-sample validation by fitting the model to 90% of the ANC observations as a training dataset while ensuring at least one observation is retained for each ANC site, and withholding the remaining 10% as a test dataset. Fifty training/test splits are created randomly for each dataset and the coverages of the 95% predictive intervals are calculated by using the observations from test datasets. Second, we evaluate the effect of including

for model fit to household survey HIV prevalence by calculating the LPPD, a measure of model accuracy in Bayesian framework ^{[11]}. The computational implications of incorporating

were assessed by comparing the median number and interquartile range (IQR) of iterations for convergence of the IMIS algorithm. Finally, we summarize the overall uncertainty about the epidemic by calculating the coefficient of variation for estimated HIV prevalence and HIV incidence at different time points in the epidemic.

## Results

### Derivation of prior

The first estimation approach does not require refitting the EPP models to the surveillance data, and provides a reasonable initial approximation to the value of the additional variance parameter. Following that approach, regression-based estimates of

for the 40 datasets considered range from 0 to 0.030, with mean of 0.007 and median 0.003. Based on this, we propose an exponential distribution with rate 0.015^{−1} as an informative prior distribution for the full Bayesian approach. This distribution has a median of 0.01 and 90% of the mass less than 0.035.

### Estimates for

Table 1 summarizes the estimates of

by different approaches for the 40 datasets, and the left panels of Fig. 1 shows the posterior density estimates for the three estimation approaches. Estimates for all sites are shown in supplementary figures (http://links.lww.com/QAD/B45). For each estimation approach, estimated values for

were similar across the three EPP model variants r-trend, r-spline with equilibrium prior, and r-spline without equilibrium prior. The first estimation approach based on regression output yielded smaller values (median 0.003, IQR 0.0–0.007) than the two alternatives, suggesting that the regression estimator underestimated the extra variance. The point estimates between the unbiased variance estimator and the full Bayesian estimator were highly correlated across the 40 datasets (correlation = 0.91).

By comparison, the mean value for

in each dataset, the transformed sampling variance around observed ANC prevalence, has a median of 0.008 (IQR 0.006–0.014). Estimating

with the r-spline model increased the error variance about ANC prevalence by 2.7 times (IQR 1.9–3.8), measured by

).

Analysis of variance (ANOVA) shows that estimates of

for regions within the same country were more similar than values of

between countries (*P* < 0.001), with country explaining 56% of the variation in the log of the mean value of

. The number of regions into which a country was stratified, urban versus rural location, the number of sentinel sites, and the median number of observations per site were not associated with the estimated value of

.

### Comparing model fit

To choose a recommended approach, we compared the approaches in terms of how well they fit the ANC data in in-sample comparisons, out-of-sample prediction, fit to household survey prevalence data, and the computational implications (Table 2). Fitting the r-trend model, with sampling variance only

the average coverage of 95% prediction intervals was 78%. With the fixed regression estimator, this increased to 86%, and substituting the unbiased variance estimator or explicitly estimating

in the Bayesian model achieved very close to the theoretical 95% coverage. In out-of-sample prediction, when

, the coverage was 69%. This increased to 90 and 89%, respectively, when using the unbiased variance estimator or jointly estimating

. Similar coverage results were observed for r-spline models with and without equilibrium prior.

Incorporating

improved the fit to household survey prevalence data, measured by the log posterior predictive density (LPPD). This is expected because the effect of additional variance for ANC data is to give relatively less weight to ANC data in the likelihood, and hence more weight to household survey prevalence.

Finally, we used the number of iterations of the IMIS algorithm, which scales with the number of likelihood evaluations required, as a proxy for the computational implications of the candidate approaches for incorporating

. The bottom rows of Table 2 show the median number of IMIS iterations required for convergence across 50 fits for the 40 regions in out-of-sample prediction simulations, and the width of the IQR as a measure of the variability in model fitting. The median number of iterations was twice as many for the r-spline model than the r-trend model (55 for r-spline, 56 without the equilibrium prior, 27 for r-trend). None of the approaches systematically affected the number of iterations required for convergence, including explicit estimation of an additional model parameter. They also did not increase the variation in number of iterations for model fitting.

### Implications for HIV estimates

Figure 1 illustrates the trend for HIV prevalence (second column) and incidence (third column) estimated by the r-spline model without additional variance (

; green lines) and when

is estimated as an additional parameter in Bayes (pink lines) for two EPP regions: Botswana Urban and Zimbabwe Manicaland. For Botswana Urban, the posterior mean of

was 0.007 [95% confidence interval (CI) 0.005–0.011], and estimates for incidence and prevalence were relatively similar with slightly greater epidemic uncertainty when allowing nonsampling error. In contrast, for Zimbabwe Manicaland, the estimates of

were larger (mean 0.036; 95% CI 0.019–0.060), and the inferred epidemic trend peaks and declines less rapidly and is more consistent with the prevalence trend observed in national household surveys. Uncertainty is much wider.

Allowing nonsampling error in ANC prevalence systematically increases the uncertainty about epidemic estimates (Table 3). The magnitude of the increase in uncertainty depends on the timing of the epidemic and the model choice. Using the r-trend model, the median increase in the coefficient of variation about prevalence was 1.49 times in 1990, 1.32 in 1995, 1.25 in 2000, and 1.04 times in 2005. For the r-spline models the increase was slightly less: 1.40 times in 1990 declining to 1.03 times in 2005. The decline over time is because during the late 2000 period, the level and uncertainty is largely determined by household survey prevalence, whereas uncertainty about the course of the epidemic from ANC-SS data has a larger effect on the estimates before the survey availability.

### Sensitivity to prior

Figure 2 presents the posterior mean for

using the informative prior with rate *v* _{0} = 0.015^{−1} (horizontal axis) compared to more diffuse priors with *v* _{0} = 0.1^{−1} (dots) and *v* _{0} = 1 (triangles). At small values of

the estimates are very similar, whereas at larger values of

there is some evidence of attenuation using the informative prior. Results in Fig. 2 are based the r-spline model, and are similar for the other model variants.

## Discussion

Probabilistic uncertainty for HIV epidemic estimates was a major innovation for the robustness of HIV estimates and projections from HIV surveillance data ^{[12]}. The imperfect statistical fit of the statistical model for ANC-SS has noted previously ^{[5]}, potentially giving too much weight to outliers and overemphasizing ANC-SS data relative to other data sources. In this study, we propose a method to account for additional variance and demonstrate that this improves the statistical fit to data from both ANC sentinel surveillance and from national household-based prevalence surveys, which generally have more rigorous sampling.

We recommended estimating the additional variance

as an additional parameter in the full Bayesian inference because this captures the full uncertainty about the error variance without incurring additional computational cost. Although the number of parameters has increased, the added parameter flattens the sharp peaks in the likelihood, which are likely caused by underestimates of the residual variance of ANC-SS data. Estimating

as an additional parameter performed similarly to substituting an unbiased estimator for the variance in the likelihood calculation, an approach that has been used elsewhere ^{[8,9]}. That values of

were more similar for regions within the same country suggest that greater nonsampling error may reflect changes over time in the methodology or quality of the national sentinel surveillance systems.

Accounting for nonsampling error systematically and appropriately increases the uncertainty about HIV prevalence and incidence during early periods of the epidemic. This additional variance allows for greater uncertainty than would be expected relative to plausible epidemic trends represented by the EPP model. Some error could also be attributable to ‘model error’ if the model is not flexible enough to capture the true underlying epidemic patterns. This highlights a key challenge for further methodological enhancements to improve model-based HIV epidemic estimates: the need for model structure to constrain epidemiologically plausible estimates during the early epidemic period when HIV data were very sparse, but the desire for a suitably flexible model that is sensitive to recent changes in epidemic trends which are of greatest policy interest.

The additional uncertainty does not overcome other fundamental assumptions that underpin the interpretation of ANC-SS data for estimating population-wide epidemic trends, including potential selection biases arising from initial sentinel sites having been selected as a ‘convenience sample’ ^{[13]}, and the assumption that the epidemic trend is homogenous in all ANC sites. These may prove to be substantially greater sources of uncertainty than that discussed here. It also does not account for the true uncertainty about other model parameters which are currently treated as fixed in epidemic inference, such as disease progression and survival, mortality rates on ART, and the relationship between fertility and HIV ^{[14,15]}. Reflecting the uncertainty about these is expected to further increase probabilistic uncertainty intervals generated by the EPP model around incidence trends inferred from prevalence and is an area for future development.

In some datasets, different approaches to estimating

or different EPP model variants resulted in dramatically different and sometimes implausible epidemic patterns (supplementary figures, http://links.lww.com/QAD/B45). This highlights the extent to which multiple epidemic trends could be considered broadly consistent with available data, suggesting even expanded uncertainty intervals may understate the true uncertainty about the epidemic. In real applications of the EPP software to create country-level HIV estimates, users have the opportunity to specify prior constraints on HIV prevalence to exclude implausible epidemic curves in formal Bayesian Melding, but ideally these should be specified *a priori* before doing any model fitting. Hierarchically sharing information between neighboring regions constitutes a systematic approach to enhance epidemiologically plausible estimates in regions where data are sparse ^{[16]}.

Fully quantifying all sources of uncertainty about data becomes increasingly important as more data sources are incorporated into estimates and each must be given appropriate weight relative to others. Examples of other data sources which could enhance future estimates of HIV epidemic trends include routine test of pregnant women attending ANC, adult mortality data, phylodynamic data, HIV incidence assays, and biomarkers for antiretroviral drug usage. The approach proposed here may be naturally applicable to the inclusion of routine prevalence data from Prevention of mother-to-child transmission testing, where the sampling error may be negligible owing to very large sample sizes, but nonsampling error may be a much greater source of uncertainty because of unknowns about the sampling frame, completeness, reporting errors, and other factors related to the ‘routine’ nature of the data ^{[4]}.

## Acknowledgements

We thank the participants in the UNAIDS Reference Group on Estimates, Modeling, and Projections (www.epidem.org) for invaluable feedback on this research. We thank Leigh Johnson for useful conversations. JWE thanks Rich Fitzjohn for cluster computing assistance.

J.W.E. and L.B. conceived the study. J.W.E. conducted model analysis and prepared the first draft of the manuscript. J.W.E. and L.B. edited the manuscript.

### Conflicts of interest

J.W.E. acknowledges funding from UNAIDS and the Bill and Melinda Gates Foundation. L.B. acknowledges funding from UNAIDS and NIH-R56 AI120812-01. J.W.E. and L.B. acknowledge funding from NIH-R03 AI125001-01.