Estimates of HIV prevalence are frequently used to monitor the development of the HIV epidemic,1,2 to study the determinants of epidemic spread, to identify groups at high risk of HIV infection,3–5 and to assess the need for HIV prevention and treatment.6,7 In recent years, population-based nationally representative surveys—in particular, the Demographic and Health Surveys8—have become the gold standard for estimating national HIV prevalence rates.2 However, such estimates are vulnerable to selection bias because information on HIV status is often not available for a proportion of the eligible population: some eligible individuals cannot be contacted; others are contacted but do not consent to an HIV test. If HIV prevalence in eligible persons who do not participate in HIV testing differs from the prevalence in eligible persons who do participate, prevalence estimates based on information from participants alone will be biased.
Nonparticipation rates in population-based HIV surveys are low in some settings and high in others.9,10 For example, in the Demographic and Health Surveys in Sub-Saharan Africa, nonparticipation rates have ranged from 8% to 30%.11 Selection bias could thus be large in these surveys. The 2007 Zambia Demographic and Health Survey is a case in point. The HIV prevalence estimates published in the official survey report were 12% in men and 16% in women.12 However, these estimates were based only on the results from the 72% of all eligible men and 77% of all eligible women who took part in HIV testing. The upper bound of the range of HIV prevalence in eligible individuals that is possible in this situation can be calculated by assuming that all individuals who did not participate in HIV testing were HIV-positive (in which case HIV prevalence would be 37% in men and 35% in women); the lower bound can be calculated by assuming that all individuals who did not participate were HIV-negative (in which case HIV prevalence would be 9% in men and 12% in women).
It is commonly assumed that HIV status is “missing at random,”13 which implies that, after conditioning on observed variables, survey nonparticipation is uncorrelated with unobserved variables, such as HIV status. If HIV status is indeed “missing at random,” prevalence estimates can be corrected for selection bias using imputation methods such as single imputation,14 multiple imputation,15 or the expectation-maximization algorithm.16 Several studies have used such imputation methods to adjust HIV prevalence estimates for selective HIV survey nonparticipation.11,17 For example, Mishra et al17 imputed HIV prevalence in eligible persons who did not participate in HIV testing, based on information on observed factors such as age, region of residence, and socioeconomic and behavioral variables. They found that in a number of countries the imputed prevalence was slightly higher than the prevalence observed in the HIV survey participants, but concluded that “the overall effect of nonresponse on observed national HIV estimates was small and insignificant in all countries.”17 WHO and UNAIDS18 have recommended imputation to adjust HIV prevalence estimates for selective survey nonparticipation.
However, the assumption that HIV data from nonresponders are “missing at random” may not be valid. In many situations, it is likely that unobserved factors—such as fear of learning one's own HIV-positive status19 or worry that others might learn one's status—determine nonparticipation in HIV surveys. If such unobserved factors are associated not only with nonparticipation but also with HIV status, imputation approaches will lead to biased HIV prevalence estimates. For example, if people know their HIV status and refuse to participate in an HIV survey if they are HIV-positive, then nonparticipation is clearly correlated with the unobserved HIV status, ie, with the outcome we wish to estimate. A previous study from Malawi found that among people who knew their HIV status because of a previous HIV test, those who were HIV-positive were more than 4 times as likely to refuse another HIV test than those who were HIV-negative, suggesting that survey nonparticipation is indeed associated with HIV status.20
Given that participation can depend on unobserved characteristics correlated with HIV status, we employ an alternative approach to imputation, using Heckman-type selection models that correct for selection bias when nonparticipation is determined both by observed and by unobserved factors.21,22 Heckman-type selection models are commonly used in economics and political science,23,24 but are rarely applied in epidemiology, possibly because their performance depends crucially on the availability of valid exclusion restrictions, ie, selection variables that determine survey participation but do not independently affect HIV status.
The only 2 previous studies using this technique on data from the Demographic and Health Surveys have used participants' characteristics, such as education,25 labor market participation, and rural versus urban residence26 as predictors of testing. However, these characteristics are unlikely to satisfy the exclusion restriction because they also affect the likelihood that a person is HIV-positive.27,28 Two other studies have used more convincing exclusion restrictions in Heckman-type selection models of HIV status. Janssens et al29 use interviewer identity as a factor that influences HIV testing in an HIV prevalence survey in Namibia. Reniers et al30 use interviewer identity as selection variables in a study of the effectiveness of different testing protocols in determining HIV status in Ethiopia. However, to our knowledge, ours is the first study to use convincing exclusion restrictions to derive national HIV prevalence estimates, corrected for selection on unobserved variables, from data that are routinely available in the Demographic and Health Surveys.
We use 2 types of operational variables as exclusion restrictions: interviewer identity, and visit to a household on the first day of fieldwork within a household cluster. We discuss why these variables are highly likely to be valid selection variables, and we apply our method to the HIV survey conducted as part of the 2007 Zambia Demographic and Health Survey.
Eligibility for HIV Testing
The 2007 Zambia Demographic and Health Survey comprised a household interview, interviews with individual household members, and HIV testing.12 A total of 7969 households were eligible for the household interview (for a description of the selection of eligible households, see the eAppendix, http://links.lww.com/EDE/A441). All women aged 15–49 years and all men aged 15–59 years “who were either permanent residents of the households in the 2007 Zambia Demographic and Health Survey sample or visitors present in the household on the night before the survey were eligible” for the individual interview and HIV testing (7146 men and 7408 women). Figure 1 shows a flow chart of the samples included in HIV prevalence estimation. Data were collected from April to October 2007.12 The data collection was carried out by 12 interviewing teams, each consisting of 1 supervisor, 3 female interviewers, and 3 male interviewers.12
In addition to a list of household members and visitors, the household interview elicited basic demographic information (such as sex, age, residence, and education) on all eligible household members and visitors,31 as well as information on drinking water, toilet facilities, cooking fuel, and household assets. In the individual interview, information was collected on a range of characteristics, such as marital status, age at first sex, number of sex partners, condom use, sexually transmitted diseases, smoking, alcohol consumption, and attitudes towards HIV/AIDS.12
The fieldworkers who interviewed the respondents also conducted HIV testing: “Interviewers explained the procedure, the confidentiality of the data, and the fact that the test results would not be made available to the respondent.”12 All respondents, whether they consented or not, were provided with information on local voluntary counseling and testing (VCT) services.12 See the eAppendix (http://links.lww.com/EDE/A441) for details of HIV testing laboratory procedures.
Selection and Imputation Models
Given that HIV survey nonparticipation may be correlated with unobserved personal characteristics associated with HIV status, we estimate a model in which being available for, and agreeing to, testing is correlated with HIV status. This selection model is based on the adaptation by Dubin and Rivers32 of Heckman's21 correction for sample selection to the case where the outcome of interest is dichotomous rather than continuous.
We start with a probit model for HIV status (the substantive equation)
where h*i is an unobserved latent variable that determines the likelihood of HIV infection for individual i, and h*i depends on a vector of observed characteristics xi and random error εi. Actual HIV status hi is either negative (0) or positive (1), depending on whether h*i is above or below zero.
Similarly, we use a probit model for HIV survey participation (the selection equation)
where s*i is an unobserved latent variable that determines the likelihood of survey participation for individual i, and s*i depends on a vector of observed characteristics xi, a vector of exclusion restrictions zi, and random error ui.
The key feature of the bivariate probit selection model is that we allow for a correlation between the unobserved error terms that affect HIV status and testing, corr(ε, u) = ρ. If there is no correlation in the error terms, the expected HIV status of someone who is not tested depends only on observed characteristics xi and simple imputation methods based on observed characteristics will be valid. However, when HIV testing and HIV status are correlated, the information that someone is not tested changes our conditional distribution of εi for that person and the likelihood that he or she is HIV-positive. We estimate the selection model using maximum likelihood.
Note that in the selection equation (3) we include some observed variables zi that affect selection into testing, but that do not affect HIV status in equation (1). While this exclusion restriction is not absolutely required for identification of the model, it is necessary to prevent collinearity between selection and HIV status, as given by equations (1) and (3), and thus to increase the robustness of the regression results.33,34 It is important that the exclusion restriction is satisfied and that the selection variables in zi used to predict testing are not relevant predictors of HIV status.
We consider 2 reasons for HIV survey nonparticipation. The first is refusal to consent to an HIV test after completing the individual interview. For this group, we use the identity of the interviewer as the exclusion restriction. Some interviewers are better than others at eliciting consent to an HIV test. For example, respect for the elderly is high in Zambia and people may find it more difficult to refuse testing from older interviewers than from younger ones. In addition, interviewers' personality traits, such as agreeableness or extraversion, may affect respondents' likelihood of consenting to test.
The second reason for nonparticipation is the inability of the interviewer team to contact a person after he or she has been identified in the household interview as eligible for HIV testing. For this group, we use the identity of the interviewer of a household member of the eligible person as an exclusion restriction. It is likely that the personality of this interviewer determines how far the household member supports the interviewers' efforts to contact the eligible person. For example, the household member may or may not inform the interviewer when the eligible person is likely to be home, or he may or may not recommend to the eligible person that he be at home at the planned time of a household revisit.
In addition, we use a dummy variable indicating whether or not the household interview took place on the first day of fieldwork within a cluster. Compared with a household interview at a later date, an interview on the first day of fieldwork leaves more opportunities for revisits and thus implies a higher probability of contacting a person during the period of fieldwork in the cluster.
We estimate separate models for the 2 reasons for HIV survey nonparticipation—consent regressions (for nonparticipation due to refusal to consent to an HIV test) and contact regressions (for nonparticipation due to failure to contact an eligible person). We run separate selection models for men and women because it is likely that many of the independent variables in both the substantive and the selection equations vary by sex in their respective effects on survey participation or HIV status. With the exception of the selection variables, we use the same independent variables in the imputation and each of the 2 equations of the selection models. In all selection models, we use dummy variables for each interviewer who conducted at least 50 interviews. All interviewers who conducted fewer than 50 interviews were assigned the same dummy variable to avoid (quasi-) separation and ensure model convergence.
While the identity of the interviewer does not directly affect HIV status, interviewers are not assigned randomly and their identity could thus be correlated with HIV status. An interviewer usually operates in only 1 or 2 provinces of Zambia, and interviewers are matched to respondents depending on their sex and language. This matching may create a correlation between interviewer identity and HIV prevalence. We control for matching on sex by running separate regressions for men and women; we control for matching on province and interview language by adjusting for these variables in our regressions. We conduct all of our analyses in STATA, version 11 (StataCorp, College Station, TX).
We provide eTables 1A (for men) and 1B (for women) (http://links.lww.com/EDE/A441) to show the distributions of sociodemographic and behavioral characteristics for respondents who consented to HIV testing (by HIV status), for interview respondents who refused HIV testing, and for eligible household members who could not be contacted for an individual interview. There were 33 interviewers who conducted individual interviews with at least 50 men and 41 who conducted interviews with at least 50 women; in the household interviews, 46 interviewers elicited information on at least 50 men, and 53 elicited information on at least 50 women. Figure 2 shows the distribution of the male HIV test consent rates (ie, the percentage of contacted men who consented to an HIV test) by interviewer. The median consent rate was 80%.
Table 1 (for men) and eTable 2 (for women) show the results of the consent regressions; regression results not shown in these tables are reported in eTables 4A and 4B (men) and 5A and 5B (women). Table 2 (men) and eTable 3 (women) show the results of the contact regressions; regression results not shown in these tables are reported in eTables 6A and 6B (men), and 7A and 7B (women).
In the selection models, the following statistics are important for interpretation of model performance and results. First, the Wald test of joint significance of the dummy variables representing the exclusion restriction in the selection equation informs on whether these variables do indeed determine HIV testing. The null hypothesis of no significant effect on consent is rejected in all 4 selection models (all P < 0.001), ie, our selection variables are highly significant determinants of HIV survey participation.
Second, the parameter ρ measures the correlation between the unobserved error terms that affect HIV status and survey participation. The Wald test of independent equations assesses the null hypothesis that consent to an HIV test and the likelihood to be HIV-positive are independent of each other, ie, whether ρ is significantly different from zero. A significant negative ρ indicates that the persons who did not participate in HIV testing are more likely to be HIV-positive than participants, while a significant positive ρ indicates that they are less likely to be HIV-positive.
In 1 of the 4 selection models (consent regression for men), ρ is negative (−0.75 [95% CI = −0.94 to −0.18]), ie, HIV prevalence in those men who did not consent to HIV testing is expected to be higher than in those who consented. We describe a plausibility test for this finding in the eAppendix (http://links.lww.com/EDE/A441). While ρ is negative in all of the other 3 selection models, it is not significant in any of these other models. The lack of significance of ρ indicates that it is sufficient to account only for selection on observed variables in estimating HIV prevalence for women.
When the national HIV prevalence estimate in men is adjusted for nonparticipation on observed factors, using the imputation models, the adjusted estimate is not significantly different from the estimate for those who participated in the HIV survey (Table 3). In contrast, when the national HIV prevalence estimate in men is adjusted for nonparticipation on observed and unobserved factors, using the selection models, the adjusted estimate (21% [95% CI = 20% to 22%]) is substantially higher than either the imputation-adjusted estimate (12% [11% to 13%]) or the unadjusted estimate (12% [11% to 13%]). In women, the 3 national prevalence estimates differ only slightly (unadjusted, 16% [15% to 17%]; imputation-adjusted, 16% [15% to 17%]; selection model-adjusted, 18% [17% to 19%]) (Table 3).
Figure 3A (men) and B (women) show the 3 national HIV prevalence estimates (unadjusted, imputation-adjusted, selection model-adjusted) by 5-year age groups. In men, the selection-adjusted estimates in all age groups are substantially higher than either the unadjusted or the imputation-adjusted estimates. In women, the 3 estimates do not differ substantially from each other in any of the age groups.
Note also that in Heckman-type selection models, effect estimates of independent variables on HIV status are corrected for bias due to selection on both unobserved and observed variables. The effect of this correction can change the interpretation of predictors of HIV status. For example, in Table 1 the marginal effect of living in a town is relatively large (−0.077 [95% CI = −0.151 to −0.004]) in the selection model, while it is relatively small (−0.022 [−0.057 to 0.014]) in the imputation model. Substantial changes in effect-size estimates can also be observed for age groups, wealth categories, condom use, and past HIV-testing behavior.
We find statistically significant selection on unobserved factors in men who were interviewed in the 2007 Zambia Demographic and Health Survey but who refused an HIV test, and we show that this sample selection leads to substantial underestimation of national HIV prevalence in men. We do not find evidence for selection on unobserved factors in women, or for men who could not be contacted by the fieldworkers.
WHO and UNAIDS18 recommend using imputation to control for selective nonparticipation in population-based HIV surveys. Imputation is an approach to the estimation of HIV prevalence in survey nonparticipants that controls for selection on observed factors but ignores selection on unobserved ones. Previous studies have found that imputation-adjusted national HIV prevalence estimates based on data collected in the Demographic and Health Surveys do not differ substantially from nonadjusted estimates.11,17 Our study confirms this result for both men and women, but at the same time demonstrates that imputation-based adjustments are not always sufficient to control selection bias.
In general, the choice between selection and imputation models implies a trade-off between efficiency and robustness. If model assumptions are met, imputation models are more efficient than selection models. However, the assumptions underlying selection models are less restrictive than those underlying imputation models. If there is no correlation between selection and HIV status in a selection model, then the HIV status regression in the model will equate to an imputation regression with the same independent variables. If, however, there is such correlation, only the selection model will produce consistent results. Estimates of HIV status based on selection models are thus more robust than estimates based on models that assume that data are “missing at random.”
The bivariate probit selection models used in this study to account for selective nonparticipation on unobserved factors are only slightly more difficult to implement in routine analyses of Demographic and Health Survey data than imputation approaches. The models are well-established in the social sciences23,24 and can be implemented easily using standard software packages, such as STATA.35 More importantly, the exclusion restrictions necessary to estimate these selection models are highly plausible and publicly available in the Demographic and Health Survey datasets. Our selection variables (interviewer identity and visit to a household on the first day of fieldwork within a household cluster) were highly significant determinants of HIV survey participation. Given the standard Demographic and Health Survey procedures, it seems impossible that the identity of an interviewer or the time when a household is first visited during the fieldwork period in one cluster could in any way have influenced the HIV status of eligible persons. Moreover, we also ensure that there is no correlation between interviewer identity and respondents' HIV status by controlling for all the factors used in deploying interviewers (region and respondent's sex and language).
In some populations, such as the women in our study, nonparticipation may not be correlated with unobserved HIV status while it may be negatively correlated with unobserved HIV status in others. For example, it is plausible that persons who know that their past behavior puts them at very low risk of HIV infection are less likely to participate in HIV testing than people at higher risk,36 and it is also plausible that this knowledge is not perfectly captured by observed variables. In this case, HIV prevalence among survey nonparticipants may be lower than among participants after conditioning on observed factors.
In general, the correlation between nonparticipation and HIV status, after controlling for observed variables, will be a function of a number of factors. It will of course depend on which variables are observed; we would thus expect selective nonparticipation on unobserved factors to differ across types of surveys. Furthermore, the relationship between HIV status and testing behavior may be determined by parameters characterizing the HIV epidemic, such as HIV prevalence37 or the distribution of time since infection,38 and the availability of antiretroviral treatment.39 We would thus expect selective nonparticipation on unobserved factors to differ across populations and within 1 population over time.
Heckman-type selection models are a powerful approach to investigate and control for selection bias. However, their performance depends crucially on the presence of at least 1 exclusion restriction, ie, a variable that significantly determines survey participation but does not independently affect the outcome of interest. This requirement may limit the application of the approach. For example, while interviewer identity and the date of a fieldworker visit are routinely available in the Demographic and Health Surveys, other surveys may not have recorded information on such plausible exclusion restrictions. Even where plausible exclusion restrictions are available, the variables may be found not to be significantly associated with survey participation. For example, while it is plausible that an interviewer's personality, skills, and fieldwork experience affect the probability that a respondent consents to an HIV test, the data may not bear out an association between interviewer identity and participation rates. Thus, to use our approach, HIV surveys should routinely publish operational variables and precise descriptions of survey operations, to provide researchers with sufficient information to judge whether these variables are likely to be plausible exclusion restrictions. Researchers then need to test whether the plausible exclusion restrictions are significantly associated with survey participation before implementing Heckman-type selection models. The eAppendix (http://links.lww.com/EDE/A441) provides a detailed description of how to identify valid exclusion restrictions and a discussion of limitations of Heckman-type selection models.
Many studies have sought to explain higher observed HIV prevalence among women than men in sub-Saharan Africa.40 Our results for Zambia suggest that at least in some cases this difference may be an artifact of selective nonparticipation in HIV surveys and that actual prevalence rates may be higher for men than for women. Furthermore, our results demonstrate that the predictors of HIV status in multivariable regression can change substantially, when selection on both observed and unobserved factors is taken into account. Thus, studies estimating HIV prevalence or relationships between independent variables and HIV status that use survey data with some substantial degree of missingness (such as the 30 HIV surveys conducted as part of the Demographic and Health Surveys up to March 201041) should routinely employ Heckman-type selection models to control for selection on both observed and unobserved factors.
1. UNAIDS. 2008 report on the global AIDS epidemic. Geneva: UNAIDS; 2008.
2. Boerma JT, Ghys PD, Walker N. Estimates of HIV-1 prevalence from national population-based surveys as a new gold standard. Lancet.
3. Mishra V, Assche SB, Greener R, et al. HIV infection does not disproportionately affect the poorer in sub-Saharan Africa. AIDS.
4. Michelo C, Sandoy IF, Fylkesnes K. Marked HIV prevalence declines in higher educated young people: evidence from population-based surveys (1995–2003) in Zambia. AIDS.
5. Tanser F, Bärnighausen T, Cooke GS, Newell ML. Localized spatial clustering of HIV infections in a widely disseminated rural South African epidemic. Int J Epidemiol.
6. WHO/UNAIDS/UNICEF. Towards universal access: scaling up priority HIV/AIDS interventions in the health sector. Geneva: WHO; 2009.
7. Ghys PD, Brown T, Grassly NC, et al. The UNAIDS Estimation and Projection Package: a software package to estimate and project national HIV epidemics. Sex Transm Infect.
9. Garcia-Calleja JM, Gouws E, Ghys PD. National population based HIV prevalence surveys in sub-Saharan Africa: results and implications for HIV and AIDS estimates. Sex Transm Infect.
10. Gouws E, Mishra V, Fowler TB. Comparison of adult HIV prevalence from national population-based surveys and antenatal clinic surveillance in countries with generalised epidemics: implications for calibrating surveillance data. Sex Transm Infect.
11. Mishra V, Vaessen M, Boerma JT, et al. HIV testing in national population-based surveys: experience from the Demographic and Health Surveys. Bull World Health Organ.
12. Central Statistical Office (CSO), Ministry of Health (MOH), Tropical Diseases Research Centre (TDRC), University of Zambia (UNZA), Macro International Inc. Zambia Demographic and Health Survey 2007. Calverton, MD: CSO, Macro International Inc.; 2009.
13. Rubin DB. Inference and missing data. Biometrika.
14. Brick JM, Kalton G. Handling missing data in survey research. Stat Methods Med Res.
15. Rubin DB. Multiple Imputation for Nonresponse in Surveys.
New York: John Wiley & Sons, Inc.; 1987.
16. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B.
17. Mishra V, Barrere B, Hong R, Khan S. Evaluation of bias in HIV seroprevalence estimates from national household surveys. Sex Transm Infect.
18. WHO/UNAIDS. Guidelines for measuring national HIV prevalence in population-based surveys. Geneva: WHO/UNAIDS; 2005.
19. Perez F, Zvandaziva C, Engelsmann B, Dabis F. Acceptability of routine HIV testing (“opt-out”) in antenatal services in two rural districts of Zimbabwe. J Acquir Immune Defic Syndr.
20. Reniers G, Eaton J. Refusal bias in HIV prevalence estimates from nationally representative seroprevalence surveys. AIDS.
21. Heckman J. Sample selection bias as a specification error. Econometrica.
22. Dubin JA, Rivers D. Selection bias in linear regression, logit and probit models. Sociol Methods Res.
23. Vella F. Estimating models with sample selection bias: a survey. J Hum Res.
24. Winship C. Models for sample selection bias. Annu Rev Sociol.
25. Bignami-Van Assche S, Salomon JA, Murray CJ. Evidence from national population-based estimates of bias in HIV prevalence. Paper presented at: Population Association of America Meeting; March 31–April 2, 2005; Philadelphia, PA.
26. Lachaud JP. HIV prevalence and poverty in Africa: micro- and macro-econometric evidences applied to Burkina Faso. J Health Econ.
27. Bärnighausen T, Hosegood V, Timaeus IM, Newell ML. The socioeconomic determinants of HIV incidence: evidence from a longitudinal, population-based study in rural South Africa. AIDS.
28. Gabrysch S, Edwards T, Glynn JR. The role of context: neighbourhood characteristics strongly influence HIV risk in young women in Ndola, Zambia. Trop Med Int Health.
29. Janssens W, van der Gaag J, de Wit TR. Refusal bias in the estimation of HIV prevalence. Amsterdam: Amsterdam Institute for International Development; 2009.
30. Reniers G, Araya T, Berhane Y, Davey G, Sanders EJ. Implications of the HIV testing protocol for refusal bias in seroprevalence surveys. BMC Public Health
32. Dubin JA, Rivers D. Selection bias in linear regression, logit and probit models. In: Fox J, Long JS, eds. Modern Methods of Data Analysis
. Newbury Park, CA: Sage Publications; 1990:410–443.
33. Puhani PA. The Heckman correction for sample selection and its critique. J Econ Surv.
34. Little R. A note about models for selectivity bias. Econometrica.
36. Boozer MA, Philipson TJ. The impact of public testing for human immunodeficiency virus. J Hum Res.
37. Auld MC. Choices, beliefs, and infectious disease dynamics. J Health Econ.
38. Bärnighausen T, Tanser F, Hallett T, Newell ML. Short communication: prioritizing communities for HIV prevention in sub-Saharan Africa. AIDS Res Hum Retroviruses.
39. Warwick Z. The influence of antiretroviral therapy on the uptake of HIV testing in Tutume, Botswana. Int J STD AIDS.
40. UNAIDS/WHO. AIDS epidemic update 2009. Geneva: UNAIDS; 2009.
Supplemental Digital Content
© 2011 Lippincott Williams & Wilkins, Inc.