Secondary Logo

Journal Logo

Epidemiology and Social

Systematic identification of correlates of HIV infection

an X-wide association study

Patel, Chirag J.a; Bhattacharya, Jayb,c; Ioannidis, John P.A.d; Bendavid, Eranb,e

Author Information
doi: 10.1097/QAD.0000000000001767



Approaching the public health goals for global HIV-1 such as ‘90–90–90’ (90% of those with HIV-1 aware of their status, 90% in regular treatment, and 90% of those on treatment virally suppressed) requires effective identification of at-risk and infected individuals [1,2]. Despite large efforts to expand testing, treatment, and retention services, only 45% of HIV-infected individuals in Sub-Saharan Africa knew their HIV-1 status in 2013, and estimated antiretroviral therapy (ART) coverage in 2016 exceeded 60% of all those infected in only five countries [3–6]. A potential approach to improving HIV-1 testing and diagnosis is to better target individuals and populations for testing and care. Existing HIV-1 control programs, such as the US President's Emergency Plan for AIDS Relief, increasingly use data-driven approaches to align resources towards high-burden populations [7,8].

Current understanding of HIV-1 risk factors in Sub-Saharan Africa commonly come from nationally representative surveys such as the Demographic and Health Surveys (DHS) that include HIV-1 testing and report prevalence stratified by prespecified groups such as age, education, place of residence, and number of sexual partners [9–11]. Such HIV-1 testing and epidemiologic stratification was carried out in two nationally representative DHS surveys conducted among samples of nearly 6000 (in 2007) and 15 000 (in 2013–2014) Zambian women and men [12,13]. However, selective identification of risk factors by testing one or only a few factors at a time may lead to incomplete understanding of or even misleading notions about possible risk factors [14,15]. Traditional risk factors (age, sex, education, place of residence, and number of sexual partners) explain less than 10% of the variation in HIV-1 infection, and represent only a small fraction of the information available in the surveys [12,13]. Although risk factors such as age and sex are intuitive and important, unintuitive or under-recognized correlates that may identify novel high-risk groups and generate new hypotheses for further study and intervention design.

We present an approach for systematically assessing the relationship between HIV-1 status and many putative risk factors. We exploit the breadth of DHS surveys to conduct an X-wide association study (XWAS) of HIV-1 risk, where X stands for all social, behavioral, environmental, and economic factors that are reliably available in DHS. This approach systematically associates each available variable with HIV-1 status, as is done in current-day genomics investigations [e.g. genome-wide association studies (GWASs)]. We have previously utilized the approach to systematically study the association of environmental exposures, dietary indicators, clinical biomarkers, and micronutrient blood tests associated with outcomes such as type II diabetes, blood pressure, mortality, and income [16–19]. An advantage of this approach is that variables are examined using a systematic approach, thus avoiding selective reporting bias, while controlling for the rate of false positives [20,21].



We used the 2007 and 2013–2014 DHS surveys from Zambia, where HIV-1 prevalence among women 15–49 years old was estimated at 16.1 and 15.1%, respectively [13]. (We analyzed both men and women, separately, and focus on women with additional results for men in the Appendix.) We linked HIV-1 status with all the indicators in the individual women's surveys. We split the data in each survey into training (’discovery’) and replication data, analyzed the association of each variable with HIV-1 status in univariate and adjusted analyses, and examined the stability of the findings over time and in population subgroups. Figure 1 illustrates the analysis steps.

Fig. 1:
Schematic overview of XWAS process.(a) The Demographic and Health Study consists of 15 433 women (15% HIV-1 prevalence) in 2013–2014 and 5715 women (16.1 prevalence) in 2007. (b) We selected variables that had at least 90% response rate and were nonredundant resulting in 688 total variables in both the 2007 and 2013–2014 surveys. (c) Split the data into two random subsets for discovery and replication (N = 7716 and 7717, respectively, for 2013–2014 and 2858 and 2857 for 2007). (d) We ran three model configurations: a univariate (red); a multivariate with adjustment variables selected a priori (yellow), including age, urban resident, wealth index, and ever had sex prior to survey; and a multivariate model consisting of variables identified in the univariate analysis (blue). Xi denotes the ith variable out of 688 in 2013–2014 and 727 variables in 2007, respectively (688 overlapping). (e) We attempted to replicate results within surveys from models in the independent replication dataset. (f) We identified variables replicated between the 2007 and 2013–2014 surveys. (g) We executed subgroup analysis for each variable identified in the univariate regression.

HIV-1 status

The HIV-1 testing procedure in the surveys involved identifying eligible household members, obtaining consent, collecting dried blood spots, and testing in a centralized lab. Two ELISA tests were used for screening and confirmation of HIV-positive tests, with western blot confirmation for discordant results. In both surveys, every test was definitively identified as positive or negative (i.e. there were no indeterminate tests). We then linked the HIV-1 test results with the individual survey data.

Selection of social, behavioral, environmental, and economic indicators

We used the following process to identify and create the variables for the XWAS (Table 1 includes the variable selection process metadata). Starting with the raw data after removal of placeholder variables (e.g. birthdates of children 6–20 for mothers with 5 children), we recast all variables with 30 or fewer levels as binary variables. Variables with 30 or more levels were treated as continuous. This decision rule aimed to preserve meaningfully continuous variables and discretize nonordinal variables. Then, we kept only those variables with at least 90% complete data to avoid what some consider unacceptable levels of missingness [22]. This led to dropping over 40% of the variables in each survey. We removed variables with no variation (e.g. an indicator variable for completion of the survey), and kept the first occurrence in any pair of collinear variables with correlation coefficient at least 0.99. The entire set of variables is in Supplementary Table 1,

Table 1:
Variable preparation process.

Association study procedures

We divided each survey (5715 women in 2007 and 15 433 women in 2013–2014), randomly into two equally sized (±1) datasets for discovery and replication. We conducted three XWAS analyses in the discovery dataset (Fig. 1d): a univariate analysis; an analysis adjusted for known HIV-1 correlates (ex ante analysis); and an analysis that, in addition to the ex ante factors, adjusted for the 10 variables that explained the greatest portion of the variation in the univariate analysis (ex post analysis).

In the first step we estimated univariate logistic regression models of the following form:

where HIVp represents the HIV-1 status of person ‘p’, and

denotes the ith variable for person ‘p.’ This procedure is repeated for each of the variables in the 2007 and 2013–2014 surveys. The exponentiation of βi corresponds to the odds ratio for HIV-1 per unit change for each variable Xi. To control for multiple hypothesis testing, we calculated the Benjamini–Hochberg (BH) false discovery rate (FDR), the estimated proportion of discoveries made that were false [23]. The Benjamini–Hochberg method assumes independence between statistical tests, and therefore, counts correlated variables as independent for determining the discovery threshold (median absolute correlation between all pairs of replicated variables 0.06 [ interquartile range (IQR) 0.02–0.12] in 2007 and 0.04 (IQR 0.02–0.09) in 2013). Throughout, we used the HIV-1 sampling weights and Huber–White robust standard errors [24].

In the ex ante analysis, we adjusted for five predetermined (ex ante) controls: urban or rural residence, DHS wealth index (a five-level scale from poorest to wealthiest, with poorest as reference), age, whether or not the respondent indicated, she had previously been tested for HIV, and whether she indicated that she never had sex at the time of the interview [25]. Specifically, the model was implemented as follows:

where covariates for urban residence, wealth, age, past testing, and sexual debut are indexed by person (p), and

again denotes the ith exposure variable for person ‘p.’

In the ex post analysis, we adjusted for the 10 variables that had the highest explanatory power (using Nagelkerke R2) among those replicated in the univariate analysis. The purpose of this analysis was to improve the identification of strong correlates that identify HIV-1 status even after controlling for the most explanatory variables [26]. The 10 variables were selected separately in the 2007 and 2013–14 surveys.

We had two levels of replication, within-survey replication and between-survey replication (Fig. 1e and f). We deemed a within-survey ‘replicated finding’ for βi as one that had FDR less than 5% in the discovery dataset and the sign for βi was in the same direction in the replication dataset with a nominal P value less than 0.05 (Fig. 1e). The second level of replication is between-survey replication where we sought within-survey replication in both the 2007 and 2013–2014 surveys (Fig. 1f).

Next, we assessed the predictive capability of HIV positivity of the variables found in all three modeling scenarios. For example, if variables Xa. Xb. Xc were tentatively replicated in the univariate modeling scenario in the 2007 survey, we fit a model:

and assessed the Nagelkerke R2 and the area under the curve for the model.

We then assessed pairwise correlations among all of the replicated variables to assess the clustering of HIV-1 risk factors and variables. That is, we wanted to identify the clusters of variables that potentially measure a latent HIV-1 risk factor (e.g. if a host of household possession variables are all related to HIV-1 and are correlated among themselves, that may indicate that wealth, a likely latent variable they measure, is a risk factor for HIV). We visualized pairwise correlations in a heatmap [27,28].

Finally, we tested the association of all replicated univariate findings in nine subgroups (Fig. 1g): (1–3) three age bins (15 to <23; 23 to <33; and 33–49); (4 and 5) two wealth groups [wealth quintiles 1–3 (poorer) and wealth quintiles 4 and 5 (wealthier)]; (6 and 7) two residence groups (urban and rural); and (8 and 9) two groups based on whether or not the respondent indicated that they had ever received an HIV-1 test.

To promote reproducibility of this work, the analytic code is available in a Github repository, and the figures can be accessed at; all analyses were performed using Stata 14 (Statacorp, College Station, Texas, USA) and R v3.2.2 (


Our surveys included information on 5715 Zambian women with HIV-1 test results in 2007 and 15 433 in 2013–2014. In the univariate analysis, 102 (out of 727, 14%) variables were replicated and associated with HIV-1 in 2007, and 182 (out of 688, 26%) in 2013–2014. Figure 2 shows a plot of P values versus odds ratios of the association with HIV-1 of the variables tested in 2007 and 2013–2014. A total of 79 variables were associated with HIV-1 status in the univariate analysis, 30 variables in the ex ante analysis, and 8 variables in the ex post analysis in both 2007 and 2013–2014. Table 2 shows all the variables that were replicated in both surveys in at least one analysis. All replicated variables (in any analysis) appear in Supplementary Tables SA1.2–SA1.4, Supplementary Table SA1.5, shows the associations between the control variables and HIV in the ex post models.

Fig. 2:
Volcano plot from univariate analysis depicting odds ratio versus −log10(P value) of association.Teal points denote replicated findings in each dataset (2007 and 2013–2014).
Table 2:
Univariate and adjusted associations with HIV-1 status.
Table 2:
(Continued) Univariate and adjusted associations with HIV-1 status.

Several variables stand out for their association with HIV-1 and for raising potentially useful targets for future investigation. Three variables were associated with HIV-1 in all three analyses and both surveys: having exactly one birth in the past 5 years (increased risk), currently breastfeeding (decreased risk), and desiring to delay having children for more than 2 years (decreased risk). Eleven additional variables were associated with HIV-1 positivity in all but one of the ex post analyses (including several variables that were used as ex post control): being formerly and not currently married, including divorce and widowhood (three variables, all increase risk), variables related to being the head of the household (three variables, all increase risk), the number of children (three variables, fewer children confer increased risk), currently using a condom for contraception (increased risk), and an indicator for ownership of a bicycle in the household (decreased risk).

Figure 3 shows the sub-group associations for the variables that were replicated in univariate analyses in the overall sample in both surveys and in at least 17 out of the 18 subgroups examined (nine subgroups, as noted above, in each of the two surveys). A total of eight variables met these criteria (including several that were associated with HIV-1 in all full-sample analyses, denoted with the symbol asterisk (*): the indicators for widowhood and being formerly in a union*, being the head of the household*, having exactly two people in the household (relative to all other household sizes), age, reporting a genital ulcer in the past 12 months, owning a bicycle*, and currently breastfeeding*.

Fig. 3:
Strength of association for univariate overall and population subgroups.Stratas include wealth index less than or equal to 3 and greater than 3 (’poor’ and ‘rich,’ respectively), individuals that have had or have not had an HIV-1 test (’tested’ and ‘not tested’), individuals living a rural (’rural’) or urban (’urban’) areas, and of ages less than 23 (’age 1’), between 23 and 33 (’age 2’), and older than 33 (’age 3’).

Two general categories of variables were associated with HIV-1 in the ex ante and ex post analyses but not in the univariate analyses (i.e. variables whose association with HIV-1 was ‘uncovered’ after adjustment, shown at the bottom variables in Table 2). These include variables related to the number of children who were different from those replicated in all or nearly all analyses (again, fewer children confer increased risk), and the anthropometric measurements BMI and Rohrer's index (higher index associated with lower HIV-1 risk).

Figure 4 shows the extent to which the variables that were replicated in the ex ante analysis are correlated and clustered among themselves. We observed that the correlation pattern between the 2007 and 2013–2014 surveys were strikingly similar. We hypothesized that this may be partly a reflection of the variable construction process (e.g. being married would be expected to be strongly negatively correlated with being divorced), and partly of the likely stable social and environmental patterns in Zambia over this time period (Supplementary Figures SA1 and SA2,

Fig. 4:
Correlation matrix of variables replicated in the ex ante analysis.Top panel shows results for the 2007 survey and the bottom panel for the 2013–2014 survey. Variable clusters representing similar constructs appear in both surveys, such as variables that characterize marital status.

In the three analyses and the two surveys, the variance explained in HIV-1 status ranged from 0.21 (in the ex post analysis of the 2007 survey) to 0.32 (in the univariate analysis of the 2013–2014 survey). The area under the curve in the six analyses ranged from 0.76 to 0.82 (specificity 0.75 at sensitivity 0.75; Supplementary Figure SA3,


We describe the findings from the first XWAS of HIV-1 risk in nationally representative sero-surveys. Out of all the variables tested (688 in 2007 and 727 in 2013–2014, of which 688 were overlapping), we identify several candidate variables that are associated with HIV-1 whenever examined in multiple analyses and may present opportunities for identifying previously under-recognized risks. These include positive associations with widowhood/divorce/being formerly in union, being the head of the household, having a small household size, and reporting a genital ulcer in the past 12 months; and negative associations with breastfeeding and bicycle ownership. The reasons for these consistent associations may have different implications. A causal relationship may have implications for targeting and design of prevention interventions. A noncausal relationship (that is observed because of confounding or reverse causation) may still have benefits for testing programs that are interested in increasing testing among high-prevalence groups. The nature of the associations, therefore, deserves further discussion.

The reason for the positive association of widowhood with HIV-1 may be because of widows’ engagement in high-risk behaviors for basic income and sustenance; it may also be partly caused by HIV-1 positivity among the widows’ now-deceased husbands. Our study cannot tease apart the dominant causal pathway, and both may contribute to the association. The similar effect among divorced women is more consistent with risky behaviors following the loss of a spouse. Recent evidence also supports a causal role: a nationally representative survey of HIV-1 incidence in Rwanda from 2013 to 2014 found elevated rate of new infections in widows [29]. If widowhood and divorce lead to increased HIV-1 risk, then targeting of prevention interventions such as preexposure prophylaxis may mitigate the associated risk. If HIV-1 prevalence is higher among these women because of preexisting risk, then this finding may still help in guiding HIV-1 identification for early treatment and care that may reduce their risk of infecting others.

The relationship between current breastfeeding and HIV-1 risk is also notable. It is not a known factor that decreases risk of HIV-1 acquisition [30,31]. This association may indicate the decreased propensity to breastfeed among HIV-positive women. Although public health guidelines for breastfeeding among HIV-infected women has shifted over the past decade, breastfeeding has been recommended by the World Health Organization since 2010 [32,33]. As we find decreased risk of HIV-1 among women who breastfeed in both 2007 and 2013–2014 (in all three analyses and 17 subgroups), this finding may indicate the challenges of changing breastfeeding behaviors and the importance of finding effective approaches to behavior change in this domain.

The variables that we highlight were replicated in multiple analyses, but this study also identifies factors whose less consistent association with HIV-1 may nevertheless warrant additional consideration. Several variables related to method of contraception were positively associated with HIV-1 status in the univariate analyses, including hormonal contraceptives, condoms, and female condoms. Wealth was associated with HIV-1 in the univariate analysis (higher risk among wealthier women), but not in the adjusted models. No variable identifying educational attainment was associated with HIV-1 in the adjusted models. These assessments improve on the extant assessment of epidemiologic risk that are commonly presented along with the DHS data (and commonly used by the Joint United Nations Program on HIV and AIDS and others) [4,13]. The DHS stratifies HIV-1 risk by age, residence, marital status, education, and wealth. XWAS improves on such stratifications by reducing potential bias from failure to consider other relevant covariates, and by using an FDR that accounts for multiple comparisons.

The extent to which our findings are generalizable to other contexts is unknown. Extending HIV-1 XWAS to additional surveys across sub-Saharan Africa and over time, however, is readily feasible and will enable greater understanding of the generalizability and stability over time of our findings. We note that the putative variables we identified in common in the 2007 and 2013–2014 surveys had similar association sizes in both surveys. These similar association sizes and correlations point to the stability of social, behavior, and environment over time in Zambia.

The limitations of this study deserve explicit mention. First, we only tested variables with at least 90% complete data. Although we retained approximately 700 variables for analysis, some important variables could have been excluded because of missingness. Second, the error rate among self-reported variables may also bias results. Errors are more likely for some variables than for others. Any nondifferential bias (e.g. individuals that report inaccurately in both HIV-1-positive and HIV-1-negative individuals) will lead to loss of power and correlations that are closer to null; however, we emphasize that sample sizes in our investigation are large. Third, self-reported variables may exhibit differential bias if participants answer differentially based on HIV-1 status. Differential bias may distort effects in unpredictable ways. Fourth, we could not assess association with incident or recent infection to mitigate chances of reverse causality. Although some DHS surveys also measure CD4+ cell counts (that may proxy for duration of infection), the Zambia surveys did not, and we did not control for duration of infection (except through some indirect controlling by adjusting for age). It is plausible, for example, that a decrease in BMI is a consequence of HIV-1 rather than a cause. Challenges to causal identification are a generic issue in large-scale cross-sectional association studies, but such analyses nevertheless remain an important method to identify potential risk factors [34].

In conclusion, we report the findings from the first XWAS of HIV-1 risk from nationally representative surveys of social, economic, environmental, and behavioral factors in Zambia. We identify strong and consistent associations with widowhood, breastfeeding, and several other self-reported indicators that may be amenable to further investigations and interventions and that may be used to guide screening policies.


Funding: This work was supported in part by grants R01-AI127250 from the National Institute of Allergy and Infectious Diseases, R01-DA15612 from the National Institute on Drug Abuse, R00 ES023054 and R21 ES025052 from the National Institutes of Environmental Health Sciences, and U54 HG007963 from NIH Common Fund. The sponsors had no role in the design, interpretation or conclusions of this study.

Author contributions: E.B. and C.J.P. conceived the work and carried out the analyses. J.B. and J.P.A.I. critically assessed the methods and findings, and contributed to the study conceptualization and preparation of the manuscript.

Conflicts of interest

There are no conflicts of interest.


1. 90–90–90 - An ambitious treatment target to help end the AIDS epidemic. Available at: [Accessed 10 December 2015]
2. Deeks SG, Lewin SR, Havlir DV. The end of AIDS: HIV infection as a chronic disease. Lancet 2013; 382:1525–1533.
3. Staveteig S, Bradley S, Nybro E, Wang S. Demographic patterns of HIV testing uptake in sub-Saharan Africa. DHS Comparative Reports No. 30. ICF International. 2013.
4. UNAIDS AIDSinfo: Epidemiological status. Available at: [Accessed 31 April 2016]
5. UNAIDS 2014 Gap Report. Available at: [Accessed 31 August 2015]
6. Demographic and Health Surveys. ICF International. Available at: [Accessed 3 August 2015]
7. Opening Statement From Ambassador Deborah L. Birx, M.D., at the UNAIDS 37th Programme Coordinating Board Meeting. Available at: [Accessed 10 August 2016]
8. PEPFAR's Dr Deborah Birx urges sharper focus to halt HIV globally. Available at: [Accessed 20 August 2016]
9. WHO HIV/AIDS strategic information: surveillance. Available at: [Accessed 31 August 2015]
10. De Cock KM, Rutherford GW, Akhwale W. Kenya AIDS Indicator Survey 2012. J Acquir Immune Defic Syndr 2014; 66 (suppl 1):S1–S2.
11. Anderson S-J, Cherutich P, Kilonzo N, Cremin I, Fecht D, Kimanga D, et al. Maximising the effect of combination HIV prevention through prioritisation of the people and places in greatest need: a modelling study. Lancet 2014; 384:249–256.
12. Zambia DHS, 2007 - final report. Available at: [Accessed 13 June 2016]
13. Zambia DHS, 2013–14 - Final Report. Available at: [Accessed 31 August 2015]
14. Patel CJ, Ioannidis JP. Studying the elusive environment in large scale. J Am Med Assoc 2014; 311:2173–2174.
15. Ioannidis J. Why most published research findings are false. PLoS Med 2005; 2:e124.
16. Tzoulaki I, Patel CJ, Okamura T, Chan Q, Brown IJ, Miura K, et al. A nutrient-wide association study on blood pressure. Circulation 2012; 126:2456–2464.
17. Patel CJ, Rehkopf DH, Leppert JT, Bortz WM, Cullen MR, Chertow GM, Ioannidis JP. Systematic evaluation of environmental and behavioural factors associated with all-cause mortality in the United States National Health and Nutrition Examination Survey. Int J Epidemiol 2013; 42:1795–1810.
18. Patel CJ, Cullen MR, Ioannidis JP, Butte AJ. Systematic evaluation of environmental factors: persistent pollutants and nutrients correlated with serum lipid levels. Int J Epidemiol 2012; 41:828–843.
19. Patel CJ, Bhattacharya J, Butte AJ. An environment-wide association study (EWAS) on type 2 diabetes mellitus. PloS One 2010; 5:e10746.
20. Ioannidis JP, Tarone R, McLaughlin JK. The false-positive to false-negative ratio in epidemiologic studies. Epidemiology 2011; 22:450–456.
21. Patel CJ, Ioannidis JP. Placing epidemiological results in the context of multiplicity and typical correlations of exposures. J Epidemiol Community health 2014; 68:1096–1100.
22. Dong Y, Peng C-YJ. Principled missing data methods for researchers. Springerplus 2013; 2:222.
23. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 1995; 57:289–300.
24. White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 1980. 817–838.
25. Rutstein SO, Johnson K. The DHS wealth index. DHS Comparative Reports no. 6. Calverton: ORC Macro; 2004.
26. Yang J, Ferreira T, Morris AP, Medland SE. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 2012; 44:369–375.
27. Patel CJ, Cullen MR, Ioannidis JPA, Rehkopf DH. Systematic assessment of the correlation of household income with infectious, biochemical, physiological factors in the United States. Am J Epidemiol 2014; 181:171–179.
28. Johnson RA, Wichern DW. Applied multivariate statistical analysis:. NJ: Prentice Hall Englewood Cliffs; 1992.
29. Remera E, Kanters S, Mulidabigwi A, et al. 2013-14 Rwanda HIV incidence household survey: understanding HIV epidemic in Rwanda. CROI, 2016, Boston, USA.
30. Serwadda D, Wawer MJ, Musgrave SD, Sewankambo NK, Kaplan JE, Gray RH. HIV risk factors in three geographic strata of rural Rakai District, Uganda. AIDS 1992; 6:983–990.
31. Cain D, Simbayi L, Kalichman S, Cherry C, Jooste S, Mfecane S. Risk factors for HIV-AIDS among youth in Cape Town, South Africa. 2015.
32. World Health Organization. HIV and infant feeding: update. 2006. Available at: [Accessed 17 November 2016]
33. World Health Organization. Guidelines on HIV and infant feeding: principles and recommendations for infant feeding in the context of HIV and a summary of evidence. 2016. Available at: [Accessed 17 November 2016]
34. Ioannidis J. Exposure wide epidemiology: revisiting Bradford Hill. Statistics in medicine 2015; 35:1749–1762.

demographic and health surveys; environment-wide association study; epidemiology; HIV-1; sub-Saharan Africa; X-wide association study

Supplemental Digital Content

Copyright © 2018 Wolters Kluwer Health, Inc.