Secondary Logo

Journal Logo

Epidemiology and Prevention

Estimating HIV-1 Transmission Routes for Patients With Unknown Risk Histories by Viral Sequence Phylogenetic Analyses

Wei, Min MD, PhD*,†; Xing, Hui MS*; Feng, Yi MS*; Hsi, Jenny H. BS*; Liu, Pengtao MS*; Shao, Yiming MD, PhD*,†,‡

Author Information
JAIDS Journal of Acquired Immune Deficiency Syndromes: October 1, 2015 - Volume 70 - Issue 2 - p 195-203
doi: 10.1097/QAI.0000000000000735



Successful prevention and control of HIV/AIDS epidemic is dependent on the knowledge of transmission in different populations. HIV transmission is through one or more of the high-risk behaviors, such as unsanitary commercial blood donation, sharing needles in injecting drug users (IDUs), unprotected heterosexual or homosexual contact, mother-to-child transmission, etc.1 HIV epidemic has different stories country by country. Unfortunately, China has all of the above-mentioned transmission routes. In 1985, the China's first HIV case of hemophiliac patient was reported in Zhejiang province.2 Thereafter, a few isolated cases of HIV were discovered until the first epidemic of 146 HIV-positive cases was found among IDUs in Ruili, Yunnan province.1 In the 1990s, HIV epidemic was alone the drug trafficking route.1 Meanwhile, in the mid-1990s, China experienced the outbreak of HIV infection in unsanitary commercial blood donors. This outbreak centered Henan province, the province of central China, and it expanded to neighboring provinces, such as Hebei, Anhui, Shanxi.1

Because of the stigma of HIV/AIDS, some of the HIV-infected patients would not tell how they get infection, even hide their HIV-positive status. This unknown-risk HIV-positive population is a barrier to public health in the policymaking of HIV transmission prevention. In the National Databank of Chinese Center for Disease Control and Prevention (China CDC), we found a lot of unknown-risk HIV infections. It is a real challenge to confirm the transmission and provide the evidence for prevention policymaking. Molecular epidemiology and phylogenetic analysis may provide such a tool.3

In this study, using molecular epidemiology analysis and statistical modeling, we established an algorithm to estimate the transmission routes for the unknown-risk HIV-positive population. Furthermore, we predicted the transmission routes of some unknown-risk samples.


Data Collection, Sample Collection, Reverse Transcription Polymerase Chain Reaction, and Sequencing

Data on newly reported HIV cases were downloaded from the National Databank of China CDC and analyzed. Whole blood and plasma samples of representative patients newly diagnosed as HIV positive were nonrandomly collected from hospitals, Provincial Center for Disease Control and Prevention from 1996 to 2006. Viral RNA was extracted from plasma samples with a QIAamp viral RNA mini kit (Qiagen, Hilden, Germany). HIV-1 gag nucleotide sequences were obtained by reverse transcription polymerase chain reaction and sequencing as previously described.4,5 The sequences and demographic information, including patients' name, sex, risk histories, etc, were deposited into the sequence databank of National AIDS/STD Prevention and Control, China CDC.

In this study, we collected all of the 843 HIV-1 gag sequences from the sequence data bank to analyze high-risk behaviors, epidemiology link, and sequence features of samples collected from 1996 to 2006. The characteristics of samples by route of transmission, by genotype, and by geography are listed in Tables 1 and 2. The China map in Figure 2A illustrates the sample geographic origin. Patients who have reported history of more than 1 high-risk behavior types are listed separately where all high-risk histories are explicitly stated.

Geography Distribution (Provinces and Municipalities Directly Under the Central Government) (N = 843)
HIV gag Genotypes and Transmission Routes From 1996 to 2006 (N = 843)

HIV Genotyping and Phylogenetic Analysis

HIV genotypes were determined by the gag sequence through phylogenetic analysis by Mega software ( Phylogenetic tree was constructed by neighbor-joining method and repeated 1000 times. Sequence references were downloaded from the HIV data bank ( and the sequence data bank of National AIDS/STD prevention and Control, China CDC.

Statistical Modeling

Decision tree model in the SPSS software was used to estimate the transmission routes of HIV. Combined the data of subjects' age, sex, geography, HIV gag genotypes, and specific clusters in the phylogenetic analysis, we established the model with some necessary modifications. The significance value was set 0.05 for splitting nodes and merging categories. The transmission routes are nominal dependent variables, so we select the method likelihood ratio for more robust estimation. For multiple comparisons of each route, significance values for merging and splitting criteria were adjusted with Bonferroni method. The model was validated with the same period known-risk samples before predicting the unknown-risk samples.


Profile of Transmission Routes of Newly Reported HIV Cases in China From 1985 to 2009

We downloaded data of HIV newly reported cases from the National Databank of China CDC. These data covers all of mainland provinces, municipalities directly under the Central Government and autonomous prefectures. Figure 1 shows the total newly reported cases and profile of transmissions from the year 1985 to 2009. Generally, the newly reported HIV cases steadily increased year by year in most kind of transmission routes except blood transmission (Fig. 1A). The newly reported HIV cases from blood transmission peaked at 1202 in 2004, and then gradually decreased until 2009. In contrast, the total newly reported cases (30,340) in 2009 in China is approximately 20-fold higher than the number (1595) in 2002. In the early years of HIV epidemic, 1985–2001, the percent of IDU, the dominant route of transmission, in all of the newly reported HIV cases was 83.4%. Thereafter, the percent of IDU in all transmission routes was shrinking, whereas the total numbers were expanding. Instead, sexual transmission was increasing rapidly both in total numbers and proportion. Sexual transmission replaced the IDU and became the major route of transmission at 2008 (Figs. 1A, B). The turning point was at 2007, in which year both newly reported cases of these 2 transmission routes were equivalent (Figs. 1A, B).

Newly reported HIV cases in China from 1985 to 2009, downloaded from the National Databank of China CDC. A, The top line of the table indicates the year of newly reported HIV cases in the National Databank of China CDC from 1985 to 2009. Among them, we merged the numbers from 1985 to 2001 because only a small number of cases were reported each year during 1985–2001. The left column shows the HIV transmission route, including IDU, sex, IDU/sex, blood, mother to child, and unknown. The numbers in each lattice represent the newly reported HIV cases, and the numbers in the parenthesis stands for the percent in all transmission routes in each year. B, The pie chart shows the percent of each transmission route in each year. Each color stands for 1 transmission route.

Noticeably, there was a large proportion of transmission unknown patients in each year, and it was up to 30%–35% in 2005–2006. The aim of this study is trying to estimate the transmission route of unknown-risk population (Figs. 1A, B).

Genotype Determination

In this study, the total of 843 gag sequences is all of the gag sequences we can get from 1996 to 2006, nonrandomly representing the samples in the same period in China. Genotype profiles of various high-risk groups for these samples are listed in Tables 1 and 2, including 519 known-risk samples and 324 unknown-risk samples. In these samples, subtype B, subtype C, circulating recombinant form 01_AE (CRF01_AE), CRF07_BC, and CRF08_BC were found (Tables 1 and 2). Subtype B dominated in the HIV-positive individuals in blood transmission group from 1996 to 2006 (90.8%) (Table 2). In IDUs, CRF07_BC was the main clade, 41.6% in total, followed by 35.2% CRF01_AE, 13.2% subtype C, 3.6% CRF08_BC, 2.8% subtype B, etc (Table 2). Among sexually infected patients, all kind of HIV-1 genotypes in other groups are seen, and proportion of CRF01_AE is 58.4%, the predominant clade, followed by 24.8% subtype B, 8.8% CRF07_BC, 5.3% subtype C, and 2.7% CRF08_BC. Although the samples were nonrandomly selected in this study, the genotype profiles in each risk group were similar as another study in our laboratory, a randomly comprehensive National Molecular Epidemiologic Survey.6 Thus, the samples in this study are representing the high-risk groups to some extent but cannot represent all of China.

When we compared the HIV-1 genotype profile of sexually transmitted group with unknown-risk group, we found that the profile was similar at both the kind of genotypes and the proportion of genotypes in total (Table 2). These data suggest that some of the unknown-risk patients could be sexually transmitted.

Molecular Phylogenetic Analyses

Next, we will use known-risk samples to estimate the unknown-risk ones. First, we randomly selected 126 HIV gag nucleotide sequences with known-risk histories from 843 samples, including 56 subtype B and 70 CRF01_AE sequences, which covers the major HIV epidemic regions in China from 1996 to 2006. In addition, all full-length HIV-1 genomes from China available on the Los Alamos HIV Database and NCAIDS Sequence Database were also included as reference sequences.7

A neighbor-joining phylogenetic tree was first constructed for all sequences to determine genotype designations to broad categories, following which a separate phylogenetic analysis was performed for sequences of each major genotype. Figure S1a (Supplemental Digital Content, shows the phylogenetic tree of known-risk subtype B sequences, and Figure S1b (Supplemental Digital Content, highlights the CRF01_AE sequences. Reference sequences are also included, such as reference B.CN.2009.ZK042.JF932497-2009.5 representing subtype B strain JF932497 isolated in China in 2009. Subtype B sequences are mainly separated into 2 big clusters, namely subtype B′ (Thailand B) cluster and United States-European B cluster (see Figure S1a, Supplemental Digital Content, Subtype B′ cluster bifurcates and several subclusters are formed. Some B′ subclusters are from former plasma donors (FPDs), spouses of FPDs, and IDUs. The HIV transmission route of FPDs' spouse is sex, such as HEB0222, HEB0223 (see Figure S1a, Supplemental Digital Content, Remarkably, it happened in the late 1980s that HIV-1 of IDUs in Yunnan province, went into FPDs in central China.1 IDUs with subtype B′ mainly gathered in the southwest of China, such as Yunnan, Guizhou province. FPDs with subtype B′ centered the central China, including Henan, Hebei, Anhui, and Shanxi province. Some FPDs have expanded to the Northeast of China, such as Jilin and Heilongjiang province (see Figure S1a, Supplemental Digital Content,; Fig. 2A).

A, Geographic map of China. The names of provinces were labeled. B, Criteria for HIV transmission estimation.

Chinese subtype B United States-European B cluster came from the United States and European men who have sex with men (MSM) populations. A unique cluster of MSM with a high bootstrap value 94% has formed, according to our data and published data (see Figure S1a, Supplemental Digital Content, In this cluster, some men have sex with men or with women. Thus, the HIV-1 transmission route of this unique MSM cluster may be homosexual or heterosexual contact. IDUs are not found in this unique MSM cluster (see Figure S1a, Supplemental Digital Content,

In the CRF01_AE phylogenetic tree, there is a big cluster that includes the IDUs and heterosexually transmitted patients (see Figure S1b, Supplemental Digital Content, These 2 groups mingle together and cannot separate well. Thus, we will list it as IDU/sex cluster. The geographic distribution of these HIV-1 sequences is in the eastern China, such as Hainan, Fujian, and Jiangxi province (Fig. 2A; Figure S1b, Supplemental Digital Content, Some unique MSM clusters are also appeared in the CRF01_AE tree (see Figure S1b, Supplemental Digital Content, Similar analysis for subtype C, CRF07_BC, and CRF08_BC was also conducted (see Figure S2, Figure S3, Supplemental Digital Content, and unshown data).

Criteria and Statistical Modeling of Estimation of Unknown-Risk Transmission Routes of HIV-Infected People

The data in Figure 1 and Tables 1 and 2 and published data6 show that, in China, HIV is transmitted mainly through blood, IDU, and sexual contact, including homosexual and heterosexual contact. Other minor transmission routes include mother-to-child vertical transmission, puncture or professional exposure, etc. The average proportion of IDUs in all of the HIV-infected population is roughly 51.0% from 1985 to 2009 (Fig. 1A), whereas it is 26.9% for sexually transmitted group and 3.9% for blood transmission group (Fig. 1A). From our sequence data (Tables 1 and 2), the average proportion for IDU, sex, and blood transmission is 35.5%, 13.4%, and 12.6%, respectively (Table 2). Thus, our data in this study are overall similar to the national data. It is easy to judge if an individual is transmitted by mother-to-child transmission. Age is the important evidence to make a decision. Less than 5-year-old HIV-positive individuals can exclude the possibility of IDU, sex, and blood transmission if no blood transfusion. For unknown-risk HIV-positive adults or adolescents, they are more likely to be infected by blood, or IDU, or sex in China.

Theoretically, transmission route of HIV-1 cannot be concluded based on a designated subtype or clusters. The epidemic history of Chinese subtype B′ in FPDs actually told us that subtype B′ in FPDs originated from a single founding subtype B′ in IDUs in Yunnan province, at the late of 1980s. Therefore, Chinese subtype B′ solely cannot tell whether it came from FPDs or IDUs. However, because of the founder effect, only subtype B′ is found in FPDs until now. Other genotypes, such as CRF07_BC, CRF01_AE, have never been reported in FPDs. Meanwhile, FPDs are only gathered in the central China and northeastern China, and IDUs are in southwest of China. Thus, the information of sampling places provides important evidence for prediction. Based on the above analyses, we set the criteria for the prediction of unknown-risk transmission route (Fig. 2B). Phylogenetic analysis is the major evidence, combined with other information, such as subjects' age, sex, and sampling places. First, if the confirmed HIV-1–positive patient is younger than 5 years and with no blood transfusion history, he/she is more likely to be infected by mother-to-child transmission. Then, if the patient is older than 5 years, reverse transcription polymerase chain reaction of HIV-1 gag sequence was conducted and followed by phylogenetic analysis and genotype determination. If the sample genotypes are subtypes B, C, CRF07-BC, CRF08-BC, or CRF01-AE, get the specific cluster in the phylogenetic tree as illustrated in the Figure 2B. If the sample does not gather in a specific cluster in the phylogenetic tree, or the genotype of sample is rare kind of subtype, not mentioned above, then the samples remain “unknown” (Fig. 2B).

Next, we input the data to SPSS software to establish the decision tree model. The dependent variable is the transmission route of HIV, the target variable that we are trying to classify. The input variables are “subtype cluster,” “province,” “year,” and “age,” which are helping us in statistical inference. Some quality measures must be took for choosing best attribute of each variable. The following equation is showed in our decision tree modeling. Information gain InfoGain(S, A) represents expected reduction in entropy (H) due to knowing A.

In the equation, “S” is a sample of training examples. Entropy (H) is one way of measuring the impurity of “S.” “p(ci)” is the proportion of examples in “S” whose category is ci. Because we have a limited amount of data, we set the minimum number of tree child nodes as 5, its parent node as 10. The significance level for splitting nodes and merging categories was all set to 0.05. Because we want fewer nodes in our model, maximum number of iterations is 100 and minimum change in the expected cell frequencies is 0.001.

But, when we classified the transmission routes, we met difficulties. The transmission routes are the most important variables in the model, they only rely on the patient's own description. However, some patients were not sure of their transmission routes, such as “IDU or sex” or “blood or sex” (Tables 1 and 2). Because the samples with dual transmission routes would interfere with the model, we deleted these samples to improve the accuracy of our model. In addition, in the phylogenetic trees, there are some unique MSM clusters. We are sure that the samples in this MSM clusters are transmitted by sexual contact. There are 22 unknown-risk samples, but in the unique MSM clusters. We predict these 22 unknown-risk samples as sexual transmission. Thus, 461 known-risk samples plus 22 unknown-risk samples, but, more likely transmitted by sexual transmission, equal to 483 samples. We used 483 samples as training examples “S” in the model (Tables 3 and 4).

Estimation of Known-Risk HIV-1 Transmission Routes
Estimation of Unknown-Risk HIV-1 Transmission Routes

To test this model, we analyzed the known-risk samples, and the predictive results are listed in Table 3. This model offers that the correct rate or sensitivity is 90.8%, 94.8%, and 69.6% for blood, IDU, and sexual transmission, respectively. Moreover, the specificity for blood, IDU, and sexual transmission is 87.3%, 87.5%, and 85.5%, respectively. Overall, this model gives satisfactory 87.0% sensitivity and specificity. This model shows a good prediction for IDU and blood transmission but poor sensitivity for sexual transmission.

Estimation of Unknown-Risk HIV-1–Positive Samples

Next, we analyzed the unknown-risk HIV-positive samples. Among the total 843 HIV gag samples, 324 samples were risk unknown, including 120 subtype B, 30 subtype C, 81 CRF01_AE, 78 CRF07_BC, 11 CRF08_BC, and 4 unknown subtype (determined as new BC or BE recombinants later). A neighbor-joining phylogenetic tree was constructed for all of the 324 samples (Fig. 3A); green color represents subtype B; purple color stands for Subtype C, CRF07-BC, and CRF08-BC; orange is for CRF01-AE. We randomly selected ∼30 unknown-risk subtype B and CRF01-AE samples, and the phylogenetic trees were shown in Figures 3B, C. In subtype B phylogenetic tree, there is a unique MSM cluster with a high bootstrap value 92%. All of the unknown-risk samples in this cluster were predicted as sexual transmission (Fig. 3B). Another sample GD06138-X is very close to Brazil reference strain B.BR.1989.BZ167.AY173956–1989.5, which is unusual in China. We had no epidemiology information of this patient. Based on the phylogenetic tree, GD06138 was estimated as sexual transmission. HEN0245, HEN0247, and HEN0203 falls in Chinese B′ cluster (Fig. 3B). Notably, there is a small cluster of JIL0604, JIL0602 with Bootstrap value 77% (Fig. 3B). The more close reference strain is blood transmission sample BJ06102, and no other references are available. Thus, we also list them as Chinese B′ cluster. The samples' subtype cluster, province, year, and age information were input to SPSS model to estimate the transmission routes.

Neighbor-joining phylogenetic analyses. A, Circle neighbor-joining phylogenetic tree of unknown-risk samples. Green color represents subtype B, purple stands for subtype C, CRF07-BC, and CRF08-BC, and orange is for CRF01-AE. B, Neighbor-joining phylogenetic tree of unknown-risk subtype B samples. C, Neighbor-joining phylogenetic tree of unknown-risk CRF01-AE samples. [Black up-pointing triangle], blood transmission; ●, sex transmission; Δ, sex or blood transmission; □, IDU; [White Diamond], mother-to-child transmission; ○, sex or IDU; and x represents unknown samples. Reference name is composed of “,” such as “B.CN.2007.FJ070016.JF932483-2007.5.”

For the CRF01-AE samples, there are 2 specific MSM clusters, 1 IDU/sex cluster and 1 heterosexual transmission cluster as previously reported (Fig. 3C).8 Thus, the unknown-risk samples were categorized into their specific clusters. Notably, there are 2 samples WH0533 and WH0534, which does not get together with any clusters, suggesting the possibility of recombination. Online blast of these 2 samples shows that these 2 samples are very close to BE recombinants. These 2 samples were also input into the model for analysis. Subtype C, CRF07BC, and CRF08-BC samples were analyzed in the similar way (see Figure S2, S3, Supplemental Digital Content,

Next, using the established decision tree model, we predict the unknown-risk 324 patients. The results are listed in Table 4. In 324 samples, 100 were predicted to be transmitted by blood, 114 by IDU, and 110 by sexual transmission.


In this study, we first showed the newly reported HIV-1 cases in China from 1985 to 2009. A large number of unknown-risk HIV samples force us to estimate the true story how they get infection. Thus, we develop a model to estimate the transmission route based on the phylogenetic and statistical analysis. The key points in this method are that, first, the reference sequences for phylogenetic analysis are very important. International reference sequences downloaded from HIV data bank ( are not adequate. Local known-risk references in the same region and same period with unique clusters are perfect, such as MSM cluster or IDU cluster. These unique clusters are important evidence for estimation. Second, sometimes these clusters are mixed. Third, other information is also important, such as age, sex, and sampling places in the statistical model.

The limitation of this study is that we could not 100% conclude that the estimated transmission is IDU, blood, or sex. Moreover, a risk of using phylogenetic determination of risk group is that as the viruses spread out of each risk group into the others, the value of this analysis will erode. Thus, it is important that the model should be continuously or periodically re-evaluated. Another factor in discussing the error rate or accuracy of any method of determining risk factors is that there is very likely to be bias in the percentage of people in each risk category who misreport their risk factors. For example, a percentage of homosexuals will claim that they are heterosexual or get infection through blood transmission. Because drug use is illegal in China, the reporting of IDU may also be underrepresented. We realized this problem. When we performed the phylogenetic analysis, we found that reported risk factors do not match the cluster in the phylogenetic tree. But, this is a small fraction. We only found 4 samples in 843 samples. The risk factors of other samples match well with phylogenetic tree, geography distribution of HIV-1, and other information. Thus, most Chinese patients tend to either tell the truth or decline to tell rather than misreporting the risk factors.

The significance of this study is to predict the unknown-risk samples using phylogenetic analyses and statistical model. The statistical model presents 90.8%, 94.8%, and 69.6% sensitivity and 87.3%, 87.5%, and 85.5% specificity, for blood, IDU, and sexual transmission, respectively. It is acceptable for IDU and blood transmission prediction. But, the sensitivity of 69.6% for sexual transmission prediction is a little lower. The reason is more likely that each HIV genotypes, even clusters in the phylogenetic tree, could be transmitted by sex. However, overall, 87.0% sensitivity and specificity of this model is satisfactory and useful, much better than unknown-risk information. In 324 unknown-risk samples collected from 1996 to 2006, 100 were predicted to be transmitted by blood, 114 by IDU, and 110 by sexual transmission. Therefore, we reanalyzed the data of newly reported cases in China (Fig. 1A). In 2005 and 2006, among the 30.0% and 34.3% of unknown-risk samples, roughly 34% (110/324, 33.9%) even more were predicted to be sexually transmitted according to this study. Thus, it should be 1–2 years earlier than 2008 in which year sexual transmission replaced the IDU and became the major route of transmission.

This model is established based on the previous samples from 1996 to 2006. Further validation of the model will be better. However, we have used all of the available samples collected during the national HIV molecular epidemiology survey from 1996 to 2006. Revalidation and re-evaluation of this model with current samples or local samples are necessary before using it. This model can be used currently in provinces and cities in China after revalidation. The total unknown-risk HIV samples in provinces will be analyzed at the provincial level by the Provincial Center for Disease Control and Prevention (provincial CDC). The analyzed data will be used by the provincial health bureau and reported to the national CDC to get the overall national figures.

China CDC is preparing another national HIV molecular epidemiology survey in 2015 and 2016, and we will use more representative samples to do another validation of the model by the end of this study. There are many areas and regions, where intensive epidemiology data collection is difficult. Because of stigma and discrimination, the trends by people not telling true routes of out-of-marriage sex, homosexual, or drug abuse transmission do still widely exist. Therefore, such objective prediction model based on local HIV genetic transmission pattern will be helpful to better characterize the transmission routes of the local HIV epidemic.

In this study, we develop a method to estimate the unknown-risk transmission route using phylogenetic analyses and statistical model. HIV-1 phylogenetic tree can provide us important information on how the individuals got infection, although sometimes it is not conclusive. This model can be used in China and other countries in the world with modifications according to local epidemic situations.


1. Shao Y. AIDS epidemic at age 25 and control efforts in China. Retrovirology. 2006;3:87.
2. Zeng Y, Fan J, Zhang Q, et al.. Detection of antibody to LAV/HTLV-III in sera from hemophiliacs in China. AIDS Res. 1986;2(suppl 1):S147–S149.
3. Brenner B, Wainberg MA, Roger M. Phylogenetic inferences on HIV-1 transmission: implications for the design of prevention and treatment interventions. AIDS. 2013;27:1045–1057.
4. Wei M, Guan Q, Liang H, et al.. Simple subtyping assay for human immunodeficiency virus type 1 subtypes B, C, CRF01-AE, CRF07-BC, and CRF08-BC. J Clin Microbiol. 2004;42:4261–4267.
5. Ye JR, Xing H, Liu HL, et al.. Subtype and sequence analysis of gag and env genes among HIV-1 strains circulating in Beijing residents during 2006 [in Chinese]. Zhonghua Liu Xing Bing Xue Za Zhi. 2007;28:586–588.
6. He X, Xing H, Ruan Y, et al.. A comprehensive mapping of HIV-1 genotypes in various risk groups and regions across China based on a nationwide molecular epidemiologic survey. PLoS One. 2012;7:e47289.
7. Li Z, He X, Wang Z, et al.. Tracing the origin and history of HIV-1 subtype B′ epidemic by near full-length genome analyses. AIDS. 2012;26:877–884.
8. Feng Y, He X, Hsi JH, et al.. The rapidly expanding CRF01_AE epidemic in China is driven by multiple lineages of HIV-1 viruses introduced in the 1990s. AIDS. 2013;27:1793–1802.

HIV; transmission; phylogenetic analysis; statistical modeling

Supplemental Digital Content

Copyright © 2015 Wolters Kluwer Health, Inc. All rights reserved.