Since the first highly published case of the intentional transmission of HIV viruses in the 1990s, in which six patients became HIV-1 positive after being treated by a dentist who was knowingly HIV-1 positive [1–3], such cases have been found in many countries around the world. In 2007, United Nations Programme on HIV/AIDS and United Nations Development Programme raised concerns about decisions reached in these criminal cases , highlighting the importance of molecular evidence of transmission direction to strengthen judgments made on the identification of transmission sources. The use of molecular evidence for identifying transmission direction is also important for identifying the characteristics of HIV transmission networks, which affect the rate of disease transmission in the short term and the prevalence of the disease in the long term . Currently, phylogenetic analysis of HIV-1 sequences, based on assessing the similarity of viral sequences in transmission partners, is used widely to determine HIV transmission linkage, but reliable identification of transmission direction is still not possible [6,7].
Recently, some attempts have been made to identify transmission direction using phylogenetic analysis of paraphyletic relationships . During viral transmission, only a few viral isolates are transmitted from source to recipient, the so-called bottleneck effect . As a result, only a subset of source sequences will be more closely related to all recipient sequences than all source sequences to each other. This relationship between source and recipient sequences is thus termed a paraphyletic relationship, and provides molecular evidence for demonstrating transmission direction. However, due to the rapid evolution of HIV viruses, paraphyletic relationships are gradually lost through time. This is a particular problem because, in most cases, there are long time lags between transmission and sampling . For example, in almost all criminal cases, the length of time between transmission and DNA testing of the suspect is extensive due to delays in reporting and detecting crime. Therefore, phylogenetic analysis of paraphyletic relationships may not be effective or reliable in most real situations, and the development of alternative methods is necessary.
It is well known that viral coreceptor usage switching is a common step in the progression of AIDS. In the early stages of infection, viruses select CCR5 as their coreceptor for entering the host cell, but subsequently switch to coreceptor CXCR4. Generally speaking, there is an intermediate period when viruses can bind to either CCR5 or CXCR4 to facilitate their entry into host cells . By analysing coreceptor usage switching, it may be possible to identify markers that are affected by transmission but remain constant throughout the progress of disease in an individual. Such disease progression-independent markers could be used to develop methods for identifying transmission directions that are not affected by time lags in sampling.
In this work, we developed an approach for identifying transmission direction based on a subset of ‘common patterns’ in the HIV-1 gp120 protein derived from CCR5/CXCR4 coreceptor usage-labeled sequence datasets. To verify our approach we identified the transmission directions of 73 transmission pairs for which the transmission direction was already known. Our approach identified transmission direction with an accuracy of up to 94.5%. It performed even better on transmission pairs with longer sampling time lags, and was not influenced by viral subtype or transmission route.
Derivation of common patterns
We searched the Los Alamos HIV-1 databases (http://www.hiv.lanl.gov/; last modified 26 January 2011), and collected all sequences in the env C2-V5 region with lengths of about 180 amino acids (genomic region 7050-7590), which were labeled coreceptor usage. The following dataset was constructed.
Dataset 1: coreceptor usage dataset
In total, there were 1926 sequences from 528 patients (on average, 3.6 sequences per patient), of which 1485 sequences of viruses only used CCR5 as the coreceptor (termed R5 sequences) and 441 sequences of viruses used other coreceptors (termed non-R5 sequences). The non-R5 sequences included 167 sequences of viruses, which used only CXCR4 as the coreceptor (termed X4 sequences) and 274 sequences of viruses, which used either CCR5 or CXCR4 as the coreceptor (termed R5X4 sequences).
Definition of patterns
We defined a ‘pattern’ as a group of nonsequential but related amino acids; for a given subsequence window of length L, a pattern is a sequence of m residues, the first residue of which is fixed in the first position at the left-hand side of the window, and the remaining m-1 residues are distributed in the remaining positions of the window. The number of possible combinations of positions for the m residues (denoted as s) in the subsequence is:
Here, we set the number of letters per pattern (m) to 4, and the length of the subsequence window (L) to 20 to search for patterns in the viral sequences.
There are 20 possible amino acid letters for each position, giving a total of s × 20m possible patterns. The subsequence window was used to sequentially search along each sequence step by step, to obtain all patterns.
Common patterns and most recent common ancestor patterns
‘Common patterns’ are those that appear in both R5 sequences and non-R5 (R5X4 or X4) sequences. To avoid the random generation of erroneous patterns, only those patterns appearing in more than 60 R5 sequences and at least one non-R5 sequence were defined as common patterns.
For comparison, we also defined another subset of patterns appearing in R5 sequences, but not in R5X4 and X4 sequence, called ‘most recent common ancestor patterns’ (MRCA patterns). To avoid the random generation of erroneous patterns, only those MRCA patterns, which appeared in more than 60 R5 sequences, were chosen.
Conservation of common patterns during disease progression in an individual
In order to test the stability of the number of unique common patterns during disease progression in an individual, a set of sequences derived from samples taken from longitudinally observed patients was constructed as described below.
Dataset 2: longitudinal sampling dataset
A set of sequences from the env C2-V5 region (amino acids approximately 260–470) from nine serially sampled infected patients (currently available patients who were sampled at more than 10 time points) whose progression to AIDS occurred in the same year were obtained from GenBank (Accession numbers AF137629 to AF138163, AF138166 to AF138263, and AF138305 to AF138703). Samples were obtained at roughly 6 monthly intervals ranging from 0 to 11 years postseroconversion . Samples with less than five viral sequences were excluded. The numbers of unique MRCA patterns and unique common patterns appearing in viral sequence sets at different sampling time points were calculated for each of these nine patients.
Transmission direction identification
As a consequence of the bottleneck effect, only a subset of common patterns is transmitted from source to recipient. Therefore, the number of unique common patterns tends to decrease during transmission from source to recipient. The number of such patterns in a viral sequence set should, therefore, be a suitable estimator of transmission direction.
In order to test whether the number of unique common patterns is a suitable estimator of transmission direction or not, 73 transmission pairs (each pair containing a source and a recipient) of known transmission direction were identified, and their sequences were collected. We searched the Los Alamos HIV-1 databases and collected all sequences in the env C2-V5 region with lengths of about 180 amino acids (genomic region 7050–7590) whose cluster transmission type was labeled as ‘Mother→Child’, ‘Heterosexual’, or ‘Men sex with men’. Sequences collected for each patient were assembled according to time point into viral sequence sets, based on the original papers, sequence filenames or comments in database files. Samples with less than five viral sequences were excluded. We then referred back to the original papers to confirm the transmission direction of all partners [12–26]. A total of 73 pairs had clear transmission linkages and direction, of which 53 had mother-to-child transmission, 14 were heterosexual partners and six were homosexual partners. Detailed information on these 73 transmission pairs is shown in Supplemental Digital Content 1, http://links.lww.com/QAD/A214. In most cases, data for each patient fell into more than one viral sequence set. In order to comprehensively test the performance of our approach, we used two entirely different methods to select a single time point for each patient, yielding the following two datasets.
Dataset 3: ‘minimal time lag dataset’
In this dataset, time points selected for each patient were those that minimized the difference in sampling time between the sample derived from the source and the sample derived from the recipient, and were closest to the transmission time point.
Dataset 4: ‘maximal time lag dataset’
In this dataset, time points selected for each patient were those that maximized the difference in sampling time between the sample derived from the source and the sample derived from the recipient, and were furthest from the transmission time point.
The number of unique common patterns appearing in a viral sequence set was used as the estimator of transmission direction. Transmission direction was determined respectively for the above two datasets. It should be noted that when a pattern appeared in a viral sequence set, it was only counted once, irrespective of whether it appeared only once or many times.
Weight score and 10-fold cross validation
In order to improve the performance of our approach, each pattern was given a weighted score, W. The weighted score of the mth pattern was defined as the following:
Wherein, n0 is the total number of sources containing the mth pattern, and n1 is that of recipients. N is the total number of transmission pairs. Thus, if the pattern appears more times in viral sequence sets from sources than in viral sequence sets from recipients, its weighted score will be positive; otherwise its weighted score will be negative.
The sum of weighted scores for patterns appearing in a given viral sequence set, SCk, should be a better estimator for identifying transmission direction.
Wherein, SCk is the score of kth viral sequence set; Emk has values of either 1 or 0 and represents whether or not the mth pattern appears in the kth viral sequence set, and ss is the total number of patterns.
In order to test the applicability of our approach to other datasets, 10-fold cross validation was performed. All 73 transmission pairs were randomly partitioned into 10 subsets. Nine of the 10 subsets were regarded as training data for obtaining the weight score of each pattern, and the other subset was used as testing data. The process was repeated 10 times, with each of the 10 subsets being used once as testing data. The results from these 10 rounds of testing were then combined to yield a single estimate .
For comparison, the paraphyletic relationships of the 73 transmission pairs were determined. For each transmission pair, a phylogenetic tree based on DNA sequences (env C2-V5 region, genomic region 7050-7590) was reconstructed as follows. Basic Local Alignment Search Tool-selected GenBank controls were used as described in a previous study . Maximum likelihood trees were constructed using bootstrap analysis (100 replications) with PHYLIP (phylogeny inference package version 3.5c) .
Common patterns are conserved as disease progresses
In total, 551 597 common patterns (shown in Supplemental Digital Content 2, http://links.lww.com/QAD/A214) were derived as conserved patterns in both R5 and non-R5 viral sequences from the coreceptor usage labeled dataset, whereas 8841 MRCA patterns were derived from R5 viral sequences.
A reliable estimator of transmission direction should not be influenced by time lags between transmission and sampling. The trend in the change of the numbers of unique common patterns during disease progress in an individual was investigated as a potentially reliable estimator. As shown in Fig. 1, for the nine patients tested, the number of unique common patterns barely changed during disease progress, compared with the large fluctuation in the number of unique MRCA patterns. In addition, the average numbers of unique common patterns in these nine patients are 146 034, 153 779, 148 297, 149 904, 135 242, 176 656, 148 192, 152 208, and 144 242; the average numbers of unique MRCA patterns in the nine patients are 597, 438, 594, 1175, 1356, 836, 675, 406, and 387, respectively. In order to further test the stability of the number of unique common patterns within individuals, the difference in the number of unique common patterns between patients was compared with that in the individual patient over time. The differences between patients were tested using one-way analysis of variance (ANOVA) to compare the means of the number of unique common patterns over all of the time points. The difference in the number of unique common patterns between patients was statistically significant (F = 2.96, P < 0.01, one way ANOVA) relative to the slight difference in the number of unique common patterns within an individual patient over time. Therefore, variation within individual patients in the number of unique common patterns should not significantly disrupt the determination of differences between patients.
There are three peaks in the distribution of common patterns along the gp120 protein sequence (Fig. 2), indicating that these three regions may play essential roles in disease transmission. This agrees well with the fact that these three peaks are mostly superposable with three known conserved regions, C2, C3 and C4, in the gp120 sequence. Moreover, there are some common patterns that always exist, which may provide some clue for vaccine development and neutralizing antibody synthesis in further studies.
The number of unique common patterns is a reliable estimator of transmission direction
If the number of unique common patterns decreased from donor to recipient in a given transmission pair, transmission direction was regarded as correctly identified for that pair. Common patterns were used to analyze the minimal and maximal time lag datasets derived from 73 transmission pairs (Table 1). The performance of our approach gave accuracies of 82.2 and 79.5% for the minimal and maximal time lag datasets, respectively.
In order to improve the accuracy of our approach, each pattern was assigned a weighted score as described above (Methods), and the sum of weighted scores of the patterns present in the viral sequence set was calculated to estimate the transmission direction. As shown in Table 1, self-consistency testing and 10-fold cross validation gave the same results, indicating that it is possible to apply our improved approach to other datasets without affecting its accuracy. The highest accuracy achieved (94.5%) was obtained when common patterns were applied to the minimal time lag dataset, and almost equal accuracy achieved (93.2%) when applied to maximal time lag dataset.
The time lag between transmission and sampling had little influence on the performance of our approach. As shown in Table 2, to compare the accuracy of determining transmission direction for transmission pairs in which there was a sampling time lag of less than 3 months with those in which the time lag ranges from 2 years to more than 18 years, we tested our approach on the minimal time lag dataset. Transmission direction in all pairs with long time lags was determined correctly, whereas that in pairs with shorter time lags was determined with an accuracy of 92.5%. The difference in the accuracies for the different time lags is not statistically significant (P = 0.21, χ2 test). These results indicate that our approach is not influenced by the time lag between infection and sampling.
In addition, viral subtype and transmission route had little effect on results obtained using our approach. Of the 73 transmission pairs, 19 and 54 pairs were infected with viruses of subtype B or other subtypes. Using our approach, transmission direction was estimated with almost equal accuracy for both subtype B and the other subtypes, (94.7 and 94.4%, respectively), and the difference in the accuracies for the different subtypes is not statistically significant (P = 0.96, χ2 test). Similarly, in the 73 transmission pairs, there were three different transmission routes; transmission in 53 pairs was from mother to child, and was heterosexual in 14 pairs and homosexual in six pairs. Using our approach, the accuracy of estimation of transmission direction gave high values of 92.5% for mother-to-child transmission, 100% both for heterosexual transmission and for homosexual transmission. The differences in the accuracies for the different transmission routes are not statistically significant (P = 0.33, χ2 test).
Comparison with phylogenetic analysis
For comparison, phylogenetic analysis was applied to the minimal time lag dataset using the DNA sequences of the same env C2-V5 region that were used in our method to determine transmission direction in the same 73 transmission pairs. If paraphyletic relationships from donors to recipients exist, it should be possible to determine transmission direction accurately. However, the accuracy obtained when phylogenetic analysis was applied was only 50.7%. These results indicate that phylogenetic analysis of paraphyletic relationships may not be effective or reliable in most real situations.
In this study, we have developed a novel approach for determining transmission direction between transmission pairs that has high accuracy irrespective of the time lag between infection and sampling. We identified a set of conserved patterns, called common patterns, in the env C2-V5 region of both R5 and non-R5 sequences whose number remains almost constant during disease progress in an individual, but decreases on disease transmission. Using the number of these unique common patterns as an estimator for determining the direction of transmission, our approach gives an accuracy of up to 94.5% and is not influenced by time lag between infection and sampling.
The approach taken here is based on two universal phenomena: the bottleneck effect and coreceptor switching. First, the bottleneck effect results in over 99% of the genetic diversity of the HIV-1 viral population within an individual being lost during transmission. All transmission routes, including mother-to-child transmission, heterosexual and homosexual transmission, are affected in this way . Due to the prevalence of the bottleneck effect, therefore, the number of unique common patterns should decrease with the loss of genetic variation in viral sequences during disease transmission. Second, coreceptor binding and switching are essential for AIDS establishment and disease progression. Coreceptor usage switching is generally considered as an essential indicator of disease progression, and common patterns are conserved during the switching process. The number of unique common patterns, therefore, should be stable as the disease progresses.
Previous approaches for determining transmission direction have only been applied to small datasets with a few criminal cases for the purpose of testing their performance. In this work, we constructed a large dataset containing viral sequence sets from 73 transmission pairs of known transmission direction to test the performance of our approach. Our approach achieved accuracies of up to 94.5%, and was not influenced by time lags between infection and sampling. In addition, different viral subtypes and transmission routes had little influence on the performance of our approach (Table 2). The dataset used in this study contains all of the currently available samples from transmission pairs with known transmission direction. However, the sample size is too small to investigate the accuracy of our method further for either a longer time lag (i.e., >5 years) or different geographic areas. The accumulation of experimental data may help to analyze the accuracy of our method further in practical applications in the future.
We compared our approach with current phylogenetic methods. Phylogenetic analysis of paraphyletic relationships gave an accuracy of only 50.7%. This may be explained by the disappearance of paraphyletic relationships over time due to the time lag in sampling, as identifying paraphyletic relationships relies on finding the most similar subset of sequences between the source and recipient, and similarity is lost over time as a result of viral mutations. However, in our approach we looked for differences in the number of common patterns before and after transmission, as differences will increase over time as a result of viral mutation. Our results provide support for this position, as accuracy for transmission pairs with a sampling time lag of more than 2 years (100%) was higher than that for those with a sampling time lag of less than 3 months (92.5%). Moreover, compared with phylogenetic analysis, the computational complexity of our approach is much lower, and more objective because none of the parameters used are selected based on experience.
As the virus evolves, the total number of unique common patterns may change over time, and the changes of the number of unique common patterns may affect the efficiency of the practical application of our method in the future. However, from current observation, the number of unique common patterns is steady over time. As shown in Fig. 1, the numbers of unique common patterns in the nine longitudinally sampled patients are steady, over the period from 1983 to 1998. Moreover, we derived all of the available sequences with known sampling year from the Los Alamos HIV-1 sequence database, and the number of unique common patterns is steady, over the period from 1980 to 2010; the same results were obtained for different subtypes (data not shown). Although the stability of the number of unique common patterns makes the practical application of our method possible, we suggested that it would be better to monitor the change in the number of common patterns over time in the future.
The approach developed here can be used to accurately determine the transmission direction of HIV between individuals, and its application should provide stronger evidence in criminal cases and lead to a clearer picture of the transmission network in a given geographic region. Geographic analysis is essential for reconstruction of the epidemiological history of viral populations, and has potential to lead to a better understanding of HIV transmission and the improvement of prevention programs .
The authors thank Ting-Rui Song from Institute of Biophysics in Chinese Academy of Sciences for partly participating in the construction of phylogenetic tree and the verification of algorithm.
X.P. conceived and designed the study, did the analyses, and were involved in the writing of the manuscript. J.Y. and M.G. were involved in the analysis of the model, in the collection of the dataset, and in the editing of the manuscript. All authors saw and approved the final version of the manuscript. All the work in this article is original and has never been published elsewhere.
Grant 30870475 from the Natural Science Foundation of China, 2009CB918801 from the Ministry of Science and Technology of China, 2008ZX10001-003 from the Ministry of Health of China, and 104519-010 from the International Development Research Center, Ottawa, Canada.
This work was supported by grant 30870475 from the Natural Science Foundation of China, 2009CB918801 from the Ministry of Science and Technology of China, 2008ZX10001-003 from the Ministry of Health of China, and 104519-010 from the International Development Research Center, Ottawa, Canada. The funding source had no role in the design or conduct of the study, or in the collection, analysis, or interpretation of the data.
Conflicts of interest
We declare that we have no conflicts of interest.
1. Palca J. The case of the Florida dentist
2. Palca J. AIDS. CDC closes the case of the Florida dentist
3. LiVolsi VA. The mechanism of HIV infection in patients of the Florida dentist
4. International consultation on the criminalization of HIV transmission: 31 October–2 November 2007; Geneva, Switzerland. Joint United Nations Programme on HIV/AIDS (UNAIDS) Geneva, United Nations Development Programme (UNDP), New York, 2007. Reprod Health Matters
5. Lewis F, Hughes GJ, Rambaut A, Pozniak A, Leigh Brown AJ. Episodic sexual transmission of HIV revealed by molecular phylodynamics
. PLoS Med
6. Metzker ML, Mindell DP, Liu XM, Ptak RG, Gibbs RA, Hillis DM. Molecular evidence of HIV-1 transmission in a criminal case
. Proc Natl Acad Sci U S A
7. Bernard EJ, Azad Y, Vandamme AM, Weait M, Geretti AM. HIV forensics: pitfalls and acceptable standards in the use of phylogenetic analysis as evidence in criminal investigations of HIV transmission
. HIV Med
8. Scaduto DI, Brown JM, Haaland WC, Zwickl DJ, Hillis DM, Metzker ML. Source identification in two criminal cases using phylogenetic analysis of HIV-1 DNA sequences
. Proc Nat Acad Sci U S A
9. Lavie A, Schlichting I, Vetter IR, Konrad M, Reinstein J, Goody RS. The bottleneck in AZT activation
. Nat Med
10. Regoes RR, Bonhoeffer S. The HIV coreceptor switch: a population dynamical perspective
. Trends Microbiol
11. Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, et al. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection
. J Virol
12. Samleerat T, Braibant M, Jourdain G, Moreau A, Ngo-Giang-Huong N, Leechanachai P, et al. Characteristics of HIV type 1 (HIV-1) glycoprotein 120 env sequences in mother-infant pairs infected with HIV-1 subtype CRF01_AE
. J Infect Dis
13. Frenkel LM, Mullins JI, Learn GH, Manns-Arcuino L, Herring BL, Kalish ML, et al. Genetic evaluation of suspected cases of transient HIV-1 infection of infants
14. Dickover R, Garratty E, Yusim K, Miller C, Korber B, Bryson Y. Role of maternal autologous neutralizing antibody in selective perinatal transmission of human immunodeficiency virus type 1 escape variants
. J Virol
15. Verhofstede C, Demecheleer E, De Cabooter N, Gaillard P, Mwanyumba F, Claeys P, et al. Diversity of the human immunodeficiency virus type 1 (HIV-1) env sequence after vertical transmission in mother-child pairs infected with HIV-1 subtype A
. J Virol
16. Wu X, Parast AB, Richardson BA, Nduati R, John-Stewart G, Mbori-Ngacha D, et al. Neutralization escape variants of human immunodeficiency virus type 1 are transmitted from mother to infant
. J Virol
17. Tobin NH, Learn GH, Holte SE, Wang Y, Melvin AJ, McKernan JL, et al. Evidence that low-level viremias during effective highly active antiretroviral therapy result from two processes: expression of archival virus and replication of virus
. J Virol
18. Contag CH, Ehrnst A, Duda J, Bohlin AB, Lindgren S, Learn GH, et al. Mother-to-infant transmission of human immunodeficiency virus type 1 involving five envelope sequence subtypes
. J Virol
19. Roth WW, Zuberi JA, Stringer HG Jr, Davidson SK, Bond VC. Examination of HIV type 1 variants in mother-child pairs
. AIDS Res Hum Retroviruses
20. Tovanabutra S, de Souza M, Sittisombut N, Sriplienchan S, Ketsararat V, Birx DL, et al. HIV-1 genetic diversity and compartmentalization in mother/infant pairs infected with CRF01_AE
21. Hoffmann FG, He X, West JT, Lemey P, Kankasa C, Wood C. Genetic variation in mother-child acute seroconverter pairs from Zambia
22. Sagar M, Laeyendecker O, Lee S, Gamiel J, Wawer MJ, Gray RH, et al. Selection of HIV variants with signature genotypic characteristics during heterosexual transmission
. J Infect Dis
23. Nowak P, Schvarcz R, Ericzon BG, Flamholc L, Sonnerborg A. Follow-up of antiretroviral treatment in liver transplant recipients with primary and chronic HIV type 1 infection
. AIDS Res Hum Retroviruses
24. Campbell MS, Gottlieb GS, Hawes SE, Nickle DC, Wong KG, Deng W, et al. HIV-1 superinfection in the antiretroviral therapy era: are seroconcordant sexual partners at risk?
. PloS one
25. Liu Y, McNevin J, Cao J, Zhao H, Genowati I, Wong K, et al. Selection on the human immunodeficiency virus type 1 proteome following primary infection
. J Virol
26. Zhu T, Wang N, Carr A, Nam DS, Moor-Jankowski R, Cooper DA, et al. Genetic characterization of human immunodeficiency virus type 1 in blood and genital secretions: evidence for viral compartmentalization and selection during sexual transmission
. J Virol
27. Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification?
. Bioinformatics (Oxford, England)
28. Sanderson MJ, Wojciechowski MF. Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae)
. Syst Biol
29. Edwards CT, Holmes EC, Wilson DJ, Viscidi RP, Abrams EJ, Phillips RE, et al. Population genetic estimation of the loss of genetic diversity during horizontal transmission of HIV-1
. BMC Evol Biol
30. Gray RR, Tatem AJ, Lamers S, Hou W, Laeyendecker O, Serwadda D, et al. Spatial phylodynamics of HIV-1 epidemic emergence in east Africa