Phylogenetics is an attractive method to reconstruct HIV epidemics because HIV evolves faster than it spreads, thus accumulating mutations as it spreads, phylogenetics reconstructs evolutionary history, and at the heart of any epidemic is transmission, which phylogenetics thusly can reconstruct. Here, we will review developments in the recent 18 months in phylogenetic inference of HIV transmissions, explicitly taking within-host diversity into account when reconstructing transmission events.
It is well known that HIV both diversifies and diverges in an infected person [1–4]. Diversification is the process of generating genetic variability that exist at any given time point. It is from this diversity that immune escape and antiviral resistance is selected. The diversity can be thought of as a cloud of genetic variants. Divergence is the movement of this cloud away from the infecting variant(s). The rate of the divergence is described by a molecular clock. The evolutionary process that dictates HIV evolution is a complex interaction between virus and host factors, as well as outside factors such as antiviral treatment. The outcome of these interactions is patient specific [5–7], disease stage dependent , and virus related [9,10]. From a modeling point of view, HIV evolution is described by a combination of neutral and selective processes .
Because HIV within-host diversity can be very large, joint HIV phylogenies from patients that are linked through transmissions are not identical to their transmission history [12–14]. Note that while within-host diversity can be summarized by a single statistic such as mean pair-wise genetic distance, it can also be described in more detail by a within-host phylogeny of the sampled virus variants, that is, a dendogram that shows how the sampled HIV sequences are related to each other by descent (tree topology) and extent of nucleotide substitutions or time (branch lengths). Consequently, a joint phylogeny shows how sampled virus variants from more than one patient are evolutionarily linked to each other.
The time point when a transmission occurred is not directly displayed in a joint HIV phylogeny; instead, the location in the joint phylogeny in which the recipient's HIV population joins the donor's HIV population simply relates to a random time point that indicates when a transmitted lineage coalesces with some lineage that can be reconstructed from the available sample in the donor. The difference in transmission time point and the corresponding coalescence time point is known as the pretransmission interval [13,14], which quantifies the bias backwards in time. The magnitude of this bias is determined by the level of diversity in the donor at time of transmission . Another effect of the within-host diversity is known as incomplete lineage sorting [15,16], resulting in disordered transmission events, for example, where it may seem as if a newborn child infected her mother [14,17]. The probability of such disordering also depends on the level of diversity and additionally how much time that has passed between the two transmission events . Although these effects of within-host diversity may seem discouraging for phylogenetic transmission reconstructions, counterfactually, the within-host diversity also provides an opportunity to better reconstruct the transmission history.
TAKING WITHIN-HOST DIVERSITY INTO ACCOUNT ENHANCES TRANSMISSION RECONSTRUCTION
At transmission, a relatively small number of virus particles is typically transmitted. Thus, with any diversity present in the donor, only a subset of the available genetic variants is transferred from one host (donor) to a new host (recipient), known as a genetic bottleneck. Consequently, the new HIV population in the recipient will typically have a smaller level of diversity, and importantly for our purpose of reconstructing the transmission event, the recipient's HIV population will sit inside the diversity of the donor's HIV phylogeny. Phylogenetically, the donor's HIV population is paraphyletic to the recipient's HIV population. Although this relationship has been described and observed in the past [14,18], recent research has systematically investigated what type of phylogenetic relationship to expect in different types of transmission.
Based on a coalescent model of within-host HIV evolution, systematic simulation experiments have shown that direct transmission (from host A to B) typically results in either a paraphyletic–monophyletic or a paraphyletic–polyphyletic donor-recipient joint phylogeny [19▪▪]. Figure 1 shows the prototypic joint phylogenies that can result from reconstructing the evolutionary history of the HIV populations in two epidemiologically linked hosts. Recipient monophyly results from transmission of a single genetic variant from the donor, and polyphyly when more than one variant is transmitted. When an intermediary link exists (A infects X, who is not sampled, and X later infects B), typically a paraphyletic–monophyletic donor-recipient joint phylogeny appears. Finally, when both A and B are infected by a common source, the joint A + B phylogeny is typically monophyletic–monophyletic. Theoretically, it was found that these expected relationships were robust under many parameter settings, when at least 20 HIV variants were sequenced from each host (A and B), and sampling occurred at about the same time in both hosts. It was also shown that with inadequate sampling of genetic variants, the phylogenetic relationships would be harder to correctly recover [19▪▪]. Similarly, with time, lineage death in the ongoing evolutionary process will eventually result in monophyletic–monophyletic phylogenies regardless of transmission history. Lineage death results from the fact that not all viruses produce viable offspring.
Figure 2 shows how a joint virus phylogeny from three hosts and the underlying transmission history are combined in the corresponding virus population history. In this example, A first infects B, and later C. Later yet, samples are collected from first B, then A, and last C. From each such sample we retrieve four, eight, and four sequences, respectively, from each host. Note that the number of sequences is too small for a real study (it should be aiming for at least 20); we show a small number here merely to keep the illustration simple. The sequences result in a joint phylogeny that is a sample from the HIV populations in A, B, and C. The actual HIV populations in A, B, and C have many more genetic variants in them than we have sampled. Using a coalescent framework, we can model the size and diversity of these populations based on the sample using a simple linear growth model,
is the effective population size of host
at t, αi is the size of the infecting population, and βi is the slope of the population growth [12,19▪▪]. Note that the effective population size is related to the level of diversity in a host, and not to the census size that can be estimated from the viral load.
Due to the pretransmission interval, none of the transmission times are represented by any node in the joint phylogeny. The phylogeny does provide limits of when transmission could have occurred, however. When only one variant is transmitted, as in the A to B transmission, the most recent time is limited by the most recent common ancestor of all phylogenetic lineages that were sampled in B (MRCA B). Clearly, a larger sample may push this limit further back in time. The most distant time point when transmission could have occurred is estimated by MRCA A + B, and similarly a larger sample from A may push this limit towards the present. The possible time interval of when A infected B is shown as an alternating blue–red branch segment. When two, or more, unique variants are transmitted, as in the A to C transmission, the most recent time of transmission is limited by the most distant common ancestor that occurs among lineages found only in C. The most distant time point of the possible transmission interval is estimated by the first occurrence of all unique lineages that end up in C. The possible time interval when A infected C is shown as alternating blue–green branch segments leading to all green segments when the lineages must be in host C. Note also that the MRCA C has nothing to do with the time of transmission as this happens between random lineages in host A. Again, sampling more variants in A and C may reduce the possible time interval during which transmission could have occurred. Notice that a naïve interpretation, looking for when the transmitted lineages originate in A, would mislead about the order and timing of transmissions to B and C because C gets infected with lineages that happen to go further back in A even though C was infected after B.
The overall HIV population history shows how we are able to model and reconstruct the transmission history from a joint virus phylogeny that is based on a limited sample of HIV variants from each infected host (Fig. 2). A and B are related to each other by a paraphyletic–monophyletic phylogeny, A and C to each other by a paraphyletic–polyphyletic phylogeny, and B and C by a monophyletic–monophyletic phylogeny. These types of pairwise relationships were confirmed by massive simulations under different parameter values of α, β, and t[19▪▪]. The figure also shows that lineages that existed before sampling may have died out. The coalescent model can recapture the diversity that such lost variants may have contributed to at some point in the past, for example, estimating the diversity at time of transmission [20▪].
A recent study evaluating 955 transmission pair datasets with known epidemiological linkage confirmed the theoretical expectations of mainly observing paraphyletic–monophyletic and paraphyletic–polyphyletic phylogenies in direct transmissions (from A to B directly), and monophyletic–monophyletic phylogenies when A and B had a common source [21▪▪]. The study involved 272 previously published transmission chains, often sequenced in more than one genomic region and containing more than two hosts, decomposed into 955 genomic regions with a transmission pair. Overall, 52% of direct transmissions resulted in a detected paraphyletic–polyphyletic phylogeny, 37% in a paraphyletic–monophyletic phylogeny and 11% in a monophyletic–monophyletic phylogeny, whereas 76% of common source transmissions resulted in a monophyletic–monophyletic phylogeny. Significantly, paraphyletic–polyphyletic phylogenies dominated (66%) among mother-to-child transmissions, and paraphyletic–polyphyletic phylogenies were also more common in men who have sex with men (52%) than in heterosexual transmission (19%). Even though paraphyletic–polyphyletic phylogenies were observed quite frequently, transmission of more than one genetic variant is likely more common than suggested by the fraction of paraphyletic–polyphyletic phylogenies, and previously expected, because of insufficient variant sampling and lineage death before samples were taken. The study also found that rooting the joint phylogeny is crucial as paraphyletic–monophyletic phylogenies could be mis-rooted as monophyletic–monophyletic phylogenies or transmission direction could be reversed. The best rooting was achieved by using several sequences from the matching HIV-1 subtype or circulating recombinant form as outgroup.
Lineage death over time, as well as the time of sampling relative to when transmissions occurred, may play unexpected tricks on the resulting joint phylogeny. In a recent study of a transmission that involved multiple variants, sampling of the donor much later than that of the recipient, and very different Ne growth rates in the hosts, lead to that the resulting paraphyletic–polyphyletic phylogeny suggested that the recipient had infected the donor [20▪]. Thus, this study presents a cautionary tale that even though paraphyletic–polyphyletic phylogenies may indicate direct transmission, interpretation of the direction of transmission (A to B or B to A) must be done carefully. In this case, extensive simulations of the exact epidemiological scenario, explicitly taking sampling times into account, could reveal the true donor. When many phylogenetic lineages are transmitted the chance of transmitting old lineages increases, hence increasing the probability that the recipient carries an older lineage than the donor at time of sampling; indeed, in the study of the 955 real transmission pairs, it was found that paraphyletic–polyphyletic phylogenies proposed the wrong direction of transmission in 24% of the cases [21▪▪]. Similarly, a recent study of 33 index-partner pairs in the HPTN052 cohort also showed that simple root state reconstruction may mislead or be insufficient to determine the donor in linked transmissions . This points out that caution must be exercised when evaluating individual cases, as case details may be very different from previous studies.
TAKING WITHIN-HOST DIVERSITY INTO ACCOUNT REVEALS UNDERLYING EPIDEMIC PROCESSES
Although it may be interesting to analyze individual transmission events between a donor and a recipient for a wide range of reasons, the application of using HIV sequence data to identify how HIV spreads in a human population is particularly important as it can access otherwise difficult to estimate epidemiological parameters such as transmission risks, underlying contact networks, and numbers of infections over time. Phylodynamic methods have shown much promise, but until recently they typically ignored within-host diversity and evolution. Recently, however, several efforts have been made to allow for within-host diversity of a transmitted pathogen, either with HIV in mind, generically, or other pathogens in mind [23–25].
De Maio et al. developed SCOTTI, a generic method to take within-host diversity into account when reconstructing transmission events in outbreak investigations using sequence data. SCOTTI is available as a BEAST 2 module , taking advantage of the rich BEAST environment. More recently, with deep next-generation sequence (NGS) data in mind, Skums et al.[28▪] developed QUENTIN, a software that reconstructs transmission histories among multiple hosts while taking within-host evolution into account. Instead of a conventional phylogenetic approach, QUENTIN uses a graph-based approach that can estimate transmission direction from sequence data. Favorizing scale-free transmission networks [29,30], it maximizes the probability of observing a transmission tree given the genetic network of observed virus sequences and an arc weight function. The transmission tree identifies transmission directions, the genetic network describes how the sequences are related to each other by single mutations, and the arc weights are equal to genetic distances between the virus populations studied.
Anticipating very large datasets from epidemics, as well as virus full genome NGS data, Wymant et al.[31▪▪] recently expanded the above described donor-recipient phylogenetic patterns (Fig. 1) to the epidemic level in the software PHYLOSCANNER. This software performs phylogenetic reconstructions in multiple windows across aligned HIV genomes from multiple patients. Based on the most parsimonious host label at nodes joining sequences from different patients, transmission directions among the patients are reconstructed. Aggregating the results from all windows, PHYLOSCANNER constructs possible transmission histories displayed as relationship graphs. This approach mitigates random reconstruction errors, in which greater credibility is given to those relationships that are observed more frequently. Because NGS data can have high error rates and may be sensitive to contamination, PHYLOSCANNER also identifies suspicious signals that typically are different from true phylogenetic events stemming from the within-host evolutionary process and transmission(s).
Within-host diversity was also recently taken into account when analyzing phylogenetic patterns that may result from different transmission network types. Giardina et al. showed that HIV sequences sampled from epidemics that spread in archetypical network structures, characterized by different degree distributions and amount of clustering, result in different phylogenetic patterns, thus making it possible to infer general epidemic transmission histories. This study showed that a HIV time-scaled phylogeny from many patients may be substantially different than the between-host transmission history, and by not taking within-host diversity into account, the phylogeny may get misinterpreted leading to erroneous inference about the underlying epidemic contact network. Although within-host evolution may display a disordered phylogeny vis-a-vis the transmission events, importantly, the diversification process also adds discriminatory power to differentiate between different types of contact networks.
Because within-host diversity causes significant backwards bias on infection times and may disorder transmission events compared with the actual transmission history, phylodynamic estimates of numbers of infected hosts may become severely overestimated if within-host HIV diversity is ignored. Volz et al. showed that a multiscale coalescent, which takes within-host diversity into account, can accurately estimate the number of infected hosts in growing HIV epidemics. In an analysis of a large outbreak among intravenous drug users in Latvia [30,34], the multiscale coalescent gave estimates of cumulative numbers of infected hosts close to the cumulative number of diagnoses in the epidemic. Surprisingly, working with only a single sequence per host (which currently is the standard in public health databases), the multiscale coalescent is capable of estimating the within-host diversity in infected patients in an epidemic.
Given adequate data, phylogenetic analysis of HIV sequences can often reconstruct transmission direction. Typically, HIV sequence data can infer that direct transmission has occurred when a paraphyletic–polyphyletic tree is observed, that transmission occurred either directly or indirectly from A to B when a paraphyletic–monophyletic tree is observed, and that an unsampled person infected both A and B when a monophyletic–monophyletic tree is observed. With either too few sequences per host, or too long time since transmission, these patterns become more uncertain. Additional caution is also called for when sampling times vary between hosts, and when diversification rates are very different between hosts. In-depth analyses, which at this time are computationally expensive, may reveal transmission direction when the epidemiological and phylogenetic patterns are complicated. Recent softwares have exploited the theoretical expectations resulting from HIV transmission and added significant realism to modern phylodynamic inference of HIV epidemics. Future development of epidemiological models that include within-host evolution in even greater detail may further improve epidemiological reconstruction and prediction.
Financial support and sponsorship
The study was supported by NIH NIAID grant R01 AI087520.
Conflicts of interest
There are no conflicts of interest.
REFERENCES AND RECOMMENDED READING
Papers of particular interest, published within the annual period of review, have been highlighted as:
- ▪ of special interest
- ▪▪ of outstanding interest
1. Shankarappa R, Margolick JB, Gange SJ, et al. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol 1999; 73:10489–10502.
2. Leitner T, Halapi E, Scarlatti G, et al. Analysis of heterogeneous viral populations by direct DNA sequencing. BioTechniques 1993; 15:120–126.
3. Wolfs TFW, Zwart G, Bakker M, Goudsmit J. HIV-1 genomic RNA diversification following sexual parenteral virus transmission
. Virology 1992; 189:103–110.
4. McNearney T, Hornickova Z, Markham R, et al. Relationship of human immunodeficiency virus type 1 sequence heterogeneity to stage of disease. Proc Natl Acad Sci U S A 1992; 89:10247–10251.
5. Halapi E, Leitner T, Jansson M, et al. Correlation between HIV sequence evolution, specific immune response and clinical outcome in vertically infected infants. AIDS 1997; 11:1709–1717.
6. Bagnarelli P, Mazzola F, Menzo S, et al. Host-specific modulation of the selective constraints driving human immunodeficiency virus type 1 env gene evolution. J Virol 1999; 73:3764–3777.
7. Salemi M. The intra-host evolutionary and population dynamics of human immunodeficiency virus type 1: a phylogenetic perspective. Infect Dis Rep 2013; 5:e3.
8. Lee HY, Perelson AS, Park SC, Leitner T. Dynamic correlation between intrahost HIV-1 quasispecies evolution and disease progression. PLoS Comput Biol 2008; 4:e1000240.
9. Hollingsworth TD, Anderson RM, Fraser C. HIV-1 transmission
, by stage of infection. J Infect Dis 2008; 198:687–693.
10. Lythgoe KA, Fraser C. New insights into the evolutionary rate of HIV-1 at the within-host and epidemiological levels. Proc Biol Sci 2012; 279:3367–3375.
11. Leitner T. The puzzle of HIV neutral and selective evolution. Mol Biol Evol 2018; 35:1355–1358.
12. Romero-Severson E, Skar H, Bulla I, et al. Timing and order of transmission
events is not directly reflected in a pathogen phylogeny
. Mol Biol Evol 2014; 31:2472–2482.
13. Leitner T, Albert J. The molecular clock of HIV-1 unveiled through analysis of a known transmission
history. Proc Natl Acad Sci U S A 1999; 96:10752–10757.
14. Leitner T, Fitch WM. Crandall KA. The phylogenetics of known transmission
histories. Johns Hopkins Univ. Press, The evolution of HIV. Baltimore, MD: 1999.
15. Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol 1988; 5:568–583.
16. Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet 2006; 2:e68.
17. Leitner T, Escanilla D, Franzén C, et al. Accurate reconstruction of a known HIV-1 transmission
history by phylogenetic tree analysis. Proc Natl Acad Sci U S A 1996; 93:10864–10869.
18. Scaduto DI, Brown JM, Haaland WC, et al. Source identification in two criminal cases using phylogenetic analysis of HIV-1 DNA sequences. Proc Natl Acad Sci U S A 2010; 107:21242–21247.
19▪▪. Romero-Severson EO, Bulla I, Leitner T. Phylogenetically resolving epidemiologic linkage. Proc Natl Acad Sci U S A 2016; 113:2690–2695.
Establishes expected phylogenetic patterns in direct, indirect, and common source transmissions by simulations under a wide range of transmission situations.
20▪. Romero-Severson EO, Bulla I, Hengartner N, et al. Donor-recipient identification in para- and poly-phyletic trees under alternative HIV-1 transmission
hypotheses using approximate Bayesian computation. Genetics 2017; 207:1089–1101.
Highlights that simple phylogenetic interpretation of apparent phylogenetic patterns may mislead about transmission direction. In-depth analyses can nevertheless recover transmission direction.
21▪▪. Leitner T, Romero-Severson E. Phylogenetic patterns recover known HIV epidemiological relationships and reveal common transmission
of multiple variants. Nat Microbiol 2018; 3:983–988.
Evaluates phylogenetic patterns in real HIV sequence data from 955 transmission pair genomic regions. Theoretical expectations of typical phylogenetic patterns are confirmed and transmission of multiple variants appears more common than previously thought.
22. Rose R, Hall M, Redd AD, et al. Phylogenetic methods inconsistently predict direction of HIV transmission
among heterosexual pairs in the HPTN052 cohort. J Infect Dis 2018; [Epub ahead of print].
23. Klinkenberg D, Backer JA, Didelot X, et al. Simultaneous inference of phylogenetic and transmission
trees in infectious disease outbreaks. PLoS Comput Biol 2017; 13:e1005495.
24. Didelot X, Fraser C, Gardy J, Colijn C. Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol Biol Evol 2017; 34:997–1007.
25. Ypma RJ, van Ballegooijen WM, Wallinga J. Relating phylogenetic trees to transmission
trees of infectious disease outbreaks. Genetics 2013; 195:1055–1062.
26. De Maio N, Wu CH, Wilson DJ. SCOTTI: efficient reconstruction of transmission
within outbreaks with the structured coalescent. PLoS Comput Biol 2016; 12:e1005130.
27. Bouckaert R, Heled J, Kühnert D, et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol 2014; 10:e1003537.
28▪. Skums P, Zelikovsky A, Singh R, et al. QUENTIN: reconstruction of disease transmissions from viral quasispecies genomic data. Bioinformatics 2018; 34:163–170.
Development of a software that takes within-host diversity into account from deep next-generation sequence (NGS) data when reconstructing transmission histories.
29. Leigh Brown AJ, Lycett SJ, Weinert L, et al. Collaboration UHDR. Transmission
network parameters estimated from HIV sequences for a nationwide epidemic. J Infect Dis 2011; 204:1463–1469.
30. Graw F, Leitner T, Ribeiro RM. Agent-based and phylogenetic analyses reveal how HIV-1 moves between risk groups: injecting drug users sustain the heterosexual epidemic in Latvia. Epidemics 2012; 4:104–116.
31▪▪. Wymant C, Hall M, Ratmann O, et al. PHYLOSCANNER: inferring transmission
from within- and between-host pathogen genetic diversity. Mol Biol Evol 2017; [Epub ahead of print].
Development of a software for large-scale analyses of epidemics with many hosts with full genome NGS data that takes within-host diversity into account.
32. Giardina F, Romero-Severson EO, Albert J, et al. Inference of transmission
network structure from HIV phylogenetic trees. PLoS Comput Biol 2017; 13:e1005316.
33. Volz EM, Romero-Severson E, Leitner T. Phylodynamic inference across epidemic scales. Mol Biol Evol 2017; 34:1276–1288.
34. Balode D, Ferdats A, Dievberna I, et al. Rapid epidemic spread of HIV type 1 subtype A1 among intravenous drug users in Latvia and slower spread of subtype B among other risk groups. AIDS Res Hum Retroviruses 2004; 20:245–249.