HIV Care Prioritization Using Phylogenetic Branch Length : JAIDS Journal of Acquired Immune Deficiency Syndromes

Secondary Logo

Journal Logo

Translational Research

HIV Care Prioritization Using Phylogenetic Branch Length

Moshiri, Niema PhDa; Smith, Davey M. MDb; Mirarab, Siavash PhDc

Author Information
JAIDS Journal of Acquired Immune Deficiency Syndromes 86(5):p 626-637, April 15, 2021. | DOI: 10.1097/QAI.0000000000002612



The structure of the HIV transmission networks can be dictated by just a few individuals. Public health intervention, such as ensuring people living with HIV adhere to antiretroviral therapy and remain virally suppressed, can help control the spread of the virus. However, such intervention requires using limited public health resource allocations. Determining which individuals are most at risk of transmitting HIV could allow public health officials to focus their limited resources on these individuals.


Molecular epidemiology can help prioritize people living with HIV by patterns of transmission inferred from their sampled viral sequences. Such prioritization has been previously suggested and performed by monitoring cluster growth. In this article, we introduce Prioritization using AnCesTral edge lengths (ProACT), a phylogenetic approach for prioritizing individuals living with HIV.


ProACT starts from a phylogeny inferred from sequence data and orders individuals according to their terminal branch length, breaking ties using ancestral branch lengths. We evaluated ProACT on a real data set of 926 HIV-1 subtype B pol data obtained in San Diego between 2005 and 2014 and a simulation data set modeling the same epidemic. Prioritization methods are compared by their ability to predict individuals who transmit most after the prioritization.


Across all simulation conditions and most real data sampling conditions, ProACT outperformed monitoring cluster growth for multiple metrics of prioritization efficacy.


The simple strategy used by ProACT improves the effectiveness of prioritization compared with state-of-the-art methods that rely on monitoring the growth of transmission clusters defined based on genetic distance.


The transmission of HIV resembles scale-free networks,1 likely because of the scale-free properties of sexual contacts and injection drug uses.2,3 As a result, public health intervention may be more effective when targeted at people living with HIV who are more likely to grow the transmission network. However, the best method to target individuals for specific interventions remains an open question.

Continued treatment is one way of reducing transmissions. Antiretroviral therapy (ART) effectively suppresses the HIV virus in most cases, stops the progression of the disease, and prevents onward transmission, provided the individual continuously adheres to the treatment.4 In many (although not all) health care systems, ART is made available routinely to newly diagnosed patients, but further care is needed. Not every diagnosed person initiates ART, and not all cases of ART initiation lead to a sustained suppression through time. Individuals who start ART but fail to sustain it or who are otherwise unsuppressed can still infect others. Thus, a possible intervention is to help diagnosed individuals to stay on ART and to remain continually suppressed.5 Such interventions require the allocation of clinical staff who would follow-up with patients to provide them further assistance in sustenance of ART. The health system can also provide increased viral load testing to adhering individuals to ensure suppression.

Another form of intervention targets undiagnosed and HIV-negative individuals. The health system can use partner tracing to identify the sexual partners of diagnosed individuals as best as possible, test these individuals, and offer them either treatment for positives or PrEP prevention for negatives.6 Although contact tracing may have ethical concerns, it is a common and appropriate tool for controlling infectious diseases (eg, HIV, Tb, and SARS-CoV-2) when delivered by trained public health staff.7 Contact tracing for HIV is resource-intensive and is often viewed as insufficiently used.8 Community-level reductions in viral loads are associated with decreased HIV incidence rates9 (although empirical data on effectiveness of contact tracing are scant10).

These interventions are costly and cannot be undertaken for every known individual. If diagnosed people at risk of not being suppressed can be predicted accurately, the public health system can prioritize follow-up efforts to these individuals. For tracing, care provided to positive contacts and PrEP given to high-risk negative contacts can decrease HIV spread within the community only if enough unsuppressed positives and high-risk contacts are found; thus, individuals with many such contacts should be prioritized. Moreover, if these high-priority individuals show associations with geographical or demographic groups (beyond what is known), the public health system can design strategies for further outreach, testing, and PrEP administration for the impacted groups. The goal of prioritization is to increase the number of prevented transmissions for a fixed budget by increasing the chance of finding the right targets for follow-ups, tracing, and testing. However, predicting tendency for future transmissions is difficult, especially if we aim to be more specific than broad demographic characteristics.

Molecular epidemiology suggests a solution: prioritizing based on patterns of transmission inferred from HIV sequence data.1,11–17 The inference of transmission networks using phylogenetic or distance-based methods has been the subject of much research.18–21 However, in this work, instead of being concerned with inferring exact patterns of transmissions, we ask the following question: given molecular data from a set of sequenced HIV-positive individuals (“samples” for short), who should be prioritized for further intervention?

Prioritizing care based on molecular epidemiology has been studied recently. Wertheim et al15 present a method for prioritizing samples based on performing transmission clustering (ie, grouping individuals with low viral genetic distance into transmission clusters) and ordering clusters by growth rate. On a large data set from New York, they show that the approach is able to predict individuals who have relatively larger numbers of transmission links in the near future. Moshiri et al22 have studied the same question in simulations and have shown that monitoring cluster growth can be used for predicting future transmissions substantially better than a random guess, whether clusters are defined using genetic distances or using phylogenetic methods. Balaban et al23 showed in simulations that using an approach similar to that of Wertheim et al15 but defining clusters using phylogenetic trees gives a small but consistent improvement over clusters defined using distances. McLaughlin et al24 developed a predictive model in which they combined viral lineage-level diversification rates inferred from phylogenetic trees with sociodemographic and clinical data, and after aggregating by patients' area of residence, they were able to predict the distribution of new HIV cases in British Columbia, Canada, between 2008 and 2018 with higher predictive power than a phylogenetically uninformed model.

In this article, we introduce a new method for ordering samples based on their phylogenetic relationships. Instead of relying on clustering individuals and then ordering clusters based on their growth, we seek to order individuals without clustering, without relying on parametric models, and without incorporating demographic information about patients. Instead, we seek to simply exploit patterns in the phylogeny and particularly, in branch lengths.


Prioritization using AnCesTral edge lengths (ProACT) takes as input the inferred phylogenetic tree of sampled HIV viruses, rooted using an out-group or clock-based methods (eg, MinVar-root25). ProACT simply orders samples in an ascending order of incident branch length of their associated virus, breaking ties based on the incident branch lengths of parent nodes, then those of grandparent nodes, etc., and ultimately by sampling times if available (see Fig. 1, Supplemental Digital Content, for a formal description). ProACT is very fast (scaling quasilinearly), and scalable methods exist for inferring26,27 and rooting25 very large trees.

ProACT is motivated and tested in a context similar to those health care systems that enjoy enough resources to offer ART to all (or most) diagnosed individuals. Thus, each sample can be assumed to be given ART at a time close to when their HIV is sequenced, but they may fail to be suppressed for the remainder of their life. These conditions describe the current common practice of care in many places and are increasingly adopted elsewhere.

To motivate ProACT, we start with the observation that, in a 10-year epidemic simulation (described in detail below), when a phylogeny is inferred from sequences obtained at a given time point in an epidemic, the more a node transmits, the shorter its incident branch length tends to be (Figs. 1D, E and Fig. 2, Supplemental Digital Content, Using the Kendall tau-b test,28 we found a statistically significant anticorrelation between the incident branch lengths of individuals sampled within the first 9 years of the epidemic and the number of individuals they infected during year 10, regardless of whether we use true (τB = −0.0431, p ≪ 10−10) or inferred (τB = −0.0354, p ≪ 10−10) phylogenies. Although not obvious, this observation can be explained by the constraints placed on the viral phylogeny by the transmission history (Figs. 1A–C).

A, Individual A transmits to individuals B and C at times t1 and t2. B, Viral samples are obtained at times tA, tB, and tC. The viral phylogeny is constrained by each transmission event's bottleneck. The most likely phylogeny matches the transmission history (left), but with deeper coalescences (less likely), it may not match (right). C, Moving from tB to tC, the branch length incident to individual A shortens on the addition of individual C in the likely event that the coalescence of the lineage from C with the lineage from A is more recent than its coalescence with the lineage from B (left), or it remains constant in the event of a less likely deeper coalescence (right). Regardless, the length of this branch never increases. In simulation, we can observe this trend: as time progresses, the incident branch lengths tend to decrease, both in true (see Fig. 2, Supplemental Digital Content, and inferred (D) phylogenies, and as the number of transmissions from a given individual increases, incident edge lengths tend to decrease. Patterns are similar with the true simulated tree (“True”) or the tree inferred from sequences (“Est.”) (E).

In many places, HIV samples are typically sequenced on beginning ART. Let us assume for simplicity that every individual in the given data set has at some point initiated ART; then, future transmissions by individuals in the data set can happen only if the source stops ART or is otherwise unsuppressed. If, in the future, individual u transmits to individual v, there are 2 possible scenarios regarding the placement of the leaf corresponding to v in the existing phylogeny: (1) v is placed on the edge incident to u, so the edge incident to u will shorten, or (2) v is placed further up the tree, so the edge incident to u will remain the same length. Although scenario 2 is possible, scenario 1 is more likely.29 Thus, as time goes by, a terminal branch can only shorten or stay fixed, and it will most often shorten because of transmissions from the individual represented by that branch. This pattern, easily observed in simulations (Fig. 1D), leads to shorter branches for samples who have transmitted recently. The first time a sample infect others, its terminal branch length likely decreases, and further transmissions further decrease the length (Fig. 1D). Thus, one expects nodes with smaller incident branch length to be more likely to have transmitted since their sampling time. Moreover, they are more likely not to be suppressed compared with individuals on long branches. Because only individuals who are not suppressed transmit, a higher probability of a lack of suppression makes an individual a good candidate for intervention to ensure suppression.


We used FAVITES to simulate a sexual contact network, transmission network, viral phylogeny, and viral sequence. We chose FAVITES parameters (see Table 2, Supplemental Digital Content, to emulate HIV transmission in San Diego (2005–2014). We mostly used the parameters from Moshiri et al22 with some modifications to better capture reality. We explored the impact of changing 3 of the main parameters (Table 1) to obtain 9 model conditions (each replicated 20 times).

TABLE 1. - HIV Simulation Parameters
Parameter Values
ART initiation rate (λ+, year−1) 1, 2, 4
ART termination rate (λ, year−1) 0.12 (0.25x), 0.24 (0.5x), 0.48 (1x), 0.96 (2x), 1.92 (4x)
Expected degree (Ed) 10, 20, 30
Values for the base model condition are shown in bold.

Sexual Contact Network

To simulate scale-free sexual contact networks, Moshiri et al22 used the Barabási–Alber model.30 HIV sexual networks also typically include densely connected communities,31 a property the Barabási–Albert model fails to directly model. To add communities, we independently simulated 20 Barabási–Albert sexual contact networks, each with 5000 individuals, and connected these communities randomly akin to the Erdős–Rényi model.32 In the base condition, the expected degree of connection within the community was chosen to be 10, and the expected degree of connection outside the community was set to 1. Estimates from the literature put the number of contacts at 3–4 during a single year.33 Because our simulated sexual contacts remain static over the 10 year simulation period, we explore mean degrees (Ed) between 10 and 30.

Transmission Model

Transmissions are modeled using a compartmental epidemiological model with 5 states (a simplified version of the Granich et al34 model): Susceptible (S), Acute HIV Untreated (AU), Acute HIV Treated (AT), Chronic HIV Untreated (CU), and Chronic HIV Treated (CT). Uninfected individuals (S) can only transition to state AU. Each of the infected states I = {AU, AT, CU, CT} has its own “infectiousness rate” λS. An uninfected individual u within nx sexual partners in state x ∈ I transitions from S to AU by a Poisson process with rate λu=xInxλS,x. Because ART significantly reduces the risk of transmission, rates are chosen such that λS,AU > λS,CU > λS,AT > λS,CT ≈ 0. We refer to λAU,AT = λCU,CT as the rate of ART initiation (λ+), and we refer to λAT,AU = λCT,CU as the rate of ART termination (λ). We explore increasing or decreasing λ by up to 4x, and we explore increasing λ+ by up to 4x (Table 1). At the beginning of the simulation, all initially infected (ie, “seed”) individuals are distributed among the 4 infected states according to their steady-state proportions (Moshiri et al22 put all of them in the state AU). The steady-state proportions were determined empirically (see Fig. 3, Supplemental Digital Content,

Viral and Sequence Evolution

The viral phylogeny evolves inside the transmission tree under a coalescent model with logistic within-host viral population growth and a bottleneck event at the time of transmission (ie, initial viral population size is 1).35 This process produces a separate viral phylogeny for each seed individual. These trees were merged by simulating a seed tree with one leaf per seed node under a nonhomogeneous Yule model36 with rate function λ(t) = exp(−t2) + 1 scaled to have a height of 25 years to match San Diego estimates (see Moshiri et al22 for details). Once the true tree was available, a mutation rate was sampled for each branch independently from a truncated normal random variable from 0 to infinity with a location parameter of 0.0008 and a scale parameter of 0.0005 to scale branch lengths from years to expected number of persite mutations.22 Sequences were evolved according to the GTR model.37 Unlike Moshiri et al22 who sampled all infected individuals in year 10 of simulations, we sample viral sequences the first time an individual begins ART (ie, state AT or CT). The goal is to model standards of care in the health care systems that afford widespread sequencing.

Tree Inference

Instead of providing ProACT with the simulated tree, using FastTree 2,26 a phylogenetic tree was inferred under the GTR+Γ model from the sequences obtained in the 6 first 9 years of the simulation, leaving the 10th year for the evaluation step. These trees were then MinVar-rooted.25 To measure the impact of imperfect sampling, we randomly subsampled the sequence data sets (at 25%, 50%, or 75% levels) before inferring phylogenies.

PIRC San Diego Data Set

To test ProACT on real data, we used an alignment of 926 HIV-1 subtype B pol sequences collected by the UC San Diego Primary Infection Resource Consortium (PIRC) between 1996 and 2018 in San Diego. PIRC is one of the largest longitudinal cohorts of samples in the United States. The PIRC design strives to include acute infections (as much as 40% of individuals are recruited during early infection). A phylogenetic tree was inferred from the alignment under the GTR+Γ model using FastTree 226 and was MinVar-rooted.25 For each decile, the tree was pruned to only contain samples obtained up to the end of that decile using TreeSwift.38

Evaluation Procedure

ProACT was run on each of the estimated trees of simulated and real data. The metrics we use for evaluation are different between simulated and real data.

Simulated Data


Because the true transmission histories are known in simulation, we simply average the number of infections caused by highly prioritized individuals in a 1-year time frame after the prioritization (ie, year 10) to obtain a raw outcome measure. Then, we measure the cumulative moving average (CMA) of the outcome measure by the top samples. The higher the CMA in prioritization, the higher the number of future transmissions from these top individuals, and thus, the higher the effectiveness of the prioritization. Let A = {1,…,n} denote the n individuals sampled during years 1–9 of our simulations. Let c(i) denote the number of individuals directly infected by individual i in year 10 of the simulations. For any subset of individuals, sA,letC(s)=1|s|isc(i) denote their average c(i). For an ordering (x1,…,xn) of A, the (unadjusted) CMA up to i is C({x1,…,xi}).

Adjusted Transmissions/Person

Sorting individuals by their transmissions in year 10 enables us to compute the optimal CMA curve, and the mean number of transmissions gives us the expected value of the CMA for random prioritization. Across experimental conditions, the maximum and random expectations vary. Thus, to enable proper comparison of effects of prioritization across conditions, we also report an adjusted CMA normalizing above the random prioritization and over the optimal prioritization. Let (o1,…,on) denote the optimal ordering (ie, in a descending order of c(i)), with ties broken arbitrarily. We define the adjusted CMA up to i as C({x1,,xi})C(A)C({o1,,oi})C(A). We use this equation to measure the effectiveness of a selection of the top i individuals from each ordering of all individuals (exploring i for 1%–10% of all samples; ie, |A|10). For this metric, 1 indicates the optimal ordering and 0 indicates an ordering that is no better than random (a negative value indicates an ordering that is worse than random).

Kendall tau-b

We also use the Kendall tau-b coefficient to measure the correlation between the optimal ordering and the ordering obtained using each method. The Kendall28 tau-b is a rank correlation coefficient adjusted for ties with values ranging between −1 and 1, with −1 signifying perfect inversion, 1 signifying perfect agreement, and 0 signifying the absence of association.

Real Data

The sequences were sorted in an ascending order of sample time and were divided into deciles. For each decile, 2 sets are formed: past (sequences up to the decile) and future (sequences after the decile). A tree was inferred from the sequences in pre as described earlier and was used as input to ProACT. We then evaluated how our outcome measure correlates with the position of each individual in the ordering. We quantify the correlation using the Kendall tau-b.

To define outcome measures, we cannot rely on infections by individuals. Instead, we use the approach used by Wertheim et al.15 We compute pairwise distances between each sequence in pre and each sequence in post under the Tamura-Nei 93 (TN93) model39 as in HIV-TRACE.19 A natural function to compute the risk of a given individual u in pre, similar to that proposed by Wertheim et al,15 is to simply count the number of individuals in post who are genetic links to u, ie, vpostI(d(u,v)1.5%). In other words, the score function is simply a step function with value 1 for all distances less than or equal to 1.5% and 0 for all other distances. However, the selection of 1.5% as the distance threshold, despite being common practice in many HIV transmission clustering analyses, is somewhat arbitrary, and a step function exactly at this threshold may be overly strict (eg, should a pairwise distance of 1.51% be ignored?).

To generalize this notion of scoring links, we used 3 analytical score functions: the step function, a sigmoid function with the choice of λ = 100 and λ = 5 (smoother versions of the step function), and an empirical smooth step function learned from the data by fitting a mixture model onto the distribution of pairwise TN93 distances (see Fig. 4, Supplemental Digital Content, For each function, for each decile, we performed a Kendall28 tau-b test to compare the prioritization approaches. To generate a null distribution for visualization, we randomly shuffled the individuals in pre; however, P values are computed using the tau-b test (not the empirical shuffling).


Simulation Results

Default Condition

ProACT dramatically increased the performance compared with random ordering according to all of our outcome measures (Fig. 2). Focusing on the transmissions per person measure, although the population mean was 0.05, the ProACT's CMA was close to 0.15 for the top 1% of prioritized samples and gradually reduced to 0.1 for the top 10% (Fig. 2A). The top 1000 individuals in the ProACT ordering (3% of the population) transmitted 0.12 times (median across our 20 replicates), which was 2.4× higher than the median population average (Fig. 2C; see also Fig. 5, Supplemental Digital Content, for numbers other than 1000). As desired, selecting fewer people from the top of ProACT prioritization resulted in more transmissions per person (Fig. 2A). Compared with optimal ordering, however, the adjusted score both increased and decreased as more individuals were selected (Fig. 2B). The adjusted metric shows that although ProACT substantially outperformed random ordering, it did not come close to the effectiveness that could be achieved using the (hypothetical) perfect ordering. The Kendall tau-b also showed a positive correlation between ProACT and optimal ordering; although the correlation coefficient is far from perfect (Fig. 2D), the correlations are statistically significant in all replicates (p < 10−9; see Fig. 6, Supplemental Digital Content,

Effectiveness of prioritization on simulated data sets. The simulations were 10 years in length, prioritization was performed 9 years into the simulation, and the effectiveness of prioritization was computed during the last year of the simulation using 4 metrics (A–D). “Cluster Growth” denotes prioritization by inferring transmission clusters using HIV-TRACE at year 9 of the simulation and sorting clusters in a descending order of growth rate since year 8. All curves were calculated using 20 simulation replicates. A, Cumulative moving average of the number of transmissions per person across the first decile of prioritized samples for the default simulation parameter set (see Fig. 8, Supplemental Digital Content, for all model conditions, which show similar patterns). The horizontal axis depicts the quantile of highest prioritized samples (eg, x = 0.01 denotes the top percentile), and the vertical axis depicts their average number of transmissions per person. Global average across all individuals (ie, expectation under random ordering) is shown in red. The curves labeled with percentages denote subsampled data sets. B, CMA of adjusted number of transmissions per person for the default model condition (see Fig. 9, Supplemental Digital Content, for all model conditions, which show similar patterns). For adjusted transmissions/person, 1 indicates the optimal ordering, and 0 indicates random ordering. All other settings are similar to part a. C, Average of the raw number of transmissions per person for the top 1000 individuals (see Fig. 5, Supplemental Digital Content, for other counts) in a prioritized list vs. simulation parameter set (1000 individuals correspond to 1%–6% of all individuals across conditions). The violin plots are across 20 replicates and contain box plots with medians shown as white dots. Red horizontal lines show population mean (ie, random prioritization). D, Kendall tau-b correlation between the optimal ordering of samples (ie, based on their number of transmissions in year 10) and the orderings by the 2 prioritization methods. See Figure 7, Supplemental Digital Content, for subsampled data. Distributions are across 20 replicates and are shown for each simulation condition.

Wertheim et al15 have presented a method for prioritizing samples by clustering individuals based on viral genetic distance, tracking the size of each cluster over time, and prioritizing clusters in a descending order of the growth rate. The approach can be extended to also order individuals (ie, individuals belonging to clusters with high growth rates are prioritized higher; see Methods for details). ProACT consistently outperformed prioritization using cluster growth (Fig. 2). For example, the top 1000 individuals according to cluster growth transmitted on average to 0.06 other people, which, although higher than the population average, was half the 0.12 transmissions per person according to ProACT. Kendall tau-b results similarly indicate that 7 ProACT has better correlation with the optimal ordering.

Impact of Simulation Parameters

We then tested the impact of 3 simulation parameters, namely the rate of stopping ART, the rate of starting ART, and the node degree in the sexual network (Figs. 2C, D and see Figs. 5–7, Supplemental Digital Content, As we increased the rate of stopping ART (λ) (ie, lower adherence), the gap between ProACT and cluster growth grew. For example, the mean number of transmissions per person among the top 1000 individuals chosen using ProACT and cluster growth was 0.169 and 0.076, respectively, (a 1.21x improvement) when λ = 4x (Fig. 2C). This 1.21x improvement briefly increased to 1.26x and subsequently decreased to 1.01x, 0.69x, and 0.63x improvement as we reduced the rate or ART termination to 2x, 1x, 0.5x, and 0.25x. Kendall tau-b correlations show similar patterns (Fig. 2D); although almost all replicates of λ = 4x have P < 10−20, for the 0.25x case, all replicates have P > 10−10 and one of the replicates has P > 10−3 (see Fig. 6, Supplemental Digital Content,

As we increased the rate of starting ART (λ+) (ie, with faster diagnoses), as expected, the raw number of new infections caused per capita decreased (Figs. 2C, see Fig. 5, Supplemental Digital Content, Although ProACT remained effective in finding high-priority individuals, its performance compared with optimal ordering slightly degraded with higher λ+ (Figs. 2D, see Fig. 7, Supplemental Digital Content, Also, the gap between ProACT and cluster growth decreased slightly. When observing the mean number of transmissions per person among the top 1000 individuals chosen by each method (Fig. 2C), ProACT gave a 1.01x, 1.03x, and 0.71x improvement over cluster growth for λ+ set to 1x, 2x, and 4x, respectively.

Changing the number of sexual contacts per person (Ed), which controls the speed of spread, had nonuniform effects (Figs. 2C, D). Increasing Ed from 10 to 20 did not substantially impact the performance of ProACT. However, for Ed = 30, we observed a small but noticeable reduction in the improvements of ProACT compared with the cluster growth (Figs. 2D, see Fig. 7, Supplemental Digital Content,

Impact of Incomplete Sampling

Subsampling the total data set to include 3/4, 1/2, or 1/4 of all samples had only a marginal impact on the performance of ProACT according to the CMA metric (Figs. 2A, B, see Figs. 8 and 9, Supplemental Digital Content, Only at 25% sampling level did we observe a small reduction in the performance of ProACT compared with the optimal ordering. For example, with λ+ = 2x, ProACT's performance remained quite similar across ≥1/2 sampling levels, but a reduction in performance was observed for the 1/4 sampling level for both ProACT and cluster growth (see Fig. 7, Supplemental Digital Content,

According to the Kendall tau-b, which measures the entire order rather than just the top individuals, there was a more noticeable degradation in performance because of sampling (see Fig. 7, Supplemental Digital Content, In particular, reduced sampling increased the variance across replicate simulations (note the wider distributions for reduced sampling in Fig. 7, Supplemental Digital Content, Moreover, statistical significance of the correlations degrades with lower sampling (see Fig. 6, Supplemental Digital Content, With 1/4 sampling, unlike full sampling, many model conditions include some replicates where the ProACT ordering is not significantly better than random according to the Kendall tau-b.

Second Order Effects

We next asked if prioritization is effective in detecting people whose contacts also transmit abundantly. To do so, we explored a new outcome measure: the total number of transmissions from all contacts of a sample. Prioritizing samples whose contacts are likely to transmit can give public health officials a chance to find undiagnosed individuals (likely to transmit) through partner tracing from diagnosed individuals and to prioritize PrEP for uninfected individuals in sexual contact with prioritized individuals.

Across all model parameters, ProACT ordering outperformed random ordering and cluster growth according to the number transmissions per neighbors (Fig. 3). For example, contacts of the top 1000 individuals according to ProACT transmitted to 2.23 individuals on average (median across replicates), which is more than twice the number of across all individuals in the network (1.08). Just as with the previous outcome measure, advantages of ProACT over random prioritization or cluster growth were most pronounced for lower λ+ and higher λ (Fig. 3C). The Kendall tau-b coefficients for the correlation between ProACT and the optimal ordering were high (Figs. 3D, see Fig. 10, Supplemental Digital Content,; in fact, they were higher for the transmissions from contacts compared with transmissions from the prioritized person (eg, the median coefficient was 0.084 for contacts and 0.033 for the individuals in the default condition). These coefficients were highly significant across all models and sampling levels (see Fig. 11, Supplemental Digital Content, Thus, ProACT was even more effective in finding individuals with active contact than it was for finding individuals who were not suppressed. These results were largely robust to reduced sampling, showing similar patterns of average performance but increased variance across replicates (see Figs. 10 and 11, Supplemental Digital Content,

Second order effects. A, CMA of the number of infections from contacts of the top individuals according to each ordering; other settings similar to Figure 2A. B, Similar to part (A), but adjusted for random and optimal ordering. C, The number of transmissions from neighbors for the top 1000 individuals in a prioritized list vs. simulation parameter set. D, The Kendall tau-b correlation between the optimal ordering of samples and their ordering by the 2 prioritization methods. See Figure 10, Supplemental Digital Content, for subsampled data.

Further interrogating the properties of an individual and their ordering, we observed a substantial correlation between the number of contacts of samples in the sexual network and their position in the ProACT ordering (Fig. 4). Thus, although ProACT only considers the phylogeny, it prioritized those individuals who had high degrees in the sexual contact network (hidden to ProACT). These correlations were strongest for networks with high degree and weakest when the rate of diagnosis was very high. Reducing sampling did not substantially affect mean correlation but increased variance over replicates (see Fig. 12, Supplemental Digital Content,

ProACT captures sexual contact network degrees. The Kendall tau-b correlation between the number of contacts of each individual and their ordering by the prioritization methods. See Figure 12, Supplemental Digital Content, for subsampled data.

Real San Diego Data set

On real data sets, unlike the simulated data, the desired outcome measure, the number of new transmissions per person, is not known. Instead, we have to use inferred relationships. HIV-TRACE (used in our cluster growth approach) defines a pair of samples as “genetically linked” if their sequences are very similar (TN93 distance below 1.5%). We similarly use the TN93 sequence similarity as an outcome measure, but in addition to using a fixed threshold, we also use smoother functions (see Fig. 13, Supplemental Digital Content, We measure the number of linked individuals using a step function (1 if TN93 distance is below 1.5% and 0 otherwise) and an empirical smooth step function determined by fitting a mixture of 3 Gaussians to the distribution of pairwise TN93 distances (Material and Methods). We also explore an analytical smooth step function (parameterized sigmoid). Note that, when the step function is used, our outcome measure (computed for future transmissions) is exactly the same as what the cluster growth method uses for prioritizing (albeit, using past data). Thus, it is reasonable to suspect that the step function may favor cluster growth. As we move to smoother functions of distance to count genetic links, our measure is expected to become less biased in favor of HIV-TRACE.

Using ProACT or cluster growth to prioritize individuals results in orderings of individuals with a positive Kendall tau-b correlations to the number of future genetic links regardless of the decile and the function used to count genetic links (Fig. 5). These correlations are statistically significant in almost all cases (Table 2). The correlation coefficient ranges between 0.4 (ProACT; 10% time) and 0.1 (cluster growth; 20% time) for empirical function and between 0.6 (cluster growth; 10% time) and 0.1 (ProACT; 80% time) for the step function.

Kendall tau-b test results for ProACT ordering on real data using 4 score functions: an empirical smooth step function, a strict step function around 1.5%, and the sigmoid score functions with λ = 100 and λ = 5. The full San Diego data set was split into 2 sets (pre and post) at each decile (shown on the horizontal axis). The individuals in pre were ordered using ProACT and by cluster growth, and they were given a “score” computed using a score function (see Methods). The Kendall tau-b correlation coefficient was computed for each ordering with respect to the optimal possible ordering (ie, sorting in a descending order of the score). The null distribution was visualized by randomly shuffling the individuals in pre, and test P values are shown in Table 2.
TABLE 2. - Kendall Tau-b Test for a Null Hypothesis That a Given Prioritization Yields a Total Outcome Measure No better Than Random
10% 20% 30% 40% 50% 60% 70% 80% 90%
Empirical smooth step function (FastTree)
 GD + CG †2 × 10−3 †2 × 10−2 5 × 10−6 2 × 10−4 5 × 10−5 6 × 10−7 2 × 10−9 2 × 10−8 2 × 10−11
 ProACT 5 × 10−8 1 × 10−4 6 × 10−6 2 × 10−7 2 × 10−8 2 × 10−11 1 × 10−11 1 × 10−11 1 × 10−17
Step function around 1.5%
 GD + CG 4 × 10−12 1 × 10−19 3 × 10−28 7 × 10−25 2 × 10−19 8 × 10−12 1 × 10−17 5 × 10−14 2 × 10−25
 ProACT 1 × 10−5 5 × 10−8 3 × 10−7 2 × 10−10 1 × 10−6 1 × 10−6 1 × 10−4 †7 × 10−3 4 × 10−7
We show P values for the real San Diego data set for the first through ninth deciles using 2 outcome measure functions. Tests that failed to reject the null hypothesis with (uncorrected) P value < 0.00138 (corresponding to α = 0.05 with Bonferroni multiple hypothesis testing correct with n = 36) are marked with †.

The comparison between ProACT and cluster growth depends on the choice of the function to count links. When counting the number of links using the step function, cluster growth consistently outperforms ProACT for all deciles of the data set. These results should be interpreted with the caveat that we count HIV-TRACE links both to prioritize and to evaluate accuracy. However, according to the empirical smooth step function learned from the TN93 distances, ProACT outperforms cluster growth in all except one time point, where they are tied.

To further test the impact of the link-counting function applied to TN93 distances on the relative accuracy of methods, we used a sigmoid function to replace the step function while keeping the inflection point at 1.5% (see Fig. 13, Supplemental Digital Content, We observed that as the outcome measure function becomes more smooth, ProACT's performance improves with respect to prioritization by cluster growth (Fig. 5, see Table 1, Supplemental Digital Content, Based on the more smooth sigmoid function (λ = 5), ProACT outperforms cluster growth in all but one case where they are tied. Thus, simply counting distances close to 1.5% as partial links leads to evaluations that favor ProACT.

As time increases, both methods experience seemingly downward trends in their tau coefficients, but the null distribution of tau coefficients also tightens (Fig. 5). Thus, both methods consistently do significantly better than expected by random chance, and there is no clear relationship between P values of individual tool and time (Table 2). However, both for the step function and the sigmoid functions, ProACT's relative performance with respect to cluster growth tends to improve over time.


We start by discussing observed results and then comment on practical implications of this article both for public health and for future research.

Discussion of Results

In our simulations, ProACT was the least effective in conditions with very low rate of ART termination, which correspond to very high adherence or high rates of ART initiation. As expected, the total number of new infections per sample is low when adherence is high, reducing the opportunity for improving the ordering. Thus, ProACT is the most beneficial where termination of ART or late diagnosis leads to diagnosed individuals who transmit frequently.

ProACT was quite robust to impacts of subsampling individuals, and only at 14 sampling did we start to lose accuracy. We remind the reader that a 14 sampling does not mean that 14 of all infected individuals are in the data set. Rather, it means that 14 of diagnosed individuals are available to us. Recall that, in our model, diagnosed individuals are immediately sequenced and put on ART (which they may or may not sustain). At any point in time, a large partition of individuals who are infected are not diagnosed and thus not sampled. In other words, the full sampling case should not be misunderstood as including undiagnosed people. Rather, a lack of full sampling corresponds to a case where some samples are known to some clinic but are not included in the study, perhaps because of a lack of sequencing or data sharing.

ProACT far outperformed random ordering. However, we note that, despite the strong performance, there is much room left for future improvement: the adjusted outcome measure is consistently below 8% when selecting up to 10% of top-priority samples. Thus, there is much room for improvement in identifying high-value individuals. It will be unrealistic to expect that any statistical method based solely on sequence data (and perhaps also commonly available metadata, such as sampling times) will be able to come close to the optimal ordering. Nevertheless, methods better than ProACT could likely be developed in the future. Moreover, in this article, we used ML methods to infer trees and used mutation rate branch lengths. We made these choices mostly for computational expediency. However, ProACT algorithm can be applied on the potentially more accurate Bayesian estimates of the phylogeny. Also, one can attempt to use ProACT after dating the tree. Whether either adjustment results in substantial improvements should be studied in the future.

Implications of Results

We proposed a useful approach for thinking about the effectiveness of molecular epidemiology. Instead of focusing on the accuracy of methods of reconstructing phylogenetic trees or transmission networks, a question fraught with difficulties, we asked a more practical question. Given molecular epidemiological data, can the methods, whether phylogenetic or clustering based, prioritize samples for increased attention by public health? Using molecular epidemiology for prioritization is, of course, not a new idea. For example, Wertheim et al15 presented a method to prioritize samples based on the growth rates of their transmission clusters. Vasylyeva et al40 performed a phylogeographic analysis to reconstruct HIV movement among different locations in Ukraine to infer region-level risk prioritization. Much earlier even, Mellors et al41 predicted HIV patient prognosis by quantifying HIV RNA in plasma and predicted prognosis can subsequently be used as a prioritization rank. However, we hope that our formal definition of the prioritization problem, in addition to our extensive simulations and developed metrics of evaluation, will stir further work in this area. It is likely that more advanced methods than our simple prioritization approach can improve performance beyond ProACT in the future.

ProACT prioritizes individuals, not clusters. Prioritizing individuals based on their perceived risk of future transmission promises to be perhaps more effective than targeting clusters. This targeted approach also poses ethical questions that have to be considered. We may not want the algorithm to be biased toward particular demographic attributes. ProACT does not use any metadata in its prioritization, reducing risks of such biases. Nevertheless, it is possible that factors such as the depth of the sampling of a demographic group can in fact change branch length patterns in the phylogeny and make ProACT less or more effective for certain demographic groups. These broader implications of individual prioritization and impacts of demographics on the performance of ProACT should be studied more carefully in future.

Our results indicated that ProACT ordering is a function of features of the sexual contact network. For example, we showed that ProACT orders correlate with the degree of nodes in the sexual network (eg, Fig. 4). These results are significant given the fact that ProACT is given no direct information about the sexual network. The fact that ProACT captures contact network features means that, even if prioritized samples are already suppressed, their sexual contacts, identified using tracing, can be good targets for interventive care.

The main practical question is what can be done with a prioritized list of known samples. We mentioned that using follow-ups public health officials can try to ensure sustenance of ART for prioritized individuals, and using partner tracing they can target PrEP and HIV testing to contacts of prioritized individuals. Follow-ups, PrEP, and targeted testing are all expensive and can benefit from prioritization. Given a fixed budget, ProACT can help decide who should get follow-ups, who should be traced, and who should be tested frequently. For example, assuming an epidemic is similar to our default simulation condition, if we afford to contact trace 1000 known individuals and effectively prevent transmission from all of their contacts (using ART or PrEP), we would prevent 2230 further transmissions using ProACT as opposed to only 1080 if the 1000 individuals are randomly chosen. Thus, ProACT is a way to target prevention strategies to increase their effectiveness.

One may ask whether ordering by branch lengths results in orderings that fail to change with time to reflect changes in the epidemic. To answer this question, on the San Diego PIRC data, we asked how fast the ProACT ordering changes as time progresses. To do so, we computed the Kendall tau-b correlations to the ProACT ordering obtained using only the first decile of the data set (see Fig. 14, Supplemental Digital Content, There was a strong but diminishing correlation with the initial ordering. The correlations started at 1 (by definition) and gradually decreased in the ninth decile to 0.522. The results show that as desired, ProACT orders do in fact change with time, albeit gradually. The gradual change implies that certain individuals remain high priority as time progresses. In practical use, ProACT ordering should be combined with clinical knowledge about the status of individual patients. For example, high-priority individuals according to ProACT can be given lower priority if they manage to constantly remain suppressed with multiple follow-ups. Future versions of ProACT should allow users to provide a history of interventions and change the result accordingly. More broadly, the ProACT should be considered one of the tools for prioritizing clinical care, and other data that it does not incorporate should also be considered.

Finally, a question faced by public health officials is whether the cost of targeting diagnosed individuals for follow-ups and partner tracing is worth the potential reduction in future cases. The answer to that question will inevitably depend on who is targeted. For example, in our default simulation case, targeting individuals randomly can directly prevent 0.05 transmissions per chosen person in the next 12 months, whereas targeting top 1000 individuals according to ProACT would directly target 0.12 transmissions. Thus, prioritization can in fact change the cost-benefit analyses. Moreover, given prioritization, one can use simulations to predict the outcome measure for the top individuals (similar to Fig. 9, Supplemental Digital Content, and use metrics such as quality-adjusted life-year to estimate how many top individuals should be targeted for the cost to justify the benefits.


The authors thank Susan B. Little for providing the San Diego HIV sequence data set used in this study (access obtained through a PIRC proposal). The authors also thank Joel O. Wertheim and Sanjay R. Mehta for fruitful discussions that helped motivate the development of ProACT.


1. Wertheim JO, Leigh Brown AJ, Hepler NL, et al. The global transmission network of HIV-1. J Infect Dis. 2014;209:304–313.
2. Little S, Pond SLK, Anderson CM, et al. Using HIV networks to inform real time prevention interventions. PLoS One. 2014;9:e98443.
3. Schneeberger ADRN, Mercer CH, Gregson SA, et al. Scale-free networks and sexually transmitted diseases: a description of observed patterns of sexual contacts in Britain and Zimbabwe. Sex Transm Dis. 2004;31:380–387.
4. Cohen MS, Chen YQ, McCauley M, et al. Prevention of HIV-1 infection with early antiretroviral therapy. N Engl J Med. 2011 365, 493–505. arXiv: arXiv:1011.1669v3.
5. Poon AF, Gustafson R, Daly P, et al. Near real-time monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study. Lancet HIV. 2016;3:e231–e238.
6. Gotz HM, van Rooijen MS, Vriens P, et al. Initial evaluation of use of an online partner notification tool for STI, called “suggest a test”: a cross sectional pilot study. Sex Transm Infect. 2014;90:195–200.
7. Guttman N, Lev E. Ethical issues in COVID-19 communication to mitigate the pandemic: dilemmas and practical implications. Health Commun. 2021;36:116–123.
8. Katz DA, Hogben M, Dooley SW, et al. Increasing public health partner services for human immunodeficiency virus: results of a Second national survey. Sex Transm Dis. 2010;37:469–475.
9. Das M, Chu PL, Santos GM, et al. Decreases in community viral load are accompanied by reductions in new HIV infections in San Francisco. PLoS One. 2010;5:e11068.
10. Hogben M, McNally T, McPheeters M, et al. The effectiveness of HIV partner counseling and referral services in increasing identification of HIV-positive individuals. Am J Prev Med. 2007;33:S89–S100.
11. Bbosa N, Ssemwanga D, Nsubuga RN, et al. Phylogeography of HIV-1 suggests that Ugandan fishing communities are a sink for, not a source of, virus from general populations. Sci Rep. 2019;9:1051.
12. Villandré L, Labbe A, Brenner B, et al. Assessing the role of transmission chains in the spread of HIV-1 among men who have sex with men in Quebec, Canada. PLoS One. 2019;14:e0213366.
13. Oster AM, France AM, Panneer N, et al. Identifying clusters of recent and rapid HIV transmission through analysis of molecular surveillance data. J Acquir Immune Defic Syndr. 2018;79:543–550.
14. Ragonnet-Cronin M, Hu YW, Morris SR, et al. HIV transmission networks among transgender women in Los Angeles County, CA, USA: a phylogenetic analysis of surveillance data. Lancet HIV. 2019;6:e164–e172.
15. Wertheim JO, Murrell B, Mehta SR, et al. Growth of HIV-1 molecular transmission clusters in New York city. J Infect Dis. 2018;218:1943–1953.
16. Wertheim JO, Kosakovsky Pond SL, Little SJ, et al. Using HIV transmission networks to investigate community effects in HIV prevention trials. PLoS One. 2011;6:e27775.
17. Smith DM, May SJ, Tweeten S, et al. A public health model for the molecular surveillance of HIV transmission in San Diego, California. AIDS. 2009;23:225–232.
18. Leitner T, Romero-Severson E. Phylogenetic patterns recover known HIV epidemiological relationships and reveal common transmission of multiple variants. Nat Microbiol. 2018;3:983–988.
19. Kosakovsky Pond SL, Weaver S, Leigh Brown AJ, et al. HIV-TRACE (TRAnsmission cluster engine): a tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens. Mol Biol Evol. 2018;35:1812–1819.
20. Ragonnet-Cronin M, Hodcroft E, Hué S, et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics. 2013;14:317.
21. Prosperi MC, Ciccozzi M, Fanti I, et al. A novel methodology for large-scale phylogeny partition. Nat Commun. 2011;2:321.
22. Moshiri N, Ragonnet-Cronin M, Wertheim JO, et al. FAVITES: simultaneous simulation of transmission networks, phylogenetic trees, and sequences. Bioinformatics. 2018;35:1852–1861. arXiv: 1805.08905.
23. Balaban M, Moshiri N, Mai U, et al. TreeCluster: clustering biological sequences using phylogenetic trees. PLoS One. 2019;14:e0221068.
24. McLaughlin A, Sereda P, Oliveira N, et al. Detection of HIV transmission hotspots in British Columbia, Canada: a novel framework for the prioritization and allocation of treatment and prevention resources. EBioMedicine. 2019;48:405–413.
25. Mai U, Sayyari E, Mirarab S. Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction. PLoS One. 2017;12:e0182238.
26. Price MN, Dehal PS, Arkin AP. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5:e9490.
27. Nguyen LT, Schmidt HA, Von Haeseler A, et al. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32:268–274.
28. Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93.
29. Romero-Severson EO, Bulla I, Leitner T. Phylogenetically resolving epidemiologic linkage. Proc Natl Acad Sci U S A. 2016;113:2690–2695. arXiv: arXiv:1408.1149.
30. Barabási AL, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–512. arXiv: 9910332 [cond-mat].
31. Rothenberg RB, Potterat JJ, Woodhouse DE, et al. Social network dynamics and HIV transmission. AIDS. 1998;12:1529–1536.
32. Erdős P, Rényi A. On random graphs I. Publ Math Debrecen. 1959;6:290–297.
33. Rosenberg ES, Sullivan PS, Dinenno EA, et al. Number of casual male sexual partners and associated factors among men who have sex with men: results from the National HIV Behavioral Surveillance system. BMC Public Health. 2011;11:189.
34. Granich RM, Gilks CF, Dye C, et al. Universal voluntary HIV testing with immediate antiretroviral therapy as a strategy for elimination of HIV transmission: a mathematical model. Lancet. 2009;373:48–57.
35. Ratmann O, Hodcroft EB, Pickles M, et al. Phylogenetic tools for generalized HIV-1 epidemics: findings from the PANGEA-HIV methods comparison. Mol Biol Evol. 2017;34:185–203.
36. Le Gat Y. Recurrent Event Modeling Based on the Yule Process, Volume 2. London, United Kingdom: ISTE Ltd; 2016:1–128.
37. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci. 1986;17:57–86.
38. Moshiri N. TreeSwift: a massively scalable Python tree package. SoftwareX. 2020;11:100436.
39. Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10:512–526.
40. Vasylyeva TI, Liulchuk M, Friedman SR, et al. Molecular epidemiology reveals the role of war in the spread of HIV in Ukraine. Proc Natl Acad Sci U S A. 2018;115:1051–1056.
41. Mellors JW, Rinaldo CR, Gupta P, et al. Prognosis in HIV-1 infection predicted by the quantity of virus in plasma. Science. 1996;272:1167–1170.

HIV; epidemiology; phylogenetics

Supplemental Digital Content

Copyright © 2020 Wolters Kluwer Health, Inc. All rights reserved.