Application of machine-learning techniques in classification of HIV medical care status for people living with HIV in South Carolina : AIDS

Secondary Logo

Journal Logo

Supplement articles

Application of machine-learning techniques in classification of HIV medical care status for people living with HIV in South Carolina

Olatosi, Bankolea; Sun, Xiaowenb; Chen, Shujieb; Zhang, Jiajiab; Liang, Chena; Weissman, Sharonc; Li, Xiaomingd

Author Information
AIDS 35():p S19-S28, May 1, 2021. | DOI: 10.1097/QAD.0000000000002814
  • Free



Efforts towards ending the HIV epidemic can benefit from the innovative use of data for intelligent decision-making from surveillance through treatment. The goal of HIV treatment for people living with HIV (PLWH) is to achieve viral suppression. PLWH who achieve and maintain viral suppression have effectively little to no risk of transmitting HIV to others and live an improved quality of life [1,2]. To achieve this goal, the current approach used for HIV prevention in the United States has adjusted to engaging PLWH across the HIV care continuum [3]. The HIV care continuum depicts discrete steps from initial HIV diagnosis, linkage to care via a HIV healthcare provider within 1 month after diagnosis, receipt of HIV medical care (antiretroviral therapy), engagement in care (active PLWH participation), retention in care, and viral suppression [4].

However, the HIV care continuum in the United States continues to face challenges. The Centers for Disease Control and Prevention (CDC) estimated that about 64% of all PLWH in the United States received HIV medical care but only 49% were retained in care, for whom the viral suppression rate was 62% [3,4]. Upwards of 50% of PLWH who are diagnosed with HIV infection did not remain in HIV care across the care continuum, including those who failed in linkage to care, those who received care but subsequently fell out of care, and those who reengaged in care or transferred from other settings [4]. An ongoing iEngage trial showed that among those who linked to HIV care, 86% of individuals maintained viral suppression after 48 weeks of diagnosis [5]. Disparities in linkage to and retention in care persist with studies indicating that a significant proportion of African Americans were not in care, or for those in care, did not achieve viral suppression [3,6,7]. These findings collectively suggest that high volumes of drop-offs during HIV care drive the low overall rates of viral suppression. Statistics from South Carolina support this suggestion. Cumulatively, only 66% of SC PLWH received any care across the treatment continuum; of these people, 54% received continuous HIV care; and 57% were virally suppressed [6–8]. Accordingly, in the 2020 update of the National HIV/AIDS Strategy for the United States, ‘prevention through linkage and retention’ was included as an imperative objective [9]. To improve the HIV care continuum, research has centered on the factors that influence retention in care as well as effective clinical, behavioral, and structural interventions to PLWH adhere to care. Studies suggest that factors that influence retention in care can be summarized into an adapted socioecological framework, which consists of individual level, relationships, community-level, healthcare system, and healthcare policy [4]. This framework outlines numerous findings of poor engagement across the care continuum. For example, poor linkage and retention have been discovered in key populations, including youth and adolescents [10], heterosexual men [11,12], transgender women [13], African Americans [11–14], and Hispanics/Latinos [11,12,15] in the United States. A number of factors are associated with poor linkage and retention, including mental illness [16–19], stigma and fear [20], place of residence [21–23], substance use [15,18,24], and access to health insurance [25]. These findings have informed interventions for improving retention in care with various foci, including linkage care management, medical case management, intensive outreach, peer and paraprofessional patient navigation, and clinic-wide messaging and culture [4]. However, many of these studies were focused on a few contributing factors and/or population samples without considering integrated evidence from clinical, administrative, and community-based datasets. This limitation has led to several missed opportunities for identifying those at risk for not engaging in care, and those at risk for disengaging in care. The reasons for these missed opportunities may be many. First, surveillance data across the HIV care continuum often fails to capture a complete portrait of PLWH's health status because of the lack of interoperability among datasets provided by different health service agencies. Second, it is often difficult to systematically identify missed opportunities for HIV testing and treatments as well as vulnerabilities of HIV care services across the care continuum. Third, it is difficult to proactively intervene PLWH who are at high risk of dropping out of HIV care using limited data and predictive models with only linear combinations. This is important with studies showing reasonable costs and cost savings associated with retention in care for PLWH [2,26]. Recent literature suggests that the plan for ending the HIV epidemic in the United States could improve when Big Data science is used to advance data integration and treatment, address gaps in transmission risks/behavior and improve HIV testing [27].

To fill these knowledge gaps, our effort has led to an integrated data repository that accumulates individual level data of all types (i.e. clinical, administrative, claims and community-based health services data) for PLWH in South Carolina as reported elsewhere [28]. A unique advantage of this data repository is that it enables the prediction of future care status for PLWH through the rich, integrated, and high volume of individual level data. Despite sparse applications of predictive modeling in epidemiology and infectious diseases [29,30], no existing research has reported the development of computational methods used for predicting retention-in-care using a comprehensive data repository.

In the present study, we sought to develop and identify the best machine-learning-based predictive model that can predict PLWH's HIV medical care status by learning from historical data from the established data repository. The out-performed model was selected by benchmark comparison among seven supervised machine-learning algorithms for the ability of accurately predicting PLWH's retention-in-care. The out-performed model was further analyzed to identify variables that are most effective in prediction and clinically meaningful to HIV medical care. The expected outcomes of this study hold potential to establish the benchmark evidence for predictive analysis of PLWH's care continuum, which further extends to evidence-based interventions and HIV health services underpinned by integrated clinical, administrative, claims, and community-based health data.


Data repository

The complete architecture of the South Carolina PLWH Big Data repository is described elsewhere [28]. Below, we highlight information important to the present study. The data repository received data from South Carolina HIV/AIDS enhanced electronic reporting system (e-HARS), which is a laboratory-based reporting system that collects statewide CD4+ and viral load tests since January 2004; Ryan White HIV/AIDS Program Service Reports (RSR), which collect clinical data provided by Ryan White-funded entities; and Health Sciences South Carolina (HSSC) clinical data warehouse, which integrated clinical records from six of the state's largest health systems (AnMed Health, McLeod Health, Medical University of South Carolina Hospital Authority, Palmetto Health, Self Regional Healthcare, and Spartanburg Regional Healthcare System). The South Carolina office of Revenue and Fiscal Affairs (RFA) integrated PLWH data from these sources with an all payer healthcare inpatient database, Medicaid claims data (including demographic files, visits files, and pharmacy files), state employee health services plan, Department of Corrections data (crime rates, prison history, etc.), Department of Mental Health, and so forth.

Study population

The data repository includes PLWH whose residence at diagnosis was in South Carolina and whose age was 13 years or older since 2005. We selected 2005 as it was the first year South Carolina state law mandated the reporting of all CD4+ and viral load tests to e-HARS. We identified data qualified for machine-learning experiments based on several criteria. First, we defined care status for this experiment using CD4+ or viral load test as a proxy measure. Second, we used linkage to care as an inclusion criterion for all PLWH in the database. As a result, a total of 8888 distinct PLWH records with 3 670 845 observations were identified. To prepare for modeling, we cleaned the data, addressed missing values issues, and generated a total of 3 640 102 observations in the final study sample.

Machine-learning experiments

There were 40 input variables and 1 output (target) variable included in the machine-learning experiments. Table 1 shows a summary of selected variables. The target variable is PLWH care status. We used the definition of HIV medical care as recommended by the CDC: ‘documentation of at least two CD4+ cell counts or viral load tests performed at least three months apart during the year of evaluation’ to define the care status [23]. All data records were annotated for care status (in care vs. not-in-care) per this definition.

Table 1 - Selected machine-learning variable attributes for adult people living with HIV in South Carolina.
Variable name Role Level Level (N)
 Race and ethnicity (white, African Americans, other, Hispanic) INPUT NOMINAL 5
 Education INPUT NOMINAL 7
 Marital status INPUT NOMINAL 6
 County of residence at HIV diagnosis REJECTED UNARY 0
 Type of primary caregiver INPUT NOMINAL 8
 South Carolina resident (yes/no) INPUT BINARY 2
Clinical factors
Retention in care status (in care vs. not in care) TARGET BINARY 2
Age at HIV diagnosis INPUT INTERVAL
 AIDS category (yes/no) INPUT NOMINAL 3
 HIV transmission Risk categories INPUT NOMINAL 6
 Group (care groups) INPUT NOMINAL 4
 Year (calendar year of HIV diagnosis) INPUT NOMINAL
 Supply (days of drug supply) REJECTED INTERVAL
 Years of retention in care (count) INPUT BINARY 2
Retention in care sequence for follow-up year INPUT INTERVAL
Health system utilization
 Zero income eligibility (yes/no) REJECTED BINARY 2
Poverty (percentage) INPUT INTERVAL
 Medicare dual eligibility REJECTED BINARY 2
 Ever on AIDS Drug Assistance Program INPUT BINARY 2
 Number of prescription drug refills used REJECTED INTERVAL
 Emergency room flag (yes/no) INPUT BINARY 2
 Payor Event 1 INPUT NOMINAL 6
 Payor Event 2 INPUT NOMINAL 5
 Payor Event 3 INPUT BINARY 2
 Payor Event 4 INPUT NOMINAL 4
 Payor Event 5 INPUT BINARY 2
 Time (time from diagnosis date to linkage) INPUT INTERVAL
 Hospital length of stay (days) INPUT INTERVAL
 Hospital payor INPUT NOMINAL 6
 Number of comorbidities INPUT NOMINAL 7
 Number of diagnoses INPUT INTERVAL
 therapeutic drug class REJECTED INTERVAL
 Number of prescription drugs INPUT INTERVAL
Corrections information
 Jail (yes/no) INPUT BINARY 2
 Jail (length of stay in days) INPUT INTERVAL

All data cleaning/management and machine-learning experiments were performed using SAS 9.4 and SAS Viya 3.4. As missing values can create bias during predictive modeling, we imputed values for variables with 10% or less of missing data using the tree surrogate method [31,32] for categorical variables and median for interval variables. We employed seven supervised machine-learning algorithms that represent a variety of approaches including Bayesian Network, Neural Network, Support Vector Machine, Logistic Regression, least absolute shrinkage and selection operator (LASSO), Decision Trees and Random Forest.

Data partitioning

The 10-fold cross-validation approach [33] was employed as the framework for machine-learning experiments to measure unbiased accuracy. Figure 1 illustrates the diagram of machine-learning experiments used for this study. Specifically, we randomly generated 10 equal splits of the data. We chose to partition our data into three groups namely training, validation, and testing to avoid overfitting that could occur using only training and validation data (Fig. 1). For each algorithm, six splits [N = 2 184 061 (60%)] were used for training, three splits [N = 1 092 031 (30%)] were used for validation, and one split [N = 364 010 (10%)] was randomly selected for testing. We used F-1 measure, precision, recall, and the area under the receiver-operating characteristic (AUC) to assess the model performance. These measures were generated based on a contingency table that specifies whether or not the model prediction is correct based on the true values in the dataset (see Fig. 2). The AUC shows how true-positive rates and false-positive rates interact with each other.

Fig. 1:
Diagram of machine-learning experiments.
Fig. 2:
Measures used for algorithm performance evaluation.


Bayesian Network is a supervised learning approach that fits a Bayesian Network model of inputs to our nominal target (care status). Bayesian Network is a directed, acyclic graphical model where the structure is based on conditional dependency between two random variables. Details about its recent utility and promise in personalized medicine for healthcare is documented by several studies [34–39]. Neural Network is a supervised learning method we used to assess the connections between input variables, hidden layers, and an output layer (care status), and the connections between each of these similar to the biologic structure of a human brain. Its utility is also well documented in areas like cancer, diabetes, and electronic health records-based studies [40–44]. For Neural Network, prior variable selection is preferred, so we used the fast-supervised selection method [45,46] that identified input variables (cluster), which jointly explained the largest variance for care status and compared it with the Neural Network with no selection group. We used Decision Trees [classification and regression trees (CART)] as a nonparametric supervised learning method to create a tree model for relevant input variables associated with our outcome (care status) using a series of rules. Each rule assigns an observation to a segment, based on the value of one input [47,48]. Random Forest is an extension of Decision Trees and consists of multiple Decision Trees using different samples with its use also reported in healthcare [49]. Support vector machine is a well known optimization-based supervised learning method for fitting binary classification problems [50–53]. It does this by identifying a decision boundary with the maximum possible margin between the data points and has also been applied to healthcare. Logistic regression is a member of discriminative models in machine learning [54] and was used here to predict care status as a function of input variables.

Feature analysis

To determine the relative importance of variables in predicting care status, we employed different measures for each algorithm. Consistent with our data-driven approach, we ran machine-learning algorithms against the variable selection options (fast-selection) and compared it with machine-learning algorithms for all the variables without prior selection. We ranked the importance of input variables by the performance decrease when taking each input variable out of the model. Top-ranked input variables were candidate principal factors for predicting PLWH's care status.


We present the selected results of these algorithms and compared model performance in predicting care status for PLWH using the selected input variables. Results presented differ based on types of algorithms. Figure 3a and 3b show Decision Tree models for the data-driven input variable selection (cluster) and for all input variables. Notable tree splits differed in the highlighted branches based on the misclassification rate by variables. Both trees had varying inputs of relative importance. Both had years of retention in care but differed at the main split between hospital payor and age at HIV diagnosis. Other split differences are annotated in the figures. For ease of understanding, we display a forward feed single-layer Neural Network showing weights associated by input variable level to our target (care status) (see Fig. 4a). Input variables of importance included years of retention in care, age at HIV diagnosis, hospital payor, type of transmission risk, marital status and education. Similarly, Fig. 4b displays the Bayesian Network diagram for classifying/specifying the conditional relationships/dependencies between the input and target variables. Dependencies show intricate but important relationships between input variables and target (care status) highlighted in yellow. This result indicates that predictive relationship is often co-dependent on other relevant input variables. The variable selection by relative importance also differed for other algorithms.

Fig. 3:
Visualization of trained Decision Trees.
Fig. 4:
Visualization of trained Neural Network.

Machine-learning performance

Consistent with the primary aims of this study, we compared and scored each algorithm for performance. Bayesian Network (F = 0.87, AUC = 0.94, precision = 0.87, recall = 0.86) has outperformed other algorithms, followed by Random Forest (F = 0.78, AUC = 0.81, precision = 0.72, recall = 0.85), Decision Tree (F = 0.76, AUC = 0.75, precision = 0.70, recall = 0.82) and Neural Network (cluster) (F = 0.75, AUC = 0.71, precision = 0.69, recall = 0.81). Table 2 shows the results for each algorithm. The AUC across the experimented algorithms are consistent with the F scores. Fig. 5 illustrates a diagrammatic comparison of AUC among algorithms for training, validation and test data.

Table 2 - Prediction performance for all the algorithms.
Algorithms Precision Recall F score AUC
Bayesian Network 0.87 0.86 0.87 0.94
Random Forest 0.72 0.85 0.78 0.81
Decision Tree 0.70 0.82 0.76 0.75
Neural Network (cluster)a 0.69 0.81 0.75 0.71
Logistic Regression 0.69 0.82 0.75 0.71
Support Vector Machine 0.69 0.80 0.74 0.67
LASSO 0.56 1.00 0.72 0.50
Neural Networkb 0.56 1.00 0.72 0.50
aNeural Network with fast variable selection.
bNeural Network without fast variable selection.

Fig. 5:
Areas under receiver-operating characteristic curves.

Recall scores are consistently higher than precision across all algorithms except for Bayesian Network, suggesting that these algorithms generally performed well in predicting as many PLWH who would remain in care as they will actually do so yet performed less well in assuring that all machine-identified retention-in-care individuals who actually will remain in care. Notably, for LASSO and Neural Network, the recall scores have achieved 1.00 yet the precision scores are low, which significantly impeded the balanced performance as indicated by the F scores and AUC. This result suggests that a considerably large portion of in-care PLWH that is identified by LASSO and Neural Network do not actually remain in care. Fig. 5 shows AUC curves for all the algorithms and data splits (training, validation and testing). The AUC curve is a plot of sensitivity (the true positive rate) against 1-specificity (the false-positive rate), which are both measures of classification based on the confusion matrix. These measures are calculated at various cutoff values. An AUC curve that rapidly approaches the upper left corner of the graph, where the difference between sensitivity and 1-specificity is the greatest, indicating a more accurate model. Consistently, the Bayesian Network outperformed other models. Fig. 6 illustrates the input variables by relative importance for the champion model – Bayesian Network. This plot shows the 23 most important input variables, as determined by the relative importance. The relative importance is calculated using a one-level Decision Tree for each input to predict the predicted value as a global surrogate model. The most important input variables for this model is years of retention-in-care with a relative importance score of 1, followed by a number of important variables with a relative importance scores ranging from 0.1 to 0.3, including CD4+ count, Calendar year of HIV diagnosis, number of diagnoses, marital status, sex, number of prescription drugs and zero income eligibility (yes/no).

Fig. 6:
Relative Importance Champion Model -- Bayesian Network.


To help end the HIV epidemic, predictive healthcare decisions and interventions for PLWH must be driven by evidence and data. Studies suggest decision-making in healthcare can benefit from predictive models and artificial intelligence [47,55,56]. In this study, we compared different machine-learning algorithms, and demonstrated important differences in model performance when predicting PLWH care status. Important predictive factors for care status are typically not linearly associated with care status suggesting more complex relationships (dependencies) may exist. Our study assessed both traditional and novel statistical methods of prediction that hold implications for the field particularly with the Bayesian Network performing best for prediction. The slight differences between the ranked variables of importance hold implications for designing future interventions.

Principal findings

The present study is among the first to leverage machine learning and Big Data analytics to predict HIV medical care status of PLWH using their previous health records consisting of clinical, health administrative, and community-based health data. The study demonstrated that PLWH's future state of HIV care utilization is predictable by using supervised machine learning and integrated health data. Additionally, we identified factors, including last year of retention in care, CD4+ count, calendar year of HIV diagnosis, number of diagnoses, marital status, sex, number of prescription drugs, zero income eligibility (yes/no) as important inputs that play a key role in indicating whether or not a PLWH is likely to retain in care. These identified key predictive factors demonstrate that nuanced information from PLWH's clinical care, social determinants of health, and activities in social care jointly contributes to a precise predictive capability of PLWH's future HIV medical care status.

Supervised machine-learning algorithms are feasible for predicting retention-in-care. Algorithms, such as Bayesian Network, Random Forest, Decision Trees, and Neural Network (cluster) showed superior predictive performance. However, algorithms, such as LASSO and Neural Network did not show effectiveness in the prediction task. One possible reason for the substandard performance for the Neural Network could be that we only tested a basic three-layer model with default parameters. We believe there may be deeper temporal relationships that need detection to further improve our predictive models [57]. To provide a systematic investigation on algorithms that heavily relies on initial parameters and architecture (i.e. Neural Network), follow-up studies should be directed to develop tailored Neural Network models. Overall, the comparatively lower scores in precision over recall are a major source of error that affected the balanced performance. Considering the fact that we only selected a limited subset of input variables available from the integrated PLWH data repository, the problem of low precision could be improved if follow-up studies would include input variables that would contribute to the accurate identification of negative output variable (i.e. not in care).

Clinical and policy implications

Several of these predictive factors have been identified in literature as important for retention in care [58]. However, the challenge for ending the HIV epidemic goes beyond linking PLWH into care, to retaining them in HIV medical care. Identifying and predicting PLWH at high-risk for being not-in-care after linkage allows for reallocation of resources to keep such high-risk population in HIV medical care. This leads to cost savings from reduced transmissibility, improved service for PLWH, and cost savings from preventing individuals dropping out of care. From evidence-driven healthcare and precision-medicine perspectives, our study shows that the application of powerful machine-learning techniques (in this case, Bayesian Network and Random Forests) makes enough difference to explore and justify their potentials for predicting future HIV care status for any PLWH population using integrated data sources. It also holds promise for predicting at the individual level, which could aid preventing dropouts in HIV medical care. To decipher the complex probability relationships observed from the Decision Trees, Neural Network and Bayesian Network for improving the prediction of future care status, deep learning is needed. This work provides evidence for the increased use of machine-learning techniques when large-scale integrated datasets exist. Observed performance differences also justify the need to apply machine-learning methods on large-scale integrated data from multiple sources.

Limitations and future direction

This study presents a few limitations. First, as a pilot study, we only selected a limited number of input variables for machine-learning experiments guided by domain experts. There may be important variables that can be used to improve prediction but were not included in machine learning. Second, other states in HIV care continuum, such as linkage-to-care and recipient-of-care are equivalently important for HIV care adherence but were not tested in the machine-learning experiments. Third, there might be intricate relationships (e.g. temporal relations) among input variables across the years in care. The present study did not aim to investigate these relationships. Fourth, identified predictive factors need to be further evaluated to confirm their clinical value. To address these limitations, future studies are directed to expand the input variables and other states in HIV care continuum as the PLWH data repository continues to include and link data, develop advanced predictive models (e.g. Deep Learning) to solve the computational complexity raised by the expanded dataset and conduct a comprehensive evaluation on the identified predictive factors.


Research reported in this publication was supported by the National Institute of Allergy And Infectious Diseases of the National Institutes of Health under Award Number R01AI127203. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We thank the South Carolina Department of Health and Environmental Control, Office of Revenue and Fiscal Affairs and various state agencies for the data provided. The content provided is solely the responsibility of the authors and does not represent the views of these organizations.

Conflicts of interest

There are no conflicts of interest.


1. Cohen MS, Chen YQ, McCauley M, Gamble T, Hosseinipour MC, Kumarasamy F, Godbole SV. HPTN 052 Study Team. Prevention of HIV-1 infection with early antiretroviral therapy. New Engl J Med 2011; 365:493–505.
2. Shrestha RK, Gardner L, Marks G, Craw J, Malitz F, Giordano TP, Mugavero M. (2015). Estimating the cost of increasing retention in care for HIV-infected patients: results of the CDC/HRSA retention in care trial. J Acquir Immune Defic Syndr 1999; 68:345–350.
3. Centers for Disease Control and Prevention. Understanding the HIV care continuum. 2019. Understanding the HIV Care Continuum. Available at: [Accessed 17 June 2020]
4. Mugavero MJ, Amico KR, Horn T, Thompson MA. The state of engagement in HIV care in the United States: from cascade to continuum to control. Clin Infect Dis 2013; 57:1164–1171.
5. Modi R, Amico KR, Knudson A, Westfall AO, Keruly J, Crane HM, et al. Assessing effects of behavioral intervention on treatment outcomes among patients initiating HIV care: Rationale and design of iENGAGE intervention trial. Contemp Clin Trials 2018; 69:48–54.
6. Crepaz N, Dong X, Wang X, Hernandez AL, Hall HI. Racial and ethnic disparities in sustained viral suppression and transmission risk potential among persons receiving HIV care—United States. Morbid Mortal Wkly Rep 2014; 67:113.
7. SC Department of Health and Environmental Control (DHEC) An Epidemiologic Profile of HIV and AIDS in South Carolina 2019. Available at: [Accessed 17 June 2020]
8. Edun B, Iyer M, Albrecht H, Weissman S. The South Carolina HIV cascade of care. South Med J 2015; 108:670–674.
9. White House Office of National, AIDS Policy. National HIV/AIDS strategy for the United States: updated to 2020. Washington, DC: White House Office of National AIDS Policy; 2015.
10. Lall P, Lim SH, Khairuddin N, Kamarulzaman A. An urgent need for research on factors impacting adherence to and retention in care among HIV-positive youth and adolescents from key populations. J Int AIDS Soc 2015; 18:19393.
11. Tripathi A, Youmans E, Gibson JJ, Duffus WA. The impact of retention in early HIV medical care on viro-immunological parameters and survival: a statewide study. AIDS Res Hum Retroviruses 2011; 27:751–758.
12. Hall HI, Gray KM, Tang T, Li J, Shouse L, Mermin J. Retention in care of adults and adolescents living with HIV in 13 US areas. JAIDS J Acquir Immune Defic Syndr 2012; 60:77–82.
13. Poteat T, Hanna DB, Rebeiro PF, Klein M, Silverberg MJ, Eron JJ, et al. Characterizing the human immunodeficiency virus care continuum among transgender women and cisgender women and men in clinical care: a retrospective time-series analysis. Clin Infect Dis 2020; 70:1131–1138.
14. Dailey AF, Johnson AS, Wu B. HIV care outcomes among blacks with diagnosed HIV—United States. MMWR Morb Mortal Wkly Rep 2017; 66:97.
15. Dasgupta S, Oster AM, Li J, et al. Disparities in consistent retention in HIV care—11 states and the District of Columbia. Morb Mortal Wkly Rep 2016; 65:77–82.
16. Schranz AJ, Barrett J, Hurt CB, Malvestutto C, Miller WC. Challenges facing a rural opioid epidemic: treatment and prevention of HIV and hepatitis C. Curr HIV/AIDS Rep 2018; 15:245–254.
17. Thompson MA, Mugavero MJ, Amico KR, Cargill VA, Chang LW, Gross R, et al. Guidelines for improving entry into and retention in care and antiretroviral adherence for persons with HIV: evidence-based recommendations from an International Association of Physicians in AIDS Care panel. Ann Intern Med 2012; 156:817–833.
18. Dombrowski JC, Simoni JM, Katz DA, Golden MR. Barriers to HIV care and treatment among participants in a public health HIV care relinkage program. AIDS Patient Care STDS 2015; 29:279–287.
19. Coyle RP, Schneck CD, Morrow M, Coleman SS, Gardner EM, Zheng JH, et al. Engagement in mental healthcare is associated with higher cumulative drug exposure and adherence to antiretroviral therapy. AIDS Behav 2019; 23:3493–3502.
20. Giordano TP, Gifford AL, White AC Jr, Suarez-Almazor ME, Rabeneck L, Hartman C, et al. Retention in care: a challenge to survival with HIV infection. Clin Infect Dis 2007; 44:1493–1499.
21. Nelson JA, Kinder A, Johnson AS, Hall HI, Hu X, Sweet D, et al. Differences in selected HIV care continuum outcomes among people residing in rural, urban, and metropolitan areas—28 US jurisdictions. J Rural Heal 2018; 34:63–70.
22. Philbin MM, Feaster DJ, Gooden L, Duan R, Das M, Jacobs P, et al. The north-south divide: substance use risk, care engagement, and viral suppression among hospitalized human immunodeficiency virus--infected patients in 11 US cities. Clin Infect Dis 2019; 68:146–149.
23. Rebeiro PF, Gange SJ, Horberg MA, Abraham AG, Napravnik S, Samji H, et al. North American AIDS Cohort Collaboration on Research and Design (NA-ACCORD). Geographic variations in retention in care among HIV-infected adults in the United States. PLoS One 2016; 11:e0146119.
24. Hartzler B, Dombrowski JC, Williams JR, Crane HM, Eron JJ, Geng EH, et al. Influence of substance use disorders on 2-year HIV care retention in the United States. AIDS Behav 2018; 22:742–751.
25. Mugavero MJ, Lin H-Y, Willig JH, Westfall AO, Ulett KB, Routman JS, et al. Missed visits and mortality among patients establishing initial outpatient HIV treatment. Clin Infect Dis 2009; 48:248–256.
26. Jain KM, Maulsby C, Brantley M, Kim JJ, Zulliger R, Holtgrave DR. SIF Intervention Team. Cost and cost threshold analyses for 12 innovative US HIV linkage and retention in care programs. AIDS Care 2016; 28:1199–1204.
27. Rana AI, Mugavero MJ. How big data science can improve linkage and retention in care. Infect Dis Clin 2019; 33.3:807–815.
28. Olatosi B, Zhang J, Weissman S, Hu J, Haider MR, Li X. Using Big Data analytics to improve HIV medical care utilisation in South Carolina: a study protocol. BMJ Open 2019; 9:e027688.
29. Wiens J, Shenoy ES. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin Infect Dis 2018; 66:149–153.
30. Johnson AS, Johnson SD, Hu S, Li J, Yu C, Wu B, et al. Monitoring selected national HIV prevention and care objectives by using HIV surveillance data: United States and 6 dependent areas, 2017. 2019.
31. Feelders A. Handling missing data in trees: surrogate splits or statistical imputation? In European Conference on Principles of Data Mining and Knowledge Discovery (pp 329-334). Berlin, Heidelberg: Springer; 1999.
32. Zhou XH, Eckert GJ, Tierney WM. Multiple imputation in public health research. Stat Med 2001; 20:1541–1549.
33. Fushiki T. Estimation of prediction error by using K-fold cross-validation. Stat Comput 2011; 21:137–146.
34. Ahmad LG, Eshlaghy AT, Poorebrahimi A, Ebrahimi M, Razavi AR. Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform 2013; 4:3.
35. SAS Cary Documentation: Bayesian Network. Available at: [Accessed 17 June 2020]
36. SAS Fast supervised learning. Available at: [Accessed 17 June 2020]
37. Velikova MV, Terwisscha van Scheltinga JA, Lucas PJ, Spaanderman M. Exploiting causal functional relationships in Bayesian network modelling for personalised healthcare 2014; 55 (Pt 1):59–73.
38. Bayat S, Cuggia M, Rossille D, Kessler M, Frimat L. Comparison of Bayesian network and decision tree methods for predicting access to the renal transplant waiting list In MIE 2009; 150:600–604.
39. Lappenschaar M, Hommersom A, Lucas PJ, Lagro J, Visscher S. Multilevel Bayesian networks for the analysis of hierarchical healthcare data. Artificial Intelligence Med 2013; 57:171–183.
40. Sordo M. Introduction to neural networks in healthcare. Open Clinical knowledge management for medical care. 2002.
41. Ma F, Chitta R, Zhou J, You Q, Sun T, Gao J. Dipole: diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining; 2017, August; pp. 1903--1911.
42. O’Neill TJ, Penm J, Penm J. A subset polynomial neural networks approach for breast cancer diagnosis. Int J Electron Healthc 2007; 3:293–302.
43. Karan O, Bayraktar C, Gümüşkaya H, Karlik B. Diagnosing diabetes using neural networks on small mobile devices. Expert Syst Applications 2012; 39:54–60.
44. Choi E, Bahadori MT, Kulas JA, Schuetz A, Stewart WF, Sun J. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. arXiv preprint 2016; arXiv:1608.05745.
45. Samanta S, Das S. A fast supervised method of feature ranking and selection for pattern classification. In: International Conference on Pattern Recognition and Machine Intelligence. Springer, Berlin, Heidelberg. 2009, pp. 80–85.
46. Razzaghi T, Roderick O, Safro I, Marko N. Fast imbalanced classification of healthcare data with missing values. In 2015 18th International Conference on Information Fusion (Fusion), IEEE, 2015 July. pp. 774–781.
47. Gordon L. Using classification and regression trees (CART) in SAS® enterprise miner TM for applications in public health. In: SAS Global Forum, 2013 April, vol. 2013, p. 2013.
48. García MNM, Herráez JCB, Barba MS, Hernández FS. Random forest based ensemble classifiers for predicting healthcare-associated infections in intensive care units. In Distributed Computing and Artificial Intelligence, 13th International Conference. Cham: Springer. 2016. pp. 303–311.
49. Ali J, Khan R, Ahmad N, Maqsood I. Random forests and decision trees. Int J Computer Science Issues (IJCSI) 2012; 9:272.
50. Razzaghi T, Roderick O, Safro I, Marko N. Multilevel weighted support vector machine for classification on healthcare data with missing values. PloS One 2016; 11:e0155119.
51. Naraei P, Abhari A, Sadeghian A. Application of multilayer perceptron neural networks and support vector machines in classification of healthcare data. In: 2016 Future Technologies Conference (FTC), IEEE. 2016 December. pp. 848--852.
52. Son Y-J, KimF HG, Kim EH, Choi S, LeeF SK. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res 2010; 16:253–259.
53. Lee SK, Kang BY, Kim HG, Son YJ. Predictors of medication adherence in elderly patients with chronic diseases using support vector machine models. Healthc Inform Res 2013; 19:33–41.
54. Wu J, Roy J, Stewart WF. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med Care 2010; 48: (6 Suppl): S106–S113.
55. Haas LR, Takahashi PY, Shah ND, Stroebel RJ, Bernard ME, Finnie DM, Naessens JM. Risk-stratification methods for identifying patients for care coordination. Am J Manag Care 2013; 19:725–732.
56. Plis K, Bunescu R, Marling C, Shubrook J, Schwartz F. A machine learning approach to predicting blood glucose levels for diabetes management,’ Modern Artificial Intelligence for Health Analytics. Papers from the AAAI-14. 2014.
57. Choi E, Schuetz A, Stewart WF, Sun J. Using recurrent neural network models for early detection of heart failure onset. J Am Med Inform Assoc 2017; 24:361–370.
58. Bulsara SM, Wainberg ML, Newton-John TR. Predictors of adult retention in HIV care: a systematic review. AIDS Behav 2018; 22:752–764.

AIDS; big data; champion model; HIV; machine learning; retention in care

Copyright © 2021 Wolters Kluwer Health, Inc. All rights reserved.