Secondary Logo

Journal Logo

Implementation Science

Predictive Analytics Using Machine Learning to Identify ART Clients at Health System Level at Greatest Risk of Treatment Interruption in Mozambique and Nigeria

Stockman, Jeni MAa; Friedman, Jonathan MAb; Sundberg, Johnna BAa; Harris, Emily MAc; Bailey, Lauren ScMc

Author Information
JAIDS Journal of Acquired Immune Deficiency Syndromes: June 1, 2022 - Volume 90 - Issue 2 - p 154-160
doi: 10.1097/QAI.0000000000002947
  • Open



Global HIV treatment programs face significant challenges in client retention, hampering efforts to achieve epidemic control through the provision of continuous life-long, client-centered antiretroviral therapy (ART).1,2 One approach to overcoming challenges such as HIV client retention is the use of artificial intelligence (AI) technologies, which have the potential to scour multiple sources of available health system data and derive deep insights into the inherent multidimensionality of problems. AI technologies can enable health providers to better serve clients in remote areas, help governments to predict and prevent outbreaks of infectious disease before they occur, and optimize healthcare delivery and outcomes in low-resource settings.3

AI technologies are being applied to nearly every sector in development, including agriculture, health, and natural disaster response systems.4 Relative to global health, these applications have been organized into a framework that defines the opportunities for advanced analytics opportunities to address population- and individual-level global health challenges. Specific to HIV/AIDS, use cases to achieve the 95-95-95 HIV epidemic controls have been defined, including to address challenges in client retention.5–7 The HIV/AIDS “Fall Out Forecaster” use case demonstrates how direct savings of 4%–5% on care and adherence support costs could be realized by predicting the likelihood of a patient dropping off or interrupting treatment within the first 12 months and, based on the patient's risk factors and individual behavior, optimizing support through focused interventions.8 However, the real-world application of such analyses using AI technologies often focus on predictive model development, with a limited but emerging evidence-base deploying and scaling the models as effective and sustainable AI solutions in HIV programs.9–11 There are few studies that examine the related use case of advanced analytics to inform health systems decision making to achieve epidemic control; one recent study applied machine learning (ML) to aggregate loss to follow-up (LTFU) behavior in people living with HIV [We are using fiscal year 2020 data and calculations according to the U.S. President's Emergency Plan for AIDS Relief (PEPFAR) Monitoring, Evaluation, and Reporting version 2.4, therefore we will refer to interruption in treatment as LTFU. We recognize that PEPFAR is shifting language of LTFU and retention to interruption in treatment and continuation of treatment, respectively] into clusters to examine and describe people living with HIV having similar characteristics and patterns according to their risk profile.12

Building on the “Fallout Forecaster” use case, the PEPFAR-supported, United States Agency for International Development (USAID)-managed Data for Implementation (Data.FI) project tested an ML approach to determine whether LTFU among ART clients could be predicted with accuracy at the level of a client's clinical or drug pickup appointment. By assigning risk predictions, the approach built into its design an intention to generate real-time repeatable predictions to demonstrate how AI can identify with high precision which ART clients were at greatest risk of becoming LTFU among facilities in Mozambique and Nigeria supported by PEPFAR through USAID. As the “Fall Out Forecaster” use case demonstrates, the findings have clinical and global health program relevance because the accurate prediction of LTFU can help prioritize interventions among those most at risk of nonadherent behaviors, thereby promoting the most efficient use of resources.


The project adopted the commonly used definition of AI as the use of computers for automated decision making to perform tasks that normally require human intelligence. It also defined ML as a subset of AI that uses algorithms that give computers the ability to learn without being explicitly programmed.


We applied supervised ML techniques (eg, Random Forest) to de-identified patient-level health records, including electronic medical record (EMR) data, together with publicly available health facility data, geospatial data, and satellite imagery, to forecast individual client risk of LTFU among PEPFAR-supported, USAID-managed facilities in Mozambique and Nigeria. We worked with 9 PEPFAR-funded implementing partners (IPs): one in Mozambique and 8 in Nigeria.

We defined LTFU, the target variable, as a clinical appointment or pharmacy pickup for which a patient was more than 28 days late. In studies referenced above, LTFU was defined at the level of the patient, and models were trained to predict the probability a patient would experience LTFU within some duration. In our study, we predicted the probability a patient would experience LTFU for each appointment or pickup, allowing these probabilities to vary as patient circumstances evolved and as temporal aspects of each patient's touchpoint differed.


We used Python and the scikit-learn library (version 0.23.1). Patient risk profiles, developed using ML clustering techniques to group clients based on how similar they were on certain metrics that had proved to be important in determining LTFU risk, were developed using Python v3.713 and the algorithm implementations relied on the scikit-learn library.14

Data Sources

We augmented de-identified client data provided by the USAID-funded IPs with other publicly available data sources that describe the contextual environment surrounding facilities and ART clients. All facility and deidentified patient data were accessed with approval from USAID Missions and use governed by data sharing agreements.

Mozambique EMR Data

The main data source was the MozART database, which was jointly established by USAID and the Mozambique Ministry of Health in 2011.15 This electronic patient tracking system provided a rich digital ecosystem primed for ML technology. It had data from facilities managed by the Efficiencies for Clinical HIV Outcomes (ECHO) project (an USAID-funded project in Mozambique) and contained more than 4 million historical clinical visits and ARV pickups for more than 360,000 de-identified clients. The data were stored in 19 data sets, with ARV pickups and clinical visits from January 1, 2010 to November 28, 2019 included in the final prediction data set.

Nigeria EMR Data

This database contained records for more than 4 million historical clinical visits and ARV pickups for more than 450,000 clients stored across 378 comma-separated values and Excel files, spanning 2002–2020. For modeling purposes, data from 2010 onwards were used because data from previous years were sparse.

Geospatial Data

We received geographic information system coordinates for Mozambican and most Nigerian health facilities, which allowed us to generate additional locational features using a variety of publicly available geospatial data sets. For Nigeria, 174 facilities in our dataset were missing geospatial coordinates. To add geospatial coordinates for these 174 facilities and ensure consistency across the final dataset, we extracted data using the bounds of the Local Government Authority (LGA) in which each facility operated. The LGAs represented the most granular geographic area that could be used in the absence of coordinates.

TABLE 1. - Geospatial Datasets
Source Accessed via Data Sets
Facebook Humanitarian data exchange Mozambique: High Resolution Population Density Maps + Demographic Estimates
Food and Agriculture Organization Harvard Dataverse Gridded Livestock of the World (2010), Land Classification (2018)
Malaria Atlas Project MAP data explorer Travel Time to Cities, 2000/2017 Pf Incidence, 2000/2017 Improved Housing Prevalence
Institute for Health Metrics and Evaluation Global health data exchange Africa Educational Attainment Geospatial Estimates (all data from 2000/2015), Low- and Middle-Income Country Neonatal, Infant, and U5 Mortality Geospatial Estimates (all data from 2000/2017), Africa HIV Prevalence Geospatial Estimates 2000/2017 (all data from 2000/2017), Low- and Middle-Income Country Child Growth Failure Geospatial Estimates 2000–2017 (all data from 2000/2017)
WorldPop WorldPop 2015 estimated pregnancies, 2015 estimated births, 2010 estimated poverty
National Oceanic and Atmospheric Administration Visible Infrared Imaging Radiometer Suite WorldPop Resampled VIIRs night-time lights 2016

We gathered data describing the populations and environments surrounding facilities. The depth and breadth of these data sets could create additional predictive strength if there were relationships between these features and propensity to become LTFU. To understand the environments in which populations live, we included data sets on public health status, education (including gender-based differences in attainment), estimated HIV rates, estimated poverty rates, and other environmental factors, such as average rainfall at different times of the year. A full list of geospatial datasets accessed are presented in Table 1.

Data Preparation and Feature Engineering

After the EMR data were imported, each data set was thoroughly checked for completeness and consistency. Features with low completion rates relative to the size of the dataset were not included, except in the case of pregnancy data, which the Mozambique IP team requested, and which could be inferred from other columns. (Although the MozART database has a column for whether a client was pregnant at the time of the appointment, this feature was missing for all ECHO data. We created it by using other information recorded during the appointment, such as a recent switch in ARV regimen or clinical notes.) Features with categorical and qualitative data were converted to numerical format using one-hot encoding, a technique that replaces categorical variables with binary dummy variables.15 This step is necessary because many ML algorithms (eg, XGBoost) require that all inputs are numeric.

With respect to geospatial data, we defined 2- and 5-km catchment areas around each health facility, based on the resolution of each dataset. These catchment areas were then used to extract the geospatial data—either the nearest point estimate or the average depending on the spatial resolution. For example, the population in each geospatial cell was summed to estimate the total population living within 2 and 5 km of the facility. Data with a resolution above 5 km were extracted from within the cell where the facility was located. We included both the 2- and 5-km zones in our data set because meaningful differences between the populations and environment immediately surrounding a facility and those outside that facility's catchment area might exist and affect LTFU rates. In the absence of catchment areas derived by geospatial coordinates, LGAs were used.

With vectors of geospatial attributes generated for each facility, these vectors were then joined with the patient record data based on the health facility where each patient sought care. The combined vector of patient and geospatial attributes formed the input vector for modeling purposes.

Model Training and Testing

To test our results, we split clients into training and testing sets. In ML, training sets are used to learn patterns associated with outcomes, whereas testing datasets are used to assess how well the resulting model would generalize to new observations. The testing dataset contained 15% of clients and was used only to test the models that performed the best during training. The testing and training data had the same rate of LTFU. We split the data based on clients instead of appointments to ensure that the modeling process did not learn the unique behavioral patterns of a subset of clients while failing to generalize well to new clients—a type of performance flaw called overfitting.

During the model training process, we used temporal 10-fold cross validation to split the training dataset into 10 training groups based on time.16 Using this method, models were created using data from the past and were tested on data collected later. We selected this validation method to increase the robustness of our results by prioritizing models during training that were already generalizing well to the future, preventing overreliance on transitory factors such as changes in prevailing regimens that would have affected LTFU for a limited time only. The models that performed best during training were then tested on the test set.

In Mozambique, the model was trained only using only data up to December 31, 2017. This ensured that we could test the model on a period that the model had never seen before. If the model were deployed, it would also need to generate predictions for new, unseen periods. Thus, this step was important to estimate how the model would perform on new observations. In Nigeria, the model was trained using data before March 2019.

Model Types and Evaluation

We implemented a wide range of commonly used ML algorithms to create the strongest model possible: Random Forest, XGBoost, Extra Trees, and Neural Networks (the latter for Mozambique only). We also trained logistic regressions to serve as a performance benchmark for more complex models. We ran more than 100 variations, varying relevant hyperparameters for each classifier, and selecting hyperparameters that produced the best performance on the validation set of data. We evaluated performance using the area under the precision–recall curve (AUC-PR), a commonly used metric in ML that reflects combinations of sensitivity and positive predictive value. The best-performing models and associated hyperparameters were selected for evaluation on the reserved test set of data. In addition to comparing the performance of the best model to the baseline, we compared the best model's performance to a model the International Center for AIDS Care and Treatment Programs (ICAPs) developed earlier to predict LTFU in Mozambique.17


Model Performance

All results were generated using the test set of data. These data were from clients the models had never seen before; therefore, we could understand how well the models would perform on new data outside the training environment. Because precision and recall can change according to the threshold set, it is recommended to evaluate models based on the AUC-PR and the Matthew correlation coefficient (MCC). A higher value for AUC-PR means that by improving performance in either precision or recall, the corresponding loss in the other metric is not as pronounced. Moreover, a higher MCC signals that the model is performing well at predicting the outcome of the underrepresented class.

For Mozambique, we trained 5 (Logistic regression, Random Forest, XGBoost, ExtraTrees, and 2-layer Neural Network) different types of models and then selected the best-performing options for each algorithm. The best Random Forest algorithm outperformed all other models on both the AUC-PR and the MCC using the mean decrease in impurity (gini) method. It achieved 0.65 AUC-PR and 0.45 MCC compared with an underlying rate of appointment LTFU of 23% across the testing period (the reference point for AUC-PR is the precision of a random model, which is equivalent to the prevalence of the positive class, or 0.23 in the case of Mozambique; this is not to be confused with the reference point for an AUC-ROC, which is typically 0.5). We used it in the Mozambique analysis.

For Nigeria, we trained 4 (Logistic regression, Random Forest, XGBoost, and ExtraTrees) different types of models and then selected the best-performing options for each algorithm for 2 data sets. One dataset included features describing changes in viral load over time using data from IPs that provided laboratory data sets. A second dataset did not include laboratory data. The analysis showed that the XGBoost algorithm outperformed all other models on both the AUC-PR and MCC using the “gain” method. There was not a substantial difference with and without laboratory data. Because the XGBoost model performed the best on our test set of data, we used this model in the Nigerian analysis. It achieved 0.52 AUC-PR and 0.37 MCC compared with an underlying of appointment LTFU of 27% across the testing period. We decided to use the model not including laboratory data because it performed slightly better, which could be a result of having more data available, and to include every IPs' data in the results.

Model accuracy was similar across sex and age groups. Despite men's slightly higher overall LTFU rates, men and women seldom differed significantly in correlation with LTFU and other features.

Top Features

We categorized the top features by importance into the following buckets: past behavior, clinical history, date, demographics, and location. In Nigeria, each client's clinical history contributed more than 35% to overall feature importance. The top features in the clinical history group described how long the client had been in treatment, the client's function status, and regimen history. Clinical history was followed by location and past behavior, which contributed about 25% and 20% to overall feature importance, respectively. Past behavior included number and duration of previous LTFU events. Demographic and location features did not appear in the top 10 most important features (Fig. 1).

Top features of ML model for Nigeria.

In Mozambique, the top 70 features represented 75% of the overall derived importance. The hundreds of remaining features contributed only about 25% to the overall importance scores. We categorized the top 70 features by importance into the following buckets: past behavior, clinical history, date, demographics, and location. Past behavior and clinical history each contributed about 30% to overall feature importance, whereas demographic features did not appear in the top 10 most important features. The most important demographic feature was age, followed by the number of people living with the client, the number of children the client had, and gender. Overall, demographic features contributed to only 8% of the top 75% of the important features. Last, features describing location and the date of the next appointment contributed 2% and 6%, respectively (Fig. 2).

Top features of ML model for Mozambique.

Comparisons With Baseline

To understand and quantify our machine-learned predictions, in Mozambique, we compared our model to 2 existing models used to identify clients deemed at high risk of LTFU. Project ECHO prioritized infants, clients with high VLs, and newly initiated clients. ICAP at Columbia University partnered with the USAID IP Dimagi to test the application of ML to LTFU in 2018.18

Table 2 compares the percentage of male and female clients flagged as at risk of LTFU and the percentage of these clients who ultimately became LTFU using ECHO's tracking heuristics and the Data.FI ML model. In 2019, 19% of female clients and 23% of male clients fit into ECHO's prioritization categories. However, the prioritizations missed more than 80% of all female clients and 77% of all male clients who became LTFU. By contrast, Data.FI's ML model actually flagged a smaller proportion of the population but identified more clients who would become LTFU. For both sexes, the ML model flagged about the top 15% of clients. Most of those flagged ended up becoming LTFU.

TABLE 2. - ECHO Prioritization Compared With Model Performance in 2019
Model Females Males
% of Females Flagged % of Female LTFU Missed % of Males Flagged % of Male LTFU Missed
ECHO-prioritized clients 19 81 23 77
Data.FI ML model 14 49 17 45

Although the ICAP model used slightly different data, covered a different geographic area, and used a different feature set, the results of Data.FI's best-performing model were still comparable with ICAP model results. Table 3 shows the results for each ML model. ICAP published its results by comparing the “riskiest” 20% of clients according to its ML model to the overall client population. The ICAP model found that just less than one-half of the flagged clients ended up missing an appointment by 7 days, compared with 32% of clients in the general sample.

TABLE 3. - Comparison of the Performance of ICAP's and Top-Performing Models on the Test Set
No. of Days By Which an Appointment is Missed ICAP Data.FI
Rate of Missed Appointments for All Clients (%) Rate of Missed Appointments for ML-Learned Riskiest 20% (%) Rate of Missed Appointments for All Clients (October 2019) (%) Rate of Missed Appointments for ML-Learned Riskiest 20% (October 2019) (%)
7 31.8 48.26 32.7 80.3
14 26.2 39.98 25.9 78.2
28 19.8 34.77 20.3 74.0

By contrast, in the ECHO project, 20.3% of appointments scheduled in October 2019 ultimately led to a LTFU status. Data.FI's machine-learned predictions performed very well: the riskiest 20% of clients were extremely likely to be late or LTFU compared with the general client population.


The machine-learned model predicted with high precision which clients were at high risk for LTFU in PEPFAR-supported, USAID-managed facilities in Mozambique and Nigeria, with AUC-PR in both contexts well above the prevailing rate of LTFU. To the best of our knowledge, this analysis is the first to leverage client-level data together with publicly available data to predict LTFU among individual ART clients in real time.

The benefits of using ML were evident: compared with existing client tracking, ML algorithms identified more clients who are likely to become LTFU than the baseline and did so with more precision. Moreover, ML output will allow PEPFAR-supported, USAID-managed IPs in Mozambique and Nigeria and other programs to compare the likelihood of LTFU among different sets of clients and prioritize clients for follow-up and interventions accordingly.

There are some limitations to the work. First, the particular model parameters trained in this project are applicable in the relevant Nigerian and Mozambican context only. The same modeling approach could be used elsewhere, but parameters would need to be retrained. Second, Nigeria and Mozambique have wide adoption of EMR and centralize EMR in a data warehouse, conditions that are not extant in many countries with a high HIV burden. Third, our focus on predictive performance meant we used ML models that make it difficult to describe the specific impact of an input, such as CD4 count, on probability of LTFU. We do report on feature importance, which is standard practice in describing ML models, but these metrics do not provide the same kind of insight that might be generated through different study approaches.

The results did not suggest a need to abandon current tracking efforts. In fact, the ML algorithm may have implicitly learned which clients received additional follow-up that made them less likely to become LTFU, with the predictions reflecting this. Rather, the results support the hypothesis that ML can be a useful aid to help prioritize individual clients in real time.

Importantly, the model's stability in performance between sexes means it can be used to improve care for the entire client population; it can be used equally well with all clients, male and female clients.

The reasons for LTFU depend on each ART client's situation and on complex interactions between the client and the environment in which the client receives care. These results demonstrate the power of applying ML to LTFU prevention efforts to further refine existing criteria for assigning priority for follow-up before or after a missed appointment. With the addition of ML, health programs can prioritize with increased precision those who are also likely to become LTFU, providing an opportunity for optimized allocation of resources and existing interventions to the most at-risk clients.

Predicting clients at high risk of LTFU based on population profiles, and demonstrating the benefits of a model-driven approach through a real-world evaluation in real time, can allow decision makers to anticipate risk with enough time to prepare for change and intervene. This could allow programs to pivot mid-project before failure or risk of failure, providing deep insights in real time when the program has the opportunity to improve. The precision this approach offers to understand each client and client population can be harnessed to drive the efficient use of resources and ensure that those resources reach people most at risk of interruption in treatment, generating deep insights to help save lives and improve health outcomes.


This article was produced for review by the U.S. President's Emergency Plan for AIDS Relief through the United States Agency for International Development. It was prepared by Data for Implementation (Data.FI). The information provided in this article is not official U.S. government information and does not necessarily reflect the views or positions of the U.S. President's Emergency Plan for AIDS Relief, the U.S. Agency for International Development, or the U.S. Government.

The authors thank Jen Chapman, project director, Data.FI, Palladium, for her review and contributions to the manuscript. The authors are also grateful to Elizabeth T. Robinson and Alison Ellis for their review of the manuscript. The authors are also grateful to USAID staff in Nigeria and Mozambique as well as USAID-funded implementing partners for their help in facilitating this work.


1. United States President’s Emergency Plan for AIDS Relief. PEPFAR 2020 Country Operational Plan Guidance for All PEPFAR Countries. Washington, DC: USAID; 2020. Available at:
2. Ashrafian H, Darzi A. Transforming health policy through machine learning. Plos Med. 2018;15:e1002692.
3. United States Agency for International Development. Artificial Intelligence in Global Health: Defining a Collective Path Forward. Washington, DC: USAID; 2019; 5. Available at: Accessed February 8, 2021.
4. United States Agency for International Development. Reflecting the Past, Shaping the Future: Making AI Work for International Development. Washington, DC: USAID; 2018. Available at: Accessed February 8, 2021.
5. Fox MP, Rosen S. Patient retention in antiretroviral therapy programs up to three years on treatment in sub-Saharan Africa, 2007-2009: systematic review. Trop Med Into Health. 2010;15(suppl 1):1–15.
6. Ekouevi DK, Balestre E, Ba-Gomis FO, et al. Low retention of HIV-infected patients on antiretroviral therapy in 11 clinical centres in West Africa. Trop Med Int Health. 2010;15(suppl 1):34–42.
7. Wilkinson LS, Skordis-Worrall J, Ajose O, et al. Self-transfer and mortality amongst adults lost to follow-up in ART programmes in low- and middle-income countries: systematic review and meta-analysis. Trop Med Int Health. 2015;20:365–379.
8. United States Agency for International Development. Data and Advanced Analytics in HIV Service Delivery: Use Cases to Help Reach 95-95-95. Washington, DC: USAID; 2020; 4. Available at: Accessed February 8, 2021.
9. Maskew M, Sharpey-Schafer K, De Voux L, et al. Machine Learning to Predict Retention and Viral Suppression in South African HIV Treatment Cohorts. Preprint.
10. Hariharan R, Sundberg J, Gallino G, et al. An interpretable predictive model of vaccine utilization for Tanzania. Front Artif Intell. 2020;3:559617.
11. Puttkammer N, Simoni JM, Sandifer T, et al. An EMR-based alert with brief provider-led ART adherence counseling: promising results of the InfoPlus adherence pilot study among Haitian adults with HIV initiating ART. AIDS Behav. 2020;24:3320–3336.
12. Onovo AA, Kalaiwo A, Obanubi C, et al. Aggregating Loss to Follow-Up Behaviour in People Living with HIV on ART: A Cluster Analysis Using Unsupervised Machine Learning Algorithm in R; 2020. Available at SSRN: or Accessed February 8, 2021.
13. Python [Programming Language]. Python Language Reference, Version 3.7. Python Software Foundation; 2018. Available at: Accessed May 1, 2020.
14. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830.
15. Brownlee J. Why One-Hot Encode in Machine Learning? Machine Learn Mastery; 2017. Available at: Accessed May 5, 2020.
16. Hyndman R. Cross-Validation for Time Series. Hyndsight; 2016. Available at: Accessed May 5, 2020.
17. U.S. Centers for Disease Control and Prevention. Patient-Level Database Informs HIV Response in Mozambique. Atlanta: Centers for Disease Control and Prevention; 2020. Available at: Accessed June 11, 2020.
18. ICAP. Machine Learning for Predicting Default from HIV Services in Mozambique: OpCon Mozambique Final Report. New York, NY: ICAP. Available at: Accessed June 11, 2020.

artificial intelligence; HIV/AIDS; loss to follow-up; machine learning; interruption in treatment

Copyright © 2022 The Author(s). Published by Wolters Kluwer Health, Inc.