Critical public health tasks to improve population-level health outcomes for persons with HIV include early diagnosis of HIV, rapid linkage to HIV care, and treatment with antiretroviral medications to achieve viral suppression.1,2 However, for public health departments, it remains challenging to achieve optimal levels of these goals in part due to the difficulty in accurately measuring this spectrum, otherwise known as the HIV care continuum.3 In the United States, interstate migration and differences in state and local public health reporting laws and interpretations among jurisdictions regarding data sharing and privacy challenge accurate measurements of the HIV care continuum, which in turn, affect the public health outreach and intervention that depend on these data.4
Data to Care5 is a public health strategy that aims to use HIV surveillance data to identify individuals with diagnosed HIV who are not in care, link or reengage them to care, and support the HIV care continuum.6 The Data to Care strategy relies on accurate data, and in particular, current residential address, vital status, and care status, which are collected in HIV surveillance systems at state/local health departments. A key characteristic of a well-functioning surveillance system and its data quality is its ability to link records on the same person across different jurisdictions to minimize duplicate records of reports/cases. In the United States, much of this information is collected through deduplication activities among jurisdictions. The Centers for Disease Control and Prevention (CDC) coordinates the Routine Interstate Duplicate Review (RIDR). This is a biannual process to identify and resolve duplicate cases in the Enhanced HIV/AIDS Reporting System (eHARS) across public health jurisdictions, for which this process is a condition for receiving CDC surveillance funds.7,8 The CDC identifies records suspected of being duplicate reports on the same individual using a CDC-developed matching algorithm. The CDC then provides jurisdictions with lists of suspected duplicate records for them to review, discuss, and agree on a resolution (“same as” or “different than”) during resource-intensive telephone case conferencing between jurisdictions. Currently, the RIDR operates with an estimated 12-month time lag between case reporting and duplicate resolution, as the process involves extensive manual follow-up for case pair resolution across jurisdictions.
In 2015, the health departments of the District of Columbia (DC), Maryland (MD), and Virginia (VA) with Georgetown University used a novel privacy-assuring data technology—the ATra Black Box System—to identify 21,472 eHARS potential duplicates from 161,343 case records across the 3 public health jurisdictions in a computational processing time of 21 minutes and 58 seconds.9 This previous study showed significant eHARS case record overlap across jurisdictions in the DC metropolitan area, reflecting persons' interactions with health systems reporting to different public health departments across these jurisdictional borders. It also gave jurisdictions the opportunity to improve accuracy of their data by identifying additional cases that were actually still in care but living out of jurisdiction and those who were deceased.
The study detailed here sought to examine the public health utility of using the ATra Black Box System in an expanded geographic area to determine its potential role in improving efficiency of case pair identification and determine the improvements in overall quality of HIV surveillance data across participating jurisdictions.
Our study objective was to use the ATra Black Box System approach for the District of Columbia (DC); Delaware (DE); Florida (FL); Maryland (MD); North Carolina (NC); and New York State (NYS), including data from New York City (NYC); Virginia (VA); and West Virginia (WV) to (1) identify the overall number of duplicate case records in eHARS across jurisdictions; (2) identify the number of exact duplicate case records in eHARS across jurisdictions; and (3) compare this approach to traditional RIDR resolution by estimating time efficiency realized and assess congruence with the July 2017 RIDR process.
A Governing Body
This highly collaborative technical approach was contingent on first establishing a governing body that determined the hypothesis in question and met regularly to discuss and implement study activities. This body included regional representatives from jurisdictional sites and study partners, all of whom received legal clearance to participate in this study—a process that involved productive dialog between participating organizations. This governing body consisted of members from public health jurisdictions (DC, DE, FL, MD, NC, NYS, VA, and WV), and members from Georgetown University (GU), CDC, and Oak Ridge National Laboratory. Guided by the public health jurisdictions' need, this body reached consensus in selecting the analytical question to query through this data technology: What is the total number and nature of duplicate HIV case records across participating jurisdictions along the East Coast corridor?
Data Privacy and Ethics
Among the jurisdictions on the US East Coast that were offered the opportunity to participate, 8 (DC, DE, FL, MD, NC, NYS, VA, and WV) agreed to participate in this effort. The jurisdictions and GU, with support from Oak Ridge National Laboratory drafted, agreed on, and signed Data Sharing Agreements and a Data Security and Confidentiality Procedures Manual following CDC's standard format for such documents.10 In NYS, the state and New York City (NYC) maintain separate HIV surveillance databases, but NYC reports cases to NYS for duplicate resolution purposes on a weekly basis. As a participating jurisdiction in this effort, NYS submitted their eHARS data and NYC's, making a total of 9 eHARS data sets included in the final match. Owing to the privacy-centered engineering design and technical approach of this study, which prohibits any person from seeing the data once in the ATra Black Box System and disallows long-term permanent storage of data, the GU Institutional Review Board (IRB) determined that the study was exempt from review.
The ATra Black Box System approach has been described previously.9 Briefly, the ATra Black Box System has a physically protected server with extremely high privacy assurance that was located at a secure data center in Virginia. Once closed, no one was able to inspect its contents, including the system administrators or the software developers. This server had no external connections to any device other than a power source. It saved data in temporary memory for data matching9,11 and was programmed with manual and automatic mechanisms for cleaning out memory in the event of unauthorized access. The ATra Black Box System was available only to participating jurisdictions through designated encrypted virtual private network links. Encryption techniques were in compliance with the Advanced Encryption Standard to protect the highly sensitive public health HIV data during transit between the jurisdiction and the ATra Black Box System. For this study, the system securely processed eHARS data uploaded directly from each jurisdiction without permanently storing the data. Each jurisdiction was assigned a single, unique, dedicated directory on the server. The jurisdictions uploaded tab-delimited data files to their assigned directories. The jurisdictions prepared input files using an SAS program that was written to combine demographic, geographic, HIV diagnostic, and laboratory information from each jurisdiction's eHARS database. Each jurisdiction downloaded output reports from their individually assigned subdirectory on match completion. Each jurisdiction received information pertinent to only their jurisdiction, including the results of match runs, a real-time log, an error report, a case-by-case match report with values of additional variables for the 3 highest match categories, eHARS-importable files, match totals, grand totals, and matches by zip code.
Information Technology System Coordination and System Testing
Information technology (IT) staff from all collaborating jurisdictions and GU collaborated to enable their health department staff to securely upload their eHARS data file for matching and reporting of results. IT and HIV surveillance staff from all jurisdictions became sponsored users of GU's system and received unique logins, passwords, and virtual private network access for the duration of this project. All jurisdictions first participated in an end-to-end test of the match process. The end-to-end test served several purposes, including confirming the communication channels between all jurisdictions and the ATra Black Box System server, testing operational efficiencies and technical processes including uploading of the data, monitoring the error logs in real time, downloading the reports, and testing the matching algorithm using a set of 9 synthetic data sets (1 for each jurisdiction plus 1 for NYC) provided by the CDC. After correcting a minor logic error and running a second test, the system successfully passed all aspects of the end-to-end test as indicated by the precise reproduction of a master list of expected results.
Matching Variables and Levels
This system used 10 matching variables: last name, first name, date of birth (DOB), sex assigned at birth (birth sex), social security number (SSN), race/ethnicity, first name Soundex, last name Soundex, partial DOB, and partial SSN. The values for these 6 matching variables were retrieved from eHARS using the SAS program: last name, first name, DOB, birth sex, SSN, and race. The Black Box calculated these 4 matching variables: first name Soundex, last name Soundex, partial DOB, and partial SSN. Five of the 10 matching variables were required to be present in the input data record for the record to be processed and matched: last name, first name, DOB, birth sex, and race. Three additional variables were required to be in the input data record, but were not used in the matching process: Stateno, Vital Status, and Transmission Category. The output displayed the number of matched individual HIV case records and the number of matched case pairs (ie, ≥2 case records representing 1 unique person with HIV) for each jurisdiction's report files. In this study, the same individual could belong to 1 or more case pair(s) if they matched across more than 2 jurisdictions.
There were 9 levels of matching confidence:
- Exact: last name, first name, DOB, SSN, birth sex, and race;
- Extremely high: last name, first name, DOB, and birth sex;
- Very high: SSN;
- High: last name, first name, DOB, and birth sex or race;
- Medium high: last name and first Soundex, DOB, and birth sex;
- Medium: (last name, DOB, birth sex, and race) or [last Soundex, first Soundex, DOB, and (birth sex or race)];
- Medium low: last Soundex, first Soundex, partial DOB, partial SSN, and (birth sex or race);
- Low: last Soundex and (partial DOB and partial SSN) and (birth sex or race);
- Very low: last Soundex and (partial DOB or partial SSN).
These match levels were previously validated to assess the specificity of the matching algorithm in the study by Ocampo et al,9 2016. Specificity differed by match level with case pair matches at the exact level being validated as 100% true matches. The remainder of this article focuses primarily on exact matches, since jurisdictions found the exact level acceptable for automatic eHARS import without further validation.
In addition, jurisdictions could upload up to 93 optional “ride-along” variables per individual. These variables represented data that are typically exchanged during the manual case resolution that occurs in the traditional RIDR process, including HIV/AIDS case definition, state and county of residence at diagnosis of HIV/AIDS, current residential address information, laboratory test results associated with initial HIV disease diagnosis, and most recent HIV viral load and CD4+ T-lymphocyte count. Although not included in the matching algorithm, the ride-along variable data were included in the output for exact matches.
Comparing This Method to Traditional RIDR Process
Estimating Time Efficiency Realized
The governing body decided to use minutes per phone call per case pair resolution as an indirect measure of jurisdictional resources spent conducting aspects of the traditional RIDR resolution process because RIDR resolution is typically conducted through phone to resolve batches of several case pairs. The time was estimated based on the typical amount of time to organize, conduct, and document calls between jurisdictions to resolve specific case pairs. Jurisdictions estimated an average of 5 minutes per call per case with 2 persons (1 from each jurisdiction) for an average of 10 minutes overall. This estimate did not account for variation among local conditions.
Congruence With July 2017 RIDR Process
Before conducting the ATra Black Box System run, jurisdictions had previously received a CDC July 2017 RIDR list. The CDC July 2017 RIDR list comprised previously unresolved potential duplicates that were found in the eHARS system between January 1, 2017, and June 30, 2017. To assess the impact of the ATra Black Box System run on the RIDR resolution process, we checked whether case pair matches found by the ATra Black Box at the exact level were also present on the jurisdictions CDC July 2017 RIDR list. In addition, we examined whether the ATra Black Box System found exact case pair matches that were not present on the CDC July 2017 RIDR list that had not been previously unresolved in eHARS. We reviewed exact matches that did not appear on the CDC July 2017 RIDR list and those case pairs that were “previously resolved” through previous deduplication efforts and “not previously resolved” in eHARS.
Overall Number of Duplicate Records Across Jurisdictions
Jurisdictions uploaded a total of 799,326 eHARS case records (DC = 40,448; DE = 8419; FL = 215,875; MD = 72,121; NC = 58,511; NYC = 242,431; NYS = 106,619; VA = 49,844; and WV = 5058), of which 7705 (1%) were not uploaded successfully and were reported as errors (data not shown). A total of 290,482 (36%) eHARS records across these 8 East Coast public health jurisdictions matched across all levels: very low (8.9%), low (0.0%), medium low (0.0%), medium (8.1%), medium high (1.2%), high (0.2%), very high (12.9%), extremely high (30.5%), and exact (38.2%) (Table 1). Overall, close to 70% of matches were exact or extremely high.
Exact Case Pairs Across Jurisdictions
A total of 110,920 individual case records fell into the exact matching level (Table 1). These cases represent a total of 55,460 case pairs matched at the exact level. As shown in Table 2, the top 3 eHARS case pairs overlap were between NYC and NYS (51%), DC and MD (10%), and FL and NYC (6%), followed closely by FL and NYS (4%), FL and NC (3%), DC and VA (3%), and MD and VA (3%) (Table 2).
Congruence With July 2017 RIDR Process
In July 2017, jurisdictions received their semiannual RIDR case pair lists from the CDC; these case pairs represented possible duplicates of new persons entered between January 1, 2017, and June 30, 2017. A total of 811 exact case pairs identified using this approach also appeared on the July 2017 RIDR lists for jurisdictions (Table 3).
Estimated Time Efficiency Realized
This study estimated that jurisdictions realized approximately 8110 minutes (or 135.2 labor hours) in time efficiency using this approach compared with aspects of the traditional RIDR resolution process. NYC and NYS conduct an automated intrastate deduplication process. Therefore, the time efficiency realized may be inflated by the large number of matches between NYC/NYS that would be resolved through other methods. The time efficiency realized, when not including the NYC/NYS matches, was approximately 4220 minutes (or 70.3 labor hours) (Table 4).
Postprocessing of Results and the NYS Case Example
To describe the added value of using this approach, NYS examined the number of exact case pair matches that were not on the July 2017 RIDR list compared with other jurisdictions and found case pairs that were defined as “previously resolved” and “previously not resolved” in eHARS (Table 5). In the case of NYS, there were 2371 case pairs matched as exact identified as “previously not resolved.”
Here, we demonstrated successfully using the ATra Black Box System to assist deduplication activities across jurisdictions along the US East Coast corridor. This effort identified previously unidentified duplicates and likely helped realize time efficiency for resource-constrained public health jurisdictions. The highly collaborative public–private partnership between government, academic, and public health partners motivated jurisdictions to increase the frequency at which they directly communicate with each other about overlapping cases and has thus improved cooperative activities among public health jurisdictions. Moreover, this study addressed the critically important arena of working together to more effectively use surveillance data while enhancing the privacy safeguards for sensitive public health data.
Monthly data transfers from jurisdictions to the CDC provide the National HIV Surveillance System with necessary information to track and monitor HIV across the nation. But, by design, the CDC does not have access to personal identifiers such as first name, last name, or SSN, and thus, the variables are not available for deduplication purposes. The ATra Black Box System allows for automatic identification of matches, but without any person seeing or storing personally identifying information while in the matching process. This facilitates increased specificity in the identification and resolution of potential duplicates without compromising privacy.
This work can translate into improved Data to Care efforts reliant on surveillance data by providing more accurate, timely, and updated case data across jurisdictions. Activities related to Data to Care (ie, linking surveillance data more closely to health care outcomes) underlie the need for the improvement of data quality in HIV surveillance. Enhanced data quality allows jurisdictions to better focus their valuable public health resources on cases in need of follow-up with confidence and less so on individuals who have demonstrated continued engagement in care. Also, updated HIV surveillance data provide a better overview of HIV for public health planning purposes and have implications for funding public health efforts and health service delivery.
Public Health Implications
Identification of exact matches, especially those that were previously known to be duplicates, when accompanied by ride-along variable data, enabled jurisdictions to update their local eHARS case records with information from other jurisdictions for more complete and accurate information—complementing the conventional RIDR resolution approach. The ride-along variables also provide added value for future deduplication, such as SSN for positive identification, obtaining demographic characteristics, current address, and HIV transmission risk factors. For local public health jurisdictions, these data are critical for outbreak investigations, as well as epidemiologic analysis and reporting, which act to fine-tune policy and targeting prevention and control activities. An additional benefit of conducting this study was the identification of exact matches that were not previously resolved and not in the July 2017 RIDR list. Such case pairs that were previously unidentified exemplify cases that had not yet been distributed for resolution through the RIDR activity, leading to earlier identification of duplicates and hence improved accuracy of the surveillance data.
Our study suggests that jurisdictions with large seasonal migration or urban areas might stand to benefit the most from use of the ATra Black Box System, given the complex nature of movement of people through cities and the need to clarify their interaction with the public health system across jurisdictional borders to ensure the most effective follow-up. This study made it clear that NYS had significant HIV surveillance data overlap with Florida.
The overlap of eHARS records identified here indicates people's interactions with health systems across jurisdictional borders, which is challenging when the system is designed with states independently conducting surveillance in isolation from other jurisdictions, and the data at the national level do not have sufficient detail for deduplication. This article demonstrates a need to account for the mobility of people living with HIV in the United States, which leads people to engage with health care systems across different states (ie, nonresidence states). This has care implications, especially for public health departments allocating valuable resources to provide outreach, care, and support services to persons with HIV. To reach optimal levels of each milestone in the HIV care continuum and to provide a bridge between public health data and clinical care, there is a need to better understand and perhaps readjust our public health outreach to the dynamic nature of modern living. This study also presents significant benefits to jurisdictions' abilities to update their eHARS data (eg, updating cases that may have never been identified for duplicate resolution in eHARS through other efforts). Here, many cases that matched were not previously resolved in eHARS and also were not present on the most recent RIDR list, which presented an opportunity to improve the quality of local HIV surveillance data. Several reasons could explain why case pairs had not been identified by previous deduplication procedures, including that, by design, identifying information is not available to the CDC for traditional national-level deduplication efforts, new cases which could possibly be on the upcoming RIDR list, cases from previous RIDR lists that were not resolved, and case pairs that had been not included in previous deduplication efforts. In each of the potential scenarios, case pair resolution enables jurisdictions to update and thus enhance their HIV surveillance data and better assist with national-level case pair deduplication.
Future Work and Study Weaknesses
Evaluating time efficiency realized could be expanded to a more comprehensive cost-to-benefit analyses in future efforts. The initial time spent on legal clearances, setting up data sharing agreements, establishing secure IT protocols between several organizations, and creating and tailoring the matching algorithm needs to be accounted for in future cost-to-benefit analyses. Although jurisdictions will naturally spend time on initial set-up activities in the earlier years, we envision that they will spend less time on the same activities in subsequent years because of growing familiarity, experience, and continued training with this approach. This work is expected to ultimately reduce the number of duplicated records existing across public health jurisdictions, but needs to be conducted on a regular basis for maximum efficiency. Future uses of this approach should also incorporate a streamlined mechanism to maximize efficiency and avoid error messages due to data aspects such as lack of names or coded names. The location of the ATra Black Box System server may also affect future uses of this approach and should be discussed with potential users.
In addition to NYS, other jurisdictions have reviewed using eHARS data after an ATra Black Box System run,12 but to improve overall efficiency, there must also be development of best practices for postprocessing eHARS data across jurisdictions after matching case pairs using this approach. A single suite of software programs to be used by all participating jurisdictions are required to import matching data sets and ride-along variables back into eHARS, to realize the full potential for time and cost savings for this automated data deduplication process. As with other new approaches, there remains significant work, including finding the best methods for evaluating less than exact matches, resolving case pair conflicts, processing ride-along variables, and developing postprocessing software.
Future work should detail this highly collaborative process to implement a novel approach to resolving case pairs in eHARS, especially given that some state privacy laws prevented other invited public health jurisdictions from participating in this study. Other uses may also consider further refining the sensitivity of the algorithm to detect more matches without losing specificity. Finally, it would be informative to learn how many case pairs overlap among jurisdictions in other geographic areas and consider how this could improve our understanding of HIV in the United States.
This study identified 290,482 potentially duplicated case records from 799,326 uploaded case records in 9 separate eHARS data sets across 8 participating jurisdictions, of which 55,460 were exact case pairs. An estimated 135 labor hours in time efficiency was realized using this process to identify duplicate case records in eHARS compared with the traditional process for CDC RIDR resolution. This privacy-centered deduplication process of eHARS records across multiple public health jurisdictions has the potential for improving the timeliness, accuracy, and completeness of nationwide HIV surveillance data. This may help to reduce potential case pairs on future CDC RIDR lists, strengthen collaborative relationships among jurisdictions, provide a fruitful cooperative platform between academia and government entities, and lead to the more efficient use of public health resources. To enhance the added value of using this approach, future applications should consider standardized protocols for postprocessing duplicate eHARS data.
The authors extend their gratitude to all administrative and other personnel who have helped to facilitate the success of this study.