Secondary Logo

Journal Logo

Improving HIV Surveillance Data by Using the ATra Black Box System to Assist Regional Deduplication Activities

Ocampo, Joanne Michelle F. MSa,b; Hamp, Auntré MPHa,c; Rhodes, Anne PhDd; Smart, J. C. PhDe; Pemmaraju, Raghu MSf; Poschman, Karalee MPHg,h; Hess, Kristen L. PhDh; Bhattacharjee, Reshma MBBSi; Flynn, Colin ScMi; Anderson, Bridget J. PhDj; Dowling, James E. MPHk; Maccormack, Fred MSk; Doshi, Rupali MSc,l; Lum, Garret MPHc; Maddox, Lorene MPHg; Moncur, Brenda MSj; Barnhart, John E. MPHm; Maxwell, Jason BSm; Aurand, Sahithi Boggavarapu MPHd; Hogan, Vicki MPHn; Wills, David BAn; Prowell, Stacy PhDo; Kassaye, Seble G. MSb; Karn, Helen E. PhDa; Laffoon, Benjamin T. BSh; Collmann, Jeff PhDa

JAIDS Journal of Acquired Immune Deficiency Syndromes: September 1, 2019 - Volume 82 - Issue - p S13–S19
doi: 10.1097/QAI.0000000000002090
Supplement Article

Background: Focused attention on Data to Care underlines the importance of high-quality HIV surveillance data. This study identified the number of total duplicate and exact duplicate HIV case records in 9 separate Enhanced HIV/AIDS Reporting System (eHARS) databases reported by 8 jurisdictions and compared this approach to traditional Routine Interstate Duplicate Review resolution.

Methods: This study used the ATra Black Box System and 6 eHARS variables for matching case records across jurisdictions: last name, first name, date of birth, sex assigned at birth (birth sex), social security number, and race/ethnicity, plus 4 system-calculated values (first name Soundex, last name Soundex, partial date of birth, and partial social security number).

Results: In approximately 11 hours, this study matched 290,482 cases from 799,326 uploaded records, including 55,460 exact case pairs. Top case pair overlaps were between NYC and NYS (51%), DC and MD (10%), and FL and NYC (6%), followed closely by FL and NYS (4%), FL and NC (3%), DC and VA (3%), and MD and VA (3%). Jurisdictions estimated that they realized a combined 135 labor hours in time efficiency by using this approach compared with manual methods previously used for interstate duplication resolution.

Discussion: This approach discovered exact matches that were not previously identified. It also decreased time spent resolving duplicated case records across jurisdictions while improving accuracy and completeness of HIV surveillance data in support of public health program policies. Future uses of this approach should consider standardized protocols for postprocessing eHARS data.

aGeorgetown University, Office of the Senior Vice President for Research, Washington, DC;

bDivision of Infectious Diseases, Department of Medicine, Georgetown University, Washington, DC;

cDistrict of Columbia Department of Health, Washington, DC;

dVirginia Department of Health, Richmond, VA;

eDepartment of Computer Science, Georgetown University, Washington, DC;

fGeorgetown University, University Information Systems, Washington, DC;

gFlorida Department of Health, Tallahassee, FL;

hDivision of HIV/AIDS Prevention, Centers for Disease Control and Prevention, Atlanta, GA;

iMaryland Department of Health, Baltimore, MD;

jNew York State Department of Health, Albany, NY;

kDelaware Division of Public Health, Newark, DE;

lDepartment of Epidemiology and Biostatistics, The George Washington University, Washington, DC.

mNorth Carolina Department of Health, Raleigh, NC;

nWest Virginia Department of Health and Human Resources, Bureau for Public Health Charleston, WV; and

oNational Secuirty Sceinces Directorate, Cyber Physical Systems Research Group Oak Ridge National Laboratory, Oak Ridge, Tennessee.

Correspondence to: Auntré Hamp, LPC, Office of the Senior Vice President for Research, 2115 Wisconsin Avenue NW, Suite 603, Washington, DC 20007 (e-mail:

Supported by the Centers for Disease Control and Prevention (CDC) Contract #211-2016-M-92074. Analyses herein were made possible through the support of CDC PS18-1802: Integrated Human Immunodeficiency Virus Surveillance and Prevention Programs for Health Departments.

J.M.F.O. at the time of manuscript preparation and publication was also a part-time employee with the Norwegian Institute of Public Health. This work was only associated with her capacity as a Georgetown University employee and not with the Norwegian Government, and she declares no conflict of interest. A.H. at the time of manuscript preparation and publication was employed with Georgetown University, but during the study period was employed with DC Department of Health, and declares no conflict of interest. J.C. was retired from his faculty position at Georgetown University throughout project planning, execution, and analysis as well as preparation and publication of this manuscript and declares no conflict of interest. The remaining authors have no conflicts of interest to disclose.

J.M.F.O. contributed to study design, led manuscript coordination, and manuscript development/edits. A.H. contributed to study design, implementation, manuscript development, and is the corresponding author. A.R. contributed to study design and implementation, and manuscript development. J.C.S. contributed to study design and implementation, and manuscript development. R.P. contributed to study design and implementation, and manuscript development. K.P. contributed to study design and implementation, and manuscript development. K.L.H. contributed to study design and implementation, and manuscript development. R.B. contributed to study design and implementation, and manuscript development. C.F. contributed to study design and implementation, and manuscript development. B.J.A. contributed to study design and implementation, and manuscript development. J.E.D. contributed to study design and implementation, and manuscript development. F.M. contributed to study design and implementation, and manuscript development. R.D. contributed to study design and implementation, and manuscript development. G.L. contributed to study design and implementation, and manuscript development. L.M. contributed to study design and implementation, and manuscript development. B.M. contributed to study design and implementation, and manuscript development. J.E.B. contributed to study design and implementation, and manuscript development. J.M. contributed to study design and implementation, and manuscript development. S.B.A. contributed to study design and implementation, and manuscript development. V.H. contributed to study design and implementation. D.W. contributed to study design and implementation, and manuscript development. S.P. contributed to study design and implementation, and manuscript development. S.G.K. contributed to study design and implementation, and manuscript development. H.E.K. contributed to manuscript development. B.T.L. contributed to study design and implementation, and manuscript development. J.C. led study design and implementation as principal investigator and contributed to manuscript development.

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains, and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (

Received March 22, 2019

Accepted March 27, 2019

Back to Top | Article Outline


Critical public health tasks to improve population-level health outcomes for persons with HIV include early diagnosis of HIV, rapid linkage to HIV care, and treatment with antiretroviral medications to achieve viral suppression.1,2 However, for public health departments, it remains challenging to achieve optimal levels of these goals in part due to the difficulty in accurately measuring this spectrum, otherwise known as the HIV care continuum.3 In the United States, interstate migration and differences in state and local public health reporting laws and interpretations among jurisdictions regarding data sharing and privacy challenge accurate measurements of the HIV care continuum, which in turn, affect the public health outreach and intervention that depend on these data.4

Data to Care5 is a public health strategy that aims to use HIV surveillance data to identify individuals with diagnosed HIV who are not in care, link or reengage them to care, and support the HIV care continuum.6 The Data to Care strategy relies on accurate data, and in particular, current residential address, vital status, and care status, which are collected in HIV surveillance systems at state/local health departments. A key characteristic of a well-functioning surveillance system and its data quality is its ability to link records on the same person across different jurisdictions to minimize duplicate records of reports/cases. In the United States, much of this information is collected through deduplication activities among jurisdictions. The Centers for Disease Control and Prevention (CDC) coordinates the Routine Interstate Duplicate Review (RIDR). This is a biannual process to identify and resolve duplicate cases in the Enhanced HIV/AIDS Reporting System (eHARS) across public health jurisdictions, for which this process is a condition for receiving CDC surveillance funds.7,8 The CDC identifies records suspected of being duplicate reports on the same individual using a CDC-developed matching algorithm. The CDC then provides jurisdictions with lists of suspected duplicate records for them to review, discuss, and agree on a resolution (“same as” or “different than”) during resource-intensive telephone case conferencing between jurisdictions. Currently, the RIDR operates with an estimated 12-month time lag between case reporting and duplicate resolution, as the process involves extensive manual follow-up for case pair resolution across jurisdictions.

In 2015, the health departments of the District of Columbia (DC), Maryland (MD), and Virginia (VA) with Georgetown University used a novel privacy-assuring data technology—the ATra Black Box System—to identify 21,472 eHARS potential duplicates from 161,343 case records across the 3 public health jurisdictions in a computational processing time of 21 minutes and 58 seconds.9 This previous study showed significant eHARS case record overlap across jurisdictions in the DC metropolitan area, reflecting persons' interactions with health systems reporting to different public health departments across these jurisdictional borders. It also gave jurisdictions the opportunity to improve accuracy of their data by identifying additional cases that were actually still in care but living out of jurisdiction and those who were deceased.

The study detailed here sought to examine the public health utility of using the ATra Black Box System in an expanded geographic area to determine its potential role in improving efficiency of case pair identification and determine the improvements in overall quality of HIV surveillance data across participating jurisdictions.

Back to Top | Article Outline


Our study objective was to use the ATra Black Box System approach for the District of Columbia (DC); Delaware (DE); Florida (FL); Maryland (MD); North Carolina (NC); and New York State (NYS), including data from New York City (NYC); Virginia (VA); and West Virginia (WV) to (1) identify the overall number of duplicate case records in eHARS across jurisdictions; (2) identify the number of exact duplicate case records in eHARS across jurisdictions; and (3) compare this approach to traditional RIDR resolution by estimating time efficiency realized and assess congruence with the July 2017 RIDR process.

Back to Top | Article Outline


A Governing Body

This highly collaborative technical approach was contingent on first establishing a governing body that determined the hypothesis in question and met regularly to discuss and implement study activities. This body included regional representatives from jurisdictional sites and study partners, all of whom received legal clearance to participate in this study—a process that involved productive dialog between participating organizations. This governing body consisted of members from public health jurisdictions (DC, DE, FL, MD, NC, NYS, VA, and WV), and members from Georgetown University (GU), CDC, and Oak Ridge National Laboratory. Guided by the public health jurisdictions' need, this body reached consensus in selecting the analytical question to query through this data technology: What is the total number and nature of duplicate HIV case records across participating jurisdictions along the East Coast corridor?

Back to Top | Article Outline

Data Privacy and Ethics

Among the jurisdictions on the US East Coast that were offered the opportunity to participate, 8 (DC, DE, FL, MD, NC, NYS, VA, and WV) agreed to participate in this effort. The jurisdictions and GU, with support from Oak Ridge National Laboratory drafted, agreed on, and signed Data Sharing Agreements and a Data Security and Confidentiality Procedures Manual following CDC's standard format for such documents.10 In NYS, the state and New York City (NYC) maintain separate HIV surveillance databases, but NYC reports cases to NYS for duplicate resolution purposes on a weekly basis. As a participating jurisdiction in this effort, NYS submitted their eHARS data and NYC's, making a total of 9 eHARS data sets included in the final match. Owing to the privacy-centered engineering design and technical approach of this study, which prohibits any person from seeing the data once in the ATra Black Box System and disallows long-term permanent storage of data, the GU Institutional Review Board (IRB) determined that the study was exempt from review.

Back to Top | Article Outline

Data Technology

The ATra Black Box System approach has been described previously.9 Briefly, the ATra Black Box System has a physically protected server with extremely high privacy assurance that was located at a secure data center in Virginia. Once closed, no one was able to inspect its contents, including the system administrators or the software developers. This server had no external connections to any device other than a power source. It saved data in temporary memory for data matching9,11 and was programmed with manual and automatic mechanisms for cleaning out memory in the event of unauthorized access. The ATra Black Box System was available only to participating jurisdictions through designated encrypted virtual private network links. Encryption techniques were in compliance with the Advanced Encryption Standard to protect the highly sensitive public health HIV data during transit between the jurisdiction and the ATra Black Box System. For this study, the system securely processed eHARS data uploaded directly from each jurisdiction without permanently storing the data. Each jurisdiction was assigned a single, unique, dedicated directory on the server. The jurisdictions uploaded tab-delimited data files to their assigned directories. The jurisdictions prepared input files using an SAS program that was written to combine demographic, geographic, HIV diagnostic, and laboratory information from each jurisdiction's eHARS database. Each jurisdiction downloaded output reports from their individually assigned subdirectory on match completion. Each jurisdiction received information pertinent to only their jurisdiction, including the results of match runs, a real-time log, an error report, a case-by-case match report with values of additional variables for the 3 highest match categories, eHARS-importable files, match totals, grand totals, and matches by zip code.

Back to Top | Article Outline

Information Technology System Coordination and System Testing

Information technology (IT) staff from all collaborating jurisdictions and GU collaborated to enable their health department staff to securely upload their eHARS data file for matching and reporting of results. IT and HIV surveillance staff from all jurisdictions became sponsored users of GU's system and received unique logins, passwords, and virtual private network access for the duration of this project. All jurisdictions first participated in an end-to-end test of the match process. The end-to-end test served several purposes, including confirming the communication channels between all jurisdictions and the ATra Black Box System server, testing operational efficiencies and technical processes including uploading of the data, monitoring the error logs in real time, downloading the reports, and testing the matching algorithm using a set of 9 synthetic data sets (1 for each jurisdiction plus 1 for NYC) provided by the CDC. After correcting a minor logic error and running a second test, the system successfully passed all aspects of the end-to-end test as indicated by the precise reproduction of a master list of expected results.

Back to Top | Article Outline

Matching Variables and Levels

This system used 10 matching variables: last name, first name, date of birth (DOB), sex assigned at birth (birth sex), social security number (SSN), race/ethnicity, first name Soundex, last name Soundex, partial DOB, and partial SSN. The values for these 6 matching variables were retrieved from eHARS using the SAS program: last name, first name, DOB, birth sex, SSN, and race. The Black Box calculated these 4 matching variables: first name Soundex, last name Soundex, partial DOB, and partial SSN. Five of the 10 matching variables were required to be present in the input data record for the record to be processed and matched: last name, first name, DOB, birth sex, and race. Three additional variables were required to be in the input data record, but were not used in the matching process: Stateno, Vital Status, and Transmission Category. The output displayed the number of matched individual HIV case records and the number of matched case pairs (ie, ≥2 case records representing 1 unique person with HIV) for each jurisdiction's report files. In this study, the same individual could belong to 1 or more case pair(s) if they matched across more than 2 jurisdictions.

There were 9 levels of matching confidence:

  • Exact: last name, first name, DOB, SSN, birth sex, and race;
  • Extremely high: last name, first name, DOB, and birth sex;
  • Very high: SSN;
  • High: last name, first name, DOB, and birth sex or race;
  • Medium high: last name and first Soundex, DOB, and birth sex;
  • Medium: (last name, DOB, birth sex, and race) or [last Soundex, first Soundex, DOB, and (birth sex or race)];
  • Medium low: last Soundex, first Soundex, partial DOB, partial SSN, and (birth sex or race);
  • Low: last Soundex and (partial DOB and partial SSN) and (birth sex or race);
  • Very low: last Soundex and (partial DOB or partial SSN).

These match levels were previously validated to assess the specificity of the matching algorithm in the study by Ocampo et al,9 2016. Specificity differed by match level with case pair matches at the exact level being validated as 100% true matches. The remainder of this article focuses primarily on exact matches, since jurisdictions found the exact level acceptable for automatic eHARS import without further validation.

In addition, jurisdictions could upload up to 93 optional “ride-along” variables per individual. These variables represented data that are typically exchanged during the manual case resolution that occurs in the traditional RIDR process, including HIV/AIDS case definition, state and county of residence at diagnosis of HIV/AIDS, current residential address information, laboratory test results associated with initial HIV disease diagnosis, and most recent HIV viral load and CD4+ T-lymphocyte count. Although not included in the matching algorithm, the ride-along variable data were included in the output for exact matches.

Back to Top | Article Outline

Comparing This Method to Traditional RIDR Process

Estimating Time Efficiency Realized

The governing body decided to use minutes per phone call per case pair resolution as an indirect measure of jurisdictional resources spent conducting aspects of the traditional RIDR resolution process because RIDR resolution is typically conducted through phone to resolve batches of several case pairs. The time was estimated based on the typical amount of time to organize, conduct, and document calls between jurisdictions to resolve specific case pairs. Jurisdictions estimated an average of 5 minutes per call per case with 2 persons (1 from each jurisdiction) for an average of 10 minutes overall. This estimate did not account for variation among local conditions.

Back to Top | Article Outline

Congruence With July 2017 RIDR Process

Before conducting the ATra Black Box System run, jurisdictions had previously received a CDC July 2017 RIDR list. The CDC July 2017 RIDR list comprised previously unresolved potential duplicates that were found in the eHARS system between January 1, 2017, and June 30, 2017. To assess the impact of the ATra Black Box System run on the RIDR resolution process, we checked whether case pair matches found by the ATra Black Box at the exact level were also present on the jurisdictions CDC July 2017 RIDR list. In addition, we examined whether the ATra Black Box System found exact case pair matches that were not present on the CDC July 2017 RIDR list that had not been previously unresolved in eHARS. We reviewed exact matches that did not appear on the CDC July 2017 RIDR list and those case pairs that were “previously resolved” through previous deduplication efforts and “not previously resolved” in eHARS.

Back to Top | Article Outline


Overall Number of Duplicate Records Across Jurisdictions

Jurisdictions uploaded a total of 799,326 eHARS case records (DC = 40,448; DE = 8419; FL = 215,875; MD = 72,121; NC = 58,511; NYC = 242,431; NYS = 106,619; VA = 49,844; and WV = 5058), of which 7705 (1%) were not uploaded successfully and were reported as errors (data not shown). A total of 290,482 (36%) eHARS records across these 8 East Coast public health jurisdictions matched across all levels: very low (8.9%), low (0.0%), medium low (0.0%), medium (8.1%), medium high (1.2%), high (0.2%), very high (12.9%), extremely high (30.5%), and exact (38.2%) (Table 1). Overall, close to 70% of matches were exact or extremely high.



Back to Top | Article Outline

Exact Case Pairs Across Jurisdictions

A total of 110,920 individual case records fell into the exact matching level (Table 1). These cases represent a total of 55,460 case pairs matched at the exact level. As shown in Table 2, the top 3 eHARS case pairs overlap were between NYC and NYS (51%), DC and MD (10%), and FL and NYC (6%), followed closely by FL and NYS (4%), FL and NC (3%), DC and VA (3%), and MD and VA (3%) (Table 2).



Back to Top | Article Outline

Congruence With July 2017 RIDR Process

In July 2017, jurisdictions received their semiannual RIDR case pair lists from the CDC; these case pairs represented possible duplicates of new persons entered between January 1, 2017, and June 30, 2017. A total of 811 exact case pairs identified using this approach also appeared on the July 2017 RIDR lists for jurisdictions (Table 3).



Back to Top | Article Outline

Estimated Time Efficiency Realized

This study estimated that jurisdictions realized approximately 8110 minutes (or 135.2 labor hours) in time efficiency using this approach compared with aspects of the traditional RIDR resolution process. NYC and NYS conduct an automated intrastate deduplication process. Therefore, the time efficiency realized may be inflated by the large number of matches between NYC/NYS that would be resolved through other methods. The time efficiency realized, when not including the NYC/NYS matches, was approximately 4220 minutes (or 70.3 labor hours) (Table 4).



Back to Top | Article Outline

Postprocessing of Results and the NYS Case Example

To describe the added value of using this approach, NYS examined the number of exact case pair matches that were not on the July 2017 RIDR list compared with other jurisdictions and found case pairs that were defined as “previously resolved” and “previously not resolved” in eHARS (Table 5). In the case of NYS, there were 2371 case pairs matched as exact identified as “previously not resolved.”



Back to Top | Article Outline


Main Findings

Here, we demonstrated successfully using the ATra Black Box System to assist deduplication activities across jurisdictions along the US East Coast corridor. This effort identified previously unidentified duplicates and likely helped realize time efficiency for resource-constrained public health jurisdictions. The highly collaborative public–private partnership between government, academic, and public health partners motivated jurisdictions to increase the frequency at which they directly communicate with each other about overlapping cases and has thus improved cooperative activities among public health jurisdictions. Moreover, this study addressed the critically important arena of working together to more effectively use surveillance data while enhancing the privacy safeguards for sensitive public health data.

Monthly data transfers from jurisdictions to the CDC provide the National HIV Surveillance System with necessary information to track and monitor HIV across the nation. But, by design, the CDC does not have access to personal identifiers such as first name, last name, or SSN, and thus, the variables are not available for deduplication purposes. The ATra Black Box System allows for automatic identification of matches, but without any person seeing or storing personally identifying information while in the matching process. This facilitates increased specificity in the identification and resolution of potential duplicates without compromising privacy.

This work can translate into improved Data to Care efforts reliant on surveillance data by providing more accurate, timely, and updated case data across jurisdictions. Activities related to Data to Care (ie, linking surveillance data more closely to health care outcomes) underlie the need for the improvement of data quality in HIV surveillance. Enhanced data quality allows jurisdictions to better focus their valuable public health resources on cases in need of follow-up with confidence and less so on individuals who have demonstrated continued engagement in care. Also, updated HIV surveillance data provide a better overview of HIV for public health planning purposes and have implications for funding public health efforts and health service delivery.

Back to Top | Article Outline

Public Health Implications

Identification of exact matches, especially those that were previously known to be duplicates, when accompanied by ride-along variable data, enabled jurisdictions to update their local eHARS case records with information from other jurisdictions for more complete and accurate information—complementing the conventional RIDR resolution approach. The ride-along variables also provide added value for future deduplication, such as SSN for positive identification, obtaining demographic characteristics, current address, and HIV transmission risk factors. For local public health jurisdictions, these data are critical for outbreak investigations, as well as epidemiologic analysis and reporting, which act to fine-tune policy and targeting prevention and control activities. An additional benefit of conducting this study was the identification of exact matches that were not previously resolved and not in the July 2017 RIDR list. Such case pairs that were previously unidentified exemplify cases that had not yet been distributed for resolution through the RIDR activity, leading to earlier identification of duplicates and hence improved accuracy of the surveillance data.

Our study suggests that jurisdictions with large seasonal migration or urban areas might stand to benefit the most from use of the ATra Black Box System, given the complex nature of movement of people through cities and the need to clarify their interaction with the public health system across jurisdictional borders to ensure the most effective follow-up. This study made it clear that NYS had significant HIV surveillance data overlap with Florida.

The overlap of eHARS records identified here indicates people's interactions with health systems across jurisdictional borders, which is challenging when the system is designed with states independently conducting surveillance in isolation from other jurisdictions, and the data at the national level do not have sufficient detail for deduplication. This article demonstrates a need to account for the mobility of people living with HIV in the United States, which leads people to engage with health care systems across different states (ie, nonresidence states). This has care implications, especially for public health departments allocating valuable resources to provide outreach, care, and support services to persons with HIV. To reach optimal levels of each milestone in the HIV care continuum and to provide a bridge between public health data and clinical care, there is a need to better understand and perhaps readjust our public health outreach to the dynamic nature of modern living. This study also presents significant benefits to jurisdictions' abilities to update their eHARS data (eg, updating cases that may have never been identified for duplicate resolution in eHARS through other efforts). Here, many cases that matched were not previously resolved in eHARS and also were not present on the most recent RIDR list, which presented an opportunity to improve the quality of local HIV surveillance data. Several reasons could explain why case pairs had not been identified by previous deduplication procedures, including that, by design, identifying information is not available to the CDC for traditional national-level deduplication efforts, new cases which could possibly be on the upcoming RIDR list, cases from previous RIDR lists that were not resolved, and case pairs that had been not included in previous deduplication efforts. In each of the potential scenarios, case pair resolution enables jurisdictions to update and thus enhance their HIV surveillance data and better assist with national-level case pair deduplication.

Back to Top | Article Outline

Future Work and Study Weaknesses

Evaluating time efficiency realized could be expanded to a more comprehensive cost-to-benefit analyses in future efforts. The initial time spent on legal clearances, setting up data sharing agreements, establishing secure IT protocols between several organizations, and creating and tailoring the matching algorithm needs to be accounted for in future cost-to-benefit analyses. Although jurisdictions will naturally spend time on initial set-up activities in the earlier years, we envision that they will spend less time on the same activities in subsequent years because of growing familiarity, experience, and continued training with this approach. This work is expected to ultimately reduce the number of duplicated records existing across public health jurisdictions, but needs to be conducted on a regular basis for maximum efficiency. Future uses of this approach should also incorporate a streamlined mechanism to maximize efficiency and avoid error messages due to data aspects such as lack of names or coded names. The location of the ATra Black Box System server may also affect future uses of this approach and should be discussed with potential users.

In addition to NYS, other jurisdictions have reviewed using eHARS data after an ATra Black Box System run,12 but to improve overall efficiency, there must also be development of best practices for postprocessing eHARS data across jurisdictions after matching case pairs using this approach. A single suite of software programs to be used by all participating jurisdictions are required to import matching data sets and ride-along variables back into eHARS, to realize the full potential for time and cost savings for this automated data deduplication process. As with other new approaches, there remains significant work, including finding the best methods for evaluating less than exact matches, resolving case pair conflicts, processing ride-along variables, and developing postprocessing software.

Future work should detail this highly collaborative process to implement a novel approach to resolving case pairs in eHARS, especially given that some state privacy laws prevented other invited public health jurisdictions from participating in this study. Other uses may also consider further refining the sensitivity of the algorithm to detect more matches without losing specificity. Finally, it would be informative to learn how many case pairs overlap among jurisdictions in other geographic areas and consider how this could improve our understanding of HIV in the United States.

Back to Top | Article Outline


This study identified 290,482 potentially duplicated case records from 799,326 uploaded case records in 9 separate eHARS data sets across 8 participating jurisdictions, of which 55,460 were exact case pairs. An estimated 135 labor hours in time efficiency was realized using this process to identify duplicate case records in eHARS compared with the traditional process for CDC RIDR resolution. This privacy-centered deduplication process of eHARS records across multiple public health jurisdictions has the potential for improving the timeliness, accuracy, and completeness of nationwide HIV surveillance data. This may help to reduce potential case pairs on future CDC RIDR lists, strengthen collaborative relationships among jurisdictions, provide a fruitful cooperative platform between academia and government entities, and lead to the more efficient use of public health resources. To enhance the added value of using this approach, future applications should consider standardized protocols for postprocessing duplicate eHARS data.

Back to Top | Article Outline


The authors extend their gratitude to all administrative and other personnel who have helped to facilitate the success of this study.

Back to Top | Article Outline


1. Sweeney P, Gardner LI, Buchacz K, et al. Shifting the paradigm: using HIV surveillance data as a foundation for improving HIV care and preventing HIV infection. Milbank Q. 2013;91:558–603.
2. National HIV/AIDS Strategy for the United States. 2020. Available at: Accessed December 13, 2017.
3. Gardner EM, McLees MP, Steiner JF, et al. The spectrum of engagement in HIV care and its relevance to test-and-treat strategies for prevention of HIV infection. Clin Infect Dis. 2011;52:793–800.
4. Gill MJ, Krentz HB. Unappreciated epidemiology: the churn effect in a regional HIV care programme. Int J STD AIDS. 2009;20:540–544.
5. Data to care, high impact prevention. Essential elements. About data to care. 2018. Available at: Accessed October 26, 2018.
6. Understanding the HIV care continuum. 2018. Available at: Accessed October 26, 2018.
7. Centers for Disease Control and Prevention and Council of State and Territorial Epidemiologists. Technical Guidance for HIV/AIDS Surveillance Programs, Volume I: Policies and Procedures. Atlanta, GA: Centers for Disease Control and Prevention; 2005. Available at: Accessed August 16, 2017.
8. Florida Health. Routine interstate duplicate review process. The CDC's routine interstate duplicate review (RIDR) process. Available at: Accessed October 26, 2018.
9. Ocampo JMF, Smart J, Allston A, et al. Improving HIV surveillance data for public health action in Washington, DC: a novel multiorganizational data-sharing method. JMIR Public Health and Surveillance. 2016;2:e3.
10. Data Security and Confidentiality Guidelines for HIV, Viral Hepatitis, Sexually Transmitted Disease, and Tuberculosis Programs: Standards to Facilitate Sharing and Use of Surveillance Data for Public Health Action. Available at: Accessed October 26, 2018.
11. Smart JC. “Technology for privacy assurance” in ethical reasoning in big data: an exploratory analysis. In: Collmann J, Matei S, eds. Ethical Reasoning in Big Data: An Exploratory Analysis. Cham, Switzerland: Springer International Publishing; 2016.
12. Hamp AD, Doshi RK, Lum GR, et al. Cross-jurisdictional data exchange impact on the estimation of the HIV population living in the District of Columbia: evaluation study. JMIR Public Health Surveill. 2018;4:e62.

HIV surveillance; Data to Care; data quality; deduplication; case pair resolution; ATra Black Box System

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved.