Electronic health records and other data routinely collected during the delivery of, or payment for, health care, and disease or exposure-specific registries are important resources to address the growing need for evidence about the effectiveness, safety, and quality of medical care.1–6 Unfortunately, even very large individual healthcare databases and registries are not big or diverse enough to address many of needs of clinicians, health care delivery systems, or the public health community. The Institute of Medicine and others have articulated the goal of using routinely collected health information for these “secondary” purposes.1,7–10 The Federal Coordinating Council for Comparative Effectiveness Research (FCCCER) noted the need for studies “with sufficient power to discern treatment effects and other impacts of interventions among patient subgroups.” In their priority recommendations, the FCCCER listed comparative effectiveness research data infrastructure as a primary investment that cuts-across all other comparative effectiveness needs.5 The FCCCER also listed several considerations for investing in person-level databases for comparative effectiveness research, including the ability to link to external data sources, the research readiness of the databases, and the need to maintain security and privacy of personally identifiable health information.5
To address the need to evaluate the processes and outcomes of care of large populations, some propose creation of large, centralized, multipayer claims databases.11 For example, the Department of Health and Human Services issued a contract titled “Strategic Design for an All-Payor, All-Claims Database to Support Comparative Effectiveness Research.” In addition, several states have already implemented all-payer claims databases to address cost and quality concerns.12,13 This data centralization approach is alluring because, in theory, it mitigates the complications associated with conducting research across multiple data holders. In practice, a centralized approach raises several serious security, proprietary, operational, legal, and patient privacy concerns for data holders, patients, and funders.14–17 As one example, even if a centralized database omits explicit identifying information like name and address, it is effectively impossible to prevent reidentification of individual level longitudinal data that contains enough detail to serve multiple purposes. In our experience, these limitations have severely constrained the effective, coordinated use of data held by multiple organizations.
An alternative to centralized, all-payer, databases, is one or more distributed research networks that permit comparative effectiveness and other evaluations across multiple databases without creation of a central data warehouse.14,15,18–24 Several such networks18,22–29 currently conduct comparative effectiveness and pharmacoepidemiologic research using a distributed data approach in which data holders maintain control over their protected data and its uses. These networks require data holders to transform their data of interest into a common data model that enforces uniform data element naming conventions, definitions, and data storage formats. The common data format allows data checking, manipulation, and analysis via identical computer programs that are shared by all data holders. Existing distributed networks typically distribute these computer programs via e-mail, data holders manually execute the programs, and return the output via secure e-mail, or another secure mechanism to a coordinating center for aggregation and, possibly, additional analysis. Many studies require no transfer of protected health information. Several single-study networks consisting of 20 to 50 million members also have been developed that adhere to a distributed research network approach.30,31
Existing and proposed distributed networks have tremendous potential utility for addressing our current post marketing evidence knowledge gap while benefiting from an approach that is more acceptable to data holders.14,15,21 In addition, a distributed approach keeps the data close to the people who know the data best and who can best consult on proper use of the data and investigate findings or anomalies.
Obstacles to effective implementation of both centralized and distributed approaches include differences in computing environments and information systems, the need for data standardization and checking, organization-by-organization variation in contracting policies and procedures, concerns related to the ethics of human subjects research and data privacy, and cross-institution variation in the rules and guidelines related to privacy and proprietary issues.32,33 Distributed networks can be built and tested in phases that allow the network to operate while being built, and network data resources can be updated and enhanced without disrupting overall network operations. Networks have the need for responsible and consistent stewardship of clinical records, and they exchange the requirements of centralized database administrative and computing infrastructure for similar processes on the part of each data holder. Thus, the administrative operation of networks can be cumbersome.
Many of the administrative, technical, and analytic barriers to developing efficient and scalable distributed health data networks to support population-level analyses can be addressed through network features that support the needs of users and data holders. We describe here the design and pilot implementation of a distributed research network infrastructure intended to meet the broad needs of all parties for comparative effectiveness evaluation and other uses. We note the challenges and barriers identified, and provide a blueprint for development of a comprehensive distributed research network as reusable national resource.
Background and Needs Assessment
The design, implementation, and evaluation plan of the pilot distributed network was based on findings from previous studies,14,15,21,34 coupled with our experience in operating a distributed network, the HMO Research Network Center for Education and Research on Therapeutics, and participating in other networks,23,27,28,31 including a current study of the safety of the H1N1 vaccine.30 Our prior work investigated the needs of data holders (eg, health plans) and potential network users (eg, federal agencies) with respect to making their data available for comparative effectiveness and other secondary uses.14,15 Data holders identified several requirements for voluntary participation in a distributed network. These included: (1) complete control of, access to, and uses of, their data, (2) strong security and privacy features, (3) limited impact on internal systems, (4) minimal data transfer, (5) auditable processes, (6) standardization of administrative and regulatory agreements, (7) transparent governance, and (8) ease of participation and use.
Users’ needs assessments, by contrast, did not depend on whether the underlying architecture was a distributed network or centralized database. Potential users identified other key elements of a network, including: menu-driven querying, easy access for feasibility assessments and for public health surveillance and monitoring, the ability to specify and create subsets of the complete data via menu-driven querying or complex programming code, and reuse of network tools to improve efficiency (eg, reuse of validated exposure and outcome algorithms).15 Nontechnical users wanted the ability to ask simple questions without assistance (eg, counts of people between the ages of 65 and 74 with a positron emission tomography scan in 2008). More sophisticated users wanted the ability to perform complex analyses (eg, compare risk adjusted survival curves for breast cancer patients treated with tamoxifen as adjuvant chemotherapy, to those who were not treated). Potential users also noted that it is often difficult to get rapid responses from existing systems, and this was even noted by users who were also data holders.
The design and rapid prototyping process focused on addressing the 8 specific data holder concerns noted above. We used a phased approach to development and implementation of the pilot distributed network. The first phase included a web-based portal system for menu-driven querying of summary-level datasets held by 5 data holders (Harvard Pilgrim Health Care, Group Health Cooperative, Geisinger Health Systems, Kaiser Permanente Colorado, and HealthPartners Research Foundation). Figure 1 and Table 1 illustrate and describe the system architecture and features; Figure 2 illustrates the current menu-driven query interface.
Development of the network software relied on a rapid prototyping approach that included multiple rounds of designing, building, testing, and revising of the interface and supporting portal, and the creation of a novel application program, the Query Execution Manager (QEM). Initial querying capability was built for drug utilization by generic name and drug class, diagnosis by 3-digit ICD-9-CM code, and procedure (Healthcare Common Procedure Coding System) queries. Query results were stratifiable by age group, sex, and year or year and quarter. These simple queries were chosen for demonstration purposes, more complex queries are possible within the network design.
From the user perspective, query creation involved logging into the web portal, selecting a query type (eg, generic drug name), selecting query parameters and the data holders to query, and once submitted, reviewing the status of queries and the aggregated query results. Each data holder installed the QEM software and responded to multiple test queries. The system used a “pull” query distribution mechanism that notified (via e-mail) data holders of waiting queries. The data holder user then opened the QEM to review the query details (eg, submitter and reason for submitting) and decides whether to execute, reject, of hold the query for further review. If the data holder decided to run the query, the QEM downloaded the query text from the portal, executed it against a local database, and presented the results to the data holder for review. The data holder could then upload the results to the portal for aggregation with other results and review by the submitter.
Test scenarios were based on summary level data and included assessment of temporal trends in the use of genetic testing, influenza-related medical and pharmacy use, attention-deficit hyperactivity disorder medication use by age, and urticaria diagnoses by age and year, and rate of diabetes by age. We assessed the ability of the system to execute and perform each of the user and data holder tasks described above for each of the test queries submitted. Throughout the rapid prototyping, implementation, and testing process we also informally evaluated data holder acceptance of the system and their willingness to continue development and testing beyond the demonstration. Other network functions such as user-based access control (ie, users have different levels of permissions to submit queries), query formation restrictions (ie, limit on the number and type of query parameters available for selection); and query results viewing rules (ie, requirement of 2 data holder responses before a user can view and export aggregated results) also were developed, tested, and evaluated.
Separately, we partnered with the National Center for Public Health Informatics to design and pilot test an alternative approach for securely transmitting queries and receiving results. The use-case for this pilot test illustrated how an authorized user could securely authenticate to a central portal and securely distribute a computer program to each data holders through their local firewall. The program was executed and the results securely returned for aggregation. Details of this work is described elsewhere.35
Implementation and System Functioning
The system was successfully implemented and tested at each of the 5 participating data holders. Each data holder was able to install and operate the QEM, retrieve and execute queries, and upload results. Installation took approximately 15 minutes.
Each of the sample queries was executed without error. Figures 3A and B show how data holders interact with the system, specifically the functions that allow them to review queries before executing them, and review results before uploading them to the central portal. Figure 4 illustrates results from a set of sample queries regarding the use of attention-deficit hyperactivity disorder medications; this type of information can help identify trends that may warrant further evaluation or help understand the adoption and diffusion of new products. The sample queries were completed within a day of submission; this period included local review of incoming queries and of the results.
Evaluation of Acceptability to Data Holders
All data holders found the system acceptable and were willing to continue working on development and implementation. The data holders found the approach acceptable because the design directly addressed many of the data holder concerns listed above. Specifically, the design illustrated: (1) data holder data autonomy that allowed approval of all query executions and data uploads, (2) an easy to use query interface, (3) centralization of the network logic in a portal, and (4) simple installation and light-weight software that did not require any special expertise to install or use. Further, the “pull” mechanism for query distribution (ie, data holders are notified of waiting queries and retrieve them) was also an important favorable factor for data holders’ acceptance.
Overall, this demonstration validated key design features of a distributed health data network including: use of a central portal, a “pull” query distribution mechanism that obviated the need to allow queries to pass through firewalls, and local control that allows data holders to maintain physical control of their data and all uses while simultaneously increasing research access for authorized users. These features were paramount to data holders. The use of summary-level data for this demonstration obviated many data privacy concerns, facilitating acceptance by data holders. The technical platform we developed is fully capable of supporting the use of patient and encounter-level utilization information, including the ability to submit a SAS program as the query text.14
This design minimizes data holder information technology responsibilities, leaves protected health information under the control of the data holder, provides for a more straightforward security implementation, and focuses network management tasks at the central portal. It supports important capabilities such as secure communications and data protection, auditable processes, a simple query interface, and locally managed query authorization.
The menu-driven interface facilitates the use of the distributed network by users with limited technical expertise. Network features streamline workflow among participating sites and enable an authorized user to quickly assess the feasibility of various comparative effectiveness studies in larger populations than might not normally be readily available. This demonstration project was relatively limited in scope to allow data holders to become comfortable with the technical design and the governance needed to manage and adjudicate simple query requests. More complex and complete demonstrations are currently underway that leverage the same network infrastructure to support full comparative effectiveness studies. Enhancements will allow more sophisticated authorization, security, and permission policies, a more flexible query interface, and access to additional data and query types. In addition, the infrastructure is designed to allow distributed multivariate analyses, either through use the of iterative methods36,37 or by merging site-specific analysis files that omit PHI, for instance through the use of high dimensionality propensity scores.38
Finally, advances in governance are as important to expanding this network model as any of the technical capabilities. Network governance requires policies and procedures to address issues such as data holder protections, conflict of interest, external communications, priority setting, by-laws, data security, accounting, network strategy, stakeholder issues, and HIPAA and human subjects protection. In addition, a coordinating center is needed to maintain network infrastructure, documentation, coordination, monitoring of data resources and contacts, documentation of lessons learned, data validity activities, and study implementation.
In theory, either a distributed or a centralized (eg, all-payer) approach can meet the need to use the growing amount of electronic health data to address important societal questions. However, a distributed network is preferred because it can perform essentially all the functions desired of a centralized database, while avoiding many disadvantage of centralized databases. In addition, distributed networks have these advantages, compared with centralized systems: (1) They allow data holders to maintain physical control over their data; without this control, in our experience, data holders are unlikely to voluntarily participate. (2) They ensure ongoing participation of individuals who are knowledgeable about the systems and practices that underlie each data holder's data. (3) They allow data holders to assess and authorize query requests, or categories of requests, on a user-by-user or case-by-case basis. (4) Distributed systems minimize the need to disclose protected health information thus mitigating privacy concerns, many of which are regulated by the Privacy and Security Rules of the Health Insurance Portability and Accountability Act of 1996 (HIPAA). (5) Distributed systems minimize the need to disclose and lose control of proprietary data. (6) A distributed approach eliminates the need to create, secure, maintain, and manage access to a complex central data warehouse. (7) Finally, a distributed network also avoids the need to repeatedly transfer and pool data to maintain a current database, which is a costly undertaking each time updating is necessary.
A phased approach is suggested for implementing a large scale distributed research network infrastructure for comparative effectiveness research and other purposes. Questions that leverage the most commonly used and best understood data types, target large populations, and execute standard statistical analyses using well-developed software packages will be the best candidates for demonstration studies. The Agency for Healthcare Research and Quality's recent grant awards include several examples of research projects that might benefit from the large study populations accessible through distributed data networks, including the study of outcomes resulting from depression treatments and asthma treatments in pregnancy, the study of ACE inhibitors in African-American males, and the study of various treatments for lumbar spine.39 Similarly, the Food and Drug Administration's planned Sentinel Initiative to monitor the safety of medical products, could also use a distributed data network, either identical to one developed for comparative effectiveness, or very similar to it. Relatively small investments compared with the cost of developing the underlying electronic health information will reduce individual study costs and demonstrate the value of creating reusable, distributed research networks.
1. Baciu A, Stratton K, Burke SP, eds. The Future of Drug Safety: Promoting and Protecting the Health of the Public.
Washington, DC: Institute of Medicine of the National Academies; 2006.
2. McClellan M. Drug safety reform at the FDA—pendulum swing or systematic improvement? N Engl J Med
3. Platt R, Wilson M, Chan KA, et al. The new Sentinel Network—improving the evidence of medical-product safety. N Engl J Med
4. Strom BL. The future of pharmacoepidemiology. In: Strom BL, ed. Pharmacoepidemiology
. Chichester, United Kingdom: John Wiley & Sons; 2005.
6. Gliklich RE, Dreyer NA, eds. Registries for Evaluating Patient Outcomes: A User's Guide. (Prepared by Outcome DEcIDE Center [Outcome Sciences, Inc. dba Outcome] under Contract No. HHSA29020050035I TO1.). AHRQ Publication No. 07-EHC001-1. Rockville, MD: Agency for Healthcare Research and Quality. April 2007.
8. Robert Wood Johnson Foundation. National Effort to Measure and Report on Quality and Cost-Effectiveness of Health Care Unveiled. 2007. Available at: http://www.rwjf.org/pr/product.jsp?id=22371
. Accessed February, 2010.
13. Miller P, Schneider CD. State and National Efforts to Establish All-Payer Claims Databases. Paper presented at: All-payer claims databases: A key to healthcare reform. Massachusetts Health Data Consortium Fall Workshop; 2009; Boston, MA.
14. Brown JS, Holmes J, Maro J, et al. Design specifications for network prototype and cooperative to conduct population-based studies and safety surveillance. Effective Health Care Research Report No. 13. (Prepared by the DEcIDE Centers at the HMO Research Network Center for Education and Research on Therapeutics and the University of Pennsylvania Under Contract No. HHSA29020050033I T05.) Agency for Healthcare Research and Quality: July 2009. Available at: www.effectivehealthcare.ahrq.gov/reports/final.cfm
15. Maro JC, Platt R, Holmes JH, et al. Design of a national distributed health data network. Ann Intern Med
16. Rosati K. Using electronic health information for pharmacovigilance: the promise and the pitfalls. J Health Life Sci Law
. 2009;2:171, 173–239.
17. Rosati KB. HIPAA privacy: the compliance challenges ahead. J Health Law
18. Hornbrook MC, Hart G, Ellis JL, et al. Building a virtual cancer research organization. J Natl Cancer Inst Monogr
19. Lazarus R, Yih K, Platt R. Distributed data processing for public health surveillance. BMC Public Health
20. McMurry AJ, Gilbert CA, Reis BY, et al. A self-scaling, distributed information architecture for public health, research, and clinical care. J Am Med Inform Assoc
21. Moore KM, Duddy A, Braun MM, et al. Potential population-based electronic data sources for rapid pandemic influenza vaccine adverse event detection: a survey of health plans. Pharmacoepidemiol Drug Saf
22. Platt R, Davis R, Finkelstein J, et al. Multicenter epidemiologic and health services research on therapeutics in the HMO Research Network Center for Education and Research on Therapeutics. Pharmacoepidemiol Drug Saf
23. Wagner EH, Greene SM, Hart G, et al. Building a research consortium of large health systems: the Cancer Research Network. J Natl Cancer Inst Monogr
24. Chen RT, Glasser JW, Rhodes PH, et al. Vaccine Safety Datalink project: a new tool for improving vaccine safety monitoring in the United States. The Vaccine Safety Datalink Team. Pediatrics
26. Chan K; HMO Research Network. The HMO Research Network. In: Strom BL, ed. Pharmacoepediomology.
Chichester, United Kingdom: John Wiley & Sons; 2005.
27. Go AS, Magid DJ, Wells B, et al. The Cardiovascular Research Network: a new paradigm for cardiovascular quality and outcomes research. Circ Cardiovasc Qual Outcomes
28. Magid DJ, Gurwitz JH, Rumsfeld JS, et al. Creating a research data network for cardiovascular disease: the CVRN. Expert Rev Cardiovasc Ther
29. Davis RL, Kolczak M, Lewis E, et al. Active surveillance of vaccine safety: a system to detect early signs of adverse events. Epidemiology
31. Velentgas P, Bohn R, Brown JS, et al. A distributed research network model for post-marketing safety studies: the Meningococcal Vaccine Study. Pharmacoepidemiol Drug Saf
32. Greene SM, Geiger AM. A review finds that multicenter studies face substantial challenges but strategies exist to achieve Institutional Review Board approval. J Clin Epidemiol
33. Greene SM, Geiger AM, Harris EL, et al. Impact of IRB requirements on a multicenter survey of prophylactic mastectomy outcomes. Ann Epidemiol
34. Brown JS, Moore KM, Braun MM, et al. Active influenza vaccine safety surveillance: potential within a healthcare claims environment. Med Care
36. Karr A, Lin X, Sanil AP, et al. Secure regression on distributed databases. J Comput Graph Stat
37. Karr AF, Lin X, Reiter JP, et al. Secure regression on distributed databases. J Comput Graph Stat
38. Rassen JA, Avorn J, Schneeweiss S. Multivariate-adjusted pharmacoepidemiologic analyses of confidential information pooled from multiple health care utilization databases. Pharmacoepidemiol Drug Saf.
Keywords:© 2010 Lippincott Williams & Wilkins, Inc.
distributed health data network; all payer databases; comparative effectiveness research network