Sittig, Dean F. PhD*; Hazlehurst, Brian L. PhD†; Brown, Jeffrey PhD‡; Murphy, Shawn MD, PhD§; Rosenman, Marc MD∥; Tarczy-Hornoch, Peter MD¶; Wilcox, Adam B. PhD#
The American Recovery and Reinvestment Act of 2009 provided $1.1 billion for comparative effectiveness research (CER).1 The goal of CER is to generate new evidence on the potential effectiveness, benefits, and harms of different treatments, diagnostics, preventions, and care models under “real world” conditions. Widespread adoption of CER has potential to radically change health care. CER also places enormous demands on existing informatics research infrastructure,2 as it requires aggregation and analysis of disparate data held by different institutions, each with its own representation of relevant events and accountabilities for protecting data as a matter of patient confidentiality and business operations.
Currently, most data manipulations are performed using noncoordinated applications [eg, data collection forms, electronic health records (EHRs), research databases, condition-specific registries, and statistical analyses] with disjointed institutional control. In an effort to address these demands, there have been new designs and implementations of informatics platforms that provide access to electronic clinical data and the governance required for interinstitutional CER.3–6 Briefly, a “platform” is a suite of interconnected, coordinated applications, together with the operational environment that hosts those applications.
The goal of this manuscript is to compare and contrast 6 large-scale projects that are either developing or extending existing informatics platforms for CER. Rather than compare the informatics platforms at an abstract level, we focus on specific CER projects that provide implementations of informatics platforms and highlight design requirements and solutions.
The following sections provide an overview of the projects surveyed.
WASHINGTON HEIGHTS/INWOOD INFORMATICS INFRASTRUCTURE FOR COMPARATIVE EFFECTIVENESS RESEARCH (WICER)
WICER is creating infrastructure to facilitate patient-centered outcomes research in Washington Heights, NY. The project facilitates comprehensive understanding of populations by leveraging data from existing EHRs, and combining data from institutions representing various health care processes. For example, it includes data from hospitals, clinics, specialists, homecare agencies, and long-term care facilities. It also includes survey data from community residents with assessments on socioeconomic status, vital statistics, support networks, health and illness perceptions, quality of life, and health literacy. Data from multiple sources are merged in a data warehouse, where deeper analysis is performed by clinical and public health researchers. WICER investigators are using the infrastructure and methods on 3 clinical trials in hypertension care around diagnosis, adherence to therapy, and care management.
SCALABLE PARTNERING NETWORK FOR COMPARATIVE EFFECTIVENESS RESEARCH: ACROSS LIFESPAN, CONDITIONS, AND SETTINGS (SPAN)
The HMO Research Network (HMORN) is a consortium of 19 Health Plans with formal, research capabilities.7 SPAN, a project within the HMORN, uses its Virtual Data Warehouse (VDW) to provide a standardized, federated data system across 11 partners, to address CER in ADHD and obesity.8 The VDW consists of commonly defined linked tables within each health plan that capture medical care utilization, clinical data, health plan enrollment information, demographics, detailed inpatient and outpatient encounter information, outpatient pharmacy dispensing data, laboratory test results, and vital signs.9 The VDW is augmented with state and local cancer registry information on date and cause of death for health plan members. Each plan maintains control of individual VDW data files and does not have access to files held by other HMORN sites. All HMORN participants must be capable of running, without modification, SAS programs distributed by other sites to execute against their local VDW. SPAN is pioneering use of a new platform, PopMedNet, that facilitates creation, operation, and governance of multisite, distributed health data networks.10
ENHANCING CLINICAL EFFECTIVENESS RESEARCH WITH NATURAL LANGUAGE PROCESSING OF EHR DATA—CER-HUB
The CER-HUB is an Internet-based platform for conducting CER. A central function of CER-HUB is facilitating (through online, interactive tools) development of a shared, data processor library that can be downloaded by registered researchers to provide uniform, standardized coding of both free-text and structured clinical data. This shared library permits researchers to assess data on clinical effectiveness in multiple health care areas and gain access to information locked in free-text notes. Using CER-HUB, researchers collaboratively build software applications (MediClass applications11) that will process EHR data within their respective health care organizations, creating standardized datasets that can be pooled to address-specific CER protocols. Participating researchers contribute Institutional Review Board-approved, limited datasets to a centralized coordinating center to be pooled with data similarly processed from other health care organizations to answer CER questions. The CER-HUB is being used to conduct 2 CER studies addressing effectiveness of medication for controlling asthma and of smoking cessation counseling services, across 6 geographically distributed and demographically diverse health systems. Researchers and data providers for these initial studies come from 3 Kaiser health plans (Northwest, Hawaii, and Georgia regions), 1 consortium of Federally Qualified Health Centers located primarily along the west coast (OCHIN Inc.), 1 Veterans Administration service region (Puget Sound VA in Washington), and an integrated network of hospitals and physicians in the greater Dallas/Fort Worth area (Baylor Health Care System).
THE PARTNERS RESEARCH PATIENT DATA REGISTRY (RPDR)
The RPDR is an enterprise data warehouse combined with a multifaceted user interface that enables clinical research and CER across Partners Healthcare in Boston, MA. The RPDR is used to recruit patients for clinical trials, and to perform active surveillance. It amasses data from billing, decision support, and EHRs in the Partners’ system. Data were available to researchers through a drag-and-drop web Query Tool12 allowing users to construct exploratory, ad hoc, queries for hypothesis generation from structured data, and to get aggregate totals and graphs of age, race, sex, and vitals. A utility exists for finding matched controls for patients. Requests can be made for detailed data on patients identified through the query tool with proper IRB authorization through an automated wizard. The RPDR has proven useful for gathering clinical trial cohorts, and for CER. This strategy was later adopted as the core of “Informatics for Integrating Biology and the Bedside” (i2b2).13 The RPDR was first released in December 1999 and has been in production at multiple sites since March 2002.
THE INDIANA NETWORK FOR PATIENT CARE (INPC) COMPARATIVE EFFECTIVENESS RESEARCH TRIAL OF ALZHEIMER DISEASE DRUGS (COMET-AD)
INPC was begun in 1994 as an experiment in community-wide health information exchange (HIE) serving 5 major hospitals in Indianapolis, IN. Today, it includes data from hospitals and payers statewide.14–16 Entities participating in INPC submit patient registration records, laboratory test results, diagnoses, procedure codes, and other data for various types of health care encounters. Data are also obtained from health departments and a pharmacy benefit manager consortium. Data are standardized (eg, laboratory test results are mapped to LOINC17 with common units of measure) to the extent possible, before storage in a central repository. Data for a patient with visits to multiple INPC institutions can be linked using a patient matching algorithm. The COMET-AD project is using data from INPC to monitor health care processes and outcomes and to build systems to monitor patients for adverse drug events. The project also involves building infrastructure and workflows to support integration of biospecimen results with clinical data from the INPC.
THE SURGICAL CARE OUTCOMES ASSESSMENT PROGRAM COMPARATIVE EFFECTIVENESS RESEARCH TRANSLATION NETWORK (SCOAP-CERTN)
The goal is to assess how well an existing statewide quality assurance and quality improvement registry (ie, the Surgical Care Outcomes Assessment Program) can be leveraged to perform CER. The SCOAP-CERTN leverages relationships built collaboratively in SCOAP to improve surgical care and outcomes and aims to build infrastructure for streamlined, electronic data abstraction from EHRs, patient-reported outcomes, and health care payments across hospitals. Through a partnership with Microsoft Health Solutions Group (Redmond, WA), SCOAP-CERTN is identifying ways to maximize automatic capture of data from EHRs, to:
* Allow longitudinal clinical data capture across health care encounter types (ie, surgical, interventional).
* Reduce clinical workflow and staffing burdens for maintenance of the SCOAP registry at participating hospitals.
* Provide capacity and interoperability to incorporate outpatient care delivery into SCOAP.
In addition, SCOAP developers plan to add functions to capture patient-reported outcomes for research and quality improvement evaluation. The primary informatics goal is to assess how, and to what degree, the collection of SCOAP-CERTN measures can be automated across sites.
CONCEPTUAL MODEL FOR CER PLATFORM EVALUATION
Designing, developing, implementing, and using health information technology within health care delivery systems is a complex, sociotechnical challenge. To provide a theoretical basis for our comparison of 6 CER informatics platforms we adapted an 8-dimension, sociotechnical model of safe and effective health information technology use.18 This model prescribes attention to: (1) appropriate hardware/software, (2) a spectrum of clinical content ranging from case narrative, to standard vocabularies, to algorithms representing best practices, (3) human-computer interfaces enabling productive interactions with technology, (4) personnel who develop systems and how systems meet the needs of users in their social contexts, (5) workflow and communications (both between people and technology components) required to accomplish tasks using the technology, (6) organizational policies, procedures, culture, and environment that prescribe and govern how and where things happen and who is responsible, (7) external rules, regulations, and pressures which shape these organizational constraints, and (8) system measurement and monitoring which ensures adequate performance for primary intended use cases, that is, the conduct of CER.
These 8 constructs18 are used to investigate and evaluate aspects of CER platform design and implementation by ensuring that both the social as well as the technical aspects are considered. Failure to consider who will use the applications, how they will use them, and why they are necessary often leads to suboptimal technology design and utilization.
We developed a written survey and sent it to informatics experts representing 6 large CER projects focusing on the design, development, and use of multi-institutional informatics platforms. Projects were selected by convenience, yet they are representative of vastly different approaches researchers have taken to address numerous CER challenges.
We (D.F.S., B.L.H.) developed a 2-page, open-ended survey that highlighted project-specific similarities and differences. We created 2–8 questions within each of the 8 dimensions to ensure that all important aspects were captured.18 For example, within workflow/communication we asked, “How do data get into your warehouse?” and “What stages do the data go through?” Similarly, within the hardware/software dimension we asked, “What computing infrastructure is required to run your system?”
Data Collection and Analysis
Completed surveys were returned by e-mail and checked for completeness. D.F.S. and B.L.H. read through the 6–10-page responses from each of the coauthors looking for key concepts highlighting project similarities and differences. After review and discussion, it became clear that the following 4 dimensions of the 8-dimension model were the key differentiators: content or data (Table 2); workflow/communication regarding how data moved from sources to analysis (Table 3); people (investigators, data programmers, research analysts, managers) involved in the projects (Table 4); and organizational policies, procedures, and culture (Table 5). We extracted data items to fill-in the tables from surveys. In addition to survey items, 2 authors (D.F.S., B.L.H.) gathered information regarding project descriptions and funding from websites and journal articles (Table 1). Drafts of completed tables were sent to coauthors for review.
All projects implement 6 generic data processing steps necessary for distributed, multi-institutional CER projects:
* Identification of applicable data within health care transaction systems.
* Extraction to a local data warehouse for staging.
* Modeling data to enable common representations across multiple health systems.
* Aggregation of data according to this common data model.
* Analysis of data to address research questions.
* Dissemination of study results.
All projects performed these activities, although there were variations in how (real-time aggregation of HL-7 transactions vs. nightly or as-needed extraction, transformation, and loading), where (local site vs. coordinating center), and with what tools (web-based query interfaces for researchers vs. tools to develop natural language processing modules).
Table 2 compares data sources, types, models, and handling of duplicate patients. All projects collected data from multiple sources (ie, hospitals, clinics, billing, long-term care) and included different data types (eg, numeric test results, ICD-9-encoded problems, and free-text progress notes). Only 3 projects used a “master patient index” that enabled them to combine data from patients who received treatment at different organizations. All projects used different, and sometimes multiple, data storage and manipulation formats ranging from SAS tables to XML-based documents to relational databases.
Table 3 provides a comparison of data flow and transformation, from local EHRs to aggregated analyses. The most important differences highlighted in Table 3 pertain to when patient-identifiable data leave local sites. In 2 projects, this occurs immediately after extraction from the local transaction-based clinical or administrative systems. In SPAN and CER-HUB, transfer of “raw” patient-identifiable data never occurs (ie, all data were processed at the local site by data analysis programs that are distributed from the central site, and only data conforming to protocol-specific Limited Data Sets are shared). Only 3 sites had any form of natural language processing capability; the other sites relied solely on numeric or coded data elements.
Also of interest in Table 3 is the state of data analysis tools offered by projects. All projects are working on “user-friendly” tools to facilitate researchers’ direct access to data by ad hoc queries, whereas concurrently meeting multi-institutional requirements for protecting patient data and corporate business interests. To date, only the RPDR has a working version.
Table 4 describes key personnel. The most important difference is that some projects either have or are working on Internet-based interfaces that allow nontechnical investigators to perform a limited set of data queries and analyses on the combined dataset. For example, the SPAN project currently requires all queries be coded as SAS programs and sent to the local site where they are executed and the results returned after manual review; SPAN is beta-testing an Internet-based approach using the PopMedNet architecture to allow nontechnical users to issue queries.
Table 5 provides a comparison of project governance and internal organizational policies and procedures. All projects have an oversight committee; most consisting of representatives from all sites involved in the project. Often this committee is responsible for governing all aspects of data ownership and sharing, project membership, and publication rights and responsibilities.
We compared 6 large CER projects and described how they use informatics platforms to provide data aggregation, analysis, and research management capabilities. Many of these platforms were originally designed and developed to address widely different health care, organizational, and research objectives; only after significant amounts of work had been completed were they transitioned to focus on CER. For example, the RPDR was originally designed to answer the question, “How many patients with a specific set of characteristics have we treated within our integrated delivery network?” On the other hand, INPC and WICER started as a means of improving the quality and efficiency of care in large metropolitan areas by creating centrally managed HIEs. Similarly, SCOAP-CERTN started as a registry to improve surgical outcomes and efficiency. SPAN (and to a lesser extent CER-HUB) build upon existing research networks comprised of similarly organized and managed, large, integrated health plans.
CER Requires Comprehensive Data on Patients
Different data types are required to create complete, patient-centered views of patients’ medical histories. The surveyed projects demonstrate that creating a useful CER platform requires enormous amounts, and a large variety, of data. To access these data, CER investigators need to collect them from as many different sources within their participating organizations as possible. Therefore, we see researchers collecting data from inpatient and outpatient EHRs (including the text narrative of clinical encounters), from billing and ancillary systems such as laboratory, pharmacy, and radiology. In addition, it is important to collect data that document that patients actually received the care that was ordered, so we see organizations collecting pharmacy dispensing and patient-reported data when available. This vast array of data, although large, is nearly always incomplete (ie, they generate sparse representations in a large-dimensional space of patient care facts in the real world) and methods which use these data must be appropriate to the task of measuring health status and care events with available data.
CER Requires Data on Populations From Multiple Organizations
Researchers need to aggregate data from multiple organizations to have enough information to identify small differences, address bias, perform subgroup analyses, improve generalizability, allow evaluation of demographic and geographic variation, and identify rare events. Therefore, CER informatics platforms must be able to extract and collect data from many different organizations to compile as complete a view of conditions, treatments, and individuals as possible. Toward that end we see investigators working to include data from multiple organizations, pursuing nontraditional research data sources, such as long-term care facilities, home and public health agencies, and attempting to reliably ascertain patients’ socioeconomic status on a widespread basis.
A key requirement for data collection across health care provider organizations located in the same geographic region is the need to merge data from the same patient who has received health care services and had clinical data captured at multiple institutions. Such efforts require a community-wide master patient index that identifies patients on the basis of multiple demographic data (eg, first name, last name, date of birth, sex, social security, or telephone numbers) and keeps track of all patient identifiers used by various participating organizations to create a single, master patient identifier.30 To date, only the CER projects that were built on top of existing HIE platforms designed for patient care have tackled this extraordinarily difficult problem,31,32 but in the future patient matching capabilities will be a critical success factor.
CER Requires Data Extraction, Modeling, Aggregation and Analysis Methods and Tools
Researchers must be able to extract required data from various electronic data systems, map data types to standardized clinical representations, and analyze it. Design and development of these “mapping” applications is one of the biggest challenges in any multi-institutional research project, because it is often the case that different organizations refer to the same activity, condition, or even procedure by different names, and the same names can refer to different things across institutions. Further, even with accurate mapping it is difficult for researchers to fully appreciate local idiosyncratic data issues (eg, nonrandom incomplete data capture) without active engagement of local data experts.
Furthermore, conducting CER is a complex undertaking requiring people with widely different skills, often in different locations and subject to different organizational policies and practices. In an attempt to reduce potential for misunderstanding in collaboration processes, platform developers are working to create powerful, user-friendly tools for data extraction, manipulation, and analysis. These tools are being designed so CER project staffs, who often have little informatics training, can perform their tasks more efficiently. In addition, several projects are developing tools to help researchers make sense of highly variable and clinically rich free-text notes documenting patient care.
CER Must Conform to Local Organizations’ Internal Governance and IRBs’ Rules and Local and Federal Legislation
The social, legal, ethical, and political challenges involved in setting up and conducting large, multi-institutional CER projects must not be underestimated. Friedman et al33 stated that “organizations are understandably reluctant to move data beyond their own boundaries absent a clear and specific need to do so, and patients will be less likely to consent to allow this to happen.” Therefore, in addition to providing the technical infrastructure required to collect, standardize, normalize, and analyze disparate data, informatics platforms must conform to local organizations’ internal governance and IRBs’ rules and regulations as well as existing state and federal guidelines. One design to address use of protected health information is to retain physical control of raw data although providing for their aggregation as limited datasets to answer specific questions. Other ways in which projects have accommodated interinstitutional governance issues include standardizing data models across the project; limiting access to authorized personnel while facilitating remote access; restricting the types of queries that can be executed and masking patient-specific, identifiable data; and logging all data transactions and access activities. As rules, regulations, and guidelines evolve (eg, proposed Common Rule revision34) CER platforms and governance processes must evolve accordingly.
SUMMARY AND CONCLUSIONS
CER stands to transform the current health care delivery system by identifying which therapies, procedures, preventive tests, and health care processes are most effective from the standpoints of cost, quality, and safety. State-of-the-art informatics platforms are necessary to carry out this type of research across organizations with disparate patient populations, health information systems, data types, and local governance structures.
We used an 8-dimension, sociotechnical model to develop a survey enabling us to compare and contrast informatics platforms that are under development or in use in 6 large CER efforts. On the basis of the data we collected, we identified 6 generic steps necessary in any distributed, multi-institutional CER project: data identification, extraction, modeling, aggregation, analysis, and dissemination.
We conclude that all of the informatics platforms for CER studied are on their way to creating the sociotechnical infrastructure required to enable researchers from multiple institutions to conduct high-quality, cost-effective CER. We expect that over the next several years, these projects will provide answers to many important CER questions that in the past were virtually inaccessible. In addition, we expect many more CER-focused informatics research platforms to be designed, developed, and tested as the fields of informatics and CER continue to evolve.
The authors thank Andrea Bradford, PhD, for editorial assistance.
2. VanLare JM, Conway PH, Rowe JW. Building academic health centers’ capacity to shape and respond to comparative effectiveness research policy. Acad Med. 2011;86:689–694
3. Stang PE, Ryan PB, Racoosin JA, et al. Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership. Ann Intern Med. 2010;153:600–606
4. Ohno-Machado L, Bafna V, Boxwala AA, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc. 2012;;19:196–201
5. Behrman RE, Benner JS, Brown JS, et al. Developing the Sentinel System—a national resource for evidence development. N Engl J Med. 2011;364:498–499
6. Payne P, Ervin D, Dhaval R, et al. TRIAD: The Translational Research Informatics and Data management grid. Appl Clin Inf. 2011;2:331–344
7. Greene SM, Hart G, Wagner EH. Measuring and improving performance in multicenter research consortia. J Natl Cancer Inst Monogr. 2005;35:26–32
8. Toh S, Platt R, Steiner JF, et al. Comparative-effectiveness research in distributed health data networks. Clin Pharmacol Ther. 2011;90:883–887
9. Hornbrook MC, Hart G, Ellis JL, et al. Building a virtual cancer research organization. J Natl Cancer Inst Monogr. 2005;35:12–25
10. Brown JS, Holmes JH, Shah K, et al. Distributed health data networks: a practical and preferred approach to multi-institutional evaluations of comparative effectiveness, safety, and quality of care. Med Care. 2010;48(suppl 1):S45–S51
11. Hazlehurst B, Frost HR, Sittig DF, et al. MediClass: a system for detecting and classifying encounter-based clinical events in any electronic medical record. J Am Med Inform Assoc. 2005;12:517–529
13. Murphy SN, Weber G, Mendis M, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 2010;17:124–130
14. Overhage JM, Tierney WM, McDonald CJ. Design and implementation of the Indianapolis Network for Patient Care and Research. Bull Med Libr Assoc. 1995;83:48–56
15. McDonald CJ, Overhage JM, Barnes M, et al. The Indiana network for patient care: a working local health information infrastructure. An example of a working infrastructure collaboration that links data from five health systems and hundreds of millions of entries. Health Aff (Millwood). 2005;24:1214–1220
16. Zhu VJ, Tu W, Rosenman MB, et al. Facilitating clinical research through the health information exchange: lipid control as an example. AMIA Annu Symp Proc. 2010;2010:947–951
17. McDonald CJ, Huff SM, Suico JG, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–633
18. Sittig DF, Singh H. A new socio-technical model for studying health information technology in complex adaptive healthcare systems. Qual Saf Healthc. 2010;19(suppl 3):i68–i74
19. Health & Human Services Research Awards: Use of Recovery Act and Patient Protection and Affordable Care Act Funds for Comparative Effectiveness Research. U.S. Government Accountability Office, Washington, D.C.; June 14, 2011. Available at: http://www.gao.gov/new.items/d11712r.pdf
. Accessed May 4, 2012
20. Tatonetti NP, Denny JC, Murphy SN, et al. Detecting drug interactions from adverse-event reports: interaction between paroxetine and pravastatin increases blood glucose levels. Clin Pharmacol Ther. 2011;90:133–42. doi: 10.1038/clpt.2011.83. [Epub 2011 May 25]
21. Kurreeman F, Liao K, Chibnik L, et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am J Hum Genet. 2011;88:57–69
23. Friedman C, Shagina L, Lussier Y, et al. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11:392–402
24. Baorto D, Li L, Cimino JJ. Practical experience with the maintenance and auditing of a large medical ontology. J Biomed Inform. 2009;42:494–503
25. Zeng QT, Goryachev S, Weiss S, et al. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak. 2006;6:30
26. Perlis RH, Iosifescu DV, Castro VM, et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol Med. 2012;42:41–50 [Epub 20 June 2011]
27. Friedlin J, McDonald CJ. Using a natural language processing system to extract and code family history data from admission reports. AMIA Annu Symp Proc. 2006::925
31. Adragna L. Implementing the enterprise master patient index. J AHIMA. 1998;69:46–48 50, 52
32. McDonald CJ, Overhage JM, Tierney WM, et al. The Regenstrief medical record system: a quarter century experience. Int J Med Inform. 1999;54:225–253
33. Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Sci Transl Med. 2010;2:57cm29
© 2012 Lippincott Williams & Wilkins, Inc.