Comparative effectiveness research (CER) seeks to answer questions about the impact of an intervention, treatment, or exposure on outcomes or effectiveness by conducting secondary analyses of data collected during normal course of health care.1,2 It therefore frequently relies upon data from sources such as electronic health record (EHR) systems and administrative claims databases. Our definition of CER encompasses treatments, interventions, or exposures directed at patients and their associated outcomes, as well as interventions directed at health care providers and the effect of these interventions on patient outcomes. Comparative effectiveness studies offer great potential for valuable insights about reducing health care cost, improving health policy decisions, and advancing health care–related research.
Although randomized clinical trials remain the gold standard for assessing the impact of treatments or interventions, data sources that capture routine clinical practice can provide a wealth of information on treatment and intervention outcomes that might be difficult to ascertain from a randomized clinical trial because of trial design complexity, high costs, and other factors. Data from health care information systems collected during the course of normal care have the advantages of providing: (1) an accurate picture of the health care services actually provided in different care settings; (2) greater numbers; and, (3) more diverse patient populations. These electronic data enable researchers to assess the impact of real-life clinical practice on patient outcomes. However, reaping the full benefits of vast quantities of accumulated data requires overcoming many challenges, one of which is to make data collected and stored in different systems and locations interoperable.3–7 Lack of an open and shared information infrastructure impedes analysis of data from disparate sources.8
Consider software created to summarize blood pressure information collected from 2 different clinical practices—the summarizing software must be able to recognize how to read data files into memory and subsequently recognize which parts of the different files contain blood pressure information. If one practice reports blood pressure as “high,” “normal,” or “low,” whereas the other reports blood pressure as a systolic/diastolic value, custom programming is required to summarize the combined data in a meaningful way. Enabling a larger scale analysis of heterogeneous data requires that ≥2 systems involved in collaborative analysis can exchange data or information with each other (syntactic interoperability) and that the different systems involved in data exchange can understand and use the exchanged data and information (semantic interoperability). Syntactic interoperability requires making formats for data and messages consistent across the different systems involved in data exchange. Semantic interoperability requires that the meaning of the data is unambiguous and correctly interpreted by both humans and computers that use the data.9 To achieve semantic interoperability, data from various sources are mapped to standardized terminology systems and annotated with additional information, called metadata, that is critical for correctly understanding the data’s meaning. Metadata include key contextual information, for example, where blood pressure readings were taken (emergency room or outpatient setting).
A data model specifies a system for representing data elements, metadata, and relationships between different data elements in a specified domain, for example, clinical CER. When used as the design of a database, this set of structural specifications are referred to as a “schema.” To address both the syntactic and the semantic interoperability issues that arise when attempting to utilize data that is idiosyncratically generated from various sources, typically, multisite projects adopt a common data model that meets specific purposes of the project.10–16 As common data model creation or selection is often driven by a specific purpose, no existing common data models are robust enough to encompass every domain or data representation need. However, data collected during the normal course of health care are sufficiently constrained that a common standard for research purposes can address a wide variety of research questions.
The approach of standardizing data interchange is consistent with the findings of other researchers: use of a common data model promotes analytic consistency across sites and ensures that the results from similar analyses at different sites are comparable and not affected by potentially different protocol interpretations.16
In this paper, we examine the challenges associated with representing and mapping data for analyses in CER studies that use data taken from multiple EHRs and associated data warehouses. We outline a rationale for adopting a common or reference data model and compare the strengths and weaknesses of existing data models that can be used as a common or reference data model. We assess the impact of having a common data model on the approach to data collection and exchange, and present lessons learned. Using a finite set of data elements related to CER drawn from an actively used research data warehouse, we also present an evaluation of the modeling challenges and data or information loss that can occur when using different existing data models.
A glossary of the technical terms and acronyms used in the paper is presented in Table 1.
COMMON CHARACTERISTICS OF CER STUDIES
CER is characterized by research and research infrastructure development to improve medical decisions and clinical outcomes by comparing various drugs, treatments, and other interventions.17 Common epidemiological designs include case-control, parallel cohort, and self-controlled case series comparing outcomes of alternative treatments. In order for causal inference to be sufficiently robust in nonrandomized studies, analytic models require collection of substantial additional data and metadata beyond the treatments and outcomes.18–20 Typically, methods relying upon factors that are correlated with treatment selection but not with outcomes require data not found in minimal data models, for example, insurance coverage, policy context, and location details. Comprehensive data models that accommodate rich metadata augment the validity of observational analyses.
CER, and more recently, patient-centered outcomes research, often include analyses of outcomes that are not strictly clinical, such as outcomes related to utilization, expenditures, and quality of life.21 This may require representation in data models of observations of patient-reported outcomes that are not typically collected in the course of routine clinical care, although required by the Centers for Medicare and Medicaid Services in certain situations, for example, care received in nursing facilities.22 These common characteristics of CER studies drive what is needed from a data model at a minimum: representation of patient demographics, drugs, procedures, outcomes or observations, providers, health care facilities, insurance features, and payments.
RATIONALE FOR ADOPTING A COMMON/REFERENCE DATA MODEL
Adopting a common or reference data model lays the groundwork for achieving syntactic and semantic interoperability so that comparable CER analyses can be performed across research study sites. This section examines some benefits and potential limitations of adopting a common data model.
Benefits of Using a Common Data Model
Using a common data model has an important practical value of providing a “checklist” of required and optional data elements. Required data elements are typically identifiers for entities represented in the data (eg, patient, encounter), and content that is necessary for linking data across multiple entities.
Within a single site or study, a standardized terminology guides data managers and investigators as they make decisions about representing local data using standardized concepts that have been established in the broader literature. Across multiple sites, representing locally generated data with standardized concepts promotes parity in independent analyses and allows data from different sources to be combined into a single coherent dataset.
Potential Limitations of Using a Common Data Model
Many EHRs allow providers to include full-text notes regarding a patient’s treatment and outcomes. Full text generally can only be mapped to a common terminology by using Natural Language Processing (NLP) technology or through manual coding. However, with NLP, there is a possibility that even if identical algorithms are employed at different sites, context may generate different results. Local customization of NLP algorithm parameters may require generating training data for NLP that is much more costly than using “out of the box” methods that have been trained on external datasets.23 Manual coding is inefficient and impractical at the scale needed for CER studies.
When a terminology with detailed descriptions (ie, a fine-grained terminology) is mapped into a terminology that has fewer details or is coarse-grained (eg, a mapping from SNOMED to ICD9 or from ICD10 to ICD9), information is lost. It is thus beneficial when possible, to include the terminology source value even after mapping to a common data model. Maintenance of source records also facilitates revisions of terminology mappings if the common model evolves.
Analysis of the Strengths and Weaknesses of Existing Data Modeling Standards
We examined the strengths and weaknesses of existing data models with regards to meeting a requirement that the model support a broad range of current and future CER projects. We examined the Clinical Data Interchange Standards Consortium (CDISC) Analysis Data Model (ADaM); the Biomedical Research Integrated Domain Group (BRIDG) model; the Observational Medical Outcomes Partnership (OMOP) Common Data Model versions 2 and 4; and the US Food and Drug Administration (FDA’s) Mini-Sentinel Common Data Model (MSCDM) versions 1.1 and 2.1. A brief overview of each modeling standard follows.
CDISC has developed several standards for exchange, storage, and submission of clinical trials data to the FDA. Their ADaM14 is a standard for describing the structure, content, and metadata associated with analysis datasets from clinical trials. ADaM has 4 categories of metadata covering the analysis dataset itself as a whole, analysis variables, analysis parameter values, and analysis results. The metadata explain how an ADaM dataset was created from source data.
The BRIDG model aims to represent the semantics of the data from clinical trials and preclinical protocol-driven research.11 BRIDG is a collaborative effort involving CDISC, the HL7 Regulated Clinical Research Information Management Technical Committee, the National Cancer Institute, and the FDA. The BRIDG model’s static semantics are presented through class diagrams and instance diagrams that describe the concepts, attributes, and relationships that occur in clinical and preclinical research. Its dynamic semantics are presented primarily through state transition diagrams that model the behavior of those concepts and how the relationships evolve.
OMOP was set up by the Foundation of the National Institutes of Health to aid in monitoring the use of drugs for safety and effectiveness. OMOP has created common data models that define the primary data elements needed across observational studies. Specifications from OMOP include a data dictionary required for standardizing aggregated data so that such studies can be compared.13,24 OMOP’s goal is to facilitate observational studies that involve using data from different databases, including administrative claims data and EHRs. For our modeling comparison, we examined OMOP Common Data Model versions 2.0 and 4.0 (disclosure: some input from this paper’s authors went into the design of version 4.0.).
The Mini-Sentinel is a pilot program sponsored by the FDA. Its goal is to develop comprehensive approaches that facilitate the use of data routinely collected and stored in EHRs for surveillance of marketed medical products’ safety. More than 30 academic and private health care institutes are participating in this program as data partners who provide data for the surveillance activity. MSCDM specifies required data content and structure as well as a standardized vocabulary mapping. To achieve semantic interoperability among the disparate datasets, the data partners transform their local datasets into the MSCDM conforming formats.10,25–28 We analyzed versions 1.1 and 2.1 of the MSCDM.
We compared the different modeling standards based on whether they were easily extensible; adequately captured patient demographic and clinical data; were easy for clinical researchers and data analysts to understand; modeled financial payment and payer data; used standardized vocabularies; modeled insurance plan benefit design and benefit plan data; had widespread real-world usage; had well-defined analytic methods and a user-base for these methods; and had the ability to model nondrug, nonprocedure interventions. As discussed above, these considerations were made in the context of the project’s present and long-term goals.
Information Loss and Limitations of Using a Common Data Model
To understand and demonstrate potential limitations and challenges associated with representing local data in a common data model, we mapped the data schema of UC San Diego’s Clinical Data Warehouse for Research (CDWR) (Fig. 1) to 4 widely adopted data models: BRIDG, ADaM, OMOP CDM version 4.0, and Mini-Sentinel version 2.1. We were unable to gain access to the detailed database schema for ADaM from CDISC and had to make inferences about primary keys and foreign key relationships across tables. As a consequence, in this paper, we focus on results for the BRIDG, OMOP, and Mini-Sentinel models. We chose these models because their data elements are designed to represent the data from EHRs, clinical data warehouses, or clinical trial management systems. As the BRIDG model does not include a standardized vocabulary, we used the CDWR’s local vocabulary for BRIDG in our mapping exercise. We examined how completely and accurately these models represent local data using 2 CER study scenarios (see Supplemental Digital Content for full details, http://links.lww.com/MLR/A507). Our goal was to answer the following questions.
Does every data element have a place in the reference models?
What kinds of extensions or modifications are required to represent the testing scenarios?
Does the model transformation result in overgeneralization?
Does the model transformation result in missing nuance or attribution (eg, reported vs. observed vs. measured), which is critical for data interpretation?
Are there any missing semantic links?
Scenario 1: The first scenario assessed was for a medication therapy management study that involved ascertaining age, sex, race, ethnicity, insurance status, marital status, and other relevant variables for all patients with type-II diabetes seen at several clinics belonging to 1 study site between January 1, 2011 and June 30, 2011. The CER inquiry of interest was as follows: are outcomes different for patients with type-II diabetes whose medication use is managed by both a pharmacist and a physician as compared to patients with type 2 diabetes whose medication use is managed only by a physician?
Scenario 2: The second scenario assessed was for a local clinical study that involved ascertaining primary care physician name, last visit date, highest hemoglobin A1C level, and other relevant variables for patients between the ages of 25 and 70 who had been seen at a study site’s family medicine clinic within the last 18 months.
The CDWR consists of 8 major tables that cover entities such as Encounter, Problem, Observation, Patient, Provider, Medication, Procedure, and Location (Fig. 1). Clinical data are extracted, transformed, and loaded (in a process called ETL for short) from UCSD’s Epic Clarity EHR system to the CDWR every day.
Model Mapping and Data Querying
One of the authors (S.F.) performed cross-mapping of the CDWR schema to BRIDG, OMOP CDM version 4.0, and Mini-Sentinel. The other authors reviewed the cross-mapping results to ensure accuracy. Queries for the 2 scenarios above were run and the results compared.
Common Data Model Impact on Data Collection, Mediation, and Exchange Approaches
A common data model serves as the reference framework for organizing the data generated from multiple, disparate, and independent study sites, in order to integrate the data for further analyses. A common data model allows 1 query to be specified to obtain data from multiple sites. A typical approach to integrating the data is to aggregate the data from all sites into a single database that uses the schema of the common data model. Data are updated in this database from the sites at regular intervals, for example, every night, once a month. The query is then applied to this database and the results are presented to the user. This single database approach has the disadvantage of data not being current, and sites losing control of their data. A less conventional approach that provides more flexibility is to use an information mediator to query each site. The mediator is a software engine that translates queries composed in the “language” of the common data model to queries to in the “language” of individual data sources, and then transforms the results obtained from various sites into an integrated result in the common data model. Figure 2 presents an example of the architecture that might be associated with data sharing using a mediator.
The transformation from the original clinical data source to a common data model is captured by mediator rules. Mediator rules are declarative logical rules that specify how information from the original data source relates to the information in the global (common data) model. The common data model impact on data collection, mediation and exchange approaches section outlines some challenges associated with using a mediator.
Analysis of the Strengths and Weaknesses of Existing Data Modeling Standards
With regards to the process of extending models to meet more CER study needs, we found that for some of the models, formal membership in the sponsor organization is required (including the payment of membership dues in the case of CDISC) in order to be involved in the process of suggesting extensions. The OMOP models were an exception in this regard.
Most of the models adequately captured patient demographic and clinical data, although access to the restricted vocabularies that facilitate translation does require use agreements. The Mini-Sentinel CDM version 1.1 did not model laboratory findings, because the main purpose of this model was to represent events from claims data but not specific clinical data captured in EHRs. However, version 2.1 expands its scope to include laboratory findings. The BRIDG model encompasses the widest range of biomedical fields. However, we found the BRIDG model to be less intuitive for clinical researchers and data analysts to understand than the other modeling standards. This is not surprising, given its role as a domain analysis model for representing a large variety of research study scenarios in detail. OMOP and Mini-Sentinel specified the use of standardized vocabularies, which aid in achieving semantic interoperability, but BRIDG and ADaM did not. With BRIDG, encoding the data with a standardized vocabulary is required. However, BRIDG does not specify the vocabulary systems to use. Instead, it allows users to use standardized vocabulary systems that are compiled through the Enterprise Vocabulary Services of the National Cancer Institute.
OMOP’s CDMs, ADaM, and the Mini-Sentinel CDMs had well-defined analytic methods with an existing or growing user-base but BRIDG did not meet this criterion. Most of the data modeling standards examined were designed with a particular use in mind and would require some adaptation for the purposes of our research. OMOP CDM version 4.0, although not perfect, met most of the study’s needs with regards to short-term and long-term usage goals. Our findings are summarized in Table 2.
Information Loss and Limitations of Using a Common Data Model
We created and executed queries for the 2 CER study scenarios from the Information loss and limitations of using a common data model section.
OMOP CDM Version 4.0
A majority of the data fields from CDWR were successfully mapped. However, some local extensions to the CDM were required to capture detailed data items that are frequently used for local research projects. Local research use cases often require data at varying levels of granularity: for example, data about provision of care may be needed at a more fine-grained level that specifies details such as “provider name” or at a less fine-grained level that specifies only “specialty care site.” Fine-grained data such as “provider name” is unlikely to be shared across independent institutions seeking to use the common data model to facilitate data exchange. This issue was deemed solvable at a local site level by enhancing the local site’s dictionary tables. As the data concepts and attributes in this model are appropriately fine-grained, there was no overgeneralization from mapping fine-grained concepts and attributes into coarser-grained ones.
Although we did not find any alterations in semantics when representing the scenarios with the OMOP CDM, the schema has complexities that may result in barriers to adoption. OMOP (and the Mini-Sentinel data models) support both claims and EHR data, but were initially developed with focus on claims data. This results in features that may appear nonintuitive, particularly to users unfamiliar with claims data. For example, claims data identify outpatient office visits as CPT procedure codes, so metadata regarding the provider with whom a visit occurred are treated analogously to a provider by whom a procedure was conducted.
Nine tables from the BRIDG model successfully covered the current CDWR (see Supplemental Digital Content 1, http://links.lww.com/MLR/A507). The BRIDG model clearly represents the semantics of the data. We did not find any alterations in semantics when representing the scenarios with the BRIDG model (although a similar scenario as with the OMOP CDM in which office visits were represented as procedures was observed).
Unlike OMOP CDM version 4.0, the BRIDG model does not include payer information. We did not observe any overgeneralization in representing the CDWR content with BRIDG. In general, however, BRIDG provides more detailed ways of representing data than the CDWR. For example, (medical) procedure information can be represented using multiple fields such as “methods” and “anatomic site.” To simplify matters, highly specialized BRIDG tables and attribute fields were combined into more general tables and attributes fields.
Unlike other models, Mini-Sentinel incorporates standardized concept codes into the tables as separate fields. A few examples of the standardized codes employed by Mini-Sentinel are Logical Observations Identifiers Names and Codes (LOINC) for laboratory and/or clinical observations, and National Drug Codes (NDC) for medication names. Embedding standardized codes into the tables simplifies the data model. However, it also means that users who adopt different coding systems need to map their data to the coding system included in the Mini-Sentinel model in order to use this model to represent their data. For example, the data in the CDWR at the testing site are not encoded with LOINC and NDC, we needed to first map our data to LOINC and NDC before transforming our CDWR model to Mini-Sentinel. Mini-Sentinel does not provide separate tables for providers and service locations. Only the identification codes of providers and service locations are represented in this model. Mini-Sentinel is a relatively compact model with 8 tables that are intended to capture core minimum data. Unlike OMOP, which provides robust ways to represent various clinical observations, Mini-Sentinel specifies a limited number of clinical observations in the laboratory and vitals tables. Therefore, many clinical observations that are not specified as a separate field in these tables cannot be represented. For example, body mass index is a data item required by scenario 2 that cannot be retrieved when the data is represented with Mini-Sentinel.
Common Data Model Impact on Data Collection, Mediation, and Exchange Approaches
There are 2 main challenges we confronted in attempting to use a data mediator.
Building the mediator transformation model: this is an involved task requiring input from data source designers and administrators (ie, those who understand the original clinical data source and its underlying databases quite well). This transformation process is the most expensive part of the mapping process, and is required whether the queries are translated with a mediator, or the source data are transformed into a common model.
Performance: the performance of the mediator is inefficient in situations that involve combining data from >1 large table (ie, tables containing several million records). This is because the mediator is relatively novel technology, and it does not yet have many of the optimization methods that are available to a query written in a language that is native to a database. We expect that this issue will be overcome with further research.
We have outlined the importance of syntactic and semantic interoperability in analyzing clinical data from heterogeneous sources for CER. In order to achieve this, selecting a common data model that can accurately and completely represent these data is the key. Our data modeling comparisons found that although many of the models examined captured a majority of the data elements that are useful for CER studies (patients, drugs, procedures, outcomes/observations, providers, health care facilities, benefit plans, payments, etc.), modeling of insurance plan benefit design and financial plans were most detailed in OMOP CDM version 4.0. We note that standardized vocabularies or data dictionaries were present in the OMOP and Mini-Sentinel data models but would need to be defined by the end-user for BRIDG and ADaM. This is an important issue with regards to achieving semantic interoperability.
Our study on information loss identified a need to extend OMOP’s CDM to address local modeling requirements (ie, those not shared for analysis across the network) and nonintuitive ways of modeling office visits in the OMOP and BRIDG data models.
With regards to data mediation, we did not find a mediator robust enough to handle the complex data mapping process from the different clinical information systems present across our study sites. However, more robust mediator software systems are in development, making the query translation option a more viable approach to facilitate data exchange.
The authors would like to thank Paulina Paul (UCSD) for her assistance with data modeling and mapping.