In the United States, the adoption of electronic health record systems is expected to increase rapidly, spurred by financial incentives and penalties mandated in the American Recovery and Reinvestment Act.1–3 At the same time, interest has grown in evaluating the effectiveness of clinical care practices using electronic data recorded during routine clinical care.4,5 Under the title of Comparative Effectiveness Research (CER), these studies focus on health outcomes, clinical effectiveness, risks, and benefits of medical care in real-world practice settings using clinical, administrative, and billing data.
As part of American Recovery and Reinvestment Act’s $1.1B funding allocation, the Agency for Healthcare Quality and Research funded multiple projects focused on building scalable, distributed research networks to support CER.6 The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) received funding through this program. SAFTINet’s goal is to build a distributed research network to support CER with a focus on safety-net stakeholders, which includes persons lacking health insurance and those with Medicaid and State Children’s Health Insurance Programs. SAFTINet will combine detailed clinical and financial data from electronic health records, state Medicaid claims, and administrative data sources into a secondary analytic-only database that is separate from databases in clinical applications to answer questions regarding the comparative effectiveness of treatments, diagnostics, protocols, and other delivery system factors. Like many beginning CER projects, SAFTINet needed to either develop or adopt a data model for storage, processing, and analysis of a broad range of clinical and financial data. Using SAFTINet as a case study, we describe the considerations that were applied during our quest to identify an acceptable data model.
WHAT IS A DATA MODEL AND WHY IS IT IMPORTANT
Data modeling is the process of determining how data are to be stored in a database.7–9 A data model specifies features and relationships, such as:
- Data types (eg, date, integer, character, time)
- Constraints (Are missing values allowed? Must each value be unique?)
- Relationships between rows of data (Can a row in 1 table be related to none, 1, or many rows in another table? Can hierarchies that define sets of concepts be represented?)
- Metadata definitions, procedures, and assumptions that describe the intended meaning and use of each data element, how data are to be collected, allowed values or ranges, and dependencies between data elements.
The structure and metadata definitions contained in a data model heavily influence what research data can be stored, how data values should be interpreted, and how easily desired data subsets can be queried and extracted from a research database.
Despite a large computer and information sciences-oriented technical literature on data modeling,8,10–13 the choices, options, and impact of data modeling decisions to support clinical research are neither well studied nor published.14,15
The structural components of a data model typically are conveyed schematically in drawings that use symbols and notations to denote the features of and relationships among data items. Figure 1 illustrates 2 very simple data models drawn in a widely used format called the Entity-Relationship Diagram (ERD). The diagrams show each item’s data type (integer, characters, dates), notes if the data element is required to always have a value (NN=never null, ie, it cannot be null/missing), and if it is a primary (PK) or foreign key (FK). For example, the patient table in model 1a has a data item named “PAT_ID” that is an integer, must always have a value (NN), and is a PK. ERDs capture the database structure but not metadata (descriptions of the data meanings) that usually is captured in related documents that are linked to an ERD.
Despite their simplicity, the ERDs in Figure 1 visually illustrate markedly different assumptions in how 2 models organize and relate data items. Model 1a is visit-centric, requiring all diagnoses, procedures, and medication prescriptions to be associated with a visit. None of these entities contain a date field because the date is obtained from the required associated visit. On the basis of the symbols at the ends of the connecting lines (lines, circles, and crow’s feet), the model 1a “says” that a visit does not always need to have an associated procedure or a prescription but must have at least one ICD-9-CM diagnosis. In contrast, model 1b is patient-centric, associating visits, procedures, and prescriptions with a patient rather than a visit. In this model, procedures and prescriptions contain an independent date field. Data in this field can be unrelated to any visit, allowing for procedures and prescriptions to be recorded when there was no corresponding visit. The first model also has a single date field for visits whereas the second model has 2 date fields for visits. Model 1b allows for both hospital admission and discharge dates to be captured whereas model 1a would require a decision about which date to store in the single visit date field. Diagnoses in model 1a can only be ICD-9-CM codes whereas model 1b can store any type of diagnosis codes (ICD-9-CM, ICD-10-CM, SNOMED) using different values in Code_Type to distinguish between different coding systems. CPT codes can be recorded in model 1b but not in model 1a.
Multidisciplinary teams including informatics professionals, clinical investigators, and biostatisticians use ERDs to evaluate a proposed data model’s ability to meet the data storage and querying needs of a research project. A significant challenge for CER projects is designing a database structure that is sufficiently flexible to support a wide variety of data types collected from electronic health records, billing systems, pharmacy dispensing, and pharmacy benefits/claims processing systems. The process of combining related data from disparate sources that have different data structures, variable formats, definitions, specificity, and quality, is called data integration.16–19 Data integration requires careful attention to how the data from different systems will be represented in the research data model and how differences in data definitions, procedures, and sources will be resolved to ensure compatibility and comparability of the resulting data values. As data integration decisions and data model structures and definitions are so closely intertwined, these decisions are best addressed when investigators and analysts (those who will be using the database to answer questions) work closely with database designers (those who model and create the database) to explore the impacts of design decisions on database maintenance and extensibility, data quality, accessibility, and analytic capability.
The 2 simple data models in Figure 1 embody different assumptions. Model 1a assumes all actions that matter to the study occur during a visit. Model 1b is more general because it represents actions both during and between visits. However, for model 1a, the query: “What is the average number of prescriptions written per ambulatory visit?” is trivial because of the direct connection between visits and prescriptions. Answering the same question with model 1b is more complex, requiring a comparison of dates between visits and prescriptions to find prescriptions written on the same date as a visit.
Balancing flexibility versus complexity is a common tension in data modeling. A simple data model may not record important details; a more complex model may be more difficult to query. Given inherent tradeoffs, the next question to explore is: “How does one judge the acceptability or quality of a data model for a particular use?”
EVALUATING DATA MODEL QUALITY
Numerous approaches to assessing data models appear in the Information Systems literature.20–26 In an examination of data model alternatives for the Food and Drug Administration Sentinel Initiative,15 Brown and colleagues list 5 key related questions:
- What does the system need to do?
- What data are needed to meet system needs?
- Where will the data be stored?
- How will the data be analyzed?
- Is a common data model needed, and if so, what will the model look like?
Although the 5 questions selected by Brown may be the most important questions in their specific setting, they may not to be the most important in other settings. Moody and Shanks created a comprehensive framework to ensure that all potential features of a data model are considered.27–29 They proposed 8 dimensions to be considered in evaluating a data model.28 We have restated the 8 dimensions into a CER context (Table 2). The Moody and Shanks framework emphasizes an integrated analysis—each setting will consider some factors to be critical, other factors to be “nice to have,” and the remainder to be not relevant.
IDENTIFYING AND EVALUATING DATA MODELS FOR SAFTINet
Data modelers work from use cases, which are small vignettes illustrating the tasks an Information System needs to support. A very high-level CER use case is: “Identify a cohort of adult patients with asthma who meet a set of criteria.” Table 1 contains a draft cohort definition from the SAFTINet project. Concepts contained in this cohort definition highlight data items that must be present in the SAFTINet data model such as data on patients, diagnoses, medication administrations, filled prescriptions, and emergency department, ambulatory, urgent care, and inpatient visits. The definition also contains explicit or implied constraints that must be represented in the SAFTINet data model: patients must have a date of birth to calculate age; visits must have a date to calculate time intervals; and diagnoses and medication administrations must have a date or be associated with a visit that has a date.
Other use cases revealed additional “must-support” capabilities for the SAFTINet data model, including the ability to:
- extract patient-level data to create analytic datasets;
- calculate ages to the year for adults, and to smaller units of measurement for children depending on their age;
- calculate prescribed drug intervals (often called drug exposures);
- link a patient’s data across disparate data sources;
- use standardized terminologies to take advantage of conceptual hierarchies and relationships;
- identify a patient as being part of a defined cohort to allow prospective data collection;
- support deidentified data in compliance with HIPAA regulations.
Applications, Platforms, and Data Models
Although a data model is at the core of all large-scale projects, it is not the only consideration in selecting the data management environment for a robust CER infrastructure project. The applications or platforms used by investigators or analysts to access data in a database are also critically important to the overall value of the data infrastructure to end users. The terms “application” and “platform” are often used interchangeably but in our usage, an application completely defines the functions that can be performed by an end-user, whereas a platform is more flexible allowing users with sufficient programming skills to add new functionality to the basic system. Both applications and platforms include software components that interact with the data model. The most visible component is the user interface. The user interface defines which parts of a data model can be “seen” and “manipulated” by the user. A very restrictive user interface can prevent access to a wide range of data model capabilities can make a flexible data model appear to be very limited to the user. Alternatively, a comprehensive set of user applications can make a less-than-optimal data model appear more desirable because of the ease-of-use provided by the applications. The selection of a data model, and the resulting database for CER, may or may not depend on the available applications or platforms that support the use of the database. For example, users may need to integrate a data model into an existing platform that provides graphical user interfaces, data visualizations, or statistical software applications.
Build-From-Scratch Versus Adopt-and-Modify
Debates regarding build versus adopt at the initial database design phase are common. Specific decisions are influenced by many factors, chiefly resources and the suitability of available models. The key benefit of build-from-scratch is the opportunity to design the data model with high specificity, essentially creating, within resource limitations, a perfect fit of tool to need. An alternative is to work with an existing data model that could fit user needs with feasible modifications. This approach is particularly beneficial if an active community of users is using the data model and is contributing new software modules or features that enhance the model’s value, a benefit from shared investments. A community of users could build-from-scratch a data model that meets multiple needs although models that attempt to meet multiple needs simultaneously add both complexity and compromises. In a similar manner, adopting an existing model also requires adjusting the model to the additional needs of the current project. Both approaches require compromises: the build strategy may be time-limited and resource-limited while the adopt strategy may be constrained by inflexibility from previous design decisions.
In SAFTINet, we were interested in adapting an existing data model rather than defining our own data model. We felt it prudent to build upon prior investments and existing efforts. The ability to begin with a “field-tested” data model and to contribute and expand upon an already established model was determined to be a better investment of SAFTINet resources and that our contributions would provide a return benefit to the original data model’s community.
However, there are a number of risks associated with attempting to leverage an existing data model that was designed and optimized for 1 purpose to support a different set of use cases:
- The candidate data model may store essential data in a manner insufficient for CER use. For example, in Figure 1, both data models require a medication prescription record to be present before drug fulfillment information can be stored. If the CER study depends upon medication fulfillment data that is not tied to prescription data, then both data models in Figure 1 are insufficient to meet this requirement.
- The existing data model may store data in a manner that is difficult to query for CER. Cohort identification requires concepts such as “average daily exposure period >30 days” or “time between successive admissions.” These concepts are not likely to be represented in many data models and would have to be computed either “on the fly” during the query or as a preprocessing step, adding complexity and therefore time and resource to deriving the cohort.
- The existing data model may have entire data domains absent because they were not relevant to the original project needs. For example, a model created to support drug safety may not contain tables for detailed billing or reimbursement data because these data were not relevant to the original project.
Evaluating an Existing Data Model’s Complexity and Usability
The general approach for determining if an existing data model can meet the needs of a new project begins with use cases as described previously. From these use cases, a data analyst examines the proposed data model and determines whether the existing tables and columns in the data model can satisfy the proposed analytic needs. If all of the required data can be stored, the data analyst attempts to estimate the complexity of the queries necessary to extract data from the data model. One key indicator of query complexity is the number of tables that need to be linked to answer a query. For example, in Figure 1, to determine all of the medications that a patient has filled over a period of time would require linking 4 tables in model 1a (Patient→Visit→Prescriptions→Rx Fulfillment) but only 3 tables in model 1b (Patient→Prescriptions→Rx Fulfillment). A more challenging measure of data model fit is assessing the difficulty to add a new data element (eg, a new medication) or data class (eg, radiology results).
Data Models Versus Data Quality
A high-quality data model cannot completely ensure high-quality data. A data model that supports the identified use cases can make data collection, recording, and analysis significantly easier and can ensure that entered data meet certain constraints such as data type, allowed values, and mandatory/optional values. But a data model alone cannot ensure that data values make logical or “real-world” sense. For example, a data model can constrain patient sex values to be either “Male” or “Female” and diagnoses to be constrained to valid ICD-9-CM or ICD-10-CM codes but this data model will permit storage of Sex=“Male” and Diagnosis=“V22 Normal Pregnancy.” Data quality check, such as male patients cannot be pregnant, require domain knowledge that exists outside of the data model and are data quality processes applied to data that are stored in a data model.
DATA MODELS CONSIDERED BY SAFTINet
Table 3 lists the data models considered by SAFTINet during our requirements analysis phase. An absolute requirement for all candidate models was that the model be freely available in the public domain without licensing restrictions. An equally important consideration was the existence of an active user community that the SAFTINet team could leverage for advice, guidance, and collaboration.
The 8 quality dimensions in the Moody and Shanks framework in Table 2 are described at a high level of abstraction. The SAFTINet team expanded the general framework descriptions into more detailed project-specific criteria for 6 of the Moody and Shanks dimensions (Table 4). The dimensions and specific criteria were developed iteratively as the SAFTINet team explored more detailed use cases and requirements for the SAFTINet project.
Although all of the data models listed in Table 3 meet many of the SAFTINet criteria in Table 4, the team selected the Observation Medical Outcomes Partnership (OMOP) data model. Specific technical design features of the OMOP data model allow for a broad range of clinical observations to be added without any structural changes to the model (no new tables or columns), handle missing data without creating empty cells, and have extensive field testing with very large administrative and clinical datasets that support complex analytic methods. OMOP exploits a rich set of terminology tables to simplify complex queries involving conceptual groups (eg, “antibiotics”). In addition, the OMOP public web site contained extensive metadata documentation and examples to clarify the intended use of each table and field. And the timing was right—the OMOP team was actively seeking new collaborators to extend the current OMOP data model to support new areas of outcomes research beyond its initial focus on drug surveillance. The SAFTINet CER network is currently being constructed with OMOP Common Data Model Version 3.0 (see http://omop.fnih.org/CDMV3 for the full specifications).
Ensuring that a database can store and retrieve data required to create analytic datasets begins with an understanding of the anticipated data uses. Development of detailed use cases that describe the research hypotheses, cohort selection criteria, and analytic plans that the database must support requires the active engagement of clinical investigators. Because clinical research and related activities are both hypothesis driven and exploratory, it is often difficult for investigators to delineate all intended uses of a database. However, the urge to “keep all options open” will result in a data model that is overly complex to use and maintain, manifesting in a proliferation of tables and linkages that must be navigated when storing and retrieving data from the database.
Clinical investigators typically focus on the structure and content of analytic datasets. Less attention is given to the structure and content of supporting databases where data from disparate data sources are combined, integrated, and harmonized. Small studies with limited data drawn from a single data source do not have these issues. But as a study expands in size, scope, locations, data sources, and investigator community, the lack of attention to the data model can bring a previously successful pilot study to a grinding halt. With CER’s focus on a wide range of data from different sources across broad patient populations, the diversity of data types that need to be accommodated in research databases has grown significantly. For example, although SAFTINet does not have a requirement to represent primary “-omics” or sequence data, the OMOP model is robust enough to accommodate such data in future.
Investigators may attempt to create a data model from scratch. For a very large project that will be a reusable resource for multiple investigations, this approach should be considered very carefully. Existing data models that have been field tested and have an active user community represent hundreds of hours of analysis and use that have revealed strengths and weaknesses.
Even with access to experienced data modeling skills, the SAFTINet team made an early decision to not develop a new data model, recognizing that adopting an existing data model would require adaptation. We selected a model that could accommodate anticipated changes without major rework. The presence of a large active user community and a development staff willing to help extend the model in a manner congruent with the original model were also strong determinants. Other models listed in Table 3 have similar characteristics and would be equally valid selections depending on requirement prioritization (Table 4). By selecting an existing data model, SAFTINet will contribute back to the user community with additional use cases and added functionality. In this respect, SAFTINet will be gaining from and contributing to the work of other CER investigators and data modelers.
1. Blumenthal D. Launching HITECH. N Engl J Med. 2010;362:382–385
2. Waldren S, Kibbe DC, Mitchell J. “Will the feds really buy me an EHR?” and other commonly asked questions about the HITECH Act. Fam Pract Manag. 2009;16:19–23
3. McLeod A. Health IT for economic and clinical health: HITECH Medicare incentive payment estimator. Healthc Financ Manage. 2009;63:110–111
4. Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Sci Transl Med. 2010;2:57cm29–31
5. Grossmann C Roundtable on Value & Science-Driven Health Care. Clinical Data as the Basic Staple of Health Learning: Creating and Protecting A Public Good: Workshop Summary. 2010 Washington, DC National Academies Press
7. Bekke JHt Semantic Data Modeling. 1992 New York Prentice Hall
8. Simsion GC Data Modeling: Theory and Practice. 2007 Bradley Beach, NJ Technics Publications
9. Hoberman S Data Modeling Made Simple: A Practical Guide for Business and IT Professionals. 20092nd ed Bradley Beach, NJ Technics Publications
10. Borkin SA Data Models: A Semantic Approach for Database Systems. 1980 Cambridge, Mass MIT Press
11. Chmura A, Heumann JM Logical Data Modeling: What it is and How to do it. 2005 New York, NY Springer
12. Tsichritzis DC, Lochovsky FH Data Models. 1982 Englewood Cliffs, NJ Prentice-Hall
13. Silverston L The Data Model Resource Book. 2001Revised ed New York John Wiley
14. Riben M, Wade G, Edgerton M, et al. Aligning tissue banking data models for caBIG interoperability. AMIA Annu Symp Proc. 2008:1109
16. Marrs KA, Kahn MG. Extending a clinical repository to include multiple sites. Proc Annu Symp Comput Appl Med Care. 1995:387–391
17. Marrs KA, Steib SA, Abrams CA, et al. Unifying heterogeneous distributed clinical data in a relational database. Proc Annu Symp Comput Appl Med Care. 1993:644–648
18. Krohn R. Advice on HIE for the ARRA-minded. A big boost for digital transformation. J Healthc Inf Manag. 2009;23:7–8
19. Halamka JD. Making the most of federal health information technology regulations. Health Aff. (Millwood). 2010;29:596–600
20. von Halle B. Data: asset or liability? Database Program Des. 1991;4:7–9
21. Batini C, Ceri S, Navathe S. Conceptual database design: an entity-relationship approach. Redwood City, CA. 1992 Benjamin/Cummings Pub. Co
22. Levitin A, Redman T. Quality dimensions of a conceptual view. Inform Process Manag. 1995;31:81–88
23. Krogstie J, Lindland O, Sindre G, et al. Towards a deeper understanding of quality in requirements engineering. Adv Inf Syst Eng. 1995;932:82–95 Springer Berlin/Heidelberg
24. Lindland OI, Sindre G, Solvberg A. Understanding quality in conceptual modeling. IEEE Software. 1994;11:42–49
25. Kesh S. Evaluating the quality of entity relationship models. Infor Software Technol. 1995;37:681–689
26. Simsion GC, Witt GC Data Modeling Essentials. 20053rd ed Amsterdam; Boston Morgan Kaufmann Publishers
27. Moody DL, Shanks GG What makes a good data model? Evaluating the quality of entity relationship models. Proceedings of the13th International Conference on the Entity-Relationship Approach
28. Moody DL, Shanks GG. Improving the quality of data models: Empirical validation of a quality management framework. Inf Syst. 2003;28:619–650
29. Moody DL. Theoretical and practical issues in evaluating the quality of conceptual models: current state and future directions. Data Knowl Eng. 2005;55:243–276
Keywords:© 2012 Lippincott Williams & Wilkins, Inc.
data models; databases; Comparative Effectiveness Research