The ability to combine and analyze heterogeneous and multidimensional datasets is central to the design and conduct of comparative effectiveness research (CER) studies.1 The generation of such datasets usually involves aggregating diverse but related data types, including but not limited to: (1) clinical observations and outcomes; (2) operational and fiscal indicators; (3) biomolecular markers; and (4) patient-reported variables. The ability to reproducibly and systematically identify, integrate, and deliver such datasets involves biomedical informatics technologies and approaches such as information retrieval, text mining, temporal reasoning, and semantic normalization operations. Unfortunately, the knowledge and expertise needed to understand, select, and apply such methods is often distributed across a variety of professional domains, literature corpora, and prevailing best-practice collections. Although relevant to many types of research, these challenges are further amplified in the context of CER, given the need to identify and utilize often broader classes of data than have been the focus of traditional research paradigms.
Given these gaps in knowledge and practice, this report has the following objectives: (1) to review the major challenges associated with the collection and aggregation of heterogeneous and multidimensional datasets; (2) to propose a conceptual model for the design and conduct of CER; and (3) to introduce a set of open research questions and policy priorities for the adoption of relevant information practices.
To address specific challenges in managing heterogeneous datasets for CER, one must first have an understanding of the informatics theories, methods, and frameworks that are available for implementation in such an application domain. These include methods to handle data discovery, extraction, and integration. In the following section we will outline some of these methods and discuss how they can be translated for use in CER.
Data Discovery and Extraction
The science of managing biomedical data has advanced significantly in recent decades. Trends or patterns in large datasets can increasingly be exposed using data-mining technologies and methods that have been adopted for wide ranging uses in biomedicine.2,3 For CER, such technologies and approaches are particularly useful for hypotheses generation, study design and feasibility assessment, and cohort identification activities like participant recruitment.
Data from electronic health records (EHR), laboratory data, and billing data are generally collected and stored in transactional databases intended for patient care and administrative use, not research. Data warehousing provides a method for health care organizations to combine clinical, laboratory, and administrative data in a manner that can then be queried for secondary purposes, such as research.4 A data warehouse should be subject oriented, time dependent, nonvolatile, and well integrated.5 Many organizations have adopted data warehouses allowing for sophisticated data-driven research. Some have made it possible for researchers to interact directly with de-identified data, enabling researchers to leverage these data for efficient hypothesis generation, study design and feasibility assessments, and certain research studies.6
Another approach to data integration and aggregation is the utilization of reusable computational pipelines where users can gather information from multiple institutions and databases, and share constituent data between and among sites. For example, techniques such as service-oriented architectures (SOA) have been used to create community accessible data pipelines. One SOA platform in the research domain is the National Cancer Institute’s Cancer Biomedical Informatics Grid7 program, which uses a SOA infrastructure named caGrid to provide a common, collaborative, extensible platform for cancer researchers to allow for data interchange and analysis.8
More recently, the computational and informatics community has begun to explore the use of highly scalable and elastic “cloud” computing platforms for data sharing and analysis, using architectural approaches commonly referred to as “software as a service” or “platform as a service.” In both of these scenarios, data exchange, storage, and analysis are achieved through the use of remotely managed and standardized computing resources, accessible through common Internet technologies and facilitated through high-speed networking capabilities.9 Even as such technological approaches to data discovery are maturing, relevant data must be stored in a manner that can be readily queried and analyzed.
Clinical documentation may contain the data needed for research, but those data are often stored in an unstructured format. Although trained research personnel have traditionally been used to extract data from clinical documents for research, automated extraction is highly desired for many anticipated CER activities. Improvements in natural language processing (NLP) of narrative documents are ongoing and hold great promise, but current NLP approaches often still require clinician input and have yet to gain widespread adoption.10–14
Regardless of the effectiveness of such approaches, it remains the case that the quality of research is dependent on the quality of the data that is captured in the EHR or other databases. Kahn et al15 recently proposed a conceptual model for data quality assessment which suggests specific methods for ensuring quality data from EHR sources. These methods can be implemented on the individual level or automated to work across a database to assess the quality of data elements.
Improving systematic data capture by clinicians and directly by patients for data reuse, without disrupting workflows remains an important national goal.16 In addition, researchers are beginning to use increasingly ubiquitous mobile devices, telemedicine, and personal monitoring technologies to collect patient-derived data.17 A detailed review of these methods is beyond the scope of this paper; however, it is reasonable to expect that these technologies will continue to improve and enhance structured data for CER in the near future.
Problems can arise when trying to combine data with different storage structures or ontologies, or data from different and incompatible systems. Semantic normalization is the process of normalizing terminology across different datasets in order to be able to combine datasets while maintaining the appropriate meaning of the data elements. This process has been used to integrate structured and unstructured data18 as well as to combine large genomic datasets with phenotypic data.19 The key to integration of diverse data sources is to have a clear understanding of the type and structure of data in each database. The use of relational databases with data models specifically designed for EHR data as well as the use of multisite informatics platforms can facilitate data integration for researchers.20,21
The timing of clinical events also needs to be taken into account when combining datasets. Although laboratory and encounter level data come with a time stamp, text-based data handle time more subjectively. Determining the order of symptoms in a timeline is a major challenge when trying to integrate lab and clinical data. Temporal reasoning techniques have been developed to help deal with these challenges.22
In the sections that follow, we start by exploring 2 example cases. These cases illustrate common themes that are encountered in CER, including issues of: (1) data discovery, (2) data warehousing, (3) temporal reasoning, and (4) integration of datasets. Each case also illustrates unique challenges based on the specific type of data collected.
METHODS AND APPROACH
For this effort, we pursued a multistep process (Fig. 1). We engaged 2 distinct groups of subject-matter experts (SMEs) from relevant research teams at Ohio State University Wexner Medical Center (OSUWMC) and Nationwide Children’s Hospital. Three authors (P.J.E., C.H., P.R.O.P.) conducted a series of semistructured interviews and analyzed notes from those sessions. On the basis of that analysis, we generated summaries of the cases, identified key challenges as well as emergent themes. We then used these examples and findings to anchor a discussion of future directions for addressing the informatics issues related to design and conduct of CER.
In the following subsections, we summarize each of the 2 example research cases, discuss issues related to data discovery and data integration, and address any unique challenges that were encountered.
Case 1: Standard Versus Specialized Wound Care for Patients With Diabetes
The objective of this study was to determine whether patients with diabetic foot ulcers have a lower amputation rate when treated by a wound care specialist versus standard care from providers without specific wound care training. The primary objective was to compare amputation rates between these 2 groups, stratified by a validated ulcer severity score, the Wagner score.
Data Discovery and Extraction
The data needed for the study were collected from 2 main sources. For inpatients receiving standard care the data were taken from the OSUWMC Information Warehouse and enterprise data warehouse that is used for both operational and research purposes. For the patients receiving specialized care, the data originated in outpatient clinics. The National Healing Corporation manages these data, and then sends the data to the OSUWMC Information Warehouse.
As the data for this study were extracted from 2 different datasets for the 2 comparison groups, 1 outpatient and 1 inpatient, it was a challenge to consistently define variables between the 2 groups. For example, procedures were classified using International Classification of Diseases (ICD-9) procedure codes for the standard care group and Current Procedural Terminology codes for the outpatient specialty care group. The research team created an algorithm to map the different coding systems. In addition, because of differences between the data sources, there were few ways to comparably define comorbidities for the 2 groups. The researchers decided to use laboratory values to define comorbidities because these were some of the few data elements that were consistently recorded between the 2 groups. This practice can introduce error into a study. Laboratory values may be only temporarily abnormal and abnormal values are not always reflective of comorbid conditions especially in the absence of clinical interpretation.
Beyond the more commonly recognized issues, several unique challenges were encountered. First was the issue of data completeness and consistency. For instance, once the data were collected it was found that Wagner scores were only available on specialty patients. The result was that standard care patients had to have Wagner scores retrospectively calculated, likely introducing bias when compared with the prospectively calculated specialty care patients.
A major challenge involved the primary outcome of the study. The initial primary outcome of the study was wound healing. However determining the instant when a wound is healed over the course of several encounters was not possible with the current documentation. Although some notes included photo documentation, many did not. In the end amputation was chosen as a surrogate outcome because it could be easily extracted from available data. Data availability leading to changes in the study’s primary outcome is just one consequence of the current state of informatics for CER.
Case 2: Comparing Primary Care Interventions for Teenagers at Risk for Depression
The second study was a study of waiting room tablet-based behavioral health screening of teenagers at risk for depression and drug abuse. The study compared immediate physician notification of screening results versus delayed notification. It examined the efficacy of screening, the success of referrals, and provision of mental health care. Ultimately the goal was to evaluate if these interventions would increase the number of at-risk youths seeking appropriate mental health care.
Data Discovery and Extraction
The data used for this study came from multiple sources. These sources included: (1) paper-based patient visit surveys, (2) billing data, (3) nurse telephone logs, (4) transcribed telephone calls, (5) school records, (6) insurance records, (7) behavioral health clinic documentation. Combining these different sources of data posed significant challenges in assessing outcomes. For instance, youth reports of depression or drug treatment sessions significantly underestimated those based on billing data. Similarly, parents underreported services received at primary care or school offices. Because of the diversity of sources and the often-conflicting data, the research team needed to make determinations of data provenance.
Although there were many electronic data sources for this study, creating a single database was complicated by several factors. No single dataset contained a common identifier; thus, name and birth date combinations were the only common linkage and this was inadequate in several cases. Some children used different names depending on parent residence. Further, the diverse types of data from disparate systems including schools, medical, and law enforcement organizations made full integration impossible.
Obtaining data from multiple sources proved to be a challenge. Insurance companies, schools, law enforcement, and child welfare offices were uncomfortable with requests for child-specific data, especially data that focused on older records. They did not feel comfortable fulfilling such requests and often delayed or did not respond. A number of insurance companies and some schools simply refused to include data from their sites.
Even when data were available, unique pediatric-specific issues arose. Because the study included patients over a range of ages and developmental milestones, standardizing risk factors and developmental milestones over the study population was complicated. Because of the rapid development of teenagers and young children temporal adjustments had to be made when examining laboratory data, weight and height, and other milestones.
RESULTS: EMERGENT THEMES
Based upon the lessons learned from the available literature and example cases described above, we find that a conceptual model of common factors influencing the design and conduct of CER emerges. This conceptual model includes 4 major factors. Two are intrinsic to the data (data quality, data preparation) and 2 that are extrinsic (sociotechnical factors and organizational factors). The major elements are illustrated in Figure 2 and described in detail below.
Thematic Category 1: Data Quality
There are many data quality issues that arise when reusing data for research as outlined in the Kahn conceptual model of data quality.15 The following factors are those that are most commonly encountered in CER.
EHRs were not designed to gather data for research but rather to enable health care processes. Moreover, health care providers are typically neither trained to record data for research purposes nor are they encouraged to do so.23 These factors often result in missing data. In case 1, the major effect of this was missing Wagner scores. Inpatient physicians were not aware of the importance of this scoring system when evaluating patients with diabetic foot ulcers. In case 2 the researchers sought to include school and juvenile detention data. While such community-derived data are often essential to certain research efforts, regulatory challenges and unstructured formatting of such data made collection and analysis especially difficult.
Lack of Shared Semantics
This is a common problem that is illustrated in both our cases. In case 1, the researchers came up with their own protocol for translating ICD-9 codes into Current Procedural Terminology codes so that the 2 groups of patients could be accurately compared. In case 2 the group was unable to reliably link patients from one database to another because of a lack of a unique identifier.
Determination of Data Provenance
A common problem with using electronic records for CER is defining variables that have more than one possible operational definition. For example, does a patient have chronic renal disease because he has an billing code (ICD-9) recorded for renal disease, he has seen a renal specialist, he has an elevated creatinine on one or more lab draws or “renal disease” is written in free text in his progress note? In case 1, the researchers chose to use laboratory values to determine comorbidities. This is attractive in that it seems to be an objective measure; however, laboratory values can change over time, can be only temporarily abnormal, and deciding how to classify patients with missing laboratory values can be difficult. In case 2, data provenance was an issue when determining which document represented the “truth” regarding whether a teenager saw a mental health professional. The research team decided to use an agreed upon protocol in order to standardize their data collection; however, more formal data provenance structures and reusable pipelines can help standardize conflicting data sources.
Thematic Category 2: Data Preparation: Temporal Relationships and Data Synthesis
When data are reused, it is often necessary to process the data in a way that maximizes its applicability to the study. One of the major processing steps involves dealing with temporal relationships. In case 1, the team was unable to use wound healing as a primary outcome because determining an exact time of wound healing was not possible. Amputation, a more objective though less satisfying outcome was chosen instead.
In case 2, temporal issues were encountered due to the need to account for aging of their patient population over time. Developmental milestones or laboratory values that might be normal for a patient at 1 point in the study could be pathological at another.
Thematic Category 3: Sociotechnical Factors
These factors were mostly encountered in case 2 where regulatory frameworks were obstructive and prevented some participants from contributing data to the study. Because the most difficult data to obtain for this study came from juvenile law enforcement, the children who could have benefitted and contributed the most were not included in the study. Although standards must be in place to protect patients’ privacy, in order to be able to leverage these data for CER, institutional, and regulatory entities need to be receptive to calls for quality improvement in data management and easing regulatory frameworks that impede data sharing.
Thematic Category 4: Organizational Factors
Organizational factors include availability of trained informatics personnel involved in database management as well as institutional support. In case 2, the data were being gathered and processed by untrained personnel. This can introduce error as well and inconsistencies in the data. A well-functioning data warehouse or database management group is essential for CER.
As the above case examples illustrate, there remain many knowledge management and informatics challenges to the systematic, routine, and successful leveraging of heterogeneous clinical data for CER. Those conducting or supporting the conduct of CER would benefit from a framework upon which to approach the often challenging informatics issues commonly encountered when engaging in such research involving clinical data reuse.
Even with trained personnel and a supportive institution, the current state of clinical data is often insufficient or inadequate to address the information needs of researchers. There are numerous open informatics research and policy-related questions that must be addressed in order to fully realize the potential of what we anticipate and expect CER endeavors to yield in terms of better, safer, and more cost-effective health care delivery.24
Open Informatics Research Questions
It is clear that informatics research in this domain must include those issues related to data discovery and integration. Questions remain as to the best way to store and structure data in order to be able to use it for research purposes, especially in data reuse scenarios. Vital to this question is how best to capture data from key members of the health care team in order to have research quality data available. This may involve improving the ease of data entry for clinicians and/or improving methods for collecting high-quality data directly from patients. Developing new methods for integrating data from other nontraditional health care sources also holds promise and must be pursued, particularly for the conduct of certain types of specialty studies.
Similarly, improved and standardized protocols need to be established to determine hierarchical and semantic relationships between different data sources and types. Standards for dealing with missing data also need to be developed and incorporated into practice. In addition, fundamental methodological innovation is needed in the collection and integrative analysis of heterogeneous and multidimensional data, especially as they relate to knowledge engineering, temporal reasoning, data synthesis (including NLP), knowledge discovery in databases, and the visualization of complex and information dense constructs.
Open Policy Issues
Data for CER are currently stored in various locations and with various semantic structures. As our case examples illustrate, significant effort is often needed to extract and combine data from multiple sources. Too often these efforts are made more difficult by prevailing policies rather than by technical limitations. Even as technical issues must be addressed to facilitate such endeavors, policy and regulatory issues are critical to realizing the benefits of such data, and many can be readily addressed today.
For one, policies that promote and enforce adherence to and use of standards across institutions are essential and must be addressed. Within individual health care organizations, investments to improve analytical and business intelligence capacities should be viewed for their ability to simultaneously enable both operational and research activities alongside one another, for the benefit of both. Along similar lines, interpretations of existing policies and regulations regarding data reuse for research that often impede research productivity should be rethought to safely and securely enable such work.25
On a broader scale, efforts to make data sharing more standardized and reproducible across institutions would benefit greatly from adjustments and reconciliation of regulatory frameworks that now often impede such sharing if it is for anything other than clinical care purposes. At the individual level, policies and incentive structures that promote and foster the involvement of clinicians and other members of the health care team in the systematic collection of data and engagement of participants in research are also sorely needed.23
Finally, in order to accomplish these goals, we will need to have professionals who are trained in informatics and clinical research. Currently there is a shortage of such highly trained professionals and a paucity of mechanisms to support or enable critically needed training programs in this domain.
This paper has several limitations. First, our description of the current knowledge base is not intended to be a systematic review of the literature but rather a targeted review aimed at informing current issues associated with CER. We did not focus on the foundational work in the computational and information sciences, as this is a paper aimed principally at those conducting and supporting CER. Also, while we believe that most of these problems are common in CER, some of the problems discussed may be reflective of issues unique to our institution. In addition, although our conceptual model presents 4 individual thematic categories, in real-use there is significant interplay between these factors, adding to the complexity of CER. Finally, it should be noted that the limited number of exemplary cases and few subject-matter experts involved means that there may be other factors we did not capture through this approach.
The conduct of CER can benefit greatly from leveraging the increasingly electronic sources of clinical data available. However, there remain many informatics challenges to overcome if we are to realize the opportunities facing the health care and biomedical research communities. Those engaged in such endeavors should consider the issues illustrated above as they plan and pursue CER studies. This proposed framework should help to inform the considerable work that remains to be done at the intersection of biomedical informatics, data management, and CER.
1. Institute of Medicine (U.S.). Committee on Comparative Effectiveness Research Prioritization., Institute of Medicine (U.S.)Initial National Priorities for Comparative Effectiveness Research.Washington, DC:Institute of Medicine of the National Academies;2009. xxiii, 227pp.
2. Woeltje KF, Lautenbach E.Informatics and epidemiology in infection control.Infect Dis Clin North Am.2011;25:261–270.
3. Marks L, Power E.Using technology to address recruitment issues in the clinical trial process.Trends Biotechnol.2002;20:105–109.
4. Hu H, Correll M, Kvecher L, et al..DW4TR: a data warehouse for translational research.J Biomed inform.2011;44:1004–1019.
5. Embi PJ, Kaufman SE, Payne PR.Biomedical informatics and outcomes research: enabling knowledge-driven health care.Circulation.2009;120:2393–2399.
6. Erdal BS, Liu J, Ding J, et al..A database de-identification framework to enable direct queries on medical data for secondary use.Methods Inform Med.2012;51:229–241.
7. National Cancer Institute.Digital capabilities to accelerate research. Available at: https://cabig.nci.nih.gov
. Accessed December 26, 2012.
8. Saltz J, Oster S, Hastings S, et al..caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid.Bioinformatics.2006;22:1910–1916.
9. Rosenthal A, Mork P, Li MH, et al..Cloud computing: a new business paradigm for biomedical information sharing.J Biomed Inform.2010;43:342–353.
10. Meystre S, Haug PJ.Natural language processing to extract medical problems from electronic clinical documents: performance evaluation.J Biomed Inform.2006;39:589–599.
11. Chen ES, Hripcsak G, Friedman C.Disseminating natural language processed clinical narratives.AMIA Annu Symp Proc.2006126–130.
12. Nadkarni PM, Ohno-Machado L, Chapman WW.Natural language processing: an introduction.J Am Med Inform Assoc.2011;18:544–551.
13. Webb KH.Natural language processing and electronic medical records.J Am Med Assoc.2011;306:2325author reply 2325–2326.
14. Liu F, Weng C, Yu HRichesson RL, Andrews JE.Natural language processing, electronic health records, and clinical research.Clinical Research Informatics.2012.London:Springer;293–310.
15. Kahn MG, Raebel MA, Glanz JM, et al..A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research.Med Care.2012;50supplS21–S29.
16. Cusack CM, Hripcsak G, Bloomrosen M, et al..The future state of clinical data capture and documentation: a report from AMIA’s 2011 Policy Meeting.J Am Med Inform Assoc.2013;20:134–140.
17. Hampton T.Recent advances in mobile technology benefit global health, research, and care.J Am Med Assoc.2012;307:2013–2014.
18. Mirhaji P, Zhu M, Vagnoni M, et al..Ontology driven integration platform for clinical and translational research.BMC Bioinform.2009;10suppl 2S2.
19. Goh CS, Gianoulis TA, Liu Y, et al..Integration of curated databases to identify genotype-phenotype associations.BMC Genomics.2006;7:257.
20. Sittig DF, Hazlehurst BL, Brown J, et al..A survey of informatics platforms that enable distributed comparative effectiveness research using multi-institutional heterogenous clinical data.Med Care.2012;50supplS49–S59.
21. Kahn MG, Batson D, Schilling LM.Data model considerations for clinical effectiveness researchers.Med care.2012;50supplS60–S67.
22. Zhou L, Hripcsak G.Temporal reasoning with medical data—a review with emphasis on medical natural language processing.J Biomed Inform.2007;40:183–202.
23. Embi PJ, Tsevat J.Commentary: the relative research unit: providing incentives for clinician participation in research activities.Acad Med.2012;87:11–14.
24. Payne PR, Embi PJ, Niland J.Foundational biomedical informatics research in the clinical and translational science era: a call to action.J Am Med Inform Assoc.2010;17:615–616.
25. Ness RB.Influence of the HIPAA Privacy Rule on health research.J Am Med Assoc.2007;298:2164–2170.