Data Quality Assessment for Comparative Effectiveness Research in Distributed Data Networks

Brown, Jeffrey S. PhD*; Kahn, Michael MD, PhD†,‡; Toh, Sengwee ScD*

doi: 10.1097/MLR.0b013e31829b1e2c
Analytic Methods

Background: Electronic health information routinely collected during health care delivery and reimbursement can help address the need for evidence about the real-world effectiveness, safety, and quality of medical care. Often, distributed networks that combine information from multiple sources are needed to generate this real-world evidence.

Objective: We provide a set of field-tested best practices and a set of recommendations for data quality checking for comparative effectiveness research (CER) in distributed data networks.

Methods: Explore the requirements for data quality checking and describe data quality approaches undertaken by several existing multi-site networks.

Results: There are no established standards regarding how to evaluate the quality of electronic health data for CER within distributed networks. Data checks of increasing complexity are often used, ranging from consistency with syntactic rules to evaluation of semantics and consistency within and across sites. Temporal trends within and across sites are widely used, as are checks of each data refresh or update. Rates of specific events and exposures by age group, sex, and month are also common.

Discussion: Secondary use of electronic health data for CER holds promise but is complex, especially in distributed data networks that incorporate periodic data refreshes. The viability of a learning health system is dependent on a robust understanding of the quality, validity, and optimal secondary uses of routinely collected electronic health data within distributed health data networks. Robust data quality checking can strengthen confidence in findings based on distributed data network.

*Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA

Division of Pediatric Epidemiology, Department of Pediatrics, University of Colorado

Colorado Clinical and Translational Sciences Institute and CCTSI Biomedical Informatics, University of Colorado Denver, Denver, CO

Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal's Website,

S.T., J.S.B., and M.K. were supported by Agency for Healthcare Research and Quality grant 1R01HS019912-01 (Scalable PArtnering Network for Comparative Effectiveness Research). M.K. was supported by Agency for Healthcare Research and Quality grant R01HS019908 (Scalable Architecture for Federated Translational Inquires Network), and NIH/NCATS Colorado CTSI Grant Number TR000154 (Colorado Clinical and Translational Sciences Institute).

The authors declare no conflict of interest.

Reprints: Jeffrey S. Brown, PhD, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, 133 Brookline Ave 6th Floor, Boston, MA 02215. E-mail:

Article Outline

Electronic health records and other information routinely collected during health care delivery and reimbursement can help address the critical need for evidence about the real-world effectiveness, safety, and quality of medical care.1–7 Consequently, state and federal agencies, private payers, and others are seeking to use these data to generate timely and actionable evidence and realize the benefits of a learning health system.8–10

Even the largest individual data resources are insufficient to investigate interventions with limited use, evaluate rare conditions, identify rare outcomes, and to enable timely decision making.1 Combining data from diverse clinical settings also strengthens the generalizability of findings. Therefore, combining multiple data sources is often necessary. Using multiple data resources for comparative effectiveness research (CER) requires efficient mechanisms to access the data while respecting the regulatory, legal, proprietary, and privacy implications of use. In principle, a distributed network or a centralized database could support multi-site CER. The US Food and Drug Administration (FDA)4,11,12 and the Office of the National Coordinator for Health Information Technology,13 among others, support using distributed data networks that allow data partners to maintain physical control over their data, while permitting authorized users to query the data.11,12,14–19

In distributed networks data partners maintain control of their data and its uses. Distributed networks typically rely on a common data model that enables queries to be executed identically by all data partners.20–24 Distributed networks address many of the security, proprietary, legal, and privacy concerns common to multi-site research.19,25,26 An added benefit is that the data are held by those who are best able to consult on proper use and interpretation. Both single-use27,28 and multi-use networks are possible. Multi-use networks such as the HMO Research Network (HMORN),29,30 the CDC-sponsored Vaccine Safety Datalink,31–33 and the FDA-sponsored Mini-Sentinel12 build and maintain data and administrative infrastructure to support multiple studies. Multi-purpose networks often refresh their data as new information becomes available.

Unfortunately, idiosyncratic data quality issues in multi-site environments can arise from variation in data capture, differences and changes in medical coding terminologies and coding practices, local business-rules regarding data adjudication, clinical workflows, and delivery system differences such as formularies, provider contracts, and payment contracts.34,35 Without a strategy to identify and address cross-site and temporal data quality issues, the evidence generated by CER using such networks is subject to validity concerns. Therefore, a framework for data quality checking is needed to help researchers identify and resolve these issues. In this paper we (1) review the pragmatic data quality framework developed by Kahn et al,36 (2) assess several existing data quality checking approaches, (3) provide several worked examples of data quality checking, and (4) recommend multi-site data checking approaches.

Back to Top | Article Outline


Kahn et al36 proposed a conceptual framework for assessing the quality of electronic health data that includes 5 key data concepts, defined below (adapted from Kahn et al36):

* Attribute domain constraints: focus on data value anomalies for individual variables, including distributions, units, and missingness. These checks identify values and distributions inconsistent with expectations (eg, a high proportion of individuals over 120 y old).

* Relational integrity rules: compare elements from one data table to related elements in another data table (eg, every person identifier in the pharmacy table must have a record in the demographic table, but not necessarily in the enrollment table).

* Historical data rules: temporal relationships and trend visualizations to identify data gaps, unusual patterns, and dependencies across multiple data values and variables (eg, utilization trends can identify shifts in data capture).

* State-dependent objects rules: extends temporal data assessment to include logical consistency (eg, a series of prenatal ultrasounds should precede a pregnancy outcome).

* Attribute dependency rules: examine conditional dependencies based on knowledge of a clinical scenario (eg, women should not have a diagnosis of prostate cancer).

These data checking domains are commonly found in multi-site data quality checking approaches. However, data quality checking in distributed data networks has not been well described in the literature, and is generally more complex due to privacy and proprietary concerns and variation in data capture.

Back to Top | Article Outline


There are no guidelines to determine whether electronic health data are “valid” or of high “quality” nor any consensus on how to define “valid” or “quality.” Hall et al37 set out guidelines for good pharmacoepidemiologic practice for database selection and use, and included several recommendations for single-site and multi-site studies.37 They provide suggestions for data checking, including assessment of the completeness and accuracy of key study variables, check of external validity (eg, is the rate of some metric consistent with external estimates), logic and plausibility checks (eg, assessment of age ranges, missingness, and clinical plausibility), and trending assessments.37

Several networks have published their data checking procedures, with varying levels of detail. The HMORN conducts annual data quality checks of their distributed database—the HMORN Virtual Data Warehouse (VDW). They assess compliance with the common data model, evaluate summary statistics for continuous measures, proportions for categorical variables, missingness frequencies, and trends.38 Output is reviewed by data area workgroups that consist of HMORN analysts and investigators. HMORN sites must get a “passing” grade before their data are considered acceptable for inclusion in the VDW. The specific metrics needed to earn a “passing” grade are determined by the workgroup.

The Observational Medical Outcomes Partnership (OMOP) has several tools to assess data quality.39,40 The OMOP approach is similar to the HMORN with respect to review of categorical and continuous variables across an entire database and without focus on a specific study topic. OMOP has a standardized approach to representing the results,39 and flagging potential problems.40

The FDA’s Mini-Sentinel project describes 4 levels of data checking.41,42 Level 1 checks review consistency with the Mini-Sentinel Common Data Model data dictionary, focusing on variable names, lengths, formats, and values. For example, acceptable values for the variable SEX are “M,” “F,” “A,” and “U” so any other value will generate an error flag. Level 2 checks focus on completeness and integrity between variables within a table, or variables between tables. For example, every person identifier in the pharmacy dispensing table must have a record in the demographic file. Level 1 and level 2 checks apply attribute domain constraints and relational integrity rules to generate binary flags. Level 3 checks—similar to historical data rules—focus on patterns, trends, and cross-variable relationships, but not necessarily relationships that can be characterized as true or false. For example, a graph of monthly pharmacy dispensings or a table of the annual distribution of inpatient and outpatient encounters can identify unusual patterns or cross-site variability. Potential issues are flagged by visual inspection of trends and review of expected within and cross-site consistency. Finally, level 4—similar to attribute dependency rules—data checks focus on specific clinical scenarios for which there is a reasonable a priori expectation for the findings. Examples include trends in dispensing rates of specific medications by age and sex, vaccinations by age and sex, and the frequency of specific procedures by age and sex.

Back to Top | Article Outline


The approaches described above are consistent in focus and approach. On the basis of the work reviewed above and our experience operating several multi-site networks, we present a series of practical approaches to help identify data issues and ensure data validity in multi-site multi-purpose distributed networks. Our recommendations have been extensively “field tested” across thousands of data extractions and hundreds of studies. Although we continue to find unusual data quality issues, the proposed methods comprise the core data quality checking procedures used in multiple active national networks. We also present considerations related to data security, privacy, and proprietary issues. Data quality checking for specific studies can use many of the same approaches, but additional study-specific checks are recommended.

Back to Top | Article Outline

Review Adherence to the Common Data Model

Many multi-site distributed networks use a common data model to facilitate cross-site analysis. Once the common data model is determined and sites conduct an extract-transform-load (ETL) procedure to populate the model, the ETL process must be checked against the data model. These checks compare the extracted data against the data model data dictionary to ensure syntactic consistency and adherence to the data model. There are at least 3 basic consistency checks: the first evaluates simple syntactic consistency, the second focuses on the table structure, and the third targets expected relationships between tables.

Syntactic correctness refers to whether the transformed variable names, values, length, and format meet the data model specifications. For example, a patient identifier variable could have 4 checks—it must be nonmissing, be a character format, be left-justified, and be called PATID. The variable “sex” might have 5 checks—must be a character format, have a length of 1, be nonmissing, have values of “F,” “M,” or “U,” and be called SEX.

Consistency with coding terminologies should be evaluated. For example, Healthcare Common Procedure Coding System codes should be 5 digits, National Drug Codes (NDCs) should not contain any letters, and International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes should always begin with a number, an “E,” or a “V.” For variables that use standardized coding terminologies, all values contained in a dataset can be checked against a reference library of all valid values for that terminology. Finally, data models may specify that certain variables include or exclude decimals, dashes, and modifiers and data check should verify that those rules were followed.

Data models are often organized into specific tables that can be linked by person or encounter identifiers. Each table typically has a definition for each row. For example, a demographic table can be defined as 1 row per unique person and include information such as date of birth, sex, and race—values that typically do not change over time. However, if the demographic table includes a time-varying characteristic such as marital status or zip code, the table could include multiple rows per person. Regardless, the row definition must be verified against expectations.

Finally, expected cross-table relationships should be checked for consistency with the data model. These checks rely on identifying relationships that should always be true based on the data model and verifying that the data meet those expectations. For example, in some networks every patient in the pharmacy file should have a corresponding row in the demographic file, and every patient in the utilization file should be in the enrollment file. Violation of these basic data model rules should generate errors for review. Table 1 illustrates the output of a cross-site, cross-table data check in the Mini-Sentinel program. Sites 5, 14, and 17 have low match rates between the enrollment and demographic files. This finding is a result of those sites providing care for nonmembers. This can occur in integrated delivery systems that operate medical facilities, but should not be observed for insurance companies that do not operate medical facilities. This example highlights that interpretation of differences may require information about data sources that exists outside of the dataset and that observed differences may not be indicative of poor data quality.

Back to Top | Article Outline

Review Each Data Domain

Data models are commonly divided into domains such as enrollment, demographics, medication dispensing, prescribing, medical utilization, laboratory results, and vital signs. Examples include the HMORN VDW, OMOP, the Mini-Sentinel Common Data Model, the Vaccine Safety Datalink, and Electronic medical record Support for Public health.43 Models that combine information across domains into a single table (eg, star schemas)44 can be stratified by domain to help simplify comparisons.

Independent of the data model, we suggest extending data domain checks by evaluating (1) frequency and proportions for categorical variables; (2) distributions and extreme values for continuous variables; (3) missingness; (4) “out-of-range” values (as defined by the data curator); (5) expected relationships between variables within the domain; (6) normalized rates (eg, per person, per member, or per person per month); and (7) temporal trends (weekly, monthly, quarterly, or annual). Trends in counts or proportions should be presented overall and by site and can be stratified by age group, sex, or other relevant characteristic. The 7 basic checks listed above provide a view of the entire database over time and should be implemented across all data domains. These checks should be modified to investigate a specific cohort of patients or topic. We present a series of examples by domain for guidance.

Back to Top | Article Outline


Key temporal components such as the start and stop dates of enrollment stratified by the type of coverage and demographic characteristics should be assessed. Suggested checks include:

* Enrollment start dates should precede enrollment end dates.

* Maximum enrollment length should not exceed the maximum observation length.

* Distribution of enrollment length (mean, median, interquartile range, 1st, 5th, 95th, and 99th percentiles).

* Enrollment periods per member.

* Proportion of enrollment periods by benefit type (drug coverage, medical coverage, etc.).

* Monthly membership, overall and by coverage type.

Figure 1 illustrates monthly enrollment changes at 2 sites in 2010 versus 2011, showing a large drop in January 2010 for 1 site and large drop for both sites in January 2011. These changes can be verified with the sites to ensure that the patterns are expected, and the pattern can help inform researchers using multi-year longitudinal cohorts.

Back to Top | Article Outline


Checks of categorical and continuous variables, including missingness and “out-of-range” values should be assessed. Sites can review age and sex distributions to verify consistency with expectations.

Back to Top | Article Outline

Medication Use

Medication information typically includes the medication dispensed/prescribed, the days of supply, number of units (eg, pills, canisters), and the date dispensed/prescribed. If standard coding terminologies such as RXnorm or NDCs are used, the recorded values should match valid values the terminology dictionary. Suggested medication checks include:

* Frequency of records by days supplied and amount dispensed/prescribed, including identification of unexpected values such as <0, 0, 0 to <1, and≥100.

* Records with missing values.

* Records that do not match the terminology dictionary.

* Records per month.

* Users (ie, individuals with at least 1 dispensing/prescription) per month.

* Dispensings/prescriptions per user and per health plan member per month.

* Days supplied per dispensing.

Several other dispensing data checks are illustrated in Supplemental Digital Content 1 (Table 1, and Supplemental Digital Content 2, 3, 4 (Figure 1–3,,

Back to Top | Article Outline

Medical Utilization

Medical utilization data often include care setting (eg, inpatient, ambulatory, emergency department), diagnoses recorded, procedures performed, facility identifier, provider identifier, and service dates. Suggested checks include:

* Encounters per patient (defined as a person with at least 1 encounter).

* Encounters per member (includes individuals without any encounter).

* Encounters by care setting.

* Diagnoses and procedures per encounter.

* Procedures by procedure code type (eg, Healthcare Common Procedure Coding System, ICD-9).

* Diagnoses by diagnosis code type (eg, ICD-9-CM, SNOMED-CT).

* Encounters per patient and per member per month.

Figure 2 shows encounters by month for 2 consecutive ETLs for 1 site. The saw-tooth pattern is unexpected; it could be the result of a data capture issue or a change in the ETL logic that was introduced in the most recent data extraction. Discussion with the data partner identified a data extraction error that was corrected. Temporal comparisons of ETLs within a site can help identify missing data [(Supplemental Digital Content 5, Figure 4, illustrates 2 sequential data extracts, the second of which had a month of missing data].

Back to Top | Article Outline

Clinical Data

Data such as laboratory results and vital signs can be tested like the other domains, with relevant stratifications. Investigation of laboratory test results is complex because results can be idiosyncratic and data capture can vary substantially across sites. Basic clinical expectations should be reviewed for validity such as men weighing more than women and blood pressure increasing by age. Examples of temporal trends include the number of tests per month, tests per member per month, and average values per year. Figure 3 provides 2 laboratory data check examples showing individuals with 1 or more laboratory test results per 1000 members. The charts show 2 different patterns, substantial cross-site variation, and within site trend changes. This figure illustrates differences in data capture across sites that would generate additional investigation by researchers using the data for specific purposes, and may influence development of research protocols or selection of sites.

Back to Top | Article Outline

Assess Expected Clinical Relationships

Expected clinical relationships should be assessed. For example, the rate of hip fractures among 60–65-year-old women, and ankle fractures among 18–22-year-old males are metrics that can be compared across sites with the expectation of cross-site consistency. The intent is to identify a clinical condition or event with an expected pattern or expected consistency across sites, and test the hypothesis of similarity. We also recommend evaluating relationships that should never occur such as the number of pregnant men and women with prostate cancer.45–49 Because most electronic data contain these types of errors, a dataset with no “never” relationships is itself suspect and it is up to the researcher to investigate with the data partner the process used to implement the cleaning.

Back to Top | Article Outline

Additional Considerations

Data checking in a distributed multi-site environment requires the transfer of information from the sites to a coordinating center that acts as data curator. Transfer of information to any external entity raises security, privacy, and proprietary concerns. In addition, data partners interpret privacy regulations differently, but most prohibit transfer of patient-level information or tables with low cell counts without approval of their Institutional Review Board and/or privacy officer. Data checking output can avoid patient privacy issues by transferring only stratified count information. Proprietary concerns can introduce additional barriers. Examples include dispensings by NDC or generic name (this can allow identification of preferred pharmacy vendors or the formulary status of products), per member per month counts, and counts of members by diagnosis code, procedure code, or other clinical codes as this can expose implied quality measures. Finally, some partners may object to the sheer volume of information often required for comprehensive data checking, requiring a balancing of partner needs with those of the multi-site network data curator.

Study-specific data checks also should be performed. Study-specific checks should investigate the exposure, outcome, and covariates of interest in detail. Assessment of metrics such as days supplied per dispensing, dispensings per user, and total days of exposure per user can help identify cross-site variability. These checks can identify data quality issues not observed by the network data checking process.

Back to Top | Article Outline


Assessing data validity in multi-site distributed data networks is complex. There are no clear rules for what constitutes a “valid” data resource or even which metrics to use to assess data validity. Data validity approaches often rest on the experience of the investigator and work of data curators. We recommend a range data checks for CER studies to help identify potential data issues and provide assurance to CER stakeholders that routine and appropriate data quality checking was conducted.

The data checks recommended are generic by design and do not address all possible data domains or scenarios. The domains included above represent the most common data domains for CER, but because the conceptual frame is the same for all domains, the data checking steps listed here can be applied to other domains such as patient reported outcomes, registry data, and other elements in electronic health records. In addition, multi-site networks (and specific studies) often incorporate data updates or “refreshes” that add newer data to the network. In such instances, all data checking must be repeated to ensure that the updates did not introduce new data quality problems, as illustrated in our examples. Our experience is that data updates using a previously validated ETL offers no protection against new data quality problems resulting from unknown or underappreciated changes in the local data resources. In fact, it is not uncommon for data partners in distributed networks to learn of local data changes from the network data checking process.

Data quality checking is typically conducted “behind the scenes,” with results excluded from public reports. We suggest changing this paradigm by including information about the data quality approach and results as part of the standard CER reporting template. At a minimum, investigators should report key data checking metrics for the primary exposure, outcome, and covariate measures. This would help ensure that investigators implement extensive data checks and provide stakeholders with enough information to assess the likelihood that the data sources are appropriate for the study design.

Well-conducted CER requires the appropriate combination of data resources, study design, and statistical analysis. Poor choice of any of the 3 can result in invalid results. Well-designed data checking activities can identify data issues and determine the suitability of the data source for the study design and statistical method. To fully realize the potential of electronic health data for CER, our understanding of data issues, including methods for identifying them, must advance in parallel with advances in study design and statistical methods. Only when all 3 legs of the CER stool are solid will CER take its rightful place in evidence generation.

Back to Top | Article Outline


1. .Report to the President and Congress. 2009. Available at: Accessed December 30, 2012.
2. McClellan M.Drug safety reform at the FDA—pendulum swing or systematic improvement?N Engl J Med.2007;356:1700–1702.
3. Alina Baciu KS, Burke SP.The Future of Drug Safety: Promoting and Protecting the Health of the Public.2006.Washington, DC:Institute of Medicine of the National Academies.
4. Behrman RE, Benner JS, Brown JS, et al..Developing the Sentinel System—a national resource for evidence development.N Engl J Med.2011;364:498–499.
5. Platt R, Wilson M, Chan KA, et al..The New Sentinel Network—improving the evidence of medical-product safety.N Engl J Med.2009;361:645–647.
6. Strom B, Kimmel SE, Hennessy S.Pharmacoepidemiology.2012.West Sussex, UK:Wiley.
7. Buntin MB, Jain SH, Blumenthal D.Health information technology: laying the infrastructure for national health reform.Health Aff (Millwood).2010;29:1214–1219.
8. Olsen L, Aisner D, McGinnis JM.The Learning Healthcare System: Workshop Summary (IOM Roundtable on Evidence-Based Medicine).2007.Washington, DC:The National Academies Press.
9. .Digital Infrastructure for the Learning Health System: The Foundation for Continuous Improvement in Health and Health Care: Workshop Series Summary.2011.Washington DC:The National Academies Press.
10. Young PL, Olsen L, McGinnis JM.Value in Health Care: Accounting for Cost, Quality, Safety, Outcomes, and Innovation: Workshop Summary.2010.Washington DC:The National Academies Press.
11. Brown JS, Lane K, Moore K, et al..Defining and evaluating possible database models to implement the FDA Sentinel Initiative. US Food and Drug Administration: FDA-2009-N-0192-0005. 2009. Available at:!documentDetail;D=FDA-2009-N-0192-0005. Accessed December 30, 2012.
12. Platt R, Carnahan RM, Brown JS, et al..The US Food and Drug Administration’s Mini-Sentinel program: status and direction.Pharmacoepidemiol Drug Saf.2012;21suppl 11–8.
13. .Query health initiative. 2012. Available at: Accessed June 2012. Accessed December 30, 2012.
14. Brown JS, Holmes J, Maro J, et al..Report 1: Design Specifications for Network Prototype and Research Cooperative. Developing a Distributed Research Network and Cooperative to Conduct Population-based Studies and Safety Surveillance.Rockville, MD:Agency for Healthcare Research and Quality: HHSA29020050033IAvailable at: Accessed January 30, 2009.
15. Maro JC, Platt R, Holmes JH, et al..Design of a national distributed health data network.Ann Intern Med.2009;151:341–344.
16. Moore KM, Duddy A, Braun MM, et al..Potential population-based electronic data sources for rapid pandemic influenza vaccine adverse event detection: a survey of health plans.Pharmacoepidemiol Drug Saf.2008;17:1137–1141.
17. Toh S, Platt R, Steiner JF, et al..Comparative-effectiveness research in distributed health data networks.Clin Pharmacol Ther.2011;90:883–887.
18. Yih WK, Kulldorff M, Fireman BH, et al..Active surveillance for adverse events: the experience of the Vaccine Safety Datalink project.Pediatrics.2011;127suppl 1S54–S64.
19. Geiger H.Decentralizing the Analysis of Health Data.2012.Washington, DC:Center for Democracy and TechnologyAvailable at:
20. Platt R, Davis R, Finkelstein J, et al..Multicenter epidemiologic and health services research on therapeutics in the HMO Research Network Center for Education and Research on Therapeutics.Pharmacoepidemiol Drug Saf.2001;10:373–377.
21. Pace WD, Cifuentes M, Valuck RJ, et al..An electronic practice-based network for observational comparative effectiveness research.Ann Intern Med.2009;151:338–340.
22. Wagner EH, Greene SM, Hart G, et al..Building a research consortium of large health systems: the Cancer Research Network.J Natl Cancer Inst Monogr.2005;35:3–11.
23. Go AS, Magid DJ, Wells B, et al..The Cardiovascular Research Network: a new paradigm for cardiovascular quality and outcomes research.Circ Cardiovasc Qual Outcomes.2008;1:138–147.
24. Robb MA, Racoosin JA, Sherman RE, et al..The US Food and Drug Administration’s Sentinel Initiative: expanding the horizons of medical product safety.Pharmacoepidemiol Drug Saf.2012;21suppl 19–11.
25. McGraw D, Rosati K, Evans B.A policy framework for public health uses of electronic health data.Pharmacoepidemiol Drug Saf.2012;21suppl 118–22.
26. Rosati K.Using electronic health information for pharmacovigilance: the promise and the pitfalls.J Health Life Sci Law.2009;2:173–239.
27. Salmon DA, Akhtar A, Mergler MJ, et al..Immunization-safety monitoring systems for the 2009 H1N1 monovalent influenza vaccination program.Pediatrics.2011;127suppl 1S78–S86.
28. Velentgas P, Bohn RL, Brown JS, et al..A distributed research network model for post-marketing safety studies: the Meningococcal Vaccine Study.Pharmacoepidemiol Drug Saf.2008;17:1226–1234.
29. Andrade S, Raebel M, Boudreau D, et al.Strom B, Kimmel SE, Hennessy S.Health maintenance organizations/health plans.Pharmacoepidemiology (Chapter 12).2012.West Sussex, UK:Wiley;163–188.
30. Hornbrook MC, Hart G, Ellis JL, et al..Building a virtual cancer research organization.J Natl Cancer Inst Monogr.2005;35:12–25.
31. .Available at: Accessed December 30, 2012.
32. Chen RT, Glasser JW, Rhodes PH, et al..Vaccine Safety Datalink project: a new tool for improving vaccine safety monitoring in the United States. The Vaccine Safety Datalink Team.Pediatrics.1997;99:765–773.
33. Davis RL, Kolczak M, Lewis E, et al..Active surveillance of vaccine safety: a system to detect early signs of adverse events.Epidemiology.2005;16:336–341.
34. Schneeweiss S, Avorn J.A review of uses of health care utilization databases for epidemiologic research on therapeutics.J Clin Epidemiol.2005;58:323–337.
35. Suissa S, Garbe E.Primer: administrative health databases in observational studies of drug effects—advantages and disadvantages.Nat Clin Pract Rheumatol.2007;3:725–732.
36. Kahn MG, Raebel MA, Glanz JM, et al..A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research.Med Care.2012;50supplS21–S29.
37. Hall GC, Sauer B, Bourke A, et al..Guidelines for good database selection and use in pharmacoepidemiology research.Pharmacoepidemiol Drug Saf.2012;21:1–10.
38. Bauck A, Bachman D, Riedlinger K, et al..Developing a consistent structure for VDW QA checks. 2011. Available at: Accessed December 30, 2012.
39. .OSCAR—Observational Source Characteristics Analysis Report (OSCAR) design specification and feasibility assessment. 2011. Available at: Accessed December 30, 2012.
40. .Generalized review of OSCAR unified checking. 2011. Available at: Accessed December 30, 2012.
41. Curtis LH, Weiner MG, Boudreau DM, et al..Design considerations, architecture, and use of the Mini-Sentinel distributed data system.Pharmacoepidemiol Drug Saf.2012;21suppl 123–31.
42. Curtis LHWM, Beaulieu NU, Rosofsky R, et al..Mini-Sentinel year 1 common data model—data core activities. 2012. Available at: Accessed December 30, 2012.
43. Klompas M, Lazarus R, Daniel J, et al..Electronic medical record Support for Public health (ESP): Automated Detection and Reporting of Statutory Notifiable Diseases to Public Health Authorities.Adv Dis Sur.2007;3:3.
44. Murphy SN, Weber G, Mendis M, et al..Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2).J Am Med Inform Assoc.2010;17:124–130.
45. Hennessy S, Leonard CE, Palumbo CM, et al..Quality of Medicaid and Medicare data obtained through Centers for Medicare and Medicaid Services (CMS).Med Care.2007;45:1216–1220.
46. Hennessy S, Bilker WB, Weber A, et al..Descriptive analyses of the integrity of a US Medicaid claims database.Pharmacoepidemiol Drug Saf.2003;12:103–111.
47. Brown PJ, Warmington V.Info-tsunami: surviving the storm with data quality probes.Inform Prim Care.2003;11:229–233discussion 234–227.
48. Brown PJ, Warmington V.Data quality probes-exploiting and improving the quality of electronic patient record data and patient care.Int J Med Inform.2002;68:91–98.
49. Brown PJ, Harwood J, Brantigan P.Data quality probes—a synergistic method for quality monitoring of electronic medical record data accuracy and healthcare provision.Stud Health Technol Inform.2001;84:1116–1119.

comparative effectiveness research; distributed research network; data quality

Supplemental Digital Content

Back to Top | Article Outline
© 2013 by Lippincott Williams & Wilkins.