Data quality – The foundation of real-world studies : Perspectives in Clinical Research

Secondary Logo

Journal Logo

Real World Education

Data quality – The foundation of real-world studies

Bhatt, Arun

Author Information
Perspectives in Clinical Research 14(2):p 92-94, Apr–Jun 2023. | DOI: 10.4103/picr.picr_12_23
  • Open


Real-world data (RWD) are data on patient health status and/or the delivery of health care routinely collected from multiple sources outside typical clinical research settings.[1] RWD are useful in conducting real-world studies (RWS) on disease epidemiology, effectiveness and safety of therapeutic interventions, and health economics. As RWD are primarily collected for nonregulatory purposes and nonresearch aims, they are likely to be disorganized, heterogeneous, and have measurement errors.[2] Sub-optimal and inconsistent data quality is a challenging issue which should be addressed whilst planning and conducting RWS.[2] When RWD are used for conducting academic clinical research or regulatory submission, ensuring data quality is critical. This brief review discusses the quality aspects of data needed for RWS.


RWD should be meaningful, valid, and transparent to answer a regulatory question for a clinical setting.[3] Data quality requires an assessment of the attributes of data needed to answer the question of interest accurately, reliably, and repeatedly.[3] Data quality requires a focus on data accrual and transformation, which are important to ensure data integrity-completeness, consistency, and accuracy of data. The quality of RWD should be robust enough to accurately capture critical covariates and endpoints or outcomes.[3]


Accuracy attribute includes several dimensions: (a) correctness of collection, transmission, and processing of data;[1] (b) assessment of the validity, reliability, and robustness of a data field;[3] and (c) closeness of agreement between the measured value and the true value of what is intended to be measured.[4]

Evaluation of quality of data accuracy requires (1) examining the logical plausibility of the clinical data and laboratory test results, (2) assessing the validity of the data elements and any algorithms used to convert the data, (3) checking the data consistency across the patient population for relevant data, and (4) the conformity of the data to established internal standards or external data models.[3] Special attention should be paid to factors such as physician practice patterns, user interfaces, and autofilling, which can affect the accuracy of data elements in electronic health records (EHRs). Quality audits of data accuracy for modern technology platforms, for example, personal digital health applications would be challenging as the data standards and validation processes are still evolving.


Completeness is defined as the presence of the necessary data to address the study question, design, and analysis.[1] It is a measure of recorded data present within a defined data field and/or data set.[3] Incompleteness in core data for the selection of study populations, exposures, key covariates, and outcomes of interest, and other important parameters can introduce bias. Completeness could be a problem when patients fail to consistently wear or charge a device or forget to record their data. Hence, it is essential to measure the extent of missing data used in the study analysis but were not observed, collected, or accessible, and develop methods to compensate for incomplete data.[3]


Consistency means relevant uniformity in data across clinical sites, facilities, departments, units within a facility, providers, or other assessors.[1] Consistency also indicates the stability of a data value within a dataset or across linked datasets.[3] Consistency could be a concern even within a single source such as EHRs because of significant variations in data coding in outdoor clinics, hospitals, laboratories, or in data recording approach, for example., structured versus free text.[3] Checking for consistency with the source for each patient and data set is important to ensure data accuracy.


Data provenance means the origin of the data, sometimes including a chronological record of data custodians and transformations.[3] It refers to an audit trail that accounts for the origin of a piece of data in a database, document, or repository along with an explanation of how and why it got to the present place.[1] Documentation supporting data provenance would be useful during an audit.


This is the ability to record changes to location, ownership, and values.[3] Permits an understanding of the relationships between the analysis results (tables listings and figures in the study report), analysis datasets, tabulation datasets, and source data.

Fit-for-purpose data

This is an assessment of whether a meaningful, valid, and transparent data set can answer the question of interest given data quality, data relevancy, and the current body of evidence.[3]


This is the process of establishing that a method is sound or that data are correctly measured.[1] Validation process is of special importance to newer digital technologies, for example., smartphones and PRO instruments.


RWD quality can be improved by conforming to published recommendations concerning registries, for example., the Agency for Health Care Quality.[5]

For registries, assurance of data quality requires structures, processes, policies, and procedures to be set up to ascertain the quality of the data in the registry and to insure against common errors in interpretation, coding, data entry, and transformation accuracy.[6]

Processes for assuring data quality include

  • Providing training to data collectors and data abstracters
  • Ensuring completeness of data
  • Maintaining data consistency across sites and over time
  • Using automatic data quality monitoring and alerting
  • Completing onsite audits for a sample of sites
  • Conducting for-cause audits.

For administrative or insurance claims or hospital data, there may not be any established data quality control processes. However, the academic researcher or the industry sponsor can adopt quality assurance approach outlined for registries to ensure data quality and procedures during the data source design and development stages.[6]

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.


1. US FDA Guidance for Industry Real-World Data: Assessing Registries to Support Regulatory Decision-Making for Drug and Biological Products. 2021. Available from: [Last accessed on 2022 Jul 11]
2. Liu F, Demosthenes P. Real-world data: A brief review of the methods, applications, challenges and opportunities. BMC Med Res Methodol 2022;22:287
3. Daniel G, Silcox C, Bryan J, McClellan M, Romine M, Frank K. Characterizing RWD Quality and Relevancy for Regulatory Purposes, Duke Margolis Center for Health Policy 2018. Available from: [Last accessed on 2022 Jul 11]
4. US FDA Guidance for Industry Real-World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision Making for Drug and Biological Products. 2021. Available from: [Last accessed on 2022 Jul 11]
5. US FDA Guidance for Industry Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices. Available from: [Last accessed on 2022 Jul 11]
6. Agency for Health Care Quality Registries for Evaluating Patient Outcomes:A User's Guide. Available from: 22. [Last accessed on 2022 May 31]

Data; quality; real-world

Copyright: © 2023 Perspectives in Clinical Research