Electronic medical records generated during routine primary care in the United Kingdom are often used for health care research. The capture of cancer cases in primary care electronic medical records, such as the General Practitioner Online Database, the primary care part of the Clinical Practice Research Datalink (known as CPRD), has been shown to be incomplete and to vary by cancer type.1–3
To increase validity and completeness, use of additional data sources may be warranted, such as hospital records (Hospital Episodes Statistics) or cancer registry data (National Cancer Data Repository), but these data sources are available only for a subset of patients in the CPRD and are not available for the most recent patient follow-up due to data lag (about 1 year for Hospital Episode Statistics and 2 years for the National Cancer Data Repository).
As a part of an international postapproval cancer safety program evaluating a new drug to treat overactive bladder, we validated cancer endpoints in the General Practitioner Online Database and linked data to Hospital Episode Statistics and the National Cancer Data Repository.4 The results from this validation effort are presented here.
The CPRD, covering about 7% of the U.K. population, contains electronic medical records created by general practitioners during their clinical practice. General practitioners provide referrals to specialists, receive results from specialists and hospital discharge notes, and prescribe treatment for acute and chronic conditions.5 The General Practitioner Online Database includes issued prescriptions and Read codes for diagnoses, signs, symptoms, referrals, test requests, and test results, as well as free-text comments, which are unstructured fields for information supplementing coded entries. Information is recorded to the extent that it is important for health care. About 75% of English practices contributing to the CPRD have consented to have their patients’ information linked to other health care data sets, like Hospital Episode Statistics or the National Cancer Data Repository.5 In Hospital Episode Statistics and the National Cancer Data Repository, diagnoses are recorded using the International Statistical Classification of Diseases and Related Health Problems, 10th Revision. All data for this study were deidentified.
In the parent cancer safety study, patients with a prescription for darifenacin, fesoterodine, oxybutynin, solifenacin, tolterodine, or trospium were included if they had at least 12 months of continuous enrollment before the prescription in an “up-to-standard” practice (a practice considered by the CPRD to deliver data of adequate quality for research), provided that the same agent was not prescribed during the previous 12 months and that the patient was 18 years or older at the time of the prescription. We excluded patients with previous cancer (except nonmelanoma skin cancer) because the focus of this study was first incident cancers. Patients with HIV infection were excluded because these patients may have received health care through specialty clinics or separate health plans, and their health service utilization might not be captured fully in the CPRD.
For the validation study, we selected from the population included in the safety study a stratified random sample, retaining all patients with a qualifying prescription for the three least commonly prescribed drugs (darifenacin, fesoterodine, and trospium) and 33% of patients with a qualifying prescription for the most common drugs (oxybutynin, solifenacin, and tolterodine). This was done to ensure that all study drugs would be well represented in the validation cohort.
The study period was January 1, 2004, to December 31, 2012. Because the end of data collection in the General Practitioner Online Database was later than in Hospital Episode Statistics and the National Cancer Data Repository, linked person-time in patients enrolled in practices with linkage to Hospital Episode Statistics and the National Cancer Data Repository was followed by nonlinked person-time. The period of complete overlap between data sources was January 1, 2004, to December 31, 2010 (Figure 1). Follow-up started with the qualifying prescription and ended at the earliest of end of the study period, disenrollment, HIV infection or cancer (except non–melanoma skin cancer), or death. We conducted validation efforts on the validation cohort.
The cancer endpoints were 10 common cancers: bladder, female breast, colorectal, corpus uteri, kidney and renal pelvis, lung and bronchus, non-Hodgkin lymphoma, pancreas, prostate, and skin melanoma.
Case Identification and Validation
Validation processes available for each patient depended on whether the individual’s data in the General Practitioner Online Database were linked to Hospital Episode Statistics and the National Cancer Data Repository.
Validation in the General Practitioner Online Database (Nonlinked and Linked) Practices
Provisional cancer cases were identified using an electronic algorithm that searched for Read diagnosis codes in the General Practitioner Online Database, for practices without or with linkage to Hospital Episode Statistics and the National Cancer Data Repository. As morphology and treatment codes are often not specific to cancer type, we did not include these types of codes in the electronic algorithm; we used them for case confirmation. Codes for benign neoplasms and in situ cancers were not included in the electronic algorithm.
We created electronic medical profiles with patients’ diagnoses, procedures, relevant additional clinical information, and prescriptions. Medical profiles for these patients were reviewed by a team of clinical reviewers blinded to the study drugs, including a specialist in medical oncology/hematology, with free-text comments around the event date (n = 405; free-text comments were requested when the diagnosis was not clear from the cancer-related codes) or without free-text comments (n = 1,081). Provisional cases identified by the electronic algorithm were confirmed when patient medical profiles presented supportive clinical evidence of a cancer diagnosis, including morphology and treatment codes, codes indicating the general practitioner reviewed the patient’s cancer care, or supportive free-text comments. Details on the content of patient profiles, criteria to request free-text comments, reviewers’ training, and review process are presented in the supplemental information.
If definitive information was found indicating that a provisional case did not have a cancer diagnosis, the patient was considered a noncase. When the medical profile had evidence that a provisional case had cancer diagnosed before cohort entry, the patient was considered a noncase and excluded from the study. Provisional cases not confirmed and not identified as noncases remained provisional. Reviewers also assessed cancer type and diagnosis date.
Discrepancies or uncertainties were reviewed by the team and resolved by the clinical specialist in medical oncology/hematology (J.A.K.). The diagnosis date was the earliest date of a cancer diagnosis in any of the sources.
Additional Validation in Linked Practices
For linked practices, validation started with the identification of provisional cases using the electronic algorithm previously described, followed by physician review of medical profiles. During the period of overlap between data sources (2004–2010), we used Hospital Episode Statistics and the National Cancer Data Repository to confirm previously identified cases (patients with cancer records in the General Practitioner Online Database and cancer records in one or both of these linked data sources) and to identify additional cases (patients in the General Practitioner Online Database without cancer records in the General Practitioner Online Database but with cancer records in Hospital Episode Statistics and/or the National Cancer Data Repository). Since Hospital Episode Statistics data are independently audited and cancer registries perform their own independent case validation using standardized procedures, including review of pathology information,6 all cases identified in Hospital Episode Statistics or the National Cancer Data Repository were considered confirmed.
Based only on the General Practitioner Online Database from cases for the entire period, we reported the number of cancer cases identified using an electronic algorithm, plus absolute and relative frequencies of case confirmation from electronic medical profile review, overall and by linkage availability. We reported the frequency of cancer cases identifiable and not identifiable in the General Practitioner Online Database from linked practices, within the period with complete overlap of data sources, by patient characteristics, for selected cancer types. We described the source of each confirmed cancer diagnosis (General Practitioner Online Database, Hospital Episode Statistics, and/or the National Cancer Data Repository) using proportional Venn diagrams for the combined study cancers and for individual cancer types. The area of each segment in these diagrams is proportional to the number of patients it includes.
Analyses were conducted using SAS 9.3 (Cary, NC: SAS Institute, Inc.; 2011) and Stata 13.1 (College Station, TX: StataCorp LP; 2014). The study protocol was registered in the European Union electronic Register of Post-Authorisation Studies before the study was conducted (Register number EUPAS5529; http://www.encepp.eu/encepp/viewResource.htm?id=11107) and was approved by the CPRD’s Independent Scientific Advisory Committee (protocol 13_142A).
The validation cohort included 50,840 study drug users. After excluding patients with cancer or HIV before cohort entry, the electronic search identified 1,486 provisional cancer cases in the General Practitioner Online Database, 56% from linked and 44% from nonlinked practices.
Validation of Provisional Cases Using Only the General Practitioner Online Database, Entire Study Period
Of the 1,486 provisional cancer cases identified through an electronic algorithm in the General Practitioner Online Database, 95% were confirmed in the review of patient’s medical profiles (Table). Of the 825 provisional cases from linked practices, 96% were confirmed; at least 90% of provisional cases were confirmed for any individual cancer type. Of the 661 provisional cases in nonlinked practices, 93% were confirmed in the review of medical patient profiles. For most individual cancer types (i.e., bladder, breast, colorectal, corpus uteri, non-Hodgkin lymphoma, pancreas, and prostate), at least 90% of provisional cases were confirmed; for lung and kidney cancer and skin melanoma, 77–88% of provisional cases were confirmed.
Source of Cases in Linked Practices Using All Data Sources, Period of Overlap
Overall, 720 cancer cases were confirmed in the General Practitioner Online Database, Hospital Episode Statistics, and/or the National Cancer Data Repository. Of these, 68% were identifiable in the General Practitioner Online Database, 81% in Hospital Episode Statistics, and 84% in the National Cancer Data Repository (Figure 2). The completeness of case recording in the General Practitioner Online Database was greater for breast cancer and prostate cancer than for other study cancers (Figure 3).
In the General Practitioner Online Database, more complete identification of study cancer cases was seen in younger individuals (eTable; http://links.lww.com/EDE/B297 in Supplemental Digital Content), in nonsmokers, and in cancers diagnosed in 2004–2008. Based on other characteristics, no substantial variation was apparent for the combined study cancers or for three cancers for which the General Practitioner Online Database is less complete: pancreas, lung, and kidney.
A very high proportion of provisional cases of cancer identified in the General Practitioner Online Database by screening for Read diagnosis codes were confirmed through clinical review of patient profiles or linkage to the National Cancer Data Repository or Hospital Episode Statistics, but, of these three data sources, no single source contained records of all confirmed study cancer cases. Completeness of cancer recording in the General Practitioner Online Database is higher for breast and prostate cancers—diseases for which general practitioners often prescribe ongoing drug therapy—than for other cancers that are usually treated by specialists. We observed more complete case ascertainment in younger individuals, but we did not identify patient groups for which the General Practitioner Online Database contains all cancer cases.
Multiple studies have examined the completeness of cancer recording in data sources available for research in the United Kingdom. A discussion of methods and findings in our and other studies is presented in the supplemental information.
Cancer ascertainment from practices whose data allow linkage to Hospital Episode Statistics and the National Cancer Data Repository is more complete than from nonlinked practices. Whether this would affect relative risks in safety studies depends on whether completeness of case ascertainment is differential for patients with versus without the exposure of interest. Even without such bias, a lower proportion of identified cases of a given cancer would be expected to yield more imprecise effect estimates.
A limitation of this study is that we identified cancer cases in the General Practitioner Online Database with an algorithm that used Read diagnosis codes exclusively (i.e., without morphology or treatment codes). While the coding system includes codes for morphology (e.g., Read code BB5..11, “[M] Adenocarcinoma”) and treatment (e.g., Read code 8BAD.00, “Chemotherapy”), only diagnosis codes consistently permit identification of the type of cancer (needed in this study). Instead, in medical profile review, morphology and treatment codes were used, along with codes related to review of cancer care, to confirm the presence of cancer. Strengths of this validation study include the meticulous process for patient profile review and confirmation of case status, including calibration of the assessment process before starting the patient profile review to decrease interrater variability.
In conclusion, cancer case identification in the General Practitioner Online Database is sensitive to features of the case ascertainment algorithm such as the use of free-text comments and the type of codes included (e.g., diagnosis, morphology, treatment). Nearly all cancers with diagnosis codes in the General Practitioner Online Database (similarly for linked and nonlinked practices) were confirmed. While completeness of the General Practitioner Online Database was high for breast and prostate cancer, a substantial proportion of other cancers will be missed if Hospital Episode Statistics and the National Cancer Data Repository are not used.
We thank Jennifer Bartsch for her help with programming, Christine Bui and Alicia Gilsenan for their help managing the project, Adele Monroe for her editorial help, Jason Mathes for his help preparing figures (all from RTI International); Kwame Appenteng and Milbhor D´Silva for their input at all stages of the study (both from Astellas); and Clinical Practice Research Datalink research staff for their support to the program.
1. Dregan A, Moller H, Murray-Thomas T, et alValidity of cancer diagnosis in a primary care database compared with linked cancer registrations in England. Population-based cohort study. Cancer Epidemiol. 2012;36:425–429.
2. Boggon R, van Staa TP, Chapman M, et alCancer recording and mortality in the General Practice Research Database and linked cancer registries. Pharmacoepidemiol Drug Saf. 2013;22:168–175.
3. Rañopa M, Douglas I, van Staa T, et alThe identification of incident cancers in UK primary care databases: a systematic review. Pharmacoepidemiol Drug Saf. 2015;24:11–18.
4. Kaye JA, Margulis AV, Fortuny J, et alCancer incidence after initiation of antimuscarinic medications for overactive bladder in the United Kingdom: evidence for protopathic bias. Pharmacotherapy. 2017 Jun;37(6):673–683. doi: 10.1002/phar.1932.
5. Herrett E, Gallagher AM, Bhaskaran K, et alData resource profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015;44:827–836.
6. Jensen OM, Parkin DM, MacLennan R, Muir CS, Skeet RGCancer Registration: Principles and Methods. IARC Publication No. 95. 1991.Lyon, France: World Health Organization, International Agency for Research on Cancer (IARC), and International Association of Cancer Registries;
Supplemental Digital Content
Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved.