Rapid identification of chronic kidney disease in electronic health record database using computable phenotype combining a common data model : Chinese Medical Journal

Secondary Logo

Journal Logo


Rapid identification of chronic kidney disease in electronic health record database using computable phenotype combining a common data model

Wang, Huai-Yu1,2; Du, Jian1; Yang, Yu1; Lin, Hongbo3; Bao, Beiyan4; Ding, Guohui1,5; Yang, Chao6,7; Kong, Guilan1,7; Zhang, Luxia1,6,7

Editor(s): Ni, Jing

Author Information
Chinese Medical Journal ():10.1097/CM9.0000000000002168, February 28, 2023. | DOI: 10.1097/CM9.0000000000002168

To the Editor: Chronic kidney disease (CKD) is a global burden of the public health. The global prevalence of CKD exceeded 10% while the awareness was around 10%.[1] In the era of big data, improving the identification of CKD using informatic tools is important. Computable phenotype is proven as an efficient tool to facilitate the process of patient identification using electronic health record (EHR) data. It is an automatic algorithm identifying the target population through objective criteria with logic statements. Effective implementation of a computable phenotype depends on valid mapping of raw data to a standard set of data and definitions. Previous studies developed computable phenotypes for CKD identification in English by using the Logical Observation Identifiers Names and Codes (LOINC) and the International Classification of Diseases (ICD) codes.[2,3] With the limited utilization of these codes and the language barrier, implementing these computable phenotypes in non-English circumstances and/or in the absence of identical coding system is difficult.

Common data model (CDM) was reported as a solution for data standardization and the localization of computable phenotypes.[4] The core of CDM is the extraction of key elements, transforming into a standard terminology and loading into a standard schema extraction, transformation, loading (ETL). Currently, various CDMs with different original aims, such as the Observational Medical Outcomes Partnership CDM, Sentinel CDM, and the Patient-Centered Outcomes Research Network CDM, had been widely used and successfully facilitated the standardization of EHR data. Sentinel previously posted coding trend analyses on kidney disease, and only ICD-9 codes and ICD-10 codes were included. The CDM for CKD characterization was still lacking.

The confirmation of CKD takes at least 3 months. This condition hinders the timely diagnosis and increases the missed diagnosis of CKD in clinical practice, especially for patients seeking health care in different institutes.[5] EHR database collects healthcare data continuously across institutes and updates those in real time. Monitoring and identifying the patients with CKD by using an informatic tool based on this database are promising. Collectively, speculating that a computable phenotype combining a CDM might facilitate the CKD-related data extraction and CKD identification using EHR data is reasonable.

Yinzhou is a district with a population of 1.6 million people located in Ningbo Zhejiang province, China. The Regional Health Information System (RHIS) in Yinzhou collected EHRs of residents and updated the database in real time. Using this database, a unique identity code (PERSONKEY) was generated by using personal ID, sex, date of birth, and name and was adopted to recognize the identical person, link the health profiles in different sub-databases, and generate the complete EHRs. The EHRs of 976,409 adults with medical records were extracted as the raw data for the following analyses [Supplementary Figure 1, https://links.lww.com/CM9/B73]. This study was approved by the ethics committee of Peking University First Hospital.

The CDM for CKD characterization was designed in accordance with the principles described in The Book of OHDSI: Observational Health Data Sciences and Informatics. In accordance with the Kidney Disease: Improving Global Outcomes (KDIGO) clinical guidelines for CKD (2012), the key elements for CKD identification were defined as age, sex, kidney function, and urine abnormality.[6] Hence, Data Domain of CDM for CKD identification was designed as demographics, laboratory tests, and diagnosis. Standard terminology of data domains was defined in accordance with the KDIGO-CKD clinical guidelines and ICD-10 codes in English and in Chinese. Forms containing demographics (age, sex), laboratory tests (kidney function, albuminuria, proteinuria, hematuria), and diagnosis (ICD-10 codes and texts) in the EHR database were integrated by PERSONKEY. Altogether, 10,981,723 medical records of 976,409 individuals in the EHR database were prepared for the extraction of original vocabularies [Supplementary Figure 1, https://links.lww.com/CM9/B73]. The mapping rules between original vocabularies and the standard terminology were established through manual annotation and format conversion. Two nephrologists independently conducted the annotation and one informaticist performed the mapping [Figure 1].

Figure 1:
Process of the development of CDM for CKD characterization and computable phenotype for CKD identification. CDM: Common data model; CKD: Chronic kidney disease; eGFR: Estimated glomerular filtration rate; EHR: Electronic health record; ICD: International Classification of Diseases.

The algorithm of the computable phenotype for CKD identification was designed in accordance with KDIGO clinical guidelines for CKD[6] [Figure 1]. On the basis of the standard terminology of CDM, patients showing at least one of the following manifestations lasting for >3 months were defined as having CKD: (1) reduced kidney function: estimated glomerular filtration rate (eGFR) <60 mL·min−1 · 1.73 m−2); (2) albuminuria: urine albumin-to-creatinine ratio ≥30 mg/g or urine albumin concentration ≥20 mg/L; (3) proteinuria: urine protein-to-creatinine ratio ≥150 mg/g, or 24 h proteinuria ≥150 mg/24 h, or urinalysis protein ≥+1; (4) hematuria without non-CKD related causes including urologic neoplasms, urinary tract infection and injury. Criteria for hematuria: urine red blood cell ≥3 cells/HPF (or >28 cells/μL) or urine occult blood ≥+2; (5) CKD-related diagnosis including primary, secondary or congenital kidney disease, renal vascular disease, maintenance dialysis and recipient/donor of kidney transplantation [Supplementary Table 1, https://links.lww.com/CM9/B73]. Patients who received re-tests over a period of 3 months and were confirmed with the absence of the abovementioned manifestations were defined as normal cases. Patients who presented these manifestations for ≤3 months or did not receive any re-test were defined as cases to be addressed and will be processed in the next iteration of CKD identification. [Figure 1].

In accordance with the number of individuals with EHRs and considering the diversity of EHR infrastructures and data sources, seven institutes were selected from 42 healthcare institutes in Yinzhou to implement the computable phenotype based on the CDM. In total, three tertiary general hospitals, two specialty hospitals (a maternity and children's hospital and an orthopedic hospital), one secondary general hospital, and one community health center were selected.

The performance of the computable phenotype was validated through manual review. Cases identified as with/without CKD were randomly selected, and their original records of demographics, diagnosis, and laboratory tests were manually reviewed by two nephrologists. For those without CKD, all diagnosis and CKD-related laboratory tests in the database were extracted and manually reviewed. For those with CKD, all diagnosis and laboratory tests from the date of presentation of CKD to the endpoint of the database were extracted and manually reviewed. Panel discussion was held when they have different opinions. Review by nephrologists was defined as the gold standard for CKD identification.

The data processing and computation in the RHIS were based on the Hadoop framework. The computing engine was Spark, and the data warehouse was Hive as the support for structured query language (SQL) (The Apache Software Foundation, Wakefield, United Kingdom). The ETL process of CDM and the implementation of the computable phenotype were conducted using SQL statements. The demographic and clinical characteristics of CKD-identified patients were analyzed. The stages of CKD-identified patients were evaluated in terms of the levels of eGFR and presented in G1–G5. Continuous and categorical variables were presented as mean ± standard deviation and frequency, respectively. The performance of the computable phenotype was evaluated in terms of sensitivity, specificity, and accuracy and analyzed using MedCalc 15.8 (MedCalc Software Ltd., Ostend, Belgium).

The standard terminology for CKD characterization is shown in Figure 1. The bilingual terminology is presented in Supplementary Table 2, https://links.lww.com/CM9/B73. A total of 617 original vocabularies for laboratory tests were found and standardized by processing 10,981,723 medical records of 976,409 individuals from 42 medical institutes. The formats of date, categorical data, and unit of test were converted. By manual annotation, 111 types of diagnosis (corresponding to 171 types of ICD-10 codes in English and Chinese versions) including primary, secondary and congenital kidney disease, renal vascular disease, and uremia-related diagnosis were reorganized as CKD-related diagnosis. [Supplementary Table 1, https://links.lww.com/CM9/B73]

By scanning 21,474,008 records of laboratory tests and diagnoses of 557,719 individuals in seven medical institutes, 64,036 (11.5%) patients with CKD were identified by the computable phenotype. In China, patients commonly seek health care across different institutes. Thus, the EHRs of more than half of residents in the whole database were extracted from the seven representative institutes. Among them, 55,682 (87.0%) patients received serum creatinine tests. The majority of patients were in early stages (G1: 33,315 cases [59.8%]; G2: 12,980 cases [23.3%]). Patients in G1 were the youngest (53.7 ± 14.0 years), whereas patients in G4 were the oldest (82.3 ± 14.6 years). The highest proportion of hematuria and albuminuria/proteinuria was observed in G1 (17,187 cases [51.6%]) and G5 (417 cases [51.3%]), respectively. The frequency of patients labeled with CKD-related ICD-10 code increased from G1 (16,795 cases[50.4%]) to G5 (737 cases [90.7%]) [Supplementary Table 3, https://links.lww.com/CM9/B73].

In total, the EHRs of 50 CKD-identified cases and 50 cases without CKD were randomly sampled and reviewed by two nephrologists. Fifty CKD-identified cases were confirmed as disease present and three cases without CKD were defined as mis-classified because they did not meet the criterion of re-testing over 3 months. The sensitivity, specificity, and accuracy of the computable phenotype for CKD identification were 94.3%, 100.0%, and 97.0%, respectively [Supplementary Table 4, https://links.lww.com/CM9/B73].

Compared with the previous models, the present computable phenotype particularly considered the utilization of existing non-uniform data and its capacity of localization across databases with different settings. Nadkarni et al[3] developed a computable phenotype to identify patients with CKD in the population with diabetes and/or hypertension based on eMERGE network. Their algorithm mainly relied on ICD-9 codes. Hence, the performance of their computable phenotype was influenced by the missing rate of diagnosis records and/or the awareness. Norton et al[2] developed an NKDEP e-phenotype for CKD identification using laboratory tests, which were extracted through LOINC. Obviously, National Kidney Disease Education Program (NKDEP) e-phenotype avoided the influence of diagnosis rate effectively, but its dependence of LOINC limited the localization in a database without LOINC. The algorithm of the present computable phenotype combined CKD-related diagnostic records and laboratory tests to improve the data utilization and the identification rate. The terminology of the CDM preferred standard description rather than a coding system, so as to reserve the potential for further expansion in foreign databases in the absence of the identical coding system. In accordance with the present results of implementation, the EHR data in different levels of healthcare institutes were scanned successfully and the prevalence of CKD and the characteristics of identified-CKD patients were consistent with previous nationally representative study.[7] This condition demonstrated the effectiveness of the design embedding a CDM into the computable phenotype.

The present study established a reproducible paradigm for the design and construction of CDM and computable phenotype in other fields and databases. First, slightly expanding the criteria for disease identification based on the standard definition of the disease is allowable to balance the utilization of data and the rate of identification. Second, embedding a CDM into the computable phenotype can improve the efficiency of its implementation across different databases. Third, a CDM containing non-monotonic terminology will increase the potentiality for the localization. Finally, the correspondence between the English and Chinese terminologies can be the interface to link the data in Chinese and the existing resources and techniques in English. This strategy may be feasible to promote the data extraction and information exchange in other languages.

The present study is the first research to establish a computable phenotype for CKD identification based on the CDM with a bilingual terminology for CKD characterization. This study develops an efficient tool for CKD identification based on a real-world EHR database and provides a potential interface, the CDM, for the generalization of the computable phenotype across English and Chinese settings of database.


This study was supported by grants from the National Natural Science Foundation of China (Nos. 82100741, 82003529, 91846101, 81771938, 81900665, 82090021), Beijing Municipal Science and Technology Commission (Grant No. 7212201), the University of Michigan Health System-Peking University Health Science Center Joint Institute for Translational and Clinical Research (Nos. BMU2020JI011, BMU2019JI005, BMU2018JI012), Beijing Nova Programme Interdisciplinary Cooperation Project (No. Z191100001119008), National Key R&D Program of the Ministry of Science and Technology of China (No. 2019YFC2005000), the National Key Research and Development Program of China (No. 2018AAA0102100), PKU-Baidu Fund (Nos. 2020BD005, 2019BD017), and CAMS Innovation Fund for Medical Sciences (No. 2019-I2M-5-046).

Conflicts of interest



1. Wang HY, Ding GH, Lin H, Sun X, Yang C, Peng S, et al. Influence of doctors’ perception on the diagnostic status of chronic kidney disease - results from 976,409 individuals with electronic health records in China. Clin Kidney J 2021;14:2428–2436. doi: 10.1093/ckj/sfab089.
2. Norton JM, Ali K, Jurkovitz CT, Kiryluk K, Park M, Kawamoto K, et al. Development and validation of a pragmatic electronic phenotype for CKD. Clin J Am Soc Nephrol 2019;14:1306–1314. doi: 10.2215/cjn.00360119.
3. Nadkarni GN, Gottesman O, Linneman JG, Chase H, Berg RL, Farouk S, et al. Development and validation of an electronic phenotyping algorithm for chronic kidney disease. AMIA Annu Symp Proc 2014;2014:907–916.
4. Shang N, Liu C, Rasmussen LV, Ta CN, Caroll RJ, Benoit B, et al. Making work visible for electronic phenotype implementation: lessons learned from the eMERGE network. J Biomed Inform 2019;99:103293. doi: 10.1016/j.jbi.2019.103293.
5. Wang H, Yang L, Wang F, Zhang L. Strategies and cost-effectiveness evaluation of persistent albuminuria screening among high-risk population of chronic kidney disease. BMC Nephrol 2017;18:135. doi: 10.1186/s12882-017-0538-1.
6. Kidney Disease: Improving Global Outcomes (KDIGO) CKD Work Group. Kidney Disease: Improving Global Outcomes- CKD Evaluation and Management. 2012. Available from: https://kdigo.org/guidelines/ckd-evaluation-and-management/. [Accessed in 2012].
7. Zhang L, Wang F, Wang L, Wang W, Liu B, Liu J, et al. Prevalence of chronic kidney disease in China: a cross-sectional survey. Lancet 2012;379:815–822. doi: 10.1016/s0140-6736(12)60033-6.

Supplemental Digital Content

Copyright © 2023 The Chinese Medical Association, produced by Wolters Kluwer, Inc. under the CC-BY-NC-ND license.