Inflammatory Bowel Diseases

Skip Navigation LinksHome > June 2013 - Volume 19 - Issue 7 > Improving Case Definition of Crohn's Disease and Ulcerative...
Inflammatory Bowel Diseases:
doi: 10.1097/MIB.0b013e31828133fd
Original Clinical Article

Improving Case Definition of Crohn's Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach

Ananthakrishnan, Ashwin N. MD, MPH1,2; Cai, Tianxi PhD3; Savova, Guergana PhD4; Cheng, Su-Chun ScD2; Chen, Pei4; Perez, Raul Guzman BS5; Gainer, Vivian S. MS5; Murphy, Shawn N. MD, PhD5,6; Szolovits, Peter PhD7; Xia, Zongqi MD, PhD2,8; Shaw, Stanley MD, PhD2,9; Churchill, Susanne PhD10; Karlson, Elizabeth W. MD2,11; Kohane, Isaac MD, PhD2,4,10; Plenge, Robert M. MD, PhD2,11; Liao, Katherine P. MD, MPH2,11

Free Access
Supplemental Author Material
Article Outline
Collapse Box

Author Information

1Gastrointestinal Unit, Massachusetts General Hospital, Boston, Massachusetts;

2Harvard Medical School, Boston, Massachusetts;

3Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts;

4Children's Hospital Boston, Boston, Massachusetts

5Research Computing, Partners HealthCare, Charlestown, Massachusetts;

6Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts;

7Massachusetts Institute of Technology, Cambridge, Massachusetts;

8Department of Neurology, Brigham and Women's Hospital, Boston, Massachusetts;

9Division of Cardiology, Massachusetts General Hospital, Boston, Massachusetts;

10i2b2 National Center for Biomedical Computing, Brigham and Women's Hospital, Boston, Massachusetts; and

11Division of Rheumatology, Brigham and Women’s Hospital, Boston, Massachusetts.

Reprints: Ashwin N. Ananthakrishnan, MD, MPH, Crohn’s and Colitis Center, Massachusetts General Hospital, 165 Cambridge Street, 9th Floor, Boston, MA 02114 (e-mail:

Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal's Web site (

The study was supported by NIH U54-LM008748. A. N. Ananthakrishnan is supported by the funding from the American Gastroenterological Association and NIH K23 DK097142. K. P. Liao is supported by NIH K08 AR060257 and the Katherine Swan Ginsburg Fund. R. M. Plunge is supported by grants from the US National Institutes of Health (NIH) (R01-AR056768, U01-GM092691, and R01-AR059648) and holds a Career Award for Medical Scientists from the Burroughs Wellcome Fund.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The authors have no conflicts of interest to disclose.

The work was presented in part at the Digestive Disease Week 2012, May 19–22, San Diego, CA.

Received September 06, 2012

Accepted September 07, 2012

Collapse Box


Background: Previous studies identifying patients with inflammatory bowel disease using administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record–based model for classification of inflammatory bowel disease leveraging the combination of codified data and information from clinical text notes using natural language processing.

Methods: Using the electronic medical records of 2 large academic centers, we created data marts for Crohn’s disease (CD) and ulcerative colitis (UC) comprising patients with ≥1 International Classification of Diseases, 9th edition, code for each disease. We used codified (i.e., International Classification of Diseases, 9th edition codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables.

Results: We confirmed 399 CD cases (67%) in the CD training set and 378 UC cases (63%) in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve for CD 0.95; UC 0.94) than models using only disease International Classification of Diseases, 9th edition codes (area under the curve 0.89 for CD; 0.86 for UC). Addition of natural language processing narrative terms to our final model resulted in classification of 6% to 12% more subjects with the same accuracy.

Conclusions: Inclusion of narrative concepts identified using natural language processing improves the accuracy of electronic medical records case definition for CD and UC while simultaneously identifying more subjects compared with models using codified data alone.

Electronic medical records (EMR) are increasingly being used in clinical practice and research allowing efficient development of cohorts, ascertainment of outcomes, and opportunities for translational research when linked to biospecimen repositories.1–8 However, optimal use of EMR data requires accurate definition of diseases and outcomes, a major challenge for researchers. Thus far, a majority of models to define disease have relied solely on administrative billing codes. For example, a diagnosis of inflammatory bowel disease (IBD) using administrative data sets was defined by the presence of a single billing code for Crohn’s disease (CD) or ulcerative colitis (UC), multiple billing codes, or a combination of billing codes, procedures, and medications.9–12 However, the accuracy of these algorithms varies widely between 75% and 97% and is limited by the variations in coding practices, the fact that billing codes are often assigned by administrative nonclinical staff not directly involved in patient care and incomplete medical history. Importantly, there are additional data in the EMR that do not have billing codes (i.e., endoscopic, pathologic, or radiologic findings) that provide important additional information for the accuracy of disease definition. However, these data are often embedded within narrative text reports that typically require laborious manual medical record review for extraction.

Natural language processing (NLP) is a range of computational techniques for analyzing and representing naturally occurring written or oral texts for the purpose of achieving human-like language processing for a range of tasks or applications.13 One such application is within the EMR where NLP has been used to define medication use, adverse events, complications, or response to treatment.6,7,14–20 NLP has also been applied to aid in development of disease cohorts including previous work from our group7 that demonstrated that this informatics-based approach was accurate, improved sensitivity, and was portable to other institutions with distinct EMR systems.1

The goals of this study were to (1) develop and validate an algorithm for definition of CD and UC within a multiinstitutional EMR; (2) compare the performance of case-definition models using codified data alone to those incorporating narrative free text extracted using NLP, in particular focusing on the added contribution of NLP to improving sensitivity and accuracy; and (3) demonstrate disease associations that can uniquely be identified using NLP data.

Back to Top | Article Outline


Data Source

We studied EMR data from Massachusetts General Hospital and Brigham and Women’s Hospital, both tertiary referral hospitals serving over 3 million patients in the Boston metropolitan area. The EMR (Partners Longitudinal Medical Record) has been in existence at Massachusetts General Hospital from October 1, 1994 and at Brigham and Women’s Hospital from October 3, 1996. We first created 2 data sets of all potential patients with IBD—the “CD mart” comprised all patients with at least 1 International Classification of Diseases, 9th edition (ICD-9) code for CD (555.x, n = 14,288), and the “UC mart,” which comprised patients with at least 1 ICD-9 diagnosis code for UC (556.X, n = 14,335) (Fig. 1). The ICD-9 codes for outpatient encounters, inpatient stays, and procedures are embedded within the data set.

Figure 1
Figure 1
Image Tools
Back to Top | Article Outline
Selection of Codified Variables

For each subject in the CD and UC marts, we identified the total number of ICD-9 codes for CD or UC. In addition, we identified the number of such codes that were assigned to an inpatient hospitalization, gastroenterologist visit, or associated with an endoscopic procedure. We also identified the number of codes for competing diagnoses with a similar clinical presentation (irritable bowel syndrome, ischemic colitis, diverticulitis), CD-related or UC-related complications (intestinal fistulae, strictures, perianal fistulae, or abscesses), or surgeries (small or large intestinal resection perirectal surgery) (see Table, Supplemental Digital Content 1, Finally, we included whether a patient was prescribed or listed as being on a CD-related or UC-related medication by their physician using the EMR electronic prescription program at any point in their follow-up. These medications include 5-aminosalicylates (mesalamine, sulfasalazine, balsalazide), corticosteroids (prednisone, hydrocortisone, budesonide), immunomodulators (azathioprine, 6-mercaptopurine, methotrexate), and antitumor necrosis factor α therapies (infliximab, adalimumab).

Back to Top | Article Outline
Narrative Terms and NLP Analysis

We used 6 different types of notes as source for our narrative terms—outpatient notes, discharge summaries, operative notes, radiology, endoscopy, and pathology reports. Routine inpatient progress notes are not available in electronic format and were not included. In addition to PDF format, endoscopy reports are available in text format and could be processed for NLP analysis. We processed the notes using the clinical Text Analysis and Knowledge Extraction System (cTAKES)21 (, which processes clinical text notes and identifies when the term is mentioned in the text, along with qualifying attributes (i.e., negated, nonnegated, current, history of, family history of). We created an expert-defined list of terms we considered relevant to identifying subjects with IBD. The terms were then mapped to the Systemized Nomenclature of Medicine-Clinical Terms (SNOMED-CT), a hierarchically organized clinical health care terminology index with over 300,000 concepts, to allow for variations in language use, or the RxNorm, a normalized naming systemic for generic and branded drugs.

The selection of the relevant terms followed a structure similar to our codified data. First, we defined the number of times the terms were mentioned “Crohn’s disease/Crohn disease” or “ulcerative colitis” in the narrative notes. Other terms extracted included those that were relatively specific for CD (ileitis), UC (proctosigmoiditis), common to both diagnoses, disease-related complications (“perianal abscess”), and surgeries (ileocecal resection). We categorized each as CD specific, UC specific, or common across both diseases. We also included the number of times the terms were mentioned in the clinical texts for each subject, for each of the potential competing diagnoses, supportive endoscopy, pathology, and radiology findings. For example, the colonoscopic findings that could support a diagnosis of IBD included “aphthous ulcer,” “friable mucosa,” or “loss of vascularity,” whereas pathology findings included “chronic active colitis” and “ileitis” and radiology findings included “bowel wall thickening” and “wall enhancement.”

To examine the accuracy of NLP in identifying the terms, 100 random sentences were selected for each of the main concepts of interest (“Crohn’s disease,” “ulcerative colitis,” medications). Clinical Text Analysis and Knowledge Extraction System was defined as having identified a term accurately if the sentence extracted contained a mention of the disease or medication. For the medication, an accurate mention included if the medication was currently being taken, had been taken in the past, was contemplated being initiated, or was being temporarily held. Negative mentions of the disease or medication terms (e.g., “no evidence of Crohn’s disease”) were considered as accurate only if clinical Text Analysis and Knowledge Extraction System was able to accurately identify that the term was negated. The precision of the identification of terms by NLP was defined as the number of sentences where the NLP output was confirmed by physician review/total number of sentences identified by NLP. Overall precision of NLP was high—CD 100%; UC 98%; anti-TNF agents 98%; corticosteroids 97%; and immunomodulators 96%.

Back to Top | Article Outline
Development of the Classification Algorithm

A training set of 600 patients were selected at random from CD mart and another 600 were selected at random from the UC mart (Fig. 1). A board-certified gastroenterologist (ANA) reviewed the EMR of all patients and classified them as having CD, UC, or not having IBD. CD or UC was diagnosed based on the presence of typical symptoms, chronicity of presentation, and supportive endoscopic, histologic, or radiologic findings.22–24 However, where primary data pertaining to the diagnosis were not available within our EMR, we considered CD or UC as being present based on consistent mention within the medical record and use of an appropriate CD-related or UC-related medication without the presence of an alternate indication for that treatment. For patients who may have had presurgical UC and developed Crohn’s of the J-pouch after their surgery, or for those with IBD-unspecified, the IBD diagnosis type presumed during the majority of their clinical encounters was assigned. A penalized logistic regression with the adaptive LASSO procedure25 was used to select the informative variables for our final predictive model. The tuning parameter for the penalized regression was selected based on the Bayesian Information Criterion.26 We constructed 4 separate models to predict a diagnosis of CD or UC in our EMR cohort—(1) model using number of CD or UC ICD-9 codes alone (ICD-9 model); (2) model comprising all codified variables including disease complications (codified model); (3) model including narrative terms identified through NLP only (NLP model); and (4) a combined model including both codified and NLP variables (combined model). The regression model assigned each patient a probability of truly having a diagnosis of CD or UC on a continuous scale.

The accuracy of the models at various specificity levels was calculated nonparametrically27 and the overall prediction performance of each model evaluated based on the area under the receiver operating characteristic curve (AUC). To correct for overfitting bias, the 0.632 bootstrap28 was used to estimate these accuracy measures. The standard error estimates were obtained through the bootstrap procedure with 1000 replicates. For all models, we selected a probability threshold corresponding to a specificity of 97% and classified patients with probability exceeding the threshold value as truly having the disease within the data mart. The accuracy of our classification rule was validated by reviewing the medical records of 100 additional patients each predicted by the final combined model to have CD or UC. Finally, we compared the performance of our final combined model to other published algorithms for defining CD or UC in an EMR cohort.

Back to Top | Article Outline
Histologic Disease Activity and Risk of Surgery

Findings such as histologic evidence of active disease are not available in routine nonresearch clinical data sets or administrative data and require laborious manual review of EMR for extraction. To further explore the utility of NLP in research, we examined the association between presence of histologic activity identified through narrative text extraction by NLP and risk of surgery in CD and UC. Relevant IBD-related surgeries were identified through ICD-9 codes as in the previous studies.29,30 Patients with ≥1 ICD-9 for an IBD-related surgery were classified as having the outcome of surgery. In this exploratory analysis, for each patient, we summed the number of NLP-identified mentions of “cryptitis,” “crypt abscesses,” “chronic inflammation,” “chronic active colitis,” or “enteritis” to estimate cumulative burden of histologic disease activity and divided patients into 4 strata based on the distribution of data—0 mentions, 1 to 2 mentions (tertile 1), 2 to 6 mentions (tertile 2), and >6 mentions (tertile 3). Logistic regression models adjusting for age, duration of follow-up, and intensity of health care utilization (number of facts) were used to examine the association between tertiles of cumulative burden of histologic disease activity and undergoing surgery during follow-up. Number of facts refers to number of distinct encounters with the medical system and is a marker of health care utilization. For example, an office visit, a laboratory test, and a colonoscopy each contribute 1 fact. The study was approved by the Institutional Review Board of Partners Healthcare.

Back to Top | Article Outline


Training Set Characteristics

The CD training set consisted of 600 patients with ≥1 ICD-9 code for CD (Fig. 1): 399 patients (67.5%) were confirmed to have CD, 66 had UC (11.0%), and the remaining 135 did not have IBD (Table 1). The mean number of ICD-9 codes for CD was greater in those with confirmed CD (34.7 ± 2.8) compared with those with UC (6.1 ± 1.5, P < 0.001) or without IBD (1.7 ± 0.1, P < 0.001). Patients with confirmed CD also had a greater number of narrative mentions of CD compared with UC or non-IBD patients.

Table 1
Table 1
Image Tools

Among the UC training set of 600 patients with ≥1 ICD-9 code for UC, 378 (63%) were confirmed on chart review to have UC, 72 to have CD (12%), and 150 did not have IBD (25%). Those with confirmed UC had a greater number of total ICD-9 codes for UC (23.0 ± 1.6) than those with CD (8.1 ± 2.8) or non-IBD controls (1.8 ± 0.2) (P < 0.001) (Table 2) and a greater number of narrative mentions of UC. Thus, the positive predictive value (PPV) of a single ICD-9 code for CD or UC in the training sets were only 67.5% and 63%, respectively.

Table 2
Table 2
Image Tools

Tables 3 and 4 present the frequency of various codified terms and corresponding NLP narrative mentions within the training sets, grouped by diagnoses of CD, UC, or non-IBD assigned by chart review. We found that NLP identified narrative terms provided more information regarding current or past use of medications than codified mentions (Fig. 2A, B). Less than one-fifth of the CD cohort had codified mentions of anti-TNF therapy; however, this proportion increased to 42% among those with narrative mentions of these agents. NLP was also useful in identifying supportive endoscopic and histologic features. A significantly greater proportion of those with CD or UC in both training sets had narrative mentions supportive of active inflammation on colonoscopy or histology than those classified as not having IBD.

Table 3
Table 3
Image Tools
Table 4
Table 4
Image Tools
Figure 2
Figure 2
Image Tools
Back to Top | Article Outline
Derivation of the Classification Algorithm

Figure 3A, B presents the variables that were selected for inclusion in our final models to define CD and UC, respectively, in order of magnitude of the regression coefficients. The strongest variables for the prediction of CD were the number of ICD-9 codes for CD and the number of NLP mentions of CD. Other informative variables include IBD-related complications and medications. The presence of competing diagnosis codes including that for UC and NLP mentions for irritable bowel syndrome were negative predictors of CD.

Figure 3
Figure 3
Image Tools

The number of NLP mentions for colon resection, presence of supportive findings on pathology, and UC were most predictive of UC diagnosis, whereas ICD-9 codes or NLP-identified mentions of perianal disease (CD-related complication) and competing diagnoses were negative predictors of UC (Fig. 3B).

Back to Top | Article Outline
Performance and Validation of the Algorithm

The combined model incorporating both narrative and codified data had greater accuracy for identification of CD (AUC 0.95, 95% CI 0.93–0.97) than a model that contained only ICD-9 billing codes for CD (AUC 0.89, 95% CI 0.87–0.92) (see Fig., Supplemental Digital Content 2, Similarly for UC, the combined model had better accuracy (AUC 0.94, 95% CI 0.93–0.96) than the ICD-9 model alone (AUC 0.86, 95% CI 0.83–0.89) or a model containing ICD-9 codes and disease complications (codified model) (see Fig., Supplemental Digital Content 3,

The combined CD model classified 5502 patients with CD when applied to the CD mart, whereas the combined UC model classified 5519 patients with UC when applied to the UC mart. On review of medical records from an additional random set of 100 patients each predicted to have CD or UC by the combined model, 97 patients each were correctly identified as having this diagnosis through chart review resulting in a PPV of 97% for each algorithm.

Figure 4 demonstrates the proportion of patients in our IBD data mart who would be classified as having CD or UC at the same 97% specificity level from each of the 4 models: ICD-9, codified, NLP, and combined model. Addition of NLP to a model containing codified data alone improved the sensitivity, and therefore, the proportion of EMR patients classified as having CD or UC. The improved sensitivity resulted in an additional 851 patients with CD and 1887 patients with UC who could be classified as truly having CD or UC when compared with the ICD-9 model and 325 to 584 patients over the next best-performing model. Compared with the previously published algorithms, our present algorithm demonstrated significantly improved specificity and PPV as well without a decrease in sensitivity (Table 5).

Figure 4
Figure 4
Image Tools
Table 5
Table 5
Image Tools
Back to Top | Article Outline
Association between Histologic Activity and Surgery

Among the cohort of patients with UC, patients in the second and third tertiles of histologic activity had significantly greater odds of surgery with odds ratios of 3.20 (95% CI 2.43–4.21) and 6.37 (95% CI 5.02–8.07), respectively, compared with those with no mentions of histologic activity. Similarly, increasing tertiles of histologic activity for CD were also associated with CD-related surgery with adjusted odds ratio of 1.41 (95% CI 1.05–1.89), 2.07 (95% CI 1.64–2.62), and 2.83 (95% CI 2.31–3.47), respectively.

Back to Top | Article Outline


We demonstrate that a CD or UC classification model incorporating clinical data extracted using NLP from narrative text has improved accuracy for the identification of patients with CD or UC in the EMR over a model using CD or UC billing codes alone. The addition of NLP-derived variables increased accuracy of CD identification by 6% compared with an algorithm-containing billing codes alone (ICD-9 model) and resulted in a 7% to 15% increase in PPV compared with previously published algorithms.9–12 Importantly, addition of NLP resulted in classifying a significantly greater proportion of patients in our EMR cohort as truly having the disease without loss of specificity and accuracy, increasing the size of our IBD cohorts by approximately 6% to 12%. Finally, we also demonstrate that NLP has the ability to contribute valuable clinical information not available in codified data; we observed an association between increasing NLP-derived mentions of histologic disease and surgical outcome in patients with CD and UC.

Accurate definition of diseases and outcomes is an important prerequisite in both administrative data and EMR-based research. Previous disease algorithms relying on ICD-9 codes have yielded PPV of 75% to 97%9–12 in the published literature. In our study, the presence of a single ICD-9 code for CD or UC yielded a PPV of only 65% to 70%. Even without narrative data, addition of codes for disease complications (codified model) improved the PPV over the ICD-9 model, suggesting that such approaches should be considered in administrative database research. In particular, given the significant (approximately 10%) misclassification of CD as UC or vice versa using only CD or UC ICD-9 codes, inclusion of disease-specific complications (perianal disease, intestinal fistulae) may help improve the specificity of such algorithms.

The myriad modifications required in different EMR databases to achieve comparable accuracy in defining the disease of interest highlights the challenges in using billing codes alone to classify disease. This approach is also vulnerable to errors. First, there is the possibility of inaccurate coding as assignment of diagnosis codes is often performed by nonmedical providers not involved in direct patient care. Second, several diseases (e.g., primary sclerosing cholangitis) may lack distinct diagnosis codes.31 In addition, reliance exclusively on billing codes ignores the wealth of information available as narrative free text within the medical record. We demonstrate that adding data extracted through NLP to models containing only billing data improved the accuracy substantially.

NLP is a range of computational techniques for analyzing the most ubiquitous human product, namely, language.13 There are several benefits to incorporating NLP to analyze narrative text in EMR research. First, this allows for identifying not only disease terms but also supportive symptoms, laboratory tests, and investigations. Thus, by accurately identifying the results of a specific investigation rather than merely having a code for the test having been performed allows for significantly greater confidence in assigning a subject as truly having the disease. Indeed, a recent approach using real-time NLP in the Veterans Affairs medical records revealed the ability to detect postoperative complications with superior sensitivity compared with patient safety indicators.3,6 Second, it contributes to increasing the confidence in disease diagnosis by ascertaining the presence of mentions for competing diagnoses that may mimic the disease in question. For example, colonoscopy revealing “erythema and ulceration in the splenic flexure watershed consistent with ischemic colitis” reduces the confidence in a diagnosis of UC though a common billing code may have been used. Third, NLP allows for identifying disease outcomes such as disease activity that are not available through billing codes. Fourth, addition of NLP to our case definition model resulted in a substantial increase in the size of our disease cohorts without compromising specificity. The utility of an EMR cohort for translational research relies on the ability to develop a sufficiently large cohort for genotype–phenotype studies, although not compromising on the specificity and accuracy of identifying true cases. Importantly, the 6% to 12% increase in cohort size in our study while maintaining a high level of accuracy for classification could significantly improve statistical power in genotype–phenotype correlation studies using biological samples linked to EMR data.

As EMR data are increasingly being used for research, and in particular translational research aimed at examining genotype–phenotype relationships, developing disease cohorts of adequate size to allow for power for genetic analyses is important. At the same time, it is important to preserve specificity of disease definition to ensure accuracy of genetic analysis. We demonstrate that addition of NLP is a valuable tool by allowing for classification of a greater number of patients as having CD or UC disease without increasing the false-positive rate. Finally, we demonstrate that NLP can also be an invaluable source for mining the clinical narrative3 by defining an association between histologic disease activity through narrative searches for terms indicating active bowel inflammation, and requirement for surgery in both CD and UC, confirming previous findings from small studies.32

There are limitations to our study. First, it was restricted to the EMR from a single healthcare system, Partners Healthcare, which uses a common electronic record. Further studies are required to examine the portability of our IBD algorithm to different EMR systems. Notably, our group has recently demonstrated that an algorithm to define patients with rheumatoid arthritis using a combination of codified and NLP data developed in the Partners EMR7 was portable to other institutions using distinct EMR systems.1 As an increasing number of institutions adopt EMRs, an approach using the wealth of free text narrative information available within the EMR offers significant opportunities for efficient cost-effective research and collaborative. Second, as our health system is comprised of referral hospitals and is not a “closed system,” a portion of our patients may receive part of their care at other hospitals. Our use of narrative free text mentions in addition to the billing codes allows a greater ability to ascertain such outcomes from the text within the medical notes. However, we acknowledge that this may still leave us with missing information.

Our findings have several implications. To our knowledge, this is one of the first studies to use NLP in addition to billing codes to improve on the predictive value of case definition models in the EMR for IBD. Validation of such an approach has the potential to allow for efficient development of multicenter cohorts to examine disease outcomes. In particular, development of such multicenter cohorts will allow for the study of uncommon phenotypes2 and complications such as primary sclerosing cholangitis that require large numbers of subjects. Second, several health care systems including ours have developed tools to allow for linkage of discarded or consented blood specimens to such EMR data.2,33,34 This offers the exciting ability to define genotype–phenotype relationships for various outcomes.8 Our group has already demonstrated the feasibility of such an approach for an rheumatoid arthritis cohort,4 and we have begun collection and linkage of such biospecimens in our IBD cohort (457 unique plasma and buffy coat samples over 5 months). Once accrual of an adequate number of samples has occurred, our IBD cohort can be used to answer key and unique clinical questions that require narrative free text analysis and cannot be addressed using administrative data sets such as genetic prediction of treatment response or treatment-related adverse events.

In conclusion, we demonstrate that incorporation of narrative free text data within the disease definition algorithm of an EMR cohort allows for superior accuracy and a higher PPV than algorithms using billing codes alone or previously published studies. This novel methodology offers considerable promise toward multiinstitution cohort development and efficient and cost-effective clinical and translational research.

Back to Top | Article Outline


1. Carroll RJ, Thompson WK, Eyler AE, et al.. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J Am Med Inform Assoc. 2012;19:e162–e169.

2. Denny JC, Crawford DC, Ritchie MD, et al.. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am J Hum Genet. 2011;89:529–542.

3. Jha AK. The promise of electronic records: around the corner or down the road? JAMA. 2011;306:880–881.

4. Kurreeman F, Liao K, Chibnik L, et al.. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am J Hum Genet. 2011;88:57–69.

5. Love TJ, Cai T, Karlson EW. Validation of psoriatic arthritis diagnoses in electronic medical records using natural language processing. Semin Arthritis Rheum. 2011;40:413–420.

6. Murff HJ, FitzHenry F, Matheny ME, et al.. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011;306:848–855.

7. Liao KP, Cai T, Gainer V, et al.. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res (Hoboken). 2010;62:1120–1127.

8. Kohane IS. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet. 2011;12:417–428.

9. Benchimol EI, Guttmann A, Griffiths AM, et al.. Increasing incidence of paediatric inflammatory bowel disease in Ontario, Canada: evidence from health administrative data. Gut. 2009;58:1490–1497.

10. Bernstein CN, Blanchard JF, Rawsthorne P, et al.. Epidemiology of Crohn's disease and ulcerative colitis in a central Canadian province: a population-based study. Am J Epidemiol. 1999;149:916–924.

11. Herrinton LJ, Liu L, Lafata JE, et al.. Estimation of the period prevalence of inflammatory bowel disease among nine health plans using computerized diagnoses and outpatient pharmacy dispensings. Inflamm Bowel Dis. 2007;13:451–461.

12. Liu L, Allison JE, Herrinton LJ. Validity of computerized diagnoses, procedures, and drugs for inflammatory bowel disease in a northern California managed care organization. Pharmacoepidemiol Drug Saf. 2009;18:1086–1093.

13. Liddy ED, Turner AM, Bradley J. Modeling interventions to improve access to public health information. AMIA Annu Symp Proc. 2003:909.

14. Hripcsak G, Kuperman GJ, Friedman C. Extracting findings from narrative reports: software transferability and sources of physician disagreement. Methods Inf Med. 1998;37:1–7.

15. Melton GB, Hripcsak G. Automated detection of adverse events using natural language processing of discharge summaries. J Am Med Inform Assoc. 2005;12:448–457.

16. Meystre S, Haug PJ. Evaluation of medical problem extraction from electronic clinical documents using MetaMap Transfer (MMTx). Stud Health Technol Inform. 2005;116:823–828.

17. Savova GK, Olson JE, Murphy SP, et al.. Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record. J Am Med Inform Assoc. 2012;19:e83–e89.

18. Sohn S, Kocher JP, Chute CG, et al.. Drug side effect extraction from clinical narratives of psychiatry and psychology patients. J Am Med Inform Assoc. 2011;18 (suppl 1):i144–i149.

19. Xu H, Stenner SP, Doan S, et al.. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc. 2010;17:19–24.

20. Zeng QT, Goryachev S, Weiss S, et al.. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak. 2006;6:30.

21. Savova GK, Masanz JJ, Ogren PV, et al.. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17:507–513.

22. Fonager K, Sorensen HT, Rasmussen SN, et al.. Assessment of the diagnoses of Crohn's disease and ulcerative colitis in a Danish hospital information system. Scand J Gastroenterol. 1996;31:154–159.

23. Loftus EV Jr, Silverstein MD, Sandborn WJ, et al.. Ulcerative colitis in Olmsted County, Minnesota, 1940–1993: incidence, prevalence, and survival. Gut. 2000;46:336–343.

24. Loftus EV Jr, Silverstein MD, Sandborn WJ, et al.. Crohn's disease in Olmsted County, Minnesota, 1940–1993: incidence, prevalence, and survival. Gastroenterology. 1998;114:1161–1168.

25. Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–1429.

26. Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning. New York: Springer; 2001.

27. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford, United Kingdom: Oxford University Press; 2004.

28. Efron B, Tibshirani R. An Introduction to the Bootstrap. London: Chapman & Hall/CRC; 1993.

29. Ananthakrishnan AN, McGinley EL, Binion DG, et al.. A nationwide analysis of changes in severity and outcomes of inflammatory bowel disease hospitalizations. J Gastrointest Surg. 2011;15:267–276.

30. Bernstein CN, Nabalamba A. Hospitalization, surgery, and readmission rates of IBD in Canada: a population-based study. Am J Gastroenterol. 2006;101:110–118.

31. Molodecky NA, Myers RP, Barkema HW, et al.. Validity of administrative data for the diagnosis of primary sclerosing cholangitis: a population-based study. Liver Int. 2011;31:712–720.

32. Riley SA, Mani V, Goodman MJ, et al.. Microscopic activity in ulcerative colitis: what does it mean? Gut. 1991;32:174–178.

33. Kho AN, Pacheco JA, Peissig PL, et al.. Electronic medical records for genetic research: results of the eMERGE consortium. Sci Transl Med. 2011;3:79re1.

34. Schildcrout JS, Basford MA, Pulley JM, et al.. An analytical approach to characterize morbidity profile dissimilarity between distinct cohorts using electronic medical records. J Biomed Inform. 2010;43:914–923.


Crohn’s disease; ulcerative colitis; disease cohort; natural language processing; informatics

© Crohn's & Colitis Foundation of America, Inc.


Article Level Metrics

Search for Similar Articles
You may search for similar articles that contain these same keywords or you may modify the keyword list to augment your search.