Institutional members access full text with Ovid®

Improving Case Definition of Crohn's Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach

Ananthakrishnan, Ashwin N. MD, MPH1,2; Cai, Tianxi PhD3; Savova, Guergana PhD4; Cheng, Su-Chun ScD2; Chen, Pei4; Perez, Raul Guzman BS5; Gainer, Vivian S. MS5; Murphy, Shawn N. MD, PhD5,6; Szolovits, Peter PhD7; Xia, Zongqi MD, PhD2,8; Shaw, Stanley MD, PhD2,9; Churchill, Susanne PhD10; Karlson, Elizabeth W. MD2,11; Kohane, Isaac MD, PhD2,4,10; Plenge, Robert M. MD, PhD2,11; Liao, Katherine P. MD, MPH2,11

doi: 10.1097/MIB.0b013e31828133fd
Original Clinical Article

Background: Previous studies identifying patients with inflammatory bowel disease using administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record–based model for classification of inflammatory bowel disease leveraging the combination of codified data and information from clinical text notes using natural language processing.

Methods: Using the electronic medical records of 2 large academic centers, we created data marts for Crohn’s disease (CD) and ulcerative colitis (UC) comprising patients with ≥1 International Classification of Diseases, 9th edition, code for each disease. We used codified (i.e., International Classification of Diseases, 9th edition codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables.

Results: We confirmed 399 CD cases (67%) in the CD training set and 378 UC cases (63%) in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve for CD 0.95; UC 0.94) than models using only disease International Classification of Diseases, 9th edition codes (area under the curve 0.89 for CD; 0.86 for UC). Addition of natural language processing narrative terms to our final model resulted in classification of 6% to 12% more subjects with the same accuracy.

Conclusions: Inclusion of narrative concepts identified using natural language processing improves the accuracy of electronic medical records case definition for CD and UC while simultaneously identifying more subjects compared with models using codified data alone.

Article first published online 4 April 2013

1Gastrointestinal Unit, Massachusetts General Hospital, Boston, Massachusetts;

2Harvard Medical School, Boston, Massachusetts;

3Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts;

4Children's Hospital Boston, Boston, Massachusetts

5Research Computing, Partners HealthCare, Charlestown, Massachusetts;

6Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts;

7Massachusetts Institute of Technology, Cambridge, Massachusetts;

8Department of Neurology, Brigham and Women's Hospital, Boston, Massachusetts;

9Division of Cardiology, Massachusetts General Hospital, Boston, Massachusetts;

10i2b2 National Center for Biomedical Computing, Brigham and Women's Hospital, Boston, Massachusetts; and

11Division of Rheumatology, Brigham and Women’s Hospital, Boston, Massachusetts.

Reprints: Ashwin N. Ananthakrishnan, MD, MPH, Crohn’s and Colitis Center, Massachusetts General Hospital, 165 Cambridge Street, 9th Floor, Boston, MA 02114 (e-mail:

Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal's Web site (

The study was supported by NIH U54-LM008748. A. N. Ananthakrishnan is supported by the funding from the American Gastroenterological Association and NIH K23 DK097142. K. P. Liao is supported by NIH K08 AR060257 and the Katherine Swan Ginsburg Fund. R. M. Plunge is supported by grants from the US National Institutes of Health (NIH) (R01-AR056768, U01-GM092691, and R01-AR059648) and holds a Career Award for Medical Scientists from the Burroughs Wellcome Fund.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The authors have no conflicts of interest to disclose.

The work was presented in part at the Digestive Disease Week 2012, May 19–22, San Diego, CA.

Received September 06, 2012

Accepted September 07, 2012

© Crohn's & Colitis Foundation of America, Inc.
You currently do not have access to this article

To access this article:

Note: If your society membership provides full-access, you may need to login on your society website