Home Issues Published Ahead-of-Print Topical Collections CE For Authors Journal Info
Skip Navigation LinksHome > July/August 2009 - Volume 27 - Issue 4 > Exploring the Ability of Natural Language Processing to Extr...
CIN: Computers, Informatics, Nursing:
doi: 10.1097/NCN.0b013e3181a91b58
Continuing Education

Exploring the Ability of Natural Language Processing to Extract Data From Nursing Narratives


Free Access
Continued Education
Article Outline
Collapse Box

Author Information

Author Affiliations: School of Nursing (Drs Hyun and Bakken); and Department of Biomedical Informatics (Drs Johnson and Bakken), Columbia University, New York.

This study was supported by 1R01LM07593 (S. Johnson, principal investigator) from the National Library of Medicine and P20NR007799 (S. Bakken, principal investigator) from the National Institute of Nursing Research.

Disclaimer: Authors declare no conflict of interest.

Corresponding author: Sookyung Hyun, RN, DNSc, 630 W 168th St, Mailbox 6, New York, NY 10032 (sh2138@columbia.edu).

Collapse Box


Natural Language Processing (NLP) offers an approach for capturing data from narratives and creating structured reports for further computer processing. We explored the ability of a NLP system, Medical Language Extraction and Encoding (MedLEE), on nursing narratives. MedLEE extracted 490 concepts from narrative text in a sample of 553 oncology nursing process notes. The most frequently monitored and recorded signs and symptoms were related to chemotherapy care, such as adverse reactions, shortness of breath, nausea, pain, and bleeding. In terms of nursing interventions, chemotherapy, blood culture, medication, and blood transfusion were commonly recorded in free text. NLP may provide a feasible approach to extract data related to patient safety/quality measures and nursing outcomes by capturing nursing concepts that are not recorded through structured data entry. For better NLP performance in the domain of nursing, additional nursing terms and abbreviations must be added to MedLEE's lexicon.

Electronic health record (EHR) systems approach the issue of data capture from clinicians in one or both of two general ways. The first consists of using specific user interfaces for capturing data in a structured and possibly coded format. The second consists of collecting and storing clinical data in computer systems in free text format.

To use clinical data for additional purposes such as quality assurance, outcomes research, or public health reporting, the data must be structured and coded.1 However, using only structured, coded approaches for data entry may result in the loss of significant clinical information typically contained in narratives (free text data). For instance, the medication list may be captured only in structured format, but many of the important nuances of the medication regimen, the history of past regimens, patient reactions to the regimen, and adherence behavior, are typically captured in narrative formats.2 Studies of medical errors have often used retrospective chart review to measure adverse event rates. Although collecting data in this manner produces important clinical information, it is more costly, and manual chart review detects only documented adverse events.3 A nursing documentation system that has a combination of structured and unstructured (ie, free text) data entry assisted by natural language processing (NLP) may better support the acquisition and use of nursing data by placing data in a meaningful context based on the original text, allowing users the freedom of narrative documentation and ensuring that the free text data are available for reuse.

NLP offers an approach for capturing data from narratives and creating structured reports for further computer processing.4 Previous research has demonstrated that NLP is an applicable method for capturing clinical information from free text data.5 However, fewer studies have been conducted to determine if NLP can be used to extract data from nursing narratives to support data reuse for purposes such as decision support, measurement of patient safety, and quality of care.

Nursing narratives contain descriptive information about the patient, specific interventions that have been completed, and patient responses to the interventions (eg, patient adherence or adverse drug event). For example, in one study, about 14% of adverse drug events were detected from electronic nursing free text data.6 Therefore, it is desirable to investigate NLP performance on nursing narratives. Medical Language Extraction and Encoding (MedLEE), an NLP system in use at New York Presbyterian Hospital, has been extensively evaluated with different types of clinical free text data, such as x-ray reports, discharge summaries, and sign-out notes.4 However, it has not been evaluated with nursing data.

The purpose of this study was to explore the ability of NLP for capturing nursing concepts so that we could determine opportunities for supporting nursing documentation and data reuse. Specific research questions are the following:

* What types of data can be extracted by NLP from oncology nursing narratives?

* What terms of relevance to patient safety and quality of care measures can be extracted from oncology nursing narratives using NLP?

Back to Top | Article Outline


Previous research has demonstrated that NLP is an effective method of accurately identifying and reusing data from clinical notes in several domains. Fiszman and Haug7 demonstrated that NLP systems supported real-time decision support for community-acquired pneumonia by extracting specific radiology findings. In another study, a lexically based NLP system showed promise as a method for detecting adverse events in outpatient visit notes.5

Originally, MedLEE was designed for decision support applications in the domain of chest x-ray reports. When it was compared with a reference standard obtained manually by an expert, MedLEE showed high accuracy, sensitivity, and specificity in extracting specific clinical information from discharge summaries and x-ray reports.8 Findings suspicious for breast cancer from mammogram reports9 and suspected tuberculosis patients from chest x-ray reports were effectively identified by MedLEE.10 In addition, MedLEE demonstrated better precision than did an ad hoc approach and acceptable recall for its intended use in ophthalmology visit notes.1

Nursing narratives are different from those of physicians. Nursing documentation is more like a picture that describes a patient's status illustratively, whereas physicians' documentation is more like a headline because of focus on problem-oriented summarization and abstraction.11 Nursing narratives describe aspects of the patient's condition that are not addressed in the flowsheet or other structured data, such as change in status, nursing interventions, and patient responses.12 Consequently, the ability of NLP to extract data from nursing narratives may differ from performance on reports and physicians' notes.

Little research on NLP has been published in the nursing domain. One study identified potential challenges associated with using NLP for HIV/AIDS clinic notes.2 A further study reported that HIV/AIDS ambulatory progress notes had a more diverse vocabulary and the language structures were different from those of radiology reports.13 In the study, notable semantics were categorized for determining automated strategies to deal with the issues. Hsieh et al14 demonstrated the potential for automatic extraction of the linguistic meaning of the terms that patients use in their electronic mail messages using NLP. Bakken et al examined the applicability of NLP for nursing narratives by comparing the semantic categories of the International Organization for Standardization (ISO) reference terminology models for nursing diagnoses and nursing actions with the semantic categories of MedLEE15 and indicated that the current semantic structure of MedLEE was not sufficient for processing verb-rich nursing narratives and proposed further research regarding NLP semantic categories for better processing of nursing narratives.16

Back to Top | Article Outline


Sample and Setting

The study setting was an oncology unit at an academic institution in New York City. The sample comprised nursing narratives (free text data) from a corpus of de-identified nursing progress notes written by oncology nurses using the Eclipsys Sunrise Critical Care documentation system between April 1 and July 31, 2006. The study was approved by the Institutional Review Board.

Back to Top | Article Outline

The study procedures included preprocessing, NLP, and analysis (Figure 1). The analysis focused on two areas: (1) extraction of data in general and (2) extraction of data for quality and safety purposes.

Figure 1
Figure 1
Image Tools
Back to Top | Article Outline

Perl (Perl Foundation, Grand Ledge, MI), a computer programming language that facilitates manipulation of a large volume of text data, was used to prepare nursing free text for NLP. The preprocessing involved (1) the addition of a colon (:) to identify the end of each section name and (2) the addition of a period (.) to identify the end of each sentence. This routine preprocessing was a required technical step for MedLEE processing.

Previous research demonstrated that words undefined by the NLP tool, such as abbreviations, symbols, or specialized vocabulary, might affect NLP performance on extracting terms because the words do not yet exist in its internal dictionary.2 With regard to this, we ran the first NLP session to purposely collect undefined words by MedLEE in our data. The undefined words identified from the output of the NLP were mostly abbreviations (eg, D/C [discontinue], premed [premeditated]) and nonmedical words (eg, endorse). The undefined words occurred because they were not in MedLEE's current lexicon. We replaced the abbreviations with full names to reduce the number of undefined terms from MedLEE's parsing of nursing narratives. After these replacements, we ran the second NLP session (see "Extraction of Data" section) and analyzed the output to examine which nursing concepts were extracted from the nursing narratives. Although our primary interest was the nursing concepts extracted by MedLEE, we were also interested in the abbreviations because they were commonly used in nursing notes. Thus, we collected the unknown abbreviations (Table 1).

Table 1
Table 1
Image Tools
Back to Top | Article Outline

Concepts were extracted by MedLEE from the nursing narratives. The extracted terms were labeled by MedLEE with various tags, such as problem, device, procedure, bodymeas (body measure), labtest, status, med (medication), normal finding (normal finding), timeper (time period), bodyfunc (body function), and finding. The extracted data were summarized using descriptive statistics.

Back to Top | Article Outline

To determine what terms of relevance to patient safety and quality of care measures can be extracted by MedLEE from the nursing free text data, two measures (chemotherapy side effects and pain management) were selected by investigators by referring to the American Nurses Association's patient safety initiatives, National Quality Forum's quality measures, and a patient safety report.17

Gold standards were needed to compare against the terms extracted through NLP. Clinical practice guidelines (CPGs) from the National Guideline Clearinghouse's Web site18 were selected as a source of terms of relevance to the measures. Two guidelines, Assessment and Management of Chronic Pain and Assessment and Management of Acute Pain, published by the Institute for Clinical Systems Improvement,19,20 were used for pain management. For chemotherapy side effects, Chemotherapy and Biotherapy Guidelines and Recommendations for Practice, published by the Oncology Nursing Society,21 was used as the source of terms of relevance to the measures. These processes are summarized in Figure 2.

Figure 2
Figure 2
Image Tools

The CPGs consisted of several typical categories, such as Scope, Methodology, Recommendations, Evidence Supporting the Recommendations, Benefits/Harms of Implementing the Guideline Recommendations, Contraindications, etc. The recommendation section was used as the source for terms of relevance. Six categoriesa of acute pain management and 11 categoriesb of chronic pain management were listed under "Recommendations." Each category had a one- or two-level hierarchy listed in either bulleted or paragraph form; these compromised terms of relevance to pain management in this study. For instance, acetaminophen, cold packs, and corticosteroids were components of Somatic Pain Treatment.

For the chemotherapy CPG, two sections under "Recommendations," Immediate Complications of Cytotoxic Therapy (eg, extravasation, hypersensitivity) and Side Effects of Cancer Therapy (eg, myelosuppression, gastrointestinal and mucosal side effects), were used to obtain terms of relevance to chemotherapy side effects for this study. For instance, neutropenia, anemia, and thrombocytopenia were obtained from myelosuppression (a complication) and nausea, vomiting, diarrhea, mucositis, and anorexia were obtained from gastrointestinal and mucosal side effects.

To examine whether the terms of relevance to pain management and chemotherapy side effects could be identified from the free text data through NLP, MedLEE tagged terms were compared with the gold standards. When the term was a drug name, the drug information database from MedlinePlus22 was used to look up synonyms of the particular drug for accurate matching. The results were summarized using descriptive statistics.

Back to Top | Article Outline


Extraction of Data

A total of 553 nursing progress notes associated with 22 patients were processed by MedLEE. Table 2 shows an example of sample text and its associated output from MedLEE. Words underlined in the output from NLP represent the terms captured with a tag.

Table 2
Table 2
Image Tools

We classified the abbreviations collected from the first NLP into three categories: (1) abbreviations in a predefined hospital abbreviation list, but currently not in the MedLEE lexicon; (2) abbreviations not in the hospital abbreviation list, but common in the nursing free texts; and (3) abbreviations that commonly appeared in the nursing free texts that could be interpreted with more than one meaning.

Examples of the first category were VSS (vital signs stable), NS (normal saline), BM (bowel movement), PICC (peripheral insertion central catheter), RN (registered nurse), FFP (fresh frozen plasma), OOB (out of bed), RUE (right upper extremity), D&C (dilatation and curettage), N&V (nausea and vomiting), and C&S (Culture and Sensitivity). Some examples of the second category were cont (continue), app (approximately), premed/premeds (premeditated), C+S (Culture and Sensitivity), and plt/plts (platelets). Examples of the last category were DL (Direct Laryngoscopy; Double Lumen), Cx (Cervix; culture, meaning blood culture), n (nausea; night), IVF (in vitro fertilization; intravenous fluid), and D/C (discharge; discontinue).

After the second NLP session, 3929 terms were captured by MedLEE. Some terms appeared in more than one narrative; therefore, the total number of nonredundant terms was 490. Selected examples of terms with associated MedLEE tags are shown in Table 3.

Table 3
Table 3
Image Tools

We made a list of undefined words so that the list could be added to the MedLEE lexicon for better NLP on nursing narratives in the future. Table 4 displays undefined words with frequencies equal to or greater than 10.

Table 4
Table 4
Image Tools
Back to Top | Article Outline
Use for Quality/Safety

Fifty-five terms relevant to pain management (eg, tylenol, heat, massage) and 35 terms relevant to chemotherapy side effects (eg, nausea, constipation) were selected from the CPGs (Tables 5 and 6).

Table 5
Table 5
Image Tools
Table 6
Table 6
Image Tools

The terms extracted through NLP were compared with the CPG-based gold standards. In several cases, the concepts from the CPGs were broader than the concepts extracted through NLP, for instance, Cardiac Toxicity (a term from the CPGs) and Palpitations (a term extracted through NLP). These cases were regarded as a difference in granularity of the terms and considered a nonmatch. About 18% and 43% of terms extracted using MedLEE were matched with the terms of relevance to pain management and chemotherapy side effects, respectively (Tables 7 and 8).

Table 7
Table 7
Image Tools
Table 8
Table 8
Image Tools
Back to Top | Article Outline


We could identify terms that were frequently documented in our sample. They were reaction, pain, nausea, adverse reaction, distress, emesis, shortness of breath, chill, discomfort, bleeding, swelling, febrile, headache, cough, and vomiting. Because the data were from the oncology nursing progress notes, these terms appear to be most frequently monitored and recorded by the oncology nurses in free text.

Terms that frequently appeared in nursing free text but currently not in the MedLEE lexicon included ordered, aware, premedicated, informed, sensitivity, flowsheet, and incident. For MedLEE to perform better on nursing free text data, these undefined words need to be added to the MedLEE lexicon.

The output of the first NLP session revealed many abbreviations frequently used in nursing narratives in oncology nursing progress notes. Some abbreviations truncated at the end of a word were commonly found in our sample, such as premed/premeds (premeditated) and cont (continue). Other abbreviations frequently found in the data but not on the hospital abbreviation list included C+S (Culture and Sensitivity) and plt/plts (platelet). Other abbreviations had more than one meaning, such as Cx (cervix; blood culture). Some abbreviations were on the hospital abbreviation list, indicating that they are legitimate abbreviations, but were absent in the MedLEE lexicon, such as VSS (vital signs stable). While using abbreviations in patient records is a convenient and efficient way for nurses to document because many medical terms are very long, the use of abbreviations may be a detriment to patient safety. One of the Joint Commission on Accreditation of Healthcare Organizations (JCAHO) National Patient Safety Goals is to improve the effectiveness of communication among caregivers.23 To meet this goal, the hospital wishes to designate a list of abbreviations and acronyms that are not to be used because their use increases the risk of medical error. In addition, the use of any unauthorized abbreviations is not allowed by the JCAHO requirements. In this context, NLP may support the identification of unauthorized clinical abbreviations from free text in electronic records.

Nurses directly affect the lives of patients,24 and measuring the performance of care that patients receive is indispensable to evaluating the effectiveness of nursing care. In an attempt to capture quality/safety measures from nursing free text, a new method, NLP, was explored in this study. Matching between terms of relevance from the gold standards and MedLEE output is a function of MedLEE performance and the free text in the corpus of the oncology nursing narratives. Terms that are not in the corpus cannot be found by MedLEE; therefore, the percent matching (ie, 18% matching for pain management, 43% matching for chemotherapy side effects) should not be viewed as a measure of MedLEE performance. Although further study is desired to validate the findings, the results from this exploratory study demonstrate that NLP has the potential to capture data related to safety/quality measures in the domain of nursing oncology notes. It may assist nurses in evaluating patient progress and determining which interventions are effective and which are ineffective. In addition, NLP may provide a method to assess nursing outcomes that are not captured through structured data entry.

Natural language processing may contribute significantly to the availability of nursing records in EHR systems for reuse in clinical decision support and patient outcomes research and, ultimately, the quality of care. Strategies are needed to handle special features of nursing records and vocabulary that are unique to the domain of nursing.

The generalizability of the findings of this study is limited because the free texts were oncology nursing narratives and only from one institution. Further research is needed to examine the use of NLP on nursing narratives from various nursing specialties and across institutions, which may vary in their documentation procedures.

Back to Top | Article Outline


We explored the ability of NLP on nursing narratives. NLP may provide a feasible approach to extract data related to patient safety/quality measures and nursing outcomes by capturing nursing concepts that are not documented through structured data entry. For better NLP performance in the domain of nursing, additional nursing terms and abbreviations must be added to MedLEE's lexicon.

Back to Top | Article Outline


We would like to thank Drs Carol Friedman and Leanne Currie for their assistance and support.

Back to Top | Article Outline


1. Barrows RC, Busuioc M, Friedman C. Limited parsing of notational text visit notes: ad-hoc vs. NLP approaches. Proc AMIA Symp. 2000:51-55.

2. Hyun S, Bakken S, Friedman C, Johnson S. Natural language processing challenges in HIV/AIDS clinic notes. AMIA Annu Symp Proc. 2003:872.

3. Jha A, Kuperman G, Teich J, et al. Identifying adverse drug events: development of a computer-based monitor and comparison with chart review and stimulated voluntary report. J Am Med Inform Assoc. 1998;5(3):305-314.

4. Friedman C. A broad coverage natural language processing system. Proc AMIA Symp. 2000:270-274.

5. Honigman BA. A computerized method for identifying incidents associated with adverse drug events in outpatients. Int J Med Inf. 2001;61(1):21-32.

6. Hurdle J, Weir C, Roth B, Hoffman J, Nebeker J. Critical gaps in the world's largest electronic medical record: ad hoc nursing narratives and invisible adverse drug events. AMIA Annu Symp Proc. 2003:309-312.

7. Fiszman M, Haug P. Using medical language processing to support real-time evaluation of pneumonia guideline. Proc AMIA Symp. 2000:235-239.

8. Friedman C, Knirsch CA, Shagina L, Hripcsak G. Automating a severity score guideline for community-acquired pneumonia employing medical language processing of discharge summaries. Proc AMIA Symp. 1999:256-260.

9. Jain NL, Friedman C. Identification of findings suspicious for breast cancer based on natural language processing of mammogram reports. Proc AMIA Symp. 1997:829-833.

10. Jain NL, Knirsch CA, Friedman C, Hripcsak G. Identification of suspected tuberculosis patients based on natural language processing of chest radiograph reports. J Am Med Inform Assoc. 1996;3(suppl):542-546.

11. Scalise D. Clinical communication and patient safety. Hosp Health Netw. 2006;80(8):50.

12. Scoates G, Fishman M, McAdam B. Health care focus documentation-more efficient charting. Nurs Manage. 1996;27(8):30-32.

13. Hyun S, Bakken S, Friedman C, Johnson S. Natural language processing challenges in HIV/AIDS clinic notes. In: Fieschi M, Coiera E, Li J, eds. Medinfo 2004. San Francisco, CA: Medinfo; 2004:1655.

14. Hsieh Y, Hardardottir G, Brennan P. Linguistic analysis: terms and phrases used by patients in e-mail messages to nurses. Medinfo. 2004:511-515.

15. Bakken S, Hyun S, Friedman C, Johnson S. A comparison of semantic categories of the ISO reference terminology models for nursing and the MedLEE natural language processing system. Medinfo. 2004;11:472-476.

16. Bakken S, Hyun S, Friedman C, Johnson S. ISO reference terminology models for nursing: applicability for natural language processing of nursing narratives. Int J Med Inform. 2005;74(7-8):615-622.

17. Committee on Data Standards for Patient Safety. Patient Safety: Achieving a New Standard for Care 2004. Washington, DC: National Academies Press; 2004.

18. The National Guideline Clearinghouse. 2006. http://www.guideline.gov/. Accessed September 25, 2006.

19. Institute for Clinical Systems Improvement. Assessment and management of chronic pain. 2005 [cited 2006]. http://www.guideline.gov/summary/summary.aspx?doc_id=8363&nbr=4684. Accessed September 25, 2006.

20. Institute for Clinical Systems Improvement. Assessment and management of acute pain. 2006 [cited 2006]. http://www.guideline.gov/summary/summary.aspx?doc_id=9009&nbr=004884&string=Assessment+and+%22management+of+acute+pain%22. Accessed September 25, 2006.

21. Oncology Nursing Society. Chemotherapy and biotherapy guidelines and recommendations for practice. 2nd ed. 2005 [cited 2006]. http://www.guideline.gov/summary/summary.aspx?doc_id=8337&nbr=004665&string=chemotherapy+AND+side+AND+effects. Accessed September 25, 2006.

22. US National Library of Medicine. MedlinePlus. 2006. http://www.nlm.nih.gov/medlineplus/druginformation.html. Accessed September 25, 2006.

23. Joint Commission. 2007 National Patient Safety Goals. 2006. http://www.jointcommission.org/PatientSafety/NationalPatientSafetyGoals/. Accessed September 25, 2006.

24. National Quality Forum. Nursing Care Quality at NQF 2006. http://qualityforum.org/nursing/. Accessed September 29, 2006.

a(1) Somatic Pain Treatment, (2) Visceral Pain Treatment, (3) Neuropathic Pain Treatment, (4) Prevention/Intervention, (5) Pharmacological Therapy, and (6) Patient Controlled Analgesia (PCA). Cited Here...

b(1) General Management; (2) Physical Rehabilitation and Psychosocial Management; (3) Psychosocial Management; (4) Pharmacologic Management; (5) Intervention Management; (6) Complementary Management; (7) Level I Treatment: Neuropathic Pain; (8) Level I Treatment: Muscle Pain; (9) Level I Treatment: Inflammatory Pain; (10) Level I Treatment: Mechanical/Compressive Pain; and (11) Level II Treatment: Interdisciplinary Team Referral, Plus a Pain Medicine Specialist or Pain Medicine Specialty Clinic. Cited Here...

For more than 14 additional continuing education articles related to electronic information in nursing, go to NursingCenter.com\CE.

Cited By:

This article has been cited 1 time(s).

Journal of the American Medical Informatics Association
Automated evaluation of electronic discharge notes to assess quality of care for cardiovascular diseases using Medical Language Extraction and Encoding System (MedLEE)
Chiang, JH; Lin, JW; Yang, CW
Journal of the American Medical Informatics Association, 17(3): 245-252.
Back to Top | Article Outline

Natural language processing; Nursing free text; Oncology nursing narratives; Patient safety measure; Quality measure

© 2009 Lippincott Williams & Wilkins, Inc.



Search for Similar Articles
You may search for similar articles that contain these same keywords or you may modify the keyword list to augment your search.

Featured Collections