Secondary Logo

Journal Logo


Machine Learning Methods for Identifying Critical Data Elements in Nursing Documentation

Bose, Eliezer; Maganti, Sasank; Bowles, Kathryn H.; Brueshoff, Bonnie L.; Monsen, Karen A.

Author Information
doi: 10.1097/NNR.0000000000000315
  • Free


In public health nursing and maternal–child home visiting, a lengthy therapeutic relationship between a public health nurse (PHN) and a high-risk parent has demonstrated itself as an effective approach for improving life course trajectories for this population (Barnard, 1998; Eckenrode et al., 2010; Monsen et al., 2006). Length of public health nursing home care varies and may be as short as a few visits to provide information and connections to resources in relatively simple situations or as long as several years to provide relationship-based therapeutic and educational interventions addressing highly complex needs (Monsen, Farri, McNaughton, & Savik, 2011). This costly intervention strategy has been shown to save public dollars over time (Eckenrode et al., 2010). However, with decreasing resources and time spent per client, it would be useful for PHNs to identify critical data elements during their home visits and decrease documentation burden (Keenan, Yakel, Tschannen, & Mandeville, 2008). Machine learning techniques can aid in retrospective investigation of critical data elements to determine what is important to the case and what is not.

Data reduction is the process of obtaining a reduced representation of the data set that is much smaller in volume but produces almost the same analytical results (Han, Kamber, & Pei, 2011). The main objective of this article is to introduce readers to two different machine learning techniques of feature reduction and illustrate their application using a large healthcare data set with the Omaha System.



This study used deidentified existing data and was approved by the university institutional review board as nonhuman subjects research. We randomly selected 756 clients from a large pool of maternal–child health clients who received public health nursing services between 2000 and 2009. Of the 756 clients, the majority were women (88%), White (62%), and not married (82%), with an average age of 19.2 years (SD = 10.4 years).

Data Source

The Omaha System is an interface terminology available in the electronic health record for documenting patient care that is recognized by the American Nurses Association and is a member of the Alliance for Nursing Informatics (Martin, Monsen, & Bowles, 2011). The Omaha System has three components: the Problem Classification Scheme, the Intervention Scheme, and the Problem Rating Scale for Outcomes. The Problem Classification Scheme is a taxonomy of 42 health concepts called problems that are organized under four domains: Environmental (4 problems), Psychological (12 problems), physiological (18 problems), and Health-related behaviors (8 problems). Each problem has a set of unique signs/symptoms used in clinical assessments to further specify the problem. Signs/symptoms variables are binary (present/not present). The Intervention Scheme consists of four levels: problem, category, target, and care description. The first level (problem) consists of all of the problems in the Problem Classification Scheme. The second level (category) consists of four action terms: teaching, guidance, and counseling; treatments and procedures; case management; and surveillance. The third level (target) consists of 75 defined terms that provide additional information about the focus of the intervention. The fourth level (care description) is not taxonomic and was not used in this analysis. The Problem Rating Scale for Outcomes consists of three Likert-type ordinal scales for rating problem-specific client knowledge, behavior, and status (KBS) (1 = most negative to 5 = most positive). Omaha System data sets are particularly suited to intervention effectiveness research because use of the Omaha System in routine documentation generates relational problem-specific assessment, intervention, and outcomes data.

Independent/Predictor and Dependent Variables

We considered the Omaha System client assessment and service delivery variables independent variables. Client assessment variables were problem-specific signs/symptoms and problem-specific ratings for knowledge (no knowledge, superior knowledge), behavior (not appropriate, consistently appropriate), and status (extreme signs/symptoms, no signs/symptoms). These ratings were averaged across problems to yield one KBS score for each client before and after receiving public health nursing services. Service delivery variables included problem-specific intervention categories and targets. All of these variables were considered independent variables or features per patient.

For the dependent variable, a maternal risk index (MRI) variable developed previously as a metric for risk classification using the Omaha System was computed for each client (Monsen et al., 2011; Monsen, Peterson, et al., 2017). The MRI summarizes risk using weighted problem totals adjusted by baseline knowledge scores. The algorithm for transforming existing variables into the MRI score is as follows [number of problems for which client received interventions +1 if Income problem + 1 if Substance use problem + 1 if Mental health problem) / baseline knowledge score] (Monsen et al., 2011). We calculated the MRI for the very first encounter with the client to establish a baseline MRI. Risk index scores were partitioned at the median of the distribution to form two groups (low- and high-risk clients; n = 378 in both groups).

Data Analysis

We used R Version 3.3.2 software for our analysis purposes.

Feature selection Techniques

In machine learning, feature selection, also known as variable subset selection, is the process of selecting a subset of relevant features (variables). If the goal is to reduce a data set’s high dimensionality but still preserve variables, then a common solution to this conundrum is feature selection. The central premise when using a feature selection technique is that the data set contains features that are either redundant or irrelevant and as such can be removed without incurring much loss of information (Guyon & Elisseeff, 2003; Igbe, Darwish, & Saadawi, 2016). Some examples of machine learning feature selection techniques are LASSO (least absolute shrinkage and selection operator) and elastic net regularized generalized linear models (glmnet in R) and minimum redundancy–maximum relevance (mRMR).

Ridge and LASSO Regression

A general regression model with p predictors x1, x2, …, xp with response variable y is predicted by

A model fitting procedure, such as regression, produces the vector of coefficients . For instance, the ordinary least squares (OLS) estimates are obtained by minimizing the distance between the actual and estimated values of the dependent variable or target variable.

Each distribution has a measure of spread of the data elements such as variance. Whenever a model is constructed, the total variance can be split into variance explained by the estimators or independent variables (commonly referred to as R2) and the unexplained variance or the variance of the residuals. Root mean square error is the square root of the variance of the residuals. The higher the R2, the greater the model is able to explain the variance of the distribution. Because OLS has been known to perform poorly in both prediction and interpretation, penalization techniques have been proposed (Zou & Hastie, 2005). Penalized estimation is a procedure that reduces the variance of estimators (independent variables) by introducing substantial bias, which becomes a major component of the mean squared error, with variance contributing only a small part. A brief explanation about penalization follows. However, the reader would benefit from knowing about two bias introducing procedures called L1-norm and L2-norm.

As an error function, L1-norm, often referred to as least absolute deviations procedure, minimizes the sum of absolute differences between the target value (Yi) and the estimated values. As an error function, L2-norm, also called least squares, minimizes the sum of the square of the differences between the target value (Yi) and the estimated values. Most regression techniques use either the L1-norm or the L2-norm for penalization. For feature selection, L1-norm adds a penalty equal to error plus the sum of the absolute value of the coefficients. L2-norm adds a penalty equal to the error plus the sum of the squared value of the coefficients. Ridge regression creates a regression model penalized with the L2-norm, which has the effect of shrinking the coefficient values, allowing coefficients with minor contribution to the target variable to get close to zero. LASSO, on the other hand, creates a regression model penalized with the L1-norm, which has the effect of shrinking coefficient values, allowing some with a minor effect to the target variable to become zero (Kuhn & Johnson, 2013). Ridge regression (Hoerl & Kennard, 1988) and LASSO (Tibshirani, 1996) have been proposed to improve OLS. However, with inherent problems found in both ridge regression, such as not producing a parsimonious model and limitations of LASSO (Zou & Hastie, 2005), regularization techniques such as elastic net have been suggested.

Elastic Net

Elastic net creates a regression model that is penalized with both the L1-norm and the L2-norm. This has the effect of effectively shrinking coefficients (as in ridge regression) and setting some coefficients to zero (as in LASSO; Zou & Hastie, 2005). The final active set of features are those with nonzero coefficients (βs). This has been used within glmnet in R.

LASSO and Elastic Net Regularized Generalized Linear Models (glmnet in R)

glmnet fits a generalized linear model via penalization using the elastic net. Two important parameters needed in the glmnet model are the elastic net penalty (represented as alpha [α]) and the tuning parameter (represented as lambda [λ]). The elastic net penalty (α) bridges the gap between LASSO (α = 1, the default) and ridge (α = 0; Hastie & Qian, 2014). The tuning parameter λ controls the overall strength of the penalty, with successive repeats with different values for λ providing a regularization path. glmnet will thus fit a whole string of λ values. When λ is very small, the LASSO solution should be very close to the OLS solution, and all of the coefficients are in the model. As λ grows, fewer variables are kept in the model (because more and more coefficients will be zero valued). Lastly, using a cross-validation procedure enables glmnet to pick an optimal value for λ. Discrete estimates of the coefficients (βs) are made along the way. Readers interested in getting an in-depth mathematical knowledge about glmnet should refer to the Friedman, Hastie, and Tibshirani (2010) article. Readers are further directed to this website by the original authors for glmnet vignette. The final active set of features are those with nonzero coefficients (βs). Once a model is developed, automatic variable selection procedures allow users to obtain a list of the features selected by the model. The concept of elastic net in automatic variable selection is similar to that of retaining “all the big fish” by stretching the fishing net (Knights, Costello, & Knight, 2011). We used the caret package in R to execute glmnet.

Minimum Redundancy–Maximum Relevance (mRMR)

In feature selection, it is important to choose features that are relevant for prediction but more so to have a set of features that are not redundant in order to increase robustness (Auffarth, López, & Cerquides, 2010; Hira & Gillies, 2015). Feature selection approaches have been categorized into filter-based methods (Yu & Liu, 2003), wrapper-based methods (Kohavi & John, 1997), and embedded methods (Guyon, Gunn, Nikravesh, & Zadeh, 2006). Filter-based feature selection methods apply a statistical measure to assign a scoring to each feature. Features are ranked by their score and selected either to be kept or removed from the data set. Wrapper-based methods treat the selection of features as a search problem, where several combinations of features are evaluated and compared. A predictive model is used to evaluate a combination of features, and a final score is assigned based on model accuracy (AlNuaimi, Masud, & Mohammed, 2015). Embedded methods learn which features best contribute to the accuracy of the model while the model is being created (AlNuaimi et al., 2015). The most common type of embedded feature selection methods are penalized methods such as LASSO, ridge regression, and elastic net. A special group of filter-based feature selection approaches tends to simultaneously select highly predictive but uncorrelated features. These approaches tend to select a subset of features having the most correlation with a class (relevance) and the least correlation between themselves (redundancy). In these algorithms, features are ranked according to the mRMR criteria (Radovic, Ghalwash, Filipovic, & Obradovic, 2017). It is likely that features selected according to maximum relevance could have rich redundancy, that is, the dependency among these features could be large (Ding & Peng, 2005). When two features highly depend on each other, the representative class discriminative power would not change much if one of them were removed. Therefore, the minimal redundancy condition can be added to select mutually exclusive features. The criterion combining the above two constraints is called mRMR (Ding & Peng, 2005). For continuous features, the F-statistic can be used to calculate correlation with the class (relevance), and the Pearson correlation coefficient can be used to calculate correlation between features (redundancy; Radovic et al., 2017). Thereafter, features are selected one by one by applying a greedy search to maximize the objective function, which is a function that integrates relevance and redundancy information of each variable into a single scoring mechanism. (Radovic et al., 2017). Readers interested in learning more about mRMR are directed to Auffrath et al. (2010), De Jay et al. (2013), Radovic et al. (2017), and Ding and Peng (2005). mRMR is a very popular tool in biostatistics and genetic research to sort through data sets containing thousands to millions of features (Ding & Peng, 2005). Similar to glmnet in R, features are ranked according to their importance score.


Findings of the two methods as applied to the Omaha System documentation data are presented and compared below. The application of each method is provided for clarity.

Application of mRMR for Feature Selection

We split the entire data set into a training and a testing data set using a 50–50 split. The training and testing data sets each had 378 patients and 206 features, with each patient present either in the training or in the testing data set. Using mRMR and making the class of MRI (high or low) as the dependent variable and all the other variables as predictors, we constructed a classic mRMR model. We obtained features retained by mRMR based on decreasing order of importance of their scores. Fifty features had scores greater than zero and thus were considered important features, with the top 20 features listed as shown in Figure 1. A training and testing subset containing all the patients with only the above 50 features was retained for further accuracy testing. We constructed a generalized linear model and tested its accuracy on the held-out test data set, with the performance of the model indicated by the accuracy (expressed as a percentage)—or instances of correct classification of MRI using the training model. We achieved highest accuracy value of 86.2% with 50 features with α = .55 and λ = .04. Subsequent testing with 40 features and 30 features achieved lower accuracy values of 85.9% and 84.6%.

Importance of each variable using minimum redundancy–maximum relevance (mRMR). The name of each individual feature is preceded by either knowledge (K), signs and symptoms (S), or target (T).

The top 20 features (Figure 1) using mRMR had 10 knowledge (Residence, Mental health, Postpartum, Abuse, Oral health, Substance abuse, Family planning, Caretaking/parenting, Health care supervision, and Neglect), seven signs/symptoms (purposeless activities, inaccurate inconsistent use of family planning methods, homelessness, inadequate crowded living space, dissatisfied with family planning methods, steep unsafe stairs, and apprehension undefined fear), and three targets (growth and development, anatomy physiology, and sickness injury care). Overall, the most important feature was residence; when classified individually within the groupings, residence was the most important knowledge rating, purposeless activities was the most important sign/symptom, and growth and development was the most important target.

Application of glmnet for Feature Selection

We imported the entire data set of 756 patients and 206 features per patient, with the outcome variable being the class of MRI. We used the same previously split training and testing data sets. Using glmnet and making MRI as the dependent variable and all the other variables as predictors, we constructed a model that used three repeats of cross validation done on the training data set (378 patients and 205 predictors, one outcome variable [MRI]). The advantage of using glmnet in R is that, for all the list of coefficients (βs) developed, more important coefficients will be larger than the less important ones and could easily be ranked by their magnitude. Variable importance evaluation functions, such as VarImp in caret package in R, are those that use the glmnet model information with all measures of importance scaled to have a maximum value of 100 (Kuhn, 2012). This technique allows users to rank variables in order of importance. Figure 2 shows the top 20 variables listed in order of their importance. Sixty-three features had their importance value greater than zero, and further testing was performed using only those features, removing the other features. We iteratively reduced features, based on varImp from 205 to 63, 50, 40, and 30. Using 63 features and constructing a generalized linear model achieved the highest accuracy of 95.5% with α = .55 and λ = .004. Subsequent lowering of features revealed lower accuracy values of 94.4%, 92.6%, and 92.3%, respectively.

Importance of each variable using elastic net regularized generalized linear model (glmnet in R). The name of each individual feature is preceded by either target (T), signs and symptoms (S) or knowledge (K).

The top 20 features (Figure 2) using glmnet had 10 knowledge ratings (Growth and development, Oral health, Caretaking parenting, Postpartum, Mental health, Substance use, Residence, Abuse, Role change, Income), six signs/symptoms (fearful hypervigilent behavior, unsafe appliances equipment, difficulty in obtaining family planning methods, inadequate delayed medical care, other5, and other4), and four targets (food, behavior modification, mental emotional signs, and symptoms and medication administration). Overall, the most important feature was food. When classified individually within the groupings, growth and development was the most important knowledge rating, fearful, hypervigilent behavior was the most important sign/symptom and food was the most important target (Figure 3).

Top 20 features selected by glmnet classified by knowledge ratings, signs/symptoms, and targets. The lower the rank, the more important the feature $$$.


We applied two different machine learning methods for reducing high dimensionality of the Omaha System data set. Our goal was to identify critical data elements necessary for a particular outcome variable, which, in our study, was the MRI.

Given that knowledge ratings adjust the total problem count of the MRI metric, it may be expected that knowledge ratings would be among the top 20 features of both the mRMR and glmnet model. Interestingly, the three weighted problems in the MRI that were included in the original metric based on the literature (Income, Mental health, and Substance use) were less important than other problems in determining risk, including Residence, Growth and development, Oral health, Caretaking/parenting, Abuse, and Postpartum (Monsen et al., 2011). This may suggest that lower knowledge of basic health information for maternal–child health problems is associated with increased risk. Further research is needed to evaluate differential knowledge by problem as a risk predictor.

The finding that specific signs/symptoms were important predictors points to the importance of documenting signs/symptoms during clinical assessments to further specify the problem assessment (KBS ratings). Signs/symptoms for the Residence, Mental health, Family planning, and Health care supervision problems were important in these models, showing that a comprehensive holistic assessment yields critical information regarding baseline risk. This finding aligns with the literature regarding the importance of social and behavioral determinants of health in influencing health risk (Monsen, Brandt, et al., 2017). This finding is further reinforced by the intervention target food, most likely to be used to describe interventions focused on obtaining food resources—among the most basic of human needs. However, the fact that these specific signs/symptoms and intervention targets were found to be important suggests that it may be possible to reduce the assessment to fewer signs/symptoms, and interventions documentation may also be reduced in interest of streamlining documentation. However, any data-driven documentation reduction recommendation must be informed by clinicians who rely on documentation as part of quality care, as well as compliance with administrative requirements. These findings should be evaluated with additional data sets for the maternal–child population to better understand the importance of the particular signs/symptoms and interventions in the model.

For the glmnet model, the finding that two signs/symptoms of “other” were important in the model has critical implications for documentation because it is impossible to interpret the meaning of these two variables. Use of terms such as “other” or similar nondefined or customized terminologies limits our ability to discover meaningful new knowledge using these and other data mining methods (Bowles et al., 2013).

Feature selection techniques such as mRMR and glmnet are techniques designed to enable users to reduce high dimensionality within a data set. Feature selection techniques enable selection of important features based on scoring metrics. We demonstrated application of these techniques on the Omaha System in order to identify important features. Feature selection techniques can improve prediction performance and can inform more efficient documentation by identifying critical data elements. These techniques could be applied to many big data applications, and nurse researchers handling large data sets can benefit from implementation of such techniques with their data sets.


The usual limitations of observational data sets apply to this study. Furthermore, we include clinical implications in order to illustrate the relevance of feature selection for clinical practice. Further research is needed to determine whether these findings apply across maternal–child health populations and is the focus of our future research endeavors.


We described and tested two different machine learning techniques of feature selection and applied them on a data set generated by PHNs using the Omaha System during routine documentation. Further studies to refine the process of feature selection may aid in informing PHNs’ and administrators’ documentation decisions in ensuring care efficiency and effective documentation. Feature selection techniques show promise toward reducing public health nursing documentation burden by reducing the number of critical data elements needed during home visits. These machine learning methods have far-reaching applications both with the Omaha System and, in general, any application that requires the reduction of features in big data.



AlNuaimi N., Masud M. M., & Mohammed F. (2015). ICU patient deterioration prediction: A data-mining approach. arXiv preprint arXiv:1511.06910. doi:10.5121/csit.2015.51517
Auffarth B., López M., & Cerquides J. (2010). Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. In Industrial Conference on Data Mining (pp. 248–262). Berlin, Heidelberg: Springer.
Barnard K. (1998). Developing, implementing, and documenting interventions with parents and young children. Zero to Three, 18, 23–29.
Bowles K. H., Potashnik S., Ratcliffe S. J., Rosenberg M., Shih N. W., Topaz M., Naylor M. D. (2013). Conducting research using the electronic health record across multi-hospital systems: Semantic harmonization implications for administrators. Journal of Nursing Administration, 43, 355–360. doi:10.1097/NNA.0b013e3182942c3c
De Jay N., Papillon-Cavanagh S., Olsen C., El-Hachem N., Bontempi G., & Haibe-Kains B. (2013). mRMRe: An R package for parallelized mRMR ensemble feature selection. Bioinformatics, 29, 2365–2368. doi:10.1093/bioinformatics/btt383
Ding C., & Peng H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology, 3, 185–205. doi:10.1142/S0219720005001004
Eckenrode J., Campa M., Luckey D. W., Henderson C. R. Jr., Cole R., Kitzman H., … Olds D. (2010). Long-term effects of prenatal and infancy nurse home visitation on the life course of youths: 19-year follow-up of a randomized trial. Archives of Pediatrics & Adolescent Medicine, 164, 9–15. doi:10.1001/archpediatrics.2009.240
Friedman J., Hastie T., & Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.
Guyon I., & Elisseeff A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
Guyon I., Gunn S., Nikravesh M., & Zadeh I. (Eds.). (2006). Embedded methods. In Feature extraction: Foundations and applications (pp. 137–162). Berlin, Heidelberg: Springer.
Han J., Kamber M., & Pei J. (2011). Data reduction. In Data mining: Concepts and techniques (3rd ed. pp. 99–110). Waltham, MA: Morgan Kaufmann.
Hastie T., & Qian J. (2014). Glmnet vignette. Retrieved from
Hira Z. M., & Gillies D. F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics, 2015, 198363. doi:10.1155/2015/198363
Hoerl A., & Kennard R. (1988). Ridge regression. In Encyclopedia of Statistical Sciences (Vol. 8, pp. 129–136). New York, NY: Wiley.
Igbe O., Darwish I., & Saadawi T. (2016). Distributed network intrusion detection systems: An artificial immune system approach. In Connected Health: Applications, Systems and Engineering Technologies (CHASE), 2016 I.E. First International Conference on (pp. 101–106). City University of New York, NY: IEEE. doi:10.1109/CHASE.2016.36
Keenan G. M., Yakel E., Tschannen D., & Mandeville M. (2008). Documentation and the nurse care planning process. In Hughes R. G. (Ed.), Patient safety and quality: An evidence-based handbook for nurses (Chap. 49). Agency for Healthcare Research and Quality: Rockville, MD.
Knights D., Costello E. K., & Knight R. (2011). Supervised classification of human microbiota. FEMS Microbiology Reviews, 35, 343–359. doi:10.1111/j.1574-6976.2010.00251.x
Kohavi R., & John G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97, 273–324. doi:10.1016/S0004-3702(97)00043-X
Kuhn M. (2012). Variable importance using the caret package. Retrieved from
Kuhn M., & Johnson K. (2013). An introduction to feature selection. In Applied predictive modeling (pp. 487–519). New York, NY: Springer.
Martin K. S., Monsen K. A., & Bowles K. H. (2011). The Omaha system and meaningful use: Applications for practice, education, and research. CIN: Computers, Informatics, Nursing, 29, 52–58. doi:10.1097/NCN.0b013e3181f9ddc6
Monsen K. A., Brandt J. K., Brueshoff B. L., Chi C. L., Mathiason M. A., Swenson S. M., & Thorson D. R. (2017). Social determinants and health disparities associated with outcomes of women of childbearing age who receive public health nurse home visiting services. Journal of Obstetric, Gynecologic & Neonatal Nursing, 46, 292–303. doi:10.1016/j.jogn.2016.10.004
Monsen K. A., Farri O., McNaughton D., & Savik K. (2011). Problem stabilization: A metric for problem improvement in home visiting clients. Applied Clinical Informatics, 2, 437–446. doi:10.4338/ACI-2011-06-RA-0038
Monsen K. A., Fitzsimmons L. L., Lescenski B. A., Lytton A. B., Schwichtenberg L. D., & Martin K. S. (2006). A public health nursing informatics data-and-practice quality project. CIN: Computers, Informatics, Nursing, 24, 152–158.
Monsen K. A., Peterson J. J., Mathiason M. A., Kim E., Votava B., & Pieczkiewicz D. S. (2017). Discovering public health nurse-specific family home visiting intervention patterns using visualization techniques. Western Journal of Nursing Research, 39, 127–146. doi:10.1177/0193945916679663
Radovic M., Ghalwash M., Filipovic N., & Obradovic Z. (2017). Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics, 18, 9. doi:10.1186/s12859-016-1423-9
Tibshirani R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58, 267–288.
Yu L., & Liu H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863). Retrieved from
Zou H., & Hastie T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320. doi:10.1111/j.1467-9868.2005.00503.x

machine learning; nursing informatics; Omaha System; public health nursing

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved.