Mortality risk is the traditional measure of severity of illness for children in ICUs. Static mortality risk measures, including the Pediatric Risk of Mortality score and the Pediatric Index of Mortality, were developed with a focus on quality assessment and, therefore, focus on the early portion of an ICU stay (1,2). These were not designed for nor widely adopted for use in individual patients, at least in part because rapid changes in severity and therapies and the frequent acquisition of data made risk assessments obsolete soon after they were obtained. The Pediatric Logistic Organ Failure (PELOD) score can be collected serially, but PELOD is updated only daily and the relatively simplistic framework limits the potential to enhance clinician assessment (3).
Early identification of ICU patients at increasing or high risk of inhospital mortality might improve clinical and operational decision-making and potentially improve outcomes. Dynamic mortality risk assessment, objectively tracking changes in mortality risk, could support the clinical decision-making of healthcare providers, especially those with less experience or expertise. Machine learning methods applicable to the continuous flow of ICU data could assess improvement or deterioration of individual patients. A recent “proof of concept” report showed potential for this approach (4). Recently, we developed the Criticality Index using machine learning, a new pediatric severity index for pediatric inpatients based on physiology, therapies, and intensity of care (5–7). This framework follows the prominent threads of critical care severity research including early qualitative assessments, identification and quantification of therapies, and physiologic profiles (8–12). The Criticality Index was computed every 6 hours and was calibrated to the probability of ICU care. As the Criticality Index increased for ICU patients, the intensity and complexity of care increased. Therefore, we hypothesized that we could use the Criticality Index to estimate probability of hospital death in ICU patients (Criticality Index-Mortality [CI-M]) using 6-hour time periods to measure dynamic changes in mortality risk. Prior to use in individuals, predictive models developed from populations that purport to measure changing clinical status should maximize predictive performance, especially calibration at the different times used to assess change because, if they are not well-calibrated, change cannot be reliably assessed. Additionally, other measures of construct validity should be used to assess “real-life” scenarios.
The overall aim of this study was the assessment of a machine learning method of serially updating mortality risk for children in ICUs. In this analysis, we estimated mortality risk using the Criticality Index and used serial models over time to maximize performance for individual admissions. We hypothesized that this method would perform well in terms of discrimination, calibration and other performance metrics and would reflect serial risk changes for survivors and deaths. A priori, we anticipated trajectories of mortality risk as follows: the risks of high-risk deaths would stay high or increase over time; the risks of high-risk survivors would decrease; the risks of low-risk deaths would increase; and the risks of low-risk survivors would stay low. We also anticipated that deaths would, in general, have a more volatile clinical course than survivors when assessed with serial mortality risks.
The dataset was derived from Health Facts (Cerner Corporation, Kansas City, MO), a voluntary, comprehensive de-identified clinical database on admissions from U.S. hospitals with a Cerner data-use agreement. Data are obtained from the electronic health record (EHR), are date- and time-stamped, and include admission and demographic information, laboratory results, medications information, diagnostic and procedure codes, vital signs, respiratory data, and hospital outcome. Not all data are available for all admissions. Health Facts (Cerner Corporation) is representative of the United States (13) and used in previous care assessments, including the Acute Physiology and Chronic Health Evaluation score (14), the Criticality Index (5–7), and pediatric medication use (15,16).
Details on data cleaning and definitions, medication classification, and diagnostic classifications have been published (5,6,17). For this analysis, we emphasized the multisystem nature of ICU disease by categorizing all systems of dysfunction based on the discharge International Classification of Diseases, 9th Revision and International Classification of Diseases, 10th Revision classifications. Inclusion criteria included age less than 22 years (18) and care in the ICU from 2009 to 2018. Exclusion criteria included hospital length of stay (LOS) greater than 100 days, ICU LOS greater than 30 days, or neonatal ICU care. There were 88 hospitals with an average of 311 admissions/hospital. The dataset included hospitals contributing both small and large samples to expand the generalizability of the methodology. The study was approved by the Children’s National institutional review board (protocol number 9282). The information and methodology conform to the “Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis” guidelines and are included in the Supplemental Digital Data 1 (https://links.lww.com/PCC/B978).
The Criticality Index independent variables consist of six vital signs, 30 routinely measured laboratory variables, and medications classified into 143 categories and also included the number of measurements/time period (5). The variables, definitions, and statistics have been reported in detail (5–7). Detailed data on the independent variables and imputed values are shown in Supplemental Digital Data 1 (https://links.lww.com/PCC/B978). The primary diagnosis was not used for modeling because it was determined at discharge. In the initial studies, positive pressure ventilation was used only to classify high-intensity ICU care. For this study, we used positive pressure ventilation as an independent variable and categorized an admission as “Yes” for ventilation for all time periods following its implementation. Criteria for positive pressure were continuous positive airway pressure, positive end-expiratory pressure, and/or peak inspiratory pressure Consistent with other machine learning models, we imputed laboratory results and vital signs using the last known result (4,19–21). These imputed values have been reported (5,6) and were identified by setting the count equal to zero (see below).
The outcome for all CI-M models (see below) was hospital outcome as survival or death. Time to death was not included in the model since a majority of deaths in many PICUs are associated with brain death or withdrawal and limitations of care (22).
Machine Learning Methodology and Statistical Analysis
The hospital course was discretized into consecutive 6-hour time periods. Models were independently developed for each time period (n = 29) from the second (6 hr) to the 30th (180 hr). Modeling was truncated at 180 hours to ensure an appropriate sample for goodness-of-fit (GOF) testing since the test sample size was reduced to less than 750 admissions and 30 deaths due to discharges and deaths. For each time period, 87% of admissions were randomly used for model development and 13% for testing. We used individual machine learning models for each time interval to maximize predictive performance because the eventual intent is to apply these methods to individual admissions.
We computed each admission’s Criticality Index for each time period using the previously published machine learning methodology and detailed in Supplemental Digital Data 1 (https://links.lww.com/PCC/B978) (5–7). Previously, we demonstrated that as the Criticality Index increases, the intensity and complexity of care increases. For each time period, we added a final step that calibrated the Criticality Index and positive pressure ventilation variables to hospital outcome of survival or death using generalized thin plate splines for binary outcomes (23,24). The performance of the individual CI-M models and the composite of all models was assessed using: 1) discrimination (area under the receiver operating characteristic curve [AUROC]) and calibration and 2) specificity, precision, F1 score, Mathew’s correlation coefficient (MCC), and negative predictive value at sensitivities of 0.85, 0.90, and 0.95. Calibration was assessed using two approaches. First, we computed the Hosmer-Lemeshow GOF test for each time period using risk interval with at least 250 admissions. Second, we assessed calibration plots of the observed and expected proportion of deaths with linear regression using at least 10 risk intervals with equal numbers of admissions. Performance metrics for the calibration plots included the regression line slope, intercept, and the coefficient of determination (R2). We also compared the observed to predicted number of outcomes for each of the calibration plot risk intervals and reported the percentage of intervals with no statistical evidence of difference (p > 0.05) (25–28). For perfect calibration, the intersect would be zero, the slope would be 1, and the R2 would be 1. We expected approximately 5% of the observed versus expected outcome intervals in the calibration plots would be statistically different. Since real-time application would be dependent, in part, on the worst performance at any time period, we examined the worst performing models for each of the calibration methods using the AUROCs, the GOF tests, the calibration plots, and observed versus expected proportion of deaths, and compared these data to the first and the last time period models.
Construct validity was assessed using population trajectories and mortality risk changes in consecutive time periods (clinical volatility) for individuals. First, we plotted the clinical trajectories (6) in the following groups based on the mortality risk determined from the first time period: Deaths in the highest risk decile of deaths (high-risk deaths), survivors in the highest risk decile of survivors (high-risk survivors), deaths in the lower risk deciles of deaths (low-risk deaths), and survivors in the lower risk deciles of survivors (low-risk survivors). A priori, we expected that the mortality risk of high-risk deaths would remain high risk or increase over time, the risk of high-risk survivors would decrease, the risk of low-risk deaths would increase, and the risk of low-risk survivors would remain low. Second, we computed the change in mortality risk for individuals over consecutive time intervals and evaluated all changes, and the maximum positive (clinical deterioration) and maximum negative (clinical improvement) change per admission. A priori, we expected that deaths would have larger changes representing more clinical volatility (instability). These comparisons used the Mann-Whitney U test.
A total of 27,354 admissions were included (Table 1). The mortality rate was 1.8% and the median ICU LOS was 53 hours (25–75th percentiles: 24–117 hr). Respiratory, cardiovascular, and neurologic dysfunction occurred in 34.8% (n = 9,515), 31.2% (n = 8,541), and 25.7% (n = 7,024) of admissions, respectively. The test sample started with 3,453 admissions and 97 deaths with a decrease due to discharges and deaths of 6–7% per time period during the first 48 hours in the ICU and 3–6% per time period after 48 hours. A total of 47.5% (n = 46) of the deaths occurred in the first 48 hours.
TABLE 1. -
Population Characteristics of Children in ICUs
|Female, n (%)
|Race, n (%)
| Other, unknown
|Age (mo), median (25–75th percentile)
|Hospital LOS (hr), median (25–75th percentile)
|ICU LOS (hr), median (25–75th percentile)
|Hospital mortality, n (%)
|Positive pressure ventilationa, n (%)
|Systems of dysfunctionb, n (%)
| Respiratory system
| Nervous system
| Cardiovascular system
| Infectious and parasitic diseases
| Gastrointestinal system
| Musculoskeletal system
| Endocrine, nutritional, metabolic diseases
| Injury and Poisonings
| Mental disorders
LOS = length of stay.
aCriteria for positive pressure were continuous positive airway pressure, positive end-expiratory pressure, and/or peak inspiratory pressure.
bCategorization based on all discharge International Classification of Diseases, 9th Revision and International Classification of Diseases, 10th Revision data.
The AUROC assessing discrimination ranged from 0.797 to 0.894 for individual CI-M models (Fig. 1A) and 0.852 (95% CI, 0.843–0.861) for all time periods (Fig. 1B). Calibration assessed by GOF testing revealed that all models except one at 72 hours of ICU care had p values of greater than 0.05 (see below). The calibration plots for all models for all time periods (Fig. 2A) had intercepts ranging from –0.002 (60 hr) to 0.009 (126 hr), slopes ranging from 0.867 (108 hr) to 1.415 (60 hr), and R2’s ranging from 0.862 (72 hr) to 0.989 (96 hr). For all models combined (Fig. 2B), the GOF significance level was 0.195, the intercept was 0.010, the slope was 0.903, and the R2 was 0.862. Comparison of the observed versus expected proportion of deaths in all calibration plot risk interval found that 290 of 294 risk intervals (98.6%) were not statistically different. The overall and individual performance metrics at sensitivities of 0.85, 0.90, and 0.95 are shown in Supplemental Digital Data 2 (https://links.lww.com/PCC/B979). Overall, at a sensitivity of 0.90, specificity = 0.630 (95% CI, 0.625–0.634), precision = 0.069 (95% CI, 0.065–0.072), negative predictive value = 0.995 (95% CI, 0.995–0.996), MCC = 0.184 (95% CI, 0.177–0.191), and F1 = 0.127 (95% CI, 0.123–0.132). Individual models were similar.
The two worst performing time periods were 72 hours, the only time period when the GOF was less than 0.05, and 60 hours when the calibration plot regression line slope = 1.415 and R2 = 0.862. The AUROCs, calibration plots, and observed versus expected comparisons for the risk intervals of these time periods and comparison data from the first and last time periods are shown in Supplemental Digital Data 2 (https://links.lww.com/PCC/B979). Neither 72 hours nor 60 hours were associated with other indicators of poor performance.
Construct validity for potential clinical application used population trajectories and clinical volatility. The mortality risk trajectories (Fig. 3) demonstrated the a priori expectations were correct. The high-risk death cohort had the highest mortality risks that remained high throughout the ICU course. The high-risk survivor cohort and the low-risk death cohort had similar mortality risks for the first 48 hours but the survivor cohort risks improved and reached the level of the low-risk survivor group at approximately 5 days, while the low-risk deaths cohort had mortality risks that rose slightly over the ICU course. The low-risk survivors’ mortality risks remained very low. Deaths demonstrated more clinical volatility than survivors (Fig. 4A). The average increase in mortality risk in consecutive time periods was 0.021 for deaths and 0.006 for survivors (p < 0.001) and the average decrease was 0.022 for deaths and 0.008 for survivors (p < 0.001). Figure 4, B and C‚ show the maximum deterioration and maximum improvement for survivors and deaths illustrating the increased higher volatility of deaths compared with survivors. The average maximum deterioration for deaths was 0.050 and 0.015 for survivors (p < 0.001) and the average maximum improvement was 0.063 for deaths and 0.022 for survivors (p < 0.001).
This study demonstrated the applicability of the CI-M for assessments of serial changes in mortality risk for individuals. The Criticality Index was initially calibrated to probability of ICU care and has been applied to determining future care needs for hospitalized children (7). This study expands its use by recalibrating it to mortality risk. We used individual machine learning models for 6-hour time intervals from 6 to 180 hours of ICU care to maximize predictive performance. Overall, model performance metrics were very good. The composite AUROC was 0.852 and, perhaps more important for potential individual applications, calibration was excellent with 28 of 29 Hosmer-Lemeshow GOF tests p value of greater than 0.05, observed versus expected outcomes in 290 of 294 (98.6%) risk intervals of the individual models not statistically different, and calibration plot metrics were very good for the overall performance and the individual models. We evaluated in detail the two time periods with the worst calibration metrics and found that, while single metrics were notable, the assessment with multiple metrics indicated that the calibration metrics were not consistently reduced. Thus, the methodology consistently performed well over the first 7 and a half days of ICU care. Clinical validity including clinical trajectories and clinical volatility supported the potential for use in individuals by capturing the anticipated patterns of illness in survivors and deaths. Eventually, however, the methodology must be implemented and assessed in real-world use including comparison to other severity assessment methods.
Experienced intensivists are excellent at assessing patients using clinical snapshots (29). The flow and amount of ICU data for patient assessment is substantial. For example, over 200 variables have been estimated as useful for care of ventilated patients (30). The ability to successfully integrate this large amount of changing information on a continuous basis lies beyond the capabilities of most knowledgeable and perceptive care givers (29). Less experienced and less skilled providers will integrate this information less well. Therefore, the addition of continuous or frequently updated risk assessments for children in ICUs could result in detection of clinical deterioration or improvement that might have been unappreciated, providing an opportunity for earlier interventions and the potential for improved outcomes.
The CI-M has a strong conceptual framework based in physiology, therapeutics, and therapeutic intensity. It is currently calibrated to 6-hour time intervals but could be calibrated for continuous data. Single time period machine learning models for ICU mortality prediction have also performed very well (21,31,32). Our results are consistent with the recent single site “proof of concept” analysis demonstrating the potential for a machine learning approach to follow changes in clinical status (4). However, there are notable differences between the studies. First, our analysis is based on a multicentered dataset, demonstrating the methodology can be applied widely. Second, our method uses a transparent set of variables (5,6). Since these methodologies should be expected to supplement physician judgment, we excluded variables that might have included clinician’s prognosis. Third, the neural networks were different. Since both approaches have positive attributes, future studies will be needed to assess the best approach. Fourth, we took special effort to assess its potential for use in individuals including both GOF and calibration plots at 6-hour intervals, comparison of observed versus expected outcomes in 294 risk intervals, and assessments of mortality risk changes over time. Notably, we detected substantial differences in clinical volatility between deaths and survivors that could be useful for bedside use.
At this time, it is not clear if machine learning models developed on a multi-institutional dataset should be applied to individual sites or if individual sites should apply the experience and machine learning approach from multi-institutional research to their sites. It is likely that models developed in individual units using a conceptual framework such as the Criticality Index or that of Aczon et al (4) supplemented with site-specific data and local decision-making cutpoints will have improved performance and clinical applicability (33). Optimizing performance is important if the application is intended for individual patients.
There are limitations to this analysis. First, we used a retrospective EHR dataset. While we have used this dataset for multiple pediatric analyses (5–7,15,16), prospective data collection could add additional data elements. Second, a more extensive exploration of machine learning methods might have uncovered better performing models. Our methodology was primarily designed to evaluate our overall aims by assessing the potential for eventual clinical use but not necessarily optimizing it for clinical use. Third, we did not evaluate the relative importance of individual data elements. Previously, we analyzed the global factors associated with prediction in our models and found that a relatively limited dataset composed primarily of physiologic data and medication classes may be sufficient (34). Finally, prior to use as a patient-level assessment method, this methodology or any other will need real-world validation, including background or silent use and correlations of changes in clinical status with changes in the mortality risk computed by the models, and analyses of their usefulness vis-a-vis providers of different experience and expertise.
Changing mortality risks for PICU patients can be measured with machine learning models based on the Criticality Index. Discrimination and calibration for all CI-M models was very good, and clinical validity was demonstrated using clinical trajectories and clinical volatility. The CI-M framework and modeling method are potentially applicable to monitoring patient improvement and deterioration in real time.
1. Pollack MM, Holubkov R, Funai T, et al.; Eunice Kennedy Shriver
National Institute of Child Health and Human Development Collaborative Pediatric Critical Care Research Network: The pediatric risk of mortality score: Update 2015. Pediatr Crit Care Med. 2016; 17:2–9
2. Straney L, Clements A, Parslow RC, et al.; ANZICS Paediatric Study Group and the Paediatric Intensive Care Audit Network: Paediatric index of mortality 3: An updated model for predicting mortality in pediatric intensive care*. Pediatr Crit Care Med. 2013; 14:673–681
3. Leteurtre S, Duhamel A, Salleron J, et al.; Groupe Francophone de Réanimation et d’Urgences Pédiatriques (GFRUP): PELOD-2: An update of the PEdiatric logistic organ dysfunction score. Crit Care Med. 2013; 41:1761–1773
4. Aczon MD, Ledbetter DR, Laksana E, et al.: Continuous prediction of mortality in the PICU: A recurrent neural network model in a single-center dataset. Pediatr Crit Care Med. 2021; 22:519–529
5. Rivera EAT, Patel AK, Chamberlain JM, et al.: Criticality: A new concept of severity of illness
for hospitalized children. Pediatr Crit Care Med. 2021; 22:e33–e43
6. Rivera EAT, Patel AK, Zeng-Treitler Q, et al.: Severity trajectories of pediatric inpatients using the criticality index. Pediatr Crit Care Med. 2021; 22:e19–e32
7. Trujillo Rivera EA, Chamberlain JM, Patel AK, et al.: Predicting future care requirements using machine learning for pediatric intensive and routine care inpatients. Crit Care Explor. 2021; 3:e0505
8. Cullen DJ, Civetta JM, Briggs BA, et al.: Therapeutic intervention scoring system: A method for quantitative comparison of patient care. Crit Care Med. 1974; 2:57–60
9. Keene AR, Cullen DJ: Therapeutic intervention scoring system: Update 1983. Crit Care Med. 1983; 11:1–3
10. Yeh TS, Pollack MM, Holbrook PR, et al.: Assessment of pediatric intensive care–application of the Therapeutic Intervention Scoring System. Crit Care Med. 1982; 10:497–500
11. Proulx F, Gauthier M, Nadeau D, et al.: Timing and predictors of death in pediatric patients with multiple organ system failure. Crit Care Med. 1994; 22:1025–1031
12. Pollack MM, Ruttimann UE, Getson PR: Accurate prediction of the outcome of pediatric intensive care. A new quantitative method. N Engl J Med. 1987; 316:134–139
13. DeShazo JP, Hoffman MA: A comparison of a multistate inpatient EHR database to the HCUP Nationwide Inpatient Sample. BMC Health Serv Res. 2015; 15:384
14. Bryant C, Johnson A, Henson K, et al.: APACHE outcomes across venues predicing inpatient mortality using electronic medical record data. Crit Care Med. 2018; 46:8
15. Heneghan JA, Trujillo Rivera EA, Zeng-Treitler Q, et al.: Medications for children receiving intensive care: A national sample. Pediatr Crit Care Med. 2020; 21:e679–e685
16. Patel AK, Trujillo-Rivera E, Faruqe F, et al.: Sedation, analgesia, and neuromuscular blockade: An assessment of practices from 2009 to 2016 in a national sample of 66,443 pediatric patients cared for in the ICU. Pediatr Crit Care Med. 2020; 21:e599–e609
17. Fung KW, Kapusnik-Uner J, Cunningham J, et al.: Comparison of three commercial knowledge bases for detection of drug-drug interactions in clinical decision support. J Am Med Inform Assoc. 2017; 24:806–812
18. Hardin AP, Hackell JM; Committee on Practice and Ambulatory Medicine: Age limit of pediatrics. Pediatrics. 2017; 140:e20172151
19. Ma J, Lee DKK, Perkins ME, et al.: Using the shapes of clinical data trajectories to predict mortality in ICUs. Crit Care Explor. 2019; 1:e0010
20. Mohamadlou H, Panchavati S, Calvert J, et al.: Multicenter validation of a machine-learning algorithm for 48-h all-cause mortality prediction. Health Informatics J. 2020; 26:1912–1925
21. Ho LV, Aczon M, Ledbetter D, et al.: Interpreting a recurrent neural network’s predictions of ICU mortality risk. J Biomed Inform. 2021; 114:103672
22. Meert KL, Keele L, Morrison W, et al.; Eunice Kennedy Shriver
National Institute of Child Health and Human Development Collaborative Pediatric Critical Care Research Network: End-of-life practices among tertiary care PICUs in the United States: A multicenter study. Pediatr Crit Care Med. 2015; 16:e231–e238
23. Gu C: Smoothing Spline ANOVA Models. Springer Series in Statistics. Cham, Spring Nature Switzerland, 2002
24. Gu C: Smoothing spline ANOVA models: R package gss. J Stat Softw. 2014; 58:1–25
25. Barnard G: A new test for 2x2 tables. Nature. 1945; 156:177
26. Calhoun P: Exact: Unconditional Exact Test. R Package Version 2.1. 2020. Available at: https://cran.r-project.org/web/packages/Exact/
. Accessed November 6, 2020
27. Martin Andres A, Silva Mato A: Choosing the optimal unconditioned test for comparing two independent proportions. Comput Stat Data Anal. 1994; 17:555–574
28. Mehrotra DV, Chan IS, Berger RL: A cautionary note on exact unconditional inference for a difference between two independent binomial proportions. Biometrics. 2003; 59:441–450
29. Gutierrez G: Artificial intelligence in the intensive care unit. Crit Care. 2020; 24:101
30. Morris AH: Human cognitive limitations. Broad, consistent, clinical application of physiological principles will require decision support. Ann Am Thorac Soc. 2018; 15(Suppl 1):S53–S56
31. Kim SY, Kim S, Cho J, et al.: A deep learning model for real-time mortality prediction in critically ill children. Crit Care. 2019; 23:279
32. Lee B, Kim K, Hwang H, et al.: Development of a machine learning model for predicting pediatric mortality in the early stages of intensive care unit admission. Sci Rep. 2021; 11:1263
33. Brajer N, Cozzi B, Gao M, et al.: Prospective and external evaluation of a machine learning model to predict in-hospital mortality of adults at time of admission. JAMA Netw Open. 2020; 3:e1920733
34. Ahmad MA, Trujillo-Rivera EA, Pollack MM, et al.: Machine learning approaches for patient state prediction in pediatric ICUs. IEEE International Conference on Healthcare Informatics, Victoria, British Columbia, Canada, August 9-12, 2021