Acute kidney injury is a common occurrence in critically ill patients, with incidence rates varying from 5% to 60% (1, 2). Renal replacement therapy (RRT) is necessary in approximately 6% of critically ill patients (1) and, in the setting of multisystem organ failure, it is associated with exceptionally high morbidity and mortality.
Models for predicting mortality in critically ill patients, such as Acute Physiology and Chronic Health Evaluation (APACHE) (3, 4) and Simplified Acute Physiology Score (SAPS) (5, 6), have shown only poor-to-moderate predictive performance in patients with acute kidney injury (7, 8). Although AKI-specific scoring systems have been developed for prediction of in-hospital or 60-day mortality (7, 9), the use of such scoring systems is not warranted in those AKI patients requiring RRT.
Actually, many prognostic models have been suggested in literature, but just a few have found its way into clinical practice. This might be explained by methodological problems with the development of these models; typically, too many predictors are tested for the number of outcome events in datasets, leading to overfitted models with limited generalizability and overoptimistic estimates of model performance (8). To assess the performance of prediction models in new datasets, external validation is a crucial step, but it is rarely performed (10).
A recent prognostic model, predicting 60-day case fatality after requiring RRT at the intensive care unit (ICU) using data from the Veterans Affairs/National Institutes of Health (VA/NIH) Acute Renal Failure Trial Network (ATN) study, has been developed (11). The objective of this study is to externally validate the ATN prediction model in an independent cohort. Moreover, contrary to ATN study, we also included critically ill patients with unknown baseline serum creatinine (sCr), i.e., AKI at ICU admission or previous chronic kidney disease (CKD)—see inclusion and exclusion criteria.
The MIMIC-III database
The Multi-Parameter Intelligent Monitoring in Intensive Care III version 1.3 (MIMIC-III v1.3) project, maintained by the Laboratory for Computational Physiology at the Massachusetts Institute of Technology (MIT), contains data on patients hospitalized in an ICU at Beth Israel Deaconess Medical Center from 2001 to 2012. The database is freely available, so that any researcher who accepts the data use agreement and has attended “protecting human subjects training” can apply for permission to access the data. This study was approved by the Institutional Review Boards of the Massachusetts Institute of Technology and Beth Israel Deaconess Medical Center and was granted a waiver of informed consent.
A total of 58,976 ICU admissions of 38,605 distinct adult patients (>15 years old) were recorded in the MIMIC III database. Inclusion criteria were ICU length of stay greater than 24 h, and RRT start during ICU stay. Exclusion criteria were patients with advanced renal failure, defined as sCr at ICU admission greater than 4 mg/dL; patients undergoing maintenance RRT or submitted to renal transplantation; RRT started before ICU admission; and patients missing two or more continuous variables.
The prediction model was based on patients included in VA/NIH ATN study (11), n = 1,124, a multicenter randomized clinical trial of intensive versus less intensive renal support in critically ill patients with AKI. The ATN prediction model included the predictors: age, chronic hypoxemia, cardiovascular disease, malignancy, immunosuppressive therapy, ischemic AKI, and postsurgery status; also, the following parameters within 24 h of RRT start: heart rate, mean arterial pressure, urine volume, mechanical ventilation, fraction of inspired oxygen, arterial pH, arterial oxygen partial pressure, sCr, serum bicarbonate, serum phosphate, serum albumin, total bilirubin, international normalized ratio (INR), and platelet count. The model aimed to predict case fatality at 60 days. The model had no internal validation.
Validation cohort: data collection
We collected all predicted variables described in the derivation cohort using structured query language (SQL) and individual search in the discharge summary when information was not available through SQL search (i.e., immunosuppressive therapy, maintenance RRT, coronary heart disease). As for missing data, we followed the same principles applied in the derivation cohort (11): when elements of medical history are unknown by surrogates, and there is no record of a condition being present, the condition is usually absent. For this reason, we viewed these missing data elements as informative and were imputed as “No,” with a probability of 0.9, and “Yes,” with a probability of 0.1. Missing data for continuous variables were imputed using regression-based maximum likelihood methods.
Discrimination and calibration
The external validity of the ATN prediction model was assessed in terms of discrimination and calibration. Discrimination refers to how well the model distinguishes between those who die within 60 days and those who survive. Discrimination was assessed by calculating the area under the curve (AUC) of the receiver-operating characteristic (ROC) curve. We used refitted c-statistics because comparison of observed c-statistics at the external validation and the refitted c-statistics reflects the influence of incorrect regression coefficients (12).
Calibration refers to the agreement between the predicted and observed probabilities. Calibration was assessed using the Hosmer–Lemeshow (H–L) test P value and the calibration plots expressed as the calibration slope and an intercept. However, the H–L test has been criticized for being largely dependent on sample size and thus noninformative in large datasets, and for dividing the patients into deciles, not accounting for the individual patient (13). Furthermore, the classic calibration curves, often drawn based on the H–L test, are not really curves and should not be used as such (10 dots, which are independent from each other, should not be connected by a line) (14). To overcome the limitations of the H–L test, we used a new statistical test for calibration, the GiViTI calibration belt (14, 15). In addition to providing a calibration curve that illustrates the relationship between predicted risk and observed outcome over different levels of risk, this technique also provides the confidence belt of the curve, that is, an estimation of the degree of uncertainty regarding the true location of the curve. In the GiViTI calibration belt, the relationship between the predicted and observed outcomes is calculated by fitting a polynomial logistic function between the logit transformation of the predicted probability and outcome. The calibration belt calculates the 80% confidence interval (CI) (light gray area) and 95% CI (dark gray area) surrounding the calibration curve. A statistically significant deviation from the bisector vector (diagonal line for perfect calibration) occurs when the 95% CI does not cross the bisector.
Patient baseline characteristics are presented as mean and SDs or frequencies (percentage). The association of the predictors with 60-day case fatality was assessed using multivariable logistic regression analyses including all predictors in the risk model. Associations were expressed as odds ratios (ORs) and 95% CIs. Two versions of the model were validated: one using the integer risk score and another using the coefficients of the risk model (nonintegerized risk model). The main analysis was performed in the entire cohort. A sensitivity analysis was performed by including only patients with known baseline sCr at most 2 mg/dL in men and at most 1.5 mg/dL in women (these values were used in the derivation cohort).
The statistical analyses were performed using SPSS (Statistical Package for Social Sciences, version 20; IBM Corporation, Armonk, NY) and R software (R Foundation for Statistical Computing, Vienna, Austria). The calibration belt was plotted using the GiViTI calibration belt library.
Of 38,605 adult patients admitted at the ICU, we identified 1,798 patients submitted to RRT during the first ICU stay. Of these, 396 patients were excluded because RRT was started before ICU admission. Another 349 patients were excluded because they had two or more missing continuous variables. Of these, 1,053 patients remained in the final analysis. Of these patients, 563 (53.5%) had a previous diagnosis of CKD by ICD-9 code. The mean age was 61.2 ± 15.2 years and 583 (55.4%) were males. The 60-day case fatality occurred in 318 (30.2%) of the patients included in the study. The complete distribution of demographic data of the validation cohort is shown in Table 1.
In this validation cohort, the strongest predictor of 60-day case fatality was the mean arterial pressure (OR 1.94, 95% CI, 1.43–2.63 for each range of MAP according to the integer score). Several predictor variables included in the derivation cohort were not associated with 60-day case fatality in this validation cohort, even when we explored the nonintegerized variables (see Table 2).
In the entire cohort, the 60-day case fatality integer prediction model showed moderate discrimination capacity (c-statistics 0.70, 95% CI, 0.66–0.73). As for the calibration, although the H–L chi-squared statistics disclosed significant differences between the predicted and observed 60-day case fatality (chi-square 18.65, P = 0.009) and the calibration curve had a slope of 0.74 and a 0.08 intercept (Fig. 1A), the calibration belt did not display any significant deviations from the bisector line by the GiViTI tests (Fig. 1B).
Because there was a significant difference in the discrimination capacity between the nonintegerized risk model and integer risk score (11) in the development cohort, we also evaluated the first one. We found no significant difference in relation to discrimination capacity when the nonintegerized risk model was tested (c-statistics 0.71, 95% CI, 0.67–0.74). The H–L chi-squared statistics also showed significant differences between the predicted and the observed 60-day case fatality (chi-square 19.89, P = 0.011, see also Fig. 1C); however, the calibration curve was very similar to the integer risk score when evaluated by the calibration belt (Fig. 1D).
Patients with baseline sCr at most 2 mg/dL in men or at most 1.5 mg/dL in women
When analyzing only patients with baseline sCr at most 2 mg/dL in men or at most 1.5 mg/dL in women (n = 313), the integer score showed good discrimination capacity (c-statistics 0.76, 95% CI, 0.71–0.81) and good calibration for 60-day case fatality prediction according to the H–L test (chi-square 10.39, P = 0.17) and its derived calibration curve. Also, there was no significant deviation from the bisector line by the GiViTI tests in the calibration belt (Fig. 2, A and B).
About the nonintegerized risk model in patients with normal/slightly altered sCr, the discriminatory capacity was acceptable (c-statistics 0.72, 95% CI, 0.66–0.78). It also showed a good calibration (H–L test chi-square 7.95, P = 0.44 and similar derived curves—Fig. 2, C and D). In the calibration belt, there was a narrower CI in the upper probability of case fatality when compared with the integer risk score.
To evaluate the influence of imprecise regression coefficients on differences between the observed c-statistics at the external validation and at the derivation cohort, we refitted the risk model in the validation sample. In the entire cohort, the refitted c-statistics was 0.85, 95% CI, 0.81 to 0.88 and when only patients with baseline sCr at most 2 mg/dL in men or at most 1.5 mg/dL in women were considered, the c-statistics was 0.86, 95% CI, 0.81 to 0.90.
In this study, we performed an external validation of a recently proposed prognostic model for AKI requiring RRT in critically ill patients to predict 60-day case fatality (11). Also, to the best of our knowledge, this is the first time that a prognostic model for AKI is validated using the GiViTi calibration belt, a new method that calculates the 80% and the 95% CIs surrounding the calibration curve (14), not simply evaluating 10 dots representative of all patients such as achieved by the H–L test.
This external validation disclosed the ATN risk model had an acceptable discrimination capacity in patients starting RRT at the ICU, regardless whether patients had or not previous CKD or if they had already been admitted at the ICU with AKI, conditions closer to daily medical practice. Although acceptable in patients with altered sCr at ICU admission, the discrimination capacity was even better when only patients with admission sCr at most 2 mg/dL in men or at most 1.5 mg/dL in women were included as derivation cohort. As for the calibration, the H–L P value was <0.05 in several analyses performed. However, the calibration belt showed no reference line segment outside the confidence interval, suggesting, just like others, that the H–L test has a high sensitivity when large samples are evaluated (13).
Although no significant deviations from the bisector line were observed for any of the models, integer and nonintegerized risk models in the entire cohort and the integer risk model for the patients with normal/slightly altered sCr displayed a less-than-perfect calibration belt for 60-day mortality prediction with wide CIs. This is due to the polynomial function greater than 1 fitted between the predicted and observed outcomes (16).
In comparison with the derivation cohort, we identified a significant reduction in c-statistics value. The best value in this validation study was 0.76 in patients with no or minor alterations in sCr levels using the coefficient risk model, in comparison with a c-statistics of 0.85 in the derivation cohort. According to Nieboer et al. (12), the discriminative ability of a prediction model in external validation can be influenced by both the correctness of regression coefficients and the case-mix heterogeneity in the validation sample. Although our cohort showed significant differences when compared with the validation cohort (a cohort not from a randomized trial, different from 60-day case fatality), the similar values between the model-derived cohort and our refitted c-statistics indicate the extent of the reduced model fit is almost independent from differences in case-mix between development and validation samples.
Although a significant number of patients were excluded because RRT was started before ICU admission (n = 396), we consider a strength of the present study the fact that we investigated the generalizability of the ATN model in an unselected cohort in a different setting (observational cohort vs. randomized controlled trial) and case-mix (less severely affected patients). Moreover, here the patients had less severe acute illness than in the derivation cohort, as demonstrated by a significant reduced Sequential Organ Failure Assessment (SOFA) score and, consequently, reduced mortality. This difference is mainly due to the fact that in the ATN study (11) one of the inclusion criteria was that the patient had a nonrenal SOFA at least 2. Also, our cohort included postcardiac surgical patients and patients from a coronary unit (approximately 10% of each in the sample). In principle, a model is generalizable to populations comparable to the development data. However, generalizability is not by definition limited to populations comparable to the development setting; model estimates may also be valid in broader populations. External validation is useful to verify whether the model can be used in different settings. Thus, the differences between the derivation and validation cohorts are more an advantage than a limitation of our study.
Although prognostic models have important limitations that hinder their ability to predict outcomes in individual patients, the validation of such models is important because of their ability to accurately estimate the risk of death in populations of patients (17). Risk adjustment and mortality prediction are largely used to benchmark ICU performance, adjust for patient differences in nonrandomized studies, assess secular trends in mortality, and compare severity of illness among participants across randomized trials.
Our study has several limitations that must be considered. First and most important is the absence of baseline renal function before ICU admission. Similar to daily practice, in our study it is not possible to clearly define if some patients had previous CKD or developed AKI before ICU admission. In the ATN study, patients were enrolled only if they had a baseline sCr at most 2 mg/dL in men and at most 1.5 mg/dL in women; to achieve these enrollment criteria, we performed a sensitivity analysis including only patients presenting such sCr level at any time during ICU stay and before RRT was started, as mentioned before. Another limitation is that our study is a single-center study and the prospective multicenter external validity of the ATN model needs to be confirmed. Finally, we excluded 745 patients (initiating RRT before ICU admission or significant missing data); however 1,053 patients remained in final analysis, a number near to that of the derivation cohort (11). We chose to exclude these patients to avoid bias regarding the difference in outcomes of patients submitted to maintenance hemodialysis before ICU stay and to avoid imprecision regarding the methods of imputing missing variables.
In conclusion, this external validation study suggests the ATN prognostic model for AKI patients requiring RRT can be useful in a broad and unselected cohort of critically ill patients. Although it had only moderate discrimination capacity when patients with major sCr alteration at ICU admission were included, using a refitted model improved it, illustrating the need for continuous external validation and updating of prognostic models over time before implementation in clinical practice.
1. Uchino S, Kellum JA, Bellomo R, Doig GS, Morimatsu H, Morgera S, Schetz M, Tan I, Bouman C, Macedo E, et al. Acute renal failure in critically ill patients
: a multinational, multicenter study. JAMA
294: 813–818, 2005.
2. Hoste EA, Schurgers M. Epidemiology of acute kidney injury: how big is the problem? Crit Care Med
3. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med
13: 818–829, 1985.
4. Zimmerman JE, Wagner DP, Draper EA, Wright L, Alzola C, Knaus WA. Evaluation of acute physiology and chronic health evaluation III predictions of hospital mortality in an independent database. Crit Care Med
26: 1317–1326, 1998.
5. Moreno RP, Metnitz PGH, Almeida E, Jordan B, Bauer P, Campos RA, Iapichino G, Edbrooke D, Capuzzo M, Le Gall J-R, et al. SAPS 3—from evaluation of the patient to evaluation of the intensive care unit. Part 2: development of a prognostic model for hospital mortality at ICU admission. Intensive Care Med
31: 1345–1355, 2005.
6. Le Gall JR, Lemeshow S, Saulnier F. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA
270: 2957–2963, 1993.
7. Douma CE, Redekop WK, van der Meulen JH, van Olden RW, Haeck J, Struijk DG, Krediet RT. Predicting mortality in intensive care patients with acute renal failure treated with dialysis. J Am Soc Nephrol
8: 111–117, 1997.
8. Uchino S, Bellomo R, Morimatsu H, Morgera S, Schetz M, Tan I, Bouman C, Macedo E, Gibney N, Tolwani A, et al. External validation
of severity scoring systems for acute renal failure using a multinational database. Crit Care Med
33: 1961–1967, 2005.
9. Chertow GM, Soroko SH, Paganini EP, Cho KC, Himmelfarb J, Ikizler TA, Mehta RL. Mortality after acute renal failure: models for prognostic stratification and risk adjustment. Kidney Int
70: 1120–1126, 2006.
10. Siontis GCM, Tzoulaki I, Castaldi PJ, Ioannidis JPA. External validation
of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol
68: 25–34, 2015.
11. Demirjian S, Chertow GM, Zhang JH, O’Connor TZ, Vitale J, Paganini EP, Palevsky PM. VA/NIH Acute Renal Failure Trial. Network: model to predict mortality in critically ill adults with acute kidney injury. Clin J Am Soc Nephrol
6: 2114–2120, 2011.
12. Nieboer D, van der Ploeg T, Steyerberg EW. Assessing discriminative performance at external validation
of clinical prediction models. PLoS One
13. Kramer AA, Zimmerman JE. Assessing the calibration of mortality benchmarks in critical care: the Hosmer-Lemeshow test revisited. Crit Care Med
35: 2052–2056, 2007.
14. Finazzi S, Poole D, Luciani D, Cogo PE, Bertolini G. Calibration belt for quality-of-care assessment based on dichotomous outcomes. PLoS One
15. Serrano N. Calibration strategies to validate predictive models: is new always better? Intensive Care Med
16. Nattino G, Finazzi S, Bertolini G. A new calibration test and a reappraisal of the calibration belt for the assessment of prediction models based on dichotomous outcomes. Stat Med
33: 2390–2407, 2014.
17. Ehlenbach WJ, Cooke CR. Making ICU prognostication patient centered: Is there a role for dynamic information? Crit Care Med
41 4: 1136–1138, 2013.