Prospective multicenter external validation of postoperative mortality prediction tools in patients undergoing emergency laparotomy : Journal of Trauma and Acute Care Surgery

Secondary Logo

Journal Logo


Prospective multicenter external validation of postoperative mortality prediction tools in patients undergoing emergency laparotomy

Kokkinakis, Stamatios MD, MSc; Kritsotakis, Evangelos I. PhD, CStat; Paterakis, Konstantinos MD, MSc; Karali, Garyfallia-Apostolia MD; Malikides, Vironas MD; Kyprianou, Anna MD; Papalexandraki, Melina MD; Anastasiadis, Charalampos S. MD, MSc; Zoras, Odysseas MD, PhD, FACS; Drakos, Nikolas MD; Kehagias, Ioannis MD, PhD; Kehagias, Dimitrios MD; Gouvas, Nikolaos MD, PhD; Kokkinos, Georgios MD; Pozotou, Ioanna MD; Papatheodorou, Panagiotis MD; Frantzeskou, Kyriakos MD; Schizas, Dimitrios MD, PhD; Syllaios, Athanasios MD, MSc; Palios, Ifaistion M. MD; Nastos, Konstantinos MD, PhD; Perdikaris, Markos MD; Michalopoulos, Nikolaos V. MD, MSc, PhD; Margaris, Ioannis MD; Lolis, Evangelos MD; Dimopoulou, Georgia MD; Panagiotou, Dimitrios MD; Nikolaou, Vasiliki MD; Glantzounis, Georgios K. MD, PhD; Pappas-Gogos, George MD; Tepelenis, Kostas MD; Zacharioudakis, Georgios MD, PhD; Tsaramanidis, Savvas MD; Patsarikas, Ioannis MD; Stylianidis, Georgios MD; Giannos, Georgios MD; Karanikas, Michail MD, MSc, PhD; Kofina, Konstantinia MD; Markou, Markos MD; Chrysos, Emmanuel MD, PhD, FACS; Lasithiotakis, Konstantinos MD, PhD, FEBS, FRCS

Author Information
Journal of Trauma and Acute Care Surgery 94(6):p 847-856, June 2023. | DOI: 10.1097/TA.0000000000003904
  • Free
  • SDC
  • Infographic
  • Best Of

Emergency laparotomy (EL) is a common procedure performed worldwide for a wide variety of abdominal pathologies. Despite documented advances in the modern era,1 mortality following EL remains substantial worldwide, affecting up to one of every five patients in the first 30 postoperative days in high-quality health care systems.2–5 Efforts to standardize perioperative care of EL patients through implementation of predetermined pathways have led to reduction in postoperative mortality.6 Standardization in contemporary practice requires calculation and consideration of the risks associated with EL before entering the operating room.7 Preoperative risk stratification may result in rational utilization of escalated levels of care postoperatively, higher level of consultant involvement in high-risk patients, improvement of communication between surgical disciplines, and optimal shared decision making.8,9 For patients undergoing EL, factors such as age, comorbidity, and waiting time from admission to operation have been associated with worse outcomes10 and have been subsequently combined in multivariable risk prediction models. Their use is embraced by modern guidelines, but no specific recommendation on the best model has been made.11

There are few external validation studies directly comparing between risk prediction models in EL.12,13 In a recent review of the applicability of risk stratification tools to emergency general surgery, the authors identified the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) tool as best fitting their definition of the ideal scoring tool.14 Several other risk prediction tools have been variably proposed and widely cited, including the National Emergency Laparotomy Audit (NELA) tool, the Portsmouth Physiological and Operative Severity Score for the Enumeration of Mortality and Morbidity (P-POSSUM), and the Predictive Optimal Trees in Emergency Surgery Risk (POTTER). These tools have demonstrated excellent predictive performance in the populations in which they were developed (mainly in the United Kingdom and United States), but their broader transportability in diverse external settings has not been adequately validated. Previous reports have revealed significant differences in the management of EL in this population compared with the United Kingdom.15,16

The present study performed comparative external validation of four common risk prediction tools (ACS-NSQIP, NELA, P-POSSUM, and POTTER), in a multicenter prospective cohort design, to identify the best tool for predicting 30-day mortality in Greek patients undergoing EL.


Ethics and Reporting

The study was approved by the institutional review board and the bioethics committee of the participating institutions and is reported according to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement (Supplemental Digital Content, Supplementary Table S5, The study was registered in (identifier NCT04615520).

Study Design and Participants

A multicenter cohort was prospectively assembled by enrolling consecutive patients undergoing EL in 10 hospitals in Greece and 1 hospital in Cyprus (1 secondary-care, 2 tertiary-care, and 8 university-affiliated hospitals), from January 2020 to May 2021. All participating centers submitted prospectively collected anonymized data on patients undergoing EL. Each patient was followed up until the 30th postoperative day. Patient inclusion and exclusion criteria were similar to those used in the NELA audit (Supplemental Digital Content, Supplementary Table S1, Briefly, all patients who had EL were eligible for this study except for appendectomy, cholecystectomy, negative diagnostic laparotomy or laparoscopy, biopsy, nongastrointestinal surgery, and elective gastrointestinal surgery. Only adults (18 years or older) were enrolled.

Outcomes and Predictors

The primary endpoint was 30-day postoperative mortality. Demographic data, preoperative variables required for each prediction model (NELA, P-POSSUM, ACS-NSQIP, and POTTER), type of operation, and postoperative outcomes were prospectively recorded for each patient. The data were then entered into respective online calculators to make predictions for the risk of 30-day postoperative death for each patient from each model. Before surgery, attending surgeons answered the following question for each patient: “In your clinical judgment, what is the risk of death within 30 days?” with four ordered response options, namely, <5%, 5% to 10%, 10% to 20%, and >20%. The ACS-NSQIP online calculator allows clinicians to adjust for underestimation by increasing the estimated risks based on their subjective impression of the patient. We independently simulated this process by opting for adjustment on the ACS-NSQIP calculator whenever the 30-day mortality risk prediction was lower than the surgeon's preoperative assessment as shown in Supplemental Digital Content, Supplementary Table S2, (we refer to this as surgeon-adjusted ACS-NSQIP). We calculated the ACS-NSQIP predicted risk both with and without incorporating the subjective surgeon's assessment. For patients discharged before the 30-day mark, we scheduled office visits and follow-up calls from study personnel, in which relevant outcomes were documented.

Sample Size

For external validation of prognostic models, a minimum of 100 outcome events is recommended to ensure adequate power to detect changes in predictive performance metrics in external datasets.18,19 We therefore recruited patients for this study for 17 months until about 100 postoperative deaths occurred in our cohort.

Missing Data

Data were readily available to calculate risk predictions for at least 98% of the patients using the NELA, P-POSSUM, and ACS-NSQIP models, but only for 486 patients (77%) based on POTTER. More than half of the missing data for POTTER originated from two hospitals, where POTTER predictions could not be calculated for 48% and 98% of their patients, respectively. This was because variables necessary for the POTTER tool, such as preoperative albumin, were not routinely available preoperatively in those hospitals. Because the prognostic models are aimed at being applied in clinic, we opted for complete case analysis for each model. Moreover, there was no evidence of association between missing POTTER prediction and mortality rate (15.3% in patients with missing POTTER vs. 20% in those with nonmissing POTTER, p = 0.172), suggesting that complete case analysis may not bias our results.

Statistical Methods

The analysis aimed to estimate and compare metrics of predictive performance and utility for decision making of the selected prediction models of 30-day mortality, when applied to our independent cohort of patients undergoing EL. The subjective preoperative assessment of patient's prognosis by their surgeon was considered as a minimum benchmark that any useful prediction tool should outperform. In addition, we examined the possibility of improving predictive performance by recalibrating the models and assessed the heterogeneity of predictive accuracy between hospitals.

The Brier score (mean squared error between observed and predicted probabilities) was used as a comparative measure of overall model performance, which we scaled by its maximum value under a noninformative null model to let it range up to a theoretical maximum of 100%. The scaled Brier score represents the amount of prediction error in a null model that is explained by the model under validation. Higher values of the scaled Brier score indicate better prediction accuracy (negative values indicate a potentially harmful model).20 Bootstrap resampling (500 replications) was applied to compute 95% confidence intervals (CIs) for the scaled Brier scores. The discriminatory ability of each model (to rank patients according to risk) was quantified by calculating the concordance c statistic as the area under the receiver operating characteristic curve with exact binomial 95% CI. The DeLong test for correlated data was used to compare each model's area under the curve (AUC) with the minimum benchmark.21

We emphasized on the calibration of the models (agreement between the predicted and observed numbers of outcome events), because adequate calibration ensures that an accurate absolute risk is communicated to patients and physicians.22,23 The Hosmer-Lemeshow goodness-of-fit test was used to broadly assess calibration in deciles of predicted risks (a small p value indicates poor calibration). The ratio of expected to observed outcome events (E:O ratio) was calculated, which ideally should be 1 (E:O <1 indicates underestimation, and E:O >1 indicates overestimation of the total number of deaths). In addition, a calibration regression line was made (plotting predicted against observed risks). The intercept and slope of the calibration regression line and their Wald-type 95% CIs were estimated using logistic regression.22,23 The calibration intercept, also known as calibration-in-the-large (CITL), compares the average predicted risk with the overall event rate and, ideally, should be 0 (CITL <0 indicates that the predictions are systematically too high, whereas CITL >0 indicates that the predictions are systematically too low). The calibration slope evaluates the spread of the predicted risks and has a target value of 1. When CITL is close to 0, a slope close to 1 indicates that good calibration is also maintained across the range of individuals.23,24 In addition, lowess-smoothed calibration curves were constructed to allow for visual inspection of calibration.25

Decision-curve analysis (DCA) was used to provide insight into the range of decision thresholds to label a patient as “high risk for postoperative death” that would have highest net benefit (NB) for decision making when using each risk prediction model for this purpose.26 Net benefit is defined as the difference between the proportion of true positives (labeled as high risk preoperatively and then going on to die within 30 days of EL) and the proportion of false positives (labeled as high risk but not going on to die within 30 days) weighted by the odds of the selected threshold for the high-risk label. At any given threshold, the model with higher NB is the preferred model.26 We examined risk thresholds between 5% and 50%.

Because there was different case mix between our cohort and the cohorts on which the risk prediction tools were originally developed, we examined whether adjusting (recalibrating) each model's intercept and slope would result in better calibrated predictions.22,23 Finally, we examined the variability in performance across hospitals (heterogeneity) using random effects meta-analysis of hospital-specific scaled Brier scores. Heterogeneity was quantified using 95% prediction intervals, which indicate the dispersion in performance metrics that can be expected when applying the model in a new center.27

All statistical analyses were performed using STATA v.17 (StataCorp, College Station, TX).

Development Versus Validation

The clinical setting and eligibility criteria in this study were similar to those of the NELA model, to allow for a reasonable comparison of outcomes.28 National Emergency Laparotomy Audit was initiated in 2014 in the United Kingdom as a specific tool for emergency laparotomies, focusing on 30-day postoperative mortality as main outcome.28 The ACS-NSQIP was developed in a broader setting that included both elective and emergency procedures from a variety of surgical subspecialties in the United States between 2009 and 2012.29 Only 59% of the development cohort for ACS-NSQIP was general surgery patients. Outcomes of interest in ACS-NSQIP were 30-day mortality, common postoperative complications, and procedure-specific complications such as anastomotic leak.29 The POTTER model was developed on a subset of the data from the ACS-NSQIP database, conditioning on patients who underwent emergency surgery between 2007 and 2013, to predict outcomes similar those targeted by the ACS-NSQIP model.30 The P-POSSUM was a modification to correct for over prediction of mortality from the initial POSSUM equation that was developed based on patients who underwent both elective and emergency surgery between 1993 and 1995, excluding day-surgery and pediatric cases.31


Patient Characteristics

Over the 17-month recruitment period, 633 patients underwent EL in the 11 participating hospitals and were enrolled in the study. Two patients were lost to follow-up and were excluded from analysis. The remaining 631 were included in the final analysis. Case ascertainment rate was high, and it has been described in detail in the report of the Hellenic Emergency Laparotomy Study.15 The patients had a mean age of 66 years (range, 19–99 years), 54% were male, and 43% were classified as American Society of Anesthesiologists status III/IV. The most common indication for EL was gastrointestinal obstruction (39%), followed by perforation (36%) and ischemia (15%). Demographic and clinical characteristics are detailed in Table 1.

TABLE 1 - Demographics and Clinical Characteristics in 631 Patients Undergoing EL in 11 Greek Hospitals
Total Survived Died
Characteristic (n = 631) (n = 528) (n = 103) p
Age, y 66.2 ± 16.7 64.3 ± 16.8 75.8 ± 12.4 <0.001
Male sex 340 (53.9) 287 (54.4) 53 (51.5) 0.59
Body mass index, kg/m2 26.5 ± 5.3 26.5 ± 5.1 26.6 ± 6.3 0.97
ASA class <0.001
 I 150 (23.8) 145 (27.5) 5 (4.9)
 II 200 (31.7) 188 (35.6) 12 (11.7)
 III 171 (27.1) 141 (26.7) 30 (29.1)
 IV 103 (16.3) 52 (9.8) 51 (49.5)
 V 6 (1.0) 1 (0.2) 5 (4.9)
 Missing 1 (0.2) 1 (0.2) 0 (0.0)
Preoperative functional status <0.001
 Independent 443 (70.2) 402 (76.1) 41 (39.8)
 Partially dependent 154 (24.4) 105 (19.9) 49 (47.6)
 Fully dependent 32 (5.1) 20 (3.8) 12 (11.7)
 Missing 2 (0.3) 1 (0.2) 1 (1.0)
Anticipated severity of malignancy 0.19
 None 461 (73.1) 394 (74.6) 67 (65.0)
 Primary 78 (12.4) 63 (11.9) 15 (14.6)
 Nodal metastasis 19 (3.0) 14 (2.7) 5 (4.9)
 Distant metastasis 72 (11.4) 56 (10.6) 16 (15.5)
 Missing 1 (0.2) 1 (0.2) 0 (0.0)
Diabetes mellitus 103 (16.3) 72 (13.6) 31 (30.4) <0.001
Cardiac comorbidity 264 (42.0) 204 (38.7) 60 (58.8) <0.001
Chronic steroid use 54 (8.7) 38 (7.3) 16 (15.7) 0.006
Ascites 81 (12.9) 54 (10.2) 27 (26.2) <0.001
Borderline cardiomegaly chest x-ray 38 (6.0) 33 (6.3) 5 (4.9) 0.59
Respiratory history 0.005
 No dyspnoea 569 (90.2) 485 (91.9) 84 (81.6)
 Dyspnoea on exertion or limiting 36 (5.7) 24 (4.5) 12 (11.7)
 Dyspnoea at rest or long-term oxygen 18 (2.9) 13 (2.5) 5 (4.9)
 Missing 8 (1.3) 6 (1.1) 2 (1.9)
Smoking 185 (29.3) 165 (31.3) 20 (19.4) 0.016
Hemodialysis or CVVH 10 (1.6) 6 (1.1) 4 (3.9) 0.042
Preoperative acute renal failure 77 (12.2) 50 (9.5) 27 (26.5) <0.001
Sepsis within 48 h before surgery <0.001
 None 308 (48.8) 284 (53.8) 24 (23.3)
 Two SIRS criteria 218 (34.5) 186 (35.2) 32 (31.1)
 Severe sepsis 80 (12.7) 48 (9.1) 32 (31.1)
 Septic shock 25 (4.0) 10 (1.9) 15 (14.6)
Preoperative diagnosis 0.032
 Perforation 225 (35.7) 187 (35.4) 38 (36.9)
 Obstruction 247 (39.1) 218 (41.3) 29 (28.2)
 Ischemia 94 (14.9) 74 (14.0) 20 (19.4)
 Other 65 (10.3) 49 (9.3) 16 (15.5)
Operation type 0.080
 Adhesiolysis 75 (11.9) 72 (13.6) 3 (2.9)
 Small bowel resection 130 (20.6) 106 (20.1) 24 (23.3)
 Colectomy right 58 (9.2) 48 (9.1) 10 (9.7)
 Hartmann's procedure 73 (11.6) 59 (11.2) 14 (13.6)
 Strangulated hernia with bowel resection 38 (6.0) 35 (6.6) 3 (2.9)
 Peptic ulcer repair 75 (11.9) 63 (11.9) 12 (11.7)
 Colectomy other 50 (7.9) 38 (7.2) 12 (11.7)
 Stoma formation 41 (6.5) 33 (6.3) 8 (7.8)
 Other 91 (14.4) 74 (14.0) 17 (16.5)
Data are presented as mean ± SD for continuous measures, and n (%) for categorical measures.
ASA, American Society of Anesthesiologists; CVVH, continuous venovenous hemofiltration; SIRS, systemic inflammatory response syndrome.

Observed and Predicted Mortality Rates

There were 103 deaths within 30 days of EL, an overall 30-day mortality rate of 16.3% (95% CI, 13.5–19.4%). The surgeons provided subjective preoperative risk assessment for all but 21 patients (3.3%). Compared with the actual mortality rate, the average predicted risks were lower for the POTTER (8.9%), NELA (10.5%), and ACS-NSQIP (12.2%) models, much higher than observed mortality for P-POSSUM (19.9%) and about similar to observed mortality (within CI limits) for the surgeon's subjective assessment (14.9%) and the surgeon-adjusted ACS-NSQIP model (18.1%). All models assigned a significantly higher mean predicted risk to the group of patients who eventually died within 30 days of EL than those who survived (Table 2).

TABLE 2 - Distributions of Mortality Risk Predictions in Patients Undergoing EL in 11 Greek Hospitals
Total Survived Died
30-d Mortality Predictions (n = 631) (n = 528) (n = 103) p
Surgeon's preoperative assessment, n (%) <0.001
 <5% 172 (27.3) 166 (31.4) 6 (5.8)
 5–10% 220 (34.9) 205 (38.8) 15 (14.6)
 11–20% 137 (21.7) 94 (17.8) 43 (41.7)
 >20% 81 (12.8) 45 (8.5) 36 (35.0)
 Unable to assess 21 (3.3) 18 (3.4) 3 (2.9)
Predicted risk, mean ± SD, %
 Surgeon* 14.9 ± 18.3 12.0 ± 15.6 29.5 ± 23.3 <0.001
 NELA 10.5 ± 15.2 7.6 ± 12.2 25.8 ± 19.3 <0.001
 P-POSSUM 19.9 ± 25.1 16.0 ± 22.3 40.3 ± 28.4 <0.001
 POTTER 8.9 ± 11.4 6.7 ± 9.3 21.0 ± 14.4 <0.001
 ACS-NSQIP 12.2 ± 17.6 8.4 ± 13.6 31.6 ± 22.3 <0.001
 ACS-NSQIP surgeon adjusted 18.1 ± 16.8 14.8 ± 14.0 35.5 ± 19.0 <0.001
*Point estimates of risk prediction provided by clinicians were taken as the midpoint of the predicted risk intervals (i.e., 2.5% for the interval <5%, 7.5% for the interval 5–10%, and so on).

Case Mix

Supplemental Digital Content, Supplementary Table S3,, compares the distribution of context-important clinical characteristics and outcomes between the present study cohort and the cohorts of patients on which the development of the NELA and ACS-NSQIP models was based. The current cohort appears to represent a different case mix of patients with higher mortality compared with the original model development cohorts of NELA and ACS-NSQIP.

Predictive Performance

Predictive performance metrics are shown in Table 3. The overall Brier scaled score was highest for ACS-NSQIP (22.4%; 95% CI, 14.5–30.3%) and surgeon-adjusted ACS-NSQIP (20.6%; 95% CI, 13.4–27.2%) and lowest for surgeon's assessment (10.6%; 95% CI, 1.3–18.7%) and P-POSSUM (1.5%; 95% CI, 0.0–13.1%). Discrimination was excellent for all models (Supplemental Digital Content, Supplementary Fig. S1,, with AUC point estimates ranging from 0.79 (surgeon and P-POSSUM) to 0.85 (NELA). DeLong tests showed that NELA and ACS-NSQIP had significantly higher AUCs than the minimum benchmark of the surgeon's preoperative assessment, whereas no statistically significant difference from the minimum benchmark was observed for P-POSSUM (p = 0.868) and POTTER (p = 0.081). The Hosmer-Lemeshow test indicated poor agreement of observed and predicted risks in decile groups for all models (all p < 0.001), except the surgeon-adjusted ACS-NSQIP model (p = 0.742; Supplemental Digital Content, Supplementary Fig. S2, As seen in Table 3, the CITL statistic indicated that POTTER, NELA, and ACS-NSQIP produced (in this order of magnitude) predictions that were systematically too low (CILT CI limits above zero), whereas P-POSSUM systematically overestimated mortality (CILT CI limits below zero). In contrast, no significant deviation from the ideal CILT was seen for the surgeon-adjusted ACS-NSQIP model, which was the only model with an acceptable calibration slope (slope CI limits spanning 1). Figure 1 shows smoothed calibration plots for each model, confirming visually the superior calibration of the ACS-NSQIP model, especially its surgeon-adjusted version.

TABLE 3 - Predictive Performance Measures of Prognostic Models for 30-Day Postoperative Death
Prognostic Model n Overall Fit Discrimination Calibration Clinical Utility
Brier Scaled % (95% CI) AUC (95% CI) DeLong p Value E:O Ratio CITL (95% CI) Slope (95% CI) HL-GOF p Value NB, 5% NB, 10% NB, 20%
Surgeon 610 10.6 (1.3–18.7) 0.79 (0.75–0.82) Ref. 0.91 0.16 (−0.09 to 0.41) 0.74 (0.57–0.91) <0.001 0.12 0.10 0.04
NELA 623 17.2 (10.2–23.9) 0.85 (0.82–0.88) 0.005 0.65 0.68 (0.43–0.93) 0.86 (0.68–1.05) <0.001 0.13 0.11 0.07
P-POSSUM 622 1.5 (−13.1 to 13.1) 0.79 (0.75–0.82) 0.868 1.23 −0.41 (−0.68 to −0.13) 0.48 (0.37–0.60) <0.001 0.13 0.10 0.06
POTTER 486 15.4 (9.3–21.8) 0.84 (0.81–0.87) 0.081 0.55 0.75 (0.47–1.03) 0.98 (0.73–1.23) <0.001 0.12 0.09 0.05
ACS-NSQIP 618 22.4 (14.5–30.3) 0.84 (0.81–0.87) 0.030 0.75 0.49 (0.23–0.75) 0.75 (0.59–0.91) <0.001 0.12 0.10 0.07
ACS-NSQIP adjusted 618 20.6 (13.4–27.2) 0.83 (0.79–0.86) 0.057 1.12 −0.16 (−0.39 to 0.08) 1.12 (0.85–1.40) 0.742 0.13 0.10 0.07
Net benefit is calculated at decision thresholds 5%, 10%, and 20%.
HL-GOF, Hosmer-Lemeshow goodness-of-fit test; n, number of patients in the analysis; Slope, calibration slope.

Figure 1:
Calibration of prognostic models when predicting 30-day postoperative death. The blue line is a smoothed locally weighted regression (lowess) line that shows the agreement between predicted probabilities and observed proportions of 30-day mortality. The dashed diagonal line indicates perfect calibration. The circled points represent mean risks in decile groups of predicted probabilities, with vertical lines representing 95% CIs. The spike plot on the x axis summarizes the density of patients in the range of predicted risks of 30-day death. Slope, calibration slope; GOF, goodness-of-fit.

Decision Curve Analysis

Figure 2 shows that all models had positive NB for decision thresholds up to about 40% mortality risk, but best overall utility on wider ranges of thresholds was maintained for the ACS-NSQIP and surgeon-adjusted ACS-NSQIP models.

Figure 2:
Decision curves showing the NB in clinical decision making of using each prognostic model of 30-day postoperative mortality.

Model Recalibration

After intercept and slope adjustments, scaled Brier scores and calibration metrics improved substantially for all models, with ACS-NSQIP retaining lead performance (Supplemental Digital Content, Supplementary Table S4, As seen in Supplemental Digital Content, Supplementary Figure S3,, the flexible calibration curves of all updated models were much closer to the diagonal reference line of perfect calibration compared with the original unadjusted models. There was evident underestimation of mortality risks in very high-risk patients from the recalibrated NELA and P-POSSUM models. Decision-curve analysis on recalibrated models is shown on Supplemental Digital Content, Supplementary Figure S4,, and confirmed that ACS-NSQIP had best overall clinical utility to select high-risk patients.


Random-effects meta-analysis revealed substantial and statistically significant heterogeneity of hospital-specific Brier scores for NELA, P-POSSUM, and POTTER, whereas heterogeneity was low and nonsignificant for ACS-NSQIP and the surgeon-adjusted ACS-NSQIP (Supplemental Digital Content, Supplementary Figs. S5–S10, As seen from the 95% prediction intervals in Figure 3, future new studies would maintain their scaled Brier score within acceptable limits only for ACS-NSQIP and its surgeon-adjusted version.

Figure 3:
Summary forest plot with overall scaled Brier score for each prognostic model based on the results of random-effects meta-analysis of hospital-specific data. The CI quantifies the precision in estimating the average Brier score in this study, whereas the prediction interval quantifies the dispersion of the Brier score value in future single-center studies by accounting for between-hospitals heterogeneity.


We presented the results of a comprehensive external validation of four commonly cited prognostic models when applied on a large multicenter cohort of Greek patients who underwent EL for 17 months at 11 hospitals. Discordant case mix between this cohort and cohorts where the models were originally developed implies that this study assesses broader transportability of the models in a different setting rather than mere reproducibility on patients similar to those in the original model development cohorts. The results of this assessment favor the use of the ACS-NSQIP model, including its surgeon-adjusted version, when assessing the prognosis of EL patients. The ACS-NSQIP was seen to outperform all other models with respect to several metrics of predictive performance, demonstrated clinical utility on a wider range of risk thresholds for decision making, and exhibited minimum heterogeneity across hospitals.

This is the first study to perform comparative external validation of different risk prediction tools for EL patients in a prospective design in Greece. In contrast to our findings, previously performed comparative external validations for emergency abdominal surgery in other countries have mostly favored the NELA risk prediction tool.12,13,32,33 In a cohort of 758 EL patients in New Zealand, NELA presented superior discrimination and calibration, compared with ACS-NSQIP, P-POSSUM, and APACHE II.13 Lai et al.32 compared NELA with P-POSSUM in an Asian population and concluded that NELA predicts 30-day mortality more accurately. A multicenter Australian study concluded that the NELA was highly sensitive and comparable with the P-POSSUM and ACS-NSQIP models in EL patients.33 A recent analysis of 650 EL patients of the NELA database favored the discriminative power of the NELA compared with that of the P-POSSUM.12 Aforementioned studies were all retrospective in nature, and 30-day mortality was reported in no more than 60 patients (range, 47–60 patients), implying that external validation statistics may have been relatively imprecise. Only two of those studies involved comparisons with the ACS-NSQIP model.13,33 A recent meta-analysis of the accuracy of ACS-NSQIP in emergency abdominal surgery also pointed out that the existing literature consists of exclusively retrospective studies that are mostly underpowered.34

The predictive ability of ACS-NSQIP in emergency surgical patients has been seen to be inferior to that in elective cases, with reported underestimation of the mortality risk of patients undergoing emergency surgery.35 Use of subjective surgeon assessment and subsequent utilization of surgeon-adjusted risk scores has been reported in geriatric patients undergoing lumbar surgery.36 For patients judged to have “somewhat higher than estimated” risks according to the Surgeon Adjustment Score, the ACS-NSQIP prediction of postoperative mortality was accurate, while, in patients with “significantly higher than estimated” risks, the model accurately predicted the risks of surgical site infection and reoperation.36 The importance of combining subjective assessment with an objective risk prediction tool was emphasized in a recent study validating prediction models for surgical patients, which reported that the combination of the best predictive model with the surgeon's subjective assessment was superior than any predictive model alone.37 Similarly, we found that calibrating the ACS-NSQIP prediction of 30-day mortality on the basis of preoperative subjective assessment improved predictive performance in our cohort of patients undergoing EL.

External validations should use standardized performance metrics and adhere to guidelines for reporting model performance to allow for comprehensive and informative comparisons of prognostic models in different patient populations. Reporting of multivariable prediction models has been shown to be insufficient before the implementation of the TRIPOD statement.38 A recent systematic review emphasized the lack of prospective cohorts in external validation studies and revealed that reporting of key performance measures was largely incomplete with median completeness of the TRIPOD checklist at only 61%.39 Methodological issues, such as poor assessment of calibration, have also been pointed out in a systematic review of risk assessment tools for EL.40 Therefore, we strictly followed the TRIPOD guidelines and used multiple metrics of predictive performance to thoroughly compare the prediction models in this study.

Good discrimination and calibration metrics do not necessarily warrant that use of a model will aid decision making.26 We therefore performed DCA to identify the model that demonstrates superiority in selecting a “high-risk” patient with highest NB upon use in clinical practice.26 The results of the DCA are to be interpreted with caution. Superiority of the ACS-NSQIP implies that it should be used in everyday practice as part of a shared decision making in our setting, without choosing a specific threshold.41 In a cohort of patients undergoing hepatopancreaticobiliary surgery, DCA performed on ACS-NSQIP and POTTER favored their use for guiding interventions on important outcomes, such as readmission and venous thromboembolism.42 In that cohort, ACS-NSQIP had superior discrimination, but POTTER demonstrated NB for a wider range of risk thresholds for venous thromboembolism risk, and the authors pointed out the importance of not relying solely on metrics, such as the concordance statistic (or AUC).42 Assessing the performance of prediction models solely on the basis of concordance statistics may lead to mislead conclusions, because a model's discriminatory performance is bound to be lower in a homogenous sample with restricted case mix and does not provide information about the calibration of predicted risks with observed events.43,44 The results of DCA in this study indicate that using the ACS-NSQIP model and its surgeon-adjusted version may guide the surgeon to modify interventions and the NB is maintained over a wider range of thresholds for defining a high-risk patient compared with the other models assessed here.

Our study has a number of limitations. First, it is important to acknowledge that this study represents a self-selected group of hospitals, mostly university or tertiary-care hospitals, and our patient cohort might not be a true population-based or nationally representative sample of EL patients. Second, while missing data were minimal for the NELA, P-POSSUM, and ACS-NSQIP models, risk calculations from POTTER were not possible for most patients in two participating centers, where data necessary for applying this tool are not part of the routine preoperative workup of EL patients. Multiple imputation has been suggested as the preferred approach for handling missing predictor data,45 but we opted for complete case analysis so that our results come from a pragmatic cohort of patients for whom risks can be readily estimated from preoperative data. Fourth, our analysis showed that recalibrating the models resulted in improved predictive accuracy for all models in our cohort. However, full model revision (reestimation) from our data set was not possible, as the selected risk prediction tools are proprietary and have their equations undisclosed. Finally, not all available risk prediction tools could be possibly validated in a single study, and we chose to examine four well-known and commonly cited models for which comparative validation studies are lacking for EL patients.

The findings of this study are promising for the use of the ACS-NSQIP model in the Greek health care system, demonstrating its broad transportability in a system different from the United States where the model was developed. More comparative validations of different risk prediction tools should be performed at national levels to determine which model might best fit each health care setting. Our results imply that combining the subjective assessment of the attending surgeon with a proper tool yields the most accurate prediction; therefore, future validations could focus on such combinations. Complete model reestimation with adjustment of key variables and revalidation in a new sample of patients may generate adjusted versions of existing models, which best fit specific patient populations.


The ACS-NSQIP tool was most accurate for mortality predictions after EL in a broad external validation cohort, outperforming the surgeon's preoperative risk assessment and the NELA, P-POSSUM, and POTTER tools in several comparative prediction metrics and demonstrating utility for facilitating preoperative risk management in the Greek health care system. The surgeon's subjective risk assessment may help optimize ACS-NSQIP predictions.


S.K., E.I.K., and K.L. contributed in the study conception and design, data analysis and interpretation, and manuscript preparation. S.K., K.L., K.P., G.-A.K., V.M., A.K., M.P., C.S.A., N.D., D.K., G.K., I.P., P.P., K.F., A.S., I.M.P., M.P., I.M., G.D., D.P., V.N., K.T., S.T., I.P., G.S., G.G., K.K., and M.M. contributed in the acquisition of data. E.I.K. contributed in the analysis and interpretation of data (lead). S.K., O.Z., I.K., N.G., D.S., K.N., N.V.M., E.L., G.K.G., G.P.-G., G.Z., M.K., E.C., and K.L. contributed in the critical review/revision. K.L. is the principal investigator and takes primary responsibility for the manuscript. All authors read and approved the final version of the manuscript.


The authors declare no conflicts of interest.


1. McLean RC, Brown LR, Baldock TE, O’Loughlin P, McCallum IJ. Evaluating outcomes following emergency laparotomy in the north of England and the impact of the National Emergency Laparotomy Audit — a retrospective cohort study. Int J Surg. 2020;77:154–162.
2. Fagan G, Barazanchi A, Coulter G, Leeman M, Hill AG, Eglinton TW. New Zealand and Australia emergency laparotomy mortality rates compare favourably to international outcomes: a systematic review. ANZ J Surg. 2021;91(12):2583–2591.
3. Jansson Timan T, Hagberg G, Sernert N, Karlsson O, Prytz M. Mortality following emergency laparotomy: a Swedish cohort study. BMC Surg. 2021;21(1):322.
4. Tolstrup M-B, Watt SK, Gögenur I. Morbidity and mortality rates after emergency abdominal surgery: an analysis of 4346 patients scheduled for emergency laparotomy or laparoscopy. Langenbecks Arch Surg. 2017;402(4):615–623.
5. Tan BHL, Mytton J, Al-Khyatt W, Aquina CT, Evison F, Fleming F, et al. A comparison of mortality following emergency laparotomy between populations from New York state and England. Ann Surg. 2017;266(2):280–286.
6. Huddart S, Peden CJ, Swart M, McCormick B, Dickinson M, Mohammed MA, et al. Use of a pathway quality improvement care bundle to reduce mortality after emergency laparotomy. Br J Surg. 2014;102(1):57–66.
7. Sivarajah V, Walsh U, Malietzis G, Kontovounisios C, Pandey V, Pellino G. The importance of discussing mortality risk prior to emergency laparotomy. Updates Surg. 2020;72(3):859–865.
8. Mak M, Hakeem A, Chitre V. Pre-NELA vs NELA — has anything changed, or is it just an audit exercise? Ann R Coll Surg Engl. 2016;98(8):554–559.
9. Harris EP, MacDonald DB, Boland L, Boet S, Lalu MM, McIsaac DI. Personalized perioperative medicine: a scoping review of personalized assessment and communication of risk before surgery. Can J Anesth. 2019;66(9):1026–1037.
10. Smith MTD, Bruce JL, Clarke DL. Using machine learning to establish predictors of mortality in patients undergoing laparotomy for emergency general surgical conditions. World J Surg. 2022;46(2):339–346.
11. Peden CJ, Aggarwal G, Aitken RJ, Anderson ID, Bang Foss N, Cooper Z, et al. Guidelines for perioperative care for emergency laparotomy Enhanced Recovery After Surgery (ERAS) Society recommendations: part 1—preoperative: diagnosis, rapid assessment and optimization. World J Surg. 2021;45(5):1272–1290.
12. Thahir A, Pinto-Lopes R, Madenlidou S, Daby L, Halahakoon C. Mortality risk scoring in emergency general surgery: are we using the best tool? J Perioper Pract. 2021;31(4):153–158.
13. Barazanchi A, Bhat S, Palmer-Neels K, Macfater WS, Xia W, Zeng I, et al. Evaluating and improving current risk prediction tools in emergency laparotomy. J Trauma Acute Care Surg. 2020;89(2):382–387.
14. Havens JM, Columbus AB, Seshadri AJ, Brown CVR, Tominaga GT, Mowery NT, et al. Risk stratification tools in emergency general surgery. Trauma Surg Acute Care Open. 2018;3(1):e000160.
15. Lasithiotakis K, Kritsotakis EI, Kokkinakis S, Petra G, Paterakis K, Karali GA, et al. The Hellenic Emergency Laparotomy Study (HELAS): a prospective multicentre study on the outcomes of emergency laparotomy in Greece. World J Surg. 2023;47(1):130–139.
16. Zacharis G, Seretis C. Letter to the editor: the Hellenic Emergency Laparotomy Study (HELAS): a prospective multicentre study on the outcomes of emergency laparotomy in Greece. World J Surg. 2023;47(2):554–555.
17. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55–63.
18. Pavlou M, Qu C, Omar RZ, Seaman SR, Steyerberg EW, White IR, et al. Estimation of required sample size for external validation of risk models for binary outcomes. Stat Methods Med Res. 2021;30(10):2187–2206.
19. Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med. 2016;35(2):214–226.
20. Kattan MW, Gerds TA. The index of prediction accuracy: an intuitive measure useful for evaluating risk prediction models. Diagnostic Progn Res. 2018;2:7.
21. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–845.
22. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230.
23. Debray TPA, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW, Moons KGM. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol. 2015;68(3):279–289.
24. Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the “calibration slope” really measure? J Clin Epidemiol. 2020;118:93–99.
25. Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014;33(3):517–535.
26. Vickers AJ, Holland F. Decision curve analysis to evaluate the clinical benefit of prediction models. Spine J. 2021;21(10):1643–1648.
27. Snell KIE, Ensor J, Debray TPA, Moons KGM, Riley RD. Meta-analysis of prediction model performance across multiple studies: which scale helps ensure between-study normality for the C-statistic and calibration measures? Stat Methods Med Res. 2018;27(11):3505–3522.
28. Eugene N, Oliver CM, Bassett MG, Poulton TE, Kuryba A, Johston C, et al. Development and internal validation of a novel risk adjustment model for adult patients undergoing emergency laparotomy surgery: the National Emergency Laparotomy Audit risk model. Br J Anaesth. 2018;121(4):739–748.
29. Bilimoria KY, Liu Y, Paruch JL, Zhou L, Kmiecik TE, Ko CY, et al. Development and evaluation of the universal ACS NSQIP surgical risk calculator: a decision aid and informed consent tool for patients and surgeons. J Am Coll Surg. 2013;217(5):833–42.e1–3.
30. Bertsimas D, Dunn J, Velmahos GC, Kaafarani HMA. Surgical risk is not linear: derivation and validation of a novel, user-friendly, and machine-learning-based Predictive OpTimal Trees in Emergency Surgery Risk (POTTER) calculator. Ann Surg. 2018;268(4):574–583.
31. Prytherch DR, Whiteley MS, Higgins B, Weaver PC, Prout WG, Powell SJ. POSSUM and Portsmouth POSSUM for predicting mortality. Br J Surg. 2003;85(9):1217–1220.
32. Lai CPT, Goo TT, Ong MW, Prakash PS, Lim WW, Drakeford PA. A comparison of the P-POSSUM and NELA risk score for patients undergoing emergency laparotomy in Singapore. World J Surg. 2021;45(8):2439–2446.
33. Hunter Emergency Laparotomy Collaborator Group, Hunter Emergency Laparotomy Collaborator Group. High-risk emergency laparotomy in Australia: comparing NELA, P-POSSUM, and ACS-NSQIP calculators. J Surg Res. 2020;246:300–304.
34. Parkin CJ, Moritz P, Kirkland O, Glover A. What is the accuracy of the ACS-NSQIP surgical risk calculator in emergency abdominal surgery? A meta-analysis. J Surg Res. 2021;268:300–307.
35. Hyder JA, Reznor G, Wakeam E, Nguyen LL, Lipsitz SR, Havens JM. Risk prediction accuracy differs for emergency versus elective cases in the ACS-NSQIP. Ann Surg. 2016;264(6):959–965.
36. Wang X, Hu Y, Zhao B, Su Y. Predictive validity of the ACS-NSQIP surgical risk calculator in geriatric patients undergoing lumbar surgery. Medicine (Baltimore). 2017;96(43):e8416.
37. Wong DJN, Harris S, Sahni A, Bedford JR, Cortes L, Shawyer R, et al. Developing and validating subjective and objective risk-assessment measures for predicting mortality after major surgery: an international prospective cohort study. PLoS Med. 2020;17(10):e1003253.
38. Heus P, Damen JAAG, Pajouheshnia R, Scholten RJPM, Reitsma JB, Collins GS, et al. Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement. BMC Med. 2018;16(1):120.
39. Groot OQ, Bindels BJJ, Ogink PT, Kapoor ND, Twining PK, Collins AK, et al. Availability and reporting quality of external validations of machine-learning prediction models with orthopedic surgical outcomes: a systematic review. Acta Orthop. 2021;92(4):385–393.
40. Oliver CM, Walker E, Giannaris S, Grocott MPW, Moonesinghe SR. Risk assessment tools validated for patients undergoing emergency laparotomy: a systematic review. Br J Anaesth. 2015;115(6):849–860.
41. Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3:18.
42. Dadashzadeh ER, Bou-Samra P, Huckaby LV, Nebbia G, Handzel RM, Varley PR, et al. Leveraging decision curve analysis to improve clinical application of surgical risk calculators. J Surg Res. 2021;261:58–66.
43. Merkow RP, Hall BL, Cohen ME, Dimick JB, Wang E, Chow WB, et al. Relevance of the C-statistic when evaluating risk-adjustment models in surgery. J Am Coll Surg. 2012;214(5):822–830.
44. Cohen ME, Liu Y, Ko CY, Hall BL. An examination of American College of Surgeons NSQIP surgical risk calculator accuracy. J Am Coll Surg. 2017;224(5):787–795.e1.
45. Hoogland J, van Barreveld M, Debray TPA, Reitsma JB, Verstraelen TE, Dijkgraaf MGW, et al. Handling missing predictor values when validating and applying a prediction model to new patients. Stat Med. 2020;39(25):3591–3607.

Laparotomy; prediction rule; mortality; risk; validation

Supplemental Digital Content

Copyright © 2023 Wolters Kluwer Health, Inc. All rights reserved.