The American College of Surgeons NSQIP risk calculator (RC) uses regression to make predictions for fourteen 30-day surgical outcomes. While this approach provides accurate (discrimination and calibration) risk estimates, they might be improved by machine learning (ML). To investigate this possibility, accuracy for regression-based risk estimates were compared to estimates from an extreme gradient boosting (XGB)-ML algorithm.
A cohort of 5,020,713 million NSQIP patient records was randomly divided into 80% for model construction and 20% for validation. Risk predictions using regression and XGB-ML were made for 13 RC binary 30-day surgical complications and one continuous outcome (length of stay [LOS]). For the binary outcomes, discrimination was evaluated using the area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC), and calibration was evaluated using Hosmer–Lemeshow statistics. Mean squared error and a calibration curve analog were evaluated for the continuous LOS outcome.
For every binary outcome, discrimination (AUROC and AUPRC) was slightly greater for XGB-ML than for regression (mean [across the outcomes] AUROC was 0.8299 vs 0.8251, and mean AUPRC was 0.1558 vs 0.1476, for XGB-ML and regression, respectively). For each outcome, miscalibration was greater (larger Hosmer–Lemeshow values) with regression; there was statistically significant miscalibration for all regression-based estimates, but only for 4 of 13 when XGB-ML was used. For LOS, mean squared error was lower for XGB-ML.
XGB-ML provided more accurate risk estimates than regression in terms of discrimination and calibration. Differences in calibration between regression and XGB-ML were of substantial magnitude and support transitioning the RC to XGB-ML.