The performance of prediction models can be assessed using a variety of methods and metrics. Traditional measures for binary and survival outcomes include the Brier score to indicate overall model performance, the concordance (or c) statistic for discriminative ability (or area under the receiver operating characteristic [ROC] curve), and goodness-of-fit statistics for calibration.
Several new measures have recently been proposed that can be seen as refinements of discrimination measures, including variants of the c statistic for survival, reclassification tables, net reclassification improvement (NRI), and integrated discrimination improvement (IDI). Moreover, decision–analytic measures have been proposed, including decision curves to plot the net benefit achieved by making decisions based on model predictions.
We aimed to define the role of these relatively novel approaches in the evaluation of the performance of prediction models. For illustration, we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer (n = 544 for model development, n = 273 for external validation).
We suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model.
From the aDepartment of Public Health, Erasmus MC, Rotterdam, The Netherlands; bDepartment of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center; New York, NY; cBrigham and Women's Hospital, Harvard Medical School, Boston, MA; dInstitute of Public Health, University of Copenhagen, Copenhagen, Denmark; eDepartment of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH; and fDepartment of Mathematics and Statistics, Boston University, Boston, MA.
Submitted 9 February 2009; accepted 24 June 2009.
This paper was based on discussions at an international symposium “Measuring the accuracy of prediction models” (Cleveland, OH, Sept 29, 2008, http://www.bio.ri.ccf.org/html/symposium.html), which was supported by the Cleveland Clinic Department of Quantitative Health Sciences and the Page Foundation.
Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com).
Editors' note: Related articles appear on pages 139 and 142.
Correspondence: Ewout W. Steyerberg, Department of Public Health, Erasmus MC, PO Box 2040, 3000 CA Rotterdam, The Netherlands. E-mail: firstname.lastname@example.org.