Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures : Epidemiology

Secondary Logo

Journal Logo

Methods: Original Article

Assessing the Performance of Prediction Models

A Framework for Traditional and Novel Measures

Steyerberg, Ewout W.a; Vickers, Andrew J.b; Cook, Nancy R.c; Gerds, Thomasd; Gonen, Mithatb; Obuchowski, Nancye; Pencina, Michael J.f; Kattan, Michael W.e

Author Information
Epidemiology 21(1):p 128-138, January 2010. | DOI: 10.1097/EDE.0b013e3181c30fb2

Abstract

From a research perspective, diagnosis and prognosis constitute a similar challenge: the clinician has some information and wants to know how this relates to the true patient state, whether this can be known currently (diagnosis) or only at some point in the future (prognosis). This information can take various forms, including a diagnostic test, a marker value, or a statistical model including several predictor variables. For most medical applications, the outcome of interest is binary and the information can be expressed as probabilistic predictions.1 Predictions are hence absolute risks, which go beyond assessments of relative risks, such as regression coefficients, odds ratios, or hazard ratios.2

There are various ways to assess the performance of a statistical prediction model. The customary statistical approach is to quantify how close predictions are to the actual outcome, using measures such as explained variation (eg, R2 statistics) and the Brier score.3 Performance can further be quantified in terms of calibration (do close to x of 100 patients with a risk prediction of x% have the outcome?), using, for example, the Hosmer-Lemeshow “goodness-of-fit” test.4 Furthermore, discrimination is essential (do patients with the outcome have higher risk predictions than those without?). Discrimination can be quantified with measures such as sensitivity, specificity, and the area under the receiver operating characteristic curve (or concordance statistic, c).1,5

Recently, several new measures have been proposed to assess performance of a prediction model. These include variants of the c statistic for survival,6,7 reclassification tables,8 net reclassification improvement (NRI), and integrated discrimination improvement (IDI),9 which are refinements of discrimination measures. The concept of risk reclassification has caused substantial discussion in the methodologic and clinical literature.10–14 Moreover, decision–analytic measures have been proposed, including “decision curves” to plot the net benefit achieved by making decisions based on model predictions.15 These measures have not yet widely been used in practice, which may partly be due to their novelty among applied researchers.16 In this article, we aim to clarify the role of these relatively novel approaches in the evaluation of the performance of prediction models.

We first briefly discuss prediction models in medicine. Next, we review the properties of a number of traditional and relatively novel measures for the assessment of the performance of an existing prediction model, or extensions to a model. For illustration, we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer.

PREDICTION MODELS IN MEDICINE

Developing Valid Prediction Models

We consider prediction models that provide predictions for a dichotomous outcome, because these are most relevant in medical applications. The outcome can be either an underlying diagnosis (eg, presence of benign or malignant histology in a residual mass after cancer treatment), an outcome occurring within a relatively short time after making the prediction (eg, 30-day mortality), or a long-term outcome (eg, 10-year incidence of coronary artery disease, with censored follow-up of some patients).

At model development, we aim for at least internally valid predictions, ie, predictions that are valid for subjects from the underlying population.17 Preferably, the predictions are also generalizable to “plausibly related” populations.18 Various epidemiologic and statistical issues need to be considered in a modeling strategy for empirical data.1,19,20 When a model is developed, it is obvious that we want some quantification of its performance, such that we can judge whether the model is adequate for its purpose, or better than an existing model.

Model Extension With a Marker

We recognize that a key interest in contemporary medical research is whether a marker (eg, molecular, genetic, imaging) adds to the performance of an existing model. Often, new markers are selected from a large set based on strength of association in a particular study. This poses a high risk of overoptimistic expectations of the marker's performance.21,22 Moreover, we are interested in only the incremental value of a marker, on top of predictors that are readily accessible. Validation in fully independent, external data is the best way to compare the performance a model with and without a new marker.21,23

Usefulness of Prediction Models

Prediction models can be useful for several purposes, such as to decide inclusion criteria or covariate adjustment in a randomized controlled trial.24–26 In observational studies, a prediction model may be used for confounder adjustment or case-mix adjustment in comparing an outcome between centers.27 We concentrate here on the usefulness of a prediction model for medical practice, including public health (eg, screening for disease) and patient care (diagnosing patients, giving prognostic estimates, decision support).

An important role of prediction models is to inform patients about their prognosis, for example, after a cancer diagnosis has been made.28 A natural requirement for a model in this situation is that predictions are well calibrated (or “reliable”).29,30

A specific situation may be that limited resources need to be targeted to those with the highest expected benefit, such as those at highest risk. This situation calls for a model that accurately distinguishes those at high risk from those at low risk.

Decision support is another important area, including decisions on the need for further diagnostic testing (tests may be burdensome or costly to a patient), or therapy (eg, surgery with risks of morbidity and mortality).31 Such decisions are typically binary and require decision thresholds that are clinically relevant.

Traditional Performance Measures

We briefly consider some of the more commonly used performance measures in medicine, without intending to be comprehensive (Table 1).

T1-22
TABLE 1:
Characteristics of Some Traditional and Novel Performance Measures

Overall Performance Measures

From a statistical modeler's perspective, the distance between the predicted outcome and actual outcome is central to quantifying overall model performance.32 The distance is Y–Ŷ for continuous outcomes. For binary outcomes, with Y defined 0–1, Ŷ is equal to the predicted probability p, and for survival outcomes, it is the predicted event probability at a given time (or as a function of time). These distances between observed and predicted outcomes are related to the concept of “goodness-of-fit” of a model, with better models having smaller distances between predicted and observed outcomes. The main difference between goodness-of-fit and predictive performance is that the former is usually evaluated in the same data while the latter requires either new data or cross-validation.

Explained variation (R2) is the most common performance measure for continuous outcomes. For generalized linear models, Nagelkerke's R2 is often used.1,33 This is a logarithmic scoring rule. For binary outcomes Y, we score a model with the logarithm of predictions p: Y × log(p) + (Y − 1) × (log(1 − p)). Nagelkerke's R2 can also be calculated for survival outcomes, based on the difference in −2 log likelihood of a model without and a model with one or more predictors.

The Brier score is a quadratic scoring rule, where the squared differences between actual binary outcomes Y and predictions p are calculated: (Yp)2.3 We can also write this in a way similar to the logarithmic score: Y × (1 − p)2 + (1 − Y) × p2. The Brier score for a model can range from 0 for a perfect model to 0.25 for a noninformative model with a 50% incidence of the outcome. When the outcome incidence is lower, the maximum score for a noninformative model is lower, eg, for 10%: 0.1 × (1 − 0.1)2 + (1 − 0.1) × 0.12 = 0.090. Similar to Nagelkerke's approach to the LR statistic, we could scale Brier by its maximum score under a noninformative model: Brierscaled = 1 – Brier/Briermax, where Briermax = mean (p) × (1 − mean (p)), to let it range between 0% and 100%. This scaled Brier score happens to be very similar to Pearson's R2 statistic.35

Calculation of the Brier score for survival outcomes is possible with a weight function, which considers the conditional probability of being uncensored during time.3,36,37 We can then calculate the Brier score at fixed time points, and create a time-dependent curve. It is useful to use a benchmark curve, based on the Brier score for the overall Kaplan-Meier estimator, that does not consider any predictive information.3 It turns out that overall performance measures comprise 2 important characteristics of a prediction model—discrimination and calibration—each of which can be assessed separately.

Discrimination

Accurate predictions discriminate between those with and those without the outcome. Several measures can be used to indicate how well we classify patients in a binary prediction problem. The concordance (c) statistic is the most commonly used performance measure to indicate the discriminative ability of generalized linear regression models. For a binary outcome, c is identical to the area under the receiver operating characteristic (ROC) curve, which plots the sensitivity (true positive rate) against 1 – specificity (false positive rate) for consecutive cut-offs for the probability of an outcome.

The c statistic is a rank-order statistic for predictions against true outcomes, related to Somers' D statistic.1 As a rank-order statistic, it is insensitive to systematic errors in calibration such as differences in average outcome. A popular extension of the c statistic with censored data can be obtained by ignoring the pairs that cannot be ordered.1 It turns out that this results in a statistic that depends on the censoring pattern. Gonen and Heller have proposed a method to estimate a variant of the c statistic that is independent of censoring, but holds only in the context of a Cox proportional hazards model.7 Furthermore, time-dependent c statistics have been proposed.6,38

In addition to the c statistic, the discrimination slope can be used as a simple measure for how well subjects with and without the outcome are separated.39 The discrimination slope is calculated as the absolute difference in average predictions for those with and without the outcome. Visualization is readily possible with a box plot or a histogram; a better discriminating model will show less overlap between those with and those without the outcome. Extensions of the discrimination slope have not yet been made to the survival context.

Calibration

Calibration refers to the agreement between observed outcomes and predictions.29 For example, if we predict a 20% risk of residual tumor for a testicular cancer patient, the observed frequency of tumor should be approximately 20 of 100 patients with such a prediction. A graphical assessment of calibration is possible, with predictions on the x-axis and the outcome on the y-axis. Perfect predictions should be on the 45-degree line. For linear regression, the calibration plot is a simple scatter plot. For binary outcomes, the plot contains only 0 and 1 values for the y-axis. Smoothing techniques can be used to estimate the observed probabilities of the outcome (p (y = 1)) in relation to the predicted probabilities, eg, using the loess algorithm.1 We may, however, expect that the specific type of smoothing may affect the graphical impression, especially in smaller data sets. We can also plot results for subjects with similar probabilities, and thus compare the mean predicted probability with the mean observed outcome. For example, we can plot observed outcome by decile of predictions, which makes the plot a graphical illustration of the Hosmer-Lemeshow goodness-of-fit test. A better discriminating model has more spread between such deciles than a poorly discriminating model. Note that such grouping, though common, is arbitrary and imprecise.

The calibration plot can be characterized by an intercept a, which indicates the extent that predictions are systematically too low or too high (“calibration-in-the-large”), and a calibration slope b, which should be 1.40 Such a recalibration framework was previously proposed by Cox.41 At model development, a = 0 and b = 1 for regression models. At validation, calibration-in-the-large problems are common, as well as b smaller than 1, reflecting overfitting of a model.1 A value of b smaller than 1 can also be interpreted as reflecting a need for shrinkage of regression coefficients in a prediction model.42,43

NOVEL PERFORMANCE MEASURES

We now discuss some relatively novel performance measures, again without attempting to be comprehensive.

Novel Measures Related to Reclassification

Cook8 has proposed to make a “reclassification table” by adding a marker to a model to show how many subjects are reclassified. For example, a model with traditional risk factors for cardiovascular disease was extended with the predictors “parental history of myocardial infarction” and “CRP.” The increase in c statistic was minimal (from 0.805 to 0.808). However, when Cook classified the predicted risks into 4 categories (0–5, 5–10, 10–20, >20% 10-year cardiovascular disease risk), about 30% of individuals changed category when comparing the extended model with the traditional one. Change in risk categories, however, is insufficient to evaluate improvement in risk stratification; the changes must be appropriate. One way to evaluate this is to compare the observed incidence of events in the cells of the reclassification table with the predicted probability from the original model. Cook proposed a reclassification test as a variant of the Hosmer-Lemeshow statistic within the reclassified categories, leading to a χ2 statistic.44

Pencina et al9 has extended the reclassification idea by conditioning on the outcome: reclassification of subjects with and without the outcome should be considered separately. Any upward movement in categories for subjects with the outcome implies improved classification, and any downward movement indicates worse reclassification. The interpretation is opposite for subjects without the outcome. The improvement in reclassification was quantified as the sum of differences in proportions of individuals moving up minus the proportion moving down for those with the outcome, and the proportion of individuals moving down minus the proportion moving up for those without the outcome. This sum was labeled the Net Reclassification Improvement (NRI). Also, a measure that integrates net reclassification over all possible cut-offs for the probability of the outcome was proposed (integrated discrimination improvement [IDI]).9 The IDI is equivalent to the difference in discrimination slopes of 2 models, and to the difference in Pearson R2 measures,45 or the difference is scaled Brier scores.

Novel Measures Related to Clinical Usefulness

Some performance measures imply that false-negative and false-positive classifications are equally harmful. For example, the calculation of error rates is usually made by classifying subjects as positive when their predicted probability of the outcome exceeds 50%, and as negative otherwise. This implies an equal weighting of false-positive and false-negative classifications.

In the calculation of the NRI, the improvement in sensitivity and the improvement in specificity are summed. This implies relatively more weight for positive outcomes if a positive outcome was less common, and less weight if a positive outcome was more common than a negative outcome. The weight is equal to the nonevents odds: (1 − mean (p))/mean (p), where mean (p) is the average probability of a positive outcome. Accordingly, although weighting is not equal, it is not explicitly based on clinical consequences. Defining the best diagnostic test as the one closest to the top left hand corner of the ROC curve—that is, the test with the highest sum of sensitivity and specificity (the Youden46 index: Se + Sp − 1)—similarly implies weighting by the nonevents odds.

Vickers and Elkin15 proposed decision-curve analysis as a simple approach to quantify the clinical usefulness of a prediction model (or an extension to a model). For a formal decision analysis, harms and benefits need to be quantified, leading to an optimal decision threshold.47 It can be difficult to define this threshold.15 Difficulties may lie at the population level, ie, there is insufficient data on harms and benefits. Moreover, the relative weight of harms and benefits may differ from patient to patient, necessitating individual thresholds. Hence, we may consider a range of thresholds for the probability of the outcome, similar to ROC curves that consider the full range of cut-offs rather than a single cut-off for a sensitivity/specificity pair.

The key aspect of decision-curve analysis is that a single probability threshold can be used both to categorize patients as positive or negative and to weight false-positive and false-negative classifications.48 If we assume that the harm of unnecessary treatment (a false-positive decision) is relatively limited—such as antibiotics for infection—the cut-off should be low. In contrast, if overtreatment is quite harmful, such as extensive surgery, we should use a higher cut-off before a treatment decision is made. The harm-to-benefit ratio hence defines the relative weight w of false-positive decisions to true-positive decisions. For example, a cut-off of 10% implies that FP decisions are valued at 1/9th of a TP decision, and w = 0.11. The performance of a prediction model can then be summarized as a Net Benefit: NB = (TP − w FP)/N, where TP is the number of true-positive decisions, FP the number of false-positive decisions, N is the total number of patients and w is a weight equal to the odds of the cut-off (pt/(1 − pt), or the ratio of harm to benefit.48 Documentation and software for decision-curve analysis is publicly available (www.decisioncurveanalysis.org).

Validation Graphs as Summary Tools

We can extend the calibration graph to a validation graph.20 The distribution of predictions in those with and without the outcome is plotted at the bottom of the graph, capturing information on discrimination, similar to what is shown in a box plot. Moreover, it is important to have 95% confidence intervals (CIs) around deciles (or other quantiles) of predicted risk to indicate uncertainty in the assessment of validity. From the validation graph we can learn the discriminative ability of a model (eg, study the spread in observed outcomes by deciles of predicted risks), the calibration (closeness of observed outcomes to the 45-degree line), and the clinical usefulness (how many predictions are above or below clinically relevant decision thresholds).

APPLICATION TO TESTICULAR CANCER CASE STUDY

Patients

Men with metastatic nonseminomatous testicular cancer can often be cured by cisplatin-based chemotherapy. After chemotherapy, surgical resection is generally carried out to remove remnants of the initial metastases that may still be present. In the absence of tumor, resection has no therapeutic benefits, while it is associated with hospital admission and with risks of permanent morbidity and mortality. Logistic regression models were developed to predict the presence of residual tumor, combining well-known predictors such as the histology of the primary tumor, prechemotherapy levels of tumor markers, and (reduction in) residual mass size.49

We first consider a dataset with 544 patients to develop a prediction model that includes 5 predictors (Table 2). We then extend this model with the prechemotherapy level of the tumor marker lactate dehydrogenase (LDH). This illustrates ways to assess the incremental value of a marker. LDH values were log-transformed, after examination of nonlinearity with restricted cubic spline functions and standardizing by dividing by the local upper levels of normal values.50 In a later study, we externally validated the 5-predictor model in 273 patients from a tertiary referral center, where LDH was not recorded.51 This comparison illustrates ways to assess the usefulness of a model in a new setting.

T2-22
TABLE 2:
Logistic Regression Models in Testicular Cancer Dataset (n = 544), Without and With the Tumor Marker LDH

A clinically relevant cut-off point for the risk of tumor was based on a decision analysis, in which estimates from literature and from experts in the field were used to formally weigh the harms of missing tumor against the benefits of resection in those with tumor.52 This analysis indicated that a risk threshold of 20% would be clinically reasonable.

Incremental Value of a Marker

Adding LDH value to the 5-predictor model increased the model χ2 from 187 to 212 (LR statistic 25, P < 0.001) in the development data set. LDH hence had additional predictive value. Overall performance also improved: Nagelkerke's R2 increased from 39% to 43%, and the Brier score decreased from 0.17 to 0.16 (Table 3). The discriminative ability showed a small increase (c rose from 0.82 to 0.84, Fig. 1). Similarly, the discrimination slope increased from 0.30 to 0.34 (Fig. 2), producing an IDI of 4%.

T3-22
TABLE 3:
Performance of Testicular Cancer Models With or Without the Tumor Maker LDH in the Development Dataset (n = 544) and the Validation Dataset (n = 273)
F1-22
FIGURE 1.:
Receiver operating characteristic (ROC) curves for (A) the predicted probabilities without (solid line) and with the tumor marker (dashed line) LDH in the development data set (n = 544) and (B) for the predicted probabilities without the tumor marker LDH from the development data set in the validation data set (n = 273). Threshold probabilities are indicated.
F2-22
FIGURE 2.:
Box plots of predicted probabilities without (left box of each pair) and with (right box) the residual tumor. A, Development data, model without LDH; B, Development data, model with LDH; C, Validation data, model without LDH. The discrimination slope is calculated as the difference between the mean predicted probability with and without residual tumor (solid dots indicate means). The difference between discrimination slopes is equivalent to the integrated discrimination index (B vs A: IDI = 0.04).

Using a cut-off of 20% for the risk of tumor led to classification of 465 patients as being high risk for residual tumor with the original model, and 469 at high risk with the extended model (Table 4). The extended model reclassified 19 of the 465 high-risk patients as low risk (4%), and 23 of the 79 low-risk patents as high risk (29%). The total reclassification was hence 7.7% (42/544). Based on the observed proportions, those who were reclassified were placed into more appropriate categories. The P value for Cook's reclassification test was 0.030 comparing predictions from the original model with observed outcomes in the 4 cells of Table 4. A more detailed assessment of the reclassification is obtained by a scatter plot with symbols by outcome (tumor or necrosis, Fig. 3). Note that some patients with necrosis have higher predicted risks in the model without LDH values than in the model with LDH (circles in right lower corner of the graph). The improvement in reclassification for those with tumor was 1.7% ((8–3)/299), and for those with necrosis 0.4% ((16–15)/245). Thus, the NRI was 2.1% (95% CI = −2.9 to +7.0%), which is a much lower percentage than the 7.7% for all reclassified patients. The IDI was already estimated from Figure 2 as 4%.

T4-22
TABLE 4:
Reclassification for the Predicted Probabilities Without and With the Tumor Marker LDH in the Development Dataset
F3-22
FIGURE 3.:
Scatter plot of predicted probabilities without and with the tumor marker LDH (+, tumor; o, necrosis). Some patients with necrosis have higher predicted risks of tumor according to the model without LDH than according to the model with LDH (circles in right lower corner of the graph). For example, we note a patient with necrosis and an original prediction of nearly 60%, who is reclassified as less than 20% risk.

A cut-off of 20% implies a relative weight of 1:4 for false-positive decisions against true-positive decisions. For the model without LDH, the net benefit was (TP – w × FP)/N = (284 − 0.25 × (465 − 284)/544 = 0.439. If we would do resection in all, the net benefit would however be similar: (299 − 0.25 × (544 − 299))/544=0.437. The model with LDH has a better NB: (289 − 0.25 × (469 − 289))/544 = 0.449. Hence, at this particular cut-off, the model with LDH would be expected to lead to one more mass with tumor being resected per 100 patients at the same number of unnecessary resections of necrosis. The decision curve shows that the net benefit would be much larger for higher threshold values (Fig. 4), ie, patients accepting higher risks of residual tumor.

F4-22
FIGURE 4.:
Decision curves (A) for the predicted probabilities without (solid line) and with (dashed line) the tumor marker LDH in the development data set (n = 544) and (B) for the predicted probabilities without the tumor marker LDH from the development data set in the validation data set (n = 273).

External Validation

Overall model performance in the new cohort of 273 patients (197 with residual tumor) was less than at development, according to R2 (25% instead of 39%) and scaled Brier scores (20% instead of 30%). Also, the c statistic and discrimination slope were poorer. Calibration was on average correct (calibration-in-the-large coefficient close to zero), but the effects of predictors were on average smaller in the new setting (calibration slope 0.74). The Hosmer-Lemeshow test was of borderline statistical significance. The net benefit was close to zero, which was explained by the fact that very few patients had predicted risks below 20% and that calibration was imperfect around this threshold (Figs. 2, 5).

F5-22
FIGURE 5.:
Validation plots of prediction models for residual masses in patients with testicular cancer. A, Development data, model without LDH; B, Development data, model with LDH; C, Validation data, model without LDH. The arrow indicates the decision threshold of 20% risk of residual tumor.

Software

All analyses were done in R version 2.8.1 (R Foundation for Statistical Computing, Vienna, Austria), using the Design library. The syntax is provided in the eAppendix (https://links.lww.com/EDE/A355).

DISCUSSION

This article provides a framework for a number of traditional and relatively novel measures to assess the performance of an existing prediction model, or extensions to a model. Some measures relate to the evaluation of the quality of predictions, including overall performance measures such as explained variation and the Brier score, and measures for discrimination and calibration. Other measures quantify the quality of decisions, including decision-analytic measures such as the net benefit and decision curves, and measures related to reclassification tables (NRI, IDI).

Having a model that discriminates well will commonly be most relevant for research purposes, such as covariate adjustment in a randomized clinical trial. But a model with good discrimination (eg, c = 0.8) may be useless if the decision threshold for clinical decisions is outside the range of predictions provided by the model. Furthermore, a poorly discriminating model (eg, c = 0.6), may be clinically useful if the clinical decision is close to a “toss-up.”53 This implies that the threshold is in the middle of the distribution of predicted risks, as is the case for models in fertility medicine, for example.54 For clinical practice, providing insight beyond the c statistic has been a motivation for some recent measures, especially in the context of extension of a prediction model with additional predictive information from a biomarker or other sources.8,9,45 Many measures provide numerical summaries that may be difficult to interpret (see eg, Table 3).

Evaluation of calibration is important if model predictions are used to inform patients or physicians in making decisions. The widely used Hosmer-Lemeshow test has a number of drawbacks, including limited power and poor interpretability.1,55 The recalibration parameters as proposed by Cox (intercept and calibration slope) are more informative.41 Validation plots with the distribution of risks for those with and without the outcome provide a useful graphical depiction, in line with previous proposals.45

The net benefit, with visualization in a decision curve, is a simple summary measure to quantify clinical usefulness when decisions are supported by a prediction model.15 We recognize however that a single summary measure cannot give full insight in all relevant aspects of model performance. If a threshold is clinically well accepted, such as the 10% and 20%, 10-year risk thresholds for cardiovascular events, reclassification tables, and its associated measures may be particularly useful. For example, Table 4 clearly illustrates that a model incorporating lactate dehydrogenase puts a few more subjects with tumor in the high risk category (289/299 = 97% instead of 284/299 = 95%) and one fewer subject without tumor in the high risk category (180/245 = 73% instead of 181/245 = 74%). This illustrates the principle that key information for comparing performances of 2 models is contained in the margins of the reclassification tables.12

A key issue in the evaluation of the quality of decisions is that false-positive and false-negative decisions will usually have quite different weights in medicine. Using equal weights for false-positive and false-negative decisions is not “absurd” in many medical applications.56 Several previously proposed measures of clinical usefulness are consistent with decision-analytic considerations.31,48,57–60

We recognize that binary decisions can fully be evaluated in a ROC plot. The plot may however be obsolete unless the predicted probabilities at the operating points are indicated. Optimal thresholds can be defined by the tangent line to the curve, defined by the incidence of the outcome and the relative weight of false-positive and false-negative decisions.58 If a prediction model is perfectly calibrated, the optimal threshold in the curve corresponds to the threshold probability in the net benefit analysis. The tangent is a 45-degree line if the outcome incidence is 50% and false-positive and false-negative decisions are weighted equally. We consider the net benefit and related decision curves preferable to graphical ROC curve assessment in the context of prediction models, although these approaches are obviously related.59

Most performance measures can also be calculated for survival outcomes, which pose the challenge of dealing with censoring observations. Naive calculation of ROC curves for censored observations can be misleading, since some of the censored observation would have had events if follow-up were longer. Also, the weight of false-positive and false-negative decisions may change with the follow-up time considered. Another issue is to consider competing risks in survival analyses of nonfatal outcomes, such as failure of heart valves,61 or mortality due to various causes.62 Disregarding competing risks often leads to overestimation of absolute risk.63

Any performance measure should be estimated with correction for optimism, as can be achieved with cross-validation or bootstrap resampling, for example. To determine generalizability to other plausibly related settings requires an external validation data set of sufficient size.18 Some statistical updating may then be necessary for parameters in the model.64 After repeated validation under various circumstances, an analysis of the impact of using a model for decision support should follow. This requires formulation of a model as a simple decision rule.65

In sum, we suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the model is to be used for making clinical decisions. Many more measures are available than discussed in this article, and those other measures may have value in specific circumstances. The novel measures for reclassification and clinical usefulness can provide valuable additional insight regarding the value of prediction models and extensions to models, which goes beyond traditional measures of calibration and discrimination. Many more measures are available than discussed in this article, and those other measures may have value in specific circumstances. The novel measures for reclassification and clinical usefulness can provide valuable additional insight regarding the value of prediction models and extensions to models, which goes beyond traditional measures of calibration and discrimination.

ACKNOWLEDGMENTS

We thank Margaret Pepe and Jessie Gu (University of Washington, Seattle, WA) for their critical review and helpful comments, as well as 2 anonymous reviewers.

REFERENCES

1.Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer; 2001.
2.Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004;159:882–890.
3.Gerds TA, Cai T, Schumacher M. The performance of risk prediction models. Biom J. 2008;50:457–479.
4.Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med. 1997;16:965–980.
5.Obuchowski NA. Receiver operating characteristic curves and their use in radiology. Radiology. 2003;229:3–8.
6.Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics. 2005;61:92–105.
7.Gonen M, Heller G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika. 2005;92:965–970.
8.Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935.
9.Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27:157–172.
10.Pepe MS, Janes H, Gu JW. Letter by Pepe et al. Regarding article, “Use and misuse of the receiver operating characteristic curve in risk prediction.” Circulation. 2007;116:e132; author reply e134.
11.Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS. Comments on ‘Integrated discrimination and net reclassification improvements-Practical advice.’ Stat Med. 2008;27:207–212.
12.Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. Ann Intern Med. 2008;149:751–760.
13.McGeechan K, Macaskill P, Irwig L, Liew G, Wong TY. Assessing new biomarkers and predictive models for use in clinical practice: a clinician's guide. Arch Intern Med. 2008;168:2304–2310.
14.Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular risk: the role of reclassification measures. Ann Intern Med. 2009;150:795–802.
15.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26:565–574.
16.Steyerberg EW, Vickers AJ. Decision curve analysis: a discussion. Med Decis Making. 2008;28:146–149.
17.Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19:453–473.
18.Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130:515–524.
19.Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54:774–781.
20.Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York: Springer; 2009.
21.Simon R. A checklist for evaluating reports of expression profiling for treatment selection. Clin Adv Hematol Oncol. 2006;4:219–224.
22.Ioannidis JP. Why most discovered true associations are inflated. Epidemiology. 2008;19:640–648.
23.Schumacher M, Binder H, Gerds T. Assessment of survival prediction models based on microarray data. Bioinformatics. 2007;23:1768–1774.
24.Vickers AJ, Kramer BS, Baker SG. Selecting patients for randomized trials: a systematic approach based on risk group. Trials. 2006;7:30.
25.Hernandez AV, Steyerberg EW, Habbema JD. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. J Clin Epidemiol. 2004;57:454–460.
26.Hernandez AV, Eijkemans MJ, Steyerberg EW. Randomized controlled trials with time-to-event outcomes: how much does prespecified covariate adjustment increase power? Ann Epidemiol. 2006;16:41–48.
27.Iezzoni LI. Risk Adjustment for Measuring Health Care Outcomes. 3rd ed. Chicago: Health Administration Press; 2003.
28.Kattan MW. Judging new markers by their ability to improve predictive accuracy. J Natl Cancer Inst. 2003;95:634–635.
29.Hilden J, Habbema JD, Bjerregaard B. The measurement of performance in probabilistic diagnosis. Part II: Trustworthiness of the exact values of the diagnostic probabilities. Methods Inf Med. 1978;17:227–237.
30.Hand DJ. Statistical methods in diagnosis. Stat Methods Med Res. 1992;1:49–67.
31.Habbema JD, Hilden J. The measurement of performance in probabilistic diagnosis: Part IV. Utility considerations in therapeutics and prognostics. Methods Inf Med. 1981;20:80–96.
32.Vittinghoff E. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models (Statistics for Biology and Health). New York: Springer; 2005.
33.Nagelkerke NJ. A note on a general definition of the coefficient of determination. Biometrika. 1991;78:691–692.
34.Brier GW. Verification of forecasts expressed in terms of probability. Mon Wea Rev. 1950;78:1–3.
    35.Hu B, Palta M, Shao J. Properties of R2 statistics for logistic regression. Stat Med. 2006 25:1383–1395.
    36.Schumacher M, Graf E, Gerds T. How to assess prognostic models for survival data: a case study in oncology. Methods Inf Med. 2003;42:564–571.
    37.Gerds TA, Schumacher M. Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom J. 2006;48:1029–1040.
    38.Chambless LE, Diao G. Estimation of time-dependent area under the ROC curve for long-term risk prediction. Stat Med. 2006;25:3474–3486.
    39.Yates JF. External correspondence: decomposition of the mean probability score. Organ Behav Hum Perform. 1982;30:132–156.
    40.Miller ME, Langefeld CD, Tierney WM, Hui SL, McDonald CJ. Validation of probabilistic predictions. Med Decis Making. 1993;13:49–58.
    41.Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45:562–565.
    42.Copas JB. Regression, prediction and shrinkage. J R Stat Soc Ser B. 1983;45:311–354.
    43.van Houwelingen JC, Le Cessie S. Predictive value of statistical models. Stat Med. 1990;9:1303–1325.
    44.Cook NR. Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clin Chem. 2008;54:17–23.
    45.Pepe MS, Feng Z, Huang Y, et al. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol. 2008;167:362–368.
    46.Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3:32–35.
    47.Pauker SG, Kassirer JP. The threshold approach to clinical decision making. N Engl J Med. 1980;302:1109–1117.
    48.Peirce CS. The numerical measure of success of predictions. Science. 1884;4:453–454.
    49.Steyerberg EW, Keizer HJ, Fossa SD, et al. Prediction of residual retroperitoneal mass histology after chemotherapy for metastatic nonseminomatous germ cell tumor: multivariate analysis of individual patient data from six study groups. J Clin Oncol. 1995;13:1177–1187.
    50.Steyerberg EW, Vergouwe Y, Keizer HJ, Habbema JD. Residual mass histology in testicular cancer: development and validation of a clinical prediction rule. Stat Med. 2001;20:3847–3859.
    51.Vergouwe Y, Steyerberg EW, Foster RS, Habbema JD, Donohue JP. Validation of a prediction model and its predictors for the histology of residual masses in nonseminomatous testicular cancer. J Urol. 2001;165:84–88.
    52.Steyerberg EW, Marshall PB, Keizer HJ, Habbema JD. Resection of small, residual retroperitoneal masses after chemotherapy for nonseminomatous testicular cancer: a decision analysis. Cancer. 1999;85:1331–1341.
    53.Pauker SG, Kassirer JP. The toss-up. N Engl J Med. 1981;305:1467–1469.
    54.Hunault CC, Habbema JD, Eijkemans MJ, Collins JA, Evers JL, te Velde ER. Two new prediction rules for spontaneous pregnancy leading to live birth among subfertile couples, based on the synthesis of three previous models. Hum Reprod. 2004;19:2019–2026.
    55.Peek N, Arts DG, Bosman RJ, van der Voort PH, de Keizer NF. External validation of prognostic models for critically ill patients required substantial sample sizes. J Clin Epidemiol. 2007;60:491–501.
    56.Greenland S. The need for reorientation toward cost-effective prediction: comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al. Statistics in Medicine (DOI: 10.1002/sim. 2929). Stat Med. 2008;27:199–206.
    57.Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Validity of prognostic models: when is a model clinically useful? Semin Urol Oncol. 2002;20:96–107.
    58.McNeil BJ, Keller E, Adelstein SJ. Primer on certain elements of medical decision making. N Engl J Med. 1975;293:211–215.
    59.Hilden J. The area under the ROC curve and its competitors. Med Decis Making. 1991;11:95–101.
    60.Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostatistics. 2005;6:227–239.
    61.Grunkemeier GL, Jin R, Eijkemans MJ, Takkenberg JJ. Actual and actuarial probabilities of competing risks: apples and lemons. Ann Thorac Surg. 2007;83:1586–1592.
    62.Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. JASA. 1999;94:496–509.
    63.Gail M. A review and critique of some models used in competing risk analysis. Biometrics. 1975;31:209–222.
    64.Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med. 2004;23:2567–2586.
    65.Reilly BM, Evans AT. Translating clinical research into clinical practice: impact of using prediction rules to make decisions. Ann Intern Med. 2006;144:201–209.

    Supplemental Digital Content

    © 2010 Lippincott Williams & Wilkins, Inc.