From the aDepartment of Public Health Erasmus MC, Rotterdam, The Netherlands. E-mail: E.Steyerberg@erasmusmc.nl; bDepartment of Quantitative Health Sciences Cleveland Clinic, Cleveland, OH; and cDepartment of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center, New York, NY.
Editors' note: Related articles appear on pages 128 and 139.
We would thank Helena Chmura Kraemer for her interest1 in our paper.2 We agree that “the value of mathematical models in medical research does not lie in their proliferation, their elegance or complexity, but in their clinical usefulness.”1 Indeed, we expressly introduce some recently developed methods that can help to determine whether use of a prediction model in practice would improve clinical outcome.
Patients, clinicians, and researchers need interpretable measures that indicate the extent to which a model supports better decision-making. Our paper2 provides an overview of measures to assess performance of predictions from a model, including calibration and discrimination measures, and measures that quantify how much a model improves decision-making. Kraemer1 adds the weighted kappa, k(r), to the latter measures, which essentially is a rescaled version of the net benefit as originally proposed by Peirce,3 and which was emphasized in our paper. We basically agree here also, because both methods weight the classifications of prediction models in terms of their consequences.
Is k(r) better than net benefit? We doubt this. Net benefit is indeed dependent on the incidence of the outcome (“P”), but this is entirely appropriate if we want to operationalize clinical usefulness.4 As a simple example, a model has greater potential to be clinically useful if the outcome occurs in 50% of patients than in only 0.01%. The receiver operating characteristic curve with a “diagnosis line” is not particularly helpful either. We argue that displaying ROC curves is a waste of precious journal space, unless predictions are shown at the curve, as in our original Figure 1.2
We are puzzled regarding Kraemer's concern that we “obscured the fact”1 that we use separate development and validation samples. This is clearly indicated throughout the paper. External validation is an important principle for evaluating prediction models. A model may be more or less useful at validation, depending on the incidence of the outcome, the distribution of predicted values, and the validity of the predicted probabilities.
Our main point of disagreement is what a “prediction” is. The principle of making predictions for binary outcomes in terms of probabilities has a long history, both in medicine and other fields such as weather forecasting. But expressing the results of such predictions purely in binary terms is predominately associated with “prophecies” that do or do not come true: a scientific weather forecast gives a “60% chance of rain”; a seer states “a blue-eyed child will ascend to the throne in a dark time.”
A patient with a probability of 29% for an outcome cannot be given a meaningful answer to the question, “Well, do I (will I) have the outcome or not?” Kraemer's advice to dichotomize to a binary prediction clearly results in a loss of information: 2 patients with different predicted probabilities of disease of, say 2% and 24%, end up with the same information of “not diseased” if the threshold is above 24%. Predicted probabilities play an important role, such as in situations where monitoring of patients is an option, or when the medical decision depends on personal issues of the patient. A more technical problem with finding binary predictions as proposed by Kraemer is that the estimates of a threshold parameter are known to be systematically more variable than the usual estimates of regression parameters. Instead, a probability appropriately expresses the uncertainty with respect to the presence or future occurrence of the outcome. This is the caution that we should exercise when applying knowledge obtained at the group level to the individual patient.
The question “Do I get treatment for it?” is very well answerable with a predicted probability. In our example, a prediction below a threshold probability, such as a less than 20% risk of residual tumor, leads to a different action (observation) than for those at a relatively high risk (surgery).
Working with probabilities has several advantages over consideration of a binary prediction. In particular, the weight w for the calculation of k(r) and net benefit follows directly from the threshold probability (w = p/[1 − p]), and this weight is unlikely to be identical to all decision-makers.5,6 Different patients and different physicians usually have different utility or loss functions that cause their decisions to effectively use different risk thresholds. In our example,2 a typical decision threshold is 20%, and at that threshold our models have very limited clinical usefulness. However, if surgical risks are increased for certain patients, a higher threshold should be applied, and the model may be more useful.
It is also important to note that we consider the situation in which a prediction model is applied prospectively. There is no opportunity to choose an optimal decision threshold other than from the predicted probabilities. If these are miscalibrated, decision-making with the model may be worse than without the model. Several approaches are possible to recalibrate a model to a specific setting.7,8 Recalibration may avoid poor decision-making for future patients, but can be done only after empirical data from a specific setting have been gathered.
How do our2 Figures 2–5 inform clinicians and their patients? Figure 2 shows box plots by outcome, and it expresses the same kind of information as in a ROC curve, ie, the discriminative value. It warns against enthusiasm for the models' performances, because the distributions of predicted risks overlap considerably between those with and without the outcome. Figure 3 is a scatter plot of predictions from models with and without a marker, and is specifically useful for the issue of reclassification: how much does a marker add?9 This is an important graphics tool that again warns against too much enthusiasm, because far from all patients are classified better in a model with the marker than in a model without. Figure 4 shows decision curves that directly express clinical usefulness, over a range of potential decision thresholds.5 Figure 4 clearly demonstrates that, in the validation sample, the net benefit of the model is no higher than that of “treat all” for threshold probabilities less than ∼50%. Clinicians and patients have a choice between using our model to determine who is sent to surgery and just going to surgery as a matter of routine. Unless a patient thinks that unnecessary surgery is about as bad as an untreated tumor—which would be unusual—clinical outcomes would not be improved by use of this model. Figure 5 is a validation graph,10 in which aspects of calibration, discrimination, and clinical usefulness can all be assessed, based on Harrell's calibration graph.11 Figures 1–5, hence, give impressions of usefulness of a model; in particular, a poor performing model would readily be identified.
In the area of performance assessment, over 100 “Q” measures were already proposed by Habbema et al almost 30 years ago.12 These were grouped similarly to our framework into measures for calibration, discrimination, and clinical usefulness. We do not purport that one measure or one type of graph makes all others superfluous; this holds for the proposals that we included in our paper2 and the proposals in Kraemer's commentary.1 Future challenges lie not only in the development of alternative performance measures (including further measures for survival data) but also in their interpretability for clinicians, who are the end-users of prediction models that epidemiologists develop.
1. Kraemer HC. The usefulness of mathematical models in assessing medical tests [commentary]. Epidemiology
2. Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology
3. Peirce CS. The numerical measure of success of predictions. Science
4. Hilden J. Prevalence-free utility-respecting summary indices of diagnostic power do not exist. Stat Med
5. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making
6. Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Validity of prognostic models: when is a model clinically useful? Semin Urol Oncol
7. Cox DR. Two further applications of a model for binary regression. Biometrika
8. van Houwelingen HC. Validation, calibration, revision and combination of prognostic survival models. Stat Med
9. McGeechan K, Macaskill P, Irwig L, Liew G, Wong TY. Assessing new biomarkers and predictive models for use in clinical practice: a clinician's guide. Arch Intern Med
10. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating
. New York: Springer; 2009.
11. Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis.
New York: Springer; 2001.
12. Habbema JD, Hilden J, Bjerregaard B. The measurement of performance in probabilistic diagnosis. V. General recommendations. Methods Inf Med