Box and Draper are widely quoted: “Essentially, all (mathematical) models are wrong, but some are useful.”^{1}^{(p. 424)} As a biostatistician, I read the models described by Steyerberg et al^{2} with admiration. The authors are doing what their title implies: evaluating mathematical models. However, as an informed and concerned medical consumer, I focused also on their introduction stating their goal of evaluating prognostic and diagnostic medical tests, and on their illustration: testing for residual tumors in medical-decision making for men with testicular cancer. Then the presentation raised concern that the content might further confuse medical decision-making, that is, concern about how useful the results are.

The outcome of interest (Y), we agree, is binary: the answer can be only yes or no (Y = 1 or 0) as to whether a patient has a certain disorder (diagnostic test), or is likely to suffer onset of a disorder or death in a specified future time period (prognostic test). In medical decision-making, some action (treatment or prevention) is proposed on the basis of the prediction of the outcome that, if wrong, can have serious medical consequences. Thus, a false-positive test for a man with testicular cancer would lead to unnecessary surgery and possible iatrogenic damage, while a false-negative test allows metastases of life-threatening residual tumors.

We disagree on what the “prediction” is. When one proposes to predict age, the “prediction” is an age; to predict weight, the “prediction” is a weight, ie, the prediction is meant to reproduce the outcome to be predicted. Accordingly, with a binary outcome (Y), the “prediction” is binary prediction (BP = 1 or 0) to match outcome, not a probability (p).^{2} If BP = 1, then the action appropriate to Y = 1 is pursued; if BP = 0, the alternative action. To report to a patient that his/her probability is 0.29 will only elicit the question “Well, do I (will I) have it or not? Do I get treatment for it (try to prevent it) or not?” In short, a demand for a binary prediction would result.

Clearly a binary prediction may be based on a list of available information about the patient, a vector **X**, of the form: BP = 1 if **X** is in the region Ω, and BP = 0 otherwise. One can choose any of a variety of mathematical approaches to develop Ω. One might choose to use a linear function of **X**, say β′**X**, and define Ω as that region where β′**X** > threshold, for a selected set of weightings β and threshold. To develop such a binary prediction, one could use a linear model such as logistic regression or a Cox model to select β; one can choose to include or to exclude interactions; one can choose different link functions, or different transformations of the individual components of **X**. Alternatively, one might use a nonlinear model, recursive partitioning methods or neural networks, etc, to develop Ω.

Whatever the mathematical model or process used to define binary prediction, the accuracy, and thus the clinical value, of the prediction lies, not in the mathematical model, but in the correspondence between the binary outcome Y and the resulting BP in the population of interest. In short, the model that leads to the binary prediction does not matter as much as does the binary prediction it generates. Thus I would preclude all performance measures based on an ordinal predictor. What is left?

The Figure presents a classic receiver operating characteristic (ROC) plane, a graph of the sensitivity versus the complement of the specificity. A brief review:

* The Sensitivity (Se) and the Specificity (Sp) or BP versus Y is defined as: Se = Prob(BP = 1|Y = 1), 1 − Sp = Prob(BP = 1|Y = 0).

* Any BP that does no better than random decision making lies on the main diagonal line: random ROC. Any BP that does better than random selection lies above the random ROC.

* Ideally the BP would have no errors and would lie at the ideal point at (0,1).

* Any BP such that Prob(BP = 1) = Prob(Y = 1) lies on the diagnosis line (the line connecting the ideal point and the point (P,P) on the random ROC, where P = Prob[Y = 1]). A BP with Prob(BP = 1) = Q lies on a line parallel to the diagnosis line intersecting the random ROC at the point (Q,Q) (the geometric version of Bayes' Theorem).

A single BP is located at one point in the ROC plane, a distance from the diagnosis line determined by P–Q, its location between the random ROC and the ideal point determining its accuracy and clinical usefulness, which can be quantified by some measure of 2 × 2 association. Every common measure of 2 × 2 association (risk differences, risk ratios, phi etc. except for the odds ratio), is equivalent to a weighted kappa coefficient, k(r) where r = w/(1 + w) (w as Steyerberg et al defined it), a weight indicating the balance between the clinical importance of false negatives versus false positives.^{3,4} To choose the optimal of several possible binary predictors for any specific r (or w), one need only choose the BP with maximal k(r), for the selected r. The magnitude of k(r) indicates the quality of the optimal BP, for k(r) = 0 for any BP on the random ROC, k(r) = 1 at the ideal point, and otherwise k(r) measures how far between random and ideal is the BP in the direction appropriate to choice of r. When P = Q, all k(r) are equal; when P = Q = 1/2, all k(r) = (OR^{1/2} − 1)/(OR^{1/2} + 1), where OR is the odds ratio relating BP to Y. The odds ratio itself is a poor choice as an effect size, for, no matter how large is OR, when Q is small enough, BP is arbitrarily close to the random ROC.^{4–7}

If binary prediction is defined by dichotomizing an ordinal measure, say p, one can draw the ROC curve relating that ordinal measure to the outcome, since each possible threshold defines a BP with:

Connecting these points gives the ROC-curve for ordinal p against the binary outcome Y (see Fig. 1 of Steyerberg et al^{2}). It is possible that the area under that ROC curve (AUC) is near its null value of 0.5, but that some threshold defines a useful BP. It is also possible that the value of AUC >0.5 seems acceptable, but that no threshold defines a clinically useful BP. The same is true of other measures relating ordinal p to binary Y. In short, there is some relationship between performance measures relating ordinal p to Y and those of the optimal BP obtained by dichotomizing p to Y, but one does not determine the other.

Equation (Uncited) Image Tools |
Figure 1 Image Tools |

Steyerberg et al briefly referred to this approach. Because they ignored the diagnosis lines in their ROC curves, they missed the crucial fact that all information in the 2 × 2 table relating any BP to Y can be read directly from the ROC graphic. Moreover, in ignoring the diagnosis lines, the authors obscured the fact that their development was done by sampling one population (*P* = 0.55) and “validation” sampling another (*P* = 0.72). Because population NB = P Se − w (1 − P)(1 − Sp), the equipotency lines^{3} for NB in the ROC plane have slope w(1 − P)/P. Thus it is indeed true that the optimal test identified by maximizing NB with weight w is the same as that identified by maximizing k(r) with weight r = w/(1 + w). However, since the random value of NB is NB_{R} = P(Q − w[1 − Q]) and the ideal value is P, NB is an uncalibrated measure and cannot be used to assess the quality of any BP. If one linearly recalibrated NB to be 0 for random decision making and 1 at the ideal point, and let r = w/(1 + w), recalibrated NB would equal k(r):

As best can be computed from the data presented, k(0.8) (w = 0.25) for the test without LDH is 0.37 and with is 0.42. With possible shrinkage, these tests are in the “only fair” range of accuracy.^{8} While 55% in this sample had residual tumor, 86% would be recommended for surgery. With LDH, of those recommended for surgery, 38% would have it unnecessarily; of those for whom surgery was not recommended, 13% would be left at risk because they had residual tumors. How would any of the information in Figures 2– 5^{2} further inform clinicians and their patients?

What is needed is not more options, more mathematical models, more performance measures, but better guidance as to how to develop and document diagnostic or prognostic tests for clinical purposes, integrating current methods. If, in certain circumstances, new models would enhance identification of a better binary predictor, (better in terms of accuracy, cost, convenience, or risk), such new approaches should, of course, be included for consideration. However, the value of mathematical models in medical research does not lie in their proliferation, their elegance or complexity, but in their clinical usefulness in the context in which they are to be used. As Box and Draper suggest: “Remember that all models are wrong: the practical question is how wrong do they have to be to not be useful.”^{1}^{(p. 74)}