Articles by Pencina et al from 2008 onward have introduced the Net Reclassification Indices (NRI, notably NRI>0) and its cousin, the Integrated Discrimination Increment (IDI), along with several variants. As reviewed recently by Kerr et al,1 this family of statistics has major shortcomings. These formulations have nonetheless gained immense popularity, as well as authoritative support (Hlatky et al2). I believe I offend no one by a frank appraisal. The summary by Kerr et al1 illustrates the dangers of inventing and popularizing new statistical measures that are based entirely on “nice looks” without appropriate theoretical underpinning.
The NRI/IDI statistics address the incremental prognostic impact of a new binary risk “oracle” that adds a new marker to an existing one. (“Oracle” in this context is a term covering any kind of black-box prediction rule.) NRI and IDI are designed to assess the risk refinement (enhanced risk differentiation) provided by the new marker. Intuitively, this risk refinement is what the measures reflect—but what do these measures measure?
An epidemiological measure should estimate an easily understood, operationally relevant, design-independent (and preferably also model-independent) population characteristic. NRI and IDI fail already here. Kerr et al1 have emphasized that, in policy selection contexts such as this, the evaluation should rest on utility benefit gained (or cost saved); again, as Kerr notes, NRI and IDI do not have this form. The measures do, however, look nice and convincing to medical investigators.
Statisticians themselves are not immune to the allure of good looks. New tools are sometimes introduced on grounds of mathematical elegance and slick fit to model anatomy. Consider the universal acceptance of hazard ratios in the field of survival analyses: is that primarily a product of medical concerns about adding years to human life—or is it dictated by the right-censored nature of human experience? The latter, obviously. A tool has taken precedence over the goal. The kappa (κ) coefficient of agreement is another example. An elegant construct, but what does it mean? People report it with confidence limits, but nobody knows what to make of them; the (1 – α) 100% warranty is void and waste. Considering that the goal of classification is optimal assignment of cases to classes (eg, treatment groups), a cogent operational measure of non-agreement would be the clinical benefit obtainable with the use of two assessors instead of one: the closer their agreement, the less the second opinion adds. Try this—and peer reviewers will cry for κ.
Once there is agreement that a proposed statistic estimates a quantity of genuine interest, one may tick off the “confidence limits potentially meaningful?” box and proceed to the question of bias. When the statistic flows from a nonlinear model, difficulties arise. Ongoing Monte Carlo studies by the Seattle group (cf, Kerr et al1) and by ours (Thomas Gerds) suggest that the NRI behaves poorly in logistic models. Noninformative predictors (useless biomarkers) often come out with positive scores.3
The next premarketing requirement for a new statistic is a description of its general statistical behavior, standard error, and efficiency. Kerr et al4 deserve credit for discovering that the naive standard error formula for IDI is misleading because it ignores the interpatient dependence that arises when model parameters and IDI are estimated from the same data. Importantly, they also observed that IDI tends to have a surprisingly non-Gaussian sampling distribution—a feature that leaves any standard error formula lame. Who would suspect?
Unknown to many scientists, evaluation of risk is beset with an additional problem that can be avoided only by insisting on a particular class of evaluation tools. The problem is this: We want the best candidate to win. We do not want an athlete to be able to improve his position by shooting himself in the foot. Under no circumstances should a risk assessor, by some clever systematic distortion of the risk assessments, be able to improve his apparent performance. The evaluation tools that ensure this are called proper scoring rules. Their key property is this: when a patient’s true risk is π, then quoting any risk figure, p, different from π will reduce the expected score; and the greater the discrepancy, the greater the reduction in expected score. He who honestly believes that π = 0.265 will expect to be penalized for quoting any other p. There is no way he can beat the truth—or hope to beat his own conviction. Nonproper scoring rules, by contrast, may fool or be fooled by their users: they may react inappropriately to miscalibrated risks—and are vulnerable to strategically distorted risk input. NRI and IDI are seriously nonproper, in that they reward overconfidence. In fact, they make it profitable to force toward 1 those risks that are already above average and to force below-average risks toward 0. Using this stratagem, a person who knows the approximate average risk in the clientele and has access to the “old” risks of the patient sample can construct a list of fake “new” risks and be sure to beat the old prediction system; no new data are needed.5,6
Fortunately, it is easy to devise appropriate proper scoring schemes. It suffices to do as recommended above: define a utility-like (or monetary) goal parameter that you see as your aim to maximize, and then calculate the (empirical) goal scores for the “new” and the “old” oracle, separately. The difference is clearly an estimated “incremental prognostic impact.” Insistence on this decision-analytic approach has the side effect that it is automatically proper. The literature on proper scoring rules is vast, but the mathematics underlying these paragraphs can be found in the Oslo handout.7 The decision-theoretic shortcomings of NRI/IDI were first stressed by Sander Greenland.8
Where does this leave the area under the receiver operating characteristic (ROC) curve, alias c-statistic? If the goal is chosen as “probability of correct pair-wise triage,” then the area under the ROC curve is exactly the empirical frequency of interest and hence proper. If patients are imagined to arrive in pairs, and only one can be given life-saving treatment at a time, then this parameter is the conditional frequency of correct decision given that the decision matters (ie, given that exactly one of the two is otherwise destined to die). Unconditionally, the proportion is therefore the area under the ROC curve times 2ρ(1 – ρ), where ρ = mean risk. (Technically, this is not a strictly proper scoring scheme because there are transformations of the true risks that leave the expected score unchanged. With area under the ROC curve, it is only the ranking of the risks that matters.) In any event, even though the area under the ROC curve is a defective yardstick for comparison of diagnostic tests,9 it does not lead astray in the present context. Experience confirms that increments in the area under the ROC curve—while often disappointingly small—closely match the increments seen with other proper scoring rules.
Finally, allow me to step back and examine the risk concept itself, as it is used in the very timely critical review by Kerr and colleagues1 and in much of the related literature. First, it takes a patient’s risk of death, say, almost as a natural constant rather than something that is shaped or honed by human activity. Properly considered, the “risks” we are dealing with are conditional on standard care, lifestyle, and preventive measures.10 Second, in the words of Kerr et al,1 we aim to help clinicians “match the intensity of treatment to the level of risk.” High versus low risk is not really the issue. What matters is to identify the patient segment that will most benefit from preventive investments. Unlike what some measures of risk may lead us to believe, the crucial patients may well be those with risks in the midrange.
ABOUT THE AUTHOR
JØRGEN HILDEN, born in 1937, went from medicine over genetics and medical data processing into biostatistics. Since 1974, his main research interests have been centered on medical decision making, including assessment of diagnostic tests, probabilistic decision aids, and meta-analysis. He retired from an associate professorship at the University of Copenhagen Health Faculty, Department of Biostatistics, in 2007, but is still active in editorial work and as a methodological watchdog.
1. Kerr KF, Wang Z, Janes H, McClelland RL, Psaty BM, Pepe MS. Net Reclassification Indices for evaluating risk prediction instruments: a critical review. Epidemiology. 2014;25:114–121
2. Hlatky MA, Greenland P, Arnett DK, et al.American Heart Association Expert Panel on Subclinical Atherosclerotic Diseases and Emerging Risk Factors and the Stroke Council. Criteria for evaluation of novel markers of cardiovascular risk: a scientific statement from the American Heart Association. Circulation. 2009;119:2408–2416
3. Pepe M, Fang J, Feng Z, Gerds T, Hilden J. The Net Reclassification Index (NRI): a Misleading Measure of Prediction Improvement with Miscalibrated or Overfit Models. UW Biostatistics Working Paper Series. 2013 Working Paper 392 Available at: biostats.bepress.com/uwbiostat/paper392
. Accessed October 16, 2013
4. Kerr KF, McClelland RL, Brown ER, Lumley T. Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am J Epidemiol. 2011;174:364–374
5. Hilden J. Preliminary note on: ‘Criteria for Evaluation of Novel Risk Markers of Cardiovascular Risk. A Scientific Statement from the American Heart Association’ (Nov 2009). Available at: http://staff.pubhealth.ku.dk/~jh/
. Accessed 3 October 2013
6. Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Statist Med. 2013 DOI: 10.1002/sim.5804
8. Greenland S. The need for reorientation toward cost-effective prediction: comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine
Stat Med. 2008;27:199–206 (DOI: 10.1002/sim.2929)
9. Hilden J. The area under the ROC curve and its competitors. Med Decis Making. 1991;11:95–101
10. Hilden J, Habbema DF. Prognosis in medicine: an analysis of its meaning and roles. Theor Med. 1987;8:349–365