NRI>0 Does Not Contrast the Performance of the New Risk Model with the Performance of the Old Risk Model
Most measures of incremental value are constructed by summarizing the performance of the old risk model, summarizing the performance of the new risk model, and comparing the two summaries (eg, ΔAUC and ΔNB). NRI>0 is fundamentally different. This index is not a difference of two performance measures for the two risk models but rather a comparison of the old and new risk values for each person. However, within-person changes in risk do not necessarily translate into improved performance on a population level. For example, Figure 2 (bottom row) shows examples with many changes in individual predicted risks (NRI>0 = 0.622), but the distribution of predicted risks in the population remains almost exactly the same.
When assessing a new biomarker, the ultimate question is whether clinicians should continue using the old risk model or switch to the new, expanded risk model. To answer this question, we need to assess and compare the performances of each of the risk models. NRI>0 measures the difference between the old and the new risk models within individual patients, but without providing information about the performances of the models.
NRI>0 Incorporates Irrelevant Information
NRI>0, like ΔAUC, does not rely on risk thresholds. Greenland14 points out that “cutpoint free” indices incorporate irrelevant information, diminishing their potential for clinical relevance. For example, area under the curve summarizes the entire receiver operating characteristic curve, including parts of the curve describing sensitivity for unacceptably poor specificity. There are two ways in which NRI>0 incorporates irrelevant information. First, NRI>0 does not account for the size of changes in a predicted risk. Infinitesimally small changes “count” even though they are clinically irrelevant. Second, NRI>0 does not account for an individual patient’s position on the risk distribution. An event at the high end of the risk distribution who moves to an even higher risk reflects positively on NRI>0. Such movement likely has little effect on treatment decisions. A new marker is beneficial if it improves treatment decisions, which often means the marker can discriminate between events and nonevents in the middle of the risk distribution.
For the CVD data,
; 21% of events have a new 5-year risk within 1% of the old risk. Among nonevents, the proportion is 53%. Therefore, a sizeable proportion of changes summarized by
and especially by
are small (and likely inconsequential) changes.
NRI>0 Can Make Uninformative New Markers Appear Predictive
Hilden and Gerds15 and Pepe and colleagues16 report a problematic feature of NRI>0. Suppose that an old risk model (risk(X)) and a new risk model (risk(X, Y)) are fit to a training data set. Suppose further that the new marker Y is completely uninformative. To avoid “optimistic bias” caused by using the same data to fit and evaluate model performance, a standard strategy is to use an independent data set to assess the models’ performances. However, NRI>0 tends to be positive for uninformative Y even when NRI>0 is computed on a large, independent validation data set.16 This problem is likely to arise in settings where the risk models are not well calibrated—a common phenomenon in practice. In contrast to NRI>0, more standard measures such as ΔAUC do not experience this problem. These results show that NRI>0 can mislead researchers to believe that an uninformative marker improves prediction.
For Three or More Risk Categories NRI Weights Reclassifications Indiscriminately
The purpose of risk categorization is to guide appropriate treatment decisions. For cardiovascular disease, suppose low risk indicates no intervention, medium risk indicates lifestyle changes and high risk indicates both lifestyle changes and pharmaceutical intervention. When categories correspond to treatment decisions, the nature of reclassification matters, not just the direction. For example, changing an event from high risk to low risk is a more serious error than changing from high risk to medium risk.
When there are three or more risk categories, one should consider all the ways a new biomarker can move persons among risk categories. For three risk categories, there are three ways to move “up”: low risk to medium risk; medium to high; and low to high. The three-category NRIe gives each of these equal weight; in particular, moving up two risk categories counts the same as moving up one. Section B of the eAppendix (http://links.lww.com/EDE/A732) describes how an appropriate weighting could be incorporated into a statistic. Weighting the different types of reclassification is extremely challenging, but that challenge does not justify using equal weights. As an alternative to assigning weights and providing a single numerical summary, one can instead examine the different types of reclassification in a reclassification table as shown below.
Polonsky et al8 considered three-category net reclassification indices with thresholds at 0.03 and 0.1 defining low, medium, and high 5-year risk (NRI0.03, 0.1 = 0.25). The value is driven by events (
),even though most of the population count as nonevents. NRI0.03, 0.1 = 0.25 is a very coarse summary and almost impossible to interpret. Table 1 shows that the new risk model tends to place nonevents in the low- and high-risk categories, placing fewer nonevents in the medium risk category than the old risk model. If the harm of moving a nonevent from medium to high risk is greater than the benefit of moving a nonevent from medium risk to low risk, then the harm of the new risk model outweighs the benefits among nonevents. The single numerical summary,
, does not reflect this.
Table 2 shows the reclassifications of nonevents and, separately, events between the old and new risk models in the cardiovascular disease study data. Such tables are interesting and potentially instructive. However, it is easiest and most informative to simply look at how a risk model assigns nonevents and events to risk categories. This information appears on the margins of Table 2 and more succinctly in Table 1. Net reclassification indices do not capture this important information.
Two-category NRIs: New Names for Existing Measures
When there are two risk categories, low and high, NRIe is the change in the proportion of events assigned to the high-risk category, that is, the change in the true-positive rate (ΔTPR). NRIne is the change in the proportion of nonevents designated low risk. In other words, NRIne = −ΔFPR, where ΔFPR is the change in the false-positive rate. For two risk categories, the population-weighted NRI is the change in the misclassification rate.
Furthermore, the weighted NRI is the same as the change in net benefit between the old and new risk models (eAppendix, http://links.lww.com/EDE/A732, section A or Van Calster et al17). In other words, wNRI = ΔNB.
DATA ANALYSIS WITH NRI
Common practice is as follows. Investigators have a data set that includes established risk factors (X) for a condition of interest and a potentially useful new marker (Y). They fit two regression models: an “old” model that uses only X, and a “new” model that uses both X and Y. The risk models are typically logistic regression models or Cox models if data are censored. The prediction increment of Y is then assessed, typically using the same data that were used to fit the models.
NRI Should Not be Used for Testing
A researcher may consider testing the null hypothesis H0: NRI = 0. Pencina et al4 provide a z-statistic for NRI-based testing. However, the test based on this z-statistic has never been validated. The next section and eAppendix (http://links.lww.com/EDE/A732) sections D and E discuss problems with the variance formula on which this z-statistic is based.18
Interestingly for the category-free index, NRI>0, hypothesis testing is unnecessary. Pepe et al19 show that rejecting the null hypothesis H0: NRI>0 = 0 is implied by rejecting the null hypothesis about the novel marker being a risk factor. In other words, once a test on the coefficient of the new marker is performed, it is redundant to perform a test based on NRI>0.
For the two-category
where t is the risk threshold, one cannot reject H0:
= 0 and H0:
= 0 on the basis of Y being a risk factor. Good tests are not yet established for these null hypotheses.
We favor inference about the nature and size of the prediction increment rather than testing a null hypothesis of no improvement. Such inference is challenging. At the early stages of model development, it might be unclear how a risk model will be used, yet understanding how a risk model will be used is important for appropriately evaluating incremental value. Setting aside these larger considerations, the next section considers methods for constructing confidence intervals for net reclassification indices.
NRI Confidence Intervals
We conducted a simulation study to evaluate methods for constructed NRI confidence intervals. Based on the section above, we considered only category-free and two-category event and nonevent net reclassification indices. Results indicate that the most reliable confidence intervals use a bootstrap estimate of the variance of the statistic. Such confidence intervals outperformed confidence intervals constructed using the estimator
proposed by Pencina et al4 and other types of bootstrap confidence intervals. Sections C and D of the eAppendix (http://links.lww.com/EDE/A732) describe the simulation study and its results in detail.
NRI INFERENCE IN THE MULTI-ETHNIC STUDY OF ATHEROSCLEROSIS DATA
In the cardiovascular disease study data, we used 5-year risk thresholds 0.03 and 0.1 following Polonsky et al.8Table 3 compares confidence intervals for category-free and various two-category event and nonevent net reclassification indices. Confidence intervals computed with bootstrapping are usually, but not always, wider than confidence intervals computed using
. For the two-category indices with threshold 0.03 for 5-year risk, the changes in the true- and false-positive rates are modest, with an estimated 6% reduction in the false-positive rate and 3% increase in the true-positive rate. For the 0.1 risk threshold, adding the coronary artery calcium score to risk prediction increases the true-positive rate substantially (19%) and also increases the false-positive rate by 3%.
Although the reclassification table (Table 2) and summary statistics (Table 3) are interesting, we find the risk distributions (Table 1) most useful. Table 1 shows that adding the arterial calcium score to prediction increases the proportion of events labeled as high risk. Unfortunately, it also increases the proportion of nonevents labeled as high risk. Because nonevents vastly outnumber events, Table 1 identifies an important problem with adding the calcium score to the risk model.
The recent literature on measures of incremental value has developed as follows. Dissatisfaction with ΔAUC led to proposals for measures based on risk categories and reclassification.20 The category-based NRI soon followed to address issues with those new measures.4 A preference to avoid arbitrary or weakly justified risk thresholds led to the proposal for NRI>0.5 Unfortunately, NRI>0 has many of the same problems as ΔAUC. Neither measure is clinically meaningful; both measures are broad summaries of changes in risk models; and both measures incorporate irrelevant information. In these respects, things have come full circle. It is difficult to understand whether a value of NRI>0 is large or small, and this is due only partly to lack of experience with the index. Furthermore, without proper attention to model fit, NRI>0 can mislead researchers to believe that an uninformative marker improves prediction.15,16 We are skeptical that NRI>0 will help investigators develop biomarkers or improve risk models, and we are concerned about the potential for NRI>0 to mislead.
The NRI statistics that are most useful are renamed versions of existing measures. Specifically, event and nonevent two-category net reclassification indices are the changes in the true- and false-positive rates; and the weighted two-category NRI is the change in net benefit. In both cases, we prefer the established, descriptive terminology.
We recommend the bootstrap method for estimating the variance of NRI estimates and constructing confidence intervals. However, methodology that works well for markers with small prediction increment is needed.21
The issues described above for NRI>0 also apply to net reclassification indices for three or more risk categories. However, the overriding issue for three or more risk categories is that the net reclassification indices do not discriminate between different types of reclassification—all upward movements in risk categories count the same, as do all downward movements. We thus recommend against net reclassification indices for three or more categories. As in the two-category case, if the benefits and costs of different types of classification can be specified, these can be used as weights in a weighted NRI, which would be the same as the change in net benefit. This is a challenging approach and, to the best of our knowledge, has not yet been done in practice. A practical alternative is to examine how the old and new risk models place events and nonevents into the risk categories (Table 1). A reclassification table (Table 2) may also be informative because it presents the classification achieved with the new marker within strata defined by the baseline risk model. Depending on the application, select two-category summary statistics may be appropriate, particularly for risk thresholds that indicate expensive or invasive treatment.
NRI>0 should not be used in hypothesis testing. Better tests are available and validated for the regression setting. However, we emphasize the limited value of hypothesis testing in assessing biomarkers. We recommend that investigators focus on describing the operating characteristics of risk models. Ideally, then, the prediction increment of a new marker is described in terms of how it improves risk model operating characteristics.
1. Wilson PW, D’Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97:1837–1847
2. Gail MH, Brinton LA, Byar DP, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst. 1989;81:1879–1886
3. . 27th Bethesda Conference. Matching the intensity of risk factor management with the hazard for coronary disease events. J Am Coll Cardiol. 1996;27:957–1047
4. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27:157–172; discussion 207
5. Pencina MJ, D’Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30:11–21
6. Greenland P, O’Malley PG. When is a new prediction marker useful? A consideration of lipoprotein-associated phospholipase A2 and C-reactive protein for stroke risk. Arch Intern Med. 2005;165:2454–2456
7. Peirce CS. The numerical measure of the success of predictions. Science. 1884;4:453–454
8. Polonsky TS, McClelland RL, Jorgensen NW, et al. Coronary artery calcium score and risk classification for coronary heart disease prediction. JAMA. 2010;303:1610–1616
9. Leening MJ, Steyerberg EW. Fibrosis and mortality in patients with dilated cardiomyopathy. JAMA. 2013;309:2547–2548
10. Pickering JW, Endre ZH. New metrics for assessing diagnostic potential of candidate biomarkers. Clin J Am Soc Nephrol. 2012;7:1355–1364
11. Pepe MS. Problems with risk reclassification methods for evaluating prediction models. Am J Epidemiol. 2011;173:1327–1335
12. Pencina MJ, D’Agostino RB, Pencina KM, Janssens AC, Greenland P. Interpreting incremental value of markers added to risk prediction models. Am J Epidemiol. 2012;176:473–481
13. Kerr KF, Bansal A, Pepe MS. Further insight into the incremental value of new markers: the interpretation of performance measures and the importance of clinical context. Am J Epidemiol. 2012;176:482–487
14. Greenland S. The need for reorientation toward cost-effective prediction: comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27:199–206
15. Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index [published online ahead of print 2 April 2013]. Stat Med. doi: 10.1002/sim.5804.
16. Pepe M, Fang J, Feng Z, Gerds T, Hilden J. The Net Reclassification Index (NRI): a Misleading Measure of Prediction Improvement with Miscalibrated or Overfit Models.UW Department of Biostatistics Working Paper Series. 2013; Paper 392
17. Van Calster B, Vickers AJ, Pencina MJ, Baker SG, Timmerman D, Steyerberg EW. Evaluation of markers and risk prediction models: overview of relationships between NRI and decision-analytic measures. Med Decis Making. 2013;33:490–501
18. Pepe MS, Feng Z, Gu JW. Comments on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27:173–181
19. Pepe MS, Kerr KF, Longton G, Wang Z. Testing for improvement in prediction model performance. Stat Med. 2013;32:1467–1482
20. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935
21. Uno H, Tian L, Cai T, Kohane IS, Wei LJ. A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Stat Med. 2013;32:2430–2442
Supplemental Digital Content
© 2014 by Lippincott Williams & Wilkins, Inc