Risk prediction is an important component of medical care and public health. Examples of models currently used for risk prediction are the Framingham model^{1} in cardiovascular disease and the Gail model^{2} in breast cancer. Accurate risk prediction enables clinicians to match the intensity of treatment to the level of risk.^{3} For many conditions, clinicians have a limited ability to accurately identify high-risk patients, and research efforts continue to be devoted to improve risk prediction models. In cardiovascular disease, many epidemiologic publications have evaluated whether new predictors can improve on the risk predictions from the Framingham model,^{1} which includes the established risk factors age, sex, systolic blood pressure, lipids, and smoking. The goal of such investigations was to evaluate new biomarkers for the predictive capacity they offer above and beyond established predictors. The improvement in risk prediction is called the incremental value or prediction increment of the biomarker.

In 2008, Pencina and colleagues^{4} introduced a new measure of incremental value called the net reclassification index (NRI). They expanded the definition of this index in 2011.^{5} Variants have recently become popular in some areas of medical research, especially cardiovascular epidemiology. There are approximately 500 papers that contain “net reclassification index” and cite the original paper.^{4}

Although net reclassification indices have become popular, there are common mistakes in interpretation. Furthermore, because there are now multiple net reclassification indices to choose from, investigators may be unsure which, if any, to use. In addition, statistical methods pertaining to these indices are not yet well-developed. The goals of this review were to clarify the interpretation of net reclassification indices, to relate net reclassification indices to more traditional measures, to provide guidance on choice of net reclassification indices, to highlight problems with current methods for calculating confidence intervals and *P* values for net reclassification indices, and to recommend methods for confidence intervals.

## NET RECLASSIFICATION INDICES AND OTHER MEASURES OF THE PREDICTION INCREMENT

We provide basic definitions and introduce data on cardiovascular disease risk that we will use for illustration. In the next section, we describe issues with the interpretation and application of both categorical and category-free net reclassification indices. Following that, we describe statistical issues in applying net reclassification indices. We then apply these findings to data from the Multi-Ethnic Study of Atherosclerosis and conclude with a summary and recommendations.

The context here is risk prediction, with the specific goal of improving risk prediction by adding a new predictor to an existing set of predictors. A traditional way to evaluate the prediction increment of a new biomarker is to consider the improvement in the area under the receiver operating characteristic curve for the expanded risk model compared with the baseline risk model (ΔAUC). However, promising new markers have failed to produce large increases in the area under the curve.^{4} There have been explicit calls for ways to evaluate new markers other than ΔAUC.^{6} Responding to these calls, Pencina and colleagues^{4} proposed new metrics, “integrated discrimination improvement” and “net reclassification improvement” (or “index”) for quantifying the prediction increment of a new marker. The net reclassification indices have become widely used and are the topic of this review.

The NRI, as originally proposed, seeks to quantify whether a new marker provides clinically relevant improvements in prediction. In the definition of “net reclassification indices,” the risk prediction model with established predictors is called the “old” model. The model that adds the new marker is the “new” model. “Events” are cases—persons who have or will have the disease or outcome in the absence of intervention. “Nonevents” are controls. The formula defining the NRI is^{4}

“Up” means that the new risk model places a person into a higher risk category than the old model. Similarly, “down” means the new model places a person into a lower risk category. For example, NRI^{0.2} means a two-category index with cutoff at 0.20 defining low and high risk. NRI^{0.1,0.2} is a three-category index with cutoffs at 0.10 and 0.20 defining low-, medium-, and high-risk categories. Any set of risk thresholds can be used to define an NRI.

The definition of the NRI in Equation 1, based originally on discrete predefined risk categories, generalizes to any upward or downward movement in predicted risks.^{5} The “category-free net reclassification index” (also called “continuous net reclassification index”) interprets definition (1) this way. We use NRI^{>0} to denote the category-free index.

The idea behind the NRI is that a valuable new biomarker will tend to increase predicted risks or risk categories for events and decrease predicted risks or risk categories for nonevents. *P* (up|event) and *P* (down|nonevent) form the positive components of the NRI in definition (1). On the contrary, events that move down and nonevents that move up are mistakes introduced by the new marker—these are the negative components of definition (1).

An NRI is the sum of the “event NRI” and the “nonevent NRI”:

For example,

and

.

For the two-category setting, Pencina et al^{5} generalized the NRI to consider the savings *s*_{1} from identifying an event as high risk and *s*_{2} from identifying a nonevent as low risk. *s*_{1} is meant to capture the adverse events that are avoided by labeling a person destined to have an event as high risk. *s*_{2} should capture all the savings (adverse events, money) from allowing a nonevent to avoid unnecessary treatment. The “weighted net reclassification index” (wNRI) is the average savings per person.

Two established measures of the prediction increment include ΔAUC (mentioned above) and ΔNB, which refers to the change in net benefit associated with the use of the new marker.^{7} For example, if the risk model is used to classify persons as “high risk” or “low risk” and high risk entails an intervention, the net benefit is

where *B* is the average benefit of the intervention among those who otherwise would have an event and *C* is the cost of intervention (including side effects) to nonevents. For old and new risk models, the change in net benefit (ΔNB) is a measure of the prediction increment of the new marker.

## EXAMPLE: CORONARY ARTERY CALCIFICATION AND PREDICTING CORONARY EVENTS

Polonsky et al^{8} examined the prediction increment of the coronary artery calcium score for predicting coronary heart disease (CHD) among 5878 participants in the Multi-Ethnic Study of Atherosclerosis. Median follow-up was 5.8 years, and 209 CHD events were observed. The cohort was 54% female, and the mean age was 62 years with a standard deviation of 10 years. The “old” risk model included the risk factors from the Framingham risk model plus race; the “new” model added the arterial calcium score. We use these data to illustrate metrics and methods. We estimate risks using Cox models; for simplicity, we otherwise ignore censoring in the data, following Polonsky et al.^{8} We refer readers to the original article^{8} for more details.

## INTERPRETING NET RECLASSIFICATION INDICES

### NRI Is Not a Proportion

A common mistake is to interpret the NRI as a proportion.^{9} For example, it is incorrect to interpret the index as “the proportion of patients reclassified to a more appropriate risk category,”^{10} as this is *P* (up and event) + *P* (down and nonevent). The NRI combines four proportions but is not itself a proportion.^{9} The maximum value of the NRI is 2.

NRI_{e} and NRI_{ne} are easier to interpret than the NRI because there are differences in proportions. NRI_{e} is the net proportion of events assigned a higher risk or risk category. NRI_{ne} is the net proportion of nonevents assigned a lower risk or risk category. The word “net” here is crucial for correct interpretation.

### Issues with Combining Event and Nonevent Net Reclassification Indices

Given the interpretations of NRI_{e} and NRI_{ne}, it is not clear why one would want to take a simple sum (or unweighted average) to produce the NRI. One logical alternative is to weight by the prevalence of events. This weighting extends the interpretations of NRI_{e} and NRI_{ne} to the whole population. We define the “population-weighted net reclassification index” as ρNRI_{e} + (1 − *ρ*) NRI_{ne}, where *ρ* is the prevalence of the condition or outcome. The population-weighted NRI can be interpreted as the net change in the proportion of subjects assigned a more appropriate risk or risk category under the new model.

Data from the CHD study illustrate another problem with the unweighted sum of NRI_{e} and NRI_{ne}. Using 5-year risks, NRI^{0.1} = 0.164. Looking at the components, we see that

but the nonevent index is negative,

. Among nonevents, the arterial calcium score introduces many more errors than corrections at the 10% risk threshold. Because there are many more nonevents than events (a common situation), the new risk model introduces far more errors than corrections overall. The positive value for NRI^{0.1} masks the population-level results. Estimating the prevalence of CHD in this population as 3.6%, the population-weighted NRI^{0.1} is −0.020. That is, the net proportion of subjects assigned to a more appropriate risk category using the 0.1 threshold is −0.02.

The population-weighted NRI illustrates one problem with this index. However, we do not advocate use of the population-weighted index because there is no compelling advantage in collapsing NRI_{e} and NRI_{ne} into a single number. NRI_{e} and NRI_{ne} provide information on how the new risk model (potentially) improves prediction for events and, separately, for nonevents. The two types of improvements have different implications. Important information is lost when these two summaries are combined.^{11}

### Large and Small Values for NRI^{>0} Are Undefined

Ideally, a measure of incremental value can be interpreted in terms of the clinical or public health benefit of incorporating the new marker. Pencina et al^{12} remark that “further research is needed to determine meaningful or sufficient degree of improvement in NRI^{>0}.” NRI^{>0} has no interpretation that translates to the clinical benefit of the new marker.^{13} If it did, then the magnitude of the index would be directly applicable to the clinical setting, and a marker’s sufficiency for improving prediction would be apparent. Other metrics, including ΔAUC, share this problem of lacking a clinically meaningful interpretation. However, an additional problem with NRI^{>0} is that its scale is unfamiliar.

Pencina et al^{12} provided a mathematical example of a new marker described as having “strong effect size.” The eAppendix (http://links.lww.com/EDE/A732) section C describes the structure of the data considered by Pencina et al.^{12} Here and throughout this review, *X* represents the established predictor or set of predictors, and *Y* is the candidate new predictor. In the example,^{12} the new marker *Y* yields NRI^{>0} = 0.622. Is 0.622 large? Consider Figures 1 and 2. In all four examples in the figures, *Y* has the same distribution, and the odds ratio for *Y* given the baseline marker *X* is constant. The four examples differ only in the strength of the old risk model, that is, the predictive capacity of *X*. At one extreme, the old risk model is uninformative, with AUC = 0.5. At the other extreme, the old risk model is highly predictive with AUC = 0.99. The figures suggest that the prediction increment for *Y* diminishes as the strength of the old model increases, even though NRI^{>0} = 0.622 in all four cases. Clearly, there are important aspects of prediction not captured by NRI^{>0}.^{12}

### NRI^{>0} Does Not Contrast the Performance of the New Risk Model with the Performance of the Old Risk Model

Most measures of incremental value are constructed by summarizing the performance of the old risk model, summarizing the performance of the new risk model, and comparing the two summaries (eg, ΔAUC and ΔNB). NRI^{>0} is fundamentally different. This index is not a difference of two performance measures for the two risk models but rather a comparison of the old and new risk values for each person. However, within-person changes in risk do not necessarily translate into improved performance on a population level. For example, Figure 2 (bottom row) shows examples with many changes in individual predicted risks (NRI^{>0} = 0.622), but the distribution of predicted risks in the population remains almost exactly the same.

When assessing a new biomarker, the ultimate question is whether clinicians should continue using the old risk model or switch to the new, expanded risk model. To answer this question, we need to assess and compare the performances of each of the risk models. NRI^{>0} measures the difference between the old and the new risk models within individual patients, but without providing information about the performances of the models.

### NRI^{>0} Incorporates Irrelevant Information

NRI^{>0}, like ΔAUC, does not rely on risk thresholds. Greenland^{14} points out that “cutpoint free” indices incorporate irrelevant information, diminishing their potential for clinical relevance. For example, area under the curve summarizes the entire receiver operating characteristic curve, including parts of the curve describing sensitivity for unacceptably poor specificity. There are two ways in which NRI^{>0} incorporates irrelevant information. First, NRI^{>0} does not account for the size of changes in a predicted risk. Infinitesimally small changes “count” even though they are clinically irrelevant. Second, NRI^{>0} does not account for an individual patient’s position on the risk distribution. An event at the high end of the risk distribution who moves to an even higher risk reflects positively on NRI^{>0}. Such movement likely has little effect on treatment decisions. A new marker is beneficial if it improves treatment decisions, which often means the marker can discriminate between events and nonevents in the middle of the risk distribution.

For the CVD data,

and

; 21% of events have a new 5-year risk within 1% of the old risk. Among nonevents, the proportion is 53%. Therefore, a sizeable proportion of changes summarized by

and especially by

are small (and likely inconsequential) changes.

### NRI^{>0} Can Make Uninformative New Markers Appear Predictive

Hilden and Gerds^{15} and Pepe and colleagues^{16} report a problematic feature of NRI^{>0}. Suppose that an old risk model (risk(*X*)) and a new risk model (risk(*X*, *Y*)) are fit to a training data set. Suppose further that the new marker *Y* is completely uninformative. To avoid “optimistic bias” caused by using the same data to fit and evaluate model performance, a standard strategy is to use an independent data set to assess the models’ performances. However, NRI^{>0} tends to be positive for uninformative *Y* even when NRI^{>0} is computed on a large, independent validation data set.^{16} This problem is likely to arise in settings where the risk models are not well calibrated—a common phenomenon in practice. In contrast to NRI^{>0}, more standard measures such as ΔAUC do not experience this problem. These results show that NRI^{>0} can mislead researchers to believe that an uninformative marker improves prediction.

### For Three or More Risk Categories NRI Weights Reclassifications Indiscriminately

The purpose of risk categorization is to guide appropriate treatment decisions. For cardiovascular disease, suppose low risk indicates no intervention, medium risk indicates lifestyle changes and high risk indicates both lifestyle changes and pharmaceutical intervention. When categories correspond to treatment decisions, the nature of reclassification matters, not just the direction. For example, changing an event from high risk to low risk is a more serious error than changing from high risk to medium risk.

When there are three or more risk categories, one should consider all the ways a new biomarker can move persons among risk categories. For three risk categories, there are three ways to move “up”: low risk to medium risk; medium to high; and low to high. The three-category NRI_{e} gives each of these equal weight; in particular, moving up two risk categories counts the same as moving up one. Section B of the eAppendix (http://links.lww.com/EDE/A732) describes how an appropriate weighting could be incorporated into a statistic. Weighting the different types of reclassification is extremely challenging, but that challenge does not justify using equal weights. As an alternative to assigning weights and providing a single numerical summary, one can instead examine the different types of reclassification in a reclassification table as shown below.

Polonsky et al^{8} considered three-category net reclassification indices with thresholds at 0.03 and 0.1 defining low, medium, and high 5-year risk (NRI^{0.03, 0.1} = 0.25). The value is driven by events (

and

),even though most of the population count as nonevents. NRI^{0.03, 0.1} = 0.25 is a very coarse summary and almost impossible to interpret. Table 1 shows that the new risk model tends to place nonevents in the low- and high-risk categories, placing fewer nonevents in the medium risk category than the old risk model. If the harm of moving a nonevent from medium to high risk is greater than the benefit of moving a nonevent from medium risk to low risk, then the harm of the new risk model outweighs the benefits among nonevents. The single numerical summary,

, does not reflect this.

Table 2 shows the reclassifications of nonevents and, separately, events between the old and new risk models in the cardiovascular disease study data. Such tables are interesting and potentially instructive. However, it is easiest and most informative to simply look at how a risk model assigns nonevents and events to risk categories. This information appears on the margins of Table 2 and more succinctly in Table 1. Net reclassification indices do not capture this important information.

### Two-category NRIs: New Names for Existing Measures

When there are two risk categories, low and high, NRI_{e} is the change in the proportion of events assigned to the high-risk category, that is, the change in the true-positive rate (ΔTPR). NRI_{ne} is the change in the proportion of nonevents designated low risk. In other words, NRI_{ne} = −ΔFPR, where ΔFPR is the change in the false-positive rate. For two risk categories, the population-weighted NRI is the change in the misclassification rate.

Furthermore, the weighted NRI is the same as the change in net benefit between the old and new risk models (eAppendix, http://links.lww.com/EDE/A732, section A or Van Calster et al^{17}). In other words, wNRI = ΔNB.

## DATA ANALYSIS WITH NRI

Common practice is as follows. Investigators have a data set that includes established risk factors (*X*) for a condition of interest and a potentially useful new marker (*Y*). They fit two regression models: an “old” model that uses only *X*, and a “new” model that uses both *X* and *Y*. The risk models are typically logistic regression models or Cox models if data are censored. The prediction increment of *Y* is then assessed, typically using the same data that were used to fit the models.

### NRI Should Not be Used for Testing

A researcher may consider testing the null hypothesis H_{0}: NRI = 0. Pencina et al^{4} provide a z-statistic for NRI-based testing. However, the test based on this z-statistic has never been validated. The next section and eAppendix (http://links.lww.com/EDE/A732) sections D and E discuss problems with the variance formula on which this z-statistic is based.^{18}

Interestingly for the category-free index, NRI^{>0}, hypothesis testing is unnecessary. Pepe et al^{19} show that rejecting the null hypothesis H_{0}: NRI^{>0} = 0 is implied by rejecting the null hypothesis about the novel marker being a risk factor. In other words, once a test on the coefficient of the new marker is performed, it is redundant to perform a test based on NRI^{>0}.

For the two-category

or

where *t* is the risk threshold, one cannot reject H_{0}:

= 0 and H_{0}:

= 0 on the basis of *Y* being a risk factor. Good tests are not yet established for these null hypotheses.

We favor inference about the nature and size of the prediction increment rather than testing a null hypothesis of no improvement. Such inference is challenging. At the early stages of model development, it might be unclear how a risk model will be used, yet understanding how a risk model will be used is important for appropriately evaluating incremental value. Setting aside these larger considerations, the next section considers methods for constructing confidence intervals for net reclassification indices.

### NRI Confidence Intervals

We conducted a simulation study to evaluate methods for constructed NRI confidence intervals. Based on the section above, we considered only category-free and two-category event and nonevent net reclassification indices. Results indicate that the most reliable confidence intervals use a bootstrap estimate of the variance of the statistic. Such confidence intervals outperformed confidence intervals constructed using the estimator

proposed by Pencina et al^{4} and other types of bootstrap confidence intervals. Sections C and D of the eAppendix (http://links.lww.com/EDE/A732) describe the simulation study and its results in detail.

## NRI INFERENCE IN THE MULTI-ETHNIC STUDY OF ATHEROSCLEROSIS DATA

In the cardiovascular disease study data, we used 5-year risk thresholds 0.03 and 0.1 following Polonsky et al.^{8}Table 3 compares confidence intervals for category-free and various two-category event and nonevent net reclassification indices. Confidence intervals computed with bootstrapping are usually, but not always, wider than confidence intervals computed using

. For the two-category indices with threshold 0.03 for 5-year risk, the changes in the true- and false-positive rates are modest, with an estimated 6% reduction in the false-positive rate and 3% increase in the true-positive rate. For the 0.1 risk threshold, adding the coronary artery calcium score to risk prediction increases the true-positive rate substantially (19%) and also increases the false-positive rate by 3%.

Although the reclassification table (Table 2) and summary statistics (Table 3) are interesting, we find the risk distributions (Table 1) most useful. Table 1 shows that adding the arterial calcium score to prediction increases the proportion of events labeled as high risk. Unfortunately, it also increases the proportion of nonevents labeled as high risk. Because nonevents vastly outnumber events, Table 1 identifies an important problem with adding the calcium score to the risk model.

## DISCUSSION

The recent literature on measures of incremental value has developed as follows. Dissatisfaction with ΔAUC led to proposals for measures based on risk categories and reclassification.^{20} The category-based NRI soon followed to address issues with those new measures.^{4} A preference to avoid arbitrary or weakly justified risk thresholds led to the proposal for NRI^{>0}.^{5} Unfortunately, NRI^{>0} has many of the same problems as ΔAUC. Neither measure is clinically meaningful; both measures are broad summaries of changes in risk models; and both measures incorporate irrelevant information. In these respects, things have come full circle. It is difficult to understand whether a value of NRI^{>0} is large or small, and this is due only partly to lack of experience with the index. Furthermore, without proper attention to model fit, NRI^{>0} can mislead researchers to believe that an uninformative marker improves prediction.^{15},^{16} We are skeptical that NRI^{>0} will help investigators develop biomarkers or improve risk models, and we are concerned about the potential for NRI^{>0} to mislead.

The NRI statistics that are most useful are renamed versions of existing measures. Specifically, event and nonevent two-category net reclassification indices are the changes in the true- and false-positive rates; and the weighted two-category NRI is the change in net benefit. In both cases, we prefer the established, descriptive terminology.

We recommend the bootstrap method for estimating the variance of NRI estimates and constructing confidence intervals. However, methodology that works well for markers with small prediction increment is needed.^{21}

The issues described above for NRI^{>0} also apply to net reclassification indices for three or more risk categories. However, the overriding issue for three or more risk categories is that the net reclassification indices do not discriminate between different types of reclassification—all upward movements in risk categories count the same, as do all downward movements. We thus recommend against net reclassification indices for three or more categories. As in the two-category case, if the benefits and costs of different types of classification can be specified, these can be used as weights in a weighted NRI, which would be the same as the change in net benefit. This is a challenging approach and, to the best of our knowledge, has not yet been done in practice. A practical alternative is to examine how the old and new risk models place events and nonevents into the risk categories (Table 1). A reclassification table (Table 2) may also be informative because it presents the classification achieved with the new marker within strata defined by the baseline risk model. Depending on the application, select two-category summary statistics may be appropriate, particularly for risk thresholds that indicate expensive or invasive treatment.

NRI^{>0} should not be used in hypothesis testing. Better tests are available and validated for the regression setting. However, we emphasize the limited value of hypothesis testing in assessing biomarkers. We recommend that investigators focus on describing the operating characteristics of risk models. Ideally, then, the prediction increment of a new marker is described in terms of how it improves risk model operating characteristics.