# Net Reclassification Improvement

When adding new markers to existing prediction models, it is necessary to evaluate the models to determine whether the additional markers are useful. The net reclassification improvement (NRI) has gained popularity in this role because of its simplicity, ease of estimation, and understandability. Although the NRI provides a single-number summary describing the improvement new markers bring to a model, it also has several potential disadvantages. Any improved classification by the new model is weighted equally, regardless of the direction of reclassification. In prediction models that already identify the high- and low-risk groups well, a positive NRI may not mean better classification of those with medium risk, where it could make the most difference. Also, overfitting, or otherwise misspecified training models, produce overly positive NRI results. Because of the unaccounted for uncertainty in the model coefficient estimation, investigators should rely on bootstrapped confidence intervals rather than on tests of significance. Keeping in mind the limitations and drawbacks, the NRI can be helpful when used correctly.

From the ^{*}Department of Anesthesiology, University of Michigan, Ann Arbor, Michigan; and ^{†}Department of Biostatistics, University of Michigan, Ann Arbor, Michigan.

Accepted for publication November 16, 2015.

Funding: Departmental and institutional.

The authors declare no conflicts of interest.

Reprints will not be available from the authors

Address correspondence to Elizabeth S. Jewell, MS, Department of Anesthesiology, University of Michigan, 1500 E. Medical Center Dr., Ann Arbor, MI 48109. Address e-mail to esjewell@med.umich.edu.

Clinicians constantly estimate patients’ risks for problems when caring for them. To accomplish this, information is used from various sources to create risk or prediction models. This information might include patient characteristics, conclusions from previous studies, and data from clinical monitors. In the case of myocardial injury, ST-segment morphology typically would be included. Given the significance of this measurement, many monitors automatically calculate how much the ST-segment deviates from the isoelectric line. In this issue, we report our effort to determine whether analyzing the variability of these intraoperative ST-segment values might be useful for predicting myocardial injury.^{1} We hypothesized that adding this information to patient characteristics and comorbidities would increase our ability to predict postoperative troponin elevations. To address this hypothesis, we used net reclassification improvement (NRI).

Pencina et al.^{2} first introduced the NRI as a way to quantify the gain in predictive accuracy achieved by adding new variables to a list of predictors. This method was created as an alternative to comparing areas under receiver operating characteristic curves (AUCs). The NRI amalgamates information found in reclassification tables into a single value. It contains information about both the number of individuals whose classification changed from incorrect to correct with the new predictor and the number of individuals whose classification changed from correct to incorrect. In our example, NRI can be used to answer the question of whether the addition of ST-segment values produces clinically important changes in anesthesiologists’ decisions.

The most general form of this 2-category NRI weights the reclassifications equally and does not take into account the possible differences in the costs of misclassifying an event versus misclassifying a nonevent. Therefore, the NRI provides a simple way to test whether new predictors will be clinically useful. Given the simplicity and understandability of this approach, the NRI has become a highly used statistical measure.

## CALCULATING THE ESTIMATED NRI

The 2-category NRI quantifies how the performance of a model changes when additional predictors are added. It estimates the improvement by considering the events and nonevents separately. Two tables, 1 for the events and 1 for the nonevents, compare how many predicted binary outcomes change with the use of the additional predictors. In our study, 2 logistic regressions were performed: 1 using the typical, original predictors and the other using these predictors as well as the new variable or variables of interest, ST-segment variability.^{1} By the use of a single cutoff value to dichotomize the predictions of the outcome from the logistic regressions, each case had a predicted binary outcome for the original model and the new model. Predicted binary outcomes that differed between the 2 models were considered reclassified, with the number of reclassified events and nonevents being shown in the tables (Appendix 1). This can be extended to a setting with several ordered categories, where the number of events that increase across categories and the numbers of nonevents that decrease across categories under the expanded model are computed or, to a continuous case, where the number of events whose predicted probabilities increase and the number of nonevents whose predicted probabilities decrease are computed.

## INTERPRETING THE ESTIMATED NRI

The 2-category NRI is easier to interpret than the NRI with more categories or the continuous NRI, and it adds value to quantifying the improvement of a model, particularly in situations in which the 2 categories are chosen on the basis of existing clinical risk categories.^{3},^{4} The event and the non-event NRI also has a relatively straightforward interpretation in the 2-category setting: the event NRI is the change in the true-positive rate and the nonevent NRI is the change in the false-positive rate.^{5} (Note, however, that the individual measures do not provide information on the prediction accuracy overall, only the events and nonevents separately.^{6}) The 2-category NRI also can be viewed as a method to assess the clinical decisions by allowing the clinician to state “if the risk is >X%, I will do this.” One can then take X% as a critical decision threshold^{7} and determine, using the NRI at X%, how much improvement ST-segment changes bring to decision making.

The NRI ranges from −1 to 1, where −1 indicates that the new model is perfectly worse and 1 indicates the new model is perfectly better. Lacking a predefined, clinically meaningful cutoff to define predicted events versus nonevents for the 2-category NRI in our ST-segment study, we chose an optimal cutoff to maximize sensitivity and specificity, separately in the models without and with ST-segment variability. In our study, including the ST-segment variability in the new model produced an improvement in predictive accuracy, NRI = 0.03454, with NRI_{event} = 0.01763 and NRI_{nonevent} = 0.01691. The NRI is >0, suggesting improvement in the model including the ST-segment variability. The binary diagnostics, particularly the increase of 0.0255 in AUC, agree with this improvement (Table 1). By categorizing a continuous predictor of interest, NRI at a fixed cutpoint may be smaller—or even 0—at some cutoff values and larger at others, because the changes in predicted probability will be grouped.

It is less clear how NRI without a fixed cutpoint will behave. One example in the literature uses hemoglobin A1c to predict cardiovascular disease, with 3 cutpoints of predicted probability (5%, 10%, and 20%) estimated from the Cox proportional hazard model. Using 4 categories of hemoglobin A1c, Cook and Ridker^{8},^{9} found an impressive improvement with NRI = 0.11; however, when hemoglobin A1c was treated as a continuous variable, its inclusion in a predictive model of cardiovascular disease worsened accuracy with an NRI = −0.187. As we note herein, getting the functional form of the predictor is important to obtain unbiased estimates of the NRI, with overestimation occurring if the form is either overfit or underfit. Hence, careful model fitting should be undertaken to assess whether nonlinear terms should be included for continuous predictors, be they classification into categories (a step function), including higher-order polynomials, or use of splines to approximate nonlinearities.

## LIMITATIONS

### Equal Weights for Sensitivity and Specificity

The NRI presents numerous concerns and limitations. First, because the event and nonevent NRI estimates are calculated then summed, equal weight is given to events—changing from an incorrect No to a correct Yes—as nonevents—changing from an incorrect Yes to a correct No. Clinically, however, we may not be interested in the overall accuracy of the prediction model but rather whether it is sensitive or specific enough. That is, we may weight sensitivity or specificity differently. For example, in a particular practice, our screening questions and heuristics would correctly identify 10 patients with malignant hyperthermia susceptibility (MHS), incorrectly identify 15 patients without MHS as having MHS, correctly identify 15 non-MHS patients, and incorrectly identify 10 MHS patients to not have MHS (Fig. 1). If adding a new question or test, we change the identification of MHS in 3 patients: in the new model, 1 of the 10 patients correctly identified as having MHS is reclassified as non-MHS, whereas 2 of the 15 patients incorrectly identified as having MHS are now correctly reclassified as non-MHS. The new model, which correctly identifies 1 more patient overall, has NRI = 0.017 and suggests that it is a better model. Missing a patient with malignant hyperthermia (MH), however, may result in a fatality, whereas the cost of incorrectly classifying a non-MH patient as MHS is low—changing the type of anesthesia to avoid MH-triggering agents. For MH, we want our prediction model to be very sensitive and are willing to sacrifice specificity. In this case, rather than using the NRI, using the 2 components separately may give the anesthesiologist a better idea of the improvement—or, in this example, the worsening—wrought by the new model.

In computing 2-category NRI, the changes in sensitivity and in specificity are summed. This implies relatively more weight to the less-common outcome.^{10} The weight is equal to the odds of the nonevents: (1 − mean (*P*)) / mean (*P*), where mean (*P*) is the average probability of the outcome, making NRI sensitive to the prevalence of the disease.^{10} In our MH example, if there were many more patients without the disease, for example, 2000, then even if the new model correctly predicts up to 99 more patients without MH as not having MH, the NRI would be <0, suggesting that the new model is now worse. Although weighting is not equal, it is also not based on the clinical consequences of whether it is better to treat 1 person who does not have the disease or to not treat one who does have the disease. This remains a clinical judgment and not a statistical one.

Pencina et al.^{3} proposed a weighted NRI based on the expected costs and savings of reclassification with the new model. This strategy takes into account the savings from treating an eventual event as well as the savings from not treating a nonevent.

### Model Performance Across the Range of Risks

Another concern is that the current model is good enough at the extremes, but it is hazy in the middle—a gray zone in which treatment decisions are unclear.^{8} A 57-year-old man coming for open aortic aneurysmectomy, with previous drug-eluting cardiac stents, now with increasing precordial chest pressure radiating down his left arm since stopping his aspirin and clopidogrel 1 month ago, would be at high risk of perioperative myocardial infarction. A 22-year-old female track star coming for an appendectomy would be at very low risk. The man would receive further cardiac evaluation; the woman would not. Somewhere in the middle, however, is a group of patients with vague symptoms or indeterminate risk factors, where the utility of the current predictive model for further cardiac evaluation is questionable. How well would a new model improve our ability to decide whether to obtain further cardiac evaluation on patients in the gray zone?

The NRI would show the net improvement over both extremes and the gray zone. But from where does the improvement come? The extremes or the gray zone middle? A new model that changes the predictive probability of a perioperative myocardial infarction from 94% to 95% for the man and 0.0001% to 0.00001% for the woman would not change clinical decision making. A change for a person in the gray zone, however, from 5% to 20% or 20% to 5%, might. More common clinically insignificant changes at the extremes might drown less common clinically significant changes in the gray zone.

### Overly Positive Improvements

Finally, Pepe et al.^{11} and Hilden and Gerds^{12} showed that poorly calibrated models can actually show improvements in NRI over a well-calibrated model. It is well known that, when predictions are tested on the data used to generate them, the predictive accuracy of the model (here the NRI) will be greater than when it is applied to new data. This can be avoided by using training data, that is, fitting the models of interest to a subset of the data and using the results of this model fit to determine predictions in the remaining data, from which the NRI will be calculated.

In a series of computer simulations, as well as analytical proofs, Pepe et al.^{11} showed that NRI produces overly positive results, even when training samples are used to generate risk models, if these samples are fit to a misspecified model.^{12} This feature is not shared with other prediction measures such as AUC or Brier scores. This misspecification can occur either by including variables that are not actually predictive, misspecifying functional forms, or failing to include interactions. Although Pepe et al.^{11} found this feature of NRI to be very problematic, they did note that overestimates of the improvement estimated by NRI can be minimized by avoiding overfitting (e.g., excluding extraneous variables that have little or no predictive power) and externally validating the new model on a new data set. If such a data set is not available, cross-validation is an alternative approach.^{3} A subset of the data is set aside (commonly 1 of 10 for 10-fold cross-validation), the model is fit on the remaining data, and the resulting predicted values are then applied to the held-out data to compute NRI. This process is repeated many times with differing random subsamples, and a summary measure (typically mean or median) is then reported as the cross-validated NRI estimate. This approach can be embedded in bootstrapping procedure to obtain cross-validated interval estimates.

## EVALUATING NRI

Pencina et al.^{3} computed SEs for NRI_{event}, NRI_{nonevent}, and NRI as follows:

Pencina et al.^{2} provided a *z*-score to determine the statistical significance of the NRI measure and an approximate 95% confidence interval given by NRI ± 1.96 ×SE (NRI). While simple to calculate, Pepe et al.^{4},^{13} pointed out that the variance estimation calculation given by Pencina et al.,^{2} the square of the denominator of the *z*-score, suffers from the fact that it does not account for uncertainty in the estimation of the model coefficients. To accommodate this when calculating confidence intervals, Pepe et al.^{13} suggested the use of bootstrapping, and simulation study by Kerr et al.^{5} supported this suggestion. Bootstrapped confidence intervals are preferred to the given variance estimation calculation for hypothesis testing because of the unaccounted for uncertainty^{3},^{5},^{6},^{13} (Appendix 2).

The Kerr appendix defined new markers added to an original model as weak, medium, or strong. ^{5}Figure 2 shows the predictions in events (*N* = 624) and nonevents (*N* = 3843) for both the original and new model, including very weak new markers. A useful new marker would shift the prediction of the events to the right (greater probability of having the outcome) and the nonevents to the left (lesser probability of having the outcome).^{14}

The ST-segment NRI confidence intervals from 5000 bootstrapped resamples were in agreement with the Kerr simulations. Specifically, the

and percentile confidence intervals were nearly identical. The unadjusted and bias-corrected confidence intervals were not near the

and percentile confidence intervals, and they had lower limits noticeably farther from 0 (Table 2 and Fig. 3). In particular, the failure of the unadjusted interval to account for the uncertainty in the model coefficients and in the optimal cutpoint choice for the prediction models yields narrower intervals that, in repeated samples, were shown tending to uncover the true NRI.

In the models with strong new markers, Kerr et al.^{5} found that the percentile confidence interval was not as similar to the

confidence interval and was more conservative than the other 3 methods of confidence interval calculation. Because of this, and despite the

recommendation, it is likely that investigators might reasonably use the percentile confidence interval because of its straightforward calculation. Whether the marker is weak or strong, however, and no matter the number of events or nonevents, the

confidence interval remains reasonable and recommended.

## SUMMARY

NRI is not a panacea. It has a relatively useful interpretation as the difference between the change in the true- and false-positive rates resulting from the inclusion of a new predictor in a model when there is a preselected predicted probability cutpoint defining predicted cases and controls (2-category NRI). Therefore, it may be most easily interpreted in settings in which this cutpoint is agreed on and well defined. The fact that it does not discount model fits to noise also suggests that it should only be used when there is strong evidence that the new predictors are being incorporated appropriately. Like all statistical tools, there are times when it will be helpful and times that it may be misapplied. It is important to know the uses and limitations of NRI.

## APPENDIX 1

### Net Reclassification Improvement Formulas and Example

A 2-by-2 table (Appendix Fig. 1) showed the predicted troponin elevation (or not) in the original model by predicted troponin elevation (or not) in the ST-segment model. Two tables were made: 1 contained only events (elevated postoperative troponin levels, *N* = 624) and the other contained only nonevents (normal postoperative troponin levels, *N* = 3843).

Appendix Figure 1. A table constructed to show counts of predicted troponin elevation (or not) in the original and ST-segment models (top). To estimate net reclassification improvement (NRI), this table is made twice: once for events (middle) and once for nonevents (bottom).

The NRI addresses the movement of classifications from changing models. In the ST-segment example, the classification is said to move up when the original model predicts no troponin elevation as the outcome and the new model with the ST-segments included predicts a troponin elevation (cell B) and said to move down when the original model predicts a troponin elevation and the ST-segment model predicts no elevation (cell C). The classification remains the same if both models predict the same outcome (cells A and D). From this, the 4 probabilities that formulate the NRI were estimated: the probability of moving up given there was a troponin elevation, the probability of moving down given there was a troponin elevation, the probability of moving up given that there was not a troponin elevation, and the probability of moving down given that there was not a troponin elevation. These probabilities were estimated from the number of events and nonevents in each category:

An estimated NRI can be calculated for the events and nonevents separately, indicating the improved reclassification in each group, keeping in mind that events moving up are correctly reclassified while nonevents moving down are correctly reclassified^{3}:

From here, the overall NRI, using either of the 2 following equivalent equations, is the sum of the event NRI and the nonevent NRI, providing a single value describing the reclassification in the new model:

## APPENDIX 2

### Bootstrapping Confidence Intervals

Bootstrapping is a resampling technique in which the statistician creates a new sample population by randomly selecting patients, with replacement, from the original data. Because of replacement, some patients may be included more than once in the new sample, and some may not be included at all in the new sample. The statistic of interest (here the net reclassification improvement [NRI]) is then estimated on the new sample. This process is then repeated hundreds or thousands of times to achieve a measure of the accuracy of the statistic. A bootstrapped confidence interval will converge to having the correct coverage in large sample sizes provided the estimate is consistent, is a smooth (differentiable) function of the data, and is not estimating a parameter on the boundary of the parameter space; additionally, the resampled elements need to be independent. These conditions are met in the estimation of NRI. Variations of the bootstrap estimator are available to improve the asymptotic convergence of the bootstrap variance estimates (e.g., bias-corrected bootstrap).^{15} In the course of resampling the data, one has the option of refitting the logistic regression model before recomputing the NRI measure, thus allowing uncertainty in the regression estimation to be incorporated into the construction of the NRI confidence interval.

Kerr et al. considered 7 alternative methods to using the formula-based confidence interval: (1) the normal approximation interval proposed in the study by Pencina et al.^{2} that ignored the uncertainty in the regression coefficient estimation (formula), (2) resampling without refitting the model to compute

and plug it into the normal approximation above (unadjusted), (3) resampling and refitting the model in order to estimate the SE (

) as the square root of the variance of the resampled estimates of NRI and plug it into the normal approximation above (

), (4) constructing a percentile confidence interval using the values from the resampled and refitted distribution at the 0.025 and 0.975 quantiles (percentile), and (5–7) replacing (3) with variations of bootstrap point and SE estimates designed to have somewhat better properties in small samples.^{6} Based on simulation studies with weak (NRI = 0.135), medium (NRI = 0.270), and strong (NRI = 0.577) new markers, Kerr et al. found that the formula and unadjusted intervals had rather poor coverage properties, being very anticonservative (with 95% confidence intervals covering the true interval <75% of the time in some settings). In the end, Kerr simulations with a strong new marker generally had approximately correct coverage with the

confidence interval, with values closest to the 0.95 target coverage value—0.958 and 0.957 in the event and nonevent category-free NRI and 0.978 and 0.951 for the event and nonevent 2-category NRI. Additionally, the Kerr simulations had moderately conservative coverage for the weak new marker NRI measure with the

confidence interval—0.988 and 0.965 for the event and nonevent category-free NRI and 0.996 for both event and nonevent 2-category NRI. This strengthens the case for using

, as the other alternative measures were anticonservative for some variation of the measure considered, with the exception of the percentile estimator, which was more conservative than the

estimator.

As an alternative to calculating a single confidence interval around the overall estimated NRI, Mühlenbruch et al.^{6} proposed a confidence ellipse that would consider the event NRI and nonevent NRI in separately. The 2 individual components are easier to interpret than the overall NRI. The confidence ellipse method defines an ellipse around (NRI_{event}, NRI_{nonevent}) to provide confidence bounds for the 2 separate measures in combination. The main disadvantages of this include it not considering the overall improvement of the model and some loss of the familiar confidence interval interpretation.

## DISCLOSURES

**Name:** Elizabeth S. Jewell, MS.

**Contribution:** This author helped design the analysis, conduct the analysis, and wrote the manuscript.

**Attestation:** Elizabeth S. Jewell approved the final manuscript.

**Name:** Michael D. Maile, MD, MS.

**Contribution:** This author helped design the analysis and wrote the manuscript.

**Attestation:** Michael D. Maile approved the final manuscript.

**Name:** Milo Engoren, MD.

**Contribution:** This author helped design the analysis and wrote the manuscript.

**Attestation:** Milo Engoren approved the final manuscript.

**Name:** Michael Elliott, PhD.

**Contribution:** This author helped conduct the analysis and wrote the manuscript.

**Attestation:** Michael Elliott approved the final manuscript.

**This manuscript was handled by:** Franklin Dexter, MD, PhD.

## REFERENCES

*et al*., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27:191–5

*et al*., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27:173–81