Historically, risk prediction in medicine was limited to simple models using perhaps just a single predictor such as age or family history. With the advent of genomics, proteomics, and metabolomics, we are now in an age of high-throughput biology. Corresponding to the increase in the kinds and amount of data available on patients, there is a surge of interest in predictive models. With the large numbers of potentially predictive markers that are available, risk modeling is inevitably multivariate. Therefore, one must consider the role of covariates in predictive models. However, in contrast to therapeutic and etiologic studies, concepts of covariate adjustment are not well established when the goal is evaluating classification or prediction performance.^{1}

This article addresses questions related to assessing the improvement in prediction performance gained by using a new biomarker to make predictions in addition to existing predictors. We call this the incremental value of the biomarker. We will make use of receiver operating curves, also known as ROC curves, to assess prediction performance, and we will use AUC to denote “area under the ROC curve.”

As a motivating example, consider the problem of evaluating the capacity of newly discovered genetic markers to improve prediction for breast cancer. Making accurate predictions is clinically important because, for example, women at low risk could be spared the expense, discomfort, stress, and risk of false positives associated with screening mammography. In contrast, women at high risk of breast cancer are candidates for prophylactic tamoxifen therapy. However, taking tamoxifen increases risks of endometrial cancer, stroke, and pulmonary embolism, and therefore only women who are clearly at high risk of breast cancer should be advised to take tamoxifen for prevention.

One issue in studying any predictor of breast cancer is how to consider age. A woman's age is very predictive of breast cancer risk. Gail^{2} notes the fallacy in ignoring age when considering new markers for breast cancer: “Some investigators compare case patients and control subjects over large age ranges. Because age is a strong predictor of breast cancer risk and is included in all risk models and because case patients tend to be older than control subjects, doing so increases the AUC value.”

Gail^{2} investigated whether 7 common, recently identified single-nucleotide polymorphisms (SNPs) could improve breast cancer prediction over existing models. Age-matched data allowed Gail to adjust for the predictive ability of age by examining the predictive ability of the SNPs in cases and controls of approximately the same age. In other words, the strategy was to evaluate the new predictors by stratifying on the existing predictor (in this context, age). Gail's measure of predictive ability was the area under the age-stratified ROC curves. This amounts to adjusting the ROC curve for age.^{1} In contrast, Wacholder et al^{3} took a fundamentally different approach; they examined the classification performance of risk models that incorporated novel markers of 10 genetic variants, as well as traditional risk factors, including age. The researchers calculated the ROC curve for the joint risk model and then compared it with the ROC curve for the risk model without the addition of genetic variants.

The different analytic strategies of Gail^{2} and Wacholder et al^{3} raise many interesting questions. How do the AUC values for the new markers evaluated in groups that are homogeneous with respect to the existing predictors^{2} relate to the change in the AUC by adding the new markers to an existing risk set of predictors?^{3} What do we learn about the value of the new marker for the overall population, which includes women of different ages? We explore these methodological questions in this paper. We then contrast traditional concepts of covariate adjustment in predictive modeling with covariate adjustment in assessing the predictive performance of a biomarker. The former entails joint prediction using an existing marker *X* and a new marker *Y,* whereas the latter evaluates the performance of *Y* in groups homogeneous with respect to *X.* We describe the very limited conditions under which the ROC curve for the joint risk model is the same as the covariate-adjusted ROC curve. We then discuss the concept of interaction in the context of evaluating the predictive performance of a marker. We demonstrate that examining interaction in terms of odds ratios is not relevant to whether there is an interaction for predictive performance. Next, we discuss the implications of the ideas presented for predictive modeling. We explore these ideas further in a dataset of predictors for prostate cancer and a dataset of predictors of renal artery stenosis.

## COVARIATE ADJUSTMENT VERSUS JOINT MODELING

Consider a simplified epidemiologic study. There is a binary variable *D* indicating disease, a variable *X* known to be associated with disease, and an additional variable of interest *Y.* In our example, *D* is occurrence of breast cancer within 5 years, *X* is age, and *Y* is SNPs. We might model the risk of disease using logistic regression:

In traditional epidemiology, there are 2 complementary ways in which a model such as Equation (1) is interpreted. First, we can say β_{2} summarizes the association between *Y* and the log odds of disease for subjects with the same value of *X.* We call this the covariate adjustment interpretation. Covariate adjustment corresponds to the concept of stratifying subjects according to a variable, in this case stratifying by *X.* A second interpretation is that of joint prediction. Model (1) contains both *X* and *Y*, and is therefore a joint model for the log odds of disease, using *X* and *Y* as predictors.

Therefore, we argue that in epidemiology the concepts of adjusting for covariates and of joint modeling are at least partially conflated because the same model can be used for both. However, for discriminating 2 classes of patients, cases (*D* = 1) and controls (*D* = 0), we next show that stratification and joint modeling are distinct concepts.

ROC curves are useful and popular tools for summarizing the ability of a marker or a risk score to discriminate cases and controls. For a single continuous predictor *Y, ROC*_{Y}() describes the ability of *Y* to discriminate between cases and controls by plotting the true positive rate *P*(*Y* > *y*|*D* = 1) against the false positive rate *P*(*Y* > *y*|*D* = 0). *ROC*_{X, Y}() refers to the ROC curve for a predictive model that uses both *X* and *Y.* For joint prediction, it is known that the optimal way to combine *X* and *Y* for discrimination is to predict disease based on *risk*(*X, Y*) **≡***P*(*D* = 1|*X, Y*).^{4–6} That is, the ROC curve for the combination defined by *risk*(*X, Y*) has the best ROC curve compared with all other possible combinations of *X* and *Y.* Therefore, we write *ROC*_{X, Y}() for the ROC curve for the risk function, *ROC*_{risk}_{(}_{X, Y}_{)}(). In contrast, the curve *ROC*_{Y|X}() is the ROC curve for *Y* stratified on *X.* It describes the ability of *Y* to discriminate between cases and controls in subpopulations that are homogeneous with respect to *X.* When *ROC*_{Y|X}() does not depend on *X,* it is called the covariate-adjusted ROC curve. Gail's analysis addresses the covariate adjusted ROC curve, *ROC*_{Y|X}(), whereas the study performed by Wacholder et al addresses the joint ROC curve, *ROC*_{X, Y}().^{2,3}

Another way to understand the difference between *ROC*_{X, Y}() and *ROC*_{Y|X}() is to consider how they might be estimated using a tool such as model (1). To estimate *ROC*_{X, Y}(), one should take a sample of patients for which *X, Y,* and *D* are known, use model (1) to estimate risks of disease, and then estimate the ROC curve that summarizes the overlap in these estimated risks between diseased and nondiseased persons. In contrast, to estimate *ROC*_{Y|X}(), one must condition on *X.* After fitting model (1), one would take a sample of patients with the same value of *X,* use model (1) to get estimated risks based on the subjects' *Y* values and their shared value of *X.* One could then make an empirical ROC curve based on these estimated risks. The average value of *X* gives the covariate adjusted ROC curve.

Despite the fact that the concepts of covariate adjustment and joint modeling are intertwined when studying associations between disease and predictors, *ROC*_{X, Y} = *ROC*_{Y|X} only in some very specific cases. We present Example 1 to provide intuition before stating a general result.

### Example 1

We present an example where there are 2 predictors of disease, *X* and *Y,* and show that *ROC*_{X, Y}() ≠ *ROC*_{Y|X}(). In this example, *X* is a binary predictor. For concreteness, let *X* represent 2 categories of age, say, with *X* = 0 for younger subjects, and *X* = 1 for older subjects. Let the distribution of the new marker *Y* be as follows:

The top 2 panels of Figure 1 illustrate the distribution of *Y.* Let *p*_{0} = *P*(*D* = 1|*X* = 0) and *p*_{1} = *P*(*D* = 1|*X* = 1) be the prevalences of disease in the younger and older subpopulations, and let *q* = *P*(*X* = 1) be the proportion of subjects in the population that are in the older age category.

*ROC*_{Y|X} can be computed simply by conditioning on *X*. It is obvious that *ROC*_{Y|X}_{=0} = *ROC*_{Y|X}_{=1}, which we write as *AROC,* because in both cases the ROC curve comes from the overlap between 2 unit-variance normal distributions with a difference of 2 in their means.

*ROC*_{X, Y} is computed from *risk*(*X, Y*) **≡***P*(*D* = 1|*X, Y*). For this simple model, Bayes' theorem gives formulas for these risks as a function of *X* and *Y*:

It can be shown algebraically that *ROC*_{X, Y}() = *ROC*_{Y|X}() if and only if *p*_{0} = *p*_{1}. This also follows from the general result proved in the following subsection. In other words, the ROC curve for the joint prediction is the same as the ROC curve for *Y* adjusted for *X* if and only if *X* is not a risk factor. If *X* is a risk factor (ie, *p*_{0} ≠ *p*_{1}), *ROC*_{X, Y} and *ROC*_{Y|X} are different curves.

Figures 2 and 3 show adjusted and unadjusted ROC curves for different values of *p*_{0}, *p*_{1}, and *q.*Figure 2 shows an example where *p*_{0} = *p*_{1}; the covariate-adjusted curves *ROC*_{Y|X} and joint ROC curve, *ROC*_{Y, X}, are the same. In contrast, Figure 3 shows an example where *p*_{0} ≠ *p*_{1}. In this case, *ROC*_{Y, X} > *ROC*_{Y|X}. An intuitive explanation for the difference is that, in the first example, knowing *X* tells us how to interpret *Y,* but does not provide any independent information about disease status. This is a case where we might say “*X* calibrates *Y.*” In the second example, *X* provides independent information about disease status in addition to telling us how to interpret *Y.*

### A General Result About *ROC*_{X, Y} and *ROC*_{Y|X}

The next result shows that in order for the joint and covariate-adjusted ROC curves to be equal, *X* cannot be informative of disease status marginally. Moreover, the role of *X* in the joint risk model can at most be to calibrate *Y.* In particular, we let *W* = *F*_{JOURNAL/epide/04.02/00001648-201111000-00007/ENTITY_OV0430/v/2017-07-26T080159Z/r/image-png,X}(*Y*), where *F*_{JOURNAL/epide/04.02/00001648-201111000-00007/ENTITY_OV0430/v/2017-07-26T080159Z/r/image-png}_{,}_{x} is the cumulative distribution of *Y* in the population of controls with *X* = *x.* Huang and Pepe^{7} use the term “covariate-specific percentile value” for 100 × *W,* which refers to the fact that *Y* is transformed to a percentile according to the distribution of *Y* in the reference population with *D* = 0 and *X* = *x.* Recall that we use the notion of a covariate-adjusted ROC curve when the conditional (or stratified) ROC curves, *ROC*_{Y|X}(), are the same across *X* values (or strata). In this setting, we have the following result.

#### Result 1

Let *ROC*_{Y|X}*be* the same for all *X.* Then we have the following equivalences.

#### Proof of Result 1

We prove the result for continuous *Y* with common support for *Y* in case and control populations. Janes and Pepe^{8} showed that the covariate-adjusted ROC curve is the same as the ROC curve for *W,* written *ROC*_{W}(). Also, because *F*_{JOURNAL/epide/04.02/00001648-201111000-00007/ENTITY_OV0430/v/2017-07-26T080159Z/r/image-png,X}() is a strictly increasing function, *P*(*D* = 1|*X, Y*) = *P*(*D* = 1|*X, W*), and therefore the ROC curve for (*X, Y*) is the same as that for (*X, W*). We therefore rewrite (2) as

By the lemma in the Appendix, (5) implies that

and therefore (3) holds. In the reverse direction, it is obvious that (3) implies (5), which is equivalent to (2).

Bayes' theorem yields the identity

where *f* denotes the probability density of *W.* The distribution of *W* conditional on *X* is uniform (0,1) in controls. So, *f*(*W*|*D* = 0, *X*) does not depend on *X.* Neither does *f*(*W*|*D* = 1, *X*) depend on *X* because, according to Janes and Pepe,^{8} the cumulative distribution of 1 − *W* given *D* = 1 and *X* is the covariate-adjusted ROC curve, which we have assumed does not depend on *X.* Therefore, neither *f*(*W*|*D* = 1, *X*) nor *f*(*W*|*D* = 0, *X*) depends on *X.* It follows that if *P*(*D* = 1|*X, W*) does not depend on *X,* then neither does *P*(*D* = 1|*X*), and vice versa. In other words, (3) holds if and only if (4) holds.

We emphasize that if *X* is not useful for prediction marginally, it may still have a role in a joint risk model. In particular, *X* will be useful if *X* calibrates *Y.* Example 1 demonstrates this phenomenon. The equivalence between (3) and (4) under the assumption that there is a single stratified ROC curve states this formally and appears to be a new, general, and interesting result.

#### Corollary 1

Let *ROC*_{Y|X} be the same for all *X* and suppose *P*(*D* = 1|*X*) ≠ *P*(*D* = 1) for some *X.* Then *ROC*_{X, Y}(·) ≥ *ROC*_{Y|X}(·).

#### Proof of Corollary 1

As mentioned in the proof of Result 1, *ROC*_{Y|X}(·) = *ROC*_{W}(·), where *W* = *F*_{JOURNAL/epide/04.02/00001648-201111000-00007/ENTITY_OV0430/v/2017-07-26T080159Z/r/image-png}_{,}_{X}(*Y*).^{8} In contrast, *ROC*_{X, Y} (·) = *ROC*_{risk}_{(}_{X,Y}_{)} (·) is known to be the optimal combination of *X* and *Y* for predicting disease: for a given false-positive rate *f, ROC*_{X, Y} (*f*) dominates any other combination of *X* and *Y* for predicting *D.*^{4–6} This implies *ROC*_{X, Y} (·) ≥ *ROC*_{W} (·) because *W* refers to a particular way of combining *X* and *Y.*

## CONCEPTS OF INTERACTIONS AMONG PREDICTORS

In any context in which a statistical model is used with multiple predictors, the possibility of interactions among predictors can arise. What precisely one means when one says that 2 variables “interact” depends on the context, and the most appropriate definition of “interaction” is always context-dependent.^{9} For many researchers who work in epidemiology and often use logistic regression models, the phrase “*X* and *Y* interact” means that the odds ratio for *Y* depends on *X* (*OR*_{Y|X} varies with *X*). However, this is not the most appropriate definition of interaction when discriminating between cases and controls. Rather, a more relevant notion of interaction is to say that *X* and *Y* interact if *ROC*_{Y|X} varies with *X.* Examples 2 and 3 in this section demonstrate that these notions of interaction are not the same.

### Example 2

*OR*_{Y|X} Depends on *X;**ROC*_{Y|X} Does Not

Consider the following variation on the data model in Example 1. When *X* = 0, the distribution of *Y* is exactly the same as Example 1. The bottom panel of Figure 1 shows the distribution of *Y* when *X* = 1.

*ROC*_{Y|X} is the same as in Example 1 and, in particular, does not depend on *X: ROC*_{Y|X}_{= 0} = *ROC*_{Y|X}_{=1}. Note that for *X* = 1, the larger separation in the means of the distribution of *Y* for cases and controls is exactly compensated for by the larger variability.^{10}^{p. 82}

However, the odds ratios do depend on *X* in this example. To see this, we use Bayes' theorem to calculate

logit*P*(*D* = 1*|Y, X* = 0) = 2*Y* − 2 + logit(*p*_{0}), so *OR*_{Y|X}_{= 0} = exp(2), whereas

logit*P*(*D* = 1*|Y, X* = 1) = *Y* − 2 + logit(*p*_{1}), so *OR*_{Y|X}_{= 1} = exp(1).

Observe in this example that, according to Result 1, if *X* is not marginally predictive then the role of *X* in the risk model is only to calibrate *Y.* However, if *X* is marginally predictive of *D*, then the joint model will involve additional effects of *X* on risk and the ROC curve for the joint risk model will be higher than that of the common stratified ROC curve.

### Example 3

*ROC*_{Y|X} Depends on *X;**OR*_{Y|X} Does Not

Let *X* have a Bernoulli distribution with *P*(*X* = 1) = *P*(*X* = 0) = 1/2, and let *Y* ∼ *N*(*X,* 1). Let the risk of disease follow a logistic model:

We simulated *X* and *Y* values, used model (7) to calculate risks of disease, and simulated disease status based on these risks. Figure 4 shows *ROC*_{X, Y} and *ROC*_{Y|X}. Note that *ROC*_{Y|X}_{= 0} ≠ *ROC*_{Y|X}_{= 1}. Model (7) makes it obvious that *OR*_{Y|X} does not depend on *X.*

## IMPLICATIONS FOR PREDICTIVE MODELS

Result 1 says that *ROC*_{X, Y}() and *ROC*_{Y|X}() are distinct curves except under special circumstances. Therefore, one should use the curve appropriate to the task at hand.

One type of application is when the new marker *Y* is envisioned to be used in conjunction with *X* in the entire population for which prediction is performed. In such a setting, *ROC*_{X, Y}() is the appropriate curve to consider and should be compared with *ROC*_{X}(). *ROC*_{Y|X}() should not be used for this purpose. A limited exception to this conclusion is that *ROC*_{Y|X}() can be used to test the null hypothesis that the incremental value of *Y* is 0. This is because *ROC*_{Y|X}() differs from the 45-degree line if and only if *ROC*_{X, Y}() > *ROC*_{X}().^{11} However, hypothesis testing is of questionable value because the real challenge is to identify markers that improve prediction by a clinically useful amount.

In other situations, *ROC*_{Y|X}() may be the curve of interest. Suppose *X* is considered to define clinically distinct subgroups of the population, or *X* can clearly define a small proportion of the population as very high (or low) risk. Researchers may envision that the new marker *Y* will be used differently in different subpopulations, or will be used only in certain subpopulations defined by *X.* Consider the example of breast cancer, and suppose *X* indicates whether a subject has a mutation in certain genes *BRCA*1 or *BRCA*2. In this case, *X* identifies a small proportion of women at much higher risk of breast cancer, and one may wish to consider the predictive ability of a marker *Y* separately in the 2 groups defined by *X.*

In the previous section we distinguished 2 notions of interaction: *OR*_{Y|X} depends on *X* versus *ROC*_{Y|X} depends on *X.* What are the implications of this distinction for predictive modeling? Suppose risks are estimated with a regression model and one finds evidence to support an interaction term in the model. Returning to Example 2, the true risk model can be written:

In other words, on the logit scale the risks are a linear combination of *X, Y,* and *X* · *Y.* We emphasize that it is appropriate (and potentially important) to include the interaction term in modeling the risks. The point is simply that, just because there is an interaction term in the regression model, this does not mean that *Y* has different predictive capacity in the subpopulations defined by *X.* Furthermore, an example in the next section shows that a large interaction in terms of odds ratios can have no impact on discriminating between cases and controls.

## APPLICATION TO PROSTATE CANCER AND RENAL ARTERY STENOSIS

In this section, we examine real data to illustrate some of the ideas discussed in this paper. The first dataset is from a prospective study of 557 men scheduled for prostate biopsy reported by Deras and colleagues.^{12} Thirty-five percent of men had a positive biopsy. The second dataset is from a study of 426 subjects, first reported by Janssens and colleagues,^{13} wherein 23% had the outcome, stenosis of the renal artery. Both datasets contain multiple predictors, but to illustrate the ideas we will limit ourselves to 2 predictors at a time. Our intent is to illustrate key concepts and so we will not be concerned with statistical significance.

In the prostate cancer dataset, we consider the binary variable indicating whether a man has a history of biopsy (*HxBx*) and the continuous variable *lPCA3,* which is the expression of a particular gene, PCA3, on the log scale. *HxBx* is predictive on its own, with a diagnosis of cancer in 44% of those without a history of biopsy and 27% of those with a previous prostate biopsy. Figure 5 suggests that the predictive ability of *lPCA3* is very similar in men with and without a history of biopsy for prostate cancer; this observation is confirmed by the ROC curves for *lPCA3* stratified on *HxBx* (left panel of Fig. 6). The aforementioned Corollary 1 mentions that *ROC*_{HxBx,lPCA3} should be greater than *ROC*_{lPCA}_{3|HxBx} because *HxBx* is marginally predictive. The right panel of Figure 6 shows that this is approximately the case because the joint ROC curve for *HxBx* and *lPCA3* dominates the *AROC,* except at small false-positive rates, where the densities show *lPCA3* is a better predictor for men without history of biopsy. The joint risks of prostate cancer using *HxBx* and *lPCA3* were estimated using an additive logistic regression model. The fact that *ROC*_{HxBx,lPCA3} is not strictly greater than *ROC*_{lPC3|HxBx} does not contradict the theoretical result, but rather reflects the fact that the fitted model is an approximation of the true risk function.

In the renal artery stenosis data, suppose that sex and log serum creatinine (*lSCr*) are the candidate predictors. The sex variable on its own is essentially useless as a predictor because the prevalence of renal artery stenosis in men and women is almost identical (24% in women and 22% in men). However, Figure 7 suggests that sex will be a useful predictor in combination with *lSCr* because a subject's sex helps one interpret the *lSCr* measurement. Indeed, if we use *lSCr* by itself, we can discriminate cases and controls with an AUC of 0.71. We modeled risk of renal artery stenosis with logistic regression using an additive model and both *lSCr* and sex as predictors. Using the joint model, the AUC increases to 0.75. This illustrates another idea presented in the second section earlier in the text—that a variable with no predictive capacity on its own can still be useful in a joint prediction model.

Another interesting example is to consider *lSCr* along with the binary predictor indicating whether a patient has vascular disease (*V*). If we consider *lSCr* as a predictor in patients with and without vascular disease, the predictive capacity is clearly different, with an AUC of 0.73 in patients with vascular disease and an AUC of 0.61 in patients without vascular disease (Fig. 8). In this sense, there is an interaction between *lSCr* and *V* for discriminating cases from controls. If we consider a logistic regression model, the interaction term is substantial:

The odds ratio for *lSCr* in patients without vascular disease is 1.50 (*OR*_{lSCr|V}_{= 0} = 1.50); in patients with vascular disease the odds ratio is 2.62 (*OR*_{lSCr|V}_{= 1} = 2.62). Here, there are interactions both in the risk model and in the performance of the marker. Interestingly, although one might suspect that including the interaction term when modeling risks will improve prediction performance, in fact, it has very little impact. Figure 8 shows that the AUC for a joint model without interaction is 0.752, whereas including an interaction increases it only to 0.755.

## DISCUSSION

We have discussed covariate adjustment, joint modeling, and interaction in the context of evaluating biomarkers for prediction and classification. First, we clarified the difference between incorporating a new predictor in a risk model that already includes established predictors, and eliminating the effect of existing predictors by adjusting for them. In particular, ROC curves for a risk model that incorporates a new predictor with existing predictors are almost never the same as ROC curves for the new predictor adjusted for existing predictors. These are equal only when the covariate has no marginal association with disease. This contrasts with the notion that covariate adjustment and joint modeling can be handled within the framework of a single risk model. Second, we contrasted notions of interaction in a classic epidemiologic context and in the context of assessing predictive performance. In epidemiology, *Y* and *X* are usually said to interact if there is evidence that *OR*_{Y|X} varies with *X.* In prediction performance assessment, a more relevant notion of interaction is whether *ROC*_{Y|X} varies with *X.* We demonstrated that these notions of interaction are distinct.

Note that ROC regression methods can be used to assess the evidence that the predictive capacity of a marker *Y* varies with a covariate *X.*^{14} These methods also provide a way to evaluate the assumption of a single adjusted ROC curve. For example, Janes, Longton, and Pepe^{15} model

where *f* is the false-positive rate and *ROC*_{Y|X}(*f*) is the corresponding true-positive rate. They test whether α_{2} = 0. Software to implement ROC regression methods is readily available.^{15} (See the paper by Cai and Pepe^{16} for more general semi-parametric modeling techniques.)

Questions of study design warrant particular attention because study design determines what the data are useful for. An especially important issue in study design is matching. Typically, when cases and controls are matched on an existing predictor *X,* then the incremental value of *Y* cannot be assessed because we cannot derive *P*(*D* = 1|*X, Y*) and consequently cannot estimate *ROC*_{X, Y}(·).^{17} Therefore, matched data present additional challenges and, as always, investigators should give serious consideration before choosing a matched design.

Although we have focused on ROC curves as a convenient framework for discussion, we do not mean to imply that ROC curves are the only useful summary of a risk model, or even the most important summary. Different metrics and summaries have different merits, and the most appropriate metrics depend on the context.^{18–21}