Click on the links below to access all the ArticlePlus for this article.
Please note that ArticlePlus files may launch a viewer application outside of your web browser.
Markers are sought to detect conditions or predict the future onset of conditions. Examples include childhood screening tests, tests for genetic abnormalities, and markers for cardiovascular disease, such as serum lipids and inflammatory indicators. Biomarkers for cancer detection include prostate-specific antigen and CA-125. Some of these same markers are used as markers of treatment response and disease progression. The emergence of new technologies, such as gene and protein expression arrays, promise the development of more sophisticated markers in the near future.1,2
The issue here is how to evaluate the performance of a marker. The importance of rigorously evaluating a marker's performance before it is adopted in routine medical practice is of particular concern to regulatory agencies and has been highlighted recently in the popular press.3 The ultimate validation of a marker requires large population studies and consideration of disease-specific costs and benefits associated with incorrect and correct classification by the marker.4 Preliminary to such studies are smaller studies that simply assess the marker's ability to discriminate subjects with the condition from those without.5 The statistical evaluation of a marker's discriminatory capacity is the specific topic of this article.
How should one measure the discriminatory capacity of a marker? An appropriate measure should not depend on the measurement units of the marker. If it does so, it cannot be used to compare different markers measured in different units. For example, the odds ratio (or relative risk) per unit increase in the marker, although commonly used, is not a self-contained summary statistic of discrimination and cannot be compared across different markers.6
We propose an approach that first involves standardizing the marker values relative to a normative population (those without the condition). This standardization puts different markers on a common scale, thereby facilitating comparisons among markers. In addition we show that the distribution of the standardized marker among subjects with the condition is closely related to the receiver operating characteristic (ROC) curve, a statistical tool that has long been used for evaluating diagnostic tests.7–9 The ROC curve is appropriate for evaluating the discriminatory capacity of any marker.10 Its interpretation as relating to the distribution of standardized marker values is appealing. In particular, it may be of interest to those researchers already comfortable with statistical concepts of standardization and frequency distributions, but who are not familiar with ROC analysis.
As illustration, we apply statistical techniques to 2 simple datasets. The data are online at http://www.fhcrc.org/labs/pepe/book/. In the first of these datasets, 2 serum biomarkers for pancreatic cancer, CA-125 and CA-19-9, were measured for 90 patients with pancreatic cancer and 51 without (Fig. 1). 11 Questions of interest are how to quantify the capacities of the 2 markers to distinguish between the patients with and without cancer and then how to compare the 2 markers.
The second dataset pertains to a marker of hearing impairment at the 1416-Hz frequency for 57 hearing-impaired ears and 147 unimpaired ears. The marker is the signal-to-noise ratio from a hearing test (the distortion product otoacoustic emissions test). The test was performed using 9 different sound stimulus intensity levels, 3 of which are included in this dataset. Thus, for each ear we have a signal-to-noise ratio value for each of the intensity levels (Fig. 2). Details of the original study and data selection can be found in Stover et al12 and in Pepe,13 respectively. Both of these studies used case–control designs. Cross-sectional cohort studies can be analyzed in the same way.
We adopt the convention that higher values of the marker are more indicative of the presence of the condition. (We can always redefine the marker if necessary to ensure this pattern—by using negation, for example. See the audiology data in Fig. 2.) The basic idea is to use the distribution of marker values in the unaffected population (those without the condition) as a reference distribution for standardizing marker values in the affected population, ie, those with the condition. The standardization for an affected subject with marker value Y is simply to calculate the frequency of unaffected subjects with marker values greater than Y. Thus, if marker values for 20% of unaffected subjects exceed Y, the standardized marker value is 0.20. We call the standardized value its placement value.14–16
The concept of calculating a placement value is closely related to that of calculating a percentile value relative to a healthy reference population, as is the common practice for reporting anthropometric measurements in children.17 Here, rather than reporting the percentile, which is the proportion of the reference population less than Y, we report the proportion greater than Y. Although the concepts are equivalent, we will see that calculation of placement values rather than percentiles facilitates connections with ROC methodology.
Placement values are proportions taking values between 0 and 1. Because higher marker values are more indicative of the condition, having the condition is associated with having smaller placement values. The smallness of the placement value indicates how extreme a subject's marker value is relative to the reference population. Moreover, a marker for which most affected subjects have very small placement values is a good marker because it identifies most affected subjects as being extreme relative to the reference population.
A key attribute of placement values is that they do not have measurement units associated with them. Different markers are converted to a common scale by the placement value standardization. This conversion facilitates comparisons among markers. Thus, if a man with a given disease has a placement value of 0.50 for marker 1 and placement value 0.01 for marker 2, then marker 2 is the better disease indicator for him. He is identified as extreme in regards to marker 2 whereas he appears to be well within the reference (nondiseased) population in regards to marker 1. To determine which of 2 markers is better at discriminating the population of affected subjects from the unaffected population, one must consider the population distributions of placement values in affected subjects for each of the markers. The marker with a higher frequency of small placement values is preferred.
The ROC curve is a statistical device for illustrating the classification accuracy achievable with a diagnostic test, or marker.9,10,16 For each possible threshold value, c, one can define a positive classification rule based on the marker, with Y ≥ c indicating that the condition is present. The associated true-positive rate (TPR(c)) and false-positive rate (FPR(c)) are
respectively. The ROC curve plots TPR(c) (the test sensitivity) versus FPR(c) (1–specificity) for all values of c. It shows the range of (FPR, TPR) that is achievable. Because good classification accuracy pertains to low FPRs and high TPRs, a good marker has an ROC curve with points in the upper left corner of the (0,1) × (0,1) square. The area under the ROC curve is the most popular ROC summary statistic. An area under the ROC curve of 1.0 corresponds to a perfect marker.
Pancreatic Cancer Biomarkers
Using the 51 subjects without pancreatic cancer as the reference group, we standardized each of the markers for the 90 subjects with pancreatic cancer by calculating placement values. The frequency distributions are displayed in Figure 3A. The CA-19-9 placement values tended to be smaller than the CA-125 values, indicating that pancreatic cancer patients were more extreme relative to the noncancer reference with regard to CA-19-9 than to CA-125.
The average (±standard deviation) of the placement values is 0.14 ± 0.26 for CA-19-9 and 0.29 ± 0.25 for CA-125. A simple paired t test could be applied to compare the averages. However, this approach is not quite appropriate because a finite sample of only 51 patients without cancer was used to standardize the markers. A different sample of patients without cancer would have produced a somewhat different standardization. The sampling variability in the reference group used to calculate the placement values for the 90 diseased subjects must be accounted for in calculating a P value that compares mean CA-19-9 and CA-125 placement values. The bootstrapping technique18 (described in the Appendix available with the electronic version of this article) accounts for this variability and yields P < 0.01.
The scatterplot (Fig. 3B) shows that, although CA-19-9 was the better marker overall, there were a substantial number of cancer patients for whom CA-125 was better in the sense that they are normal by the CA-19-9 marker but abnormal (ie, smaller) by CA-125. For example, 5 patients with CA-19-9 placement values exceeding 20% had CA-125 values less than 10%.
Distributions of standardized (negative) signal-to-noise ratio values are shown in Figure 4 for hearing-impaired subjects. It appears that the test is more discriminatory when the sound stimulus is at a lower intensity since the placement values are smaller at the 55 decibel (dB) intensity level than at the 60 or 65 dB levels. The mean ± SD placement values at the 3 decibel levels are 0.029 ± 0.057, 0.053 ± 0.106, and 0.071 ± 0.127. The P value for comparing the averages at 55 and 65 dB is <0.01 using the bootstrap technique. The 55-dB stimulus also appears to work better than the 65-dB stimulus for most individuals, as can be seen from the scatterplot (Fig. 4B). Test results for most hearing-impaired subjects appeared more abnormal with the lower intensity stimulus, as evidenced by smaller placement values.
Relationship with ROC Analysis
Figures 3C and 4C show the cumulative distributions of standardized markers for cancer patients and for hearing-impaired subjects. The cumulative distribution corresponding to the placement value (p) on the x-axis is the proportion of values that are ≤p. Interestingly, these cumulative distribution curves are identical to ROC curves for the markers. The general argument is as follows: Let c be the threshold value that corresponds to the false-positive rate p, FPR(c) = p. Consider the point p on the cumulative distribution curve cdf(p). Observe that a subject's placement value is ≤p if and only if his marker value Y ≥ c. Therefore the proportion of affected subjects with placement values ≤p, namely cdf(p), is equal to the proportion with marker values ≥c, ie, TPR(c). Thus, each point (p, cdf(p)) on the cumulative distribution curve is a point (FPR(c), TPR(c)) on the ROC curve and vice versa. A mathematical argument is given in Pepe and Cai.15
Therefore, 2 interpretations exist for the curves shown in Figures 3C and 4C. Interpreted as cumulative distribution functions, the curves show the proportion of affected subjects with standardized marker values that are at least as extreme as p. Interpreted as ROC curves, the curves display the trade-offs between sensitivity and specificity that are possible when we apply thresholding classification rules to the marker in the population. Both interpretations are meaningful and useful. The accuracy of CA-19-9 for classifying subjects with or without pancreatic cancer is clearly superior to CA-125. For example, the thresholding rule with specificity of 80% (FPR = 0.20) yields a sensitivity of 78% for CA-19-9 but only 49% for CA-125. Said another way, 78% of cancer patients have standardized CA-19-9 below 0.2 while only 49% have standardized CA-125 less than 0.2. Similarly we see from Figure 4C that classification accuracy is better when the lower sound intensity is used.
ROC Summary Statistics
The areas under the ROC curves in Figure 3C are 0.86 for CA-19-9 and 0.71 for CA-125. In Figure 4C, the areas are 0.97 at 55 dB, 0.95 at 60 dB, and 0.93 at 65 dB. Observe that these are exactly the same as 1 minus the mean placement values calculated earlier. The result holds in general that averaging standardizing markers for affected subjects yields 1 minus the area under the ROC curve.
This relationship is intuitive for the perfect marker, since all placement values for affected subjects are equal to 0 and the area under the ROC curve = 1 for the perfect marker. Mathematical arguments for the general result are available.14,15
The implication of this result is that statistical comparisons between markers using areas under ROC curves are the same as statistical comparisons between markers using placement value averages for diseased subjects. Therefore the P values cited earlier that pertain to average placement values are also valid for comparing the areas under the curves in Figures 3C and 4C.
We propose a standardization procedure to facilitate the evaluation of markers. The use of a reference distribution is a familiar concept. In laboratory medicine, for example, values outside of a normal healthy reference range often are used to flag patients as having a medical condition. Standardization with respect to an age- and sex-matched reference population is used for anthropometric measurements. Standardization not only provides better clinical interpretations but makes possible valid comparisons across different populations. Our standardization can be used to compare a marker's discriminatory capacity across different populations. One could compare placement values in diseased men and diseased women, for example, to determine whether the marker performs better in men or women. An additional compelling attribute of the standardization we propose is that it provides valid comparisons of different markers across the same population, as demonstrated with our 2 datasets.
We also noted the close connection between analyzing standardized markers of affected subjects and ROC analysis. With our approach, one can analyze standardized markers in familiar ways (as we did for pancreatic cancer and hearing impairment markers) without explicitly considering the operating characteristics of thresholding decision rules. Nevertheless, we have shown that such considerations are implicitly at play and the approach is fundamentally the same as ROC analysis.
Our approach offers avenues for addressing questions that should be asked about marker performance, but are typically not. In particular, regression analysis applied to placement values can be used to determine whether covariates affect the capacity of a marker to distinguish cases from controls.15,19 Covariates may relate to characteristics of subjects tested or to the test itself.16 To illustrate with the audiology data, the following linear regression model was fit to the placement values for hearing-impaired subjects:
with Z, the normal deviate corresponding to the placement value, and covariate, Intensity, being the sound stimulus intensity. The estimate α1 = 0.023 (95% confidence interval = (–0.019, 0.076); se = 0.024) indicates a trend for higher intensity levels being associated with larger placement values among hearing-impaired subjects, ie, reduced marker performance. Figure 4C shows the corresponding cumulative distributions of placement values. The more frequent occurrence of small placement values at the lower intensity levels is obvious from these curves. The better performance at lower intensity is also evident with the ROC curve interpretation. More complex models that include multiple independent variables simultaneously can easily be fit as well. As noted earlier, bootstrapping can be applied to arrive at appropriate standard errors and P values. Alternatively, recent work15,19 provides theory for making statistical inference about regression models using placement values.
1. The Chipping Forecast. Nat Genet.
2. Liotta LA, Ferrari M, Petricoin E. Clinical proteomics: written in blood. Nature
3. Pollack A. “New Cancer Test Stirs Hope and Concern.” New York Times.
3 Feb. 2004, late ed., final: F1.
4. Etzioni R, Urban N, Ramsey S, et al. The case for early detection. Nat Rev Cancer
5. Pepe MS, Etzioni R, Feng Z, et al. Phases of biomarker development for early detection of cancer. J Natl Cancer Inst
6. Pepe MS, Janes H, Longton G, et al. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol
7. Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diag Imaging
8. Begg CB. Advances in statistical methodology for diagnostic medicine in the 1980s. Stat Med
9. Zhou SH, McClish DK, Obuchowski NA. Statistical Methods in Diagnostic Medicine.
New York: Wiley; 2002.
10. Baker SG. The central role of receiver operating characteristic (ROC) curves in evaluating tests for the early detection of cancer. J Natl Cancer Inst
11. Wieand S, Gail MH, James BR, et al. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika
12. Stover L, Gorga MP, Neely ST, et al. Toward optimising the clinical utility of distortion product otoacoustic emission measurements. J Acoust Soc Am
13. Pepe MS. Three approaches to regression analysis of receiver operating characteristic curves for continuous test results. Biometrics
14. Hanley JA, Hajian-Tilaki KO. Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: an update. Acad Radiol
15. Pepe MS, Cai T. The analysis of placement values for evaluating discriminatory measures. Biometrics
16. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction.
Oxford: Oxford University Press; 2003.
17. Frischancho AR. Anthropometric Standards for the Assessment of Growth and Nutritional Status.
Ann Arbor: University of Michigan Press; 1990.
18. Efron B, Tibshirani RJ. An Introduction to the Bootstrap.
New York: Chapman and Hall; 1993.
19. Cai T. Semi-parametric ROC regression analysis with placement values. Biostatistics