To explore causal relationships between exposure and outcome, epidemiologists must rely on accurate measurements of both. Misclassification of either exposure or outcome will obscure causality, ie, by an inability to distinguish exposed from unexposed or “diseased” from “nondiseased.” Whenever new and purportedly better (but nonvalidated) measurements of exposure or outcome become available, epidemiologists are faced with a fundamental question of “Which test is better?” Tools for answering this question differ depending on whether a true gold standard is available. In this issue of Epidemiology, articles by Pepe1 and Hagdu2 address some of the issues surrounding comparisons of diagnostic tests with1 and without2 a gold standard.
The receiver-operator characteristics curve (ROC) is a popular summary measure of the discriminatory ability of a clinical marker that can be used when there is a gold standard. The ROC plots sensitivity against 1-specificity (true versus false positivity) for all thresholds that could have been used to define “test positive.” An ROC curve is assessed by measuring the area under the curve (AUC), which ranges from 0.5 (no discriminatory ability) to 1.0 (perfect discriminatory ability). Two diagnostic tests can be compared by calculating the difference between the areas under their 2 ROC curves.
Pepe1 offers an alternative (nonparametric) presentation of ROC analysis that may make such analysis more accessible to epidemiologic researchers. Pepe relates the ROC curve to standardized values of the clinical marker, termed “placement values.” The placement value for an affected individual with maker value Y is the proportion of unaffected subjects (the reference population) with values higher than Y. To extend this idea to the whole population, the mean placement value or average of the percentiles is computed by averaging the placement values of the affected subjects, with the mean placement value reflecting the distribution of values in the control population. A low mean placement value indicates that the clinical marker distinguishes patients with an outcome from those without an outcome in that population. Using the fact that the ROC curve is the probability distribution of the placement values, Pepe demonstrates that the mean placement value is equal to 1-AUC. A nice feature of using placement values is that they are amenable to regression analysis to identify and control for factors that modify the performance of the clinical measurement.
Both of these related approaches use all thresholds for the definition of “test positive.” In practice, users often confine their attention to regions of the ROC curve that correspond to the most clinically relevant values of test sensitivity or specificity. For example, a test that performs well for low specificity but poorly for high specificity may not be desirable. In this case, the partial area under the curve is a more appropriate measure of test performance than the AUC.
Pepe1 discusses settings in which a gold standard is available. For many outcomes, however, such a standard does not exist—or if it does, it is too costly to be broadly applicable. Hagdu2 addresses this setting, and the role of discrepant analysis. Discrepant analysis is a widely used technique to assess the diagnostic accuracy of a new diagnostic test with a binary outcome (positive/negative) in the absence of a gold standard. In discrepant analysis, results from the new test and the standard test are compared. When both tests agree, the results are considered “the truth.” When the new test and the standard test give different answers, the disease status is determined by the results from a third, independent assay. This step is called discrepant resolution. As Hadgu2 points out, this selective reanalysis of discordant outcomes can lead to biased estimates of clinical performance. Using as an example Chlamydia trachomatis detection by a nucleic acid amplification test (NAAT) compared with the referent standard of cell culture,3 Hagdu shows how a reanalysis of only the samples considered false positive (NAAT positive and cell-culture negative) leads to an overestimate of the sensitivity of NAAT.
After discussing issues pertinent to definition of Chlamydia infection versus detectable virus DNA or RNA, Hagdu reviews statistical methods for evaluating test performance without a gold standard. (Hui and Zhao4 have published a comprehensive overview of statistical evaluation of diagnostic tests without a gold standard.) Hagdu notes that in the very rare situation when a new test with known sensitivity and specificity is available, one can correct bias in discrepant analysis-based estimates of sensitivity and specificity. He then reviews latent class models to estimate clinical performance of a test. If one were to modify discrepant analysis such that the third test without known characteristics would be applied to samples from the concordant cells and to discordant test results for reclassification, one would avoid a bias favoring one test or another. In this setting, if the 3 tests were conditionally independent given the true unobserved disease status, latent class analysis could be employed to estimate sensitivity and specificity of all 3 tests. However, the procedure is not robust to the likely violation of the assumption of conditional independence of the tests.
To illustrate this point, we compared test results for DNA testing of human papillomavirus (HPV) from a study of HPV and cervical neoplasia in Costa Rica.5–7 It is now generally accepted that cervical infections by about 15 carcinogenic HPV types are responsible for virtually all cervical cancer.8 We recently compared a new polymerase chain reaction (PCR) method (Test 1) to our previous method (Test 2).7 In Table 1, the crude comparison demonstrates that Test 1 was more likely to test positive for any of 13 carcinogenic types than Test 2 (P < 0.0005, McNemar's χ2). However, it is unclear whether Test 1 provides a true increase in sensitivity of detection of carcinogenic HPV, or simply an increase in false positives, perhaps due to the contamination that often plagues PCR-based methods.
A third test may help answer this question—even if the third test is itself imperfect (Table 2). First, assume that all samples with concordant results for Test 1 and Test 2 are true positives (Group A, Table 1) and true negatives (Group D, Table 1). This is an extreme assumption, but serves to show the general principle. When Test 3 is applied to the “true positives” in Group A, 9% are negative. This is an estimate of the false-negative rate for Test 3. When Test 3 is applied to the “true negatives” in Group D, 9% are positive, which estimates the false-positive rate for Test 3.
Recall that the underlying question is whether the improvement with Test 2 is due to improved sensitivity or to an increase in false positives (ie, is Group B composed primarily of true or false positives?) Applying Test 3 to this group, we see the rate of positives is 40%—intermediate between Test 3 results in Group A and Group D. This position suggests that the “improved sensitivity” of Test 2 is a mixture of a real improvement in sensitivity, and an increase in false positive results (assuming that the 3 tests are conditionally independent). The catch is, how valid is the assumption of conditional independence? In this example, given that all the tests detect HPV DNA, conditional independence clearly cannot be justified.
As discussed by Hagdu,2 recent advances have been made to relax the conditional independence assumption. However, Albert and Dodd9 show that when the conditional dependence structure between tests is incorrectly specified, estimates of sensitivity, specificity and prevalence are biased. Thus, although one might conclude from an informal or formal comparison with Test 3 that the newer method is more sensitive than the previously used method, caution is warranted.
Another statistical approach to improve assessment of the accuracy of diagnostic tests is to incorporate covariates. Formal, albeit nontrivial, approaches for incorporating covariates have been developed,10 but these methods perhaps have not been fully used by epidemiologists.
In summary, the papers by Pepe1 and Hagdu2 address important and different aspects of the evaluation of diagnostic tests with and without a gold standard. The more vexing problems arise in the absence of a gold standard. While the criticism of discrepant analysis is well taken, the assumptions that are needed for alternatives to discrepant analysis are also problematic. Devising better laboratory procedures is always preferable to making assumptions about dependence of tests or about prior distributions for Bayesian approaches. However, extension of statistical methods, such as incorporation of additional information in the form of covariates (or with Bayesian approaches, in the form of prior distributions), may help resolve disagreements between tests and estimate performance characteristics in the absence of a gold standard.
In practice, gold standards are rarer than one might think. For example, colposcopy-directed biopsy of the cervix to determine the degree of severity of cervical neoplasia has long been viewed as the gold standard of disease detection, but it is only about 60% sensitive for detection of cervical precancer.11 Also, tests are sometimes assigned the status of gold standard, or “most sensitive assay,” without verification. Such assumptions are common for NAATs, such as PCR, based on the principles underlying the assay and the high sensitivity of PCR under highly controlled laboratory conditions. However, true performance of PCR-based methods on clinical specimens can fall short of the theoretical predictions, as has been observed for PCR-based detection of HPV DNA.12
Finally, it is worth emphasizing that test performance should be evaluated in the context of the specific application of the test. In epidemiologic studies, full ascertainment of exposure is critical. In our example of HPV DNA testing, the increased analytic sensitivity of Test 2 (if true) would make the test highly desirable for studying the natural history of infection. However, the increased analytic sensitivity of the newer assay did not translate into a clinically important increase in detection of cervical precancer and cancer. Instead, the newer test had more positives among women without precancer or cancer7 than the old test, thereby reducing clinical specificity. Thus, criteria for a clinical test (which must weigh the relative importance of false negative versus false positive as determined from a risk-to-benefit model) may differ substantially from criteria for epidemiologic studies. For rare conditions, a clinical test is required to be highly sensitive and specific (ie, to have a low mean placement value) to be clinically useful. An epidemiologic study, by contrast, may not always need to distinguish clinically relevant measures (eg, cervical precancer that is HPV DNA positive) from the clinically irrelevant (eg, HPV infection without disease).
ABOUT THE AUTHORS
RUTH PFEIFFER is a tenure-track investigator in the Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute. Her research focuses on statistical methods for laboratory methods and problems arising in genetic epidemiology. PHILIP CASTLE is a tenure-track investigator in the Hormonal and Reproductive Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute. His research focuses on the natural history of human papillomavirus and the epidemiology and prevention of cervical cancer.
We thank Robbie D. Burk (Albert Einstein College of Medicine) and the clinical team and women participating in the Proyecto Epidemiologico Guanacaste (Costa Rica) for use of the data presented in this article.
1. Pepe MS, Longton G. Standardizing markers to evaluate and compare their performances. Epidemiology
2. Hagdu A, Dendukuri N, Hilden J. Evaluation of nucleic acid amplification tests in the absence of a perfect gold-standard test: a review of the statistical and epidemiologic issues. Epidemiology
3. van Doornum GJ, Buimer M, Prins M, et al. Detection of chlamydia trachomatis infection in urine samples from men and women by ligase chain reaction. J Clin Microbiol
4. Hui SL, Zhou XH. Evaluation of diagnostic tests without gold standards. Stat Methods Med Res
5. Herrero R, Schiffman MH, Bratti C, et al. Design and methods of a population-based natural history study of cervical neoplasia in a rural province of Costa Rica: the Guanacaste Project. Rev Panam Salud Publica
6. Castle PE, Schiffman M, Gravitt PE, et al. Comparisons of HPV DNA Detection by MY09/11 PCR Methods. J Med Virol
7. Herrero R, Castle PE, Schiffman M, et al. Epidemiology of type-specific HPV infection in Guanacaste. J Infect Dis
8. Munoz N, Bosch FX, de Sanjose S, et al. Epidemiologic classification of human papillomavirus types associated with cervical cancer. N Engl J Med
9. Albert PS, Dodd LE. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics
10. Nagelkerke NJ, Fidler V, Buwalda M. Instrumental variables in the evaluation of diagnostic test procedures when the true disease state is unknown. Stat Med
11. Guido R, Schiffman M, Solomon D, et al. Postcolposcopy management strategies for women referred with low-grade squamous intraepithelial lesions or human papillomavirus DNA-positive atypical squamous cells of undetermined significance: a two-year prospective study. Am J Obstet Gynecol
12. Schiffman M, Wheeler CM, Dasgupta A, et al. A comparison of a prototype PCR assay and hybrid capture 2 for detection of carcinogenic human papillomavirus DNA in women with equivocal or mildly abnormal Pap smears. Accepted to Am J Clinic Pathol
2005; in press.