Share this article on:

Commentary: Reference-test Bias in Diagnostic-test Evaluation A Problem for Epidemiologists, too

Miller, William C.

doi: 10.1097/EDE.0b013e31823b5b5b
Infectious Disease

From the aDivision of Infectious Diseases, Department of Medicine, School of Medicine, University of North Carolina, Chapel Hill, NC; and bDepartment of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC.

Supported by NIH grants 5R01AI067913-05 and 5UL1RR025747-04. The author reported no other financial interests related to this research.

Correspondence: William C. Miller, CB7030, Division of Infectious Diseases, UNC-Chapel Hill, Chapel Hill, NC 27599-7030. E-mail:

Epidemiologists depend on accurate assessment of disease states for almost all aspects of their work, whether for research or practice. Most epidemiologists are aware that diagnostic tests are fallible. They have developed and applied sophisticated methods to address the measurement error of outcomes. Epidemiologists are comfortable with sensitivity and specificity, the parameters used to express the accuracy of diagnostic and screening tests. However, few “traditional epidemiologists” have contributed to methods for evaluating diagnostic tests, despite the centrality of diagnostic tests to their work. Instead, diagnostic test evaluation has been the purview of “clinical epidemiologists” and a few biostatisticians.

Diagnostic-test evaluation is subject to numerous potential biases.1 Fundamentally, the evaluation of a new diagnostic test requires a comparison with a reference (“gold”) standard, which is usually assumed to discriminate disease and nondisease states perfectly. Unfortunately, few (if any) reference standards are perfect. The resulting reference-test bias is one of the most important, pervasive, and challenging forms of bias.2,3

The impact of reference-test bias can be described with a simple example. Consider a reference test with sensitivity of 0.85 and specificity of 0.90 and a new and improved test with a true (but unknown) sensitivity of 0.90 and specificity of 0.95. For simplicity, we assume conditional independence of these tests. In a study sample with a prevalence of 0.1, the measured sensitivity and specificity of the new test against the reference test would be 0.46 and 0.93, respectively. In a study sample with a prevalence of 0.5 (ie, cases and noncases selected independently), the estimates of sensitivity and specificity for the new test improve markedly to 0.81 and 0.83, respectively. The dependence of the sensitivity and specificity estimates on prevalence is comparable with the variation of positive and negative predictive values with prevalence. False positives by the reference test increase as prevalence decreases; reference-test false negatives increase as prevalence increases. The imperfect classification of disease by the reference standard leads to apparent misclassification by the new test and to biased sensitivity and specificity estimates.

Reference-test bias is particularly problematic when the new test is better than the reference standard, as in the aforementioned example. The belief that a new test is inherently better has led to intuitive “solutions” to account for reference-test bias. Unfortunately, as with other areas of epidemiology, intuition is often a poor statistician. In the 1990s, microbiologists adopted an intuitively appealing procedure referred to as discrepant analysis for the evaluation of new nucleic-acid-amplification tests (such as polymerase chain reaction) for the diagnosis of chlamydial infection and other infectious diseases.4 6 Microbiologists recognized that these new tests represented a major advance over culture techniques, which were known to have limited sensitivity.7 To address the concern of reference-test bias when using culture as the reference standard, the microbiologists chose to conduct additional testing on specimens with discordant results (ie, positive by new test, but negative by reference standard). This “resolution of discrepancy” was often performed with tests that are mechanistically similar to the new test under evaluation. After resolving the discordant specimens, sensitivity and specificity were recalculated. Although intuitively appealing, this procedure was inherently biased, with substantial overestimation of test performance even under ideal conditions.8 10

The history of discrepant analysis reveals the fundamental challenges of communication between laboratory scientists and statistical methodologists. After the US Food and Drug Administration (FDA) allowed discrepant analysis for clearance of a few tests, Hadgu8,9 and others10,11 demonstrated the method's inherent bias. These reports led to vigorous debate between the microbiologists and methodologists.7,10,12 14 At times, the misunderstanding between the “scientists” and the “statisticians” was remarkable. The laboratory scientists recognized that the limits of detection of the new nucleic-acid-amplification tests under evaluation were markedly better than previous culture-based tests.7 Thus, discrepant analysis was used to account for what they considered a biologic fact. Assumptions and bias were secondary considerations, as was evident at a diagnostic-test-evaluation workshop in 1999 at the Centers for Disease Control and Prevention (CDC). After carefully explaining the statistical assumptions of various approaches for reference-test bias, a prominent laboratory scientist raised his hand and proclaimed, “I work in a laboratory. I make no assumptions. I simply let the data speak for itself.” This fundamentally different perspective was extremely difficult to overcome. Fortunately, the FDA acknowledged the limitations of discrepant analysis and restricted its use.

In this issue of EPIDEMIOLOGY, Hadgu and colleagues15 address a different intuitive yet biased diagnostic-test-evaluation procedure, the patient-infected-status algorithm (PISA). During the past decade, PISA has replaced discrepant analysis as a common procedure for evaluating diagnostic and screening tests for many infectious diseases.16 18 Unfortunately, this procedure also yields biased results.

PISA uses a combination of tests to create a composite reference standard.16 18 Sometimes 2 tests are used with specimens taken from multiple anatomic sites. Hadgu et al15 demonstrate that this approach can be associated with substantial bias in the simplest conditions (assuming conditional independence between tests), and with more complex conditions (assuming conditional dependence). Given this bias, PISA should not be an acceptable procedure for FDA clearance.

The bias of simple, intuitive approaches to diagnostic-test evaluation shows the need for more sophisticated statistical approaches. Several alternative approaches have been developed.19 23 Latent-class analysis is a probabilistic approach that treats the true disease state as an unknown, underlying latent variable. In its simplest form, conditional independence is assumed,21 but more advanced applications can incorporate conditional dependence.22 Bayesian approaches, including Bayesian latent-class models, have also been developed.23 Generally, these statistically intensive approaches are important steps forward, but, unfortunately, none is “ideal.” For example, latent-class approaches have been criticized because the latent variable (which is unknown and unspecified) may not reflect the clinical entity under study.3 These approaches may also be limited by feasibility, such as the number of tests required for analysis.20

These past 2 decades of controversy and progress in diagnostic-test evaluation have gone largely unnoticed by most mainstream epidemiologists. A quick PubMed search reveals zero relevant diagnostic-test evaluation articles since 1990 in the American Journal of Epidemiology, and just 5 articles in EPIDEMIOLOGY. In contrast, the Journal of Clinical Epidemiology has 15 relevant articles since 2006. Should mainstream epidemiologists be concerned about diagnostic-test-evaluation methods? I would have to say yes. Given that the outcomes used in epidemiologic studies are determined by diagnostic tests, epidemiologists must have a deep understanding of the validity of these data. Epidemiologists have a long history of evaluating bias and developing appropriate methods to account for bias. If epidemiology's methodologists would give attention to diagnostic-test evaluation, I am convinced that new and important insights would result. Better evaluation of diagnostic tests would improve the quality of epidemiologic studies and simultaneously lead to improved clinical outcomes and public health.

Back to Top | Article Outline


WILLIAM C. MILLER is Special Editor for Infectious Diseases at EPIDEMIOLOGY and an Associate Professor of Medicine and Epidemiology at the University of North Carolina, where he directs the Program in Health Care Epidemiology. He has written on the evaluation of new diagnostic tests and on addressing bias in diagnostic-test evaluation. He also has been a consultant for Centers for Disease Control and Prevention on diagnostic-test evaluation.

Back to Top | Article Outline


1. Begg CB. Biases in the assessment of diagnostic tests. Stat Med. 1987;6:411–423.
2. Boyko EJ, Alderman BW, Baron AE. Reference test errors bias the evaluation of diagnostic tests for ischemic heart disease. J Gen Intern Med. 1988;3:476–481.
3. Reitsma JB, Rutjes AW, Khan KS, Coomarasamy A, Bossuyt PM. A review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard. J Clin Epidemiol. 2009;62:797–806.
4. Schachter J, Stamm WE, Quinn TC, Andrews WW, Burczak JD, Lee HH. Ligase chain reaction to detect Chlamydia trachomatis infection of the cervix. J Clin Microbiol. 1994;32:2540–2543.
5. Wiesenfeld HC, Uhrin M, Dixon BW, Sweet RL. Diagnosis of male Chlamydia trachomatis urethritis by polymerase chain reaction. Sex Transm Dis. 1994;21:268–271.
6. Vuorinen P, Miettinen A, Vuento R, Hallstrom O. Direct detection of Mycobacterium tuberculosis complex in respiratory specimens by Gen-Probe Amplified Mycobacterium tuberculosis Direct Test and Roche Amplicor Mycobacterium Tuberculosis Test. J Clin Microbiol. 1995;33:1856–1859.
7. Schachter J. Two different worlds we live in. Clin Infect Dis. 1998;27:1181–1185.
8. Hadgu A. The discrepancy in discrepant analysis. Lancet. 1996;348:592–593.
9. Hadgu A. Bias in the evaluation of DNA-amplification tests for detecting Chlamydia trachomatis. Stat Med. 1997;16:1391–1399.
10. Miller WC. Bias in discrepant analysis: when two wrongs don't make a right. J Clin Epidemiol. 1998;51:219–231.
11. Miller WC. Can we do better than discrepant analysis for new diagnostic test evaluation? Clin Infect Dis. 1998;27:1186–1193.
12. Hilden J. Discrepant analysis—or behaviour? Lancet. 1997;350:902.
13. Schachter J, Stamm WE, Quinn TC. Discrepant analysis and screening for Chlamydia trachomatis. Lancet. 1996;348:1308–1309.
14. Schachter J, Stamm WE, Quinn TC. Discrepant analysis and screening for Chlamydia trachomatis. Lancet. 1998;351:217–218.
15. Hadgu A, Dendukuri N, Wang L. Evaluation of screening tests for detecting infection: bias associated with the patient-infected-status algorithm. Epidemiology. 2012;23:72–82.
16. Martin DH, Nsuami M, Schachter J, et al.. Use of multiple nucleic acid amplification tests to define the infected-patient “gold standard” in clinical trials of new diagnostic tests for Chlamydia trachomatis infections. J Clin Microbiol. 2004;42:4749–4758.
17. Chernesky MA, Martin DH, Hook EW, et al.. Ability of new APTIMA CT and APTIMA GC assays to detect Chlamydia trachomatis and Neisseria gonorrhoeae in male urine and urethral swabs. J Clin Microbiol. 2005;43:127–131.
18. Schachter J, Chernesky MA, Willis DE, et al.. Vaginal swabs are the specimens of choice when screening for Chlamydia trachomatis and Neisseria gonorrhoeae: results from a multicenter evaluation of the APTIMA assays for both infections. Sex Transm Dis. 2005;32:725–728.
19. Hadgu A, Qu Y. A biomedical application of latent class models with random effects. J R Stat Soc Ser C Appl Stat. 1998;47:603–616.
20. Qu Y, Tan M, Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics. 1996;52:797–810.
21. Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. 1980;36:167–171.
22. Dendukuri N, Hadgu A, Wang L. Modeling conditional dependence between diagnostic tests: a multiple latent variable model. Stat Med. 2009;28:441–461.
23. Enoe C, Georgiadis MP, Johnson WO. Estimation of sensitivity and specificity of diagnostic tests and disease prevalence when the true disease state is unknown. Prev Vet Med. 2000;45:61–81.
Copyright © 2012 Wolters Kluwer Health, Inc. All rights reserved.