OBJECTIVE: To estimate the accuracy of colposcopy to identify cervical precancer in screening and diagnostic settings.
METHODS: As part of a larger clinical trial to evaluate the diagnostic accuracy of optical spectroscopy, we recruited 1,850 patients into a diagnostic or a screening group depending on their history of abnormal findings on Papanicolaou tests. Colposcopic examinations were performed and biopsies specimens obtained from abnormal and normal colposcopic sites for all patients. The criterion standard of test accuracy was the histologic report of biopsies. We calculated sensitivities, specificities, likelihood ratios, receiver operating characteristic curves, and areas under the receiver operating characteristic curves.
RESULTS: The prevalence of high-grade squamous intraepithelial lesions (HSIL) or cancer was 29.0% for the diagnostic group and 2.2% for the screening group. Using a disease threshold of HSIL, colposcopy had a sensitivity of 0.983 and a specificity of 0.451 in the diagnostic group when the test threshold was low-grade squamous intraepithelial lesions (LSIL), and a sensitivity of 0.714 and a specificity of 0.813 when the test threshold was HSIL. Using the same HSIL disease threshold, in the screening group, colposcopy had a sensitivity of 0.286 and a specificity of 0.877 when the test threshold was LSIL, and a sensitivity of 0.191 and a specificity of 0.961 when the threshold was HSIL. The colposcopy area under the receiver operating characteristic curve was 0.821 (95% confidence interval 0.79–0.85) in the diagnostic setting compared with 0.587 (95% confidence interval 0.56–0.62) in the screening setting. Changing the disease threshold to LSIL demonstrated similar patterns in the tradeoff of sensitivity and specificity and measure of accuracy.
CONCLUSION: Colposcopy performs well in the diagnostic setting and poorly in the screening setting. Colposcopy should not be used to screen for cervical intraepithelial neoplasia.
LEVEL OF EVIDENCE: II
Colposcopy performs well in the diagnostic setting; however, its poor performance in screening indicates that colposcopy should not be used to screen for cervical intraepithelial neoplasia.
From the 1Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, Texas; 2Department of Statistics, Rice University, Houston, Texas; 3Division of Population Science, Fox Chase Cancer Center, Philadelphia, Pennsylvania; 4Department of Gynecologic Oncology, The University of Texas M. D. Anderson Cancer Center, Houston, Texas; and 5Division of Gynecologic Oncology, Department of Obstetrics and Gynaecology, University of British Columbia, Vancouver, British Columbia, Canada.
Supported by grant number CA82710 from the National Cancer Institute.
Presented as a poster at the 43rd Annual Meeting of the American Society of Clinical Oncology, Chicago, Illinois, June 1–5, 2007.
Corresponding author: Scott B. Cantor, PhD, The University of Texas M. D. Anderson Cancer Center, Section of Health Services Research, Department of Biostatistics, 1515 Holcombe Boulevard, Unit 447, Houston, TX 77030-4009; e-mail: email@example.com.
Financial Disclosure The authors have no potential conflicts of interest to disclose.
Although first described as a method for early cancer detection more than 75 years ago,1 colposcopy was slow to be used in cancer detection in most of the developed world. Enthusiasm and widespread introduction of cervical cytology as a simply performed screening test lead to a marked reduction in both the incidence and mortality of invasive cervical cancer in those countries with comprehensive screening programs.
Colposcopy has been the subject of numerous reports in the past 30 years, with virtually all of these articles describing its value as a diagnostic enhancer for assessing screening cytologic abnormalities.2–7 The diagnostic abilities of colposcopy as a diagnostic technique are well understood and appreciated, but scientific data on its potential as a screening method are limited. We undertook the present study to estimate the accuracy of colposcopy in both the screening and diagnostic settings.
MATERIALS AND METHODS
Participants in this study were participants in a study evaluating optical spectroscopy, an emerging technology for the screening and diagnosis of cervical squamous intraepithelial lesions (SIL). Based on their history of Papanicolaou test results, participants were allocated to a diagnostic group or a screening group. Participants in the diagnostic group had a history of abnormal Papanicolaou test results. Approximately 60% of these participants were patients in a colposcopy clinic and had recent abnormal Papanicolaou test results, and the rest of the participants were recruited from the community. Participants in the screening group had no history of abnormal Papanicolaou test results; most were recruited from the community. All women recruited from the community were recruited through television and radio news stories, advertisements, billboards, and word of mouth. The study was performed in three clinical settings: a community hospital and a comprehensive cancer center in the United States, and a comprehensive cancer center in Canada. All participants completed informed consent authorizations, and the study was approved by the institutional review boards at The University of Texas M. D. Anderson Cancer Center, The University of Texas Health Science Center, the Lyndon Baines Johnson Hospital Health District, British Columbia Cancer Agency, and the University of British Columbia.
As part of the research protocol, all participants received several tests associated with cervical cancer screening and diagnosis. Each woman underwent a complete medical history and received a physical examination and a pelvic examination. The pelvic examination included a Papanicolaou test using Ayre's spatula and a cytobrush, bacterial cultures for Chlamydia and gonorrhea testing, viral specimens for human papillomavirus testing, and a colposcopic examination of the vulva, vagina, and cervix.
Colposcopists included four gynecologic oncologists and eight nurse practitioners, all of whom had several years of experience in colposcopy procedures. A colposcopist first applied 6% acetic acid and let it remain for approximately 2 minutes. The 6% acetic acid was reapplied to the cervix using cotton balls repeatedly every few seconds over the next 1–2 minutes to detect “fast fader” or “slow uptake” lesions.8 The colposcopist then inspected the cervix and identified the squamous columnar junction and the transformation zone. The International Federation for Cervical Pathology and Colposcopy nomenclature9 was used to grade colposcopic lesions. Colposcopic impression was classified as normal and benign lesions (inflammatory and metaplasia), low-grade squamous intraepithelial lesions (LSIL), high-grade squamous intraepithelial lesions (HSIL), or cancer.
The colposcopist then took one or two colposcopically directed biopsies of the area with the worst colposcopic impression according to standard of practice, and one or two biopsies of squamous and columnar epithelium from an area of normal appearance. If the overall colposcopic impression was normal, biopsies were obtained from one or two normal sites and included both types of cervical epitheliums. All biopsies were submitted to pathologists for sectioning and reading.
Pathologists were blinded to the colposcopic impression, tests results, and medical history of the patient, and all biopsies were read twice. The first reading was done on site by the local participating pathologist, and the second, by another pathologist on the study team. If the two pathologists disagreed on the pathology diagnosis, a third slide reading was performed to determine a final diagnosis. Agreement between pathologists was considered substantial; details have been published elsewhere.10 Histologic diagnosis was categorized according to the Bethesda classification as normal (including inflammatory lesions and atypical squamous cells of undetermined significance [ASC-US]), LSIL, HSIL, or cancer.
Additional clinical data related to the research questions posed in the study were obtained, including cytologic specimens for Feulgen staining, endocervical curettage, human papillomavirus (HPV) testing by polymerase chain reaction, fluorescence and reflectance emission spectra, and hormone levels.
From October 1998 to November 2005, we recruited 1,000 participants into the screening group and 850 into the diagnostic group. Almost all (96%) participants had complete data and were included in the final analysis. The primary reasons for excluding a participant from the final analysis were if she refused a colposcopy or biopsy or if the colposcopy or biopsy results were missing. The percentage of missing results was approximately the same for the diagnostic and screening groups. Figure 1 presents a flow chart of study participants.
We used the histology result as the criterion standard of diagnosis and evaluated the sensitivity and specificity of colposcopy at the LSIL and HSIL disease thresholds. These thresholds were chosen based on clinical decision-making standards of care. Women diagnosed with HSIL are definitely treated, typically with a loop electrosurgical excision procedure, and in our practice, women diagnosed with LSIL are asked to return for follow-up in 6 months.
For each participant, we identified the worst colposcopy and worst biopsy results among the sites examined. The worst colposcopy result was the “test” that was compared with the participant's worst biopsy result, which is the criterion standard for diagnosis. Thus, the woman was the unit of analysis, that is, we evaluated colposcopy at the level of the participant rather than the individual colposcopy sites. By using the woman as the unit of analysis, we paralleled the process of decision making in medical practice, in which the worst biopsy result determines the treatment decision.
We evaluated colposcopic performance for each study group (screening and diagnosis) separately. For each disease threshold and for each test cutpoint, we estimated the sensitivity and specificity and the 95% confidence intervals for those statistics. In addition, we computed the positive and negative likelihood ratios for each cutpoint for each disease threshold. The likelihood ratio is the likelihood of a given test result in a patient with a disorder compared with the likelihood of the same result in a patient without the disorder.11
A positive likelihood ratio greater than 10 indicates a large and typically conclusive increase in the likelihood of disease. Similarly, a positive likelihood ratio between 5 and 10, 2 and 5, or 1 and 2 indicates a moderate, small, or minimal increase in the likelihood of disease. A negative likelihood ratio less than 0.1 indicates a large and typically conclusive decrease in the likelihood of disease. Similarly, a negative likelihood ratio between 0.1 and 0.2, 0.2 and 0.5, or 0.5 and 1.0 indicates a moderate, small, or minimal decrease in the likelihood of disease.
We constructed two sets of receiver operating characteristic (ROC) curves, depending on whether LSIL or HSIL was the disease threshold. The ROC curve is a plot of the sensitivity (true positive fraction) of a diagnostic test against one minus the specificity (equal to the false positive fraction) of the test, as the threshold for indicating a positive test is varied. This plot is often used in choosing between competing tests.
We used SPSS 12.0 for Windows 2003 (SPSS, Inc., Chicago, IL) to examine frequencies by study group and analyzed differences within study groups. The variables we studied were age, race, education, marital status, number of pregnancies, menopausal status, smoking status, and HPV infection. We used the χ2 test for categorical and ordinal variables and the t test for continuous variables. We used Stata 9 statistical software (StataCorp, LP, College Station, TX) to perform the ROC curve analysis, including the computation of all operating characteristics. The areas under the ROC curves were compared using the χ2 test with a nonparametric approach.12
We analyzed 1,768 women, 971 (55%) from the screening group and 797 (45%) from the diagnostic group. The participants in the screening group were significantly older than those in the diagnostic group (mean ages of 44 years and 36 years, respectively; P<.001). In terms of sociodemographic characteristics, the groups showed statistically significant differences regarding marital status and race (P<.001), but not education level (P=.06). Not surprisingly, because of the age distribution within each group, more women in the screening group than the diagnostic group stated that they had gone through menopause or were experiencing menopausal symptoms (P<.001). Regarding HPV infection, 41% of the women in the diagnostic group tested positive for high-risk strains of HPV, compared with only 9% of those in the screening group. Table 1 details the demographic and clinical characteristics of the screening and diagnostic groups.
Table 2 shows the results of colposcopy for the 797 participants in the diagnostic group, for whom the prevalence of HSIL or cancer was 29.0%, and the prevalence of LSIL or worse was 54.1%.
When disease was defined as worst biopsy per patient showing “HSIL or worse” and the test threshold was defined to be worst colposcopic diagnosis per patient showing LSIL, colposcopy appropriately identified (sensitivity) 0.983 (95% confidence interval [CI] 0.956–0.995) of participants in the diagnostic group with cervical precancer. In Table 2, this equates to (62+165)/231. The false-negative rate in the diagnostic group was thus 1.7% (ie, 4/231). For diagnostic participants without cervical precancer, the true-negative rate (specificity) was 45.1% (95% CI 0.409–0.493). At this disease threshold, the positive likelihood ratio was 1.788 and the negative likelihood ratio was 0.038 (Table 3).
When disease was defined as “HSIL or worse” and the test threshold was made stricter and defined to be HSIL, sensitivity decreased to 0.714 (95% CI 0.651–0.772) and the specificity increased to 0.813 (95% CI 0.778–0.844). The positive likelihood ratio was 3.814 and the negative likelihood ratio was 0.352.
When the disease definition was expanded to worst biopsy per patient showing “LSIL or worse” and the test threshold was defined to be worst colposcopic diagnosis showing LSIL, colposcopy appropriately identified (sensitivity) 0.879 (95% CI 0.845–0.909) of participants in the diagnostic group with this expanded definition of disease. Using these disease and test thresholds, the false-negative rate among diagnostic participants was thus 12.1% (ie, 52/231). For diagnostic participants without cervical precancer, the true-negative rate (specificity) was 56.6% (95% CI 0.513–0.617). At this disease threshold, the positive likelihood ratio was 2.024 and the negative likelihood ratio was 0.213.
When disease was defined as “LSIL or worse” and the test threshold was made stricter and defined to be HSIL, sensitivity decreased to 0.515 (95% CI 0.467–0.563) and the specificity increased to 0.866 (95% CI 0.827–0.899). The positive likelihood ratio was 3.847 and the negative likelihood ratio was 0.560.
The operating characteristics of colposcopy were worse for the screening group. Table 4 shows the results of colposcopy for the 971 screening participants. The prevalence of HSIL or cancer in the screening group was 2.2%; the prevalence of LSIL or worse was 14.0%. With test threshold of HSIL, colposcopy appropriately identified (sensitivity) 0.191 (95% CI 0.055–0.419) of screening patients with “HSIL or worse.” The true negative rate (specificity) for the screening group at the HSIL disease threshold was 0.961 (95% CI 0.947–0.972). The positive likelihood ratio was 4.891, and the negative likelihood ratio was 0.842 (Table 3 shows the overall summary of results).
When disease was defined as worst biopsy per patient showing “HSIL or worse” and the test threshold was made less restrictive and defined to be worst colposcopic diagnosis showing LSIL, sensitivity increased to 0.286 (95% CI 0.113–0.522), and the specificity decreased to 0.877 (95% CI 0.854–0.897). At the disease and test thresholds of HSIL, the positive likelihood ratio was 2.320, and the negative likelihood ratio was 0.815.
When the disease definition was expanded to “LSIL or worse” and the test threshold of cervical precancer was defined to be HSIL, colposcopy appropriately identified 0.103 (95% CI 0.057–0.167) of the participants in the screening group with disease. With these disease and test thresholds, in the screening group, the true negative rate (specificity) was 0.968 (95% CI 0.953–0.979). The positive likelihood ratio was 3.184 and the negative likelihood ratio was 0.937.
When the test threshold was defined as worst colposcopic diagnosis showing “LSIL or worse” and the disease threshold was worst biopsy per patient showing LSIL, colposcopy appropriately identified 0.257 (95% CI 0.186–0.339) of the screening participants with disease. For woman in the screening group without cervical precancer, the true negative rate (specificity) was 0.895 (95% CI 0.872–0.915). The positive likelihood ratio was 2.442, and the negative likelihood ratio was 0.830.
We compared these results by analyzing the receiver operating characteristic curve (referred to as “curve” hereafter), as shown in Figures 2 and 3 Using a disease threshold of HSIL, Figure 2A shows the curve for participants in the diagnostic group and Figure 2B shows the curve for screening participants. The area under the curve for diagnostic participants is 0.821 (95% CI 0.79–0.85), and the area under the curve for screening participants is 0.587 (95% CI 0.48–0.69). The areas under the ROC curves for the diagnostic and screening participants were found to be statistically significantly different (P<.001).
Using a disease threshold of LSIL, Figure 3A shows the curve for diagnostic participants and Figure 3B shows the curve for screening participants. The area under the curve for diagnostic participants is 0.776 (95% CI 0.74–0.81), and the area under the curve for screening participants is 0.577 (95% CI 0.54–0.62). Similar to the HSIL disease threshold above, the areas under the ROC curves using the LSIL disease threshold for the diagnostic and screening participants were also found to be statistically significantly different (P<.001).
This study confirms the value of colposcopy as a diagnostic aid (ie, after an abnormal Papanicolaou test) while calling into question its potential role as a primary screening tool. Based on a per patient analysis that used the worst colposcopic impression as the test result and the worst biopsy as the criterion standard, we found that colposcopy performance was quite acceptable in the diagnostic group. However, the discriminative ability of colposcopy was significantly inferior in the screening group. These findings held whether the disease threshold of cervical precancer was LSIL or HSIL.
Previous meta-analyses of the accuracy of colposcopy in the diagnostic setting13,14 reported results similar to those in the current study. Mitchell13 reported excellent sensitivity greater than 0.90 and specificity approximately 0.50 for colposcopy when used by experienced clinicians. However, those meta-analyses of previously published data did not consider studies in which colposcopy was used in the screening (ie, primary care) setting.
A recent clinical trial examined the accuracy of colposcopy performed in the diagnostic setting.7 The goal of that study was to determine the incremental benefit of HPV testing compared with colposcopy alone after an abnormal Papanicolaou test result. In that study, investigators found that HPV testing increased the accuracy of colposcopy in the diagnosis of cervical intraepithelial neoplasia. Researchers reported a sensitivity of 98.5% and specificity of 35.6% for colposcopy to detect HSIL in the diagnostic setting. However, the study reported only on patients with Papanicolaou test results of ASC-US or worse and did not routinely obtain biopsies from colposcopically normal sites.
Our study is unique for several reasons. First, our sample size is quite large. Second, we considered diagnostic and screening populations. Third, we took at least two biopsies from each participant, not just those that had abnormal Papanicolaou test results. This reduced the possibility of work-up bias, including its most common form, verification bias.15 Work-up bias occurs when a participant's chance of being referred to the criterion standard (biopsy) differs if the participant tests negative in the referral test (colposcopy). Work-up bias inflates sensitivity and falsely reduces specificity.16 We rectified this problem by ensuring that each participant received the criterion standard. Work-up bias may well explain the differences between our results and those of Monsonego et al.7
Previous studies that have proposed colposcopy as a tool for screening for cervical precancer have not reported on the technology's sensitivity and specificity (Marcial-Toledo S, Cortes-Guzman J, Chavez L, Guzman-Patraca C, Terrazas-Espitia S, Sanchez-Ruiz J, et al. Screen-and-treat colposcopy as public health strategy for cervical cancer early detection in high-risk population: the experience of the Centro de Estudios y Prevencion del Cancer (CEPREC) in indigenous population of Southern Mexico [abstract]. J Clin Oncol 2006;24:abstract no. 5012).17–18 The previous studies provided data only on test accuracy for cervical precancer screening methods including visual inspection with 5% acetic acid. In addition, they compared the performances of these technologies with that of Papanicolaou test screening rather than of a more accurate criterion standard, that is, colposcopically directed biopsy.
We must point out that the results presented in this article do not reflect the way that colposcopy is practiced today. In our clinical trial, biopsies were taken from both abnormal and normal sites. The normal sites were sampled, and the results were aggregated at the level of the participant. For example, even if a clinical site was determined to be normal based on the colposcopy reading, a biopsy could be performed at that site that could result in an abnormal histology. Because we aggregated the colposcopy and histology data on a per participant basis, we may have classified a woman's results as abnormal based on histology even if the woman had only colposcopically normal sites.
The construction of the data using the worst colposcopy result is similar to how clinicians would interpret the patient level of test outcome using a see-and-treat strategy. In the see-and-treat strategy, the worst colposcopy site is considered the diagnosis for that patient. If this “worst colposcopic diagnosis” is HSIL or worse, then the patient would be treated without need for biopsy. Our analysis, using the worst colposcopy result and worst histology result on a per participant basis determines the accuracy of colposcopy within the context of a see-and-treat strategy.
Our analytic approach will lead to increased sensitivity and specificity because the worst possible result for both test (colposcopy) and criterion standard (biopsy) are used. This pessimistic approach (ie, by using the worst outcome) will lead to greater agreement between test and criterion standard than if we had analyzed the data evaluating each colposcopic diagnosis and biopsy result pair separately without aggregating the data on a per patient basis.
We reason that colposcopy did not perform as well in the screening group as in the diagnostic group because the prevalence of disease was very small in the screening group, leading to a large false-positive rate. This would lead to a large number of biopsies that would most likely attempt to identify a nonexistent case of cervical precancer. The relatively older age of women with cervical cancer, together with the upward migration of the squamocolumnar junction with age and the occurrence of lesions wholly or partially in the endocervical canal, are all factors that might persuade against the use of colposcopy as a primary screening device.
In this study, we demonstrated the good discriminative ability of colposcopy in the diagnostic setting and the inferior discriminative ability of colposcopy in the screening setting. In addition, we noted that using HSIL as the disease threshold of cervical precancer may make a difference in the level of accuracy of colposcopy. In the current mode of usual care for cervical precancer in developed countries, high specificity for screening and high sensitivity for diagnosis seem to achieve the optimal norm. These characteristics will need to be replicated if colposcopy is to become a useful technology for cervical precancer screening in the developing world.
1. Richart RM, Sciarra JJ. Treatment of cervical dysplasia by outpatient electrocauterization. Am J Obstet Gynecol 1968;101:200–5.
2. Benedet JL, Anderson GH, Boyes DA. Colposcopic accuracy in the diagnosis of microinvasive and occult invasive carcinoma of the cervix. Obstet Gynecol 1985;65:557–62.
3. Kierkegaard O, Byrjalsen C, Frandsen KH, Hansen KC, Frydenberg M. Diagnostic accuracy of cytology and colposcopy in cervical squamous intraepithelial lesions. Acta Obstet Gynecol Scand 1994;73:648–51.
4. Ang MS, Kaufman RH, Adam E, Riddle G, Irwin JF, Reeves KO, et al. Colposcopically directed biopsy and loop excision of the transformation zone. Comparison of histologic findings. J Reprod Med 1995;40:167–70.
5. ASCUS-LSIL Triage Study (ALTS) Group. Results of a randomized trial on the management of cytology interpretations of atypical squamous cells of undetermined significance. Am J Obstet Gynecol 2003;188:1383–92.
6. Benedet JL, Matisic JP, Bertrand MA. The quality of community colposcopic practice. Obstet Gynecol 2004;103:92–100.
7. Monsonego J, Pintos J, Semaille C, Beumont M, Dachez R, Zerat L, et al. Human papillomavirus testing improves the accuracy of colposcopy in detection of cervical intraepithelial neoplasia. Int J Gynecol Cancer 2006;16:591–8.
8. Sellors J, Sankaranarayanan R. Colposcopy and treatment of cervical intraepithelial neoplasia: a beginners manual. Lyon, France: IARC Press; 2003.
9. Walker P, Dexeus S, De Palo G, Barrasso R, Campion M, Girardi F, et al. International terminology of colposcopy: an updated report from the International Federation for Cervical Pathology and Colposcopy. Obstet Gynecol 2003;101:175–7.
10. Malpica A, Matisic JP, Niekirk DV, Crum CP, Staerkel GA, Yamal JM, et al. Kappa statistics to measure interrater and intrarater agreement for 1790 cervical biopsy specimens among twelve pathologists: qualitative histopathologic analysis and methodologic issues. Gynecol Oncol 2005;99:S38–52.
11. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. 2nd ed. Boston (MA): Little, Brown and Company; 1991.
12. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44:837–45.
13. Mitchell MF. Accuracy of colposcopy. Consult Obstet Gynecol 1994;6:70–3.
14. Mitchell MF, Cantor SB, Brookner C, Utzinger U, Schottenfeld D, Richards-Kortum R. Screening for squamous intraepithelial lesions with fluorescence spectroscopy. Obstet Gynecol 1999;94:889–96.
15. Zhou X-H, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. New York (NY): Wiley-Interscience; 2002.
16. Visual inspection with acetic acid for cervical-cancer screening: test qualities in a primary-care setting. University of Zimbabwe/JHPIEGO Cervical Cancer Project. Lancet 1999;353:869–73.
17. Belinson J, Qiao YL, Pretorius R, Zhang WH, Elson P, Li L, et al. Shanxi Province Cervical Cancer Screening Study: a cross-sectional comparative trial of multiple techniques to detect cervical neoplasia [published erratum appears in Gynecol Oncol 2002;84:355]. Gynecol Oncol 2001;83:439–44.
© 2008 by The American College of Obstetricians and Gynecologists. Published by Wolters Kluwer Health, Inc. All rights reserved.
18. Syrjanen K, Naud P, Derchain S, Roteli-Martins C, Longatto-Filho A, Tatti S, et al. Comparing PAP smear cytology, aided visual inspection, screening colposcopy, cervicography and HPV testing as optional screening tools in Latin America. Study design and baseline data of the LAMS study. Anticancer Res 2005;25:3469–80.