Reproducibility of Endocervical Curettage Diagnoses

Zahn, Christopher M. MD; Rao, Luigi K. F. MD; Olsen, Cara DrPH; Whitworth, Scott A. MD; Washington, Antoine MD; Crothers, Barbara A. DO

doi: 10.1097/AOG.0b013e318223552d
Original Research
Annual Awards

OBJECTIVE: To estimate overall interobserver variability of histopathology diagnoses on endocervical curettage (ECC) specimens.

METHODS: Five study pathologists, blinded to the original diagnosis, reviewed archived ECC specimens initially interpreted as normal, low-grade dysplasia, and high-grade dysplasia. We assessed interobserver agreement and agreement between pathologists using the κ statistic and analyzed the effect of reducing diagnostic choices to two categories (one method using “normal and dysplasia” and another method using “normal and low-grade” and “high-grade or worse”).

RESULTS: A total of 90 specimens were reviewed. The overall observer agreement was moderate (κ=0.52). For specific diagnoses, cases interpreted as normal or high-grade dysplasia demonstrated greater agreement than those interpreted as low-grade dysplasia. Individual pathologists' comparison κ values ranged from 0.31 to 0.80. Changing diagnostic options to a two-tiered system resulted in significant improvement in κ values for only 1 of 36 pathologist comparisons. Using the gynecologist pathologist consensus interpretation, study pathologists downgraded 44% of cases originally interpreted as high-grade.

CONCLUSION: Interobserver agreement in the interpretation of ECC specimens is at best moderate, even between those with additional experience and training in gynecologic pathology. Furthermore, reducing diagnostic options to two categories did not improve agreement. It is concerning that important clinical decisions may be made based on an ECC diagnosis that is moderately or poorly reproducible.


Interobserver agreement in the interpretation of endocervical curettage specimens is at best moderate, even among those with additional training and experience in gynecologic pathology.

From the Department of Pathology, Walter Reed Army Medical Center, Washington, DC; the Department of Pathology, National Naval Medical Center, Bethesda, Maryland; and the Departments of Preventive Medicine and Biometrics and Obstetrics and Gynecology, Uniformed Services University of the Health Sciences, Bethesda, Maryland.

Corresponding author: Christopher M. Zahn, MD, Department of Obstetrics and Gynecology, USUHS, 4301 Jones Bridge Road, Bethesda, MD 20814-4799; e-mail:

Financial Disclosure The authors did not report any potential conflicts of interest.

Article Outline

Practice guidelines for the management of women with cytologic and histologic abnormalities of the cervix exist, which include performance of an endocervical curettage (ECC) in specific clinical settings.1,2 The ECC diagnosis may affect subsequent management, including the potential for a diagnostic excisional procedure such as loop electrosurgical excision or cold-knife cone biopsy. These procedures are not without risk; both loop electrosurgical excision and cold-knife cone have been associated with potentially significant obstetric complications.3–14

One of the concerns relative to cervical cytologic and histologic interpretation in the evaluation of women with abnormal cytologic screening is diagnostic reproducibility. Although reproducibility for high-grade squamous lesions is generally good, interpretations of low-grade lesions (encompassing human papillomavirus-related changes and cervical intraepithelial neoplasia 1) in cervical biopsies are moderately to poorly reproducible.15–23 Interobserver variation also depends on factors such as study setting or pathologist training and experience.15,16,24 However, reproducibility data regarding ECC diagnoses are lacking. We performed an extensive literature search using various combinations of terms, including “endocervical curettage,” “endocervix curettage,” “endocervical curettage,” “reliability,” “reproducibility,” “interobserver,” “intraobserver,” and “agreement,” that revealed no studies specifically addressing reproducibility of an ECC interpretation. Thus, there is a concern that performance of a potentially morbid excisional procedure may be based on a poorly reproducible diagnosis.

Our objectives for this study were to estimate overall interobserver variability of tissue diagnoses on ECC specimens. Additionally, we sought to estimate if other factors such as the severity of histologic abnormality or experience or area of interest of reviewing pathologists effected reproducibility of ECC interpretations.

Back to Top | Article Outline


This study was approved by the National Naval Medical Center Responsible Conduct for Research Directorate (Institutional Review Board). Archived hematoxylin and eosin-stained tissue slides of previously obtained ECC specimens, maintained in the Departments of Pathology at National Naval Medical Center and Walter Reed Army Medical Center, formed the basis for this study. We designed a data report that searched our laboratory information system for all ECC specimens from 2005 to 2008 with the following diagnostic results: normal, low-grade dysplasia (consisting of human papillomavirus-related changes or mild dysplasia), high-grade dysplasia (consisting of moderate or severe dysplasia or squamous carcinoma in situ), and glandular “neoplasia” (consisting of glandular atypia, endocervical glandular hyperplasia, or adenocarcinoma in situ). The search did not involve any patient identifiers or demographic data. Clinically active staff pathologists in military medical centers with a wide range of experience and areas of interest provided the original diagnoses. An administrator in the Department of Pathology selected case slides from archival files representing each of these categories, with instructions to obtain approximately 30 cases from each category. This individual covered each original slide label and then relabeled the slides with a study number, such that all research pathologists were blinded to the original interpretation. Each research pathologist reviewed all of the slides and was instructed to categorize his or her interpretation according to the same classification system used to initially identify slides for review (normal, low-grade, high-grade, or glandular atypia or neoplasia). If the research pathologist was unable to render a diagnostic interpretation because of an inadequate specimen, then the specimen was considered unsatisfactory for diagnosis.

Five research pathologists participated in the study, including one postgraduate year (postgraduate year 4) pathology resident, one generalist board-certified staff pathologist at a major military medical center with approximately 2 years of experience, and three pathologists with focused training or special interest in gynecologic pathology. For the three pathologists with interest in gynecologic pathology, one was a board-certified staff faculty pathologist with subspecialty training in cytopathology and 22 years of experience, one was a board-certified staff faculty pathologist with subspecialty training in cytopathology and 12 years of experience, and one was dually board-certified in obstetrics and gynecology and anatomic pathology with fellowship training in gynecologic pathology and 17 years of experience. We selected the pathologists for the study (including the original pathologist's diagnosis) to represent the range of experience and expertise in actual pathology practice, thereby providing an appropriate group to investigate interobserver variability.

We calculated κ statistics to ascertain interobserver agreement rates according to the previously described diagnostic categories. We assessed overall interobserver agreement as well as interobserver agreement comparing different pathologists. In addition to individual pathologist interpretations, we evaluated a “gynecologic pathologist” consensus diagnosis, which was based on cases for which at least two of the three gynecologic pathologists rendered the same diagnostic interpretation. We then analyzed interobserveer agreement by comparing the originally rendered diagnosis and the interpretations of the pathology resident and generalist pathologist with the gynecologic pathologist consensus interpretation.

We recognized that simplifying a classification system may improve interobserver agreement. We therefore used two different methods of categorization of diagnostic interpretations into dichotomous options and recalculated κ values for the same individual pathologist comparisons as well as comparisons with the gynecologic pathologist consensus as previously described. One method of analysis compared specimens categorized as either normal or all dysplasia, in which all dysplasia included low-grade and high-grade interpretations. The other method of analysis grouped all normal and low-grade interpretations into one category, and high-grade or worse interpretations into another category. Additionally, using these categories of grouped interpretations, we sought to estimate if one pathologist was more likely to render one of the diagnostic interpretations when compared with the other. Furthermore, we compared the change in κ statistics to estimate if combining diagnostic interpretations into categories resulted in improvement in agreement for individual pathologists when comparing them to one another.

We calculated both weighted and unweighted κ statistics with 95% confidence intervals using the linear agreement method to describe the overall and individual research pathologist comparisons for interobserver agreement.25 The weighted κ statistic was included because it is more sensitive to discrepancies greater than one classification.25,26 A κ of more than 0.75 represents excellent agreement, a κ between 0.40 and 0.75 represents moderate agreement, and a κ less than 0.40 is poor agreement. We used the McNemar test to determine if one pathologist was more likely to render a particular diagnosis compared with the others. To assess the effect of combining categories into dichotomous options on interrater agreement, the initially calculated κ statistics were compared with κ statistics based on a dichotomous classification using the bootstrap procedure.27 Significance was set at P=.05. This analysis was based on squamous interpretations only because of the expected low frequency of glandular atypia or neoplasia on ECC specimens. For cases interpreted as glandular atypia or neoplasia by the research pathologists, only descriptive statistics are reported. Statistical analysis was performed using SPSS for Windows 16.0.1); for the bootstrap procedure, SAS was used.

We calculated the sample size corresponding to a two-category, two-rater kappa statistic and adjusted the total sample size to allow for a third diagnostic category. Assuming equal numbers in each diagnostic category, a large sample test (with a 5% two-sided significance level) of the null hypothesis that intraclass κ is 0 will have an 80% power to detect an alternative κ of 0.40 when the number of slides to be evaluated is 50 (25 in each category). Our target of 30 slides in each of the three categories, for a total sample size of 90, therefore is sufficient to detect values of κ in the moderate range and higher.

Back to Top | Article Outline


A total of 90 cases were included in the study. Although we designed the study to select cases evenly from the three categories as described, there were proportionally more high-grade cases identified for which slides were available. Our database search did not identify any cases originally interpreted as glandular atypia or neoplasia or inadequate for interpretation; however, research pathologists were still allowed to assign these interpretations.

The overall frequencies of the diagnostic interpretations, including the gynecologic pathologist consensus diagnoses, are provided in Table 1. There were 11 cases for which at least one of the gynecologic pathologists interpreted the specimen as either glandular neoplasia (including adenocarcinoma in situ) or inadequate for interpretation. For the individual gynecologic pathologists, specimens considered inadequate or unsatisfactory ranged from one case to six cases, and those considered glandular atypia or neoplasia ranged from two to four cases. Only two cases were interpreted as glandular atypia or neoplasia, and one was considered inadequate by gynecologic pathologist consensus. All three gynecologic pathologists differed in their interpretation of three specimens; these cases were excluded from the consensus diagnosis comparisons.

The κ values for the individual diagnostic categories as well as the overall κ values for all pathologists are depicted in Table 2. As shown, the overall κ values are 0.48 (glandular neoplasia and inadequate cases included) and 0.52 (glandular neoplasia and inadequate cases excluded), consistent with moderate agreement. There was greater agreement between individual pathologists for specimens considered normal or high-grade dysplasia; however, agreement for low-grade interpretations was less. Agreement was poor for both glandular neoplasia and inadequate cases. Because the differences were minimal between the κ values in which glandular neoplasia and inadequate cases were included or excluded, and because the agreement was poor when these cases were considered, glandular neoplasia and inadequate cases were excluded for subsequent analyses regarding pathologist agreement.

Table 3 demonstrates individual pathologist comparisons for agreement on the ECC specimens for all diagnostic categories; normal, low-grade dysplasia, and high-grade dysplasia were considered separately for these analyses. Specimens with an interpretation of glandular atypia or neoplasia and inadequate were excluded. The κ values range from poor (0.31 unweighted, 0.36 weighted) to excellent (0.80 weighted), with most κ values in the moderate range.

Table 4 stratifies the data by grouping pathologists' ECC interpretations into dichotomous categories. One column (normal compared with dysplasia) combines low-grade and high-grade dysplasia into one category. We used this comparison to ascertain interobserver agreement for normal compared with abnormal specimens. The second column (low-grade compared with high-grade) combines normal and low-grade dysplasia into one category and compares these values with high-grade or worse interpretations. We used this comparison to determine interobserver agreement for low-grade and insignificant findings compared with high-grade lesions. As with the previous analysis, glandular atypia or neoplasia and cases considered inadequate were excluded. The overall κ for all raters for normal and dysplasia was 0.58 (95% confidence interval 0.52–0.63). In the individual pathologist comparisons for the normal and dysplasia analysis (Table 4), agreement ranged from poor (κ=0.29) to excellent (κ=0.83), with most κ values in the moderate range. The overall κ for all raters for the normal and low-grade and high-grade or worse analysis was 0.59 (95% confidence interval 0.53–0.64). In the individual pathologist comparisons for the normal and low-grade and high-grade or worse analysis (Table 4), agreement was mostly moderate (κ values ranging from 0.45 to 0.83), although agreement for two of the gynecologic pathologists would be considered excellent.

For the same dichotomized categories described for Table 4, shown in Table 5 are the comparisons between pathologists regarding the probability that one pathologist would be more likely to render a particular diagnosis compared with the other in the pairing, namely comparisons regarding an interpretation of normal and those interpretations of high-grade. For the same pathologist being considered in a particular comparison, percentages of specimens interpreted as normal or high-grade may differ slightly from one comparison to the next because the percentages were based on the individual pathologist pairs being considered and there may have been variations in the total number of cases interpreted by both pathologists in a pairing. In most instances, for normal interpretations, one pathologist of the pair was more likely to interpret a specimen as normal. Similarly, for an interpretation of high-grade, one pathologist in several of the comparisons was more likely to interpret a specimen as high-grade compared with the other, particularly when comparing the research pathologists to the pathologist rendering the initial diagnosis. For these cases, the research pathologists were less likely to interpret a specimen as high-grade compared with the original pathologist. There was no significant difference identified between the gynecologic pathologists when considering high-grade interpretations.

To estimate the effect of reducing the number of diagnostic categories available, we further analyzed the data by comparing the κ values for three diagnostic options considered separately (the weighted κ values in Table 3) with the κ values when only dichotomous choices were available (the κ values depicted in Table 4). In this analysis (individual data not shown), reducing three diagnostic options to only two choices resulted in significant improvement in κ values for only 1 of the 36 possible pathologist comparisons (initial diagnosis compared with resident research pathologist comparison for the normal and low-grade and high-grade or worse options).

We also investigated the number of cases that were either upgraded or downgraded when comparing the gynecologic pathologist consensus diagnoses to the original diagnoses. Only one case (1.8%) was upgraded from an initial normal interpretation to low-grade dysplasia. Of 84 cases initially diagnosed as high-grade dysplasia, 37 (44%) were downgraded according to gynecologic pathologist consensus interpretation. Of the downgraded cases, 16 low-grade interpretations were considered normal, 10 cases were downgraded from high-grade dysplasia to normal, and 11 cases originally diagnosed as high-grade were downgraded to low-grade dysplasia.

Back to Top | Article Outline


We found that interobserver agreement in the overall interpretation of ECC specimens is mostly moderate; however, interobserver agreement when considering the interpretation of low-grade dysplasia was poor. The weakest interpretative agreement occurred between the initial diagnosis (rendered by the original pathologist) and the interpretation by the research pathologists, particularly the gynecologic pathologists. Agreement between the research gynecologic pathologists for specific interpretations (normal, low-grade, and high-grade dysplasia) was also mostly moderate, with κ values ranging from 0.60 to 0.75. We initially expected greater and more consistent agreement, considering the additional training and subspecialty experience of these individuals. However, other studies of cervical biopsy interpretation using experienced study pathologists (including gynecologic pathologists) demonstrated κ values ranging from 0.54 to 0.87.16,28–30 Therefore, agreement among experienced gynecologic pathologists in the current study of ECC specimens mirrors agreement among research pathologists in studies of cervical biopsy agreement. Reducing the number of diagnostic options available will generally improve agreement between observers; this has been documented in studies of agreement involving cervical biopsy interpretation and endometrial lesions.30–34 In our study in which two different binary systems were considered, we found significant improvement for only 1 of 36 possible comparisons. Thus, in our study of ECC specimens, regardless of criteria used as a “cut-off,” reducing diagnostic options did not significantly improve interobserver agreement. Nonetheless, the concept that a binary or some other system of interpretation may improve diagnostic reproducibility for ECC specimens deserves further study, particularly because of the potentially significant clinical implications of an ECC diagnosis.

Our investigation of the likelihood that a pathologist will render a particular interpretation when compared with another pathologist yielded interesting results. For comparisons involving the original diagnostic interpretation and those of the pathology resident, these pathologists were more likely to diagnose high-grade abnormalities on the ECC specimens, particularly when compared with the gynecologic pathology consensus interpretations. Although the gynecologic pathologists were generally more conservative when interpreting a specimen as high-grade, there were discordant results between the gynecologic pathologists. These differences again demonstrate that even among pathologists with additional training or experience, there are inconsistencies in the interpretation of ECC specimens.

Interpretive variation becomes meaningful only if it has clinical relevance. Although the utility of ECC is debated, guidelines exist that describe indications for performance of an ECC and the recommended follow-up of an abnormal result, which may include performance of an excisional procedure. It would be beneficial to have a highly reproducible test (in this case, ECC) on which to base clinical recommendations, particularly if these recommendations could result in a potentially morbid procedure, additional costly follow-up, and increased patient anxiety. However, we have demonstrated that agreement between several types of observers is at best moderate, even for those pathologists with greater experience and training. Because a high-grade ECC diagnosis would be potentially more likely to contribute to an indication for the performance of an excisional procedure, it is concerning that there may be significant variability in the interpretation of specimens as high-grade. Highlighting this concern is the consideration of those cases in which the original diagnosis may have been upgraded or downgraded. Although only one case would have been upgraded, nearly one-half of the cases would have been downgraded, including 21 cases (25%) that would have been downgraded from an original interpretation of high-grade. Although we did not ascertain the clinical details in these particular cases, including colposcopic impression or other biopsy findings that may have been accepted indications for an excisional procedure, it is certainly possible that the high-grade diagnosis rendered for the ECC would have been an important determinant in the consideration for performance of an excisional procedure. Therefore, it is theoretically possible that an excisional procedure may have been avoided in a number of these cases.

We had too few cases of glandular atypia or neoplasia or biopsy samples considered inadequate for separate analysis. Reproducibility of specimens considered inadequate or those demonstrating glandular abnormalities are also in need of further investigation.

A potential weakness of this study is the lack of clinical follow-up of the women from whom these specimens were obtained. We therefore cannot address the accuracy of the interpretations of these ECC specimens; it is not possible to determine which of the pathologists in the various comparisons was “correct” in his or her interpretation. Our objective was to analyze reproducibility and interobserver variation, not accuracy. There are numerous studies addressing follow-up findings and the benefits, or lack thereof, of ECC, leading to the controversy surrounding the utility of ECC. Our intent was to analyze potential variability in diagnostic interpretation, and not to determine if performance of an ECC was beneficial in management of women with abnormal cervical cytology or histology.

It is also important to note that the greatest discrepancy in interpretations, involving the comparisons between the original diagnosis and all of the study pathologists, may be partly because of lack of clinical information available to the study pathologists. It is possible that additional information available to the pathologist rendering the original interpretation, such as colposcopic impression, clinical history, and interpretation of other biopsy specimens or cervical cytology performed at or near the time of the ECC, may have influenced his or her diagnosis. To control for this variable, it would be reasonable to perform an interobserver reproducibility study with clinical information equally available to all study participants. Furthermore, the study pathologists were not necessarily burdened by the clinical effect of their interpretive decisions, which could theoretically lead to either false-negative interpretations to avoid a potentially morbid intervention or false-positive interpretations to avoid missing a potentially significant lesion, particularly in the current medical–legal climate. Nonetheless, comparisons between research pathologists, none of whom had clinical information, still revealed interpretive variation. Therefore, although clinical information may influence one's diagnostic impression, it may not be a major factor in discordant opinion in most cases of ECC interpretation.

To our knowledge, this is the first investigation of interobserver variation in ECC specimens demonstrating discordant diagnostic interpretations even among pathologists with significant training and experience in gynecologic pathology. Because potentially important clinical decisions may be made depending on the result of the ECC interpretation, it is concerning that these decisions may be made based on a diagnosis that is moderately or poorly reproducible. Further study of alternative diagnostic classification schemes or additional diagnostic aids to improve interobserver agreement in the interpretation of ECC specimens is warranted.

Back to Top | Article Outline


1. Wright TC Jr, Massad LS, Dunton CJ, Spitzer M, Wilkinson EJ, Solomon D, et al. 2006 consensus guidelines for the management of women with abnormal cervical cancer screening tests. Am J Obstet Gynecol 2007;197:346–55.
2. Wright TC Jr, Massad LS, Dunton CJ, Spitzer M, Wilkinson EJ, Solomon D, et al. 2006 consensus guidelines for the management of women with cervical intraepithelial neoplasia or adenocarcinoma in situ. Am J Obstet Gynecol 2007;197:340–5.
3. Crane JM, Delaney T, Hutchens D. Transvaginal ultrasonography in the prediction of preterm birth after treatment for cervical intraepithelial neoplasia. Obstet Gynecol 2006;107:37–44.
4. Jones JM, Sweetnam P, Hibbard BM. The outcome of pregnancy after cone biopsy of the cervix: a case-control study. Br J Obstet Gynaecol 1979;86:913–6.
5. Ludvikkson K, Sandstrom B. Outcome of pregnancy after cone biopsy–a case-control study. Eur J Obstet Gynecol Reprod Biol 1982;14:135–42.
6. Kristensen J, Langhoff-Roos J, Kristensen FB. Increased risk of preterm birth in women with cervical conization. Obstet Gynecol 1993;81:1005–8.
7. Lee NH. The effect of cone biopsy on subsequent pregnancy outcome. Gynecol Oncol 1978;6:1–6.
8. Leiman G, Harrison NA, Rubin A. Pregnancy following conization of the cervix: complications related to cone size. Am J Obstet Gynecol 1980;136:14–8.
9. Raio L, Ghezzi F, Di Naro E, Gomez R, Luscher KP. Duration of pregnancy after carbon dioxide laser conization of the cervix: influence of cone height. Obstet Gynecol 1997;90:978–82.
10. Luesley DM, McCrum A, Terry PB, Wade-Evans T, Nicholson HO, Mylotte MJ, et al. Complications of cone biopsy related to the dimensions of the cone and the influence of prior colposcopic assessment. Br J Obstet Gynaecol 1985;92:158–64.
11. Samson SL, Bentley JR, Fahey TJ, McKay DJ, Gill GH. The effect of loop electrosurgical excision procedure on future pregnancy outcome. Obstet Gynecol 2005;105:325–32.
12. Kyrgiou M, Koliopoulos G, Martin-Hirsch P, Arbyn M, Prendiville W, Paraskevaidis E. Obstetric outcomes after conservative treatment for intraepithelial or early invasive cervical lesions: systematic review and meta-analysis. Lancet 2006;367:489–98.
13. Sadler L, Saftlas A, Wang W, Exeter M, Whittaker J, McCowan L. Treatment for cervical intraepithelial neoplasia and risk of preterm delivery. JAMA 2004;291:2100–6.
14. Crane JM. Pregnancy outcome after loop electrosurgical excision procedure: a systematic review. Obstet Gynecol 2003;102:1058–62.
15. Stoler MH, Schiffman M, Atypical Squamous Cells of Undetermined Significance-Low-Grade Squamous Intraepthelial Lesion Triage Study (ALTS) Group. Interobserver reproducibility of cervical cytologic and histologic interpretations. Realistic estimates from the ASCUS-LSIL Triage Study. JAMA 2001;285:1500–5.
16. Parker MF, Zahn CM, Vogel KM, Olsen CH, Miyazawa K, O'Connor DM. Discrepancy in the interpretation of cervical histology by gynecologic pathologists. Obstet Gynecol 2002;100:277–80.
17. Ismail SM, Colclough AB, Dinnen JS, Eakins D, Evans DM, Gradwell E, et al. Reporting cervical intra-epithelial neoplasia (CIN): intra- and interpathologist variation and factors associated with disagreement. Histopathology 1990;16:371–6.
18. Robertson AJ, Anderson JM, Beck JS, Burnett RA, Howatson SR, Lee FD, et al. Observer variability in histopathological reporting of cervical biopsy specimens. J Clin Pathol 1989;42:231–8.
19. McCluggage WG, Walsh MY, Thornton CM, Hamilton PW, Date A, Caughley LM, et al. Inter- and intra-observer variation in the histopathological reporting of cervical squamous intraepithelial lesions using a modified Bethesda grading system. Br J Obstet Gynaecol 1998;105:206–10.
20. Kato I, Santamaria M, De Ruiz PA, Aristizabal N, Bosch FX, De San Jose S, et al. Inter-observer variation in cytological and histological diagnoses of cervical neoplasia and its epidemiologic implication. J Clin Epidemiol 1995;48:1167–74.
21. Creagh T, Bridger JE, Kupek E, Fish DE, Martin-Bates E, Wilkins MJ. Pathologist variation in reporting cervical borderline epithelial abnormalities and cervical intraepithelial neoplasia. J Clin Pathol 1995;48:59–60.
22. Genest DR, Stein L, Cibas E, Sheets E, Zitz JC, Crum CP. A binary (Bethesda) system for classifying cervical cancer precursors: criteria, reproducibility, and viral correlates. Hum Pathol 1993;24:730–6.
23. de Vet HC, Knipschild PG, Schouten HJ, Koudstaal J, Kwee WS, Willebrand D, et al. Sources of interobserver variation in histopathological grading of cervical dysplasia. J Clin Epidemiol 1992;45:785–90.
24. O'Sullivan JP, Ismail SM, Barnes WS, Deery AR, Gradwell E, Harvey JA, et al. Interobserver variation in the diagnosis and grading of dyskaryosis in cervical smears: Specialist cytopathologists compared to non-specialists. J Clin Pathol 1994;47:515–8.
25. Cicchetti DV, Allison T. A new procedure for assessing reliability of scoring EEG sleep recordings. Am J EEG Technol 1971;11:101–9.
26. Fleiss JL. Statistical methods for weights and proportions. Third Edition. New York (NY): Wiley and Sons; 2003.
27. Efron B, Tibshirani RJ. An introduction to the bootstrap. Monographs on statistics and applied probability number 57. London (UK): Chapman and Hall/CRC; 1993.
28. Cai B, Ronnett BM, Stoler M, Ferenczy A, Kurman RJ, Sadow D, et al. Longitudinal evaluation of interobserver and intraobserver agreement of cervical intraepithelial neoplasia diagnosis among an experienced panel of gynecologic pathologists. Am J Surg Pathol 2007;31:1854–60.
29. Malpica A, Matistic JP, Niekirk DV, Crum CP, Staerkel GA, Yamal JM, et al. Kappa statistics to measure interrater and intrarater agreement for 1790 cervical biopsy specimens among twelve pathologists: qualitative histopathologic analysis and methodologic issues. Gynecol Oncol 2005;99:S38–52.
30. Dalla Palma P, Giorgi Rossi P, Collina G, Buccoliero AM, Ghiringhello B, Gilioli E, et al. The reproducibility of CIN diagnoses among different pathologists: data from histology reviews from a multicenter randomized study. Am J Clin Pathol 2009;132:125–32.
31. Sherman ME, Ronnett BM, Ioffe OB, Richesson DA, Rush BB, Glass AG, et al. Reproducibility of biopsy diagnoses of endometrial hyperplasia: evidence supporting a simplified classification. Int J Gynecol Pathol 2008;27:318–25.
32. Sagae S, Saito T, Satoh M, Ikeda T, Kimura S, Mori M, et al. The reproducibility of a binary tumor grading system for uterine endometrial endometrioid carcinoma, compared with FIGO system and nuclear grading. Oncology 2004;67:344–50.
33. Klaes R, Benner A, Friedrich T, Ridder R, Herrington S, Jenkins D, et al. p16INK4a immunohistochemistry improves interobserver agreement in the diagnosis of cervical intraepithelial neoplasia. Am J Surg Pathol 2002;26:1389–99.
34. Gurrola-Díaz CM, Suárez-Rincón ÁE, Vázquez-Camacho G, Buonocunto-Vázquez G, Rosales-Quintana S, Wentzensen N, et al. P16INK4a immunohistichemistry improves the reproducibility of the histological diagnosis of cervical intraepithelial neoplasia in cone biopsies. Gynecol Oncol 2008;111:120–4.
© 2011 by The American College of Obstetricians and Gynecologists.