We further compared the change in review time between assistance modalities for each individual image across pathologists. Two individual images with macrometastases showed large time differences between assistance modes: one image was reviewed faster with assistance and one image was slower with assistance. The single macrometastasis that was reviewed noticeably faster with assistance exhibited tumor foci with bland morphology and is the same image with false-negative interpretations discussed in the Classification accuracy section (Fig. 4B). The single macrometastasis that consistently took longer on average to review with assistance was notable for involving tumor with significant fibrotic change, interpreted by several pathologists as possible treatment effect. In addition, the algorithm did not completely outline the contiguous tumor region for this metastasis, and some regions containing tumor cells were classified as moderate confidence by the algorithm (Supplemental Fig. 5, Supplemental Digital Content 1, http://links.lww.com/PAS/A677).
Time differences between the 2 sessions of the crossover study design were also analyzed using mixed-effects models. In addition to the significant effects of assistance for negative cases and micrometastases as described above, the second reader session was independently associated with a shorter average time per image than the first session for micrometastases (12 s, P=0.04), negative cases (23 s, P<0.01), and ITCs (22 s, P=0.03).
Evaluating Subjective Difficulty of Image Review
To test the hypothesis that computer assistance can affect the subjective difficulty of image review and to gain additional insight into the potential impact of assisted reads, pathologists were instructed to report an “obviousness score” (Materials and methods) following their interpretation for each image. This approach, based in part on a scoring system described by Gallas et al,19 was aimed at addressing how easily pathologists were able to reach their final image classification.
The average obviousness score for macrometastases (91.6) was the highest of the categories, and the average obviousness score for ITC cases (52.9) was the lowest of the individual categories. Regarding assisted versus unassisted reads, digital assistance was associated with an increase in the obviousness score for micrometastases without any significant differences for the other categories (Table 3).
Recent studies have described the ability of deep learning algorithms to perform on par with expert pathologists for isolated diagnostic tasks.10 Underlying these exciting advances, however, is the important notion that these algorithms do not replace the breadth and contextual knowledge of human pathologists and that even the best algorithms would need to integrate into existing clinical workflows in order to improve patient care. In this proof-of-concept study, we investigated the impact of a computer assistance tool for the interpretation of digitized H&E slides, and show that a digital tool developed to assist with the identification of lymph node metastases can indeed augment the efficiency and accuracy of pathologists.
In regards to accuracy, algorithm assistance improved the sensitivity of detection of micrometastases from 83% to 91% and resulted in higher overall diagnostic accuracy than that of either unassisted pathologist interpretation or the computer algorithm alone. Although deep learning algorithms have been credited with comparable or superior diagnostic performance to pathologists, our results suggest that combining the unique strengths of computers and human experts may provide an even more promising opportunity. Pathologists understand the clinical setting and diagnostic workflow, allowing them to contextualize the therapeutic implications of false positives and false negatives in order to optimize their diagnostic operating point, sometimes even on a case by case basis. In contrast, algorithms that exceed the sensitivity of pathologists often do so at the cost of increased false positives. By using an algorithm to surface the most pertinent clinical information and allowing pathologists to review the findings, thoughtfully designed assistant tools have the potential to maximize both sensitivity and specificity while also allowing for the identification of any findings that are not interpretable by the algorithm alone.
The baseline sensitivity (without assistance) for micrometastases in our study (83%) was considerably higher than that reported recently for a similar task (38% on average, range: 15% to 63%).10 Although pathologists in our study were instructed to review images at a pace similar to that of their routine clinical review, one possible factor contributing to the sensitivity discrepancy is the time spent per image. Pathologists in the prior study reviewed images with “flexible” time constraints that averaged to ∼1 minute per slide, compared with 1.5 minutes in our study. Because the prior study did not report the time spent by pathologists on each category of images, we were unable to directly compare the time spent on cases with small or no tumor foci. As supporting evidence that time of review can affect sensitivity, the prior study also reported a sensitivity of 89% for review of micrometastases without time constraints. Another possibility is that the utilization of IHC in different practices or in different geographic regions results in a different diagnostic threshold for calling micrometastases using H&E alone. The pathologists in our study self-reported IHC utilization for roughly 10% of clinical cases, which may represent a different baseline experience than the pathologists in the prior study. Different levels of experience with digital pathology may also contribute to sensitivity differences, although there were no clear differences in this regard between the two studies. The sensitivity in actual clinical practice for micrometastases on H&E alone may be hard to evaluate in the era of IHC use, but original studies suggest that 12% to 29% of cases called negative using single-level H&E were in fact positive on review of IHC.20 This suggests a substantial false-negative rate for the relatively infrequent micrometastases using H&E alone, perhaps even higher than the rate observed in our study. Although the 6 pathologists in our study represent a range of experience and clinical practice (none were breast specialists), they might represent above-average performance. If a reader study was performed in a particularly time-limited clinical setting with a broad range of pathologists, the accuracy benefits of an assistive tool may be more pronounced.
The most significant benefits of algorithm assistance observed in this study were for efficiency, with a time savings of 19% for negative cases and 52% for micrometastases. These observations are perhaps unsurprising for micrometastases, where we expect an accurate algorithm to help pathologists locate small tumor foci. However, the increased efficiency for negative cases is particularly notable given the majority of cases in clinical practice are negative. Extrapolating our data to clinical cases consisting of roughly 75% negative cases and 5% to 10% micrometastases,21 these results would suggest a potential overall time savings of ∼20%. We hypothesize that this time benefit for negative cases requires both an accurate algorithm and the establishment of trust in the algorithm’s performance. This trust was likely developed through the use of the algorithm assistance during the study sessions. Taken together with our pilot study observations that a side-by-side heatmap display (Supplementary Digital Content 1, http://links.lww.com/PAS/A677) did not provide a similar time benefit, we reason that thoughtful user interface design is critical in order to avoid distracting users with extraneous information. Still, larger studies will be important to further validate the observed impact of digital assistance on efficiency and accuracy, especially for negative cases and in actual clinical workflows.
In addition to the time benefits associated with the assistance tool, we also observed independent, statistically significant time differences between the first and second sessions of this cross over study. One likely possibility is that the study participants became more familiar with all aspects of the viewer interface and the specific task such that their review was shorter. This is also consistent with the efficiency gains reported with increased digital pathology experience.2 The faster review in the second session was not associated with any significant differences in accuracy between sessions, either overall or as a function of assistance modality.
The “obviousness score” described in this study was intended to provide insights into the subjective perception of the task with and without assistance and to build further on the objective measurements of accuracy and efficiency. Despite considerable inter-reader variability, these scores suggest that algorithm assistance increased the perceived ease of image review, specifically for micrometastases. While inter-reader variability in the use of quantitative scores is a known challenge in multiple-reader multiple-case studies,22 future studies could provide more extensive training to improve inter-reader consistency for this type of score. Subjective metrics such as this, although often challenging to calibrate and incorporate, may be an important aspect of evaluating impact and value as digital tools in pathology and medicine continue to be developed and implemented.
The observed time savings for negative cases raises the important consideration of over-reliance on algorithm assistance, with the theoretical possibility of less thorough review for some cases.23,24 In our study, pathologists were informed of the differences between the high confidence and moderate confidence regions, and to expect the possibility of false positives and false negatives in the assistance. Because the algorithm identified at least one moderate confidence region in every metastasis-positive image in this data set, we could not formally evaluate for false negative pathologist interpretations due to over-reliance. Although it may be possible to tune the information presented by an algorithm to approach 100% sensitivity, the possibility of over-reliance on negative predictions warrants further investigation across larger data sets for which histologic diversity may result in unexpected algorithm false negatives. In regards to pathologist over-reliance on positive predictions, we did observe one image where pathologists ignored moderate-confidence regions that were indeed tumor (verified by IHC) and called the case negative. In this case, pathologists reviewing the image without assistance also failed to detect the metastasis, and thus it was an instance where the assistance tool “outperformed” pathologists, but the assistance had neither a clear positive nor negative effect. The only algorithmic false positive “high confidence” outline corresponded to a region of fat necrosis (Fig. 4C). This region was universally identified as such by the pathologists and ignored, and no false positive final diagnoses occurred among the 420 images reviewed in assisted mode. Seven additional tumor-negative images contained moderate confidence outlines but were also correctly recognized as nontumor by the readers, highlighting the important role of trained pathologists in this assisted read scenario. These findings suggest that despite limited experience with the assistant tool, pathologists were able to calibrate their reliance on the tool in order to use it effectively, and without evidence of over-reliance in this set of images.
In addition to an assisted primary review as evaluated in this study, digital assistance could also be integrated into clinical workflows in other ways: as a “screening” tool to triage definitively negative and/or positive cases, or as a “second read” for difficult cases following primary pathologist review. The screening approach is similar to the FDA-approved use of computer assistance for cervical cytology specimens,25,26 while the second read approach could mitigate over-reliance or assistive bias during primary review. For lymph node staging, a tool that reduces the review burden for negative cases or triages challenging cases for IHC before initial pathologist review may reduce reporting delays and provide cost savings through efficiency and accuracy gains. Although the potential benefits are exciting, thoughtful consideration of the limitations along with clear instructions and definitions for intended use will be critical to the success and safety of assistive tools in pathology.
Several of the notable limitations to this study stem from the artificial constraints introduced by having pathologists review images in isolation, without the context of the actual clinical workflow. For example, pathologists would typically have access to the other slides comprising the case (including the primary tumor), additional levels, or IHC. In addition, unexpected findings such as lymphomas, infections, or other lymph node pathologies may be present in clinical cases, but were not evaluated in this study and represent a key limitation to stand-alone computer algorithms in diagnostic pathology. Future work can more thoroughly explore algorithms to identify other lymph node pathologies as well as nodal metastases for other cancer types. The algorithm used in this study detects metastatic tumor in lymph nodes, without the ability to interpret the positioning of these foci relative to the nearest lymph node. Correspondingly, extensions of this algorithm could also indicate specific diagnostic features of these detected tumor foci, such as extracapsular extension or lymphovascular invasion by identifying the location of each focus relative to the associated lymph node. Assistive tools for lymph node review may also facilitate easier measurement of tumor deposits, allowing more efficient and accurate reporting for prognosis and clinical management, and potentially even contribute to refined guidelines for nodal staging across different cancer types.
Ultimately, studies to demonstrate the clinical utility and value for assistive tools in digital pathology will require prospective clinical evaluation and thorough evaluation of the clinical task in question. For nodal staging, inclusion of the complete set of lymph node slides for a given case may be a useful next step,27 potentially followed by the evaluation of the complete case. Given the findings reported here, understanding the actual clinical significance of increased lymph node micrometastasis detection is also an important consideration. While several studies have demonstrated prognostic significance for lymph node micrometastases relative to node-negative disease,13,28,29 the reported outcomes differences can be small, institutional management practices may vary, and the ultimate impact on clinical decision making depends on the final staging and other patient-specific variables. Studies to evaluate the impact of assisted reads on the use of IHC, refined categories for prognosis, or presorting of cases based on algorithm predictions may demonstrate still additional value for “intelligent” tools in digital pathology.
In summary, this study directly demonstrates some of the potential benefits of assisted reads in pathology, including specific gains in efficiency and accuracy. As such, this study is a useful first step in understanding how assistive tools in pathology can be best designed and utilized, both to improve clinical care and to allow more time and mental energy for tasks that require invaluable human experience and expertise.
The authors thank Greg Corrado and Philip Nelson for their advice and guidance in enabling this work, Craig Mermel for helpful comments on the manuscript, David Miller and Sunny Jansen for their input in study design and data analysis, and James Wren for logistical support. The authors thank members of the Google AI Pathology team for software infrastructure support, logistical support, and slide digitization services. Gratitude also goes to Sara Gabriele, T Saensuksopa, and Melissa Strader for insights into the user interface design. Deepest appreciation goes to pathologists Kathy Brady, Chris Kim, and 7 other pathologists for determining the reference standard for the images or reviewing images as a reader in the pilot or formal study.
1. Mukhopadhyay S, Feldman MD, Abels E, et al. Whole slide imaging versus microscopy for primary diagnosis in surgical pathology: a multicenter blinded randomized noninferiority study of 1992 cases (pivotal study). Am J Surg Pathol. 2018;42:39–52.
2. Mills AM, Gradecki SE, Horton BJ, et al. Diagnostic efficiency in digital pathology
: a comparison of optical versus digital assessment in 510 surgical pathology cases. Am J Surg Pathol. 2018;42:53–59.
3. Gavrielides MA, Gallas BD, Lenz P, et al. Observer variability in the interpretation of HER2/neu immunohistochemical expression with unaided and computer-aided digital microscopy. Arch Pathol Lab Med. 2011;135:233–242.
4. Wolff AC, Hammond MEH, Hicks DG, et al. Recommendations for human epidermal growth factor receptor 2 testing in breast cancer
: American Society of Clinical Oncology/College of American Pathologists clinical practice guideline update. J Clin Oncol. 2013;31:3997–4013.
5. Acs B, Rimm DL. Not just digital pathology
, intelligent digital pathology
. JAMA Oncol. 2018;4:403–404.
6. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444.
7. Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015;115:211–252.
8. Krizhevsky A, Sutskever I, Hinton GEPereira F, Burges CJC, Bottou L, Weinberger KQ. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25. San Francisco, CA: Curran Associates Inc; 2012:1097–1105.
9. Liu Y, Gadepalli K, Norouzi M, et al. Detecting cancer metastases on gigapixel pathology images. 2017. Available at: http://arxiv.org/abs/1703.02442
. Accessed March 9, 2017.
10. Ehteshami Bejnordi B, Veta M, Johannes van Diest P, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer
. JAMA. 2017;318:2199–2210.
11. Litjens G, Sánchez CI, Timofeeva N, et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci Rep. 2016;6:26286.
13. Weaver DL, Ashikaga T, Krag DN, et al. Effect of occult metastases on survival in node-negative breast cancer
. N Engl J Med. 2011;364:412–421.
14. Giuliano AE, Hunt KK, Ballman KV, et al. Axillary dissection vs no axillary dissection in women with invasive breast cancer
and sentinel node metastasis: a randomized clinical trial. JAMA. 2011;305:569–575.
15. Vestjens JH, de Boer M, van Diest PJ, et al. Prognostic impact of isolated tumor cells in breast cancer
axillary nodes: single tumor cell(s) versus tumor cell cluster(s) and microanatomic location. Breast Cancer
Res Treat. 2012;131:645–651.
16. Fine JL. 21(st) century workflow: a proposal. J Pathol Inform. 2014;5:44.
17. Rabinovitch A. The College of American Pathologists laboratory accreditation program. Accredit Qual Assur. 2002;7:473–476.
18. Lester SC, Bose S, Chen Y-Y, et al. Protocol for the examination of specimens from patients with invasive carcinoma of the breast. Arch Pathol Lab Me. 2009;133:1515–1538.
19. Gallas BD, Chan H-P, D’Orsi CJ, et al. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol. 2012;19:463–477.
20. Apple SK. Sentinel lymph node in breast cancer
: review article from a pathologist’s point of view. J Pathol Transl Med. 2016;50:83–95.
21. Rutledge H, Davis J, Chiu R, et al. Sentinel node micrometastasis in breast carcinoma may not be an indication for complete axillary dissection. Mod Pathol. 2005;18:762–768.
22. Dendumrongsup T, Plumb AA, Halligan S, et al. Multi-reader multi-case studies using the area under the receiver operator characteristic curve as a measure of diagnostic accuracy: systematic review with a focus on quality of data reporting. PLoS One. 2014;9:e116018.
23. Taylor P, Potts HW. Computer aids and human second reading as interventions in screening mammography: Two systematic reviews to compare effects on cancer detection and recall rate. Eur J Cancer. 2008;44:798–807.
24. Cabitza F, Rasoini R, Gensini GF. Unintended consequences of machine learning
in medicine. JAMA. 2017;318:517–518.
25. Wilbur DC, Black-Schaffer WS, Luff RD, et al. The Becton Dickinson FocalPoint GS Imaging System: clinical trials demonstrate significantly improved sensitivity for the detection of important cervical lesions. Am J Clin Pathol. 2009;132:767–775.
26. Biscotti CV, Dawson AE, Dziura B, et al. Assisted primary screening using the automated ThinPrep Imaging System. Am J Clin Pathol. 2005;123:281–287.
28. de Boer M, van Deurzen CHM, van Dijck JAAM, et al. Micrometastases or isolated tumor cells and the outcome of breast cancer
. N Engl J Med. 2009;361:653–663.
29. de Boer M, van Dijck JAAM, Bult P, et al. Breast cancer
prognosis and occult lymph node metastases, isolated tumor cells, and micrometastases. J Natl Cancer Inst. 2010;102:410–425.
artificial intelligence; machine learning; digital pathology; breast cancer; computer aided detection
Supplemental Digital Content
Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved.