Watch and wait (WW) for locally advanced rectal cancer (LARC) is a strategy to expectantly observe patients after neoadjuvant therapy. Pioneered in the early 2000s, it is based on the 13% to 24% rate of pathological complete response after neoadjuvant therapy.1–3 In lieu of surgical resection, WW uses clinical assessment to evaluate for complete tumor regression, termed complete clinical response (cCR). Similar oncologic outcomes have been reported in patients who underwent WW management compared with patients who underwent surgical resection and had a pathological complete response.4–6
Endoscopy is one of the primary modalities for determining clinical response. The current criteria for a cCR include a white or pale scar with telangiectasia and without signs of residual disease, such as ulceration or nodularity.7,8 However, these criteria are qualitative and can be inaccurate, with a 20% to 30% risk of local regrowth.9–11 Conversely, some have critiqued the WW cCR criteria as too strict, resulting in the exclusion of patients who may benefit from expectant follow-up.12 To add to the complexity, the assessment of response relies on surgeon evaluation, which is subject to interobserver variability.13
One technology previously shown to be successful in classifying different disease conditions and outcomes in biomedical images is deep learning.14–17 In one type of deep learning, a convolutional neural network (CNN) extracts image features not perceivable to the human eye to provide a final classification by automatically learning the appropriate set of features directly from the data.18–20 Previously, CNNs have been applied to endoscopic images of various disease processes in the GI tract, including colon polyps and gastric cancer.16,18,21 No studies, to our knowledge, have used a CNN model to assess endoscopic images of patients diagnosed with LARC treated with total neoadjuvant therapy (TNT) or chemoradiotherapy and systemic chemotherapy.
This study aimed to create a novel CNN model for identifying the presence or absence of tumor on endoscopic images of patients with LARC treated with TNT and to investigate the accuracy of the method.
PATIENTS AND METHODS
Selection of Patients and Grouping of Images
Patients with LARC who underwent TNT with tumor response assessed endoscopically at Memorial Sloan Kettering Cancer Center between December 2012 and July 2017 were selected. The protocol at our institution includes induction chemotherapy [FOLFOX (folinic acid, fluorouracil, oxaliplatin) or CAPEOX (capecitabine and oxaliplatin)] followed by long-course chemoradiotherapy with concurrent infusional fluorouracil or oral capecitabine.22 A few patients received long-course chemoradiotherapy with concurrent infusional fluorouracil or oral capecitabine followed by consolidation chemotherapy.
Tumor response was evaluated by white-light flexible endoscopy using the Olympus scope model CF-Q160S. Patients were assessed before the start of TNT, at least once during TNT, after TNT (defined as the first endoscopic assessment after treatment with an average of 6 weeks after TNT in our cohort), and in subsequent follow-up visits. If multiple images were captured simultaneously, then we selected up to 3 images that provided the best representation of the area of interest and had the best image quality. The available images were collected retrospectively and correlated with clinical information. This study was approved by the institutional review board of Memorial Sloan Kettering Cancer Center.
The available images were grouped into a tumor or no tumor category based on the following criteria. If patients with images captured before initiating TNT had tumors, they were placed in the tumor group. Images captured during TNT when assessed to have residual tumors by the treating surgeon were placed in the tumor group, whereas those assessed to have no residual lesion were placed in the no tumor group. The patients’ images after TNT and during follow-up were sorted using the following criterion standards: (1) The images from patients endoscopically assessed to have a residual tumor who underwent surgical resection and had surgical pathology showing tumor presence were placed in the tumor group. (2) In the no tumor group, the images were from patients with a sustained cCR, defined as no evidence of tumor regrowth in the 2 years after the end of TNT. This was used as the criterion standard given previous literature showing that the majority of local regrowths happen within this course.10 Thus, images from these patients were assumed to be tumor free.
The unit of analysis for this study was the image. This approach was chosen to maximize the images available and to facilitate the development of a CNN classifier capable of differentiating a tumor from no tumor. Furthermore, this created a larger variety of images in each group, such as those with tumor regression (during TNT and after TNT particularly) and with blood or fecal matter, and thus allowed for greater CNN exposure.
The images were made available for incorporation into the CNN in 2 sets: training and testing. The model for classifying the endoscopic images was learned on the training set and independently evaluated after the completion of training on the testing set. To ensure unbiased classification, the training images were made available to the CNN only after finalizing the model.
CNN Model Development and Training
A very deep, 19-layered CNN, called a Visual Geometry Group (VGG-19), that was previously developed for the classification of natural images was modified for our study (Fig. 1).23 It was applied to analyze endoscopic image patches (256 × 256 pixels) extracted automatically from a region of interest corresponding to the brightest portion of the image. The CNN extracted several thousands of image features using convolutional filters that “learn” the features useful for classifying the endoscopic patches into a tumor or no tumor category. Extracted features included textural patterns and abrupt changes in the bowel mucosa along with higher-level representations akin to nodules. The extracted features were combined to obtain a nonlinear representation using fully connected layers and a softmax activation function that produced a probabilistic classification of tumor or no tumor.24
To successfully train the network with many parameters, we used a transfer learning technique as used in other medical image-based classification tasks.15,25 Using this technique, the CNN was initialized with weights obtained through training on many natural images (1.28 million) arising from the ImageNet 2012 classification data set.26 Adaptation for the endoscopic images was performed by only training the modified network layers using 3-fold crossvalidation. Further details can be found in the Supplemental Material at https://links.lww.com/DCR/B960.
CNN Model Testing
All testing set images, regardless of the presence of artifact or low quality, were inputted into the previously trained CNN model for the classification of tumor or no tumor. This was performed intentionally to avoid bias and allow for a more realistic image set. Additionally, given that CNN was naive to these images, they allowed for independent testing of the accuracy of the network.
The CNN model was trained using 3-fold crossvalidation, a machine-learning technique that ensures generalizable performance.27,28 The performance of the CNN was independently evaluated on the training and testing sets. Performance was measured using an area under the operating characteristic (AUROC). Sensitivity, specificity, positive predictive value, and negative predictive value were assessed. The sensitivity or true positive rate corresponded to the accuracy of detecting no tumor, and the specificity or true negative rate corresponded to the accuracy of detecting a tumor.
Images from 109 patients were used. The training set consisted of 82 patients, and the testing set had 27 patients. Table 1 shows the patient characteristics in each imaging set. Both sets had similar baseline characteristics, with most tumors located in the mid rectum and classified as clinical T3, node positive. Additionally, 84.1% of patients in the training set and 96.3% of patients in the testing set received induction chemotherapy.
TABLE 1. -
Patient characteristics for images included in the training and testing sets
||Training set (N = 82)
||Testing set (N = 27)
|Age at diagnosis, y, mean
|Sex (% men/% women)
|Distance from anal verge, cm, mean
|Clinical T classification, n (%)
| 1 or 2
|Clinical N classification, n (%)
|Neoadjuvant therapy, n (%)
| Chemotherapy then chemoradiation
| Chemoradiation then chemotherapy
aUnable to evaluate the clinical T classification due to intussusception seen on MRI.
The training set consisted of 1099 images after 40 were excluded for poor quality. Four hundred sixty-eight images were categorized into the no tumor group and 631 images into the tumor group. Figure 2 shows the proportion of images from each time point. The network achieved an AUROC across all 3 folds for the training set of 0.83 (Fig. 3). The sensitivity was 0.78 with a specificity of 0.73, positive predictive value of 0.70, and negative predictive value of 0.80.
The testing set contained 293 images. The no tumor group had 151 images, and the tumor group had 142 images. When the testing set was applied to the network, a similar AUROC was achieved (0.83; Fig. 3). The sensitivity was 0.66 with a specificity of 0.87, positive predictive value of 0.70, and negative predictive value of 0.85. No difference was found in AUROC between the training and testing sets when evaluated using the DeLong 2-sided test with 2000 bootstrap iterations (p = 0.8).28
Evaluation of Misclassified Images in the Testing Set
Machine-learning methods operate under the assumption that the distribution of the testing set matches the training set. Thus, our testing set that contained images with artifacts or lower quality was expected to perform worse because of the violation of this assumption. We believed it necessary to include these images to provide more realistic conditions, and they were subsequently used to identify sources of poor classification. Examples are shown in Figure 4. As expected, poor accuracy was observed in cases with excessive blurring, presence of fecal matter or blood, and even normal bowel mucosal folds. When the extreme cases were removed with visual inspection and retested, the AUROC was slightly higher at 0.86. The sensitivity was 0.87, and the specificity was 0.72.
CNN Performance Variability With Longitudinal Imaging
We next evaluated where the classification accuracy of CNN varied between patients or between time points to understand whether there was clustering of accuracy among patients. The testing set images for each patient were grouped by time point (before TNT, during TNT, after TNT, and subsequent follow-ups), and the frequency of correct (0 to 1) and incorrect (0 to –1) detections was computed (Fig. 5). The incorrect detections (a frequency exceeding –0.5 or –50%) were more common for the after TNT images (7/27 patients) and follow-up images (8/27 patients) compared with the before TNT images (0/27 patients) and the during TNT images (2/27 patients). Misidentification was observed likely due to fecal matter, residual mucosal features, artifacts, or normal bowel mucosal folds that were mistaken for tumor features. In patients who had residual tumor after TNT, clustering of correct detections across time points was observed. Figure 6 highlights images from a patient with a sustained cCR in which the CNN successfully identified the images over time. However, the confidence of CNN in the prediction decreased over time as the tumor regressed, emphasizing the increased challenge of correctly classifying tumor regression.
In this pilot study, we developed a CNN model to classify endoscopic images from patients with LARC who underwent TNT. The CNN evaluated endoscopic images at various time points and classified them based on the presence of tumor or no tumor in a longitudinal imaging scenario. The CNN achieved high accuracy in detecting tumor presence and reasonable accuracy in detecting tumor regression or no tumor in the testing set. With a goal of developing a CNN model to use as a tool for surgeon endoscopic assessment of response in patients being considered for WW, the CNN showed feasibility in longitudinally detecting the presence of tumor. However, the detection of no tumor was more of a challenge, with the accuracy improving when poor-quality images were removed. Our findings offer a novel, first step to using deep learning in the assessment of tumor response for patients with LARC being considered for or undergoing WW management.
WW has shown promising oncological outcomes.11 However, the qualitative clinical assessment of response needs improvement before widespread adoption. One challenge is correctly identifying a cCR. A 2018 systematic review showed that with our current definition, there was a 21.6% pooled 3-year cumulative rate of local regrowth, with most salvaged in patients with a cCR.11 Conversely, other studies have noted that this strict definition missed patients who may benefit from organ preservation. In 2014, Smith et al29 reported that 74% of patients with T0 on surgical pathology did not have a complete response. More recently, a near-complete response category has been described. Patients with features such as minor ulceration and mucosal irregularities are being considered by some for WW.12,30 One technology that, to our knowledge, has not been explored for the improvement of the classification of response for patients with LARC after neoadjuvant therapy is deep learning.
The current literature has shown a benefit for deep learning in colon polyp identification, differentiation of adenomatous from nonadenomatous polyps, and cancerous polyp recognition.31 With CNNs, computer-aided detection systems using endoscopic video or still images have been created to improve lesion detection. A recent randomized controlled trial comparing the use of this technology during real-time colonoscopies showed a higher rate of adenoma or cancer detection in the computer-aided diagnosis group than in the control group (relative risk, 1.30; 95% CI, 1.14–1.45).32 The detection systems are also highly accurate for cancerous colon lesion detection, with a recent meta-analysis showing a pooled accuracy of 89.9%.33
In the present study, our novel CNN model had a high specificity for identifying tumor images. Clustering of correct detections was observed across time points in patients with residual tumor after TNT. This is oncologically significant because missing patients with tumors could be detrimental. The CNN was less successful at identifying images without tumor; however, its sensitivity improved when the poor-quality images were removed from the testing set and the image set more closely resembled the training set. This shows that the CNN classifier was trained to differentiate between no tumor images and tumor images. The classification became more challenging when artifacts, such as previous tattoo, fecal matter, or even normal bowel mucosal folds, were present. Furthermore, these images were harder to identify because some had only subtle characteristic changes (or tumor regression) with the surrounding characteristics remaining the same, leading to misidentification. Although the classification was performed purposefully to best stimulate the way a surgeon would evaluate images over time, it made the identification of no tumor images more complex for the CNN classifier.
Several studies have sought to understand current surgeon accuracy in assessing cCR with endoscopic images. van der Sande et al12 found an AUROC between 0.80 and 0.84 when 3 readers evaluated endoscopic images for complete response. Another study used digital rectal examination and endoscopy and had an AUROC of 0.88.34 A recent survey by Felder et al surveying multiple surgeons regarding response in endoscopic images showed superior accuracy in identifying a complete response. The surgeons’ accuracy was 82% for distinguishing a complete response compared with 68% for an incomplete response.13
Our accuracy was similar to these studies while not compared directly to surgeon assessment. Furthermore, of the images identified as no tumor, 30% had a tumor present (with a positive predictive value of 70%). For our first attempt at creating a CNN model, this is akin to the current rate of local regrowth.9–11 However, the CNN model will need further refinement and testing to evaluate its capability to aid in the assessment of clinical response and its use as a clinical tool. With improvement, a CNN would benefit a surgeon in providing an additional tool to improve the consistency and accuracy of the current response assessment. One such benefit is the ability of a CNN model to learn features indistinguishable from the human eye. This could aid in differentiating no tumor from tumor when subtle features of a near-complete response, such as mucosal irregularities and ulceration, are present. Another benefit is that our CNN model had a high rate of correct tumor image detection. As previously shown, surgeons have higher accuracy for identifying a complete response, and the CNN may act as a complement to the current assessment. However, the extent to which the CNN model would improve surgeon accuracy and reproducibility is beyond the current scope.
Our study had several limitations. As a pilot study, it was retrospective in nature, conducted at a single center, and had a relatively small sample size of images. However, even with a limited data set, our study showed good accuracy; thus, it needs to be validated on larger data sets in the future. Further exposure of the CNN to more cases would improve its accuracy and likely overcome the real-life challenge of, at times, suboptimal images. Furthermore, increasing the diversity of no tumor images, such as adding more images corresponding to a near-complete response, would improve the sensitivity and positive predictive value. For this initial study, we endeavored to train the CNN with images of tumor presence and absence that were as clear as possible and did not include patients who had local regrowth or pathological complete response. Thus, further work would need to expand the breadth of cases. Finally, we acknowledge that the improvement of accuracy is not the only challenge in assessing clinical response, and a CNN model would also need to decrease the interobserver variability. Future studies would need to evaluate whether our CNN is able to improve the accuracy of surgeon assessment enough to overcome this variability.
We created a CNN model to detect the presence or absence of tumors on endoscopic images of patients with LARC who underwent TNT. The current algorithm achieved good accuracy with an AUROC of 0.83 on both the training and the testing sets. To improve the accuracy and increase generalizability, a larger testing cohort is needed as well as further testing on more complex patient cases.
1. Martin ST, Heneghan HM, Winter DC. Systematic review and meta-analysis of outcomes following pathological complete response to neoadjuvant chemoradiotherapy for rectal cancer. Br J Surg. 2012;99:918–928.
2. Maas M, Nelemans PJ, Valentini V, et al. Long-term outcome in patients with a pathological complete response after chemoradiation for rectal cancer: a pooled analysis of individual patient data. Lancet Oncol. 2010;11:835–844.
3. Hartley A, Ho KF, McConkey C, Geh JI. Pathological complete response following pre-operative chemoradiotherapy in rectal cancer: analysis of phase II/III trials. Br J Radiol. 2005;78:934–938.
4. Maas M, Beets-Tan RG, Lambregts DM, et al. Wait-and-see policy for clinical complete responders after chemoradiation for rectal cancer. J Clin Oncol. 2011;29:4633–4640.
5. Smith JD, Ruby JA, Goodman KA, et al. Nonoperative management of rectal cancer with complete clinical response after neoadjuvant therapy. Ann Surg. 2012;256:965–972.
6. Dossa F, Chesney TR, Acuna SA, Baxter NN. A watch-and-wait approach for locally advanced rectal cancer after a clinical complete response following neoadjuvant chemoradiation: a systematic review and meta-analysis. Lancet Gastroenterol Hepatol. 2017;2:501–513.
7. Habr-Gama A, Perez RO, Wynn G, Marks J, Kessler H, Gama-Rodrigues J. Complete clinical response after neoadjuvant chemoradiation therapy for distal rectal cancer: characterization of clinical and endoscopic findings for standardization. Dis Colon Rectum. 2010;53:1692–1698.
8. Smith JJ, Chow OS, Gollub MJ, et al.; Rectal Cancer Consortium. Organ preservation in rectal adenocarcinoma: a phase II randomized controlled trial evaluating 3-year disease-free survival in patients with locally advanced rectal cancer treated with chemoradiation plus induction or consolidation chemotherapy, and total mesorectal excision or nonoperative management. BMC Cancer. 2015;15:767.
9. Renehan AG, Malcomson L, Emsley R, et al. Watch-and-wait approach versus surgical resection after chemoradiotherapy for patients with rectal cancer (the OnCoRe project): a propensity-score matched cohort analysis. Lancet Oncol. 2016;17:174–183.
10. van der Valk MJM, Hilling DE, Bastiaannet E, et al.; IWWD Consortium. Long-term outcomes of clinical complete responders after neoadjuvant treatment for rectal cancer in the International Watch & Wait Database (IWWD): an international multicentre registry study. Lancet. 2018;391:2537–2545.
11. Dattani M, Heald RJ, Goussous G, et al. Oncological and survival outcomes in watch and wait
patients with a clinical complete response after neoadjuvant chemoradiotherapy for rectal cancer: a systematic review and pooled analysis. Ann Surg. 2018;268:955–967.
12. van der Sande ME, Maas M, Melenhorst J, Breukink SO, van Leerdam ME, Beets GL. Predictive value of endoscopic features for a complete response after chemoradiotherapy for rectal cancer. Ann Surg. 2021;274:e541–e547.
13. Felder SI, Patil S, Kennedy E, Garcia-Aguilar J. Endoscopic feature and response reproducibility in tumor assessment after neoadjuvant therapy for rectal adenocarcinoma. Ann Surg Oncol. 2021;28:5205–5223.
14. Campanella G, Hanna MG, Geneslaw L, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med. 2019;25:1301–1309.
15. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–118.
16. Hirasawa T, Aoyama K, Tanimoto T, et al. Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images. Gastric Cancer. 2018;21:653–660.
17. Kather JN, Pearson AT, Halama N, et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat Med. 2019;25:1054–1056.
18. Min JK, Kwak MS, Cha JM. Overview of deep learning in gastrointestinal endoscopy. Gut Liver. 2019;13:388–393.
19. Hosny A, Parmar C, Coroller TP, et al. Deep learning for lung cancer prognostication: A retrospective multi-cohort radiomics study. PLoS Med. 2018;15:e1002711.
20. Jiang J, Hu YC, Liu CJ, et al. Multiple resolution residually connected feature streams for automatic lung tumor segmentation from CT images. IEEE Trans Med Imaging. 2019;38:134–144.
21. Yamada M, Saito Y, Imaoka H, et al. Development of a real-time endoscopic image diagnosis support system using deep learning technology in colonoscopy. Sci Rep. 2019;9:14465.
22. Cercek A, Roxburgh CSD, Strombom P, et al. Adoption of total neoadjuvant therapy for locally advanced rectal cancer. JAMA Oncol. 2018;4:e180071.
23. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations (ICLR2015); May 7-9, 2015, San Diego, CA. http://arxiv.org/abs/1409.1556
. Accessed: June 22, 2021.
24. Nwankpa CE, Ijomah W, Gachagan A, Marshall S. Activation functions: comparison of trends in practice and research for deep learning. 2021. 2nd International Conference on Computational Sciences and Technology - Jamshoro, Jamshoro, Pakistan, December 17–19, 2020, 124–133. arVix:1811.03378.
25. Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010;22:1345–1359.
26. Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, June 20-25, 2009:248–255.
27. Fehr D, Veeraraghavan H, Wibmer A, et al. Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images. Proc Natl Acad Sci U S A. 2015;112:E6265–E6273.
28. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Volume 2; 1995; Montreal, Quebec, Canada.
29. Smith FM, Wiland H, Mace A, Pai RK, Kalady MF. Clinical criteria underestimate complete pathological response in rectal cancer treated with neoadjuvant chemoradiotherapy. Dis Colon Rectum. 2014;57:311–315.
30. van der Sande ME, Beets GL, Hupkens BJ, et al. Response assessment after (chemo)radiotherapy for rectal cancer: why are we missing complete responses with MRI and endoscopy? Eur J Surg Oncol. 2019;45:1011–1017.
31. Pannala R, Krishnan K, Melson J, et al. Artificial intelligence in gastrointestinal endoscopy. VideoGIE. 2020;5:598–613.
32. Repici A, Badalamenti M, Maselli R, et al. Efficacy of real-time computer-aided detection of colorectal neoplasia in a randomized trial. Gastroenterology. 2020;159:512–520.e7.
33. Mohan BP, Khan SR, Kassab LL, Ponnada S, Dulai PS, Kochhar GS. Accuracy of convolutional neural network-based artificial intelligence in diagnosis of gastrointestinal lesions based on endoscopic images: a systematic review and meta-analysis. Endosc Int Open. 2020;8:E1584–E1594.
34. Maas M, Lambregts DM, Nelemans PJ, et al. Assessment of clinical complete response after chemoradiation for rectal cancer with digital rectal examination, endoscopy, and MRI: selection for organ-saving treatment. Ann Surg Oncol. 2015;22:3873–3880.