Volumetric Pancreas Segmentation on Computed Tomography: Accuracy and Efficiency of a Convolutional Neural Network Versus Manual Segmentation in 3D Slicer in the Context of Interreader Variability of Expert Radiologists : Journal of Computer Assisted Tomography

Secondary Logo

Journal Logo

Abdominopelvic Imaging: Gastrointestinal

Volumetric Pancreas Segmentation on Computed Tomography: Accuracy and Efficiency of a Convolutional Neural Network Versus Manual Segmentation in 3D Slicer in the Context of Interreader Variability of Expert Radiologists

Khasawneh, Hala MBBS; Patra, Anurima MD; Rajamohan, Naveen MD; Suman, Garima MD; Klug, Jason PhD; Majumder, Shounak MD; Chari, Suresh T. MD‡,§; Korfiatis, Panagiotis PhD; Goenka, Ajit Harishkumar MD

Author Information
Journal of Computer Assisted Tomography 46(6):p 841-847, 11/12 2022. | DOI: 10.1097/RCT.0000000000001374

Abstract

There has been increasing interest in the application of artificial intelligence (AI) to augment quantitative imaging of the pancreas. Many of these potential AI-augmented applications are centered on the use of quantitative metrics extracted from volumetrically segmented pancreas to serve as image-based biomarkers for both exocrine and endocrine pancreatic diseases.1–4 Accurate pancreas segmentation is a fundamental step for the development of robust pancreas AI models and for their use in potential clinical translation of insights obtained from such image-based biomarker research.5

Pancreas segmentation is more challenging than segmentation of other abdominal organs such as the liver, kidneys, or the spleen. This is principally due to the relatively small size of the pancreas, its variable oblong shape subject to elastic deformation, feathery borders, and nonlinear orientation. In addition, its deep abdominal location insinuated between bowel, close apposition with major vessels, and variable amounts of peripancreatic fat are other challenges.6 Thus, volumetric pancreas segmentation is typically a manual process requiring radiologists or domain-trained technologists.7–10 However, the manual segmentation is a time-intensive process, which is associated with interreader and intrareader variability. Such manual processes become a major hindrance for creation of large imaging data sets, which are necessary for the development of clinically translatable deep learning applications.11 Consequently, the clinical deployment of pancreas AI applications has been negligible. Thus, an accurate, efficient, and scalable method for volumetric pancreas segmentation is necessary to advance image-based pancreas AI applications to the clinical domain. To that end, we have previously reported a convolutional neural network (CNN) model for fully automated volumetric pancreas segmentation on standard-of-care computed tomography (CT), which was developed and validated on large data sets.12 The model had the highest reported performance for volumetric pancreas segmentation with the mean (SD) Dice-Sorenson coefficient (DSC) being 0.91 (0.03).

For the clinical translation of such AI models, the essential prerequisites include ongoing refinement, validation of consistent performance vis-à-vis multiple expert radiologists, and a positive impact on operational efficiency while considering the need for any refinement of AI output segmentations. Therefore, the aim of the current study was to validate the CNN performance against expert manual segmentations in the context of intrareader and interreader reliability and to measure the impact of a CNN-enhanced workflow on the time needed for volumetric pancreas segmentation.

MATERIALS AND METHODS

Data Set Selection and Curation

This study was an Institutional Review Board–approved and Health Information Portability and Accountability Act–compliant retrospective study with the requirement for informed consent waived. Through a retrospective review of Radiology Information System, patients with abdomen CT scans performed between 2019 and 2020 with morphologically normal pancreas were identified based on the statement of a negative or unremarkable pancreas in the CT report. We randomly selected 294 contrast-enhanced abdomen CT scans obtained in portal venous phase (159 females, 135 males; mean [range] patient age, 50 [17–102] years) with slice thickness of ≤3 mm (mean [range] slice thickness, 1.6 [0.3–3.0] mm; distribution, 1.25–3.0 mm [n = 132], 0.6–1.0 mm [n = 93], and <0.5 mm [n = 69]). The scans were obtained on CT systems from 4 different vendors (195 Siemens, 69 Toshiba, 17 Canon, and 13 GE). This data set was subsequently curated by 2 radiologists to confirm optimal image quality, portal venous phase, and morphologically normal pancreas. The current study did not include any of the CTs that were part of the development or testing of the previously published CNN model for volumetric pancreas segmentation.12

Volumetric Pancreas Segmentation

All CT studies were downloaded and deidentified by anonymization of Digital Imaging and Communication in Medicine tags using Clinical Trial Processor.13 Metadata elements such as vendor and CT slice thickness were extracted from Digital Imaging and Communication in Medicine headers. These anonymized CT data sets were then converted into the Neuroimaging Informatics Technology Initiative format. Three radiologists (R1, 4 years of post–radiology residency experience; R2 and R3, each with 2 years post–radiology residency experience) performed independent manual volumetric segmentation of the pancreas on each CT using 3D Slicer (version 4.11.20210226).14 The “Fill between the Slices” tool was used to connect the segmentation by interpolating between the annotated slices. Any tissue with Hounsfield unit less than 0 was eliminated by using the “logical operators” tool in 3D Slicer to eliminate the peripancreatic fat from the segmentation mask (Fig. 1). Time taken for manual segmentation of a given case was individually recorded by the study radiologist using the “alarms & clock” stopwatch on Microsoft Windows. The stopwatch was started once the volume file for a given case was uploaded on 3D Slicer and the pancreas was localized. It was stopped once the segmented mask was saved. All 3 radiologists were pretrained on the use of this workflow to avoid inconsistencies in its implementation.

F1
FIGURE 1:
Study design. Volumetric segmentation of the pancreas was performed by CNN model (“CNN segmentation”). Three radiologists (R1, R2, and R3) independently performed manual segmentation and reviewed CNN segmentation (“c-CNN segmentation”) using 3D Slicer. Time was recorded for each case for both sets of segmentations. Ground truth segmentations were obtained using STAPLE algorithm. Finally, CNN segmentations below 25th percentile of DSC were reviewed. PPV, predicted pancreas volume.

Subsequently, volumetric pancreas segmentations were obtained for all CTs using our previously published CNN model.12 Briefly, the 2-stage CNN pancreas segmentation model uses a variation of U-Net architecture. The first step is localizing the pancreas with a minimum enclosing bounding box followed by a second step for volumetric pancreas segmentation. Since publication of the prior report, we have continued to augment and retrain the CNN model using volumetrically segmented pancreas from portal venous phase CTs that spanned a gamut of variations in image acquisition parameters. The 3 radiologists (R1, R2, and R3) then independently reviewed the CNN segmentations and, if needed, made corrections. These radiologist-reviewed CNN segmentations were referred to as “corrected CNN (c-CNN)” segmentations. The time taken by each radiologist for correcting the CNN segmentations was also noted. Time was recorded as 0 minutes for those CNN segmentations that were approved by radiologists without any corrections.

To evaluate and compare the segmentations done by the radiologists and by the CNN, the ground truth (GT) segmentations were obtained using simultaneous truth and performance level estimation (STAPLE) algorithm from SimpleITK library (v2.0).15–17 This weighted-voting iterative algorithm calculates a probabilistic estimate of the true segmentation using a combination of segmentations from multiple readers. In our study, the STAPLE algorithm was used to generate the GT segmentations from the 3 radiologists' manual segmentations.

Pancreas voxel volumes were extracted from the CNN segmentations and from each of the 2 sets of radiologists' segmentations (ie, manual and c-CNN). All segmentations were subsequently compared with STAPLE-derived GT using metrics such as DSC, Jaccard coefficient (JC), and predicted pancreas volume. Interreader reliability was assessed between the 3 radiologists and compared between manual and c-CNN segmentations. Intrareader reliability was assessed by comparing the manual to c-CNN segmentation for a given radiologist.

Finally, we analyzed cases at lower range of CNN's performance. Lower performance was defined as those CNN segmentations with DSC below the 25th percentile compared with STAPLE-derived GT.12 A review of the CT images and the CNN segmentations was done to identify potential reasons for the relatively lower model performance in these CTs. The CNN segmentations for these CTs were subsequently stratified into categories of undersegmentation (parts of the pancreas not included) or oversegmentation (adjacent peripancreatic structures incorrectly included), or both. Additional subcategories were labeled as minor error (less than 15% difference of predicted pancreas volume), intermediate (15%–50% difference of predicted pancreas volume), and major error (more than 50% difference of predicted pancreas volume).

Statistical Analyses

Statistical analyses were performed with Python software (version 3.7.8; Python Software Foundation, Wilmington, DE) or R software (version 4.1.3; r-project.org). Equivalence was determined using a two 1-sided test. The test statistic and P value were both returned for the lower threshold test and upper threshold test (statsmodels package 0.14.0). A Brown and Forsythe test was used to test for equal variance (scipy package v1.5.4).18 The equivalence margin (equivalence margin/2 = delta [δ]) defines the clinically acceptable range of differences between pancreas volumes on 2 sets of measurements.19 A P value <0.05 on the equivalence test denotes rejection of the null hypothesis that the difference between the 2 comparisons is smaller or larger than the interreader δ, respectively. Total sample size needed was calculated to confirm that the tests were adequately powered. The SD across all radiologists' manual segmentations compared with STAPLE-derived GT was used as the δ for subsequent equivalence tests.

Intraclass correlation coefficient (ICC) (R psych package) and concordance correlation coefficient were used for evaluating interreader and intrareader reliability, respectively, and for comparison of the predicted pancreas volumes from CNN segmentations and STAPLE-derived GT volumes.20 Shapiro-Wilk test (scipy package v1.5.4) was used to check normality of the three radiologists' segmentations. Friedman test with pairwise Conover post hoc tests were used to determine if there were significant differences between manual pancreas volumes. For comparisons of pancreas volumes, the mean volume error between any 2 sets of segmentations was evaluated with Bland-Altman analyses.

RESULTS

Volumetric Pancreas Segmentation

The mean (SD) pancreas volumes from the radiologists' segmentations were as follows: R1, 75.0 (20.1) cm3; R2, 78.6 (20.0) cm3; and R3, 77.0 (20.0) cm3 (Table 1). The SD across all radiologists' manual segmentations compared with STAPLE-derived GT was 5 cm3, which was the δ used for equivalence tests. The pancreas volumes from the 3 radiologists' segmentations were equivalent to STAPLE-derived GT pancreas volumes (76.6 [20.2] cm3]) (all P values, <0.05) (Supplementary Fig. 1, https://links.lww.com/RCT/A149, Supplementary Fig. 2, https://links.lww.com/RCT/A150). Compared with R1 segmentations, both R2 and R3 tended to oversegment the pancreas as evident from their higher mean pancreas volumes. The mean pancreas volume difference between R2-R1, R3-R1, and R2-R3 was 3.6 cm3 (P = 0.02), 2.0 cm3 (P = 0.22), and 1.6 cm3 (P = 0.34), respectively (Fig. 2). Nevertheless, the interreader reliability for manual segmentations was high with the mean (SD) DSC between R2-R1 being 0.87 (0.04), between R3-R1 being 0.90 (0.05), and between R2-R3 being 0.87 (0.04).

TABLE 1 - Metrics for Different Sets of Segmentations When Compared Against STAPLE-Derived GT
Reader/Method DSC
Mean (SD)
JC
Mean (SD)
PPV Mean (SD), cm3 PPV Difference
Mean (SD)
R1 manual 0.95 (0.03) 0.90 (0.05) 75.0 (20.1) 1.55 (3.78)
R2 manual 0.92 (0.04) 0.85 (0.06) 78.6 (20.0) −2.02 (6.17)
R3 manual 0.95 (0.03) 0.91 (0.05) 77.0 (20.0) −0.47 (3.96)
CNN 0.88 (0.05) 0.79 (0.07) 72.0 (19.3) 4.51 (7.07)
R1 c-CNN 0.89 (0.04) 0.81 (0.06) 74.5 (20.4) 2.07 (7.11)
R2 c-CNN 0.89 (0.03) 0.81 (0.05) 74.2 (19.9) 2.4 (5.16)
R3 c-CNN 0.89 (0.04) 0.81 (0.06) 73.1 (19.4) 3.42 (5.39)
PPV indicates predicted pancreatic volume.

F2
FIGURE 2:
Interreader reliability assessment. A, B, Bland-Altman analysis and correlation plot of PPV of R2 versus R1 manual segmentations. Mean PPV difference of 3.6 cm3 (−11.54 to 18.67 cm3) and a 0.91 CCC. C, D, Bland-Altman analysis and correlation plot of PPV of R3 versus R1 manual segmentations. Mean PPV difference of 2.0 cm3 (−10.7 to 14.74 cm3) and a 0.95 CCC. E, F, Bland-Altman analysis and correlation plot of PPV of R2 versus R3 manual segmentations. Mean PPV difference of 1.6 cm3 (−18.93 to 15.83 cm3) and a 0.90 CCC. CCC, concordance correlation coefficient; PPV, predicted pancreas volume.

The mean (SD) pancreas volumes of CNN segmentation and c-CNN segmentations were 72.0 (19.3) cm3 and 73.9 (19.9) cm3, respectively (Table 1). Although both c-CNN and CNN pancreas volumes were noninferior, only c-CNN pancreas volumes were equivalent to STAPLE-derived GT (Fig. 3) (CNN vs STAPLE-derived GT: noninferiority P value, <0.0001; equivalence P value, 0.13; c-CNN vs STAPLE-derived GT: noninferiority and equivalence P values, <0.0001) (Supplementary Fig. 3, https://links.lww.com/RCT/A151, Supplementary Fig. 4, https://links.lww.com/RCT/A152).

F3
FIGURE 3:
Convolutional neural network, c-CNN, and STAPLE-derived GT equivalence analysis. A, Convolutional neural network PPVs were noninferior but not equivalent to STAPLE-derived GT (noninferiority P value, <0.0001; equivalence P value, 0.13). B, Mean c-CNN segmentation (c-CNN) PPVs were equivalent to STAPLE-derived GT (equivalence P value, <0.0001). C, Convolutional neural network PPVs were equivalent to mean c-CNN segmentation (equivalence P value, <0.0001). Dotted lines indicates equivalence margin; δ = 5. Mean ± 95% CI. CI, confidence interval; PPV, predicted pancreas volume.

The CNN model showed high performance for pancreas segmentation with a mean (SD) DSC of 0.88 (0.05) and a mean (SD) JC of 0.79 mm (0.07 mm). Many of the CNN segmentations were accepted by radiologists without any modification (R1, 117 of 294 [40%]; R2, 89 of 294 [30%]; R3, 151 of 294 [51%]). Moreover, the 3 radiologists had a nearly identical mean (SD) DSC of 0.89 (0.04) and JC of 0.81 mm (0.06 mm) of their respective c-CNN segmentations when compared with STAPLE-derived GT. Overall, cases with smaller c-CNN volumes were associated with lower DSC and JC scores (Supplementary Fig. 5, https://links.lww.com/RCT/A153). There was high intrareader reliability (concordance correlation coefficient, 0.91–0.93) for each of the 3 radiologists as assessed by the pancreas volume comparison from their c-CNN segmentations with the respective manual segmentations (Fig. 4). R1 had a minimal pancreas volume difference of −0.5 cm3, whereas R2 and R3 had a difference of −4.4 cm3 and −3.9 cm3, respectively (Fig. 5). Interreader reliability was significantly higher among the 3 radiologists for c-CNN (ICC mean, 0.96; 95% confidence interval, 0.95–0.97) versus manual segmentation (ICC mean, 0.93; 95% confidence interval, 0.91–0.94; P value <0.0001).

F4
FIGURE 4:
Accuracy of CNN model. A, Axial CT image of a morphologically normal pancreas. B, Color-coded areas represent R1 manual (red), c-CNN (dark yellow), and CNN segmentations (green). Note the high degree of overlap of the segmentations due to high accuracy of the CNN model. Figure 4 can be viewed online in color at www.jcat.org.
F5
FIGURE 5:
Intrareader reliability assessment. A, B, Bland-Altman analysis and correlation plots of R1 PPV between manual segmentation and review of CNN segmentation (c-CNN) segmentation with mean PPV difference of −0.52 cm3 (16.1–15.0 cm3) and a 0.92 CCC. C, D, Bland-Altman analysis and correlation plots of R2 PPV between manual and c-CNN segmentation. Mean PPV difference of −4.41 cm3 (18.6–9.8 cm3) and a 0.91 CCC. E, F, Bland-Altman analysis and correlation plots of R3 PPV between manual and c-CNN segmentation. Mean PPV difference of −3.9 cm3 (−17.0 to 9.2 cm3) and a 0.93 CCC. CCC, concordance correlation coefficient; PPV, predicted pancreas volume.

There were 73 CNN segmentations with DSC below the 25th percentile compared with STAPLE-derived GT (DSC, <0.86). Segmentation errors in these cases were classified as oversegmentation (n = 19 [26%]), undersegmentation (n = 47 [64%]), and both (n = 7 [11%]) (ie, exclusion of the distal pancreatic tail with inclusion of a part of duodenum). These errors were subclassified as minor (n = 34 [46.6]) and intermediate (n = 39 [53.4%]) according to the difference in predicted pancreas volume compared with STAPLE-derived GT. There was no major segmentation error. Common areas of undersegmentation were the pancreas head or the distal tail (Fig. 6). On the other hand, oversegmentation tended to include duodenum near the pancreatic head, the jejunum close to the junction of the pancreas body and tail, or less frequently, the stomach wall. In 1 case, the left adrenal gland was included in the segmentation.

F6
FIGURE 6:
Undersegmentation error. A, Axial CT scan of a case at the lower range of model performance. B, Color-coded areas represent CNN segmentation (green) and corrected segmentation (c-CNN) (yellow). Convolutional neural network segmentation excluded a part of distal tail in the volume. Figure 6 can be viewed online in color at www.jcat.org.

Time Efficiency

The total inference time for CNN (with central processing unit) was 1.2 (0.3) minutes. The time (mean [range]) needed for manual versus c-CNN segmentation was as follows: R1 (3.03 [1.07–5.8] minutes vs 0.67 [0.0–6.25] minutes), R2 (2.53 [1.12–4.93] minutes vs 0.33 [0.0–2.02] minutes), and R3 (4.53 [2.03–15.3] minutes vs 0.8 [0.0–10.28] minutes) (Fig. 7). Thus, there was substantial time saving with the CNN-enhanced workflow as against performing manual segmentations, which on average was 77.9%, 87%, and 82.3% for R1, R2, and R3, respectively (P < 0.0001).

F7
FIGURE 7:
Time efficiency evaluation. Box plots compare the time taken by each radiologist for manual versus c-CNN segmentation. Substantial time savings were noted by all 3 radiologists.

DISCUSSION

In this validation study, the CNN model had a high DSC of 0.88 for volumetric pancreas segmentation when compared with STAPLE-derived composite GT from a large data set of standard-of-care portal venous phase CTs. This high DSC compares favorably to the interreader reliability between the radiologists in the current study (DSC, 0.86–0.90) and to the model's performance in our prior report.12 Of note, unlike in our prior study, the manual pancreas segmentation process involved the use of the “logical operator” tool on 3D Slicer, which minimized the inclusion of the peripancreatic fat in the segmented volume. Since publication of the prior report, we have continued to augment the CNN model using CTs with pancreas segmented through this tool. Furthermore, many of the model's pancreas volumetric segmentations required no further modification on subjective review by the radiologists, which highlights the potential of the model to provide fully automated high-fidelity segmentations. The average time taken by the radiologist to review and, if needed, modify the CNN segmentations was 0.6 minutes. Conversely, the average time per study for manual segmentation was 3.37 minutes. Thus, there was substantial efficiency gained by all 3 radiologists with the CNN-enhanced workflow regardless of their post–residency experience or experience in using 3D Slicer. Our analysis of the CNN segmentations that were below the 25th percentile of DSC range identified only minor-to-intermediate errors. Undersegmentation was the most common error in CNN segmentations. The distal pancreas tail or the inferior most uncinate process was the common site of exclusion by the CNN. These areas can be difficult to distinguish from the subadjacent isodense anatomy even by expert radiologists. Thus, we conclude that this CNN model can consistently provide highly accurate, efficient, and automated volumetric segmentation of morphologically normal pancreas on standard-of-care CT.

In general, imaging AI studies rely upon GT segmentations provided by a single reader.4,6,21 However, data set annotation by multiple readers is ideal because it facilitates accounting the intrareader and interreader variability, which often depends upon the experience of the readers.11,22 For instance, in the current study, prior experience of the radiologists in the use of 3D Slicer influenced both the time required for and the pattern of manual segmentation. The radiologist (R3) with no prior experience in using 3D Slicer took the longest time (mean, 4.53 minutes) compared with the more experienced radiologists (R1 and R2) (3.03 and 2.53 minutes, respectively). On the other hand, the radiologist (R2) who took the least time for both manual and c-CNN segmentations also had the lowest intrareader reliability. Regarding the pattern of segmentation, the most experienced radiologist (R1) tended to undersegment the pancreas, whereas the 2 other radiologists (R2 and R3) tended to oversegment the pancreas at its interface with either the duodenum or peripancreatic vessels. These findings highlight the intrareader and interreader variability that is inevitable in manual annotation of imaging data sets.

To account for such variability while defining the GT, we chose to use the STAPLE algorithm to arrive at the composite GT.16 The STAPLE algorithm is a weighted-voting algorithm that initially combines individual manual segmentations into an initial test segmentation through pixel-level voting. It then compares individual manual segmentations against this test segmentation. A second test segmentation is then created by weighting the votes of individual radiologists according to their accuracy on the initial test segmentation. This iterative process of accuracy estimation and refining the test segmentation is repeated until the test segmentation converges, that is, stops changing. The STAPLE is regarded as the criterion-standard algorithm for definition of GT from multiple radiologists. One of its many advantages is that it enables performance evaluation of any individual rater including of the candidate AI model because the final GT is independent of the individual contributions. In our study, all 3 sets of segmentation methods, that is, manual, CNN, and c-CNN, had pancreas volume comparable with GT-derived volumes (76.6 [20.2] cm3).

We chose to develop and validate the pancreas segmentation model on standard-of-care portal venous phase CT scans. Computed tomography is one the primary imaging modalities for evaluation of the abdomen as well as to investigate pancreatic pathologies.6,10 Its advantages include the ability to provide volumetric data at isotropic resolution and within a few seconds of breath hold. The utility of CT for quantitative imaging of the pancreas is on the rise on account of the recently introduced technologies in acquisition and postprocessing.23 Second, the portal venous phase is the most common phase of acquisition in clinical practice. The image contrast on portal venous phase is adequate for distinguishing pancreas from adjacent vessels.10 Thus, the application of AI to CT offers the prospect of accurate estimation of pancreas volume, which is particularly important when serial monitoring is the goal such as in the context of immunotherapy or diabetes.24–29 For instance, there is emerging evidence that serial change in pancreas volume can be a biomarker of oncologic outcomes in patients on immunotherapy.26–29

A validated and efficient pancreas segmentation workflow is a critical component for many AI pancreas applications. For instance, accurate pancreas segmentation is often the first step in the investigation of radiomics and deep learning for early detection of pancreatic cancer on prediagnostic CT scans. This is because the pancreas tends to be morphologically normal on prediagnostic CTs in majority of patients with pancreatic cancer.12,30–32 There is also increasing interest in the application of AI to cross-sectional imaging studying for endocrine diseases such as diabetes.24,25 Recent studies highlight the potential of morphometric and radiomic features extracted from the pancreas on CT for the screening of early diabetes and the prediction of subsequent occurrence of diabetes.1,25,33 However, these studies have used either manual 2D regions of interest around the pancreas or slice-by-slice manual segmentation of the pancreas in small cohorts.1,34 The challenges involved in creation of large data sets with volumetrically segmented pancreas is one of the critical barriers in advancing such investigations beyond proof-of-concept stage. A validated AI-driven pancreas segmentation workflow as the one proposed in the current study can provide a discovery tool to accelerate research in quantitative pancreas imaging.

Our study had limitations. First, the time taken for segmentations was self-recorded by each radiologist. It is possible that there could be some variations because of tasks such as manual start and stop of the timer. Second, each radiologist used their regular workstation for this study. Variations could also be introduced because of differences in speed and performance of the respective software and hardware. However, the magnitude of time saving witnessed with CNN-enhanced workflow versus manual segmentation was comparable among all radiologists. We also believe that the CNN-enhanced workflow will be even faster in the clinical picture archiving and communication system environment because these systems are not equipped with volumetric segmentation aids such as the ones in 3D Slicer, which makes manual segmentation even more cumbersome. Finally, our study focuses on segmentation of morphologically normal pancreas by expert radiologists. Therefore, the CNN model is not yet trained to segment diseased pancreas with distorted anatomy due to entities such as cysts or inflammation.

In conclusion, our CNN model can consistently provide high performance for volumetric pancreas segmentation on standard-of-care portal venous phase CTs vis-à-vis expert manual segmentations. The CNN-enhanced workflow delivered substantial time savings compared with manual segmentation even after accounting for review of initial output of automated segmentations. Thus, CNN-powered volumetric pancreas segmentation offers an accurate and scalable discovery tool for potential clinical translation of insights from quantitative pancreas imaging for endocrine and exocrine pancreas diseases.

REFERENCES

1. Lu CQ, Wang YC, Meng XP, et al. Diabetes risk assessment with imaging: a radiomics study of abdominal CT. Eur Radiol. 2019;5:2233–2242.
2. Chu LC, Park S, Kawamoto S, et al. Utility of CT Radiomics features in differentiation of pancreatic ductal adenocarcinoma from normal pancreatic tissue. AJR Am J Roentgenol. 2019;213:349–357.
3. Park S, Chu LC, Hruban RH, et al. Differentiating autoimmune pancreatitis from pancreatic ductal adenocarcinoma with CT radiomics features. Diagn Interv Imaging. 2020;9:555–564.
4. Janssens LP, Weston AD, Singh D, et al. Determining age and sex-specific distribution of pancreatic whole-gland CT attenuation using artificial intelligence aided image segmentation: associations with body composition and pancreatic cancer risk. Pancreatology. 2021;21:1524–1530.
5. Lafata KJ, Wang Y, Konkel B, et al. Radiomics: a primer on high-throughput image phenotyping. Abdom Radiol (NY). 2021.
6. Bagheri MH, Roth H, Kovacs W, et al. Technical and clinical factors affecting success rate of a deep learning method for pancreas segmentation on CT. Acad Radiol. 2020;27:689–695.
7. Suman G, Panda A, Korfiatis P, et al. Development of a volumetric pancreas segmentation CT dataset for AI applications through trained technologists: a study during the COVID 19 containment phase. Abdom Radiol (NY). 2020;45:4302–4310.
8. Kumar H, DeSouza SV, Petrov MS. Automated pancreas segmentation from computed tomography and magnetic resonance images: a systematic review. Comput Methods Programs Biomed. 2019;178:319–328.
9. Park S, Chu LC, Fishman EK, et al. Annotated normal CT data of the abdomen for deep learning: challenges and strategies for implementation. Diagn Interv Imaging. 2020;101:35–44.
10. Anderson SW, Soto JA. Pancreatic duct evaluation: accuracy of portal venous phase 64 MDCT. Abdom Imaging. 2009;34:55–63.
11. Suman G, Patra A, Korfiatis P, et al. Quality gaps in public pancreas imaging datasets: implications & challenges for AI applications. Pancreatology. 2021;21:1001–1008.
12. Panda A, Korfiatis P, Suman G, et al. Two-stage deep learning model for fully automated pancreas segmentation on computed tomography: comparison with intra-reader and inter-reader reliability at full and reduced radiation dose on an external dataset. Med Phys. 2021;48:2468–2481.
13. "MIRC CTP". Available at: http://mircwiki.rsna.org/index.php?title=CTP-The_RSNA_Clinical_Trial_Processor. Accessed June 25, 2021.
14. Fedorov A, Beichel R, Kalpathy-Cramer J, et al. 3D Slicer as an image computing platform for the quantitative imaging network. Magn Reson Imaging. 2012;30:1323–1341.
15. Warfield SK, Zou KH, Wells WM. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging. 2004;23:903–921.
16. Alexander Weston PhD. How to use the STAPLE algorithm to combine multiple image segmentations. Towards Data Science. 2021. Available at: https://towardsdatascience.com/how-to-use-the-staple-algorithm-to-combine-multiple-image-segmentations-ce91ebeb451e/. Accessed February 5, 2022.
17. Yushkevich PA, Piven J, Hazlett HC, et al. User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. NeuroImage. 2006;31:1116–1128.
18. "scipy.stats.ttest_ind". Available at: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html. Accessed March 30, 2021.
19. Ahn S, Park SH, Lee KH. How to demonstrate similarity by using noninferiority and equivalence statistical testing in radiology research. Radiology. 2013;267:328–338.
20. Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61(Pt 1):29–48.
21. Farag A, Le Lu, Roth HR, et al. A bottom-up approach for pancreas segmentation using cascaded superpixels and (deep) image patch labeling. IEEE Trans Image Process. 2017;26:386–399.
22. Willemink MJ, Koszek WA, Hardell C, et al. Preparing medical imaging data for machine learning. Radiology. 2020;295:4–15.
23. Almeida RR, Lo GC, Patino M, et al. Advances in pancreatic CT imaging. AJR Am J Roentgenol. 2018;211:52–66.
24. McCleary BM, Trout AT, Fei L, et al. Healthy pancreatic parenchymal volume and its relationship to exocrine function. Pediatr Radiol. 2020;50:684–688.
25. Al-Mrabeh A, Hollingsworth KG, Shaw JAM, et al. 2-year remission of type 2 diabetes and pancreas morphology: a post-hoc analysis of the DiRECT open-label, cluster-randomised trial. Lancet Diabetes Endocrinol. 2020;8:939–948.
26. Oshiro Y, Nishida K, Shimazaki J, et al. Investigation of morphological and functional changes in the liver and pancreas during bevacizumab treatment. Scand J Gastroenterol. 2020;55:712–717.
27. Shinagare AB, Steele E, Braschi-Amirfarzan M, et al. Sunitinib-associated pancreatic atrophy in patients with gastrointestinal stromal tumor: a toxicity with prognostic implications detected at imaging. Radiology. 2016;281:140–149.
28. Eshet Y, Baruch EN, Shapira-Frommer R, et al. Clinical significance of pancreatic atrophy induced by immune-checkpoint inhibitors: a case-control study. Cancer Immunol Res. 2018;6:1453–1458.
29. Abu-Sbeih H, Tang T, Lu Y, et al. Clinical characteristics and outcomes of immune checkpoint inhibitor-induced pancreatic injury. J Immunother Cancer. 2019;7:31.
30. Toshima F, Watanabe R, Inoue D, et al. CT abnormalities of the pancreas associated with the subsequent diagnosis of clinical stage I pancreatic ductal adenocarcinoma more than 1 year later: a case-control study. AJR Am J Roentgenol. 2021;217:1353–1364.
31. Young MR, Abrams N, Ghosh S, et al. Prediagnostic image data, artificial intelligence, and pancreatic cancer: a tell-tale sign to early detection. Pancreas. 2020;49:882–886.
32. Singh DP, Sheedy S, Goenka AH, et al. Computerized tomography scan in pre-diagnostic pancreatic ductal adenocarcinoma: stages of progression and potential benefits of early intervention: a retrospective study. Pancreatology. 2020;20:1495–1501.
33. Lim S, Bae JH, Chun EJ, et al. Differences in pancreatic volume, fat content, and fat density measured by multidetector-row computed tomography according to the duration of diabetes. Acta Diabetol. 2014;51:739–748.
34. Jang S, Kim JH, Choi SY, et al. Application of computerized 3D-CT texture analysis of pancreas for the assessment of patients with diabetes. PLoS One. 2020;15:e0227492.
Keywords:

deep learning; artificial intelligence; pancreas; pancreatic neoplasms; workflow

Supplemental Digital Content

Copyright © 2022 Wolters Kluwer Health, Inc. All rights reserved.