Glioblastoma (GB) is the most frequent primary brain cancer.1 The diffuse and highly invasive growth as well as the intratumor heterogeneity makes it the most lethal cancer of the central nervous system.2 Despite the aggressive combination therapy with surgery followed by radiation plus concomitant and adjuvant chemotherapy, the median survival time is still only 15 to 17 months.3 Magnetic resonance imaging (MRI) is commonly used to evaluate location, size, spread, edema, and the biological status of the tumor noninvasively.4 Magnetic resonance imaging is part of the standard clinical workup for GB management for planning and follow-up of surgery, chemotherapy, and radiation.
Detection of the tumor and determination of location and extension of the different tumor compartments are important for surgery planning. It has been shown in large studies that the extent of the resection of tumor volume of 98% or more is associated with longer survival time.5 Furthermore, Hammoud et al6 showed that MRI features such as sign of little or no necrosis and lower tumor enhancement are associated with longer survival time. Determination of the tumor extent is also highly relevant to radiomics,7 an emerging imaging-based method for extracting quantitative features from standard-of-care imaging to establish predictive models.8 Furthermore, a better identification of the different biological areas of the tumor improves the precision of tissue targeting in biopsy.
In radiological reading, tumor and compartment segmentation is not done routinely but would offer important additional clinical value in the aforementioned approaches. Manual expert segmentation is still considered as the gold standard, but it is time-consuming and highly variable based on the level of expertise.9 Consequently, there remains an unmet need for fully automatic, user-independent detection and segmentation tools, with the potential to become an integral part of the clinical reading workflow.
Over the last decade, significant progress has been made in computer-assisted and machine learning–based segmentation algorithms for the identification and segmentation of brain lesions. Several main categories of approaches can be identified. There are semiautomatic and fully automatic approaches of various complexity, from simple thresholding or region growing algorithms to comprehensive model-based, supervised, and unsupervised machine learning algorithms.10,11 To compare the different approaches to brain tumor segmentation algorithms objectively, the Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) challenge has been organized since 2012.12 In the latest BRATS benchmark for pretreatment segmentation of brain tumors, the algorithms evaluated demonstrated results comparable with interrater variability.13
Recently, deep learning–based methods have shown promising and clinically relevant results.14–16 The deep learning techniques showing a high model capacity and the ability to learn highly discriminative features often outperform hand designed feature sets. In particular, 2-dimensional and 3-dimensional (3D) convolutional neural networks (CNNs) show promising results on clinical imaging data.17,18 However, studies about evaluations of these algorithms using routine clinical data are still lacking. We chose DeepMedic, a 3D CNN-based algorithm for evaluation.18 This algorithm is fully automatic and achieved high scores on the BRATS data. Its application is based on the most routinely performed MRI examinations: T1-weighted (T1w), CE T1-weighted (CE T1w), T2-weighted (T2w), and FLAIR.
The aims of this study were as follows:
- To evaluate a state-of-the-art, fully automatic brain tumor segmentation compared with manual annotations by expert readers.
- To evaluate variations of the segmentation results with highly heterogeneous clinical MRI examinations performed in multiple institutions, using different acquisition protocols and scanners from different vendors.
MATERIALS AND METHODS
In this retrospective study, we included consecutive patients with newly diagnosed, supratentorial GBs, eligible for surgical resection, referred to and treated in our institution between 2010 and 2014. Patients with infratentorial and secondary GBs were excluded.
A total of 64 patients with biopsy-proven GB and completed preoperative MR examinations (T1w, T2w, FLAIR, and CE T1w) were reviewed.
Patients' characteristics are summarized in Table 1. The study was approved by the Local Research Ethics Commission with waved informed consent.
Magnetic Resonance Imaging
Magnetic resonance images were acquired on 8 different scanner types at a total of 15 institutions. Thirty-four of 64 MR examinations have been conducted at our institution. The remaining 30 MR examinations have been conducted at the referring 14 institutions. The majority of images were acquired at 1.5 T field strength. Details of the acquisitions are shown in Table 1. T1w, T2w, FLAIR, and CE T1w series were acquired according to standard clinical acquisition protocols for the different scanners. Slice thickness ranged from 1 to 8 mm in all patients, in-plane resolution ranged from 0.3 to 1 mm (Table 1).
For evaluation of the deep learning model, ground truth tumor compartments were delineated manually. This was performed using a semiautomatic approach with subsequent manual editing (IntelliSpace Discovery; Philips Healthcare, Best, the Netherlands), both performed by a radiologist and a senior neuroradiologist in a consensus reading. In addition, the segmented volumes of interest (VOIs) were compared with the volumes segmented independently by neurosurgeons using the iPlan software (Brainlab GmbH, Feldkirchen, Germany). In case of discrepancies, the segmentation was reviewed until consensus was reached.
The procedure followed the BRATS challenge.13 Volumes of interest were created for (a) the whole tumor, (b) contrast-enhancing tumor, and (c) tumor necrosis. The whole tumor VOI was segmented on the T2w and FLAIR sequences, including the contrast-enhancing and tumor necrosis compartment. Contrast-enhancing tumor and necrosis were delineated on the CE T1w series.
Workflow for Automatic Segmentation Pipeline
The overall workflow for automatic tumor segmentation was built as shown in Figure 1. Four series (T1w, T2w, FLAIR, and CE T1w) were first preprocessed, then automatic tumor segmentation was performed using the trained deep learning model, followed by postprocessing of the output VOIs. The whole workflow was performed completely automatically.
Magnetic resonance images were preprocessed with established tools (SPM8: Statistical Parametric Mapping software package version 8; Wellcome Trust Centre for Neuroimaging, London, United Kingdom; Intellispace Discovery; Philips Healthcare, Best, the Netherlands) before feeding into automatic segmentation. The preprocessing pipeline is shown in Figure 2. All 4 series were bias field corrected. Then, in a second step, the corrected T1w, T2w, and FLAIR images were coregistered to the reference space defined by the CE T1w series. Consequently, the ground truth VOIs for contrast-enhancing tumor and necrosis were, by definition, in the same reference space. In the third step, the manual segmentation of the whole tumor VOI was aligned to the reference space using the 6-parameter transformation obtained from the FLAIR image. Then, a brain mask was computed (SPM8 “New Segmentation”19) and applied to obtain skull-stripped images. Finally, images were normalized to zero-mean and standard deviation of 1 and resampled to isotropic resolution of 1 × 1 × 1 mm3.18 The image processing pipeline was executed fully automatically without user interaction. Processing results were visually checked for quality control.
For automatic segmentation of tumor compartments, a multiparametric deep learning model was applied to the preprocessed data.
Deep Learning Model
The deep learning model is based on the recently published DeepMedic architecture, which provided top scoring results on the BRATS data set.18 The DeepMedic architecture was installed on a graphics processing unit server at our institution.
The model was trained on an independent data set available through the BRATS 2015 challenge. The training data consisted of 220 cases of GB with expert manual segmentations of the tumor compartments.13 The 220 cases were split into 190 for training and 30 for validation during the training procedure. The data was preprocessed as described previously.
The DeepMedic architecture consists of a deep 3D CNN followed by a 3D fully connected network to remove false-positives. The 3D CNN includes 2 pathways that apply different image resolutions to capture characteristics of the tumor appearance at 2 different spatial ranges. Inputs to the 2 pathways are centered at the same image location, but for the second input, the image is down-sampled to a third of its original size. The model consists of an 11-layer architecture with kernels of size 33. The last layers of the 2 pathways have receptive fields of size 173 voxels. For inference, image segments of 453 voxels are fed into the model. Finally, a fully-connected conditional random field is applied, which has a smoothing effect.18
The deep learning model resulted in automatic segmentation of 4 tumor compartments (edema, contrast-enhancing tumor, necrosis, nonenhancing tumor). The whole tumor region was obtained as the union of all other segmented regions, as defined in the BRATS benchmark.13
To evaluate automatic segmentation, the resulting VOIs were compared with the manual ground truth annotations. For the whole tumor, contrast-enhancing tumor, and necrosis, the VOIs were compared with respect to volume and voxel-wise accuracy. As usual, the accuracy was computed as overlap of ground truth segmentation (VOIgt) and model segmentation (VOImodel) using the dice similarity coefficient (DSC)20:
In addition, the sensitivity (true-positive rate) and positive predictive value (PPV) of the automatic segmentation was assessed on the voxel level using the following expressions13:
We investigated possible dependencies of the algorithm accuracy on size and shape of the tumor and on image resolution. For that, DSCs were correlated (Pearson correlation) with volume, surface-to-volume ratio, and with lowest resolution for each subject. Surface-to-volume ratio was calculated using the pyradiomics package21 on resampled (1 × 1 × 1 mm3) ground truth VOIs.
Fully automated image processing and tumor segmentation was completed for all 64 patients. After visual quality control, 2 patients were excluded due to GB location in the brain stem or incomplete coverage of the tumor area by T2w series.
Manual segmentation of the whole tumor and contrast-enhancing tumor compartments was completed for all patients. Necrosis compartments were observed in 58 of the 62 patients.
Automatic Detection, Localization, and Segmentation of the Tumor
The deep learning model automatically detected, localized, and segmented the whole tumor and contrast-enhancing tumor in all 62 patients.
Necrosis was automatically detected, localized, and segmented in 53 of the 58 cases with ground truth necrosis. For the 5 cases without automatic detection (false-negatives), the mean necrosis volume on manual segmentation was 3.3 ± 1.9 mL. Further analysis of the necrosis VOI was restricted to the 53 cases where necrosis was correctly detected. Absence of the necrosis was correctly detected in 3 of 4 cases (in 1 case, the algorithm segmented a necrotic core of 1.5 mL).
Example cases with automatic segmentations are shown in Figure 3.
Volumes and results are reported in Table 2. Significant correlations with Pearson r > 0.8 (P < 0.0001) were found between the volumes of automatic and manual segmentations (Table 2, Fig. 4). Absolute volumes of automatic and manual segmentations were comparable, that is, no bias was observed in automatic segmentation.
High voxel-wise overlap was obtained for the whole and contrast-enhancing tumor volumes (DSC of 0.86 and 0.78, respectively). For 53 patients with detected necrosis, an overlap with a DSC of 0.62 was observed. The high PPV of 0.89 indicates a low number of false-positive voxels. Altogether, it demonstrates that automatic segmentation tends to underestimate necrosis. This is consistent with the smaller volumes of the automatic necrosis segmentations.
Correlation With Image and VOI Properties
Image and tumor properties may influence quality of automatic segmentation. To evaluate the effect, we correlated DSC for different VOIs with image resolution, VOI surface-to-volume ratio, and VOI volume (Table 3). The correlation scatter plots are shown in Figure 5. For all VOIs, no strong correlation was found between DSC and image resolution, VOI surface-to-volume ratio, and VOI volume.
The processing time of the deep learning model including postprocessing was less than 5 minutes per subject using an NVIDIA Tesla P100 graphics processing unit.
This study evaluated fully automatic detection and segmentation of brain tumors based on a deep learning algorithm and compared the results to manual annotations by expert readers. We furthermore investigated if segmentation results vary across clinical MRI examinations from multiple institutions using different acquisition protocols and scanners from different vendors.
The whole tumor and CE tumor VOIs have been correctly detected and localized in all cases. The necrosis VOI was correctly detected and localized in 91% of the cases.
For the automatic segmentation, the algorithm we have chosen achieved top scoring results with the BRATS test data set as reported by Kamnitsas et al18 (see Table 4, row 2; DSC range, 0.63–0.85). The automatic segmentation results we achieved with this algorithm in our study on clinical routine data appeared to be slightly better (Table 4, row 1; DSC range, 0.78–0.86). One limitation of the deep learning algorithm is the requirement that all 4 MR input series (CE T1, FLAIR, T1, T2) need to be present. If one of the series is not available, the proposed model cannot be applied.
It is of particular clinical importance that the automatic segmentation results are in the same range as the interrater variability (Table 4, row 4; DSC range, 0.74–0.85) reported by Menze et al.13
Furthermore, in the first pass of our manual segmentation procedure, we observed differences between neuroradiologists' and neurosurgeons' annotation in 45% of the cases for CE tumor VOIs and in 8% of the cases for the whole tumor VOIs, based on a volume discrepancy threshold of 30%.
Applying the same 30% volume threshold to the differences between the deep learning algorithm and the ground truth, we obtained discrepancies in 25% of the cases for CE tumor VOIs and in 8% of the cases for the whole tumor VOIs. This further supports our finding that the variability between automatic and ground truth segmentations is in the range of the variability of manual segmentations by expert readers.
To put this in perspective, we compare the results from our data with the selected deep learning algorithm to another high scoring algorithm on the BRATS data by Pereira et al.22 Our results are slightly better for the CE tumor segmentation (Pereira et al: DSC, 0.75; DeepMedic on our clinical data: DSC, 0.78).
The second part of the study addressed the variations of the segmentation results across the acquired clinical MRI examinations. In contrast to other studies,23,24 we did not use a standardized protocol across institutions. The data analyzed in our study was obtained using a wide range of MRI acquisition protocols (15 institutions, 8 MR scanner models, eg, slice thickness ranging from 1 to 8 mm). Within the wide range of scanner models, there is a predominance of Philips 1.5 T scanners in our data. However, due to the fact that these scanners (Intera, Achieva, and Gyroscan models) are located at 6 different institutions and heterogeneous image protocols have been applied on the scanners, the data has sufficient heterogeneity to be suitable as a proof point for wider applicability of the deep learning algorithm. Furthermore, the deep learning algorithm was trained on a completely independent, heterogeneous data set (BRATS data), based on GE, Siemens, and Philips scanners.
Independent from the variations in the imaging protocols, we have observed high DSC for the automatic compared with the ground truth segmentation. We have found no correlation between the DSC and slice thickness. Furthermore, no relevant correlation was found between quality of automatic segmentation and VOI properties (surface-to-volume ratio and volume).
These observations are clinically relevant because they show that the reported results of the selected deep learning algorithm, trained on the BRATS data, are reproducible on heterogeneous data sets acquired in clinical routine. Furthermore, the detection and segmentation results were not affected by variations in the imaging protocols and variations in the tumor shape and size.
It should be noted that we have not analyzed images without the presence of GB in this study. Thus, the sensitivity and specificity of the detection part of the algorithm has not been evaluated, which should be the subject of future work.
We see the potential of the deep learning algorithms, as evaluated in this work, in automatically analyzing images from primary GB. The analysis would take place in the background and before the images are read by the radiologist. At the time of the reading, the radiologist would be able to review the segmentation results, which could aid in the decision-making process. For this kind of workflow integration, it is necessary to provide an automatic and precise segmentation of the different tumor areas. In the primary GB setting, important clinical questions are as follows25:
- The selection of the area for maximum safe resection of the tumor to improve overall survival, while at the same time reducing the patients' functionality as little as possible.
- Identifying different compartments of the tumor, for example, relevant biopsy targets.
- Identifying relevant prognostic markers.
The segmentation of tumor compartments (whole tumor, CE tumor, and necrosis) evaluated in this work is important to address these questions properly.
The whole tumor VOI includes edema, the CE tumor, and the necrosis. The DSC on our clinical data and the BRATS test data is around 0.85 with a high PPV of 0.9, showing that only a low number of voxels are misclassified by the algorithm as normal tissue. Furthermore, the DSC is equal to the interrater DSC. In the clinical reading, the whole tumor VOI would be used to, for example, determine the extent of the edema. In the research setting, the edema part of the whole tumor VOI could be used to further evaluate potential tumor invasion in this area via, for example, diffusion tensor analysis.26 Compared with the time-consuming and user-dependent manual segmentation, the results of the automatic segmentations suggest a high potential for integration in the radiology reading workflow.
The CE tumor compartment is the area of the tumor with contrast agent accumulation, hyperintense on a postcontrast, T1w image. Biologically, this area reflects the part of the tumor with leaky and poorly constructed vessel. The DSC in our clinical data (0.78) was higher than on the BRATS test data (DSC, 0.63). Similar to the whole tumor VOI, the PPV value of 0.83 shows that a relatively low number of voxels is misclassified as non-CE tumor.
In the clinical reading, the CE tumor compartment is important to determine the resection boarder and to identify relevant biopsy targets. In the research setting, this VOI is important, for example, in multiparametric analysis of potential pseudoprogression after radiation therapy as part of the longitudinal tracking.
Taking into account that the CE tumor compartments are more fragmented and complex in shape, they will take longer to segment manually than the whole tumor VOI. Therefore, the results will be even more user dependent. The automatically generated and user-independent CE tumor VOI segmentation with a DSC of 0.78 further supports the potential for the integration in the radiology workflow.
To our knowledge, no data are currently available on interrater variability or on comparison of automatic and manual segmentations for necrosis. For automatic segmentation of necrosis, a DSC of 0.62 was achieved, which is a reasonable result for small and heterogeneous volumes of the necrotic compartment (eg, Fig. 3) in this study cohort (17.0 ± 14.8 mL). Furthermore, as for the other compartments, the algorithms achieved a high PPV (0.89 ± 0.20) for necrosis, showing that only a low number of voxels are misclassified by the algorithm as nonnecrosis. Manual segmentation of tumor necrosis would be the most challenging part because these areas are sometimes very small, heterogeneous, and scattered. Furthermore, the required accuracy for the segmentation of the necrosis needs more discussion and clinical evaluation.
The results of the study show that the proposed algorithm for automatic detection of the primary GB tumor and the segmentation of the different tumor compartments has the potential to reproducibly and fully automatically support the clinical reading and preoperative planning.
The results are reproducible compared with former studies on different data and show a stable performance on a wide variety of clinical scanners and protocols. This reduces the risk described by some authors that spatial-temporal changes due to new MR machines or protocols will affect the performance of the algorithm.25
To reduce a possible bias in our ground truth generation, we combined semiautomatic approaches with a manual, consensus-based, repeated annotation by experts from radiology and neurosurgery. Next steps in research include using the automatic segmentation in the postsurgery setting to automatically detect and determine a residual tumor volume, which would influence the patient prognosis. A further step would be to apply the approach for automatic VOI generation in longitudinal tumor tracking to enable a multiparametric analysis in the case of, for example, pseudoresponse in targeted therapy. For these clinical questions, using the VOIs as input for a radiomics—27,28 or in combination with genetic markers for a radiogenomics-analysis,29,30 could be of further research interest.
To drive clinical acceptance of automatic segmentation in routine reading further validation of the clinical applicability of the algorithm is needed. Seeing the stability on the heterogeneous data a next step could be to pool the data from different centers for a multicenter trial to further prove and validate the stability and reproducibility of the algorithm results.
1. Thakkar JP, Dolecek TA, Horbinski C, et al. Epidemiologic and molecular prognostic review of glioblastoma
. Cancer Epidemiol Biomarkers Prev
2. Omuro A. Glioblastoma
and other malignant gliomas. JAMA
3. Stupp R, Mason WP, Van Den Bent MJ, et al. Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma
. N Engl J Med
4. Osborn A, Salzman K, Jhaveri M, et al. Diagnostic imaging: brain E-book. 2015.
5. Lacroix M, Abi-Said D, Fourney DR, et al. A multivariate analysis of 416 patients with glioblastoma
multiforme: prognosis, extent of resection, and survival. J Neurosurg
6. Hammoud MA, Sawaya R, Shi W, et al. Prognostic significance of preoperative MRI
scans in glioblastoma
multiforme. J Neurooncol
7. Narang S, Lehrer M, Yang D, et al. Radiomics in glioblastoma
: current status, challenges and potential opportunities. Transl Cancer Res
8. Yip SS, Aerts HJ. Applications and limitations of radiomics. Phys Med Biol
9. Deeley MA, Chen A, Datteri R, et al. Comparison of manual and automatic segmentation methods for brain structures in the presence of space-occupying lesions: a multi-expert study. Phys Med Biol
10. Gordillo N, Montseny E, Sobrevilla P. State of the art survey on MRI
brain tumor segmentation
. Magn Reson Imaging
11. Bauer S, Wiest R, Nolte LP, et al. A survey of MRI
-based medical image analysis for brain tumor studies. Phys Med Biol
13. Menze B. The multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging
14. Isin A, Direkogul C, Sah M. ScienceDirect review of MRI
-based brain tumor image segmentation using deep learning
methods. Procedia Comput Sci
15. Cicero M, Bilbily A, Colak E, et al. Training and validating a deep convolutional neural network for computer-aided detection and classification of abnormalities on frontal chest radiographs. Invest Radiol
16. Becker AS, Marcon M, Ghafoor S, et al. Deep learning
in mammography: diagnostic accuracy of a multipurpose image analysis software in the detection of breast cancer. Invest Radiol
17. Litjens G, Kooi T, Bejnordi BE, et al. A survey on deep learning
in medical image analysis. Med Image Anal
18. Kamnitsas K, Ledig C, Newcombe VFJ, et al. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med Image Anal
19. Ashburner J, Friston KJ. Unified segmentation. Neuroimage
20. Crum WR, Camara O, Hill DL. Generalized overlap measures for evaluation and validation in medical image analysis. IEEE Trans Med Imaging
21. van Griethuysen JJM, Fedorov A, Parmar C, et al. Computational radiomics system to decode the radiographic phenotype. Cancer Res
22. Pereira S, Pinto A, Alves V, et al. Brain tumor segmentation
using convolutional neural networks in MRI
images. IEEE Trans Med Imaging
23. Huber T, Alber G, Bette S, et al. Progressive disease in glioblastoma
: benefits and limitations of semi-automated volumetry. PLoS One
24. Porz N, Bauer S, Pica A, et al. Multi-modal glioblastoma
segmentation: man versus machine. PLoS One
25. Fuster-Garcia E, García-Gómez JM, De Angelis E, et al. Use case II: imaging biomarkers and new trends for integrated glioblastoma
management. In: Imaging Biomarkers
. Cham: Springer International Publishing; 2017:181–194.
26. Price SJ, Jena R, Burnet NG, et al. Improved delineation of glioma margins and regions of infiltration with the use of diffusion tensor imaging: an image-guided biopsy study. AJNR Am J Neuroradiol
27. Ingrisch M, Schneider MJ, Nörenberg D, et al. Radiomic analysis reveals prognostic information in T1-weighted baseline magnetic resonance imaging in patients with glioblastoma
. Invest Radiol
28. Hainc N, Stippich C, Stieltjes B, et al. Experimental texture analysis in glioblastoma
: a methodological study. Invest Radiol
29. Gutman DA, Dunn WD, Grossmann P, et al. Somatic mutations associated with MRI
-derived volumetric features in glioblastoma
30. Panth KM, Leijenaar RT, Carvalho S, et al. Is there a causal relationship between genetic changes and radiomics-based image features? An in vivo preclinical experiment with doxycycline inducible GADD34 tumor cells. Radiother Oncol
Keywords:Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved.
glioblastoma; GB; MRI; tumor segmentation; machine learning; deep learning