Pneumonia is a commonly encountered clinical entity with a prevalence of 10% to 65% among hospitalized patients. It is not only responsible for 257,000 emergency room visits per year, but also the cause of nearly 50,000 deaths in the United States annually.1–3 Chest radiographs are often part of the initial diagnostic workup of pneumonia and are used to monitor progression or resolution.2 Pneumonia is one of many indications for the roughly 2 billion chest radiographs performed annually in the United States.4 Because of these large study volumes, computer-automated diagnostic tools are increasingly being developed to assist in diagnostic interpretation.5
Convolutional neural networks (CNNs) are a recent form of machine learning (ML) that have reinvigorated interest in the development of algorithms for chest radiography. In contrast to historical ML approaches, CNNs can learn structural features of an image or volume without being explicitly programmed. This makes it considerably easier to build CNNs capable of performing a variety of tasks, including image-wide classification, object detection, and segmentation.6
Much of recent research in chest radiography utilizees a large, 112,000 public frontal database of chest radiographs.7–9 This database includes an associated ontology-based text image-wide classification covering 14 common radiographic findings/pathologies. Because of the nature of these labels, sometimes referred to as “weak” labels, many groups have explored the use of “classification” networks to perform image-based diagnosis. While these approaches show promise in their ability to classify radiographic findings, one of the challenges has been the uncertainty on how these methods arrive at their final categorization.8,10,11 One potential strategy has been to use “saliency maps” or maps of neural network activation to reveal areas of the image important in arriving at the “diagnosis.” Because these classification CNNs are not explicitly directed to the pathology of interest, often, these saliency maps are unreliable and may highlight parts of the image unrelated to the diagnosis. Without visualizing where an algorithm focuses on a radiograph, it may be difficult to resolve inconsistencies or disagreements between machine and radiologist interpretation of a chest radiograph.
We, therefore, sought to explore a novel approach using a new strategy of “semantic segmentation” for radiographic diagnosis, which inherently provides algorithm transparency. With this strategy, the CNN makes pixel-level decisions to produce a probability map for the presence or absence of pneumonia. This strategy is analogous to what has been used for segmentation and quantitative measurements of structures such as the heart,12 brain lesions,13 pulmonary nodules,14 and liver.15 We hypothesize that this semantic segmentation approach may be just as effective as the “classification” strategy, while also providing a probability map to display the pixel-wise likelihood of pneumonia.
MATERIALS AND METHODS
Data Sources and Patient Demographics
We utilize a database of publicly available frontal chest radiographs with bounding boxes representing pneumonia annotated by radiologists, released as part of the 2018 RSNA pneumonia challenge.16 The radiographs in this data set are a subset of a larger 112,120 NIH frontal chest radiograph database7 where each of the radiographs was assigned findings/diagnoses from 14 categories based on radiologist text reports, including atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, and hernia. The RSNA pneumonia challenge included 25,684 chest radiographs spanning the range of pathologies from this data set, distributed as DICOM images. No additional exclusion criteria were applied for the current study.
Each radiograph in the public data set was distributed with a spatial resolution of 1024×1024 pixels and an 8-bit pixel depth. Patient demographics are as follows: 56.8% were male individuals; ages ranged from 1 to 93 years; 45.6% were anterior-posterior (AP) and 54.4% were posterior-anterior (PA) projections. Twenty-two percent of radiographs were assigned as positive for pneumonia, 33.2% were normal radiographs, and the remaining 44.8% were diagnosed as abnormal but did not have pneumonia. These characteristics are presented in Table 1.
TABLE 1 -
Data Used to Train and Evaluate the Pneumonia
|Mean age (range) (y)
|Pneumonia, n (%)
|Abnormal, not pneumonia, n (%)
|Normal, n (%)
Convolutional neural networks were trained on publically available frontal chest radiographs and radiologist-defined bounding boxes demarcating areas of lung parenchyma associated with pneumonia. A total of 22,000 radiographs were used for model training, and the remaining were reserved to evaluate performance.
Data and Preprocessing
Radiographs and radiologist-defined bounding box annotations of pneumonia from this public data were preprocessed into probability maps. This additional step was devised to synthesize training data for the “semantic segmentation” strategy. First, to reduce computational complexity, chest radiographs were spatially down-sampled to 256×256 pixels. For radiographs annotated with bounding boxes, box coordinates were used to create ellipsoid probability masks of identical width, height, and location. This conversion from a bounding box to an elliptical map was utilized to reduce highlighting of extrathoracic structures that were otherwise contained in the original rectangular bounding boxes. For radiographs without pneumonia bounding boxes, a null binary probability map was created.
The data set was randomly divided into 2 groups—22,000 (85.6%) radiographs were utilized for training and the remaining 3684 (14.4%) radiographs reserved for validation. Validation radiographs were kept independent from training data and were used solely to benchmark and evaluate neural network performance.
Model Structure and Training
A U-net17 was trained using the synthetic probability maps to predict the pixel-wise likelihood of pneumonia on each frontal chest radiograph (Fig. 1). Predictions are represented by a pneumonia probability map with dimensions identical to the input image and output pixel values between 0 and 1. Neural network weights were initialized randomly between 0 and 1 using a uniform distribution. No pretraining or transfer learning was utilized. No image augmentation was utilized.
Training was performed using a batch size of 8 and a weighted binary cross-entropy loss function.17 Pixels with pneumonia were assigned 10-fold weighting, and positive cases of pneumonia were assigned 30-fold weighting. Model training over this loss function was optimized using the “ADAM” back-propagation method with an initial learning rate of 0.0001. One epoch was defined as the interval when all 22,000 radiographs have been used to update the network. We used a dynamic learning rate, decreasing by a factor of 5 when the validation loss failed to decrease for 3 consecutive epochs; overall training was terminated once validation loss failed to decrease for six epochs. Using this training strategy, the CNN trained for a total of 43 epochs.
The U-Net CNN was implemented and trained in Python (version 3.5; Python Software Foundation, Wilmington, DE) using Keras 2.2 and Tensorflow 1.8 on a GPU workstation running Ubuntu 16.04, equipped with a Titan X graphics card (NVIDIA, Mountain View, CA). Model training and algorithm development was executed by the primary author, a radiology research resident.
Probability Map Postprocessing
In order to arrive at patient-level binary classifications of “pneumonia” and “no pneumonia,” probability maps were postprocessed using a fully automated strategy. First, we automatically isolated discrete regions of pneumonia from the probability map using an Otsu-thresholding technique18 and calculated each region’s mean probability and a rectangular bounding box encompassing the predicted pathology. The overall radiographic classification was determined by comparing the region with the highest predicted mean pneumonia probability with a minimum operating probability threshold.
Model Evaluation and Statistical Analysis
First, to evaluate accuracy of localization, we examined all 860 of the 3684 radiographs in the validation cohort with ground truth labels of pneumonia. For these cases, Dice and Intersection over Union (IoU) scores were computed to compare the overlap between radiologist-annotation and the predicted probability map. Dice and IoU metrics are standard metrics used to describe the degree of overlap between 2 discrete objects in an image with values ranging between 0 (none) and 1 (perfect). These metrics were computed for predictions compared against elliptical annotations as well as predictions against the data set’s original rectangular bounding box annotations for external comparison. For elliptical annotations, we grouped cases according to Dice scores of high overlap (0.5< Dice <1.0), low overlap (0.0 < Dice <0.5), and no overlap (Dice=0).
Second, to evaluate whole-radiograph classification performance, we examined the entire validation cohort of 3684 radiographs by computing receiver operator characteristic (ROC) curves and areas under the curve (AUC). ROC curves were created by adjusting the regional mean probability threshold for classifying the radiograph as “positive” for pneumonia. An optimal operating point was also chosen to maximize the sensitivity and specificity equally, known as the Youden J-index.19 Subanalyses were performed to assess performance using (a) the entire validation cohort, (b) pneumonia versus normal, excluding films labeled as abnormal but not pneumonia in the public data set, and (c) pneumonia versus abnormal findings, excluding normal chest radiographs.
Finally, to further examine the performance of the neural network, we rendered probability maps superimposed on input chest radiographs to assess the performance of this approach for individual representative cases. Probability maps were alpha blended with a maximum of 80% opacity for probability values of 100% and full transparency for probability values of <5%. Cases were divided into characteristic groups, highlighted in Table 2, based on whether there was agreement between prediction and radiologist annotation. Concordance was defined as agreement between the CNN and the ground truth label; discordance was defined as a disagreement between the CNN and the ground truth label.
TABLE 2 -
Classification Definitions Between Predictions and Data Set
||“Ground Truth” Label
Each classification prediction made is evaluated against its “ground truth” label. Cases which agree are considered “concordant,” and cases which disagree are considered “discordant.” This deviates from the classic nomenclature of true/false positive/negative because of some uncertainty of pneumonia diagnosis in the training and test data set.
The performance of the CNN was evaluated on 3,684 chest radiographs held out for analysis, including 860 cases with ground truth labels of pneumonia. Performance characteristics of the algorithm are shown in Figure 2. The overall mean and SD of Dice and IoU scores for predicted regions compared with the elliptical radiologist annotations were 0.603±0.204 and 0.461±0.205, respectively. The overall mean and SD of Dice and IoU scores against the rectangular radiologist annotations were 0.553±0.259 and 0.417±0.229, respectively. For 60.0% of radiographs, there was high overlap (Dice >0.5). For 22.3% of radiographs, there was low overlap (Dice >0 and <0.5). For 0.5% of radiographs, there was no overlap (Dice=0). The remaining 17.2% cases did not achieve the Youden J-index threshold (calculated to be 0.08) to be classified as pneumonia. Representative case examples are highlighted in Figure 3.
Classification performance of the neural network was assessed using the entire validation cohort of 3684 radiographs. Overall AUC for the CNN was 0.854. At the optimal operating point (Youden J-index threshold), this corresponded to an accuracy of 81.6%, sensitivity of 82.8%, specificity of 72.6%, positive predictive value of 47.9%, and negative predictive value of 93.3%. When abnormal nonpneumonia chest radiographs were excluded, performance increased to an AUC of 0.944. When normal chest radiographs were excluded, the AUC declined to 0.788. ROC curves for each of these 3 analyses are illustrated in Figure 4.
Representative examples of the performance of the neural network are highlighted in Figures 5–8. For concordant positive cases, the CNN successfully localized cases of diffuse pneumonia with bilateral involvement and focal pneumonia on both adult and pediatric films. For concordant negative cases, the CNN appeared to perform well on normal radiographs, but also avoided normal variants such as an elevated right hemidiaphragm. For discordant positive cases, where the CNN predicted pneumonia and ground truth was not labeled as pneumonia, the CNN highlighted pulmonary opacities that might be considered equivocal for pneumonia. For discordant negative cases, where there was a ground truth label of pneumonia, but the CNN did not identify any abnormalities, findings were equivocal or subtle for pneumonia, including a perihilar opacity and lingular opacity.
In this study, we have shown that a semantic segmentation deep learning strategy can achieve radiographic diagnosis of pneumonia with an AUC of 0.854, compared with historical classification strategies that achieved an AUC of 0.78 to 0.91,8,9 albeit with a different validation cohort. More importantly, this strategy appears to successfully highlight suspicious foci of pneumonia, which may be a more practical application of neural networks than previous approaches, providing a natural level of algorithm transparency that can be readily integrated into a radiologist’s workflow. As radiographs are often interpreted concurrently with clinical history, prior films, lateral projections, or even prior computed tomographies (CTs), the final interpretation of radiographs is often more complex than can be accomplished on the basis of a single frontal film. Color-encoded pixel-wise likelihood maps likely have some intrinsic value of their own. Probability maps can allow a physician to rapidly refute or agree with the observations of the neural network and consider these observations within the full clinical context of the patient and other more definitive information.
The literature using deep learning to localize pneumonia is limited. Other CNN approaches to predict pneumonia bounding boxes can be used, such as Faster RCNN, or Mask RCNN. Recent work using a Mask RCNN approach on this same data set reported a mean IoU of 0.18 to 0.21,20 which, while not an equivalent calculation, is likely slightly better than our rectangular-based IoU metric when accounting for discordant predictions, but roughly equivalent when using the elliptical-based IoU metric. Previous groups using whole-image classification approaches attempted to illustrate pneumonia localizations using applied saliency maps to reveal portions of the image that are emphasized in the final classification.9 However, it is unclear how reliable these saliency maps can be. Future work may be required to compare these alternative techniques to the strategy proposed here.
The purposes of cropping bounding box annotations into ellipses was an attempt to remove extrathoracic structures from being interpreted as pneumonia, thus improving the consistency of what the model learns as a pneumonia. Our results support this in two compounding ways: (1) all positive predictions localized to the thoracic cavity, and (2) 60% of concordant pneumonia predictions have elliptically based dice scores over 50%. Taken together, this suggests the model reasonably differentiates lung opacities from normal lung parenchyma. The downside, as stated above, is that cropping radiologists’ annotations as ellipses does make it difficult for direct comparisons with prior works. Nonetheless, we favor this approach for clinical use because we can be confident predictions will always localize within the thoracic cavity and be able to localize opacities.
One advantage of a classification approach over a localization or segmentation approach is that radiologist-defined localizations are not required for algorithm development. “Weak” binary labels (is or is not pneumonia, pneumothorax, etc.) are enough to make binary predictions. This property has made the classification strategy attractive to ML scientists, as they do not require involvement of a radiologist. The tradeoff, however, is that classification approaches typically require much larger data sets and more complex neural network architectures. Interestingly, we found that a smaller number of cases, only 20% of the NIH 112K data set, was enough to train a U-net to achieve similar performance as the classification-only approaches, which required 98-108K examinations.8,9 We presume this is due to training with explicit localizations.
AUC and Dice scores assume that the ground truth labels of pneumonia and radiologist annotations are correct and exact. It is possible that they were not, and the accuracy of our current algorithm may be underestimated. In this data set, radiologists annotated radiographs on the basis of previous natural language processing-derived labels from the broader NIH chest radiograph data set. The accuracy of those labels is not certain.9 In addition, only a couple of radiologists annotated each radiograph. Even expert thoracic radiologists may not have perfect agreement on the boundaries or certainty of radiographic findings, which is discussed in the curation of the data set used for this study.16 Without supporting clinical data or a confirmatory CT scan, it is difficult to assess the degree of diagnostic certainty in the data set. Future work may be required to assess overall performance against a more definitive ground truth, including CT or objective clinical features, such as leukocytosis and relevant clinical history.
Several observations stand out when reviewing the probability maps generated by the neural network. Using the strategy proposed here, we found that all areas of medium to high probability for pneumonia were confined to the thoracic cavity and tended to be observed on lung opacities. While this was intuitively obvious to human observers, one common pitfall of classification neural networks is that they may often use visual cues that are unrelated to the disease process.21 It is likely that this is the natural result of utilizing radiologists’ annotations of the location of pneumonia in the training process, rather than loosely providing labels of which examinations came from patients who had pneumonia.
Ultimately, algorithms like these may be integrated into the clinical workflow of radiologists, emergency physicians, and internists. However, it is likely that they will not be immediately perfect in their initial implementation, and will require further training and optimization, which may be facilitated with expert feedback. As CNNs can be further modified through a process of transfer learning, it is possible to adapt and “teach” CNNs to improve performance for specific patterns where they may struggle, analogous to teaching a resident after he/she misses a retrocardiac pneumonia. A number of different strategies could have been used to further improve the performance of our algorithm. It is possible that using a pretrained network, hyperparameter optimization, or applying image augmentation could have marginally improved performance. However, we believe it is also possible to improve performance by providing relevant data to learn on. Improving performance on diaphragmatic, retrocardiac, and lingular pneumonias may be remedied by increasing the number or weight of these cases in training, whether retraining from scratch or utilizing a “transfer learning” strategy.15 Alternatively, training CNNs to specifically recognize and identify other similar appearing pathologies such as pulmonary edema may also improve performance.
The approach we propose here, augmenting radiographs with a probability map, has potential to integrate readily into the clinical workflow of an interpreting radiologist, who can integrate information from multiple sources, including the clinical history, lateral films, or other imaging modalities such as CT. Future work may further assess the multiple potential advantages of this image augmentation approach, which may be readily applied to other disease processes. For example, small pneumothoraces can be important to detect, although difficult with historical classification approaches.22 The same technique may also be effective at localizing tubes, lines, or other devices or acute fractures. It is also possible that this approach could have impact on the quality of interpretation of physicians closer to the point of care, such as emergency or ICU physicians, or diagnostic radiology trainees with more limited experience with this front-line imaging modality. Future work may be valuable to assess the overall clinical impact of this technology.
In conclusion, in this study, we have shown that a “semantic segmentation” deep learning approach may be a useful adjunct to facilitate the radiographic diagnosis of pneumonia. The pneumonia probability map produced by this approach may interface more naturally with radiologist interpretation than purely classification-based strategies.
1. Xu J, Murphy SL, Kochanek KDBB. National Vital Statistics Reports Deaths: final data for 2013. Natl Vital Stat Rep. 2016;64:1–119.
2. Bartlet JG, Mundy LM. Current concepts community-acquired pneumonia
. N Engl J Med. 1995;222:1618–1624.
3. Ibrahim EH, Tracy L, Hill C, et al. The occurrence of ventilator-associated pneumonia
in a community hospital: risk factors and clinical outcomes. Chest. 2001;120:555–561.
4. Raoof S, Feigin D, Sung A, et al. Interpretation of plain chest roentgenogram. Chest. 2012;141:545–558.
5. Hinton G. Deep learning
-a technology with the potential to transform health care. JAMA. 2018;320:1101–1102.
6. Retson TA, Besser AH, Sall S, et al. Machine learning and deep neural networks in thoracic and cardiovascular imaging. J Thorac Imaging. 2019;34:192–201.
7. Wang X, Peng Y, Lu L, et al. Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. 2017:2097–2106.
8. Zech JR, Badgeley MA, Liu M, et al. Variable generalization performance of a deep learning
model to detect pneumonia
in chest radiographs: a cross-sectional study. PLoS Med. 2018;15:1–17.
9. Rajpurkar P, Irvin J, Ball RL, et al. Deep learning
for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 2018;15:e1002686.
10. Erickson BJ, Korfiatis P, Kline TL, et al. Deep learning
in radiology: does one size fit all? J Am Coll Radiol. 2018;15:521–526.
11. Recasens A, Kellnhofer P, Stent S, et alFerrari V, Hebert M, Sminchisescu C, Weiss Y. Learning to zoom: a saliency-based sampling layer for neural networks. Computer Vision – ECCV 2018. Lecture Notes in Computer Science. Springer, Cham; 2018:11213.
12. Avendi M, Kheradvar A, Jafarkhani H. Fully automatic segmentation of heart chambers in cardiac MRI using deep learning
. J Cardiovasc Magn Reson. 2016;18(S1):2–4.
13. Miller RW, Zhuge Y, Arora BC, et al. Brain tumor segmentation using holistically nested neural networks in MRI images. Med Phys. 2017;44:5234–5243.
14. Wang S, Zhou M, Liu Z, et al. Central focused convolutional neural networks: developing a data-driven model for lung nodule segmentation. Med Image Anal. 2017;40:172–183.
15. Wang K, Mamidipalli A, Retson T, et al. Automated CT and MRI liver segmentation and biometry using a generalized convolutional neural network. Radiol Artif Intell. 2019;1:180022.
16. Shih G, Wu CC, Halabi SS, et al. Augmenting the National Institutes of Health Chest Radiograph dataset with expert annotations of possible pneumonia
. Radiol Artif Intell. 2019;1:e180041.
17. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. International conference on medical image computing and computer-assisted intervention. Springer, Cham; 2015.
18. Otsu N. A threshold selection method from Gray-Level histograms. IEEE Trans Syst Man Cybern. 1979;9:62–66.
19. Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3:32–35.
20. Jaiswal AK, Tiwari P, Kumar S, et al. Identifying pneumonia
in chest X-rays: a deep learning
approach. Meas J Int Meas Confed. 2019;145:511–518.
21. Kallianos K, Mongan J, Antani S, et al. How far have we come? Artificial intelligence
for chest radiograph interpretation. Clin Radiol. 2019;74:338–345.
22. Yang RZ, Lee MJ, Hu H. Acute-phase serum amyloid A: an inflammatory adipokine and potential link between obesity and its metabolic complications. PLoS Med. 2006;3:e287.