Learning Hierarchical Attention for Weakly-supervised Chest X-Ray Abnormality Localization and Diagnosis

Xi Ouyang Srikrishna Karanam \IEEEmembershipMember, IEEE, Ziyan Wu^∗ \IEEEmembershipMember, IEEE, Terrence Chen \IEEEmembershipSenior Member, IEEE, Jiayu Huo Xiang Sean Zhou Qian Wang^∗ Jie-Zhi Cheng^∗ This work was supported in part by the National Key Research and Development Program of China (2018YFC0116400), Grant of Shanghai Strategic Emerging Industries from Shanghai Municipal Development and Reform Commission (20191211), and STCSM (19QC1400600). (Corresponding authors: Ziyan Wu, Qian Wang, and Jie-Zhi Cheng.)Xi Ouyang, Jiayu Huo and Qian Wang are with the Institute for Medical Imaging Technology, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China. Xi Ouyang and Jiayu Huo are interns at Shanghai United Imaging Intelligence Co. during this work. (e-mail: {xi.ouyang, jiayu.huo, wang.qian}@sjtu.edu.cn).Srikrishna Karanam, Ziyan Wu, and Terrence Chen are with United Imaging Intelligence, Cambridge MA, United States. (e-mail: {srikrishna.karanam, ziyan.wu, terrence.chen}@united-imaging.com).Xiang Sean Zhou, and Jie-Zhi Cheng are with Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China. (e-mail: [email protected], [email protected]).

Abstract

We consider the problem of abnormality localization for clinical applications. While deep learning has driven much recent progress in medical imaging, many clinical challenges are not fully addressed, limiting its broader usage. While recent methods report high diagnostic accuracies, physicians have concerns trusting these algorithm results for diagnostic decision-making purposes because of a general lack of algorithm decision reasoning and interpretability. One potential way to address this problem is to further train these models to localize abnormalities in addition to just classifying them. However, doing this accurately will require a large amount of disease localization annotations by clinical experts, a task that is prohibitively expensive to accomplish for most applications. In this work, we take a step towards addressing these issues by means of a new attention-driven weakly supervised algorithm comprising a hierarchical attention mining framework that unifies activation- and gradient-based visual attention in a holistic manner. Our key algorithmic innovations include the design of explicit ordinal attention constraints, enabling principled model training in a weakly-supervised fashion, while also facilitating the generation of visual-attention-driven model explanations by means of localization cues. On two large-scale chest X-ray datasets (NIH ChestX-ray14 and CheXpert), we demonstrate significant localization performance improvements over the current state of the art while also achieving competitive classification performance. Our code is available on https://github.com/oyxhust/HAM.

{IEEEkeywords}

Weakly Supervised, Abnormality Localization, Explainability, Hierarchical Attention.

\IEEEpeerreviewmaketitle

1 Introduction

\IEEEPARstart

The chest X-ray (CXR) is one of the most commonly performed medical imaging examinations in clinical practice. Diagnosis with CXR greatly depends on the radiologist’s experience [1] since anatomical structures may overlap due to the 2D projection effect. Another challenge with CXR is the high diversity of possible abnormalities and diseases. Consequently, many detection [2], diagnosis [3, 4, 5], and triage [6, 7] methods have been proposed to support computer-aided CXR diagnosis.

The public release of the large-scale NIH Chest X-ray14 [8] and CheXpert [9] datasets, both including more than 100,000 images, has further fostered research in this field, with many prominent techniques [3, 4, 5, 10, 11] formulating CXR image diagnosis as a multi-label classification problem. Despite their high classification accuracies, physicians find it difficult to interpret these “black-box” models. Furthermore, as these models are trained with image-level annotations, they are unable to capture the large intra-class diversity in shapes, appearances, and sizes of different abnormalities. On the other hand, one may expect to mitigate these issues given sufficient abnormality localization annotations for training; however obtaining them is prohibitively expensive, particularly in medical applications. Consequently, low accuracies in abnormality localization as well as limited model interpretability have become key bottlenecks in wide adoption of these algorithms in clinical practice.

Existing methods in the literature propose ways to address these issues. For instance, in Li et al. [12], a weakly supervised approach (using a small number of box annotations) was employed for abnormality localization. Despite good progress in a few cases, the performance in most abnormalities (e.g., “Nodule”, “Mass”, “Atelectasis”, etc.) still largely remains quite low. In addition to a general lack of decision reasoning (or model explanations), the performance improvements with this line of work have come at the cost of reduced image-level abnormality diagnosis performance [12]. This is because of a severe data imbalance between the small number (of the order of a few hundreds) of box-level localization annotations versus a much larger number (of the order of several hundreds of thousands) of images with image-level class labels. Thus, in this work, our key considerations are - (a) can we provide an efficient means for explaining model decisions? and (b) how do we improve localization performance while also ensuring little/no classification performance reduction?

Recent progress in convolutional neural network (CNN) attention modeling and learning [13, 14] has led to increased adoption of visual attention for model interpretability and explainability. Such extensions [15, 16], while generating attention priors and demonstrating applicability for classification tasks, are not directly applicable in our context due to the hierarchical nature of localizing a particular abnormality in a larger anomalous region in the image. To address this problem, we design an attention-driven learning framework that addresses two key drawbacks of existing methods- (a) precisely localizing abnormalities, especially for subtle classes like “Nodule”, and (b) addressing the imbalance problem of box-level and image-level label annotations in a holistic manner. Our intuition is that CNN visual attention, by being a weakly-supervised source of localization cues, provides a strong prior for learning generalizable models even with small quantities of box-level annotations. Such a framework will not only help improve localization performance but also, by means of attention, provide a way to visually explain model decisions, both of which are important to clinical deployment of deep learning models.

Refer to caption — Figure 1: Hierarchical attention mining framework. It contains three levels of attention mechanism: foreground attention, positive attention, and abnormality attention. The foreground attention is from the activation-based foreground attention block (FAB). The positive attention and abnormality attention are the gradient-based attentions generated from two online-CAM modules of the two-way (positive/negative) classification task and $D$ -way ( $D$ abnormality types) classification task, respectively.

Our proposed method learns hierarchical attention at three levels: foreground, positive, and abnormality attention (Fig. 1). To model foreground attention, we present an activation-based foreground attention block (FAB) that captures foreground dependencies to give an initial, coarse estimation of the foreground region of interest. Our FAB considers both channel- and position-wise attention, and is realized with a cascade design to guide the learning process to discover informative features that are useful for the later search/recognition of abnormalities. Next, a two-way (positive/negative) classification scheme uses gradient-based diagnostic-level attention to specifically narrow down regions that may, again fairly coarsely, enclose the abnormalities (positive attention). Finally, abnormality attention from the $D$ -way ( $D$ abnormality types) classification task is responsible for computing the abnormality locations. To ensure learning of attention regions hierarchically, we enforce two explicit attention ordinality constraints. Specifically, we propose a novel attention-bound learning objective that enforces the output of abnormality attention to be located completely within the coarse positive attention region. Furthermore, we propose a novel attention-union objective to enforce positive attention to lie within the union region of abnormality attention maps. Finally, by design, our framework enables principled incorporation of the limited box-level annotations by regularizing the abnormality attention maps to directly conform to the ground-truth distribution.

To summarize, the key contributions of our work are:

•

We present a new visual attention-driven weakly-supervised learning framework that simultaneously addresses abnormality localization and classification with very limited box-level annotations.
•

To address the hierarchical nature of abnormality localization, our proposed visual attention mechanism is explicitly hierarchical and comprised of three levels (foreground, positive, and abnormality) that enables progressive weakly-supervised discovery of the specific abnormality location of interest.
•

We demonstrate improved abnormality localization performance, establishing state-of-the-art results on the NIH ChestX-ray14 dataset.
•

We invite an experienced radiologist to provide box annotations in the CheXpert [9] dataset to help evaluate our method’s localization performance. In total, 2345 images in the CheXpert dataset have been annotated with 6099 bounding boxes for 9 abnormality types.

2 Related Work

2.1 CXR Image Analysis

Deep learning has enabled much recent progress in the field of medical image analysis [17]. For CXR image analysis, the release of the NIH dataset [8], which provided an official patient-level data split, has motivated many recently studies [3, 4, 5, 18, 10, 11] for the diagnosis of 14 abnormalities. Most of these techniques utilized either the DenseNet [18, 5, 4, 3] or ResNet [4, 11] architecture as the backbone with several add-ons like squeeze-and-excitation [5], global-local feature branch combination [4], two parallel branches [10], integration of multi-resolution cues [18], knowledge fusion from other datasets [3] and so on. The current state-of-the-art performance among methods that use the official data split [3, 5, 18, 10, 11] is around 0.81 measured in terms of average AUC. Chen et al. [19] explored the hierarchical structure of the 14 abnormality labels from the PLCO dataset [20].

As noted earlier, abnormality regions highlighted by models trained with image-level annotations may not always actually relate to the true abnormalities and may possibly also be non-pathological or outside the cardio-thoracic parts. To address these problems, some methods [12, 21] used a small number of available box-level annotations to facilitate more plausible abnormality localization. Specifically, multiple-instance learning (MIL) was employed in [12, 21, 22] by treating the various spatial sliced blocks as instances. While good localization was demonstrated in [12], the image diagnosis AUC was still around 0.75 with the official split (see Table 4 in the appendix of Li et al.’s [12] arXiv v6 version). Based on the MIL framework, Liu et al. [21] developed the contrast-induced attention (CIA) network for localization improvement. CIA required pairs of positive and negative input images (images with and without abnormalities, respectively) to obtain contrast attention maps as the abnormality localization priors for the MIL framework. For the purpose of good quality contrast attention, CIA further required an alignment network to ensure that positive and negative images are in the same canonical form. While these techniques [12, 21] demonstrated better localization when compared to the NIH baseline [8], the performance on abnormalities like “Nodule” and “Mass” is still not satisfactory. Meanwhile, the performance of CXR diagnosis with the official split was not elaborated in [21]. Our method, on the other hand, does not require either paired positive/negative images or the block slicing step as in these methods. Furthermore, our method improves abnormality localization performance without requiring an additional alignment step.

2.2 Attention Mechanism

Self-attention mechanism is an effective feature learning technique shown to be helpful in various image analysis tasks, e.g., video classification [23], image classification [24], and semantic segmentation [25]. This can be seen as a type of activation-based attention [26] since a variety of activation functions, e.g., Sigmoid, Softmax, etc., are used to compute attentive spatial parts or channels from feature maps. This is generally realized using two types of core modules: channel-wise attention [24] and spatial-wise attention [23], with both these types typically employed in a parallel fashion to address image analysis problems like semantic segmentation [25].

Gradient-based attention is another line of of work in this direction that was shown to be helpful for weakly supervised learning. Class-activation map (CAM) [13] and gradient-weighted class activation map (Grad-CAM) [14, 27] are some examples of techniques that can be categorized under gradient-based attention, with CAM technically a special case of Grad-CAM for a specific type of CNN architecture (i.e., performing pooling over convolutional maps immediately prior to prediction).

Some extensions [28, 15, 16] further used these technique as online trainable modules for improve the performance of models on downstream tasks, e.g., image classification or segmentation. There were some applications in the medical field as well, with Lian et al. [29, 30] utilizing gradient-based attention to boost the diagnostic performance of models for the Alzheimer’s disease (AD) in a two-stage framework. Specifically, the disease attention map derived in the first stage is employed to guide the training of the disease classification networks in the second stage. Since only AD-related diseases were considered in these methods [29, 30], the aspect of label hierarchy, e.g., the image-level label of positive/negative vs. the abnormality-level labels was not fully explored.

Our proposed method is, to the best of our knowledge, the first hierarchical visual attention framework that unifies both activation- and gradient-based attention in a holistic manner. To discover informative features that are useful for the later search/recognition of abnormalities, we propose the activation-based foreground attention block by cascading channel- and position-wise attention. Specifically, channel-wise attention module is used first to re-calibrate useful channels for the re-computation of feature maps. The recomputed feature maps may carry more informative features to support the learning of spatial attention features. With such a cascade design, effective attention features can be better learned. The resulting foreground attention map serves as a spatial prior for abnormal regions and thus helps the abnormality localization task. Furthermore, we develop two online CAM modules to produce gradient-based attention for positive and abnormality attention, given hierarchically organized image labels. Finally, we also propose novel ways to bound the learned attention by means of ordinal learning objectives to explicitly model the latent correlation between attention maps.

3 Methodology

Our framework is illustrated in Fig. 2. It learns a three-level hierarchical representation of attention. The first level corresponds to a coarse delineation of the foreground region of interest, produced by our proposed foreground attention block (FAB). The FAB is realized by means of a differentiable activation-based attention mechanism that is infused into the model training process to highlight the features of the foreground region of interest. The second level corresponds to a coarse demarcation of the positive region of interest, realized by means of gradient-based attention with a two-way (positive/negative) classification scheme (called positive attention). Finally, the third level corresponds to delineating the particular abnormality of interest, realized using gradient-based attention with a $D$ -way ( $D$ abnormality types) classification scheme (called abnormality attention). To enforce explicit learning of such a hierarchical attention representation, we propose new ordinal attention constraints, implemented by means of learning objectives we call attention bound and attention union losses. Our learning mechanism is flexible to enable the use of a small number of box annotations, which helps improve localization performance without a reduction in the classification performance. In the subsequent sections, we discuss each component of our proposed framework in details.

3.1 Foreground Attention Block (FAB)

The foreground attention block (see Fig. 3) implements perceptual-level attention with a self-attention mechanism [23, 24, 31] to learn and highlight foreground features. The FAB is comprised of both channel and position attention modules, which we propose to use in a sequential manner. Unlike Fu et al. [25] that combined these two attention types into a parallel data flow, we cascade the channel- and position-wise attention to learn a foreground attention map. Specifically, the channel-wise attention is performed first to calibrate useful channels for the position-wise attention component. Fu et al. [25] on the other hand exercised the self-attention mechanism in a parallel manner to recalibrate the channel and spatial cues simultaneously. Accordingly, the parallel self-attention mechanism with two non-local-based matrix operations may consume more GPU memory. Our cascading FAB only requires one matrix operation and can attain the desirable results in an efficient and faster way.

Our channel-wise attention module is architecturally similar to the squeeze-and-excitation (SE) block [24], where we replace the average pooling operation with spatial attention pooling. Specifically, as shown in Fig. 3, we employ spatial attention [23, 31] to assign similar weights to pixels having similar scores using softmax, which are then used to perform weighted spatial average pooling of the input feature maps, producing a channel-weighted $C\times$ 1 vector. The channel-weighted (by the $C\times$ 1 vector above) $C\times H\times W$ feature maps are then processed by the position-wise attention module, producing an attention-weighted $1\times H\times W$ map. It is the foreground attention to give an initial, coarse estimation of the foreground region of interest. Then, it is element-wise added to each channel of the input $C\times H\times W$ feature maps, producing the final features that are then input to the subsequent parts of our model.

3.2 Diagnostic Attention

After identifying the foreground regions of interest, our model computes diagnostic attention at the remaining two levels of the hierarchy, which is realized with different but coupled classification objectives. First, we perform a two-way classification and generate the positive attention map. This step essentially attempts to tell apart CXR images with and without any abnormalities, thereby helping learn features that are important for positive/negative prediction. Next, the same feature maps are used in conjunction with a $D$ -way classifier, where the goal is to identify the particular abnormality type among the $D$ possibilities, to generate the abnormality attention maps. To exhaustively learn all features important for these classification tasks, and to produce the corresponding attention maps, we use a gradient-based attention mechanism, e.g., online CAM [13, 28]. The key idea of CAM is to generate the attention maps for different classes by weighting the convolutional feature maps with the weights from the fully-connected layer. Let $f$ denote the feature maps before the log-sum-exp (LSE) pooling [32] operation and $w$ denote the weight matrix of the fully-connected layer. To make our attention generation procedure trainable, we use $w$ as the kernel of a $1\times 1$ convolution layer such that:

M={\rm{ReLU}}\left({{\rm{conv}}\left({f,w}\right)}\right),

(1)

where $M$ has the shape $D\times T\times S$ , and $D$ is the number of classes in the $D$ -way ( $D$ abnormality types) classification task. $D$ is set to $1$ for the the two-way (positive/negative) classification task. Given the attention map $M$ , we normalize the values to the range between 0 and 1 and perform sigmoid for soft masking [15], which is defined as:

T(M)=\frac{1}{{1+\exp(-\alpha(M-\beta))}},

(2)

where values of $\alpha$ and $\beta$ are set to $100$ and $0.4$ respectively.

Based on Equations 1 and 2, we use the encoded feature maps from the backbone CNN and the weights of fully-connected layer of the 2-way (positive/negative) classification task to generate the positive attention map $M^{P}$ . The abnormality attention maps $M^{a_{k}}$ ( $k=1,2,\ldots,D$ indicates the specific abnormality) are similarly computed with the encoded feature maps, which are further weighted with the weightings of fully-connected layer in the $D$ -way classification task, see Fig. 2. The synergy between the two classification tasks is achieved by means of our new ordinal constraints on the attention maps. Our intuition is that each abnormality attention map (obtained from the $D$ -way classifier) should be completely contained within the positive attention map obtained from the 2-way (positive/negative) classification task. Furthermore, the positive attention map itself should be constrained to cover all the possible regions that may be attended to by the individual abnormality attention maps. To this end, we next formulate our new attention bound and attention union objective functions, which will be detailed later. Finally, we can exploit the small number of box-level annotations to further improve the localization performance in the weakly supervised setting.

3.2.1 The attention bound loss

As noted above, we generate two diagnostic attention maps in attention hierarchy: the positive attention map $M^{p}$ with shape $1\times H\times W$ and $D$ -way abnormality attention maps $M^{a_{k}}$ with shape $D\times H\times W$ (where $D$ is the number of abnormality classes and $k=1,2,\ldots,D$ indicates the specific abnormality). Given the hierarchical relationship between these two attention maps discussed above, we seek a learning objective that enables the model to learn to produce attention that respects this hierarchy. Specifically, given a CXR image, all abnormality attention maps $M^{a_{k}}$ shall be contained within the region bounded by the positive attention map $M^{p}$ as shown in Fig. 4. Accordingly, our proposed attention bound objective, $L_{bound}$ , attempts to spatially constrain the region covered by each $M^{a_{k}}$ with respect to $M^{p}$ , and is realized for input CXR image as:

{L_{bound}}=\\ \frac{1}{N}\sum\nolimits_{{y_{k}}=1}{\left({1-\frac{{\sum\nolimits_{ij}{(\min(M_{ij}^{p},M_{ij}^{{a_{k}}})\cdot T(M_{ij}^{{a_{k}}}))}}}{{\sum\nolimits_{ij}{M_{ij}^{{a_{k}}}}}}}\right)},

(3)

where the image’s ground-truth label $y_{k}\in\{0,1\}$ ( $0/1$ indicates absence/presence of abnormality $k$ ), $N$ is the number of positive classes in the image label set $\{y_{k}\}$ , $i$ and $j$ represent the $(i,j)^{th}$ pixel in the corresponding attention map, and $T(M_{ij}^{{a_{k}}})$ is the soft masking operation defined in Equation 2 which masks out the impact of background noise in the attention maps. In summary, the $L_{bound}$ can ensure abnormality attentions $M^{a}$ to lie within the positive attention $M_{p}$ .

3.2.2 The attention union loss

While our proposed $L_{bound}$ helps enforce explicit spatial constraints on the spatial bounds of the abnormality attention $M^{a_{k}}$ , it does not, however, provide any direct supervision on the spatial extent of the positive attention $M^{p}$ . To ensure the positive attention only encompasses all the possible abnormality attention regions, and no more than that, we enforce explicit union constraint on the spatial extent of positive attention map $M^{p}$ and the union of all the abnormality attention maps $M^{a_{k}}$ . Specifically, as shown in Fig. 4, we seek the spatial extent of the positive attention map $M^{p}$ to be no more than the spatial extent of the union of all the abnormality attention maps $M^{a_{k}}$ . To this end, we first compute the union of all abnormality attention maps, denoted $M^{u}$ , as:

{M^{u}}=\max(M_{ij}^{{a_{1}}}\cdot{y_{1}},M_{ij}^{{a_{2}}}\cdot{y_{2}},\dots,M_{ij}^{{a_{D}}}\cdot{y_{D}}),

(4)

where similar to $L_{bound}$ , we only consider the positive abnormalities. Then, our proposed attention union loss, $L_{union}$ , is realized by calculating the region overlap between $M^{p}$ and $M^{u}$ as:

{L_{union}}=1-\frac{{\sum\nolimits_{ij}{(\min(M_{ij}^{p},M_{ij}^{u})\cdot T(M_{ij}^{p}))}}}{\sum\nolimits_{ij}{M_{ij}^{p}}},

(5)

where $T(M_{ij}^{p})$ refers to the same soft-masking operation of Equation 2.

3.3 Weakly-supervised Learning from Extra Box-level Annotations

Although we have proposed the attention bound and union loss to better locate the abnormality regions, it is still a challenging task under only image-level labels. Given a small number of extra annotations (e.g., bounding-box annotations for abnormal localizations), we can in fact provide weak yet direct supervision to our model. To this end, we apply an attention-adaptive mean square error (AMSE) loss, $L_{amse}$ . Given our abnormality attention map $T({M}^{a_{k}})$ (bilinearly interpolated to input size after soft masking) and the corresponding ground-truth abnormality annotation $G^{k}$ , our $L_{amse}$ is formulated as:

{L_{amse}}=\frac{1}{N}\sum\nolimits_{{y_{k}}=1}{\left({\frac{{\sum\nolimits_{ij}{{{(T(M_{ij}^{{a_{k}}})-G_{ij}^{k})}^{2}}}}}{{\sum\nolimits_{ij}{T(M_{ij}^{{a_{k}}})+\sum\nolimits_{ij}{G_{ij}^{k}}}}}}\right)},

(6)

where $N$ is the number of positive abnormalities in the image label set $\{y_{k}\}$ . One can note that the proposed $L_{amse}$ is a slightly modified version of the traditional MSE loss using the sum of regions of location map ${M}^{a_{k}}$ and $G^{k}$ as an adaptive normalization factor. We show later that our $L_{amse}$ improves localization performance even with a small number of box annotations. In the datasets used in this work, only box-level annotations are available. To get the ground-truth masks, for the $k^{th}$ abnormality class, given the box annotation, we first generate a pixel-level binary mask $G^{k}$ where all pixels outside the current bounding box location are set to 0.

In real cases, although box annotations are helpful for the localization of abnormalities, the annotated boxes tend to be too large and enclose too much background cues. To address this issue, we further introduce a self-refinement (SR) method for the box annotations with extra pass of training. Specifically, we train an additional instance of our network with box annotations. Then, this additional network is used to generate attention maps for training images with box annotations for the potential refinement of boxes. If the IoU value of the attention region and the corresponding box is lower than 0.3, we keep the original box. Otherwise, we get the refined mask annotations via the overlap of the box masks and the corresponding attention maps. We show some examples in Fig. 5. It is worth noting that the additional SR and main networks are distinct. The additional SR network is used for box refinement, whereas the main network is employed for the tasks of abnormality localization and diagnosis. The self-refinement method is not carried out on the testing data in the experiments.

3.4 Overall Objective Loss Function

Our overall learning objective is comprised of the abnormality and positive/negative classification, the attention bound and union, and the attention-adaptive MSE losses, and is expressed as:

{L_{total}}={L_{ab}}+z\cdot{L_{amse}}+\lambda_{1}{L_{pn}}+\lambda_{2}{L_{bound}}+\lambda_{3}{L_{union}},

(7)

where $L_{pn}$ and $L_{ab}$ are binary cross entropy (BCE) terms for the positive/negative and abnormality classifications, and $z$ , $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ are weighting factors (in our experiments we determine them with a search on $20\%$ of the training and validation set). Note that when box annotations are not used, we set $z=0$ .

4 Experiments

4.1 Datasets and Settings

4.1.1 Experimental settings

We conduct experiments on the NIH Chest-Xray14 dataset [8] and CheXpert dataset [9], two large-scale CXR datasets. In our experiments, we use dilated ResNet50 [33] as the backbone. The size of the abstracted feature map to generate the various attention maps is $\frac{1}{8}$ of the input images. We use the Adam [34] optimizer with momentum set to $0.9$ , a weight decay of $0.0001$ , and a learning rate of $0.00002$ that is reduced by a factor of 10 after every $10$ epochs. We train our model with a batch size of 24 and empirically set $r=6$ for the LSE pooling, as defined in [32]. During training, we perform random translation and horizontal flipping for augmentation. We resize the original 3-channel images to $512\times 512$ as the input. The loss weight factors $z$ , $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ are set to 0.5, 0.01, 0.001, 0.001, respectively.

Table 1: Comparison of AUC scores with existing state-of-the-art methods on the official split of NIH Chest X-ray14. We report the AUC with 95% confidence interval (CI) of our method.

Abnormality

Wang et al.

[8]

Li et al.

[12]

DNetLoc

[3]

CAN

[10]

Ours

Atelectasis

0.70

0.73

0.77

0.78

0.77 (0.77, 0.78)

Cardiomegaly

0.81

0.84

0.88

0.89

0.87 (0.86, 0.88)

Effusion

0.76

0.79

0.83

0.83 (0.83, 0.84)

Infiltration

0.66

0.67

0.71

0.70

0.71 (0.71, 0.72)

Mass

0.69

0.78

0.82

0.84

0.83 (0.82, 0.84)

Nodule

0.67

0.70

0.76

0.77

0.79 (0.78, 0.81)

Pneumonia

0.66

0.65

0.73

0.72

0.72 (0.70, 0.75)

Pneumothorax

0.80

0.81

0.85

0.86

0.88 (0.87, 0.88)

Consolidation

0.70

0.71

0.75

0.74 (0.73, 0.75)

Edema

0.81

0.84

0.85

0.84 (0.83, 0.85)

Emphysema

0.83

0.88

0.90

0.91

0.94 (0.93, 0.95)

Fibrosis

0.79

0.77

0.82

0.83

0.83 (0.81, 0.85)

Pleural Thickening

0.68

0.73

0.76

0.79

0.79 (0.78, 0.80)

Hernia

0.87

0.69

0.90

0.93

0.91 (0.87, 0.94)

Mean AUC

0.745

0.755

0.807

0.817

0.819 (0.815, 0.823)

Table 2: Comparison of AUC scores in 5-fold CV scheme of NIH Chest X-ray14. We show the standard deviation of Li et al. [12] and ours.

Abnormality

Li et al. [12]

Liu et al. [21]

Ours

Atelectasis

0.80

\pm

0.00

0.79

0.82

\pm

0.01

Cardiomegaly

0.87

\pm

0.01

0.87

0.90

\pm

0.02

Effusion

0.87

\pm

0.00

0.88

\pm

0.01

Infiltration

0.70

\pm

0.01

0.69

0.72

\pm

0.01

Mass

0.83

\pm

0.01

0.81

0.85

\pm

0.02

Nodule

0.75

\pm

0.01

0.73

0.79

\pm

0.01

Pneumonia

0.67

\pm

0.01

0.75

0.73

\pm

0.01

Pneumothorax

0.87

\pm

0.01

0.89

0.90

\pm

0.01

Consolidation

0.80

\pm

0.01

0.79

0.80

\pm

0.01

Edema

0.88

\pm

0.01

0.91

0.90

\pm

0.01

Emphysema

0.91

\pm

0.01

0.93

0.94

\pm

0.01

Fibrosis

0.78

\pm

0.02

0.80

0.81

\pm

0.01

Pleural Thickening

0.79

\pm

0.01

0.80

0.79

\pm

0.01

Hernia

0.77

\pm

0.03

0.92

0.86

\pm

0.03

Mean AUC

0.806

0.826

0.835

\pm

0.007

Table 3: Comparison of localization results trained using

100\%

unannotated images of NIH Chest X-ray14.

$T(IoU)$	Model	Atelectasis	Cardiomegaly	Effusion	Infiltration	Mass	Nodule	Pneumonia	Pneumothorax	Mean
0.1	Li et al. [12]	0.59	0.81	0.73	0.85	0.69	0.29	0.23	0.38	0.57
0.1	Liu et al. [21]	0.39	0.90	0.65	0.85	0.69	0.38	0.30	0.39	0.60
	Ours	0.78	0.97	0.82	0.85	0.78	0.56	0.76	0.48	0.75
0.3	Liu et al. [21]	0.34	0.71	0.39	0.65	0.48	0.09	0.16	0.20	0.38
	Ours	0.34	0.40	0.27	0.55	0.51	0.14	0.42	0.22	0.36

4.1.2 Evaluation metrics

We use the area under the ROC curve (AUC) to measure classification performance, and intersection over union ratio (IoU) and the intersection over the detected region (IoR) to quantify localization results. To calculate IoU and IoR, we use bounding boxes of localized attentive regions. Following prior work [8, 12], we report the ratio of the number of cases with correct localization against the total number of cases in each class. A localization result is considered correct if the criterion of either IoU $>T(IoU)$ or IoR $>T(IoR)$ , where $T(*)$ is the threshold, is met.

4.2 NIH Chest-Xray14 Dataset

NIH Chest-Xray14 dataset provides 112,120 X-ray images with abnormality labels from 30,805 patients. Images are labeled with 14 abnormality classes, with 984 bounding boxes of 8 abnormalities for 880 images labeled by board-certified radiologists.

4.2.1 Abnormality classification

We conduct two experiments to evaluate the performance on the abnormality classification task and compare to the state-of-the-art (SOTA) methods. The first experiment is based on the official split [8] and prepared at the patient-level. All images with box annotations are in the testing set and not used during model training. Results are shown in Table 1, where our method outperforms state-of-the-art methods in terms of AUC. In particular, our method outperforms DNetLoc [3] and CAN [10] methods that employ deeper models (DenseNet121) as their backbones.

In the second experiment, we use the 5-fold cross-validation (CV) scheme following [12, 21]. For the convenience of comparison, we use the same notations with [12, 21] by referring to unannotated CXR images as those with only image-level labels and annotated CXR images as those with both image- and box-level labels. In each fold, we use $70\%$ of the annotated and $70\%$ of unannotated images for training, and $10\%$ of the annotated and unannotated images for validation. Then, the rest $20\%$ of the annotated and unannotated images are used for testing. We summarize AUC scores of the compared methods w.r.t. the 14 abnormalities of testing set in Table 2, where our method obtains the best mean AUC score of $0.835$ .

Table 4: Results of models trained with

50\%

unannotated and

80\%

annotated images of NIH Chest X-ray14 at various

T(IoU)

thresholds.

$T(IoU)$	Model	Atelectasis	Cardiomegaly	Effusion	Infiltration	Mass	Nodule	Pneumonia	Pneumothorax	Mean
0.1	Wang et al. [8]	0.69	0.94	0.66	0.71	0.40	0.14	0.63	0.38	0.57
0.1	Li et al. [12]	0.71	0.98	0.87	0.92	0.71	0.40	0.60	0.63	0.73
	Ours	0.71	1.00	0.89	0.88	0.76	0.65	0.91	0.78	0.82
0.2	Wang et al. [8]	0.47	0.68	0.45	0.48	0.26	0.05	0.35	0.23	0.37
0.2	Li et al. [12]	0.53	0.97	0.76	0.83	0.59	0.29	0.50	0.51	0.62
	Ours	0.54	1.00	0.75	0.79	0.67	0.53	0.86	0.60	0.72
0.3	Wang et al. [8]	0.24	0.46	0.30	0.28	0.15	0.04	0.16	0.13	0.22
0.3	Li et al. [12]	0.36	0.94	0.56	0.66	0.45	0.17	0.39	0.44	0.49
	Liu et al. [21]	0.53	0.88	0.57	0.73	0.48	0.10	0.49	0.40	0.53
	Ours	0.40	1.00	0.52	0.68	0.58	0.46	0.69	0.43	0.60
0.4	Wang et al. [8]	0.09	0.28	0.20	0.12	0.07	0.01	0.08	0.07	0.12
0.4	Li et al. [12]	0.25	0.88	0.37	0.50	0.33	0.11	0.26	0.29	0.42
	Ours	0.26	1.00	0.29	0.56	0.40	0.35	0.50	0.32	0.46
0.5	Wang et al. [8]	0.05	0.18	0.11	0.07	0.01	0.01	0.03	0.03	0.06
0.5	Li et al. [12]	0.14	0.84	0.22	0.30	0.22	0.07	0.17	0.19	0.27
	Liu et al. [21]	0.32	0.78	0.40	0.61	0.33	0.05	0.37	0.23	0.39
	Ours	0.15	0.99	0.14	0.33	0.27	0.22	0.35	0.22	0.33
0.6	Wang et al. [8]	0.02	0.08	0.05	0.02	0.00	0.01	0.02	0.03	0.03
0.6	Li et al. [12]	0.07	0.73	0.15	0.18	0.16	0.03	0.10	0.12	0.19
	Ours	0.08	0.97	0.05	0.18	0.14	0.15	0.27	0.11	0.24
0.7	Wang et al. [8]	0.01	0.03	0.02	0.00	0.00	0.00	0.01	0.02	0.01
0.7	Li et al. [12]	0.04	0.52	0.07	0.09	0.11	0.01	0.05	0.05	0.12
	Liu et al. [21]	0.18	0.70	0.28	0.41	0.27	0.04	0.25	0.18	0.29
	Ours	0.02	0.77	0.01	0.12	0.08	0.10	0.06	0.03	0.15

Table 5: Results of models trained with

50\%

unannotated and

80\%

annotated images of NIH Chest X-ray14 at various

T(IoR)

thresholds. Liu et al. [21] does not list any

T(IoR)

results in their paper. “SR” indicates the proposed self-refinement method to reduce the noise of box annotations.

$T(IoR)$	Model	Atelectasis	Cardiomegaly	Effusion	Infiltration	Mass	Nodule	Pneumonia	Pneumothorax	Mean
0.1	Wang et al. [8]	0.62	1.00	0.80	0.91	0.59	0.15	0.86	0.52	0.68
0.1	Li et al. [12]	0.77	0.99	0.91	0.95	0.75	0.40	0.69	0.68	0.77
	Ours	0.74	1.00	0.92	0.88	0.81	0.67	0.94	0.80	0.85
	Ours + SR	0.69	1.00	0.93	0.91	0.84	0.71	0.94	0.79	0.85
0.25	Wang et al. [8]	0.39	0.99	0.63	0.80	0.46	0.05	0.71	0.34	0.55
0.25	Li et al. [12]	0.57	0.99	0.79	0.88	0.57	0.25	0.62	0.61	0.66
	Ours	0.61	1.00	0.76	0.82	0.74	0.54	0.89	0.62	0.75
	Ours + SR	0.60	1.00	0.84	0.87	0.76	0.59	0.88	0.72	0.78
0.5	Wang et al. [8]	0.19	0.95	0.42	0.65	0.31	0.00	0.48	0.27	0.41
0.5	Li et al. [12]	0.35	0.98	0.52	0.62	0.40	0.11	0.49	0.43	0.49
	Ours	0.40	1.00	0.50	0.59	0.60	0.35	0.63	0.43	0.56
	Ours + SR	0.46	1.00	0.61	0.72	0.61	0.38	0.72	0.57	0.63
0.75	Wang et al. [8]	0.09	0.82	0.23	0.44	0.16	0.00	0.29	0.17	0.28
0.75	Li et al. [12]	0.20	0.87	0.34	0.46	0.29	0.07	0.43	0.30	0.37
	Ours	0.21	0.88	0.28	0.38	0.36	0.25	0.43	0.27	0.38
	Ours + SR	0.25	0.97	0.41	0.49	0.51	0.25	0.45	0.37	0.46
0.9	Wang et al. [8]	0.07	0.65	0.14	0.36	0.09	0.00	0.23	0.12	0.21
0.9	Li et al. [12]	0.15	0.59	0.23	0.32	0.22	0.06	0.34	0.22	0.27
	Ours	0.10	0.58	0.11	0.20	0.18	0.16	0.23	0.13	0.21
	Ours + SR	0.12	0.75	0.25	0.27	0.28	0.20	0.28	0.28	0.30

4.2.2 Abnormality localization

Two experiments for assessing localization performance are conducted. In the first experiment, we show localization performance by considering only image-level annotations, i.e., abnormality labels. Following [12, 21], we train our model with $100\%$ images (111,240) without any box annotations and test with the 880 images with box annotations. We summarize our results in Table 3, where we compare to SOTA methods at $T(IoU)=0.1$ and $0.3$ . Our method significantly improves localization performance for all abnormalities at $T(IoU)=0.1$ , with a mean score of $0.75$ . In particular, we achieve more than $150\%$ improvement in localization for “Pneumonia” compared to the results of the most recent SOTA method [21]. We also observe remarkable localization improvements for the “Atelectasis”, “Mass” and “Nodule” abnormalities. In $T(IoU)=0.3$ , our method also yields competitive localization performance and outperforms [21] for “Mass”, “Nodule” and “Pneumonia”. We further show qualitative comparison of the attention maps with SOTA method [21] in Fig. 6, where the exactness of our attention map can be corroborated.

Also, we notice that our proposed method does not perform well for “Cardiomegaly”, “Effusion” and “Infiltration” at $T(IoU)=0.3$ compared with Liu et al. [21]. The underlying reasons may be twofold: 1) the threshold setting for the attention maps; 2) the coverage range of bounding boxes. For the first reason, we set a very high threshold (0.999) on attention maps to get the final binary localization masks for the computation of evaluation scores in Tables 3. This threshold is very rigorous and may generate relatively small object masks, which may not be favorable for large class like “Cardiomegaly”, for localization. For comparison, we further explore a smaller attention threshold of 0.1. The resulting correct ratio score at $T(IoU)=0.3$ with the attention threshold of 0.1 for “Cardiomegaly” is 0.72, which is slightly higher than result of Liu et al. [21]. The second reason is the issue of box definition that may lead to lower performance of our method at $T(IoU)=0.3$ , particularly for “Effusion” and “Infiltration”. We have invited an experienced radiologist to carefully review the original annotated boxes and suggest that the variety of the annotation coverage for some classes (e.g., “Effusion”, “Infiltration” and “Pneumonia”) is very large (see figures and detailed analysis for better illustration in the supplementary material). Larger boxes with more non-related regions may favor localization results with relatively larger masks for higher scores. Specifically, the results of Liu et al. [21] for the “Infiltration” and “Effusion” cases in the Fig. 6 are fuzzy and cover more non-related regions (even the abdominal region), while our localization results are closer to the abnormal regions of “Effusion” and “Infiltration”. Therefore, the results of Liu et al. [21] for “Effusion” and “Infiltration” may be higher.

In the second experiment, we use the 5-fold CV scheme following [12, 21]. In each fold, we train our model with $50\%$ of the unannotated images and $80\%$ of the annotated images, and tested with the remaining $20\%$ of the annotated images. As shown in Table 4, our method outperforms SOTA methods in most cases. In particular, for the difficult classes of “Mass” and “Nodule”, our localization performance is remarkably better. Similar with results in Table 4, our method does not perform well for some abnormalities (e.g., “Effusion” and “Infiltration”). Like previous observation, it may be due to large box annotations for these abnormalities. For “Pneumonia”, we notice that our method achieves much better performance than baselines at $T(IoU)<0.5$ but much worse at $T(IoU)>=0.5$ . Although most boxes provided by NIH for “Pneumonia” are reasonable, some box annotations also enclose many non-related regions. Meanwhile, it is worth noting that there are some missing boxes in NIH Chest X-ray14. The figures and more details can be found in the supplementary material. Thus, we suggest the results with IoU scores in the range of 0.3 to 0.7 are acceptable (see Fig. 9 and Fig. 10 in the supplementary material). It can be observed that nearly 63% ( $0.69-0.06$ ) of our results have IoU scores of “Pneumonia” in the range of 0.3 to 0.7, compared to 24% ( $0.49-0.25$ ) of results in Liu et al. [21].

Table 6: Ablation study of all component in our method (RN50 = ResNet50, PA = Positive Attention, ABU = Attention Bound and Union Loss). All models are trained with the official split of NIH Chest X-ray14.

Model

Mean AUC

T(IoU)=0.1

Atelectasis

Cardiomegaly

Effusion

Infiltration

Mass

Nodule

Pneumonia

Pneumothorax

Mean

RN50-32-GAP

0.816

0.36

0.65

0.60

0.53

0.38

0.04

0.01

0.38

0.37

RN50-8-GAP

0.815

0.33

0.61

0.35

0.48

0.19

0.13

0.43

0.39

RN50-8-LSE

0.816

0.38

0.58

0.42

0.24

0.55

0.48

0.32

0.55

0.44

RN50-8-LSE+FAB-C

0.817

0.58

0.53

0.56

0.55

0.59

0.41

0.30

0.51

RN50-8-LSE+FAB-P

0.815

0.61

0.55

0.62

0.67

0.60

0.54

0.50

0.57

RN50-8-LSE+FAB

0.818

0.64

0.92

0.69

0.64

0.54

0.44

0.35

0.61

RN50-8-LSE+FAB+PA

0.815

0.61

0.91

0.70

0.76

0.67

0.51

0.58

0.31

0.63

RN50-8-LSE+PA+ABU

0.817

0.37

0.88

0.50

0.82

0.60

0.15

0.68

0.43

0.55

RN50-8-LSE+FAB+PA+Bound

0.816

0.68

0.78

0.69

0.78

0.68

0.52

0.61

0.46

0.65

RN50-8-LSE+FAB+PA+ABU

0.819

0.70

0.87

0.75

0.89

0.69

0.59

0.68

0.40

0.70

Table 7: Effectiveness of AMSE loss. All models are trained using

50\%

unannotated and

80\%

annotated images of NIH Chest X-ray14.

$T(IoU)$	Model	Atelectasis	Cardiomegaly	Effusion	Infiltration	Mass	Nodule	Pneumonia	Pneumothorax	Mean
0.1	Ours	0.71	1.00	0.89	0.88	0.76	0.65	0.91	0.78	0.82
	Ours-MSE	0.59	1.00	0.83	0.92	0.75	0.54	0.78	0.65	0.76
0.3	Ours	0.40	1.00	0.52	0.68	0.58	0.46	0.69	0.43	0.60
	Ours-MSE	0.33	1.00	0.41	0.58	0.52	0.34	0.58	0.39	0.52

To better illustrate the localization performance, we also show the $T(IoR)$ results in Table 5. It calculates the intersection over the detected region, reducing the effect from non-related regions in the box annotations in evaluation. As shown in Table 5, our method gives the best performance in most items of $T(IoR)$ values. At the same time, we show the performance of the self-refinement method which can reduce the noise of the foreground in the box annotations during training. It proves that the self-refinement method can make the attention results more concentrated within the box annotations. We show qualitative results of the self-refinement method in Fig. 7.

In summary, our proposed hierarchical attention framework can result in more precise and pathological plausible localization of abnormalities and achieves state-of-the-art results at several $T(IoU)$ and $T(IoR)$ settings in our experiments. It is worth noting that our method localizes the relatively small abnormalities like “Nodule” much better. The detection of pulmonary “Nodule” in CXR images is also a very challenging task with low sensitivity in the range of $0.69$ to $0.82$ for radiologists [35].

4.2.3 Ablation Study

Table 6 shows results of the ablation studies with the AUC metrics and localization results of $T(IoU)=0.1$ , using the official split for training and testing. Abnormality attention is applied in all the models to generate the abnormality localization prediction, which is the same as CAM [13] when not using our attention bound and union loss.

First, ResNet50 without the positive/negative classification branch is implemented with down-sampled input images (factor of $32$ ). The second backbone is the dilated ResNet50 with images down-sampled by a factor of $8$ . The global average pooling (GAP) is utilized for the two backbones. The corresponding results are shown in the first two rows of “RN50-32-GAP” and “RN50-8-GAP”, respectively, in the Table 6. We can see that higher resolution, i.e., smaller down-sampling factor, is helpful for the localization of classes like “Mass” and “Nodule”. Second, we propose to use the LSE pooling instead of GAP. To illustrate the effectiveness, we compare the network of “RN50-8-LSE” to “RN50-8-GAP”. As shown in Table 6, the average localization score is higher with LSE.

Third, we perform an ablation study for different versions of FAB. Specifically, we compare three attention configurations: FAB-C (channel-wise only), FAB-P (position-wise only), and FAB (both channel-wise and position-wise). The combination of both attention modules achieves the best average localization performance (0.61).

Fourth, we add the positive/negative classification branch to generate the positive attention (“RN50-8-LSE+FAB+PA”), with which the average localization score (0.63) is slightly higher than “RN50-8-LSE+FAB”. However, with the addition of our attention bound and union loss in the last row, the average localization performance increases to 0.70. The results suggest that the proposed attention losses are important for the hierarchical attention representation in guiding the learning of abnormality attention from positive attention. Results of our method without FAB (“RN50-8-LSE+PA+ABU”) suggest that the FAB module is efficient for refining the feature encoding from the backbone network. Comparing the results of “RN50-8-LSE+FAB” and the last row, we can see that the localization results of the “Infiltration” and “Pneumonia” are significantly boosted by our hierarchical attention mining method. In particular, the “Pneumonia” class has the smallest number of samples (876 images) in the training set (86,524 images) of the official split. It is evident that our HAM method can alleviate the data-imbalance problem even in such an extreme situation.

At the same time, we show the results of our method with only attention bound constraint (“RN50-8-LSE+FAB+PA+Bound”). It performs worse in both classification and localization tasks than the model with attention bound and union losses. Because there is no constraint for positive attention without the incorporation of attention union loss, the incorrect prediction of positive attention may misdirect the abnormality attention. In such a case, the final classification and localization performances are compromised. Accordingly, by comparing the ablation of attention bound and attention union losses, the effectiveness of the synergy of two attention losses is corroborated.

To better illustrate the statistical significance of our method, we also calculate the $p$ -value between the baseline “RN50-8-LSE” and our model “RN50-8-LSE+FAB+PA+ABU” in Table 6. The $p$ -value for the classification predictions of these two models is $4.31\times 10^{-5}$ , implying that the proposed methods have significant improvements compared with “RN50-8-LSE”.

Finally, the effectiveness of the proposed AMSE loss is shown by comparing the performance to the original MSE loss. We conduct the same 5-fold CV experiment as the second experiment in section 4.2.2. As shown in Table 7, the AMSE loss is especially effective for the smaller abnormalities (e.g., “Nodule”).

Table 8: Comparison of AUC scores with different models ((RN50 = ResNet50, RN152 = ResNet152) on CheXpert validation set.

Model

Atelectasis

Cardiomegaly

Consolidation

Edema

Pleural

Effusion

Mean

AUC

U-Ignore

[9]

0.818

0.828

0.938

0.934

0.928

0.8892

U-Zeros

[9]

0.811

0.840

0.932

0.929

0.931

0.8886

U-Ones

[9]

0.858

0.832

0.899

0.941

0.934

0.8927

U-Ones+CT

+LSR [36]

0.825

0.855

0.937

0.930

0.923

0.8940

Ours-RN50

0.897

0.838

0.893

0.932

0.938

0.8996

Ours-RN152

0.920

0.886

0.907

0.937

0.933

0.9166

Table 9: Results at various

T(IoU)

from different models in the validation set of CheXpert. “Ours” denotes the models trained without any box annotations. “Ours_extra” denotes the models trained with extra box annotations from 457 images.

T(IoU)

Model

Atelectasis

Cardiomegaly

Consolidation

Edema

Enlarged

Cardiomediastinum

Pneumonia

Pneumothorax

Pleural

Effusion

Fracture

Mean

0.1

RN50

0.57

0.93

0.70

0.91

0.86

0.85

0.46

0.81

0.12

0.69

DRN50

0.58

0.66

0.69

0.73

0.50

0.88

0.26

0.78

0.19

0.59

Ours

0.71

0.91

0.81

0.92

0.98

0.92

0.54

0.80

0.11

0.74

Ours_extra

0.79

0.99

0.85

0.99

1.00

0.95

0.74

0.95

0.22

0.83

Ours_extra + SR

0.71

0.99

0.82

0.99

0.93

0.71

0.94

0.26

0.82

0.3

RN50

0.16

0.49

0.31

0.53

0.34

0.43

0.25

0.43

0.02

0.33

DRN50

0.13

0.02

0.25

0.07

0.02

0.35

0.06

0.33

0.02

0.14

Ours

0.21

0.35

0.43

0.52

0.37

0.62

0.16

0.35

0.02

0.34

Ours_extra

0.36

0.99

0.48

0.85

0.98

0.63

0.40

0.70

0.04

0.60

Ours_extra + SR

0.26

0.99

0.38

0.71

0.98

0.56

0.32

0.51

0.05

0.53

0.5

RN50

0.02

0.10

0.08

0.02

0.11

0.09

0.11

0.00

0.07

DRN50

0.02

0.00

0.02

0.00

0.06

0.00

0.05

0.00

0.02

Ours

0.04

0.03

0.11

0.07

0.05

0.13

0.05

0.06

0.00

0.06

Ours_extra

0.07

0.93

0.16

0.44

0.81

0.17

0.10

0.20

0.00

0.32

Ours_extra + SR

0.03

0.88

0.06

0.27

0.79

0.11

0.08

0.10

0.00

0.26

Table 10: Results at various

T(IoR)

from different models in the validation set of CheXpert.

T(IoR)

Model

Atelectasis

Cardiomegaly

Consolidation

Edema

Enlarged

Cardiomediastinum

Pneumonia

Pneumothorax

Pleural

Effusion

Fracture

Mean

0.1

RN50

0.59

0.98

0.75

0.95

0.98

0.90

0.55

0.83

0.15

0.74

DRN50

0.62

0.95

0.78

0.90

0.92

0.94

0.56

0.84

0.22

0.75

Ours

0.73

0.98

0.87

0.97

0.99

0.96

0.64

0.84

0.12

0.79

Ours_extra

0.83

0.99

0.88

1.00

0.97

0.85

0.97

0.26

0.86

Ours_extra + SR

0.82

0.99

0.90

1.00

0.97

0.85

0.97

0.32

0.87

0.5

RN50

0.06

0.93

0.27

0.73

0.90

0.47

0.36

0.29

0.00

0.45

DRN50

0.07

0.89

0.48

0.67

0.73

0.58

0.39

0.40

0.02

0.47

Ours

0.12

0.94

0.49

0.71

0.91

0.65

0.37

0.35

0.02

0.51

Ours_extra

0.40

0.99

0.56

0.88

1.00

0.74

0.54

0.73

0.03

0.66

Ours_extra + SR

0.50

0.99

0.65

0.92

1.00

0.85

0.63

0.85

0.08

0.72

0.75

RN50

0.01

0.83

0.06

0.48

0.79

0.26

0.20

0.08

0.00

0.30

DRN50

0.01

0.83

0.27

0.48

0.60

0.36

0.31

0.17

0.01

0.34

Ours

0.02

0.82

0.30

0.46

0.70

0.48

0.23

0.15

0.00

0.35

Ours_extra

0.17

0.89

0.35

0.65

0.95

0.59

0.35

0.43

0.01

0.54

Ours_extra + SR

0.30

0.94

0.46

0.78

0.98

0.61

0.48

0.61

0.04

0.58

4.3 CheXpert Dataset

CheXpert is another prominent CXR dataset containing 224,316 chest radiographs of 65,240 patients. However, the images in CheXpert are annotated with labels only at image level. To further illustrate the localization performance on this dataset, we invite a senior radiologist with 10+ years of experience to label the bounding boxes for 9 abnormalities, e.g., “Atelectasis”, “Cardiomegaly”, “Consolidation”, “Edema”, “Enlarged Cardiomediastinum”, “Pneumonia”, “Pneumothorax”, “Pleural Effusion”, and “Fracture”. In the end, 2345 images were annotated with 6099 bounding boxes for the 9 abnormalities. It is worth noting that the number of our box annotations is significantly larger than the number of annotated boxes in the NIH dataset. These new box annotations on the CheXpert dataset will be released soon.

4.3.1 Abnormality classification

The authors of CheXpert [9] propose an evaluation protocol over 5 categories: “Atelectasis”, “Cardiomegaly”, “Consolidation”, “Edema”, and “Pleural Effusion”, which were selected based on the clinical importance and prevalence from the validation set. In this experiment, we use the official set to train all the models, and show the AUC scores of these 5 abnormalities on official validation set. CheXpert captures uncertainties inherent in radiograph interpretation with an effective labeling strategy ( $0$ for negative, $-1$ for uncertain, and $1$ for positive). There are a few significant differences between the performance of the uncertainty approaches. U-Ignore [9] ignores the uncertainty labels during training, while U-Zeros [9] and U-Ones [9] treat them as $0$ or $1$ . Since the U-Ones model achieves the best AUC performance in this dataset, we treat all uncertainty labels as $1$ in our experiments. At the same, we show the results of our methods with two backbone networks (ResNet50 and ResNet152). AUC scores are shown in Table 8. Our method outperforms the three models in the official paper [9], indicating that our method can maintain competitive classification performance.

In Table 8, we also compare the performance of DenseNet-121 with the conditional training and label smoothing regularization (U-Ones+CT+LSR) strategies in [36]. Pham et al. [36] obtained mean AUC score of 0.930 on the official CheXpert testing set with the ensemble approach that combined six deep models of DenseNet-121, DenseNet-169, DenseNet-201, Inception-ResNet-v2, Xception, and NASNetLarge. In [36], the DenseNet-121 with the strategies of U-Ones+CT+LSR is the single model that achieves the best performance in the official CheXpert validation set. In comparison, our method can outperform DenseNet-121 with the strategies of U-Ones+CT+LSR even with smaller backbone of ResNet50 on the same validation set. The mean AUC scores on the official test set of our method with the backbone of ResNet50 and ResNet152 are 0.888 and 0.895 respectively without the implementation of an ensemble strategy.

4.3.2 Abnormality localization

In this experiment, we split 221,674 images from the official training set to train the models, which including 457 images with 1435 bounding boxes. Then, we split 1888 images with 4664 bounding boxes as the validation set to evaluate the localization performance of 9 abnormalities. Table 9 and Table 10 show the localization results in the validation set. Here, we compare with two baseline models, ResNet50 and Dilated ResNet50. Dilated ResNet50 encodes the size of feature map to $\frac{1}{8}$ of the input images and uses LSE pooling. We use CAM [13] to generate the attention maps as the localization results for two baseline methods. We can see from these tables that our method outperforms both these baseline methods. Moreover, with the addition of very few extra annotations (457 images), our method (“Ours_extra”) can gain a significant improvements with respect to localization scores. We also show the results of self-refinement method (“Ours_extra + SR”), which can improve the $T(IoR)$ results.

Qualitative results are shown in Fig. 8, where we observe that our method can generate more accurate localization results. We can see the baseline ResNet50 model usually produces rough and large attention maps, while our method can produce more accurate results. The gridding effect can be observed in Fig. 8 from the results of the Dilated ResNet50 model, which is caused by dilated convolution [37]. We can see that our method can eliminate this effect when using dilated convolution to increase the resolution of attention maps. With the addition of very few extra annotations, we can see that attention results of our method can be further improved. We can also see that the attention results using the self-refinement method are concentrated within the box annotations and look anatomically plausible and appear close to the image segmentation effect.

5 Discussion and Conclusion

We presented a novel hierarchical attention framework comprised of activation- and gradient-based attention mechanisms to address the CXR image diagnosis and corresponding abnormality localization problems. We evaluated our method on two public datasets and compared with recent state-of-the-art methods. Extensive experimental results show that our method can achieve state-of-the-art results on both image-level classification and abnormality localization tasks. Our method can be easily generalized to other weakly-supervised problems with limited box- or pixel-level annotations. Furthermore, our localization results can be used by radiologists for verifying diagnosis conclusions, thereby providing direct relevance in a clinical environment. These visual cues can also be used in an active learning framework where the radiologist can guide the model towards improved predictions, thereby helping infuse human domain knowledge in building continually-learning algorithms.

We next briefly discuss the limitations of our proposed method. First, the extension to 3D medical images may not be directly feasible. Because 3D medical images contain richer details of anatomical and pathological cues, 3D generalization from the 2D design of our attention modules may not be trivial. The implementation of 3D self-attention modules and 3D CAM/Grad-CAM may also need to consider the efficiency of GPU usage. Meanwhile, since some of 3D medical images, e.g., Computer Tomography (CT), may have the lower resolution in the z-direction, the design of 3D self-attention and 3D Grad-CAM/CAM may need to consider the factor of anisotropic resolution. Further, the definition of label hierarchy for 3D images may also be very distinctive to the label hierarchy of 2D images.

Second, most abnormalities involved in this work are related to soft tissues. As can be found in Tables 9 and 10, the performances for bone tissue abnormality of fracture from all methods are not very promising. Bone-tissue abnormalities like fractures are relatively subtle and can appear in a thin elongated shape. Accordingly, the detection of these kinds of abnormalities may need to incorporate the constraint of anatomical structures for reducing the search space. Since only two levels of label hierarchy is exploited in this study without the explicit consideration of broader anatomical labels of rib, scapula, spine, etc., our current model may not be sufficient to address the localization of the difficult abnormality of fracture.

Third, there also exist fine-grained hierarchies of abnormality labels in the NIH Chest-Xray14 and CheXpert, which may also be informative for improving performance. For example, pneumonia is the most common cause of lung consolidation, and has some related complications such as abscesses, pleural effusion and infiltration. Such fine-grained hierarchies could be helpful for learning sharable features or to some degree mitigate the label/sample imbalance problem. As part of future work, we will further explore the broader and deeper hierarchies of abnormality and anatomy labels with the specific design of ordinal constraints in the hierarchical attention framework.

Finally, we notice there exist some questions for the box annotations of some abnormalities (e.g., “Effusion”, “Infiltration” and “Pneumonia”) in NIH Chest Xray14. We have invited an experienced radiologist to carefully review the original annotated boxes for these abnormalities and give a detailed analysis in the supplementary material. It can be observed that some boxes tend to cover more non-related regions, or there are still some missing boxes. In such cases, even though our results deliver higher quality in terms of clinical findings, they still cannot better match with the boxes of NIH Chest Xray14. It leads to an inaccurate comparison with different methods in high $T(IoU)$ thresholds. To better solve this issue, we hope to invite several senior radiologists to perform the task of mask-level annotation for NIH and CheXpert dataset in the future. These precise mask-level annotations can be used to conduct a more meaningful localization comparison at high $T(IoU)$ thresholds.

References

[1] B. S. Kelly, L. A. Rainford, S. P. Darcy, E. C. Kavanagh, and R. J. Toomey, “The development of expertise in radiology: in chest radiograph interpretation,“expert” search pattern may predate “expert” levels of diagnostic accuracy for pneumothorax identification,” Radiology, vol. 280, no. 1, pp. 252–260, 2016.
[2] P. Lakhani and B. Sundaram, “Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks,” Radiology, vol. 284, no. 2, pp. 574–582, 2017.
[3] S. Guendel, S. Grbic, B. Georgescu, S. Liu, A. Maier, and D. Comaniciu, “Learning to recognize abnormalities in chest x-rays with location-aware dense networks,” in Iberoamerican Congress on Pattern Recognition. Springer, 2018, pp. 757–765.
[4] Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, and Y. Yang, “Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification,” arXiv preprint arXiv:1801.09927, 2018.
[5] C. Yan, J. Yao, R. Li, Z. Xu, and J. Huang, “Weakly supervised deep learning for thoracic disease classification and localization on chest x-rays,” in Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM, 2018, pp. 103–110.
[6] M. Annarumma, S. J. Withey, R. J. Bakewell, E. Pesce, V. Goh, and G. Montana, “Automated triaging of adult chest radiographs with deep artificial neural networks,” Radiology, vol. 291, no. 1, pp. 196–202, 2019.
[7] X. Ouyang, Z. Xue, Y. Zhan, X. S. Zhou, Q. Wang, Y. Zhou, Q. Wang, and J.-Z. Cheng, “Weakly supervised segmentation framework with uncertainty: A study on pneumothorax segmentation in chest x-ray,” in MICCAI, 2019.
[8] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in CVPR, 2017.
[9] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” arXiv preprint arXiv:1901.07031, 2019.
[10] C. Ma, H. Wang, and S. C. Hoi, “Multi-label thoracic disease image classification with cross-attention networks,” in MICCAI, 2019.
[11] I. M. Baltruschat, H. Nickisch, M. Grass, T. Knopp, and A. Saalbach, “Comparison of deep learning approaches for multi-label chest x-ray classification,” Scientific reports, vol. 9, no. 1, p. 6381, 2019.
[12] Z. Li, C. Wang, M. Han, Y. Xue, W. Wei, L.-J. Li, and L. Fei-Fei, “Thoracic disease identification and localization with limited supervision,” in CVPR, 2018.
[13] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in CVPR, 2016.
[14] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017.
[15] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu, “Tell me where to look: Guided attention inference network,” in CVPR, 2018.
[16] L. Wang, Z. Wu, S. Karanam, K.-C. Peng, R. V. Singh, B. Liu, and D. N. Metaxas, “Sharpen focus: Learning with attention separability and consistency,” in ICCV, 2019.
[17] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.
[18] L. Yao, J. Prosky, E. Poblenz, B. Covington, and K. Lyman, “Weakly supervised medical diagnosis and localization from multiple resolutions,” arXiv preprint arXiv:1803.07703, 2018.
[19] H. Chen, S. Miao, D. Xu, G. D. Hager, and A. P. Harrison, “Deep hierarchical multi-label classification of chest x-ray images,” in International Conference on Medical Imaging with Deep Learning, 2019, pp. 109–120.
[20] J. K. Gohagan, P. C. Prorok, R. B. Hayes, and B.-S. Kramer, “The prostate, lung, colorectal and ovarian (plco) cancer screening trial of the national cancer institute: History, organization, and status,” Controlled Clinical Trials, 2000.
[21] J. Liu, G. Zhao, Y. Fei, M. Zhang, Y. Wang, and Y. Yu, “Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision,” in ICCV, 2019.
[22] Y. Wang, L. Lu, C.-T. Cheng, D. Jin, A. P. Harrison, J. Xiao, C.-H. Liao, and S. Miao, “Weakly supervised universal fracture detection in pelvic x-rays,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 459–467.
[23] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018.
[24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
[25] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in CVPR, 2019.
[26] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in ICLR, 2017. [Online]. Available: https://arxiv.org/abs/1612.03928
[27] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in WACV, 2018.
[28] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention branch network: Learning of attention mechanism for visual explanation,” in CVPR, 2019.
[29] C. Lian, M. Liu, Y. Pan, and D. Shen, “Attention-guided hybrid network for dementia diagnosis with structural mr images,” IEEE Transactions on Cybernetics, pp. 1–12, 2020.
[30] C. Lian, M. Liu, L. Wang, and D. Shen, “End-to-end dementia status prediction from brain mri using multi-task weakly-supervised attention network,” in MICCAI 2019, 2019, pp. 158–167.
[31] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” arXiv preprint arXiv:1904.11492, 2019.
[32] C. Sun, M. Paluri, R. Collobert, R. Nevatia, and L. Bourdev, “Pronet: Learning to propose object-specific boxes for cascaded neural networks,” in CVPR, 2016.
[33] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in CVPR, 2017.
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[35] J. G. Nam, S. Park, E. J. Hwang, J. H. Lee, K.-N. Jin, K. Y. Lim, T. H. Vu, J. H. Sohn, S. Hwang, J. M. Goo, and C. M. Park, “Development and validation of deep learning–based automatic detection algorithm for malignant pulmonary nodules on chest radiographs,” Radiology, vol. 290, no. 1, pp. 218–228, 2019, pMID: 30251934. [Online]. Available: https://doi.org/10.1148/radiol.2018180237
[36] H. H. Pham, T. T. Le, D. T. Ngo, D. Q. Tran, and H. Q. Nguyen, “Interpreting chest x-rays via cnns that exploit hierarchical disease dependencies and uncertainty labels,” arXiv preprint arXiv:2005.12734, 2020.
[37] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, 2018, pp. 1451–1460.

6 Supplementary Material

6.1 Analysis of Box Annotations in NIH Chest X-ray14

Referring to Tables III, IV, and V in the paper, our results are generally better than the compared methods of Li et al. [12] and Liu et al. [21]. In particular, at the settings of $T(IoU)=0.1$ in Table III, $T(IoU)<0.5$ in Table IV, and $T(IoR)<=0.5$ in Table V, our method can remarkably improve the localization performance for most abnormality classes, especially for “Nodule”. Accordingly, the efficacy of our method can be corroborated. However, for some diffusive and fuzzy abnormalities (e.g., “Effusion”, “Infiltration”, “Pneumonia”), our method does not produce great scores in high $T(IoU)$ thresholds. We have invited an experienced radiologist to carefully review the originally annotated boxes in NIH Chest X-ray14 and suggest that the variety of the annotation coverage for these classes is very large (see Fig. S1 for illustration).

As can be found in Fig. S1, the annotated boxes may sometimes enclose many non-related regions. For such cases, if the abnormalities are presented in both lungs, the boxes may be drawn roughly to even include the mediastinum. The inclusion of mediastinum may be problematic as lung infiltration or pneumonia may not happen there. The relatively large and rough boxes are not favor our results if the setting of $T(IoU)$ is higher. Since our results may be more exact to localize the abnormalities, our results for these abnormalities have less chance to be present at mediastinum, and therefore, may not perfectly match with the rough box in high IoU scores. It can be observed that the IoU scores of all the cases in Fig. S1 are around from 0.1 to 0.5. This may explain the lower scores at high $T(IoU)$ thresholds in Table III and IV.

Additionally, for “Pneumonia”, our invited senior radiologist also provide some relabeled bounding boxes and we show them in Fig. S2. Although most boxes provided by NIH Chest Xray14 for “Pneumonia” are reasonable, it is worth noting that there are some missing boxes. As can be found in the figure, our method can also identify the “Pneumonia” regions where the boxes were missed by NIH annotators. In such cases, our results may not hold high IoU scores, but indeed deliver higher quality in terms of clinical findings. Due to the above, we suggest the results with IoU scores in the range of 0.3 to 0.7 are acceptable (see “Pneumonia” in Fig. S1 and Fig. S2). The performances of our method for the “Pneumonia” class are reported at the settings of $T(IoU)=[0.1,0.2,0.3,0.4,0.5,0.6,0.7]$ in Table IV. As can be found, more than 65% cases of our results have IoU scores larger than 0.3, whereas there are approximate 34% ( $0.69-0.35$ ) and 29% ( $0.35-0.06$ ) cases of our results with IoU scores between 0.3 to 0.5, and 0.5 to 0.7, respectively. And 6% of our results have more than 0.7 IoU scores with the boxes. With such distribution, the major portions of our results, nearly 63% ( $34\%+29\%$ ) has IoU scores in the range of 0.3 to 0.7, compared to 24% ( $0.49-0.25$ ) of results in Liu et al. [21] with the IoU score range of 0.3 to 0.7.