2School of Electronics Engineering and Computer Science, University of Peking
3School of Automation, University of Northwestern Polytechnical
4Department of Computer Science, University of Hong Kong
11email: [email protected], [email protected], {zhangshu@deepwise, gaokai@deepwise, liuxiaoqing@deepwise}.com, [email protected], [email protected]
A Structure-Aware Relation Network for Thoracic Diseases Detection and Segmentation
Abstract
Instance level detection and segmentation of thoracic diseases or abnormalities are crucial for automatic diagnosis in chest X-ray images. Leveraging on constant structure and disease relations extracted from domain knowledge, we propose a structure-aware relation network (SAR-Net) extending Mask R-CNN. The SAR-Net consists of three relation modules: 1. the anatomical structure relation module encoding spatial relations between diseases and anatomical parts. 2. the contextual relation module aggregating clues based on query-key pair of disease RoI and lung fields. 3. the disease relation module propagating co-occurrence and causal relations into disease proposals. Towards making a practical system, we also provide ChestX-Det, a chest X-Ray dataset with instance-level annotations (boxes and masks). ChestX-Det is a subset of the public dataset NIH ChestX-ray14. It contains 3500 images of 13 common disease categories labeled by three board-certified radiologists. We evaluate our SAR-Net on it and another dataset DR-Private. Experimental results show that it can enhance the strong baseline of Mask R-CNN with significant improvements. The ChestX-Det is released at https://github.com/Deepwise-AILab/ChestX-Det-Dataset.
Keywords:
Thoracic diseases detection and segmentation, SAR-Net, ChestX-Det1 Introduction
Chest X-ray scan is a routine examination for thoracic diseases in hospitals. With domain expertise, radiologists can identify and localize abnormalities or diseases for further diagnosis. To reduce the burden on radiologists, computer-aided diagnosis is put into increasing efforts in recent years. With the success of deep Convolutional Neural Network (CNN) on natural images, applications like classification, detection and segmentation in medical images are also overwhelmingly benefited. In this paper, we aim to detect and segment thoracic abnormalities at the instance level based on Mask R-CNN[9]. Combining domain knowledge extracted from chest X-ray studies, we extend Mask R-CNN, a successful instance-segmentation framework, with our relation modules. To clarify in the following, we use the term “abnormality” and “disease” alternatively as “object” in natural images.
Our relation modules are motivated from three types of relations: 1. Spatial relations between diseases and thoracical anatomical structures. Diseases often have location priors. Encoding spatial relations and constraints can help obtain more accurate locations. 2. Contextual relations between abnormalities and observation in lung fields. Contextual clues are always useful for radiologists. One typical example is contralateral examination, e.g., over-exposured xrays have similar appearance with lung consolidation. By checking contralateral appearance, computers can mimic radiologists to exclude this type of alarms. 3. Categorical dependent relations among diseases. It is common knowledge that one disease can cause another disease. Also, a complex disease might be caused by a combination of factors, resulting in various abnormalities. Therefore abnormalities can co-exist in one x-ray image.

To this end, we propose a structure-aware relation network which consists of three modules: spatial relation module, contextual relation module and disease relation module. We first extract the anatomical structures via a pre-trained semantic part segmentation model. Instead of segmentation masks, we adopt only bounding boxes of anatomical parts for further computation. For the spatial relation module, we encode the spatial relations between disease proposals and anatomical parts. For the contextual relation module, we extract lung-field features as the context, then adopt the query-key attention mechanism to learn the attention weights in the lung fields. Finally contextual features are weighted aggregated. For the disease relation module, we first build a relation graph among diseases based on the co-occurrence frequency between each pair of diseases. Then we propagate information via the adjacent matrix among diseases. Figure 1 illustrates three types of relations. The outputs of all three modules are encoded as feature vectors for each disease RoI proposal. We concatenate them with original RoI features after RoI-pooling. Both the box head and mask head are trained from scratch to obtain more accurate detection and segmentation results.

Above all, the contribution of our work are three folds: 1. To push forward the research on fully supervised instance-level detection and segmentation on chest X-Rays, we provide a new benchmark called ChestX-det, including instance-level annotations of 13 categories of disease/abnormality of 3,500 images from the public dataset NIH ChestX-ray14. 2. We provide a strong baseline of Mask R-CNN that can be compared for further research. 3. We propose SAR-Net modeling three types of relations. It can be embedded into general detection frameworks and enhance the baseline with significant improvements.
2 Related Works
2.1 Automatic Chest X-ray Analysis
Most existing works on chest X-rays focus on disease classification and weakly supervised localization[30, 21, 8, 20, 35, 34], due to the fact that most current public datasets contain only class labels or very limited amounts of box annotations. CheXpert[12] proposed by Irvin et al. contains 224,316 frontal and lateral view chest X-ray images. In training set, the labels (1/0/uncertain of the presence of 14 classes of observations) are extracted from radiology reports by a carefully designed labeler. In validation and test datasets, the labels are labeled by radiologists. Earlier, NIH Chest X-ray 14[30] proposed by Wang et al. contains 112,120 front-view images of 14 disease categories, among which there are 880 images of 8 categories containing box annotations. Lately, Nguyen et al. proposed VinDr-CXR [19] which contains 18,000 images that were manually annotated with 22 classes of rectangles surrounding abnormalities and 6 global labels of suspected diseases. There also exist some datasets that focus on a single disease, such as the Pneumonia detection dataset 111https://www.kaggle.com/c/rsna-pneumonia-detection-challenge, Tuberculosis detection dataset [18] and Pneumothorax segmentation dataset 222https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation, etc. CAM[37], grad-CAM[26], and attention models[20, 35] are the most widely used tools to highlight rough locations of diseases. To use box annotations, Li et al. [14] propose an end-to-end fully convolutional neural network to classify and localize abnormalities. The problem is addressed in an approach like weakly supervised segmentation at the coarse level. Based on [14], Liu et al[16] propose a contrast induced attention network to predict the location of diseases, and an alignment module to align positive and negative images in the first place. Our work is different that we are the first to detect chest X-ray diseases with fully supervised annotations. Chen et al. [2] present a deep hierarchical multi-label classification approach for Chest X-ray diagnosis. Inspired by the idea of taxonomy in [2], we group 13 categories into three parent classes (Table 1) and evaluate the effect of each relation module based on the class hierarchy. More details are presented in the 4.6 subsection.
2.2 Object Detection/Instance Segmentation
Modern object detection frameworks include two lines of approaches. The first line is two-stage detectors of Fast R-CNN[7], Faster R-CNN[23], FPN[15], etc. By using a top-down pathway and the lateral connections, FPN can enrich the semantic information of shallow layers while maintaining their high resolution. Utilizing multi-level features to improve performance is crucial in detection tasks. The second line is one-stage detectors of YOLO[22], SSD[17] and recent anchor-free detectors[13, 27, 39, 4, 38]. We choose FPN as our backbone framework since disease regions in chest X-rays range from 20 20 pixels (e.g. nodule, calfication) to 1000 1000 pixels (e.g. emphysema). FPN is known for extracting RoI features from shallow to deep layers, capturing characteristics from small to large objects. Mask R-CNN[9] is a successful instance segmentation framework extending FPN. The mask head operating on RoIs can output class-specific masks. Since disease regions are always of various shapes, we use mask R-CNN to visualize them for more accurate diagnosis. However, one difference for chest X-ray images is the constant anatomical structure, which is not available in natural images. Furthermore, general object detectors like FPN and Mask R-CNN fail to capture object-wise and class-wise relations, let alone object-structure relations. There exists works exploring relations for general object detection. In [11], appearance and geometry relations within objects are modeled inspired by attention modules[28] in NLP. In [33], object features are enhanced via attending different semantic concepts and propagating information through a common sense knowledge graph. In [32], a sparse graph is built based on semantic and spatial relations among objects. Our work is closely related to them while we focus on the relations between diseases and structures.

3 Method
In this section, we present our SAR-Net (Structure-Aware Relation Network). The SAR-Net (Figure 3) consists of three relation modules: anatomical structure relation module, contextual relation module and disease relation module. For region proposals with their feature , our aim is to enhance each region proposal feature by concatenating , and computed from the three modules.
As shown in Figure 3, all the five parts are integrated into the whole framework, and are trained end-to-end. To be more specific, additional features from three relation modules are online. During the iterative training of RPN with FPN, locations (Region Boxes) and visual appearance (Region Features, Lung Field Features) keep changing online. Hence, the spatial, contextual and categorical relations (, and ) are also changing online. Architectures of three relation modules will be detailed in following subsections respectively.
3.1 Anatomical Structure Relation Module
Some diseases or abnormalities are highly correlated with specific parts or organs in the body. For instance, cardiomegaly is a medical condition in which the heart is enlarged; atelectasis is a complete or partial collapse of the entire lung or area (lobe) of the lung.
Based on the above observation, we propose an anatomical structure relation module to encode spatial relations between disease and anatomical parts. To accomplish the task, we first adopt a pre-trained segmentation model 333https://github.com/Deepwise-AILab/ChestX-Det-Dataset/tree/main/pre-trained_PSPNet to obtain anatomical parts. Then we choose five key parts of left lung, right lung, left scapula, right scapula and heart, as shown in Figure 4. The segmentation model is trained on 1000 chest images labeled with 5 parts from external data, and is adopted to generate anatomical parts for each image in our datasets. Then for each disease RoI, we use coordinate difference to quantify its spatial relation with each anatomical part.

(1) |
where and are up-left and bottom-right vertex coordinates of disease RoI and anatomical part respectively, and and are width and height of the bounding box covering left and right lung. Relating all five parts, is a 40-dimensional vector. Inspired by [28], we embed into a high-dimensional space, as
(2) |
where sine and cosine functions of different wavelengths are computed. We can change the dimension of by setting different . Finally, the spatial relation between all disease proposals and anatomical parts are recorded in position embedding , where .
3.2 Contextual Relation Module
Contextual cues are very useful for radiologists to reference. For instance, by checking on the symmetric area on the contralateral side, radiologists can decide if the abnormality is a nodule or simply papilla. In our work, we focus on diseases in lung fields. Therefore we bound contextual area into the anatomical parts of left and right lung. Here we customize the attention mechanism having been proven successful in natural language processing and natural image recognition. Similarly, for a query of a particular disease RoI, a set of key contents in the lung fields are aggregated according to attention learned on disease-context relations. The relations involve both spatial and feature compatibility.
Specifically, we use and to denote the bounding box of left lung and right lung respectively. We adopt the same operation on and to obtain their grid-shaped feature and , of size . For each disease RoI feature (after 2 fc layers) and each grid feature , we compute their compatibility as
(3) |
where and are both weight matrices, and and share the same dimension.
Similar with Eq.(1) and Eq.(2), we can obtain the spatial relation between abnormality and grid .
(4) |
where is 444 are coordinates of up-left and bottom-right vertexes of abnormality RoI. is the center coordinate of a grid., and is a learnable weight matrix.
Then, the spatial relation and appearance relation are collected across all grids and summed into a soft-max function to get attention weights as
(5) |
Finally, grid-wise contextual features from lung fields are aggregated according to attention learned above.
(6) |
where denotes the convolution kernel weights. The enhanced features of all region proposals are . Note that [,] means the concatenation operation.
3.3 Disease Relation Module
Diseases or abnormalities in chest X-Rays are highly correlated. For instance, pulmonary tuberculosis is a complex disease which might contain nodules, fibrosis and consolidation simultaneously. There also exists causal relations among abnormalities. For instance, rib fracture is likely to cause pneumothorax. To this end, first we need to build a relation graph containing co-occurrence and causal relations among diseases. Then messages from semantic concepts of diseases are propagated via the relation graph. Finally RoI features aggregate messages to form representation . These three steps are detailed in the following.
Relation Graph Construction. To build the relation graph, we count co-occurrence frequency in the data set of each pair of categories and compute their conditional probability as
(7) |
where denotes number of samples disease B exists, and denotes number of samples disease A and B co-exist. Then we build a directed graph as the disease relation graph. are disease category nodes and each edge encodes relations:
(8) |
Message Passing via Relation Graph. The relation graph is built based on statistics collected from thousands of X-rays and reflects prior knowledge. However, for a specific sample, it only reflects the condition of a patient in a limited period. Therefore, we propose to use a global attention to force message passing on potential diseases.
First, we input the whole X-ray image into the same convolutional layers of SAR-Net to get the image feature . Then a global mean-average-pooling operation is applied to squeeze into . Finally, we use a fully-connected layer to get categories score , where is the number of categories. The global binary cross-entropy loss for each category is defined as
(9) |
where is the index of categories, denotes the target label of the category and denotes the predicted probability.
Inspired by some few/zero-shot works [25, 29, 6], we use parameter weights of the classifier branch in box head to represent the semantic embedding of each category. Formally, the semantic embedding is defined as , where and are number of categories and weight dimensions respectively. By transmitting causal or co-occurrence relations through the sparse relation graph, we can obtain the -th category embedding as
(10) |

Mapping Disease Relation to Regions. Since our goal is to enhance the original region features, we need to map the global embedded semantics from categories to regions. In our work, we choose the classification probability of each RoI as the mapping bridge. The semantic embedding of -th region is
(11) |
where denotes the probability for the -th region towards category .
Finally, is linearly transformed to reduce dimension
(12) |
where is the weight of a fully-connected layer. The enhanced features of all regions are . Now, we obtain the embedded feature of spatial relation , contextual relation and category relation . The final enhanced feature of each disease RoI is .
4 Experiments
4.1 Dataset and Evaluation
Datasets: ChestX-Det is a subset with box annotations of NIH ChestX-14. NIH ChestX-14 contains 112,120 front-view images, among which there are 880 images with box annotations. To make fully supervised learning feasible, we select 3,575 images and invite three board-certified radiologists to annotate them with 13 common categories of diseases or abnormalities. For annotation, we split three board certified radiologists into two roles. Commitee: Every chest X-ray is annotated by the first two radiologists. The two radiologists are mutually blind to the annotation. Judge: The third and the most experienced radiologist with 15+ years of experience select annotations from those provided by the first two radiologists, and can add annotations if there are more. Figure 5 shows examples of the annotated 13 categories and a “normal” sample without abnormalities. We use 3025 images for training and 553 images for testing. 10% samples from the training set are used for validation.
DR-private is a private dataset by collecting 6,629 chest X-ray images from multiple Chinese hospitals. The annotation process is the same with ChestX-Det. As for DR-private, we use 5800 images for training and 829 images for testing. 10% samples from the training set are used for validating. Table 1 shows the instance count of each disease in both datasets. For both instance segmentation and anatomical part segmentation, all data are selected to cover the appearance range as wide as possible. We include normal and abnormal samples in different angles, whiteness, and scanning conditions. For white-out lungs in part segmentation, we ask radiologists to annotate their original contours as far as they can.
Part Segmentation Performance: We evaluate the part segmentation model on 229 test samples, the mean IOU (Mean Intersection-Over-Union) is 86.85%. We only use the bounding boxes of segmentation for further computation. The accuracy of segmentation does not seriously affect the bounding boxes. The segmentation results of PSP-Net can well serve the SAR-Net.
Evaluation: We borrow the metric of bounding box AP (APbb) [5] used in general object detection. Considering that disease/abnormality regions lack clear boundary as general objects, we adopt AP as the main evaluation metric. AP and AP are used for reference. We believe that AP better reflects localization performance and AP better reflects classification performance. AP is the comprehensive performance index for classification and localization. For more practical usage, we also present instance-level recall at fixed FP(false positive) per image for more direct reference. In addition, we use mask AP (APmask) to evaluate the performance of instance segmentation. AP and AP are also provided for reference.
4.2 Experiment Setup
We implement SAR-Net stacked on six backbones: ResNet-50-C4[10] in Faster R-CNN[23], Mask R-CNN[9] and Cascade R-CNN[1] respectively; ResNet-50-FPN[15] in Faster R-CNN and Mask R-CNN respectively; ResNet-50-FPN+DCN[3] in Mask R-CNN. The core idea of DCN is to capture position-based semantic context to improve the capacity of feature representation. It adds 2D offsets to the regular grid sampling locations and defines a new grid to select input values.
All experiments are implemented using Pytorch on 4 TITAN-V GPUs. ResNet-50 is pretrained on ImageNet [24]. For all training, we apply stochastic gradient descent (SGD) with a weight decay of 0.0001 and momentum of 0.9 to optimize all models. The first conv layers of FPN and C4 are frozen. We train 50 epochs with image batch-size of 2 on each GPU. The learning rate starts at 0.01, and reduce by a factor of 10 after 20 and 40 epochs.
During training, we adopt random flipping and multi-scale sampling (shorter side=) for all images. At testing stage, the shorter side of image is fixed at 1200. The total number of proposed regions after NMS is 512. All the other hyper parameters and loss functions follow traditional Mask R-CNN. As shown in Table2, we clarify parameter dimensions in three modules with FPN-DCN. The extracted features , and from three relation modules are all 1D. All the weight parameters are initialized using gaussian prior.
4.3 Comparison with the baseline model
Table3 shows detection results comparison of our SAR-Net and baseline of various models on both datasets. We can see that SAR-Net consistently outperforms baseline with any configuration of models, evaluation metrics and datasets. Improvements on all models demonstrate that SAR-Net can be embedded into general two-stage object detection frameworks.
Model | Method | % | AP | AP | AP | % | AP | AP | AP \bigstrut |
Faster R-CNN(C4) | Baseline | Chest X-Det | 50.5 | 36.5 | 10.4 | DR-Private | 42.8 | 33.6 | 11.9 \bigstrut[t] |
SAR-Net | \bigstrut[b] | ||||||||
Faster R-CNN(FPN) | Baseline | 52 | 41.7 | 12.3 | 48.5 | 38.7 | 14.3 \bigstrut[t] | ||
SAR-Net | \bigstrut[b] | ||||||||
Cascade R-CNN(C4) | Baseline | 48.6 | 39.0 | 15.1 | 41.9 | 32.5 | 14.0 \bigstrut[t] | ||
SAR-Net | \bigstrut[b] | ||||||||
Mask R-CNN(C4) | Baseline | 48.6 | 37.9 | 10.9 | 42.6 | 32.3 | 13.3 \bigstrut[t] | ||
SAR-Net | \bigstrut[b] | ||||||||
Mask R-CNN(FPN) | Baseline | 52.8 | 42.6 | 15.2 | 48.5 | 39.7 | 15.7 \bigstrut[t] | ||
SAR-Net | \bigstrut[b] | ||||||||
Mask R-CNN(FPN+DCN) | Baseline | 54.8 | 45.0 | 16.1 | 48.7 | 40.1 | 16.1 \bigstrut[t] | ||
SAR-Net | \bigstrut[b] |
Stable gains are achieved at AP. We believe that SAR-Net plays an effective balance role in improving both classification and localization, which accords with the principle of SAR-Net. Using the strong model Mask R-CNN with FPN+DCN, SAR-Net can also boost baseline by 2.7% at AP on ChestX-Det. It demonstrates that our modules are robust enough. Improvements on dataset DR-Private also demonstrates our effectiveness. It also means SAR-Net generalizes well on different data distributions. It is noteworthy that the results on DR-private are slightly lower than the results on ChestX-Det. The main reason leading to different performance on two datasets is different data distribution of the two datasets. ChestX-Det from ChestX-14 has more bedside samples and DR-private has more regular ones. Abnormalities in bedside samples are often severer and salient.
We also report the performance of instance segmentation on both datasets. As shown in Table 4, our relation modules can also achieve considerable performance gain in the instance segmentation task.
% | Model | Method | APmask | AP | AP \bigstrut |
ChestX-Det | Mask R-CNN(C4) | Baseline | 13.8 | 32.6 | 9.9 \bigstrut[t] |
SAR-Net | 15.3 | 35.4 | 12.9 \bigstrut[b] | ||
Mask R-CNN(FPN) | Baseline | 16.2 | 36.6 | 13.4 \bigstrut[t] | |
SAR-Net | 17.0 | 38.6 | 14.1 \bigstrut[b] | ||
Mask R-CNN(FPN+DCN) | Baseline | 16.2 | 38.3 | 12.3 \bigstrut[t] | |
SAR-Net | 17.9 | 40.6 | 14.9 \bigstrut[b] | ||
DR-Private | Mask R-CNN(C4) | Baseline | 13.8 | 30.2 | 12.4 \bigstrut[t] |
SAR-Net | 15.0 | 32.8 | 13.8 \bigstrut[b] | ||
Mask R-CNN(FPN) | Baseline | 16.9 | 37.3 | 13.8 \bigstrut[t] | |
SAR-Net | 17.5 | 38.0 | 15.0 \bigstrut[b] | ||
Mask R-CNN(FPN+DCN) | Baseline | 17.5 | 37.1 | 15.4 \bigstrut[t] | |
SAR-Net | 17.7 | 37.9 | 15.4 \bigstrut[b] |
% | Model | Atelectasis | Calcification | Cardiomegaly | Consolidation | Diffusive Nodule | Effusion | Emphysema |
ChestX-Det | Baseline | 51.4 | 76.3 | 61.2 | 37.8 | 51.9 | 65.4 | |
SAR-Net | 39.5 | |||||||
Model | Fibrosis | Fracture | Mass | Nodule | Pleural Thickening | Pneumothorax | ||
Baseline | 30.9 | 34.0 | 31.5 | 29.2 | 45.0 | |||
SAR-Net | 39.8 | 28.2 | ||||||
% | Model | Atelectasis | Calcification | Cardiomegaly | Consolidation | Diffusive Nodule | Effusion | Emphysema |
DR-Private | Baseline | 11.6 | 85.4 | 43.5 | 46.1 | 39.2 | ||
SAR-Net | 23.4 | 55.2 | ||||||
Model | Fibrosis | Fracture | Mass | Nodule | Pleural Thickening | Pneumothorax | ||
Baseline | 41.4 | 52.8 | 29.6 | 16.0 | 54.7 | 40.1 | ||
SAR-Net | 19.9 |
Model | Atelectasis | Calcification | Cardiomegaly | Consolidation | Diffusive Nodule | Effusion | Emphysema | |
ChestX-Det | Baseline | 0.612 | 0.857 | 0.492 | 0.558 | 0.712 | ||
SAR-Net | 0.647 | 0.576 | ||||||
Model | Fibrosis | Fracture | Mass | Nodule | Pleural Thickening | Pneumothorax | ||
Baseline | 0.496 | 0.559 | 0.294 | 0.543 | 0.571 | |||
SAR-Net | 0.492 | 0.429 | ||||||
Model | Atelectasis | Calcification | Cardiomegaly | Consolidation | Diffusive Nodule | Effusion | Emphysema | |
DR-Private | Baseline | 0.341 | 0.394 | 0.921 | 0.608 | 0.633 | 0.653 | |
SAR-Net | 0.575 | |||||||
Model | Fibrosis | Fracture | Mass | Nodule | Pleural Thickening | Pneumothorax | ||
Baseline | 0.422 | 0.379 | 0.329 | 0.731 | 0.543 | |||
SAR-Net | 0.363 | 0.605 |
Table 5 shows comparison results at AP of each category. From our observation, pleural thickening, fracture, diffusive nodule, cardiomegaly and emphysema are categories with consistent and large improvements on both datasets. For representative categories of fracture and diffusive nodule, our method achieves AP of 38.9% and 45.6% on ChestX-Det, with a lead of 8.0% and 7.4% over baseline respectively. Fibrosis is the only category with decreasing accuracy in both datasets. In our datasets, emphysema and fibrosis often exist simultaneously and have similar spatial distributions, both of which diffuse the whole lung fields. Compared with fibrosis, the feature of emphysema is easier to distinguish, which cause that when fibrosis and emphysema overlap, the proposal boxes are easier to be predicted as emphysema. When adding the relation modules, such phenomenon is more obvious. The reason is that our SAR-Net has high sensitivity for the diseases which has more unique appearances, such as cardiomegaly and emphysema etc. Although the detection performance of fibrosis decreases slightly, the detection performance of emphysema obtains great improvement over both datasets. For more practical usage, we also provide recall (sensitivity) at fixed false positives per image for each category in Table 6. In particular, we set the rate of instance-level FP/image to be 0.1 at AP. From table 6, we can see that for most categories, our SAR-Net can obtain higher sensitivity than baseline.
4.4 Performance on stronger backbones.
We perform experiments with stronger backbones ResNet-101 and ResNeXt-101 [31]. Results are shown in Table 7.
% | Backbone | AP | AP | AP \bigstrut |
Baseline | ResNet-50 | 54.8 | 45.0 | 16.1 \bigstrut |
SAR-Net | 57.0 | 47.7 | 19.0 \bigstrut | |
Baseline | ResNet-101 | 54.4 | 45.6 | 16.5 \bigstrut |
SAR-Net | 57.9 | 48.5 | 19.7 \bigstrut | |
Baseline | ResNext-101 (328d) | 55.9 | 46.5 | 16.6 \bigstrut |
SAR-Net | 58.1 | 48.9 | 20.0 \bigstrut |
Our proposed SAR-Net consistently improves performance on all the experimented backbones. Specially, on the ResNet-101 backbone, we can obtain the gains of 3.5, 2.9 and 3.2 points on the AP, AP and AP respectively. It demonstrates that our modules are robust enough and can be generalized to boost performance on multiple backbone architectures.
4.5 Effectiveness of each relation module.
In order to evaluate the effectiveness of each module, we conduct ablation studies on ChestX-Det from different perspectives (4.5-4.7). In all experiments, we use the same train, validation and test set. ResNet-50 is adopted as the backbone. We remove the spatial relation module (SRM), the disease relation module (DRM) and the contextual relation module (CRM) from SAR-Net respectively. Ablation results are shown in Table 8.
% | SRM | DRM | CRM | ChestX-Det | ||
AP | AP | AP | ||||
✓ | ✓ | ✓ | ||||
✓ | ✓ | |||||
✓ | ✓ | |||||
✓ | ✓ |
The effect of spatial relation. Compared with SAR-Net, if we remove the spatial relation module, the performance is decreased by 0.0%, 0.4% and 1.5% on AP, AP and AP respectively. The maximum decline on AP validates the effectiveness of the spatial relation module on localization. Moreover, the zero decline on AP validates that the spatial relation module has little positive effect on classification, which corresponds to the principle of the spatial relation module.
The effect of disease relation. Compared with SAR-Net, removing the disease relation module decreases performance by 1.7%, 0.7% and 0.6% on AP, AP and AP respectively. This indicates that the disease relation module enhances feature representation more for classification than localization. When adding the disease relation module, some disease labels are corrected according to the information propagation among diseases.
The effect of contextual relation. Compared with the spatial relation+disease relation, adding the extra contextual relation module can boost the performance by 1.6%, 1.1% and 1.3% on AP, AP and AP respectively. The contextual relation module is encoded by spatial information and feature information. This encoding mechanism has effectiveness on both classification and localization, which is verified by the experimental results. Moreover, the proposed attention mechanism can obtain more gains.
4.6 Performance of each relation module on different diseases.
In this subsection, we evaluate the performance of each relation module on different diseases. Specially, inspired by [2], we group all diseases into three super-classes, namely, LUNG, PLEURA and MEDIASTINUM. Detailed information of the super-classes and their sub-classes can be found in Table 1. Performance evaluations are conducted based on the super-classes and results are shown in Table 9.
Super-Class | LUNG | PLEURAL | MEDIASTINUM | Params | FLOPs \bigstrut | ||||||
Model | AP | AP | AP | AP | AP | AP | AP | AP | AP | \bigstrut | |
Baseline | 52.0 | 44.0 | 15.5 | 55.9 | 37.5 | 5.9 | 76.3 | 76.3 | 52.9 | - | - \bigstrut[t] |
+ DRM | 52.8 | 44.7 | 16.9 | 57.5 | 38.3 | 5.1 | 78.7 | 76.8 | 50.4 | + 0.57M | + 0.18G |
+ SRM | 51.8 | 44.6 | 17.5 | 55.3 | 39.7 | 6.2 | 81.5 | 79.2 | 54.3 | + 0.09M | + 0.26G |
+ CRM | 53.6 | 44.8 | 18.0 | 55.3 | 38.2 | 4.4 | 81.8 | 81.4 | 50.8 | + 3.29M | + 2.32G \bigstrut[b] |
SAR-Net | 53.6 | 46.3 | 18.8 | 58.1 | 40.5 | 9.8 | 83.6 | 81.4 | 48.8 | + 3.8M | + 2.5G \bigstrut |
The contextual relation module obtains maximum gains on AP, AP and AP respectively for LUNG diseases, while simply adding the spatial or disease relation module can not obtain great improvement. This is easy to understand, the diseases of LUNG distribute in the lung fields but most of them do not appear on fixed locations. For instance, the atelectasis may appears in either the upper lobe or in the lower lobe, given different causes. While the contextual relation module are able to models contextual relations between diseases and observations in the lung fields, which involves both spatial and appearance compatibility, and therefore is more effective for the diagnosis of lung diseases.
The diseases of PLEURA have strong co-occurrence relations. For instance, the effusion and primary spontaneous pneumothorax (PSP) often cause pleural thickening. Moreover, those three diseases have specific spatial distribution and locate around the pleura. Therefore, it is anticipated that the disease module and the spatial module should contribute more to the performance improvement. The experimental results also validated our hypothesis, as the disease module obtains gains on AP & AP and the spatial module obtain gains on AP & AP. This suggests that the disease module can boost the performance for classification while the spatial module are better at improving localization accuracy, as we have emphasized before.
Compared with other diseases, cardiomegaly (MEDIASTINUM) has obvious appearance and fixed location, which makes it the perfect target disease of our proposed relation modules. It can be observed that all three relation modules can obtain considerable gains on AP and AP. And the spatial relation module can also obtain gains on AP, a metric that requires precise localization.
Different relation modules have different emphasis (classification or localization) for different diseases. In general, when adding all the relation modules, our model can achieve maximum performance.
Model complexity. We also report the complexity of each module in Table 9. As shown in Table 9, using our modules can obtain higher performance with only a slight increase in parameters and floating point operations (FLOPs). Note that some parameters are shared in different modules, such as the coordinate parameters of the anatomical structure. Hence, the extra parameters of SAR-Net are less than the sum of extra parameters of all three modules. The same is true for FLOPs.
4.7 Qualitative results.
Figure 6 shows qualitative results of all the ablation methods. As shown in the first row, the disease relation module has little positive effect on localization. However, the spatial relation module and contextual module can effectively improve the localization performance of the detected diseases. This is because all the above two relation modules are better at exploiting the spatial information.

Pneumothorax and effusion have strong co-occurrence relations and specific spatial distribution and thus can benefit from the disease and spatial relation module. As shown in the second row, the disease relation module can correct the wrong disease labels according to the information propagation among diseases. On the other hand, the spatial relation module also can achieve the same effect with the help of the spatial information propagation. As have been shown on the quantitative results, the contextual relation module has high sensitivity on diagnosing the abnormalities of LUNG, but has low sensitivity on diagnosing the abnormalities of PLEURA. On this sample, the fracture is detected and pneumothorax is overlooked.
Pleural thickening develops when scar tissue thickens the delicate membrane lining the lungs (the pleura) and will not appear in the lung fields. As shown in the third row, the relation modules (SRM, CRM) which contain spatial information can eliminate this kind of false positives effectively. In addition, since the contextual relation module is not sensitive for PLEURA, it can not improve the localization performance of pleural thickening compared with the spatial relation module.

Moreover, we also show some failure cases in Figure 7. As mentioned before, effusion has strong co-occurrence relations with pneumothorax and often appears at costophrenic angle. As shown in the first case, when the relations modules are used, the above factors cause the consolidation box to be predicted as an effusion box. As mentioned in the subsection 4.3, when fibrosis and emphysema boxes overlap, the proposal boxes are easier to be predicted as emphysema. As shown in the second case, when the relations modules are used, the proposal box is predicted incorrectly and fibrosis is overlooked. In actual post-processing, this prediction box would be deleted by NMS. We keep it in Figure 7 for illustration purpose. There exist a lymphoid mass around the neck in the third case. In our dataset, mass often appears in the lung fields, which causes the lymphoid mass to be overlooked when the relation modules are present. Note that such failure cases only occur occasionally.
Extensive experiment results indicate that our proposed method is effective and has a great potential for clinical application. In Figure 2, we also show some instance segmentation results of SAR-Net on ChestX-Det for close inspection. Only the contours are shown to retain the appearance of the detected diseases.
5 Conclusion
In conclusion, we present a structure-aware relation network (SAR-Net) on chest X-Ray detection and instance segmentation. The SAR-Net consists of three modules modeling three types of relations: 1. spatial relation between diseases and anatomical structure. 2. contextual relation between diseases and lung fields. 3. categorical relation within diseases. The proposed modules can be embedded into general object detection frameworks and bring significant improvements. Also, we present ChestX-Det, a subset of NIH ChestX-14 with box annotations of 13 categories of diseases. We believe the new dataset is a valuable benchmark for evaluation on disease detection in chest X-Rays.
References
- [1] Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In CVPR (2018)
- [2] Chen, H., Miao, S., Xu, D., Hager, G.D., Harrison, A.P.: Deep hiearchical multi-label classification applied to chest x-ray abnormality taxonomies. In MIA 66, 303–338 (2020)
- [3] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In CVPR (2017)
- [4] Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Object detection with keypoint triplets. Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
- [5] Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. In IJCV 88, 303–338 (2010)
- [6] Gidaris, S., Komodakis., N.: Dynamic few-shot visual learning without forgetting. In CVPR (2018)
- [7] Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440–1448 (2015)
- [8] Guan, Q., Huang, Y., Zhong, Z., Zheng, Z., Zheng, L., Yang, Y.: Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. arXiv preprint arXiv:1801.09927 (2018)
- [9] He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In ICCV (2017)
- [10] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In CVPR (2016)
- [11] Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In CVPR (2018)
- [12] J, I., P, R., Ko M, e.a.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: arXiv preprint arXiv:1901.07031 (2019)
- [13] Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 734–750 (2018)
- [14] Li, Z., Wang, C., Han, M., Xue, Y., Wei, W., Li, L.J., Fei-Fei, L.: Thoracic disease identification and localization with limited supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8290–8299 (2018)
- [15] Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie., S.: Feature pyramid networks for object detection. In CVPR (2017)
- [16] Liu, J., Zhao, G., Fei, Y., Zhang, M., Wang, Y., Yu, Y.: Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10632–10641 (2019)
- [17] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European conference on computer vision. pp. 21–37. Springer (2016)
- [18] Liu, Y., Wu, Y., Ban, Y., Wang, H., Cheng, M.: Rethinking computer-aided tuberculosis diagnosi. In CVPR (2020)
- [19] Nguyen, H.Q., Lam, K., Shazeer, L.T., Pham, H.H., Tran, D.Q., Nguyen, D.B.: Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotation. arXiv preprint arXiv:2012.15029 (2020)
- [20] Pesce, E., Ypsilantis, P.P., Withey, S., Bakewell, R., Goh, V., Montana, G.: Learning to detect chest radiographs containing lung nodules using visual attention networks. arXiv preprint arXiv:1712.00996 (2017)
- [21] Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., et al.: Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
- [22] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779–788 (2016)
- [23] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS (2015)
- [24] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
- [25] Salakhutdinov, R., Torralba, A., Tenenbaum, J.: Learning to share visual appearance for multiclass object detection. In CVPR (2011)
- [26] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE conference on computer vision and pattern recognition (2017)
- [27] Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
- [28] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
- [29] Wang, X., Ye, Y., Gupta., A.: Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR (2018)
- [30] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2097–2106 (2017)
- [31] Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In CVPR (2017)
- [32] Xu, H., Jiang, C.H., Liang, X.D., , Li, Z.G.: Spatial-aware graph relation network for large-scale object detection. In CVPR (2019)
- [33] Xu, H., Jiang, C.H., Liang, X.D., Lin, L., Li, Z.G.: Reasoning-rcnn: Unifying adaptive global reasoning into large-scale object detection. In CVPR (2019)
- [34] Yan, C., Yao, J., Li, R., Xu, Z., Huang, J.: Weakly supervised deep learning for thoracic disease classification and localization on chest x-rays. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. pp. 103–110. ACM (2018)
- [35] Ypsilantis, Petros-Pavlos, Montana, Giovanni: Learning what to look in chest x-rays with a recurrent visual attention model. arXiv preprint arXiv:1701.06452 (2017)
- [36] Zhao, H.S., Shi, J., Qi, X.J., Wang, X.G., Jia, J.Y.: Pyramid scene parsing network. In CVPR (2017)
- [37] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)
- [38] Zhou, X., Zhuo, J., Krähenbühl, P.: Bottom-up object detection by grouping extreme and center points. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2019)
- [39] Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2019)