¹¹institutetext: Department of Computer Science, Peking University, Beijing, China ²²institutetext: Deepwise AI Lab, Beijing, China
²²email: [email protected] ³³institutetext: Advanced Institute of Information Technology, Peking University, Hangzhou, China ⁴⁴institutetext: Center on Frontiers of Computing Studies, Peking University, Beijing, China ⁵⁵institutetext: The University of Hong Kong, Pokfulam, Hong Kong

Revisiting 3D Context Modeling with Supervised Pre-training for Universal Lesion Detection in CT Slices

Shu Zhang 11 Jincheng Xu This work was done when Jincheng Xu was an intern at Deepwise AI Lab.11 Yu-Chun Chen 22 Jiechao Ma 22 Zihao Li 22 Yizhou Wang 113344 Yizhou Yu

{}^{\left(\textrm{{\char 0\relax}}\right)}

2255

Abstract

Universal lesion detection from computed tomography (CT) slices is important for comprehensive disease screening. Since each lesion can locate in multiple adjacent slices, 3D context modeling is of great significance for developing automated lesion detection algorithms. In this work, we propose a Modified Pseudo-3D Feature Pyramid Network (MP3D FPN) that leverages depthwise separable convolutional filters and a group transform module (GTM) to efficiently extract 3D context enhanced 2D features for universal lesion detection in CT slices. To facilitate faster convergence, a novel 3D network pre-training method is derived using solely large-scale 2D object detection dataset in the natural image domain. We demonstrate that with the novel pre-training method, the proposed MP3D FPN achieves state-of-the-art detection performance on the DeepLesion dataset (3.48% absolute improvement in the sensitivity of [email protected]), significantly surpassing the baseline method by up to 6.06% (in [email protected]) which adopts 2D convolution for 3D context modeling. Moreover, the proposed 3D pre-trained weights can potentially be used to boost the performance of other 3D medical image analysis tasks.

Keywords:

Lesion Detection 3D Context Modeling 3D Network Pre-training.

1 Introduction

With its high-resolution image and low cost, CT scan is critical in clinical decision and holds the key for making precise medical-care accessible to everyone around the world. Recently, deep learning methods have been introduced to detect lesions in CT slices [1, 2, 3, 4, 5]. Since it is difficult to distinguish lesions within a single axial slice, exploiting sufficient 3D context for accurate detection in volumetric CT data has emerged as a significant research focus.

Various architectures have been proposed for proper modeling of 3D context from neighboring CT slices. Yan et al. [1] adopts a late fusion strategy which stacked 2D features of neighboring slices to build 3D context enhanced features. Although the pseudo-3D contextual information has provided prominent performance gain [1, 2, 3, 4, 5], its late fusion strategy leads to notable losses of context information from early stages of the network. A direct way to address these issues is to employ 3D convolutions which introduce inter-slice connections hierarchically to learn 3D representations end to end. 3D convolutional filters can well preserve the 3D structure and texture information, but intensive memory and computation demands hinder its wide application in the universal lesion detection problem. What’s worse, although 3D network pre-training has raised significant research attention [6, 7, 8, 9], the lack of good pre-trained 3D models makes it even harder to achieve good performance with 3D based detectors.

In this paper, we focus on the problem of universal lesion detection in CT slices, where multiple adjacent CT slices are taken into consideration to localize 2D lesions for the target slice. We aim to develop a generic and efficient 3D backbone for 2D lesion detection with enhanced context modeling ability from multiple CT slices and devise a supervised pre-training method to boost its performance. Specifically, pseudo-3D convolutional filters [8] which use depth-wise separable convolution are adopted to reduce the memory and computation overhead. The backbone in our method is a Modified Pseudo-3D ResNet (MP3D ResNet), which extracts context enhanced 3D features from multiple neighboring CT slices (9 in our case) and then converted the 3D features into 2D ones with a group transform module (GTM) for further 2D lesion detection in the target slice. Then, we feed backbone features extracted from MP3D ResNet into the neck of Feature Pyramid Network (FPN) to form the MP3D FPN for effective multi-scale detection. Finally, to facilitate efficient training of the MP3D FPN, we designed a novel supervised pre-training method, which exploits supervised signals from large-scale 2D natural image object detection dataset to pre-train the proposed MP3D detector. In summary, the main contributions of our paper are three folds:

1.

We have proposed a generic framework to employ 3D network for 2D lesion detection in CT slices. The proposed MP3D FPN is computational and memory efficient, and it achieves state-of-the-art performance on the DeepLesion dataset.

2.

We have derived a novel and effective way to adopt 2D natural images to pre-train 3D network with supervised labels, whose pre-trained weights can potentially benefit other 3D medical image analysis tasks (e.g. segmentation).

3.

We have conducted comprehensive experiments to explore the effects of pre-trained weights for deep medical image analysis. The results suggest that pre-trained weights can not only lead to faster convergence in all sized datasets, but also help to achieve better results in smaller-scale ones.

2 Methodology

Fig.1 gives an overview of the proposed lesion detection framework. The proposed MP3D FPN comprises an MP3D ResNet as the backbone, a 2D FPN [12] as the neck and a 2D RPN/RCNN head. The MP3D ResNet takes multiple consecutive CT slices (e.g. 9) as input and generates 3D feature maps which bear the ability of 3D context modeling. Then a conversion block (GTM) further transforms the 3D feature maps into 2D ones for further 2D detection. Detailed architecture designs of the proposed MP3D backbone and the novel supervised pre-training scheme will be elaborated in the following sections.

Refer to caption — Figure 1: Overview of the proposed MP3D FPN. MP3D ResNet extracts context enhanced 3D features and converts them to 2D ones with a group transform module (GTM). These context enhanced 2D features are then fed into the FPN neck and the RPN/RCNN head for further 2D lesion detection. The MP3D FPN is pre-trained on Microsoft COCO object detection dataset [15].

2.1 3D Context Modeling with an MP3D ResNet Backbone

In this work, we explore to employ 3D convolutions for effective 3D context modeling in the problem of lesion detection from consecutive CT slices (e.g. 9 slices). To advance the time and memory efficiency of normal 3D ResNet, we adopt the Pseudo-3D Residual Network (P3D ResNet) [8] as the prototype of our backbone network. The pseudo-3D convolution simulates $3\times 3\times 3$ convolution with $1\times 3\times 3$ filter on axial-view slices plus $3\times 1\times 1$ filter to build inter-slice connections on adjacent CT slices.

Lesion detection in CT slices aims to predict 2D bounding boxes in a certain slice, thus it requires 2D feature maps corresponding to the target slice for further prediction. Therefore, we need to convert the 3D feature maps to 2D ones for further prediction, meanwhile preserving the precise information of the target CT slice for accurate localization and classification. The designed Modified Pseudo-3D Residual Network (MP3D ResNet) highlights two aspects of modifications to fulfill such demands: 1) Instead of conducting isotropic pooling as in the original P3D ResNet, we neglect pooling operation in the inter-slice dimension. 2) A group transform module is introduced to generate the desired 2D feature maps from the context enhanced 3D features.

Neglecting pooling operation in the inter-slice dimension can help to preserve precise information of the target slice. In the meantime, since the number of input slices (e.g. 9) is rather small, we can get enough receptive field in the inter-slice dimension without downsampling. Regarding 2D feature map conversions, Fang et al. [13] proposed to extract $C$ 2D feature maps ( $1\times 1\times H\times W$ ) corresponding to the center slice and concatenate them to form the converted 2D feature map of size ( $C\times H\times W$ ). However, this method can not fully exploit the 3D context information resided in other adjacent slices.

We, on the other hand, propose a group transform module (GTM) instead to includes all slice’s features to compensate for the information loss. Specifically, we view 3D features ( $C\times D\times H\times W$ ) into 2D ( $CD\times H\times W$ ) and apply a group convolutional layer with the group size of $C$ (every $D$ channel is a group) to fuse all neighboring features to yield the final 2D feature maps ( $C\times H\times W$ ).

2.2 Supervised 3D Pre-training with COCO Dataset

Table 1: Sensitivities (%) at various FPs per image on the test set of DeepLesion. ^∗ indicates re-implementation of 3DCE using ResNet-50 FPN with the same configuration as our MP3D FPN.

Methods	$\mathbf{0.5}$	$\mathbf{1}$	$\mathbf{2}$	$\mathbf{4}$	$\mathbf{[email protected]}$
3DCE, 27 slices[1]	52.86	64.80	74.84	84.38	-
MSB, 3 slices[2]	67.00	76.80	83.70	89.00	-
RetinaNet, 3 slices[3]	72.15	80.07	86.40	90.77	-
MVP-Net, 9 slices[4]	73.83	81.82	87.60	91.30	-
MULAN, 9 slices[5]	76.12	83.69	88.76	92.30	-
FPN+3DCE, 3 slices^∗	68.52	77.59	83.91	88.33	64.41
FPN+3DCE, 9 slices^∗	74.06	82.00	87.58	91.56	70.28
FPN+3DCE, 27 slices^∗	74.67	82.89	88.17	91.62	70.82
MR3D FPN, 9 slices	79.09	84.84	89.18	92.06	76.57
MP3D FPN, 9 slices	79.60	85.29	89.61	92.45	76.87
Imp over MULAN, 9 slices	$\uparrow$ 3.48	$\uparrow$ 1.60	$\uparrow$ 0.85	$\uparrow$ 0.15	-

Supervised pre-training from natural images has proven to be an effective way for 2D medical image transfer learning[1, 2, 3, 4, 5, 10]. This indicates that using supervised pre-training models from another domain can actually benefit the medical image analysis application. What’s more, compared to self-supervised signals, we believe that supervised labels which carry the semantic information could enable the model to learn semantically invariant and discriminative features more effectively. Therefore, in this section, we aim to develop a method to exploit supervised labels from large-scale 2D natural image object detection dataset (e.g. coco [15]) to pre-train our MP3D FPN.

Previous works [1] have shown that by grouping 3 consecutive CT slices (which is natively 3D data) as a 3-channel RGB image, we can boost the detection performance with Image-Net pre-trained weights, indicating the feasibility of simulating RGB natural image with natively 3D CT slices. This inspires us to reversely decompose the 3 channels of natural RGB images into 3 consecutive CT slices, and train an MP3D FPN with such simulated 3D data. Fig 2 illustrates a comparison of the two correlative strategies. For implementation details, we train the MP3D FPN on COCO dataset for 72 epochs and the final weights are used to initialize MP3D ResNet. To drive the network to learn useful 3D contextual features from inter-slice connections, it is essential to keep the resolution in the inter-slice dimension unchanged for all stages of the backbone.The MP3D detector trained with a slice number of 3 can be used to initialize lesion detectors which takes variable number of slices as network input.

3 Experiments

3.1 Experimental Setup

Dataset and Metric: The NIH DeepLesion is a large-scale dataset for lesion detection, which contains 32,735 lesions on 32,120 axial CT slices captured from 4,427 patients. DeepLesion is splitted into training (70%), validation (15%), and test (15%) sets. We evaluate our MP3D FPN and all the compared methods on the test set by reporting the mean average precision ([email protected]) and average sensitivities at different false positives (FPs) per image.

Implementation Details: As in [3], the Hounsfield units (HU) are clipped into the range of $[-1024,1050]$ . We interpolate in the z-axis to normalize the intervals of all CT slices to 2.5mm. Anchor scales are set to $\{16,32,64,128,256\}$ in FPN. Apart from horizontal and vertical flip, we resize the image to different scales of $\{448,512,576\}$ for data augmentation. MP3D-63 with group normalization[14] is used as the backbone in all our experiments, which has similar depth with the ResNet3D-50 model. The MP3D-63 model is derived from the conventional P3D-63[8] model with the proposed modifications. Unless otherwise specified, the MP3D FPN takes 9 consecutive slices as input. We train all the models for 24 epochs at the base learning rate of 0.02, and reduce it by a factor of 10 after the 16-th and 22-th epoch (corresponding to the 2x learning schedule[11] on COCO dataset). We conduct experiments on the NVIDIA TITAN V GPU with 12GB of memory, and mixed-precision training strategy is used in all our experiments to save memory.

3.2 Comparison with State-of-the-arts

Table 1 presents the comparisons with the previous state-of-the-art (SOTA) methods. Our model surpasses all the SOTA methods on sensitivities at different FPs and [email protected], which includes 3DCE[1], MSB[2], RetinaNet[3], MVP-Net[4] and MULAN[5].

Without using any auxiliary supervision, MP3D FPN outperforms MULAN, the previous SOTA which additionally employs multi task learning and a deeper backbone (DenseNet-121) to improve the detection accuracy, by up to 3.48% on the sensitivity of [email protected]. We re-implement 3DCE with ResNet-50 FPN using the same configuration as our MP3D FPN for fair comparison. Our proposed MP3D achieve a performance gain of 6.05% on [email protected] compared with this 2D convolution based context encoding method, demonstrating the superior 3D context modeling ability of our MP3D backbones. As shown in Table 1, MP3D FPN (248.93 GFLOPS, 45.16 $M$ Params) and MR3D FPN (Modified ResNet 3D, 415.81 GFLOPS, 64.03 $M$ Params) based detector achieve comparable results, but the MP3D based detector consumes much less time and memory. This strongly proves the efficacy and the thrift of our MP3D model.

3.3 Ablation Study

We perform a number of ablations to probe into our MP3D FPN. The results are shown as follows:

Table 2: Detection performance and computational cost with variable numbers of input slice. GFLOPS is used to characterize the computational cost.

Methods	$\mathbf{0.5}$	$\mathbf{1}$	$\mathbf{2}$	$\mathbf{4}$	$\mathbf{[email protected]}$	$\mathbf{GFLOPS}$
MP3D, 5 slices	76.86	83.44	88.13	91.54	75.01	156.84
MP3D, 7 slices	78.22	84.45	88.90	91.50	76.69	202.88
MP3D, 9 slices	79.60	85.29	89.61	92.45	76.87	248.93
MP3D, 11 slices	80.05	85.77	89.55	92.45	77.64	294.97

Input Slices: Table 2 shows the performance of the MP3D detector when applying 5,7,9 and 11 slices as input. The detector achieves higher detection accuracy as more slices are used, meanwhile consuming more time and memory. MP3D with 7 slices as input get the best trade-off between effectiveness and efficiency.

Table 3: Comparison of different conversion modules and different pooling strategies for pre-training.

Methods	$\mathbf{0.5}$	$\mathbf{1}$	$\mathbf{2}$	$\mathbf{4}$	$\mathbf{[email protected]}$
MP3D w/ CTM	79.18	84.90	88.96	91.90	76.30
MP3D w/ GTM	79.60	85.29	89.61	92.45	76.87
MP3D w/ isotropic pooling	78.24	84.41	88.82	91.98	75.06
MP3D w/ proposed pooling	79.60	85.29	89.61	92.45	76.87

Conversion Type: Table 3 demonstrates the comparisons of proposed GTM with the center-cropping transform module (CTM), which is proposed by Fang et al. [13]. The proposed GTM brings better results as it can efficiently aggregate information from all adjacent slices for further detection.

3.4 Effectiveness of the 3D Pre-trained Model

We conducted three groups of experiments to explore effectiveness of the pre-training method.

Comparison to Isotropic Pooling: In this work, to achieve 3D context modeling ability in the z-axis, we neglect pooling operation in the inter-slice dimension when pre-training the MP3D model on the Microsoft COCO dataset. We compared our proposed method to isotropic pooling for validation.

The pre-trained model takes three slices as input. When training with isotropic pooling, the z-axis degenerates to a single slice after the first two pooling layers, preventing further 3D convolution layers from learning useful 3D contextual information. As shown in Table3, pre-trained weights learned from isotropic pooling gives worse results than the proposed method. This also proves that using decomposed natural image as input can actually helps the 3D model to gain context-encoding ability. Thus the learned weights can potentially be used to boost the performance of other 3D medical image analysis tasks.

Table 4: Comparison of model performance with and without pre-training with different learning schedules. 1x, 2x and 6x indicates max training epochs of 12, 24 and 72 separately.

Methods	$\mathbf{0.5}$	$\mathbf{1}$	$\mathbf{2}$	$\mathbf{4}$	$\mathbf{[email protected]}$
MP3D 1x w/o pretrain	70.12	78.00	83.95	88.23	67.60
MP3D 2x w/o pretrain	76.11	82.65	87.70	91.17	74.00
MP3D 6x w/o pretrain	79.60	85.29	89.26	92.19	76.75
MP3D 1x w/ pretrain	78.02	84.33	88.84	91.74	75.78
MP3D 2x w/ pretrain	79.58	85.29	89.61	92.45	76.87

Table 5: Training with variable dataset sizes (

100\%

20\%

). For simplicity, we present the results of [email protected].

Methods	$\mathbf{100\%}$	$\mathbf{80\%}$	$\mathbf{60\%}$	$\mathbf{40\%}$	$\mathbf{20\%}$
3DCE 9 slices 2x	70.28	69.22	67.08	63.61	57.02
3DCE 27 slices 2x	70.82	69.96	68.08	65.36	58.82
MP3D w/o pre-train 2x	74.00	71.58	68.79	63.40	50.67
MP3D w/o pre-train 6x	76.75	75.43	72.87	68.14	58.98
MP3D w/ pre-train 2x	76.87	75.66	73.33	71.07	65.55

Comparison to Training From Scratch: He et al.[11] demonstrated that with sufficient training data (around 35k from its experiment) and longer training schedule (6x), models trained from scratch could achieve comparable results to models training with pre-trained weights. Therefore, we examined the effectiveness of our proposed pre-training method by comparing MP3D with pre-training to model trained from scratch with longer schedule.

As shown in Table4, when both trained for 1x learning schedule (12 epochs), MP3D with pre-trained weights significantly outperforms the one without pre-training, demonstrating faster convergence speed. And it turns out that with 2x learning schedule (24 epochs), model trained with the proposed pre-training weights can achieve comparable results with MP3D model trained from scratch with 6x learning schedule (72 epochs). These results validate the effectiveness of our proposed pre-training scheme.

Performance on Variable Dataset Sizes: In medical image analysis tasks, annotated data is often scarce. Therefore, it is appealing to gain a better understanding of the effects of pre-trained weights when dataset size is small. In this subsection, we compare the model performance of 2x, 6x training from scratch and 2x with pre-training on variable dataset sizes by randomly choosing 20%, 40%, 60% and 80% of the whole training data. Pre-training based models achieve better performance with less training time on all the cases, and the smaller the size of the dataset, the larger the gap. A dramatic drop of performance starts when training with only 40% of the whole data. And when training with only 20% of the dataset, which is around 4,500 images, the model trained with our proposed pre-trained weights achieves an absolute performance gain of 6.57% on [email protected], accounting for an 11% relative gain.

4 Conclusions

In this paper, we propose a generic model architecture to exploit 3D network for 2D lesion detection in CT slices. The proposed MP3D FPN can reduce computation and memory cost while providing enhanced 3D context modeling ability. A simple yet effective way for 3D network pre-training is also derived to facilitate efficient training. Without sophisticated structures and multi-supervision signals, it significantly improves the detection performance on the DeepLesion dataset, surpassing all the SOTAs. We have proved the benefits of pre-trained weights for variable dataset size, and we expect that the MP3D ResNet along with its pre-trained weights can serve as a benchmark backbone for 3D medical image analysis, making contributions towards accessible precise medication.

4.0.1 Acknowledgements.

This work is funded by National Key Research and Development Program of China (No. 2019YFC0118101), MOST-2018AAA0102004 and NSFC-61625201. We would like to thank Yemin Shi for valuable discussions.

References

[1] Yan, K., Bagheri, M., Summers, R.M.: 3d context enhanced region-based convolutional neural network for end-to-end lesion detection. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 511–519. Springer (2018).https://doi.org/10.1007/978-3-030-00928-1_58
[2] Shao, Q., Gong, L., Ma, K., Liu, H., Zheng, Y.: Attentive ct lesion detection using deep pyramid inference with multi-scale booster. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 301–309. Springer (2019).https://doi.org/10.1007/978-3-030-32226-7_34
[3] Zlocha, M., Dou, Q., Glocker, B.: Improving retinanet for ct lesion detection with dense masks from weak recist labels. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 402–410. Springer (2019).https://doi.org/10.1007/978-3-030-32226-7_45
[4] Li, Z., Zhang, S., Zhang, J., Huang, K., Wang, Y., Yu, Y.: Mvp-net: Multi-view fpn with position-aware attention for deep universal lesion detection. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 13–21. Springer (2019).https://doi.org/10.1007/978-3-030-32226-7_2
[5] Yan, K., Tang, Y., Peng, Y., Sandfort, V., Bagheri, M., Lu, Z., Summers, R.M.: Mulan: Multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 194–202. Springer (2019).https://doi.org/10.1007/978-3-030-32226-7_22
[6] Zhou, Z., Sodha, V., Siddiquee, M.M.R., Feng, R., Tajbakhsh, N., Gotway, M.B., Liang, J.: Models genesis: Generic autodidactic models for 3d medical image analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 384–393. Springer (2019).https://doi.org/10.1007/978-3-030-32251-9_42
[7] Chen, S., Ma, K., Zheng, Y.: Med3d: Transfer learning for 3d medical image analysis. arXiv preprint arXiv:1904.00625 (2019)
[8] Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: proceedings of the IEEE International Conference on Computer Vision. pp. 5533–5541 (2017)
[9] Yang, J., Huang, X., Ni, B., Xu, J., Yang, C., Xu, G.: Reinventing 2d convolutions for 3d medical images. arXiv preprint arXiv:1911.10477 (2019)
[10] Lakhani, P., Sundaram, B.: Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology. 284(2), 574–582 (2017)
[11] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE international conference on computer vision. pp. 4918–4927 (2019)
[12] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)
[13] Fang, C., Li, G., Pan, C., Li, Y., Yu, Y.: Globally guided progressive fusion network for 3d pancreas segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 210–218. Springer (2019).https://doi.org/10.1007/978-3-030-32245-8_24
[14] Wu, Y., He, K.: Group normalization. In: Proceedings of the European conference on computer vision (ECCV). pp. 3–19 (2018)
[15] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)