∎

¹¹institutetext: BeiHang University ²²institutetext: ²²email: [email protected] ³³institutetext: Beijing, 100191, China

PLU-Net: Extraction of multi-scale feature fusion

Weihu Song

(Received: date / Accepted: date)

Abstract

Deep learning algorithms have achieved remarkable results in medical image segmentation in recent years. These networks are unable to handle with image boundaries and details with enormous parameters, resulting in poor segmentation results. To address the issue, we develop atrous spatial pyramid pooling (ASPP) and combine it with the Squeeze-and-Excitation block (SE block), as well as present the PS module, which employs a broader and multi-scale receptive field at the network’s bottom to obtain more detailed semantic information. We also propose the Local Guided block (LG block) and also its combination with the SE block to form the LS block, which can obtain more abundant local features in the feature map, so that more edge information can be retained in each down sampling process, thereby improving the performance of boundary segmentation. We propose PLU-Net and integrate our PS module and LS block into U-Net. We put our PLU-Net to the test on three benchmark datasets, and the results show that by fewer parameters and FLOPs, it outperforms on medical semantic segmentation tasks.

Keywords:

Semantic segmentation U-Net deep learning medical image

1 Introduction

The significance of image analysis is rising in parallel with the successful application of imaging in clinical medicine. Image segmentation is a key image analysis technology which plays an essential role in imaging medicine. Deep learning technology, mainly based on deep convolutional neural network (DCNN), have solved various semantic segmentation difficulties of medical images in recent times.Although the performance of subsequent improved methods based on the U-Net has improved, some issues have emerged, including such increasing the parameters and FLOPs, and the network for image segmentation boundary and details is not good enough.In this article, we propose PLU-Net, a simple and effective network model which utilizes U-Net as a baseline to solve the problems. To start, the LS block is employed to obtain rich local information in order to improve boundary segmentation performance. Second, with its broad and multi-scale receptive field, the PS module is added to the bottom of the network to collect richer detail information. The combination of the two modules allows for excellent image boundary and detail information acquisition. Finally, the network depth is reduced by one layer, greatly increasing the model’s interface speed.

2 Related work

Other CNN models appeared in the years after the ILSVRC Russakovsky et al. (2015) competition began in 2012, including ALexNet Krizhevsky et al. (2012), VGG Simonyan and Zisserman (2014), GoogLeNet Szegedy et al. (2015), ResNet He et al. (2016), and SENet Hu et al. (2018). These models are mostly utilized in image classification tasks at the image level, and many fields require more detailed image classification. This is especially true in medical imaging, where precision and speed are more important than other fields.

deep convolutional networks: In 2015, the Fully Convolutional Network (FCN) Long et al. (2015) replaced the fully connected layer with the convolution layers to output spatial mapping, allowing the model to handle images of varying sizes and considerably boosting segmentation accuracy over traditional methods. However, it still has significant flaws, such as the model’s poor recognition efficiency in particular cases and the omission of global context information. At this time, U-Net was born. It uses a completely symmetrical model structure and uses an altogether new feature fusion technique than FCN: concat. Meanwhile, it reduces the size of the model and delivers excellent results with little training data, which is essential for medical segmentation. More semantic segmentation models employ U-Net as the basis for improvement because of its superior performance. U-Net++ Zhou et al. (2020) improves accuracy by adding deep supervision to each layer’s sub-network and better capturing some feature information lost in down-sampling and up-sampling operations.

multi-scale feature extraction: PSPNetZhao et al. (2016) proposes to use the pyramid pooling module to aggregate the context information of different regions, so as to improve the extraction ability of feature information. DeepLabChen et al. (2014) uses ASPP to aggregate more convolution kernels of different scales to improve the multi-scale feature extraction ability.Res-UNet Xiao et al. (2018) and Dense-UNet Huang et al. (2017), respectively, incorporate ResNet and DenseNet concepts into the U-Net, ResNet’s residual block and DenseNet’s dense block are used to effectively reduce feature information loss during transmission.

attention modules: For each up-sampling, Attention U-Net Oktay et al. (2018) inserts the attention gate into U-Net, which eliminates feature redundancy caused by the repetitive employment of low-level features in multiple convolution processing. R2U-Net Alom et al. (2018) combines the RNN and ResNet structures into a U-Net structure, allowing the structure to gain more characteristic information after each convolution. Additionally, because transformer has a global receptive field and can acquire feature information from all pixels in an image, numerous recent works Fan et al. (2022) Sha et al. (2021) Chen et al. (2021) Petit et al. (2021) Lin et al. (2021) have merged transformer and U-Net in various ways. The models performance has improved to some level, but it has also introduced a slew of new issues. On the one hand, a transformer structure will dramatically increase the size of the model, stifling inference speed and necessitating higher hardware needs. On the other hand, it frequently requires the blessing of the pre-trained model, therefore a solid pre-trained model is critical to the model’s performance. To summarize, the model’s accuracy can be enhanced by learning additional feature information or lowering feature information loss during the feature map calculation. In addition, during the model design process, the long-running time induced by the growth of the model size must be taken into account.

Refer to caption — Figure 1: Comparison of three different blocks.

3 Methodology

3.1 LS block

The Conv block in the original U-Net network consists of two 3x3 convolution operations, two batch normalization operations, and two nonlinear activations (ReLU). However, we notice that this structure has a loss of local information, so we propose the local guided block (LG block, shown as Fig.1), which is divided into two branches and is made up of two 3x3 dilated convolution operations with dilated rates equal to 1 and 3, The results of two dilated convolution operations are then concatenated to enhance feature propagation. Then, to achieve cross-channel information fusion, a 1x1 convolution operation is employed, and nonlinear features are added on the assumption that the size of the feature map remains unchanged. To achieve channel information adaptation, we inserted SE block after LG block to make LS block (shown as Fig.1), similar to PS module. LG block, as comparison to the original convolution block, minimizes the amount of calculation while obtaining richer feature information with a large receptive field thanks to the addition of double branch structure and dilated convolution. It further realizes the adaptation of channel feature information by adding SE block.

3.2 PS module

ASPP first was proposed in DeepLabv1 Chen et al. (2014) and then improved in DeepLabv2 Chen et al. (2017a) and DeepLabv3 Chen et al. (2017b), as seen in Fig.2. Deeplabv3’s ASPP, which comprises of one 1x1 convolution, three dilated convolutions, and one global average pooling operation, is used as the foundation. The employment of the global average pooling method in up-sampling produces duplicate information and degrades prediction performance, according to experimental verification. As a result, we remove the global average pooling and replace the ordinary void convolution with the depthwise separable convolution Sifre and Mallat (2014), resulting in a reduction of roughly five times the number of parameters compared to the original ASPP. The ASPP’s multi-scale structure allows it to gather more feature information and utilise larger receptive fields, however more feature information may contain duplicated data, lowering the performance of the system.To alleviate the impact of redundant information, we choose to employ the SE block in SENet Hu et al. (2018) to increase the weight of important channel information while decreasing the weight of worthless channel information. The SE block uses squeeze and excitation two processes to learn the importance of each channel’s features, and then strengthens the relevant channel while weakening the idle channel to achieve adaptive feature channel calibration. As a result, a PS module is proposed (shown as Fig.2). After getting additional feature information and employing the large receptive field, this module may combine the advantages of the ASPP module and the SE block to suppress redundant information and strengthen the important channel feature information.

3.3 Network Architecture

Our PLU-Net, as shown in Fig.3, improves on the original U-Net network architecture by substituting the LS block for the convolution block in the down-sampling and up-sampling pathways. The LS block’s double-branch structure effectively ensures that information loss is minimized at each layer of operation, while feature reuse and successful propagation are ensured by the bigger receptive field. In addition, the U-Net network’s up-sampling and down-sampling periods were reduced from four to three, and a PS module was added at the end of the down-sampling. We may considerably minimize the calculation amount and make the model more lightweight by reducing the number of down-sampling and up-sampling channels in the last layer of the U-Net network, which is 1024. We employ four branches and a greater dilation rate at the same time to obtain a bigger receptive field and consequently richer feature information. The up-sampling step that follows can be completed efficiently. Our network structure can now achieve better performance with fewer parameters and FlOPs because to the combination of these enhancements.

4 Experiments and Results

4.1 Datasets

Because both the PS module and the LS block are modular, they can simply be utilized to substitute convolution processes in various network architectures. We designed three models, PU-Net (conv+PS module), LUNet(LS block), and PLU-Net (LS block+PS module), to demonstrate the robustness of our model, in addition to the original U-Net (conv+null here A+B means A for down-sampling and up-sampling pathway and B for PS module, the same below.) We evaluated the models on three biomedical image segmentation datasets in the study.

4.1.1 Polyp Segmentation

CVC-ClinicDBBernal et al. (2015)(CVC for short) is from colonoscopy videos and contains 612 polyp images. We use the original size 384x288 of image and is randomly split into train set ( $60\%$ ), validation set ( $20\%$ ), and test set ( $20\%$ ). Also, we scale the original images equally (resize it from $512\times 512$ to $256\times 256$ ).

4.1.2 Nuclei Segmentation

In most cancer grading schemes, nuclei segmentation has far-reaching significance because nuclear morphology is one of the important components. The dataset is derived from Kaggle 2018 Data Science Bowl¹¹1https://www.kaggle.com/c/data-science-bowl-2018/data (DSB2018 for short). It contains 670 nucleus images and is randomly split into train set ( $60\%$ ), validation set ( $20\%$ ), and test set ( $20\%$ ). Also, we resize the original images to $96\times 96$ .

4.1.3 Skin Lesion Segmentation

Computer-aided automatic diagnosis of skin cancer is an inevitable trend, and skin lesions segmentation as the first step is urgent. The dataset is from MICCAI 2018 Workshop - ISIC2018: Skin Lesion Analysis Towards Melanoma Detection Codella et al. (2019) Tschandl et al. (2018) (ISIC2018 for short). It contains 2594 images and is randomly split into train set ( $60\%$ ), validation set ( $20\%$ ), and test set ( $20\%$ ). For better model training and result display, we resize all the original images to $224\times 224$ .

4.2 Experimental Setup

We use three datasets to compare the U-Net, PU-Net, LU-Net, U-Net++, MultiResUnetIbtehaz and Rahman (2020), DoubleUNetJha et al. (2020), and PLU-Net architectures. We chose U-Net because of its widespread use and relevance in medical image segmentation, as well as the fact that it serves as the foundation for numerous network architectures. The kernel size is set to $3\times 3$ with dilation values of 1 and 3 correspondingly in the LS block, followed by batch normalization and ReLU. Furthermore, the PS module employs depth-wise separable convolution, the results of which are fed into four atrous convolutions with kernel sizes of $3\times 3$ and dilation values of 1, 6, 12, and 18 respectively. The output size is determined by concatenating the results of four atrous convolutions using 1x1 convolution. For the DSB2018 dataset, we used a batch size of 16, four for the ISIC2018 dataset, and two for the CVC dataset. The optimizer is Adam Kingma and Ba (2014), and the two momentum terms are $0.5$ and $0.999$ , with a learning rate of 0.0003. The epoch is set to 100, and the loss function is Binary CrossEntropy Loss(BCELoss). All of the experiments are run on four NVIDIA TITAN V GPUs with 12GB of RAM each, using PyTroch Paszke et al. (2019).

Table 1: Evaluation of proposed PLU-Net

Dataset	Methods	PC	SE	F1	JS	Params(M)	FLOPs(G)
DSB2018	U-NetRonneberger et al. (2015)	0.8965	0.9064	0.9014	0.8205	34.53	9.21
	U-Net++Zhou et al. (2020)	0.8892	0.9184	0.9036	0.8237	36.62	19.41
	MultiResUnetIbtehaz and Rahman (2020)	0.9432	0.8401	0.8887	0.7977	7.24	2.11
	DoubleUNetJha et al. (2020)	0.8808	0.9298	0.9046	0.8249	18.84	6.21
	LU-Net	0.9067	0.9015	0.9040	0.8258	29.29	6.79
	PU-Net	0.8912	0.9157	0.9032	0.8234	38.19	9.32
	PLU-Net	0.9025	0.9099	0.9062	0.8279	6.22	4.99
CVC	U-NetRonneberger et al. (2015)	0.8001	0.9087	0.8509	0.7385	34.53	110.49
	U-Net++Zhou et al. (2020)	0.7973	0.9632	0.8724	0.7706	36.62	232.92
	MultiResUnetIbtehaz and Rahman (2020)	0.7929	0.9495	0.8641	0.7562	7.24	25.3
	DoubleUNetJha et al. (2020)	0.8637	0.9222	0.8920	0.8249	18.84	74.52
	LU-Net	0.8591	0.9351	0.8954	0.8102	29.29	81.42
	PU-Net	0.8727	0.9096	0.8807	0.7979	38.19	111.85
	PLU-Net	0.9139	0.8832	0.8983	0.8125	6.22	59.9
ISIC2018	U-NetRonneberger et al. (2015)	0.8449	0.9038	0.8734	0.7665	34.53	50.13
	U-Net++Zhou et al. (2020)	0.8342	0.9156	0.8730	0.7688	36.62	105.68
	MultiResUnetIbtehaz and Rahman (2020)	0.8223	0.9340	0.8746	0.7732	7.24	11.48
	DoubleUNetJha et al. (2020)	0.8567	0.9007	0.8781	0.7779	18.84	33.18
	LU-Net	0.8678	0.8933	0.8804	0.7802	29.29	36.94
	PU-Net	0.8556	0.8981	0.8763	0.7771	38.19	50.75
	PLU-Net	0.8774	0.9152	0.8959	0.8061	6.22	27.18

4.3 Result and Discussion

To better show the experimental results, we considered several performance metrics, including Precision (PC, Eq.1), Sensitivity (SE, Eq.2), F1-score(F1, which is also known as Dice coefficient, DC, Eq.3) and Jaccard similarity (JS, Eq.4). Variables involved in these formulas are True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN), Ground Truth (GT), and Segmentation Result (SR).

PC=\frac{TP}{TP+FP}

(1)

SE=\frac{TP}{TP+FN}

(2)

F1=2\frac{SE*PC}{SE+PC}=2\frac{|GT\cap SR|}{|GT|+|SR|}=DC

(3)

JS=\frac{|GT\cap SR|}{|GT\cup SR|}

(4)

Table.1 illustrates the results of our experiments using our proposed model and various state-of-the-art U-Net models, such as U-Net++, MultiResUnet, and DoubleUNet, while Table.1 demonstrates the segmentation outcomes of three different biomedical image segmentation tasks. Table.1 shows that our proposed models LU-Net, PU-Net, and PLU-Net are all superior to U-Net. F1 in JS is superior to U-Net among them. On CVC, our model outperforms U-Net by more than 6 and 8 points in F1 and JS, respectively, when compared to U-Net. LU-Net and PU-Net results, on the other hand, reveal that they are superior than U-Net in JS and F1, with PLU-Net outperforming all other models, proving the superiority of LG Block and the capability of PS module. Furthermore, the segmentation results of the three segmentation tasks in Fig.2 show the model’s advantages. In nucleus segmentation, our model performs better on the edges, and in Polyp Segmentation, the model presented in this paper greatly outperforms other models in terms of segmentation performance. Unlike other models with smooth boundary processing, our model has more refined boundary processing in skin lesion segmentation.

5 Conclusion

In this paper, We propose an LS block for learning local feature information from a big reception field and a PS module for learning more deep information from a wider reception field. Furthermore, we design a lightweight network PLU-Net with fewer parameters and FLOPs, which can handle boundaries and details well for medical images, based on the Local Guided block and PS module. Experiments on colon cancer, nuclei, and skin lesion segmentation demonstrate the advantages of the suggested PLU-Net for generating high-quality segmentation results.

Conflict of interest

The authors report no conflict of interest.

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Alom et al. (2018) Alom MZ, Hasan M, Yakopcic C, Taha TM, Asari VK (2018) Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:180206955
Bernal et al. (2015) Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, de Miguel CR, Vilariño F (2015) Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society 43:99–111
Chen et al. (2021) Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:210204306
Chen et al. (2014) Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:14127062
Chen et al. (2017a) Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017a) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4):834–848
Chen et al. (2017b) Chen LC, Papandreou G, Schroff F, Adam H (2017b) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:170605587
Codella et al. (2019) Codella N, Rotemberg V, Tschandl P, Celebi ME, Dusza S, Gutman D, Helba B, Kalloo A, Liopyris K, Marchetti M, et al. (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:190203368
Fan et al. (2022) Fan CM, Liu TJ, Liu KH (2022) Sunet: Swin transformer unet for image denoising. arXiv preprint arXiv:220214009
He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu et al. (2018) Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Huang et al. (2017) Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Ibtehaz and Rahman (2020) Ibtehaz N, Rahman MS (2020) Multiresunet: Rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Networks 121:74–87
Jha et al. (2020) Jha D, Riegler MA, Johansen D, Halvorsen P, Johansen HD (2020) Doubleu-net: A deep convolutional neural network for medical image segmentation. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pp 558–564
Kingma and Ba (2014) Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
Krizhevsky et al. (2012) Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25:1097–1105
Lin et al. (2021) Lin A, Chen B, Xu J, Zhang Z, Lu G (2021) Ds-transunet: Dual swin transformer u-net for medical image segmentation. arXiv preprint arXiv:210606716
Long et al. (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Oktay et al. (2018) Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, et al. (2018) Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:180403999
Paszke et al. (2019) Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp 8024–8035, URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Petit et al. (2021) Petit O, Thome N, Rambour C, Themyr L, Collins T, Soler L (2021) U-net transformer: Self and cross attention for medical image segmentation. In: International Workshop on Machine Learning in Medical Imaging, Springer, pp 267–276
Ronneberger et al. (2015) Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer, pp 234–241
Russakovsky et al. (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3):211–252, DOI 10.1007/s11263-015-0816-y
Sha et al. (2021) Sha Y, Zhang Y, Ji X, Hu L (2021) Transformer-unet: Raw image processing with unet. arXiv preprint arXiv:210908417
Sifre and Mallat (2014) Sifre L, Mallat S (2014) Rigid-motion scattering for texture classification. arXiv preprint arXiv:14031687
Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Szegedy et al. (2015) Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Tschandl et al. (2018) Tschandl P, Rosendahl C, Kittler H (2018) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5(1):1–9
Xiao et al. (2018) Xiao X, Lian S, Luo Z, Li S (2018) Weighted res-unet for high-quality retina vessel segmentation. In: 2018 9th international conference on information technology in medicine and education (ITME), IEEE, pp 327–331
Zhao et al. (2016) Zhao H, Shi J, Qi X, Wang X, Jia J (2016) Pyramid scene parsing network. IEEE Computer Society
Zhou et al. (2020) Zhou Z, Siddiquee M, Tajbakhsh N, Liang J (2020) Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging 39(6):1856–1867