Exploiting full Resolution Feature Context for Liver Tumor and Vessel Segmentation via Integrate Framework: Application to Liver Tumor and Vessel 3D Reconstruction under embedded microprocessor

Xiangyu Meng Xudong Zhang Gan Wang Ying Zhang Xin Shi Huanhuan Dai Zixuan Wang and Xun Wang This research was supported by the National Natural Science Foundation of China [No.61972416, 61873280 and No.61873281]; the Natural Science Foundation of Shandong Province [No.ZR2019MF012]. (Xiangyu Meng and Xudong Zhang contributed equally to this work)(Corresponding author: Zixuan Wang and Xun Wang). Xiangyu Meng, Xudong Zhang, Gan Wang, Ying Zhang, Xin Shi and Huanhuan Dai are with the Department of Computer Science and technology, China University of Petroleum, Qingdao, 266580, Shangdong, China(e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). Zixuan Wang is with the Minimally Invasive Interventional Therapy Center, Qingdao Municipal Hospital, Qingdao, 266011, Shangdong, China(e-mail: [email protected]).Xun Wang is with the Department of Computer Science and technology, China University of Petroleum, Qingdao, 266580, Shangdong, China; High Performance Computer Research Center, University of Chinese Academy of Sciences, Beijing, 100190, China(e-mail: [email protected]).

Abstract

Liver cancer is one of the most common malignant diseases in the world. Segmentation and labeling of liver tumors and blood vessels in CT images can provide convenience for doctors in liver tumor diagnosis and surgical intervention. In the past decades, many state-of-the-art medical image segmentation algorithms appeared during this period. With the development of embedded devices, embedded deployment for medical segmentation and automatic reconstruction brings prospects for future automated surgical tasks. Yet, most of the existing segmentation methods mostly care about the spatial feature context and have a perception defect in the semantic relevance of medical images, which significantly affects the segmentation accuracy of liver tumors and blood vessels. Deploying large and complex models into embedded devices requires a reasonable trade-off between model accuracy, reasoning speed and model capacity. Given these problems, we introduce a multi-scale feature fusion network called TransFusionNet based on Transformer. This network achieved very competitive performance for liver vessel and liver tumor segmentation tasks, meanwhile it can improve the recognition of morphologic margins of liver tumors by exploiting the global information of CT images. Experiments show that in vessel segmentation task TransFusionNet achieved mean Dice coefficients of 0.899 and in liver tumor segmentation task TransFusionNet achieved mean Dice coefficients of 0.961. Compared with the state-of-the-art framework, our model achieves the best segmentation result. In addition, we deployed the model into an embedded micro-structure and constructed an integrated model for liver tumor vascular segmentation and reconstruction. This proprietary structure will be the exclusive component of the future medical field.

{IEEEkeywords}

Liver tumor, Liver vessel, Medical image segmentation, Transformer, 3D reconstruction, embedded microprocessor, computer-aided diagnosis

1 Introduction

\IEEEPARstart

Liver cancer is the sixth most common primary cancer worldwide and the fourth leading cause of cancer death[1]. Therefore, there is an urgent need for effective prevention programs and treatments to reduce the harm caused by liver cancer. In the early stage of liver cancer, potential risks of coming serious liver cancer can be eliminated by surgical removal of the tumor or local treatment. In recent years, computer-assisted liver surgery (e.g., ablation and embolization) has been increasingly used for the treatment of primary and secondary liver tumor patients who are not eligible for common surgeries[2]. Computed Tomography(CT), as part of computer-assisted liver surgery, is a commonly implemented for clinical diagnostic approach to improve the visualization on liver, vessels and tumors[3]. Because CT is capable of clearly showing the number, boundary, density and other patterns of the disease focus. Experts will segment liver vessels and tumors from CT images before surgery in helping 3D visualization, path planning, and guidance for interventional surgery of liver[4]. However, there are some challenging obstacles in computer-assisted liver interventions. The most critical one of all is that segmentation of liver vessels and tumors from CT images is manually completed by specialists, which is rather time-consuming, labor-intensive and no quality guaranteed. This can lead to the inability to precisely pinpoint the vessels that supplies nutrition for the hepatic tumor, thus affecting hepatic embolization procedure, ablation and so on. Eventually local tumors will relapse[5] . As a result, there is an urgent need for an intelligent auxiliary diagnostic key embedded component in the medical field. The structure can be flexibly deployed in any CT instrument. Meanwhile inferential results of liver tumor and artery can be quickly generated with guaranteed precision, which assist physicians to complete rapid diagnosis and carry out next liver surgery plan.

In previous studies, many methods have emerged for segmenting liver vessels separately or tumors as well, but none of which considers segmenting vessels and tumors at the same time. This is due to the complicated background, heterogeneous shape and surrounding vessels irregularity of the tumor making it difficult to segment the hepatic vessels that supply nutrition for the tumor[6]. Traditional methods try to segment livers or tumors by active contour methods, tracking methods, and feature learning methods. Active Contour Model (ACM) is a method to detect object boundaries based on curve evolution theory and level set approach. Cheng et al. [7] implemented ACM with precise shape dimension constraints based on CT scan models for contour point detection of vessel cross-sections to plot vessel boundaries. Chung et al. [8] proposed an active contour method to segment portal vein and hepatic vein based on the regional intensity distribution of the image and the probability map of vessel occurrence. However, the active contour model tends to fall into the local optimum problem when extracting complex regions in the vector field, and cannot handle gray scale inhomogeneous images well. The tracing method starts by manual initialization or image preprocessing to initiate a single or specified number of seed points in the vessel, and then finds subsequent points based on the image derived data as a way to trace the vessel[3]. Tracking methods mainly include model-based algorithms [9, 10, 11],least cost path-based algorithms [12, 13]. However, if the initial seed points of these methods are not correctly positioned, the final segmentation results can be seriously affected.

In order to segment vessels or tumors from CT images, feature learning methods need to perform feature extraction from images and labels based on real segmentation to train machine learning models such as random forest (RF) [14, 15] and support vector machine (SVM) [16, 17] for automatically segmenting vessels or tumors from CT images. However, the robustness and generalization ability of machine learning models are limited. In recent years, many deep learning models ,like convolutional neural networks [18, 19],have gradually shown promosing performances in the field of medical image segmentation. Currently, segmentation models based on fully convolutional networks [20] and UNet [21] architectures are the most effective ones. Huang et al. [22] combined 3D-UNet with data enhancement techniques, a variant of dice coefficient, to reduce the effect of high imbalance in some extent between hepatic vessels and background classes. Zhou et al. [23] proposed UNet++, a model that combines a deeply supervised encoder and decoder and links the sub-networks of both through a series of hops as a way to reduce the semantic gap between the encoder and decoder feature mappings. Recently, Transformer [24][25][26] has made great achievements in the field of deep learning, and TransUNet proposed by Chen et al. applies transformer as an encoder to extract global contextual features and combines it with convolutional neural network for decoding. For segmentation of liver vessels and tumors, a high degree of accuracy must be achieved to enable clinical applications. In view of above mentioned methods including other UNet-based [27, 28, 29] methods , performances still can be improved in terms of accuracy and efficiency despite of some attempts in architecture.

Because embedded microprocessors are low-power, inexpensive, and easy to deploy, researchers considered migrating semantic segmentation models to embedded microprocessors or edge computing devices in order to complete the inference task in specific scenarios. Wei et al.[30] introduced a fast and efficient lightweight network called Turbo Unified Network (ThunderNet). This model implements fast and efficient inference on the Jetson platform. Huang et al. demonstrate EDSSA-an Encoder-Decoder semantic segmentation networks accelerator architecture which can be implemented with flexible parameter configurations and hardware resources on the FPGA platforms that support Open Computing Language (OpenCL) development[31]. In the process of model transplantation, the tradeoff between speed, volume and accuracy of model inference is the focus of various researchers. With the development of medical image segmentation methods, related studies show significant precision in different lesion segmentation of multimodal medical data. Lightweight model deployment and transplantation of high-precision medical segmentation models into embedded micro-devices will greatly promote the development of automated surgery and automated diagnosis.

In our work, we construct a general semantic segmentation model TransFusionNet according to the different semantic features of liver tumors and vessels in CT images. It is a multiscale information fusion network capable of learning sematic and spatial information features, including a Transformer-based semantically feature extraction module, a Multi-layer local feature extraction module, and a multiscale fusion decoder. The model can accurately detect and segment fine-scale arterial vasculature, while effectively identifying and segmenting liver tumors and vascular features by fusing the global context of CT images. At the same time, based on the Edge Extraction Module, the network can effectively extract the edge features of the target objects. Since images were obtained from different scanners or different imaging protocols during clinical diagnosis. Instead, training of deep learning models often requires a large amount of labeled data that is supposed to accurately represent the original data[32]. To address the above issues, we propose a transfer learning training strategy. This learning strategy allows the network to mix the features of the three datasets, and the final trained model can significantly alleviate overfitting and improve segmentation accuracy. Finally, we deployed the model to the Jetson TX2 embedded microprocessor using compressed distillation to allow the model to segment CT medical images in real-time and rebuild tumors and blood vessels. Experiment shows that this inference system can complete fast and automated reconstructions with the error’s permission, which can greatly reduce the workload of doctors in diagnosis. In the future, this structure can be used as a key component in intelligent surgical diagnosis and has high application prospects. At the same time, the portability of our proposed framework can be demonstrated by transplanting high-precision medical segmentation models.

2 Methods

In order to better complete the segmentation task of liver tumors and blood vessels, we design a novel segmentation architecture called TransFusionNet. We introduced a new feature extraction module fused with Transformer and CNN. Based on this module, the network can effectively extract image spatial features and semantically related features. At the same time, we proposed an Edge Extraction Module which can significantly capture the edge feature of the image to cooperate with the training of the segmentation network. The model designed based on our ideas can effectively learn rich feature information, and can effectively ensure the segmentation accuracy of the edges of difficult-to-segment objects, which are critical in vascular and liver tumor segmentation tasks. The framework of our model is shown in Figure 1.

Refer to caption — Figure 1: Overview of the TransFusionNet model. (a) Transformer-based feature extractor. (b) Multi-layer local feature extraction module (c) Fusion decoder for multiscale feature. (d) Edge Extraction Module.

2.1 Transformer-based semantically feature extraction module

We introduce an encoder that can learn the global feature representation, which consists of a feature embedding module based on a feature extraction backbone and a feature extraction module that senses the semantically related information representation of the image based on the transformer [24]. This module adopts a brand-new feature extraction idea, by semantically representing the features of the picture and learning the global representation of semantic features.

The input image $i\in\mathbb{R}^{H\times W\times C}$ is first fed into the feature extraction backbone network. The network can extract the spatial information features of CT images and output the feature map $x\in\mathbb{R}^{H^{{}^{\prime}}\times W^{{}^{\prime}}\times C^{{}^{\prime}}}$ . We divide the feature map x learned by the backbone into a series of patches $x_{p}^{i}\in\mathbb{R}^{P^{2}\times C},i=1,…,N$ , where the size of each patch is $P\times P$ ,and the number of patches denote by $N=\frac{H^{{}^{\prime}}\times W^{{}^{\prime}}}{P^{2}}$ . For each patch, we use a convolution operation with a kernel size of $P\times P$ to obtain the information $E_{info}^{i}$ of i-th patch to form an information matrix $\{E_{info}^{1},E_{info}^{2},...,E_{info}^{N}\}$ . In order to better learn location information using Transformer, Dosovitskiy et al. [33] perform a learnable location embedding for each patch to obtain the location matrix $\{E_{pos}^{1},E_{pos}^{1},...,E_{pos}^{N}\}$ of the N patches. The feature of the i-th patch can be formulated by the following equation:

E^{i}=E_{info}^{i}+E_{pos}^{i}.

(1)

We adopt this position encoding method, so that the feature extraction module can effectively learn the position information of the features. We next input the above obtained feature matrix $E=\{E^{1},E^{2},…,E^{N}\}$ of x into multi Transformer layers to learn semantically representation of the feature map. In comparison to traditional convolution operation, transformer adopts a multi-head self-attention mechanism, and its core formulation is shown in equation 2:

y=\sum_{i=1}^{h}\sum_{j=1}^{w}\sum_{k=1}^{n}(softmax(Q_{ijk}^{T}\times K_{ijk})\times V_{ijk}),

(2)

where $h$ and $w$ denote by width and height of the feature matrix $E$ after feature extraction and location embedding. And $n$ is the number of self-attention mechanism heads. $Q_{ijk},K_{ijk},V_{ijk}$ denote the query, key and value obtained by three linear transformations of the input $E_{ij}$ in each self-attended head, respectively. $y\in\mathbb{R}^{h\texttimes w\texttimes C}$ denotes the output after one multi-headed self-attention. We stacked 12 transformer layers, and the output of the last layer can theoretically learn to incorporate a rich context feature representation of the CT image under a wider range of perceptual fields. We then feed the output of the Transformer layers into a three-layer convolution operation. The final output feature map consist global high-level abstract information, effectively solving the problem of missing information caused by perceptual field defects in traditional deep CNN networks. We call it semantic feature map.

2.2 Multi-layer local feature extraction module

Transformer-based extraction module is a very powerful for semantically information feature, because the Transformer feature extraction module has advantages in learning semantically related features. In many ways, however, Transformer is not an effective replacement for traditional convolutional operations. For extraction of more subtle feature in some images such as edge feature of interest regions and tiny vessel feature, CNN is nothing but the perfect solution. We designed a local residual network encoder based on multi-layer SEBottleNet stacking, as shown in Figure 2. The encoder consists of a six feature extraction module. A max pooling operation is performed to extract the high-level feature representation after feed feature map to each feature extraction block. The input CT image $x\in\mathbb{R}^{H\times W}$ is first fed forward to a CNN module for high-level feature extraction, and the feature map $u\in R^{H\times W\times C}$ is obtained. Then, the feature map $u$ is feed into a deep residual feature extractor stacked by five layers of SEBottleNet, each of which is used for learning the context features under the local perception field. BottleNet residual network[34] retains all the advantages of residual network and significantly reduces computation interval and computational burden. We introduced the Squeeze and Excitation(SE)[35] in the BottleNet to enhance the interdependence between feature map channels. The structure of the SEBottleNet is shown as in Figure 2. The mean value $e_{c}\in\mathbb{R}^{1\times 1\times C}$ of the feature embedding for each channel in the feature map $U\in\mathbb{R}^{H\times W\times C}$ can be obtained from the Squeeze section, as shown in the following equation:

e_{c}=\frac{1}{w\times h}\sum_{i=1}^{w}\sum_{j=1}^{h}u_{c}(i,j).

(3)

Where the $u_{c}(i,j)\in\mathbb{R}^{1\times 1\times C}$ is the pixel in feature map $U$ . The Excitation section can learn the feature weights $e_{c}$ for each channel by $s_{c}$ :

s_{c}=\delta(\mathcal{G}(e_{c},\mathcal{W})).

(4)

Finally, the vector product $\tilde{O}$ of s and u is obtained by the Scale operation, and this is the final output of the SE module:

\tilde{O_{c}}=s_{c}\times u_{c},

(5)

where $\tilde{O}_{c}$ is the feature map of a feature channel.

The SEBottleNet residual network splits the traditional convolutional operation into multiple modules to ensure that each module has a different feature extraction task. We introduced the Squeeze and Excitation module in the middle of the module to better learn the importance of the feature map channel dimensions, so that SEBottleNet has a stronger learning focus in the feature extraction process. Through the continuous stacking of SEBottleNet and maxpool, the encoder can continuously extract the local feature representation of the input CT image HIGH-level. Meanwhile, since each SEBottleNet is set with residual connections, it enables the encoder to effectively mitigate the degradation problem caused by network deepening.

2.3 Edge Extraction Module

Since the hepatic arterial vessels are very small and the margins of the liver tumor arblurd, further refining the segmentation of the vessels and liver is a challenging task. In order to allow the model to learn more detailed spatial feature features, we introduce the Edge Extraction Module, which is specially designed to learn the edge features of blood vessels and tumor regions of interest and fuse the edge features to the segmentation network. The structure of this module is shown in Figure 1 (d). The Edge Extraction Module takes the feature maps of feature extraction layers and the CT edge map(Figure 4(b)) extracted by the Canny algorithm as input, and predict the edge result $e\in\mathbb{R}^{H\times W}$ . This module predicts edge information and fuses the predicted feature maps into the segmentation network. To accomplish this task, we process segmentation annotations to obtain edge annotations $e_{r}$ (Figure 4(d)), which can be used as a supervision condition for this module.

In this module, we used Gated Excitation Convolution (GEC) layer. GEC is the most important unit in Edge extraction module and it can filter out some irrelevant information to help Edge extraction module focus on extracting image edge features. GEC is applied between the Edge extraction module and the feature extraction module. It uses gating mechanisms to deactivate its own activations that are not deemed relevant by the higher-level information contained in the extraction module[36]. At the same time, we introduce an excitation module in the gating activation layer to learn the importance between different feature maps.

We define $t_{i},c_{i}\in\mathbb{R}^{\frac{H}{2i}\times\frac{W}{2i}\times C}$ as the feature maps of the Transformer module and the local feature extraction module, and $i$ denote the number of locations. Before using the GEC module, $t_{i}$ and $c_{i}$ were fed into a convolutional layer $C_{1\times 1}$ to obtain the image-dimensional feature maps $t_{i}^{{}^{\prime}},c_{i}^{{}^{\prime}}\in\mathbb{R}^{H\times W}$ . Let $e_{i}\in\mathbb{R}^{H\times W\times C}$ denote the feature map synthesized by the Edge Extraction Module. Given the feature maps $t_{i}^{{}^{\prime}},c_{i}^{{}^{\prime}}$ and $e_{i}$ , an excitation convolutional layer is applied to generate the sigmoid activation $\alpha_{i}\in\mathbb{R}^{H\times W}$ :

\alpha_{i}=\sigma(C_{1\times 1}(F_{se}(cat(t_{i}^{{}^{\prime}},c_{i}^{{}^{\prime}},e_{i})))),

(6)

where $F_{se}$ denotes squeeze and excitation option as shown in Equation 3-5. Finally, $\alpha_{i}$ and $e_{i}$ is fed into Gated Convolution layer[37][38][36] and generate the $e_{i}^{{}^{\prime}}$ . The Gated Convolution layer is compute as

e_{i}^{{}^{\prime}}=C_{1\times 1}((e_{i}^{{}^{\prime}}\times\alpha_{i})+e_{i}^{{}^{\prime}}).

(7)

Theoretically, GEC can be simply regard as a collection of attention for the spatial dimension and channel dimension of the feature map. Through GEC operation, the attention maps $\alpha_{i}$ selectively preserve the edge semantic features. We cancel the GEC operation on the shallow feature maps of the feature extractor, since the images feed to the convolution layer mainly learns the general low-level features, at the same time, the output feature map retains rich edge information. As the network deepens, the feature map will retain high-level features. Using GEC operation can effectively weight the useful edge information of high-level features in theory.

The Canny operator can effectively filter out the irrelevant features of the image to obtain the canny image as shown in Figure 4(b). We think it is applicable for medical image segmentation. Therefore, we firstly concatenate the canny image and the last GEC module output $e_{n}$ . Then we feed them together with the output feature maps of the two feature extractors to the Fusion Module. At the same time, the edge extraction module uses edge loss as the loss function and edge label as the supervision to optimize the prediction edge map.

2.4 Multi-scale feature fusing module

In the previous sections, we introduced two feature extraction structures that learn spatial features and semantically related features, respectively. At the same time, we introduce the edge feature extraction module to enhance the ability of the whole network to extract the target edge features. In this section, we introduce the multi-scale feature fusion decoding module to sample the semantic features learned by the three modules. This module takes the feature maps extracted by the three modules as the input and outputs the predicted category distribution map $\hat{y}\in\mathbb{R}^{K\times H\times W}$ , where K represent the semantic classes.

We introduce a fusion module, which mainly fuses the feature maps of the three feature extraction modules. Figure 3 shows the structure of this module. We designed the module with reference to the spatial pyramid pooling(SPP). Firstly, the module uses $C_{1\times 1}$ and $C_{3\times 3}$ convolution to extract features from the concatenation result of the semantic feature map and spatial feature map respectively. Next, we feed it into the pooling layer and fused the edge feature map. Through the above operations, we obtained the feature maps of three different receptive fields. Finally, we sample and concatenate these three feature maps to output the fused feature map. Theoretically, the feature map output by these module can retain rich spatial features, semantic related features and edge features.

In the process of continuous feature extraction layer by layer in the coded network, the low-level information of the feature map is continuously filtered and the high-level information is extracted. UNet uses skip connections to conduct the feature maps of the encoding module of each stage to the decoding module of the corresponding stage, and the network can fully learn the feature maps of different levels of the image. We adopted the skip connection operation from UNet and introduced skip connections to different feature encoders to allow the whole network to better learn the feature information of different encoders at different levels. The skip connection introduced in the local feature extraction module is similar to the traditional UNet module, which combines the short-range skip connection (residual connection) and the long-range skip connection of SEBottleNet. As for the Transformer based feature extraction module, we first introduce skip connections in the encoding process of the backbone network to connect the intermediate feature maps in the forward propagation process of the backbone embedding network, which improves the low-level feature loss in the feature embedding process of the backbone network. Next, we add skip connections to the feature maps with global feature representations after Transformer feature encoding fusion to fuse the global low-level features. Eventually, after continuously fusing low-level feature maps of different scales, the decoder can learn the semantic information of images from coarse to fine.

2.5 multi task training strategy

We propose the edge feature extraction module to cooperate with the segmentation task of the model, so we train the model to complete semantic segmentation and edge information segmentation at the same time. We introduce the joint optimization of edge loss and segmentation loss respectively. At the same time, in order to better ensure the consistency of multi task learning optimization, we set up a regularization methods to balance the two losses.

We use Dice and Cross Entropy(CE) as the loss function of segmentation task to predict semantic segmentation $y$ :

\mathcal{L}_{seg}^{\theta,\psi}=\lambda_{1}\mathcal{L}_{Dice}(y,\hat{y})+\lambda_{2}\mathcal{L}_{CE}(y,\hat{y}),

(8)

where $y\in\mathbb{R}^{H\times W}$ denote the real semantic label map of liver tumor and vessel. In the equation 8, $\lambda_{1}$ and $\lambda_{2}$ represent hyper parameters. As for edge prediction, we use Binary Cross Entropy(BCE) loss. In this experiment, the model mainly focuses on tumor and vascular segmentation. We extract their common edges to obtain edge label $\hat{e}\in\mathbb{R}^{H\times W}$ (Figure 4(d)) and take $\hat{e}$ as the loss supervision. Therefore, the edge loss can be expressed as:

\mathcal{L}_{edge}^{\theta,\psi}=\lambda_{3}\mathcal{L}_{BCE}(e,\hat{e}),

(9)

where, $e$ represent the edge predict map of the edge extraction module. It is worth noting that in the optimization process, the parameters of feature extraction modules and edge extraction modules will be optimized based on loss. Next, we will input the feature map output by the edge extraction module into the fusion module to predict the segmentation results. Therefore, the prior knowledge learned by the edge extraction module was retained in $y$ . At the same time, the segmentation loss will pay more attention to the edge features in the optimization process.

We introduce regularization methods to make the model cooperate better in the training process. As mentioned above, $y\in\mathbb{R}^{K\times H\times W}$ represents the predicted segmentation map and $e\in\mathbb{R}^{H\times W}$ represents the predicted edge graph. Therefore, we introduce shape regularization, which can be expressed as:

\mathcal{L}_{sreg}^{\theta,\psi}=\lambda_{4}||Sigmoid(\oplus y)\times e-e||,

(10)

Where $\oplus$ represents the pixel-wise addition of $y$ without the background label map which can be implemented using kernel fixed convolution operator. This operation outputs a label map containing the predict region of the tumor and blood vessels. In particular, at the beginning of model training, because the edge extraction module cannot accurately predict the edge, loss does not play any role. We therefore introduced a dynamic adjustment strategy, which is set $\lambda_{4}$ to 0 before 100 epoch and $\lambda_{4}\geq 0$ after 100 epoch

Finally, the loss function of the model is:

\mathcal{L}_{total}=\mathcal{L}_{seg}^{\theta,\psi}+\mathcal{L}_{edge}^{\theta,\psi}+\mathcal{L}_{sreg}^{\theta,\psi}.

(11)

We set the epoch to 300, the initial learning rate to 0.001 (using the cosine annealing learning rate decay method), and the batch size to 8. The model is trained using an SGD optimizer with a momentum of 0.9 and a weight decay of 1e-4.

2.6 Applying transfer learning to TransFusionNet

The TransFusionNet can significantly learn full-resolution context feature information, and its segmentation effect in the public dataset of blood vessels and liver tumors is significant. However, due to the scarcity of the enhanced CT images of liver cancer after the screening, we only obtained CT images of 18 patients. Too little data will inevitably affect the performance of the model and deepen the over-fitting problem. For this purpose we introduce a transfer learning strategy, which does not require exactly representative training data and is able to take advantage of the similarity between datasets to capture specific prior knowledge during the training phase of the model in order to construct new segmentation models.

We first pre-trained the models using the publicly available datasets LITS and 3Dircadb to obtain a liver tumor segmentation model and a liver vascular segmentation model, respectively. Then, we use our liver tumor data and liver vascular data to retrain the two models obtained by pre-training, and obtain two liver tumor segmentation models and liver vascular segmentation models based on the training sample distribution of our dataset. When we need to perform segmentation of liver tumor and blood vessels of CT images, we only need to input one CT image, and the two models will segment the tumor and blood vessel parts of CT images respectively. The mask output from the two model segments is automatically fused into a single mask that contains the tumor and blood vessels with relative positions as the final output.

3 Experiment and discussion

3.1 Experimental setup

3.1.1 Dataset

The LITS (Liver and Liver Tumor Segmentation, https://competitions.codalab.org/competitions/17094) dataset contains 130 cases of tumors, metastases, and cysts, and these CT scans have large spatial resolution and field of view (FOV) differences[6]. 3Dircadb (3D Image Reconstruction for Comparison of Algorithm Database, https://www.ircad.fr/research/3d-ircadb-01/) is a public dataset that can be used to train and test liver vessel segmentation methods, including 20 patients in different image resolutions, vessel structure, intensity distribution and liver vessel comparison CT enhancement[22]. At the same time, we collected 18 typical patients’ portal and arterial phase CT enhanced images of the liver for manual annotation, and finally constructed a liver tumor blood vessel(LTBV) dataset. The data we used was ethically reviewed, but we cannot make this dataset publicly available because of patient privacy. We annotated the hepatic arterial vessels in the arterial phase images of the same patient and annotated the liver tumors in the portal phase images. Due to the different characteristics of CT images in the two phases, we need to train two models for automatic segmentation of arterial vessels and tumors.

The LITS and 3Dircadb datasets cover a wide range of CT images with different resolution differences and field of view (FOV) differences. We use these two datasets for model pre-training. We use our private dataset for the fine-tuning training of the model for hepatic artery and tumor segmentation tasks. The above three datasets are divided into training set and test set according to the ratio of 8:2.

3.1.2 Evaluation Metrics

In order to better evaluate our model from multiple perspectives, we have selected 5 evaluation indicators including: iou, DSC coefficient, voe, recall, precision. IoU(Intersection over Union) is the calculation of the intersection of the real annotation and the segmentation result. The calculation method was

IoU=\frac{R_{pre}\cap R_{real}}{R_{pre}\cup R_{real}},

(12)

where $R_{pre}$ represents the segmentation result predicted by the model, and $R_{real}$ represents the actual segmentation result. The DSC(Dice Similariy Coefficient) represents the ratio of the area where the segmented image and the real image intersect to the total area. The calculation method was

DSC=\frac{2\times(R_{pre}\cap R_{real})}{R_{pre}+R_{real}},

(13)

where $R_{pre}$ represents the segmentation result predicted by the model, and $R_{real}$ represents the actual segmentation result. VOE(Volumetric Overlap Error) represents the difference between the area of the segmented image and the real image, and usually represents the error rate of segmentation. The specific calculation method was

VOE=\frac{2\times(R_{pre}-R_{real})}{R_{pre}+R_{real}}.

(14)

Precision is the proportion of pixels that are actually not in the region of interest correctly judged as not in the region of interest. It measures the ability to correctly judge the pixels that are not in the region of interest in the segmentation experiment. Its calculation method was

Precision=\frac{I-R_{pre}\cup R_{real}}{I-R_{real}}.

(15)

Where I is the original input image. Recall is the proportion of pixels that are correctly judged as pixels in the region of interest. It measures the ability to correctly segment the region of interest.Its calculation method was

Recall=\frac{R_{pre}\cap R_{real}}{R_{real}}.

(16)

3.2 Performance comparison with state-of-the-art methods

We choose 5 advanced segmentation models to compare with our method, the 5 models are SegNet[18], UNet[21], UNet++[23], UNet3+[39], TransUNet[40]. At the same time, we divide the proposed method into TransFusionNet(TFN) and TransFusionNet with edge module(TFNEdge), and evaluate them respectively. We first compare the segmentation effects of five models on blood vessels and tumors based on two public datasets:LITS(Tumor)and 3Dircadb(Vessel). Next, we use the LTBV dataset to fine-tune the five models and compare the segmentation effects of the five models.

3.2.1 Comparison experiment of liver tumor and blood vessel segmentation effects based on public datasets:LITS and 3Dircadb

The performance of TransFusionNet and the other four methods on two public datasets is shown in Table 1. The experimental results show that the IoU of TransFusionNet on the 3Dircadb dataset can reach 0.854, and the DSC can reach 0.918, which is 0.8% and 1.1% higher than the IoU and DSC of the baseline method UNet. The IoU is 2.3% and 0.7% higher than UNet++ and TransUNet, respectively. However the IoU can reach 0.863 when using TransFusionNet with edge extraction module. On the LITS dataset, that is, when performing liver tumor segmentation, the IoU and DSC of TransFusionNet can reach 0.840 and 0.910. As can be seen from Table 1, the VOE of TransFusionNet on the two datasets, that is, the error rate is also far lower For other models.

Table 1: Performance comparison on LITS and 3Dircadb datasets

Dataset	Methods	IoU	DSC	VOE	Precision	Recall
3Dircadb	SegNet[18]	0.839	0.907	-0.067	0.938	0.879
	UNet[21]	0.846	0.913	-0.079	0.951	0.880
	UNet++[23]	0.831	0.904	-0.062	0.934	0.879
	UNet3+[39]	0.853	0.917	-0.059	0.945	0.894
	TransUNet[40]	0.847	0.913	-0.066	0.944	0.885
	Ours	0.854	0.918	-0.041	0.938	0.901
	Ours with edge module	0.863	0.921	-0.051	0.947	0.901
LITS	SegNet[18]	0.805	0.887	-0.035	0.904	0.875
	UNet[21]	0.832	0.905	-0.024	0.917	0.897
	UNet++[23]	0.828	0.902	-0.020	0.912	0.896
	UNet3+[39]	0.821	0.895	-0.002	0.899	0.898
	TransUNet[40]	0.834	0.905	-0.040	0.923	0.889
	Ours	0.840	0.910	-0.018	0.919	0.904
	Ours with edge module	0.840	0.910	-0.018	0.919	0.904

3.2.2 Comparison experiment of liver tumor and blood vessel segmentation effects based on private dataset:LTBV

As the experiments described in section 3.2.1, we used LITS and 3Dircadb public datasets to train to obtain segmentation models of liver tumors and blood vessels. Compared with the other four state-of-the-art methods, our method has the best automatic segmentation effect. We use five methods trained on two datasets to perform transfer learning fine-tuning training on the LTBV dataset. The performance of each model in the LTBV dataset as shown in Table 2. From Table 2, we can see that the IoU of TransFusionNet on the blood vessel dataset can reach 0.822, and the DSC can reach 0.899. This is 1.9% and 1.8% higher than the IoU and DSC of the baseline method SegNet, and is higher than the IoU and Voe of TransUNet. They are 0.4% and 0.5% higher respectively. On the tumor dataset, the IoU and DSC of TransFusionNet are as high as 0.927 and 0.961, which shows our method can still achieve the best results after LTBV transfer learning.

Table 2: Performance comparison on LTBV datasets

Dataset	Methods	IoU	DSC	VOE	Precision	Recall
Vessel	SegNet[18]	0.803	0.881	-0.056	0.907	0.858
	UNet[21]	0.812	0.893	-0.013	0.902	0.890
	UNet++[23]	0.809	0.892	-0.058	0.919	0.868
	UNet3+[39]	0.821	0.897	0.022	0.892	0.909
	TransUNet[40]	0.818	0.897	-0.049	0.920	0.876
	Ours	0.822	0.899	-0.054	0.925	0.877
	Ours with edge module	0.854	0.901	-0.040	0.933	0.895
Tumor	SegNet[18]	0.905	0.948	0.002	0.922	0.931
	UNet[21]	0.915	0.954	-0.018	0.963	0.946
	UNet++[23]	0.912	0.952	0.003	0.952	0.954
	UNet3+[39]	0.827	0.899	-0.037	0.918	0.885
	TransUNet[40]	0.920	0.955	-0.023	0.967	0.945
	Ours	0.927	0.961	-0.011	0.966	0.955
	Ours with edge module	0.917	0.954	-0.022	0.975	0.945

3.3 Ablation Study for TransFusionNet model

3.3.1 Ablation study of Transformer-based feature extraction module and SEBottleNet local encoder

In this section, we use the Transformer module, the Multi-layer local feature extraction module, and TransFusionNet for our experiments, with the aim of testing the effect of the above two modules on the segmentation accuracy of TransFusionNet. From Figure 5(a) we can see that the Transformer module performs better than the SEBottleNet module in general on the vascular dataset of 3Dircadb. We believe that the Transformer encoder can learn the CT image global contextual feature representation, especially it encodes the image location information, which certainly helps to enhance the segmentation of the image as a whole. On the LITS tumor dataset, as shown in Figure 5(b), the segmentation accuracy of the SEBottleNet module is higher, which is attributed to the fact that its internal CNN and local residuals are more interested in some finer features in the image, such as tumor edge features. From Figure 5, we can see that this paper achieves an effective improvement in the segmentation accuracy of liver vessels and tumors by combining the Transformer module and SEBottleNet module.

3.3.2 Function of skip connection used in Decooder

From results in Tables 1 and 2, we can find that the TFNEdge module shows the best effect in vascular segmentation task, but the outcome in tumor segmentation task is not as good as TFN. Anyway, the tumor segmentation effect of TFNEdge module also exceeds the state-of-the-art model. This shows that the edge extraction module can play a good role in the task of small target segmentation. However, when segmenting large targets, due to the function of the edge loss function, the network will pay more attention to edge optimization and affect the global control of the whole target.

3.3.3 Function of skip connection used in Decooder

In the Encoder-Decoder structure, the encoder learns to extract the high-frequency image representation of the feature map, and the decoder continuously learns feature recovery based only on the high-frequency feature coding output from the encoder. The role of low-frequency feature information is ignored in the process of encoding and decoding, yet low-frequency features often have their non-negligible role. The role of skip connection is to allow the network to better learn low-frequency features during the encoding and decoding process. In this experiment, to demonstrate the importance of different modular jump links of our designed network on the segmentation effect, we use TFN Architecture and remove the skip connections of the Transformer module, the skip connections of the local feature extraction module and all skip connections, and train these three models using the same parameter settings. The performance gap with the original network is compared. The experimental results are shown in Table 3. According to the results in the Table 3, we can find that the model retaining the global local skip connections has a significant improvement compared to the model with the jump links removed. This result proves the importance of skip connections for TransFusionNet and also shows that the low frequency features of the image have a significant impact on the segmentation results.

Table 3: Performance comparison on LITS and 3Dircadb datasets

Dataset	Modules without skip connections	IoU	DSC	VOE	Precision	Recall
3Dircadb	all encoders	0.654	0.780	-0.163	0.862	0.732
	Transformer-based encoder	0.767	0.858	0.059	1.066	0.886
	CNN-based encoder	0.785	0.870	0.019	0.880	0.868
	Ours	0.854	0.918	-0.041	0.938	0.901
LITS	all encoders	0.805	0.887	-0.035	0.904	0.875
	Transformer-based encoder	0.832	0.905	-0.024	0.917	0.897
	CNN-based encoder	0.828	0.902	-0.020	0.912	0.896
	Ours	0.840	0.910	-0.018	0.919	0.904

3.4 Visualizations

From the above quantitative experimental methods, our model has the best performance in the segmentation of liver blood vessels and liver tumors. Next, we use TransFusionNet and other comparable models on a test case on the LTBV dataset to segment liver tumors and blood vessels and then visualize them. The first row of Figure 6 is the model’s segmentation of Vessel, the second row is the model’s segmentation of Tumor, and the third row is the result of fusing the first two rows of segmentation results in the same coordinate system.

From the perspective of visual analysis, SegNet and UNet are not accurate in segmenting the details of blood vessels. Although UNet++ can identify some details of blood vessels, the error rate is too high. TransFusionNet almost perfectly segmented the details of blood vessels, which is more accurate than TransUNet. This is attributed to SEBottleNet’s extraction of local receptive field information and the importance of different channels. In tumor segmentation, UNet++ has a great segmentation result for the edge of the tumor, while SegNet and UNet perform poorly in this respect. All comparison models have some wrong tumor segmentation, and TransFusionNet not only avoids these wrong segmentation but also can segment the edge and contour of the tumor accurately. We believe that after TransFusionNet extracts global and local information, the multi-scale feature fusion decoder almost perfectly restores the feature of the image, so that the segmentation accuracy is significantly improved, and the error rate is low.

In summary, the above comparison models are not accurate in segmenting tumors and blood vessels. They are easy to misclassify some areas that are not tumors, and they are not sensitive to the recognition of some fine blood vessels areas, resulting in incomplete blood vessel segmentation results. TransFusionNet can accurately segment the liver tumor regardless of its integrity or vascular continuity.

3.5 Case Study: 3D Reconstruction of Liver Tumor Vessels Using TransfusionNet under embedded devices

Medical image segmentation has a wide range of applications and research values in medical research and practice fields such as medical research, clinical diagnosis, pathological analysis, computer-assisted surgery, and three-dimensional simulations. In this experiment, we use knowledge distillation to generate a more lightweight model from the TransFusionNet and deploy it into JetsonTX2 embedded device. We fed the CT images to the JetsonTX2 embedded system to directly predict the results of 3D reconstruction of liver vessels and tumors.

We define segmentation model $\mathcal{F}:\mathbb{R}^{N\times H\times W}\rightarrow\mathbb{R}^{N\times C\times H\times W}$ . this model has been fully trained. Next, the arterial phase CT-enhanced image $x\in\mathbb{R}^{N\times H\times W}$ feed into $\mathcal{F}$ to predict the liver and vessel label map $o\in\mathbb{R}^{N\times C\times H\times W}$ where $C=3$ in this segmentation task. We need to construct 3D reconstruction result according to the segmentation label map $o$ . Thus, to obtain segmentation result $y\in\mathbb{R}^{N\times H\times W}$ from label map $o$ , we use the following equation:

y=\mathcal{G}*argmax(o).

(17)

Where $\mathcal{G}$ denotes Gaussian filter. In order to better support the embedded platform, we refer to the method in [41] to distill the knowledge of the Transformer module. At the same time, we do the post-processing quantization operation for the middle layer of the trained model. After the above optimizations, we successfully deployed the model to the embedded processor

Finally, we save the reconstruction results in $.nrrd$ format and use 3Dslicer for visual display. The Figure 7 shows a comparison between the reconstruction results under embedded microprocessor and manual annotations of a typical patient. Except for some noise and loss of vessel details, the reconstruction results were very close to the actual annotation results. However, a detailed manual annotation requires a lot of time and effort, which significantly reflect the efficiency and accuracy of our proposed algorithm.

4 Conclusion

In this work, we propose a segmentation model which can effectively extract the full-scale feature information of CT images. The IOU reached the peak performance of 0.864 in the vessel segmentation of the public dataset 3dircadb and 0.840 in the liver tumor segmentation of the public dataset LITS. At the same time, we transferred the trained model to our annotated dataset, and the IOU in tumor and vascular segmentation reached 0.927 and 0.822 respectively. Compared with the state-of-the-art segmentation methods, TransFusionNet has an accuracy improvement of 1% - 2%. Although this experiment is only for the segmentation of liver tumors and blood vessels, our model can also be applied to the segmentation of other tissues.

We further compress and distill the model and deploy it to the embedded system. Finally, we developed an automatic tumor vessel segmentation and reconstruction device for CT images. And the device realizes the automatic reconstruction of tumors and blood vessels without losing too much accuracy. This shows that our method has great application prospects in intelligent surgery in the future.

Although we have completed the reconstruction of liver tumors and blood vessels on Jetson, the segmentation and reconstruction of intrahepatic blood vessels using our method still needs to be further improved. Due to the numerous and small branches of intrahepatic vessels, it is difficult for deep learning algorithm to perceive the characteristics of intrahepatic vessels. At the same time, the clarity of CT image and the computing power will also hinder the fine reconstruction of internal hepatic artery. Facing so many challenges, in the next work, we will further design a novel quantitative method to optimize the inference accuracy of the model in the embedded structure.

References

[1] X. Li, P. Ramadori, D. Pfister, M. Seehawer, L. Zender, and M. Heikenwalder, “The immunological and metabolic landscape in primary and metastatic liver cancer,” Nature Reviews Cancer, vol. 21, no. 9, pp. 1–17, 2021.
[2] D. A. Gervais, S. N. Goldberg, D. B. Brown, M. C. Soulen, S. F. Millward, and D. K. Rajan, “Society of interventional radiology position statement on percutaneous radiofrequency ablation for the treatment of liver tumors,” Journal of Vascular and Interventional Radiology, vol. 20, no. 7, pp. S342–S347, 2009.
[3] M. Ciecholewski and M. Kassjański, “Computational methods for liver vessel segmentation in medical imaging: A review,” Sensors, vol. 21, no. 6, p. 2027, 2021.
[4] Q. Yan et al., “An attention-guided deep neural network with multi-scale feature fusion for liver vessel segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 7, pp. 2629–2642, 2020.
[5] H.-W. Huang, “Influence of blood vessel on the thermal lesion formation during radiofrequency ablation for liver tumors,” Medical physics, vol. 40, no. 7, p. 073303, 2013.
[6] H. Jiang, T. Shi, Z. Bai, and L. Huang, “Ahcnet: An application of attention mechanism and hybrid connection for liver tumor segmentation in ct volumes,” IEEE Access, vol. 7, pp. 24 898–24 909, 2019.
[7] Y. Cheng, X. Hu, J. Wang, Y. Wang, and S. Tamura, “Accurate vessel segmentation with constrained b-snake,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2440–2455, 2015.
[8] M. Chung, J. Lee, J. W. Chung, and Y.-G. Shin, “Accurate liver vessel segmentation via active contour model with dense vessel candidates,” Computer methods and programs in biomedicine, vol. 166, pp. 61–75, 2018.
[9] C. Bauer, T. Pock, E. Sorantin, H. Bischof, and R. Beichel, “Segmentation of interwoven 3d tubular tree structures utilizing shape priors and graph cuts,” Medical image analysis, vol. 14, no. 2, pp. 172–184, 2010.
[10] S. Esneault, C. Lafon, and J.-L. Dillenseger, “Liver vessels segmentation using a hybrid geometrical moments/graph cuts method,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 2, pp. 276–283, 2009.
[11] M. A. Lebre, A. Vacavant, M. Grand-Brochier, H. Rositi, and B. Magnin, “Automatic segmentation methods for liver and hepatic vessels from ct and mri volumes, applied to the couinaud scheme,” Computers in biology and medicine, vol. 110, pp. 42–51, 2019.
[12] J. N. Kaftan, H. Tek, and T. Aach, “A two-stage approach for fully automatic segmentation of venous vascular structures in liver ct images,” in Medical imaging 2009: image processing, vol. 7259. International Society for Optics and Photonics, 2009, p. 725911.
[13] Y.-z. Zeng, Y.-q. Zhao, P. Tang, M. Liao, Y.-x. Liang, S.-h. Liao, and B.-j. Zou, “Liver vessel segmentation and identification based on oriented flux symmetry and graph cuts,” Computer methods and programs in biomedicine, vol. 150, pp. 31–39, 2017.
[14] D. Mahapatra, “Analyzing training information from random forests for improved image segmentation,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1504–1512, 2014.
[15] A. Smith, “Image segmentation scale parameter optimization and land cover classification using the random forest algorithm,” Journal of Spatial Science, vol. 55, no. 1, pp. 69–79, 2010.
[16] X. Wanga, T. Wang, and J. Bua, “Color image segmentation using pixel wise structural-support-vectormachine (s-svm) classification,” Pattern Recognit, vol. 44, no. 4, pp. 777–787, 2011.
[17] Y. Zhiwen, H. Wong, and W. Guihua, “A modified support vector machine and its application to image segmentation [j],” Image and Vision Computing, vol. 29, no. 1, pp. 29–40, 2011.
[18] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[19] X. Meng, X. Li, and X. Wang, “A computationally virtual histological staining method to ovarian cancer tissue by deep generative adversarial networks,” Computational and Mathematical Methods in Medicine, vol. 2021, p. 4244157, 2021.
[20] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 2015, pp. 3431–3440.
[21] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Munich, Germany: Springer, 2015, pp. 234–241.
[22] Q. Huang, J. Sun, H. Ding, X. Wang, and G. Wang, “Robust liver vessel extraction using 3d u-net with variant dice loss function,” Computers in biology and medicine, vol. 101, pp. 153–162, 2018.
[23] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, 2018, pp. 3–11.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems. Red Hook, NY, USA: Curran Associates Inc., 2017, pp. 5998–6008.
[25] P. Shanchen, Z. Ying, S. Tao, Z. Xudong, W. Xun, and R.-P. Alfonso, “Amde: a novel attention-mechanism-based multidimensional feature encoder for drug–drug interaction prediction,” Briefings in Bioinformatics, no. 1, p. 1.
[26] S. Tao, Z. Xudong, D. Mao, R.-P. Alfonso, W. Shudong, and W. Gan, “Deepfusion: A deep learning based multi-scale feature fusion method for predicting drug-target interactions,” Methods, vol. 90, pp. 11–12, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1046202322000378
[27] C. Li, Y. Tan, W. Chen, X. Luo, Y. He, Y. Gao, and F. Li, “Anu-net: Attention-based nested u-net to exploit full resolution features for medical image segmentation,” Computers & Graphics, vol. 90, pp. 11–20, 2020.
[28] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen, “Doubleu-net: A deep convolutional neural network for medical image segmentation,” in 2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), Rochester, Minnesota, USA, 2020, pp. 558–564.
[29] T. Song, F. Meng, A. Rodriguez-Paton, P. Li, P. Zheng, and X. Wang, “U-next: a novel convolution neural network with an aggregation u-net architecture for gallstone segmentation in ct images,” IEEE Access, vol. 7, pp. 166 823–166 832, 2019.
[30] W. Xiang, H. Mao, and V. Athitsos, “Thundernet: A turbo unified network for real-time semantic segmentation,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019.
[31] H. Huang, Y. Wu, M. Yu, X. Shi, and X. Liu, “Edssa: An encoder-decoder semantic segmentation networks accelerator on opencl-based fpga platform,” Sensors, vol. 20, no. 14, p. 3969, 2020.
[32] A. Van Opbroek, M. A. Ikram, M. W. Vernooij, and M. De Bruijne, “Transfer learning improves supervised image segmentation across imaging protocols,” IEEE transactions on medical imaging, vol. 34, no. 5, pp. 1018–1030, 2014.
[33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Amsterdam, Netherlands: Springer, 2016, pp. 630–645.
[35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City,USA, 2018, pp. 7132–7141.
[36] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-scnn: Gated shape cnns for semantic segmentation,” ICCV, 2019.
[37] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[38] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” 2017.
[39] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-W. Chen, and J. Wu, “Unet 3+: A full-scale connected unet for medical image segmentation,” 2020.
[40] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
[41] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers and distillation through attention,” 2021.