This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PSTNet: Enhanced Polyp Segmentation with Multi-scale Alignment and Frequency Domain Integration

Wenhao Xu    Rongtao Xu    Changwei Wang    Xiuli Li    Shibiao Xu    Li Guo Wenhao Xu, Shibiao Xu and Li Guo are with School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China. Rongtao Xu is with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Changwei Wang is with the Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250013, China; Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, China; the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Xiuli Li is with AI Lab, Deepwise Healthcare, Beijing 100080, China.Shibiao Xu is the corresponding author ([email protected]).
Abstract

Accurate segmentation of colorectal polyps in colonoscopy images is crucial for effective diagnosis and management of colorectal cancer (CRC). However, current deep learning-based methods primarily rely on fusing RGB information across multiple scales, leading to limitations in accurately identifying polyps due to restricted RGB domain information and challenges in feature misalignment during multi-scale aggregation. To address these limitations, we propose the Polyp Segmentation Network with Shunted Transformer (PSTNet), a novel approach that integrates both RGB and frequency domain cues present in the images. PSTNet comprises three key modules: the Frequency Characterization Attention Module (FCAM) for extracting frequency cues and capturing polyp characteristics, the Feature Supplementary Alignment Module (FSAM) for aligning semantic information and reducing misalignment noise, and the Cross Perception localization Module (CPM) for synergizing frequency cues with high-level semantics to achieve efficient polyp segmentation. Extensive experiments on challenging datasets demonstrate PSTNet’s significant improvement in polyp segmentation accuracy across various metrics, consistently outperforming state-of-the-art methods. The integration of frequency domain cues and the novel architectural design of PSTNet contribute to advancing computer-assisted polyp segmentation, facilitating more accurate diagnosis and management of CRC.

{IEEEkeywords}

Polyp segmentation, shunted transformer, multi-scale fusion

1 Introduction

\IEEEPARstart

Colorectal cancer (CRC) often begins with the development of epithelial polyps within the colon or rectum, which are considered precursors to this malignant condition. While these polyps are initially non-cancerous, there is a risk that some may transform into pre-cancerous lesions, ultimately progressing to colorectal cancer[1]. The timely detection and removal of these polyps through colonoscopy is thus of paramount importance in the prevention and management of CRC. Recognized as the gold standard, colonoscopy enables the identification and excision of polyps before they can advance to a more dangerous stage.

Refer to caption
Figure 1: Our proposed PSTNet model has been comprehensively evaluated and compared with MSNet[2] on a diverse set of challenging polyp images. These images include scenarios where the polyps are diminutive and easily overlooked (a), as well as situations where the segmentation boundaries are prone to errors due to blurred demarcations (b) and (c). The results of our experimental analyses demonstrate that PSTNet outperforms in terms of polyp localization capabilities and achieves higher segmentation accuracy.

However, the accurate identification and segmentation of polyps during colonoscopy remain challenging due to the diverse sizes, shapes, and textures of polyps, coupled with their low contrast against surrounding tissues, leading to a camouflaging effect[3, 4]. This can result in missed or misdiagnosed polyps, severely compromising patient health outcomes.

In the early stages, polyp segmentation methods heavily relied on handcrafted features[5]. However, these methods suffered from limited accuracy and generalization ability due to the restricted expressive power of handcrafted features and the inherent similarity between polyps and surrounding tissues. To enhance the accuracy and efficiency of polyp segmentation, researchers have developed a variety of deep learning (DL) architectures, employing different techniques to address this complex task. Encoder-decoder-based models, such as U-Net[6], UNet++[7], and ResUNet++[8], along with attention-based approaches, including PraNet[9], Polyp-PVT[10], and SegT[11], have significantly advanced the field of polyp segmentation by effectively capturing relevant features. When compared with conventional methods, deep learning techniques have demonstrated significant advancements in segmentation. However, two challenging issues still persist: (1) The inherent low contrast between polyps and their surrounding tissue renders them adept at camouflaging themselves, as discussed in the study by Fan et al.[4]. This characteristic poses significant challenges in accurately localizing polyps, as illustrated in Figure 1(a). Moreover, polyp images often exhibit unclear boundaries, both between adjacent polyps and in the transition from polyps to normal tissue[9]. These unclear boundaries give rise to segmentation inaccuracies, as evident in Figure 1(b) and Figure 1(c). Addressing these challenges is crucial for improving the effectiveness and reliability of polyp segmentation methods. (2) Existing methodologies for polyp segmentation predominantly rely on feature data sourced exclusively from the RGB domain. However, relying solely on RGB features may not provide sufficient discriminative information to accurately distinguish between polyps and surrounding tissue, especially in cases of low contrast and unclear boundaries. Furthermore, discrepancies in pixel positions of features across different scales during the fusion process can significantly influence the accuracy of segmentation results. To address these limitations, it is crucial to explore alternative feature sources that can provide complementary information and address issues related to feature alignment and scale consistency, thereby enhancing the efficacy of polyp segmentation.

To address the previously mentioned issues, we have proposed a novel polyp segmentation network, which addresses the challenges of low contrast and indistinct boundaries in polyp segmentation by incorporating frequency domain cues. Our main contributions are as follows:

  • We propose Polyp Segmentation Network with Shunted Transformer (PSTNet), a novel deep learning architecture tailored for accurate polyp segmentation in colonoscopy images. The architecture incorporates frequency domain information, enhancing the model’s capability to distinguish polyps from surrounding tissues.

  • We introduce the Feature Supplementary Alignment Module (FSAM), which employs feature alignment techniques to mitigate noise and achieve precise delineation of polyp boundaries. FSAM leverages multi-scale subtraction to extract distinctions between features at individual scales, thereby improving feature quality and addressing the issue of unclear boundaries.

  • We establish the Frequency Characteristic Attention Module (FCAM), which infuses frequency cues from low-level features into the feature representation. This fusion enriches the available feature information, enabling it to be combined with global features for accurate localization of polyp regions.

  • We design the Cross Perception localization Module (CPM) to interconnect the features generated by PSTNet’s constituent components, leading to precise segmentation outcomes. By integrating the enhanced features from FCAM and FSAM, CPM ensures the effective utilization of comprehensive feature information, allowing PSTNet to attain state-of-the-art performance.

  • Through extensive experimentation on five challenging datasets, we demonstrate that PSTNet significantly outperforms most models, establishing a new benchmark for polyp segmentation. The superior performance validates the effectiveness of our proposed modules and network architecture in addressing key challenges, highlighting PSTNet’s potential to improve the accuracy and reliability of polyp segmentation in clinical practice.

2 Related work

2.1 Hand-Crafted Methods

Traditional automated polyp detection systems primarily rely on manual feature extraction techniques. These techniques encompass various methods, such as shape-based[12], texture-based[13], valley-depth-based[3], and combined approaches[14]. However, due to the strong intra-class variability and weak inter-class differences between polyp regions and highly similar regions, the representational ability of the extracted features by these handcrafted methods is rather limited. Consequently, these methods carry substantial risks of missed detections and false positives, necessitating the exploration of alternative, more effective approaches to achieve accurate polyp detection.

2.2 Deep Learning-Based Methods

2.2.1 CNN-Based Polyp Segmentation

Deep learning has revolutionized medical image analysis[15, 16, 17, 18, 19, 20, 21, 22], particularly in the domain of polyp detection. The MICCAI 2015 Colonoscopy Video Challenge showcased the superiority of CNN-based methods over traditional hand-crafted approaches, with the top-performing CNN achieving higher precision and recall values[23].

Encoder-decoder architectures, such as U-Net[6], UNet++[7], and ResUNet++[8], have gained popularity in medical image analysis due to their exceptional performance. DDA-Net[24], a two-decoder attention network built upon ResUNet++, demonstrated its effectiveness in polyp segmentation on the Kvasir-SEG dataset, achieving high dice coefficient and mIoU values. Various advancements have been made in CNN-based architectures for polyp image segmentation, including the use of parallel LSTM with DeepLab-v3[25], multi-task segmentation models[26], reverse attention modules[9], and dual-tree wavelet pool CNNs[27]. ADSNet[28] adapted to diverse semantic nuances in ambiguous polyp segmentation areas using a complementary trilateral decoder and continuous attention module. Further developments include real-time polyp segmentation with ColonSNet[29], the application of generative adversarial networks[30], and the use of contextual pixel relations and attention mechanisms in DCRNet[31]. MSNet[2] and EUNet[32] addressed polyp size variability and image noise, while PolySeg Plus[33] utilized active learning to ameliorate data limitations and false positives. DUCK-Net[34] employed residual downsampling and custom convolutional blocks to extract multi-resolution features.

Despite these advancements, CNN-based methods still face challenges in capturing long-range dependencies and efficiently processing high-resolution images, which are crucial for accurate polyp segmentation. While promising results have been achieved, there remains room for improvement in addressing the variability in polyp size, shape, and appearance, as well as dealing with image noise and artifacts.

2.2.2 Transformer-Based Polyp Segmentation

The transformer architecture, initially designed for machine translation, has been adapted for vision tasks, achieving remarkable performance[35, 36, 37, 38, 39, 40]. The vision transformer (ViT)[41] partitions an image into patches, encodes them, and feeds them sequentially to the transformer encoder, performing image classification using a multi-layer perceptron. Compared to traditional CNNs, ViT models offer advantages such as handling larger input sizes, capturing global dependencies between pixels, and faster convergence with larger batch sizes.

ViT has also been applied to segmentation tasks[42, 43, 44, 45, 46]. To address intensive prediction tasks, pyramidal structures have been incorporated into transformers, as seen in models like PVT and shunted transformer, which utilize hierarchical transformers with multiple stages. Recently, the transformer architecture has been applied to polyp segmentation, with models like Polyp-PVT[10] combining multiscale features from PVTv2 and a CNN-based decoder for accurate polyp segmentation. Other works like Duplex[31], TGANet[47], PraNet[9], and SegT[11] also utilize attention for polyp segmentation tasks, exploring techniques like inverse attention, adversarial training, and edge guidance.

While transformer-based methods have shown promising results in polyp segmentation, they are still relatively new in this field and face challenges such as high computational costs, the need for large amounts of training data, and interpretability concerns. To address the key challenges of polyp segmentation, such as scale inconsistency, low contrast, and unclear boundaries, we propose the Polyp Segmentation Shunted Transformer Network (PSTNet). PSTNet introduces innovative modules to effectively align multi-scale features, integrate frequency information from polyp images, and improve polyp localization and segmentation accuracy.

Refer to caption
Figure 2: The framework of our PSTNet, which includes the shunted transformer (ST)[48] (a) as an encoder network, (b) Feature Supplementary Alignment Module (FSAM) for fusing global semantic features,which contains three Feature Alignment (FA) units, (c) Frequency Characteristic Attention module (FCAM) for extracting low-level semantic features with frequency domain cues, and (d) Cross Perception localization Module (CPM) for linking frequency domain cues with global semantic features for the final output.

3 Proposed Method

The PSTNet is a polyp segmentation architecture that utilizes a multi-scale feature fusion framework with a shunted transformer. The goal of this design is to enhance feature representation by aligning features and extracting more detailed information through the combination of frequency-domain cues. The process of the proposed method is shown in Algorithm 1. In the following sections, we discuss the methods used in the network in more detail.

3.1 Overall Architecture

The proposed method (shown in Fig. 2) comprises four primary modules: shunted transformer encoder (ST Encoder), feature supplementary alignment module (FSAM), frequency characteristic attention module (FCAM), and cross perception localization module (CPM). The shunted transformer is utilized to extract feature maps at four different scales (𝐗1\mathbf{X}_{1}, 𝐗2\mathbf{X}_{2}, 𝐗3\mathbf{X}_{3}, and 𝐗4\mathbf{X}_{4}) from the input image, spanning from low to high levels. FCAM extracts shallow-level features from the low-level feature image 𝐗1\mathbf{X}_{1}, enabling the capture of polyps in various sizes and shapes through frequency domain analysis. The resulting intermediate output is denoted as 𝐏1\mathbf{P}_{1}. Furthermore, the feature maps at the four different scales undergo convolutional operations to match the desired number of channels. Subsequently, FSAM is employed to progressively align and fuse features from all levels, yielding the intermediate result 𝐏2\mathbf{P}_{2}. CPM integrates features from both FCAM and FSAM, enabling the efficient fusion of low-level semantics (containing frequency domain cues) with global semantics. This fusion yields the prediction 𝐏3\mathbf{P}_{3} as the final output. Additionally, the sum of 𝐏1\mathbf{P}_{1} and 𝐏2\mathbf{P}{2} is computed, weighted, and added to 𝐏3\mathbf{P}_{3} to generate the ultimate prediction. In the training phase, we optimized the model using a primary loss function 3\mathcal{L}_{3} as well as auxiliary loss functions 1\mathcal{L}_{1} and 2\mathcal{L}_{2}. The primary loss was computed by comparing the final segmentation result 𝐏3\mathbf{P}_{3} with the ground truth (GT), serving as the optimization target for achieving accurate polyp segmentation. Similarly, the auxiliary losses supervised the intermediate outputs, 𝐏1\mathbf{P}_{1} and 𝐏2\mathbf{P}_{2}, generated by FCAM and FSAM, respectively.

Algorithm 1 Proposed PSTNet for Polyp Segmentation
0:  Input image II
1:  Extract feature maps {𝐗1,𝐗2,𝐗3,𝐗4}\{\mathbf{X}_{1},\mathbf{X}_{2},\mathbf{X}_{3},\mathbf{X}_{4}\} from II using ST Encoder
2:  Obtain 𝐏1\mathbf{P}_{1} from 𝐗1\mathbf{X}_{1} using FCAM
3:  Match channel dimensions of {𝐗1,𝐗2,𝐗3,𝐗4}\{\mathbf{X}_{1},\mathbf{X}_{2},\mathbf{X}_{3},\mathbf{X}_{4}\} using convolutions
4:  Obtain 𝐏2\mathbf{P}_{2} by progressively aligning and fusing {𝐗1,𝐗2,𝐗3,𝐗4}\{\mathbf{X}_{1},\mathbf{X}_{2},\mathbf{X}_{3},\mathbf{X}_{4}\} using FSAM
5:  Integrate 𝐏1\mathbf{P}_{1} and 𝐏2\mathbf{P}_{2} using CPM to obtain 𝐏3\mathbf{P}_{3}
6:  Compute weighted sum of 𝐏1\mathbf{P}_{1} and 𝐏2\mathbf{P}_{2} and add to 𝐏3\mathbf{P}_{3} to obtain final prediction
7:  Optimize model using primary loss 3\mathcal{L}_{3} on 𝐏3\mathbf{P}_{3} and auxiliary losses 1\mathcal{L}_{1}, 2\mathcal{L}_{2} on 𝐏1\mathbf{P}_{1}, 𝐏2\mathbf{P}_{2}
7:  Polyp segmentation prediction

3.2 Shunted Transformer Encoder for PSTNet

In the context of polyp images, noise can be a significant issue due to various uncontrolled factors in the image acquisition process. To address this, we utilized a shunted transformer as the backbone network, leveraging its higher performance and better robustness to input interference compared to other methods like ViT. The shunted transformer allows for the extraction of more robust features from polyp images. Polyps can exhibit diverse characteristics such as different sizes, shapes, and appearances. Therefore, it is essential to have an effective multi-scale representation for feature extraction. The shunted transformer, as described in the study by Ren et al.[48], employs a unique strategy called shunted self-attention (SSA). This approach enables the ViT model to incorporate attention at mixed scales within each attention layer. It effectively models objects of different scales simultaneously by assigning different attention heads in the same layer. This approach provides computational efficiency while preserving the ability to capture fine-grained details, which is crucial for detecting small polyps that might otherwise be overlooked. To accommodate the polyp segmentation task, we generated four multi-scale feature maps (i.e., 𝐗1\mathbf{X}_{1}, 𝐗2\mathbf{X}_{2}, 𝐗3\mathbf{X}_{3}, and 𝐗4\mathbf{X}_{4}) at different stages of the shunted transformer and designed subsequent modules based on them.

3.3 Frequency Characteristic Attention Module

While low-level RGB features provide detailed visual information about polyps, such as texture, color, and boundaries, polyps often remain concealed within normal tissues, posing a challenge for visual detection solely based on human perceptual capabilities[49]. To surpass the limitations of human biological vision, it is essential to incorporate additional cues beyond the RGB domain. The FcaNet approach[50] operated in the frequency domain by extending the concept of global average pooling (GAP) to a 2D discrete cosine transform representation, enabling the effective utilization of additional frequency components for more comprehensive data analysis. Therefore, we introduce frequency domain information as an additional cue to better distinguish polyps from the background and reduce the probability of false and missed detections.

Refer to caption
Figure 3: The details of the frequency characteristic attention module (FCAM). First, the input Xin{X}_{in} is cut, repeated and fused horizontally and vertically so that each spatial location obtains a feature response from a global context with the same horizontal and vertical coordinates. Secondly, we combined a 2D discrete cosine transform (DCT2D{DCT}_{2D}) to obtain spectral information, and finally, we used the resulting full attention affinity to re-weight each channel map.

We designed the frequency domain perceptual attention module, as shown in Figure 3, to capture the details of polyps from the combination of both the RGB domain and frequency domain of the low-level feature 𝐗1\mathbf{X}_{1}, which can accurately locate the position of polyps. Specifically, the FCAM includes the full attention enhancement operation (Attf{Att}_{f}) and the 2D discrete cosine transform (DCT2D{DCT}_{2D}), the Attf{Att}_{f} operation compositional a function can be defined as:

Attf(𝐗)=α(A(𝐐,𝐊)𝐕)+𝐗\displaystyle{Att}_{f}(\mathbf{X})=\alpha\cdot(A\left(\mathbf{Q},\mathbf{K}\right)\cdot\mathbf{V})+\mathbf{X} (1)

where α\alpha is the scale parameter, XC×H×WX\in\mathbb{R}^{C\times H\times W} is the input low-level feature, and A()A\left(\cdot\right) is the affinity operation, which is defined as follows:

Ai,j=exp(𝐐i𝐊j)i=1Cexp(𝐐i𝐊j)\displaystyle A_{i,j}=\frac{\exp\left(\mathbf{Q}_{i}\cdot\mathbf{K}_{j}\right)}{\sum_{i=1}^{C}\exp\left(\mathbf{Q}_{i}\cdot\mathbf{K}_{j}\right)} (2)

where Ai,jAA_{i,j}\in A denotes the degree of correlation between the ithi_{th} and jthj_{th} channel at a specific spatial position.

In the 2D Discrete Cosine Transform (DCT), basis functions are defined as follows:

Bh,wi,j=cos(πhH(i+12))cos(πwW(j+12))B_{h,w}^{i,j}=\cos\left(\frac{\pi h}{H}\left(i+\frac{1}{2}\right)\right)\cos\left(\frac{\pi w}{W}\left(j+\frac{1}{2}\right)\right) (3)

The 2D DCT computation is expressed as:

DCT2D=i=0H1j=0W1𝐱i,j2dBh,wi,j\displaystyle DCT_{2D}=\sum_{i=0}^{H-1}\sum_{j=0}^{W-1}\mathbf{x}_{i,j}^{2d}B_{h,w}^{i,j} (4)
s.t.h{0,1,\displaystyle s.t.\;\;h\in\{0,1, ,H1},w{0,1,,W1},\displaystyle\cdots,H-1\},w\in\{0,1,\cdots,W-1\},

Here, DCT2DDCT_{2D} is the outcome of the 2D DCT computation, and 𝐱2dH×W\mathbf{x}^{2d}\in\mathbb{R}^{H\times W} is the input signal.

We further process 𝐗\mathbf{X} to obtain 𝐐P\mathbf{Q}_{P}:

𝐐P=[Reh(CVw),Rew(CVh)]\displaystyle\mathbf{Q}_{P}=[\operatorname{Re}_{h}\left(\operatorname{CV}_{w}\right),\operatorname{Re}_{w}\left(\operatorname{CV}_{h}\right)] (5)

Here, CVw{CV}_{w} and CVh{CV}_{h} represent pooling along the width (WW) and height (HH) dimensions, respectively, followed by a linear layer. Reh{Re}_{h} and Rew{Re}_{w} indicate replication along the height and width directions, and [][\cdot] signifies the concatenation of tensors. Next, we split 𝐐P\mathbf{Q}_{P} into multiple parts along the channel dimension, denoted as [𝐐P0,𝐐P1,,𝐐Pn1][\mathbf{Q}_{P}^{0},\mathbf{Q}_{P}^{1},\cdots,\mathbf{Q}_{P}^{n-1}], where 𝐐PiC×H×W\mathbf{Q}_{P}^{i}\in\mathbb{R}^{C^{\prime}\times H\times W}, i0,1,,n1i\in{0,1,\cdots,n-1},C=C×n1C^{\prime}={C}\times{n}^{-1} , and CC is divisible by nn. Each part corresponds to a 2D DCT frequency component, resulting in:

𝐅𝐫𝐞𝐪i\displaystyle\mathbf{Freq}^{i} =DCT2Dui,vi(𝐐Pi)\displaystyle=DCT_{2D}^{u_{i},v_{i}}(\mathbf{Q}_{P}^{i})\ s.t.;;i0,1,,n1\displaystyle s.t.;;i\in{0,1,\cdots,n-1} (6)

where [ui,vi][u_{i},v_{i}] are the frequency component 2D indices corresponding to XiX^{i}, and 𝐅𝐫𝐞𝐪iC\mathbf{Freq}^{i}\in\mathbb{R}^{C^{\prime}} is the compressed vector.

Finally, we concatenate these frequency components to obtain the multi-spectral vector:

𝐐=Cat([𝐅𝐫𝐞𝐪0,𝐅𝐫𝐞𝐪1,,𝐅𝐫𝐞𝐪n1])\displaystyle\mathbf{Q}=\text{Cat}([\mathbf{Freq}^{0},\mathbf{Freq}^{1},\cdots,\mathbf{Freq}^{n-1}]) (7)

Here, 𝐐\mathbf{Q} is the resulting multi-spectral vector. In these ways, the FCAM introduces multispectral channel information while modeling attention spatially and in terms of channels.

3.4 Feature Supplementary Alignment Module

Many contemporary polyp segmentation networks face a critical challenge arising from the mismatch problem induced by frequent downsampling operations and the unprocessed integration of contextual information during feature aggregation. The noise generated by pixel shifts during feature fusion across various scales exacerbates this issue. To address this concern, we introduce a novel solution: a feature alignment method specifically designed for cascaded feature fusion.

As shown in Figure 2 (b), the inputs are 𝐗1\mathbf{X}_{1}, 𝐗2\mathbf{X}_{2}, 𝐗3\mathbf{X}_{3}, and 𝐗4\mathbf{X}_{4}, representing four scales of features. To ensure the robustness of the features, the neighboring features are cascaded and fused using the multi-scale subtraction unit (SU), while the FA unit is employed for feature alignment of the neighboring features. The resulting cascade fusions are then combined using element-wise addition to obtain 𝐏2\mathbf{P}_{2}. In our implementation, f()f(\cdot) is specified as a convolutional unit, which comprises a 3×33\times 3 convolutional layer with padding set to 1, along with batch normalization and the SiLU activation function (CBS). The detailed process of the two cascaded fusions is as follows.

Refer to caption
Figure 4: The details of the Feature Alignment units. First, the high-level features 𝐂\mathbf{C} are upsampled and connected to the neighbouring low-level features 𝐏\mathbf{P}. The two predicted biased feature maps are then obtained by deformable convolution in BSD, and the features are aligned separately for both scales, followed by a summation operation.

3.4.1 Supplementary Fusion

The SU captures complementary information from adjacent feature layers and emphasizes their differences, providing the decoder with differential feature information. This can be expressed as:

SU=|f(𝐗i)f(𝐗i1)|\displaystyle SU=\left|f(\mathbf{X}_{i})\ominus f(\mathbf{X}_{i-1})\right| (8)

where \ominus denote element-wise subtraction,||\left|\cdot\right| means calculate the absolute value. We upsample 𝐗4\mathbf{X}_{4} to the same size as 𝐗3\mathbf{X}_{3}, and then pass f()f(\cdot) , the results are noted as 𝐗4{\mathbf{X}_{4}^{\prime}}, 𝐗3{\mathbf{X}_{3}^{\prime}}. Then using the SU unit to fuse 𝐗4{\mathbf{X}_{4}^{\prime}} and 𝐗3{\mathbf{X}_{3}^{\prime}} to obtain S1S_{1} can be expressed as S1=|f(𝐗4)f(𝐗3)|S_{1}=\left|f(\mathbf{X}_{4}^{\prime})\ominus f(\mathbf{X}_{3}^{\prime})\right|. Then, 𝐒1\mathbf{S}_{1} is upsampled to the same size as 𝐗2\mathbf{X}_{2}, which is recorded as S1S_{1}^{\prime}, 𝐗2\mathbf{X}_{2} is passed through f()f(\cdot), the result is recorded as 𝐗2{\mathbf{X}_{2}^{\prime}}, and use the SU to fuse 𝐒1\mathbf{S}_{1}^{\prime} and 𝐗2{\mathbf{X}_{2}^{\prime}}, the result is recorded as 𝐒1\mathbf{S}_{1}, which can be expressed as 𝐒2=|f(𝐒1)f(𝐗2)|\mathbf{S}_{2}=\left|f(\mathbf{S}_{1}^{\prime})\ominus f(\mathbf{X}_{2}^{\prime})\right|. Similarly, the final supplemental fusion result S3S_{3} can be obtained, which can be expressed as 𝐒3=|f(𝐒2)f(𝐗1)|\mathbf{S}_{3}=\left|f(\mathbf{S}_{2}^{\prime})\ominus f(\mathbf{X}_{1}^{\prime})\right|.

3.4.2 Feature Align

In the multi-scale feature fusion process, a natural spatial offset exists between the pixel positions of the upper feature 𝐅i\mathbf{F}_{i} and its lower feature 𝐅i1\mathbf{F}_{i-1}. This offset cannot be eliminated either through concatenation or elemental operations[51]. In Figure 4, 𝐂i\mathbf{C}_{i} is initially upsampled to obtain 𝐂~i\tilde{\mathbf{C}}_{i}, which is then concatenated with PP. These concatenated features are subsequently processed through DCNv2[52] with a kernel size of 3×33\times 3, resulting in the generation of two offset maps, namely ΔC\Delta_{C} and ΔP\Delta_{P}.

These offset maps play a pivotal role in the calibration of the low-resolution features, 𝐂i\mathbf{C}_{i}, and the high-resolution features, 𝐏\mathbf{P}[53], respectively. Once the offset maps are obtained, our feature alignment aggregation can be difined as follows:

Ai=u(Upsample(𝐂~i),𝚫C)+u(𝐏~,𝚫A)\displaystyle{A}_{i}={u}\left(\text{Upsample}\left(\tilde{\mathbf{C}}_{i}\right),\boldsymbol{\Delta}_{C}\right)+{u}\left(\tilde{\mathbf{P}},\boldsymbol{\Delta}_{A}\right) (9)

where upsample denotes a bilateral interpolation function, while u(,){u}(\cdot,\cdot) represents the alignment function. It can be assumed that the spatial coordinates of each position on the feature map FF to be aligned are given as (1,1),(1,2),,(H,W){(1,1),(1,2),\ldots,(H,W)}, with the offset map represented by ΔR2×H×W\Delta\in{R}^{2\times H\times W}. The UhwU_{hw} is proposed in SFSegNet[54], which is the output of the alignment function u(F,𝚫){u}\left(F,\boldsymbol{\Delta}\right) and alignment function is defined as follows:

Uhw=h=1Hw=1WFhwmax(0,1|h+Δ1hwh|)max(0,1|w+Δ2hww|)\begin{array}[]{c}{U}_{hw}=\sum_{h^{\prime}=1}^{H}\sum_{w^{\prime}=1}^{W}{F}_{h^{\prime}w^{\prime}}\\ \cdot\max\left(0,1-\left|h+\Delta_{1hw}-h^{\prime}\right|\right)\\ \cdot\max\left(0,1-\left|w+\Delta_{2hw}-w^{\prime}\right|\right)\end{array} (10)

which samples feature on position (h+Δ1hw,w+Δ2hw)\left(h+\Delta_{1hw},w+\Delta_{2hw}\right) of FF, using the bilinear interpolation kernel, where Δ1hw\Delta_{1hw}, Δ2hw\Delta_{2hw} indicate the learned 2D transformation offsets for position (h,w)(h,w).

3.5 Cross Perception localization Module

To combine the spectral cues acquired from FCAM with the semantic information derived from FSAM, we introduced a cross-perception localization module, as illustrated in Figure 2. This module takes two input feature maps, 𝐑2\mathbf{R}{2} containing global semantic information and 𝐑1\mathbf{R}{1} with frequency domain information and rich details. These input feature maps are processed by the Cross-Perception (CP) unit, depicted in Figure 5. The CP unit performs an element-wise subtraction operation on the features from the FCAM (𝐑1ς\mathbf{R}^{\varsigma}_{1}) and FSAM (𝐑2ς\mathbf{R}^{\varsigma}_{2}) branches, resulting in a feature map denoted as 𝐑ς\mathbf{R}^{\varsigma}. Subsequently, the Feature Alignment (FA) unit (which described in Section 3.4) aligns these features. The aligned feature map is then passed to the Frequency Characteristic Attention (FCA) unit, with the 2D DCT operation excluded. Finally, the feature map 𝐑ς\mathbf{R}^{\varsigma} is added to the output of the FCA unit to produce the resulting feature map 𝐙\mathbf{Z}. The entire process can be summarized as equation:

𝐙=Attf/DCT(fa(𝐑1,𝐑2))+𝐑ς\displaystyle\mathbf{Z}=Att_{f/DCT}(f_{a}(\mathbf{R}_{1},\mathbf{R}_{2}))+\mathbf{R}^{\varsigma} (11)

where faf_{a} is the operation to perform feature alignment and Attf/DCTAtt_{f/DCT} is the frequency domain aware attention operation to remove the 2D DCT transform. 𝐙\mathbf{Z} by sigmoid to obtain the final segmentation result 𝐏3\mathbf{P}_{3}.

Refer to caption
Figure 5: The details of the CP unit. The low-level features R1R_{1} containing frequency domain information are aligned with the global features R2R_{2} via the FA unit, then enhanced via the FCA unit, and finally added with the result of the subtraction of |𝐑1𝐑2||\mathbf{R}_{1}-\mathbf{R}_{2}|.

3.6 Loss Function

In medical image segmentation, class imbalance is a common challenge. The number of background pixels vastly exceeds the number of pixels belonging to the target objects. This imbalance can hinder the model’s ability to learn effective representations, particularly for object boundaries. To address this issue, we adopted a combination loss function. Given the Dice loss’s good performance in handling class imbalance [55], we chose it as the base loss function. However, directly employing the Dice loss may lead to instability during training. To mitigate this issue, we introduced the Focal loss[56], which concentrates the training process on pixels that are difficult to classify correctly, further enhancing the segmentation quality of object boundaries. Additionally, we incorporated the weighted Binary Cross-Entropy (wBCE) loss to further improve the segmentation accuracy, especially at object boundaries, where misclassification errors have a greater impact. We specifically designed this combination loss function to optimize model performance and address the complexities associated with challenging pixels and small object segmentation. Given the ground-truth GG and the prediction 𝐏i,i{1,2,3}\mathbf{P}_{i},i\in\{1,2,3\}, the total loss function total\mathcal{L}_{\text{total}} is given by:

total=γ1(𝐘,𝐏1)+λ2(𝐘,𝐏2)+3(𝐘,𝐏3)\displaystyle\mathcal{L}_{\text{total}}=\gamma\cdot\mathcal{L}_{\text{1}}\left(\mathbf{Y},{\mathbf{P}}_{1}\right)+\lambda\cdot\mathcal{L}_{\text{2}}\left(\mathbf{Y},{\mathbf{P}}_{2}\right)+\mathcal{L}_{\text{3}}\left(\mathbf{Y},{\mathbf{P}}_{3}\right) (12)

where the parameter was set experimentally to γ=0.1\gamma=0.1 ,λ=1\lambda=1, each loss term is calculated by LψL^{\psi}, which is defined as:

ψ=bcew+dice +focal\displaystyle{\mathcal{L}^{\psi}}=\mathcal{L}_{\text{bce}}^{w}+\mathcal{L}_{\text{dice }}+\mathcal{L}_{\text{focal }} (13)

Here, bcew\mathcal{L}_{bce}^{w} represents a weighted binary cross-entropy (BCE) loss function. In contrast to the standard BCE loss, which treats all pixels equally, bcew\mathcal{L}_{bce}^{w} takes into account the importance of each pixel by assigning higher weights to challenging pixels. The weighting scheme enables the network to prioritize difficult regions, leading to an overall improvement in performance. Additionally, dice\mathcal{L}_{\text{dice}} refers to the dice loss, while focal\mathcal{L}_{\text{focal}} is the focal loss[56]. The dice loss effectively learns the class distribution, mitigating imbalanced voxel problems. Conversely, the focal loss compels the model to improve its learning for poorly classified pixels[57].

4 Experiment

4.1 Datasets and Compared SOTA Methods

4.1.1 Datasets

Our approach underwent evaluation using five challenging colonoscopic polyp image datasets: Kvasir-SEG[58], CVC-Clinic[3], ETIS[59], CVC-ColonDB[60], and EndoScene[1] datasets. The EndoScene dataset is a combination of CVC-612 and CVC-300. In our experiments, we solely utilized theCVC-300 test set, as a portion of the CVC-612 dataset might have been employed for training purposes.

Table 1: Statistical comparison of the learning ability of CVC-ClinicDB and Kvasir SEG. ↑ and ↓ denote respectively that the larger and smaller scores are better.

The Best and second best scores are shown in red and blue respectively. Model mDic(%)mDic(\%)\uparrow mIoU(%)mIoU(\%)\uparrow Fβω(%)F_{\beta}^{\omega}(\%)\uparrow Sα(%)S_{\alpha}(\%)\uparrow mEϵ(%)mE_{\epsilon}(\%)\uparrow maxEϵ(%)\max E_{\epsilon}(\%)\uparrow MAE(%)MAE(\%)\downarrow U-Net++ (TMI’19) 82.1 74.3 80.8 86.2 88.6 90.9 4.8 PraNet (MICCAI’20) 89.8 84.0 88.5 91.5 94.4 94.8 3.0 Kvasir DCRNet (arXiv’21) 88.6 82.5 86.8 91.1 93.3 94.1 3.5 SANet (MICCAI’21) 90.4 84.7 89.2 91.5 94.9 95.3 2.8 MSNet (MICCAI’21) 90.7 86.2 89.3 92.2 95.2 94.4 2.8 ADSNet (BMVC’24) 92.0 87.1 91.6 —— —— —— 2.3 SegT (arXiv’23) 92.7 88.0 —— —— —— —— 2.3 PSTNet(Ours) 93.5 89.5 92.9 93.7 96.7 97.3 1.7 U-Net++ (TMI’19) 79.4 72.9 78.5 87.3 89.1 93.1 2.2 PraNet (MICCAI’20) 89.9 84.9 89.6 93.6 96.3 97.9 0.9 CVC-ClinicDB DCRNet (arXiv’21) 89.6 84.4 89.0 93.3 96.4 97.8 1.0 SANet (MICCAI’21) 91.6 85.9 90.9 93.9 97.1 97.6 1.2 MSNet (MICCAI’21) 92.1 87.9 91.4 94.1 97.6 97.2 0.8 ADSNet (BMVC’24) 93.8 89.0 94.0 —— —— —— 0.6 SegT (arXiv’23) 94.0 89.7 —— —— —— —— 0.6 PSTNet(Ours) 94.5 90.1 94.5 95.3 98.7 99.0 0.7

4.1.2 Compared Methods

We compared the performance of recent state-of-the-art (SOTA) models for polyp image segmentation on seven different metrics, including UNet++[7], PraNet[9], DCRNet[31], SANet[61], MSNet[2], ADSNet[28], and SegT[11]. For a fair comparison, we used their open-source code and default settings to evaluate the models on the same training and test sets, and generated the result maps.

4.2 Experimental Setting and Evaluation Metrics

4.2.1 Experimental Setting

During the training phase, we incorporated a multi-scale method[48] to handle the size variations among individual polyp images. The AdamW optimizer, a common choice for transformer networks, was employed to optimize the model parameters. Across all experiments, the model was trained for approximately 135 epochs, with a fixed learning rate of 1e41e-4 and a decay rate of 0.1 every 45 epochs. For more specific parameter settings, please refer to Table 2. To ensure the stability and effectiveness of the results, we conducted five independent training and testing runs.

Table 2: The setting of parameters during training.
Optimizer Decay rate Epochs
AdamW 0.1 135
Learning rate (lr) Weight decay Clip
1e-4 1e-4 0.5
Input size Decay epoch Batch size
352×352 45 20

Our model was implemented using the PyTorch framework, and all training and testing experiments were conducted on a server equipped with four NVIDIA GeForce RTX 3090 GPUs. During the evaluation phase, images were scaled to a fixed size of 352×352352\times 352 without applying any post-processing optimization techniques.

4.2.2 Evaluation Metrics

We employed six standard evaluation metrics commonly used in image segmentation tasks: mDic, IoU (Intersection over Union), S-measures (SαS_{\alpha}), weighted F-measure FβωF_{\beta}^{\omega}, E-measure (EϵE_{\epsilon}), and mean absolute error (MAE) to comprehensively evaluate the model performance. For a rigorous and fair comparison, all models were trained, validated, and tested using identical data splits. Additionally, we utilized pre-trained backbones from the ImageNet dataset and incorporated the authors’ provided source code when available. This consistent methodology in metric selection and model evaluation enhances the reliability and accuracy of our comparative analysis.

4.3 Performance Comparison

4.3.1 Quantitative Results

Table 3: Statistical comparison of the generalization ability of CVC-ColonDB, ETIS, and CVC-300. ↑ and ↓ denote respectively that the larger and smaller scores are better.

The best and second best scores are shown in red and blue respectively. Model mDic(%)mDic(\%)\uparrow mIoU(%)mIoU(\%)\uparrow Fβω(%)F_{\beta}^{\omega}(\%)\uparrow Sα(%)S_{\alpha}(\%)\uparrow mEϵ(%)mE_{\epsilon}(\%)\uparrow maxEϵ(%)\max E_{\epsilon}(\%)\uparrow MAE(%)MAE(\%)\downarrow U-Net++ (TMI’19) 48.3 41.0 46.7 69.1 68.0 76.0 6.4 PraNet (MICCAI’20) 71.2 64.0 69.9 82.0 84.7 87.2 4.3 CVC-ColonDB DCRNet (arXiv’21) 70.4 63.1 68.4 82.1 84.0 84.8 5.2 SANet (MICCAI’21) 75.3 67.0 72.6 83.7 86.9 87.8 4.3 MSNet (MICCAI’21) 75.5 67.8 73.7 83.6 87.0 88.3 4.1 ADSNet (BMVC’24) 81.5 73.0 86.0 —— —— —— 2.9 SegT (arXiv’23) 81.4 73.2 —— —— —— 2.6 PSTNet(Ours) 82.7 74.8 81.3 87.7 92.5 92.8 2.5 U-Net++ (TMI’19) 70.7 62.4 68.7 83.9 83.4 89.8 1.8 PraNet (MICCAI’20) 87.1 79.7 84.3 92.5 95.0 97.2 1.0 CVC-300 DCRNet (arXiv’21) 85.6 78.8 83.0 92.1 94.3 96.0 1.0 SANet (MICCAI’21) 88.8 81.5 85.9 92.8 96.2 97.2 0.8 MSNet (MICCAI’21) 86.9 80.7 84.9 92.5 95.8 94.3 1.0 ADSNet (BMVC’24) —— —— —— —— —— —— —— SegT (arXiv’23) 89.5 82.8 —— —— —— —— 0.8 PSTNet(Ours) 91.0 84.7 89.5 94.1 97.6 98.3 0.5 U-Net++ (TMI’19) 40.1 34.4 39.0 68.3 62.9 77.6 3.5 PraNet (MICCAI’20) 62.8 56.7 60.0 79.4 80.8 84.1 6.7 ETIS DCRNet (arXiv’21) 55.6 49.6 50.6 73.6 74.2 77.3 9.6 SANet (MICCAI’21) 75.0 65.4 68.5 84.9 88.1 89.7 1.5 MSNet (MICCAI’21) 71.9 66.4 67.8 84.0 87.5 83.0 2.0 ADSNet (BMVC’24) 79.8 71.5 79.2 —— —— —— 1.2 SegT (arXiv’23) 81.0 73.2 —— —— —— —— 1.3 PSTNet(Ours) 80.0 72.6 76.1 87.5 90.1 91.3 1.3

Learning Capability: We selected two datasets, CVC-ClinicDB and Kvasir-SEG, as benchmarks. CVC-ClinicDB contains 612 images extracted from 31 colonoscopy videos, while Kvasir-SEG consists of 1000 polyp images collected from the polyp category of the Kvasir dataset. Following the practice of PraNet, we used 548 images from CVC-ClinicDB and 900 images from Kvasir-SEG as the training set, with the remaining 64 and 100 images serving as the test set for each dataset, respectively. As shown in Table 1, our model achieved the best results on both datasets, surpassing the state-of-the-art methods. On Kvasir-SEG, our model outperformed ADSNet and SegT by 1.5% and 0.8% in terms of mDice, and 2.4% and 1.5% in terms of mIoU, respectively. Similarly, on CVC-ClinicDB, our model matched the performance of TGANet in mDice and outperformed ADSNet and SegT by 0.7% and 0.5%, respectively, while achieving the highest mIoU of 90.1%, demonstrating its strong learning capability. This can be attributed to our proposed multi-scale feature fusion framework, which enhances the model’s ability to capture comprehensive feature information by combining information from both the RGB and frequency domains. Furthermore, the FCAM incorporates frequency cues from low-level features into global features, enriching the available feature information and aiding in the precise localization of polyp regions. These innovations enable our model to better learn the feature representations of polyps, outperforming other methods.

Generalization Capabilities: To comprehensively evaluate the generalization capability of the model, we employed three datasets that the model had never encountered before: CVC-ColonDB (380 images), CVC-300 (60 images), and ETIS (196 images). It is worth noting that this differs from the validation method used for the ClinicDB and Kvasir-SEG datasets, as the model had no exposure to these datasets during the training process. As shown in Table 3, our proposed method still achieved the best results. On the CVC-ColonDB dataset, our model outperformed ADSNet and SegT by 1.2% and 1.3% in terms of mDice, and 1.8% and 1.6% in terms of mIoU, respectively. Similarly, on the CVC-300 dataset, our model surpassed SANet and SegT by 2.2% and 1.5% in mDice, and 3.2% and 1.9% in mIoU, respectively. Particularly on the most challenging ETIS dataset, where most images contain small polyps, our method outperformed the state-of-the-art ADSNet and SegT by 0.2% and 1.0% in terms of mDice, and 1.1% and 0.6% in terms of mIoU, respectively, demonstrating its exceptional generalization capability. This can be primarily attributed to our proposed FSAM). FSAM employs feature alignment techniques to mitigate noise and utilizes multi-scale subtraction to extract the differences between features at various scales, enabling more precise delineation of polyp boundaries. This effectively addresses the issue of indistinct polyp boundaries, allowing our model to better adapt to unseen datasets. Moreover, the design of the Cross-Perception Module (CPM) also plays a crucial role, as it connects the features generated by all components of PSTNet and integrates the enhanced features from FCAM and FSAM, ensuring the effective utilization of comprehensive feature information and achieving superior generalization performance compared to other methods.

Refer to caption
Figure 6: Visualization of the results of our model compared to other models. Green represents correct polyps, red are incorrect detections and yellow are missed polyps.
Refer to caption
Figure 7: Visualization of the results of our model compared to other models.
Refer to caption
Figure 8: The ablation study results have been visually represented as heatmap. It is evident that the removal of any module leads to a substantial alteration in weighting, consequently resulting in the omission or incorrect detection of important elements.

4.3.2 Visual Comparisons

In Figs.6 and7, we present the visualization results obtained from our PSTNet and eight other models, showcasing their performance on challenging examples. Our approach outperforms other methods in several critical aspects: (a) Enhanced Detection of Small Polyps: Our method excels in capturing a more comprehensive range of small polyps. By effectively learning richer multi-scale information, it successfully identifies polyps of varying sizes, thus reducing the occurrence of missed detections. (b) Improved Noise Suppression: Our model demonstrates superior noise suppression capabilities, effectively excluding polyps camouflaged by normal tissue. Given the frequent resemblance of polyps to their background, our approach leverages frequency domain information to facilitate the accurate identification of polyps amidst normal tissue. (c) Enhanced Edge Prediction: The segmentation results generated by our model exhibit remarkable internal consistency and a closer alignment with ground truth data. This reflects our model’s ability to predict edges more effectively, contributing to the overall accuracy and reliability of the segmentation results.

4.4 Ablation Study

An extensive ablation study was conducted, and all findings unequivocally affirmed the efficacy of each individual model component. The training, testing, and hyperparameter configurations precisely mirrored those elucidated in Section4.2.1. For illustration, Table4 and Table 5 show the ablation results for network structure components and loss functions, respectively.

Table 4: The table presents quantitative results from an ablation study, focusing on mDic and mIoU evaluation metrics. Various columns in the table indicate the influence of different configurations, with the best results highlighted in bold.
Dataset Metric(%) Bas. w/o FCAM w/o FSAM w/o CPM Final
Kvasir mDic 91.2 92.5 91.9 92.3 93.5
mIoU 86.0 87.8 86.5 87.3 89.5
ClinicDB mDic 91.0 93.3 92.5 94.0 94.5
mIoU 85.1 86.7 86.1 86.9 90.1
CVC-300 mDic 87.3 90.1 88.7 90.5 91.0
mIoU 80.0 81.3 80.6 82.2 84.7
ColonDB mDic 79.8 80.8 82.1 81.3 82.7
mIoU 71.5 71.9 73.2 72.2 74.8
ETIS mDic 76.5 77.8 79.3 78.7 80.0
mIoU 68.1 70.1 72.2 70.7 72.6

4.4.1 Network Components

Our baseline (Bas.) is the shunted transformer[48], and we assess the effectiveness of the modules by removing or replacing components from the complete PSTNet and comparing the variants with the standard version. The standard version is denoted as “PSTNet (ST+FCAM+FSAM+CPM)”, where “FCAM”, “FSAM” and “CPM” indicate the usage of the FCAM, FSAM and CPM, respectively.

Effectiveness of FCAM. We investigate the contribution of the FCAM module. We trained a version of ”PSTNet (w/o FCAM)”. Table 4 shows that the method without the FCAM module performs worse on the five datasets compared to the standard PSTNet. Notably, the mDic on the ClinicDB dataset drops by 1.2%, and the mIoU drops by 3.4%. Meanwhile, the absence of FCAM is undeniably associated with a substantial introduction of noise (Fig. 8).

Effectiveness of FSAM. o analyze the effectiveness of FSAM, a version of ”PSTNet (w/o FSAM)” is trained. Table 4 shows that the mIoU metric drops on all datasets after removing the FSAM module, with the most significant drop observed on the CVC-300 dataset (from 84.7% to 80.6%). Meanwhile, the absence of FSAM is undeniably associated with a substantial introduction of noise (Fig. 8).

Effectiveness of CPM. To investigate the contribution of the CPM module to the model, we removed CPM from PSTNet and used the element summing operation as a substitute. This version of the training is called ”PSTNet (w/o CPM)”. Table 4 shows that the standard version of PSTNet has better metrics than PSTNet (w/o CPM) on each dataset, with the most significant difference observed on the ClinicDB dataset ( mIoU: 90.1% vs. 86.9%). Fig. 8 shows the benefits of CPM more intuitively. It is observed that the absence of CPM results in more pronounced errors in detail, and in some cases, even leads to missed detections.

Table 5: The ablation study, focusing on the loss function, reports performance metrics as mDic(%) and mIoU(%) on the evaluation dataset. The best results are indicated in bold.
Dataset Metric(%) w/o (dice+focal) w/o (wBCE) Final
Kvasir mDic 92.1 92.2 93.5
mIoU 86.3 86.9 89.5
ClinicDB mDic 92.5 93.8 94.5
mIoU 87.1 89.3 90.1
CVC-300 mDic 89.8 90.6 91.0
mIoU 82.2 83.7 84.7
ColonDB mDic 81.0 81.6 82.7
mIoU 72.5 73.9 74.8
ETIS mDic 77.2 77.5 80.0
mIoU 70.2 72.2 72.6

4.4.2 Total Loss Function

To assess the impact of each component of the loss function on model performance, we performed an ablation study. This study involved evaluating our model using three distinct loss function setups: 1) excluding the combined dice and focal loss (”w/o (dice+focal)”), 2) omitting the weighted binary cross-entropy loss (”w/o (wBCE)”), and 3) employing our ultimate combined loss function (”Final”). The results, presented in Table 5, indicate that our final loss function configuration persistently attains the highest mean Dice coefficient (mDic) and mean Intersection over Union (mIoU) scores across all datasets. This observation underscores the efficacy of our combined loss strategy in enhancing the robustness and accuracy of the segmentation outcomes.

5 Conclusion

In this work, we propose PSTNet, a novel approach tackling the challenges of low contrast, indistinct boundaries, and scale inconsistency in polyp segmentation through three key modules: FCAM for extracting frequency domain cues, FSAM for aligning and enhancing multi-scale features, and CPM for effectively combining them. Extensive experiments demonstrate PSTNet’s superior performance over state-of-the-art models, highlighting the potential of incorporating frequency domain cues.

While focused on polyp segmentation, PSTNet’s core concepts and modules can potentially extend to other medical imaging tasks with similar challenges, such as retinal imaging, lung nodule detection, and brain tumor segmentation. However, this requires careful consideration of modality-specific characteristics, network adjustments, feature extraction adaptations, and large-scale annotated datasets.

Future work will involve collaborating with domain experts, collecting relevant datasets, and validating our approach’s effectiveness in new settings. We aim to explore advanced feature alignment techniques, incorporate domain-specific prior knowledge to enhance segmentation accuracy, and investigate integrating frequency domain cues with other state-of-the-art methods for broader medical image analysis applications. Overcoming challenges in data availability, computational efficiency, and model interpretability will be crucial for successful clinical translation.

6 Acknowledgements

This work was funded by National Natural Science Foundation of China (Nos. 62271074 and 82371962), Beiing Natural Science Foundation (L232133), CAMS Innovation Fund for Medical Sciences (2022-I2M-C&\&T-B-019) and National High Level Hospital Clinical Research Funding (2022-PUMCH-033).

References

  • [1] D. Vázquez, J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, A. M. López, A. Romero, M. Drozdzal, and A. Courville, “A benchmark for endoluminal scene segmentation of colonoscopy images,” Journal of healthcare engineering, vol. 2017, no. 1, p. 4037190, 2017.
  • [2] X. Zhao, L. Zhang, and H. Lu, “Automatic polyp segmentation via multi-scale subtraction network,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2021, pp. 120–130.
  • [3] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, and F. Vilariño, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Computerized medical imaging and graphics, vol. 43, pp. 99–111, 2015.
  • [4] D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Concealed object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [5] J. Bernal, J. Sánchez, and F. Vilarino, “Towards automatic polyp detection with a polyp appearance model,” Pattern Recognition, vol. 45, no. 9, pp. 3166–3182, 2012.
  • [6] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [7] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep learning in medical image analysis and multimodal learning for clinical decision support.   Springer, 2018, pp. 3–11.
  • [8] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. De Lange, P. Halvorsen, and H. D. Johansen, “Resunet++: An advanced architecture for medical image segmentation,” in 2019 IEEE International Symposium on Multimedia (ISM).   IEEE, 2019, pp. 225–2255.
  • [9] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2020, pp. 263–273.
  • [10] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, and L. Shao, “Polyp-pvt: Polyp segmentation with pyramid vision transformers,” CAAI Artificial Intelligence Research, vol. 2, p. 9150015, 2023.
  • [11] F. Chen, H. Ma, and W. Zhang, “Segt: Separated edge-guidance transformer network for polyp segmentation.” Mathematical biosciences and engineering : MBE, vol. 20 10, pp. 17 803–17 821, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259202588
  • [12] Y. Iwahori, H. Hagi, H. Usami, R. J. Woodham, A. Wang, M. K. Bhuyan, and K. Kasugai, “Automatic polyp detection from endoscope image using likelihood map based on edge information.” in ICPRAM, 2017, pp. 402–409.
  • [13] S. Ameling, S. Wirth, D. Paulus, G. Lacey, and F. Vilarino, “Texture-based polyp detection in colonoscopy,” in Bildverarbeitung für die Medizin 2009.   Springer, 2009, pp. 346–350.
  • [14] J. J. Fu, Y.-W. Yu, H.-M. Lin, J.-W. Chai, and C. C.-C. Chen, “Feature extraction and pattern classification of colorectal polyps in colonoscopic imaging,” Computerized medical imaging and graphics, vol. 38, no. 4, pp. 267–275, 2014.
  • [15] M. Zhao, W. Yan, R. Xu, D. Zhi, R. Jiang, T. Jiang, V. D. Calhoun, and J. Sui, “An attention-based hybrid deep learning framework integrating temporal coherence and dynamics for discriminating schizophrenia,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2021, pp. 118–121.
  • [16] M.-J. Gui, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, H. Li, T.-Y. Xiang, J.-L. Wang, and Z.-G. Hou, “Design and experiments of a novel halbach-cylinder-based magnetic skin: A preliminary study,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–11, 2022.
  • [17] M.-J. Gui, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, Z.-Q. Feng, and Z.-G. Hou, “Soft magnetic skin’s deformation analysis for tactile perception,” IEEE Transactions on Industrial Electronics, 2023.
  • [18] S. Dong, G. Luo, G. Sun, K. Wang, and H. Zhang, “A left ventricular segmentation method on 3d echocardiography using deep learning and snake,” in 2016 Computing in Cardiology Conference (CinC).   IEEE, 2016, pp. 473–476.
  • [19] G. Luo, G. Sun, K. Wang, S. Dong, and H. Zhang, “A novel left ventricular volumes prediction method based on deep learning network in cardiac mri,” in 2016 Computing in cardiology conference (CinC).   IEEE, 2016, pp. 89–92.
  • [20] R. Xu, Y. Li, C. Wang, S. Xu, W. Meng, and X. Zhang, “Instance segmentation of biological images using graph convolutional network,” Engineering Applications of Artificial Intelligence, vol. 110, p. 104739, 2022.
  • [21] C. Wang, R. Xu, S. Xu, W. Meng, and X. Zhang, “Da-net: Dual branch transformer and adaptive strip upsampling for retinal vessels segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 528–538.
  • [22] C. Wang, R. Xu, Y. Zhang, S. Xu, and X. Zhang, “Retinal vessel segmentation via context guide attention net with joint hard sample mining strategy,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2021, pp. 1319–1323.
  • [23] J. Bernal, N. Tajkbaksh, F. J. Sanchez, B. J. Matuszewski, H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rustad, I. Balasingham et al., “Comparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,” IEEE transactions on medical imaging, vol. 36, no. 6, pp. 1231–1249, 2017.
  • [24] N. K. Tomar, D. Jha, S. Ali, H. D. Johansen, D. Johansen, M. A. Riegler, and P. Halvorsen, “Ddanet: Dual decoder attention network for automatic polyp segmentation,” in International Conference on Pattern Recognition.   Springer, 2021, pp. 307–314.
  • [25] W.-T. Xiao, L.-J. Chang, and W.-M. Liu, “Semantic segmentation of colorectal polyps with deeplab and lstm networks,” in 2018 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW).   IEEE, 2018, pp. 1–2.
  • [26] B. Murugesan, K. Sarveswaran, S. M. Shankaranarayana, K. Ram, J. Joseph, and M. Sivaprakasam, “Psi-net: Shape and boundary aware joint multi-task deep network for medical image segmentation,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).   IEEE, 2019, pp. 7223–7226.
  • [27] D. Banik, K. Roy, D. Bhattacharjee, M. Nasipuri, and O. Krejcar, “Polyp-net: A multimodel fusion network for polyp segmentation,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–12, 2020.
  • [28] Q. V. Nguyen, V. T. Huynh, and S.-H. Kim, “Adaptation of distinct semantics for uncertain areas in polyp segmentation,” in British Machine Vision Conference, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:267000461
  • [29] D. Jha, S. Ali, N. K. Tomar, H. D. Johansen, D. Johansen, J. Rittscher, M. A. Riegler, and P. Halvorsen, “Real-time polyp detection, localization and segmentation in colonoscopy using deep learning,” Ieee Access, vol. 9, pp. 40 496–40 510, 2021.
  • [30] A. Ahmed and M. Ali, “Generative adversarial networks for automatic polyp segmentation,” arXiv preprint arXiv:2012.06771, 2020.
  • [31] Z. Yin, K. Liang, Z. Ma, and J. Guo, “Duplex contextual relation network for polyp segmentation,” in 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2022, pp. 1–5.
  • [32] K. Patel, A. M. Bur, and G. Wang, “Enhanced u-net: A feature enhancement network for polyp segmentation,” in 2021 18th Conference on Robots and Vision (CRV).   IEEE, 2021, pp. 181–188.
  • [33] A. I. Saad, F. A. Maghraby, and O. Badawy, “Polyseg plus: Polyp segmentation using deep learning with cost effective active learning,” International Journal of Computational Intelligence Systems, vol. 16, no. 1, p. 148, 2023.
  • [34] R.-G. Dumitru, D. Peteleaza, and C. Craciun, “Using duck-net for polyp image segmentation,” Scientific reports, vol. 13, no. 1, p. 9803, 2023.
  • [35] R. Xu, J. Zhang, J. Sun, C. Wang, Y. Wu, S. Xu, W. Meng, and X. Zhang, “Mrftrans: Multimodal representation fusion transformer for monocular 3d semantic scene completion,” Information Fusion, vol. 111, p. 102493, 2024.
  • [36] J. Zhang, K. Wang, R. Xu, G. Zhou, Y. Hong, X. Fang, Q. Wu, Z. Zhang, and W. He, “Navid: Video-based vlm plans the next step for vision-and-language navigation,” arXiv preprint arXiv:2402.15852, 2024.
  • [37] R. Xu, C. Wang, S. Xu, W. Meng, and X. Zhang, “Dc-net: Dual context network for 2d medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24.   Springer, 2021, pp. 503–513.
  • [38] D. Zhang, H. Li, W. Cong, R. Xu, J. Dong, and X. Chen, “Task relation distillation and prototypical pseudo label for incremental named entity recognition,” in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 3319–3329.
  • [39] C. Wang, R. Xu, S. Xu, W. Meng, and X. Zhang, “Treating pseudo-labels generation as image matting for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 755–765.
  • [40] C. Wang, R. Xu, S. Xu, W. Meng, J. Xiao, and X. Zhang, “Accurate lung nodule segmentation with detailed representation transfer and soft mask supervision,” IEEE transactions on neural networks and learning systems, pp. 1–13, October 2023.
  • [41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
  • [42] R. Xu, C. Wang, J. Zhang, S. Xu, W. Meng, and X. Zhang, “Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation,” IEEE Transactions on Image Processing, vol. 32, pp. 1052–1064, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256319085
  • [43] R. Xu, C. Wang, J. Sun, S. Xu, W. Meng, and X. Zhang, “Self correspondence distillation for end-to-end weakly-supervised semantic segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 3045–3053.
  • [44] R. Xu, C. Wang, S. Xu, W. Meng, and X. Zhang, “Dual-stream representation fusion learning for accurate medical image segmentation,” Engineering Applications of Artificial Intelligence, vol. 123, p. 106402, 2023.
  • [45] S. Xu, S. Zheng, W. Xu, R. Xu, C. Wang, J. Zhang, X. Teng, A. Li, and L. Guo, “Hcf-net: Hierarchical context fusion network for infrared small object detection,” CoRR, vol. abs/2403.10778, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.10778
  • [46] W. Xu, R. Xu, C. Wang, S. Xu, L. Guo, M. Zhang, and X. Zhang, “Spectral prompt tuning: Unveiling unseen classes for zero-shot semantic segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 6369–6377.
  • [47] N. K. Tomar, D. Jha, U. Bagci, and S. Ali, “Tganet: Text-guided attention for improved polyp segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 151–160.
  • [48] S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted self-attention via multi-scale token aggregation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 853–10 862.
  • [49] Y. Zhong, B. Li, L. Tang, S. Kuang, S. Wu, and S. Ding, “Detecting camouflaged object in frequency domain,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4504–4513.
  • [50] Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 783–792.
  • [51] Z. Huang, Y. Wei, X. Wang, W. Liu, T. S. Huang, and H. Shi, “Alignseg: Feature-aligned segmentation networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 550–557, 2021.
  • [52] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9308–9316.
  • [53] S. Huang, Z. Lu, R. Cheng, and C. He, “Fapn: Feature-aligned pyramid network for dense image prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 864–873.
  • [54] J. Jiang, R. Wang, S. Lin, and F. Wang, “Sfsegnet: Parse freehand sketches using deep fully convolutional networks,” in 2019 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2019, pp. 1–8.
  • [55] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep learning in medical image analysis and multimodal learning for clinical decision support.   Springer, 2017, pp. 240–248.
  • [56] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [57] W. Zhu, Y. Huang, L. Zeng, X. Chen, Y. Liu, Z. Qian, N. Du, W. Fan, and X. Xie, “Anatomynet: deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy,” Medical physics, vol. 46, no. 2, pp. 576–589, 2019.
  • [58] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. d. Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” in International Conference on Multimedia Modeling.   Springer, 2020, pp. 451–462.
  • [59] J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,” International journal of computer assisted radiology and surgery, vol. 9, no. 2, pp. 283–293, 2014.
  • [60] N. Tajbakhsh, S. R. Gurudu, and J. Liang, “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE transactions on medical imaging, vol. 35, no. 2, pp. 630–644, 2015.
  • [61] J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, and S. Cui, “Shallow attention network for polyp segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2021, pp. 699–708.