Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Song Tang Shaxu Yan Xiaozhi Qi Jianxin Gao Mao Ye Jianwei Zhang Xiatian Zhu IMI Group, School of Health Sciences and Engineering, University of Shanghai for Science and Technology, Shanghai, China TAMS Group, Department of Informatics, Universität Hamburg, Hamburg, Germany School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China Surrey Institute for People-Centred Artificial Intelligence, and Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK Shenzhen Key Laboratory of Minimally Invasive Surgical Robotics and System, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, China

Abstract

Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods. The code and data are available at https://github.com/tntek/DSPNet.

keywords:

MSC:

*, *, *, * \KWDFew-shot semantic segmentation , Medical image , High-fidelity prototype , Detail self-refining

^†^†journal: Medical Image Analysis

1 Introduction

Medical image segmentation plays a critical role in clinical processes and medical research, such as disease diagnosis [49], treatment planning [37] and follow-up [32]. In the medical field, the well-annotated samples are limited due to privacy protection and the requirement of clinical expertise. Within this context, Few-shot Semantic Segmentation (FSS) methods [31] demonstrate their advantages in this domain, involving extracting one or several supporting data to predict the same type in query data.

Refer to caption — Fig. 1: Scheme comparison. For the local information loss problem caused by pooling operation, previous detail discovery scheme incrementally mines new prototypes to capture more details. Our scheme is featured with the design called detail self-refining, aiming at encouraging high-fidelity prototypes. To enhance the deep representation for details, our self-refining works in different way: The foreground class prototype is refreshed by fusing the cluster-mined semantics, whilst background prototypes are enhanced by incorporating the channel-specific structural information.

The key to FSS is building a resemblance between the support and query images. The existing FSS methods follow three clues. The first is constructing support images-based guidance to boost this query image segmentation, e.g., the two-branch architecture with interaction [35, 34]. The second identifies the shared features by building a resemblance between the support and query images, e.g., attention modules [14] and graph networks [11]. The third is prototypical approaches [38, 30], mining prototypes from support images to build a resemblance with the query images. Among them, the third one is the current prevalent scheme due to the generality and robustness to noise. However, given that the prototype extraction utilizes the pooling operation, e.g., Mask Average Pooling or Average Pooling, this scheme suffers from an inherent limitation: Since the pooling is prone to losing local details, the conventional prototypes lead to low-discriminative feature maps that confuse the foreground and background.

Existing methods address this above-mentioned limitation by incrementally mining new prototypes for diverse detail representations, i.e., the detail discovery scheme marked by a yellow zone in Fig. 1. For instance, the single class prototype for foreground was enriched by several part-aware prototypes [22] or compensation prototypes [46]. For background, Average Pooling was employed at the regular grid to generate diverse local prototypes [30]. This strategy works well in imaging scenarios with (i) considerably distinct objects and (ii) not highly complex background, e.g., natural images. However, the medical images with highly heterogeneous textures¹¹1Refer to these considerable and complicated distinct structures/tissues with compact boundaries between them. do not satisfy the conditions. Namely, the incremental strategy cannot provide complete detail representations for medical images.

To overcome this problem above, in this paper, we propose a new Detail Self-refined Prototype Network (DSPNet). As demonstrated in Fig. 1 (see the green zone), in contrast to constructing new prototypes, our scheme highlights enhancing details representation of off-the-shelf prototypes by detail self-refining, leading to high-fidelity prototypes.

In the proposed network, our detail self-refining involves two novel attention-like modules, called Foreground Semantic Prototype Attention (FSPA) and Background Channel-structural Multi-head Attention (BCMA). In FSPA, to account for the clear semantics of foreground, we mine the semantic prototypes at the class level as the detail prototypes, using superpixel clustering. Then, they are fused to a single class prototype in a channel-wise one-dimensional convolution fashion, assembling the global semantics while maintaining the local semantics. In BCMA, since the complicated background in medical images is often unsemantic, we take the ones generated by Average Pooling at the regular grid as the background detail prototypes, instead of mining detail information from the spatial dimension. Then, the channel-specific structural information is explored by combining learnable global information with an adjustment highlighting sparse-relative channels. In the end, the elements of each detail prototype are channel-wise refreshed independently by the corresponding channel-specific structural information.

The contributions of this work are three folds. We propose:

(1)

A novel prototypical FSS approach DSPNet that enhances prototypes’ self-representation for complicated details, totally discrepant from the previous incremental paradigm of constructing new detail prototypes.
(2)

A self-refining method FSPA for class prototype that integrates the cluster prototypes, i.e., the mined semantic details, into an enhanced one in an attention-like fashion and indicates the potential of fusing cluster-based local details for complete foreground representation.
(3)

A self-refining method BCMA for background prototype that incorporates the channel-specific structural information by multi-head channel attention with sparse channel-aware regularization and provides a conceptually different view for background details modeling.

2 Related Work

2.1 Medical Image Segmentation

Currently, the deep neural network approaches dominate the medical image segmentation field. The early phase shares models with the natural image semantic segmentation. Fully Convolutional Networks (FCNs) [24] first equipped vanilla Convolutional Networks (CNN) with a segmentation head by introducing Up-sampling and Skip layer. Following that, the encoder-decoder-based methods [3, 28] are developed. Unlike the coarse reconstruction in FCNs, the symmetrical reconstruction of the decoder can capture much richer detailed semantics. With the application of deep learning in the medical field, the medical image-specific models merge correspondingly, among which U-Net [28] is extensively recognized for its superior performance. Besides symmetrical encoder-decoder architecture, U-Net infuses the skipped connections to facilitate the propagation of contextual information to higher resolution hierarchies. Inspired by it, several variants of U-Net are designed, including U-Net 3D [5], Atten-U-Net [29], Edge-U-Net [2], V-Net [27] and Y-Net [26].

These segmentation models mentioned above only work in a supervised fashion, relying on abundant expert-annotated data. Thus, they cannot apply to the few-shot setting where we need to segment an object of an ”unseen” class as only a few labeled images of this class are given.

2.2 Few-shot Semantic Segmentation (FSS)

The key to solving FSS is building a class-wise similarity between the query and support images. Following this view, the existing methods can be divided into three categories. The first category constructs support images-based guidance [35, 34, 44, 39]. For instance, [35] developed a two-branch approach where a conditioning branch imposes controlling on the logistic regression layer of the segmentation branch. [34] introduced squeeze and excitation blocks into the conditioning branch to encourage dense information interaction between the two branches. The second category designed novel network modules, e.g., attention modules [14, 41] and graph networks [11, 45], for discriminative representations, by which the features shared by query and support images were identified. The third mainstream is prototypic network [42, 30, 7, 9] that prototypes bridge the similarity computation in a meta-learning fashion. Here, the prototypes are specific features with semantics extracted from the support images. Recently, PANet [42] achieved impressive performance on the natural image segmentation task, performing dual-directive alignment between the query and support images. SSL-PANet [30] transferred the PANet architecture into the medical image segmentation where self-supervision with superpixels and local representation ensure the unsupervised segmentation. Following that, anomaly detection-inspired methods enhanced the performance by introducing self-supervision with supervoxels[12] or learning mechanism of an expert clinician [36].

Our DSPNet belongs to the third category above, having two significant discrepancies. Compared with methods working with classes-abundant natural images, e.g., PANet, DSPNet is a medical image-specific model with limited labelled support data. On the other hand, DSPNet considers the limitation of local information loss from a new view of detail self-refining, which is totally different from the existing prototypical methods.

2.3 Attention Method in Few-Shot Semantic Segmentation

For few-shot semantic segmentation tasks, attention mechanism [40] is a popular technique to build the relationship between the support and query images. The existing approaches can be divided into two categories: (i) graph-based [47, 41, 11] and (ii) non-graph-based [14, 48, 25]. The methods belonging to the first category employing a graph model to activate more pixels, such that the correspondence between support and query images is enhanced. For example, [11] fused the graph attention and the last layer feature map to generate an enhanced feature map to solve the problem of foreground pixel loss in the attention map. The core idea of the second category methods is to build the correspondence based on feature interaction between the support and query data, e.g., multi-scale contextual features [14], affinity constraint [48], mix attention map [25].

Unlike the spatial attention-based methods above, both FSPA and BCMA in DSPNet are channel attention-like methods. For FSPA, the channel-wise fusion ensures the deeper semantic fusion from local to global, whilst in BCMA, the detail self-refining relies on the channel structural information.

3 Problem Statement

In the case of few-shot segmentation, the dataset includes two parts, the training subset $\mathcal{D}_{tr}$ (annotated by $\mathcal{Y}_{tr}$ ) and the test subset $\mathcal{D}_{te}$ (annotated by $\mathcal{Y}_{te}$ ), both of which consist of image-mask pairs. Furthermore, the $\mathcal{D}_{tr}$ and $\mathcal{D}_{te}$ do not share categories. Namely, $\mathcal{Y}_{tr}\cap\mathcal{Y}_{te}=\varnothing$ . The goal of few-shot semantic segmentation is to train a segmentation model on $\mathcal{D}_{tr}$ that can segment unseen semantic classes $\mathcal{Y}_{te}$ in images in $\mathcal{D}_{te}$ , given a few annotated examples of $\mathcal{Y}_{te}$ , without re-training.

To reach the goal above, we formulate this problem in a meta-learning fashion, the same as the initial few-shot semantic segmentation work. Specifically, $\mathcal{D}_{tr}=\{S_{i},Q_{i}\}_{i=1}^{N_{tr}}$ and $\mathcal{D}_{te}=\{S_{i},Q_{i}\}_{i=1}^{N_{te}}$ are sliced into several randomly sampled episodes, where ${N_{tr}}$ and ${N_{te}}$ are the episodes number for training and testing, respectively. Each episode consists of K annotated support images and a collection of query images containing N categories. Namely, we consider an N-way K-shot segmentation sub-problem. Specifically, the support set $S_{i}=\{({I}_{k}^{s},{m}_{k}^{s}({c}_{j}))\}_{k=1}^{K}$ contains K image-mask pairs of a gray-scale image $I\in{R}^{H\times W}$ and its corresponding binary mask ${m\in\{{0,1}\}}^{H\times W}$ for class ${c}_{j}\in{C}_{tr},j=1,2,\cdots,N$ . The query set ${Q_{i}}$ contains ${V}$ image-mask pairs from the same class as the support set. While the training on $\mathcal{D}_{tr}$ , over each episode, we learn a function $f(I^{q},S_{i})$ , which predicts a binary mask of an unseen class when given the query image $I^{q}\in{Q_{i}}$ and the support set $S_{i}$ . After a series of episodes, we obtain the final segmentation model, which is evaluated on ${N_{te}}$ in the same N-way K-shot segmentation manner. Following the common practice in [30, 1, 36], this paper set $N=K=1$ .

4 Methodology

In this work, we propose a detail representation enhanced network (DSPNet) for prototypical FSS, building on the self-supervision framework [30]. As shown in Fig. 2(a), DSPNet consists of three modules from left to right: (i) The CNN-based feature extractor $f(\cdot)$ ; (ii) the detail self-refining block ${\rm{DSR}}(\cdot)$ ; and (iii) the segmentation block based on the cosine similarity. Suppose the support and query images are denoted by $I_{s}$ and $I_{q}$ , respectively. The segmentation begins with feature extraction $F_{s}=f_{\theta}(I_{s})$ and $F_{q}=f_{\theta}(I_{q})$ . Furthermore, high-fidelity foreground prototype and background prototypes are produced by the detail self-refining block, denoted by $P_{k}=\{P_{f},P_{b}\}={\rm{DSR}}(F_{q},F_{s},M_{s})$ where $M_{s}$ is the support masking label. Finally, we obtain the query prediction of segmentation ${\rm{SEG}}(F_{q},P_{k})$ , computing cosine similarity between $F_{q}$ and obtained prototypes $P_{k}$ in a convolution fashion.

In the segmentation process above, the optimal prototype generation encouraged by the detail self-refining block, i.e., ${\rm{DSR}}(\cdot,\cdot,\cdot)$ , distinguishes our DSPNet from the previous work. As shown in Fig. 2(b), after RAN calibrates $F_{s}$ and $F_{q}$ to a semantics fused feature maps $\hat{F}_{s}$ , FSPA and BCMA extract cluster-based prototypes and Average Pooling-based prototypes from $\hat{F}_{s}$ respectively and take them as raw detail prototypes. Then, the high-fidelity class prototype $P_{f}$ and background prototypes $P_{b}$ are further obtained by the channel-wise fusion in FSPA and the sparse channel-aware multi-head channel attention in BCMA. In the rest of this section, we will elaborate on the three key components.

4.1 Resemblance Attention Network

In the FSS field, Resemblance Attention Network (RAN) [43] is a classic module to integrate the support and query features [10, 15, 6]. In the proposed DSPNet, RAN engages in filtering irrelevant texture and objects between $F_{s}$ and $F_{q}$ . Fig. 3 presents its network architecture. When support and query feature maps $F_{s}$ , $F_{q}$ are input, they are first reshaped to feature vector $A_{s}$ and $A_{q}$ , respectively. After that, in a Query-Key-Value attention manner with residual connection, the $A_{s}$ , $A_{q}$ are fused to $\hat{F}_{s}$ where ${\rm{\bf Q}}={\rm{\bf V}}=A_{s}$ , ${\rm{\bf K}}=A_{q}$ . The process can be formulated by Eq. (1).

\begin{split}\hat{{F}}_{s}=\frac{{{\phi}\left({A}_{s}^{T}\times{A}_{q}\right)}\times{A}_{s}}{\left\|{A}_{s}\right\|\left\|{A}_{q}\right\|}+{A}_{s}.\end{split}

(1)

where $\phi(\cdot)$ stands for softmax operation, $\times$ means matrix multiplication, ${\phi}\left({A}_{s}^{T}\times{A}_{q}\right)$ means the similarity-based probability matrix weighting $A_{s}$ .

4.2 Foreground Semantic Prototype Attention

To obtain high-fidelity class prototype for the semantic foreground, we explore the local semantics in the foregroud and fuse them to form global semantics without local semantics loss. We accomplish this idea using the cluster-based detail prototypes and channel-wise attention with local semantics guidance.

Overview. As shown in Fig. 4(a), to get more local semantics, we first employ the superpixel-guided clustering method [18] to mine $N_{s}$ cluster prototypes, denoted by ${P}_{c}=\{{P}_{c}^{i}\}_{i=1}^{N_{s}}$ where ${P}_{c}^{i}\in{\mathbb{R}}^{1\times D}$ , $D$ is the dimension of prototype. The intuitive fusion, e.g., vanilla weighting without prior knowledge [18], can obtain the global semantics but suffers from confusing detail semantics. Therefore, we propose an attention-like cluster prototype fusion to address this issue, implementing detail self-refining and foreground tailoring sequentially.

Attention-like cluster prototype fusion. As shown in the middle of Fig. 4 (marked by grey box), this attention can be implemented in the fashion of Query-Key-Value. Taking ${\rm{\bf Q}}={\hat{F}_{s}}$ , ${\rm{\bf K}}={\rm{\bf V}}={{P}_{c}}$ , we can summarize this module to the following equation.

\begin{split}{\bar{F}}_{s}={{\phi}\left(\hat{F}_{s}~{}{\scriptsize\leavevmode\hbox to10.16pt{\vbox to10.16pt{\pgfpicture\makeatletter\hbox{\hskip 5.08159pt\lower-5.08159pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.88159pt}{0.0pt}\pgfsys@curveto{4.88159pt}{2.69606pt}{2.69606pt}{4.88159pt}{0.0pt}{4.88159pt}\pgfsys@curveto{-2.69606pt}{4.88159pt}{-4.88159pt}{2.69606pt}{-4.88159pt}{0.0pt}\pgfsys@curveto{-4.88159pt}{-2.69606pt}{-2.69606pt}{-4.88159pt}{0.0pt}{-4.88159pt}\pgfsys@curveto{2.69606pt}{-4.88159pt}{4.88159pt}{-2.69606pt}{4.88159pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.52777pt}{-2.39166pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{C}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}~{}{P}_{c}\right)}~{}{\scriptsize\leavevmode\hbox to10.4pt{\vbox to10.4pt{\pgfpicture\makeatletter\hbox{\hskip 5.2022pt\lower-5.2022pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{5.0022pt}{0.0pt}\pgfsys@curveto{5.0022pt}{2.76266pt}{2.76266pt}{5.0022pt}{0.0pt}{5.0022pt}\pgfsys@curveto{-2.76266pt}{5.0022pt}{-5.0022pt}{2.76266pt}{-5.0022pt}{0.0pt}\pgfsys@curveto{-5.0022pt}{-2.76266pt}{-2.76266pt}{-5.0022pt}{0.0pt}{-5.0022pt}\pgfsys@curveto{2.76266pt}{-5.0022pt}{5.0022pt}{-2.76266pt}{5.0022pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.67361pt}{-2.39166pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{D}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}~{}{P}_{c},\end{split}

(2)

where $\phi(\cdot)$ is softmax computation; operator and respectively means the computation for cosine similarity measurement and channel-wise prototype fusion, whose details are presented in the following.

Since the size of $\hat{F}_{s}$ and $P_{c}$ are different, computation of does not follow cosine similarity’s definition, but performing in prototype-wise. Specifically, each prototype in ${P}_{c}$ , i.e., ${P}_{c}^{i}$ , is used to compute similarity with the supporting feature maps $\hat{F}_{s}$ in a one-dimensional convolution manner, in which the convolution calculation is replaced by cosine similarity computation. Thus, the $N_{s}$ prototypes lead to $N_{s}$ similarity maps, which can be collectively written as $S_{s}=\hat{\rm{F}}_{s}~{}{\scriptsize\leavevmode\hbox to10.16pt{\vbox to10.16pt{\pgfpicture\makeatletter\hbox{\hskip 5.08159pt\lower-5.08159pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.88159pt}{0.0pt}\pgfsys@curveto{4.88159pt}{2.69606pt}{2.69606pt}{4.88159pt}{0.0pt}{4.88159pt}\pgfsys@curveto{-2.69606pt}{4.88159pt}{-4.88159pt}{2.69606pt}{-4.88159pt}{0.0pt}\pgfsys@curveto{-4.88159pt}{-2.69606pt}{-2.69606pt}{-4.88159pt}{0.0pt}{-4.88159pt}\pgfsys@curveto{2.69606pt}{-4.88159pt}{4.88159pt}{-2.69606pt}{4.88159pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.52777pt}{-2.39166pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{C}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}~{}{P}_{c}\in{\mathbb{R}}^{(H_{s}\times W_{s})\times N_{s}}$ where $(H_{s},W_{s})$ is map size. For any map in $S_{s}$ , denoted by $S_{s}^{i}$ , its computation can be expressed as

\small\begin{split}{{S}_{s}^{i}}={\rm{sim1D}}~{}(\hat{F}_{s},{P}_{c}^{i}),\end{split}

(3)

where function ${\rm{sim1D}}~{}(\cdot,\cdot)$ stands for the similarity computation working in the one-dimensional convolution fashion. The value of ${{S}_{s}^{i}}$ at position $(h,w)$ is the cosine similarity between ${P}_{c}^{i}$ and ${\hat{F}}_{s}$ at position $(h,w)$ . Namely,

\begin{split}{S}_{s}^{i}(h,w)=\frac{\left({P}_{c}^{i}\right)^{T}\times{\hat{F}}_{s}{(h,w)}}{\left\|{{P}}_{c}^{i}\right\|\left\|{\hat{F}}_{s}{(h,w)}\right\|},\end{split}

(4)

where ${\hat{F}}_{s}{(h,w)}\in{\mathbb{R}}^{1\times D}$ represents the feature vector of the feature maps ${\hat{F}}_{s}$ at position $(h,w)$ along channel dimension.

To incorporate the knowledge represented by the similarity maps $S_{s}$ into the cluster prototypes $P_{c}$ , we also adopt a one-dimensional convolution to implement the computation of , as illustrated in Fig. 4(b) and (c). Specifically, the computation begins with the channel-wise generation of convolution filters. Given that the cluster prototypes $P_{c}$ are arranged as shown in Fig. 4(b). We slice $P_{c}$ along the channel-dimension and obtain $D$ convolution vectors $\{K_{i}\}_{i=1}^{D}$ where $K_{i}\in{\mathbb{R}}^{1\times N_{s}}$ contains cluster prototypes’ semantic component on the i-th channel. After that, as done in Fig. 4(c), we conduct one-dimensional convolution to obtain fused maps ${\bar{F}_{s}}\in{\mathbb{R}}^{D\times H\times W}$ . The computation for the $i$ -th map can be expressed as:

\small\begin{split}{\bar{F}_{s}^{i}}={\rm{conv1D}}\left(\phi\left({S}_{s}\right),K_{i}\right),\end{split}

(5)

where $\phi(\cdot)$ is softmax operation, $\phi({S}_{s})$ stands for probability map, $K_{i}$ work as the convolution filter. In this end, to suppress introduced noise in the fusion step, we tailor the fused map ${\bar{F}_{s}}$ to high-fidelity foreground prototype $P_{f}$ by Mask Average Pooling.

\begin{split}P_{f}=\frac{\sum_{h,w}{\bar{F}}_{s}(h,w)\odot m_{s}(h,w)}{\sum_{h,w}m_{s}(h,w)},\end{split}

(6)

where $m_{s}$ is the given mask of the support image and resized to the same as ${\bar{F}}_{s}$ .

Remark: In Eq. (5), $S_{s}$ is essentially semantic response maps concerning the cluster prototypes, such that probability map $\phi({S}_{s})$ is noticeably relational to the detail semantics. That is, this fusion ensured by computation is guided by the detail semantics represented in $S_{s}$ . Equivalently, the fusion process preserves the detail semantics, as our expectation.

Besides, two designs differs FSPA from the previous work. First, FSPA reduces the mined cluster prototypes to a fused one instead of using these prototypes separately [8, 21]. Second, the proposed channel-wise attention leads to global semantics preserving the local semantics unlike spatially weighting [18].

4.3 Background Channel-structural Multi-head Attention

Unlike the foreground taking the cluster prototypes as the local details, the background in medical images is usually semantic-less in a large scope. Therefore, in this paper, we do not mine from the spatial dimension but deem the structural information in the channel dimension as the local details. Within this context, we design a controllable channel attention mechanism to jointly model the channel-specific structural information and incorporate them into the raw background prototypes.

Overview. As illustrated in Fig. 5(a), BCMA begins with generating raw detail prototypes. By Average Pooling and reshaping, ${\hat{F}}_{s}$ is converted to $P_{n}\in{\mathbb{R}}^{(H\times W)\times D}$ . Following that, the controllable multi-head channel attention module refreshes $P_{n}$ to high-fidelity background prototypes $P_{a}$ . Finally, $P_{a}$ is reshaped to feature maps ${\widetilde{F}}_{s}$ and further tailored to the high-fidelity background prototypes $P_{b}$ by the background zone in pooled support mask ${M}_{r}$ .

Controllable multi-head channel attention. The proposed channel attention mechanism encodes the channel-structural information into raw background prototypes in an element manner. For a raw prototype, their elements are independently refined by the structural information of different channels. We achieve this by the D-way architecture illustrated in Fig. 5(b). Suppose that for any raw prototype in $P_{n}$ , denoted by $P_{n}^{k}$ , the converted high-fidelity prototype is $P_{a}^{k}$ . In the Q-K-V fashion, we set ${\rm{\bf Q}}\!=\!Q_{n}$ , ${\rm{\bf K}}\!=\!P_{n}$ , ${\rm{\bf V}}\!=\!P_{n}^{k}$ where $Q_{n}$ is transformed by channel-wise slicing $P_{n}$ . In the proposed module, $P_{n}$ and $P_{n}^{k}$ are copied D times and inputted into the D heads, respectively. At the same time, $Q_{n}$ is inputted channel-wise. That is, the $j$ -th head takes the $j$ -th component $Q_{n}^{j}$ as the input. The multi-head module refining $P_{n}^{k}$ can be formulated as

\small\begin{split}{{P}_{a}^{k}}&={\rm{cat}}\left(\{P_{a}^{k,j}\}\right),\\ P_{a}^{k,j}&={\rm{h}}_{j}\left({Q}_{n}^{j},Q_{n},P_{n}^{k}\right),1\leq j\leq D,\end{split}

(7)

where ${\rm{cat}}(\cdot)$ concatenates the input set to a vector according to their indices; ${\rm{h}}_{j}(\cdot,\cdot,\cdot)$ is the $j$ -th channel attention head generating $P_{a}^{k,j}$ (the $j$ -th element in $P_{a}^{k}$ ) that we elaborate as follows.

In Eq. (7), the objective of the attention head $h_{j}$ is encoding the $j$ -th channel-specific structural information, denoted by $\boldsymbol{a}^{\prime}_{j}$ , into the raw $P_{n}^{k}$ . Under the attention framework, the encoding can be implemented by a weighting operation $P_{n}^{k}\times\boldsymbol{a}^{\prime}_{j}$ , whilst the $\boldsymbol{a}^{\prime}_{j}$ generation is the core problem we need to address. For this issue, as depicted in Fig. 5(c), we provide a controllable design consisting of (i) global exploration module and (ii) sparse channel-aware regulating module. Among them, the former predicts the global channel structural information of the $j$ -th channel $\boldsymbol{a}_{j}$ , whilst the latter serves as a controller by injecting the $j$ -th channel-specific adjustment $\boldsymbol{r}$ . Thus, the working mechanism ${\rm{h}}_{j}$ can be formulated as

\begin{split}{\rm{h}}_{j}={P_{n}^{k}}\times\overbrace{\left(\boldsymbol{r}\odot\boldsymbol{a}_{j}\right)}^{\boldsymbol{a}^{\prime}_{j}},\end{split}

(8)

where parameter ${\boldsymbol{a}_{j}}$ is learnable; operator $\odot$ means element-wise multiplying.

In Eq. (8), the generation of adjustment $\boldsymbol{r}$ involves two blocks in the sparse channel-aware regulating module (see the two dark grey box in Fig. 5(c)). First, the channel similarity computation, formulated by Eq. (9), captures the dynamics of the relationship between $j$ -th channel and other channels.

\begin{split}\boldsymbol{w}_{c}=\phi\left({\rm{cossim}}\left({Q_{n}},{Q_{n}^{j}}\right)\right),~{}~{}\boldsymbol{w}_{c,i}=\frac{\left({Q_{n}^{i}}\right)^{T}\times{Q_{n}^{j}}}{\left\|Q_{n}^{i}\right\|\left\|Q_{n}^{j}\right\|},\end{split}

(9)

where $\boldsymbol{w}_{c}\in{\mathbb{R}}^{D}$ is the channel similarity whose $i$ -th element is $\boldsymbol{w}_{c,i}$ , function ${\rm{cossim}}(\cdot,\cdot)$ measures the cosine similarity of vector ${Q_{n}^{j}}$ over set ${Q_{n}}$ , $\phi$ is softmax operation. Subsequently, the incorporation unit generates adjustment coefficients by high-lighting the sparse-relative channels indexed by masked frozen vector ${\boldsymbol{m}_{w}}$ . This process can be formulated as

\begin{split}\boldsymbol{r}=1+\beta\left(\boldsymbol{w}_{c}\odot\boldsymbol{m}_{w}\right),\end{split}

(10)

where rade-off parameter $\beta$ stands for the control strength.

As mentioned above, the proposed $h_{j}$ involves two important parameters, i.e., $\boldsymbol{a}_{j}$ (Eq. (8)) and $\boldsymbol{m}_{w}$ (Eq. (10)). In our design, both of them are initiated by a pre-set sparse vector $\boldsymbol{w}_{i}$ that represents a prior knowledge about the channel structural pattern. Specifically, at the beginning of model training, we set $\boldsymbol{m}_{w}={\rm{mask}}(\boldsymbol{w}_{i})$ and $\boldsymbol{a}_{j}=\boldsymbol{w}_{i}$ where function ${\rm{mask}}(\cdot)$ outputs Boolean vector whose locations of 1 corresponds to the non-zero places in input vector.

Remark: In our controllable attention mechanism, the core idea is imposing sparse channel-aware regulating to adjust the learnt global channel relation, leading to channel-specific structural information. Here, the sparse constraint is motivated by the ubiquitous sparse nature of neural connections, whose rationality is verified by much work [20, 4].

Also, from a methodological point of view, our structure can be understood as a piece of work of structural learning-based attention [33, 23, 13], but in the channel dimension. For instance, shifted window partitioning in Swin Transformer [23] introduces spatial relation constraint to self-attention. Similarly, our sparse channel-aware regulating introduces a channel structural constraint, i.e., sparse relation ( $\boldsymbol{r}$ ), to the predicted global channel structural information ( $\boldsymbol{a}_{j}$ ).

4.4 Loss Function

We regulate cross-entropy regularization to supervise this model training process:

\displaystyle\mathcal{L}_{seg}\!=\!-\frac{1}{HW}\sum_{h}^{H}\sum_{w}^{W}\sum_{j\in\{f,b\}}{m}_{q}^{j}\left(h,w\right)\odot log\left(\hat{m}_{q}^{j}\left(h,w\right)\right),

(11)

where $\hat{m}_{q}^{j}(h,w)$ is the predicted results of the query mask label ${m}_{q}^{j}(h,w)$ ; in $\{f,b\}$ , $f$ and $b$ means foreground and background, respectively. Also, following [42, 30, 36], we perform another inverse learning where the query images serve as the support set to predict labels of the support images. Thus, we encourage a prototypical alignment formulated by

\displaystyle\mathcal{L}_{reg}=-\frac{1}{HW}\sum_{h}^{H}\sum_{w}^{W}\sum_{j\in\{f,b\}}{m}_{s}^{j}(h,w)\odot log\left(\hat{m}_{s}^{j}(h,w)\right).

(12)

Overall, for each training episode, the final objective of DSPNet is defined as follows:

\begin{split}\mathcal{L}_{\rm{DSPNet}}=\mathcal{L}_{seg}+\mathcal{L}_{reg}.\end{split}

(13)

5 Experiments

This part first introduce the experimental settings, followed by the segmentation results on three challenging benchmarks. The extensive model discussion is provided in this end.

Table 1: Experiment results (in Dice score) on ABD-MRI and ABD-CT. Numbers in bold and italics indicate the best and the second-best results, respectively.

Settings	Method	ABD-MRI					ABD-CT
Settings	Method	Liver	R.kidney	L.kidney	Spleen	Mean	Liver	R.kidney	L.kidney	Spleen	Mean
Setting-1	SE-Net [34]	29.02	47.96	45.78	47.30	42.51	35.42	12.51	24.42	43.66	29.00
	PANet [42]	47.37	30.41	34.96	27.73	35.11	60.86	50.42	56.52	55.72	57.88
	SSL-ALPNet [30]	70.49	79.86	81.25	64.49	74.02	67.29	72.62	76.35	70.11	71.59
	Q-Net [36]	73.54	84.41	68.36	76.69	75.75	68.65	55.63	69.39	56.82	62.63
	CAT-Net [19]	73.01	79.54	73.11	69.31	73.74	66.24	47.83	69.09	66.98	62.54
	DSPNet (our)	75.06	85.37	81.88	70.93	78.31	69.32	74.54	78.01	69.31	72.79
Setting-2	SE-Net [34]	27.43	61.32	62.11	51.80	50.66	0.27	14.34	32.83	0.23	11.91
	PANet [42]	69.37	66.94	63.17	61.25	65.68	61.71	34.69	37.58	43.73	44.42
	SSL-ALPNet [30]	69.46	62.34	75.49	69.02	69.08	66.21	64.68	58.66	66.69	64.06
	Q-Net [36]	82.97	51.81	70.39	57.74	65.73	64.44	41.75	66.21	37.87	52.57
	CAT-Net [19]	74.09	63.51	70.56	67.02	68.79	52.53	46.87	65.01	46.73	52.79
	DSPNet (our)	78.56	82.01	76.47	68.27	76.33	69.16	63.55	68.46	66.48	66.17

5.1 Data Sets

To demonstrate the effectiveness of DSPNet, we conduct evaluation on three challenging datasets with different segmentation scenarios. Their details are presented as follows.

Abdominal CT dataset [17], termed ABD-CT, was acquired from the Multi-Atlas Abdomen Labeling challenge at the Medical Image Computing and Computer Assisted Intervention Society (MICCAI) in 2015. This dataset contains 30 3D abdominal CT scans. Of note, this is a clinical dataset containing patients with various pathologies and variations in intensity distributions between scans.

Abdominal MRI dataset [16], termed ABD-MRI, was obtained from the Combined Healthy Abdominal Organ Segmentation (CHAOS) challenge held at the IEEE International Symposium on Biomedical Imaging (ISBI) in 2019. This dataset consists of 20 3D MRI scans with a total of four different labels representing different abdominal organs.

Cardiac MRI dataset [50], termed CMR, was obtained from the Automatic Cardiac Chamber and Myocardium Segmentation Challenge held at the Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) in 2019. It contains 35 clinical 3D cardiac MRI scans.

In our experiment settings, to ensure fair comparison, we adopted the same image preprocessing solution as SSL-ALPNet [30]. Specifically, we sampled the images into slices along the channel dimension, and resized each slice to 256 $\times$ 256 pixels. Moreover, we repeated each slice three times along the channel dimension to fit into the network. We employ 5-fold cross-validation as our evaluation method, where each dataset is evenly divided into 5 parts.

5.2 Evaluation Protocol

To evaluate the performance of the segmentation model, we utilized the conventional Dice score scheme. The Dice score has a range from 0 to 100, where 0 represents a complete mismatch between the prediction and ground truth, while 100 signifies a perfect match. The Dice calculation formula is

\begin{split}Dice(A,B)=\frac{2\begin{Vmatrix}A\cap B\end{Vmatrix}}{\begin{Vmatrix}A\end{Vmatrix}+\begin{Vmatrix}B\end{Vmatrix}}\times 100\%,\end{split}

(14)

where $A$ represents the predicted mask and $B$ represents the ground truth.

Table 2: Experiment results (in Dice score) on the CMR dataset. Numbers in bold and italics indicate the best and the second-best results, respectively.

Settings	Method	RV	LV-MYO	LV-BP	Mean
Setting-1	SE-Net [34]	12.86	58.04	25.18	32.03
	PANet [42]	57.13	72.77	44.76	58.20
	SSL-ALPNet [30]	77.59	63.29	85.36	75.41
	Q-Net [36]	67.99	52.09	86.21	68.76
	CAT-Net [19]	69.37	48.81	81.33	66.51
	DSPNet (our)	79.73	64.91	87.75	77.46

Table 3: Ablation study results on the ABD-MRI dataset. Numbers in bold indicate the best results.

#	Method	RAN	FSPA	BCMA	Setting-1					Setting-2
#	Method	RAN	FSPA	BCMA	Liver	R.kidney	L.kidney	Spleen	Mean	Liver	R.kidney	L.kidney	Spleen	Mean
1	SSL-ALPNet	–	–	–	70.49	79.86	81.25	64.49	74.02	69.46	62.34	75.49	69.02	69.08
2	DSPNet w/o RAN	✗	✓	✓	72.92	81.98	85.55	65.66	76.52	70.54	72.97	80.94	63.53	71.99
3	DSPNet w/o FSPA	✓	✗	✓	71.97	82.14	80.95	67.46	75.63	74.86	73.56	72.97	66.99	72.09
4	DSPNet w/o BCMA	✓	✓	✗	70.24	84.64	80.81	68.04	75.93	66.09	73.47	77.19	66.22	70.74
5	DSPNet w/ RAN	✓	✗	✗	75.89	80.89	77.07	66.73	75.15	74.86	68.69	70.89	60.47	68.73
6	DSPNet	✓	✓	✓	75.06	85.36	81.88	70.93	78.31	78.56	82.01	76.47	68.27	76.33

5.3 Few-Shot Settings

To evaluate the model’s performance, we follow the experimental settings in [30, 12], considering two cases. Setting-1 is the initial setting proposed in [34], where test classes may appear in the background of training images. We train and test on all classes in the dataset without any partitioning. Setting-2 is a strict version of Setting-1, proposed in [30], where we adopted a stricter approach. In this setting, test classes do not appear in any training images. For instance, when segmenting Liver during training, the support and query images do not contain the Spleen, which is the segmenting target for testing. We directly removed the images containing test classes during the training phase to ensure that the test classes are truly ”unseen” for the model.

5.4 Implementation Details

We implemented our model using the Pytorch framework with a pre-trained fully convolutional Resnet101 model as the feature extractor. The Resnet-101 model was pre-trained on the MS-COCO dataset. Given that the superpixel pseudo-labels contain rich clustering information, which are helpful to alleviate the annotation absence. We generate the superpixel pseudo-label in an offline manner as the support image mask before starting the model training, following [30, 12, 36].

In DSPNet, there is one hyper-parameter: Local adjustment intensity $\alpha$ in Eq. (8). As another important factor, the sparse pattern of $\boldsymbol{w}_{i}$ follows the neighbour channel constraint, namely, $\boldsymbol{w}_{i}=\left[\boldsymbol{0},w_{1},w_{2},w_{3},\boldsymbol{0}\right]$ where $w_{2}$ is the $j$ -th element of $\boldsymbol{w}_{i}$ , $({w}_{1},{w}_{2},{w}_{3})\!\in\!\left[0,1.0\right]$ , ${w}_{2}\!>\!{w}_{1}={w}_{3}$ . Specifically, in the ABD-MRI dataset, Setting-1 adopts $\beta\!=\!0.3,(w_{1},w_{2},w_{3})\!=\!(0.2,0.8,0.2)$ , whilst $\beta\!=\!0.2,(w_{1},w_{2},w_{3})\!=\!(0.3,0.6,0.3)$ are used in Setting-2. In the ABD-CT dataset, Setting-1 adopts $\beta\!=\!0.2,(w_{1},w_{2},w_{3})\!=\!(0.3,0.7,0.3)$ , and $\beta\!=\!0.4,(w_{1},w_{2},w_{3})\!=\!(0.1,0.7,0.1)$ are used in Setting-2. For the CMR dataset, Setting-1 selects $\beta\!=\!0.3,(w_{1},w_{2},w_{3})\!=\!(0.1,0.9,0.1)$ .

For our experimental results, we used stochastic gradient descent algorithm with a batch size of 1 for 100k iterations to minimize the objective in Eq. (13). The self-supervised training took around 4.5 hours on a single Nvidia TITAN V GPU, and the memory consumption was about 8.1GB.

5.5 Competitors

To evaluate our approach, we compared it with six state-of-the-art medical image semantic segmentation methods, including SE-Net [34], PANet [42], SSL-ALPNet [30], Q-Net [36], and CAT-Net [19]. Among them, SE-Net belongs to the category of constructing support images-based guidance, whilst the rest comparisons all follow the clue of prototypic network. For a fair comparison, we obtain the results of all prototype-based methods, i.e., PANet, SSL-ALPNet, Q-Net and CAT-Net, by re-running their official codes on the same evaluating bed with DSPNet. The results of SE-Net are cited from the publication.

5.6 Quantitative and Qualitative Results

The same as the previous methods, we perform the evaluation on ABD-MRI and ABD-CT under both Setting-1 and Setting-2 whilst the CMR is based on Setting-1. Tab. 1 reports the results in Dice score on ABD-MRI and ABD-CT. The results showed that DSPNet outperforms the previous methods in the two settings. On ABD-MRI dataset, compared with the second-best method Q-Net in mean score, DSPNet achieves an improvement of 2.6 under Setting-1. Meanwhile, as for the strict Setting-2 testing model for ”unknown” classes, DSPNet demonstrates impressive performance with 7.8 increase, especially with a dice score of approximately 82 for Right kidney. The reason is discussed in Section 5.7: Ablation study. On the ABD-CT dataset, in average score, DSPNet also surpasses the second-best method SSL-ALPNet by 1.2 in Setting-1 and 2.1 in Setting-2, respectively. For an intuitive observation, we present the visual segmentation results in Fig. 6. As shown in this figure, DSPNet has much better segmentation for large objects (see Liver), while predicting the finer boundary for small objects (see Spleen).

Tab. 2 shows the comparison results on CMR with adjacent organs. In this scenario, DSPNet exhibited better segmenting performance in all three classes, obtaining 2.0 improvement in mean score compared with the previous best method SSL-ALPNet. The right side of Fig. 6 depicts three toy experimental results. It is seen that the DSPNet can generate complicated boundaries (see LV-MYO, RV), implying more details are captured by DSPNet compared with the previous methods. For the objects with relatively regular shapes, e.g., LV-BP, DSPNet achieves fuller segmentation near the boundary.

Table 4: Effect analysis of channel-wise fusion in FSPA on the ABD-MRI dataset under Setting-2. Numbers in bold indicate the best results.

Method	Liver	R.kidney	L.kidney	Spleen	Mean
SSL-ALPNet	69.46	62.34	75.49	69.02	69.08
DSPNet-F-separating	72.13	79.64	73.21	65.31	72.56
DSPNet-F-weighting	69.14	65.97	64.77	59.89	64.94
DSPNet	78.56	82.01	76.47	68.27	76.33

Table 5: Effect analysis results for components in the sparse channel-aware regulation on ABD-MRI under Setting-2. SparseIni means that the attention matrix

\boldsymbol{a}

is initiated by the sparse channel-aware

\boldsymbol{w}_{i}

. Numbers in bold indicate the best results.

#	Method	Learnable	Adjust	SparseIni	Liver	R.kidney	L.kidney	Spleen	Mean
1	SSL-ALPNet	–	–	–	69.46	62.34	75.49	69.02	69.08
2	DSPNet w/o NCR	–	–	–	68.85	78.01	73.43	62.23	70.63
3	DSPNet w/ $\boldsymbol{a}$ -fix	✗	✓	✓	71.99	73.55	77.48	69.06	73.02
4	DSPNet w/ $\boldsymbol{a}$ -no-adjust	✓	✗	✓	75.41	75.39	72.23	75.16	74.55
5	DSPNet w/ $\boldsymbol{a}$ -random	✓	✓	✗	45.77	38.99	39.42	40.69	41.22
6	DSPNet	✓	✓	✓	78.56	82.01	76.47	68.27	76.33

5.7 Alation Study

As illustrated in the middle of Fig. 2, DSPNet involves three components, i.e., RAN, FSPA and BCMA. In this part, we carry out an ablation study to isolate their effect as follows. All experimental results are obtained based on the ABD-MRI dataset under strict Setting-2.

5.7.1 Effect to final performance

By removing the three ones from our framework, we have variation methods:

1.

DSPNet w/o RAN. We remove the RAN block and set the fused support feature ${\hat{F}}_{s}=F_{s}$ directly.
2.

DSPNet w/o FSPA. When FSPA block is removed, we generate the foreground prototype $P_{f}$ exploiting the conventional MAP skill, the same as previous work [30].
3.

DSPNet w/o BCMA. After removing the BCMA block, the background prototypes $P_{b}$ is generated in two steps: (i) We convert the fused support feature ${\hat{F}}_{s}$ to feature maps by AP and then (ii) directly tailored to $P_{b}$ according to the background zone in support mask ${M}_{r}$ , which is also generated by Average Pooling.

From the results from Tab. 3, we see that when removing any one of the three, the mean results have decline to some extent compared with DSPNet, whilst all being better than SSL-ALPNet. These results confirm that the proposed three designs all play positive roles in the proposed scheme. Meanwhile, the full version, DSPNet, significantly outperforms the other three variation methods. The results indicate that the three designs jointly lead to the final performance.

To better understand the effect of the three designs, we present some typical segmentation results under Setting-2, as shown in Fig. 8. When any one is unavailable, the segmentation has evident deterioration. For example, when RAN is unavailable, the big object segmentation will have obvious holes (see Liver). Due to removing background-specific BCMA, some background zones are wrongly segmented, as adopting DSPNet w/o BCMA (see Left Kidney, Spleen).

Combining results of Setting-1 with Setting-2, we have one detailed finding. First, DSPNet w/o FSPA, DSPNet w/o BCMA have similar results with especially tiny gap under Setting-1, implying their balanced effect. Unlike it, under Setting-2, DSPNet w/o BCMA beat SSL-ALPNet by increase of 1.9 only, but has 3.1 decrease compared with DSPNet w/o FSPA. The comparison shows that for the truly ”unseen” scenario, background-oriented BCMA is more important than foreground-oriented FSPA. The result is understandable: Performing detail self-refining on the background prototype is the most logical strategy when these training images cannot provide valuable references for the unseen testing classes. This is because, under Setting-2, DSPNet has a large performance margin on top of the previous methods (see Tab. 3), which lose focus on ameliorating the background prototypes.

5.7.2 Effect of RAN to FSPA and BCMA

As shown in Fig. 2, the working of FSPA and BCMA builds on RAN. Here, we propose another variation method of DSPNet, named DSPNet w/ RAN, to determine its effect. In this comparison method, both FSPA and BCMA are removed: The foreground class prototype and background detail prototypes are generated by traditional MAP and AP, respectively. As listed in Tab. 3 (see the fifth row), DSPNet w/ RAN improve by only 1.1 under Setting-1 and has a tiny gap of 0.3 under Setting-2, compared with SSL-ALPNet. This result indicates that RAN cannot work alone and must work jointly with FSPA and BCMA.

5.8 Model Analysis

5.8.1 Analysis of FSPA.

This part discusses the two key features of BCMA: (i) fusing the mined cluster prototypes into a single one for incorporating the local and global semantics and (ii) the channel-wise fusion strategy instead of weighting prototypes. To evaluate their effects, we propose two variations of DSPNet:

1.

DSPNet-F-separating: Feature map ${\bar{F}}_{s}$ in Fig. 4(a) is generated by directly computing cosine distance between the cluster prototypes $P_{c}$ and semantics fused feature ${\hat{F}}_{s}$ .
2.

DSPNet-F-weighting: We average the cluster prototypes $P_{c}$ and employ the weighted prototype to compute cosine distance with ${\hat{F}}_{s}$ .

As listed in Tab. 4, DSPNet-F-separating is 3.77 lower than DSPNet in the mean score and outperforms SSL-ALPNet by 3.48. This comparison indicates that mining cluster prototypes can boost the segmentation but suffering from the loss of global semantics. This is in line with our expectations. Besides, DSPNet-F-weighting is defeated by DSPNet with a large decrease of 11.39, even lower than SSL-ALPNet. The results show that the weighting scheme will confuse the semantics and our channel-wise fusion provides a potential semantics incorporation way from local to global.

5.8.2 Analysis of BCMA

As shown in Fig. 5(b), the sparse channel-aware regulating is the core difference from the conventional channel attention mechanism. Evaluation in this part first focuses on the effect of this regulation. To this end, we propose a comparison method, named DSPNet w/o NCR, where the inputted prototype is directly refreshed by the channel similarity vector. From the second row in Tab. 5, we can see that DSPNet w/o NCR lowers DSPNet by 5.7 and very close to result of removing BCMA, i.e., DSPNet w/o BCMA (see Tab. 3). The comparison indicates that the effect of BCMA almost derives from our design of neighbour channel-aware regulation, confirming the rationality of introducing channel structural information.

As mentioned in Section 4.3, the sparse channel-aware regulation contains three significant designs: (i) $\boldsymbol{a}$ is learnable, (ii) incorporation unit integrates $\boldsymbol{r}$ to adjust $\boldsymbol{a}$ , and (iii) $\boldsymbol{a}$ is initiated by the sparse vector $\boldsymbol{w}_{i}$ representing the neighbour channel constraint. To demonstrate their effectiveness, we conduct a comparison experiment where three variation methods of DSPNet are given:

1.

DSPNet w/ $\boldsymbol{a}$ -fix: We keep $\boldsymbol{a}=\boldsymbol{w}_{i}$ during training.
2.

DSPNet w/ $\boldsymbol{a}$ -no-adjust: Setting $\beta=0$ removes $\boldsymbol{r}$ ’s adjustment, whilst $\boldsymbol{a}$ is still learnable and initiated by $\boldsymbol{w}_{i}$ .
3.

DSPNet w/ $\boldsymbol{a}$ -random: $\boldsymbol{a}$ is not initiated by $\boldsymbol{w}_{i}$ , instead, using conventional random initiation.

From the comparison results in Tab. 5, we have three main observations. First, DSPNet’s minimal version DSPNet w/ $\boldsymbol{a}$ -fix is surpassing SSL-ALPNet by 5.97 in mean accuracy. This indicates that our neighbourhood-ware idea is effective, even when it works alone. Meanwhile, DSPNet surpasses DSPNet w/ $\boldsymbol{a}$ -fix by 3.3, indicating the importance of global fusion design ensured by enabling $A$ learnable. Second, DSPNet outperforms DSPNet w/ $\boldsymbol{a}$ -no-adjust by 1.78 in mean score, confirming the rationality of introducing local adjusting. Third, compared with DSPNet, DSPNet w/ $\boldsymbol{a}$ -random’s performance decrease sharply by 35.11. This result shows that the design of $\boldsymbol{w}_{i}$ initiation is crucial to optimising $\boldsymbol{a}$ , once again supporting the importance of introducing the neighbourhood prior. Easily understood, $\boldsymbol{w}_{i}$ provides a good optimization initiation point.

5.8.3 Conventional prototypes v.s. high-fidelity prototypes

Compared to the conventional prototypes, the core advantage of our prototypes is deeply representing the details. To verify it, we perform a quantitative experiment based on a typical image from the ABD dataset. As shown in the left side of Fig. 9, we mark three zones containing objects, denoted by C1 (left kidney), C2 (right kidney) and C3 (gallbladder) in this image where C2 is the foreground. After that, we compute their similarity score under Setting-2 by averaging the final similarity map (i.e., ${\rm{cossim}}({F}_{q},{P}_{k})$ ) at the locations of C1, C2 and C3. The right side in Fig. 9 demonstrates the comparison results of DSPNet and SSL-ALPNet. Compared with SSL-ALPNet, DSPNet improved by 0.46 at C2. To the opposite, DSPNet declines by 3.26 and 2.3 at C1 and C3, respectively. In the view of max relative declination, e.g., $(S_{C2}-max(S_{C1},S_{C3}))/S_{C2}\times 100\%$ , SSL-ALPNet is 26.8%, whilst DSPNet amplify it to 44.3%. The results indicate that our high-fidelity prototypes can encourage more discriminative representations than the conventional prototypes.

Table 6: Results of Dice score changing as parameter

\beta

varying from 0.17 to 0.23 with step 0.01 (on the ABD-MRI dataset under Setting-2).

$\beta$	Liver	R.kidney	L.kidney	Spleen	Mean
0.18	73.32	79.67	78.41	68.95	74.09
0.19	67.58	75.79	73.41	69.02	75.45
0.20	78.56	82.01	76.47	68.27	76.33
0.21	73.98	81.39	72.64	65.51	73.38
0.22	73.69	81.13	77.74	65.13	74.42
0.23	72.01	74.35	76.36	67.03	72.44

5.8.4 Parameter sensitiveness.

This part displays the performance sensitivity of the local adjustment intensity in Eq. (8) based on the Setting-2 in the ABD dataset. As presented in Tab. 6, when the parameter changes, there are no evident drops in the accuracy variation curves. This indicates that DSPNet is insensitive to the parameter $\beta$ .

6 Conclusion

In this paper, we present a novel FSS approach, dubbed as DSPNet, aiming at the local information loss problem in medical images as adopting the prototypical paradigm. To our knowledge, this is an initial effort from the perspective: Enhancing detail representation ability of the off-the-shelf prototypes by detail self-refining. Specifically, we introduce two pivotal designs: FSPA and BLNA modules for the foreground class prototype and background detail prototypes generation, respectively. Among them, the former implements the detail self-refining by fusing the detailed prototypes clustered from the foreground. The latter models this self-refining as incorporating the channel-specific structural information, employing the multi-head channel attention with sparse channel-aware regulation. DSPNet’s effectiveness is validated by state-of-the-art experimental results across three challenging datasets.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partly funded by the German Research Foundation (DFG) and National Natural Science Foundation of China (NSFC) in project Crossmodal Learning under contract Sonderforschungsbereich Transregio 169, the Hamburg Landesforschungsförderungsprojekt Cross, NSFC (61773083); NSFC (62206168, 62276048, 52375035).

References

Achanta et al. [2012] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S., 2012. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2274–2282.
Allah et al. [2023] Allah, A.M.G., Sarhan, A.M., Elshennawy, N.M., 2023. Edge U-Net: Brain tumor segmentation using mri based on deep u-net model with boundary information. Expert Syst. Appl. 213, 118833.
Badrinarayanan et al. [2017] Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495.
Child et al. [2019] Child, R., Gray, S., Radford, A., Sutskever, I., 2019. Generating long sequences with sparse transformers. arXiv:1904.10509 URL: https://arxiv.org/abs/1904.10509.
Çiçek et al. [2016] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O., 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation, in: Proceedings of the Med. Image Comput. and Computer-Assis. Interv. (MICCAI), Springer. pp. 424–432.
Ding et al. [2023a] Ding, H., Sun, C., Tang, H., Cai, D., Yan, Y., 2023a. Few-shot medical image segmentation with cycle-resemblance attention, in: Proceedings of the IEEE Wint. Conf. on Appl. of Comput. Vis. (WACV), pp. 2488–2497.
Ding et al. [2023b] Ding, H., Zhang, H., Jiang, X., 2023b. Self-regularized prototypical network for few-shot semantic segmentation. Pattern Recognition 133, 109018.
Fan et al. [2022] Fan, Q., Pei, W., Tai, Y.W., Tang, C.K., 2022. Self-support few-shot semantic segmentation, in: Proceedings of the Eur. Conf. Comput. Vis. (ECCV), Springer. pp. 701–719.
Feng et al. [2021] Feng, R., Zheng, X., Gao, T., Chen, J., Wang, W., Chen, D.Z., Wu, J., 2021. Interactive few-shot learning: Limited supervision, better medical image segmentation. IEEE Trans. Med. Imag. 40, 2575–2588.
Fu et al. [2019] Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H., 2019. Dual attention network for scene segmentation, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 3146–3154.
Gao et al. [2022] Gao, H., Xiao, J., Yin, Y., Liu, T., Shi, J., 2022. A mutually supervised graph attention network for few-shot segmentation: the perspective of fully utilizing limited samples. IEEE Trans. Neural. Netw. Learn. Syst. doi:10.1109/TNNLS.2022.3155486.
Hansen et al. [2022] Hansen, S., Gautam, S., Jenssen, R., Kampffmeyer, M., 2022. Anomaly detection-inspired few-shot medical image segmentation through self-supervision with supervoxels. Med. Image. Anal. 78, 102385.
Hassani et al. [2023] Hassani, A., Walton, S., Li, J., Li, S., Shi, H., 2023. Neighborhood attention transformer, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 6185–6194.
Hu et al. [2019] Hu, T., Yang, P., Zhang, C., Yu, G., Mu, Y., Snoek, C.G., 2019. Attention-based multi-context guiding for few-shot semantic segmentation, in: Proceedings of the AAAI Conf. on Artif. Intell. (AAAI), pp. 8441–8448.
Huang et al. [2019] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W., 2019. Ccnet: Criss-cross attention for semantic segmentation, in: Proceedings of the Int. Conf. Comput. Vis. (ICCV), pp. 603–612.
Kavur et al. [2021] Kavur, A.E., Gezer, N.S., Barış, M., Aslan, S., Conze, P.H., Groza, V., Pham, D.D., Chatterjee, S., Ernst, P., Özkan, S., et al., 2021. Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation. Med. Image Anal. 69, 101950.
Landman et al. [2015] Landman, B., Xu, Z., Igelsias, J.E., et al., 2015. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge, in: Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, pp. 5–12.
Li et al. [2021] Li, G., Jampani, V., Sevilla-Lara, L., Sun, D., Kim, J., Kim, J., 2021. Adaptive prototype learning and allocation for few-shot segmentation, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 8334–8343.
Lin et al. [2023] Lin, Y., Chen, Y., Cheng, K.T., Chen, H., 2023. Few shot medical image segmentation with cross attention transformer, pp. 233–243.
Liu et al. [2015] Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M., 2015. Sparse convolutional neural networks, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 806–814.
Liu et al. [2022] Liu, Y., Liu, N., Yao, X., Han, J., 2022. Intermediate prototype mining transformer for few-shot semantic segmentation. Adv. Neural Inform. Process. Syst. (NIPS) 35, 38020–38031.
Liu et al. [2020] Liu, Y., Zhang, X., Zhang, S., He, X., 2020. Part-aware prototype network for few-shot semantic segmentation, in: Proceedings of the Eur. Conf. Comput. Vis. (ECCV), Springer. pp. 142–158.
Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the Int. Conf. Comput. Vis. (ICCV), pp. 10012–10022.
Long et al. [2015] Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 3431–3440.
Mao et al. [2022] Mao, B., Wang, L., Xiang, S., Pan, C., 2022. Task-aware adaptive attention learning for few-shot semantic segmentation. Neurocomputing 494, 104–115.
Mehta et al. [2018] Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J.G., Shapiro, L., 2018. Y-net: joint segmentation and classification for diagnosis of breast biopsy images, in: Proceedings of the Med. Image Comput. and Computer-Assis. Interv. (MICCAI), Springer. pp. 893–901.
Milletari et al. [2016] Milletari, F., Navab, N., Ahmadi, S.A., 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: Proceedings of the Int. Conf. 3D Vis. (3DV), IEEE. pp. 565–571.
Noh et al. [2015] Noh, H., Hong, S., Han, B., 2015. Learning deconvolution network for semantic segmentation, in: Proceedings of the Int. Conf. Comput. Vis. (ICCV), pp. 1520–1528.
Oktay et al. [2018] Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al., 2018. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 .
Ouyang et al. [2020] Ouyang, C., Biffi, C., Chen, C., Kart, T., Qiu, H., Rueckert, D., 2020. Self-supervision with superpixels: Training few-shot medical image segmentation without annotation, in: Proceedings of the Eur. Conf. Comput. Vis. (ECCV), Springer. pp. 762–780.
Ouyang et al. [2022] Ouyang, C., Biffi, C., Chen, C., Kart, T., Qiu, H., Rueckert, D., 2022. Self-supervised learning for few-shot medical image segmentation. IEEE Trans. Med. Imag. 41, 1837–1848.
Pham et al. [2000] Pham, D.L., Xu, C., Prince, J.L., 2000. Current methods in medical image segmentation. Annu Rev Biomed Eng 2, 315–337.
Ramachandran et al. [2019] Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J., 2019. Stand-alone self-attention in vision models. Adv. Neural Inform. Process. Syst. (NIPS) 32.
Roy et al. [2020] Roy, A.G., Siddiqui, S., Pölsterl, S., Navab, N., Wachinger, C., 2020. ‘squeeze & excite’ guided few-shot segmentation of volumetric images. Med. Image. Anal. 59, 101587.
Shaban et al. [2017] Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B., 2017. One-shot learning for semantic segmentation.
Shen et al. [2023] Shen, Q., Li, Y., Jin, J., Liu, B., 2023. Q-net: Query-informed few-shot medical image segmentation. arXiv:2208.11451 URL: https://arxiv.org/abs/2208.11451.
Sherer et al. [2021] Sherer, M.V., Lin, D., Elguindi, S., Duke, S., Tan, L.T., Cacicedo, J., Dahele, M., Gillespie, E.F., 2021. Metrics to evaluate the performance of auto-segmentation for radiation treatment planning: A critical review. Radiotherapy and Oncology 160, 185–191.
Snell et al. [2017] Snell, J., Swersky, K., Zemel, R., 2017. Prototypical networks for few-shot learning. Adv. Neural Inform. Process. Syst. (NIPS) 30.
Sun et al. [2022] Sun, L., Li, C., Ding, X., Huang, Y., Chen, Z., Wang, G., Yu, Y., Paisley, J., 2022. Few-shot medical image segmentation using a global correlation network with discriminative embedding. Comput. Biol. Med. 140, 105067.
Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Adv. Neural Inform. Process. Syst. (NIPS) 30.
Wang et al. [2020] Wang, H., Zhang, X., Hu, Y., Yang, Y., Cao, X., Zhen, X., 2020. Few-shot semantic segmentation with democratic attention networks, in: Proceedings of the Eur. Conf. Comput. Vis. (ECCV), Springer. pp. 730–746.
Wang et al. [2019] Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J., 2019. PANet: Few-shot image semantic segmentation with prototype alignment, in: Proceedings of the Int. Conf. Comput. Vis. (ICCV), pp. 9197–9206.
Wang et al. [2018] Wang, X., Girshick, R., Gupta, A., He, K., 2018. Non-local neural networks, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 7794–7803.
Wu et al. [2022] Wu, H., Xiao, F., Liang, C., 2022. Dual contrastive learning with anatomical auxiliary supervision for few-shot medical image segmentation, in: Proceedings of the Eur. Conf. Comput. Vis. (ECCV), Springer. pp. 417–434.
Xie et al. [2021] Xie, G.S., Liu, J., Xiong, H., Shao, L., 2021. Scale-aware graph neural network for few-shot semantic segmentation, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 5475–5484.
Zhang et al. [2021] Zhang, B., Xiao, J., Qin, T., 2021. Self-guided and cross-guided learning for few-shot segmentation, in: Proceedings of the IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 8312–8321.
Zhang et al. [2019] Zhang, C., Lin, G., Liu, F., Guo, J., Wu, Q., Yao, R., 2019. Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation, in: Proceedings of the Int. Conf. Comput. Vis. (ICCV), pp. 9587–9595.
Zhang et al. [2022] Zhang, S., Wu, T., Wu, S., Guo, G., 2022. Catrans: context and affinity transformer for few-shot segmentation, pp. 104–115.
Zhu et al. [2022] Zhu, Q., Wang, H., Xu, B., Zhang, Z., Shao, W., Zhang, D., 2022. Multimodal triplet attention network for brain disease diagnosis. IEEE Trans. Image Process. 41, 3884–3894.
Zhuang [2018] Zhuang, X., 2018. Multivariate mixture model for myocardial segmentation combining multi-source images. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2933–2946.