Robust Prototypical Few-Shot Organ Segmentation with Regularized Neural-ODEs

Prashant Pandey, Mustafa Chasmai, Tanuj Sur, Brejesh Lall

Abstract

Despite the tremendous progress made by deep learning models in image semantic segmentation, they typically require large annotated examples, and increasing attention is being diverted to problem settings like Few-Shot Learning (FSL) where only a small amount of annotation is needed for generalisation to novel classes. This is especially seen in medical domains where dense pixel-level annotations are expensive to obtain. In this paper, we propose Regularized Prototypical Neural Ordinary Differential Equation (R-PNODE), a method that leverages intrinsic properties of Neural-ODEs, assisted and enhanced by additional cluster and consistency losses to perform Few-Shot Segmentation (FSS) of organs. R-PNODE constrains support and query features from the same classes to lie closer in the representation space thereby improving the performance over the existing Convolutional Neural Network (CNN) based FSS methods. We further demonstrate that while many existing Deep CNN-based methods tend to be extremely vulnerable to adversarial attacks, R-PNODE exhibits increased adversarial robustness for a wide array of these attacks. We experiment with three publicly available multi-organ segmentation datasets in both in-domain and cross-domain FSS settings to demonstrate the efficacy of our method. In addition, we perform experiments with seven commonly used adversarial attacks in various settings to demonstrate R-PNODE’s robustness. R-PNODE outperforms the baselines for FSS by significant margins and also shows superior performance for a wide array of attacks varying in intensity and design. ^†^†Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. Prashant Pandey, Mustafa Chasmai and Brejesh Lall are with the Department of Electrical Engineering and Department of Computer Science, Indian Institute of Technology Delhi, New Delhi 110016, India. Tanuj Sur is with Chennai Mathematical Institute, India. The first two authors contributed equally to this work. Email: [email protected], [email protected], [email protected], [email protected].

Index Terms:

Few-shot Segmentation, Neural-ODEs, Adversarial Robustness, Medical Image Segmentation.

I Introduction

Deep learning-based methods are widely successful [1, 2] in image segmentation having dense pixel-level annotations. The fully-supervised methods perform extremely well with a large number of annotated examples. It may be quite straightforward for existing approaches to employ techniques like transfer learning and fine-tuning when well-organized large-scale datasets with natural images are easily available. This is not the case for medical image segmentation. In medical domains [3], it is not always feasible to collect dense pixel-level annotations, specifically when the medical images vary in image features or characteristics depending upon the location from where the images are collected, camera characteristics, modality (CT, MRI, X-Ray etc.) and the task (type of organ to segment) at hand.

While for some organs, the 3D scans may be available in abundance for analysis and learning, a few body organs need sophisticated costly medical devices to capture them thereby rendering the fully-supervised approaches mostly ineffective due to over-fitting with abundantly labelled examples and poor generalization on scarce classes/organs. Also, fully supervised methods would need to directly rely on a significant amount of annotations of these scarce or novel target organs to be able to perform well. Even transfer learning or fine-tuning methods won’t be effective due to the unavailability of generic large-scale medical datasets and the presence of significant domain shifts between base/seen classes and novel/unseen classes.

Refer to caption — Figure 1: Visualisation of the evolution of the features learnt (y-axis) across the layers in a CNN and across time (x-axis) for a Neural-ODE. Features $\mathrm{Z}_{1}$ and $\mathrm{Z}_{2}$ belong to the same class as $\mathrm{Z}$ . CNNs do not have explicit constraints on the nature of these features, and the output features of a slightly different/perturbed input or adversarial feature ( $\mathrm{Z}_{\delta}$ ) may be very different from the expected output ( $h_{\phi}(\mathrm{Z})$ ) of original feature $\mathrm{Z}$ . On the other hand, Neural-ODEs (bottom left) have the constraint that any two of these curves (feature transformations across the layers) may not intersect. Because features of the same class are being brought closer during training, the features of other samples (like $\mathrm{Z}_{\delta}$ ) are constrained to belong to the same class, resulting in robustness. In R-PNODE (top right), these constraints are further tightened by populating the sample space with additional Gaussian samples $Z^{1}_{G}$ and $Z^{2}_{G}$ .

Few-Shot Segmentation (FSS) comes to the rescue by leveraging approaches like metric learning [4, 5, 6, 7] (e.g. Prototypical Learning) to segment novel or rare classes/organs with examples varying from one to five per class. One of the major challenges arising from the implementation of screening and diagnosis programs is the enormous amount of CT images that must be analysed by medical practitioners. Automated systems are intended to make the interpretation of CT images faster and more accurate, thereby improving the cost-effectiveness of the screening program. Advanced computer-aided diagnosis systems (implemented with ML/Deep learning) have the potential to expedite this process and motivate the research community to effectively design such cost-effective automated systems like FSS that have the ability to learn well even with few annotated examples. An FSS model has the ability to segment novel classes by learning support and query image relationships from a completely disjoint set of base classes with an ample number of annotated examples that are freely available. During testing on novel classes, the model uses the learnt metric to segment unseen regions present in the query images with the help of a few labelled support images.

Past few years, many FSS methods on both natural [4, 5] and medical [6, 8, 9, 10, 11, 12] images are proposed that employ CNN based feature extractors. These methods are devoid of the capability to explicitly force prototypes of support images to lie closer to query image features that may lead to poor performance. This is further aggravated in medical domains where the novel/test query images can be slightly different from the images in the base classes due to variations in data modality, texture, tissue appearance and orientation, camera characteristics, colour intensity and size and shape of the target organs. These additional test query image characteristics can be regarded as perturbations of the query features and the learnt metric in prototypical FSS fails to capture these subtle variations. Further, due to a lack of well-annotated data, these models are vulnerable to several kinds of white and black-box adversarial attacks [13, 14, 15].

Medical diagnosis is often a safety-critical task, where a small mistake can cost a human life. In [16], the authors discuss how the activities of certain startups indicate that deep learning-based medical imaging systems are potentially applicable for clinical diagnosis in the near future. If indeed complete automation is to be achieved here, the entire pipeline should be able to either handle or at least detect any possible malicious attacks. Although still relatively under-explored, there are quite a few studies working on this problem. Some studies work on improving adversarial robustness [17, 18], some propose new benchmarking or testing strategies catered for the medical domain [19] and some work on detecting these adversarial attacks [16]. A survey done recently [20] discussed how a lack of a sufficient amount of high-quality images and labelled data in the medical domain is one of the main reasons for the weaknesses of adversarial defence mechanisms. In a few-shot setting, where the amount of labelled data is extremely low, these weaknesses are bound to be prominent.

ML practitioners employ FSS to learn patterns using well-annotated base classes, and finally to transfer the knowledge to scarcely annotated novel classes. This knowledge transfer is severely impacted in the presence of adversarial attacks when the support and query samples from novel classes are injected with adversarial perturbations and models fail to recognize organs, important clinical landmarks etc., present in the image thereby questioning the credibility of these FSS methods for medical image segmentation.

Common Adversarial Training mechanisms [13, 15, 21] require adversarially perturbed examples shown to the model during training. [22] introduced standard adversarial training (SAT) procedure for semantic segmentation. These methods do not guarantee defence when the type of attack differs from the adversarially perturbed examples [23, 24], and it is impractical to expose the model to different kinds of adversarial examples during training. Also, a common method that handles attacks both on support and query examples of novel classes, is non-existent. To the best of our knowledge, the adversarial attacks on few-shot segmentation (FSS) with Deep Neural models and their defence mechanisms have not yet been explored and the need for such robust models is inevitable.

To this end, we propose Regularized Prototypical Neural Ordinary Differential Equation (R-PNODE), a novel prototypical few-shot segmentation method based on Neural-ODEs [25] that provides robustness against slight variations in query image features and defence against different kinds of adversarial attacks in different settings. Owing to the fact that the integral curves of Neural-ODEs are non-intersecting, perturbations in the input lead to small changes in the output as opposed to existing FSS models with CNN-based feature extractors where the output is unpredictable. For a stable dynamical system, if an example x is slightly perturbed with $\epsilon$ such that the initial state of the example is $\textbf{x}_{0}=\textbf{x}+\epsilon$ , then at time $t\rightarrow\infty,~{}\textbf{x}_{t}\rightarrow\textbf{x}$ . This property is desirable for robustness against input perturbations, and to help promote stability in R-PNODE, we employ additive and multiplicative perturbations with Gaussian noise [26, 27, 28, 29, 30] that help to learn robust class-wise prototypes and query features. Additionally, we apply cluster loss that enforces prototypes perturbed with Gaussian noise and their non-perturbed counterparts to lie closer in the representation space, enabling R-PNODE to be robust against adversarially attacked support examples in novel classes. Similarly, we apply consistency loss between predictions of perturbed query features and the ground-truth labels of their corresponding non-perturbed (or clean) query inputs that render R-PNODE robust against various perturbations (adversarial/non-adversarial) of query sets.

Thus, R-PNODE forces query features to be constrained by the integral curves of support features belonging to the same class as that of the query and hence they lie closer in the representation space as shown in Fig. 1. This explicit constraining of features along with the proposed losses is absent in Neural-ODEs. In this paper, we make the following contributions:

1.

We propose a novel prototypical FSS method, R-PNODE, that leverages Neural-ODEs regularized with cluster and consistency losses to outperform existing few-shot organ segmentation methods by large margins.
2.

We extend Standard Adversarial Training to the FSS domain and handle attacks on both support and query.
3.

We demonstrate how our method is adversarially robust, with the ability to handle adversarial attacks like FGSM [13], PGD [15], BIM[14], CW [31], Auto-Attack[32], DAG [33] and SMIA [34] differing in intensity and design when applied to support or query image. We show R-PNODE’s efficacy by repeating these experiments on three publicly available multi-organ segmentation datasets BCV [35], CT-ORG [36] and DECATHLON [37] for both in-domain and cross-domain settings on novel classes.

II Related Works

II-A Neural-ODEs:

Deep learning models such as ResNets [38] learn a sequence of transformations by mapping input x to output y by composing a sequence of transformations to a hidden state. In a ResNet block, computation of a hidden layer representation can be expressed using the following transformation:

\mathrm{\textbf{h}(t+1)}=\mathrm{\textbf{h}(t)}+f_{\theta}\mathrm{(\textbf{h}(t),t)}

(1)

where $\mathrm{t\in\{0,\ldots,T\}}$ and $\mathrm{\textbf{h}}:[0,\infty]\rightarrow\mathbb{R}^{n}$ . As the number of layers is increased and smaller steps are taken, in the limit, the continuous dynamics of the hidden layers are parameterized using an ordinary differential equation (ODE) [25] specified by a neural network,

\frac{d\textbf{h}\mathrm{(t)}}{d\mathrm{t}}=f_{\theta}\mathrm{(\textbf{h}(t),t)}

(2)

where $f:\mathbb{R}^{n}\times[0,\infty]\rightarrow\mathbb{R}^{n}$ denotes the non-linear trainable layers parameterized by weights $\mathbf{\theta}$ . These layers define the relation between the input $\mathrm{h(0)}$ and output $\mathrm{h(T)}$ , at time $\mathrm{T}>0$ , by providing a solution to the ODE initial value problem at terminal time T. Neural-ODEs are the continuous equivalent of ResNets where the hidden layers can be regarded as discrete-time difference equations. The time here is a dummy variable loosely corresponding to the equivalent of layer number in ResNets, though it may not have a physical interpretation. The curve h(t), obtained by solving Eq. 2, is termed as an integral curve and denotes the $n$ -dimensional features at any intermediate time t. Different initial values h(0) lead to different particular solutions which can be thought of as different integral curves obtained as different inputs are passed through the Neural-ODE.

II-B Intrinsic Robustness of Neural-ODEs:

Multiple works have tried to explain the intrinsic robustness of Neural-ODEs. [25] argues that the non-intersecting property of the Neural-ODE’s integral curves allows the model to be intrinsically robust. Because the integral curves are governed by differential equations of the form shown in Eq. 2, it follows that if two integral curves intersect, their slopes at the point of intersection must be identical. Following the curves back to the y-axis (Fig. 2), it can be proved that any two intersecting integral curves have to be identical, or equivalently, distinct integral curves cannot intersect. As can be seen in Fig. 2, this leads to restrictions on the possible integral curves and this consequently improves robustness as illustrated in Fig. 1. We elaborate on this property, its proof and its consequences on robustness with a supplementary video presentation.

As an alternate theory to account for this intrinsic robustness, [39] shows that adaptive stepsize ODE solvers commonly used in Neural-ODEs tend to have a gradient-masking effect, and this allows the model to be robust to gradient-based attacks. Multiple studies [25, 40, 41] have applied Neural-ODEs to defend against adversarial attacks. [42] proposes time-invariant steady Neural-ODE that is more stable than conventional convolutional neural networks (CNNs) in the classification setting. [41] design a Neural-ODE such that the equilibrium points of the ODE solution have Lyapunov-stability, thereby suppressing input perturbations. [40] explores the equivalents of different regularising noise injection techniques like dropout and gaussian smoothing in Neural-ODEs, and argues how these techniques lead to better, more stable equilibrium points. [43] describe a framework that guarantees exponential convergence and improved adversarial robustness by using a novel Lyapunov loss, which they approximately optimise using a practical Monte Carlo approach.

II-C Few-shot Segmentation (FSS):

Few-shot segmentation (FSS) [44, 4, 45, 12, 8, 10, 11] aims to perform pixel-level classification for novel classes in a query image when trained on only a few labelled support images. A commonly adopted approach for FSS is based on prototypical networks [5, 7, 4, 46] that employ prototypes to represent typical information for foreground objects present in the support images. The prototype is subsequently compared with the query images pixel-wise via cosine similarity, leading to the segmentation mask. In addition to the prototype-based setting, [44] incorporates ‘squeeze & excite’ blocks that avoid the need for pre-trained models for medical image segmentation. [45] uses a relation network and introduced the FSS-1000 dataset that is significantly smaller as compared to contemporary large-scale datasets for FSS. The data-scarce medical domain has a unique requirement for specialized FSS algorithms [47, 46]. One common problem setting in this domain is that of organ segmentation, with openly available organ segmentation datasets [35, 36, 37] and specialized methods [48, 6] addressing their unique challenges. [48] used a bidirectional gated recurrent unit (GRU) to capture relationships of features across slices. [6] uses an interclass segmentation network for organ segmentation on a multi-institution dataset from prostate cancer patients. Many other methods [49, 50, 51] addressing different unique challenges can be found in literature, but we focus more on the commonly used prototypical networks. MetaNODE [52] was one of the first works to use Neural-ODEs in the Few-shot domain. Our work is quite different to MetaNODE as the latter addresses prototype bias reduction. Also, the core technique varies in these two cases. MetaNODE uses ODEs for meta-optimization on mean prototypes while we use ODEs for generating a representative prototype for each class.

II-D Adversarial robustness:

Adversarial attacks for natural image classification have been extensively explored, and interest is turning towards the effects of adversarial attacks in the medical domain [18, 53]. FGSM [13] and PGD [15] generate adversarial examples based on the CNN gradients. Besides image classification, several attack methods have also been proposed for semantic segmentation [54, 33, 34, 55]. [33] introduced Dense Adversary Generation (DAG) that optimizes a loss function over a set of pixels for generating adversarial perturbations. Recently, [34] proposed an adversarial attack (SMIA) for images in the medical domain that employs a loss stabilization term to exhaustively search the perturbation space. While adversarial attacks expose the vulnerability of deep neural networks, adversarial training [15, 13, 17] is effective in enhancing the target model by training it with adversarial samples. However, none of the existing methods has explored SAT procedure for few-shot semantic segmentation. In addition to the explicit use of adversarial samples during training, many methods have also explored the use of simple and fast Gaussian perturbations [26, 28, 27, 29, 30] as effective regularisation mechanisms for boosting adversarial robustness. Augmenting the training set with examples perturbed using Gaussian noise to increase adversarial robustness has been suggested in [56]. [26] uses Renyi divergence to illustrate the connection of robustness to random noise. [30] proposes a defence method against query-based black-box attacks, Random Noise Defense (RND), which adds proper Gaussian noise to each query. RND can be combined with existing defence methods like adversarial training to further boost the adversarial robustness.

Other than external procedures to make models robust, recent studies also propose models that are intrinsically robust. [57] uses a Feature Pyramid Decoder with denoising and image restoration that can be applied to any general CNN, improving robustness intrinsically. [58] uses blocks that denoise intermediate features, effectively restoring perturbed features back to the clean ones. The intrinsic adversarial robustness of Neural-ODEs [25, 41, 40, 43, 39] is being increasingly explored in recent years, making it an exciting field of research.

III Proposed Method

The objective is to build a highly accurate few-shot organ segmentation model that is also robust to adversarial attacks. After formalising the problem setting, our methodology focuses primarily on two aspects. First, we describe our proposed method R-PNODE and discuss why it should perform better than existing baselines. Next, we describe and extend SAT for FSS to serve as a strong baseline for comparing adversarial robustness, while also highlighting subtle differences that R-PNODE has from these traditional methods and how this help alleviates some of their drawbacks.

III-A Problem setting

FSS setting includes train $\mathcal{D}_{\mathrm{train}}$ and test $\mathcal{D}_{\mathrm{test}}$ datasets having non-overlapping class sets. Each dataset consists of a set of episodes with each episode containing a $N$ -way $K$ -shot task $\mathcal{T}_{i}$ = ( $\mathcal{S}_{i},\mathcal{Q}_{i}$ ) where $\mathcal{S}_{i}$ and $\mathcal{Q}_{i}$ is support and query sets for the $i^{th}$ episode having a class set $C_{i}$ . Formally, $\mathcal{D_{\mathrm{train}}}=\{(\mathcal{S}_{i},\mathcal{Q}_{i})\}^{E_{\mathrm{train}}}_{i=1}$ and $\mathcal{D_{\mathrm{test}}}=\{(\mathcal{S}_{i},\mathcal{Q}_{i})\}^{E_{\mathrm{test}}}_{i=1}$ where $E_{\mathrm{train}}$ and $E_{\mathrm{test}}$ denote the number of episodes during training and testing. The support set $\mathcal{S}_{i}$ has $K$ image $(\mathcal{I}_{\mathcal{S}})$ , mask $(L_{\mathcal{S}})$ pairs per class with a total of $N$ semantic classes i.e. $\mathcal{S}_{i}=\{(\mathcal{I}_{\mathcal{S}}^{k},L_{\mathcal{S}}^{k}(\eta))\}$ where $L^{k}_{\mathcal{S}}(\eta)$ is the ground-truth mask for $k$ -th shot corresponding to class $~{}\eta\in C_{i}$ , $|C_{i}|=N$ and $k=1,2,\ldots,K$ . The query set $\mathcal{Q}_{i}$ has $N_{\mathcal{Q}}$ $\mathcal{\mathrm{image}~{}(I_{Q})},{\mathrm{mask}~{}(L_{\mathcal{Q}})}$ pairs. The FSS model $\mathcal{F}(.)$ is trained on $\mathcal{D}_{\mathrm{train}}$ across the episodes with support sets and query images as inputs, and predicts the segmentation mask $M_{\mathcal{Q}}$ = $\mathcal{F}(\mathcal{S}_{i},\mathcal{I}_{\mathcal{Q}})$ in the $i$ -th episode for query image $\mathcal{I}_{\mathcal{Q}}$ . During testing, the trained model $\mathcal{F}(.)$ is used to predict masks for unseen novel classes with the corresponding support set samples and query images as inputs from $\mathcal{D}_{\mathrm{test}}$ . Further, the trained FSS model is adversarially attacked to record the drop in performance. An adversarial version of a clean sample can be generated by exploiting gradient information from the model $\mathcal{F}(.)$ employing [13]. Specific to the case of FSS, the prediction of query masks not only depends on the query image but also the information from the support set. This enables the attacks to be designed in such a way that either attacked query or support can deteriorate the query prediction. These perturbations are specifically chosen so that the loss between ground truth and the predicted masks of the query increases.

III-B Regularized Prototypical Neural-ODE (R-PNODE)

The proposed R-PNODE is based on existing prototypical few-shot segmentation models [7, 4]. Given an episode $i$ with task $\mathcal{T}_{i}$ = ( $\mathcal{S}_{i},\mathcal{Q}_{i}$ ), the feature extractor $f_{\theta}$ generates intermediate feature representations ${Z}^{k}_{\mathcal{S}}$ and ${Z}_{\mathcal{Q}}$ for the support and query images $\mathcal{I}^{k}_{\mathcal{S}}$ , $\mathcal{I}_{\mathcal{Q}}$ . The outputs from the feature extractor $f_{\theta}$ are considered as initial states for the Neural-ODE block at time $t=0$ , denoted as ${Z}^{k}_{\mathcal{S}}(0)$ and ${Z}_{\mathcal{Q}}(0)$ respectively. From each support and query image, an additional image is generated by applying multiplicative H $\times$ W dimensional i.i.d. Gaussian noise, denoted mathematically by:

\begin{split}\mathcal{I}^{kG}_{\mathcal{S}}=\mathcal{I}^{k}_{\mathcal{S}}+\mathcal{I}^{k}_{\mathcal{S}}\odot M\mathrm{\ \ where\ \ }M\sim N_{H\times W}(0,\sigma^{2})\end{split}

(3)

\begin{split}\mathcal{I}^{G}_{\mathcal{Q}}=\mathcal{I}_{\mathcal{Q}}+\mathcal{I}_{\mathcal{Q}}\odot M\mathrm{\ \ where\ \ }M\sim\mathcal{N}_{H\times W}(0,\sigma^{2})\end{split}

(4)

where $\odot$ is the Hadamard operator, for element-wise multiplication. The standard deviation $\sigma$ is a hyperparameter. Similar to the clean images, their corresponding Gaussian samples are also passed through the feature extractor to obtain intermediate features ${Z}^{kG}_{\mathcal{S}}(0)$ and ${Z}^{G}_{\mathcal{Q}}(0)$ . The Neural-ODE block consists of hidden layers $h_{\phi}$ parameterized by $\phi$ and its dynamics are governed by $h_{\phi}$ which control how the intermediate state changes at any given time $t$ . The output representation at fixed terminal time $T(T>0)$ for query features $Z_{\mathcal{Q}}$ is given by:

\begin{split}Z_{\mathcal{Q}}(T)=Z_{\mathcal{Q}}(0)+\int^{T}_{0}h_{\phi}(Z_{\mathcal{Q}}(t),t)dt\\ \mathrm{where}\quad\frac{d{Z_{\mathcal{Q}}}(t)}{d\mathrm{t}}=h_{\phi}({Z}_{\mathcal{Q}}(t),t)\end{split}

(5)

Similarly, the output representation at fixed terminal time $T(T>0)$ for support features $Z^{k}_{\mathcal{S}}$ are generated. The support feature maps $Z^{k}_{\mathcal{S}}(T)$ from the Neural-ODE block of spatial dimensions ( $H^{\prime}\times W^{\prime}$ ) are upsampled to the same spatial dimensions of their corresponding masks ${L}_{\mathcal{S}}$ of dimension $(H\times W$ ). Inspired by late fusion [4] where the ground-truth labels are masked over feature maps, we employ Masked Average Pooling (MAP) between $Z^{k}_{\mathcal{S}}(T)$ and ${L}^{k}_{\mathcal{S}}(\eta)$ to form a $d$ -dimensional prototype $p(\eta)$ for each foreground class $\eta\in C_{i}$ as shown:

p(\eta)=\dfrac{1}{K}\sum_{k}\dfrac{\sum_{x,y}\{Z^{k}_{\mathcal{S}}(T)\}^{(x,y)}\cdot\mathbbm{1}{[\{L^{k}_{\mathcal{S}}(\eta)\}^{(x,y)}=\eta]}}{\sum_{x,y}\mathbbm{1}{[\{L^{k}_{\mathcal{S}}(\eta)\}^{(x,y)}=\eta]}}

(6)

where $(x,y)$ are the spatial locations in the feature map and $\mathbbm{1}(.)$ is an indicator function. The background is also treated as a separate class and the prototype for it is calculated by computing the feature mean of all the spatial locations excluding the ones that belong to the foreground classes. The probability map over semantic classes $\eta$ is computed by measuring the cosine similarity (cossim) between each of the spatial locations in $Z_{\mathcal{Q}}(T)$ with each prototype $p(\eta)$ as given by:

{{M}}_{\mathcal{Q}}^{(x,y)}(\eta)=\dfrac{\mathrm{exp}(\mathrm{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}cossim}}(\{Z_{\mathcal{Q}}(T)\}^{(x,y)},p(\eta)))}{\sum_{\eta^{\prime}\in\mathcal{C}_{i}}\mathrm{exp}{(\mathrm{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}cossim}}(\{Z_{\mathcal{Q}}(T)\}^{(x,y)},p(\eta^{\prime}))}}

(7)

The denominator involves a summation over all foreground classes $\eta^{\prime}\in C_{i}$ . The predicted mask $M_{\mathcal{Q}}$ is generated by taking the argmax of $M_{\mathcal{Q}}(\eta)$ across semantic classes. We use Binary Cross Entropy loss $\mathcal{L}_{CE}$ between $M_{\mathcal{Q}}$ and the ground-truth mask for training.

R-PNODE is further optimized using two additional losses. Eq. 5 and Eq. 7 hold for the corresponding features of the gaussian perturbed images as well to get $Z_{S}^{kG}(T)$ and ${M}^{G}$ for the support and query respectively. The cluster loss ( $\mathcal{L}_{CL}$ ) maximises the cosine similarity between each feature $Z$ extracted from clean support samples and their corresponding Gaussian samples as shown below:

\begin{split}\mathcal{L}_{CL}=1-\frac{1}{K}\sum^{K}_{k=1}d_{k}~{}~{},\mathrm{where}~{}d_{k}=\mathrm{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}cossim}}(Z_{S}^{k}(T),Z_{S}^{kG}(T))\end{split}

(8)

The cluster loss helps to bring the support features and their corresponding gaussian perturbations closer in the representation space which generates robust prototypes that are more representative of the class. This helps to learn better relationships between support and query examples which results in better mask predictions. This technique of augmenting the support images with different gaussian perturbations is not explored by the previous methods. It improves both generalizability and robustness as observed from the results in Table VIII. Further, to make the query features robust, we employ consistency loss ( $\mathcal{L}_{CON}$ ) to minimize the difference between Gaussian perturbed query images ( ${M}^{G}_{\mathcal{Q}}$ ) and their corresponding ground-truth labels ( $L_{\mathcal{Q}}$ ). For the sake of uniformity, we extend the same loss $\mathcal{L}_{CE}$ as the objective function.

\mathcal{L}_{CON}=-\dfrac{1}{N}\sum_{x,y}\sum_{\eta^{\prime}\in\mathcal{C}}\mathbbm{1}[\{L_{\mathcal{Q}}(\eta^{\prime})\}^{(x,y)}=\eta^{\prime}]\log{M}^{G(x,y)}_{\mathcal{Q}}(\eta^{\prime})

(9)

$\mathcal{L}_{CL}$ helps to predict more realistic masks as predicted masks are dependent on the quality of the prototypes and aids $\mathcal{L}_{CON}$ in matching the predicted masks from gaussian perturbed query images with the ground truth masks which further improves robustness and generalizability.

For the overall procedure, refer to Algorithm 1 and Fig. 3. R-PNODE’s superior performance for few-shot segmentation is attributed primarily to two factors. First is the fact that the integral curves learnt by a Neural-ODE are always non-intersecting. We expect a good model to be such that the query and support features of the same class are close to each other. The non-intersecting properties of Neual-ODEs essentially lead to explicit constraints on the features learnt. Algorithm 1 Regularized Prototypical Neural-ODEs for FSS 0: Clean training data $\mathcal{D_{\mathrm{train}}}=\{(\mathcal{S}_{i},\mathcal{Q}_{i})\}^{E_{train}}_{i=1}$ , segmentation network $\mathcal{F}(.)$ . 1: for $\mathrm{i}\in\{1,\ldots,E_{train}\}$ do 2: Sample episode $\mathcal{E}^{\mathrm{orig}}=\{(\mathcal{S}_{\mathrm{i}},\mathcal{Q}_{\mathrm{i}})\}$ from $\mathcal{D_{\mathrm{train}}}$ . 3: Get gaussian perturbed samples $\mathcal{S}_{\mathrm{i}}^{G}$ and $\mathcal{Q}_{\mathrm{i}}^{G}$ . 4: Get prototypes $p({\eta})$ and features $Z_{S}^{k}(T)$ from $\mathcal{S}_{\mathrm{i}}$ . Use prototypes to get predictions $M_{\mathcal{Q}_{\mathrm{i}}}$ and $M_{\mathcal{Q}_{\mathrm{i}}}^{G}$ . 5: Apply $\mathcal{L}_{CE}$ on mask predictions $M_{\mathcal{Q}_{\mathrm{i}}}$ and $\mathcal{L}_{CON}$ on $M_{\mathcal{Q}_{\mathrm{i}}}^{G}$ using Eq. 9. 6: Extract features $Z_{S}^{kG}(T)$ for each perturbed support sample and compare with clean $Z_{S}^{k}(T)$ to get $\mathcal{L}_{CL}$ by Eq. 8. 7: Backpropagate losses $\mathcal{L}_{CE}+\alpha.\mathcal{L}_{CON}+\beta.\mathcal{L}_{CL}$ and update parameters of $\mathcal{F}(.)$ . 8: end for

The query features are constrained by the surrounding integral curves of those support prototype features that belong to the same class. Thus, the query features tend to be closer to the prototype features in the representation space. This leads to the accurate classification of query features which ultimately leads to the accurate segmentation of the query images. Fig. 1 tries to demonstrate this very process.

The second factor contributing to the performance of R-PNODE is the addition of cluster and consistency losses. As shown in Fig. 1, the gaussian samples serve as additional reference points in the feature space, and by bringing the final features of these closer to the original, the final features of all samples within this larger space are essentially constrained to be very close to each other.

Extending the reasoning for the superior FSS performance of R-PNODE, the adversarial robustness is again attributed to the intrinsic property of non-intersecting curves of Neural-ODEs along with the proposed losses. As the integral curves of the support features imposed constraints on the features of the query, in a similar way, the clean features also impose constraints on the features of adversarially perturbed images. Thus, features for perturbed support and query tend to lie closer to the features of the clean ones giving rise to adversarial robustness.

An important difference from the usual adversarial perturbation methods is how the support is perturbed here. Although the ground-truth masks for support images are also available, it makes more sense in the FSS task to take the loss of query prediction using the corresponding support samples. Thus, in Eq. 10, the loss is between the query mask and query label, but the gradients are calculated w.r.t the support image. We take batches of either the query or the support perturbed, but not both together. Generating gaussian perturbations is not computationally expensive when compared to generating adversarial perturbations using gradient-based methods like PGD, SMIA or even FGSM. Thus, R-PNODE has a much smaller computational overhead than SAT. Note that the purpose of the Gaussian samples in R-PNODE is not the same as the adversarial samples in SAT. While SAT uses the perturbations to learn better features on the expected test adversarial distribution, R-PNODE uses gaussian perturbations to populate the training input feature space with more samples thereby having more constrained integral curves. Additionally, note that R-PNODE is attack-agnostic, and thus, its robustness should generalize well to multiple attack strategies. Thus, R-PNODE is able to overcome many of the limitations prevalent in adversarial training methods.

IV Implementation Details

The proposed R-PNODE consists of a CNN-based feature extractor followed by a Neural-ODE block consisting of 3 convolutional layers as its hidden dynamics. The architecture of R-PNODE (Fig. 4) consists of a total of 14.7M trainable parameters, while the baseline models PANet and FSS-1000 have 14.7M and 18.6M parameters, respectively. Taking motivation from [4, 48], we use a hidden dimension of d = 512 (output from VGG). The higher order Runge-Kutta solver [25] is used as the black-box approximate solver for the Neural-ODE block. All noises are added at the input level from a $H\times$ W dimensional normal distribution with mean 0 and standard deviation $\sigma=0.1$ . Although we perform our experiments on 1-way {1,3}-shot setting, the same approach may be extended to the general n-way k-shot setting as well. To understand the effect of adversarial training on prototypical networks, we employ SAT with PANet [4] and name it AT-PANet (Adversarially Trained PANet). AT-PANet is trained with FGSM attack of intensity $\epsilon=0.025$ . To test all the trained models, we perturb the support and query images by setting $\epsilon=0.02,0.01~{}\mathrm{and}~{}0.04$ for FGSM, PGD and SMIA, respectively. For the iterative adversarial attacks SMIA and PGD, we take 10 iterations each. For BIM, CW and DAG, we choose $\epsilon=0.02$ . We take motivation from [19] for the implementation of Auto-Attack and set 0.1 as the dice threshold and 0.01 as the $\epsilon$ . These hyperparameters for the attacks were chosen so as to limit the perturbations to human perceptibility. The same set of attack hyperparameters is used for both support and query attacks. The loss weighting parameters $\alpha$ and $\beta$ as mentioned in Algorithm 1 are set to 0.001 ad 0.01 respectively, using grid search. All the baselines are trained using the exact same setting as in R-PNODE. The train, validation and test sets are exactly identical for all the baselines and R-PNODE. The attack’s intensities and crop settings are consistent across the baselines. We have used a comparable feature extractor to R-PNODE for all the baselines. We use one A100 GPU to conduct our experiments. For statistical significance, we run each experiment twice and report mean and standard deviations. Our implementations, trained models and processed data can be found here.

V Experiments and Results

V-A Dataset Description

We experiment on three publicly available multi-organ segmentation datasets, BCV [35], CT-ORG [36], and Decathlon [37] to evaluate the generalisability of our method. For the in-domain setting, we split BCV into the train (seen) and test (novel) classes. Of the total BCV dataset available, we use half of the dataset for training (6506 slices), $\frac{1}{6}\mathrm{th}$ for validation (1779 slices) and $\frac{1}{3}\mathrm{rd}$ for testing (3873 slices). It is to be noted that the dataset splits are done according to the subjects. All slices of any subject will belong to either of the train, validation or test subset. So, the model will not be tested on a slice of any subject it has seen during training. For the cross-domain FSS, we train on the BCV dataset (with seen classes) and use novel classes from CT-ORG and Decathlon to test. To have a more uniform size of the test set, we sample 500 random slices per organ from the much larger CT-ORG dataset. We sample 500 and 325 slices from the Liver and Spleen test organs respectively, in the Decathlon dataset. For the 3D volumes in all three datasets, we extract slices with non-empty masks and divide them into the fixed train, test, and validation splits. [48] cropped the slices based on the ROI for each organ using available masks. We follow a more challenging setting, where no such cropping is applied (Fig. 5), thereby depriving any unfair advantage due to leakage of localisation information. Of the available organs, we report results on the Liver and Spleen (as novel classes) due to their medical significance and availability across datasets.

V-B Choice of Baselines

For baseline comparisons on few-shot organ segmentation, we experiment with traditional FSS methods, PANet [4], FSS-1000 [45] and SENet [44]. We compare with the recently proposed BiGRU [48], achieving state-of-the-art results on the selected datasets (in the ROI-cropped setting). Additionally, we compare with a recently proposed Neural-ODE based method SONet [39]^*^**SONet was requiring exorbitant amounts of time for each experiment, so we perform only one run for it.. To provide a fair comparison, we use skew-symmetric and input-output stable Neural-ODE blocks of SONet and extend the model to have VGG-like architecture. Finally, to demonstrate adversarial robustness, we compare with AT-PANet, which is PANet [4] trained with the modified adversarial training procedure described in section V-B.

Adversarial Training

To effectively demonstrate the adversarial robustness of R-PNODE, we experiment with Standard Adversarial Training (SAT) as a competitive baseline. [59] recently proposed Adversarial Querying to handle adversarial attacks on the query in few-shot learning. Here, in every task, the query was adversarially perturbed and then standard meta-learning based training was performed with the perturbed query. We first extend this method to segmentation with prototypical FSS models. Next, we modify it to handle both query and support attacks while also focusing on preserving clean accuracy. To accomplish this, we extend $\mathcal{D_{\mathrm{train}}}$ for each batch during training with two additional batches using the update rule from [13] as follows:

generate adversarial example $\mathcal{I^{\mathrm{adv}}_{S}}$ for support image $\mathcal{I}_{\mathcal{S}}$ :

\mathcal{I^{\mathrm{adv}}_{S}}=\mathcal{I_{S}}+\epsilon.\mathrm{sign}(\nabla_{\mathcal{I_{S}}}\mathcal{L(F}(\mathcal{S}_{i},\mathcal{I_{Q}}),L_{\mathcal{Q}}(\eta)))

(10)

generate adversarial example $\mathcal{I^{\mathrm{adv}}_{Q}}$ for query image $\mathcal{I}_{\mathcal{Q}}$ :

\mathcal{I^{\mathrm{adv}}_{Q}}=\mathcal{I_{Q}}+\epsilon.\mathrm{sign}(\nabla_{\mathcal{I_{Q}}}\mathcal{L(F}(\mathcal{S}_{i},\mathcal{I_{Q}}),L_{\mathcal{Q}}(\eta)))

(11)

$\epsilon$ bounds the $l_{\infty}$ (L-infinity) norm of the perturbation. ‘sign’ is the Signum function. $\nabla_{\mathcal{I}}$ is the gradient of the loss function $\mathcal{L}$ with respect to the input image $\mathcal{I}$ . While the original batch focuses on training the model for clean samples, the batch from Eq. 10 will improve robustness against support attacks and the batch from Eq. 11 will improve robustness against query attacks.

TABLE I: Few-Shot organ segmentation results on 1-shot setting for BCV in-domain and cross-domain (BCV

\rightarrow

CT-ORG and BCV

\rightarrow

Decathlon) settings for Liver and Spleen organs (novel classes). The dice scores are rounded off to two decimals.

Method	Venue	BCV $\rightarrow$ BCV		BCV $\rightarrow$ CT-ORG	BCV $\rightarrow$ Decathlon
Method	Venue	Liver	Spleen	Liver	Liver	Spleen
PANet[4]	CVPR’19	0.61 ± 0.01	0.38 ± 0.03	0.52 ± 0.01	0.53 ± 0.01	0.43 ± 0.04
FSS-1000[45]	CVPR’20	0.37 ± 0.04	0.41 ± 0.02	0.29 ± 0.04	0.37 ± 0.06	0.39 ± 0.01
SENet[44]	MedIA’20	0.61 ± 0.01	0.57 ± 0.01	0.47 ± 0.01	0.50 ± 0.01	0.53 ± 0.01
BiGRU[48]	AAAI’21	0.51 ± 0.01	0.56 ± 0.01	0.43 ± 0.05	0.46 ± 0.05	0.46 ± 0.05
SONet[39]	MSML’22	0.56 ± 0.00	0.32 ± 0.00	0.28 ± 0.00	0.51 ± 0.00	0.31 ± 0.00
AT-PANet	-	0.65 ± 0.01	0.46 ± 0.01	0.56 ± 0.01	0.57 ± 0.01	0.45 ± 0.01
R-PNODE	-	0.79 ± 0.01	0.64 ± 0.02	0.67 ± 0.01	0.79 ± 0.01	0.57 ± 0.01

V-C Choice of Attacks

We experiment on 6 different attacks, some traditional and some state-of-the-art. We first use FGSM[13], one of the most commonly used traditional single-step attack methods. While BIM [14] and PGD[15] are some basic iterative methods that extend single-step attacks, CW [31] is a more sophisticated one. Auto-Attack[32] is an adaptive method involving an ensemble of four diverse attacks to reliably evaluate robustness. While these attacks were originally proposed for classification, we extend them to segmentation in our experiments. DAG [33] was one of the first methods for adversarial attacks specific to segmentation. SMIA[34] was another segmentation attack proposed quite recently, that focuses on the medical domain, and is shown to perform better than many other methods. With this broad spectrum of attacks covering the most important categories, we are able to effectively demonstrate the efficacy of our proposed approach.

V-D Results and Discussion

V-D1 Few-Shot Organ Segmentation

We report the results for 1-shot organ segmentation in Table I and those for 3-shot in Fig. 6. R-PNODE outperforms all compared baselines by significant margins across all organs both for in-domain and cross-domain settings. On average, R-PNODE outperforms all baselines by at least 29%, 20% and 33% for BCV in-domain, BCV $\rightarrow$ CT-ORG cross-domain and BCV $\rightarrow$ Decathlon cross-domain, respectively. R-PNODE outperforms the baselines in the 3-shot setting as well, as evident from Fig. 6. It is able to outperform PANet, AT-PANet and SONet, even though they have an almost identical number of parameters. R-PNODE learns a better representation space of support and query samples owing to the continuous dynamics of the Neural-ODE, resulting in superior performance.

Since the Liver organ had a significant number of annotations in the BCV dataset, we tried an experiment training Liver in a fully supervised manner. With the same backbone as R-PNODE and a U-Net style architecture, we observed that even with full supervision, the dice score was 0.84, which is not too high compared to our FSS result of 0.79 (in-domain setting in Table I) considering the amount of supervision used. Given that we want to automate the segmentation process, this tradeoff is justifiable. Hence, there is a significant scope of methods like ours in the wide spectrum between fully automated and fully manual. We also compare with feature-based methods like Level Set [60] and CRF [61] as shown in Table I. It can be seen that although Level Set [60] performs well for the Liver organ, both perform quite poorly for the smaller and harder-to-segment Spleen organ.

V-D2 Adversarial Robustness

As can be seen in Table II and Table III, R-PNODE outperforms all of our baselines by at least 15%, 25%, 17%, 16%, 16%, 45% and 10% on FGSM, PGD, SMIA, BIM, CW, DAG and Auto-Attack query attacks respectively for the BCV in-domain Spleen setting. We report results on the cross-domain setting in Table IV and Table V. As is evident from these tables, similar to distribution shifts between clean and perturbed samples, R-PNODE is also robust to cross-domain distribution shifts.

TABLE II: Query attack results (higher is better) for BCV

\rightarrow

BCV 1-shot in-domain for Liver organ (novel class).
The dice scores are rounded off to two decimals.

Method	FGSM	PGD	SMIA	BIM	CW	DAG	Auto-Attack
PANet[4]	0.29 ± 0.01	0.21 ± 0.01	0.20 ± 0.01	0.20 ± 0.01	0.29 ± 0.04	0.19 ± 0.03	0.08 ± 0.01
FSS-1000[45]	0.10 ± 0.03	0.04 ± 0.02	0.18 ± 0.01	0.05 ± 0.04	0.33 ± 0.06	0.06 ± 0.04	0.01 ± 0.01
SENet[44]	0.30 ± 0.06	0.22 ± 0.02	0.12 ± 0.02	0.19 ± 0.02	0.33 ± 0.01	0.32 ± 0.01	0.07 ± 0.01
BiGRU[48]	0.16 ± 0.12	0.05 ± 0.05	0.35 ± 0.01	0.16 ± 0.01	0.08 ± 0.02	0.01 ± 0.01	0.02 ± 0.02
SONet[39]	0.19 ± 0.00	0.11 ± 0.00	0.06 ± 0.00	0.09 ± 0.00	0.00 ± 0.00	0.06 ± 0.00	0.03 ± 0.01
AT-PANet	0.35 ± 0.03	0.27 ± 0.02	0.36 ± 0.01	0.28 ± 0.04	0.11 ± 0.06	0.31 ± 0.06	0.08 ± 0.02
R-PNODE	0.41 ± 0.02	0.27 ± 0.01	0.53 ± 0.03	0.29 ± 0.06	0.55 ± 0.07	0.34 ± 0.04	0.10 ± 0.03

TABLE III: Query attack results (higher is better) for BCV

\rightarrow

BCV 1-shot in-domain for Spleen organ (novel class).
The dice scores are rounded off to two decimals.

Method	FGSM	PGD	SMIA	BIM	CW	DAG	Auto-Attack
PANet[4]	0.16 ± 0.01	0.11 ± 0.01	0.07 ± 0.01	0.12 ± 0.01	0.03 ± 0.01	0.02 ± 0.01	0.06 ± 0.01
FSS-1000[45]	0.19 ± 0.01	0.08 ± 0.01	0.17 ± 0.01	0.05 ± 0.02	0.19 ± 0.10	0.04 ± 0.04	0.04 ± 0.01
SENet[44]	0.04 ± 0.01	0.21 ± 0.04	0.01 ± 0.01	0.25 ± 0.05	0.25 ± 0.03	0.12 ± 0.04	0.01 ± 0.01
BiGRU[48]	0.19 ± 0.01	0.03 ± 0.01	0.15 ± 0.01	0.19 ± 0.01	0.01 ± 0.01	0.01 ± 0.01	0.01 ± 0.01
SONet[39]	0.21 ± 0.00	0.08 ± 0.00	0.14 ± 0.00	0.09 ± 0.00	0.00 ± 0.00	0.01 ± 0.00	0.05 ± 0.01
AT-PANet	0.32 ± 0.08	0.19 ± 0.03	0.11 ± 0.01	0.16 ± 0.02	0.02 ± 0.02	0.06 ± 0.04	0.10 ± 0.07
R-PNODE	0.37 ± 0.02	0.28 ± 0.03	0.20 ± 0.02	0.29 ± 0.05	0.29 ± 0.02	0.22 ± 0.01	0.11 ± 0.02

TABLE IV: Query attack results (higher is better) for BCV

\rightarrow

CT-ORG 1-shot for Liver. The dice scores are rounded off to two decimals.

Method	FGSM	PGD	SMIA
PANet[4]	0.16 ± 0.03	0.14 ± 0.01	0.20 ± 0.01
FSS-1000[45]	0.05 ± 0.01	0.01 ± 0.01	0.14 ± 0.02
SENet[44]	0.14 ± 0.04	0.11 ± 0.02	0.15 ± 0.05
AT-PANet	0.23 ± 0.08	0.14 ± 0.01	0.31 ± 0.01
R-PNODE	0.32 ± 0.02	0.22 ± 0.02	0.42 ± 0.02

TABLE V: Query attack results (higher is better) for BCV

\rightarrow

Decathlon 1-shot for Liver. Dice scores are rounded off to two decimals.

Method	FGSM	PGD	SMIA
PANet[4]	0.18 ± 0.03	0.13 ± 0.01	0.22 ± 0.01
FSS-1000[45]	0.09 ± 0.05	0.03 ± 0.02	0.12 ± 0.02
SENet[44]	0.18 ± 0.06	0.13 ± 0.02	0.14 ± 0.07
AT-PANet	0.35 ± 0.04	0.19 ± 0.04	0.29 ± 0.01
R-PNODE	0.40 ± 0.01	0.25 ± 0.04	0.45 ± 0.01

We also evaluate R-PNODE on support attacks and compare its performance with other prototypical networks. The results of these experiments are reported in Table VI and Table VII. We observed much lower deterioration of performance with non-prototypical networks with the support attacks and believe that the support attack strategy used by us is not suitable for them. R-PNODE outperforms other prototypical baselines in most settings and improves by a considerable margin in many. Thus, it is adversarially robust to perturbations in the support images too. The improvements for different attacks are much more skewed for support attacks when compared to query attacks. For example, while Liver BCV in-domain improves by over 100% for PGD attacks, it shows only 9% improvement on SMIA. In other settings as well, the performance improvement on SMIA is considerably small. The performance on other attacks is so good that they even outperform the clean performance of some baselines. For example, the dice scores on Liver in-domain setting of R-PNODE for FGSM and PGD (Table VI) are higher than that for PANet and AT-PANet for clean data in the same setting (Table I). Another interesting observation is that while AT-PANet outperforms PANet for all the in-domain experiments, this trend is not followed for the cross-domain ones. This hints towards possible drawbacks of the SAT defence mechanism. In addition to these experiments for the 1-shot setting, we also test the performance in the 3-shot setting. The results in Fig. 8 demonstrate that R-PNODE performs better than our baselines in this setting as well.

TABLE VI: Support attack results (higher is better) with prototypical networks for BCV

\rightarrow

BCV 1-shot in-domain for Liver and Spleen as novel classes.

Organ	Method	FGSM	PGD	SMIA
Liver	PANet[4]	0.17 ± 0.06	0.39 ± 0.01	0.11 ± 0.01
	AT-PANet	0.19 ± 0.07	0.30 ± 0.15	0.11 ± 0.01
	R-PNODE	0.67 ± 0.01	0.68 ± 0.08	0.12 ± 0.01
Spleen	PANet[4]	0.17 ± 0.02	0.10 ± 0.02	0.03 ± 0.01
	AT-PANet	0.42 ± 0.03	0.32 ± 0.06	0.03 ± 0.01
	R-PNODE	0.38 ± 0.01	0.40 ± 0.02	0.04 ± 0.02

TABLE VII: Support attack results (higher is better) with prototypical networks for BCV

\rightarrow

CT-ORG and BCV

\rightarrow

Decathlon 1-shot settings for Liver.

Dataset	Method	FGSM	PGD	SMIA
CT-ORG	PANet[4]	0.20 ± 0.05	0.43 ± 0.02	0.14 ± 0.01
	AT-PANet	0.12 ± 0.02	0.32 ± 0.08	0.14 ± 0.01
	R-PNODE	0.51 ± 0.01	0.52 ± 0.02	0.15 ± 0.01
Decathlon	PANet[4]	0.15 ± 0.01	0.35 ± 0.02	0.13 ± 0.01
	AT-PANet	0.14 ± 0.04	0.31 ± 0.06	0.13 ± 0.01
	R-PNODE	0.66 ± 0.01	0.66 ± 0.02	0.13 ± 0.01

We visualise the features ( $Z_{Q}(T)$ and $Z_{S}(T)$ ) learnt by R-PNODE and the baseline AT-PANet using t-SNE plots in Fig. 7 for query attacks. Here, the d-dimensional support prototypes are directly visualised, but for the query, the features of each pixel of an image are shown separately. It can be seen that while both models have query pixels that are significantly perturbed, the number of such perturbations is much smaller in the case of R-PNODE. Also, here most query pixels stayed closer to the same prototype cluster after perturbation than they were before. Consequently, the perturbation would not cause their misclassification.

We visualise the predictions by each of these models for the different attacks in Fig. 9. It can be seen that the predictions by R-PNODE are closest to the ground-truth labels. For the clean samples, SENet, AT-PANet and R-PNODE are visually very similar, while FSS-1000 performs quite poorly and PANet predicts an extra object. For FGSM, AT-PANet performs much better than PANet, most likely because of encountering similar data during training. For PGD, only AT-PANet and R-PNODE are able to predict a mask that even resembles the ground truth. While FSS-1000 predicts poorly for clean, FGSM and PGD, it surprisingly performs well for SMIA along with R-PNODE, which also reflects in the dice scores in Table III.

VI Ablation Studies

We perform further experiments to understand the relative contributions of the various components in R-PNODE. We use the Spleen BCV in-domain setting to perform these experiments and report the results in Table VIII. If the cluster and consistency losses are removed, the performance with different attacks drops by roughly 5-10%. R-PNODE is inherently robust due to the intrinsic robustness provided by Neural-ODEs. The two losses then further improve robustness by regularizing it, leading to the improvements reflected here. Adding each loss gives some improvement in particular attacks and both losses together give the best overall performance. They give an improvement of 6 absolute points on clean samples (from 58 to 64) on a scale of 0 to 100. Similarly, we get an improvement of 5 absolute points for FGSM (32 to 37), 4 absolute points for PGD (24 to 28) and 1 absolute point for SMIA (19 to 20). We also added these losses to PANet[4] and observed 1.6 and 7 absolute points improvement for Clean and PGD test queries, respectively over the baseline with the cluster loss. With consistency loss, we got 3 and 8 absolute points improvement for FGSM and PGD test queries, respectively over the baseline. We perform an experiment where we removed the Neural-ODE block, gaussian perturbations and losses from R-PNODE that resulted in an FSS model with only a vanilla CNN block which we refer to as Vanilla CNN method in Table VIII. As can be seen, there is a significant drop from R-PNODE’s original performance and thus, adding the Neural-ODE and losses helped improve the method’s performance.

TABLE VIII: Ablation studies of different components of R-PNODE for BCV

\rightarrow

BCV in-domain for Spleen organ.

Method	Clean	FGSM	PGD	SMIA
Vanilla CNN	0.33	0.17	0.13	0.00
R-PNODE	0.64	0.37	0.28	0.20
without $\mathcal{L}_{CON}$ and $\mathcal{L}_{CL}$ (R-PNODE - $\mathcal{L}_{CON}$ - $\mathcal{L}_{CL}$ )	0.58	0.32	0.24	0.19
without $\mathcal{L}_{CL}$	0.58	0.30	0.26	0.19
without $\mathcal{L}_{CON}$	0.59	0.34	0.25	0.19
additive gaussian noise	0.56	0.28	0.25	0.17
multiplicative gaussian noise $\sigma$ = 0.05	0.61	0.31	0.21	0.17
multiplicative gaussian noise $\sigma$ = 0.3	0.60	0.30	0.27	0.15
multiplicative gaussian noise $\sigma$ = 0.5	0.59	0.29	0.30	0.14

While generating the gaussian samples, if additive gaussian noise is used instead of multiplicative, the performance drops severely. The subtle difference between additive and multiplicative noise lies in the fact that in the additive case, images of different scales are perturbed by the same absolute magnitude, while in the multiplicative case, the perturbation has a relative magnitude. This may lead to lower performance since an image with a relatively smaller pixel intensity is perturbed by the same amount as another with a larger pixel intensity. Noise that may be ideal for the latter may be too much for the former. On the other hand, using a larger $\sigma$ ( $\sigma\geq 0.3$ ) for multiplicative noise also leads to a slight decrease in performance. Thus, it may be inferred that an ideal choice of the type and strength of noise is important for the model’s performance. While the overall theoretical utility may not change much, the subtle differences in its nature can become important.

VII Conclusion

The existing FSS methods may misclassify query image pixels, even if the query image features are slightly perturbed. There is a lack of an explicit mechanism in these methods to constrain support and query image features of the same class to lie close in the representation space. Further, defence against adversarial attacks on FSS models is of utmost importance as these models are data-scarce. Although adversarial training may alleviate the risk associated with these attacks, the training procedure’s computational overhead and poor generalizability render it less favourable, and sometimes impractical, as a robust defence strategy. We overcome limitations in existing FSS methods by employing Neural-ODEs to propose regularised and adversarially robust prototypical FSS method R-PNODE that stabilizes the model against various support and query perturbations. While the Neural-ODE block provides intrinsic robustness to the model, regularisation using cost-effective Gaussian samples further improves this robustness. With extensive experimentation, R-PNODE is shown to have better generalization abilities and adversarial robustness. To the best of our knowledge, we are the first to study the effects of different adversarial attacks on FSS models and provide a robust defence strategy.

References

[1] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep neural networks,” International journal of multimedia information retrieval, vol. 7, no. 2, pp. 87–93, 2018.
[2] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE winter conference on applications of computer vision (WACV). Ieee, 2018, pp. 1451–1460.
[3] D. Gut, Z. Tabor, M. Szymkowski, M. Rozynek, I. Kucybała, and W. Wojciechowski, “Benchmarking of deep architectures for segmentation of medical images,” IEEE Transactions on Medical Imaging, 2022.
[4] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng, “Panet: Few-shot image semantic segmentation with prototype alignment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9197–9206.
[5] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
[6] Y. Li, Y. Fu, Q. Yang, Z. Min, W. Yan, H. Huisman, D. Barratt, V. A. Prisacariu, and Y. Hu, “Few-shot image segmentation for cross-institution male pelvic organs using registration-assisted prototypical learning,” arXiv e-prints, pp. arXiv–2201, 2022.
[7] N. Dong and E. P. Xing, “Few-shot semantic segmentation with prototype learning.” in BMVC, vol. 3, no. 4, 2018.
[8] C. Ouyang, C. Biffi, C. Chen, T. Kart, H. Qiu, and D. Rueckert, “Self-supervised learning for few-shot medical image segmentation,” IEEE Transactions on Medical Imaging, 2022.
[9] D. Al Chanti, V. G. Duque, M. Crouzier, A. Nordez, L. Lacourpaille, and D. Mateus, “Ifss-net: Interactive few-shot siamese network for faster muscle segmentation and propagation in volumetric ultrasound,” IEEE Transactions on Medical Imaging, vol. 40, no. 10, pp. 2615–2628, 2021.
[10] W. Wang, Q. Xia, Z. Hu, Z. Yan, Z. Li, Y. Wu, N. Huang, Y. Gao, D. Metaxas, and S. Zhang, “Few-shot learning by a cascaded framework with shape-constrained pseudo label assessment for whole heart segmentation,” IEEE Transactions on Medical Imaging, vol. 40, no. 10, pp. 2629–2641, 2021.
[11] H. Cui, D. Wei, K. Ma, S. Gu, and Y. Zheng, “A unified framework for generalized low-shot medical image segmentation with scarce data,” IEEE Transactions on Medical Imaging, vol. 40, no. 10, pp. 2656–2671, 2020.
[12] R. Feng, X. Zheng, T. Gao, J. Chen, W. Wang, D. Z. Chen, and J. Wu, “Interactive few-shot learning: Limited supervision, better medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 40, no. 10, pp. 2575–2588, 2021.
[13] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[14] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Artificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112.
[15] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
[16] X. Li and D. Zhu, “Robust detection of adversarial attacks on medical images,” in 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, 2020, pp. 1154–1158.
[17] S. Liu, A. A. A. Setio, F. C. Ghesu, E. Gibson, S. Grbic, B. Georgescu, and D. Comaniciu, “No surprises: Training robust lung nodule detection for low-dose ct scans by augmenting with adversarial attacks,” IEEE Transactions on Medical Imaging, vol. 40, no. 1, pp. 335–345, 2020.
[18] M. Xu, T. Zhang, and D. Zhang, “Medrdf: a robust and retrain-less diagnostic framework for medical pretrained models against adversarial attack,” IEEE Transactions on Medical Imaging, 2022.
[19] L. Daza, J. C. Pérez, and P. Arbeláez, “Towards robust general medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2021, pp. 3–13.
[20] S. Kaviani, K. J. Han, and I. Sohn, “Adversarial attacks and defenses on ai in medical imaging informatics: A survey,” Expert Systems with Applications, p. 116815, 2022.
[21] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” in International conference on machine learning. PMLR, 2019, pp. 7472–7482.
[22] X. Xu, H. Zhao, and J. Jia, “Dynamic divide-and-conquer adversarial training for robust semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7486–7495.
[23] H. Zhang, H. Chen, Z. Song, D. Boning, I. S. Dhillon, and C.-J. Hsieh, “The limitations of adversarial training and the blind-spot attack,” arXiv preprint arXiv:1901.04684, 2019.
[24] S. Park and J. So, “On the effectiveness of adversarial training in defending against adversarial example attacks for image classification,” Applied Sciences, vol. 10, no. 22, p. 8079, 2020.
[25] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural ordinary differential equations,” Advances in Neural Information Processing Systems, 2018.
[26] B. Li, C. Chen, W. Wang, and L. Carin, “Certified adversarial robustness with additive noise,” Advances in neural information processing systems, vol. 32, 2019.
[27] N. Ford, J. Gilmer, N. Carlini, and D. Cubuk, “Adversarial examples are a natural consequence of test error in noise,” arXiv preprint arXiv:1901.10513, 2019.
[28] Z. He, A. S. Rakin, and D. Fan, “Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 588–597.
[29] J. Byun, H. Go, and C. Kim, “On the effectiveness of small input noise for defending against query-based black-box attacks,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3051–3060.
[30] Z. Qin, Y. Fan, H. Zha, and B. Wu, “Random noise defense against query-based black-box attacks,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[31] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57.
[32] F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” in International conference on machine learning. PMLR, 2020, pp. 2206–2216.
[33] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adversarial examples for semantic segmentation and object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1369–1378.
[34] G. Qi, L. Gong, Y. Song, K. Ma, and Y. Zheng, “Stabilized medical image attacks,” arXiv preprint arXiv:2103.05232, 2021.
[35] B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” in Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, vol. 5, 2015, p. 12.
[36] B. Rister, D. Yi, K. Shivakumar, T. Nobashi, and D. L. Rubin, “Ct-org, a new dataset for multiple organ segmentation in computed tomography,” Scientific Data, vol. 7, no. 1, pp. 1–9, 2020.
[37] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. Van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze et al., “A large annotated medical image dataset for the development and evaluation of segmentation algorithms,” arXiv preprint arXiv:1902.09063, 2019.
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[39] Y. Huang, Y. Yu, H. Zhang, Y. Ma, and Y. Yao, “Adversarial robustness of stabilized neural ode might be from obfuscated gradients,” Proceedings of Machine Learning Research vol, vol. 145, pp. 1–19, 2021.
[40] X. Liu, T. Xiao, S. Si, Q. Cao, S. Kumar, and C.-J. Hsieh, “How does noise help robustness? explanation and exploration under the neural sde framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 282–290.
[41] Q. Kang, Y. Song, Q. Ding, and W. P. Tay, “Stable neural ode with lyapunov-stable equilibrium points for defending against adversarial attacks,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[42] H. Yan, J. Du, V. Y. Tan, and J. Feng, “On robustness of neural ordinary differential equations,” arXiv preprint arXiv:1910.05513, 2019.
[43] I. D. Jimenez Rodriguez, A. D. Ames, and Y. Yue, “Lyanet: A lyapunov framework for training neural odes,” arXiv, 2022.
[44] A. G. Roy, S. Siddiqui, S. Pölsterl, N. Navab, and C. Wachinger, “‘squeeze & excite’guided few-shot segmentation of volumetric images,” Medical image analysis, vol. 59, p. 101587, 2020.
[45] X. Li, T. Wei, Y. P. Chen, Y.-W. Tai, and C.-K. Tang, “Fss-1000: A 1000-class dataset for few-shot segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2869–2878.
[46] H. Tang, X. Liu, S. Sun, X. Yan, and X. Xie, “Recurrent mask refinement for few-shot medical image segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3918–3928.
[47] L. Sun, C. Li, X. Ding, Y. Huang, Z. Chen, G. Wang, Y. Yu, and J. Paisley, “Few-shot medical image segmentation using a global correlation network with discriminative embedding,” Computers in biology and medicine, vol. 140, p. 105067, 2022.
[48] S. Kim, S. An, P. Chikontwe, and S. H. Park, “Bidirectional rnn-based few shot learning for 3d medical image segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 1808–1816.
[49] C. Ouyang, C. Biffi, C. Chen, T. Kart, H. Qiu, and D. Rueckert, “Self-supervision with superpixels: Training few-shot medical image segmentation without annotation,” in European Conference on Computer Vision. Springer, 2020, pp. 762–780.
[50] G. Zhang, G. Kang, Y. Yang, and Y. Wei, “Few-shot segmentation via cycle-consistent transformer,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[51] A. K. Mondal, J. Dolz, and C. Desrosiers, “Few-shot 3d multi-modal medical image segmentation using generative adversarial learning,” arXiv preprint arXiv:1810.12241, 2018.
[52] B. Zhang, X. Li, S. Feng, Y. Ye, and R. Ye, “Metanode: Prototype optimization as a neural ode for few-shot learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, pp. 9014–9021.
[53] B. Stimpel, C. Syben, F. Schirrmacher, P. Hoelter, A. Dörfler, and A. Maier, “Multi-modal deep guided filtering for comprehensible medical image processing,” IEEE transactions on medical imaging, vol. 39, no. 5, pp. 1703–1711, 2019.
[54] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1765–1773.
[55] U. Ozbulak, A. Van Messem, and W. D. Neve, “Impact of adversarial examples on deep learning models for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 300–308.
[56] V. Zantedeschi, M.-I. Nicolae, and A. Rawat, “Efficient defenses against adversarial attacks,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 2017, pp. 39–49.
[57] G. Li, S. Ding, J. Luo, and C. Liu, “Enhancing intrinsic adversarial robustness via feature pyramid decoder,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 800–808.
[58] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He, “Feature denoising for improving adversarial robustness,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 501–509.
[59] M. Goldblum, L. Fowl, and T. Goldstein, “Adversarially robust few-shot learning: A meta-learning approach,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 886–17 895, 2020.
[60] X. Jiang, R. Zhang, and S. Nie, “Image segmentation based on level set method,” Physics Procedia, vol. 33, pp. 840–845, 2012, 2012 International Conference on Medical Physics and Biomedical Engineering (ICMPBE2012). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1875389212014563
[61] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.