The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation

Yuhan Liu, Yixiong Zou, Yuhua Li, Ruixuan Li
School of Computer Science and Technology, Huazhong University of Science and Technology
{yuhan_liu, yixiongz, idcliyuhua, rxli}@hust.edu.cn Corresponding author.

Abstract

Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but under-explored phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71% and 5.34% average MIoU in 1-shot and 5-shot scenarios, respectively.

1 Introduction

Current deep neural networks [24, 30, 43] have achieved remarkable success in semantic segmentation, but their performance heavily relies on large-scale annotated datasets [1, 23]. However, the process of annotating data, particularly for dense pixel-wise tasks like semantic segmentation, is labor-intensive and time-consuming. To address this issue, Few-Shot Segmentation (FSS) has been introduced, aiming to generate pixel-level predictions for unseen categories with only a few labeled samples. Typically, the model is pretrained on a large-scale dataset of base classes and then transferred to unseen novel classes. However, novel classes might not be in the same domain as base classes [11, 22, 7, 5, 36], a more realistic approach is to consider domain gaps between the base (source domain) and novel (target domain) classes, which gives rise to the Cross-Domain Few-Shot Segmentation (CDFSS) task [20].

Refer to caption — Figure 1: (a) In CDFSS tasks, the training (source) and testing (target) datasets belong to distinct domains, with categories in the testing dataset being unseen during training. (b) Previous CDFSS methods show a decreasing trend of mIoU as the source-domain training proceeds, even at very early epochs for distant domains. (c) Our method can effectively prevent the model from performance decline after early epochs and achieve higher performance.

Although extensive works [20, 17, 33, 16] have been developed, a well-observed phenomenon is still not handled: for target domains, especially those distant from the source domain, the best performance is always achieved at the very early epochs or even the first epoch. As shown in Fig. 1b, the performance decreases sharply as the source-domain training proceeds, where the 20th epoch’s MIoU is even lower than that of the first epoch. Although the early stop can handle this problem, the goal of the source-domain training is to provide a generalizable model for all target domains [20]. Therefore, it may be inappropriate to try different early stops for specific domains. In this paper, we aim to delve into this well-observed but under-explored phenomenon and handle it based on our interpretations of it, as shown in Fig. 1c.

To delve into this phenomenon, we first visualize the predictions of the 1st and 20th epochs in Fig. 2. We observe that the 20th epoch’s model shows many fundamental errors in target domains, i.e., CDFSS models fail to identify the foreground but focus on entirely wrong regions. This suggests that the CDFSS models may not have acquired meaningful information for target domains during source-domain training, resulting in almost random predictions during testing. We then visualize feature maps at different network stages, and observe this problem occurs even at the shallow layers, and deeper layers produce wrong feature maps with wrong inputs from shallow layers. Quantitatively, previous works [13, 46] suggest that the challenge posed by domain shift can be analyzed through the flatness of the loss landscape. We use this as an entry point to explore the connection between shallow layers and early stops. We find that low-level features, although simpler and previously believed to be more transferable [44], are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training. That is, the devil lies in the low-level features.

Based on this phenomenon and interpretation, we further propose a method for the CDFSS task. During the source-domain training, we propose a novel sharpness-aware minimization method to flatten the loss landscapes for low-level features, which is achieved by a shape-preserving perturbation module with random convolutions. During the target-domain phase, since low-level features can hardly capture the target-domain information due to its vulnerability to domain shifts, we directly supplement such information to the model with a low-level-based calibration module, which effectively prevents the model from fundamental errors caused by collapsed low-level features. Based on our methods, we can effectively prevent the model from dropping performances after early epochs and achieve higher performance (Fig. 1c).

In summary, our contributions can be listed as

$\bullet$ We focus on a well-observed but under-explored phenomenon in CDFSS: for target domains, particularly those distant from the source domain, the best segmentation performance is always achieved at the very early epochs, followed by a sharp decline as source-domain training progresses.

$\bullet$ We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS.

$\bullet$ Building on this interpretation, we propose a method that includes two plug-and-play modules: one act as a novel sharpness-aware minimization method to flatten the loss landscapes of low-level features during source-domain training, while the other directly supplement target-domain information to the model during target-domain testing.

$\bullet$ Extensive experiments on four target datasets validate the rationale of our interpretation and method, showing its superiority over current state-of-the-art methods.

2 Interpretation

2.1 Preliminaries

Cross-Domain Few-Shot Segmentation (CDFSS) aims to transfer the segmentation capabilities learned from the source domain to target domains, with only a few annotated images per class. The source domain $\mathcal{D}_{s}=(\mathcal{X}_{s},\mathcal{Y}_{s})$ and target domain $\mathcal{D}_{t}=(\mathcal{X}_{t},\mathcal{Y}_{t})$ are defined by different input distributions and disjoint label spaces, i.e., $\mathcal{X}_{s}\neq\mathcal{X}_{t}$ , $\mathcal{Y}_{s}\cap\mathcal{Y}_{t}=\emptyset$ , where $\mathcal{X}$ refers to input distribution and $\mathcal{Y}$ refers to the label space.

In this work, we adopt the meta-learning episodic paradigm following [20]. Specifically, both $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ consist of several episodes. Each episode is constructed by a support set $S={\{(I_{s}^{i},M_{s}^{i})\}}_{i=1}^{K}$ with $K$ training samples and a query set $Q=\{(I_{q},M_{q})\}$ , where $I$ denote the image and $M$ denote the pixel labels. Within each episode, the model leverages $\{I_{s},M_{s}\}$ and $I_{q}$ to predict the query mask $M_{q}$ .

The model is trained on the source domain support sets and query set $\left\{S^{s},Q^{s}\right\}$ by minimizing the binary cross-entropy loss:

\mathcal{L}=BCE\left(F\left(I_{s}^{s},I_{q}^{s},M_{s}^{s}\right),M_{q}^{s}\right),

(1)

where $F\left(I_{s}^{s},I_{q}^{s},M_{s}^{s}\right)=h\left(f\left(I_{s}^{s},I_{q}^{s}\right),M_{s}^{s}\right)$ outputs the segmentation score map of $I_{q}^{s}$ , which is composed of a feature extractor $f\left(\cdot\right)$ and a comparison module $h\left(\cdot\right)$ .

Then the model will be evaluated on the target domain query set $Q^{t}$ , the prediction for $I_{q}^{t}$ is:

\hat{M_{q}^{t}}=arg\max F\left(I_{s}^{t},I_{q}^{t},M_{s}^{t}\right)

(2)

Finally, the evaluation can be conducted based on the comparison between $\hat{M_{q}^{t}}$ and the real label $M_{q}^{t}$ .

2.2 Delve into the Early Stop in Cross-Domain Few-Shot Segmentation

2.2.1 Intuitive observation: The devil is in the low-level features

Given our suspicion that the model fails to learn useful information from the source domain for target domains, we start by visualizing the feature maps. Using ResNet-50 [15] as the backbone following [12], we visualize the feature maps of stages 1 to 4 for both the source domain and the target domains, as shown in Fig. 3. We observe that on source-domain samples, the model learns boundary information effectively in stage 1, enabling it to accurately focus on the foreground in stage 4. However, on target domains especially those distant from the source domain, even though the foreground is clear and prominent, the model still fails to learn anything meaningful in stages 1 and 2. Consequently, in stage 4, the model focuses on completely incorrect areas. This indicates that the model’s bad performance may originate from its shallow layers.

In Fig.4, we further visualize feature maps of each stage at the 1st and 20th epoch, respectively. We find that the shallow layers show more distinguishable activations at epoch 1 than at epoch 20, which aligns with our observation of higher accuracy at epoch 1. Consequently, we conclude that the poor performance of CDFSS models may be attributed to shallow layers that fail to capture useful low-level information, a trend that intensifies during source-domain training.

2.2.2 Quantitative verification: Low-level features lead to a sharper loss landscape

Previous work [13, 46] has demonstrated that the sharpness of the loss landscape can serve as a tool to analyze cross-domain issues. As illustrated in Fig. 5a, a smoother loss landscape enables the model to better handle domain shifts¹¹1Please refer to the appendix for more details about the sharpness.. Therefore, we measure the sharpness of the source-domain-trained models against different epochs and stages.

We first measure the sharpness of the loss landscape across different epochs. Following the method in [46], we apply pixel perturbations to the training data. Examples of these perturbations are shown in Fig. 5b, while the magnitude of performance drop is depicted in Fig.5c. A larger performance drop under the same perturbation indicates a sharper loss landscape. As observed, the performance of models across all epochs drops when exposed to perturbations, but the model from epoch 1 experiences a smaller drop than that of epoch 10, which in turn performs better than epoch 20. This suggests that as training progresses, the loss landscape becomes progressively sharper, making the model increasingly vulnerable to domain shifts and resulting in decreased performance on target domains. This is in line with our previous observations.

Based on this, we further investigate the relationship between model layers and the loss landscape. We apply the same magnitude of perturbations to the feature maps at stages 1-4 and measure the performance drops. The results in Fig.5d show that earlier stages exhibit larger performance drops under the same perturbation, which implies that shallow layers lead to sharper loss landscapes and are more vulnerable to domain shifts. This experiment provides quantitative validation for the intuitive observation from the low-level feature visualization in the previous section.

2.2.3 Further analysis: Low-level features absorb domain-specific information

Table 1: The model with fixed shallow layers performs better than the one with trained shallow layers, implying that shallow layers learn domain-specific information that negatively impacts generalization to the target domain.

Method	FSS-1000	Deepglobe	ISIC	ChestX-ray	Average
train stage1, 2, 3, 4	78.86	39.44	35.76	72.49	56.64
fix stage1, train stage2, 3, 4	78.88	39.90	37.00	72.12	56.98
fix stage1, 2, train stage3, 4	78.91	40.00	35.49	74.44	57.21

To investigate the impact of shallow layers on model performance, we compare the results of fixing stage 1 and stage 2 using ImageNet [8] pre-trained weights with the results of training all stages, as shown in Tab. 1. We find that fixing stage 1 and stage 2 leads to the best performance.

In Fig 6, we further visualize the mIoU curves over epochs for two different training settings: fix stage 1 only, and fix stage 1, 2. The comparison revealed that when stage 2 is fixed, the model reaches the optimal mIoU at a later epoch, and as training progresses, the mIoU decreases more slowly.

This suggests that shallow layers are vulnerable to domain shifts, therefore any slight absorption of domain information could harm target-domain generalization.

2.3 Conclusion and Discussion

Based on the above experiments, we make the interpretation as follows. The shallow layers of the model tend to learn domain-specific information, causing the low-level features to become increasingly domain-specific during training, gradually overfitting to the source domain. The overfitting sharpens the model’s loss landscape and makes it more vulnerable to domain shifts, thereby reducing cross-domain performance and introducing significant mistakes in CDFSS. This problem is partially addressed by the early-stop strategy and freezing shallow layers.

However, merely stopping the training of shallow layers cannot fully unlock the power of source-domain training. Therefore, below we propose a novel method, LoEC (Low-level Enhancement and Calibration), to handle the problem in shallow layers.

3 Method

3.1 Method Overview

Our core idea is that low-level features are vulnerable to domain shifts, which may lead to a sharp decrease in generalization with only slight absorption of source-domain information. Based on this, we designed two plug-and-play modules to operate in two phases: 1) During the source-domain training phase, the Low-level Enhancement Module enhances the robustness against domain shifts by synthesizing diverse domains and incorporating them into low-level features. 2) During the target-domain testing phase, the Low-level Calibration Module refines previous segmentation results by utilizing the low-level features from the target domain. The framework is illustrated in Fig. 7.

3.2 Train-time Low-level Enhancement Module

As analyzed above, overfitting of low-level features on the source domain during training leads to a sharp loss landscape, making the model vulnerable to domain shifts. According to the previous Sharpness Aware Minimization (SAM) method [13, 46], one effective strategy to smooth loss landscapes is to introduce perturbations. Thus we transform the domain of the low-level support feature during training, which acts as a form of perturbation to reduce overfitting.

Previous work [39] has shown that random convolution is a shape-preserving strategy to distort local textures, which we view to be suitable for the segmentation task. We take advantage of this property to generate diverse domains while retaining the content. Specifically, given support feature $F_{s}\in\mathbb{R}^{H\times W\times C}$ from the early layers of the encoder and a convolution layer with filters $\Theta\in\mathbb{R}^{h\times w\times C\times C}$ , where $H$ , $W$ and $C$ represent height, width and channels of the input feature, and $h$ and $w$ are the height and width of the random convolution filter. We sample the filter weights from $N(0,\sigma^{2})$ , where $\sigma$ acts as a hyperparameter to control the perturbation magnitude. Then we get perturbed support feature $F_{s}^{\prime}$ with transformed domain:

F_{s}^{\prime}=F_{s}*\Theta,\vspace{-0.25cm}

(3)

Although random convolution can generally preserve the shape and semantics of the feature, some edge details inevitably get blurred. Existing research [28] has shown that the Fast Fourier Transformation (FFT) decomposes a signal into amplitude spectrum and phase spectrum, with amplitude capturing the overall domain and texture information and phase representing shape, edge, and spatial structure. Thus, we apply FFT to both $F_{s}$ and $F_{s}^{\prime}$ , decomposing them into phase spectrum $\mathcal{P}$ , $\mathcal{P^{\prime}}$ and amplitude spectrum $\mathcal{A}$ , $\mathcal{A^{\prime}}$ :

\mathcal{A}e^{i\mathcal{P}}=FFT(F_{s}),\vspace{-0.4cm}

(4)

\mathcal{A^{\prime}}e^{i\mathcal{P^{\prime}}}=FFT(F_{s}^{\prime}),\vspace{-0.1cm}

(5)

Then we recombine the phase from the original feature $F_{s}$ and the amplitude from the perturbed feature $F_{s}^{\prime}$ , using the Inverse Fast Fourier Transform (IFFT) to generate transformed support feature:

F_{s}^{t}=IFFT\left(\mathcal{A}^{\prime}e^{i\mathcal{P}}\right),\vspace{-0.2cm}

(6)

Therefore, while transforming the domain, we achieve better preservation of the boundary shapes, which achieves domain-oriented perturbations on low-level features for sharpness-aware minimization. The transformed feature is then fed into the subsequent layers of the encoder.

3.3 Test-time Low-level Calibration Module

After being processed by the encoder, the support and query features are fed into the comparison module, yielding a coarse score map $S\in\mathbb{R}^{H\times W\times 2}$ . Since low-level features may be collapsed on target domains due to the vulnerability to domain shifts, we directly use the low-level query feature $F_{q}\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times C}$ to supplement such collapsed low-level target-domain information to calibrate the score map during testing on the target domain.

First, we compute a confidence map $C$ by subtracting the background similarity from the foreground similarity in the score map:

C_{i,j}=S_{i,j,1}-S_{i,j,0},

(7)

Here, $S_{i,j,1}$ and $S_{i,j,0}$ denote the foreground and background similarity for the pixel at $\left(i,j\right)$ , respectively. The confidence map $C$ quantifies the likelihood of foreground at each pixel.

Then we partition the confidence map into patches $\left\{P_{1},P_{2},\dots,P_{n}\right\}$ and select the Top-K patches with the highest average confidence. These patches are considered the most reliable foreground regions:

\bar{C}_{p}=\frac{1}{|P_{p}|}\sum_{(i,j)\in P_{p}}C(i,j),\vspace{-0.2cm}

(8)

\{P_{k_{1}},P_{k_{2}},\dots,P_{k_{K}}\}=\arg\max_{P_{p}}\bar{C}_{p}\vspace{-0.2cm}

(9)

Where $\left|P_{p}\right|$ is the number of pixels in patch $P_{p}$ and $\left(i,j\right)$ are the pixel locations within the patch.

For each of the selected Top-K foreground patches, we locate the corresponding patch in the interpolated query feature map $F_{q}^{\prime}\in\mathbb{R}^{H\times W\times C}$ as $\{F_{k_{1}},F_{k_{2}},\dots,F_{k_{K}}\}$ , and compute the cosine similarity between it and all other patches in the query feature map:

Sim(F_{k_{i}},F_{m})=\frac{F_{k_{i}}\cdot F_{m}}{\|F_{k_{i}}\|\|F_{m}\|}\vspace{-0.15cm}

(10)

Finally, the foreground similarity in the score map is updated using the top-K similarity maps:

S_{i,j,1}^{\prime}=S_{i,j,1}+\sum_{k=1}^{K}w_{k}\cdot\left(Sim_{k}\left(i,j\right)-\beta_{k}\right),\vspace{-0.15cm}

(11)

Here, $w_{k}$ is a scaling factor and $\beta_{k}$ is a bias term. This adjustment improves foreground detection by directly supplementing collapsed low-level target-domain information.

During the source-domain training phase, the model will be trained by minimizing the loss in Eq. 1. During the target-domain testing phase, the prediction will be obtained based on Eq. 2 and LCM.

4 Experiments

4.1 Datasets

Following the settings in [20], we use PASCAL VOC 2012 [11] with SBD augmentation [14] as source domain for training. Then we evaluate the trained models on four target domains: FSS-1000 [22], Deepglobe [7], ISIC [5, 36], and Chest X-ray [3, 18]. See Appendix for more details.

4.2 Implementation Details

We employ ResNet-50 [15] with ImageNet [8] pretrained weights as our backbone. Following baseline SSP [12], we discard the last backbone stage and the last ReLU for better generalization. Furthermore, we integrated our approach into the transformer architecture by using ViT-B/16 [9] as the backbone, following FPTrans [41]. Consistent with FPTrans [41], we resize both support and query images to $480\times 480$ . During training on the source domain, we use SGD to optimize our model, with a momentum of 0.9 and a initial learning rate of 1e-3. We apply our LEM into different shallow layers of the backbone, more details can be seen in 4.4.4. The random convolution filter used in LEM has a kernel size of $3\times 3$ , and the standard deviation $\sigma$ is set to be 0.1. During testing on the target domain, we apply our LCM to refine segmentation mask. The hyperparameters $K$ , $w$ and $\beta$ are set to 3, 0.6 and 0.7, respectively.

4.3 Comparison with State-of-the-art Methods

Table 2: MIoU of 1-shot and 5-shot results compared with previous FSS and CD-FSS methods. All the methods are trained on PASCAL and tested on the CD-FSS. The best performance among all methods is highlighted in bold.

Method	Mark	Backbone	FSS-1000		Deepglobe		ISIC		Chest X-ray		Average
Method	Mark	Backbone	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
Few-Shot Segmentation Methods
RePRI [2]	CVPR-21	Res-50	70.96	74.23	25.03	27.41	23.27	26.23	65.08	65.48	46.09	48.34
HSNet [26]	ICCV-21	Res-50	77.53	80.99	29.65	35.08	31.20	35.10	51.88	54.36	47.57	51.38
SSP [12]	ECCV-22	Res-50	78.91	80.59	40.00	48.68	35.49	45.86	74.44	74.26	57.21	62.35
FPTrans [41]	NIPS-22	ViT-base	80.74	83.65	38.36	49.30	48.65	60.37	80.92	82.91	62.17	69.06
PerSAM [42]	ICLR-24	ViT-base	60.92	66.53	36.08	40.65	23.27	25.33	29.95	30.05	37.56	40.64
Cross-Domain Few-Shot Segmentation Methods
PATNet [20]	ECCV-22	Res-50	78.59	81.23	37.89	42.97	41.16	53.58	66.61	70.20	56.06	61.99
APM [35]	NIPS-24	Res-50	79.29	81.83	40.86	44.92	41.71	51.16	78.25	82.81	60.03	65.18
ABCDFSS [17]	CVPR-24	Res-50	74.60	76.20	42.60	49.00	45.70	53.30	79.80	81.40	60.67	64.97
DRA [33]	CVPR-24	Res-50	79.05	80.40	41.29	50.12	40.77	48.87	82.35	82.31	60.86	65.42
APSeg [16]	CVPR-24	ViT-base	79.71	81.90	35.94	39.98	45.43	53.98	84.10	84.50	61.30	65.09
LoEC	Ours	Res-50	78.51	80.60	44.10	49.67	38.21	47.04	81.02	82.73	60.46	65.01
LoEC	Ours	ViT-base	81.05	83.69	42.12	51.48	52.91	62.43	83.94	84.12	65.01	70.43

In Tab. 2, we compare our method with CNN-based and ViT-based approaches, including traditional FSS methods and existing CDFSS methods. Our results illustrate a notable enhancement for both 1-shot and 5-shot tasks. Specifically, our method outperforms the state-of-the-art APSeg [16] by 3.71% and 5.34% in the 1-shot and 5-shot settings respectively, confirming the effectiveness of our strategy. Additionally, we apply our method on existing CDFSS baselines and observe improved performance. Please refer to the appendix for more details. We present qualitative results of our method in 1-way 1-shot segmentation in Fig.8.

4.4 Ablation Study

4.4.1 Effectiveness of Each Component

Table 3: Ablation study on key components.

LEM	LCM	ResNet	ViT
		57.21	62.17
$\checkmark$		58.35	63.06
	$\checkmark$	59.78	64.39
$\checkmark$	$\checkmark$	60.46	65.01

We evaluate each proposed component on both CNN and ViT baselines to assess the effectiveness of our designs, including LEM and LCM. The results presented in Tab. 3 show that in the 1-shot setting, introducing LEM improved average MIoU by 1.14% for the CNN baseline and 0.89% for the ViT baseline, while adding LCM further increased it by 2.11% and 1.95%, respectively. These results clearly demonstrate the effectiveness of each component in enhancing performance.

4.4.2 Verification of LEM

LEM smooths loss landscapes.⁴⁴4We present the validation of CNN-based methods here, and similar trends hold for ViT-based methods. Please refer to the appendix for details.

To assess the impact of LEM, we measured the loss landscape sharpness before and after its application. We introduced low-frequency noise on training data to simulate domain shifts in the representation space (i.e., pixels and features). A larger performance drop suggests a sharper landscapes. As illustrated in Fig.9, random convolution smooths the loss landscape compared with the baseline, while Fourier transformation further enhances this effect. These results confirm that LEM effectively smooths the loss landscape, which improves robustness against domain shifts.

LEM ensures performance improves with training.

We compared mIoU trends over training epochs with and without LEM in Fig.10 (left). In the baseline model, mIoU on the target domain declines as it increasingly absorbs source-domain information during training. Conversely, with LEM, mIoU not only outperforms the baseline at each epoch but also keeps increasing as training advances, indicating that the model focuses more on domain-invariant information.

LEM prevents the shallow layers from absorbing excessive source-domain information. As previously analyzed, fixing stage 1 and stage 2 of the baseline resulted in the best performance, which suggests that shallow layers tend to learn information that negatively impacts target-domain generalization. Fig.10 (right) shows that after incorporating our LEM, training stage 2 produced results comparable to those achieved by fixing stage 2. This indicates that LEM effectively prevents the shallow layers from absorbing excessive source-domain information during the training process.

LEM encourages the model to focus more on domain-agnostic information.

To assess how the domain distance between the source and target domains changes before and after applying LEM, we use CKA (Centered Kernel Alignment) similarity for measurement, following the approach in [6, 47, 48, 49, 45]. Specifically, we extract features from source domain and target domain images using a backbone network, then compute the CKA similarity of the final layer’s feature by aligning the channel dimension. A higher CKA similarity indicates a smaller domain distance, suggesting the model retains less domain-specific information. As shown in Fig.11 (left), the average CKA similarity across four target domains increases after applying random convolution, and is further improved with the addition of the Fourier transform. This demonstrates that by perturbing the source domain, LEM encourages the model to capture domain-agnostic information during training.

LEM enables the model to capture more useful low-level information in shallow layers. As shown in Fig.11 (right), LEM helps the model learn boundary information more effectively in stage 1, allowing it to focus more accurately on the foreground in stage 4. This leads to more precise activation regions in the later epochs of source domain training.

4.4.3 Verification of LCM

LCM smooths loss landscapes.

To assess the impact of LCM in reducing sharpness and enhancing robustness against domain shifts, we superimpose styles from ISIC medical images and Deepglobe remote sensing images onto source domain images to simulate domain shifts, with the level of style alteration serving as the perturbation level. As shown in Fig.12, the lowered performance drop validates that LCM smooths the loss landscape, thus improves robustness against domain shifts.

4.4.4 Sensitivity Study of Hyper-parameters

As shown in Fig.13, we investigated the effects of using LEM and LCM at various positions in the shallow layers. Please refer to the appendix for more details.

5 Related Work

5.1 Few-Shot Segmentation

Few-Shot Segmentation (FSS) aims to segment novel classes in query images using only a few annotated support samples. Existing methods in FSS can be generally categorized into two types: prototype-based methods and matching-based methods. Motivated by PrototypicalNet[32], prototype-based methods [19, 21, 34, 38] generate prototypes from support images and perform segmentation by computing the similarity between query features and these support prototypes. On the other hand, matching-based methods[25, 26, 29, 40] focus on analyzing pixel-to-pixel dense correspondences between support and query features to avoid the loss of spatial structure inherent in prototype-based methods. However, these methods are limited to segmenting novel classes within a single domain and do not generalize effectively to unseen domains.

5.2 Domain Generalization

Domain Generalization (DG) aims to train models that can generalize to diverse, unseen target domains, particularly when target domain data is unavailable during training, which aligns with the objective of CDFSS. Existing methods for domain generalization can be categorized into two types: methods that focus on learning domain-invariant feature representations across multiple source domains[4, 37], and methods that generate diverse samples through data or feature augmentation[10, 31]. However, most existing DG methods focus on generalizing the whole model, which is not suitable for direct use in CDFSS tasks. In contrast, our plug-and-play LEM module is compact, lightweight, and effective. In the appendix, we compare our LEM module with domain generalization methods to demonstrate the effectiveness of our approach as a novel sharpness-aware minimization method in addressing the CDFSS problem.

5.3 Cross-domain Few-Shot Segmentation

Unlike traditional FSS setting, CDFSS requires models to generalize to unseen target domains without accessing target data during training, making the task more realistic and complex. PATNet [20] introduces a transformation module that converts domain-specific features into domain-agnostic features, which can be finetuned with target domain data during the testing phase. ABCDFSS [17] proposes a test-time task adaptation method, utilizing consistency-based contrastive learning to prevent overfitting and improve class discriminability. DR-Adapter [33] introduces a small adapter for rectifying diverse target domain styles to the source domain. Compared with them, our approach proposes two simple plug-and-play modules that require no training and can be directly applied to boost performance in CDFSS. Besides, a few works [12, 27] are based on the self-support concept. Our LCM module follows a similar principle but offers two significant advantages: (1) While other methods use query prototypes to match query features, we directly use the low-level query features to adjust the query mask, preventing further loss of information. (2) LCM can be conveniently added during the target-domain testing phase, rather than during the entire training process.

6 Conclusion

In this paper, We focus on a well-observed but under-explored phenomenon in CDFSS: for target domains, the best segmentation performance is always achieved at the very early epochs, followed by a sharp decline as source-domain training progresses. We delve into this phenomenon for an interpretation, and find low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on the interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training, and the other to directly supplement target-domain information to the model during target-domain testing. Extensive experiments on four CDFSS benchmarks validate our rationale and effectiveness.

Acknowledgments

This work is supported by the National Key Research and Development Program of China under grant 2024YFC3307900; the National Natural Science Foundation of China under grants 62206102, 62436003, 62376103 and 62302184 ; Major Science and Technology Project of Hubei Province under grant 2024BAA008; Hubei Science and Technology Talent Service Project under grant 2024DJC078; and Ant Group through CCF-Ant Research Fund. The computation is completed in the HPC Platform of Huazhong University of Science and Technology.

References

Benenson et al. [2019] Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large-scale interactive object segmentation with human annotators. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11700–11709, 2019.
Boudiaf et al. [2021] Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Piantanida, Ismail Ben Ayed, and Jose Dolz. Few-shot segmentation without meta-learning: A good transductive inference is all you need? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13979–13988, 2021.
Candemir et al. [2013] Sema Candemir, Stefan Jaeger, Kannappan Palaniappan, Jonathan P Musco, Rahul K Singh, Zhiyun Xue, Alexandros Karargyris, Sameer Antani, George Thoma, and Clement J McDonald. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Transactions on Medical Imaging, 33(2):577–590, 2013.
Carlucci et al. [2019] Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2229–2238, 2019.
Codella et al. [2019] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
Davari et al. [2022] MohammadReza Davari, Stefan Horoi, Amine Natik, Guillaume Lajoie, Guy Wolf, and Eugene Belilovsky. Reliability of cka as a similarity measure in deep learning. arXiv preprint arXiv:2210.16156, 2022.
Demir et al. [2018] Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 172–181, 2018.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
Dosovitskiy [2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Du et al. [2020] Yingjun Du, Jun Xu, Huan Xiong, Qiang Qiu, Xiantong Zhen, Cees GM Snoek, and Ling Shao. Learning to learn with variational information bottleneck for domain generalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 200–216. Springer, 2020.
Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–338, 2010.
Fan et al. [2022] Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. Self-support few-shot semantic segmentation. In European Conference on Computer Vision, pages 701–719, 2022.
Foret et al. [2020] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
Hariharan et al. [2011] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In Proceedings of the IEEE International Conference on Computer Vision, pages 991–998. IEEE, 2011.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
He et al. [2024] Weizhao He, Yang Zhang, Wei Zhuo, Linlin Shen, Jiaqi Yang, Songhe Deng, and Liang Sun. Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23762–23772, 2024.
Herzog [2024] Jonas Herzog. Adapt before comparison: A new perspective on cross-domain few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23605–23615, 2024.
Jaeger et al. [2013] Stefan Jaeger, Alexandros Karargyris, Sema Candemir, Les Folio, Jenifer Siegelman, Fiona Callaghan, Zhiyun Xue, Kannappan Palaniappan, Rahul K Singh, Sameer Antani, et al. Automatic tuberculosis screening using chest radiographs. IEEE Transactions on Medical Imaging, 33(2):233–245, 2013.
Lang et al. [2022] Chunbo Lang, Gong Cheng, Binfei Tu, and Junwei Han. Learning what not to segment: A new perspective on few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8057–8067, 2022.
Lei et al. [2022] Shuo Lei, Xuchao Zhang, Jianfeng He, Fanglan Chen, Bowen Du, and Chang-Tien Lu. Cross-domain few-shot semantic segmentation. In European Conference on Computer Vision, pages 73–90. Springer, 2022.
Li et al. [2021] Gen Li, Varun Jampani, Laura Sevilla-Lara, Deqing Sun, Jonghyun Kim, and Joongkyu Kim. Adaptive prototype learning and allocation for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8334–8343, 2021.
Li et al. [2020] Xiang Li, Tianhan Wei, Yau Pun Chen, Yu-Wing Tai, and Chi-Keung Tang. Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2869–2878, 2020.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Long et al. [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
Lu et al. [2021] Zhihe Lu, Sen He, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8741–8750, 2021.
Min et al. [2021] Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6941–6952, 2021.
Nie et al. [2024] Jiahao Nie, Yun Xing, Gongjie Zhang, Pei Yan, Aoran Xiao, Yap-Peng Tan, Alex C Kot, and Shijian Lu. Cross-domain few-shot segmentation via iterative support-query correspondence mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3380–3390, 2024.
Oppenheim and Lim [1981] Alan V Oppenheim and Jae S Lim. The importance of phase in signals. Proceedings of the IEEE, 69(5):529–541, 1981.
Peng et al. [2023] Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, and Jiaya Jia. Hierarchical dense correlation distillation for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23641–23651, 2023.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
Shankar et al. [2018] Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi. Generalizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745, 2018.
Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
Su et al. [2024] Jiapeng Su, Qi Fan, Wenjie Pei, Guangming Lu, and Fanglin Chen. Domain-rectifying adapter for cross-domain few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24036–24045, 2024.
Tian et al. [2020] Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrichment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence, 44(2):1050–1065, 2020.
Tong et al. [2024] Jintao Tong, Yixiong Zou, Yuhua Li, and Ruixuan Li. Lightweight frequency masker for cross-domain few-shot semantic segmentation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
Tschandl et al. [2018] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1):1–9, 2018.
Wang et al. [2019a] Haohan Wang, Zexue He, Zachary C Lipton, and Eric P Xing. Learning robust representations by projecting superficial statistics out. arXiv preprint arXiv:1903.06256, 2019a.
Wang et al. [2019b] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot image semantic segmentation with prototype alignment. In proceedings of the IEEE/CVF international conference on computer vision, pages 9197–9206, 2019b.
Xu et al. [2021] Zhenlin Xu, Deyi Liu, Junlin Yang, Colin Raffel, and Marc Niethammer. Robust and generalizable visual representation learning via random convolutions. In International Conference on Learning Representations, 2021.
Zhang et al. [2021] Gengwei Zhang, Guoliang Kang, Yi Yang, and Yunchao Wei. Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems, 34:21984–21996, 2021.
Zhang et al. [2022] Jian-Wei Zhang, Yifan Sun, Yi Yang, and Wei Chen. Feature-proxy transformer for few-shot segmentation. Advances in neural information processing systems, 35:6575–6588, 2022.
Zhang et al. [2024] Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Yu Qiao, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot. In The Twelfth International Conference on Learning Representations, 2024.
Zhao et al. [2017] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
Zou et al. [2021] Yixiong Zou, Shanghang Zhang, Jianpeng Yu, Yonghong Tian, and José MF Moura. Revisiting mid-level patterns for cross-domain few-shot recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pages 741–749, 2021.
Zou et al. [2022] Yixiong Zou, Shanghang Zhang, Yuhua Li, and Ruixuan Li. Margin-based few-shot class-incremental learning with class-level overfitting mitigation. Advances in neural information processing systems, 35:27267–27279, 2022.
Zou et al. [2024a] Yixiong Zou, Yicong Liu, Yiman Hu, Yuhua Li, and Ruixuan Li. Flatten long-range loss landscapes for cross-domain few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23575–23584, 2024a.
Zou et al. [2024b] Yixiong Zou, Ran Ma, Yuhua Li, and Ruixuan Li. Attention temperature matters in vit-based cross-domain few-shot learning. Advances in Neural Information Processing Systems, 37:116332–116354, 2024b.
Zou et al. [2024c] Yixiong Zou, Shuai Yi, Yuhua Li, and Ruixuan Li. A closer look at the cls token for cross-domain few-shot learning. Advances in Neural Information Processing Systems, 37:85523–85545, 2024c.
Zou et al. [2024d] Yixiong Zou, Shanghang Zhang, Haichen Zhou, Yuhua Li, and Ruixuan Li. Compositional few-shot class-incremental learning. arXiv preprint arXiv:2405.17022, 2024d.