Class Similarity Transition: Decoupling Class Similarities and Imbalance from Generalized Few-shot Segmentation

Shihong Wang^1∗, Ruixun Liu^1∗, Kaiyu Li^1∗, Jiawei Jiang², Xiangyong Cao^1†
¹ Xi’an Jiaotong University, China. ² Sun Yat-Sen University, China.
[email protected], [email protected], [email protected]
[email protected], [email protected]
https://github.com/earth-insights/ClassTrans ^∗Equal contribution. †Corresponding author.

Abstract

In Generalized Few-shot Segmentation (GFSS), a model is trained with a large set of base class samples and then adapted on limited samples of novel classes. This paper focuses on the relevance between base and novel classes, and improves GFSS in two aspects: 1) mining the similarity between base and novel classes to promote the learning of novel classes, and 2) mitigating the class imbalance issue caused by the volume difference between the support set and the training set. Specifically, we first propose a similarity transition matrix to guide the learning of novel classes with base class knowledge. Then, we leverage the Label-Distribution-Aware Margin (LDAM) loss and Transductive Inference to the GFSS task to address the problem of class imbalance as well as overfitting the support set. In addition, by extending the probability transition matrix, the proposed method can mitigate the catastrophic forgetting of base classes when learning novel classes. With a simple training phase, our proposed method can be applied to any segmentation network trained on base classes. We validated our methods on the adapted version of OpenEarthMap. Compared to existing GFSS baselines, our method excels them all from 3% to 7% and ranks second in the OpenEarthMap Land Cover Mapping Few-Shot Challenge at the completion of this paper.

1 Introduction

Refer to caption — Figure 1: The motivation of our proposed method and the experiment result. A) The top left figure reveals that datasets from the real world exhibit long-tailed distribution. *i.e*. severe class imbalance. The dataset shown here is from [27] whose adaptation is used in the challenge. B) The top right figure shows the class relevance by counting the novel class pixels from the support set misclassified by the model after the training phase. The wider and deeper the bond is, the more pixels are wrongly classified as base classes. C) The bottom figure shows the effectiveness of our proposed components in tackling class imbalance and class similarity problems.

In recent years, deep learning has revolutionized the field of computer vision, enabling significant advancements in various vision tasks such as image classification [11, 9], object detection [7], and semantic segmentation [18]. Semantic segmentation, in particular, plays a crucial role in understanding the visual world by assigning a class label to each pixel in an image [20]. This fine-grained pixel-level prediction facilitates scene understanding and serves as a fundamental component in various applications, including autonomous driving, medical image analysis, and augmented reality. Despite the impressive achievements of deep learning models in semantic segmentation, these approaches still have some limitations. One significant drawback is the heavy reliance on large annotated datasets for training. The process of manually labeling pixel-level annotations for training data is labor-intensive, time-consuming, and costly [13], undermining the application of deep learning.

Recent efforts have introduced the concept of Few-shot Semantic Segmentation (FSS) [22, 1, 24]. The goal of FSS is to enable the model to quickly learn to segment novel classes from limited samples, thus mitigating the demand for data annotation. In the FSS paradigm, the model is first trained on a well-annotated dataset comprising only base examples ( $\sim$ 1000x). Then, during the testing phase, a support set consisting of a few examples ( $\sim$ 10x) with novel classes is provided for few-shot learning algorithms. Finally, the performance of FSS algorithms is evaluated on a query set. However, the FSS task only focuses on the performance of novel classes while ignoring base classes, thus leading to the performance decrease in learned base classes, i.e. FSS methods gain fast learning ability by sacrificing the performance of learned knowledge [8]. But in reality, we would like the neural network to have the same learning ability as humans: quickly learning novel classes from limited samples while preserving what we have learned before.

In recent years, generalized few-shot semantic segmentation (GFSS) [23, 8] has been proposed to address the shortcomings of FSS setups. GFSS evaluates the model in both base and novel classes after a few shots of learning. Moreover, GFSS relaxes the constraints that FSS has on the data, allowing novel classes to appear in the training set as background and only providing annotations of novel classes in the support set during few-shot learning. In this way, GFSS can emulate human learning behavior in that humans only focus on classifying interested classes when learning for the first time while treating others as background, and humans then need to learn some novel classes from background and treat the remaining classes as the background again. This can effectively alleviate the cost of annotating data for only novel classes that are needed for the second learning.

Although these GFSS works [23, 12, 8] have shown promising progress, two key elements have been overlooked. The first neglected element is class similarity. Previous work has achieved the classification of novel classes by learning prototypes [12] or a linear classifier [8] for novel classes during the few shot learning phase, but has overlooked the possibility that novel classes may be relevant to base classes. As shown in Fig. 1, relevant classes can share similar features, such as road type 1 in the base class and road type 2 in the novel class. The features of oceans learned in the base-class learning phase can help enhance the classification performance of the novel classes, thereby achieving better results. The second ignored element is class imbalance. As shown in Fig. 1, in real-world applications, training examples always exhibit a long-tailed distribution, that is, a class imbalance issue exists in the datasets. Such imbalanced data sets challenge learning algorithms by making the model easily biased toward head classes [32]. Essentially, given a large-scale training set and disproportionately small support sets, the GFSS model faces an even more imbalanced data set, while previous work [8, 23] ignored this essential issue existing in the GFSS task.

To address the aforementioned issues, this paper adopts the divide-and-conquer strategy and modifies the GFSS method from three points of view in the few-shot learning phase. Specifically, for utilizing base-class-to-novel-class similarity, we build a similarity transition matrix to guide the learning from base classes to novel classes. The transition matrix is then extended by a base-to-base projection, and an implicit diagonal matrix is learned to encourage maintaining knowledge of the base class. To address the class imbalance between base and novel classes, the Label-Distribution-Aware Margin (LDAM) loss [3] with class distribution prior is introduced, which amplifies the effect of sparse novel classes.

The contributions of our work are summarized as:

•

We determine two key problems that exist in the GFSS task: class similarity and class imbalance between the novel classes and base classes, and accordingly propose the probability transfer matrix and introduce a logit adjustment loss.
•

We show that our method exceeds the state-of-the-art GFSS methods and validate the effectiveness of the proposed components through a series of ablation and visualization analyses. Our method ranks second in the development phase of the OpenEarthMap challenge without bells and whistles.
•

Our experiments further illustrate the importance of preventing overfitting on the support set and retaining knowledge of base classes. The issues we emphasized point to the necessary means for advancing GFSS tasks in future research.

2 Related Works

Generalized Few-shot Segmentation.

Some works recently extended existing FSS methods to GFSS settings: A) Prototypical Network. Instead of learning features as traditional convolution networks do, prototypical networks try to learn a network that outputs the prototypes of the classes. Then given a query image, the prototypical network classifies the pixel by cosine similarity or clustering based on the prototypes of classes and the pixel [19]. Being a pioneering work, Tian et al. [23] first proposed a prototypical network, named CAPL, for GFSS with two modules to dynamically adapt both base and novel class prototypes to tackle GFSS. Furthermore, [14] tries to improve the quality of the prototype learned by the feature extractor by orthogonalizing the prototypes while [15] addresses the similarities between base and novel classes by a Graph Neural Network with class-contrastive loss. Though some of these works stressed class similarity, none of these methods address the training example imbalances in the training set and support set. B) Unified Classifier with Transductive Inference. Due to the limited number of novel class examples in the support set, directly learning a classifier for the novel class on the support set will inevitably lead to overfitting, resulting in poor performance on novel classes. [2] introduced Transductive Inference into FSS, i.e. a Kullback-Leibler (KL) Divergence term in loss function between the pixel distribution of the model’s predictions on query image and that of estimated ground truth pixel distribution, punishing overfitting on support images. Thus training a unified end-to-end classification network is viable. Later, DIaM [8] extends the idea of transductive learning to GFSS and achieves competitive results. However, DIaM overlooked the utilization of the class similarity. Compared with existing GFSS methods, our proposed method addresses both class similarity and class imbalance, achieving higher performance.

Class Similarities.

Utilizing the knowledge from source classes to aid the classification of target classes, i.e. utilizing the knowledge of base classes to help the model learn the novel classes, is known as transfer learning [36]. A) Parameter controlling directly controls the parameter of the novel classes, i.e. by sharing parameters [34] of the base classes or forcing the parameters of the novel classes to be similar to base classes [35], to take advantage of learned knowledge. However, parameter controlling does not leverage the similarity effectively. B) Adversarial learning [5, 6] transfers the knowledge from the base model to the new model by learning the model that cannot be discriminated by the discriminator. However, the extra demands to train a discriminator are unrealistic under the GFSS scenario. The exploration of using the similarity between classes to transfer knowledge is a rarely researched field. Our method is a novel method to transfer the knowledge learned from base classes to novel classes.

Class Imbalance.

Class-imbalanced learning, also referred as long-tailed learning, aims to learn a classifier that performs well on both head, body, and tail classes (in GFSS setting, base classes, and novel classes). To tackle class imbalance, various methods tackle the problem from different perspectives: A) Expert networks try to learn a shared feature extractor and class-specific classifiers, called expert, to classify head, body, and tail classes [28]. Further, determining which expert to use to obtain the prediction is also studied [26]. However, with more parameters to train, the optimization of the expert networks is very data-hungry, especially when limited novel class examples are available in the case of GFSS; B) Re-weighting. Finding that the norm of classification weights for tail classes is disproportionately smaller than that of head classes, decreasing the possibilities for tail classes after Softmax, some researchers [10, 30] decided to enhance the performance of classifiers on tail classes by normalizing the weights of linear classifiers by class. However, a recent study [17] points out that the correlation between weight norms and classes is dependent on the design of the classifier and the choice of optimizer thus re-weighting strategy fits to specific classifiers and lacks robustness; C) Logit adjustment aims to rectify the logits of the classifier, resulting in a greater punishment for misclassifying tail classes [3, 17] by the prior distribution of classes. Further, some methods [21, 31] are proposed to estimate the prior when it is unavailable. In our methods, we adopt logit adjustment to mitigate class imbalance in GFSS for it brings nearly no extra overhead of optimizing and is robust to classifier design with no extra demand for training examples..

3 Methods

3.1 Notations

Preliminaries. Denote $H$ and $W$ as the height and width of the image $\mathbf{x}\in\mathbb{R}^{|\Omega|\times 3}$ , and $\Omega=\{(m,n)\},m=0,...,H-1,n=0,...,W-1$ is the pixel coordinates. In all generality, we denote a segmentation model $f$ that combines a feature extractor $\phi$ that outputs a feature vector map $\mathbf{V}=\phi(\mathbf{x})\in\mathbb{R}^{|\Omega|\times F}$ where $F$ denotes the channel of the feature vector and a classifier with weight $W\in\mathbb{R}^{K\times F}$ which takes image $\mathbf{x}$ and output segmentation maps $f(\mathbf{x})=\mathbf{P}\in\{0,1\}^{|\Omega|\times K}$ where $K$ is the number of the classes to predict. For simplicity, we omit the bias of the classifier.

Generalized few-shot segmentation. In GFSS, two sets of classes are considered: the base classes, $\mathcal{C}^{b}$ , and the novel classes, $\mathcal{C}^{n}$ , and $\mathcal{C}^{b}\cap\mathcal{C}^{n}=\emptyset$ . In the training phase, the model learns datasets $D_{train}$ which contain base class and treats novel class as background. At test time, the model is evaluated through a series of tasks. In each task, the model first goes through a few-shot learning phase on given support set $\mathbb{S}=\{\mathbf{x}_{i},y_{i}\}^{|\mathbb{S}|}_{i=1}$ , containing a few images, along with their corresponding binary segmentation masks $y_{i}\in\{0,1\}^{|\Omega|}$ , for some novel class sampled from $\mathcal{C}^{n}$ . After the model quickly learns from the support set $\mathbb{S}$ , the model is required to give predictions on an unlabeled image, termed query image, of both base and novel classes to test the GFSS learning algorithm, i.e. for pixel $j$ we have

\mathbf{P}(j)=\begin{bmatrix}{\overbrace{\begin{matrix}p_{0}\end{matrix}}^{\text{bg}}},&{\overbrace{\begin{matrix}p_{1},\dotsb,p_{|\mathcal{C}^{n}|}\end{matrix}}^{\text{base classes}}},&{\overbrace{\begin{matrix}p_{|\mathcal{C}^{n}|+1},\dotsb,p_{|\mathcal{C}^{n}|+|\mathcal{C}^{b}|}\end{matrix}}^{\text{novel classes}}}\end{bmatrix}^{T}

in which bg stands for background, $(j)$ represents pixel indexing and we omit pixel index $j$ from the right-hand side for simplicity.

3.2 Similarity transition

Our idea of class similarity originates from the idea that by giving an image containing both base and novel classes, the model only trained on base classes would wrongly classify a pixel of the novel class as the similar base class. After learning $D_{train}$ , given an image containing both novel and base classes, pixel $j$ of novel class $y_{n}$ will be wrongly classified as base class $y_{b}$ by the model, i.e.

p(y_{b}|x(j))=\text{Softmax(}W^{b}_{t}\circ f_{\theta}(\mathbf{x}(j))\text{)}\quad y_{b}=0,\ldots,|\mathcal{C}^{b}|

(1)

where $\circ$ denotes matrix product and $f_{\theta}$ , $W^{b}_{t}\in\mathbb{R}^{(1+|\mathcal{C}^{b}|)\times F}$ , represents parameters of feature extractor and linear classifier after training on $D_{train}$ . If we obtain $p(y_{n}|y_{b},\mathbf{x}(j))$ , thus we would have

p(y_{n}|\mathbf{x}(j))=\sum_{y_{b}=0}^{\mathcal{C}^{b}}p(y_{n}|y_{b},\mathbf{x}(j))\cdot p(y_{b}|\mathbf{x}(j))

(2)

The possibility $p(y_{n}|y_{b},\mathbf{x}(j))$ means, given a pixel $j$ being wrongly classified as base class $y_{b}$ by $W^{b}_{t}$ , the possibility of pixel $j$ belongs to novel class $y_{n}$ . Thus, $p(y_{n}|y_{b},\mathbf{x}(j))$ describes the similarity from base class $y_{b}$ to novel class $y_{n}$ . For simplicity, we denote

s_{hq}:=p(y_{n}=h|y_{b}=q,\mathbf{x}(j))

(3)

and then for a pixel $j$ we have a corresponding $|\mathcal{C}^{n}|\times(1+|\mathcal{C}^{b}|)$ matrix ( $1$ for background)

\mathbf{S}(\mathbf{x})=\begin{bmatrix}s_{|\mathcal{C}^{b}|+1,0}&s_{|\mathcal{C}^{b}|+1,1}&\cdots&s_{|\mathcal{C}^{b}|+1,|\mathcal{C}^{b}|}\\ s_{|\mathcal{C}^{b}|+2,0}&s_{|\mathcal{C}^{b}|+2,1}&\cdots&s_{|\mathcal{C}^{b}|+2,|\mathcal{C}^{b}|}\\ \vdots&\vdots&\ddots&\vdots\\ s_{|\mathcal{C}^{b}|+|\mathcal{C}^{n}|,0}&s_{|\mathcal{C}^{b}|+|\mathcal{C}^{n}|,1}&\cdots&s_{|\mathcal{C}^{b}|+|\mathcal{C}^{n}|,|\mathcal{C}^{b}|}\\ \end{bmatrix}

Note, for simplicity, we omit pixel indexing $(j)$ . We call this matrix Similarity Transition Matrix from base classes to novel classes (or abbreviated as Transition Matrix), since it defines the probability of transiting wrongly classified base classes to novel classes. Besides, the elements of $\mathbf{S}(\mathbf{x})$ are positive and each column sums to 1, for it represents the probability.

Note that conditional probability $p(y_{b}|\mathbf{x}(j))$ is the wrong classification probability given by the classifier after training on $D_{train}$ . Thus, with $p(y_{b}|\mathbf{x}(j))$ and $\mathbf{S}(\mathbf{x})$ obtained, our network consists of two branches, as shown in Fig. 2:

•

Classification Branch. The classification branch is a classifier that takes in the feature vector and outputs classification logits. The implementation of classifiers can be in any form, e.g. cosine similarity-based classifiers, Euclidean distance-based clustering, etc. In our implementation, we simply choose a linear classifier as a vanilla classification network does. Thus we have classification logits:

$\text{Logits}_{cl}=\text{cat}([W^{b}_{f},W^{n}_{f}])\circ\phi(\mathbf{x})$ (4)

in which cat represents concatenates operation, $W^{b}_{f},W^{n}_{f}$ represents the parameter of the linear classifier of base classes and novel classes to learn during few-shot learning.
•

Transition Branch. The transition branch takes in the feature vector and outputs the similarity transition logits. Given similarity transition matrix $\mathbf{S}(\mathbf{x})$ and classifier weights $W^{b}_{t}$ after training on $D_{train}$ . We can obtain the similarity transition logits (abbreviated as transition logits):

$\text{Logits}_{tr}=\mathbf{S}(\mathbf{x})\circ\text{Softmax}(W^{b}_{t}\circ\phi(\mathbf{x}))$ (5)

To obtain the similarity transition matrix $\mathbf{S}(\mathbf{x})$ we adopt a Multilayer Perceptron (MLP) $\mathbf{S}(\mathbf{x})=g(\mathbf{x})$ to estimate the transition matrix. Also, to leverage the feature extraction functionality provided by the backbone, in our implementation, we take the feature vector map $\mathbf{V}$ as input rather than raw image $\mathbf{x}$ as input. Finally, the output vector of MLP is resized to the shape of the similarity transition matrix.

Reducing the difficulty of optimization.

Further on, increasing the number of parameters improves the model’s capacity but raises the risk of overfitting as well. In the proposed MLP, it contains $(\mathcal{C}^{n}\times(1+\mathcal{C}^{b}))\times F$ parameters to optimize which burdens the optimization algorithms a lot. To reduce the difficulty of optimizing MLP, we decompose the MLP into two smaller MLPs, $g^{r}(\mathbf{x})$ and $g^{c}(\mathbf{x})$ , each outputting the rows and columns of the transition matrix respectively, i.e.

\mathbf{S}(\mathbf{x})=g(\mathbf{x})=g^{c}_{\theta_{c}}(\mathbf{x})\otimes g^{r}_{\theta_{r}}(\mathbf{x})+\beta

where $\beta$ is a learnable bias that avoids $\mathbf{S}(\mathbf{x})$ ranks 1, $\theta_{c},\theta_{r}$ are the parameters of $g^{c},g^{r}$ respectively, and $\otimes$ represents outer product operation of two vector. By decomposing, we reduce the parameter of the MLP to $(\mathcal{C}^{n}+1+\mathcal{C}^{b})\times F$ , in our implementation, form $(4\times 8)\times 512$ to $(4+8)\times 512$ .

To our knowledge, we are the first to utilize the similarity transition matrix in GFSS. There is a similar work that also uses transition matrix [25]. However, several key differences exist that support our novelty: a) [25] is originally proposed to tackle multi-category classification while we focus on GFSS. The essences of the fields studied are different. b) The transition probabilities proposed in [25] are to evaluate the possibility that a fine class belongs to a coarse class, while we mainly utilize the transition probabilities as a measurement of the similarity between base classes and novel classes.

3.3 Exploring class imbalance in-depth

Natural datasets exhibit long-tailed distribution. Simply adopting cross-entropy as Eq. (6) in which $z_{k}$ is the $k$ -th logits of the output of the model to ensure high confidence on predictions will eventually lead the model biased towards the dominant classes, i.e. base classes.

\mathcal{L}_{ce}(\mathbf{x},y)=-\log\frac{e^{z_{y}}}{\sum_{k=0}^{|\mathcal{C}^{b}|+|\mathcal{C}^{n}|}e^{z_{k}}}

(6)

This is because, utilizing traditional cross-entropy, the great number of base class samples causes the models to incur a higher punishment when misclassifying base classes as novel classes, leading to larger gradients [32] to mitigate that. On the contrary, misclassifying novel classes as base classes will not result in a great punishment. Consequently, the optimized model is more inclined to predict all samples as base classes, as shown in our experiments.

A proven effective way to mitigate the disproportional punishment of base classes and novel classes is rectifying the logits before cross-entropy [17, 3], i.e., amplifying the loss on novel classes. Given the class distribution $n_{k}$ for $k$ -th class, we use such class distribution to rectify the logits. Following LDAM [3], we adopt rectified cross-entropy as:

\mathcal{L}_{\textit{LDAM}}(\mathbf{x},y)=-\log\frac{e^{z_{y}-\delta_{y}}}{e^{z_{y}-\delta_{y}}+\sum_{k\neq y}e^{z_{k}}}

(7)

where $\delta_{k}=\frac{C}{n_{k}^{1/4}}$ and $C$ a hyper-parameter to tune. By doing so, we rectify the logits according to the frequency of the class. As a result, the loss of novel classes is amplified. The rest problem is how to obtain the class distribution. In our implementation, we estimate the class distribution via $D_{train}$ and $D_{support}$ .

3.4 Preserving base knowledge

The setting of GFSS prevents the supporting images from providing explicit supervision, thus during few-shot learning when continually updating the parameters, the lack of supervision of the base class will eventually result in the model forgetting the knowledge of base classes obtained from the training phase. As shown in Fig. 3, after removing the Knowledge Distillation, DIaM [8] wrongly classifies the base class to the novel class, e.g., Agric land type 1 to Agric land type 2, Sea to River, and Building type 1 to the background.

The previous method DIaM [8] adopts Knowledge Distillation to overcome the forgetting

\mathcal{L}_{KD}=\mathbf{KL}(\pi_{\text{new2old}}(\mathbf{P})||\hat{\mathbf{P}})

(8)

where $\hat{\mathbf{P}}=W^{b}_{t}\cdot\phi(\mathbf{x})$ is the prediction of the base classifier obtained after training on $D_{train}$ , $\mathbf{P}$ is the prediction of the classifier for both base and novel during few-shot learning, and $\pi_{\text{new2old}}(\cdot)$ is a projection function which maps novel class logits for during training phase, novel classes are treated as background, i.e.

\pi_{\text{new2old}}(\mathbf{P})(j)=\left[p_{0}+\sum_{k=1}^{|\mathcal{C}^{n}|}p_{|\mathcal{C}^{b}|+k},p_{1},p_{2},\cdots,p_{|\mathcal{C}^{b}|}\right]^{T}

In our proposed method, we preserve the knowledge of base classes in a different way by extending the Transition Matrix from $|\mathcal{C}^{n}|\times(1+|\mathcal{C}^{b}|)$ to $(1+|\mathcal{C}^{b}|+|\mathcal{C}^{n}|)\times(1+|\mathcal{C}^{b}|)$ , as shown in Eq. (9), where $\mathbf{S}^{\text{base2novel}}(\mathbf{x})$ is the aforementioned $\mathbf{S}(\mathbf{x})$ . Each element in $\mathbf{S}^{\text{base2base}}(\mathbf{x})$ represents the transition probability from base classes to base classes. Under an extreme situation in which the diagonal elements of $\mathbf{S}^{\text{base2base}}(\mathbf{x})$ are close to 1 while others are close to 0, the learned knowledge is preserved after outer product. The detailed exemplification is shown in Sec. 4.3.

\hat{\mathbf{S}}(\mathbf{x})=\begin{bmatrix}\mathbf{S}^{\text{base2base}}(\mathbf{x})\\ \mathbf{S}^{\text{base2novel}}(\mathbf{x})\end{bmatrix}

(9)

Table 1: Results of our methods compared to baselines. We implement all baselines using ResNet-101 as the backbone to make a fair comparison with our method (Ours-RN). Besides, we report our highest submission results from the competition portal which uses ConvNext-L as the backbone (Ours-CN).

Method	Backbone	Base	Road type 2	River	Boat & ship	Agric land type 2	Novel	Average mIoU	Weighted mIoU
CAPL	ResNet-101	27.64	0.75	17.15	0.07	1.25	4.81	16.02	13.99
BAM	ResNet-101	36.33	0.00	5.31	0.45	8.30	3.52	19.93	16.64
DIaM	ResNet-101	37.41	0.12	5.96	1.83	8.61	4.13	20.77	17.44
Ours-RN	ResNet-101	37.37	0.04	18.93	0.05	21.74	10.19	23.78	21.06
Ours-CN	ConvNext-L	55.46	0.44	39.13	0.00	47.28	21.71	38.58	35.21

3.5 Preventing from overfitting the support set

Learning a classifier from $D_{support}$ for epochs leads to the classifier overfitting the support set, as shown in Fig. 4(a). To prevent the overfitting, [2] introduces a transductive normalization term. We adapt the proposed method into our framework as

\mathcal{L}_{\pi}=\hat{\mathcal{P}}_{Q}\log(\frac{\hat{\mathcal{P}}_{Q}}{\pi})

(10)

with $\hat{\mathcal{P}}_{Q}=\frac{1}{|\Omega|}\sum_{k=0}^{1+|\mathcal{C}^{b}|+|\mathcal{C}^{n}|}\mathcal{P}_{Q}$ being the proportion of each class in the model’s prediction of the query image and $\pi$ being that of the ground truth. The term $\mathcal{L}_{\pi}$ punishes the model when the proportion of each class in the prediction of query image mismatches that of the ground truth, so as to prevent overfitting on the support set.

$\pi$ is agnostic during few-shot learning, so we estimate the $\pi$ by the model’s prediction, i.e.

\pi^{t}=\left\{\begin{array}[]{ll}\hat{\mathcal{P}}_{Q}^{0},&0\leq t\leq t_{\pi}\\ \hat{\mathcal{P}}_{Q}^{t_{\pi}},&t>t_{\pi}\end{array}\right.

(11)

In summary, the final objective is

\min_{W^{b}_{f},W^{n}_{f},\theta_{r},\theta_{c}}\mathcal{L}=\mathcal{L}_{\textit{LDAM}}+\lambda\cdot\mathcal{L}_{\pi}

(12)

where $\lambda$ is a hyper-parameter to tune.

4 Experiments

4.1 Experimental setting

Datasets.

To test the effectiveness of our method we test our methods on an adapted version of OpenEarthMap [27]. The adapted version consists of 408 samples of the original OpenEarthMap benchmark dataset and has 15 classes. We select 8 classes as base classes including background, and 4 as novel classes. The support set consists of 20 image-label pair examples by 5 examples for each of the 4 novel classes. The labels for each image in the support set do not contain any of the base classes. Also, in each 5-set example, the labels contain only one novel class. Details of the labels are shown in Tab. 2.

Evaluation protocol.

We report the mean intersection-over-union (mIoU) over the classes. In our tables, Base and Novel refer to mIoU over base classes and novel classes, respectively. Average refers to mIoU over all classes. Following [8], directly compare Average mIoU is unfair for novel classes, so we adopt Weighted mIoU which weights Base mIoU and Novel mIoU by 0.6:0.4.

Baselines.

We adapt CAPL [23] and DIaM [8] to our dataset and compare the results. Though BAM [12] is proposed from FSS, we adapt BAM to GFSS following [8] and report the results in the GFSS task. For all implementations, we train the classifier by 100 epochs. Necessary down-sampling and up-sampling is implemented.

4.2 Main Results

Comparison to state-of-the-art GFSS. Tab. 1 reports the comparison of our method and current SOTA methods in GFSS. To make a fair comparison, we use ResNet-101 as the backbone for all methods compared. It can be seen that: A) Compared with other methods, our method with ResNet-101 backbone, i.e. Ours-RN, outperforms existing SOTA. And the main improvement comes from the novel classes. B) The great majority of the novel class improvement comes from River and Agric land type 2, this is because similar base class Sea helps the model classify River. Similar to Agric land type 2.

Further on, to test our method’s effectiveness on different backbones, we implement various backbones as shown in Tab. 3. The experiment result indicates that our method is equally valid for different backbones. Even for a backbone strong enough, our method can still improve the performance. We report the best result of ConvNext-L [16] as backbone which achieves 34.57 weighted mIoU. In the OpenEarthMap Land Cover Mapping Few-Shot Challenge, the result ranked No.2 till we finished the paper.

4.3 Visualization of transition matrix

In our method, we do not leverage the knowledge distillation to perverse the knowledge learned from base classes. Instead, we achieve the preservation by extending the transition matrix with base-to-base parts. To verify our claim and show the effectiveness of base-to-base parts, we visualize the transition matrix, as shown in Fig. 5.

As Fig. 5 indicates, the diagonal elements of the base-to-base parts, $\mathbf{S}^{\text{base2base}}(\mathbf{x})$ , are dominant in the corresponding columns which suggests that after the outer product with classification probabilities given by $W^{b}_{t}$ , the logit on the corresponding position is much greater than others. After the softmax, the base classes have greater logits. Thus, our proposed method succeeds in preserving the knowledge from base classes.

Besides, the first column in the transition matrix represents the similarity transition probabilities of the background class in the training set to base and novel classes during the few-shot learning. It can be seen that the transition probabilities from background to novel classes stand high while background to base classes are low. This is because, during the training phase, potential novel classes are treated as background whereas base classes are not. Thus, having high transition probabilities from background to novel classes, the transition branches further help the model distinguish the classes that were previously considered as background in the training set.

4.4 Quantitative study of overfitting

As we claim optimizing the classifier on the support set may result in overfitting. We do a quantitative study to verify our claim, as shown in Fig. 4. A) Obviously, as shown in Fig. 4(a), as the optimization keeps, the classifier suffers from overfitting, i.e., the mIoU on the support set kept increasing while that of query images started decreasing as it reached the highest score at about 450 epoch. B) Compared to $k=1$ and $k=4$ , $\mathcal{L}_{\pi}$ significantly delays the onset of overfitting from about 450 epochs to about 620 epochs. Moreover, the classifier achieves higher performance on query images. The performance without $\mathcal{L}_{\pi}$ is below 4, whereas with it, the performance exceeds 5. C) Compared $k=4$ to $k=1$ , the classifier keeps gaining improvement on the query image and does not reach the pike yet, suggesting our method still has ample untapped potential, enabling higher performance to be achieved.

4.5 Ablation study

To validate our proposed method, we do various ablation studies from different perspectives.

Ablation on components.

To verify the effectiveness of our components we do ablation studies on all of our components. The results show that: a) Without $\mathcal{L}_{\pi}$ , the model achieves better results on base classes but shows a degradation on novel classes, corresponding to the previous experiment that reveals the overfitting happened on the support set. b) Without transition logits, the model shows a degradation in novel classes, especially on Argic land type 2, proving our Similarity transition is valid yet effective. c) Without $\mathcal{L}_{\textit{LDAM}}$ , the model shows great degradation in novel classes for the disproportionate number of samples of base classes to novel classes. D) With all the components, our method achieves a significant performance improvement on novel classes with an accepting degradation on old classes. In summary, with all components armored, our method improves the performances prominently in novel classes with a minor decrease in base classes.

Table 2: The ablation study of our method. We remove

\text{Logits}_{tr}

and

\mathcal{L}_{\textit{LDAM}}

, i.e. similarity utilization and class balancing.

	w/o $\text{Logits}_{tr}$	w/o $\mathcal{L}_{\textit{LDAM}}$	ours
Tree	60.26	60.26	60.22
Rangeland	59.33	59.50	59.32
Bareland	35.99	35.04	35.98
Agric land type 1	75.47	75.51	75.47
Road type 1	55.12	54.55	55.06
Sea, lake, & pond	40.26	40.28	40.29
Building type 1	61.80	61.78	61.86
Road type 2	0.44	0.35	0.44
River	39.98	30.14	39.13
Boat & ship	0.00	0.00	0.00
Agric land type 2	41.16	29.70	47.28
Base mIoU	55.46	55.28	55.46
Novel mIoU	20.17	15.05	21.71
Average mIoU	37.82	35.16	38.58
weighted mIoU	34.29	31.14	35.21

Ablation on the backbone compared with DIaM [8].

To prevent our proposed method overfits a specific backbone, we do an ablation study on different backbones, proving the generalization of our methods. In detail, we compared our method with DIaM [8] with different backbones, i.e. ResNet-101 with PSPNet [33], ViT-B/16 [4] with UperNet [29], and ConvNext-L with UperNet. A) Compared with DIaM, except for ResNet-101 with PSPNet, our method beats DIaM in all classes with all backbones. This reveals that our method works better on GFSS than DIaM. B) Compared to our method with different backbones, the performance gaining is steady, i.e. 3.62, 2.39, 3.60 respectively. This suggests that our method, as a plug-in module to enable any segmentation model owning the few-shot learning ability, can continually gain improvements, indicating with a stronger backbone implemented, our method can achieve even better performance. Unluckily, due to the limited time, we did not implement more backbones to prove our methods’ effectiveness in a more general field.

Table 3: Weighted mIoU in different backbones compared with DIaM [8]. Performance on the left side of the slash refers to the results of DIaM [8], ours on the right side.

Backbone	ResNet-101	ViT-B/16	ConvNext-L
Seg.Head	PSPNet	UPerNet	UPerNet
Base mIoU	37.41 / 37.37	46.48 / 48.08	54.78 / 55.46
Novel mIoU	4.13 / 10.19	12.33 / 15.24	16.16 / 21.71
Average mIoU	20.77 / 23.78	29.41 / 31.66	35.47 / 38.58
weighted-mIoU	17.44 / 21.06	25.99 / 28.38	31.61 / 35.21

5 Conclusions and Discussions

In this work, we determine two issues that exist in the GFSS task and verify our claim with experiments. Our key insight is to utilize the similarity between the base classes and novel classes as well as processing the class imbalanced in the training set and support set. Beyond that, we conduct plenty of experiments to show the effectiveness of our methods, demonstrating that our proposed components address the issues we found. In the future, we will further investigate how to encode the similarity more effectively. Since we do not obtain any supervision from the class similarities, we train the model in an end-to-end manner, hoping the model learns the similarities implicitly. However, with extra supervision provided well-designed architectures are meant to improve the performance. Furthermore, other valid methods to prevent overfitting the support set need to be studied.

References

Boudiaf et al. [2021a] Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Piantanida, Ismail Ben Ayed, and Jose Dolz. Few-shot segmentation without meta-learning: A good transductive inference is all you need?, 2021a.
Boudiaf et al. [2021b] Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Piantanida, Ismail Ben Ayed, and Jose Dolz. Few-shot segmentation without meta-learning: A good transductive inference is all you need?, 2021b.
Cao et al. [2019] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss, 2019.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Ganin and Lempitsky [2015] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation, 2015.
Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks, 2016.
Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, 2014.
Hajimiri et al. [2023] Sina Hajimiri, Malik Boudiaf, Ismail Ben Ayed, and Jose Dolz. A strong baseline for generalized few-shot semantic segmentation, 2023.
He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
Kim and Kim [2020] Byungju Kim and Junmo Kim. Adjusting decision boundary for class imbalanced learning, 2020.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
Lang et al. [2022] Chunbo Lang, Gong Cheng, Binfei Tu, and Junwei Han. Learning what not to segment: A new perspective on few-shot segmentation, 2022.
Lin et al. [2015] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
Liu et al. [2023a] Sun-Ao Liu, Yiheng Zhang, Zhaofan Qiu, Hongtao Xie, Yongdong Zhang, and Ting Yao. Learning orthogonal prototypes for generalized few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11319–11328, 2023a.
Liu et al. [2023b] Weide Liu, Zhonghua Wu, Yang Zhao, Yuming Fang, Chuan-Sheng Foo, Jun Cheng, and Guosheng Lin. Harmonizing base and novel classes: A class-contrastive approach for generalized few-shot segmentation, 2023b.
Liu et al. [2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
Menon et al. [2021] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment, 2021.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015.
Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning, 2017.
Thoma [2016] Martin Thoma. A survey of semantic segmentation, 2016.
Tian et al. [2020a] Junjiao Tian, Yen-Cheng Liu, Nathan Glaser, Yen-Chang Hsu, and Zsolt Kira. Posterior re-calibration for imbalanced datasets, 2020a.
Tian et al. [2020b] Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrichment network for few-shot segmentation, 2020b.
Tian et al. [2022] Zhuotao Tian, Xin Lai, Li Jiang, Shu Liu, Michelle Shu, Hengshuang Zhao, and Jiaya Jia. Generalized few-shot semantic segmentation, 2022.
Wang et al. [2020] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot image semantic segmentation with prototype alignment, 2020.
Wang et al. [2023] Renzhen Wang, De cai, Kaiwen Xiao, Xixi Jia, Xiao Han, and Deyu Meng. Label hierarchy transition: Delving into class hierarchies to enhance deep classifiers, 2023.
Wang et al. [2022] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella X. Yu. Long-tailed recognition by routing diverse distribution-aware experts, 2022.
Xia et al. [2022] Junshi Xia, Naoto Yokoya, Bruno Adriano, and Clifford Broni-Bediako. Openearthmap: A benchmark dataset for global high-resolution land cover mapping, 2022.
Xiang et al. [2020] Liuyu Xiang, Guiguang Ding, and Jungong Han. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification, 2020.
Xiao et al. [2018] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
Ye et al. [2022] Han-Jia Ye, Hong-You Chen, De-Chuan Zhan, and Wei-Lun Chao. Identifying and compensating for feature deviation in imbalanced deep learning, 2022.
Zhang et al. [2021] Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, and Jian Sun. Distribution alignment: A unified framework for long-tail visual recognition, 2021.
Zhang et al. [2023] Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey, 2023.
Zhao et al. [2017] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
Zhuang et al. [2011] Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, and Zhongzhi Shi. Exploiting associations between word clusters and document classes for cross-domain text categorization. Statistical Analysis and Data Mining, 4:100 – 114, 2011.
Zhuang et al. [2014] Fuzhen Zhuang, Ping Luo, Changying Du, Qing He, Zhongzhi Shi, and Hui Xiong. Triplex transfer learning: Exploiting both shared and distinct concepts for text classification. IEEE Transactions on Cybernetics, 44(7):1191–1203, 2014.
Zhuang et al. [2020] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning, 2020.