¹¹institutetext: School of Computer Science and Technology, Soochow University, Suzhou, 215006, China
¹¹email: {qqiao, 20224227022}@stu.suda.edu.cn
¹¹email: [email protected]
¹¹email: [email protected]

A Simple Task-aware Contrastive Local Descriptor Selection Strategy for Few-shot Learning between inter class and intra class

Qian Qiao^† Yu Xie^† Shaoyao Huang Fanzhang Li Corresponding Author. †Equal contribution.

Abstract

Few-shot image classification aims to classify novel classes with few labeled samples. Recent research indicates that deep local descriptors have better representational capabilities. These studies recognize the impact of background noise on classification performance. They typically filter query descriptors using all local descriptors in the support classes or engage in bidirectional selection between local descriptors in support and query sets. However, they ignore the fact that background features may be useful for the classification performance of specific tasks. This paper proposes a novel task-aware contrastive local descriptor selection network (TCDSNet). First, we calculate the contrastive discriminative score for each local descriptor in the support class, and select discriminative local descriptors to form a support descriptor subset. Finally, we leverage support descriptor subsets to adaptively select discriminative query descriptors for specific tasks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on both general and fine-grained datasets.

Keywords:

few-shot learningtask-awarelocal descriptorimage classification

1 Introduction

The purpose of few-shot learning is to enable models to adapt quickly to new tasks with only a small number of training samples in scenarios where data is scarce. Generally, these methods can be divided into three groups: optimization-based methods [5, 1, 11], metric-based methods [23, 24, 10], and data augmentation-based methods [2, 21, 29, 9, 6].

This work is based on a few-shot learning method using local descriptors, falling within the realm of metric learning. Features based on local descriptors exhibit superior representational capabilities compared to image-level features. In previous works, [14] proposed DN4, which directly utilizes all query descriptors. It selects k support descriptors directly for each query local descriptor through k-nearest neighbors (k-NN) and approximates the relationship between query samples and support classes using cosine similarity distances. Based on DN4, [16] introduced DMN4, which believes that not all query descriptors are task-relevant and contain significant background noise. DMN4 establishes mutual nearest neighbor (MNN) relationships, explicitly selecting query descriptors most relevant to each task, thereby avoiding the impact of background noise on classification performance. Similarly, based on DN4, [4] and [30] proposed ATL-Net and TADNet, respectively. Both methods measure the relationships between each query local descriptor and all support classes, adaptively selecting discriminative query descriptors for classification. [19] introduced TALDS-Net, recognizing background noise in query descriptors. It first adaptively selects optimal descriptor subsets composed of support class local descriptors and then adaptively chooses query descriptors for classification from the optimal descriptor subset. However, all these methods aim to eliminate background noise to prevent its influence on the feature representation of local descriptors, either by filtering query descriptors through all support class local descriptors or by bidirectionally selecting between support class local descriptors and query descriptors. Their goal is to remove background noise. We observe that from a human cognitive perspective, for example, considering an image of a dog and an image of a dolphin, not only do the target features differ significantly, but the background features also exhibit substantial distinctions (e.g., dolphin backgrounds are unlikely to be grassy, whereas dog backgrounds might include grass). In such cases, background features can contribute to classification. Conversely, for two images both belonging to the dolphin category, the differences in background features might not be as pronounced, and in this scenario, background features can be considered as noise. For instance, when dealing with unfamiliar images, if the background is an ocean, it can help narrow down the classification to objects commonly found only in the ocean. This aids in identifying the target category among familiar images. Thus, background features within the same category might positively impact classification performance. Furthermore, background features between different categories might also contribute to enhancing classification performance. Determining discriminative local descriptors for methods based on local descriptors is a challenging task. Moreover, it is essential to judiciously retain or discard background noise in the process.

In response to this challenge, a straightforward solution is to select local descriptors in the support class to form a support descriptor subset, and then use the support descriptor subset to select query descriptors. Experimental results have also demonstrated the effectiveness of this simple method.

The above-described method is our proposed Task-Aware Contrastive Discriminative Local Descriptor Selection Network (TCDSNet). Specifically, we first select local descriptors from the support class. For each support descriptor, we calculate the sum of its similarities with the remaining support descriptors in the same category as the intra-class similarity score. Next, we compute the sum of its similarities with support descriptors from other categories as the inter-class similarity score. A high intra-class similarity score indicates that the support descriptor has strong representational capabilities for that class, while a low inter-class similarity score suggests that the support descriptor has high discriminative power across other classes. We calculate the discriminative score by dividing the intra-class similarity score by the inter-class similarity score, which we term the contrastive discriminative score. Then we select $K$ support descriptors in descending order of their discriminative scores. Finally, we utilize the selected support descriptors to choose query descriptors. For selecting discriminative query descriptors, we employ a simple learnable module to adaptively predict a threshold. Using the learned threshold and a score map, we select the most discriminative descriptors for final classification. This approach enhances the model’s classification and generalization capabilities.

In summary, our main contributions are three folds:

•

We propose a novel method that calculates discriminative scores ( $\mathcal{CDS}$ ) for local descriptors in the support class. This enhances the model’s adaptability to different tasks and strengthens the performance of local descriptors in few-shot learning tasks.
•

We propose a novel Task-Aware Contrastive Discriminative Local Descriptor Selection Network (TCDSNet) that not only selects a subset of support descriptors based on discriminative scores but also incorporates a learnable module for adaptively choosing the discriminative query descriptors.
•

Extensive experimental results demonstrate that TCDSNet outperforms state-of-the-art methods on multiple general and fine-grained datasets.

2 METHOD

Fig. 1 shows an overview of the proposed method.

Refer to caption — Figure 1: The overall architecture of the proposed method under a 5-way 5-shot setting. The model primarily consists of three components: a feature extraction module $f_{\theta}$ for extracting features, a model for selecting $K$ discriminative LDs, and $\mathcal{F}_{q}$ for adaptively selecting query LDs.

2.1 Problem Definition

In this paper, we follow the same setting as previous methods[14, 4, 30, 19]. Given a support set $S$ , a query set $Q$ , and an auxiliary set $A$ , where the label space of the auxiliary set $A$ is disjoint from $S$ and is used to learn transferable knowledge. The support set $S$ contains $C$ classes, each with $K$ labeled samples, while the samples in the query set $Q$ are unlabeled and share the same label space as $S$ . We are given a support set consisting of $n$ classes, each with $k$ samples, and a query image, and the task is to classify the query image into one of the $n$ support classes. This constitutes the n-way k-shot few-shot classification problem. Under this setting, we introduce a meta-training mechanism[25] called the episodic training mechanism. We randomly sample from the auxiliary set $A$ to construct an n-way k-shot task. Each task consists of a support set $A_{S}$ and a query set $A_{Q}$ . During the training phase, we construct tens of thousands of tasks to learn transferable knowledge.

2.2 Image Representation Based on Local Descriptors

We obtain a three-dimensional feature representation $f_{\theta}(X)\in\mathbb{R}^{h\times w\times d}$ for the image $X$ through the embedding module $f_{\theta}(\cdot)$ , which is considered as a set of $d$ -dimensional local descriptors (LDs):

\begin{split}f_{\theta}(X)=[l_{1},l_{2},\cdots,l_{m}]\in\mathbb{R}^{m\times d}\end{split}

(1)

Where $l_{i}$ denotes the $i$ -th deep local descriptor (LD). Similar to other descriptor-based Methods [14, 4, 16, 30, 19], we consider it as a set of $m$ $d$ -dimensional descriptors, and $m=h\times w$ .

In each episode, each support class has $k$ images. We denote the descriptor set from category $c$ as $\mathcal{L}_{c}^{S}$ , where there are $n$ classes in total, and represent the descriptor representation for each query image as $l^{q}$ . When using shallower embedding modules (e.g., Conv-4), each support category is represented in its original form. When using deeper embedding modules (e.g., ResNet-12), each support category is represented by the empirical mean of its support descriptors.

Table 1: The classification accuracies on the miniImageNet and tieredImageNet datasets in the 5-way 1-shot and 5-shot settings using Conv-4 and ResNet-12 as backbones with

95\%

confidence interval. All the results of comparative methods are from the exiting literature (’-’ not reported). The methods below ”hline” are LDs-based methods.

Method	Conv-4				ResNet-12
Method	miniImageNet		tieredImageNet		miniImageNet		tieredImageNet
MatchingNet[25]	43.56 $\pm$ 0.84	55.31 $\pm$ 0.73	-	-	63.08 $\pm$ 0.20	75.99 $\pm$ 0.15	68.50 $\pm$ 0.92	80.60 $\pm$ 0.71
ProtoNet[23]	51.20 $\pm$ 0.26	68.94 $\pm$ 0.78	53.45 $\pm$ 0.15	72.32 $\pm$ 0.57	62.33 $\pm$ 0.12	80.88 $\pm$ 0.41	68.40 $\pm$ 0.14	84.06 $\pm$ 0.26
RelationNet[24]	50.44 $\pm$ 0.82	65.32 $\pm$ 0.70	54.48 $\pm$ 0.93	71.31 $\pm$ 0.78	60.97	75.12	64.71	78.41
FRN[28]	54.87	71.56	55.54	74.68	66.45 $\pm$ 0.19	82.83 $\pm$ 0.13	72.06 $\pm$ 0.22	86.89 $\pm$ 0.14
Meta-OLE[27]	56.82 $\pm$ 0.84	73.87 $\pm$ 0.67	58.82 $\pm$ 0.88	75.85 $\pm$ 0.87	67.04 $\pm$ 0.72	82.23 $\pm$ 0.67	68.82 $\pm$ 0.71	85.51 $\pm$ 0.59
Approximate GAP[12]	53.52 $\pm$ 0.88	70.75 $\pm$ 0.67	57.47 $\pm$ 0.99	71.66 $\pm$ 0.76	-	-	-	-
GAP[12]	54.86 $\pm$ 0.85	71.55 $\pm$ 0.61	58.56 $\pm$ 0.93	72.82 $\pm$ 0.77	-	-	-	-
DeepEMD[31]	51.72 $\pm$ 0.20	65.10 $\pm$ 0.39	51.22 $\pm$ 0.14	65.81 $\pm$ 0.68	65.91 $\pm$ 0.82	82.41 $\pm$ 0.56	71.16 $\pm$ 0.87	86.03 $\pm$ 0.58
DN4[14]	51.24 $\pm$ 0.74	71.02 $\pm$ 0.64	52.89 $\pm$ 0.23	73.36 $\pm$ 0.73	65.35	81.10	69.60	83.41
DMN4[16]	55.77	74.22	56.99	74.13	66.58	83.52	72.10	85.72
ATL-Net[4]	54.30 $\pm$ 0.76	73.22 $\pm$ 0.63	-	-	-	-	-	-
TADNet[30]	56.14 $\pm$ 0.20	74.68 $\pm$ 0.15	57.88 $\pm$ 0.21	75.98 $\pm$ 0.17	67.26 $\pm$ 0.20	84.23 $\pm$ 0.13	71.29 $\pm$ 0.22	86.46 $\pm$ 0.15
TCDSNet(ours)	57.14 $\pm$ 0.22	75.89 $\pm$ 0.35	58.67 $\pm$ 0.61	76.06 $\pm$ 0.33	68.53 $\pm$ 0.19	85.12 $\pm$ 0.42	72.43 $\pm$ 0.72	87.35 $\pm$ 0.55

2.3 Contrastive Discriminative Scores for Support Local Descriptors Selection

As mentioned above, $X_{S}$ represents an image in the support class, fed into the embedding module $f_{\theta}$ to obtain local descriptors $\mathcal{L}^{S}=f_{\theta}(X_{S})\in\mathbb{R}^{m\times d}$ , where $m=h\times w$ . Here, $l^{s}$ denotes a supporting local descriptor in $\mathcal{L}^{S}$ , $\hat{\mathcal{L}}^{S}$ represents the set of remaining support descriptors in $\mathcal{L}_{S}$ excluding the current $l^{s}$ , and $\bar{\mathcal{L}}^{S}$ represents the set of local descriptors from the remaining support classes. Thus, we obtain $m$ $d$ -dimensional local descriptors (LDs) for an image in the support class. Under the n-way k-shot setting, there are a total of $nkm$ $d$ -dimensional support LDs. Previous methods [19] only considered the average similarity between each LD and the remaining LDs within the same class as the discriminative score. However, our goal is not only to maintain discriminative relationships within the same class but also across other classes. For each $l^{s}$ , we calculate its average similarity with all other LDs within the same support class, referred to as intra-class similarity, and then calculate its average similarity with LDs from the remaining support classes, referred to as inter-class similarity. We seek support LDs with high intra-class similarity and low inter-class similarity. High intra-class similarity indicates strong representational capabilities of the support LD for its corresponding class, while low inter-class similarity signifies poorer representational capabilities of the support LD for other classes. Support LDs exhibiting these characteristics suggest discriminative capabilities, potentially incorporating discriminative background features to enhance classification results. Therefore, the calculations for intra-class and inter-class similarities are as follows:

\begin{split}\text{SIM}_{intra}=\frac{1}{m-1}\sum_{\hat{l}^{s}\in\hat{\mathcal{L}}^{S}}\text{cos}(l^{s},\hat{l}^{s}),\\ \text{SIM}_{inter}=\frac{1}{(n-1)m}\sum_{\bar{l}^{s}\in\bar{\mathcal{L}}^{S}}\text{cos}(l^{s},\bar{l}^{s})\\ \end{split}

(2)

Where $\hat{\mathcal{L}}^{S}$ represents the set of remaining support descriptors in $\mathcal{L}_{S}$ excluding the current $l^{s}$ (in the case of 1-shot, it corresponds to the remaining local descriptors of the current image). $\bar{\mathcal{L}}^{S}$ denotes the set of local descriptors from the remaining support classes, $\text{SIM}_{intra}$ denotes the intra-class similarity score, and $\text{SIM}_{inter}$ denotes the inter-class similarity score. Furthermore, we normalize these two similarity scores and subsequently calculate their discriminative scores:

\begin{split}\mathcal{D}_{intra}=\text{softmax}(\text{SIM}_{intra})\\ \mathcal{D}_{inter}=\text{softmax}(\text{SIM}_{inter})\end{split}

(3)

Where $\mathcal{D}_{intra}$ denotes the discriminative score of the local descriptor $l^{s}$ within its own class, and $\mathcal{D}_{inter}$ represents its discriminative score across classes.

Based on the above results, we can calculate the two discriminative scores $\mathcal{D}_{intra}$ and $\mathcal{D}_{inter}$ for each support descriptor using a comparative approach. Subsequently, an optimized Contrastive Discriminative Score ( $\mathcal{CDS}$ ) can be computed:

\begin{split}\mathcal{CDS}=\sigma(\frac{\mathcal{D}_{intra}}{\mathcal{D}_{inter}})\end{split}

(4)

We can observe that $\mathcal{CDS}$ aligns well with our initial idea, indicating that the current local descriptor exhibits high similarity with other local descriptors within the same class and low similarity with local descriptors from other classes, where $\sigma$ is a sigmoid function. Furthermore, in Fig. 2, based on the descending order of $\mathcal{CDS}$ , we select the top $K$ support descriptors with the contrastive discriminative scores for each class, forming a discriminative support descriptor set:

\begin{split}\mathcal{L}_{c}^{\mathcal{CDS}}=\text{Top K}_{l_{i}}(\mathcal{CDS})\end{split}

(5)

The value of $K$ will be discussed in Ablation Studies 3.4. We will form a set $\mathcal{L}_{\mathcal{CDS}}$ with the support descriptors selected after screening.

2.4 Query Local Descriptors Selection

Given a query image $X_{q}$ embedded as $\mathcal{L}^{Q}=f_{\theta}(X_{q})\in\mathbb{R}^{m\times d}$ . $l^{q}_{i}$ denotes a query descriptor in $\mathcal{L}^{Q}$ . Previous works[30, 19] employed $k$ -NN to select $k$ support descriptors from each support class. However, we have observed that, after computing the discriminative support descriptor set $\mathcal{L}^{\mathcal{CDS}}$ , it is not necessary to use $k$ -NN for selecting $k$ support LDs from $\mathcal{L}^{\mathcal{CDS}}$ . We directly compute the sum of similarities between each query descriptor $l_{i}^{q}$ and the discriminative support descriptor set $\mathcal{L}_{c}^{\mathcal{CDS}}$ for each support class $c$ :

\begin{split}\text{SIM}_{c}^{l_{i}^{q}}=\sum_{l_{c}\in\mathcal{L}_{c}^{\mathcal{CDS}}}\text{cos}(l_{i}^{q},l_{c})\end{split}

(6)

Where $c\in\{1,2,\ldots,k\}$ denotes a support class, and $l_{c}$ is one discriminative support LD from the discriminative support descriptor set of support class $c$ . Similarly, the discriminative score for each query descriptor $l_{i}^{q}$ is calculated as:

\begin{split}\mathcal{D}^{{l_{i}^{q}}}=\max_{c}\left(\frac{\text{SIM}_{c}^{l_{i}^{q}}}{\sum_{c=1}^{n}\text{SIM}_{c}^{l_{i}^{q}}}\right)\end{split}

(7)

Previous works [14, 16, 4, 30, 19] have employed methods that directly select query descriptors by using a fixed threshold $\mathcal{V}$ and the top- $k$ query descriptors with the highest similarity. However, both of these methods suffer from poor generalization, as they may overlook some discriminative LDs. Thus, inspired by [4, 30, 19, 8], we employ a network $\mathcal{F}_{q}$ consisting of two fully connected layers as an MLP to adaptively predict the threshold $\mathcal{V}^{l_{i}^{q}}$ for each query descriptor. Finally, we utilize the predicted threshold $\mathcal{V}$ to learn a query descriptor weights map $\mathcal{M}_{q}$ . We feed the discriminative support descriptor set $\mathcal{L}_{c}^{\mathcal{CDS}}$ and the query descriptor $l_{i}^{q}$ into $\mathcal{F}_{q}$ , ultimately predicting the threshold $\mathcal{V}^{l_{i}^{q}}$ :

\begin{split}\mathcal{V}^{l_{i}^{q}}=\sigma(\mathcal{F}_{q}(l_{i}^{q},\mathcal{L}_{c}^{\mathcal{CDS}}))\end{split}

(8)

Where $i\in\{1,2,\ldots,m\}$ denotes a query LD, and $c\in\{1,2,\ldots,n\}$ denotes a support class. The final calculation for the query descriptor weights map $\mathcal{M}_{q}$ is as follows:

\begin{split}&\mathcal{M}_{q}={1}/{(1+\exp^{-\lambda(\mathcal{D}^{{l_{i}^{q}}}-\mathcal{V}^{l_{i}^{q}})})}\end{split}

(9)

Where, when $\lambda$ is sufficiently large and $\mathcal{D}^{{l_{i}^{q}}}>\mathcal{V}^{l_{i}^{q}}$ , the values of $\mathcal{M}_{q}$ approximates $1$ . Conversely, the values of $\mathcal{M}_{q}$ approximates $0$ .

Therefore, we can utilize $\mathcal{M}_{q}$ to select query LDs. The calculation for the similarity scores between each query image $X_{q}$ and each support class $c$ is as follows:

\begin{split}\text{Score}(X_{q},c)=\sum_{l^{q}_{i}\in\mathcal{L}_{c}^{Q}}\mathcal{V}^{l^{q}_{i}}\mathcal{M}_{q}\end{split}

(10)

The cross-entropy loss is used to meta-train the network:

\begin{split}&p_{\phi}\left(y=c\mid X_{q}\right)=\frac{\exp\left(\text{score}(X_{q},c)\right)}{\sum_{c^{\prime}=1}^{n}\exp\left(\text{score}(X_{q},c^{\prime})\right)}\end{split}

(11)

\begin{split}&\mathcal{J}({\phi})=-\frac{1}{|A_{Q}|}\sum_{X_{q}\in A_{Q}}\sum_{c=1}^{n}y\log p_{\phi}\left(y=c\mid X_{q}\right)\\ \end{split}

(12)

3 EXPERIMENTS

In this section, we validate the effectiveness of our proposed method on several few-shot benchmark datasets and compare it with other state-of-the-art LDs-based methods. Additionally, we compare our method with small-sample methods using different parameter settings. Furthermore, we conduct ablation experiments to further analyze and validate the effectiveness of our proposed method.

3.1 Datasets

miniImageNet[25] is a subset of ImageNet [3]. It is divided into a training set with $64$ classes, a validation set with $16$ classes, and a test set with $20$ classes. Each class consists of $600$ image samples, each of size $84\times 84$ pixels.

tieredImageNet[20] is another subset of ImageNet. It comprises $608$ classes, with each class containing $1281$ images. These $608$ classes are divided into $351$ for training, $97$ for validation, and $160$ for testing.

CUB-200[26] is a fine-grained dataset that consists of $11788$ bird images, encompassing $200$ different bird species. We partition it into $100$ classes for training, $50$ classes for validation, and $50$ classes for testing. For fine-grained datasets, we resized the images in them to the same size as miniImageNet, which is $84\times 84$ pixels.

Table 2: The classification accuracies on the CUB-200 dataset in the 5-way 1-shot and 5-shot settings using Conv-4 and ResNet-12 as backbones, The confidence intervals of our method are all below 0.20.

Method	Conv-4		ResNet-12
Method	1-shot	5-shot	1-shot	5-shot
ProtoNet[23]	63.73	81.50	66.09	82.50
DSN[22]	66.01	85.41	80.80	91.19
FRN[28]	73.48	88.43	83.16	92.59
Meta-OLE[27]	71.32	86.11	-	-
Approximate GAP[12]	43.77	62.92	-	-
GAP[12]	44.74	64.88	-	-
DeepEMD[31]	-	-	77.14	88.98
DN4[14]	73.42	90.38	-	-
DMN4[16]	78.36	92.16	-	-
TADNet[30]	82.47	93.36	87.62	94.80
TCDSNet(ours)	82.73	95.04	88.71	95.82

Table 3: Ablation study on miniImageNet and CUB-200 datasets for the influence of Top

K

in support LDs selection.

Conv-4			ResNet-12
K	miniImagenet	CUB-200	K	minImagenet	CUB-200
1%	74.94	90.11	3%	83.92	89.21
2%	75.89	90.23	5%	85.12	89.25
5%	74.23	92.37	10%	84.39	92.33
10%	72.02	95.04	25%	84.41	95.82
30%	71.11	94.57	30%	83.83	94.29

3.2 Implementation Details

Model architecture. We use Conv-4 and ResNet-12 as feature extraction networks $f_{\theta}$ , similar to previous work [14, 4, 30, 19]. Conv-4 consists of 4 convolutional blocks, each containing a convolutional layer, batch normalization layer, and Leaky ReLU layer. ResNet-12 is composed of 4 residual blocks, with each block consisting of 3 convolutional layers with $3\times 3$ kernels, 3 batch normalization layers, 3 Leaky ReLU layers, and a $2\times 2$ max-pooling layer. Conv-4 and ResNet-12 generate feature maps of size $19\times 19\times 64$ and $5\times 5\times 640$ for $84\times 84$ images, respectively. These feature maps are then mapped through a transformation layer $f_{\phi}$ , which consists of a $1\times 1$ convolutional layer, a batch normalization layer, and a LeakyReLU layer. Finally, $\mathcal{F}_{q}$ is implemented with two fully connected layers.

Training and evaluation details. During the meta-training phase, we followed the settings in [16, 30, 19]. For Conv-4, we set the learning rate to $1e-3$ and decay $0.1$ every $10$ epoch, training for a total of 30 epochs using the Adam optimizer. For ResNet-12, we pre-trained it first and then conducted meta-training for 40 epochs using momentum SGD with an initial learning rate of $5e-4$ and decay $0.1$ every $10$ epochs. During the test, as in [16, 30, 19], we randomly constructed $10000$ episodes from the test set to calculate the classification accuracy. This process was repeated five times, and we reported the average accuracy along with a $95\%$ confidence interval.

3.3 Comparisons with State-of-the-art Methods

We choose $7$ generic few-shot learning state-of-the-art baselines[25, 23, 24, 28, 27, 12], as well as $5$ SOTA baselines based on LDs[31, 14, 16, 4, 30]. For fine-grained datasets, we also selected $9$ state-of-the-art baselines[23, 14, 22, 12, 27, 31, 28, 16, 30].

Results on miniImageNet dataset. As shown in Table 1, the performance of our method in the 5-way 1-shot and 5-shot settings exceeds that of all current LDs-based methods [31, 14, 16, 4, 30]. Compared to the baseline DN4, our method exhibits significant improvement. In the 5-way 1-shot and 5-shot settings, using Conv-4 as the backbone, it achieved improvements of $5.90\%$ and $4.87\%$ , respectively. Compared to the state-of-the-art (SOTA), our method also improved by $1\%$ and $1.21\%$ , respectively. When using ResNet-12 as the backbone, improvements of $3.18\%$ and $4.02\%$ were achieved, surpassing SOTA by $1.27\%$ and $0.89\%$ , respectively.

Results on tieredImageNet dataset. As shown in Table 1, our method outperforms the current state-of-the-art LDs-based methods as well. In the 5-way 1-shot and 5-shot settings, when using Conv-4 as the backbone, our method improves by $0.79\%$ and $0.08\%$ , respectively, compared to the state-of-the-art method based on LDs. When using ResNet-12 as the backbone, our method improves by $1.14\%$ and $0.89\%$ , respectively, compared to the state-of-the-art method based on LDs.

Results on fine-grained CUB-200 dataset. As shown in Table 2, our method also achieves state-of-the-art performance on fine-grained datasets. In the 5-way 1-shot and 5-shot settings, when using Conv-4 as the backbone, our method improves by $0.26\%$ and $1.68\%$ , respectively, compared to the state-of-the-art method based on LDs. When using ResNet-12 as the backbone, our method improves by $1.02\%$ and $1.19\%$ , respectively, compared to the state-of-the-art method based on LDs.

Table 4: The classification accuracies on the miniImageNet dataset in the 5-way 1-shot and 5-shot settings for backbones with different parameters.

Method	Backbone	$\approx$ Params	miniImageNet
CTM[13]	ResNet-18	11.7 M	64.12 $\pm$ 0.82	80.51 $\pm$ 0.13
Neg-Cosine[15]	ResNet-18	11.7 M	62.33 $\pm$ 0.82	80.94 $\pm$ 0.59
UniSiam+dist[17]	ResNet-18	11.7 M	64.10 ± 0.36	82.26 $\pm$ 0.25
Meta-OLE[12]	WRN-28-10	36.5 M	75.22 $\pm$ 0.30	86.12 $\pm$ 0.28
MetaQDA[32]	WRN-28-10	36.5 M	67.83 $\pm$ 0.64	84.28 $\pm$ 0.69
OM[18]	WRN-28-10	36.5 M	66.78 $\pm$ 0.30	85.29 $\pm$ 0.41
FewTURE[7]	ViT-Small	22 M	68.02 $\pm$ 0.88	84.51 $\pm$ 0.53
FewTURE[7]	Swin-Tiny	29 M	72.40 $\pm$ 0.78	86.38 $\pm$ 0.49
TCDSNet(ours)	ResNet-12	12.4 M	68.53 $\pm$ 0.19	85.12 $\pm$ 0.42

3.4 Ablation Studies

Influence of Top $K$ in support LDs selection. In subsection 2.2, we selected $K$ ( $K$ as a percentage) LDs based on $\mathcal{CDS}$ for each support class to form a discriminative LDs set. As shown in Table 3, we conducted experiments on the miniImagenet and CUB-200 datasets under the 5-way 5-shot setting. When using Conv-4 as the backbone, we set $K$ to $1\%,2\%,5\%,10\%,30\%$ respectively. When using ResNet-12 as the backbone, we set $K$ to $3\%,5\%,10\%,25\%,30\%$ respectively. Through experiments, we found that when using Conv-4 as the backbone, the performance is best when $K=2\%$ and $K=10\%$ on both datasets. When using ResNet-12 as the backbone, the performance is best when $K=2\%$ and $K=25\%$ on both datasets. The experimental results indicate that compared to general datasets, fine-grained datasets require more discriminative LDs. Similarly, under the 5-way 1-shot setting, the best performance is achieved with Conv-4 as the backbone when $K=5\%$ and $K=10\%$ , and with ResNet-12 as the backbone, the best performance is achieved when $K=5\%$ and $K=10\%$ .

Comparison with methods using backbones with different parameters. As shown in Table 4, we selected three baselines[13, 17, 15] using ResNet-18 as the backbone, three baselines[12, 32, 18] using WRN-28-10 as the backbone, and baselines[7] using ViT-Small and Swin-Tiny as the backbone. These methods are not LDs-based baselines. Compared to the baselines using ResNet-18 as the backbone, our method outperforms the best-performing method by $4.41\%$ and $2.86\%$ in the 1-shot and 5-shot settings, respectively. When compared to baselines using WRN-28-10 as the backbone[12, 32, 18], our method achieves a $0.70\%$ improvement in the 1-shot setting and is only $0.17\%$ lower than OM[18] in the 5-shot setting, despite WRN-28-10 having three times the parameters of ResNet-12. Compared to FewTURE[7] with ViT-Small as the backbone, our method achieves improvements of $0.51\%$ and $0.61\%$ , and is only $1.26\%$ lower than FewTURE with Swin-Tiny as the backbone in the 5-shot setting. However, Swin-Tiny has 2.3 times the parameters of ResNet-12. Additionally, FewTURE’s ViT-Small and Swin-Tiny were trained using 4 and 8 Nvidia A100 40GB GPUs, respectively, making their GPU requirements relatively high.

4 CONCLUSION

We propose a novel Task-Aware Contrastive Discriminative Local Descriptor Selection Network (TCDSNet), which utilizes a novel contrastive discriminative measure to filter discriminative local descriptors from the support class. Subsequently, it further selects discriminative query local descriptors from the filtered discriminative support descriptors, ensuring the selection of task-relevant query local descriptors. Extensive experiments validate the superiority and effectiveness of our proposed method. We anticipate that TCDSNet provides a new perspective for research in few-shot learning based on local descriptors.

Acknowledgment

This work was supported in part by the National Key R&D Program of China (2018YFA0701700; 2018YFA0701701) and by the National Natural Science Foundation of China under Grant No.61672364, No.62176172 and No.62002253.

References

[1] Antoniou, A., Edwards, H., Storkey, A.: How to train your maml. arXiv preprint arXiv:1810.09502 (2018)
[2] Antoniou, A., Storkey, A., Edwards, H.: Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017)
[3] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
[4] Dong, C., Li, W., Huo, J., Gu, Z., Gao, Y.: Learning task-aware local representations for few-shot learning. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. pp. 716–722 (2021)
[5] Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. pp. 1126–1135. PMLR (2017)
[6] He, F., Li, G., Zhang, M., Yan, L., Si, L., Li, F.: Freestyle: Free lunch for text-guided style transfer using diffusion models (2024)
[7] Hiller, M., Ma, R., Harandi, M., Drummond, T.: Rethinking generalization in few-shot classification. Advances in Neural Information Processing Systems 35, 3582–3595 (2022)
[8] Huang, S., Cao, Z., Qin, L., Gao, J., Zhang, J.: Contrastive learning with high-quality and low-quality augmented data for query-focused summarization. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 11536–11540. IEEE (2024)
[9] Huang, S., Qin, L., Cao, Z.: Diffusion language model with query-document relevance for query-focused summarization. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 11020–11030 (2023)
[10] Jiang, M., Li, F.: Lie group continual meta learning algorithm. Applied Intelligence 52(10), 10965–10978 (2022)
[11] Jiang, M., Li, F., Liu, L.: Continual meta-learning algorithm. Applied Intelligence pp. 1–16 (2022)
[12] Kang, S., Hwang, D., Eo, M., Kim, T., Rhee, W.: Meta-learning with a geometry-adaptive preconditioner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16080–16090 (2023)
[13] Li, H., Eigen, D., Dodge, S., Zeiler, M., Wang, X.: Finding task-relevant features for few-shot learning by category traversal. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1–10 (2019)
[14] Li, W., Wang, L., Xu, J., Huo, J., Gao, Y., Luo, J.: Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7260–7268 (2019)
[15] Liu, B., Cao, Y., Lin, Y., Li, Q., Zhang, Z., Long, M., Hu, H.: Negative margin matters: Understanding margin in few-shot classification. In: ECCV (2020)
[16] Liu, Y., Zheng, T., Song, J., Cai, D., He, X.: Dmn4: Few-shot learning via discriminative mutual nearest neighbor neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 1828–1836 (2022)
[17] Lu, Y., Wen, L., Liu, J., Liu, Y., Tian, X.: Self-supervision can be a good few-shot learner. In: European Conference on Computer Vision. pp. 740–758. Springer (2022)
[18] Qi, G., Yu, H., Lu, Z., Li, S.: Transductive few-shot classification on the oblique manifold. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8412–8422 (2021)
[19] Qiao, Q., Xie, Y., Zeng, Z., Li, F.: Talds-net: Task-aware adaptive local descriptors selection for few-shot image classification. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3750–3754. IEEE (2024)
[20] Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J.B., Larochelle, H., Zemel, R.S.: Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676 (2018)
[21] Schwartz, E., Karlinsky, L., Shtok, J., Harary, S., Marder, M., Kumar, A., Feris, R., Giryes, R., Bronstein, A.: Delta-encoder: an effective sample synthesis method for few-shot object recognition. Advances in neural information processing systems 31 (2018)
[22] Simon, C., Koniusz, P., Nock, R., Harandi, M.: Adaptive subspaces for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4136–4145 (2020)
[23] Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Advances in neural information processing systems 30 (2017)
[24] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1199–1208 (2018)
[25] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. Advances in neural information processing systems 29 (2016)
[26] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
[27] Wang, Z., Lu, Y., Qiu, Q.: Meta-ole: Meta-learned orthogonal low-rank embedding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5305–5314 (2023)
[28] Wertheimer, D., Tang, L., Hariharan, B.: Few-shot classification with feature map reconstruction networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8012–8021 (2021)
[29] Xian, Y., Sharma, S., Schiele, B., Akata, Z.: f-vaegan-d2: A feature generating framework for any-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10275–10284 (2019)
[30] Yan, L., Li, F., Zheng, X., Zhang, L.: Few-shot learning via task-aware discriminant local descriptors network. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. pp. 2887–2894 (2023)
[31] Zhang, C., Cai, Y., Lin, G., Shen, C.: Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12203–12213 (2020)
[32] Zhang, X., Meng, D., Gouk, H., Hospedales, T.M.: Shallow bayesian meta learning for real-world few-shot recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 651–660 (2021)