This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: School of Computer Science and Technology, Soochow University, Suzhou, 215006, China
11email: {qqiao, 20224227022}@stu.suda.edu.cn
11email: [email protected]
11email: [email protected]

A Simple Task-aware Contrastive Local Descriptor Selection Strategy for Few-shot Learning between inter class and intra class

Qian Qiao    Yu Xie    Shaoyao Huang    Fanzhang Li Corresponding Author. †Equal contribution.
Abstract

Few-shot image classification aims to classify novel classes with few labeled samples. Recent research indicates that deep local descriptors have better representational capabilities. These studies recognize the impact of background noise on classification performance. They typically filter query descriptors using all local descriptors in the support classes or engage in bidirectional selection between local descriptors in support and query sets. However, they ignore the fact that background features may be useful for the classification performance of specific tasks. This paper proposes a novel task-aware contrastive local descriptor selection network (TCDSNet). First, we calculate the contrastive discriminative score for each local descriptor in the support class, and select discriminative local descriptors to form a support descriptor subset. Finally, we leverage support descriptor subsets to adaptively select discriminative query descriptors for specific tasks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on both general and fine-grained datasets.

Keywords:
few-shot learningtask-awarelocal descriptorimage classification

1 Introduction

The purpose of few-shot learning is to enable models to adapt quickly to new tasks with only a small number of training samples in scenarios where data is scarce. Generally, these methods can be divided into three groups: optimization-based methods [5, 1, 11], metric-based methods [23, 24, 10], and data augmentation-based methods [2, 21, 29, 9, 6].

This work is based on a few-shot learning method using local descriptors, falling within the realm of metric learning. Features based on local descriptors exhibit superior representational capabilities compared to image-level features. In previous works, [14] proposed DN4, which directly utilizes all query descriptors. It selects k support descriptors directly for each query local descriptor through k-nearest neighbors (k-NN) and approximates the relationship between query samples and support classes using cosine similarity distances. Based on DN4, [16] introduced DMN4, which believes that not all query descriptors are task-relevant and contain significant background noise. DMN4 establishes mutual nearest neighbor (MNN) relationships, explicitly selecting query descriptors most relevant to each task, thereby avoiding the impact of background noise on classification performance. Similarly, based on DN4, [4] and [30] proposed ATL-Net and TADNet, respectively. Both methods measure the relationships between each query local descriptor and all support classes, adaptively selecting discriminative query descriptors for classification. [19] introduced TALDS-Net, recognizing background noise in query descriptors. It first adaptively selects optimal descriptor subsets composed of support class local descriptors and then adaptively chooses query descriptors for classification from the optimal descriptor subset. However, all these methods aim to eliminate background noise to prevent its influence on the feature representation of local descriptors, either by filtering query descriptors through all support class local descriptors or by bidirectionally selecting between support class local descriptors and query descriptors. Their goal is to remove background noise. We observe that from a human cognitive perspective, for example, considering an image of a dog and an image of a dolphin, not only do the target features differ significantly, but the background features also exhibit substantial distinctions (e.g., dolphin backgrounds are unlikely to be grassy, whereas dog backgrounds might include grass). In such cases, background features can contribute to classification. Conversely, for two images both belonging to the dolphin category, the differences in background features might not be as pronounced, and in this scenario, background features can be considered as noise. For instance, when dealing with unfamiliar images, if the background is an ocean, it can help narrow down the classification to objects commonly found only in the ocean. This aids in identifying the target category among familiar images. Thus, background features within the same category might positively impact classification performance. Furthermore, background features between different categories might also contribute to enhancing classification performance. Determining discriminative local descriptors for methods based on local descriptors is a challenging task. Moreover, it is essential to judiciously retain or discard background noise in the process.

In response to this challenge, a straightforward solution is to select local descriptors in the support class to form a support descriptor subset, and then use the support descriptor subset to select query descriptors. Experimental results have also demonstrated the effectiveness of this simple method.

The above-described method is our proposed Task-Aware Contrastive Discriminative Local Descriptor Selection Network (TCDSNet). Specifically, we first select local descriptors from the support class. For each support descriptor, we calculate the sum of its similarities with the remaining support descriptors in the same category as the intra-class similarity score. Next, we compute the sum of its similarities with support descriptors from other categories as the inter-class similarity score. A high intra-class similarity score indicates that the support descriptor has strong representational capabilities for that class, while a low inter-class similarity score suggests that the support descriptor has high discriminative power across other classes. We calculate the discriminative score by dividing the intra-class similarity score by the inter-class similarity score, which we term the contrastive discriminative score. Then we select KK support descriptors in descending order of their discriminative scores. Finally, we utilize the selected support descriptors to choose query descriptors. For selecting discriminative query descriptors, we employ a simple learnable module to adaptively predict a threshold. Using the learned threshold and a score map, we select the most discriminative descriptors for final classification. This approach enhances the model’s classification and generalization capabilities.

In summary, our main contributions are three folds:

  • We propose a novel method that calculates discriminative scores (𝒞𝒟𝒮\mathcal{CDS}) for local descriptors in the support class. This enhances the model’s adaptability to different tasks and strengthens the performance of local descriptors in few-shot learning tasks.

  • We propose a novel Task-Aware Contrastive Discriminative Local Descriptor Selection Network (TCDSNet) that not only selects a subset of support descriptors based on discriminative scores but also incorporates a learnable module for adaptively choosing the discriminative query descriptors.

  • Extensive experimental results demonstrate that TCDSNet outperforms state-of-the-art methods on multiple general and fine-grained datasets.

2 METHOD

Fig. 1 shows an overview of the proposed method.

Refer to caption

Figure 1: The overall architecture of the proposed method under a 5-way 5-shot setting. The model primarily consists of three components: a feature extraction module fθf_{\theta} for extracting features, a model for selecting KK discriminative LDs, and q\mathcal{F}_{q} for adaptively selecting query LDs.

2.1 Problem Definition

In this paper, we follow the same setting as previous methods[14, 4, 30, 19]. Given a support set SS, a query set QQ, and an auxiliary set AA, where the label space of the auxiliary set AA is disjoint from SS and is used to learn transferable knowledge. The support set SS contains CC classes, each with KK labeled samples, while the samples in the query set QQ are unlabeled and share the same label space as SS. We are given a support set consisting of nn classes, each with kk samples, and a query image, and the task is to classify the query image into one of the nn support classes. This constitutes the n-way k-shot few-shot classification problem. Under this setting, we introduce a meta-training mechanism[25] called the episodic training mechanism. We randomly sample from the auxiliary set AA to construct an n-way k-shot task. Each task consists of a support set ASA_{S} and a query set AQA_{Q}. During the training phase, we construct tens of thousands of tasks to learn transferable knowledge.

2.2 Image Representation Based on Local Descriptors

We obtain a three-dimensional feature representation fθ(X)h×w×df_{\theta}(X)\in\mathbb{R}^{h\times w\times d} for the image XX through the embedding module fθ()f_{\theta}(\cdot), which is considered as a set of dd-dimensional local descriptors (LDs):

fθ(X)=[l1,l2,,lm]m×d\begin{split}f_{\theta}(X)=[l_{1},l_{2},\cdots,l_{m}]\in\mathbb{R}^{m\times d}\end{split} (1)

Where lil_{i} denotes the ii-th deep local descriptor (LD). Similar to other descriptor-based Methods [14, 4, 16, 30, 19], we consider it as a set of mm dd-dimensional descriptors, and m=h×wm=h\times w.

In each episode, each support class has kk images. We denote the descriptor set from category cc as cS\mathcal{L}_{c}^{S}, where there are nn classes in total, and represent the descriptor representation for each query image as lql^{q}. When using shallower embedding modules (e.g., Conv-4), each support category is represented in its original form. When using deeper embedding modules (e.g., ResNet-12), each support category is represented by the empirical mean of its support descriptors.

Table 1: The classification accuracies on the miniImageNet and tieredImageNet datasets in the 5-way 1-shot and 5-shot settings using Conv-4 and ResNet-12 as backbones with 95%95\% confidence interval. All the results of comparative methods are from the exiting literature (’-’ not reported). The methods below ”hline” are LDs-based methods.
Method Conv-4 ResNet-12
miniImageNet tieredImageNet miniImageNet tieredImageNet
MatchingNet[25] 43.56 ±\pm 0.84 55.31±\pm0.73 - - 63.08±\pm0.20 75.99±\pm0.15 68.50±\pm0.92 80.60±\pm0.71
ProtoNet[23] 51.20±\pm0.26 68.94±\pm0.78 53.45±\pm0.15 72.32±\pm0.57 62.33±\pm0.12 80.88±\pm0.41 68.40±\pm0.14 84.06±\pm0.26
RelationNet[24] 50.44±\pm0.82 65.32±\pm0.70 54.48±\pm0.93 71.31±\pm0.78 60.97 75.12 64.71 78.41
FRN[28] 54.87 71.56 55.54 74.68 66.45±\pm0.19 82.83±\pm0.13 72.06±\pm0.22 86.89±\pm0.14
Meta-OLE[27] 56.82±\pm0.84 73.87±\pm0.67 58.82±\pm0.88 75.85±\pm0.87 67.04±\pm0.72 82.23±\pm0.67 68.82±\pm0.71 85.51±\pm0.59
Approximate GAP[12] 53.52±\pm0.88 70.75±\pm0.67 57.47±\pm0.99 71.66±\pm0.76 - - - -
GAP[12] 54.86±\pm0.85 71.55±\pm0.61 58.56±\pm0.93 72.82 ±\pm0.77 - - - -
DeepEMD[31] 51.72±\pm0.20 65.10±\pm0.39 51.22±\pm0.14 65.81±\pm0.68 65.91±\pm0.82 82.41±\pm0.56 71.16±\pm0.87 86.03±\pm0.58
DN4[14] 51.24±\pm0.74 71.02±\pm0.64 52.89±\pm0.23 73.36±\pm0.73 65.35 81.10 69.60 83.41
DMN4[16] 55.77 74.22 56.99 74.13 66.58 83.52 72.10 85.72
ATL-Net[4] 54.30±\pm0.76 73.22±\pm0.63 - - - - - -
TADNet[30] 56.14 ±\pm 0.20 74.68±\pm0.15 57.88±\pm0.21 75.98±\pm0.17 67.26±\pm0.20 84.23±\pm0.13 71.29±\pm 0.22 86.46±\pm0.15
TCDSNet(ours) 57.14±\pm0.22 75.89±\pm0.35 58.67±\pm0.61 76.06±\pm0.33 68.53±\pm0.19 85.12±\pm0.42 72.43±\pm0.72 87.35±\pm0.55

2.3 Contrastive Discriminative Scores for Support Local Descriptors Selection

As mentioned above, XSX_{S} represents an image in the support class, fed into the embedding module fθf_{\theta} to obtain local descriptors S=fθ(XS)m×d\mathcal{L}^{S}=f_{\theta}(X_{S})\in\mathbb{R}^{m\times d}, where m=h×wm=h\times w. Here, lsl^{s} denotes a supporting local descriptor in S\mathcal{L}^{S}, ^S\hat{\mathcal{L}}^{S} represents the set of remaining support descriptors in S\mathcal{L}_{S} excluding the current lsl^{s}, and ¯S\bar{\mathcal{L}}^{S} represents the set of local descriptors from the remaining support classes. Thus, we obtain mm dd-dimensional local descriptors (LDs) for an image in the support class. Under the n-way k-shot setting, there are a total of nkmnkm dd-dimensional support LDs. Previous methods [19] only considered the average similarity between each LD and the remaining LDs within the same class as the discriminative score. However, our goal is not only to maintain discriminative relationships within the same class but also across other classes. For each lsl^{s}, we calculate its average similarity with all other LDs within the same support class, referred to as intra-class similarity, and then calculate its average similarity with LDs from the remaining support classes, referred to as inter-class similarity. We seek support LDs with high intra-class similarity and low inter-class similarity. High intra-class similarity indicates strong representational capabilities of the support LD for its corresponding class, while low inter-class similarity signifies poorer representational capabilities of the support LD for other classes. Support LDs exhibiting these characteristics suggest discriminative capabilities, potentially incorporating discriminative background features to enhance classification results. Therefore, the calculations for intra-class and inter-class similarities are as follows:

SIMintra=1m1l^s^Scos(ls,l^s),SIMinter=1(n1)ml¯s¯Scos(ls,l¯s)\begin{split}\text{SIM}_{intra}=\frac{1}{m-1}\sum_{\hat{l}^{s}\in\hat{\mathcal{L}}^{S}}\text{cos}(l^{s},\hat{l}^{s}),\\ \text{SIM}_{inter}=\frac{1}{(n-1)m}\sum_{\bar{l}^{s}\in\bar{\mathcal{L}}^{S}}\text{cos}(l^{s},\bar{l}^{s})\\ \end{split} (2)

Where ^S\hat{\mathcal{L}}^{S} represents the set of remaining support descriptors in S\mathcal{L}_{S} excluding the current lsl^{s} (in the case of 1-shot, it corresponds to the remaining local descriptors of the current image). ¯S\bar{\mathcal{L}}^{S} denotes the set of local descriptors from the remaining support classes, SIMintra\text{SIM}_{intra} denotes the intra-class similarity score, and SIMinter\text{SIM}_{inter} denotes the inter-class similarity score. Furthermore, we normalize these two similarity scores and subsequently calculate their discriminative scores:

𝒟intra=softmax(SIMintra)𝒟inter=softmax(SIMinter)\begin{split}\mathcal{D}_{intra}=\text{softmax}(\text{SIM}_{intra})\\ \mathcal{D}_{inter}=\text{softmax}(\text{SIM}_{inter})\end{split} (3)

Where 𝒟intra\mathcal{D}_{intra} denotes the discriminative score of the local descriptor lsl^{s} within its own class, and 𝒟inter\mathcal{D}_{inter} represents its discriminative score across classes.

Based on the above results, we can calculate the two discriminative scores 𝒟intra\mathcal{D}_{intra} and 𝒟inter\mathcal{D}_{inter} for each support descriptor using a comparative approach. Subsequently, an optimized Contrastive Discriminative Score (𝒞𝒟𝒮\mathcal{CDS}) can be computed:

𝒞𝒟𝒮=σ(𝒟intra𝒟inter)\begin{split}\mathcal{CDS}=\sigma(\frac{\mathcal{D}_{intra}}{\mathcal{D}_{inter}})\end{split} (4)

We can observe that 𝒞𝒟𝒮\mathcal{CDS} aligns well with our initial idea, indicating that the current local descriptor exhibits high similarity with other local descriptors within the same class and low similarity with local descriptors from other classes, where σ\sigma is a sigmoid function. Furthermore, in Fig. 2, based on the descending order of 𝒞𝒟𝒮\mathcal{CDS}, we select the top KK support descriptors with the contrastive discriminative scores for each class, forming a discriminative support descriptor set:

c𝒞𝒟𝒮=Top Kli(𝒞𝒟𝒮)\begin{split}\mathcal{L}_{c}^{\mathcal{CDS}}=\text{Top K}_{l_{i}}(\mathcal{CDS})\end{split} (5)

The value of KK will be discussed in Ablation Studies 3.4. We will form a set 𝒞𝒟𝒮\mathcal{L}_{\mathcal{CDS}} with the support descriptors selected after screening.

Refer to caption
Figure 2: Selecting KK LDs to form a discriminative support descriptor set is achieved by computing the CDSCDS for each LD in each support class.

2.4 Query Local Descriptors Selection

Given a query image XqX_{q} embedded as Q=fθ(Xq)m×d\mathcal{L}^{Q}=f_{\theta}(X_{q})\in\mathbb{R}^{m\times d}. liql^{q}_{i} denotes a query descriptor in Q\mathcal{L}^{Q}. Previous works[30, 19] employed kk-NN to select kk support descriptors from each support class. However, we have observed that, after computing the discriminative support descriptor set 𝒞𝒟𝒮\mathcal{L}^{\mathcal{CDS}}, it is not necessary to use kk-NN for selecting kk support LDs from 𝒞𝒟𝒮\mathcal{L}^{\mathcal{CDS}}. We directly compute the sum of similarities between each query descriptor liql_{i}^{q} and the discriminative support descriptor set c𝒞𝒟𝒮\mathcal{L}_{c}^{\mathcal{CDS}} for each support class cc:

SIMcliq=lcc𝒞𝒟𝒮cos(liq,lc)\begin{split}\text{SIM}_{c}^{l_{i}^{q}}=\sum_{l_{c}\in\mathcal{L}_{c}^{\mathcal{CDS}}}\text{cos}(l_{i}^{q},l_{c})\end{split} (6)

Where c{1,2,,k}c\in\{1,2,\ldots,k\} denotes a support class, and lcl_{c} is one discriminative support LD from the discriminative support descriptor set of support class cc. Similarly, the discriminative score for each query descriptor liql_{i}^{q} is calculated as:

𝒟liq=maxc(SIMcliqc=1nSIMcliq)\begin{split}\mathcal{D}^{{l_{i}^{q}}}=\max_{c}\left(\frac{\text{SIM}_{c}^{l_{i}^{q}}}{\sum_{c=1}^{n}\text{SIM}_{c}^{l_{i}^{q}}}\right)\end{split} (7)

Previous works [14, 16, 4, 30, 19] have employed methods that directly select query descriptors by using a fixed threshold 𝒱\mathcal{V} and the top-kk query descriptors with the highest similarity. However, both of these methods suffer from poor generalization, as they may overlook some discriminative LDs. Thus, inspired by [4, 30, 19, 8], we employ a network q\mathcal{F}_{q} consisting of two fully connected layers as an MLP to adaptively predict the threshold 𝒱liq\mathcal{V}^{l_{i}^{q}} for each query descriptor. Finally, we utilize the predicted threshold 𝒱\mathcal{V} to learn a query descriptor weights map q\mathcal{M}_{q}. We feed the discriminative support descriptor set c𝒞𝒟𝒮\mathcal{L}_{c}^{\mathcal{CDS}} and the query descriptor liql_{i}^{q} into q\mathcal{F}_{q}, ultimately predicting the threshold 𝒱liq\mathcal{V}^{l_{i}^{q}}:

𝒱liq=σ(q(liq,c𝒞𝒟𝒮))\begin{split}\mathcal{V}^{l_{i}^{q}}=\sigma(\mathcal{F}_{q}(l_{i}^{q},\mathcal{L}_{c}^{\mathcal{CDS}}))\end{split} (8)

Where i{1,2,,m}i\in\{1,2,\ldots,m\} denotes a query LD, and c{1,2,,n}c\in\{1,2,\ldots,n\} denotes a support class. The final calculation for the query descriptor weights map q\mathcal{M}_{q} is as follows:

q=1/(1+expλ(𝒟liq𝒱liq))\begin{split}&\mathcal{M}_{q}={1}/{(1+\exp^{-\lambda(\mathcal{D}^{{l_{i}^{q}}}-\mathcal{V}^{l_{i}^{q}})})}\end{split} (9)

Where, when λ\lambda is sufficiently large and 𝒟liq>𝒱liq\mathcal{D}^{{l_{i}^{q}}}>\mathcal{V}^{l_{i}^{q}}, the values of q\mathcal{M}_{q} approximates 11. Conversely, the values of q\mathcal{M}_{q} approximates 0.

Therefore, we can utilize q\mathcal{M}_{q} to select query LDs. The calculation for the similarity scores between each query image XqX_{q} and each support class cc is as follows:

Score(Xq,c)=liqcQ𝒱liqq\begin{split}\text{Score}(X_{q},c)=\sum_{l^{q}_{i}\in\mathcal{L}_{c}^{Q}}\mathcal{V}^{l^{q}_{i}}\mathcal{M}_{q}\end{split} (10)

The cross-entropy loss is used to meta-train the network:

pϕ(y=cXq)=exp(score(Xq,c))c=1nexp(score(Xq,c))\begin{split}&p_{\phi}\left(y=c\mid X_{q}\right)=\frac{\exp\left(\text{score}(X_{q},c)\right)}{\sum_{c^{\prime}=1}^{n}\exp\left(\text{score}(X_{q},c^{\prime})\right)}\end{split} (11)
𝒥(ϕ)=1|AQ|XqAQc=1nylogpϕ(y=cXq)\begin{split}&\mathcal{J}({\phi})=-\frac{1}{|A_{Q}|}\sum_{X_{q}\in A_{Q}}\sum_{c=1}^{n}y\log p_{\phi}\left(y=c\mid X_{q}\right)\\ \end{split} (12)

3 EXPERIMENTS

In this section, we validate the effectiveness of our proposed method on several few-shot benchmark datasets and compare it with other state-of-the-art LDs-based methods. Additionally, we compare our method with small-sample methods using different parameter settings. Furthermore, we conduct ablation experiments to further analyze and validate the effectiveness of our proposed method.

3.1 Datasets

miniImageNet[25] is a subset of ImageNet [3]. It is divided into a training set with 6464 classes, a validation set with 1616 classes, and a test set with 2020 classes. Each class consists of 600600 image samples, each of size 84×8484\times 84 pixels.

tieredImageNet[20] is another subset of ImageNet. It comprises 608608 classes, with each class containing 12811281 images. These 608608 classes are divided into 351351 for training, 9797 for validation, and 160160 for testing.

CUB-200[26] is a fine-grained dataset that consists of 1178811788 bird images, encompassing 200200 different bird species. We partition it into 100100 classes for training, 5050 classes for validation, and 5050 classes for testing. For fine-grained datasets, we resized the images in them to the same size as miniImageNet, which is 84×8484\times 84 pixels.

Table 2: The classification accuracies on the CUB-200 dataset in the 5-way 1-shot and 5-shot settings using Conv-4 and ResNet-12 as backbones, The confidence intervals of our method are all below 0.20.
Method Conv-4 ResNet-12
1-shot 5-shot 1-shot 5-shot
ProtoNet[23] 63.73 81.50 66.09 82.50
DSN[22] 66.01 85.41 80.80 91.19
FRN[28] 73.48 88.43 83.16 92.59
Meta-OLE[27] 71.32 86.11 - -
Approximate GAP[12] 43.77 62.92 - -
GAP[12] 44.74 64.88 - -
DeepEMD[31] - - 77.14 88.98
DN4[14] 73.42 90.38 - -
DMN4[16] 78.36 92.16 - -
TADNet[30] 82.47 93.36 87.62 94.80
TCDSNet(ours) 82.73 95.04 88.71 95.82
Table 3: Ablation study on miniImageNet and CUB-200 datasets for the influence of Top KK in support LDs selection.
Conv-4 ResNet-12
K miniImagenet CUB-200 K minImagenet CUB-200
1% 74.94 90.11 3% 83.92 89.21
2% 75.89 90.23 5% 85.12 89.25
5% 74.23 92.37 10% 84.39 92.33
10% 72.02 95.04 25% 84.41 95.82
30% 71.11 94.57 30% 83.83 94.29

3.2 Implementation Details

Model architecture. We use Conv-4 and ResNet-12 as feature extraction networks fθf_{\theta}, similar to previous work [14, 4, 30, 19]. Conv-4 consists of 4 convolutional blocks, each containing a convolutional layer, batch normalization layer, and Leaky ReLU layer. ResNet-12 is composed of 4 residual blocks, with each block consisting of 3 convolutional layers with 3×33\times 3 kernels, 3 batch normalization layers, 3 Leaky ReLU layers, and a 2×22\times 2 max-pooling layer. Conv-4 and ResNet-12 generate feature maps of size 19×19×6419\times 19\times 64 and 5×5×6405\times 5\times 640 for 84×8484\times 84 images, respectively. These feature maps are then mapped through a transformation layer fϕf_{\phi}, which consists of a 1×11\times 1 convolutional layer, a batch normalization layer, and a LeakyReLU layer. Finally, q\mathcal{F}_{q} is implemented with two fully connected layers.

Training and evaluation details. During the meta-training phase, we followed the settings in [16, 30, 19]. For Conv-4, we set the learning rate to 1e31e-3 and decay 0.10.1 every 1010 epoch, training for a total of 30 epochs using the Adam optimizer. For ResNet-12, we pre-trained it first and then conducted meta-training for 40 epochs using momentum SGD with an initial learning rate of 5e45e-4 and decay 0.10.1 every 1010 epochs. During the test, as in [16, 30, 19], we randomly constructed 1000010000 episodes from the test set to calculate the classification accuracy. This process was repeated five times, and we reported the average accuracy along with a 95%95\% confidence interval.

3.3 Comparisons with State-of-the-art Methods

We choose 77 generic few-shot learning state-of-the-art baselines[25, 23, 24, 28, 27, 12], as well as 55 SOTA baselines based on LDs[31, 14, 16, 4, 30]. For fine-grained datasets, we also selected 99 state-of-the-art baselines[23, 14, 22, 12, 27, 31, 28, 16, 30].

Results on miniImageNet dataset. As shown in Table 1, the performance of our method in the 5-way 1-shot and 5-shot settings exceeds that of all current LDs-based methods [31, 14, 16, 4, 30]. Compared to the baseline DN4, our method exhibits significant improvement. In the 5-way 1-shot and 5-shot settings, using Conv-4 as the backbone, it achieved improvements of 5.90%5.90\% and 4.87%4.87\%, respectively. Compared to the state-of-the-art (SOTA), our method also improved by 1%1\% and 1.21%1.21\%, respectively. When using ResNet-12 as the backbone, improvements of 3.18%3.18\% and 4.02%4.02\% were achieved, surpassing SOTA by 1.27%1.27\% and 0.89%0.89\%, respectively.

Results on tieredImageNet dataset. As shown in Table 1, our method outperforms the current state-of-the-art LDs-based methods as well. In the 5-way 1-shot and 5-shot settings, when using Conv-4 as the backbone, our method improves by 0.79%0.79\% and 0.08%0.08\%, respectively, compared to the state-of-the-art method based on LDs. When using ResNet-12 as the backbone, our method improves by 1.14%1.14\% and 0.89%0.89\%, respectively, compared to the state-of-the-art method based on LDs.

Results on fine-grained CUB-200 dataset. As shown in Table 2, our method also achieves state-of-the-art performance on fine-grained datasets. In the 5-way 1-shot and 5-shot settings, when using Conv-4 as the backbone, our method improves by 0.26%0.26\% and 1.68%1.68\%, respectively, compared to the state-of-the-art method based on LDs. When using ResNet-12 as the backbone, our method improves by 1.02%1.02\% and 1.19%1.19\%, respectively, compared to the state-of-the-art method based on LDs.

Table 4: The classification accuracies on the miniImageNet dataset in the 5-way 1-shot and 5-shot settings for backbones with different parameters.
Method Backbone \approx Params miniImageNet
CTM[13] ResNet-18 11.7 M 64.12 ±\pm 0.82 80.51 ±\pm0.13
Neg-Cosine[15] ResNet-18 11.7 M 62.33±\pm0.82 80.94 ±\pm0.59
UniSiam+dist[17] ResNet-18 11.7 M 64.10 ± 0.36 82.26±\pm 0.25
Meta-OLE[12] WRN-28-10 36.5 M 75.22±\pm0.30 86.12±\pm0.28
MetaQDA[32] WRN-28-10 36.5 M 67.83±\pm0.64 84.28±\pm0.69
OM[18] WRN-28-10 36.5 M 66.78±\pm0.30 85.29±\pm0.41
FewTURE[7] ViT-Small 22 M 68.02±\pm0.88 84.51±\pm0.53
FewTURE[7] Swin-Tiny 29 M 72.40±\pm0.78 86.38±\pm0.49
TCDSNet(ours) ResNet-12 12.4 M 68.53±\pm0.19 85.12±\pm0.42

3.4 Ablation Studies

Influence of Top KK in support LDs selection. In subsection 2.2, we selected KK (KK as a percentage) LDs based on 𝒞𝒟𝒮\mathcal{CDS} for each support class to form a discriminative LDs set. As shown in Table 3, we conducted experiments on the miniImagenet and CUB-200 datasets under the 5-way 5-shot setting. When using Conv-4 as the backbone, we set KK to 1%,2%,5%,10%,30%1\%,2\%,5\%,10\%,30\% respectively. When using ResNet-12 as the backbone, we set KK to 3%,5%,10%,25%,30%3\%,5\%,10\%,25\%,30\% respectively. Through experiments, we found that when using Conv-4 as the backbone, the performance is best when K=2%K=2\% and K=10%K=10\% on both datasets. When using ResNet-12 as the backbone, the performance is best when K=2%K=2\% and K=25%K=25\% on both datasets. The experimental results indicate that compared to general datasets, fine-grained datasets require more discriminative LDs. Similarly, under the 5-way 1-shot setting, the best performance is achieved with Conv-4 as the backbone when K=5%K=5\% and K=10%K=10\%, and with ResNet-12 as the backbone, the best performance is achieved when K=5%K=5\% and K=10%K=10\%.

Comparison with methods using backbones with different parameters. As shown in Table 4, we selected three baselines[13, 17, 15] using ResNet-18 as the backbone, three baselines[12, 32, 18] using WRN-28-10 as the backbone, and baselines[7] using ViT-Small and Swin-Tiny as the backbone. These methods are not LDs-based baselines. Compared to the baselines using ResNet-18 as the backbone, our method outperforms the best-performing method by 4.41%4.41\% and 2.86%2.86\% in the 1-shot and 5-shot settings, respectively. When compared to baselines using WRN-28-10 as the backbone[12, 32, 18], our method achieves a 0.70%0.70\% improvement in the 1-shot setting and is only 0.17%0.17\% lower than OM[18] in the 5-shot setting, despite WRN-28-10 having three times the parameters of ResNet-12. Compared to FewTURE[7] with ViT-Small as the backbone, our method achieves improvements of 0.51%0.51\% and 0.61%0.61\%, and is only 1.26%1.26\% lower than FewTURE with Swin-Tiny as the backbone in the 5-shot setting. However, Swin-Tiny has 2.3 times the parameters of ResNet-12. Additionally, FewTURE’s ViT-Small and Swin-Tiny were trained using 4 and 8 Nvidia A100 40GB GPUs, respectively, making their GPU requirements relatively high.

4 CONCLUSION

We propose a novel Task-Aware Contrastive Discriminative Local Descriptor Selection Network (TCDSNet), which utilizes a novel contrastive discriminative measure to filter discriminative local descriptors from the support class. Subsequently, it further selects discriminative query local descriptors from the filtered discriminative support descriptors, ensuring the selection of task-relevant query local descriptors. Extensive experiments validate the superiority and effectiveness of our proposed method. We anticipate that TCDSNet provides a new perspective for research in few-shot learning based on local descriptors.

Acknowledgment

This work was supported in part by the National Key R&D Program of China (2018YFA0701700; 2018YFA0701701) and by the National Natural Science Foundation of China under Grant No.61672364, No.62176172 and No.62002253.

References

  • [1] Antoniou, A., Edwards, H., Storkey, A.: How to train your maml. arXiv preprint arXiv:1810.09502 (2018)
  • [2] Antoniou, A., Storkey, A., Edwards, H.: Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017)
  • [3] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [4] Dong, C., Li, W., Huo, J., Gu, Z., Gao, Y.: Learning task-aware local representations for few-shot learning. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. pp. 716–722 (2021)
  • [5] Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. pp. 1126–1135. PMLR (2017)
  • [6] He, F., Li, G., Zhang, M., Yan, L., Si, L., Li, F.: Freestyle: Free lunch for text-guided style transfer using diffusion models (2024)
  • [7] Hiller, M., Ma, R., Harandi, M., Drummond, T.: Rethinking generalization in few-shot classification. Advances in Neural Information Processing Systems 35, 3582–3595 (2022)
  • [8] Huang, S., Cao, Z., Qin, L., Gao, J., Zhang, J.: Contrastive learning with high-quality and low-quality augmented data for query-focused summarization. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 11536–11540. IEEE (2024)
  • [9] Huang, S., Qin, L., Cao, Z.: Diffusion language model with query-document relevance for query-focused summarization. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 11020–11030 (2023)
  • [10] Jiang, M., Li, F.: Lie group continual meta learning algorithm. Applied Intelligence 52(10), 10965–10978 (2022)
  • [11] Jiang, M., Li, F., Liu, L.: Continual meta-learning algorithm. Applied Intelligence pp. 1–16 (2022)
  • [12] Kang, S., Hwang, D., Eo, M., Kim, T., Rhee, W.: Meta-learning with a geometry-adaptive preconditioner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16080–16090 (2023)
  • [13] Li, H., Eigen, D., Dodge, S., Zeiler, M., Wang, X.: Finding task-relevant features for few-shot learning by category traversal. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1–10 (2019)
  • [14] Li, W., Wang, L., Xu, J., Huo, J., Gao, Y., Luo, J.: Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7260–7268 (2019)
  • [15] Liu, B., Cao, Y., Lin, Y., Li, Q., Zhang, Z., Long, M., Hu, H.: Negative margin matters: Understanding margin in few-shot classification. In: ECCV (2020)
  • [16] Liu, Y., Zheng, T., Song, J., Cai, D., He, X.: Dmn4: Few-shot learning via discriminative mutual nearest neighbor neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 1828–1836 (2022)
  • [17] Lu, Y., Wen, L., Liu, J., Liu, Y., Tian, X.: Self-supervision can be a good few-shot learner. In: European Conference on Computer Vision. pp. 740–758. Springer (2022)
  • [18] Qi, G., Yu, H., Lu, Z., Li, S.: Transductive few-shot classification on the oblique manifold. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8412–8422 (2021)
  • [19] Qiao, Q., Xie, Y., Zeng, Z., Li, F.: Talds-net: Task-aware adaptive local descriptors selection for few-shot image classification. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3750–3754. IEEE (2024)
  • [20] Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J.B., Larochelle, H., Zemel, R.S.: Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676 (2018)
  • [21] Schwartz, E., Karlinsky, L., Shtok, J., Harary, S., Marder, M., Kumar, A., Feris, R., Giryes, R., Bronstein, A.: Delta-encoder: an effective sample synthesis method for few-shot object recognition. Advances in neural information processing systems 31 (2018)
  • [22] Simon, C., Koniusz, P., Nock, R., Harandi, M.: Adaptive subspaces for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4136–4145 (2020)
  • [23] Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Advances in neural information processing systems 30 (2017)
  • [24] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1199–1208 (2018)
  • [25] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. Advances in neural information processing systems 29 (2016)
  • [26] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
  • [27] Wang, Z., Lu, Y., Qiu, Q.: Meta-ole: Meta-learned orthogonal low-rank embedding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5305–5314 (2023)
  • [28] Wertheimer, D., Tang, L., Hariharan, B.: Few-shot classification with feature map reconstruction networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8012–8021 (2021)
  • [29] Xian, Y., Sharma, S., Schiele, B., Akata, Z.: f-vaegan-d2: A feature generating framework for any-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10275–10284 (2019)
  • [30] Yan, L., Li, F., Zheng, X., Zhang, L.: Few-shot learning via task-aware discriminant local descriptors network. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. pp. 2887–2894 (2023)
  • [31] Zhang, C., Cai, Y., Lin, G., Shen, C.: Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12203–12213 (2020)
  • [32] Zhang, X., Meng, D., Gouk, H., Hospedales, T.M.: Shallow bayesian meta learning for real-world few-shot recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 651–660 (2021)