Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning

Yuzhong Chen University of Electronic Science and Technology of China Zhenxiang Xiao University of Electronic Science and Technology of China Lin Zhao University of Georgia Lu Zhang University of Texas at Arlington Haixing Dai University of Georgia David Weizhong Liu University of Georgia Zihao Wu University of Georgia Changhe Li Northwestern Polytechnical University Tuo Zhang Northwestern Polytechnical University Changying Li University of Georgia Dajiang Zhu University of Texas at Arlington Tianming Liu University of Georgia Jiang Xi Corresponding author: [email protected] University of Electronic Science and Technology of China

Abstract

Learning with little data is challenging but often inevitable in various application scenarios where the labeled data is limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning based FSL approaches are inefficient in knowledge generalization and thus degenerate the downstream task performances. In this paper, we propose a novel mask-guided vision transformer (MG-ViT) to achieve an effective and efficient FSL on ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT to focus on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pre-trained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning based sample selection method to further improve the generalizability of MG-ViT based FSL. We evaluate the proposed MG-ViT on both Agri-ImageNet classification task and ACFR apple detection task with gradient-weighted class activation mapping (Grad-CAM) as the mask. The experimental results show that the MG-ViT model significantly improves the performance when compared with general fine-tuning based ViT models, providing novel insights and a concrete approach towards generalizing data-intensive and large-scale deep learning models for FSL.

1 Introduction

Deep neural networks (DNNs) have achieved great success in many computer vision tasks with a large amount of labeled data. However, learning with little data is often inevitable in a variety of application scenarios in which the labeled data is limited and costly. Recently, few-shot learning (FSL) has attained increasing interest to reconcile the demand and scarcity of large-scale labeled data in DNNs because of its generalizability of prior knowledge to new tasks given only a few samples. Existing literature studies have devoted extensive efforts to improving FSL from three aspects [1]: data augmentation [2, 3], model design [4], and algorithm development [5], and obtained promising results. By simply freezing the backbone and merely fine-tuning the last few layers, previous fine-tuning based FSL approaches have achieved promising performances [6, 7, 8].

However, for data-intensive models with even more parameters, e.g., Vision Transformer (ViT) [9], current fine-tuning based FSL approaches are inefficient in knowledge generalization. For example, in FSL scenario, ViT has better performance than ResNet only when being pretrained on large dataset[9]. This is probably because the data-intensive models, such as ViT, do not inherently encode the inductive biases that are useful for smaller datasets and thus require a large amount of labeled data to figure out the underlying modality-specific rules [9, 10], resulting in unsatisfying performance when the labeled data is limited. Therefore, figuring out how to efficiently generalize the domain knowledge of data-intensive models for FSL is still an open question and worth more effort to explore.

Refer to caption — Figure 1: Mask operation to select most task-relevant and discriminative image patches for MG-ViT model.

In this paper, we propose a novel mask-guided vision transformer (MG-ViT) to effectively adapt the domain knowledge of pre-trained ViT to FSL tasks. The key idea of MG-ViT is to apply a mask on input image patches before the first transformer encoder layer to screen out the task-irrelevant patches and to guide ViT to focus on task-relevant and discriminative ones during FSL (1) as background information is harmful for few-shot learning [11]. The most task-relevant and discriminative patches are identified from the salience map by calculating the gradient-weighted class activation mapping (Grad-CAM) [12] of the most similar images in the source dataset with the target dataset. The proposed mask operation can take better advantage of the prior knowledge from the source domain and reduce the deviation between the source and target domains. Moreover, since the visible patches that are not masked retain task-relevant information but lose global information, especially the position information in the image, we add a residual connection between the first and last encoder layers of ViT to retain the global information. MG-ViT follows a two-stage training scheme for FSL [7]. The first stage is to train on a base dataset only with the vanilla ViT, and the second stage is to fine-tune the model on both novel and base datasets with the proposed MG-ViT. Particularly, MG-ViT only introduces the above-mentioned mask operation and residual connection to ViT, and is able to inherit parameters from pre-trained ViT without any other costly pre-training.

To further improve the generalizability of MG-ViT based FSL, we additionally include an active learning based sample selection method [13, 14]. The rationale is that since general FSL aims to predict a large number of unlabeled samples by learning from a limited number of labeled data [1], active learning may improve the model learning by labeling representative samples and further improve the model prediction ability on other unlabeled samples [13, 15]. Therefore, we introduce a cluster-based sample selection method to optimally select representative few-shot samples.

We evaluate the proposed MG-ViT on both Agri-ImageNet image classification task and ACFR apple detection task, and conduct extensive comparisons with the general fine-tuning based ViT models. The experimental results demonstrate that the proposed MG-ViT model significantly improves the task performance when compared with general fine-tuning based ViT models.

In general, the main novelties and contributions of our work are:

•

We introduce an elegant mask operation on image patches to guide the ViT to focus on task-relevant and discriminative patches during FSL, add a residual connection to retain the global features of visible image patches, and maintain the other structures of ViT, thus allowing improvement of model effectiveness with inheritance of parameters from pre-trained ViT without any other cost.
•

We include an active learning based sample selection method to optimally select representative few-shot samples, which allows further improvement of FSL on the proposed MG-ViT.
•

We propose an effective framework of FSL on data-intensive models (ViT in this study) and achieve excellent performances on both image classification and object detection tasks, thus providing novel insights in generalizing data-intensive and large-scale deep learning models for FSL.

2 Related work

Vision Transformer

ViT [9] is transferred from the structure of transformer in natural language processing (NLP) tasks [16]. Recently, several refined ViT models such as DeiT [17], CeiT [18], local ViT [19] and NesT [20] are introduced with useful strategies including knowledge distillation, depth-wise convolution and tree-like structure, and achieve improved model performance in the vision task.

However, the data-intensive ViT is difficult to quickly adapt to the target domain with a small amount of labeled data. By using a distillation approach [14], smoothing the loss landscapes at convergence [21], or incorporating CNNs like CCT [22] and NesT [20], ViT can reduce the demand for large sample data to some extent. Moreover, MAE [23] is proposed to mask most of the image patches and apply an unsupervised image reconstruction method to pre-train transformer to improve its generalizability for downstream tasks. However, these studies are still far from the requirement of fast adaption to the target domain for FSL.

Few-shot Learning

Few-shot learning (FSL) is proposed for learning with only a few samples [1]. Meta-learning, which is also known as learning-to-learn, is a crucial approach for FSL [24]. In recent studies for FSL, Zhang et al. [5] proposed a novel absolute-relative learning paradigm to fully use the binary label and soft similarity information. Ma et al. [4] designed an inverted pyramid network (IPN) with global and local stages to learn the support-query relation and precise query-to-class similarity embedding. Yang et al. [2] calibrated the distribution of few sample class by transferring statistics from the classes with sufficient examples subject to a Gaussian distribution of each feature representation. For transformer based FSL, recent studies [25, 26, 27] merely fix the CNN-based feature extractor trained on base class and apply the attention mechanisms to exploit the correlation between query and support sets to perform classification on FSL, which does not take full advantage of the powerful learning representation of ViT.

Active Learning

Previous studies usually randomly select certain images from the dataset as training samples and have the model learn from examples [13]. However, by actively selecting a fixed number of training data, active learning is provably more powerful and with better generalizability [13, 15]. For example, Chitta et al. [14] designed an ensemble active learning method and achieved better performance on the test data with less training data. Therefore, active learning may improve FSL by labeling representative few-shot samples [1]. For example, Yan et al. [3] reported a better result on FSL using a Graph Convolution Network (GCN) based active learning data selection policy.

3 Methods

The overall framework is illustrated in Section 2. We first illustrate the definitions of our problem and ViT architecture in Section 3.1. We next demonstrate the detailed structure of MG-ViT in Section 3.2 and the generation of image patch mask in Section 3.3. We then introduce the identification of neighborhood image in Section 3.4 and active learning based few-shot sample selection in Section 3.5. Lastly, we provide the overall training scheme in Section 3.6.

3.1 Problem Definition and Vision Transformer

Problem Definition:

Given a base dataset $D_{b}=\{(x,y)\}$ where $y\in Y_{b}$ and a novel dataset $D_{n}=\{(x,y)\}$ where $y\in Y_{n}$ , $Y_{b}\cup Y_{n}=Y$ , $Y_{b}\cap Y_{n}=\emptyset$ where $Y$ denotes the whole set of class. The base dataset is with a large amount of labeled data while the novel dataset is with only a few labeled data. Our aim is to train a model with both base and novel datasets while achieving satisfying generalizability on the novel dataset for FSL. The common practice to evaluate the fast adaptation ability and generalizability of FSL model is to build a $N$ -way- $K$ -shot task, where $N$ is the number of classes and $K$ is the number of labeled data in the novel dataset. The model is trained with only $N\times K$ labeled data from the novel dataset. The performance of FSL model is evaluated on the test split of the novel dataset.

Vision Transformer:

ViT [9] receives an patch embedding sequence $x_{\scriptscriptstyle PATCH}\in\mathbb{R}^{N\times D}$ , where $\displaystyle N=\frac{H*W}{P^{2}}$ is the number of patches, $D$ is the output dimension, and $(H,W)$ and $(P,P)$ are the resolutions of the image and patch, respectively. For different downstream tasks, $x_{\scriptscriptstyle PATCH}$ concatenates different tokens, e.g., $x_{\scriptscriptstyle CLS}\in\mathbb{R}^{1\times D}$ class token [9] for the image classification task and $x_{\scriptscriptstyle DET}\in\mathbb{R}^{100\times D}$ detect token [28] for the object detection task. To retain the position information of patches in the whole image, position embedding $\mathbf{P}\in\mathbb{R}^{(N+1+100)\times D}$ is also added to the concatenated inputs. Therefore, the input sequence $z_{0}$ of ViT with both class token and detect token is described as:

z_{0}=[x_{\scriptscriptstyle PATCH};x_{\scriptscriptstyle DET};x_{\scriptscriptstyle CLS}]+\mathbf{P}

(1)

The encoder layer of Transformers consists of one multi-head self-attention (MSA) block and one multi-layer perceptron (MLP) block. LayerNorm (LN) and residual connections are applied before and after every block, respectively. Therefore, the output embedding of $l$ -th layer is:

	$\displaystyle z_{l}^{\prime}$	$\displaystyle=MSA(LN(z_{l-1}))+z_{l-1},\,l=1,\cdots,L$		(2)
	$\displaystyle z_{l}$	$\displaystyle=MLP(LN(z_{l}^{\prime}))+z_{l}^{\prime},\,l=1,\cdots,L$		(3)

For different downstream tasks, the task-relevant tokens are fed into the task-special MLP header for final prediction. For image classification task:

y_{class}=MLP(x_{\scriptscriptstyle CLS})

(4)

For object detection task:

	$\displaystyle y_{class}\!$	$\displaystyle=\![MLP_{c}(x_{\scriptscriptstyle DET}^{1});\cdots;MLP_{c}(x_{\scriptscriptstyle DET}^{100})]$		(5)
	$\displaystyle y_{bbox}\!$	$\displaystyle=\![MLP_{b}(x_{\scriptscriptstyle DET}^{1});\cdots;MLP_{b}(x_{\scriptscriptstyle DET}^{100})]$		(6)

3.2 Mask-guided Vision Transformer

We introduce the detailed structure of proposed MG-ViT as follows. One of the core issues for FSL is to effectively adapt the prior knowledge learned from the source domain (base dataset) to the target domain (novel dataset). Inspired by He et al. [23], we add an image patch mask (detailed in Section 3.3) to the patch embeddings before the first encoder layer of transformer to screen out the task-irrelevant image patches and to guide ViT focusing on task-relevant and discriminative ones. Different from masking random patches in MAE [23], our design masks the task-irrelevant ones to focus on the task-relevant prior knowledge for better generalizability. Note that we only apply the mask operation on the base dataset but not on the novel dataset for two reasons. First, we want to make fully use of the few-shot sample information in the novel dataset for better FSL. Second, it is difficult to identify the important features within only a few samples which may lead to a noisy salience map. Therefore, the input of the first encoder layer $z_{0}^{masked}$ is:

z_{0}^{masked}=[x_{\scriptscriptstyle PATCH}\odot Mask;x_{\scriptscriptstyle DET};x_{\scriptscriptstyle CLS}]+[P_{\scriptscriptstyle PATCH}\odot Mask;P_{\scriptscriptstyle DET};P_{\scriptscriptstyle CLS}]

(7)

where $P_{\scriptscriptstyle PATCH}$ , $P_{\scriptscriptstyle DET}$ and $P_{\scriptscriptstyle CLS}$ are the patch position embedding, detect position embedding and class position embedding in P, respectively. $\odot$ is the element-wise production.

Moreover, we introduce a residual connection between the first and last encoder layers of ViT to retain the global features, especially the position information, of visible patches. Inspired by He et al. [29, 23], we add the embeddings of all image patches before the first encoder layer to the input of the last encoder layer as in Figure 2. The input of the last encoder layer $\hat{z}_{L-1}$ is:

	$\displaystyle\hat{z}_{L-1}=[\hat{x}_{\scriptscriptstyle PATCH}^{L-1};x_{\scriptscriptstyle DET}^{L-1};x_{\scriptscriptstyle CLS}^{L-1}]$		(8)
	$\displaystyle\hat{x}_{\scriptscriptstyle PATCH}^{{L-1}^{i}}=\begin{cases}x_{\scriptscriptstyle PATCH}^{{L-1}^{i}}+x_{\scriptscriptstyle PATCH}^{0^{i}}&,\boldsymbol{if}\;Mask_{i}=1\\ x_{\scriptscriptstyle PATCH}^{0^{i}}&,otherwise\end{cases}$

where $\hat{x}_{\scriptscriptstyle PATCH}^{L-1}$ = $\{x_{\scriptscriptstyle PATCH}^{{L-1}^{i}}|i=1,\cdots,N\}$ , $x_{\scriptscriptstyle DET}^{L-1}$ and $x_{\scriptscriptstyle CLS}^{L-1}$ are the image patch token, detect token and class token of the last layer, respectively. Compared with the vanilla ViT model, MG-ViT only introduces an additional mask operation and a residual connection, thus enabling the model to focus on the most task-relevant image patches and to inherit parameters from pre-trained ViT without any other retraining. Therefore, MG-ViT can achieve fast domain adaption for FSL on ViT.

3.3 Generation of Image Patch Mask

In order to make better advantage of the prior knowledge and to reduce the deviation between the source domain (base dataset) and target domain (novel dataset), we generate the image patch mask based on the most task-relevant and discriminative patches from the salience map calculated by Grad-CAM [12]. Specifically, we first adopt Grad-CAM [12] to calculate the salience map of the identified neighborhood samples from the base dataset (detailed in Section 3.4). We then select top k salient patches with largest absolute sum values of the gradient of image patch features after patch embedding in the salience map as the most task-relevant and discriminative ones. We finally perform a binary operation on the salience map to label those top k salient patches as 1 and the remaining ones as 0. The generation of image patch mask is written as:

	$\displaystyle g_{i}=Sum\left\|\frac{\partial L(f(x),y)}{\partial x_{\scriptscriptstyle PATCH}^{i}}\right\|,\,i=1,2,\cdots,N$		(9)
	$\displaystyle Mask_{i}=\begin{cases}1&,\boldsymbol{if}\;g_{i}\;\boldsymbol{in}\;topk\;G\\ 0&,otherwise\\ \end{cases}$		(10)

where $G\in\mathbb{R^{N}}=\{g_{1},g_{2},...,g_{N}\}$ is the salience map of patches $x_{\scriptscriptstyle PATCH}=\{x_{\scriptscriptstyle PATCH}^{1},...,x_{\scriptscriptstyle PATCH}^{N}\}$ , topk is the set of selected salient patches and $Mask=\{Mask^{1},\cdots,Mask^{N}\}$ is the binary mask. Besides the discrete mask generated from the Grad-CAM based salience map, we also generate a continued one based on the center coordinates and designated length and width of the discrete mask in order to provide more contour information of target localization for object detection task (Figure 2).

3.4 Identification of Neighborhood Samples From the Base Dataset

Identifying neighborhood samples [30] from the base dataset which are similar to the few-shot samples in the novel dataset for joint fine-tuning can improve the performance for FSL [31, 32, 2]. Inspired by Paul et al. [33] using the Gradient Normed and Error $l_{2}$ -Norm scores to identify the important samples which are hard for model training with large loss, we measure the similarity as the negative loss value of the sample in the base dataset, with the model trained on the selected few-shot samples from the novel dataset (detailed in Section 3.5). The similarity of sample $(x_{i},y_{i})\in D_{b}$ with $D_{n}$ is written as:

	$\displaystyle Sim(x_{i},D_{n})=-L(f_{\hat{W}}(x_{i}),y_{i})$		(11)
	$\displaystyle\hat{W}={\mathop{\arg\min}\limits_{W}{L(f_{W}(x),y)}},\,\forall(x,y)\in D_{n}$		(12)

where $L(*)$ is the loss function, $f(*)$ represents the model’s output of input $x$ , $W$ is the set of parameters of the model, and $\hat{W}$ is the set of parameters of the model with least loss.

In this way, we effectively combine both the model characteristics and labeling information of the data without performing complex similarity calculations between the two datasets. The proposed method of loss-based image similarity is similar to the anomaly score in anomaly detection [34] which measures the distance of data point from the center of a sphere. The lower loss allows the model to focus more on learning the representation of the novel dataset which contributes to the performance for FSL.

3.5 Active Learning based Few-Shot Sample Selection

To further improve the generalizability of MG-ViT based FSL, we introduce a cluster-based sample selection method to optimally select representative few-shot samples. Considering that the elaborate active learning methods may lead to over-fitting problems, we design a simple and effective method inspired by [35, 31] as illustrated in Figure Figure 3. We first use a CNN-based model as a feature extractor to extract image features, then perform an unsupervised clustering to identify $k$ clusters as $k$ -shot, and finally select the image with the highest node degree in each cluster as the few-shot sample. In this study, we use the pre-trained ResNet-101 [29] as the backbone to extract features, $k$ -means to cluster image features and Euclidean distance as the weight of adjacency.

3.6 Overall Training Scheme

Algorithm 1 Fine-tuning scheme with MG-ViT.

Input: base dataset

D_{b}

, novel dataset

D_{n}

and pretrained model.

Initialize: model retraining on

D_{b}

for epoch in initial fine-tuning step do

train\_one\_epoch(model,(D_{b},Mask=None))

end for

while training do

for

(x,y)

D_{b}

Sim_{x_{i}}=-loss(f(x_{i}),y_{i})

end for

D_{b}^{sub}=topk(Sim_{x_{i}})

Mask_{b}=Grad-CAM(model,D_{b}^{sub})

as Eq 9

train\_one\_epoch(model,\!{(\!D_{n},\!None\!)\!\cup\!{(\!D_{b}^{sub},\!Mask_{b}\!))}}

end while

The overall training scheme of MG-ViT consists of two stages. The first stage is to train on the base dataset only with the vanilla ViT and the second stage is to fine-tune the model on both the base and novel datasets with the proposed MG-ViT. We demonstrate the fine-tuning scheme at the second stage in Algorithm 1. Note that before we fine-tune MG-ViT with the novel dataset, we first initialize fine-tuning of the model on the base dataset for a few epochs since the change of computational flow of the model at the fine-tuning stage may lead to deviations in the input of the last encoder part.

4 Experiments

We evaluate our MG-ViT based on ViT-S [9] and conduct experiments on both Agri-ImageNet image classification and Apple object detection tasks. We compare our methods with the general fine-tuning based FSL methods [7]. We report the average precision (AP) score for object detection task and the average accuracy (ACC) for image classification task on the test split of the novel dataset.

4.1 Image Classification in Agri-ImageNet

Agri-ImageNet

We carry out the image classification task based on a new Agri-ImageNet dataset. The Agri-ImageNet dataset contains three parent classes including fruit, weed and vegetable. We randomly select the fruit (with 9 sub-classes) and weed (with 8 sub-classes) parent classes as the base dataset and the vegetable (with 4 sub-classes) one as the novel dataset. We perform 4-way few-shot image classification task since there are 4 sub-classes in the vegetable parent class. The base dataset is randomly split into training/test with 75%/25%. The remaining data in the novel dataset except for the actively selected few-shot samples is the test split of novel dataset. For the training dataset, Rand-Augment [36], Random Erasing [37], and RandomResizeCrop to 224 × 224 are applied for data augmentation. For the test dataset, images are only resized and center cropped to 224 × 224.

Setting

The ImageNet-1k pre-trained model is firstly trained on the base dataset with the vanilla ViT. We adopt an AdamW optimizer with 200 epochs using a cosine decay learning rate scheduler and 10 epochs of linear warm-up. Batch size is set to 64, initial learning rate is set to 0.0001, and weight decay is set to 0.0001. Then, we fine-tune the model on the few-shot samples in the novel dataset together with the neighborhood samples from the base dataset with MG-ViT. We keep the same setting of regular training except for the initial learning rate to 0.001 and epochs to 30. The cross-entropy loss is adopted and the label smoothing is set to 0.1. The topk is set to 7×7 to select the salient patches for mask generation.

Table 1: The averaged ACC for different number of shots in image classification task based on Agri-ImageNet.

Backbone	Method	1-shot	5-shot	10-shot
ViT-S	Ft-full	74.9	90.2	94.6
ViT-S	Ft-part	60.1	90.1	92.6
MGViT-S	Ft-part	65.9	95.0	95.5
MGViT-S	Ft-full	85.6	98.1	98.5

Result

We compare MG-ViT with fine-tuning [7] based methods on ViT-S. As reported in Table 1, we see that fine-tuning the whole ViT (Ft-full) instead of only the MLP part (Ft-part) obtains better classification performance. Compared to the fine-tuning based methods (ViT-S), MG-ViT improves 3.9% from 94.6% to 98.5% in 10-shot, 7.9% from 90.2% to 98.1% in 5-shot, and 10.7% from 74.9% to 85.6% in 1-shot, demonstrating the superiority of the proposed MG-ViT in FSL. The patch salience maps are further visualized in Figure 4 to illustrate that MG-ViT can locate task-relevant patches more accurately compared to the fine-tuning based ViT..

4.2 Object Detection in ACFR Dataset

ACFR Apple

We apply the object detection task on the apple class of ACFR Orchard Fruit Dataset [38]. The dataset collected in 2016 and 2017 is used as the base and novel dataset, respectively. The base dataset is randomly split into training/validation/test with 64%/16%/20%. The remaining data in the novel dataset except for the few-shot samples is the test split of novel dataset. For the training dataset, data augmentation includes resizing and cropping the input images so that the shortest side is between 432 and 720 pixels and the longest one is at most 960 pixels. For the test dataset, the images are only resized to 720 × 960 and normalized.

Setting

We follow Fang et al. [28] and use the provided ImageNet-1k pre-trained ViT-S as the backbone. We continue pre-training the model on the Agri-ImageNet and keep the same setting as the training on the base dataset for better performance. Since the pre-trained model is trained on images with a resolution of 224 × 224 while the ones in the apple dataset are with higher resolution, we adopt a bicubic interpolation [17] for position embedding to fit the apple images. AdamW optimizer with a cosine decay learning rate scheduler is employed during the training on the base dataset. Batch size, epochs, initial rate, weight decay are set to 1,200, 0.0001 and 0.0001, respectively. We use the same hyperparameters except for initial rate and epochs set to 0.00001 and 400, respectively during the fine-tuning on the novel dataset. We use the same loss function as in Zhu et al. [39]. The topk is set to 24×32 to select the salient patches for mask generation.

Table 2: The AP scores for different number of shots in object detection task based on ACFR Apple dataset.

Backbone	Method	5-shot	10-shot	30-shot
ViT-S	Ft-full	47.9	56.2	74.9
ViT-S	Ft-part	26.4	28.7	33.2
MG-ViT-S	Ft-full	50.6	65.3	76.0

Result

We compare the proposed MG-ViT with the baseline method, a pure transformer structural object model of YOLOS-S [28] with fine-tuning [7]. As reported in Table 2, we see that similarly in Table 1, fine-tuning the full network (Ft-full) performs better than only fine-tuning the backbone (Ft-part) of the model, which is contrary to the conclusion of CNN-based few-shot object detection. By performing the mask operation, our MG-ViT outperforms the general fine-tuning based method (ViT-S) by 2.7% from 47.9% to 50.6% in 5-shot, 9.1% from 56.2% to 65.3% in 10-shot, and 1.1% from 74.9% to 76.0% in 30-shot. The patch salience maps of MG-ViT and YOLOS are further visualized and compared in Figure 4 to illustrate that MG-ViT can locate the object (i.e., apple) more accurately compared to YOLOS.

5 Ablation Study

Table 3: The averaged ACC in image classification task and AP in object detection task for different number of shots based on different combinations of model design.

Task	Active Learning	Neighborhood	Masked	1-shot	5-shot	10-shot
Clasiffication		$\surd$	$\surd$	84.6	96.5	96.6
	$\surd$			77.0	90.4	96.7
	$\surd$	$\surd$		82.1	97.8	97.9
	$\surd$	$\surd$	$\surd$	85.6	98.1	98.5
Task	Active Learning	Neighborhood	Masked	5-shot	10-shot	30-shot
Detection		$\surd$	$\surd$	47.2	56.3	75.1
	$\surd$			48.2	63.2	72.9
	$\surd$	$\surd$		48.6	64.1	75.3
	$\surd$	$\surd$	$\surd$	50.6	65.3	76.0

5.1 Effect of Active Learning based Few-Shot Sample Selection

We compare the effect of active learning based few-shot sample selection with randomly selected few-shot samples in MG-ViT. As reported in Figure 3, for image classification task, active learning based few-shot sample selection improves the accuracy by 1.0% from 84.6% to 85.6% in 1-shot, 1.6% from 96.5% to 98.1% in 5-shot, and 1.9% from 96.6% to 98.5% in 10-shot. Similarly, for object detection task, it improves the accuracy by 3.4% from 47.2% to 50.6% in 5-shot, 9.0% from 56.3% to 65.3% in 10-shot, and 0.9% from 75.1% to 76.0% in 30-shot.

5.2 Effect of Neighborhood Samples Identification

We compare the effect of adopting identified neighborhood samples with randomly selected images from the base dataset in MG-ViT. As reported in Figure 3, for image classification task, adopting identified neighborhood samples for joint fine-tuning with the novel dataset improves the accuracy by 5.1% from 77.0% to 82.1% in 1-shot, 7.4% from 90.4% to 97.8% in 5-shot, and 1.2% from 96.7% to 97.9% in 10-shot. Similarly, for object detection task, it improves the accuracy by 0.4% from 48.2% to 48.6% in 5-shot, 0.9% from 63.2% to 64.1% in 10-shot, and 2.4% from 72.9% to 75.3% in 30-shot.

Table 4: The performances of discrete and continued masks on 5-shot image classification and 10-shot object detection with MG-ViT.

Mask Shape	Discrete	Continued
5-shot image classification	98.7	98.1
10-shot object detection	63.1	65.3

5.3 Effect of Discrete or Continued Mask

We evaluate the effect of using a discrete or continued mask in MG-ViT. As reported in Table 4, the discrete mask performs better for image classification while the continued one performs better for object detection. This difference may derive from the fact that the continued mask could provide more contour information of target localization which benefits for object detection, while the discrete mask may provide more global semantic information for image classification task. We leave exploration of different types of masks for different downstream tasks in FSL to future work.

6 Conclusion

We propose a novel mask-guided vision transformer (MG-ViT) to guide ViT more effectively learn from the task-relevant prior knowledge for few-shot learning. By simply adding an image patch mask operation and a residual connection to the vanilla ViT, MG-ViT significantly outperforms the general fine-tuning based methods for FSL. Our results are in agreement with ViT that learning the relevant patterns directly from data is sufficient, even beneficial. To further improve FSL, we also introduce an effective active learning based few-shot sample selection method. Our two-stage fine-tuning based framework could be widely applied on different downstream tasks such as image classification and object detection in this study. In general, MG-ViT provides a concrete approach towards generalizing data-intensive and large-scale deep learning models for FSL. In the future, we can systematically explore the effectiveness of different types (e.g., size, shape, data modality, etc.) of image patch masks for different downstream tasks. Further combing the proposed mask operation with other models and modalities in other tasks such as NLP is another exciting direction of future work.

References

[1] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR), 53(3):1–34, 2020.
[2] Shuo Yang, Lu Liu, and Min Xu. Free lunch for few-shot learning: Distribution calibration. arXiv preprint arXiv:2101.06395, 2021.
[3] Shipeng Yan, Songyang Zhang, and Xuming He. Budget-aware few-shot learning via graph convolutional network. arXiv preprint arXiv:2201.02304, 2022.
[4] Yuqing Ma, Wei Liu, Shihao Bai, Qingyu Zhang, Aishan Liu, Weimin Chen, and Xianglong Liu. Few-shot visual learning with contextual memory and fine-grained calibration. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 811–817, 2021.
[5] Hongguang Zhang, Piotr Koniusz, Songlei Jian, Hongdong Li, and Philip HS Torr. Rethinking class relations: Absolute-relative supervised and unsupervised few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9432–9441, 2021.
[6] Akihiro Nakamura and Tatsuya Harada. Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216, 2019.
[7] Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly simple few-shot object detection. arXiv preprint arXiv:2003.06957, 2020.
[8] John Cai and Sheng Mei Shen. Cross-domain few-shot learning with meta fine-tuning. arXiv preprint arXiv:2005.10544, 2020.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[10] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. arXiv preprint arXiv:2101.01169, 2021.
[11] Xu Luo, Longhui Wei, Liangjian Wen, Jinrong Yang, Lingxi Xie, Zenglin Xu, and Qi Tian. Rectifying the shortcut learning of background for few-shot learning. Advances in Neural Information Processing Systems, 34, 2021.
[12] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[13] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine learning, 15(2):201–221, 1994.
[14] Kashyap Chitta, José M Álvarez, Elmar Haussmann, and Clément Farabet. Training data subset search with ensemble active learning. IEEE Transactions on Intelligent Transportation Systems, 2021.
[15] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. arXiv preprint arXiv:1703.03365, 2017.
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[17] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
[18] Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating convolution designs into visual transformers. arXiv preprint arXiv:2103.11816, 2021.
[19] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
[20] Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, and Tomas Pfister. Aggregating nested transformers. arXiv preprint arXiv:2105.12723, 2021.
[21] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pretraining or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.
[22] Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
[23] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
[24] Joaquin Vanschoren. Meta-learning: A survey. arXiv preprint arXiv:1810.03548, 2018.
[25] Lu Liu, William Hamilton, Guodong Long, Jing Jiang, and Hugo Larochelle. A universal representation transformer layer for few-shot image classification. arXiv preprint arXiv:2006.11702, 2020.
[26] Tao Gan, Weichao Li, Yuanzhe Lu, and Yanmin He. Transformer-based few-shot learning for image classification. In International Conference on Artificial Intelligence for Communications and Networks, pages 68–74. Springer, 2021.
[27] Haoxing Chen, Huaxiong Li, Yaohui Li, and Chunlin Chen. Sparse spatial transformers for few-shot learning. arXiv preprint arXiv:2109.12932, 2021.
[28] Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, and Wenyu Liu. You only look at one sequence: Rethinking transformer in vision through object detection. arXiv preprint arXiv:2106.00666, 2021.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[30] Qiang He, Zongxia Xie, Qinghua Hu, and Congxin Wu. Neighborhood based sample and feature selection for svm classification learning. Neurocomputing, 74(10):1585–1594, 2011.
[31] Weifeng Ge and Yizhou Yu. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1086–1095, 2017.
[32] Othman Sbai, Camille Couprie, and Mathieu Aubry. Impact of base dataset design on few-shot image classification. In European Conference on Computer Vision, pages 597–613. Springer, 2020.
[33] Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34, 2021.
[34] Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019.
[35] Xiao Cai, Feiping Nie, Heng Huang, and Farhad Kamangar. Heterogeneous image feature integration via multi-modal spectral clustering. In CVPR 2011, pages 1977–1984. IEEE, 2011.
[36] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
[37] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13001–13008, 2020.
[38] Suchet Bargoti and James P Underwood. Image segmentation for fruit detection and yield estimation in apple orchards. Journal of Field Robotics, 34(6):1039–1060, 2017.
[39] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.