Few-Shot Learning with a Strong Teacher

Han-Jia Ye, Lu Ming, De-Chuan Zhan, Wei-Lun Chao H.-J. Ye, L. Ming, and D.-C. Zhan are with State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China; Wei-Lun Chao is with the Ohio State University, USA.
E-mail: {yehj, mingl, zhandc}@lamda.nju.edu.cn, [email protected]

Abstract

Few-shot learning (FSL) aims to generate a classifier using limited labeled examples. Many existing works take the meta-learning approach, constructing a few-shot learner (a meta-model) that can learn from few-shot examples to generate a classifier. Typically, the few-shot learner is constructed or meta-trained by sampling multiple few-shot tasks in turn and optimizing the few-shot learner’s performance in generating classifiers for those tasks. The performance is measured by how well the resulting classifiers classify the test (i.e., query) examples of those tasks. In this paper, we point out two potential weaknesses of this approach. First, the sampled query examples may not provide sufficient supervision for meta-training the few-shot learner. Second, the effectiveness of meta-learning diminishes sharply with the increasing number of shots (i.e., the number of training examples per class). To resolve these issues, we propose a novel meta-training objective for the few-shot learner, which is to encourage the few-shot learner to generate classifiers that perform like strong classifiers. Concretely, we associate each sampled few-shot task with a strong classifier, which is trained with ample labeled examples. The strong classifiers can be seen as the target classifiers that we hope the few-shot learner to generate given few-shot examples, and we use the strong classifiers to supervise the few-shot learner. We present an efficient way to construct the strong classifier, making our proposed objective an easily plug-and-play term to existing meta-learning based FSL methods. We validate our approach, LastShot (Learning with A Strong Teacher for few-SHOT learning), in combinations with many representative meta-learning methods. On several benchmark datasets including miniImageNet and tieredImageNet, our approach leads to a notable improvement across a variety of tasks. More importantly, with our approach, meta-learning based FSL methods can consistently outperform non-meta-learning based methods at different numbers of shots, even in many-shot settings, greatly strengthening their applicability.

Index Terms:

Few-Shot Learning, Meta-Learning, Knowledge Distillation

1 Introduction

The intriguing ability that humans can learn new knowledge with few examples and then generalize it to novel environments [1, 2] has led to a community-wide enthusiasm in artificial intelligence towards few-shot learning (FSL)¹¹1The term “shot” refers to the number of training examples per class., especially in the applications of visual recognition [3, 4], machine translation [5, 6], reinforcement learning [7, 8], etc.

Specifically for visual recognition, FSL is generally studied in a two-stage setting [9, 10, 11]. At first, the few-shot learner — a meta-model that aims to generate a classifier given few-shot examples — is provided with a large number of training examples collected from a set of base classes. After learning from these data, the few-shot learner then receives a small number of training examples for a set of novel classes that it needs to generate a classifier to recognize. The quality of the few-shot learner is measured by the generated classifier’s accuracy on recognizing these novel classes. By default, the two sets of classes are disjoint but related (e.g., both about animal species). Thus, to perform well in FSL, the few-shot learner must efficiently re-use what it learned (e.g., features) from the base classes for the novel classes.

Refer to caption — Figure 1: The comparison of our LastShot to existing methods. We conduct 5-way (i.e., 5-class) classification experiments on miniImageNet [10] and tieredImageNet [12] and report the accuracy under different numbers of shots. We consider two baselines: ProtoNet [13] is a representative meta-learning based approach; PT-EMB is a nearest centroid classifier whose features are learned by directly training a multi-class classifier using all the base class data. ProtoNet outperforms PT-EMB on the 1-shot setting but falls behinds on the others. Our proposed LastShot improves ProtoNet for all settings. See section 4 for details.

Meta-learning, or learning to learn, is arguably the most popular approach to tackle FSL [7, 14, 15, 13, 16]. The core idea is to simulate the few-shot scenario of learning with novel classes using the base class data. That is, instead of using the base class data in a conventional way just like training a standard multi-class classifier [17], meta-learning approaches sample multiple few-shot tasks from the base class data and use them to meta-train the few-shot learner to excel in the scenario it will encounter in receiving novel class data, even though the base class data are indeed many-shot. In other words, what the few-shot learner learns from the base class data is the ability to generate a classifier using few-shot data.

To implement meta-learning for FSL, most existing works follow [10] to formulate each simulated few-shot task by a support set and a query set whose data are sampled from the same subset of base classes. The support set is the few-shot training data that is used by the few-shot learner to generate the classifier; the query set is used to evaluate the resulting classifier’s quality. In essence, the classification loss on the query set is the main supervision that is used to meta-train the few-shot learner for how to learn from the support set.

In this paper, we argue that the classification loss on the query set is not an efficient form of supervision, especially when the query set size is limited. Concretely, in meta-training the few-shot learner, one usually constructs a mini-batch by sampling one or several few-shot tasks. Thus, the limited query set size could result from the memory constraint or a large support set that consumes most of the space of a mini-batch²²2Conventionally, we sample one or several few-shot tasks (each includes a support set and a query set) whose size can fit into the mini-batch size (constrained by the memory). Thus, when the support set gets larger, either due to the increasing number of classes or shots, the query set size gets shrunk. In extreme cases, each class may just have one query example.. Figure 1 gives an illustration, in which we experimented with different numbers of shots: the larger the shot number is for the support set, the smaller the query set size is. We showed that, even in the standard 5-way 5-shot FSL setting, meta-learning methods (i.e., ProtoNet) [13] can be outperformed by a simple nearest centroid classifier (i.e., PT-EMB) [9, 18] whose features are trained with all the base class data in a conventional, non-meta-learning way. Moreover, the gap between these two methods becomes more pronounced in a larger shot setting where the query set size shrinks³³3While one may overcome this by processing a few-shot task using multiple mini-batches, aggregating the gradients, and updating the model at once, this would increase the meta-training time and may involve additional hyper-parameter tuning.. While one may argue that meta-learning should just be used in few-shot settings, the fact that it is effective only in a narrow shot range greatly limits its applicability.

The above issue motivates us to re-visit another implementation of meta-learning, which is to directly teach the few-shot learner to generate a strong classifier like the one trained using many-shot data [19, 20, 16, 21]. In this way, the query set in a simulated few-shot task is replaced by a “target” classifier pre-trained using the (entire) base class data⁴⁴4Here, a target classifier means the classifier that we hope a few-shot learner can generate given few-shot examples. One choice according to [19, 20, 16, 21] is the classifier that is trained on ample labeled data.. The loss for meta-training the few-shot learner is instantiated by a distance measure between the generated classifier and the target classifier in the parameter space. This implementation bypasses the limit of query set sizes. However, its applicability is limited by another factor: the lack of a suitable distance metric in the high-dimensional parameter space. Therefore, prior works pre-train a feature extractor in a non-meta-learning way, and constrain the few-shot learner to generate only the last fully-connected layer.

In this paper, we propose a novel meta-learning approach for FSL that incorporates the advantages of these two ways of implementations. On the high level, we follow the notion of constructing target classifiers to teach the few-shot learner, which bypasses the limit of query set sizes. However, instead of meta-training the few-shot learner to generate a classifier that is exactly like the target classifier in the parameter space, we meta-train the few-shot learner to generate a classifier that predicts like the target classifier on test (i.e., query) examples. That is, in our approach, a simulated few-shot task contains both the target classifier and the query set besides the support set. Given the support set, our approach encourages the few-shot learner to generate a classifier that would make predictions (e.g., predicted logits or class probabilities) on the query examples similar to the target classifier. By using the difference in predictions rather than in the parameter space to supervise the few-shot learner, our approach enables the few-shot learner to learn to generate the entire classifier (i.e., the feature extractor plus the final fully-connected layer). Moreover, measuring the difference in predictions disentangles the model architectures used by the target classifier, the few-shot learner, and the generated classifier, making our approach a plug-and-play meta-learning objective that is compatible with most existing meta-learning based FSL methods. Even if the query examples are limited, the few-shot learner can receive extra and more informative supervision — how a stronger target classifier makes predictions on them — beyond the one-hot query example labels.

We investigate multiple ways of constructing the target classifiers. One intuitive approach is to train a multi-class classifier using the entire base class data [17]. However, since a simulated few-shot task usually contains only a subset of base classes (e.g., $5$ classes from the $64$ base classes [10]), the prediction of this target classifier must be properly filtered and calibrated. We therefore propose to leverage this pre-trained classifier’s feature extractor to (a) construct a target nearest centroid classifier [9] or (b) dynamically train a target logistic regression, both using all data from the subset of base classes. Besides, we further investigate multiple ways of querying the target classifiers’ predictions during meta-training. We found that imposing variations to the query data (e.g., resizing the images to be larger or adding noise) improves the performance, which is reminiscent of the recent studies in [22]. During meta-testing on novel classes, we do not impose any variations to ensure a fair comparison to existing methods. We name our approach LastShot: Learning with A Strong Teacher for few-SHOT learning.

We validate LastShot in combination with representative FSL methods [13, 23, 24, 25] on multiple benchmarks, including miniImageNet [10], tieredImageNet [26], CIFAR-FS [27], FC-100 [28], and CUB [29]. LastShot consistently improves these methods and achieves the state-of-the-art accuracy. We further study LastShot under varying numbers of shots in the support set. LastShot shows highly robust performance, especially when the shots go larger and existing meta-learning based methods are outperformed by simple nearest centroid approaches [9] (see Figure 1). The code in available at https://github.com/Han-Jia/LastShot.

The contributions of this work are as follow:

•

We propose a novel meta-training objective that encourages the few-shot learner to generate classifiers that perform like the corresponding target strong classifiers. Our proposed objective is an easily plug-and-play term to existing meta-learning based FSL methods.
•

We present an efficient way to construct the target strong classifier. While we meta-train the few-shot learner with few-shot tasks, we use the strong classifiers trained in a non-meta-learning way with ample examples to supervise the meta-training process.
•

We conduct extensive experiments to validate our approach, which includes multiple benchmark datasets and multiple settings. Our LastShot can consistently outperform conventional meta-learning and non-meta-learning based methods across a range of shot numbers.

1.1 Comparisons to related work

On the surface, our approach is reminiscent of knowledge distillation [30]. However, there are several notable differences from other FSL methods that are based on knowledge distillation [31, 32, 33, 34]. For instance, Tian et al. [34] studied how distillation with self-ensemble improves the pre-trained features for the downstream FSL task. In contrast, we focus on how to fundamentally improve meta-learning based FSL methods by leveraging distillation in the meta-training process. That is, our approach is complementary to their efforts as we apply distillation at a different training stage (meta-training vs. feature pre-training)⁵⁵5Nowadays, nearly all meta-learning based FSL methods are initialized with pre-trained features learned in a conventional way with the base class data [19, 23, 35, 34], and our approach is built upon them.. Wang et al. [33] studied how to generate extra support set examples such that the generated classifier by the few-shot learner could perform similarly to the target classifier. In contrast, we do not introduce the extra data generation task, making our method more easily applicable. We also note that, our method is different from self-distillation [36, 37, 38, 39] — our teacher is a classifier trained with ample data while our student is a few-shot learner. Please see section 2 for more details and discussions.

2 Related Work

2.1 Few-shot learning (FSL)

FSL generally focuses on the $C$ -way classification scenario in which the few-shot learner receives few labeled examples for $C$ new classes and needs to generate a classifier to recognize them ( $C$ is usually set between $5$ to $30$ ). FSL can be studied from roughly three aspects from data, models, and algorithms [40]. We describe non-meta-learning based FSL methods in this subsection, and describe meta-learning based methods in the next subsection.

From the aspect of data, FSL can be resolved if one has sufficient data for the $C$ novel classes. Data augmentation approaches hallucinate additional examples of novel classes based on their few examples [41, 42, 43]. The hallucinated examples are used together with the original ones to train a better classifier for the novel classes. Since there are more training examples, it prevents the model from over-fitting.

From the aspect of models, FSL can be resolved if one can extract features that faithfully capture the semantic and discriminative information of each example to support nearest-neighbor classifiers. Feature or metric learning approaches learn a feature extractor or metric that encourages examples from the same classes to be closer and examples from different classes faraway [18, 44, 45]. The resulting features or metrics are used in a linear or nearest-neighbor classifier that is trained on the examples of the novel classes [46, 47, 9]. Another direction to resolve FSL from the aspect of models is to construct generative models, which are learned to generate visual content conditioned on stochastic binary variables that indicate object identities. Recognizing novel classes involves adding new variables to the model and estimating the parameters associated with it based on few labeled data [48, 49, 50].

From the aspect of algorithms, one could alter the optimization strategy to search for a better model given limited data. Hypothesis transfer reuses a well-trained model from a related dataset. Through fine-tuning it, we can leverage the knowledge learned from the related dataset to avoid over-fitting to few-shot data [11, 51]. [52, 53, 54] proposed special regularization strategies used during fine-tuning.

2.2 Meta-learning for FSL

Here we mainly discuss how meta-learning helps FSL. Meta-learning constructs simulated few-shot tasks, called episodes [10, 15], from the base class data to mimic future few-shot tasks, and trains a few-shot learner to maximize the classification accuracy on the episodes. It is expected that the few-shot learner could be generalized to the few-shot tasks with novel classes. Various strategies have been investigated to implement the few-shot learner. For example, based on the meta-trained feature extractor, the few-shot learner classifies the query instances with (adapted) embeddings [10, 13, 3, 28, 55, 56, 57]. Based on the meta-trained model initialization [7, 58, 59, 60, 61, 62], the few-shot learner can adapt it to a good classifier for each few-shot task with a small number of gradient descent steps. The optimization strategy for a few-shot learner to fit the few-shot examples can also be meta-trained and shared across episodes [63, 64, 65], e.g., the learning rate [66, 15]. An image generator could also be meta-trained to augment the support set [67]. We note that meta-learning can explicitly take characteristics of few-shot tasks into account. For instance, the number of novel classes $C$ and the number of labeled examples per class (i.e., shots) can be specified in the episodes.

Meta-learning based FSL has been applied in various fields, such as unsupervised learning [68], human motion prediction [20], domain adaptation [69], image generation [70], and object detection [71, 72]. Some novel directions have also been explored recently. For example, [73, 74] leveraged the graph propagation to improve meta-learning; [75, 76] constructed a generalized FSL objective towards adaptive calibration strategies between base and novel classes; [77] investigated specialized data augmentations; [78] studied how to identify invariant features in meta-learning. Auxiliary self-supervised tasks are also incorporated to improve the generalization ability of the resulting few-shot learners or classifiers [79, 80, 81].

Most of the meta-learning approaches rely on the classification loss on the query sets to drive the meta-training of the few-shot learner. [16, 21, 82] were the first to propose the notion of constructing target classifiers to guide few-shot learners. Specifically, they measured the Euclidean distance between the target classifier and the classifier generated by the meta-learner as the loss. However, due to the large number of parameters, they only managed to meta-train the few-shot learner to generate the linear classifiers (i.e., the final fully-connected layers).

Some recent methods emphasize the importance of pre-trained features [44, 64, 23, 19, 34, 35]. In particular, the parameters of the backbone model are initialized by optimizing a classification objective over the base class data, which results in more discriminative features. Fine-tuning over the backbone leads to strong FSL baselines [11, 51]. The influence of the number of shots (e.g., between meta-training and meta-testing) is discussed in [83, 84, 28].

We empirically find that the effectiveness of meta-learning diminishes sharply with increasing shots. Specifically, compared to the simple classifiers built upon the pre-trained embeddings, the episodic meta-training could not get further improvement. We propose a plug-and-play module LastShot for meta-learning based FSL to encourage the few-shot learner to generate classifiers which perform like strong classifiers. This helps the resulting few-shot learner outperform other methods across different shot numbers.

2.3 Knowledge distillation

As mentioned in subsection 1.1, our approach is reminiscent of knowledge distillation [30] as we leverage target classifiers trained on ample examples to supervise the meta-training of the few-shot learner. In this subsection, we therefore review knowledge distillation and contrast our usage of it to other existing methods for few-shot learning [31, 32, 33, 34].

Knowledge distillation (KD) [30] is a training strategy that leverages the dark knowledge from one (teacher) model to facilitate the learning progress of another (student) model. One popular way to extract the dark knowledge is by the instance-wise prediction, e.g., the posterior probability on a data instance [30, 85, 86], which provides the relative comparison among classes, in contrast to the conventional one-hot label. By matching the predictions of a “student” model to those of a “teacher” model, the student model obtains richer learning signals and becomes more generalizable. The dark knowledge for KD can be in the form of soft labels [30], hidden layer activation [87, 88], and comparison relationships [89]. Knowledge distillation has also been exploited for model interpretability [90] and incremental learning [91, 92], etc.

Besides [34, 33] mentioned in subsection 1.1, KD was also employed in two recent works of FSL [31, 32], but again based on very different motivations and manners from our LastShot. [32] applied KD mainly to construct a portable few-shot learner; [31] applied self-supervised learning [93, 94] to train robust features and use them to guide the few-shot learner. Both of them only studied a single FSL method, e.g., [7]. In contrast, our approach is more general, compatible with many meta-learning based FSL methods and requiring no self-supervised learning steps. Wang et al. [33] investigated constructing a data generator to create extra support set examples; the generator is learned via KD, from a many-shot target classifier. In LastShot, we take advantage of a plug-and-play loss term to distill knowledge from the target classifier more easily.

Self-distillation is a special case of KD which distills knowledge from the model itself, i.e., the teacher can be the previous generation of the model along the training progress [36, 95, 37]. Tian et al. [34] applied self-distillation in the model pre-training stage to obtain more discriminative features. Our LastShot works in an orthogonal direction to directly improve the meta-learning based FSL methods. Different from self-distillation, the target classifier in LastShot is not the previous few-shot learner or feature extractor, but a stronger classifier trained with ample base class data.

3 Our Approach: LastShot

In this section, we first introduce the notation of few-shot learning (FSL) for visual recognition and highlight the overall framework of meta-learning for FSL. We then introduce our approach LastShot.

3.1 Notation

In FSL, a few-shot learner is first provided with a training set $\mathcal{D}_{\text{base}}=\{({\bm{x}}_{1},y_{1}),\dots,({\bm{x}}_{N},y_{N})\}$ that contains $N$ labeled images from $B$ base classes; i.e., $y_{n}\in\{1,\cdots,B\}$ . Here, each class is supposed to have ample training examples. The few-shot learner is then provided with a support set $\mathcal{D}_{\text{support}}^{\prime}$ of labeled images from $C$ novel classes $\{B+1,\cdots,B+C\}$ , in which each novel class has $K$ examples. The goal of the few-shot learner is to generate a classifier that can accurately classify among the $C$ novel classes. This FSL setting is referred to as the $C$ -way $K$ -shot setting.

3.2 Meta-learning for few-shot learning

The core concept of meta-learning is to use the base class data $\mathcal{D}_{\text{base}}$ to simulate the FSL scenario for meta-training the few-shot learner, before it receives the few-shot data from the novel classes. Namely, meta-learning samples multiple FSL tasks $\mathcal{T}$ from $\mathcal{D}_{\text{base}}$ , each contains a $C$ -way $K$ -shot support set $\mathcal{D}_{\text{support}}$ whose class labels are from $\{1,\cdots,B\}$ . Meta-learning then inputs each support set to the few-shot learner $\mathcal{A}$ , whose goal is to output/generate a classifier that can classify the $C$ input classes well. To evaluate the quality of $\mathcal{A}$ , meta-learning approaches usually sample a corresponding query set $\mathcal{D}_{\text{query}}$ for each $\mathcal{D}_{\text{support}}$ , whose examples are from the same $C$ classes. The quality of $\mathcal{A}$ is measured by the outputted classifier’s accuracy on $\mathcal{D}_{\text{query}}$ .

Putting all these notations together, we can now formulate the meta-learning problem as follows

\displaystyle\min_{\mathcal{A}}\sum_{\mathcal{T}\sim\mathcal{D}_{\text{base}}}\;\sum_{({\bm{x}},y)\in\mathcal{D}_{\text{query}}}\ell(\mathcal{A}(\mathcal{D}_{\text{support}})({\bm{x}}),y),

(1)

where each $\mathcal{T}$ is a tuple $(\mathcal{D}_{\text{support}},\mathcal{D}_{\text{query}})$ ; $\mathcal{A}(\mathcal{D}_{\text{support}})$ is the $C$ -way classifier outputted by $\mathcal{A}$ , after taking the support set $\mathcal{D}_{\text{support}}$ as input. $\ell$ is a loss function on the prediction; e.g., the cross-entropy loss. As shown in Equation 1, the query set $\mathcal{D}_{\text{query}}$ is the major supervision that drives the meta-training of the few-shot learner $\mathcal{A}$ .

After $\mathcal{A}$ is properly meta-trained, we then apply it to the true $\mathcal{D}_{\text{support}}^{\prime}$ of the novel classes to construct the classifier.

3.2.1 Exemplar few-shot learners

We review several representative few-shot learners. In model-agnostic meta-learning (MAML) [7], the few-shot learner $\mathcal{A}$ is modeled by a neural network classifier $f_{\bm{\theta}}$ with weights $\bm{\theta}$ . Denote by $\mathcal{L}(\bm{\theta})$ the differentiable loss of $f_{\bm{\theta}}$ computed on the support set $\mathcal{D}_{\text{support}}$ , $\mathcal{A}(\mathcal{D}_{\text{support}})$ is formulated, in its simplest form, as an updated neural network $f_{\hat{\bm{\theta}}}$ with weights $\hat{\bm{\theta}}$ :

\displaystyle\hat{\bm{\theta}}=\bm{\theta}-\alpha\times\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta}),

(2)

where $\alpha$ is a step size. That is, MAML performs a single (or few) gradient descent update(s) according to $\mathcal{L}(\bm{\theta})$ to obtain the outputted classifier. The meta-learning problem in Equation 1 thus aims to find a good initialization $\bm{\theta}$ such that the neural network $f_{\hat{\bm{\theta}}}$ can perform well on $\mathcal{D}_{\text{query}}$ after only a single (or few) update(s). Many meta-learning approaches are built on MAML [58, 96, 66, 64, 97, 98, 24, 25].

Another popular way to model $\mathcal{A}$ is via a feature extractor $f_{\bm{\theta}}$ . For example, in Prototypical Network (ProtoNet) [13], $\mathcal{A}(\mathcal{D}_{\text{support}})({\bm{x}})$ is formulated as

\displaystyle\hat{y}=\operatorname{arg\,min}_{c}\left\|f_{\bm{\theta}}({\bm{x}})-\sum_{{\bm{x}}^{\prime}\in\mathcal{D}_{\text{support, c}}}\frac{f_{\bm{\theta}}({\bm{x}}^{\prime})}{|\mathcal{D}_{\text{support, c}}|}\right\|_{2}^{2},

(3)

where $\mathcal{D}_{\text{support, c}}$ is the subset of $\mathcal{D}_{\text{support}}$ that contains examples of class $c$ . That is, ProtoNet performs nearest centroid classification and Equation 3 aims to learn a good feature extractor. Similar approaches are [82, 3, 10, 99, 23].

3.2.2 Meta-learning with the target classifier

In Equation 1, the query set $\mathcal{D}_{\text{query}}$ is the main supervision for meta-training the few-shot learner $\mathcal{A}$ . That is, $\mathcal{D}_{\text{query}}$ is used to measure the loss of the classifier $\mathcal{A}(\mathcal{D}_{\text{support}})$ outputted by $\mathcal{A}$ . Such a loss is then used to derive gradients to update $\mathcal{A}$ .

Instead of using the query set $\mathcal{D}_{\text{query}}$ to measure the quality of $\mathcal{A}(\mathcal{D}_{\text{support}})$ in meta-training the few-shot learner $\mathcal{A}$ , [16, 21] constructed the “target” classifier $h^{\star}$ , which is the $C$ -way classifier trained with ample labeled examples of the same classes in ${D}_{\text{support}}$ . The target classifier $h^{\star}$ can be seen as the classifier that we hope the few-shot learner $\mathcal{A}$ to output given $\mathcal{D}_{\text{support}}$ : after all, what we ask the few-shot learner to do is to generate a classifier as if it were trained with ample examples. To this end, one can measure the loss of $\mathcal{A}(\mathcal{D}_{\text{support}})$ by its difference from the target classifier $h^{\star}$ in the parameter space, and use the loss to meta-train the few-shot learner $\mathcal{A}$ . We note that computing the difference in the parameter space does not require the query set $\mathcal{D}_{\text{query}}$ . For instance, in [16], a simulated FSL task $\mathcal{T}\in\mathcal{D}_{\text{base}}$ is formulated as a tuple $(\mathcal{D}_{\text{support}},h^{\star})$ , in which the target classifier $h^{\star}$ is obtained by training with all the data in $\mathcal{D}_{\text{base}}$ (that are from the same classes in $\mathcal{D}_{\text{support}}$ ). Accordingly, Equation 1 for meta-training $\mathcal{A}$ can be reformulated as follows

\displaystyle\min_{\mathcal{A}}\sum_{\mathcal{T}\sim\mathcal{D}_{\text{base}}}\;\ell_{\text{model}}(\mathcal{A}(\mathcal{D}_{\text{support}}),h^{\star}),

(4)

where $\ell_{\text{model}}$ is a distance measure (e.g., Euclidean distance) between classifiers in the parameter space.

One advantage of Equation 4 over Equation 1, or a few-shot task $(\mathcal{D}_{\text{support}},h^{\star})$ over a few-shot task $(\mathcal{D}_{\text{support}},\mathcal{D}_{\text{query}})$ , is the effective amount of supervised signals associated with each support set $\mathcal{D}_{\text{support}}$ . Unlike $(\mathcal{D}_{\text{support}},\mathcal{D}_{\text{query}})$ , which is limited by the query set size $|\mathcal{D}_{\text{query}}|$ , $h^{\star}$ can be trained with all the data in $\mathcal{D}_{\text{base}}$ (from the same classes in $\mathcal{D}_{\text{support}}$ ) and essentially provides more information for meta-training.

Denote by $\bm{\phi}$ and $\bm{\phi}^{\star}$ the parameters of $h=\mathcal{A}(\mathcal{D}_{\text{support}})$ and $h^{\star}$ , [16, 21, 20] computed the squared Euclidean distance $\ell_{\text{model}}(h,h^{\star})=\|\bm{\phi}-\bm{\phi}^{\star}\|_{2}^{2}$ to drive meta-training. Doing so, however, has two drawbacks. First, the Euclidean distance may not be suitable in the high-dimensional parameter space. Indeed, [16, 21] have limited their approaches to only meta-train the last fully-connected layer of a neural network. Second, the Euclidean distance requires the two classifiers to have exactly the same architecture, meaning that we have to re-construct $h^{\star}$ for different meta-learning algorithms and models.

3.3 LastShot: Learning with a strong teacher

Let us now introduce our approach. We define each simulated FSL task $\mathcal{T}=(\mathcal{D}_{\text{support}},\mathcal{D}_{\text{query}},h^{\star})\sim\mathcal{D}_{\text{base}}$ as a tuple that contains both the query set and the target classifier beside the support set. The target classifier $h^{\star}$ is constructed similarly as described in subsubsection 3.2.2, i.e., using all the data in $\mathcal{D}_{\text{base}}$ that are from the same classes as in $\mathcal{D}_{\text{support}}$ . We will provide more details in subsection 3.4. We propose a new meta-learning objective to meta-train the few-shot learner $\mathcal{A}$ with this new definition of $\mathcal{T}$ ,

	$\displaystyle\min_{\mathcal{A}}\sum_{\mathcal{T}\sim\mathcal{D}_{\text{base}}}\;\sum_{({\bm{x}},y)\in\mathcal{D}_{\text{query}}}$	$\displaystyle{\ell_{\textsc{LastShot}}(\mathcal{A}(\mathcal{D}_{\text{support}})({\bm{x}}),h^{\star}({\bm{x}}))}$
		$\displaystyle+\lambda\times\ell(\mathcal{A}(\mathcal{D}_{\text{support}})({\bm{x}}),y).$		(5)

This objective is as general as Equation 1 but is able to benefit from the target classifiers $h^{\star}$ . Specifically, Equation 5 involves two loss terms on the query data. One is exactly the same as in Equation 1, comparing $\mathcal{A}(\mathcal{D}_{\text{support}})({\bm{x}})$ to $y$ . The other term compares $\mathcal{A}(\mathcal{D}_{\text{support}})({\bm{x}})$ to $h^{\star}({\bm{x}})$ via a loss function $\ell_{\textsc{LastShot}}$ , which can be interpreted as a relaxed version of Equation 4: to match the predictions of the two classifiers, not their parameters. This relaxed version effectively overcomes the two drawbacks of Equation 4. On the one hand, we are now able to employ a stronger target classifier (e.g., the entire neural network model) to teach the few-shot learner, since the comparison between them is in the lower-dimensional space govern by the number of ways $C$ , not by the number of parameters of $h^{\star}$ . On the other hand, comparing the predictions allows us to disentangle the model architectures of the few-shot learner, its generated classifier, and $h^{\star}$ . In other words, the same $h^{\star}$ associated with each $\mathcal{D}_{\text{support}}$ can now be applied to train different kinds of few-shot learners, making Equation 5 a widely applicable objective for meta-learning. Indeed, wherever Equation 1 can be applied, Equation 5 should be equally applicable.

We realize the loss $\ell_{\textsc{LastShot}}$ in Equation 5 by the Kullback–Leibler (KL) divergence, inspired by [38]. That is, instead of comparing the hard one-hot predictions, we compare the soft predictions that encode the target classifier’s confidence as well as the learned semantic relationship among classes. More specifically, we take the logits $\bm{\psi}({\bm{x}})\in\mathbb{R}^{C}$ and $\bm{\psi}^{\star}({\bm{x}})\in\mathbb{R}^{C}$ produced by $h=\mathcal{A}(\mathcal{D}_{\text{support}})$ and $h^{\star}$ on a query example ${\bm{x}}$ , smooth the latter with a temperature $\tau>0$ [100], and normalize them with the $\mathbf{softmax}$ function before computing the KL divergence,

	$\displaystyle\ell_{\textsc{LastShot}}(\mathcal{A}(\mathcal{D}_{\text{support}})({\bm{x}}),h^{\star}({\bm{x}}))=$
	$\displaystyle\mathbf{KL}\left(\mathbf{softmax}(\frac{\bm{\psi}^{\star}({\bm{x}})}{\tau})\;\big{\\|}\;\mathbf{softmax}\left(\bm{\psi}({\bm{x}})\right)\right)\;.$		(6)

We note that the KL divergence is not the only way to realize $\ell_{\textsc{LastShot}}$ . One can apply other measures that can characterize the difference of predictions between two classifiers.

3.4 Construction of the target classifier

So far, we assume that the target classifier $h^{\star}$ is well-specified and can be easily prepared for each sampled FSL task $\mathcal{T}$ . In this subsection, we investigate several ways to construct it. We note that, a target classifier $h^{\star}$ for each $\mathcal{T}$ should be a $C$ -way classifier; the $C$ classes are the same as those contained in $\mathcal{D}_{\text{support}}$ and $\mathcal{D}_{\text{query}}$ . That is, $h^{\star}$ would be different across the simulated FSL tasks.

3.4.1 Masking the $B$ -way base classifier

Given the $B$ -class many-shot training set $\mathcal{D}_{\text{base}}$ , one intuitive way to construct $h^{\star}$ is to directly train a $B$ -class classifier. However, since most of the FSL settings have $C$ (i.e., the number of ways in an FSL task $\mathcal{T}$ ) much smaller than $B$ , we must mask out the logits of $h^{\star}$ not from the $C$ classes in $\mathcal{T}$ , followed by calibration. In our experiments, we found such a mismatch in class numbers does degrade the performance of LastShot (please see subsection 4.6). We thus develop two simple, efficient, yet effective alternatives to construct $h^{\star}$ .

3.4.2 Nearest centroid classifiers (NC)

Let us denote by $f^{\ddagger}$ the feature extractor of the learned $B$ -class classifier and by $\mathcal{C}_{\mathcal{T}}$ the class label space of $\mathcal{T}$ . Our first idea is to build a $C$ -way nearest centroid classifier [9, 18]. Denote by $\mathcal{D}_{\text{base, c}}$ the subset of $\mathcal{D}_{\text{base}}$ of class $c$ , we can pre-compute the class mean $\bm{\mu}_{c}$ for each class $c$

\displaystyle\bm{\mu}_{c}=\sum_{{\bm{x}}^{\prime}\in\mathcal{D}_{\text{base, c}}}\frac{f^{\ddagger}({\bm{x}}^{\prime})}{|\mathcal{D}_{\text{base, c}}|}.

(7)

Then, given a $C$ -way few-shot task $\mathcal{T}$ , we can retrieve $\bm{\mu}_{c}$ for class $c\in\mathcal{C}_{\mathcal{T}}$ and compute for ${\bm{x}}\in\mathcal{D}_{\text{query}}$ the logit of class $c$

\displaystyle\psi^{\star}_{c}({\bm{x}})=-\left\|f^{\ddagger}({\bm{x}})-\bm{\mu}_{c}\right\|_{2}^{2}.

(8)

That is, the logit vector $\bm{\psi}^{\star}({\bm{x}})\in\mathbb{R}^{C}$ of the target classifier for ${\bm{x}}$ is the negative squared distance between $f^{\ddagger}({\bm{x}})$ and the $C$ class means, which is then used in Equation 6. We note that all the $f^{\ddagger}({\bm{x}}),\forall{\bm{x}}\in\mathcal{D}_{\text{base}}$ can be pre-extracted.

3.4.3 Logistic regression (LR)

Our second way is to train a $C$ -class logistic regression using the features pre-extracted by $f^{\ddagger}$ . Specifically, given a $C$ -way few-shot task $\mathcal{T}$ , we gather the pre-extracted features $f^{\ddagger}({\bm{x}})$ for data in the subset $\cup_{c\in\mathcal{C}_{\mathcal{T}}}\mathcal{D}_{\text{base, c}}$ of $\mathcal{D}_{\text{base}}$ and train a $C$ -class logistic regression. The logit vector $\bm{\psi}^{\star}({\bm{x}})$ of the target classifier for ${\bm{x}}$ is the output of the LR model. Training the linear LR is efficient given the extracted features.

3.5 Querying the target classifier in meta-training

Besides different ways of constructing $h^{\star}$ , we also investigate how to query $h^{\star}$ ; i.e., how to obtain $\bm{\psi}^{\star}({\bm{x}})$ for Equation 6. The basic way is to input the query example ${\bm{x}}$ directly. Motivated by recent studies [22, 101], where strengthening the teacher or weakening the student (e.g., adding noise) could further improve the distillation or self-training performance, we explore (a) varying the size of the input image for ${\bm{x}}$ or (b) applying autoaugment [102] to ${\bm{x}}$ . We note that both operations are applied only in the meta-training phase.

3.5.1 Varying the input image size

One way to strengthen the teacher target classifier $h^{\star}$ , i.e., to increase its prediction accuracy, is to operate on images with a larger size. Specifically, when we apply $f^{\ddagger}$ to extract features to construct $h^{\star}$ , we enlarge the input images (e.g., by $140\%$ ). We do so also when we query $h^{\star}$ ; i.e., we enlarge ${\bm{x}}$ before applying $h^{\star}({\bm{x}})$ . Namely, the target classifier is constructed with and applied to images of a larger size, resulting in better logits $\bm{\psi}^{\star}({\bm{x}})$ to be matched in Equation 6.

We note that, we enlarge the image size only for constructing and querying the teacher classifier during meta-training. We do not change the image size for the few-shot learner and its generated classifier. Therefore, our method does not raise any concern of unfair comparisons to existing methods.

3.5.2 Applying autoaugment [102]

We further consider weakening the student model (i.e., the few-shot learner and its generated classifier) to encourage it to learn more from the teacher. Specifically, we apply autoaugment [102] to each query example ${\bm{x}}$ , so that both the teacher $h^{\star}$ and the student $h=\mathcal{A}(\mathcal{D}_{\text{support}})$ make their predictions on a noisy query example. Since the teacher is trained on a larger data set $\mathcal{D}_{\text{base}}$ , it suffers less than the student from the noise, creating a larger training signal in Equation 6.

We find that both treatments can improve LastShot on some of the datasets, and strengthening the teacher is more stable. Thus, we mainly consider it in the experiments.

3.6 Evaluation of the learned few-shot learner $\mathcal{A}$

So far in section 3, we focus on how to meta-train the few-shot learner $\mathcal{A}$ that can output a classifier $h=\mathcal{A}(\mathcal{D}_{\text{support}})$ given the support set $\mathcal{D}_{\text{support}}$ of a few-shot task $\mathcal{T}$ . The few-shot task $\mathcal{T}$ during meta-training is sampled/simulated from ${D}_{\text{base}}$ , and one can flexibly define it as $\mathcal{T}=(\mathcal{D}_{\text{support}},\mathcal{D}_{\text{query}})$ , $\mathcal{T}=(\mathcal{D}_{\text{support}},h^{\star})$ , or $\mathcal{T}=(\mathcal{D}_{\text{support}},\mathcal{D}_{\text{query}},h^{\star})$ (cf. subsection 3.2, subsubsection 3.2.2, subsection 3.3, respectively).

During meta-testing, in which the trained few-shot learner $\mathcal{A}$ is to be evaluated on novel classes (cf. subsection 3.1), we follow the standard evaluation protocol (cf. subsection 4.2): each meta-testing task $\mathcal{T}^{\prime}$ is a tuple $(\mathcal{D}_{\text{support}}^{\prime},\mathcal{D}_{\text{query}}^{\prime})$ . The accuracy on a task is defined as

\displaystyle\frac{1}{|\mathcal{D}_{\text{query}}^{\prime}|}\sum_{({\bm{x}},y)\in\mathcal{D}_{\text{query}}^{\prime}}\textbf{1}[\mathcal{A}(\mathcal{D}_{\text{support}}^{\prime})({\bm{x}})==y],

(9)

where 1 is an indicator function whose value is 1 if the input is true; otherwise, 0. In other word, during meta-testing, we do not need a target classifier for each meta-testing task. We construct the target classifiers only to facilitate meta-training, and this is possible because the base class data ${D}_{\text{base}}$ contains sufficient examples.

4 Experiments on Few-Shot Classification

Table I: Statistics of the five benchmark datasets. We list the number of classes and number of images in each split.

	meta-train		meta-val		meta-test
dataset	Class	Image	Class	Image	Class	Image
miniImageNet [10]	64	38,400	16	9,600	20	12,000
tieredImageNet [26]	351	448,695	97	124,261	160	206,209
CUB [29]	100	5,891	50	2,932	50	2,965
CIFAR-FS [27]	64	38,400	16	9,600	20	12,000
FC-100 [28]	60	36,000	20	12,000	20	12,000

4.1 Datasets

We evaluate our approach LastShot using multiple benchmark datasets, including miniImageNet [10], tieredImageNet [26], CUB [29], CIFAR-FS [27], and FC-100 [28]. There are 100 classes in miniImageNet with 600 images per class. Following the split of [15], the meta-training/validation/testing sets contain 64/16/20 (non-overlapping) classes, respectively. In tieredImageNet [26], the class numbers in the three sets are 351/97/160, respectively, again without overlapping. For CUB, we follow the split of [23] and set 100 classes for meta-training. Two 50-class splits from the remaining classes are used for meta-validation and meta-testing. CIFAR-FS [27] and FC-100 [28] are two variants of the CIFAR-100 dataset [103] for few-shot learning but with different partitions. The classes in CIFAR-FS are randomly split into 64/16/20 for meta-training/validation/testing, while FC-100 takes the split of 60/20/20 classes following the super-classes. There are 600 images in each class in CIFAR-100. All images are resized to 84 by 84 for the first three datasets and are resized to 32 by 32 for the latter two CIFAR variants in advance. The detailed statistics are listed in Table I.

4.2 Evaluation protocols

We follow the evaluation in the literature [64, 23, 19, 35], where 10,000 $C$ -way $K$ -shot tasks are sampled from the meta-testing set. In each task, the query set contains 15 instances in each class. Mean accuracy (in %) as well as the 95% confidence interval are reported. For every meta-training task, we also sample 15 instances per class to construct the query set, unless stated otherwise.

4.3 Implementation details

LastShot is a plug-and-play approach that is compatible with most of the existing meta-learning FSL methods. We apply LastShot to four representative meta-learning methods: ProtoNet [13], FEAT [23], ProtoMAML [24], and MetaOptNet [25]. The first two are embedding-based methods to learn good feature extractors; the latter two are similar to MAML that learn in a two-stage inner/outer-loop manner.

4.3.1 Model architectures for few-shot classification

We follow [25] to use a ResNet-12 [17] for the feature extractor $f_{\bm{\theta}}$ (cf. subsubsection 3.2.1), which contains a wider width and the Dropblock module [104] to avoid over-fitting.

4.3.2 Initialization of the feature extractor

Following [9, 23], we initialize the feature extractor $f_{\bm{\theta}}$ used in our few-shot learner and our target classifier using the base class data $\mathcal{D}_{\text{base}}$ . Concretely, we learn $f_{\bm{\theta}}$ , together with a $B$ -way linear classifier $W$ on top⁶⁶6We omit the bias term for brevity., to classify the base classes. For instance, on miniImageNet, $B$ is $64$ , meaning that we pre-train a $64$ -class classifier. More specifically, we train the classifier by minimizing the following objective over examples $({\bm{x}}_{i},y_{i})$ in $\mathcal{D}_{\text{base}}$ :

\displaystyle\min_{W,f_{\bm{\theta}}}\;\sum_{({\bm{x}}_{i},y_{i})\in{\mathcal{D}_{\text{base}}}}\;\ell(W^{\top}f_{\bm{\theta}}({\bm{x}}_{i}),\;y_{i}).

(10)

Here, $\ell(\cdot,\cdot)$ measures the discrepancy between the predicted label and the ground-truth label, and we use the cross-entropy loss. We follow [23] for optimization. For instance, on miniImageNet, Equation 10 is optimized by stochastic gradient descent (SGD) with momentum 0.9, weight decay 0.0005, batch-size 128, and initial learning rate 0.1. After each epoch, we validate the quality of $f_{\bm{\theta}}$ on the meta-validation set by sampling 200 1-shot $C$ -way tasks where $C$ equals the number of classes in the meta-validation set (e.g., 16 for miniImageNet). For each task, we use the ProtoNet decision rule in Equation 3 to evaluate the classification performance. Similar procedures are employed for other datasets.

With the pre-trained $f_{\bm{\theta}}$ , we can initialize the few-shot learner in the meta-training phase. For fair comparisons, we use the notation $\dagger$ to denote our re-implemented results of the baseline methods using this pre-training procedure, which usually leads to higher accuracy than the reported results. We note that, FEAT [23] already uses the above procedure, so we compare with the published results directly.

4.3.3 Meta-Learning with LastShot

In our implementation, we do not use the meta-validation set to learn the few-shot learner; i.e., the meta-validation set is only used to select the best hyper-parameters such as the balance coefficient $\lambda$ in Equation 5 or the temperature $\tau$ . For LastShot with LR, we randomly select at most 50 instances per class to construct a linear logistic regression for each few-shot task. To strengthen the teacher, we enlarge images during training and querying the target classifier $h^{\star}$ . For miniImageNet, tieredImageNet, and CUB, we enlarge the images to 116 by 116. For CIFAR-FS and FC-100, we enlarge the images to 50 by 50. We note that, we do not enlarge the images for the few-shot learner and its outputted classifier. Thus, our results can be fairly compared to the literature.

4.3.4 Non-meta-learning based FSL methods

The initialized $f_{\bm{\theta}}$ (cf. Equation 10) can be used for few-shot classification directly. For instance, by extracting features on the support set data of the novel classes using $f_{\bm{\theta}}$ , we can apply Equation 3 to classify a query example according to the nearest centroid classification rule. We call this method PT-EMB, which stands for pre-trained-embedding. We also compare to SimpleShot[9], which employs several simple pre-processing steps to further improve the accuracy.

Table II: Few-shot classification accuracy plus 95% confidence interval on miniImageNet with the ResNet-12 backbone. Results are evaluated over 10,000 tasks.

\dagger

denotes our re-implemented results. Note that the cited results of FEAT [23] are implemented with the same pre-training procedure already.

Setting	1-Shot	5-Shot
ModelRegression [16]	61.94 $\pm$ 0.20	76.24 $\pm$ 0.14
MetaOptNet [25]	62.64 $\pm$ 0.20	78.63 $\pm$ 0.14
SimpleShot^† [9]	65.36 $\pm$ 0.20	81.39 $\pm$ 0.14
TRAML+ProtoNet [105]	60.31 $\pm$ 0.48	77.94 $\pm$ 0.57
RFS-Simple [34]	62.02 $\pm$ 0.63	79.64 $\pm$ 0.44
RFS-Distill [34]	64.82 $\pm$ 0.60	82.14 $\pm$ 0.43
DSN-MR [106]	64.60 $\pm$ 0.72	79.51 $\pm$ 0.50
MTL+E3BM [107]	63.80 $\pm$ 0.40	80.10 $\pm$ 0.30
DeepEMD [35]	65.91 $\pm$ 0.82	82.41 $\pm$ 0.56
TRAML+AM3 [105]	67.10 $\pm$ 0.54	79.54 $\pm$ 0.60
ProtoNet^† [13]	62.39 $\pm$ 0.20	79.74 $\pm$ 0.14
+ LastShot (NC)	64.16 $\pm$ 0.20	81.23 $\pm$ 0.14
+ LastShot (LR)	64.80 $\pm$ 0.20	81.65 $\pm$ 0.14
ProtoMAML^† [24]	62.04 $\pm$ 0.21	79.62 $\pm$ 0.14
+ LastShot (NC)	63.07 $\pm$ 0.20	81.04 $\pm$ 0.14
+ LastShot (LR)	63.05 $\pm$ 0.20	81.11 $\pm$ 0.14
MetaOptNet^† [25]	63.21 $\pm$ 0.20	79.94 $\pm$ 0.14
+ LastShot (NC)	65.07 $\pm$ 0.20	80.34 $\pm$ 0.14
+ LastShot (LR)	65.08 $\pm$ 0.20	80.40 $\pm$ 0.14
FEAT [23]	66.78 $\pm$ 0.20	82.05 $\pm$ 0.14
+ LastShot (NC)	67.33 $\pm$ 0.20	82.39 $\pm$ 0.14
+ LastShot (LR)	67.35 $\pm$ 0.20	82.58 $\pm$ 0.14

Table III: Few-shot classification accuracy plus 95% confidence interval on tieredImageNet with the ResNet-12 backbone. Results are evaluated over 10,000 tasks.

\dagger

denotes our re-implemented results. Note that the cited results of FEAT [23] are implemented with the same pre-training procedure already.

Setting	1-Shot	5-Shot
MetaOptNet [25]	65.99 $\pm$ 0.72	81.56 $\pm$ 0.53
SimpleShot^† [9]	70.51 $\pm$ 0.23	84.58 $\pm$ 0.16
RFS-Simple [34]	69.74 $\pm$ 0.72	84.41 $\pm$ 0.55
DSN [106]	66.22 $\pm$ 0.75	82.79 $\pm$ 0.48
DSN-MR [106]	67.39 $\pm$ 0.82	82.85 $\pm$ 0.56
RFS-Distill [34]	71.52 $\pm$ 0.69	86.03 $\pm$ 0.49
MTL+E3BM [107]	71.20 $\pm$ 0.40	85.30 $\pm$ 0.30
DeepEMD [35]	71.16 $\pm$ 0.87	86.03 $\pm$ 0.58
ProtoNet^† [13]	68.23 $\pm$ 0.23	84.03 $\pm$ 0.16
+ LastShot (NC)	68.91 $\pm$ 0.23	85.15 $\pm$ 0.16
+ LastShot (LR)	69.37 $\pm$ 0.23	85.36 $\pm$ 0.16
ProtoMAML^† [24]	67.10 $\pm$ 0.23	81.18 $\pm$ 0.16
+ LastShot (NC)	68.42 $\pm$ 0.23	83.63 $\pm$ 0.16
+ LastShot (LR)	68.80 $\pm$ 0.23	83.72 $\pm$ 0.16
MetaOptNet^† [25]	66.49 $\pm$ 0.23	82.36 $\pm$ 0.16
+ LastShot (NC)	68.23 $\pm$ 0.23	83.22 $\pm$ 0.16
+ LastShot (LR)	68.29 $\pm$ 0.23	83.45 $\pm$ 0.16
FEAT [23]	71.13 $\pm$ 0.23	84.79 $\pm$ 0.16
+ LastShot (NC)	72.03 $\pm$ 0.23	85.66 $\pm$ 0.16
+ LastShot (LR)	72.43 $\pm$ 0.23	85.82 $\pm$ 0.16

Table IV: Few-shot classification accuracy plus 95% confidence interval on CUB with the ResNet-12 backbone. Results are evaluated over 10,000 tasks.

\dagger

denotes our re-implemented results. Note that the cited results of FEAT [23] are implemented with the same pre-training procedure already.

Setting	1-Shot	5-Shot
TriNet [108]	69.61 $\pm$ 0.46	84.10 $\pm$ 0.35
DeepEMD [35]	75.65 $\pm$ 0.83	88.69 $\pm$ 0.50
Negative-Cosine [109]	72.66 $\pm$ 0.85	89.40 $\pm$ 0.43
ProtoNet [13]	72.92 $\pm$ 0.21	87.84 $\pm$ 0.12
+ LastShot (NC)	75.83 $\pm$ 0.21	89.83 $\pm$ 0.12
+ LastShot (LR)	75.80 $\pm$ 0.21	90.22 $\pm$ 0.12
ProtoMAML [24]	71.07 $\pm$ 0.21	87.27 $\pm$ 0.12
+ LastShot (NC)	74.57 $\pm$ 0.21	88.67 $\pm$ 0.12
+ LastShot (LR)	74.18 $\pm$ 0.21	88.23 $\pm$ 0.12
MetaOptNet [25]	75.31 $\pm$ 0.21	89.24 $\pm$ 0.12
+ LastShot (NC)	78.54 $\pm$ 0.21	90.46 $\pm$ 0.12
+ LastShot (LR)	78.79 $\pm$ 0.21	90.46 $\pm$ 0.12
FEAT [23]	77.97 $\pm$ 0.21	90.03 $\pm$ 0.12
+ LastShot (NC)	80.07 $\pm$ 0.21	91.43 $\pm$ 0.12
+ LastShot (LR)	80.20 $\pm$ 0.21	91.49 $\pm$ 0.12

Table V: Few-shot classification accuracy plus plus 95% confidence interval on two splits of CIFAR-100, i.e., CIFAR-FS and FC-100, based on the ResNet-12 backbone. We use “N/A” if the confidence interval is not reported in the cited papers.

\dagger

denotes our re-implemented results. Note that the cited results of FEAT [23] are implemented with the same pre-training procedure already.

\mathsection

: DeepEmd [35] make predictions for FC-100 by meta-training and meta-testing on larger images with size 84

\times

84 instead of 32

\times

32.

Dataset	CIFAR-FS		FC-100
Setting	1-Shot 5-Way	5-Shot 5-Way	1-Shot 5-Way	5-Shot 5-Way
ProtoNet [13]	72.20 $\pm$ 0.70	83.50 $\pm$ 0.50	37.50 $\pm$ 0.60	52.50 $\pm$ 0.60
ShotFree [83]	69.20 $\pm$ N/A	84.70 $\pm$ N/A	-	-
MetaOptNet [25]	72.00 $\pm$ 0.70	84.20 $\pm$ 0.50	41.10 $\pm$ 0.60	55.50 $\pm$ 0.60
RFS-Simple [34]	71.50 $\pm$ 0.80	86.00 $\pm$ 0.50	42.60 $\pm$ 0.70	58.10 $\pm$ 0.60
RFS-Distill [34]	73.90 $\pm$ 0.80	86.90 $\pm$ 0.50	44.60 $\pm$ 0.70	60.90 $\pm$ 0.60
DeepEmd^§ [35]	-	-	46.47 $\pm$ 0.78	63.22 $\pm$ 0.71
ProtoNet^† [13]	72.42 $\pm$ 0.21	85.26 $\pm$ 0.12	42.01 $\pm$ 0.18	57.48 $\pm$ 0.18
+ LastShot (NC)	74.77 $\pm$ 0.21	86.46 $\pm$ 0.12	42.65 $\pm$ 0.18	58.43 $\pm$ 0.18
+ LastShot (LR)	74.69 $\pm$ 0.21	86.51 $\pm$ 0.12	42.83 $\pm$ 0.18	58.16 $\pm$ 0.18
ProtoMAML^† [24]	69.95 $\pm$ 0.21	84.61 $\pm$ 0.12	41.39 $\pm$ 0.18	56.76 $\pm$ 0.18
+ LastShot (NC)	71.90 $\pm$ 0.21	85.60 $\pm$ 0.12	41.94 $\pm$ 0.18	58.26 $\pm$ 0.18
+ LastShot (LR)	71.99 $\pm$ 0.21	85.61 $\pm$ 0.12	42.23 $\pm$ 0.18	57.56 $\pm$ 0.18
MetaOptNet^† [25]	72.96 $\pm$ 0.22	85.50 $\pm$ 0.15	42.53 $\pm$ 0.18	57.54 $\pm$ 0.18
+ LastShot (NC)	75.66 $\pm$ 0.22	86.50 $\pm$ 0.15	43.46 $\pm$ 0.18	58.30 $\pm$ 0.18
+ LastShot (LR)	76.08 $\pm$ 0.22	86.70 $\pm$ 0.15	43.29 $\pm$ 0.18	58.14 $\pm$ 0.18
FEAT^† [23]	75.44 $\pm$ 0.21	86.56 $\pm$ 0.12	42.56 $\pm$ 0.18	57.48 $\pm$ 0.18
+ LastShot (NC)	76.67 $\pm$ 0.21	87.78 $\pm$ 0.12	43.91 $\pm$ 0.18	58.41 $\pm$ 0.18
+ LastShot (LR)	76.76 $\pm$ 0.21	87.49 $\pm$ 0.12	44.08 $\pm$ 0.18	59.14 $\pm$ 0.18

4.4 Benchmark comparisons

We first evaluate LastShot with 5-Way 1-Shot and 5-Way 5-Shot tasks on miniImageNet (Table II), tieredImageNet (Table III), CUB (Table IV), CIFAR-FS (Table V), and FC-100 (Table V). We highlight the results when a baseline method is combined with LastShot via a cyan background in the tables, and make the best performance in each setting in bold. We denote the two implementations of the target classifiers, i.e., the nearest centroid classifier and the logistic regression, by NC and LR, respectively.

We find that LastShot can consistently improve each baseline method on all the datasets by 1 $\sim$ 3%. For example, with LastShot, the 1-Shot classification accuracy of FEAT improves from 77.97% to 80.20% in Table IV. By further investigating the four baseline methods, we find that MetaOptNet and FEAT are stronger than ProtoNet and ProtoMAML. In general, LastShot can lead to a larger boost for the relatively weaker baselines, but it can still improve MetaOptNet and FEAT. The target classifiers based on the nearest centroid and LR have similar performance. Overall, LastShot variants achieve the state-of-the-art performance on miniImageNet and CUB, while obtaining competitive results on tieredImageNet. These verify that a strong teacher can provide better supervision in meta-learning.

Last but not the least, comparing to Model Regression [16] and RFS [34], in which the former only uses the target classifier to guide the few-shot learner in the last fully-connected layer while the latter uses distillation to strengthen the pre-trained features, our LastShot outperforms them on miniImageNet, CUB, and CIFAR-FS, and performs on par with them on other datasets. The promising performance further demonstrates the effectiveness of our approach.

Table VI: 5-Way

K

-shot classification accuracy and 95% confidence interval on miniImageNet, where

K\in\{1,5,10,20,30,50\}

. We apply LastShot on ProtoNet and FEAT, and achieve stable improvements when the shot number becomes larger. For “PT-EMB”, we apply the ProtoNet classification rule with the pre-trained feature extractor.

Setups $\rightarrow$	1	5	10	20	30	50
PT-EMB	59.27 $\pm$ 0.20	80.55 $\pm$ 0.14	84.37 $\pm$ 0.12	86.40 $\pm$ 0.11	87.15 $\pm$ 0.10	87.74 $\pm$ 0.10
SimpleShot [9]	65.36 $\pm$ 0.20	81.39 $\pm$ 0.14	84.89 $\pm$ 0.11	86.91 $\pm$ 0.10	87.53 $\pm$ 0.10	88.08 $\pm$ 0.10
ProtoNet [13]	63.73 $\pm$ 0.21	79.40 $\pm$ 0.14	82.83 $\pm$ 0.12	84.61 $\pm$ 0.11	85.07 $\pm$ 0.11	85.57 $\pm$ 0.10
+ LastShot (NC)	64.76 $\pm$ 0.20	81.60 $\pm$ 0.14	85.03 $\pm$ 0.12	86.94 $\pm$ 0.11	87.56 $\pm$ 0.10	88.23 $\pm$ 0.10
+ LastShot (LR)	64.85 $\pm$ 0.20	81.81 $\pm$ 0.14	85.27 $\pm$ 0.12	87.19 $\pm$ 0.11	87.89 $\pm$ 0.10	88.45 $\pm$ 0.10
FEAT [23]	66.78 $\pm$ 0.20	82.05 $\pm$ 0.14	85.15 $\pm$ 0.12	87.09 $\pm$ 0.11	87.82 $\pm$ 0.10	87.83 $\pm$ 0.10
+ LastShot (NC)	67.33 $\pm$ 0.20	82.39 $\pm$ 0.14	85.64 $\pm$ 0.12	87.52 $\pm$ 0.11	88.26 $\pm$ 0.10	88.76 $\pm$ 0.10
+ LastShot (LR)	67.35 $\pm$ 0.20	82.58 $\pm$ 0.14	85.99 $\pm$ 0.12	87.80 $\pm$ 0.11	88.63 $\pm$ 0.10	89.03 $\pm$ 0.10

Table VII: 5-Way

K

-shot classification accuracy and 95% confidence interval on tieredImageNet, where

K\in\{1,5,10,20,30,50\}

Setups $\rightarrow$	1	5	10	20	30	50
PT-EMB	65.14 $\pm$ 0.23	84.65 $\pm$ 0.16	87.62 $\pm$ 0.14	89.24 $\pm$ 0.12	89.84 $\pm$ 0.12	90.26 $\pm$ 0.11
SimpleShot [9]	70.51 $\pm$ 0.23	84.58 $\pm$ 0.16	87.64 $\pm$ 0.14	89.32 $\pm$ 0.12	89.77 $\pm$ 0.12	90.30 $\pm$ 0.11
ProtoNet [13]	68.23 $\pm$ 0.24	84.03 $\pm$ 0.16	86.28 $\pm$ 0.14	87.75 $\pm$ 0.13	88.33 $\pm$ 0.12	88.67 $\pm$ 0.11
+ LastShot (NC)	69.02 $\pm$ 0.23	85.45 $\pm$ 0.16	88.32 $\pm$ 0.14	89.88 $\pm$ 0.12	89.97 $\pm$ 0.11	90.38 $\pm$ 0.11
+ LastShot (LR)	68.95 $\pm$ 0.23	85.44 $\pm$ 0.16	88.04 $\pm$ 0.14	89.39 $\pm$ 0.12	89.94 $\pm$ 0.11	90.22 $\pm$ 0.11
FEAT [23]	71.13 $\pm$ 0.23	84.79 $\pm$ 0.16	87.42 $\pm$ 0.14	89.13 $\pm$ 0.12	89.98 $\pm$ 0.11	90.11 $\pm$ 0.11
+ LastShot (NC)	72.03 $\pm$ 0.23	85.66 $\pm$ 0.16	88.52 $\pm$ 0.14	90.06 $\pm$ 0.12	90.52 $\pm$ 0.11	90.94 $\pm$ 0.11
+ LastShot (LR)	72.43 $\pm$ 0.23	85.82 $\pm$ 0.16	88.52 $\pm$ 0.14	90.06 $\pm$ 0.12	90.52 $\pm$ 0.11	91.06 $\pm$ 0.11

4.5 Larger-shot comparisons

Besides the benchmark results, we further investigate the scenarios when there are more shots in the support set for the novel classes (i.e., during meta-testing). For instance, $K=50$ is much larger than the usual few-shot learning scenario (where $K$ =1 or $K$ =5), but is still smaller than usual many-shot datasets like ImageNet [110] for training a complex deep neural network. We consider $K=\{1,5,10,20,30,50\}$ . For meta-training on the base class data, we also consider $K=\{1,5,10,20,30,50\}$ . However, instead of setting the same $K$ for meta-training and meta-testing, we pick the best $K$ in meta-training for each $K$ of meta-testing, according to the accuracy on the meta-validation set. We note that, for a larger $K$ in meta-training, the query set size may be shrunk due to the memory constraint. For example, for $K=50$ , the number of query examples per class becomes $2$ . For meta-testing, since no gradients need to be computed for the few-shot learner $\mathcal{A}$ , we can still keep the query set size intact, i.e., $15$ examples per class.

The results are shown in Table VI and Table VII: LastShot can consistently improve the corresponding meta-learning method at all the different shot settings. From the further analysis in Figure 1, we see that the strength of conventional meta-learning decreases when $K$ increases. For instance, the improvement by the meta-learned method ProtoNet over the plainly-trained method “PT-EMB” diminishes when the number of shots becomes large. PT-EMB even outperforms ProtoNet when $K>5$ . (Note that, both methods use exactly the same prediction rule.) In Figure 3, we see a similar trend when comparing the best-performing meta-learning method FEAT and the best-performing non-meta-learning method SimpleShot. SimpleShot outperforms FEAT when $K=50$ .

While one may argue that this is not surprising given that meta-learning is designed for small $K$ , the fact that there seems to exist a turning point that one must switch from meta-learning methods to non-meta-learning ones is notoriously annoying in practice, especially that such turning point is quite unstable (it is around 5 for ProtoNet but 50 for FEAT). It is desired to design learning mechanisms that can seamlessly combine the advantages of these two methods.

We claim that LastShot is one of such mechanisms. On the one hand, LastShot is itself a meta-learning method, drawing few-shot tasks to train the few-shot learner. On the other hand, LastShot explicitly uses the non-meta-learning based methods with a larger $K$ to construct the target classifier to teach the few-shot learner in the meta-training process. The results on miniImageNet and tieredImageNet (in Table VI and Table VII, respectively) further demonstrate this. Our LastShot combined with FEAT consistently improves FEAT and outperforms SimpleShot at every $K$ values, making LastShot the best-performing methods to be applied across a wide spectrum of $K$ values.

4.6 Ablation studies

4.6.1 Advantage of LastShot with a small query set

We evaluate the influence of the query set size during meta-training. With a small query set, the few-shot learner only receives limited supervision. We investigate ProtoNet with LastShot. As shown in Table VIII, the classification accuracy of ProtoNet improves when more query instances are involved, especially from 1 to 5. With LastShot, the few-shot learner performs much favorably. Even with 1 query instance per class, its performance can already outperform ProtoNet trained with 50 query instances but without LastShot.

Table VIII: The 1-Shot 5-Way classification accuracy and 95% confidence interval with 10,000 trials on miniImageNet. The tasks used for meta-training have different number of query instances

\{1,15,50\}

per class.

# Query	1	5	50
ProtoNet	60.94 $\pm$ 0.20	62.20 $\pm$ 0.20	63.44 $\pm$ 0.20
+ LastShot (NC)	63.78 $\pm$ 0.20	64.16 $\pm$ 0.20	64.50 $\pm$ 0.20
+ LastShot (LR)	63.50 $\pm$ 0.20	64.80 $\pm$ 0.20	64.82 $\pm$ 0.20

Table IX: The 1/5-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet. Based on FEAT, we compare the local teacher in our LastShot (LR) with the teacher based on the pre-trained

B

-class classifier.

	1-Shot	5-Shot
FEAT	66.78 $\pm$ 0.20	82.05 $\pm$ 0.14
FEAT w/ $B$ -class target	66.89 $\pm$ 0.20	82.10 $\pm$ 0.14
FEAT + LastShot (LR)	67.35 $\pm$ 0.20	82.58 $\pm$ 0.14

Table X: The 1/5-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet and tieredImageNet. Based on FEAT, we compare different implementations to query the target classifier.

miniImageNet	1-Shot	5-Shot
FEAT	66.78 $\pm$ 0.20	82.05 $\pm$ 0.14
+ Autoaug	65.22 $\pm$ 0.20	80.13 $\pm$ 0.14
+ Enlarge	65.31 $\pm$ 0.20	79.99 $\pm$ 0.14
+ LastShot (NC), Vanilla	66.97 $\pm$ 0.20	82.15 $\pm$ 0.14
+ LastShot (NC), Autoaug	67.06 $\pm$ 0.20	82.16 $\pm$ 0.14
+ LastShot (NC), Enlarge	67.33 $\pm$ 0.20	82.39 $\pm$ 0.14
tieredImageNet	1-Shot	5-Shot
FEAT	71.13 $\pm$ 0.23	84.79 $\pm$ 0.16
+ Autoaug	68.11 $\pm$ 0.23	82.48 $\pm$ 0.16
+ Enlarge	69.48 $\pm$ 0.23	83.87 $\pm$ 0.16
+ LastShot (NC), Vanilla	71.41 $\pm$ 0.23	84.93 $\pm$ 0.16
+ LastShot (NC), Autoaug	71.62 $\pm$ 0.23	85.21 $\pm$ 0.16
+ LastShot (NC), Enlarge	72.03 $\pm$ 0.23	85.66 $\pm$ 0.16

4.6.2 Comparison on the target classifiers

One intuitive way to construct the target classifier $h^{\star}$ is to train a $B$ -class classifier and then select the corresponding $C$ logits according to the few-shot task at hand. As shown in Table IX, using part of the $B$ -class base classifier as the teacher could not introduce further improvement, which may result from the mismatch in class numbers and the poorly calibrated logit values.

4.6.3 Strengthening the teacher or weakening the student?

There are three ways to query $h^{\star}$ , i.e., directly querying it, weakening the student (e.g., adding noise on the query set with autoaugment (Autoaug) [102]), or strengthening the teacher with enlarged inputs. We show the influence of the choices in Table X. The vanilla case means that we query $h^{\star}$ directly. By adding noises to the query set, we enlarge the learning signal between the teacher and the student; by enlarging the input images for the teacher, the teacher can provide higher-quality supervision. Both ways can improve the vanilla method. We empirically find that strengthening the teacher leads to the most stable improvement.

To verify if adding noise itself can improve meta-learning alone, we also apply Autoaug to FEAT. As shown in Table X, Autoaug, without LastShot, can degrade the performance of FEAT significantly. Similarly, enlarging the input images without LastShot cannot improve FEAT.

Table XI: The influence of

\lambda

with LastShot. We show the {1,5,10,20,30,50}-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet.

$\lambda$	0	0.001	0.01	0.1	1	10
1-Shot	67.08	66.82	67.13	66.74	66.91	67.33
5-Shot	82.31	82.01	82.39	82.03	82.03	81.68
10-shot	85.30	85.29	85.60	85.29	84.75	84.79
20-shot	87.22	87.44	87.44	87.16	86.55	86.51
30-shot	87.82	88.26	88.07	87.75	87.05	86.88
50-shot	88.21	88.76	88.58	88.24	87.69	87.34

4.6.4 Influence of the coefficient $\lambda$

We investigate the influence of $\lambda$ , the coefficient for the vanilla meta-learning objective in Equation 5, when we apply LastShot (NC) with FEAT. The larger the value of $\lambda$ , the smaller the influence of the target classifier’s supervision during meta-training. We show the {1,5,10,20,30,50}-Shot 5-Way classification accuracy on miniImageNet in Table XI. We find that when there are more shots in a task, meta-learning with more supervision from the target classifier (with smaller $\lambda$ ) works better.

4.6.5 Will a direct ensemble help?

Since we construct a stronger teacher (not based on meta-learning) to supervise the few-shot learner, one may ask whether the stronger teacher itself can outperform the final few-shot learner trained with LastShot. We note that, the strong teacher cannot be applied to the novel meta-testing tasks directly since we do not have ample examples for them to construct the strong teacher. We therefore investigate an alternative question: can we achieve better few-shot accuracy by combining a non-meta-learning based and a meta-learning based FSL methods directly? We investigate this idea by taking the ensemble of their predictions. In detail, for a meta-testing task, we apply both SimpleShot [9] and ProtoNet [13] for classification, and we further average their predictions per example. We also apply the ensemble method to FEAT [23]. The results are reported in Table XII. The ensemble approach gets marginal improvement compared to the vanilla meta-learning methods, while our LastShot shows consistent improvements over the meta-learning methods with a single final model in various cases.

Table XII: The 1/5-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet. Based on ProtoNet and FEAT, we try to improve their performance by ensemble their predictions with a non-meta-learning based FSL method, i.e., SimpleShot [9]. The ensemble results are denoted by “w/ Ensemble” in the table.

	1-Shot	5-Shot
ProtoNet	62.39 $\pm$ 0.20	79.74 $\pm$ 0.14
w/ Ensemble	62.78 $\pm$ 0.20	81.10 $\pm$ 0.14
+ LastShot (NC)	64.16 $\pm$ 0.20	81.23 $\pm$ 0.14
FEAT	66.78 $\pm$ 0.20	82.05 $\pm$ 0.14
w/ Ensemble	66.85 $\pm$ 0.20	81.76 $\pm$ 0.14
+ LastShot (NC)	67.33 $\pm$ 0.20	82.39 $\pm$ 0.14

Table XIII: The 1/5-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet. Based on graph-based FSL methods, we try to improve their performance with LastShot.

	1-Shot	5-Shot
GCN	64.54 $\pm$ 0.20	79.20 $\pm$ 0.14
+ LastShot (NC)	65.17 $\pm$ 0.20	81.36 $\pm$ 0.14
+ LastShot (LR)	65.25 $\pm$ 0.20	81.45 $\pm$ 0.14

4.6.6 Will LastShot help graph-based FSL methods?

We apply LastShot to the GCN-based few-shot learning algorithm. We use [111] as the basic GCN-based algorithm (denoted as “GCN”), following [112]. With ResNet-12 as the feature extractor, the 1/5-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet is reported in Table XIII. Our LastShot can further improve the GCN-based algorithm, arriving at 65.17% w/ LastShot (NC) and 65.25% w/ LastShot (LR), respectively. The results indicate that LastShot is general for meta-learning and can improve GCN-based algorithms.

Table XIV: The 1/5-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet. Based on ProtoNet and FEAT, we try to improve their performance with LastShot on ConvNet and WRN-28-10 backbones.

ConvNet	1-Shot	5-Shot
ProtoNet	51.71 $\pm$ 0.20	70.31 $\pm$ 0.16
+LastShot (NC)	52.84 $\pm$ 0.20	70.61 $\pm$ 0.16
+LastShot (LR)	53.35 $\pm$ 0.20	71.01 $\pm$ 0.16
FEAT	54.62 $\pm$ 0.20	71.95 $\pm$ 0.16
+LastShot (NC)	56.19 $\pm$ 0.20	72.62 $\pm$ 0.16
+LastShot (LR)	56.19 $\pm$ 0.20	72.35 $\pm$ 0.16
WRN	1-Shot	5-Shot
ProtoNet	60.03 $\pm$ 0.24	79.84 $\pm$ 0.16
+LastShot (NC)	60.17 $\pm$ 0.24	80.37 $\pm$ 0.16
+LastShot (LR)	60.83 $\pm$ 0.24	80.47 $\pm$ 0.16
FEAT	65.09 $\pm$ 0.24	81.74 $\pm$ 0.16
+LastShot (NC)	65.89 $\pm$ 0.24	82.35 $\pm$ 0.16
+LastShot (LR)	65.86 $\pm$ 0.24	82.27 $\pm$ 0.16

4.6.7 Will LastShot be compatible with other backbones?

As shown in several recent works like [113, 9, 23], the backbone architectures of the feature extractor can largely affect the resulting classification or FSL accuracy. In this subsection, we investigate whether the superiority of LastShot extends to more backbones. We consider two widely used backbones, i.e., a four-layer convolution network and Wide ResNet-28-10 [114]. The four-layer ConvNet contains four repeated blocks, and each block includes a convolutional layer, a Batch Normalization layer [115], a ReLU, and a Max pooling [10, 13]. We add a global average pooling layer at last to reduce the dimension of the embedding to 64. We follow [64, 23] and use the WRN-28-10 structure, which sets the depth to 28 and width to 10. We study two meta-learning based algorithms, ProtoNet and FEAT. The 1/5-Shot 5-Way classification accuracy (plus 95% confidence interval) with 10,000 trials on miniImageNet is reported in Table XIV. LastShot improves ProtoNet and FEAT on both backbones.

5 Experiments on Few-Shot Regression

Besides the experiments on few-shot classification in section 4, we further follow [7] to study a one-dimensional regression problem to illustrate the effectiveness and applicability of our approach LastShot.

5.1 Datasets

We generate synthetic, one-dimensional few-shot regression tasks with inputs $x$ sampled uniformly from $[-5,5]$ . Every few-shot task is characterized by a true sine function parameterized by $(a,v,b)$ :

\displaystyle y=a\times\sin(vx+b)+\epsilon.

(11)

In this sine curve, the amplitude $a$ is sampled uniformly from $[0,2]$ , the frequency $v$ is sampled uniformly from $[2,4]$ , the bias $b$ is sampled uniformly from $[0,2\pi]$ , and $\epsilon$ is sampled from $0.3\times\mathcal{N}(0,1)$ , where $\epsilon$ is an additional noise. Given $(a,v,b)$ , we can then sample the support and query sets by first sampling $x\sim[-5,5]$ and then obtaining the corresponding $y$ using Equation 11. We note that, the sine curves defined in Equation 11 have higher frequencies than those studied in [7]. Thus, it requires more support set examples to learn a regression function.

5.2 Evaluation and training protocols

The goal is to build a regression function using the support set, such that it can predict the real-valued labels well for the corresponding query set. We randomly sample $1,000$ meta-testing tasks (i.e., $1,000$ sin curves) for evaluation. From each, we randomly sample $K$ pairs of $(x,y)$ according to Equation 11 to form the few-shot support set $\mathcal{D}^{\prime}_{\text{support}}$ , and another non-overlapping 100 pairs as the query set $\mathcal{D}^{\prime}_{\text{query}}$ . (In our experiments, we consider $K\in\{5,50\}$ .) The performance is measured by the average Mean Square Error (MSE) and 95% confidence intervals over the $1,000$ tasks.

We investigate training a meta-learning based few-shot learner $\mathcal{A}$ to output the regressor. Specifically, we model the regression function by a three-layer fully-connected neural network $f_{\bm{\theta}}(x)$ as the feature extractor. The hidden and output dimensions are all set to $100$ . We can learn such a feature extractor using nearly all the existing meta-learning methods. For example, with MAML [7], we will add a linear regressor $w$ on top. When embedding-based methods like ProtoNet [13] are used, it is equivalent to learning a non-parametric regression

\displaystyle\hat{y}=\sum_{(x^{\prime},y^{\prime})\in\mathcal{D}_{\text{support}}}\mathbf{Softmax}\left(-\|f_{\bm{\theta}}(x)-f_{\bm{\theta}}(x^{\prime})\|_{2}^{2}\right)\times y^{\prime},

(12)

where $\mathbf{Softmax}(\cdot)$ is taken over $K$ examples in $\mathcal{D}_{\text{support}}$ .

We train $\mathcal{A}$ with at most $1,280,000$ few-shot tasks, each is associated with a specific $(a,v,b)$ , a support set $\mathcal{D}_{\text{support}}$ , and a query set $\mathcal{D}_{\text{query}}$ . We pack every $32$ tasks into a mini-batch, and train $\mathcal{A}$ for $40,000$ iterations. We apply stochastic gradient descent (SGD) with a momentum of $0.9$ as the optimizer. We set the initial learning rate as $0.001$ , and multiply it by $0.5$ every $160,000$ tasks. We perform early stopping using another $1,000$ validation tasks.

5.3 LastShot for regression

To apply LastShot for few-shot regression, we must (a) construct the target regressor $h^{\star}$ for each few-shot regression task, and (b) define the suitable $\ell_{\textsc{LastShot}}$ for supervision.

For the former, to efficiently construct $h^{\star}$ without sampling extra data from each few-shot task, we discretize the ranges of $a\in[0,2]$ , $v\in[2,4]$ , and $b\in[0,2\pi]$ each by $0.1$ to construct a set of “anchor” tasks. For each of them, we can sample many-shot examples (e.g., $1,000$ in our experiments) to construct a many-shot regressor. Here we train a neural network regressor using the regularized least square loss (i.e., the loss used in ridge regression) for each anchor task, whose hyper-parameter is tuned by cross-validation. Then during meta-training, given a few-shot task and its associated $(a,v,b)$ , we retrieve the anchor task at the same grid to be the target model $h^{\star}$ — the rationale here is that sine curves with similar $(a,v,b)$ are shaped similarly.

With the target regressor $h^{\star}$ , we realize $\ell_{\textsc{LastShot}}$ (cf. Equation 5) by a weighted square loss [116],

$\displaystyle\sum_{(x,y)\in\mathcal{D}_{\text{query}}}$	$\displaystyle\ell_{\textsc{LastShot}}(\mathcal{A}(\mathcal{D}_{\text{support}})(x),h^{\star}(x))=$
$\displaystyle\sum_{(x,y)\in\mathcal{D}_{\text{query}}}$	$\displaystyle\mathbf{Softmax}\left(-(h^{\star}(x)-y)^{2}\right)$	(13)
	$\displaystyle\times\left(h^{\star}(x)-\mathcal{A}(\mathcal{D}_{\text{support}})(x)\right)^{2}\;,$

in which $\mathbf{Softmax}$ is taken over the examples in $\mathcal{D}_{\text{query}}$ . The $\mathbf{Softmax}$ measures how well the target regressor $h^{\star}$ performs on a query example — we measure this because the retrieved anchor regressor does not exactly match the few-shot task at hand. The better $h^{\star}$ performs (i.e., a smaller error between $h^{\star}(x)$ and $y$ ), the larger the score is after $\mathbf{Softmax}$ . We then use these $\mathbf{Softmax}$ scores to perform weight average over the error between the few-shot regressor $A(\mathcal{D}_{\text{support}})(x)$ and the target regressor $h^{\star}(x)$ for each query example $x$ . Namely, the few-shot learner pays more attention to those query examples that $h^{\star}(x)$ truly performs well on.

5.4 Results

We compare LastShot with ProtoNet [13], MAML [7], MetaOptNet [25], and FEAT [23] in Table XV. Specifically for MetaOptNet, we implement its top-layer classifier as ridge regression and tune the hyper-parameter. We make the following observations. First, predicting a sine curve is difficult given few-shot examples⁷⁷7One-dimensional sine functions have an infinite VC dimension., and all methods perform better when there are more shots. Second, FEAT and MetaOptNet, being more advanced meta-learning methods, outperform MAML and ProtoNet. Third and most importantly, LastShot can notably improve each of the baseline methods.

We further visualize the few-shot learners’ performance on three few-shot tasks whose true, oracle curve are $y=\sin(2x)$ , $y=\sin(3x+\pi)$ , and $y=2\sin(2x+\pi)$ in Figure 4. Specifically, we consider both $5$ and $50$ shots cases, and plot the oracle curve (cyan), predicted curve without LastShot (blue), and predicted curve with LastShot (red). We see that the predicted curves with LastShot can better match the oracle ones. It is worth noting that for each of these three curves, there is more than one period in the range. Therefore, it is difficult to fit the curve when there are merely five examples: there can be multiple feasible solutions. We surmise that this is why ProtoNet creates a nearly horizontal curve. By learning with a stronger teacher via LastShot, ProtoNet can generate more faithful curves in both 5-shot and 50-shot scenarios.

Table XV: Mean Square Error (MSE) and 95% confidence interval over 1000

\{5,50\}

-shot regression tasks. The lower the better.

MSE	5-Shot	50-Shot
MAML [7]	0.693 $\pm$ 0.034	0.303 $\pm$ 0.016
ProtoNet [13]	0.681 $\pm$ 0.031	0.299 $\pm$ 0.013
+ LastShot	0.630 $\pm$ 0.030	0.294 $\pm$ 0.011
MetaOptNet [25]	0.622 $\pm$ 0.029	0.162 $\pm$ 0.004
+ LastShot	0.616 $\pm$ 0.022	0.159 $\pm$ 0.004
FEAT [23]	0.491 $\pm$ 0.026	0.189 $\pm$ 0.006
+ LastShot	0.479 $\pm$ 0.023	0.167 $\pm$ 0.004

6 Conclusion

We present a novel meta-learning paradigm to introduce rich supervision for few-shot learning. By coupling a few-shot task in the meta-training phase with a target classifier that is trained with ample examples, the few-shot learner can receive additional training signals from the target classifier besides the query set instances. We propose ways to efficiently construct and leverage the target classifier, leading to a plug-and-play method LastShot, which is compatible with various meta-learning based few-shot learning methods. On multiple datasets, LastShot achieves consistent improvement against baseline meta-learning methods. Importantly, LastShot also maintains robust performance across a wider range of shot numbers — making meta-learning not only work well on the few-shot but also many-shot settings.

Acknowledgments

This research is supported by National Key R&D Program of China (2020AAA0109401), Nanjing University-Huawei Joint Research Program, NSFC (62006112, 61921006, 62176117), Collaborative Innovation Center of Novel Software Technology and Industrialization, NSF of Jiangsu Province (BK20200313), NSF IIS-2107077, NSF OAC-2118240, NSF OAC-2112606, and the OSU GI Development funds. We are thankful for the generous support of computational resources by Ohio Supercomputer Center and AWS Cloud Credits for Research.

References

[1] B. Lake, T. D. Ullman, J. Tenenbaum, and S. Gershman, “Building machines that learn and think like people,” Behavioral and Brain Sciences, vol. 40, 2016.
[2] B. Lake, T. Linzen, and M. Baroni, “Human few-shot learning of compositional instructions,” in CogSci, 2019, pp. 611–617.
[3] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in CVPR, 2018, pp. 1199–1208.
[4] S. Motiian, Q. Jones, S. M. Iranmanesh, and G. Doretto, “Few-shot adversarial domain adaptation,” in NIPS, 2017, pp. 6673–6683.
[5] J. Gu, Y. Wang, Y. Chen, V. O. K. Li, and K. Cho, “Meta-learning for low-resource neural machine translation,” in EMNLP, 2018, pp. 3622–3631.
[6] B. M. Lake, “Compositional generalization through meta sequence-to-sequence learning,” in NeurIPS, 2019, pp. 9788–9798.
[7] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017, pp. 1126–1135.
[8] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” in CoRL, 2017, pp. 357–368.
[9] Y. Wang, W.-L. Chao, K. Q. Weinberger, and L. van der Maaten, “Simpleshot: Revisiting nearest-neighbor classification for few-shot learning,” CoRR, vol. abs/1911.04623, 2019.
[10] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in NIPS, 2016, pp. 3630–3638.
[11] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” in ICLR, 2019.
[12] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in ICML, 2018, pp. 4331–4340.
[13] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” in NIPS, 2017, pp. 4080–4090.
[14] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient descent by gradient descent,” in NIPS, 2016, pp. 3981–3989.
[15] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2017.
[16] Y.-X. Wang and M. Hebert, “Learning to learn: Model regression networks for easy small sample learning,” in ECCV, 2016, pp. 616–634.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[18] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distance-based image classification: Generalizing to new classes at near-zero cost,” TPAMI, vol. 35, no. 11, pp. 2624–2637, 2013.
[19] W.-L. Chao, H.-J. Ye, D. Zhan, M. Campbell, and K. Q. Weinberger, “Revisiting meta-learning as supervised learning,” CoRR, vol. abs/2002.00573, 2020.
[20] L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. M. F. Moura, “Few-shot human motion prediction via meta-learning,” in ECCV, 2018, pp. 441–459.
[21] Y.-X. Wang, D. Ramanan, and M. Hebert, “Learning to model the tail,” in NIPS, 2017, pp. 7032–7042.
[22] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in CVPR, 2020, pp. 10 687–10 698.
[23] H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via embedding adaptation with set-to-set functions,” in CVPR, 2020, pp. 8805–8814.
[24] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol et al., “Meta-dataset: A dataset of datasets for learning to learn from few examples,” in ICLR, 2019.
[25] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with differentiable convex optimization,” in CVPR, 2019, pp. 10 657–10 665.
[26] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel, “Meta-learning for semi-supervised few-shot classification,” in ICLR, 2018.
[27] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” in ICLR, 2019.
[28] B. N. Oreshkin, P. R. López, and A. Lacoste, “TADAM: task dependent adaptive metric for improved few-shot learning,” in NeurIPS, 2018, pp. 719–729.
[29] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” California Institute of Technology, Tech. Rep. CNS-TR-2011-001, 2011.
[30] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015.
[31] J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah, “Self-supervised knowledge distillation for few-shot learning,” CoRR, vol. abs/2006.09785, 2020.
[32] M. Zhang, D. Wang, and S. Gai, “Knowledge distillation for model-agnostic meta-learning,” in ECAI, 2020, pp. 1355–1362.
[33] Y.-X. Wang, A. Bardes, R. Salakhutdinov, and M. Hebert, “Progressive knowledge distillation for generative modeling,” in NeurIPS Workshop on Learning with Rich Experience, 2019.
[34] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola, “Rethinking few-shot image classification: A good embedding is all you need?” in ECCV, 2020, pp. 266–282.
[35] C. Zhang, Y. Cai, G. Lin, and C. Shen, “Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers,” in CVPR, 2020, pp. 12 200–12 210.
[36] H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi, “Label refinery: Improving imagenet classification through label progression,” CoRR, vol. abs/1805.02641, 2018.
[37] C. Yang, L. Xie, C. Su, and A. L. Yuille, “Snapshot distillation: Teacher-student optimization in one generation,” in CVPR, 2019, pp. 2859–2868.
[38] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong semi-supervised learners,” in NeurIPS, 2020, pp. 22 243–22 255.
[39] G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” in ECCV, 2020, pp. 588–604.
[40] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,” ACM Computing Surveys, vol. 53, no. 3, pp. 63:1–63:34, 2020.
[41] A. Antoniou, A. Storkey, and H. Edwards, “Data augmentation generative adversarial networks,” in ICLR Workshop, 2018.
[42] B. Hariharan and R. B. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in ICCV, 2017, pp. 3037–3046.
[43] K. Li, Y. Zhang, K. Li, and Y. Fu, “Adversarial feature hallucination networks for few-shot learning,” in CVPR, 2020, pp. 13 467–13 476.
[44] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML Deep Learning Workshop, vol. 2, 2015.
[45] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823.
[46] E. Triantafillou, R. S. Zemel, and R. Urtasun, “Few-shot learning through an information retrieval lens,” in NIPS, 2017, pp. 2255–2265.
[47] T. R. Scott, K. Ridgeway, and M. C. Mozer, “Adapted deep embeddings: A synthesis of methods for k-shot inductive transfer learning,” in NeurIPS, 2018, pp. 76–85.
[48] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” TPAMI, vol. 28, no. 4, pp. 594–611, 2006.
[49] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum, “One-shot learning by inverting a compositional causal process,” in NIPS, 2013, pp. 2526–2534.
[50] H. Edwards and A. Storkey, “Towards a neural statistician,” in ICLR, 2017.
[51] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto, “A baseline for few-shot image classification,” in ICLR, 2020.
[52] Z. Li and D. Hoiem, “Learning without forgetting,” TPAMI, vol. 40, no. 12, pp. 2935–2947, 2018.
[53] H.-J. Ye, D.-C. Zhan, Y. Jiang, and Z.-H. Zhou, “Rectify heterogeneous models with semantic mapping,” in ICML, 2018, pp. 1904–1913.
[54] ——, “Heterogeneous few-shot model rectification with semantic mapping,” TPAMI, vol. 43, no. 11, pp. 3878–3891, 2021.
[55] A. Li, T. Luo, T. Xiang, W. Huang, and L. Wang, “Few-shot learning with global class representations,” in ICCV, 2019, pp. 9714–9723.
[56] M. Chen, X. Wang, H. Luo, Y. Geng, and W. Liu, “Learning to focus: cascaded feature matching network for few-shot image recognition,” Science China Information Science, vol. 64, no. 9, 2021.
[57] N. Pang, X. Zhao, W. Wang, W. Xiao, and D. Guo, “Few-shot text classification by leveraging bi-directional attention and cross-class knowledge,” Science China Information Science, vol. 64, no. 3, 2021.
[58] Y. Lee and S. Choi, “Gradient-based meta-learning with learned layerwise metric and subspace,” in ICML, 2018, pp. 2933–2942.
[59] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” CoRR, vol. abs/1803.02999, 2018.
[60] H. Yao, Y. Wei, J. Huang, and Z. Li, “Hierarchically structured meta-learning,” in ICML, 2019, pp. 7045–7054.
[61] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning with implicit gradients,” in NeurIPS, 2019, pp. 113–124.
[62] H.-J. Ye and W.-L. Chao, “How to train your MAML to excel in few-shot classification,” CoRR, vol. abs/2106.16245, 2021.
[63] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi, “Learning feed-forward one-shot learners,” in NIPS, 2016, pp. 523–531.
[64] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell, “Meta-learning with latent embedding optimization,” in ICLR, 2019.
[65] M. Ren, R. Liao, E. Fetaya, and R. S. Zemel, “Incremental few-shot learning with attention attractor networks,” in NeurIPS, 2019, pp. 5276–5286.
[66] Z. Li, F. Zhou, F. Chen, and H. Li, “Meta-sgd: Learning to learn quickly for few shot learning,” CoRR, vol. abs/1707.09835, 2017.
[67] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan, “Low-shot learning from imaginary data,” in CVPR, 2018, pp. 7278–7286.
[68] K. Hsu, S. Levine, and C. Finn, “Unsupervised learning via meta-learning,” in ICLR, 2019.
[69] X. Yue, Z. Zheng, S. Zhang, Y. Gao, T. Darrell, K. Keutzer, and A. L. Sangiovanni-Vincentelli, “Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation,” in CVPR, 2021, pp. 13 834–13 844.
[70] U. Ojha, Y. Li, J. Lu, A. A. Efros, Y. J. Lee, E. Shechtman, and R. Zhang, “Few-shot image generation via cross-domain correspondence,” in CVPR, 2021, pp. 10 743–10 752.
[71] J.-M. Pérez-Rúa, X. Zhu, T. M. Hospedales, and T. Xiang, “Incremental few-shot object detection,” in CVPR, 2020, pp. 13 843–13 852.
[72] C. Zhu, F. Chen, U. Ahmed, Z. Shen, and M. Savvides, “Semantic relation reasoning for shot-stable few-shot object detection,” in CVPR, 2021, pp. 8782–8791.
[73] L. Liu, T. Zhou, G. Long, J. Jiang, and C. Zhang, “Learning to propagate for graph meta-learning,” in NeurIPS, 2019, pp. 1037–1048.
[74] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang, “Learning to propagate labels: Transductive propagation network for few-shot learning,” in ICLR, 2019.
[75] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without forgetting,” in CVPR, 2018, pp. 4367–4375.
[76] H.-J. Ye, H. Hu, and D.-C. Zhan, “Learning adaptive classifiers synthesis for generalized few-shot learning,” IJCV, vol. 129, no. 6, pp. 1930–1953, 2021.
[77] H. Yao, L.-K. Huang, L. Zhang, Y. Wei, L. Tian, J. Zou, J. Huang, and Z. Li, “Improving generalization in meta-learning via task augmentation,” in ICML, 2021, pp. 11 887–11 897.
[78] E. Creager, J.-H. Jacobsen, and R. S. Zemel, “Environment inference for invariant learning,” in ICML, 2021, pp. 2189–2200.
[79] S. Khodadadeh, L. Bölöni, and M. Shah, “Unsupervised meta-learning for few-shot image classification,” in NeurIPS, 2019, pp. 10 132–10 142.
[80] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord, “Boosting few-shot visual learning with self-supervision,” in ICCV, 2019, pp. 8059–8068.
[81] X. Li, Q. Sun, Y. Liu, Q. Zhou, S. Zheng, T.-S. Chua, and B. Schiele, “Learning to self-train for semi-supervised few-shot classification,” in NeurIPS, 2019, pp. 10 276–10 286.
[82] S. Qiao, C. Liu, W. Shen, and A. L. Yuille, “Few-shot image recognition by predicting parameters from activations,” in CVPR, 2018, pp. 7229–7238.
[83] A. Ravichandran, R. Bhotika, and S. Soatto, “Few-shot learning with embedded class models and shot-free meta training,” in ICCV, 2019, pp. 331–339.
[84] T. Cao, M. T. Law, and S. Fidler, “A theoretical analysis of the number of shots in few-shot learning,” in ICLR, 2020.
[85] B. B. Sau and V. N. Balasubramanian, “Deep model compression: Distilling knowledge from noisy teachers,” CoRR, vol. abs/1610.09650, 2016.
[86] S.-I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in AAAI, 2020, pp. 5191–5198.
[87] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” in ICLR, 2015.
[88] A. Koratana, D. Kang, P. Bailis, and M. Zaharia, “LIT: learned intermediate representation training for model compression,” in ICML, 2019, pp. 3509–3518.
[89] H.-J. Ye, S. Lu, and D.-C. Zhan, “Distilling cross-task knowledge via relationship matching,” in CVPR, 2020, pp. 12 393–12 402.
[90] Z.-H. Zhou and Y. Jiang, “Nec4.5: Neural ensemble based C4.5,” TKDE, vol. 16, no. 6, pp. 770–773, 2004.
[91] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Learning to generalize: Meta-learning for domain generalization,” in AAAI, 2018, pp. 3490–3497.
[92] X. Tao, X. Chang, X. Hong, X. Wei, and Y. Gong, “Topology-preserving class-incremental learning,” in ECCV, 2020, pp. 254–270.
[93] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” TPAMI, pp. 1–1, 2020.
[94] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” CoRR, vol. abs/2006.08218, 2020.
[95] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born-again neural networks,” in ICML, 2018, pp. 1602–1611.
[96] C. Finn and S. Levine, “Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm,” in ICLR, 2018.
[97] R. Boney and A. Ilin, “Semi-supervised few-shot learning with maml,” in ICLR Workshop, 2018.
[98] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths, “Recasting gradient-based meta-learning as hierarchical bayes,” in ICLR, 2018.
[99] H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang, “Finding task-relevant features for few-shot learning by category traversal,” in CVPR, 2019, pp. 1–10.
[100] D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik, “Unifying distillation and privileged information,” in ICLR, 2016.
[101] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He, “Data distillation: Towards omni-supervised learning,” in CVPR, 2018, pp. 4119–4128.
[102] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in CVPR, 2019, pp. 113–123.
[103] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
[104] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularization method for convolutional networks,” in NeurIPS, 2018, pp. 10 750–10 760.
[105] A. Li, W. Huang, X. Lan, J. Feng, Z. Li, and L. Wang, “Boosting few-shot learning with adaptive margin loss,” in CVPR, 2020, pp. 12 573–12 581.
[106] C. Simon, P. Koniusz, R. Nock, and M. Harandi, “Adaptive subspaces for few-shot learning,” in CVPR, 2020, pp. 4135–4144.
[107] Y. Liu, B. Schiele, and Q. Sun, “An ensemble of epoch-wise empirical bayes for few-shot learning,” in ECCV, 2020, pp. 404–421.
[108] Z. Chen, Y. Fu, Y. Zhang, Y.-G. Jiang, X. Xue, and L. Sigal, “Multi-level semantic feature augmentation for one-shot learning,” TIP, vol. 28, no. 9, pp. 4594–4605, 2019.
[109] B. Liu, Y. Cao, Y. Lin, Q. Li, Z. Zhang, M. Long, and H. Hu, “Negative margin matters: Understanding margin in few-shot classification,” in ECCV, 2020, pp. 438–455.
[110] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
[111] V. G. Satorras and J. B. Estrach, “Few-shot learning with graph neural networks,” in ICLR, 2018.
[112] H.-Y. Tseng, H.-Y. Lee, J.-B. Huang, and M.-H. Yang, “Cross-domain few-shot classification via learned feature-wise transformation,” in ICLR, 2020.
[113] X. Dong, L. Liu, K. Musial, and B. Gabrys, “Nats-bench: Benchmarking NAS algorithms for architecture topology and size,” CoRR, vol. abs/2009.00437, 2020.
[114] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in BMVC, 2016.
[115] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, pp. 448–456.
[116] M. R. U. Saputra, P. P. B. de Gusmao, Y. Almalioglu, A. Markham, and N. Trigoni, “Distilling knowledge from a deep pose regressor network,” in ICCV, 2019, pp. 263–272.

Few-Shot Learning with a Strong Teacher

Abstract

Index Terms:

1 Introduction

1.1 Comparisons to related work

2 Related Work

2.1 Few-shot learning (FSL)

2.2 Meta-learning for FSL

2.3 Knowledge distillation

3 Our Approach: LastShot

3.1 Notation

3.2 Meta-learning for few-shot learning

3.2.1 Exemplar few-shot learners

3.2.2 Meta-learning with the target classifier

3.3 LastShot: Learning with a strong teacher

3.4 Construction of the target classifier

3.4.1 Masking the BB-way base classifier

3.4.2 Nearest centroid classifiers (NC)

3.4.3 Logistic regression (LR)

3.5 Querying the target classifier in meta-training

3.5.1 Varying the input image size

3.5.2 Applying autoaugment [102]

3.6 Evaluation of the learned few-shot learner 𝒜\mathcal{A}

4 Experiments on Few-Shot Classification

4.1 Datasets

4.2 Evaluation protocols

4.3 Implementation details

4.3.1 Model architectures for few-shot classification

4.3.2 Initialization of the feature extractor

4.3.3 Meta-Learning with LastShot

4.3.4 Non-meta-learning based FSL methods

4.4 Benchmark comparisons

4.5 Larger-shot comparisons

4.6 Ablation studies

4.6.1 Advantage of LastShot with a small query set

4.6.2 Comparison on the target classifiers

4.6.3 Strengthening the teacher or weakening the student?

4.6.4 Influence of the coefficient λ\lambda

4.6.5 Will a direct ensemble help?

4.6.6 Will LastShot help graph-based FSL methods?

4.6.7 Will LastShot be compatible with other backbones?

5 Experiments on Few-Shot Regression

5.1 Datasets

5.2 Evaluation and training protocols

5.3 LastShot for regression

5.4 Results

6 Conclusion

Acknowledgments

References

3.4.1 Masking the $B$ -way base classifier

3.6 Evaluation of the learned few-shot learner $\mathcal{A}$

4.6.4 Influence of the coefficient $\lambda$