Instance-based Max-margin for Practical Few-shot Recognition

Minghao Fu, Ke Zhu and Jianxin Wu
State Key Laboratory for Novel Software Technology
Nanjing University, Nanjing, China
{fumh,zhuk}@lamda.nju.edu.cn, [email protected]

Abstract

In order to mimic the human few-shot learning (FSL) ability better and to make FSL closer to real-world applications, this paper proposes a practical FSL (pFSL) setting. pFSL is based on unsupervised pretrained models (analogous to human prior knowledge) and recognizes many novel classes simultaneously. Compared to traditional FSL, pFSL is simpler in its formulation, easier to evaluate, more challenging and more practical. To cope with the rarity of training examples, this paper proposes IbM2, an instance-based max-margin method not only for the new pFSL setting, but also works well in traditional FSL scenarios. Based on the Gaussian Annulus Theorem, IbM2 converts random noise applied to the instances into a mechanism to achieve maximum margin in the many-way pFSL (or traditional FSL) recognition task. Experiments with various self-supervised pretraining methods and diverse many- or few-way FSL tasks show that IbM2 almost always leads to improvements compared to its respective baseline methods, and in most cases the improvements are significant. With both the new pFSL setting and novel IbM2 method, this paper shows that practical few-shot learning is both viable and promising.

1 Introduction

We human have the ability to learn new concepts based on a few exemplars, thanks to both our ability to learn and the previously accumulated knowledge. As a stark contrast, learning machines (especially deep learning models) mostly require plentiful training examples to learn just few (e.g., 5) concepts. Both the vision and learning community are fully aware of this shortcoming, and few-shot learning (FSL) is our community’s effort to counter this weakness, which has been studied for a long time [40].

FSL relies on prior knowledge, too. Currently, the setting is to set aside a collection of object categories (known as the base set), with many training examples in every base set category. A model is first pretrained based on the base set using either self-supervised learning [33, 26, 42, 22, 27], meta-learning [25, 32, 14, 34, 38], or traditional supervised approaches [28, 12, 9]. The model naturally encodes prior knowledge from the base set. Then, few novel concepts or object categories (most commonly 5) are to be recognized based on the pretrained model and few training examples from these new concepts (e.g., 1 or 5 per category). It is worth noting that the base and the novel training sets are semantically closely related (e.g., both contain different kinds of birds). This setting is also complicated.

There is a clear mismatch between the human ability and this FSL setting. We human can learn many instead of few novel concepts; and, the human prior knowledge include those accumulated as both common-sense and domain knowledge from many diverse domains rather than a limited set of concepts closely related to a specific task.

Although this FSL setting has served our community well and has pushed the technical frontier for years, with the immense advances in deep learning and emerging practical needs for real-world FSL applications, it is time that we need a better, simpler, and more practical FSL setting.

Hence we advocate a practical few-shot learning (or pFSL) setting: based on an unsupervised model pretrained from a large number of concepts (e.g., ImageNet [30]), simultaneously learn many new concepts (e.g., 200) with few examples (e.g., 1 or 5 per category).

An ImageNet pretrained model is more analogous to human knowledge than those learned from a base set in traditional FSL, and learning many new concepts makes it more practical in applications. Without the base set, the learning phase of pFSL is simple, too. More importantly, evaluation of algorithms is difficult, complicated, and very time consuming in the current FSL setting [16], but as will be discussed later, pFSL does not suffer from this drawback.

Although an ImageNet pretrained model undoubtedly contains more prior knowledge than a base-set trained one, simultaneously learning many new concepts will inevitably make pFSL much more difficult than traditional FSL. We believe that a more challenging task will help technology advancement, too—the traditional FSL setting is already saturated to certain extent (c.f. results in [42, 22]).

Hence, we propose a novel approach for pFSL: Instance-based Max-margin (IbM2), which is effective in traditional FSL, too. The max-margin idea has been shown to be effective in improving generalization. Classic methods like SVM [7] build max-margin decision boundaries with only a very small portion of the training examples (i.e., support vectors). This idea is problematic when we have both very few training examples and a very high dimensionality. Intuitively, the support vectors chosen by SVM are highly likely wrong or misleading, because a small training set leads to excessively unstable estimation [16].

Refer to caption — Figure 1: Instance-based max-margin. We create a hypersphere centered at every example (solid shapes), sample many virtual examples (hollow shapes) around the hypersphere, and require the decision boundary to classify all virtual samples correctly. Then we in effect achieve instance-based max-margin. The large dashed circles denote hypersphere. Best viewed in color.

In the proposed IbM2, we achieve max-margin at the instance level, instead of at the class level as in SVM. As illustrated in Fig. 1, we create a hypersphere for every training example, with the example being the center of the hypersphere. Then, we can sample as many virtual samples as we want around this hypersphere, and assign the label of the center (original training example) to all its associated virtual examples. By finding a properly maximized radius for the hypersphere and requiring a classifier to correctly classify the virtual examples, we in effect achieve max-margin. Another benefit of IbM2 is that we do not need any special handling for multi-class recognition, while vanilla SVM is for binary classification only. In short, the contributions of this paper are twofold:

$\bullet$

We propose a practical few-shot learning setting (pFSL): many-way (e.g., 200-way) recognition, uses an unsupervised pretrained model, and has no base set.
$\bullet$

We propose IbM2, an instance-based max-margin algorithm, which suits few-shot, high dimensionality, and multi-class naturally.

As will be shown by extensive experiments, IbM2 consistently improves both traditional FSL and the proposed pFSL. The benefits of pFSL against traditional FSL will be verified by both analyses and experiments, too.

2 Related Work

Few-shot learning aims at recognizing novel categories with the help of some form of prior knowledge. Given the base set as prior knowledge, few-shot learning methods can be roughly divided into two types: those based on meta-learning or transfer-learning.

Meta-learning based FSL. Meta-learning [21] learns a model through a large quantity of episodes consisting of different training and evaluation sets. Given a new task during evaluation, the model performs what it does in training to predict. In FSL, the training episodes are randomly sampled from the base set. One line of work [14, 25, 15, 31] focuses on learning general initial weights for fast adapting the model to unseen categories with a few steps of optimization. For example, [14] explicitly trains the classifier in a model-agnostic way. [25] extends meta-learning with linear predictors as a differentiable convex problem. [15] learns the task distribution from a probabilistic perspective. Another line of work [32, 38, 34, 43, 45, 1, 11] boosts meta-learning by exploring different distance metrics. For example, [32] aggregates training features to generate prototypes for different classes as the classification head. [45] leverages the Earth Mover’s Distance to calculate the structural distance for classification. [1] extracts sets of features from each image to build a set-based matching schema. [11] proposes a meta-baseline, which performs pretraining and meta-training sequentially.

Transfer-learning based FSL. This line of attack [9, 28, 12, 41, 36] leverages the idea of standard transfer learning, which first pretrains a model on base classes and then fine-tunes the model weights with limited training samples from novel classes. [9] normalizes the features and classification weights and calculates their cosine distances as the logits. [28] pretrains with manifold mixup [37] to improve robustness. [12] introduces a baseline for effectively and transductively finetuning. [41] calibrates the training distribution of a few-shot task using statistics from the base set. [36] explores learning a general representation of base classes in supervised or self-supervised ways with self-distillation.

Self-supervised learning. We stress that in pFSL the pretrained model is unsupervised, for which the reason is to be explained in Sec. 3. Self-supervised learning (SSL) is the mainstream approach to train a deep net in the unsupervised manner. Popular SSL methods, whether applied to traditional the ResNet [19] models (e.g., BYOL [17], MoCo [18] and SwAV [5]) or Vision Transformers (ViT) [13] (e.g., MSN [2], DINO [6] and MoCov3 [10]), all produce pretrained models based on large-scale datasets. In this paper, we utilize such models as our prior knowledge in FSL.

3 pFSL: Practical Few-Shot Learning

In traditional FSL, the training dataset is split into 1) the base set with $N_{b}$ labeled images, denoted as $D_{b}=\{(x^{b}_{i},y^{b}_{i})\}_{i=1}^{N_{b}}$ where $y^{b}_{i}\in\mathcal{Y}_{b}$ is the label of instance $x^{b}_{i}$ and $\mathcal{Y}_{b}$ is the label space of $D_{b}$ ; 2) the novel set with $N_{n}$ labeled images, denoted as $D_{n}=\{(x^{n}_{j},y^{n}_{j})\}_{j=1}^{N_{n}}$ where $y^{n}_{j}\in\mathcal{Y}_{n}$ is the label of $x^{n}_{j}$ and $\mathcal{Y}_{n}$ is the label space of $D_{n}$ . Note that although $\mathcal{Y}_{b}\cap\mathcal{Y}_{n}=\varnothing$ , the concepts in $\mathcal{Y}_{b}$ and $\mathcal{Y}_{n}$ are semantically similar to each other. Then, the task is to learn a recognizer generalizing well on unknown samples from $\mathcal{Y}_{n}$ . By letting $N_{n}\ll N_{b}$ , the training samples in $D_{n}$ are few-shot (i.e., few training examples per class). In traditional FSL, $|\mathcal{Y}_{b}|$ is relatively large but $|\mathcal{Y}_{n}|$ is small. For example, in most cases $|\mathcal{Y}_{n}|=5$ , which is called a 5-way FSL. The evaluation of traditional FSL is a rather complicated task.

3.1 The pFSL Formulation

To make FSL simpler, more practical, and closer to human’s few-shot learning capability, we propose a new practical few-shot learning (pFSL) paradigm, which is characterized by the following properties:

$\bullet$

Removing the base set. The prior knowledge obtained through pretraining on the base set is limited, as the base set only contains limited number of concepts. In some applications, collecting data sharing similar concepts with the novel set may be as difficult as collecting more samples for the novel set itself.
$\bullet$

Pretrained model based on big data. A model pretrained on a large training set containing a wide variety of concepts is analogous to the common-sense and domain knowledge our brains encode, which will be useful for few shot learning in diverse domains.
$\bullet$

Many-way FSL. In an $N$ -way $k$ -shot FSL, we seek $N$ to be large and $k$ to be small, e.g., using all the CUB [39] categories ( $N=200$ categories and $k$ is 1 or 5). This makes pFSL much closer to real-world applications than traditional FSL.
$\bullet$

Unsupervised pretraining. In pFSL, we require the pretraining to be unsupervised. Since a large scale dataset (e.g., ImageNet) may contain the concept in few-shot learning (e.g., those birds in CUB), avoiding using the labels during pretraining is necessary to make the evaluation of FSL algorithm fairer.

These properties distinguish pFSL from traditional FSL and some recent variants of it [22, 28, 9, 12, 23].

3.2 Simplicity and Better Evaluation

The pFSL framework is clearly simpler than traditional FSL, and it leads to not only benefits in the training phase, but more importantly better evaluation of FSL algorithms.

Traditional $N$ -way $k$ -shot FSL typical use $N=5$ and $k=1,5$ , which inevitably leads to unreliable estimation of the test accuracy. Hence, the common practice is to sample a large number of $n$ episodes ( $n\geq 500$ ), find the accuracy in each episode, and report the average accuracy $\mu$ along with its 95% confidence interval $Z_{95\%}$ . Note that with $n$ episodes, the standard deviation $\sigma$ of the accuracy within the $n$ episodes is $\sigma=0.51\sqrt{n}Z_{95\%}$ . For example, a typical result is: average accuracy 64.93%, 95% confidence interval 0.18%, but the standard deviation of accuracy is 9.18% [16] (in which $n=10000$ ). In short, one single evaluation of traditional FSL requires training and testing for $n$ times ( $n\geq 500$ ) and the evaluation results are still unreliable. Because of the complicated setup, both learning and evaluation of traditional FSL algorithms are not only costly, but often carried out in slightly different ways, which renders the fair comparison even more difficult.

As a stark contrast, pFSL is much simpler and as the experimental results will show in Table 1, the estimate of average accuracy is reliable. The standard deviation is small (mostly $<1.0\%$ ). This reliable estimation comes from the fact that pSFL is many-way (i.e., $N$ is large). Hence, instead of running $n$ episodes, few episodes (e.g., 3) is enough to evaluate pFSL, which not only means savings in computation, but also more reliable evaluation of algorithms.

4 IbM2: Instance-based Max-margin

In pFSL, we have a pretrained model $\mathcal{M}$ (unsupervised trained on a large set) and a novel set. There is no base set $D_{b}$ any more, hence we will simply denote the novel set as $D=\{x_{i},y_{i}\}_{i=1}^{M}$ (that is, ignoring the superscript ^b). The label set of $D$ is $\mathcal{Y}$ . In an $N$ -way $k$ -shot pFSL task, we have $|\mathcal{Y}|=N$ and there will be $k$ training examples in each of the $N$ novel classes, hence $M=kN$ . Note that we expect $|\mathcal{Y}|=N$ to be large (i.e., many-way FSL) and $k$ to be small (e.g., 1 or 5).

4.1 Generating Virtual Samples

Since $k$ (the number of training images in each class) is still small but the number of classes ( $N$ ) becomes much larger (e.g., from 5 in traditional FSL to 200 in pFSL), we expect that the challenge in learning a pFSL model is greater than that in traditional FSL. This is verified by our experiments (c.f. the results in Table 1 vs. those in Table 3).

Virtual examples, or examples that are sampled based on the original training examples, have been repeatedly proven useful when the number of training examples is small, e.g., in traditional few-shot learning [41] or long-tailed recognition (where the tail classes have few training images) [20].

The common idea in generating virtual samples is to make the virtual samples follow the underlying distributions of the classes. However, in few-shot learning this requirement is very difficult to entertain even with the help of techniques such as distribution calibration (DC) in [41], because there is only 1 or 5 examples per class.

In contrast, we do not try to sample from any underlying distribution, but try to achieve max-margin of the decision boundary, as illustrated in Fig. 1. Specifically, for every training instance $x_{i}$ , we i.i.d. generate $R$ virtual samples $x_{i,r}$ ( $1\leq r\leq R$ ) which distribute around the shell of the hypersphere centered at $x_{i}$ and whose radius is controlled by a parameter $\epsilon$ . Note that the same $\epsilon$ is shared by all training examples $x_{i}$ ( $1\leq i\leq kN$ ).

Technically, let $z_{i}$ be the feature vector produced by the pretrained model $\mathcal{M}$ : $z_{i}=\mathcal{M}(x_{i})$ from a forward calculation. Then, a noise vector $\delta_{i,r}$ is randomly sampled from the standard multi-dimensional normal distribution $N(0,I_{d})$ , where $d$ is the length of $z_{i}$ and $I_{d}$ is a $d\times d$ identity matrix. By adding a scaled version of the noise vector to the $i$ -th training example, we obtain $z_{i,r}$ (the $r$ -th virtual example for $x_{i}$ ):

z_{i,r}=z_{i}+\epsilon\delta_{i,r}\,.

(1)

Its label $y_{i,r}$ is $y_{i}$ (the label of $x_{i}$ ) and $\epsilon$ is a positive number.

Interesting properties exist in our virtual examples:

$\bullet$

Hypersphere for different training examples can overlap, so long as they belong to the same class (c.f. Fig. 1).
$\bullet$

We require the virtual examples to be correctly classified, or, the original training examples are discarded.
$\bullet$

The virtual examples lie around the shell, not the interior of the hypersphere. According to the Gaussian Annulus Theorem [4], almost all the probability of a high-dimensional spherical Gaussian with unit variance is concentrated in a thin annulus of width $O(1)$ with radius $\sqrt{d}$ when $d$ is large. Since $d$ is indeed large (e.g., $d=2048$ ), the virtual examples reside around the shell. Hence, if we maximize the radius parameter $\epsilon$ but require that the virtual examples are correctly classified, as Fig. 1 shows, we in essence achieve margin maximization in an instance-based manner.

Hence, the proposed method is named Instance-based Max-margin, abbreviated as IbM2.

4.2 Ellipsoidal Noise Generation

One issue with the above isotropic noise sampling is that it ignores the structural property of the training examples. For instance, if the ranges for two feature dimensions are $[-1,1]$ and $[-100,100]$ , respectively, isotropic sampling is clearly unsuitable.

To overcome this drawback, we calculate a range vector $s=(s_{1},s_{2},\dots,s_{d})$ using all the original training examples regardless of the class label, where $s_{i}$ is the sample standard deviation for the $i$ -th dimension. Then, the IbM2 virtual examples are generated as (replacing Eq. 1)

z_{i,r}=z_{i}+\epsilon(s\odot\delta_{i,r})\,,

(2)

where $\odot$ is element-wise multiplication. The final virtual example generation process is illustrated in Fig. 2. Note that estimating $s$ is dramatically easier than estimating the underlying distribution of every novel class.

It is worth noting that this ellipsoidal noise generation is unsupervised, which distinguishes itself from Distribution calibration (DC) [41]. DC samples virtual examples based on an estimated covariance structure, which is estimated from the few novel class training examples in a sample-wise manner and calibrated using the base set. As aforementioned, we argue that it makes sense to get rid of the dependency on a base set, then DC is not applicable.

More importantly, the reliability of this estimation of a dense covariance matrix is understandably much lower than our estimation of the range indicator vector $s$ —we estimate $d$ values using $k\times N$ examples, while DC estimates $d\times d$ values using only 1 example and the base set. The base set calibration will help DC, but clearly its estimation is still unreliable, as shown by experiments in [41, 46]

4.3 Binary Search to Find the Largest Possible $\epsilon$

Obviously we want $\epsilon$ to be as large as possible to achieve max-margin, under the condition that virtual examples are classified correctly. Hence, we design a simple binary search to find the optimal value for $\epsilon$ .

For any given $\epsilon$ value, we can generate $kNR$ virtual examples $(x_{i,r},y_{i,r})$ ( $1\leq i\leq kN$ , $1\leq r\leq R$ ) using Eq. 2. We denote the virtual training set for this particular value of $\epsilon$ as $D^{\epsilon}$ . Then, we train a linear classifier with parameters $W$ to classify $D^{\epsilon}$ and obtain the training accuracy. If the training accuracy is too high (larger than a threshold $T$ ), $\epsilon$ is enlarged; otherwise $\epsilon$ is shrunk. The binary search maintains a search range, when the search range is tight enough (when the range is less than 0.05 in Algorithm 1), the search terminates, and we have found the optimal value $\hat{\epsilon}$ .

We have a few notes for finding $\hat{\epsilon}$ :

$\bullet$

Unlike DC [41], we do not need a validation set.
$\bullet$

The threshold $T$ does not need to be 1 (100%). In SVM learning, the slack variable trick allows some support vectors not to have max-margin. Similarly, we just need $T$ to be close to 1 (0.9 in our experiments).

⬇

# Inputs:

# x : training features of a few-shot task

# y : training labels of a few-shot task

# R : sampling times for an instance

# T : accuracy threshold for searching

# Outputs:

# eps : epsilon for sampling

left = 0.0

right = a large value

eps = right / 2

W = init_classifier

while True:

acc = train_and_eval(W, x, y, eps, R)

if acc > T:

left = eps # increase epsilon

else:

right = eps # decrease epsilon

eps = (left + right) / 2.0

if right - left < 0.05:

break

Algorithm 1 Pseudo code for searching

\epsilon

Finally, the IbM2 pipeline is simple: First, use Algorithm 1 to find $\hat{\epsilon}$ . Second, generate the virtual dataset $D^{\hat{\epsilon}}$ . And third, learn a linear classifier $W$ .

Note that the pretrained model $\mathcal{M}$ is frozen and will not be updated.

5 Experimental Results

We evaluate our IbM2 method on both setups: the proposed pFSL in many-way few-shot scenario, and also comparing IbM2 under the traditional FSL few-way setting with state-of-the-art FSL methods to further validate its effectiveness.

5.1 Implementation Details

For both the searching and training stages of IbM2, all features extracted by the backbone $\mathcal{M}$ were $L2$ -normalized first. When learning the classification head (both during the binary search and after $\hat{\epsilon}$ was determined), we used Adam as the optimizer and the label smoothed cross-entropy loss [35] as the learning objective. The pretrained model was trained using various self-supervised learning methods on the ImageNet-1K [30] training set.

During the binary search, for all the experiments we will report, we always set $R$ as 200, $T$ as 0.9 in pFSL and $0.999$ in the traditional FSL setting, except when we carried out ablation experiments on these hyperparameters.

Full implementation details are included in the appendix.

5.2 Experiments in the pFSL Setting

Datasets and Evaluation Setup. We explored two common many-way classification datasets, ImageNet-1K [30] and CUB-200-2011 (CUB) [39]. ImageNet-1K contains about 1.28 million training images from 1,000 classes and 50,000 images for evaluation. CUB is a fine-grained recognition dataset composed of 11,788 images belonging to 200 classes of birds, 5,994 for training and 5,794 for evaluation.

For both datasets, we randomly sampled the training sets for pFSL by selecting $k$ images from every class (i.e., $k$ -shot) and using all the classes (1000 and 200, respectively) to form the novel set. $k$ is chosen from $\{1,2,3,4,5,8,16\}$ . The evaluation was carried out on the full validation (for ImageNet) or test (for CUB) set.

Utilizing various self-supervised backbone network $\mathcal{M}$ pretrained on ImageNet, we compare two sets of results: one in which the linear classifier was obtained by using the original training set, the other by using IbM2’s virtual examples. We report the top-1 accuracy and its standard deviation computed over 3 randomly sampled novel sets.

Dataset	Pretraining Method	Backbone	IbM2	Shot per Class
Dataset	Pretraining Method	Backbone	IbM2	1	2	3	4	5	8	16
ImageNet-1K	DINO [6]	ViT-S/16		39.2 $\pm$ 0.3	49.2 $\pm$ 0.2	54.1 $\pm$ 0.4	56.9 $\pm$ 0.2	58.6 $\pm$ 0.1	62.0 $\pm$ 0.2	65.5 $\pm$ 0.2
	DINO [6]	ViT-S/16	$\checkmark$	39.3 $\pm$ 0.3	49.5 $\pm$ 0.3	54.7 $\pm$ 0.4	57.7 $\pm$ 0.1	59.3 $\pm$ 0.2	62.4 $\pm$ 0.2	65.9 $\pm$ 0.1
	MoCov3 [10]	ViT-S/16		33.1 $\pm$ 0.6	42.4 $\pm$ 0.3	47.2 $\pm$ 0.4	49.8 $\pm$ 0.4	51.6 $\pm$ 0.3	55.2 $\pm$ 0.2	59.2 $\pm$ 0.3
	MoCov3 [10]	ViT-S/16	$\checkmark$	34.4 $\pm$ 0.6	43.8 $\pm$ 0.3	48.7 $\pm$ 0.4	51.5 $\pm$ 0.3	53.0 $\pm$ 0.2	56.2 $\pm$ 0.1	60.3 $\pm$ 0.2
	MSN[2]	ViT-S/16		47.9 $\pm$ 0.1	56.2 $\pm$ 0.4	59.8 $\pm$ 0.3	61.7 $\pm$ 0.1	62.8 $\pm$ 0.2	65.3 $\pm$ 0.3	68.1 $\pm$ 0.1
		ViT-S/16	$\checkmark$	47.9 $\pm$ 0.1	56.6 $\pm$ 0.5	60.5 $\pm$ 0.2	62.5 $\pm$ 0.1	63.6 $\pm$ 0.2	66.0 $\pm$ 0.3	68.5 $\pm$ 0.1
		ViT-B/4		53.9 $\pm$ 0.2	64.5 $\pm$ 0.4	68.9 $\pm$ 0.3	70.9 $\pm$ 0.2	72.3 $\pm$ 0.3	74.1 $\pm$ 0.1	75.7 $\pm$ 0.1
		ViT-B/4	$\checkmark$	54.3 $\pm$ 0.1	65.2 $\pm$ 0.6	69.6 $\pm$ 0.2	71.5 $\pm$ 0.2	72.7 $\pm$ 0.3	74.7 $\pm$ 0.0	76.4 $\pm$ 0.1
		ViT-L/7		57.8 $\pm$ 0.3	66.6 $\pm$ 0.5	69.9 $\pm$ 0.5	71.7 $\pm$ 0.4	72.3 $\pm$ 0.2	73.8 $\pm$ 0.1	75.3 $\pm$ 0.1
		ViT-L/7	$\checkmark$	58.2 $\pm$ 0.4	66.9 $\pm$ 0.5	70.2 $\pm$ 0.6	71.9 $\pm$ 0.5	72.7 $\pm$ 0.2	74.3 $\pm$ 0.1	76.1 $\pm$ 0.0
	SimCLR [8]	ResNet50		24.2 $\pm$ 0.3	33.3 $\pm$ 0.1	38.7 $\pm$ 0.4	41.6 $\pm$ 0.2	43.5 $\pm$ 0.0	47.2 $\pm$ 0.1	51.9 $\pm$ 0.1
	SimCLR [8]	ResNet50	$\checkmark$	24.2 $\pm$ 0.3	33.4 $\pm$ 0.2	39.0 $\pm$ 0.4	42.0 $\pm$ 0.3	44.2 $\pm$ 0.1	48.0 $\pm$ 0.0	52.7 $\pm$ 0.1
	BYOL [17]	ResNet50		27.7 $\pm$ 0.2	37.1 $\pm$ 0.1	42.5 $\pm$ 0.4	45.9 $\pm$ 0.3	48.0 $\pm$ 0.1	51.8 $\pm$ 0.1	57.1 $\pm$ 0.1
	BYOL [17]	ResNet50	$\checkmark$	27.6 $\pm$ 0.2	37.5 $\pm$ 0.1	43.3 $\pm$ 0.4	46.9 $\pm$ 0.2	49.1 $\pm$ 0.1	53.3 $\pm$ 0.2	58.0 $\pm$ 0.1
CUB	DINO [6]	ViT-S/16		35.9 $\pm$ 1.4	49.6 $\pm$ 0.3	57.4 $\pm$ 1.0	61.4 $\pm$ 0.8	65.9 $\pm$ 0.9	71.8 $\pm$ 0.6	78.2 $\pm$ 0.1
	DINO [6]	ViT-S/16	$\checkmark$	36.2 $\pm$ 1.4	50.2 $\pm$ 0.6	57.9 $\pm$ 1.0	62.4 $\pm$ 0.5	66.7 $\pm$ 0.9	72.6 $\pm$ 0.8	79.0 $\pm$ 0.1
	MoCov3 [10]	ViT-S/16		18.5 $\pm$ 0.7	27.2 $\pm$ 0.2	35.5 $\pm$ 1.0	40.0 $\pm$ 0.6	45.3 $\pm$ 0.9	54.1 $\pm$ 0.4	65.3 $\pm$ 0.3
	MoCov3 [10]	ViT-S/16	$\checkmark$	19.5 $\pm$ 0.4	27.6 $\pm$ 0.2	35.6 $\pm$ 0.7	40.2 $\pm$ 0.4	46.2 $\pm$ 1.1	55.9 $\pm$ 0.4	66.8 $\pm$ 0.5
	MSN [2]	ViT-L/7		37.5 $\pm$ 1.1	49.4 $\pm$ 0.4	58.8 $\pm$ 0.8	62.7 $\pm$ 0.9	67.2 $\pm$ 0.3	73.8 $\pm$ 0.8	80.4 $\pm$ 0.2
	MSN [2]	ViT-L/7	$\checkmark$	37.7 $\pm$ 1.2	50.1 $\pm$ 0.5	59.0 $\pm$ 0.8	63.6 $\pm$ 0.5	67.9 $\pm$ 0.7	74.5 $\pm$ 0.5	81.2 $\pm$ 0.1
	BYOL [17]	ResNet50		15.1 $\pm$ 0.8	21.3 $\pm$ 0.6	28.4 $\pm$ 0.2	32.3 $\pm$ 0.2	35.0 $\pm$ 1.1	44.1 $\pm$ 0.1	55.1 $\pm$ 0.5
	BYOL [17]	ResNet50	$\checkmark$	16.4 $\pm$ 0.9	22.6 $\pm$ 0.6	30.1 $\pm$ 0.6	34.7 $\pm$ 0.5	37.1 $\pm$ 1.3	45.0 $\pm$ 0.4	57.6 $\pm$ 0.7

Table 1: Average of top-1 accuracy (%) with standard deviation across 3 random subsets on ImageNet-1K and CUB. A

\checkmark

in the column IbM2 denotes the proposed IbM2 method, or the baseline method if it is blank. Those IbM2 results that are at least 0.5% higher than the baseline are shown in boldface.

Main Results. As the results in Table 1 suggest, we evaluated on two types of backbones: Vision Transformer (ViT) [13] and ResNet50 [19]. The pretraining methods included DINO [6], MoCov3 [10], MSN [2], SimCLR [8] and BYOL [17]. Table 1 shows that in almost all cases, IbM2 benefited the few-shot learning process and improved the top-1 accuracy.

When the number of shots ( $k$ ) is very small ( $\leq 2$ ), the improvement of IbM2 over the baseline is roughly $0.5\%$ on average. However, as the number of training shots goes larger, it is clear that the level of accuracy improvement increases gradually from $\sim 0.5\%$ to $\sim 1.0\%$ , even $>2\%$ in some cases.

A useful observation from these results is that in pFSL, both the baseline and IbM2 have small standard deviations, that is, more robust in evaluation. Hence, we do not need 500 episodes in pFSL any longer—3 is enough.

Pretraining Method	Backbone	IbM2	Top 1
DINO [6]	ViT-S/8		70.1
	ViT-S/8	$\checkmark$	70.6
	ViT-S/16		64.7
	ViT-S/16	$\checkmark$	65.2
MoCov3 [10]	ViT-S/16		58.1
MoCov3 [10]	ViT-S/16	$\checkmark$	59.1
MSN [2]	ViT-S/16		67.3
	ViT-S/16	$\checkmark$	68.0
	ViT-B/4		75.4
	ViT-B/4	$\checkmark$	76.1
	ViT-L/7		74.8
	ViT-L/7	$\checkmark$	75.6
SimCLR [8]	ResNet50		50.5
SimCLR [8]	ResNet50	$\checkmark$	51.3
BYOL [17]	ResNet50		55.7
BYOL [17]	ResNet50	$\checkmark$	56.9

Table 2: Top-1 accuracy (%) of 1% ImageNet-1K semi-supervised learning. The training set contains on average 12 labeled training samples per category.

Semi-supervised learning. In the self-supervised learning literature, it is a common practice to use ImageNet-1K training images to learn a backbone in the unsupervised manner, then use a small portion (e.g., 1%) of the training data now with labels to train a classifier. This semi-supervised learning task can also be viewed as in our pFSL setting. Hence, we also report the 1% ImageNet-1K semi-supervised learning (on average 12 labeled training images per class) results in Table 2. IbM2 consistently improved various self-supervised models and backbone architectures in all cases, with 0.5% to 1.2% top-1 accuracy increase.

Dataset	Pretraining & Meta-training	Backbone	IbM2	1-shot					5-shot
Dataset	Pretraining & Meta-training	Backbone	IbM2	$\text{ACC}_{m}$	$\sigma$	$\text{ACC}_{1}$	$\text{ACC}_{10}$	$\text{ACC}_{100}$	$\text{ACC}_{m}$	$\sigma$	$\text{ACC}_{1}$	$\text{ACC}_{10}$	$\text{ACC}_{100}$
mini-ImageNet	PMF [22]	ViT-S/16		93.4	6.1	61.3	72.4	83.4	98.4	1.8	87.7	91.4	95.6
		ViT-S/16	$\checkmark$	94.4	5.6	63.2	74.7	85.1	98.6	1.8	86.1	91.3	95.9
		ViT-B/16		94.9	5.4	66.9	75.5	86.0	98.8	1.6	89.1	92.9	96.3
		ViT-B/16	$\checkmark$	95.6	5.1	69.6	77.6	87.1	98.9	1.6	89.3	93.0	96.3
		ResNet50		94.9	5.3	67.2	75.2	86.3	98.5	1.7	89.3	91.9	95.8
		ResNet50	$\checkmark$	95.3	5.2	66.1	74.6	86.9	98.8	1.6	89.3	92.7	96.3
	$S2M2_{R}$ [28]	WRN-28-10		65.4	10.0	32.0	39.3	50.7	82.3	7.1	52.0	62.5	71.8
	$S2M2_{R}$ [28]	WRN-28-10	$\checkmark$	65.8	10.0	32.5	40.5	51.1	82.9	7.0	53.3	63.3	72.3
	Meta-Baseline [11]	ResNet12		62.9	10.4	26.7	34.3	47.4	79.0	7.5	54.7	58.0	67.9
	Meta-Baseline [11]	ResNet12	$\checkmark$	63.0	10.2	32.3	35.6	47.9	79.5	7.3	54.9	58.5	68.7
CIFAR-FS	PMF [22]	ViT-S/16		87.9	8.3	51.7	61.7	74.8	95.5	4.2	74.7	81.8	88.5
		ViT-S/16	$\checkmark$	89.0	8.1	59.2	64.1	75.9	95.7	4.2	73.3	81.7	88.7
		ViT-B/16		89.8	8.1	55.2	64.8	76.8	96.0	4.1	73.1	82.2	89.2
		ViT-B/16	$\checkmark$	90.3	8.0	58.9	65.5	77.4	96.0	4.1	74.1	81.8	89.2
		ResNet50		81.7	10.4	35.2	51.6	65.8	91.3	5.5	71.2	75.1	82.7
		ResNet50	$\checkmark$	82.3	10.3	36.3	51.1	66.5	91.6	5.3	73.3	76.0	83.2
	$S2M2_{R}$ [28]	WRN-28-10		75.2	10.8	44.0	46.7	59.2	87.7	7.1	54.7	67.9	76.8
	$S2M2_{R}$ [28]	WRN-28-10	$\checkmark$	75.5	10.8	42.4	45.5	59.2	87.7	7.0	54.9	68.1	77.0
	Meta-Baseline [11]	ResNet12		72.7	11.6	32.0	42.9	56.1	84.8	7.5	54.7	65.1	73.6
	Meta-Baseline [11]	ResNet12	$\checkmark$	72.3	11.5	32.8	42.3	55.7	85.1	7.5	56.3	65.4	74.1

Table 3: Results of 5-way classification on mini-ImageNet and CIFAR-FS in the traditional few-shot learning setting. For the baseline methods, we obtain the results from their respective official implementations.

5.3 Experiments in the Traditional FSL Setting

Datasets and Evaluation Setup. We evaluated IbM2 in the traditional FSL setup, too. We conducted experiments on two standard benchmark datasets, mini-ImageNet [38] and CIFAR-FS [3]. mini-ImageNet consists of 100 categories selected from ImageNet-1K, which are further split into 64 base, 16 val and 20 novel categories according to [29]. CIFAR-FS is created by randomly shuffling the 100 categories of CIFAR-100 [24] into 64 base, 16 val, 20 novel categories. Note that in order to perform traditional few-shot learning, only the novel split of these two datasets is required to sample many 5-way 1/5-shot episodes. We report the metrics mentioned in [16]: the average ( $\text{ACC}_{m}$ ), worst-case ( $\text{ACC}_{1}$ ), average of worst 10 ( $\text{ACC}_{10}$ ), average of worst 100 ( $\text{ACC}_{100}$ ) episodes’ accuracy, and the standard deviation ( $\sigma$ ) over 500 episodes for comprehensive evaluation. To make the results more reliable, we average the value of each metric from 5 runs with different random seeds. We adopted PMF [22], $S2M2_{R}$ [28] and Meta-Baseline [11] as the pretraining or meta-training approaches to obtain the backbone weights out of the base set.

Main Results. As shown in Table 3, by plugging IbM2 in during classifier learning, the proposed IbM2 improved the average accuracy $\text{ACC}_{m}$ in all cases except only 1 result. And, the standard deviation $\sigma$ was reduced by IbM2 or remain unchanged in all cases. As [16] advocated, smaller $\sigma$ means the learning process is more stable.

[16] also advocates the worst-case episodes accuracy is more important than the average accuracy among all episodes. Table 3 shows that IbM2 almost always leads to higher worst-case ( $\text{ACC}_{1}$ ) or near-worst-case ( $\text{ACC}_{10}$ , $\text{ACC}_{100}$ ) accuracy. Furthermore, the average gain of $\text{ACC}_{m}$ is 0.4%, which is far less than that of $\text{ACC}_{1}$ (1.3%). As for $\text{ACC}_{10}$ and $\text{ACC}_{100}$ , their gains are very similar, both around 0.5%. These observations demonstrate that IbM2 can boost the recognition accuracy of few-shot learning in almost every scenario, especially the worst case one.

At last, by comparing the numbers in Tables 1 and 3, pFSL is more challenging than traditional FSL, which leads to more open questions to solve.

5.4 Ablation Studies

We performed ablation studies for IbM2 in the pFSL setting on ImageNet-1K.

Instance-based vs. class-based max-margin. IbM2 seeks max-margin through an instance-based manner. In this study, we compare this instance-based max-margin with the well-known support vector machine (SVM), which achieves max-margin in a class-based manner. We experimented with SVM using the linear kernel function in the LIBSVM [7] software package. By iterating the value of SVM’s regularization hyperparameter $C$ from the set $\{0.1,1,10,100,1000\}$ , we found the best accuracy among them to compare with our IbM2 method.

As shown in Table 4, the proposed IbM2 method outperforms SVM with the best $C$ by a significant margin in all cases. Specifically, the recognition accuracy improves $\sim$ 2% on average, and even up to $\sim$ 4% in the extremely scarce 1-shot case. These results demonstrate that our instance-based margin generated by IbM2 helps more to learn a robust classification boundary, especially when the training distribution is drastically shifted.

Method	Shot per Class
Method	1	2	3	4	5	8	16
SVM	54.8	64.8	68.2	70.0	70.8	72.8	74.7
IbM2	58.2	66.9	70.2	71.9	72.7	74.3	76.1

Table 4: Top-1 accuracy (%) with different margin-based methods on ImageNet-1K classification with ViT-L/7 from MSN [2] as the pretrained backbone.

$R$	Shot per Class
$R$	1	2	3	4	5	8	16
1	32.9	42.7	47.8	50.8	52.5	55.9	60.1
10	33.7	43.6	48.5	51.4	52.9	56.2	60.3
50	33.8	43.7	48.6	51.5	53.0	56.2	60.3
200	34.4	43.8	48.7	51.5	53.0	56.2	60.3
400	34.5	43.8	48.6	51.5	53.0	56.2	60.3

Table 5: Top-1 accuracy (%) of classification with different

R

on ImageNet-1K with ViT-S/16 from MoCov3 [10] as the pretrained backbone.

Sensitivity of $R$ . As described in Sec. 4, one original training example is turned into $R$ virtual examples to form the training set $D^{\hat{\epsilon}}$ . In this part, we study the effect of the hyperparameter $R$ . As shown in Table 5, when the training shots are highly limited, increasing $R$ significantly improves recognition accuracy. As the number of training shots gets larger, the accuracy difference between a large $R$ ( $R=400$ ) and a small one ( $R=1$ ) gradually decreases from 1.6% to 0.2%.

When training samples are extremely scarce, it is difficult to model the correct Gaussian distribution with a few samples in a high-dimensional space. Therefore, increasing $R$ is necessary for a good estimation. When more training shots are available, the need for a large value of sampling times is reduced, hence the accuracy difference between different $R$ becomes smaller, too. Based on these results, we let $R=200$ in all our experiments.

Sampling Schema	Shot per Class
Sampling Schema	1	2	3	4	5	8	16
-	33.1	42.4	47.2	49.8	51.6	55.2	59.2
Spherical	33.0	42.1	47.1	50.2	51.8	55.3	59.8
Ellipsoidal	34.4	43.8	48.7	51.5	53.0	56.2	60.3

Table 6: Top-1 accuracy (%) of classification with different sampling schemas on ImageNet-1K with ViT-S/16 from MoCov3 [10] as the pretrained backbone. ’-’ denotes the simple baseline using only the original training examples.

Ellipsoidal vs. isotropic sampling. We have described two noise sampling strategies to generate virtual examples. The ellipsoidal one is preferred over the isotropic one, and is used in IbM2. In this part, we evaluate the effectiveness of this sampling schema by experimentally comparing these two sampling strategies. The isotropic sampling strategy is denoted as Spherical in Table 6.

As shown in Table 6, the accuracy in the ellipsoidal sampling schema is consistently better than its spherical counterpart or the baseline. Moreover, in low-shot cases ( $\leq 3$ ), sampling with spherical Gaussian slightly degraded the recognition accuracy. The reason may be that the standard deviation of different channels calculated from training features varied a lot, thus simply regarding all channels as independent and identically distributed might make the sampled points significantly shifted from the underlying distribution. However, as the number of training samples increases, both isotropic and ellipsoidal sampling outperformed the baseline.

Based on these observations, we chose to adopt the ellipsoidal noise sampling (Eq. 2) in IbM2.

On what classes can IbM2 help? Finally, we study what classes will benefit from IbM2—will most or only a small portion of classes be improved by IbM2? After the training of the baseline and IbM2 finished, we calculated the class-wise validation accuracy for every class on ImageNet-1K, as

\text{ACC}^{k}=\frac{N_{cor}^{k}}{N_{cls}^{k}}\,,

(3)

where $\text{ACC}^{k}$ is the per-class accuracy (recall) of the $k$ -th class, $N_{cls}^{k}$ is the number of test samples from class $k$ ( $N_{cls}^{k}=50$ in ImageNet-1K) and ${N_{cor}^{k}}$ is the number of correctly classified samples of class $k$ . For every class, we obtain the difference of per-class accuracy (that of IbM2 minus that of the baseline). A positive difference is an accuracy gain and a negative gain is in fact an accuracy loss.

We then sort the baseline’s per-class accuracy in the ascending order. Following this order, we rearrange the per-class accuracy gains. The 1000 accuracy gains are divided into 10 histogram bins in the rearranged order, and inside each bin the 100 per-class accuracy gains are averaged to obtain the accuracy gain of that bin. Fig. 3 plots the average accuracy gain in each bin.

We find that IbM2 improves average per-class accuracy in almost every histogram bin (i.e., is above the “0.0” horizon in the $y$ -axis). More detailed numbers reveal that the recognition accuracy of more than 65% categories out of 1000 is improved in all shots. Furthermore, Fig. 3 shows a trend: the more accurate a class is, the higher gains IbM2 can achieve over the baseline. That is, when the task is too difficult for the baseline, IbM2 can help but its usefulness is restricted. But, when the baseline is already accurate enough (categorical subset 9 and later), the room for IbM2’s further improvement is small again. It is the middle range that IbM2 is the most useful.

6 Conclusions and Limitations

In this paper, we advocated pFSL, a new practical few-shot learning setting. We also proposed IbM2, an instance-based max-margin method to improve few-shot learning. With the technological advancements, it is the time to upgrade the traditional FSL settings. We need a simpler, easier to evaluate, more challenging and more practical FSL setting, and the proposed pFSL task satisfies these requirements.

To face the challenges related to the scarcity of training examples, we proposed IbM2. Instead of maximizing the class-level margin, IbM2 is instance-based margin maximization. It achieved significant improvements in both pFSL and the traditional FSL settings consistently. IbM2 is simple and reliable. It introduces only 2 hyperparameters, and the default values of them worked well in experiments across different architectures and tasks.

As the experiments indicated, IbM2 works the best for the middle range in terms of task difficulty. In a $k$ -shot task when $k$ is extremely small, there is still a lot of room for improvements. Furthermore, IbM2 freezes the backbone and only learns the linear classifier. In the future, we plan to further improve the very small shot cases (e.g., $k=1$ ) and better tune the backbone in few-shot learning.

References

[1] Arman Afrasiyabi, Hugo Larochelle, Jean-François Lalonde, and Christian Gagné. Matching feature sets for few-shot image classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9004–9014, 2022.
[2] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked Siamese Networks for label-efficient learning. In European Conference on Computer Vision, volume 13691 of LNCS, page 456–473. Springer, 2022.
[3] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, pages 1–15, 2019.
[4] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of Data Science. Cambridge University Press, 2020.
[5] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, pages 9912–9924, 2020.
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised Vision Transformers. In IEEE/CVF International Conference on Computer Vision, pages 9630–9640, 2021.
[7] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1–27, 2011.
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597–1607, 2020.
[9] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In International Conference on Learning Representations, pages 1–16, 2019.
[10] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised Vision Transformers. In IEEE/CVF International Conference on Computer Vision, pages 9620–9629, 2021.
[11] Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-Baseline: Exploring simple meta-learning for few-shot learning. In IEEE/CVF International Conference on Computer Vision, pages 9042–9051, 2021.
[12] Guneet S Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. In International Conference on Learning Representations, pages 1–20, 2020.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, pages 1–21, 2021.
[14] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, page 1126–1135, 2017.
[15] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, page 9537–9548, 2018.
[16] Minghao Fu, Yun-Hao Cao, and Jianxin Wu. Worst case matters for few-shot recognition. In European Conference on Computer Vision, volume 13680 of LNCS, page 99–115. Springer, 2022.
[17] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap Your Own Latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, pages 21271–21284, 2020.
[18] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for unsupervised visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9726–9735, 2020.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[20] Yin-Yin He, Jianxin Wu, and Xiu-Shen Wei. Distilling virtual examples for long-tailed recognition. In IEEE/CVF International Conference on Computer Vision, pages 235–244, 2021.
[21] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5149–5169, 2022.
[22] Shell Xu Hu, Da Li, Jan Stühmer, Minyoung Kim, and Timothy M. Hospedales. Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a difference. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9058–9067, 2022.
[23] Seong-Woong Kim and Dong-Wan Choi. Better generalized few-shot learning even without base data. In AAAI Conference on Artificial Intelligence, 2023.
[24] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, 2009.
[25] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10649–10657, 2019.
[26] Yuning Lu, Liangjian Wen, Jianzhuang Liu, Yajing Liu, and Xinmei Tian. Self-supervision can be a good few-shot learner. In European Conference on Computer Vision, volume 13679 of LNCS, pages 740–758. Springer, 2022.
[27] Xu Luo, Longhui Wei, Liangjian Wen, Jinrong Yang, Lingxi Xie, Zenglin Xu, and Qi Tian. Rectifying the shortcut learning of background for few-shot learning. In Advances in Neural Information Processing Systems, pages 13073–13085, 2021.
[28] Puneet Mangla, Mayank Singh, Abhishek Sinha, Nupur Kumari, Vineeth N Balasubramanian, and Balaji Krishnamurthy. Charting the right manifold: Manifold Mixup for few-shot learning. In IEEE Winter Conference on Applications of Computer Vision, pages 2207–2216, 2020.
[29] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, pages 1–11, 2017.
[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[31] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, pages 1–17, 2019.
[32] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical Networks for few-shot learning. In Advances in Neural Information Processing Systems, page 4080–4090, 2017.
[33] Jong-Chyi Su, Subhransu Maji, and Bharath Hariharan. When does self-supervision improve few-shot learning? In European Conference on Computer Vision, volume 12352 of LNCS, pages 645–666. Springer, 2020.
[34] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M. Hospedales. Learning to compare: Relation Network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
[35] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
[36] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In European Conference on Computer Vision, volume 12359 of LNCS, page 266–282. Springer, 2020.
[37] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold Mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438–6447, 2019.
[38] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching Networks for one shot learning. In Advances in Neural Information Processing Systems, page 3637–3645, 2016.
[39] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[40] Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1–34, 2020.
[41] Shuo Yang, Lu Liu, and Min Xu. Free lunch for few-shot learning: Distribution calibration. In International Conference on Learning Representations, pages 1–13, 2021.
[42] Zhanyuan Yang, Jinghua Wang, and Yingying Zhu. Few-shot classification with contrastive learning. In European Conference on Computer Vision, volume 13680 of LNCS, page 293–309. Springer, 2022.
[43] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8805–8814, 2020.
[44] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, pages 1–12, 2016.
[45] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. DeepEMD: Few-shot image classification with differentiable Earth Mover’s distance and structured classifiers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12200–12210, 2020.
[46] Zhiyuan Zhao, Qingjie Liu, and Yunhong Wang. Exploring effective knowledge transfer for few-shot object detection. In ACM International Conference on Multimedia, page 6831–6839, 2022.

Appendix

In this appendix, we provide the full implementation details of experiments in the proposed pFSL or traditional FSL settings.

First, we describe the common details in all experiments. The features $z$ extracted by the backbone $\mathcal{M}$ were $L2$ -normalized. Whenever a linear classifier was trained, we always used Adam as the optimizer whose learning rate was gradually decreased in a cosine scheduling, and the label smoothed cross-entropy loss [35] was used as the learning objective.

Appendix A Setup for pFSL

The pretrained model $\mathcal{M}$ was trained using various self-supervised learning methods [6, 10, 2, 8, 17] on the ImageNet-1K [30] training set. We obtained the pretrained models from their respective official implementations. To store the features extracted by the backbone $\mathcal{M}$ in advance for conveniently conducting experiments, images were resized to 256 pixels along the shorter side using bicubic resampling, then were center cropped to 224 $\times$ 224 pixels.

For the proposed IbM2, we always sampled 200 virtual examples for every original training instance (i.e., $R=200$ ). We initially set the value of accuracy threshold $T$ in Algorithm 1 of the main paper as 0.9. By preliminarily training a linear classifier on the original training set $D$ before performing Algorithm 1, we got the empirical training accuracy upper bound (denoted as $\text{ACC}_{up}$ ), set $T$ as the minimum of its initially predefined value (0.9 here) and $\text{ACC}_{up}$ (where $T=\min\{0.9,\text{ACC}_{up}\}$ in this case) to ensure its validity. During the process of searching for the best value of $\epsilon$ , as shown in Algorithm 1, we initialized a linear classifier once, and repeatedly trained it with learning rate 1.0 for 20 epochs until the searching range became tight enough.

Next we trained the linear classifier for IbM2 (based on $D^{\hat{\epsilon}}$ , the set of virtual examples) or the baseline method (based on the original training set $D$ ). Since the evaluation accuracy is sensitive to the selected learning rate during few-shot training as shown in the literature [22]. In the main paper, we compared IbM2 with the baseline by showing the best accuracy among 10 experimental setting, where the sole difference was the initial learning rate—it enumerated values in the set {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0}. We trained a series of 10 linear classifiers with the initial learning rate being one of the 10 values, and reported the best top-1 accuracy among them. The learning rate was scaled as init_lr $\times$ batch_size / 256. In this stage, for ViT [13] architectures, the linear classifier was trained with batch size 256 for 100 epochs. For ResNet50 [19], the linear classifier was trained with batch size 512 for 60 epochs.

Appendix B Best vs. Default Learning Rate

But in practice we are not allowed use the test set to find the best initial learning rate. Furthermore, in few-shot learning, even cross validation is not possible and we cannot determine the best initial learning rate using cross validation. Hence, we first determine the default initial learning rate for both methods based on the above experiments, which is 0.005 for the baseline and 1.0 for IbM2. Then, we also compare their results using the default value as initial learning rate.

Pretraining Method	Backbone	IbM2	Best LR	Default LR
DINO [6]	ViT-S/8		70.1	68.9
	ViT-S/8	$\checkmark$	70.6	70.5
	ViT-S/16		64.7	63.2
	ViT-S/16	$\checkmark$	65.2	65.1
MoCov3 [10]	ViT-S/16		58.1	58.1
MoCov3 [10]	ViT-S/16	$\checkmark$	59.1	58.9
MSN [2]	ViT-S/16		67.3	66.5
	ViT-S/16	$\checkmark$	68.0	68.0
	ViT-B/4		75.4	75.4
	ViT-B/4	$\checkmark$	76.1	76.1
	ViT-L/7		74.8	74.8
	ViT-L/7	$\checkmark$	75.6	75.6
SimCLR [8]	ResNet50		50.5	50.2
SimCLR [8]	ResNet50	$\checkmark$	51.3	50.8
BYOL [17]	ResNet50		55.7	55.0
BYOL [17]	ResNet50	$\checkmark$	56.9	56.6

Table 7: Best vs. Default learning rate on 1% ImageNet-1K semi-supervised learning. The column ‘Best LR’ means the best top-1 accuracy (%) by using the test set to pick up the best initial learning rate. This result is a replicate of Table 2 in the main paper. The column ‘Default LR’ reports results using the default learning rate for both the baseline (0.005) and IbM2 (1.0). It is clear that in all cases, IbM2 outperforms the baseline. More importantly, even when the baseline uses the best initial learning rate chosen on the test set, while IbM2 only uses the default learning rate, IbM2 consistently outperforms the baseline in all cases.

As shown in Table 7, IbM2 outperforms the baseline in all cases. More importantly, even when the baseline uses the best initial learning rate (chosen on the test set), while IbM2 only uses the default learning rate (without using the test set), IbM2 consistently outperforms the baseline in all cases, too.

Hence, IbM2 has a clear and significant advantage over the baseline method.

Appendix C Setup for Traditional FSL

In this setup, the pretrained models were generated by different approaches [22, 11, 28] on the base set. For PMF [22], following their guidance, we resized the input images into 224 $\times$ 224 pixels while keeping them in small resolutions for other two baselines [11, 28] (80 $\times$ 80 pixels in mini-ImageNet [38] and 32 $\times$ 32 pixels in CIFAR-FS [3]).

The initial value of $T$ in the proposed IbM2 was set as 0.999. During the search for $\epsilon$ , the linear classifier was initialized once and trained for 50 epochs with learning rate 1.0 at each search step. To get the final linear classifier, we trained a new linear classifier with 200 epochs for a 5-way few-shot task on training set of baseline (original training set) and IbM2 (virtually sampled training set after the optimal $\hat{\epsilon}$ was got), respectively.

As suggested in PMF [22], we sampled a small number (20) of extra few-shot episodes sharing very similar semantics with the evaluated ones from the novel split on each scenario, in order to select the learning rate ranged in {0.00001, 0.0001, 0.001, 0.01, 0.1, 1} for the final training stage of IbM2 or the baseline. The selected optimal learning rate was used to train all 500 episodes of that scenario.

For PMF [22], we discarded its original fine-tuning pipeline of updating weights of the whole backbone. We only learned the linear classifier on top of the frozen features as in pFSL setting to make the comparison fairer.

We performed experiments on PMF [22] using ViT [13] and ResNet50 [19], $S2M2_{R}$ [28] using WRN-28-10 [44] and Meta-Baseline [11] using ResNet12 [19] as the backbone. The other implementation details of traditional FSL are the same with pFSL as aforementioned.