Few-shot Learning with LSSVM Base Learner and Transductive Modules

Haoqing Wang, Zhi-Hong Deng

Abstract

The performance of meta-learning approaches for few-shot learning generally depends on three aspects: features suitable for comparison, the classifier ( base learner ) suitable for low-data scenarios, and valuable information from the samples to classify. In this work, we make improvements for the last two aspects: 1) although there are many effective base learners, there is a trade-off between generalization performance and computational overhead, so we introduce multi-class least squares support vector machine as our base learner which obtains better generation than existing ones with less computational overhead; 2) further, in order to utilize the information from the query samples, we propose two simple and effective transductive modules which modify the support set using the query samples, i.e., adjusting the support samples basing on the attention mechanism and adding the prototypes of the query set with pseudo labels to the support set as the pseudo support samples. These two modules significantly improve the few-shot classification accuracy, especially for the difficult 1-shot setting. Our model, denoted as FSLSTM (Few-Shot learning with LSsvm base learner and Transductive Modules), achieves state-of-the-art performance on miniImageNet and CIFAR-FS few-shot learning benchmarks.

Refer to caption — Figure 1: Overview of our model. A few-shot classification task is the tuple of a support set and a query set, and we need to correctly classify the samples in the query set. In order to obtain a better classifier, we introduce LSSVM as our base learner (see Section 3.2); and then we use the query samples to adjust the support samples basing on the attention mechanism (see Section 3.3), corresponding to Inverse Attention Module; and we add the prototypes of the query set with pseudo labels to the support set as the new support samples (see Section 3.3), corresponding to Pseudo Support Samples, and this operation can be iterated multiple times and is only used during meta-testing.

1 Introduction

Human can learn from a few examples, yet it is a challenge for modern machine learning systems. Although many deep learning based image recognition approaches have achieved impressive performance (Simonyan and Zisserman 2014; He et al. 2016), they are data-hungry which require hundreds of training samples from each class. Few-shot classification (Fei-Fei, Fergus, and Perona 2006; Lake, Salakhutdinov, and Tenenbaum 2015) aims to classify unseen instances into a set of new classes based on a few support samples from each class and is more challenging.

Few-shot classification has received significant attention from the machine learning community recently and the research has been dominant by meta-learning based approaches. The performance of many approaches generally depends on three aspects: features suitable for comparison, the base learner suitable for low-data scenarios, and valuable information from the samples to classify. First, features satisfying the clustering assumption are easier to be classified, i.e., there is a clustering structure and the samples in the same cluster belong to the same class, which inspires us to learn task-specific representations (Li et al. 2019; Ye et al. 2020) or improve the generalization ability of the backbone (Gidaris et al. 2019; Seo, Jung, and Lee 2020). Second, different classifiers, e.g., nearest-neighbor classifier, ridge-regression and SVM, have different classification capabilities in the low-data scenarios and computational overhead, which inspires us to introduce different classifiers as the base learner (Snell, Swersky, and Zemel 2017; Bertinetto et al. 2018; Lee et al. 2019). Finally, considering the limited supervision information in the supporting samples in a few-shot task, the knowledge in the query samples is very valuable, which inspires us to perform transductive learning (Yang et al. 2020; Hu et al. 2020). In this work, we focus on the last two aspects and make improvements. Our model is illustrated in Figure 1.

While many choices of base learners exist, there is a trade-off between generalization and computational overhead. SVM base learner (Lee et al. 2019) can achieve better generalization but with an increase in computational overhead, as the objective is a quadratic program problem and is solved with iterative algorithm. In this work, we propose to use multi-class least squares support vector machine (LSSVM) as the base learner which improves generalization with less computational overhead. Further, in order to utilize the information from the query samples to enhance classification, we propose two transductive modules and the main motivation is to use the query samples to modify the support set, i.e., adjusting the support samples and adding pseudo support samples. Although many transductive meta-learning approaches have been proposed (Nichol, Achiam, and Schulman 2018; Ye et al. 2020; Yang et al. 2020; Hu et al. 2020), their models are not universal and cannot be directly applied to the base learners like LSSVM.

Our first transductive module is basing on the attention mechanism (Vaswani et al. 2017), denoted as IAM (Inverse Attention Module). The motivation is that if we know the classification method (e.g., SVM or LSSVM) and the samples to classify, and are allowed to adjust the support (training) samples, then we can move the support samples according to the characteristics of the classifier and the query samples to get better classifier which is completely determined by the support samples. And the way of moving needs to be meta-learned and here we resort to the attention mechanism.

Our second transductive module is used during meta-testing, denoted as PSM (Pseudo Support Module) which calculates the prototypes of the query set with pseudo labels and uses them as new support samples. As shown in (Snell, Swersky, and Zemel 2017), the class prototype is a good representation of a class. Actually, we can iterate this process multiple times and continuously increase support samples.

We denote our model as FSLSTM, the abbreviation for Few-Shot learning with LSsvm base learner and Transductive Modules. Specifically, our contributions can be summarized as follows.

•

We introduce multi-class least squares support vector machine as our base learner, which can achieve better generation than existing ones with less computational overhead.
•

We then propose two transductive modules which significantly improve the few-shot classification accuracy, especially for the difficult 1-shot setting.
•

Experiments show our model, FSLSTM, can achieve state-of-the-art performance on miniImageNet and CIFAR-FS.

2 Related Work

Meta-learning approaches for few-shot learning aim to learn some inductive bias that generation across a distribution of tasks (Vilalta and Drissi 2002) and can be broadly categorized into three groups: 1) black-box adaptation approaches (Santoro et al. 2016; Mishra et al. 2017) train the neural network to generate optimal model parameters; 2) optimization-based approaches (Andrychowicz et al. 2016; Finn, Abbeel, and Levine 2017; Lee et al. 2019; Hu et al. 2020) learn how to rapidly adapt a model to a given few-shot recognition task via a small number gradient descent iterations or teach the deep network to use standard machine learning tools (e.g., ridge regression, SVM) as its internal optimization; 3) metric-learning based approaches (Vinyals et al. 2016; Snell, Swersky, and Zemel 2017; Chen et al. 2020) learn a distance metric between a query sample and a set of support samples of a few shot task.

Many effective base learners have been proposed. Snell, Swersky, and Zemel (2017) proposed a simple but powerful nearest-neighbor classifier which represents each class by the meaning embedding of the samples and classify query samples based on the distance to the nearest class mean. Bertinetto et al. (2018) used differentiable closed-form solvers (ridge regression and logistic regression) as base learner. Lee et al. (2019) used discriminatively trained linear classifier (SVM and ridge regression) as the base learner. In this work, we use multi-class least squares support vector machine as the base learner which further improves accuracy and reduces computational overhead.

In order to utilize the information from the query samples, some transdutive meta-learning approaches have been proposed. Liu et al. (2018) reuse the label propagation algorithm (Zhu, Ghahramani, and Lafferty 2003) for transductive inference within each task and boosting the performance. Qiao et al. (2019) explored the pairwise constraints and regularization prior within each task with the setup of transduction to tailor an episodic wise metric for each task. Hu et al. (2020) proposed a gradient-based method which uses the query samples to calculate the synthetic gradients to perform internal gradient updates. Yang et al. (2020) convey both the distribution-level and instance-level relations in each few-shot learning task using dual complete graph network. Compared with these works, our transductive modules are much more simple and generic. The attention mechanism we use in the Inverse Attention Module is similar to (Vaswani et al. 2017), but the Key and the Query are different. It can be seen as modifying the representation of support samples with query samples as context.

3 Proposed Model

In this section, we first derive the meta-learning framework for few-shot learning following prior work (Vinyals et al. 2016), and then introduce the basic components of our model FSLSTM: LSSVM base learner and transductive modules, as shown in Figure 1.

3.1 Problem Formulation

In a few-shot classification task, we are given some training data which contains $N$ distinct, unseen classes with $K$ samples each. For a test sample, we need to classify it correctly into one of the $N$ classes. In most prior works, each task is organized as an episode which consists of a support set $\mathcal{S}$ and a query set $\mathcal{Q}$ , and a $N$ -way $K$ -shot task can be defined as a tuple $\{\mathcal{S},\mathcal{Q}\}$ , where

	$\displaystyle\mathcal{S}=$	$\displaystyle\{s^{(c)}\}_{c=1}^{N},\quad\|s^{(c)}\|=K$
	$\displaystyle\mathcal{Q}=$	$\displaystyle\{q^{(c)}\}_{c=1}^{N},\quad\|q^{(c)}\|=Q$

where $c$ is the class index and $Q$ is the number of query samples in each class. A 3-way 2-shot task is shown in Figure 1. Before classification, a backbone $f_{\theta}(\cdot)$ , a CNN or ResNet (He et al. 2016), is needed to map the original inputs to feature representations and is expected to extract similar representations for the (support or query) samples in the same class. Then a base learner $\mathcal{A}$ output the optimal classifier $\psi$ of the task basing on the support set $\mathcal{S}$ , i.e., $\psi=\mathcal{A}(\mathcal{S};\theta)$ . In particular, for transductive setting, the base learner $\mathcal{A}$ also takes the query samples as input.

Meta-learning approaches aim to learn some inductive bias that generation across a distribution of tasks, so they can be considered to learn over a collection of tasks, i.e., $\mathcal{T}_{train}=\{(\mathcal{S}_{i},\mathcal{Q}_{i})\}_{i=1}^{I}$ , called meta-training set. The model is learned by minimizing generalization error across tasks given a base learner $\mathcal{A}$ and the learning objective is:

\min_{\theta}\mathbb{E}_{\mathcal{T}_{train}}\left[\mathcal{L}^{meta}(\mathcal{Q},\psi)\right]+\mathcal{R}(\theta),\psi=\mathcal{A}(\mathcal{S};\theta)

(1)

where $\mathcal{L}^{meta}$ is a loss function, e.g., cross entropy loss, and $\mathcal{R}(\theta)$ is the regularization term.

After the model is learned, its generalization is evaluated on a set of held-out tasks, called meta-testing set, $\mathcal{T}_{test}=\{(\mathcal{S}_{j},\mathcal{Q}_{j})\}_{j=1}^{J}$ and computed as

\mathbb{E}_{\mathcal{T}_{test}}\left[\mathcal{L}^{meta}(\mathcal{Q},\psi)\right],\quad\psi=\mathcal{A}(\mathcal{S};\theta)

(2)

Following prior work (Ravi and Larochelle 2016; Finn, Abbeel, and Levine 2017), the stages corresponding to Equation (1) and (2) are called meta-training and meta-testing respectively. In addition, during meta-training, a held-out meta-validation set $\mathcal{T}_{val}$ is kept to choose the best model parameters. The categories in $\mathcal{T}_{train}$ , $\mathcal{T}_{test}$ and $\mathcal{T}_{val}$ are different to each other, which makes sure the tasks in $\mathcal{T}_{test}$ and $\mathcal{T}_{val}$ are unseen for the learned model.

Model	Backbone	miniImageNet		CIFAR-FS
Model	Backbone	5way-1shot	5way-5shot	5way-1shot	5way-5shot
Reptile+BN (Nichol, Achiam, and Schulman 2018)	Conv4	49.97 $\pm$ 0.0	65.99 $\pm$ 0.0	-	-
Relation Net (Sung et al. 2018)	Conv4	50.40 $\pm$ 0.8	65.30 $\pm$ 0.7	55.00 $\pm$ 1.0	69.30 $\pm$ 0.8
TPN (Liu et al. 2018)	Conv4	55.51 $\pm$ 0.0	69.86 $\pm$ 0.0	-	-
TEAM (Qiao et al. 2019)	Conv4	56.57 $\pm$ 0.0	72.04 $\pm$ 0.0	-	-
FEAT (Ye et al. 2020)	Conv4	57.04 $\pm$ 0.2	72.89 $\pm$ 0.2	-	-
FEAT^‡ (Ye et al. 2020)	Conv4	58.98 $\pm$ 0.2	74.72 $\pm$ 0.2	-	-
SIB (Hu et al. 2020)	Conv4	58.00 $\pm$ 0.6	70.70 $\pm$ 0.4	68.70 $\pm$ 0.6	77.10 $\pm$ 0.4
SIB^‡ (Hu et al. 2020)	Conv4	65.04 $\pm$ 0.8	77.20 $\pm$ 0.5	74.10 $\pm$ 0.9	82.78 $\pm$ 0.5
LSSVM^‡	Conv4	58.13 $\pm$ 0.6	75.09 $\pm$ 0.4	69.80 $\pm$ 0.7	81.43 $\pm$ 0.5
LSSVM+BN^‡	Conv4	57.99 $\pm$ 0.6	75.32 $\pm$ 0.4	69.99 $\pm$ 0.7	82.48 $\pm$ 0.5
LSSVM+IAM^‡	Conv4	59.29 $\pm$ 0.6	76.70 $\pm$ 0.4	71.99 $\pm$ 0.7	85.36 $\pm$ 0.5
LSSVM+IAM+PSM (FSLSTM)^‡	Conv4	62.98 $\pm$ 0.6	77.72 $\pm$ 0.4	78.18 $\pm$ 0.7	86.41 $\pm$ 0.5
FEAT (Ye et al. 2020)	ResNet12	66.78 $\pm$ 0.2	82.05 $\pm$ 0.2	-	-
FEAT^‡ (Ye et al. 2020)	ResNet12	69.96 $\pm$ 0.2	84.32 $\pm$ 0.2	-	-
SIB (Hu et al. 2020)	ResNet12	70.40 $\pm$ 0.8	80.16 $\pm$ 0.5	77.04 $\pm$ 0.8	84.25 $\pm$ 0.6
SIB^‡ (Hu et al. 2020)	ResNet12	74.80 $\pm$ 0.8	83.65 $\pm$ 0.5	82.17 $\pm$ 0.7	88.05 $\pm$ 0.5
DPGN (Yang et al. 2020)	ResNet12	67.77 $\pm$ 0.3	84.60 $\pm$ 0.4	77.90 $\pm$ 0.5	90.20 $\pm$ 0.4
DPGN^‡ (Yang et al. 2020)	ResNet12	69.54 $\pm$ 0.5	85.72 $\pm$ 0.4	80.14 $\pm$ 0.5	91.83 $\pm$ 0.3
LSSVM^‡	ResNet12	68.46 $\pm$ 0.6	85.14 $\pm$ 0.4	78.60 $\pm$ 0.6	91.17 $\pm$ 0.4
LSSVM+BN^‡	ResNet12	69.48 $\pm$ 0.6	85.46 $\pm$ 0.4	81.09 $\pm$ 0.6	91.40 $\pm$ 0.4
LSSVM+IAM^‡	ResNet12	70.96 $\pm$ 0.6	85.99 $\pm$ 0.4	81.66 $\pm$ 0.6	92.01 $\pm$ 0.4
LSSVM+IAM+PSM (FSLSTM)^‡	ResNet12	75.54 $\pm$ 0.6	86.75 $\pm$ 0.4	86.88 $\pm$ 0.6	92.82 $\pm$ 0.4

Table 1: Comparison to previous transductive meta-learning approaches on miniImageNet and CIFAR-FS. ’LSSVM’ represents using multi-class LSSVM as the base learner. ’BN’ represents information is shared among samples using batch normalization. ’

{\ddagger}

’ stands for using the weights pretrained on the Places365-standard dataset as initialization. The results of the baselines are reported by (Liu et al. 2018; Qiao et al. 2019; Ye et al. 2020; Hu et al. 2020) and our reimplementation.

3.2 LSSVM Base Learner

The choice of base learner $\mathcal{A}$ is crucial to few-shot classification. Many choices of base learners have been proposed (Snell, Swersky, and Zemel 2017; Bertinetto et al. 2018; Gidaris and Komodakis 2018; Lee et al. 2019), among them MetaOptNet (Lee et al. 2019) achieve impressive performance using discriminatively trained linear classifier (e.g., SVM). However, it needs to solve the quadratic programming problem with iterative algorithm which brings an increase in computational overhead. In addition, due to the limitation of the number of iterations and the quadratic programming solver, the obtained classifier can only be approximately optimal. Instead, we use multi-class least square support vector machine as our base learner which only need to solve the system of linear equations obtained from the Karush-Kuhn-Tucker (KKT) condition and can get the optimal solution.

There are three basic coding approaches for multi-class problems using binary classifiers: one-vs-all, one-vs-one and error correcting output codes (ECOC) (García-Pedrajas and Ortiz-Boyer 2011). Given train set $\{(x_{i},y_{i})\}_{i=1}^{n}$ with $C$ distinct categories, coding each category with a vector of length $L$ with each element chosen from $\{0,\pm 1\}$ , we can get a coding matrix $M\in\mathbb{R}^{C\times L}$ and $L$ train sets $\{(x_{i},y_{i}^{l})\}_{i=1}^{n},l=1,\cdots,L$ for $L$ binary classifiers respectively. The quadratic programming formulation is

	$\displaystyle\min_{w,b}\quad$	$\displaystyle\sum_{l=1}^{L}\left[\frac{1}{2}(w_{l}^{T}w_{l}+b_{l}^{2})+\frac{\gamma}{2}\sum_{i=1}^{n}e_{i}^{l2}\right]$		(3)
	$\displaystyle s.t.\quad$	$\displaystyle y_{i}^{l}\left[w_{l}^{T}\varphi(x_{i})+b_{l}\right]=1-e_{i}^{l},\forall i,l$		(4)

where $\gamma$ is the regularization parameter and $\varphi(\cdot)$ is a mapping function. Using the Karush-Kuhn-Tucker (KKT) condition, we only need to solve the system of linear equations:

\begin{bmatrix}-2I&Y^{T}\\ Y&\Omega\end{bmatrix}\begin{bmatrix}\textbf{b}\\ \bm{\alpha}\end{bmatrix}=\begin{bmatrix}\textbf{0}\\ \textbf{1}\end{bmatrix}

(5)

where $I$ is the identity matrix, $\bm{\alpha}=[\alpha_{1}^{T},\cdots,\alpha_{L}^{T}]^{T}$ is the dual variable and

\Omega=\begin{bmatrix}\Omega_{1}&\cdots&0\\ \vdots&\ddots&\vdots\\ 0&\cdots&\Omega_{L}\end{bmatrix}\qquad Y=\begin{bmatrix}y^{1}&\cdots&0\\ \vdots&\ddots&\vdots\\ 0&\cdots&y^{L}\end{bmatrix}

(6)

and $y^{l}_{i}\in\{0,\pm 1\},\Omega_{l}^{ij}=y_{i}^{l}y_{j}^{l}\Phi(x_{i},x_{j})+\frac{1}{\gamma}\delta_{ij},\forall i,j,l$ with $\Phi(\cdot,\cdot)$ is the kernel function and $\delta_{ij}$ is the Kronecker delta function. Given a test sample $x$ , its classification results in the $L$ classifiers constitute its class code and take the class with the smallest code Hamming distance as the prediction class, i.e.,

	$\displaystyle y=$	$\displaystyle\arg\min_{r}\sum_{l=1}^{L}\frac{1-sgn(M_{rl}\cdot sgn(c_{l}(x)))}{2}$		(7)
	$\displaystyle\approx$	$\displaystyle\arg\max_{r}\sum_{l=1}^{L}M_{rl}\cdot c_{l}(x),r=1,\cdots,C$		(8)

where $c_{l}(x)=\sum_{i=1}^{n}\alpha_{l}^{i}y^{l}_{i}\Phi(x_{i},x)+b_{l}$ and $sgn(\cdot)$ is the sign function.

Note that there are only $L(n+1)$ variables in Equation (5) which is small in the few-shot task ( $n=NK$ ), and there is no need to iterate multiple steps. Using this base learner, our system is end-to-end trainable which means we need to calculate $\frac{\partial\alpha}{\partial x}$ and $\frac{\partial b}{\partial x}$ , where $x$ corresponds to the output of the backbone. Using implicit function theorem (Dontchev and Rockafellar 2009; Krantz and Parks 2012) on Equation (5), we can solve it easily.

Base Learner	miniImageNet				CIFAR-FS
	5way-1shot		5way-5shot		5way-1shot		5way-5shot
	Acc	Time	Acc	Time	Acc	Time	Acc	Time
NN	59.25 $\pm$ 0.6	286	75.60 $\pm$ 0.5	358	72.20 $\pm$ 0.7	115	83.50 $\pm$ 0.5	146
RR	61.41 $\pm$ 0.6	524	77.88 $\pm$ 0.5	600	72.60 $\pm$ 0.7	335	84.30 $\pm$ 0.5	371
SVM	62.64 $\pm$ 0.6	732	78.63 $\pm$ 0.5	1062	72.00 $\pm$ 0.7	629	84.20 $\pm$ 0.5	884
LSSVM	63.22 $\pm$ 0.6	303	79.02 $\pm$ 0.5	387	73.40 $\pm$ 0.7	132	85.49 $\pm$ 0.5	155

Table 2: Average few-shot classification accuracy(

\%

) with

95\%

confidence intervals and time(s) required to solve 10,000 randomly sampled tasks from miniImageNet and CIFAR-FS on a single NVIDIA RTX 2080Ti GPU. Here ’NN’ stands for nearest-neighbor classifier (Snell, Swersky, and Zemel 2017) and ’RR’ stands for ridge regression (Lee et al. 2019). The accuracy of the baselines is reported from (Lee et al. 2019).

3.3 Transductive Modules

Next, we introduce how to use query samples to modify the support set so as to obtain the better classifier parameters.

Inverse Attention Module

As we can see, Equation (5) has a unique solution, so given the support set, our base learner outputs the unique optimal classifier and the decision boundary changes with the support set. Intuitively, the support samples can be adjusted using the query samples to obtain the better classifier parameters.

Specifically, we use the attention mechanism (Vaswani et al. 2017) to introduce knowledge from the query samples. Let $f_{\theta}(\mathcal{S})$ and $f_{\theta}(\mathcal{Q})$ be the feature vectors of the support set $\mathcal{S}$ and the query set $\mathcal{Q}$ respectively, define (query $\mathbf{Q}$ , key $\mathbf{K}$ , value $\mathbf{V}$ ) as

$\displaystyle\mathbf{Q}=$	$\displaystyle g_{\phi}^{q}(f_{\theta}(\mathcal{S}))\in\mathbb{R}^{NK\times d_{k}}$	(9)
$\displaystyle\mathbf{K}=$	$\displaystyle g_{\phi}^{k}(f_{\theta}(\mathcal{Q}))\in\mathbb{R}^{NQ\times d_{k}}$	(10)
$\displaystyle\mathbf{V}=$	$\displaystyle g_{\phi}^{v}(f_{\theta}(\mathcal{Q}))\in\mathbb{R}^{NQ\times d_{v}}$	(11)

where $g_{\phi}^{q}(\cdot)$ , $g_{\phi}^{k}(\cdot)$ and $g_{\phi}^{v}(\cdot)$ are three different mapping functions. Once we have $(\mathbf{Q},\mathbf{K},\mathbf{V})$ , we use them to compute the Scaled Dot-Product Attention:

A(\mathbf{Q},\mathbf{K},\mathbf{V})=softmax\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d_{k}}}\right)\mathbf{V}

(12)

Using the labels from the support set, we can know the class corresponding to each row of the matrix $A$ . To increase the intra-class similarity, we replace each row of $A$ with the prototype of its corresponding class which is calculated from the matrix $A$ . Finally, we can calculate the adjusted support set as

\mathcal{S}:=LN(\mathcal{S}+Dropout(h_{\phi}(A(\mathbf{Q},\mathbf{K},\mathbf{V}))))

(13)

where $h_{\phi}$ is another mapping function and $LN$ represents layer normalization (Ba, Kiros, and Hinton 2016).

Generally, the mapping functions $g_{\phi}^{q}(\cdot)$ , $g_{\phi}^{k}(\cdot)$ and $g_{\phi}^{v}(\cdot)$ are linear functions defined by a weight matrix $W$ . To limit model complexity while increasing model capability, we use the bottleneck with two fully connected layers, i.e., a dimension-reduction layer with parameters $W_{1}$ and reduction ratio $r$ , a ReLU and a dimension-increasing layer with parameters $W_{2}$ , as the mapping functions. This module is called Inverse Attention Module since it uses the support set as the Query and uses the query set as the Key and the Value. It uses the weighted combination of the Value as the offset of the support samples and the weight is calculated basing on the relationship between the Key and the Query.

Pseudo Support Module

In a few-shot classification task, the support samples and the query samples satisfy the clustering assumption. So when the original classifier has been able to classify samples well, the prototypes of the query set with pseudo labels can be used as the effective support samples, and they have a closer relationship with the query samples. We can use them to enhance classification.

Let the query set be $\mathcal{Q}=\{x_{i}\}_{i=1}^{NQ}$ , using the optimal classifier from our base learner, we can get new query set with pseudo labels $\widetilde{\mathcal{Q}}=\{(x_{i},\widetilde{y_{i}})\}_{i=1}^{NQ}$ . Its prototypes can be computes as

p_{k}=\frac{1}{|Q_{k}|}\sum_{x_{i}\in Q_{k}}f_{\theta}(x_{i}),k=1,\cdots,N

(14)

where $Q_{k}=\{x|(x,\widetilde{y})\in\widetilde{\mathcal{Q}},\widetilde{y}=k\}$ . Note that this operation can be iterated multiple times and is only used during meta-testing to avoid increasing computation overhead during meta-training.

4 Experiments

In this section, we first briefly describe our experimental setting. Next, we compare our model, FSLSTM, with the existing state-of-the-art transductive meta-learning approaches. Then we compare our LSSVM base learner with the existing ones. After that, we show the robustness of our Pseudo Support Module. Finally, we analyze the proposed two transductive modules in detail.

Base Learner	Backbone	miniImageNet		CIFAR-FS
Base Learner	Backbone	5way-1shot	5way-5shot	5way-1shot	5way-5shot
NN	ResNet12	59.25 $\pm$ 0.6	75.60 $\pm$ 0.5	72.20 $\pm$ 0.7	83.50 $\pm$ 0.5
NN+CAN	ResNet12	62.63 $\pm$ 0.7	76.99 $\pm$ 0.5	77.72 $\pm$ 0.8	84.07 $\pm$ 0.5
NN+PSM	ResNet12	65.94 $\pm$ 0.7	77.03 $\pm$ 0.5	78.91 $\pm$ 0.8	84.34 $\pm$ 0.5
RR	ResNet12	61.41 $\pm$ 0.6	77.88 $\pm$ 0.5	72.60 $\pm$ 0.7	84.30 $\pm$ 0.5
RR+CAN	ResNet12	59.77 $\pm$ 0.7	74.86 $\pm$ 0.5	72.33 $\pm$ 0.8	83.11 $\pm$ 0.6
RR+PSM	ResNet12	66.51 $\pm$ 0.7	78.96 $\pm$ 0.5	78.68 $\pm$ 0.8	85.47 $\pm$ 0.5
SVM	ResNet12	62.64 $\pm$ 0.6	78.63 $\pm$ 0.5	72.00 $\pm$ 0.7	84.20 $\pm$ 0.5
SVM+CAN	ResNet12	62.14 $\pm$ 0.7	78.33 $\pm$ 0.5	71.37 $\pm$ 0.8	83.76 $\pm$ 0.5
SVM+PSM	ResNet12	66.14 $\pm$ 0.7	79.34 $\pm$ 0.5	77.79 $\pm$ 0.8	85.22 $\pm$ 0.5
LSSVM	ResNet12	63.22 $\pm$ 0.6	79.02 $\pm$ 0.5	73.40 $\pm$ 0.7	85.49 $\pm$ 0.5
LSSVM+CAN	ResNet12	62.43 $\pm$ 0.7	77.66 $\pm$ 0.5	72.72 $\pm$ 0.7	84.96 $\pm$ 0.5
LSSVM+PSM	ResNet12	66.26 $\pm$ 0.7	79.46 $\pm$ 0.5	78.29 $\pm$ 0.7	86.29 $\pm$ 0.5

Table 3: Average few-shot classification accuracy with

95\%

confidence intervals on miniImageNet and CIFAR-FS with ResNet12 as the backbone. ’CAN’ means using the transductive method in (Hou et al. 2019).

4.1 Experimental Setting

We evaluate our model on two standard few-shot learning benchmarks: miniImageNet (Vinyals et al. 2016) and CIFAR-FS (Bertinetto et al. 2018).

For fair and comprehensive comparison with previous approaches, we employ two popular networks as our backbone: 1) Conv4 (Kim et al. 2019; Ye et al. 2020; Yang et al. 2020) consists of four Conv-BN-ReLU blocks and the last two blocks contain the dropout layer (Srivastava et al. 2014); 2) ResNet12 (Lee et al. 2019; Ye et al. 2020; Yang et al. 2020) consists of four residual blocks and each residual block consists of three Conv-BN-ReLU blocks. Instead of optimizing from scratch, we apply an additional pretraining strategy as in (Fei et al. 2020) which pretrains the backbone using Places365-standard dataset (Zhou et al. 2017) for the standard 365-way classification. The purpose of using the pretraining strategy is to warm up the backbone and thus assist training the transductive modules. During pretraining, we crop the images to $84\times 84$ and $32\times 32$ for miniImageNet and CIFAR-FS respectively. We use SGD with Nesterov momentum of 0.9 and weight decay of 0.0005. The model is meta-trained for 60 epochs and each epoch consists of 1000 batches with each batch consisting of 8 episodes. Considering the Pseudo Support Module can be performed multiple times, we choose the number of iterations $k=10$ during meta-testing for all experiments.

We evaluate our model in 5-way 1-shot/5-shot settings and randomly sample 1000 episodes for evaluation and report the average accuracy ( $\%$ ) as well as $95\%$ confidence interval.

Please see the supplementary material for more details.

4.2 Comparison with State-of-the-arts

We compare our model with several state-of-the-art transductive meta-learning approaches, and the result is shown in Table 1. We use the same pretraining strategy as us for several baselines for fair comparison. Specifically, for FEAT (Ye et al. 2020) which uses a different pretraining strategy, we directly replace its original pretraining weights with the weights obtained by our pretraining strategy. For SIB (Hu et al. 2020) which freezes the backbone during meta-training, if we directly use the weights pretrained on Places365-standard dataset, the final result is very poor because there is no information about miniImageNet or CIFAR-FS, so we first use the weights pretrained on Places365-standard dataset as initialization, and then pretrain the backbone again on miniImageNet and CIFAR-FS using the strategy in SIB. For DPGN (Yang et al. 2020) which does not use any pretraining strategy, we use the weights obtained by our pretraining strategy as the initialization of the backbone and reduce its learning rate. All results are obtained using the public implementation published by the authors.

As shown in Table 1, the new pretraining strategy improves the few-shot classification accuracy to different degrees, which shows the potential of transfer learning to improve few-shot learning. Furthermore, our first transductive module, IAM, can achieve the improvements ranging from $0.84\%$ to $3.93\%$ over the LSSVM base learner, and outperforms batch normalization. Our second transductive module, PSM, further improve the few-shot classification accuracy, especially for difficult 1-shot tasks with the improvement ranging from $3.69\%$ to $6.19\%$ . Those results clearly validate the effectiveness of our transductive modules and our model, FSLSTM (LSSVM+IAM+PSM), achieves state-of-the-art performance on both datasets.

4.3 Base Learner Comparison

Next, we compare the LSSVM base learner with existing ones, i.e., nearest-neighbor classifier (Snell, Swersky, and Zemel 2017), ridge-regression (Bertinetto et al. 2018; Lee et al. 2019) and SVM (Lee et al. 2019), in terms of accuracy and inference speed. For fair comparisons, we employ them on miniImageNet and CIFAR-FS using the same backbone without pretraining. Specifically, we use ResNet12 as the backbone to evaluate accuracy, and use Conv4 as the backbone to evaluate inference speed. Using the small convolutional network can better reflect the difference in inference speed among different base learners. We compare the amount of time required to solve 10,000 randomly sampled tasks on a single NVIDIA RTX2080Ti GPU.

As shown in Table 2, the LSSVM base learner achieves higher accuracy than existing ones with faster inference speed. Specifically, the LSSVM base learner outperforms MetaOptNet-SVM (Lee et al. 2019) on both few-shot learning benchmarks and is $2\sim 6$ times faster than it with Conv4 as the backbone. On the other hand, the LSSVM base learner is as fast as Prototypical Network (Snell, Swersky, and Zemel 2017) but achieves significantly better results.

4.4 Robustness of Pseudo Support Module

A closely related work (Hou et al. 2019) with our Pseudo Support Module augments the support set using the confidently classified query images during meta-testing, i.e., choosing the query samples with higher predicted scores as the pseudo support samples. Similar to our module, their method can also be iterated multiple times. It is worth noting that their method is specifically designed for metric-learning based approaches, i.e., matching network (Vinyals et al. 2016), prototypical network (Snell, Swersky, and Zemel 2017) and relation network (Sung et al. 2018). So it is interesting to explore the effect of this method on other base learners (e.g., ridge-regression, SVM and LSSVM), and can be directly compared with our module.

For a fair comparison, we use the transductive operation in CAN (Hou et al. 2019) and our Pseudo Support Module for various base learners, i.e., nearest-neighbor classifier (Snell, Swersky, and Zemel 2017), ridge-regression (Lee et al. 2019), SVM (Lee et al. 2019) and the LSSVM base learner. We use ResNet12 as the backbone and do not apply the pre-training initialization. We implement the transductive operation in CAN (Hou et al. 2019) for two iterations with 35 candidates for the first iteration and 70 for the second iteration, as suggested by the authors.

The results are shown in Table 3. From these results, we can make following observations: 1) our Pseudo Support Module achieves impressive effect on various base learners, not only the metric-learning based ones, but also the optimization-based ones; 2) the transductive method in CAN (Hou et al. 2019) significantly improves the metric-learning based approaches, but reduces the accuracy of the optimization-based approaches, and the reason may be that in the metric-learning based approaches, the predicted scores are directly related to the confidence of the candidate samples, while other optimization-based approaches have more complex classification mechanisms and the predicted scores and the confidence are no longer directly related, so the candidate samples become noise; 3) our transductive operation consistently outperforms the method in (Hou et al. 2019), even in metric-learning based approaches, which means our method is more universal and effective, and the class prototypes are more statistically valid and robust compared with specific samples.

4.5 Analysis on Transductive Modules

Finally, we analyze our transductive modules in detail. We perform qualitative study on the Inverse Attention Module, visualizing the change of the support samples. Then we perform quantitative study on the Pseudo Support Module to explore the impact of its iteration number on accuracy.

Change of the support samples

We randomly sample four 5-way 1-shot tasks, corresponding to four different settings respectively, i.e., ResNet12 on miniImageNet, Conv4 on miniImageNet, ResNet12 on CIFAR-FS and Conv4 on CIFAR-FS. For each task, we use the support samples, the adjusted support samples and the query samples to learn a principal component analysis (PCA) model which projects the feature representations into 2-D space. Then we apply this learned PCA model to all samples and the result is shown in Figure 2.

Interestingly, Inverse Attention Module pushes the support samples away from their clusters, which reflects the difference between the LSSVM base learner and the nearest neighbor method and indicates Inverse Attention Module has adapted to the LSSVM base learner. Besides, the feature representations from the backbone have a cluster structure and samples of the same class are close to each other.

Iteration number for Pseudo Support Module

As discussed above, Pseudo Support Module can be iterated multiple times, and the new pseudo support samples can be added each time. So we explore the impact of the iterations number $k$ on accuracy. Specifically, we examine the variation of the accuracy increment with the iteration number $k$ under four different settings as above. We calculate the increment of the average accuracy, and the result is shown in Figure 3.

We can make the following observations: 1) as the iterations number increases, Pseudo Support Module can continuously improve the classification accuracy; 2) the increment of the 5-way 1-shot tasks is significantly higher than that of the 5-way 5-shot tasks and the reason may be that the information in the support set of the 1-shot tasks is too scarce to get more significant improvement using the same information increment; 3) this process cannot keep increasing the accuracy, but has a performance limit.

5 Conclusion

In this work, we analyze three directions where we can further improve few-shot learning, i.e., features suitable for comparison, the base learner suitable for low-data scenarios, and valuable information from the samples to classify, and make improvements in the last two directions. We first introduce multi-class least squares support vector machine as the base learner, which achieves better generalization and faster inference than existing ones. Then we propose two transductive modules that modify the support set using the query samples, i.e., adjusting the support samples and adding pseudo support samples. Experiments show that our transductive modules can significantly improve the few shot learning, especially for the difficult 1-shot setting. Note that the existing methods to make features more suitable for comparison, e.g., using self-supervised auxiliary training or using data augmentation (regional dropout), are compatible with our method and it can further improve performance when combining with them. For future work, we can make more explorations from these three directions.

Supplementary Material

Dataset

We evaluate our model on two standard few-shot learning benchmarks: miniImageNet (Vinyals et al. 2016) and CIFAR-FS (Bertinetto et al. 2018). The miniImageNet is the subset of ImageNet (Russakovsky et al. 2015) and includes a total of 100 classes with 600 images per class, and each image is of size $84\times 84$ . Following the setup provided by (Ravi and Larochelle 2016), we use 64 classes as meta-training set, 16 and 20 classes as meta-validation set and meta-testing set respectively. The CIFAR-FS consists of all 100 classes from CIFAR-100 (Krizhevsky, Nair, and Hinton 2010) and each class contains 600 images of size $32\times 32$ . All classes are split into 64, 16 and 20 for meta-training, meta-validation and meta-testing as in (Lee et al. 2019; Yang et al. 2020).

Training scheme

During meta-training, we perform data augmentation, such as horizontal flip, random crop and color (brightness, contrast and saturation) jitter, as in (Gidaris and Komodakis 2018; Qiao et al. 2018; Lee et al. 2019; Yang et al. 2020). We use SGD with Nesterov momentum of 0.9 and weight decay of 0.0005. The model is meta-trained for 60 epochs and each epoch consists of 1000 batches with each batch consisting of 8 episodes. Without pretraining initialization, all learning rate is initially set to 0.1, and with pretraining initialization, the initial learning rate of the backbone is set to 0.005 for Conv4 and 0.0005 for ResNet12. The learning rate of the Inverse Attention Module is initialized to 0.005 with ResNet12 as the backbone and 0.01 with Conv4 as the backbone so as to coordinate with different backbones. We drop the learning rate by the factor 0.06, 0.2, 0.2 at epoch 20, 40 and 50 respectively.

As suggested in (Liu et al. 2018; Lee et al. 2019), we apply ”Higher Shot” for meta-training which means keeping meta-training shot higher than meta-testing shot. Specifically, we set training shot to 5 for CIFAR-FS, 5 for miniImageNet 1-shot and 15 for miniImageNet 5-shot. Each class contains 6 query samples during meta-training and 15 query samples during meta-testing. The optimal model is chosen on 5-way 5-shot tasks from the meta-validation set.

The regularization parameter $\gamma$ of LSSVM is set to 0.1 for meta-training. And we use one-vs-all multi-class coding method and linear kernel function for all experiments. The reduction ratio $r$ of Inverse Attention Module is set to 8 for CIFAR-FS and 16 for miniImageNet.

References

Andrychowicz et al. (2016) Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M. W.; Pfau, D.; Schaul, T.; Shillingford, B.; and De Freitas, N. 2016. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, 3981–3989.
Ba, Kiros, and Hinton (2016) Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 .
Bertinetto et al. (2018) Bertinetto, L.; Henriques, J. F.; Torr, P. H.; and Vedaldi, A. 2018. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136 .
Chen et al. (2020) Chen, J.; Zhan, L.; Wu, X.; and Chung, F. 2020. Variational Metric Scaling for Metric-Based Meta-Learning. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, 3478–3485. AAAI Press. URL https://aaai.org/ojs/index.php/AAAI/article/view/5752.
Dontchev and Rockafellar (2009) Dontchev, A. L.; and Rockafellar, R. T. 2009. Implicit functions and solution mappings. Springer Monographs in Mathematics. Springer 208.
Fei et al. (2020) Fei, N.; Lu, Z.; Gao, Y.; Tian, J.; Xiang, T.; and Wen, J.-R. 2020. Meta-Learning across Meta-Tasks for Few-Shot Learning. arXiv preprint arXiv:2002.04274 .
Fei-Fei, Fergus, and Perona (2006) Fei-Fei, L.; Fergus, R.; and Perona, P. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4): 594–611.
Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135. JMLR. org.
García-Pedrajas and Ortiz-Boyer (2011) García-Pedrajas, N.; and Ortiz-Boyer, D. 2011. An empirical study of binary classifier fusion methods for multiclass classification. Information Fusion 12(2): 111–130.
Gidaris et al. (2019) Gidaris, S.; Bursuc, A.; Komodakis, N.; Pérez, P.; and Cord, M. 2019. Boosting few-shot visual learning with self-supervision. In Proceedings of the IEEE International Conference on Computer Vision, 8059–8068.
Gidaris and Komodakis (2018) Gidaris, S.; and Komodakis, N. 2018. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4367–4375.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Hou et al. (2019) Hou, R.; Chang, H.; Bingpeng, M.; Shan, S.; and Chen, X. 2019. Cross attention network for few-shot classification. In Advances in Neural Information Processing Systems, 4003–4014.
Hu et al. (2020) Hu, S. X.; Moreno, P. G.; Xiao, Y.; Shen, X.; Obozinski, G.; Lawrence, N. D.; and Damianou, A. 2020. Empirical Bayes Transductive Meta-Learning with Synthetic Gradients. arXiv preprint arXiv:2004.12696 .
Kim et al. (2019) Kim, J.; Kim, T.; Kim, S.; and Yoo, C. D. 2019. Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 11–20.
Krantz and Parks (2012) Krantz, S. G.; and Parks, H. R. 2012. The implicit function theorem: history, theory, and applications. Springer Science & Business Media.
Krizhevsky, Nair, and Hinton (2010) Krizhevsky, A.; Nair, V.; and Hinton, G. 2010. Cifar-10 (canadian institute for advanced research). URL http://www. cs. toronto. edu/kriz/cifar. html 5.
Lake, Salakhutdinov, and Tenenbaum (2015) Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science 350(6266): 1332–1338.
Lee et al. (2019) Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10657–10665.
Li et al. (2019) Li, H.; Eigen, D.; Dodge, S.; Zeiler, M.; and Wang, X. 2019. Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–10.
Liu et al. (2018) Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S. J.; and Yang, Y. 2018. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002 .
Mishra et al. (2017) Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P. 2017. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141 .
Nichol, Achiam, and Schulman (2018) Nichol, A.; Achiam, J.; and Schulman, J. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 .
Qiao et al. (2019) Qiao, L.; Shi, Y.; Li, J.; Wang, Y.; Huang, T.; and Tian, Y. 2019. Transductive episodic-wise adaptive metric for few-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, 3603–3612.
Qiao et al. (2018) Qiao, S.; Liu, C.; Shen, W.; and Yuille, A. L. 2018. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7229–7238.
Ravi and Larochelle (2016) Ravi, S.; and Larochelle, H. 2016. Optimization as a model for few-shot learning .
Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115(3): 211–252.
Santoro et al. (2016) Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning, 1842–1850.
Seo, Jung, and Lee (2020) Seo, J.-W.; Jung, H.-G.; and Lee, S.-W. 2020. Self-Augmentation: Generalizing Deep Networks to Unseen Classes for Few-Shot Learning. arXiv preprint arXiv:2004.00251 .
Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Snell, Swersky, and Zemel (2017) Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networks for few-shot learning. In Advances in neural information processing systems, 4077–4087.
Srivastava et al. (2014) Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1): 1929–1958.
Sung et al. (2018) Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and Hospedales, T. M. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1199–1208.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
Vilalta and Drissi (2002) Vilalta, R.; and Drissi, Y. 2002. A perspective view and survey of meta-learning. Artificial intelligence review 18(2): 77–95.
Vinyals et al. (2016) Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, 3630–3638.
Yang et al. (2020) Yang, L.; Li, L.; Zhang, Z.; Zhou, X.; Zhou, E.; and Liu, Y. 2020. DPGN: Distribution Propagation Graph Network for Few-shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13390–13399.
Ye et al. (2020) Ye, H.-J.; Hu, H.; Zhan, D.-C.; and Sha, F. 2020. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8808–8817.
Zhou et al. (2017) Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40(6): 1452–1464.
Zhu, Ghahramani, and Lafferty (2003) Zhu, X.; Ghahramani, Z.; and Lafferty, J. D. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), 912–919.