This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Few-shot Classification via Ensemble Learning with Multi-Order Statistics

Sai Yang1    Fan Liu2∗    Delong Chen2&Jun Zhou3 1School of Electrical Engineering, Nantong University, Nantong, China
2College of Computer and Information, Hohai University, Nanjing, China
3School of Information and Communication Technology, Griffith University, Queensland, Australia
Email: [email protected]
Abstract

Transfer learning has been widely adopted for few-shot classification. Recent studies reveal that obtaining good generalization representation of images on novel classes is the key to improving the few-shot classification accuracy. To address this need, we prove theoretically that leveraging ensemble learning on the base classes can correspondingly reduce the true error in the novel classes. Following this principle, a novel method named Ensemble Learning with Multi-Order Statistics (ELMOS) is proposed in this paper. In this method, after the backbone network, we use multiple branches to create the individual learners in the ensemble learning, with the goal to reduce the storage cost. We then introduce different order statistics pooling in each branch to increase the diversity of the individual learners. The learners are optimized with supervised losses during the pre-training phase. After pre-training, features from different branches are concatenated for classifier evaluation. Extensive experiments demonstrate that each branch can complement the others and our method can produce a state-of-the-art performance on multiple few-shot classification benchmark datasets.

1 Introduction

Refer to caption
Figure 1: (a) The traditional methods often use different backbone networks as individuals, which significantly increases the computation and storage costs. (b) Our method takes the same backbone and equips different branches with multi-order statistics as learning individuals. They are parameter-free and trained jointly, and do not require extra model size and computation time.

Few-shot Classification (FSC) is a promising direction in alleviating the labeling cost and bridging the gap between human intelligence and machine models. It aims to accurately differentiate novel classes with only a few labeled training samples. Due to limited supervision from novel classes, an extra base set with abundant labeled samples is often used to improve the classification performance. According to the adopted training paradigms, FSC methods can be roughly divided into meta-learning-based  Finn et al. (2017); Snell et al. (2017) and transfer-learning-based Chen et al. (2019); Liu et al. (2020); Afrasiyabi et al. (2020). The first type takes the form of episodic training, in which subsets of data are sampled from the base set to imitate the meta-test setting. Since sampling does not cover all combinations, this paradigm cannot fully utilize the information provided by the base set. In contrast, the transfer-learning takes the base set as a whole, so it avoids the drawback of meta-learning and achieves better performance. Many effective regularization techniques have been exploited in transfer-learning, for example, manifold mixup Mangla et al. (2020), self-distillation Tian et al. (2020), and self-supervised learning Zhang et al. (2020b), which leads to significant improvement on the generalization of image representations and the FSC performance.

Ensemble learning combines multiple learners to solve the same problem and exhibits better generalization performance than any individual learners Yang et al. (2013). When combining ensemble learning with deep Convolutional Neural Networks (CNN), the new paradigm usually requires large-scale training data for classification tasks Horváth et al. (2021); Agarwal et al. (2021), making it challenging to be adopted for FSC. Recently, two notable studies Dvornik et al. (2019); Bendou et al. (2022) employed an ensemble of deep neural networks for FSC tasks under either a meta-learning or a transfer-learning setting. They demonstrated that ensemble learning is also applicable to FSC. Yet, these works are still preliminary and lack a theoretical analysis to explain the underlying reason behind the promising performance. To address this challenge, we provide an FSC ensemble learning theorem for the transfer-learning regime. Its core idea is a tighter expected error bound on the novel classes, in which the expected error on the novel classes can be reduced by implementing ensemble learning on the base classes, given the base classes-novel classes domain divergence.

The generalization ability of ensemble learning is strongly dependent on generating diverse individuals. As shown in Figure 1 (a), traditional methods often use different backbone networks as individuals, which significantly increases the computation and storage costs. Our work finds that different-order statistics of the CNN features are complementary to each other, and integrating them can better model the whole feature distribution. Based on this observation, we develop a parameter-free ensemble method, which takes the same backbone and equips different branches with multi-order statistics as learning individuals. We name this method Ensemble Learning with Multi-Order Statistics (ELMOS), as shown in Figure 1 (b). The main contributions of this paper are summarized as follows:

  • To our knowledge, this is the first theoretical analysis to guide ensemble learning in FSC. The derived theorem proves a tighter expected error bound is available on novel classes.

  • We propose an ensemble learning method by adding multiple branches at the end of the backbone networks, which can significantly reduce the computation time of the training stage for FSC.

  • This is the first time that multi-order statistics is introduced to generate different individuals in ensemble learning.

  • We conduct extensive experiments to validate the effectiveness of our method on multiple FSC benchmarks.

2 Related Work

In this section, we review the related work to the proposed method.

2.1 Few-shot Classification

According to how the base set is used, FSC methods can be roughly categorized into two groups, meta-learning-based Zhang et al. (2020a) and transfer-learning-based Chen et al. (2019); Liu et al. (2020). Meta-learning creates a set of episodes to simulate the real FSC test scenarios and simultaneously accumulate meta-knowledge for fast adaptation. Typical meta-knowledge includes optimization factors such as initialization parameters Finn et al. (2017) and task-agnostic comparing ingredients of feature embedding and metric Snell et al. (2017); Wertheimer et al. (2021). Recent literature on transfer learning Tian et al. (2020); Chen et al. (2019) questioned the efficiency of the episodic training in meta-learning, and alternatively used all base samples to learn an off-the-shelf feature extractor and rebuilt a classifier for novel classes. Feature representations play an important role in this regime Tian et al. (2020). To this end, regularization techniques such as negative-margin softmax loss and manifold mixup Liu et al. (2020); Mangla et al. (2020) have been adopted to enhance the generalization ability of cross-entropy loss. Moreover, self-supervised Zhang et al. (2020b); Rajasegaran et al. (2020) and self-distillation Ma et al. (2019); Zhou et al. (2021) methods have also shown promising performance in transfer-learning. To this end, supervised learning tasks can be assisted by several self-supervised proxy tasks such as rotation prediction and instance discrimination Zhang et al. (2020b), or by adding an auxiliary task of generating features during the pre-training Xu et al. (2021b). When knowledge distillation is adopted, a high-quality backbone network can be evolved through multiple generations by a born-again strategy Rajasegaran et al. (2020). All these methods suggest the importance of obtaining generalization representations, and we will leverage ensemble learning to achieve this goal.

2.2 Ensemble Learning

Ensemble learning builds several different individual learners based on the same training data and then combines them to improve the generalization ability of the learning system over any single learner. This learning scheme has shown promising performance on traditional classification tasks with deep learning on large-scale labeled datasets. Recently, ensemble learning for FSC methods has been presented. For example, Dvornik et al. (2019) combined an ensemble of prototypical networks through deep mutual learning under a meta-learning setting. Bendou et al. (2022) reduced the capacity of each backbone in the ensemble and pre-trained them one by one with the same routine. However, the size of the ensemble learner increased for inference in the former work, while the latter required extra time to pre-train many learning individuals. Therefore, it still lacks efficient designs for learning individuals in FSC ensemble learning. Moreover, these works did not involve any theoretical analysis of the underlying mechanism of ensemble learning in FSC. In this paper, we investigate why ensemble learning works well in FSC under the transfer-learning setting. Based on the analysis, we propose an efficient learning method using a shared backbone network with multiple branches to generate learning individuals.

2.3 Pooling

Convolutional neural network models progressively learn high-level features through multiple convolution layers. A pooling layer is often added at the end of the network to output the final feature representation. To this end, Global Average Pooling (GAP) is the most popular option, however, it cannot fully exploit the merits of convolutional features because it only calculates the 1st1^{st}-order feature statistics. Global Covariance Pooling (GCP) such as DeepO2\rm O^{2}P explores the 2nd2^{nd}-order statistic by normalizing the covariance matrix of the convolutional features, which has achieved impressive performance gains over the classical GAP in various computer vision tasks. Further research shows that using richer statistics may lead to further possible improvement. For example, Kernel Pooling Cui et al. (2017) generates high-order feature representations in a compact form. However, a certain order statistic can only describe partial characteristics of the feature vector from the view of the characteristic function of random variables. For example, the first- and second-order statistics can completely represent their statistical characteristic only for the Gaussian distribution. Therefore, higher-order statistics are still needed for the non-Gaussian distributions, which are more ubiquitous in many real-world applications. This motivates us to calculate multi-order statistics to retain more information on features.

3 The Proposed Method

Here we present the proposed method. We start with a formal definition of FSC, and then present a theorem on FSC ensemble learning. This theorem leads to the development of an ensemble learning approach with multi-order statistics.

3.1 Theory Foundation

Under the standard setting of few-shot classification, three sets of data with disjoint labels are available, i.e., the base set SbS_{b}, the validation set SvalS_{val} and the novel set SnS_{n}. In the context of transfer-learning, SbS_{b} is used for pre-training a model to well classify the novel classes in SnS_{n}, with the hyper-parameters tuned on SvalS_{val}. Let Sb={(xi,yi)}i=1NbS_{b}=\{(x_{i},y_{i})\}_{i=1}^{N_{b}} denotes the source domain with Nb{N_{b}} labelled samples and SnS_{n} denotes the target domain labelled with KK samples in each episode, where Nb>>KN_{b}>>K. Let the label function of SbS_{b} and SnS_{n} be fbf_{b} and fnf_{n}, respectively. During the pre-training, a learner hh is obtained to approximate the optimal mapping function hh^{*} based on all Nb{N_{b}} training samples in SbS_{b} from all possible hypotheses \mathcal{H}. When ensemble learning is introduced into the pre-training, several learners denoted as {ho}o=1O\{h_{o}\}_{o=1}^{O} can be obtained. With the ensemble technique of weighted averaging, the final learner h¯\overline{h} is produced as:

h¯=o=1Oαoho,\overline{h}=\sum_{o=1}^{O}\alpha_{o}h_{o}, (1)

where αo\alpha_{o} is the weight parameter. There is a domain shift between the base and novel classes Tseng et al. (2020), and we use the L1L_{1} distance Kifer et al. (2004) to measure the domain divergence between SbS_{b} and SnS_{n}:

𝒟(Sb,Sn)=|ηb(x)ηn(x)||h¯(x)fn(x)|𝑑x,\mathcal{D}(S_{b},S_{n})=\int\left|\eta_{b}(x)-\eta_{n}(x)\right|\left|\overline{h}(x)-f_{n}(x)\right|dx, (2)

where ηb(x)\eta_{b}(x) and ηn(x)\eta_{n}(x) is the density functions of SbS_{b} and SnS_{n} respectively.

Theorem 1 (FSC Ensemble Learning)

Let \mathcal{H} be a hypothesis space, for any h{ho}o=1Oh\in\{h_{o}\}_{o=1}^{O}\in\mathcal{H} is learned from SbS_{b}, and h¯=o=1Oαoho\overline{h}=\sum_{o=1}^{O}\alpha_{o}h_{o}\in\mathcal{H}, the expected error on SnS_{n} respectively with h¯\overline{h} and hh holds the following relationship:

en(h¯)eb(h¯)+𝒟(Sb,Sn)(Sb-Sn) divergence+λ\displaystyle e_{n}(\overline{h})\leq e_{b}(\overline{h})+\underbrace{\mathcal{D}(S_{b},S_{n})}_{\text{($S_{b}$-$S_{n}$) divergence}}+\lambda
eb(h)+𝒟(Sb,Sn)(Sb-Sn) divergence+λ,\displaystyle\leq e_{b}(h)+\underbrace{\mathcal{D}(S_{b},S_{n})}_{\text{($S_{b}$-$S_{n}$) divergence}}+\lambda,

where λ=EXSb|fn(x)fb(x)|\lambda=E_{X\in{S_{b}}}\left|f_{n}(x)-f_{b}(x)\right| is a constant, en(h¯)e_{n}(\overline{h}) is the expected error on SnS_{n} with h¯\overline{h}, eb(h)e_{b}(h) is the expected error on SbS_{b} with hh, eb(h¯)e_{b}(\overline{h}) is the expected error on SbS_{b} with h¯\overline{h}.

The proof is provided in the Supplementary Material.

Remark 1

The core idea of Theorem 1 is to define a tighter expected error bound on the novel classes with the learned mapping function in the form of ensemble learning during the pre-training. Theorem 1 tells that the true error on the novel classes can be reduced by implementing ensemble learning on the base classes, given the domain divergence between the novel class and base class. This can well explain the effectiveness of ensemble learning in few-shot classification, in which multiple learners are assembled to enhance the generalization on the base set, resulting in better performance in novel classes.

3.2 FSC via Ensemble Learning with Multi-order Statistics

3.2.1 Overview

Our method employs the transfer-learning paradigm in a two-phase manner. In the first phase, a good feature extractor is pre-trained on the base set. In the second phase, FSC evaluation is done on the novel set with the pre-trained feature extractor. Following Theorem 1, we introduce ensemble learning in the first phase to improve the FSC performance. The key to this phase is to effectively train multiple diverse individuals. Different from the previous works Dvornik et al. (2019); Bendou et al. (2022) that use many different networks as individuals, we add multiple branches after the backbone network to create individuals for reducing training costs. Each branch calculates different-order statistics for pooling to highlight the discrepancy between the individuals. This step is optimized by supervised losses. After pre-training, features from different branches are concatenated for FSC evaluation. We name this method as Ensemble Learning with multi-Order Statistics (ELMOS) for FSC. An overview of ELMOS is shown in Figure 2, and a flow description of ELMOS is given in Algorithm 1.

Refer to caption
Figure 2: An overview of our framework. The images from SbS_{b} are augmented by the image processing module and fed into the backbone for feature extraction. The CNN features from the backbone are then reshaped into the matrix, which is used to calculate multi-order statistics to equip different branches. Ensemble learning is implemented by the linear combination of multiple branches during the pre-training phase.

3.2.2 Pre-training via Multi-order Statistics

The proposed model architecture mainly consists of the following four components: an image processing module, the backbone network, a multi-order statistics module, and a supervised classifier module. The image processing module is denoted as M()M\left(\cdot\right), which performs transformation of multi-scale rotation to augment the original base set and their label space. The backbone network is denoted as Bθ()B_{\theta}\left(\cdot\right) and parameterized by θ\theta, which converts each image into a tensor of size H×W×dH\times W\times d. The multi-order statistics module module is denoted as S()S\left(\cdot\right), which maps the tensor from the backbone into multiple feature representations to generate individual learners for ensemble learning. The supervised classifier module is composed of softmax classifiers LW()L_{W}\left(\cdot\right) and the projectors LU()L_{U}\left(\cdot\right) with parameter matrices WW and UU, respectively, which are used to build the supervised losses for pre-training.

Given LL samples be randomly sampled from SbS_{b} with CbC_{b} classes, in which an image and its corresponding label are denoted as (xi,yi)(x_{i},y_{i}), yi{1,2,Cb}y_{i}\in\{1,2,...C_{b}\}. M()M\left(\cdot\right) scales the images with the aspect-ratio of 2:3 and rotates the images with {0,90,180,270}\{0^{\circ},90^{\circ},180^{\circ},270^{\circ}\} under both the new and the original scales, resulting in eight times expansion of training samples. Feed xix_{i} into BθB_{\theta} to produce a tensor feature of Ti=Bθ(xi)H×W×dT_{i}=B_{\theta}(x_{i})\in\mathscr{R}^{H\times W\times d}. Next, we reshape the tensor TiT_{i} into the matrix TiHW×dT_{i}\in\mathscr{R}^{HW\times d}, and view each row vector in the matrix tjdt_{j}\in\mathscr{R}^{d} as an observation of the random variable of tdt\in\mathscr{R}^{d}. When d=1d=1, the first characteristic function of variable tt in the Laplace operator is given by:

ϕ(s)=+f(t)est𝑑t=+est𝑑F(t),\phi(s)=\int_{-\infty}^{+\infty}f(t)e^{st}dt=\int_{-\infty}^{+\infty}e^{st}dF(t), (3)

where f(t)f(t) and F(t)F(t) are the density function and distribution function of tt, respectively. Let ψ(s)=lnϕ(s)\psi(s)=ln\phi(s) be the second characteristic function of the random variable tt.

Theorem 2 (The Inversion Formula for Distributions)

Let tt be a random variable with distribution function F(t)F(t) and characteristic function ϕ(s)\phi(s). For a,bC(F)a,b\in C(F) and a<ba<b,

F(b)F(a)=limc12πccesaesbsϕ(s)𝑑s.\displaystyle F(b)-F(a)={\lim_{c\to\infty}}\frac{1}{2\pi}\int_{-c}^{c}\frac{e^{-sa}-e^{-sb}}{s}\phi(s)ds.
Corollary 1 (Uniqueness)

If two distributions of F1(t)F_{1}(t) and F2(t)F_{2}(t) are identical, then the corresponding characteristic functions ψ1(s)\psi_{1}(s) and ψ2(s)\psi_{2}(s) are identical.

See proof of Theorem 2 and Corollary 1 in  Shiryaev (2016). From Theorem 2 and Corollary 1, we can see that there is a one-to-one correspondence between the characteristic function and the probability density function such that the characteristic function can completely describe a random variable.

The otho^{th}-order cumulant of the random variable tt is defined as the otho^{th} derivative of function ψ(s)\psi(s) at the origin, which is:

co=doψ(s)dso|s=0.c_{o}=\frac{d^{o}\psi(s)}{ds^{o}}\bigg{|}_{s=0}. (4)

Then the Taylor series expansion of function ψ(s)\psi(s) at the origin with respect to ss yields:

ψ(s)=c1s+12c2s2++1o!cpso+Rs(so),\psi(s)=c_{1}s+\frac{1}{2}c_{2}s^{2}+...+\frac{1}{o!}c_{p}s^{o}+R_{s}(s^{o}), (5)

where Rs(so)R_{s}(s^{o}) is the remainder term. It can be seen from Equation (5) that the otho^{th}-order cumulant of tt is the coefficient of the term sos^{o} in Equation (5).

Proposition 1

Consider a Gaussion distribution f(t)f(t) with mean μ\mu and variance Σ2\Sigma^{2} for the random variable tt, its second characteristic function is:

ψ(s)=μs+12Σ2s2.\displaystyle\psi(s)=\mu s+\frac{1}{2}\Sigma^{2}s^{2}.

Consequently, the cumulant of the random variable tt are:

c1=μ,c2=Σ2,co=0(o=3,4,).\displaystyle c_{1}=\mu,c_{2}=\Sigma^{2},c_{o}=0\quad(o=3,4,...).

The proof is provided in the Supplementary Material.

Remark 2

Proposition 1 implies that for Gaussian signals only, the cumulants are identically zero when the order is greater than 2. Please note this conclusion can be naturally extended to the scenario of multivariate variables when d>1d>1. For the random variables with Gaussian distribution, the first and second-order statistics can completely represent their statistical characteristics. However, the non-Gaussian signals are more common in real-world applications. In this case, higher-order statistics also contain a lot of useful information. Therefore, we propose a multi-order statistics module consisting of multiple branches, each equipped with different order statistics of the tensor feature TiT_{i}.

In particular, we employ three branches in the multi-order statistics module, which respectively calculate three orders cumulants of the variable tt with the observations in TiT_{i}. The specific formulation of the 1st1^{st}-order, 2nd2^{nd}-order and 3rd3^{rd}-order cumulants of tt are expressed as:

ci1=1H×Wj=1H×Wtjci1d,ci2=1H×Wj=1H×W(tjci1)(tjci1)Tci2d×d,ci3=1H×Wj=1H×W(tjci1)2(tjci1)Tci22ci2Tci3d×d.\begin{split}&c_{i1}=\frac{1}{H\times W}\sum_{j=1}^{H\times W}t_{j}\quad c_{i1}\in\mathscr{R}^{d},\\ &c_{i2}=\frac{1}{H\times{W}}\sum_{j=1}^{H\times W}(t_{j}-c_{i1})(t_{j}-c_{i1})^{T}\quad c_{i2}\in\mathscr{R}^{d\times d},\\ &c_{i3}=\frac{1}{H\times W}\sum_{j=1}^{H\times W}\frac{(t_{j}-c_{i1})^{2}(t_{j}-c_{i1})^{T}}{c_{i2}^{2}c_{i2}^{T}}\quad c_{i3}\in\mathscr{R}^{d\times d}.\\ \end{split} (6)

As ci2c_{i2} and ci3c_{i3} are d×dd\times d matrices, we flatten them into d2d^{2}-dimensional vectors and finally get the feature representations of zi1z_{i1}, zi2z_{i2} and zi3z_{i3}. We use these three features as individuals in ensemble learning, which respectively pass through their corresponding softmax classifier LW()L_{W}(\cdot) and projectors LU()L_{U}(\cdot). So the oo-th (o=1,2,3o=1,2,3) outputs are:

Pijo=LWo(zio)=exp(zi0Twoj)j=18Cbexp(zi0Twoj),uio=LUo(zio)=zioTUo,\begin{split}&P_{ij}^{o}=L_{Wo}\left(z_{io}\right)=\frac{exp({{z_{i0}}^{T}w_{oj}})}{\sum_{j=1}^{8C_{b}}{exp({{z_{i0}}^{T}w_{oj}})}},\\ &{u_{io}}=\left\|{L_{Uo}}(z_{io})\right\|=\left\|{{z_{io}}^{T}}U_{o}\right\|,\end{split} (7)

where LWo()L_{Wo}(\cdot) is the oo-th softmax classifier with the parameter matrix of WoW_{o}, wojw_{oj} is the jj-th component of WoW_{o}. LUo()L_{Uo}(\cdot) is the oo-th projector with the parameter matrix UoU_{o}. PijoP_{ij}^{o} is the jj-th component of the output probability from the oo-th softmax classifier. uiou_{io} is the output vector from the oo-th projector. We simultaneously employ Classification-Based (CB) loss of cross-entropy and Similarity-Based (SB) loss of supervised contrastive in supervised learning for each individual Scott et al. (2021). These two losses are formulated as:

LCBo(θ,Wo)=i=18Lj=18CbyijlogPijo,\displaystyle L_{CB}^{o}\left(\theta,W_{o}\right)=-{\sum_{i=1}^{8L}\sum_{j=1}^{8C_{b}}y_{ij}logP_{ij}^{o}}, (8)
LSBo(θ,Uo)=\displaystyle L_{SB}^{o}(\theta,U_{o})= i=18LlogqQ(ui0)exp(uiouqo/τ)a=18Lexp(uaouqo/τ),\displaystyle-\sum_{i=1}^{8L}log\sum_{q\in{Q(u_{i0})}}\frac{exp(u_{io}\cdot u_{qo}/\tau)}{\sum_{a=1}^{8L}exp(u_{ao}\cdot u_{qo}/\tau)},

where yijy_{ij} is the jj-th component of label yiy_{i}, τ\tau is a scalar temperature parameter. Q(uio)Q(u_{io}) is the positive sample set, in which each sample has the same label as uiou_{io}. uqou_{qo} is the qq-th sample in Q(uio)Q(u_{io}). Then the learning objective function for the oo-th individual is:

Lo(θ,Wo,Uo)=LCBo(θ,Wo)+LSBo(θ,Uo).L_{o}(\theta,W_{o},U_{o})=L_{CB}^{o}\left(\theta,W_{o}\right)+L_{SB}^{o}(\theta,U_{o}). (9)

The overall loss function with ensemble learning is:

Loverall=o=1OαoLo(θ,Wo,Uo),L_{overall}=\sum_{o=1}^{O}\alpha_{o}L_{o}(\theta,W_{o},U_{o}), (10)

where αo\alpha_{o} is a weight controlling the contribution of each individual in the ensemble learning. The pre-training adopts the gradient descent method to optimize the above loss function.

Input: Base set SbS_{b}, support set SpS_{p}, query set SqS_{q}; augmentation module M()M\left(\cdot\right), backbone network Bθ()B_{\theta}\left(\cdot\right), multi-order statistics module S()S(\cdot), softmax classifier LWoL_{Wo}, projector LUoL_{Uo} and logistic regression gξ()g_{\xi}\left(\cdot\right); temperature parameter τ\tau, weight αo\alpha_{o} (o=1,2,3o=1,2,3).
Output: Final prediction of the query samples
Stage 1: Pre-training with ensemble learning
for numbers of training epochs  do
       Sample a mini-batch with any image of {xi,yi}\{x_{i},y_{i}\};
      Feed xix_{i} into T()T\left(\cdot\right) and Bθ()B_{\theta}\left(\cdot\right) to obtain feature map TiH×W×dT_{i}\in\mathscr{R}^{H\times W\times d} ;
      Pass TiT_{i} through S()S(\cdot) to output features zioz_{io}, (o=1,2,3o=1,2,3); 
      Pass zioz_{io} through LWoL_{Wo} and LUoL_{Uo} to get the output probability and projection feature ;
      Calculate optimization loss for each individual via Equation (9);
      Calculate overall loss for pre-training via Equation (10);
      Update the parameters of θ\theta, WoW_{o}, UoU_{o} using SGD;
end for
Stage 2: Few-shot evaluation
for all iteration = 1, 2, …, MaxIteration do
       Feed xsSpx_{s}\in{S_{p}} into Bθ()B_{\theta}(\cdot) and S()S(\cdot) to output feature zsoz_{so}, (o=1,2,3o=1,2,3);
      Concatenate zsoz_{so} into the feature zsz_{s} to train the classifier of gξ()g_{\xi}\left(\cdot\right);
end for
Classify the query samples according to Equation (13).
Algorithm 1 Ensemble Learning with multi-Order Statistics (ELMOS) for FSC

3.2.3 Few-shot Evaluation

The phase of few-shot evaluation still needs to construct a set of NN-way KK-shot FSC tasks, with a support set and a query set in each task. The support set randomly selects KK samples from each of the NN classes that are sampled from SnS_{n}, which is denoted as Sp={xs,ys}s=1NKS_{p}={\{x_{s},y_{s}}\}_{s=1}^{NK}, where (xs,ys)(x_{s},y_{s}) is the ss-th images and its corresponding label. The query set consists of the remaining images in these NN classes, which is denoted as Sq={xq}q=1QS_{q}=\{{x_{q}}\}_{q=1}^{Q} with any image of xqx_{q}. After pre-training, we get rid of the softmax classifier LW()L_{W}(\cdot) and projectors LU()L_{U}(\cdot) and fix the backbone network Bθ()B_{\theta}(\cdot) and the multi-order statistics module module S()S(\cdot). The support set SpS_{p} is input into Bθ()B_{\theta}(\cdot) and S()S(\cdot) to produce the output features:

zso=BθS(xs)(o=1,2,3),z_{so}=B_{\theta}\circ S(x_{s})\quad(o=1,2,3), (11)

where \circ is the stack operator. The features zs1,zs2,zs3z_{s1},z_{s2},z_{s3} are concatenated into a final expression of xsx_{s}:

zs=con(zs1,zs2,zs3),z_{s}=con(z_{s1},z_{s2},z_{s3}), (12)

where con()con(\cdot) is the concatenated operator. A logistic regression classifier gξ()g_{\xi}\left(\cdot\right) parameterized by ξ\xi is then trained with zsz_{s} and its corresponding label ysy_{s}. The query image xqx_{q} is finally classified as:

y^q=gξ(zq),\hat{y}_{q}=g_{\xi}(z_{q}), (13)

where y^q\hat{y}_{q} is the inference label value of xqx_{q}.

Table 1: Test accuracy (%) of each branch and their ensemble under 5-way 1-shot and 5-shot tasks on three datasets.
Method Backbone miniImageNet CIFAR-FS CUB
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
B_1 ResNet12 69.06±\pm0.44 83.61±\pm0.29 77.09±\pm0.46 88.46±\pm0.34 81.46±\pm0.39 92.55±\pm0.18
B_2 ResNet12 66.42±\pm0.42 85.76±\pm0.26 71.53±\pm0.48 88.83 ±\pm0.27 77.79±\pm0.39 94.44±\pm0.17
B_3 ResNet12 67.68±\pm0.43 82.81±\pm0.29 72.83±\pm0.46 86.34±\pm0.34 83.89±\pm0.38 91.20±\pm0.17
ELMOS ResNet12 70.30±\pm0.45 86.17±\pm0.26 78.18±\pm0.41 89.87±\pm0.31 85.21±\pm0.38 95.02±\pm0.16
Refer to caption
(a) 5-way 1-shot
Refer to caption
(b) 5-way 5-shot
Figure 3: Test accuracy (%) of the classification-based (CB) loss, similarity-based (SB) loss and their combination (CB&SB) under 5-way 1-shot and 5-way 5-shot tasks on three datasets.
Table 2: Comparison of results against state-of-the-art methods on CUB dataset.The top three results are marked in red, blue and green.
Method CUB
1-shot 5-shot
Meta-learning
Relational Sung et al. (2018) 55.00±\pm1.00 69.30±\pm0.80
DeepEMD Zhang et al. (2020a) 75.65±\pm0.83 88.69±\pm0.50
BML Zhou et al. (2021) 76.21±\pm0.63 90.45±\pm0.36
RENet Kang et al. (2021) 79.49±\pm0.44 91.11±\pm 0.24
FPNWertheimer et al. (2021) 83.55±\pm0.19 92.92±\pm0.10
IEPT Zhang et al. (2020b) 69.97±\pm0.49 84.33±\pm0.33
APP2S Ma et al. (2022) 77.64±\pm0.19 90.43±\pm0.18
MFS Afrasiyabi et al. (2022) 79.60±\pm0.80 90.48±\pm0.44
DeepBDC Xie et al. (2022) 84.01±\pm0.42 94.02±\pm0.24
HGNN Yu et al. (2022) 78.58±\pm0.20 90.02±\pm0.12
INSTARongkai Ma (2022) 75.26 ± 0.31 88.12 ± 0.54
Transfer-learning
Baseline++ Chen et al. (2019) 60.53±\pm0.83 79.34±\pm0.61
Neg-Cosine Liu et al. (2020) 72.66±\pm0.85 89.40±\pm0.43
S2M2 Mangla et al. (2020) 80.68±\pm0.81 90.85±\pm0.44
DC-LRYang et al. (2021) 79.56±\pm0.87 90.67±\pm0.35
CCF Xu et al. (2021b) 81.85±\pm0.42 91.58±\pm0.32
ELMOS (ours) 85.21±\pm0.38 95.02±\pm0.16

4 Experiments

4.1 Datasets

miniImageNet contains 600 images over 100 classes, which are divided into 64, 16 and 20 respectively for base, validation and novel sets. tiredImageNet consists of 779, 165 images belonging to 608 classes, which are further grouped into 34 higher-level categories with 10 to 30 classes per category. These categories are partitioned into 20 categories (351 classes), 6 categories (97 classes) and 8 categories (160 classes) respectively for base, validation and novel sets. CIFAR-FS is derived from CIFAR100 and consists of 100 classes with 600 images per class. The total classes are split into 64, 16 and 20 for base, validation and novel sets. Caltech-UCSD Bird-200-2011(CUB) has a total number of 11,788 images over 200 bird species. These species are divided into 100, 50, and 50 for the base, validation and novel sets, respectively.

4.2 Implementation Details

In the experiments, we primarily used ResNet12 architecture with 4 residual blocks. Each block had 3 convolutional layers with 3×3 kernels. The number of kernels for the 4 blocks was 64, 160, 320, and 640, respectively. A max-pooling layer was added at the end of the first three blocks. The last block was branched with three pooling layers, which respectively modeled different statistical representations of the images. We opted for the SGD optimizer with a momentum of 0.9 and a weight decay of 5e-4. The learning rate was initialized to be 0.025. We trained the network for 130 epochs with a batch size of 32 in all the experiments. For miniImageNet, tiredImageNet and CIFAR-FS, the learning rate was reduced by a factor of 0.2 at the 70-thth and 100-thth epoch. For CUB, the learning rate was reduced by a factor of 0.2 for every 15 epochs after the 75-thth epoch. We randomly sampled 2,000 episodes from SnS_{n} with 15 query samples per class for both 5-way 1-shot and 5-shot evaluations, to produce the mean classification accuracy as well as the 95% confidence interval.

Table 3: Comparison of results against state-of-the-art methods on miniImageNet, tiredImageNet, and CIFAR-FS dataset. ’-’ means the results were not provided by the authors. The top three results are marked in red, blue and green, respectively.
Method Backbone Venue miniImageNet tiredImageNet CIFAR-FS
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
Meta-learning
DeepEMD Zhang et al. (2020a) ResNet12 CVPR’20 65.91±\pm0.82 82.41±\pm 0.56 71.16±\pm0.87 86.03±\pm0.58 - -
CC+rot Gidaris et al. (2019) ResNet12 CVPR’20 62.93±\pm0.45 79.87±\pm0.33 70.53±\pm0.51 84.98±0.36\pm 0.36 76.09±\pm0.30 87.83±\pm0.21
BML Zhou et al. (2021) ResNet12 ICCV’21 67.04±\pm0.63 83.63±\pm0.29 68.99±\pm0.50 85.49±\pm0.34 73.45±\pm0.47 88.04±\pm0.33
RENet Kang et al. (2021) ResNet12 ICCV’21 67.60±\pm0.44 82.58±\pm0.30 71.61±\pm0.51 85.28±\pm0.35 74.51±\pm0.46 86.60±\pm0.32
MeTALBaik et al. (2021) ResNet12 CVPR’21 66.61±\pm0.28 81.43±\pm0.25 70.29±\pm0.40 86.17±\pm0.35 - -
DAN Xu et al. (2021a) ResNet12 CVPR’21 67.76±\pm0.46 82.71±\pm0.31 71.89±\pm0.52 85.96±\pm0.35 - -
IEPT Zhang et al. (2020b) ResNet12 ICLR’21 67.05±\pm0.44 82.90±\pm0.30 72.24±\pm0.50 86.73±\pm0.34 - -
APP2S Ma et al. (2022) ResNet12 AAAI’22 66.25±\pm0.20 83.42±\pm0.15 72.00±\pm0.22 86.23±\pm0.15 73.12 ±\pm0.22 85.69±\pm0.16
DeepBDC Xie et al. (2022) ResNet12 CVPR’22 67.34±\pm0.43 84.46±\pm0.28 72.34±\pm0.49 87.31±\pm0.32 - -
MFS Afrasiyabi et al. (2022) ResNet12 CVPR’22 68.32±\pm0.62 82.71±\pm0.46 73.63±\pm0.88 87.59±\pm0.57 - -
TPMNWu et al. (2021) ResNet12 CVPR’22 67.64±\pm0.63 83.44±\pm0.43 72.24±\pm 0.70 86.55 ±\pm 0.63 - -
HGNN Yu et al. (2022) ResNet12 AAAI’22 67.02±\pm0.20 83.00±\pm0.13 72.05±\pm0.23 86.49±\pm0.15 - -
DSFNZhang and Huang (2022) ResNet12 ECCV’22 61.27±\pm0.71 80.13±\pm0.17 65.46±\pm 0.70 82.41±\pm0.53 - -
MTRBouniot et al. (2022) ResNet12 ECCV’22 62.69±\pm 0.20 80.95±\pm0.14 68.44 ±\pm0.23 84.20 ±\pm0.16 - -
Transfer-learning
Baseline++ Chen et al. (2019) ResNet12 ICLR’19 48.24±\pm0.75 66.43±\pm0.63 - -
Neg-Cosine Liu et al. (2020) WRN28 ECCV’20 61.72±\pm0.81 81.79±\pm0.55 - -
RFS Tian et al. (2020) WRN28 ECCV’20 64.82±\pm0.60 82.14±\pm0.43 71.52±\pm0.69 86.03±\pm0.49 - -
CBM Wang et al. (2020) ResNet12 MM’20 64.77±\pm0.46 80.50±\pm0.33 71.27±\pm0.50 85.81±\pm0.34 - -
SKD Rajasegaran et al. (2020) ResNet12 Arxiv’21 67.04±\pm0.85 83.54±\pm0.54 72.03±\pm0.91 86.50±\pm0.58 76.9±\pm0.9 88.9±\pm0.6
IESung et al. (2021) ResNet12 CVPR’21 67.28±\pm0.80 84.78±\pm0.33 72.21±\pm0.90 87.08±\pm0.58 77.87±\pm0.85 89.74±\pm0.57
PAL Ma et al. (2019) ResNet12 ICCV’21 69.37±\pm0.64 84.40±\pm0.44 72.25±\pm0.72 86.95±\pm0.47 77.1±\pm0.7 88.0±\pm0.5
CCFXu et al. (2021b) ResNet12 CVPR’22 68.88±\pm0.43 84.59±\pm0.30 - - - -
ELMOS (ours) ResNet12 - 70.30±\pm0.45 86.17±\pm0.26 73.84±\pm0.49 87.98±\pm0.31 78.18±\pm0.41 89.87±\pm0.31
Table 4: Comparison of results with the most related method under 5-way 1-shot and 5-shot tasks on CIFAR-FS and CUB.
Method CIFAR-FS CUB
1-shot 5-shot 1-shot 5-shot
EASY 75.24±\pm0.20 88.38±\pm0.14 77.97±\pm0.20 91.59±\pm0.10
ELMOS 78.18±\pm0.41 89.87±\pm0.31 85.21±\pm0.38 95.02±\pm0.16

4.3 Ablation Studies

The effectiveness of our method is attributed to the ensemble of different branches equipped with multi-order statistics. In this section, we conducted ablation studies to analyze the effect of the 1st1^{st}-order, 2nd2^{nd}-order and, 3rd3^{rd}-order statistical pooling and their combination on the miniImageNet, CIFAR-FS and CUB datasets. Above methods are respectively denoted as B_1, B_2, B_3,and ELMOS. Their accuracies under 5-way 1-shot and 5-shot tasks on three datasets are shown in Table 1. From the results, we can see that: (1) On all three datasets, the test accuracy of B1B_{1} and B3B_{3} is higher than B2B_{2} under the 1-shot task, but the test accuracy of B2B_{2} is higher than B1B_{1} and B3B_{3} under the 5-shot task. The above phenomenon shows that different order statistics provide different information about the images. (2) The test accuracy of ELMOS is higher than B1B_{1}, B2B_{2} and B3B_{3} under both 1-shot and 5-shot tasks, which illustrates that different order statistics complement each other. Combing them can bring more useful information for classification, resulting in higher classification performance.

For each individual in the ensemble learning, the optimization is cooperatively accomplished by the Classification-Based (CB) loss and Similarity-Based (SB) loss Scott et al. (2021). Hence, we conducted ablation experiments to analyze the contribution of each loss on three benchmark datasets: miniImageNet, CIFAR-FS and CUB. Subsequently, we pre-trained the model respectively with CB and SB loss alone and their combination, resulting in three methods denoted as CB, SB and CB&SB. The test accuracies under different methods are shown in Figure 3. The test results show that the accuracy of CB&SB is higher than CB and SB, which implies that both classification-based and similarity-based losses play important roles in our method.

4.4 Comparison with the Most Related Method

Our method is most related to EASY Bendou et al. (2022), which is also a FSC ensemble learning method in context of transfer learning. The comparison of results between them on CIFAR-FS and CUB datasets is shown in Table 4. From the results, we can see that our method beats EASY by a very large margin under both 1-shot and 5-shot tasks. Please note that our method is more efficient that EASY, because EASY needs to pre-train multiple individual networks, which spends much more pre-training time than our method.

4.5 Comparison with State-of-the-Art Methods

We compare the performance of our method with several state-of-the-art methods. These methods are either meta-learning based or transfer-learning based. The comparison of results is shown in Table 2 and Table 3. From Table 2, we can see the performance of our method ranks at the top under both 1-shot and 5-shot tasks on CUB. Specifically, our method exceeds the second-best model DeepBDC by 1.2% and 1.0% respectively in 1-shot and 5-shot settings. From Table 3, we can see that our method beats state-of-the-art methods under both 5-way 1-shot and 5-way 5-shot tasks on the dataset of miniImageNet, tiredImagegNet, and CIFAR-FS. Specifically, on miniImageNet, PAL and IE behave the second best respectively in 1-shot and 5-shot settings. Our method beats them by 0.93% and 1.39%. On tiredImageNet, our method outperforms the second-best MFS by 0.21% and 0.39% respectively in 1-shot and 5-shot settings. On CIFAR-FS, our method achieves 0.31% and 0.13% improvement over IE for 1-shot and 5-shot respectively. In brief, our method consistently outperforms the state-of-the-art FSC methods under both 1-shot and 5-shot tasks on multiple datasets. The promising results are achieved because of the generalization representation obtained by ensemble learning with multi-order on the base set.

5 Conclusion

This paper analyzes the underlying work mechanism of ensemble learning in few-shot classification. A theorem is provided to illustrate that the true error on the novel classes can be reduced with ensemble learning on the base set, given the domain divergence between the base and the novel classes. Multi-order statistics on image features are further introduced to produce learning individuals to get an effective ensemble learning design. Comprehensive experiments on multiple benchmarks have illustrated that different-order statistics can generate diverse learning individuals due to their complementarity. The promising FSC performance with ensemble learning on the base set has validated the proposed theorem.

References

  • Afrasiyabi et al. [2020] Arman Afrasiyabi, Jean-François Lalonde, and Christian Gagné. Associative alignment for few-shot image classification. In Proceedings of European Conference on Computer Vision, pages 18–35, Glasgow, UK, November 2020. Springer.
  • Afrasiyabi et al. [2022] Arman Afrasiyabi, Hugo Larochelle, Jean-François Lalonde, and Christian Gagné. Matching feature sets for few-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9014–9024, New Orleans, USA, June 2022. IEEE.
  • Agarwal et al. [2021] Rishabh Agarwal, Levi Melnick, Nicholas Frosst, Xuezhou Zhang, , Rich Caruana, and Geoffrey E Hinton. Neural additive models: Interpretable machine learning with neural nets. In Proceedings of 34th Annual Conference on Neural Information Processing Systems, pages 4078–4088, 4699–4711, December 2021. Neural Information Processing Systems Foundation.
  • Baik et al. [2021] Sungyong Baik, Janghoon Choi, Heewon Kim, Dohee Cho, Jaesik Min, and Kyoung Mu Lee. Meta-learning with task-adaptive loss function for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9465–9474, Nashville, USA, June 2021. IEEE.
  • Bendou et al. [2022] Yassir Bendou, Yuqing Hu, Raphael Lafargue, Giulia Lioi, Stéphane Pateux, and Vincent Gripon. Easy—ensemble augmented-shot-y-shaped learning: State-of-the-art few-shot classification with simple components. Journal of Imaging, 8(7):179, 2022.
  • Bertinetto et al. [2019] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In Proceedings of 7th International Conference on Learning Representations, New Orleans, USA, May 2019. International Conference on Learning Representations.
  • Bouniot et al. [2022] Quentin Bouniot, Ievgen Redko, Romaric Audigier, Angélique Loesch, and Amaury Habrard. Improving few-shot learning through multi-task representation learning theory. In Proceedings of European Conference on Computer Vision, pages 435–452, Tel Aviv, Israel, October 2022. Springer.
  • Chen et al. [2019] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In Proceedings of 7th International Conference on Learning Representations, New Orleans, USA, May 2019. International Conference on Learning Representations.
  • Cui et al. [2017] Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, and Serge Belongie. Kernel pooling for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2921–2930, Venice, Italy, February 2017. IEEE.
  • Dvornik et al. [2019] Nikita Dvornik, Cordelia Schmid, and Julien Mairal. Diversity with cooperation: Ensemble methods for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3723–3731, Seoul, Korea, February 2019. IEEE.
  • Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of International Conference on Machine Learning, pages 1126–1135, Sydney,Australia, July 2017.
  • Gidaris et al. [2019] Spyros Gidaris, Andrei Bursuc, Patrick Pérez, and Matthieu Cord. Boosting few-shot visual learning with self-supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8059–8068, Seoul, Korea, February 2019. IEEE.
  • Horváth et al. [2021] Miklós Z Horváth, Mark Niklas Müller, Marc Fischer, and Martin Vechev. Boosting randomized smoothing with variance reduced classifiers. arXiv preprint arXiv:2106.06946, 2021.
  • Kang et al. [2021] Dahyun Kang, Heeseung Kwon, Juhong Min, and Minsu Cho. Relational embedding for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8822–8833, Nashville, USA, June 2021. IEEE.
  • Kifer et al. [2004] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Proceedings of the 31st International Conference on Very Large Databases, pages 180–191, Toronto, Canada;, September 2004. Morgan Kaufmann.
  • Liu et al. [2020] Bin Liu, Yue Cao, Yutong Lin, , Mingsheng Long, and Han Hu. Negative margin matters: Understanding margin in few-shot classification. In Proceedings of European Conference on Computer Vision, pages 438–455, Glasgow, UK, November 2020. Springer.
  • Ma et al. [2019] Jiawei Ma, Hanchen Xie, Guangxing Han, Shih-Fu Chang, Aram Galstyan, and Wael Abd-Almageed. Partner-assisted learning for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10573–10582, Seoul, Korea, February 2019. IEEE.
  • Ma et al. [2022] Rongkai Ma, Pengfei Fang, Tom Drummond, and Mehrtash Harandi. Adaptive poincaré point to set distance for few-shot classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1926–1934, Austin, Texas, August 2022. AAAI.
  • Mangla et al. [2020] Puneet Mangla, Nupur Kumari, Abhishek Sinha, Mayank Singh, , and Vineeth N Balasubramanian. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2218–2227, Snowmass, USA, March 2020. IEEE.
  • Rajasegaran et al. [2020] Jathushan Rajasegaran, Salman Khan, , Fahad Shahbaz Khan, and Mubarak Shah. Self-supervised knowledge distillation for few-shot learning. arXiv preprint arXiv:2006.09785, 2020.
  • Rongkai Ma [2022] Gil Avraham Yan Zuo Rongkai Ma, Pengfei Fang. Learning instance and task-aware dynamic kernels for few-shot learning. arXiv preprint arXiv:2112.03494, 2022.
  • Scott et al. [2021] Tyler R Scott, Andrew C Gallagher, and Michael C Mozer. von mises-fisher loss: An exploration of embedding geometries for supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10612–10622, Nashville, USA, June 2021. IEEE.
  • Shiryaev [2016] Albert N Shiryaev. Probability-1, volume 95. Springer, 2016.
  • Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Proceedings of 31st Annual Conference on Neural Information Processing Systems, pages 4078–4088, Long Beach, USA, December 2017. Neural Information Processing Systems Foundation.
  • Sung et al. [2018] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1199–1208, Salt Lake City, USA, June 2018. IEEE.
  • Sung et al. [2021] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10836–10846, Nashville, USA, June 2021. IEEE.
  • Tian et al. [2020] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In Proceedings of European Conference on Computer Vision, pages 266–282, Glasgow, UK, November 2020. Springer.
  • Tseng et al. [2020] Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, and Ming-Hsuan Yang. Cross-domain few-shot classification via learned feature-wise transformation. arXiv preprint arXiv:2001.08735, 2020.
  • Wang et al. [2019] Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623, 2019.
  • Wang et al. [2020] Zeyuan Wang, Yifan Zhao, Jia Li, and Yonghong Tian. Cooperative bi-path metric for few-shot learning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1524–1532, Seattle, USA, October 2020. ACM.
  • Wertheimer et al. [2021] Davis Wertheimer, Luming Tang, and Bharath Hariharan. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8012–8021, Nashville, USA, 2021.
  • Wu et al. [2021] Jiamin Wu, Tianzhu Zhang, Yongdong Zhang, and Feng Wu. Task-aware part mining network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8433–8442, Nashville, USA, June 2021. IEEE.
  • Xie et al. [2022] Jiangtao Xie, Fei Long, Jiaming Lv, Qilong Wang, and Peihua Li. Joint distribution matters: deep brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7962–7971, New Orleans, USA, June 2022. IEEE.
  • Xu et al. [2021a] Chengming Xu, Yanwei Fu, Chen Liu, Chengjie Wang, Jilin Li, Feiyue Huang, Li Zhang, and Xiangyang Xue. Learning dynamic alignment via meta-filter for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5182–5191, Nashville, USA, June 2021. IEEE.
  • Xu et al. [2021b] Jing Xu, Xinglin Pan, Xu Luo, Wenjie Pei, and Zenglin Xu. Exploring category-correlated feature for few-shot image classification. arXiv preprint arXiv:2112.07224, 2021.
  • Yang et al. [2013] Jing Yang, Xiaoqin Zeng, Shuiming Zhong, and Shengli Wu. Effective neural network ensemble approach for improving generalization performance. IEEE transactions on neural networks and learning systems, 24(6):878–887, 2013.
  • Yang et al. [2021] Shuo Yang, Lu Liu, and Min Xu. Free lunch for few-shot learning: distribution calibration. In Proceedings of 9th International Conference on Learning Representations, New Orleans, USA, May 2021. International Conference on Learning Representations.
  • Yu et al. [2022] Tianyuan Yu, Sen He, Yi-Zhe Song, and Tao Xiang. Hybrid graph neural networks for few-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3179–3187, Austin, Texas, 2022.
  • Zhang and Huang [2022] Tao Zhang and Wu Huang. Kernel relative-prototype spectral filtering for few-shot learning. In Proceedings of European Conference on Computer Vision, pages 541–557, Tel Aviv, Israel, October 2022.
  • Zhang et al. [2020a] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12203–12213, Seattle, USA, June 2020. IEEE.
  • Zhang et al. [2020b] Manli Zhang, Jianhong Zhang, , and Songfang Huang. Iept: Instance-level and episode-level pretext tasks for few-shot learning. In Proceedings of 7th International Conference on Learning Representations, Addis Ababa, Ethiopian Empire, May 2020. International Conference on Learning Representations.
  • Zhou et al. [2021] Ziqi Zhou, Xi Qiu, Jiangtao Xie, Jianan Wu, and Chi Zhang. Binocular mutual learning for improving few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8402–8411, Seoul, Korea, February 2021. IEEE.

6 Supplementary Material

6.1 Proof of Theorem 1

Proof 1

The expected error with h¯\overline{h} on SbS_{b} and SnS_{n} are:

eb(h¯)=eSb(h¯,fb)=ExSb[|h¯(x)fb(x)|]\displaystyle e_{b}(\overline{h})=e_{S_{b}}(\overline{h},f_{b})=E_{x\in S_{b}}[\left|\overline{h}(x)-f_{b}(x)\right|] (14)
en(h¯)=eSn(h¯,fn)=ExSn[|h¯(x)fn(x)|].\displaystyle e_{n}(\overline{h})=e_{S_{n}}(\overline{h},f_{n})=E_{x\in S_{n}}[\left|\overline{h}(x)-f_{n}(x)\right|].

Then, we will get the following formula:

en(h¯)=en(h¯)+eb(h¯)eb(h¯)+eSb(h¯,fn)eSb(h¯,fn)\displaystyle e_{n}(\overline{h})=e_{n}(\overline{h})+e_{b}(\overline{h})-e_{b}(\overline{h})+e_{S_{b}}(\overline{h},f_{n})-e_{S_{b}}(\overline{h},f_{n}) (15)
eb(h¯)+|eSb(h¯,fn)eSb(h¯,fb)|+\displaystyle\leq e_{b}(\overline{h})+\left|e_{S_{b}}(\overline{h},f_{n})-e_{S_{b}}(\overline{h},f_{b})\right|+
|eSn(h¯,fn)eSb(h¯,fn)|\displaystyle\left|e_{S_{n}}(\overline{h},f_{n})-e_{S_{b}}(\overline{h},f_{n})\right|
eb(h¯)+EXSb|fn(x)fb(x)|+\displaystyle\leq e_{b}(\overline{h})+E_{X\in{S_{b}}}\left|f_{n}(x)-f_{b}(x)\right|+
|eSn(h¯,fn)eSb(h¯,fn)|\displaystyle\left|e_{S_{n}}(\overline{h},f_{n})-e_{S_{b}}(\overline{h},f_{n})\right|
eb(h¯)+EXSb|fn(x)fb(x)|+\displaystyle\leq e_{b}(\overline{h})+E_{X\in{S_{b}}}\left|f_{n}(x)-f_{b}(x)\right|+
|ηb(x)ηn(x)||h¯(x)fn(x)|𝑑x\displaystyle\int\left|\eta_{b}(x)-\eta_{n}(x)\right|\left|\overline{h}(x)-f_{n}(x)\right|dx
eb(h¯)+EXSb|fn(x)fb(x)|+(Sb,Sn).\displaystyle\leq e_{b}(\overline{h})+E_{X\in{S_{b}}}\left|f_{n}(x)-f_{b}(x)\right|+\mathcal{L}(S_{b},S_{n}).

The expected error on SbS_{b} with any learner hoh_{o} of ensemble learning is calculated as:

eb(ho)=(ho(x)h(x))2ηb(x)𝑑x,\displaystyle e_{b}(h_{o})=\int(h_{o}(x)-h^{*}(x))^{2}\eta_{b}(x)dx, (16)

where ηb(x)\eta_{b}(x) is the density functions of SbS_{b}, The average error on SbS_{b} with the learners of ensemble learning is:

eb(h)=o=1Oαo(ho(x)h(x))2ηb(x)𝑑x.\displaystyle e_{b}(h)=\sum_{o=1}^{O}\alpha_{o}\int(h_{o}(x)-h^{*}(x))^{2}\eta_{b}(x)dx. (17)

Recall that h¯=αoo=1Oho\overline{h}=\alpha_{o}\sum_{o=1}^{O}h_{o}, then the expected error on SbS_{b} with h¯\overline{h} is calculated as:

eb(h¯)=(h¯(x)h(x))2ηb(x)𝑑x\displaystyle e_{b}(\overline{h})=\int(\overline{h}(x)-h^{*}(x))^{2}\eta_{b}(x)dx (18)
=(αoo=1Oho(x)h(x))2ηb(x)𝑑x\displaystyle=\int(\alpha_{o}\sum_{o=1}^{O}h_{o}(x)-h^{*}(x))^{2}\eta_{b}(x)dx
o=1Oαo(ho(x)h(x))2ηb(x)𝑑xeb(h).\displaystyle\leq\sum_{o=1}^{O}\alpha_{o}\int(h_{o}(x)-h^{*}(x))^{2}\eta_{b}(x)dx\leq e_{b}(h).

6.2 Proof of Proposition 1

Proof 2

The Gaussian distribution of the random variable tt is expressed as:

f(t)=12πΣe{(tμ)22Σ2}.f(t)=\frac{1}{\sqrt{2\pi}\Sigma}e^{\{-\frac{(t-\mu)^{2}}{2\Sigma^{2}}\}}. (19)

According to the definition in Equation (3), the first characteristic function of random variable tt is calculated as:

ϕ(s)=+12πΣe{(tμ)22Σ2}est𝑑t\displaystyle\phi(s)=\int_{-\infty}^{+\infty}\frac{1}{\sqrt{2\pi}\Sigma}e^{\{-\frac{(t-\mu)^{2}}{2\Sigma^{2}}\}}e^{st}dt (20)
\xlongequal[]t=tμ+12πΣe{(t)22Σ2}es(t+μ)𝑑t\displaystyle\xlongequal[]{t^{\prime}=t-\mu}\int_{-\infty}^{+\infty}\frac{1}{\sqrt{2\pi}\Sigma}e^{\{-\frac{(t^{\prime})^{2}}{2\Sigma^{2}}\}}e^{s(t^{\prime}+\mu)}dt^{\prime}
=eμs+12πΣe{(t)22Σ2}est𝑑t\displaystyle=e^{\mu s}\int_{-\infty}^{+\infty}\frac{1}{\sqrt{2\pi}\Sigma}e^{\{-\frac{(t^{\prime})^{2}}{2\Sigma^{2}}\}}e^{st^{\prime}}dt^{\prime}
=eμs12πΣ+e{t22Σ2+st}𝑑t.\displaystyle=e^{\mu s}\frac{1}{\sqrt{2\pi}\Sigma}\int_{-\infty}^{+\infty}e^{\{-\frac{t^{\prime 2}}{2\Sigma^{2}}+st^{\prime}\}}dt^{\prime}.

The common Gaussian integral formula is expressed as:

+e(Ax2±2BxC)𝑑x=πAe(ACB2A).\displaystyle\int_{-\infty}^{+\infty}e^{(-Ax^{2}\pm 2Bx-C)}dx=\sqrt{\frac{\pi}{A}e^{(-\frac{AC-B^{2}}{A})}}. (21)

In the right side of Equation (21), let A=12Σ2A=\frac{1}{2\Sigma^{2}}, B=s/2B=s/2, C=0C=0, the Euqation (20) becomes:

ϕ(s)=eμse12Σ2s2.\displaystyle\phi(s)=e^{\mu s}e^{\frac{1}{2}\Sigma^{2}s^{2}}. (22)

The second characteristic function of ψ(s)\psi(s) is formulated as:

ψ(s)=lnϕ(s)=ln(eμse12Σ2s2s)=μs+12Σ2s2.\displaystyle\psi(s)=ln\phi(s)=ln(e^{\mu s}e^{\frac{1}{2}\Sigma^{2}s^{2}s})=\mu s+\frac{1}{2}\Sigma^{2}s^{2}. (23)

Compare Equation (23) with Equation (5), the following coefficients of the term sos^{o} can be obtained in Equation (5):

c1=μ,c2=Σ2,co=0(o=3,4,).\displaystyle c_{1}=\mu,c_{2}=\Sigma^{2},c_{o}=0\quad(o=3,4,...). (24)

6.3 More Experiments

Refer to caption
(a) 5-way 1-shot
Refer to caption
(b) 5-way 5-shot
Figure 4: Test accuracy (%) under different values of the parameter α2\alpha_{2} in the setting of 5-way 1-shot and 5-shot on three FSC datasets.
Refer to caption
(a) 5-way 1-shot
Refer to caption
(b) 5-way 5-shot
Figure 5: Test accuracy (%) under different values of the parameter α3\alpha_{3} in the setting of 5-way 1-shot and 5-shot on three FSC datasets.
Refer to caption
Figure 6: Image reconstruction of features respectively represented by 1st1^{st}-order, 2nd2^{nd}-order, 3rd3^{rd}-order statistics.
Refer to caption
(a) Baseline
Refer to caption
(b) Our method
Figure 7: T-SNE visualization of features on unseen samples of Baseline and our method.
Table 5: Comparison of different methods under cross-domain scenario.
Method miniImageNet \rightarrow CUB
1-shot 5-shot
Prototypical Snell et al. [2017] 36.61±\pm0.53 55.23±\pm0.83
Relational Sung et al. [2018] 44.07±\pm0.77 59.46±\pm0.71
MetaOptNet Bertinetto et al. [2019]†† 44.79±\pm0.75 64.98±\pm0.68
IEPT Zhang et al. [2020b] 52.68±\pm0.56 72.98 ±\pm0.40
FPN Wertheimer et al. [2021] 51.60±\pm0.21 72.97±\pm0.18
BML Zhou et al. [2021] - 72.42±\pm0.54
Baseline++ Chen et al. [2019] - 62.04±\pm0.76
SimpleShot Wang et al. [2019]†† 48.56 65.63
S2M2 Mangla et al. [2020] 48.24±\pm0.84 70.44±\pm0.75
Neg-Cosine Liu et al. [2020] - 67.03±\pm0.76
GNN+FT Tseng et al. [2020] 47.47±\pm0.75 66.98±\pm0.68
ELMOS(ours) 53.73±\pm0.47 74.37±\pm0.37

6.3.1 Parameter Analysis

The effect of each branch is controlled by the parameters α1\alpha_{1}, α2\alpha_{2} and α3\alpha_{3} in Equation (10). Since the first branch modeling the 1st1^{st}-order statistic is the main branch, we set its corresponding parameter to 1. Subsequently, we first fixed the value of α3\alpha_{3} to be 1, and varied the value of α2\alpha_{2} between [0, 1] with an interval of 0.1. The test accuracy under different values is shown in Figure 4. When α2\alpha_{2} is 1, the highest performance on miniImageNet under both 1-shot and 5-shot tasks can be achieved. When α2\alpha_{2} is 0.3, we get the highest performance on CUB and CIFAR-FS under both 1-shot and 5-shot tasks. Next, we fixed the value of α2\alpha_{2} to be 1 on miniImageNet, 0.3 on CIFAR-FS and CUB, and varied the value of α3\alpha_{3} between [0, 1] with an interval of 0.1. The test accuracy under different values is shown in Figure 5. When α3\alpha_{3} is 1, we get the highest performance on all three datasets under both 1-shot and 5-shot tasks.

6.3.2 Image Reconstruction of Features

The effectiveness of our method is mainly attributed to the diversity of 1st1^{st}-order, 2nd2^{nd}-order, and 3rd3^{rd}-order statistic features. We used the technique of deep image prior to respectively invert different-order statistic features after the pre-training into RGB images. The reconstruction results are shown in Figure 6. From the results, we notice that as the order of statistic feature becomes higher, the reconstructed images become more smooth. The above phenomenon illustrates that 2nd2^{nd}-order and 3rd3^{rd}-order statistic features are more robust to the singularity variation such as the noise point than the 1st1^{st}-order statistic feature. By comparison, the 1st1^{st}-order statistic feature has stronger ability of capturing the details of the images than 2nd2^{nd}-order and 3rd3^{rd}-order statistic features. The above analysis has shown that 1st1^{st}-order, 2nd2^{nd}-order, and 3rd3^{rd}-order statistic features are complementary.

6.3.3 t-SNE Visualization of Features

To show the performance of our method, we visualize the features of the novel class samples in comparison with the Baseline. Herein, the Baseline pre-trained the backbone network only with the global average pooling. We randomly selected 5 classes and 200 samples per class from CIFAR-FS and visualize the features of the samples using t-SNE. The visualization results are shown in Figure 7. From the results, we can see that five classes can well separate from each other in our feature space compared to with the Baseline, which illustrates that our method can extract better features for unseen novel classes compared with the Baseline.

6.3.4 Comparison of Cross-domain Performance

As stated in our Theorem 1, there exists a domain shift between the base and novel classes. In the former test, the base and novel classes are in the same domain, which has a smaller domain divergence than the ones in the different domains. Now, we large the domain divergence to evaluate our method by doing cross-domain FSC. Following the protocol in Chen et al. [2019], the model was trained on miniImageNet and then evaluated on the novel classes in CUB. The comparison of results is shown in Table 5. From the results, we can see that our method is better than all the compared methods under 1-shot and 5-shot tasks. Specifically, our method outperforms the best method of IEPT with the improvement of 1.05% and 1.39% respectively. Our method does not concern the domain divergence, but we can also get good cross-domain performance by implementing ensemble learning to decrease the generalization error on base classes, because it is also an important term for the true error on novel classes.