This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to Affiliate: Mutual Centralized Learning for Few-shot Classification

Yang Liu1,2, Weifeng Zhang1, Chao Xiang1, Tu Zheng1,2, Deng Cai1,2, Xiaofei He1,2
1Zhejiang University  2Fabu Inc., Hangzhou, China
{lyng_95, zhangwf, chaoxiang, zhengtu}@zju.edu.cn  [email protected][email protected]
Abstract

Few-shot learning (FSL) aims to learn a classifier that can be easily adapted to accommodate new tasks, given only a few examples. To handle the limited-data in few-shot regimes, recent methods tend to collectively use a set of local features to densely represent an image instead of using a mixed global feature. They generally explore a unidirectional paradigm, e.g., finding the nearest support feature for every query feature and aggregating local matches for a joint classification. In this paper, we propose a novel Mutual Centralized Learning (MCL) to fully affiliate these two disjoint dense features sets in a bidirectional paradigm. We first associate each local feature with a particle that can bidirectionally random walk in discrete feature space. To estimate the class probability, we propose the dense features’ accessibility that measures the expected number of visits to the dense features of that class in a Markov process. We relate our method to learning a centrality on an affiliation network and demonstrate its capability to be plugged in existing methods by highlighting centralized local features. Experiments show that our method achieves the new state-of-the-art.

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
(a) 𝐏𝐫DN4(c𝐪,𝐒\mathbf{Pr}_{\mathrm{DN4}}(c\mid\mathbf{q},\mathbf{S}) (b) 𝐏𝐫DeepEMD(c𝐪,𝐒)\mathbf{Pr}_{\mathrm{DeepEMD}}(c\mid\mathbf{q},\mathbf{S}) (c) 𝐏𝐫FRN(c𝐪,𝐒)\mathbf{Pr}_{\mathrm{FRN}}(c\mid\mathbf{q},\mathbf{S}) (d) 𝐏𝐫MCL(c𝐪,𝐒)\mathbf{Pr}_{\mathrm{MCL}}(c\mid\mathbf{q},\mathbf{S})
exp(q𝐪cos(q,NN𝐬c(q)))\displaystyle{\propto\mathrm{exp}\left(\sum_{q\in\mathbf{q}}\mathrm{cos}(q,\mathrm{NN}_{\mathbf{s}^{c}}(q))\right)} exp(q𝐪qEMD𝐬c(q)2)\displaystyle{\propto\mathrm{exp}\left(\scalebox{0.75}[1.0]{$-$}\sum_{q\in\mathbf{q}}\|q-\mathrm{EMD}_{\mathbf{s}^{c}}(q)\|^{2}\right)} exp(q𝐪q𝐬c𝐰2)\displaystyle{\propto\mathrm{exp}\left(\scalebox{0.75}[1.0]{$-$}\sum_{q\in\mathbf{q}}\|q-\mathbf{s}^{c}\mathbf{w}\|^{2}\right)} z𝐪𝐒𝔼[t=1𝟙[Xt𝐬c]|X0=z]\displaystyle{\propto\sum_{z\in\mathbf{q}\cup\mathbf{S}}\mathbb{E}\left[\sum_{t=1}^{\infty}\mathds{1}[X_{t}\in\mathbf{s}^{c}]\bigg{|}X_{0}=z\right]}
Figure 1: Illustrations of four dense features based methods in a 3-way 1-shot scenario, where each colored pane indicates an image and particles within each pane indicate three feature vectors of that image. The arrow line indicates a direct connection between two feature vectors in inference. For example, (a) DN4 [16] accumulates cosine similarities between the nearest features; (b) DeepEMD [35] finds the optimal matching flow that has the minimum cost; (c) FRN [33] reconstructs query features with the linear combination of support features. (d) Our MCL lets query and support features bidirectionally random walk to the opposite dense features set and measures the expected number of visits in a Markov process {Xt}\{X_{t}\}. Difference from them, MCL considers mutual affiliations between the dense features instead of following the unidirectional paradigm whose evident character is an accumulation over all query features followed by a softmax probability.

1 Introduction

Few-shot classification aims to learn a classifier that can be readily adapted to novel classes given just a small number of labeled instances. To address this problem, a line of previous literature adopts metric-based methods [30, 26, 28] that learn a global image representation in an appropriate feature space and use a distance metric to predict their labels.

Recent approaches [35, 16, 18] have demonstrated that the significant intra-class variations would inevitably drive the image-level embedding from the same category far apart in a given metric space under low-data regimes. In contrast, densely representative local features can provide transferrable information across categories that have shown promising performances in the few-shot scenario. Among those methods illustrated in Figure 1, DN4 [16] finds the nearest neighbor support feature for each query feature and accumulates all the local matches in a Naive-Bayes way to represent an image-to-class similarity; DeepEMD [35] uses the earth mover distance to compare the complex structured representations composed of local features. FRN [33] reconstructs each of query dense features with a linear combination of support dense features in a latent space and use the reconstruction distance to measure the image-to-class relevance. They all follow a unidirectional query-to-support paradigm, whose evident character is an accumulation over all query features followed by a softmax probability.

In this paper, we incorporate an extra support-to-query connection as a complement to thoroughly affiliate two disjoint sets of dense features. The potential offered by this bidirectional paradigm stems from the intuition that, except for using query features to find related support features, it is also plausible to estimate the task-relevance of query features according to the support features. Specifically, we associate dense features with particles that could bidirectionally random walk to the opposite dense features set in the discrete feature space. The prediction probability for each class is then estimated by the dense features’ accessibility, i.e., the expected number of visits to the dense features of that class in a time-homogeneous Markov process.

The contributions are as follows: (1) We propose to learn mutual affiliations between the query and support dense features instead of following the unidirectional query-to-support paradigm in FSL. (2) We introduce the dense features’ accessibility to FSL and demonstrate that traditional transductive methods could be easily adapted to the inductive setting if we treat dense features from a single query image as a set of unlabeled data. (3) We propose a novel bidirectional random walk based method in FSL and draw its connection to the single-mode eigenvector centrality of an affiliation network. (4) The underlying centrality investigated in this work can be plugged in existing global feature based methods like ProtoNet and RelationNet by highlighting task-centralized local features instead of global average pooling.

2 Related Work

Dense Feature based FSL. Those methods focus on learning image-to-image similarities by encoding each input into a set of dense feature vectors (or a feature map). Among them, DC [18] proposes to predict for each local features and average their probabilities. DeepEMD [35] adopts the earth mover’s distance to compute an image-to-class distance. DN4 [16] uses the top-k nearest neighbors in a Naive-Bayes way to represent image-level similarities. ADM [15] and SAML [12] use extra network layers for dense classifications. FRN [33] reconstructs the dense query features from support features of a given class to predict their relevance. Lifchitz et al. [17] propagates labels from support sets to query features. MCL is also among the family of dense feature based frameworks. A major difference is that we consider the mutual affiliations between the query and support features instead of the unidirectional one in previous work.

Graphical models in FSL. Another branch of methods learns graphs on FSL. Among them, [11] build graph neural network (GNN) from a collection of images. [9] use the degree-centrality as part of graph features to adjust the weight of graph vertex in transductive FSL. Our proposed feature accessibility of bidirectional random walks is well-related to the eigenvector centrality. Differently, we use the single-mode centrality of bipartite data as a novel criterion for the inductive classification.

From Transductive to Inductive. TPN [20] first introduces the well-known label propagation [36] to the transductive FSL that transfers information from labeled data to unlabeled one and shows substantial improvements over inductive methods. Lifchitz et al. [17] first demonstrates it feasible to adopt label for inductive FSL by treating the query dense features as numerous unlabeled data. Different from unidirectional label propagation, our MCL further considers the set of query and support features as bipartite data where the self-reinforcements (i.e., query-to-query and support-to-support random walks) are avoided.

3 Method

3.1 Formulation of Dense Features based FSL

In a NN-way KK-shot episode, the goal is to predict a class label for a single query image given NN support classes, each of which contains KK support images for reference.

We first encode each input image into rr number of dd-dimensional dense features fθ()d×rf_{\theta}(\cdot)\in\mathbb{R}^{d\times r}. We use 𝐪={q1,,qr}\mathbf{q}=\{q_{1},...,q_{r}\} to denote dense features from a query image. Inspired by prototype learning [26], we average feature vectors from the same spatial locations of KK different images into support dense features for each 𝐬c={s1c,,src}\mathbf{s}^{c}=\{s_{1}^{c},...,s_{r}^{c}\} category.

We use the bold font (e.g., 𝐪\mathbf{q} and 𝐬\mathbf{s}) to denote a set of feature vectors and use the normal font (e.g., qq and ss) to denote a single feature vector. We use 𝐒=cC𝐬c\mathbf{S}=\bigcup_{c\in C}\mathbf{s}^{c} to represent the union set of dense features from all supporting classes. The cardinalities for the sets 𝐪\mathbf{q}, 𝐬c\mathbf{s}^{c} and 𝐒\mathbf{S} are rr, rr and NrNr respectively.

The target of dense features based FSL is to find the label of a query image, given its dense query features 𝐪\mathbf{q} and the union set of dense support features 𝐒\mathbf{S} from all categories.

3.2 Bidirectional Random Walks on Dense Features

We start by using the cosine similarity ϕ(v1,v2)=v1v1,v2v2\phi(v_{1},v_{2})=\langle\frac{v_{1}}{\|v_{1}\|},\frac{v_{2}}{\|v_{2}\|}\rangle to measure the closeness between the two feature vectors where ,\langle\cdot,\cdot\rangle is the Frobenius inner product.

Given two dense features sets 𝐪\mathbf{q} and 𝐒\mathbf{S}, we learn inter similarity matrix 𝚽r×Nr\mathbf{\Phi}\in\mathbb{R}^{r\times Nr} between those two disjoint sets where the entry of [𝚽]qs=ϕ(q,s)\left[\mathbf{\Phi}\right]_{qs}=\phi(q,s) indicates the similarity between vectors q𝐪q\in\mathbf{q} and s𝐒s\in\mathbf{S}.

We build query-to-support affiliations from every query feature qq to all support features s𝐒s\in\mathbf{S} by a random walk probability psq=eγϕ(s,q)s𝐒eγϕ(s,q)p_{sq}=\frac{e^{\gamma\phi(s,q)}}{\sum_{s^{\prime}\in\mathbf{S}}e^{\gamma\phi(s^{\prime},q)}}. The probability matrix from all query to support features can be written by

𝐏𝐒𝐪=exp(γ𝚽T)𝐃1\mathbf{P}_{\mathbf{Sq}}=\mathrm{exp}(\gamma\mathbf{\Phi}^{\mathrm{T}})\mathbf{D}^{-1} (1)

where exp()\mathrm{exp}(\cdot) is the element-wise exponential function, γ\gamma is a scaling temperature for a softmax like probability and 𝐃\mathbf{D} is a diagonal normalization matrix with its (j,j)(j,j)-value to be the sum of the jj-th column of exp(γ𝚽T)\mathrm{exp}(\gamma\mathbf{\Phi}^{\mathrm{T}}). The subscript serves both as the matrix size and as an indication for the feature vector with which it is associated. For example, 𝐏𝐒𝐪\mathbf{P}_{\mathbf{Sq}} is of size |𝐒|×|𝐪||\mathbf{S}|\times|\mathbf{q}| and it includes random walk probabilities from query features 𝐪\mathbf{q} to support features 𝐒\mathbf{S}.

Likewise, we connect every support feature ss to all q𝐪q\in\mathbf{q} by a similar probability matrix:

𝐏𝐪𝐒=exp(β𝚽)𝐖1\mathbf{P}_{\mathbf{qS}}=\mathrm{exp}(\beta\mathbf{\Phi})\mathbf{W}^{-1} (2)

where 𝐖\mathbf{W} is the analogous diagonal matrix with its (j,j)(j,j)-value to be the sum of the jj-th column of exp(β𝚽)\mathrm{exp}(\beta\mathbf{\Phi}).

It should be noted that we use two different scaling parameters since 𝚽\mathbf{\Phi} is a non-square matrix due to the different cardinalities of dense feature sets 𝐪\mathbf{q} and 𝐒\mathbf{S}. Although the matrices exp(γ𝚽T)\mathrm{exp}(\gamma\mathbf{\Phi}^{\mathrm{T}}) and exp(β𝚽)\mathrm{exp}(\beta\mathbf{\Phi}) would be symmetric if we ignore their different scaling parameters, the probability matrices 𝐏𝐪𝐒\mathbf{P}_{\mathbf{qS}}, 𝐏𝐒𝐪\mathbf{P}_{\mathbf{Sq}} are both column-normalized by different 𝐃\mathbf{D}, 𝐖\mathbf{W} respectively and are thus directional.

To learn the mutual affiliations between the two dense feature sets, we first associate each feature vector with a particle zz that forms a discrete feature space 𝐳=𝐪𝐒\mathbf{z}=\mathbf{q}\cup\mathbf{S}. Next, we let particles bidirectionally random walk between particles in that space according to 𝐏𝐒𝐪\mathbf{P}_{\mathbf{Sq}} and 𝐏𝐪𝐒\mathbf{P}_{\mathbf{qS}} (i.e., particles at query features set are only allowed to move to the support features set, and vice versa). We assume it time-homogeneous in a Markov process {Xt}\{X_{t}\} and the random walk probability from zjz_{j} to ziz_{i} for all ziz_{i}, zjz_{j} in 𝐳\mathbf{z} can be formulated by:

𝐏𝐫(Xt+1=zi|Xt=zj)={eγϕ(zi,zj)s𝐒eγϕ(s,zj)zi𝐒,zj𝐪eβϕ(zi,zj)q𝐪eβϕ(q,zj)zi𝐪,zj𝐒0otherwise.\mathbf{Pr}(X_{t+1}=z_{i}|X_{t}=z_{j})=\left\{\begin{array}[]{ll}\displaystyle{\frac{e^{\gamma\phi(z_{i},z_{j})}}{\sum_{s\in\mathbf{S}}e^{\gamma\phi(s,z_{j})}}}&z_{i}\in\mathbf{S},z_{j}\in\mathbf{q}\\ \displaystyle{\frac{e^{\beta\phi(z_{i},z_{j})}}{\sum_{q\in\mathbf{q}}e^{\beta\phi(q,z_{j})}}}&z_{i}\in\mathbf{q},z_{j}\in\mathbf{S}\\ 0&\mathrm{otherwise}.\end{array}\right.

(3)

To write it in matrix, we can obtain the Markov transition matrix 𝐏\mathbf{P}:

𝐏=(𝟎𝐏𝐒𝐪𝐏𝐪𝐒𝟎)\mathbf{P}=\begin{pmatrix}\mathbf{0}&\mathbf{P}_{\mathbf{S}\mathbf{q}}\\ \mathbf{P}_{\mathbf{q}\mathbf{S}}&\mathbf{0}\end{pmatrix} (4)

that consists of the column-normalized 𝐏𝐒𝐪\mathbf{P}_{\mathbf{Sq}} and 𝐏𝐪𝐒\mathbf{P}_{\mathbf{qS}}.

It can be proved (in supplementary) that, after infinity times of bidirectional random walks, the Markov chain with anti-diagonal 𝐏\mathbf{P} reaches the periodic stationary distribution:

limt𝐏2t=(𝝅(𝐒)𝒆NrT𝟎𝟎𝝅(𝐪)𝒆rT)limt𝐏2t1=(𝟎𝝅(𝐒)𝒆rT𝝅(𝐪)𝒆NrT𝟎)\displaystyle\begin{split}\lim_{t\rightarrow\infty}\mathbf{P}^{2t}&=\begin{pmatrix}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}&\mathbf{0}\\ \mathbf{0}&\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}\end{pmatrix}\\ \lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}&=\begin{pmatrix}\mathbf{0}&\bm{\pi}(\mathbf{S})\bm{e}_{r}^{\mathrm{T}}\\ \bm{\pi}(\mathbf{q})\bm{e}_{Nr}^{\mathrm{T}}&\mathbf{0}\end{pmatrix}\end{split} (5)

where 𝝅(𝐒)Nr\bm{\pi}(\mathbf{S})\in\mathbb{R}^{Nr}, 𝝅(𝐪)r\bm{\pi}(\mathbf{q})\in\mathbb{R}^{r} are the stationary distributions with equations 𝝅(𝐒)=𝐏𝐒𝐪𝐏𝐪𝐒𝝅(𝐒)\bm{\pi}(\mathbf{S})=\mathbf{P_{Sq}}\mathbf{P_{qS}}\bm{\pi}(\mathbf{S}) and 𝝅(𝐪)=𝐏𝐪𝐒𝐏𝐒𝐪𝝅(𝐪)\bm{\pi}(\mathbf{q})=\mathbf{P_{qS}P_{Sq}}\bm{\pi}(\mathbf{q}) respectively. 𝒆||\bm{e}_{|\cdot|} is a vector of ones with different length indicated by its subscript.

The motivation of using these stationary distributions to encode mutual affiliations is straightforward: support features would be frequently visited in the long times of bidirectional random walk if they shared the same local characteristics with the query ones. In other words, the support class that owns the most mutual affiliations with querying image would be predicted due to their mutual closeness.

3.3 Classification by Dense Features’ Accessibility

To formulate it in FSL, we first assume particles are uniformly distributed in the discrete finite space 𝐳\mathbf{z}. Then, we use the expected number of visits from all particles z𝐳z\in\mathbf{z} to support features 𝐬c𝐒\mathbf{s}^{c}\subset\mathbf{S} of class cc in the long times of bidirectional random walks {Xt}\{X_{t}\} to measure the amount of local characteristics that class cc owns for the query image:

𝐏𝐫(y~=c)limtz𝐳𝔼[k=1t𝟙[Xk𝐬c]|X0=z]=limt1tk=1t[1Nr+rs𝐬c(s𝐒[𝐏2k]ss+q𝐪[𝐏2k1]sq)]=1Nr+rs𝐬c(s𝐒[limt𝐏2t]ss+q𝐪[limt𝐏2t1]sq)=s𝐬c[𝝅(𝐒)]s\begin{aligned} &\mathbf{Pr}(\tilde{y}=c)\propto\lim_{t\rightarrow\infty}\sum_{z\in\mathbf{z}}\mathbb{E}\left[\sum_{k=1}^{t}\mathds{1}[X_{k}\in\mathbf{s}^{c}]\bigg{|}X_{0}=z\right]\\ &=\lim_{t\rightarrow\infty}\frac{1}{t}\sum_{k=1}^{t}\left[\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{s^{\prime}\in\mathbf{S}}\left[\mathbf{P}^{2k}\right]_{ss^{\prime}}+\sum_{q\in\mathbf{q}}\left[\mathbf{P}^{2k-1}\right]_{sq}\right)\right]\\ &=\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{s^{\prime}\in\mathbf{S}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t}\right]_{ss^{\prime}}+\sum_{q\in\mathbf{q}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}\right]_{sq}\right)\\ &=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}\end{aligned}

(6)

where 𝟙[]\mathds{1}[\cdot] is an indicator function that equals 1 if its argument is true and zero otherwise. []ij[\cdot]_{ij} indicates the entry of the matrix from particle jj to particle ii and []i[\cdot]_{i} indicates the entry of the vector that associated to feature ii.

We give the derivation when the Markov chain length tt is even in the first equality of Eqn.(LABEL:eq:MEL) and prove that the odd tt reaches the same result in the supplementary material. The second equality is from the absorbing property of the periodic Markov chain where the power of matrix gets saturated to Eqn.(5) for the increasing tt. Since 𝝅(𝐒)\bm{\pi}(\mathbf{S}) is the stationary distribution of column-stochastic 𝐏𝐒𝐪𝐏𝐪𝐒\mathbf{P_{Sq}P_{qS}} under the probability constraint 𝒆NrT𝝅(𝐒)=1\bm{e}^{\mathrm{T}}_{Nr}\bm{\pi}(\mathbf{S})=1, 𝐏𝐫(y~)\mathbf{Pr}(\tilde{y}) is a valid distribution for classifications.

3.4 Reinterpretation as a Graph Centrality

If we interpret the random walk probability matrix 𝐏\mathbf{P} as an adjacency matrix {av1,v2}\{a_{v_{1},v_{2}}\} of a directed bipartite graph G:=(V={𝐪,𝐒},E)G:=(V=\{\mathbf{q},\mathbf{S}\},E), we find that the dense features’ accessibility in the time-homogeneous bidirectional random walk would be equivalent to learning a single-mode eigenvector centrality on the graph GG. The bipartite graph is also called the affiliation network in social network analysis [3, 4, 5] that models two types of entities "actors" and "society" related by affiliation of the former in the latter. The concept of centrality in social network analysis is generally used to investigate the acquaintanceships among people that often stem from one or more shared affiliations.

To see this in graph theory, we start with a brief overview of the eigenvector centrality that reflects score xx of vertex vv for both q𝐪q\in\mathbf{q} and s𝐒s\in\mathbf{S} in the affiliation network with adjacency {av1,v2}\{a_{v_{1},v_{2}}\}:

xq=1λv𝕍(q)aq,vxv=1λv𝐒aq,vxvxs=1λv𝕍(s)as,vxv=1λv𝐪as,vxv\displaystyle\begin{split}x_{q}&=\frac{1}{\lambda}\sum_{v\in\mathbb{V}(q)}a_{q,v}x_{v}=\frac{1}{\lambda}\sum_{v\in\mathbf{S}}a_{q,v}x_{v}\\ x_{s}&=\frac{1}{\lambda}\sum_{v\in\mathbb{V}(s)}a_{s,v}x_{v}=\frac{1}{\lambda}\sum_{v\in\mathbf{q}}a_{s,v}x_{v}\end{split} (7)

where 𝕍()\mathbb{V}(\cdot) is a set of neighbors for the given vertex and λ\lambda is a constant.

With a small rearrangement, Eqn.(7) can be rewritten in vector notation with an eigenvector equation 𝐏𝐱=λ𝐱\mathbf{Px}=\lambda\mathbf{x}. The additional requirement that all the entries of the eigenvector be non-negative implies (by the Perron–Frobenius theorem [21]) that only the greatest eigenvalue results in the desired centrality measure. For the column-stochastic adjacency matrix 𝐏\mathbf{P} in our method, the largest eigenvalue λ\lambda is 1.

Single-mode centrality [5] is a special form of graph centrality that measures the extent to which nodes in one vertex set are relatively central only to other nodes in the same vertex set on bipartite graph. For example, the single-mode centrality for different ss in 𝐒\mathbf{S} is defined by 𝐱^𝐒=𝐱𝐒/s𝐒xs\hat{\mathbf{x}}_{\mathbf{S}}=\mathbf{x}_{\mathbf{S}}/\sum_{s\in\mathbf{S}}x_{s}. Lemma 1 (proved in supplementary material) shows that dense features’ accessibility 𝝅(𝐒)\bm{\pi}(\mathbf{S}) of bidirectional random walks in Eqn.(LABEL:eq:MEL) is equivalent to the single-mode eigenvector centrality of support set 𝐒\mathbf{S} on bipartite data.

Lemma 1.

Assume GG is the affiliation network of bipartite data 𝐪\mathbf{q}, 𝐒\mathbf{S} with the adjacency matrix defined by the anti-diagonal Markov transition matrix 𝐏\mathbf{P} in Eqn.(4). The single-mode eigenvector centrality 𝐱^𝐒=𝐱𝐒/s𝐒xs\hat{\mathbf{x}}_{\mathbf{S}}=\mathbf{x}_{\mathbf{S}}/\sum_{s\in\mathbf{S}}x_{s} of vertex set 𝐒\mathbf{S} is equivalent to the dense features’ accessibility 𝛑(𝐒)\bm{\pi}(\mathbf{S}) on 𝐒\mathbf{S} in a time-homogeneous Markov process.

Based on this reinterpretation, it is straightforward to introduce the attenuation (damping) factor α\alpha in Markov bidirectional random walks motivated by the Katz centrality [13], where connections made with distant neighbors are penalized by α\alpha. The dense features’ accessibility with attenuation factor α\alpha for few-shot classifications is defined by

𝐏𝐫Katz(y~=c)z𝐳𝔼[t=1αt𝟙[Xt𝐬c]|X0=z]=1ϵt=1s𝐬c(s𝐒[α2t𝐏2t]ss+q𝐪[α2t1𝐏2t1]sq)\displaystyle\begin{split}&\mathbf{P}\mathbf{r}_{\mathrm{Katz}}(\tilde{y}=c)\propto\sum_{z\in\mathbf{z}}\mathbb{E}\left[\sum_{t=1}^{\infty}\alpha^{t}\mathds{1}[X_{t}\in\mathbf{s}^{c}]\bigg{|}X_{0}=z\right]\\ &=\frac{1}{\epsilon}\sum_{t=1}^{\infty}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{s^{\prime}\in\mathbf{S}}\left[\alpha^{2t}\mathbf{P}^{2t}\right]_{ss^{\prime}}+\sum_{q\in\mathbf{q}}\left[\alpha^{2t-1}\mathbf{P}^{2t-1}\right]_{sq}\right)\end{split} (8)

where ϵ=(Nr+r)t=1αt=(Nr+r)α/(1α)\epsilon=(Nr+r)\sum_{t=1}^{\infty}\alpha^{t}=(Nr+r)\alpha/(1-\alpha) is a constant for a valid distribution.

Although we simply consider the single-mode centrality on 𝐒\mathbf{S} for an end-to-end classification purpose, it is also beneficial to learn its conjugate centrality 𝝅(𝐪)\bm{\pi}(\mathbf{q}) on 𝐪\mathbf{q} in the affiliation network (by the analogous 𝝅(𝐪)=𝐱𝐪/q𝐪xq\bm{\pi}(\mathbf{q})=\mathbf{x}_{\mathbf{q}}/\sum_{q\in\mathbf{q}}x_{q} in Lemma 1). We will show that both single-mode centralities on two dense features sets could serve as plug-and-play for finding centralized local characteristics in existing methods (hence the term Mutual Centralized Learning, MCL).

3.5 End-to-End Training by Katz Approximation

The algorithm of Mutual Centralized Learning (MCL) in Eqn.(LABEL:eq:MEL) involves the computation of Markov stationary distribution 𝝅(𝐒)\bm{\pi}(\mathbf{S}) with equation 𝝅(𝐒)=𝐏𝐒𝐪𝐏𝐪𝐒𝝅(𝐒)\bm{\pi}(\mathbf{S})=\mathbf{P_{Sq}P_{qS}}\bm{\pi}(\mathbf{S}). Theoretically, the 𝝅(𝐒)\bm{\pi}(\mathbf{S}) is the eigenvector of 𝐏𝐒𝐪𝐏𝐪𝐒\mathbf{P_{Sq}P_{qS}} with the eigenvalue 1 under a probability constraint 𝒆NrT𝝅(𝐒)=1\bm{e}_{Nr}^{\mathrm{T}}\bm{\pi}(\mathbf{S})=1.

The above constraints could lead to a solution of 𝝅(𝐒)\bm{\pi}(\mathbf{S}) by solving the overdetermined linear system

(𝐏𝐒𝐪𝐏𝐪𝐒𝑰𝒆NrT)𝝅(𝐒)=(𝟎1)\begin{pmatrix}\mathbf{P_{Sq}P_{qS}}-\bm{I}\\ \bm{e}^{\mathrm{T}}_{Nr}\end{pmatrix}\bm{\pi}(\mathbf{S})=\begin{pmatrix}\mathbf{0}\\ 1\end{pmatrix} (9)

where 𝑰\bm{I} is an identity matrix and 𝟎\mathbf{0} is a vector of zeros. Although we can solve it by various methods like pseudo-inverse or QR decomposition and back substitution, it is empirically found time-consuming as these operators are either numerically unstable or not paralleled well in modern deep-learning packages.

To handle it, we present an alternative solution based on Lemma 1 as follows: We first calculate the Katz centrality by its closed-form solution [13]:

𝐱Katz=((𝑰α𝐏)1𝑰)𝒆Nr+r\mathbf{x}^{\mathrm{Katz}}=((\bm{I}-\alpha\mathbf{P})^{-1}-\bm{I})\bm{e}_{Nr+r} (10)

and then solve single-mode Katz centrality in Eqn.(8) by the definition of the single-mode centrality:

𝐏𝐫Katz(y~=c)=s𝐬c𝐱sKatzs𝐒𝐱sKatz\mathbf{Pr}_{\mathrm{Katz}}(\tilde{y}=c)=\frac{\sum_{s\in\mathbf{s}^{c}}\mathbf{x}^{\mathrm{Katz}}_{s}}{\sum_{s^{\prime}\in\mathbf{S}}\mathbf{x}^{\mathrm{Katz}}_{s^{\prime}}} (11)

Since Katz centrality degrades to the eigenvector centrality when α\alpha approaches 1 [13] (indicates no attenuation), we can obtain the approximation of eigenvector centrality by a large α=0.999\alpha=0.999:

𝐱Eigen=limα1((𝑰α𝐏)1𝑰)𝒆Nr+r((𝑰0.999𝐏)1𝑰)𝒆Nr+r\displaystyle\begin{split}\mathbf{x}^{\mathrm{Eigen}}&=\lim_{\alpha\rightarrow 1}((\bm{I}-\alpha\mathbf{P})^{-1}-\bm{I})\bm{e}_{Nr+r}\\ &\approx((\bm{I}-0.999\mathbf{P})^{-1}-\bm{I})\bm{e}_{Nr+r}\end{split} (12)

and the analogous single-mode eigenvector centrality in Eqn.(LABEL:eq:MEL) can then be approximated by:

𝐏𝐫MCL(y~=c)=s𝐬c𝐱sEigens𝐒𝐱sEigen\mathbf{Pr}_{\mathrm{MCL}}(\tilde{y}=c)=\frac{\sum_{s\in\mathbf{s}^{c}}\mathbf{x}^{\mathrm{Eigen}}_{s}}{\sum_{s^{\prime}\in\mathbf{S}}\mathbf{x}^{\mathrm{Eigen}}_{s^{\prime}}} (13)

We use the negative log-likelihood loss to update parameters in different feature extractors and the whole picture of MCL for few-shot classifications is shown in Figure 2. We provide the detailed algorithm/pseudo-codes in the supplementary material for reference.

Refer to caption
Figure 2: Our framework for 3-way 1-shot classification. Given the query and support images, we first extract their dense features 𝐪\mathbf{q} and 𝐒\mathbf{S} which are two sets of feature vectors. Then we calculate their similarity matrix 𝚽\mathbf{\Phi}. After that, we build probability matrix 𝐏\mathbf{P} based on the scaled and column-normalized similarity matrix. We obtain the stationary distributions 𝝅(𝐒)\bm{\pi}(\mathbf{S}) and 𝝅(𝐪)\bm{\pi}(\mathbf{q}) by the Katz approximation in Eqn.(12) and the Lemma 1. Finally, we use them for both end-to-end FSL and as a centrality plugin in global feature based FSL. We explore three dense feature extractors (i.e., VanillaFCN, PyramidFCN, PyramidGrid) inspired by [35] to extract dense feature representations.

3.6 Plugin for Global Feature based FSL.

Since the underlying centrality learned by MCL reveals the different importance of local features on an affiliation network, it is thus plausible to plug it into global feature based methods by replacing their native global average pooling (GAP) with the centrality weighted pooling as follows:

a. Extract dense features for each input image.

b. Compute the eigenvector centrality 𝐱Eigen\mathbf{x}^{\mathrm{Eigen}} by Eqn.(12).

c. Compute 𝝅(𝐒)\bm{\pi}(\mathbf{S}) and 𝝅(𝐪)\bm{\pi}(\mathbf{q}) according to definition of the single-mode centrality in Lemma 1.

d. Use 𝝅(𝐪)\bm{\pi}(\mathbf{q}) as weights on query dense features and weighted accumulate them to a single feature vector.

e. Use the class-wise normalized centrality from 𝝅(𝐒)\bm{\pi}(\mathbf{S}) to accumulate dense features for each supporting class.

Once we get the centrality weighted features for the query image and support classes, it is straightforward to perform traditional global feature based methods like before.

4 Experiments

We conduct experiments on two widely-used FSL datasets and three fine-grained datasets: (1) miniImageNet [30] contains 600 images per class over 100 classes. We follow the split used by [23] that takes 64, 16 and 20 classes for train/val/test respectively; (2) tieredImageNet [22] is much larger compared to miniImageNet with 608 classes. The 351, 97 and 160 classes are used for train/val/test respectively. (3) CUB [31] consists 11,788 images from 200 bird classes. 100/50/50 classes are used for train/val/test and each image is first cropped to a human-annotated bounding box. (4) meta-iNat [32] is a benchmark of animal species in the wild. We follow the same class split proposed by [32] that uses 908/227 classes for training/evaluation respectively. (5) tiered-meta-iNat [32] is a more difficult version of meta-iNat where a large domain gap is introduced. The 354 test classes are populated by insects and arachnids, while the remaining 781 classes form the training set.

Backbone networks. We conduct experiments with both widely-used four layer convolutional Conv-4 [30] and deep ResNet-12 [27] backbones. As is commonly implemented in the state-of-the-art FSL literature, we adopt a pre-training stage for the ResNet-12 before the episode meta-training while directly meta-train from scratch for the simple Conv-4.

Dense feature extractor fθ()f_{\theta}(\cdot). We explore three dense feature extractors in our experiments as shown in Figure 2: (1) VanillaFCN simply treats the feature map output of fully convolutional network as dense features. (2) PyramidFCN uses an extra adaptative average pooling layer to obtain 34 dense features for each image. (3) PyramidGrid crops the image evenly into the grid of size 2×2+3×32\times 2+3\times 3 and encodes each grid cell to a feature vector individually. The feature vectors from all the cells constitute the set of dense features.

Method Conv-4 ResNet-12
miniImageNet tieredImageNet miniImageNet tieredImageNet
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
Global features DSN [25] 51.78 68.99 53.22 71.06 62.64 78.83 67.39 82.85
MetaOptNet [14] 52.87 68.76 54.71 71.76 62.64 78.63 65.99 81.56
Negative Margin [19] 52.84 70.41 - - 63.85 81.57 - -
FEAT [34] 55.15 71.61 - - 66.78 82.05 70.80 84.79
Meta-Baseline [7] - - - - 63.17 79.26 68.62 83.29
RelationNet [28] 52.12 66.90 54.33 69.95 60.97 75.12 64.71 78.41
RelationNet+MCL 54.50 70.63 57.73 74.46 61.70 75.53 65.93 80.27
ProtoNet [26] 52.32 69.74 53.19 72.28 62.67 77.88 68.48 83.46
ProtoNet+MCL 54.31 69.84 56.67 74.36 64.40 78.60 70.62 83.84
Dense features (VanillaFCN) DN4 [16] 54.66 71.26 56.86 72.16 65.35 81.10 69.60 83.41
DeepEMD [35] 52.15 65.52 50.89 66.12 65.91 82.41 71.16 83.95
FRN [33] 54.87 71.56 55.54 74.68 66.45 82.83 71.16 86.01
Label Propagation 52.24 67.68 54.56 70.08 65.00 80.07 71.12 83.89
MCL (ours) 55.38 70.02 57.63 74.25 67.36 83.63 71.76 86.01
MCL-Katz (ours) 55.55 71.74 57.78 74.77 67.51 83.99 72.01 86.02
Dense features (PyramidFCN) DN4 [16] 54.54 70.94 57.05 72.90 63.54 79.04 71.10 84.22
DeepEMD [35] 50.67 64.94 51.26 65.64 66.27 82.41 70.76 84.20
FRN [33] 54.40 70.75 57.30 75.58 65.94 81.97 70.56 85.44
Label Propagation 53.38 68.69 55.21 71.22 65.71 75.78 71.00 78.01
MCL (ours) 55.13 70.77 57.93 74.36 67.45 84.36 72.01 86.31
MCL-Katz (ours) 55.77 71.24 58.20 74.73 67.85 84.47 72.13 86.32
Dense features (PyramidGrid) DN4 [16] 57.17 70.91 56.71 70.92 67.86 80.08 71.29 82.60
DeepEMD [35] 55.68 70.75 55.88 70.06 67.83 81.32 73.13 84.18
FRN [33] 55.80 71.52 55.68 72.87 67.00 82.20 71.42 85.58
Label Propagation 55.12 68.43 56.05 72.39 67.18 81.07 73.18 85.19
MCL (ours) 57.50 73.03 57.57 73.81 69.31 85.11 73.62 86.29
MCL-Katz (ours) 57.88 74.03 57.63 73.96 69.25 84.71 73.38 86.21
Table 1: Few-shot classification accuracy (%) on miniImageNet and tieredImageNet. The 95% confidence intervals are all below 0.2 for the 10,000 episodes evaluation. Results of italic font indicates the performance of MCL as a plugin. Results of bold fonts are the best results for different dense feature extractors, respectively. We re-implement our plugin baseline methods (i.e., ProtoNet and RelationNet) as well as the competitive dense features based methods (i.e., DN4, DeepEMD and FRN) in our unified framework in case their performances are not available under specific settings. \dagger: re-implemented results on Conv-4 backbone, \ddagger: re-implemented results on both backbones.

4.1 General Few-shot Classification Results

Table 1 details the comparisons of MCL with global feature based methods [25, 14, 34, 7, 19] as well as dense feature based methods [16, 35, 33] on mini-/tieredImageNet.

Besides these methods, we also exploit label propagation [36] in dense feature based inductive FSL like in [17] to demonstrate that traditional transductive methods could be easily adapted to inductive settings if we treat dense feature vectors as a set of unlabeled data. As shown, our MCL is highly competitive with recent state-of-the-art results. In particular, our MCL-Katz achieves 67.51% (1-shot) and 83.99% (5-shot) on miniImageNet with the simplest VanillaFCN.

Comparisons with dense feature based methods. The last three panes of Table 1 illustrate that proposed bidirectional MCL outperforms query-to-support DN4’s nearest neighboring, DeepEMD’s optimal matching and FRN’s latent feature reconstructions in various tasks. Although the optimal matching flow is unidirectional in DeepEMD, they adopt a cross-reference attention that encodes the bidirectional relevance to a certain extent. If we treat dense features as nodes in the graph and use their similarities as edges, a major difference is that they treat nodes equally to derive different weights on edges while we use the fixed edges to derive the centrality of nodes directly for classifications.

Comparisons with global feature based methods. The centrality investigated in this work is also capable to be plugged into global feature based methods as local features are not equally important before global average pooling. As shown in the first pane of Table 1, our proposed centrality weighted pooling could provide at most 3.4% and 4.5% performance gains for ProtoNet and RelationNet respectively. We show in Table 4 that the improvements are also consistent if we solely apply it on the query and support dense features.

Method CUB meta-iNat tiered-meta-iNat
Conv-4 ResNet-12 Conv-4 ResNet-12 Conv-4 ResNet-12
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
ProtoNet [26] 71.64 81.50 82.98 91.38 53.78 73.80 76.47 90.20 35.47 54.85 47.61 71.06
ProtoNet + MCL 76.87 86.95 84.59 91.71 57.52 74.33 78.75 90.53 39.09 59.16 51.22 72.50
DSN [25] 66.01 85.41 80.80 91.19 58.08 77.38 78.80 89.77 36.82 60.11 46.61 72.79
CTX [10] 69.64 87.31 78.47 90.90 60.03 78.80 69.04 88.39 36.83 60.84 42.91 69.88
DN4 [16] 78.31 88.43 85.44 92.51 62.32 79.76 79.58 91.37 43.82 64.17 48.99 72.29
DeepEMD [35] 75.34 85.68 83.35 91.60 54.48 68.36 76.05 86.82 36.05 48.55 48.14 66.27
FRN [33] 73.48 88.43 83.16 92.59 62.42 80.45 73.52 91.83 43.91 63.36 48.86 76.62
MCL 77.80 88.71 83.64 92.18 64.66 81.31 80.17 91.59 44.08 64.61 51.35 76.87
MCL-Katz 79.61 90.56 85.63 93.18 63.92 81.09 79.34 91.84 44.00 64.24 49.68 76.05
Table 2: Few-shot classification (%) results on fine-grained datasets. The 95% confidence intervals are below 0.19 within 10,000 episodes.
fθ()f_{\theta}(\cdot) miniImageNet tieredImageNet
1-shot 5-shot 1-shot 5-shot
\hlineB2 unidirectional VanillaFCN 54.88 69.98 55.31 68.87
MCL 55.38 70.02 57.63 74.25
MCL-Katz 55.55 71.74 57.78 74.77
unidirectional PyramidFCN 54.92 69.17 55.69 69.26
MCL 55.13 70.77 57.93 74.36
MCL-Katz 55.77 71.24 58.20 74.73
unidirectional PyramidGrid 56.61 71.40 55.12 69.19
MCL 57.50 73.03 57.57 73.81
MCL-Katz 57.88 74.03 57.63 73.96
Table 3: Comparisons between the unidirectional random walks and bidirectional MCL on mini-/tieredImageNet with Conv-4.

4.2 Fine-grained Few-shot Classification Results

We follow the same setting (shown in supplementary) as in FRN [33] to train all the models from scratch on those fine-grained datasets. We re-implement DN4 [16], DeepEMD [35] to thoroughly compare with dense features based methods. Apart from the basic comparisons (both backbones on CUB and Conv-4 backbone on meta-iNat/tiered-meta-iNat) in [33], we conduct extra ResNet-12 experiments for all the comparing methods on meta-iNat and tiered-meta-iNat in our unified framework for a thorough comparison.

Results in Table 2 demonstrate that both our end-to-end MCL (VanillaFCN) and the centrality plugin are broadly effective in various fine-grained few-shot classification tasks.

5 Analysis

5.1 Ablation Study

Different choices of γ\gamma,β\beta. Since 𝚽\mathbf{\Phi} is non-square due to the different cardinalities of set 𝐒\mathbf{S} and 𝐪\mathbf{q}, we use two different scaling parameters in Eqn.(1) and Eqn.(2). By the column-wise normalization, γ\gamma and β\beta can be interpreted as the reciprocal of temperatures in softmax random walk probability. Large γ\gamma, β\beta will have a hard random walk probability that leads to a concentrated centrality in the affiliation network. However, extremely large γ\gamma, β\beta (e.g., one-hot probability when they approaching infinity) would inevitably bias the episodic training due to the potential gradient explosion.

Since the cardinality of set 𝐒\mathbf{S} is larger than that of 𝐪\mathbf{q}, we empirically use a larger γ\gamma than β\beta as we need a harder probability in random walks from query to the large set of support features. In the experiments, we carefully select γ\gamma and β\beta (generally γ\gamma=2020 and β\beta=1010 for 1-shot models and γ\gamma=4040 and β\beta=2020 for 5-shot models) from several combinations of parameters by their validation performance.

+MCL miniImageNet tieredImageNet
𝐪\mathbf{q} 𝐬\mathbf{s} 1-shot 5-shot 1-shot 5-shot
\hlineB2 ProtoNet [26] 52.32 69.74 53.19 72.28
53.74 69.61 55.36 73.17
53.97 69.74 56.60 74.18
54.31 69.84 56.67 74.36
RelationNet [28] 52.12 66.90 54.33 69.95
53.25 66.79 54.46 70.71
54.37 70.57 57.68 74.24
54.50 70.63 57.73 74.46
Table 4: Ablation of MCL as plug-and-play that individually applied on query features 𝐪\mathbf{q} and support features 𝐬\mathbf{s}. The experiments are conducted with Conv-4 and VanillaFCN
ProtoNet [26] [Uncaptioned image] [Uncaptioned image]
ProtoNet +MCL [Uncaptioned image] [Uncaptioned image]
  query ground truth confounding support classes   query ground truth confounding support classes
(a) MCL as a plugin for feature centralization (b) MCL as an end-to-end classification method
Table 5: Grad-CAM visualizations of query and support images in 5-way 1-shot tasks. The left pane (a) illustrates MCL as a plugin on the ProtoNet. The right pane (b) illustrates MCL as an end-to-end classification method. Images (from miniImageNet) at the second column of each pane are from the ground truth class while the four on the far right are the confounding support classes.

Influence of Katz attenuation factor α\alpha. When α\alpha is small, the contribution given by paths longer than one rapidly declines, and the centrality will be determined by short paths (mostly in-degrees). When α\alpha is large, long paths are devalued smoothly, and the centrality would be more influenced by endogenous topology. When α\alpha approaching 0, the classification will be only determined by the unidirectional query-to-support random walk where the contributions from paths longer than one just vanished. We show comparisons in Table 3 to demonstrate that performances are improved by introducing mutual affiliations in a bidirectional manner.

In the experiments, we simply use α=0.5\alpha=0.5 for MCL-Katz methods, and the performances could be further improved by selecting different α\alpha as shown in supplementary.

5.2 Computational Speed

Intuitively, bidirectional methods would be slower than unidirectional ones since their back-and-forth computations. However, our MCL would definitely reach to the periodic equilibrium state as shown in Eqn.(5). Thus, we could directly solve it by the Katz approximation.

At first glance, the most expensive step in Eqn.(10) is inverting 𝑰α𝐏\bm{I}-\alpha\mathbf{P} that costs O(N3r3)O(N^{3}r^{3}) where NN is the number of supporting classes and rr is the number of dense features for each image input. Fortunately, 𝑰α𝐏\bm{I}-\alpha\mathbf{P} is a very special matrix that equals a diagonal matrix minus an anti-diagonal matrix. In this case, we can reformulate it by block-wise matrix inversion:

(𝑰α𝐏)1=(𝑰α𝐏𝐒𝐪α𝐏𝐒𝐪𝑰)1=(𝑰+α2𝐏𝐒𝐪𝚫1𝐏𝐪𝐒α𝐏𝐒𝐪𝚫1α𝚫1𝐏𝐪𝐒𝚫1)\begin{aligned} (\bm{I}-\alpha\mathbf{P})^{-1}&=\begin{pmatrix}\bm{I}&-\alpha\mathbf{P}_{\mathbf{Sq}}\\ -\alpha\mathbf{P}_{\mathbf{Sq}}&\bm{I}\end{pmatrix}^{-1}\\ &=\begin{pmatrix}\bm{I}+\alpha^{2}\mathbf{P}_{\mathbf{Sq}}\mathbf{\Delta}^{-1}\mathbf{P}_{\mathbf{qS}}&\alpha\mathbf{P}_{\mathbf{Sq}}\mathbf{\Delta}^{-1}\\ \alpha\mathbf{\Delta}^{-1}\mathbf{P}_{\mathbf{qS}}&\mathbf{\Delta}^{-1}\\ \end{pmatrix}\end{aligned}

(14)

where 𝚫\mathbf{\Delta} is defined by 𝑰α2𝐏𝐒𝐪𝐏𝐪𝐒\bm{I}-\alpha^{2}\mathbf{P_{Sq}}\mathbf{P_{qS}}. The most expensive step then becomes a inversion of 𝚫r×r\mathbf{\Delta}\in\mathbb{R}^{r\times r} and the time complexity is reduced to O(r3)O(r^{3}).

As shown in Table 6, we compare the inference speed between MCL and other unidirectional dense feature based methods with different image resolutions on 5-way 1-shot FSL tasks with ResNet-12. Each class contains 15 query images in an episode. It can be observed that our bidirectional MCL did not introduce many computational overheads compared to unidirectional methods, and Katz approximation does accelerate computing the stationary distribution.

5.3 Qualitative Feature Centralization

To better understand our centrality based methods, we visualize the Grad-CAM [24] of the last convolutional layer in ResNet-12 for both plugin and end-to-end methods. We show in Table 5(a) that our MCL helps ProtoNet concentrate on the most relevant regions of interest. Table 5(b) demonstrates that the most centralized support features in MCL are not only from the ground truth but also mutually affiliated with task-relevant objects in the query images. We consider it qualitatively validates our underlying idea that support features would be frequently visited in the long times of bidirectional random walks if they shared the same local characteristics with the query image.

Model VanillaFCN resolution rr
5×\times5 8×\times8 10×\times10 12×\times12
DN4 [16] 11.16 13.41 16.82 23.45
DeepEMD [35] 262.0 814.6 2781.9 8035.4
FRN [33] (Ridge regression) 14.55 20.42 25.81 30.65
FRN [33] (Woodbury identity) 36.74 42.33 47.31 55.84
MCL (torch.pinverse) 55.91 119.0 292.2 464.8
MCL (QR decomposition) 600.0 658.6 719.4 802.6
MCL (Katz approximation) 12.59 14.59 18.92 29.10
Table 6: Time per episode (ms) for different feature map resolutions rr on 5-way 1-shot tasks where the feature extraction time is excluded for all comparing methods.

6 Conclusions

We present a novel dense feature based framework: Mutual Centralized Learning (MCL) to highlight the mutual affiliations of bipartite dense features in FSL. We introduce a novel dense features’ accessibility criterion and demonstrate classic transductive methods like label propagation could be easily adapted to the inductive setting if we treat dense features from a single image as the set of unlabeled data. We propose bidirectional random walk to learn mutual affiliations in FSL and prove its features’ accessibility in a long time-homogeneous Markov process is equivalent to the single-mode eigenvector centrality of an affiliation network. We show that such centrality could not only serve as a noval end-to-end classification criterion but also as a plugin in existing methods. Experimental results demonstrate MCL achieves the state-of-the-art on various datasets.

References

  • [1] Arman Afrasiyabi, Jean-François Lalonde, and Christian Gagn’e. Associative alignment for few-shot image classification. In European Conference on Computer Vision, pages 18–35. Springer, 2020.
  • [2] Luca Bertinetto, Joao F Henriques, Philip Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2018.
  • [3] Phillip Bonacich. Factoring and weighting approaches to status scores and clique identification. Journal of mathematical sociology, 2(1):113–120, 1972.
  • [4] Phillip Bonacich. Simultaneous group and individual centralities. Social networks, 13(2):155–168, 1991.
  • [5] Stephen P Borgatti and Martin G Everett. Network analysis of 2-mode data. Social networks, 19(3):243–269, 1997.
  • [6] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2019.
  • [7] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390, 2020.
  • [8] Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. In International Conference on Learning Representations, 2019.
  • [9] Kaize Ding, Jianling Wang, Jundong Li, Kai Shu, Chenghao Liu, and Huan Liu. Graph prototypical networks for few-shot learning on attributed networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 295–304, 2020.
  • [10] Carl Doersch, Ankush Gupta, and Andrew Zisserman. Crosstransformers: spatially-aware few-shot transfer. arXiv preprint arXiv:2007.11498, 2020.
  • [11] Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
  • [12] Fusheng Hao, Fengxiang He, Jun Cheng, Lei Wang, Jianzhong Cao, and Dacheng Tao. Collect and select: Semantic alignment metric learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8460–8469, 2019.
  • [13] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, 1953.
  • [14] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019.
  • [15] Wenbin Li, Lei Wang, Jing Huo, Yinghuan Shi, Yang Gao, and Jiebo Luo. Asymmetric distribution measure for few-shot learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 2957–2963, 2021.
  • [16] Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. Revisiting local descriptor based image-to-class measure for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7260–7268, 2019.
  • [17] Yann Lifchitz, Yannis Avrithis, and Sylvaine Picard. Local propagation for few-shot learning. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 10457–10464. IEEE, 2021.
  • [18] Yann Lifchitz, Yannis Avrithis, Sylvaine Picard, and Andrei Bursuc. Dense classification and implanting for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9258–9267, 2019.
  • [19] Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Mingsheng Long, and Han Hu. Negative margin matters: Understanding margin in few-shot classification. In European Conference on Computer Vision, pages 438–455. Springer, 2020.
  • [20] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019.
  • [21] Oskar Perron. Zur theorie der matrices. Mathematische Annalen, 64(2):248–263, 1907.
  • [22] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018.
  • [23] Ravi Sachin and Larochell Hugo. Optimization as a model for few-shot learning. ICLR, 2017.
  • [24] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [25] Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. Adaptive subspaces for few-shot learning. In CVPR, pages 4136–4145, 2020.
  • [26] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4077–4087, 2017.
  • [27] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 403–412, 2019.
  • [28] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, pages 1199–1208, 2018.
  • [29] Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, and Ming-Hsuan Yang. Cross-domain few-shot classification via learned feature-wise transformation. In International Conference on Learning Representations, 2020.
  • [30] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 3637–3645, 2016.
  • [31] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. 2010.
  • [32] Davis Wertheimer and Bharath Hariharan. Few-shot learning with localization in realistic settings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6558–6567, 2019.
  • [33] Davis Wertheimer, Luming Tang, and Bharath Hariharan. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8012–8021, 2021.
  • [34] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pages 8808–8817, 2020.
  • [35] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Differentiable earth mover’s distance for few-shot learning. arXiv e-prints, pages arXiv–2003, 2020.
  • [36] Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, volume 16, pages 321–328, 2003.

Appendix A Proof of Eqn.(5)

Proof of periodic:

Consider the eigenvalue λ\lambda of 𝐏\mathbf{P} by the determinant equation

det(λ𝑰𝐏)\displaystyle\mathrm{det}(\lambda\bm{I}-\mathbf{P}) =det(λ𝑰𝐏𝐒𝐪𝐏𝐪𝐒λ𝑰)\displaystyle=\mathrm{det}\begin{pmatrix}\lambda\bm{I}&-\mathbf{P}_{\mathbf{S}\mathbf{q}}\\ -\mathbf{P}_{\mathbf{q}\mathbf{S}}&\lambda\bm{I}\end{pmatrix} (A.1)
=det(λ2𝑰𝐏𝐒𝐪𝐏𝐪𝐒),\displaystyle=\mathrm{det}(\lambda^{2}\bm{I}-\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}),

it can be found that the eigenvalues of 𝐏\mathbf{P} are the square roots of eigenvalues of 𝐏𝐒𝐪𝐏𝐪𝐒\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}.

Since both 𝐏𝐒𝐪\mathbf{P}_{\mathbf{Sq}} and 𝐏𝐪𝐒\mathbf{P}_{\mathbf{qS}} are column-normalized matrices, their product is still column-stochastic that can be proved by:

𝒆NrT𝐏𝐒𝐪𝐏𝐪𝐒=𝒆rT𝐏𝐪𝐒=𝒆NrT\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}=\bm{e}_{r}^{\mathrm{T}}\mathbf{P}_{\mathbf{qS}}=\bm{e}_{Nr}^{\mathrm{T}} (A.2)

where 𝒆||\bm{e}_{|\cdot|} is a vector of ones with different length indicated by its subscript. NrNr and rr are the cardinalities of 𝐒\mathbf{S} and 𝐪\mathbf{q}, respectively.

We know (by the definition of stochastic matrix) that λ=1\lambda=1 is the largest eigenvalue of 𝐏𝐒𝐪𝐏𝐪𝐒\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}, and its uniqueness is guaranteed since there is no zero entry in both 𝐏𝐒𝐪\mathbf{P}_{\mathbf{Sq}} and 𝐏𝐪𝐒\mathbf{P}_{\mathbf{qS}}. According to Eqn.(A.1), we get another eigenvalue λ=1\lambda=-1 for stochastic matrix 𝐏\mathbf{P}. From the Perron–Frobenius theorem that the period of 𝐏\mathbf{P} equals to the number of eigenvalue whose absolute value is equal to the spectral radius of 𝐏\mathbf{P}, we prove its stationary distribution is of period 2.

Proof for even periods:

We give the limit of matrix 𝐏2t\mathbf{P}^{2t} for the extremely large number of tt as follows:

limt𝐏2t\displaystyle\lim_{t\rightarrow\infty}\mathbf{P}^{2t} =limt(𝐏𝐒𝐪𝐏𝐪𝐒𝟎𝟎𝐏𝐪𝐒𝐏𝐒𝐪)t\displaystyle=\lim_{t\rightarrow\infty}\begin{pmatrix}\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}&\mathbf{0}\\ \mathbf{0}&\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\end{pmatrix}^{t} (A.3)
=(limt[𝐏𝐒𝐪𝐏𝐪𝐒]t𝟎𝟎limt[𝐏𝐪𝐒𝐏𝐒𝐪]t)\displaystyle=\begin{pmatrix}\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\right]^{t}&\mathbf{0}\\ \mathbf{0}&\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\right]^{t}\end{pmatrix}

Since we have shown in Eqn.(A.2) that 𝐏𝐒𝐪𝐏𝐪𝐒\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}} is also column-stochastic, we use 𝝅(𝐒)\bm{\pi}(\mathbf{S}) to denote its stationary distribution vector by equation limt[𝐏𝐒𝐪𝐏𝐪𝐒]t=𝝅(𝐒)𝒆NrT\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\right]^{t}=\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}.

By analogy, the infinity power of 𝐏𝐪𝐒𝐏𝐒𝐪\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}} could also reach a similar stationary distribution 𝝅(𝐪)\bm{\pi}(\mathbf{q}) with equation limt[𝐏𝐪𝐒𝐏𝐒𝐪]t=𝝅(𝐪)𝒆rT\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\right]^{t}=\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}.

Substituting the two stationary vectors into Eqn.(A.3), we can prove the stationary distributions of 𝐏\mathbf{P} for the even periods in Eqn.(5).

Proof for odd periods:

From the definition of matrix product, we first have

limt𝐏2t1\displaystyle\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1} =limt𝐏2t+1\displaystyle=\lim_{t\rightarrow\infty}\mathbf{P}^{2t+1} (A.4)
=𝐏limt𝐏2t\displaystyle=\mathbf{P}\lim_{t\rightarrow\infty}\mathbf{P}^{2t}
=(𝟎𝐏𝐒𝐪𝝅(𝐪)𝒆rT𝐏𝐪𝐒𝝅(𝐒)𝒆NrT𝟎)\displaystyle=\begin{pmatrix}\mathbf{0}&\mathbf{P}_{\mathbf{Sq}}\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}\\ \mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}&\mathbf{0}\end{pmatrix}

Next, according to the definition of 𝝅(𝐪)\bm{\pi}(\mathbf{q}) and 𝝅(𝐒)\bm{\pi}(\mathbf{S}), we can get

𝝅(𝐪)𝒆rT\displaystyle\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}} =limt[𝐏𝐪𝐒𝐏𝐒𝐪]t\displaystyle=\lim_{t\rightarrow\infty}\left[\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\right]^{t} (A.5)
=𝐏𝐪𝐒(limt[𝐏𝐒𝐪𝐏𝐪𝐒]t)𝐏𝐒𝐪\displaystyle=\mathbf{P}_{\mathbf{qS}}\left(\lim_{t\rightarrow\infty}\left[\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\right]^{t}\right)\mathbf{P}_{\mathbf{Sq}}
=𝐏𝐪𝐒𝝅(𝐒)𝒆NrT𝐏𝐒𝐪.\displaystyle=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}.

If we right matrix product of 𝒆r\bm{e}_{r} on both sides of Eqn.(A.5), we have

𝝅(𝐪)𝒆rT𝒆r=𝐏𝐪𝐒𝝅(𝐒)𝒆NrT𝐏𝐒𝐪𝒆r\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}\bm{e}_{r}=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}\bm{e}_{r} (A.6)

Since 𝒆rT𝒆r=r\bm{e}_{r}^{\mathrm{T}}\bm{e}_{r}=r and 𝒆NrT𝐏𝐒𝐪𝒆r=ij[𝐏𝐒𝐪]ij=r\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}\bm{e}_{r}={\sum_{i}\sum_{j}\left[\mathbf{P}_{\mathbf{Sq}}\right]_{ij}}=r, Eqn.(A.6) can be simplified by dividing the same scalar rr on both sides:

𝝅(𝐪)=𝐏𝐪𝐒𝝅(𝐒).\bm{\pi}(\mathbf{q})=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S}). (A.7)

By analogy, a symmetric equation 𝝅(𝐒)=𝐏𝐒𝐪𝝅(𝐪)\bm{\pi}(\mathbf{S})=\mathbf{P}_{\mathbf{Sq}}\bm{\pi}(\mathbf{q}) can also be easily proved in the same way as from Eqn.(A.5) to Eqn.(A.7).

Substituting 𝝅(𝐒)=𝐏𝐒𝐪𝝅(𝐪)\bm{\pi}(\mathbf{S})=\mathbf{P}_{\mathbf{Sq}}\bm{\pi}(\mathbf{q}) and 𝝅(𝐪)=𝐏𝐪𝐒𝝅(𝐒)\bm{\pi}(\mathbf{q})=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S}) into Eqn.(A.4), we can prove the stationary distributions of 𝐏\mathbf{P} for the odd periods in Eqn.(5).

Appendix B Proof of Lemma 1

We have shown in Appendix A that there exists an eigenvalue λ=1\lambda=1 for the column-stochastic matrix 𝐏\mathbf{P} with equation 𝐏𝐱=𝐱\mathbf{Px}=\mathbf{x}. If we interpret the transition matrix as an adjacency matrix for the directed bipartite graph, the eigenvector centrality of that graph is 𝐱\mathbf{x}.

We split the eigenvector 𝐱\mathbf{x} into 𝐱𝐒\mathbf{x}_{\mathbf{S}}, 𝐱𝐪\mathbf{x}_{\mathbf{q}} for the bipartite vertex set 𝐪,𝐒\mathbf{q},\mathbf{S} respectively and the single-mode eigenvector centralities of the single vertex set can therefore be formulated by:

𝐱^𝐒\displaystyle\hat{\mathbf{x}}_{\mathbf{S}} =𝐱𝐒s𝐒xs\displaystyle=\frac{\mathbf{x}_{\mathbf{S}}}{\sum_{s\in\mathbf{S}}x_{s}} 𝐱^𝐪\displaystyle\hat{\mathbf{x}}_{\mathbf{q}} =𝐱𝐪q𝐪xq\displaystyle=\frac{\mathbf{x}_{\mathbf{q}}}{\sum_{q\in\mathbf{q}}x_{q}} (B.1)

If we left matrix product 𝐏\mathbf{P} on both sides of 𝐏𝐱=𝐱\mathbf{Px}=\mathbf{x}, we will have 𝐏2𝐱=𝐏(𝐏𝐱)=𝐏𝐱=𝐱\mathbf{P}^{2}\mathbf{x}=\mathbf{P}(\mathbf{Px})=\mathbf{Px}=\mathbf{x}. To write it in matrix notation, we have

(𝐏𝐒𝐪𝐏𝐪𝐒𝟎𝟎𝐏𝐪𝐒𝐏𝐒𝐪)𝐏2(𝐱𝐒𝐱𝐪)𝐱=(𝐱𝐒𝐱𝐪)𝐱.\underbrace{\begin{pmatrix}\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}&\mathbf{0}\\ \mathbf{0}&\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\end{pmatrix}}_{\mathbf{P}^{2}}\underbrace{\begin{pmatrix}\mathbf{x}_{\mathbf{S}}\\ \mathbf{x}_{\mathbf{q}}\end{pmatrix}}_{\mathbf{x}}=\underbrace{\begin{pmatrix}\mathbf{x}_{\mathbf{S}}\\ \mathbf{x}_{\mathbf{q}}\end{pmatrix}}_{\mathbf{x}}. (B.2)

Consider the first row of 𝐏2\mathbf{P}^{2} matrix product with 𝐱\mathbf{x} in Eqn.(B.2), we have 𝐏𝐒𝐪𝐏𝐪𝐒𝐱𝐒=𝐱𝐒\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\mathbf{x}_{\mathbf{S}}=\mathbf{x}_{\mathbf{S}}. Since 𝝅(𝐒)\bm{\pi}(\mathbf{S}) is the eigenvector of 𝐏𝐒𝐪𝐏𝐪𝐒\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}} of eigenvalue 1 with probability constraint 𝒆NrT𝝅(𝐒)=1\bm{e}_{Nr}^{\mathrm{T}}\bm{\pi}(\mathbf{S})=1, 𝝅(𝐒)\bm{\pi}(\mathbf{S}) is exactly equivalent to the single-mode eigenvector centrality 𝐱^𝐒\hat{\mathbf{x}}_{\mathbf{S}} in Eqn.(B.1).

By analogy, if we consider the matrix product between the second row of 𝐏2\mathbf{P}^{2} and 𝐱\mathbf{x} in Eqn.(B.2), we can prove 𝝅(𝐪)\bm{\pi}(\mathbf{q}) equivalent to the conjugate single-mode eigenvector centrality 𝐱^𝐪\hat{\mathbf{x}}_{\mathbf{q}} of the bipartite graph.

Appendix C Proof of Eqn.(6)

Eqn.(6) to prove:

𝐏𝐫(y~=c)\displaystyle\mathbf{Pr}(\tilde{y}=c) =limtz𝐳𝔼[k=1t𝟙[Xk𝐬c]|X0=z]z𝐳𝔼[k=1t𝟙[Xk𝐒]|X0=z]\displaystyle=\lim_{t\rightarrow\infty}\frac{\displaystyle{\sum_{z\in\mathbf{z}}}\hskip 2.0pt\mathbb{E}\left[\sum_{k=1}^{t}\mathds{1}[X_{k}\in\mathbf{s}^{c}]\bigg{|}X_{0}=z\right]}{\displaystyle{\sum_{z\in\mathbf{z}}}\hskip 2.0pt\mathbb{E}\left[\sum_{k=1}^{t}\mathds{1}[X_{k}\in\mathbf{S}]\bigg{|}X_{0}=z\right]}
=s𝐬c[𝝅(𝐒)]s\displaystyle=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}

We first define 𝐏𝐫(y~=c)limt𝐏𝐫(t)\displaystyle{\mathbf{Pr}(\tilde{y}=c)\triangleq\lim_{t\rightarrow\infty}\mathbf{Pr}(t)}. Since we have proof that Markov process of transition matrix 𝐏\mathbf{P} is of 2 period in Appendix A, the proof of Eqn.(6) is thus equivalent to prove:

limt𝐏𝐫(2t)=limt𝐏𝐫(2t1)=s𝐬c[𝝅(𝐒)]s\lim_{t\rightarrow\infty}\mathbf{Pr}(2t)=\lim_{t\rightarrow\infty}\mathbf{Pr}(2t-1)=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s} (C.1)

Proof for even period:

From the definition, we have

𝐏𝐫(2t)=1Nr+rz𝐳k=12ts𝐬c[𝐏k]sz1Nr+r(rt+Nrt)=1tk=1t[1Nr+rs𝐬c(z𝐒[𝐏2k]sz+z𝐪[𝐏2k1]sz)]\displaystyle\begin{split}&\mathbf{Pr}(2t)=\frac{\displaystyle{\frac{1}{Nr+r}\sum_{z\in\mathbf{z}}\sum_{k=1}^{2t}}\sum_{s\in\mathbf{s}^{c}}\left[\mathbf{P}^{k}\right]_{sz}}{\displaystyle{\frac{1}{Nr+r}\left(rt+Nrt\right)}}\\ &=\frac{1}{t}\sum_{k=1}^{t}\left[\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{z\in\mathbf{S}}\left[\mathbf{P}^{2k}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\mathbf{P}^{2k-1}\right]_{sz}\right)\right]\end{split} (C.2)

where rtrt is the number of visits from particles in 𝐪\mathbf{q} to support features in 𝐒\mathbf{S} after 2t2t steps of Markov bidirectional random walk. NrtNrt is the number of visits starting from particles in 𝐒\mathbf{S} to support features in 𝐒\mathbf{S}. The second equality is derived from the diagonal/anti-diagonal property of 𝐏2k\mathbf{P}^{2k}/𝐏2k1\mathbf{P}^{2k-1} respectively where the sub-matrices 𝟎\mathbf{0} are ignored in summation.

Taking Eqn.(C.2) to the extreme, we have

limt𝐏𝐫(2t)\displaystyle\lim_{t\rightarrow\infty}\mathbf{Pr}(2t) (C.3)
=1Nr+rs𝐬c(z𝐒[limt𝐏2t]sz+z𝐪[limt𝐏2t1]sz)\displaystyle=\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{z\in\mathbf{S}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}\right]_{sz}\right)
=s𝐬c[𝝅(𝐒)]s\displaystyle=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}

where the first equality is derived from the absorbing of periodic Markov chain and the second equality is from the substitution of Eqn.(5).

Proof for odd period

: From the definition, we have

𝐏𝐫(2t1)=1Nr+rz𝐳k=12t1s𝐬c[𝐏k]sz1Nr+r(rt+Nr(t1))=1ωs𝐬c[z𝐪𝐏sz+k=2t(z𝐒[𝐏2k2]sz+z𝐪[𝐏2k1]sz)]\displaystyle\begin{split}&\mathbf{Pr}(2t-1)=\frac{\displaystyle{\frac{1}{Nr+r}\sum_{z\in\mathbf{z}}\sum_{k=1}^{2t-1}}\sum_{s\in\mathbf{s}^{c}}\left[\mathbf{P}^{k}\right]_{sz}}{{\displaystyle{\frac{1}{Nr+r}\left(rt+Nr(t-1)\right)}}}\\ &=\frac{1}{\omega}\sum_{s\in\mathbf{s}^{c}}\left[\sum_{z\in\mathbf{q}}\mathbf{P}_{sz}+\sum_{k=2}^{t}\left(\sum_{z\in\mathbf{S}}\left[\mathbf{P}^{2k-2}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\mathbf{P}^{2k-1}\right]_{sz}\right)\right]\end{split} (C.4)

where ω\omega equals (Nr+r)(tNrNr+r)(Nr+r)(t-\frac{Nr}{Nr+r}). Take Eqn.(C.4) to the extreme, we have

limt𝐏𝐫(2t1)=1Nr+rs𝐬c(z𝐒[limt𝐏2t2]sz+z𝐪[limt𝐏2t1]sz)=s𝐬c[𝝅(𝐒)]s\displaystyle\begin{split}&\lim_{t\rightarrow\infty}\mathbf{Pr}(2t-1)\\ &=\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{z\in\mathbf{S}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-2}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}\right]_{sz}\right)\\ &=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}\end{split} (C.5)

where limt1tNrNr+rz𝐪𝐏sz=0\displaystyle{\lim_{t\rightarrow\infty}\frac{1}{t-\frac{Nr}{Nr+r}}\sum_{z\in\mathbf{q}}\mathbf{P}_{sz}}=0 is ignored when tt approaches the infinity.

# support of tensor shape [N, d, r]:
# N-way FSL, each class owns r number of d-dimensional dense features
# query of tensor shape [q, d, r]:
# q query examples, each of them owns r dense features.
#
# gamma: scaled similarity parameter
# beta: scaled similarity parameter
# alpha: Katz attenuation factor
# alpha_2: the square of alpha
#
# @: the matrix multiplication operator in Pytorch
def inner_cosine(query, support):
N, d, r = support.shape
q = len(query)
query = query / query.norm(2, dim=-1, keepdim=True)
support = support / support.norm(2, dim=-1, keepdim=True)
support = support.unsqueeze(0).expand(q, -1, -1, -1)
query = query.unsqueeze(1).expand(-1, N, -1, -1)
S = query.transpose(-2, -1)@support
S = S.permute(0, 2, 1, 3).contiguous().view(q, r, N * r)
return S
def MCL_Katz_approx(query, support):
N, d, r = support.shape
q = len(query)
S = inner_cosine(query, support) # [q, r, Nr]
St = S.transpose(-2, -1) # [q, Nr, r]
# column-wise softmax probability
P_sq = torch.softmax(gamma * St, dim=-2)
P_qs = torch.softmax(beta * S, dim=-2)
# From the derivations in Eqn.(F.2)
inv = torch.inverse(
torch.eye(r)[None].repeat(q, 1, 1) - alpha_2 * P_qs@P_sq
) # [q, r, r]
katz = (alpha_2 * P_sq@inv@P_qs).sum(-1) + (alpha * P_sq@inv).sum(-1)
katz = katz / katz.sum(-1, keepdim=True)
predicts = katz.view(q, N, r).sum(-1)
return predicts
Code 1: Pytorch pseudo-code for 1-shot MCL (Katz approximation) in a single episode.

Appendix D Codes

We use block-wise inversion in Eqn.(14) that is more computational-efficient than directly inverting Eqn.(10) when the number of support classes NN is large. The corresponding Pytorch pseudo code can be found below where the whole calculation is performed in parallel via batched matrix multiplication and inversion. Our unified FSL training and testing framework is publicly available at https://github.com/LouieYang/MCL.

Appendix E Implementation details

Preprocessing: During training on CUB, miniImageNet and tieredImageNet, images are randomly cropped to 92×9292\times 92 and then resized into 84×8484\times 84. For meta-iNat and tiered-meta-iNat, images are randomly padded then cropped into 84×8484\times 84. Unlike previous methods , we only random horizontal flip the image during training.

During inference, images are center cropped to 92×9292\times 92 and then resized into 84×8484\times 84 for CUB, miniImageNet and tieredImageNet. For meta-iNat and tiered-meta-iNat, images are already 84×8484\times 84 and are fed into the models directly.

Network backbones: We use two backbones in our experiments: Conv-4 and ResNet-12. Conv-4 contains four convolutional blocks, each of which consists of a 3×33\times 3 convolution, a BatchNorm, a LeakyReLU(0.2) and an additional 2×22\times 2 max-pooling. ResNet-12 consist four residual blocks, each with three convolutional layers, with LeakyReLU(0.1) and 2×22\times 2 max-pooling on the main stem. Given the image of input size 84×8484\times 84, Conv-4 outputs a feature map of size 5×5×645\times 5\times 64 while ResNet-12 outputs that of size 5×5×645\times 5\times 64.

Re-implemented baselines: We re-implement ProtoNet [26] and RelationNet [28] in our unified framework as two global feature based baselines for our centrality plugin. We borrow most of codes from their official implementations and introduce slight modifications to improve their performances inspired their subsequent work [33, 34]. For ProtNet, we introduce a fixed temperature scaling with 1/641/64 before in softmax function. For RelationNet, we change the original MSELoss to CrossEntropyLoss.

Model Backbone 1-shot 5-shot
Baseline [6] ResNet-10 - 65.57 ±\pm 0.70
Baseline++ [6] ResNet-18 - 62.04 ±\pm 0.76
MetaOptNet [14] ResNet-12 44.79 ±\pm 0.75 64.98 ±\pm 0.68
MatchingNet+FT [29] ResNet-10 36.61 ±\pm 0.53 55.23 ±\pm 0.83
RelationNet+FT [29] ResNet-10 44.07 ±\pm 0.77 59.46 ±\pm 0.71
GNN+FT [29] ResNet-10 47.47 ±\pm 0.75 66.98 ±\pm 0.68
Neg-margin [19] ResNet-18 - 69.30 ±\pm 0.73
Centroid et al. [1] ResNet-18 46.85 ±\pm 0.75 70.37 ±\pm 1.02
FRN [33] ResNet-12 51.60 ±\pm 0.21 72.97 ±\pm 0.18
ProtoNet [26] ResNet-12 40.05 ±\pm 0.18 55.29 ±\pm 0.19
ProtoNet+MCL ResNet-12 42.02 ±\pm 0.19 64.76 ±\pm 0.20
MCL ResNet-12 55.48 ±\pm 0.22 75.93 ±\pm 0.18
MCL-Katz ResNet-12 53.22 ±\pm 0.22 77.39 ±\pm 0.18
Table 7: Few-shot classifications (%) in the cross-domain setting: miniImageNet\rightarrowCUB. \dagger: our re-implemented results in our unified framework that share the same dataloader and training strategies.

Pre-training: We use the pre-training + meta-training procedure for ResNet-12 backbones on miniImageNet and tieredImageNet like most of the methods in the literature [35, 16, 33]. We follow the same pre-training technique from the dense features based FRN [33] to learn spatially distinctive dense features, e.g., run 350 epochs of batch size 128 on miniImageNet, using SGD with initial learning rate 0.1 and decaying by a factor of 10 at epochs 200 and 300.

Meta-training on mini-/tieredImageNet: We follow the earliest dense feature based DN4 [16] that randomly sample 20,000/200 episodes in an epoch for Conv-4/ResNet-12 backbone, respectively. Since we didn not use pre-training for Conv-4, the number of episodes per epoch of Conv-4 is far larger than ResNet-12 for convergences. In each episode, besides KK support images in each class, 15 query images will also be selected from each class.

For Conv-4, we adopt Adam optimizer with initial learning rate of 1e-3 to train for 30 epochs (on both datasets) and reduce it by a factor of 10 every 10 epochs.

For ResNet-12, we adopt SGD with initial learning rate of 5e-4 (40 epochs on miniImageNet and 60 epochs on tieredImageNet) and cut it by half every 10 epochs.

Unlike the latest dense feature based FRN that adopts larger way during meta-training (e.g., 25-way for training 1-shot models and 20-way for 5-shot models), we did not use that setting in widely used mini-/tireredImageNet as we thought it would be unfair in comparing with other methods.

Meta-training on fine-grained datasets: We follow the latest fine-grained few-shot classification setting from FRN [33] as most of the comparing performances in Table 2 are from them. Although practitioners agree on the train/validation/test split ratio (i.e., 100/50/50) on CUB dataset, there is no official class split. In our experiments, we use the same train/val/test class split as in [33] for a fair comparison.

For CUB, we train all our Conv-4 models for 1200 epochs using SGD with Nesterov momentum of 0.9 and an initial learning rate of 0.1. The learning rate decreases by a factor of 10 at epoch 400 and 800. ResNet-12 backbone trains for 600 epochs and scale down the learning rate by 10 at epoch 300, 400 and 500. Conv-4 backbone is trained with standard 20-way for 5-shot models and is trained with 30-way for 1-shot models like [33], while ResNet-12 backbone is trained with 10-way for the both shots models.

For meta-iNat and tiered-meta-iNat, we train our Conv-4 and ResNet-12 models for 100 epochs using Adam with initial learning rate 1e-3. We set 0.5 learning rate decay every 20 epochs. Both Conv-4 and ResNet-12 are trained with 10-way for both 1-shot and 5-shot models.

Methods fθ()f_{\theta}(\cdot) 1-shot 5-shot
Baseline [6] Global feature 53.99 ±\pm 0.20 78.78 ±\pm 0.15
Baseline++ [6] 55.03 ±\pm 0.20 78.79 ±\pm 0.15
ProtoNet [26] 53.99 ±\pm 0.20 79.70 ±\pm 0.15
ProtoNet+MCL 58.62 ±\pm 0.20 80.99 ±\pm 0.14
DN4 [16] VanillaFCN 58.95 ±\pm 0.19 78.01 ±\pm 0.15
FRN [33] 59.71 ±\pm 0.19 74.05 ±\pm 0.15
MCL 60.94 ±\pm 0.20 80.20 ±\pm 0.14
MCL-Katz 61.55 ±\pm 0.20 81.09 ±\pm 0.14
Table 8: Few-shot classifications (%) without episodic meta-training. All of the comparing methods are evaluated in our unified framework with the same pre-trained ResNet-12 backbone.

Appendix F Cross-Domain Few-shot Classification

We also evaluate on the challenging cross-domain setting proposed by [6], where models trained on miniImageNet base classes are directly evaluated on test classes from CUB. We use the same test class split as in [33] for fair comparisons, which is much harder than the test class split in [6].

As shown in Table S7, our MCL outperforms previous state-of-the-art methods by large margins of 3.9% on the 1-shot task and 5.4% on the 5-shot task, respectively.

Appendix G Evaluation without Meta-training

Given that an increasing number of methods simply use standard supervised learning to pre-train the feature extractor and then use their methods directly for evaluation without meta-training [6, 8], we also evaluate our methods under this setting with the same pre-trained feature extractor we used in the Table 1. As shown in Table S8, global feature based methods are likely to misclassify images under the extremely 1-shot scenario, where the significant intra-class variations would inevitably drive the image-level embedding from the same category far apart. In contrast, dense feature based methods provide more information across categories that shows promising performances in that scenario.

Our end-to-end Katz centrality based MCL outperforms previous methods by a margin of 1.8% and 1.3% on 1-shot and 5-shot tasks, respectively. It is interested to note that our MCL plugin help centralize the task-relevant local features in ProtoNet [26] by a large margin of 4.6% in 1-shot task and 1.2% in 5-shot task without bell and whistles.

5-way 1-shot 5-way 5-shot
α0\alpha\approx 0 (unidirectional) 66.60 ±\pm 0.20 81.76 ±\pm 0.13
α=0.1\alpha=0.1 67.13 ±\pm 0.20 83.20 ±\pm 0.13
α=0.3\alpha=0.3 67.28 ±\pm 0.20 83.89 ±\pm 0.13
α=0.5\alpha=0.5 (MCL-Katz) 67.51 ±\pm 0.20 83.99 ±\pm 0.13
α=0.7\alpha=0.7 67.27 ±\pm 0.20 84.05 ±\pm 0.13
α=0.9\alpha=0.9 67.41 ±\pm 0.20 83.96 ±\pm 0.13
α=0.999\alpha=0.999 (MCL) 67.36 ±\pm 0.20 83.63 ±\pm 0.13
Table 9: Ablation studies on the Katz attenuation factor α\alpha on miniImageNet with VanillaFCN ResNet-12.
β=5.0\beta=5.0 β=10.0\beta=10.0 β=20.0\beta=20.0 β=50.0\beta=50.0
γ=5.0\gamma=5.0 66.12 66.15 65.87 65.39
γ=10.0\gamma=10.0 67.21 67.20 66.78 66.23
γ=20.0\gamma=20.0 67.53 67.36 66.70 66.00
γ=50.0\gamma=50.0 67.05 66.77 65.84 65.08
Table 10: Ablation studies of MCL (α=0.999\alpha=0.999) on the scaling parameters γ\gamma and β\beta in Eqn.(1) and Eqn.(2), respectively. Experiments are conducted on 5-way 1-shot miniImageNet with VanillaFCN.

Appendix H Ablation on parameters α\alpha, γ\gamma and β\beta

Ablation on Katz attenuation factor α\alpha. Table S9 shows results with different α\alpha in bidirectional random walks. As discussed in Sec.5, the centrality with large α\alpha is more influenced by the endogenous topology while that with small α\alpha is more influenced by the in-degree paths. It could be observed in Table S9 that, the optimal α\alpha differs across various FSL tasks (that with different shot on different datasets).

In the experiments, we use α\alpha=0.9990.999 to approximate single-mode eigenvector centrality in Eqn.(6) and simply use α\alpha=0.50.5 to represent general Katz centrality in Eqn.(8).

Ablation on scaling parameters γ\gamma and β\beta. Table S10 shows MCL results with different scaling parameters γ\gamma and β\beta. As discussed in Sec.5 that a larger scaling parameter (i.e., a smaller temperature in softmax-like random walk probability) will lead to a more concentrated eigenvector centrality. However, a larger scaling parameter would inevitably bias the meta-training. Thus, there exists a trade-off to select optimal parameters according to each task.

In the experiments, we find its empirically effective to select γ\gamma and β\beta according to their pre-trained models (similar to Table S10). In most cases, we use γ\gamma=2020, β\beta=1010 and γ\gamma=4040, β\beta=2020 for 1-shot and 5-shot tasks, respectively.

Appendix I Additional Plugin Experiments

We have shown that our proposed centrality weighted pooling has a consistent performance gain (especially in the extreme 1-shot scenario) over global average pooling on ProtoNet [26] and RelationNet [28] by concentrating on more task-relevant local features. Besides Table 1, 2, 4 and S8, we give additional results in Table S11 to show that MCL can be easily plugged into those methods that need no extra parameters except for the feature extractor.

The experiments are conducted by the following rules: all comparing methods are evaluated with the same pre-trained backbone ResNet-12 as in Sec.G without meta-training; we only use centrality weighted pooling to aggregate local features from query image as different methods have different operations on support features; we fixed the parameters γ\gamma=2020 and β\beta=1010 to MCL plugins for all comparing methods.

As shown in Table S11, our proposed centrality plugin consistently improves the performances of all the global feature based methods without bells and whistles.

Method miniImageNet tieredImageNet
1-shot 5-shot 1-shot 5-shot
Baseline[6] 53.99 78.78 67.75 85.23
+MCL 56.21+2.22 80.44+1.66 68.53+0.78 85.75+0.52
Baseline++[6] 55.03 78.79 61.86 84.31
+MCL 56.93+1.90 79.93+1.14 63.00+1.14 84.61+0.30
ProtoNet[26] 56.50 79.68 64.27 84.01
+MCL 58.62+2.12 80.99+1.30 66.48+2.21 84.83+0.82
MatchingNet[30] 58.41 79.52 68.68 85.16
+MCL 59.85+1.44 80.79+1.27 69.45+0.97 85.49+0.33
R2D2[2] 59.82 78.97 70.22 85.30
+MCL 60.86+1.04 80.65+1.68 70.64 +0.42 85.84+0.54
MetaOptNet[14] 59.59 79.77 69.55 85.25
+MCL 60.66+1.07 81.27+1.50 70.11+0.55 85.76+0.51
DSN[25] 58.70 79.01 69.13 85.14
+MCL 60.08+1.38 80.66+1.65 69.59+0.46 85.61+0.47
Neg-cosine[19] 58.82 79.63 69.44 84.94
+MCL 60.20+1.38 81.04+1.41 69.85+0.41 85.31+0.37
Table 11: Few-shot classifications (%) before and after applying centrality weighted pooling. Experiment settings follow Table S8.

Appendix J Additional Visualization

The additional Grad-CAM visualizations as in Table 5 of the main paper are presented as follows: Table S12 illustrates end-to-end MCL-Katz like in Table 5(b). Table S13 illustrates MCL plugin on ProtoNet [26] like in Table 5(a).

[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
Table 12: Additional Grad-CAM visualizations of MCL-katz on 5-way 1-shot FSL tasks (miniImageNet) with ResNet-12. Formatting follows Table 5(b): query images are placed in the first column for each task; ground truth support images are placed in the second column; the four images on the far right column of each task are from the confounding support classes.
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]
Table 13: Additional ProtoNet+MCL Grad-CAM visualizations on 5-way 1-shot FSL tasks (miniImageNet) with ResNet-12. Formatting follows Table 5(a): query images are placed in the first column for each task; ground truth support images are placed in the second column; the four images on the far right column of each task are from the confounding support classes. The top row in each task is from ProtoNet while the bottom row is from ProtoNet+MCL.