Learning to Affiliate: Mutual Centralized Learning for Few-shot Classification

Yang Liu^1,2, Weifeng Zhang¹, Chao Xiang¹, Tu Zheng^1,2, Deng Cai^1,2, Xiaofei He^1,2
¹Zhejiang University ²Fabu Inc., Hangzhou, China
{lyng_95, zhangwf, chaoxiang, zhengtu}@zju.edu.cn [email protected] [email protected]

Abstract

Few-shot learning (FSL) aims to learn a classifier that can be easily adapted to accommodate new tasks, given only a few examples. To handle the limited-data in few-shot regimes, recent methods tend to collectively use a set of local features to densely represent an image instead of using a mixed global feature. They generally explore a unidirectional paradigm, e.g., finding the nearest support feature for every query feature and aggregating local matches for a joint classification. In this paper, we propose a novel Mutual Centralized Learning (MCL) to fully affiliate these two disjoint dense features sets in a bidirectional paradigm. We first associate each local feature with a particle that can bidirectionally random walk in discrete feature space. To estimate the class probability, we propose the dense features’ accessibility that measures the expected number of visits to the dense features of that class in a Markov process. We relate our method to learning a centrality on an affiliation network and demonstrate its capability to be plugged in existing methods by highlighting centralized local features. Experiments show that our method achieves the new state-of-the-art.


(a) $\mathbf{Pr}_{\mathrm{DN4}}(c\mid\mathbf{q},\mathbf{S}$ )	(b) $\mathbf{Pr}_{\mathrm{DeepEMD}}(c\mid\mathbf{q},\mathbf{S})$	(c) $\mathbf{Pr}_{\mathrm{FRN}}(c\mid\mathbf{q},\mathbf{S})$	(d) $\mathbf{Pr}_{\mathrm{MCL}}(c\mid\mathbf{q},\mathbf{S})$
$\displaystyle{\propto\mathrm{exp}\left(\sum_{q\in\mathbf{q}}\mathrm{cos}(q,\mathrm{NN}_{\mathbf{s}^{c}}(q))\right)}$	$\displaystyle{\propto\mathrm{exp}\left(\scalebox{0.75}[1.0]{$-$}\sum_{q\in\mathbf{q}}\\|q-\mathrm{EMD}_{\mathbf{s}^{c}}(q)\\|^{2}\right)}$	$\displaystyle{\propto\mathrm{exp}\left(\scalebox{0.75}[1.0]{$-$}\sum_{q\in\mathbf{q}}\\|q-\mathbf{s}^{c}\mathbf{w}\\|^{2}\right)}$	$\displaystyle{\propto\sum_{z\in\mathbf{q}\cup\mathbf{S}}\mathbb{E}\left[\sum_{t=1}^{\infty}\mathds{1}[X_{t}\in\mathbf{s}^{c}]\bigg{\|}X_{0}=z\right]}$

Figure 1: Illustrations of four dense features based methods in a 3-way 1-shot scenario, where each colored pane indicates an image and particles within each pane indicate three feature vectors of that image. The arrow line indicates a direct connection between two feature vectors in inference. For example, (a) DN4 [16] accumulates cosine similarities between the nearest features; (b) DeepEMD [35] finds the optimal matching flow that has the minimum cost; (c) FRN [33] reconstructs query features with the linear combination of support features. (d) Our MCL lets query and support features bidirectionally random walk to the opposite dense features set and measures the expected number of visits in a Markov process

\{X_{t}\}

. Difference from them, MCL considers mutual affiliations between the dense features instead of following the unidirectional paradigm whose evident character is an accumulation over all query features followed by a softmax probability.

1 Introduction

Few-shot classification aims to learn a classifier that can be readily adapted to novel classes given just a small number of labeled instances. To address this problem, a line of previous literature adopts metric-based methods [30, 26, 28] that learn a global image representation in an appropriate feature space and use a distance metric to predict their labels.

Recent approaches [35, 16, 18] have demonstrated that the significant intra-class variations would inevitably drive the image-level embedding from the same category far apart in a given metric space under low-data regimes. In contrast, densely representative local features can provide transferrable information across categories that have shown promising performances in the few-shot scenario. Among those methods illustrated in Figure 1, DN4 [16] finds the nearest neighbor support feature for each query feature and accumulates all the local matches in a Naive-Bayes way to represent an image-to-class similarity; DeepEMD [35] uses the earth mover distance to compare the complex structured representations composed of local features. FRN [33] reconstructs each of query dense features with a linear combination of support dense features in a latent space and use the reconstruction distance to measure the image-to-class relevance. They all follow a unidirectional query-to-support paradigm, whose evident character is an accumulation over all query features followed by a softmax probability.

In this paper, we incorporate an extra support-to-query connection as a complement to thoroughly affiliate two disjoint sets of dense features. The potential offered by this bidirectional paradigm stems from the intuition that, except for using query features to find related support features, it is also plausible to estimate the task-relevance of query features according to the support features. Specifically, we associate dense features with particles that could bidirectionally random walk to the opposite dense features set in the discrete feature space. The prediction probability for each class is then estimated by the dense features’ accessibility, i.e., the expected number of visits to the dense features of that class in a time-homogeneous Markov process.

The contributions are as follows: (1) We propose to learn mutual affiliations between the query and support dense features instead of following the unidirectional query-to-support paradigm in FSL. (2) We introduce the dense features’ accessibility to FSL and demonstrate that traditional transductive methods could be easily adapted to the inductive setting if we treat dense features from a single query image as a set of unlabeled data. (3) We propose a novel bidirectional random walk based method in FSL and draw its connection to the single-mode eigenvector centrality of an affiliation network. (4) The underlying centrality investigated in this work can be plugged in existing global feature based methods like ProtoNet and RelationNet by highlighting task-centralized local features instead of global average pooling.

2 Related Work

Dense Feature based FSL. Those methods focus on learning image-to-image similarities by encoding each input into a set of dense feature vectors (or a feature map). Among them, DC [18] proposes to predict for each local features and average their probabilities. DeepEMD [35] adopts the earth mover’s distance to compute an image-to-class distance. DN4 [16] uses the top-k nearest neighbors in a Naive-Bayes way to represent image-level similarities. ADM [15] and SAML [12] use extra network layers for dense classifications. FRN [33] reconstructs the dense query features from support features of a given class to predict their relevance. Lifchitz et al. [17] propagates labels from support sets to query features. MCL is also among the family of dense feature based frameworks. A major difference is that we consider the mutual affiliations between the query and support features instead of the unidirectional one in previous work.

Graphical models in FSL. Another branch of methods learns graphs on FSL. Among them, [11] build graph neural network (GNN) from a collection of images. [9] use the degree-centrality as part of graph features to adjust the weight of graph vertex in transductive FSL. Our proposed feature accessibility of bidirectional random walks is well-related to the eigenvector centrality. Differently, we use the single-mode centrality of bipartite data as a novel criterion for the inductive classification.

From Transductive to Inductive. TPN [20] first introduces the well-known label propagation [36] to the transductive FSL that transfers information from labeled data to unlabeled one and shows substantial improvements over inductive methods. Lifchitz et al. [17] first demonstrates it feasible to adopt label for inductive FSL by treating the query dense features as numerous unlabeled data. Different from unidirectional label propagation, our MCL further considers the set of query and support features as bipartite data where the self-reinforcements (i.e., query-to-query and support-to-support random walks) are avoided.

3 Method

3.1 Formulation of Dense Features based FSL

In a $N$ -way $K$ -shot episode, the goal is to predict a class label for a single query image given $N$ support classes, each of which contains $K$ support images for reference.

We first encode each input image into $r$ number of $d$ -dimensional dense features $f_{\theta}(\cdot)\in\mathbb{R}^{d\times r}$ . We use $\mathbf{q}=\{q_{1},...,q_{r}\}$ to denote dense features from a query image. Inspired by prototype learning [26], we average feature vectors from the same spatial locations of $K$ different images into support dense features for each $\mathbf{s}^{c}=\{s_{1}^{c},...,s_{r}^{c}\}$ category.

We use the bold font (e.g., $\mathbf{q}$ and $\mathbf{s}$ ) to denote a set of feature vectors and use the normal font (e.g., $q$ and $s$ ) to denote a single feature vector. We use $\mathbf{S}=\bigcup_{c\in C}\mathbf{s}^{c}$ to represent the union set of dense features from all supporting classes. The cardinalities for the sets $\mathbf{q}$ , $\mathbf{s}^{c}$ and $\mathbf{S}$ are $r$ , $r$ and $Nr$ respectively.

The target of dense features based FSL is to find the label of a query image, given its dense query features $\mathbf{q}$ and the union set of dense support features $\mathbf{S}$ from all categories.

3.2 Bidirectional Random Walks on Dense Features

We start by using the cosine similarity $\phi(v_{1},v_{2})=\langle\frac{v_{1}}{\|v_{1}\|},\frac{v_{2}}{\|v_{2}\|}\rangle$ to measure the closeness between the two feature vectors where $\langle\cdot,\cdot\rangle$ is the Frobenius inner product.

Given two dense features sets $\mathbf{q}$ and $\mathbf{S}$ , we learn inter similarity matrix $\mathbf{\Phi}\in\mathbb{R}^{r\times Nr}$ between those two disjoint sets where the entry of $\left[\mathbf{\Phi}\right]_{qs}=\phi(q,s)$ indicates the similarity between vectors $q\in\mathbf{q}$ and $s\in\mathbf{S}$ .

We build query-to-support affiliations from every query feature $q$ to all support features $s\in\mathbf{S}$ by a random walk probability $p_{sq}=\frac{e^{\gamma\phi(s,q)}}{\sum_{s^{\prime}\in\mathbf{S}}e^{\gamma\phi(s^{\prime},q)}}$ . The probability matrix from all query to support features can be written by

\mathbf{P}_{\mathbf{Sq}}=\mathrm{exp}(\gamma\mathbf{\Phi}^{\mathrm{T}})\mathbf{D}^{-1}

(1)

where $\mathrm{exp}(\cdot)$ is the element-wise exponential function, $\gamma$ is a scaling temperature for a softmax like probability and $\mathbf{D}$ is a diagonal normalization matrix with its $(j,j)$ -value to be the sum of the $j$ -th column of $\mathrm{exp}(\gamma\mathbf{\Phi}^{\mathrm{T}})$ . The subscript serves both as the matrix size and as an indication for the feature vector with which it is associated. For example, $\mathbf{P}_{\mathbf{Sq}}$ is of size $|\mathbf{S}|\times|\mathbf{q}|$ and it includes random walk probabilities from query features $\mathbf{q}$ to support features $\mathbf{S}$ .

Likewise, we connect every support feature $s$ to all $q\in\mathbf{q}$ by a similar probability matrix:

\mathbf{P}_{\mathbf{qS}}=\mathrm{exp}(\beta\mathbf{\Phi})\mathbf{W}^{-1}

(2)

where $\mathbf{W}$ is the analogous diagonal matrix with its $(j,j)$ -value to be the sum of the $j$ -th column of $\mathrm{exp}(\beta\mathbf{\Phi})$ .

It should be noted that we use two different scaling parameters since $\mathbf{\Phi}$ is a non-square matrix due to the different cardinalities of dense feature sets $\mathbf{q}$ and $\mathbf{S}$ . Although the matrices $\mathrm{exp}(\gamma\mathbf{\Phi}^{\mathrm{T}})$ and $\mathrm{exp}(\beta\mathbf{\Phi})$ would be symmetric if we ignore their different scaling parameters, the probability matrices $\mathbf{P}_{\mathbf{qS}}$ , $\mathbf{P}_{\mathbf{Sq}}$ are both column-normalized by different $\mathbf{D}$ , $\mathbf{W}$ respectively and are thus directional.

To learn the mutual affiliations between the two dense feature sets, we first associate each feature vector with a particle $z$ that forms a discrete feature space $\mathbf{z}=\mathbf{q}\cup\mathbf{S}$ . Next, we let particles bidirectionally random walk between particles in that space according to $\mathbf{P}_{\mathbf{Sq}}$ and $\mathbf{P}_{\mathbf{qS}}$ (i.e., particles at query features set are only allowed to move to the support features set, and vice versa). We assume it time-homogeneous in a Markov process $\{X_{t}\}$ and the random walk probability from $z_{j}$ to $z_{i}$ for all $z_{i}$ , $z_{j}$ in $\mathbf{z}$ can be formulated by:

$\mathbf{Pr}(X_{t+1}=z_{i}|X_{t}=z_{j})=\left\{\begin{array}[]{ll}\displaystyle{\frac{e^{\gamma\phi(z_{i},z_{j})}}{\sum_{s\in\mathbf{S}}e^{\gamma\phi(s,z_{j})}}}&z_{i}\in\mathbf{S},z_{j}\in\mathbf{q}\\ \displaystyle{\frac{e^{\beta\phi(z_{i},z_{j})}}{\sum_{q\in\mathbf{q}}e^{\beta\phi(q,z_{j})}}}&z_{i}\in\mathbf{q},z_{j}\in\mathbf{S}\\ 0&\mathrm{otherwise}.\end{array}\right.$

(3)

To write it in matrix, we can obtain the Markov transition matrix $\mathbf{P}$ :

\mathbf{P}=\begin{pmatrix}\mathbf{0}&\mathbf{P}_{\mathbf{S}\mathbf{q}}\\ \mathbf{P}_{\mathbf{q}\mathbf{S}}&\mathbf{0}\end{pmatrix}

(4)

that consists of the column-normalized $\mathbf{P}_{\mathbf{Sq}}$ and $\mathbf{P}_{\mathbf{qS}}$ .

It can be proved (in supplementary) that, after infinity times of bidirectional random walks, the Markov chain with anti-diagonal $\mathbf{P}$ reaches the periodic stationary distribution:

\displaystyle\begin{split}\lim_{t\rightarrow\infty}\mathbf{P}^{2t}&=\begin{pmatrix}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}&\mathbf{0}\\ \mathbf{0}&\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}\end{pmatrix}\\ \lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}&=\begin{pmatrix}\mathbf{0}&\bm{\pi}(\mathbf{S})\bm{e}_{r}^{\mathrm{T}}\\ \bm{\pi}(\mathbf{q})\bm{e}_{Nr}^{\mathrm{T}}&\mathbf{0}\end{pmatrix}\end{split}

(5)

where $\bm{\pi}(\mathbf{S})\in\mathbb{R}^{Nr}$ , $\bm{\pi}(\mathbf{q})\in\mathbb{R}^{r}$ are the stationary distributions with equations $\bm{\pi}(\mathbf{S})=\mathbf{P_{Sq}}\mathbf{P_{qS}}\bm{\pi}(\mathbf{S})$ and $\bm{\pi}(\mathbf{q})=\mathbf{P_{qS}P_{Sq}}\bm{\pi}(\mathbf{q})$ respectively. $\bm{e}_{|\cdot|}$ is a vector of ones with different length indicated by its subscript.

The motivation of using these stationary distributions to encode mutual affiliations is straightforward: support features would be frequently visited in the long times of bidirectional random walk if they shared the same local characteristics with the query ones. In other words, the support class that owns the most mutual affiliations with querying image would be predicted due to their mutual closeness.

3.3 Classification by Dense Features’ Accessibility

To formulate it in FSL, we first assume particles are uniformly distributed in the discrete finite space $\mathbf{z}$ . Then, we use the expected number of visits from all particles $z\in\mathbf{z}$ to support features $\mathbf{s}^{c}\subset\mathbf{S}$ of class $c$ in the long times of bidirectional random walks $\{X_{t}\}$ to measure the amount of local characteristics that class $c$ owns for the query image:

$\begin{aligned} &\mathbf{Pr}(\tilde{y}=c)\propto\lim_{t\rightarrow\infty}\sum_{z\in\mathbf{z}}\mathbb{E}\left[\sum_{k=1}^{t}\mathds{1}[X_{k}\in\mathbf{s}^{c}]\bigg{|}X_{0}=z\right]\\ &=\lim_{t\rightarrow\infty}\frac{1}{t}\sum_{k=1}^{t}\left[\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{s^{\prime}\in\mathbf{S}}\left[\mathbf{P}^{2k}\right]_{ss^{\prime}}+\sum_{q\in\mathbf{q}}\left[\mathbf{P}^{2k-1}\right]_{sq}\right)\right]\\ &=\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{s^{\prime}\in\mathbf{S}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t}\right]_{ss^{\prime}}+\sum_{q\in\mathbf{q}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}\right]_{sq}\right)\\ &=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}\end{aligned}$

(6)

where $\mathds{1}[\cdot]$ is an indicator function that equals 1 if its argument is true and zero otherwise. $[\cdot]_{ij}$ indicates the entry of the matrix from particle $j$ to particle $i$ and $[\cdot]_{i}$ indicates the entry of the vector that associated to feature $i$ .

We give the derivation when the Markov chain length $t$ is even in the first equality of Eqn.(LABEL:eq:MEL) and prove that the odd $t$ reaches the same result in the supplementary material. The second equality is from the absorbing property of the periodic Markov chain where the power of matrix gets saturated to Eqn.(5) for the increasing $t$ . Since $\bm{\pi}(\mathbf{S})$ is the stationary distribution of column-stochastic $\mathbf{P_{Sq}P_{qS}}$ under the probability constraint $\bm{e}^{\mathrm{T}}_{Nr}\bm{\pi}(\mathbf{S})=1$ , $\mathbf{Pr}(\tilde{y})$ is a valid distribution for classifications.

3.4 Reinterpretation as a Graph Centrality

If we interpret the random walk probability matrix $\mathbf{P}$ as an adjacency matrix $\{a_{v_{1},v_{2}}\}$ of a directed bipartite graph $G:=(V=\{\mathbf{q},\mathbf{S}\},E)$ , we find that the dense features’ accessibility in the time-homogeneous bidirectional random walk would be equivalent to learning a single-mode eigenvector centrality on the graph $G$ . The bipartite graph is also called the affiliation network in social network analysis [3, 4, 5] that models two types of entities "actors" and "society" related by affiliation of the former in the latter. The concept of centrality in social network analysis is generally used to investigate the acquaintanceships among people that often stem from one or more shared affiliations.

To see this in graph theory, we start with a brief overview of the eigenvector centrality that reflects score $x$ of vertex $v$ for both $q\in\mathbf{q}$ and $s\in\mathbf{S}$ in the affiliation network with adjacency $\{a_{v_{1},v_{2}}\}$ :

\displaystyle\begin{split}x_{q}&=\frac{1}{\lambda}\sum_{v\in\mathbb{V}(q)}a_{q,v}x_{v}=\frac{1}{\lambda}\sum_{v\in\mathbf{S}}a_{q,v}x_{v}\\ x_{s}&=\frac{1}{\lambda}\sum_{v\in\mathbb{V}(s)}a_{s,v}x_{v}=\frac{1}{\lambda}\sum_{v\in\mathbf{q}}a_{s,v}x_{v}\end{split}

(7)

where $\mathbb{V}(\cdot)$ is a set of neighbors for the given vertex and $\lambda$ is a constant.

With a small rearrangement, Eqn.(7) can be rewritten in vector notation with an eigenvector equation $\mathbf{Px}=\lambda\mathbf{x}$ . The additional requirement that all the entries of the eigenvector be non-negative implies (by the Perron–Frobenius theorem [21]) that only the greatest eigenvalue results in the desired centrality measure. For the column-stochastic adjacency matrix $\mathbf{P}$ in our method, the largest eigenvalue $\lambda$ is 1.

Single-mode centrality [5] is a special form of graph centrality that measures the extent to which nodes in one vertex set are relatively central only to other nodes in the same vertex set on bipartite graph. For example, the single-mode centrality for different $s$ in $\mathbf{S}$ is defined by $\hat{\mathbf{x}}_{\mathbf{S}}=\mathbf{x}_{\mathbf{S}}/\sum_{s\in\mathbf{S}}x_{s}$ . Lemma 1 (proved in supplementary material) shows that dense features’ accessibility $\bm{\pi}(\mathbf{S})$ of bidirectional random walks in Eqn.(LABEL:eq:MEL) is equivalent to the single-mode eigenvector centrality of support set $\mathbf{S}$ on bipartite data.

Lemma 1.

Assume $G$ is the affiliation network of bipartite data $\mathbf{q}$ , $\mathbf{S}$ with the adjacency matrix defined by the anti-diagonal Markov transition matrix $\mathbf{P}$ in Eqn.(4). The single-mode eigenvector centrality $\hat{\mathbf{x}}_{\mathbf{S}}=\mathbf{x}_{\mathbf{S}}/\sum_{s\in\mathbf{S}}x_{s}$ of vertex set $\mathbf{S}$ is equivalent to the dense features’ accessibility $\bm{\pi}(\mathbf{S})$ on $\mathbf{S}$ in a time-homogeneous Markov process.

Based on this reinterpretation, it is straightforward to introduce the attenuation (damping) factor $\alpha$ in Markov bidirectional random walks motivated by the Katz centrality [13], where connections made with distant neighbors are penalized by $\alpha$ . The dense features’ accessibility with attenuation factor $\alpha$ for few-shot classifications is defined by

\displaystyle\begin{split}&\mathbf{P}\mathbf{r}_{\mathrm{Katz}}(\tilde{y}=c)\propto\sum_{z\in\mathbf{z}}\mathbb{E}\left[\sum_{t=1}^{\infty}\alpha^{t}\mathds{1}[X_{t}\in\mathbf{s}^{c}]\bigg{|}X_{0}=z\right]\\ &=\frac{1}{\epsilon}\sum_{t=1}^{\infty}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{s^{\prime}\in\mathbf{S}}\left[\alpha^{2t}\mathbf{P}^{2t}\right]_{ss^{\prime}}+\sum_{q\in\mathbf{q}}\left[\alpha^{2t-1}\mathbf{P}^{2t-1}\right]_{sq}\right)\end{split}

(8)

where $\epsilon=(Nr+r)\sum_{t=1}^{\infty}\alpha^{t}=(Nr+r)\alpha/(1-\alpha)$ is a constant for a valid distribution.

Although we simply consider the single-mode centrality on $\mathbf{S}$ for an end-to-end classification purpose, it is also beneficial to learn its conjugate centrality $\bm{\pi}(\mathbf{q})$ on $\mathbf{q}$ in the affiliation network (by the analogous $\bm{\pi}(\mathbf{q})=\mathbf{x}_{\mathbf{q}}/\sum_{q\in\mathbf{q}}x_{q}$ in Lemma 1). We will show that both single-mode centralities on two dense features sets could serve as plug-and-play for finding centralized local characteristics in existing methods (hence the term Mutual Centralized Learning, MCL).

3.5 End-to-End Training by Katz Approximation

The algorithm of Mutual Centralized Learning (MCL) in Eqn.(LABEL:eq:MEL) involves the computation of Markov stationary distribution $\bm{\pi}(\mathbf{S})$ with equation $\bm{\pi}(\mathbf{S})=\mathbf{P_{Sq}P_{qS}}\bm{\pi}(\mathbf{S})$ . Theoretically, the $\bm{\pi}(\mathbf{S})$ is the eigenvector of $\mathbf{P_{Sq}P_{qS}}$ with the eigenvalue 1 under a probability constraint $\bm{e}_{Nr}^{\mathrm{T}}\bm{\pi}(\mathbf{S})=1$ .

The above constraints could lead to a solution of $\bm{\pi}(\mathbf{S})$ by solving the overdetermined linear system

\begin{pmatrix}\mathbf{P_{Sq}P_{qS}}-\bm{I}\\ \bm{e}^{\mathrm{T}}_{Nr}\end{pmatrix}\bm{\pi}(\mathbf{S})=\begin{pmatrix}\mathbf{0}\\ 1\end{pmatrix}

(9)

where $\bm{I}$ is an identity matrix and $\mathbf{0}$ is a vector of zeros. Although we can solve it by various methods like pseudo-inverse or QR decomposition and back substitution, it is empirically found time-consuming as these operators are either numerically unstable or not paralleled well in modern deep-learning packages.

To handle it, we present an alternative solution based on Lemma 1 as follows: We first calculate the Katz centrality by its closed-form solution [13]:

\mathbf{x}^{\mathrm{Katz}}=((\bm{I}-\alpha\mathbf{P})^{-1}-\bm{I})\bm{e}_{Nr+r}

(10)

and then solve single-mode Katz centrality in Eqn.(8) by the definition of the single-mode centrality:

\mathbf{Pr}_{\mathrm{Katz}}(\tilde{y}=c)=\frac{\sum_{s\in\mathbf{s}^{c}}\mathbf{x}^{\mathrm{Katz}}_{s}}{\sum_{s^{\prime}\in\mathbf{S}}\mathbf{x}^{\mathrm{Katz}}_{s^{\prime}}}

(11)

Since Katz centrality degrades to the eigenvector centrality when $\alpha$ approaches 1 [13] (indicates no attenuation), we can obtain the approximation of eigenvector centrality by a large $\alpha=0.999$ :

\displaystyle\begin{split}\mathbf{x}^{\mathrm{Eigen}}&=\lim_{\alpha\rightarrow 1}((\bm{I}-\alpha\mathbf{P})^{-1}-\bm{I})\bm{e}_{Nr+r}\\ &\approx((\bm{I}-0.999\mathbf{P})^{-1}-\bm{I})\bm{e}_{Nr+r}\end{split}

(12)

and the analogous single-mode eigenvector centrality in Eqn.(LABEL:eq:MEL) can then be approximated by:

\mathbf{Pr}_{\mathrm{MCL}}(\tilde{y}=c)=\frac{\sum_{s\in\mathbf{s}^{c}}\mathbf{x}^{\mathrm{Eigen}}_{s}}{\sum_{s^{\prime}\in\mathbf{S}}\mathbf{x}^{\mathrm{Eigen}}_{s^{\prime}}}

(13)

We use the negative log-likelihood loss to update parameters in different feature extractors and the whole picture of MCL for few-shot classifications is shown in Figure 2. We provide the detailed algorithm/pseudo-codes in the supplementary material for reference.

Refer to caption — Figure 2: Our framework for 3-way 1-shot classification. Given the query and support images, we first extract their dense features $\mathbf{q}$ and $\mathbf{S}$ which are two sets of feature vectors. Then we calculate their similarity matrix $\mathbf{\Phi}$ . After that, we build probability matrix $\mathbf{P}$ based on the scaled and column-normalized similarity matrix. We obtain the stationary distributions $\bm{\pi}(\mathbf{S})$ and $\bm{\pi}(\mathbf{q})$ by the Katz approximation in Eqn.(12) and the Lemma 1. Finally, we use them for both end-to-end FSL and as a centrality plugin in global feature based FSL. We explore three dense feature extractors (*i.e*., VanillaFCN, PyramidFCN, PyramidGrid) inspired by [35] to extract dense feature representations.

3.6 Plugin for Global Feature based FSL.

Since the underlying centrality learned by MCL reveals the different importance of local features on an affiliation network, it is thus plausible to plug it into global feature based methods by replacing their native global average pooling (GAP) with the centrality weighted pooling as follows:

a. Extract dense features for each input image.

b. Compute the eigenvector centrality $\mathbf{x}^{\mathrm{Eigen}}$ by Eqn.(12).

c. Compute $\bm{\pi}(\mathbf{S})$ and $\bm{\pi}(\mathbf{q})$ according to definition of the single-mode centrality in Lemma 1.

d. Use $\bm{\pi}(\mathbf{q})$ as weights on query dense features and weighted accumulate them to a single feature vector.

e. Use the class-wise normalized centrality from $\bm{\pi}(\mathbf{S})$ to accumulate dense features for each supporting class.

Once we get the centrality weighted features for the query image and support classes, it is straightforward to perform traditional global feature based methods like before.

4 Experiments

We conduct experiments on two widely-used FSL datasets and three fine-grained datasets: (1) miniImageNet [30] contains 600 images per class over 100 classes. We follow the split used by [23] that takes 64, 16 and 20 classes for train/val/test respectively; (2) tieredImageNet [22] is much larger compared to miniImageNet with 608 classes. The 351, 97 and 160 classes are used for train/val/test respectively. (3) CUB [31] consists 11,788 images from 200 bird classes. 100/50/50 classes are used for train/val/test and each image is first cropped to a human-annotated bounding box. (4) meta-iNat [32] is a benchmark of animal species in the wild. We follow the same class split proposed by [32] that uses 908/227 classes for training/evaluation respectively. (5) tiered-meta-iNat [32] is a more difficult version of meta-iNat where a large domain gap is introduced. The 354 test classes are populated by insects and arachnids, while the remaining 781 classes form the training set.

Backbone networks. We conduct experiments with both widely-used four layer convolutional Conv-4 [30] and deep ResNet-12 [27] backbones. As is commonly implemented in the state-of-the-art FSL literature, we adopt a pre-training stage for the ResNet-12 before the episode meta-training while directly meta-train from scratch for the simple Conv-4.

Dense feature extractor $f_{\theta}(\cdot)$ . We explore three dense feature extractors in our experiments as shown in Figure 2: (1) VanillaFCN simply treats the feature map output of fully convolutional network as dense features. (2) PyramidFCN uses an extra adaptative average pooling layer to obtain 34 dense features for each image. (3) PyramidGrid crops the image evenly into the grid of size $2\times 2+3\times 3$ and encodes each grid cell to a feature vector individually. The feature vectors from all the cells constitute the set of dense features.

Method		Conv-4				ResNet-12
		miniImageNet		tieredImageNet		miniImageNet		tieredImageNet
		1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
Global features	DSN [25]	51.78	68.99	53.22	71.06	62.64	78.83	67.39	82.85
	MetaOptNet [14]	52.87	68.76	54.71	71.76	62.64	78.63	65.99	81.56
	Negative Margin [19]	52.84	70.41	-	-	63.85	81.57	-	-
	FEAT [34]	55.15	71.61	-	-	66.78	82.05	70.80	84.79
	Meta-Baseline [7]	-	-	-	-	63.17	79.26	68.62	83.29
	RelationNet^‡ [28]	52.12	66.90	54.33	69.95	60.97	75.12	64.71	78.41
	RelationNet+MCL	54.50	70.63	57.73	74.46	61.70	75.53	65.93	80.27
	ProtoNet^‡ [26]	52.32	69.74	53.19	72.28	62.67	77.88	68.48	83.46
	ProtoNet+MCL	54.31	69.84	56.67	74.36	64.40	78.60	70.62	83.84
Dense features (VanillaFCN)	DN4^‡ [16]	54.66	71.26	56.86	72.16	65.35	81.10	69.60	83.41
	DeepEMD^† [35]	52.15	65.52	50.89	66.12	65.91	82.41	71.16	83.95
	FRN^† [33]	54.87	71.56	55.54	74.68	66.45	82.83	71.16	86.01
	Label Propagation	52.24	67.68	54.56	70.08	65.00	80.07	71.12	83.89
	MCL (ours)	55.38	70.02	57.63	74.25	67.36	83.63	71.76	86.01
	MCL-Katz (ours)	55.55	71.74	57.78	74.77	67.51	83.99	72.01	86.02
Dense features (PyramidFCN)	DN4^‡ [16]	54.54	70.94	57.05	72.90	63.54	79.04	71.10	84.22
	DeepEMD^‡ [35]	50.67	64.94	51.26	65.64	66.27	82.41	70.76	84.20
	FRN^‡ [33]	54.40	70.75	57.30	75.58	65.94	81.97	70.56	85.44
	Label Propagation	53.38	68.69	55.21	71.22	65.71	75.78	71.00	78.01
	MCL (ours)	55.13	70.77	57.93	74.36	67.45	84.36	72.01	86.31
	MCL-Katz (ours)	55.77	71.24	58.20	74.73	67.85	84.47	72.13	86.32
Dense features (PyramidGrid)	DN4^‡ [16]	57.17	70.91	56.71	70.92	67.86	80.08	71.29	82.60
	DeepEMD^‡ [35]	55.68	70.75	55.88	70.06	67.83	81.32	73.13	84.18
	FRN^‡ [33]	55.80	71.52	55.68	72.87	67.00	82.20	71.42	85.58
	Label Propagation	55.12	68.43	56.05	72.39	67.18	81.07	73.18	85.19
	MCL (ours)	57.50	73.03	57.57	73.81	69.31	85.11	73.62	86.29
	MCL-Katz (ours)	57.88	74.03	57.63	73.96	69.25	84.71	73.38	86.21

Table 1: Few-shot classification accuracy (%) on miniImageNet and tieredImageNet. The 95% confidence intervals are all below 0.2 for the 10,000 episodes evaluation. Results of italic font indicates the performance of MCL as a plugin. Results of bold fonts are the best results for different dense feature extractors, respectively. We re-implement our plugin baseline methods (i.e., ProtoNet and RelationNet) as well as the competitive dense features based methods (i.e., DN4, DeepEMD and FRN) in our unified framework in case their performances are not available under specific settings.

\dagger

: re-implemented results on Conv-4 backbone,

\ddagger

: re-implemented results on both backbones.

4.1 General Few-shot Classification Results

Table 1 details the comparisons of MCL with global feature based methods [25, 14, 34, 7, 19] as well as dense feature based methods [16, 35, 33] on mini-/tieredImageNet.

Besides these methods, we also exploit label propagation [36] in dense feature based inductive FSL like in [17] to demonstrate that traditional transductive methods could be easily adapted to inductive settings if we treat dense feature vectors as a set of unlabeled data. As shown, our MCL is highly competitive with recent state-of-the-art results. In particular, our MCL-Katz achieves 67.51% (1-shot) and 83.99% (5-shot) on miniImageNet with the simplest VanillaFCN.

Comparisons with dense feature based methods. The last three panes of Table 1 illustrate that proposed bidirectional MCL outperforms query-to-support DN4’s nearest neighboring, DeepEMD’s optimal matching and FRN’s latent feature reconstructions in various tasks. Although the optimal matching flow is unidirectional in DeepEMD, they adopt a cross-reference attention that encodes the bidirectional relevance to a certain extent. If we treat dense features as nodes in the graph and use their similarities as edges, a major difference is that they treat nodes equally to derive different weights on edges while we use the fixed edges to derive the centrality of nodes directly for classifications.

Comparisons with global feature based methods. The centrality investigated in this work is also capable to be plugged into global feature based methods as local features are not equally important before global average pooling. As shown in the first pane of Table 1, our proposed centrality weighted pooling could provide at most 3.4% and 4.5% performance gains for ProtoNet and RelationNet respectively. We show in Table 4 that the improvements are also consistent if we solely apply it on the query and support dense features.

Method	CUB				meta-iNat				tiered-meta-iNat
	Conv-4		ResNet-12		Conv-4		ResNet-12		Conv-4		ResNet-12
	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
ProtoNet [26]	71.64	81.50	82.98	91.38	53.78	73.80	76.47	90.20	35.47	54.85	47.61	71.06
ProtoNet + MCL	76.87	86.95	84.59	91.71	57.52	74.33	78.75	90.53	39.09	59.16	51.22	72.50
DSN [25]	66.01	85.41	80.80	91.19	58.08	77.38	78.80	89.77	36.82	60.11	46.61	72.79
CTX [10]	69.64	87.31	78.47	90.90	60.03	78.80	69.04	88.39	36.83	60.84	42.91	69.88
DN4 [16]	78.31	88.43	85.44	92.51	62.32	79.76	79.58	91.37	43.82	64.17	48.99	72.29
DeepEMD [35]	75.34	85.68	83.35	91.60	54.48	68.36	76.05	86.82	36.05	48.55	48.14	66.27
FRN [33]	73.48	88.43	83.16	92.59	62.42	80.45	73.52	91.83	43.91	63.36	48.86	76.62
MCL	77.80	88.71	83.64	92.18	64.66	81.31	80.17	91.59	44.08	64.61	51.35	76.87
MCL-Katz	79.61	90.56	85.63	93.18	63.92	81.09	79.34	91.84	44.00	64.24	49.68	76.05

Table 2: Few-shot classification (%) results on fine-grained datasets. The 95% confidence intervals are below 0.19 within 10,000 episodes.

	$f_{\theta}(\cdot)$	miniImageNet		tieredImageNet
	$f_{\theta}(\cdot)$	1-shot	5-shot	1-shot	5-shot
\hlineB2 unidirectional	VanillaFCN	54.88	69.98	55.31	68.87
MCL		55.38	70.02	57.63	74.25
MCL-Katz		55.55	71.74	57.78	74.77
unidirectional	PyramidFCN	54.92	69.17	55.69	69.26
MCL		55.13	70.77	57.93	74.36
MCL-Katz		55.77	71.24	58.20	74.73
unidirectional	PyramidGrid	56.61	71.40	55.12	69.19
MCL		57.50	73.03	57.57	73.81
MCL-Katz		57.88	74.03	57.63	73.96

Table 3: Comparisons between the unidirectional random walks and bidirectional MCL on mini-/tieredImageNet with Conv-4.

4.2 Fine-grained Few-shot Classification Results

We follow the same setting (shown in supplementary) as in FRN [33] to train all the models from scratch on those fine-grained datasets. We re-implement DN4 [16], DeepEMD [35] to thoroughly compare with dense features based methods. Apart from the basic comparisons (both backbones on CUB and Conv-4 backbone on meta-iNat/tiered-meta-iNat) in [33], we conduct extra ResNet-12 experiments for all the comparing methods on meta-iNat and tiered-meta-iNat in our unified framework for a thorough comparison.

Results in Table 2 demonstrate that both our end-to-end MCL (VanillaFCN) and the centrality plugin are broadly effective in various fine-grained few-shot classification tasks.

5 Analysis

5.1 Ablation Study

Different choices of $\gamma$ , $\beta$ . Since $\mathbf{\Phi}$ is non-square due to the different cardinalities of set $\mathbf{S}$ and $\mathbf{q}$ , we use two different scaling parameters in Eqn.(1) and Eqn.(2). By the column-wise normalization, $\gamma$ and $\beta$ can be interpreted as the reciprocal of temperatures in softmax random walk probability. Large $\gamma$ , $\beta$ will have a hard random walk probability that leads to a concentrated centrality in the affiliation network. However, extremely large $\gamma$ , $\beta$ (e.g., one-hot probability when they approaching infinity) would inevitably bias the episodic training due to the potential gradient explosion.

Since the cardinality of set $\mathbf{S}$ is larger than that of $\mathbf{q}$ , we empirically use a larger $\gamma$ than $\beta$ as we need a harder probability in random walks from query to the large set of support features. In the experiments, we carefully select $\gamma$ and $\beta$ (generally $\gamma$ = $20$ and $\beta$ = $10$ for 1-shot models and $\gamma$ = $40$ and $\beta$ = $20$ for 5-shot models) from several combinations of parameters by their validation performance.

	+MCL		miniImageNet		tieredImageNet
	$\mathbf{q}$	$\mathbf{s}$	1-shot	5-shot	1-shot	5-shot
\hlineB2 ProtoNet [26]			52.32	69.74	53.19	72.28
	✓		53.74	69.61	55.36	73.17
		✓	53.97	69.74	56.60	74.18
	✓	✓	54.31	69.84	56.67	74.36
RelationNet [28]			52.12	66.90	54.33	69.95
	✓		53.25	66.79	54.46	70.71
		✓	54.37	70.57	57.68	74.24
	✓	✓	54.50	70.63	57.73	74.46

Table 4: Ablation of MCL as plug-and-play that individually applied on query features

\mathbf{q}

and support features

\mathbf{s}

. The experiments are conducted with Conv-4 and VanillaFCN

[Uncaptioned image] — Table 5: Grad-CAM visualizations of query and support images in 5-way 1-shot tasks. The left pane (a) illustrates MCL as a plugin on the ProtoNet. The right pane (b) illustrates MCL as an end-to-end classification method. Images (from miniImageNet) at the second column of each pane are from the ground truth class while the four on the far right are the confounding support classes.

Influence of Katz attenuation factor $\alpha$ . When $\alpha$ is small, the contribution given by paths longer than one rapidly declines, and the centrality will be determined by short paths (mostly in-degrees). When $\alpha$ is large, long paths are devalued smoothly, and the centrality would be more influenced by endogenous topology. When $\alpha$ approaching 0, the classification will be only determined by the unidirectional query-to-support random walk where the contributions from paths longer than one just vanished. We show comparisons in Table 3 to demonstrate that performances are improved by introducing mutual affiliations in a bidirectional manner.

In the experiments, we simply use $\alpha=0.5$ for MCL-Katz methods, and the performances could be further improved by selecting different $\alpha$ as shown in supplementary.

5.2 Computational Speed

Intuitively, bidirectional methods would be slower than unidirectional ones since their back-and-forth computations. However, our MCL would definitely reach to the periodic equilibrium state as shown in Eqn.(5). Thus, we could directly solve it by the Katz approximation.

At first glance, the most expensive step in Eqn.(10) is inverting $\bm{I}-\alpha\mathbf{P}$ that costs $O(N^{3}r^{3})$ where $N$ is the number of supporting classes and $r$ is the number of dense features for each image input. Fortunately, $\bm{I}-\alpha\mathbf{P}$ is a very special matrix that equals a diagonal matrix minus an anti-diagonal matrix. In this case, we can reformulate it by block-wise matrix inversion:

$\begin{aligned} (\bm{I}-\alpha\mathbf{P})^{-1}&=\begin{pmatrix}\bm{I}&-\alpha\mathbf{P}_{\mathbf{Sq}}\\ -\alpha\mathbf{P}_{\mathbf{Sq}}&\bm{I}\end{pmatrix}^{-1}\\ &=\begin{pmatrix}\bm{I}+\alpha^{2}\mathbf{P}_{\mathbf{Sq}}\mathbf{\Delta}^{-1}\mathbf{P}_{\mathbf{qS}}&\alpha\mathbf{P}_{\mathbf{Sq}}\mathbf{\Delta}^{-1}\\ \alpha\mathbf{\Delta}^{-1}\mathbf{P}_{\mathbf{qS}}&\mathbf{\Delta}^{-1}\\ \end{pmatrix}\end{aligned}$

(14)

where $\mathbf{\Delta}$ is defined by $\bm{I}-\alpha^{2}\mathbf{P_{Sq}}\mathbf{P_{qS}}$ . The most expensive step then becomes a inversion of $\mathbf{\Delta}\in\mathbb{R}^{r\times r}$ and the time complexity is reduced to $O(r^{3})$ .

As shown in Table 6, we compare the inference speed between MCL and other unidirectional dense feature based methods with different image resolutions on 5-way 1-shot FSL tasks with ResNet-12. Each class contains 15 query images in an episode. It can be observed that our bidirectional MCL did not introduce many computational overheads compared to unidirectional methods, and Katz approximation does accelerate computing the stationary distribution.

5.3 Qualitative Feature Centralization

To better understand our centrality based methods, we visualize the Grad-CAM [24] of the last convolutional layer in ResNet-12 for both plugin and end-to-end methods. We show in Table 5(a) that our MCL helps ProtoNet concentrate on the most relevant regions of interest. Table 5(b) demonstrates that the most centralized support features in MCL are not only from the ground truth but also mutually affiliated with task-relevant objects in the query images. We consider it qualitatively validates our underlying idea that support features would be frequently visited in the long times of bidirectional random walks if they shared the same local characteristics with the query image.

Model	VanillaFCN resolution $r$
Model	5 $\times$ 5	8 $\times$ 8	10 $\times$ 10	12 $\times$ 12
DN4 [16]	11.16	13.41	16.82	23.45
DeepEMD [35]	262.0	814.6	2781.9	8035.4
FRN [33] (Ridge regression)	14.55	20.42	25.81	30.65
FRN [33] (Woodbury identity)	36.74	42.33	47.31	55.84
MCL (torch.pinverse)	55.91	119.0	292.2	464.8
MCL (QR decomposition)	600.0	658.6	719.4	802.6
MCL (Katz approximation)	12.59	14.59	18.92	29.10

Table 6: Time per episode (ms) for different feature map resolutions

r

on 5-way 1-shot tasks where the feature extraction time is excluded for all comparing methods.

6 Conclusions

We present a novel dense feature based framework: Mutual Centralized Learning (MCL) to highlight the mutual affiliations of bipartite dense features in FSL. We introduce a novel dense features’ accessibility criterion and demonstrate classic transductive methods like label propagation could be easily adapted to the inductive setting if we treat dense features from a single image as the set of unlabeled data. We propose bidirectional random walk to learn mutual affiliations in FSL and prove its features’ accessibility in a long time-homogeneous Markov process is equivalent to the single-mode eigenvector centrality of an affiliation network. We show that such centrality could not only serve as a noval end-to-end classification criterion but also as a plugin in existing methods. Experimental results demonstrate MCL achieves the state-of-the-art on various datasets.

References

[1] Arman Afrasiyabi, Jean-François Lalonde, and Christian Gagn’e. Associative alignment for few-shot image classification. In European Conference on Computer Vision, pages 18–35. Springer, 2020.
[2] Luca Bertinetto, Joao F Henriques, Philip Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2018.
[3] Phillip Bonacich. Factoring and weighting approaches to status scores and clique identification. Journal of mathematical sociology, 2(1):113–120, 1972.
[4] Phillip Bonacich. Simultaneous group and individual centralities. Social networks, 13(2):155–168, 1991.
[5] Stephen P Borgatti and Martin G Everett. Network analysis of 2-mode data. Social networks, 19(3):243–269, 1997.
[6] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2019.
[7] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390, 2020.
[8] Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. In International Conference on Learning Representations, 2019.
[9] Kaize Ding, Jianling Wang, Jundong Li, Kai Shu, Chenghao Liu, and Huan Liu. Graph prototypical networks for few-shot learning on attributed networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 295–304, 2020.
[10] Carl Doersch, Ankush Gupta, and Andrew Zisserman. Crosstransformers: spatially-aware few-shot transfer. arXiv preprint arXiv:2007.11498, 2020.
[11] Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
[12] Fusheng Hao, Fengxiang He, Jun Cheng, Lei Wang, Jianzhong Cao, and Dacheng Tao. Collect and select: Semantic alignment metric learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8460–8469, 2019.
[13] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, 1953.
[14] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019.
[15] Wenbin Li, Lei Wang, Jing Huo, Yinghuan Shi, Yang Gao, and Jiebo Luo. Asymmetric distribution measure for few-shot learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 2957–2963, 2021.
[16] Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. Revisiting local descriptor based image-to-class measure for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7260–7268, 2019.
[17] Yann Lifchitz, Yannis Avrithis, and Sylvaine Picard. Local propagation for few-shot learning. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 10457–10464. IEEE, 2021.
[18] Yann Lifchitz, Yannis Avrithis, Sylvaine Picard, and Andrei Bursuc. Dense classification and implanting for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9258–9267, 2019.
[19] Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Mingsheng Long, and Han Hu. Negative margin matters: Understanding margin in few-shot classification. In European Conference on Computer Vision, pages 438–455. Springer, 2020.
[20] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019.
[21] Oskar Perron. Zur theorie der matrices. Mathematische Annalen, 64(2):248–263, 1907.
[22] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018.
[23] Ravi Sachin and Larochell Hugo. Optimization as a model for few-shot learning. ICLR, 2017.
[24] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[25] Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. Adaptive subspaces for few-shot learning. In CVPR, pages 4136–4145, 2020.
[26] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4077–4087, 2017.
[27] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 403–412, 2019.
[28] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, pages 1199–1208, 2018.
[29] Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, and Ming-Hsuan Yang. Cross-domain few-shot classification via learned feature-wise transformation. In International Conference on Learning Representations, 2020.
[30] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 3637–3645, 2016.
[31] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. 2010.
[32] Davis Wertheimer and Bharath Hariharan. Few-shot learning with localization in realistic settings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6558–6567, 2019.
[33] Davis Wertheimer, Luming Tang, and Bharath Hariharan. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8012–8021, 2021.
[34] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pages 8808–8817, 2020.
[35] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Differentiable earth mover’s distance for few-shot learning. arXiv e-prints, pages arXiv–2003, 2020.
[36] Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, volume 16, pages 321–328, 2003.

Appendix A Proof of Eqn.(5)

Proof of periodic:

Consider the eigenvalue $\lambda$ of $\mathbf{P}$ by the determinant equation

	$\displaystyle\mathrm{det}(\lambda\bm{I}-\mathbf{P})$	$\displaystyle=\mathrm{det}\begin{pmatrix}\lambda\bm{I}&-\mathbf{P}_{\mathbf{S}\mathbf{q}}\\ -\mathbf{P}_{\mathbf{q}\mathbf{S}}&\lambda\bm{I}\end{pmatrix}$		(A.1)
		$\displaystyle=\mathrm{det}(\lambda^{2}\bm{I}-\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}),$		(A.1)

it can be found that the eigenvalues of $\mathbf{P}$ are the square roots of eigenvalues of $\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}$ .

Since both $\mathbf{P}_{\mathbf{Sq}}$ and $\mathbf{P}_{\mathbf{qS}}$ are column-normalized matrices, their product is still column-stochastic that can be proved by:

\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}=\bm{e}_{r}^{\mathrm{T}}\mathbf{P}_{\mathbf{qS}}=\bm{e}_{Nr}^{\mathrm{T}}

(A.2)

where $\bm{e}_{|\cdot|}$ is a vector of ones with different length indicated by its subscript. $Nr$ and $r$ are the cardinalities of $\mathbf{S}$ and $\mathbf{q}$ , respectively.

We know (by the definition of stochastic matrix) that $\lambda=1$ is the largest eigenvalue of $\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}$ , and its uniqueness is guaranteed since there is no zero entry in both $\mathbf{P}_{\mathbf{Sq}}$ and $\mathbf{P}_{\mathbf{qS}}$ . According to Eqn.(A.1), we get another eigenvalue $\lambda=-1$ for stochastic matrix $\mathbf{P}$ . From the Perron–Frobenius theorem that the period of $\mathbf{P}$ equals to the number of eigenvalue whose absolute value is equal to the spectral radius of $\mathbf{P}$ , we prove its stationary distribution is of period 2.

Proof for even periods:

We give the limit of matrix $\mathbf{P}^{2t}$ for the extremely large number of $t$ as follows:

	$\displaystyle\lim_{t\rightarrow\infty}\mathbf{P}^{2t}$	$\displaystyle=\lim_{t\rightarrow\infty}\begin{pmatrix}\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}&\mathbf{0}\\ \mathbf{0}&\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\end{pmatrix}^{t}$		(A.3)
		$\displaystyle=\begin{pmatrix}\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\right]^{t}&\mathbf{0}\\ \mathbf{0}&\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\right]^{t}\end{pmatrix}$		(A.3)

Since we have shown in Eqn.(A.2) that $\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}$ is also column-stochastic, we use $\bm{\pi}(\mathbf{S})$ to denote its stationary distribution vector by equation $\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\right]^{t}=\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}$ .

By analogy, the infinity power of $\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}$ could also reach a similar stationary distribution $\bm{\pi}(\mathbf{q})$ with equation $\displaystyle{\lim_{t\rightarrow\infty}}\left[\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\right]^{t}=\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}$ .

Substituting the two stationary vectors into Eqn.(A.3), we can prove the stationary distributions of $\mathbf{P}$ for the even periods in Eqn.(5).

Proof for odd periods:

From the definition of matrix product, we first have

$\displaystyle\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}$	$\displaystyle=\lim_{t\rightarrow\infty}\mathbf{P}^{2t+1}$	(A.4)
	$\displaystyle=\mathbf{P}\lim_{t\rightarrow\infty}\mathbf{P}^{2t}$
	$\displaystyle=\begin{pmatrix}\mathbf{0}&\mathbf{P}_{\mathbf{Sq}}\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}\\ \mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}&\mathbf{0}\end{pmatrix}$

Next, according to the definition of $\bm{\pi}(\mathbf{q})$ and $\bm{\pi}(\mathbf{S})$ , we can get

$\displaystyle\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}$	$\displaystyle=\lim_{t\rightarrow\infty}\left[\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\right]^{t}$	(A.5)
	$\displaystyle=\mathbf{P}_{\mathbf{qS}}\left(\lim_{t\rightarrow\infty}\left[\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\right]^{t}\right)\mathbf{P}_{\mathbf{Sq}}$
	$\displaystyle=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}.$

If we right matrix product of $\bm{e}_{r}$ on both sides of Eqn.(A.5), we have

\bm{\pi}(\mathbf{q})\bm{e}_{r}^{\mathrm{T}}\bm{e}_{r}=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S})\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}\bm{e}_{r}

(A.6)

Since $\bm{e}_{r}^{\mathrm{T}}\bm{e}_{r}=r$ and $\bm{e}_{Nr}^{\mathrm{T}}\mathbf{P}_{\mathbf{Sq}}\bm{e}_{r}={\sum_{i}\sum_{j}\left[\mathbf{P}_{\mathbf{Sq}}\right]_{ij}}=r$ , Eqn.(A.6) can be simplified by dividing the same scalar $r$ on both sides:

\bm{\pi}(\mathbf{q})=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S}).

(A.7)

By analogy, a symmetric equation $\bm{\pi}(\mathbf{S})=\mathbf{P}_{\mathbf{Sq}}\bm{\pi}(\mathbf{q})$ can also be easily proved in the same way as from Eqn.(A.5) to Eqn.(A.7).

Substituting $\bm{\pi}(\mathbf{S})=\mathbf{P}_{\mathbf{Sq}}\bm{\pi}(\mathbf{q})$ and $\bm{\pi}(\mathbf{q})=\mathbf{P}_{\mathbf{qS}}\bm{\pi}(\mathbf{S})$ into Eqn.(A.4), we can prove the stationary distributions of $\mathbf{P}$ for the odd periods in Eqn.(5).

Appendix B Proof of Lemma 1

We have shown in Appendix A that there exists an eigenvalue $\lambda=1$ for the column-stochastic matrix $\mathbf{P}$ with equation $\mathbf{Px}=\mathbf{x}$ . If we interpret the transition matrix as an adjacency matrix for the directed bipartite graph, the eigenvector centrality of that graph is $\mathbf{x}$ .

We split the eigenvector $\mathbf{x}$ into $\mathbf{x}_{\mathbf{S}}$ , $\mathbf{x}_{\mathbf{q}}$ for the bipartite vertex set $\mathbf{q},\mathbf{S}$ respectively and the single-mode eigenvector centralities of the single vertex set can therefore be formulated by:

\displaystyle\hat{\mathbf{x}}_{\mathbf{S}}

\displaystyle=\frac{\mathbf{x}_{\mathbf{S}}}{\sum_{s\in\mathbf{S}}x_{s}}

\displaystyle\hat{\mathbf{x}}_{\mathbf{q}}

\displaystyle=\frac{\mathbf{x}_{\mathbf{q}}}{\sum_{q\in\mathbf{q}}x_{q}}

(B.1)

If we left matrix product $\mathbf{P}$ on both sides of $\mathbf{Px}=\mathbf{x}$ , we will have $\mathbf{P}^{2}\mathbf{x}=\mathbf{P}(\mathbf{Px})=\mathbf{Px}=\mathbf{x}$ . To write it in matrix notation, we have

\underbrace{\begin{pmatrix}\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}&\mathbf{0}\\ \mathbf{0}&\mathbf{P}_{\mathbf{qS}}\mathbf{P}_{\mathbf{Sq}}\end{pmatrix}}_{\mathbf{P}^{2}}\underbrace{\begin{pmatrix}\mathbf{x}_{\mathbf{S}}\\ \mathbf{x}_{\mathbf{q}}\end{pmatrix}}_{\mathbf{x}}=\underbrace{\begin{pmatrix}\mathbf{x}_{\mathbf{S}}\\ \mathbf{x}_{\mathbf{q}}\end{pmatrix}}_{\mathbf{x}}.

(B.2)

Consider the first row of $\mathbf{P}^{2}$ matrix product with $\mathbf{x}$ in Eqn.(B.2), we have $\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}\mathbf{x}_{\mathbf{S}}=\mathbf{x}_{\mathbf{S}}$ . Since $\bm{\pi}(\mathbf{S})$ is the eigenvector of $\mathbf{P}_{\mathbf{Sq}}\mathbf{P}_{\mathbf{qS}}$ of eigenvalue 1 with probability constraint $\bm{e}_{Nr}^{\mathrm{T}}\bm{\pi}(\mathbf{S})=1$ , $\bm{\pi}(\mathbf{S})$ is exactly equivalent to the single-mode eigenvector centrality $\hat{\mathbf{x}}_{\mathbf{S}}$ in Eqn.(B.1).

By analogy, if we consider the matrix product between the second row of $\mathbf{P}^{2}$ and $\mathbf{x}$ in Eqn.(B.2), we can prove $\bm{\pi}(\mathbf{q})$ equivalent to the conjugate single-mode eigenvector centrality $\hat{\mathbf{x}}_{\mathbf{q}}$ of the bipartite graph.

Appendix C Proof of Eqn.(6)

Eqn.(6) to prove:

	$\displaystyle\mathbf{Pr}(\tilde{y}=c)$	$\displaystyle=\lim_{t\rightarrow\infty}\frac{\displaystyle{\sum_{z\in\mathbf{z}}}\hskip 2.0pt\mathbb{E}\left[\sum_{k=1}^{t}\mathds{1}[X_{k}\in\mathbf{s}^{c}]\bigg{\|}X_{0}=z\right]}{\displaystyle{\sum_{z\in\mathbf{z}}}\hskip 2.0pt\mathbb{E}\left[\sum_{k=1}^{t}\mathds{1}[X_{k}\in\mathbf{S}]\bigg{\|}X_{0}=z\right]}$
		$\displaystyle=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}$

We first define $\displaystyle{\mathbf{Pr}(\tilde{y}=c)\triangleq\lim_{t\rightarrow\infty}\mathbf{Pr}(t)}$ . Since we have proof that Markov process of transition matrix $\mathbf{P}$ is of 2 period in Appendix A, the proof of Eqn.(6) is thus equivalent to prove:

\lim_{t\rightarrow\infty}\mathbf{Pr}(2t)=\lim_{t\rightarrow\infty}\mathbf{Pr}(2t-1)=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}

(C.1)

Proof for even period:

From the definition, we have

\displaystyle\begin{split}&\mathbf{Pr}(2t)=\frac{\displaystyle{\frac{1}{Nr+r}\sum_{z\in\mathbf{z}}\sum_{k=1}^{2t}}\sum_{s\in\mathbf{s}^{c}}\left[\mathbf{P}^{k}\right]_{sz}}{\displaystyle{\frac{1}{Nr+r}\left(rt+Nrt\right)}}\\ &=\frac{1}{t}\sum_{k=1}^{t}\left[\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{z\in\mathbf{S}}\left[\mathbf{P}^{2k}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\mathbf{P}^{2k-1}\right]_{sz}\right)\right]\end{split}

(C.2)

where $rt$ is the number of visits from particles in $\mathbf{q}$ to support features in $\mathbf{S}$ after $2t$ steps of Markov bidirectional random walk. $Nrt$ is the number of visits starting from particles in $\mathbf{S}$ to support features in $\mathbf{S}$ . The second equality is derived from the diagonal/anti-diagonal property of $\mathbf{P}^{2k}$ / $\mathbf{P}^{2k-1}$ respectively where the sub-matrices $\mathbf{0}$ are ignored in summation.

Taking Eqn.(C.2) to the extreme, we have

		$\displaystyle\lim_{t\rightarrow\infty}\mathbf{Pr}(2t)$		(C.3)
		$\displaystyle=\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{z\in\mathbf{S}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}\right]_{sz}\right)$
		$\displaystyle=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}$

where the first equality is derived from the absorbing of periodic Markov chain and the second equality is from the substitution of Eqn.(5).

Proof for odd period

: From the definition, we have

\displaystyle\begin{split}&\mathbf{Pr}(2t-1)=\frac{\displaystyle{\frac{1}{Nr+r}\sum_{z\in\mathbf{z}}\sum_{k=1}^{2t-1}}\sum_{s\in\mathbf{s}^{c}}\left[\mathbf{P}^{k}\right]_{sz}}{{\displaystyle{\frac{1}{Nr+r}\left(rt+Nr(t-1)\right)}}}\\ &=\frac{1}{\omega}\sum_{s\in\mathbf{s}^{c}}\left[\sum_{z\in\mathbf{q}}\mathbf{P}_{sz}+\sum_{k=2}^{t}\left(\sum_{z\in\mathbf{S}}\left[\mathbf{P}^{2k-2}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\mathbf{P}^{2k-1}\right]_{sz}\right)\right]\end{split}

(C.4)

where $\omega$ equals $(Nr+r)(t-\frac{Nr}{Nr+r})$ . Take Eqn.(C.4) to the extreme, we have

\displaystyle\begin{split}&\lim_{t\rightarrow\infty}\mathbf{Pr}(2t-1)\\ &=\frac{1}{Nr+r}\sum_{s\in\mathbf{s}^{c}}\left(\sum_{z\in\mathbf{S}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-2}\right]_{sz}+\sum_{z\in\mathbf{q}}\left[\lim_{t\rightarrow\infty}\mathbf{P}^{2t-1}\right]_{sz}\right)\\ &=\sum_{s\in\mathbf{s}^{c}}\left[\bm{\pi}(\mathbf{S})\right]_{s}\end{split}

(C.5)

where $\displaystyle{\lim_{t\rightarrow\infty}\frac{1}{t-\frac{Nr}{Nr+r}}\sum_{z\in\mathbf{q}}\mathbf{P}_{sz}}=0$ is ignored when $t$ approaches the infinity.

⬇

# support of tensor shape [N, d, r]:

# N-way FSL, each class owns r number of d-dimensional dense features

# query of tensor shape [q, d, r]:

# q query examples, each of them owns r dense features.

# gamma: scaled similarity parameter

# beta: scaled similarity parameter

# alpha: Katz attenuation factor

# alpha_2: the square of alpha

# @: the matrix multiplication operator in Pytorch

def inner_cosine(query, support):

N, d, r = support.shape

q = len(query)

query = query / query.norm(2, dim=-1, keepdim=True)

support = support / support.norm(2, dim=-1, keepdim=True)

support = support.unsqueeze(0).expand(q, -1, -1, -1)

query = query.unsqueeze(1).expand(-1, N, -1, -1)

S = query.transpose(-2, -1)@support

S = S.permute(0, 2, 1, 3).contiguous().view(q, r, N * r)

return S

def MCL_Katz_approx(query, support):

N, d, r = support.shape

q = len(query)

S = inner_cosine(query, support) # [q, r, Nr]

St = S.transpose(-2, -1) # [q, Nr, r]

# column-wise softmax probability

P_sq = torch.softmax(gamma * St, dim=-2)

P_qs = torch.softmax(beta * S, dim=-2)

# From the derivations in Eqn.(F.2)

inv = torch.inverse(

torch.eye(r)[None].repeat(q, 1, 1) - alpha_2 * P_qs@P_sq

) # [q, r, r]

katz = (alpha_2 * P_sq@inv@P_qs).sum(-1) + (alpha * P_sq@inv).sum(-1)

katz = katz / katz.sum(-1, keepdim=True)

predicts = katz.view(q, N, r).sum(-1)

return predicts

Code 1: Pytorch pseudo-code for 1-shot MCL (Katz approximation) in a single episode.

Appendix D Codes

We use block-wise inversion in Eqn.(14) that is more computational-efficient than directly inverting Eqn.(10) when the number of support classes $N$ is large. The corresponding Pytorch pseudo code can be found below where the whole calculation is performed in parallel via batched matrix multiplication and inversion. Our unified FSL training and testing framework is publicly available at https://github.com/LouieYang/MCL.

Appendix E Implementation details

Preprocessing: During training on CUB, miniImageNet and tieredImageNet, images are randomly cropped to $92\times 92$ and then resized into $84\times 84$ . For meta-iNat and tiered-meta-iNat, images are randomly padded then cropped into $84\times 84$ . Unlike previous methods , we only random horizontal flip the image during training.

During inference, images are center cropped to $92\times 92$ and then resized into $84\times 84$ for CUB, miniImageNet and tieredImageNet. For meta-iNat and tiered-meta-iNat, images are already $84\times 84$ and are fed into the models directly.

Network backbones: We use two backbones in our experiments: Conv-4 and ResNet-12. Conv-4 contains four convolutional blocks, each of which consists of a $3\times 3$ convolution, a BatchNorm, a LeakyReLU(0.2) and an additional $2\times 2$ max-pooling. ResNet-12 consist four residual blocks, each with three convolutional layers, with LeakyReLU(0.1) and $2\times 2$ max-pooling on the main stem. Given the image of input size $84\times 84$ , Conv-4 outputs a feature map of size $5\times 5\times 64$ while ResNet-12 outputs that of size $5\times 5\times 64$ .

Re-implemented baselines: We re-implement ProtoNet [26] and RelationNet [28] in our unified framework as two global feature based baselines for our centrality plugin. We borrow most of codes from their official implementations and introduce slight modifications to improve their performances inspired their subsequent work [33, 34]. For ProtNet, we introduce a fixed temperature scaling with $1/64$ before in softmax function. For RelationNet, we change the original MSELoss to CrossEntropyLoss.

Model	Backbone	1-shot	5-shot
Baseline [6]	ResNet-10	-	65.57 $\pm$ 0.70
Baseline++ [6]	ResNet-18	-	62.04 $\pm$ 0.76
MetaOptNet [14]	ResNet-12	44.79 $\pm$ 0.75	64.98 $\pm$ 0.68
MatchingNet+FT [29]	ResNet-10	36.61 $\pm$ 0.53	55.23 $\pm$ 0.83
RelationNet+FT [29]	ResNet-10	44.07 $\pm$ 0.77	59.46 $\pm$ 0.71
GNN+FT [29]	ResNet-10	47.47 $\pm$ 0.75	66.98 $\pm$ 0.68
Neg-margin [19]	ResNet-18	-	69.30 $\pm$ 0.73
Centroid et al. [1]	ResNet-18	46.85 $\pm$ 0.75	70.37 $\pm$ 1.02
FRN [33]	ResNet-12	51.60 $\pm$ 0.21	72.97 $\pm$ 0.18
ProtoNet^† [26]	ResNet-12	40.05 $\pm$ 0.18	55.29 $\pm$ 0.19
ProtoNet+MCL	ResNet-12	42.02 $\pm$ 0.19	64.76 $\pm$ 0.20
MCL	ResNet-12	55.48 $\pm$ 0.22	75.93 $\pm$ 0.18
MCL-Katz	ResNet-12	53.22 $\pm$ 0.22	77.39 $\pm$ 0.18

Table 7: Few-shot classifications (%) in the cross-domain setting: miniImageNet

\rightarrow

CUB.

\dagger

: our re-implemented results in our unified framework that share the same dataloader and training strategies.

Pre-training: We use the pre-training + meta-training procedure for ResNet-12 backbones on miniImageNet and tieredImageNet like most of the methods in the literature [35, 16, 33]. We follow the same pre-training technique from the dense features based FRN [33] to learn spatially distinctive dense features, e.g., run 350 epochs of batch size 128 on miniImageNet, using SGD with initial learning rate 0.1 and decaying by a factor of 10 at epochs 200 and 300.

Meta-training on mini-/tieredImageNet: We follow the earliest dense feature based DN4 [16] that randomly sample 20,000/200 episodes in an epoch for Conv-4/ResNet-12 backbone, respectively. Since we didn not use pre-training for Conv-4, the number of episodes per epoch of Conv-4 is far larger than ResNet-12 for convergences. In each episode, besides $K$ support images in each class, 15 query images will also be selected from each class.

For Conv-4, we adopt Adam optimizer with initial learning rate of 1e-3 to train for 30 epochs (on both datasets) and reduce it by a factor of 10 every 10 epochs.

For ResNet-12, we adopt SGD with initial learning rate of 5e-4 (40 epochs on miniImageNet and 60 epochs on tieredImageNet) and cut it by half every 10 epochs.

Unlike the latest dense feature based FRN that adopts larger way during meta-training (e.g., 25-way for training 1-shot models and 20-way for 5-shot models), we did not use that setting in widely used mini-/tireredImageNet as we thought it would be unfair in comparing with other methods.

Meta-training on fine-grained datasets: We follow the latest fine-grained few-shot classification setting from FRN [33] as most of the comparing performances in Table 2 are from them. Although practitioners agree on the train/validation/test split ratio (i.e., 100/50/50) on CUB dataset, there is no official class split. In our experiments, we use the same train/val/test class split as in [33] for a fair comparison.

For CUB, we train all our Conv-4 models for 1200 epochs using SGD with Nesterov momentum of 0.9 and an initial learning rate of 0.1. The learning rate decreases by a factor of 10 at epoch 400 and 800. ResNet-12 backbone trains for 600 epochs and scale down the learning rate by 10 at epoch 300, 400 and 500. Conv-4 backbone is trained with standard 20-way for 5-shot models and is trained with 30-way for 1-shot models like [33], while ResNet-12 backbone is trained with 10-way for the both shots models.

For meta-iNat and tiered-meta-iNat, we train our Conv-4 and ResNet-12 models for 100 epochs using Adam with initial learning rate 1e-3. We set 0.5 learning rate decay every 20 epochs. Both Conv-4 and ResNet-12 are trained with 10-way for both 1-shot and 5-shot models.

Methods	$f_{\theta}(\cdot)$	1-shot	5-shot
Baseline [6]	Global feature	53.99 $\pm$ 0.20	78.78 $\pm$ 0.15
Baseline++ [6]		55.03 $\pm$ 0.20	78.79 $\pm$ 0.15
ProtoNet [26]		53.99 $\pm$ 0.20	79.70 $\pm$ 0.15
ProtoNet+MCL		58.62 $\pm$ 0.20	80.99 $\pm$ 0.14
DN4 [16]	VanillaFCN	58.95 $\pm$ 0.19	78.01 $\pm$ 0.15
FRN [33]		59.71 $\pm$ 0.19	74.05 $\pm$ 0.15
MCL		60.94 $\pm$ 0.20	80.20 $\pm$ 0.14
MCL-Katz		61.55 $\pm$ 0.20	81.09 $\pm$ 0.14

Table 8: Few-shot classifications (%) without episodic meta-training. All of the comparing methods are evaluated in our unified framework with the same pre-trained ResNet-12 backbone.

Appendix F Cross-Domain Few-shot Classification

We also evaluate on the challenging cross-domain setting proposed by [6], where models trained on miniImageNet base classes are directly evaluated on test classes from CUB. We use the same test class split as in [33] for fair comparisons, which is much harder than the test class split in [6].

As shown in Table S7, our MCL outperforms previous state-of-the-art methods by large margins of 3.9% on the 1-shot task and 5.4% on the 5-shot task, respectively.

Appendix G Evaluation without Meta-training

Given that an increasing number of methods simply use standard supervised learning to pre-train the feature extractor and then use their methods directly for evaluation without meta-training [6, 8], we also evaluate our methods under this setting with the same pre-trained feature extractor we used in the Table 1. As shown in Table S8, global feature based methods are likely to misclassify images under the extremely 1-shot scenario, where the significant intra-class variations would inevitably drive the image-level embedding from the same category far apart. In contrast, dense feature based methods provide more information across categories that shows promising performances in that scenario.

Our end-to-end Katz centrality based MCL outperforms previous methods by a margin of 1.8% and 1.3% on 1-shot and 5-shot tasks, respectively. It is interested to note that our MCL plugin help centralize the task-relevant local features in ProtoNet [26] by a large margin of 4.6% in 1-shot task and 1.2% in 5-shot task without bell and whistles.

	5-way 1-shot	5-way 5-shot
$\alpha\approx 0$ (unidirectional)	66.60 $\pm$ 0.20	81.76 $\pm$ 0.13
$\alpha=0.1$	67.13 $\pm$ 0.20	83.20 $\pm$ 0.13
$\alpha=0.3$	67.28 $\pm$ 0.20	83.89 $\pm$ 0.13
$\alpha=0.5$ (MCL-Katz)	67.51 $\pm$ 0.20	83.99 $\pm$ 0.13
$\alpha=0.7$	67.27 $\pm$ 0.20	84.05 $\pm$ 0.13
$\alpha=0.9$	67.41 $\pm$ 0.20	83.96 $\pm$ 0.13
$\alpha=0.999$ (MCL)	67.36 $\pm$ 0.20	83.63 $\pm$ 0.13

Table 9: Ablation studies on the Katz attenuation factor

\alpha

on miniImageNet with VanillaFCN ResNet-12.

	$\beta=5.0$	$\beta=10.0$	$\beta=20.0$	$\beta=50.0$
$\gamma=5.0$	66.12	66.15	65.87	65.39
$\gamma=10.0$	67.21	67.20	66.78	66.23
$\gamma=20.0$	67.53	67.36	66.70	66.00
$\gamma=50.0$	67.05	66.77	65.84	65.08

Table 10: Ablation studies of MCL (

\alpha=0.999

) on the scaling parameters

\gamma

and

\beta

in Eqn.(1) and Eqn.(2), respectively. Experiments are conducted on 5-way 1-shot miniImageNet with VanillaFCN.

Appendix H Ablation on parameters $\alpha$ , $\gamma$ and $\beta$

Ablation on Katz attenuation factor $\alpha$ . Table S9 shows results with different $\alpha$ in bidirectional random walks. As discussed in Sec.5, the centrality with large $\alpha$ is more influenced by the endogenous topology while that with small $\alpha$ is more influenced by the in-degree paths. It could be observed in Table S9 that, the optimal $\alpha$ differs across various FSL tasks (that with different shot on different datasets).

In the experiments, we use $\alpha$ = $0.999$ to approximate single-mode eigenvector centrality in Eqn.(6) and simply use $\alpha$ = $0.5$ to represent general Katz centrality in Eqn.(8).

Ablation on scaling parameters $\gamma$ and $\beta$ . Table S10 shows MCL results with different scaling parameters $\gamma$ and $\beta$ . As discussed in Sec.5 that a larger scaling parameter (i.e., a smaller temperature in softmax-like random walk probability) will lead to a more concentrated eigenvector centrality. However, a larger scaling parameter would inevitably bias the meta-training. Thus, there exists a trade-off to select optimal parameters according to each task.

In the experiments, we find its empirically effective to select $\gamma$ and $\beta$ according to their pre-trained models (similar to Table S10). In most cases, we use $\gamma$ = $20$ , $\beta$ = $10$ and $\gamma$ = $40$ , $\beta$ = $20$ for 1-shot and 5-shot tasks, respectively.

Appendix I Additional Plugin Experiments

We have shown that our proposed centrality weighted pooling has a consistent performance gain (especially in the extreme 1-shot scenario) over global average pooling on ProtoNet [26] and RelationNet [28] by concentrating on more task-relevant local features. Besides Table 1, 2, 4 and S8, we give additional results in Table S11 to show that MCL can be easily plugged into those methods that need no extra parameters except for the feature extractor.

The experiments are conducted by the following rules: all comparing methods are evaluated with the same pre-trained backbone ResNet-12 as in Sec.G without meta-training; we only use centrality weighted pooling to aggregate local features from query image as different methods have different operations on support features; we fixed the parameters $\gamma$ = $20$ and $\beta$ = $10$ to MCL plugins for all comparing methods.

As shown in Table S11, our proposed centrality plugin consistently improves the performances of all the global feature based methods without bells and whistles.

Method	miniImageNet		tieredImageNet
Method	1-shot	5-shot	1-shot	5-shot
Baseline[6]	53.99	78.78	67.75	85.23
+MCL	56.21+2.22	80.44+1.66	68.53+0.78	85.75+0.52
Baseline++[6]	55.03	78.79	61.86	84.31
+MCL	56.93+1.90	79.93+1.14	63.00+1.14	84.61+0.30
ProtoNet[26]	56.50	79.68	64.27	84.01
+MCL	58.62+2.12	80.99+1.30	66.48+2.21	84.83+0.82
MatchingNet[30]	58.41	79.52	68.68	85.16
+MCL	59.85+1.44	80.79+1.27	69.45+0.97	85.49+0.33
R2D2[2]	59.82	78.97	70.22	85.30
+MCL	60.86+1.04	80.65+1.68	70.64 +0.42	85.84+0.54
MetaOptNet[14]	59.59	79.77	69.55	85.25
+MCL	60.66+1.07	81.27+1.50	70.11+0.55	85.76+0.51
DSN[25]	58.70	79.01	69.13	85.14
+MCL	60.08+1.38	80.66+1.65	69.59+0.46	85.61+0.47
Neg-cosine[19]	58.82	79.63	69.44	84.94
+MCL	60.20+1.38	81.04+1.41	69.85+0.41	85.31+0.37

Table 11: Few-shot classifications (%) before and after applying centrality weighted pooling. Experiment settings follow Table S8.

Appendix J Additional Visualization

The additional Grad-CAM visualizations as in Table 5 of the main paper are presented as follows: Table S12 illustrates end-to-end MCL-Katz like in Table 5(b). Table S13 illustrates MCL plugin on ProtoNet [26] like in Table 5(a).

ProtoNet [26]
ProtoNet +MCL
	query	ground truth	confounding support classes	query	ground truth	confounding support classes
(a) MCL as a plugin for feature centralization				(b) MCL as an end-to-end classification method