This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalized Sum Pooling for Metric Learning

Yeti Z. Gürbüz\dagger Ozan Şener A. Aydın Alatan RSiM, TU Berlin Intel Labs OGAM and METU \daggerAffiliated with OGAM-METU during the research.
Abstract

A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, i.e., a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework

1 Introduction

Distance metric learning (DML) addresses the problem of finding an embedding function such that the semantically similar samples are embedded close to each other while the dissimilar ones are placed relatively apart in the Euclidean sense. Although the prolific and diverse literature of DML includes various architectural designs [1, 2, 3], loss functions [4], and data-augmentation techniques [5, 6], many of these methods have a shared component: a convolutional neural network (CNN) followed by a global pooling layer, mostly global average pooling (GAP) [4].

Common folklore to explain the effectiveness of GAP is considering each pixel of the CNN feature map as corresponding to a separate semantic entity [7]. For example, spatial extent of one pixel can correspond to a "tire" object making the resulting feature a representation for "tireness" of the image. If this explanation is correct, the representation space defined via output of GAP is a convex combination of semantically independent representations defined by each pixel in the feature map. Although this folklore is later empirically studied in [8, 9, 10, and references therein] and further verified for classification in [11, 12], its algorithmic implications are not clear. If each feature is truly representing a different semantic entity, should we really average over all of them? Surely, some classes belong to the background and should be discarded as nuisance variables. Moreover, is uniform average of them the best choice? Aren’t some classes more important than others? In this paper, we try to answer these questions within the context of metric learning. We propose a learnable and generalized version of GAP which learns to choose the subset of the semantic entities to utilize as well as weights to assign them while averaging.

In order to generalize the GAP operator to be learnable, we re-define it as a solution of an optimization problem. We let the solution space to include 0-weight effectively enabling us to choose subset of the features as well as carefully regularize it to discourage degenerate solution of using all the features. Crucially, we rigorously show that the original GAP is a specific case of our proposed optimization problem for a certain realization. Our proposed optimization problem closely follows optimal transport based top-kk operators [13] and we utilize its literature to solve it. Moreover, we present an algorithm for an efficient computation of the gradients over this optimization problem enabling learning. A critical desiderata of such an operator is choosing subset of features which are discriminative and ignoring the background classes corresponding to nuisance variables. Although supervised metric learning losses provide guidance for seen classes, they carry no such information to generalize the behavior to unseen classes. To enable such a behavior, we adopt a zero-shot prediction loss as a regularization term which is built on expressing the class label embeddings as a convex combination of attribute embeddings [14, 12].

In order to validate the theoretical claims, we design a synthetic empirical study. The results confirm that our pooling method chooses better subsets and improve generalization ability. Moreover, our method can be applied with any DML loss as GAP is a shared component of them. We applied our method on 6 DML losses and test on 4 datasets. Results show consistent improvements with respect to direct application of GAP as well as other pooling alternatives.

2 Related Work

Our contributions. Briefly, our contributions include that i)i) we introduce a general formulation for weighted sum pooling, ii)ii) we formulate local feature selection as an optimization problem which admits closed form gradient expression without matrix inversion, and iii)iii) we propose a meta-learning based zero-shot regularization term to explicitly impose unseen class generalization to the DML problem. We now discuss the works which are most related to ours.

Distance Metric Learning (DML). Primary thrusts in DML include i)i) tailoring pairwise loss terms [4] that enforces the desired intra- and inter-class proximity constraints, ii)ii) pair mining [5], iii)iii) generating informative samples [15, 16, 17, 6], and iv)iv) augmenting the mini-batches with virtual embeddings called proxies [18, 19]. To improve generalization; learning theoretic ideas [20, 21, 22], disentangling class-discriminative and class-shared features [2, 23], intra-batch feature aggregation [24, 25], ranking surrogates [26], and further regularization terms [27, 28, 29, 30, 31] are utilized. To go beyond of a single model, ensemble [32, 1, 33, 34, 35] and multi-task based approaches [36, 37] are also used. Different to them, we propose a learnable pooling method generalizing GAP, a shared component of all of the mentioned works. Hence, our work is orthogonal to all of these and can be used jointly with any of them.

Prototype-based pooling. Most related to ours are trainable VLAD [38, 39] and optimal transport based aggregation [40, 41]. Such methods employ similarities to the prototypes to form a vector of aggregated local features for each prototype and build ensemble of representations. Although their embeddings enjoy 2\ell 2 metric for similarity, they typically have very large sizes limiting their applicability for DML. Recently, reducing the dimension of VLAD embedding via non-linear transforms is addressed for DML [42]. Nevertheless, such methods still map a set of features to another set of features without discarding any and do not provide a natural way to aggregate the class-discriminative subset of the features. On the contrary, our pooling machine effectively enables learning to select discriminative features and maps a set of features to a single feature that is distilled from nuisance information.

Attention-based pooling. CroW [43], Trainable-SMK [44], and CBAM [45] reweight the CNN features before pooling. They build on feature magnitude based saliency, assuming that the backbone functions must be able to zero-out nuisance information. Yet, such a requirement is restrictive for the parameter space and annihilation of the non-discriminative information might not be feasible in some problems. Similarly, attention-based weighting methods DeLF [46], GSoP [47] do not have explicit control on feature selection behavior and might result in poor models when jointly trained with the feature extractor [46]. Differently, our method unifies attention-based feature masking practices (e.g., convolution, correlation) with an efficient-to-solve optimization framework and lets us do away with engineered heuristics in obtaining the masking weights (e.g., normalization, sigmoid, soft-plus) without restricting the solution space.

Optimal transport (OT) based operators. OT distance [48] to match local features is used as the DML distance metric instead of 2\ell 2 in [49]. Despite effective, replacing 2\ell 2 with OT increases memory cost for image representation as well as computation cost for the distance computation. Different to them, we shift OT based computation in pooling (i.e., feature extraction) stage while having OT’s merits and hence, do not affect the memory and computation costs of the inference by sticking to 2\ell 2 metric. Moreover, our feature selection and aggregation formulation has close relation to OT [48] based top-kk [50], ranking [13] and aggregation [40, 41, 51] operators which are not effectively applied to DML before. Their aggregation is built on concatenating the feature ensembles resulting in very large embedding sizes. What makes our method different is the unique way we formulate the feature selection problem to fuse aggregation into it (see Appendix for technical details). Our selection operator allows computationally appealing and matrix inversion free gradient computation unlike its OT based counterparts [52].

Refer to caption

Figure 1: Sketch of the method, where Z=[zi]iZ{=}[z_{i}]_{i} (4.3) and GSP vectors (4.1) are coloured w.r.t. their class label.

3 Preliminaries

Consider the data distribution p𝒳𝗑𝒴p_{\mathcal{X}\mathsf{x}\mskip 1.0mu\mathcal{Y}} over 𝒳𝗑𝒴\mathcal{X}\mathsf{x}\mskip 1.0mu\mathcal{Y} where 𝒳\mathcal{X} is the space of data points and 𝒴\mathcal{Y} is the space of labels. Given iid. samples from p𝒳𝗑𝒴p_{\mathcal{X}\mathsf{x}\mskip 1.0mu\mathcal{Y}} as {(xi,yi)}\{(x_{i},y_{i})\}, distance metric learning problem aims to find the parameters θ\theta of an embedding function e(;θ):𝒳de(\cdot;\theta)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\rightarrow\mathbb{R}^{d} such that the Euclidean distance in the space of embeddings is consistent with the label information where dd is the embedding dimension. More specifically, e(xi;θ)e(xj;θ)2\|e(x_{i};\theta)-e(x_{j};\theta)\|_{2} is small whenever yi=yjy_{i}=y_{j}, and large whenever yiyjy_{i}\neq y_{j}. In order to enable learning, this requirement is represented via loss function l((xi,yi),(xj,yj);θ)l((x_{i},y_{i}),(x_{j},y_{j});\theta) (e.g., contrastive [53], triplet [54], multi-similarity [55]).

The typical learning mechanism is gradient descent of an empirical risk function defined over a batch of data points BB. To simplify notation throughout the paper, we will use b={b(i)xi,yiB}ib=\{b(i)\mid x_{i},y_{i}\in B\}_{i} to index the samples in a batch. Then, the typical empirical risk function is defined as:

DML(b;θ)1|b|2ibjbl((xi,yi),(xj,yj);θ).\mathcal{L}_{\text{DML}}(b;\theta)\coloneqq\tfrac{1}{|b|^{2}}\textstyle\sum\limits_{i\in b}\textstyle\sum\limits_{j\in b}l((x_{i},y_{i}),(x_{j},y_{j});\theta)\,. (3.1)

We are interested specific class of embedding functions where a global average pooling is used. Specifically, consider the composite function family e=gfe=g\circ f such that gg is pooling and ff is feature computation. We assume a further structure over the functions gg and ff. The feature function ff maps the input space 𝒳\mathcal{X} into w𝗑h𝗑d\mathbb{R}^{w\mathsf{x}\mskip 1.0muh\mathsf{x}\mskip 1.0mud} where ww and hh are spatial dimensions. Moreover, gg performs averaging as;

g(f(x;θ))=1whi[wh]fi,g(f(x;\theta))=\frac{1}{w\cdot h}\textstyle\sum\limits_{i\in[w\cdot h]}f_{i}\,, (3.2)

where [n]=1,,n[n]{=}{1,\ldots,n} and we let fidf_{i}{\in}\mathbb{R}^{d} denote ithi^{\text{th}} spatial feature of f(x;θ)f(x;\theta) to simplify notation. In the rest of the paper, we generalize the pooling function gg into a learnable form and propose an algorithm to learn it.

4 Method

Consider the pooling operation in (3.2), it is a simple averaging over pixel-level feature maps (fif_{i}). As we discuss in Sec. 1, one explanation for the effectiveness of this operation is considering each fif_{i} as corresponding to a different semantic entity corresponding to the spatial extend of the pixel, and the averaging as convex combination over these semantic classes. Our method is based on generalizing this averaging such that a specific subset of pixels (correspondingly subset of semantic entities) are selected and their weights are adjusted according to their importance.

We generalize the pooling (3.2) in Sec. 4.1 by formulating a feature selection problem in which we prioritize a subset of the features that are closest to some trainable prototypes. If a feature is to be selected, its weight will be high. We then formulate our pooling operation as a differentiable layer to facilitate the prototype learning along with the rest of the embedding function parameters in Sec. 4.2. We learn the prototypes with class-level supervision, however in metric learning, learned representations should generalize to unseen classes. Thus, we introduce a zero-shot prediction loss to regularize prototype training for zero-shot setting in Sec. 4.3.

4.1 Generalized Sum Pooling as a Linear Program

Consider the pooling function gg with adjustable weights as g(f(x;θ);ω)=i[n]pifig(f(x;\theta);\omega)=\sum_{i\in[n]}p_{i}f_{i} where n=whn=w\,h. Note that, pi=1/np_{i}{=}\nicefrac{{1}}{{n}} corresponds to average pooling. Informally, we want to control the weights to ease the metric learning problem. Specifically, we want the weights corresponding to background classes to be 0 and the ones corresponding to discriminative features to be high.

If we were given representations of discriminative semantic entities, we could simply compare them with the features (fif_{i}) and choose the ones with high similarity. Our proposed method is simply learning these representations and using them for weight computations. We first discuss the weight computation part before discussing learning the representations of prototypes.

Assume that there are mm discriminative semantic entities which we call prototypes with latent representations ω={ωi}i[m]\omega=\{\omega_{i}\}_{i\in[m]} of appropriate dimensions (same as fif_{i}). Since we know that not all features ({fi}i[n]\{f_{i}\}_{i\in[n]}) are relevant, we need to choose a subset of {fi}i[n]\{f_{i}\}_{i\in[n]}. We perform this top-kk selection process by converting it into an optimal transport (OT) problem.

Consider a cost map cij=ω¯if¯j2c_{ij}=\|\bar{\omega}_{i}\shortminus\bar{f}_{j}\|_{2} which is an mm (number of prototypes) by nn (number of features) matrix representing the closeness of prototypes ωi\omega_{i} and features fjf_{j} after some normalization u¯=u/max{1,u2}\bar{u}=\nicefrac{{u}}{{{\max\{1,\|u\|_{2}\}}}}. We aim to find a transport map π\pi which re-distributes the uniform mass from features to prototypes. Since we do not have any prior information over features, we also consider its marginal distribution (importance of each feature to begin with) to be uniform. As we need to choose a subset, we set μ[0,1]\mu{\in}[0,1] ratio of mass to be transported. The resulting OT problem is:

ρ,π=argminρ,π0ijcijπij s.to ρj+Σiπij=1nΣijπij=μ.\rho^{\ast},\pi^{\ast}=\!\operatorname*{arg\,min}_{\rho,\pi\geqslant 0}\!\textstyle\sum_{ij}c_{ij}\pi_{ij}\text{\quad s.to }{\small\begin{array}[t]{r}\rho_{j}{+}\Sigma_{i}\pi_{ij}{=}\tfrac{1}{n}\\[-0.77498pt] \Sigma_{ij}\pi_{ij}{=}\mu\end{array}}. (P1)

Different to typical OT literature, we introduce decision variables ρ\rho to represent residual weights to be discarded. Specifically modelling discarded weight instead of enforcing another marginalization constraint is beneficial beyond stylistic choices as it allows us to very efficiently compute gradients. When the introduced transport problem is solved, we perform weighting using residual weights as:

g(f(x;θ);ω)=ipifi=i1/nρiμfi.g(f(x;\theta);\omega)=\textstyle\sum_{i}p_{i}f_{i}=\textstyle\sum_{i}\tfrac{\nicefrac{{1}}{{n}}-\rho^{\ast}_{i}}{\mu}f_{i}\,. (4.1)

Given set of prototypes {ωi}i[m]\{\omega_{i}\}_{i\in[m]}, solving the problem in (P1) is a strict generalization of GAP since setting μ=1\mu=1 recovers the original GAP. We formalize this equivalence in the following claim.

Theorem 4.1.

If μ=1\mu=1, the operation in (4.1) reduces to global average pooling in (3.2).

We defer the proof to Appendix. Having generalized GAP to a learnable form, we introduce a method to learn the prototypes {ωi}i[m]\{\omega_{i}\}_{i\in[m]} in the next section.

4.2 GSP as a Differentiable Layer

Consider the generalized form of pooling, defined as solution of (P1), as a layer of a neural network. The input is the feature vectors {fi}i[n]\{f_{i}\}_{i\in[n]}, the learnable parameters are prototype representations {ωi}i[m]\{\omega_{i}\}_{i\in[m]}, and the output is residual weights ρ\rho^{\ast}. To enable learning, we need partial derivatives of ρ\rho^{\ast} with respect to {ωi}i[m]\{\omega_{i}\}_{i\in[m]}. However, this function is not smooth. More importantly it requires the μ\mu parameter to be known a priori.

We use a toy example to set the stage for rest of the formulation. Consider a 10𝗑10𝗑310{\mathsf{x}\mskip 1.0mu}10{\mathsf{x}\mskip 1.0mu}3 feature map visualized as RGB-image in Fig. 2 and corresponding two prototypes with representations (1,0,0)(1,0,0) (red) and (0,0,1)(0,0,1) (blue). The true μ=0.5\mu=0.5 since the half of the image corresponds to red and blue, and other half is background class of green. Consider an under-estimation of μ=0.2\mu=0.2, the global solution (shown as linear programming) is explicitly ignoring informative pixels (part of red and blue region). To solve this issue, we use entropy smoothing which is first introduced in [48] to enable fast computation of OT. Consider the entropy smoothed version of the original problem in (P1) as:

ρ(ε),π(ε)=argminρ,π0ρj+Σiπij=1/nΣijπij=μijcijπij+1ε(H(π)+H(ρ)),\rho^{(\varepsilon)},\pi^{(\varepsilon)}=\!\!\!\!\!\operatorname*{arg\,min}_{\begin{subarray}{c}\rho,\pi\geqslant 0\\ \rho_{j}+\Sigma_{i}\pi_{ij}=\nicefrac{{1}}{{n}}\\ \Sigma_{ij}\pi_{ij}=\mu\end{subarray}}\!\!\!\!\!\textstyle\sum_{ij}c_{ij}\pi_{ij}+\tfrac{1}{\varepsilon}(H(\pi)+H(\rho)), (P2)

and obtain pooling weights by replacing ρ\rho^{\ast} with ρ(ε)\rho^{(\varepsilon)} in (4.1), where H(u)ΣiuiloguiH(u)\coloneqq\Sigma_{i}u_{i}\log u_{i}.

Refer to caption

Figure 2: The resultant pooling weights (higher the darker) of different problems.

When smoothing is high (ε0\varepsilon{\to}0), the resulting solution is uniform over features similar to GAP. When it is low, the result is similar to top-kk like behavior. For us, ε\varepsilon controls the trade-off between picking μ\mu portion of the features that are closest to the prototypes and including as much features as possible for weight transfer. We further visualize the solution of the entropy smoothed problem in Fig. 2 showing desirable behavior even with underestimated μ\mu.

Beyond alleviating the under-estimation of μ\mu problem, entropy smoothing also makes the problem strictly convex and smooth. Thus, the solution of the problem enables differentiation and in fact, admits closed-form gradient expression. We state the solution of (P2) and its corresponding gradient in the following propositions and defer their proofs to Appendix.

Proposition 4.2.

Given initialization t(0)=1t^{(0)}=1, consider the following iteration:

ρ(k+1)=1/n(1+t(k)exp(εc)𝟏m)1,t(k+1)=μ(𝟏mexp(εc)ρ(k+1))1\begin{split}&\rho^{(k{+1})}=\nicefrac{{1}}{{n}}\,(1+t^{(k)}\exp({\shortminus}\varepsilon c)^{\intercal}\bm{1}_{m})^{\shortminus 1}\!,\\ &t^{(k{+}1)}=\mu\,(\bm{1}_{m}^{\intercal}\exp({\shortminus}\varepsilon c)\rho^{(k{+}1)})^{\shortminus 1}\end{split}

where exp\exp and ()1(\cdot)^{\shortminus 1} are element-wise and 𝟏m\bm{1}_{m} is mm-dimensional vector of ones. Then, (ρ(k),t(k))(\rho^{(k)},t^{(k)}) converges to the solution of (P2) defining transport map via π(k)=t(k)exp(εc)Diag(ρ(k))\pi^{(k)}=t^{(k)}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k)}).

Proposition 4.3.

Consider any differentiable loss function \mathcal{L} as a function of (ρ,π)(\rho,\pi). Given gradients ρ\frac{\partial\mathcal{L}}{\partial\rho} and π\frac{\partial\mathcal{L}}{\partial\pi}, with (ρ,π)(\rho,\pi) is the solution of (P2). Let q=ρρ+(ππ)𝟏mq=\rho\odot\frac{\partial\mathcal{L}}{\partial\rho}+(\pi\odot\frac{\partial\mathcal{L}}{\partial\pi})^{\intercal}\bm{1}_{m} and η=(ρρ)𝟏nnqρ\eta=(\rho\odot\frac{\partial\mathcal{L}}{\partial\rho})^{\intercal}\bm{1}_{n}\shortminus n\,q^{\intercal}\rho, the gradient of \mathcal{L} with respect to cc reads:

c=ε(ππnπDiag(qη1μnρρ)ρ),\frac{\partial\mathcal{L}}{\partial c}=\shortminus\varepsilon\Big{(}\pi\odot\frac{\partial\mathcal{L}}{\partial\pi}-n\pi Diag\big{(}q-\tfrac{\eta}{1\shortminus\mu\shortminus n\rho^{\intercal}\rho}\big{)}\rho\Big{)}\,\,, (4.2)

where \odot denotes element-wise multiplication.

Proposition 4.2 and 4.3 suggest that our feature selective pooling can be implemented as a differentiable layer. Moreover, Proposition 4.3 gives a matrix inversion free computation of the gradient with respect to the costs unlike optimal transport based operators [52]. Thus, the prototypes, ω\omega, can be jointly learned with the feature extraction efficiently.

4.3 Cross-batch Zero-shot Regularization

We formulated a prototype based feature pooling and learn the prototypes using class labels. Simply classifying the labels as prototypes is a degenerate solution. We rather want the prototypes to capture transferable attributes so that the learning can be transferred to the unseen classes as long as the attributes are shared. Learning with prototype based pooling shapes the embedding geometry in such a way that we have clusters corresponding to the prototypes in the embedding space. We want such clusters to have transferable semantics rather than class-specific information. To enable this, we now formulate a mechanism to predict class embedding vectors from the prototype assignment vectors and use that mechanism to tailor a loss regularizing the prototypes to have transferable representations.

Our feature selection layer should learn discriminative feature prototypes ω\omega using top-down label information. Consider two randomly selected batches (b1,b2)(b_{1},b_{2}) of data sampled from the distribution. If the prototypes are corresponding to discriminative entities, the weights transferred to them (i.e., marginal distribution of prototypes) should be useful in predicting the classes and such behavior should be consistent between batches for zero-shot prediction. Formally, if one class in b2b_{2} does not exist in b1b_{1}, a predictor on class labels based on marginal distribution of prototypes for each class of b1b_{1} should still be useful for b2b_{2}. DML losses do not carry such information. We thus formulate a zero-shot prediction loss to enforce such zero-shot transfer.

We consider that we are given an embedding vector υi\upsilon_{i} for each class label ii, i.e., Υ=[υi]i[c]\Upsilon=[\upsilon_{i}]_{i\in[c]} for cc-many classes. We are to predict such embeddings from the marginal distribution of the prototypes. In particular, we use linear predictor AA to predict label embeddings as υ^=Az\hat{\upsilon}=A\,z where zz is the normalized distribution of the weighs on the prototypes;

z=1μiπi(ε)whereπ(ε)=[πi(ε)]i[n].z=\tfrac{1}{\mu}\textstyle\sum_{i}\pi^{(\varepsilon)}_{i}\quad\text{where}\quad\pi^{(\varepsilon)}=[\pi^{(\varepsilon)}_{i}]_{i\in[n]}\,. (4.3)

If we consider the prototypes as semantic vectors of some auxiliary labels such as attributes similar to zero-shot learning (ZSL) [14, 12, 56], then we can interpret zz as attribute predictions. Given attribute predictions {zi}ib\{z_{i}\}_{i\in b} and corresponding class embeddings for a batch bb we fit the predictor as;

Ab=argminA=[ai]i[m]ibAziυyi22+ϵi[m]ai22,A_{b}\!=\!\!\!\operatorname*{arg\,min}_{A=[a_{i}]_{i\in[m]}}\!\!\!\textstyle\sum_{i\in b}\|A\,z_{i}\shortminus\upsilon_{y_{i}}\|^{2}_{2}+\epsilon\textstyle\sum_{i\in[m]}\|a_{i}\|^{2}_{2}\,, (P3)

which admits a closed form expression enabling back propagation Ab=Υb(ZbZb+ϵI)1ZbA_{b}=\Upsilon_{b}\,(Z_{b}^{\intercal}Z_{b}+\epsilon I)^{\shortminus 1}Z_{b}^{\intercal} where Υb=[υyi]ib\Upsilon_{b}=[\upsilon_{y_{i}}]_{i\in b}, Zb=[zi]ibZ_{b}=[z_{i}]_{i\in b}. In practice, we are not provided with the label embeddings Υ=[υi]i[c]\Upsilon=[\upsilon_{i}]_{i\in[c]}. Nevertheless, having a closed-form expression for AbA_{b} enables us to exploit a meta-learning scheme like [57] to formulate a zero-shot prediction loss to learn them jointly with the rest of the parameters.

Specifically, we split a batch bb into two111Although we considered the simplest form which already worked well, repeating this splitting process can be beneficial. as b1b_{1} and b2b_{2} such that classes are disjoint. We then estimate attribute embeddings AbkA_{b_{k}} according to (P3) using one set and use that estimate to predict the label embeddings of the other set to form zero-shot prediction loss. Formally, our loss becomes:

ZS(b;θ)=1|b2|ib2log(1+j[c]e(υjυyi)A1zi)+1|b1|ib1log(1+j[c]e(υjυyi)A2zi),\begin{split}\mathcal{L}_{\text{ZS}}(b;\theta)&=\tfrac{1}{|b_{2}|}\textstyle\sum\limits_{i\in b_{2}}\log\big{(}1+\textstyle\sum\limits_{j\in[c]}\mathrm{e}^{(\upsilon_{j}\shortminus\upsilon_{y_{i}})^{\intercal}A_{1}\,z_{i}}\big{)}\\ &+\tfrac{1}{|b_{1}|}\textstyle\sum\limits_{i\in b_{1}}\log\big{(}1+\textstyle\sum\limits_{j\in[c]}\mathrm{e}^{(\upsilon_{j}\shortminus\upsilon_{y_{i}})^{\intercal}A_{2}\,z_{i}}\big{)},\end{split} (4.4)

i.e., rearranged soft-max cross-entropy where Ak=AbkA_{k}{=}A_{b_{k}} with the abuse of notation, and θ={θf,ω,Υ}\theta=\{\theta_{f},\omega,\Upsilon\} (i.e., CNN parameters, prototype vectors, label embeddings).

We learn attribute embeddings (i.e., columns of AA) as sub-task and can define such learning as a differentiable operation. Intuitively, such a regularization should be useful in better generalization of our pooling operation to unseen classes since attribute predictions are connected to prototypes and the local features. We combine this loss with the metric learning loss using λ\lambda mixing (i.e., (1λ)DML+λZS(1{\shortminus}\lambda)\mathcal{L}_{\text{DML}}+\lambda\mathcal{L}_{\text{ZS}}) and jointly optimize.

Refer to caption

Figure 3: Behavior analysis of GSP: (a) GAP vs GSP in aggregating features. The tokens represent learned embedding vectors, and the samples are derived from their aggregation. (b) Experiments on Cifar Collage dataset: (b.1) presents the results, while (b.2) displays sample train and test images along with their attention maps in terms of pooling weights. Distilled denotes baseline performance on non-collage dataset, excluding the shared classes.

4.4 Implementation Details

Embedding function. For the embedding function f(;θ)f(\cdot;\theta) we use ResNet20 [58] for Cifar [59] experiments, and ImageNet [60] pretrained BN-Inception [61] for the rest. We exploit architectures until the output before the global average pooling layer. We add a per-pixel linear transform (i.e., 1𝗑11{\mathsf{x}\mskip 1.0mu}1 convolution), to the output to obtain the local embedding vectors of size 128128.

Pooling layer. For baseline methods, we use global average pooling. For our method, we perform parameter search and set the hyperparameters accordingly. Specifically, we use 64- or 128-many prototypes depending on the dataset. We use ε=0.5\varepsilon{=}0.5 for proxy-based losses and ε=5.0\varepsilon{=}5.0 for non-proxy losses. For the rest, we set μ=0.3\mu{=}0.3, ϵ=0.05\epsilon{=}0.05, λ=0.1\lambda{=}0.1 and we iterate until k=100k{=}100 in Proposition 4.2. The embedding vectors upon pooling are 2\ell 2 normalized to have unit norm.

5 Experiments

We start our empirical study with a synthetic study validating the role of GAP in learning and the impact of GSP on the feature geometry. We further examine the effectiveness of our generalized sum pooling in metric learning for various models and datasets. We further perform ablation studies for the implications of our formulation as well as effects of the hyperparameters. We share the implementation details and the complete Tensorflow [62] code base in the supplemental materials.

5.1 Analysis of the Behavior

Synthetic study. We designed a synthetic empirical study to evaluate GSP in a fully controlled manner. We considered 16-class problem such that classes are defined over trainable tokens. In this setting, tokens correspond to semantic entities but we choose to give them a specific working to emphasize that they are trained as part of the learning. Each class is defined with 4 distinct tokens and there are also 4 background tokens shared by all classes. For example, a "car" class would have tokens like "tire" and "window" as well as background tokens of "tree" and "road". We sampled class representations from both class specific and background tokens according to a mixing ratio μ~𝒩(0.5,0.1)\tilde{\mu}\sim\mathcal{N}(0.5,0.1). Such a 50-many feature collection corresponds to a training sample (i.e., we are mimicking CNN’s output with trainable tokens). We then obtained global representations using GAP and GSP. We visualize the geometry of the embedding space in Fig. 3-(a) along with the DML performances. With GAP, we observe overlapping class convex hulls resulting in poor DML performance. However, GSP gives well separated class convex hulls, further validating that it learns to ignore background tokens.

Selective pooling. We further extended this synthetic study to image domain by considering the 20 super-classes of Cifar100 dataset [59] where each has 5 sub-classes. For each super-class, we split the sub-classes for train (2), validation (1), and test (2). We consider 4 super-classes as the shared classes and compose 4𝗑44{\mathsf{x}\mskip 1.0mu}4-stitched collage images for the rest 16 classes (see supplementary material for details). In particular, we sample an image from a class and then sample 3 images from shared classes (Fig. 4). We used ResNet20 backbone pretrained on Cifar100 classification task and followed the implementation explained in Sec. 4.4. We provide the evaluation results in Fig. 3-(b). GSP and the proposed zero shot loss effectively increase MAP@R. We also provide sample train and test images to showcase that our pooling can transfer well to unseen classes.

Refer to caption

Figure 4: Sample generation for Cifar Collage dataset

Zero-shot regularization. We also evaluated the zero-shot prediction performance of the attribute vectors. We trained on Cifar10 dataset with 8 prototypes using ProxyNCA++ [19] (PNCA) loss with and without ZS\mathcal{L}_{\text{ZS}}. We then extracted attribute histograms for each class and visualized them in Fig. 5. We observe transferable representations with ZS\mathcal{L}_{\text{ZS}} and we visually show in Fig. 5 that the semantic entities represented by the prototypes transfer across classes. We quantitatively evaluated such behavior by randomly splitting the classes into half and apply cross-batch zero-shot prediction explained in Sec. 4.3. Namely, we fit AA in (P3) for one subset and used it to predict the class embeddings for the other set. We pre-computed class embeddings from the dataset as the class mean. To this end, our evaluation assesses generalization of both the features and the prototypes. We used MAP with both 2\ell 2 distance and cosine similarity in our evaluation. We repeated the experiment 1000 times. We observe in Fig. 5 that zero-shot performance of the prototypes learned with ZS\mathcal{L}_{\text{ZS}} is substantially superior. We also see that our feature aggregation method enables approximate localization of the semantic entities. Recent ZSL approaches [56, 12] can provide attribute localization and share a similar spirit with our method. However, attribute annotations must be provided for those methods whereas we exploit only class labels to extract attribute-like features. Thus, our method can be considered as an attribute-unsupervised alternative to them.

Refer to caption

Figure 5: Comparing the distributions of the learned 8 prototypes across classes of Cifar10 dataset with and without ZS\mathcal{L}_{\text{ZS}}. Pooling weights are coloured according to the dominant prototype at that location.

Refer to caption

Figure 6: Summary of relative improvements: Each mark without a dashed line represents a baseline DML loss using GAP, unless stated otherwise. Similar losses are color-coded. Dashed lines indicate performance changes when replacing GAP with the associated pooling method (mark with the dashed line). Pooling methods applied on top of GMP or GMean are combined with them instead of replacing.

5.2 Deep Metric Learning Experiments

We largely rely on the recent work explicitly studying the fair evaluation strategies for metric learning [5, 4, 63]. Specifically, we follow the procedures proposed in [4] to evaluate our method as well as the other methods. We additionally follow the relatively old-fashioned conventional procedure [64] for the evaluation of our method and provide those results in the supplementary material. We provide full detail of our experimental setup in the supplementary material for complete transparency and reproducibility.

Datasets. CUB [65], Cars196 [66], In-shop [67], and SOP [64] with typical DML data augmentation [4].

Evaluation metrics. We report mean average precision (MAP@R) at R where R is defined for each query and is the total number of true references of the query.

Hyperparameters. We use Adam [68] optimizer with learning rate 10510^{\shortminus 5}, weight decay 10410^{\shortminus 4}, batch size 32 (4 per class). We train 4-fold: 4 models (1 for each 3/4\nicefrac{{3}}{{4}} train set partition).

Evaluation. Average performance (128D) where each of 4-fold model is trained 3 times resulting in realization of 34=813^{4}{=}81 different model collections. In our results we provide mean of 81 evaluations.

Baselines. We implement our method on top of and compare with Contrastive (C2): Contrastive with positive margin [53], MS: Multi-similarity [55], Triplet: Triplet [54], XBM: Cross-batch memory [18] with contrastive loss [69], PNCA: ProxyNCA++ [19], PAnchor: ProxyAnchor [70].

5.2.1 Results

We compared our method (GSP) against direct application of GAP with 6 DML methods in 4 datasets. We also evaluated 14 additional pooling alternatives on Ciffar Collage and CUB datasets. We provide the tabulated results in supplementary material. Based on CUB performances, we picked generalized mean pooling (GeMean) [71] and DeLF [46] to compare against in 4 DML benchmarks. We also evaluated max pooling (GMP) and its combination with GAP as we typically observe GAP+GMP in the recent works [6, 19, 70, 18]. We also applied our method with GMP (GMP+GSP) and with GeMean (GeMean+GSP) to show that per channel selection is orthogonal to our method and thus, GSP can marginally improve those methods as well.

We present the tabulated evaluation results in Tab. 2, while Fig. 6 provides a concise summary of the relative MAP@R orderings of the methods employing 128D embeddings. We observe consistent improvements upon direct application of GAP in all datasets. On the average, we consistently improve the baselines 1%{\approx}1\% points in MAP@R. Our improvement margins are superior to ones of attention based DeLF pooling. We improve state-of-the-art (SOTA) XBM method up to 2%2\% points, which is a good evidence that application of GSP is not limited to loss terms but can be combined with different DML approaches. We also consistently improve GMP and GeMean pooling methods in all datasets, yet another evidence that our method can be combined with max pooling based methods.

We additionally evaluated our method with different architectures and methods in conventional setting [64] for the comparison with SOTA. The tabulated results are provided in supplementary material (§ 1.1), where we observe that we achieve SOTA performances with XBM [18] and LIBC [24] methods.

Table 1: Effects of ZS\mathcal{L}_{\text{ZS}} and GSP with C2 loss
MAP@R
SOP In-shop CUB Cars196
ZS\mathcal{L}_{\text{ZS}} GSP 512D 128D 512D 128D 512D 128D 512D 128D
45.85 41.79 59.07 55.38 25.95 20.58 24.38 17.02
46.78 42.66 59.46 55.50 26.25 20.85 25.54 17.88
46.60 42.55 59.38 55.43 26.49 21.08 25.54 17.67
46.81 42.84 60.01 55.94 27.12 21.52 26.25 18.31

5.2.2 Ablations

Effect of ZS\bm{\mathcal{L}_{\text{ZS}}}. We empirically showed the effect of ZS\mathcal{L}_{\text{ZS}} on learned representations in Sec. 5.1. We further examined the effect of ZS\mathcal{L}_{\text{ZS}} quantitatively by enabling/disabling it. We also evaluated its effect without GSP by setting μ=1\mu{=}1 where we used GAP with attribute vectors. The results are summarized in Tab. 1 showing that both components improve the baseline and their combination brings the best improvement. We observe similar behavior in Cifar Collage experiment (Fig. 3-(b)) where the effect of ZS\mathcal{L}_{\text{ZS}} is more substantial.

Computation efficiency. Through a series of kk iterations, our pooling mechanism utilizes lightweight matrix-vector products to determine the pooling weights. While back propagation can be achieved through automatic-differentiation, it can become computationally intensive as kk increases for certain problems. However, our pooling mechanism boasts a desirable feature of having a closed-form gradient expression for its backward computation, resulting in minimal scaling of computation load as kk increases, as evidenced by our analysis in Fig. 7. We further provided the inference times for various kk in supplementary material (§ 1.5).

Effect of μ\bm{\mu}. As outlined in Sec. 4.2, GSP is similar to top-kk operator with an adaptive kk thanks to entropy smoothing. We empirically validated such behavior by sweeping μ\mu parameter controlling top-kk behavior. The results, plotted in Fig. 7, show similar performance for lower μ\mu values, with a decrease as μ\mu increases, possibly due to overestimation of the foreground ratio. Hence a suggested value for μ\mu is 0.30.3.

Refer to caption

Refer to caption

Figure 7: Efficiency of closed-form gradient (left) and effect of μ\mu (right). Shaded regions represent \mpstd.
Table 2: Comparison with the existing methods for the retrieval task on SOP, In-shop, CUB, Cars. Experimental setting follows MLRC [4]. \mp denotes 1 std. Red: the best, Blue: the second best, Bold: the loss term specific best.
Dataset \rightarrow SOP In-shop CUB Cars196
Dim. \rightarrow 512D 128D 512D 128D 512D 128D 512D 128D
Method\downarrow P@1 MAP@R P@1 MAP@R P@1 MAP@R P@1 MAP@R P@1 MAP@R P@1 MAP@R P@1 MAP@R P@1 MAP@R
C1+
GAP 69.290.11\underset{\scriptsize{\mp 0.11}}{69.29} 40.400.15\underset{\scriptsize{\mp 0.15}}{40.40} 65.150.10\underset{\scriptsize{\mp 0.10}}{65.15} 36.500.11\underset{\scriptsize{\mp 0.11}}{36.50} 80.110.19\underset{\scriptsize{\mp 0.19}}{80.11} 50.320.14\underset{\scriptsize{\mp 0.14}}{50.32} 75.830.14\underset{\scriptsize{\mp 0.14}}{75.83} 46.420.13\underset{\scriptsize{\mp 0.13}}{46.42} 63.320.57\underset{\scriptsize{\mp 0.57}}{63.32} 23.490.31\underset{\scriptsize{\mp 0.31}}{23.49} 56.340.35\underset{\scriptsize{\mp 0.35}}{56.34} 19.370.29\underset{\scriptsize{\mp 0.29}}{19.37} 78.010.38\underset{\scriptsize{\mp 0.38}}{78.01} 22.870.33\underset{\scriptsize{\mp 0.33}}{22.87} 65.610.59\underset{\scriptsize{\mp 0.59}}{65.61} 16.060.14\underset{\scriptsize{\mp 0.14}}{16.06}
XBM+GAP 76.540.32\underset{\scriptsize{\mp 0.32}}{76.54} 48.580.47\underset{\scriptsize{\mp 0.47}}{48.58} 73.220.48\underset{\scriptsize{\mp 0.48}}{73.22} 44.550.57\underset{\scriptsize{\mp 0.57}}{44.55} 87.760.26\underset{\scriptsize{\mp 0.26}}{87.76} 57.530.41\underset{\scriptsize{\mp 0.41}}{57.53} 85.260.37\underset{\scriptsize{\mp 0.37}}{85.26} 54.400.45\underset{\scriptsize{\mp 0.45}}{54.40} 65.560.48\underset{\scriptsize{\mp 0.48}}{65.56} 25.650.24\underset{\scriptsize{\mp 0.24}}{25.65} 57.480.41\underset{\scriptsize{\mp 0.41}}{57.48} 20.270.19\underset{\scriptsize{\mp 0.19}}{20.27} 83.550.35\underset{\scriptsize{\mp 0.35}}{83.55} 27.530.22\underset{\scriptsize{\mp 0.22}}{27.53} 72.170.30\underset{\scriptsize{\mp 0.30}}{72.17} 18.980.17\underset{\scriptsize{\mp 0.17}}{18.98}
XBM+GSP 77.880.18\underset{\scriptsize{\mp 0.18}}{{\color[rgb]{0.99609375,0,0}\mathbf{77.88}}} 50.650.28\underset{\scriptsize{\mp 0.28}}{{\color[rgb]{0.99609375,0,0}\mathbf{50.65}}} 74.840.19\underset{\scriptsize{\mp 0.19}}{{\color[rgb]{0.99609375,0,0}\mathbf{74.84}}} 46.690.28\underset{\scriptsize{\mp 0.28}}{{\color[rgb]{0.99609375,0,0}\mathbf{46.69}}} 88.330.19\underset{\scriptsize{\mp 0.19}}{{\color[rgb]{0.99609375,0,0}\mathbf{88.33}}} 58.550.29\underset{\scriptsize{\mp 0.29}}{\mathbf{58.55}} 85.950.21\underset{\scriptsize{\mp 0.21}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{85.95}}} 55.300.21\underset{\scriptsize{\mp 0.21}}{\mathbf{55.30}} 67.000.49\underset{\scriptsize{\mp 0.49}}{\mathbf{67.00}} 26.050.15\underset{\scriptsize{\mp 0.15}}{\mathbf{26.05}} 58.890.49\underset{\scriptsize{\mp 0.49}}{\mathbf{58.89}} 20.600.16\underset{\scriptsize{\mp 0.16}}{\mathbf{20.60}} 83.310.22\underset{\scriptsize{\mp 0.22}}{\mathbf{83.31}} 27.880.23\underset{\scriptsize{\mp 0.23}}{\mathbf{27.88}} 73.040.39\underset{\scriptsize{\mp 0.39}}{\mathbf{73.04}} 19.260.19\underset{\scriptsize{\mp 0.19}}{\mathbf{19.26}}
C2+
GAP 74.200.23\underset{\scriptsize{\mp 0.23}}{74.20} 45.850.31\underset{\scriptsize{\mp 0.31}}{45.85} 70.540.19\underset{\scriptsize{\mp 0.19}}{70.54} 41.790.26\underset{\scriptsize{\mp 0.26}}{41.79} 86.470.15\underset{\scriptsize{\mp 0.15}}{86.47} 59.070.21\underset{\scriptsize{\mp 0.21}}{59.07} 83.420.12\underset{\scriptsize{\mp 0.12}}{83.42} 55.380.13\underset{\scriptsize{\mp 0.13}}{55.38} 67.350.50\underset{\scriptsize{\mp 0.50}}{67.35} 25.950.21\underset{\scriptsize{\mp 0.21}}{25.95} 58.870.36\underset{\scriptsize{\mp 0.36}}{58.87} 20.580.13\underset{\scriptsize{\mp 0.13}}{20.58} 80.960.48\underset{\scriptsize{\mp 0.48}}{80.96} 24.380.58\underset{\scriptsize{\mp 0.58}}{24.38} 69.550.42\underset{\scriptsize{\mp 0.42}}{69.55} 17.020.31\underset{\scriptsize{\mp 0.31}}{17.02}
GSP 74.910.12\underset{\scriptsize{\mp 0.12}}{74.91} 46.810.17\underset{\scriptsize{\mp 0.17}}{46.81} 71.430.11\underset{\scriptsize{\mp 0.11}}{71.43} 42.840.14\underset{\scriptsize{\mp 0.14}}{42.84} 86.900.17\underset{\scriptsize{\mp 0.17}}{86.90} 60.010.29\underset{\scriptsize{\mp 0.29}}{\mathbf{60.01}} 83.570.18\underset{\scriptsize{\mp 0.18}}{83.57} 55.940.17\underset{\scriptsize{\mp 0.17}}{55.94} 68.850.41\underset{\scriptsize{\mp 0.41}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{68.85}}} 27.120.27\underset{\scriptsize{\mp 0.27}}{27.12} 60.420.36\underset{\scriptsize{\mp 0.36}}{60.42} 21.520.16\underset{\scriptsize{\mp 0.16}}{21.52} 82.830.27\underset{\scriptsize{\mp 0.27}}{82.83} 26.250.34\underset{\scriptsize{\mp 0.34}}{26.25} 71.400.27\underset{\scriptsize{\mp 0.27}}{71.40} 18.310.22\underset{\scriptsize{\mp 0.22}}{18.31}
DeLF 74.590.15\underset{\scriptsize{\mp 0.15}}{74.59} 46.540.19\underset{\scriptsize{\mp 0.19}}{46.54} 45.530.18\underset{\scriptsize{\mp 0.18}}{45.53} 42.470.17\underset{\scriptsize{\mp 0.17}}{42.47} 86.650.16\underset{\scriptsize{\mp 0.16}}{86.65} 59.200.22\underset{\scriptsize{\mp 0.22}}{59.20} 83.510.09\underset{\scriptsize{\mp 0.09}}{83.51} 55.360.12\underset{\scriptsize{\mp 0.12}}{55.36} 68.660.32\underset{\scriptsize{\mp 0.32}}{68.66} 27.060.18\underset{\scriptsize{\mp 0.18}}{27.06} 59.850.18\underset{\scriptsize{\mp 0.18}}{59.85} 21.420.16\underset{\scriptsize{\mp 0.16}}{21.42} 81.850.41\underset{\scriptsize{\mp 0.41}}{81.85} 24.770.38\underset{\scriptsize{\mp 0.38}}{24.77} 69.950.38\underset{\scriptsize{\mp 0.38}}{69.95} 17.320.25\underset{\scriptsize{\mp 0.25}}{17.32}
GeMean 74.920.13\underset{\scriptsize{\mp 0.13}}{74.92} 46.990.15\underset{\scriptsize{\mp 0.15}}{46.99} 71.530.11\underset{\scriptsize{\mp 0.11}}{71.53} 43.120.12\underset{\scriptsize{\mp 0.12}}{43.12} 86.620.15\underset{\scriptsize{\mp 0.15}}{86.62} 59.120.19\underset{\scriptsize{\mp 0.19}}{59.12} 83.830.09\underset{\scriptsize{\mp 0.09}}{83.83} 55.700.12\underset{\scriptsize{\mp 0.12}}{55.70} 68.790.36\underset{\scriptsize{\mp 0.36}}{68.79} 27.120.19\underset{\scriptsize{\mp 0.19}}{27.12} 60.370.30\underset{\scriptsize{\mp 0.30}}{60.37} 21.500.15\underset{\scriptsize{\mp 0.15}}{21.50} 82.430.60\underset{\scriptsize{\mp 0.60}}{82.43} 25.270.63\underset{\scriptsize{\mp 0.63}}{25.27} 70.230.55\underset{\scriptsize{\mp 0.55}}{70.23} 17.410.45\underset{\scriptsize{\mp 0.45}}{17.41}
GeMean+GSP 75.320.08\underset{\scriptsize{\mp 0.08}}{\mathbf{75.32}} 47.690.13\underset{\scriptsize{\mp 0.13}}{\mathbf{47.69}} 71.930.10\underset{\scriptsize{\mp 0.10}}{\mathbf{71.93}} 43.710.13\underset{\scriptsize{\mp 0.13}}{\mathbf{43.71}} 86.940.15\underset{\scriptsize{\mp 0.15}}{\mathbf{86.94}} 59.980.21\underset{\scriptsize{\mp 0.21}}{59.98} 84.350.19\underset{\scriptsize{\mp 0.19}}{\mathbf{84.35}} 56.340.14\underset{\scriptsize{\mp 0.14}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{56.34}}} 69.110.49\underset{\scriptsize{\mp 0.49}}{{\color[rgb]{0.99609375,0,0}\mathbf{69.11}}} 27.560.18\underset{\scriptsize{\mp 0.18}}{{\color[rgb]{0.99609375,0,0}\mathbf{27.56}}} 60.810.34\underset{\scriptsize{\mp 0.34}}{{\color[rgb]{0.99609375,0,0}\mathbf{60.81}}} 21.840.19\underset{\scriptsize{\mp 0.19}}{{\color[rgb]{0.99609375,0,0}\mathbf{21.84}}} 83.620.36\underset{\scriptsize{\mp 0.36}}{\mathbf{83.62}} 26.980.31\underset{\scriptsize{\mp 0.31}}{\mathbf{26.98}} 72.380.28\underset{\scriptsize{\mp 0.28}}{\mathbf{72.38}} 19.050.22\underset{\scriptsize{\mp 0.22}}{\mathbf{19.05}}
GMP 74.090.15\underset{\scriptsize{\mp 0.15}}{74.09} 46.130.19\underset{\scriptsize{\mp 0.19}}{46.13} 69.680.20\underset{\scriptsize{\mp 0.20}}{69.68} 41.310.22\underset{\scriptsize{\mp 0.22}}{41.31} 86.380.12\underset{\scriptsize{\mp 0.12}}{86.38} 59.040.10\underset{\scriptsize{\mp 0.10}}{59.04} 83.040.13\underset{\scriptsize{\mp 0.13}}{83.04} 54.890.07\underset{\scriptsize{\mp 0.07}}{54.89} 68.130.40\underset{\scriptsize{\mp 0.40}}{68.13} 26.430.21\underset{\scriptsize{\mp 0.21}}{26.43} 58.990.34\underset{\scriptsize{\mp 0.34}}{58.99} 20.660.18\underset{\scriptsize{\mp 0.18}}{20.66} 81.830.62\underset{\scriptsize{\mp 0.62}}{81.83} 25.110.72\underset{\scriptsize{\mp 0.72}}{25.11} 69.050.61\underset{\scriptsize{\mp 0.61}}{69.05} 17.080.47\underset{\scriptsize{\mp 0.47}}{17.08}
GMP+GAP 74.710.11\underset{\scriptsize{\mp 0.11}}{74.71} 46.700.15\underset{\scriptsize{\mp 0.15}}{46.70} 70.830.10\underset{\scriptsize{\mp 0.10}}{70.83} 42.380.15\underset{\scriptsize{\mp 0.15}}{42.38} 86.580.16\underset{\scriptsize{\mp 0.16}}{86.58} 59.220.18\underset{\scriptsize{\mp 0.18}}{59.22} 83.410.12\underset{\scriptsize{\mp 0.12}}{83.41} 55.370.15\underset{\scriptsize{\mp 0.15}}{55.37} 67.880.48\underset{\scriptsize{\mp 0.48}}{67.88} 26.630.23\underset{\scriptsize{\mp 0.23}}{26.63} 59.240.32\underset{\scriptsize{\mp 0.32}}{59.24} 20.880.17\underset{\scriptsize{\mp 0.17}}{20.88} 82.140.40\underset{\scriptsize{\mp 0.40}}{82.14} 25.660.44\underset{\scriptsize{\mp 0.44}}{25.66} 69.810.38\underset{\scriptsize{\mp 0.38}}{69.81} 17.620.32\underset{\scriptsize{\mp 0.32}}{17.62}
GMP+GSP 75.080.1\underset{\scriptsize{\mp 0.1}}{75.08} 47.120.17\underset{\scriptsize{\mp 0.17}}{47.12} 71.180.15\underset{\scriptsize{\mp 0.15}}{71.18} 42.800.18\underset{\scriptsize{\mp 0.18}}{42.80} 86.790.16\underset{\scriptsize{\mp 0.16}}{86.79} 59.430.28\underset{\scriptsize{\mp 0.28}}{59.43} 83.860.15\underset{\scriptsize{\mp 0.15}}{83.86} 55.760.19\underset{\scriptsize{\mp 0.19}}{55.76} 68.470.58\underset{\scriptsize{\mp 0.58}}{68.47} 27.490.36\underset{\scriptsize{\mp 0.36}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{27.49}}} 60.190.41\underset{\scriptsize{\mp 0.41}}{60.19} 21.690.35\underset{\scriptsize{\mp 0.35}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{21.69}}} 82.540.46\underset{\scriptsize{\mp 0.46}}{82.54} 26.300.43\underset{\scriptsize{\mp 0.43}}{26.30} 71.030.48\underset{\scriptsize{\mp 0.48}}{71.03} 18.240.29\underset{\scriptsize{\mp 0.29}}{18.24}
MS+
GAP 72.810.14\underset{\scriptsize{\mp 0.14}}{72.81} 44.190.21\underset{\scriptsize{\mp 0.21}}{44.19} 69.090.10\underset{\scriptsize{\mp 0.10}}{69.09} 40.340.16\underset{\scriptsize{\mp 0.16}}{40.34} 87.010.20\underset{\scriptsize{\mp 0.20}}{87.01} 58.790.37\underset{\scriptsize{\mp 0.37}}{58.79} 83.870.21\underset{\scriptsize{\mp 0.21}}{83.87} 54.850.34\underset{\scriptsize{\mp 0.34}}{54.85} 65.430.46\underset{\scriptsize{\mp 0.46}}{65.43} 24.950.15\underset{\scriptsize{\mp 0.15}}{24.95} 57.570.27\underset{\scriptsize{\mp 0.27}}{\mathbf{57.57}} 20.130.12\underset{\scriptsize{\mp 0.12}}{20.13} 83.730.34\underset{\scriptsize{\mp 0.34}}{83.73} 27.160.43\underset{\scriptsize{\mp 0.43}}{27.16} 72.540.43\underset{\scriptsize{\mp 0.43}}{72.54} 18.730.31\underset{\scriptsize{\mp 0.31}}{18.73}
GSP 73.050.11\underset{\scriptsize{\mp 0.11}}{\mathbf{73.05}} 44.720.17\underset{\scriptsize{\mp 0.17}}{\mathbf{44.72}} 69.440.15\underset{\scriptsize{\mp 0.15}}{\mathbf{69.44}} 40.870.19\underset{\scriptsize{\mp 0.19}}{\mathbf{40.87}} 88.280.21\underset{\scriptsize{\mp 0.21}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{88.28}}} 60.490.24\underset{\scriptsize{\mp 0.24}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{60.49}}} 85.280.19\underset{\scriptsize{\mp 0.19}}{\mathbf{85.28}} 56.620.26\underset{\scriptsize{\mp 0.26}}{{\color[rgb]{0.99609375,0,0}\mathbf{56.62}}} 65.500.33\underset{\scriptsize{\mp 0.33}}{\mathbf{65.50}} 25.090.21\underset{\scriptsize{\mp 0.21}}{\mathbf{25.09}} 57.390.15\underset{\scriptsize{\mp 0.15}}{57.39} 20.340.22\underset{\scriptsize{\mp 0.22}}{\mathbf{20.34}} 84.270.35\underset{\scriptsize{\mp 0.35}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{84.27}}} 28.580.40\underset{\scriptsize{\mp 0.40}}{{\color[rgb]{0.99609375,0,0}\mathbf{28.58}}} 73.740.32\underset{\scriptsize{\mp 0.32}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{73.74}}} 19.910.31\underset{\scriptsize{\mp 0.31}}{{\color[rgb]{0.99609375,0,0}\mathbf{19.91}}}
Triplet+
GAP 74.540.24\underset{\scriptsize{\mp 0.24}}{74.54} 45.880.30\underset{\scriptsize{\mp 0.30}}{45.88} 69.410.38\underset{\scriptsize{\mp 0.38}}{69.41} 40.010.39\underset{\scriptsize{\mp 0.39}}{40.01} 85.990.36\underset{\scriptsize{\mp 0.36}}{85.99} 59.670.46\underset{\scriptsize{\mp 0.46}}{59.67} 81.750.38\underset{\scriptsize{\mp 0.38}}{81.75} 54.250.45\underset{\scriptsize{\mp 0.45}}{54.25} 64.110.66\underset{\scriptsize{\mp 0.66}}{64.11} 23.650.40\underset{\scriptsize{\mp 0.40}}{23.65} 55.620.46\underset{\scriptsize{\mp 0.46}}{55.62} 18.540.31\underset{\scriptsize{\mp 0.31}}{18.54} 77.580.60\underset{\scriptsize{\mp 0.60}}{77.58} 22.670.58\underset{\scriptsize{\mp 0.58}}{22.67} 64.610.59\underset{\scriptsize{\mp 0.59}}{64.61} 15.740.34\underset{\scriptsize{\mp 0.34}}{15.74}
GSP 75.590.23\underset{\scriptsize{\mp 0.23}}{\mathbf{75.59}} 47.350.32\underset{\scriptsize{\mp 0.32}}{\mathbf{47.35}} 70.650.20\underset{\scriptsize{\mp 0.20}}{\mathbf{70.65}} 41.380.22\underset{\scriptsize{\mp 0.22}}{\mathbf{41.38}} 86.750.27\underset{\scriptsize{\mp 0.27}}{\mathbf{86.75}} 60.850.47\underset{\scriptsize{\mp 0.47}}{{\color[rgb]{0.99609375,0,0}\mathbf{60.85}}} 82.740.33\underset{\scriptsize{\mp 0.33}}{\mathbf{82.74}} 55.540.46\underset{\scriptsize{\mp 0.46}}{\mathbf{55.54}} 66.090.52\underset{\scriptsize{\mp 0.52}}{\mathbf{66.09}} 24.800.33\underset{\scriptsize{\mp 0.33}}{\mathbf{24.80}} 57.120.42\underset{\scriptsize{\mp 0.42}}{\mathbf{57.12}} 19.380.25\underset{\scriptsize{\mp 0.25}}{\mathbf{19.38}} 78.930.30\underset{\scriptsize{\mp 0.30}}{\mathbf{78.93}} 23.440.29\underset{\scriptsize{\mp 0.29}}{\mathbf{23.44}} 65.810.35\underset{\scriptsize{\mp 0.35}}{\mathbf{65.81}} 16.140.21\underset{\scriptsize{\mp 0.21}}{\mathbf{16.14}}
PNCA+
GAP 75.180.15\underset{\scriptsize{\mp 0.15}}{75.18} 47.110.16\underset{\scriptsize{\mp 0.16}}{47.11} 72.150.06\underset{\scriptsize{\mp 0.06}}{72.15} 43.570.08\underset{\scriptsize{\mp 0.08}}{43.57} 87.260.14\underset{\scriptsize{\mp 0.14}}{87.26} 57.430.14\underset{\scriptsize{\mp 0.14}}{57.43} 84.860.08\underset{\scriptsize{\mp 0.08}}{84.86} 54.410.10\underset{\scriptsize{\mp 0.10}}{54.41} 65.740.51\underset{\scriptsize{\mp 0.51}}{65.74} 25.270.23\underset{\scriptsize{\mp 0.23}}{25.27} 58.190.36\underset{\scriptsize{\mp 0.36}}{58.19} 20.630.20\underset{\scriptsize{\mp 0.20}}{20.63} 82.330.25\underset{\scriptsize{\mp 0.25}}{82.33} 26.210.22\underset{\scriptsize{\mp 0.22}}{26.21} 70.750.18\underset{\scriptsize{\mp 0.18}}{70.75} 18.610.08\underset{\scriptsize{\mp 0.08}}{18.61}
GSP 75.680.11\underset{\scriptsize{\mp 0.11}}{75.68} 47.740.14\underset{\scriptsize{\mp 0.14}}{47.74} 72.370.06\underset{\scriptsize{\mp 0.06}}{72.37} 43.950.06\underset{\scriptsize{\mp 0.06}}{43.95} 87.350.10\underset{\scriptsize{\mp 0.10}}{87.35} 57.650.12\underset{\scriptsize{\mp 0.12}}{57.65} 85.130.10\underset{\scriptsize{\mp 0.10}}{85.13} 54.680.08\underset{\scriptsize{\mp 0.08}}{54.68} 65.800.38\underset{\scriptsize{\mp 0.38}}{65.80} 25.480.25\underset{\scriptsize{\mp 0.25}}{25.48} 58.200.22\underset{\scriptsize{\mp 0.22}}{58.20} 20.750.19\underset{\scriptsize{\mp 0.19}}{20.75} 82.700.27\underset{\scriptsize{\mp 0.27}}{82.70} 26.930.18\underset{\scriptsize{\mp 0.18}}{26.93} 71.550.32\underset{\scriptsize{\mp 0.32}}{71.55} 19.200.17\underset{\scriptsize{\mp 0.17}}{19.20}
DeLF 75.290.09\underset{\scriptsize{\mp 0.09}}{75.29} 47.440.11\underset{\scriptsize{\mp 0.11}}{47.44} 72.050.06\underset{\scriptsize{\mp 0.06}}{72.05} 43.620.07\underset{\scriptsize{\mp 0.07}}{43.62} 87.190.11\underset{\scriptsize{\mp 0.11}}{87.19} 57.440.16\underset{\scriptsize{\mp 0.16}}{57.44} 84.550.04\underset{\scriptsize{\mp 0.04}}{84.55} 54.130.10\underset{\scriptsize{\mp 0.10}}{54.13} 65.420.34\underset{\scriptsize{\mp 0.34}}{65.42} 25.310.16\underset{\scriptsize{\mp 0.16}}{25.31} 57.980.24\underset{\scriptsize{\mp 0.24}}{57.98} 20.510.14\underset{\scriptsize{\mp 0.14}}{20.51} 82.370.35\underset{\scriptsize{\mp 0.35}}{82.37} 26.630.22\underset{\scriptsize{\mp 0.22}}{26.63} 71.060.27\underset{\scriptsize{\mp 0.27}}{71.06} 18.810.14\underset{\scriptsize{\mp 0.14}}{18.81}
GeMean 75.640.09\underset{\scriptsize{\mp 0.09}}{75.64} 47.820.09\underset{\scriptsize{\mp 0.09}}{47.82} 72.750.07\underset{\scriptsize{\mp 0.07}}{72.75} 44.430.06\underset{\scriptsize{\mp 0.06}}{44.43} 87.630.10\underset{\scriptsize{\mp 0.10}}{87.63} 57.880.13\underset{\scriptsize{\mp 0.13}}{57.88} 85.480.14\underset{\scriptsize{\mp 0.14}}{85.48} 55.140.12\underset{\scriptsize{\mp 0.12}}{55.14} 66.330.33\underset{\scriptsize{\mp 0.33}}{66.33} 25.740.20\underset{\scriptsize{\mp 0.20}}{25.74} 58.520.39\underset{\scriptsize{\mp 0.39}}{58.52} 20.710.20\underset{\scriptsize{\mp 0.20}}{20.71} 83.830.29\underset{\scriptsize{\mp 0.29}}{\mathbf{83.83}} 27.440.15\underset{\scriptsize{\mp 0.15}}{27.44} 72.140.28\underset{\scriptsize{\mp 0.28}}{\mathbf{72.14}} 19.160.12\underset{\scriptsize{\mp 0.12}}{19.16}
GeMean+GSP 75.890.11\underset{\scriptsize{\mp 0.11}}{\mathbf{75.89}} 48.170.12\underset{\scriptsize{\mp 0.12}}{\mathbf{48.17}} 72.910.04\underset{\scriptsize{\mp 0.04}}{\mathbf{72.91}} 44.610.06\underset{\scriptsize{\mp 0.06}}{\mathbf{44.61}} 87.640.10\underset{\scriptsize{\mp 0.10}}{\mathbf{87.64}} 58.120.16\underset{\scriptsize{\mp 0.16}}{\mathbf{58.12}} 85.580.07\underset{\scriptsize{\mp 0.07}}{\mathbf{85.58}} 55.250.08\underset{\scriptsize{\mp 0.08}}{\mathbf{55.25}} 67.390.53\underset{\scriptsize{\mp 0.53}}{\mathbf{67.39}} 26.190.26\underset{\scriptsize{\mp 0.26}}{\mathbf{26.19}} 59.390.40\underset{\scriptsize{\mp 0.40}}{\mathbf{59.39}} 21.310.21\underset{\scriptsize{\mp 0.21}}{\mathbf{21.31}} 83.090.25\underset{\scriptsize{\mp 0.25}}{83.09} 27.960.30\underset{\scriptsize{\mp 0.30}}{\mathbf{27.96}} 71.950.27\underset{\scriptsize{\mp 0.27}}{71.95} 19.740.19\underset{\scriptsize{\mp 0.19}}{\mathbf{19.74}}
GMP 74.430.08\underset{\scriptsize{\mp 0.08}}{74.43} 46.330.08\underset{\scriptsize{\mp 0.08}}{46.33} 70.800.07\underset{\scriptsize{\mp 0.07}}{70.80} 42.240.08\underset{\scriptsize{\mp 0.08}}{42.24} 86.940.13\underset{\scriptsize{\mp 0.13}}{86.94} 56.790.13\underset{\scriptsize{\mp 0.13}}{56.79} 84.530.08\underset{\scriptsize{\mp 0.08}}{84.53} 53.860.09\underset{\scriptsize{\mp 0.09}}{53.86} 65.740.51\underset{\scriptsize{\mp 0.51}}{65.74} 25.360.29\underset{\scriptsize{\mp 0.29}}{25.36} 57.610.38\underset{\scriptsize{\mp 0.38}}{57.61} 20.330.29\underset{\scriptsize{\mp 0.29}}{20.33} 83.060.33\underset{\scriptsize{\mp 0.33}}{83.06} 26.960.27\underset{\scriptsize{\mp 0.27}}{26.96} 71.190.25\underset{\scriptsize{\mp 0.25}}{71.19} 18.920.15\underset{\scriptsize{\mp 0.15}}{18.92}
GMP+GAP 75.190.09\underset{\scriptsize{\mp 0.09}}{75.19} 47.260.11\underset{\scriptsize{\mp 0.11}}{47.26} 71.970.04\underset{\scriptsize{\mp 0.04}}{71.97} 43.550.06\underset{\scriptsize{\mp 0.06}}{43.55} 87.210.14\underset{\scriptsize{\mp 0.14}}{87.21} 57.340.15\underset{\scriptsize{\mp 0.15}}{57.34} 84.950.09\underset{\scriptsize{\mp 0.09}}{84.95} 54.420.10\underset{\scriptsize{\mp 0.10}}{54.42} 65.910.35\underset{\scriptsize{\mp 0.35}}{65.91} 25.560.26\underset{\scriptsize{\mp 0.26}}{25.56} 57.920.37\underset{\scriptsize{\mp 0.37}}{57.92} 20.680.20\underset{\scriptsize{\mp 0.20}}{20.68} 82.920.41\underset{\scriptsize{\mp 0.41}}{82.92} 26.920.36\underset{\scriptsize{\mp 0.36}}{26.92} 71.330.22\underset{\scriptsize{\mp 0.22}}{71.33} 18.950.19\underset{\scriptsize{\mp 0.19}}{18.95}
GMP+GSP 75.410.12\underset{\scriptsize{\mp 0.12}}{75.41} 47.500.12\underset{\scriptsize{\mp 0.12}}{47.50} 72.100.07\underset{\scriptsize{\mp 0.07}}{72.10} 43.730.09\underset{\scriptsize{\mp 0.09}}{43.73} 87.430.10\underset{\scriptsize{\mp 0.10}}{87.43} 57.680.14\underset{\scriptsize{\mp 0.14}}{57.68} 85.100.10\underset{\scriptsize{\mp 0.10}}{85.10} 54.700.08\underset{\scriptsize{\mp 0.08}}{54.70} 66.140.48\underset{\scriptsize{\mp 0.48}}{66.14} 25.850.23\underset{\scriptsize{\mp 0.23}}{25.85} 58.120.32\underset{\scriptsize{\mp 0.32}}{58.12} 20.960.18\underset{\scriptsize{\mp 0.18}}{20.96} 83.460.31\underset{\scriptsize{\mp 0.31}}{83.46} 27.120.21\underset{\scriptsize{\mp 0.21}}{27.12} 72.040.39\underset{\scriptsize{\mp 0.39}}{72.04} 19.380.20\underset{\scriptsize{\mp 0.20}}{19.38}
PAnchor+
GAP 76.480.19\underset{\scriptsize{\mp 0.19}}{76.48} 48.080.26\underset{\scriptsize{\mp 0.26}}{48.08} 73.500.14\underset{\scriptsize{\mp 0.14}}{73.50} 44.330.20\underset{\scriptsize{\mp 0.20}}{44.33} 88.020.21\underset{\scriptsize{\mp 0.21}}{88.02} 58.020.25\underset{\scriptsize{\mp 0.25}}{58.02} 85.830.18\underset{\scriptsize{\mp 0.18}}{85.83} 54.980.22\underset{\scriptsize{\mp 0.22}}{54.98} 68.040.41\underset{\scriptsize{\mp 0.41}}{68.04} 26.200.21\underset{\scriptsize{\mp 0.21}}{26.20} 59.910.34\underset{\scriptsize{\mp 0.34}}{59.91} 20.940.15\underset{\scriptsize{\mp 0.15}}{20.94} 85.260.31\underset{\scriptsize{\mp 0.31}}{85.26} 27.140.20\underset{\scriptsize{\mp 0.20}}{27.14} 75.080.23\underset{\scriptsize{\mp 0.23}}{75.08} 19.150.13\underset{\scriptsize{\mp 0.13}}{19.15}
GSP 77.130.16\underset{\scriptsize{\mp 0.16}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{77.13}}} 49.050.22\underset{\scriptsize{\mp 0.22}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{49.05}}} 74.070.13\underset{\scriptsize{\mp 0.13}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{74.07}}} 45.070.17\underset{\scriptsize{\mp 0.17}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{45.07}}} 88.100.11\underset{\scriptsize{\mp 0.11}}{\mathbf{88.10}} 58.440.14\underset{\scriptsize{\mp 0.14}}{\mathbf{58.44}} 85.970.06\underset{\scriptsize{\mp 0.06}}{{\color[rgb]{0.99609375,0,0}\mathbf{85.97}}} 55.340.13\underset{\scriptsize{\mp 0.13}}{\mathbf{55.34}} 68.400.45\underset{\scriptsize{\mp 0.45}}{\mathbf{68.40}} 26.590.25\underset{\scriptsize{\mp 0.25}}{\mathbf{26.59}} 60.800.31\underset{\scriptsize{\mp 0.31}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{60.80}}} 21.440.17\underset{\scriptsize{\mp 0.17}}{\mathbf{21.44}} 86.460.39\underset{\scriptsize{\mp 0.39}}{{\color[rgb]{0.99609375,0,0}\mathbf{86.46}}} 28.430.33\underset{\scriptsize{\mp 0.33}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{28.43}}} 75.880.25\underset{\scriptsize{\mp 0.25}}{{\color[rgb]{0.99609375,0,0}\mathbf{75.88}}} 19.900.20\underset{\scriptsize{\mp 0.20}}{{\color[rgb]{0,0.82421875,0.796875}\mathbf{19.90}}}

6 Conclusion

We proposed a learnable and generalized version of GAP. Our proposed generalization GSP is a trainable pooling layer that selects the feature subset and re-weight it during pooling. To enable effective learning of our layer, we also proposed a cross-batch regularization improving zero-shot transfer. With extensive empirical studies, we validated the effectiveness of the proposed pooling layer in various DML benchmarks. We also established and empirically validated a valuable link between the computed transport maps for pooling and the prototypes, enabling attribute learning via meta-learning without explicit localized annotations. We believe such a connection is interesting, and has potential to offer pathways for enhanced embedding learning for DML as well as unsupervised attribute learning.

References

  • [1] W.Kim, B.Goyal, K.Chawla, J.Lee, and K.Kwon, “Attention-based ensemble for deep metric learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 736–751.
  • [2] X.Lin, Y.Duan, Q.Dong, J.Lu, and J.Zhou, “Deep variational metric learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [3] A.Ermolov, L.Mirvakhabova, V.Khrulkov, N.Sebe, and I.Oseledets, “Hyperbolic vision transformers: Combining improvements in metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7409–7419.
  • [4] K.Musgrave, S.Belongie, and S.-N.Lim, “A metric learning reality check,” in European Conference on Computer Vision. Springer, 2020, pp. 681–699.
  • [5] K.Roth, T.Milbich, S.Sinha, P.Gupta, B.Ommer, and J. P.Cohen, “Revisiting training strategies and generalization performance in deep metric learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 8242–8252.
  • [6] S.Venkataramanan, B.Psomas, E.Kijak, laurent amsaleg, K.Karantzalos, and Y.Avrithis, “It takes two to tango: Mixup for deep metric learning,” in International Conference on Learning Representations, 2022.
  • [7] Y. Z.Gurbuz and A. A.Alatan, “Generalizable embeddings with cross-batch metric learning,” arXiv preprint arXiv:2307.07620, 2023.
  • [8] M. D.Zeiler and R.Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.
  • [9] B.Zhou, A.Khosla, A.Lapedriza, A.Oliva, and A.Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • [10] B.Zhou, D.Bau, A.Oliva, and A.Torralba, “Interpreting deep visual representations via network dissection,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [11] Y. Z.Gürbüz and A. A.Alatan, “A novel bovw mimicking end-to-end trainable cnn classification framework using optimal transport theory,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 3053–3057.
  • [12] W.Xu, Y.Xian, J.Wang, B.Schiele, and Z.Akata, “Attribute prototype network for zero-shot learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21969–21980, 2020.
  • [13] M.Cuturi, O.Teboul, and J.-P.Vert, “Differentiable ranking and sorting using optimal transport,” Advances in neural information processing systems, vol. 32, 2019.
  • [14] B.Demirel, R.Gokberk Cinbis, and N.Ikizler-Cinbis, “Attributes2classname: A discriminative model for attribute-based unsupervised zero-shot learning,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1232–1241.
  • [15] B.Ko and G.Gu, “Embedding expansion: Augmentation in embedding space for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7255–7264.
  • [16] C.Liu, H.Yu, B.Li, Z.Shen, Z.Gao, P.Ren, X.Xie, L.Cui, and C.Miao, “Noise-resistant deep metric learning with ranking-based instance selection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6811–6820.
  • [17] G.Gu, B.Ko, and H.-G.Kim, “Proxy synthesis: Learning with synthetic classes for deep metric learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 1460–1468.
  • [18] X.Wang, H.Zhang, W.Huang, and M. R.Scott, “Cross-batch memory for embedding learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6388–6397.
  • [19] E. W.Teh, T.DeVries, and G. W.Taylor, “Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis,” in European Conference on Computer Vision (ECCV). Springer, 2020.
  • [20] M.Dong, X.Yang, R.Zhu, Y.Wang, and J.Xue, “Generalization bound of gradient descent for non-convex metric learning,” Neural Information Processing Systems Foundation, 2020.
  • [21] Y.Lei, M.Liu, and Y.Ying, “Generalization guarantee of sgd for pairwise learning,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [22] Y. Z.Gurbuz, O.Can, and A. A.Alatan, “Deep metric learning with chance constraints,” arXiv preprint arXiv:2209.09060, 2022.
  • [23] K.Roth, B.Brattoli, and B.Ommer, “Mic: Mining interclass characteristics for improved metric learning,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [24] J.Seidenschwarz, I.Elezi, and L.Leal-Taixé, “Learning intra-batch connections for deep metric learning,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. 2021, vol. 139 of Proceedings of Machine Learning Research, pp. 9410–9421, PMLR.
  • [25] J.Lim, S.Yun, S.Park, and J. Y.Choi, “Hypergraph-induced semantic tuplet loss for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • [26] Y.Patel, G.Tolias, and J.Matas, “Recall@ k surrogate loss with large batches and similarity mixup,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7502–7511.
  • [27] P.Jacob, D.Picard, A.Histace, and E.Klein, “Metric learning with horde: High-order regularizer for deep embeddings,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [28] D.Zhang, Y.Li, and Z.Zhang, “Deep metric learning with spherical embedding,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [29] O.Can, Y. Z.Gürbüz, and A. A.Alatan, “Deep metric learning with alternating projections onto feasible sets,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 1264–1268.
  • [30] Y.Kim and W.Park, “Multi-level distance regularization for deep metric learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35.
  • [31] K.Roth, O.Vinyals, and Z.Akata, “Non-isotropy regularization for proxy-based deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7420–7430.
  • [32] H.Xuan, R.Souvenir, and R.Pless, “Deep randomized ensembles for metric learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 723–734.
  • [33] A.Sanakoyeu, V.Tschernezki, U.Buchler, and B.Ommer, “Divide and conquer the embedding space for metric learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [34] W.Zheng, C.Wang, J.Lu, and J.Zhou, “Deep compositional metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9320–9329.
  • [35] W.Zheng, B.Zhang, J.Lu, and J.Zhou, “Deep relational metric learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12065–12074.
  • [36] T.Milbich, K.Roth, H.Bharadhwaj, S.Sinha, Y.Bengio, B.Ommer, and J. P.Cohen, “Diva: Diverse visual feature aggregation for deep metric learning,” in European Conference on Computer Vision. Springer, 2020, pp. 590–607.
  • [37] K.Roth, T.Milbich, B.Ommer, J. P.Cohen, and M.Ghassemi, “S2sd: Simultaneous similarity-based self-distillation for deep metric learning,” in ICML 2021: 38th International Conference on Machine Learning, 2021, pp. 9095–9106.
  • [38] H.Zhang, J.Xue, and K.Dana, “Deep ten: Texture encoding network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 708–717.
  • [39] R.Arandjelovic, P.Gronat, A.Torii, T.Pajdla, and J.Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
  • [40] S.Kolouri, N.Naderializadeh, G. K.Rohde, and H.Hoffmann, “Wasserstein embedding for graph learning,” in International Conference on Learning Representations, 2020.
  • [41] G.Mialon, D.Chen, A.d’Aspremont, and J.Mairal, “A trainable optimal transport embedding for feature aggregation and its relationship to attention,” in ICLR 2021-The Ninth International Conference on Learning Representations, 2021.
  • [42] S.KAN, Y.Liang, M.Li, Y.Cen, J.Wang, and Z.He, “Coded residual transform for generalizable deep metric learning,” in Advances in Neural Information Processing Systems.
  • [43] Y.Kalantidis, C.Mellina, and S.Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in European conference on computer vision. Springer, 2016, pp. 685–701.
  • [44] G.Tolias, T.Jenicek, and O.Chum, “Learning and aggregating deep local descriptors for instance-level recognition,” in European Conference on Computer Vision. Springer, 2020, pp. 460–477.
  • [45] S.Woo, J.Park, J.-Y.Lee, and I. S.Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
  • [46] H.Noh, A.Araujo, J.Sim, T.Weyand, and B.Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [47] Z.Gao, J.Xie, Q.Wang, and P.Li, “Global second-order pooling convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3024–3033.
  • [48] M.Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013.
  • [49] W.Zhao, Y.Rao, Z.Wang, J.Lu, and J.Zhou, “Towards interpretable deep metric learning with structural matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9887–9896.
  • [50] Y.Xie, H.Dai, M.Chen, B.Dai, T.Zhao, H.Zha, W.Wei, and T.Pfister, “Differentiable top-k with optimal transport,” Advances in Neural Information Processing Systems, vol. 33, pp. 20520–20531, 2020.
  • [51] N.Naderializadeh, J. F.Comer, R.Andrews, H.Hoffmann, and S.Kolouri, “Pooling by sliced-wasserstein embedding,” Advances in Neural Information Processing Systems, vol. 34, pp. 3389–3400, 2021.
  • [52] G.Luise, A.Rudi, M.Pontil, and C.Ciliberto, “Differential properties of sinkhorn approximation for learning with wasserstein distance,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [53] C.-Y.Wu, R.Manmatha, A. J.Smola, and P.Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
  • [54] F.Schroff, D.Kalenichenko, and J.Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • [55] X.Wang, X.Han, W.Huang, D.Dong, and M. R.Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [56] D.Huynh and E.Elhamifar, “Fine-grained generalized zero-shot learning via dense attribute-based attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4483–4493.
  • [57] L.Bertinetto, J. F.Henriques, P.Torr, and A.Vedaldi, “Meta-learning with differentiable closed-form solvers,” in International Conference on Learning Representations, 2018.
  • [58] K.He, X.Zhang, S.Ren, and J.Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
  • [59] A.Krizhevsky and G.Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
  • [60] O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, A.Karpathy, A.Khosla, M.Bernstein, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [61] S.Ioffe and C.Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
  • [62] M.Abadi, P.Barham, J.Chen, Z.Chen, A.Davis, J.Dean, M.Devin, S.Ghemawat, G.Irving, M.Isard, et al., “Tensorflow: a system for large-scale machine learning.,” in OSDI, 2016, vol. 16, pp. 265–283.
  • [63] I.Fehervari, A.Ravichandran, and S.Appalaraju, “Unbiased evaluation of deep metric learning algorithms,” arXiv preprint arXiv:1911.12528, 2019.
  • [64] H.Oh Song, Y.Xiang, S.Jegelka, and S.Savarese, “Deep metric learning via lifted structured feature embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4004–4012.
  • [65] C.Wah, S.Branson, P.Welinder, P.Perona, and S.Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
  • [66] A.Krause and D.Golovin, “Submodular function maximization,” in Tractability: Practical Approaches to Hard Problems, pp. 71–104. Cambridge University Press, 2014.
  • [67] Z.Liu, P.Luo, S.Qiu, X.Wang, and X.Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096–1104.
  • [68] D. P.Kingma and J.Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [69] R.Hadsell, S.Chopra, and Y.LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE, 2006, vol. 2, pp. 1735–1742.
  • [70] S.Kim, D.Kim, M.Cho, and S.Kwak, “Proxy anchor loss for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3238–3247.
  • [71] F.Radenović, G.Tolias, and O.Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655–1668, 2018.
  • [72] Y.Zhu, M.Yang, C.Deng, and W.Liu, “Fewer is more: A deep graph metric learning perspective using fewer proxies,” Advances in Neural Information Processing Systems, vol. 33, pp. 17792–17803, 2020.
  • [73] M.Tan and Q.Le, “Efficientnetv2: Smaller models and faster training,” in International conference on machine learning. PMLR, 2021, pp. 10096–10106.
  • [74] N.Murray, H.Jégou, F.Perronnin, and A.Zisserman, “Interferences in match kernels,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1797–1810, 2016.
  • [75] T.Ng, V.Balntas, Y.Tian, and K.Mikolajczyk, “Solar: second-order loss and attention for image retrieval,” in European conference on computer vision. Springer, 2020, pp. 253–270.
  • [76] A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weissenborn, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  • [77] T.Durand, N.Thome, and M.Cord, “Weldon: Weakly supervised learning of deep convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4743–4752.
  • [78] L. M.Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,” USSR computational mathematics and mathematical physics, vol. 7, no. 3, pp. 200–217, 1967.
  • [79] H. H.Bauschke and A. S.Lewis, “Dykstras algorithm with bregman projections: A convergence proof,” Optimization, vol. 48, no. 4, pp. 409–427, 2000.
  • [80] T.-T.Lu and S.-H.Shiou, “Inverses of 2×\times 2 block matrices,” Computers & Mathematics with Applications, vol. 43, no. 1-2, pp. 119–129, 2002.

Supplemental Material for "Generalized Sum Pooling for Metric Learning"

1 Extended Empirical Study for DML

In the following sections, we explain our empirical study in detail and provide additional experiments on effect of hyperparameters as well as evaluation with the conventional experimental settings.

Reproducibility

We provide full detail of our experimental setup and recapitulate the implementation details for the sake of complete transparency and reproducibility. Code is available at: GSP-DML Framework.

Table 3: Comparison with the existing methods for the retrieval task in conventional experimental settings with BN-Inception and ResNet50 backbones where superscripts denote embedding size. Red: the best. Blue: the second best. Bold: previous SOTA.   \ddaggerResults obtained from [24].
(a)
Backbone \rightarrow BN-Inception-512D
Dataset \rightarrow CUB Cars196 SOP In-shop
Method \downarrow R@1 R@1 R@1 R@1
C1+XBM [18] 65.80 82.00 79.50 89.90
ProxyAnchor [70] 68.40 86.10 79.10 91.50
DiVA [36] 66.80 84.10 78.10 -
ProxyFewer [72] 66.60 85.50 78.00 -
Margin+S2SD [37] 68.50 87.30 79.30 -
C1+XBM 64.32(23.59)\underset{\scriptsize{(23.59)}}{64.32} 77.63(21.67)\underset{\scriptsize{(21.67)}}{77.63} 79.29(52.59)\underset{\scriptsize{(52.59)}}{79.29} 90.16(61.39)\underset{\scriptsize{(61.39)}}{90.16}
C1+XBM+GSP 64.99(25.35)\underset{\scriptsize{(25.35)}}{64.99} 79.07(22.51)\underset{\scriptsize{(22.51)}}{79.07} 79.59(52.70)\underset{\scriptsize{(52.70)}}{\mathbf{79.59}} 90.92(63.25)\underset{\scriptsize{(63.25)}}{\mathbf{90.92}}
(b)
Backbone \rightarrow ResNet50
Dataset \rightarrow CUB Cars196 SOP In-shop
Method \downarrow R@1 R@1 R@1 R@1
C1+XBM128 [18] - - 80.60 91.30
ProxyAnchor512 [70] 69.70 87.70 80.00\ddagger 92.10\ddagger
DiVA512 [36] 69.20 87.60 79.60 -
ProxyNCA++512 [19] 66.30 85.40 80.20 88.60
Margin+S2SD512 [37] 69.00 89.50 81.20 -
LIBC512 [24] 70.30 88.10 81.40 92.80
MS+Metrix512 [6] 71.40 89.60 81.00 92.20
PAnchor+DIML128 [49] 66.46(25.58)\underset{\scriptsize{(25.58)}}{66.46} 86.13(28.11)\underset{\scriptsize{(28.11)}}{86.13} 79.22(43.04)\underset{\scriptsize{(43.04)}}{79.22} -
LIBC+GSP512 70.70 88.43 81.65 93.30
C1+XBM512 66.68(25.38)\underset{\scriptsize{(25.38)}}{66.68} 82.83(25.34)\underset{\scriptsize{(25.34)}}{82.83} 81.44(55.66)\underset{\scriptsize{(55.66)}}{81.44} 91.56(64.00)\underset{\scriptsize{(64.00)}}{91.56}
C1+XBM+GSP512 66.63(25.51)\underset{\scriptsize{(25.51)}}{66.63} 82.60(25.76)\underset{\scriptsize{(25.76)}}{82.60} 81.54(55.91)\underset{\scriptsize{(55.91)}}{\mathbf{81.54}} 91.75(64.43)\underset{\scriptsize{(64.43)}}{91.75}

1.1 Conventional Evaluation

We additionally follow the relatively old-fashioned conventional procedure [64] for the evaluation of our method. We use BN-Inception [61] and ResNet50 [58] architectures as the backbones. We obtain 512D (BN-Inception and ResNet50) embeddings through linear transformation after global pooling layer. Aligned with the recent approaches [6, 19, 70, 18], we use global max pooling as well as global average pooling. The rest of the settings are disclosed in Sec. 1.6.

We evaluate our method with XBM. We provide R@1 results in Tab. 3 for the comparison with SOTA. In our evaluations, we also provide MAP@R scores in parenthesis under R@1 scores. We also provide baseline XBM evaluation in our framework. The results are mostly consistent with the ones reported in the original paper [18] except for CUB and Cars datasets. In XBM [18], the authors use proxy-based trainable memory for CUB and Cars datasets. On the other hand, we use the official implementation provided by the authors, which does not include such proxy-based extensions.

We observe that our method improves XBM and XBM+GSP reaches SOTA performance in large scale datasets. With that being said, the improvement margins are less substantial than the ones in fair evaluation. Such a result is expected since training is terminated by early-stopping which is a common practice to regularize the generalization of training [20, 21]. In conventional evaluation, early-stopping is achieved by monitoring the test data performance, enabling good generalization to test data. Therefore, observing less improvement in generalization with GSP is something we expect owing to generalization boost that test data based early-stopping already provides.

We also observe that in a few cases, the R@1 performance of GSP is slightly worse than the baseline. Nevertheless, once we compare the MAP@R performances, GSP consistently brings improvement with no exception. We should recapitulate that R@1 is a myopic metric to assess the quality of the embedding space geometry [4] and hence, pushing R@1 does not necessarily reflect the true order of the improvements that the methods bring.

As we observe from MAP@R comparisons in Table 2 (main paper), the methods sharing similar R@1 (i.e., P@1) performances can differ in MAP@R performance relatively more significantly. In that manner, we firmly believe that comparing MAP@R performances instead of R@1 technically sounds more in showing the improvements of our method.

Finally, we also apply our method with LIBC [24] to further show wide applicability of our method. We use the official implementation of LIBC and follow their default experimental settings. The evaluations on 4 benchmarks show that GSP improve LIBC by 0.5\approx 0.5pp R@1. To offer a complete outlook on the conventional evaluation, we have included the recall at K (R@K) scores in Table 4 as well.

Table 4: R@K performances using 512D embeddings from LIBC [24] and XBM [18] with ResNet50 backbone
Dataset\rightarrow CUB Cars196
Method\downarrow R@1 R@2 R@4 R@8 R@1 R@2 R@4 R@8
LIBC+GSP 70.70 80.72 88.18 92.64 88.43 93.03 95.78 97.69
XBM+GSP 66.63 77.43 85.26 91.02 82.60 89.04 92.67 95.62
Dataset\rightarrow SOP In-shop
Method\downarrow R@1 R@10 R@50 R@100 R@1 R@10 R@20 R@40
LIBC+GSP 81.65 91.37 94.85 96.00 93.30 98.54 98.96 99.25
XBM+GSP 81.54 91.84 95.18 96.29 91.75 97.83 98.52 99.01

1.2 Application of GSP to Other Problems

GSP is applicable to any problem and architecture with a pooling layer. We believe GSP is particularly relevant to the problem of metric learning due to the geometry it enforces. Our pooling method enhances local geometry by reducing overlap of class convex hulls and improves unseen class generalization.

In order to evaluate the applicability of GSP beyond metric learning, we applied GSP to ImageNet classification tasks using ResNetV2 [58] and EfficientNetV2 [73] models. We took the official Tensorflow Keras models and only replaced pooling layers with GSP.

Table 5: Evaluation in classification task
ImageNet Acc. P@1 MAP@R
RN50V2 75.26 69.30 41.18
+GSP 76.53 71.34 42.58
ENV2B3 80.03 79.23 59.98
+GSP 82.00 80.80 62.75

The results suggests that our method is applicable beyond metric lerning as it improves ImageNet classification accuracy for both ResNetV2 and EfficientNetV2 models. We additionally assessed the metric learning performance of the embedding vectors pre-classification. By reducing the embedding dimension to 512 through LDA, we then evaluated the resulting embedding geometry using P@1 and MAP@R metrics, and observed that GSP yields better feature geometry.

Table 6: Evaluation of feature pooling methods on Cifar Collage and CUB datasets with Contrastive and ProxyNCA++ losses for DML task. Red: the best, Blue: the second best, Bold: the third best.
128D - MAP@R
Dataset\rightarrow Cifar Collage CUB
Method\downarrow Loss\rightarrow Contrastive ProxyNCA++ Contrastive ProxyNCA++
CBAM [45] 7.87 10.99 18.45 18.21
CroW [43] 10.09 11.48 20.88 20.42
DeLF [46] 11.44 24.83 21.42 20.51
GeMax [74] 7.04 7.83 18.85 17.83
GeMean [71] 10.97 10.60 21.50 20.71
GSoP [47] 11.15 17.73 20.52 15.72
OTP-(8𝗑16)(8\mathsf{x}\mskip 1.0mu16) [41] 7.02 11.55 15.19 13.57
OTP-(64𝗑128)(64\mathsf{x}\mskip 1.0mu128) [41] 7.57 11.79 20.88 20.48
SOLAR [75] 17.30 20.36 19.89 20.14
TFM-1 [76] 8.84 10.83 17.82 19.13
TFM-2 [76] 16.48 21.00 17.16 18.13
TFM-4 [76] 18.51 21.56 16.91 18.22
TFM-8 [76] 18.18 19.68 16.31 17.47
T-SMK [44] 9.21 13.15 21.01 20.23
VLAD-(8𝗑16)(8\mathsf{x}\mskip 1.0mu16) [39] 21.73 19.68 15.19 13.08
VLAD-(64𝗑128)(64\mathsf{x}\mskip 1.0mu128) [39] 22.52 21.15 16.67 16.53
WELDON [77] 13.81 20.38 20.67 20.31
GAP 8.09 10.68 20.58 20.63
GMP 9.53 11.25 20.66 20.33
GMP+GAP 10.01 11.85 20.88 20.68
GSP 22.68 27.61 21.52 20.75

1.3 Evaluation of Other Pooling Alternatives

We evaluate 14 additional pooling alternatives on Ciffar Collage and CUB datasets with contrastive (C2) and Proxy-NCA++ (PNCA) losses. We pick contrastive since it is one of the best performing sample-based loss. We pick Proxy-NCA++ since most of the pooling methods are tailored for landmark-based image retrieval and use classification loss akin to Proxy-NCA++. We particularly consider Cifar Collage dataset since the images of different classes share a considerable amount of semantic entities, enabling us to assess the methods w.r.t. their ability to discard the nuisance information.

Compared methods. In addition to our method (GSP) and global average pooling (GAP), we consider: i)i) global max pooling (GMP), ii)ii) GAP+GMP [70], iii)iii) CBAM [45], iv)iv) CroW [43], v)v) DeLF [46], vi)vi) generalized max pooling (GeMax) [74], vii)vii) generalized mean pooling (GeMean) [71], viii)viii) GSoP [47], ix)ix) optimal transport based aggregation (OTP) [41, 40], x)x) SOLAR [75], xi)xi) trainable SMK (T-SMK) [44], xii)xii) NetVLAD [39], xiii)xiii) WELDON [77], and xiv)xiv) visual transformer encoder with class token (TFM) [76]. Among those, OTP and VLAD are ensemble based methods and typically necessitate large embedding dimensions. Thus, we both experimented their 128D version -(8𝗑16)8\mathsf{x}\mskip 1.0mu16) (8 prototypes of 16D vectors) and 8192D version -(64𝗑128)64\mathsf{x}\mskip 1.0mu128) (64 prototypes of 128D vectors). We note that some of our baselines utilize attention based pooling. Notably, attention mechanism is the key part of DeLF, SOLAR, and GSoP. In fact, DeLF can be seen as equivalent to a single-layer residual-free transformer layer with a class token. To perform a more direct comparison with transformers, we conducted experiments by replacing GAP with transformer layers using a class token. We evaluated 1, 2, 4, and 8-layer transformers (TFM-#).

Setting. For CUB dataset, the experimental setting follows Sec. 1.6-Fair evaluation and we report MAP@R performance of the 4-model average at 128 dimensional embeddings each. For Cifar Collage dataset, the experimental setting follows Sec. 2.2 and we report MAP@R performance. We provide the results in Table 6.

Results. Evaluations show that our method is superior to other pooling alternatives including prototype based VLAD and OTP. Predominantly, for 128 dimensional embeddings, our method outperforms prototype based methods by large margin. In CUB dataset, the pooling methods either are inferior to or perform on par with GAP. The performance improvements of the superior methods are less than 1%, implying that our improvements in the order of 1-2% reported in Table 2 (main paper) is substantial. Besides, the methods that mask the feature map outperform GAP by large margin in Cifar Collage dataset. That said, our method outperforms all the methods except for Contrastive+VLAD by large margin in Cifar Collage dataset, yet another evidence for better feature selection mechanism of our method. For instance in CUB dataset, DeLF and GeMean are on par with our method which has slightly better performance. Yet, our method outperforms both by large margin in Cifar Collage dataset.

Superior selection mechanism. Comparing to CroW, T-SMK and CBAM, our method outperforms them by large margin. Those methods are the built on feature magnitude based saliency, assuming that the backbone functions must be able to zero-out nuisance information. Yet, such a requirement is restrictive for the parameter space and annihilation of the non-discriminative information might not be feasible in some problems. We experimentally observe such a weakness of CroW, T-SMK and CBAM in Cifar Collage dataset where the nuisance information cannot be zeroed-out by the backbone. Our formulation do not have such a restrictive assumption and thus substantially superior to them.

Superior attention mechanism. Similarly, attention-based weighting methods, DeLF and GSoP, do not have explicit control on feature selection behavior and might result in poor models when jointly trained with the feature extractor [46], which we also observe in Cifar Collage experiments. On the contrary, we have explicit control on the pooling behavior with μ\mu parameter and the behavior of our method is stable and consistent across datasets and with different loss functions. We also found that our method outperforms direct application of transformer based pooling.

Simpler and interpretable. Attention-based methods DeLF, GSoP, and SOLAR typically introduce several convolution layers to compute the feature weights. We only introduce an mm-kernel 1𝗑11\mathsf{x}\mskip 1.0mu1 convolution layer (i.e., mm-many trainable prototypes) and obtain better results. We should note that our pooling operation is as simple as performing a convolution (i.e., distance computation) and alternating normalization of a vector and a scalar. Additionally, we are able to incorporate a zero-shot regularization into our problem naturally by using the prototype assignment weights. We can as well incorporate such a loss in DeLF which has 1𝗑11\mathsf{x}\mskip 1.0mu1 convolution to compute prototype similarities. However, we first need a mechanism to aggregate the per prototype similarities (e.g., sum and normalization). Indeed, normalizing the similarities channel-wise and spatially summing them correspond to solving our problem with μ=1\mu=1.

Other pooling methods, i.e., GAP, GMP, GAP+GMP, GeMax, GeMean, WELDON, VLAD, OTP, do not build on discriminative feature selection. Thus, our method substantially outperforms those.

Refer to caption

Figure 8: Computation increase (\uparrow) in inference with GSP using kk iterations

1.4 Computational Analysis

Forward and backward computation of proposed GSP method can be implemented using only matrix-vector products. Moreover, having closed-form matrix-inversion-free expression for the loss gradient enables memory efficient back propagation since the output of each iteration must be stored otherwise.

We perform kk iterations to obtain the pooling weights and at each iteration, we only perform matrix-vector products. In this sense, the back propagation can be achieved using automatic-differentiation. One problem with automatic differentiation is that the computation load increases with increasing kk. On the other hand, with closed-form gradient expression, we do not have such issue and in fact we have constant back propagation complexity. Granted that the closed-form expression demands exact solution of the problem (i.e., kk\to\infty), forward computation puts a little computation overhead and is memory efficient since we discard the intermediate outputs. Moreover, our initial empirical study show that our problems typically converges for k>50k>50 and we observe similar performances with k25k\geqslant 25.

Refer to caption

Figure 9: Comparing closed-form gradient with automatic differentiation through analyzing the effect of kk on computation in CUB with C2 loss. Shaded regions represent \mpstd.

The choice of kk is indeed problem dependent (i.e., size of the feature map and the number of prototypes). Thus, its effect on computation load should be analyzed. We study the effect of kk with automatic differentiation and with our closed-form gradient expression. We provide the related plots in Fig. 9. We observe that with closed-form gradients, our method puts a little computation overhead and increasing kk has marginal effect. On the contrary, with automatic differentiation, the computational complexity substantially increases.

We have further provided the inference times for various optimization steps (kk) in Fig. 8. The additional computational complexity introduced by our method is minor, resulting in a less than 6% increase in the time per image (from 43.1 ms to 45.6 ms) within the typical operation interval of kk (25-50). Therefore, our method remains computationally feasible for real-time processing.

Refer to caption

(a)

Refer to caption

(b)

Figure 10: Parameter search for m:m\mathrel{\mathop{\mathchar 58\relax}} number of prototoypes and ε\varepsilon: entropy smoothing coefficient. We fix μ=0.3\mu=0.3 and λ=0.1\lambda=0.1. (a) Searching mεm-\varepsilon space in CUB dataset. (b) Effect of mm once we fix ε=5\varepsilon=5 for Contrastive (C2) and ε=0.5\varepsilon=0.5 for Proxy NCA++ (PNCA).

1.5 Hyperparameter Selection

We first perform a screening experiment to see the effect of the parameters. We design a 2-level fractional factorial (i.e., a subset of the all possible combinations) experiment.

We provide the results in Tab. 7. In our analysis, we find that lower the better for λ\lambda and μ\mu. Thus, we set μ=0.3\mu=0.3 and λ=0.1\lambda=0.1. ε\varepsilon is observed to have the most effect and number of prototypes, mm, seems to have no significant effect. Nevertheless, we jointly search for mm and ε\varepsilon. To this end, we perform grid search in CUB dataset with Contrastive (C2) and Proxy NCA++ (PNCA) losses. We provide the results in Fig. 10-(a). We see that both losses have their best performance when m=64m=64. On the other hand, ε=5.0\varepsilon=5.0 works better for C2 while ε=0.5\varepsilon=0.5 works better for PNCA. We additionally perform a small experiment to see whether ε=0.5\varepsilon=0.5 is the case for Proxy Anchor loss as well and observe that ε=0.5\varepsilon=0.5 is a better choice over ε=5.0\varepsilon=5.0. As the result of mm-ε\varepsilon search, we set ε=5.0\varepsilon=5.0 for non-proxy based losses and ε=0.5\varepsilon=0.5 for proxy-based losses.

Table 7: Initial 2-level fractional factorial screening experiments for parameter tuning (conducted in CUB)
Setting MAP@R
mm μ\mu ε\varepsilon λ\lambda C2 PNCA
16 0.3 0.5 0.1 40.63 40.59
16 0.7 0.5 0.5 40.41 40.34
128 0.3 0.5 0.5 40.22 40.35
128 0.7 0.5 0.1 40.07 40.85
16 0.3 20 0.5 36.11 40.51
16 0.7 20 0.1 39.11 39.88
128 0.3 20 0.1 39.61 39.12
128 0.7 20 0.5 35.36 39.92
Baseline 39.77 39.90

Fixing μ=0.3,λ=0.1,ε=0.5( or 5.0)\mu=0.3,\lambda=0.1,\varepsilon=0.5\text{( or }5.0\text{)}, we further experiment the effect of number of prototypes mm in large datasets (i.e., SOP and In-shop). We provide the corresponding performance plots in Fig. 10-(b). Supporting our initial analysis, mm seemingly does not have a significant effect once it is not small (e.g., m64m\geqslant 64). We observe that any choice of m64m\geqslant 64 provides performance improvement. With that being said, increasing mm does not bring substantial improvement over relatively smaller values. Considering the results of the experiment, we set m=128m=128 for SOP and In-shop datasets since both C2 and PNCA losses perform slightly better with m=128m=128 than the other choices of mm.

1.6 Experimental Setup

1.6.1 Datasets

We perform our experiments on 4 widely-used benchmark datasets: Stanford Online Products (SOP) [64], In-shop [67], Cars196 [66] and, CUB-200-2011 (CUB) [65].

SOP has 22,634 classes with 120,053 product images. The first 11,318 classes (59,551 images) are split for training and the other 11,316 (60,502 images) classes are used for testing.

In-shop has 7,986 classes with 72,712 images. We use 3,997 classes with 25,882 images as the training set. For the evaluation, we use 14,218 images of 3,985 classes as the query and 12,612 images of 3,985 classes as the gallery set.

Cars196 contains 196 classes with 16,185 images. The first 98 classes (8,054 images) are used for training and remaining 98 classes (8,131 images) are reserved for testing.

CUB-200-2011 dataset consists of 200 classes with 11,788 images. The first 100 classes (5,864 images) are split for training, the rest of 100 classes (5,924 images) are used for testing.

Data augmentation follows [4]. During training, we resize each image so that its shorter side has length 256, then make a random crop between 40 and 256, and aspect ratio between 3/4\nicefrac{{3}}{{4}} and 4/3\nicefrac{{4}}{{3}}. We resize the resultant image to 227𝗑227227\mathsf{x}\mskip 1.0mu227 and apply random horizontal flip with 50%50\% probability. During evaluation, images are resized to 256 and then center cropped to 227𝗑227227\mathsf{x}\mskip 1.0mu227.

1.6.2 Training Splits

Fair evaluation. We split datasets into disjoint training, validation and test sets according to [4]. In particular, we partition 50%/50%\nicefrac{{50\%}}{{50\%}} for training and test, and further split training data to 4 partitions where 4 models are to be trained by exploiting 1/4\nicefrac{{1}}{{4}} as validation while training on 3/4\nicefrac{{3}}{{4}}.

Conventional evaluation. Following relatively old-fashioned conventional evaluation [64], we use the whole train split of the dataset for training and we use the test split for evaluation as well as monitoring the training for early stopping.

Hyperparameter tuning. For the additional experiments related to the effect of hyperparameters, we split training set into 5 splits and train a single model on the 4/5\nicefrac{{4}}{{5}} of the set while using 1/5\nicefrac{{1}}{{5}} for the validation.

1.6.3 Evaluation Metrics

We consider precision at 1 (P@1) and mean average precision (MAP@R) at R where R is defined for each query222A query is an image for which similar images are to be retrieved, and the references are the images in the searchable database. and is the total number of true references as the query. Among those, MAP@R performance metric is shown to better reflect the geometry of the embedding space and to be less noisy as the evaluation metric [4]. Thus, we use MAP@R to monitor training in our experiments except for conventional evaluation setting where we monitor P@1.

P@1: Find the nearest reference to the query. The score for that query is 1 if the reference is of the same class, 0 otherwise. Average over all queries gives P@1 metric.

P@R: For a query, ii, find RiR_{i} nearest references to the query and let rir_{i} be the number of true references in those RiR_{i}-neighbourhood. The score for that query is P@Ri=ri/Ri\text{P@R}_{i}=\nicefrac{{r_{i}}}{{R_{i}}}. Average over all queries gives P@R metric, i.e., P@R=1ni[n]P@Ri\text{P@R}=\tfrac{1}{n}\sum\limits_{i\in[n]}\text{P@R}_{i}, where nn is the number of queries.

MAP@R: For a query, ii, we define MAP@Ri1Rii[Ri]P(i)\text{MAP@R}_{i}\coloneqq\tfrac{1}{R_{i}}\sum\limits_{i\in[R_{i}]}P(i), where P(i)=P@RiP(i)=\text{P@R}_{i} if ithi^{th} retrieval is correct or 0 otherwise. Average over all queries gives MAP@R metric, i.e., MAP@R=1ni[n]MAP@Ri\text{MAP@R}=\tfrac{1}{n}\sum\limits_{i\in[n]}\text{MAP@R}_{i}, where nn is the number of queries.

1.6.4 Training Procedure

Fair evaluation. We use Adam [68] optimizer with constant 10510^{\shortminus 5} learning rate, 10410^{\shortminus 4} weight decay, and default moment parameters, β1=.9\beta_{1}{=}.9 and β2=.99\beta_{2}{=}.99. We use batch size of 32 (4 samples per class). We evaluate validation MAP@R for every 100 steps of training in CUB and Cars196, for 1000 steps in SOP and In-shop. We stop training if no improvement is observed for 15 steps in CUB and Cars196, and 10 steps in SOP and In-shop. We recover the parameters with the best validation performance. Following [4], we train 4 models for each 3/4\nicefrac{{3}}{{4}} partition of the train set. Each model is trained 3 times. Hence, we have 34=813^{4}=81 many realizations of 4-model collections. We present the average performance as well as the standard deviation (std) of such 81 models’ evaluations.

Conventional evaluation. We use Adam [68] optimizer with default moment parameters, β1=.9\beta_{1}{=}.9 and β2=.99\beta_{2}{=}.99. Following recent works [70], we use reduce on plateau learning rate scheduler with patience 4. The initial learning rate is 10510^{\shortminus 5} for CUB, and 10410^{\shortminus 4} for Cars, SOP and In-shop. We use 10410^{\shortminus 4} weight decay for BNInception backbone and 4 1044\,10^{\shortminus 4} wight decay for ResNet50 backbone. We use batch size of 128 (4 samples per class) for BNInception backbone and 112 (4 samples per class) for ResNet backbone (following [5]). We evaluate validation P@1 for every 25 steps of training in CUB and Cars196, for 250 steps in SOP and In-shop. We stop training if no improvement is observed for 15 steps in CUB and Cars196, and 10 steps in SOP and In-shop. We recover the parameters with the best validation performance. We repeat each experiment 3 times and report the best result. For the evaluations on LIBC framework [24], we follow their experimental setting.

Hyperparameter tuning. We use Adam [68] optimizer with constant 10510^{\shortminus 5} learning rate, 10410^{\shortminus 4} weight decay, and default moment parameters, β1=.9\beta_{1}{=}.9 and β2=.99\beta_{2}{=}.99. We use batch size of 32 (4 samples per class). We evaluate validation MAP@R for every 100 steps of training in CUB and Cars196, for 1000 steps in SOP and In-shop. We stop training if no improvement is observed for 10 steps in CUB and Cars196, and 7 steps in SOP and In-shop. We recover the parameters with the best validation performance. We train a single model on the 4/5\nicefrac{{4}}{{5}} of the training set while using 1/5\nicefrac{{1}}{{5}} for the validation. We repeat each experiment 3 times and report the averaged result.

1.6.5 Embedding vectors

Fair evaluation. Embedding dimension is fixed to 128. During training and evaluation, the embedding vectors are 2\ell 2 normalized. We follow the evaluation method proposed in [4] and produce two results: i)i) Average performance (128 dimensional) of 4-fold models and ii)ii) Ensemble performance (concatenated 512 dimensional) of 4-fold models where the embedding vector is obtained by concatenated 128D vectors of the individual models before retrieval.

Conventional evaluation. Embedding dimension is 512 in both BNInception and ResNet50 experiments.

Hyperparameter tuning. Embedding dimension is fixed to 128.

1.6.6 Baselines with GSP

We evaluate our method with C1+XBM+GSP: Cross-batch memory (XBM) [18] with original Contrastive loss (C1) [69], C2+GSP: Contrastive loss with positive margin [53], MS+GSP: Multi-similarity (MS) loss [55], Triplet+GSP: Triplet loss [54], PNCA+GSP: ProxyNCA++ loss [19], PAnchor+GSP: ProxyAnchor loss [70].

1.6.7 Hyperparameters

For the hyperparameter selection, we exploit the work [4] that performed parameter search via Bayesian optimization on variety of losses. We further experiment the suggested parameters from the original papers and official implementations. We pick the best performing parameters. We perform no further parameter tuning for the baseline methods’ parameters when applied with our method to purely examine the effectiveness of our method.

C1: We adopted XBM’s official implementation for fair comparison. We use 0.5 margin for all datasets.

C2: C2 has two parameters, (m+,m)(m^{+},m^{-}): positive margin, m+m^{+}, and negative margin. We set (m+,m)(m^{+},m^{-}) to (0,0.3841)(0,0.3841), (0.2652,0.5409)(0.2652,0.5409), (0.2858,0.5130)(0.2858,0.5130), (0.2858,0.5130)(0.2858,0.5130) for CUB, Cars196, In-shop and SOP, respectively.

Triplet: We set its margin to 0.0961, 0.1190, 0.0451, 0.0451 for CUB, Cars196, In-shop and SOP, respectively.

MS: We set its parameters (α,β,λ)(\alpha,\beta,\lambda) to (2,40,0.5)(2,40,0.5), (14.35,75.83,0.66)(14.35,75.83,0.66), (8.49,57.38,0.41)(8.49,57.38,0.41), (2,40,0.5)(2,40,0.5) for CUB, Cars196, In-shop and SOP, respectively.

ProxyAnchor: We set its two paremeters (δ,α)(\delta,\alpha) to (0.1,32)(0.1,32) for all datasets. We use 1 sample per class in batch setting (i.e., 32 classes with 1 samples per batch), we perform 1 epoch warm-up training of the embedding layer, and we apply learning rate multiplier of 100 for the proxies during training. For SOP dataset, we use 5 1065\,10^{\shortminus 6} learning rate.

ProxyNCA++: We set its temperature parameter to 0.11 for all datasets. We use 1 sample per class in batch setting (i.e., 32 classes with 1 samples per batch), we perform 1 epoch warm-up training of the embedding layer, and we apply learning rate multiplier of 100 for the proxies.

XBM: We evaluate XBM with C1 since in the original paper, contrastive loss is reported to be the best performing baseline with XBM. We set the memory size of XBM according to the dataset. For CUB and Cars196, we use memory size of 25 batches. For In-shop, we use 400 batches and for SOP we use 1400 batches. We perform 1K steps of training with the baseline loss prior to integrate XBM loss in order to ensure XBM’s slow drift assumption.

GSP: For the hyperparameters of our method, we perform parameters search, details of which are provided in Sec. 1.5. We use 64-many prototypes in CUB and Cars, and 128-many prototypes in SOP and In-shop. We use ε=0.5\varepsilon{=}0.5 for proxy-based losses and ε=5.0\varepsilon{=}5.0 for non-proxy losses. For the rest, we set μ=0.3\mu{=}0.3, ϵ=0.05\epsilon{=}0.05, and we iterate until k=100k{=}100 in Proposition 4.1. For zero-shot prediction loss coefficient (i.e., (1λ)DML+λZS(1{\shortminus}\lambda)\mathcal{L}_{DML}+\lambda\mathcal{L}_{ZS}), we set λ=0.1\lambda{=}0.1.

2 Details of the Other Empirical Work

2.1 Synthetic Study

We design a synthetic empirical study to evaluate GSP in a fully controlled manner. We consider 16-class problem such that classes are defined over trainable tokens. In this setting, tokens correspond to semantic entities but we choose to give a specific working to emphasize that they are trained as part of the learning. Each class is defined with 4 distinct tokens and there are also 4 background tokens shared by all classes. For example, a "car" class would have tokens like "tire" and "window" as well as background tokens of "tree" and "road".

We sample class representations from both class specific and background tokens according to a mixing ratio μ~𝒩(0.5,0.1)\tilde{\mu}\sim\mathcal{N}(0.5,0.1). We sample a total of 50 tokens and such a 50-many feature collection will correspond to a training sample (i.e., we are mimicking CNN’s output with trainable tokens). For instance, given class tokens for class-cc, ν(c)={ν1(c),ν2(c),ν3(c),ν4(c)}\nu^{(c)}=\{\nu^{(c)}_{1},\nu^{(c)}_{2},\nu^{(c)}_{3},\nu^{(c)}_{4}\} and shared tokens, ν(b)={ν1(b),ν2(b),ν3(b),ν4(b)}\nu^{(b)}=\{\nu^{(b)}_{1},\nu^{(b)}_{2},\nu^{(b)}_{3},\nu^{(b)}_{4}\}; we first sample μ=0.4\mu=0.4 and then sample 20 tokens from ν(c)\nu^{(c)} with replacement, and 30 tokens from ν(b)\nu^{(b)}, forming a feature collection for a class-cc, i.e., f(c)={ν3(c),ν1(c),ν1(c),ν3(c),,ν4(b),ν3(b),ν4(b),ν1(b),}f^{(c)}=\{\nu^{(c)}_{3},\nu^{(c)}_{1},\nu^{(c)}_{1},\nu^{(c)}_{3},\ldots,\nu^{(b)}_{4},\nu^{(b)}_{3},\nu^{(b)}_{4},\nu^{(b)}_{1},\ldots\} We then obtain global representations using GAP and GSP.

We do not apply 2\ell 2 normalization on the global representations. We also constrain the range of the token vectors to be in between [0.3,0.3][\shortminus 0.3,0.3] to bound the magnitude of the learned vectors. We use default Adam optimizer with 10410^{\shortminus 4} learning rate and perform early stopping with 30 epoch patience by monitoring MAP@R. In each batch, we use 4 samples from 16 classes.

2.2 Cifar Collage

We consider the 20 super-classes of Cifar100 dataset [59] where each has 5 sub-classes. For each super-class, we split the sub-classes for train (2), validation (1), and test (2). We consider 4 super-classes as the shared classes and compose 4𝗑44{\mathsf{x}\mskip 1.0mu}4-stitched collage images for the rest 16 classes. In particular, we sample an image from a class and then sample 3 images from shared classes. We illustrate a sample formation process in Fig. 11.

Refer to caption

Figure 11: Sample generation for Cifar Collage dataset

We should note that the classes exploited in training, validation and test are disjoint. For instance, if a tree class is used as a shared class in training, then that tree class does not exist in validation or test set as a shared feature. Namely, in our problem setting, both the background and the foreground classes are disjoint across training, validation and test sets. Such a setting is useful to analyze zero-shot transfer capability of our method.

We use ResNet20 (i.e., 3 stages, 3 blocks) backbone pretrained on Cifar100 classification task. We use 2\ell 2 normalization on global representations. We use default Adam optimizer with initial 0.0010.001 learning rate. We use reduce on plateau with 0.5 decay factor and 5 epochs patience. For GSP, we set m=128,μ=0.2,ε=10,λ=0.5m=128,\mu=0.2,\varepsilon=10,\lambda=0.5. We use 4 samples from 16 classes in a batch.

2.3 Evaluation of Zero-shot Prediction Loss

We train on Cifar10 [59] dataset with 8 prototypes using ProxyNCA++ [19] (PNCA) loss with and without ZS\mathcal{L}_{ZS}. We then use test set to compute attribute histograms for each class. Namely, we aggregate the marginal transport plans of each sample in a class to obtain the histogram. Similarly, for each class, we compute the mean embedding vector (i.e., we average embedding vectors of the samples of a class). Our aim is to fit a linear predictor to map attribute vectors to the mean embeddings.

To quantify the zero-shot prediction performance, we randomly split the classes into half and apply cross-batch zero-shot prediction. Specifically, we fit a linear predictor using 5 classes and then use that transformation to map the other 5 classes to their mean embeddings. We then compute pairwise distance between the predicted means and the true means. We then evaluate the nearest neighbour classification performance. We use both 2\ell 2 distance and cosine distance while computing the pairwise distances. We repeat the experiment 1000 times with different class splits.

Appendix A Appendix

A.1 Proof for Theorem 4.1

Proof.

ρ\rho^{\ast} is obtained as the solution of the following optimal transport problem:

ρ,π=argminρ,π0ρj+Σiπij=1/nΣijπij=μijcijπij.\rho^{\ast},\pi^{\ast}=\!\!\!\!\!\!\!\operatorname*{arg\,min}_{\begin{subarray}{c}\rho,\pi\geqslant 0\\ \rho_{j}+\Sigma_{i}\pi_{ij}=\nicefrac{{1}}{{n}}\\ \Sigma_{ij}\pi_{ij}=\mu\end{subarray}}\!\!\!\!\!\!\!\textstyle\sum_{ij}c_{ij}\pi_{ij}.

Given solutions (ρ,π)(\rho^{\ast},\pi^{\ast}), for μ=1\mu{=}1, from the 3rd constraint, we have Σijπij=1\Sigma_{ij}\pi_{ij}^{\ast}{=}1. Then, using the 2nd2^{nd} constraint, we write:

jρj+jiπij=j1n\textstyle\sum_{j}\rho^{\ast}_{j}+\textstyle\sum_{j}\textstyle\sum_{i}\pi^{\ast}_{ij}=\textstyle\sum_{j}\tfrac{1}{n}

where j[n]j{\in}[n] for nn-many convolutional features. Hence, we have jρ=0\sum_{j}\rho^{\ast}=0 which implies ρ=0\rho^{\ast}{=}0 owing to non-negativity constraint. Finally, pooling weights becomes pi=1/nρiμ=1=1/np_{i}=\tfrac{\nicefrac{{1}}{{n}}-\cancel{\rho^{\ast}_{i}}}{\underset{=1}{\mu}}=\nicefrac{{1}}{{n}}.

A.2 Proof for Proposition 4.2

Before starting our proof, we first derive an iterative approach for the solution of (P2). We then prove that the iterations in Proposition 4.2 can be used to obtain the solution.

We can write (P2) equivalently as:

ρ(ε),π(ε)=argminρ,π0ρj+Σiπij=1/nΣijπij=μijcijπij+1ε(ijπijlogπij+jρjlogρj)+j0ρjijπijjρj+ijeεcij+jeε0\begin{split}\rho^{(\varepsilon)},\pi^{(\varepsilon)}=\!\!\!\!\!\!\operatorname*{arg\,min}_{\begin{subarray}{c}\rho,\pi\geqslant 0\\ \rho_{j}+\Sigma_{i}\pi_{ij}=\nicefrac{{1}}{{n}}\\ \Sigma_{ij}\pi_{ij}=\mu\end{subarray}}\!\!\!\!\!\textstyle\sum_{ij}c_{ij}\pi_{ij}&+\tfrac{1}{\varepsilon}(\textstyle\sum_{ij}\pi_{ij}\log\pi_{ij}+\textstyle\sum_{j}\rho_{j}\log\rho_{j})\\ &+\textstyle\sum_{j}0\rho_{j}-\textstyle\sum_{ij}\pi_{ij}-\textstyle\sum_{j}\rho_{j}+\textstyle\sum_{ij}\mathrm{e}^{\shortminus\varepsilon c_{ij}}+\textstyle\sum_{j}\mathrm{e}^{\shortminus\varepsilon 0}\end{split}

Rearranging the terms we get:

ρ(ε),π(ε)=argminρ,π0ρj+Σiπij=1/nΣijπij=μijπijlogπijeεcij+jρjlogρjeε0ijπijjρj+ijeεcij+jeε0\rho^{(\varepsilon)},\pi^{(\varepsilon)}=\!\!\!\!\!\!\operatorname*{arg\,min}_{\begin{subarray}{c}\rho,\pi\geqslant 0\\ \rho_{j}+\Sigma_{i}\pi_{ij}=\nicefrac{{1}}{{n}}\\ \Sigma_{ij}\pi_{ij}=\mu\end{subarray}}\!\!\!\!\!\!\textstyle\sum_{ij}\pi_{ij}\log\tfrac{\pi_{ij}}{\mathrm{e}^{\shortminus\varepsilon c_{ij}}}+\textstyle\sum_{j}\rho_{j}\log\tfrac{\rho_{j}}{\mathrm{e}^{\shortminus\varepsilon 0}}-\textstyle\sum_{ij}\pi_{ij}-\textstyle\sum_{j}\rho_{j}+\textstyle\sum_{ij}\mathrm{e}^{\shortminus\varepsilon c_{ij}}+\textstyle\sum_{j}\mathrm{e}^{\shortminus\varepsilon 0}

which is generalized Kullback–Leibler divergence (KLD) between (ρ,π)(\rho,\pi) and (exp(ε0),exp(εc))(\exp{({\shortminus}\varepsilon 0)},\exp{({\shortminus}\varepsilon c)}). Hence, we reformulate the problem as a KLD prjoection onto a convex set, which can be solved by Bregman Projections (i.e., alternating projections onto constraint sets) [78, 79]. Defining 𝒞1{(ρ,π)ρj+ijπij=1/nj}\mathcal{C}_{1}\coloneqq\{(\rho,\pi)\mid\rho_{j}+\textstyle\sum_{ij}\pi_{ij}=\nicefrac{{1}}{{n}}\,\,\forall j\} and 𝒞2{(ρ,π)ijπij=μ}\mathcal{C}_{2}\coloneqq\{(\rho,\pi)\mid\textstyle\sum_{ij}\pi_{ij}=\mu\}, and omitting constants, we can write the problem as:

ρ(ε),π(ε)=argminρ,π0(ρ,π)𝒞1𝒞2ijπij(logπijeεcij1)+jρj(logρjeε01)\rho^{(\varepsilon)},\pi^{(\varepsilon)}=\!\!\!\!\operatorname*{arg\,min}_{\begin{subarray}{c}\rho,\pi\geqslant 0\\ (\rho,\pi)\in\mathcal{C}_{1}\cap\mathcal{C}_{2}\end{subarray}}\textstyle\sum_{ij}\pi_{ij}(\log\tfrac{\pi_{ij}}{\mathrm{e}^{\shortminus\varepsilon c_{ij}}}\shortminus 1)+\textstyle\sum_{j}\rho_{j}(\log\tfrac{\rho_{j}}{\mathrm{e}^{\shortminus\varepsilon 0}}\shortminus 1) (P2)

Given, (ρ(k),π(k))(\rho^{(k)},\pi^{(k)})), at iteration kk, KLD projection onto 𝒞1\mathcal{C}_{1}, i.e., (ρ(k+1),π(k+1))𝒫𝒞1KL(ρ(k),π(k))(\rho^{(k{+}1)},\pi^{(k{+}1)})\coloneqq\mathcal{P}_{\mathcal{C}_{1}}^{KL}(\rho^{(k)},\pi^{(k)}), reads:

ρj(k+1)=1/n(ρj(k)+iπij(k))1ρj(k),\rho^{(k{+}1)}_{j}=\nicefrac{{1}}{{n}}(\rho^{(k)}_{j}+\textstyle\sum_{i}\pi^{(k)}_{ij})^{\shortminus 1}\rho^{(k)}_{j}\,,
π(k+1)=1/n(ρj(k)+iπij(k))1πij(k)\pi^{(k{+}1)}=\nicefrac{{1}}{{n}}(\rho^{(k)}_{j}+\textstyle\sum_{i}\pi^{(k)}_{ij})^{\shortminus 1}\pi^{(k)}_{ij}

where the results follow from method of Lagrange multipliers. Similarly, for 𝒫𝒞2KL(ρ(k),π(k))\mathcal{P}_{\mathcal{C}_{2}}^{KL}(\rho^{(k)},\pi^{(k)}), we have:

ρ(k+1)=ρ(k),π(k+1)=μijπij(k)π(k).\rho^{(k{+}1)}=\rho^{(k)}\,,\quad\pi^{(k{+}1)}=\tfrac{\mu}{\textstyle\sum_{ij}\pi^{(k)}_{ij}}\pi^{(k)}\,.

Given initialization, (ρ(0),π(0))=(𝟏n,exp(εc))(\rho^{(0)},\pi^{(0)})=(\bm{1}_{n},\exp({\shortminus}\varepsilon c)), we obtain the pairs (ρ(k),π(k))(\rho^{(k)},\pi^{(k)}) for k=0,1,2,k=0,1,2,\ldots as:

ρ(k+1)=1/n(ρ(k)+π(k)𝟏m)1ρ(k),π(k+1)=μ(𝟏mπ^𝟏n)1π^whereπ^=π(k)Diag(1/n(ρ(k)+π(k)𝟏m)1)\begin{split}\rho^{(k{+}1)}&=\nicefrac{{1}}{{n}}(\rho^{(k)}+\pi^{(k)\intercal}\bm{1}_{m})^{\shortminus 1}\odot\rho^{(k)}\,,\quad\pi^{(k{+}1)}=\mu(\bm{1}_{m}^{\intercal}\hat{\pi}\bm{1}_{n})^{\shortminus 1}\hat{\pi}\\ &\text{where}\quad\hat{\pi}=\pi^{(k)}Diag\big{(}\nicefrac{{1}}{{n}}(\rho^{(k)}+\pi^{(k)\intercal}\bm{1}_{m})^{\shortminus 1}\big{)}\end{split} (A.1)
Proof.

We will prove by induction. From Proposition 4.2, we have

ρ(k+1)=1/n(1+t(k)exp(εc)𝟏m)1,t(k+1)=μ(𝟏mexp(εc)ρ(k+1))1\rho^{(k{+1})}=\nicefrac{{1}}{{n}}\,(1+t^{(k)}\exp({\shortminus}\varepsilon c)^{\intercal}\bm{1}_{m})^{\shortminus 1}\!,\,\,\,t^{(k{+}1)}=\mu\,(\bm{1}_{m}^{\intercal}\exp({\shortminus}\varepsilon c)\rho^{(k{+}1)})^{\shortminus 1}

and π(k)=t(k)exp(εc)Diag(ρ(k))\pi^{(k)}=t^{(k)}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k)}). It is easy to show that the expressions hold for the pair (ρ(1),π(1))(\rho^{(1)},\pi^{(1)}). Now, assuming that the expressions also holds for arbitrary (ρ(k),π(k))(\rho^{(k^{\prime})},\pi^{(k^{\prime})}). We have

ρ(k+1)=1/n(ρ(k)+π(k)𝟏m)1ρ(k)\rho^{(k^{\prime}{+}1)}=\nicefrac{{1}}{{n}}(\rho^{(k^{\prime})}+\pi^{(k^{\prime})\intercal}\bm{1}_{m})^{\shortminus 1}\odot\rho^{(k^{\prime})}

Replacing π(k)=t(k)exp(εc)Diag(ρ(k))\pi^{(k^{\prime})}=t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k^{\prime})}) we get:

ρ(k+1)=1/n(ρ(k)+Diag(ρ(k))t(k)exp(εc)𝟏m)1ρ(k)\rho^{(k^{\prime}{+}1)}=\nicefrac{{1}}{{n}}(\rho^{(k^{\prime})}+Diag(\rho^{(k^{\prime})})t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)^{\intercal}\bm{1}_{m})^{\shortminus 1}\odot\rho^{(k^{\prime})}

where ρ(k)\rho^{(k^{\prime})} terms cancel out, resulting in:

ρ(k+1)=1/n(1+t(k)exp(εc)𝟏m)1.\rho^{(k^{\prime}{+}1)}=\nicefrac{{1}}{{n}}(1+t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)^{\intercal}\bm{1}_{m})^{\shortminus 1}.

Similarly, we express π^\hat{\pi} as:

π^=t(k)exp(εc)Diag(ρk)Diag(1/n(ρ(k)+Diag(ρ(k))t(k)exp(εc)𝟏m)1)\hat{\pi}=t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)Diag(\rho^{k^{\prime}})Diag\Big{(}\nicefrac{{1}}{{n}}\big{(}\rho^{(k^{\prime})}+Diag(\rho^{(k^{\prime})})t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)^{\intercal}\bm{1}_{m}\big{)}^{\shortminus 1}\Big{)}

again ρ(k)\rho^{(k^{\prime})} terms cancel out, resulting in:

π^=t(k)exp(εc)Diag(1/n(1+t(k)exp(εc)𝟏m)1)=t(k)exp(εc)Diag(ρ(k+1)).\hat{\pi}=t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)Diag(\nicefrac{{1}}{{n}}(1+t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)^{\intercal}\bm{1}_{m})^{\shortminus 1})=t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k^{\prime}+1)}).

Hence, π(k+1)\pi^{(k^{\prime}+1)} becomes:

π(k+1)=μ(𝟏mt(k)exp(εc)Diag(ρ(k+1))𝟏n)1t(k)exp(εc)Diag(ρ(k+1))=1t(k)μ(𝟏mexp(εc)ρ(k+1))1=t(k+1)t(k)exp(εc)Diag(ρ(k+1))=t(k+1)exp(εc)Diag(ρ(k+1)),\begin{split}\pi^{(k^{\prime}+1)}&=\mu(\bm{1}_{m}^{\intercal}t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k^{\prime}+1)})\bm{1}_{n})^{\shortminus 1}t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k^{\prime}+1)})\\ &=\tfrac{1}{t^{(k^{\prime})}}\underbrace{\mu(\bm{1}_{m}^{\intercal}\exp({\shortminus}\varepsilon c)\rho^{(k^{\prime}+1)})^{\shortminus 1}}_{=t^{(k^{\prime}+1)}}t^{(k^{\prime})}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k^{\prime}+1)})\\ &=t^{(k^{\prime}+1)}\exp({\shortminus}\varepsilon c)Diag(\rho^{(k^{\prime}+1)}),\end{split}

meaning that the expressions also hold for the pair (ρ(k+1),π(k+1))(\rho^{(k^{\prime}+1)},\pi^{(k^{\prime}+1)}). ∎

A.3 Proof for Proposition 4.3

Proof.

We start our proof by expressing (P2) in a compact form as:

x(ε)=argminx0Ax=bc¯x+1εx(logx𝟏(m+1)n)x^{(\varepsilon)}=\operatorname*{arg\,min}_{\begin{subarray}{c}x\geqslant 0\\ Ax=b\end{subarray}}\bar{c}^{\intercal}x+\tfrac{1}{\varepsilon}x^{\intercal}(\log x-\bm{1}_{(m+1)n})

where xx denotes the vector formed by stacking ρ\rho and the row vectors of π\pi, c¯\bar{c} denotes the vector formed by stacking nn-dimensional zero vector and the row vectors of cc, and AA and bb are such that Ax=bAx=b imposes transport constraints as:

A=[In𝗑nIn𝗑nIn𝗑nm𝟎n𝟏mn],b=[1/n𝟏nμ]A=\begin{bmatrix}I_{n\mathsf{x}\mskip 1.0mun}&{\smash{\overbrace{\begin{matrix}I_{n\mathsf{x}\mskip 1.0mun}&\dotsb&I_{n\mathsf{x}\mskip 1.0mun}\end{matrix}}^{m}}}\\ \bm{0}^{\intercal}_{n}&\bm{1}_{m\,n}^{\intercal}\end{bmatrix}\,,\quad b=[\nicefrac{{1}}{{n}}\bm{1}_{n}^{\intercal}\quad\mu]^{\intercal}

From Lagrangian dual, we have:

x(ε)=exp(ε(c¯+Aλ))x^{(\varepsilon)}=\exp({\shortminus}\varepsilon(\bar{c}{+}A^{\intercal}\lambda^{\ast}))

where λ\lambda^{\ast} is the optimal dual Lagrangian satisfying:

Aexp(ε(c¯+Aλ))=bA\exp({\shortminus}\varepsilon(\bar{c}{+}A^{\intercal}\lambda^{\ast}))=b

Defining [xc]ijxjci[\tfrac{\partial x}{\partial c}]_{ij}\coloneqq\tfrac{\partial x_{j}}{\partial c_{i}}, we have;

x(ε)c=εI¯(I+λc¯A)Diag(x(ε))\tfrac{\partial x^{(\varepsilon)}}{\partial c}=-\varepsilon\bar{I}(I+\tfrac{\partial\lambda^{\ast}}{\partial\bar{c}}A)Diag(x^{(\varepsilon)})

where I¯[𝟎(mn)𝗑nI(mn)𝗑((m+1)n)]\bar{I}\coloneqq[\bm{0}_{(mn){\mathsf{x}\mskip 1.0mu}n}\quad I_{(mn){\mathsf{x}\mskip 1.0mu}((m{+}1)n)}]. Similarly, for the dual variable, we have:

ε(I+λc¯A)Diag(x(ε))A=0λc¯=Diag(x(ε))A(ADiag(x(ε))A)1.-\varepsilon(I+\tfrac{\partial\lambda^{\ast}}{\partial\bar{c}}A)Diag(x^{(\varepsilon)})A^{\intercal}=0\Rightarrow\tfrac{\partial\lambda^{\ast}}{\partial\bar{c}}=\shortminus Diag(x^{(\varepsilon)})A^{\intercal}(ADiag(x^{(\varepsilon)})A^{\intercal})^{\shortminus 1}.

Putting back the expression for λc¯\tfrac{\partial\lambda^{\ast}}{\partial\bar{c}} in x(ε)c\tfrac{\partial x^{(\varepsilon)}}{\partial c}, we obtain:

x(ε)c=εI¯(Diag(x(ε))Diag(x(ε))A(ADiag(x(ε))A)1ADiag(x(ε))),\tfrac{\partial x^{(\varepsilon)}}{\partial c}=-\varepsilon\bar{I}\big{(}Diag(x^{(\varepsilon)})-Diag(x^{(\varepsilon)})A^{\intercal}(ADiag(x^{(\varepsilon)})A^{\intercal})^{\shortminus 1}ADiag(x^{(\varepsilon)})\big{)},

which includes (m+1)(m{+}1) by nn matrix inversion, HADiag(x(ε))AH\coloneqq ADiag(x^{(\varepsilon)})A^{\intercal}. We now show that H1H^{\shortminus 1} can be obtained without explicit matrix inversion.

HH can be expressed as:

H=[1/nI1/nρ1/nρμ]H=\begin{bmatrix}\nicefrac{{1}}{{n}}I&\nicefrac{{1}}{{n}}-\rho\\ \nicefrac{{1}}{{n}}-\rho^{\intercal}&\mu\end{bmatrix}

HH is Hermitian and positive definite. Using block matrix inversion formula for such matrices (Corrolary 4.1 of [80]), we obtain the inverse as;

H1=[nI+k1ρ^ρ^k1ρ^k1ρ^k1]H^{\shortminus 1}=\begin{bmatrix}nI+k^{\shortminus 1}\hat{\rho}\hat{\rho}^{\intercal}&-k^{\shortminus 1}\hat{\rho}\\ -k^{\shortminus 1}\hat{\rho}^{\intercal}&k^{\shortminus 1}\end{bmatrix}

where k=1μnρ(ε)ρ(ε)k=1-\mu-n\rho^{(\varepsilon)\intercal}\rho^{(\varepsilon)} and ρ^=1nρ(ε)\hat{\rho}=1-n\rho^{(\varepsilon)}.

Given x(ε)\tfrac{\partial\mathcal{L}}{\partial x^{(\varepsilon)}}, i.e., (ρ(ε),π(ε))(\tfrac{\partial\mathcal{L}}{\partial\rho^{(\varepsilon)}},\tfrac{\partial\mathcal{L}}{\partial\pi^{(\varepsilon)}}), the rest of the proof to obtain c\tfrac{\partial\mathcal{L}}{\partial c} follows from right multiplying the Jacobian, i.e., c=x(ε)cx(ε)\tfrac{\partial\mathcal{L}}{\partial c}=\tfrac{\partial x^{(\varepsilon)}}{\partial c}\,\tfrac{\partial\mathcal{L}}{\partial x^{(\varepsilon)}} and rearranging the terms. ∎

Table 8: Comparing our pooling method with OT-based pooling
Item\downarrow Method\rightarrow Ensemble of SWD Monge Maps[51]\underset{\text{\cite[cite]{[\@@bibref{}{naderializadeh2021pooling}{}{}]}}}{\text{Ensemble of SWD Monge Maps}} OT Monge Map[40, 41]\underset{\text{\cite[cite]{[\@@bibref{}{kolouri2020wasserstein, mialon2021trainable}{}{}]}}}{\text{OT Monge Map}} Ours (GSP)
optimization problem ss\dagger-many 1D OT argminπ0Σiπij=1/nΣjπij=1/mΣijcijπij\underset{\begin{subarray}{c}\pi\geq 0\\ \Sigma_{i}\pi_{ij}=\nicefrac{{1}}{{n}}\\ \Sigma_{j}\pi_{ij}=\nicefrac{{1}}{{m}}\end{subarray}}{\mathrm{argmin}}\Sigma_{ij}c_{ij}\pi_{ij} argminρ,π0ρj+Σiπij=1/nΣijπij=μΣijcijπij\underset{\begin{subarray}{c}\rho,\pi\geq 0\\ \rho_{j}+\Sigma_{i}\pi_{ij}=\nicefrac{{1}}{{n}}\\ \Sigma_{ij}\pi_{ij}=\mu\end{subarray}}{\mathrm{argmin}}\Sigma_{ij}c_{ij}\pi_{ij}
image representation [g1g2gs][g_{1}\mid g_{2}\mid\cdots\mid g_{s}]\dagger m[f1f2fn]π\sqrt{m}[f_{1}\mid f_{2}\mid\cdots\mid f_{n}]\pi^{\intercal}\quad Σi1nρinμfi\Sigma_{i}\tfrac{1-n\rho_{i}}{n\mu}f_{i}
dimension m×sm\times s m×dm\times d dd
feature selection
gradient computation auto-diff auto-diff closed form expression
matrix-inverse-free gradient

\dagger ss: number of slices, gig_{i}: projection of {fj}j\{f_{j}\}_{j} to slice-ii with sorted jj according to Monge Map.

Appendix B Optimal Transport Based Operators

In this section, we briefly discuss optimal transport (OT) based aggregation and selection operators. We provide their formulations to show how our formulation differs from them.

B.1 Feature Aggregation

Given a cost map cij=ωifj2c_{ij}=\|\omega_{i}\shortminus f_{j}\|_{2} which is an mm (number of prototypes) by nn (number of features) matrix representing the closeness of prototypes ωi\omega_{i} and features fjf_{j}, [41] consider the following OT problem:

π=argminπ0Σiπij=1/nΣjπij=1/mijcijπij\pi^{\ast}=\!\!\!\operatorname*{arg\,min}_{\begin{subarray}{c}\pi\geqslant 0\\ \Sigma_{i}\pi_{ij}=\nicefrac{{1}}{{n}}\\ \Sigma_{j}\pi_{ij}=\nicefrac{{1}}{{m}}\end{subarray}}\!\!\!\textstyle\sum_{ij}c_{ij}\pi_{ij} (P4)

and defines their aggregated feature as:

g=m[f1f2fn]π.g=\sqrt{m}[f_{1}\mid f_{2}\mid\cdots\mid f_{n}]\pi^{\intercal}\,\,. (B.1)

Namely, gg is an ensemble of mm vectors each of which is the weighted aggregation of {fi}i[n]\{f_{i}\}_{i\in[n]} with the weights proportional to the assignment weights to the corresponding prototype. The same aggregation scheme is also discovered within the context of linear Wasserstein embeddings via Monge maps and is shown to be a barycentric projection of the feature set with the transport plan to approximate Monge map [40]. Similar to them, ensemble of Monge maps corresponding to sliced-Wasserstein distances (SWD) are further employed in set aggregation [51]. In such ensemble representations, there is no feature selection mechanism and thus, all the features are somehow included in the image representation.

For instance, if we further sum those mm vectors of gg in Eq. B.1 to obtain a single global representation, we end up with global average pooling: g𝟏m=m[f1f2fn]π𝟏m=m/n[f1f2fn]𝟏n=m/nΣifig^{\intercal}\bm{1}_{m}=\sqrt{m}[f_{1}\mid f_{2}\mid\cdots\mid f_{n}]\pi^{\intercal}\bm{1}_{m}=\nicefrac{{\sqrt{m}}}{{n}}[f_{1}\mid f_{2}\mid\cdots\mid f_{n}]\bm{1}_{n}=\nicefrac{{\sqrt{m}}}{{n}}\Sigma_{i}f_{i}.

Briefly, those optimal transport based set aggregators, [40, 41, 51] map a set of features to another set of features without discarding any and do not provide a natural way to aggregate the class-discriminative subset of the features. Such representation schemes are useful for structural matching. Albeit enabling 2\ell 2 metric as a similarity measure for the sets, their ensemble based representation results in very high dimensional embeddings. On the contrary, our formulation effectively enables learning to select discriminative features and maps a set of features to a single feature that is of the same dimension and is distilled from nuisance information. We summarize the comparison of our pooling method with the optimal transport based counterparts in Table 8.

B.2 Top-kk Selection

Given nn-many scalars as x=[xi]i[n]x=[x_{i}]_{i\in[n]} and mm-many scalars as y=[yi]i[m]y=[y_{i}]_{i\in[m]} with yy is in an increasing family, i.e., y1<y2<y_{1}{<}y_{2}<\ldots, [50] considers the following OT:

π=argminπ0Σiπij=qjΣjπij=piijcijπij\pi^{\ast}=\!\!\operatorname*{arg\,min}_{\begin{subarray}{c}\pi\geqslant 0\\ \Sigma_{i}\pi_{ij}=q_{j}\\ \Sigma_{j}\pi_{ij}=p_{i}\end{subarray}}\!\!\textstyle\sum_{ij}c_{ij}\pi_{ij} (P5)

where cij=|yixj|c_{ij}=|y_{i}-x_{j}| and pp is mm-dimensional probability simplex, i.e., p{pIR0mΣipi=1}p\in\{p{\in}I\!\!R_{\leq 0}^{m}\mid\Sigma_{i}p_{i}{=}1\}. Then, top-kk selection is formulated with the setting q=1/n𝟏nq=\nicefrac{{1}}{{n}}\bm{1}_{n}, y=[0,1]y=[0,1] and p=[knnkn]p=[\tfrac{k}{n}\,\,\tfrac{n{\shortminus}k}{n}]^{\intercal}. Similarly, sorted top-kk selection is formulated with the setting y=[k]y=[k] and p=[1n1nnkn]p=[\tfrac{1}{n}\cdots\tfrac{1}{n}\,\,\tfrac{n{\shortminus}k}{n}]^{\intercal}. Solving the same problem (P5), [13] formulate top-kk ranking by letting qq and pp be arbitrary probability simplex and yy be in an increasing family.

Such top-kk formulations are suitable for selecting/ranking scalars. In our problem, the aim is to select a subset of features that are closest to the prototypes which are representatives for the discriminative information. Namely, we have a problem of subset selection from set-to-set distances. If we had our problem in the form of set-to-vector, then we would be able to formulate the problem using (P5). However, there is no natural extension of the methods in [50, 13] to our problem. Therefore, we rigorously develop an OT based formulation to express a discriminative subset selection operation analytically in a differentiable form.

Our formulation in (P1) differs from the typical optimal transport problem exploited in (P5). In optimal transport, one matches two distributions and transports all the mass from one to the other. Differently, we transport μ\mu portion of the uniformly distributed masses to the prototypes that have no restriction on their mass distribution. In our formulation, we have a portion constraint instead of a target distribution constraint, and we use an additional decision variable, ρ\rho, accounting for residual masses. If we do not have ρ\rho and set μ=1\mu=1, then the problem becomes a specific case of an optimal transport barycenter problem with 1 distribution.

Our problem can be expressed in a compact form by absorbing ρ\rho into π\pi with zero costs associated in the formulation, as in the proof (Sec. A.3). We choose to explicitly define ρ\rho in the problem (P1) to show its role. We believe its residual mass role is more understandable this way. The benefits of our formulation include that we can perform feature selection with matrix inversion free Jacobian and we can change the role of the prototypes as background representatives simply by using ρ\rho to weight the features instead of 1/nρ\nicefrac{{1}}{{n}}{\shortminus}\rho in Eq. (4.1). Our specific formulation further allows us to tailor a zero-shot regularization loss built on the learned prototypes within our pooling layer.

Appendix C Implementations with Pseudo Codes

Algorithm 1 Deep Metric Learning Loss with GSP and ZSR
(X,Y)=({xi},{yi})ib(X,Y)=(\{x_{i}\},\{y_{i}\})_{i\in b} \eqparbox[h]COMMENT// a batch of image-label pairs
FBackbone(X)F\leftarrow\mathrm{Backbone}(X) \eqparbox[h]COMMENT// a CNN backbone such as BN-Inception, ResNet
(Xp,Z){GSP(f)}fF(X_{p},Z)\leftarrow\{\mathrm{GSP}(f)\}_{f\in F} \eqparbox[h]COMMENT// get pooled features and attribute predictions, see Algorithm 2
ZSRZSR(Z,Y)\mathcal{L}_{ZSR}\leftarrow\mathrm{ZSR}(Z,Y) \eqparbox[h]COMMENT// compute ZSR loss, see Algorithm 4
DMLLossDML(Xp,Y)\mathcal{L}_{DML}\leftarrow\mathrm{LossDML}(X_{p},Y) \eqparbox[h]COMMENT// a DML loss such as contrastive, triplet, XBM, LIBC, …
(1λ)DML+λZSR\mathcal{L}\leftarrow(1{\shortminus}\lambda)\mathcal{L}_{DML}+\lambda\mathcal{L}_{ZSR} \eqparbox[h]COMMENT// we set λ=0.1\lambda{=}0.1

return \mathcal{L}

Algorithm 2 GSP(ff)

trainable parameters: ω={ωi}i[m]\omega=\{\omega_{i}\}_{i\in[m]}\eqparbox[h]COMMENT// mm-many prototypes

 

f={fi}i[n]f=\{f_{i}\}_{i\in[n]} \eqparbox[h]COMMENT// feature map, n=whn=w\,h (i.e. width𝗑height\mathrm{width}\mathsf{x}\mskip 1.0mu\mathrm{height})
ω¯iωi/max{1,ωi2}\bar{\omega}_{i}\leftarrow\nicefrac{{\omega_{i}}}{{\max\{1,\|\omega_{i}\|_{2}\}}} i[m]\forall i{\in}[m]
f¯jfj/max{1,fj2}\bar{f}_{j}\leftarrow\nicefrac{{f_{j}}}{{\max\{1,\|f_{j}\|_{2}\}}} j[n]\forall j{\in}[n]
cijω¯if¯j2c_{ij}\leftarrow\|\bar{\omega}_{i}\shortminus\bar{f}_{j}\|_{2} \eqparbox[h]COMMENT// cost map, c={cij}(i,j)[m]𝗑[n]c=\{c_{ij}\}_{(i,j)\in[m]\mathsf{x}\mskip 1.0mu[n]}
ρ,πWeightTransport(c)\rho,\pi\leftarrow\mathrm{WeightTransport}(c) \eqparbox[h]COMMENT// see Algorithm 3
f1nρμff\leftarrow\tfrac{1{\shortminus}n\rho}{\mu}\odot f \eqparbox[h]COMMENT// re-weight features, \odot: element-wise multiplication
xp(1ni[n]fip)1/px_{p}\leftarrow(\tfrac{1}{n}\textstyle\sum\limits_{i\in[n]}f_{i}^{p})^{\nicefrac{{1}}{{p}}} \eqparbox[h]COMMENT// pooled feature, GSP for p=1p{=}1, GeMean+GSP for p>1p{>}1
zi1μj[n]πijz_{i}\leftarrow\tfrac{1}{\mu}\textstyle\sum\limits_{j\in[n]}\pi_{ij} i[m]\forall i{\in}[m] \eqparbox[h]COMMENT// attribute predictions, z={zi}i[m]z=\{z_{i}\}_{i\in[m]}

return xpx_{p}, zz

Algorithm 3 WeightTransport(cc)

hyperparameters: μ:\mu\mathrel{\mathop{\mathchar 58\relax}} transport ratio, ε:\varepsilon\mathrel{\mathop{\mathchar 58\relax}} entropy regularization weight, k:k\mathrel{\mathop{\mathchar 58\relax}} number of iterations

 

forward: gets cost map, cc, returns residual weights, ρ\rho, and transport plan π\pi

 

c={cij}(i,j)[m]𝗑[n]c=\{c_{ij}\}_{(i,j)\in[m]\mathsf{x}\mskip 1.0mu[n]} \eqparbox[h]COMMENT// cost map of mm-many prototypes and nn-many features
κexp(εc)\kappa\leftarrow\mathrm{exp}({\shortminus}\varepsilon c), t1t\leftarrow 1 \eqparbox[h]COMMENT// exp\mathrm{exp} is element-wise
repeat kk times  
     ρ1/n(1+tκ𝟏m)1\rho\leftarrow\nicefrac{{1}}{{n}}(1+t\,\kappa^{\intercal}\bm{1}_{m})^{\shortminus 1} \eqparbox[h]COMMENT// A𝟏mA^{\intercal}\bm{1}_{m}: sum AA along rows, ()1(\cdot)^{\shortminus 1} is element-wise
     tμ(𝟏mκρ)1t\leftarrow\mu(\bm{1}_{m}^{\intercal}\kappa\,\rho)^{\shortminus 1}

return ρ,tκDiag(ρ)\rho,\,t\,\kappa\,Diag(\rho) \eqparbox[h]COMMENT// πtκDiag(ρ)\pi\leftarrow t\,\kappa\,Diag(\rho)

 

backward: gets the solution (ρ\rho, π\pi) and the gradients (ρ,π\frac{\partial\mathcal{L}}{\partial\rho},\frac{\partial\mathcal{L}}{\partial\pi}), returns c\frac{\partial\mathcal{L}}{\partial c}

 

ρ,π,ρ,π\rho,\,\,\pi,\,\,\frac{\partial\mathcal{L}}{\partial\rho},\,\,\frac{\partial\mathcal{L}}{\partial\pi} \eqparbox[h]COMMENT// results of forward pass and the loss gradient w.r.t. them
qρρ+(ππ)𝟏mq\leftarrow\rho\odot\frac{\partial\mathcal{L}}{\partial\rho}+(\pi\odot\frac{\partial\mathcal{L}}{\partial\pi})^{\intercal}\bm{1}_{m} \eqparbox[h]COMMENT// A𝟏mA^{\intercal}\bm{1}_{m}: sum AA along rows, \odot: element-wise multiplication
η(ρρ)𝟏nnqρ\eta\leftarrow(\rho\odot\frac{\partial\mathcal{L}}{\partial\rho})^{\intercal}\bm{1}_{n}\shortminus n\,q^{\intercal}\rho
cε(ππnπDiag(qη1μnρρ)ρ)\frac{\partial\mathcal{L}}{\partial c}\leftarrow\shortminus\varepsilon\Big{(}\pi\odot\frac{\partial\mathcal{L}}{\partial\pi}-n\pi Diag\big{(}q-\tfrac{\eta}{1\shortminus\mu\shortminus n\rho^{\intercal}\rho}\big{)}\rho\Big{)}

return c\frac{\partial\mathcal{L}}{\partial c}

Algorithm 4 ZSR(ZZ, YY)

trainable parameters: Υ={υi}i[#classes]\Upsilon=\{\upsilon_{i}\}_{i\in[\#\text{classes}]}\eqparbox[h]COMMENT// a semantic embedding vector for each class label

 

Z={zi}ibZ{=}\{z_{i}\}_{i\in b}, Y={yi}ibY{=}\{y_{i}\}_{i\in b} \eqparbox[h]COMMENT// a batch, bb, of attribute prediction vectors, ziz_{i}, and their labels, yiy_{i}
(b1,b2)(b_{1},b_{2})\leftarrow split bb into two class-disjoint halves s.t. {yi}ib1{yi}ib2=\{y_{i}\}_{i\in b_{1}}{\cap}\{y_{i}\}_{i\in b_{2}}=\emptyset
Υk[υyi]ibk\Upsilon_{k}\leftarrow[\upsilon_{y_{i}}]_{i\in b_{k}} for k=1,2k{=}1,2 \eqparbox[h]COMMENT// label embedding matrix for batch-kk, i.e. prediction targets
Zk[zi]ibkZ_{k}\leftarrow[z_{i}]_{i\in b_{k}} for k=1,2k{=}1,2 \eqparbox[h]COMMENT// attribute prediction matrix for batch-kk, i.e. prediction inputs
AkΥk(ZkZk+ϵI)1ZkA_{k}\leftarrow\Upsilon_{k}(Z_{k}^{\intercal}Z_{k}+\epsilon I)^{\shortminus 1}Z_{k}^{\intercal} for k=1,2k{=}1,2\eqparbox[h]COMMENT// fit label embedding predictor for batch-kk, ϵ=0.05\epsilon{=}0.05
Υ^1A2Z1\hat{\Upsilon}_{1}\leftarrow A_{2}Z_{1}, Υ^2A1Z2\hat{\Upsilon}_{2}\leftarrow A_{1}Z_{2} \eqparbox[h]COMMENT// use predictor for bkb_{k} to predict the label embeddings of bkb_{k^{\prime}}
Υ^[Υ^1Υ^2]\hat{\Upsilon}\leftarrow[\hat{\Upsilon}_{1}\mid\hat{\Upsilon}_{2}] \eqparbox[h]COMMENT// concatenate predictions
SSoftMax(Υ^Υ)S\leftarrow\mathrm{SoftMax}(\hat{\Upsilon}^{\intercal}\Upsilon) \eqparbox[h]COMMENT// similarity scores between predictions and label embeddings
ZSRCrossEntropy(S,Y)\mathcal{L}_{ZSR}\leftarrow\mathrm{CrossEntropy}(S,Y)

return ZSR\mathcal{L}_{ZSR}