Generalized Sum Pooling for Metric Learning
Abstract
A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, i.e., a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework
1 Introduction
Distance metric learning (DML) addresses the problem of finding an embedding function such that the semantically similar samples are embedded close to each other while the dissimilar ones are placed relatively apart in the Euclidean sense. Although the prolific and diverse literature of DML includes various architectural designs [1, 2, 3], loss functions [4], and data-augmentation techniques [5, 6], many of these methods have a shared component: a convolutional neural network (CNN) followed by a global pooling layer, mostly global average pooling (GAP) [4].
Common folklore to explain the effectiveness of GAP is considering each pixel of the CNN feature map as corresponding to a separate semantic entity [7]. For example, spatial extent of one pixel can correspond to a "tire" object making the resulting feature a representation for "tireness" of the image. If this explanation is correct, the representation space defined via output of GAP is a convex combination of semantically independent representations defined by each pixel in the feature map. Although this folklore is later empirically studied in [8, 9, 10, and references therein] and further verified for classification in [11, 12], its algorithmic implications are not clear. If each feature is truly representing a different semantic entity, should we really average over all of them? Surely, some classes belong to the background and should be discarded as nuisance variables. Moreover, is uniform average of them the best choice? Aren’t some classes more important than others? In this paper, we try to answer these questions within the context of metric learning. We propose a learnable and generalized version of GAP which learns to choose the subset of the semantic entities to utilize as well as weights to assign them while averaging.
In order to generalize the GAP operator to be learnable, we re-define it as a solution of an optimization problem. We let the solution space to include -weight effectively enabling us to choose subset of the features as well as carefully regularize it to discourage degenerate solution of using all the features. Crucially, we rigorously show that the original GAP is a specific case of our proposed optimization problem for a certain realization. Our proposed optimization problem closely follows optimal transport based top- operators [13] and we utilize its literature to solve it. Moreover, we present an algorithm for an efficient computation of the gradients over this optimization problem enabling learning. A critical desiderata of such an operator is choosing subset of features which are discriminative and ignoring the background classes corresponding to nuisance variables. Although supervised metric learning losses provide guidance for seen classes, they carry no such information to generalize the behavior to unseen classes. To enable such a behavior, we adopt a zero-shot prediction loss as a regularization term which is built on expressing the class label embeddings as a convex combination of attribute embeddings [14, 12].
In order to validate the theoretical claims, we design a synthetic empirical study. The results confirm that our pooling method chooses better subsets and improve generalization ability. Moreover, our method can be applied with any DML loss as GAP is a shared component of them. We applied our method on 6 DML losses and test on 4 datasets. Results show consistent improvements with respect to direct application of GAP as well as other pooling alternatives.
2 Related Work
Our contributions. Briefly, our contributions include that we introduce a general formulation for weighted sum pooling, we formulate local feature selection as an optimization problem which admits closed form gradient expression without matrix inversion, and we propose a meta-learning based zero-shot regularization term to explicitly impose unseen class generalization to the DML problem. We now discuss the works which are most related to ours.
Distance Metric Learning (DML). Primary thrusts in DML include tailoring pairwise loss terms [4] that enforces the desired intra- and inter-class proximity constraints, pair mining [5], generating informative samples [15, 16, 17, 6], and augmenting the mini-batches with virtual embeddings called proxies [18, 19]. To improve generalization; learning theoretic ideas [20, 21, 22], disentangling class-discriminative and class-shared features [2, 23], intra-batch feature aggregation [24, 25], ranking surrogates [26], and further regularization terms [27, 28, 29, 30, 31] are utilized. To go beyond of a single model, ensemble [32, 1, 33, 34, 35] and multi-task based approaches [36, 37] are also used. Different to them, we propose a learnable pooling method generalizing GAP, a shared component of all of the mentioned works. Hence, our work is orthogonal to all of these and can be used jointly with any of them.
Prototype-based pooling. Most related to ours are trainable VLAD [38, 39] and optimal transport based aggregation [40, 41]. Such methods employ similarities to the prototypes to form a vector of aggregated local features for each prototype and build ensemble of representations. Although their embeddings enjoy metric for similarity, they typically have very large sizes limiting their applicability for DML. Recently, reducing the dimension of VLAD embedding via non-linear transforms is addressed for DML [42]. Nevertheless, such methods still map a set of features to another set of features without discarding any and do not provide a natural way to aggregate the class-discriminative subset of the features. On the contrary, our pooling machine effectively enables learning to select discriminative features and maps a set of features to a single feature that is distilled from nuisance information.
Attention-based pooling. CroW [43], Trainable-SMK [44], and CBAM [45] reweight the CNN features before pooling. They build on feature magnitude based saliency, assuming that the backbone functions must be able to zero-out nuisance information. Yet, such a requirement is restrictive for the parameter space and annihilation of the non-discriminative information might not be feasible in some problems. Similarly, attention-based weighting methods DeLF [46], GSoP [47] do not have explicit control on feature selection behavior and might result in poor models when jointly trained with the feature extractor [46]. Differently, our method unifies attention-based feature masking practices (e.g., convolution, correlation) with an efficient-to-solve optimization framework and lets us do away with engineered heuristics in obtaining the masking weights (e.g., normalization, sigmoid, soft-plus) without restricting the solution space.
Optimal transport (OT) based operators. OT distance [48] to match local features is used as the DML distance metric instead of in [49]. Despite effective, replacing with OT increases memory cost for image representation as well as computation cost for the distance computation. Different to them, we shift OT based computation in pooling (i.e., feature extraction) stage while having OT’s merits and hence, do not affect the memory and computation costs of the inference by sticking to metric. Moreover, our feature selection and aggregation formulation has close relation to OT [48] based top- [50], ranking [13] and aggregation [40, 41, 51] operators which are not effectively applied to DML before. Their aggregation is built on concatenating the feature ensembles resulting in very large embedding sizes. What makes our method different is the unique way we formulate the feature selection problem to fuse aggregation into it (see Appendix for technical details). Our selection operator allows computationally appealing and matrix inversion free gradient computation unlike its OT based counterparts [52].
3 Preliminaries
Consider the data distribution over where is the space of data points and is the space of labels. Given iid. samples from as , distance metric learning problem aims to find the parameters of an embedding function such that the Euclidean distance in the space of embeddings is consistent with the label information where is the embedding dimension. More specifically, is small whenever , and large whenever . In order to enable learning, this requirement is represented via loss function (e.g., contrastive [53], triplet [54], multi-similarity [55]).
The typical learning mechanism is gradient descent of an empirical risk function defined over a batch of data points . To simplify notation throughout the paper, we will use to index the samples in a batch. Then, the typical empirical risk function is defined as:
(3.1) |
We are interested specific class of embedding functions where a global average pooling is used. Specifically, consider the composite function family such that is pooling and is feature computation. We assume a further structure over the functions and . The feature function maps the input space into where and are spatial dimensions. Moreover, performs averaging as;
(3.2) |
where and we let denote spatial feature of to simplify notation. In the rest of the paper, we generalize the pooling function into a learnable form and propose an algorithm to learn it.
4 Method
Consider the pooling operation in (3.2), it is a simple averaging over pixel-level feature maps (). As we discuss in Sec. 1, one explanation for the effectiveness of this operation is considering each as corresponding to a different semantic entity corresponding to the spatial extend of the pixel, and the averaging as convex combination over these semantic classes. Our method is based on generalizing this averaging such that a specific subset of pixels (correspondingly subset of semantic entities) are selected and their weights are adjusted according to their importance.
We generalize the pooling (3.2) in Sec. 4.1 by formulating a feature selection problem in which we prioritize a subset of the features that are closest to some trainable prototypes. If a feature is to be selected, its weight will be high. We then formulate our pooling operation as a differentiable layer to facilitate the prototype learning along with the rest of the embedding function parameters in Sec. 4.2. We learn the prototypes with class-level supervision, however in metric learning, learned representations should generalize to unseen classes. Thus, we introduce a zero-shot prediction loss to regularize prototype training for zero-shot setting in Sec. 4.3.
4.1 Generalized Sum Pooling as a Linear Program
Consider the pooling function with adjustable weights as where . Note that, corresponds to average pooling. Informally, we want to control the weights to ease the metric learning problem. Specifically, we want the weights corresponding to background classes to be and the ones corresponding to discriminative features to be high.
If we were given representations of discriminative semantic entities, we could simply compare them with the features () and choose the ones with high similarity. Our proposed method is simply learning these representations and using them for weight computations. We first discuss the weight computation part before discussing learning the representations of prototypes.
Assume that there are discriminative semantic entities which we call prototypes with latent representations of appropriate dimensions (same as ). Since we know that not all features () are relevant, we need to choose a subset of . We perform this top- selection process by converting it into an optimal transport (OT) problem.
Consider a cost map which is an (number of prototypes) by (number of features) matrix representing the closeness of prototypes and features after some normalization . We aim to find a transport map which re-distributes the uniform mass from features to prototypes. Since we do not have any prior information over features, we also consider its marginal distribution (importance of each feature to begin with) to be uniform. As we need to choose a subset, we set ratio of mass to be transported. The resulting OT problem is:
(P1) |
Different to typical OT literature, we introduce decision variables to represent residual weights to be discarded. Specifically modelling discarded weight instead of enforcing another marginalization constraint is beneficial beyond stylistic choices as it allows us to very efficiently compute gradients. When the introduced transport problem is solved, we perform weighting using residual weights as:
(4.1) |
Given set of prototypes , solving the problem in (P1) is a strict generalization of GAP since setting recovers the original GAP. We formalize this equivalence in the following claim.
We defer the proof to Appendix. Having generalized GAP to a learnable form, we introduce a method to learn the prototypes in the next section.
4.2 GSP as a Differentiable Layer
Consider the generalized form of pooling, defined as solution of (P1), as a layer of a neural network. The input is the feature vectors , the learnable parameters are prototype representations , and the output is residual weights . To enable learning, we need partial derivatives of with respect to . However, this function is not smooth. More importantly it requires the parameter to be known a priori.
We use a toy example to set the stage for rest of the formulation. Consider a feature map visualized as RGB-image in Fig. 2 and corresponding two prototypes with representations (red) and (blue). The true since the half of the image corresponds to red and blue, and other half is background class of green. Consider an under-estimation of , the global solution (shown as linear programming) is explicitly ignoring informative pixels (part of red and blue region). To solve this issue, we use entropy smoothing which is first introduced in [48] to enable fast computation of OT. Consider the entropy smoothed version of the original problem in (P1) as:
(P2) |
and obtain pooling weights by replacing with in (4.1), where .
When smoothing is high (), the resulting solution is uniform over features similar to GAP. When it is low, the result is similar to top- like behavior. For us, controls the trade-off between picking portion of the features that are closest to the prototypes and including as much features as possible for weight transfer. We further visualize the solution of the entropy smoothed problem in Fig. 2 showing desirable behavior even with underestimated .
Beyond alleviating the under-estimation of problem, entropy smoothing also makes the problem strictly convex and smooth. Thus, the solution of the problem enables differentiation and in fact, admits closed-form gradient expression. We state the solution of (P2) and its corresponding gradient in the following propositions and defer their proofs to Appendix.
Proposition 4.2.
Given initialization , consider the following iteration:
where and are element-wise and is -dimensional vector of ones. Then, converges to the solution of (P2) defining transport map via .
Proposition 4.3.
Consider any differentiable loss function as a function of . Given gradients and , with is the solution of (P2). Let and , the gradient of with respect to reads:
(4.2) |
where denotes element-wise multiplication.
Proposition 4.2 and 4.3 suggest that our feature selective pooling can be implemented as a differentiable layer. Moreover, Proposition 4.3 gives a matrix inversion free computation of the gradient with respect to the costs unlike optimal transport based operators [52]. Thus, the prototypes, , can be jointly learned with the feature extraction efficiently.
4.3 Cross-batch Zero-shot Regularization
We formulated a prototype based feature pooling and learn the prototypes using class labels. Simply classifying the labels as prototypes is a degenerate solution. We rather want the prototypes to capture transferable attributes so that the learning can be transferred to the unseen classes as long as the attributes are shared. Learning with prototype based pooling shapes the embedding geometry in such a way that we have clusters corresponding to the prototypes in the embedding space. We want such clusters to have transferable semantics rather than class-specific information. To enable this, we now formulate a mechanism to predict class embedding vectors from the prototype assignment vectors and use that mechanism to tailor a loss regularizing the prototypes to have transferable representations.
Our feature selection layer should learn discriminative feature prototypes using top-down label information. Consider two randomly selected batches of data sampled from the distribution. If the prototypes are corresponding to discriminative entities, the weights transferred to them (i.e., marginal distribution of prototypes) should be useful in predicting the classes and such behavior should be consistent between batches for zero-shot prediction. Formally, if one class in does not exist in , a predictor on class labels based on marginal distribution of prototypes for each class of should still be useful for . DML losses do not carry such information. We thus formulate a zero-shot prediction loss to enforce such zero-shot transfer.
We consider that we are given an embedding vector for each class label , i.e., for -many classes. We are to predict such embeddings from the marginal distribution of the prototypes. In particular, we use linear predictor to predict label embeddings as where is the normalized distribution of the weighs on the prototypes;
(4.3) |
If we consider the prototypes as semantic vectors of some auxiliary labels such as attributes similar to zero-shot learning (ZSL) [14, 12, 56], then we can interpret as attribute predictions. Given attribute predictions and corresponding class embeddings for a batch we fit the predictor as;
(P3) |
which admits a closed form expression enabling back propagation where , . In practice, we are not provided with the label embeddings . Nevertheless, having a closed-form expression for enables us to exploit a meta-learning scheme like [57] to formulate a zero-shot prediction loss to learn them jointly with the rest of the parameters.
Specifically, we split a batch into two111Although we considered the simplest form which already worked well, repeating this splitting process can be beneficial. as and such that classes are disjoint. We then estimate attribute embeddings according to (P3) using one set and use that estimate to predict the label embeddings of the other set to form zero-shot prediction loss. Formally, our loss becomes:
(4.4) |
i.e., rearranged soft-max cross-entropy where with the abuse of notation, and (i.e., CNN parameters, prototype vectors, label embeddings).
We learn attribute embeddings (i.e., columns of ) as sub-task and can define such learning as a differentiable operation. Intuitively, such a regularization should be useful in better generalization of our pooling operation to unseen classes since attribute predictions are connected to prototypes and the local features. We combine this loss with the metric learning loss using mixing (i.e., ) and jointly optimize.
4.4 Implementation Details
Embedding function. For the embedding function we use ResNet20 [58] for Cifar [59] experiments, and ImageNet [60] pretrained BN-Inception [61] for the rest. We exploit architectures until the output before the global average pooling layer. We add a per-pixel linear transform (i.e., convolution), to the output to obtain the local embedding vectors of size .
Pooling layer. For baseline methods, we use global average pooling. For our method, we perform parameter search and set the hyperparameters accordingly. Specifically, we use 64- or 128-many prototypes depending on the dataset. We use for proxy-based losses and for non-proxy losses. For the rest, we set , , and we iterate until in Proposition 4.2. The embedding vectors upon pooling are normalized to have unit norm.
5 Experiments
We start our empirical study with a synthetic study validating the role of GAP in learning and the impact of GSP on the feature geometry. We further examine the effectiveness of our generalized sum pooling in metric learning for various models and datasets. We further perform ablation studies for the implications of our formulation as well as effects of the hyperparameters. We share the implementation details and the complete Tensorflow [62] code base in the supplemental materials.
5.1 Analysis of the Behavior
Synthetic study. We designed a synthetic empirical study to evaluate GSP in a fully controlled manner. We considered 16-class problem such that classes are defined over trainable tokens. In this setting, tokens correspond to semantic entities but we choose to give them a specific working to emphasize that they are trained as part of the learning. Each class is defined with 4 distinct tokens and there are also 4 background tokens shared by all classes. For example, a "car" class would have tokens like "tire" and "window" as well as background tokens of "tree" and "road". We sampled class representations from both class specific and background tokens according to a mixing ratio . Such a 50-many feature collection corresponds to a training sample (i.e., we are mimicking CNN’s output with trainable tokens). We then obtained global representations using GAP and GSP. We visualize the geometry of the embedding space in Fig. 3-(a) along with the DML performances. With GAP, we observe overlapping class convex hulls resulting in poor DML performance. However, GSP gives well separated class convex hulls, further validating that it learns to ignore background tokens.
Selective pooling. We further extended this synthetic study to image domain by considering the 20 super-classes of Cifar100 dataset [59] where each has 5 sub-classes. For each super-class, we split the sub-classes for train (2), validation (1), and test (2). We consider 4 super-classes as the shared classes and compose -stitched collage images for the rest 16 classes (see supplementary material for details). In particular, we sample an image from a class and then sample 3 images from shared classes (Fig. 4). We used ResNet20 backbone pretrained on Cifar100 classification task and followed the implementation explained in Sec. 4.4. We provide the evaluation results in Fig. 3-(b). GSP and the proposed zero shot loss effectively increase MAP@R. We also provide sample train and test images to showcase that our pooling can transfer well to unseen classes.
Zero-shot regularization. We also evaluated the zero-shot prediction performance of the attribute vectors. We trained on Cifar10 dataset with 8 prototypes using ProxyNCA++ [19] (PNCA) loss with and without . We then extracted attribute histograms for each class and visualized them in Fig. 5. We observe transferable representations with and we visually show in Fig. 5 that the semantic entities represented by the prototypes transfer across classes. We quantitatively evaluated such behavior by randomly splitting the classes into half and apply cross-batch zero-shot prediction explained in Sec. 4.3. Namely, we fit in (P3) for one subset and used it to predict the class embeddings for the other set. We pre-computed class embeddings from the dataset as the class mean. To this end, our evaluation assesses generalization of both the features and the prototypes. We used MAP with both distance and cosine similarity in our evaluation. We repeated the experiment 1000 times. We observe in Fig. 5 that zero-shot performance of the prototypes learned with is substantially superior. We also see that our feature aggregation method enables approximate localization of the semantic entities. Recent ZSL approaches [56, 12] can provide attribute localization and share a similar spirit with our method. However, attribute annotations must be provided for those methods whereas we exploit only class labels to extract attribute-like features. Thus, our method can be considered as an attribute-unsupervised alternative to them.
5.2 Deep Metric Learning Experiments
We largely rely on the recent work explicitly studying the fair evaluation strategies for metric learning [5, 4, 63]. Specifically, we follow the procedures proposed in [4] to evaluate our method as well as the other methods. We additionally follow the relatively old-fashioned conventional procedure [64] for the evaluation of our method and provide those results in the supplementary material. We provide full detail of our experimental setup in the supplementary material for complete transparency and reproducibility.
Datasets. CUB [65], Cars196 [66], In-shop [67], and SOP [64] with typical DML data augmentation [4].
Evaluation metrics. We report mean average precision (MAP@R) at R where R is defined for each query and is the total number of true references of the query.
Hyperparameters. We use Adam [68] optimizer with learning rate , weight decay , batch size 32 (4 per class). We train 4-fold: 4 models (1 for each train set partition).
Evaluation. Average performance (128D) where each of 4-fold model is trained 3 times resulting in realization of different model collections. In our results we provide mean of 81 evaluations.
Baselines. We implement our method on top of and compare with Contrastive (C2): Contrastive with positive margin [53], MS: Multi-similarity [55], Triplet: Triplet [54], XBM: Cross-batch memory [18] with contrastive loss [69], PNCA: ProxyNCA++ [19], PAnchor: ProxyAnchor [70].
5.2.1 Results
We compared our method (GSP) against direct application of GAP with 6 DML methods in 4 datasets. We also evaluated 14 additional pooling alternatives on Ciffar Collage and CUB datasets. We provide the tabulated results in supplementary material. Based on CUB performances, we picked generalized mean pooling (GeMean) [71] and DeLF [46] to compare against in 4 DML benchmarks. We also evaluated max pooling (GMP) and its combination with GAP as we typically observe GAP+GMP in the recent works [6, 19, 70, 18]. We also applied our method with GMP (GMP+GSP) and with GeMean (GeMean+GSP) to show that per channel selection is orthogonal to our method and thus, GSP can marginally improve those methods as well.
We present the tabulated evaluation results in Tab. 2, while Fig. 6 provides a concise summary of the relative MAP@R orderings of the methods employing 128D embeddings. We observe consistent improvements upon direct application of GAP in all datasets. On the average, we consistently improve the baselines points in MAP@R. Our improvement margins are superior to ones of attention based DeLF pooling. We improve state-of-the-art (SOTA) XBM method up to points, which is a good evidence that application of GSP is not limited to loss terms but can be combined with different DML approaches. We also consistently improve GMP and GeMean pooling methods in all datasets, yet another evidence that our method can be combined with max pooling based methods.
We additionally evaluated our method with different architectures and methods in conventional setting [64] for the comparison with SOTA. The tabulated results are provided in supplementary material (§ 1.1), where we observe that we achieve SOTA performances with XBM [18] and LIBC [24] methods.
MAP@R | |||||||||
---|---|---|---|---|---|---|---|---|---|
SOP | In-shop | CUB | Cars196 | ||||||
GSP | 512D | 128D | 512D | 128D | 512D | 128D | 512D | 128D | |
45.85 | 41.79 | 59.07 | 55.38 | 25.95 | 20.58 | 24.38 | 17.02 | ||
✓ | 46.78 | 42.66 | 59.46 | 55.50 | 26.25 | 20.85 | 25.54 | 17.88 | |
✓ | 46.60 | 42.55 | 59.38 | 55.43 | 26.49 | 21.08 | 25.54 | 17.67 | |
✓ | ✓ | 46.81 | 42.84 | 60.01 | 55.94 | 27.12 | 21.52 | 26.25 | 18.31 |
5.2.2 Ablations
Effect of . We empirically showed the effect of on learned representations in Sec. 5.1. We further examined the effect of quantitatively by enabling/disabling it. We also evaluated its effect without GSP by setting where we used GAP with attribute vectors. The results are summarized in Tab. 1 showing that both components improve the baseline and their combination brings the best improvement. We observe similar behavior in Cifar Collage experiment (Fig. 3-(b)) where the effect of is more substantial.
Computation efficiency. Through a series of iterations, our pooling mechanism utilizes lightweight matrix-vector products to determine the pooling weights. While back propagation can be achieved through automatic-differentiation, it can become computationally intensive as increases for certain problems. However, our pooling mechanism boasts a desirable feature of having a closed-form gradient expression for its backward computation, resulting in minimal scaling of computation load as increases, as evidenced by our analysis in Fig. 7. We further provided the inference times for various in supplementary material (§ 1.5).
Effect of . As outlined in Sec. 4.2, GSP is similar to top- operator with an adaptive thanks to entropy smoothing. We empirically validated such behavior by sweeping parameter controlling top- behavior. The results, plotted in Fig. 7, show similar performance for lower values, with a decrease as increases, possibly due to overestimation of the foreground ratio. Hence a suggested value for is .
Dataset | SOP | In-shop | CUB | Cars196 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dim. | 512D | 128D | 512D | 128D | 512D | 128D | 512D | 128D | ||||||||
Method | P@1 | MAP@R | P@1 | MAP@R | P@1 | MAP@R | P@1 | MAP@R | P@1 | MAP@R | P@1 | MAP@R | P@1 | MAP@R | P@1 | MAP@R |
C1+ | ||||||||||||||||
GAP | ||||||||||||||||
XBM+GAP | ||||||||||||||||
XBM+GSP | ||||||||||||||||
C2+ | ||||||||||||||||
GAP | ||||||||||||||||
GSP | ||||||||||||||||
DeLF | ||||||||||||||||
GeMean | ||||||||||||||||
GeMean+GSP | ||||||||||||||||
GMP | ||||||||||||||||
GMP+GAP | ||||||||||||||||
GMP+GSP | ||||||||||||||||
MS+ | ||||||||||||||||
GAP | ||||||||||||||||
GSP | ||||||||||||||||
Triplet+ | ||||||||||||||||
GAP | ||||||||||||||||
GSP | ||||||||||||||||
PNCA+ | ||||||||||||||||
GAP | ||||||||||||||||
GSP | ||||||||||||||||
DeLF | ||||||||||||||||
GeMean | ||||||||||||||||
GeMean+GSP | ||||||||||||||||
GMP | ||||||||||||||||
GMP+GAP | ||||||||||||||||
GMP+GSP | ||||||||||||||||
PAnchor+ | ||||||||||||||||
GAP | ||||||||||||||||
GSP |
6 Conclusion
We proposed a learnable and generalized version of GAP. Our proposed generalization GSP is a trainable pooling layer that selects the feature subset and re-weight it during pooling. To enable effective learning of our layer, we also proposed a cross-batch regularization improving zero-shot transfer. With extensive empirical studies, we validated the effectiveness of the proposed pooling layer in various DML benchmarks. We also established and empirically validated a valuable link between the computed transport maps for pooling and the prototypes, enabling attribute learning via meta-learning without explicit localized annotations. We believe such a connection is interesting, and has potential to offer pathways for enhanced embedding learning for DML as well as unsupervised attribute learning.
References
- [1] W.Kim, B.Goyal, K.Chawla, J.Lee, and K.Kwon, “Attention-based ensemble for deep metric learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 736–751.
- [2] X.Lin, Y.Duan, Q.Dong, J.Lu, and J.Zhou, “Deep variational metric learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- [3] A.Ermolov, L.Mirvakhabova, V.Khrulkov, N.Sebe, and I.Oseledets, “Hyperbolic vision transformers: Combining improvements in metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7409–7419.
- [4] K.Musgrave, S.Belongie, and S.-N.Lim, “A metric learning reality check,” in European Conference on Computer Vision. Springer, 2020, pp. 681–699.
- [5] K.Roth, T.Milbich, S.Sinha, P.Gupta, B.Ommer, and J. P.Cohen, “Revisiting training strategies and generalization performance in deep metric learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 8242–8252.
- [6] S.Venkataramanan, B.Psomas, E.Kijak, laurent amsaleg, K.Karantzalos, and Y.Avrithis, “It takes two to tango: Mixup for deep metric learning,” in International Conference on Learning Representations, 2022.
- [7] Y. Z.Gurbuz and A. A.Alatan, “Generalizable embeddings with cross-batch metric learning,” arXiv preprint arXiv:2307.07620, 2023.
- [8] M. D.Zeiler and R.Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.
- [9] B.Zhou, A.Khosla, A.Lapedriza, A.Oliva, and A.Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
- [10] B.Zhou, D.Bau, A.Oliva, and A.Torralba, “Interpreting deep visual representations via network dissection,” IEEE transactions on pattern analysis and machine intelligence, 2018.
- [11] Y. Z.Gürbüz and A. A.Alatan, “A novel bovw mimicking end-to-end trainable cnn classification framework using optimal transport theory,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 3053–3057.
- [12] W.Xu, Y.Xian, J.Wang, B.Schiele, and Z.Akata, “Attribute prototype network for zero-shot learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21969–21980, 2020.
- [13] M.Cuturi, O.Teboul, and J.-P.Vert, “Differentiable ranking and sorting using optimal transport,” Advances in neural information processing systems, vol. 32, 2019.
- [14] B.Demirel, R.Gokberk Cinbis, and N.Ikizler-Cinbis, “Attributes2classname: A discriminative model for attribute-based unsupervised zero-shot learning,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1232–1241.
- [15] B.Ko and G.Gu, “Embedding expansion: Augmentation in embedding space for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7255–7264.
- [16] C.Liu, H.Yu, B.Li, Z.Shen, Z.Gao, P.Ren, X.Xie, L.Cui, and C.Miao, “Noise-resistant deep metric learning with ranking-based instance selection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6811–6820.
- [17] G.Gu, B.Ko, and H.-G.Kim, “Proxy synthesis: Learning with synthetic classes for deep metric learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 1460–1468.
- [18] X.Wang, H.Zhang, W.Huang, and M. R.Scott, “Cross-batch memory for embedding learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6388–6397.
- [19] E. W.Teh, T.DeVries, and G. W.Taylor, “Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis,” in European Conference on Computer Vision (ECCV). Springer, 2020.
- [20] M.Dong, X.Yang, R.Zhu, Y.Wang, and J.Xue, “Generalization bound of gradient descent for non-convex metric learning,” Neural Information Processing Systems Foundation, 2020.
- [21] Y.Lei, M.Liu, and Y.Ying, “Generalization guarantee of sgd for pairwise learning,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- [22] Y. Z.Gurbuz, O.Can, and A. A.Alatan, “Deep metric learning with chance constraints,” arXiv preprint arXiv:2209.09060, 2022.
- [23] K.Roth, B.Brattoli, and B.Ommer, “Mic: Mining interclass characteristics for improved metric learning,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
- [24] J.Seidenschwarz, I.Elezi, and L.Leal-Taixé, “Learning intra-batch connections for deep metric learning,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. 2021, vol. 139 of Proceedings of Machine Learning Research, pp. 9410–9421, PMLR.
- [25] J.Lim, S.Yun, S.Park, and J. Y.Choi, “Hypergraph-induced semantic tuplet loss for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- [26] Y.Patel, G.Tolias, and J.Matas, “Recall@ k surrogate loss with large batches and similarity mixup,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7502–7511.
- [27] P.Jacob, D.Picard, A.Histace, and E.Klein, “Metric learning with horde: High-order regularizer for deep embeddings,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
- [28] D.Zhang, Y.Li, and Z.Zhang, “Deep metric learning with spherical embedding,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- [29] O.Can, Y. Z.Gürbüz, and A. A.Alatan, “Deep metric learning with alternating projections onto feasible sets,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 1264–1268.
- [30] Y.Kim and W.Park, “Multi-level distance regularization for deep metric learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35.
- [31] K.Roth, O.Vinyals, and Z.Akata, “Non-isotropy regularization for proxy-based deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7420–7430.
- [32] H.Xuan, R.Souvenir, and R.Pless, “Deep randomized ensembles for metric learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 723–734.
- [33] A.Sanakoyeu, V.Tschernezki, U.Buchler, and B.Ommer, “Divide and conquer the embedding space for metric learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- [34] W.Zheng, C.Wang, J.Lu, and J.Zhou, “Deep compositional metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9320–9329.
- [35] W.Zheng, B.Zhang, J.Lu, and J.Zhou, “Deep relational metric learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12065–12074.
- [36] T.Milbich, K.Roth, H.Bharadhwaj, S.Sinha, Y.Bengio, B.Ommer, and J. P.Cohen, “Diva: Diverse visual feature aggregation for deep metric learning,” in European Conference on Computer Vision. Springer, 2020, pp. 590–607.
- [37] K.Roth, T.Milbich, B.Ommer, J. P.Cohen, and M.Ghassemi, “S2sd: Simultaneous similarity-based self-distillation for deep metric learning,” in ICML 2021: 38th International Conference on Machine Learning, 2021, pp. 9095–9106.
- [38] H.Zhang, J.Xue, and K.Dana, “Deep ten: Texture encoding network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 708–717.
- [39] R.Arandjelovic, P.Gronat, A.Torii, T.Pajdla, and J.Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
- [40] S.Kolouri, N.Naderializadeh, G. K.Rohde, and H.Hoffmann, “Wasserstein embedding for graph learning,” in International Conference on Learning Representations, 2020.
- [41] G.Mialon, D.Chen, A.d’Aspremont, and J.Mairal, “A trainable optimal transport embedding for feature aggregation and its relationship to attention,” in ICLR 2021-The Ninth International Conference on Learning Representations, 2021.
- [42] S.KAN, Y.Liang, M.Li, Y.Cen, J.Wang, and Z.He, “Coded residual transform for generalizable deep metric learning,” in Advances in Neural Information Processing Systems.
- [43] Y.Kalantidis, C.Mellina, and S.Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in European conference on computer vision. Springer, 2016, pp. 685–701.
- [44] G.Tolias, T.Jenicek, and O.Chum, “Learning and aggregating deep local descriptors for instance-level recognition,” in European Conference on Computer Vision. Springer, 2020, pp. 460–477.
- [45] S.Woo, J.Park, J.-Y.Lee, and I. S.Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
- [46] H.Noh, A.Araujo, J.Sim, T.Weyand, and B.Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- [47] Z.Gao, J.Xie, Q.Wang, and P.Li, “Global second-order pooling convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3024–3033.
- [48] M.Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013.
- [49] W.Zhao, Y.Rao, Z.Wang, J.Lu, and J.Zhou, “Towards interpretable deep metric learning with structural matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9887–9896.
- [50] Y.Xie, H.Dai, M.Chen, B.Dai, T.Zhao, H.Zha, W.Wei, and T.Pfister, “Differentiable top-k with optimal transport,” Advances in Neural Information Processing Systems, vol. 33, pp. 20520–20531, 2020.
- [51] N.Naderializadeh, J. F.Comer, R.Andrews, H.Hoffmann, and S.Kolouri, “Pooling by sliced-wasserstein embedding,” Advances in Neural Information Processing Systems, vol. 34, pp. 3389–3400, 2021.
- [52] G.Luise, A.Rudi, M.Pontil, and C.Ciliberto, “Differential properties of sinkhorn approximation for learning with wasserstein distance,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- [53] C.-Y.Wu, R.Manmatha, A. J.Smola, and P.Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
- [54] F.Schroff, D.Kalenichenko, and J.Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
- [55] X.Wang, X.Han, W.Huang, D.Dong, and M. R.Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- [56] D.Huynh and E.Elhamifar, “Fine-grained generalized zero-shot learning via dense attribute-based attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4483–4493.
- [57] L.Bertinetto, J. F.Henriques, P.Torr, and A.Vedaldi, “Meta-learning with differentiable closed-form solvers,” in International Conference on Learning Representations, 2018.
- [58] K.He, X.Zhang, S.Ren, and J.Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
- [59] A.Krizhevsky and G.Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
- [60] O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, A.Karpathy, A.Khosla, M.Bernstein, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
- [61] S.Ioffe and C.Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
- [62] M.Abadi, P.Barham, J.Chen, Z.Chen, A.Davis, J.Dean, M.Devin, S.Ghemawat, G.Irving, M.Isard, et al., “Tensorflow: a system for large-scale machine learning.,” in OSDI, 2016, vol. 16, pp. 265–283.
- [63] I.Fehervari, A.Ravichandran, and S.Appalaraju, “Unbiased evaluation of deep metric learning algorithms,” arXiv preprint arXiv:1911.12528, 2019.
- [64] H.Oh Song, Y.Xiang, S.Jegelka, and S.Savarese, “Deep metric learning via lifted structured feature embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4004–4012.
- [65] C.Wah, S.Branson, P.Welinder, P.Perona, and S.Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
- [66] A.Krause and D.Golovin, “Submodular function maximization,” in Tractability: Practical Approaches to Hard Problems, pp. 71–104. Cambridge University Press, 2014.
- [67] Z.Liu, P.Luo, S.Qiu, X.Wang, and X.Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096–1104.
- [68] D. P.Kingma and J.Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [69] R.Hadsell, S.Chopra, and Y.LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE, 2006, vol. 2, pp. 1735–1742.
- [70] S.Kim, D.Kim, M.Cho, and S.Kwak, “Proxy anchor loss for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3238–3247.
- [71] F.Radenović, G.Tolias, and O.Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655–1668, 2018.
- [72] Y.Zhu, M.Yang, C.Deng, and W.Liu, “Fewer is more: A deep graph metric learning perspective using fewer proxies,” Advances in Neural Information Processing Systems, vol. 33, pp. 17792–17803, 2020.
- [73] M.Tan and Q.Le, “Efficientnetv2: Smaller models and faster training,” in International conference on machine learning. PMLR, 2021, pp. 10096–10106.
- [74] N.Murray, H.Jégou, F.Perronnin, and A.Zisserman, “Interferences in match kernels,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1797–1810, 2016.
- [75] T.Ng, V.Balntas, Y.Tian, and K.Mikolajczyk, “Solar: second-order loss and attention for image retrieval,” in European conference on computer vision. Springer, 2020, pp. 253–270.
- [76] A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weissenborn, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- [77] T.Durand, N.Thome, and M.Cord, “Weldon: Weakly supervised learning of deep convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4743–4752.
- [78] L. M.Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,” USSR computational mathematics and mathematical physics, vol. 7, no. 3, pp. 200–217, 1967.
- [79] H. H.Bauschke and A. S.Lewis, “Dykstras algorithm with bregman projections: A convergence proof,” Optimization, vol. 48, no. 4, pp. 409–427, 2000.
- [80] T.-T.Lu and S.-H.Shiou, “Inverses of 2 2 block matrices,” Computers & Mathematics with Applications, vol. 43, no. 1-2, pp. 119–129, 2002.
Supplemental Material for "Generalized Sum Pooling for Metric Learning"
1 Extended Empirical Study for DML
In the following sections, we explain our empirical study in detail and provide additional experiments on effect of hyperparameters as well as evaluation with the conventional experimental settings.
Reproducibility
We provide full detail of our experimental setup and recapitulate the implementation details for the sake of complete transparency and reproducibility. Code is available at: GSP-DML Framework.
Backbone | ResNet50 | |||
---|---|---|---|---|
Dataset | CUB | Cars196 | SOP | In-shop |
Method | R@1 | R@1 | R@1 | R@1 |
C1+XBM128 [18] | - | - | 80.60 | 91.30 |
ProxyAnchor512 [70] | 69.70 | 87.70 | 80.00 | 92.10 |
DiVA512 [36] | 69.20 | 87.60 | 79.60 | - |
ProxyNCA++512 [19] | 66.30 | 85.40 | 80.20 | 88.60 |
Margin+S2SD512 [37] | 69.00 | 89.50 | 81.20 | - |
LIBC512 [24] | 70.30 | 88.10 | 81.40 | 92.80 |
MS+Metrix512 [6] | 71.40 | 89.60 | 81.00 | 92.20 |
PAnchor+DIML128 [49] | - | |||
LIBC+GSP512 | 70.70 | 88.43 | 81.65 | 93.30 |
C1+XBM512 | ||||
C1+XBM+GSP512 |
1.1 Conventional Evaluation
We additionally follow the relatively old-fashioned conventional procedure [64] for the evaluation of our method. We use BN-Inception [61] and ResNet50 [58] architectures as the backbones. We obtain 512D (BN-Inception and ResNet50) embeddings through linear transformation after global pooling layer. Aligned with the recent approaches [6, 19, 70, 18], we use global max pooling as well as global average pooling. The rest of the settings are disclosed in Sec. 1.6.
We evaluate our method with XBM. We provide R@1 results in Tab. 3 for the comparison with SOTA. In our evaluations, we also provide MAP@R scores in parenthesis under R@1 scores. We also provide baseline XBM evaluation in our framework. The results are mostly consistent with the ones reported in the original paper [18] except for CUB and Cars datasets. In XBM [18], the authors use proxy-based trainable memory for CUB and Cars datasets. On the other hand, we use the official implementation provided by the authors, which does not include such proxy-based extensions.
We observe that our method improves XBM and XBM+GSP reaches SOTA performance in large scale datasets. With that being said, the improvement margins are less substantial than the ones in fair evaluation. Such a result is expected since training is terminated by early-stopping which is a common practice to regularize the generalization of training [20, 21]. In conventional evaluation, early-stopping is achieved by monitoring the test data performance, enabling good generalization to test data. Therefore, observing less improvement in generalization with GSP is something we expect owing to generalization boost that test data based early-stopping already provides.
We also observe that in a few cases, the R@1 performance of GSP is slightly worse than the baseline. Nevertheless, once we compare the MAP@R performances, GSP consistently brings improvement with no exception. We should recapitulate that R@1 is a myopic metric to assess the quality of the embedding space geometry [4] and hence, pushing R@1 does not necessarily reflect the true order of the improvements that the methods bring.
As we observe from MAP@R comparisons in Table 2 (main paper), the methods sharing similar R@1 (i.e., P@1) performances can differ in MAP@R performance relatively more significantly. In that manner, we firmly believe that comparing MAP@R performances instead of R@1 technically sounds more in showing the improvements of our method.
Finally, we also apply our method with LIBC [24] to further show wide applicability of our method. We use the official implementation of LIBC and follow their default experimental settings. The evaluations on 4 benchmarks show that GSP improve LIBC by pp R@1. To offer a complete outlook on the conventional evaluation, we have included the recall at K (R@K) scores in Table 4 as well.
Dataset | CUB | Cars196 | ||||||
---|---|---|---|---|---|---|---|---|
Method | R@1 | R@2 | R@4 | R@8 | R@1 | R@2 | R@4 | R@8 |
LIBC+GSP | 70.70 | 80.72 | 88.18 | 92.64 | 88.43 | 93.03 | 95.78 | 97.69 |
XBM+GSP | 66.63 | 77.43 | 85.26 | 91.02 | 82.60 | 89.04 | 92.67 | 95.62 |
Dataset | SOP | In-shop | ||||||
Method | R@1 | R@10 | R@50 | R@100 | R@1 | R@10 | R@20 | R@40 |
LIBC+GSP | 81.65 | 91.37 | 94.85 | 96.00 | 93.30 | 98.54 | 98.96 | 99.25 |
XBM+GSP | 81.54 | 91.84 | 95.18 | 96.29 | 91.75 | 97.83 | 98.52 | 99.01 |
1.2 Application of GSP to Other Problems
GSP is applicable to any problem and architecture with a pooling layer. We believe GSP is particularly relevant to the problem of metric learning due to the geometry it enforces. Our pooling method enhances local geometry by reducing overlap of class convex hulls and improves unseen class generalization.
In order to evaluate the applicability of GSP beyond metric learning, we applied GSP to ImageNet classification tasks using ResNetV2 [58] and EfficientNetV2 [73] models. We took the official Tensorflow Keras models and only replaced pooling layers with GSP.
ImageNet | Acc. | P@1 | MAP@R |
---|---|---|---|
RN50V2 | 75.26 | 69.30 | 41.18 |
+GSP | 76.53 | 71.34 | 42.58 |
ENV2B3 | 80.03 | 79.23 | 59.98 |
+GSP | 82.00 | 80.80 | 62.75 |
The results suggests that our method is applicable beyond metric lerning as it improves ImageNet classification accuracy for both ResNetV2 and EfficientNetV2 models. We additionally assessed the metric learning performance of the embedding vectors pre-classification. By reducing the embedding dimension to 512 through LDA, we then evaluated the resulting embedding geometry using P@1 and MAP@R metrics, and observed that GSP yields better feature geometry.
128D - MAP@R | ||||
Dataset | Cifar Collage | CUB | ||
Method Loss | Contrastive | ProxyNCA++ | Contrastive | ProxyNCA++ |
CBAM [45] | 7.87 | 10.99 | 18.45 | 18.21 |
CroW [43] | 10.09 | 11.48 | 20.88 | 20.42 |
DeLF [46] | 11.44 | 24.83 | 21.42 | 20.51 |
GeMax [74] | 7.04 | 7.83 | 18.85 | 17.83 |
GeMean [71] | 10.97 | 10.60 | 21.50 | 20.71 |
GSoP [47] | 11.15 | 17.73 | 20.52 | 15.72 |
OTP- [41] | 7.02 | 11.55 | 15.19 | 13.57 |
OTP- [41] | 7.57 | 11.79 | 20.88 | 20.48 |
SOLAR [75] | 17.30 | 20.36 | 19.89 | 20.14 |
TFM-1 [76] | 8.84 | 10.83 | 17.82 | 19.13 |
TFM-2 [76] | 16.48 | 21.00 | 17.16 | 18.13 |
TFM-4 [76] | 18.51 | 21.56 | 16.91 | 18.22 |
TFM-8 [76] | 18.18 | 19.68 | 16.31 | 17.47 |
T-SMK [44] | 9.21 | 13.15 | 21.01 | 20.23 |
VLAD- [39] | 21.73 | 19.68 | 15.19 | 13.08 |
VLAD- [39] | 22.52 | 21.15 | 16.67 | 16.53 |
WELDON [77] | 13.81 | 20.38 | 20.67 | 20.31 |
GAP | 8.09 | 10.68 | 20.58 | 20.63 |
GMP | 9.53 | 11.25 | 20.66 | 20.33 |
GMP+GAP | 10.01 | 11.85 | 20.88 | 20.68 |
GSP | 22.68 | 27.61 | 21.52 | 20.75 |
1.3 Evaluation of Other Pooling Alternatives
We evaluate 14 additional pooling alternatives on Ciffar Collage and CUB datasets with contrastive (C2) and Proxy-NCA++ (PNCA) losses. We pick contrastive since it is one of the best performing sample-based loss. We pick Proxy-NCA++ since most of the pooling methods are tailored for landmark-based image retrieval and use classification loss akin to Proxy-NCA++. We particularly consider Cifar Collage dataset since the images of different classes share a considerable amount of semantic entities, enabling us to assess the methods w.r.t. their ability to discard the nuisance information.
Compared methods. In addition to our method (GSP) and global average pooling (GAP), we consider: global max pooling (GMP), GAP+GMP [70], CBAM [45], CroW [43], DeLF [46], generalized max pooling (GeMax) [74], generalized mean pooling (GeMean) [71], GSoP [47], optimal transport based aggregation (OTP) [41, 40], SOLAR [75], trainable SMK (T-SMK) [44], NetVLAD [39], WELDON [77], and visual transformer encoder with class token (TFM) [76]. Among those, OTP and VLAD are ensemble based methods and typically necessitate large embedding dimensions. Thus, we both experimented their 128D version -( (8 prototypes of 16D vectors) and 8192D version -( (64 prototypes of 128D vectors). We note that some of our baselines utilize attention based pooling. Notably, attention mechanism is the key part of DeLF, SOLAR, and GSoP. In fact, DeLF can be seen as equivalent to a single-layer residual-free transformer layer with a class token. To perform a more direct comparison with transformers, we conducted experiments by replacing GAP with transformer layers using a class token. We evaluated 1, 2, 4, and 8-layer transformers (TFM-#).
Setting. For CUB dataset, the experimental setting follows Sec. 1.6-Fair evaluation and we report MAP@R performance of the 4-model average at 128 dimensional embeddings each. For Cifar Collage dataset, the experimental setting follows Sec. 2.2 and we report MAP@R performance. We provide the results in Table 6.
Results. Evaluations show that our method is superior to other pooling alternatives including prototype based VLAD and OTP. Predominantly, for 128 dimensional embeddings, our method outperforms prototype based methods by large margin. In CUB dataset, the pooling methods either are inferior to or perform on par with GAP. The performance improvements of the superior methods are less than 1%, implying that our improvements in the order of 1-2% reported in Table 2 (main paper) is substantial. Besides, the methods that mask the feature map outperform GAP by large margin in Cifar Collage dataset. That said, our method outperforms all the methods except for Contrastive+VLAD by large margin in Cifar Collage dataset, yet another evidence for better feature selection mechanism of our method. For instance in CUB dataset, DeLF and GeMean are on par with our method which has slightly better performance. Yet, our method outperforms both by large margin in Cifar Collage dataset.
Superior selection mechanism. Comparing to CroW, T-SMK and CBAM, our method outperforms them by large margin. Those methods are the built on feature magnitude based saliency, assuming that the backbone functions must be able to zero-out nuisance information. Yet, such a requirement is restrictive for the parameter space and annihilation of the non-discriminative information might not be feasible in some problems. We experimentally observe such a weakness of CroW, T-SMK and CBAM in Cifar Collage dataset where the nuisance information cannot be zeroed-out by the backbone. Our formulation do not have such a restrictive assumption and thus substantially superior to them.
Superior attention mechanism. Similarly, attention-based weighting methods, DeLF and GSoP, do not have explicit control on feature selection behavior and might result in poor models when jointly trained with the feature extractor [46], which we also observe in Cifar Collage experiments. On the contrary, we have explicit control on the pooling behavior with parameter and the behavior of our method is stable and consistent across datasets and with different loss functions. We also found that our method outperforms direct application of transformer based pooling.
Simpler and interpretable. Attention-based methods DeLF, GSoP, and SOLAR typically introduce several convolution layers to compute the feature weights. We only introduce an -kernel convolution layer (i.e., -many trainable prototypes) and obtain better results. We should note that our pooling operation is as simple as performing a convolution (i.e., distance computation) and alternating normalization of a vector and a scalar. Additionally, we are able to incorporate a zero-shot regularization into our problem naturally by using the prototype assignment weights. We can as well incorporate such a loss in DeLF which has convolution to compute prototype similarities. However, we first need a mechanism to aggregate the per prototype similarities (e.g., sum and normalization). Indeed, normalizing the similarities channel-wise and spatially summing them correspond to solving our problem with .
Other pooling methods, i.e., GAP, GMP, GAP+GMP, GeMax, GeMean, WELDON, VLAD, OTP, do not build on discriminative feature selection. Thus, our method substantially outperforms those.
1.4 Computational Analysis
Forward and backward computation of proposed GSP method can be implemented using only matrix-vector products. Moreover, having closed-form matrix-inversion-free expression for the loss gradient enables memory efficient back propagation since the output of each iteration must be stored otherwise.
We perform iterations to obtain the pooling weights and at each iteration, we only perform matrix-vector products. In this sense, the back propagation can be achieved using automatic-differentiation. One problem with automatic differentiation is that the computation load increases with increasing . On the other hand, with closed-form gradient expression, we do not have such issue and in fact we have constant back propagation complexity. Granted that the closed-form expression demands exact solution of the problem (i.e., ), forward computation puts a little computation overhead and is memory efficient since we discard the intermediate outputs. Moreover, our initial empirical study show that our problems typically converges for and we observe similar performances with .
The choice of is indeed problem dependent (i.e., size of the feature map and the number of prototypes). Thus, its effect on computation load should be analyzed. We study the effect of with automatic differentiation and with our closed-form gradient expression. We provide the related plots in Fig. 9. We observe that with closed-form gradients, our method puts a little computation overhead and increasing has marginal effect. On the contrary, with automatic differentiation, the computational complexity substantially increases.
We have further provided the inference times for various optimization steps () in Fig. 8. The additional computational complexity introduced by our method is minor, resulting in a less than 6% increase in the time per image (from 43.1 ms to 45.6 ms) within the typical operation interval of (25-50). Therefore, our method remains computationally feasible for real-time processing.
(a)
(b)
1.5 Hyperparameter Selection
We first perform a screening experiment to see the effect of the parameters. We design a 2-level fractional factorial (i.e., a subset of the all possible combinations) experiment.
We provide the results in Tab. 7. In our analysis, we find that lower the better for and . Thus, we set and . is observed to have the most effect and number of prototypes, , seems to have no significant effect. Nevertheless, we jointly search for and . To this end, we perform grid search in CUB dataset with Contrastive (C2) and Proxy NCA++ (PNCA) losses. We provide the results in Fig. 10-(a). We see that both losses have their best performance when . On the other hand, works better for C2 while works better for PNCA. We additionally perform a small experiment to see whether is the case for Proxy Anchor loss as well and observe that is a better choice over . As the result of - search, we set for non-proxy based losses and for proxy-based losses.
Setting | MAP@R | ||||
---|---|---|---|---|---|
C2 | PNCA | ||||
16 | 0.3 | 0.5 | 0.1 | 40.63 | 40.59 |
16 | 0.7 | 0.5 | 0.5 | 40.41 | 40.34 |
128 | 0.3 | 0.5 | 0.5 | 40.22 | 40.35 |
128 | 0.7 | 0.5 | 0.1 | 40.07 | 40.85 |
16 | 0.3 | 20 | 0.5 | 36.11 | 40.51 |
16 | 0.7 | 20 | 0.1 | 39.11 | 39.88 |
128 | 0.3 | 20 | 0.1 | 39.61 | 39.12 |
128 | 0.7 | 20 | 0.5 | 35.36 | 39.92 |
Baseline | 39.77 | 39.90 |
Fixing , we further experiment the effect of number of prototypes in large datasets (i.e., SOP and In-shop). We provide the corresponding performance plots in Fig. 10-(b). Supporting our initial analysis, seemingly does not have a significant effect once it is not small (e.g., ). We observe that any choice of provides performance improvement. With that being said, increasing does not bring substantial improvement over relatively smaller values. Considering the results of the experiment, we set for SOP and In-shop datasets since both C2 and PNCA losses perform slightly better with than the other choices of .
1.6 Experimental Setup
1.6.1 Datasets
We perform our experiments on 4 widely-used benchmark datasets: Stanford Online Products (SOP) [64], In-shop [67], Cars196 [66] and, CUB-200-2011 (CUB) [65].
SOP has 22,634 classes with 120,053 product images. The first 11,318 classes (59,551 images) are split for training and the other 11,316 (60,502 images) classes are used for testing.
In-shop has 7,986 classes with 72,712 images. We use 3,997 classes with 25,882 images as the training set. For the evaluation, we use 14,218 images of 3,985 classes as the query and 12,612 images of 3,985 classes as the gallery set.
Cars196 contains 196 classes with 16,185 images. The first 98 classes (8,054 images) are used for training and remaining 98 classes (8,131 images) are reserved for testing.
CUB-200-2011 dataset consists of 200 classes with 11,788 images. The first 100 classes (5,864 images) are split for training, the rest of 100 classes (5,924 images) are used for testing.
Data augmentation follows [4]. During training, we resize each image so that its shorter side has length 256, then make a random crop between 40 and 256, and aspect ratio between and . We resize the resultant image to and apply random horizontal flip with probability. During evaluation, images are resized to 256 and then center cropped to .
1.6.2 Training Splits
Fair evaluation. We split datasets into disjoint training, validation and test sets according to [4]. In particular, we partition for training and test, and further split training data to 4 partitions where 4 models are to be trained by exploiting as validation while training on .
Conventional evaluation. Following relatively old-fashioned conventional evaluation [64], we use the whole train split of the dataset for training and we use the test split for evaluation as well as monitoring the training for early stopping.
Hyperparameter tuning. For the additional experiments related to the effect of hyperparameters, we split training set into 5 splits and train a single model on the of the set while using for the validation.
1.6.3 Evaluation Metrics
We consider precision at 1 (P@1) and mean average precision (MAP@R) at R where R is defined for each query222A query is an image for which similar images are to be retrieved, and the references are the images in the searchable database. and is the total number of true references as the query. Among those, MAP@R performance metric is shown to better reflect the geometry of the embedding space and to be less noisy as the evaluation metric [4]. Thus, we use MAP@R to monitor training in our experiments except for conventional evaluation setting where we monitor P@1.
P@1: Find the nearest reference to the query. The score for that query is 1 if the reference is of the same class, 0 otherwise. Average over all queries gives P@1 metric.
P@R: For a query, , find nearest references to the query and let be the number of true references in those -neighbourhood. The score for that query is . Average over all queries gives P@R metric, i.e., , where is the number of queries.
MAP@R: For a query, , we define , where if retrieval is correct or 0 otherwise. Average over all queries gives MAP@R metric, i.e., , where is the number of queries.
1.6.4 Training Procedure
Fair evaluation. We use Adam [68] optimizer with constant learning rate, weight decay, and default moment parameters, and . We use batch size of 32 (4 samples per class). We evaluate validation MAP@R for every 100 steps of training in CUB and Cars196, for 1000 steps in SOP and In-shop. We stop training if no improvement is observed for 15 steps in CUB and Cars196, and 10 steps in SOP and In-shop. We recover the parameters with the best validation performance. Following [4], we train 4 models for each partition of the train set. Each model is trained 3 times. Hence, we have many realizations of 4-model collections. We present the average performance as well as the standard deviation (std) of such 81 models’ evaluations.
Conventional evaluation. We use Adam [68] optimizer with default moment parameters, and . Following recent works [70], we use reduce on plateau learning rate scheduler with patience 4. The initial learning rate is for CUB, and for Cars, SOP and In-shop. We use weight decay for BNInception backbone and wight decay for ResNet50 backbone. We use batch size of 128 (4 samples per class) for BNInception backbone and 112 (4 samples per class) for ResNet backbone (following [5]). We evaluate validation P@1 for every 25 steps of training in CUB and Cars196, for 250 steps in SOP and In-shop. We stop training if no improvement is observed for 15 steps in CUB and Cars196, and 10 steps in SOP and In-shop. We recover the parameters with the best validation performance. We repeat each experiment 3 times and report the best result. For the evaluations on LIBC framework [24], we follow their experimental setting.
Hyperparameter tuning. We use Adam [68] optimizer with constant learning rate, weight decay, and default moment parameters, and . We use batch size of 32 (4 samples per class). We evaluate validation MAP@R for every 100 steps of training in CUB and Cars196, for 1000 steps in SOP and In-shop. We stop training if no improvement is observed for 10 steps in CUB and Cars196, and 7 steps in SOP and In-shop. We recover the parameters with the best validation performance. We train a single model on the of the training set while using for the validation. We repeat each experiment 3 times and report the averaged result.
1.6.5 Embedding vectors
Fair evaluation. Embedding dimension is fixed to 128. During training and evaluation, the embedding vectors are normalized. We follow the evaluation method proposed in [4] and produce two results: Average performance (128 dimensional) of 4-fold models and Ensemble performance (concatenated 512 dimensional) of 4-fold models where the embedding vector is obtained by concatenated 128D vectors of the individual models before retrieval.
Conventional evaluation. Embedding dimension is 512 in both BNInception and ResNet50 experiments.
Hyperparameter tuning. Embedding dimension is fixed to 128.
1.6.6 Baselines with GSP
We evaluate our method with C1+XBM+GSP: Cross-batch memory (XBM) [18] with original Contrastive loss (C1) [69], C2+GSP: Contrastive loss with positive margin [53], MS+GSP: Multi-similarity (MS) loss [55], Triplet+GSP: Triplet loss [54], PNCA+GSP: ProxyNCA++ loss [19], PAnchor+GSP: ProxyAnchor loss [70].
1.6.7 Hyperparameters
For the hyperparameter selection, we exploit the work [4] that performed parameter search via Bayesian optimization on variety of losses. We further experiment the suggested parameters from the original papers and official implementations. We pick the best performing parameters. We perform no further parameter tuning for the baseline methods’ parameters when applied with our method to purely examine the effectiveness of our method.
C1: We adopted XBM’s official implementation for fair comparison. We use 0.5 margin for all datasets.
C2: C2 has two parameters, : positive margin, , and negative margin. We set to , , , for CUB, Cars196, In-shop and SOP, respectively.
Triplet: We set its margin to 0.0961, 0.1190, 0.0451, 0.0451 for CUB, Cars196, In-shop and SOP, respectively.
MS: We set its parameters to , , , for CUB, Cars196, In-shop and SOP, respectively.
ProxyAnchor: We set its two paremeters to for all datasets. We use 1 sample per class in batch setting (i.e., 32 classes with 1 samples per batch), we perform 1 epoch warm-up training of the embedding layer, and we apply learning rate multiplier of 100 for the proxies during training. For SOP dataset, we use learning rate.
ProxyNCA++: We set its temperature parameter to 0.11 for all datasets. We use 1 sample per class in batch setting (i.e., 32 classes with 1 samples per batch), we perform 1 epoch warm-up training of the embedding layer, and we apply learning rate multiplier of 100 for the proxies.
XBM: We evaluate XBM with C1 since in the original paper, contrastive loss is reported to be the best performing baseline with XBM. We set the memory size of XBM according to the dataset. For CUB and Cars196, we use memory size of 25 batches. For In-shop, we use 400 batches and for SOP we use 1400 batches. We perform 1K steps of training with the baseline loss prior to integrate XBM loss in order to ensure XBM’s slow drift assumption.
GSP: For the hyperparameters of our method, we perform parameters search, details of which are provided in Sec. 1.5. We use 64-many prototypes in CUB and Cars, and 128-many prototypes in SOP and In-shop. We use for proxy-based losses and for non-proxy losses. For the rest, we set , , and we iterate until in Proposition 4.1. For zero-shot prediction loss coefficient (i.e., ), we set .
2 Details of the Other Empirical Work
2.1 Synthetic Study
We design a synthetic empirical study to evaluate GSP in a fully controlled manner. We consider 16-class problem such that classes are defined over trainable tokens. In this setting, tokens correspond to semantic entities but we choose to give a specific working to emphasize that they are trained as part of the learning. Each class is defined with 4 distinct tokens and there are also 4 background tokens shared by all classes. For example, a "car" class would have tokens like "tire" and "window" as well as background tokens of "tree" and "road".
We sample class representations from both class specific and background tokens according to a mixing ratio . We sample a total of 50 tokens and such a 50-many feature collection will correspond to a training sample (i.e., we are mimicking CNN’s output with trainable tokens). For instance, given class tokens for class-, and shared tokens, ; we first sample and then sample 20 tokens from with replacement, and 30 tokens from , forming a feature collection for a class-, i.e., We then obtain global representations using GAP and GSP.
We do not apply normalization on the global representations. We also constrain the range of the token vectors to be in between to bound the magnitude of the learned vectors. We use default Adam optimizer with learning rate and perform early stopping with 30 epoch patience by monitoring MAP@R. In each batch, we use 4 samples from 16 classes.
2.2 Cifar Collage
We consider the 20 super-classes of Cifar100 dataset [59] where each has 5 sub-classes. For each super-class, we split the sub-classes for train (2), validation (1), and test (2). We consider 4 super-classes as the shared classes and compose -stitched collage images for the rest 16 classes. In particular, we sample an image from a class and then sample 3 images from shared classes. We illustrate a sample formation process in Fig. 11.
We should note that the classes exploited in training, validation and test are disjoint. For instance, if a tree class is used as a shared class in training, then that tree class does not exist in validation or test set as a shared feature. Namely, in our problem setting, both the background and the foreground classes are disjoint across training, validation and test sets. Such a setting is useful to analyze zero-shot transfer capability of our method.
We use ResNet20 (i.e., 3 stages, 3 blocks) backbone pretrained on Cifar100 classification task. We use normalization on global representations. We use default Adam optimizer with initial learning rate. We use reduce on plateau with 0.5 decay factor and 5 epochs patience. For GSP, we set . We use 4 samples from 16 classes in a batch.
2.3 Evaluation of Zero-shot Prediction Loss
We train on Cifar10 [59] dataset with 8 prototypes using ProxyNCA++ [19] (PNCA) loss with and without . We then use test set to compute attribute histograms for each class. Namely, we aggregate the marginal transport plans of each sample in a class to obtain the histogram. Similarly, for each class, we compute the mean embedding vector (i.e., we average embedding vectors of the samples of a class). Our aim is to fit a linear predictor to map attribute vectors to the mean embeddings.
To quantify the zero-shot prediction performance, we randomly split the classes into half and apply cross-batch zero-shot prediction. Specifically, we fit a linear predictor using 5 classes and then use that transformation to map the other 5 classes to their mean embeddings. We then compute pairwise distance between the predicted means and the true means. We then evaluate the nearest neighbour classification performance. We use both distance and cosine distance while computing the pairwise distances. We repeat the experiment 1000 times with different class splits.
Appendix A Appendix
A.1 Proof for Theorem 4.1
Proof.
is obtained as the solution of the following optimal transport problem:
Given solutions , for , from the 3rd constraint, we have . Then, using the constraint, we write:
where for -many convolutional features. Hence, we have which implies owing to non-negativity constraint. Finally, pooling weights becomes .
∎
A.2 Proof for Proposition 4.2
Before starting our proof, we first derive an iterative approach for the solution of (P2). We then prove that the iterations in Proposition 4.2 can be used to obtain the solution.
We can write (P2) equivalently as:
Rearranging the terms we get:
which is generalized Kullback–Leibler divergence (KLD) between and . Hence, we reformulate the problem as a KLD prjoection onto a convex set, which can be solved by Bregman Projections (i.e., alternating projections onto constraint sets) [78, 79]. Defining and , and omitting constants, we can write the problem as:
(P2′) |
Given, ), at iteration , KLD projection onto , i.e., , reads:
where the results follow from method of Lagrange multipliers. Similarly, for , we have:
Given initialization, , we obtain the pairs for as:
(A.1) |
Proof.
We will prove by induction. From Proposition 4.2, we have
and . It is easy to show that the expressions hold for the pair . Now, assuming that the expressions also holds for arbitrary . We have
Replacing we get:
where terms cancel out, resulting in:
Similarly, we express as:
again terms cancel out, resulting in:
Hence, becomes:
meaning that the expressions also hold for the pair . ∎
A.3 Proof for Proposition 4.3
Proof.
We start our proof by expressing (P2′) in a compact form as:
where denotes the vector formed by stacking and the row vectors of , denotes the vector formed by stacking -dimensional zero vector and the row vectors of , and and are such that imposes transport constraints as:
From Lagrangian dual, we have:
where is the optimal dual Lagrangian satisfying:
Defining , we have;
where . Similarly, for the dual variable, we have:
Putting back the expression for in , we obtain:
which includes by matrix inversion, . We now show that can be obtained without explicit matrix inversion.
can be expressed as:
is Hermitian and positive definite. Using block matrix inversion formula for such matrices (Corrolary 4.1 of [80]), we obtain the inverse as;
where and .
Given , i.e., , the rest of the proof to obtain follows from right multiplying the Jacobian, i.e., and rearranging the terms. ∎
Item Method | Ours (GSP) | ||
optimization problem | -many 1D OT | ||
image representation | |||
dimension | |||
feature selection | ✗ | ✗ | ✓ |
gradient computation | auto-diff | auto-diff | closed form expression |
matrix-inverse-free gradient | ✓ | ✗ | ✓ |
: number of slices, : projection of to slice- with sorted according to Monge Map.
Appendix B Optimal Transport Based Operators
In this section, we briefly discuss optimal transport (OT) based aggregation and selection operators. We provide their formulations to show how our formulation differs from them.
B.1 Feature Aggregation
Given a cost map which is an (number of prototypes) by (number of features) matrix representing the closeness of prototypes and features , [41] consider the following OT problem:
(P4) |
and defines their aggregated feature as:
(B.1) |
Namely, is an ensemble of vectors each of which is the weighted aggregation of with the weights proportional to the assignment weights to the corresponding prototype. The same aggregation scheme is also discovered within the context of linear Wasserstein embeddings via Monge maps and is shown to be a barycentric projection of the feature set with the transport plan to approximate Monge map [40]. Similar to them, ensemble of Monge maps corresponding to sliced-Wasserstein distances (SWD) are further employed in set aggregation [51]. In such ensemble representations, there is no feature selection mechanism and thus, all the features are somehow included in the image representation.
For instance, if we further sum those vectors of in Eq. B.1 to obtain a single global representation, we end up with global average pooling: .
Briefly, those optimal transport based set aggregators, [40, 41, 51] map a set of features to another set of features without discarding any and do not provide a natural way to aggregate the class-discriminative subset of the features. Such representation schemes are useful for structural matching. Albeit enabling metric as a similarity measure for the sets, their ensemble based representation results in very high dimensional embeddings. On the contrary, our formulation effectively enables learning to select discriminative features and maps a set of features to a single feature that is of the same dimension and is distilled from nuisance information. We summarize the comparison of our pooling method with the optimal transport based counterparts in Table 8.
B.2 Top- Selection
Given -many scalars as and -many scalars as with is in an increasing family, i.e., , [50] considers the following OT:
(P5) |
where and is -dimensional probability simplex, i.e., . Then, top- selection is formulated with the setting , and . Similarly, sorted top- selection is formulated with the setting and . Solving the same problem (P5), [13] formulate top- ranking by letting and be arbitrary probability simplex and be in an increasing family.
Such top- formulations are suitable for selecting/ranking scalars. In our problem, the aim is to select a subset of features that are closest to the prototypes which are representatives for the discriminative information. Namely, we have a problem of subset selection from set-to-set distances. If we had our problem in the form of set-to-vector, then we would be able to formulate the problem using (P5). However, there is no natural extension of the methods in [50, 13] to our problem. Therefore, we rigorously develop an OT based formulation to express a discriminative subset selection operation analytically in a differentiable form.
Our formulation in (P1) differs from the typical optimal transport problem exploited in (P5). In optimal transport, one matches two distributions and transports all the mass from one to the other. Differently, we transport portion of the uniformly distributed masses to the prototypes that have no restriction on their mass distribution. In our formulation, we have a portion constraint instead of a target distribution constraint, and we use an additional decision variable, , accounting for residual masses. If we do not have and set , then the problem becomes a specific case of an optimal transport barycenter problem with 1 distribution.
Our problem can be expressed in a compact form by absorbing into with zero costs associated in the formulation, as in the proof (Sec. A.3). We choose to explicitly define in the problem (P1) to show its role. We believe its residual mass role is more understandable this way. The benefits of our formulation include that we can perform feature selection with matrix inversion free Jacobian and we can change the role of the prototypes as background representatives simply by using to weight the features instead of in Eq. (4.1). Our specific formulation further allows us to tailor a zero-shot regularization loss built on the learned prototypes within our pooling layer.
Appendix C Implementations with Pseudo Codes
return
trainable parameters: \eqparbox[h]COMMENT// -many prototypes
return ,
hyperparameters: transport ratio, entropy regularization weight, number of iterations
forward: gets cost map, , returns residual weights, , and transport plan
return \eqparbox[h]COMMENT//
backward: gets the solution (, ) and the gradients (), returns
return
trainable parameters: \eqparbox[h]COMMENT// a semantic embedding vector for each class label
return