Dissecting the impact of different loss functions with gradient surgery

Hong Xuan
Microsoft
[email protected] Robert Pless
Geroge Washington University
[email protected]

Abstract

Pair-wise loss is an approach to metric learning that learns a semantic embedding by optimizing a loss function that encourages images from the same semantic class to be mapped closer than images from different classes. The literature reports a large and growing set of variations of the pair-wise loss strategies. Here we decompose the gradient of these loss functions into components that relate to how they push the relative feature positions of the anchor-positive and anchor-negative pairs. This decomposition allows the unification of a large collection of current pair-wise loss functions. Additionally, explicitly constructing pair-wise gradient updates to separate out these effects gives insights into which have the biggest impact, and leads to a simple algorithm that beats the state of the art for image retrieval on the CAR, CUB and Stanford Online products datasets.

1 Introduction

Deep Metric Learning trains networks to map semantically related images to similar locations in an embedding space. Metric learning is useful in extreme classification settings when there are so many classes and limited embedding size that standard approaches fail, when there is a need to compare features extracted from images in unseen classes, or when there may be incomplete labels that allow a system to know that images come from the same or different classes without knowing what those classes are.

In this domain, one popular pair-wise loss to train a network is Triplet Loss. Triplets are three images comprising an anchor image, a positive image from the same class, and a negative image from a different class. The network is trained with a loss function that penalizes situations where the anchor-negative pair is closer than the anchor-positive pair. Many variations of this basic approach explore ways to choose which triplets should be included in the optimization or how much they should be weighted, whether the optimization should consider distances or angles between the embedded vectors, and what specific loss function should drive the scoring of a particular triplet.

One recent work gave a large-scale analysis of many of these variations [11] and found that a substantial fraction of the reported performance variation disappears with careful matching of experimental conditions. In this work, we propose a further unifying analysis of these approaches, but explicitly consider how the pair-wise loss function attempts to affect the embedding location of the anchor, positive, and negative examples. Different pair-wise loss functions have gradients that directly affect the desired locations of each embedded location in different ways. Those gradients are different in terms of the direction the anchor, positive and negative examples are pushed, the overall importance or weight given to different triplets, and the relative importance or weight given to the anchor-positive vs. the anchor negative pairs.

Refer to caption — Figure 1: To realize a desired embedding space, a common method is to design a loss function which can be calculated on deep learning platforms such as PyTorch and TensorFlow(Red). The auto-grad mechanism on the platforms automatically calculates the gradient to update the model parameters to forming the desired embedding space(Blue). In practice, the goal of deep metric learning is about optimizing the separation or clustering of feature points extracted from imagery, and the loss function is a somewhat indirect approach to reach that goal, while the gradient more directly affects the update of the feature extraction. We propose the method to directly design the gradient to train models.

In addition to the analysis, we exploit the fact that PyTorch [12] allows for the programmatic specification of gradients, allowing us to explicitly control the above gradient components, and then supports back-propagation to encourage the low-level features to move in this way. This flexibility allows us to explore the relative contributions of these components of the gradient and better understand what is and is not important in the optimization. Finally, we demonstrate the potential to directly modify the gradient components to train models for deep metric learning tasks instead of loss function modification.

The three main contributions¹¹1Reject in CVPR2021, ICCV2021, CVPR2022 of this are:

•

a direct gradient framework to create a unified analysis of many recent triplet and pair-wise loss functions in terms of their gradients,
•

an experimental analysis showing how different choices for components of the gradient affects model performance,
•

a deeper understanding of the practical effects of defining a loss based on the Euclidean metric compared with the cosine similarity metric, and
•

an integration of the best choice of each component to create a new gradient rule that outperforms the current state-of-art result for image retrieval by a clear margin across multiple datasets.

2 Background

There are many loss functions that have been proposed to solve the deep metric learning problem. Pair-wise loss functions such as contrastive loss [2], binomial deviance loss [24], lifted structure loss [17] and multi-similarity loss [19] penalize pairs of same label instances if their distance is large and pairs of different label instances if their distance is small. The triplet loss function [5, 15] and its variants such as circle loss [18] form a triplet that contains anchor, positive and negative instances, where the anchor and positive instance share the same label, and anchor and negative instance share different labels. These loss functions have losses encouraging the anchor-positive distance to be smaller than anchor-negative distance. Other variants of triplet loss integrate more negative instances into a triplet, such as N-Pair loss [16]. Proxy loss [10] defines for each class a learnable anchor as a proxy. During the training, each instance is directly pulled to its proxy and pushed away from the proxy location for other classes.

Due to the explosion of many new loss functions, issues underlying the fair comparison for these loss functions have been raised in [11]. This paper works hard to re-implement many works before 2019. It tries to fix settings such as network architecture, optimizer and image prepossessing and compares different methods apple to apple. This gives a relatively clear comparison of many loss functions but does not try to explore why some methods are superior to others.

Recent works such as Multi-Similarity Loss and Circle Loss [19, 18, 22] have started with standard triplet loss formulations and adjust the gradient of loss functions to give clear improvements with very simple code modifications. These works all find an explicit loss function whose gradient creates the desired loss function. In some cases, like the current state-of-the-art approach across many datasets [19], the updated loss function for one triplet includes the relative similarity between the anchor-positive and the anchor and other examples for the anchor’s class. This more complicated loss function and more complicated gradient may cause subtle challenges in the optimization process.

Other strategies start with a desired gradient weighting function and integrate the desired gradients to solve for a loss function whose gradient has the appropriate properties. This is often limited to simple weighting strategies, such as the simple linear form in [18] and simple gradient removal for positive pairs when triplets contain hard negative in [22], because it may be hard to find the loss function whose gradient is consistent with complex weighting strategies.

The discussion of explicitly updating the direction of the gradient has been introduced in [9]. They encourage the anchor-positive and anchor-negative directional updates to be orthogonal (so they don’t cancel each other), but include this as a ”direction regularization”, which does not enforce orthogonality.

The most related work is P2Sgrad [27], the author analyzes the gradient in the family of margin-based softmax loss and directly modified the gradient with the cosine similarity for better optimization. Comparing to P2Sgrad, our work focuses on the triplet and pair-wise loss functions.

The framework in this paper directly explore the space of desired gradient updates as shown in Figure 1. By not limiting ourselves to designing a loss function with appropriate gradients, we can be more explicit in experimentally dissecting the effects of different parts of the gradient. Furthermore, we can recombine the gradient terms that are experimentally most useful in a form of gradient surgery [25] that very slightly alters existing algorithms to give improved performance.

3 The Role of the Gradient in Metric Learning

We define a collection of terms for how a batch of images affects a network. Let $\mathbf{X}$ be a batch of input images, $\mathbf{f}$ be the $L2$ normalized feature vectors of the images extracted by the network, $l$ be loss value for the batch, $\theta$ be the parameters of the network, $\eta$ be the learning rate, $f_{\theta}(\cdot)$ be the mapping function of the network, and $L(\cdot)$ be loss function. In the forward training step, the expression is:

l=L(\mathbf{f})\text{, where }\mathbf{f}=f_{\theta}(\mathbf{X})

(1)

The network weights are updated as:

\theta^{t+1}=\theta^{t}-\eta\frac{\partial L}{\partial\mathbf{f}}\frac{\partial\mathbf{f}}{\partial\theta}

(2)

This equation highlights that the gradient of the loss function (rather than the loss function itself) directly affects how the model updates its parameters. Therefore, explicitly exploring the gradient is a useful path to exploring network learning behavior.

We decompose the gradient into two terms, $\frac{\partial L}{\partial\mathbf{f}}$ and $\frac{\partial\mathbf{f}}{\partial\theta}$ . The first term represents how changing the embedded feature location affects the loss, and this is the term explored most in detail in this work. The second term represents how model parameter (network weight) changes affect the feature embedding. In a modern deep network with multiple layers, the second term is always expanded with the multiplication of multiple terms for each layer because of the derivative chain rule.

In the following discussion, we focus on the particular forms of the first term in many triplet and pair-wise loss functions and then proposed to directly set and design the first term for model training. In Section 3.1, as an example, two commonly used triplet losses are decomposed into components and then those components are categorized into three parts. Then, Section 3.2, 3.3 and 3.4 extend the analysis to more existing loss functions.

3.1 Gradient of Triplet Losses

Given a triplet, $(\mathbf{f_{a}},\mathbf{f_{p}},\mathbf{f_{n}})$ , there are two commonly used triplet losses in the literature, a triplet loss based on Euclidean distance:

L_{euc}=\max(D_{ap}^{2}-D_{an}^{2}+m,0),

(3)

where $D_{ap}=\|\mathbf{f_{a}}-\mathbf{f_{p}}\|$ , $D_{an}=\|\mathbf{f_{a}}-\mathbf{f_{n}}\|$ are the distances between the anchor-positive and the anchor-negative pairs, and $m$ is a distance margin. A second common triplet loss is the triplet loss based on cosine similarity with NCA [1]:

L_{cos}=-\log(\frac{\exp{(\tau S_{ap})}}{\exp{(\tau S_{ap})}+\exp{(\tau S_{an})}})

(4)

where $S_{ap}=\mathbf{f_{a}}^{T}\mathbf{f_{p}}$ , is the cosine similarity computed as the dot-product of the normalized anchor feature and the normalized feature from the positive example, the anchor-negative is computed in the same way, $S_{an}=\mathbf{f_{a}}^{T}\mathbf{f_{n}}$ and $\tau$ is the scaling parameter.

When comparing these two loss functions, their substantial differences make it challenging to determine how the loss affects performance. One loss is based on the Euclidean distance combined with a hinge function, while the other uses cosine similarity along with a negative log softmax function to combine the anchor-positive and anchor-negative pairs. Looking at the gradients of these loss functions makes the difference more clear. In triplet loss based on Euclidean distance, if the loss is greater than 0, its gradient can be derived from Equation 3 as:

\left\{\begin{aligned} &\frac{\partial L_{euc}}{\partial\mathbf{\mathbf{f_{p}}}}=2\|\mathbf{f_{p}}-\mathbf{f_{a}}\|\mathbf{e_{p}^{euc}}\\ &\frac{\partial L_{euc}}{\partial\mathbf{\mathbf{f_{n}}}}=2\|\mathbf{f_{a}}-\mathbf{f_{n}}\|\mathbf{e_{n}^{euc}}\\ &\frac{\partial L_{euc}}{\partial\mathbf{\mathbf{f_{a}}}}=-2\|\mathbf{f_{a}}-\mathbf{f_{p}}\|\mathbf{e_{p}^{euc}}-2\|\mathbf{f_{n}}-\mathbf{f_{a}}\|\mathbf{e_{n}^{euc}}\end{aligned}\right.

(5)

Being explicit about this gradient allows us to name the direction that the positive example is being pulled to anchor example as $\mathbf{e_{p}^{euc}}$ , and these are unit vectors defined as: $\mathbf{e_{p}^{euc}}=\frac{\mathbf{f_{p}}-\mathbf{f_{a}}}{\|\mathbf{f_{p}}-\mathbf{f_{a}}\|}$ , with corresponding directions for the negative example, $\mathbf{e_{n}^{euc}}=\frac{\mathbf{f_{a}}-\mathbf{f_{n}}}{\|\mathbf{f_{a}}-\mathbf{f_{n}}\|}$ .

The gradient of the triplet loss function based on cosine similarity can also be derived from Equation 4 to give a unit direction and magnitude:

\left\{\begin{aligned} &\frac{\partial L_{cos}}{\partial\mathbf{f_{p}}}=\frac{1}{1+\exp{(\tau(S_{ap}-S_{an}))}}\tau\mathbf{e_{p}^{cos}}\\ &\frac{\partial L_{cos}}{\partial\mathbf{f_{n}}}=\frac{1}{1+\exp{(\tau(S_{ap}-S_{an}))}}\tau\mathbf{e_{n}^{cos}}\\ &\frac{\partial L_{cos}}{\partial\mathbf{f_{a}}}=\frac{1}{1+\exp{(\tau(S_{ap}-S_{an}))}}\tau(\mathbf{e_{a_{p}}^{cos}}+\mathbf{e_{a_{n}}^{cos}})\end{aligned}\right.

(6)

where $\mathbf{e_{p}^{cos}}=-\mathbf{f_{a}}$ , $\mathbf{e_{n}^{cos}}=\mathbf{f_{a}}$ , $\mathbf{e_{a_{p}}^{cos}}=-\mathbf{f_{p}}$ and $\mathbf{e_{a_{n}}^{cos}}=\mathbf{f_{n}}$ are the unit gradient directions.

Though both $L_{euc}$ and $L_{cos}$ contain different gradient components, those components can be categorized into two major parts: unit gradient direction for moving the feature and a scalar weight that affects the length of the gradient in that direction. The weight itself can be divided into two sub-parts: the weight related to all three features in a triplet $\mathbf{f_{a}}$ , $\mathbf{f_{p}}$ and $\mathbf{f_{n}}$ (Triplet Weight), and the weight related to the positive pair $\mathbf{f_{a}}$ and $\mathbf{f_{p}}$ or negative pair $\mathbf{f_{a}}$ and $\mathbf{f_{n}}$ in a triplet (Pair Weight).

With the categorizations of the gradient components, it becomes easy to compare the effects of each component. Before the comparison, we first show how recently proposed loss functions can be characterized by computing the direction and weights of the different gradient terms in Sections 3.2, 3.3 and 3.4 and then perform comparisons of the isolated effects of each gradient component in Sections 5.1, 5.2 and 5.3

3.2 Unit Gradient Direction

The first gradient component is the unit vector in the direction of the gradient, derived from how the loss function moves the relative configuration of the anchor, positive and negative features. We refer to the unit gradient direction of the two most common metrics Euclidean distance and cosine similarity as Euclidean direction $\mathbf{e^{euc}}$ and cosine direction $\mathbf{e^{cos}}$ . Recent work [9], also suggests to other directions, Euclidean orthogonal direction $\mathbf{e^{euc-orth}}$ and cosine orthogonal direction $\mathbf{e^{cos-orth}}$ .

Euclidean Direction( $\mathbf{e^{euc}}$ ):

In equation 5, the geometric explanation of Euclidean direction is to move the positive feature directly towards the anchor and move the negative feature directly away from the anchor, as shown in Figure 2. The vector direction of the anchor image (not shown in the Figure), is a combination of these directions.

Cosine Direction( $\mathbf{e^{cos}}$ ):

In equation 6, the geometric explanation of cosine direction on positive pair is to move the positive feature in the anchor feature direction and move the anchor in positive feature direction, and on negative pair is to move negative in the opposite of the anchor feature direction and move the anchor in the opposite of the negative feature direction as shown in Figure 2.

Orthogonal Direction( $\mathbf{e^{euc-orth}}$ $\&$ $\mathbf{e^{cos-orth}}$ ):

A direct gradient modification function 7 can be applied to both the Euclidean and cosine directions. This requires the negative pair to move in a direction orthogonal to the direction the positive pair is moving. This is constrained as:

\mathbf{e_{n}}\cdot(\mathbf{f_{a}}-\mathbf{f_{p}})=0

(7)

This gradient was realized in recent work by [9] who implicitly encourage the negative examples to move orthogonally to the anchor positive direction by adding regularizer in their loss function. Our approach is directly understanding the gradient direction for each example highlights the impact of this loss function.

3.3 Pair Weight

We define the pair-weight $P$ , for the anchor-positive pair $P_{+}$ and anchor-negative pair $P_{-}$ . The pair weight of cosine similarity $P^{cos}$ based triplet loss is a constant scaling parameter. This is useful as a baseline for comparison. For this case where both pair weights are set with constant $1$ , as:

P_{+}^{con}=P_{-}^{con}=1;

(8)

In Euclidean distance based triplet loss, the pair weight $P^{euc}$ is different for the anchor-positive and anchor-negative pairs:

\left\{\begin{aligned} &P_{+}^{euc}=\|\mathbf{f_{a}}-\mathbf{f_{p}}\|\\ &P_{-}^{euc}=\|\mathbf{f_{a}}-\mathbf{f_{n}}\|\end{aligned}\right.

(9)

and indicates the pair weight is proportional to the distance between the anchor and the other element of the pair.

Recent works [18, 19, 22] argue that the weight for anchor-negative pair should be large when they are close to each other. Otherwise, as mentioned in [22], the optimization will quickly converge to bad local minima. The solution in Circle loss [18] is to apply a linear pair weight $P^{lin}$ : for negative pairs, the weight is large if the similarity is large and small if the similarity is small; for positive pairs, the weight is large if the similarity is small and small if the similarity is large:

\left\{\begin{aligned} &P_{+}^{lin}=1-S_{ap}\\ &P_{-}^{lin}=S_{an}\end{aligned}\right.

(10)

Early work binomial deviance loss [24] uses a similar pair weight but with a nonlinear sigmoid form $P^{sig}$ :

\left\{\begin{aligned} &P_{+}^{sig}=\frac{1}{1+\exp{(\alpha(S_{ap}-\lambda))}}\\ &P_{-}^{sig}=\frac{1}{1+\exp{(-\beta(S_{an}-\lambda))}}\end{aligned}\right.

(11)

where $\alpha$ , $\beta$ and $\lambda$ are three hyper-parameters.

Multi-similar(MS) loss [19] combines ideas from the lifted structure loss [17] and binomial deviance loss [24], which includes not only the self-similarity of a selected pair but also the relative similarity from other pairs.

The MS paper [19] tries to find a loss function whose derivative fits the proposed pair weight. Because the relative similarity term involves additional examples (outside the triplet), this creates additional gradients relative to those examples, even though the stated purpose is to weigh the selected pair. Therefore, it’s difficult to understand if the performance gain is coming from the proposed pair weight or from the gradients affecting the feature location of these other examples. By casting their work within our framework, we can decouple the pair-weighting and explore the impact of this term in isolation.

We follow the MS paper to cast their weighting function $P^{sig-ms}$ in our framework. Given a triplet, the self-similarity of the selected positive pair and negative pair are $S_{ap}$ and $S_{an}$ . The similarity of other positives and negatives to the anchor is considered as relative-similarity, noted as ${R_{ap}}^{i}$ and ${R_{an}}^{j}$ . In addition, [19] also defines $\mathcal{P}$ and $\mathcal{N}$ be the sets of selected ${R_{ap}}^{i}$ and ${R_{an}}^{j}$ , where

	$\displaystyle\mathcal{P}=\{{R_{ap}}^{i}\colon{R_{ap}}^{i}<\max\{S_{an},{R_{an}}^{j}\}+\epsilon\}$
	$\displaystyle\mathcal{N}=\{{R_{an}}^{j}\colon{R_{an}}^{j}>\min\{S_{ap},{R_{ap}}^{i}\}-\epsilon\}$		(12)

\left\{\begin{aligned} &P_{+}^{sig-ms}=\frac{1}{m_{+}^{sig}+\exp{(\alpha(S_{ap}-\lambda))}}\\ &P_{-}^{sig-ms}=\frac{1}{m_{-}^{sig}+\exp{(-\beta(S_{an}-\lambda))}}\end{aligned}\right.

(13)

where

	$\displaystyle m_{+}^{sig}=\frac{1}{\left\|\mathcal{P}\right\|}\sum_{\mathcal{P}}\exp{(\alpha(S_{ap}-{R_{ap}}^{i}))}$
	$\displaystyle m_{-}^{sig}=\frac{1}{\left\|\mathcal{N}\right\|}\sum_{\mathcal{N}}\exp{(-\beta(S_{an}-{R_{an}}^{j}))}$

When $m_{+}^{sig}=m_{-}^{sig}=1$ the pair weights simplify back to the sigmoid form in equation 11.

In practice, training MS loss needs to tune four hyper-parameters $\alpha$ , $\beta$ , $\lambda$ and $\epsilon$ to fit different datasets, making the training not convenient and not efficient. With analysis on relative-similarity terms $m_{+}^{sig}$ and $m_{-}^{sig}$ in the appendix, we define a clearer and parameter free version of pair weight called linear MS pair weight $P^{lin-ms}$ , which behaves similar to the original MS weight:

\left\{\begin{aligned} &P_{+}^{lin-ms}=(1-m_{+}^{lin})(1-S_{ap})\\ &P_{-}^{lin-ms}=(1+m_{-}^{lin})S_{an}\end{aligned}\right.

(14)

where

	$\displaystyle m_{+}^{lin}=\frac{1}{\left\|\mathcal{P}\right\|}\sum_{\mathcal{P}}(S_{ap}-{R_{ap}}^{i})$
	$\displaystyle m_{-}^{lin}=\frac{1}{\left\|\mathcal{N}\right\|}\sum_{\mathcal{N}}(S_{an}-{R_{an}}^{j})$

3.4 Triplet Weight

The triplet weight contains the similarity of both positive and negative pairs of a triplet, measuring whether a triplet is well separated or not. In Euclidean distance based triplet loss, the triplet weight (denoted as $T$ ) is a constant indicating that every triplet will be treated equally. For the fair comparison for other triplet weights, we set constant weight $0.5$ .

T^{con}=0.5

(15)

In cosine similarity based triplet loss, the triplet weight is:

T^{cos}=\frac{1}{1+\exp{(\tau(S_{ap}-S_{an}))}}

(16)

$T^{cos}$ is rely on the difference of $S_{ap}$ and $S_{an}$ . When a triplet in a correct configuration, $S_{ap}-S_{an}>0$ , the triplet weight is small. Otherwise, the triplet weight will be large.

In Circle loss [18], the triplet weight is:

T^{cir}=\frac{1}{1+\exp{(\tau(S_{ap}(2-S_{ap})-S_{an}^{2}))}}

(17)

Because $T^{cos}$ only considers the similarity difference $S_{ap}-S_{an}$ , some corner cases such triplet with both large $S_{ap}$ and $S_{an}$ or both small $S_{ap}$ and $S_{an}$ are not well treated. The idea of $T^{cir}$ is to introduce a non-linear mapping for $S_{ap}$ and $S_{an}$ in the exponential term in order to weight more on the corner cases.

Figure 3 shows the triplet weight diagram, a triplet visualization tool from [22], for $T^{cos}$ and $T^{cir}$ with $\tau=1$ . The equal weight line in $T^{cos}$ is straight lines with form $S_{ap}-S_{an}=\text{const.}$ . And the equal weight line in $T^{cir}$ is circular lines with form $(S_{ap}-1)^{2}+S_{an}^{2}=\text{const.}$ .

Selectively Contrastive Triplet(SCT) loss [22] selects triplets with hard negatives (the negative example in a triplet is closer to anchor than the positive example) and applies only contrastive loss to the hard negative pairs during the batch training. At gradient level, this approach is to remove the gradients from the anchor-positive pairs for triplets with hard negatives. We treat the selection as a masking operator on positive pair weight:

\displaystyle T^{sc1}(P_{+})=\begin{cases}0&\text{if $S_{an}>S_{ap}$}\\ P_{+}&\text{others}\end{cases}

(18)

Because the decision boundary of triplets selection $S_{an}=S_{ap}$ is a 1st order straight line, we note this masking operator is noted as $T^{sc1}$ . Besides, we continue to extend the selection idea with Circle loss. The triplets in the corner cases can be also selected to only separate the negative pairs. Then, the decision boundary of the selection operator becomes a 2nd order circular line. We note it as $T^{sc2}$ ,

\displaystyle T^{sc2}(P_{+})=\begin{cases}0&\text{if $S_{ap}(2-S_{ap})-S_{an}^{2}>0.5$}\\ P_{+}&\text{others}\end{cases}

(19)

Figure 3 right shows the difference decision boundaries of $T^{sc1}$ and $T^{sc2}$ .

3.5 Metric Learning Gradient Summary

Method Direction PairWeight TripletWeight Triplet (Euclidean) [15] $\mathbf{e^{euc}}$ $P^{euc}$ $T^{con}$ Triplet (cosine) [23] $\mathbf{e^{cos}}$ $P^{con}$ $T^{cos}$ Circle loss [18] $\mathbf{e^{cos}}$ $P^{lin}$ $T^{cir}$ Binomial deviance [24] $\mathbf{e^{cos}}$ $P^{sig}$ $T^{con}$ MS loss [19] $\mathbf{e^{cos}}$ $P^{sig-ms}$ $T^{con}$ DR-MS loss [9] $\mathbf{e^{cos-orth}}$ $P^{sig-ms}$ $T^{con}$ SC triplet loss [22] $\mathbf{e^{cos}}$ $P^{con}$ $T^{cos}$ , $T^{sc1}$

Table 1: Triplet loss functions define a gradient on the embedded feature locations of the anchor, positive, and negative examples of the triplet. A large collection of recently proposed triplet loss functions (left) can be put into a unified framework by decomposing the gradient into the (unit) directions they impose on the features, and the weight of that gradient due to the properties of the anchor-positive and anchor-negative pairs, and the overall configuration of the triplet. This decomposition gives some insight into why some approaches give improved results, and provides a design space for choosing particular combinations of weights to optimize overall performance.

In this section, we have derived ways to represent many previous loss functions in terms of their gradients. We have explicitly defined the gradients in terms of how the anchor, positive and negative are moved, defined them in terms of a unit vector in the direction of motion, a weight of anchor-positive term and the anchor negative term and weight of the triplet overall. Table 1 shows how to map different combinations of gradient components into currently proposed loss functions. Section 5 gives explicit experiments to understand the isolated effects of these three parts of gradient component.

4 Experiment Settings

We run a set of experiments on the CUB200 (CUB) [20], CAR196 (CAR) [7], Stanford Online Products (SOP) [17] and In-shop Cloth (In-shop) [8] dataset. All experiments are run on the PyTorch platform [12] with Nvidia Tesla V100 GPU, using ResNet [4] architectures, pre-trained on ILSVRC 2012-CLS data [14]. Training images augmented using a standard scheme (random horizontal flip and random crops padded by 10 pixels on each side), and normalized using the channel means and standard deviations. The network is trained with stochastic gradient descent (SGD) with momentum $0$ , step $0.1$ and milestone at $60\%$ of the total epochs. We refer the Easy Positive with Hard Negative mining protocol [23] to sample a batch with $C$ classes and $N$ images per class. On CUB, CAR, SOP and In-shop dataset, we sample 8, 16, 4 and 4 images per class in a mini-batch.

Small embedding size comparing to training classes size: We follow the early goal of deep metric learning works [17, 16, 10, 3] which sets the embedding size to be smaller than the number of training classes. On CUB, CAR, SOP and In-shop dataset the embedding size is 64, 64, 512, 512.

Comparison of Gradient Components: To compare each component in the gradient, we train ResNet18 on CAR dataset and In-shop dataset for 60 epochs. The training is run with batch size 128. For a given test setting, we run the test 5 times to remove the effect caused by the randomness coming from the random sampling of the batch and random initialization of the final FC embedding layer which reducing the GAP feature to a target dimension (e.g. 64 or 512). Then, the mean and standard deviation of Recall@1 are calculated.

Comparison with the State-of-the-Art: To compare the recent state-of-the-Arts results, we select ResNet50 as the backbone for 80 epochs training. The training is run with different batch sizes 128, 256, 384 and 512. Each test is run 3 times and mean Recall@K is calculated as the measurement for retrieval quality.

PyTorch Implementation: In PyTorch platform, we use torch.autograd.Function module to customize both forward and backward functions for a loss module. The backward function is to generate our customized gradient for the optimizer. During the training, the gradient is directly starting from the backward function, replacing the gradient generated by AutoGrad of the forward function.

5 Comparison Experiments

In this section, we give explicit experiment results to demonstrate the isolated effects contributed by unit gradient direction, pair weight and triplet weight. More raw results are shown in Appendix.

Direction CAR In-shop $\mathbf{e^{euc}}$ 69.5 $\pm$ 0.7 83.7 $\pm$ 0.1 $\mathbf{e^{cos}}$ 75.5 $\pm$ 0.2 85.2 $\pm$ 0.3 $\mathbf{e^{euc-orth}}$ 66.9 $\pm$ 0.5 84.1 $\pm$ 0.1 $\mathbf{e^{cos-orth}}$ 77.0 $\pm$ 0.7 86.6 $\pm$ 0.2

Table 2: Comparing recall@1 performance of different gradient directions on CAR and In-shop dataset

PairWeight $\mathbf{e^{euc}}$ $\mathbf{e^{cos}}$ $P^{con}$ 69.5 $\pm$ 0.7 75.5 $\pm$ 0.2 $P^{euc}$ 75.2 $\pm$ 0.4 77.0 $\pm$ 0.4 $P^{lin}$ 76.7 $\pm$ 0.5 77.8 $\pm$ 0.9 $P^{lin-ms}$ 78.2 $\pm$ 0.4 78.8 $\pm$ 0.9 $P^{sig}$ 71.9 $\pm$ 0.4 74.3 $\pm$ 0.2 CAR $P^{sig-ms}$ 74.6 $\pm$ 0.8 76.2 $\pm$ 0.5 $P^{con}$ 83.7 $\pm$ 0.1 85.2 $\pm$ 0.3 $P^{euc}$ 85.2 $\pm$ 0.2 84.8 $\pm$ 0.2 $P^{lin}$ 87.4 $\pm$ 0.1 87.3 $\pm$ 0.2 $P^{lin-ms}$ 87.5 $\pm$ 0.2 87.3 $\pm$ 0.1 $P^{sig}$ 86.2 $\pm$ 0.1 87.8 $\pm$ 0.2 In-shop $P^{sig-ms}$ 84.9 $\pm$ 0.4 86.4 $\pm$ 0.2

Table 3: Comparing recall@1 performance of different pair weights with Euclidean and cosine direction on CAR and In-shop dataset

TripletWeight $P^{con}$ $P^{lin}$ $T^{con}$ 75.5 $\pm$ 0.2 77.8 $\pm$ 0.9 $T^{cos}$ 75.8 $\pm$ 0.2 77.5 $\pm$ 0.5 $T^{cos}$ & $T^{sc1}$ 77.0 $\pm$ 0.2 78.8 $\pm$ 0.6 $T^{cos}$ & $T^{sc2}$ 77.2 $\pm$ 0.5 78.4 $\pm$ 0.5 $T^{cir}$ 75.5 $\pm$ 0.3 78.3 $\pm$ 0.2 $T^{cir}$ & $T^{sc1}$ 77.0 $\pm$ 0.4 78.1 $\pm$ 0.6 CAR $T^{cir}$ & $T^{sc2}$ 76.6 $\pm$ 0.8 78.8 $\pm$ 0.3 $T^{con}$ 85.2 $\pm$ 0.1 87.3 $\pm$ 0.2 $T^{cos}$ 86.0 $\pm$ 0.3 87.9 $\pm$ 0.4 $T^{cos}$ & $T^{sc1}$ 85.1 $\pm$ 0.3 87.3 $\pm$ 0.3 $T^{cos}$ & $T^{sc2}$ 84.5 $\pm$ 0.2 86.9 $\pm$ 0.2 $T^{cir}$ 86.0 $\pm$ 0.1 87.7 $\pm$ 0.2 $T^{cir}$ & $T^{sc1}$ 84.9 $\pm$ 0.2 87.2 $\pm$ 0.1 In-shop $T^{cir}$ & $T^{sc2}$ 84.2 $\pm$ 0.1 86.9 $\pm$ 0.2

Table 4: Comparing recall@1 performance of different triplet weights with constant and linear pair weight on CAR and In-shop dataset

5.1 Unit Gradient Direction

To understand the behavior of unit gradient directions in section 3.2, we set constant pair and triplet weight $T^{con}=0.5$ and $P^{con}=1$ , and vary the choice of Euclidean, cosine, Euclidean-orthogonal and cosine-orthogonal direction.

In Table 2, we find the following trends. First, the cosine and cosine-orthogonal direction have better Recall@1 accuracy than other directions. Second, the cosine-orthogonal gradient direction gives an improvement for both datasets compared to the cos direction. More analysis will be discussed in section 5.4.

5.2 Pair Weight

To understand the behavior of the pair weights, we set triplet weights with constant form $T^{con}=0.5$ and gradient direction with $\mathbf{e^{cos}}$ and $\mathbf{e^{euc}}$ for two sets of results respectively. As for baseline results, the pair weights are set with constant form $P^{con}=1$ .

In Table 3, all pair weights provide a clear performance gain to their baseline results. Also, the performance gap of gradient direction $\mathbf{e^{euc}}$ and $\mathbf{e^{cos}}$ after applying the pair weight is greatly reduced.

Both relative-similarity methods $P^{lin-ms}$ and $P^{sig-ms}$ perform better than the method with only self-similarity on CAR dataset across different learning rates. Due to the property of well separation on In-shop dataset as shown in Figure 4, the relative-similarity term will less likely exist during the train because few positive and negative examples will be in $\mathcal{P}$ and $\mathcal{N}$ set as mentioned in equation 12. $P^{lin-ms}$ is performance almost as same as $P^{lin}$ , but $P^{sig-ms}$ shows some computation instability effect. We put a further analysis of this effect in the appendix.

In summary, Table 3 shows several features related to the pair weight. First, pair weight causes substantial improvement in recall@1 accuracy. Second, in most cases, the linear and sigmoid pair weight outperforms the default Euclidean pair weight. Third, the linear version of the multi-similarity gradient direction is much more robust to different learning rates than the sigmoid version(see in appendix), and gives better performance and Recall@1 accuracy.

5.3 Triplet Weight

Table 4, we show two groups of experiments to compare the seven triplet weights. One group sets the pair-weight to be constant $P^{con}=1$ . Another group uses the linear pair weight $P^{lin}$ . All experiments use cosine gradient direction.

Comparing to the baseline method where triplet weight $T^{con}=0.5$ , $T^{cos}$ and $T^{cir}$ has minimal but slight boost in performance; $T^{sc1}$ and $T^{sc2}$ has bigger impact on CAR data set than In-shop dataset. This is due to the properties of these two datasets as shown in Figure 4. CAR dataset has low inter-class variance(images from different classes may look similar) while In-shop dataset has high inter-class variance(images from different classes look not similar). The major challenge of CAR dataset is to distinguish similar images with different labels, and this is the purpose of triplet operators $T^{sc1}$ and $T^{sc2}$ because they concentrate on separating triplets with hard negative in training. And In-shop dataset is to relatively easy to separate images with different labels, the goal is to continue separating the images better, which is the impact of $T^{cos}$ and $T^{cir}$

Therefore, we can conclude that the performance gain in Circle loss is largely from the pair weight not from the triplet weight. Selective Contrastive operator benefits the training tasks which need to separate triplets with hard negative and is not helpful for training tasks which easily separate triplets during the training.

5.4 Euclidean or Cosine Direction?

Dataset CUB(dim=64) CAR(dim=64) SOP(dim=512) In-shop(dim=512) Method R@1 R@2 R@4 R@1 R@2 R@4 R@1 R@10 R@100 R@1 R@10 R@20 LiftedStruct [17] 43.6 56.6 68.6 53.0 65.7 76.0 62.5 80.8 91.9 - - - ProxyNCA [10] 49.2 61.9 67.9 73.2 82.4 86.4 73.7 - - - - - SoftTriple [13] 60.1 71.9 81.2 78.6 86.6 91.8 78.3 90.3 95.9 - - - EasyPositive [23] 57.3 68.9 79.3 75.5 84.2 90.3 78.3 90.7 96.3 87.8 95.7 96.8 MS [19] 57.4 69.8 80.0 77.3 85.3 90.5 78.2 90.5 96.0 89.7 97.9 98.5 SCT [22] 57.7 69.8 79.6 73.4 82.0 88.0 81.9 92.6 96.8 90.9 97.5 98.2 DR-MS [9] 59.1 71.0 80.3 79.3 86.7 91.4 - - - 91.7 98.1 98.7 Proxy-anchor [6] 61.7 73.0 81.8 78.8 87.0 92.2 79.1 90.8 96.2 91.5 98.1 98.8 MS*(B128) 59.8 71.7 81.0 79.0 86.6 91.5 78.7 90.4 96.0 89.4 96.6 97.4 DR-MS*(B128) 60.7 71.9 81.3 79.9 87.0 91.7 78.8 90.4 96.1 89.6 96.4 97.4 Ours(B128) 63.5 74.8 83.6 82.5 89.1 93.3 79.9 90.5 95.5 91.4 97.7 98.4 Ours(B256) 63.8 74.8 83.7 85.5 91.0 94.6 82.0 92.3 96.8 92.2 97.8 98.4 Ours(B384) 63.8 75.0 84.2 86.5 91.6 94.8 82.2 92.5 96.8 92.0 97.8 98.3 Ours(B512) 63.1 74.6 83.2 85.7 91.2 94.7 82.3 92.5 96.9 90.8 97.2 97.9

Table 5: Retrieval Performance on the CUB, CAR, SOP and In-shop datasets comparing to the best reported results.

In Section 3.2 and Figure 2, the different gradient behaviors of $\mathbf{e^{euc}}$ and $\mathbf{e^{cos}}$ have been showed. But additional discussion will highlight the performance difference shown in Sections 5.1 and 5.2.

We first decompose the unit gradient to move positive and negative features into two directions: the direction along positive and negative feature $d_{\parallel}$ and the direction orthogonal to positive and negative feature $d_{\bot}$ . As shown in Figure 2, only the gradient component along $d_{\bot}$ effectively contributes to the angle change of anchor-positive and anchor-negative pair which directly affect the similarity score. The effective gradient projection strength for $\mathbf{e^{euc}}$ and $\mathbf{e^{cos}}$ :

\displaystyle\begin{cases}\sqrt{\frac{1+S}{2}}&\text{Euclidean direction}\\ \sqrt{1-S^{2}}&\text{cosine direction}\end{cases}

(20)

where $S$ is the similarity of a positive or negative pair. The derivation of the above projection length is shown in the appendix.

Figure 2 right shows the change of the effective gradient strength for $\mathbf{e^{euc}}$ and $\mathbf{e^{cos}}$ varying as a function of pair similarity. Because most pairs during the training have positive similarity, we focus on projection length when similarity is positive.

The Euclidean gradient has stronger force to pull positive close and push negative away than the cosine direction when two features are close to each other. Therefore, Euclidean gradient continues to force features together even when they are already relatively close, unlike the cosine gradient. In Figure 5 left column, we show the triplet diagram plot of triplets extracted from the last 5 epochs of training (epoch 55-60) on CAR dataset. The Euclidean direction clusters the same label feature more tightly than the cosine direction because there are more triplets along the right edge of the triplet diagram comparing to the scatter of cosine direction.

However, the tight clustering behavior in training leads to even the triplet with nearest positive and the nearest negative to be compact. In the middle and right column of Figure 5, we plot just these triplets for the whole training set (middle), and testing set (right). The Euclidean gradient has more triplets very close to the top right corner, indicating that point have very similar same class and different class neighbors, while cosine gradient creates triplets that are more spread out. The spread out effect indicates the feature learned by the deep model is distinguishable [26, 21, 23]. Because these Euclidean feature are more compressed (for both the anchor-positive and anchor-negative pairs), it is harder for the network to learn distinguishable features that if it is using the cosine gradient.

One more piece of evidence to support the analysis above is the pair weight result in section 5.2. When the Euclidean pair weight $P^{euc}$ is applied to Euclidean direction and cosine direction, the performance gap between these two methods is almost disappeared. This is because the Euclidean pair weight $P^{euc}$ reduces the weight de-emphasizes positive pair when they are already close, and therefore avoid its tight clustering behavior, making Euclidean direction behave similarly as cosine direction.

6 Best combination of gradients

In the previous chapter, we separately consider the gradient terms that relate to the gradient directions, the pair weights applied to the gradients from the anchor-negative and anchor-positive pairs, and the overall weight of the triplets. In terms of the gradient direction, the $\mathbf{e}^{cos-orth}$ gives the best performance and is relatively stable with respect to the learning rate. In terms of the pair-weighting, $P^{lin-ms}$ is consistently a top performer across datasets. Similarly, $T^{cir}$ shows stable improvement to both CAR and In-shop datasets. We combine these gradient components empirically to form the final gradient, and train a network by imposing this gradient combination. We compare the performance of the network trained this way with many latest state-of-the-art results.

To ensure a fair comparison, we also re-implement current related SOTA approaches, MS and DR-MS results (noted as MS* and DR-MS*) with our gradient method to create a comparison with the same network backbone, pre-processing and training settings. The implementation difference is shown in the appendix. The result is reported in Table 5. In addition, we vary the batch size 128, 256, 384, 512 on all tests for four datasets and continue to improve the Recall performance.

7 Limitations

We point out the following limitations of the paper:

•

We do not exhaustively compute all possible combination of all the three gradient components, and instead focus on the isolated effect of single gradient components. There may be additional improvements in explicitly considering the interactions between the different gradient components.
•

There are recent loss functions proposed in the deep metric learning literature such as Proxy loss [10] and N-pair loss [16]; and we currently are not able to put those loss function into our framework due to complex gradient computation for multiple negative pairs.
•

Our experiments do not fully explore training optimizations. We have fixed hyper-parameters in our sampling approach, we keep a constant step size, and we fix the hyper-parameters in gradient components such as $P^{sig-ms}$ for most experiments. Our results are based on hyper-parameter selections from earlier papers, but the gradient based approach to learning embedding functions may be improved with additional search over the hyper-parameter space.

8 Conclusion

We provide a new framework to train deep metric learning networks with direct gradient modification. In our framework, we disentangled gradient components of many loss functions into common components, and analyze the effects of each component. We find that the Euclidean gradient direction and the cosine gradient direction behave quite differently. In its default form, the Euclidean gradient creates embedding spaces that are very tightly clustered and the cosine gradient direction has a consistently big improvement over a large set of experimental conditions.

Second, recently popular works define new loss functions that, in terms of their gradient, primarily change the pair weight term, which is consistent with our findings that the pair-weight term is very important. In contrast, we find the triplet weight term to have limited impact that was not consistent across datasets.

Finally, this study of the importance of different weighting functions and components of the gradient led to a simple approach that directly defines the desired gradients and gives improvements to state-of-the-art performance relative to recent work.

References

[1] Jacob Goldberger, Geoffrey E Hinton, Sam T. Roweis, and Ruslan R Salakhutdinov. Neighbourhood components analysis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 513–520. MIT Press, 2005.
[2] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 1735–1742. IEEE, 2006.
[3] Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, Tom Drummond, et al. Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2821–2829, 2017.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[5] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.
[6] Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Proxy anchor loss for deep metric learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[7] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
[8] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[9] Deen Dayal Mohan, Nishant Sankaran, Dennis Fedorishin, Srirangaraj Setlur, and Venu Govindaraju. Moving in the right direction: A regularization for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[10] Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In Proc. International Conference on Computer Vision (ICCV), Oct 2017.
[11] Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In European Conference on Computer Vision, pages 681–699. Springer, 2020.
[12] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[13] Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. Softtriple loss: Deep metric learning without triplet sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[14] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[15] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[16] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016.
[17] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[18] Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[19] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5022–5030, 2019.
[20] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
[21] Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[22] Hong Xuan, Abby Stylianou, Xiaotong Liu, and Robert Pless. Hard negative examples are hard, but useful. In The European Conference on Computer Vision (ECCV), September 2020.
[23] Hong Xuan, Abby Stylianou, and Robert Pless. Improved embeddings with easy positive triplet mining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020.
[24] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Deep metric learning for person re-identification. In 2014 22nd International Conference on Pattern Recognition, pages 34–39. IEEE, 2014.
[25] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782, 2020.
[26] Xu Zhang, Felix X. Yu, Sanjiv Kumar, and Shih-Fu Chang. Learning spread-out local feature descriptors. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[27] Xiao Zhang, Rui Zhao, Junjie Yan, Mengya Gao, Yu Qiao, Xiaogang Wang, and Hongsheng Li. P2sgrad: Refined gradients for optimizing deep face models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9906–9914, 2019.