Construct Informative Triplet with Two-stage Hard-sample Generation

Chuang Zhu¹¹1The first two authors contributed equally to this work. [email protected] Zheng Hu²²2The first two authors contributed equally to this work. Huihui Dong Gang He Zekuan Yu Shangshang Zhang School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, China Center for Shanghai Intelligent Imaging for Critical Brain Diseases Engineering and Technology Research, Fudan University, Shanghai, China Berkeley AI Research Lab, University of California Berkeley, CA, 94720, USA

Abstract

In this paper, we propose a robust sample generation scheme to construct informative triplets. The proposed hard sample generation is a two-stage synthesis framework that produces hard samples through effective positive and negative sample generators in two stages, respectively. The first stage stretches the anchor-positive pairs with piecewise linear manipulation and enhances the quality of generated samples by skillfully designing a conditional generative adversarial network to lower the risk of mode collapse. The second stage utilizes an adaptive reverse metric constraint to generate the final hard samples. Extensive experiments on several benchmark datasets verify that our method achieves superior performance than the existing hard-sample generation algorithms. Besides, we also find that our proposed hard sample generation method combining the existing triplet mining strategies can further boost the deep metric learning performance.

keywords:

Deep metric learning, triplet , hard sample generation , adversarial network , mode collapse.

^†^†journal: Journal of LaTeX Templates

1 Introduction

Many applications, such as mobile augmented reality (MAR) and social life based on image websites, are becoming more and more popular with the increase of digital cameras and intelligent mobile devices [1]. Millions of images are produced every day from these types of digital equipment, and thus how to accurately search images from a large dataset poses a great challenge. To address this problem, the content-based image retrieval (CBIR) technique is proposed to search for images representing the same object or scene as the one depicted in a query image [2, 3].

The key for CBIR is how to build compact and robust image representations. The traditional hand-crafted features suffer large memory size consumption and will lower the searching efficiency [2, 4]. Many compact image representations [5, 6, 7, 8, 9] are thus proposed to address the above problem. Among these studies, the deep features can produce better performance than the traditional Fisher compact features (such as work [5] and [6]), and they have become the predominant stream of features used for CBIR. Generally, the Convolutional Neural Networks (CNN) layer activations are directly adopted as the off-the-shelf deep features [7, 8, 9]. However, these features are usually trained for image classification tasks. They should be further optimized by transferring the CNN model to the image retrieval task through the deep metric learning (DML) architecture [10, 11].

In deep metric learning, the matching/non-matching training pairs or triplets are first carefully constructed and then passed to the Siamese [12] or triplet architectures [13]. The off-the-shelf deep feature is thus fine-tuned by decreasing the distances of matching pairs and increasing the distances of non-matching pairs. How to construct the informative matching/non-matching pairs or triplets plays a vital role in the fine-tuning process. Typically, there are two different lines of recent research regarding this issue: hard example mining and hard example generation.

In the first category, the researchers applied hard example mining to build the pairs or triplets, which targets selecting samples that can provide informative supervision to make the model training more efficient [14, 15, 16, 17]. The hard mining strategy has been proved to be effective in accelerating the convergence of the network. However, this training strategy may lead to training a biased network [18] because only a few samples are selected for training, while most of the non-selected easy samples are considered invalid.

Refer to caption — Figure 1: Schematic diagram of our proposed hard sample generation. In Stage 1, anchor and positive embeddings $\{\mathbf{a},\mathbf{p}\}$ will be pushed away from each other; in Stage 2, the hard negative embedding will be generated with reversed metric loss, and the hard triplet is thus produced by adjusting the anchor-positive pair $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ . The generated hard triplet { $\hat{\mathbf{a}}$ , $\hat{\mathbf{p}}$ , $\hat{\mathbf{n}}$ } is thus applied to deep representation learning.

The second category tries to construct informative triplets through hard example generation. These studies focus on digging the potential information of easy samples by generating hard samples with one-stage adversarial model, instead of just mining existing samples [19, 20]. This kind of research utilizes both the original and the generated hard samples to construct the informative pairs or triplets, and then optimizes the embedding learning. However, the positive and negative samples play different roles in the deep metric learning process, which have different distributions and characteristics. It is difficult to exploit the full potential capability of all the positive and negative samples at the same time just through one adversarial model.

Approach and Contributions. In this work, we propose a novel hard sample generation for deep embedding learning. To exploit the full potential ability of all the simple samples, we propose a two-stage generator to perform the hard positive and negative generation, separately. Moreover, to avoid generating random vectors, the existing hard sample generation schemes adopt label preserving synthesis, which just requires the generated samples to have the same label with their original samples. However, the generator can only produce samples from a subset of modes to meet the above simple constraint. This will ruin the diversity of the generated hard samples and thus cause the mode collapse problem [21]. We propose to build more powerful conditional synthesis with stronger constraints to alleviate this problem. The main contributions of our work are summarized as follows:

1.

A two-stage adversarial learning architecture, as shown in Fig 1, is proposed to produce hard positives and hard negatives at different stages, respectively. In this architecture, the hard anchor-positive pair is first produced and then the hard negative generation is performed in the second stage. Through our two-stage generation, the potential ability of both the simple positives and negatives is mined.
2.

In the hard anchor-positive generation, we perform a stronger conditional synthesis scheme than the recent label-preserving method. Different from the previous works, we use a conditional generative adversarial network that introduces intermediate embeddings as discriminatory conditions to generate samples that can be as close as the original samples and lower the risk of mode collapse. Besides, a piecewise linear manipulation function is also proposed to properly control the hard level of the generated samples.
3.

In the hard negative generation, we propose an adaptive reverse triplet loss (ART-loss). We impose more stringent restrictions on reverse triplet loss with a larger threshold when the sample generation is getting better. With ART-loss we can gradually generate harder and harder negative samples.
4.

We experimentally demonstrate that our proposed two-stage hard sample generation can be directly combined with other triplet mining methods. Our method can generate informative hard samples, which can be used as a complement to previous hard mining strategies to further boost the DML model performance.

Outline. The paper is organized as follows. In Section 2, we review techniques that are related to our work. In Section 3, we show some preliminaries about deep metric learning. In Section 4, we build the whole deep embedding learning framework. After that, we present the proposed two-stage hard sample generation scheme in Section 5. Then, we report the experimental results in Section 6. Finally, the conclusions of this paper are summarized in Section 7.

2 Related Works

Hard Sample Mining. Because not all image pairs or triplets are equally informative, many works proposed schemes mining hard samples to train the deep metric models [14, 15, 22, 23, 24]. Hard sample mining is a technique for selecting informative sample pairs in deep metric learning. Considering that the number of triplets grows cubically with the size of training data, triplet selection is quite necessary for training with triplet loss. These triplets can be selected not only offline from the entire training set [23], but also from each batch of training using an online approach [14, 24, 25]. Hard negative pair mining [23] proposed the selection of the hardest positive and negative within a batch to construct triplets that produce useful gradients and therefore help triplet loss networks converge quickly. While it speeds up convergence, it also could lead to a bad local minimum and a biased model due to the hardest positive and negative may often be noise in data. In work [25], the authors proposed a more relaxed strategy, semi-hard negative pair mining, which chooses an anchor-negative pair that is farther than the anchor-positive pair, but within a margin. Based on this idea, soft-hard mining [24] restricting the sampling set using moderate negatives and positives is proposed to avoid too confusing samples that are highly likely to be noisy data. Given that the previous method only focuses on a small number of hard triplets and ignores a large number of simple samples, distance-weighted tuple mining [14] comprehensively considers samples in the whole spectrum of difficulties by introducing a sampling distribution over the range of distances between anchor and negatives.

Hard Sample Generation. The more recent researches seek approaches that can generate hard sample pairs or triplets to optimize the deep network, such as work [19], work [20], work [26] and [27].

In work [19], the authors proposed an adversarial learning algorithm to jointly optimize a hard triplet generator and an embedding network. The hard triplet generation is realized by using an inverse triplet loss function, and the generator is constrained by keeping label consistency to avoid random output. However, in metric learning, just the generated hard triplets are adopted and the original triplets are ignored. Besides, performing hard triplet generation only through one adversarial stage fails to dig the full potential information for both the simple positive and negative samples at the same time. Both work [20] and [26] focus on hard negative generation. Compared with work [20], the authors in [26] proposed an adaptive hardness-aware augmentation method to control the hard levels of the generated samples and thus achieved some improvement. The hard level here denotes the “hard degree” of the image matching pair or non-matching pair used in the deep metric learning: the hard level is high when the embedding distance is big for the matching pair or the embedding distance is small for the non-matching pair, and vice versa. However, these two studies just digging the potential information of easy negatives while ignoring the easy positives. All the above hard sample generation studies just use the simple label preserving technique to avoid the failure of generation. In work [27], the authors proposed a method of synthetic hard sample generation, which generates symmetrical synthetic embeddings with each other as an axis of symmetry and then selects the hardest negative pair within the original and synthetic embeddings. Although this scheme is computationally inexpensive, it can only synthesise samples at corresponding symmetric positions in the embedding space, and the diversity of the synthesised samples is very dependent on the number of original samples.

3 Preliminaries

In this paper, the boldface uppercase letter denotes a set (pair) of input images, embeddings, or labels; the boldface lowercase letter or the uppercase letter denotes a specific image, embedding, or label. The main notations used in this paper are listed in Table 1.

Notations	Descriptions
$\mathbf{I}$	the input images
$\mathbf{L}$ , $\hat{\mathbf{L}}$	the ground truth labels and predicted labels
${\mathbf{X}}$ , $\hat{\mathbf{X}}$	original samples and the generated hard samples
$\mathbf{a}$ , $\mathbf{p}$ , $\mathbf{n}$	a triplet embeddings of the anchor, positive and negative
$\hat{\mathbf{a}}$ , $\hat{\mathbf{p}}$ , $\hat{\mathbf{n}}$	generated hard triplet embeddings
$I_{a}$ , $I_{p}$ , $I_{n}$	a triplet input images
$l$ , $\hat{l}$	a ground truth label and a predicted label

Table 1: Main notations and their descriptions.

Deep metric learning aims to learn an embedding that can decrease the distance between matching pairs and increase the distance between non-matching pairs [3, 28]. To learn such an embedding, many works are proposed, such as the traditional contrastive loss [12, 29] under Siamese network and triplet loss [13, 25].

In this paper, we apply the triplet architecture to learn the embedding for deep metric learning. Triplet loss-based metric learning takes a triplet as the input, where each triplet contains an anchor ( ${{I}_{a}}$ ), a positive ( ${I}_{p}$ ), and a negative ( ${I}_{n}$ ) image. The triplet loss is written as:

	$\displaystyle\mathcal{L}_{\text{t}}$	$\displaystyle=\left[\left\lVert F({I}_{a})-F({I}_{p})\right\rVert^{2}_{2}-\left\lVert F({I}_{a})-F({I}_{n})\right\rVert^{2}_{2}+\tau\right]_{+}$
		$\displaystyle=\left[\left\lVert\mathbf{a}-\mathbf{p}\right\rVert^{2}_{2}-\left\lVert\mathbf{a}-\mathbf{n}\right\rVert^{2}_{2}+\tau\right]_{+}$		(1)

where $\{\mathbf{a},\mathbf{p},\mathbf{n}\}$ are the learned triplet embeddings corresponding to $\{I_{a},I_{p},I_{n}\}$ ; $F(\cdot)$ denote the learned deep network which maps the input image to the embedding space; the operator $\left[\cdot\right]_{+}$ is the max(0, $\cdot$ ) function, and $\tau>0$ is a distance margin that makes the anchor-positive pair closer than the anchor-negative pair. The triplet based metric learning is depicted in Fig. 2.

4 Deep Embedding Learning Framework with Hard-sample Generation

4.1 Framework

Our framework is shown in Fig. 3, which mainly consists of three parts: the CNN feature extractor (FE), the hard sample generation (HSG) model, and the embedding learning loss functions. Formally, we denote the feature extraction CNN by $F$ , which maps the input images $\mathbf{I}$ into embedding samples $\mathbf{X}$ . We then use the proposed HSG model to further produce the hard samples $\hat{\mathbf{X}}$ based on the original samples $\mathbf{X}$ . Note that the produced temporary hidden samples $\mathbf{X}^{\prime}$ in HSG are also applied in the loss computing process. The input images $\mathbf{I}$ , the original embedding samples $\mathbf{X}$ , the hidden samples $\mathbf{X}^{\prime}$ and the generated hard embedding samples $\hat{\mathbf{X}}$ share the same ground truth categories $\mathbf{L}=[l_{1},\cdots,l_{n}]$ , where $l_{i}\in{1,\cdots,C}$ . The goal is to train the deep embedding network $F$ with parameters $\theta_{m}$ . The general metric learning approaches train the parameters $\theta_{m}$ through optimizing the following objective function:

\theta_{m}^{*}=\mathop{\arg\min}_{\theta_{m}}J(\theta_{m};\mathbf{X},F)

(2)

We enhance the training procedure by utilizing the synthetic hard samples. Our final deep embedding learning metric is denoted as:

\theta_{m}^{*}=\mathop{\arg\min}_{\theta_{m}}J(\theta_{m};\mathbf{X},\mathbf{X}^{\prime},\hat{\mathbf{X}},\mathbf{L},F)

(3)

We rewrite (3) as:

\displaystyle J

\displaystyle=\mathcal{L}_{F}=w_{o}\mathcal{L}_{org}+w_{l}\mathcal{L}_{class}+w_{h}\mathcal{L}_{hard}

(4)

where $\mathcal{L}_{org}$ , $\mathcal{L}_{class}$ and $\mathcal{L}_{hard}$ are loss terms corresponding to the original deep distance metric loss for $\mathbf{X}$ , the category loss based on $\mathbf{L}$ and the deep distance metric loss for the generated hard sample $\hat{\mathbf{X}}$ ; $w_{o}$ , $w_{l}$ and $w_{h}$ are three weighting factors corresponding to the three terms. Although the embedding model $F$ should be optimized on the whole dataset, we discuss the loss computing in a triplet input images $\mathbf{I}=\{I_{a},I_{p},I_{n}\}$ for convenience. In the following, we will detail the objective function (4).

4.2 Objective Function

For the input images $\mathbf{I}=\{I_{a},I_{p},I_{n}\}$ , the corresponding triplet embedding samples $\mathbf{X}=\{\mathbf{a},\mathbf{p},\mathbf{n}\}$ will be produced by CNN feature extractor $F$ . The hard triplet embedding samples $\hat{\mathbf{X}}=\{\hat{\mathbf{a}},\hat{\mathbf{p}},\hat{\mathbf{n}}\}$ are then further generated by passing $\mathbf{X}$ into HSG model. Besides, the temporary hidden embedding samples ${\mathbf{X}^{\prime}}=\{{\mathbf{a}^{\prime}},{\mathbf{p}^{\prime}}\}$ are also obtained in HSG. The deep embedding network $F$ and the HSG model are simultaneously trained in an end-to-end manner. However, in the early stages of training, the deep model focuses on memorizing the simple training data [30], and thus we can decrease the weight for the generated hard samples. Besides, in the early stage, the embedding space does not have an accurate semantic structure [26], and so the quality of the generated hard embedding samples is very low and they can not provide valid information. Thus the use of the hard generated samples will ruin the learning and damage the embedding space structure. To solve this problem, we designed a weighting parameter adaptively adjusting method by using the training loss $\mathcal{L}_{G}$ of the HSG model. The embedding training objective function is represented as:

$\displaystyle\mathcal{L}_{F}$	$\displaystyle=w_{o}\mathcal{L}_{org}+w_{l}\mathcal{L}_{class}+w_{h}\mathcal{L}_{hard}$
	$\displaystyle=e^{-\frac{\beta}{\mathcal{L}_{G}}}\mathcal{L}_{org}+\phi\mathcal{L}_{class}+(1-e^{-\frac{\beta}{\mathcal{L}_{G}}})\mathcal{L}_{hard}$
	$\displaystyle=e^{-\frac{\beta}{\mathcal{L}_{G}}}\mathcal{L}_{t}(\mathbf{X})+\phi\mathcal{L}_{sm}(\mathbf{X},\mathbf{X^{\prime}},\hat{\mathbf{X}},\mathbf{L})+(1-e^{-\frac{\beta}{\mathcal{L}_{G}}})\mathcal{L}_{t}(\hat{\mathbf{X}})$	(5)

where $\beta$ and $\phi$ are pre-defined parameters; $\mathcal{L}_{t}$ and $\mathcal{L}_{sm}$ are the triplet metric loss and the softmax loss, respectively; $e^{-\frac{\beta}{\mathcal{L}_{G}}}$ , $\phi$ , and $(1-e^{-\frac{\beta}{\mathcal{L}_{G}}})$ are three balancing parameters for the three terms: the original metric loss $\mathcal{L}_{org}$ , the category loss $\mathcal{L}_{label}$ and the generated hard metric loss $\mathcal{L}_{hard}$ . In the early stage of training, ${\mathcal{L}_{G}}$ is very large and thus the weight $e^{-\frac{\beta}{\mathcal{L}_{G}}}$ for $\mathcal{L}_{org}$ is larger than the weight $(1-e^{-\frac{\beta}{\mathcal{L}_{G}}})$ for $\mathcal{L}_{hard}$ . As the training proceeds, ${\mathcal{L}_{G}}$ is decreased and thus high quality hard samples will be produced. The hard synthetics can provide information for training, and higher weight $(1-e^{-\frac{\beta}{\mathcal{L}_{G}}})$ is set to highlight the generated hard samples. Note that ${\mathcal{L}_{G}}$ is set as ${\mathcal{L}_{G_{2}}}$ in the HSG model, which will be discussed in Section 5. In (5), $\mathcal{L}_{t}(\mathbf{X})$ and $\mathcal{L}_{t}(\hat{\mathbf{X}})$ are written as (6) and (7).

\displaystyle\mathcal{L}_{t}(\mathbf{X})

\displaystyle=\left[\left\lVert\mathbf{a}-\mathbf{p}\right\rVert^{2}_{2}-\left\lVert\mathbf{a}-\mathbf{n}\right\rVert^{2}_{2}+\tau\right]_{+}

(6)

\displaystyle\mathcal{L}_{t}(\hat{\mathbf{X}})

\displaystyle=\left[\left\lVert\hat{\mathbf{a}}-\hat{\mathbf{p}}\right\rVert^{2}_{2}-\left\lVert\hat{\mathbf{a}}-\hat{\mathbf{n}}\right\rVert^{2}_{2}+\tau\right]_{+}

(7)

The category loss is performed based on the softmax loss,

	$\displaystyle\mathcal{L}_{sm}(\mathbf{X},\mathbf{X^{\prime}},\hat{\mathbf{X}},\mathbf{L})=-{\sum_{\mathbf{x}}^{\{\mathbf{a},\mathbf{p},\mathbf{n}\}}}l(\mathbf{x})\text{log}{\hat{l}(\mathbf{x})}$
	$\displaystyle-{\sum_{\mathbf{x}}^{\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}}}l(\mathbf{x})\text{log}{\hat{l}(\mathbf{x})}-{\sum_{\mathbf{x}}^{\{\hat{\mathbf{a}},\hat{\mathbf{p}},\hat{\mathbf{n}}\}}}l(\mathbf{x})\text{log}{\hat{l}(\mathbf{x})}$		(8)

where $l(\mathbf{x})$ is the one-hot encoding vector of the correct class for sample $\mathbf{x}$ ; $\hat{l}(\mathbf{x})$ denotes the predicted probability vector.

In (5), the only unknown elements are the generated samples $\hat{\mathbf{X}}$ and ${\mathbf{X}^{\prime}}$ . In the following section, we will detail the hard sample generation.

5 Two-stage Hard Sample Generation

5.1 Architecture

The detailed two-stage hard sample generation architecture is denoted in Fig. 4. In the first stage, the hard level of the input anchor-positive pair $\{\mathbf{a},\mathbf{p}\}$ will first be increased by directly manipulating the distances between them. To achieve this, the piecewise linear manipulation (PLM) is designed. Then the generated hard anchor-positive pair $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ will be processed by generator $G_{1}$ and thus the hard anchor-positive pair $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ is obtained. However, just conducting label-preserving synthesis is not enough and this may result in mode collapse, as shown in Fig. 5(a). In this case, the generator only produces samples from part modes of the distribution and ignores the other modes.

To alleviate the mode collapse issue and guarantee the diversity of the generated samples, we encourage the one-to-one relationship between the original sample and the generated sample by adding an extra conditional constraint to the generation process. The discriminator $D_{G1}$ needs to have the ability to identify the results of combining the intermediate embeddings with the original and generated embeddings respectively. In the second stage, the generated anchor-positive pair $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ and original negative sample $\mathbf{n}$ are passed into a generator $G_{2}$ to produce the final hard triplet $\{\hat{\mathbf{a}},\hat{\mathbf{p}},\hat{\mathbf{n}}\}$ . To support the adversarial learning, the discriminator $D_{G2}$ is integrated into Stage 2.

In our architecture, the PLM, the conditional synthesis (in Stage 1) and the hard triplet generation (in Stage 2) are three key parts that will be discussed in the following.

5.2 Piecewise Linear Manipulation

In order to make anchor $\mathbf{a}$ and positive $\mathbf{p}$ more difficult, we need to increase the distance between them. Here, we first linearly stretch the embeddings of anchor and positive samples, as shown in Stage 1 of Fig. 1, thus generating a harder anchor and positive pair $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ , as depicted in Fig. 4. We propose the piecewise linear manipulation (PLM) scheme as shown in (9),

\begin{split}\mathbf{a}^{*}=\mathbf{a}+\lambda(\mathbf{a}-\mathbf{p})\\ \mathbf{p}^{*}=\mathbf{p}+\lambda(\mathbf{p}-\mathbf{a})\end{split}

(9)

where $\lambda$ is a picecwise variable which controls the hard level of the generated anchor-positive pair $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ . The picecwise variable $\lambda$ plays the key role in our PLM scheme and we will discuss it in the following.

During the stretching process, if the value of $\lambda$ is too large, the anchor-positive samples may be stretched into different classes. Even if the distance is increased, the stretched samples will no longer be anchor-positive pair and thus become meaningless. In order to avoid this issue, we need to limit the scope and size of the $\lambda$ . Thus in our PLM, we set the piecewise variable $\lambda$ as:

\lambda=\begin{cases}\frac{\alpha}{e^{d(\mathbf{a},\mathbf{p})-d_{t}}},&{if}\ \ \ d(\mathbf{a},\mathbf{p})\geq d_{t}\\ \alpha+\gamma(1-\frac{d(\mathbf{a},\mathbf{p})}{d_{t}}),&if\ \ \ d(\mathbf{a},\mathbf{p})<d_{t}\end{cases}

(10)

where $d(\mathbf{a},\mathbf{p})$ denotes the distance between the anchor and positive samples; $d_{t}$ is an adaptive threshold. When $d(\mathbf{a},\mathbf{p})$ is already large enough to be greater than $d_{t}$ , we use a exponent function ${\alpha}/{e^{d(\mathbf{a},\mathbf{p})-d_{t}}}$ . At this situation, the maximum value of $\lambda$ is $\alpha$ , and the minimum value is about 0. When the distance between them is less than $d_{t}$ , we use a linear function $\alpha+\gamma(1-\frac{d(\mathbf{a},\mathbf{p})}{d_{t}})$ . At this situation, the maximum value of $\lambda$ is $\alpha+\gamma$ and the minimum value is $\alpha$ . In the calculation of $\lambda$ , $d_{t}$ is related to the distance distribution of samples in different datasets, so it is difficult to manually adjust $d_{t}$ in various datasets. In order to automatically select the best hyperparameters $d_{t}$ in each dataset, we proposed an adaptive method by choosing $d_{t}$ as:

d_{t}=\frac{1}{M}\sum_{i=1}^{M}{d_{epoch-1}(\mathbf{a}_{i},\mathbf{p}_{i})}

(11)

where $M$ is the number of anchor-positive pairs, and $d_{t}$ represents the average distance of anchor-positive pairs in the previous epoch.

After piecewise linearly manipulating the input anchor-positive pair $\{\mathbf{a},\mathbf{p}\}$ , we obtain hard anchor-positive pair $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ . However, there is no guarantee that $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ share the same label with $\{\mathbf{a},\mathbf{p}\}$ . Next, we will conduct conditional synthesis based on the generated $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ to produce valid hard anchor-positive pair $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ .

5.3 Conditional Synthesis

After piecewise linearly manipulating anchor-positive pairs, in order to ensure the generated pairs are still in the same category domain as the original samples, the recent method requires label consistency to avoid generating invalid embeddings [20]. However, the simple constraint may result in a mode collapse problem, which means the generated samples suffering from insufficient diversity, as shown in Fig 5. Thus except for the label-preserving requirement, we try to enhance the relationship between the original sample and the generated sample by adding an extra restriction. Specifically, we require that the generated samples $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ based on $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ have the same distribution as the original samples $\{\mathbf{a},\mathbf{p}\}$ . Unlike an unconditional generative adversarial network, both the generator $G_{1}$ and discriminator $D_{G1}$ consider the information of the linear-manipulated anchor-positive pair $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ . To summarize, we encourage the one-to-one relationship between the original sample and the generated sample for our conditional synthesis, and thus the diversity is ensured.

When $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ passed $G_{1}$ , they are mapped to $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ . On the one hand, we synthesize anchor-positive pairs that are further apart based on $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ ; On the other hand, no matter where the input $x^{*}$ of $G_{1}$ lies in its category space, the output of $G_{1}$ is located in the same category space as the original one by the constraint of label consistency. Our conditional synthesis is achieved by training $G_{1}$ with the following loss function:

\displaystyle\mathcal{L}_{G1}

\displaystyle=\eta(\mathcal{L}_{class1}+\mathcal{L}_{D_{G1},adv})+{(1-2\eta)}\mathcal{L}_{rec1}

(12)

where $\eta$ is the balance factor between softmax and adversarial loss; the first term makes sure the generated embeddings have the same labels with the original ones; the second term is adversarial loss with discriminator $D_{G1}$ which aims to let the generated embeddings look real, not the fake ones; the third term guarantees the generated embeddings can maintain the same hard level as the input embedings. In (12), the first term is denoted as:

\displaystyle\mathcal{L}_{class1}

\displaystyle={\sum_{\mathbf{x}^{\prime}}}\mathcal{L}_{sm}\!(C_{F}(\mathbf{x}^{\prime}),\!l_{\mathbf{x}^{\prime}})

(13)

where $\mathbf{x}^{\prime}$ denotes embeddings from $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ and $l_{\mathbf{x}^{\prime}}$ is the ground truth label of $\mathbf{x}^{\prime}$ ; the $\mathcal{L}_{sm}$ denotes the softmax loss; $C_{F}$ is a shared fully connected layer which is used to classify the generated pair. In (12), the second term is denoted as:

\displaystyle\mathcal{L}_{D_{G1},adv}

\displaystyle={\sum_{\mathbf{{x}^{\prime}_{c}}=[\mathbf{x}^{\prime},\mathbf{x}^{*}]}}\mathcal{L}_{sm}(D_{G1}(\mathbf{x}^{\prime}_{c}),1)

(14)

where $\mathbf{x}$ denotes embeddings from $\{\mathbf{a},\mathbf{p}\}$ , $\mathbf{x}^{*}$ denotes embeddings from $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ and ”1” represents the real samples. The discriminator $D_{G1}$ takes $\mathbf{x}^{\prime}_{c}$ the result of concatenating $\mathbf{x}^{\prime}$ and $\mathbf{x}^{*}$ along the last dimension as the input. This adversarial loss is used to fool the discriminator $D_{G1}$ by generating real-like samples; the third term in (12) is reconstruction loss $\mathcal{L}_{rec1}$ between $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ and $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ , which is calculated as:

\displaystyle\mathcal{L}_{rec1}={d(\mathbf{a^{*}},\mathbf{a}^{\prime})+d(\mathbf{p^{*}},\mathbf{p}^{\prime})}=\left\lVert\mathbf{a^{*}}-\mathbf{{a}^{\prime}}\right\rVert^{2}_{2}+\left\lVert\mathbf{p^{*}}-\mathbf{{p}^{\prime}}\right\rVert^{2}_{2}.

(15)

With the reconstruction loss $\mathcal{L}_{rec1}$ , the relative hardness of anchor-positive synthetic pairs in triplets will be guaranteed. In this way, our generative loss $\mathcal{L}_{G1}$ enables generated anchor-positive pairs to have a greater separation distance while remaining within their original category space.

The discriminator $D_{G1}$ is a two-category classifier for identifying whether a given embedding is a real one or a generated one. The training loss function for $D_{G1}$ is formulated as:

\displaystyle\mathcal{L}_{D_{G1}}=\frac{1}{2}\{{\sum_{\mathbf{x}_{c}=[\mathbf{x},\mathbf{x}^{*}]}}\mathcal{L}_{sm}(D_{G1}(\mathbf{x}_{c}),1)+{\sum_{\mathbf{{x}^{\prime}_{c}}=[\mathbf{x}^{\prime},\mathbf{x}^{*}]}}\mathcal{L}_{sm}(D_{G1}(\mathbf{{x}^{\prime}_{c}}),0)\}

(16)

where $\mathbf{x}$ , $\mathbf{x}^{*}$ and $\mathbf{x}^{\prime}$ denote embeddings from $\{\mathbf{a},\mathbf{p}\}$ , $\{\mathbf{a}^{*},\mathbf{p}^{*}\}$ and $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ , respectively. $\mathbf{x}_{c}$ and $\mathbf{{x}^{\prime}_{c}}$ , the input embedding of the discriminator $D_{G1}$ , are concatenated of $\mathbf{x}^{*}$ and corresponding embedding in the last dimension of the same labeled instance. ”1” and ”0” denote the real sample and the fake sample respectively.

After the adversarial training of $G_{1}$ , our hard anchor-positive pairs have met the requirements of label consistency. At the same time, conditional discriminator $D_{G1}$ ensures that the generated pairs do not have mode collapse with random generation. Next, we will use the hard anchor-positive pair $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ to generate the hard negative, and produce the final hard samples $\{\hat{\mathbf{a}},\hat{\mathbf{p}},\hat{\mathbf{n}}\}$ .

5.4 Hard Triplet Synthesis

We propose a new hard triplet generator $G_{2}$ to generate hard samples. Generator $G_{2}$ is able to map its input $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ and $\mathbf{n}$ to $\{\hat{\mathbf{a}},\hat{\mathbf{p}},\hat{\mathbf{n}}\}$ . While generating a negative sample, the hard negative sample is not only related to the information of negative sample itself, but also to the anchor-positive pair in the triplet. Therefore the inputs of $G_{2}$ are negative sample $\mathbf{n}$ and hard anchor-positive pair $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ generated in Stage 1. For our hard triplet synthesis, we propose the loss function to train $G_{2}$ as follows:

\displaystyle\mathcal{L}_{G2}=\mu\mathcal{L}_{ART}\!+\!(1\!\!-\!2\eta\!-\!\!\mu)\mathcal{L}_{rec2}\!+\!\eta(\mathcal{L}_{class2}\!+\mathcal{L}_{DG2,adv}\!\!)

(17)

where $\eta$ and $\mu$ are the balancing factors. The overall objective loss function of $G_{2}$ is composed of four parts: the adaptive reverse triplet loss (ART-loss) $\mathcal{L}_{ART}$ , reconstruction loss $\mathcal{L}_{rec2}$ , softmax loss $\!\mathcal{L}_{class2}\!$ and adversarial loss $\mathcal{L}_{DG2,adv}\!$ . The ART-loss is used to generate the hard negative; the reconstruction loss is applied to keep hard anchor-positive pair from being affected by reverse triplet loss; the softmax loss is used to ensure label consistency for the generated samples; adversarial loss with discriminator $D_{G2}$ is performed to generate samples like the real ones. These four terms are detailed in the following.

The proposed ART-loss $\mathcal{L}_{ART}$ is defined as follows:

\displaystyle\mathcal{L}_{ART}=\mathcal{L}_{\text{r}}(\mathbf{a}^{\prime},\mathbf{p}^{\prime},\mathbf{n})=\left[\left\|G_{2}(\mathbf{a^{\prime}})\!\!-\!\!G_{2}(\mathbf{n})\right\|^{2}_{2}\!\!-\left\|G_{2}(\mathbf{a}^{\prime})\!\!-\!\!G_{2}(\mathbf{p}^{\prime})\right\|^{2}_{2}+\tau_{r}\right]_{+}

(18)

The proposed ART-loss is obtained from the triplet loss by reversing the sign in front of the first two terms in (1). The purpose of ART-loss is to make the distance between $\mathbf{a}^{\prime}$ and $\mathbf{n}$ as small as possible until it is less than the distance between $\mathbf{a}^{\prime}$ and $\mathbf{p}^{\prime}$ with a parameter $\tau_{r}$ . When $\mathbf{n}$ satisfies this condition, it will be a hard negative sample. In this paper, we propose to use an adaptive parameter $\tau_{r}$ which increases with the training of the network. We impose more stringent restrictions on reverse triplet loss with larger $\tau_{r}$ when the performance of $G_{2}$ is getting better. Thus, we can gradually generate harder and harder negative samples. We set $\tau_{r}$ as a parameter that changes with $\mathcal{L}_{G2}$ as:

\tau_{r}=\nu(1-\exp^{-\frac{\beta}{\mathcal{L}_{G2}}})

(19)

where $\nu$ and $\beta$ are constant parameters. As $G_{2}$ trains better, the $\mathcal{L}_{G2}$ will get smaller, and $\tau_{r}$ will increase adaptively to enhance the difficulty of hard negative.

The reconstruction loss $\mathcal{L}_{rec2}$ aims to keep the synthetic anchor-positive pair with a consistent hard-level by decreasing the distances between $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ and $\{\hat{\mathbf{a}},\hat{\mathbf{p}}\}$ , which is denoted as:

\mathcal{L}_{rec2}={d(\mathbf{a}^{\prime},\hat{\mathbf{a}})+d(\mathbf{p}^{\prime},\hat{\mathbf{p}})}\\ =\left\lVert\mathbf{a}^{\prime}-\hat{\mathbf{a}}\right\rVert^{2}_{2}+\left\lVert\mathbf{p}^{\prime}-\hat{\mathbf{p}}\right\rVert^{2}_{2}

(20)

The softmax loss $\!\mathcal{L}_{class2}\!$ makes the generated embeddings have the same labels with the original samples, which is calculated as:

\displaystyle\mathcal{L}_{class2}

\displaystyle={\sum_{\hat{\mathbf{x}}}}\mathcal{L}_{sm}\!(C_{F}(\hat{\mathbf{x}}),\!l_{\hat{\mathbf{x}}})

(21)

where $\hat{\mathbf{x}}$ denotes embeddings from $\{\hat{\mathbf{a}},\hat{\mathbf{p}},\hat{\mathbf{n}}\}$ , and $l_{\hat{\mathbf{x}}}$ denotes the ground truth labels of $\hat{\mathbf{x}}$ .

The adversarial loss with discriminator $D_{G2}$ is

\displaystyle\mathcal{L}_{D_{G2},adv}

\displaystyle={\sum_{\hat{\mathbf{x}}}}\mathcal{L}_{sm}\!(D_{G2}(\hat{\mathbf{x}}),\!l_{\hat{\mathbf{x}}})

(22)

Different with the previous adversarial loss, such as (14), we use the ground truth label $l_{\hat{\mathbf{x}}}$ to represent the “real” samples. Thus the adversarial loss (22) is used to fool the discriminator by classifying these generated embeddings as the real samples with ground truth $l_{\hat{\mathbf{x}}}$ . Actually, our adopted discriminator $D_{G2}$ is a $(C+1)$ -class classifier where $C$ denotes the number of ground truth categories for the real samples and the left one class represents the generated samples. When an embedding is given, $D_{G2}$ can discriminate it into the $C+1$ categories [31] and thus the real-like sample generation and label consistency are achieved at the same time. The training loss for $D_{G2}$ is denoted as:

\mathcal{L}_{D_{G2}}=\frac{1}{C+1}\{{\sum_{\mathbf{x}^{\prime}}}\mathcal{L}_{sm}(D_{G2}(\mathbf{x}^{\prime}),l_{{\mathbf{x}}^{\prime}})+{\sum_{\hat{\mathbf{x}}}}\mathcal{L}_{sm}(D_{G2}(\hat{\mathbf{x}}),0)\}

(23)

where $\mathbf{x}^{\prime}$ denotes embeddings from $\{\mathbf{a}^{\prime},\mathbf{p}^{\prime}\}$ or $\mathbf{n}$ , $l_{{\mathbf{x}}^{\prime}}$ denotes the ground truth labels of ${\mathbf{x}}^{\prime}$ , and $\hat{\mathbf{x}}$ denotes embeddings from $\{\hat{\mathbf{a}},\hat{\mathbf{p}},\hat{\mathbf{n}}\}$ .

6 Experiment

In this section, extensive experiments are performed to demonstrate the superior performance of our proposed hard sample generation. Besides, an ablation study is further developed to show the contributions of each component of our proposed algorithm.

6.1 Dataset

We evaluated our method under the three datasets below. In order to better compare with other methods, we split the datasets according to [20][26][32][33], and the training set and test set contain image classes with no intersection.

1.

The CUB-200-2011 dataset [34] consists of 200 bird species with 11,788 images. We use the first 100 species (5,864 images) for training and the rest 100 species (5,924 images) for testing.
2.

The Cars196 dataset [35] consists of 196 car makes and models with 16,185 images. We use the first 98 models (8,054 images) for training and the rest 100 models (8,131 images) for testing.
3.

The Stanford Online Products dataset (SOP) [32] consists of 22,634 online products from eBay.com with 120,053 images. We use the first 11,318 products (59,551 images) for training and the rest 11,316 products (60,502 images) for testing.

6.2 Implementation

Configuration. The proposed method is implemented with Python3.7 and PyTorch 1.6.0, using the GoogleNet [36] and ResNet50 [37] architectures, pre-trained on ImageNet ILSVRC dataset [38]. All experiments are performed on individual NVIDIA Tesla T4 GPUs. We trained networks with standard back-propagation, which is performed by Adam optimizer with 4e-4 weight decay. We set the initial learning rate 1e-5 for the feature extractor, 1e-3 for the classifier, 1e-4 for the discriminators, and 1e-3 for the generators. Each generator is composed of two fully connected layers with dimensions 128 and 512. Note that all output embedding vectors of the feature extractor and generators are under normalization. Each training is run over 150 epochs for CUB200-2011/CARS196 and 100 epochs for Stanford Online Products. For a fair comparison with previous methods on deep metric learning, we used GoogLeNet [36] architecture as the CNN feature extractor with output embedding size of 512 for all the three datasets similar to [26] and [20]. While training, we set the batch size to 128, and all the training images are resized to $227\times 227$ . During the training process, we will update all discriminators, generators, and the embedding extraction network according to Algorithm 1. We set triplet margin $\tau=0.2$ and set the the maximum value of reverse triple margin $\nu=0.2$ as well. The other related parameters of this work are summarized in Table 2.

	$\alpha$	$\gamma$	$\eta$	$\beta$	$\mu$	$\phi$
Cub200	0.2	0.8	0.3	0.5	0.3	0.5
Cars196	0.2	0.8	0.3	0.5	0.3	0.25
SOP	0.2	0.8	0.3	0.5	0.3	0.2

Table 2: The related parameter settings for different datasets.

Training Flow. We alternately train $F$ and the other models, and mini-batch SGD is applied in the training process. Specifically, we first initialize all the models and then pre-train model $F$ . After that, we will iteratively train all the models for $epochs$ times. In each epoch, these models are updated for $batches$ times. For a specific batch, embeddings $\mathbf{X}$ are first produced by $F$ , and then $\mathbf{X^{*}}$ and $\mathbf{X^{\prime}}$ can be obtained by using PLM and $G_{1}$ . Then model ${D_{G_{1}}}$ can be updated with (16). After that, the generator ${{G_{1}}}$ are updated with (12) accordingly. However, for the training of $G_{2}$ and $D_{G_{2}}$ , we will use the samples produced by the latest generative networks. For this reason, we will re-extract embeddings $\mathbf{X^{\prime}}$ after the model updating of Stage 1 for the training of $G_{2}$ and $D_{G_{2}}$ in Stage 2. Then we update them with (23) and (17) accordingly. Finally, we update model $F$ with the re-extracted $\hat{\mathbf{X}}$ , $\mathbf{X^{\prime}}$ and the original $\mathbf{X}$ by using (5). With the above training flow, the final embedding model $F$ and the other models related to hard sample generation are obtained at the same time. When our network training flow is over, there is no additional computational effort involved in the final feature extraction for image retrieval. We summarize the whole embedding training flow as Algorithm 1.

Algorithm 1 embedding training flow

Input: model $F$ , $G_{1}$ , $D_{G_{1}}$ , $G_{2}$ , $D_{G_{2}}$ , images $\mathbf{I}$ , labels $\mathbf{L}$ ,
Parameter: $epochs$ ,
Output: trained $F$

1: init model

F

G_{1}

D_{G_{1}}

G_{2}

D_{G_{2}}

2: pre-train model

F

3: for

i=1

epochs

4: for

j=1

batches

5: extract embeddings

\mathbf{X}=F(\mathbf{I}_{\text{batch}_{j}})

\mathbf{X^{*}}

and

\mathbf{X^{\prime}}

6: update

{{G_{1}}}

and

{D_{G_{1}}}

with (12) and (16) respectively,

7: re-extract embeddings

\mathbf{X}^{\prime}

8: extract embeddings

\hat{\mathbf{X}}

9: update

{{G_{2}}}

and

{D_{G_{2}}}

with (23) and (17) respectively,

10: re-extract embeddings

\hat{\mathbf{X}}

11: update model

{{F}}

with (5),

12: end for

13: end for

14: return model

F

6.3 Evaluation

We designed experiments to prove the effectiveness of our hard-sample generation.

Evaluation Criteria. We report the performance for both retrieval and clustering tasks. The mAP is widely used in state-of-the-art image retrieval works and it reveals the position-sensitive ranking precision. The definition of mAP is denoted as:

\text{mAP}=\frac{1}{N}\sum_{i=1}^{N}\frac{\sum_{r=1}^{M^{i}_{\text{relevant}}}P(r)}{M^{i}_{\text{relevant}}}

(24)

where $i$ is the $i_{th}$ query index and $N$ is the number of total queries [3]. $M^{i}_{\text{relevant}}$ is the number of relevant images corresponding to the $i_{th}$ query, ${r}$ is the rank and $P(r)$ is the precision at the cut-off rank of $r$ . The employed Recall@K metric is determined by the existence of at least one correct retrieved sample in the K nearest neighbors [26].

For the clustering task, we employed NMI and $F_{1}$ as performance metrics. The normalized mutual information (NMI) [32] is defined by the ratio of the mutual information of clusters and ground truth labels and the arithmetic mean of their entropy: $\operatorname{NMI}(\Omega,\mathbb{C})=\frac{I(\Omega;\mathbb{C})}{H(\Omega)+H(\mathbb{C})}$ , where $\Omega=\left\{\omega_{1},\ldots,\omega_{n}\right\}$ is a set of clusters and $\mathbb{C}=\left\{c_{1},\ldots,c_{n}\right\}$ is the ground truth classes. Here $I(\cdot,\cdot)$ and $H(\cdot)$ denotes mutual information and entropy respectively. $F_{1}$ score is deined as the harmonic mean of precision and recall [32].

Qualitative Analysis. We visualize our image retrieval performance in several selected cases and visually show the advantage of our algorithm. Besides, we depict the learned embeddings for different datasets and qualitatively verify the effectiveness of the proposed scheme.

6.4 Objective Comparison

Method	R@1	R@2	R@4	R@8	NMI	F1	mAP
Triplet[26]	35.9	47.7	59.1	70.0	49.8	15.0	19.2
MAC	40.3	51.9	63.6	74.1	-	-	20.1
R-MAC	41.7	53.7	66.3	76.3	-	-	20.8
GeM	42.2	54.3	67.1	77.4	-	-	21.4
DAML[20]	37.6	49.3	61.3	74.4	51.3	17.6	21.7
HDML[26]	43.6	55.8	67.7	78.3	55.1	21.9	22.8
SS[27]	51.4	63.0	74.4	84.1	59.6	26.2	-
OURS¹²⁸	55.3	66.6	77.3	86.0	62.6	31.5	26.8
OURS	57.0	68.8	79.2	86.7	64.1	33.0	27.9

Table 3: Experimental results on the CUB-200-2011 dataset in comparison with other methods(%).

Method	R@1	R@2	R@4	R@8	NMI	F1	mAP
Triplet[26]	45.1	57.4	69.7	79.2	52.9	17.9	20.8
MAC	56.5	69.3	79.1	87.2	-	-	17.0
R-MAC	60.0	71.8	81.1	88.2	-	-	20.5
GeM	60.4	71.9	81.2	88.1	-	-	21.5
DAML[20]	60.6	72.5	82.5	89.9	56.5	22.9	20.7
HDML[26]	61.0	72.6	80.7	88.5	59.4	27.2	22.5
SS[27]	69.7	78.7	86.1	91.4	62.4	31.8	-
OURS¹²⁸	79.3	86.9	92.2	95.7	65.5	35.4	29.6
OURS	82.1	88.7	93.1	96.0	66.3	35.1	31.1

Table 4: Experimental results on the Cars196 dataset in comparison with other methods(%).

Method	R@1	R@10	R@100	NMI	F1	mAP
Triplet[26]	53.9	72.1	85.7	20.2	15.0	29.7
MAC	54.2	72.5	85.9	-	-	30.5
R-MAC	55.4	73.1	86.1	-	-	31.3
GeM	56.7	74.3	86.7	-	-	32.8
DAML[20]	58.1	75.0	88.0	87.1	22.3	34.9
HDML[26]	58.5	75.5	88.3	87.2	22.5	35.7
SS[27]	65.7	81.4	91.7	88.9	30.6	-
OURS¹²⁸	66.1	80.8	90.8	87.6	25.4	33.0
OURS	67.7	82.2	91.6	87.7	26.1	34.4

Table 5: Experimental results on the SOP dataset in comparison with other methods(%).

We conduct full comparisons between our method and the other image retrieval algorithms. The involved strategies include deep pooling features and the recent deep embeddings with hard generation. For a fair comparison, all the involved algorithms are implemented with triplet loss architecture.

Typical Image Retrieval Methods. To comprehensively demonstrate the superiority of our proposed method over existing methods in the retrieval task, we compared our scheme with several typical baseline image retrieval methods built based on deep pooling features, including the average-pooled convolutional features under triplet architecture (Triplet), maximum activations of convolutions (MAC) [7], regional MAC (R-MAC) and Generalized-Mean (GeM) pooling [17]. For a fair comparison, we evaluated all the methods mentioned above using the same CNN model and trained the deep features with triplet loss.

Deep Embeddings with Hard Generation. Our proposed two-stage hard sample generation method is a novel generation method that enables the network to synthesize informative triplet more efficiently, leading to boosts in performance. The state-of-the-art deep embeddings with hard generation are included in this work for comparisons: the hard-negative generation method DAML [20], HDML [26] and symmetrical synthesis (SS) [27].

Comparison Analysis. Table 3, 4, and 5 show the quantitative results of our method and the comparison methods on the CUB-200-2011, Cars196, and SOP datasets. The feature dimension is 512 by default in the three tables above, except for special cases where the dimension is marked with superscript. Bold numbers indicate the best results and blue color represents the second-best performance. As we can see from the tables, our method achieves the best Recall@K and mAP on the CUB-200-2011, Cars196 datasets (Table 3 and Table 4), while obtaining at least the second-best performance on the Stanford dataset (Table 5). It should be noted that the image retrieval accuracy gain (Table 5) introduced by our method on the Stanford Online Products dataset is smaller than the improvement on the other two datasets (Table 3 and Table 4). The possible reason may come from the fact that the Stanford Online Products dataset [32] consists of more diverse images than the other two datasets. The existing data already contains many hard samples and the left improving space for hard sample generation on Stanford Online Products dataset is limited. For studying the performance of our method with varying embedding sizes, we conducted additional experiments on all three using a small embedding dimension. We can find that the performance of our proposed method can outperform the previous schemes even if the feature dimension is reduced to 128.

Compared with the image retrieval schemes with deep pooling features (MAC, R-MAC, GeM), the methods with hard sample generation (DAML, HDML, SS, and our scheme) produce better results. Compared with the DAML, HDML, and SS which improve performance by generating negative samples, our scheme can further boost the retrieval performance. In addition, our scheme can also obtain at least the second-best performance in the result of F1 and NMI. In summary, our proposed two-stage hard sample generation method is highly competitive in both retrieval and clustering tasks.

6.5 Combine Hard Generation and Hard Mining

The motivation of our hard sample generation approach is to synthesize difficult sample pairs from a large number of simple samples to obtain valid information. In contrast, existing hard sample mining strategies directly select the most informative sample pairs from real samples for training. To further explore whether the generated hard samples and real hard samples are complementary, we design to combine the generated samples with the mined samples for training. We perform experiments to analyze the effect of combining our proposed method with a variety of hard sample mining strategies.

Hard Sample Mining Methods. Triplet loss needs to mine training tuples from the available mini-batch. In our study, all of our experiments are based on triplet architecture, so we choose the four most representative tuple mining strategies for our experiments. These strategies include random tuple mining (R) [12], semihard triplet mining (Semi)[25], softhard triplet mining (Soft) [24] and distance-weighted tuple mining (D) [14]. In this experiment, our proposed algorithm combined with random tuple mining fully equivalent to the two-stage hard sample generation method (THSG), which can be regarded as the baseline of our methods.

Method	R@1	R@2	R@4	R@8	NMI	F1	MAP
Triplet(R)	58.0	70.1	80.3	86.0	64.4	32.6	29.9
Triplet(Semi)	59.2	70.9	80.8	86.4	64.8	33.3	30.8
Triplet(Soft)	60.3	72.1	81.8	87.2	65.9	34.6	31.6
Triplet(D)	62.5	72.9	82.0	88.7	66.3	34.8	32.3
THSG	61.3	72.8	82.3	89.1	65.6	33.8	31.5
THSG(Semi)	61.3	72.5	81.8	88.2	64.9	32.7	31.8
THSG(Soft)	62.4	74.0	82.8	89.2	66.3	34.5	32.4
THSG(D)	63.2	74.6	83.5	89.5	66.7	35.9	33.0

Table 6: Experimental results of THSG combining hard mining methods compared with their respective baselines on the CUB-200-2011 dataset(%).

Implementation Details. In order to warrant unbiased comparability in Table 6, 7, and 8, our training protocol follows settings of [39]. The difference from the previous experimental setting is that we resize images to $224\times 224$ and utilize a ResNet50 architecture with frozen Batch-Normalization layers and embedding dimensionality 128 . When combining two different hard triplet for training, we only replace the original deep distance metric loss $\mathcal{L}_{org}$ with the hard mining deep distance metric loss $\mathcal{L}_{mining}$ and keep the other training steps of ours method unchanged. Therefore, the final deep embedding learning (4) is rewritten as:

\displaystyle\mathcal{L}_{F}

\displaystyle=w_{o}\mathcal{L}_{mining}+w_{l}\mathcal{L}_{class}+w_{h}\mathcal{L}_{hard}

(25)

Analysis of Results. We can find that experimental results in all evaluation criteria are further boosted by combining our method with a hard sample mining. Both in the retrieval task and in the clustering task, our method combined with hard sample mining results in better performance compared to the individual method. The performance improvement implies there is complementary information in the mined and generated hard samples. In particular, our generation method combined with simple mining strategies, such as random tuple mining or semihard triplet mining, yields an even more obvious improvement in results. Thus, our method does generate hard samples with information that is difficult to mine in the original data.

Method	R@1	R@2	R@4	R@8	NMI	F1	MAP
Triplet(R)	67.8	78.2	85.8	91.2	60.1	27.2	26.9
Triplet(Semi)	70.6	80.1	86.7	91.5	61.3	29.7	28.8
Triplet(Soft)	76.9	84.6	90.3	94.1	62.8	30.6	31.3
Triplet(D)	77.7	85.7	90.8	94.0	64.4	33.1	31.9
THSG	80.2	87.2	91.7	95.0	66.2	34.8	31.9
THSG(Semi)	79.8	86.4	91.5	95.0	66.0	35.4	34.2
THSG(Soft)	80.3	87.0	91.8	95.2	66.4	35.6	34.2
THSG(D)	81.6	88.4	92.5	95.6	67.5	36.7	34.6

Table 7: Experimental results of THSG combining hard mining methods compared with their respective baselines on the Cars196 dataset(%).

Method	R@1	R@10	R@100	NMI	F1	MAP
Triplet(R)	71.0	85.3	93.6	88.8	30.9	38.7
Triplet(Semi)	76.2	88.8	95.4	89.9	36.2	44.3
Triplet(Soft)	76.3	89.1	95.6	89.8	35.8	44.1
Triplet(D)	77.8	90.2	95.9	90.1	36.9	45.8
THSG	76.2	88.7	95.2	89.6	34.7	43.4
THSG(Semi)	78.0	89.9	95.8	90.1	37.3	45.9
THSG(Soft)	77.7	89.8	95.7	90.0	36.6	45.5
THSG(D)	78.5	90.3	95.9	90.1	37.1	46.3

Table 8: Experimental results of THSG combining hard mining methods compared with their respective baselines on the SOP dataset(%).

6.6 Subjective Comparison

Visualized Retrieval Cases. We selected several queries from the utilized dataset and depicted the retrieval results in Fig. 6. The depicted methods are our work, the HDML of work [26] and the basic triplet scheme. The visualized results show that our method achieves better search accuracy. It should be noted that for some cases, such as the last two queries in Fig. 6, almost all the methods have many error results.

Embedding Visualization. We visualized the learned embeddings for the CUB-200-2011 and Cars196 datasets by using t-SNE [40], as denoted by Fig. 7 and Fig. 8. In each figure, several selected specific regions are enlarged to highlight the representative classes for easy observation. In each specific region, although the images in the same class suffer from large variations in poses, colors, and backgrounds, our proposed method still can group similar objects. The visualization results denote our learned embeddings have strong representation ability and thus can achieve better retrieval performance in an intuitive manner.

6.7 Ablation Study

We conducted the ablation study of the proposed method on the Cars196 dataset. In Fig. 10, we show the results based on the Recall@1 criterion of different model settings, including our proposed method (OURS), the proposed method combined with distance-weighted hard mining (OURS^∗), the proposed method without using the second stage of synthesis (w/o $G2$ ), the proposed method without using the second stage of synthesis and category loss $\mathcal{L}_{class}$ (w/o ( $G2$ & $\mathcal{L}_{class}$ )), and the proposed method without both the generated hard sample loss $\mathcal{L}_{hard}$ and the category loss $\mathcal{L}_{class}$ (w/o ( $\mathcal{L}_{hard}$ & $\mathcal{L}_{class}$ )), which is the baseline of using triplet architecture.

We can find that the Recall@1 is further improved (80.2% $\rightarrow$ 81.6%) by combining our method with a hard sample mining. The performance improvement implies there is complementary information in the mined and generated hard samples. We remove the Generator $G_{2}$ and only use the first stage of generation to synthesize the anchor-positive pairs in hard triplets, then some performance loss is introduced (80.2% $\rightarrow$ 76.5%), which reveals the importance of hard sample generation which can extract the potential information from simple samples for embedding training. By further removing category loss $\mathcal{L}_{class}$ , the performance suffers an obvious degradation again (76.5% $\rightarrow$ 71.9%), which tells that the category loss still can provide useful information under the metric learning architecture. Moreover, because we train the embedding model $F$ and the other hard sample generation models alternately, the absence of $\mathcal{L}_{class}$ will ruin the original obtained sample $\mathbf{X}$ and in turn impairs the generation of hard samples. Thus, the category loss $\mathcal{L}_{class}$ also plays a great role in our scheme. When the generated hard sample loss $\mathcal{L}_{hard}$ is totally removed (w/o $\mathcal{L}_{hard}$ ), we see the performance decreases a lot (71.9% $\rightarrow$ 67.8%), which verifies the effectiveness of the proposed stronger conditional synthesis scheme.

To further verify the contributions of each component in our method, we perform the similar experiment on all quantitative criteria including Recall@K, NMI, F1, and mAP, which are depicted as Fig. 10. The quantitative results in Fig. 10 also confirm the analysis as presented above, and the specific performance reduction brought to the model can be observed when each corresponding component is removed.

7 Conclusion

The existing mining-based triplet constructing methods directly learn the deep embeddings based on mining the hard samples, which will ignore the potential information of the easy samples. Although the recent hard sample generation schemes have tried to dig the potential information of the easy samples with the one-stage adversarial model, it is ignored the fact that the positives and negatives have different distributions and characteristics. In this study, we designed a two-stage hard sample generation scheme to achieve better embedding learning.

We found that our proposed scheme achieves the best overall image retrieval and clustering performance on three large datasets (see Table 3, 4, and 5), indicating the effectiveness of the two-stage hard sample generation scheme. Specifically, our proposed method can conduct better results than the existing metric learning with hard sample generation. We also found that both the anchor-positive pairs generated by the conditional synthesis in the first stage and the negative samples generated in the second stage are crucial. Besides, by combining the generated hard samples and real hard samples selected by hard sample mining strategies, the performance can be further boosted (see Fig. 10 and Fig. 10).

Our results provide compelling performance for image retrieval and clustering tasks through the proposed hard sample generation scheme. In this paper, the usage of generated samples in our experiments is limited to the triplet metric learning architecture. In the future, we can consider exploring more efficient usage strategies for generated samples and we will conduct more hard sample generation research on the other loss architectures, such as the n-pair loss structure. In addition, there is a general problem in current hard sample generation schemes is that the generated samples are discarded after being used once. In the future, if schemes can be designed to efficiently reuse high-quality generated samples, it is possible to save computational resources and further accelerate network training.

8 Acknowledgement

This work was supported in part by the 111 project (NO.B17007), Shandong Key Laboratory of Intelligent Buildings Technology (Grant No.SDIBT202006).

References

[1] Z. Li, J. Tang, Weakly supervised deep metric learning for community-contributed image retrieval, IEEE Transactions on Multimedia 17 (11) (2015) 1989–1999.
[2] B. Girod, V. Chandrasekhar, D. M. Chen, N.-M. Cheung, R. Grzeszczuk, Y. Reznik, G. Takacs, S. S. Tsai, R. Vedantham, Mobile visual search, IEEE signal processing magazine 28 (4) (2011) 61–76.
[3] C. Zhu, H. Dong, S. Zhang, Feature fusion for image retrieval with adaptive bitrate allocation and hard negative mining, IEEE Access 7 (2019) 161858–161870.
[4] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110.
[5] F. Perronnin, Y. Liu, J. Sánchez, H. Poirier, Large-scale image retrieval with compressed fisher vectors, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 3384–3391.
[6] H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, C. Schmid, Aggregating local image descriptors into compact codes, IEEE transactions on pattern analysis and machine intelligence 34 (9) (2011) 1704–1716.
[7] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, S. Carlsson, From generic to specific deep representations for visual recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 36–45.
[8] A. Babenko, V. Lempitsky, Aggregating local deep features for image retrieval, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1269–1277.
[9] G. Tolias, R. Sicre, H. Jégou, Particular object retrieval with integral max-pooling of cnn activations, arXiv preprint arXiv:1511.05879.
[10] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, J. Sivic, Netvlad: Cnn architecture for weakly supervised place recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
[11] F. Radenović, G. Tolias, O. Chum, Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples, in: European conference on computer vision, Springer, 2016, pp. 3–20.
[12] J. Hu, J. Lu, Y.-P. Tan, Discriminative deep metric learning for face verification in the wild, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1875–1882.
[13] E. Hoffer, N. Ailon, Deep metric learning using triplet network, in: International Workshop on Similarity-Based Pattern Recognition, Springer, 2015, pp. 84–92.
[14] C.-Y. Wu, R. Manmatha, A. J. Smola, P. Krahenbuhl, Sampling matters in deep embedding learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
[15] Y. Yuan, K. Yang, C. Zhang, Hard-aware deeply cascaded embedding, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 814–823.
[16] F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, O. Chum, Revisiting oxford and paris: Large-scale image retrieval benchmarking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5706–5715.
[17] F. Radenović, G. Tolias, O. Chum, Fine-tuning cnn image retrieval with no human annotation, IEEE transactions on pattern analysis and machine intelligence 41 (7) (2018) 1655–1668.
[18] B. Yu, T. Liu, M. Gong, C. Ding, D. Tao, Correcting the triplet selection bias for triplet loss, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 71–87.
[19] Y. Zhao, Z. Jin, G.-j. Qi, H. Lu, X.-s. Hua, An adversarial approach to hard triplet generation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 501–517.
[20] Y. Duan, W. Zheng, X. Lin, J. Lu, J. Zhou, Deep adversarial metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2780–2789.
[21] Y. Guo, D. An, X. Qi, Z. Luo, S.-T. Yau, X. Gu, et al., Mode collapse and regularity of optimal transportation maps, arXiv preprint arXiv:1902.02934.
[22] Y. Cui, F. Zhou, Y. Lin, S. Belongie, Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1153–1162.
[23] A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, arXiv preprint arXiv:1703.07737.
[24] R. Yu, Z. Dou, S. Bai, Z. Zhang, Y. Xu, X. Bai, Hard-aware point-to-set deep metric for person re-identification, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 188–204.
[25] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[26] W. Zheng, Z. Chen, J. Lu, J. Zhou, Hardness-aware deep metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 72–81.
[27] G. Gu, B. Ko, Symmetrical synthesis for deep metric learning, arXiv preprint arXiv:2001.11658.
[28] K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, in: Advances in Neural Information Processing Systems, 2016, pp. 1857–1865.
[29] S. Chopra, R. Hadsell, Y. LeCun, et al., Learning a similarity metric discriminatively, with application to face verification, in: CVPR (1), 2005, pp. 539–546.
[30] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530.
[31] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784.
[32] H. Oh Song, Y. Xiang, S. Jegelka, S. Savarese, Deep metric learning via lifted structured feature embedding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4004–4012.
[33] H. Oh Song, S. Jegelka, V. Rathod, K. Murphy, Deep metric learning via facility location, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5382–5390.
[34] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltech-ucsd birds-200-2011 dataset.
[35] J. Krause, M. Stark, J. Deng, L. Fei-Fei, 3d object representations for fine-grained categorization, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 554–561.
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[37] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (3) (2015) 211–252.
[39] K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, J. P. Cohen, Revisiting training strategies and generalization performance in deep metric learning, arXiv preprint arXiv:2002.08473.
[40] L. Van Der Maaten, Accelerating t-sne using tree-based algorithms, The Journal of Machine Learning Research 15 (1) (2014) 3221–3245.