Multi-Agent Semi-Siamese Training for
Long-tail and Shallow Face Learning

Hailin Shi^*, Dan Zeng^*, Yichun Tai, Hang Du, Yibo Hu, Zicheng Zhang, and Tao Mei * Equal contribution.Hailin Shi, Yibo Hu, and Tao Mei are with JD AI Research, Beijing 100020, China. E-mail: [email protected], [email protected], [email protected] Zeng, Yichun Tai, and Hang Du are with Shanghai University, Shanghai 200444, China. E-mail: {dzeng, taiyc, duhang}@shu.edu.cnZicheng Zhang is with University of Chinese Academy of Sciences, Beijing 100043, China. E-mail: [email protected] author: Dan Zeng.

Abstract

With the recent development of deep convolutional neural networks and large-scale datasets, deep face recognition has made remarkable progress and been widely used in various applications. However, unlike the existing public face datasets, in many real-world scenarios of face recognition, the depth of training dataset is shallow, which means only two face images are available for each ID. With the non-uniform increase of samples, such issue is converted to a more general case, a.k.a long-tail face learning, which suffers from data imbalance and intra-class diversity dearth simultaneously. These adverse conditions damage the training and result in the decline of model performance. Based on the Semi-Siamese Training (SST), we introduce an advanced solution, named Multi-Agent Semi-Siamese Training (MASST), to address these problems. MASST includes a probe network and multiple gallery agents, the former aims to encode the probe features, and the latter constitutes a stack of networks that encode the prototypes (gallery features). For each training iteration, the gallery network, which is sequentially rotated from the stack, and the probe network form a pair of semi-siamese networks. We give the theoretical and empirical analysis that, given the long-tail (or shallow) data and training loss, MASST smooths the loss landscape and satisfies the Lipschitz continuity with the help of multiple agents and the updating gallery queue. The proposed method is out of extra-dependency, thus can be easily integrated with the existing loss functions and network architectures. It is worth noting that, although multiple gallery agents are employed for training, only the probe network is needed for inference, without increasing the inference cost. Extensive experiments and comparisons demonstrate the advantages of MASST for long-tail and shallow face learning.

Index Terms:

Face Recognition, Shallow Face Learning, Long-tail Face Learning.

I Introduction

The convolutional neural networks (CNNs) [1, 2, 3] have shown great advantages in face recognition. Besides the advance of network design and training algorithm, the success largely relies on the large-scale training datasets, such as CASIA-WebFace [4], MS-Celeb-1M [5], VGGFace2 [6], etc. We term this type of dataset as deep face data, which provides abundant information both in breadth (IDs) and depth (samples per ID). In many real-world scenarios, however, the training usually encounters unfavorable data distribution, such as shallow distribution [7]. The shallow face data distribution, which often has a large number of IDs, yet contains a few (two mostly) face images per ID (a registration photo and a spot photo, so-called “gallery” and “probe”). According to the analysis in SST [7], the lack of intra-class diversity prevents the network from effective optimization, and leads to the collapse of feature dimension. Whereas, with the non-uniform increase of samples for each ID, the data tends to be long-tail distributed. For example, in the large-scale face dataset MS-Celeb-1M, only a small portion of identities, about 10%, have abundant face images, while a considerable portion has only a few samples. The head classes contain enough samples with rich intra-class diversity to avoid the obvious imbalance problem in the learning process. However, for tail classes, the amount of samples for each ID is too small to push the decision boundary correctly and classify with no bias, making the trained model prone to overfitting. As exemplified in Fig. 1, shallow face data is an extreme case of long-tail face data, where it only has samples like tail data and the number of samples for each ID is two. In such a situation, we find the existing training methods may not perform well.

Refer to caption — Figure 1: Left: in deep face data, each ID has plentiful samples, providing rich intra-class diversity. Top-right: long-tail face data. Bottom-right: shallow face data, which can be regarded as a special case of long-tail distribution, has a small number of samples per ID, and therefore the scarcity of intra-class diversity broadly exists in each ID.

With the further maturity of face recognition technology, more and more people begin to pay attention to the problems in real-world scenarios mentioned above. The main challenging issues are as follows: (1) Shallow face learning is similar to the existing problem of few-shot learning in face recognition [8], which has been widely studied. For example, to increase the sample size, some works are done to synthesize data based on generation rules [9, 10]. Some other works try to adjust the weight norm of the classifier for the few-shot classification [11]. But there are still some differences between these two issues. Most obviously, few-shot learning performs close recognition [11, 12, 13, 14], while shallow face learning focuses on the open-set recognition task in which test IDs are excluded from training IDs. Besides, few-shot learning needs pre-training in the source domain (with deep data) before fine-tuning to the target domain [13, 15, 16], while shallow face learning tends to train on shallow face data from the very beginning. Unfortunately, targeted and effective solutions have not been proposed. (2) For face recognition tasks, some metric-based losses such as [2, 17] are proved to be superior to conventional softmax loss. On that basis, the range loss [18], which can enhance the discriminative power of the deeply learned features when dealing with few-shot samples in tail classes, is proposed to solve the long-tail problem. As the inherent problems of long-tail distribution, the data imbalance and few-shot problem are concerned. Most works related to the scarcity of samples rely on a base set with adequate samples and pay more attention to few-shot classification rather than learning a general representation under data with the long-tail distribution. In addition, some classical strategies [19, 20] against imbalance problem have been made full use in current deep learning frameworks, including resampling [21, 22] and cost-sensitive learning [23, 24]. However, there are still many limitations such as discarding samples for under-sampling or introducing additional noises (e.g., assigning larger penalty on outlier samples).

To tackle the above challenges, we propose Multi-Agent Semi-Siamese Training (MASST) mechanism. In long-tail face data, the tail identities cannot provide an accurate description with the limited number of training samples, and the feature space of them are often squeezed by the head identities. Therefore, we randomly choose two photos in each ID, of which one photo is employed as the initial prototype, and the other one is employed as the training sample. However, these two photos belong to the same ID and contain limited intra-class information. To enlarge intra-class diversity, we design MASST to enforce the backbone being Semi-Siamese, which means the networks have close (but not identical) parameters. As a result, MASST is proved to achieve better performance in shallow and long-tail face learning.

MASST consists of a probe network and multiple gallery agents. For each ID in the training, the former extracts the feature from the probe as the training sample, and the latter sequentially rotated from the stack extracts the feature from the gallery as the prototype. To guarantee the intra-class diversity between the feature, we combine moving-average with an auxiliary term to update the gallery network, which can constrain the slight difference among networks. The gallery queue is built based on the gallery samples and updated in the training to replace the weights of the fully connected (FC) layer, which can reduce a large number of FC parameters. As shown in Section IV, MASST achieves significant improvement in shallow and long-tail face learning. It also outperforms the conventional training on the challenge of domain transfer. Furthermore, since the proposed method is developed without extra-dependency, it can be flexibly integrated with the existing loss functions and network architectures.

The contributions of this paper are summarized as follows:

•

Compared with shallow face learning, long-tail face learning is a more general case including the few-shot problem and the data imbalance problem at the same time. We study these two main challenges, which exist in many real-world scenarios of face recognition but have not been well addressed.
•

We propose Multi-Agent Semi-Siamese Training (MASST) method to address the few-shot problem and the data imbalance problem in shallow and long-tail face learning. MASST is able to achieve a smoother optimization process on the shallow and long-tail data by satisfying the Lipschitz continuity, and obtain the leading performance when it combines with the SOTA loss functions and network architectures flexibly.
•

We conduct comprehensive experiments to verify that MASST not only brings significant improvement on shallow and long-tail face learning, but also takes effect on both conventional deep data and domain transfer.

This paper is an extension of our previous conference version [7], and there are three major improvements over the preliminary one: (1) MASST can tackle more general problems in face recognition. For the preliminary version, the Semi-Siamese Training (SST) is targeted proposed for shallow face learning. As an extension of shallow distribution, the tail samples in long-tail distribution are shallow and the number of face images with different IDs varies greatly. Thus, shallow face learning can be seen as a subproblem of long-tail face learning. In the current version, we aim to address this more complicated issue, which is common in real-world scenarios of face recognition. (2) MASST has a smoother optimization process. The preliminary version designs SST, including a pair of Semi-Siamese networks, which have a probe network to embed the probe features, and a gallery network to update prototypes by gallery features. On this basis, benefiting from the aforementioned Semi-Siamese structure, the gallery network of the current version is sequentially rotated from the network stack which involves multiple agents rather than the fixed single gallery network in SST. At the same time, the update method of MASST is changed to achieve a smoother optimization process. (3) Theoretical analysis and comprehensive experiments are added. Through theoretical analysis, we prove that the networks in MASST can better satisfy the Lipschitz continuity. Besides, comprehensive experiments on deep face data, shallow face data, and long-tail face data show the effectiveness of MASST. It can combine with the SOTA loss functions and network architectures flexibly and obtain the leading performance.

II Related Work

For shallow and long-tail face learning, the main difficulties are the few-shot problem and the data imbalance problem. So we review methods that solve these two issues in this section.

II-A Few-shot Learning

The straightforward solution to enlarge the diversity of samples is data generation. Hariharan and Girshick [9] propose to hallucinate additional samples of few-shot classes by category-independent transformations. Dixitet et al. [25] present an approach guided by attributes for data augmentation. Delta-encoder [26] is designed to synthesize new samples for unseen categories even if seeing few samples from them, by extracting transferable intra-class deformations and applying them to the few provided examples of a novel class. Yin et al. [10] assume a Gaussian prior of the variance to transfer variations of normal classes to few-shot classes, which can augment the feature space of few-shot classes. Ding et al. [27] propose a generative adversarial one-shot face recognizer to synthesize data for one-shot classes. Specifically, a knowledge transfer generator and a general classifier are combined to guide the generation of effective data, which contributes to promoting the underrepresented classes.

Besides, to regularize the learning process, some methods adjust the conventional settings such as weight of samples and evaluation of loss. Guo et al. [11] develop the classifier to recognize few-shot samples by aligning the norms of the weight vectors of few-shot classes and the normal classes. For few-shot learning in face recognition, Ma et al. [28] adopt a hierarchical regularization method which utilizes coarse class labels for training and fine class labels for refining to avoid overfitting, Cheng et al. [13] propose the enforced softmax to guide model to produce a more compact vectorized representation by optimal dropout, selective attenuation, $L_{2}$ normalization, and model-level optimization. Simon et al. [29] introduce a dynamic classifier that is constructed from few samples to design a framework for few-shot learning. With such modeling, the results on the task of supervised and semi-supervised few-shot classification are competitive.

II-B Data Imbalance

Arising from the long-tail distribution of nature data, the imbalance problem has been extensively studied [30, 31, 32, 33, 34]. Some metric-based methods add desirable constraints on the distance between samples in the feature space. Huang et al. [35] propose quintuplet sampling with triple-header hinge loss to maintain both inter-cluster and inter-class margins for solving imbalanced distribution in vision classification tasks. Center invariant loss [36] aligns centers of the minority classes to the majority. Range loss [18] is developed to overcome the impact of long-tail distribution by simultaneously minimizing the harmonic mean of $k$ largest intra-class distance and maximizing the shortest inter-class distance within the mini-batch. Ring loss [37] applies soft feature normalization to augment standard loss functions. In addition, fair loss [38] is proposed to balance different classes by using reinforcement learning.

In terms of data, some other methods have been proposed for improvement. Classical strategies include under-sampling head classes, over-sampling tail classes, and data instance re-weighting, which aim to balance the number of samples. By SMOTE [21], over-sampling of the few-shot class and under-sampling of the normal class can be combined to achieve better classifier performance than replication-based methods. BalanceCascade [39] trains the learners sequentially. The majority of class examples classified correctly are not taken into consideration further. Unfortunately, these methods can easily suffer from over-fitting and introduce undesirable noises. The under-sampling also might discard a large portion of samples so that missing valuable information. Some of these solutions have been further extended to solve the data imbalance problem in deep learning [40]. In order to remove representation bias, REPAIR [41] learns weights for different classes to re-sample data. Zhong et al. [42] attempt to treat the head data and the tail data of long-tail distribution in an unequal way to make full use of their respective characteristics. Accompanying with noise-robust loss functions, these two training streams offer complementary information to deep feature learning. Zhang et al. [43] focus on the optimization of dataset structures and build a medium-scale face recognition training set BUPT-CBFace, which can alleviate the problem of recognition bias. Huang et al. [44] conduct extensive and systematic experiments to validate the effectiveness of classic schemes mentioned above. Furthermore, CLMLE is proposed to reduce the class imbalance inherent in the data neighborhood, by enforcing a deep network to maintain inter-cluster margins both within and between classes. As an extension of [28], the two-step training schema [45] is designed, which aggregates similar classes in tail parts and then selectively disperses those aggregated superclasses, to address the long-tail problem.

III Proposed Method

In this section, we begin with the discussion on the major problems in shallow and long-tail face data. Then, we introduce the method of Semi-Siamese and Multi-Agent Semi-Siamese Training.

III-A SST for Shallow Data

The major issue in shallow face data is the lack of intra-class information. Its adverse effect on model training has been discussed in Du et al. [7]. Here, we elaborate the detailed behavior of model when trained on shallow data with the current state-of-the-art methods, and introduce a novel perspective for understanding the difficulty of training on shallow face data and, more general cases, long-tail face data.

The state-of-the-art training methods are mostly developed from the softmax loss and its variants. Without loss of generality, taking softmax as an example, the formulation can be written as

\mathcal{L}=-\log\frac{e^{s\,\boldsymbol{w_{y}}^{T}\boldsymbol{x}}}{e^{s\,\boldsymbol{w_{y}}^{T}\boldsymbol{x}}+\sum_{j=1,j\neq y}^{n}e^{s\,\boldsymbol{w_{j}}^{T}\boldsymbol{x}}},

(1)

where ${n}$ is the class number, $s$ is the scaling parameter, $\boldsymbol{x}$ is the sample feature with the L2 normalization ( $\boldsymbol{x}=\boldsymbol{x}/\|\boldsymbol{x}\|_{2}$ ), $\boldsymbol{w_{j}}$ is the $j$ -th class weight with the L2 normalization ( $\boldsymbol{w_{j}}=\boldsymbol{w_{j}}/\|\boldsymbol{w_{j}}\|_{2}$ ), $y$ is the ground-truth label of the sample, and the output of the last FC layer is the inner product of the sample feature $x$ and $j$ -th class weight $\boldsymbol{w_{j}}$ . In the training process, the goal is maximizing the intra-class pair similarity $\boldsymbol{{w}_{y}}^{\mathrm{T}}\boldsymbol{x}$ and minimizing the inter-class pairs $\boldsymbol{{w}_{j}}^{\mathrm{T}}\boldsymbol{x}$ . If the sample number per class is large enough (a.k.a. deep data), $\boldsymbol{x}$ ’s can provide abundant intra-class diversity. There are, however, only two samples per class (i.e., the gallery $\boldsymbol{x_{g}}$ and probe $\boldsymbol{x_{p}}$ ) in shallow data. As the prototype $\boldsymbol{{{w}_{y}}}$ is determined by $\boldsymbol{x_{g}}$ and $\boldsymbol{x_{p}}$ , the three vectors will easily converge close ( $\boldsymbol{w_{y}}\approx\boldsymbol{x_{g}}\approx\boldsymbol{x_{p}}$ ) that the loss value in this class approaches to the lower bound. Therefore, in each iteration of batch-wise training, the network is well-fitted on a small number of classes and badly-fitted on the remaining ones. As a result, the total loss value will be oscillating and the training will be harmed. Moreover, due to the lack of the intra-class diversity in features space of all the classes ( $\boldsymbol{x_{g}}\approx\boldsymbol{x_{p}}$ ), the prototype $\boldsymbol{{{w}_{y}}}$ is pushed to zeros in most dimensions so that it is unable to span an effective feature space. Specifically, given a network and loss function in conventional training, the loss landscape will become much more winding when the data depth becomes shallow. One can refer to Du et al. [7] for detailed discussion.

In this context, SST is designed to improve the stability of training on shallow data. We denote gallery and probe images by $I_{g}$ and $I_{p}$ , and their features $x_{g}=\phi(I_{g})$ and $x_{p}=\phi(I_{p})$ , where $\phi$ is the backbone. $\phi$ should be Semi-Siamese, coined gallery network $\phi_{g}$ and probe network $\phi_{p}$ , to maintain the distance between $\boldsymbol{x_{g}}$ and $\boldsymbol{x_{p}}$ . $\phi_{g}$ and $\phi_{p}$ have the same architecture and close (but non-identical) parameters, $\phi_{g}=\phi_{p}+\boldsymbol{\epsilon^{\prime}}$ , to prevent features from collapse, $\phi_{g}(I_{g})=\phi_{p}(I_{p})+\boldsymbol{\epsilon}$ . To implement the Semi-Siamese networks, a common approach is to add a network constraint $\|\phi_{g}-\phi_{p}\|<\epsilon^{\prime}$ in the training loss. As suggested by MoCo [46], we choose a better approach that updating gallery network in the momentum way, which is defined as

{\phi}_{g}={m}{\phi}_{g}+{(1-m)}{\phi}_{p},

(2)

where $m$ is the weight of moving-average, and the probe network $\phi_{p}$ updates with SGD.

The prototype constraint is deployed in the training objective to enlarge the entries of prototype, such like $\mathcal{L}+\beta(\alpha-\|\boldsymbol{w_{y}}\|)$ with parameters $\alpha$ and $\beta$ , avoiding the vanishing issue in dimensions of $\boldsymbol{w_{y}}$ . This technique, however, does not show effective improvement (Table II), since the entries are enlarged uniformly. To handle this issue, SST replaces $\boldsymbol{w_{y}}$ with the gallery feature $\boldsymbol{x_{g}}$ as the prototype rather than manipulating $\boldsymbol{w_{y}}$ itself. The feature-based prototype avoids the vanishing issue, and thereby keeps the discriminative components in the prototype.

III-B From SST to MASST

The experiments in Du et al. [7] show that SST has remarkable advantage over the conventional training on shallow data. Inspired by Markov-Lipschitz theory [47], we analyze the advantage of SST from the perspective of the bi-Lipschitz continuity in the following.

A function $\eta:\mathbb{R}^{H\times W}\rightarrow\mathbb{R}^{n}$ is locally bi-Lipschitz if for all $x_{j}\in\mathcal{N}_{i}$ (the neighborhood of $x_{i}$ ), there exists a real constant $K\geq 1$ satisfying

{\|x_{i}-x_{j}\|}_{2}/K\leq{\|\eta(x_{i})-\eta(x_{j})\|}_{2}\leq K{\|x_{i}-x_{j}\|}_{2}.

(3)

$K$ is termed as bi-Lipschitz constant. According to Convex Optimization theory [48], a differentiable loss function $\mathcal{L}(x)$ is called smooth if its gradient $\eta(x)$ is Lipschitz or bi-Lipschitz continuous. The optimization of smooth function will be more stable and robust if $K$ is smaller [47].

First, we compare the bi-Lipschitz constant $K$ of the conventional training and SST to compare their stability of optimization process. For conventional training, according to Eq. 1, when the gallery $I_{g}$ and probe $I_{p}$ are fed into the backbone $\phi$ separately, we have the gradient

\begin{split}\eta_{Convt.}(I_{g})=s(\frac{e^{s\,\boldsymbol{w_{y}}^{T}\phi(I_{g})}}{\sum_{j=1}^{n}e^{s\,\boldsymbol{w_{j}}^{T}\phi(I_{g})}}-1)\phi(I_{g}),\\ \eta_{Convt.}(I_{p})=s(\frac{e^{s\,\boldsymbol{w_{y}}^{T}\phi(I_{p})}}{\sum_{j=1}^{n}e^{s\,\boldsymbol{w_{j}}^{T}\phi(I_{p})}}-1)\phi(I_{p}).\end{split}

(4)

As the training carries on, for $I_{g}$ and $I_{p}$ that belong to the same ID, $\phi(I_{g})\approx\phi(I_{p})$ and $\eta_{Convt.}(I_{g})\approx\eta_{Convt.}(I_{p})$ . This means $K$ is very large. Thereby, the optimization of conventional training on shallow data is difficult to be smooth and steady [47].

As for SST, $\boldsymbol{w_{y}}$ is replaced with the gallery feature $\boldsymbol{x_{g}}=\phi_{g}(I_{g})$ , and the gradient with respect to $\boldsymbol{x_{g}}$ becomes to

\displaystyle\eta_{x_{g}}(\boldsymbol{x})=s(\frac{e^{s\,\boldsymbol{x_{g}}^{T}\boldsymbol{x}}}{\sum_{j=1}^{n}e^{s\,\boldsymbol{x_{j}}^{T}\boldsymbol{x}}}-1)\boldsymbol{x}.

(5)

Given the shallow data including a pair of gallery and probe per ID, which are the input of SST, Eq. 5 is developed to

	$\displaystyle\eta_{x_{g}}$	$\displaystyle([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])$		(6)
		$\displaystyle=s(\frac{e^{s\,\boldsymbol{x_{g}}^{T}\boldsymbol{x_{p}}}}{e^{s\,\boldsymbol{x_{g}}^{T}\boldsymbol{x_{p}}}+\sum_{j=1,j\neq y}^{n}e^{s\,\boldsymbol{x_{j}}^{T}\boldsymbol{x_{p}}}}-1)\boldsymbol{x_{p}},$		(6)

where $\boldsymbol{x_{g}}=\phi_{g}(I_{g})$ and $\boldsymbol{x_{p}}=\phi_{p}(I_{p})$ .

For the clarity of the following deduction, we denote $f$ as a function of $\boldsymbol{x_{g}}$ and $\boldsymbol{x_{p}}$ that

\displaystyle f([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])=\frac{e^{s\,\boldsymbol{x_{g}}^{T}\boldsymbol{x_{p}}}}{e^{s\,\boldsymbol{x_{g}}^{T}\boldsymbol{x_{p}}}+\sum_{j=1,j\neq y}^{n}e^{s\,\boldsymbol{x_{j}}^{T}\boldsymbol{x_{p}}}}-1.

(7)

Then, the upper bound of the variation of $\eta_{x_{g}}$ is computed as

		$\displaystyle\lVert\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-\eta_{x_{g}}([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\rVert_{2}$		(8)
		$\displaystyle=\lVert sf([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])\boldsymbol{x_{p}}-sf([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\boldsymbol{x_{p}}\rVert_{2}$
		$\displaystyle=\lVert s(f([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-f([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}]))\boldsymbol{x_{p}}\rVert_{2}$
		$\displaystyle=s\lVert f([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-f([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\rVert_{2}$
		$\displaystyle=s\lVert\nabla_{\boldsymbol{x_{g}}}f^{T}_{\boldsymbol{x_{g}};\boldsymbol{x_{p}}}(\boldsymbol{x_{g}^{\prime}}-\boldsymbol{x_{g}})+o(\lVert\boldsymbol{x_{g}^{\prime}}-\boldsymbol{x_{g}}\rVert_{2})\rVert_{2}$
		$\displaystyle=s\lVert\boldsymbol{x_{g}}-\boldsymbol{x_{g}^{\prime}}\rVert_{2}\lVert\nabla_{\boldsymbol{x_{g}}}f^{T}_{\boldsymbol{x_{g}};\boldsymbol{x_{p}}}\rVert_{2}$
		$\displaystyle\leq\frac{s^{2}}{2}\lVert\boldsymbol{x_{g}}-\boldsymbol{x_{g}^{\prime}}\rVert_{2},$

where $\boldsymbol{x_{g}^{\prime}}=\boldsymbol{x_{g}}+\epsilon_{1}$ . Then, we can obtain the ratio to the input variation is

\displaystyle\frac{\lVert\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-\eta_{x_{g}}([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\rVert_{2}}{\lVert\boldsymbol{x_{g}}-\boldsymbol{x_{g}^{\prime}}\rVert_{2}}\leq\frac{s^{2}}{2}.

(9)

One can refer to the supplementary material for the deduction details.

This indicates that there always exists a constant $K$ making SST satisfies the right side inequality in Eq. 3. Besides, the ratio of variation also has a lower bound. This is because that $\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])\neq\eta_{x_{g}}([\boldsymbol{x^{\prime}_{g}};\boldsymbol{x_{p}}])$ always holds owing to the Semi-Siamese structure and the momentum-updating method.

In most of the existing experiments, the parameter $s$ is set around 30, so $\frac{s^{2}}{2}>1$ . Therefore, there exists $K\geq 1$ that constrains the variation range of ${\|\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-\eta_{x_{g}}([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\|}_{2}$ . In contrast to the conventional training, SST can satisfy the bi-Lipschitz continuity, which is the basic reason that SST achieves stable and robust convergence on shallow data.

Based on the advantage of SST on shallow data, we further extend it to handle a more general problem that is long-tail face learning. Unfortunately, with the non-uniform distribution of training samples in long-tail face data, SST suffers from bad convergence. The tail classes have faster convergence speed due to the fewer samples, but the head classes including large number of samples converge slower. The power of SST is undermined on long-tail data.

In order to better satisfy the locally bi-Lipschitz continuity, we propose MASST to strengthen the stability and robustness with smaller value of $K$ . The single gallery network in SST is replaced with a gallery network stack including multiple agents ( $\phi_{g_{1}},\phi_{g_{2}},\cdots,\phi_{g_{S}}$ ). The framework of MASST is depicted in Fig. 2. In the training process, we sequentially rotate one agent from the stack as the gallery network. The update method of the $i$ -th agent is defined as

{\phi}_{g_{i}}=(1+a)(m{\phi}_{g_{i}}+(1-m){\phi}_{p})-a\frac{\sum_{j=1,j\neq i}^{S}{\phi}_{g_{j}}}{S-1},

(10)

where $i=1,2,\cdots,S$ , $S$ is the number of agents in the gallery network stack, and $a$ is a parameter that adjusts the weight of the first and second terms. With the second term, the agents in the gallery network stack can maintain distance intentionally. Therefore, MASST can have smaller value of $K$ to achieve smoother optimization process.

Similar to SST, we have the gradient in MASST

		$\displaystyle\eta_{x_{g}}([\boldsymbol{{x_{g}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])$		(11)
		$\displaystyle=s(\frac{e^{s\,\boldsymbol{{x_{g}}^{ma}}^{T}\boldsymbol{{x_{p}}^{ma}}}}{e^{s\,\boldsymbol{{x_{g}}^{ma}}^{T}\boldsymbol{{x_{p}}^{ma}}}+\sum_{j=1,j\neq y}^{n}e^{s\,\boldsymbol{{x_{j}}^{ma}}^{T}\boldsymbol{{x_{p}}^{ma}}}}-1)\boldsymbol{{x_{p}}^{ma}},$		(11)

where $\boldsymbol{{x_{g}}^{ma}}=\phi_{g_{i}}(I_{g})$ and $\boldsymbol{{x_{p}}^{ma}}=\phi_{p}(I_{p})$ .

Without loss of generality, ignoring the scaling parameter $s$ , the ratio of the variation of $\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])$ to the variation of $\eta_{x_{g}}([\boldsymbol{{x_{g}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])$ can be written as

		$\displaystyle\frac{\lVert\eta_{x_{g}}([\boldsymbol{{x_{g}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])-\eta_{x_{g}}([\boldsymbol{{x_{g}^{\prime}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])\rVert_{2}}{\lVert\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-\eta_{x_{g}}([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\rVert_{2}}$		(12)
		$\displaystyle=\frac{\lVert\nabla_{\boldsymbol{{x_{g}}^{ma}}}\frac{e^{\boldsymbol{{x_{g}}^{ma}}^{T}\boldsymbol{{x_{p}}^{ma}}}}{e^{\boldsymbol{{x_{g}}^{ma}}^{T}\boldsymbol{{x_{p}}^{ma}}}+e^{\boldsymbol{{x_{g}^{\prime}}^{ma}}^{T}\boldsymbol{{x_{p}}^{ma}}}}\rVert_{2}}{\lVert\nabla_{\boldsymbol{{x_{g}}}}\frac{e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}}{e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}+e^{\boldsymbol{{x_{g}^{\prime}}}^{T}\boldsymbol{{x_{p}}}}}\rVert_{2}},$		(12)

where $\boldsymbol{{x_{g}^{\prime}}^{ma}}=\boldsymbol{{x_{g}}^{ma}}+\epsilon_{2}$ , and $\lVert\nabla_{\boldsymbol{{x_{g}}}}\frac{e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}}{e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}+e^{\boldsymbol{{x_{g}^{\prime}}}^{T}\boldsymbol{{x_{p}}}}}\rVert_{2}$ can be developed to

		$\displaystyle\lVert\nabla_{\boldsymbol{{x_{g}}}}\frac{e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}}{e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}+e^{\boldsymbol{{x_{g}^{\prime}}}^{T}\boldsymbol{{x_{p}}}}}\rVert_{2}$		(13)
		$\displaystyle=\lVert\frac{{e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}e^{\boldsymbol{{x_{g}^{\prime}}}^{T}\boldsymbol{{x_{p}}}}}}{({e^{\boldsymbol{{x_{g}}}^{T}\boldsymbol{{x_{p}}}}+e^{\boldsymbol{{x_{g}^{\prime}}}^{T}\boldsymbol{{x_{p}}}}})^{2}}\boldsymbol{{x_{p}}}\rVert_{2}$
		$\displaystyle=\lVert\frac{1}{e^{\boldsymbol{{(x_{g}-x_{g}^{\prime})}}^{T}\boldsymbol{{x_{p}}}}+e^{\boldsymbol{{(x_{g}^{\prime}-x_{g})}}^{T}\boldsymbol{{x_{p}}}}+2}\rVert_{2}.$

Compared with SST, the update method of the gallery network in MASST increases differences among agents in the stack, i.e., $\epsilon_{2}>\epsilon_{1}$ . Then, we can find that

$\displaystyle\boldsymbol{{(x_{g}-x_{g}^{\prime})}}^{T}$	$\displaystyle\boldsymbol{{x_{p}}}$	(14)
	$\displaystyle\leq\lVert\boldsymbol{{(x_{g}-x_{g}^{\prime})}}\rVert_{2}\lVert\boldsymbol{{x_{p}}}\rVert_{2}=\epsilon_{1},$
$\displaystyle\boldsymbol{{({x_{g}}^{ma}-{x_{g}^{\prime}}^{ma})}}^{T}$	$\displaystyle\boldsymbol{{{x_{p}}^{ma}}}$
	$\displaystyle\leq\lVert\boldsymbol{{({x_{g}}^{ma}-{x_{g}^{\prime}}^{ma})}}\rVert_{2}\lVert\boldsymbol{{{x_{p}}^{ma}}}\rVert_{2}=\epsilon_{2}.$

According to the regional monotonicity of Equation LABEL:monotone w.r.t. $\boldsymbol{{(x_{g}-x_{g}^{\prime})}}^{T}\boldsymbol{{x_{p}}}$ (more details are shown in the supplementary material), we obtain:

\displaystyle\frac{\lVert\eta_{x_{g}}([\boldsymbol{{x_{g}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])-\eta_{x_{g}}([\boldsymbol{{x_{g}^{\prime}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])\rVert_{2}}{\lVert\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-\eta_{x_{g}}([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\rVert_{2}}\leq 1,

(15)

for $\boldsymbol{{(x_{g}-x_{g}^{\prime})}}^{T}\boldsymbol{{x_{p}}}\leq\boldsymbol{{({x_{g}}^{ma}-{x_{g}^{\prime}}^{ma})}}^{T}\boldsymbol{{{x_{p}}^{ma}}}$ . By further comparing the ratio to the input variation in SST and MASST,

		$\displaystyle\frac{\lVert\eta_{x_{g}}([\boldsymbol{{x_{g}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])-\eta_{x_{g}}([\boldsymbol{{x_{g}^{\prime}}^{ma}};\boldsymbol{{x_{p}}^{ma}}])\rVert_{2}\cdot\lVert\boldsymbol{x_{g}}-\boldsymbol{x_{g}^{\prime}}\rVert_{2}}{\lVert\eta_{x_{g}}([\boldsymbol{x_{g}};\boldsymbol{x_{p}}])-\eta_{x_{g}}([\boldsymbol{x_{g}^{\prime}};\boldsymbol{x_{p}}])\rVert_{2}\cdot\lVert\boldsymbol{{x_{g}}^{ma}}-\boldsymbol{{x_{g}^{\prime}}^{ma}}\rVert_{2}}$		(16)
		$\displaystyle\leq\epsilon_{1}/\epsilon_{2}<1,$		(16)

We can verify that there exists a smaller constant $K$ making MASST better satisfies the bi-Lipschitz continuity than SST. In Section III-C, the result of experiment also proves this property of MASST.

It is worth noting that, although multiple gallery agents are involved for training, we only utilize the probe network for test inference. Thereby, the test accuracy is improved while keeping the inference cost unchanged.

III-C MASST for Long-tail Data

With the non-uniform increase of samples, the training suffers from data imbalance and intra-class diversity dearth simultaneously, which leads to the instability of the convergence process.

In order to verify the effectiveness of MASST in long-tail face learning, we construct long-tail data by the existing identity-based sampling method. It is expressed as follows,

Num_{new}=Num_{org}\cdot(index+1)^{-r},

(17)

where $index$ is the ID index sorted in descending order according to the number of intra-class samples that $index\in N^{*}$ , $Num_{org}$ and $Num_{new}$ are the numbers of intra-class samples before and after sampling, and $r$ is an adjusting parameter.

With the increase of $r$ , the degree of long-tail distribution is intensified and the sample size becomes smaller correspondingly. $Num_{new}\geq 2$ guarantees that there are at least two images as input for training in tail classes. Different from deep data, the distribution of synthetic long-tail face data is non-uniform (as shown in Fig. 3 (a)). Obviously, shallow face data is an extreme case of long-tail face data, and long-tail face learning is a more general case, which suffers from data imbalance and the lack of intra-class diversity simultaneously.

To compare the optimization process, the key problem here is measuring the loss landscape’s curvature. By means of the Taylor expansion, we have

\eta(x_{i})=\eta(x_{j})+\frac{\partial\eta(x_{j})}{\partial x_{j}}(x_{i}-x_{j})+o(\lVert x_{i}-x_{j}\rVert),

(18)

then,

	$\displaystyle{\\|\eta(x_{i})-\eta(x_{j})\\|}_{2}$	$\displaystyle\approx(x_{i}-x_{j})^{T}H^{T}H(x_{i}-x_{j})$		(19)
		$\displaystyle\leq\lVert H\rVert_{2}\lVert x_{i}-x_{j}\rVert_{2},$		(19)

where $H$ is the Hessian matrix of loss function $\mathcal{L}(x)$ .

Therefore, the Lipschitz norm is given by ${\|\eta\|}_{Lip}$ = sup $\sigma(\nabla\eta)$ = sup $\sigma(H)$ = $\sigma(H)$ , where $\sigma(A)$ is the spectral norm of the matrix $A$ ( $L_{2}$ matrix norm of $A$ ). The gradient $\eta$ satisfies the Lipschitz continuity better with the smaller spectral norm $\sigma(H)$ . Naively forming the Hessian matrix to compute the spectral norm has a prohibitive computational cost. If there are $q$ parameters (weights and biases) in the network, then the Hessian matrix has dimensions $q\times q$ . The computational effort needed to evaluate the Hessian will scale like $O(q^{2})$ , and it also requires storage that is $O(q^{2})$ . In order to compute the principal eigenvalue of the Hessian matrix efficiently, we use the power method, which is a matrix-free algorithm [49]. For a random vector $v$ , whose dimension is the same as $\eta$ , as it is independent to $\theta$ , we have

\frac{\partial(\eta^{T}v)}{\partial\theta}=\frac{\partial\eta^{T}}{\partial\theta}v+\eta^{T}\frac{\partial v}{\partial\theta}=\frac{\partial\eta^{T}}{\partial\theta}v=Hv,

(20)

where $H$ is the Hessian matrix. The curvature can be computed efficiently in this way.

Input: Model Parameter

\theta

Compute the gradient of

\mathcal{L}

w.r.t.

\theta

, i.e.,

\eta=\frac{\mathrm{d}\mathcal{L}}{\mathrm{d}\theta}

Initialize a random vector

v

(same dimension as

\eta

Normalize

v

v=\frac{v}{\lVert v\rVert_{2}}

for $i=1,2,...,n$ do

Compute

\eta^{T}v

Compute

Hv

by backpropagation,

Hv=\frac{\mathrm{d}(\eta^{T}v)}{\mathrm{d}\theta}

Normalize and reset

v

v=\frac{Hv}{\lVert Hv\rVert_{2}}

Algorithm 1 Power Iteration for Eigenvalue Computation

The algorithm is shown in Alg. 1. Specifically, we first compute the gradient $\eta$ , and initialize a random vector $v$ , followed by the operation of the inner product. Thus, we simply backpropagate $\eta^{T}v$ rather than computing the full Hessian matrix. When the error between two calculations of principal eigenvalue is less than the threshold value ( $thr=1e-3$ ), or the iterations exceed the maximum iterations ( $max=50$ ), the loop is stopped. In this way, the quantity of interest is not the Hessian matrix $H$ itself but the product of $H$ with some vector $v$ . The Hessian vector product $Hv$ that we wish to calculate, however, has only $q$ elements, so instead of computing the Hessian as an intermediate step, we find an efficient approach to evaluating $Hv$ directly that requires only $O(q)$ operations.

For the conventional training, SST and MASST with MobileFaceNet on shallow face data and synthetic long-tail face data, we calculate the principal eigenvalue of the Hessian matrix during the training process. As shown in Fig. 4, we can find: (1) the curvature of the conventional training becomes too small on shallow face data, which means it converges to the local minimum and optimizes invalidly; (2) on long-tail face data, the curvature of the conventional training gradually increases with the increase of iteration times, which means the training process becomes more tortuous; (3) compared with conventional training, the curvatures of SST and MASST are maintained at the medium range whether on shallow face data or long-tail face data. We also calculate the average value and standard deviation of curvature during the training process in Table I. On both shallow face data and long-tail face data, MASST has a smaller average value and standard deviation of curvature to maintain the stable optimization process.

According to the above experimental results, we further intuitively summarize the optimization processes of conventional training, SST, and MASST in Fig. 5. The optimization process of conventional training on deep data is smooth. However, when encountering long-tail data, the process to obtain an optimal solution becomes tortuous. This issue has been greatly improved in SST. Fortunately, with the combination of the gallery network stack, MASST achieves the smoothest optimization process.

TABLE I: Average (AVG.) and standard deviation (SD.) of curvature of training landscape by the conventional training, SST, and MASST along with the training iteration. One can refer to Fig. 4 for the trend of curvature along with the training iteration.

Method	Shallow Face Data		Long-tail Face Data
Method	AVG.	SD.	AVG.	SD.
Convt.	5.7518	1.5522	19.9664	16.5690
SST	11.6319	0.4789	12.0648	1.6331
MASST	9.8915	0.4484	9.4698	1.0015

IV Experiments

This section is structured as follows. Section IV-A introduces the datasets and experimental settings. Section IV-B includes the ablation study on MASST. Section IV-C demonstrates the significant improvement by MASST on shallow face data with various loss functions (Section IV-C1) and various backbones (Section IV-C2). Section IV-D verifies MASST performs well on long-tail face data. Section IV-E shows MASST can achieve leading performance on deep face data. Section IV-F studies MASST also outperforms conventional training for the domain transfer.

IV-A Datasets and Experimental Settings

Training Data We use the public datasets to prove the reproducibility. To obtain shallow data, two images are randomly selected for each ID from the MS1M-v1c [50] dataset. Thus, the shallow data includes 72,778 IDs and 145,556 images. And then, we sample from the MS1M-v1c dataset to construct long-tail data. The sampling method is concretely expressed in Section III-C. The samples in head classes are sufficient. Closer to the tail classes, the number of samples per ID goes down but there are at least two images as input for training. For deep data, we employ the full MS1M-v1c which has 44 images per ID on average. As a real-world surveillance face recognition benchmark, QMUL-SurvFace [51] is utilized for the experiment of pretrain-finetune as well.

Test Data To evaluate comprehensively, we adopt LFW [52], BLUFR [53], AgeDB-30 [54], CFP-FP [55], CALFW [56], CPLFW [57], MegaFace [58], IJB-B [59], IJB-C [60] and QMUL-SurvFace [51] datasets. AgeDB-30 and CALFW concentrate on cross-age face verification. CFP-FP and CPLFW aim at face verification with different poses. Based on the LFW dataset, BLUFR is a benchmark protocol that contains verification (VR) and open-set identification (DIR) scenarios, with a focus on the low false accept rate (FAR). VR @FAR=1e-5 is adopted for performance measurement in our experiments. MegaFace evaluates the performance of face recognition at the million scales of distractors. IJB-C is a further extension of IJB-B (the extension of IJB-A [61]), the former having more subjects with still images and more frames from videos. Compared with the above benchmark, the QMUL-SurvFace test set places emphasis on real-world surveillance face recognition.

Prepossessing Detection is carried out on each original image by the FaceBoxes [62]. Then, we employ five facial landmarks [63] for alignment and cropping to 144 $\times$ 144 RGB images and augment data through resizing, rotation, grayscale conversion, and horizontal flipping.

CNN Architecture To reduce time overhead while ensuring good performance, MobileFaceNet [64] is chosen in the ablation study and the experiments with various loss functions. Besides, we use Attention-56 [65] in the synthetic long-tail data, deep data, and transfer learning. The output is a 512-dimension feature. In addition, we adopt extra backbones to prove the convergence of MASST with various architectures, including VGG-16 [66], SE-ResNet-18 [67], ResNet-50 and -101 [3].

Training and Evaluation We employ four NVIDIA Tesla P40 GPUs for training. The batch size is 256 and the learning rate begins with 0.05. For the shallow data and long-tail data, the learning rate is divided by 10 at the 36k, 54k iterations and the training process is finished at 64k iterations. For the deep data, the learning rate is divided at the 72k, 96k, 112k iterations, and finished at 120k iterations. For transfer learning, the learning rate starts from 0.001, and we divide it by 10 at the 6k, 9k iterations, and finish at 10k iterations. The momentum is 0.9 and the weight decay is 5e-4. The number of classes in training datasets determines the size of the gallery queue, so we empirically set it as 16,384 for shallow, long-tail, and deep data, and 2,560 for QMUL-SurvFace. The scale parameter $s$ is fixed as 30 and the number of gallery agents $n$ in MASST is set to 3. In the evaluation stage, the last layer output from the probe network is extracted as the face representation. In addition, we utilize the cosine similarity as the similarity metric. In order to evaluate strictly and precisely, we remove all the overlapping IDs between training and test datasets according to the list [68].

Loss Function The proposed method can be flexible integrated with various training loss functions. We choose classification and embedding learning loss functions as the baseline simultaneously, and then integrate them with MASST to make comparisons. The classification loss functions include A-softmax [69], AM-softmax [70], Arc-softmax [71], AdaCos [72], MV-softmax [73], DP-softmax [15] and Center loss [17]. The embedding learning methods include Contrastive [74], Triplet [2] and N-pairs [75].

TABLE II: Ablation study on shallow data. Performance (

\%

) on LFW, AgeDB-30, CFP-FP, CALFW, CPLFW and BLUFR (VR @FAR=1e-5).

	LFW	AgeDB	CFP	CALFW	CPLFW	BLUFR
softmax
Convt.	92.64	73.96	70.80	73.05	62.64	27.05
A	91.36	71.85	69.00	72.14	61.35	24.87
B	93.43	76.00	71.46	74.65	62.68	30.65
C	96.62	82.63	79.10	80.18	67.55	52.05
D	98.32	88.77	84.81	86.63	74.80	69.93
SST	98.77	91.60	88.63	89.82	78.43	77.58
MASST-MA	98.80	91.15	87.70	89.62	77.92	78.42
MASST	98.85	91.63	88.13	89.80	78.53	81.53
A-softmax
Convt.	94.67	77.88	72.90	75.85	64.00	37.16
A	93.76	76.79	71.35	74.56	62.80	35.18
B	94.62	78.08	74.03	76.35	63.87	38.35
C	96.32	82.28	81.30	81.05	68.77	57.13
D	97.52	85.83	81.87	83.88	71.03	60.79
SST	98.98	91.88	89.54	89.73	77.68	80.65
MASST-MA	98.87	91.05	88.47	89.47	77.92	79.71
MASST	99.03	91.72	89.29	90.03	78.02	82.45
AM-softmax
Convt.	92.75	75.30	68.74	76.63	63.63	33.23
A	92.35	74.12	68.08	74.89	62.76	32.12
B	93.25	76.16	69.17	77.78	63.88	36.59
C	98.02	86.37	85.17	85.72	72.83	62.07
D	98.30	88.18	87.31	87.93	76.27	75.46
SST	98.97	92.25	88.97	90.23	79.45	84.95
MASST-MA	98.97	91.63	87.89	89.88	79.32	86.74
MASST	99.12	92.73	89.23	90.58	79.82	86.79
Arc-softmax
Convt.	94.32	77.80	71.25	78.15	65.45	40.34
A	93.60	77.35	70.59	77.78	64.28	40.08
B	94.48	78.42	72.15	78.65	65.78	42.50
C	98.20	85.28	81.50	83.50	71.32	60.67
D	98.08	88.68	84.54	86.92	74.40	68.84
SST	98.95	91.73	88.59	89.85	79.60	82.74
MASST-MA	98.97	91.18	88.70	89.53	78.88	82.68
MASST	99.10	92.08	89.24	90.52	79.40	84.67

IV-B Ablation Study

To analyze comprehensively, we compare SST and MASST with the different choices aforementioned, such as the network constraint ( $\|\phi_{g}-\phi_{p}\|<\epsilon^{\prime}$ ) and the prototype constraint ( $\beta(\alpha-\|w_{y}\|)$ ). Table II compares their performance with four basic loss functions (softmax, A-Softmax, AM-Softmax and Arc-softmax). In this table, “Convt.” denotes the plain training, “A” denotes the prototype constraint, “B” denotes the network constraint, “C” denotes the gallery queue, “D” denotes the combination of “B” and “C”, “SST” denotes the ultimate scheme of Semi-Siamese Training which includes the moving-average updating Semi-Siamese networks and the training scheme with gallery queue, “MASST-MA” denotes the ultimate scheme of Multi-Agent Semi-Siamese Training which replaces the single gallery network in SST with a gallery network stack that involves multiple agents and updates by the moving-average. “MASST” denotes the gallery network stack in MASST updates in a composite way. From Table II, we can conclude: (1) enlarging $w_{y}$ in every dimension indiscriminately is not effective in shallow face learning, as the prototype constraint “A” leads to decrease in most terms; (2) compared with the network constraint “B” and the gallery queue “C”, the combination of them “D” obtains further improvement; (3) SST employs moving-average updating and gallery queue, and gets better performance; (4) finally, MASST sets up the gallery network stack including multiple agents and designs a more appropriate update mode, which achieves the best results by most of the terms. The comparison indicates MASST can address the problem in shallow face learning and obtain a slight improvement in most test accuracy compared with SST.

TABLE III: Comparison of MASST and SST on synthetic long-tail data training with MobileFaceNet (

r\in(0,0.1]

). In MegaFace, “Id.” refers to face identification rank1 accuracy with 1M distractors, and “Veri.” refers to face verification rate at 1e-6 FAR.

$r$	Method	LFW	AgeDB	CFP	CALFW	CPLFW	BLUFR	MegaFace
$r$	Method	LFW	AgeDB	CFP	CALFW	CPLFW	BLUFR	Id.	Veri.
0.02	SST	99.38	93.37	92.34	91.87	82.58	87.10	77.78	82.77
0.02	MASST	99.50	94.70	93.10	92.45	82.45	91.58	80.61	83.76
0.04	SST	99.33	93.92	92.53	91.78	82.70	88.77	78.74	82.73
0.04	MASST	99.50	93.85	92.86	92.58	82.48	89.22	79.92	83.31
0.06	SST	99.35	93.83	91.90	92.20	82.00	87.07	77.78	81.92
0.06	MASST	99.40	94.25	92.86	91.87	82.13	87.48	79.97	82.59
0.08	SST	99.37	93.60	91.81	92.18	82.08	87.65	76.39	82.60
0.08	MASST	99.55	93.85	92.56	91.95	82.23	89.52	79.39	82.73
0.1	SST	99.38	93.85	92.09	91.80	81.97	89.05	76.48	80.20
0.1	MASST	99.43	94.02	91.81	92.33	81.78	88.72	77.90	83.47

TABLE IV: Performance (

\%

) on MegaFace (face verification rate at 1e-6 FAR) as the degree of long-tail distribution intensifies.

$r$	0	0.05	0.10	0.15	0.20	0.25	0.30
softmax
Convt.	92.00	90.70	86.68	82.43	64.10	41.20	21.49
SST	93.23	91.37	89.62	86.62	86.03	85.68	80.90
MASST	94.76	92.95	90.57	87.13	86.89	86.57	80.05
AM-softmax
Convt.	96.35	94.82	93.67	91.60	89.57	69.42	39.68
SST	96.96	95.67	93.96	91.94	89.76	87.14	85.61
MASST	97.18	96.40	94.42	91.83	90.18	87.68	85.90
Arc-softmax
Convt.	96.81	95.73	93.82	91.73	87.65	77.55	38.82
SST	96.50	95.71	94.95	92.71	88.02	86.94	80.46
MASST	96.67	95.56	94.57	93.16	88.01	87.46	82.12

IV-C Advantage of MASST on Shallow Data

To prove the effectiveness of MASST on shallow data learning, we not only train the network with various loss functions, but also employ it to train different CNN architectures.

IV-C1 MASST with Various Loss Functions

First, we train the network on the shallow data with various loss functions and test it on BLUFR at FAR=1e-5 (the blue bars in Fig. 6). The loss functions include classification and embedding ones such as softmax, A-softmax, AM-softmax, Arc-softmax, AdaCos, MV-softmax, DP-softmax, Center loss, Contrastive, Triplet, and N-pairs. Then, we train the same network with the same loss functions on the shallow data, but with SST scheme and MASST scheme respectively. As shown in Fig. 6, MASST can be flexibly integrated with every loss function. Compared with SST (the orange bars), MASST obtains a larger increase for shallow face learning (the green bars). Moreover, MASST is proved to be effective when facing the hard example mining strategies by the result of training on MV-softmax and embedding losses.

IV-C2 MASST with Various Network Architectures

TABLE V: Comparison of MASST, SST, and the conventional training on deep data with various loss settings.

Method	LFW	AgeDB	CFP	CALFW	CPLFW	MegaFace
Method						Id.	Veri.
Convt.(softmax)	99.58	95.33	92.66	93.18	84.47	89.89	92.00
Convt.(AM-softmax) [70]	99.70	97.03	94.17	94.41	87.00	95.67	96.35
Convt.(Arc-softmax) [71]	99.73	97.18	94.37	95.25	87.05	96.10	96.81
Convt.(DP-softmax) [15]	99.63	95.68	91.74	93.03	83.88	89.27	90.94
Convt.(Contrastive) [74]	99.50	92.91	91.76	87.56	80.13	60.14	63.59
Convt.(Triplet) [2]	99.47	93.32	94.49	89.32	82.25	65.65	69.18
Convt.(N-pairs) [75]	99.53	94.58	93.43	92.10	83.15	76.87	78.28
SST (softmax)	99.67	96.37	94.96	94.18	85.82	91.01	93.23
SST (AM-softmax)	99.75	97.20	95.10	94.62	88.35	96.27	96.96
SST (Arc-softmax)	99.77	97.12	95.96	94.78	87.15	95.63	96.50
SST (DP-softmax)	99.68	96.24	94.56	93.78	86.04	92.08	93.57
SST (Contrastive)	99.56	93.14	92.71	92.13	81.78	77.59	82.44
SST (Triplet)	99.50	94.30	93.30	92.05	82.67	81.76	83.29
SST (N-pairs)	99.65	96.12	94.86	94.32	84.74	91.72	93.48
MASST (softmax)	99.67	97.23	95.37	94.97	87.05	92.64	94.76
MASST (AM-softmax)	99.77	97.70	95.61	95.10	88.63	96.15	97.18
MASST (Arc-softmax)	99.76	97.37	96.23	95.08	87.95	95.86	96.67
MASST (DP-softmax)	99.73	97.26	95.17	94.93	87.39	93.70	94.73
MASST (Contrastive)	99.57	93.88	93.21	91.70	83.27	81.63	84.52
MASST (Triplet)	99.58	94.62	94.19	92.32	84.54	84.43	86.31
MASST (N-pairs)	99.65	96.73	95.24	95.02	86.18	93.36	94.89

TABLE VI: QMUL-SurvFace evaluation. “TPR(%)@FAR” includes the true positive verification rate at varying FARs, and “TPIR20(%)@FPIR” includes rank-20 true positive identification rate at varying false positive identification rates.

Method	TPR(%)@FAR				TPIR20(%)@FPIR
Method	0.3	0.1	0.01	0.001	0.3	0.2	0.1	0.01
Convt.(softmax)	73.09	52.29	26.07	12.54	8.09	6.25	3.98	1.13
Convt.(AM-softmax) [70]	69.59	47.67	23.90	13.24	9.07	7.14	4.65	1.34
Convt.(Arc-softmax) [71]	68.14	48.65	24.12	11.34	8.77	6.88	4.79	1.36
Convt.(DP-softmax) [15]	76.32	55.85	25.32	11.64	7.50	5.38	3.38	0.95
Convt.(Contrastive) [74]	84.48	67.99	31.87	5.31	9.16	6.91	4.44	0.10
Convt.(Triplet) [2]	85.59	69.61	33.76	7.20	10.14	7.70	4.75	0.37
Convt.(N-pairs) [75]	87.26	67.04	29.67	12.07	10.75	8.09	4.87	0.41
SST(softmax)	81.08	63.41	34.20	19.03	11.24	8.49	5.28	1.21
SST(AM-softmax)	86.49	69.41	36.21	18.51	12.22	9.51	5.85	1.63
SST(Arc-softmax)	87.00	68.21	35.72	22.18	12.38	9.71	6.61	1.72
SST(DP-softmax)	87.69	69.69	36.32	14.83	10.20	7.83	5.14	1.08
SST(Contrastive)	87.54	69.91	32.15	9.58	9.87	7.38	4.76	0.78
SST(Triplet)	90.65	73.35	33.85	12.48	11.09	8.14	5.27	0.92
SST(N-pairs)	89.31	71.26	32.34	15.96	11.30	9.13	5.68	1.22
MASST(softmax)	81.78	62.87	33.01	16.26	11.71	9.07	5.78	1.35
MASST(AM-softmax)	86.70	68.75	36.13	18.72	11.94	9.18	6.12	1.65
MASST(Arc-softmax)	85.99	69.08	37.46	21.97	12.27	9.21	5.64	1.68
MASST(DP-softmax)	86.61	67.31	36.62	20.62	11.18	8.31	5.65	1.25
MASST(Contrastive)	89.27	73.28	37.03	19.17	10.25	7.90	4.26	0.43
MASST(Triplet)	90.10	74.18	36.91	10.00	12.47	9.96	6.63	0.63
MASST(N-pairs)	88.67	69.89	36.86	16.87	11.84	8.97	5.57	1.30

To demonstrate the stable convergence in the training, we employ MASST to train different CNN architectures, including MobileFaceNet, VGG-16, SE-ResNet-18, Attention-56, ResNet-50, and -101. The test accuracy of each network on BLUFR is shown in Fig. 7. The blue bars refer to the network trained by conventional training, the green bars refer to the network trained by MASST, and the orange bars refer to the network trained by SST. For conventional training, the test accuracy decreases with the deeper network architectures, showing that the larger model size exacerbates the model degeneration and over-fitting. In contrast, as the network becomes heavy, the test accuracy of SST and MASST both increase, showing that they make increasing contributions with more complicated architectures and MASST shows advantages compared with SST.

IV-D Advantage of MASST on Long-tail Data

First, we set $r$ at 0.02 intervals between 0 and 0.1 to compare the performance of SST and MASST on LFW, AgeDB-30, CFP-FP, CALFW, CPLFW, BLUFR, and MegaFace. In order to balance performance and the time costs, we choose softmax as the loss function and use the MobileFaceNet to conduct this experiment. The result is shown in Table III. We can find that MASST gains higher accuracy in most of the test sets such as LFW, AgeDB-30, CFP-FP, CALFW, and BLUFR. In particular, it shows greater advantages than SST on MegaFace.

Then, we compare the performance of conventional training, SST and MASST on IJB-B and IJB-C. The results of experiments are shown in the supplementary material. With MASST, there are some improvements for different cases of loss functions both on IJB-B and IJB-C.

To observe the performance trends of MASST, SST, and conventional training when $r$ is changed, we set $r$ at 0.05 intervals between 0 and 0.3 (if $r>0.3$ , too few samples to get satisfied training results). The number of images in the synthetic long-tail face data for different $r$ is shown in Fig. 3 (b). We employ Attention-56 and train the network with various loss functions such as softmax, AM-softmax, and Arc-softmax. Specifically, compare the performance of conventional training, SST, and MASST on the verification task of MegaFace.

As shown in Table IV, with the increase of $r$ , i.e. the degree of long-tail distribution is intensified, the verification rate of conventional training decreases significantly. By contrast, MASST is more adaptable and achieves better performance than SST in long-tail face learning.

IV-E Stable Advantage on Deep Data

The previous experiments show MASST has well tackled the problems in shallow face learning and long-tail face learning and obtained significant improvement in test accuracy. To further explore the advantage of MASST for wider application, we adopt MASST scheme on the deep data (full version of MS1M-v1c), and make comparisons with the conventional training and SST. Table V shows the performance on LFW, AgeDB-30, CFP-FP, CALFW, CPLFW and MegaFace. MASST gains the leading accuracy in most of the test sets, and also the competitive results on CALFW and the face identification rank1 accuracy with 1M distractors. MASST (softmax) achieves about two percent improvement than conventional training (softmax) on AgeDB-30, CFP-FP, CALFW, and CPLFW which include the hard cases of large face pose or large age gap. Compared with SST, MASST shows better network property. Notably, MASST and SST reduce a large number of FC parameters by which the classification loss is computed for the conventional training.

IV-F Stable Advantage on Domain Transfer

The public training datasets, such as MS-Celeb-1M and VGGFace2, are different from the captured face images in real-world scenarios of face recognition. They are consist of well-posed face images collected from the internet. However, in real-world applications, the external conditions are more complicated. To address this issue, the typical routine is to pretrain a network on the public training datasets and fine-tune it on real-world face data. Although MASST is proved to be superior in shallow and long-tail face learning, we still look forward to extending it to the challenge on domain transfer. So, we conduct an extra experiment with pretraining on MS1M-v1c and finetuning on QMUL-SurvFace in this subsection. From Table VI, we can find that, no matter for classification learning or embedding learning, MASST boosts the performance significantly in both verification and identification, compared with the conventional training and SST.

V Conclusion

In this paper, we focus on shallow and long-tail face learning. We analyze the causes of these problems and verify the existing training methods are not effective enough to solve them. The shallow and imbalanced data increases the training difficulty. Besides, the feature space collapse leads to model degeneration and over-fitting. Then, we propose Multi-Agent Semi-Siamese Training (MASST) to address these issues in shallow and long-tail face learning. Specifically, MASST employs the Semi-Siamese networks with a gallery network stack that involves multiple agents to satisfy Lipschitz continuity and constructs the gallery queue with gallery features. We conduct extensive experiments to verify it can obtain leading performance when it combines with the existing loss functions and network architectures flexibly. MASST has great improvement compared with conventional training. The extra experiments further explore the advantage of MASST for wider applications, such as deep data training and domain transfer.

References

[1] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon, “Attentionnet: Aggregating weak directions for accurate object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2659–2667.
[2] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[4] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
[5] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European conference on computer vision. Springer, 2016, pp. 87–102.
[6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 67–74.
[7] H. Du, H. Shi, Y. Liu, J. Wang, Z. Lei, D. Zeng, and T. Mei, “Semi-siamese training for shallow face learning,” in European Conference on Computer Vision. Springer, 2020, pp. 36–53.
[8] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 4, pp. 594–611, 2006.
[9] B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3018–3027.
[10] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for deep face recognition with under-represented data,” arXiv preprint arXiv:1803.09014, 2018.
[11] Y. Guo and L. Zhang, “One-shot face recognition by promoting underrepresented classes,” arXiv preprint arXiv:1707.05574, 2017.
[12] Y. Wu, H. Liu, and Y. Fu, “Low-shot face recognition with hybrid classifiers,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 1933–1939.
[13] Y. Cheng, J. Zhao, Z. Wang, Y. Xu, K. Jayashree, S. Shen, and J. Feng, “Know you at one glance: A compact vector representation for low-shot learning,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 1924–1932.
[14] L. Wang, Y. Li, and S. Wang, “Feature learning for one-shot face recognition,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 2386–2390.
[15] X. Zhu, H. Liu, Z. Lei, H. Shi, F. Yang, D. Yi, G. Qi, and S. Z. Li, “Large-scale bisample learning on id versus spot face recognition,” International Journal of Computer Vision, vol. 127, no. 6-7, pp. 684–700, 2019.
[16] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for face recognition with under-represented data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5704–5713.
[17] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515.
[18] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao, “Range loss for deep face recognition with long-tailed training data,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5409–5418.
[19] S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri, “Cost-sensitive learning of deep feature representations from imbalanced data,” IEEE transactions on neural networks and learning systems, vol. 29, no. 8, pp. 3573–3587, 2017.
[20] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[21] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[22] Y. He, W.-J. Hsu, and C. E. Leiserson, “Provably efficient online nonclairvoyant adaptive scheduling,” IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 9, pp. 1263–1279, 2008.
[23] C. Drummond, R. C. Holte et al., “C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” in Workshop on learning from imbalanced datasets II, vol. 11. Citeseer, 2003, pp. 1–8.
[24] K. M. Ting, “A comparative study of cost-sensitive boosting algorithms,” in In Proceedings of the 17th International Conference on Machine Learning. Citeseer, 2000.
[25] M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos, “Aga: Attribute-guided augmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7455–7463.
[26] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, R. Feris, A. Kumar, R. Giryes, and A. M. Bronstein, “Delta-encoder: an effective sample synthesis method for few-shot object recognition,” arXiv preprint arXiv:1806.04734, 2018.
[27] Z. Ding, Y. Guo, L. Zhang, and Y. Fu, “Generative one-shot face recognition,” arXiv preprint arXiv:1910.04860, 2019.
[28] Y. Ma, M. Kan, S. Shan, and X. Chen, “Hierarchical training for large scale face recognition with few samples per subject,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 2401–2405.
[29] C. Simon, P. Koniusz, R. Nock, and M. Harandi, “Adaptive subspaces for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4136–4145.
[30] R. Salakhutdinov, A. Torralba, and J. Tenenbaum, “Learning to share visual appearance for multiclass object detection,” in CVPR 2011. IEEE, 2011, pp. 1481–1488.
[31] X. Zhu, D. Anguelov, and D. Ramanan, “Capturing long-tail distributions of object subcategories,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 915–922.
[32] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3730–3738.
[33] W. Ouyang, X. Wang, C. Zhang, and X. Yang, “Factors in finetuning deep model for object detection with long-tail distribution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 864–873.
[34] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie, “Large scale fine-grained categorization and domain-specific transfer learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4109–4118.
[35] C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning deep representation for imbalanced classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5375–5384.
[36] Y. Wu, H. Liu, J. Li, and Y. Fu, “Deep face recognition with center invariant loss,” in Proceedings of the on Thematic Workshops of ACM Multimedia 2017, 2017, pp. 408–414.
[37] Y. Zheng, D. K. Pal, and M. Savvides, “Ring loss: Convex feature normalization for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5089–5097.
[38] B. Liu, W. Deng, Y. Zhong, M. Wang, J. Hu, X. Tao, and Y. Huang, “Fair loss: Margin-aware reinforcement learning for deep face recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10 052–10 061.
[39] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2008.
[40] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Neural Networks, vol. 106, pp. 249–259, 2018.
[41] Y. Li and N. Vasconcelos, “Repair: Removing representation bias by dataset resampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9572–9581.
[42] Y. Zhong, W. Deng, M. Wang, J. Hu, J. Peng, X. Tao, and Y. Huang, “Unequal-training for deep face recognition with long-tailed noisy data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7812–7821.
[43] Y. Zhang and W. Deng, “Class-balanced training for deep face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 824–825.
[44] C. Huang, Y. Li, C. C. Loy, and X. Tang, “Deep imbalanced learning for face recognition and attribute prediction,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 11, pp. 2781–2794, 2019.
[45] Y. Ma, M. Kan, S. Shan, and X. Chen, “Learning deep face representation with long-tail data: An aggregate-and-disperse approach,” Pattern Recognition Letters, vol. 133, pp. 48–54, 2020.
[46] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
[47] S. Z. Li, Z. Zhang, and L. Wu, “Markov-lipschitz deep learning,” arXiv preprint arXiv:2006.08256, 2020.
[48] Y. Nesterov et al., Lectures on convex optimization. Springer, 2018, vol. 137.
[49] Z. Yao, A. Gholami, D. Arfeen, R. Liaw, J. Gonzalez, K. Keutzer, and M. Mahoney, “Large batch size training of neural networks with adversarial training and second-order information,” arXiv preprint arXiv:1810.01021, 2018.
[50] http://trillionpairs.deepglint.com/overview.
[51] Z. Cheng, X. Zhu, and S. Gong, “Surveillance face recognition challenge,” arXiv preprint arXiv:1804.09691, 2018.
[52] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,” in Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
[53] S. Liao, Z. Lei, D. Yi, and S. Z. Li, “A benchmark study of large-scale unconstrained face recognition,” in IEEE international joint conference on biometrics. IEEE, 2014, pp. 1–8.
[54] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou, “Agedb: the first manually collected, in-the-wild age database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 51–59.
[55] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs, “Frontal to profile face verification in the wild,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9.
[56] T. Zheng, W. Deng, and J. Hu, “Cross-age lfw: A database for studying cross-age face recognition in unconstrained environments,” arXiv preprint arXiv:1708.08197, 2017.
[57] T. Zheng and W. Deng, “Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments,” Beijing University of Posts and Telecommunications, Tech. Rep, vol. 5, 2018.
[58] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The megaface benchmark: 1 million faces for recognition at scale,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4873–4882.
[59] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen et al., “Iarpa janus benchmark-b face dataset,” in proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 90–98.
[60] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney et al., “Iarpa janus benchmark-c: Face dataset and protocol,” in 2018 International Conference on Biometrics (ICB). IEEE, 2018, pp. 158–165.
[61] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah, and A. K. Jain, “Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1931–1939.
[62] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “Faceboxes: A cpu real-time face detector with high accuracy,” in 2017 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 2017, pp. 1–9.
[63] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing loss for robust facial landmark localisation with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2235–2245.
[64] S. Chen, Y. Liu, X. Gao, and Z. Han, “Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices,” in Chinese Conference on Biometric Recognition. Springer, 2018, pp. 428–438.
[65] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3156–3164.
[66] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[67] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[68] X. Wang, S. Wang, J. Wang, H. Shi, and T. Mei, “Co-mining: Deep face recognition with noisy labels,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 9358–9367.
[69] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
[70] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
[71] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
[72] X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li, “Adacos: Adaptively scaling cosine logits for effectively learning deep face representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 823–10 832.
[73] X. Wang, S. Zhang, S. Wang, T. Fu, H. Shi, and T. Mei, “Mis-classified vector guided softmax loss for face recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 241–12 248.
[74] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” Advances in neural information processing systems, vol. 27, pp. 1988–1996, 2014.
[75] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” Advances in neural information processing systems, vol. 29, pp. 1857–1865, 2016.

Multi-Agent Semi-Siamese Training for Long-tail and Shallow Face Learning