Relational Deep Feature Learning
for Heterogeneous Face Recognition

MyeongAh Cho, Taeoh Kim, Ig-Jae Kim, Kyungjae Lee, and Sangyoun Lee This research was supported by R&D program for Advanced Integrated-intelligence for Identification (AIID) through the National Research Foundation of KOREA(NRF) funded by Ministry of Science and ICT (NRF-2018M3E3A1057289) M.Cho, T.Kim and S.Lee are with the School of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea (e-mail: [email protected]; [email protected]; [email protected]).I.Kim is with the Center for Imaging Media Research, Korea Institute of Science and Technology, Seoul, South Korea (e-mail: [email protected]).K.Lee is with the Department of Computer Science, Yongin University, Yongin, South Korea ([email protected]).

Abstract

Heterogeneous Face Recognition (HFR) is a task that matches faces across two different domains such as visible light (VIS), near-infrared (NIR), or the sketch domain. Due to the lack of databases, HFR methods usually exploit the pre-trained features on a large-scale visual database that contain general facial information. However, these pre-trained features cause performance degradation due to the texture discrepancy with the visual domain. With this motivation, we propose a graph-structured module called Relational Graph Module (RGM) that extracts global relational information in addition to general facial features. Because each identity’s relational information between intra-facial parts is similar in any modality, the modeling relationship between features can help cross-domain matching. Through the RGM, relation propagation diminishes texture dependency without losing its advantages from the pre-trained features. Furthermore, the RGM captures global facial geometrics from locally correlated convolutional features to identify long-range relationships. In addition, we propose a Node Attention Unit (NAU) that performs node-wise recalibration to concentrate on the more informative nodes arising from relation-based propagation. Furthermore, we suggest a novel conditional-margin loss function ( $C$ -softmax) for the efficient projection learning of the embedding vector in HFR.

The proposed method outperforms other state-of-the-art methods on five HFR databases. Furthermore, we demonstrate performance improvement on three backbones because our module can be plugged into any pre-trained face recognition backbone to overcome the limitations of a small HFR database

Index Terms:

Heterogeneous face recognition, relation embedding, graph structured module, face recognition

I Introduction

Face recognition, a task that aims to match facial images of the same person, has developed rapidly with the advent of deep learning. The features extracted through the multiple-hidden layers of deep convolutional neural network (DCNNs) contain representative information that is used to distinguish an individual[1]. However, when recognizing a face via representative features, variations such as pose, illumination, or facial expression create difficulties[2, 3, 4]. Unlike general face recognition within the visible spectrum, the Heterogeneous Face Recognition (HFR) aims to match faces across different domains such as visible light (VIS), near-infrared (NIR), or the sketch domain[5, 6]. Face recognition over different domains is important, since NIR images acquired with infrared cameras contain more useful information when visible light is lacking, while sketch-to-photo matching is important in law enforcement for rapidly identifying suspects[1]. As such, HFR can be a practical application for biometric security control or surveillance cameras under low-light scenarios[7].

Refer to caption — (a) CASIA NIR-VIS 2.0

HFR has several challenging issues, the biggest of which is the large gap between data domains. When HFR is performed with a face recognition network pre-trained on VIS face images, accuracy is significantly reduced. This is because the difference between the distributions of VIS and non-VIS data is large. Therefore, we need to reduce the domain gap through either learning domain-invariant features or common space projection methods. Another issue is the lack of HFR databases. Deep learning-based face recognition networks are usually trained with large-scale visual database such as MS-Celeb-1M[11] which consists of 10 million images with 85 thousand identities, or MegaFace[12], with 4.7 million images. By comparison, the typical HFR database has a small number of images and subjects, which causes overfitting in a deep network and makes learning a general feature difficult. Therefore, most HFR tasks are fine-tuned using a backbone which is pre-trained on a large visual database.

To solve these problems, several works [6, 13, 14] use image synthesis method to transform input images from non-VIS domain to VIS domains and recognize faces in the same domain feature space. Although this approach may create a similar domain by the the data transform, it is difficult to generate good-quality transform images with a small amount of data; this greatly impacts performance and the approach does not reduce the gap between domain properties. Other studies train the network to learn NIR-VIS-invariant features by using the Wasserstein distance[15], variational formulation[16], a triplet loss function[17], or domain-specific module [18]. These domain-invariant approaches force the network to reduce the domain gap implicitly, which makes learning and designing a network challenging. Therefore, we propose a graph-structured module that reduces the fundamental differences in heterogeneous domain characteristics by extracting global relational information using general facial features from the large-scale visual dataset.

For many computer vision tasks, the relational information within the image or video is important, in the same way that human visual processing can easily perform recognition by capturing relations. Since each identities’ relational information between intra facial parts are very similar in any modality, it is suitable for reducing the gap between domains in HFR. With our proposed Relational Graph Module (RGM), each component of the face is embedded to a node vector and edges are computed by modeling the relationships among nodes. Through graph propagation with generated nodes and edges, we create a relational node vector containing the overall relationship nodes; and perform node-wise recalibration through the correlation information for these nodes with Node Attention Unit (NAU). Also, we suggest a conditional-margin loss function ( $C$ -softmax) to learn with an efficient space margin between inter-classes when data from two different domains is projected into one latent space.

In this paper, our main contributions are as follows:

•

We propose a graph-structured module, RGM, to reduce the fundamental domain gap by modeling face components as node vectors and relational information edges. We also perform a recalibration by considering global node correlation via NAU.
•

In order to project features from different domains into common latent space efficiently, we suggest a $C$ -softmax that uses the inter-class margin conditionally.
•

The proposed module can overcome the limitation of HFR databases by plugging to a general feature extractor; we experimentally demonstrate superior performance for three different backbones and five HFR databases.

The organization of the paper is as follows. In Section II, we introduce three different approaches to HFR and briefly describe some relation capturing methods. Then, in Section III, we begin by presenting a preliminary version of this work, Relation Module (RM)[19], and explain our proposed RGM, NAU, and $C$ -softmax approaches for HFR. Next, in Section IV, the experimental results and related discussions are provided, prior to conclusion in Section V.

II Related Works

As stated in Section I, the challenge of the HFR task to match identities using conventional face recognition networks despite such domain differences as texture or style. The examples in Figure 1 of NIR-to-VIS and Sketch-to-Photo databases illustrate the gaps between the domains which can depend on variations in illumination or on the artist’s sketch style. Therefore, methods for reducing domain discrepancy are being studied, and these can be largely divided into projection to common space based methods, image synthesis based methods, and domain-invariant feature based methods. This section summarizes preceding HFR studies and then introduces methods of capturing the relational information within the image that can reduce the fundamental domain difference.

II-A Heterogeneous Face Recognition

II-A1 Projection to Common Space Based Method

Projection based methods involve learning to project features from two different domains into a discriminative common latent space where images with the same identity are close regardless of their domain. Lin and Tang [24] proposed a Common Discriminant Feature Extraction (CDFE) algorithm, in which two different domain features simultaneously learn common space to solve the inter-modality problem. With empirical separability and a local consistency regularization objective function, the model learns compact intra-class space and prevents the overfitting problem. Yi and Liao [25] suggested matching each partial patch of face images by extracting points, edges, or contours that are similar between domains. Lei et al. [26] designed a Coupled Spectral Regression (CPR) method for finding different projective vectors by representing relationships between each images and their embeddings. Different from this coupled method, which learns each domain’s representative vectors separately, Lei et al. [27] suggested the technique of learning the projection from both domains. Since target data neighbors should correspond to source data neighbors, Shao et al. [28] matched projected target and source data by using a reconstruction coefficient matrix, Z, in the common space.

With deep neural networks (DNNs) showing great improvement in face recognition performance, Sarfraz and Stiefelhagen [29] used a deep perceptual mapping (DPM) method in which DNNs learn projection of visual and thermal features together. In [30], Reale et al. used coupled NIR and VIS CNNs, initializing them with a pre-trained face recognition network to extract global features. Wu et al. [16] proposed a Coupled Deep Learning (CDL) method with relevance constraints as a regularizer and a cross-modal ranking objective function. However, these methods are difficult to train because they require extraction of domain-specific features with a small database.

II-A2 Image Synthesis Based Method

Image synthesis based methods transform face images from one domain into the other so as to perform recognition in the same modality. Liu et al. [31] proposed a pseudo-sketch synthesis method which divides a photo image into a fixed number of patches and reconstructs each patch as a corresponding sketch patch. This patch-based strategy preserves local geometry while transforming the photo image into a sketch-style image. In [32], Wang et al. proposed a multi-scale Markov network, conducting brief propagation to transform multi-scale patches. Recently, with the widespread development of generative adversarial networks (GANs) [33], many studies have focused on generating visual face images from non-visual ones[14], [6]. In [13] Song et al. transformed NIR face images to VIS face images by pixel space adversarial learning using CycleGAN[34] and feature space adversarial learning with a variance discrepancy loss function.

These methods of transforming an image from one domain to another can be effective for visually similar domain, but do not fundamentally address the modality discrepancy that the data exhibits. In addition, due to issue of small amounts of unpaired HFR data, GAN-based methods struggle to create good-quality images, which affects performance.

II-A3 Domain-invariant Feature Based Method

Another approach is to use a feature extractor to reduce domain discrepancies and enable learning of domain-invariant features. Since the NIR-to-VIS face recognition task is heavily influenced by the light source in each image, Liu et al. [35] used differential-based band-pass image filters, relying only on variation patterns of skin properties. Lui et al. [17] also proposed a TRIVET loss function which applies Triplet loss [36] to cross-domain sampling to reduce the domain gap. He et al. [5] used a division approach, using two orthogonal spaces to represent invariant identity and light source information. The Wasserstein CNN in [15] is also divided into an NIR-VIS shared layer and a specific layer. The shared layer is designed to learn domain-invariant features by minimizing the Wasserstein distance between different domain data. Based on this approach, the DVR method [37] is proposed to disentangle the cross-domain facial representations with the identity information and within-person variations.

Several studies [38], [39], [40] used relational representations that allow projection heterogeneous data into a common space. In [38], Klare and Jain proposed a random prototype subspace framework to define prototype representations and learn a subspace projection matrix with kernel similarities among face patches. With their G-HFR method [39], Peng et al. employed Markov networks to extract graphical representations. This method finds the $k$ nearest patches of a patch in a probe or gallery image from the representation dataset and linearly combines them to obtain graphical representations. Since finding $k$ nearest patches from the representation process performance score relies heavily on the value of $k$ , Peng et al. [40] proposed an adaptive spare graphical representation method which considers all possible numbers of related image patches. These methods found relations between representation dataset image patches in randomly selected pairs. Unlike these methods, our proposed RGM extracts domain-invariant features by considering global spatial pair-wise relations. We first apply a deep learning based relation approach to the HFR task with graph structured module.

II-B Relation Capturing

In many computer vision tasks such as image classification, video recognition, and so on, it is important to understand the relationships within images or videos. However, simply operating multiple neural network layers often fails to identify long-range relationships as human visual systems do. With CNN bring significant improvements in computer vision, there are many studies underway to extract relational information using the local connectivity and multi-layer structure of CNN.

In [41], Lin et al. captured local pair-wise feature interaction to solve the fine-grained categorization task, in which visual differences are small between classes and can easily be overwhelmed by other factors such as view point or pose. The features from two streams of CNN termed Bilinear CNN or B-CNN are multiplied using an outer product to capture partial feature correlation. Since the face recognition task can be seen as a subarea of fine-grained recognition, Chowdhury et al. [20] applied this bilinear model to face recognition tasks with symmetric B-CNN (see Figure 2(a)). Chen et al. [21] captured relational information with a double attention block consisting of bilinear pooling attention and feature distribution attention (Figure 2(b)). The Non-local block [22] was proposed to operate a weighted sum of all features at each position, showing outperformance in video recognition tasks (Figure 2(c)). To solve the Visual Q&A task, which requires relation reasoning information, Santoro et al. [23] introduced a Relation Network that captures all potential relations for object pairs and it was applied to other tasks [42, 19] (Figure 2(d)).

Recently, graph-based methods have proven effective in relation capturing[43]. While traditional graph analysis has usually relied on hand crafted features, Graph Neural Networks (GNN) can learn nodes or edges update by propagating each layer’s weights. Kipf and Welling [44] proposed a spectral method using a Graph Convolutional Networks (GCN) which inputs graph-structured data and uses multiple hidden layers to learn graph structures and node features. Wang and Gupta [45] applied GCNs in action recognition to understand appearance and temporal functional relationships, while Chen et al.[46] proposed a GloRe module that projects coordinate space features into interaction space to extract relation-aware features, boosting the performance of 2D and 3D CNNs on semantic segmentation, image recognition, and so on. As such, the graph-structured networks are effective for most computer vision tasks where relational information is important. In particular, since the HFR task involves small differences between classes and large within class discrepancies, relation information for faces plays an important role in representing each identity. Compared to Attentional Modules [41, 21, 22], graph-based modules better capture relations, thereby reducing the fundamental domain gap in HFR. We compare existing attentional module-based approaches [41], [22], [21] as well as graph-based module [46] to our module in Section IV-A4.

III Proposed method

In this section, we firstly present our preliminary version of this work called RM [19] and introduce RGM to model and propagate relationships of face components. Secondly, we describe NAU which helps to focus on global node correlation. Last, we introduce a $C$ -softmax, loss function with conditional margin. RGM and NAU are add-on module which can be plugged into any pre-trained face recognition backbones. We experimentally quantify the performance improvement of proposed modules in three different backbones and over five different heterogeneous databases in Section IV.

III-A Relation Module

When an NIR or VIS face image is input to a single face recognition network pre-trained with large-scale visual face images, the network cannot perform well because of the domain discrepancy. In addition, HFR databases are mostly unpaired and consist of much smaller numbers of images than large-scale database such as MS-Celeb-1M[11] and CASIA-WebFace[47], so it is difficult to fine-tune the pre-trained deep networks. To solve this problem, the RM concentrates on relationships of pair-wise face component that less depend on domain information (see Figure 2(d)). The RM is plugged in at the end of the backbone’s convolutional layer and takes an input as a feature map which is the output of the last convolutional layer. This feature map’s spatial-wise vector represents the face component, depending on the CNN’s local connectivity characteristics. From these $N\times N$ number of feature vectors, RM extracts the relationships between component pair. Since this pair-wise relationship is independent of ordering, the total number of combinations is $N^{2}\times\left(N^{2}+1\right)/2$ and an $L$ -dimensional relation vector is extracted from each pair (we use $L$ = 64).

\displaystyle\boldsymbol{RM}\left(\boldsymbol{f}_{i,j}\right)=g_{\theta}\left(\boldsymbol{f}_{i},\boldsymbol{f}_{j}\right)_{i,j=1,...,N^{2}}

(1)

In Equation 1, $g_{\theta}\left(\cdot\right)$ is the relation extracting function with shared learnable weight $\theta$ and $\boldsymbol{f}_{i}$ , $\boldsymbol{f}_{j}$ are the input feature vectors. The relation vector $\boldsymbol{RM}\left(\boldsymbol{f}_{i,j}\right)$ represents the relationship between two parts of the face such as the lips-to-nose or eye-to-eye relationship (e.g., distance, ratio or similarity) within a face. For computation, we concatenate two vectors in each combination and embed it into relation vector by shared fully connected layer. These computed relation vectors are reshaped and embedded into one embedding vector with a fully connected layer.

This process does not need to define actual relationships explicitly but simply looks at all combinations of patches and infers the relationship implicitly. Simply adding the RM can reduce intra-class variation and enlarge the inter-class space by using relational information.

III-B Relational Graph Module

As mentioned, the HFR task suffers from a problem of insufficient data and the difficulty of extracting features that reduce the domain gap. Since we confirmed with the RM that relational information in face images contains domain-invariant information, we propose our RGM for more efficient facial relationship modeling. Because RM considers every pair-wise combination and embeds all of them into $L$ -dimensional vector, it presents a computational complexity issue with an attendant overfitting risk when training on a small HFR database. Therefore, we propose a method of relation exploration through our graph-structured RGM, which consists of a node vector containing the face component information and edges that capture relationship information between node vectors. Figure 3 shows the overall framework.

III-B1 Node Embedding

We first treat the spatial feature vectors extracted through the backbone as initial graph node vectors with dimension $C$ . Then we embed the node vectors into $d$ -dimensional vectors using a transform matrix $\boldsymbol{W}_{1}\in\mathbb{R}^{C\times d}$ . We experiment with the optimal value of this embedding dimension $d$ in Section IV-A2.

III-B2 Relation Propagation Based on Directed Relation Extraction

The feature vectors of the face image from the convolutional layers represent each face component (e.g., eyes, lips, and chin). In the RM, the feature vectors are simply concatenated to extract the relation through the shared fully connected layer. In RGM, after node embedding, we extract the directed edges of each node. Because the components that represent the face are the same for every classes, we generate a fixed number of component nodes rather than selecting nodes (64 nodes are used in this paper).

	$\displaystyle E_{w_{e}}(\boldsymbol{n}_{i},\boldsymbol{n}_{j})={\boldsymbol{W}_{e}}^{T}\left[\boldsymbol{n_{i}},\boldsymbol{n_{j}}\right]$
	$\displaystyle E_{i,j}=E_{w_{e}}(\boldsymbol{n}_{i},\boldsymbol{n}_{j})$
	$\displaystyle{A}_{i,j}=\sigma(E_{i,j})$		(2)

In Equation 2, the edge yielding the relationship between two node vectors $n_{i}$ and $n_{j}$ is obtained through the edge function $E_{w_{e}}(\cdot)$ . Edge $E_{i,j}$ is a scalar value and is calculated as the weighted sum between node vector elements, where the weight $\boldsymbol{W}_{e}$ is a parameter obtained through learning. ${A}_{i,j}$ has a value in the range [0, 1] through the sigmoid function $\sigma({\cdot})$ .

\displaystyle{\boldsymbol{n}_{i}}^{*}=\sum_{k=1}^{N^{2}}{A}_{i,k}\boldsymbol{n}_{k}

(3)

Then, as shown in Equation 3, each node vector propagates in inter-dependency with all other node vectors through the edges to become a propagated node vector ${\boldsymbol{n}_{i}}^{*}$ . Each face component has different relations for each identity and updating the nodes with the relations can concentrate on component relational information rather than visual-domain features such as texture information.

As a point of comparison, Graph Attention Network (GAT)[48] adopt self-attention mechanisms by a learnable parameter $\alpha_{vu}=softmax(g(\boldsymbol{a}^{T}\left[\boldsymbol{W}^{T}\boldsymbol{n}_{v}||\boldsymbol{W}^{T}\boldsymbol{n}_{u}\right]))$ which only compute for neighbor nodes $u\in\mathcal{N}_{v}$ where an adjacency matrix( $\mathcal{N}_{v}$ ) is used to define this neighbor. By updating nodes $\sum_{u\in\mathcal{N}_{v}}\alpha_{vu}{\bf W}^{T}\boldsymbol{n}_{u}$ within an adjacency matrix, where $g(\cdot)$ is a LeakyReLU activation function and $\boldsymbol{a}$ is a vector of learnable parameters. In the RGM, the adjacency matrix and $\alpha$ are learned simultaneously; also, the RGM uses a sigmoid activation function which looks at each value separately that allows for independent values of relations. Since the relation between the nodes is independent of the relations of other nodes, sigmoid activation is more relevant than Softmax which looks all values interrelated in phase and computes the sum of values to 1. We experiment with this activation issue in Section IV-F1.

III-B3 Node Re-Embedding

After propagation, we apply the NAU as an activation function $\delta(\cdot)$ (see Figure 3). The NAU is serves as a node-adaptive activation function, this will be described in detail below. After we recalibrate nodes through the NAU, we use the weight matrix $\boldsymbol{W}_{2}\in\mathbb{R}^{d\times C}$ to re-embed the node vectors into the original input dimension and perform residual summation. Since the feature map from the backbone CNN contains general features of the face, we intended to use general feature and relational information together by residual terms [22, 46, 45]. After summation, we concatenate entire node vectors and embed them into the final representative embedding vector for comparison through the fully connected layer.

III-C Node Attention Unit

In SENet[49], when the convolutional layer computes a spatial and channel-wise combination in the receptive field, recalibration of the feature is performed to boost the network’s representative power. Inspired by this approach, each node vector that contains relational information is recalibrated through the NAU by considering inter-node correlation.

	$\displaystyle z_{i}=\frac{1}{C}\sum^{C}\boldsymbol{{n}_{i}}^{*}(c)$
	$\displaystyle\boldsymbol{s}=\sigma(\boldsymbol{{W}_{b}}^{T}\mathrm{ReLU}(\boldsymbol{{W}_{a}}^{T}\boldsymbol{z}))$
	$\displaystyle F_{recalib}(s_{i},\boldsymbol{{n}_{i}})=s_{i}\boldsymbol{{n}_{i}}$		(4)

In Equation 4, each propagated node $\boldsymbol{{n}_{i}}^{*}$ squeezes information through global average pooling to vector $\boldsymbol{z}$ . The node-squeezed vector is then aggregated through weights $\boldsymbol{W_{a}}$ and $\boldsymbol{W_{b}}$ to a node-wise scale vector $\boldsymbol{s}$ . Then, with recalibration function $F_{recalib}(\cdot)$ , each node vector is scaled (see Figure 4). This process yields a recalibration effect according to the global importance of nodes and focusing attention on the characteristic aspect of the identity. In contrast to the SENet, squeeze spatial dimension and recalibrate channels, we squeeze channels and recalibrate nodes. In contrast to the CBAM (spatial attention block)[50], since our nodes are not spatially correlated after propagation has forced each node to contain global relations, we do not perform convolution-based squeezing but instead squeeze point-wisely.

III-D Conditional Margin Loss: $C$ -softmax

The node passing through the RGM becomes the $L$ -dimensional embedding vector through the fully connected layer. In the training phase, the embedding vector goes through the Softmax layer to optimize the loss value through the cross entropy loss function. When testing, the class is predicted by computing the cosine similarity between the embedding vectors of the gallery and the probe images, respectively.

\displaystyle L_{softmax}=-\frac{1}{B}\sum_{i}^{B}log\frac{e^{{\boldsymbol{W}_{k}}^{T}\boldsymbol{x}_{i}+\boldsymbol{b}_{k}}}{\sum_{j}^{M}e^{{\boldsymbol{W}_{j}}^{T}\boldsymbol{x}_{i}+\boldsymbol{b}_{j}}}

(5)

Equation 5 defines a softmax function in which $B$ denotes batch size, $M$ is the number of classes and $x_{i}$ is the embedding vector of the ( $k$ -th class) training sample.

In [19], a triplet loss function with conditional margin is proposed that applies an conditional margin between inter-classes to reduce intra-class discrepancies. This loss function is defined in Equation 6.

	$\displaystyle\left\{{\boldsymbol{x}_{i}^{a}},{\boldsymbol{x}_{i}^{p}},{\boldsymbol{x}_{i}^{n}}\right\}\in T$
	$\displaystyle s_{p}=CS({\boldsymbol{x}_{i}^{a}},{\boldsymbol{x}_{i}^{p}})$
	$\displaystyle s_{n}=CS({\boldsymbol{x}_{i}^{a}},{\boldsymbol{x}_{i}^{n}})$
	$\displaystyle L_{tripletcondtional}=\sum_{i}^{N}\left[\frac{s_{n}+1}{s_{p}+1}-m\right]_{+}$		(6)

${\boldsymbol{x}_{i}}^{a}$ , ${\boldsymbol{x}_{i}}^{p}$ and ${\boldsymbol{x}_{i}}^{n}$ denotes anchor, positive and negative embedding vector, respectively, where anchor and positive are from same identity while negative sample is from different identity. To reduce domain discrepancy, ${\boldsymbol{x}_{i}}^{a}$ and ${\boldsymbol{x}_{i}}^{p}$ are sampled from different domains. $CS(\cdot)$ indicates cosine similarity and ${S}_{p}$ and ${S}_{n}$ are intra-class and inter-class similarity, respectively. The loss value is calculated from similarity ratios with margin $m$ and this margin considers the distributions of ${S}_{p}$ and ${S}_{n}$ .

\displaystyle s_{n}<m_{1}s_{p}+m_{2}

(7)

This triplet loss function with conditional margin is designed to satisfy Equation 7 which takes into account not only the intercept value $m_{2}$ but also the slope $m_{1}$ , meaning that every margin is computed conditionally.

Since this loss function utilizes triplet loss, the positive and negative samplings play an important role in learning; therefore online sampling should be done within a mini-batch and semi-hard example learning is required. This increases the training time and makes sampling difficult since the HFR database has only a small number of images and identities. To avoid sampling, we suggest $C$ -softmax (conditional-margin softmax) as a loss function, since it applies the margin into the Sofmax layer conditionally according to inter-class similarity.

	$\displaystyle\boldsymbol{W}^{{}^{\prime}}=\frac{\boldsymbol{W}}{\left\\|\boldsymbol{W}\right\\|},\boldsymbol{x}^{{}^{\prime}}=\frac{\boldsymbol{x}}{\left\\|\boldsymbol{x}\right\\|}$
	$\displaystyle{\boldsymbol{W}^{{}^{\prime}}_{j}}^{T}\boldsymbol{x}^{{}^{\prime}}=\frac{\boldsymbol{W}^{T}\boldsymbol{x}}{\left\\|\boldsymbol{W}\right\\|\left\\|\boldsymbol{x}\right\\|}=cos\theta_{j}$
	$\displaystyle cos\theta_{j}<m_{1}cos\theta_{i}+m_{2}$		(8)

\displaystyle L_{cond}=-\frac{1}{N}\sum_{i}^{N}log\frac{e^{\alpha({m_{1}cos\theta_{i}+m_{2}})}}{\sum_{j\neq i}^{M}e^{scos\theta_{j}}+e^{\alpha{(m_{1}cos\theta_{i}+m_{2}})}}

(9)

First, we normalize the fully connected layer $\boldsymbol{W}$ and embedding vector $\boldsymbol{x}$ ; normalized vectors are re-scaled to scale $\alpha$ , following [51]. In Equation 8, the product of these two vectors gives the angle between the two vectors, which defines cosine similarity. Therefore, the conditional margin in Equation 7 can be written as Equation 8 and it transforms as Equation 9. Here $L_{cond}$ is $C$ -softmax loss function and $m_{1}$ and $m_{2}$ indicate the slope and intercept values respectively, as in Equation 7.

Figure 5 shows the margin according to each cosine similarity in two classes. Compared to CosFace[52] and ArcFace[53], our proposed conditional margin is determined conditionally by considering the similarity between classes. When this similarity is small, sufficient inter-class space is guaranteed, so that we do not need to set a high margin. Conversely, when the similarity between classes is high, we have a hard class example, so the margin should be increased to give a stricter criterion. In this way, the margin can be conditionally determined according to the similarity value, which gives a hard sampling effect by concentrating on the hard sample (see Figure 5(c)). When we decrease $m_{1}$ , the margin at the large similarity region increases since the slope is gentle; when we increase $m_{2}$ , the margin at the small similarity region increases. To prevent a negative margin, we use a constraint such as $m_{1}-m_{2}\geq 1$ . By contrast, CosFace[52] gives a constant margin for cosine similarity using $cos\theta_{j}<cos\theta_{i}+m$ , and ArcFace[53] uses a constant margin for class angular domain with $cos\theta_{j}<cos(\theta_{i}+m)$ . When we convert the ArcFace class angular domain $\theta$ to $cos\theta$ , as shown in Figure 5(b), the margin varies depending on the similarity, but a large margin occurs only when the similarity is near the midpoint. In contrast, our margins are given conditionally so that heterogeneous data with large intra-class discrepancy can be more efficiently trained during common space learning.

Figure 6 is a toy experiment result of angular losses. We train a 2D feature embedding network with eight subjects of face images and visualized 2D features of training images, following [52, 53]. The training samples are from CASIA NIR-VIS 2.0 (around 30 NIR and VIS images per class). $C$ -softmax projects samples into a compact intra-class and a large inter-class by providing different margins for inter-class similarity, but in case of other losses, a large inter-class is not guaranteed since they provide a fixed margin value. This is because two classes of high similarity are projected closer than two classes of low similarity. Through the margin based on the similarity between two classes (small margin for small similarity and large margin for large similarity), each class is finally mapped at the almost identical intervals.

IV Experimental Results

In this section, we experiment proposed method on five HFR databases, namely CAISA NIR-VIS 2.0[8], IIIT-D Sketch[9], BUAA-VisNir[10], Oulu-CASIA NIR-VIS[54], and TUFTS[55]. For each database, we perform ablation studies and comparison with other state-of-the-art methods. Also, we compare the RGM with other attentional modules and $C$ -softmax loss function with other angular margin losses. Finally, we analyze and discuss the visualization of the extracted relational information of RGM and NAU.

Our three backbones are LightCNN-9, LightCNN-29[56], and ResNet18[57], consisting of 9, 29 and 18 convolutional layers respectively. These three baseline networks are pre-trained on the MS-Celeb-1M, a large-scale visual face database. To fine-tune the proposed modules, the pre-trained feature extractor is frozen and only the HFR database, comprising non-VIS and VIS faces, is used for training data. For the fair comparison, since learnable parameters are added by RGM and NAU, we add extra convolutional layers after the backbone in which a number of parameters are similar to RGM and NAU following [18]. Two extra convolutional layers $(128,128,1\times 1)$ are added at LightCNN-9 and LightCNN-29, and one extra convolutional layer $(512,512,1\times 1)$ is added at ResNet18. The numbers in parentheses indicate input channel, output channel, and filter size, respectively. We use 128 (or 64 for small database) batches and learning rate starts at 0.001 (or 0.01 for the IIIT-D Sketch database); to avoid over-fitting, the dropout[58] rate is set to 0.7 at the fully connected layer. For an input, we crop each image to $144\times 144$ size and randomly crop to $128\times 128$ for LightCNN and $112\times 112$ for ResNet. The RGM is plugged after the last convolutional layer and use 64 ( $8\times 8$ ) node vectors. In the NAU, the channel reduction ratio is 2; in $C$ -softmax loss function $m_{1}=0.7$ , $m_{2}=-0.3$ , and the normalized scale value $s$ = 24 are used.

IV-A CASIA NIR-VIS 2.0

IV-A1 Database

The CASIA NIR-VIS 2.0 database is one of the largest HFR databases, and is composed of NIR and VIS face images. It contains 725 subjects, imaged by VIS and NIR cameras in four recording sessions. We followed view 2 protocol, which training subjects and the corresponding testing sets are non-overlapped and the numbers of subjects are virtually identical. For evaluation, the gallery set comprises one VIS image per subject while the probe set contains several NIR images per subject. The prediction score is computed by similarity matrix over the whole gallery set and the identification accuracy and verification rate recorded.

IV-A2 Ablation Studies

We first experiment with different numbers of the node vector dimension $d$ to find the appropriate dimension for HFR. The experiment is conducted on LightCNN-9, in which there are 128 channels in the last convolutional layer. Figure 7 shows the results of training with the RGM node vector dimension $d=16,32,64,128$ and $256$ . The identification accuracy and verification rate show better performance as the dimension increases and then drop off when it becames too large. We use a dimension of 128 in LightCNN and 256 in ResNet18, whose channel size at the last convolutional layer is $512$ .

TABLE I: Ablation studies of the proposed method on the CASIA NIR-VIS 2.0 database

Models		Rank-1 Acc(%)	VR@FAR =1%(%)	VR@FAR =0.1%(%)	VR@FAR =0.01%(%)
LightCNN-9	fine-tuned	93.21	98.01	93.41	90.15
	+extra conv	96.91	98.83	95.48	93.68
	RGM	96.7	98.86	95.66	93.43
	+NAU	97.2	98.76	95.79	93.9
	+ $C$ -softmax	98.03	99.15	96.76	95.23
ResNet18	fine-tuned	88.87	91.24	79.97	74.99
	+extra conv	94.73	97.76	94.15	92.14
	RGM	96.33	98.59	96.39	94.95
	+NAU	96.67	98.71	96.47	95.07
	+ $C$ -softmax	97.44	98.79	96.71	95.43
LightCNN-29	fine-tuned	97.65	99.34	97.79	96.84
	+extra conv	98.85	99.47	98.33	97.72
	RGM	98.98	99.5	95.65	98.05
	+NAU	99.06	99.94	99.5	98.11
	+ $C$ -softmax	99.3	99.51	99.02	98.86

Table I shows ablation studies in three baseline networks on the CASIA NIR-VIS 2.0 database. In the table, the fine-tuned indicates the results of training only the fully connected layers while freezing the pre-trained feature extractor. For each network we attach our proposed RGM module, then experiment with NAU, and finally show the results of training with $C$ -softmax. The ResNet18 fine-tuned is 88.37% in rank-1 accuracy. When the RGM extracts domain-invariant features focused on relational information, the performance improves to 96.33%. When the network trains with NAU and $C$ -softmax, it shows additional performance improvement of 0.34% and 0.77% respectively which is higher than baseline with extra convolutional layer. Similarly, performance on LightCNNs are improved by 4.82% and 1.65% over fine-tuned accuracy.

IV-A3 Comparison with Other Methods

In Table II, we compare our method (LigthCNN-29 backbone) with other deep learning-based HFR methods, namely HFR-CNN[59], TRIVLET[17], ADFL[13], CDL[16], WCNN[15], DSU[60], RCN[18] and RM[19]. All comparison methods use backbone pre-trained on MS-Celeb-1M or CASIA WebFace. Since RCN[18] performance in original paper used a backbone pre-trained on five large-scale databases, we reproduced RCN with LightCNN-29 pre-trained on MS-Celeb-1M as a backbone for fair comparison. The RM method, which extracts features by pair-wise relation embedding, performs better than other deep learning methods. Our method shows 0.38% performance improvement over RM and also yields results comparable with other domain-invariant based methods.

IV-A4 Comparison with Attentional Modules

In Table III, we compare RGM with the attentional modules depicted earlier in Figure 2. We train under the same conditions, with the last feature map of LightCNN-9 and LigthCNN-29 passed through each module with cross entropy loss.The RM and graph-structured modules: GloRe [46] and RGM show higher performance than other methods, among which RGM shows 1.02% and 0.24% higher performance over second best performance in each backbone.

TABLE II: Comparison with Other Methods on the 10-fold CASIA NIR-VIS 2.0 Database

Methods	Rank-1 Acc(%)	VR@FAR =0.1%(%)
HFR-CNN[59]	85.9 $\pm$ 0.9	78.0
TRIVLET[17]	95.7 $\pm$ 0.5	91.0 $\pm$ 1.3
ADFL[13]	98.2 $\pm$ 0.3	97.2 $\pm$ 0.3
CDL[16]	98.6 $\pm$ 0.2	98.3 $\pm$ 0.1
WCNN[15]	98.7 $\pm$ 0.3	98.4 $\pm$ 0.4
DSU[60]	96.3 $\pm$ 0.4	98.4 $\pm$ 0.12
RCN[18]	98.48 $\pm$ 0.5	97.77 $\pm$ 0.4
RM[19]	98.92 $\pm$ 0.16	98.72 $\pm$ 0.2
Ours	99.3 $\pm$ 0.1	98.9 $\pm$ 0.12

TABLE III: Comparison with attentional modules on the CASIA NIR-VIS 2.0 database

Backbone	LightCNN-9		LightCNN-29
Modules	Rank-1 Acc(%)	VR@FAR =0.1%(%)	Rank-1 Acc(%)	VR@FAR =0.1%(%)
finetune	93.21	93.41	97.65	97.79
B-CNN[20]	81.67	82.17	92.04	91.05
Non-Local[22]	92.51	92.11	98.98	98.37
DoubleAttention[21]	70.48	68.29	88.24	88.11
GloRe[46]	96.18	94.68	98.82	98.33
RM[19]	94.73	94.31	98.12	97.68
RGM(Ours)	97.2	95.79	99.06	99.5

TABLE IV: Ablation studies of the proposed method on the IIIT-D Sketch, BUAA Vis-Nir and Oulu-CASIA NIR-VIS database

Models		IIIT-D Sketch			BUAA-VisNir		Oulu-CASIA NIR-VIS
Models		Rank-1 Acc(%)	VR@FAR=1%(%)	VR@FAR=0.1%(%)	Rank-1 Acc(%)	VR@FAR=1%(%)	Rank-1 Acc(%)	VR@FAR=1%(%)	VR@FAR=0.1%(%)
LightCNN-9	fine-tuned	78.72	92.84	89.75	94.78	88.22	96.35	97.81	95.21
	+extra conv	85.11	97.02	94.04	93.33	94.11	98.54	96.77	96.04
	RGM	88.08	99.78	94.47	92.67	87.33	98.44	96.88	93.65
	+NAU	88.94	97.87	95.74	95.11	88.44	99.27	98.44	96.77
	+ $C$ -softmax	88.51	96.17	94.47	97.56	98.1	99.27	99.69	98.96
ResNet18	fine-tuned	70.21	86.81	82.55	97.67	97.33	99.17	96.77	95.83
	+extra conv	83.83	95.32	94.89	95.89	95.44	100.0	98.75	97.71
	RGM	85.11	95.41	94.89	99.22	98.22	99.9	97.92	94.9
	+NAU	85.11	95.74	94.47	98.89	97.11	100.0	99.17	98.96
	+ $C$ -softmax	85.96	95.74	95.32	99	97.22	100.0	98.96	99.17
LightCNN-29	fine-tuned	62.98	84.68	81.7	97.44	98.89	99.27	99.69	98.96
	+extra conv	74.04	91.49	90.64	99.11	99.44	100.0	99.06	98.12
	RGM	74.5	92.77	91.06	99.56	99.22	100.0	98.44	96.88
	+NAU	78.72	94.47	92.34	99.56	99.11	100.0	98.44	96.88
	+ $C$ -softmax	79.15	94.04	91.49	99.67	99.22	100.0	99.17	98.96

TABLE V: Comparison with other methods on the IIIT-D Sketch database

Model	Rank-1 Acc(%)	VR@FAR =1%(%)
SIFT[9]	76.28	-
MCWLD[61]	84.24	-
VGG[62]	80.89	72.08
CenterLoss[63]	84.07	76.2
CDL[16]	85.35	82.52
RCN[18]	63.83	90.12
RM[19]	77.45	91.34
Ours	88.94	97.87

TABLE VI: Comparison with attentional modules on the IIIT-D Sketch database

Backbone	LightCNN-9		LightCNN-29
Modules	Rank-1 Acc(%)	VR@FAR =1%(%)	Rank-1 Acc(%)	VR@FAR =1%(%)
finetune	78.72	92.84	62.98	94.68
B-CNN[20]	50.21	70.64	26.81	44.26
Non-Local[22]	80	94.47	70.64	91.49
DoubleAttention[21]	44.31	67.98	27.66	48.09
GloRe[46]	79.15	94.89	74.04	91.49
RM[19]	77.45	92.34	65.11	88.94
RGM(Ours)	88.94	97.87	78.72	94.47

IV-B IIIT-D Sketch

IV-B1 Database

The IIIT-D Sketch database is designed for the sketch-to-photo face recognition task. We use the Viewed Sketch Database which comprises 238 subjects. Each subject has one image pair, a sketch and a VIS photo face image. Since there are only a small number of images for training, we train on CUHK Face Sketch FERET Database (CUFSF)[32] and evaluate on IIIT-D Sketch database, following the same protocol as in [16]. The CUFSF database includes 1,194 subjects from the FERET database [64], with a single sketch and photo image pair per subject. For testing, we use VIS photo images as the gallery set and sketch images as the probe set.

IV-B2 Ablation Studies

In IIIT-D database, where the domain discrepancy is large and the data is insufficient (only one pair image for each identity), the deeper the layer of the backbone, the lower the performance due to the overfitting problem [65]. As with the results on CASIA NIR-VIS 2.0 database, in Table IV, our approach improves further with the addition of RGM, NAU and $C$ -softmax loss. However, when LightCNN-9 is the baseline, training with the original softmax performs 0.43% better than with the $C$ -softmax. This is because the number of CUFSF and IIIT-D images is smaller than for the CASIA NIR-VIS database, so it is difficult to learn sufficiently with $C$ -softmax loss and the margin values $m_{1}$ and $m_{2}$ need to be adjusted.

IV-B3 Comparison with Other Methods

As we described in Table V, SIFT[9], MCWLD[61], VGG[62], CenterLoss[63], CDL[16] and RCN[18] are compared with our approach and all of these methods are use backbone pre-trained on MS-Celeb-1M (We reproduce RCN with LightCNN-9 pre-trained on MS-Celeb-1M for fair comparison). In particular, the sketch HFR database comprises artiest’s pictures, rather than the photos, making training based on deep learning difficult. Nevertheless, our method shows a rank-1 accuracy of 88.94%, the leading performance among deep learning and hand-crafted methods with same pre-trained on MS-Celeb-1M database condition.

IV-B4 Comparison with Attentional Modules

We also apply the attention method and the graph method on LightCNN-9 and LightCNN-29 to the sketch-to-photo HFR task (Table VI). As with the NIR database, the B-CNN and DoubleAttention module show low performance making it difficult to reduce the sketch domain discrepancy via the self-attention method by simply multiplying feature vectors. In the Sketch database, which has a large domain difference and a small number of images, the RGM shows higher performance than second best performance by a larger amount of 9.79% and 4.68% on each backbone. The RGM, which extracts relational information with small parameters, prevents overfitting and outperforms even with a small database.

TABLE VII: Comparison with other methods on the BUAA Vis-Nir database

Model	Rank-1 Acc(%)	VR@FAR =1%(%)
H2 (LBP3)[66]	88.8	88.8
TRIVLET[17]	93.9	93
ADFL[13]	95.2	95.3
CDL[16]	96.9	95.9
WCNN[15]	97.4	96
Ours	99.67	99.22

IV-C BUAA-VisNir

IV-C1 Database

The BUAA-VisNir database is composed of NIR and VIS face images of 150 subjects. Each subject has nine NIR and VIS images including one frontal view, four different other views, and four different expressions (happiness, anger, disgust and amazement). These NIR and VIS images are paired and captured simultaneously. The training set comprises 50 subjects with 900 images. For testing, 100 subjects with one VIS image each make up the gallery set, with 900 NIR images in the probe set.

IV-C2 Ablation Studies

In Table IV, the performance with LightCNN-9 and 29 incrementally improves over baseline when the RGM module, NAU, and conditional margin loss $C$ -softmax are added. With $C$ -softmax loss, the performance improves 2.45% and 0.11% respectively. On the other hand, when NAU is added to the ResNet18 baseline, the performance decreases because it becomes more difficult to learn the global node correlation with fewer training set subjects. Our approach brings performance improvements 2.78%, 1.55%, and 2.23% over fine-tune in the three baselines.

IV-C3 Comparison with Other Methods

Table VII compares our method with three other types of method (projection based, synthesis based and domain-invariant base method) on H2 (LBP3)[66], TRIVLET[17], ADFL[13], CDL[16] and WCNN[15]. Our method shows better performance than other domain-invariant feature methods such as WCNN, TRIVLET that focus on features themselves rather than relationships.

IV-D Small-scale HFR database

IV-D1 Database

Oulu-CASIA NIR-VIS facial expression database[54] consists of 80 subjects with six different expressions and three different illumination conditions. Following the protocols in [15], we randomly selected 40 subjects and eight images from each NIR and VIS domain for six expressions. The train set and test set are 20 subjects each. For test set, all 960 number of VIS and NIR images of the 20 subjects are used as gallery and probe set.

TUFTS NIR and VIS database [55] consists of 100 subjects with large pose variations. The number of images per subject is less than 36, and each subject is photographed at nine equidistant positions around a semicircle with a fixed viewpoint. Since there are no protocols or comparison papers for training and test settings, we cropped all faces to 144x144 and use identities 1–25 as the test set and identities 26–100 as the training set.

TABLE VIII: Comparison with other angular margin losses

Loss	CASIA NIR-VIS 2.0		CASIA WebFace	Small-CASIA WebFace
Loss	Rank-1 (%)	VR@FAR(1%)	LFW Ver(%)	LFW Ver(%)
Normalized-softmax	97.2	98.76	99.13 (99.1)	98.55
A-softmax[67]	87.06	94.78	99.18 (99.11)	98.32
CosFace[52]	97.17	99.05	99.52 (99.51)	99.02
ArcFace[53]	97.95	99.29	99.43 (99.53)	99.03
C-softmax(Ours)	98.03	99.15	99.58	99.27

IV-D2 Results

The Figure 8(c) and 8(d) show visualization of relationship of face components in different poses in TUFTS database. In the figure, regardless of domain and pose, the strong relational components are similar within subject. The detailed explanation of the visualization will be provided in Section IV-F2.

Since Oulu-CASIA and TUFTS databases have too small identities (number of 20 and 25) which performances are 100% and 99.66% rank-1 accuracy respectively, we only conduct experiments with ablation studies or visualization. The performance on Oulu-CASIA database increases as our method is added, but saturated (Table IV). This is because it has fewer subjects, and it has multiple gallery images per subject, unlike other HFR databases which have only one gallery image per subject.

IV-E Conditional Margin Loss: $C$ -softmax

IV-E1 CASIA NIR-VIS 2.0

We compare our conditional margin loss ( $C$ -softmax) to other angular margin losses such as Normalized-softmax[51, 68], A-softmax[67], CosFace[52] and ArcFace[53] using LightCNN-9 as a baseline and training under the same conditions (the embeddings are normalized and scaled $s$ = 24); the margin for each loss follows each study. In Table VIII, on the CASIA NIR-VIS 2.0 database, performance of ArcFace and our C-softmax show better performance than the other losses at 97.97% and 98.03% because of the different margins for the class cosine similarity, as shown in Figure 5. ArcFace reduces the margin when the class cosine similarity is large or small and increases it near the midpoint (Figure 5(b)), while $C$ -softmax increases the margin at larger class similarity values (Figure 5(c)). This helps to control classes with domain discrepancy because it effectively adjusts the margin between inter-classes.

IV-E2 LFW

In Table VIII, we also conduct experiments with LFW [9], a large-scale visual face database. For this purpose, we use CASIA WebFace [47] consisting of 10,575 subjects for the training dataset and perform evaluation in LFW. All the implementation details are same as ArcFace and the margin and scale factors of each loss function are set to the optimal values presented in the original paper. The performance of each loss function is reproduced and the value in parentheses is the performance written on each paper. The results show that $C$ -softmax achieves best performance with 99.58%, followed by the CosFace and ArcFace. In addition, we conduct experiments on a small-CASIA WebFace database with 5,287 subjects, half the size of the CASIA WebFace. The performance gap between $C$ -softmax and other losses is larger in small-CASIA WebFace than in CASIA WebFace. This result shows that $C$ -softmax improves training effectiveness when we train on the dataset with a small number of classes.

TABLE IX: Comparison of softmax and Sigmoid activation in RGM

Database	Activation function	Rank-1 Acc(%)	VR@FAR =1%(%)	VR@FAR =0.1%(%)	VR@FAR =0.01%(%)
CASIA NIR	softmax	97.95	99.16	97.15	96
CASIA NIR	sigmoid	98.03	99.15	96.76	95.23
IIITD Sketch	softmax	87.66	95.32	90.21	-
IIITD Sketch	sigmoid	88.51	96.17	94.47	-

IV-F Discussion

IV-F1 RGM with Sigmoid Activation Function

As mentioned in Section III-B2, when obtaining a directed relation between nodes, all edge values are passed through an activation function. In this case, we use a sigmoid activation function instead of softmax because relation information for each node is independent of and should not be influenced by other nodes’ relations. Unlike sigmoids where $f_{sigmoid}(\boldsymbol{x}_{i})={1}/({1+e^{-\boldsymbol{x}_{i}}})$ , softmax $f_{softmax}(\boldsymbol{x}_{i})={\boldsymbol{x}_{i}}/{\sum e^{-\boldsymbol{x}_{j}}}$ looks at the interrelation of all values. Table IX shows the results of experiments in which the activation function of RGM is varied. We use LightCNN-9 as a baseline with the CASIA NIR-VIS and IIIT-D Sketch databases. For these databases, the rank-1 accuracy is increased by 0.08% and 0.85% compared to softmax.

IV-F2 Visualization of Relations

We visualize node relation $\boldsymbol{A}_{i,j}$ in Equation 2 which is the directed relations of nodes in the RGM. In Figure 8, the first row shows VIS and NIR pair, while the second row shows VIS and Sketch pairs. In each image, we select one reference node vector (red color) and visualize five strong relational node vectors (green color) of it. Regardless of domain, within a subject, the strong relational components in the gallery and probe images are similar. For example, the subject in Figure 8(a), the left eye has a strong relationship with right eye, nose and wrinkles; the subject in Figure 8(b), the left eye has a strong relationship with wrinkles, nose, and mouth; In the case of different subjects, when comparing (a) and (b), the strong relationship components with the left eye are different. Also, in case of pose variation, both gallery and probe in Figure 8(c), nostrils have a strong relationship with left eyebrow and eye, whereas right eye has a strong relationship with nose in Figure 8(d). These relationships are obtained by passing the gallery VIS image and the probe NIR or Sketch image separately to the RGM, revealing each identity has similar relationships in faces regardless of domain. Additional visualization results are presented in the Supplementary Material.

IV-F3 Visualization of Node Attention Unit

Nodes whose relational information is propagated through the RGM are node-wisely recalibrated through the NAU. Figure 9 shows the scale value computed for node-wise recalibration from the NAU ( $s$ in Equation 4). Each row corresponds to samples of four subjects from test set; in each subject, the first row is a gallery and other rows are probe sets. Looking at the gallery and probe set, we can observe that the nodes are similarly focused for each subject. In other words, regardless of domain, the importance pattern of relation propagated nodes (output of RGM) are different across subjects and similar within each subject.

V Conclusion

The Relational Graph Module (RGM) extracts representative relational information of each identity by embedding each face component into a node vector and modeling the relationships among these. This graph-structured module solved the discrepancy problem between HFR domains using a structured approach based on extracting relations. Moreover, the RGM overcame the problem of lack of adequate HFR database by plugging into a pre-trained face extractor and fine-tuning it. In addition, through the Node Attention Unit (NAU), node-wise recalibration was performed to focus on global informative nodes among propagated node vectors. Furthermore, our novel $C$ -softmax loss helped to learn common projection space adaptively by applying a higher margin as the class similarity increases.

We applied the RGM module to several pre-trained backbones and explored performance improvements on NIR-to-VIS and Sketch-to-VIS tasks. In addition, in ablation studies, each proposed method showed the impact of its role through boosted performance. Furthermore, the visualization of relational information in VIS, NIR, and sketch images showed that relationships within the face are similar in each subject, revealing representative domain-invariant features. Finally, our proposed approach showed best performance comparing with state-of-the-art methods on the CASIA NIR-VIS 2.0, IIIT-D Sketch, BUAA-VisNir, Oulu-CASIA NIR-VIS and TUFTS.

References

[1] M. Wang and W. Deng, “Deep face recognition: A survey,” arXiv preprint arXiv:1804.06655, 2018.
[2] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs, “Frontal to profile face verification in the wild,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9.
[3] Z. Wang, X. Tang, W. Luo, and S. Gao, “Face aging with identity-preserved conditional generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7939–7947.
[4] E. Zangeneh, M. Rahmati, and Y. Mohsenzadeh, “Low resolution face recognition using a two-branch deep convolutional neural network architecture,” Expert Systems with Applications, vol. 139, p. 112854, 2020.
[5] R. He, X. Wu, Z. Sun, and T. Tan, “Learning invariant deep representation for nir-vis face recognition,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[6] H. Bi, N. Li, H. Guan, D. Lu, and L. Yang, “A multi-scale conditional generative adversarial network for face sketch synthesis,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 3876–3880.
[7] S. Ouyang, T. Hospedales, Y.-Z. Song, X. Li, C. C. Loy, and X. Wang, “A survey on heterogeneous face recognition: Sketch, infra-red, 3d and low-resolution,” Image and Vision Computing, vol. 56, pp. 28–48, 2016.
[8] S. Li, D. Yi, Z. Lei, and S. Liao, “The casia nir-vis 2.0 face database,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2013, pp. 348–353.
[9] H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, “Memetic approach for matching sketches with digital face images,” Tech. Rep., 2012.
[10] D. Huang, J. Sun, and Y. Wang, “The buaa-visnir face database instructions,” School Comput. Sci. Eng., Beihang Univ., Beijing, China, Tech. Rep. IRIP-TR-12-FR-001, 2012.
[11] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 87–102.
[12] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The megaface benchmark: 1 million faces for recognition at scale,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4873–4882.
[13] L. Song, M. Zhang, X. Wu, and R. He, “Adversarial discriminative heterogeneous face recognition,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[14] T. Zhang, A. Wiliem, S. Yang, and B. Lovell, “Tv-gan: Generative adversarial network based thermal to visible face recognition,” in 2018 international conference on biometrics (ICB). IEEE, 2018, pp. 174–181.
[15] R. He, X. Wu, Z. Sun, and T. Tan, “Wasserstein cnn: Learning invariant features for nir-vis face recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1761–1773, 2018.
[16] X. Wu, L. Song, R. He, and T. Tan, “Coupled deep learning for heterogeneous face recognition,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[17] X. Liu, L. Song, X. Wu, and T. Tan, “Transferring deep representation for nir-vis heterogeneous face recognition,” in 2016 International Conference on Biometrics (ICB). IEEE, 2016, pp. 1–8.
[18] Z. Deng, X. Peng, and Y. Qiao, “Residual compensation networks for heterogeneous face recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8239–8246.
[19] M.Cho, T. Chung, T. Kim, and S. Lee, “Nir-to-vis face recognition via embedding relations and coordinates of the pairwise features,” in 2019 international conference on biometrics (ICB). IEEE, 2019.
[20] A. R. Chowdhury, T.-Y. Lin, S. Maji, and E. Learned-Miller, “One-to-many face recognition with bilinear cnns,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9.
[21] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A^ 2-nets: Double attention networks,” in Advances in Neural Information Processing Systems, 2018, pp. 352–361.
[22] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
[23] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in Advances in neural information processing systems, 2017, pp. 4967–4976.
[24] D. Lin and X. Tang, “Inter-modality face recognition,” in European conference on computer vision. Springer, 2006, pp. 13–26.
[25] D. Yi, S. Liao, Z. Lei, J. Sang, and S. Z. Li, “Partial face matching between near infrared and visual images in mbgc portal challenge,” in International Conference on Biometrics. Springer, 2009, pp. 733–742.
[26] Z. Lei and S. Z. Li, “Coupled spectral regression for matching heterogeneous faces,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 1123–1128.
[27] Z. Lei, S. Liao, A. K. Jain, and S. Z. Li, “Coupled discriminant analysis for heterogeneous face recognition,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 6, pp. 1707–1716, 2012.
[28] M. Shao, D. Kit, and Y. Fu, “Generalized transfer subspace learning through low-rank constraint,” International Journal of Computer Vision, vol. 109, no. 1-2, pp. 74–93, 2014.
[29] M. S. Sarfraz and R. Stiefelhagen, “Deep perceptual mapping for thermal to visible face recognition,” arXiv preprint arXiv:1507.02879, 2015.
[30] C. Reale, N. M. Nasrabadi, H. Kwon, and R. Chellappa, “Seeing the forest from the trees: A holistic approach to near-infrared heterogeneous face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 54–62.
[31] Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma, “A nonlinear approach for face sketch synthesis and recognition,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 1005–1010.
[32] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 1955–1967, 2008.
[33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[34] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[35] S. Liu, D. Yi, Z. Lei, and S. Z. Li, “Heterogeneous face image matching using multi-scale features,” in 2012 5th IAPR International Conference on Biometrics (ICB). IEEE, 2012, pp. 79–84.
[36] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[37] X. Wu, H. Huang, V. M. Patel, R. He, and Z. Sun, “Disentangled variational representation for heterogeneous face recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9005–9012.
[38] B. F. Klare and A. K. Jain, “Heterogeneous face recognition using kernel prototype similarities,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 6, pp. 1410–1422, 2012.
[39] C. Peng, X. Gao, N. Wang, and J. Li, “Graphical representation for heterogeneous face recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 2, pp. 301–312, 2016.
[40] ——, “Sparse graphical representation based discriminant analysis for heterogeneous face recognition,” Signal Processing, vol. 156, pp. 46–61, 2019.
[41] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1449–1457.
[42] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 318–334.
[43] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018.
[44] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[45] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 399–417.
[46] Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis, “Graph-based global reasoning networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 433–442.
[47] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
[48] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[49] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[50] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
[51] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrained softmax loss for discriminative face verification,” arXiv preprint arXiv:1703.09507, 2017.
[52] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274.
[53] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
[54] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial expression recognition from near-infrared videos,” Image and Vision Computing, vol. 29, no. 9, pp. 607–619, 2011.
[55] K. Panetta, Q. Wan, S. Agaian, S. Rajeev, S. Kamath, R. Rajendran, S. Rao, A. Kaszowska, H. Taylor, A. Samani et al., “A comprehensive database for benchmarking imaging systems,” IEEE transactions on pattern analysis and machine intelligence, 2018.
[56] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[58] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[59] S. Saxena and J. Verbeek, “Heterogeneous face recognition with cnns,” in European conference on computer vision. Springer, 2016, pp. 483–491.
[60] T. de Freitas Pereira, A. Anjos, and S. Marcel, “Heterogeneous face recognition using domain specific units,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 7, pp. 1803–1816, 2018.
[61] H. S. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa, “Memetically optimized mcwld for matching sketches with digital face images,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 5, pp. 1522–1535, 2012.
[62] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.” in bmvc, vol. 1, no. 3, 2015, p. 6.
[63] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515.
[64] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The feret evaluation methodology for face-recognition algorithms,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 10, pp. 1090–1104, 2000.
[65] Z. Deng, X. Peng, Z. Li, and Y. Qiao, “Mutual component convolutional neural networks for heterogeneous face recognition,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 3102–3114, 2019.
[66] M. Shao and Y. Fu, “Cross-modality feature learning through generic hierarchical hyperlingual-words,” IEEE transactions on neural networks and learning systems, vol. 28, no. 2, pp. 451–463, 2016.
[67] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
[68] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1041–1049.

Relational Deep Feature Learning for Heterogeneous Face Recognition