NIR-to-VIS Face Recognition via
Embedding Relations and Coordinates of the Pairwise Features

MyeongAh Cho Tae-young Chung Taeoh Kim Sangyoun Lee
Image and Video Pattern Recognition Laboratory
School of Electrical and Electronic Engineering, Yonsei University, Republic of Korea
maycho0305, tato0220, kto, [email protected]

Abstract

NIR-to-VIS face recognition is identifying faces of two different domains by extracting domain-invariant features. However, this is a challenging problem due to the two different domain characteristics, and the lack of NIR face dataset. In order to reduce domain discrepancy while using the existing face recognition models, we propose a ’Relation Module’ which can simply add-on to any face recognition models. The local features extracted from face image contain information of each component of the face. Based on two different domain characteristics, to use the relationships between local features is more domain-invariant than to use it as it is. In addition to these relationships, positional information such as distance from lips to chin or eye to eye, also provides domain-invariant information. In our Relation Module, Relation Layer implicitly captures relationships, and Coordinates Layer models the positional information. Also, our proposed Triplet loss with conditional margin reduces intra-class variation in training, and resulting in additional performance improvements.

Different from the general face recognition models, our add-on module does not need to pre-train with the large scale dataset. The proposed module fine-tuned only with CASIA NIR-VIS 2.0 database. With the proposed module, we achieve 14.81% rank-1 accuracy and 15.47% verification rate of 0.1% FAR improvements compare to two baseline models.

1 Introduction

Recently, as Deep Convolutional Neural Network(DCNN) has shown promising performance in computer vision areas and there has been a lot of improvements in the face recognition tasks as well. Specifically, DCNN extracts representative features of the face from an input image and classifies the features into each identity [22, 24]. To recognize each identity, there are two methods: Those that use classification layer such as softmax [26] and those that directly learn embedding features that correspond to face similarity such as cosine similarity [22]. Both methods are intended to make large inter-class distance and small intra-class distance which leads to better performance.

Heterogeneous face recognition refers to the face recognition in images acquired from two different domains such as sketch to photo, TIR(thermal infrared) to VIS(visible light), and NIR(near infrared) to VIS [9, 10, 12]. Especially, NIR camera is widely used in video surveillance and security, because in a night or a low-light environment, it is much more useful than the VIS camera [3]. Therefore, there are lots of studies have been done on the NIR-to-VIS among the heterogeneous face recognition.

Refer to caption — Figure 1: (a) General face recognition network extracts local features with domain discrepancy remaining. (b) Our proposed relation module captures relations and coordinates of the pairwise feature to reduce the domain discrepancy.

The biggest challenging problem of NIR-to-VIS face recognition is extracting domain-invariant features. Therefore, it is important to enlarge each inter-class variation and reduce the discrepancy between the NIR and VIS features in each intra-class. In He et al. [6], network extracts domain-invariant features, Liu et al. [17] uses triplet loss with hard sampling and Song et al. [23] uses CycleGAN [33] to convert NIR to VIS face images.

In face recognition, it is very important to extract a general feature that can distinguish each person. Face recognition networks use the large scale datasets such as MS-Celeb-1M [5] or Labeled Faces in the Wild [8] to generalize features. However, the NIR-to-VIS dataset is a relatively small dataset for training, so the network which trained only NIR-to-VIS dataset cannot provide satisfying performance. Therefore, most of the NIR-to-VIS studies were done by fine-tuning the pre-trained models which is difficult to design a new architecture or transfer learning to a face recognition model with a good performance.

In order to solve these problems, as mentioned above, this paper proposes an add-on ‘Relation Module’ that exploits off-the-shelf models trained on the visual data to extract domain-invariant features without pre-training procedure. Since texture information is dominant in the domain difference, our add-on module only extracts the relationships between texture information. Inspired by the Relation Network [20], our module captures relationships of face components. It modeled the relationship between each object in the image and applied to the Visual Q&A problem. Similarly, our proposed Relation Module reduces the domain discrepancy between the NIR and VIS by capturing the relationships between each component of the face. After passing through convolutional network, each cell of the feature map represents local patch of the input face image such as lips, eyes and chin. Our module looks all possible combinations of them and does not need to indicate actual relation of patches explicitly. Since these relations represent characteristic of identities and domain-invariant, it can be suitable for the NIR-to-VIS recognition task. Positional information of each component is also important information. Such as distance from lips to chin or eye to eye can be characteristics of identity. Since it is also domain-invariant information, we add coordinates layer to instruct the positions on the Relation Module.

To reduce an additional domain discrepancy, we propose triplet loss with conditional margin. A conditional margin is proposed to provide adaptive margin to the intra-class. In addition, following [17], anchor and positive, negative examples are sampled from the different domains.

In this paper, our main contributions are,

•

To reduce discrepancy between the two different domain features, Relation Module captures relationships and positions between the pairwise patches that are the components of the face.
•

This paper proposes triplet loss with conditional margin which considers intra-class distribution. Also for hard sampling, all anchors and targets(positives, negatives) are sampled from the different domain.
•

Add-on module shows 14.81% performance improvement over baseline without pre-training and 4.19% performance improvement over fully pre-trained baseline models. Our maximum rank-1 accuracy achieves 98.92% which is comparative with the state-of-the-art methods.

2 Related Works

Conventional approaches for the NIR-to-VIS face recognition is modality-invariant feature learning, which learn features that can create a robust feature space with two different modalities. With the development of deep learning method, Yi et al. [31] resorts to RBM combined with removed PCA feature. Liu et al. [17] improves performance by applying CNN with triplet loss concepts to the NIR-to-VIS face recognition. Wu et al. [30] utilizes the low rank and block-diagonal constraints on fully connected layer to alleviate overfitting, and proposes cross-modal ranking to reduce domain discrepancy. He et al. [7] uses the Wasserstein distance [1] to reduce the domain gap, in order to obtain domain-invariant features for the NIR-to-VIS face recognition.

Another approach is utilizing data synthesis which transforms face images from one modality into another, via image synthesis. Data synthesis is first proposed to synthesize and recognize a sketch image from a face photo in Wang et al. [28]. After development of deep learning and GAN [4], Zhao et al. [32] performed data synthesis using GAN and Song et al. [23] utilized CycleGAN [33] to realize a cross-spectral face hallucination, facilitating heterogeneous face recognition via generation.

Our Relation Module is inspired by Santoro et al. [20], which proposed a relation network that finds the relationships between objects in Visual Q&A problem. Kang et al. [11] also applied the same concept of the relation network for face recognition. However, we applied it to the NIR-to-VIS task because the relation network operation has domain-invariant characteristics.

Liu et al. [15] has found that directly encoding the positional information of the features under certain circumstances, can be very useful for improving the performance of the network. Vaswani et al. [25] also improved performance by designing encoders and decoders with a positional encoding, so that the network can utilize sequential order information. Therefore, this direct positional information inspired the design of the coordinates layer, which made it useful for the Relation Module to find the relation between each feature.

Various studies on face recognition focused on loss design to increase discriminative power of the feature, which can be divided into softmax based and triplet based loss methods. In recent years, studies using angular margins based on cosine similarity, such as [2, 16, 27], allowed the face recognition performance to be improved. These studies can also be applied to NIR-to-VIS face recognition. Though Liu et al. [17] applied triplet loss to the NIR-to-VIS face recognition, the application of other improved losses with the concept of angular margins to the NIR-to-VIS has not yet been studied sufficiently.

3 Proposed Approach

In this section we present an overview of our network and method of the proposed Relation Module that consist of the relation layer and coordinates layer. Then, triplet loss with conditional margin and hard sampling is introduced, which reduces gap between NIR and VIS face images.

3.1 Overview

Our network is designed to learn extracting similar embedding features from different domains of face images. The whole framework is illustrated in Figure 2. Input of the network is NIR or VIS face image and after ConvNet, N $\times$ N feature map is extracted. For feature extractor baseline, we use LightCNN [29]. This feature map is input of the Relation Module. In Relation Module, we consider N $\times$ N number of the feature vectors and all pairwise combinations of them with positional information. These pair sets of combinations pass through shard fully connected layer and are embedded to the relation vector with L-dimension. After fully connected layer, we finally extract 256-dimension embedding feature vector which represents each identity.

During training time, we use softmax classifier-based method and triplet loss with conditional margin which is embedding-based method. For triplet loss, we sampled anchor in one domain and negative, positive examples from the other.

3.2 Relation Layer

From the fact of that CNN has local connectivity characteristic, each cell of the feature maps after CNN represents the local parts of the input, and each channel-wise vector holds representative information within local parts.

In Figure 2, the output of ConvNet is N $\times$ N feature map(we use N=8). These N $\times$ N number of feature vectors represent the local patches of the face such as lips, eyes and nose which are important characteristics of the face. In the relation layer, we consider all pairwise combinations of the feature vector. By pairwise combining, the relations between two patches of the face can be obtained. Since these relations are regardless of the order, there are 2N $\times$ (2N+1)/2 orderless combinations. These combinations are embedded into the L-dimension relation vector by shared fully connected layer. This process extracts representative relations of patches such as relation of shapes, size, etc. Relation layer does not need to define explicit or actual relation but simply looks all combinations of patches and discovers general relation implicitly. Since these relations reduce domain dependency, each identity is represented as similar relation vector regardless of the domain.

3.3 Coordinates Layer

Position of each part of the face is an important information while classifying faces. Relative distance of face parts such as distance from lips to chin or eye to eye can be the representative features of the identities. Since this information is not dependent on the domain, it can be effectively used for NIR-to-VIS face recognition task. Therefore, we add coordinates layer to each feature vector that can give positional information of each patch. Similar to [15], we simply add two additional channels which indicate two spatial dimensions. The first of the first channel is filled with 0’s, second row is filled with 1’s, etc. Second channel is also filled similarly to the first, but columns are constant value and are scaled to [-1,1]. As depicted in Figure 2, these coordinates(CoordConv) are concatenated with each vector and used in capturing relations.

	Rank-1 Acc.(%)	VR@FAR=1%(%)	VR@FAR=0.1%(%)	VR@FAR=0.01%(%)
Pre-trined model \@slowromancapi@	93.21	98.01	93.41	90.15
Baseline \@slowromancapi@	82.59	93.9	80.87	74.62
+ Relation Layer	94.73	98.02	93.65	91.28
+ Coordinates Layer	95.21	98.09	94.46	91.52
+ Conditional Triplet	97.4	99.2	96.34	94.31
Pre-trined model \@slowromancapii@	97.65	99.34	97.79	96.84
Baseline \@slowromancapii@	95.21	97.85	93.83	91.36
+ Relation Layer	98.12	99.37	97.68	96.86
+ Coordinates Layer	98.59	99.23	97.59	96.69
+ Conditional Triplet	98.92	99.44	98.72	98.30

Table 1: Results of the proposed method from the baseline on the 10-fold CASIA NIR-VIS 2.0 database.

3.4 Loss Function

3.4.1 Softmax Loss

While training the network, we use softmax classification loss and triplet loss. For softmax loss, we normalize the embedding feature $x_{i}$ by L2 normalization which is followed by [19, 26]. Also, normalized feature is re-scaled to scale $s$ followed by [19]. In Equation 1, we denote batch size $N$ , the number of class $M$ , weights of the last softmax layer $w$ and the embedding vector $\hat{x}$ .

	$\displaystyle{\hat{x}}_{i}=\frac{x_{i}}{\left\\|x_{i}\right\\|}\times s$
	$\displaystyle L_{Softmax}=\frac{1}{N}\sum_{i}^{N}\frac{e^{w_{i}^{T}{\hat{x}}_{i}+b_{i}}}{\sum_{j}^{M}e^{w_{j}^{T}{\hat{x}}_{j}+b_{j}}}$		(1)

3.4.2 Triplet Loss with Conditional Margin

Since there are large intra-class discrepancies between two domain differences, triplet loss is introduced. Equation 2 is original triplet loss function [22], where $x^{a}$ (anchor) is the embedding feature vector of the randomly selected input image and $x^{p}$ (positive) is the embedding feature vector of the same class with anchor while $x^{n}$ (negative) is the different class with anchor. Loss function is designed to minimize the Euclidean distance between same identity, and to maximize the distance between different identities.

	$\displaystyle\left\{x_{i}^{a},x_{i}^{n},x_{i}^{p}\right\}\in\textit{T}$
	$\displaystyle L_{Triplet}=\sum_{i}^{N}[\left\\|x_{i}^{a}-x_{i}^{p}\right\\|_{2}^{2}$
	$\displaystyle-\left\\|x_{i}^{a}-x_{i}^{n}\right\\|_{2}^{2}+m]_{+}$		(2)

In Equation 2, the distance difference should be bigger than margin $m$ . We plot the closest negative embeddings for each class in Figure 3. X-axis is cosine similarity between anchor-positive( $Sp$ ) and Y-axis is the maximum cosine similarity of anchor-negative( $Sn$ ) in the training set. However, Equation 2 is inappropriate for our task. As we can see in this plot, correlation between $Sp$ and $Sn$ is not 1 which means loss criterion should be different according to each $Sp$ . Considering $Sp$ and $Sn$ distribution, we propose conditional margin.

	$\displaystyle S_{p}=CS(x_{i}^{a},x_{i}^{p})$
	$\displaystyle S_{n}=CS(x_{i}^{a},x_{i}^{n})$
	$\displaystyle\frac{S_{n}+1}{S_{p}+1}<m$
	$\displaystyle L_{Conditional}=\sum_{i}^{N}\left[\frac{S_{n}+1}{S_{p}+1}-m\right]_{+}$		(3)

In Equation 3, CS represents Cosine Similarity and we apply conditional margin that considers margin adaptively. In Figure 3, the triplet loss with conditional margin line is considered not only intercept value $(1-m)$ but also slope $m$ . Equation 3 is our triplet loss with conditional margin ( $m$ =0.7) and total loss is defined in Equation 4.

L=L_{Softmax}+\lambda L_{Conditional}

(4)

To reduce gap between domains, we sampled positive and negative examples at different domains with an anchor [17]. This sampling forces NIR and VIS embeddings to be close having same identity and makes compact intra-class regardless of the domain.

4 Experiments and Results

4.1 Database

For experiment, we use CASIA NIR-VIS 2.0 Face Database [14]. This database consists of 725 identities and 10 fold experiments. There are 1-22 VIS images and 5-50 NIR images per identities. It is the largest and challenging database of heterogeneous task. We cropped each image by 144 $\times$ 144 size and during training we randomly crop to 128 $\times$ 128 size. In training set, there are about 8,600 number of NIR or VIS images from 360 identities. In test set, gallery set consists only one VIS image and probe set consists about 6,000 NIR images from 358 identities.

4.2 Implementation

Our baseline is LightCNN(removed softmax layer) and it has 9(or 29) number of convolutional layers. This baseline is pre-trained on MS-Celeb-1M dataset [5]. Relation Module gets 8 $\times$ 8 size feature map for the input and embedding to 64-dimension relation vectors. Our Relation Module is only fine-tuned with the CASIA NIR-VIS 2.0 database. To prevent classifier overfitting on the training set, we apply dropout at the last softmax layer. Learning rate starts from ${10}^{-3}$ and gradually drops to ${10}^{-5}$ . The batch size is set to be 128 and balancing parameter $\lambda$ is 10.

4.3 Results

4.3.1 Relation Module Results

We followed CASIA NIR-VIS 2.0 Face Dataset View 2 evaluation protocol which consists of 10 sub experiments. All experiments identities in training set and test set are non-overlapping. Table 1 shows results of Rank-1 identification rate and verification rate of 1, 0.1 and 0.01% FAR. In Table 1, pre-trained model \@slowromancapi@ is LightCNN-9 which indicates whole network is pre-trained on MS-Celeb-1M and fine-tuned on CASIA NIR-VIS 2.0 database. Baseline \@slowromancapi@ is LightCNN-9 which only the feature extractor is pre-trained and fully connected layers are fine-tuned. Pre-trained model \@slowromancapi@ rank-1 accuracy is 93.21% and baseline \@slowromancapi@ is 82.59%. We add relation layer and coordinates layer to the baseline \@slowromancapi@ where FC layers are removed. The results of the relation layer is 94.73% and addition of the coordinates layer reaches 95.21%. Futhermore, we add triplet loss with conditional margin which performed 97.4%, showing 14.81% accuracy improvement from the baseline \@slowromancapi@. Since Relation Module does not need pre-training, any other face recognition feature extractor can be added with simple fine-tune procedure. In Table 1, for pre-trained model \@slowromancapii@ and baseline \@slowromancapii@, we use LightCNN-29 which has 29 convolutional layers. Pre-trained model \@slowromancapii@ and baseline \@slowromancapii@ performed 97.65% and 95.21%. After adding relation module and triplet loss with conditional margin, 98.92% accuracy and 98.72% verification rate of 0.1% FAR.

	Rank-1 Acc.(%)	VR@FAR=0.1%(%)
HFR-CNN[21]	85.9	78
COTS	89.59	-
+Low-rank[13]	89.59	-
TRIVET[17]	95.7	91
IDR[6]	97.33	95.73
ADFL[23]	98.15	97.21
CDL[30]	98.62	98.32
W-CNN[7]	98.7	98.4
Ours	98.92	98.72

Table 2: Comparing with other deep learning based methods on the 10-fold CASIA NIR-VIS 2.0 database.

	margin(slope)	Rank-1 Acc(%)
Softmax	-	95.21
Softmax+Triplet	0.2	94.74
Softmax +Triplet with conditional margin	0.4 (0.6)	94.94
	0.3 (0.7)	97.4
	0.2 (0.8)	95.62

Table 3: Results on the 10-fold CASIA NIR-VIS 2.0 database with the different loss functions and the margin values.

As described in Table 2 shows other HFR models on CASIA NIR-VIS 2.0 database based on deep learning. Our comparing models are HFR-CNN(2016) [21], COTS+Low-rank(2017) [13], TRIVET(2016) [17], IDR(2017) [6], ADFL(2018) [23], CDL(2017) [30] and W-CNN(2018) [7]. Experiment results are shown in Table 2. Our Relation Module with conditional triplet loss performed 0.52% improvements than W-CNN, which is comparative results with the state-of-the-art models.

4.3.2 Triplet Loss with Conditional Margin Results

We applied different loss in Relation Module network(baseline is LightCNN-9). When we only use softmax loss, rank-1 accuracy is 95.21% in Table 3. In Table 3, we experiment $m$ = 0.6, 0.7 and 0.8 representing the values of slope, 0.6, 0.7 and 0.8 and intercept, 0.4, 0.3 and 0.2. Comparing original triplet margin with our suggested conditional margin, proposed loss brings performance gain for 2.38% in rank-1 accuracy. By modifying $m$ , it performs the best when $m$ = 0.7, of 97.4%. This shows that network needs to train with enough conditional margin and is database dependent parameter. This result shows that minimizing intra-class with conditional margin while separating inter-class with softmax loss is more effective in training.

4.3.3 Visualizaion of Embedding Features

We visualize output 256-dimension embedding feature vectors of the trained network which input is NIR and VIS face images. By following t-SNE [18], shown in Figure 4, V and N indicates VIS and NIR embeddings and the number denotes each identity. (a) is a result of the baseline and (b) is a result of our proposed module which adds Relation Module and triplet loss with conditional margin. In Figure 4 each color indicates the identities and most of the identities in (b) distinguish-ably separated. Also comparing distance between NIR and VIS within identity, (b) is much closer than (a) showing compact intra-class. For example, in (a), V10 and N10 (or V13 and N13) are distanced each other and close to other identity which leads wrong identification. While in (b), most of embedding features of each class are compact and all identities are separated enough which leads good performance.

5 Conclusion

Relation Module, an add-on module, which simultaneously captures relations and coordinates of the pairwise features from the off-the-shelf models was proposed in this paper. The relation layer effectively captures the pairwise relationship of each component of the face, and the coordinates layer models the positional information from the features. Furthermore, the proposed triplet loss with conditional margin increases the performance by modeling data dependent adaptive margin between the anchor-positive and the anchor-negative.

Experimental results show that each component of our Relation Module increases the accuracy by only training on the target dataset from the baseline models, showing the comparative performance with the state-of-the-art algorithms. Our visualization of the embedding feature shows that the Relation Module effectively not only reduce the domain discrepancy between the NIR and VIS but also enlarge the relative inter-class distances.

One of the main difficulties of the heterogeneous face recognition is the lack of the labeled dataset from different domains. The proposed method effectively can solve this problem by combining existing visual face recognition model with small size NIR-VIS face dataset. Our future works will be extended with the same framework to other domains such as sketch and thermal.

Acknowledgement

This research was supported by Multi-Ministry Collaborative R&D Program(R&D program for complex cognitive technology) through the National Research Foundation of Korea(NRF) funded by MSIT, MOTIE, KNPA(NRF2018M3E3A1057289)

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (2016-0-00197, Development of the high-precision natural 3D viewgeneration technology using smart-car multi sensors and deep learning)

References

[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[2] J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
[3] S. Farokhi, J. Flusser, and U. U. Sheikh. Near infrared face recognition: A literature survey. Computer Science Review, 21:1–17, 2016.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[5] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
[6] R. He, X. Wu, Z. Sun, and T. Tan. Learning invariant deep representation for nir-vis face recognition. In AAAI, volume 4, page 7, 2017.
[7] R. He, X. Wu, Z. Sun, and T. Tan. Wasserstein cnn: Learning invariant features for nir-vis face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[8] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
[9] L. Huang, J. Lu, and Y.-P. Tan. Learning modality-invariant features for heterogeneous face recognition. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1683–1686. IEEE, 2012.
[10] Y. Jin, J. Lu, and Q. Ruan. Coupled discriminative feature learning for heterogeneous face recognition. IEEE Transactions on Information Forensics and Security, 10(3):640–652, 2015.
[11] B.-N. Kang, Y. Kim, and D. Kim. Pairwise relational networks for face recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 628–645, 2018.
[12] B. F. Klare and A. K. Jain. Heterogeneous face recognition using kernel prototype similarities. IEEE transactions on pattern analysis and machine intelligence, 35(6):1410–1422, 2013.
[13] J. Lezama, Q. Qiu, and G. Sapiro. Not afraid of the dark: Nir-vis face recognition via cross-spectral hallucination and low-rank embedding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6807–6816. IEEE, 2017.
[14] S. Li, D. Yi, Z. Lei, and S. Liao. The casia nir-vis 2.0 face database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 348–353, 2013.
[15] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pages 9627–9638, 2018.
[16] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 1, 2017.
[17] X. Liu, L. Song, X. Wu, and T. Tan. Transferring deep representation for nir-vis heterogeneous face recognition. In Biometrics (ICB), 2016 International Conference on, pages 1–8. IEEE, 2016.
[18] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
[19] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
[20] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
[21] S. Saxena and J. Verbeek. Heterogeneous face recognition with cnns. In European Conference on Computer Vision, pages 483–491. Springer, 2016.
[22] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. June 2015.
[23] L. Song, M. Zhang, X. Wu, and R. He. Adversarial discriminative heterogeneous face recognition. arXiv preprint arXiv:1709.03675, 2017.
[24] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
[26] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018.
[27] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018.
[28] X. Wang and X. Tang. Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11):1955–1967, 2009.
[29] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
[30] X. Wu, L. Song, R. He, and T. Tan. Coupled deep learning for heterogeneous face recognition. arXiv preprint arXiv:1704.02450, 2017.
[31] D. Yi, Z. Lei, and S. Z. Li. Shared representation learning for heterogenous face recognition. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1–7. IEEE, 2015.
[32] J. Zhao, L. Xiong, P. K. Jayashree, J. Li, F. Zhao, Z. Wang, P. S. Pranata, P. S. Shen, S. Yan, and J. Feng. Dual-agent gans for photorealistic and identity preserving profile face synthesis. In Advances in Neural Information Processing Systems, pages 66–76, 2017.
[33] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017.

NIR-to-VIS Face Recognition via Embedding Relations and Coordinates of the Pairwise Features