Face Transformer for Recognition
Abstract
Recently there has been a growing interest in Transformer not only in NLP but also in computer vision. We wonder if transformer can be used in face recognition and whether it is better than CNNs. Therefore, we investigate the performance of Transformer models in face recognition. Considering the original Transformer may neglect the inter-patch information, we modify the patch generation process and make the tokens with sliding patches which overlaps with each others. The models are trained on CASIA-WebFace and MS-Celeb-1M databases, and evaluated on several mainstream benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP, AGEDB and IJB-C databases. We demonstrate that Face Transformer models trained on a large-scale database, MS-Celeb-1M, achieve comparable performance as CNN with similar number of parameters and MACs. To facilitate further researches, Face Transformer models and codes are available at https://github.com/zhongyy/Face-Transformer.
Index Terms:
Face Recognition, Neural networks, Transformer.I Introduction
Recently it seems a popular trend to apply Transformer in different computer vision tasks including image classification [1], object detection [2], video processing [3] and so on. Although the inner workings of Transformer is not that clear, researchers come up with idea after idea to apply Transformer in different kinds of ways [4, 5, 6] because of its strong representation ability.
Based on large-scale training databases [7] and effective loss functions [8, 9, 10], convolution neural networks (CNNs), from VGGNet [11] to ResNet [12], have achieved great success in face recognition over the past few years [10]. DeepFace [13] first uses a 9-layer CNN in face recognition, and obtains a 97.35% accuracy on the LFW database. FaceNet [14] adopts GoogleNet [15], assisted by a private large scale dataset, achieving state-of-art performance (99.63% on LFW) at that time. SphereNet [8] adopts a 64-layer ResNet [12] network, with a large-margin loss function, achieving 99.42% accuracy on the LFW database. ArcFace [10] develops ResNet [12] with an IR block and achieves new state-of-art performance on several benchmarks.
Despite the success of CNNs, we still wonder can Transformer be used in face recognition and whether it is better than ResNet-like CNNs. Since Transformer has shown its excellent performance combined with large scale databases [1], while there have been lots of large scale training database in face recognition. It is interesting to observe the performance of combination of Transformer and large scale face training databases. Perhaps Transformer is just the best to challenge the CNNs hegemony over the face recognition task. It is known that, the efficiency bottleneck of Transformer models, is just the key of them, i.e., the self-attention mechanism, which incurs a complexity of with respect to sequence length [16]. Of course efficiency is important for face recognition models, but in this paper, let us mainly determine the feasibility of applying Transformer models in face recognition and leave out the potential efficiency problem of them.
We first experiment with a standard Transformer [17] as ViT [1] did. However, the original ViT directly flatten the images into patches, which may neglect inter-patch information. Since some of important facial features will be partitioned into different tokens. To better describe the inter-patch information, we slightly modify the tokens generation method of ViT, to make the image patch overlaps, which can improve the performance compared with original ViT and will not increase the computing cost. Face Transformer models are trained on a large scale training database, MS-Celeb-1M [7] database, supervised with CosFace [9], and evaluated on several face recognition benchmarks including LFW [18], SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-FP [23], AgeDB-30 [24], and IJB-C [25] databases. Finally, we demonstrate that Transformer models trained on a large-scale database obtain comparable performance as CNN with a similar number of parameters and MACs. In addition, it is reasonable to find the Transformer models attend to the face area as we expected.
The contribution of our work is that we show the feasibility of Transformer models in face recognition and report promising experiment results. How to further improve the performance and efficiency of Transformer models in face recognition is a promising task for future research.

II Face Transformer
In this paper, following the open-set face recognition pipeline [8], Face Transformer is trained on face databases (with image with label ) in a supervised manner, where face images are encoded using a well-designed network, and the output face image embeddings are supervised by an elaborate loss function [8, 9, 10] for better discriminative ability, as shown in Figure 1.
II-A Network Architecture
Face Transformer model follows the architecture of ViT [1], which applies the original Transformer [17].
The only difference is that, we modify the tokens generation method of ViT, to generate tokens with sliding patches, i.e., to make the image patch overlaps, for the better description of the inter-patch information, as shown in Figure 1. Specifically, we extract sliding patches from the image with patch size and stride for them (with implicit zero on both sides of input), and finally obtain a sequence of flattened 2D patches . is the resolution of the original image while is the resolution of each image patch. The effective sequence length is the number of patches , where is the amount of zero-paddings.
As ViT did, a trainable linear projection maps the flattened patches to the model dimension D, and outputs the patch embeddings . The class token, i.e., a learnable embedding () is concatenated to the patch embeddings, and its state at the output of the Transformer encoder () is the final face image embedding, as Equation 2. Then, position embeddings are added to the patch embeddings to retain positional information. The final embedding
(1) |
serves as input to the Transformer,
(2) |
which consists of multiheaded self-attention (MSA) and MLP blocks, with LayerNorm (LN) before each block and residual connections after each block, as shown in Figure 1. In Equation 2, the output is the final output of Transformer model.
One of the key block of Transformer, MSA, is composed of parallel self-attention (SA),
(3) |
where is an input sequence, is the weight matrix for linear transformation, and is the attention map. The output of MSA is the concatenation of attention head outputs
(4) |
where .
II-B Loss Function
The output of Equation 2, i.e., the final output of Transformer model, is supervised by an elaborate loss function [8, 9, 10] for better discriminative ability,
(5) |
where is the label, is the predicted probability of assigning to class , is the number of identities, is the -th column of the weight of the last fully connected layer, and is the bias. Softmax based loss functions [26, 8, 9, 10] remove the bias term and transform , and incorporate large margin in the term [8, 9, 10]. Therefore, softmax based loss functions can be formulated as
(6) |
where in CosFace [9].
III Experiment
III-A Implementation Details
We apply two training databases, CASIA-WebFace and MS-Celeb-1M [7]. CASIA-WebFace is a sweet training database and contains 0.49M images from 10,575 celebrities, which can be seen as a relatively small-scale database compared with million-scale ones [7]. MS-Celeb-1M is a popular large scale training database in face recognition and we use the clean version refined by insightface [10], which contains 5.3M images of 93,431 celebrities. We choose CosFace [9] ( = 64 and = 0.35) as the loss function for better convergence and recognition performance. The face images are aligned to 112 112. The Horizontally flip with a probability of 50% is used for training data augmentation.
For comparison, the CNN architecture used in our work is a modified ResNet-100 [12] proposed in the first version of ArcFace paper [10], which uses IR blocks (BN-Conv-BN-PReLU-Conv-BN) and applies the “BN [27]-Dropout [28]-FC-BN” structure to get the final 512- embedding feature. We also experiment with the recent proposed T2T-ViT [5]. The number of parameters, MACs and inference speed (Tesla V100, Intel Xeon E5-2698 v4) of these face recognition models are listed in Table I. Details are as follows. For ViT models, the number of layers is 20, the number of heads is 8, hidden size is 512, MLP size is 2048. For the Token-to-Token part of T2T-ViT model, the depth is 2, hidden dim is 64, and MLP size is 512; while for the backbone, the number of layers is 24, the number of heads is 8, hidden size is 512, MLP size is 2048. Note that, the “ViT-P10S8” represents the ViT model with patch size, with stride , and “ViT-P8S8” represents no overlapping between tokens.
Training Data | Models | LFW | SLLFW | CALFW | CPLFW | TALFW | CFP-FP | AgeDB-30 |
CASIA-WebFace | ResNet-100 [12] | 99.55 | 98.65 | 94.13 | 90.93 | 53.17 | 96.30 | 95.50 |
ViT-P8S8 [1] | 97.32 | 90.78 | 86.78 | 80.78 | 83.05 | 86.60 | 81.48 | |
ViT-P12S8 | 97.42 | 90.07 | 87.35 | 81.60 | 84.00 | 85.56 | 81.48 | |
MS-Celeb-1M | ResNet-100 [12] | 99.82 | 99.67 | 96.27 | 93.43 | 64.88 | 96.93 | 98.27 |
ViT-P8S8 [1] | 99.83 | 99.53 | 95.92 | 92.55 | 74.87 | 96.19 | 97.82 | |
T2T-ViT [5] | 99.82 | 99.63 | 95.85 | 93.00 | 71.93 | 96.59 | 98.07 | |
ViT-P10S8 | 99.77 | 99.63 | 95.95 | 92.93 | 72.95 | 96.43 | 97.83 | |
ViT-P12S8 | 99.80 | 99.55 | 96.18 | 93.08 | 70.13 | 96.77 | 98.05 |
III-B Results on Mainstream Benchmarks
We mainly report recognition performance of models on several mainstream benchmarks including LFW [18], SLLFW [19], CALFW [20], CPLFW [21], TALFW [22] CFP-FP [23], AgeDB-30 [24], and IJB-C [25] databases. LFW database contains 13,233 face images from 5,749 different identities, which is a classic benchmark for unconstrained face verification. Similar-looking LFW (SLLFW), Cross-Age LFW (CALFW), Cross-Pose LFW (CPLFW) and Transferable Adversarial LFW (TALFW) databases are constructed based on LFW database, to emphasize similar-looking challenges, cross-age challenge and cross-pose challenge, and adversarial robustness of face recognition. CFP-FP database is built for facilitating large pose variation and AgeDB-30 database is a manually collected cross-age database. IJB-C database contains both still images and video frames to address the unconstrained face recognition.
The experimental results are shown in Table II and Table III. We first find that in Table II, Face Transformer models trained on CASIA-WebFace database performs much worse than ResNet-100. Actually, we find that the accuracy of Face Transformer models trained on CASIA-WebFace can reach a high level as ResNet-100, while models cannot generalize well on test databases, which indicates that the scale of CASIA-WebFace may be not enough for Transformer models.
While things change when we use a much larger training database, MS-Celeb-1M. The performance of Face Transformer models demonstrate promising results on large-scale face training databases. The performance of Face Transformer is competitive compared to the ResNet-100 with similar number of parameters and MACs. Compared with “ViT-P8S8”, “ViT-P10S8” and “ViT-P12S8” have better performance, which demonstrates the overlapping patches can help in some degree. T2T-ViT also obtain good performance, while limited computer source, more hyper-parameters for T2T block remains to try. Another interest point is that, Transformer models obtain a little higher accuracy on TALFW database, which is a database with transferable adversarial noise. Since TALFW database is generated using CNNs as surrogate models, it seems that there is no significant specificality with Transformer in terms of adversarial robustness. It is interesting to explore the performance of combination of Face Transformer models and adversarial training.
III-C Discussion
III-C1 Attention Area Analysis
Since the key of Transformer models is the self-attention mechanism, we analyze how the Transformer models concentrate on face images by analyzing the ViT-P12S8 model trained on MS-Celeb-1M. Specifically, we use the Attention Rollout [30] method, which recursively multiplies the modified attention matrices of all layers, where is the attention map of Equation 3. We demonstrate that Transformer models attend to the face area as we expected, as shown in Figure 2.

III-C2 Attention Matrices Visualization
To further understand the Transformer models (MS-Celeb-1M, ViT-P12S8), we visualize the attention matrices of different layers, and calculate the mean attention distance in the image space, which is seemed as the receptive field as CNNs [1], shown in Figure 3. While we find that although the deepest layers attend to long distance relationship, it seems that the attention distance of the lowest layer in Face Transformer models is relatively longer than the original ViT [1].

III-C3 Occlusion Robustness
The key of Face Transformer models is the self-attention mechanism and it seems that they concentrate more on the whole face, therefore, we wonder whether them are more robust at classifying partial occluded face images. To explore the occlusion robustness of Face Transformer models, we apply random occlusion (zero value) on face images of several test datasets, and test the recognition performance of models as the occlusion area increases. The experimental results are in Figure 4. We find the performance of Face Transformer models decreases more compared with ResNet-100, which indicates Face Transformer models behave no better than CNNs in occlusion robustness.

III-C4 Abortive attempts and Observations
In addition to the reported models, we would like to share some of our abortive attempts and observations. Note that, these observations may not be rigorous enough to come to conclusions, but maybe they are helpful for readers.
(1) We first tried SGD as previous works [9, 10] to train Face Transformer models, while models cannot converge. So finally we apply AdamW, which has been proved as a effective optimizer for Transformer models.
(2) We tried removing , and used the mean pooling of other tokens outputs. Compared with using as output, the recognition performance decreases slightly, while accuracy on TALFW database increases to more than 85%.
(3) We tried removing the MLP to improve the efficiency but find the training accuracy cannot increase to a normal level, which indicates that the MLP block is essential for Face Transformer models.
IV Conclusion
In this paper, we aim to investigate the feasibility of applying Transformer models in face recognition. Finally, we have demonstrated that Face Transformer models cannot work with a relatively small database, CASIA-WebFace, while they can obtain promising performance on the large-scale face training database, MS-Celeb-1M. In addition, we have provided some analyses for better understanding the Face Transformer models.
References
- [1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision. Springer, 2020, pp. 213–229.
- [3] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense video captioning with masked transformer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8739–8748.
- [4] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” arXiv preprint arXiv:2012.12877, 2020.
- [5] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” arXiv preprint arXiv:2101.11986, 2021.
- [6] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” arXiv preprint arXiv:2103.00112, 2021.
- [7] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European conference on computer vision. Springer, 2016, pp. 87–102.
- [8] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
- [9] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274.
- [10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
- [11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [13] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
- [14] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
- [15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- [16] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu et al., “A survey on visual transformer,” arXiv preprint arXiv:2012.12556, 2020.
- [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
- [18] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” University of Massachusetts, Amherst, Tech. Rep. 07-49, October 2007.
- [19] W. Deng, J. Hu, N. Zhang, B. Chen, and J. Guo, “Fine-grained face verification: Fglfw database, baselines, and human-dcmn partnership,” Pattern Recognition, vol. 66, pp. 63–73, 2017.
- [20] T. Zheng, W. Deng, and J. Hu, “Cross-age LFW: A database for studying cross-age face recognition in unconstrained environments,” arXiv:1708.08197, 2017.
- [21] T. Zheng and W. Deng, “Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments,” Beijing University of Posts and Telecommunications, Tech. Rep. 18-01, February 2018.
- [22] Y. Zhong and W. Deng, “Towards transferable adversarial attack against deep face recognition,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 1452–1466, 2020.
- [23] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs, “Frontal to profile face verification in the wild,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9.
- [24] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou, “Agedb: the first manually collected, in-the-wild age database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 51–59.
- [25] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney et al., “Iarpa janus benchmark-c: Face dataset and protocol,” in 2018 International Conference on Biometrics (ICB). IEEE, 2018, pp. 158–165.
- [26] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia. ACM, 2017, pp. 1041–1049.
- [27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- [29] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- [30] S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” arXiv preprint arXiv:2005.00928, 2020.