This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust face anti-spoofing framework with Convolutional Vision Transformer

Yunseung Lee    Youngjun Kwak    Jinho Shin
Abstract

Owing to the advances in image processing technology and large-scale datasets, companies have implemented facial authentication processes, thereby stimulating increased focus on face anti-spoofing (FAS) against realistic presentation attacks. Recently, various attempts have been made to improve face recognition performance using both global and local learning on face images; however, to the best of our knowledge, this is the first study to investigate whether the robustness of FAS against domain shifts is improved by considering global information and local cues in face images captured using self-attention and convolutional layers. This study proposes a convolutional vision transformer-based framework that achieves robust performance for various unseen domain data. Our model resulted in 7.3%pp and 12.9%pp increases in FAS performance compared to models using only a convolutional neural network or vision transformer, respectively. It also shows the highest average rank in sub-protocols of cross-dataset setting over the other nine benchmark models for domain generalization.

Face Anti-Spoofing, Domain Generalization, Convolutional Vision Transformer, Global and Local Features, Hybrid Feature Extraction

1 Introduction

Refer to caption
Figure 1: The main idea of our framework for FAS comprises a hybrid feature extraction and training mechanism with a regression-based DG method. These modules lead our framework to capture more generalized feature space and to achieve robust performance on unseen target data despite distribution shifts between source and target data.

Face anti-spoofing (FAS) aims to detect attacks that occur during face recognition. This is essential for the security of facial authentication systems because it is the first stage of the process (Anthony et al., 2021). Recently, as more diverse and sophisticated presentation attacks, such as printed images and video replay, threaten the system, the need for robust FAS algorithms is increasing (Yu et al., 2022). It is essential to develop a model that considers scenarios where the distribution of training data is distinct from that of data flowing into a real-time system to ensure the FAS model performs optimally in real service. Therefore, a training mechanism based on domain generalization (DG) should be considered (Anthony et al., 2021; Yu et al., 2022; Shao et al., 2019; Jia et al., 2020; Liu et al., 2021a; Wang et al., 2022b). Consistent with previous studies, our study aims to develop a generalized FAS model to handle domain shifts. To achieve the robustness of FAS, we designed a framework to extract rich information from images and learn a generalized feature space that is irrelevant to the domain- and person-specific attributes, as shown in Figure 1.

Maximizing the extraction of meaningful information from an image leads to better performance in a range of vision tasks, such as FAS. Therefore, research has focused on identifying new backbones to learn high-quality features. For example, convolution neural networks (CNNs), which are widely used as a representative backbone for FAS are advantageous, in extracting local information (Nagpal & Dubey, 2019). Recently, vision transformer (ViT), designed to encode the global context between image patches based on the self-attention (SA) mechanism, has emerged as a new backbone for facial authentication (George & Marcel, 2021; Huang et al., 2022; Ming et al., 2022). Despite a research report indicating improved FAS performance when utilizing both local and global information in an intra-dataset protocol (Wang et al., 2022a), empirical studies on this are scarce, and prior studies have not confirmed the performance in terms of DG.

Refer to caption
Figure 2: A robust FAS framework based on ConViT aims to extract local and global features with gated positional self-attention mechanisms and to learn feature space for domain generalization through liveness prediction with discretized labels.

Few studies have focused on unknown attacks that partially cover domain shifts, with NN-shot learning (George & Marcel, 2021; Huang et al., 2022). Our research differs from these studies in that they only use global dependencies and overlook local information. As an example of FAS for video, there is a study that considers the effectiveness of feature extraction for DG and utilizes SA in ViT to learn spatio-temporal relationships in videos with frames (Ming et al., 2022). However, no attempt in FAS for images has been made for feature extraction to encode both local cues and global dependencies with ViT architecture.

The novelty of this study lies in bridging the gap between research on improving FAS performance with global and local information and maintaining the robustness of FAS with domain-generalized features. To the best of our knowledge, this is the first study to propose a robust FAS model in terms of DG using ViT structure that encodes both global and local contexts from a single image. To make our framework sufficiently robust for DG, we designed it to extract rich features using convolutional vision transformer (ConViT) and to train it with a regression-based method as well as adversarial learning. Under cross-dataset protocol, our proposed framework reports a higher Area Under Curve (AUC) than single-operation-based models, such as EfficientNet and ViT. Our method achieved the highest performance out of nine FAS models in terms of DG, demonstrating that a rich representation trained with the generalization method contributes to the robustness of domain shifts.

2 Related Work

2.1 Feature Extraction for FAS

Depending on the feature extraction method used in neural networks, the backbones for FAS can be classified as CNN, SA-based ViT, and hybrid ViT with convolutional elements (Han et al., 2022).

CNN-based models can extract local features from images and are trained efficiently based on strong assumptions, such as spatial equivariance. For example, ResNet (He et al., 2016; Shao et al., 2019; Jia et al., 2020) and EfficientNet (Tan & Le, 2019; Kong et al., 2022) have been widely used to obtain local cues from images in FAS. However, the information from CNN is limited to the range of the receptive field, and it is difficult to aggregate information from other spatial locations (Raghu et al., 2021). The performance ceiling is unavoidable if the assumption of locality and invariance does not fit the task because the CNN is operated based on complex assumptions (d’Ascoli et al., 2021).

To overcome the limitations of CNNs, the ViT-based model designs the SA mechanism with a soft assumption. ViT divides an image into non-overlapped patches and transforms them into a linearly embedded sequence. Global dependencies between patches are encoded via SA (Dosovitskiy et al., 2020; Liu et al., 2021b). However, it is difficult to solve the FAS task using ViTs because a large dataset is necessary for pretraining ViT owing to a lack of prior knowledge (Raghu et al., 2021; d’Ascoli et al., 2021; Tu et al., 2022).

Therefore, hybrid models have recently been proposed to reflect the advantages of CNNs in ViT (Dosovitskiy et al., 2020; d’Ascoli et al., 2021; Tu et al., 2022). For example, ViT Hybrid uses a feature map extracted through a CNN as an input to ViT. ConViT indirectly reflects locality into the ViT structure by adding a modified SA module that operates similarly to the convolution in the initial stage (d’Ascoli et al., 2021). In addition, MaxViT is an advanced model of CoAtNet (Dai et al., 2021) that uses convolution layers directly in several stages of ViT (Tu et al., 2022).

2.2 Domain Generalization

The diversity of face images inputted into the authentication system is attributed to varying environmental conditions, such as background and camera angle, between individuals. This variation causes distribution shifts between the training and inference data, which degrades model performance. Therefore, the previous studies have suggested several methods to train the generalized feature space underlying some source domains and unseen but related target domains (Shao et al., 2019; Jia et al., 2020; Liu et al., 2021a; Wang et al., 2022b; Ming et al., 2022).

These studies have defined FAS as a binary classification problem that discriminates between real and spoof and adopted binary cross-entropy (BCE) as the loss function. However, recent studies have indicated that models trained via BCE are prone to attacks due to overfitting, thereby leading to decreased generalization performance. To address this issue, previous studies redefined the FAS task as a regression problem and probabilistically estimated its liveness score (Jiang et al., 2022; Kwak et al., 2023). Consequently, the regression-based FAS model improves the generalization ability because the converted continuous label contains more semantic information from both real and spoof images (Jiang et al., 2022). In addition, the expected liveness score-based regression neural network using the pseudo-discrete label encoding technique contributes to more robustness in unseen datasets than the model trained with BCE (Kwak et al., 2023).

3 Proposed Method

In this study, we propose a FAS framework with SA and a convolution-based feature extractor for DG. Figure 2 shows our proposed framework consists of three stages: label discretization, feature extraction, and liveness prediction.

3.1 Label Discretization

We redefine the binary FAS classification as a regression-based method, as this approach is more suitable for learning generalized and discriminative representations (Jiang et al., 2022; Kwak et al., 2023). A sequence of class labels was transformed into a probability distribution using CutMix (Yun et al., 2019). CutMix is a technique used for data augmentation, but we generate a label with CutMix to discretize the pseudo-label and solve the FAS task with a regression-based loss function. Specifically, there are NN source domain datasets, which comprise face images XH×W×3X\in\mathbb{R}^{H{\times}W{\times}3} with binary label Y{0,1}Y{\in}{\{0,1\}} that represents fake and real for each face image. The real and fake face images are denoted as XrX_{r} and XfX_{f}, respectively, and YrY_{r} and YfY_{f} are the corresponding labels. Using CutMix, we swap some parts of XrX_{r} to that of XfX_{f} and denote this combined image as input image XcX_{c}. We define the discretized pseudo-label YcY_{c} as mlYrm_{l}Y_{r}, which is the single value mlm_{l} selected from the set M={0,1K,2K,,K1K,1}M=\{0,\frac{1}{K},\frac{2}{K},…,\frac{K-1}{K},1\}, where the interval [0,1][0,1] is divided by a constant KK.

3.2 Hybrid Feature Extraction

In the second stage, we employ ConViT as a backbone FF for feature extraction to encode rich information to capture a cue for FAS. Specifically, global context and local cues are obtained through gated positional self-attention(GPSA) which is modified from SA. In general, an attention score A in original ViT is defined as the dot product of query embedding QQ and key embedding KK. WQW_{Q}, WKW_{K}, and WVW_{V} are the weight matrices for QQ, KK, and VV, respectively, which are embeddings for the query, key, and value, respectively. The embeddings were inferred by multiplying the linearly projected input XpX_{p} with the corresponding weight matrices. The attention score AijA_{ij} implies the semantic relevance between patches XpiX_{p}^{i} and XpjX_{p}^{j} and can capture the long-range dependency between patches.

To reflect locality in the attention score, an inner product of the learnable embedding vposv_{pos} and relative positional encoding rijr_{ij} is added to QiKjTQ_{i}K_{j}^{T}. As shown in Eq. (2), the gating parameter σ\sigma manipulates the type of information on which to focus. A larger σ\sigma at the initial stage makes this GPSA function as generalized convolution, which assigns higher weights to adjacent patches. After extracting the local features at the earlier layer, a smaller σ\sigma is gated with more global information of patches far from each other at the upper layer.

GPSA(Xp):=normalized[A]XpWv,GPSA(X_{p}):=normalized[A]X_{p}W_{v}, (1)
whereAij=(1σ)×softmax(QiKjT)+σ×softmax(vposTrij).\begin{split}where\ A_{ij}=(1-\sigma)\times softmax(Q_{i}K_{j}^{T})\\ +\sigma\times softmax(v_{pos}^{T}r_{ij}).\end{split} (2)

3.3 Liveness Prediction

The feature vector is derived from the input image XcX_{c}, by averaging the patch embedding F(Xc)F(X_{c}) extracted from our backbone FF and passed to both the score regressor RR and domain discriminator DD, which contributes to a robust backbone for unseen domain data (Jia et al., 2020). As described in Eq. (3), pkip_{k}^{i} is the probability vector of the real image in the ithi^{th} input image, and the liveness score Y^ci\hat{Y}_{c}^{i} is calculated using the sum of the element-wise multiplication between MM and corresponding probabilities in pkip_{k}^{i}. The score regressor RR is trained to minimize the difference between the predicted score Y^ci\hat{Y}_{c}^{i} and groundtruth label YciY_{c}^{i} through the mean squared error. Subsequently, backbone FF is trained adversarially to extract generalized features, which makes it difficult for the discriminator to predict the domain label YDY_{D} of a feature. For adversarial learning, a gradient reversal layer (GRL) is included after the feature extractor to make it difficult for the discriminator to distinguish features from different domains. Additionally, the regressor and discriminator are composed of fully connected layers. In Eq. (4), q()q(\cdot) is the probability vector of the discriminator and is equal to D(F(Xc))D(F(X_{c})). Therefore, domain-invariant feature is extracted from backbone FF through adversarial learning, as described in Eq. (4). The final loss function is defined as in Eq. (5).

reg=YciY^ci22,whereY^ci=k=0Kmk×pk,\mathcal{L}_{reg}=\sum\|Y_{c}^{i}-\hat{Y}_{c}^{i}\|_{2}^{2},\\ \quad where\;\hat{Y}_{c}^{i}=\sum_{k=0}^{K}m_{k}\times p_{k}, (3)
minDmaxFadv(F,D)=𝔼x,yXc,YDn=1N𝟙[n=y]logq(x),\min_{D}\max_{F}\mathcal{L}_{adv}(F,D)=-\mathbb{E}_{x,y{\sim}{X_{c},Y_{D}}}\sum_{n=1}^{N}\mathbbm{1}_{[n=y]}\log{q(x)}, (4)
final=reg+adv.\mathcal{L}_{final}=\mathcal{L}_{reg}+\mathcal{L}_{adv}. (5)

4 Experimental Results

4.1 Experimental Settings

We conducted experiments on various FAS benchmark datasets, namely OULU-NPU(O) (Boulkenafet et al., 2017), MSFD-MSU(M) (Wen et al., 2015), REPLAY-ATTACK(I) (Chingovska et al., 2012), and CASIA-FASD(C) (Zhang et al., 2012). We have used a cross-dataset protocol to evaluate the performance. The cross-dataset protocol is a leave-one-out setting that is used to train a model with several source datasets and to test the generalization performance with an unseen target dataset. Half total error rate (HTER) and area under curve (AUC) were used as evaluation metrics (Yu et al., 2022).

We used CNN, ViT, and hybrid networks pretrained with ImageNet1K to prevent overfitting and achieve further improvement in DG (Parkin & Grinchuk, 2019; Ming et al., 2022). While we compared the FAS performance depending on differences in the feature extraction methods of the backbones, the structure of the score regressor and discriminator was kept the identical, and each feature extractor was trained with the same loss function as in Eq. (5).

4.2 Experimental Results

Table 1: Results of the cross-dataset protocol. In each cell, the left and right numbers represent HTER(%) and AUC(%), respectively. Bold implies the best results and second best is underlined.
Backbone OCI\rightarrowM OMI\rightarrowC CMO\rightarrowI CMI\rightarrowO μ±σ\mu\pm\sigma
CNN ResNet 5.8 / 91.8 14.9 / 89.4 11.5 / 94.4 15.0 / 90.5 11.8+4.3 / 91.5+2.2
EfficientNet 10.4 / 92.4 18.8 / 84.2 18.9 / 87.4 21.7 / 82.5 17.5+4.9 / 86.6+4.3
ViT ViT Base 17.9 / 83.7 31.3 / 70.7 16.9 / 89.9 28.3 / 79.5 23.6+7.3 / 81.0+8.1
Swin Transformer 10.0 / 94.3 23.7 / 77.0 16.0 / 78.5 10.6 / 95.2 15.1+6.3 / 86.2+9.8
Hybrid ViT Hybrid 5.0 / 97.1 13.6 / 92.2 17.9 / 89.9 11.3 / 95.0 12.0+5.4 / 93.5+3.1
ConViT 5.0 / 97.2 12.1 / 93.6 15.1 / 93.3 14.7 / 91.6 11.7+4.7 / 93.9+2.4
MaxViT 10.4 / 93.9 14.5 / 89.7 26.8 / 72.9 20.6 / 87.1 18.1+7.1 / 85.9+9.1

We conducted a comparative experiment to determine the most robust FAS backbone among the candidate models. Specifically, we selected ResNet and EfficientNet as representative models for the CNN, ViT and swin transformer for the ViT series and ViT Hybrid, ConViT, and MaxViT for the hybrid series. First, Table 1 demonstrates that main feature extraction operation significantly affects the performance. When ConViT was used as a feature extractor, it achieved the best performance with an HTER of 11.7% and AUC of 93.9% compared to the CNN and ViT models. In addition to ConViT, ViT Hybrid had the second-highest average AUC. However, the ViT and swin transformer using only SA exhibit HTER scores of 23.6% and 15.1%, respectively, recording the highest error rates, which implies that the ViT is vulnerable to changes in the distribution of the evaluation data. When features were extracted from a ViT-based hybrid model containing convolution-like elements, performance in the unseen domain was typically superior to that of convolution or SA models alone. Specifically, the ConViT-based framework outperformed EfficientNet and ViT in 7.3%pp (=93.9%-86.6%) and 12.9%pp (=93.9%-81.0%) higher AUC scores, respectively.

Second, we found that convolution-like elements in the initial layer would be effective in improving FAS performance by comparing the hybrid models, as shown in Table 1. Specifically, MaxViT using convolution in most stages demonstrated poor performance, while ConViT, which indirectly adds convolutional SA in the early stage, and ViT Hybrid, which utilizes convolution in the input image, reported the lowest HTER values among the comparison groups. In addition, both MaxViT and swin transformer are similar in that they use window-based SA and the averaged AUC score of these models is about 86% compared to that of the other hybrid models, which are ConViT and ViT Hybrid, achieving 94%. Therefore, models using window-based SA show an approximately 8%pp lower AUC than the other hybrid models. Furthermore, we experimentally confirm that attention using a window shift is inappropriate for FAS; however, convolutional elements in an early stage of the network contribute to robust classification.

Table 2: Comparison with other generalized FAS methods under cross-dataset testing. Results are evaluated as average ranks in each setting. \dagger and \ast denote the proposed framework with ViT Hybrid and ConViT, respectively.
Model OCI\rightarrowM OMI\rightarrowC CMO\rightarrowI CMI\rightarrowO Avg. Rank
MADDG (Shao et al., 2019) 17.7 / 88.1 24.5 / 84.5 22.2 / 85.0 27.9 / 80.0 10.75
PAD-GAN (Wang et al., 2020) 17.0 / 90.1 19.7 / 87.4 20.9 / 86.7 25.0 / 81.5 8.75
RF-Meta (Shao et al., 2020) 13.9 / 94.0 20.3 / 88.2 17.3 / 90.5 16.5 / 91.2 6.5
SSDG-M (Jia et al., 2020) 16.7 / 90.5 23.1 / 85.5 18.2 / 94.6 25.2 / 81.8 9
SDA (Wang et al., 2021) 15.4 / 91.8 24.5 / 84.4 15.6 / 90.1 23.1 / 84.3 7
D2\textbf{D}^{2}AM (Chen et al., 2021) 15.4 / 91.2 12.7 / 95.7 21.0 / 85.6 15.3 / 90.9 5.75
ANRL (Liu et al., 2021a) 10.8 / 96. 8 17.8 / 89.3 16.0 / 91.0 15.7 / 91.9 4.5
SSAN-M (Wang et al., 2022b) 10.4 / 94.8 16.5 / 90.8 14.0 / 94.6 19.5 / 88.2 4
ViTransPAD (Ming et al., 2022) 8.4 / - 17.9 / - 16.0 / - 15.7 / - 4.25
Ours\dagger 5.0 / 97.1 13.6 / 92.2 17.9 / 89.9 11.3 / 95.0 3
Ours\ast 5.0 / 97.2 12.1 / 93.6 15.1 / 93.3 14.7 / 91.6 1.5

Table 2 presents a comparison of the proposed framework with the latest promising models. The ConViT or ViT Hybrid-based framework was ranked 1.5 and 3, respectively, by averaging HTER scores over four sub-protocols in a cross-dataset setting, showing the highest performance. In particular, the proposed ConViT-based model outperformed a previous model with ViT structure (Ming et al., 2022), exhibiting a 2.8%pp lower HTER. These results suggest that our ConViT-based model is the most effective as a generalized FAS methodology compared to the previously developed methodologies (Shao et al., 2019; Wang et al., 2020; Shao et al., 2020; Jia et al., 2020; Wang et al., 2021; Chen et al., 2021; Liu et al., 2021a; Wang et al., 2022b; Ming et al., 2022), and the extraction of rich representations from images significantly impacts improved performance in comparison to learning the spatio-temporal relationship in a video (Ming et al., 2022). Consequently, the results suggest that both locality and global dependencies within patches contribute to a more robust performance.

5 Conclusion

This study demonstrates that our hybrid FAS framework with self-attention and convolution is more robust for FAS tasks than those with a CNN and ViT alone. Our ConViT-based framework improved the performance of domain generalization, over other promising FAS models. These results suggest that it is important to extract both local and global information from images for a robust FAS against domain shifts.

References

  • Anthony et al. (2021) Anthony, P., Ay, B., and Aydin, G. A review of face anti-spoofing methods for face recognition systems. In 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp.  1–9. IEEE, 2021.
  • Boulkenafet et al. (2017) Boulkenafet, Z., Komulainen, J., Li, L., Feng, X., and Hadid, A. Oulu-npu: A mobile face presentation attack database with real-world variations. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017), pp.  612–618. IEEE, 2017.
  • Chen et al. (2021) Chen, Z., Yao, T., Sheng, K., Ding, S., Tai, Y., Li, J., Huang, F., and Jin, X. Generalizable representation learning for mixture domain face anti-spoofing. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  1132–1139, 2021.
  • Chingovska et al. (2012) Chingovska, I., Anjos, A., and Marcel, S. On the effectiveness of local binary patterns in face anti-spoofing. In 2012 BIOSIG-proceedings of the international conference of biometrics special interest group (BIOSIG), pp.  1–7. IEEE, 2012.
  • Dai et al. (2021) Dai, Z., Liu, H., Le, Q. V., and Tan, M. Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems, 34:3965–3977, 2021.
  • Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • d’Ascoli et al. (2021) d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., and Sagun, L. Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pp. 2286–2296. PMLR, 2021.
  • George & Marcel (2021) George, A. and Marcel, S. On the effectiveness of vision transformers for zero-shot face anti-spoofing. In 2021 IEEE International Joint Conference on Biometrics (IJCB), pp.  1–8. IEEE, 2021.
  • Han et al. (2022) Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Huang et al. (2022) Huang, H.-P., Sun, D., Liu, Y., Chu, W.-S., Xiao, T., Yuan, J., Adam, H., and Yang, M.-H. Adaptive transformers for robust few-shot cross-domain face anti-spoofing. In European Conference on Computer Vision, pp.  37–54. Springer, 2022.
  • Jia et al. (2020) Jia, Y., Zhang, J., Shan, S., and Chen, X. Single-side domain generalization for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8484–8493, 2020.
  • Jiang et al. (2022) Jiang, F., Liu, P., and Zhou, X.-D. Ordinal regression with representative feature strengthening for face anti-spoofing. Neural Computing and Applications, 34(18):15963–15979, 2022.
  • Kong et al. (2022) Kong, C., Chen, B., Li, H., Wang, S., Rocha, A., and Kwong, S. Detect and locate: Exposing face manipulation by semantic-and noise-level telltales. IEEE Transactions on Information Forensics and Security, 17:1741–1756, 2022.
  • Kwak et al. (2023) Kwak, Y., Jung, M., Yoo, H., Shin, J., and Kim, C. Liveness score-based regression neural networks for face anti-spoofing. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  • Liu et al. (2021a) Liu, S., Zhang, K.-Y., Yao, T., Bi, M., Ding, S., Li, J., Huang, F., and Ma, L. Adaptive normalized representation learning for generalizable face anti-spoofing. In Proceedings of the 29th ACM international conference on multimedia, pp.  1469–1477, 2021a.
  • Liu et al. (2021b) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021b.
  • Ming et al. (2022) Ming, Z., Yu, Z., Al-Ghadi, M., Visani, M., Luqman, M. M., and Burie, J.-C. Vitranspad: video transformer using convolution and self-attention for face presentation attack detection. In 2022 IEEE International Conference on Image Processing (ICIP), pp.  4248–4252. IEEE, 2022.
  • Nagpal & Dubey (2019) Nagpal, C. and Dubey, S. R. A performance evaluation of convolutional neural networks for face anti spoofing. In 2019 International Joint Conference on Neural Networks (IJCNN), pp.  1–8. IEEE, 2019.
  • Parkin & Grinchuk (2019) Parkin, A. and Grinchuk, O. Recognizing multi-modal face spoofing with face recognition networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  0–0, 2019.
  • Raghu et al. (2021) Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., and Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
  • Shao et al. (2019) Shao, R., Lan, X., Li, J., and Yuen, P. C. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10023–10031, 2019.
  • Shao et al. (2020) Shao, R., Lan, X., and Yuen, P. C. Regularized fine-grained meta face anti-spoofing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  11974–11981, 2020.
  • Tan & Le (2019) Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR, 2019.
  • Tu et al. (2022) Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., and Li, Y. Maxvit: Multi-axis vision transformer. In European conference on computer vision, pp.  459–479. Springer, 2022.
  • Wang et al. (2020) Wang, G., Han, H., Shan, S., and Chen, X. Cross-domain face presentation attack detection via multi-domain disentangled representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6678–6687, 2020.
  • Wang et al. (2021) Wang, J., Zhang, J., Bian, Y., Cai, Y., Wang, C., and Pu, S. Self-domain adaptation for face anti-spoofing. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  2746–2754, 2021.
  • Wang et al. (2022a) Wang, W., Wen, F., Zheng, H., Ying, R., and Liu, P. Conv-mlp: A convolution and mlp mixed model for multimodal face anti-spoofing. IEEE Transactions on Information Forensics and Security, 17:2284–2297, 2022a.
  • Wang et al. (2022b) Wang, Z., Wang, Z., Yu, Z., Deng, W., Li, J., Gao, T., and Wang, Z. Domain generalization via shuffled style assembly for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4123–4133, 2022b.
  • Wen et al. (2015) Wen, D., Han, H., and Jain, A. K. Face spoof detection with image distortion analysis. IEEE Transactions on Information Forensics and Security, 10(4):746–761, 2015.
  • Yu et al. (2022) Yu, Z., Qin, Y., Li, X., Zhao, C., Lei, Z., and Zhao, G. Deep learning for face anti-spoofing: A survey. IEEE transactions on pattern analysis and machine intelligence, 45(5):5609–5631, 2022.
  • Yun et al. (2019) Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  6023–6032, 2019.
  • Zhang et al. (2012) Zhang, Z., Yan, J., Liu, S., Lei, Z., Yi, D., and Li, S. Z. A face antispoofing database with diverse attacks. In 2012 5th IAPR international conference on Biometrics (ICB), pp.  26–31. IEEE, 2012.