Semi-Supervised learning for Face Anti-Spoofing using Apex frame

Abstract

Conventional feature extraction techniques in the face anti-spoofing domain either analyze the entire video sequence or focus on a specific segment to improve model performance. However, identifying the optimal frames that provide the most valuable input for the face anti-spoofing remains a challenging task. In this paper, we address this challenge by employing Gaussian weighting to create apex frames for videos. Specifically, an apex frame is derived from a video by computing a weighted sum of its frames, where the weights are determined using a Gaussian distribution centered around the video’s central frame. Furthermore, we explore various temporal lengths to produce multiple unlabeled apex frames using a Gaussian function, without the need for convolution. By doing so, we leverage the benefits of semi-supervised learning, which considers both labeled and unlabeled apex frames to effectively discriminate between live and spoof classes. Our key contribution emphasizes the apex frame’s capacity to represent the most significant moments in the video, while unlabeled apex frames facilitate efficient semi-supervised learning, as they enable the model to learn from videos of varying temporal lengths. Experimental results using four face anti-spoofing databases: CASIA, REPLAY-ATTACK, OULU-NPU, and MSU-MFSD demonstrate the apex frame’s efficacy in advancing face anti-spoofing techniques.

Index Terms— Face anti-spoofing, Semi-supervised learning, Video representation, Deep learning, apex frame

1 Introduction

Face recognition technology has found extensive applications in various domains, including surveillance, border security, unlocking smartphones, and law enforcement. However, a significant concern in this technology is the challenge of dealing with presentation attacks, also known as spoofing attacks or biometric vulnerabilities [1]. Presentation attacks encompass various strategies aimed at deceiving or manipulating face recognition systems by providing counterfeit or altered facial information. These tactics may include presenting printed photos, executing replay attacks, or employing 3D mask attacks. The ultimate goal of these attacks is typically to impersonate another individual or illicitly access confidential or secure information. To combat evolving threats, it is crucial for face recognition systems to employ strong anti-spoofing measures, ensuring the technology’s security and reliability in real-world applications [2].

In the context of face anti-spoofing, where both live and spoofed classes exhibit spatiotemporal attributes, video-based methods [3, 4, 5, 6, 7, 1] outperform image-based approaches [8, 9, 10] on face presentation attack detection (PAD). This is because video-based methods aim to capture temporal information, which provides valuable insights into facial movements, texture changes, and dynamic characteristics. These characteristics are crucial for effectively distinguishing genuine faces from spoofing attacks. Furthermore, spoofing attacks often exhibit irregular patterns, especially with textured materials like paper or fabric. These inconsistencies become more apparent in video sequences. Nonetheless, it is vital to know that the effectiveness of video-based face PAD methods can be impacted by various factors, including video quality, resolution, frame rate, camera specifications, as well as the particular algorithms and features employed for the analysis.

Frame selection methods often argue that not all video frames are equally important, as certain frames may contain more valuable or distinctive content than others. These methods are mostly based on optical flow and frame differences [11, 12]. However, optical flow estimation involves estimating the motion of pixels between frames, which can be a computationally intensive process. More advance approaches, such as global motion estimation have been introduced to compensate the effects of camera motion [13]. Nonetheless, feature-based global motion estimation methods typically work with a relatively small number of distinctive features, which might not capture the full complexity of the scene. Gaussian weighting has been suggested to be applied to the frame sequences by adaptively selecting segment length in [14]. However, when using adaptive frame selection during model training, there is a risk of overfitting to the specific frame selection strategy and dataset. This can render the model less robust when applied in the domain generalization scenarios.

Recently, there has been an increased attention on detecting the apex frame to improve model performance in micro-expression recognition. An apex frame in video analysis refers to a specific frame within a video sequence that represents a critical or significant moment in the video. For instance, Guoying, et al. [15] proposed a robust method to detect the apex frame in frequency domain. Inspired by their work, we initiate two discussions including “Is it reasonable to address the face anti-spoofing problem by creating an apex frame?” and ”Can apex frames yield superior results compared to the previous frame-based approaches reported in face PAD?” Our aim is to thoroughly explore these two arguments. In particular, we proceed with both apex frame generation and the development of semi-supervised learning to distinguish between real and spoof classes.

In response to our earlier observations, we introduce a new method for generating apex frame, which represents the entire video by a single image. In particular, our key idea for generating apex frame involves targeting the central frame within a video sequence to create a more comprehensive and descriptive frame. We accomplish this by computing Gaussian weighted function centered around the central frame of a video. Specifically, we assign weights to each frame in the video based on its temporal distance from the central frame. The Gaussian weighted function assigns higher weights to frames that are closer in time to the central frame and lower weights to frames that are farther away. By taking the weighted average of these frames, we obtain a single, representative apex frame. Fig. 1(a) represents an example of the generation of apex frame. This procedure is designed to extract the most significant details from the video sequence, compressing them into a single frame for subsequent analysis. Motivated by this, we generate unlabeled data by considering different temporal lengths of the video as shown in Fig. 1(b) . To combine this complementary information, we present a semi-supervised learning to use both labeled and unlabeled data to improve the performance of face presentation attack detection. In summary, our key contributions can be summarized as follow:

1.

This study presents the first attempt to employ the apex frame for the summarization of an entire video into a singular image, with a specific focus on enhancing facial anti-spoofing task.
2.

In contrast to frame-based approaches (i.e., optical flow or global motion), our approach is computationally faster since it does not involve estimating pixel-level motion vectors between frames. This computational efficiency makes our method more practical and scalable for real-world applications.
3.

Through semi-supervised learning, we evaluated the effectiveness of our approach on four different datasets. The results demonstrate promising generalization ability, achieving new state-of-the-art performance on three datasets.

The paper is structured as follows: Section 2 explains the proposed methodology, Section 3 discusses experimental settings and comparison results, and Section 4 presents conclusions.

2 Methodology

The proposed approach comprises three key steps: (1) creating an apex frame, (2) generating unlabeled data, and (3) implementing semi-supervised learning. Fig. 2 provides an overall visualization of the proposed method, illustrating how the labeled and unlabeled data is interconnected through semi-supervised learning to address the challenge of face presentation attack detection.

2.1 Apex frame creation

The selection of important frames for face anti-spoofing is a relatively recent focus in biometric authentication research. This is primarily due to the high degree of similarity in spatial and motion information found in many neighboring frames [2]. These frames, whether derived from genuine or attack videos, often provide less significant structural and textural information to the overall analysis. Taking inspiration from Gaussian distribution theory, we employ a Gaussian distribution function to generate an apex frame by calculating Gaussian weights, with a focus on the central frame of the video. The standard deviation determines the spread or width of the Gaussian distribution, affecting the weights assigned to frames in the apex frame generation process. We implement our approach by employing the following mathematical equations.

Suppose $I_{i}$ is the $i$ -th frame in the video and $I_{c}$ is the central frame. The Gaussian weighting for frame $I_{i}$ relative to the central frame $I_{c}$ can be calculated as follows:

W_{i}=\exp\left(-\frac{(i-c)^{2}}{2\sigma^{2}}\right),

(1)

where $W_{i}$ is the weight assigned to frame $I_{i}$ , $i$ is the index of frame $I_{i}$ and $c$ is the index of the central frame $I_{c}$ . $\sigma$ is the standard deviation parameter that controls the spread of the Gaussian distribution.

After calculating the weights for all frames in the video, we normalize them as follow:

\text{Normalized Weight}_{i}=\frac{W_{i}}{\sum_{i=1}^{N}W_{i}},

(2)

where $N$ is the total number of frames in the video. Finally, to compute the apex frame $F$ , we use the following equation:

F=\sum_{i=1}^{N}\left(\text{Normalized Weight}_{i}\cdot I_{i}\right).

(3)

This single weighted apex frame gives a representation of the video that focuses more on frames that are located closer to the central frame $I_{c}$ , while frames farther away have lower influence due to the Gaussian weighting. Thus, generating apex frames reduces facial data redundancy and provides the most informative and representative of the overall content.

Refer to caption — Fig. 1: The flowchart of the proposed Gaussian weighted apex frame method. (a) The Gaussian weighted curve indicates how much influence or weight is assigned to frames at different temporal distances from the central frame. (b) Unlabeled data is generated by utilizing various temporal lengths of the video centered around the apex (central) frame of the segment.

2.2 Unlabeled data generation

The primary concept behind our unlabeled data generation involves splitting video clips into different temporal lengths denoted as $T$ , and then using this pseudo data to learn the spatiotemporal variations and context in an unsupervised manner. In particular, the model does not rely on explicit labels or annotations. Instead, it discover patterns and structure in the data without human-provided guidance. To achieve this, we divide the video into segments. Each segment, denoted as $S_{j}$ , contains a subset of $N=50$ frames from the video. For each segment $S_{j}$ , we calculate Gaussian weights for the frames within that segment relative to the central frame $I_{c_{j}}$ of that segment. The Gaussian weighting for frame $I_{i}^{j}$ within segment $S_{j}$ is given by:

W_{i}^{j}=\exp\left(-\frac{(i-c_{j})^{2}}{2\sigma^{2}}\right),

(4)

where $W_{i}^{j}$ is the weight assigned to frame $I_{i}^{j}$ in segment $S_{j}$ , $i$ is the index of frame $I_{i}^{j}$ within segment $S_{j}$ , $c_{j}$ is the index of the central frame $I_{c_{j}}$ in segment $S_{j}$ and $\sigma$ is the standard deviation parameter that controls the spread of the Gaussian distribution.

After calculating the Gaussian weights for frames within each segment, we create a weighted frame $F_{j}$ that represents a single apex frame for segment $S_{j}$ by taking the weighted sum of frames:

F_{j}=\sum_{i=1}^{N_{j}}\left(\frac{W_{i}^{j}}{\sum_{k=1}^{N_{j}}W_{k}^{j}}\cdot I_{i}^{j}\right),

(5)

where $N_{j}$ is the total number of frames in segment $S_{j}$ . $W_{i}^{j}$ is the Gaussian weight for frame $I_{i}^{j}$ in segment $S_{j}$ . $I_{i}^{j}$ is the $i$ -th frame in segment $S_{j}$ . We iterate through this process, considering each successive video segment to obtain distinct apex frames for various sections of the video.

2.3 Semi-Supervised learning

Our approach leverages the benefits of having access to labeled data obtained from Eq. 3 while also utilizing the large pool of unlabeled data extracted from Eq. 5 to improve model performance. In particular, we generate two pseudo classes of Gaussian-guided outputs with distinct temporal lengths (T=50 and T=80). These pseudo categories are used to initialize a semi-supervised learning process. By doing so, the model undergoes typical supervised training with labeled images, utilizing a cross-entropy loss function. Subsequently, the same model is employed to generate predictions for a batch of unlabeled images, selecting the class with the highest confidence as the pseudo-label. Then, a cross-entropy loss is computed by comparing the model’s predictions with the pseudo-labels assigned to the unlabeled images. The total loss for training is a combination of the supervised loss and the unsupervised loss, and utilized as:

\text{L}=L_{\textit{labeled}}+\lambda\cdot L_{\textit{unlabeled}},

(6)

where $\lambda$ is a hyperparameter. In the proposed semi-supervised learning approach, we employ a 2D convolutional neural network [16] while also leveraging the capabilities of long short-term memory (LSTM) network for temporal modeling [17].

Table 1: Performance evaluation using MSU-MFSD (M), Idiap (I), CASIA (C) and OULU-NPU (O) databases. Comparison results are obtained directly from the cited papers.

	O&C&I to M		O&M&I to C		O&C&M to I		I&C&M to O
Method	HTER	AUC	HTER	AUC	HTER	AUC	HTER	AUC
MADDG [18]	17.69	88.06	24.50	84.51	22.19	84.99	27.89	80.02
DAFL [19]	14.58	92.58	17.41	90.12	15.13	95.76	14.72	93.08
DR-MD [6]	17.02	90.10	19.68	87.43	20.87	86.72	25.02	81.47
GW + RNN [14]	4.12	99.93	7.04	99.87	13.48	97.42	41.33	88.48
RFMetaFAS [3]	13.89	93.98	20.27	88.16	17.30	90.48	16.45	91.16
FAS-DR-BC(MT) [20]	11.67	93.09	18.44	89.67	11.93	94.95	16.23	91.18
HFN + MP [21]	5.24	97.28	9.11	96.09	15.35	90.67	12.40	94.26
Cross-ADD [22]	11.64	95.27	17.51	89.98	15.08	91.92	14.27	93.04
FG +HV [23]	9.17	96.92	12.47	93.47	16.29	90.11	13.58	93.55
ADL [24]	5.00	97.58	10.00	96.85	12.07	94.68	13.45	94.43
Regression Network [25]	5.41	98.85	10.05	94.27	8.62	97.60	11.42	95.52
CRFAS [26]	7.14	97.44	9.88	96.56	8.57	96.07	16.38	90.87
Supervised learning	10.41	96.21	26.11	82.28	23.12	90.38	45.99	81.12
Semi-supervised learning	7.08	98.27	15.74	90.97	21.62	92.65	38.97	79.64
Semi-supervised + LSTM	3.10	99.95	4.12	98.49	7.37	99.18	28.92	96.44

3 Experiments

To evaluate the effectiveness of our proposed method, our study employs several databases: Idiap Replay-attack [27] (denoted as I), MSU Mobile Face Spoofing [10] (denoted as M), OULU-NPU [28] (denoted as O), and CASIA Face Anti-Spoofing [29] (referred to as C). We report the Half Total Error Rate (HTER), taking into consideration both false acceptances and false rejections. We also define the Area Under the Curve (AUC) as an inclusive metric that provides a singular numerical value summarizing the overall performance of the model.

3.1 Implementation details

The Densely Connected Convolutional Networks (DenseNet-201) [16] model is fine-tuned on the labeled data by resizing all images to $224\times 224$ . Traditional data augmentation transformations such as rotation, x-translation, and y-translation were performed to improve the robustness of the model. During the fine-tuning process, we employed specific parameters, including the Adam optimizer, a validation frequency of every $30$ steps, and a learning rate set at $0.0001$ with an early stopping function for model development. The $\lambda$ cofficient is set to $1.5$ . To generate the apex frame and produce unlabeled data, we consistently use standard deviation $\sigma=5$ . We set a confidence threshold $0.9$ for the semi-supervised learning. This expanded training dataset includes unlabeled data with reliably predicted labels.

For training the LSTM, the deep features , comprising $1920$ dimensions, were extracted directly from the last pooling layer of the model after fine-tuned the model in semi-supervised way. The training process of LSTM involves the Adam optimizer, a hidden layer dimension of $100$ , a validation frequency of $30$ , and a learning rate of $0.0001$ with an early stopping function. We use the same parameters for training across all datasets to ensure the repeatability of our results.

3.2 Comparison against state-of-the-art methods

To assess the efficacy of our proposed approach, we conducted a comparison with various state-of-the-art deep learning models as shown in Table 1. In particular, we employed a leave-one-out (LOO) strategy, in which three datasets were randomly chosen for training, and one dataset was held out for testing in each experiment. Among recently reported domain generalization methods [25, 26, 21, 30, 19, 18, 31, 21, 24], we noted that the proposed semi-supervised learning provides state-of-the-art performance on three testing sets (i.e., MSU, CASIA, and Replay-attack). It is worth noting that the equal error rate is calculated using source testing sets, and subsequently, a threshold is applied to determine the Half Total Error Rate (HTER) on an unseen target testing set. We hypothesis that Gaussian guided apex frame can enhance the model’s generalization by focusing on the most informative moments of a video. When the model learns from unlabeled Apex frames, it becomes better at recognizing essential patterns and content in videos, making it more versatile in handling unseen video data.

Additionally, as demonstrated in Table 1, it is evident that our proposed method consistently achieves AUC scores exceeding $90\%$ across all the datasets. For further analysis, the ROC curves are visualized in Fig. 3 by plotting the true positive rate (TPR) on the y-axis and the false positive rate (FPR) on the x-axis. One can observe at the ROC curves that they are positioned in the top-left corner. This positioning signifies higher true positive rates and lower false positive rates, which in turn demonstrate the model’s improved ability to discriminate effectively in comparison to the model trained without unlabeled data.

4 Conclusions

In this paper, we highlight the significance of apex frame in the face anti-spoofing task. Leveraging Gaussian weighting for apex frame selection allows the model to efficiently concentrate on the most relevant areas. Our study reveals that apex frames with varying temporal lengths complement semi-supervised learning, achieving state-of-the-art performance across three previously unseen face anti-spoofing datasets. Furthermore, training the model on a more compact and informative apex frames enhances its generalization to unseen data. However, the paper acknowledges a potential limitation: the model may not capture all fine details, especially in longer videos. Therefore, future research should explore new approaches and enhancements to address this limitation and advance video summarization methods for longer videos.

References

[1] Usman Muhammad and Mourad Oussalah, “Self-supervised face presentation attack detection with dynamic grayscale snippets,” in 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2023, pp. 1–6.
[2] Usman Muhammad, Md Ziaul Hoque, Mourad Oussalah, and Jorma Laaksonen, “Deep ensemble learning with frame skipping for face anti-spoofing,” arXiv preprint arXiv:2307.02858, 2023.
[3] Rui Shao, Xiangyuan Lan, and Pong C Yuen, “Regularized fine-grained meta face anti-spoofing.,” in AAAI, 2020, pp. 11974–11981.
[4] Usman Muhammad, Djamila Romaissa Beddiar, and Mourad Oussalah, “Domain generalization via ensemble stacking for face presentation attack detection,” arXiv preprint arXiv:2301.02145, 2023.
[5] Usman Muhammad, Mourad Oussalah, Md Ziaul Hoque, and Jorma Laaksonen, “Saliency-based video summarization for face anti-spoofing,” arXiv preprint arXiv:2308.12364, 2023.
[6] Guoqing Wang, Hu Han, Shiguang Shan, and Xilin Chen, “Cross-domain face presentation attack detection via multi-domain disentangled representation learning,” in CVPR, 2020, pp. 6678–6687.
[7] Usman Muhammad, Tuomas Holmberg, Wheidima Carneiro de Melo, and Abdenour Hadid, “Face anti-spoofing via sample learning based recurrent neural network (rnn),” in The British Machine Vision Conference 2019 (BMVC) 9th-12th September 2019, Cardiff UK. British Machine Vision Association Press, 2019.
[8] Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid, “Face spoofing detection using colour texture analysis,” TIFS, vol. 11, no. 8, pp. 1818–1830, 2016.
[9] Usman Muhammad and Abdenour Hadid, “Face anti-spoofing using hybrid residual learning framework,” in 2019 International Conference on Biometrics (ICB). IEEE, 2019, pp. 1–7.
[10] Di Wen, Hu Han, and Anil K Jain, “Face spoof detection with image distortion analysis,” TIFS, vol. 10, no. 4, pp. 746–761, 2015.
[11] Talha Ahmad Siddiqui, Samarth Bharadwaj, Tejas I Dhamecha, Akshay Agarwal, Mayank Vatsa, Richa Singh, and Nalini Ratha, “Face anti-spoofing with multifeature videolet aggregation,” in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 1035–1040.
[12] Azeddine Benlamoudi, Kamal Eddine Aiadi, Abdelkrim Ouafi, Djamel Samai, and Mourad Oussalah, “Face antispoofing based on frame difference and multilevel representation,” Journal of Electronic Imaging, vol. 26, no. 4, pp. 043007–043007, 2017.
[13] Usman Muhammad, Zitong Yu, and Jukka Komulainen, “Self-supervised 2d face presentation attack detection via temporal sequence sampling,” Pattern Recognition Letters, vol. 156, pp. 15–22, 2022.
[14] Usman Muhammad and Mourad Oussalah, “Face anti-spoofing from the perspective of data sampling,” Electronics Letters, vol. 59, no. 1, pp. e12692, 2023.
[15] Yante Li, Xiaohua Huang, and Guoying Zhao, “Joint local and global information learning with single apex frame detection for micro-expression recognition,” IEEE Transactions on Image Processing, vol. 30, pp. 249–263, 2020.
[16] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[17] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[18] Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C Yuen, “Multi-adversarial discriminative deep domain generalization for face presentation attack detection,” in CVPR, 2019, pp. 10023–10031.
[19] Suman Saha, Wenhao Xu, Menelaos Kanakis, Stamatios Georgoulis, Yuhua Chen, Danda Pani Paudel, and Luc Van Gool, “Domain agnostic feature learning for image and video based face anti-spoofing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 802–803.
[20] Yunxiao Qin, Zitong Yu, Longbin Yan, Zezheng Wang, Chenxu Zhao, and Zhen Lei, “Meta-teacher for face anti-spoofing,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 10, pp. 6311–6326, 2021.
[21] Rizhao Cai, Zhi Li, Renjie Wan, Haoliang Li, Yongjian Hu, and Alex C Kot, “Learning meta pattern for face anti-spoofing,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 1201–1213, 2022.
[22] Hanye Huang, Youjun Xiang, Guodong Yang, Lingling Lv, Xianfeng Li, Zichun Weng, and Yuli Fu, “Generalized face anti-spoofing via cross-adversarial disentanglement with mixing augmentation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 2939–2943.
[23] Shice Liu, Shitao Lu, Hongyi Xu, Jing Yang, Shouhong Ding, and Lizhuang Ma, “Feature generation and hypothesis verification for reliable face anti-spoofing,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 1782–1791.
[24] Mingxin Liu, Jiong Mu, Zitong Yu, Kun Ruan, Baiyi Shu, and Jie Yang, “Adversarial learning and decomposition-based domain generalization for face anti-spoofing,” Pattern Recognition Letters, vol. 155, pp. 171–177, 2022.
[25] Youngjun Kwak, Minyoung Jung, Hunjae Yoo, JinHo Shin, and Changick Kim, “Liveness score-based regression neural networks for face anti-spoofing,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[26] Guanghao Zheng, Yuchen Liu, Wenrui Dai, Chenglin Li, Junni Zou, and Hongkai Xiong, “Learning causal representations for generalizable face anti spoofing,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[27] Ivana Chingovska, André Anjos, and Sébastien Marcel, “On the effectiveness of local binary patterns in face anti-spoofing,” in Proceedings of the international conference of biometrics special interest group. IEEE, 2012, pp. 1–7.
[28] Zinelabinde Boulkenafet, Jukka Komulainen, Lei Li, Xiaoyi Feng, and Abdenour Hadid, “Oulu-npu: A mobile face presentation attack database with real-world variations,” in FG. IEEE, 2017, pp. 612–618.
[29] Zhiwei Zhang, Junjie Yan, Sifei Liu, Zhen Lei, Dong Yi, and Stan Z Li, “A face antispoofing database with diverse attacks,” in 2012 5th IAPR international conference on Biometrics (ICB). IEEE, 2012, pp. 26–31.
[30] Zitong Yu, Jun Wan, Yunxiao Qin, Xiaobai Li, Stan Z Li, and Guoying Zhao, “Nas-fas: Static-dynamic central difference network search for face anti-spoofing,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 9, pp. 3005–3023, 2020.
[31] Usman Muhammad, Jiehua Zhang, Li Liu, and Mourad Oussalah, “An adaptive spatio-temporal global sampling for presentation attack detection,” IEEE Transactions on Circuits and Systems II: Express Briefs, 2022.