FaceCat: Enhancing Face Recognition Security with a Unified Diffusion Model

Jiawei Chen¹, Xiao Yang², Yinpeng Dong², Hang Su², Zhaoxia Yin¹

Abstract

Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded as critical technologies to ensure the safety of face recognition systems. However, due to limited practicality, complex deployment, and the additional computational overhead, it is necessary to implement both detection techniques within a unified framework. This paper aims to achieve this goal by breaking through two primary obstacles: 1) the suboptimal face feature representation and 2) the scarcity of training data. To address the limited performance caused by existing feature representations, motivated by the rich structural and detailed features of face diffusion models, we propose FaceCat, the first approach leveraging the diffusion model to simultaneously enhance the performance of FAS and FAD. Specifically, FaceCat elaborately designs a hierarchical fusion mechanism to capture rich face semantic features of the diffusion model. These features then serve as a robust foundation for a lightweight head, designed to execute FAS and FAD simultaneously. Due to the limitations in feature representation that arise from relying solely on single-modality image data, we further propose a novel text-guided multi-modal alignment strategy that utilizes text prompts to enrich feature representation, thereby enhancing performance. To combat data scarcity, we build a comprehensive dataset with a wide range of 28 attack types, offering greater potential for a unified framework in facial security. Extensive experiments validate the effectiveness of FaceCat generalizes significantly better and obtains excellent robustness against common input transformations.

Introduction

Deep learning has propelled face recognition (FR) to the forefront of biometric applications, but its proliferation has sparked security concerns like presentation attacks (2019; 2017; 2021) and adversarial attacks (2023; 2021; 2019a), which are crafted to deceive FR systems into granting unauthorized access or misidentifying individuals. In response to these challenges, specialized defense mechanisms have been developed, namely face anti-spoofing (2020; 2022; 2023) and face adversarial detection (2021; 2021). Both of them are considered as two independent tasks by the existing methods. This suggests the necessity for deploying multiple models, thereby increasing the computational overhead. Moreover, recent research (2023b; 2022; 2023) indicates that these methods generally lack generalization ability (The FAS method cannot be directly applied to the detection of adversarial examples, and vice versa). In other words, these defenses cannot generalize well on unknown attack categories due to the overfitting to the manipulation types they are trained, limiting the practical applicability. Therefore, it is critical for FR systems to integrate these security tasks to improve their overall robustness against attacks.

Refer to caption — Figure 1: Comparison of our method and face generative models. (a) The face generation only models. (b) Our FaceCat exploits the abundant features inherent in face generative models to serve face anti-spoofing and adversarial detection simultaneously.

To address the problem mentioned above, there are two main challenges: (1) Suboptimal face feature representation: traditional methods, which often treat FAS and FAD tasks as classification problems and utilize discriminative models to address multiple attack categories simultaneously (2023b; 2022), are not entirely suitable for a unified facial security framework. Such approaches typically tend to make models primarily focus on the structural features of images (2023)—A richer representation of facial features is essential for the unified facial security framework, where our aim is not only adversarial examples but also spoofing samples. Furthermore, these approaches rely solely on single-mode facial data, which limits their ability to further express features (2023). (2) Data scarcity: acquiring facial training data that covers a wide range of attack types is challenging compared to traditional facial data. Although some facial security datasets have been introduced, they exhibit significant shortcomings in terms of data types, such as the lack of 3D data and adversarial examples. This limitation hinders the development of facial security frameworks.

Based on the above discussions, this paper conducts a pioneering exploration of the unified facial security framework. Specifically, we address the aforementioned challenges by making the following contributions:

Study	Presentation attack		Adversarial attack			Types
Study	2D	3D	perturbation	patch	3D-printed	Types
FaceGuard (2023a)	✗	✗	✓	✗	✗	6
SIM-Wv2 (2019)	✓	✓	✗	✗	✗	14
WMCA (2019)	✓	✓	✗	✗	✗	7
HQ-WMCA (2020)	✓	✓	✗	✗	✗	10
HiFiMask (2021)	✗	✓	✗	✗	✗	3
GrandFake (2023b)	✓	✓	✓	✗	✗	25
FaceCat (Ours)	✓	✓	✓	✓	✓	28

Table 1: Comparison of the proposed protocol with others.

Effective framework for integrating facial security tasks. We propose FaceCat, a novel framework to integrate FAS and FAD tasks. As illustrated in Figure 2, FaceCat extracts face features by leveraging a face diffusion model (FDM) (2020), which offers a wealth of facial features (as shown in Figure 3) for enhanced facial security tasks. To address the challenge of adapting FDM to face security tasks, we design a hierarchical fusion mechanism to extract the incorporation of multi-level features for rich face representations. Specifically, we amalgamate five distinct hierarchical blocks and stack them through pooling operations to form a unified face feature, adept at seamless integration with downstream network architectures. Moreover, To address the limitations of single-modal data, we propose a text-guided (TG) multi-modal alignment strategy, which utilizes text knowledge to enrich semantic content and thereby further enhance the FaceCat’s representation capability. In detail, we introduce textual information to describe visual concepts and generate text embeddings. These embeddings are then integrated with FDM’s image embeddings to compute a similarity score, which is regarded as the logit for optimization. Besides, owing to the similarity of face features, a triplet-based margin optimization is adopted to make real samples more compact and push attack samples apart.

Thorough facial security dataset. We introduce FaceCatData, a comprehensive and fair dataset (approximately 410,000 face images) tailored for unified face security tasks. In particular, this dataset encompasses 28 diverse attack types, involving multiple new practical attacks, such as 3D-printed attacks, as shown in Table 1. Based on this dataset, we compare FaceCat with several currently popular methods, including five FAS methods, three FAD methods, and four classification models. For the sake of fairness, all these methods have been fine-tuned on the proposed protocol. Moreover, three prevalent input transformations (2016; 2019) are adopted to verify the robustness of FaceCat. We also conduct the ablation study to further investigate the proposed components. Experimental results demonstrate that our method exhibits excellent performance and outstanding robustness. Our main contributions can be summarized as:

•

To the best of our knowledge, this is the first work to unify FAD and FAS tasks using a diffusion model to serve as a strong feature initialization.
•

We propose a text-guided multi-modal alignment strategy with text embeddings which can enhance performance by leveraging multi-modal signals’ supervision.
•

We develop a comprehensive and fair dataset to perform FAS and FAD simultaneously.
•

Extensive experiments demonstrate that the proposed method exhibits superior performance in the task of FAD and FAS.

Related Work

In this section, we review related work on representation with generative models, FAS, and FAD.

Representation with Generative Models

Pioneers attempt to exploit GANs (2016) and VAEs (2014) for representation learning. (2019) demonstrates GANs can learn a competitive representation for images in latent space. Over the recent biennium, diffusion models (2020; 2021) have garnered substantial achievements in the realm of generative learning. Therefore, recent works attempt to utilize diffusion models for representation learning to serve downstream tasks, such as image editing (2022), semantic segmentation (2021) etc. Despite achieving commendable performance, the majority of these studies lack investigation into integrating face security tasks and the robustness of diffusion models.

Face Anti-Spoofing and Adversarial Detection

FAS (2022; 2023; 2023) and FAD (2021; 2018; 2022; 2021) have been developed to protect face recognition systems. In recent years, research in FAS has increasingly gravitated towards the exploration of 3D and multi-modal approaches (2020). FAD exhibits a preference for the development of attack-agnostic, universally applicable defense methods (2021; 2021). Moreover, some research (2023b; 2022) attempts to unify several face security tasks into a framework, aiming to enhance both practicality and generalization. UniFAD (2023b) attempts to unify digital attacks and FAS through k-means clustering. However, due to the inherent drawback of classification models, such methods primarily focus on the structural features of images. As a comparison, FaceCat has global structural and deep-level detailed features by virtue of its use of generative models.

Method

We first formulate how to unify FAS and FAD. Afterward, we elucidate the technical details of the hierarchical fusion mechanism and the lightweight head. We then propose a text-guided multi-modal alignment strategy for improving performance. Moreover, a triplet-based margin optimization is utilized to auxiliarily increase the distance between positive and negative samples. An overview of our proposed method is provided in Figure 2.

Problem Formulation

For FAS and FAD tasks, we can define three input types: spoofing face $\bm{x}^{spoof}$ , adversarial face $\bm{x}^{adv}$ and real face $\bm{x}^{real}$ . As for a unified detector, it is only necessary to reject potential attack types without identifying whether $\bm{x}^{spoof}$ or $\bm{x}^{adv}$ . Thus, we define $\bm{x}^{fake}\in\left\{\bm{x}^{spoof},\bm{x}^{adv}\right\}$ . The decision function can be defined as:

\displaystyle f_{\bm{\theta}}:\bm{x}\rightarrow\{0\text{ ($\bm{x}^{fake}$)},1\text{ ($\bm{x}^{real}$)}\},

which classifies whether a sample $\bm{x}$ is fake or real. $f_{\bm{\theta}}(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}$ is the unified detector parameterized by $\bm{\theta}$ . Generally, the problem is formulated as a dichotomous classification as follows:

	$\displaystyle\mathcal{L}_{c}(\bm{x},y)=\mathbb{E}_{p(\bm{x})}\left[-(y\log(f_{\bm{\theta}}(\bm{x}))+(1-y)\log(1-f_{\bm{\theta}}(\bm{x})))\right],$
	$\displaystyle\text{where }y\in\left\{0,1\right\},\bm{x}\in\left\{\bm{x}^{fake},\bm{x}^{real}\right\},$		(1)

where $y$ is the ground truth, $\bm{x}\in\mathbb{R}^{d}$ is the input. In this paper, our method also considers unifying FAS and FAD as a dichotomous classification problem.

Traditional classification models may overly prioritize image structure at the expense of fine details, whereas FDM inherently contains rich global structural and deep-level detail features. Therefore, we aim to leverage it as a strong initialization for FAS and FAD. A brief overview of FDM reveals that its training process is structured in two stages: 1) The forward diffusion process gradually adds Gaussian noise to the data to obtain a sequence of noisy samples; 2) The backward generative process reverses the diffusion process to denoise images. The central objective is to predict the noise at time $t$ , which is formulated as a simple mean squared error loss:

\begin{split}\mathcal{L}_{\text{simple}}=\mathbb{E}_{\bm{x}_{0},t,\bm{\epsilon}}[\|\epsilon_{\bm{\psi}}(\bm{x}_{t},t)-\bm{\epsilon}\|_{2}^{2}]\end{split},

(2)

where $\epsilon_{\bm{\psi}}$ is the diffusion models and $\bm{x}_{t}$ is the noisy image at time $t$ (noise level), i.e., $\bm{x}_{t}$ is obtained by adding noise $\epsilon$ to $\bm{x}$ .

In this paper, we focus on using the Face diffusion model as a Catalyst (FaceCat) for enhancing FAS and FAD. Formally, given a FDM $\epsilon_{\bm{\psi}}$ , Note that we only obtain the features from the middle layer through $\epsilon_{\bm{\psi}}$ , not the predicted noise, i.e., the input $\bm{x}_{t}$ is processed through $\epsilon_{\bm{\psi}}$ to obtain a universal face feature with dimension $d^{\prime}$ . To leverage this feature, a lightweight head $\mathcal{H}:\mathcal{R}^{d^{\prime}}\rightarrow\mathcal{R}$ is adopt. Therefore, the objective function of FaceCat can be formulated as:

\displaystyle\underset{\bm{\kappa}}{\min}~{}\mathbb{E}_{p(\bm{x}_{t})}\left[\mathcal{L}_{c}(\mathcal{H}_{\bm{\kappa}}(\epsilon_{\bm{\psi}}(\bm{x}_{t},t)),y)\right],

(3)

where $\mathcal{H}$ is parameterized by $\bm{\kappa}$ and $p(\bm{x}_{t})$ is the probability distribution of $\bm{x}_{t}$ . Note that Equation 3 is only adopted to optimize the lightweight head $\mathcal{H}_{\bm{\kappa}}$ . The rationale behind this is twofold: 1) The features of FDM are sufficiently general-purpose; 2) By not fine-tuning FDM, we reduce the training time. By following the objective (3), given a face image $\bm{x}$ , we can get the probability of whether it is a real face.

Hierarchical Fusion Mechanism and Lightweight Head

	Performance
	ACER	EER	[email protected]%FDR
B = 5, 6	2.62	2.59	92.81
B = 7, 8	2.67	2.74	92.89
B = 5, 12	2.82	2.79	92.58
B = 5, 6, 7, 12	2.56	2.40	93.07
B = 5, 6, 7, 8	2.51	2.45	93.19
B = 5, 6, 7, 8, 12	1.36	2.12	94.64

Table 2: The performance of FaceCat with different blocks.

As illustrated in Figure 3, different blocks of the diffusion model capture distinct facial features, with some layers representing detailed features and others abstract features. For facial security tasks, both types of features are beneficial to model performance. Thus, we propose leveraging a hierarchical fusion mechanism to integrate these diverse features:

f_{total}=\sum_{i=1}^{n}\mathcal{F}_{i}(f_{i}),

(4)

where $f_{i}$ denotes the input features to the $i^{th}$ block, and $\mathcal{F}_{i}(f_{i})$ represents the output features after processing by the $i^{th}$ block. $f_{total}$ is the features after fusion. For the lightweight head, aiming for minimal weight while ensuring performance, we utilize the first four layers of ResNet-18, complemented by an additional two downsampling layers.

Although some related works (2018) have also introduced the concept of multi-level features, their application has largely been empirical within traditional classification models. Given the unique architecture of diffusion models, which differs significantly from conventional classification models, previous experiences cannot be directly applied. Consequently, our investigation into the fusion of different blocks is detailed in Table 2. Observations indicate that optimal performance is achieved when blocks 5, 6, 7, 8, 12 are selected. Consequently, this configuration has been adopted for subsequent experiments.

Text-guided Multi-modal Alignment

The performance of FAS and FAD often suffers due to the constraints associated with the exclusive use of single-modal (RGB) data. Drawing inspiration from the ability of textual data to impart more detailed information in computer vision tasks, we try to design a text-guided multi-modal alignment strategy so that text information can be used in face security models.

Our text-guided multi-modal alignment is based on CLIP (2021). Given a batch of image-text pairs, we can predict the image’s class by computing a text-image cosine similarity. Formally, given an image $\bm{x}$ , we can compute the prediction probability as follows:

p(\hat{y}|\bm{x})=\frac{\text{exp}(\text{cos}(\bm{\omega}_{i},\bm{e})/\tau)}{\sum_{j=1}^{K}\text{exp}(\text{cos}(\bm{\omega}_{i},\bm{e})/\tau)},

(5)

where $\tau$ is a temperature parameter and $\hat{y}$ is the predicted class label. $\bm{e}$ is the image embedding of $\bm{x}$ extracted by the image encoder, and { $\bm{\omega}\}_{i=1}^{K}$ represent the weight vectors from the text encoder. $K$ and cos( $\cdot,\cdot$ ) denote the number of classes and cosine similarity, respectively.

	Text Prompts
Real Prompts	A photo of a real face.
	This is a real face.
	This is not a fake face.
	An example of a real face.
	A photo of the bonafide face.
	An example of a bonafide face.
Fake Prompts	A photo of a fake face.
	This is a fake face.
	This is not a real face.
	An example of a fake face.
	A photo of the attack face.
	An example of attack face.

Table 3: The text prompts of the real and fake classes.

Specifically, we craft some text prompts for this task along the lines of “a photo of a [class]”. The “[class]” is “[real face]” or “[fake face]”, all text prompts are presented in Table 3. This can be formalized as $\bm{t}=[V]_{1}[V]_{2}\ldots[V]_{M}[\text{real/fake face}]$ , where each $[V]_{m}(m\in\{1,\ldots,M\})$ is a vector with the same dimension as word embeddings (i.e., 512 for the text encoder), and $M$ is a hyperparameter specifying the number of context tokens. Given the text encoder $g(\cdot)$ , we can obtain $\bm{\omega}=g(\bm{t})$ .

As for $\bm{e}$ , we can acquire through the lightweight head $\mathcal{H}$ . Assume that the feature layer of $\mathcal{H}$ is $\mathcal{H}^{f}:\mathcal{R}^{d^{\prime}}\rightarrow\mathcal{R}^{d^{\prime\prime}}$ , where $d^{\prime\prime}=512$ which matches the dimension of word embeddings. Hence, Equation 5 can be re-expressed as:

p(\hat{y}|\bm{x}_{t})=\frac{\text{exp}(\text{cos}(g(\bm{t}),\mathcal{H}^{f}({\epsilon_{\bm{\psi}}}(\bm{x}_{t})))/\tau)}{\sum_{j=1}^{K}\text{exp}(\text{cos}(g(\bm{t}),\mathcal{H}^{f}({\epsilon_{\bm{\psi}}}(\bm{x}_{t})))/\tau)},

(6)

where $K=2$ and $\bm{x}_{t}$ is the noisy image. Additionally, we employ the more flexible focal loss (2017) as the optimization function instead of binary cross-entropy loss. Formally, given that $p$ denotes the probability in $p(\hat{y}|\bm{x}_{t})$ that the identified image is a real face, we adopt the notation $p_{c}$ to represent the probability of the target class:

p_{c}=\begin{cases}p&\text{if }y=1,\\ 1-p&\text{otherwise.}\end{cases}

(7)

Referring to the standard $\alpha$ -balanced focal loss (2017), the text-guided multi-modal alignment can be formulated as follows:

\text{TG}(p_{c},y)=-\alpha_{c}(1-p_{c})^{\gamma}\log(p_{c}),

(8)

the parameter $\alpha_{c}$ is introduced to balance the data distribution, and its formal definition mirrors that of Equation 7, with $p$ replaced by $\alpha$ . Since negative samples are approximately twice as many as positive samples, $\alpha$ is set to 0.75. To encourage $\mathcal{H}^{f}$ to focus on hard samples during training, we set the focusing parameter $\gamma$ to 2.0.

Triplet-based Margin Optimization

Face data from different classes exhibits similar features, which is different from general classification tasks. This characteristic makes it difficult to distinguish between positive and negative sample features. Therefore, we adopt a triplet loss (2017) to supervise features, which is commonly utilized in deep metric learning, and has proven its effectiveness in optimizing the relative distances between samples in the embedded space. The mathematical form is expressed as follows:

\begin{split}\mathcal{L}_{\text{triplet}}(\bm{x}^{a},\bm{x}^{p},\bm{x}^{n})&=\max\left(0,\left\|\mathcal{H}^{f}({\epsilon_{\bm{\psi}}}(\bm{x}^{a}))-\mathcal{H}^{f}({\epsilon_{\bm{\psi}}}(\bm{x}^{p}))\right\|_{2}^{2}\right.\\ &\left.-\left\|\mathcal{H}^{f}({\epsilon_{\bm{\psi}}}(\bm{x}^{a}))-\mathcal{H}^{f}({\epsilon_{\bm{\psi}}}(\bm{x}^{n}))\right\|_{2}^{2}+m\right),\end{split}

where $\bm{x}^{p}$ and $\bm{x}^{n}$ are the positive sample ( $\bm{x}^{real}$ ) and negative samples ( $\bm{x}^{fake}$ ) respectively. $\bm{x}^{a}$ is an anchor sample, which is the same class as $\bm{x}^{real}$ . $\left\|\cdot\right\|_{2}^{2}$ represents the squared Euclidean distance. The margin $m$ is a hyperparameter ensuring a safe gap between positive and negative pairs, preventing trivial solutions. Therefore, Equation 3 can be rewritten as:

\displaystyle\underset{\bm{\kappa}^{\prime}}{\min}~{}\mathbb{E}_{p(\bm{x}_{t})}\left[\text{TG}(p_{c},y)+\lambda\cdot\mathcal{L}_{triplet}(\bm{x}_{t}^{a},\bm{x}_{t}^{real},\bm{x}_{t}^{fake})\right],

(9)

where $\bm{\kappa}^{\prime}$ denotes the parameters of $\mathcal{H}^{f}$ . $\lambda$ is the balancing factor. $\bm{x}_{t}^{a}$ , $\bm{x}_{t}^{real}$ and $\bm{x}_{t}^{fake}$ represent the forms of $\bm{x}^{a},\bm{x}^{real}$ and $\bm{x}^{fake}$ with noises. Note that we also do not optimize the parameters of $g(\cdot)$ . Following Equation 9, we can acquire a classifier that performs FAS and FAD simultaneously.

	Method	Performance
	Method	APCER	BPCER	ACER	EER	TDR @ 0.2% FDR
Face anti-spoofing	DeepPixBis (2019)	1.89	3.63	2.76	3.12	85.48
	Depthnet (2018)	2.64	5.53	4.09	4.23	88.81
	FRT-PAD (2022)	2.80	3.04	2.52	2.93	89.20
	Flip (2023)	1.82	2.38	2.10	2.54	91.78
	CFPL (2024)	2.24	2.64	2.44	2.46	92.34
Face adversarial detection	EST (2022)	5.29	2.01	3.65	3.89	88.31
	FDS (2018)	4.89	3.22	4.06	3.54	87.43
	DFRAA (2021)	3.52	3.12	3.32	3.26	88.19
Classification models	Resnet50 (2016)	3.84	2.77	3.31	3.36	86.41
	Inceptionv3 (2017)	2.91	3.69	3.30	3.35	87.54
	Efficientnetb0 (2019)	3.80	3.23	3.52	3.53	86.44
	ViT-b/16 (2020)	3.31	3.02	3.17	3.18	87.56
Proposed FaceCat	w/o TG	1.96	2.70	2.33	2.35	92.74
Proposed FaceCat	with TG	1.36	2.12	1.24	1.68	94.64

Table 4: The different performances comparison (%) between the proposed FaceCat and baseline methods under the protocol.

FaceCatData

Dataset creation. To evaluate FaceCat’s capacity for concurrent FAS and FAD, we first craft 14 types of adversarial examples from the LFW (2008) dataset targeting the state-of-the-art (SOTA) face recognition model, ArcFace (2019), as shown in Figure 4. We employ various techniques including: 1) adversarial perturbation: FGSM(w) (2014), PGD(w) (2017), DIM(w) (2019) and Evolutionary(b) (2019b); 2) adversarial patch (2021): Eyeglass(w), Sticker(w), Facemask(w). The ‘w’ and ‘b’ represent white and black-box attacks respectively. Each attack type contains dodging and impersonation attacks. All adversarial examples are generated using the default parameters of the respective work. The attack success rate (ASR) and more adversarial examples are shown in Appendix A. Moreover, SiW-Mv2 is adopted as a complement to evaluate FAS. We partition the adversarial examples based on Protocol 1 of SiW-Mv2 and subsequently incorporate them into SiW-Mv2. Therefore, we obtain a comprehensive dataset FaceCatData to evaluate FAS and FAD simultaneously.

Protocol	Class	Types			# Total
Protocol	Class	# Live	# Spoof	# Adv	# Total
protocol 1	train	86404	68597	50000	205001
	eval	34561	27379	20000	81940
	test	51842	41218	30000	123060
protocol 2	train	86404	0	50000	136404
	eval	34561	0	20000	54561
	test	51842	41218	30000	123060
protocol 3	train	86404	68597	0	155001
	eval	34561	27379	0	61940
	test	51842	41218	30000	123060

Table 5: The performance of FaceCat with different blocks.

Types of protocols. For the proposed FaceCatData, we design three different protocols, as shown in Table 5. Protocol 1, known attack pattern detection: all attack types are present in the training, aiming to evaluate the method’s capability to simultaneously perform FAS and FAD. Protocol 2 and Protocol 3 both involve unknown attack pattern detection. Specifically, the Protocol 2 is used to test the method’s performance when spoof faces are absent from the training and validation sets. Protocol 3 assesses the method’s performance when adversarial faces are not included in the training and validation sets. In this paper, unless stated otherwise, Protocol 1 is utilized and the experiments under Protocols 2 and 3 are detailed in Appendix B.

Experiments

In this section, we conduct extensive experiments to evaluate the performance of FaceCat and perform ablation studies to verify the effectiveness of each component.

Experimental Settings

Evaluation metrics. We evaluate with the following metrics: Attack Presentation Classification Error Rate (APCER), Bona Fide Presentation Classification Error Rate (BPCER), and the average of APCER and BPCER, Average Classification Error Rate (ACER) for a fair comparison. Equal Error Rate (EER), Area Under Curve (AUC), and True Detection Rate (TDR) at a False Detection Rate (FDR) 0.2% (TDR @ 0.2% FDR). In addition, a visualization (2008) is also reported to evaluate the performance further. The specific training details can be found in Appendix C.

Experiment in the Proposed Protocol

Effectiveness of the proposed method. To verify the efficacy of the proposed approach, we compare FaceCat with three baselines: 1) FAS: Flip (2023), CFPL (2024), DeepPixBis (2019), Depthnet (2018), FRT-PAD (2022); 2) FAD: EST (2022), FDS (2018), DFRAA (2021); 3) classification models: Resnet50 (2016), Inceptionv3 (2017), Efficientnetb0 (2019), ViT-b/16 (2020). These methods are either commonly used or represent the SOTA techniques. To fairly compare the proposed method, all these methods have been fine-tuned according to the proposed protocol. Table 4 shows the performance metrics for each method. Inspection of the table allows us to deduce the subsequent conclusions:

(1) FaceCat effectively improves the face recognition systems’ security. When conducting FAS and FAD simultaneously, all metrics achieve optimal performance, with ACER and EER being 1.24% and 1.68%, respectively. This can be attributed to: 1) the powerful representational capabilities of the diffusion model, which provide rich prior knowledge for FAS and FAD; 2) the proposed hierarchical fusion mechanism that effectively leverages the features of the diffusion model; 3) the text-guided multi-modal alignment technique that further enhances the performance of FaceCat by utilizing rich textual features.

(2) The proposed method with the text-guided multi-modal alignment strategy generally outperforms the variant without it across various metrics. This demonstrates the strategy can effectively improve the performance of the proposed method. We think it benefits from low-cost and effective text embedding, which facilitates face features learning under the nuanced supervision of multi-modal cues.

Evaluation on input transformation. To evaluate the robustness of FaceCat, common input transformations (are excluded from data augmentation), e.g., bit-depth reduction, Gaussian blur, and JPEG compression are employed to reduce the image quality. We compare our results with a widely recognized method, CDCN (2020). In Figure 5, when image quality degrades, the performance of CDCN in addressing diverse types of attacks diminishes, with its response to JPEG compression being the most notably affected. This indicates although the baseline can demonstrate decent performance under the proposed protocol, its robustness significantly decreases with the degradation of image quality. In contrast, FaceCat continues to maintain stable performance even in the face of degraded image quality, reaching up to 55.3% better than the baseline under JPEG compression. This can be attributed to the superior feature robustness of FaceCat, i.e., even with the loss of certain image information, the proposed method is still sufficient to achieve robust metrics in detecting such attack threats.

Ablation Study

Method	Performance
Method	ACER	EER	[email protected]%FDR
MAE (2022)	2.33	2.36	92.77
SwAV (2020)	2.36	2.23	91.57
FR-ArcFace (2019)	2.75	2.41	91.14
w/o TG ( $\alpha$ =0.25)	2.74	2.76	88.66
w/o TG ( $\alpha$ =0.5)	2.89	3.02	91.58
w/o TG ( $\alpha$ =0.75)	2.33	2.36	92.74
FaceCat	1.36	2.12	94.64

Table 6: The ablation study of FaceCat in the protocol.

Advantages of FDM architecture. We conduct an ablation study to verify the effectiveness of FDM architecture, as shown in Table 6. Both MAE (2022) and SwAV (2020) represent SOTA self-supervised learning methodologies that have undergone pre-training on the FFHQ dataset prior to fine-tuning on our proposed protocol, ensuring consistency in model initialization. The FR-ArcFace (2019) is the face recognition model ArcFace, which utilizes a ResNet18 (2016) and is pre-trained for face recognition tasks. From the table, two key conclusions can be drawn: Firstly, even the employment of pre-trained face models can enhance the model’s capabilities in conducting FAS and FAD; however, FDM, with its dual focus on global structural information and intricate detail features, demonstrates superior performance over these pre-trained face models. Secondly, utilizing face recognition models as pre-trained systems yields inferior results compared to models pre-trained via self-supervised methods. This is attributed to the tendency of classification models to overemphasize structural features, resulting in a partial loss of feature information.

Value of $\bm{\alpha}$ . In Table 6, we exploit the influence of different $\alpha$ of TG on the experimental results. $\alpha=0.75$ performs better than other settings. In joint defense scenarios, the quantity and diversity of real face samples are typically inferior to those of spoofing faces. This leads to an imbalanced focus of models on fake face samples. Hence, adjusting hyper-parameter $\alpha$ to mitigate sample imbalance significantly improves the model’s overall performance.

Feature visualization. The feature distribution in the test set on the proposed protocol is visualized in Figure 6 via t-SNE (2008), which consists of 6 types (live, PGD, makeup, silicone mask, facemask, transparent mask). Through the figure, it is evident from (b) that, in the absence of TG, facemask and PGD are not distinctly separable. Moreover, the live sample distribution from (a) is compact and the clusters exhibit improved separation for the proposed method trained with TG. These results illustrate the effectiveness of the text-guided strategy in FAS and FAD tasks. The main reason is that TG can leverage text embeddings to enhance the aggregation and distinction of image features.

The ablation study on the timesteps of diffusion and the real-world experiments are detailed in Appendix D and F.

Conclusion

In this paper, we proposed a novel framework called FaceCat, which treats the FDM as a pre-trained model for integrating face anti-spoofing and adversarial detection. Besides, we introduced a text-guided multi-modal alignment and hierarchical fusion mechanism to enhance semantic information and optimize the utilization of FDM’s features, respectively. We also conducted extensive experiments to evaluate the effectiveness of FaceCat for face security tasks. Moreover, we further validated its robustness against common input transformations.

References

Baranchuk et al. (2021) Baranchuk, D.; Rubachev, I.; Voynov, A.; Khrulkov, V.; and Babenko, A. 2021. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126.
Boulkenafet et al. (2017) Boulkenafet, Z.; Komulainen, J.; Li, L.; Feng, X.; and Hadid, A. 2017. OULU-NPU: A mobile face presentation attack database with real-world variations. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017), 612–618. IEEE.
Caron et al. (2020) Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; and Joulin, A. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33: 9912–9924.
Carrara et al. (2018) Carrara, F.; Becarelli, R.; Caldelli, R.; Falchi, F.; and Amato, G. 2018. Adversarial examples detection in features distance spaces. In Proceedings of the European conference on computer vision (ECCV) workshops, 0–0.
Chen et al. (2023) Chen, J.; Yang, X.; Yin, H.; Ma, M.; Chen, B.; Peng, J.; Guo, Y.; Yin, Z.; and Su, H. 2023. AdvFAS: A robust face anti-spoofing framework against adversarial examples. Computer Vision and Image Understanding, 235: 103779.
Deb, Liu, and Jain (2023a) Deb, D.; Liu, X.; and Jain, A. K. 2023a. Faceguard: A self-supervised defense against adversarial face images. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), 1–8. IEEE.
Deb, Liu, and Jain (2023b) Deb, D.; Liu, X.; and Jain, A. K. 2023b. Unified detection of digital and physical face attacks. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), 1–8. IEEE.
Deng et al. (2019) Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4690–4699.
Deng et al. (2021) Deng, Z.; Yang, X.; Xu, S.; Su, H.; and Zhu, J. 2021. Libre: A practical bayesian approach to adversarial detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 972–982.
Donahue, Krähenbühl, and Darrell (2016) Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv preprint arXiv:1605.09782.
Donahue and Simonyan (2019) Donahue, J.; and Simonyan, K. 2019. Large scale adversarial representation learning. Advances in neural information processing systems, 32.
Dong et al. (2019a) Dong, Y.; Su, H.; Wu, B.; Li, Z.; Liu, W.; Zhang, T.; and Zhu, J. 2019a. Efficient decision-based black-box adversarial attacks on face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7714–7722.
Dong et al. (2019b) Dong, Y.; Su, H.; Wu, B.; Li, Z.; Liu, W.; Zhang, T.; and Zhu, J. 2019b. Efficient decision-based black-box adversarial attacks on face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7714–7722.
Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dziugaite, Ghahramani, and Roy (2016) Dziugaite, G. K.; Ghahramani, Z.; and Roy, D. M. 2016. A study of the effect of jpg compression on adversarial images. arXiv preprint arXiv:1608.00853.
George and Marcel (2019) George, A.; and Marcel, S. 2019. Deep pixel-wise binary supervision for face presentation attack detection. In 2019 International Conference on Biometrics (ICB), 1–8. IEEE.
George et al. (2019) George, A.; Mostaani, Z.; Geissenbuhler, D.; Nikisins, O.; Anjos, A.; and Marcel, S. 2019. Biometric face presentation attack detection with multi-channel convolutional neural network. IEEE transactions on information forensics and security, 15: 42–55.
Goodfellow, Shlens, and Szegedy (2014) Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
Gu et al. (2019) Gu, S.; Yi, P.; Zhu, T.; Yao, Y.; and Wang, W. 2019. Detecting adversarial examples in deep neural networks using normalizing filters. UMBC Student Collection.
He et al. (2022) He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; and Girshick, R. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Hermans, Beyer, and Leibe (2017) Hermans, A.; Beyer, L.; and Leibe, B. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737.
Heusch et al. (2020) Heusch, G.; George, A.; Geissbühler, D.; Mostaani, Z.; and Marcel, S. 2020. Deep models and shortwave infrared information to detect face presentation attacks. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(4): 399–409.
Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
Huang et al. (2008) Huang, G. B.; Mattar, M.; Berg, T.; and Learned-Miller, E. 2008. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition.
Kim, Kwon, and Ye (2022) Kim, G.; Kwon, T.; and Ye, J. C. 2022. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2426–2435.
Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Jimenez Rezende, D.; and Welling, M. 2014. Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27.
Komkov and Petiushko (2021) Komkov, S.; and Petiushko, A. 2021. Advhat: Real-world adversarial attack on arcface face id system. In 2020 25th International Conference on Pattern Recognition (ICPR), 819–826. IEEE.
Lin et al. (2017) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
Liu et al. (2024) Liu, A.; Xue, S.; Gan, J.; Wan, J.; Liang, Y.; Deng, J.; Escalera, S.; and Lei, Z. 2024. CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 222–232.
Liu et al. (2021) Liu, A.; Zhao, C.; Yu, Z.; Su, A.; Liu, X.; Kong, Z.; Wan, J.; Escalera, S.; Escalante, H. J.; Lei, Z.; et al. 2021. 3d high-fidelity mask face presentation attack detection challenge. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 814–823.
Liu, Jourabloo, and Liu (2018) Liu, Y.; Jourabloo, A.; and Liu, X. 2018. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 389–398.
Liu et al. (2019) Liu, Y.; Stehouwer, J.; Jourabloo, A.; and Liu, X. 2019. Deep tree learning for zero-shot face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4680–4689.
Madry et al. (2017) Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
Massoli et al. (2021) Massoli, F. V.; Carrara, F.; Amato, G.; and Falchi, F. 2021. Detection of face recognition adversarial attacks. Computer Vision and Image Understanding, 202: 103103.
Moitra, Kim, and Panda (2022) Moitra, A.; Kim, Y.; and Panda, P. 2022. Adversarial Detection without Model Information. arXiv preprint arXiv:2202.04271.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
Srivatsan, Naseer, and Nandakumar (2023) Srivatsan, K.; Naseer, M.; and Nandakumar, K. 2023. FLIP: Cross-domain Face Anti-spoofing with Language Guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 19685–19696.
Szegedy et al. (2017) Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence, volume 31.
Tan (2019) Tan, M. 2019. rethinking model scaling for convolutional neural networks. arXiv. 2019 doi: 10.48550. arXiv.
Tong et al. (2021) Tong, L.; Chen, Z.; Ni, J.; Cheng, W.; Song, D.; Chen, H.; and Vorobeychik, Y. 2021. Facesec: A fine-grained robustness evaluation framework for face recognition systems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13254–13263.
Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
Watson and Al Moubayed (2021) Watson, M.; and Al Moubayed, N. 2021. Attack-agnostic adversarial detection on medical data using explainable machine learning. In 2020 25th International Conference on Pattern Recognition (ICPR), 8180–8187. IEEE.
Wei et al. (2023) Wei, C.; Mangalam, K.; Huang, P.-Y.; Li, Y.; Fan, H.; Xu, H.; Wang, H.; Xie, C.; Yuille, A.; and Feichtenhofer, C. 2023. Diffusion Models as Masked Autoencoders. arXiv preprint arXiv:2304.03283.
Xie et al. (2019) Xie, C.; Zhang, Z.; Zhou, Y.; Bai, S.; Wang, J.; Ren, Z.; and Yuille, A. L. 2019. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2730–2739.
Yang et al. (2023) Yang, X.; Liu, C.; Xu, L.; Wang, Y.; Dong, Y.; Chen, N.; Su, H.; and Zhu, J. 2023. Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4119–4128.
Yu et al. (2022) Yu, Z.; Cai, R.; Li, Z.; Yang, W.; Shi, J.; and Kot, A. C. 2022. Benchmarking joint face spoofing and forgery detection with visual and physiological cues. arXiv preprint arXiv:2208.05401.
Yu et al. (2023) Yu, Z.; Liu, A.; Zhao, C.; Cheng, K. H.; Cheng, X.; and Zhao, G. 2023. Flexible-modal face anti-spoofing: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6345–6350.
Yu et al. (2020) Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; and Zhao, G. 2020. Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5295–5305.
Zhang et al. (2022) Zhang, W.; Liu, H.; Liu, F.; Ramachandra, R.; and Busch, C. 2022. Effective Presentation Attack Detection Driven by Face Related Task. In European Conference on Computer Vision, 408–423. Springer.