This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

secReferences

Gradient Alignment for Cross-Domain Face Anti-Spoofing

Binh M. Le     Simon S. Woo
Sungkyunkwan University, South Korea
{bmle,swoo}@g.skku.edu
Corresponding author.
Abstract

Recent advancements in domain generalization (DG) for face anti-spoofing (FAS) have garnered considerable attention. Traditional methods have focused on designing learning objectives and additional modules to isolate domain-specific features while retaining domain-invariant characteristics in their representations. However, such approaches often lack guarantees of consistent maintenance of domain-invariant features or the complete removal of domain-specific features. Furthermore, most prior works of DG for FAS do not ensure convergence to a local flat minimum, which has been shown to be advantageous for DG. In this paper, we introduce GAC-FAS, a novel learning objective that encourages the model to converge towards an optimal flat minimum without necessitating additional learning modules. Unlike conventional sharpness-aware minimizers, GAC-FAS identifies ascending points for each domain and regulates the generalization gradient updates at these points to align coherently with empirical risk minimization (ERM) gradient updates. This unique approach specifically guides the model to be robust against domain shifts. We demonstrate the efficacy of GAC-FAS through rigorous testing on challenging cross-domain FAS datasets, where it establishes state-of-the-art performance. The code is available at: https://github.com/leminhbinh0209/CVPR24-FAS.

1 Introduction

Refer to caption
Figure 1: Illustration of Our Learning Objective. Most SoTA methods [77, 66, 44] for DG in FAS rely on auxiliary modules to learn domain-invariant features, and do not guarantee convergence towards a flat minimum. In contrast, our method coherently aligns the generalization gradients at ascending points of each domain with gradients derived from ERM. This approach ensures that the model converges to an optimal flat minimum and is robust against domain shifts.

With the increasing importance of security systems, face recognition technologies [9, 62, 42] have become ubiquitous in many industrial applications. However, these systems are vulnerable to presentation attacks, such as printed faces [30], and 3D masks [17], among others. Consequently, face anti-spoofing (FAS) has emerged as an essential technology to safeguard recognition systems over the past decades [8, 69, 48, 28, 6, 13, 38]. Although existing methods have achieved promising performance, they often suffer from poor generalization when exposed to unseen environments. This limitation is largely due to their assumption of stationary settings, such as lighting conditions or sensor variations, which often do not hold in real-world scenarios.

To address this challenge, recent studies have focused on improving domain generalization (DG) for FAS by learning domain-invariant features from source training domains [66, 23, 59, 57, 44, 77, 5, 55, 63]. Primary efforts include removing domain-specific features from representations through adversarial training [60, 54, 23] or meta-learning [5, 55, 63]. Subsequent works have applied metric learning methods [59, 57] and style ensemble techniques [66, 77] to enhance robustness under domain shifts. However, most of these studies assume that domain-invariant features are preserved for DG through their specific designs of additional learning modules, without ensuring convergence of the model to a local flat minimum.

In this research, we introduce a novel training objective, namely Gradient Alignment for Cross-Domain Face Anti-Spoofing (GAC-FAS), designed to guide detectors towards an optimal flat minimum robust towards domain shift. This approach is particularly motivated by the recent advancements in Sharpness-Aware Minimization (SAM) [14, 31, 82], which offers a promising alternative to empirical risk minimization (ERM)[58] for seeking generalizable minima. Our objective function for DG in FAS is carefully modulated by considering the limitations of current SAM variants. When SAM is applied to entire datasets, it may produce biased updates due to the dominance of a particular domain or generate inconsistent gradients when applied to individual domains. Moreover, the updates to the SAM generalization gradient have a tendency to yield a model that is capable of handling many forms of noise, including label noise and adversarial noise. However, our primary focus is on addressing domain shifts in the context of DG for FAS.

Consequently, we propose two essential conditions for DG in FAS. First, the objective should aim for an optimal flat minimum that is both flat and low in terms of the training loss. Second, the SAM generalization gradient updates, derived at ascending points (see its definition in Sec. 3.2) for each domain, should be coherently aligned with each other and with the ERM gradient update as illustrated in Fig. 1. This dual approach enables our model to learn a more stable local minimum and become more robust to domain shifts over different face spoofing datasets.

Our comprehensive experiments on benchmark datasets under various settings, including leave-one-out, limited source domain, and performance upon convergence, demonstrate the superiority of our method compared to current state-of-the-art (SoTA) baselines. The main contributions of our work are summarized as follows: 1) We offer a new perspective for cross-domain FAS, shifting the focus from learning domain-invariant features to finding an optimal flat minimum for significantly improving the generalization and robustness to domain shifts.

2) We propose a novel training objective: self-regulating generalization gradient updates at ascending points to coherently align with the ERM gradient update, benefiting DG in FAS.

3) We demonstrate that our approach outperforms well-known baselines in both snapshot and convergence performance across popular FAS evaluation protocol settings.

Our paper is structured as follows: Sec. 2 provides a concise review of the most pertinent literature in FAS, along with the relationship between loss landscape sharpness and model generalization. Sec. 3 introduces the preliminaries of ERM and SAM within the context of FAS, followed by a detailed presentation of our approach. Experimental results and ablation studies are discussed in Sec. 4. Finally, we draw our conclusions in Sec. 5

2 Related Work

2.1 Face Anti-Spoofing

In the initial phases of research, handcrafted features were primarily employed as artifacts for detection. Such features include LBP [8, 2], HOG [69, 29], and SIFT [51]. Concurrently, studies have examined predefined biometric traits and behaviors, such as eye blinking [48], lip motion [28], head turning, and facial expression variations [6]. With the advent of deep neural networks, there has been a notable enhancement in detection capabilities [13, 36, 50]. Such improvements were further facilitated through diverse supervisory inputs, encompassing depth maps [70], reflection maps [74], and R-PPG signals [38]. Recently, transformer-based models have emerged, demonstrating superior efficacy in identifying spoofing attempts [21, 37].

Lately, there has been a growing interest in model generalization across disparate domains. A significant body of work has employed domain adaptation (DA) techniques, where pre-trained models are fine-tuned to novel domains using additional data [33, 61, 18, 79]. Concurrently, domain generalization (DG) methodologies, particularly those incorporating adversarial loss, have sought to achieve generalization by extracting domain-invariant features from source training domains [66, 23, 59, 57, 44, 77]. Additionally, several research works have considered meta-learning as a form of regularization to counteract domain shifts during the training phase [5, 55, 63], and others have pursued self-supervised learning to reduce reliance on labeled data [39, 45].

In contrast to prior studies in DG that primarily centered on creating auxiliary modules to eliminate domain-specific features, these approaches may not generalize effectively to unseen domains due to uncertainties in training model convergence to flat loss regions. On the other hand, our approach leverages the inherent sharpness of a model within specific domains by aligning these models. Finally, our goal is to construct a more robust and universally generalizable model for FAS.

2.2 Sharpness and Generalization

The relationship between sharpness and model generalization was initially broached in [20]. Building on this foundation and under the i.i.d assumption, numerous theoretical and empirical investigations delved into the relationship from the lens of loss surface geometry [25, 12, 15, 22, 14, 4]. Notably, both stochastic weight averaging (SWA) [22] and stochastic weight averaging densely (SWAD) [4] have posited, both theoretically and practically, that a flatter minimum can narrow the DG gap, leading them to propose distinct weight averaging methodologies. Nevertheless, these strategies did not explicitly encourage the model to converge towards flatter minima during its training phase.

Concurrently, Sharpness-Aware Minimization (SAM) [14] and its subsequent variants [10, 82, 46, 64] aimed to address the sharp minima problem by adjusting the objective to minimize a perturbed loss, p(θ)\mathcal{L}_{p}(\theta), which is defined as the maximum loss within a neighborhood parameter space. Specifically, Look-SAM [46] and ESAM [10] reduced the computational demands of SAM. However, they retained SAM’s primary challenge, wherein the perturbed loss p(θ)\mathcal{L}_{p}(\theta) could potentially disagree with the actual sharpness measure. To overcome this challenge, GSAM [82] minimized a surrogate gap, h(θ)\ensurestackMath\stackon[1pt]=Δp(θ)(θ)h(\theta)\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\mathcal{L}_{p}(\theta)-\mathcal{L}(\theta), albeit at the expense of increasing (θ)\mathcal{L}(\theta). Later on, SAGM [64] rectified the inconsistencies observed in GSAM by incorporating gradient matching, ensuring model convergence to flatter regions. While SAM-based techniques have shown promise in generalizing from a single source and dealing with various types of noise, including label and adversarial noise, their application in multi-source domain DG for FAS has not been explored. Inspired by these foundational studies, we have tailored SAM to ensure that its generalization gradient updates derived from multi-source domains are aligned with each another, and with the ERM gradient update. To the best of our knowledge, this study pioneers the exploration of SAM’s capacity for DG in FAS.

3 Methods

In this section, we first define the general empirical risk for training cross-domain FAS problems. Next, we revisit variations of Sharpness-Aware Minimization for domain generalization from which we draw our motivation. Furthermore, we propose our approach, GAC-FAS, specifically tailored for the problem of DG for FAS. Finally, we analyze the benefits and prove the convergence rate of our algorithm.

3.1 Problem Definition

We begin by introducing the notion of cross-domain FAS. Consider an input space 𝓧d\bm{\mathcal{X}}\in\mathbb{R}^{d} and an output space 𝓨={0 (fake or spoofed),1 (live)}\bm{\mathcal{Y}}=\{0\text{ (fake or {spoofed})},1\text{ (live)}\}. Assuming there are kk distinct source domains for training, represented as 𝓢={𝒮i}ik\bm{\mathcal{S}}=\{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\}^{k}_{i}, and a singular target domain denoted by 𝓣\bm{\mathcal{T}}.

A neural network, characterized as f:𝓧𝓨f:\bm{\mathcal{X}}\rightarrow\bm{\mathcal{Y}}, is parameterized by learning parameters θ\theta. Its aim is to distinguish whether an input xx from the source domains is live or spoofed (fake). A standard approach to optimization involves the empirical risk minimization (ERM) framework [58], which aims to minimize the loss described by:

minθ(θ;𝓢)\displaystyle\min_{\theta}\mathcal{L}(\theta;\bm{\mathcal{S}}) =minθ𝔼𝒮i𝓢(θ;𝒮i),\displaystyle=\min_{\theta}\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathcal{L}(\theta;{\color[rgb]{0,0,1}\mathcal{S}_{i}}), (1)

where (θ;𝒮i)=𝔼(x,y)𝒮i(f(x;θ),y)\mathcal{L}(\theta;{\color[rgb]{0,0,1}\mathcal{S}_{i}})=\mathbb{E}_{(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}}\ell(f(x;\theta),y) is domain-wise empirical loss on domain ithi-th, and \ell could be cross-entropy loss [23] or L1L_{1} regression loss [16].

To minimize the empirical risk (θ;𝓢)\mathcal{L}(\theta;\bm{\mathcal{S}}), the neural network ff aspires to identify the optimal parameter set θ\theta^{*}. A notable challenge with ERM is its propensity to overfit the training data and converge towards sharp minima, compromising the performance on an unseen domain. Such tendencies might arise due to domain-specific attributes such as camera configurations or image resolution [57]. Consequently, targeting flatter minima when training on source domains becomes pivotal for addressing the DG for FAS problem.

3.2 Preliminaries: Sharpness-Aware Minimization

The Sharpness-Aware Minimization (SAM) [14] method aims to identify a flatter area in the vicinity of the minimum that exhibits a lower loss value. In order to attain this objective, given a training set 𝒟\mathcal{D} (which can be considered as either 𝓢\bm{\mathcal{S}} or 𝒮i{\color[rgb]{0,0,1}\mathcal{S}_{i}} in our problem later), SAM addresses the subsequent min-max problem:

minθp(θ;𝒟)+(θ;𝒟), where\displaystyle\min_{\theta}\mathcal{L}_{p}(\theta;\mathcal{D})+\mathcal{R}(\theta;\mathcal{D}),\text{ where } (2)
p(θ;𝒟)=maxϵ(θ,ρ)(θ+ϵ;𝒟),\displaystyle\mathcal{L}_{p}(\theta;\mathcal{D})=\max_{\epsilon\in\mathcal{B}(\theta,\rho)}\mathcal{L}(\theta+\epsilon;\mathcal{D}), (3)

and (θ;𝒟)\mathcal{R}(\theta;\mathcal{D}) is a regularization term, and (θ,ρ)={ϵ:ϵθρ}\mathcal{B}(\theta,\rho)=\{\epsilon:\lVert\epsilon-\theta\rVert\leq\rho\} is the vicinity of model weight vector θ\theta with a predefined constant radius ρ\rho. Intuitively, for a given θ\theta, the maximization in Eq. 3 identifies the most adversarial weight perturbation, denoted as ϵ\epsilon^{*}, within the ball \mathcal{B} of radius ρ\rho. This perturbation maximizes the empirical loss, leading to (θ+ϵ;𝒟)\mathcal{L}(\theta+\epsilon^{*};\mathcal{D}) being the supremum in (θ,ρ)\mathcal{B}(\theta,\rho). We now refer ϵ\epsilon^{*} and θ+ϵ\theta+\epsilon^{*} as ascending vector and ascending point, respectively. By minimizing (θ+ϵ;𝒟)\mathcal{L}(\theta+\epsilon^{*};\mathcal{D}), the approach encourages the selection of θ\theta values that are situated in a region with a flatter loss landscape. Consequently, the function ff exhibits enhanced stability under domain shifts, making it more resilient to unseen domains.

SAM uses Taylor expansion of the empirical loss around θ\theta to estimate ϵ\epsilon^{*} as follows [14]:

ϵ^=ρ(θ;𝒟)(θ;𝒟)argmaxϵ(θ,ρ)(θ+ϵ;𝒟).\hat{\epsilon}=\rho\frac{\nabla\mathcal{L}(\theta;\mathcal{D})}{\lVert\nabla\mathcal{L}(\theta;\mathcal{D})\rVert}\approx\operatorname*{arg\,max}_{\epsilon\in\mathcal{B}(\theta,\rho)}\mathcal{L}(\theta+\epsilon;\mathcal{D}). (4)

Therefore, the perturbation loss of SAM reduces to

p(θ;𝒟)=(θ+ϵ^;𝒟), where ϵ^=ρ(θ;𝒟)(θ;𝒟).\mathcal{L}_{p}(\theta;\mathcal{D})=\mathcal{L}(\theta+\hat{\epsilon};\mathcal{D}),\text{ where }\hat{\epsilon}=\rho\frac{\nabla\mathcal{L}(\theta;\mathcal{D})}{\lVert\nabla\mathcal{L}(\theta;\mathcal{D})\rVert}. (5)
Refer to caption
Figure 2: Illustration of Different SAM Objective Approach. (a) Standard SAM variants applied to the entire source dataset yield a biased ascending vector ϵ^𝓢\hat{\epsilon}_{\bm{\mathcal{S}}}, predominantly influenced by a particular domain. (b) Domain-specific gradient adjustments lead to noisy gradient estimates, impeding optimization progress. (c) Our proposed GAC-FAS addresses these issues by computing perturbation losses across the dataset at all ascending points, while concurrently adjusting gradients to align with the ERM gradients at the current point (with γ\gamma), making the model robust to domain shift.

Subsequently, Zhuang et al. [82] and Wang et al. [64] introduce a surrogate gap (sharpness) h(θ)\ensurestackMath\stackon[1pt]=Δp(θ)(θ)h(\theta)\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\mathcal{L}_{p}(\theta)-\mathcal{L}(\theta), albeit at the expense of increasing (θ)\mathcal{L}(\theta) into their objectives to obtain a better minimum. Nevertheless, there are the following limitations and issues when applying SAM and its variants for DG for FAS, as discussed subsequently.

Our Preliminary Observations and Analysis. (i) Compensation for Input Changes: Optimizing p(θ;𝒟)\mathcal{L}_{p}(\theta;\mathcal{D}) acts as a compensatory mechanism for a range of input changes. These include domain shifts and various types of corruptions such as adversarial noises [68], heavy compression [32], and label noises [14]. However, our primary objective in this research is to enhance robustness against domain shifts specific to FAS. (ii) Dominated ϵ^\hat{\epsilon} in All-Domain Application (𝒟=𝒮\mathcal{D}=\bm{\mathcal{S}}): By applying Eq. 5 or its variants to the problem of DG for FAS as formulated in Eq. 1, we can determine the optimal ascending vector ϵ^\hat{\epsilon} for the training source dataset 𝒟=𝓢\mathcal{D}=\bm{\mathcal{S}}. However, a significant challenge arises from imbalances in dataset sizes and the presence of subtle, domain-specific artifacts. This can lead to ‘learning shortcuts,’ where the model preferentially learns from less complex or larger domains. Consequently, the ascending vector ϵ^\hat{\epsilon} from (θ\theta, 𝓢\bm{\mathcal{S}}) can become dominated by a particular domain 𝒮i\mathcal{S}_{i}. This behaviours is similar with the long-tail problem [80, 81]. (iii) Gradient Conflicts in Domain-wise Application (𝒟=𝒮i\mathcal{D}={\color[rgb]{0,0,1}\mathcal{S}_{i}}): On the other hand, if we apply Eq. 5 or its variants on domain-wise manner, i.e., 𝒟=𝒮i\mathcal{D}={\color[rgb]{0,0,1}\mathcal{S}_{i}}, the gradients of the model, derived from each domain as their ascending points, can counteract one another, leading to potential conflicts between domains. This phenomenon is further illustrated in Fig. 2. In light of these intricacies and limitations, we argue that the SAM variants might not deliver optimal generalization performance for the DG for FAS problem.

3.3 Objective of GAC-FAS

Given the insights gleaned from our prior analysis, we introduce two pivotal conditions to ensure that our model remains robust across unseen domains: (i) Optimal minimum: The identified minimum should not only be sufficiently low but should also reside on a flat loss surface. (ii) Aligned cross-domain gradients: From the source training datasets, the generalization gradient update learned from some domains should align with the ERM gradient of another domain.

Intuitively, the first condition resonates with the core objectives of optimal minimum of loss landscape from training set, as discussed previously[82, 64] . The second condition serves dual purposes. It aims to harmonize the optimization of h(θ)h(\theta), reducing potential conflicts. Simultaneously, it aspires for p(θ;𝓢)\mathcal{L}_{p}(\theta;\bm{\mathcal{S}}) to exclusively compensate for domain shifts in FAS.

To fulfill these conditions, we introduce a novel optimization objective for our DA FAS, expressed as:

(θ;𝓢)+𝔼𝒮i𝓢pi(θγ(θ;𝓢);𝓢)+(θ;𝓢).\mathcal{L}(\theta;\bm{\mathcal{S}})+\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta-\gamma\nabla\mathcal{L}(\theta;\bm{\mathcal{S}});\bm{\mathcal{S}})+\mathcal{R}(\theta;\bm{\mathcal{S}}). (6)

In this formulation, which is inspired by [64], we perturb the model weights w.r.t each individual domain. Simultaneously, we incorporate an auxiliary ERM’s gradient term, γ(θ;𝓢)\gamma\nabla\mathcal{L}(\theta;\bm{\mathcal{S}}), computed over all source training domains. And, Eq. 6 can be further expressed as:

(θ;𝓢)+𝔼𝒮i𝓢(θ+ϵ^iγ(θ;𝓢);𝓢)+(θ;𝓢),\mathcal{L}(\theta;\bm{\mathcal{S}})+\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathcal{L}(\theta+{\color[rgb]{0,0,1}\hat{\epsilon}_{i}}-\gamma\nabla\mathcal{L}(\theta;\bm{\mathcal{S}});\bm{\mathcal{S}})+\mathcal{R}(\theta;\bm{\mathcal{S}}), (7)

where the optimal perturbation is characterized by ϵ^i=ρ(θ;𝒮i)(θ;𝒮i){\color[rgb]{0,0,1}\hat{\epsilon}_{i}}=\rho\frac{\nabla\mathcal{L}(\theta;{\color[rgb]{0,0,1}\mathcal{S}_{i}})}{\lVert\nabla\mathcal{L}(\theta;{\color[rgb]{0,0,1}\mathcal{S}_{i}})\rVert} as defined in Eq. 4.

3.4 Benefits & Convergence of GAC-FAS

In this section, we offer analyses for a deeper understanding of our proposed losses for DA in FAS, detailing how they satisfy the two conditions outlined in Sec. 3.3. Subsequently, we present our novel training algorithm and the theorem regarding its convergence rate.

Refer to caption
Figure 3: Illustration of the effects described by Eq. 10. The term ηp1(𝒮2)-\eta\nabla\mathcal{L}_{\text{p}_{1}}(\mathcal{S}_{2}) represents the generalization gradient update of the model learned from 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2}. The term η(𝒮3)-\eta\nabla\mathcal{L}(\mathcal{S}_{3}) denotes the ERM update for domain 𝒮3\mathcal{S}_{3} and serves as a comparative oracle for domain shift. In the absence of our regularization, the generalization update is not robust to the domain shift associated with 𝒮3\mathcal{S}_{3} (left), as their update directions are different. Conversely, with our regularization as formulated in Eq. 10, the generalization update aligns with the ERM update on 𝒮3\mathcal{S}_{3}, suggesting that the model updates in a direction that is robust to domain shifts (right).

3.4.1 Benefits of GAC-FAS

We perform the first order Taylor expansion around θ+ϵ^i\theta+{\color[rgb]{0,0,1}\hat{\epsilon}_{i}} for the second term in Eq. 6 as follows:

𝔼𝒮i𝓢pi(θγ(θ;𝓢);𝓢)\displaystyle\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta-\gamma\nabla\mathcal{L}(\theta;\bm{\mathcal{S}});\bm{\mathcal{S}})
\displaystyle\approx 𝔼𝒮i𝓢pi(θ;𝓢)γpi(θ;𝓢),(θ;𝓢).\displaystyle\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}})-\gamma\langle\nabla\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}}),\nabla\mathcal{L}(\theta;\bm{\mathcal{S}})\rangle. (8)

As a result, our objective in Eq. 6 can be expressed as follows:

(θ;𝓢)+𝔼𝒮i𝓢pi(θ;𝓢)\displaystyle\mathcal{L}(\theta;\bm{\mathcal{S}})+\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}})
γpi(θ;𝓢),(θ;𝓢)+(θ;𝓢).\displaystyle-\gamma\langle\nabla\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}}),\nabla\mathcal{L}(\theta;\bm{\mathcal{S}})\rangle+\mathcal{R}(\theta;\bm{\mathcal{S}}). (9)

The underlying objectives of our Eq. 9 is twofold: we aim to minimize the loss functions, namely (θ;𝓢)\mathcal{L}(\theta;\bm{\mathcal{S}}) and pi(θ;𝓢)\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}}), while simultaneously maximizing the inner products between the gradient of pi(θ;𝓢)\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}}) and the gradients (θ;𝓢)\mathcal{L}(\theta;\bm{\mathcal{S}}). Specifically, condition (i) is satisfied by minimizing (θ;𝓢)\mathcal{L}(\theta;\bm{\mathcal{S}}) and pi(θ;𝓢),i\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}}),\forall i. Our distinct contribution, however, lies in our method’s emphasis on cross-domain gradient alignment. This facet of our methodology particularly benefits the DG for FAS as elucidated subsequently.

Algorithm 1 Training pipeline for GAC-FAS.
1:DNN ff parameterized by θ\theta, training dataset 𝓢={𝒮i}ik\bm{\mathcal{S}}=\{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\}^{k}_{i}. Learning rate η\eta. Alignment parameter γ\gamma and radius ρ\rho. Total number of iterations TT.
2:for t1t\leftarrow 1 to TT do
3:     Sample a mini-batch =𝒮1++𝒮k\mathcal{B}=\mathcal{B}_{\mathcal{S}_{1}}+...+\mathcal{B}_{\mathcal{S}_{k}};
4:     Compute grad. of reg. term (θt;)\nabla\mathcal{R}(\theta_{t};\mathcal{B});
5:     #Compute grad. for 1st1^{st} term of Eq. 7:
6:     Compute the training loss gradient on each domain {(θt;𝒮i)}i=1k\{\nabla\mathcal{L}(\theta_{t};\mathcal{B}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}})\}_{i=1}^{k}, and sum them up to obtain (θt;)\nabla\mathcal{L}(\theta_{t};\mathcal{B});
7:     #Compute grad. for 2nd2^{nd} term of Eq. 7:
8:     for  domain i{1,,k}i\in\{1,...,k\} do
9:         ϵ^i=ρ(θt;𝒮i)(θt;𝒮i){\color[rgb]{0,0,1}\hat{\epsilon}_{i}}=\rho\frac{\nabla\mathcal{L}(\theta_{t};\mathcal{B}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}})}{\lVert\nabla\mathcal{L}(\theta_{t};\mathcal{B}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}})\rVert} #ascending vector
10:         pi=(θt+ϵ^iγ(θt;);)\nabla\mathcal{L}_{p}^{i}=\nabla\mathcal{L}(\theta_{t}+{\color[rgb]{0,0,1}\hat{\epsilon}_{i}}-\gamma\nabla\mathcal{L}(\theta_{t};\mathcal{B});\mathcal{B})
11:     end for
12:     #Update weights:
13:     θt+1=θtη((θt;)+1kΣi=1kpi+(θt;))\theta_{t+1}=\theta_{t}-\eta\cdot\big{(}\nabla\mathcal{L}(\theta_{t};\mathcal{B})+\frac{1}{k}\Sigma_{i=1}^{k}\nabla\mathcal{L}_{p}^{i}+\nabla\mathcal{R}(\theta_{t};\mathcal{B})\big{)}
14:     t=t+1t=t+1
15:end for

By noting that (θ;𝓢)=Σm=1k(θ;𝒮m)\nabla\mathcal{L}(\theta;\bm{\mathcal{S}})=\Sigma_{m=1}^{k}\nabla\mathcal{L}(\theta;{\mathcal{S}_{m}}), maximizing pi(θ;𝓢),(θ;𝓢)\langle\nabla\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;\bm{\mathcal{S}}),\nabla\mathcal{L}(\theta;\bm{\mathcal{S}})\rangle is similar with maximizing following term:

Σm=1kΣn=1kpi(θ;𝒮m),(θ;𝒮n).\Sigma^{k}_{m=1}\Sigma^{k}_{n=1}\langle\nabla\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;{\mathcal{S}_{m}}),\nabla\mathcal{L}(\theta;{\mathcal{S}_{n}})\rangle. (10)

We interest in the case where 𝒮n\mathcal{S}_{n} is different with 𝒮m\mathcal{S}_{m} and 𝒮i{\color[rgb]{0,0,1}\mathcal{S}_{i}}. Maximizing the inner product pi(θ;𝒮m),(θ;𝒮n)\langle\nabla\mathcal{L}_{{\color[rgb]{0,0,1}p_{i}}}(\theta;{\mathcal{S}_{m}}),\nabla\mathcal{L}(\theta;{\mathcal{S}_{n}})\rangle implies that the generalization gradient update, learned from 𝒮i{\color[rgb]{0,0,1}\mathcal{S}_{i}} (via ϵ^i{\color[rgb]{0,0,1}\hat{\epsilon}_{i}}) and 𝒮m\mathcal{S}_{m}, must align with the ERM gradient of another domain, 𝒮n\mathcal{S}_{n}. This ERM gradient acts as a comparative oracle for domain shifts, guiding generalization update of the model converge towards a minimum that is robust against domain shifts. A toy example of this effect is illustrated in Fig. 3. Moreover, maximizing Eq. 10 indirectly leads to the matching each pair of {(θ,𝒮n)}k\{\nabla\mathcal{L}(\theta,\mathcal{S}_{n})\}^{k}, benefiting for DG [56].

3.4.2 Convergence of GAC-FAS

Theorem 1 (Proof in Sec. B Supp. material). Suppose that the loss function (θt)=(f(x;θt),y)\ell(\theta_{t})=\ell(f(x;\theta_{t}),y) satisfies the following assumptions. (i) its gradient g(θt)=(θt)g(\theta_{t})=\nabla\ell(\theta_{t}) is bounded, i.e., g(θt)Gt\lVert g(\theta_{t})\rVert\leq G\text{, }\forall t. (ii) The stochastic gradient is L-Lipchitz, i.e., g(θt)g(θt)Lθtθt\lVert g(\theta_{t})-g(\theta_{t}^{\prime})\rVert\leq L\lVert\theta_{t}-\theta_{t}^{\prime}\rVert, θt\forall\theta_{t}, θt\theta_{t}^{\prime}. Let the learning rate ηt\eta_{t} be η0t\frac{\eta_{0}}{\sqrt{t}}, and and let the perturbation be proportional to the learning rate, i.e., ρt=ρt\rho_{t}=\frac{\rho}{\sqrt{t}}, and γt=γt\gamma_{t}=\frac{\gamma}{\sqrt{t}}, we have:

1TΣt=1T𝔼𝒮i𝓢𝔼(x,y)𝒮i[(θt)2]𝒪(logTT),and\displaystyle\frac{1}{T}\Sigma^{T}_{t=1}\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathbb{E}_{(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]\leq\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right),\text{and}
1TΣt=1T𝔼𝒮i𝓢𝔼(x,y)𝒮i[(θtadv)2]𝒪(logTT),\displaystyle\frac{1}{T}\Sigma^{T}_{t=1}\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathbb{E}_{(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}}\left[\lVert\nabla\ell(\theta_{t}^{\text{adv}})\rVert^{2}\right]\leq\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right),

where θtadv=θt+ϵt^γtδt\theta_{t}^{\text{adv}}=\theta_{t}+\hat{\epsilon_{t}}-\gamma_{t}\delta_{t}, δt=Σj=1k(f(xj;θt),yj)\delta_{t}=\Sigma_{j=1}^{k}\nabla\ell(f(x_{j}^{\prime};\theta_{t}),y_{j}^{\prime}), and (xj,yj)𝒮j(x_{j}^{\prime},y_{j}^{\prime})\sim\mathcal{S}_{j}.

Theorem 1 implies that both \ell and p\ell_{p} converge at rate 𝒪(logT/T)\mathcal{O}(\log T/\sqrt{T}), and it matches convergence rate of first-order gradient optimizers such as Adam [27].

Equipped with Theorem 1, we present the overall training pipeline of our GAC-FAS  in Algorithm 1.

Methods ICM \rightarrow O OCM \rightarrow I OCI \rightarrow M OMI \rightarrow C
HTER \downarrow AUC \uparrow HTER \downarrow AUC \uparrow HTER \downarrow AUC \uparrow HTER \downarrow AUC \uparrow
MMD-AAE [34] 40.98 63.08 31.58 75.18 27.08 83.19 44.59 58.29
MADDG [54] 27.98 80.02 22.19 84.99 17.69 88.06 24.50 84.51
RFM [55] 16.45 91.16 17.30 90.48 13.89 93.98 20.27 88.16
SSDG-M [23] 25.17 81.83 18.21 94.61 16.67 90.47 23.11 85.45
SSDG-R [23] 15.61 91.54 11.71 96.59 7.38 97.17 10.44 95.94
D2AM [5] 15.27 90.87 15.43 91.22 12.70 95.66 20.98 85.58
SDA [63] 23.10 84.30 15.60 90.10 15.40 91.80 24.50 84.40
DRDG [41] 15.63 91.75 15.56 91.79 12.43 95.81 19.05 88.79
ANRL [40] 15.67 91.90 16.03 91.04 10.83 96.75 17.85 89.26
SSAN [66] 13.72 93.63 8.88 96.79 6.67 98.75 10.00 96.67
AMEL [78] 11.31 93.96 18.60 88.79 10.23 96.62 11.88 94.39
EBDG [11] 15.66 92.02 18.69 92.28 9.56 97.17 18.34 90.01
PathNet [59] 11.82 95.07 13.40 95.67 7.10 98.46 11.33 94.58
IADG [77] 8.86 97.14 10.62 94.50 5.41 98.19 8.70 96.40
SA-FAS [57] 10.00 96.23 6.58 97.54 5.95 96.55 8.78 95.37
UDG-FAS [44] 10.97 95.36 5.86 98.62 5.95 98.47 9.82 96.76
GAC-FAS (ours) 8.600.28 97.160.40 4.290.83 98.870.60 5.000.00 97.560.06 8.200.43 95.160.09
Table 1: Evaluation of cross-domain face anti-spoofing on four leading benchmark datasets: CASIA (C), Idiap Replay (I), MSU-MFSD (M), and Oulu-NPU (O). Methods are benchmarked for optimal performance using the standard evaluation procedure outlined in [23]. Symbols \uparrow and \downarrow signify that larger and smaller values are preferable, respectively.

4 Experimental Results

In this section, we compare our method with previous SoTA baselines using standard FAS evaluation protocol settings. Additionally, we assess the effectiveness of our algorithm in scenarios where the training source is limited. We then compare its convergence performance with other baselines. Finally, we conduct several ablation studies to explore alternatives to the minimizer for DG in FAS and the effects of hyperparameter tuning.

Methods MI \rightarrow C MI \rightarrow O
HTER \downarrow AUC \uparrow HTER \downarrow AUC \uparrow
MSLBP [47] 51.16 52.09 43.63 58.07
Color Texture [2] 55.17 46.89 53.31 45.16
LBPTOP [8] 45.27 54.88 47.26 50.21
MADDG [54] 41.02 64.33 39.35 65.10
SSDG-M [23] 31.89 71.29 36.01 66.88
D2AM [5] 32.65 72.04 27.70 75.36
DRDG [41] 31.28 71.50 33.35 69.14
ANRL [40] 31.06 72.12 30.73 74.10
SSAN [66] 30.00 76.20 29.44 76.62
EBDG [11] 27.97 75.84 25.94 78.28
AMEL [78] 24.52 82.12 19.68 87.01
IAGD [77] 24.07 85.13 18.47 90.49
GAC-FAS (ours) 16.911.17 88.120.58 17.880.15 89.670.39
Table 2: Evaluation on limited source domains. Baseline results are sourced from [77].

4.1 Experiment Settings

Datasets. Our experiments are conducted on four benchmark datasets: Idiap Replay Attack [7] (I), OULU-NPU [3] (O), CASIA-MFSD [76] (C), and MSU-MFSD [67] (M). Consistent with prior works, we treat each dataset as a separate domain and employ a leave-one-out testing protocol to evaluate cross-domain generalization capabilities. For instance, the protocol ICM \rightarrow O involves training on Idiap Replay Attack, CASIA-MFSD, and MSU-MFSD, and testing on OULU-NPU.

Methods AUC\uparrow
SVM1+IMQ [1] 70.2312.69{}^{\text{12.69}}
CDCN [71] 88.6910.56{}^{\text{10.56}}
CDCN++ [71] 87.5310.90{}^{\text{10.90}}
SSAN [66] 88.019.93{}^{\text{9.93}}
TTN-S [65] 89.719.17{}^{\text{9.17}}
UDG-FAS [44] 92.436.86{}^{\text{6.86}}
GAC-FAS (ours) 93.394.27{}^{\text{4.27}}
(a) Unseen 2D attack
Method AUC\uparrow
Saha et al. [53] 79.20
Panwar et al. [49] 80.00
SSDG-R [23] 82.11
CIFAS [43] 83.20
UDG-FAS [44] 87.26
GAC-FAS (ours) 89.270.58{}^{\text{0.58}}
(b) Unseen 3D attack
Table 3: Evaluation on unseen attacks. Baseline results are sourced from [44].
Methods ICM \rightarrow O OCM \rightarrow I OCI \rightarrow M OMI \rightarrow C
HTER /AUC/TPR95\downarrow/\mathrm{AUC}\uparrow/\mathrm{TPR}95\uparrow HTER /AUC/TPR95\downarrow/\mathrm{AUC}\uparrow/\mathrm{TPR}95\uparrow HTER /AUC/TPR95\downarrow/\mathrm{AUC}\uparrow/\mathrm{TPR}95\uparrow HTER /AUC/TPR95\downarrow/\mathrm{AUC}\uparrow/\mathrm{TPR}95\uparrow
SSDG-R [23] 15.831.29 / 92.130.96 / 66.544.00 14.651.21 / 91.931.35 / 53.682.56 22.841.14 / 78.671.31 / 50.805.95 28.760.89 / 80.911.10 / 41.472.68
SSAN-R [66] 25.723.74 / 79.374.69 / 36.755.19 35.398.04 / 70.139.03 / 64.002.70 21.793.68 / 84.063.78 / 51.914.28 26.442.91 / 78.842.83 / 45.364.29
PatchNet [59] 23.491.80 / 84.621.92 / 39.396.83 29.752.76 / 80.531.35 / 54.252.18 25.921.13 / 83.430.87 / 38.758.31 36.261.98 / 71.381.89 / 19.223.85
SA-FAS [57] 11.290.32 / 95.230.24 / 73.381.64 11.481.10 / 95.740.55 / 77.053.26 14.361.10 / 92.060.53 / 55.714.82 19.400.66 / 88.690.67 / 50.533.60
GAC-FAS (ours) 9.89 0.47 / 96.440.18 / 80.471.34 12.513.03 / 93.032.24 / 77.388.50 12.291.29 / 95.350.57/ 72.003.84 15.371.52 / 91.671.67/ 58.6710.55
Table 4: Evaluation at convergence. A comprehensive assessment of cross-domain face anti-spoofing on prominent databases: CASIA (C), Idiap Replay (I), MSU-MFSD (M), and Oulu-NPU (O). Methods are benchmarked using their mean and standard deviation performance over the final 10 evaluations. Baseline results are sourced from [57].

Implementation Details. Input images are detected and cropped using MTCNN [72], then resized to 256×\times256 pixels. We employ a ResNet-18 [19] architecture, pre-trained on the ImageNet dataset [52], as our feature extraction backbone to maintain consistency with SoTA baselines [57, 77, 66, 23]. The network is trained using an SGD optimizer with an initial learning rate of 0.005. Our regularization strategy includes weight decay and supervised contrastive learning, applied at an intermediate layer, to promote inter-domain discriminability [26, 57]. The hyperparameters are set as follows: {γ=0.0002,ρ=0.1}\{\gamma=0.0002,\rho=0.1\}. We run each experiment three times and take the average performance to report.

Evaluation Metrics. Model performance is quantified using three standard metrics: Half Total Error Rate (HTER), Area Under the Receiver Operating Characteristic Curve (AUC), and True Positive Rate (TPR) at a False Positive Rate (FPR) of 5%, denoted as TPR95.

4.2 Comparison to SoTA baselines

Leave-One-Out. Table 1 presents a comprehensive comparison with a broad range of recent studies addressing DG in FAS. Based on the results, we make the following observations: (1) Recent advancements in FAS methods have achieved significant breakthroughs in performance. However, a performance plateau is evident among these methods, largely because they do not incorporate the sharpness of the loss landscape into their objectives. (2) Our method consistently outperforms the majority of the surveyed DG for FAS approaches [44, 57, 59, 77, 66, 23] across all four experimental setups. Notably, we report a 1.56% improvement in the HTER (reducing from 5.86% to 4.29%) in the OCM \rightarrow I experiment, translating to an enhancement of over 26%.

Model / loss     DGrad     HTER \downarrow AUC \uparrow TPR95 \uparrow
Fish [56]     ✓     33.830.74 72.340.37 17.501.48
SAM [14]     ✗     11.990.49 95.130.16 73.311.57
    ✓     12.510.89 95.450.35 72.942.99
ASAM [31]     ✗     11.730.40 95.360.28 75.361.61
    ✓     12.180.93 95.400.49 75.921.54
SAGM [64]     ✗     11.630.53 95.330.23 74.812.07
    ✓     12.190.50 95.060.21 74.171.48
LookSAM [46]     ✗     11.560.54 95.560.15 75.641.19
    ✓     11.060.50 95.760.15 75.861.75
GSAM [82]     ✗     11.600.38 95.300.18 74.441.79
    ✓     12.480.83 95.540.35 73.392.92
Reg. pi(𝒮i),(𝓢)\langle\nabla\mathcal{L}_{\text{p}_{i}}(\mathcal{S}_{i}),\nabla\mathcal{L}(\bm{\mathcal{S}})\rangle ✓     11.350.54 95.550.18 73.581.21
GAC-FAS (ours) ✓     9.89 0.47 96.440.18 80.471.34
Table 5: Ablation study: Alternative approach using domain gradient for DG in FAS. DGrad indicates whether domain-wise gradients are used in each method (see Fig. 2).

Limited Source Domains. Table 2 summarizes the performance of our method when source domains are extremely limited. Adhering to the standard settings established by prior research [77, 78, 66], we utilize MSU-MFSD and Idiap Replay Attack as source domains for training, while OULU-NPU and CASIA-MFSD serve as the test datasets. Our proposed method consistently surpasses SoTA baselines, achieving substantial margins of improvement on the HTER metric. A particularly significant enhancement is observed on the MI \rightarrow C setup, where our approach yields approximately a 7% reduction in HTER (decreasing from 24.07% to 16.91%).

Refer to caption
Figure 3: Ablation study: Sensitivity analysis of hyper-parameters γ\gamma and ρ\rho on ICM \rightarrow O upon convergence performance.
Refer to caption
Figure 4: Loss Landscape Visualization [35]. SoTA baselines exhibit convergence to sharp minima (Figures a to d) on the training set of ICM\rightarrowO. In contrast, our GAC-FAS (Figure e) achieves convergence to a flatter minimum, indicative of potentially better generalization.

Unseen Attacks. In this experiment, we evaluate the detector’s performance against unseen 2D and 3D attacks. For 2D attacks, we adopt the ‘leave-one-attack-type-out’ method from [1], training on two domains of I, C, M, and testing on an unseen attack of unseen domain. In 3D attacks, we train on O, C, M, and evaluate using a 3D attack subset from the CelebA-Spoof dataset[75]. Results, shown in Table 3, indicate our approach outperforms baselines by significant margins of 0.96% and 2.01% on AUC metric for 2D and 3D attacks, respectively.

Comparison Upon Convergence . In their recent study, Sun et al. highlight that a snapshot performance report of a test set may not accurately reflect the true generalization ability of a detection model [57]. In alignment with their methodology, we report the average performance of our model across the last 10 evaluations in Table 4. As evident from the results, our method consistently demonstrates superior convergence on the three metrics: HTER, AUC, and TPR95, and achieves comparable results to SA-FAS in the OCM \rightarrow I experiment. Notably, we observe a 4% reduction in HTER in the OMI \rightarrow C experiment (from 19.40% down to 15.37%). These results suggest that our proposed GAC-FAS enables the model to converge to flatter and more stable minima compared to other approaches.

4.3 Ablation Studies

Alternatives of Optimizer. In Table 5, we explore various alternative objectives for DG in FAS, including SAM [14], ASAM [31], GSAM [82], [64], and LookSAM [46]. Furthermore, we incorporate Fish [56], a gradient matching method for DG, into our study. We experiment with ICM\rightarrow O task and report their performances upon convergence. The checkmarks in the second column denote the utilization of domain-wise gradients (see Fig. 2). While most methods demonstrate improved convergence with whole-data gradients per iteration compared to domain-wise gradients, they do not unequivocally surpass the existing SoTA baselines in DG for FAS. Domain-wise gradients, as we observed, are prone to noisy gradients at ascending points, whereas whole-data gradients tend to be dominated by specific domains, which can impede convergence. Notably, Fish [56] exhibits competitive performance on other datasets but shows slower convergence on FAS datasets compared to SAM-based objectives.

Effects of Hyper-parameters γ\gamma and ρ\rho. We investigate the sensitivities of our GAC-FAS with respect to γ\gamma and ρ\rho, summarizing the results of our analysis in Fig. 3. This study focuses on the ICM \rightarrow O task, reporting the HTER metric performance upon convergence. We test a range of values for the hyperparameters γ{0.0,0.0001,0.0002,0.001,0.002}\gamma\in\{0.0,0.0001,0.0002,0.001,0.002\} and ρ{0.005,0.05,0.1,0.2,0.4}\rho\in\{0.005,0.05,0.1,0.2,0.4\}, noting that at γ=0.0\gamma=0.0, no regularization is applied. The results indicate that with γ\gamma set to 0.0, our model’s performance is competitive with the current SoTA as shown in Table 4, yet it does not achieve the best result. Furthermore, when the settings for γ\gamma and ρ\rho are not excessively large, the model’s performance tends to be stable and can exceed that of previous SoTA methods. However, higher values of these hyperparameters may deteriorate performance. It is important to note that we did not fine-tune these hyperparameters to optimize test accuracy for each experimental task; thus, the same hyper-parameter settings used in Sec. 4.2 may not be optimal, even though they outperform all current SoTA methods on the datasets.

Loss Landscape Visualization. Figure 4 presents the loss landscape visualization [35] of GAC-FAS, in comparison with four SoTA approaches: SSAN [66], SSDG [44], SA-FAS [57], and IADG [77]. We employ negative log-likelihood as the loss metric and use the training set of the ICM\rightarrowO task for visualization. While all baseline methods demonstrate comparable generalization capabilities to our method in Table 1, they distinctly exhibit sharp minima in their loss functions, characterized by steep gradients in the loss landscape as shown in Fig. 4a)-d). In contrast, our proposed approach reveals a flatter minimum, which may correlate with enhanced generalization. These observations provide additional insights into the superior numerical results achieved by our method in various experiments, as discussed in Sec. 4.2.

5 Conclusion

In this paper, we introduced GAC-FAS, a novel framework designed to optimize the minimum for Domain Generalization (DG) in Face Anti-Spoofing (FAS). Inspired by recent advancements in DG methods that leverage loss sharpness-aware objectives, our approach involves identifying and utilizing ascending points for each domain within the training dataset. A key underlying novelty in our methodology is the regulation of SAM generalization gradients of whole data at these ascending points, aligning them coherently with gradients derived from ERM. Through comprehensive analysis and a series of rigorous experiments, we have demonstrated that GAC-FAS not only achieves superior generalization capabilities in FAS tasks but also consistently outperforms current SoTA baselines by significant margins. This performance consistency is observed across a variety of common experimental setups, underscoring the robustness and effectiveness of our proposed method. Our limitations and further analyses are provided in the Supp. material to support future studies.

References

  • [1] Shervin Rahimzadeh Arashloo, Josef Kittler, and William Christmas. An anomaly detection approach to face spoofing detection: A new formulation and evaluation protocol. IEEE access, 5:13868–13882, 2017.
  • [2] Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid. Face anti-spoofing based on color texture analysis. In 2015 IEEE international conference on image processing (ICIP), pages 2636–2640. IEEE, 2015.
  • [3] Zinelabinde Boulkenafet, Jukka Komulainen, Lei Li, Xiaoyi Feng, and Abdenour Hadid. Oulu-npu: A mobile face presentation attack database with real-world variations. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017), pages 612–618. IEEE, 2017.
  • [4] Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
  • [5] Zhihong Chen, Taiping Yao, Kekai Sheng, Shouhong Ding, Ying Tai, Jilin Li, Feiyue Huang, and Xinyu Jin. Generalizable representation learning for mixture domain face anti-spoofing. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 1132–1139, 2021.
  • [6] Girija Chetty. Biometric liveness checking using multimodal fuzzy fusion. In International Conference on Fuzzy Systems, pages 1–8. IEEE, 2010.
  • [7] Ivana Chingovska, André Anjos, and Sébastien Marcel. On the effectiveness of local binary patterns in face anti-spoofing. In 2012 BIOSIG-proceedings of the international conference of biometrics special interest group (BIOSIG), pages 1–7. IEEE, 2012.
  • [8] Tiago de Freitas Pereira, André Anjos, José Mario De Martino, and Sébastien Marcel. Lbp- top based countermeasure against face spoofing attacks. In Computer Vision-ACCV 2012 Workshops: ACCV 2012 International Workshops, Daejeon, Korea, November 5-6, 2012, Revised Selected Papers, Part I 11, pages 121–132. Springer, 2013.
  • [9] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
  • [10] Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent Tan. Efficient sharpness-aware minimization for improved training of neural networks. In International Conference on Learning Representations, 2021.
  • [11] Zhekai Du, Jingjing Li, Lin Zuo, Lei Zhu, and Ke Lu. Energy-based domain generalization for face anti-spoofing. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1749–1757, 2022.
  • [12] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Uncertainty in Artificial Intelligence, 2017.
  • [13] Litong Feng, Lai-Man Po, Yuming Li, Xuyuan Xu, Fang Yuan, Terence Chun-Ho Cheung, and Kwok-Wai Cheung. Integration of image quality and motion cues for face anti-spoofing: A neural network approach. Journal of Visual Communication and Image Representation, 38:451–460, 2016.
  • [14] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
  • [15] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  • [16] Anjith George and Sébastien Marcel. Deep pixel-wise binary supervision for face presentation attack detection. In 2019 International Conference on Biometrics (ICB), pages 1–8. IEEE, 2019.
  • [17] Andy Greenberg. Hackers say they’ve broken face id a week after iphone x release. https://www.wired.com/story/hackers-say-broke-face-id-security/, November 2017. Accessed: 2023-07-01.
  • [18] Xiao Guo, Yaojie Liu, Anil Jain, and Xiaoming Liu. Multi-domain learning for updating face anti-spoofing models. In European Conference on Computer Vision, pages 230–249. Springer, 2022.
  • [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [20] Sepp Hochreiter and Jürgen Schmidhuber. Simplifying neural nets by discovering flat minima. Advances in neural information processing systems, 7, 1994.
  • [21] Hsin-Ping Huang, Deqing Sun, Yaojie Liu, Wen-Sheng Chu, Taihong Xiao, Jinwei Yuan, Hartwig Adam, and Ming-Hsuan Yang. Adaptive transformers for robust few-shot cross-domain face anti-spoofing. In European Conference on Computer Vision, pages 37–54. Springer, 2022.
  • [22] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 876–885. Association For Uncertainty in Artificial Intelligence (AUAI), 2018.
  • [23] Yunpei Jia, Jie Zhang, Shiguang Shan, and Xilin Chen. Single-side domain generalization for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8484–8493, 2020.
  • [24] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2889–2898, 2020.
  • [25] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.
  • [26] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  • [27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [28] Klaus Kollreider, Hartwig Fronthaler, Maycel Isaac Faraj, and Josef Bigun. Real-time face detection and motion analysis with application in “liveness” assessment. IEEE Transactions on Information Forensics and Security, 2(3):548–558, 2007.
  • [29] Jukka Komulainen, Abdenour Hadid, and Matti Pietikäinen. Context based face anti-spoofing. In 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pages 1–8. IEEE, 2013.
  • [30] Paul Kunert. Phones’ facial recog tech ‘fooled’ by low-res 2d photo. https://www.theregister.com/2023/05/19/2d_photograph_facial_recog/, May 2023. Accessed: 2023-07-01.
  • [31] Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning, pages 5905–5914. PMLR, 2021.
  • [32] Binh M Le and Simon S Woo. Quality-agnostic deepfake detection with intra-model collaborative learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22378–22389, 2023.
  • [33] Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang, and Alex C Kot. Unsupervised domain adaptation for face anti-spoofing. IEEE Transactions on Information Forensics and Security, 13(7):1794–1809, 2018.
  • [34] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5400–5409, 2018.
  • [35] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
  • [36] Lei Li, Xiaoyi Feng, Zinelabidine Boulkenafet, Zhaoqiang Xia, Mingming Li, and Abdenour Hadid. An original face anti-spoofing approach using partial convolutional neural network. In 2016 Sixth international conference on image processing theory, tools and applications (IPTA), pages 1–6. IEEE, 2016.
  • [37] Chen-Hao Liao, Wen-Cheng Chen, Hsuan-Tung Liu, Yi-Ren Yeh, Min-Chun Hu, and Chu-Song Chen. Domain invariant vision transformer learning for face anti-spoofing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6098–6107, 2023.
  • [38] Bofan Lin, Xiaobai Li, Zitong Yu, and Guoying Zhao. Face liveness detection by rppg features and contextual patch-based cnn. In Proceedings of the 2019 3rd international conference on biometric engineering and applications, pages 61–68, 2019.
  • [39] Haozhe Liu, Zhe Kong, Raghavendra Ramachandra, Feng Liu, Linlin Shen, and Christoph Busch. Taming self-supervised learning for presentation attack detection: In-image de-folding and out-of-image de-mixing. arXiv preprint arXiv:2109.04100, 2021.
  • [40] Shubao Liu, Ke-Yue Zhang, Taiping Yao, Mingwei Bi, Shouhong Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. Adaptive normalized representation learning for generalizable face anti-spoofing. In Proceedings of the 29th ACM international conference on multimedia, pages 1469–1477, 2021.
  • [41] Shubao Liu, Ke-Yue Zhang, Taiping Yao, Kekai Sheng, Shouhong Ding, Ying Tai, Jilin Li, Yuan Xie, and Lizhuang Ma. Dual reweighting domain generalization for face presentation attack detection. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 867–873, 2021.
  • [42] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.
  • [43] Yuchen Liu, Yabo Chen, Wenrui Dai, Chenglin Li, Junni Zou, and Hongkai Xiong. Causal intervention for generalizable face anti-spoofing. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 01–06. IEEE, 2022.
  • [44] Yuchen Liu, Yabo Chen, Mengran Gou, Chun-Ting Huang, Yaoming Wang, Wenrui Dai, and Hongkai Xiong. Towards unsupervised domain generalization for face anti-spoofing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20654–20664, 2023.
  • [45] Yuchen Liu, Yabo Chen, Mengran Gou, Chun-Ting Huang, Yaoming Wang, Wenrui Dai, and Hongkai Xiong. Towards unsupervised domain generalization for face anti-spoofing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20654–20664, 2023.
  • [46] Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, and Yang You. Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12360–12370, 2022.
  • [47] Jukka Määttä, Abdenour Hadid, and Matti Pietikäinen. Face spoofing detection from single images using micro-texture analysis. In 2011 international joint conference on Biometrics (IJCB), pages 1–7. IEEE, 2011.
  • [48] Gang Pan, Lin Sun, Zhaohui Wu, and Shihong Lao. Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In 2007 IEEE 11th international conference on computer vision, pages 1–8. IEEE, 2007.
  • [49] Ankush Panwar, Pratyush Singh, Suman Saha, Danda Pani Paudel, and Luc Van Gool. Unsupervised compound domain adaptation for face anti-spoofing. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 1–8. IEEE, 2021.
  • [50] Keyurkumar Patel, Hu Han, and Anil K Jain. Cross-database face antispoofing with robust feature representation. In Biometric Recognition: 11th Chinese Conference, CCBR 2016, Chengdu, China, October 14-16, 2016, Proceedings 11, pages 611–619. Springer, 2016.
  • [51] Keyurkumar Patel, Hu Han, and Anil K Jain. Secure face unlock: Spoof detection on smartphones. IEEE transactions on information forensics and security, 11(10):2268–2283, 2016.
  • [52] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [53] Suman Saha, Wenhao Xu, Menelaos Kanakis, Stamatios Georgoulis, Yuhua Chen, Danda Pani Paudel, and Luc Van Gool. Domain agnostic feature learning for image and video based face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 802–803, 2020.
  • [54] Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C Yuen. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10031, 2019.
  • [55] Rui Shao, Xiangyuan Lan, and Pong C Yuen. Regularized fine-grained meta face anti-spoofing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11974–11981, 2020.
  • [56] Yuge Shi, Jeffrey Seely, Philip Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. In International Conference on Learning Representations, 2021.
  • [57] Yiyou Sun, Yaojie Liu, Xiaoming Liu, Yixuan Li, and Wen-Sheng Chu. Rethinking domain generalization for face anti-spoofing: Separability and alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24563–24574, 2023.
  • [58] Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.
  • [59] Chien-Yi Wang, Yu-Ding Lu, Shang-Ta Yang, and Shang-Hong Lai. Patchnet: A simple face anti-spoofing framework via fine-grained patch recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20281–20290, 2022.
  • [60] Guoqing Wang, Hu Han, Shiguang Shan, and Xilin Chen. Improving cross-database face presentation attack detection via adversarial domain adaptation. In 2019 International Conference on Biometrics (ICB), pages 1–8. IEEE, 2019.
  • [61] Guoqing Wang, Hu Han, Shiguang Shan, and Xilin Chen. Unsupervised adversarial domain adaptation for cross-domain face presentation attack detection. IEEE Transactions on Information Forensics and Security, 16:56–69, 2020.
  • [62] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018.
  • [63] Jingjing Wang, Jingyi Zhang, Ying Bian, Youyi Cai, Chunmao Wang, and Shiliang Pu. Self-domain adaptation for face anti-spoofing. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 2746–2754, 2021.
  • [64] Pengfei Wang, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. Sharpness-aware gradient matching for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3769–3778, 2023.
  • [65] Zhuo Wang, Qiangchang Wang, Weihong Deng, and Guodong Guo. Learning multi-granularity temporal characteristics for face anti-spoofing. IEEE Transactions on Information Forensics and Security, 17:1254–1269, 2022.
  • [66] Zhuo Wang, Zezheng Wang, Zitong Yu, Weihong Deng, Jiahong Li, Tingting Gao, and Zhongyuan Wang. Domain generalization via shuffled style assembly for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4123–4133, 2022.
  • [67] Di Wen, Hu Han, and Anil K Jain. Face spoof detection with image distortion analysis. IEEE Transactions on Information Forensics and Security, 10(4):746–761, 2015.
  • [68] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems, 33:2958–2969, 2020.
  • [69] Jianwei Yang, Zhen Lei, Shengcai Liao, and Stan Z Li. Face liveness detection with component dependent descriptor. In 2013 International Conference on Biometrics (ICB), pages 1–6. IEEE, 2013.
  • [70] Zitong Yu, Xiaobai Li, Jingang Shi, Zhaoqiang Xia, and Guoying Zhao. Revisiting pixel-wise supervision for face anti-spoofing. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(3):285–295, 2021.
  • [71] Zitong Yu, Chenxu Zhao, Zezheng Wang, Yunxiao Qin, Zhuo Su, Xiaobai Li, Feng Zhou, and Guoying Zhao. Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5295–5305, 2020.
  • [72] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.
  • [73] Kaipeng Zhang, Zhanpeng Zhang, Hao Wang, Zhifeng Li, Yu Qiao, and Wei Liu. Detecting faces using inside cascaded contextual cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 3171–3179, 2017.
  • [74] Ke-Yue Zhang, Taiping Yao, Jian Zhang, Shice Liu, Bangjie Yin, Shouhong Ding, and Jilin Li. Structure destruction and content combination for face anti-spoofing. In 2021 IEEE International Joint Conference on Biometrics (IJCB), pages 1–6. IEEE, 2021.
  • [75] Yuanhan Zhang, ZhenFei Yin, Yidong Li, Guojun Yin, Junjie Yan, Jing Shao, and Ziwei Liu. Celeba-spoof: Large-scale face anti-spoofing dataset with rich annotations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 70–85. Springer, 2020.
  • [76] Zhiwei Zhang, Junjie Yan, Sifei Liu, Zhen Lei, Dong Yi, and Stan Z Li. A face antispoofing database with diverse attacks. In 2012 5th IAPR international conference on Biometrics (ICB), pages 26–31. IEEE, 2012.
  • [77] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Xuequan Lu, Ran Yi, Shouhong Ding, and Lizhuang Ma. Instance-aware domain generalization for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20453–20463, 2023.
  • [78] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Ran Yi, Shouhong Ding, and Lizhuang Ma. Adaptive mixture of experts learning for generalizable face anti-spoofing. In Proceedings of the 30th ACM International Conference on Multimedia, pages 6009–6018, 2022.
  • [79] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Ran Yi, Kekai Sheng, Shouhong Ding, and Lizhuang Ma. Generative domain adaptation for face anti-spoofing. In European Conference on Computer Vision, pages 335–356. Springer, 2022.
  • [80] Yixuan Zhou, Yi Qu, Xing Xu, and Hengtao Shen. Imbsam: A closer look at sharpness-aware minimization in class-imbalanced recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11345–11355, 2023.
  • [81] Zhipeng Zhou, Lanqing Li, Peilin Zhao, Pheng-Ann Heng, and Wei Gong. Class-conditional sharpness-aware minimization for deep long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3499–3509, 2023.
  • [82] Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C Dvornek, James s Duncan, Ting Liu, et al. Surrogate gap minimization improves sharpness-aware training. In International Conference on Learning Representations, 2021.
\nocitesec

* In this supplementary material, we first provide a brief description of the datasets used in our experiment Section (Section A). Next, the proof of Theorem 1 is provided in Section B. In Section C, we will detail the implementation and present ablation studies about the detectors’ robustness towards unseen corruptions. Lastly, in Section D, we discuss the limitations of our proposed method and outline our planned future work.

Appendix A Datasets

We describe here four popular benchmark datasets used to evaluate our proposed GAC-FAS:

  • Idiap Replay Attack (denoted as I) [7]: This dataset includes 1,300 videos captured from 50 clients under two different lighting conditions. It features four types of replayed faces and one type of printed face for spoof attacks.

  • OULU-NPU (denoted as O) [3]: Comprising high-resolution videos, this dataset contains 3,960 spoof face videos and 990 live face videos captured from six different cameras. It includes two kinds of printed faces and two kinds of replayed faces.

  • CASIA-MFSD (denoted as C) [76]: Consisting of 50 subjects, each with 12 videos, this dataset features three types of attacks: printed photo, cut photo, and video attacks.

  • MSU-MFSD (denoted as M) [67]: This dataset includes 280 videos for 35 subjects recorded with different cameras. It encompasses three spoof types: one kind of printed face and two kinds of replayed faces.

Following the pre-processing steps outlined in [57], we utilized MTCNN [73] to detect faces in each frame of the videos.

Appendix B Proof of Theorem 1

Theorem 1 (Restate): Suppose that the loss function (θt)=(f(x;θt),y)\ell(\theta_{t})=\ell(f(x;\theta_{t}),y) satisfies the following assumptions. (i) its gradient g(θt)=(θt)g(\theta_{t})=\nabla\ell(\theta_{t}) is bounded, i.e., g(θt)Gt\lVert g(\theta_{t})\rVert\leq G\text{, }\forall t. (ii) The stochastic gradient is L-Lipchitz, i.e., g(θt)g(θt)Lθtθt\lVert g(\theta_{t})-g(\theta_{t}^{\prime})\rVert\leq L\lVert\theta_{t}-\theta_{t}^{\prime}\rVert, θt\forall\theta_{t}, θt\theta_{t}^{\prime}. Let the learning rate ηt\eta_{t} be η0t\frac{\eta_{0}}{\sqrt{t}}, and and let the perturbation be proportional to the learning rate, i.e., ρt=ρt\rho_{t}=\frac{\rho}{\sqrt{t}}, and γt=γt\gamma_{t}=\frac{\gamma}{\sqrt{t}}, we have:

1TΣt=1T𝔼𝒮i𝓢𝔼(x,y)𝒮i[(θt)2]𝒪(logTT),and\displaystyle\frac{1}{T}\Sigma^{T}_{t=1}\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathbb{E}_{(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]\leq\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right),\text{and}
1TΣt=1T𝔼𝒮i𝓢𝔼(x,y)𝒮i[(θtadv)2]𝒪(logTT),\displaystyle\frac{1}{T}\Sigma^{T}_{t=1}\mathbb{E}_{{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathbb{E}_{(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}}\left[\lVert\nabla\ell(\theta_{t}^{\text{adv}})\rVert^{2}\right]\leq\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right),

where θtadv=θt+ϵt^γtδt\theta_{t}^{\text{adv}}=\theta_{t}+\hat{\epsilon_{t}}-\gamma_{t}\delta_{t}, δt=Σj=1k(f(xj;θt),yj)\delta_{t}=\Sigma_{j=1}^{k}\nabla\ell(f(x_{j}^{\prime};\theta_{t}),y_{j}^{\prime}) with (xj,yj)𝒮j(x_{j}^{\prime},y_{j}^{\prime})\sim\mathcal{S}_{j}.

For simplicity, we denote the update at step tt as:

dt=ηtg(θt)ηtg(θtadv).d_{t}=-\eta_{t}g(\theta_{t})-\eta_{t}g(\theta_{t}^{\text{adv}}). (11)

By L-smoothess of \ell and the definition of dt=θt+1θtd_{t}=\theta_{t+1}-\theta_{t}, we have:

(θt+1)(θt)(θt),θt+1θt+L2θt+1θt2\displaystyle\ell(\theta_{t+1})-\ell(\theta_{t})\leq\langle\nabla\ell(\theta_{t}),\theta_{t+1}-\theta_{t}\rangle+\frac{L}{2}\lVert\theta_{t+1}-\theta_{t}\rVert^{2}
=(θt),dt+L2dt2\displaystyle=\langle\nabla\ell(\theta_{t}),d_{t}\rangle+\frac{L}{2}\lVert d_{t}\rVert^{2}
=ηt(θt),g(θt)+g(θtadv)+Lηt22g(θt)+g(θtadv)2\displaystyle=-\eta_{t}\langle\nabla\ell(\theta_{t}),g(\theta_{t})+g(\theta_{t}^{\text{adv}})\rangle+\frac{L\eta_{t}^{2}}{2}\lVert g(\theta_{t})+g(\theta_{t}^{\text{adv}})\rVert^{2}
=ηt(θt),(θt)+g(θtadv)+Lηt22g(θt)+g(θtadv)2\displaystyle=-\eta_{t}\langle\nabla\ell(\theta_{t}),\nabla\ell(\theta_{t})+g(\theta_{t}^{\text{adv}})\rangle+\frac{L\eta_{t}^{2}}{2}\lVert g(\theta_{t})+g(\theta_{t}^{\text{adv}})\rVert^{2}
=ηt(θt),(θt)+(θt)(θt)+g(θtadv)\displaystyle=-\eta_{t}\langle\nabla\ell(\theta_{t}),\nabla\ell(\theta_{t})+\nabla\ell(\theta_{t})-\nabla\ell(\theta_{t})+g(\theta_{t}^{\text{adv}})\rangle
+Lηt22g(θt)+g(θtadv)2\displaystyle+\frac{L\eta_{t}^{2}}{2}\lVert g(\theta_{t})+g(\theta_{t}^{\text{adv}})\rVert^{2}
2ηt(θt)2ηt(θt),g(θtadv)g(t)+Lηt2G2\displaystyle\leq-2\eta_{t}\lVert\nabla\ell(\theta_{t})\rVert^{2}-\eta_{t}\langle\nabla\ell(\theta_{t}),g(\theta_{t}^{\text{adv}})-g(t)\rangle+L\eta_{t}^{2}G^{2}

Taking expectation on both sides, and let 𝔼(x,y)𝒮i𝒮i𝓢=𝔼𝒮i𝓢𝔼(x,y)𝒮i\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}=\mathbb{E}_{\small{\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}}\mathbb{E}_{(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}}, we have:

𝔼(x,y)𝒮i𝒮i𝓢[(θt+1)(θt)]2ηt𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]\displaystyle\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\ell(\theta_{t+1})-\ell(\theta_{t})\right]\leq-2\eta_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]
+ηt𝔼(x,y)𝒮i𝒮i𝓢[(θt),g(t)g(θtadv)]+Lηt2G2.\displaystyle+\eta_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\langle\nabla\ell(\theta_{t}),g(t)-g(\theta_{t}^{\text{adv}})\rangle\right]+L\eta_{t}^{2}G^{2}. (12)

Here we need to bound the term 𝔼(x,y)𝒮i𝒮i𝓢[(θt),g(t)g(θtadv)]\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\langle\nabla\ell(\theta_{t}),g(t)-g(\theta_{t}^{\text{adv}})\rangle\right]. We have:

𝔼(x,y)𝒮i𝒮i𝓢[(θt),g(t)g(θtadv)]\displaystyle\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\langle\nabla\ell(\theta_{t}),g(t)-g(\theta_{t}^{\text{adv}})\rangle\right]
𝔼(x,y)𝒮i𝒮i𝓢[(θt)g(t)g(θtadv)]\displaystyle\leq\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\cdot\lVert g(t)-g(\theta_{t}^{\text{adv}})\rVert\right]
L𝔼(x,y)𝒮i𝒮i𝓢[(θt)θtθtadv]( assumption (ii))\displaystyle\leq L\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\cdot\lVert\theta_{t}-\theta_{t}^{\text{adv}}\rVert\right]\text{( assumption (ii))}
=L𝔼(x,y)𝒮i𝒮i𝓢[(θt)ϵt^γtδt]\displaystyle=L\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\cdot\lVert\hat{\epsilon_{t}}-\gamma_{t}\delta_{t}\rVert\right]
L𝔼(x,y)𝒮i𝒮i𝓢[(θt)ϵt^]\displaystyle\leq L\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\cdot\lVert\hat{\epsilon_{t}}\rVert\right]
+Lγt𝔼(x,y)𝒮i𝒮i𝓢[(θt)δt]\displaystyle+L\gamma_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\cdot\lVert\delta_{t}\rVert\right]
Lρt𝔼(x,y)𝒮i𝒮i𝓢[(θt)]\displaystyle\leq L\rho_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\right]
+Lγt𝔼(x,y)𝒮i𝒮i𝓢[(θt)δt] (ϵ^tρt)\displaystyle+L\gamma_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\cdot\lVert\delta_{t}\rVert\right]\text{ ($\hat{\epsilon}_{t}\leq\rho_{t}$)}
Lρt𝔼(x,y)𝒮i𝒮i𝓢[(θt)]+LγtkG𝔼(x,y)𝒮i𝒮i𝓢[(θt)]\displaystyle\leq L\rho_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\right]+L\gamma_{t}kG\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert\right]
LρtG+LγtkG2 (assumption (i)).\displaystyle\leq L\rho_{t}G+L\gamma_{t}kG^{2}\text{ (assumption (i))}. (13)

Replace Equation 13 into Equation 12 we obtain:

𝔼(x,y)𝒮i𝒮i𝓢[(θt+1)(θt)]2ηt𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]\displaystyle\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\ell(\theta_{t+1})-\ell(\theta_{t})\right]\leq-2\eta_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]
+LρtG+LγkG2+Lηt2G2.\displaystyle+L\rho_{t}G+L\gamma kG^{2}+L\eta_{t}^{2}G^{2}. (14)

Re-arrange the above formula, we have:

2ηt𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]𝔼(x,y)𝒮i𝒮i𝓢[(θt)(θt+1)]\displaystyle 2\eta_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]\leq\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\ell(\theta_{t})-\ell(\theta_{t+1})\right]
+LρtG+LγtkG2+Lηt2G2.\displaystyle+L\rho_{t}G+L\gamma_{t}kG^{2}+L\eta_{t}^{2}G^{2}. (15)

Perform telescope sum and taking expectation on each step we have:

2Σt=1Tηt𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]𝔼(x,y)𝒮i𝒮i𝓢[(θ0)(θT)]\displaystyle 2\Sigma_{t=1}^{T}\eta_{t}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]\leq\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\ell(\theta_{0})-\ell(\theta_{T})\right]
+LGΣt=1Tρt+LkG2Σt=1Tγt+LG2Σt=1Tηt2.\displaystyle+LG\Sigma_{t=1}^{T}\rho_{t}+LkG^{2}\Sigma_{t=1}^{T}\gamma_{t}+LG^{2}\Sigma_{t=1}^{T}\eta_{t}^{2}. (16)

Note that our schedules are ηtη0t\eta_{t}\frac{\eta_{0}}{\sqrt{t}} ρt=ρt\rho_{t}=\frac{\rho}{\sqrt{t}}, and γt=γt\gamma_{t}=\frac{\gamma}{\sqrt{t}} then we have:

2η0TΣt=1T𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]LHS(16)RHS(16)\displaystyle\frac{2\eta_{0}}{\sqrt{T}}\Sigma_{t=1}^{T}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]\leq\text{LHS(\ref{eqn:bound_befor_schedule})}\leq\text{RHS(\ref{eqn:bound_befor_schedule})}
(θ0)min+LGρΣt=1T1t+LkG2γΣt=1T1t\displaystyle\leq\ell(\theta_{0})-\ell_{\text{min}}+LG\rho\Sigma_{t=1}^{T}\frac{1}{\sqrt{t}}+LkG^{2}\gamma\Sigma_{t=1}^{T}\frac{1}{\sqrt{t}}
+LG2η02Σt=1T1t\displaystyle+LG^{2}\eta_{0}^{2}\Sigma_{t=1}^{T}\frac{1}{t}
(θ0)min+LGρ(2T1)+LkG2γ(2T1)\displaystyle\leq\ell(\theta_{0})-\ell_{\text{min}}+LG\rho(2\sqrt{T}-1)+LkG^{2}\gamma(2\sqrt{T}-1)
+LG2η02(1+log(T)).\displaystyle+LG^{2}\eta_{0}^{2}(1+\log(T)). (17)

Hence,

1TΣt=1T𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]\displaystyle\frac{1}{T}\Sigma_{t=1}^{T}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right] C0+C1T+C2logTT\displaystyle\leq C_{0}+\frac{C_{1}}{\sqrt{T}}+C_{2}\frac{\log T}{\sqrt{T}}
=𝒪(logTT)\displaystyle=\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right) (18)

where C0C_{0}, C1C_{1}, C2C_{2} are some constants.

For the second part of the Theorem, we have that :

𝔼(x,y)𝒮i𝒮i𝓢[(θtadv)22]\displaystyle\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t}^{\text{adv}})\rVert_{2}^{2}\right]
=𝔼(x,y)𝒮i𝒮i𝓢[(θt)+(θtadv)(θt)2]\displaystyle=\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})+\nabla\ell(\theta_{t}^{\text{adv}})-\nabla\ell(\theta_{t})\rVert^{2}\right]
2𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]\displaystyle\leq 2\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]
+2𝔼(x,y)𝒮i𝒮i𝓢[(θtadv)(θt)2]\displaystyle+2\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t}^{\text{adv}})-\nabla\ell(\theta_{t})\rVert^{2}\right]
2𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]+2𝔼(x,y)𝒮i𝒮i𝓢[g(θtadv)g(θt)]\displaystyle\leq 2\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]+2\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert g(\theta_{t}^{\text{adv}})-g(\theta_{t})\rVert\right]
2𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]\displaystyle\leq 2\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]
+2L2𝔼(x,y)𝒮i𝒮i𝓢[θtadvθt2] (assumption (ii))\displaystyle+2L^{2}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\theta_{t}^{\text{adv}}-\theta_{t}\rVert^{2}\right]\text{ (assumption (ii))}
2𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]+2L2𝔼(x,y)𝒮i𝒮i𝓢[ϵt^γtδt2]\displaystyle\leq 2\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]+2L^{2}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\hat{\epsilon_{t}}-\gamma_{t}\delta_{t}\rVert^{2}\right]
2𝔼(x,y)𝒮i𝒮i𝓢[(θt)2]+2L2(ρt2+γt2k2G2)\displaystyle\leq 2\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t})\rVert^{2}\right]+2L^{2}\left(\rho_{t}^{2}+\gamma_{t}^{2}k^{2}G^{2}\right) (19)

Sum over tt and average, then we have:

1TΣt=1T\displaystyle\frac{1}{T}\Sigma_{t=1}^{T} 𝔼(x,y)𝒮i𝒮i𝓢[(θtadv)22]\displaystyle\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t}^{\text{adv}})\rVert_{2}^{2}\right]
\displaystyle\leq 2(C0+C1T+C2logTT)\displaystyle 2\left(C_{0}+\frac{C_{1}}{\sqrt{T}}+C_{2}\frac{\log T}{\sqrt{T}}\right)
+\displaystyle+ 2L21+logTT(ρ02+γ02k2G2).\displaystyle 2L^{2}\frac{1+\log T}{T}(\rho_{0}^{2}+\gamma_{0}^{2}k^{2}G^{2}). (20)

Therefor,

1TΣt=1T𝔼(x,y)𝒮i𝒮i𝓢[(θtadv)22]\displaystyle\frac{1}{T}\Sigma_{t=1}^{T}\mathbb{E}_{\begin{subarray}{c}(x,y)\sim{\color[rgb]{0,0,1}\mathcal{S}_{i}}\\ {\color[rgb]{0,0,1}\mathcal{S}_{i}}\sim\bm{\mathcal{S}}\end{subarray}}\left[\lVert\nabla\ell(\theta_{t}^{\text{adv}})\rVert_{2}^{2}\right] C3+C4T+C5logTT\displaystyle\leq C_{3}+\frac{C_{4}}{\sqrt{T}}+C_{5}\frac{\log T}{\sqrt{T}}
=𝒪(logTT)\displaystyle=\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right) (21)

where C3C_{3}, C4C_{4}, C5C_{5} are some constants.

Appendix C More Empirical Experiment

Details of our Implementation. We observe in Alg. 1 that random sampling in each iteration does not necessarily include images from all source domains. Specifically, the algorithm functions effectively even if, during certain iterations, the images in a minibatch belong to only 2, or even a single, domain. However, this issue can be mitigated by designing a balanced sampler. Regarding the training process, the hyperparameters are detailed precisely in Table 6.

lr. step FC lr. scale Logit scale Weight decay Epochs
ICM \rightarrow O 40 10 12 1×1041\times 10^{-4} 150
OMI \rightarrow C 40 1 16 5×1045\times 10^{-4} 80
OCM \rightarrow I 40 10 32 6×1046\times 10^{-4} 50
OCI \rightarrow M 5 10 12 6×1046\times 10^{-4} 20
Table 6: Hyper-parameter settings in our experiment.

Robustness to Unseen Corruptions. In assessing the generalization capabilities of a FAS detector, it is crucial to evaluate its robustness against various types of input corruptions, a topic extensively explored in prior works [32, 24]. Adopting the experimental settings from [24], we examine the detector’s performance under six common image corruptions: saturation, contrast, block-wise distortion, white Gaussian noise, blurring, and JPEG compression, each with five levels of severity. In Figure 5, we showcase examples of live and spoof faces affected by six types of image corruption techniques [24], each applied with a severity level of 3. While these represent digital corruptions, they are still pertinent for assessing the resilience of spoof face detectors.

We compare our method with 4 baselines: SSAN [66], SSDG [44], SA-FAS [57], and IADG [77]. These comparisons are based on the available official models. The results are demonstrated in Fig. 6. Our method consistently exhibits robustness across varying severity levels, as indicated by its lower HTER performance on average.

Refer to caption
Figure 5: Illustration of six corruption types applied on live and spoof faces in OULU-NPU dataset.
Refer to caption
Figure 6: HTER performance (%) of DG spoof detectors under various image corruptions with different severity levels [24]. The experiment are conducted on ICM\rightarrowO with the corruptions are applied on OULU-NPU dataset.

Appendix D Limitations and Future Works

While our proposed method has achieved SoTA performance across various experiments, we acknowledge two limitations in our work. First, the training dataset requires domain labels to derive ascending points, which may limit its applicability in the in scenarios where training data from multiple sources are combined. Second, although our method maintains comparable computational demands to other methods during the validation phase, GAC-FAS could incur higher computational costs during training when handling a large number of domains as the rising number of ascending points.

In our forthcoming research, we aim to reduce the number of ascending points by exploring similarities across domains. This endeavor includes developing a more efficient regularization approach to gain deeper insights into generalization updates. Notably, our proposed method, which employs a SAM-based optimizer, demonstrates parallels in generating domain-specific gradients with meta-learning techniques [5, 55, 63], albeit in a contrasting direction. While meta-learning methods require additional domain-specific gradient steps and may underperform compared to our approach, the potential synergy of combining ascending vectors from our GAC-FAS with descending vectors from meta-learning promises further enhancements in domain generalization. Our future research will concentrate on investigating these synergistic possibilities.