Securing Social Media Against Deepfakes using Identity, Behavioral, and Geometric Signatures

Muhammad Umar Farooq, Awais Khan, Ijaz Ul Haq, and Khalid Mahmood Malik, Manuscript received XXX. This work is supported by the National Science Foundation (NSF) under award numbers 2329858 and 2231619 and Michigan Translational Research and Commercialization (MTRAC), Advanced Computing Technologies (ACT) award number 292883. This material is based upon work supported by the National Science Foundation under Grant No. 2409577Muhammad Umar Farooq, Awais Khan, Ijaz Ul Haq and Khalid Mahmood Malik are with College of Innovation and Technology, University of Michigan-Flint, Flint, MI 48502, USA (e-mail: [email protected], [email protected], [email protected] and [email protected]).

Abstract

Trust in social media is a growing concern due to its ability to influence significant societal changes. However, this space is increasingly compromised by various types of deepfake multimedia, which undermine the authenticity of shared content. Although substantial efforts have been made to address the challenge of deepfake content, existing detection techniques face a major limitation in generalization: they tend to perform well only on specific types of deepfakes they were trained on.This dependency on recognizing specific deepfake artifacts makes current methods vulnerable when applied to unseen or varied deepfakes, thereby compromising their performance in real-world applications such as social media platforms. To address the generalizability of deepfake detection, there is a need for a holistic approach that can capture a broader range of facial attributes and manipulations beyond isolated artifacts. To address this, we propose a novel deepfake detection framework featuring an effective feature descriptor that integrates Deep identity, Behavioral, and Geometric (DBaG) signatures, along with a classifier named DBaGNet. Specifically, the DBaGNet classifier utilizes the extracted DBaG signatures, leveraging a triplet loss objective to enhance generalized representation learning for improved classification. Specifically, the DBaGNet classifier utilizes the extracted DBaG signatures and applies a triplet loss objective to enhance generalized representation learning for improved classification. The comprehensive DBaG signatures captures both facial geometry inconsistencies and behavioral cues, enhancing the detection of diverse deepfake types and improving generalization. To test the effectiveness and generalizability of our proposed approach, we conduct extensive experiments using six benchmark deepfake datasets: WLDR, CelebDF, DFDC, FaceForensics++, DFD, and NVFAIR. Specifically, to ensure the effectiveness of our approach, we perform cross-dataset evaluations, and the results demonstrate significant performance gains over several state-of-the-art methods.

Index Terms:

Deepfake Detection, Multimedia Forensics, Behavioral Biometrics, face forgery detection, DBaG

I Introduction

Over the past decade, internet traffic has seen a significant shift from text-based information to multimedia files, driven by the rise of large-scale social multimedia platforms [1]. While these platforms enrich and facilitate the sharing of everyday experiences, they also pose a growing threat: the spread of fake multimedia (e.g., images, videos, and audio) that contains misleading or fabricated content. Advances in video generation technologies have played a crucial role in the rise of such forged media through various forms of manipulation.

Refer to caption — Figure 1: Analysis of in- and cross-dataset representations of individual and fused features. The left column highlights the facial regions where deep identity, behavioral, and geometric features are extracted. The scatter plots display the t-SNE visualization of real and fake samples using different feature types: deep identity in the top row, behavioral in the second, geometric in the third, and the fused DBaG (Deep identity, Behavioral, and Geometric) features in the bottom row. The DBaG fusion demonstrates superior discrimination between real and fake samples, enhancing classification performance and generalization across both within-dataset and cross-dataset tests.

Multimedia manipulation stems from techniques such as replacing facial expressions without altering the actor’s identity [2] and real-time expression transfer [3]. Moreover, the technology has advanced to the point where creating a deepfake now takes only 35 seconds, making the manipulation of multimedia faster and more accessible than ever before. Similarly, the year 2023 alone surge in the development of deepfake generation tools, with over 60% of them being newly introduced within that year¹¹1https://humanorai.io/deepfake-tools-statistics. The advancement in generative AI techniques, such as generative adversarial networks (GANs) and diffusion-based models, enabled users to produce highly realistic deepfake videos that are often indistinguishable from genuine videos.

Despite their benefits in fields such as entertainment and the film industry, deepfakes pose significant risks when used maliciously, threatening social integrity. The manipulation of faces in videos is particularly concerning, as seen in recent real-world incidents. For instance, a fabricated video of former President Obama mocking Donald Trump spread distrust, and a forged video of Ukraine’s president urging soldiers to surrender sowed confusion and misinformation [4]. In the financial sector, deepfakes have also been weaponized, with a notable case involving a fraudulent transaction of $25 million, triggered by a deepfake video of a chief financial officer during a video conference²²2https://www.cnn.com/2024/02/04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html. For public figures, such fake content can lead to severe reputational damage and destabilizing consequences, amplifying the urgency to address the misuse of deepfake technology.

Although substantial progress has been made in the detection of deepfakes, there are still some key challenges that continue to limit the effectiveness of existing methods. The existing methods developed to encounter deepfake can be categorized into traditional solutions using handcrafted features and recent studies using deep learning models. These solutions have proven to be effective against specified types of deepfake but often fall short in terms of generalization. For example, Zhang et al. [5] employed Speeded Up Robust Features SURF descriptors for detecting face swap deepfakes, but their method is solely developed to detect deepfake images and ineffective when confronted with videos. In another study, Ciftci et al. [6] introduced biological signal analysis (e.g., heart rate) for forensic changes. The approach presented in this study may suffer from limited generalization due to its reliance on a dataset that may introduce biases towards known generative models, specifically against unseen fake instances. Similarly, Jung et al. [7] used eye-blinking anomalies for detection, but their method fails when subjects exhibit abnormal blinking patterns. These approaches, while valuable, demonstrate that single-feature-based solutions are limited in scope. On the other hand, traditional solutions such as those focusing on eye-blinking [8] or behavioral biometrics [9] perform well in controlled scenarios but lack generalization to other types of forgeries or unseen datasets.

In contrast, recent deep learning solutions, such as those proposed by Li et al. [10] and Agarwal et al. [9], utilize facial landmarks and behavioral biometrics for deepfake detection. However, these models frequently encounter challenges when addressing diverse deepfake techniques, including lip-sync manipulations and unseen datasets. In another study, Fernandes et al. [11] employ biological signals by measuring heart rates through methods like skin color variation and Eulerian video magnification, which are subsequently analyzed using a Neural Ordinary Differential Equations (Neural-ODE) model. While this approach has shown effectiveness in detecting deepfakes, it is affected by increased computational complexity, potentially limiting its applicability in real-time scenarios. Similarly, Yang et al. [12] introduce a multi-scale texture difference network that employed a ResNet-18 architecture to capture texture variations in manipulated images, although the presented solution achieve good performance on certain deepfakes these models often struggle with generalization across varying deepfake techniques, particularly in more complex situations. Other works [13, 14] have explored CNN-based methods for the detection of swapped faces, however, there remains a critical need for more generalized solutions capable of consistently performing across diverse datasets and manipulation types. Collectively, these studies highlight the challenges faced by existing deep learning approaches in achieving comprehensive generalization, revealing that reliance on isolated handcrafted features or end-to-end deep models alone is insufficient for effective detection.

To address the limitations of existing studies, we introduce a novel framework for video deepfake detection. The proposed framework leverages a unique combination of AR/VR-inspired behavioral features, golden ratio-based geometric features, and deep identity features to train the classification model. By employing a triplet input structure and a loss function that measures similarity, the presented framework assesses the closeness of each sample to either the real or deepfake class. Unlike existing methods that rely on single feature sets or binary classification, our framework integrates complementary DBaG features and trains using a similarity matrix, enhancing its ability to capture the complex characteristics of both in-dataset and cross-dataset deepfakes, as illustrated in Figure 1. For this analysis, we trained our system on FF++ for in-dataset testing and used CelebDF for cross-dataset analysis, visualized through t-SNE. To the best of our knowledge, this is the first work to utilize a comprehensive feature set, combining expert insights from AR/VR, golden ratio principles, and facial recognition for deepfake detection. The contributions of this work are summarized as follows:

1.

We introduce a novel descriptor named DBaG that encompasses deep identity, behavioral, and geometric information for deepfake detection. DBaG captures holistic aspects of a given video, leading to effective in- and cross-datasets deepfake detection.
2.

We present a novel video deepfake detection framework that employed a triplet-based classifier DBaGNet that effectively discriminates between authentic and manipulated content by leveraging distance-based learning on the comprehensive DBaG features. The triplet learning of DBaGNet helps in enhancing generalization, particularly for unseen deepfake examples, that improves the robustness of the proposed solution in cross-dataset evaluations.
3.

We performed rigorous experimentation to test the generalization ability of the presented framework on six different datasets: FF++, DFD, DFDC, WLDR, CelebDF, and NVFAIR. The results demonstrate that the combination of features in the DBaG descriptor is effective against vast types of deepfake in the seen and unseen cross-dataset settings.
4.

The comprehensive ablation study reveals that AR/VR-inspired behavioral features, golden ratio-based facial geometrical features, and deep identity features from facial recognition perform effectively within individual datasets. Moreover, the fusion of these behavioral, geometric, and deep identity features enhances the model generalization ability towards cross datasets evaluations.

The rest of the paper is organized as follows: Section II reviews the literature on deepfake detection. Section III presents the details of the proposed framework. Section IV provides the evaluation details, including datasets and results. Lastly, Section V presents the conclusion and limitations of the paper.

II Literature Review

The rapid advancement of deepfake generation methods raises concerns about the need for effective and generalized deepfake detection methods. The existing deepfake detection approaches can be broadly categorized into handcrafted and deep features based methods. Methods that employ handcrafted features mostly analyze inconsistencies in lip-syncing and blinking (behavioral) or inconsistencies in facial landmark movements (geometric). In contrast, deep learning techniques like CNNs excel at extracting subtle inconsistencies in facial features. When applied to video data, these methods utilize temporal information to identify inconsistencies across sequential frames. In the next subsections below we provide a brief overview of each category and discuss the recent studies conducted on enhancing generalization ability in deepfake detection.

II-A Traditional ML based deepfake detection

Traditional deepfake detection methods mainly relied on handcrafted features, specifically designed to exploit signature vulnerabilities introduced by the generative algorithms. These features targeted inconsistencies induced by the deepfake generation algorithms, for instance, the authors of the study [15] revealed that deep fake videos often lack natural physiological signals like regular eye blinking and consistent head movements, which could serve as indicators for deep fake detection. In another study [16], the authors identified visual anomalies, such as changes in eye color, unconvincing reflections, and discrepancies in eye and tooth details, and used them to identify deepfakes. Similarly, the method proposed in [17] exploited distinct facial expressions. In [18], the method capitalized on generator limitations, while [19] focused on subtle variations induced by the heartbeat on the face to discern video authenticity. Though these techniques were effective in their time, their dependence on specific cues made them vulnerable to rapid advancements in deepfake technology. In contrast, some other works concentrate on methods for creating interpretable deep models through the use of self-explanatory mechanisms [20]. Instead of relying solely on raw inputs, these approaches used ”units of explanation,” such as abstract notions, to convey interpretability. This strategy was used for basic concept acquisition by Alvarez et al. in [20], which was further used for case-based reasoning and prototype learning by Kim et al. [21]. Recently, the author of [22] proposed ProtoPNet, which uses prototype learning to create predictions based on resemblance to class-specific image patches. However, although significant progress has been made in the development of intricate video classification models for explainable solutions (such as 3D CNNs, TSNs, and TRNs [23, 24], there is a lack of interpretability in complex deepfake detection methods encompassing recent deepfake forgeries.

II-B Deep learning based deepfake detection

Deep learning approaches for deepfake detection offer more robust and adaptable solutions. Unlike traditional handcrafted features, neural networks are able to learn the complex patterns and subtle inconsistencies in real and fake faces, surpassing the limitations of manually engineered features. These solutions can be broadly categorized into frame-based, temporal, and spatial-temporal approaches. In frame-based deepfake detection, [25] proposed a deepfake detector that identified blending imperfections for video distortion detection. The author of the study [26] presented a CNN-based model for detecting deformation artifacts. Similarly, the presented framework in [27] analyzed video segments to capture visual inconsistencies. In contrast, in temporal analysis, [28] combined CNNs and LSTMs to analyze blinking patterns and temporal irregularities. The author of [29] used a spatio-temporal network for deepfake manipulated artifact detection. On the other hand, some studies developed lightweight deepfake detector models that can be used in computational intensive devices. For instance, the author of the study [30] presents ShallowNet based deepfake detector which excelled at detecting GAN-generated images even at low resolution. In another study [31], Pellcier et.al., introduces a unified framework based on prototype learning for Deepfake Detection (PUDD), which uses similarity-based detection to identify deepfakes by comparing input data to known prototypes. The presented PUDD framework outperforms the existing methods by achieving 95.1% accuracy on the Celeb-DF dataset. However, the presented solution was not tested against generalization across diverse datasets. Additionally, the model’s effectiveness could be affected by sensitivity to prototype selection. Similarly, the study [32] investigates the potential of deep neural network fusion for effective deepfake detection. The author emphasize the structural ability of various neural networks to capture distinct features for deepfake detection. For this, the author presents a multi-branch and multi-level architecture that separates fixed and adaptive knowledge from pre-trained networks which enhance the detection on low-power devices such as mobile devices. However, the generalization ability of this solution remains untested. In the next subsection, we explained and discussed the research conducted towards generalization of the deepfake detection.

II-C Deepfake detection toward generalization

Generalization ability remains a critical hurdle in deepfake detection, affecting the performance of the deepfake detectors from adapting to unseen data. Despite this significant challenge, research specifically addressing generalization ability within this domain remains limited. Early attempts like FWA [33] exploited resolution differences between forged faces and backgrounds for detection. Recent studies, however, demonstrate significant progress towards enhanced generalizability. Face X-ray [34] focuses on identifying blending boundary artifacts, while SPSL [35] utilizes phase spectrum analysis in a frequency-based approach. Lip-Forensics [36] uses spatio-temporal networks to detect unnatural mouth movements, and SRM [37] analyzes high-frequency noise for broader detection capabilities. In recent study [38], The author Yan et.al., tackles the generalization issue in deepfake detection, where performance suffers from mismatched training and testing data distributions. The authors present the Latent Space Data Augmentation (LSDA) detector, which improves decision boundaries by simulating variations of forgery features in the latent space. The presented LSDA outperformed the existing detectors across multiple benchmarks. However, LSDA’s reliance on effective simulation techniques may restrict its applicability to real-world scenarios with novel unseen deepfake forgeries. In another study [39], the authors propose the Generalized Multi-Scenario Deepfake Detection framework (GM-DF), which trains models on multiple datasets and utilizes a hybrid expert modeling approach along with a domain-aware meta-learning strategy. Although the framework shows promise in enhancing generalization, it may experience rapid declines in accuracy due to dataset discrepancies. Although these studies enhance the generalization of deepfake detection methods, their reliance on predetermined forgery patterns and consideration of the entire feature space makes them susceptible to disruption from irrelevant factors like background and identity.

Therefore, collectively both handcrafted and deep learning approaches offer distinct advantages: dataset oriented deepfake detection for handcrafted features and generalization ability for deep neural network. However, their fusion within a single framework holds significant promise for achieving both in-dataset and generalized deepfake detection. This strategy has been explored in few studies, some of which are discussed in the next subsections, with the aim of enhancing the generalization ability for deepfake detection systems.

III Proposed Framework

In this section, we introduce the proposed framework for detecting video deepfakes. The presented framework is composed of three stages. The first stage involves preprocessing the video data to crop faces and extract facial landmarks. In the second stage, the extracted cropped faces and landmarks are passed through a multidisciplinary feature extraction process to obtain deep identity, behavioral, and geometric features. The obtained features are then fused together to create a comprehensive set of DBaG features. In the final stage, the fused DBaG features are input into a classifier, which uses a triplet loss objective to learn the representation of input video segments to classify them as real or fake based on a similarity index. The feature extraction process and the architecture of the proposed DBaGNet are presented in Figures 3 and 4. In the following subsections, we explain the process of each stage in detail.

III-A Preprocessing

Given a video $V$ consisting of $N$ frames, reshaped into $K$ batches ${S}^{B\times C\times W\times H}_{K}$ , where $B$ , $C$ , $W$ and $H$ represent batch size, frame channels, width and height, respectively. A Multi-task Cascaded Convolutional Network [40] is then applied on each batch $S^{B\times C\times W\times H}_{k},\text{ for }k=1,2,3,\ldots,K-1$ to get the coordinates of faces in the input frames. Batches are reshaped into a sequence of frames ${V}^{N\times C\times W\times H}$ along with the coordinates of the detected faces. To ensure the quality of resulting cropped faces for the subsequent steps, we set $120\times 120$ as the acceptable facial dimensions required by the MTCNN, in the input video frames. Similarly, in the video with multiple faces, the face with larger dimensions is selected. After the face detection, faces are cropped and rescaled to $224\times 224$ to get ${V}^{N\times C\times W\times H}_{224\times 224}$ , which is used for behavioral, geometric and deep identity features extraction.

III-B DBaG: A Multidisciplinary Feature Descriptor

We argue that a holistic analysis of a face in a given video is more robust and generalized approach than a single cue based analysis. Based on our hypothesis and reported experimental results, we proposed a multidisciplinary DBaG descriptor composed of behavioral, facial geometry, and identity signature for deepfake detection. The visualization of DBaG features descriptor is given in Figure 2 and the composition of the BDaG is visualized in Figure 3. The DBaG descriptor holistically analyze a given face for inconsistencies in facial behavior and face geometry along with relative identity information. Furthermore, the behavior and geometric features in DBaG descriptor are handcrafted, resulting in interpretability, while the identity features are extracted using deep models to detect low-level details like texture and skin color tone, and hence improve generalizability. A detailed discussion on each component of the DBaG descriptor is given in the subsequent sections.

III-B1 Blendshape Features for Behavioral Analysis

Facial behavioral patterns or expressions are the most complicated phenomena to mimic perfectly in deepfake generation. Inspired by the role of blendshape features that are commonly employed in virtual and augmented reality applications [41], we use them to capture a wider spectrum of facial expressions for behavioral analysis. The acquisition of blendshape features starts with the extraction of facial landmarks. An input face frame $V_{224\times 224}$ from ${V}^{N\times C\times W\times H}_{224\times 224}$ is passed to a MobileNetV2-like customized architecture [42] ( $M_{Net}$ ) to extract $478$ facial landmarks. Next, to capture facial expressions, a comprehensive set of 52-dimensional blendshape behavioral features is extracted with a MLP-Mixer backbone [42] ( $MLP_{Mixer}$ ). A visual representation of behavioral feature calculation is given in Figure 2(b). This approach facilitates the precise characterization of behavioral distinctions exhibited by the face throughout the frame sequence. Equations 1 & 2 formulate the calculation of blendshape features as:

LM_{[lm_{1},lm_{2},\ldots,lm_{478}]}=\sum_{j=1}^{N}M_{Net}[V_{224\times 224}^{(j)}]

(1)

F_{b_{[1\times 52]}}=\sum_{k=1}^{N}MLP_{Mixer}[LM_{[lm_{1},lm_{2},\ldots,lm_{478}^{(k)}]}]

(2)

III-B2 Golden Ratio for Facial Geometry Analysis

In some of the deepfake categories such as faceswap, the inner facial regions of a target face are replaced while the outer facial geometry is kept intact. To exploit the difference in inner and outer facial geometry, DBaG derives novel facial geometric features inspired by facial golden ratio [43]. The facial golden ratio is originally used for beauty calculation and mostly covers the symmetry of a given face. To capture facial geometry information, we compute geometric features using facial landmarks to analyze temporal inconsistencies in facial structure. These inconsistencies may include subtle deviations in symmetry or proportion that arise when source and target facial geometries blend, creating a mix that appears visually appealing but lacks the natural cohesion of real faces in video frames. For this task, we utilize Mediapipe [42] to capture facial geometry, by calculating distances formed by various landmarks. This facial geometry provides a holistic representation of the structural aspects when discerning genuine facial attributes. Specifically, we aim to capture alterations in facial geometry exhibited by an individual while speaking in a given video. Figure 2(c) represents a visual representation of the derived geometric features.

In our approach, we calculate the geometric features between specific sets of facial landmarks. Here, $P=\{p_{1},p_{2},\ldots,p_{l}\}$ represent the set of landmarks corresponding to the lower half of the outer face region, and $Q=\{q_{1},q_{2},\ldots,q_{m}\}$ represent the set of landmarks corresponding to the upper half of the outer face region. We define bottom of the face $B$ , as the central point of $P$ and the top of face $T$ as the central point of $Q$ . We compute distances from these central points to each landmark in the opposite region, covering a symmetric range of indices from $T-n$ to $T+n$ and $B-n$ to $B+n$ respectively:

G_{T_{[1\times 18]}}=\sum_{t=T-n}^{T+n}\left(|x_{B}-x_{t}|+|y_{B}-y_{t}|+|z_{B}-z_{t}|\right)

(3)

G_{B_{[1\times 18]}}=\sum_{b=B-n}^{B+n}\left(|x_{T}-x_{b}|+|y_{T}-y_{b}|+|z_{T}-z_{b}|\right)

(4)

These equations define the geometric features as distances from the central points $B$ and $T$ to the landmarks in the opposite region, spanning from $T-n$ to $T+n$ for $G_{T}$ , and $B-n$ to $B+n$ for $G_{B}$ .

The final feature vector representing the geometric properties is constructed as:

F_{g_{[1\times 36]}}=\left[G_{T_{[1\times 18]}},\,G_{B_{[1\times 18]}}\right]

(5)

This approach provides a concise and focused calculation of distances between central points and surrounding landmarks in opposite facial regions, capturing geometric relationships precisely.

III-B3 Deep Identity Features Signature Analysis

In addition to behavioral and geometric features , DBaG also include deep identity facial features to capture individual’s facial characteristics or signature. In order to capture deep identity signature, we deploy a quality adaptive face-recognition model, AdaFace [44], which uses a ResNet backbone and applies a margin based loss function to capture the deep features in low quality images as given in Equation 6.

\theta=-\log\left(\frac{\exp(g(\Theta_{yi},m))}{\exp(g(\Theta_{yi},m))+\sum_{j\neq yi}^{n}\exp(s\cos\Theta_{j})}\right)

(6)

where $\theta j$ denotes the angle between the feature vector of current frame and the $j^{th}$ classifier weight vector, $yi$ represents the index of the real face label, and $m$ corresponds to the margin, serving as a scalar hyperparameter. Finally, a feature vector of size 512 is obtained as given in Equation 7.

F_{i_{[1\times 512]}}=\sum_{m=1}^{N}M_{AF}(\theta([V_{224\times 224}]))

(7)

where $V_{224\times 224}$ is the input image, $\theta$ is the loss function as shown in 6, and $M_{AF}$ is the transformation function used in AdaFace to compute the final feature vector.

Our DBaG feature descriptor brings different aspects, including behavioral $F_{b}$ showing how the face expresses itself, geometry $F_{g}$ , revealing the face’s structure and identity signature $F_{i}$ , uncovering hidden patterns from generative algorithms. Jointly, these features form our final feature vector, providing a well-rounded view of the information needed to tell if a video is real or fake. Equation 8 formulates the final feature vector of DBaG.

Fv_{[1\times 600]}=[F_{b}+F_{g}+F_{i}]

(8)

To capture the spatial and temporal details in the video frames, we reshape the DBaG feature descriptor vectors to 2D slices of 120 frames with an overlap of 60 frames. The final input vector for the model is $Fv_{[120\times 600]}$ .

III-C Classification

To perform classification in an efficient manner, we designed a deep learning classifier DBagNet as shown in Figure 4, which employs residual learning alongside both spatial and temporal domains. This architecture incorporates attention-based mechanisms and pooling techniques, drawing on prior research [45, 46] to create a model that can effectively generalize across manipulated media.

III-C1 Representation Learning with triplet loss

Prior to representation learning, the obtained DBaG features passed to the DBaGNet consists of five squeeze and excitation based residual layers. The DBaGNet initial block consists of a 2D convolution with a 7×7 kernel and a stride of 2, followed by batch normalization, ReLU activation, and a max pooling layer with a 3×3 kernel. This block downscales the input and extracts low-level spatial features. Representational learning is primarily carried out through residual blocks, each of which contains two 3×3 convolutions with batch normalization and ReLU activation. Each block also integrates a squeeze-and-excitation (SE) component for adaptive feature recalibration, applied after the second convolution. The final convolution layers are followed by an adaptive average pooling layer, which standardizes the feature map size (1×1), regardless of the original input size. After the convolutional and pooling layers, two fully connected layers reduce the dimensionality and generate an embedding vector. These layers use Leaky ReLU activations to maintain feature sensitivity and ensure stable gradient flow, especially important for effective embedding vectors.

Following this, the proposed DBaGNet is trained with triplet margin loss to learn the triplet embeddings for generalized deepfakes detection. The training with triplet learning objective, requires the dataset to be constructed to have a triplet based input having an anchor $x_{i}$ , a positive $p_{i}$ and a negative $n_{i}$ features. Next, the DBaG features for anchor, positive and negative samples are transformed to $\hat{x}_{i}$ , $\hat{p}_{i}$ and $\hat{n}_{i}$ by feeding them into residual dense layer architecture to obtain finalized feature vectors. The resultant $\hat{p}_{i}$ has the same label as the $\hat{x}_{i}$ in contrary the $\hat{n}_{i}$ has different label. Next, we used triplet margin loss as the objective function to generate the discriminative embeddings of real and fake feature vectors. The triplet margin loss helps avoid over-fitting to the training data and better generalization. The triplet margin loss can be defined as:

\mathcal{L_{\text{triplet}}}(\hat{x}_{i},\hat{p}_{i},\hat{n}_{i})=\max\{D(\hat{x}_{i},\hat{p}_{i})-D(\hat{x}_{i},\hat{n}_{i})+\text{m},0\}

(9)

where $\hat{x}_{i},\hat{p}_{i}$ and $\hat{n}_{i}$ are anchor, positive and negative representation vectors. $m$ is the margin hyperparameter. The margin ensures that the distance between the anchor and the negative vector is not just greater than the distance between the anchor and the positive vector, but it is greater by at least the specified margin. A larger margin enforces more significant separation of positive and negative vectors, which can lead to better discrimination but might make training more challenging. $D$ is the pairwise euclidean norm between anchor, positive and anchor, negative pairs, and is defined as:

D(\hat{x}_{i},\hat{p}_{i})=\|\hat{x}_{i}-\hat{p}_{i}\|_{2}=\sqrt{\sum_{i}(\hat{x}_{i}-\hat{p}_{i})^{2}}

(10)

D(\hat{x}_{i},\hat{n}_{i})=\|\hat{x}_{i}-\hat{n}_{i}\|_{2}=\sqrt{\sum_{i}(\hat{x}_{i}-\hat{n}_{i})^{2}}

(11)

During the training phase, model learns the representations of triplets in the form of anchor, positive and negative class. After the training process is complete, embedding vectors of the training samples are saved as the reference representations along with their labels.

III-D Testing

To evaluate the model, the testing samples are passed through the trained network, generating embeddings for each slice. We implement a similarity comparison approach from the reference set for the labels prediction of test embeddings. We calculate the distance of each test embedding from the embeddings of reference set to create a distance matrix. The final label of test embedding is predicted based on the majority labels of m closest neighbors in the reference set.

Once the training process is complete, we generate embeddings for each sample in the training set, referred to as the reference set $E_{\text{ref}}$ . $E_{\text{ref}}$ is stored as embeddings, labels pairs $\{(e_{1},y_{1}),(e_{2},y_{2})\ldots,(e_{n},y_{n})\}$ , for n reference samples. These embeddings form a basis against which new (unseen) test samples are compared, providing a fixed point of reference for label prediction. For each sample in the test set, the model computes a new embedding $e_{test}$ by passing it through the trained DBagNet. This embedding represents the learned features of the test sample in the same embedding space as the reference set, capturing characteristics that differentiate real from fake data.To determine the label of $e_{test}$ , we calculate its Euclidean distance from each reference embedding $e_{i}$ in $E_{ref}$ , providing a measure of similarity. The Euclidean distance $d_{i}$ between $e_{test}$ and $e_{i}$ is given by:

d_{i}=\|e_{\text{test}}-e_{i}\|_{2}=\sqrt{\sum_{j}(e_{\text{test},j}-e_{i,j})^{2}}

(12)

where j indexes the dimensions of the embedding vectors. This step produces a set of distances ${d_{1},d_{2},d_{3},...,d_{n}}$ that indicate how close the test embedding is to each reference sample. To assign a label to $e_{test}$ , we rank the distances ${d_{1},d_{2},d_{3},...,d_{n}}$ in ascending order and identify the $m$ smallest distances, representing the nearest neighbors of $e_{test}$ in the embedding space. The label is then determined by a majority vote among the labels of these nearest neighbors. Formally, the predicted label is given by:

y_{\text{pred}}=M_{vote}(\{y_{i_{1}},y_{i_{2}},\ldots,y_{i_{m}}\})

(13)

where ${i_{1}},y_{i_{2}},\ldots,y_{i_{k}}$ are the indices of the $m$ nearest embeddings in $E_{ref}$ and $M_{vote}$ selects the label that appears most frequently among these neighbors. This majority voting process ensures that the final prediction takes into account multiple neighbors, providing robustness against outliers and minor variations in the embedding space.

IV Experimentation and results

This section discusses the details of the evaluation criteria, validation experiments on benchmark deepfake datasets including comparison with State-of-the-art methods, and the ablation study based on different combinations of feature sets and cross-validation experiments to evaluate the generalizability of the proposed framework.

TABLE I: Performance evaluation of the proposed method on DFDC, WLDR, DFD, CelebDF, and face swapping subsets of FF++.

Dataset	ACC	AUC	EER
DFDC	97.71	96.78	0.72
WLDR	99.31	99.29	0.93
DFD	95.15	93.07	7.65
CelebDF	99.71	99.43	0.46
FF-FaceSwap	99.37	99.12	2.42
FF-Deepfake	99.18	99.02	1.02
FF-FaceShifter	99.49	99.23	1.23
FF-Combined	99.47	99.41	1.82

IV-A Datasets

We evaluate the feasibility and superiority of the proposed framework on six large-scale datasets of manipulated samples from recent GAN’s and Diffusion models, confirming a well-suited evaluation to assessing the model’s efficacy for generalization on new and old types of deepfakes. Short description of each dataset is given below.

IV-A1 DFDC [47]

The DFDC is a large-scale dataset, with over 100k videos generated using 8 different GAN-based algorithms. We randomly selected 6800 videos with 1050 real and 5750 fake videos.

IV-A2 CelebDF-v2 [48]

CelebDF-v2 comprising of 590 real videos and 5639 fake videos of 59 celebrities generated with recent generative models.

IV-A3 World Leaders Dataset (WLDR) [17]

WLDR composed of five US presidential candidates including Bernie Sanders, Barack Obama, Elizabeth Warren, Joe Biden and Hillary Clinton. This dataset is comparatively challenging because the Deepfakes in WLDR are generated using impersonations of each political figure. This dataset contains 595 real and 82 fake videos.

IV-A4 DFD [49]

The deepfake detection dataset (DFD) by Google, in collaboration with Jigsaw, is generated with latest GAN algorithms. This dataset contains 363 high quality real and 3068 face-swap fake videos of 28 paid actors. This dataset has videos with one and more than one individual, performing different tasks, like talking, walking and hugging, and different emotional expressions like anger, disgust, happiness and neutral.

IV-A5 FaceForensics $++$ (FF $++$ ) [50]

FF $++$ dataset stands as the most extensively employed forgery dataset. This dataset is comprised of five manipulation techniques, including three swapping techniques (DeepFakes, FaceSwap and FaceShifter), and two expression swapping techniques (Face2Face and NeuralTexture). This dataset contains $1000$ real videos and $1000$ for each fake technique with a total of $6000$ videos.

IV-A6 NVIDIA Facial Reenactment (NVFAIR) [51]

NVFAIR is a new expression swapping dataset recently generated by NVIDIA. This dataset contains hours of expressions swapped videos of 161 identities, 24 from RAVDESS, 91 from CREMA-D, and 46 from their own video-conferencing data. They used three reenactment algorithms such as Face-vid2vid, LIA, and TPS to generate manipulated samples.

IV-B Evaluation Metrics and Experimental Setup

In our evaluation, we used Accuracy (ACC), equal error rate (EER) and area under the curve (AUC) metrics. A random split of 80%:20% is used for training and testing, respectively for DFDC, DFD, WLDR, and NVFAIR as these datasets have enough videos for each identity. While for the FF++ and CelebDF datasets we followed a different approach for train/test split, similar to [52], because these datasets have a limited number of videos per identity. Therefore, we chose the training and testing subsets from each video as follows: the first 20% of the video segment is used for training, the last 20% for testing, and the remaining video is ignored to prevent overlapping.

IV-C Performance Evaluation of the Proposed Framework

To evaluate the performance of the proposed framework, we tested our framework on the two most common types of deepfake video i.e., face swapping and expressions swapping. A detailed discussion on the obtained results is presented in subsequent sections.

IV-C1 Performance Analysis on Deepfakes

This experiment is designed to analyze the performance of the proposed framework on the most recent and challenging deepfakes on the standard benchmarks. We used DFDC, CelebDF, WLDR, DFD, and face-swapping subsets of FF $++$ for this experiment. The results are presented in Table I. It can be seen that the proposed framework achieves remarkable performance with an AUC of 93% to 99%, on DFDC, WLDR, CelebDF and FF++ datasets. The performance is comparatively low on DFD dataset, due to the presence of more than one face in some videos. The proposed framework is robust to the new types of deepfakes and achieves remarkable performance on new types of manipulations like FaceShifter and DFDC datasets.

IV-C2 Performance Analysis on Expressions Swaps

To analyze the performance evaluation on expressions swaps, we conducted experiments on NVFAIR and expression swapping subsets of the FF $++$ dataset. The results in Table II demonstrate the performance of our framework on facial reenactment. The proposed framework achieves remarkable performance on facial reenactment, including the new types in the NVFIAR dataset, with an AUC of 96% to 98%. Performance is comparatively low on the NeuralTextures type of expression swaps. This indicates the detection of NeuralTextures expressions swap is a challenging task.

TABLE II: Performance evaluation of the proposed method on expressions swapping subsets of FF++ and NVFAIR Dataset.

Manipulation Type	ACC	AUC	EER
NVFAIR-FaceV2V	97.18	96.27	6.12
NVFAIR-RAVDESS	98.08	97.62	5.96
FF-Face2Face	98.82	98.24	3.17
FF-NeuralTextures	97.27	96.38	4.21
FF-Combined	98.28	97.86	3.53

IV-D Comparative Performance Analysis with State-of-the-art

The proposed framework is compared with current state-of-the-art (SOTA) techniques to analyse its effectiveness over various datasets. We compared the proposed framework with PWL [17] and AnB [52] on WLDR dataset while 2-stream [53], XceptionNet [48],Head Pose [15], MesoNet [54], DSP-FWA[10], AnB [52], MAT [55], SLADD [56], Face-x-ray [57], PCL+12G [58], and SBI [59] on the rest of the datasets. Table III gives a comparative performance analysis. As seen, the proposed framework outperform PWL and AnB on the WLDR dataset, achieving an AUC of 99.29. It also achieves an AUC of 99.41 on FF++, 96.78 on DFDC, and 99.43 on the CelebDF dataset, outperforming other methods on DFDC and CDF datasets. The proposed method demonstrated strong performance, achieving AUC scores of 99.41% on the FF++ dataset and 92.12% on the DFD dataset, positioning it as the second-highest performer and closely matching the leading results of 99.64% and 93.2% attained by SBI and AnB, respectively.

TABLE III: Comparison with the SOTA on multiple benchmarks. Performance metric used for comparison is AUC.

Methods	WLDR	FF++	DFD	DFDC	CDF
PWL [17]	93.00	-	-	-	-
2-stream [53]	-	70.1	52.8	61.4	53.8
XceptionNet [48]	-	99.7	85.9	72.2	65.3
Head Pose [15]	-	47.3	56.1	55.9	54.6
MesoNet [54]	-	84.7	76.0	75.3	54.8
DSP-FWA[10]	-	93.0	81.1	75.5	64.6
AnB [52]	99.0	99.2	93.2	95.6	99.3
MAT [55]	-	99.61	-	-	-
SLADD [56]	-	98.40	-	-	-
Face-x-ray [57]	-	99.17	-	-	-
PCL+12G [58]	-	99.11	-	-	-
SBI [59]	-	99.64	-	-	-
Ours	99.29	99.41	92.12	96.78	99.43

TABLE IV: Cross-manipulation evaluation. DeepFakes, FaceSwap and FaceShifter manipulation sets of FF++ are denoted by DF, FS and FST, respectively.

Train	Method	DF	FS	FST	Mean
DF	EN-b4 [60]	97.97	46.24	51.26	65.82
	MAT [55]	99.92	40.61	45.39	61.97
	GFF [61]	99.87	47.21	51.93	66.34
	DCL [62]	99.98	61.01	68.45	76.48
	IID [63]	99.51	63.83	73.49	78.94
	Ours	99.82	81.24	76.68	85.91
FS	EN-b4 [60]	69.25	99.89	60.76	76.63
	MAT [55]	64.13	99.67	57.37	73.72
	GFF [61]	70.21	99.85	61.29	77.12
	DCL [62]	74.80	99.90	64.86	79.85
	IID [63]	75.39	99.73	66.18	80.43
	Ours	93.52	99.34	85.59	92.82
FST	EN-b4 [60]	61.11	56.19	99.52	72.27
	MAT [55]	58.15	55.03	99.16	70.78
	GFF [61]	61.48	56.17	99.41	72.35
	DCL [62]	63.98	58.43	99.49	73.97
	IID [63]	65.42	59.50	99.50	74.81
	Ours	89.71	83.57	99.37	90.89

TABLE V: Performance analysis of Cross-dataset evaluation. Model trained on FF

++

(c23) and tested on CelebDF, DFD and DFDC.

Method	Celeb-DF	DFD	DFDC
	AUC(%) EER(%)	AUC(%) EER(%)	AUC(%) EER(%)
XceptionNet [48]	65.27 38.77	65.27 38.77	87.86 21.04
EN-b4 [60]	68.52 35.61	68.52 35.61	87.37 21.99
Face X-ray [57]	74.20 -	85.60 -	70 -
MLDG [64]	74.56 30.81	88.14 21.34	71.86 34.44
F3-Net [65]	71.21 34.03	86.10 26.17	72.88 33.38
MAT [55]	76.65 32.83	87.58 21.73	67.34 38.31
GFF [61]	75.31 32.48	85.51 25.64	71.58 34.77
LTW [66]	77.14 29.34	88.56 20.57	74.58 33.81
Local-relation [67]	78.26 29.67	89.24 20.32	76.53 32.41
DCL [62]	82.30 26.53	91.66 16.63	76.71 31.97
UIA-ViT [68]	82.41 -	94.68 -	75.80 -
IID [63]	83.80 24.85	93.92 14.01	81.23 26.80
Ours	82.54 52.24	82.33 23.12	85.94 18.00
Ours + Aug	89.72 8.83	91.62 7.45	92.05 6.51

IV-E Cross-manipulation evaluation

To analyze the generalizability of the proposed framework on unseen type of deepfakes for the seen identities, we compared the performance against the SOTA on FaceShifter(FST), FaceSwap(FS) and DeepFakes(DF) manipulation classes of FF++. The model was trained on each of the three types of deepfakes (FST, FS, and DF) and tested on the test set of the same and the other two subsets. It can be observed in the Table IV and Figure 5 that the mean AUC achieved by our framework is significantly higher compared to the SOTA, with the mean AUC gains of 6.97, 12.39 and 16.08 respectively. The AUC gain for DF training is comparatively low due to the low performance on FST as shown in Figure 5 $(c)$ . The reason for this performance drop is the difference in quality and the nature of these two manipulation algorithms. Videos generated with DF do not have natural eye blinking and lip movements, while the videos generated with FST have more realistic behavioral characteristics, making them more difficult to detect.

IV-F Cross-dataset evaluation

To analyze the generalizability of the proposed framework, we performed a comparative analysis with the SOTA by training the model on FF++ and testing on CelebDF, DFD and DFDC datasets. To conduct this experiment, the proposed framework was trained on the FF++(C23) and tested on test sets of other three datasets. We selected SOTA methods for comparison, including XceptionNet [48], MLDG [64], MAT [55], DCL [62], EN-b4 [60], GFF [61] and IID [63]. It is shown in Table V, the proposed model achieves superior performance over the SOTA.

TABLE VI: Effectiveness of different combination of features on DFDC, WLDR, FF

++

, and NVFAIR. FF-FS refers to Face swapping manipulations and FF-ES refers to expressions swapping manipulations.

Features	DFDC	WLDR	FF-FS	FF-ES	NVFAIR
DIF	84.14	92.42	85.49	87.24	88.30
DIF+B	93.79	95.37	90.69	92.14	94.55
DIF+B+G	96.78	99.29	94.18	97.18	99.21

IV-G Ablation study

To analyze the effectiveness of extracted features, we performed an ablation study to evaluated our framework with the different combinations of feature sets on the DFDC, WLDR, FFF++ face swap, FF++ expression swap and NVFIAR datasets. The quantitative results on each dataset are presented in Table VI. To evaluate the effectiveness of different feature combinations on our model, we performed an ablation study across multiple datasets, analyzing the performance both within individual datasets and across datasets. Specifically, we examined our model on DFDC, WLDR, FF++ (face swap and expression swap), and NVFIAR datasets, testing each feature combination’s impact. Results for in-dataset evaluations are presented in Table VI, while cross-dataset results are provided in Table VII.

Starting with only deep identity features (DIF), the model established a baseline by achieving $88\pm 5\%$ AUC. With the addition of behavioral features, along with deep identity features, our model improves 9.65%, 2.95%, 5.20%, 4.90% and 6.25% AUC on DFDC, WLDR, FF-FS, FF-ES and NVFAIR, respectively. The AUC gains for geometric features, combined with deep identity and behavioral features, are 4.64%, 3.87%, 3.49%, 5.04% and 4.66%. Table VI clearly shows that the final feature set, with behavioral and geometric features, achieves a more stable performance on all datasets. For cross-dataset analysis, we trained the model on FF++ and evaluated it on DFDC, DFD, and CelebDF. For cross-dataset analysis, the model achieves an AUC of 73.90%, 76.52% and 75.54% on DFDC, DFD and CelebDF respectively, as shown in Table VII. The addition of behavioral features improves 12.90%, 10.86% and 10.89% AUC on DFDC, DFD and CelebDF. The AUC gains for final combination of features on cross-dataset evaluations are 5.25%, 4.24% and 3.29%. Our proposed combination of features achieves the best generalization towards new types of deepfakes and expressions swaps.

TABLE VII: Effectiveness of different combination of features on cross-dataset evaluation. Model is trained on FF++ and tested on DFDC, DFD and CelebDF.

Features	DFDC	DFD	CelebDF
DIF	73.90	76.52	75.54
DIF+B	86.80	87.38	86.43
DIF+B+G	92.05	91.62	89.72

V Conclusion and Future Direction

To address the limitation of poor generalizability in deepfake detectors, this paper presents DBaGNet; a triplet loss based framework with improved generalization across different manipulation. First, a DBaG descriptor is designed that is composed of identity, behavior, and geometric cues for a holistic analysis of given media. The combination of DBaG includes deep features that cover the identity of the subject, behavior analysis using blend shape features, and facial geometry analysis based on distances among various facial parts. After extracting the DBaG, the obtained features are forwarded to the proposed DBaGNet for representation learning, followed by a triplet loss-based classifier where distance-based classification is performed. Extensive experimentation is carried out on six large-scale datasets including cross-manipulation and cross-dataset evaluation which demonstrate superiority of the proposed framework against state-of-the-art methods. In the future, our aim is to explore some biological cues such as emotions or gaze estimation to strengthen the DBaG descriptor.

References

[1] H. V. Zhao, W. S. Lin, and K. R. Liu, “Behavior modeling and forensics for multimedia social networks,” IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 118–139, 2009.
[2] P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormaehlen, P. Perez, and C. Theobalt, “Automatic face reenactment,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, June 2014. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2014.537
[3] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt, “Real-time expression transfer for facial reenactment,” ACM Trans. Graph., vol. 34, no. 6, nov 2015. [Online]. Available: https://doi.org/10.1145/2816795.2818056
[4] S. Xiao, G. Lan, J. Yang, Y. Li, and J. Wen, “Securing the socio-cyber world: multiorder attribute node association classification for manipulated media,” IEEE Transactions on Computational Social Systems, no. 99, pp. 1–10, 2022.
[5] Y. Zhang, L. Zheng, and V. L. Thing, “Automated face swapping and its detection,” in 2017 IEEE 2nd international conference on signal and image processing (ICSIP). IEEE, 2017, pp. 15–19.
[6] U. A. Çiftçi, İ. Demir, and L. Yin, “Deepfake source detection in a heart beat,” The Visual Computer, vol. 40, no. 4, pp. 2733–2750, 2024.
[7] T. Jung, S. Kim, and K. Kim, “Deepvision: Deepfakes detection using human eye blinking pattern,” IEEE Access, vol. 8, pp. 83 144–83 154, 2020.
[8] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking,” arXiv preprint arXiv:1806.02877, 2018.
[9] S. Agarwal, H. Farid, T. El-Gaaly, and S.-N. Lim, “Detecting deep-fake videos from appearance and behavior,” in 2020 IEEE international workshop on information forensics and security (WIFS). IEEE, 2020, pp. 1–6.
[10] Y. Li, “Exposing deepfake videos by detecting face warping artif acts,” arXiv preprint arXiv:1811.00656, 2018.
[11] S. Fernandes, S. Raj, E. Ortiz, I. Vintila, M. Salter, G. Urosevic, and S. Jha, “Predicting heart rate variations of deepfake videos using neural ode,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
[12] J. Yang, S. Xiao, A. Li, W. Lu, X. Gao, and Y. Li, “Msta-net: Forgery detection by generating manipulation trace based on multi-scale self-texture attention,” IEEE transactions on circuits and systems for video technology, vol. 32, no. 7, pp. 4854–4866, 2021.
[13] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, “Recurrent convolutional strategies for face manipulation detection in videos,” Interfaces (GUI), vol. 3, no. 1, pp. 80–87, 2019.
[14] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1–11.
[15] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent head poses,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8261–8265.
[16] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to expose deepfakes and face manipulations,” in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). IEEE, 2019, pp. 83–92.
[17] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li, “Protecting world leaders against deep fakes.” in CVPR workshops, vol. 1, 2019, p. 38.
[18] S. McCloskey and M. Albright, “Detecting gan-generated imagery using saturation cues,” in 2019 IEEE international conference on image processing (ICIP). IEEE, 2019, pp. 4584–4588.
[19] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y. Liu, and J. Zhao, “Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 4318–4327.
[20] D. Alvarez-Melis and T. S. Jaakkola, “Towards robust interpretability with self-explaining neural networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 7786–7795.
[21] B. Kim, C. Rudin, and J. Shah, “The bayesian case model: a generative approach for case-based reasoning and prototype classification,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, p. 1952–1960.
[22] C. Chen, O. Li, C. Tao, A. J. Barnett, J. Su, and C. Rudin, This looks like that: deep learning for interpretable image recognition. Red Hook, NY, USA: Curran Associates Inc., 2019.
[23] S. Lee and S. Hong, “D-tsm: Discriminative temporal shift module for action recognition,” in 2023 20th International Conference on Ubiquitous Robots (UR). IEEE, 2023, pp. 133–136.
[24] W. Potok, A. Post, V. Beliaeva, M. Bächinger, A. M. Cassarà, E. Neufeld, R. Polania, D. Kiper, and N. Wenderoth, “Modulation of visual contrast sensitivity with trns across the visual system, evidence from stimulation and simulation,” eneuro, vol. 10, no. 6, 2023.
[25] M. A. Raza, K. M. Malik, and I. U. Haq, “Holisticdfd: Infusing spatiotemporal transformer embeddings for deepfake detection,” Information Sciences, vol. 645, p. 119352, 2023.
[26] Z. Zhao, P. Wang, and W. Lu, “Detecting deepfake video by learning two-level features with two-stream convolutional neural network,” in Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence, 2020, pp. 291–297.
[27] S. Bibi, A. Abbasi, I. U. Haq, S. W. Baik, and A. Ullah, “Digital image forgery detection using deep autoencoder and cnn features,” Hum. Cent. Comput. Inf. Sci, vol. 11, pp. 1–17, 2021.
[28] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task learning for detecting and segmenting manipulated facial images and videos,” in 2019 IEEE 10th international conference on biometrics theory, applications and systems (BTAS). IEEE, 2019, pp. 1–8.
[29] M. S. Saealal, M. Z. Ibrahim, D. J. Mulvaney, M. I. Shapiai, and N. Fadilah, “Using cascade cnn-lstm-fcns to identify ai-altered video based on eye state sequence,” PLoS One, vol. 17, no. 12, p. e0278989, 2022.
[30] S. Tariq, S. Lee, H. Kim, Y. Shin, and S. S. Woo, “Gan is a friend or foe? a framework to detect various fake face images,” in Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 2019, pp. 1296–1303.
[31] A. L. Pellicer, Y. Li, and P. Angelov, “Pudd: Towards robust multi-modal prototype-based deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3809–3817.
[32] A. Agarwal and N. Ratha, “Deepfake catcher: Can a simple fusion be effective and outperform complex dnns?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3791–3801.
[33] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face warping artifacts,” in CVPR Workshops, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53298495
[34] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face x-ray for more general face forgery detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5000–5009, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:209516424
[35] H. Liu, X. Li, W. Zhou, Y. Chen, Y. He, H. Xue, W. Zhang, and N. Yu, “Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 772–781, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232092167
[36] A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, “Lips don’t lie: A generalisable and robust approach to face forgery detection,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5037–5047, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:229156239
[37] Y. Luo, Y. Zhang, J. Yan, and W. Liu, “Generalizing face forgery detection with high-frequency features,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16 312–16 321, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232320599
[38] Z. Yan, Y. Luo, S. Lyu, Q. Liu, and B. Wu, “Transcending forgery specificity with latent space augmentation for generalizable deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8984–8994.
[39] Y. Lai, Z. Yu, J. Yang, B. Li, X. Kang, and L. Shen, “Gm-df: Generalized multi-scenario deepfake detection,” arXiv preprint arXiv:2406.20078, 2024.
[40] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.
[41] T. Menzel, M. Botsch, and M. E. Latoschik, “Automated blendshape personalization for faithful face animations using commodity smartphones,” in Proceedings of the 28th ACM Symposium on Virtual Reality Software and Technology, 2022, pp. 1–9.
[42] Google, “Mediapipe face landmarker,” https://storage.googleapis.com/mediapipe-assets/Model%20Card%20MediaPipe%20Face%20Mesh%20V2.pdf, 2024.
[43] G. B. Meisner, The golden ratio: The divine beauty of mathematics. Race Point Publishing, 2018.
[44] M. Kim, A. K. Jain, and X. Liu, “Adaface: Quality adaptive margin for face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 750–18 759.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[46] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[47] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer, “The deepfake detection challenge (dfdc) dataset,” arXiv preprint arXiv:2006.07397, 2020.
[48] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale challenging dataset for deepfake forensics,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[49] N. Dufour and A. Gully, “Contributing data to deepfake detection research,” https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html, 2019.
[50] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1–11.
[51] E. Prashnani, K. Nagano, S. D. Mello, D. Luebke, and O. Gallo, “Avatar fingerprinting for authorized use of synthetic talking-head videos,” arXiv preprint arXiv:2305.03713, 2023.
[52] S. Agarwal, H. Farid, T. El-Gaaly, and S.-N. Lim, “Detecting deep-fake videos from appearance and behavior,” in 2020 IEEE international workshop on information forensics and security (WIFS). IEEE, 2020, pp. 1–6.
[53] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Two-stream neural networks for tampered face detection,” in 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, 2017, pp. 1831–1839.
[54] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” in 2018 IEEE international workshop on information forensics and security (WIFS). IEEE, 2018, pp. 1–7.
[55] H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, and N. Yu, “Multi-attentional deepfake detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2185–2194.
[56] L. Chen, Y. Zhang, Y. Song, L. Liu, and J. Wang, “Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 710–18 719.
[57] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face x-ray for more general face forgery detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5001–5010.
[58] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and W. Xia, “Learning self-consistency for deepfake detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 023–15 033.
[59] K. Shiohara and T. Yamasaki, “Detecting deepfakes with self-blended images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 720–18 729.
[60] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[61] Y. Luo, Y. Zhang, J. Yan, and W. Liu, “Generalizing face forgery detection with high-frequency features,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 317–16 326.
[62] K. Sun, T. Yao, S. Chen, S. Ding, J. Li, and R. Ji, “Dual contrastive learning for general face forgery detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2316–2324.
[63] B. Huang, Z. Wang, J. Yang, J. Ai, Q. Zou, Q. Wang, and D. Ye, “Implicit identity driven deepfake face swapping detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4490–4499.
[64] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales, “Learning to generalize: Meta-learning for domain generalization,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
[65] J. Wei, S. Wang, and Q. Huang, “F³net: fusion, feedback and focus for salient object detection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 321–12 328.
[66] K. Sun, H. Liu, Q. Ye, Y. Gao, J. Liu, L. Shao, and R. Ji, “Domain general face forgery detection by learning to weight,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 3, 2021, pp. 2638–2646.
[67] S. Chen, T. Yao, Y. Chen, S. Ding, J. Li, and R. Ji, “Local relation learning for face forgery detection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 2, 2021, pp. 1081–1088.
[68] W. Zhuang, Q. Chu, Z. Tan, Q. Liu, H. Yuan, C. Miao, Z. Luo, and N. Yu, “Uia-vit: Unsupervised inconsistency-aware method based on vision transformer for face forgery detection,” in European conference on computer vision. Springer, 2022, pp. 391–407.