Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

Tharun Anand Siva Sankar Pravin Nair
Indian Institute of Technology Madras
[email protected], [email protected], [email protected]

Abstract

With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods $(2021$ - $2024)$ involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a $20\%$ improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.

1 Introduction

Deepfake generation techniques [71, 53] are advancing rapidly, driven by sophisticated editing capabilities enabled by state-of-the-art generative models. In particular, foundational technologies, such as Generative Adversarial Networks (GANs) [21] and diffusion models [27] have significantly enhanced the quality of deepfake videos, facilitating diverse manipulation techniques like face swapping (exchanging identities between individuals) [61, 25], face reenactment (transferring movements and poses from one face to another) [29, 8], and localized edits (modifying specific facial attributes to change expressions or context) [70, 66, 36]. While these technologies have beneficial applications, their unethical use raises serious privacy concerns over the spread of falsified media content [51, 33]. The growing prevalence of such videos underscores the critical need for robust and generalizable detection methods to protect against malicious uses of deepfake technology.

[Uncaptioned image] — Figure 1: Locally Edited Deepfakes Detection: A real video is manipulated to produce fake videos with subtle hard-to-detect edits - raised eyebrows, gender modification, expression change to disgust (single frame shown for illustration). Our method achieves significantly higher probability scores over top methods, effectively detecting these fine-grained edits with high confidence.

Although existing detection methods have shown reasonable success in identifying deepfakes involving face swaps and reenactments, they fall short when it comes to detecting localized manipulations of specific facial attributes. Unlike traditional deepfakes that rely on full-face alterations, these emerging techniques focus on nuanced edits, such as subtle changes to the shape of the mouth, nose, or eyes, or adjustments to micro-expressions like smiles or frowns. Such targeted manipulations produce synthetic content that is nearly indistinguishable from authentic media, posing a formidable challenge for current detection models, which are typically optimized for detecting broader, more apparent alterations. This shift toward high-fidelity, localized edits introduces a significant gap in current detection capabilities, leaving existing methods vulnerable to these next-generation deepfakes. Refer to Fig. 1 for visual examples of localized manipulations.

Refer to caption — Figure 2: Proposed Method: The input video is processed using a face detection algorithm to extract equally spaced face-centered frames. These frames are divided into $N$ tubular patches, which are fed into a novel encoder, obtained by fusing latent representations obtained from pretrained pretext tasks, to generate latent vector $\mathbf{X}_{E}$ . The encoded latent vector $\mathbf{X}_{E}$ is then passed through a classification head to detect the video as real or fake.

Addressing this challenge requires a unified detection framework capable of accurately identifying not only face swaps and reenactments, but also fine-grained, localized manipulations. To meet this need, we propose a generalization-focused detection framework trained solely on the widely used FaceForensics++ (FF++) dataset [58]. Although FF++ primarily includes videos generated by earlier deepfake methods focused on face swapping and reenactment, our method demonstrates the ability to generalize from this dataset to detect advanced, high-fidelity localized edits. This generalization capability hinges on learning optimal spatio-temporal representations that capture subtle, frame-level variations, enabling consistent detection across a wide range of manipulation types.

Our approach leverages carefully designed pretext tasks to drive representation learning. Specifically, we introduce a novel action unit-guided spatio-temporal framework that combines Facial Action Units (AUs) detection and masking-based pretext tasks to learn robust neural representations. Defined by the Facial Action Coding System (FACS) [18], AUs collectively describe a wide range of expressions and micro-expressions, where each AU corresponds to a unique movement, such as eyebrow raises, eyelid movements, lip pulls, etc. By using AU-based features to guide video representations learned through Masked Autoencoders (MAE) [65], our method effectively captures localized changes crucial for detecting subtle facial edits. This approach enables us to construct a unified latent representation that encodes both localized edits and broader alterations in face-centered videos, providing a comprehensive and adaptable solution for deepfake detection.

In this regard, our key contributions are as follows:

1.

We propose a novel framework for learning robust spatio-temporal representations of videos, guided by Facial Action Unit (AU) embeddings. Through a cross-attention mechanism, our approach fuses frame-level features with AU-derived embeddings, capturing subtle, localized manipulations with high sensitivity.
2.

Trained solely on standard deepfake FF+ dataset, our model achieves strong generalization to advanced deepfake techniques that involve high-fidelity, localized edits, where our model achieves a $20$ % improvement in detection accuracy over the latest state-of-the-art deepfake detection methods.
3.

To the best of our knowledge, this is the first approach specifically designed to detect localized edits in deepfake videos. Our AU-driven representation learning is not only effective for localized edits but also competitive on traditional deepfake datasets, offering a scalable, future-proof solution for diverse deepfake challenges.

2 Related Work

Early deepfake detection methods focussed on image-level detectors that identify spatial artifacts within individual frames [1, 7, 9, 19, 28, 56]. These methods primarily employed variations of Convolutional Neural Networks (CNNs) [38], with networks like EfficientNet [62] and XceptionNet [4] becoming standard baselines due to their effectiveness. To enhance detection, researchers introduced techniques that operate in the frequency domain, detecting subtle artifacts often missed in the RGB domain [31, 32, 46, 55]. Additional approaches targeted face blending artifacts [10, 39] and artifacts due to resolution discrepancies between source and target video [41], further improving frame-level detection. However, these methods were limited by their inability to capture temporal inconsistencies present in deepfake videos.

To address this limitation, video-level detectors were developed, leveraging temporal information across multiple frames [3, 12, 13, 23, 42, 48, 59]. CNN-based recurrent models [59] incorporated recurrent units after CNNs to capture temporal dynamics, while other methods directly learned spatiotemporal features [3, 13, 72]. Domain-specific insights, such as facial action units [5, 2], lip motion [23], and identity inconsistency [17, 12], further improved detection models by exploiting unnatural temporal variation in deepfake videos. Recently, advancements in deepfake detection involve Vision Transformers (ViT) [17, 11, 35, 5], which focus on spatio-temporal regions within frames to detect low-level perturbations in manipulated videos.

With the rising threat of deepfakes in recent years, current research efforts focus on developing generalizable deepfake detection methods that capture facial cues robustly across diverse forgery patterns [50, 61, 69, 44, 73]. Most of the recent methods are fairly generalizable in detecting global manipulations like face swapping [61, 25], face reenactment [29, 8] etc. However, with recent advances in deepfake generation, a new challenging form of deepfakes has emerged where a person’s facial feature or expression can be minutely changed, which can alter the context of the video [70, 66, 36]. To the best of our knowledge, prior to our work, the generalizability of current state-of-the-art detection methods to recent deepfake generation techniques involving localized manipulations [70, 66, 36], has remained unexplored.

3 Proposed Method

The proposed method is illustrated in Fig. 2. Given a video, we first perform face detection on these frames and then sample equally spaced frames. Following the tokenization approach in Video Masked Autoencoders (VideoMAE) [65], these set of frames is then divided into $N$ tubular (3D) blocks, denoted by $\{\mathbf{X}_{\text{in}}(i)\}_{i=1}^{N}$ , each of size $T\times P\times P$ , where every token captures localized spatial and temporal information. The tokenization process is refined using a patch embedding layer, followed by the addition of positional encodings to preserve spatial-temporal relationships.

The complete token set is then passed through the proposed encoder, producing a latent representation $\mathbf{X}_{\text{E}}$ of dimensions $N\times D$ , where $D$ denotes the token dimensionality across all tubular blocks. This latent representation is then processed by a classification head to determine if the video is real or fake. Although the framework is compatible with various encoder architectures, achieving a truly generalizable latent representation that can robustly distinguish between real and synthetic content, even in the case of localized manipulations, remains a significant challenge.

To address this, we next introduce the design of our novel encoder, tailored specifically to generate rich, discriminative features optimized for detecting deepfake videos with localized edits. Our encoder construction is based on learning robust latent embeddings through two complementary pretext tasks, followed by an effective cross-attention based fusion of these embeddings. Next, we provide a comprehensive analysis of the designed pretext tasks and the architectural formulation of the proposed encoder.

3.1 Learning Representations using Pretext Tasks

The overall procedure for training our pretext tasks is illustrated in Fig. 3. The tubular tokens $\{\mathbf{X}_{\text{in}}(i)\}_{i=1}^{N}$ are initially projected to form learnable embeddings, subsequently reshaping them into token embeddings of size ${N\times D}$ . We apply a structured masking strategy to occlude a fraction of tubular tokens, ensuring that only $M$ tokens remain visible. This process partitions the latent representation into a visible subset of dimension ${M\times D}$ and a masked subset of dimension ${(N-M)\times D}$ . Such a masking strategy enhances the difficulty of the reconstruction task, encouraging the encoder to learn robust and detailed spatiotemporal structures by focusing on limited visible information.

For our pretext tasks, we adopt the encoder-decoder architecture from VideoMAE [65]. As a preprocessing step, the input video is first tokenized using a patch embedding layer, followed by the addition of positional encodings to retain spatial-temporal information. We choose to mask the tubular tokens by masking the corresponding positionally encoded tokens. Then after masking, the visible tokens are processed by an encoder composed of $L$ multi-head self-attention blocks. The resulting latent embeddings are then fed into a decoder, which reconstructs the task-specific target by leveraging the partially visible information from the encoder and the placeholder learnable tokens for the masked regions (indicated in gray). The decoder architecture mirrors the encoder, preserving input dimensions throughout the process. Both encoder and decoder weights, as well as the projections for embeddings, are optimized using task-specific loss functions to ensure robust and detailed representation learning.

Unmasked Video Frame Reconstruction: The first pretext task involves reconstructing masked regions within input video frames, allowing the model to interpolate masked areas by leveraging dependencies within frames. This technique was initially proposed for representation learning in images [52] and eventually extended to videos [26, 68, 65]. By training the model to reconstruct the masked areas, we enable the model to learn rich spatio-temporal relationships and develop an implicit understanding of scene dynamics.

The reconstruction target is the original video frames, with the decoder producing a sequence of face-centered frames that closely approximates the input. This video frame reconstruction task enables the encoder, denoted as Video Frame Encoder (VFE), to produce representations that represent the visible tokens, focusing on reconstructing meaningful spatiotemporal features. Although these latent embeddings can be directly applied for the downstream task of deepfake detection, our results in Table 5 indicate that these learned features struggle to consistently differentiate between real and manipulated content, especially in the case of localized manipulations. Consequently, it becomes necessary to refine the representations learned by VFE to address the generalization challenge in deepfake detection.

AU Detection: To enhance the representations learned from the masking, we introduce another pretext task centered on local AU detection. Leveraging an encoder-decoder structure similar to the masking model architecture, this pretext task is designed to learn fine-grained local facial dynamics, capturing subtle expressions and muscle movements. To construct latent representations for AU detection, we follow a procedure similar to the random masking pretext task, with the only change that the decoder is architectured for the reconstruction target to be AU detection output. In particular, for a frame of dimensions, $3\times H\times W$ and a chosen set of $F$ AUs, the reconstruction target is an $F\times 3\times H\times W$ label map, where each channel corresponds to a detected AU, capturing finer facial cues for robust feature extraction.

The resulting AU-based latent representations from the encoder serve as a complementary set of features to video representation features obtained from masking pretext task, which we will effectively use to enrich the spatio-temporal embeddings learned from masking. These latent representations for AU detection can also be applied directly for downstream deepfake detection. However, similar to the latent embeddings from masking i.e., the representations output by VFE, our empirical analysis indicates that the AUE’s latent representations lack sufficient generalization across diverse, locally edited deepfake videos, as demonstrated in Table 2 in Section 4. In the following section, we use these AU-based latent representations as conditioning vectors to significantly enhance the robustness and generalization of the latent representations obtained through masking.

3.2 Fused Latent Representation

We next discuss the construction the final fused encoder. The tubular tokens $\{\mathbf{X}_{\text{in}}(i)\}_{i=1}^{N}$ are processed through patch embedding layers corresponding to both our pretrained encoders, VFE and AUE. Let the resulting two latent vectors be denoted $\mathbf{X}_{1}$ and $\mathbf{X}_{2}$ , corresponding to masking and AU detection pretext tasks respectively.

As shown in Fig. 2, in the fused encoder, the Multi-head attention blocks from layers $2$ to $L$ in VFE are conditioned using cross-attention, i.e. the query embeddings are derived from the corresponding layers of AUE. Specifically, let the output of every $i\mbox{-}$ th layer in VFE be denoted by $\mathbf{X}_{1}(i)$ , and that of AUE be denoted by $\mathbf{X}_{2}(i)$ . Then, each block’s output in VFE is computed as,

\mathbf{X}_{1}(i)=\text{Concat}(\text{head}_{1},\text{head}_{2},\dots,\text{head}_{H})\mathbf{W}_{O},

where each attention head is defined as,

$\text{head}_{h}=\text{softmax}\left(\frac{\mathbf{X}_{2}(i-1)\mathbf{W}_{Q}^{h}(\mathbf{X}_{1}(i-1)\mathbf{W}_{K}^{h})^{T}}{\sqrt{d}}\right)\mathbf{X}_{1}(i-1)\mathbf{W}_{V}^{h},$

with weight matrices $\mathbf{W}_{Q}^{h},\mathbf{W}_{K}^{h},\mathbf{W}_{V}^{h}\in\mathbb{R}^{D\times d}$ , $\mathbf{W}_{O}\in\mathbb{R}^{D\times D}$ , $d=D/H$ , and $H$ denoting the number of attention heads. This fused cross-attention structure enables the final output latent vector $\mathbf{X}_{1}(L)\in\mathbb{R}^{N\times D}$ to serve as the fused representation $\mathbf{X}_{E}$ from the encoder.

3.3 Implementation Details

Preprocessing and Data Preparation: We preprocess input videos using FaceXZoo [67] for face detection, thereby extracting $16$ face-centered frames per video. This results in a sequence of face-centered frames of dimensions $16\times 3\times 224\times 224$ . Each sequence is then partitioned into 3D spatiotemporal blocks of size $2\times 16\times 16$ , where $T=2$ denotes the temporal dimension and $P=16$ represents the spatial patch size, resulting in $N=1568$ tokens per video.

Pretext Task Training: We train the pretext tasks on the CelebV-HQ dataset [74], which consists of $35,000$ high-quality facial videos. Following VideoMAE [65], we employ a $50\%$ random masking strategy during training to learn robust features. The encoder and decoder architectures utilize $L=11$ layers of multi-head attention, with a fixed token embedding dimension of $D=768$ .

For the masked frame reconstruction pretext task, we adopt an $\ell_{1}$ reconstruction loss, minimizing the difference between ground-truth frames and their reconstructions from masked inputs. For the Action Unit (AU) detection pretext task, we define a set of $F=16$ AUs, capturing fine-grained facial movements across key regions such as the eyebrows, eyelids, nose, cheeks, lips, and dimples. We obtain the required ground-truth AU annotations for supervised learning of AU detection using state-of-the-art AU detection techniques [30]. The AU detection model produces a structured output of dimensions $16\times 3\times 224\times 224$ , where each channel encodes an individual AU. Our selection of AUs is validated through extensive ablation studies in Section 4. Similar to masked frame reconstruction task, we adopt an $\ell_{1}$ loss for AU detection, minimizing the difference between ground-truth AU representations and our model output.

Fine-tuning for Deepfake Detection: For downstream deepfake detection, we fuse the pretrained encoders into a unified representation, as shown in Fig. 2, and integrate a classifier. The finetuning is performed on the FaceForensics++ (FF++) dataset [58], which contains $700$ real and $3,600$ deepfake videos generated using state-of-the-art face swapping and reenactment techniques [37, 63, 14, 64]. To address class imbalance, we use Focal Loss, a variant of cross-entropy loss [57], ensuring improved learning from hard-to-classify samples.

Training Setup. Pretext tasks and deepfake detection trainings are conducted separately on a single NVIDIA RTX $4090$ GPU. Additional details on training hyperparameters and setup are provided in the Appendix A.

4 Experiments

In this section, we present a comprehensive evaluation of our proposed method, including both quantitative and qualitative comparisons against state-of-the-art deepfake detection methods: FTCN [72], RealForensics [24], Lip Forensics [23], EfficientNet+ViT [11], Face X-Ray [39], AltFreezing [69], CADMM [16], LAANet [50], SBI [61], where source codes are available. We begin with a detailed overview of the deepfake videos used for evaluation, incorporating both state-of-the-art deepfake generation techniques with localized manipulations and traditional deepfake datasets. Then we provide an in-depth discussion of the experimental results, including self-analysis of the proposed method. We mainly highlight our model’s superior detection capabilities on localized edits while maintaining strong generalization across diverse manipulations. Additional experiments are provided in the Appendix D .

Latest Locally edited Deepfake methods: We focus on locally edited deepfakes created using the latest state-of-the-art deepfake methods: Diffusion Video Autoencoders (DVA) [36], Stitch It In Time (STIT) [66], and Disentangled Face Editing (DFE) [70], Tokenflow [20], VideoP2P [45], TextLive [6], Fatezero [54]. DVA employs diffusion-based video editing to perform precise, text-guided alterations to the facial features in the diffusion feature space. In contrast, STIT and DFE exploit the rich latent space of StyleGAN for targeted facial modifications. Tokenflow, VideoP2P, FateZero and Text2Live adapts pre-trained text-to-image diffusion models for video editing while maintaining temporal coherence to allow text-guided edits. Refer to Figures 1 and 8 for examples of fake video frames that are visually indistinguishable from real ones, highlighting the superior quality of these synthetic manipulations.

To ensure comprehensive coverage of different facial manipulations, we incorporated a wide variety of facial features and attribute edits. For facial feature editing, we modified eye size, eye-eyebrow distance, nose ratio, nose-mouth distance, lip ratio, and cheek ratio. For facial attribute editing, we varied expressions such as smile, anger, disgust, and sadness. This diversity is essential for validating the robustness of our model over a wide range of localized edits. In total, we generated $50$ videos for each of the above-mentioned editing methods and validated our method’s strong generalization for deepfake detection. Further details on the editing parameters for these deepfake generation methods are elaborated in Appendix C.

Face swap and Face reenactment Methods: For completeness, we also conduct evaluations on standard deepfake datasets, Celeb-DFv2 (CDF2) [43], DeepFake Detection (DFD) [22], DeepFake Detection Challenge (DFDC) [15], and WildDeepfake (DFW) [75]. Note that these datasets consist of global manipulations such as identity swaps and face reenactments. Additionally, the deepfake generation models used in the datasets are no longer state-of-the-art.

Evaluation Metrics To evaluate our method against existing benchmarks and datasets, we use commonly applied standard evaluation metrics in deepfake detection: Area Under the Curve (AUC), which measures the discriminative capability of the model, and Average Precision (AP), which provides a precision-recall trade-off at the video level. More metric evaluations using Average Recall (AR) and mean F1-score (mF) are shown in Appendix D.

Detection Methods	DVA		STIT		DFE		Tokenflow		VideoP2P		TextLive		Fatezero
	CVPR’23		SIGGRAPH’22		ICCV’21		ICLR’23		CVPR’24		ECCV’22		ICCV’23
	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP
FTCN	27.1	30.1	34.8	37.0	33.5	35.8	29.5	32.1	31.0	33.7	30.2	32.9	28.7	31.3
RealForensics	37.5	40.3	46.9	49.8	45.6	48.6	41.2	44.2	42.8	45.8	42.0	45.1	40.5	43.4
Lip Forensics	33.8	37.1	42.0	45.8	40.9	44.6	36.5	40.3	38.0	41.9	37.3	41.2	35.8	39.5
EfficientNet+ViT	36.2	38.4	44.7	47.1	43.5	45.9	39.0	41.5	40.8	43.2	40.1	42.5	38.6	40.8
Face X-Ray	33.4	36.1	41.1	43.6	40.3	42.9	36.0	39.1	37.5	40.7	36.8	40.0	35.2	38.2
LAA Net	61.5	58.0	72.5	69.3	71.2	68.1	66.4	62.9	68.0	64.5	67.1	63.7	65.7	61.8
SBI	65.2	62.8	75.5	73.2	73.3	71.5	69.0	66.4	70.8	68.1	70.2	67.5	68.6	65.9
AltFreezing	41.1	44.0	51.8	53.6	51.2	52.8	45.6	47.1	47.5	48.9	46.9	48.3	45.0	46.5
CADMM	44.5	47.2	55.6	57.4	54.3	56.2	49.0	50.9	50.5	52.3	49.9	51.7	48.2	49.9
Ours	87.2	85.8	92.5	90.7	93.1	91.6	91.7	89.4	90.3	90.2	89.1	87.9	88.5	86.0

Table 1: Latest Locally edited Deepfake detection comparison:: The proposed method demonstrates superior performance in detecting fake videos with latest deepfake generation methods, achieving at least

15-20\%

increase in AUC and AP scores over the second-best method. Bold text indicates the best results, and underlined text indicates the second-best.

Method	CDF2		DFD		DFW		DFDC
	AUC	AP	AUC	AP	AUC	AP	AUC	AP
FTCN [72]	$86.9$	$86.0$	$94.4$	$90.33$	$64.73$	$65.5$	$86.0$	$87.48$
RealForensics [24]	$85.6$	$85.2$	$82.24$	$84.62$	$66.72$	$66.5$	$89.7$	$88.46$
Lip Forensics [23]	$82.4$	$82.66$	$90.2$	$89.37$	$62.3$	$60.75$	$92.53$	$93.41$
EfficientNet+ViT [11]	$79.0$	$75.61$	$87.0$	$88.09$	$72.0$	$68.74$	$91.0$	$85.12$
Face X-Ray [39]	$79.5$	$80.41$	$95.4$	$94.7$	$61.5$	$60.94$	$85.27$	$85.0$
LAANet[50]	95.4	$\mathbf{97.64}$	$\mathbf{99.5}$	$\mathbf{99.8}$	$\underline{87.6}$	$85.08$	$86.94$	$\mathbf{97.7}$
SBI [61]	$93.18$	$85.16$	$97.56$	$92.79$	$84.83$	$\mathbf{88.37}$	$86.16$	$\underline{93.24}$
AltFreezing [69]	$89.5$	$88.46$	$\underline{98.50}$	$93.17$	$72.6$	$72.0$	$\mathbf{94.0}$	$88.11$
CADMM [16]	$93.0$	$91.12$	$99.03$	$\underline{99.59}$	$75.0$	$79.42$	$88.3$	$89.71$
Ours	$\underline{93.84}$	$\underline{95.27}$	$97.15$	$95.28$	$\mathbf{91.0}$	$\underline{88.25}$	$\underline{93.0}$	$91.93$

Table 2: Traditional deepfake detection comparison: AUC and AP scores are shown for traditional deepfake datasets. Our approach remains competitive with SOTA methods, underscoring its robustness and adaptability across diverse manipulations.

4.1 Cross-Dataset Evaluation

We first evaluate our model’s generalization capability in a cross-dataset setting using latest deepfake generation methods [36, 66, 70, 20, 45, 54, 6] which involve local manipulations. As shown in Table 1, the existing SOTA detection methods, LAANet [50], SBI [61], AltFreezing [69] and CADMM [16], experience a significant drop in performance on the latest deepfake generation methods. The current SOTA methods exhibit AUCs as low as $48\mbox{-}71\%$ , demonstrating their poor generalization capabilities to the recent deepfakes. On the other hand, our method demonstrates robust generalization, achieving an AUC in the range $87$ - $93\%$ . A similar trend is noticeable in the case of average precision as well. As shown in Table 2, our method also consistently achieves high performance on standard datasets, exceeding $90\%$ AUC and are competitive with recent deepfake detection models. We highlight that these standard datasets primarily contain deepfake methods introduced before 2020, prior to the emergence of recent video editing techniques for deepfake generation.

To visually illustrate our model’s superior performance, we display frames of videos with various localized edits in Fig. 8, along with probability scores for detection. Our method consistently achieves confidence scores exceeding $90\%$ in detecting localized edits within fake videos, whereas existing state-of-the-art detection methods fall below $50\%$ , underscoring the robustness of our approach. The demonstrated results highlight the limitations of current deepfake detection techniques in handling localized edits, the strong sensitivity of our model to fine-grained manipulations, and the generalizable property of the proposed method.

Perturbation	DVA	STIT	DFE	TF	T2L	FZ	V2P
Gaussian Noise	82.3	83.1	83.8	84.5	85.0	84.2	83.7
Saturation Change	87.2	88.1	88.5	88.9	89.3	89.0	88.6
Blockwise Distortion	86.7	87.5	87.8	88.3	88.6	88.4	88.0
Contrast Change	86.8	87.6	88.0	88.4	88.8	88.5	88.1
Pixelation	86.0	86.8	87.3	87.8	88.1	87.9	87.5
Gaussian Blur	86.3	87.0	87.5	87.9	88.2	88.0	87.6
No Perturbation	88.5	89.3	89.7	90.1	90.6	90.3	89.8

Table 3: Effect of perturbations on AUC: The proposed method shows resilience to various perturbations, with only a slight reduction in AUC. The highest reduction in detection performance is observed when Gaussian noise is added to videos.

4.2 Robustness to Perturbrations

In general, adversarial attacks involve introducing different perturbations such as noise, blur, etc., or changing video properties such as saturation, contrast, etc. A generalizable deepfake detection algorithm should be robust against such alterations. Similar to [16], we applied the following standard perturbations: 1) Saturation in a range of $0.5$ to $2.0$ , 2) Contrast in a range of $0.5$ to $2.0$ , 3) Gaussian Noise with a standard deviation of $0.01$ to $0.1$ , 4) Blur with a Gaussian kernel size of radius $3$ to $11$ , 5) Pixelation with downsampling factors $4$ to $16$ , and 6) Blocking artifacts with quality levels from $10$ to $50$ . In Table 3, the AUC scores of our model are evaluated on deepfake generation methods (averaged over $50$ videos), after applying these perturbations to the test videos. Results show that Gaussian noise resulted in a slight decrease in AUC scores, while other alterations had minimal impact on model performance, demonstrating the robustness of our method against various distortions.

4.3 Ablation Study

We first perform a detailed interpretability analysis to assess whether our modified encoder effectively focuses on key action units within face-centered videos. In Fig. 5, the attention maps from the final cross-attention block of our fused encoder highlight the model’s ability to capture local dynamics, using multiple action units. This analysis visually substantiates why the latent output representation of our constructed fused encoder are robust enough to detect deepfake videos with localized manipulations.

We next conduct an in-depth analysis by systematically selecting different action units, as presented in Table 4, demonstrating that our chosen AUs yield the best detection performance. We use the open-source benchmark [47] to generate ground truth AU maps for the first 16 AUs, which effectively capture key facial feature information. This evaluation aligns with the attention maps illustrated in Fig. 5, where our method effectively captures most of the important facial expressions by leveraging comprehensive information across action units.

In Table 5, we highlight the need for a cross-attention based fusion of latent representations from two pretext tasks for deepfake detection. Note that our model fuses latent embeddings from two encoders, VFE pretrained by masking, and AUE pretrained by AU detection. To assess the performance of each encoder independently and understand the impact of fusing VFE layer features with AUE layer features through cross-attention, we conduct experiments where each encoder is fine-tuned separately for the downstream task of deepfake detection. Note that the baseline here is a model analogous to the architecture of VFE without any pre-training.

To highlight the contribution of the AU encoder (AUE) guidance during fine-tuning, we first remove the connections from AUE and finetune only the frame encoder (VFE) for deepfake detection. As shown in Table 5, this leads to an unacceptable drop in performance across all the state-of-the-art deepfake generation modes, since these method focus on localized edit patterns. These findings underscore the critical role of AU guidance from AUE in detecting subtle manipulations. Next, to assess the importance of VFE, we fine-tune only the AU encoder (AUE) for deepfake detection. As shown in Table 5, though the model with only AUE performs moderately well, it underperforms considerably compared to the full model with both encoders. This is likely due to the absence of VFE, which captures global facial representations that are crucial for effectively distinguishing between real and fake videos. These results suggest that the cross-attention mechanism between VFE and AUE is crucial for detecting local facial manipulations. More experiments on self-analysis of the proposed method is provided in Appendix B.

AU Set	DVA	STIT	DFE	TF	FZ	VP2P	T2L
AU’s 1-5 Eyes	63.1	72.4	72.8	71.4	64.3	67.7	66.2
AU’s 7-11 Nose	66.1	73.1	73.8	72.4	67.3	70.7	69.2
AU’s 12-16 Lips	72.4	79.3	80.0	78.6	73.1	76.9	75.4
AU’s 1-16 (All)	87.20	92.50	93.10	91.70	88.50	90.30	89.10

Table 4: Effect of Different AU Subsets: The selected set of action units is essential for effectively capturing comprehensive facial dynamics, thereby enhancing the deepfake detection accuracy.

Components	DVA	STIT	DFE	TF	FZ	VP2P	TL
W/O VFE and AUE	$35.10$	$32.10$	$33.80$	$38.0$	$36.30$	$39.70$	$38.0$
With VFE	$52.60$	$62.60$	$65.30$	$60.90$	$54.80$	$57.70$	$56.0$
With AUE	$75.10$	$82.10$	$82.80$	$81.80$	$76.30$	$79.50$	$78.20$
Fused encoder	$\mathbf{87.20}$	$\mathbf{93.50}$	$\mathbf{93.10}$	$\mathbf{91.30}$	$\mathbf{88.50}$	$\mathbf{91.70}$	$\mathbf{90.1}$

Table 5: Effect of Encoders: Performance degradation without VFE and AUE shows their importance. Our approach of fusing two encoders achieves the best performance.

5 Conclusion

In this work, we present a novel deepfake detection method specifically designed to identify subtle, localized manipulations in video, addressing a challenge overlooked by existing methods. To the best of our knowledge, this is the first approach that effectively targets localized edits, leveraging a powerful fusion of spatio-temporal representations from two complementary pretext tasks-masked frame reconstruction and facial action unit detection. Our method significantly outperforms existing state-of-the-art deepfake detection models in identifying latest deepfakes localized manipulations. Additionally, it demonstrates competitive performance with top methods on standard deepfake datasets involving global alterations. These results validate our approach as highly effective across diverse deepfake types and hence robustly generalizable. Future work will explore extending our fused latent video representation framework to additional downstream tasks in video analysis, expanding beyond face-centered content.

References

Afchar et al. [2018] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. MesoNet: a compact facial video forgery detection network. Proc. IEEE International Workshop on Information Forensics and Security, pages 1–7, 2018.
Agarwal et al. [2019] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. Protecting world leaders against deep fakes. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 1:38, 2019.
Amerini et al. [2019] Irene Amerini, Leonardo Galteri, Roberto Caldelli, and Alberto Del Bimbo. Deepfake video detection through optical flow based CNN. Proc. IEEE/CVF International Conference on Computer Vision Workshops, pages 1205–1207, 2019.
Ashok and Joy [2023] V Ashok and Preetha Theresa Joy. Deepfake detection using XceptionNet. Proc. IEEE International Conference on Recent Advances in Systems Science and Engineering, pages 1–5, 2023.
Bai et al. [2023] Weiming Bai, Yufan Liu, Zhipeng Zhang, Bing Li, and Weiming Hu. AUNet: Learning relations between action units for face forgery detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24709–24719, 2023.
Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. European Conference on Computer Vision, abs/2204.02491, 2022.
Bayar and Stamm [2016] Belhassen Bayar and Matthew C Stamm. A deep learning approach to universal image manipulation detection using a new convolutional layer. Proc. 4th ACM Workshop on Information Hiding and Multimedia Security, pages 5–10, 2016.
Bounareli et al. [2023] Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, and Georgios Tzimiropoulos. HyperReenact: One-shot reenactment via jointly learning to refine and retarget faces. Proc. IEEE/CVF International Conference on Computer Vision, pages 7149–7159, 2023.
Chai et al. [2020] Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? Understanding properties that generalize. Proc. European Conference on Computer Vision, pages 103–120, 2020.
Chen et al. [2022] Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18710–18719, 2022.
Coccomini et al. [2022] Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. Combining efficientnet and vision transformers for video deepfake detection. Proc. International Conference on Image Analysis and Processing, pages 219–229, 2022.
Cozzolino et al. [2021] Davide Cozzolino, Andreas Rössler, Justus Thies, Matthias Nießner, and Luisa Verdoliva. ID-Reveal: Identity-aware deepfake video detection. Proc. IEEE/CVF International Conference on Computer Vision, pages 15108–15117, 2021.
De Lima et al. [2020] Oscar De Lima, Sean Franklin, Shreshtha Basu, Blake Karwoski, and Annet George. Deepfake detection using spatiotemporal convolutional networks. arXiv preprint arXiv:2006.14749, 2020.
Deepfakes Community [2024] Deepfakes Community. Deepfakes github repository. GitHub Repository, 2024. Accessed: 2024-11-09.
Dolhansky et al. [2019] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. The deepfake detection challenge (DFDC) preview dataset, 2019.
Dong et al. [2023] Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Haoqiang Fan, and Zheng Ge. Implicit identity leakage: The stumbling block to improving deepfake detection generalization. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3994–4004, 2023.
Dong et al. [2022] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Ting Zhang, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, and Baining Guo. Protecting celebrities from deepfake with identity consistency transformer. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9468–9478, 2022.
Ekman and Friesen [1978] Paul Ekman and Wallace V Friesen. Facial action coding system. Environmental Psychology & Nonverbal Behavior, 1978.
Fu et al. [2022] Tsu-Jui Fu, Xin Eric Wang, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. M3l: Language-based video editing via multi-modal multi-level transformers. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10513–10522, 2022.
Geyer et al. [2023] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. International Conference on Learning Representations, abs/2307.10373, 2023.
Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Proc. Communications of the ACM, 63(11):139–144, 2020.
Google Research [2024] Google Research. Contributing data to deepfake detection research. Google Research Blog, 2024. Accessed: 2024-11-09.
Haliassos et al. [2021] Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5039–5049, 2021.
Haliassos et al. [2022] Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. Leveraging real talking faces via self-supervision for robust forgery detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14950–14962, 2022.
Han et al. [2025] Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, and Yong Liu. Face-adapter for pre-trained diffusion models with fine-grained id and attribute control. Proc. European Conference on Computer Vision, pages 20–36, 2025.
He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Proc. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Hsu et al. [2018] Chih-Chung Hsu, Chia-Yen Lee, and Yi-Xiu Zhuang. Learning to detect fake face images in the wild. Proc. International Symposium on Computer, Consumer and Control, pages 388–391, 2018.
Hsu et al. [2022] Gee-Sern Hsu, Chun-Hung Tsai, and Hung-Yi Wu. Dual-generator face reenactment. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 642–650, 2022.
Jacob and Stenger [2021] Geethu Miriam Jacob and Bjorn Stenger. Facial action unit detection with transformers. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7680–7689, 2021.
Jeong et al. [2022a] Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. BiHPF: Bilateral high-pass filters for robust deepfake detection. Proc. IEEE/CVF Winter Conference on Applications of Computer Vision, pages 48–57, 2022a.
Jeong et al. [2022b] Yonghyun Jeong, Doyeon Kim, Youngmin Ro, and Jongwon Choi. FrePGAN: Robust deepfake detection using frequency-level perturbations. Proc. AAAI conference on artificial intelligence, 36(1):1060–1068, 2022b.
Karnouskos [2020] Stamatis Karnouskos. Artificial intelligence in digital media: The era of deepfakes. IEEE Transactions on Technology and Society, 1(3):138–147, 2020.
Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
Khan and Dang-Nguyen [2022] Sohail Ahmed Khan and Duc-Tien Dang-Nguyen. Hybrid transformer network for deepfake detection. Proc. 19th International Conference on Content-Based Multimedia Indexing, pages 8–14, 2022.
Kim et al. [2023] Gyeongman Kim, Hajin Shim, Hyunsu Kim, Yunjey Choi, Junho Kim, and Eunho Yang. Diffusion video autoencoders: Toward temporally consistent face video editing via disentangled video encoding. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6091–6100, 2023.
Kowalski [2024] Marek Kowalski. Faceswap. GitHub Repository, 2024. Accessed: 2024-11-09.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Proc. Advances in Neural Information Processing Systems, 25:1097–1105, 2012.
Li et al. [2020a] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. Face X-ray for more general face forgery detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5001–5010, 2020a.
Li et al. [2018a] Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. EAC-Net: Deep nets with enhancing and cropping for facial action unit detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11):2583–2596, 2018a.
Li and Lyu [2019] Yuezun Li and Siwei Lyu. Exposing deepfake videos by detecting face warping artifacts. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.
Li et al. [2018b] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu Oculi: Exposing AI created fake videos by detecting eye blinking. Proc. IEEE International Workshop on Information Forensics and Security, pages 1–7, 2018b.
Li et al. [2020b] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challenging dataset for deepfake forensics. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020b.
Lin et al. [2024] Li Lin, Xinan He, Yan Ju, Xin Wang, Feng Ding, and Shu Hu. Preserving fairness generalization in deepfake detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16815–16825, 2024.
Liu et al. [2023] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8599–8608, 2023.
Masi et al. [2020] Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, and Wael AbdAlmageed. Two-branch recurrent network for isolating deepfakes in videos. Proc. European Conference on Computer Vision, pages 667–684, 2020.
Miriam Jacob and Stenger [2021] Geethu Miriam Jacob and Björn Stenger. Facial action unit detection with transformers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7676–7685, 2021.
Mittal et al. [2020] Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. Emotions Don’t Lie: An audio-visual deepfake detection method using affective cues. Proc. 28th ACM International Conference on Multimedia, pages 2823–2832, 2020.
Nagrani et al. [2020] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60:101027, 2020.
Nguyen et al. [2024] Dat Nguyen, Nesryne Mejri, Inder Pal Singh, Polina Kuleshova, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. LAA-Net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17395–17405, 2024.
Pantserev [2020] Konstantin A Pantserev. The malicious use of AI-based deepfake technology as the new threat to psychological security and political stability. Cyber Defence in the Age of AI, Smart Societies and Augmented Humanity, pages 37–55, 2020.
Pathak et al. [2016] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
Pei et al. [2024] Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yunsheng Wu, Guangtao Zhai, Jian Yang, Chunhua Shen, and Dacheng Tao. Deepfake generation and detection: A benchmark and survey. arXiv preprint arXiv:2403.17881, 2024.
Qi et al. [2023] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15886–15896, 2023.
Qian et al. [2020] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. Proc. European conference on computer vision, pages 86–103, 2020.
Rahmouni et al. [2017] Nicolas Rahmouni, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. Distinguishing computer graphics from natural images using convolution neural networks. Proc. IEEE Workshop on Information Forensics and Security, pages 1–6, 2017.
Ross and Dollár [2017] T-YLPG Ross and GKHP Dollár. Focal loss for dense object detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2980–2988, 2017.
Rossler et al. [2019] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. Proc. IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019.
Sabir et al. [2019] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI), 3(1):80–87, 2019.
Shao et al. [2018] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. Deep adaptive attention for joint facial action unit detection and face alignment. Proceedings. European Conference on Computer Vision, pages 705–720, 2018.
Shiohara et al. [2023] Kaede Shiohara, Xingchao Yang, and Takafumi Taketomi. BlendFace: Re-designing identity encoders for face-swapping. Proc. IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2023.
Tan and Le [2019] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. Proc. International conference on machine learning, pages 6105–6114, 2019.
Thies et al. [2016] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. Face2Face: Real-time face capture and reenactment of rgb videos. Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–2395, 2016.
Thies et al. [2019] Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019.
Tong et al. [2022] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Proc. Advances in Neural Information Processing Systems, 35:10078–10093, 2022.
Tzaban et al. [2022] Rotem Tzaban, Ron Mokady, Rinon Gal, Amit Bermano, and Daniel Cohen-Or. Stitch It In Time: Gan-based facial editing of real videos. Proc. SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
Wang et al. [2021] Jun Wang, Yinglu Liu, Yibo Hu, Hailin Shi, and Tao Mei. FaceX-Zoo: A pytorch toolbox for face recognition. Proc. ACM International Conference on Multimedia, pages 3779–3782, 2021.
Wang et al. [2022] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. BEVT: BERT pretraining of video transformers. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14733–14743, 2022.
Wang et al. [2023] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, and Houqiang Li. AltFreezing for more general video face forgery detection. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4129–4138, 2023.
Yao et al. [2021] Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hellier. A latent transformer for disentangled face editing in images and videos. Proc. IEEE/CVF International Conference on Computer Vision, pages 13789–13798, 2021.
Zhang [2022] Tao Zhang. Deepfake generation and detection, a survey. Multimedia Tools and Applications, 81(5):6259–6276, 2022.
Zheng et al. [2021] Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring temporal coherence for more general video face forgery detection. Proc. IEEE/CVF International Conference on Computer Vision, pages 15044–15054, 2021.
Zhou et al. [2023] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Xuequan Lu, Ran Yi, Shouhong Ding, and Lizhuang Ma. Instance-aware domain generalization for face anti-spoofing. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20453–20463, 2023.
Zhu et al. [2022] Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. CelebV-HQ: A large-scale video facial attributes dataset. Proc. European Conference on Computer Vision, pages 650–667, 2022.
Zi et al. [2020] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. WildDeepfake: A challenging real-world dataset for deepfake detection. Proc. 28th ACM International Conference on Multimedia, pages 2382–2390, 2020.

Appendix

In this section, we provide additional details to demonstrate the effectiveness of the proposed method. We start by providing detailed training procedure for both the pretext tasks and the deepfake detection framework. Further, we evaluate each pretext task individually through reconstruction performance and visual comparisons. We also include additional metrics, such as Average Recall (AR) and mean F1 score (mF1), along with the primary metrics of AUC and Average Precision (AP) shown in the manuscript. This offers a more detailed comparison of our method against existing state-of-the-art detection methods across different deepfake generators. In addition, we elaborate on the construction of the latest deepfake videos with local manipulations, including descriptions of the methods and parameters used to generate fake videos with localized subtle edits.

Appendix A Comprehensive Training Details

The two pretext tasks are trained using the CelebV-HQ dataset [74], which contains approximately $35,000$ real facial videos. The first pretext model for the reconstruction of masked frames minimizes a variant of $\ell_{1}$ reconstruction loss (Huber loss) between ground truth frames and frames reconstructed from masked inputs, following a VideoMAE-like approach [65]. The second pretext model for Facial Action Unit (AU) detections is trained using Huber loss between predicted AU maps and ground truth maps, to predict $16$ AUs for each frame. To generate the ground-truth attention map for every action unit, we define landmarks corresponding to the different AUs following the conventional approach in [60, 40]. Elliptical regions are fitted to these landmarks as initial AU regions, which are then smoothed using a Gaussian filter of radius $3$ . This process yields $16$ distinct AU maps, each corresponding to a specific localized action, for a single frame.

	DVA	STIT	DFE	TF	FZ	T2L	V2P
MAE: Random Masking	2.71e-8	2.6e-8	2.64e-8	2.48e-8	2.35e-8	2.40e-8	2.38e-8
MAE: AU Detection	1.23e-8	1.17e-8	1.19e-8	1.05e-8	9.8e-9	1.02e-8	1.01e-8

Table 6: Reconstruction Error for Pretext Tasks: The pretext models for random masking and AU detection are evaluated independently, with MAE between ground truth and reconstructions tabulated across datasets (normalized between

0

and

1

). The negligible MAE confirms their effectiveness in both tasks.

We trained both the pretext models using the Adam optimizer with a batch size of $8$ for $600$ epochs. Gradient accumulation was applied every $20$ steps. We used the pre-trained checkpoints from VideoMAE [65] to initialize our weights for both the pretext tasks.

During fine-tuning for deepfake detection, the fused encoder shown in Fig. $4$ in the manuscript, is trained with a classifier on the FF++ dataset [58], consisting of $700$ real and $3,600$ fake videos generated via four manipulation methods [37, 63, 14, 64]. Focal Loss [57] is used to address class imbalance. For finetuning, we used a batch size of $8$ for $100$ epochs. A learning rate of $1e\mbox{-}5$ with an exponential decay of $1e\mbox{-}3$ is used for both the pretext tasks and the finetuning stage.

Detection Methods	DVA				STIT				DFE				Tokenflow				VideoP2P				TextLive				Fatezero
	AUC	AP	AR	mf	AUC	AP	AR	mf	AUC	AP	AR	mf	AUC	AP	AR	mf	AUC	AP	AR	mf	AUC	AP	AR	mf	AUC	AP	AR	mf
FTCN	27.1	30.1	25.4	27.2	34.8	37.0	32.1	34.5	33.5	35.8	30.9	33.3	29.5	32.1	27.0	29.5	31.0	33.7	29.2	31.5	30.2	32.9	28.6	30.8	28.7	31.3	26.9	29.1
RealForensics	37.5	40.3	36.2	38.2	46.9	49.8	44.5	47.1	45.6	48.6	43.2	45.8	41.2	44.2	39.0	41.6	42.8	45.8	40.4	43.0	42.0	45.1	39.8	42.3	40.5	43.4	38.5	40.9
Lip Forensics	33.8	37.1	32.4	34.7	42.0	45.8	40.3	43.0	40.9	44.6	39.0	41.7	36.5	40.3	34.7	37.4	38.0	41.9	36.0	38.8	37.3	41.2	35.4	38.2	35.8	39.5	33.9	36.6
EfficientNet+ViT	36.2	38.4	34.9	36.6	44.7	47.1	42.8	44.9	43.5	45.9	41.3	43.5	39.0	41.5	37.2	39.3	40.8	43.2	38.6	40.8	40.1	42.5	38.0	40.2	38.6	40.8	36.5	38.6
Face X-Ray	33.4	36.1	31.5	33.8	41.1	43.6	38.8	41.2	40.3	42.9	38.2	40.5	36.0	39.1	34.2	36.5	37.5	40.7	35.4	37.8	36.8	40.0	34.8	37.3	35.2	38.2	33.5	35.8
LAA Net	61.5	58.0	55.2	56.6	72.5	69.3	66.4	67.8	71.2	68.1	64.8	66.3	66.4	62.9	60.3	61.6	68.0	64.5	62.0	63.3	67.1	63.7	60.9	62.2	65.7	61.8	59.3	60.6
SBI	65.2	62.8	59.4	61.0	75.5	73.2	70.1	71.6	73.3	71.5	68.5	69.9	69.0	66.4	63.5	64.9	70.8	68.1	65.2	66.6	70.2	67.5	64.7	66.1	68.6	65.9	63.1	64.5
Ours	87.2	85.8	82.5	84.1	92.5	90.7	88.1	89.4	93.1	91.6	89.5	90.5	91.7	89.4	87.0	88.2	90.3	90.2	86.9	88.0	89.1	87.9	85.5	86.6	88.5	86.0	84.2	85.1

Table 7: Cross-Dataset Quantitative Comparison: AUC, AP, AR, and mF1 scores evaluated across the latest deepfake generation methods. The results highlight the superior detection performance of our method, significantly surpassing existing state-of-the-art approaches in identifying fine-grained localized edits.

Method	CDF2				DFD				DFW				DFDC
	AUC	AP	AR	mf	AUC	AP	AR	mf	AUC	AP	AR	mf	AUC	AP	AR	mf
LAA Net [50]	95.4	$\mathbf{97.64}$	$\underline{87.71}$	92.41	$\mathbf{99.5}$	$\mathbf{99.8}$	$95.47$	97.59	$\underline{87.6}$	$85.08$	$69.66$	$78.56$	$86.94$	$\mathbf{97.7}$	$73.37$	$83.81$
SBI [61]	$93.18$	$85.16$	$82.68$	$83.90$	$97.56$	$92.79$	$89.49$	$91.11$	$84.83$	88.37	$\underline{81.64}$	$\underline{84.60}$	$86.16$	$\underline{93.24}$	$71.58$	$80.99$
AltFreezing [69]	$89.5$	$88.46$	$85.50$	$86.24$	$98.50$	$97.86$	$\underline{97.0}$	$\underline{97.41}$	$72.6$	$70.86$	$68.5$	$69.66$	94.0	$92.57$	91.1	91.80
CADMM [16]	$93.0$	$91.12$	$77.00$	$83.46$	$99.03$	$\underline{99.59}$	$82.17$	$90.04$	$75.0$	$72.80$	$71.26$	$72.14$	$88.3$	$86.7$	$85.62$	$86.1$
EfficientNet+ViT [11]	$79.0$	$75.61$	$74.5$	$75.05$	$87.0$	$88.09$	$85.8$	$86.93$	$72.0$	$68.74$	$67.0$	$67.85$	$91.0$	$85.12$	$83.7$	$84.39$
Ours	$\underline{93.84}$	$\underline{95.27}$	92.66	$\underline{92.17}$	$97.15$	$95.28$	98.6	$97.23$	$\mathbf{91.0}$	$\underline{88.25}$	88.63	87.40	$\underline{93.0}$	$91.93$	$\underline{90.38}$	$\underline{91.26}$

Table 8: Cross-Dataset Quantitative Comparison: Evaluation of AUC, AP, AR, and mF1 scores across standard deepfake datasets, focused on face swapping and reenactment. The proposed method is competitive with existing SOTA approaches across all metrics. Notably, our method achieves superior AR values, indicating high sensitivity in detecting fake videos (positives).

Appendix B Evaluation on Pretext tasks

We evaluate the performance of pretext models independently to demonstrate the effectiveness of the representations learned by the respective encoders, VFE (Video Frame Encoder) and AUE (Action Unit Encoder). For the first self-supervised task - reconstruction of face-centered frames from masked input frames - we compute the Mean Absolute Error (MAE) between the output reconstructed frames and the ground truth. MAE is first computed across all 16 input frames for each video to obtain a video-level MAE. This score is then averaged over all videos across diverse methods, and presented in the first row of Table 6.

For the second self-supervised pretext task - reconstruction of AU maps for each video - we compute the MAE between the $16$ reconstructed AU maps and the corresponding ground truth maps for every frame. For diverse methods, this per-frame MAE is initially averaged across all 16 frames for each video. These video-level MAE values are then averaged across the all the videos corresponding to a particular deepfake generation method to obtain the final reconstruction error, as reported in the second row of Table 6. The low MAE values for both the pretext tasks demonstrates effectiveness of their respective learned representations. In Fig. 6, a qualitative comparison is shown, where, for a single frame, we display selected ground-truth AU maps alongside their reconstructed counterparts as output by the model. These visualizations highlight the model’s capability in accurately capturing fine-grained facial details.

Appendix C Latest Locally edited Deepfake Videos

We leveraged seven state-of-the-art methods to test proposed deepfake detection method : Diffusion Video Autoencoders (DVA) [36], Stitch It In Time (STIT) [66], Disentangled Face Editing (DFE) [70], Tokenflow [20], VideoP2P [45], FateZero [54], Text2Live [6]. For all the methods, we utilized their official source code and generated $50$ videos each. These methods enabled localized edits targeting eyes, mouth, expressions, age, and gender transformations. For DVA, we used $1000$ sampling steps, a learning rate of $0.002$ (for finetuning), and an editing scale of $0.5$ . For StyleGAN2 [34] based editing methods STIT and DFE, we followed the common pipeline for editing, which involves video inversion to latent space, finetuning the generator for a specific video, and editing the latent vector. For STIT, we used $50$ steps for finetuning the generator, along with an editing range of $+6$ to $-6$ . Similarly, for DFE, we used $50$ steps for finetuning and edtiting range between $-10$ to $+10$ . TokenFlow, Video-P2P, and FateZero utilize pre-trained diffusion models during inference, standardized with 50 DDIM inversion steps and a classifier-free guidance scale of 7.5 for text fidelity. Video-P2P further employs a cross-attention replacement ratio of 0.4 to enhance temporal consistency. Text2LIVE, in contrast is trained for each video using a video-specific generator for 1,000 optimization steps.

Appendix D Additional Experimental results

In this section, we evaluate our method’s generalization capability in a cross-dataset setting. As shown in Table 8, our method consistently achieves high performance on standard datasets, exceeding $90\%$ AUC and matching the performance of recent deepfake detection models, LAANet [50], SBI [61], AltFreezing [69] and CADMM [16], as shown in Table 8. In the case of latest locally manipulated video, existing SOTA methods experience a significant drop in performance. The current SOTA methods exhibit AUCs as low as $30\mbox{-}75\%$ , as shown in Table 7, whereas our method demonstrates robust generalization, achieving an AUC as high as $93\%$ . A similar trend is noticeable in the case of all other metrics as well. Notably, our approach exhibits a superior average recall, across all the compared videos, indicating high accuracy in detecting fake videos (considered as positives), with significantly fewer false negatives and a manageable number of false positives, ensuring efficient and reliable detection even for localized manipulations by recent deepfake methods.

To visually illustrate our model’s superior performance, we display frames of videos with various localized edits in Fig. 8, along with probability scores for detection. All real videos utilized in this experiment are from the publicly available dataset [49]. Our method consistently achieves confidence scores exceeding $90\%$ in detecting localized edits within fake videos, as compared to the existing state-of-the-art detection methods. This observation holds consistently across a diverse range of localized edits, including expressions such as smiles, shock, disgust, sadness, anger, and modifications like eyebrow raises, eye gaze adjustments, and gender or age transformations.

Next, in Fig. 7, we present examples where our method exhibits a noticeable drop in confidence scores for detecting fake videos generated through localized edits applied to three real videos in [49]. Most of these cases occur when the subject is facing sideways or when occlusions hinder the learned representations to accurately capture facial dynamics through action units.

		Eye Raise, Old, Young
		Eye Gaze, Sad, Shock
		Shock, Young, Angry
		Disgust, Smile, Old
Original	Fake Manipulations	Edit types & Probability score

		Gender, Young, Old
		Sad, Anger, Disgust
		Smile, Anger, Disgust
		Smile, EyeRaise, EyeGaze
		Shock, EyeRaise, Smile
		Disgust, Anger, EyeGaze
		Shock, Smile, Anger
Original	Fake Manipulations	Edit types & Probability score



Fake Score: $56.6$	Fake Score: $60.2$	Fake Score: $67.3$