This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FreqBlender: Enhancing DeepFake Detection by Blending Frequency Knowledge

Hanzhe Li1,∗   Jiaran Zhou1,   Yuezun Li1,   Baoyuan Wu2   Bin Li3   Junyu Dong1
1 School of Computer Science and Technology, Ocean University of China
2 School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
3 Guangdong Provincial Key Laboratory of Intelligent Information Processing,
Shenzhen University, Shenzhen, China
Equal contributionCorresponds to Yuezun Li ([email protected])
Abstract

Generating synthetic fake faces, known as pseudo-fake faces, is an effective way to improve the generalization of DeepFake detection. Existing methods typically generate these faces by blending real or fake faces in spatial domain. While these methods have shown promise, they overlook the simulation of frequency distribution in pseudo-fake faces, limiting the learning of generic forgery traces in-depth. To address this, this paper introduces FreqBlender, a new method that can generate pseudo-fake faces by blending frequency knowledge. Concretely, we investigate the major frequency components and propose a Frequency Parsing Network to adaptively partition frequency components related to forgery traces. Then we blend this frequency knowledge from fake faces into real faces to generate pseudo-fake faces. Since there is no ground truth for frequency components, we describe a dedicated training strategy by leveraging the inner correlations among different frequency knowledge to instruct the learning process. Experimental results demonstrate the effectiveness of our method in enhancing DeepFake detection, making it a potential plug-and-play strategy for other methods.

[Uncaptioned image]
Figure 1: Overview of our method. In contrast to the existing spatial-blending methods (right part), our method explores face blending in frequency domain (left part). By leveraging the frequency knowledge, our method can generate pseudo-fake faces that closely resemble the distribution of wild fake faces. Our method can complement and work in conjunction with existing spatial-blending methods.

1 Introduction

DeepFake refers to face forgery techniques that can manipulate facial attributes, such as identity, expression, and lip movement [1]. The recent advancement of deep generative models [2, 3] has greatly sped up the evolution of DeepFake techniques, enabling the creation of highly realistic and visually imperceptible manipulations. However, the misuse of these techniques can pose serious security concerns [4], making DeepFake detection more pressing than ever before.

There have been many methods proposed for detecting DeepFakes, showing their effectiveness on public datasets [5, 6, 7, 8, 9]. However, with the continuous growth of AI techniques, new types of forgeries constantly emerge, posing a challenge for current detectors to accurately expose unknown forgeries. To address this challenge, recent efforts [10, 11, 12, 13] have focused on improving the generalizability of detection, i.e., the ability to detect unknown forgeries based on known examples. One effective approach to address this problem is to enhance the training data by generating synthetic fake faces, known as pseudo-fakes [14, 15, 16]. The intuition behind this approach is that the DeepFake generation process introduces artifacts in the step of blending faces, and these methods generate pseudo-fake faces by simulating various blending artifacts. By training on these pseudo-fake faces, the models can be driven to learn corresponding artifacts. However, existing methods concentrate on simulating the spatial aspects of face blending (see Fig. 1 (right)). While they can make the pseudo-fake faces resemble the distribution of wild fake faces in the spatial domain, they do not explore the distribution in the frequency domain. Thus, current pseudo-fake faces lack frequency-based forgery clues, limiting the models to learn generic forgery features.

In this paper, we shift our attention from the spatial domain to the frequency domain and propose a new method called FreqBlender to generate pseudo-fake faces by blending frequency knowledge (see Fig. 1 (left)). To achieve this, we analyze the composition of the frequency domain and accurately identify the range of forgery clues falling into. Then we replace this range of real faces with the corresponding range of fake faces to generate pseudo-fake faces. However, identifying the frequency range of forgery clues is challenging due to two main reasons: 1) this range varies across different fake faces due to its high dependence on face content, and 2) forgery clues may not be concentrated on a single frequency range but could be an aggregation of various portions across multiple ranges. Thus, general low-pass, high-pass, or band-pass filters are incapable of precisely pinpointing the distribution.

To address this challenge, we propose a Frequency Parsing Network (FPNet) that can adaptively partition the frequency domain based on the input faces. Specifically, we hypothesize that the faces are composed of three frequency knowledge, which represents semantic information, structural information, and noise information, respectively, and the forgery traces are likely hidden in structural information. This hypothesis is validated in our preliminary analysis (refer to Sec. 3 for details). Based on this footstone, we design the network consisting of a shared encoder and three decoders to extract corresponding frequency knowledge. The encoder transforms the input data into a latent frequency representation, while the decoders estimate the probability map of the corresponding frequency knowledge.

Training this network is non-trivial since no ground truth of frequency distribution is provided. Therefore, we propose a novel training strategy that leverages the inner correlations among different frequency knowledge. To be specific, we describe dedicated-crafted objectives that are performed on various blending combinations of the output from each decoder and emphasize the properties of each frequency knowledge. The experimental results demonstrate that the network successfully parses the desired frequency knowledge within the proposed training strategy.

Once the network is trained, we can parse the frequency component corresponding to the structural information of a fake face, and blend it with a real face to generate a pseudo-fake face. It is important to note that our method is not in conflict with existing spatial blending methods, but rather complements them by addressing the defect in the frequency domain. Our method is validated on multiple recent DeepFake datasets (e.g., FF++ [5], CDF [6], DFDC [8], DFDCP [7], FFIW [9]) and compared with many state-of-the-art methods, demonstrating the efficacy of our method in improving detection performance.

The contributions of this paper are summarized in three-fold: 1) To the best of our knowledge, we are the first to generate pseudo-fake faces by blending frequency knowledge. Our method pushes pseudo-fake faces closer to the distribution of wild fake faces, enhancing the learning of generic forgery features in DeepFake detection.2) We propose a Frequency Parsing Network that can adaptively partition the frequency components corresponding to semantic information, structural information, and noise information, respectively. Since no ground truth is provided, we design dedicated objectives to train this network. 3) Extensive experimental results on several DeepFake datasets demonstrate the efficacy of our method and its potential as a plug-and-play strategy for existing methods.

2 Related Works

The rapid progress of AI generative models has spawned the development of DeepFake detection methods. These methods mainly rely on deep neural networks to identify the inconsistency between real and fake faces using various features, including biological signals [17], spatial artifacts [14, 15, 16], frequency abnormality [13, 18, 19], auto-learned clues from dedicated-designed models [20, 21]. These methods have shown promising results on public datasets. However, some of their performance significantly deteriorates when confronted with unknown DeepFake faces due to the large distribution discrepancy resulting from limited training datasets. To tackle this issue, many methods have been proposed to improve their generalizability by learning the generic DeepFake traces, e.g.,  [10, 14, 15, 16, 12, 22, 23]. One effective approach is to create synthetic fake faces during training, known as pseudo-fake faces, e.g., [10, 14, 15, 16]. FWA [10] is a pioneering method that conducts self-blending to simulate fake faces. Several extended variants (Face X-ray [14], PCL [15], SBI [16], BiG-Arts [24]) have been proposed to blend faces using curated strategies, further improving detection performance. By increasing the diversity of training faces, the gap in the distribution of wild fake faces can be reduced, allowing the models to learn the invariant DeepFake traces across different distributions. To generate the pseudo-fake faces, existing methods usually design spatial blending operations to combine different faces. This involves extracting the face region from a source image and blending it into a target image. However, these methods overlook the distribution of wild fake faces in the frequency domain. While the synthetic faces may resemble the spatial-based distribution, the lack of consideration for frequency perspective hinders the models from learning the fundamental generic DeepFake traces.

3 Preliminary Analysis

We perform a statistical analysis of the frequency distribution of real and fake faces and present preliminary results for the main frequency components corresponding to semantic information, structural formation, and noise information, respectively.

Inspiration and Verification. The investigation in previous works [25, 26] has indicated that the forgery traces mainly exist in high-frequency areas. However, the precise range of these areas has not been described, driving us to re-investigate the frequency distribution of forgery traces.

Specifically, we conduct verification experiments using FaceForensics++ (FF++) [5] datasets. We extract the frames from all videos and randomly select 3,0003,000 real images and 3,0003,000 fake images for each manipulation method (e.g., DF, F2F, FS, and NT). Then we crop out the face region in these selected images using a face detector [27] and apply DCT [28] to generate frequency maps. For analysis, we sum up all frequency maps of real and fake images and adopt the visualization process of azimuthal average described in previous work [29, 26]. This process involves logarithmic transformation and the calculation of azimuthally-averaged flux in circular annuli apertures. By placing the center of the circular annuli aperture at the top-left corner of the frequency map, we can obtain a one-dimensional array representing the spectrum diagram. The visual results of their distribution are shown in Fig. 2 (top). It can be observed that this figure is consistent with the results in [25]. However, when we directly plot their distribution differences without logarithmic operation, the results do not match the previous figure. It can be seen that the disparities in high-frequency regions are not as substantial as expected, while the differences in the lower range become more noticeable (see Fig. 2 (bottom)). This is because the logarithmic operation mitigates the degree of differences in lower frequency ranges, causing the illusion that only the high-frequency range exhibits differences between real and fake faces. Therefore, we conjecture that the forgery traces may not only be concentrated in a very high-frequency range but could possibly spread to the low-frequency range.

Refer to caption
Figure 2: Statistics of frequency distribution. The top part shows the frequency distribution of real and fake faces using algorithms in [29, 26]. The bottom part shows the frequency difference between real and fake. The values on the vertical axis are logarithmic with 22.
Refer to caption
Figure 3: Visualization of the frequency difference between real and fake faces. The lighter color indicates the larger difference.

Hypothesis and Validation. As shown in Fig. 2, the most significant difference can be observed in the range of very low frequency. Given the significant dissimilarity in appearance between real and fake faces, we hypothesize that the semantic information is mainly represented in this low-frequency band. Moreover, we hypothesize that the mid-to-high frequency components capture the structural information, making them more susceptible to containing forgery traces. Furthermore, we hypothesize that the highest frequency components likely correspond to the noise introduced by various video preprocessing operations, such as compression, decompression, and encoding.

Refer to caption
Figure 4: Image visualization corresponding to different frequency components.

To validate our hypothesis, we directly visualize the difference between real and fake faces on their frequency maps in Fig. 3. By observing these results, we empirically split the frequency map into three non-overlap bands. The split operations follow the general band-pass filters. Denote the position in the frequency map as (x,y)(x,y), where (0,0),(1,1)(0,0),(1,1) denotes the top-left corner and bottom-right corner. Specifically, we identify the region where x+y1/16x+y\leq 1/16 as containing semantic information, the region where 1/16<x+y1/21/16<x+y\leq 1/2 as containing structural information, and the region where x+y>1/2x+y>1/2 as containing noise information. The corresponding results are visualized in Fig. 4, validating that these three ranges provide empirical evidence that aligns with our frequency distribution hypothesis.

4 FreqBlender

We describe a new method to create pseudo-fakes by blending specific frequency knowledge. The motivation is that existing methods only focus on spatial domain blending, which overlook the disparity between real and fake faces in the frequency domain. By considering the frequency distribution, the pseudo-fakes can closely resemble the fake faces. To achieve this, we propose a Frequency Parsing Network (FPNet) to partition the frequency domain into three components, corresponding to semantic information, structural information, and noise information, respectively. We then blend the structural information of fake faces with the real faces to generate the pseudo-fakes. The details of the Frequency Parsing Network are elaborated in Sec. 4.1, and the objective and training process for this network is described in Sec. 4.2. Then we introduce the deployment of our method with existing methods in Sec. 4.3.

4.1 Frequency Parsing Network

Overview. The Frequency Parsing Network (FPNet) is composed of one shared encoder and three independent decoders. The encoder transforms the input faces into frequency-critical features and the decoders aim to decompose the feature from the encoder and extract the respective frequency components.

Denote the encoder as \mathcal{E} and three decoders as 𝒟sem,𝒟str,𝒟noi\mathcal{D}_{\rm sem},\mathcal{D}_{\rm str},\mathcal{D}_{\rm noi} respectively. Given an input face image 𝐱𝒳{0,255}h×w×3\mathbf{x}\in\mathcal{X}\in\{0,255\}^{h\times w\times 3}, we first convert this face to the frequency map as ϕ(𝐱)h×w×3\phi(\mathbf{x})\in\mathbb{R}^{h\times w\times 3}, where ϕ\phi denotes the operations of Discrete Cosine Transform (DCT). Then we send this frequency map into model and generate three distribution maps as 𝒟sem((ϕ(𝐱)))[0,1]h×w\mathcal{D}_{\rm sem}(\mathcal{E}(\phi(\mathbf{x})))\in[0,1]^{h\times w},𝒟str((ϕ(𝐱)))[0,1]h×w\mathcal{D}_{\rm str}(\mathcal{E}(\phi(\mathbf{x})))\in[0,1]^{h\times w}, and 𝒟noi((ϕ(𝐱)))[0,1]h×w\mathcal{D}_{\rm noi}(\mathcal{E}(\phi(\mathbf{x})))\in[0,1]^{h\times w} respectively. Each distribution map indicates the probability of the corresponding frequency component distributed in the frequency map. Given these distribution maps, we can select the corresponding frequency components conveniently. For example, the frequency component corresponding to semantic information can be selected by ϕsem(𝐱)=ϕ(𝐱)𝒟sem((ϕ(𝐱)))\phi_{\rm sem}(\mathbf{x})=\phi(\mathbf{x})\mathcal{D}_{\rm sem}(\mathcal{E}(\phi(\mathbf{x}))) and the same is for other two frequency components, i.e., ϕstr(𝐱)=ϕ(𝐱)𝒟str((ϕ(𝐱)))\phi_{\rm str}(\mathbf{x})=\phi(\mathbf{x})\mathcal{D}_{\rm str}(\mathcal{E}(\phi(\mathbf{x}))) and ϕnoi(𝐱)=ϕ(𝐱)𝒟noi((ϕ(𝐱)))\phi_{\rm noi}(\mathbf{x})=\phi(\mathbf{x})\mathcal{D}_{\rm noi}(\mathcal{E}(\phi(\mathbf{x}))). The overview of FPNet is shown in Fig. 5 (left).

Refer to caption
Figure 5: Overview of the proposed Frequency Parsing Network (FPNet). Given an input face image, our method can partition it into three frequency components, corresponding to the semantic information, structural information, and noise information respectively. Since there is no ground truth, we propose four corollaries to supervise the training. The architecture of the encoder and decoders is shown in the right part.

Network Architecture. The encoder simply consists of four convolution layers with a kernel size of 3×33\times 3, a stride of 22, and a padding of 11. Each decoder also consists of four layers, and each layer is a combination of a convolutional layer and PixelShuffle operation [30] (see Fig. 5 (right)).

4.2 Objective Design for FPNet

The most challenging and crucial aspect of our method is to train the network for frequency parsing, as there is no ground truth available for the different frequency components. Note that the only available resources for supervising the training are the preliminary analysis results in Section 3. Nevertheless, these results are not precise and can not be adaptive to different inputs, which are insufficient for model training. Therefore, we meticulously craft a couple of auxiliary objectives to instruct the learning of networks, allowing for the self-refinement of the network.

These objectives are designed based on the following proposition.

Proposition 1.

Each frequency component exhibits the following properties:

  1. 1.

    Semantic information can reflect the facial identity.

  2. 2.

    Structural information serves as the carrier of forgery traces.

  3. 3.

    Noise information has minimal impact on visual quality.

  4. 4.

    The preliminary analysis findings are generally applicable.

Corollary 1.

For a given face 𝐱\mathbf{x}, the transformed face based on its semantic information will retain the same facial identity as 𝐱\mathbf{x}, i.e., 𝐱𝒳{0,255}h×w×3,(ϕ1(ϕsem(𝐱)))=(𝐱)\forall\;\mathbf{x}\in\mathcal{X}\in\{0,255\}^{h\times w\times 3},\;\mathcal{F}(\phi^{-1}(\phi_{\rm sem}(\mathbf{x})))=\mathcal{F}(\mathbf{x}), where \mathcal{F} denotes a face recognition model and ϕ1\phi^{-1} denotes the Inverse Discrete Cosine Transform (IDCT).

Facial Fidelity Loss. We introduce a facial fidelity loss ff\mathcal{L}_{\rm ff} to penalize the discrepancy in identity between the input face image and the spatial content represented by semantic information. To measure the identity discrepancy, we employ the MobileNet [31] as our face recognition model and train it using ArcFace [32, 33]. We select MobileNet for its balance between computational efficiency and recognition accuracy. Let \mathcal{F} be the MobleNet and f(𝐱)\mathcal{F}_{f}(\mathbf{x}) be the facial features extracted from the face image 𝐱\mathbf{x}. The facial fidelity loss can be defined as

ff(𝐱)=f(ϕ1(ϕsem(𝐱))f(𝐱)22.\displaystyle\mathcal{L}_{\rm ff}(\mathbf{x})=\lVert\mathcal{F}_{f}(\phi^{-1}(\phi_{\rm sem}(\mathbf{x}))-\mathcal{F}_{f}(\mathbf{x})\rVert^{2}_{2}. (1)

Note that the input face 𝐱\mathbf{x} can be either real or fake, as the identity information is present in both cases.

Corollary 2.

For a given real face 𝐱r\mathbf{x}_{r}, it can be detected as fake if and only if it is inserted the structural information from a fake face 𝐱f\mathbf{x}_{f}, i.e., 𝒟(𝐱r)=0iff𝐱r𝐱rϕstr(𝐱f)\mathcal{D}(\mathbf{x}_{r})=0\;{\rm iff}\;\mathbf{x}_{r}\leftarrow\mathbf{x}_{r}\oplus\phi_{\rm str}(\mathbf{x}_{f}), where 𝒟\mathcal{D} denotes a Deepfake detector with labels of fake and real in {0,1}\{0,1\}, \oplus indicates the inserting operation.

Authenticity-determinative Loss. This loss is designed to emphasize the determinative role of structural information. To evaluate the authenticity of faces, we develop a DeepFake detector 𝒟\mathcal{D}, which is implemented using a ResNet-34 [34] trained on real and fake faces. Then we construct two sets of faces by blending frequency components.

The first set contains three types of faces transformed from frequency components corresponding to 1) the semantic information of the real face, 2) the semantic information of the fake face, and 3) the semantic information of the real face blended with the structural information of the real face. We denote this set as 𝒞r={ϕ1(ϕsem(𝐱r)),ϕ1(ϕsem(𝐱f)),ϕ1(ϕsem(𝐱r)+ϕstr(𝐱r))}\mathcal{C}_{r}=\{\phi^{-1}(\phi_{\rm sem}(\mathbf{x}_{r})),\phi^{-1}(\phi_{\rm sem}(\mathbf{x}_{f})),\phi^{-1}(\phi_{\rm sem}(\mathbf{x}_{r})+\phi_{\rm str}(\mathbf{x}_{r}))\}. Since there is no structural information from fake faces in this set, all the faces should be detected as real.

Similarly, the second set contains two types of faces: 1) blending the semantic information of the fake face with the structural information of the fake face, and 2) blending the semantic information of the real face with the structural information of the fake face. We denote this set as 𝒞f={ϕ1(ϕsem(𝐱f)+ϕstr(𝐱f)),ϕ1(ϕsem(𝐱r)+ϕstr(𝐱f))}\mathcal{C}_{f}=\{\phi^{-1}(\phi_{\rm sem}(\mathbf{x}_{f})+\phi_{\rm str}(\mathbf{x}_{f})),\phi^{-1}(\phi_{\rm sem}(\mathbf{x}_{r})+\phi_{\rm str}(\mathbf{x}_{f}))\}. Since all blended faces in this set contain the structural information of fake faces, they should be detected as fake. Thus the authenticity-determinative loss ad\mathcal{L}_{\rm ad} can be written as

ad(𝐱r,𝐱f)=1|𝒞r|𝐱𝒞rCE(𝐱,1)+1|𝒞f|𝐱𝒞fCE(𝐱,0)\displaystyle\mathcal{L}_{\rm ad}(\mathbf{x}_{r},\mathbf{x}_{f})=\frac{1}{|\mathcal{C}_{r}|}\sum_{\mathbf{x}\in\mathcal{C}_{r}}{\rm CE}(\mathbf{x},1)+\frac{1}{|\mathcal{C}_{f}|}\sum_{\mathbf{x}\in\mathcal{C}_{f}}{\rm CE}(\mathbf{x},0) (2)

where CE denotes the cross-entropy loss.

Corollary 3.

The face should exhibit no visible change if the frequency component of noise information is removed, i.e., 𝐱𝒳{0,255}h×w×3,𝐱𝐱ϕnoi(𝐱f)\forall\;\mathbf{x}\in\mathcal{X}\in\{0,255\}^{h\times w\times 3},\;\mathbf{x}\approx\mathbf{x}\ominus\phi_{\rm noi}(\mathbf{x}_{f}), where \ominus indicates the removing operation.

Quality-agnostic Loss. As noise information does not contain decisive details for the overall depiction of the image, the face image is expected to be similar to the face image transformed using the frequency components of semantic and structural information. This similarity can be quantified using the quality-agnostic Loss qa\mathcal{L}_{\rm qa}, defined as

qa(𝐱)=𝐱ϕ1(ϕsem(𝐱)+ϕstr(𝐱))22,\displaystyle\mathcal{L}_{\rm qa}(\mathbf{x})=\lVert\mathbf{x}-\phi^{-1}(\phi_{\rm sem}(\mathbf{x})+\phi_{\rm str}(\mathbf{x}))\rVert^{2}_{2}, (3)

where the face 𝐱\mathbf{x} can be either real or fake.

Corollary 4.

Each frequency component is bound by the preliminary results, i.e., there should be no significant deviation between the predicted frequency component and the approximate frequency distribution in preliminary analysis.

Prior and Integrity Loss. According to our analysis in the Preliminary Analysis section, we have an initial understanding of the approximate frequency distribution. Denote the initial frequency maps for semantic, structural, and noise information as msem,mstr,mnoim_{\rm sem},m_{\rm str},m_{\rm noi}, respectively. These maps are utilized to accelerate the convergence of the model towards the desired direction. Moreover, we add a constraint on the integrity of their distributions, ensuring that their combination covers all elements of the frequency map. This loss pi\mathcal{L}_{\rm pi} can be expressed as

pi=\displaystyle\mathcal{L}_{\rm pi}= 𝒟sem((ϕ(𝐱)))msem22+𝒟str((ϕ(𝐱)))mstr22+𝒟noi((ϕ(𝐱)))mnoi22+\displaystyle\lVert\mathcal{D}_{\rm sem}(\mathcal{E}(\phi(\mathbf{x})))-m_{\rm sem}\rVert^{2}_{2}+\lVert\mathcal{D}_{\rm str}(\mathcal{E}(\phi(\mathbf{x})))-m_{\rm str}\rVert^{2}_{2}+\lVert\mathcal{D}_{\rm noi}(\mathcal{E}(\phi(\mathbf{x})))-m_{\rm noi}\rVert^{2}_{2}+ (4)
(𝒟sem((ϕ(𝐱)))+𝒟str((ϕ(𝐱)))+𝒟noi((ϕ(𝐱))))𝟏22,\displaystyle\lVert(\mathcal{D}_{\rm sem}(\mathcal{E}(\phi(\mathbf{x})))+\mathcal{D}_{\rm str}(\mathcal{E}(\phi(\mathbf{x})))+\mathcal{D}_{\rm noi}(\mathcal{E}(\phi(\mathbf{x}))))-\bm{1}\rVert^{2}_{2},

where 𝟏\bm{1} denotes a mask where all the elements in it is 11.

Overall Objectives. The overall objectives are the summation of all these loss terms, as

=λ1ff+λ2ad+λ3qa+λ4pi,\displaystyle\mathcal{L}=\lambda_{1}\mathcal{L}_{\rm ff}+\lambda_{2}\mathcal{L}_{\rm ad}+\lambda_{3}\mathcal{L}_{\rm qa}+\lambda_{4}\mathcal{L}_{\rm pi}, (5)

where λ1,λ2,λ3,λ4\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4} are the weights for different loss terms.

4.3 Deployment of FreqBlender

Given a fake face 𝐱r\mathbf{x}_{r} and a real face 𝐱f\mathbf{x}_{f}, we can generate a pseudo-fake face by

𝐱f=ϕ1(\displaystyle\mathbf{x}^{\prime}_{f}=\phi^{-1}\left(\right. ϕ(𝐱r)𝒟sem((ϕ(𝐱f)))+ϕ(𝐱f)𝒟str((ϕ(𝐱f)))+ϕ(𝐱r)𝒟noi((ϕ(𝐱f)))).\displaystyle\phi(\mathbf{x}_{r})\mathcal{D}_{\rm sem}(\mathcal{E}(\phi(\mathbf{x}_{f})))+\phi(\mathbf{x}_{f})\mathcal{D}_{\rm str}(\mathcal{E}(\phi(\mathbf{x}_{f})))+\phi(\mathbf{x}_{r})\mathcal{D}_{\rm noi}(\mathcal{E}(\phi(\mathbf{x}_{f})))\left.\right). (6)
[Uncaptioned image]

Note that in our method, it is not necessary to perform the blending using wild fake faces. Instead, we can tactfully substitute wild fake faces with the pseudo-fake faces generated by existing spatial face blending methods. It allows us to overcome the limitations in the frequency distribution of existing pseudo-fake faces.

5 Experiments

5.1 Experimental Setups

DataSets. Our method is evaluated using several standard datasets, including FaceForensics++ [5] (FF++), Celeb-DF (CDF) [6], DeepFake Detection Challenge (DFDC) [8], DeepFake Detection Challenge Preview (DFDCP) [7], and FFIW-10k (FFIW) [9] datasets. Specifically, the FF++ dataset consists of 10001000 pristine videos and 40004000 manipulated videos corresponding to four different manipulation methods, that are Deepfakes (DF), Face2Face (F2F), FaceSwap (FS), and NeuralTextures (NT). CDF dataset comprises 590590 pristine videos and 56395639 high-quality fake videos created from DeepFake alterations of celebrity videos available on YouTube. DFDC is a large-scale deepfake dataset, that consists of 100,000100,000 video clips, and DFDCP is a preview version of DFDC, which is also widely used in evaluation. The FFIW dataset contains 82508250 pristine videos and 82508250 DeepFake videos with multi-face scenarios. We follow the original training and testing split provided by the datasets for experiments.

Implementation Details. Our method is implemented using PyTorch 2.0.1 [35] with a Nvidia 3090ti. In the training stage of FPNet, the image size is set to 400×400400\times 400. The batch size is set to 88 and the Adam optimizer is utilized with an initial learning rate of 1e41e^{-4}. The training epoch is set to 200200. The hyperparameters in the objective function in Eq. (5) are set as follows: λ1=1/12,λ2=1,λ3=1e3,λ4=1/4\lambda_{1}=1/12,\lambda_{2}=1,\lambda_{3}=1e^{-3},\lambda_{4}=1/4. For DeepFake detection, we employ the vanilla EfficientNet-b4 [36] as our model following [16]. In the training phase, we create pseudo-fake faces on-the-fly. We first generate synthetic faces using the spatial-blending method [16] and then blend them with real faces using our method with a probability of α=0.2\alpha=0.2. Other training and testing settings are the same as [16]. More analysis of parameters are provided in Supplementary.

Table 1: The cross-dataset evaluation of different methods. Blue indicates best and red indicates second best.
Method Input Type Training Set Test Set AUC (%)
Real Fake CDF DFDC DFDCP FFIW
Two-branch (ECCV’20) [37] Video 76.65 - - -
DAM (CVPR’21) [9] Video 75.3 - 72.8 -
LipForensics (CVPR’21) [1] Video 82.4 73.50 - -
FTCN (ICCV’21) [38] Video 86.9 71.00 74.0 74.47
SST (CVPR’24) [23] Video 89.0 - - -
DSP-FWA (CVPRW’19 [10]) Frame 69.30 - - -
Face X-ray (CVPR’20) [14] Frame - - - 71.15 -
Face X-ray (CVPR’20) [14] Frame - - 80.92 -
F3-Net (ECCV’20) [28] Frame 72.93 61.16 81.96 61.58
LRL (AAAI’21) [39] Frame 78.26 - 76.53 -
FRDM (CVPR’21) [40] Frame 79.4 - 79.7 -
PCL+I2G (ICCV’21) [15] Frame - 90.03 67.52 74.37 -
DCL (AAAI’22) [41] Frame 82.30 - 76.71 71.14
SBI (CVPR’22) [16] Frame - 92.94 72.08 85.51 85.99
SBI (CVPR’22) [16] Frame - 93.18 72.42 86.15 84.83
TALL-Swin (ICCV’23) [21] Frame 90.79 76.78 - -
UCF (ICCV’23) [12] Frame 82.4 80.5 - -
BiG-Arts (PR’23) [24] Frame 77.04 - 80.48 -
F-G (CVPR’24) [42] Frame 74.42 61.47 - -
LSDA (CVPR’24) [22] Frame 83.0 73.6 81.5 -
FreqBlender (Ours) Frame - 94.59 74.59 87.56 86.14

5.2 Results

To showcase the effectiveness of our method, we train our method solely on the FF++ dataset and test it on the other different datasets. We employ the Area Under the Receiver Operating Characteristic Curve (AUC) as the evaluation metric following previous work [16]. Our method is compared with five video-based detection methods, including Two-branch [37], DAM [9], LipForensics [1], FTCN [38] and SST [23]. Moreover, we involve thirteen frame-level state-of-the-art methods for comparison, which are DSP-FWA [10], Face X-ray [14], F3-Net [28], LRL [39], FRDM [40], PCL [15], DCL [41], SBI [16], TALL-Swin [21], UCF [12], SST [23], F-G [42], LSDA [22],respectively.

Cross-dataset Evaluation. We evaluate the cross-dataset performance of our method compared to other counterparts in Table 1. The best performance is highlighted in blue and the second-best is marked by red. It should be noted that our method operates on pseudo-fake faces generated by SBI, thus we do not need fake faces. In comparison to video-level methods, our method achieves the best performance, which outperforms all the methods by a large margin.

When compared to frame-level methods, our method still outperforms the others. For example, our method improves upon the performance of the most relevant counterpart SBI by 1.41%1.41\%, 2.17%2.17\%, 1.41%1.41\%, 1.31%1.31\% on CDF, DFDC, DFDCP, and FFIW respectively. This improvement can be attributed to the incorporation of frequency knowledge in pseudo-fake faces, enhancing the generalization of detection models. Note that the performance of the compared methods (except SBI* and F3-Net) is extracted from their original papers. SBI* denotes the performance obtained using the officially released codes, and F3-Net is reproduced using the codes implemented by others 111F3-Net:https://github.com/Leminhbinh0209/F3Net. The results closely align with the reported scores, which verifies the correctness of our configuration of their codes. In subsequent experiments, we employ their release codes for comparison.

Cross-manipulation Evaluation. Since SBI is the most recent and effective method, we compare our method with it for demonstration. Specifically, we compare our method with two variants of the SBI method. The first is trained using the raw set of real videos in the FF++ dataset, while the second is trained using the c23 set. According to the standard protocols, all methods are tested on c23 videos. The results are shown in Table 2. It can be seen that our method outperforms SBI-raw by 3.62%3.62\% and SBI-c23 by 3.89%3.89\%, demonstrating the efficacy of our method on cross-manipulation scenarios.

Table 2: The cross-manipulation evaluation of different methods.
Method FF++ Avg
DF F2F FS NT
SBI-raw [16] 98.35 91.07 96.92 83.69 92.51
SBI-c23 [16] 98.60 92.60 95.44 82.30 92.24
FreqBlender (Ours) 99.18 96.76 97.68 90.88 96.13

5.3 Analysis

Table 3: Effect of each objective term.
Setting CDF DFDC DFDCP Avg
Baseline 91.69 72.69 86.67 83.68
w/o ff\mathcal{L}_{\rm ff} 94.01 74.30 86.86 85.06
w/o ad\mathcal{L}_{\rm ad} 93.31 72.97 86.32 84.20
w/o qa\mathcal{L}_{\rm qa} 94.28 74.42 87.25 85.32
w/o pi\mathcal{L}_{\rm pi} 93.78 74.03 85.99 84.60
All 94.59 74.59 87.56 85.58
Table 4: Effect of our method complementary to spatial-blending methods.
Method FF++ CDF DFDCP FFIW Avg
DSP-FWA [10] 48.14 62.91 60.74 40.65 53.11
DSP-FWA [10] + Ours 49.46 65.47 56.18 41.81 53.23
I2G [15] 59.56 53.55 48.02 46.75 51.97
I2G [15] + Ours 63.84 48.89 49.53 48.96 52.81
Face X-ray [14] 82.26 67.99 65.00 63.65 69.73
Face X-ray [14] + Ours 84.03 76.05 63.90 67.24 72.81

Effect of Each Objective Term. This part studies the effect of each objective term on CDF, DFDC, and DFDCP datasets. The results are shown in Table 4. Note that Baseline denotes only using prior and integrity loss pi\mathcal{L}_{\rm pi} and “w/o” denotes without. It can be seen that without one certain objective term, the performance drops on all datasets, which demonstrates that different objective terms have distinct impacts, and their collective contributes most to our method.

Complementary to Spatial-blending Methods. To validate the complementary of our method, we replace the SBI method with other spatial-blending methods and study if the performance is improved. Specifically, we reproduce the pseudo-fake face generation operations in DSP-FWA, I2G, and Face X-ray, and combine them with our method. Note that I2G and Face X-ray have not released their codes, we re-implement them rigorously following their original settings. The results on FF++, CDF, DFDCP, FFIW datasets are presented in Table 4. It can be seen that by combining with DSP-FWA, our method improves 0.12%0.12\% averagely. A similar trend is also observed in the I2G, which improves 0.84%0.84\% on average. For Face X-ray, we improve the performance by 3.08%3.08\% on average. It is noteworthy that I2G and Face X-ray have not released their codes yet. We rigorously follow the instructions as in their papers and run the codes widely used by others222I2G: https://github.com/jtchen0528/PCL-I2G 333Face X-ray: https://github.com/AlgoHunt/Face-Xray.

Different Network Architectures. This part validates the effectiveness of our method on different networks, including ResNet-50 [34], EfficientNet-b1 [36], VGG16 [43], Xception [44], ViT [45], F3-Net [28], and GFFD[40]. We compare our method with SBI on these networks, which are tested on CDF, FF++, DFDCP, and FFIW datasets. The results are shown in Table 5. It can be observed that our method improves the performance by 1.78%1.78\%, 0.41%0.41\%, 1.51%1.51\%, 0.2%0.2\%, 2.75%2.75\%, and 4.68%4.68\% averagely on ResNet-50, EfficientNet-b1, Xception, Vit networks, F3-Net, and GFFD networks respectively. It is noteworthy that our method slightly reduces the performance of VGG16 by 0.01%0.01\%. It is possibly because the capacity of VGG16 is limited than other networks, and learning spatial pseudo-fake faces almost fills up this capacity, leaving no room for the learning of frequency knowledge.

Table 5: The effect of our method on different networks.
Method CDF FF++ DFDCP FFIW Avg
ResNet-50 [34] + SBI 84.82 95.39 73.51 81.67 83.85
ResNet-50 [34] + Ours 85.44 94.61 76.16 86.32 85.63
EfficientNet-b1 [36] + SBI 90.25 94.66 87.54 82.55 88.75
EfficientNet-b1 [36] + Ours 90.53 94.65 87.70 83.76 89.16
VGG16 [43] + SBI 78.22 93.05 74.13 87.26 83.16
VGG16 [43] + Ours 78.38 93.10 73.47 87.63 83.15
Xception [44] + SBI 87.00 91.40 75.68 70.24 81.08
Xception [44] + Ours 90.52 93.32 76.07 70.43 82.59
ViT [45] + SBI 85.85 96.09 87.71 86.05 88.92
ViT [45] + Ours 86.34 96.10 87.17 86.88 89.12
F3-Net [28] + SBI 84.94 93.42 79.29 73.42 82.77
F3-Net [28] + Ours 88.10 95.16 84.32 74.49 85.52
GFFD [40] + SBI 81.34 91.81 77.19 65.53 78.97
GFFD [40] + Ours 86.71 92.18 78.25 77.45 83.65
Refer to caption
Figure 6: Grad-CAM visualization of SBI and our method on four manipulation types of FF++ dataset. Compared to SBI, our method focuses more on the manipulated structural boundaries.

Saliency Visualization. We employ Grad-CAM [46] to visualize the attention of our method compared to SBI on four manipulations in the FF++ dataset. Compared to SBI, our method concentrates more on the structural information, such as the manipulation boundaries. For example, our method highlights the face outline in DF, F2F, and FS, while focusing on the mouth contour in NT.

Effect of Using Wild Fake or Spatial Pseudo-fake Faces. As described in Sec. 4.3, our method is performed using real and spatial-blending pseudo-fake (SP-fake) faces. The rationale is that SP-fake faces are greatly diversified, containing more spatial forgery traces. Applying our method to these faces can consider both frequency and spatial traces effectively. To verify this, we directly perform our method on real and wild fake faces. The results in Table 6 indicate a notable performance drop when only wild fakes are used.

Table 6: Effect of using wild fake or SP-fake faces.
SP-fake CDF DFDC DFDCP FFIW Avg
×\times 75.79 66.30 66.30 67.70 72.95
94.59 74.59 74.59 86.14 85.72

Limitations. Our method is designed to address the drawbacks of existing spatial-blending methods. Hence, it inherits the assumption that the faces are forged by face-swapping techniques. Further research is needed to validate our performance on other types of forgery operations, such as whole-face synthesis and attribute editing.

6 Conclusion

This paper describes a new method called FreqBlender that can generate pseudo-fake faces by blending frequency knowledge. To achieve this, we propose a Frequency Parsing Network that adaptively extracts the frequency component corresponding to structural information. Then we can blend this information from fake faces into real faces to create pseudo-fake faces. The extensive Experiments demonstrate the effectiveness of our method and can serve as a complementary module for existing spatial-blending methods.

Acknowledgement. This work is supported in part by the National Natural Science Foundation of China (No.62402464), Shandong Natural Science Foundation (No.ZR2024QF035), and China Postdoctoral Science Foundation (No.2021TQ0314; No.2021M703036). Jiaran Zhou is supported by the National Natural Science Foundation of China (No.62102380) and Shandong Natural Science Foundation (No.ZR2021QF095, No.ZR2024MF083). Baoyuan Wu is supported by Guangdong Basic and Applied Basic Research Foundation (No.2024B1515020095), National Natural Science Foundation of China (No. 62076213), and Shenzhen Science and Technology Program (No. RCYX20210609103057050). Bin Li is supported in part by NSFC (Grant U23B2022, U22B2047).

References

  • [1] Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5039–5049, 2021.
  • [2] Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. Forgerynet: A versatile benchmark for comprehensive forgery analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 4360–4369, 2021.
  • [3] Kunlin Liu, Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Wenbo Zhou, and Weiming Zhang. Deepfacelab: Integrated, flexible and extensible face-swapping framework. Pattern Recognition(PR), 141:109628, 2023.
  • [4] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
  • [5] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 1–11, 2019.
  • [6] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 3207–3216, 2020.
  • [7] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854, 2019.
  • [8] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020.
  • [9] Tianfei Zhou, Wenguan Wang, Zhiyuan Liang, and Jianbing Shen. Face forensics in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 5778–5788, 2021.
  • [10] Yuezun Li and Siwei Lyu. Exposing deepfake videos by detecting face warping artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [11] Jiaming Li, Hongtao Xie, Jiahong Li, Zhongyuan Wang, and Yongdong Zhang. Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 6458–6467, 2021.
  • [12] Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22412–22423, 2023.
  • [13] Yuan Wang, Kun Yu, Chen Chen, Xiyuan Hu, and Silong Peng. Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7278–7287, 2023.
  • [14] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 5001–5010, 2020.
  • [15] Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, and Wei Xia. Learning self-consistency for deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 15023–15033, 2021.
  • [16] Kaede Shiohara and Toshihiko Yamasaki. Detecting deepfakes with self-blended images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18720–18729, 2022.
  • [17] Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14800–14809, 2021.
  • [18] Changtao Miao, Zichang Tan, Qi Chu, Nenghai Yu, and Guodong Guo. Hierarchical frequency-assisted interactive networks for face manipulation detection. IEEE Transactions on Information Forensics and Security (TIFS), 17:3008–3021, 2022.
  • [19] Qiqi Gu, Shen Chen, Taiping Yao, Yang Chen, Shouhong Ding, and Ran Yi. Exploiting fine-grained face forgery clues via progressive enhancement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 735–743, 2022.
  • [20] Junyi Cao, Chao Ma, Taiping Yao, Shen Chen, Shouhong Ding, and Xiaokang Yang. End-to-end reconstruction-classification learning for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4113–4122, 2022.
  • [21] Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22658–22668, 2023.
  • [22] Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8984–8994, 2024.
  • [23] Jongwook Choi, Taehoon Kim, Yonghyun Jeong, Seungryul Baek, and Jongwon Choi. Exploiting style latent flows for generalizing deepfake video detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1133–1143, 2024.
  • [24] Han Chen, Yuezun Li, Dongdong Lin, Bin Li, and Junqiang Wu. Watching the big artifacts: Exposing deepfake videos via bi-granularity artifacts. Pattern Recognition (PR), 135:109179, 2023.
  • [25] Shuai Jia, Chao Ma, Taiping Yao, Bangjie Yin, Shouhong Ding, and Xiaokang Yang. Exploring frequency adversarial attacks for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4103–4112, 2022.
  • [26] Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 7890–7899, 2020.
  • [27] Davis King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research (JMLR), Dec 2009.
  • [28] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision (ECCV), pages 86–103. Springer, 2020.
  • [29] Ricard Durall, Margret Keuper, Franz-Josef Pfreundt, and Janis Keuper. Unmasking deepfakes with simple features. arXiv preprint arXiv:1911.00686, 2019.
  • [30] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1874–1883, 2016.
  • [31] Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [32] Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Cotsia, and Stefanos P Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), page 1–1, Jan 2021.
  • [33] Jiahao Liang, Huafeng Shi, and Weihong Deng. Exploring disentangled content information for face forgery detection. In European Conference on Computer Vision (ECCV), pages 128–145. Springer, 2022.
  • [34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
  • [35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [36] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 09–15 Jun 2019.
  • [37] Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, and Wael AbdAlmageed. Two-branch recurrent network for isolating deepfakes in videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 667–684. Springer, 2020.
  • [38] Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 15044–15054, 2021.
  • [39] Shen Chen, Taiping Yao, Yang Chen, Shouhong Ding, Jilin Li, and Rongrong Ji. Local relation learning for face forgery detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 1081–1088, 2021.
  • [40] Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16317–16326, 2021.
  • [41] Ke Sun, Taiping Yao, Shen Chen, Shouhong Ding, Jilin Li, and Rongrong Ji. Dual contrastive learning for general face forgery detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2316–2324, 2022.
  • [42] Li Lin, Xinan He, Yan Ju, Xin Wang, Feng Ding, and Shu Hu. Preserving fairness generalization in deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16815–16825, 2024.
  • [43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [44] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1251–1258, 2017.
  • [45] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  • [46] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision (IJCV), page 336–359, Feb 2020.
  • [47] Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, and Ziwei Liu. Collaborative diffusion for multi-modal face generation and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [48] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 4401–4410, 2019.
  • [49] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 8110–8119, 2020.
  • [50] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Machine Learning (ICML), 2015.
  • [51] Jiakai Wang. Adversarial examples in physical world. In IJCAI, pages 4925–4926, 2021.
  • [52] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018.
  • [53] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE symposium on security and privacy (SP), pages 39–57. IEEE, 2017.

Appendix A Appendix / supplemental material

Various Loss Weights λ1,λ2,λ3,λ4\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4}. These weights are empirically selected based on experimental results. As shown in Table 7, we evaluate the performance across various set of λ1,λ2,λ3,λ4\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4}. The results exhibit that our method is not particularly sensitive to the settings of loss weights, with performance variations within approximately 0.5%\sim 0.5\%. For our main experiments, we select the set of λ1=1/12,λ2=1,λ3=0.001,λ4=1/4\lambda_{1}=1/12,\lambda_{2}=1,\lambda_{3}=0.001,\lambda_{4}=1/4 as it yields the best performance.

Table 7: Effect of 3ur method on different loss proportions.
λ1\lambda_{1} λ2\lambda_{2} λ3\lambda_{3} λ4\lambda_{4} CDF FF++ FFIW DFDCP Avg
1 1 1 1 93.16 96.30 87.82 85.32 90.65
0.1 1 0.1 0.5 93.59 95.60 84.78 86.60 90.14
0.1 1 0.01 0.5 94.27 96.11 85.54 87.81 90.93
0.01 1 0.0001 0.1 93.60 96.00 85.85 87.38 90.71
1/12 1 0.001 1/4 94.59 96.13 86.14 87.56 91.11

More Details of Complementary to Spatial-blending Methods. Table 8 shows the detailed results of every manipulation in the FF++ dataset. It can be seen that our method improves the performance of all manipulation methods, averaging 1.32%1.32\% for DSP-FWA, 4.28%4.28\% for I2G, and 1.77%1.77\% for Face X-ray. This improvement further demonstrates the effectiveness of our method.

Table 8: Effect of our method complementary to spatial-blending Methods.
Method FF++
DF F2F FS NT Avg
DSP-FWA [10] 55.48 43.79 49.26 44.05 48.14
DSP-FWA [10] + Ours 56.20 45.93 53.96 41.74 49.46
I2G [15] 47.83 82.13 60.82 47.47 59.56
I2G [15] + Ours 56.90 81.26 61.66 55.52 63.84
Face X-ray [14] 89.38 85.02 83.45 71.17 82.26
Face X-ray [14] + Ours 93.21 85.41 82.70 74.79 84.03

Probability α\alpha in FreqBlender. As shown in Table 9, we evaluate the effect of using various probability α\alpha. Note that α\alpha denotes the probability of generating a synthetic face whether using spatial-blending or using FreqBlender after spatial-blending. α=1\alpha=1 denotes only using FreqBlender, while α=0\alpha=0 means only using spatial-blending. It can be observed that solely using our method can not perform well, as no color space knowledge is involved, hindering the overall effectiveness of pseudo-fake faces. In contrast, solely using spatial-blending operations can reach a favorable performance. However, when inserting pseudo-fake faces generated by FreqBlender, the generalization performance is further enhanced, as our method can compensate for the loss of frequency knowledge. The optimal effect on CDF evaluation is achieved when α\alpha is set to 0.2. This experiment demonstrates the complementary effect of our method to existing spatial-blending methods. More generalization experiments on more datasets have been conducted in the main paper, demonstrating the effectiveness of our method.

Table 9: Effect of our method on different data proportions.
α\alpha FF++ CDF
DF F2F FS NT Avg
1 63.18 59.85 65.79 54.61 60.86 56.34
0.8 98.58 94.80 97.69 89.83 95.23 90.27
0.5 99.03 96.99 97.81 91.68 96.38 91.83
0.3 99.12 97.15 97.93 90.91 96.28 93.76
0.2 99.18 96.76 97.68 90.88 96.13 94.59
0.1 99.11 97.11 97.93 90.91 96.26 93.76
0 99.17 97.63 97.77 91.35 96.50 93.58
Refer to caption
Figure 7: Heatmap visualization of the frequency difference between our method and the wild fake faces (Deepfakes, FaceSwap, Face2Face, NeuralTextures). Note that α=0\alpha=0 denotes that our method is degraded to SBI [16].

Visual Demonstration of FreqBlender. The goal of our method is to create pseudo-fake faces that resemble the frequency distribution of wild fake faces. To verify this, we conduct a visual experiment on the FF++ dataset. Specifically, we randomly select 3,0003,000 images from each manipulation method (Deepfakes, FaceSwap, Face2Face, NeuralTextures) and calculate the average frequency map for each manipulation method respectively. Then we create the same number of pseudo-fake faces using our method and calculate their average frequency map. Finally, we visualize the frequency difference between our method and wild fake faces. The results are illustrated in Fig. 7. It is important to note that α\alpha represents the probability of applying FreqBlender. Hence, α=1\alpha=1 means all pseudo-fake faces are created using our method, while α=0\alpha=0 denotes our method is degraded to SBI. From this figure, we can see that the difference is minimal when α=1\alpha=1. As α\alpha decreases, the difference becomes larger as the pseudo-fake faces are more likely created by SBI. This demonstrates that our method effectively simulates the frequency distribution of wild fake faces.

Performance on FF++ Low-quality (LQ). To investigate the performance of our method on low-quality videos, we conduct experiments on the FF++ Low-quality (LQ) set. The results in Table 11 show that while all methods experience substantial performance drops on the LQ set, our method consistently achieves the highest performance, demonstrating better generalization capability on low-quality videos compared to other methods.

Tentative Validation on Various Synthesized Faces. In addition to validation on the standard datasets, we investigate a new scenario: Diffusion-based face-swap deepfake detection. In this scenario, we employ a recent diffusion model (Collaborative Diffusion [47]) to synthesize faces, which are then blended into original videos. We create 200 fake faces and evaluate our method in Table 11. It can be seen that our method is effective in detecting such forged faces.

Moreover, although our method is designed for face-swapping techniques, we also test it on face images generated by StyleGAN [48] and StyleGAN2 [49]. The results, also shown in Table 11, reveal a notable performance drop across all methods. However, our method still outperforms the others.

Table 10: Performance of our method on FF++ (LQ).
Method FF++(LQ)
I2G [15] 52.20
Face X-ray [14] 65.41
SBI [16] 76.11
FreqBlender 77.56
Table 11: Performance of Diffusion-based face-swap and Gan-Generated images.
Method Diffusion-based [47] StyleGAN [48] StyleGAN2 [49]
I2G [15] 63.51 47.89 43.86
Face X-ray [14] 89.81 59.11 66.54
SBI [16] 91.70 63.99 72.88
FreqBlender 94.74 64.39 76.70

Performance against Evasion Attacks. We employ a widely recognized library TorchAttacks with four well-known attack methods: FGSM [50], BIM [51], PGD [52] and CW [53]. The attack experiment is conducted on the CDF dataset with six models and the attack configuration is set by default. The results are shown in Table  12. It can be seen without any defense strategies, all models can be easily attacked in the white-box attacking mode, which exactly aligns with our expectations and the discoveries in the papers of FGSM, BIM, etc.

Table 12: AUC (%) performance against evasion attacks.
Method Original Attacked
FGSM BIM PGD CW
FWA [10] 62.91 27.43 0 0 0
I2G [15] 53.35 3.26 0 0 0
Face-Xray [14] 67.99 54.66 0 0 0
SBI-c23 [16] 92.94 27.19 0 0 0
SBI-raw [16] 93.18 11.48 0 0 0
Ours 94.59 22.31 0 0 0