Towards General Deepfake Detection with Dynamic Curriculum
Abstract
Most previous deepfake detection methods bent their efforts to discriminate artifacts by end-to-end training. However, the learned networks often fail to mine the general face forgery information efficiently due to ignoring the data hardness. In this work, we propose to introduce the sample hardness into the training of deepfake detectors via the curriculum learning paradigm. Specifically, we present a novel simple yet effective strategy, named Dynamic Facial Forensic Curriculum (DFFC), which makes the model gradually focus on hard samples during the training. Firstly, we propose Dynamic Forensic Hardness (DFH) which integrates the facial quality score and instantaneous instance loss to dynamically measure sample hardness during the training. Furthermore, we present a pacing function to control the data subsets from easy to hard throughout the training process based on DFH. Comprehensive experiments show that DFFC can improve both within- and cross-dataset performance of various kinds of end-to-end deepfake detectors through a plug-and-play approach. It indicates that DFFC can help deepfake detectors learn general forgery discriminative features by effectively exploiting the information from hard samples.
Index Terms— Deepfake detection, Curriculum learning, Face image quality
1 Introduction
Deepfake techniques [1, 2, 3] refer to a series of deep learning-based facial forgery techniques that can swap or reenact the face of one person in a video to another. They can lead to the dissemination of false information or even political manipulation. Thus, detecting deepfakes has become a crucial research topic in recent years.
Early works[4, 5, 6, 7] treat deepfake detection as a binary classification problem, and commonly use deep neural networks to distinguish fake faces. In order to improve the detection performance, some works [8, 9, 10] introduce auxiliary modalities or supervision information for learning subtle forgery artifacts. However, these methods can not mine the general face forgery information efficiently due to ignoring the data hardness, resulting in poor generalization performance under the real-world scenario. To address this issue, some works introduce pseudo-forgery augmentations[11, 12, 13, 14] to enlarge the diversity of forgery artifacts, but are prone to fail in detecting samples with heavy post-processing.


As shown in Figure 1, the judgment of real/fake facial images/videos by human vision is difficult due to the visual qualities. Take the post-processing for example, the post-processing on real faces can be misinterpreted as forgery artifacts. Besides, the post-processing on fake faces can erase the forgery artifacts. By analogy, the discriminative deepfake detectors vary with different samples. However, most existing detection methods treat all samples the same during the training. Thus, we argue that the hardness of samples should be taken into account for training the general deepfake detector.
Curriculum learning (CL)[15, 16, 17], which involves presenting training samples to the model in a specific order of hardness, is an effective scheme for hard sample mining [18, 19, 20]. Inspired by such a paradigm, we present a novel simple yet effective strategy, named Dynamic Facial Forensic Curriculum (DFFC) that makes the model gradually focus on hard samples during the training. We introduce Dynamic Forensic Hardness (DFH), a novel approach that integrates the facial quality score and instantaneous instance loss to dynamically assess sample hardness throughout the training phase. Additionally, we present a pacing function that guides the progression of training subsets from easy to hard based on the DFH score throughout the training iterations. Experimental results demonstrate that DFFC can improve both within- and cross-dataset performance of various kinds of deepfake detectors through a plug-and-play approach. The main contributions of our work are summarized as follows:
-
•
To the best of our knowledge, the proposed DFFC is the first work that introduces the curriculum learning paradigm to mine hard samples for the deepfake detection task. DFFC can be plug-and-plays for any end-to-end deepfake detectors.
-
•
To measure the sample hardness, we propose the DFH score that integrates the instantaneous instance loss and facial quality score dynamically during the training.
-
•
To adequately mine the forgery information on hard samples, we present a pacing function to gradually control the training subsets from easy to hard according to DFH throughout the training process.
2 Methodology
2.1 Overview
In this section, we introduce the proposed Dynamic Facial Forensics Curriculum (DFFC). Specifically, the overall pipeline (See in Figure 2) mainly consists of two components: 1) Dynamic Forensic Hardness (DFH) is proposed to measure the sample hardness during the training; 2) Pacing Function is presented to control the pace of presenting data from easy to hard according to DFH. Each component will be detailed subsequently.

2.2 Dynamic Forensic Hardness
The difficulty score plays a key role in curriculum learning as it describes the relative “hardness” of each sample. In this work, we propose Dynamic Forensic Hardness (DFH), which considers the dynamic model behavior and the static facial quality. Let be the training dataset with samples, and represent the -th data and its ground-truth label, respectively. Let be the deepfake detector model with the parameter at -th epoch. We regard the loss (i.e., the binary entropy loss denoted as ) of -th sample at -th epoch as an indicator for the current hardness of this sample judged by the current state of the model before conducting the training step. Thus, we propose the instantaneous hardness that normalized the current loss with the learning rate , formulated as:
(1) |
where is the max learning rate during the training. Inspired by Dynamic Instance Hardness (DIH)[21], we measure the dynamic hardness through a moving average of the instantaneous hardness over training history, defined and computed recursively as:
(2) |
where is a discount factor, and is the subsets of hard samples at -th epoch selected by the pacing function.
As mentioned in Section 1, detecting low-quality faces is harder than high-quality ones for deepfake detectors. We regard that the facial visual quality can be a prior hardness for deepfake detection tasks. To achieve this, we utilize a pre-trained facial quality assessment model IFQA [22] as a teacher to guide the prior hardness as . Finally, we get DFH by integrating and as:
(3) |
where is a balance weight.
2.3 Pacing Function
To control the learning pace of presenting data from easy to hard, we design a pacing function to determine the sample pool of training data according to DFH. The pipeline is summarized in Algorithm 1. Like human education, if a teacher presents materials from easy to hard in a very short period of time, students would become confused and will not learn effectively. Thus, we define a pacing sequence to represent milestones(i.e., episodes) in total training epoch . During the first epochs, we utilize all the samples in for the warm-up training. After epoch , we only change the size of at every milestone . Specifically, at each epoch of episode , we select samples with top DFH in (i.e., hardest samples) as a hard sample pool . Along with the training, we reduce the size of by with discount factor to make it gradually focus on harder samples. To further enlarge the diversity of the data, we also select the samples with bottom DFH in (i.e., easiest samples) as a easy sample pool and then conduct lightweight data augmentations (e.g., Gaussian blur, brightness adjustment, and affine transformation) on them. Finally, we get the sample pool by mixing the and , i.e, . After that, we sample mini-batch in and then conduct a vanilla training step to update the model parameter as .
Methods | DF | F2F | FS | NT | FSh |
---|---|---|---|---|---|
Xception[23] | 95.15 | 83.48 | 92.09 | 77.89 | 94.96 |
+ DIH | 96.34 | 90.14 | 94.05 | 78.93 | 95.32 |
+ DFFC | 96.43 | 90.88 | 94.64 | 79.50 | 95.36 |
ENb4[24] | 94.26 | 86.86 | 92.72 | 75.99 | 94.54 |
+ DIH | 97.11 | 90.55 | 95.34 | 80.29 | 95.41 |
+ DFFC | 97.28 | 91.22 | 95.38 | 80.65 | 95.59 |
MAT[7] | 95.36 | 88.36 | 93.21 | 77.50 | 94.62 |
+ DIH | 97.25 | 90.20 | 95.23 | 78.95 | 95.45 |
+ DFFC | 97.59 | 90.55 | 95.66 | 79.05 | 95.12 |
Swin[25] | 95.10 | 88.57 | 92.00 | 76.29 | 94.17 |
+ DIH | 96.78 | 90.55 | 95.40 | 80.04 | 95.43 |
+ DFFC | 97.37 | 91.17 | 95.59 | 80.33 | 95.47 |
SPSL[8] | 93.82 | 86.82 | 91.73 | 75.97 | 91.26 |
+ DIH | 96.10 | 89.42 | 94.32 | 77.59 | 94.37 |
+ DFFC | 96.19 | 89.55 | 94.57 | 77.94 | 94.50 |
SRM[9] | 94.63 | 87.71 | 91.27 | 76.39 | 93.36 |
+ DIH | 95.38 | 89.40 | 93.23 | 76.46 | 94.30 |
+ DFFC | 96.48 | 89.92 | 93.97 | 77.82 | 94.89 |
Methods | DF | CDF | Wild | DFDC-P | DFD |
---|---|---|---|---|---|
Xception[23] | 99.38 | 70.74 | 60.06 | 80.93 | 90.60 |
+ DIH | 99.71 | 80.45 | 59.91 | 81.15 | 88.68 |
+ DFFC | 99.71 | 82.12 | 65.87 | 81.61 | 93.43 |
ENb4[24] | 99.56 | 74.50 | 60.04 | 79.37 | 92.34 |
+ DIH | 99.58 | 81.19 | 60.43 | 81.68 | 92.00 |
+ DFFC | 99.59 | 85.01 | 68.56 | 89.33 | 95.51 |
MAT[7] | 99.48 | 76.39 | 61.10 | 76.30 | 92.56 |
+ DIH | 99.49 | 80.46 | 60.23 | 81.18 | 92.37 |
+ DFFC | 99.50 | 84.37 | 66.71 | 82.87 | 94.73 |
Swin[25] | 99.66 | 73.53 | 69.72 | 88.18 | 93.07 |
+ DIH | 99.68 | 91.69 | 68.37 | 86.98 | 92.44 |
+ DFFC | 99.86 | 92.26 | 71.75 | 90.63 | 93.50 |
SPSL[8] | 99.36 | 76.88 | 61.51 | 80.93 | 91.56 |
+ DIH | 99.31 | 81.06 | 58.81 | 82.05 | 92.38 |
+ DFFC | 99.68 | 82.24 | 62.36 | 82.10 | 92.23 |
SRM[9] | 99.68 | 73.01 | 60.79 | 79.11 | 89.46 |
+ DIH | 99.64 | 79.01 | 57.81 | 79.29 | 91.88 |
+ DFFC | 99.74 | 81.42 | 61.22 | 85.99 | 93.68 |
3 Experiments
3.1 Experiment Settings
Datasets and pre-processing. In this paper, we mainly conducted experiments on the FaceForensics++ (FF++) dataset[5]. The samples were split into disjoint training, validation, and testing sets at the video level follows the official protocol. As for pre-processing, we utilized MTCNN to detect and crop the face regions (enlarged by a factor of 1.3) from each video frame, and resized the them to 256 256.
Implementation detail. We employed a SGD optimizer with a cosine learning rate scheduler with . As for DFFC, we set 20 epoch for totally training, the pacing sequence and hyper-parameters .
Method | F2F | FS | NT | FSh | Avg |
---|---|---|---|---|---|
Xception | 71.13 | 54.87 | 76.95 | 74.72 | 69.42 |
+DIH | 76.70 | 56.50 | 76.90 | 68.14 | 69.56 |
+DFFC | 84.34 | 62.06 | 78.70 | 75.14 | 75.06 |
3.2 Evaluation of Detection Performances
In this part, we deployed the proposed DFFC on several end-to-end deepfake detectors for evaluation. We considered three typical kinds of detection methods: 1) Spatial detector, i.e., Xception[23], EfficientNet (ENb4)[24], SwinTransformerV2 (Swin) [25] and MAT[7]; 2) Frequency detector, i.e., SPSL[8]; 3) Detector contained spatial and frequency branches, i.e., SRM[9]. We reproduced the aforementioned methods by their official codes and initialized them with imagenet weights.
Within datasets evaluations. We evaluated the performance on detecting five manipulation methods on FF++ (LQ). As shown in Table 1, the proposed DFFC can improve performance for all benchmarks.
Cross datasets evaluations. We evaluated the generalization capability of the proposed DFFC by training on FF++/DF (HQ) and testing on several benchmark datasets, i.e., Celeb-DF (CDF)[26] WildDeepfake(Wild)[27], Deepfake Detection (DFD)111http://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html and Deepfake Detection Challenge preview (DFDC-p)[28]. As shown in Table 2, our DFFC can improve detection performance for all benchmarks.
Cross manipulation evaluations. We conducted the cross-manipulation experiment on FF++, where our model was trained on the DF(HQ) subset and tested on the remaining four manipulations. Take Xception as an example, the result is shown in Table 3, our DFFC can improve detection performance for all benchmarks.
Ablation studies of facial quality hardness. In this part, we investigated the impact of facial quality hardness. We only used DIH[21], which removes facial quality hardness compared to DFFC, to train detectors. As shown in Table 1 and Table 2, we can observe that only introducing DIH still improve both within and cross datasets performance for most benchmarks, while introducing the facial quality hardness improve further. It demonstrates that both introducing dynamic hardness and facial quality priors are beneficial to training deepfake detectors.
Comparison with other training strategies. In this part, we compared DFFC with other training strategies, including the vanilla training and BabyStep[15, 29, 30] which is the simplest CL strategy that utilizes a static pre-defined hardness. We conducted BabyStep by introducing IFQA[22] score as the pre-defined hardness and utilizing the pacing setting in [29]. Furthermore, we also investigated impacts of data augmentations. As shown in Figure 3, we observe that both CL paradigm and data augmentations can improve performance compared with vanilla training in most cases. However, introducing data augmentations in BabyStep suffers severe performance degradation, as it makes the augmented data does not match its pre-defined static hardness. It demonstrates that the dynamic CL strategy with data augmentations (i.e., our DFFC) is beneficial for general deepfake detection.






Metric | DFH-idx | DF | F2F | FS | NT | FSh |
---|---|---|---|---|---|---|
TAR(%) | Top | 2.72 | 1.13 | 1.17 | 3.25 | 2.48 |
Bottom | 6.84 | 8.26 | 11.31 | 5.40 | 10.84 | |
SSIM | Top | 0.9874 | 0.9920 | 0.9915 | 0.9803 | 0.9886 |
Bottom | 0.9012 | 0.9489 | 0.9376 | 0.9674 | 0.9334 |
3.3 Analysis Properties of DFFC
In this part, we analyze some properties of DFFC. We conduct the subsequent experiments by utilizing Xception trained on FF++/DF(HQ).
How DFH changes during the training? In this part, we explored the properties of the training process of DFFC. We respectively selected samples with top (hard samples), bottom (easy samples) and median DFH and illustrated the variations of their DFH during the training. As shown in Figure 4, we can observe that DFH decreases along the training for all samples. We can also find that DFH of easy samples remains small throughout training. This is because deepfake detectors can learn to identify easy samples in the early stages of training so that DFFC does not tend to update their DFH. However, the detectors needs more time to mine the forged clues of the difficult samples. It indicates that as learning continues, easy samples become less informative so that we can select and train on fewer hard samples.
What do hard samples look like? We explored the properties of samples with different hardness mined by DFFC. We respectively illustrated the samples with top and bottom DFH in Figure 5. We can observe that the fake faces with top DFH are relatively high-quality compared to those with bottom DFH, which cannot be easily distinguished. These fake samples with bottom DFH always contain more clear forgery clues, such as color inconsistency. For real faces, top DFH samples usually have heavy post-processing which is easy to confuse with fake faces. It demonstrates that the DFH score mined by our DFFC is in line with human visual perception. We also computed the pixel-level tampering ratio (TAR) and SSIM[31] metrics for fake samples with their corresponding real faces. As shown in the Table 4, fake faces with top DFH involve low TAR and high SSIM, which indicates these samples have less forgery artifacts and higher similarities to the corresponding real faces. It makes sense that deepfake detectors would have difficulty identifying these samples.
4 Conclusion
In this paper, we propose DFFC as a general training strategy for deepfake detection. First, we present DFH, an innovative metric that integrates the facial quality score and instantaneous instance loss to dynamically evaluate the sample hardness during the training process. Furthermore, we introduce a pacing function that controls the training subsets from easy to hard basd on DFH throughout the iterations of training. It makes deepfake detectors gradually focus on hard samples to mine the general forgery clues during the training.
References
- [1] Yuval Nirkin, Yosi Keller, and Tal Hassner, “FSGAN: Subject Agnostic Face Swapping and Reenactment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7184–7193.
- [2] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen, “Advancing High Fidelity Identity Swapping for Forgery Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5074–5083.
- [3] Yuval Nirkin, Yosi Keller, and Tal Hassner, “FSGANv2: Improved Subject Agnostic Face Swapping and Reenactment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 560–575, 2023.
- [4] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: A Compact Facial Video Forgery Detection Network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, pp. 1–7.
- [5] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Niessner, “FaceForensics++: Learning to Detect Manipulated Facial Images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1–11.
- [6] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao, “Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues,” in ECCV, 2020, pp. 86–103.
- [7] Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu, “Multi-Attentional Deepfake Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2185–2194.
- [8] Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu, “Spatial-Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 772–781.
- [9] Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu, “Generalizing Face Forgery Detection With High-Frequency Features,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16317–16326.
- [10] Han Chen, Yuzhen Lin, and Bin Li, “Exposing Face Forgery Clues via Retinex-based Image Enhancement,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 602–617.
- [11] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo, “Face X-Ray for More General Face Forgery Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5001–5010.
- [12] Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, and Wei Xia, “Learning Self-Consistency for Deepfake Detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15023–15033.
- [13] Kaede Shiohara and Toshihiko Yamasaki, “Detecting Deepfakes With Self-Blended Images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18720–18729.
- [14] Han Chen, Yuzhen Lin, Bin Li, and Shunquan Tan, “Learning Features of Intra-Consistency and Inter-Diversity: Keys Toward Generalizable Deepfake Detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1468–1480, 2023.
- [15] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 41–48.
- [16] Xin Wang, Yudong Chen, and Wenwu Zhu, “A Survey on Curriculum Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4555–4576, 2022.
- [17] Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe, “Curriculum Learning: A Survey,” International Journal of Computer Vision, vol. 130, no. 6, pp. 1526–1565, 2022.
- [18] Yajing Kong, Liu Liu, Jun Wang, and Dacheng Tao, “Adaptive Curriculum Learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5067–5076.
- [19] Yulin Wang, Yang Yue, Rui Lu, Tianjiao Liu, Zhao Zhong, Shiji Song, and Gao Huang, “EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5852–5864.
- [20] Yuzhen Lin, Rangding Wang, Diqun Yan, Li Dong, and Xueyuan Zhang, “Audio Steganalysis with Improved Convolutional Neural Network,” in Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, 2019, pp. 210–215.
- [21] Tianyi Zhou, Shengjie Wang, and Jeffrey Bilmes, “Curriculum Learning by Dynamic Instance Hardness,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 8602–8613.
- [22] Byungho Jo, Donghyeon Cho, In Kyu Park, and Sungeun Hong, “IFQA: Interpretable Face Quality Assessment,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 3444–3453.
- [23] Francois Chollet, “Xception: Deep Learning With Depthwise Separable Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251–1258.
- [24] Mingxing Tan and Quoc Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in International Conference on Machine Learning. 2019, pp. 6105–6114, PMLR.
- [25] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo, “Swin Transformer V2: Scaling Up Capacity and Resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009–12019.
- [26] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu, “Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3207–3216.
- [27] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang, “WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2382–2390.
- [28] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer, “The Deepfake Detection Challenge (DFDC) Preview Dataset,” arXiv:1910.08854 [cs], 2019.
- [29] Guy Hacohen and Daphna Weinshall, “On The Power of Curriculum Learning in Training Deep Networks,” in Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 2535–2544.
- [30] Volkan Cirik, Eduard Hovy, and Louis-Philippe Morency, “Visualizing and understanding curriculum learning for long short-term memory networks,” arXiv preprint arXiv:1611.06204, 2016.
- [31] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.