Identifying Rhythmic Patterns for Face Forgery Detection and Categorization
Abstract
With the emergence of GAN, face forgery technologies have been heavily abused. Achieving accurate face forgery detection is imminent. Inspired by remote photoplethysmography (rPPG) that PPG signal corresponds to the periodic change of skin color caused by heartbeat in face videos, we observe that despite the inevitable loss of PPG signal during the forgery process, there is still a mixture of PPG signals in the forgery video with a unique rhythmic pattern depending on its generation method. Motivated by this key observation, we propose a framework for face forgery detection and categorization consisting of: 1) a Spatial-Temporal Filtering Network (STFNet) for PPG signals filtering, and 2) a Spatial-Temporal Interaction Network (STINet) for constraint and interaction of PPG signals. Moreover, with insight into the generation of forgery methods, we further propose intra-source and inter-source blending to boost the performance of the framework. Overall, extensive experiments have proved the superiority of our method.
1 Introduction
In the past decade, with the rapid development of digital cameras, communication technologies and mobile devices, video has become a way of entertainment and communication in people’s lives. At the same time, face forgery technology represented by DeepFakes [2] flooded the network. Unfortunately, due to its extremely low barriers and wide accessibility, this technology has been maliciously applied to create pornographic content [1], fake news, financial fraud and smear politicians. Its unrestricted use not only damages privacy, law, and politics but also causes a serious crisis of trust in society.
Therefore, achieving accurate face forgery detection is imminent. Some studies [19] obtain high accuracy temporarily by learning the pixel differences between real and forgery images, while face forgery technology will continue to improve according to the pixel difference as the supervision information until evading detection, just like GAN generators and discriminators, competing with each other. Therefore, to break this deadlock, we need to comprehensively exploit the differences between real and fake videos, and grasp the essence and core forgery cues to detect.
Heart rate, as a significant physiological signal in the human body, has a certain regularity in a period of continuous time. The blood flow caused by the heartbeat forms periodic changes in the capillaries of human skin tissues, resulting in periodic changes in the absorbed light and reflected light, and these changes reflecting the heart rate signals can be analyzed and acquired in the face video [31]. The related technology is called remote photoplethysmography (rPPG), which is widely used in heart rate measurement and face anti-spoofing.

In this paper, we grasp the unique rhythmic pattern of the PPG signal for forgery detection. Motivated by the key observation that despite the inevitable loss of PPG signal during the forgery process, there is still a mixture of PPG signals in the forgery video with a unique rhythmic pattern depending on its generation method, as shown in Figure 1. Therefore, we design a framework for face forgery detection and categorization. We use a Spatial-Temporal Filtering Network to filter out the irrelevant noise in the PPG signal. To further exploit the potential connections between adjacent PPG signals, we propose a Spatial-Temporal Interaction Network to interact and constrain the PPG signals through bidirectional LSTM and adjacency constraint, respectively. Moreover, with insight into the generation of forgery methods, we further propose intra-source and inter-source blending to boost the performance of the framework. Overall, extensive experiments have proved the superiority of our method, and our method achieves remarkable elevation on both face forgery detection and categorization, but also performs strong robustness and generalization performance.
The contributions of this paper could be summarized as three-fold:
-
•
We propose a framework for face forgery detection and categorization. A Spatial-Temporal Filtering Network (STFNet) for PPG signals filtering, and a Spatial-Temporal Interaction Network (STINet) for interaction and constraint of the PPG signal.
-
•
With insight into the generation of forgery methods, we further propose intra-source and inter-source data blending to boost the performance of the framework.
-
•
Our method not only achieves remarkable elevation on both face forgery detection and categorization, but also performs strong robustness and generalization capabilities.
2 Related Work
2.1 Face Forgery Generation
Face forgery has received considerable attention over the past decade. In the earlier studies, graphics-based methods were used to implement this technology [4, 9]. Although graphics-based methods have been evolving, they have not yet become widespread due to their complexity and huge cost. With the publication of GANs by Goodfellow et al. [14] in 2014, it has made impressive progress and attracted a lot of research. Choi et al. [5] proposed StarGAN which performs multi-domain image conversion tasks in one dataset. To address this dilemma that GAN is hard and slow to train, Karras et al. [16] proposed ProGAN which gradually generates images from low quality to high quality. Overall, These GAN-based methods can easily use some random noise to create indistinguishable images, which makes face forgery easier and wildly available to the public.
2.2 Face Forgery Detection
Due to the incredible development of deep learning and GAN, face forgery technology has been progressing rapidly. To avoid its illegal use, researchers are exploring how to identify it. In earlier studies, researchers used hand-crafted features [24, 13] to detect. However, these methods don’t perform good generalization ability. In recent years, some methods with automatic learning ability based on deep learning have been proposed. Ding et al. [12] proposed a transfer learning model to detect swapped faces. Liu et al. [21] proposed an architecture that can use textures in combination with the Gram Block module. Li et al. [17] assumed that the face is a blend of two different images, and transformed the task to detect its stitching boundary for better generalization. Although these methods achieve a high accuracy rate, they do not take advantage of deeper features and may fail in the face of new generations of forgery schemes.
2.3 Remote Photoplethysmography
With the proposal of remote photoplethysmography [31] (rPPG), a number of heart rate measurement methods based on it have been developed [22, 32, 23]. It is precisely because of the uniqueness of rPPG signal that researchers gradually apply it to the field of face anti-spoofing [18].
At the same time, face forgery detection methods based on rPPG have also been proposed. Ciftci et al. [7] analyzed the biological signals of fake and real video pairs, constructed a generalized classifier, and convert the signals into PPG maps to train a simple detection network. Qi et al. [25] proposed the motion-magnified spatial-temporal representation and dual-spatial-temporal attention network to adapt to changing faces and various fake types. However, these methods do not make full use of the PPG signal. To further exploit the advantages of the PPG signal, our method takes into account the noise in the PPG signals as well as the interaction and constraints of adjacent PPG signal.

3 Method
3.1 Generation of Spatial Temporal Map
Since the PPG signal does not simply rely on the texture information of the images, it cannot be represented directly using the original unprocessed images. Moreover, the PPG signal is very weak and susceptible to interference. To enable an accurate and comprehensive representation of the PPG signal, Inspired by [23], we generate a spatial-temporal map to represent the PPG signal, as shown in Figure 2.

3.2 Intra-Source and Inter-Source Blending
It is observed that most existing face forgery methods share a common step: different original face images are transformed and then blended to generate forgery faces images. As mentioned in Section 2.3, each face video contains PPG signals corresponding to the unique heartbeat pattern in the human body. However, the loss of PPG signals is inevitable during the face forgery generation process. Nonetheless, these forgery face videos still exist the mixed PPG signals, which are the mixture of different residual PPG signals that may no longer correspond to human heart rate information but can reflect the unique rhythmic pattern of the face forgery method. Naturally, we propose a simple data blending to imitate the face forgery method for data augmentation.
Unlike forgery methods that process each image separately, since the spatial-temporal map is a combination of PPG signals from different sub-ROIs of the faces in the video clips. we take into account it and provide a new approach that is straightforward yet more effective.
Specifically, as shown in Figure 3, given two spatial-temporal maps and , we generate a new spatial-temporal map by blending and :
(1) |
where denotes the blending coefficient, which is uniformly distributed between . Moreover, since our detection is a classification task, the data blending can be divided into intra-source and inter-source blending, i.e., whether the two original spatial-temporal maps belong to the same source of face forgery or not. Correspondingly, the loss function of the framework will be different, as detailed in Section 3.4. Intuitively, intra-source blending expands the data diversity within the source, while inter-source blending can be considered as expanding the variety of sources even further.

3.3 Framework
In previous research of face forgery detection based on rPPG principle, the network architectures of some methods are too simple [7] to make comprehensive use of PPG signal, while some are too complex [25], and require considerable time. Moreover, since the input of the network is a spatial-temporal map rather than an ordinary two-dimensional image, the current mainstream network architecture may not be particularly suitable. Therefore, we design a framework containing a Spatial-Temporal Filtering Network (STFNet) for PPG signals filtering and a Spatial-Temporal Interaction Network (STINet) for constraint and interaction of PPG signals.
3.3.1 Spatial-Temporal Filtering Network
Although we can eliminate some noise (e.g., quantization noise of the camera) in the spatial-temporal map by pooling the ROI, some irrelevant components (e.g., noise caused by illumination or motion) remain in it. We need to further filter the correlated interference to improve the signal-to-noise ratio of the PPG signal. Some of the previous methods [10, 11] use hand-crafted features based on certain assumptions in a specific environment to eliminate these irrelevant components. However, these methods are not robust and may fail in some realistic situations when the assumptions do not hold. Here without specific assumptions, we propose a Spatial-Temporal Filtering Network (STFNet) with a residual module to filter out irrelevant noise in the PPG signal.
The PPG signals are so week that it is difficult to learn the PPG signal directly from the spatial-temporal map, while the irrelevant noise signals are relatively more stable. Inspired by ResNet [15], we design a module with a residual structure by learning the irrelevant signal in the backbone layer and the difference between the PPG signal and the irrelevant signal in the shortcut layer, respectively. Finally, element-wise addition is applied for these two layers to obtain the filtered PPG signal.
3.3.2 Spatial-Temporal Interaction Network
The heart rate has a certain periodic pattern, and the PPG signal reflects the heart rate information. To explore the temporal information in it, we propose a Spatial-Temporal Interaction Network containing Bi-LSTM and adjacency constraint.
Bi-LSTM. PPG signals have continuity. Intuitively, we believe that there are some potential connections between adjacent PPG signals. To explore the temporal information, we utilize bidirectional LSTM to transfer information from adjacent video clips to each other. To be specific, we simultaneously input adjacent spatial-temporal maps into the STFNet, next we pass the filtered PPG signals through a convolutional feature extraction network, and then we feed the extracted features into the one-layer bidirectional LSTM structure. Finally, the mean of the output of LSTM as our classification probability and the cross-entropy loss is used as the loss function.
(2) |
where denotes the target face forgery source, denotes the number of adjacent video clips.
Adjacency Constraint. Since we use a certain step size to divide the video clips of length from the video (, then the adjacent video clips will have an overlap of the length of . Therefore, when adjacent video clips are simultaneously fed into the STFNet, the filtered PPG signals will also have an overlap of length . These overlapping parts should have some connection. However, due to preprocessing such as normalization, these overlapping parts are not necessarily identical but should still have the similar waveform variations. To measure the linear similarity error of the overlapping parts, we use the negative Pearson coefficient to calculate the adjacency loss.
(3) |
where and denote the PPG signals of two adjacent video clips, respectively, denotes the covariance, denotes the standard deviation, denotes the number of adjacent video clips.
Intuitively, the adjacency loss can be considered as additional supervisory information, which guides STFNet to filter out the noise of the PPG signal more accurately. In other words, via adjacency constraint, the STFNet can focus on extracting PPG signals instead of interference.
3.4 Loss of the Framework
Overall, we summarize the loss function of our framework. The final loss function can be written as:
(4) |
where and denote the two classes of sources, respectively, is the weight for balancing the loss. And for intra-source data blending, and belong to the same source and the loss function can be simplified to:
(5) |
where denotes the class of source.
4 Experiments
Datasets. To demonstrate the effectiveness of our method, we selected the most challenging datasets for validation, FaceForensics++ [26] and Celeb-DF [20], from which we evaluated our method.
FaceForensics++ contains 1,000 original videos downloaded from the internet. Five different types of forgery video sub-datasets are generated by computer graphics-based and learning-based methods. i.e., Deepfakes [3], Face2Face [30], FaceSwap [4], NeuralTextures [29] and FaceShifter. Each sub-dataset contains 1,000 videos. FaceForensics++ dataset provides three video quality versions, i.e., RAW (raw video), HQ (constant rate quantization parameter equal to 23), and LQ (constant rate quantization parameter equal to 40). And it still contains a dataset DFD dedicated to Deepfakes, containing 363 real videos and 3068 forgery videos.
Celeb-DF used 59 celebrity interview videos on YouTube as the original videos. In total, 590 real videos and 5,639 DeepFake videos are included.
Implementation Details. Unless otherwise specified, the dataset is divided into a ratio of 7:3, the number of frames of the video segment is , the number of sub-ROIs is , the size of the spatial-temporal map is , and the loss balance weight . For a complete video, we use a frame step of 16 to get all the video clips. All the code is based on the pytorch111https://pytorch.org/ framework and trained with NVIDIA GTX 1080Ti. We use SGD as our optimizer with an initial learning rate of 0.1 and a half decay every 10 epochs. The maximum epoch number is 100.
Prediction Aggregation. Since a video contains multiple video clips, we take the sum of the predicted probabilities of all video clips as the final predicted probability of the whole video.
4.1 Results
To verify that the presence and unique rhythmic pattern of the PPG signal in each forgery method, in other words, to verify that the PPG signal of each face forgery source does have its unique rhythmic pattern, rather than the PPG signals of all face forgery sources sharing the same pattern, we conduct pairwise detection experiments (real-fake and fake-fake) and face forgery categorization experiment (1 real and 5 fakes) in the six sub-datasets of FaceForensics++, as shown in Table 2 and Figure 5, respectively. The results show that we achieve high accuracy in both pairwise detection and face forgery categorization.
Method | Face forgery detection (real-fake) | Face forgery categorization (1 real and 5 fakes) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
DF | F2F | FSW | NT | FSH | Real | DF | F2F | FSW | NT | FSH | Avg | |
VGG [27] | 50.00 | 50.00 | 50.13 | 50.00 | 61.06 | 26.35 | 0.00 | 77.70 | 0.00 | 0.00 | 0.00 | 18.64 |
ResNet [15] | 89.95 | 77.39 | 74.81 | 74.12 | 82.16 | 97.31 | 98.69 | 94.24 | 92.56 | 88.55 | 98.86 | 95.29 |
Inception V3 [28] | 99.74 | 64.36 | 87.74 | 59.8 | 99.35 | 97.76 | 98.32 | 95.49 | 94.22 | 82.8 | 97.96 | 94.77 |
Xception [6] | 99.75 | 98.53 | 96.14 | 91.46 | 99.97 | 99.84 | 99.59 | 98.55 | 92.76 | 89.62 | 100.00 | 97.11 |
Ciftci et al. [8] | - | - | - | - | - | 97.29 | 94.66 | 91.66 | 92.33 | 81.93 | - | 93.39 |
FakeCatcher [7] | 94.87 | 96.37 | 95.75 | 89.12 | - | - | - | - | - | - | - | - |
DeepRhythm [25] | 100.00 | 99.50 | 100.00 | - | - | - | - | - | - | - | - | - |
Our method | 100.00 | 99.50 | 100.00 | 97.00 | 100.00 | 97.59 | 99.66 | 97.59 | 98.62 | 96.55 | 100.00 | 98.33 |

Dataset | Real | DF | F2F | FSW | NT | FSH |
---|---|---|---|---|---|---|
Real | - | 100.00 | 97.93 | 100.00 | 96.38 | 99.14 |
DF | - | - | 99.83 | 99.83 | 99.65 | 99.83 |
F2F | - | - | - | 99.48 | 98.28 | 99.66 |
FSW | - | - | - | - | 100.00 | 98.48 |
NT | - | - | - | - | - | 99.83 |
FSH | - | - | - | - | - | - |
4.2 Comparison
To demonstrate the superiority of our proposed method, we compare it with other methods. We select the mainstream network, i.e., VGG [27], ResNet [15], and some methods that have achieved good performance in FaceForensics++’s benchmark, i.e., Inception_V3 [28], Xception [6], and methods that are also based on the rPPG principle, i.e., Ciftciet al. [8], FakeCatcher [7], DeepRhythm [25].
It is worth noting that, except for the method based on the rPPG principle, the rest of the methods detect forgery videos by single images rather than video clips, so we adapt the training strategy for these methods to better compare. Specifically, we directly feed the face region detected by Dlib with appropriate scaling into the network. To speed up the training, we utilize every frame to train. This can be considered as a baseline for frame-based detection. For a fair and comprehensive comparison, in the face forgery detection and face forgery categorization experiments, we follow DeepRhythm and Ciftci et al. [8] to divide the dataset in the ratio of 8:1:1 and 7:3, respectively. All methods are experimented in six sub-datasets of FaceForensics++.
Overall, in Table 1, we can see that our method outperforms state-of-the-art methods in both face forgery detection and categorization, which fully demonstrates the effectiveness of our method. Baseline methods that do not utilize PPG signals can achieve high accuracy in some datasets. However, when facing NerualTextures-based rendering videos, these methods that only consider single-frame spatial information may fail and the accuracy will drop sharply. This proves that PPG signals have deeper information compared to 2D images, and PPG signals with spatial-temporal information can represent the more unique difference between real and fake videos. Compared with methods also based on the rPPG principle, our method achieves the same state-of-the-art accuracy as DeepRhythm in face forgery detection experiments (DeepRhythm only tests three sub-datasets). However, the model size of our method is much smaller than DeepRhythm (see Figure 6 (a)). Moreover, it can be seen that our method gets significant improvement in the accuracy of face forgery categorization compared with Ciftci et al. [8]. These comparisons demonstrate that our method performs a stronger ability to exploit the PPG signal.

4.3 Ablation Experiments
To demonstrate the effectiveness of our method, based on face forgery categorization experiments, we perform several ablations to better understand the contributions of each component in face forgery categorization experiments, as shown in Table 3.
PPG | STFNet | Intra | Inter | LSTM | Constraint | Avg Acc |
---|---|---|---|---|---|---|
✓ | 94.25 | |||||
✓ | 96.04 | |||||
✓ | ✓ | 97.19 | ||||
✓ | ✓ | ✓ | 97.76 | |||
✓ | ✓ | ✓ | 98.22 | |||
✓ | ✓ | ✓ | ✓ | 98.27 | ||
✓ | ✓ | ✓ | ✓ | 98.10 | ||
✓ | ✓ | ✓ | ✓ | ✓ | 98.33 |
Effectiveness of PPG Signal. We feed face images and spatial-temporal maps containing PPG signals into the STFNet, respectively. the average detection accuracy of utilizing PPG signals is 1.79% higher than not. It is worth emphasizing that these results demonstrate that PPG signals from different face forgery sources have their unique rhythmic patterns, and the network can achieve high detection accuracy by learning their unique rhythmic patterns.
Effectiveness of STFNet. We utilize STFNet and VGG without classification layers as the backbone lay of the network for experiments, respectively. The results show that STFNet outperforms VGG in all aspects and improves the average accuracy by 1.15%, proving the effectiveness of the STFNet.
Effectiveness of Data Augmentation. We further evaluate the effectiveness of our intra-source and inter-source data blending. The results show that both data blending improve the accuracy. Moreover, inter-source blending is more effective than intra-source blending, with an improvement of 0.46%, and an improvement of 1.03% compared to no blending, which proves that our data augmentation can achieve amazing improvement by cleverly blends spatial-temporal maps without any additional computation and processing.
Effectiveness of STINet. Finally, we verify the effectiveness of STINet. We first add adjacency constraint and Bi-LSTM separately, with 0.12% reduction and 0.05% improvement in accuracy, respectively, while the accuracy increases by 0.11% after combining these two. After analysis, we speculate that although adding adjacency constraint alone has a certain constraint effect on the network, it does not play an optimization role because there is no information interaction between adjacent video clips. Therefore, combining the two can play a mutually reinforcing role.
4.4 Extended Experiments
To verify the generalization performance of our method, we conduct face forgery categorization experiments by adding the forgery sets of DFD, Celeb, and both of them, respectively. To save compute, only 300 videos are selected for each dataset including the six sub-datasets of FaceForensics++. The results are shown in Table 4, whether the two datasets are added separately or simultaneously, high accuracy is achieved, which proves that our method is also effective when other face forgery sources are expanded.
Real | DF | F2F | FSW | NT | FSH | DFD | Celeb | Avg |
---|---|---|---|---|---|---|---|---|
96.59 | 98.86 | 94.32 | 97.73 | 98.86 | 97.73 | 98.73 | - | 97.53 |
95.45 | 100.00 | 95.45 | 98.86 | 96.59 | 97.73 | - | 98.89 | 97.57 |
96.59 | 100.00 | 97.73 | 98.86 | 97.73 | 98.86 | 98.73 | 98.89 | 98.42 |
Version | Real | DF | F2F | FSW | NT | FSH | Avg |
---|---|---|---|---|---|---|---|
HQ | 92.23 | 99.65 | 93.99 | 85.11 | 85.51 | 97.53 | 92.34 |
LQ | 61.15 | 95.58 | 77.36 | 29.59 | 25.51 | 42.86 | 55.37 |
4.5 Video Clip Length and Prediction Aggregation
To test the performance and improvement of prediction aggregation with different lengths of video clips. We choose frames to conduct experiments.
The results are shown in Figure 6 (b), (c), and (d). We can see that the accuracy of detection of video clips with is very close, 95.28%, 95.22%, and 95.28%, respectively, while it decreases sharply at , only 87.55%, which corresponds to the theoretical fact of rPPG principle: the length of video clips too short may not contain the complete heart rate cycle and thus fail to extract PPG signal accurately, while too long may include additional noise. This once again demonstrates that our method has an extremely strong theoretical basis. Moreover, prediction aggregation has a certain boost on the accuracy to different degrees.
4.6 Video Compression
To demonstrate the robustness of our proposed method for compressed videos, we conduct experiments on the HQ version (a light compression) and LQ version(a heavy compression) of FaceForensics++ dataset. As shown in Table 5, our method still achieves the average accuracy of 92.34% for face forgery categorization in HQ version, which proves the robustness of our method even in light compression video. In contrast, in LQ version, due to severe compression, the PPG signal usually suffers from noisy curve shapes and inaccurate peak locations due to information loss caused by intra-frame and inter-frame coding of the video compression process [32], which leads our method to obtain only an average accuracy of 55.37%.
5 Conclusion
In this paper, we propose a framework for face forgery detection and categorization. It is intuitively motivated by the key observation that despite the inevitable loss of PPG signal during the forgery process, there is still a mixture of PPG signals in the forgery video with a unique rhythmic pattern depending on its generation method. Moreover, with insight into the generation of forgery methods, we further propose intra-class and inter-class blending to boost the performance of the framework. Extensive experiments demonstrate the superiority of our method. Meanwhile, we also explore the impact of video clip length, predictive aggregation, and video compression.
References
- [1] Deepfake porn is ruining women’s lives. https://www.technologyreview.com/2021/02/12/1018222/deepfake-revenge-porn-coming-ban/. Accessed: 2021-02-15.
- [2] Deepfakes. https://github.com/deepfakes/. Accessed: 2021-02-06.
- [3] Fakeapp. https://github.com/deepfakes/faceswap. Accessed: 2021-02-03.
- [4] Fakeswap. https://github.com/MarekKowalski/FaceSwap. Accessed: 2021-02-05.
- [5] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
- [6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
- [7] U. A. Ciftci, I. Demir, and L. Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- [8] U. A. Ciftci, I. Demir, and L. Yin. How do the hearts of deep fakes beat? deep fake source detection via interpreting residuals with biological signals. In 2020 IEEE International Joint Conference on Biometrics (IJCB), pages 1–10. IEEE, 2020.
- [9] K. Dale, K. Sunkavalli, M. K. Johnson, D. Vlasic, W. Matusik, and H. Pfister. Video face replacement. In Proceedings of the 2011 SIGGRAPH Asia conference, pages 1–10, 2011.
- [10] G. De Haan and V. Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Transactions on Biomedical Engineering, 60(10):2878–2886, 2013.
- [11] G. De Haan and A. Van Leest. Improved motion robustness of remote-ppg by using the blood volume pulse signature. Physiological measurement, 35(9):1913, 2014.
- [12] X. Ding, Z. Raziei, E. C. Larson, E. V. Olinick, P. Krueger, and M. Hahsler. Swapped face detection using deep learning and subjective assessment. EURASIP Journal on Information Security, 2020:1–12, 2020.
- [13] J. Fridrich and J. Kodovsky. Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3):868–882, 2012.
- [14] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 2014.
- [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [16] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- [17] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5001–5010, 2020.
- [18] X. Li, J. Komulainen, G. Zhao, P.-C. Yuen, and M. Pietikäinen. Generalized face anti-spoofing by detecting pulse from face videos. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 4244–4249. IEEE, 2016.
- [19] Y. Li and S. Lyu. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
- [20] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020.
- [21] Z. Liu, X. Qi, and P. H. Torr. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8060–8069, 2020.
- [22] X. Niu, S. Shan, H. Han, and X. Chen. Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation. IEEE Transactions on Image Processing, 29:2409–2423, 2019.
- [23] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, and G. Zhao. Video-based remote physiological measurement via cross-verified feature disentangling. In European Conference on Computer Vision, pages 295–310. Springer, 2020.
- [24] X. Pan, X. Zhang, and S. Lyu. Exposing image splicing with inconsistent local noise variances. In 2012 IEEE International Conference on Computational Photography (ICCP), pages 1–10. IEEE, 2012.
- [25] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y. Liu, and J. Zhao. Deeprhythm: exposing deepfakes with attentional visual heartbeat rhythms. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4318–4327, 2020.
- [26] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019.
- [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- [29] J. Thies, M. Zollhöfer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019.
- [30] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016.
- [31] W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remote plethysmographic imaging using ambient light. Optics express, 16(26):21434–21445, 2008.
- [32] Z. Yu, W. Peng, X. Li, X. Hong, and G. Zhao. Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 151–160, 2019.