\floatsetup

[table]capposition=top

Identifying Invariant Texture Violation for Robust Deepfake Detection

Xinwei Sun
Microsoft Research Asia
[email protected] Equal contribution. Botong Wu^∗
Peking University
[email protected] Wei Chen
Microsoft Research Asia
[email protected]

Abstract

Existing deepfake detection methods have reported promising in-distribution results, by accessing published large-scale dataset. However, due to the non-smooth synthesis method, the fake samples in this dataset may expose obvious artifacts (e.g., stark visual contrast, non-smooth boundary), which were heavily relied on by most of the frame-level detection methods above. As these artifacts do not come up in real media forgeries, the above methods can suffer from a large degradation when applied to fake images that close to reality. To improve the robustness for high-realism fake data, we propose the Invariant Texture Learning (InTeLe) framework, which only accesses the published dataset with low visual quality. Our method is based on the prior that the microscopic facial texture of the source face is inevitably violated by the texture transferred from the target person, which can hence be regarded as the invariant characterization shared among all fake images. To learn such an invariance for deepfake detection, our InTeLe introduces an auto-encoder framework with different decoders for pristine and fake images, which are further appended with a shallow classifier in order to separate out the obvious artifact-effect. Equipped with such a separation, the extracted embedding by encoder can capture the texture violation in fake images, followed by the classifier for the final pristine/fake prediction. As a theoretical guarantee, we prove the identifiability of such an invariance texture violation, i.e., to be precisely inferred from observational data. The effectiveness and utility of our method are demonstrated by promising generalization ability from low-quality images with obvious artifacts to fake images with high realism.

1 Introduction

The advances of computing capability and deep generative models (e.g., Auto-Encoder) make it possible to synthesis fake videos or images that swap facial information from the source person by the one from the target person, which is known as Deepfakes [14], such as identity swap and expression swap [27, 32]. By leveraging deep convolutional neural networks, current Deepfake methods can manipulate in a microscopic level that is hard to be discerned by human detectors. The propagation of these fake images on the Internet can cause potential harm in aspects like cybersecurity, defamation of reputation, and fake news circulation. An example is the recent fake news that ”Aids is over” announced by Donald Trump ¹¹1Although it was originally generated to attract the public’s attention on the AIDS issue, this video can incur misleading on social media.. To counter-mitigate these side effects, a robust ²²2Here robustness means generalize well on high-realism fake images. and effective detection method is desired.

By accessing the public large-scale dataset such as FaceForensics++ [27] which may be with low visual quality, existing methods reported promising results on the in-distribution test data. However, without meticulous post-processing during synthesis, these low-quality fake images can expose obvious artifacts such as color mismatch and stark visual contrast, which are hence far from realism. The methods trained to capture these artifacts [1, 17] may not generalize well to real forgeries that either with the better visual quality or with post-compression to smooth out the artifacts. As illustrated in Fig. 1, with better synthesis method adopted in [18], the fake images in newly published Celeb-DF dataset [18] can exhibit a large improvement of visual quality. Such an improvement cause the degradation of above detection methods [18, 27] from FaceForensics++ to Celeb-DF (0.955 $\to$ 0.655 in terms of AUC), as summarized in Tab. 1.

Refer to caption — Figure 1: The fake images respectively from Faceforencis++ [27] (the 1st row) and the Celeb-DF [18] (the 2nd row). The ones in the 1st row exhibit the obvious artifacts, corresponding to the training data from $\mbox{P}^{\mathrm{L}}$ (in Def. 3.1) that we access. The 2nd row remove the artifacts by implementing better synthesis method ( $\mbox{P}^{\mathrm{H}}$ in Def. 3.2), as the test arena to validate the robustness of proposed methods.

Apart from artifacts that may not come up in the wild (e.g., those vivid and circulated fake images), we in this paper exploit an important prior that invariably hold across all fake images. As an invariance, it refers to that the facial texture (i.e., multitude of microscopic details and surface orientation) for one person is uniquely distributed [31, 19]. That is, different facial regions may expose heterogeneous textures though, they give rise to the holistic regularity such as geometric coherence or high-order smoothness of spatial variation. This prior has been exploited in the literature of face recognition [25, 9, 36]. Replacing the facial region with the one from other sources (i.e. people) would violate the holistic regularity mentioned above. Different from the obvious artifacts due to careless synthesis, such an invariance with texture-wise violation corresponds to high-frequent signals that cannot be perceived by human detectors, making the detection extremely difficult.

To generalize well on real forgeries from the data with low visual quality, we propose a novel framework, namely Invariant Texture Learning (InTeLe) to identify this invariant texture violation. Our InTeLe is based on a latent generative model (as shown in Fig. 2) that encapsulates two sources of clues of fake images with low quality: the artifacts during synthesis and the texture violation. Specifically, we introduce hidden variables $Z$ to model the high-level latent components that contain the texture information. Besides, the generation process of pristine images differs from the one for fake images that can suffer from artifacts. This is illustrated by the additional arrow $Y\to X$ (where $Y$ denotes the binary pristine/fake label) illustrated in the left image in Fig. 2. For high-quality fake images with removed artifacts, this arrow does not exist (in the right image in Fig. 2) and the generating process of $X$ no longer depends on $Y$ .

To learn the invariant texture information for generalization, our InTeLe introduces an auto-encoder structure, with the encoder extracting the texture information, followed by a classifier for pristine/fake prediction. Motivated by the $Y\to X$ in the latent generative model, our decoders are with different branches for generating fake and pristine images. To model such a generating difference reflected by the artifacts, we append a shallow classifier on decoded images to regularize the decoder, leaving the invariant texture violation solely encoded into the latent components for generalization. As a guarantee, we prove that this invariant texture information can be precisely inferred from observational distribution, i.e., identifiability. To demonstrate the robustness of InTeLe, we test it on more realistic fake data in Celeb-DF [18] and benchmark images in [27]. The empirical results show a large-margin improvement (e.g. $6.5\%$ AUC on Celeb-DF) compared with conventional supervised learning. The contribution is summarized as follows:

•

We are the first to consider the texture violation, as an invariance for deepfake detection.
•

We propose an auto-encoder-based model that separates out the artifact to identify the invariance for prediction.
•

We prove the identifiability of the invariant texture violation, giving rise to the observational distribution.
•

We generalize to datasets with high realism (from low-qualtiy data) and achieve better results than others.

2 Related Work

Video Detection. The video-based detection method can exploit temporal discrepancies [28, 8], and also the physiological aspects [21] including incoherent head poses [35], abnormal eye blinking patterns [16] or behaviours [2]; the exploitation of spatial and motion information [33]. However, in many real scenarios, the image-level manipulation is enough to cause potential harm, such as cybersecurity or defaming the politician’s reputation by generating inappropriate expression in some contexts. Therefore, a robust frame-based detection method is more desired.

Image Detection. For detecting fake images, most existing methods [17, 22, 1, 37, 38, 20] trained on the Faceforensics++ [27] (as one of the first large-scale public data) to learn the obvious artifacts due to imperfect synthesis (we will elaborate the details in section 3.1). These artifacts can visually cause stark contrast, which corresponds to low-frequency signals with respect to the human observers and hence far from realism. Using shallow networks, these low-frequent signals can be captured to achieve accurate prediction [1]. Another line of methods uses a data-driven method by directly training DNNs, such as XceptionNet [27, 23, 15]. Benefited from the ability to learn high-frequency signals by deep layers, these DNN can learn both artifacts and texture violation (but without separation of each effect). However, these methods cannot generalize to real forgeries, e.g., [18] with color correction and smooth mask detection. The [18] reported this performance degradation.

Key Difference. Compared with existing works, our method focus on identifying the texture violation characterized by microscopic details, as an invariance shared among all fake images. Specifically, we targeting on generalizing from the Faceforensics++ which expose obvious artifacts, to Celeb-DF with fake images that are close to reality. Particularly note that, rather than simply performing well on Celeb-DF (since we do not use data from Celeb-DF), our goal is to propose a robust model that can exploit invariant texture information for prediction. This goal is commonly believed to be the future prospects for deepfakes detection [32, 21]. Compared with conventional feature-extraction methods (e.g., [7]), our method is able to extract the high-frequency signals equipped with deep neural network.

3 Methodology

Problem Setting With accessible Deepfake dataset which can expose obvious artifacts due to unsatisfactory synthetic quality, we expect to generalize well on fake data with high realism. Formally speaking, our training data contains $N$ samples $\{x_{i},y_{i}\}\overset{i.i.d}{\sim}\mbox{P}^{\mathrm{L}}(x,y)$ (with “L” stands for low quality and $\mbox{P}^{\mathrm{L}}(X,Y)$ is defined in Def. 3.1), where $(x,y)\in\mathcal{X}\times\mathcal{Y}$ denote the image and pristine/fake label with the $\mathcal{Y}:=\{0,1\}$ (The “0” denotes the pristine while the “1” denotes the fake). Our goal is to learn a robust predictor $f:\mathcal{X}\to\mathcal{Y}$ that can generalize well from $\mbox{P}^{\mathrm{L}}$ to real world media forgeries with high visual quality, which that is characterized by $\mbox{P}^{\mathrm{H}}\neq\mbox{P}^{\mathrm{L}}$ (”H” stands for high quality and $\mbox{P}^{\mathrm{H}}$ is defined in Def. 3.2): $\min_{f}\mathbb{E}_{p^{\mathrm{H}}(x,y)}\mathbbm{1}(f(x)\neq y)$ .

Outline. We start with a latent generative model in section 3.1 to model two effects in fake images with low visual quality: the artifacts and texture violation for $\mbox{P}^{\mathrm{L}}$ . This generative model motivates the our InTeLe framework in section 3.2 which identifies the invariant texture violation by separating out the artifact-effect. Finally, we in section 3.3 provide a theoretical guarantee that this invariant texture information can be precisely inferred, i.e., identifiability.

3.1 Latent Generative Model

For manipulated/fake images with low visual quality, there are two clues for detection: (i) the intrinsic texture violation from another person; and (ii) the synthetic artifacts such as splicing boundaries, color mismatch. We assume that the (i) is shared by all fake images due to the following prior regarding the facial texture: The facial texture, as a generalization of a multitude of microscopic details and surface orientation, is uniquely determined for each person [31, 19]. For the manipulation that leverages the information from another person, the distribution of the source person’s facial texture will be inevitably violated. To understand the (ii) that only come up in fake images far from realism, taking the synthesis method adopted in Faceforencis++ [27] as an illustration: after generating the target facial region from that of the source person, the [27] implemented a warping back operation in order to match with the original face. It is this warping back procedure that induces non-smooth boundary, the color mismatch between the source person and the target person in terms of skin [1], leaving obvious artifacts.


(a) DAG for $\mbox{P}^{\mathrm{L}}$	(b) DAG for $\mbox{P}^{\mathrm{H}}$

We encapsulate these two clues into a latent generative model illustrated in the leftmost image of Fig. 2. As shown, the hidden variable $Z\in\mathbb{R}^{d_{z}}$ is introduced to model the explanatory factors of facial image $X$ ; these explanatory factors encode the texture information of each local facial region such as five sense organs, hair, beard, etc. The obvious artifact (i.e., (ii)) used to differentiate the fake images is modeled by the dependency on the label $y$ during the generating process. The existing method either exploited the artifacts (a.k.a, (i)) or both (i) and (ii); and could achieve high detection recall on $\mbox{P}^{\mathrm{L}}$ . However, these methods cannot generalize well on high-realism fake images from $\mbox{P}^{\mathrm{H}}$ which removes the artifact via better synthesis method or compression and only expose microscopic texture violation which is not easy to be detected by human detectors. Due to such a removal of the artifact, we assume that the generating process of $\mbox{P}^{\mathrm{H}}$ only depends on the latent variable, as commonly assumed in the literature of latent generative model [13, 29]. This is shown by the missing link from $Y\to X$ in the middle image of Fig. 2, and the $\mbox{P}^{\mathrm{H}}(x|z)=\mbox{P}^{\mathrm{L}}(x|z,y=0)$ (recall that the $y=0$ denotes the pristine image). The formal definition of $\mbox{P}^{\mathrm{L}},\mbox{P}^{\mathrm{H}}$ are introduced in Def. 3.1, 3.2.

Definition 3.1 (Latent Generative Model for $\mbox{P}^{\mathrm{L}}$ ).

The generative model [24] of $\mbox{P}^{\mathrm{L}}$ is defined by (i) a Directed Acylic Graph (DAG) as illustrated in the leftmost image in Fig. 2 and denoted as $G=(\mathcal{V},\mathcal{E})$ with $\mathcal{V}:=(X,Y,Z)$ containing the input $X$ , output $Y$ and latent variables $Z$ and $\mathcal{E}:=\{Y\to Z,Y\to X,Z\to X\}$ characterizing the generating direction; and (ii) generating distributions $\left(p_{f_{y}}(y),p_{f_{z}}(z|y),p_{f_{x}}(x|z,y)\right)$ for each element in $\mathcal{V}$ .

Note that this generative model differs from the commonly adopted graphical model with $Z\to X$ in [12] in the additional arrow $Y\to X$ , implying the difference in the generating process for pristine images and fake images. Specifically, this difference lies in the obvious artifacts that do not come up for natural pristine images, which is due to the non-smooth warping back operation. With improved synthesis method or compression to remove such artifacts, this difference no longer exists, making the microscopic texture manipulation the sole clue for detection.

Definition 3.2 (Latent Generative Model for $\mbox{P}^{\mathrm{H}}$ ).

The generative model for $\mbox{P}^{\mathrm{H}}$ is the same to Def. 3.1 except the missing arrow in $Y\to X$ and $p_{f^{\mathrm{H}}_{x}}(x|z)=p_{f_{x}}(x|z,y=0)$ .

According to Def. 3.1, 3.2, the $p(z,y)$ hence the mechanism of $p(y|z)$ , as the predicting mechanism with only texture inconsistency, is shared among all generative models, therefore is defined as the invariance with texture violation:

Definition 3.3 (Invariance with Texture Violation).

We define the predicting mechanism $p(y|z)$ as the invariance with texture violation, which is shared between $\mbox{P}^{\mathrm{L}}$ in Def. 3.1 and $\mbox{P}^{\mathrm{H}}$ in Def. 3.2.

By accessing only data from $\mbox{P}^{\mathrm{L}}$ which mixes the texture violation with the artifact-effect, our goal is to identify the invariance with texture violation that is also shared by $\mbox{P}^{\mathrm{L}}$ , by separating out the artifact-effect. Such a learned invariance is used for final prediction on $\mbox{P}^{\mathrm{H}}$ . To learn the invariance in Def. 3.3, note that the generating mechanism of image $X$ in $\mbox{P}^{\mathrm{H}}$ , although is different from, however is similar to the one in $\mbox{P}^{\mathrm{L}}$ : $X\leftarrow f_{x}(Z,Y,\varepsilon_{X})$ for $\mbox{P}^{\mathrm{L}}$ v.s $X\leftarrow f_{x}(Z,Y=0,\varepsilon_{X})$ for $\mbox{P}^{\mathrm{H}}$ , providing us the opportunity to identify $f_{x}$ , with which we can infer the latent components $Z$ from $X$ and finally predict using $p(y|z)$ . In the subsequent section, we provide our InTeLe framework which learns the $p(y|z)$ from $\mbox{P}^{\mathrm{L}}$ and generalize to the data from $\mbox{P}^{\mathrm{H}}$ .

3.2 Invariant Texture Learning Framework

We introduce our learning method, namely Invariant Texture Learning, with the goal of identifying the invariance with texture violation (i.e., $p(z|y)$ ) from data in $\mbox{P}^{\mathrm{L}}$ and generalize on data from $\mbox{P}^{\mathrm{H}}$ , by separating it from the artifact-effect. The whole pipeline is illustrated in Fig. 3. As shown, our InTeLe is based on the Auto-Encoder (AE), as a deterministic version ( $\varepsilon_{x}=0$ , which can be proved to satisfy the identifiability in section 3.3) of Variational AE to learn the generative model. To model the artifacts and texture violation in $\mbox{P}^{\mathrm{L}}$ , our (i) encoder is expected to extract the latent components $Z$ that incorporates the texture information of local facial regions; and our (ii) two-branch decoders respectively correspond to the generating process of fake and pristine images, which echos the dependency of $Y$ during generation of $X$ in Fig. 2. To further enforce the decoder to capture the obvious artifact-effect, we append a shallow network (i.e., $f_{\mathrm{SC}}$ ) to capture the low-frequency information (which is easy to be observe by humans). By separating out this artifact, the extracted latent features $Z$ only encode the texture consistency for predicting pristine/fake. The modules of auto-encoder ( $\mathrm{Enc},f_{0},f_{1}$ ), classifier $f_{\mathrm{cls}}$ and the semantic classifier $f_{\mathrm{SC}}$ participant in the end-to-end training. We explain each module in more details.

Auto-Encoder Module. As early mentioned, the information of texture inconsistency of each local region is embedded into the latent components $Z$ of the whole image $X$ , and the artifacts are the ones that left during the synthesizing of fake images. Motivated by these priories, we leverage the auto-encoder framework to model such two effects. The latent components extracted by the encoder is expected to capture the invariant texture inconsistency. For the decoder, to model the artifacts due to synthesis, we propose a two-branch version with one (denoted as $f_{0}$ ) for generating the pristine images and the other (denoted as $f_{1}$ ) for generating the fake images. Since texture information corresponds to high-frequency signals that can be captured by deep neural networks, we choose the network structure for encoder with deep layers (e.g., XceptionNet, EfficientNet), and the details will be introduced in the experimental part. The overall loss function for encoder $\mathrm{Enc}$ and the decoder $f_{0},f_{1}$ is:

	$\displaystyle\mathcal{L}_{\mathrm{AE}}(\mathrm{Enc},f_{0},f_{1})$	$\displaystyle=\sum_{i=1}^{n}\mathbbm{1}(y_{i}=0)*\\|X_{i}-f_{0}(\mathrm{Enc}(x_{i}))\\|_{F}^{2}$
	$\displaystyle+$	$\displaystyle\mathbbm{1}(y_{i}=1)*\\|X_{i}-f_{1}(\mathrm{Enc}(x_{i}))\\|_{F}^{2}$		(1)

Semantic Classifier Module. To further ensure that the artifacts can be captured by the difference between $f_{0}$ and $f_{1}$ , we append a shallow semantic classifier ( $f_{\mathrm{SC}}$ in Fig. 3) which has been shown to only capture the obvious low-frequency signals (e.g., the semantic pattern in object classification). We expect that for any image $x$ , this $f_{\mathrm{SC}}$ can only exploit artifacts as a difference between $f_{0}(\mathrm{Enc}(x))$ and $f_{1}(\mathrm{Enc}(x))$ for prediction. Specifically, for each image either is pristine or fake, the $f_{\mathrm{SC}}$ is trained to classify the estimated images from $f_{0}$ (which expose artifacts) as 0 and the one from $f_{1}$ (no artifacts exposed) as 1:

	$\displaystyle\mathcal{L}_{\mathrm{CE}}(\mathrm{Enc},f_{0},f_{1},f_{\mathrm{SC}})$	$\displaystyle=-\sum_{i=1}^{n}\log{\left(1-f_{\mathrm{SC}}(f_{0}(\mathrm{Enc}(x_{i})))\right)}$
		$\displaystyle+\log{\left(f_{\mathrm{SC}}(f_{1}(\mathrm{Enc}(x_{i})))\right)}.$		(2)

Further, to avoid learning the trivial solution for $f_{0}$ and $f_{1}$ (such as making the first pixel of the reconstructed image different), we add an auxiliary branch of pristine/fake classification with image $x$ directly as input to avoid learning $f_{0},f_{1}$ that are trivial for separation by $f_{\mathrm{SC}}$ :

	$\displaystyle\mathcal{L}_{\mathrm{aux}}(f_{\mathrm{SC}})$	$\displaystyle=-\sum_{i=1}^{n}\mathbbm{1}(y_{i}=0)*\log{\left(1-f_{\mathrm{SC}}(x_{i})\right)}$
		$\displaystyle+\mathbbm{1}(y_{i}=1)*\log{\left(f_{\mathrm{SC}}(x_{i})\right)}.$		(3)

Combining with the Eq. (3.2), the loss function is:

\displaystyle\mathcal{L}_{\mathrm{SC}}(f_{0},f_{1},f_{\mathrm{SC}})=\mathcal{L}_{\mathrm{CE}}(f_{0},f_{1},f_{\mathrm{SC}})+\alpha*\mathcal{L}_{\mathrm{aux}}(f_{\mathrm{SC}}),

with $\alpha\geq 0$ denoting the hyper-parameter to represent the extent of regularization.

Classifier Module. With the effect of artifacts separated by the semantic classifier $f_{\mathrm{SC}}$ following after $f_{0},f_{1}$ , the $Z$ only contains the texture information, which is therefore taken as input for the final classifier that is trained by:

	$\displaystyle\mathcal{L}_{\mathrm{cls}}(f_{\mathrm{cls}},\mathrm{Enc})$	$\displaystyle=-\sum_{i=1}^{n}\mathbbm{1}(y_{i}=0)*\log{\left(1-f_{\mathrm{cls}}(\mathrm{Enc}(x_{i}))\right)}$
		$\displaystyle+\mathbbm{1}(y_{i}=1)*\log{\left(f_{\mathrm{cls}}(\mathrm{Enc}(x_{i}))\right)}.$		(4)

Training and Testing. In summary, the overall loss functions is the combination of the losses in the modules mentioned above: the reconstructed loss (i.e., Eq. (3.2)) followed by the shallow semantic classifier to separate the effect of artifacts from the texture information, which is for final classification. The overall loss $\mathcal{L}$ for training is:

		$\displaystyle\mathcal{L}(f_{0},f_{1},\mathrm{Enc},f_{\mathrm{SC}},f_{\mathrm{cls}})=\mathcal{L}_{\mathrm{AE}}(\mathrm{Enc},f_{0},f_{1})$
		$\displaystyle+\lambda_{1}\mathcal{L}_{\mathrm{SC}}(f_{0},f_{1},f_{\mathrm{SC}})+\lambda_{2}\mathcal{L}_{\mathrm{cls}}(f_{\mathrm{cls}},\mathrm{Enc}),$		(5)

which trains all modules in an end-to-end scheme. The $\lambda_{1},\lambda_{2}$ are hyper-parameters that balance the effects of each module. During testing, given a new sample $x_{\mathrm{new}}$ , we feed the extracted texture information $Z$ (i.e., $\mathrm{Enc}(x_{\mathrm{new}})$ ) into the classifier $f_{\mathrm{cls}}$ for pristine/fake prediction.

3.3 Identifiability of the Texture Violation

In this section, we provide a theoretical guarantee for our InTeLe to ensure that the invariance with texture violation (i.e., $p(y|z)$ ) can be identified, i.e. precisely inference from the observational distribution (here refers to $p^{\mathrm{L}}(x,y)$ ). Our analysis is inspired by the recent analysis in nonlinear ICA [10, 11] with adaptation on our generative model in Fig. 2 with additional dependency of $Y$ during generation. Besides, we here generalize the definition of $Y$ to $\mathcal{Y}:=\{0,...,m\}$ which denotes the space of sub-types of images: the $0$ denotes the pristine image and the $k\leq m$ denotes the fake image implemented with the $k$ -th manipulation method. Our goal is to identify the generating mechanism $f_{x}$ for inference of latent variable $Z$ (as the intrinsic texture information for training) and hence $p(y|z)$ for prediction. In the following, we assume that the latent variables $Z|Y$ is generated from exponential family, i.e.,

\displaystyle p^{\mathrm{L}}_{f_{z}}(z|y)=\prod_{i=1}^{d_{z}}\exp\Big{(}\sum_{j=1}^{k_{z}}T_{i,j}(z_{i})\Gamma_{y,i,j}+B_{i}(z_{i})-A_{y,i}\Big{)},

where $f_{z}$ is associated with the sufficient statistics $\{T_{i,j}(z_{i})\}$ ; the $\{\Gamma_{y,i,j}\}$ denotes the natural parameters; and $\{B_{i}\},\{A_{y,i}\}$ denote the base measures and normalizing constants to ensure the integral of distribution equals to 1. Let $\mathbf{T}\!:=\!\left[\mathbf{T}_{1},...,\mathbf{T}_{d_{z}}\right]$ $\!\in\!\mathbb{R}^{k_{z}\times d_{z}}$ $\big{(}\mathbf{T}_{i}\!:=\![T_{i,1},...,T_{i,k_{z}}],\forall i\in[d_{z}]\big{)}$ and $\bm{\Gamma}_{y}\!:=\!\left[\bm{\Gamma}_{y,1},...,\bm{\Gamma}_{y,d_{z}}\right]$ $\!\in\!\mathbb{R}^{k_{z}\times d_{z}}$ $\big{(}\bm{\Gamma}_{y,i}\!:=\![\Gamma_{y,i,1},...,\Gamma_{y,i,k_{z}}],\forall i\in[d_{z}]\big{)}$ . Our goal is to identify the parameters $\theta:\{\mathbf{T},f_{x}\}$ that give rise to the observational distribution $p^{\mathrm{L}}(x,y)$ . Since $Z$ is assumed to model the high-level abstractions/concepts in the latent generative model [13], we assume that $d_{z}<d_{x}$ and that $\varepsilon_{x}\in\mathbb{R}^{d_{x}-d_{z}}$ such that $f_{x}$ is bijective. The following theorem ensures that the $\theta$ is identifiable under a linear and pointwise transformation, which is similar to the result in [10].

Theorem 3.4 (Identifiability).

Consider the DAG associated with generative model which has the following factorization,

p^{\mathrm{L}}(x,z,y)=p_{f_{y}}(y)p_{f_{z}}(z|y)p_{f_{x}}(x|z,y).

(6)

under the following assumptions:

(a)

The sufficient statistics $\{T_{i,j}\}$ are differentiable and with nonzero derivatives almost everywhere.
(b)

There exists $m:=k_{z}d_{z}+1$ values of $y$ , i.e., $y_{0},y_{1},..,y_{m}$ such that the matrix $[\bm{\Gamma}_{y_{1}}-\bm{\Gamma}_{y_{0}}$ $,...,\bm{\Gamma}_{y_{m}}-\bm{\Gamma}_{y_{0}}]$ is invertible

we have that if $\theta$ and $\tilde{\theta}:=\{\tilde{\mathbf{T}},\tilde{f}_{x}\}$ give rise to the same observational distribution, i.e., $p_{\theta}^{\mathrm{L}}(x,y)=p_{\tilde{\theta}}^{\mathrm{L}}(x,y)$ for any $(x,y)\in\mathcal{X}\times\mathcal{Y}$ distribution, then there exists an invertible matrix $A\in\mathbb{R}^{k_{z}d_{z}\times k_{z}d_{z}}$ and vector $b\in\mathbb{R}^{k_{z}d_{z}}$ such that $\bm{T}([f_{x}^{-1}]_{\mathcal{Z}})=A\tilde{\bm{T}}([\tilde{f}_{x}^{-1}]_{\mathcal{Z}})+b$ ³³3Here the $[f]_{\mathcal{A}}$ denotes the function $f$ restricted on the dimension index that belongs to index set $\mathcal{A}$ .

Proof.

Suppose $\theta,\tilde{\theta}$ give the same observational distribution that $p_{\theta}(x,y)=p_{\tilde{\theta}}(x,y)$ , then we have:

\displaystyle\int p_{f_{y}}(y)p_{f_{z}}(z|y)p_{f_{x}}(x|z,y)dz=\int p_{\tilde{f}_{y}}(y)p_{\tilde{f}_{z}}(z|y)p_{\tilde{f}_{x}}(x|z,y)dz,

(7)

since $p_{f_{y}}(y)=p_{\tilde{f}_{y}}(y)$ , then we have that

\displaystyle\int p_{T^{z},\Gamma}(z|y)p_{f_{x}}(x|z,y)dz=\int p_{\tilde{T}^{z},\tilde{\Gamma}}(z|y)p_{\tilde{f}_{x}}(x|z,y)dz.

(8)

According to the rule of change of variables that $\bar{x}:=f_{x}(z,y)$ , we have

		$\displaystyle\int p_{T^{z},\Gamma}([f^{-1}(\bar{x})]_{\mathcal{Z}}\|y)p_{\varepsilon_{x}}(x-\bar{x})\|J_{f^{-1}}(\bar{x})\|d\bar{x}$
	$\displaystyle=$	$\displaystyle\int p_{\tilde{T}^{z},\tilde{\Gamma}}([\tilde{f}^{-1}(\bar{x})]_{\mathcal{Z}}\|y)p_{\varepsilon_{x}}(x-\bar{x})\|J_{\tilde{f}^{-1}}(\bar{x})\|d\bar{x}.$		(9)

This implies that $(p_{T,\Gamma,f,y}*p_{\varepsilon_{x}})(x)=(p_{\tilde{T},\tilde{\Gamma},\tilde{f},y}*p_{\varepsilon_{x}})(x)$ . Taking the Fourier transformation on both sides, we have that $F(p_{T,\Gamma,f,y})(\omega)\phi_{\varepsilon_{x}}(\omega)=F(p_{\tilde{T},\tilde{\Gamma},\tilde{f},y})(\omega)\phi_{\varepsilon_{x}}(\omega)$ . This implies that $F(p_{T,\Gamma,f,y})(\omega)\phi_{\varepsilon_{x}}(\omega)=F(p_{\tilde{T},\tilde{\Gamma},\tilde{f},y})(\omega)$ , from which we have $p_{T,\Gamma,f,y}=p_{\tilde{T},\tilde{\Gamma},\tilde{f},y}$ by taking the inverse Fourier transformation on both sides. Here, the $p_{T,\Gamma,f,y}$ is

\displaystyle p_{T,\Gamma,f,y}(x)=p_{T,\Gamma}(f^{-1}(x)|y)|J_{f^{-1}}(x)|.

(10)

By taking the logarithmic on both sizes of $p_{T,\Gamma,f,y}=p_{\tilde{T},\tilde{\Gamma},\tilde{f},y}$ , we have that

\displaystyle\langle T,\Gamma(y)\rangle=\langle\tilde{T},\tilde{\Gamma}(y)\rangle+b(y)+g(x),

(11)

where $b(y)$ is related to $y$ and $g(x)$ is the function of $x$ . Define $\bar{\Gamma}(y):=\Gamma(y)-\Gamma(y_{1})$ , we have that

\displaystyle L^{\top}T=\tilde{L}^{\top}\tilde{T}+b.

(12)

According to the condition $b)$ and applying the result in [10], we can have that $A:=(L^{\top})^{-1}\tilde{L}$ is invertible. The proof is completed. ∎

Remark 1.

The condition (b) puts a requirement on the number and the diversity of manipulating classes to identify the invariant mechanisms $\theta$ . Notwithstanding that this condition may not be satisfied in real scenarios (since the accessible training data may only contain a limited number of manipulation types), we empirically find that incorporating $y$ (into decoder) can help identify the texture information, even if with $Y$ being as binary pristine/fake label.

This theorem states that under a deterministic setting, the latent variable can be identified up to linear transformation (with an invertible matrix) and a point-wise transformation. Besides, we believe that latent components (the texture and various abstractions regarding each facial region) are deterministic to the facial image. Therefore, we propose to learn such invariant texture information using an auto-encoder rather than a variational one.

4 Experiments

Table 1: The statistics of pristine/fake videos/images and the performance of XceptionNet [27] trained on C40 dataset of Faceforensics++. Compared to the Faceforensics++, the Xception suffers from performance degradation on both Celeb-DF with better visual quality and benchmark images with compression to smooth out the artifacts.

	# Pristine Videos	# Fake Videos	# Pristine images	# Fake images	AUC [27]	Precision [27]
Faceforensics++ [27]	1,000	1,000	509.9k	1,830.1k	0.955	-
Celeb-DF [18]	590	5,639	225.4k	2,116.8k	0.655	0.810
Benchmark in [27]	1,000	-	500	500	-	0.701

In this section, we use Faceforensics++ [27] (which expose obvious artifacts for classification [32, 1, 18]) for training, and test our model on two dataset of ( $\mbox{P}^{\mathrm{H}}$ ) i) the test set of Celeb-DF [18], and ii) additional 1000 benchmark images in [27]. We will introduce these three datasets in details, with statistics for each dataset summarized in Tab. 1.

4.1 Dataset

Faceforensics++ [27] for $\mbox{P}^{\mathrm{L}}$ . The Faceforensics++ transferred the face (or expression) from the target video to each frame of the source video to generate a fake image. Different levels of image resolution are considered: the Raw (original) with the highest resolution, the light compression C23 (constant rate quantization parameter equal to 23), and C40. To synthesize fake images, the Faceforensics++ first extracts only facial region from the source image for generating the target facial region, followed by final warping back steps such as affine transformation and shape refinement to match the non-facial regions in the source image. Such a warping procedure can exhibit visual artifacts, which can be easily captured to detect the fake images [32, 1, 17].

Celeb-DF [18]. Compared to the Faceforensics++, the Celeb-DF significantly improve the image quality, which matches with the characteristics in $\mbox{P}^{\mathrm{H}}$ and provides a more challenging arena for detection methods. Specifically, the [18] adopted more wide and deep network structures of Auto-encoder to improve the resolution. Besides, it generated a facial mask that along with the facial surrounding context, can make the boundary more smooth than the convex hull adopted in Faceforensics++. It additionally corrected the color mismatch. We can therefore approximately assume that the generation of the fake image is the same as that of the natural pristine image. Although the artifacts were removed, the fake images exhibit inconsistent textures between facial and nonfacial regions for detection.

Benchmark images in [27]. In addition to Faceforensics++, the [27] provided a competitive benchmark on 1,000 videos post-processed by unknown compression approaches to mimic the ones in real media scenarios with low resolution. This benchmark data only contains 1,000 images, each of which was manually selected as the most challenging one among all frames in the corresponding video. The dataset can be downloaded from the host-server of the [27] with unknown ground-truth labels. As shown in Tab. 1, the Xception trained on C40 suffers from precision degradation on benchmark images.

4.2 Implementations

We respectively adopt the Area Under the ROC Curve (AUC) and prediction accuracy as the evaluation metric for Celeb-DF and the benchmark images.

Compared Baselines. For Celeb-DF, we compare with
a) The Convolutional Neural Network (CNN) supervised by cross entropy (CE). For future reference, we name it as CE ( $X\to Y$ ) regardless of the choice of backbone.
b) The spoof cues framework (LGSC) [6] also employed the auto-encoder framework. The LGSC was formulated as an anomaly detection method, which regularized the decoder to generate spoof cues for manipulated images. Specifically, it is composed of a spoof cue generator and an auxiliary classifier, with the spoof cue generator parameterized by U-Net to capture the multi-scale information. To expose spoof cues for manipulated images, it implemented a sparse regression loss on the distance between the residual feature map of pristine samples and the zero feature map; while for fake images, it put no constraints.
c) To validate the effectiveness of semantic classifier, we additionally conduct another version of our method but without $f_{\mathrm{SC}}$ and the other modules are kept the same.

Implementation Details. For all methods, we first crop the image to only enclose the human face using the OpenFace toolkit [3]. This pre-processing has been shown [27] to improve the detecting performance. We only keep the fake images with replacement in the training set since the Celeb-DF only contains face replacement. We implement U-Net [26] for the decoder, which can capture multi-scale especially fine-scale information in image reconstruction or object segmentation [4]. We implement various backbones for the encoder (also the layers before CE), including EfficientNet-B5 [30] with student-noise initialization [34] and XceptionNet [5] to validate that the effectiveness of our method can generalize to any backbone as long as it is deep enough the capture the high-frequency signals. We implement the stochastic gradient descent (SGD) as the optimizer, with the learning rate set as 0.02. The weight-decay coefficient is set to 1e-4. The batch-size is respectively set to 16 and 20 for EfficientNet-B5 and XceptionNet. The $\lambda_{1},\lambda_{2}$ in Eq. (3.2) are set to 2 and 0.25, respectively. The $\alpha$ for $\mathcal{L}_{f_{\mathrm{SC}}}$ in Eq. (3.2) is set to 4.

4.3 Results and Analysis

We first apply our method on the test set of Celeb-DF. The average over 3 runs and standard variance for each method are summarized in Tab. 2. The ROC curve for these experiments is also plotted in Fig. 4. As shown, our method performs better or comparable than other methods under any backbones. Specifically, due to the modeling of abnormal patterns with triplet loss, the LGSC can also outperform the CE ( $X\to Y$ ). Equipped with the semantic classifier $f_{\mathrm{SC}}$ , our label-dependent decoder can separate out the artifact-effect, making the encoded embedding $Z$ capture cleaner invariance with texture information (i.e., in Def. 3.3). The effectiveness of $f_{\mathrm{SC}}$ is validated by the improvement of our InTeLe over one without $f_{\mathrm{SC}}$ .

Table 2: AUC (mean

\pm

std) on Celeb-DF [18] over three runs; and ACC on Benchmark. As shown, our method consistently outperform others. The ROC are shown in Fig. 4.

	Celeb-DF [18]		Benchmark [27]
	EfficientNet-B5	XceptionNet	EfficientNet-B5
CE ( $X\to Y$ )	$69.8\pm 1.0$	$61.6\pm 0.8$	$83.0\%$
LGSC [6]	$71.0\pm 1.8$	$64.1\pm 1.3$	$84.8\%$ [6]
InTeLe without $f_{\mathrm{sc}}$	$71.3\pm 1.0$	$65.3\pm 0.5$	$84.4\%$
InTeLe (Ours)	$\bm{73.8\pm 1.1}$	$\bm{68.1\pm 1.4}$	$\bm{86.4\%}$

Results on benchmark images in [27]. For benchmark data, we adopt a down-sample strategy for fake images in Faceforensics++ to make the ratio of pristine-fake approximately being 1:1 (1:2 in the original dataset). Besides, we augmented the dataset by implementing compression on the original image. Our method can achieve accuracy $\mathbf{86.4\%}$ , which outperform LGSC ( $84.8\%$ ) and is the best among the methods with published code in the leaderboard. The comparisons with the CE $X\to Y$ and InTeLe without $f_{\mathrm{SC}}$ show similar phenomena to those of Celeb-DF, as shown in Tab. 2.

4.4 Effect of Separating Artifacts

To validate our InTele’s effect of separating the artifacts, we visualize the triplet $(x,f_{0}(\mathrm{Enc}(x)),f_{1}(\mathrm{Enc}(x)))$ in Fig. 5. We pick two examples for each case: the fake of Celeb-DF; the pristine of Celeb-DF; the fake of benchmark and the pristine of benchmark, corresponding to the top-left, top-right, bottom-left and bottom-right quadrants. As marked by the red-circle, the $f_{1}(\mathrm{Enc}(x))$ can exhibit obvious artifacts, such as stark contrast, color mismatch whereas $f_{0}(\mathrm{Enc}(x))$ do not even the $x$ is fake. This differentiation validates that our $f_{0},f_{1}$ can successfully capture the artifacts effect and hence leave the invariant texture violation encoded in the $Z$ , which can be contributed to the label-dependent decoder and appended semantic classifier $f_{\mathrm{SC}}$ .

5 Conclusions & Discussions

We propose the InTeLe framework which exploits the texture violation due to manipulation for Deepfakes detection. Based on the auto-encoder framework, we propose two-branch decoders appended with a semantic classifier to separate the texture information from the effect of the artifact. As an invariance shared among all fake images, the capture of this texture violation yielded a robust detection method, as validated by much better generalization results than existing artifact-based methods.

For limitations, our identifiability theorem requires that the types of generating process (i.e., manipulation method) are diverse enough, which may not be satisfied in real scenarios. Besides, our method should be able to generalize to more types of manipulations that beyond the scope of face replacement and expression reenactment in this work. Such a more broadly generalization, together with the identification of the invariant texture information with a limited number of fake textures, is left in the future work.

References

[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–7. IEEE, 2018.
[2] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. Protecting world leaders against deep fakes. In CVPR Workshops, pages 38–45, 2019.
[3] Brandon Amos, Bartosz Ludwiczuk, Mahadev Satyanarayanan, et al. Openface: A general-purpose face recognition library with mobile applications. 2016.
[4] Bhakti Baheti, Shubham Innani, Suhas Gajre, and Sanjay Talbar. Eff-unet: A novel architecture for semantic segmentation in unstructured environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 358–359, 2020.
[5] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
[6] Haocheng Feng, Zhibin Hong, Haixiao Yue, Yang Chen, Keyao Wang, Junyu Han, Jingtuo Liu, and Errui Ding. Learning generalized spoof cues for face anti-spoofing. arXiv preprint arXiv:2005.03922, 2020.
[7] Javier Galbally, Sébastien Marcel, and Julian Fierrez. Image quality assessment for fake biometric detection: Application to iris, fingerprint, and face recognition. IEEE transactions on image processing, 23(2):710–724, 2013.
[8] David Güera and Edward J Delp. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2018.
[9] Abdenour Hadid, Matti Pietikainen, and T Ahonen. Face description with local binary patterns: Application to face recognition. IEEE transactions on pattern analysis and machine intelligence, 28(12):2037–2041, 2006.
[10] Ilyes Khemakhem, Diederik P Kingma, and Aapo Hyvärinen. Variational autoencoders and nonlinear ICA: A unifying framework. In Proceedings of the 23th International Conference on Artificial Intelligence and Statistics (AISTATS-23), volume 108, Palermo, Italy, 2020. AISTATS Committee, PMLR.
[11] Ilyes Khemakhem, Ricardo Pio Monti, Diederik P Kingma, and Aapo Hyvärinen. Ice-beem: Identifiable conditional energy-based deep models. arXiv preprint arXiv:2002.11537, 2020.
[12] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
[13] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR 2014), Banff, Canada, 2014. ICLR Committee.
[14] Pavel Korshunov and Sébastien Marcel. Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685, 2018.
[15] Pavel Korshunov and Sébastien Marcel. Deepfake detection: humans vs. machines. arXiv preprint arXiv:2009.03155, 2020.
[16] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877, 2018.
[17] Yuezun Li and Siwei Lyu. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
[18] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020.
[19] Chang Hong Liu, Charles A Collin, Reza Farivar, and Avi Chaudhuri. Recognizing faces defined by texture gradients. Perception & psychophysics, 67(1):158–167, 2005.
[20] Yaqi Liu, Qingxiao Guan, Xianfeng Zhao, and Yun Cao. Image forgery localization based on multi-scale convolutional neural networks. In Proceedings of the 6th ACM Workshop on Information Hiding and Multimedia Security, pages 85–90, 2018.
[21] Siwei Lyu. Deepfake detection: Current challenges and next steps. In 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 1–6. IEEE, 2020.
[22] Falko Matern, Christian Riess, and Marc Stamminger. Exploiting visual artifacts to expose deepfakes and face manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pages 83–92. IEEE, 2019.
[23] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. Capsule-forensics: Using capsule networks to detect forged images and videos. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2307–2311. IEEE, 2019.
[24] Judea Pearl. Causality. Cambridge university press, 2009.
[25] Jean-Sébastien Pierrard and Thomas Vetter. Skin detail analysis for face recognition. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.
[26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[27] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1–11, 2019.
[28] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI), 3(1), 2019.
[29] Raphael Suter, Djordje Miladinovic, Bernhard Schölkopf, and Stefan Bauer. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In International Conference on Machine Learning, pages 6056–6065, 2019.
[30] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
[31] James W Tanaka and Diana Simonyi. The “parts and wholes” of face recognition: A review of the literature. Quarterly Journal of Experimental Psychology, 69(10):1876–1889, 2016.
[32] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. arXiv preprint arXiv:2001.00179, 2020.
[33] Yaohui Wang and Antitza Dantcheva. A video is worth more than 1000 lies. comparing 3dcnn approaches for detecting deepfakes. In FG’20, 15th IEEE International Conference on Automatic Face and Gesture Recognition, May 18-22, 2020, Buenos Aires, Argentina., 2020.
[34] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020.
[35] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8261–8265. IEEE, 2019.
[36] Baochang Zhang, Yongsheng Gao, Sanqiang Zhao, and Jianzhuang Liu. Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor. IEEE transactions on image processing, 19(2):533–544, 2009.
[37] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. Two-stream neural networks for tampered face detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1831–1839. IEEE, 2017.
[38] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. Learning rich features for image manipulation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1053–1061, 2018.