This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semantic-aware One-shot Face Re-enactment with Dense Correspondence Estimation

Yunfan Liu12, Qi Li2, Zhenan Sun12, Tieniu Tan12
Abstract

One-shot face re-enactment is a challenging task due to the identity mismatch between source and driving faces. Specifically, the suboptimally disentangled identity information of driving subjects would inevitably interfere with the re-enactment results and lead to face shape distortion. To solve this problem, this paper proposes to use 3D Morphable Model (3DMM) for explicit facial semantic decomposition and identity disentanglement. Instead of using 3D coefficients alone for re-enactment control, we take the advantage of the generative ability of 3DMM to render textured face proxies. These proxies contain abundant yet compact geometric and semantic information of human faces, which enable us to compute the face motion field between source and driving images by estimating the dense correspondence. In this way, we could approximate re-enactment results by warping source images according to the motion field, and a Generative Adversarial Network (GAN) is adopted to further improve the visual quality of warping results. Extensive experiments on various datasets demonstrate the advantages of the proposed method over existing start-of-the-art benchmarks in both identity preservation and re-enactment fulfillment.

1 Introduction

Face re-enactment aims at animating the source face with pose and expression conveyed by driving images, and meanwhile keeping its identity information unchanged. Due to its wide range of practical applications, face re-enactment has attracted remarkable research attention in the last few years. With the rapid development of Generative Adversarial Networks (GANs), numerous studies have been conducted to solve the face re-enactment problem with GAN-based models (Kim et al. 2018; Thies et al. 2016; Wu et al. 2018). Although realistic results could be obtained, these methods are either designed only for a specific subject or require a large amount of images of the source identity, which heavily limits their application in real-world scenarios.

Refer to caption
Figure 1: Illustration of (a) sample face re-enactment results with ‘identity leak’. Note the change of identity in results obtain from One-shot (Zhang et al. 2019) and Latent-pose (Burkov et al. 2020), and (b) visualization of the face proxy of source and re-enacted faces.

In this work, we focus on solving a more generalized problem, i.e., one-shot face re-enactment, where only one single image of the source subject is available at test time. The key for solving this problem is to fully disentangle geometric attributes (i.e., pose and expression) in driving images from personal characteristics, otherwise it will interfere with the re-enactment process and cause shape distortion in generation results (known as the ‘identity leak’ problem (Ha et al. 2020), see Fig. 1(a)).

To this end, various representations for facial geometry are proposed to eliminate the influence of driving identity. Facial Action Units (FAUs), a coding system for classifying facial muscle movement, are used in some previous studies (Pumarola et al. 2018; Tripathy, Kannala, and Rahtu 2020, 2021) to describe face motion. However, the flexibility of FAUs is quite limited as they could not describe complex facial expression or head pose. Facial landmarks are also widely used in existing methods (Ha et al. 2020; Huang, Yang, and Wang 2020; Nirkin, Keller, and Hassner 2019; Song et al. 2018; Wu et al. 2018; Zhang et al. 2020a, 2019; Siarohin et al. 2019; Tripathy, Kannala, and Rahtu 2022) to depict the contour and layout information of facial components with facial texture excluded. Compared to FAUs, facial landmarks could provide a more comprehensive description of the spatial structure of human faces. Unfortunately, they still contain personal attributes of driving subjects in terms of face shape, which may impair the performance of source identity preservation.

In this paper, we propose to use an advanced 3D Morphable Model (3DMM), i.e., FLAME (Li et al. 2017), for facial geometry decomposition. Unlike most existing methods which only use low-dimensional 3DMM coefficients (Yin et al. 2017; Nguyen-Phuoc et al. 2019; Tewari et al. 2020; Ghosh et al. 2020) or plain face meshes (Yao et al. 2020) for face re-enactment control, we also make use of its generative ability to render textured face proxies (see Fig. 1 (b)). These representations not only contain richer spatial information of skin area and facial components, but also are less subject to the influence of high-frequency image details or background content when estimating the face motion, compared to original face images.

This inspires us to leverage the advantage of textured face proxies for face motion field estimation, which could further facilitate the prediction of re-enacted faces. Concretely, we first generate the proxy of re-enacted face by the FLAME model based on the mixture of 3DMM parameters of source and driving faces (see Fig. 2). Then, the face motion field is computed via the proposed Dense Correspondence Guided GAN (DCG-GAN) model via estimating the dense spatial correspondence between the proxy of source and re-enacted faces. Afterwards, source images are warped according to the motion field, and results are fed in to a generator network to improve the visual quality. Extensive experiment results demonstrate the effectiveness of the proposed method and its advantages over existing state-of-the-art approaches.

Contributions of this work are summarized as follows,

  • An advanced 3DMM model, i.e., FLAME, is adopted to disentangle the identity feature from pose and expression. We also use it to render textured face proxies to predict the geometry and semantic of re-enacted faces, which facilitate the estimation of face motion.

  • A novel GAN-based framework, named DCG-GAN, is proposed to generate realistic re-enactment results. It first estimates the face motion field by computing the dense correspondence, and improve the visual quality based on the warping result.

  • Extensive experiments are conducted on multiple datasets to demonstrate the advantage of DCG-GAN in solving the one-shot face re-enactment problem, compared to existing state-of-the-art methods.

2 Method

Given a source image IsI_{s} and a driving image IdI_{d}, we aim to generate the re-enacted image IrI_{r} with pose and expression conveyed in IdI_{d} and identity in IsI_{s}. As shown in Fig. 1, our method is generally comprised of three modules. Firstly, a 3D-guided Face Proxy Editor is adopted to decompose geometric semantics of both IsI_{s} and IdI_{d}, and render the proxy of IsI_{s} and IrI_{r} (Section 2.1). Then, the spatial correlation \mathcal{F} between proxies is estimated by a Dense Correspondence Estimator to present the face motion, which is used to approximate IrI_{r} by warping IsI_{s} (Section 2.2). Thereafter, the warped IsI_{s} and other associated spatial representations are fed into the Facial Prior-guided Generator to synthesize IrI_{r} with high visual fidelity (Section 2.3).

Refer to caption
Figure 2: Illustration of the reorganization of 3D coefficients and rendering of textured face proxies. Black dashed lines with arrows denote that the connected networks (i.e., E3DE_{3D} and RR) share the same weight.

2.1 3D-guided Face Proxy Editor

We adopt an advanced 3D parametric model, FLAME (Li et al. 2017), to explicitly decompose face semantics into various disentangled components. Specifically, FLAME depicts the geometry of a portrait image with coefficients of camera α\alpha, pose β\beta, expression θ\theta, shape ϕ\phi, lighting condition λ\lambda and facial texture μ\mu. As shown in Fig. 2, an encoder network, denoted as E3D:ICE_{3D}:I\rightarrow C (C={α,β,θ,ϕ,λ,μ}C=\{\alpha,\beta,\theta,\phi,\lambda,\mu\}), is proposed to regress FLAME coefficients CC from a given face image II. Concretely, coefficients for disentangled semantics of IsI_{s} and IdI_{d} could be computed as Cs=E3D(Is)C_{s}=E_{3D}(I_{s}) and Cd=E3D(Id)C_{d}=E_{3D}(I_{d}), respectively. Thus, the set of coefficients for the re-enacted face IrI_{r} could be obtain by simply mixing the element in both CsC_{s} and CdC_{d}, which could be denoted as Cr={αd,βd,θd,ϕs,λs,μs}C_{r}=\{\alpha_{d},\beta_{d},\theta_{d},\phi_{s},\lambda_{s},\mu_{s}\} (geometry-related coefficients obtained from IdI_{d}, and the rest from IsI_{s}).

Refer to caption
Figure 3: An overview of the proposed model DCG-GAN. The Dense Correspondence Estimation Module computes the correlation matrix based on feature vectors extracted from face proxies. Afterwards, warped spatial representations of the source face are fed into the Facial Prior-guided Generator to synthesize re-enactment results with high quality.

Unlike most existing methods merely using low-dimensional 3DMM coefficients for re-enactment control, we also make use of the generative ability of FLAME to convey more semantic information in the spatial dimension. To this end, given the set of parameters CrC_{r}, a differentiable rendering network (Feng et al. 2021) (denoted as R:CIR:C\rightarrow I) is adopted to generate the corresponding proxy image Pr=R(Cr)P_{r}=R(C_{r}). Notably, PrP_{r} helps mitigate the identity gap as it shares the same identity with Ps=R(Cs)P_{s}=R(C_{s}), i.e. the textured mesh of IsI_{s}, and contains abundant yet compact spatial information (e.g., no background but adequate facial texture) to help estimate the desired face motion field.

2.2 Dense Correspondence Estimator

Intuitively, the mismatch between IsI_{s} and IdI_{d} in both identity and image content (e.g., background and high-frequency texture) makes it difficult to accurately estimate the desired face motion field for animating IsI_{s}. However, according to the discussion in Section 2.1, the advantages of textured face proxies rendered by FLAME (i.e., PsP_{s} and PrP_{r}) help mitigate those gaps and provide an aligned space for motion field estimation. In this work, we model the face motion between IsI_{s} and IdI_{d} with a dense correspondence field \mathcal{F}, where each element fi,jf_{i,j}\in\mathcal{F} denotes the correlation between two image patches pip_{i} and pjp_{j} within PsP_{s} and PrP_{r}, respectively.

Specifically, we use an encoder network EcorrE_{corr} followed by two sequences of ResBlocks, denoted as Eres(s)E_{res}^{(s)} and Eres(r)E_{res}^{(r)}, to extract features from PsP_{s} and PrP_{r}, respectively. Feature maps computed by a pre-trained HR-Net (Sun et al. 2019) (denoted as LsL_{s} and LrL_{r}) are also involved to represent the distribution of facial key-points, which help the network better explore the inner relationship between facial components. So far, both face proxy P{P} and landmark embedding L{L} only describe the content within facial area, and do not contain the features of the rest of images. Therefore, to make the network learn about the spatial layout beyond facial regions, we also feed the parsing map of IsI_{s} (denoted as MsM_{s}) computed by a pre-trained BiSeNet (Yu et al. 2018) to the feature extraction network (see Fig. 3).

Mathematically, the feature vector for the source face image could be written as

fs=Eres(s)(Ecorr(Ps),Ls,Ms)f_{s}=E_{res}^{(s)}(E_{corr}(P_{s}),L_{s},M_{s}) (1)

and similarly, we have

fr=Eres(r)(Ecorr(Pr),Lr)f_{r}=E_{res}^{(r)}(E_{corr}(P_{r}),L_{r}) (2)

describing the feature of PrP_{r}. Therefore, the dense correspondence field \mathcal{F} could be represented by the correlation matrix between fsf_{s} and frf_{r} (Zhang et al. 2020b; Zhou et al. 2021), which could be formulated as

=softmax(f^rf^sT)hw×hw\mathcal{F}=softmax(\hat{f}_{r}\cdot\hat{f}_{s}^{T})\in\mathbb{R}^{hw\times hw} (3)

where f^shw×1\hat{f}_{s}\in\mathbb{R}^{hw\times 1} and f^rhw×1\hat{f}_{r}\in\mathbb{R}^{hw\times 1} denote the flattened and normalized version of fsh×wf_{s}\in\mathbb{R}^{h\times w} and frh×wf_{r}\in\mathbb{R}^{h\times w}, respectively.

Refer to caption
Figure 4: The network structure of AFM. Feature maps of foreground and background are fused with an attention map computed from all input, and the result is considered as the semantic guidance for the subsequent generation process.

Notably, with the column-wise softmax operation in Eq. 3, warping of IsI_{s} could simply be achieved by matrix multiplication, i.e. Iswarp=I~sI_{s}^{warp}=\mathcal{F}\cdot\tilde{I}_{s}, where I~shw×1\tilde{I}_{s}\in\mathbb{R}^{hw\times 1} denotes IsI_{s} after down-sampling and flattening. The advantage is two-fold: 1) it could be easily extended to other spatial representations of IsI_{s}, e.g. textured face proxy PsP_{s} (Pswarp=P~sP_{s}^{warp}=\mathcal{F}\cdot\tilde{P}_{s}) or parsing map MsM_{s} (Mswarp=M~sM_{s}^{warp}=\mathcal{F}\cdot\tilde{M}_{s}), to provide multi-modal prior knowledge for the subsequent generation process (see Section 2.3), and 2) the reverse mapping could also be implemented with a simple matrix multiplication operation, which facilitates the computation of cycle consistency constraints (see Section 2.4).

2.3 Facial Prior-guided Generator

Although IswarpI_{s}^{warp} could theoretically approximate the ideal re-enactment result and serve as a generative prior, it is often blurred due to the loss of high-frequency details in IsI_{s}, and distorted due to the error of estimated warping field \mathcal{F} (see Fig. 3). To solve this problem, we propose a Facial Prior-guided Generator to synthesize the final re-enactment result with high visual quality, which incorporates various warped spatial representations of the source subject (i.e., IswarpI_{s}^{warp}, MswarpM_{s}^{warp}, and PswarpP_{s}^{warp}) as the guiding prior knowledge.

As shown in Fig. 3, the generator takes the down-sampled concatenation of IswarpI_{s}^{warp}, MswarpM_{s}^{warp}, and PswarpP_{s}^{warp} as input, and generates the final output with a sequence of ResBlocks. For ResBlocks on coarse levels, IswarpI_{s}^{warp}, MswarpM_{s}^{warp}, and PswarpP_{s}^{warp} are involved to introduce prior knowledge on the spatial layout of re-enacted faces. Since they contain different semantic information of IrI_{r}, we design an Attention-based Fusion Module (AFM, as shown in Fig. 4) to fuse them into a single-channeled feature map, which is injected into the main body of the generator for semantic guidance.

Concretely, since the information of image background is only embedded in IswarpI_{s}^{warp}, while that of the foreground region (i.e., facial skin area) is contained in both IswarpI_{s}^{warp} and PswarpP_{s}^{warp}, the feature of background (fbgf_{bg}) and foreground (ffgf_{fg}) could be computed by

fbg=conv(STN(Iswarp))f_{bg}=conv(STN(I_{s}^{warp})) (4)
ffg=conv(concat(STN(Iswarp),STN(Pswarp)))f_{fg}=conv(concat(STN(I_{s}^{warp}),STN(P_{s}^{warp}))) (5)

where STNSTN denotes the Spatial Transformer Network (Jaderberg et al. 2015), convconv stands for the convolutional block, and concatconcat refers to the channel-wise concatenation operation. Then, fbgf_{bg} and ffgf_{fg} are merged by an attention map AA computed based on all three input, as shown in Fig. 4. The final output foutf_{out} could be obtained by merging fbgf_{bg} and ffgf_{fg} via

fout=Affg+(1A)fbgf_{out}=A\cdot f_{fg}+(1-A)\cdot f_{bg} (6)

Afterwards, foutf_{out} is integrated into the corresponding Resblock via spatially-adaptive normalization (Park et al. 2019). In contrast, due to the inevitable error of warping field estimation, we do not incorporate IswarpI_{s}^{warp}, MswarpM_{s}^{warp}, and PswarpP_{s}^{warp} into ResBlocks on finer levels, since they may increase the chance of introducing noise on high-frequency textural details.

2.4 Objective Functions

The dense correspondence estimator and facial prior-guided generator are trained simultaneously with the following objective functions.

Losses on Dense Correspondence Estimation

To regulate the estimated dense correspondence field \mathcal{F}, a straightforward pixel-level loss LmwL_{mw} is used to minimize the error of proxy warping, which could be written as

Lmw=PswarpPr1L_{mw}=\|P_{s}^{warp}-P_{r}\|_{1} (7)

Similar to (Zhu et al. 2017a), in order to regulate the mapping learned, a cycle consistency loss LccL_{cc} is also adopted to ensure that the reconstructed source image IsI_{s}^{\prime} is close to IsI_{s}, which could be formulated as

Lcc=IsIs1L_{cc}=\|I_{s}^{\prime}-I_{s}\|_{1} (8)

Is=softmax(T)I~swarpI_{s}^{\prime}=softmax(\mathcal{F}^{T})\cdot\tilde{I}^{warp}_{s} denotes the reverse mapping result of IswarpI^{warp}_{s} obtained via matrix multiplication, which demonstrates the advantage of modeling the face motion with a correlation matrix.

Losses on Semantic Consistency

To fulfill the requirement of face re-enactment, explicit constraints LidL_{id} and LgeoL_{geo} are adopted to control the personal attributes and facial geometry of IrI_{r}, respectively. Mathematically, the identity loss LidL_{id} could be formulated as

Lid=1cos(𝒜(Ir),𝒜(Is))L_{id}=1-cos(\mathcal{A}(I_{r}),\mathcal{A}(I_{s})) (9)

where 𝒜\mathcal{A} is a pre-trained ArcFace (Deng et al. 2019) network and cos(,)cos(\cdot,\cdot) denotes the cosine similarity of two face embeddings. The geometry loss LgeoL_{geo} could be written as

Lgeo=βrβd1+θrθd1L_{geo}=\|\beta_{r}-\beta_{d}\|_{1}+\|\theta_{r}-\theta_{d}\|_{1} (10)

where βr\beta_{r} (βd\beta_{d}) and θr\theta_{r} (θd\theta_{d}) are 3DMM coefficients for pose and expression of IrI_{r} (IdI_{d}) regressed by E3DE_{3D}, respectively.

Unlike existing methods (Ha et al. 2020; Wang, Mallya, and Liu 2021; Yao et al. 2021) merely using paired data Is,Is\langle I_{s},I_{s}^{\ast}\rangle from the same video for image-level supervision, we adopt a hybrid learning scheme where IsI_{s} and IdI_{d} with different identity are also involved during the training process. Therefore, the dense pixel-level loss LpixL_{pix} could be jointly formulated as

Lpix={IrIs2+Mswarp(Is)2ifId=Is0otherwiseL_{pix}=\left\{\begin{array}[]{cl}\|I_{r}^{\ast}-I_{s}^{\ast}\|_{2}+\|M^{warp}_{s}-\mathcal{M}(I_{s}^{\ast})\|_{2}&\mbox{if}\hskip 5.0ptI_{d}=I_{s}^{\ast}\\ 0&\mbox{otherwise}\end{array}\right. (11)

where \mathcal{M} denotes the pre-trained BiSeNet (Yu et al. 2018) for computing the face parsing map. The strategy for selecting IdI_{d} will be detailed in Section 3.1.

Losses on Visual Fidelity

Let us use GG to roughly refer to the entire network of DCG-GAN for face re-enactment, i.e. Ir=G(Is,Id)I_{r}=G(I_{s},I_{d}). To improve the visual quality of generation results, a discriminator network DD is adopted to distinguish synthetic images from realistic ones. DD is designed to be of an ordinary network structure similar to PatchGAN (Isola et al. 2017), and the hinge loss h(t)=min(0,1+t)h(t)=\min(0,-1+t) is used for discriminator regularization (Wang et al. 2018b; Zhang et al. 2020b). Mathematically, the adversarial loss for GG and DD could be written as,

LadvD=𝔼Idpdata[h(D(Id))]𝔼Is,Idpdata[h(D(G(Is,Id)))]\displaystyle\begin{split}L_{adv}^{D}=&-\mathbb{E}_{I_{d}\sim p_{data}}[h(D(I_{d}))]\\ &-\mathbb{E}_{I_{s},I_{d}\sim p_{data}}[h(-D(G(I_{s},I_{d})))]\end{split} (12)
LadvG=\displaystyle L_{adv}^{G}= 𝔼Is,Idpdata[D(G(Is,Id))]\displaystyle-\mathbb{E}_{I_{s},I_{d}\sim p_{data}}[D(G(I_{s},I_{d}))] (13)

where pdatap_{data} refers to the real data distribution.

The Overall Training Objective

In summary, the overall training objective function for GG and DD could be written as

LG=λmwLmw+λccLcc+λidLid+λgeoLgeo+λpixLpix+λadvGLadvG\displaystyle\begin{split}L_{G}=\hskip 3.00003pt&\lambda_{mw}L_{mw}+\lambda_{cc}L_{cc}+\\ &\lambda_{id}L_{id}+\lambda_{geo}L_{geo}+\lambda_{pix}L_{pix}+\\ &\lambda_{adv_{G}}L_{adv}^{G}\end{split} (14)
LD=\displaystyle L_{D}=\hskip 3.00003pt LadvD\displaystyle L_{adv}^{D} (15)

where λmw\lambda_{mw}, λcc\lambda_{cc}, λid\lambda_{id}, λgeo\lambda_{geo}, λpix\lambda_{pix}, and λadv\lambda_{adv} are coefficients balancing the important of different loss terms. LGL_{G} and LDL_{D} are minimized alternatively until convergence.

Refer to caption
Figure 5: Visualization of intermediate warping results. It is clear that the warped representations successfully predict the semantic of re-enacted faces.

3 Experiments

3.1 Configuration Setup

Implementation Details The proposed model is trained by an Adam optimizer with β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, and the learning rate is set to 1e41e-4. We empirically set λgeo\lambda_{geo} to 1.01.0, and roughly adjust other coefficients so that the corresponding loss value is one order of magnitude lower. At each iteration, the network is trained with paired data from the same video by a probability of p[0,1]p\in[0,1], and with unpaired face images by the chance of 1p1-p. To facilitate convergence, pp is set to a high value (e.g., 0.8) in starting epochs, and gradually drops to accustom the network to unpaired input. A NVIDIA A100 GPU is used to run all experiments with the batch size set to 8.

Datasets Experiments are conducted on the following datasets with all images resized to 256×256256\times 256.

  • FF++ (Rossler et al. 2019): FaceForensics++ (FF++) contains 1,000 video sequences of near frontal face without occlusion. We evenly extract 10 frames from each video and collect 10,000 images in total.

  • Celeb-DF (Li et al. 2020): Celeb-DF is a large-scale dataset containing 890 realistic videos collected from YouTube, which is challenging as it covers a wide range of age, gender, and ethnic groups with extreme poses and varying image quality.

  • VoxCeleb1 (Nagrani, Chung, and Zisserman 2017): VoxCeleb1 is another large-scale video dataset of 1.251 celebrities acquired from YouTube. We directly use the cropped images released on the official website 111https://www.robots.ox.ac.uk/\simvgg/data/voxceleb/vox1.html in our experiments.

Similar to (Siarohin et al. 2019), on FF++ and Celeb-DF, we randomly select videos of 80%80\% subjects for training, and use the rest data for testing with no identity overlap with the training split. VoxCeleb1 is used as a driving dataset to test the performance of cross-domain face re-enactment where source images are selected from the testing split of FF++ and Celeb-DF.

Benchmark Methods In order to achieve fair performance comparison, we choose four representative methods with code released as the benchmark, including One-shot (Zhang et al. 2019), Latent-pose (Burkov et al. 2020), Bi-layer (Zakharov et al. 2020), and FOMM (Siarohin et al. 2019). Note that there are also other approaches proposed for one-shot face re-enactment (e.g., (Yao et al. 2021, 2020)). However, no publicly available code has been found for these methods, and it would be unfair if we compare to our re-implementations.

Evaluation Metrics As for quantitative evaluation, three metrics commonly used in previous one-shot face re-enactment studies (Ha et al. 2020; Wang, Mallya, and Liu 2021; Yao et al. 2021; Zhang et al. 2019) are adopted for performance evaluation. Cosine similarity (CosSim) between the feature of IrI_{r} and IsI_{s} are computed to evaluate the performance on identity preservation. Mean Squared Error (MSE) of the pose (PoseMSE) and expression (ExpMSE) embedding between IrI_{r} and IdI_{d} measures the accuracy of re-enactment fulfillment. We use CosFace (Wang et al. 2018a) and Hopenet (Ruiz, Chong, and Rehg 2018) to extract the feature for identity and pose, respectively. To provide a more comprehensive evaluation, both 3DDFA (Zhu et al. 2017b) and FaceWarehouse (Cao et al. 2013) are adopted to compute the embedding vector for expression.

3.2 Qualitative Results

Single-domain face re-enactment. Sample results of one-shot face re-enactment on FF++ and Celeb-DF are shown in Fig. 5. It is clear that the textured proxy of re-enacted faces (i.e., PrP_{r}, in the red box) are able to convey the pose and expression of driving faces, while maintain the identity and lighting condition in source images. The key to this success is that 3DMM provides us with a set of disentangled geometric basis, which serves as a common coordinate system where we could project face images and obtain aligned coefficient representations. As a result, it improves the accuracy of dense correspondence estimation by eliminating the interference of both driving identity and redundant high-frequency details in non-facial area. Warped representations shown in Fig. 5 indicate that the computed correspondence field \mathcal{F} is also effective beyond the facial region, thanks to the prior knowledge (i.e., MsM_{s}, LsL_{s}, and LrL_{r}) introduced when extracting features for PsP_{s} and PrP_{r}.

Refer to caption
Figure 6: Sample results of single-domain face re-enactment on FF++ and Celeb-DF. All images are aligned according to eye positions for ease of comparison.
Refer to caption
Figure 7: Sample results of cross-domain face re-enactment with the name of dataset labeled underneath. All images are aligned according to eye positions for ease of comparison.

According to comparison results shown in Fig. 6, both One-shot and Latent-pose fail to completely disentangle the identity information in driving faces, and thus they interfere with the generation results and cause large face shape distortion. Although FOMM shows better performance in identity preservation, it fails to accurately capture pose changes and subtle movement of facial components (especially for eyes and mouth). Also note that FOMM requires image sequences as input for driving at inference time, while our method could work in cases where only one single driving image is available. Moreover, our method outperforms Bi-layer in synthesizing more realist hair region and background texture preservation.

Cross-domain face re-enactment. We also test the performance of DCG-GAN when driving images are sampled from unseen dataset, which is very common in practical applications. According to results shown in Fig. 7, our method could well-adapt to the discrepancy of data distributions. This is because 3DMM could also help eliminate the interference of domain-specific data bias, besides the identity of driving subjects. Inaccurately manipulated pose and expression could be clearly observed in results obtained by One-shot and FOMM. Although Latent-pose achieves more satisfying performance in re-enactment fulfillment, it still suffers from large identity shift. Similar to results in single-domain face re-enactment, Bi-layer fails to handle the translation of non-facial region.

Table 1: Quantitative results of both single- and cross-domain face re-enactment. Elements in angled brackets denote the name of datasets where source (first) and driving (second) are sampled. ExpMSE1 and ExpMSE2 denote the MSE between expression embeddings compute by 3DFFA and Facewarehouse, respectively. Scores ranked in the first and second places are highlighted in red and blue under each metric (Note: One-shot* and Latent-pose* are not considered due to their poor performance in identity preservation).
CosSim\uparrow PoseMSE\downarrow ExpMSE1\downarrow ExpMSE2\downarrow CosSim\uparrow PoseMSE\downarrow ExpMSE1\downarrow ExpMSE2\downarrow
⟨  FF++, FF++ ⟩ ⟨  Celeb-DF, Celeb-DF ⟩
One-shot* 0.34±0.120.34\pm 0.12 3.63±1.883.63\pm 1.88 0.98±0.410.98\pm 0.41 3.41±0.683.41\pm 0.68 0.29±0.120.29\pm 0.12 5.24±2.885.24\pm 2.88 1.14±0.441.14\pm 0.44 3.41±0.683.41\pm 0.68
Latent-pose* 0.37±0.120.37\pm 0.12 10.25±5.8710.25\pm 5.87 1.33±0.611.33\pm 0.61 3.48±0.643.48\pm 0.64 0.36±0.120.36\pm 0.12 12.13±8.6712.13\pm 8.67 1.33±0.621.33\pm 0.62 3.20±0.603.20\pm 0.60
Bi-layer 0.61±0.130.61\pm 0.13 9.14±3.769.14\pm 3.76 1.26±0.56\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.26}}\pm 0.56 2.93±0.58\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}2.93}}\pm 0.58 0.56±0.150.56\pm 0.15 11.73±5.8311.73\pm 5.83 1.26±0.531.26\pm 0.53 3.03±0.58\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.03}}\pm 0.58
FOMM 0.91±0.08\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.91}}\pm 0.08 11.42±6.5511.42\pm 6.55 1.40±0.661.40\pm 0.66 3.61±0.693.61\pm 0.69 0.83±0.12\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.83}}\pm 0.12 11.89±8.60\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}11.89}}\pm 8.60 1.40±0.601.40\pm 0.60 3.53±0.65\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.53}}\pm 0.65
DCG-GAN-sD 0.68±0.110.68\pm 0.11 8.82±4.94\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}8.82}}\pm 4.94 1.24±0.551.24\pm 0.55 3.48±0.703.48\pm 0.70 0.50±0.140.50\pm 0.14 14.92±5.9214.92\pm 5.92 1.42±0.661.42\pm 0.66 3.61±0.693.61\pm 0.69
DCG-GAN-sG 0.22±0.110.22\pm 0.11 8.86±4.558.86\pm 4.55 1.22±0.52\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.22}}\pm 0.52 3.46±0.673.46\pm 0.67 0.06±0.080.06\pm 0.08 12.49±5.5812.49\pm 5.58 1.49±0.661.49\pm 0.66 3.84±0.663.84\pm 0.66
DCG-GAN-sDG 0.07±0.090.07\pm 0.09 9.88±5.139.88\pm 5.13 1.25±0.581.25\pm 0.58 3.48±0.723.48\pm 0.72 0.06±0.090.06\pm 0.09 15.70±5.9615.70\pm 5.96 1.56±0.601.56\pm 0.60 3.95±0.753.95\pm 0.75
DCG-GAN 0.84±0.12\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.84}}\pm 0.12 8.50±4.52\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}8.50}}\pm 4.52 1.21±0.51\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.21}}\pm 0.51 3.42±0.70\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.42}}\pm 0.70 0.79±0.17\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.79}}\pm 0.17 10.92±5.20\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.92}}\pm 5.20 1.25±0.58\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.25}}\pm 0.58 3.55±0.683.55\pm 0.68
⟨  FF++, Celeb-DF ⟩ ⟨  Celeb-DF, FF++ ⟩
One-shot* 0.31±0.120.31\pm 0.12 4.70±2.764.70\pm 2.76 1.09±0.421.09\pm 0.42 3.48±0.733.48\pm 0.73 0.31±0.130.31\pm 0.13 4.40±2.294.40\pm 2.29 1.10±0.431.10\pm 0.43 3.34±0.663.34\pm 0.66
Latent-pose* 0.37±0.120.37\pm 0.12 12.37±8.6212.37\pm 8.62 1.35±0.661.35\pm 0.66 3.26±0.653.26\pm 0.65 0.36±0.120.36\pm 0.12 11.22±6.4711.22\pm 6.47 1.37±0.651.37\pm 0.65 3.20±0.663.20\pm 0.66
Bi-layer 0.60±0.130.60\pm 0.13 10.89±6.2510.89\pm 6.25 1.40±0.59\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.40}}\pm 0.59 3.46±0.64\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.46}}\pm 0.64 0.59±0.140.59\pm 0.14 11.01±4.35\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}11.01}}\pm 4.35 1.36±0.56\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.36}}\pm 0.56 3.57±0.62\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.57}}\pm 0.62
FOMM 0.84±0.11\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.84}}\pm 0.11 10.30±5.61\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}10.30}}\pm 5.61 1.43±0.661.43\pm 0.66 3.59±0.743.59\pm 0.74 0.90±0.09\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.90}}\pm 0.09 14.66±8.5414.66\pm 8.54 1.50±0.701.50\pm 0.70 3.65±0.713.65\pm 0.71
DCG-GAN 0.77±0.17\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.77}}\pm 0.17 10.01±5.54\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.01}}\pm 5.54 1.36±0.60\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.36}}\pm 0.60 3.43±0.71\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.43}}\pm 0.71 0.82±0.17\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.82}}\pm 0.17 10.75±6.05\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.75}}\pm 6.05 1.31±0.58\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.31}}\pm 0.58 3.54±0.66\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.54}}\pm 0.66
⟨  FF++, VoxCeleb1 ⟩ ⟨  Celeb-DF, VoxCeleb1 ⟩
One-shot* 0.30±0.120.30\pm 0.12 5.00±2.845.00\pm 2.84 1.09±0.441.09\pm 0.44 3.48±0.733.48\pm 0.73 0.28±0.130.28\pm 0.13 5.35±2.975.35\pm 2.97 1.11±0.441.11\pm 0.44 3.45±0.693.45\pm 0.69
Latent-pose* 0.36±0.120.36\pm 0.12 12.14±7.6712.14\pm 7.67 1.36±0.661.36\pm 0.66 3.27±0.683.27\pm 0.68 0.36±0.120.36\pm 0.12 12.98±8.6812.98\pm 8.68 1.36±0.641.36\pm 0.64 3.22±0.703.22\pm 0.70
Bi-layer 0.59±0.130.59\pm 0.13 11.46±6.3211.46\pm 6.32 1.33±0.58\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.33}}\pm 0.58 3.61±0.643.61\pm 0.64 0.58±0.150.58\pm 0.15 11.19±6.0911.19\pm 6.09 1.37±0.56\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.37}}\pm 0.56 3.47±0.67\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.47}}\pm 0.67
FOMM 0.80±0.13\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.80}}\pm 0.13 10.35±5.81\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}10.35}}\pm 5.81 1.43±0.661.43\pm 0.66 3.59±0.74\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.59}}\pm 0.74 0.79±0.14\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.79}}\pm 0.14 10.18±7.16\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}10.18}}\pm 7.16 1.38±0.651.38\pm 0.65 3.50±0.713.50\pm 0.71
DCG-GAN 0.81±0.11\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.81}}\pm 0.11 10.15±5.41\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.15}}\pm 5.41 1.31±0.58\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.31}}\pm 0.58 3.55±0.70\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.55}}\pm 0.70 0.74±0.17\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.74}}\pm 0.17 9.81±5.20\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}9.81}}\pm 5.20 1.31±0.60\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.31}}\pm 0.60 3.42±0.75\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.42}}\pm 0.75

3.3 Quantitative Results

Quantitative results for both single- and cross-domain face re-enactment are reported in Table 1. Since One-shot enforces the shape of re-enacted faces close to that of driving images, it produces lower error in re-enactment fulfillment but completely fails to preserve the identity of source faces (CosSim values lower than 0.35 in all cases). Latent-pose also suffers from poor identity preservation performance, where CosSim values are around 0.36 to 0.37. In contrast, our method achieves much better performance in preserving the source identity, indicating the success of using 3DMM to bridge the identity gap in one-shot face re-enactment. Although CosSim scores obtained by FOMM are slightly higher than ours, it produces larger errors in pose and expression transformation compared to our methods in most cases. Moreover, FOMM requires a sequence of faces with no abrupt change for driving, while our method only needs one driving image to achieve face re-enactment, which has a much broader range of application.

Refer to caption
Figure 8: Sample results obtained by DCG-GAN and its variants on (a) FF++ and (b) Celeb-DF. Red boxes encircle image distortion caused by using the simplified dense correspondence estimation module.

3.4 Ablation Study

In this subsection, we validate the effectiveness of the network design for each component in our method. Specifically, we propose the following variants of DCG-GAN for investigation: 1) In DCG-GAN-sD (simple Dense correspondence estimation), \mathcal{F} is computed merely based on MsM_{s} and MrM_{r} with no LsL_{s}, LrL_{r} or PsP_{s} involved; 2) DCG-GAN-sG (simple Generator) refers to the setup where no AFM network is used to guide the generation process, and IswarpI_{s}^{warp}, MswarpM_{s}^{warp}, PswarpP_{s}^{warp} are fed to the generator only at the bottom layer; and 3) DCG-GAN-sDG combines the configuration in both previous settings for comparison.

As shown in Fig. 8, removing the facial landmark (LsL_{s} and LrL_{r}) and parsing map (PsP_{s}) impairs the accuracy of motion field estimation, especially for hair and mouth regions. The reason for this might be that the model fails to extract sufficient information of both the image layout (contained in PsP_{s}) and shape of facial components (embedded in LsL_{s} and LrL_{r}). Results obtained by DCG-GAN-sG have severe problem in identity preservation, as no detailed textural information is provided for ResBlocks at higher resolutions. DCG-GAN-sDG is very unstable to train and we only present results obtained by the latest checkpoint before collapse, which suffer from severe artifacts. Quantitative results of reported in Table 1 are in line with our previous observations. Our model consistently outperforms its variants under all measurements, indicating the effectiveness of the network design for each module.

4 Conclusion

In this paper, we propose a novel GAN-based model, named DCG-GAN, for one-shot face re-enactment. DCG-GAN adopts a 3DMM model to decompose and reorganize geometric semantics of faces, and generate textured mesh images to help estimate the dense correspondence between source and re-enacted faces. Such dense correspondence field could be used to warp source images for re-enactment results approximation, followed by a generator network to improve the visual quality. Extensive experiments on multiple datasets demonstrate the effectiveness of DCG-GAN in generating lifelike images with accurate re-enactment fulfillment. We also discuss limitations of DCG-GAN and propose possible solutions as future research directions.

References

  • Burkov et al. (2020) Burkov, E.; Pasechnik, I.; Grigorev, A.; and Lempitsky, V. 2020. Neural head reenactment with latent pose descriptors. In IEEE Conference on Computer Vision and Pattern Recognition, 13786–13795.
  • Cao et al. (2013) Cao, C.; Weng, Y.; Zhou, S.; Tong, Y.; and Zhou, K. 2013. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3): 413–425.
  • Deng et al. (2019) Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 4690–4699.
  • Feng et al. (2021) Feng, Y.; Feng, H.; Black, M. J.; and Bolkart, T. 2021. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. ACM Transactions on Graphics, 40(8).
  • Ghosh et al. (2020) Ghosh, P.; Gupta, P. S.; Uziel, R.; Ranjan, A.; Black, M. J.; and Bolkart, T. 2020. GIF: Generative interpretable faces. In International Conference on 3D Vision, 868–878.
  • Ha et al. (2020) Ha, S.; Kersner, M.; Kim, B.; Seo, S.; and Kim, D. 2020. Marionette: Few-shot face reenactment preserving identity of unseen targets. In AAAI Conference on Artificial Intelligence, 10893–10900.
  • Huang, Yang, and Wang (2020) Huang, P.-H.; Yang, F.-E.; and Wang, Y.-C. F. 2020. Learning Identity-Invariant Motion Representations for Cross-ID Face Reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, 7084–7092.
  • Isola et al. (2017) Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, 1125–1134.
  • Jaderberg et al. (2015) Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Conference on Neural Information Processing Systems, 1–9.
  • Kim et al. (2018) Kim, H.; Garrido, P.; Tewari, A.; Xu, W.; Thies, J.; Niessner, M.; Pérez, P.; Richardt, C.; Zollhöfer, M.; and Theobalt, C. 2018. Deep video portraits. ACM Transactions on Graphics, 37(4): 1–14.
  • Li et al. (2017) Li, T.; Bolkart, T.; Black, M. J.; Li, H.; and Romero, J. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, 36(6): 194:1–194:17.
  • Li et al. (2020) Li, Y.; Yang, X.; Sun, P.; Qi, H.; and Lyu, S. 2020. Celeb-df: A large-scale challenging dataset for deepfake forensics. In IEEE Conference on Computer Vision and Pattern Recognition, 3207–3216.
  • Nagrani, Chung, and Zisserman (2017) Nagrani, A.; Chung, J. S.; and Zisserman, A. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612.
  • Nguyen-Phuoc et al. (2019) Nguyen-Phuoc, T.; Li, C.; Theis, L.; Richardt, C.; and Yang, Y.-L. 2019. Hologan: Unsupervised learning of 3d representations from natural images. In International Conference on Computer Vision, 7588–7597.
  • Nirkin, Keller, and Hassner (2019) Nirkin, Y.; Keller, Y.; and Hassner, T. 2019. Fsgan: Subject agnostic face swapping and reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, 7184–7193.
  • Park et al. (2019) Park, T.; Liu, M.-Y.; Wang, T.-C.; and Zhu, J.-Y. 2019. Semantic Image Synthesis with Spatially-Adaptive Normalization. In IEEE Conference on Computer Vision and Pattern Recognition, 2337–2346.
  • Pumarola et al. (2018) Pumarola, A.; Agudo, A.; Martinez, A. M.; Sanfeliu, A.; and Moreno-Noguer, F. 2018. Ganimation: Anatomically-aware facial animation from a single image. In European Conference on Computer Vision, 818–833.
  • Rossler et al. (2019) Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; and Nießner, M. 2019. Faceforensics++: Learning to detect manipulated facial images. In International Conference on Computer Vision, 1–11.
  • Ruiz, Chong, and Rehg (2018) Ruiz, N.; Chong, E.; and Rehg, J. M. 2018. Fine-grained head pose estimation without keypoints. In IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2074–2083.
  • Siarohin et al. (2019) Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; and Sebe, N. 2019. First order motion model for image animation. In Conference on Neural Information Processing Systems, volume 32, 7137–7147.
  • Song et al. (2018) Song, L.; Lu, Z.; He, R.; Sun, Z.; and Tan, T. 2018. Geometry guided adversarial facial expression synthesis. In ACM International Conference on Multimedia, 627–635.
  • Sun et al. (2019) Sun, K.; Xiao, B.; Liu, D.; and Wang, J. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Tewari et al. (2020) Tewari, A.; Elgharib, M.; Bharaj, G.; Bernard, F.; Seidel, H.-P.; Pérez, P.; Zollhofer, M.; and Theobalt, C. 2020. Stylerig: Rigging stylegan for 3d control over portrait images. In IEEE Conference on Computer Vision and Pattern Recognition, 6142–6151.
  • Thies et al. (2016) Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; and Nießner, M. 2016. Face2face: Real-time Face Capture and Reenactment of RGB Videos. In IEEE Conference on Computer Vision and Pattern Recognition, 2387–2395.
  • Tripathy, Kannala, and Rahtu (2020) Tripathy, S.; Kannala, J.; and Rahtu, E. 2020. Icface: Interpretable and controllable face reenactment using gans. In IEEE Winter Conference on Applications of Computer Vision, 3385–3394.
  • Tripathy, Kannala, and Rahtu (2021) Tripathy, S.; Kannala, J.; and Rahtu, E. 2021. FACEGAN: Facial Attribute Controllable rEenactment GAN. In IEEE Winter Conference on Applications of Computer Vision, 1329–1338.
  • Tripathy, Kannala, and Rahtu (2022) Tripathy, S.; Kannala, J.; and Rahtu, E. 2022. Single Source One Shot Reenactment Using Weighted Motion From Paired Feature Points. In IEEE Winter Conference on Applications of Computer Vision, 2715–2724.
  • Wang et al. (2018a) Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; and Liu, W. 2018a. Cosface: Large margin cosine loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 5265–5274.
  • Wang et al. (2018b) Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; and Catanzaro, B. 2018b. High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE Conference on Computer Vision and Pattern Recognition, 8798–8807.
  • Wang, Mallya, and Liu (2021) Wang, T.-C.; Mallya, A.; and Liu, M.-Y. 2021. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE Conference on Computer Vision and Pattern Recognition, 10039–10049.
  • Wu et al. (2018) Wu, W.; Zhang, Y.; Li, C.; Qian, C.; and Loy, C. C. 2018. Reenactgan: Learning to reenact faces via boundary transfer. In European Conference on Computer Vision, 603–619.
  • Yao et al. (2021) Yao, G.; Yuan, Y.; Shao, T.; Li, S.; Liu, S.; Liu, Y.; Wang, M.; and Zhou, K. 2021. One-shot Face Reenactment Using Appearance Adaptive Normalization. In AAAI Conference on Artificial Intelligence, 3172–3180.
  • Yao et al. (2020) Yao, G.; Yuan, Y.; Shao, T.; and Zhou, K. 2020. Mesh guided one-shot face reenactment using graph convolutional networks. In ACM International Conference on Multimedia, 1773–1781.
  • Yin et al. (2017) Yin, X.; Yu, X.; Sohn, K.; Liu, X.; and Chandraker, M. 2017. Towards large-pose face frontalization in the wild. In International Conference on Computer Vision, 3990–3999.
  • Yu et al. (2018) Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; and Sang, N. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, 325–341.
  • Zakharov et al. (2020) Zakharov, E.; Ivakhnenko, A.; Shysheya, A.; and Lempitsky, V. 2020. Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, 524–540.
  • Zhang et al. (2020a) Zhang, J.; Zeng, X.; Wang, M.; Pan, Y.; Liu, L.; Liu, Y.; Ding, Y.; and Fan, C. 2020a. Freenet: Multi-identity face reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, 5326–5335.
  • Zhang et al. (2020b) Zhang, P.; Zhang, B.; Chen, D.; Yuan, L.; and Wen, F. 2020b. Cross-domain Correspondence Learning for Exemplar-based Image Translation. In IEEE Conference on Computer Vision and Pattern Recognition, 5143–5153.
  • Zhang et al. (2019) Zhang, Y.; Zhang, S.; He, Y.; Li, C.; Loy, C. C.; and Liu, Z. 2019. One-shot Face Reenactment. In British Machine Vision Conference.
  • Zhou et al. (2021) Zhou, X.; Zhang, B.; Zhang, T.; Zhang, P.; Bao, J.; Chen, D.; Zhang, Z.; and Wen, F. 2021. CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation. In IEEE Conference on Computer Vision and Pattern Recognition, 11465–11475.
  • Zhu et al. (2017a) Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017a. Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision, 2223–2232.
  • Zhu et al. (2017b) Zhu, X.; Liu, X.; Lei, Z.; and Li, S. Z. 2017b. Face alignment in full pose range: A 3d total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1): 78–92.