Semantic-aware One-shot Face Re-enactment with Dense Correspondence Estimation

Yunfan Liu¹², Qi Li², Zhenan Sun¹², Tieniu Tan¹²

Abstract

One-shot face re-enactment is a challenging task due to the identity mismatch between source and driving faces. Specifically, the suboptimally disentangled identity information of driving subjects would inevitably interfere with the re-enactment results and lead to face shape distortion. To solve this problem, this paper proposes to use 3D Morphable Model (3DMM) for explicit facial semantic decomposition and identity disentanglement. Instead of using 3D coefficients alone for re-enactment control, we take the advantage of the generative ability of 3DMM to render textured face proxies. These proxies contain abundant yet compact geometric and semantic information of human faces, which enable us to compute the face motion field between source and driving images by estimating the dense correspondence. In this way, we could approximate re-enactment results by warping source images according to the motion field, and a Generative Adversarial Network (GAN) is adopted to further improve the visual quality of warping results. Extensive experiments on various datasets demonstrate the advantages of the proposed method over existing start-of-the-art benchmarks in both identity preservation and re-enactment fulfillment.

1 Introduction

Face re-enactment aims at animating the source face with pose and expression conveyed by driving images, and meanwhile keeping its identity information unchanged. Due to its wide range of practical applications, face re-enactment has attracted remarkable research attention in the last few years. With the rapid development of Generative Adversarial Networks (GANs), numerous studies have been conducted to solve the face re-enactment problem with GAN-based models (Kim et al. 2018; Thies et al. 2016; Wu et al. 2018). Although realistic results could be obtained, these methods are either designed only for a specific subject or require a large amount of images of the source identity, which heavily limits their application in real-world scenarios.

Refer to caption — Figure 1: Illustration of (a) sample face re-enactment results with ‘identity leak’. Note the change of identity in results obtain from One-shot (Zhang et al. 2019) and Latent-pose (Burkov et al. 2020), and (b) visualization of the face proxy of source and re-enacted faces.

In this work, we focus on solving a more generalized problem, i.e., one-shot face re-enactment, where only one single image of the source subject is available at test time. The key for solving this problem is to fully disentangle geometric attributes (i.e., pose and expression) in driving images from personal characteristics, otherwise it will interfere with the re-enactment process and cause shape distortion in generation results (known as the ‘identity leak’ problem (Ha et al. 2020), see Fig. 1(a)).

To this end, various representations for facial geometry are proposed to eliminate the influence of driving identity. Facial Action Units (FAUs), a coding system for classifying facial muscle movement, are used in some previous studies (Pumarola et al. 2018; Tripathy, Kannala, and Rahtu 2020, 2021) to describe face motion. However, the flexibility of FAUs is quite limited as they could not describe complex facial expression or head pose. Facial landmarks are also widely used in existing methods (Ha et al. 2020; Huang, Yang, and Wang 2020; Nirkin, Keller, and Hassner 2019; Song et al. 2018; Wu et al. 2018; Zhang et al. 2020a, 2019; Siarohin et al. 2019; Tripathy, Kannala, and Rahtu 2022) to depict the contour and layout information of facial components with facial texture excluded. Compared to FAUs, facial landmarks could provide a more comprehensive description of the spatial structure of human faces. Unfortunately, they still contain personal attributes of driving subjects in terms of face shape, which may impair the performance of source identity preservation.

In this paper, we propose to use an advanced 3D Morphable Model (3DMM), i.e., FLAME (Li et al. 2017), for facial geometry decomposition. Unlike most existing methods which only use low-dimensional 3DMM coefficients (Yin et al. 2017; Nguyen-Phuoc et al. 2019; Tewari et al. 2020; Ghosh et al. 2020) or plain face meshes (Yao et al. 2020) for face re-enactment control, we also make use of its generative ability to render textured face proxies (see Fig. 1 (b)). These representations not only contain richer spatial information of skin area and facial components, but also are less subject to the influence of high-frequency image details or background content when estimating the face motion, compared to original face images.

This inspires us to leverage the advantage of textured face proxies for face motion field estimation, which could further facilitate the prediction of re-enacted faces. Concretely, we first generate the proxy of re-enacted face by the FLAME model based on the mixture of 3DMM parameters of source and driving faces (see Fig. 2). Then, the face motion field is computed via the proposed Dense Correspondence Guided GAN (DCG-GAN) model via estimating the dense spatial correspondence between the proxy of source and re-enacted faces. Afterwards, source images are warped according to the motion field, and results are fed in to a generator network to improve the visual quality. Extensive experiment results demonstrate the effectiveness of the proposed method and its advantages over existing state-of-the-art approaches.

Contributions of this work are summarized as follows,

•

An advanced 3DMM model, i.e., FLAME, is adopted to disentangle the identity feature from pose and expression. We also use it to render textured face proxies to predict the geometry and semantic of re-enacted faces, which facilitate the estimation of face motion.
•

A novel GAN-based framework, named DCG-GAN, is proposed to generate realistic re-enactment results. It first estimates the face motion field by computing the dense correspondence, and improve the visual quality based on the warping result.
•

Extensive experiments are conducted on multiple datasets to demonstrate the advantage of DCG-GAN in solving the one-shot face re-enactment problem, compared to existing state-of-the-art methods.

2 Method

Given a source image $I_{s}$ and a driving image $I_{d}$ , we aim to generate the re-enacted image $I_{r}$ with pose and expression conveyed in $I_{d}$ and identity in $I_{s}$ . As shown in Fig. 1, our method is generally comprised of three modules. Firstly, a 3D-guided Face Proxy Editor is adopted to decompose geometric semantics of both $I_{s}$ and $I_{d}$ , and render the proxy of $I_{s}$ and $I_{r}$ (Section 2.1). Then, the spatial correlation $\mathcal{F}$ between proxies is estimated by a Dense Correspondence Estimator to present the face motion, which is used to approximate $I_{r}$ by warping $I_{s}$ (Section 2.2). Thereafter, the warped $I_{s}$ and other associated spatial representations are fed into the Facial Prior-guided Generator to synthesize $I_{r}$ with high visual fidelity (Section 2.3).

2.1 3D-guided Face Proxy Editor

We adopt an advanced 3D parametric model, FLAME (Li et al. 2017), to explicitly decompose face semantics into various disentangled components. Specifically, FLAME depicts the geometry of a portrait image with coefficients of camera $\alpha$ , pose $\beta$ , expression $\theta$ , shape $\phi$ , lighting condition $\lambda$ and facial texture $\mu$ . As shown in Fig. 2, an encoder network, denoted as $E_{3D}:I\rightarrow C$ ( $C=\{\alpha,\beta,\theta,\phi,\lambda,\mu\}$ ), is proposed to regress FLAME coefficients $C$ from a given face image $I$ . Concretely, coefficients for disentangled semantics of $I_{s}$ and $I_{d}$ could be computed as $C_{s}=E_{3D}(I_{s})$ and $C_{d}=E_{3D}(I_{d})$ , respectively. Thus, the set of coefficients for the re-enacted face $I_{r}$ could be obtain by simply mixing the element in both $C_{s}$ and $C_{d}$ , which could be denoted as $C_{r}=\{\alpha_{d},\beta_{d},\theta_{d},\phi_{s},\lambda_{s},\mu_{s}\}$ (geometry-related coefficients obtained from $I_{d}$ , and the rest from $I_{s}$ ).

Unlike most existing methods merely using low-dimensional 3DMM coefficients for re-enactment control, we also make use of the generative ability of FLAME to convey more semantic information in the spatial dimension. To this end, given the set of parameters $C_{r}$ , a differentiable rendering network (Feng et al. 2021) (denoted as $R:C\rightarrow I$ ) is adopted to generate the corresponding proxy image $P_{r}=R(C_{r})$ . Notably, $P_{r}$ helps mitigate the identity gap as it shares the same identity with $P_{s}=R(C_{s})$ , i.e. the textured mesh of $I_{s}$ , and contains abundant yet compact spatial information (e.g., no background but adequate facial texture) to help estimate the desired face motion field.

2.2 Dense Correspondence Estimator

Intuitively, the mismatch between $I_{s}$ and $I_{d}$ in both identity and image content (e.g., background and high-frequency texture) makes it difficult to accurately estimate the desired face motion field for animating $I_{s}$ . However, according to the discussion in Section 2.1, the advantages of textured face proxies rendered by FLAME (i.e., $P_{s}$ and $P_{r}$ ) help mitigate those gaps and provide an aligned space for motion field estimation. In this work, we model the face motion between $I_{s}$ and $I_{d}$ with a dense correspondence field $\mathcal{F}$ , where each element $f_{i,j}\in\mathcal{F}$ denotes the correlation between two image patches $p_{i}$ and $p_{j}$ within $P_{s}$ and $P_{r}$ , respectively.

Specifically, we use an encoder network $E_{corr}$ followed by two sequences of ResBlocks, denoted as $E_{res}^{(s)}$ and $E_{res}^{(r)}$ , to extract features from $P_{s}$ and $P_{r}$ , respectively. Feature maps computed by a pre-trained HR-Net (Sun et al. 2019) (denoted as $L_{s}$ and $L_{r}$ ) are also involved to represent the distribution of facial key-points, which help the network better explore the inner relationship between facial components. So far, both face proxy ${P}$ and landmark embedding ${L}$ only describe the content within facial area, and do not contain the features of the rest of images. Therefore, to make the network learn about the spatial layout beyond facial regions, we also feed the parsing map of $I_{s}$ (denoted as $M_{s}$ ) computed by a pre-trained BiSeNet (Yu et al. 2018) to the feature extraction network (see Fig. 3).

Mathematically, the feature vector for the source face image could be written as

f_{s}=E_{res}^{(s)}(E_{corr}(P_{s}),L_{s},M_{s})

(1)

and similarly, we have

f_{r}=E_{res}^{(r)}(E_{corr}(P_{r}),L_{r})

(2)

describing the feature of $P_{r}$ . Therefore, the dense correspondence field $\mathcal{F}$ could be represented by the correlation matrix between $f_{s}$ and $f_{r}$ (Zhang et al. 2020b; Zhou et al. 2021), which could be formulated as

\mathcal{F}=softmax(\hat{f}_{r}\cdot\hat{f}_{s}^{T})\in\mathbb{R}^{hw\times hw}

(3)

where $\hat{f}_{s}\in\mathbb{R}^{hw\times 1}$ and $\hat{f}_{r}\in\mathbb{R}^{hw\times 1}$ denote the flattened and normalized version of $f_{s}\in\mathbb{R}^{h\times w}$ and $f_{r}\in\mathbb{R}^{h\times w}$ , respectively.

Notably, with the column-wise softmax operation in Eq. 3, warping of $I_{s}$ could simply be achieved by matrix multiplication, i.e. $I_{s}^{warp}=\mathcal{F}\cdot\tilde{I}_{s}$ , where $\tilde{I}_{s}\in\mathbb{R}^{hw\times 1}$ denotes $I_{s}$ after down-sampling and flattening. The advantage is two-fold: 1) it could be easily extended to other spatial representations of $I_{s}$ , e.g. textured face proxy $P_{s}$ ( $P_{s}^{warp}=\mathcal{F}\cdot\tilde{P}_{s}$ ) or parsing map $M_{s}$ ( $M_{s}^{warp}=\mathcal{F}\cdot\tilde{M}_{s}$ ), to provide multi-modal prior knowledge for the subsequent generation process (see Section 2.3), and 2) the reverse mapping could also be implemented with a simple matrix multiplication operation, which facilitates the computation of cycle consistency constraints (see Section 2.4).

2.3 Facial Prior-guided Generator

Although $I_{s}^{warp}$ could theoretically approximate the ideal re-enactment result and serve as a generative prior, it is often blurred due to the loss of high-frequency details in $I_{s}$ , and distorted due to the error of estimated warping field $\mathcal{F}$ (see Fig. 3). To solve this problem, we propose a Facial Prior-guided Generator to synthesize the final re-enactment result with high visual quality, which incorporates various warped spatial representations of the source subject (i.e., $I_{s}^{warp}$ , $M_{s}^{warp}$ , and $P_{s}^{warp}$ ) as the guiding prior knowledge.

As shown in Fig. 3, the generator takes the down-sampled concatenation of $I_{s}^{warp}$ , $M_{s}^{warp}$ , and $P_{s}^{warp}$ as input, and generates the final output with a sequence of ResBlocks. For ResBlocks on coarse levels, $I_{s}^{warp}$ , $M_{s}^{warp}$ , and $P_{s}^{warp}$ are involved to introduce prior knowledge on the spatial layout of re-enacted faces. Since they contain different semantic information of $I_{r}$ , we design an Attention-based Fusion Module (AFM, as shown in Fig. 4) to fuse them into a single-channeled feature map, which is injected into the main body of the generator for semantic guidance.

Concretely, since the information of image background is only embedded in $I_{s}^{warp}$ , while that of the foreground region (i.e., facial skin area) is contained in both $I_{s}^{warp}$ and $P_{s}^{warp}$ , the feature of background ( $f_{bg}$ ) and foreground ( $f_{fg}$ ) could be computed by

f_{bg}=conv(STN(I_{s}^{warp}))

(4)

f_{fg}=conv(concat(STN(I_{s}^{warp}),STN(P_{s}^{warp})))

(5)

where $STN$ denotes the Spatial Transformer Network (Jaderberg et al. 2015), $conv$ stands for the convolutional block, and $concat$ refers to the channel-wise concatenation operation. Then, $f_{bg}$ and $f_{fg}$ are merged by an attention map $A$ computed based on all three input, as shown in Fig. 4. The final output $f_{out}$ could be obtained by merging $f_{bg}$ and $f_{fg}$ via

f_{out}=A\cdot f_{fg}+(1-A)\cdot f_{bg}

(6)

Afterwards, $f_{out}$ is integrated into the corresponding Resblock via spatially-adaptive normalization (Park et al. 2019). In contrast, due to the inevitable error of warping field estimation, we do not incorporate $I_{s}^{warp}$ , $M_{s}^{warp}$ , and $P_{s}^{warp}$ into ResBlocks on finer levels, since they may increase the chance of introducing noise on high-frequency textural details.

2.4 Objective Functions

The dense correspondence estimator and facial prior-guided generator are trained simultaneously with the following objective functions.

Losses on Dense Correspondence Estimation

To regulate the estimated dense correspondence field $\mathcal{F}$ , a straightforward pixel-level loss $L_{mw}$ is used to minimize the error of proxy warping, which could be written as

L_{mw}=\|P_{s}^{warp}-P_{r}\|_{1}

(7)

Similar to (Zhu et al. 2017a), in order to regulate the mapping learned, a cycle consistency loss $L_{cc}$ is also adopted to ensure that the reconstructed source image $I_{s}^{\prime}$ is close to $I_{s}$ , which could be formulated as

L_{cc}=\|I_{s}^{\prime}-I_{s}\|_{1}

(8)

$I_{s}^{\prime}=softmax(\mathcal{F}^{T})\cdot\tilde{I}^{warp}_{s}$ denotes the reverse mapping result of $I^{warp}_{s}$ obtained via matrix multiplication, which demonstrates the advantage of modeling the face motion with a correlation matrix.

Losses on Semantic Consistency

To fulfill the requirement of face re-enactment, explicit constraints $L_{id}$ and $L_{geo}$ are adopted to control the personal attributes and facial geometry of $I_{r}$ , respectively. Mathematically, the identity loss $L_{id}$ could be formulated as

L_{id}=1-cos(\mathcal{A}(I_{r}),\mathcal{A}(I_{s}))

(9)

where $\mathcal{A}$ is a pre-trained ArcFace (Deng et al. 2019) network and $cos(\cdot,\cdot)$ denotes the cosine similarity of two face embeddings. The geometry loss $L_{geo}$ could be written as

L_{geo}=\|\beta_{r}-\beta_{d}\|_{1}+\|\theta_{r}-\theta_{d}\|_{1}

(10)

where $\beta_{r}$ ( $\beta_{d}$ ) and $\theta_{r}$ ( $\theta_{d}$ ) are 3DMM coefficients for pose and expression of $I_{r}$ ( $I_{d}$ ) regressed by $E_{3D}$ , respectively.

Unlike existing methods (Ha et al. 2020; Wang, Mallya, and Liu 2021; Yao et al. 2021) merely using paired data $\langle I_{s},I_{s}^{\ast}\rangle$ from the same video for image-level supervision, we adopt a hybrid learning scheme where $I_{s}$ and $I_{d}$ with different identity are also involved during the training process. Therefore, the dense pixel-level loss $L_{pix}$ could be jointly formulated as

L_{pix}=\left\{\begin{array}[]{cl}\|I_{r}^{\ast}-I_{s}^{\ast}\|_{2}+\|M^{warp}_{s}-\mathcal{M}(I_{s}^{\ast})\|_{2}&\mbox{if}\hskip 5.0ptI_{d}=I_{s}^{\ast}\\ 0&\mbox{otherwise}\end{array}\right.

(11)

where $\mathcal{M}$ denotes the pre-trained BiSeNet (Yu et al. 2018) for computing the face parsing map. The strategy for selecting $I_{d}$ will be detailed in Section 3.1.

Losses on Visual Fidelity

Let us use $G$ to roughly refer to the entire network of DCG-GAN for face re-enactment, i.e. $I_{r}=G(I_{s},I_{d})$ . To improve the visual quality of generation results, a discriminator network $D$ is adopted to distinguish synthetic images from realistic ones. $D$ is designed to be of an ordinary network structure similar to PatchGAN (Isola et al. 2017), and the hinge loss $h(t)=\min(0,-1+t)$ is used for discriminator regularization (Wang et al. 2018b; Zhang et al. 2020b). Mathematically, the adversarial loss for $G$ and $D$ could be written as,

	$\displaystyle\begin{split}L_{adv}^{D}=&-\mathbb{E}_{I_{d}\sim p_{data}}[h(D(I_{d}))]\\ &-\mathbb{E}_{I_{s},I_{d}\sim p_{data}}[h(-D(G(I_{s},I_{d})))]\end{split}$			(12)
	$\displaystyle L_{adv}^{G}=$	$\displaystyle-\mathbb{E}_{I_{s},I_{d}\sim p_{data}}[D(G(I_{s},I_{d}))]$		(13)

where $p_{data}$ refers to the real data distribution.

The Overall Training Objective

In summary, the overall training objective function for $G$ and $D$ could be written as

	$\displaystyle\begin{split}L_{G}=\hskip 3.00003pt&\lambda_{mw}L_{mw}+\lambda_{cc}L_{cc}+\\ &\lambda_{id}L_{id}+\lambda_{geo}L_{geo}+\lambda_{pix}L_{pix}+\\ &\lambda_{adv_{G}}L_{adv}^{G}\end{split}$			(14)
	$\displaystyle L_{D}=\hskip 3.00003pt$	$\displaystyle L_{adv}^{D}$		(15)

where $\lambda_{mw}$ , $\lambda_{cc}$ , $\lambda_{id}$ , $\lambda_{geo}$ , $\lambda_{pix}$ , and $\lambda_{adv}$ are coefficients balancing the important of different loss terms. $L_{G}$ and $L_{D}$ are minimized alternatively until convergence.

3 Experiments

3.1 Configuration Setup

Implementation Details The proposed model is trained by an Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , and the learning rate is set to $1e-4$ . We empirically set $\lambda_{geo}$ to $1.0$ , and roughly adjust other coefficients so that the corresponding loss value is one order of magnitude lower. At each iteration, the network is trained with paired data from the same video by a probability of $p\in[0,1]$ , and with unpaired face images by the chance of $1-p$ . To facilitate convergence, $p$ is set to a high value (e.g., 0.8) in starting epochs, and gradually drops to accustom the network to unpaired input. A NVIDIA A100 GPU is used to run all experiments with the batch size set to 8.

Datasets Experiments are conducted on the following datasets with all images resized to $256\times 256$ .

•

FF++ (Rossler et al. 2019): FaceForensics++ (FF++) contains 1,000 video sequences of near frontal face without occlusion. We evenly extract 10 frames from each video and collect 10,000 images in total.
•

Celeb-DF (Li et al. 2020): Celeb-DF is a large-scale dataset containing 890 realistic videos collected from YouTube, which is challenging as it covers a wide range of age, gender, and ethnic groups with extreme poses and varying image quality.
•

VoxCeleb1 (Nagrani, Chung, and Zisserman 2017): VoxCeleb1 is another large-scale video dataset of 1.251 celebrities acquired from YouTube. We directly use the cropped images released on the official website ¹¹1https://www.robots.ox.ac.uk/ $\sim$ vgg/data/voxceleb/vox1.html in our experiments.

Similar to (Siarohin et al. 2019), on FF++ and Celeb-DF, we randomly select videos of $80\%$ subjects for training, and use the rest data for testing with no identity overlap with the training split. VoxCeleb1 is used as a driving dataset to test the performance of cross-domain face re-enactment where source images are selected from the testing split of FF++ and Celeb-DF.

Benchmark Methods In order to achieve fair performance comparison, we choose four representative methods with code released as the benchmark, including One-shot (Zhang et al. 2019), Latent-pose (Burkov et al. 2020), Bi-layer (Zakharov et al. 2020), and FOMM (Siarohin et al. 2019). Note that there are also other approaches proposed for one-shot face re-enactment (e.g., (Yao et al. 2021, 2020)). However, no publicly available code has been found for these methods, and it would be unfair if we compare to our re-implementations.

Evaluation Metrics As for quantitative evaluation, three metrics commonly used in previous one-shot face re-enactment studies (Ha et al. 2020; Wang, Mallya, and Liu 2021; Yao et al. 2021; Zhang et al. 2019) are adopted for performance evaluation. Cosine similarity (CosSim) between the feature of $I_{r}$ and $I_{s}$ are computed to evaluate the performance on identity preservation. Mean Squared Error (MSE) of the pose (PoseMSE) and expression (ExpMSE) embedding between $I_{r}$ and $I_{d}$ measures the accuracy of re-enactment fulfillment. We use CosFace (Wang et al. 2018a) and Hopenet (Ruiz, Chong, and Rehg 2018) to extract the feature for identity and pose, respectively. To provide a more comprehensive evaluation, both 3DDFA (Zhu et al. 2017b) and FaceWarehouse (Cao et al. 2013) are adopted to compute the embedding vector for expression.

3.2 Qualitative Results

Single-domain face re-enactment. Sample results of one-shot face re-enactment on FF++ and Celeb-DF are shown in Fig. 5. It is clear that the textured proxy of re-enacted faces (i.e., $P_{r}$ , in the red box) are able to convey the pose and expression of driving faces, while maintain the identity and lighting condition in source images. The key to this success is that 3DMM provides us with a set of disentangled geometric basis, which serves as a common coordinate system where we could project face images and obtain aligned coefficient representations. As a result, it improves the accuracy of dense correspondence estimation by eliminating the interference of both driving identity and redundant high-frequency details in non-facial area. Warped representations shown in Fig. 5 indicate that the computed correspondence field $\mathcal{F}$ is also effective beyond the facial region, thanks to the prior knowledge (i.e., $M_{s}$ , $L_{s}$ , and $L_{r}$ ) introduced when extracting features for $P_{s}$ and $P_{r}$ .

According to comparison results shown in Fig. 6, both One-shot and Latent-pose fail to completely disentangle the identity information in driving faces, and thus they interfere with the generation results and cause large face shape distortion. Although FOMM shows better performance in identity preservation, it fails to accurately capture pose changes and subtle movement of facial components (especially for eyes and mouth). Also note that FOMM requires image sequences as input for driving at inference time, while our method could work in cases where only one single driving image is available. Moreover, our method outperforms Bi-layer in synthesizing more realist hair region and background texture preservation.

Cross-domain face re-enactment. We also test the performance of DCG-GAN when driving images are sampled from unseen dataset, which is very common in practical applications. According to results shown in Fig. 7, our method could well-adapt to the discrepancy of data distributions. This is because 3DMM could also help eliminate the interference of domain-specific data bias, besides the identity of driving subjects. Inaccurately manipulated pose and expression could be clearly observed in results obtained by One-shot and FOMM. Although Latent-pose achieves more satisfying performance in re-enactment fulfillment, it still suffers from large identity shift. Similar to results in single-domain face re-enactment, Bi-layer fails to handle the translation of non-facial region.

Table 1: Quantitative results of both single- and cross-domain face re-enactment. Elements in angled brackets denote the name of datasets where source (first) and driving (second) are sampled. ExpMSE₁ and ExpMSE₂ denote the MSE between expression embeddings compute by 3DFFA and Facewarehouse, respectively. Scores ranked in the first and second places are highlighted in red and blue under each metric (Note: One-shot^* and Latent-pose^* are not considered due to their poor performance in identity preservation).

	CosSim $\uparrow$	PoseMSE $\downarrow$	ExpMSE₁ $\downarrow$	ExpMSE₂ $\downarrow$	CosSim $\uparrow$	PoseMSE $\downarrow$	ExpMSE₁ $\downarrow$	ExpMSE₂ $\downarrow$
	⟨ FF++, FF++ ⟩				⟨ Celeb-DF, Celeb-DF ⟩
One-shot^*	$0.34\pm 0.12$	$3.63\pm 1.88$	$0.98\pm 0.41$	$3.41\pm 0.68$	$0.29\pm 0.12$	$5.24\pm 2.88$	$1.14\pm 0.44$	$3.41\pm 0.68$
Latent-pose^*	$0.37\pm 0.12$	$10.25\pm 5.87$	$1.33\pm 0.61$	$3.48\pm 0.64$	$0.36\pm 0.12$	$12.13\pm 8.67$	$1.33\pm 0.62$	$3.20\pm 0.60$
Bi-layer	$0.61\pm 0.13$	$9.14\pm 3.76$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.26}}\pm 0.56$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}2.93}}\pm 0.58$	$0.56\pm 0.15$	$11.73\pm 5.83$	$1.26\pm 0.53$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.03}}\pm 0.58$
FOMM	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.91}}\pm 0.08$	$11.42\pm 6.55$	$1.40\pm 0.66$	$3.61\pm 0.69$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.83}}\pm 0.12$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}11.89}}\pm 8.60$	$1.40\pm 0.60$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.53}}\pm 0.65$
DCG-GAN-sD	$0.68\pm 0.11$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}8.82}}\pm 4.94$	$1.24\pm 0.55$	$3.48\pm 0.70$	$0.50\pm 0.14$	$14.92\pm 5.92$	$1.42\pm 0.66$	$3.61\pm 0.69$
DCG-GAN-sG	$0.22\pm 0.11$	$8.86\pm 4.55$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.22}}\pm 0.52$	$3.46\pm 0.67$	$0.06\pm 0.08$	$12.49\pm 5.58$	$1.49\pm 0.66$	$3.84\pm 0.66$
DCG-GAN-sDG	$0.07\pm 0.09$	$9.88\pm 5.13$	$1.25\pm 0.58$	$3.48\pm 0.72$	$0.06\pm 0.09$	$15.70\pm 5.96$	$1.56\pm 0.60$	$3.95\pm 0.75$
DCG-GAN	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.84}}\pm 0.12$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}8.50}}\pm 4.52$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.21}}\pm 0.51$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.42}}\pm 0.70$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.79}}\pm 0.17$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.92}}\pm 5.20$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.25}}\pm 0.58$	$3.55\pm 0.68$
	⟨ FF++, Celeb-DF ⟩				⟨ Celeb-DF, FF++ ⟩
One-shot^*	$0.31\pm 0.12$	$4.70\pm 2.76$	$1.09\pm 0.42$	$3.48\pm 0.73$	$0.31\pm 0.13$	$4.40\pm 2.29$	$1.10\pm 0.43$	$3.34\pm 0.66$
Latent-pose^*	$0.37\pm 0.12$	$12.37\pm 8.62$	$1.35\pm 0.66$	$3.26\pm 0.65$	$0.36\pm 0.12$	$11.22\pm 6.47$	$1.37\pm 0.65$	$3.20\pm 0.66$
Bi-layer	$0.60\pm 0.13$	$10.89\pm 6.25$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.40}}\pm 0.59$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.46}}\pm 0.64$	$0.59\pm 0.14$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}11.01}}\pm 4.35$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.36}}\pm 0.56$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.57}}\pm 0.62$
FOMM	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.84}}\pm 0.11$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}10.30}}\pm 5.61$	$1.43\pm 0.66$	$3.59\pm 0.74$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.90}}\pm 0.09$	$14.66\pm 8.54$	$1.50\pm 0.70$	$3.65\pm 0.71$
DCG-GAN	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.77}}\pm 0.17$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.01}}\pm 5.54$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.36}}\pm 0.60$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.43}}\pm 0.71$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.82}}\pm 0.17$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.75}}\pm 6.05$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.31}}\pm 0.58$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.54}}\pm 0.66$
	⟨ FF++, VoxCeleb1 ⟩				⟨ Celeb-DF, VoxCeleb1 ⟩
One-shot^*	$0.30\pm 0.12$	$5.00\pm 2.84$	$1.09\pm 0.44$	$3.48\pm 0.73$	$0.28\pm 0.13$	$5.35\pm 2.97$	$1.11\pm 0.44$	$3.45\pm 0.69$
Latent-pose^*	$0.36\pm 0.12$	$12.14\pm 7.67$	$1.36\pm 0.66$	$3.27\pm 0.68$	$0.36\pm 0.12$	$12.98\pm 8.68$	$1.36\pm 0.64$	$3.22\pm 0.70$
Bi-layer	$0.59\pm 0.13$	$11.46\pm 6.32$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.33}}\pm 0.58$	$3.61\pm 0.64$	$0.58\pm 0.15$	$11.19\pm 6.09$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.37}}\pm 0.56$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.47}}\pm 0.67$
FOMM	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.80}}\pm 0.13$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}10.35}}\pm 5.81$	$1.43\pm 0.66$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.59}}\pm 0.74$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.79}}\pm 0.14$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}10.18}}\pm 7.16$	$1.38\pm 0.65$	$3.50\pm 0.71$
DCG-GAN	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.81}}\pm 0.11$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}10.15}}\pm 5.41$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.31}}\pm 0.58$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.55}}\pm 0.70$	$\mathbf{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}0.74}}\pm 0.17$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}9.81}}\pm 5.20$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1.31}}\pm 0.60$	$\mathbf{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3.42}}\pm 0.75$

3.3 Quantitative Results

Quantitative results for both single- and cross-domain face re-enactment are reported in Table 1. Since One-shot enforces the shape of re-enacted faces close to that of driving images, it produces lower error in re-enactment fulfillment but completely fails to preserve the identity of source faces (CosSim values lower than 0.35 in all cases). Latent-pose also suffers from poor identity preservation performance, where CosSim values are around 0.36 to 0.37. In contrast, our method achieves much better performance in preserving the source identity, indicating the success of using 3DMM to bridge the identity gap in one-shot face re-enactment. Although CosSim scores obtained by FOMM are slightly higher than ours, it produces larger errors in pose and expression transformation compared to our methods in most cases. Moreover, FOMM requires a sequence of faces with no abrupt change for driving, while our method only needs one driving image to achieve face re-enactment, which has a much broader range of application.

3.4 Ablation Study

In this subsection, we validate the effectiveness of the network design for each component in our method. Specifically, we propose the following variants of DCG-GAN for investigation: 1) In DCG-GAN-sD (simple Dense correspondence estimation), $\mathcal{F}$ is computed merely based on $M_{s}$ and $M_{r}$ with no $L_{s}$ , $L_{r}$ or $P_{s}$ involved; 2) DCG-GAN-sG (simple Generator) refers to the setup where no AFM network is used to guide the generation process, and $I_{s}^{warp}$ , $M_{s}^{warp}$ , $P_{s}^{warp}$ are fed to the generator only at the bottom layer; and 3) DCG-GAN-sDG combines the configuration in both previous settings for comparison.

As shown in Fig. 8, removing the facial landmark ( $L_{s}$ and $L_{r}$ ) and parsing map ( $P_{s}$ ) impairs the accuracy of motion field estimation, especially for hair and mouth regions. The reason for this might be that the model fails to extract sufficient information of both the image layout (contained in $P_{s}$ ) and shape of facial components (embedded in $L_{s}$ and $L_{r}$ ). Results obtained by DCG-GAN-sG have severe problem in identity preservation, as no detailed textural information is provided for ResBlocks at higher resolutions. DCG-GAN-sDG is very unstable to train and we only present results obtained by the latest checkpoint before collapse, which suffer from severe artifacts. Quantitative results of reported in Table 1 are in line with our previous observations. Our model consistently outperforms its variants under all measurements, indicating the effectiveness of the network design for each module.

4 Conclusion

In this paper, we propose a novel GAN-based model, named DCG-GAN, for one-shot face re-enactment. DCG-GAN adopts a 3DMM model to decompose and reorganize geometric semantics of faces, and generate textured mesh images to help estimate the dense correspondence between source and re-enacted faces. Such dense correspondence field could be used to warp source images for re-enactment results approximation, followed by a generator network to improve the visual quality. Extensive experiments on multiple datasets demonstrate the effectiveness of DCG-GAN in generating lifelike images with accurate re-enactment fulfillment. We also discuss limitations of DCG-GAN and propose possible solutions as future research directions.

References

Burkov et al. (2020) Burkov, E.; Pasechnik, I.; Grigorev, A.; and Lempitsky, V. 2020. Neural head reenactment with latent pose descriptors. In IEEE Conference on Computer Vision and Pattern Recognition, 13786–13795.
Cao et al. (2013) Cao, C.; Weng, Y.; Zhou, S.; Tong, Y.; and Zhou, K. 2013. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3): 413–425.
Deng et al. (2019) Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 4690–4699.
Feng et al. (2021) Feng, Y.; Feng, H.; Black, M. J.; and Bolkart, T. 2021. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. ACM Transactions on Graphics, 40(8).
Ghosh et al. (2020) Ghosh, P.; Gupta, P. S.; Uziel, R.; Ranjan, A.; Black, M. J.; and Bolkart, T. 2020. GIF: Generative interpretable faces. In International Conference on 3D Vision, 868–878.
Ha et al. (2020) Ha, S.; Kersner, M.; Kim, B.; Seo, S.; and Kim, D. 2020. Marionette: Few-shot face reenactment preserving identity of unseen targets. In AAAI Conference on Artificial Intelligence, 10893–10900.
Huang, Yang, and Wang (2020) Huang, P.-H.; Yang, F.-E.; and Wang, Y.-C. F. 2020. Learning Identity-Invariant Motion Representations for Cross-ID Face Reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, 7084–7092.
Isola et al. (2017) Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, 1125–1134.
Jaderberg et al. (2015) Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Conference on Neural Information Processing Systems, 1–9.
Kim et al. (2018) Kim, H.; Garrido, P.; Tewari, A.; Xu, W.; Thies, J.; Niessner, M.; Pérez, P.; Richardt, C.; Zollhöfer, M.; and Theobalt, C. 2018. Deep video portraits. ACM Transactions on Graphics, 37(4): 1–14.
Li et al. (2017) Li, T.; Bolkart, T.; Black, M. J.; Li, H.; and Romero, J. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, 36(6): 194:1–194:17.
Li et al. (2020) Li, Y.; Yang, X.; Sun, P.; Qi, H.; and Lyu, S. 2020. Celeb-df: A large-scale challenging dataset for deepfake forensics. In IEEE Conference on Computer Vision and Pattern Recognition, 3207–3216.
Nagrani, Chung, and Zisserman (2017) Nagrani, A.; Chung, J. S.; and Zisserman, A. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612.
Nguyen-Phuoc et al. (2019) Nguyen-Phuoc, T.; Li, C.; Theis, L.; Richardt, C.; and Yang, Y.-L. 2019. Hologan: Unsupervised learning of 3d representations from natural images. In International Conference on Computer Vision, 7588–7597.
Nirkin, Keller, and Hassner (2019) Nirkin, Y.; Keller, Y.; and Hassner, T. 2019. Fsgan: Subject agnostic face swapping and reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, 7184–7193.
Park et al. (2019) Park, T.; Liu, M.-Y.; Wang, T.-C.; and Zhu, J.-Y. 2019. Semantic Image Synthesis with Spatially-Adaptive Normalization. In IEEE Conference on Computer Vision and Pattern Recognition, 2337–2346.
Pumarola et al. (2018) Pumarola, A.; Agudo, A.; Martinez, A. M.; Sanfeliu, A.; and Moreno-Noguer, F. 2018. Ganimation: Anatomically-aware facial animation from a single image. In European Conference on Computer Vision, 818–833.
Rossler et al. (2019) Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; and Nießner, M. 2019. Faceforensics++: Learning to detect manipulated facial images. In International Conference on Computer Vision, 1–11.
Ruiz, Chong, and Rehg (2018) Ruiz, N.; Chong, E.; and Rehg, J. M. 2018. Fine-grained head pose estimation without keypoints. In IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2074–2083.
Siarohin et al. (2019) Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; and Sebe, N. 2019. First order motion model for image animation. In Conference on Neural Information Processing Systems, volume 32, 7137–7147.
Song et al. (2018) Song, L.; Lu, Z.; He, R.; Sun, Z.; and Tan, T. 2018. Geometry guided adversarial facial expression synthesis. In ACM International Conference on Multimedia, 627–635.
Sun et al. (2019) Sun, K.; Xiao, B.; Liu, D.; and Wang, J. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition.
Tewari et al. (2020) Tewari, A.; Elgharib, M.; Bharaj, G.; Bernard, F.; Seidel, H.-P.; Pérez, P.; Zollhofer, M.; and Theobalt, C. 2020. Stylerig: Rigging stylegan for 3d control over portrait images. In IEEE Conference on Computer Vision and Pattern Recognition, 6142–6151.
Thies et al. (2016) Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; and Nießner, M. 2016. Face2face: Real-time Face Capture and Reenactment of RGB Videos. In IEEE Conference on Computer Vision and Pattern Recognition, 2387–2395.
Tripathy, Kannala, and Rahtu (2020) Tripathy, S.; Kannala, J.; and Rahtu, E. 2020. Icface: Interpretable and controllable face reenactment using gans. In IEEE Winter Conference on Applications of Computer Vision, 3385–3394.
Tripathy, Kannala, and Rahtu (2021) Tripathy, S.; Kannala, J.; and Rahtu, E. 2021. FACEGAN: Facial Attribute Controllable rEenactment GAN. In IEEE Winter Conference on Applications of Computer Vision, 1329–1338.
Tripathy, Kannala, and Rahtu (2022) Tripathy, S.; Kannala, J.; and Rahtu, E. 2022. Single Source One Shot Reenactment Using Weighted Motion From Paired Feature Points. In IEEE Winter Conference on Applications of Computer Vision, 2715–2724.
Wang et al. (2018a) Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; and Liu, W. 2018a. Cosface: Large margin cosine loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 5265–5274.
Wang et al. (2018b) Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; and Catanzaro, B. 2018b. High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE Conference on Computer Vision and Pattern Recognition, 8798–8807.
Wang, Mallya, and Liu (2021) Wang, T.-C.; Mallya, A.; and Liu, M.-Y. 2021. One-shot free-view neural talking-head synthesis for video conferencing. In IEEE Conference on Computer Vision and Pattern Recognition, 10039–10049.
Wu et al. (2018) Wu, W.; Zhang, Y.; Li, C.; Qian, C.; and Loy, C. C. 2018. Reenactgan: Learning to reenact faces via boundary transfer. In European Conference on Computer Vision, 603–619.
Yao et al. (2021) Yao, G.; Yuan, Y.; Shao, T.; Li, S.; Liu, S.; Liu, Y.; Wang, M.; and Zhou, K. 2021. One-shot Face Reenactment Using Appearance Adaptive Normalization. In AAAI Conference on Artificial Intelligence, 3172–3180.
Yao et al. (2020) Yao, G.; Yuan, Y.; Shao, T.; and Zhou, K. 2020. Mesh guided one-shot face reenactment using graph convolutional networks. In ACM International Conference on Multimedia, 1773–1781.
Yin et al. (2017) Yin, X.; Yu, X.; Sohn, K.; Liu, X.; and Chandraker, M. 2017. Towards large-pose face frontalization in the wild. In International Conference on Computer Vision, 3990–3999.
Yu et al. (2018) Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; and Sang, N. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, 325–341.
Zakharov et al. (2020) Zakharov, E.; Ivakhnenko, A.; Shysheya, A.; and Lempitsky, V. 2020. Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, 524–540.
Zhang et al. (2020a) Zhang, J.; Zeng, X.; Wang, M.; Pan, Y.; Liu, L.; Liu, Y.; Ding, Y.; and Fan, C. 2020a. Freenet: Multi-identity face reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, 5326–5335.
Zhang et al. (2020b) Zhang, P.; Zhang, B.; Chen, D.; Yuan, L.; and Wen, F. 2020b. Cross-domain Correspondence Learning for Exemplar-based Image Translation. In IEEE Conference on Computer Vision and Pattern Recognition, 5143–5153.
Zhang et al. (2019) Zhang, Y.; Zhang, S.; He, Y.; Li, C.; Loy, C. C.; and Liu, Z. 2019. One-shot Face Reenactment. In British Machine Vision Conference.
Zhou et al. (2021) Zhou, X.; Zhang, B.; Zhang, T.; Zhang, P.; Bao, J.; Chen, D.; Zhang, Z.; and Wen, F. 2021. CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation. In IEEE Conference on Computer Vision and Pattern Recognition, 11465–11475.
Zhu et al. (2017a) Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017a. Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision, 2223–2232.
Zhu et al. (2017b) Zhu, X.; Liu, X.; Lei, Z.; and Li, S. Z. 2017b. Face alignment in full pose range: A 3d total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1): 78–92.