Human Pose Transfer with Augmented Disentangled Feature Consistency

Kun Wu [email protected] Syracuse UniversitySyracuseNYUSA13244 , Chengxiang Yin [email protected] Syracuse UniversitySyracuseNYUSA13244 , Zhengping Che [email protected], [email protected] Midea GroupBeijingChina , Bo Jiang [email protected] Didi ChuxingBeijingChina , Jian Tang^† [email protected] Midea GroupBeijingChina , Zheng Guan [email protected] Computer Science School, Beijing Institute of TechnologyBeijingChina and Gangyi Ding [email protected] Computer Science School, Beijing Institute of TechnologyBeijingChina

(2023; 11 August 2022; 9 August 2023; 9 September 2023)

Abstract.

Deep generative models have made great progress in synthesizing images with arbitrary human poses and transferring the poses of one person to others. Though many different methods have been proposed to generate images with high visual fidelity, the main challenge remains and comes from two fundamental issues: pose ambiguity and appearance inconsistency. To alleviate the current limitations and improve the quality of the synthesized images, we propose a pose transfer network with augmented Disentangled Feature Consistency (DFC-Net) to facilitate human pose transfer. Given a pair of images containing the source and target person, DFC-Net extracts pose and static information from the source and target respectively, then synthesizes an image of the target person with the desired pose from the source. Moreover, DFC-Net leverages disentangled feature consistency losses in the adversarial training to strengthen the transfer coherence and integrates a keypoint amplifier to enhance the pose feature extraction. With the help of the disentangled feature consistency losses, we further propose a novel data augmentation scheme that introduces unpaired support data with the augmented consistency constraints to improve the generality and robustness of DFC-Net. Extensive experimental results on Mixamo-Pose and EDN-10k have demonstrated DFC-Net achieves state-of-the-art performance on pose transfer.

Human pose transfer, Generative adversarial network, Image generation, Computer vision

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†journal: JACM^†^†journalvolume: 37^†^†journalnumber: 4^†^†article: 111^†^†publicationmonth: 8^†^†ccs: Computing methodologies Appearance and texture representations^†^†ccs: Computing methodologies Reconstruction

1. Introduction

Human pose transfer has become increasingly compelling recently since it can be applied to real-world applications such as movies’ special effects (Tung et al., 2017), entertainment systems (Xia et al., 2017), reenactment (Liu et al., 2019b) and so forth (Moeslund et al., 2006; Ding and Tao, 2016). At the same time, it is also closely related to many computer vision tasks like human-object interaction recognition (Qi et al., 2018; Zhou et al., 2020, 2021), person re-identification (Liu et al., 2020; Tian et al., 2021), human pose segmentation (Zhou et al., 2022; Wang et al., 2020) and human parsing (Li et al., 2020; Wang et al., 2021), and all these methods can be beneficial to each other. Given some images of a target person and a source person image with the desired pose (e.g., judo, dance), the goal of the human pose transfer task is to synthesize a realistic image of the target person with the desired pose of the source person.

With the power of deep learning, especially the generative adversarial networks (GANs) (Goodfellow et al., 2014), pioneering works have raised impressive solutions to address the human image generation (Ma et al., 2017; Neverova et al., 2018; Liu et al., 2019a) by efficiently leveraging the image-to-image translation schemes and have achieved significant progress. Intuitively, early routine coarsely conducts human pose transfer through general image-to-image translation methods such as Pix2Pix (Isola et al., 2017) and CycleGAN (Zhu et al., 2017), which attempt to translate the extracted skeleton image of the source person to the image of target person with the desired poses.

Subsequent approaches (Ma et al., 2017, 2018; Li et al., 2019) adopt specifically designed modules for human pose transfer. Specifically, the U-net architecture with skip connections in (Esser et al., 2018) is employed to keep the low-level features. To mitigate the pose misalignment between the source and target persons, (Siarohin et al., 2018) uses part-wise affine transformation with a modified feature fusion mechanism to warp the appearance features onto the target pose. Later, extensive works have been presented to strengthen the modeling ability of body deformation and feature transfer with different methods, including 3D surface models (Neverova et al., 2018; Li et al., 2019; Grigorev et al., 2019), local attention (Zhu et al., 2019; Ren et al., 2020) and optical flow (Wang et al., 2018a). (Li et al., 2020) and (Wang et al., 2021) propose a rectification strategy in a self-learning way and hierarchical information framework, respectively, for human parsing, which benefits the downstream pose transfer task. However, the warping methods commonly struggle with pose ambiguity when the viewpoint changes, occlusions occur, or even transferring a complicated pose in many situations. To address the pose ambiguity, a series of works (Wang et al., 2018a; Liu et al., 2019a) use predictive branches to illuminate and replenish new contents for invisible regions. When the hallucinated contents have a different context style than the local-warped ones, generated images will have a low visual fidelity due to appearance inconsistency. One of the main reasons for pose ambiguity and appearance inconsistency is that the commonly used reconstruction loss and the adversarial generative loss only constrain the synthesized image generation at the image level.

Towards alleviating the mentioned limitations, it is important to disentangle the pose and appearance information, and exploit the disentangled pose and appearance feature consistencies between the synthesized and real images, i.e., the synthesized target image should have a similar high-level appearance feature to the real target person as well as a similar high-level pose feature to the real source person. The disentangled pose and appearance feature consistencies can constrain the training at the feature level and lead to a more consistent and realistic synthesized result. In CDMS (Zhou et al., 2022), a multi-mutual consistency learning strategy is proposed for the human pose segmentation task, showing the importance of feature consistency for distinguishing the human pose.

In this paper, we propose a pose transfer network with augmented Disentangled Feature Consistency (DFC-Net) to facilitate human pose transfer. DFC-Net contains a pose feature encoder and a static feature encoder to extract pose and appearance features from the source and target person, respectively. In the pose feature encoder, we integrated a pre-trained pose estimator such as OpenPose (Cao et al., 2019) to extract the keypoint heatmaps. Notice that the pose estimator is pre-trained on COCO keypoint challenge dataset (Lin et al., 2014), which is not any dataset deployed in our experiments. As shown in Figure 2, though the pre-trained pose estimator can predict pose heatmaps for unseen subjects in our dataset, it cannot generalize well and the heatmaps have much noise, which hinders subsequent pose transfer. Further, in order to remedy the distortion of the extracted keypoints caused by the distribution shift from the pose estimator, we introduce a keypoint amplifier to eliminate the noise in keypoint heatmaps. An image generator synthesizes a realistic image of the target person conditioned on the disentangled pose and appearance features. The feature encoders and image generator empower DFC-Net to enable us to present novel feature-level pose and appearance consistency losses (Zhu et al., 2017). These losses reinforce the consistency of pose and appearance information in the feature space and simultaneously maintain visual fidelity. Additionally, to further improve the robustness and generality of DFC-Net, by disentangling the pose information from different source persons, we present a novel data augmentation scheme that builds an extra unpaired support dataset as the source images, which provides different persons with unseen poses in the training set and augmented consistency constraints.

We also notice that the commonly used real-person datasets and benchmarks (Zheng et al., 2015; Liu et al., 2016) usually do not have the image of the target person with the desired pose from another source person, which is the ground truth. It is common practice to use a target person image directly from the testing dataset to provide the pose information during the evaluation process. Thus this practice raises the risk of leaking information and is also inconsistent with the usage in the real-world (i.e., the pose information is from another source person). In order to be consistent with the real-world application and to better evaluate the proposed method, inspired by (Aberman et al., 2019), we collect an animation character image dataset named Mixamo-Pose from Adobe Mixamo (Adobe Systems Inc., 2018), a 3D animation library, to accurately generate different characters performing identical poses as a benchmark to assess the human pose transfer between different people. Mixamo-Pose contains four different animation characters performing 15 kinds of poses. To further evaluate the DFC-Net, we also modify a real person dataset called EDN-10k upon (Chan et al., 2019), which contains 10K high-resolution images for four real subjects performing different poses. The experimental results on these two datasets demonstrate that our model can effectively synthesize realistic images and conduct pose transfer for both the animation characters and real persons.

In summary, our contributions are as follows:

•

We propose a novel method DFC-Net for human pose transfer with two disentangled feature consistency losses to make the information between the real images and synthesized images consistent.
•

We propose a novel data augmentation scheme that enforces augmented consistency constraints with an unpaired support dataset to further improve the generality of our model.
•

We collect an animation character dataset Mixamo-Pose as a new benchmark to enable the accurate evaluation of pose transfer between different people in the animation domain.
•

We conduct extensive experiments on datasets Mixamo-Pose and EDN-10k, on which the empirical results demonstrate the effectiveness of our method.

2. Related Work

Generative adversarial networks (Goodfellow et al., 2014) and Diffusion models (Ho et al., 2020) have achieved tremendous success in image generation tasks, whose goal is to generate high-fidelity images based on other images or text prompts from a different domain. Pix2Pix (Isola et al., 2017) proposes a framework based on cGANs (Mirza and Osindero, 2014) with an encoder-decoder architecture (Hinton and Salakhutdinov, 2006); CycleGAN (Zhu et al., 2017) addressed this problem by using cycle-consistent GANs; DualGAN (Yi et al., 2017) and (Hoshen and Wolf, 2018) are also unsupervised image-to-image translation methods trained on unpaired datasets. Similarly, (Liu et al., 2017; Bousmalis et al., 2017; Huang et al., 2018) are also image-to-image translation techniques, but they try to generate a dataset of the target domain with labels for domain adaptation tasks. The above works can be exploited as a general approach in the human pose transfer task, while the precondition is that they have a specific image domain that can be converted to the synthesized image domain, e.g., using a pose estimator (Cao et al., 2017) to generate a paired skeleton image dataset. Based on the diffusion model, Diffustereo (Shao et al., 2022) proposes a diffusion kernel and stereo constraints for 3D human reconstruction from sparse cameras. MotionDiffuse (Zhang et al., 2022) leverages the diffusion model on the text-driven motion generation task. In this work, we focus on the 2D pose-guided motion transfer task, which differs from the above 3D reconstruction and test-driven tasks. Different from the image-to-image translation methods, DFC-Net improved the quality of the synthesized image by adding consistency constraints in the feature space.

Recently, there have been a growing number of human pose transfer methods with specifically designed modules. One branch is the spatial transformation methods (Siarohin et al., 2018; Dong et al., 2018; Li et al., 2019), aiming to build the deformation mapping of the keypoint correspondences in the human body. By leveraging the spatial transformation capability of CNN, (Jaderberg et al., 2015) presented the spatial transformer networks (STN) that approximate the global affine transformation to warp the features. Following STN, several variant works (Zhang and He, 2017; Lin and Lucey, 2017; Jiang et al., 2019) have been proposed to synthesize images with better performance. (Wang et al., 2020) introduced an external eye-tracking dataset and two cascaded attention modules for comprehensive pose segmentation. (Wang et al., 2021) incorporated three different inference processes to detect each part of the human body. (Balakrishnan et al., 2018) used image segmentation to decompose the problem into modular subtasks for each body part and then integrated all parts into the final result. (Siarohin et al., 2018) built deformable skip connections to move information and transfer textures for pose transfer. Monkey-Net (Siarohin et al., 2019a) encoded pose information via dense flow fields generated from keypoints learned in a self-supervised fashion. First-Order Motion Model (Siarohin et al., 2019b) decoupled appearance and pose and proposes to use learned keypoints and local affine transformations to generate image animation. (Liu et al., 2019a) integrated the human pose transfer, appearance transfer, and novel view synthesis into one unified framework by using SMPL (Loper et al., 2015) to generate a human body mesh. The spatial transformation methods usually implicitly assume that the warping operation can cover the whole body. However, when the viewpoint changes, and occlusions occur, the above assumption can not hold, leading to pose ambiguity and performance dropping.

Another branch methods are pose-guided and aim to predict new appearance contents in uncovered regions to handle the pose ambiguity problem. One of the earliest works, PG² (Ma et al., 2017), presented a two-stage method using U-Net to synthesize the target person with arbitrary poses. (Ma et al., 2018) further decomposed the image into the foreground, background, and pose features to achieve more precise control of different information. (Si et al., 2018) introduced a multi-stage GAN loss and synthesized each body part, respectively. (Neverova et al., 2018) leveraged the DensePose (Alp Güler et al., 2018) rather than the commonly used 2D key-points to perform accurate pose transfer. (Chan et al., 2019) learned a direct mapping from the skeleton images to synthesized images with corresponding poses based on the architecture of Pix2PixHD (Wang et al., 2018b). PATH (Zhu et al., 2019) introduced cascaded attention transfer blocks (PATBs) to refine pose and appearance features simultaneously. Inspired by PATH, PMAN (Chen et al., 2021) proposed a progressive multi-attention framework with memory networks to improve image quality. However, some of these methods (Neverova et al., 2018; Zheng et al., 2019; Wang et al., 2018a) focused on synthesizing results at the image level (i.e., adversarial and reconstruction losses), thus leading to appearance inconsistency when predicted local contents are not consistent with the surrounding contexts. Some works (Zhao et al., 2021; Shen et al., 2021) designed the light weighted networks to accelerate the training and inference process. Our method can also benefit from these light weighted networks to achieve high efficiency human pose transfer.

In contrast, our method learns to disentangle and reassemble the pose and appearance in the feature space. One similar work close to ours is C²GAN (Tang et al., 2019) which consists of three generation cycles (i.e., one for image generation and two for keypoint generation). C²GAN explored the cross-modal information in the image level at the cost of model complexity and training instability while DFC-Net only introduced two feature consistency losses into the full objective, which kept the model simple and effective. By disentangling the pose and appearance features, we can enforce the feature consistencies between the synthesized and real images and leverage the pose features from an unpaired dataset to improve performance.

Refer to caption — Figure 1. Upper left: DFC-Net synthesizes an image of the target person performing the pose of the source person. Upper right: the Pose Feature Encoder $M(\cdot)$ includes three components: pre-trained Pose Estimator, Keypoint Amplifier and Pose Refiner. Bottom: the overview of training process of DFC-Net. Note that the images $x_{s}$ surrounded by orange dotted boxes is from the support set, and the augmented consistency losses $\mathcal{L}_{\mathrm{sup}}$ are sum of $\mathcal{L}_{\mathrm{adv}}^{-},\mathcal{L}_{\mathrm{mc}}$ and $\mathcal{L}_{\mathrm{sc}}$ surrounded by orange dotted boxes.

3. Methodology

3.1. Overview

The training and inference process of the proposed model is shown in Figure 1. Given one image $\bm{x}_{\mathrm{s}}$ of a source person and another image $\bm{x}_{\mathrm{t}}$ of a target person, DFC-Net synthesizes an image $\bm{x}_{\mathrm{syn}}$ , which reserves a) the pose information, e.g., pose and location, of the source person in $\bm{x}_{\mathrm{s}}$ , and b) the static information, e.g., person appearance and environment background, from the target image $\bm{x}_{\mathrm{t}}$ . For each image, DFC-Net attempts to disentangle the pose and static information into orthogonal features. Specifically, DFC-Net consists of the following core components: 1) a Pose Feature Encoder $M(\cdot)$ , which extracts pose features $M(\bm{x})$ from an image $\bm{x}$ ; 2) a Static Feature Encoder $S(\cdot)$ , which extracts static features $S(\bm{x}^{\prime})$ from an image $\bm{x}^{\prime}$ ; and 3) an Image Generator $G(\cdot)$ , which synthesizes an image $G(M(\bm{x}),S(\bm{x}^{\prime}))$ based on the encoded pose and static features $M(\bm{x})$ and $S(\bm{x}^{\prime})$ from images $\bm{x}$ and $\bm{x}^{\prime}$ separately. In the remainder of this section, we describe the model architecture and introduce the training procedure, followed by the model instantiations.

3.2. Pose Transfer Network Architecture

3.2.1. Pose Feature Encoder

Our designed Pose Feature Encoder consists of a Pose Estimator network, a Keypoint Amplifier block, and a Pose Refiner network. Given a RGB image $\bm{x}\in\mathbb{R}^{3\times H\times W}$ of height $H$ and width $W$ , the pre-trained Pose Estimator aims at extracting pose information $P(\bm{x})$ from the image $\bm{x}$ . Similar to (Cao et al., 2017), the extracted pose information contains the downsampled keypoint heatmaps $\bm{h}\in\mathbb{R}^{18\times\frac{H}{8}\times\frac{W}{8}}$ and the part affinity fields $\bm{p}\in\mathbb{R}^{38\times\frac{H}{8}\times\frac{W}{8}}$ . The keypoint heatmaps $\bm{h}$ store the heatmaps of 18 body parts, and the part affinity fields $\bm{p}$ store the location and orientation for heatmaps of body parts and background, which has 38 ( $=(18+1)\times 2$ ) channels.

As the pre-trained pose estimator (OpenPose (Cao et al., 2019) in our implementation) is pre-trained on COCO keypoint challenge dataset (Lin et al., 2014), when applied on Mixamo-Pose and EDN-10k datasets with different distributions, more noise is made on keypoint heatmaps $\bm{h}$ . To reduce the interference of noise, we apply a softmax function with a relatively small temperature $T$ (e.g., 0.01) as the Keypoint Amplifier to denoise the extracted keypoint heatmaps by increasing the gap between large and small values in the heatmaps and obtain the amplified heatmaps $\bm{h}^{\prime}$ by

(1)

\displaystyle\bm{h}^{\prime}=\mathrm{softmax}\left(\frac{1}{T}\cdot\bm{h}\right).

As shown in Figure 2, by employing Keypoint Amplifier on the input heatmaps, the small probability, e.g., 0.2, will be squeezed to almost 0.0. On the contrary, the large probability, e.g., 0.8, will be squeezed to almost 1.0. Without the Keypoint Amplifier, the generator may still synthesize blurry limbs for the low probability areas and twist the generated person.

Finally, the Pose Refiner takes both the part affinity fields $\bm{p}$ and the amplified keypoint heatmaps $\bm{h}^{\prime}$ and produces the encoded pose feature vector $M(\bm{x})$ . In this way, the pose information extracted from the Pose Estimator can be refined, and the influence caused by different limb ratios and/or camera angles and distances can be reduced.

3.2.2. Static Feature Encoder

While the Pose Feature Encoder is not capable of extracting static information, the static information, including background, personal appearance, etc., from another image $\bm{x}^{\prime}$ , is captured automatically by another module with the help of the full objective function. Named as Static Feature Encoder, this module extracts only static features $S(\bm{x}^{\prime})$ from $\bm{x}^{\prime}$ .

3.2.3. Image Generator

Given pose features $M(\bm{x}_{\mathrm{s}})$ extracted from a source image $\bm{x}_{\mathrm{s}}$ and static features $S(\bm{x}_{\mathrm{t}})$ extracted from a target image $\bm{x}_{\mathrm{t}}$ , the Image Generator outputs the synthesized image $\bm{x}_{\mathrm{syn}}$ by

(2)

\displaystyle\bm{x}_{\mathrm{syn}}=G\left(M(\bm{x}_{\mathrm{s}}),S(\bm{x}_{\mathrm{t}})\right).

It is noted that many existing methods (e.g., (Chan et al., 2019)) attempt to learn pose-to-image or pose-to-appearance mapping solely via its generator. In that case, the generator has to learn three different functionalities: 1) memorizing the state information of the target person, 2) extracting representative pose features, and 3) combining the static and pose information to synthesize the target person image with the desired pose. Even though the generator can memorize the state information $\bm{x}_{\mathrm{t}}$ of the target person perfectly, once the desired pose $\bm{x}_{\mathrm{s}}$ is very different from the poses in the training dataset (e.g., the distance from the camera, skeleton scale from different persons, and occlusions), it is too difficult for the generator to achieve the above second and third functionalities at the same time. The results of Pix2Pix (Isola et al., 2017) and Everybody Dance Now (EDN) (Chan et al., 2019) in Section 4 also validate their disadvantages. DFC-Net, instead, decomposes the above three functionalities into three network modules, including the pose feature encoder, static feature encoder, and image generator, and thus enables the reconstruction’s quality improvement.

3.3. Training DFC-Net

We train the pose transfer network in an adversarial learning way with disentangled feature consistency losses as well as other objectives. Basically, the model is trained with a set of images of the same person, possibly from one or several video clips. To further improve the generalization ability of DFC-Net, we propose to train DFC-Net with a support set and the augmented consistency losses. We show the ablation study results in Section 4.3.

3.3.1. Adversarial Training

We employ an Image Discriminator (D) in an adversarial learning way to ensure the synthesized image $\bm{x}_{\mathrm{syn}}$ borrows the pose and static information from the source and target images ( $\bm{x}_{\mathrm{s}}$ and $\bm{x}_{\mathrm{t}}$ ) separately. As both the source and target images during training contain the same person with the same appearance and background, they share almost the same static features, i.e., $S(\bm{x}_{\mathrm{t}})\simeq S(\bm{x}_{\mathrm{s}})$ . Therefore, the output of the model $\bm{x}_{\mathrm{syn}}$ can also be treated as a reconstruction of the source image $\bm{x}_{\mathrm{s}}$ , as the synthesized images contain the same pose features $M(\bm{x}_{\mathrm{s}})$ as the source image. This inspires us to resort to the conditional generative adversarial network (cGAN) (Isola et al., 2017), where the Image Discriminator attempts to discern between the real sample $\bm{x}_{\mathrm{s}}$ and the generated image $\bm{x}_{\mathrm{syn}}$ , conditioned on the pose features $M(\bm{x}_{\mathrm{s}})$ extracted from the source image. That is, the Image Discriminator attempts to fit $D(\bm{x}_{\mathrm{s}},M(\bm{x}_{\mathrm{s}}))=1$ and $D(\bm{x}_{\mathrm{syn}},M(\bm{x}_{\mathrm{s}})))=0$ . The adversarial loss is described as follows:

(3)

\displaystyle\mathcal{L}_{\mathrm{adv}}=-(\mathcal{L}_{\mathrm{adv}}^{+}+\mathcal{L}_{\mathrm{adv}}^{-}),

where

(4)		$\displaystyle\mathcal{L}_{\mathrm{adv}}^{+}$	$\displaystyle=\log D(\bm{x}_{\mathrm{s}},M(\bm{x}_{\mathrm{s}})),$
(5)		$\displaystyle\mathcal{L}_{\mathrm{adv}}^{-}$	$\displaystyle=\log\left(1-D\left(\bm{x}_{\mathrm{syn}},M(\bm{x}_{\mathrm{s}})\right)\right).$

We enhance the Image Discriminator with a multi-scale discriminator $D=(D_{1},D_{2})$ (Wang et al., 2018b) and include the discriminator feature matching loss $\mathcal{L}_{\mathrm{fm}}$ in our objective. The feature matching loss is a weighted sum of feature losses from 5 different layers of the Image Discriminator, calculated by $L_{1}$ distance between the corresponding features of $\bm{x}_{\mathrm{s}}$ and $\bm{x}_{\mathrm{syn}}$ .

In order to increase the training stability and improve the synthesized image quality, we also add the perceptual loss $\mathcal{L}_{\mathrm{per}}$ (Johnson et al., 2016) based on a pre-trained VGG network (Simonyan and Zisserman, 2014a).

3.3.2. Disentangled Feature Consistency Losses

The above adversarial training losses aim at penalizing discrepancy between the synthesized and source images directly in the raw image space. To improve the accuracy and robustness of the pose transfer results, we also introduce two disentangled feature consistency losses in terms of pose and static features to ensure the synthesized person looks like the target person and behaves as the source person separately. The pose consistency loss $\mathcal{L}_{\mathrm{mc}}$ measures the differences between the synthesized and source images in the pose feature space, and the static consistency loss $\mathcal{L}_{\mathrm{sc}}$ measures the differences between the synthesized and target images in the static feature space. They are both $L_{1}$ distances between the outputs from the corresponding encoders, formally defined as

(6)		$\displaystyle\mathcal{L}_{\mathrm{mc}}$	$\displaystyle={\left\\|M(\bm{x}_{\mathrm{syn}})-M(\bm{x}_{\mathrm{s}})\right\\|}_{1},$
(7)		$\displaystyle\mathcal{L}_{\mathrm{sc}}$	$\displaystyle={\left\\|S(\bm{x}_{\mathrm{syn}})-S(\bm{x}_{\mathrm{t}})\right\\|}_{1}.$

3.3.3. Augmented Consistency Loss

Through disentangling the pose feature from the source images, we find that images with different persons can also be passed into the training process as the source images $\bm{x}_{\mathrm{s}}$ to improve the generalization ability of our model. Hence, we introduce a novel data augmentation method that extends the training dataset with the images of different persons, referred to as the support set, providing many kinds of unseen poses. Note that the subjects in support set can be arbitrary and are different from the primary training dataset, so the ground-truth images with the target person performing the pose of the source person are not available at all. As a result, the corresponding losses $\mathcal{L}_{\mathrm{adv}}^{+},\mathcal{L}_{\mathrm{per}}$ and $\mathcal{L}_{\mathrm{fm}}$ for the support set are not applicable, and we optimize relevant objective terms, which are defined by

(8)

\displaystyle\mathcal{L}_{\mathrm{sup}}=\lambda_{\mathrm{adv}}\mathcal{L}_{\mathrm{adv}}^{-}+\lambda_{\mathrm{mc}}\mathcal{L}_{\mathrm{mc}}+\lambda_{\mathrm{sc}}\mathcal{L}_{\mathrm{sc}}

where the weights $\lambda_{\mathrm{adv}},\lambda_{\mathrm{mc}},\lambda_{\mathrm{sc}}$ are the weights for each loss.

3.3.4. Full Objective

By bringing all the objective terms together, we train all components jointly except for the Pose Estimator to minimize the full objective $\mathcal{L}_{\mathrm{full}}$ below.

(9)		$\displaystyle\mathcal{L}_{\mathrm{full}}=$	$\displaystyle\lambda_{\mathrm{adv}}\mathcal{L}_{\mathrm{adv}}+\lambda_{\mathrm{fm}}\mathcal{L}_{\mathrm{fm}}+\lambda_{\mathrm{per}}\mathcal{L}_{\mathrm{per}}$
		$\displaystyle+\lambda_{\mathrm{mc}}\mathcal{L}_{\mathrm{mc}}+\lambda_{\mathrm{sc}}\mathcal{L}_{\mathrm{sc}}+\mathcal{L}_{\mathrm{sup}}$

where $\lambda_{\mathrm{adv}},\lambda_{\mathrm{fm}},\lambda_{\mathrm{per}}$ are set to 1, 10, 10 following Pix2Pix (Isola et al., 2017) and Everybody Dance Now (EDN) (Chan et al., 2019), while $\lambda_{\mathrm{mc}},\lambda_{\mathrm{sc}}$ are set to 0.1, 0.01 by grid search. We set the $\lambda_{\mathrm{sc}}$ to 0.01 comparing to the $\lambda_{\mathrm{mc}}$ to balance the $\mathcal{L}_{\mathrm{mc}}$ and $\mathcal{L}_{\mathrm{sc}}$ .

3.4. Training and Inference Process

For each subject in the training dataset (i.e., Mixamo-Pose and EDN-10k in our experiments), we train one separate model following the same scheme of (Chan et al., 2019) (e.g., we trained four models for four subjects in EDN-10k dataset.) For the sake of comparison fairness, we also train all the baseline methods following the same scheme.

During the training stage, given a training dataset consisting of N images of one subject and a support set, for each training iteration, we randomly choose a pair of images as $x_{s}$ and $x_{t}$ from the training dataset and an image as $x_{s}$ from the support set respectively, pass them into the DFC-Net and train it using the full objective in Equation 9.

During the inference stage, given a desired pose image $x_{s}$ , we randomly choose an image $x_{t}$ from the training dataset and synthesize the result. For EDN-10k dataset, the pose image $x_{s}$ is chosen from the testing dataset with an unseen pose in the training process. Even though the pose image $x_{s}$ and the target person image $x_{t}$ contain the same person (i.e., the ground truth of the pose image $x_{s}$ with another person is unavailable for real-world data), by passing through the pose feature encoder, the static information in the pose image $x_{s}$ is discarded, and only the keypoints information are preserved. For Mixamo-Pose dataset, the pose image $x_{s}$ is chosen from the testing dataset, including the different person from the target person (e.g., the target person image $x_{t}$ is from Liam and the source person image $x_{s}$ is from Remy). For both benchmark, DFC-Net has to extract the static features from the target person image $x_{t}$ and combines them with the pose features of the pose image $x_{s}$ to synthesize the final images where the poses are unseen during the training. Thus there is no information leakage.

3.5. Implementation details

We employed the pre-trained VGG-19 (Simonyan and Zisserman, 2014b) network part from (Cao et al., 2017) for the Pose Estimator, and adopted the similar approach in (Wang et al., 2018a) to build our network, the detailed designs are as follows:

•

Pose Refiner: It is composed of a convolutional block, a channel-wise upsampling module, and five residual blocks (He et al., 2016). Firstly, the convolutional block consists of a reflection padding layer, a $7\times 7$ convolutional layer, a batch normalization layer, and ReLU. The channel-wise upsampling module, which increases the number of channels from 64 to 512, contains three convolutional blocks. Each block contains a $3\times 3$ convolutional layer, a batch normalization layer, and ReLU. Each of the five residual blocks consists of two small convolutional blocks, and each block has a reflection padding layer, a $3\times 3$ convolutional layer, and a batch normalization layer. The first small convolutional block also has a ReLU at the end.
•

Static Feature Encoder: It firstly has the same convolutional block as in the Pose Refiner. Then it contains three convolutional downsampling blocks, and each block consists of a $3\times 3$ convolutional layer, a batch normalization layer, and ReLU. There are also five residual blocks following the downsampling blocks as the same as in the Pose Refiner.
•

Image Generator: It is composed of four residual blocks, an upsampling module, and a convolutional block. Each of the residual blocks is the same as in the Pose Refiner and Static Feature Encoder. The upsampling module consists of three transposed convolutional blocks, and each block is composed of a $3\times 3$ transposed convolutional layer, a batch normalization layer, and ReLU. The last convolutional block contains two reflection padding layers, two $7\times 7$ convolutional layers, and a tangent function.
•

Image Discriminator: It contains two discriminators at different scales, which are similar to (Wang et al., 2018b). Each discriminator is composed of five convolutional blocks. The first block has a $4\times 4$ convolutional layer and LeakyReLU. Each of the next three blocks has a $4\times 4$ convolutional layer, a batch normalization layer, and Leaky ReLU. The last block only has a $4\times 4$ convolutional layer.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

We built Mixamo-Sup as a support set for data augmentation to boost the generality of DFC-Net and organized two datasets, Mixamo-Pose and EDN-10k, to verify the effectiveness of the proposed DFC-Net for human pose transfer.

•

EDN-10k: We processed and tailored the dataset released by (Chan et al., 2019) to build the EDN-10k dataset. The original dataset consists of five long target videos, each lasting from 8 minutes to 17 minutes, and split into the training and test set. We chose the first four subjects since subject five only performed less complex dance poses. In each video of the original dataset, a different subject performed a series of different motions, and the camera was fixed to keep the background unchanged. We chose four subjects from the original dataset, uniformly sampled 10k frames as the training set and 1k frames as the test set for each subject, Since the original images have a large resolution of $1024\times 512$ and most areas are the fixed background, we cropped all frames to the middle $512\times 512$ square areas and resized them to $256\times 256$ .
•

Mixamo-Pose: We randomly chose 4 characters, Andromeda, Liam, Remy, and Stefani, with 30 different pose sequences from Mixamo. To render the 3D animations into 2D images, we loaded each character performing each pose sequence on a white background into Blender (Blender Online Community, 2018), placed two cameras in front of and behind the character, and took the images. We centered the characters in the images according to their keypoints and resized them to 256 $\times$ 256. Mixamo-Pose were split into training and test sets. For each character, the training set contains 1488 images with 15 poses, and the test set contains 1185 images with 15 other poses.
•

Mixamo-Sup: For data augmentation, we built a support set by rendering 15684 images of six new characters from Mixamo (Adobe Systems Inc., 2018) with another 15 unseen poses in the same way as Mixamo-Pose. Since it is unnecessary to contain the same person as the target person image, DFC-Net leveraged the support set as the source person images $x_{s}$ . When training on both EDN-10k and Mixamo-Pose, we use Mixamo-Sup as the support set. Note that Mixamo-Sup has a totally different distribution from EDN-10k but still gains a huge improvement shown in Section 4.3.

Note that the experiments on EDN-10k only include the pose transfer on the same person because the ground truth of different people carrying the same pose are unavailable. For Mixamo-Pose, since we can manipulate different characters to do the same action, the experiments include the pose transfer both on the different people, e.g., transfer the unseen pose of Liam to Andromeda, and the same person, e.g., transfer the unseen pose of Andromeda to herself.

4.1.2. Baseline Methods

We compared our DFC-Net with the following competitive baselines:

•

Nearest Neighbors (NN): For each source person image $\bm{x}_{\mathrm{s}}$ , we chose the image $\bm{x}^{\prime}$ in the training set $\mathcal{D}_{\mathrm{tr}}$ with the lowest mean square error (MSE) between the pose information $P(\bm{x}_{\mathrm{s}})$ and $P(\bm{x}^{\prime})$ as $\bm{x}_{\mathrm{syn}}$ .

(10)

\displaystyle\bm{x}_{\mathrm{syn}}

\displaystyle={\arg\min}_{\bm{x}^{\prime}\in\mathcal{D}_{\mathrm{tr}}}{\left\|P(\bm{x}_{\mathrm{s}})-P(\bm{x}^{\prime})\right\|}^{2}_{2}.

The pose information was extracted by the same Pose Estimator as in our method.

•

Pose-guided Methods: We chose CycleGAN (Zhu et al., 2017), Pix2Pix (Isola et al., 2017) and Everybody Dance Now (EDN) (Chan et al., 2019) as the baselines. They all took the skeleton images as input instead of the original images, and we employed a pre-trained pose estimator (Cao et al., 2017) to extract keypoints, used OpenCV (Bradski, 2000) to connect pairs of keypoints with different colors to generate the skeleton images. To ensure fair comparisons, the face GAN and face keypoint estimator in EDN were not adopted in our implementation, as they are independent components and can be seamlessly adopted by other learning-based baselines.
•

Spatial Transformation Methods: We selected Liquid Warping GAN (LWG) (Liu et al., 2019a), Monkey-Net (MKN) (Siarohin et al., 2019a) and First Order Motion Model (FOMM) (Siarohin et al., 2019b). LWG calculates the flow fields with additional 3D human models and integrates the human pose transfer, appearance transfer, and novel view synthesis into one unified framework. MKN and FOMM are both object-agnostic frameworks using learned keypoints to generate image animation in a self-supervised fashion.

4.1.3. Evaluation Metrics

We evaluated the quality of the synthesized images with three commonly used metrics:

•

MSE: The mean squared error between the values of pixels of synthesized images and ground-truth images. Lower MSE values are better.
•

PSNR: The peak signal-to-noise ratio, which provides an empirical measure of the quality of synthesized images regarding ground-truth images. Higher PSNR values are better.
•

SSIM: Structural similarity (Wang et al., 2004), which is another perceptual metric that quantifies the quality of synthesized images given ground-truth images and focuses more on structural information (e.g., light). Higher SSIM values are better.
•

IS: Inception Score (Salimans et al., 2016) is a metric for estimating the quality of the synthetic images based on the Inception-V3 model (Szegedy et al., 2016). Higher IS values are better.
•

FID: Frechet Inception Distance (Heusel et al., 2017) is also an Inception-V3-based metric to evaluate the synthetic images according to the statistics of the synthetic images. Lower FID values are better.

We calculated the average scores among all pairs of synthesized and ground-truth images on the test set. On Mixamo-Pose, for each character as the target person, we reported the average metrics of 4 different characters as the source person. While on EDN-10k, we reported metrics of the task on the same person for every subject.

Table 1. Comparisons on EDN-10k in terms of MSE with the best results (lowest values) in bold.

Method	Subject1	Subject2	Subject3	Subject4
NN	54.9448	36.2267	55.2041	26.2531
CycleGAN (Zhu et al., 2017)	64.1959	77.4336	70.3681	52.5171
Pix2Pix (Isola et al., 2017)	58.3633	43.0771	62.0203	24.1688
EDN (Chan et al., 2019)	56.3549	36.3887	55.9625	21.5724
LWG (Liu et al., 2019a)	51.6246	43.2884	53.4031	21.4314
MKN (Siarohin et al., 2019a)	48.1603	30.9902	47.6255	21.6634
FOMM (Siarohin et al., 2019b)	46.2852	30.7603	51.1431	21.2709
Ours	45.2043	30.4782	48.5436	20.7248

Table 2. Comparisons on EDN-10k in terms of SSIM with the best results (highest values) in bold.

Method	Subject1	Subject2	Subject3	Subject4
NN	0.6138	0.8253	0.7616	0.8437
CycleGAN (Zhu et al., 2017)	0.5256	0.4911	0.5869	0.7821
Pix2Pix (Isola et al., 2017)	0.6238	0.8040	0.7585	0.8767
EDN (Chan et al., 2019)	0.6205	0.8445	0.8233	0.8939
LWG (Liu et al., 2019a)	0.6394	0.8375	0.7434	0.8634
MKN (Siarohin et al., 2019a)	0.7007	0.8503	0.8030	0.8904
FOMM (Siarohin et al., 2019b)	0.6645	0.8445	0.7896	0.8649
Ours	0.7083	0.8670	0.8241	0.9083

Table 3. Comparisons on EDN-10k in terms of PSNR with the best results (highest values) in bold.

Method	Subject1	Subject2	Subject3	Subject4
NN	30.7957	32.6439	31.3307	34.2481
CycleGAN (Zhu et al., 2017)	30.0577	29.2421	29.7007	31.3398
Pix2Pix (Isola et al., 2017)	30.6020	32.0304	30.5843	34.6860
EDN (Chan et al., 2019)	30.7109	32.8639	30.9957	35.0369
LWG (Liu et al., 2019a)	31.0085	31.7812	31.0503	35.0118
MKN (Siarohin et al., 2019a)	31.3094	33.2530	30.6493	34.9108
FOMM (Siarohin et al., 2019b)	31.5272	33.3145	31.3099	34.9612
Ours	31.5978	33.3509	31.4159	35.0718

Table 4. Comparisons on EDN-10k in terms of IS with the best results (highest values) in bold.

Method	Subject1	Subject2	Subject3	Subject4
NN	3.1284	3.3052	3.1525	3.3271
CycleGAN (Zhu et al., 2017)	2.9294	2.8902	2.9776	3.0343
Pix2Pix (Isola et al., 2017)	3.1903	3.2966	3.1864	3.5012
EDN (Chan et al., 2019)	3.1802	3.4328	3.4083	3.5316
LWG (Liu et al., 2019a)	3.1774	3.4035	3.1365	3.4122
MKN (Siarohin et al., 2019a)	3.3481	3.4680	3.3793	3.5238
FOMM (Siarohin et al., 2019b)	3.2794	3.4075	3.3281	3.4019
Ours	3.3502	3.5227	3.4240	3.5682

4.2. Quantitative Evaluations

4.2.1. Results on EDN-10k

Tables 2, 2, 4, 4, and 6 depict results in terms of MSE, SSIM, PSNR, IS, and FID on the EDN-10k dataset. Table 6 provides comparisons on EDN-10k in terms of average results of the above five metrics over four subjects. The experimental results validated the advances of DFC-Net for real images:

•

Our method consistently outperformed all the baseline methods on all subjects. When synthesizing real person images, the most significant result is that on Subject1, our method achieved 45.2043 MSE while NN, CycleGAN, Pix2Pix, EDN, LWG, and MKN only got MSE values of 54.9448, 64.1959, 58.3633, 56.3549, 51.6246 and 48.1603 in Table 2, which indicates the images generated by our method have clear details. From Tables 4 and 6, DFC-Net also excelled other baselines for all four subjects according to the Inception Score (IS) and Frechet Inception Distance (FID). As shown in Table 6, our method also achieved the highest average SSIM of 0.8269 for all subjects, while no other methods except MKN got SSIM score greater than 0.8, which shows that our synthesized images are more realistic and suitable for the human visual system.
•

Secondly, we noticed CycleGAN has the worst results, which are 64.1959, 77.4336, and 70.3681 for MSE scores, and 0.5256, 0.4911, and 0.5869 for SSIM scores on Subject1, Subject2, and Subject3 in the Tables 2 and 2 respectively. We argue that CycleGAN is better at transferring the color or style for images from two domains rather than changing the geometry of the images, such as recovering the human appearance from the human skeleton because CycleGAN aims to learn a mapping from unpaired images directly. This property of CycleGAN is also supported by its inferior PSNR scores compared with other methods.
•

We could observe that NN can achieve low scores of MSE in Table 2. Since there are a lot of training images, it is easier for NN to find an image whose motion is very close to the desirable motion. Moreover, the fixed background also makes NN have higher scores of SSIM and PSNR in Tables 2 and 4, while other methods have to learn to generate an accurate background. But the images generated by NN usually do not perform the desired motions, and do not have any temporal coherence in motion when the input is a motion sequence since the results only depend on the training set.
•

Moreover, we observed that EDN, LWG, MKN, and FOMM also provided good results, especially on MSE metric, e.g., 42.5696, 42.4368, 37.1098, and 37.3648 average values for all subjects in Table 2, comparing with other baselines. Taking Subject2 as an example, LWG, MKN, and FOMM provided the SSIM of 0.8445, 0.8375, 0.8503, and 0.8445 in Table 2 which are higher than the results of NN, CycleGAN, and Pix2Pix. The higher SSIM values show that LWG, MKN, and FOMM can synthesize images closer to the ground truths.

Table 5. Comparisons on EDN-10k in terms of FID with the best results (lowest values) in bold.

Method	Subject1	Subject2	Subject3	Subject4
NN	24.8053	19.2783	22.4787	23.3219
CycleGAN (Zhu et al., 2017)	37.8460	38.8926	35.3055	32.9549
Pix2Pix (Isola et al., 2017)	23.5316	21.3092	24.7829	17.5752
EDN (Chan et al., 2019)	23.8172	18.3454	19.0175	14.4926
LWG (Liu et al., 2019a)	22.3348	18.1062	25.7232	19.7245
MKN (Siarohin et al., 2019a)	19.4209	16.5735	19.5568	14.4617
FOMM (Siarohin et al., 2019b)	20.3571	17.6391	21.3258	18.7283
Ours	18.3029	15.2877	17.8782	13.7508

Table 6. Comparisons on EDN-10k in terms of the average results of the 5 metrics over 4 subjects.

Method	MSE( $\downarrow$ )	SSIM( $\uparrow$ )	PSNR( $\uparrow$ )	IS( $\uparrow$ )	FID( $\downarrow$ )
NN	43.1572	0.7611	32.2546	3.2283	22.4711
CycleGAN (Zhu et al., 2017)	66.1287	0.5964	30.0851	2.9579	36.2498
Pix2Pix (Isola et al., 2017)	46.9074	0.7658	31.9757	3.2936	21.7997
EDN (Chan et al., 2019)	42.5696	0.7956	32.4019	3.3882	18.9182
LWG (Liu et al., 2019a)	42.4368	0.7709	32.2129	3.2824	21.4496
MKN (Siarohin et al., 2019a)	37.1098	0.8111	32.5306	3.4298	17.5032
FOMM (Siarohin et al., 2019b)	37.3648	0.7908	32.7782	3.3542	19.5126
Ours	36.2377	0.8269	32.8591	3.4662	16.3049

Table 7. Comparisons on Mixamo-Pose in terms of MSE with the best results (lowest values) in bold.

Method	Andromeda	Liam	Remy	Stefani
NN	24.9047	28.4846	27.1801	24.8609
CycleGAN (Zhu et al., 2017)	27.5520	29.9436	28.2237	22.7682
Pix2Pix (Isola et al., 2017)	23.9370	23.5610	24.0687	21.5841
EDN (Chan et al., 2019)	24.3244	23.1203	23.9229	35.0930
LWG (Liu et al., 2019a)	24.2905	22.8587	22.9707	22.0910
MKN (Siarohin et al., 2019a)	30.5934	39.3444	29.4817	24.1297
FOMM (Siarohin et al., 2019b)	27.7809	29.0469	27.4474	25.3934
Ours	23.8539	21.7328	22.0763	21.2587

Table 8. Comparisons on Mixamo-Pose in terms of SSIM with best results (highest values) in bold.

Method	Andromeda	Liam	Remy	Stefani
NN	0.7357	0.7265	0.7487	0.7411
CycleGAN (Zhu et al., 2017)	0.7205	0.7154	0.7377	0.7753
Pix2Pix (Isola et al., 2017)	0.7784	0.7955	0.7932	0.8069
EDN (Chan et al., 2019)	0.7817	0.7931	0.7926	0.8058
LWG (Liu et al., 2019a)	0.7613	0.7912	0.7887	0.7858
MKN (Siarohin et al., 2019a)	0.7076	0.6874	0.7531	0.7753
FOMM (Siarohin et al., 2019b)	0.7165	0.7404	0.7538	0.74642
Ours	0.7726	0.8040	0.8057	0.8071

4.2.2. Results on Mixamo-Pose

Tables 8, 8, 10, 10 and 12 respectively show the quantitative results of pose transfer in terms of MSE, SSIM, PSNR, IS, and FID on the Mixamo-Pose dataset. Table 12 provides comparisons on Mixamo-Pose in terms of average results of the above five metrics over four characters. We obtained the empirical results clearly demonstrated the effectiveness of DFC-Net on animation images:

•

As shown in Tables 8, 8, 10, 10 and 12, our DFC-Net outperformed all competing baselines regarding all five metrics on average again like the results on EDN-10k Notably, in terms of MSE, DFC-Net outperformed the second-best LWG by 0.8223 in Table 8. To be more concrete, the lowest MSEs of our DFC-Net indicated DFC-Net provided the most accurate motion transfer images with the shortest $L_{2}$ distance between the ground-truth images. For instance, our method provides 22.0763 of MSE on Remy comparing to the 27.1801, 28.2237, 24.0687, and 23.9229 from NN, CycleGAN, Pix2Pix and EDN in the Table 8. Similar results of IS and FID can also be observed in Tables 10 and 12. Furthermore, together with the highest scores of PSNR and SSIM, our DFC-Net could generate synthesized images with the most accurate motion transfer and the best image quality simultaneously.
•

Secondly, NN provided a relative good results on Andromeda but a higher MSE on Liam and Remy in the Table 8, since its performance depends on the training dataset in the extreme, and it can’t provide stable synthesized results. Moreover, in Tables 8 and 8, CycleGAN also can not perform well, with the results are 27.5520, 29.9436, and 28.2237 MSE, and 0.7205, 0.7154 and 0.7377 for SSIM scores on Andromeda, Liam, and Remy respectively. These results once again validated that it is difficult to learn a mapping from the skeleton images to human images directly using unpaired data.
•

Thirdly, from Tables 8, 8, 10, 10, and 12, Pix2Pix, EDN, and LWG delivered such superior performance comparing with NN and CycleGAN. because they aim to learn a paired mapping from the skeleton images to human images directly. Besides that, the skeleton images as their input have accurate motion information without any noise and make the learning process easier. In contrast, even without converting source person images to skeleton images, our method DFC-Net fills the gap between the original source person images and the skeleton images to some extent by introducing the keypoint amplifier, two consistency losses, and a support dataset for training. Without any pre-process steps, and thus the source person images contain much noise and redundant information, our method can still achieve more improvements over Pix2Pix and EDN according to all five metrics.
•

Compared to the results on EDN-10k, MKN and FOMM dropped their performance when they transferred the pose between different people on Mixamo-Pose. It is difficult for them to extract keypoints features without a pre-trained pose estimator when the source person was not in the training dataset. For example, in Table 8, we can see that the SSIM score of MKN on Andromeda is 0.7076 compared with 0.7357, 0.7784, 0.7817 and 0.7726 from NN, Pix2Pix, EDN, and our method.

Table 9. Comparisons on Mixamo-Pose in terms of PSNR with the best results (highest values) in bold.

Method	Andromeda	Liam	Remy	Stefani
NN	34.4420	34.0575	34.3865	34.3977
CycleGAN (Zhu et al., 2017)	33.7903	33.4230	33.6786	34.6130
Pix2Pix (Isola et al., 2017)	34.4049	34.5167	34.3875	34.8497
EDN (Chan et al., 2019)	34.3524	34.5882	34.4214	34.7247
LWG (Liu et al., 2019a)	34.3426	34.6762	34.6183	34.7519
MKN (Siarohin et al., 2019a)	33.6626	32.8841	33.7487	34.4779
FOMM (Siarohin et al., 2019b)	33.7777	33.5912	33.8337	34.1919
Ours	34.4336	34.8798	34.7823	34.9303

Table 10. Comparisons on Mixamo-Pose in terms of IS with the best results (highest values) in bold.

Method	Andromeda	Liam	Remy	Stefani
NN	3.3671	3.4907	3.3170	3.3815
CycleGAN (Zhu et al., 2017)	2.9237	2.9105	3.0148	3.0681
Pix2Pix (Isola et al., 2017)	3.3892	3.5026	3.3356	3.5073
EDN (Chan et al., 2019)	3.4308	3.5418	3.4509	3.5624
LWG (Liu et al., 2019a)	3.3704	3.4052	3.3177	3.4339
MKN (Siarohin et al., 2019a)	3.3856	3.5697	3.4212	3.5105
FOMM (Siarohin et al., 2019b)	3.3921	3.4893	3.4082	3.4721
Ours	3.4285	3.6297	3.5618	3.6075

Table 11. Comparisons on Mixamo-Pose in terms of FID with the best results (lowest values) in bold.

Method	Andromeda	Liam	Remy	Stefani
NN	21.9411	23.4728	22.5086	24.7382
CycleGAN (Zhu et al., 2017)	23.1479	25.1871	21.0418	21.2344
Pix2Pix (Isola et al., 2017)	14.5051	12.7375	12.8751	11.1583
EDN (Chan et al., 2019)	14.6281	11.1481	11.6892	11.2089
LWG (Liu et al., 2019a)	16.7303	11.4930	13.5728	14.3207
MKN (Siarohin et al., 2019a)	21.4839	21.4015	17.3208	16.7219
FOMM (Siarohin et al., 2019b)	19.6782	19.3755	21.2926	20.0382
Ours	14.3172	10.1062	11.2756	10.7603

Table 12. Comparisons on Mixamo-Pose in terms of the average results of the 5 metrics over 4 characters.

Method	MSE( $\downarrow$ )	SSIM( $\uparrow$ )	PSNR( $\uparrow$ )	IS( $\uparrow$ )	FID( $\downarrow$ )
NN	26.3576	0.7380	34.3209	3.3891	23.1652
CycleGAN (Zhu et al., 2017)	27.1219	0.7372	33.8762	2.9793	22.6528
Pix2Pix (Isola et al., 2017)	23.2877	0.7935	34.5397	3.4337	12.8190
EDN (Chan et al., 2019)	26.6151	0.7933	34.5217	3.4965	12.1686
LWG (Liu et al., 2019a)	23.0527	0.7817	34.5972	3.3818	14.0292
MKN (Siarohin et al., 2019a)	30.8873	0.7308	33.6933	3.4718	19.2320
FOMM (Siarohin et al., 2019b)	28.6544	0.7392	33.8486	3.4404	20.0961
Ours	22.2304	0.7973	34.7565	3.5569	11.6148

4.3. Ablation Study

To better understand the merits of designs of DFC-Net, we conducted detailed ablation studies on Subject1 from EDN-10k and Liam from Mixamo-Pose. The evaluation results are shown in Tables 14 and 14.

4.3.1. Keypoint Amplifier

In Tables 14 and 14, the baseline (the first row) is the results of our model without the keypoint amplifier, the consistency losses and the support dataset. This baseline provided the worst results of the three metrics, e.g., the MSE is 26.4595 in Table 14. The results in the second row versus those in the first row showed that the keypoint amplifier strengthened the performance of pose transfer, such as promoting the SSIM of the baseline from 0.6595 to 0.6676 in Table 14. Even with the consistency losses and support set (rows 6 and 7), it continuously reduced the interference of the noise on the keypoint heatmaps. These results validated that the keypoint amplifier can filter out the real keypoint locations and reduce the interference of the noise on the keypoint heatmaps.

Table 13. Ablation studies on Subject1 from EDN-10k. KA denotes the Keypoint Amplifier.

	KA	$\mathcal{L}_{\mathrm{sc}}$	$\mathcal{L}_{\mathrm{mc}}$	$\mathcal{L}_{\mathrm{sup}}$	MSE( $\downarrow$ )	PSNR( $\uparrow$ )	SSIM( $\uparrow$ )
1					52.5749	30.9361	0.6595
2	✓				50.7810	31.0860	0.6676
3	✓	✓			49.3831	31.2043	0.6742
4	✓		✓		49.8352	31.1662	0.6710
5	✓	✓	✓		48.6431	31.2714	0.6796
6		✓	✓	✓	46.3699	31.4801	0.6937
7	✓	✓	✓	✓	45.2043	31.5978	0.7083

Table 14. Ablation studies on Liam from Mixamo-Pose. KA denotes the Keypoint Amplifier.

	KA	$\mathcal{L}_{\mathrm{sc}}$	$\mathcal{L}_{\mathrm{mc}}$	$\mathcal{L}_{\mathrm{sup}}$	MSE( $\downarrow$ )	PSNR( $\uparrow$ )	SSIM( $\uparrow$ )
1					26.4595	33.9856	0.7605
2	✓				24.7312	34.2838	0.7726
3	✓	✓			23.7580	34.4666	0.7830
4	✓		✓		22.6160	34.6913	0.8011
5	✓	✓	✓		21.9509	34.8292	0.8032
6		✓	✓	✓	22.0240	34.8047	0.8031
7	✓	✓	✓	✓	21.7328	34.8798	0.8040

4.3.2. Consistency Losses

We explored the effects of consistency losses $\mathcal{L}_{\mathrm{sc}}$ and $\mathcal{L}_{\mathrm{mc}}$ on the task of pose transfer. Considering the different results between the second row and the third row, the static feature consistency loss $\mathcal{L}_{\mathrm{sc}}$ boosted the pose transfer task and verified its validity. Moreover, the performance variations between the second row and the fourth row clearly evidenced the advantages of pose feature consistency loss $\mathcal{L}_{\mathrm{mc}}$ . Together with these two feature consistency losses, the model achieved better results with MSE of 21.9509, PSNR of 34.8292, and SSIM of 0.8032, as shown in the fifth row of Table 14 The comparison among these cases showed that each consistency loss, whether the static one or the motion one, can enforce the consistency between the real image and synthesized images, thus improving the quality of the synthesized images.

It is noted that the static feature consistency was more useful than the motion one on EDN-10k, while we observed the opposite performance on Mixamo-Pose. For example, in the third and fourth rows of Table 14, the static feature consistency loss achieved 0.0104 improvements for SSIM, while the motion feature consistency loss provided a 0.0285 improvement. We considered that since the backgrounds of images in Mixamo-Pose were simply white (i.e., easy to learn), the static feature consistency could only boost the performance for personal appearance on Mixamo-Pose. On the contrary, the backgrounds were more complex in EDN-10k. Thus the static feature consistency was more effective on EDN-10k than on Mixamo-Pose.

4.3.3. Support Set with Augmented Consistency Loss

We added the support set and utilized the augmented consistency loss $\mathcal{L}_{\mathrm{sup}}$ during the training. The comparisons among the final row and other rows notably indicated that the support set could improve the generalization ability and the robustness of our model, especially when the poses of the source person are closer to the poses in the support set. The support set provided more negative examples besides the original training set and could help the discriminator to form an accurate boundary, which could further strengthen the performance of the generator. We also noticed that the support set was more useful on EDN-10k, because EDN-10k was sampled from the video clips of subjects, and it did not contain much motion variance compared with Mixamo-Pose. Moreover, the limb ratios and the distances between the person and camera in the support set were also distinct from EDN-10k, which provided more valuable negative samples for EDN-10k.

4.3.4. Different Pose Estimator Backbone

To further verify the contribution of different pre-trained pose estimator backbones, we also conducted extensive experiments on Subject1 from EDN-10k and Liam from Mixamo-Pose with ResNet architectures including ResNet-18, ResNet-50, ResNet-101, and ResNet-152. For ResNet-18, we did not find any pre-trained model online, and thus we re-implemented (Cao et al., 2017) by replacing the VGG-19 with the ResNet-18 following the same training process as (Cao et al., 2017). For ResNet-50, ResNet-101, and ResNet-152, we directly use the pre-trained models from (Xiao et al., 2018). The results are shown in Tables 16 and 16. We can observe that when we use a larger pose estimator backbone, we usually get better performance according to all five metrics. For instance, on Subject1 from EDN-10k, the ResNet-152 achieved the best SSIM result of 0.7261 compared to other backbones. It is because a better pre-trained pose estimator backbone can provide more representative pose information and thus generate high-fidelity images. It suggests that our method still has the potential for improvement by combining more advanced pose estimation technologies. In this work, we mainly focus on how to add consistency in the feature space instead of directly using a better pose estimator backbone.

4.4. Qualitative Results

Table 15. Ablation studies of different pose estimator backbones on Subject1 from EDN-10k.

Network	MSE( $\downarrow$ )	PSNR( $\uparrow$ )	SSIM( $\uparrow$ )	IS( $\uparrow$ )	FID( $\downarrow$ )
ResNet-18	49.3291	30.5388	0.6802	3.2811	20.7048
ResNet-50 (Xiao et al., 2018)	44.3875	31.7714	0.7039	3.3255	18.1374
ResNet-101 (Xiao et al., 2018)	42.8027	31.9208	0.7255	3.4039	16.8275
ResNet-152 (Xiao et al., 2018)	41.5562	32.3248	0.7261	3.4415	16.4553
VGG-19 (Cao et al., 2017)	45.2043	31.5978	0.7083	3.3502	18.3029

Table 16. Ablation studies of different pose estimator backbones on Liam from Mixamo-Pose.

Network	MSE( $\downarrow$ )	PSNR( $\uparrow$ )	SSIM( $\uparrow$ )	IS( $\uparrow$ )	FID( $\downarrow$ )
ResNet-18	23.3403	34.3123	0.8057	3.5564	10.2478
ResNet-50 (Xiao et al., 2018)	21.3634	35.2125	0.8073	3.6187	9.9064
ResNet-101 (Xiao et al., 2018)	20.8277	35.8716	0.8342	3.6571	9.5731
ResNet-152 (Xiao et al., 2018)	20.5169	35.8920	0.8188	3.6933	9.4565
VGG-19 (Cao et al., 2017)	21.7328	34.8798	0.8040	3.6297	10.1062

We further highlighted the superiority of our proposed approach by showing and contrasting the visualizations of synthesized results of all models on EDN-10k and Mixamo-Pose, respectively. In addition, we presented the visualizations of the ablation comparisons for Subject1 on EDN-10k to indicate the effectiveness of components of DFC-Net.

4.4.1. Visual comparisons on EDN-10k

We also provided visualization results on EDN-10k in Figure 3 and cropped the results in Figure 4 with more details, and DFC-Net synthesized real person image results with better qualities, which clearly indicated the effectiveness of our method as well. For instance:

•

As shown in the second row of Figure 3, the difference between the source pose and the target pose is very large, and this posture involves changes in almost all parts of the human body. In that case, we can observe that DFC-Net can generate the correct human pose for the Subject2, when other methods like LWG, MKN, and FOMM can only synthesize distorted poses. Another example is in Figure 4, we observed that the synthesized images of our method contained more details, including the clear face, wrinkles on clothes, etc. The generated pose transfer image of Subject1 from DFC-Net has a more clear face structure, and the face orientation of the characters is also consistent with the target image, while MKN and FOMM failed to synthesize correct face orientation and hand poses.
•

Sometimes NN could generate images that are closer to the ground truths, e.g., the image of Subject3 at the third row of Figure 3, since the number of images in the training set increases from 1488 in Mixamo-Pose to 10000 in EDN-10k. But people could still easily determine that the poses of the generated images are different from the ground-truths.
•

The results produced by Pix2Pix and EDN were more blurry, and there was an aliasing effect on the edges of the subject in the Pix2Pix result. In addition, EDN showed better image results than Pix2Pix did. In the third row of Figure 3, we noticed that the image of Subject3, synthesized by Pix2Pix, had much noise in the subject’s hair. On the contrary, the result of EDN was smoother and more realistic, but they lacked many details compared with DFC-Net, especially the face part was still twisted.

4.4.2. Visual comparisons on Mixamo-Pose

As depicted in Figure 5, we employed multiple target persons and source persons with various poses that were different from those in the training set of the Mixamo-Pose. The first column is the target person image, and the second one is the source person image. Note that only our method DFC-Net takes the target person and source person images at the same time. CycleGAN, Pix2Pix, and EDN only take the corresponding skeleton image of the source person image as input and synthesize results. The last column is real images of the target person making the desired motion which are ground truth. We illustrate details as follows:

•

Firstly, compared to the aforementioned baselines, DFC-Net offered better transfer results and alleviated their drawbacks on synthesizing body parts, clothes, and faces, just to name a few. For instance, only DFC-Net generated the left hand of Liam at the first and second rows, while there was only the arm in the results of other baselines. When the target person and source person are not the same person, e.g., in rows 2-8, DFC-Net can still disentangle the pose information of the source person from the static information successfully and then synthesize high-quality results. In contrast, MKN and FOMM are unable to strip out the pose information fully, and thus the generated results are strange, e.g., the person in the result of FOMM is facing away in the fourth row.
•

Meanwhile, though the Pix2Pix and EDN could synthesize the target persons with the desired poses, some body parts (e.g., arms, hands) generated by Pix2Pix were severely corrupted. For example, images produced by EDN had stale colors and lots of noisy pixels in terms of clothes and faces, and it failed to capture the details of shoes in the 3rd row since the color of the shoes is white and is easily confused with the background.
•

Moreover, even though the image results of NN are more clear and not blurry, the NN method always made incorrect pose transfer since it can solely synthesize poses that existed in the training set. Besides that, NN can not synthesize images with temporal coherence, while the input is usually a sequence of desirable motions rather than a single image.
•

The CycleGAN could not provide similar poses compared with the ground truths. We can observe that almost all the poses were distorted, and the qualities of the corresponding images were constantly degraded.

4.4.3. Visual ablation studies on Subject1

We presented the visualization of the ablation study results on Subject1 in Figure 6. We observed that each component has different degrees of improvement to the results, including clear head, hand gestures, etc. Specifically:

•

The results produced by setting 1, which does not have the Keypoint Amplifier, have more noise than other settings, especially near the head area. Compared to the baseline setting 1, setting 2 with Keypoint Amplifier significantly improved the generated image qualities. For instance, the results of setting 1 in the first and second rows failed to synthesize complete head and hands, while the corresponding results of setting 2 fixed these errors.
•

As shown in the column of setting 7, DFC-Net with full components synthesized images with more accurate details and better qualities. In the second row of the results produced by setting 5, we can see the generated person lacked his arm, while the results of setting 7 with the augmented consistency loss from the support set can complete the arm part.

5. Conclusion

In this paper, we proposed DFC-Net, a novel network with disentangled feature consistencies for human pose transfer. We introduce two disentangled feature consistency losses to enforce the pose and static information to be consistent between the synthesized and real images. Besides, we leverage the keypoint amplifier to denoise the keypoint heatmaps and make it easier to extract the pose features. Moreover, we show that the support set Mixamo-Sup containing different subjects with unseen poses can boost the pose transfer performance and enhance the robustness of the model. To enable the accurate evaluation of pose transfer between different people, we collect an animation character dataset Mixamo-Pose. Results on both animation and real image datasets, Mixamo-Pose and EDN-10k, consistently demonstrated the effectiveness of the proposed model.

References

(1)
Aberman et al. (2019) Kfir Aberman, Rundi Wu, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. 2019. Learning Character-Agnostic Motion for Motion Retargeting in 2D. ACM Transactions on Graphics (TOG) (2019).
Adobe Systems Inc. (2018) Adobe Systems Inc. 2018. https://www.mixamo.com.. Accessed: 2018-12-27..
Alp Güler et al. (2018) Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In CVPR.
Balakrishnan et al. (2018) Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. 2018. Synthesizing images of humans in unseen poses. In CVPR.
Blender Online Community (2018) Blender Online Community. 2018. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam. http://www.blender.org
Bousmalis et al. (2017) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR.
Bradski (2000) G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).
Cao et al. (2019) Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
Cao et al. (2017) Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.
Chan et al. (2019) Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 2019. Everybody dance now. In ICCV.
Chen et al. (2021) Baoyu Chen, Yi Zhang, Hongchen Tan, Baocai Yin, and Xiuping Liu. 2021. PMAN: Progressive Multi-Attention Network for Human Pose Transfer. IEEE Transactions on Circuits and Systems for Video Technology (2021).
Ding and Tao (2016) Changxing Ding and Dacheng Tao. 2016. A Comprehensive Survey on Pose-Invariant Face Recognition. ACM Trans. Intell. Syst. Technol. (2016).
Dong et al. (2018) Haoye Dong, Xiaodan Liang, Ke Gong, Hanjiang Lai, Jia Zhu, and Jian Yin. 2018. Soft-gated warping-gan for pose-guided person image synthesis. In NIPS.
Esser et al. (2018) Patrick Esser, Ekaterina Sutter, and Björn Ommer. 2018. A variational u-net for conditional appearance and shape generation. In CVPR.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS.
Grigorev et al. (2019) Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, and Victor Lempitsky. 2019. Coordinate-based texture inpainting for pose-guided human image generation. In CVPR.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS (2017).
Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science (2006).
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. NIPS 33 (2020), 6840–6851.
Hoshen and Wolf (2018) Yedid Hoshen and Lior Wolf. 2018. Identifying Analogies Across Domains. In ICLR.
Huang et al. (2018) Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal unsupervised image-to-image translation. In ECCV.
Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR.
Jaderberg et al. (2015) Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In NIPS.
Jiang et al. (2019) Wei Jiang, Weiwei Sun, Andrea Tagliasacchi, Eduard Trulls, and Kwang Moo Yi. 2019. Linearized multi-sampling for differentiable image transformation. In ICCV.
Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV.
Li et al. (2020) Tao Li, Zhiyuan Liang, Sanyuan Zhao, Jiahao Gong, and Jianbing Shen. 2020. Self-learning with rectification strategy for human parsing. In CVPR.
Li et al. (2019) Yining Li, Chen Huang, and Chen Change Loy. 2019. Dense intrinsic appearance flow for human pose transfer. In CVPR.
Lin and Lucey (2017) Chen-Hsuan Lin and Simon Lucey. 2017. Inverse compositional spatial transformer networks. In CVPR.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV.
Liu et al. (2019b) Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo Kim, Florian Bernard, Marc Habermann, Wenping Wang, and Christian Theobalt. 2019b. Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (2019).
Liu et al. (2017) Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. In NIPS.
Liu et al. (2020) Wenhe Liu, Xiaojun Chang, Ling Chen, Dinh Phung, Xiaoqin Zhang, Yi Yang, and Alexander G. Hauptmann. 2020. Pair-Based Uncertainty and Diversity Promoting Early Active Learning for Person Re-Identification. ACM Trans. Intell. Syst. Technol. (2020).
Liu et al. (2019a) Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao. 2019a. Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In ICCV.
Liu et al. (2016) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR.
Loper et al. (2015) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM transactions on graphics (TOG) (2015).
Ma et al. (2017) Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose guided person image generation. In NIPS.
Ma et al. (2018) Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Disentangled person image generation. In CVPR.
Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
Moeslund et al. (2006) Thomas B Moeslund, Adrian Hilton, and Volker Krüger. 2006. A survey of advances in vision-based human motion capture and analysis. Computer vision and image understanding (2006).
Neverova et al. (2018) Natalia Neverova, Riza Alp Guler, and Iasonas Kokkinos. 2018. Dense pose transfer. In ECCV.
Qi et al. (2018) Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In ECCV.
Ren et al. (2020) Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. 2020. Deep image spatial transformation for person image generation. In CVPR.
Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. NIPS (2016).
Shao et al. (2022) Ruizhi Shao, Zerong Zheng, Hongwen Zhang, Jingxiang Sun, and Yebin Liu. 2022. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In ECCV. Springer.
Shen et al. (2021) Jianbing Shen, Yuanpei Liu, Xingping Dong, Xiankai Lu, Fahad Shahbaz Khan, and Steven Hoi. 2021. Distilled siamese networks for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
Si et al. (2018) Chenyang Si, Wei Wang, Liang Wang, and Tieniu Tan. 2018. Multistage adversarial losses for pose-based human image synthesis. In CVPR.
Siarohin et al. (2019a) Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019a. Animating arbitrary objects via deep motion transfer. In CVPR.
Siarohin et al. (2019b) Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019b. First Order Motion Model for Image Animation. In NIPS.
Siarohin et al. (2018) Aliaksandr Siarohin, Enver Sangineto, Stéphane Lathuilière, and Nicu Sebe. 2018. Deformable gans for pose-based human image generation. In CVPR.
Simonyan and Zisserman (2014a) Karen Simonyan and Andrew Zisserman. 2014a. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Simonyan and Zisserman (2014b) Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR.
Tang et al. (2019) Hao Tang, Dan Xu, Gaowen Liu, Wei Wang, Nicu Sebe, and Yan Yan. 2019. Cycle in cycle generative adversarial networks for keypoint-guided image generation. In Proceedings of the 27th ACM International Conference on Multimedia.
Tian et al. (2021) Jiajie Tian, Qihao Tang, Rui Li, Zhu Teng, Baopeng Zhang, and Jianping Fan. 2021. A Camera Identity-Guided Distribution Consistency Method for Unsupervised Multi-Target Domain Person Re-Identification. ACM Trans. Intell. Syst. Technol. (2021).
Tung et al. (2017) Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. 2017. Self-supervised learning of motion capture. In NIPS.
Wang et al. (2018a) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. Video-to-Video Synthesis. In NIPS.
Wang et al. (2018b) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR.
Wang et al. (2020) Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven CH Hoi, and Haibin Ling. 2020. Paying attention to video object pattern understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
Wang et al. (2021) Wenguan Wang, Tianfei Zhou, Siyuan Qi, Jianbing Shen, and Song-Chun Zhu. 2021. Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing (2004).
Xia et al. (2017) Shihong Xia, Lin Gao, Yu-Kun Lai, Ming-Ze Yuan, and Jinxiang Chai. 2017. A survey on human performance capture and animation. Journal of Computer Science and Technology (2017).
Xiao et al. (2018) Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple Baselines for Human Pose Estimation and Tracking. In ECCV.
Yi et al. (2017) Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV.
Zhang and He (2017) Haoyang Zhang and Xuming He. 2017. Deep free-form deformation network for object-mask registration. In ICCV.
Zhang et al. (2022) Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022).
Zhao et al. (2021) Zongji Zhao, Sanyuan Zhao, and Jianbing Shen. 2021. Real-time and light-weighted unsupervised video object segmentation network. Pattern Recognition (2021).
Zheng et al. (2019) Haitian Zheng, Lele Chen, Chenliang Xu, and Jiebo Luo. 2019. Unsupervised pose flow learning for pose guided synthesis. arXiv preprint arXiv:1909.13819 (2019).
Zheng et al. (2015) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In ICCV.
Zhou et al. (2022) Tao Zhou, Huazhu Fu, Chen Gong, Ling Shao, Fatih Porikli, Haibin Ling, and Jianbing Shen. 2022. Consistency and diversity induced human motion segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
Zhou et al. (2021) Tianfei Zhou, Siyuan Qi, Wenguan Wang, Jianbing Shen, and Song-Chun Zhu. 2021. Cascaded parsing of human-object interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
Zhou et al. (2020) Tianfei Zhou, Wenguan Wang, Siyuan Qi, Haibin Ling, and Jianbing Shen. 2020. Cascaded human-object interaction recognition. In CVPR.
Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.
Zhu et al. (2019) Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. 2019. Progressive pose attention transfer for person image generation. In CVPR.