This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

Wen Liu, Zhixin Piao, Zhi Tu, Wenhan Luo, Lin Ma, and Shenghua Gao Wen Liu is with School of Information Science and Technology, ShanghaiTech University, and Chinese Academy of Sciences, Shanghai Institute of Microsystem and Information Technology, and University of Chinese Academy of Sciences, China. Zhixin Piao, Zhi Tu, and Shenghua Gao are with School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China. Shenghua Gao is the corresponding author.
Abstract

We tackle human image synthesis, including human motion imitation, appearance transfer, and novel view synthesis, within a unified framework. It means that the model, once being trained, can be used to handle all these tasks. The existing task-specific methods mainly use 2D keypoints (pose) to estimate the human body structure. However, they only express the position information with no abilities to characterize the personalized shape of the person and model the limb rotations. In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape. It can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose an Attentional Liquid Warping GAN with Attentional Liquid Warping Block (AttLWB) that propagates the source information in both image and feature spaces to the synthesized reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Furthermore, our proposed method can support a more flexible warping from multiple sources. To further improve the generalization ability of the unseen source images, a one/few-shot adversarial learning is applied. In detail, it firstly trains a model in an extensive training set. Then, it finetunes the model by one/few-shot unseen image(s) in a self-supervised way to generate high-resolution (512×512512\times 512 and 1024×10241024\times 1024) results. Also, we build a new dataset, namely Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. Extensive experiments demonstrate the effectiveness of our methods in terms of preserving face identity, shape consistency, and clothes details. All codes and dataset are available on https://impersonator.org/work/impersonator-plus-plus.html.

Index Terms:
Human Image Synthesis, Motion Imitation, Appearance Transfer, Novel View Synthesis, Generative Adversarial Network, and One/Few-Shot Learning

1 Introduction

Human image synthesis aims to make believable and photo-realistic images of humans, including motion imitation [1, 2, 3], appearance transfer [4, 5] and novel view synthesis [6, 7]. It has vast potential applications in character animation, re-enactment, virtual clothes try-on, movie or game making, etc. Given a source human image and a human reference image, i) the goal of motion imitation is to generate an image with the texture from source human and pose from reference human, as depicted in the top row of Fig. 1; ii) human novel view synthesis aims to synthesize new images of the human body, captured from different viewpoints, as illustrated in the middle row of Fig. 1; iii) the goal of appearance transfer is to generate a human image preserving the source face identity while wearing the clothes of the reference, as shown in the bottom row of Fig. 1 where each garment (upper-clothes or pants) might come from different people.

Refer to caption
Figure 1: Illustration of human motion imitation, novel view synthesis and appearance transfer. The 1st1^{st} row is the source image and the 2nd2^{nd} row is reference condition, such as image or novel viewpoint of camera. The 3rd3^{rd} row is the synthesized results.

Taking human motion imitation as an example, existing methods can be roughly categorized into an image-to-image translation-based [8, 9, 10] pipeline and a warping-based pipeline [2, 3, 1, 11]. The image-to-image translation-based pipeline learns a person-specific mapping function from the human conditions, characterized by a skeleton, dense pose, and parsing result, to the image from a video with paired sequences of conditions and images. Thus, everybody needs to train their model from scratch, and a particular trained model cannot be applied to others. Besides, it is not accessible to be extended to other tasks, such as appearance transfer. To overcome this shortcoming, researchers have proposed the warping-based methods, which warp the input images into the reference conditions (skeleton, dense pose, or parsing) and generate the desired image. So a trained model in these methods could be applied to other input images with different identities. We summarize the recent warping-based approaches in Fig. 2. An early work [2], shown in Fig. 2 (a), feeds the concatenated source image (with its pose condition) with the target pose condition into a network with an adversarial training to generate an image with the desired pose. However, direct concatenation does not consider the spatial layout, and it is ambiguous for the generator to place the pixel from a source image into the right position. Thus, it always results in a blurred image and loses the source identity. Later, inspired by the spatial transformer networks (STN) [12], a texture warping method [1], as shown in Fig. 2 (b), is proposed. It firstly fits a rough affine transformation matrix from the source and the reference key points, then uses an STN to warp the source image into the reference pose, and after that generates the final result based on the warped image. However, texture warping could not preserve the source information as well, in terms of the color, style, or face identity, because the generator might drop out the source information after several downsampling operations, such as stride convolution and pooling. Meanwhile, contemporary work [11, 3, 13] proposes to warp the deep features of the source images into the target poses rather than that in the image space, as shown in Fig 2 (c), named as feature warping. However, features extracted by an encoder in the feature warping cannot guarantee to characterize the source identity accurately, which consequently produces a blur or low-fidelity image inevitably.

Refer to caption
Figure 2: Three existing approaches to propagate the source information into the target condition. (a) early concatenation, concatenates the source image, the source condition, and the target condition into the color channel. (b) and (c) are texture and feature warping, respectively. The source image or its features are propagated into the target condition under a fitted transformation flow.

The aforementioned existing methods encounter with challenges in generating realistic-looking images, due to three reasons: 1) diverse clothes in terms of texture, style, color, and high-structure face identity are difficult to be captured and preserved in their network architectures; 2) articulated and deformable human bodies result in a large spatial layout and geometric changes for arbitrary pose manipulations; 3) all these methods cannot handle multiple source inputs, such as in appearance transfer, different parts might come from different source people; 4) the generalization is not good when the inputs are out of the domain of training set because to synthesize photo-realistic images, all these methods apply the adversarial constraints of discriminators, which push the results similar to the distribution of training set.

In this paper, we follow the warping-based pipeline. To preserve the source details of the clothes and face identity, we propose a Liquid Warping Block (LWB) and an advanced version, Attentional Liquid Warping Block (AttLWB), to address the loss of the source information from three aspects: 1) a denoising convolutional auto-encoder is used to extract useful features that preserve the source information, including texture, color, style and face identity; 2) the source features of each local part are blended into a global feature stream by our proposed LWB and AttLWB, to preserve the source details further; 3) it supports multiple-source warping, such as in the appearance transfer that supports to warp the features of a head (local identity) from one source and that of a body from another, and aggregate them into a global feature stream; 4) a one/few-shot learning strategy is utilized to improve the generalization of the network.

In addition, existing approaches mainly rely on a 2D pose [1, 2, 3], a dense pose [14] and body a parsing result [11]. These methods only take care of the layout locations and ignore the personalized shape and limb (joints) rotations, which are even more essential than layout locations in human image synthesis. For example, in an extreme case that a tall man imitates the actions of a short person, if we the 2D skeleton, the dense pose and the body parsing condition will unavoidably change the height and the size of the tall one, as shown at the bottom of Fig. 9. To overcome these issues, we use a parametric statistical human body model, SMPL [15, 16, 17, 18], which disentangles a human body into the pose (joint rotations) and the shape. It outputs a 3D mesh (without clothes) rather than the layouts of joints and parts. Further, transformation flows can be easily calculated by matching the correspondences between two 3D triangulated meshes, which is more accurate and results in fewer misalignments than previous fitted affine matrix from keypoints [1, 3].

Based on the SMPL model and the Liquid Warping Block (LWB) or the Attentional Liquid Warping Block (AttLWB), our method can be further extended into other tasks, including human appearance transfer and novel view synthesis for free and one model can handle these three tasks. We summarize our contributions as follows: 1) we propose an LWB and an AttLWB to propagate and address the loss of the source information, such as texture, style, color, and face identity, in both the image and the feature space; 2) by taking advantages of both the LWB (AttLWB) and the 3D parametric model, our method is a unified framework for human motion imitation, appearance transfer, and novel view synthesis; 3) since the previous datasets [19, 20] have the limitation in the diversity of the poses, and can only be used for motion imitation, we build a dataset for these tasks, especially for human motion imitation in the video, and released all codes and datasets for further research convenience in the community.

This paper is an extension of our previous work [21]. We extend the framework in the following aspects:

i) our previous LWB [21] directly adds the warped multiple source features into the global features, and it will enlarge the magnitude of the features in the overlap area, thereby resulting in artifacts. To address this, motivated by the attention architecture [22], we propose a more advanced Attentional Liquid Warping Block (AttLWB). It firstly learns similarities of the global features among all multiple sources features, and then it fuses the multiple sources features by a linear combination of the learned similarities and the multiple sources in the feature spaces. Finally, to better propagate the source identity (style, color, and texture) into the global stream, we warp the fused source features to the global stream by the Spatially-Adaptive Normalization (SPADE) [23], which could further improve the final result;

ii) our previous network could not generalize well when the input images are far away from the training domain, as the interracial motion imitation. The reason might be that to generate images with high fidelity, an adversarial (GAN) loss is essential [2, 1, 3, 21], which pushes the generated images in the distribution of the training set. Considering that the input images are diverse in human races, face identities, and clothes styles, and it is infeasible to collect a dataset containing all these individuals. In the testing phase, once an individual is unique in face identity or clothes style, the well-trained network might produce a high-fidelity result similar to the training samples but does not preserve its own source identity in terms of face and clothes. To improve the generalization, inspired by the SinGAN [24] and the Few-Shot Adversarial Learning [25], we apply a one/few-shot adversarial learning to push the network to focus on the individual input with several steps of adaptation, namely personalization.

iii): our previous method successfully achieves decent results on 256×256256\times 256 resolution, and in this version, based on the AttLWB and personalization, we could further achieve the high-fidelity results with a higher 512×512512\times 512 and 1024×10241024\times 1024 resolution.

We organize the rest of this paper as follows: In Section 2, we summarize the related work of the Human Image Synthesis, including the motion imitation, the appearance transfer, and the novel view synthesis. In Section 3, we firstly introduce the essential modules of our proposed Attentional Liquid Warping GAN. The following are the training strategies, the loss functions, the one/few-shot personalization, and the inference details. In Section 4, extensive experiments on different datasets and tasks validate the effectiveness of our work. In Section 5, ablation studies and analysis are conducted to evaluate the impacts of different components. We conclude our work in Section 6.

Refer to caption
Figure 3: The training pipeline of our method. We randomly sample a pair of images from a video, denoting the source and the reference image as IsiI_{s_{i}} and IrI_{r}. (a) A body mesh recovery module will estimate the 3D mesh of each image and render their correspondence map, CsC_{s} and CtC_{t}; (b) The flow composition module will first calculate the transformation flow TT based on two correspondence maps and their projected vertices in the image space. Then it will separate the source image IsiI_{s_{i}} into a foreground image IsiftI^{ft}_{s_{i}} and a masked background IbgI_{bg}. Finally it warps the source image based on the transformation flow TT and produces a warped image IsynI_{syn}; (c) In the last GAN module, the generator consists of three streams, which separately generates the background image I^bg\hat{I}_{bg} by GBGG_{BG}, reconstructs the source image I^s\hat{I}_{s} by GSIDG_{SID} and synthesizes the target image I^t\hat{I}_{t} under the reference condition by GTSFG_{TSF}. To preserve the details of the source image, we propose a novel LWB and AttLWB (shown in Fig. 4) which propagates the source features of GSIDG_{SID} into GTSFG_{TSF} at several layers and preserve the source information, in terms of texture, style and color.

2 Related work

2.1 Human Motion Imitation

We summarize the recent image-to-image translation-based and the warping-based methods as follows.

Image-to-Image translation-based methods. Esser et al.[26] use a Variational U-Net to learn a mapping function from a 2D skeleton to an image. Chan et al.[9] learn a mapping function from a 2D skeleton to an image by a pix2pixHD [27] with a specialized Face GAN and temporally coherent GAN. Wang et al.[28] propose a vid2vid framework and learn a mapping function from 2D dense pose to image. Meanwhile, Shysheya et al.[29] firstly build a full texture UV image of a person by multi-view cameras, then learn a mapping function from a 3D skeleton to part coordinates of the UV map and finally render a result based on the coordinates and the UV image. Contemporarily, Liu et al.[30] firstly use a monocular video to reconstruct a full 3D character model of a person with a static pose, then render the texture of each body parts and finally learn a mapping from synthetic to real images. However, all these methods train a mapping from keypoints or parts to each person’s image and everybody needs to train their own model. This might limit its wide application.

Warping-based methods. Recent work is mainly based on the conditioned generative adversarial networks (CGAN) [2, 1, 31, 14, 32]. Their key technical idea is to combine the source image along with the source pose (2D skeleton) as inputs and generate a realistic image by GANs using a reference pose. The differences among those approaches are merely in network architectures, warping strategies, and adversarial losses. In [2], Ma et al.[2] directly concatenate the source image and the reference pose, and then design a U-Net [33] generator with a coarse-to-fine strategy to generate 256×256256\times 256 images. Neverova et al. [14] replace the sparse 2D key points with the dense correspondences between the image and surface of the human body by the DensePose [34]. Si et al. [32] propose a multistage adversarial loss and separately generate the foreground (or different body parts) and background. Balakrishnan et al.[1] firstly fit an affine transformation matrix based on the source and the target 2D key points and then use a texture warping strategy to generate the foreground and the background separately. These work [3, 11, 35, 13], focus on the way of warping the source features into the target conditions, like skeleton or parsing. Besides, Li et al.[36] propose to learn a transformation flow from 2D key points and warp the deep features based on the learned transformations.

2.2 Human Appearance Transfer

Human appearance modeling or transfer is a vast topic, especially in the field of virtual try-on applications, from computer graphics pipelines [37] to learning based pipelines [4, 5]. Graphics based methods first estimate the detailed 3D human mesh with clothes via garments and 3D scanners [38] or multiple camera arrays [39], and then human appearance with clothes is capable of being conducted from one person to another based on the detailed 3D mesh. Although these methods can produce high-fidelity results, their cost, size, and controlled environment are unfriendly and inconvenient to customers. Recently, in the light of deep generative models, SwapNet [4] firstly learns a pose-guided clothing segmentation synthetic network, and then the clothing parsing results with texture features from the source image are fed into an encoder-decoder network to generate the image with the desired garment. In [5], the authors leverage a geometric 3D shape model combined with learning methods, swap the color of visible vertices of the triangulated mesh, and train a model to infer that of invisible vertices. Instead of estimating the 3D clothes by other sensors, in the MGN [40], the authors, train a network with 3D scans data and predict the body shape and clothing directly from 8 frames or a video. They apply the garment transfer based on the estimated 3D body mesh with clothes.

2.3 Human Novel View Synthesis

Novel view synthesis aims to synthesize new images of the same object or human body from arbitrary viewpoints. The core step of existing methods is to fit a correspondence map from the observable views to new views with convolutional neural networks. In [41], the authors use CNNs to predict appearance flow and synthesize new images of the same object by copying the pixel from a source image based on the appearance flow and they have achieved decent results of rigid objects like vehicles. The following work [42] proposes to infer the invisible textures based on appearance flow and adversarial generative network (GAN) [43], while Zhu et al. [7] argue that appearance flow-based method performs poorly on articulated and deformable objects, such as human bodies. They propose an appearance-shape-flow strategy to synthesize different views of human bodies – besides, Zhao et al. [6] design a GAN based method to synthesize high-resolution views in a coarse-to-fine way. Recently, in PiFu [44], the authors learn an implicit function with multi-layer perceptrons (MLPs) to digitize the human body and infer the 3D surfaces and texture from a single or multiple frames. The fully digitalized human body could synthesize a different view.

2.4 One/Few-shot Learning in Image Synthesize

Ding et al. [45] propose a generative adversarial one-shot face recognizer to synthesize new face images. Shaham et al. [24] introduce a SinGAN, an unconditional generative model from a single image. Zakharov et al. [25] apply the few-shot adversarial learning to generate the realistic talking head. In light of the success of the Meta-Learning in classification, reinforcement learning and network architecture search [46, 47, 48], Lee et al. [10] propose a MetaPix for the few-shot motion imitation. Wang et al. [49] extend the previous vid2vid [28] framework within a few-shot setting and make it capable of synthesizing videos of unseen subjects by leveraging few example images.

3 Our Approach

In this section, we first introduce the whole models of our framework. It contains three modules, a body mesh recovery, a flow composition, and a GAN module with the Liquid Warping Block (LWB) or the Attentional Liquid Warping Block (AttLWB). Then, the following are the training details and loss functions. Further, to improve the generalization, we introduce a one/few-shot learning strategy. We illustrate the details of how to apply our model to three tasks in the inference section (Sect. 3.6).

Once the model has been trained on one task, it can deal with other tasks as well. Here, we use motion imitation as an example, as shown in Fig. 3. Our framework supports multiple sources of inputs, denoting the source images as {Is1,Is2,,Isn}\{I_{s_{1}},I_{s_{2}},...,I_{s_{n}}\}, and the reference image as IrI_{r}. Here, sns_{n} is the number of source images. First, the body mesh recovery module will estimate the 3D mesh of IsiI_{s_{i}} and IrI_{r} and render their correspondence maps, CsiC_{s_{i}}, and CtC_{t}. Next, the flow composition module will calculate the transformation flow TsitT_{s_{i}\to t} of each source image to the reference, based on two correspondence maps and their projected mesh in image space. Each source image IsiI_{s_{i}} is thereby decomposed as the foreground image IsiftI^{ft}_{s_{i}} and the masked background IsibgI^{bg}_{s_{i}}. Since all source images share the same background, we randomly choose one of the masked backgrounds, denoted as IbgI_{bg}. Simultaneously, each source image contributes its visible textures to warp a synthetic image ItsynI^{syn}_{t}, based on the transformation flow TsitT_{s_{i}\to t}. The last (Attentional) Liquid Warping GAN module consists of three streams. It separately generates the background image by GBGG_{BG}, reconstructs the source image I^si\hat{I}_{s_{i}} by GSIDG_{SID} and synthesizes the final result It^\hat{I_{t}} under the reference condition by GTSFG_{TSF}. To preserve the details of source image, we propose the novel Liquid Warping Block (LWB) and Attentional Liquid Warping Block (AttLWB) which propagate the source features of GSIDG_{SID} into GTSFG_{TSF} at multiple layers.

3.1 Body Mesh Recovery Module

As shown in Fig. 3 (a), given the source image IsiI_{s_{i}} and the reference image IrI_{r}, the role of this stage is to predict the kinematic pose (rotation of limbs) and shape parameters, as well as the 3D mesh of each image. In this paper, we use the HMR [17, 18] as the 3D pose and shape estimator due to its good trade-off between accuracy and efficiency. In HMR, an image is firstly encoded into a feature with 2048\mathbb{R}^{2048} by a ResNet-50 [50] and then followed by an iterative 3D regression network that predicts the pose θ72\theta\in\mathbb{R}^{72} and the shape β10\beta\in\mathbb{R}^{10} of SMPL [16], as well as the weak-perspective camera K3K\in\mathbb{R}^{3}. SMPL is a 3D body model that can be defined as a differentiable function M(θ,β)Nv×3M(\theta,\beta)\in\mathbb{R}^{N_{v}\times 3}, and it parameterizes a triangulated mesh by Nv=6,890N_{v}=6,890 vertices and Nf=13,776N_{f}=13,776 faces with the parameters of a pose θ72\theta\in\mathbb{R}^{72} and a shape β10\beta\in\mathbb{R}^{10}. Here, the shape parameters β\beta are the coefficients of a low-dimensional shape space learned from thousands of registered scans, and the pose parameters θ\theta are the joint rotations that articulate the bones via forwarding kinematics. With such process, we will obtain the body reconstructive estimations of each source image, {Ksi,θsi,βsi,Msi}\{K_{s_{i}},\theta_{s_{i}},\beta_{s_{i}},M_{s_{i}}\} and those of reference image, {Kr,θr,βr,Mr}\{K_{r},\theta_{r},\beta_{r},M_{r}\}, respectively.

3.2 Flow Composition Module

Based on previous estimations, we first render a correspondence map and a weight index map for each source mesh MsiM_{s_{i}} and the reference mesh MrM_{r} under the camera view of KsiK_{s_{i}} and KrK_{r}. Here, we denote the source weight index map, the source and the target correspondence maps as WsiW_{s_{i}}, CsiC_{s_{i}} and CtC_{t}, respectively. In this paper, we use a fully differentiable renderer, Neural Mesh Renderer (NMR) [51]. We thereby project vertices of the source VsiV_{s_{i}} into a 2D image space by a weak-perspective camera, vsi=π(Vsi,Ksi)v_{s_{i}}=\pi(V_{s_{i}},K_{s_{i}}). Here, π\pi is the weak-perspective projective function. Then, we calculate the barycentric coordinates of each mesh face and obtain fsiNf×2f_{s_{i}}\in\mathbb{R}^{N_{f}\times 2}. Next, we calculate the transformation flow TsitH×W×2T_{s_{i}\to t}\in\mathbb{R}^{H\times W\times 2} by matching the correspondences between the source correspondence map with its mesh face coordinates fsif_{s_{i}}. Here H×WH\times W is the size of the image. By the same means, we obtain the transformation flow TrtT_{r\to t} of the reference correspondence map. We describe the procedure to obtain the transformation flow in Algorithm 1. Consequently, a foreground image IsiftI^{ft}_{s_{i}} and a masked background image IsibgI^{bg}_{s_{i}} are derived from masking the source image IsiI_{s_{i}} based on CsiC_{s_{i}}. We randomly pick one of the masked backgrounds, denoted as IbgI_{bg}, because all source images share the same background. Finally, we warp the visible textures of each source image IsiI_{s_{i}} to the desired condition by the transformation flow TsitT_{s_{i}\to t} and thereby obtain a synthetic image ItsynI^{syn}_{t}, as depicted in Fig. 3.

Algorithm 1 The procedure of obtaining transformation TsitT_{s_{i}\to t}.
1:WsiW_{s_{i}}, VsiV_{s_{i}}, FsiF_{s_{i}}, KsiK_{s_{i}}, CsiC_{s_{i}}, CtC_{t}.
  • Ksi3×1K_{s_{i}}\in\mathbb{R}^{3\times 1}: source weak-perspective camera;

  • VsiNv×3V_{s_{i}}\in\mathbb{R}^{N_{v}\times 3}: NvN_{v} is the number of vertices;

  • FsiNf×3F_{s_{i}}\in\mathbb{R}^{N_{f}\times 3}: NfN_{f} is the number of faces;

  • WsiH×W×3W_{s_{i}}\in\mathbb{R}^{H\times W\times 3}: the weight index map of source mesh, the value of each pixel indicates the barycentric weights of the triangulated faces in image space;

  • Csi(Ct)H×W×1C_{s_{i}}(C_{t})\in\mathbb{R}^{H\times W\times 1}: the correspondence map of source and target mesh, and the value in each pixel indicates the face index of the mesh.

2:TsitH×W×2T_{s_{i}\to t}\in\mathbb{R}^{H\times W\times 2}, the output transformation flow;
3:vsi=π(Vsi,Ksi)v_{s_{i}}=\pi(V_{s_{i}},K_{s_{i}}) # projecting vertices of source VsiV_{s_{i}} into the 2D image space by the weak-perspective camera;
4:trisi=vsi[Fsi]tri_{s_{i}}=v_{s_{i}}[F_{s_{i}}] Nf×3×2\in\mathbb{R}^{N_{f}\times 3\times 2} # the triangulated faces with vertices in 2D image space;
5:VissiNf×1Vis_{s_{i}}\in\mathbb{R}^{N_{f}\times 1} # the face visibility;
6:for f=1f=1 to NfN_{f} do
7:     Vissi(f)=1Vis_{s_{i}}(f)=1 if ff appears in CsiC_{s_{i}} else 0;
8:end for
9:initializing TsitH×W×2T_{s_{i}\to t}\in\mathbb{R}^{H\times W\times 2};
10:for i=1i=1 to HH do
11:     for j=1j=1 to WW do
12:         f=Ct(i,j)f=C_{t}(i,j) # the face index in current pixel;
13:         Tsit(i,j)=Wsi(i,j)×trisi(f)T_{s_{i}\to t}(i,j)=W_{s_{i}}(i,j)\times tri_{s_{i}}(f), if Vissi(f)Vis_{s_{i}}(f) is 1;
14:     end for
15:end for
16:return TsitT_{s_{i}\to t}.

3.3 Attentional Liquid Warping GAN

This stage synthesizes high-fidelity human images under the desired condition. More specifically, it 1) synthesizes the background image; 2) predicts the color of invisible parts based on the visible parts; 3) generates pixels of clothes, hairs, and others out of the reconstruction of SMPL.

Generator. Our generator works in a three-stream manner. One stream, named GBGG_{BG}, works on the concatenation of the masked background image IbgI_{bg} and the mask obtained by the binarization of CsiC_{s_{i}} in the color channel to generate the realistic background image I^bg\hat{I}_{bg}, as shown in the top stream of Fig. 3 (c). The other two streams are the source identity stream, namely GSIDG_{SID} and the transfer stream, namely GTSFG_{TSF}. GSIDG_{SID} is a denoising convolutional auto-encoder that aims to guide the encoder to extract the features that are capable of preserving the source information. Together with the I^bg\hat{I}_{bg}, it takes the masked source foreground IsiftI^{ft}_{s_{i}} and the correspondence map CsiC_{s_{i}} as its inputs and reconstructs source foreground image I^s\hat{I}_{s}. GTSFG_{TSF} stream synthesizes the final result, which receives the warped foreground by a bilinear sampler and the correspondence map CtC_{t} as its inputs. To preserve the source information, such as texture, style, and color, we propose a novel Liquid Warping Block (LWB), as well as its advanced version, Attentional Liquid Warping Block (AttLWB), that links the source with the target streams. They blend the source features from GSIDG_{SID} and fuses them into the transfer stream GTSFG_{TSF}, as shown at the bottom of Fig. 3 (c).

GBGG_{BG} and GSIDG_{SID} have similar architectures with separate parameters and follow the structure of CycleGAN [52] with 6 residual blocks [53]. The details of kernel sizes and number of filters are illustrated in Fig. 5. GTSFG_{TSF} is a combination of a ResNet and a U-Net [33], named ResUnet. For GBGG_{BG}, we directly regress the final background image, I^bg\hat{I}_{bg}, while for GSIDG_{SID} and GTSFG_{TSF}, we concretely generate an attention map AA and a color map PP, as shown in Fig. 5. The final image can be obtained as follows:

I^si\displaystyle\hat{I}_{s_{i}} =PsAsi+I^bg(1Asi)\displaystyle=P_{s}\odot A_{s_{i}}+\hat{I}_{bg}\odot(1-A_{s_{i}}) (1)
I^t\displaystyle\hat{I}_{t} =PtAt+I^bg(1At).\displaystyle=P_{t}\odot A_{t}+\hat{I}_{bg}\odot(1-A_{t}).

Here, \odot represents an element-wise multiplication. The total trainable parameters in the generator are θG={θBG,θSID,θTSF,θAttLWB}\theta_{G}=\{\theta_{BG},\theta_{SID},\theta_{TSF},\theta_{AttLWB}\}, with respect to GBGG_{BG}, GSIDG_{SID}, GTSFG_{TSF} and AttLWB.

Discriminator. To push the discriminators to focus on different aspects of the generated images, such as the clothes on the human body and the face identity, we utilize a global-local content-orientation architecture. It consists of three sub-discriminators. The first one is a global discriminator, DGlobalD_{Global}, which regularizes the entire generated I^t\hat{I}_{t} to be more realistic-looking. The rest two are a body discriminator DBodyD_{Body} and a face discriminator DHeadD_{Head}, and they push the cropped body area and the head (face) parts of the generated I^t\hat{I}_{t} to be realistic-looking. All of them are conditional discriminators, and they take the generated images and the correspondence map CtC_{t} as their inputs. We illustrate the details of our discriminators in Fig. 5. The total trainable parameters in the discriminators are θD={ϕGlobal,ϕBody,ϕHead}\theta^{D}=\{\phi_{Global},\phi_{Body},\phi_{Head}\}.

Refer to caption
Figure 4: Illustration of our LWB and AttLWB. They have the same structure illustrated in (b) but with separate AddWB (illustrated in (a)) or AttWB (illustrated in (b)). (a) is the structure of AddWB. Through AddWB, X^tl\widehat{X}_{t}^{l} is obtained by aggregation of warped source features and features from GTSFG_{TSF}. (b) is the shared structure of (Attentional) Liquid Warping Block. {Xs1l,Xs2l,,Xsnl}\{X^{l}_{s_{1}},X^{l}_{s_{2}},...,X^{l}_{s_{n}}\} are the feature maps of different sources extracted by GSIDG_{SID} at the lthl^{th} layer. {Ts1t,Ts2t,,Tsnt}\{T_{s_{1}\to t},T_{s_{2}\to t},...,T_{s_{n}\to t}\} are the transformation flows from different sources to the target. XtlX^{l}_{t} is the feature map of GTSFG_{TSF} at the lthl^{th} layer. (c) is the architecture of AttWB. Through AttWB, final output features X^tl\widehat{X}_{t}^{l} is obtained with SPADE by denormalizing feature map from GTSFG_{TSF} with weighted combination of warped source features by a bilinear sampler (BS) with respect to corresponding flow TsitT_{s_{i}\to t}.

Attentional Liquid Warping Block. One advantage of our proposed Liquid Warping Block (LWB) and Attentional Liquid Warping Block (AttLWB) is that it addresses the issue of multiple sources. For instance, in human motion imitation, the source images are multi-view inputs, and in the appearance transfer, different parts of garments come from different people. The different parts of features are aggregated into GTSFG_{TSF} by their transformation flow independently. As shown in Fig. 4, we denote Xs1lX^{l}_{s_{1}} and Xs2lX^{l}_{s_{2}} as the feature maps extracted by GSIDG_{SID} of different sources at the lthl^{th} layer and XtlX^{l}_{t} is the feature map of GTSFG_{TSF} at the lthl^{th} layer. Each part of the source feature is warped by their transformation flow and aggregated into the features of GTSFG_{TSF}. We use a bilinear sampler (BS) to warp the source features Xs1lX^{l}_{s_{1}} and Xs2lX^{l}_{s_{2}} with respect to corresponding transformation flows, Ts1tT_{s_{1}\to t} and Ts2tT_{s_{2}\to t}. The way to aggregate the warped source features into the global stream is the main difference between LWB and AttLWB.

LWB, as illustrated in Fig. 4 (a), directly uses an element-wise addition among all features and the fuses the global features as:

Xsitl\displaystyle X^{l}_{s_{i}\to t} =BS(Xsil,Ti)\displaystyle=BS(X^{l}_{s_{i}},T_{i}) (2)
X^tl\displaystyle\widehat{X}_{t}^{l} =i=1snXsitl+Xtl.\displaystyle=\sum_{i=1}^{s_{n}}X^{l}_{s_{i}\to t}+X_{t}^{l}.

However, LWB will enlarge the magnitude of the features in the overlap area, and thereby result in artifacts. To address this, motivated by the attention architecture [22], we propose a more advanced Attentional Liquid Warping Block (AttLWB), as shown in Fig. 4 (c). It firstly learns similarities of the global features among all multiple source features, and then it fuses the multiple source features by the linear combination of the learned similarities and the multiple sources in feature space. Finally, to better propagate the source identity (style, color, and texture) into the global stream, we use the SPADE [23] to denormalize the feature map of GTSFG_{TSF} with the fused source features to obtain the global stream, which could further improve the final result. We describe the entire procedures of AttLWB in Algorithm 2.

Algorithm 2 The procedure of our AttLWB.
1:{Ts1t,,Tsnt}\{T_{s_{1}\to t},...,T_{s_{n}\to t}\}, {Xs1l,,Xsnl}\{X^{l}_{s_{1}},...,X^{l}_{s_{n}}\}, and XtlX^{l}_{t}.
  • {Ts1t,,Tsnt}\{T_{s_{1}\to t},...,T_{s_{n}\to t}\}: the transformation flows from different sources to the target;

  • {Xs1l,,Xsnl}\{X^{l}_{s_{1}},...,X^{l}_{s_{n}}\}: the feature maps extracted by GSIDG_{SID} of different sources at the lthl^{th} layer;

  • XtlX^{l}_{t}: the feature map of GTSFG_{TSF} at the lthl^{th} layer;

2:X^tl\widehat{X}_{t}^{l}, the output features;
3:Xsitl=BS(Xsil,Tsit)X^{l}_{s_{i}\to t}=BS(X^{l}_{s_{i}},T_{s_{i}\to t}) # warping each source feature;
4:Q=fQ(Xtl)Q=f_{Q}(X^{l}_{t}) # query embeddings;
5:K=[fK(Xs1tl),,fK(Xsntl)]K=[f_{K}(X^{l}_{s_{1}\to t}),...,f_{K}(X^{l}_{s_{n}\to t})] # key embeddings;
6:V=[fV(Xs1tl),,fV(Xsntl)]V=[f_{V}(X^{l}_{s_{1}\to t}),...,f_{V}(X^{l}_{s_{n}\to t})] # value embeddings;
7:x~s=Attention(Q,K,V)=Softmax(QKTdk)V\tilde{x}_{s}=Attention(Q,K,V)=Softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V # fused source features, dkd_{k} is the number of channels of KK;
8:X^tl=SPADE(Xtl,x~s)\widehat{X}_{t}^{l}=SPADE(X^{l}_{t},\tilde{x}_{s}) # conditioned on x~s\tilde{x}_{s};
9:return X^tl\widehat{X}_{t}^{l};
Refer to caption
Figure 5: The details of network architectures of our Attentional Liquid Warping GAN, including the generator and the discriminator. Here ss represents the stride size in convolution and transposed convolution.

3.4 Training Details and Loss Functions

In this part, we will introduce the loss functions and how to train the whole system. For the body recovery module, we follow the network architecture and loss functions of HMR [17, 18]. Here, we use a pre-trained (off-the-shelf) SMPL estimator.

Note that our proposed Attentional Liquid Warping GAN is a unified framework for motion imitation, appearance transfer, and novel view synthesis. Therefore once we have trained the model on one task, it is capable of being applied to other tasks. These three tasks share the same training pipeline in our method, except for the way to sample the source the reference images. In motion imitation, we randomly sample sn+1s_{n}+1 images from each video with difference poses and set the first sns_{n} ones as the source images {Is1,,Isn}\{I_{s_{1}},...,I_{s_{n}}\} and the other one as the reference IrI_{r}. In appearance transfer, we need to sample sn+1s_{n}+1 images with the same person identity wearing different clothes, while in novel view synthesis, we need to sample sn+1s_{n}+1 images of the same person under the different camera of views. In our experiments, we train a model for motion imitation and then apply it to appearance transfer and novel view synthesis.

The whole loss function of the generator contains four terms, which are perceptual loss [54], face identity loss, attention regularization loss, and adversarial loss.

Perceptual Loss. It regularizes the reconstructed source image I^si\hat{I}_{s_{i}} to the ground truth IsiI_{s_{i}} and pushes the generated target image I^t\hat{I}_{t} and the reference image IrI_{r} to be closer in a VGG [55] feature subspace. Its formulation is given as follows:

Lp=1sni=1snI^siIsi1+f(I^t)f(Ir)1.\displaystyle L_{p}=\frac{1}{s_{n}}\sum_{i=1}^{s_{n}}\|\hat{I}_{s_{i}}-I_{s_{i}}\|_{1}+\|f(\hat{I}_{t})-f(I_{r})\|_{1}. (3)

Here, ff is a pre-trained VGG-19 [55] on ImageNet [56].

Face Identity Loss. It regularizes the cropped face from the synthesized target image I^t\hat{I}_{t} to be similar to that from the image of ground truth IrI_{r}, which pushes the generator to preserve the face identity. It is shown as follows:

Lf=g(I^t)g(Ir)1.\displaystyle L_{f}=\|g(\hat{I}_{t})-g(I_{r})\|_{1}. (4)

Here, gg is a pre-trained SphereFaceNet [57].

Adversarial Loss. It pushes the distribution of synthesized images to the distribution of real images. We use a LSGAN110LSGAN_{-110} [58] loss in a way like PatchGAN over all discriminators, DGlobalD_{Global}, DBodyD_{Body} and DHeadD_{Head}. They push the entire generated images, cropped body area, and head (face) parts to be realistic-looking. We denote the bounding box of head and body as headxyhead^{xy} and bodyxybody^{xy} in the ground-truth IrI_{r}, respectively, and we calculate them by the projected vertices in the image space. I^tb\hat{I}^{b}_{t}, IrbI^{b}_{r} and CtbC^{b}_{t} are the cropped bodies from the generated image, the reference image and the correspondence map, based on bounding box of body, bodyxybody^{xy}. Ith^\hat{I^{h}_{t}}, IrhI^{h}_{r} and CthC^{h}_{t} are the corresponding cropped heads with respect to the bounding box of head, headxyhead^{xy}. We arrive at the total adversarial loss as follows:

LadvG\displaystyle L^{G}_{adv} =DGlobal(I^t,Ct)2+DBody(I^tb,,Ctb)2\displaystyle=\sum D_{Global}(\hat{I}_{t},C_{t})^{2}+\sum D_{Body}(\hat{I}^{b}_{t},,C^{b}_{t})^{2} (5)
+DHead(I^th,Cth)2\displaystyle+\sum D_{Head}(\hat{I}^{h}_{t},C^{h}_{t})^{2}

Attention Regularization Loss. It regularizes the attention map AtA_{t} and AsiA_{s_{i}} to be smooth and prevents them from saturating. Considering that there is no ground truth of attention map AA or color map PP, they are learned from the resulting gradients of above losses. However, the attention masks can easily saturate to 1 which prevents the generator from working. To alleviate this situation, we regularize the mask to be closer to the silhouettes SS rendered from a 3D body mesh. Since the silhouettes is a rough map and it contains the body mask without clothes and hair, we addtionaly introduce a Total Variation Regularization [59] over AA to compensate the shortcomings of silhouettes. It is shown as:

La\displaystyle L_{a} =AsSs22+AtSt22+TV(As)+TV(At)\displaystyle=\|A_{s}-S_{s}\|^{2}_{2}+\|A_{t}-S_{t}\|^{2}_{2}+TV(A_{s})+TV(A_{t}) (6)
TV(A)\displaystyle TV(A) =i,j[A(i,j)A(i1,j)]2+[A(i,j)A(i,j1)]2.\displaystyle=\sum_{i,j}[A(i,j)-A(i-1,j)]^{2}+[A(i,j)-A(i,j-1)]^{2}.

For the generator, the full objective function is shown as follows, and λp,λf\lambda_{p},\lambda_{f} and λa\lambda_{a} are the weights of perceptual, face identity and attention losses, respectively.

LG=λpLp+λfLf+λaLa+LadvG.\displaystyle L^{G}=\lambda_{p}L_{p}+\lambda_{f}L_{f}+\lambda_{a}L_{a}+L^{G}_{adv}. (7)

For discriminator, the full objective function is

LD\displaystyle L^{D} =[DGlobal(I^t,Ct)+1]2+[DGlobal(Ir,Ct)1]2\displaystyle=\sum[D_{Global}(\hat{I}_{t},C_{t})+1]^{2}+\sum[D_{Global}(I_{r},C_{t})-1]^{2} (8)
+[DBody(I^tb),Ctb)+1]2+[DBody(Irb,Ctb)1]2\displaystyle+\sum[D_{Body}(\hat{I}^{b}_{t}),C^{b}_{t})+1]^{2}+\sum[D_{Body}(I^{b}_{r},C^{b}_{t})-1]^{2}
+[DHead(I^th,Cth)+1]2+[DHead(Irh,Cth)1]2.\displaystyle+\sum[D_{Head}(\hat{I}^{h}_{t},C^{h}_{t})+1]^{2}+\sum[D_{Head}(I^{h}_{r},C^{h}_{t})-1]^{2}.
Refer to caption
Figure 6: Illustration of calculating the transformation flows of different tasks during the testing phase. The left is the disentangled body parameters by the Body Recovery module of both source and reference images. The right is the different implementations to calculate the transformation flow in different tasks.

3.5 One/Few-shot Personalization by Fine-tunning

Though we can train our model on a large dataset, to a certain degree, with diverse people and clothes, however, such a generator is still hard to be well-generalized to the inputs out of the domain of training set. After all, it is infeasible to build a universal dataset and generator to handle the diverse face identities, styles of clothes, and backgrounds. To improve the generalization, inspired by the SinGAN [24] and the Meta-learning [25, 49, 10, 46], we apply the one/few-shot adversarial learning to push the network to focus on each individual by several steps of fast personal adaptation. In real application scenarios, the user might only provide a little number (sns_{n}) of their photos with different views or poses, and in an extreme case, there is only one image accessible. In this paper, we focus on the setting where there are no more than eight images (sn8s_{n}\leq 8[25] available in the testing phase.

Specifically, we first train our model, including a generator and a discriminator, on a combined large dataset, and consequently obtain the generator’s pre-trained parameters, θGM\theta^{M}_{G}, and the discriminator’s pre-trained parameters, θDM\theta^{M}_{D}. Then, for each specific person PiP_{i} with sns_{n} images, we learn the person-specific generator θGPi\theta^{P_{i}}_{G} and discriminator θDPi\theta^{P_{i}}_{D} from the sns_{n} images by fine-tuning the pre-trained model. This process is called one/few-shot personalization. To further push the generator from the pre-trained θGM\theta^{M}_{G} to the person-specific θGPi\theta^{P_{i}}_{G}, we discard the pre-trained parameters of the discriminator θDM\theta^{M}_{D}, and we train the person-specific discriminator θDPi\theta^{Pi}_{D} from scratch. The overall loss functions in the personalization phase are similar to that in the training phase, except for the adversarial loss. Since there are only a few images (sn8s_{n}\leq 8), to avoid overfitting and reduce the time consumption of each iteration in personalization, we only use the global discriminator.

3.6 Inference

After we conduct personalization, the person-specific generator can be applied to all three tasks. The difference lies in the transformation flow computation, due to the different conditions of various tasks. The remaining modules, Body Mesh Recovery and Liquid Warping GAN (Attentional Liquid Warping GAN) are all the same. The followings are the details of each task of the Flow Composition module in the testing phase.

Motion Imitation. We firstly copy the value of pose parameters of the reference θr\theta_{r} into that of the source and get the synthetic parameters of SMPL, as well as the 3D mesh, Mt=M(θr,βs)M_{t}=M(\theta_{r},\beta_{s}). Next, we render a correspondence map of the source mesh MsM_{s} and that of the synthetic mesh MtM_{t} under a camera view KsK_{s}. Here, we denote the source and the synthetic correspondence map as CsC_{s} and CtC_{t}, respectively. Then, we project the source vertices into the 2D image space by a weak-perspective camera, vs=π(Vs,Ks)v_{s}=\pi(V_{s},K_{s}). Here, π\pi is the weak-perspective projective function. Next, we calculate the barycentric coordinates of each mesh face and have fsNf×2f_{s}\in\mathbb{R}^{N_{f}\times 2}. Finally, we calculate the transformation flow TH×W×2T\in\mathbb{R}^{H\times W\times 2} by matching the correspondences between the source correspondence map with its mesh face coordinates fsf_{s} and the synthetic correspondence map. It is shown in Fig. 6 (a).

Novel View Synthesis. Given a new camera view, in terms of a rotation RR and a translation tt. We firstly calculate the 3D mesh under the novel view, Mt=MsR+tM_{t}=M_{s}R+t. The consequential operations are similar to that of motion imitation. We render a correspondence map of the source mesh MsM_{s} and that of the novel mesh MtM_{t} under a weak-perspective camera KsK_{s} and calculate the transformation flow TH×W×2T\in\mathbb{R}^{H\times W\times 2}, as depicted in Fig. 6 (b).

Appearance Transfer. We need to “copy” the clothes on the body from the reference image while keeping the head (face, eye, hair and so on) identity of the source. We split the transformation flow TT into two sub-transformation flows, source flow T1T_{1} and referent flow T2T_{2}. We denote the head mesh as Mh=(Vh,Fh)M^{h}=(V^{h},F^{h}) and the body mesh as Mb=(Vb,Fb)M^{b}=(V^{b},F^{b}). Here, M=MhMbM=M^{h}\cup M^{b}. For T1T_{1}, We firstly project the head mesh MshM^{h}_{s} of source into the image space and thereby obtain the silhouettes, SshS^{h}_{s}. Then, we create a mesh grid, GH×W×2G\in\mathbb{R}^{H\times W\times 2}. Next, we mask GG by ShS^{h} and derive T1=GShT_{1}=G\odot S^{h}. Here, \odot represents an element-wise multiplication. For T2T_{2}, it is similar to that in motion imitation. We render the correspondence map of the source body MsbM^{b}_{s} and that of the reference MtbM^{b}_{t}, denoted as CsbC^{b}_{s} and CtbC^{b}_{t}, respectively. Finally, we calculate the transformation flow T2T_{2} based on the correspondences between CsbC^{b}_{s} and CtbC^{b}_{t}. We illustrate it in Fig. 6 (c).

Refer to caption
Figure 7: The statistic information of iPER dataset, including the action, clothes, height and weight distribution of the actors.

4 Experiments

4.1 Dataset

iPER. To evaluate the performance of our proposed method of motion imitation, appearance transfer, and novel view synthesis, we build a new dataset with diverse styles of clothes in videos, named Impersonator (iPER) dataset. There are 30 subjects of different conditions of shape, height, and gender. Each subject wears different clothes and performs an A-pose video and a video with random actions. There are 103 clothes in total. The whole dataset contains 206 video sequences with 241,564 frames. We split it into training/testing set at a ratio of 8:2 according to the different clothes. All the clothes and 29% of the actors in the testing set do not appear in the training set. We illustrate the details of the iPER dataset in classes of actions, styles of clothes, weight, and height distributions of actors in Fig. 7. We show some samples in the first two rows of Fig. 8.

Refer to caption
Figure 8: The samples of four datasets. The first two rows are the samples from iPER dataset. The third row is the samples from the MotionSynthetic dataset and the fourth row is that from FashionVideo dataset. The last row is the samples from Youtube-Dancer-18 dataset.

MotionSynthetic. We also make up a synthetic dataset, named MotionSynthetic, for the convenience of evaluation, especially for human appearance transfer and novel view synthesis, because we can synthesize the ground truth images with different views and wearing garments by the modification of meshes. This dataset borrows 24 human meshes from people snapshot [60] and 96 human meshes from MultiGarments [40]; thus, 120 meshes in total. All of these meshes with UV texture images have been registered in SMPL [16]. For each mesh, we choose a pose sequence from Mixamo and a background image from the Internet. Based on these materials (mesh, UV image, pose sequence, and background image), we render the synthetic images by NMR [51], resulting in 39,529 frames in total. We split it into training/testing set at a ratio of 8:2 according to the different meshes and illustrate some synthetic images in the 3rd rows of Fig. 8.

FashionVideo. It contains 500 training and 100 testing videos with a single female model wearing fashionable clothes [20]. Each video has around 350 frames. The clothes and textures are diverse, while there are few types of gestures, with only a few standard poses for the models. Also, this dataset lacks diversity in background, and all the backgrounds are black. We display some samples in the 4th row of Fig 8.

Youtube-Dancer-18. To further validate the effectiveness and generalization of our method, we evaluate our method on the in-the-wild internet videos, Youtube-Dancer-18 [10]. It consists of 18 videos, with people dancing, downloaded from Youtube, and each of them lasts from 4 to 12 minutes. We follow the setting with MetaPix [10] that we sample frames with 30 FPS and only use sn8s_{n}\leq 8 frames from training sequences for personalization and then apply the evaluation on the testing sequences. Some samples are shown at the bottom of Fig. 8. It needs to be mentioned that we do not train the model in this dataset. We only sample sns_{n} frames for personalization and directly test on this dataset to evaluate the generalization over all methods.

4.2 Implementation Details

We train our Attentional Liquid Warping GAN on a combined dataset consisting of the iPER, MotionSynthetic, and FashionVideo dataset and perform evaluations among these three datasets. To evaluate our methods’ generalization, we also perform tests on an additional Youtube-Dancer-18 dataset without training on it. We crop all images based on the bounding box of the human body, rescale the cropped images with keeping the original ratio of height and width, and then pad them into a 512×512512\times 512 resolution. We normalize the color space of all images to [-1, 1]. In our experiments, including the training and personalization phase, we use the Adam [61] based Stochastic Gradient Descent optimizer for both generators and discriminators. λp,λf\lambda_{p},\lambda_{f} and λa\lambda_{a} are 10.0, 5.0 and 2.5, respectively.

i): In the training phase, we randomly sample sn+1s_{n}+1 images from each video and set the first sns_{n} ones as the source images {Is1,,Isn}\{I_{s_{1}},...,I_{s_{n}}\}, and the other one as the reference IrI_{r}. We fix sn=2s_{n}=2 and the mini-batch size to be 2. There are two training epochs. We fix the first quarter training session with a learning rate as 0.0001 and gradually decrease it to 0.00001 in the end.

ii): In the personalization and testing phase, sns_{n} could be flexible, and because of the memory limitation of the GPU devices, in our experiments, we set sn{1,2,4,8}s_{n}\in\{1,2,4,8\}. Besides, IrI_{r} lies in the set of source images {Is1,,Isn}\{I_{s_{1}},...,I_{s_{n}}\}. We fix the learning rate as 0.0001 and take T=100T=100 steps for personalization.

4.3 Results of Human Motion Imitation

Refer to caption
Figure 9: Comparison of our method with others of motion imitation on the iPER and FashionVideo dataset (zoom-in for the best of view). All results are in 512×512512\times 512 resolution. 2D pose-guided methods pG2 [2], DSC [3] SHUP [1] and DIAF cannot preserve the clothes details, face identity and shape consistency of source images. We highlight the details by red and blue rectangles.
Refer to caption
Figure 10: Examples of motion imitation from our proposed methods (zoom-in for the best of view). All results are in 512×512512\times 512 resolution. Our method could produce high-fidelity images that preserve the face identity, shape consistency and clothes details of source. We recommend accessing the supplementary material for more results in videos.

Evaluation Metrics. We propose an evaluation protocol of the testing set of the iPER, MotionSynthetic, FashionVideo, and Youtube-Dancer-18 datasets, and it can indicate the performance of different methods in terms of different aspects. The details are listed in followings:

1): In each video with actor PiP_{i}, {I1Pi,,ItPi,,ILPi}\{I^{P_{i}}_{1},...,I^{P_{i}}_{t},...,I^{P_{i}}_{L}\}, we select eight images as candidate images with different views, such as frontal, sideways or back. Here, LL is the number of frames.

2): We choose sn8s_{n}\leq 8 images as sources, {Is1Pi,,IsnPi}\{I^{P_{i}}_{s_{1}},...,I^{P_{i}}_{s_{n}}\}, from the eight candidate images for personalization. For a fair comparison with other methods [2, 1, 3, 36, 35], which only use a single source image, we separately report the results on sn=1s_{n}=1 (one-shot setting) and 2sn82\leq s_{n}\leq 8(few-shot setting).

3): After personalization, we perform self-imitation that each actor PiP_{i} imitates actions from images of themselves, with ItPiI^{P_{i}}_{t} as the reference image. We denote I^tPiPi\hat{I}^{P_{i}\to P_{i}}_{t} as the synthesized image referring to ItPiI^{P_{i}}_{t}. As for criterion, we use PSNR, SSIM [62], Learned Perceptual Similarity (LPIPS) [63], Body-CS and Face-CS to measure the similarities between I^tPiPi\hat{I}^{P_{i}\to P_{i}}_{t} and ItPiI^{P_{i}}_{t}.

Body-Cosine-Similarity (Body-CS): is the distance between the cropped person region of the synthesized image and that of the ground-truth image. In particular, it firstly uses a YOLOv3 [64] detector to get the person bounding box of the synthesized and ground-truth image. Then, we crop the person patches according to the bounding boxes. Finally, we use a pre-trained Person re-identification (ReID) model, OS-Net [65], to get the embedding features of the cropped person patches, and then we normalize the features and calculate the cosine similarity between the features to acquire the Body-CS.

Face-Cosine-Similarity (Face-CS): similar to Body-CS, it is the distance between the cropped face region of the synthesized image and that of the ground-truth image. Specifically, we firstly use an MTCNN [66] face detector to get the face bounding boxes of the synthesized and ground-truth images. Then, we crop the face regions according to the bounding boxes. Finally, we uses a pre-trained face recognition model [67], to get the embedding features of the cropped face patches, and then we normalize the features and calculate the cosine similarity between the normalized features to obtain the Face-CS.

4): We also conduct cross-imitation that an actor PiP_{i} imitates actions from others, such as PjP_{j}. We denote {I^1PiPj,,I^LPiPj}\{\hat{I}^{P_{i}\to P_{j}}_{1},...,\hat{I}^{P_{i}\to P_{j}}_{L}\} as a sequence of synthesized images referring to {I1Pj,,ILPj}\{I^{P_{j}}_{1},...,I^{P_{j}}_{L}\} and {Is1Pi,,IsnPi}\{I^{P_{i}}_{s_{1}},...,I^{P_{i}}_{s_{n}}\} as the sequence of real images. Since there is no ground-truth of synthesized images for the similarities metrics as mentioned above, here, we use a Fréchet Inception Distance (FID) [68] to measure perceptual realism. It calculates the distance between the set of synthesized images and that of real images. We further propose the Fréchet Distance of a pre-trained ReID model, OS-Net [65], namely Body-FD and that of a face recognition model, namely Face-FD. We also collect LL consecutive frames from the actor PiP^{i}, denoted as {I1Pi,,ILPi}\{I^{P_{i}}_{1},...,I^{P_{i}}_{L}\}, then calculate the Body-CS and Face-CS as aforementioned.

TABLE I: One-shot average results for human motion imitation of different methods on the iPER, MotionSynthetic and FashionVidieo dataset. \uparrow means the larger the better, and \downarrow is on the contrary. A higher SSIM may not mean a better quality of an image [63].
Self-Imitation Cross-Imitation
PSNR\uparrow SSIM\uparrow LPIPS\downarrow Body-CS\uparrow Face-CS\uparrow Face-CS\uparrow Face-FD\downarrow Body-CS\uparrow Body-FD\downarrow FID\downarrow
PG2 [2] 23.699 0.876 0.130 0.744 0.085 0.148 429.142 0.709 240.429 119.378
SHUP [1] 23.979 0.881 0.080 0.855 0.288 0.297 243.599 0.820 80.973 51.823
DSC [3] 20.782 0.732 0.331 0.695 0.139 0.204 407.070 0.673 273.103 150.082
DIAF [36] 22.753 0.829 0.108 0.851 0.390 0.364 166.560 0.808 102.807 63.528
PATB [35] 20.387 0.798 0.169 0.738 0.129 0.363 218.333 0.731 259.135 136.911
Our-LWB 23.932 0.843 0.089 0.901 0.560 0.538 99.258 0.862 48.619 32.370
Our-AttLWB 24.513 0.856 0.074 0.911 0.591 0.564 73.217 0.869 44.022 30.503
TABLE II: Few-shot results for human motion imitation of different methods on the Youtube-Dancer-18 dataset. The number of source images sns_{n} is 22. \uparrow means the larger the better, and \downarrow represents the smaller the better.
Self-Imitation Cross-Imitation
PSNR\uparrow SSIM\uparrow LPIPS\downarrow Body-CS\uparrow Face-CS\uparrow Face-CS\uparrow Face-FD\downarrow Body-CS\uparrow Body-FD\downarrow FID\downarrow
pix2pixHD [27] 11.134 0.196 0.633 0.616 0.106 0.136 221.661 0.565 266.552 175.574
SPADE [23] 8.984 0.120 0.780 0.535 0.106 0.131 294.672 0.513 431.670 304.698
MetaPix Pix2PixHD [10] 14.052 0.385 0.550 0.549 0.134 0.187 277.555 0.523 441.495 257.457
MetaPix SHUP [10] 18.857 0.649 0.269 0.765 0.234 0.191 185.363 0.693 160.485 83.501
Our-LWB 19.485 0.642 0.245 0.830 0.413 0.355 96.280 0.738 102.075 70.743
Our-AttLWB 19.691 0.649 0.232 0.831 0.437 0.380 82.053 0.743 99.575 65.454

Quantitative Comparison with Other Methods under One-shot Setting. We compare the performance of our method with that of existing methods, including PG2 [2], SHUP [1], DSC [3], DIAF [36] and PATB [35]. We train all these methods on a combined dataset with the iPER, MotionSynthetic, and FashionVideo dataset and apply the evaluation protocol with the one-shot setting mentioned above to these methods. We report the results in Table I, and our method outperforms others on all the metrics except SSIM, for which a higher numerical value does not necessarily mean a better quality of an image as reported in [63].

Quantitative Comparison with Other Methods under Few-shot Setting. We compare the performance of our method with pix2pixHD [27], SPADE [23], MetaPix pix2pixHD and MetaPix SHUP [10] under this setting. Here, we report the results on the Youtube-Dancer-18 dataset with the number of source images sns_{n} being 22 in Table II and our method outperforms others.

Qualitative Comparison. Besides, we also analyze the generated images and make comparisons between ours and the above methods. From Fig. 9, we find that 1) the above methods that use 2D pose-guided inputs change the body shape of the source. For example, in the 2nd2^{nd} row of Fig. 9, the scenario is a tall person imitating motion from a short person, and baseline methods change the height of the source body. However, our method is capable of keeping the body shape unchanged because our method disentangles the pose and the personalized shape of each actor. 2) In the light of our proposed AttLWB (LWB) and face identity loss, our method is more powerful in terms of preserving source identities, such as the face identity and cloth details of source than other methods, as shown in the 1st1^{st} and 2nd2^{nd} row of Fig. 9. 3) Our method also produces high-fidelity images in the cross-imitation setting (imitating actions from others), which we illustrate in Fig. 10. As we can see in Fig. 10, the face identity, and clothes details, in terms of texture color and style, are preserved well. It shows that our method can achieve decent results in cross imitation even when the reference image comes from the Internet, which is out of the domain of our training dataset.

4.4 Results of Human Appearance Transfer

Refer to caption
Figure 11: Examples of our proposed AttLWB of human appearance transfer in the testing set of iPER (zoom-in for the best of view). All results are in 512×512512\times 512 resolution. Our method could produce high-fidelity and decent images that preserve the face identity and shape consistency of the source image and keep the clothes details of reference image.
Refer to caption
Figure 12: Examples of our proposed AttLWB of human novel view synthesis. It is capable of preserving the source information, in terms of face identity and logo details of cloths, even the person wearing the long dress with fluffy hair.

It is worth emphasizing that once the model has been trained, it can directly be applied in three tasks, including motion imitation, appearance transfer, and novel view synthesis. We conduct the experiments on the iPER dataset.

Evaluation Metrics. In the iPER dataset, subjects might wear different clothes, and we sample the same person’s images with different clothes as the source and the reference image. We use aforementioned PSNR, SSIM [62], LPIPS [63], Body-CS and Face-CS as the metrics.

Quantitative Results. We report the results of our methods with LWB and AttWLB on the iPER dataset in Table III. The results show that Attentional Liquid Warping Block (AttLWB) is slightly better than the LWB.

TABLE III: Results for human appearance transfer of our LWB and AttLWB, on the iPER dataset. Here, we report the PSNR, SSIM, LPIPS, Body-CS and Face-CS. \uparrow means the larger the better. A higher SSIM may not mean a better quality of an image [63].
PSRN\uparrow SSIM\uparrow LPIPS\downarrow Body-CS\uparrow Face-CS\uparrow
Our-LWB 17.707 0.734 0.225 0.891 0.642
Our-AttLWB 17.783 0.726 0.220 0.896 0.706

Qualitative Results. We randomly pick some examples displayed in Fig. 11. The face identity and clothes details, in terms of texture, color, and style, are preserved well by our method. It demonstrates that our method can achieve decent results in appearance transfer, even when the reference image comes from the Internet and is out of the domain of the iPER dataset, such as the last five columns in Fig. 11.

4.5 Results of Human Novel View Synthesis

Evaluation Metrics. As for data in the iPER dataset, we have videos containing different views of a certain subject performing A-pose, and in the MotionSynthetic dataset, we render A-pose images with 3D meshes from different viewpoints. Thus, we obtain images of the same person in different views. For evaluation, we use PSNR, SSIM [62] and LPIPS [63] as the metrics.

TABLE IV: Results for human novel view synthesis of different methods, including AppFlow [41], MV2NV [69], ours LWB and AttLWB, on iPER and MotionSynthetic dataset. Here, we report the PSNR, SSIM and LPIPS [63]. \uparrow means the larger the better. A higher SSIM may not mean a better quality of an image [63].
iPER MotionSynthetic
PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow
AppFlow 23.342 0.849 0.133 25.575 0.896 0.083
MV2NV 24.950 0.883 0.125 25.951 0.837 0.097
LWB 24.518 0.862 0.090 25.055 0.779 0.106
AttLWB 25.246 0.867 0.078 28.625 0.934 0.037

Quantitative Results. In Table IV, we report the results of our methods AttLWB and that of other state-of-the-art methods, including AppFlow [41] and MV2NV [69], on the iPER and MotionSynthetic datasets based on the above evaluation metrics. The results show that our method outperforms other methods.

Qualitative Results. We randomly sample source images from the testing set of the iPER dataset and change the views from 3030^{\circ} to 330330^{\circ}. The results are illustrated in Fig. 12. Our method is capable of predicting reasonable content of invisible parts when switching to other views and keep the source information, in terms of face identity and clothes details, even in the self-occlusion case, such as the middle and bottom rows in Fig. 12. Through Fig. 12, we can see that 1) even when the subjects have large motion deformation, such as the case in the 1st1^{st} row of Fig. 12, results of our method can keep the logo details of clothes. 2) The 2nd2^{nd} row shows the results when the subjects have self-occlusion. 3) Our method can also handle cases with complex background as the 3rd3^{rd} row in Fig. 12 shows. 4) The 4th4^{th} row of Fig. 12 shows cases in which subjects wear a long dress and have fluffy hair. 5) The 5th5^{th} row of Fig. 12 is the case with complex clothes texture.

5 Ablation Studies and Analysis

In this section, we perform experiments to analyze the impacts of factors in our system, including with/without personalization, ablation studies of different loss functions and the comparison of our proposed LWB or AttLWB with other warping strategies, such as input concatenation, texture warping and feature warping. We further report the running time and analyze the failure cases.

5.1 Impact of Personalization

Refer to caption
Figure 13: Comparison of our proposed AttLWB with and without personalization (zoom-in for the best of view). The two roles in the left column are the source images from the Youtube-Dancer-18 dataset, and that in the right column are cartoon images from the Internet. From the top left segment, we can see that our method could preserve the color style of the source background with personalization. From the bottom-left segment, we find that our method without personalization might lose the details of the logo structure in the source images, while our method with personalization could preserve the logo details. The roles in the right column demonstrate that with personalization, our method has more capability of generalization; the results show that our model can deal with scenarios in which the source images are out of the domain of training set and even when the source images are in cartoon style from the Internet.

We perform the ablation studies of with/without personalization to verify the effectiveness of personalization. Besides, we also analyze the effect of hyper-parameters, including the number of source images 1sn81\leq s_{n}\leq 8 and that of steps TT for personalization. Since we only use the Youtube-Dancer-18 dataset in the testing phase, it is reasonable to evaluate the generalization of our methods of with/without personalization on this dataset. Here, we use self-imitation evaluation metrics, as mentioned above.

TABLE V: Comparison of our proposed AttLWB with and without personalization on the Youtube-Dancer-18 dataset. \uparrow means the larger the better and \downarrow means the smaller the better.
PSRN\uparrow SSIM\uparrow LPIPS\downarrow Body-CS\uparrow Face-CS\uparrow FID\downarrow
w/o 16.932 0.519 0.302 0.792 0.335 79.321
with 17.974 0.579 0.263 0.834 0.413 59.832

With/Without Personalization. We conduct comparative experiments with and without personalization in our methods. Here, we fix the sn=2s_{n}=2 and T=100T=100 in the phase of personalization. Table V shows that our method with personalization could achieve 1.04211.0421 higher in PSNR, 0.05990.0599 higher in SSIM, and 0.0390.039 lower in LPIPS than that without personalization on the Youtube-Dancer-18 dataset. Furthermore, we display some example results in Fig. 13, where the left-column two roles are the source images from the Youtube-Dancer-18 dataset, and the right-column two roles are the cartoon images from the Internet. We find that with personalization, 1) our method could keep the color style of the background unchanged, as shown in the 1st1^{st} top left of Fig. 13; 2) our method is capable of preserving the logo details in the source clothes, as depicted in the 2nd2^{nd} bottom left of Fig. 13; 3) our method is more powerful in the generalization, even when the source images are cartoon style, as illustrated in the right column of Fig. 13. These demonstrate that personalization indeed plays a significant role in improving the generalization of our system.

Refer to caption
Figure 14: Comparison of different number of source images sns_{n} and number of steps TT for personalization. The performance grows with the increase of sns_{n}, when TT is large enough. When TT is small with respective to a large sns_{n}, in this case of T=10T=10 and sn=8s_{n}=8, the performance would decrease.

Number of Source Images sns_{n}. In our system, we adopt a few source images 1sn81\leq s_{n}\leq 8 for personalization, and we will analyze the impacts of sns_{n} to the final results. Here, we fix the number of steps to T=t{10,50,100,150,200}T=t\in\{10,50,100,150,200\} respectively for personalization and list the PSNR with different sn{1,2,4,8}s_{n}\in\{1,2,4,8\} in Fig. 14. It shows that the performance grows with an increase of sns_{n} when TT is large enough. The reason for the performance increase is due to the increase of the invisible textures. However, it is worth noticing that when TT is small with respect to a large sns_{n}, in the case of T=10T=10 and sn=8s_{n}=8, the performance decreases. The reason might be that when TT is small, it is too hard for the network to fit those too many source images.

Number of Steps TT for Personalization. In the real application, we should take the number of steps TT into consideration because more steps will take more time. It is necessary to consider the trade-off between performance and overhead time for personalization. We set sn{1,2,4,8}s_{n}\in\{1,2,4,8\} and list the performance with different TT for personalization in Fig. 14. From Fig. 14, we can see that the performance saturates at around 150 steps.

In summary, based on the above analysis, we recommend that in the stage of personalization, finetuning around 100 steps should be enough, and if the time for personalization is limited, it would be better to use fewer source images.

5.2 Impact of Different Loss Functions

In our methods, we apply a perceptual loss LpL_{p}, a face identity loss LfL_{f}, an attention regularization loss LaL_{a}, and an adversarial loss LadvGL^{G}_{adv} (with global, body and head adversarial loss in details) to the full training loss functions. To validate the effectiveness of each term, we perform the ablation studies of the different loss functions. From Table VI, we can see that the model with the full loss would have the best performance. Besides, with the addition of LfL_{f} and LadvGL^{G}_{adv}, the performance increases compared with that of the trial with only LpL_{p}.

TABLE VI: Comparison between results with different loss functions on the Youtube-Dancer-18 dataset.\uparrow means the larger the better and \downarrow means the smaller the better.
PSNR\uparrow SSIM\uparrow LPIPS\downarrow Body-CS\uparrow Face-CS\uparrow
LpL_{p} 18.204 0.575 0.274 0.791 0.314
Lp+LadvGL_{p}+L^{G}_{adv} 19.656 0.638 0.231 0.810 0.334
Lp+LadvG+LfL_{p}+L^{G}_{adv}+L_{f} 19.542 0.629 0.247 0.809 0.351
LfullL_{full} 20.038 0.656 0.212 0.826 0.421

5.3 Impact of Different Warping Strategies

To verify the impact of our proposed Attentional Liquid Warping Block (AttLWB), we design some baselines with the ways mentioned above to propagate the source information, including input concatenation, texture warping, and feature warping. The body recovery, flow composition modules, the basic network architectures, and all loss functions are the same except for the propagating strategies among our method and other warping baselines. Here, we denote early concatenation, texture warping, and feature warping, as WCW_{C}, WTW_{T}, and WFW_{F}, respectively. Also, we denote the sns_{n} source images as {Is1,,Isn}\{I_{s_{1}},...,I_{s_{n}}\}, their corresponding conditional inputs as {Cs1,,Csn}\{C_{s_{1}},...,C_{s_{n}}\} and their corresponding feature maps as {Xs1l,,Xsnl}\{X^{l}_{s_{1}},...,X^{l}_{s_{n}}\} at the lthl^{th} layer, respectively. The reference conditional inputs are CtC_{t}. The transformation flow of each source image to the reference is TsitT_{s_{i}\to t}. We list the details of all warping baselines in followings:

Input Concatenation WCW_{C}. It directly concatenates all source images, their corresponding conditional inputs, as well as the reference conditional inputs, and then feeds them into the GTSFG_{TSF} network, as shown in Fig. 2 (a).

Texture Warping WTW_{T}. Based on each transformation flow TsitT_{s_{i}\to t}, we warp each source image sis_{i} to the reference condition, average the pixels of overlap regions, and synthesize an initial image. Then, we feed it into the GTSFG_{TSF} network and generate the final image, as shown in Fig. 2 (b).

Feature Warping WFW_{F}. Instead of warping the source information in the image space, it propagates the source information in the feature space, based on the transformation flow. As mentioned above, we firstly obtain the warped feature XsitlX^{l}_{s_{i}\to t} by using a bilinear sampler (BS) to warp each source feature XsilX^{l}_{s_{i}} concerning the corresponding transformation flow TsitT_{s_{i}\to t}. According to the ways to aggregate the global feature XtlX^{l}_{t} from multiple warped source features {Xs1tl,,Xsntl}\{X^{l}_{s_{1}\to t},...,X^{l}_{s_{n}\to t}\}, we can specifically subdivide them into the followings:

  1. 1.

    Attention WFAttW^{Att}_{F} (ours) is shown in Algorithm 2.

  2. 2.

    Add-Aggregation WFAW^{A}_{F} (ours). It is the first version of our proposed Liquid Warping Block(LWB) [21], as shown in the Fig. 4 (a) and Equation (2).

  3. 3.

    Mean-Aggregation WFMW^{M}_{F}. Directly adding the warped features will enlarge the magnitude of the features in the overlap area and thereby results in artifacts. A naive way is to average all the warped features, shown as follows.

    X^tl\displaystyle\widehat{X}_{t}^{l} =1sni=1snXsitl+Xtl.\displaystyle=\frac{1}{s_{n}}\sum_{i=1}^{s_{n}}X^{l}_{s_{i}\to t}+X_{t}^{l}. (9)
  4. 4.

    Add-Soft-Gate WFAW^{A\odot}_{F}. The warped feature might introduce the misalignment problem, and to address it, Dong et al. [11] utilizes a gated convolution to control the transformation degree. We firstly add all the warped features, then utilize a gated convolution, as shown in Equation (10). Here, gg is a function with two-convolution layers followed by a Sigmoid activation and g(Xtl)[0,1]g(X_{t}^{l})\in[0,1]. \odot represents the element-wise multiplication.

    X^tl\displaystyle\widehat{X}_{t}^{l} =g(Xtl)i=1snXsitl+Xtl.\displaystyle=g(X_{t}^{l})\odot\sum_{i=1}^{s_{n}}X^{l}_{s_{i}\to t}+X_{t}^{l}. (10)
  5. 5.

    Mean-Soft-Gate WFMW^{M\odot}_{F}. It firstly averages all the warped features and following steps are the same with WFAW^{A\odot}_{F}. The formulation is shown as follows:

    X^tl\displaystyle\widehat{X}_{t}^{l} =g(Xtl)1sni=1snXsitl+Xtl.\displaystyle=g(X_{t}^{l})\odot\frac{1}{s_{n}}\sum_{i=1}^{s_{n}}X^{l}_{s_{i}\to t}+X_{t}^{l}. (11)

We conduct a user study, with 64 volunteers, to assess the quality of the generated videos and compare the performance of the warping strategies mentioned above. Participants are shown 17 groups of videos with 7 videos generated by 7 warping strategies respectively in random order in each group. Besides, the shared source image and reference video of each group is also shown to the participants for reference. Participants are asked to choose the best video considering the quality of the face, clothes texture, figure pose, and background. Finally, 64 responses are collected, and the results are shown in Fig. 15. As we can see that our proposed AttLWB and AddLWB have the best appraise, scoring 41.73%41.73\% and 20.04%20.04\%, respectively, far higher than others.

Refer to caption
Figure 15: Results of the user study (%). The user preference of the videos with best quality regarding to the quality of face, the quality of clothes texture and background.

5.4 Running Time

Our method could produce the results with different image resolutions, ranging from 256×256256\times 256, 512×512512\times 512, 1024×10241024\times 1024 to 1920×19201920\times 1920. Here, we benchmark the running time of our system in different image resolutions. Since a high resolution needs more memory allocation of the GPUs device, we perform all the tests on a Tesla V100S-PCIe-32G GPU with the Intel Xeon(R) E5-2620 2.10GHz CPUs. The image resolution of the source images is 4032×30244032\times 3024, and that of the reference video with 165 frames is 1920×10801920\times 1080. In Fig.16, we separately report the running time of preprocessing, personalization and inference, when synthesizing different resolutions, respectively. From Fig.16, we can see that the higher resolution consumes more running time, especially in the personalization and inference.

Refer to caption
Figure 16: Running time when producing images with different resolutions. The I/O consumption has been taken into count. The larger resolution, the more consuming time is, particularly in the stages of personalization and inference.

5.5 Failure Cases and Limitations

There are three main types of failure cases of our methods. The first one, as shown in the 1st1^{st} row of Fig. 17, is that source image contains a large area of self-occlusion, which introduces an ambiguity in textures and thereby results in a bad synthesized image. The second occurs when the Body Recovery Module fails and could not accurately estimate the pose parameters, as illustrated in the 2nd2^{nd} row of Fig. 17. The rest is when the background inpaintor GBGG_{BG} fails, as shown in the 3rd3^{rd} row of Fig. 17.

Refer to caption
Figure 17: The failure cases of our system. It mainly contains three types of failure cases. One occurs when the source images introduce a large self-occlusion area, as shown in the top row. The second row is when the body recovery module fails. The third row shows the artifacts when the background inpainting network fails.

In addition, there are still some limitations of our system, 1) it cannot imitate the motions of hands and facial expressions from the reference images, since the 3D body parametric SMPL [16] used in our system does not contain the articulated hands and expressive face; 2) also, it cannot animate the large-motion body with too loose clothing like the skirt or evening dress; 3) it is affected by the different lighting environments among sources.

Therefore, for a better result, the input source images need to follow these guidelines:

  • They share the same static background without too complex scene structures. If possible, we recommend using the actual background.

  • The person in the source images holds an A-pose for introducing the most visible textures.

  • It is recommended to capture the source images in an environment without too much contrast in lighting conditions and lock auto-exposure and auto-focus of the camera.

6 Conclusion

We propose a unified framework to handle human motion imitation, appearance transfer, and novel view synthesis. It employs a body recovery module to estimate the 3D body mesh, which is more powerful than the 2D poses. In order to preserve the source information, we further design a novel warping strategy, Attentional Liquid Warping Block (AttLWB), which propagates the source information in both image and feature spaces and supports a more flexible warping from multiple sources. Besides, with a fast personalization, our method could be generalized well when the input images are out of the domain of training set and synthesize higher resolution (512×512512\times 512 and 1024×10241024\times 1024) results. Extensive experiments show that our framework outperforms others and produce decent results.

ACKNOWLEDGMENT

We thank Dr. Weixin Luo for the meaningful discussion in the whole procedure, and we appreciate all the help of building the first version of the iPER dataset from Min Jie.

References

  • [1] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag, “Synthesizing images of humans in unseen poses,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [2] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, “Pose guided person image generation,” in Advances in Neural Information Processing Systems, 2017, pp. 405–415.
  • [3] A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe, “Deformable gans for pose-based human image generation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [4] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu, “Swapnet: Image based garment transfer,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XII, 2018, pp. 679–695.
  • [5] M. Zanfir, A.-I. Popa, A. Zanfir, and C. Sminchisescu, “Human appearance transfer,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [6] B. Zhao, X. Wu, Z. Cheng, H. Liu, Z. Jie, and J. Feng, “Multi-view image generation from a single-view,” in 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22-26, 2018, 2018, pp. 383–391.
  • [7] H. Zhu, H. Su, P. Wang, X. Cao, and R. Yang, “View extrapolation of human body from a single image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976, 2017.
  • [9] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [10] J. Lee, D. Ramanan, and R. Girdhar, “Metapix: Few-shot video retargeting,” in ICLR, 2019.
  • [11] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin, “Soft-gated warping-gan for pose-guided person image synthesis,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., 2018, pp. 472–482.
  • [12] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2017–2025.
  • [13] B. AlBahar and J.-B. Huang, “Guided image-to-image translation with bi-directional feature transformation,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [14] N. Neverova, R. A. Güler, and I. Kokkinos, “Dense pose transfer,” in European Conference on Computer Vision (ECCV), 2018.
  • [15] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” in European Conference on Computer Vision.   Springer, 2016, pp. 561–578.
  • [16] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM Trans. Graphics (Proc. SIGGRAPH Asia), vol. 34, no. 6, pp. 248:1–248:16, oct 2015.
  • [17] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [18] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” in ICCV, 2019.
  • [19] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [20] P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, “Dwnet: Dense warp-based network for pose-guided human video generation,” in 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019.   BMVA Press, 2019, p. 51.
  • [21] W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao, “Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
  • [23] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [24] T. R. Shaham, T. Dekel, and T. Michaeli, “Singan: Learning a generative model from a single natural image,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [25] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot adversarial learning of realistic neural talking head models,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [26] P. Esser, E. Sutter, and B. Ommer, “A variational u-net for conditional appearance and shape generation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [27] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [28] T. Wang, M. Liu, J. Zhu, N. Yakovenko, A. Tao, J. Kautz, and B. Catanzaro, “Video-to-video synthesis,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., 20148, pp. 1152–1164.
  • [29] A. Shysheya, E. Zakharov, K.-A. Aliev, R. Bashirov, E. Burkov, K. Iskakov, A. Ivakhnenko, Y. Malkov, I. Pasechnik, D. Ulyanov, A. Vakhitov, and V. Lempitsky, “Textured neural avatars,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [30] L. Liu, W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, W. Wang, and C. Theobalt, “Neural rendering and reenactment of human actor videos,” ACM Transactions on Graphics 2019 (TOG), 2019.
  • [31] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, “Disentangled person image generation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [32] C. Si, W. Wang, L. Wang, and T. Tan, “Multistage adversarial losses for pose-based human image synthesis,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, 2015, pp. 234–241.
  • [34] I. K. Rıza Alp Güler, Natalia Neverova, “Densepose: Dense human pose estimation in the wild,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [35] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai, “Progressive pose attention transfer for person image generation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [36] Y. Li, C. Huang, and C. C. Loy, “Dense intrinsic appearance flow for human pose transfer,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [37] G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black, “Clothcap: seamless 4d clothing capture and retargeting,” ACM Trans. Graph., vol. 36, no. 4, pp. 73:1–73:15, 2017.
  • [38] C. Zhang, S. Pujades, M. J. Black, and G. Pons-Moll, “Detailed, accurate, human shape estimation from clothed 3d scan sequences,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 5484–5493.
  • [39] V. Leroy, J. Franco, and E. Boyer, “Multi-view dynamic shape refinement using local temporal integration,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2003, pp. 3113–3122.
  • [40] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll, “Multi-garment net: Learning to dress 3d people from images,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [41] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, 2016, pp. 286–301.
  • [42] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg, “Transformation-grounded image generation network for novel 3d view synthesis,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [43] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 2672–2680.
  • [44] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li, “Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [45] Z. Ding, Y. Guo, L. Zhang, and Y. Fu, “One-shot face recognition via generative learning,” in 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018.   IEEE Computer Society, 2018, pp. 1–7.
  • [46] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   PMLR, 2017, pp. 1126–1135.
  • [47] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” CoRR, vol. abs/1803.02999, 2018.
  • [48] D. Lian, Y. Zheng, Y. Xu, Y. Lu, L. Lin, P. Zhao, J. Huang, and S. Gao, “Towards fast adaptation of neural architectures with meta learning,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.   OpenReview.net, 2020.
  • [49] T. Wang, M. Liu, A. Tao, G. Liu, B. Catanzaro, and J. Kautz, “Few-shot video-to-video synthesis,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 2019, pp. 5014–5025.
  • [50] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, 2016, pp. 630–645.
  • [51] H. Kato, Y. Ushiku, and T. Harada, “Neural 3d mesh renderer,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 3907–3916.
  • [52] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
  • [53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 770–778.
  • [54] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, 2016, pp. 694–711.
  • [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 2015.
  • [56] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [57] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 6738–6746.
  • [58] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “On the effectiveness of least squares generative adversarial networks,” CoRR, vol. abs/1712.06391, 2017. [Online]. Available: http://arxiv.org/abs/1712.06391
  • [59] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part X, 2018, pp. 835–851.
  • [60] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll, “Video based reconstruction of 3d people models,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, vol. abs/1412.6980, 2015.
  • [62] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [63] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [64] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016.   IEEE Computer Society, 2016, pp. 779–788.
  • [65] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person re-identification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [66] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Process. Lett., vol. 23, no. 10, pp. 1499–1503, 2016.
  • [67] https://github.com/timesler/facenet-pytorch/.
  • [68] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems 30: 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 6626–6637.
  • [69] S.-H. Sun, M. Huh, Y.-H. Liao, N. Zhang, and J. J. Lim, “Multi-view to novel view: Synthesizing novel views with self-learned confidence,” in European Conference on Computer Vision, 2018.
[Uncaptioned image] Wen Liu received the bachelor degree from Northwestern Polytechnical University, Xian, China, in 2016. He is currently pursuing a Ph.D. degree at ShanghaiTech University. His research interests focus on human 3D body reconstruction, image synthesis, motion transfer, novel view synthesis, neural rendering and video anomaly detection.
[Uncaptioned image] Zhixin Piao received the bachelor degree from Southeast University, Nanjing, China, in 2017. He is currently pursuing a master degree at ShanghaiTech University. His research topic is human 3D reconstruction and motion transfer.
[Uncaptioned image] Zhi Tu received the bachelor degree from ShanghaiTech University, Shanghai, China, in 2020. His research topic is human motion transfer and medical image analysis.
[Uncaptioned image] Wenhan Luo received the Ph.D. degree from Imperial College London, UK, 2016, M.E. degree from Institute of Automation, Chinese Academy of Sciences, China, 2012 and B.E. degree from Huazhong University of Science and Technology, China, 2009. His research interests include several topics in computer vision and machine learning, such as motion analysis (especially object tracking), image/video quality restoration, object detection and recognition, reinforcement learning.
[Uncaptioned image] Lin Ma received the B.E. and M.E. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 2006 and 2008, respectively, and the Ph.D. degree from the Department of Electronic Engineering, The Chinese University of Hong Kong, in 2013. He was a Researcher with the Huawei Noah’s Ark Laboratory, Hong Kong, from 2013 to 2016. He is currently a Principal Researcher with the Tencent AI Laboratory, Shenzhen, China. His current research interests lie in the areas of computer vision, multimodal deep learning, specifically for image and language, image/video understanding, and quality assessment. Dr. Ma received the Best Paper Award from the Pacific-Rim Conference on Multimedia in 2008. He was a recipient of the Microsoft Research Asia Fellowship in 2011. He was a finalist in HKIS Young Scientist Award in engineering science in 2012.
[Uncaptioned image] Shenghua Gao is an assistant professor, PI in ShanghaiTech University, China. He received the B.E. degree from the University of Science and Technology of China in 2008 (outstanding graduates), and received the Ph.D. degree from the Nanyang Technological University in 2012. From Jun 2012 to Jul 2014, he worked as a postdoctoral fellow in Advanced Digital Sciences Center, Singapore. His research interests include computer vision and machine learning.