Do as we do:
Multiple Person Video-To-Video Transfer

Mickael Cormier^3,1 Houraalsadat Mortazavi Moshkenan³ Franz Lörch¹ Jürgen Metzler^1,2 Jürgen Beyerer^1,3
{mickael.cormier, franz.loerch, juergen.metzler, juergen.beyerer}@iosb.fraunhofer.de, [email protected] ¹Fraunhofer IOSB, Karlsruhe, Germany; ²Fraunhofer Center for Machine Learning;
³Vision and Fusion Lab, Institute for Anthropomatics and Robotics,
Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Abstract

Our goal is to transfer the motion of real people from a source video to a target video with realistic results. While recent advances significantly improved image-to-image translations, only few works account for body motions and temporal consistency. However, those focus only on video re-targeting for a single actor/ for single actors. In this work, we propose a marker-less approach for multiple-person video-to-video transfer using pose as an intermediate representation. Given a source video with multiple persons dancing or working out, our method transfers the body motion of all actors to a new set of actors in a different video. Differently from recent ”do as I do” methods, we focus specifically on transferring multiple person at the same time and tackle the related identity switch problem. Our method is able to convincingly transfer body motion to the target video, while preserving specific features of the target video, such as feet touching the floor and relative position of the actors. The evaluation is performed with visual quality and appearance metrics using publicly available videos with the permission of their owners.

Index Terms:

gan, motion transfer, video-to-video transfer, video retargeting

{strip}

\captionof

figure“Do as we do” motion transfer: given a clip of a team working out [18] (top), and a video of people performing other exercises [9], our method transfers the workout onto the second team (bottom).

I Introduction

Human Motion analysis is an important topic in the computer vision community. Recent advances in human-related application systems bring new human-centered challenges such as driver behavior recognition [25], crowd pose estimation for crowd motion analysis [15] or human action recognition in the dark [35]. However, the CNNs trained for such higher-level tasks require large amounts of annotated data. This data is often challenging to collect and properly annotate. Therefore, synthetic photo-realistic data is often considered as a cost-effective method for augmentation of existing datasets [14, 12, 21]. In this work, we introduce a method for synthesizing real-looking videos of multiple persons dancing side by side and switching places based on real input and target videos. Based on the Everybody Dance Now work [3] and similar to recent video-to-video translation works [41, 16, 28], we first extract pose skeletons using a state of the art method [2, 1, 29, 34]. Since those works only address the video-to-video translation problem for a single person, we extend the simple yet efficient method from [3] to the multiple person transfer problem. We collect online videos from dance workouts with different numbers of persons and perform an ablation study depending on the number of persons. Furthermore, we improve the face generation network by using more accurate face landmarks. Finally, this scenario brings new challenges regarding the pose transfer of each individual in the group. The normalization step is adapted in order to accurately map each subject from the input video to its counterpart in the target. Furthermore, we address the problem of persons switching places by adapting the keypoint correspondence network from [30] for tracking each individual in the video.

II Related Work

Recent breakthroughs in the field of image-to-image translation were recently offered by the introduction of conditional GANs for paired and unpaired images [19, 42]. Those works were rapidly followed by numerous methods for image and video manipulation. In this section, we review related work for image-to-image translation and appearance transfer.

Wang et al. [32] presented a method to generate high resolution 2048 $\times$ 1024 pixels results from semantic label maps using perceptual loss, a coarse-to-fine generator and a multi-scale discriminator architecture. An approach was proposed in [22] to generate images of high-resolution using semantic segmentation and texture prediction. Various generative adversarial networks were proposed to increase the visual quality of generated images using labels and texts [43, 39, 36, 27]. Liu et al. [23] proposed an encoder-decoder for a pose-guided high resolution appearance transfer to a target pose. They use local descriptors with means of progressive local perceptual loss and local discriminators at the highest resolution followed by training of the autoencoder architecture. Zanfir et al. [38] successfully transferred the appearance of a person in source images to a person in target images while preserving the body outline of the target person, using 3D pose as an intermediate representation. Kundu et al. [20] propose a recurrent network for targeting a long-term synthesis of 3D person interactions for long periods of time. Attribute-Decomposed GAN [26] introduces a generative model for controllable person image synthesis, which generates desired human attributes such as pose, head, upper clothes and pants.

Efros et al. [11] transfers videos based on predicted skeletons introducing the concepts of “Do as I do”, where the images of a target person are generated according to a drivers movement, and “Do as I say” where images of target persons are produced based on imposed commands. More recently Zhou et al. [41] trained a model with a relatively long video of a target person which resulted in the ability to transfer any movements of choice from a reference video to the target person while preserving the appearance of the target person. This model receives a frame of a target person and a pose from the reference as an input and generates the images of the target person in that pose as an output. Wang et al. [31] proposed a model for generating images of the targets including humans or scenes that have never been seen previously as a few-shot vid2vid framework. This model generalizes the poses of the reference video to few example images of the target simultaneously. Liu et al. [24] proposed a generative adversarial learning-based approach to upper body video synthesis. They use body and facial landmarks of the source person into the target person, followed by the normalization of the upper body landmarks to generate facial features in the target video with spatio-temporal smoothing. Chan et al. [3] proposed a similar approach for motion transfer from a source video to a target video. Their approach consists of two steps of pose encoding and normalization followed by a pose to video translation. Their poses are normalized by evaluating the ankle positions and height of the subjects in order to adapt the size of the source to the target. Their pose to video translation uses a three-step coarse to fine approach, one step explicitly addressing the quality of the generated face, and use of temporal smoothing. Videos for unseen in-the-wild poses are generated in [28] using data augmentation and unpaired learning to improve generalization of the system for minimizing the domain gaps between testing and training pose sequence. Gomes et al. [16] account for pose, shape, appearance, and motion features of the moving target. Finally, a graph convolutional network is proposed in [13] for generating dance videos from audio information to create natural motions preserving the key movements of different music styles. Nevertheless, these methods focus only on the transfer of a single subject at the same time.

Refer to caption — Figure 1: An overview of our method. First, a pose detector is used to detect the pose of each actor. For each person, a pose stick figure is generated with a distinct set of colors which is kept consistent over time through pose tracking. Those are then normalized into stick figures for the target domain. Finally, the trained generator is applied.

III Method

Starting from a video with a fixed number of source persons and a video with the same number of targets, we aim to generate a new synthetic version of the target video in which the persons now perform the movement seen in the source. Starting from in [3], the pipeline is divided into three stages – pose detection, global pose normalization, and mapping from normalized skeletons to the target subjects.

Pose Encoding Since the focus of our work lies on generalizing pose transfer, we use a pre-trained state of the art model for pose estimation [2, 1, 29, 34] to produce accurate pose estimation for the input frames. We then generate a colored pose stick figure for each person. In order to reliably learn the appearance of each individual using the poses as intermediate representation, we notice empirically that the stick figures need clear distinct color for each person. If the colors of the body part of two persons are too similar, the model tends to average both appearances and produce less realistic features.

Pose Normalization Since target and source videos have different settings of environment and camera as well as people with different physical appearances such as height and shape, a normalization step between the input and target subjects is needed in order to produce more realistic frames. For instance, in Figure 2, the horizon of the source video is higher than the horizon of the target video which caused the people in the generated video to be above the horizon: their feet are not located on the floor. Moreover, if the person in the source video is taller than the corresponding person in the target video, the subject in the generated frame is abnormally taller. Another instance is if the distance between the people and the camera in the source video is shorter than the distance between the people and the camera in the target video, those in the generated video are peculiarly and disproportionately larger than they should be.

Changes of Place between Source Subjects If the source subjects change places, there happens to be an identity switch in the target video, meaning an input person will change appearances with another after changing places. To address this, each subject needs to be tracked over time. In this case, poses are tracked before encoding using keypoint correspondences as proposed in [30] with scenario-specific adjustments. First, we extend the model for tracking the whole 25 body landmarks available instead of only 17. Although we could also use face and hand landmarks, those aren’t predicted as reliably, therefore we decide to use body landmarks only. Since the number of subjects in each video remains the same, we drop frames where more poses are detected than expected. Typically, in the case where more poses are predicted than there are people in a frame, two poses are assigned to one person which can result in identity switches. For better keypoint heatmap accuracy, we train a keypoint correspondence network with input images of size $512\times 512$ pixels instead of $256\times 256$ pixels. As the persons are a closed set no similarity threshold is used and poses are always assigned in a greedy fashion to the ids of the closed set.

Pose to Video Translation After preprocessing the tracked skeletons are then used as input to our model. An overview of our method is given in Figure 1. Our model is based on an adversarial conditional GAN setup, trained in three stages: global, local and faces. For more details, we refer to [3].

IV Experiments

IV-A Setup

A separate model is trained on collected frames for each training video of 2, 3, 4 and 5 people videos at $1024\times 512$ respectively. This is followed by an evaluation using unseen test frames in order to evaluate the efficiency of our approach. Each model is trained separately in three stages. In the first stage, a global generator is used for training the model. In the second stage, the model is refined with a local enhancer generator and finally FaceGAN is used in the last stage. For our experiments we use the Adam optimizer with learning $rate=0.0002$ and $B=0.999$ . For all experiments the batch size is set to $1$ . Also we set $\lambda_{VGG}=10$ . We trained our models for about $168,000$ iterations which required totally about $35$ hours on a RTX 2080Ti. As baseline we reproduce the results from [3] for single to single person motion transfer.

Evaluation Metrics We measure the quality of the synthesized frames using four metrics: 1) Peak Signal-to-Noise Ratio (PSNR) measures the similarity of the pixel-level images between generated images, 2) Structural Similarity (SSIM) [33] between two images by comparing three factors of luminance, contrast, and structure, 3) Learned Perceptual Image Patch Similarity (LPIPS) [40] to measure the perceptual similarity between synthesized images and ground truth, and 4) Frechet Inception Distance (FID) [17] to measure the quality of frames of generated videos. We strive for high metrics for 1) and 2) and for smaller metrics for 3) and 4).

IV-B Quantitative Results

We compare our approach quantitatively for a different number of people. We perform the evaluation on a held-out test data and report our results in Table I. We first report results for single-to-single transfer as a baseline. For this model almost four times more data is available for training than for multiple person transfer, which may partially explain the performance gap between the baseline and our models. Overall our models provide satisfying results for the few constraints we chose to apply for collecting the video pairs. Surprisingly our 5-to-5 model performs especially well by FID and LPIPS metrics, which means the results should be more convincing in the human eye. In order to refine our results and since acceptable face keypoints for the 3-to-3 and 5-to-5 videos were available, we added 60 supplemental face landmarks to our model and report our findings in Table II. This addition brought a strong improvement in terms of FID to the 3-to-3 model and a smaller improvement to the 5-to-5 model. However, we emphasize that these improvements are largely reliant on the quality of the pose estimator’s predictions. Therefore, such additional facial landmarks may not always be available or of sufficient quality.

Model	FID $\downarrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$
1-to-1	$20.843$	$0.060$	$37.837$	$0.950$
2-to-2	$25.280$	$0.085$	$36.576$	$0.947$
3-to-3	$24.364$	$0.178$	$34.135$	$0.865$
4-to-4	$26.797$	$0.211$	$33.319$	$0.843$
5-to-5	$8.510$	$0.088$	$35.666$	$0.925$

TABLE I: 1 Person with 23,000 Frames as in [3]. The other models are trained with around 5,600 frames due to unavailability of more frames.

Model	F	FID $\downarrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$
3-to-3		$24.364$	$0.178$	$34.135$	$0.865$
5-to-5		$8.510$	$0.088$	$35.666$	$0.925$
3-to-3	✓	$20.210$	$0.173$	$34.214$	$0.865$
5-to-5	✓	$8.086$	$0.088$	$33.110$	$0.830$

TABLE II: Our best models are further optimized using 68 face keypoints instead of only eight.

IV-C Qualitative Results

Transfer results for multiple source and target subjects can be seen in Figure Do as we do: Multiple Person Video-To-Video Transfer and Figure 3. The advantage of using target normalization can be clearly seen in Figure Do as we do: Multiple Person Video-To-Video Transfer where the input subjects are shifted to the left as in the learned target video. This property is important since the target scene could contain physical objects on which the person would otherwise mistakenly be projected. We show results for more difficult face poses in Figure 3 with an example for which a turning face is handled properly and another for which our model struggles. In this case the turning head of the dancer in the input video has never been seen in a similar fashion during training for the target subject. Therefore, our model can’t handle the projection of the back of the head and produces a strong artifact instead of generating hair. Furthermore, failure cases for previously unseen extreme poses are illustrated in Figure 6. As shown in Table I, the number of subjects for transfer grows, the performance of the model generally declines, which could be expected while increasing the difficulty of the task without altering the parameters of the model. However, the 5-to-5 model delivers contradicting results. We show qualitative results for this model in Figure 4. Those results and particularly the faces are convincingly smooth. We notice the similarities in clothing between the target subjects. This setup not only boosts the performance metric for this video due to the clothing, but also allows better performance for the whole scene. We argue that this is related to the limited amount of parameters available to our model: less parameters are required to learn the appearances of lower bodies, therefore more parameters are available to accurately represent faces and upper bodies. Further works could progressively increase the number of subjects and investigate a required size of the model in order to reach optimal transfer performance. Finally, identity switches are handled as shown in Figure 5. However, such tracking is highly dependent on the quality of the pose estimator. Therefore, for input scenes in which a subject disappears for a long time behind another subject, the track may be lost requiring a new mapping from source to target. While our results suggest the clear feasibility to convincingly transfer multiple persons at the same time, we find that the quality of the synthesized face still needs improvement. Furthermore, extreme arm or face poses remain challenging.

V Conclusion

We extended and generalized the concept of video human motion transfer to multiple person using a relatively simple but yet efficient model. We address the pose normalization of multiple subjects and potential identity switches when different actors change places. Our method, while using only a few thousand frames, delivers high-quality videos of a target group of persons following the visual instructions of another group, even generating convincing shadows. However, our results are highly limited by the available data for the target group, which is difficult to collect. Furthermore, input and target videos are required to take on a similar perspective. Future work could focus on the training data and on extracting even more information such as semantic masks, dense poses or clothing information. A potential application is to create photo-realistic avatars from synthesized poses in order to efficiently render individuals anonymous and therefore facilitate a generation of new realistic data in the target domain.

References

[1] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[2] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
[3] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In IEEE International Conference on Computer Vision (ICCV), 2019.
[4] BollyX YouTube channel. Bollywood workout. https://www.youtube.com/watch?v=ZxFbrdY_i8Y&list=PLXAfj_NH9m0DO3SHcFojg-Wthnm0tWNqn&index=29, Jul 2019.
[5] Sunny Funny Fitness YouTube channel. Chica. chung ha. https://www.youtube.com/watch?v=5Y5L3FXkHVM&list=PLXAfj_NH9m0BYfum_AwXgReAQm_wXHPNN&index=7, Aug 2019.
[6] Sunny Funny Fitness YouTube channel. 4 week diet challenge- day 16. https://www.youtube.com/watch?v=yoXpVh13BrE&t=509s, JAN 2020.
[7] Sunny Funny Fitness YouTube channel. Circus. https://www.youtube.com/watch?v=KNsWzSJe1yo&list=PLXAfj_NH9m0DO3SHcFojg-Wthnm0tWNqn&index=39&t=160s, Aug 2020.
[8] Sunny Funny Fitness YouTube channel. Ring ring. https://www.youtube.com/watch?v=5IVqO55N3Mk&list=PLXAfj_NH9m0BYfum_AwXgReAQm_wXHPNN&index=1, JAN 2020.
[9] Sunny Funny Fitness YouTube channel. Shape of you. https://www.youtube.com/watch?v=q2izeugK_xM&list=PLXAfj_NH9m0BYfum_AwXgReAQm_wXHPNN&index=3, JAN 2020.
[10] KYARA dance cover YouTube channel. Twice-cheer up. https://www.youtube.com/watch?v=f9sn3JQeWsE&list=PLXAfj_NH9m0DO3SHcFojg-Wthnm0tWNqn&index=49, Nov 2017.
[11] Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik. Recognizing action at a distance. In IEEE International Conference on Computer Vision, pages 726–733, Nice, France, 2003.
[12] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to detect and track visible and occluded body joints in a virtual world. In European Conference on Computer Vision (ECCV), 2018.
[13] J. P. Ferreira, T. M. Coutinho, T. L. Gomes, J. F. Neto, R. Azevedo, R. Martins, and E. R. Nascimento. Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio. Computers & Graphics, 94:11 – 21, 2021.
[14] Engelmann Francis, Kontogianni Theodora, Hermans Alexander, and Leibe Bastian. Exploring spatial context for 3d semantic segmentation of point clouds. In IEEE International Conference on Computer Vision, 3DRMS Workshop, ICCV, 2017.
[15] Thomas Golda, Tobias Kalb, Arne Schumann, and Jüergen Beyerer. Human Pose Estimation for Real-World Crowded Scenarios. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019.
[16] T. L. Gomes, R. Martins, J. Ferreira, and E. R. Nascimento. Do as i do: Transferring human motion and appearance between monocular videos with spatial and temporal constraints. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 3355–3364, 2020.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.
[18] incorporesony YouTube channel. Girls like you. https://www.youtube.com/watch?v=Diddxm9hYEY&list=PLXAfj_NH9m0BYfum_AwXgReAQm_wXHPNN&index=6, Jul 2019.
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
[20] J. N. Kundu, H. Buckchash, P. Mandikal, R. M. V, A. Jamkhandi, and R. V. Babu. Cross-conditioned recurrent networks for long-term synthesis of inter-person human motion interactions. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2713–2722, 2020.
[21] Igor Kviatkovsky, Nadav Bhonker, and Gerard Medioni. From real to synthetic and back: Synthesizing training data for multi-person scene understanding, 2020.
[22] Christoph Lassner, Gerard Pons-Moll, and Peter V Gehler. A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision, pages 853–862, 2017.
[23] Ji Liu, Heshan Liu, Mang-Tik Chiu, Yu-Wing Tai, and Chi-Keung Tang. Pose-guided high-resolution appearance transfer via progressive training. arXiv preprint arXiv:2008.11898, 2020.
[24] Zhaoxiang Liu, Huan Hu, Zipeng Wang, Kai Wang, Jinqiang Bai, and Shiguo Lian. Video synthesis of human upper body with realistic face. In 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pages 200–202. IEEE, 2019.
[25] Manuel Martin, Alina Roitberg, Monica Haurilet, Matthias Horne, Simon Reiß, Michael Voit, and Rainer Stiefelhagen. Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019.
[26] Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and Zhouhui Lian. Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[27] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642–2651, 2017.
[28] Jian Ren, Menglei Chai, Sergey Tulyakov, Chen Fang, Xiaohui Shen, and Jianchao Yang. Human motion transfer from poses in the wild. In Adrien Bartoli and Andrea Fusiello, editors, Computer Vision – ECCV 2020 Workshops, pages 262–279, Cham, 2020. Springer International Publishing.
[29] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.
[30] Rafi Umer, Andreas Doering, Bastian Leibe, and Juergen Gall. Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos. arXiv preprint arXiv:2004.12652, 2020.
[31] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Bryan Catanzaro, and Jan Kautz. Few-shot video-to-video synthesis. Advances in Neural Information Processing Systems, 32:5013–5024, 2019.
[32] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
[33] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[34] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In CVPR, 2016.
[35] Yuecong Xu, Jianfei Yang, Haozhi Cao, Kezhi Mao, Jianxiong Yin, and Simon See. Arid: A new dataset for recognizing action in the dark, 2020.
[36] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
[37] Bollyx YouTube. Coca cola. https://www.youtube.com/watch?v=CcLbVm1gSQM&list=PLXAfj_NH9m0BYfum_AwXgReAQm_wXHPNN&index=4, Feb 2019.
[38] Mihai Zanfir, Alin-Ionut Popa, Andrei Zanfir, and Cristian Sminchisescu. Human appearance transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5391–5399, 2018.
[39] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
[40] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[41] Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara Berg. Dance dance generation: Motion transfer for internet videos. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
[42] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
[43] Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and Chen Change Loy. Be your own prada: Fashion synthesis with structural coherence. In Proceedings of the IEEE international conference on computer vision, pages 1680–1688, 2017.