Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN

Zhenyu Xie¹, Zaiyu Huang¹, Fuwei Zhao¹, Haoye Dong¹
Michael Kampffmeyer², Xiaodan Liang^1,3
¹Shenzhen Campus of Sun Yat-Sen University
²UiT The Arctic University of Norway, ³Peng Cheng Laboratory
{xiezhy6,huangzy225,zhaofw,donghy7}@mail2.sysu.edu.cn
[email protected], [email protected] Xiaodan Liang is the corresponding author. Our code will be available at PASTA-GAN.

Abstract

Image-based virtual try-on is one of the most promising applications of human-centric image generation due to its tremendous real-world potential. Yet, as most try-on approaches fit in-shop garments onto a target person, they require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability. While a few recent works attempt to transfer garments directly from one person to another, alleviating the need to collect paired datasets, their performance is impacted by the lack of paired (supervised) information. In particular, disentangling style and spatial information of the garment becomes a challenge, which existing methods either address by requiring auxiliary data or extensive online optimization procedures, thereby still inhibiting their scalability. To achieve a scalable virtual try-on system that can transfer arbitrary garments between a source and a target person in an unsupervised manner, we thus propose a texture-preserving end-to-end network, the PAtch-routed SpaTially-Adaptive GAN (PASTA-GAN), that facilitates real-world unpaired virtual try-on. Specifically, to disentangle the style and spatial information of each garment, PASTA-GAN consists of an innovative patch-routed disentanglement module for successfully retaining garment texture and shape characteristics. Guided by the source person keypoints, the patch-routed disentanglement module first decouples garments into normalized patches, thus eliminating the inherent spatial information of the garment, and then reconstructs the normalized patches to the warped garment complying with the target person pose. Given the warped garment, PASTA-GAN further introduces novel spatially-adaptive residual blocks that guide the generator to synthesize more realistic garment details. Extensive comparisons with paired and unpaired approaches demonstrate the superiority of PASTA-GAN, highlighting its ability to generate high-quality try-on images when faced with a large variety of garments (e.g. vests, shirts, pants), taking a crucial step towards real-world scalable try-on.

Refer to caption — Figure 1: Example virtual try-on results from our PASTA-GAN, which is flexible for various try-on scenarios, e.g., garment transfer for the upper body, the lower body, and the full body.

1 Introduction

Image-based virtual try-on, the process of computationally transferring a garment onto a particular person in a query image, is one of the most promising applications of human-centric image generation with the potential to revolutionize shopping experiences and reduce purchase returns. However, to fully exploit its potential, scalable solutions are required that can leverage easily accessible training data, handle arbitrary garments, and provide efficient inference results. Unfortunately, to date, most existing methods bochao2018cpvton ; yun2019vtnfp ; han2019clothflow ; dong2019fwgan ; han2020acgpn ; ge2021dcton ; ge2021pfafn ; choi2021vitonhd ; xie2021wasvton ; Zhao202m3dvton rely on paired training data, i.e., a person image and its corresponding in-shop garment, leading to laborious data-collection processes. Furthermore, these methods are unable to exchange garments directly between two person images, thus largely limiting their application scenarios and raising the need for unpaired solutions to ensure scalability.

While unpaired solutions have recently started to emerge, performing virtual try-on in an unsupervised setting is extremely challenging and tends to affect the visual quality of the try-on results. Specifically, without access to the paired data, these models are usually trained by reconstructing the same person image, which is prone to over-fitting, and thus underperform when handling garment transfer during testing. The performance discrepancy is mainly reflected in the garment synthesis results, in particular the shape and texture, which we argue is caused by the entanglement of the garment style and spatial representations in the synthesis network during the reconstruction process.

While this is not a problem for the traditional paired try-on approaches bochao2018cpvton ; han2019clothflow ; han2020acgpn ; ge2021pfafn , which avoid this problem and preserve the garment characteristics by utilizing a supervised warping network to obtain the warped garment in target shape, this is not possible in the unpaired setting due to the lack of warped ground truth. The few works that do attempt to achieve unpaired virtual try-on, therefore, choose to circumvent this problem by either relying on person images in various poses for feature disentanglement men2020adgan ; sarkar2020nhrr ; sarkar2021humangan ; sarkar2021posestylegan ; albahar2021pose ; Cui2021dior , which again leads to a laborious data-collection process, or require extensive online optimization procedures neuberger2020oviton ; ira2021vogue to obtain fine-grain details of the original garments, harming the inference efficiency. However, none of the existing unpaired try-on methods consider the problem of coupled style and spatial garment information directly, which is crucial to obtain accurate garment transfer results in the unpaired and unsupervised virtual try-on scenario.

In this paper, to tackle the essential challenges mentioned above, we propose a novel PAtch-routed SpaTially-Adaptive GAN, named PASTA-GAN, a scalable solution to the unpaired try-on task. Our PASTA-GAN can precisely synthesize garment shape and style (see Fig. 1) by introducing a patch-routed disentanglement module that decouples the garment style and spatial features, as well as a novel spatially-adaptive residual module to mitigate the problem of feature misalignment.

The innovation of our PASTA-GAN includes three aspects: First, by separating the garments into normalized patches with the inherent spatial information largely reduced, the patch-routed disentanglement module encourages the style encoder to learn spatial-agnostic garment features. These features enable the synthesis network to generate images with accurate garment style regardless of varying spatial garment information. Second, given the target human pose, the normalized patches can be easily reconstructed to the warped garment complying with the target shape, without requiring a warping network or a 3D human model. Finally, the spatially-adaptive residual module extracts the warped garment feature and adaptively inpaints the region that is misaligned with the target garment shape. Thereafter, the inpainted warped garment features are embedded into the intermediate layer of the synthesis network, guiding the network to generate try-on results with realistic garment texture.

We collect a scalable UnPaired virtual Try-on (UPT) dataset and conduct extensive experiments on the UPT dataset and two existing try-on benchmark datasets (i.e., the DeepFashion liuLQWTcvpr16DeepFashion and the MPV Dong2019mgvton datasets). Experiment results demonstrate that our unsupervised PASTA-GAN outperforms both the previous unpaired and paired try-on approaches.

2 Related Work

Paired Virtual Try-on. Paired try-on methods xintong2018viton ; bochao2018cpvton ; yun2019vtnfp ; han2019clothflow ; Matiur2020cpvton+ ; han2020acgpn ; ge2021dcton ; ge2021pfafn ; xie2021wasvton aim to transfer an in-shop garment onto a reference person. Among them, VITON xintong2018viton for the first time integrates a U-Net ronneberger2015unet based generation network with a TPS bookstein1989TPS based deformation approach to synthesize the try-on result. CP-VTON bochao2018cpvton improves this paradigm by replacing the time-consuming warping module with a trainable geometric matching module. VTNFP yun2019vtnfp adopts human parsing to guide the generation of various body parts, while Matiur2020cpvton+ ; han2020acgpn ; Zhao202m3dvton introduce a smooth constraint for the warping module to alleviate the excessive distortion in TPS warping. Besides the TPS-based warping strategy, han2019clothflow ; xie2021wasvton ; ge2021pfafn turn to the flow-based warping scheme which models the per-pixel deformation. Recently, VITON-HD choi2021vitonhd focuses on high-resolution virtual try-on and proposes an ALIAS normalization mechanism to resolve the garment misalignment. PF-AFN ge2021pfafn improves the learning process by employing knowledge distillation, achieving state-of-the-art results. However, all of these methods require paired training data and are incapable of exchanging garments between two person images.

Unpaired Virtual Try-on. Different from the above methods, some recent works men2020adgan ; sarkar2020nhrr ; sarkar2021humangan ; sarkar2021posestylegan ; neuberger2020oviton ; ira2021vogue eliminate the need for in-shop garment images and directly transfer garments between two person images. Among them, men2020adgan ; sarkar2020nhrr ; sarkar2021humangan ; sarkar2021posestylegan ; albahar2021pose ; Cui2021dior leverage pose transfer as the pretext task to learn disentangled pose and appearance features for human synthesis, but require images of the same person with different poses.¹¹1As the concurrent work StylePoseGAN sarkar2021posestylegan is the most related pose transfer-based approach, we provide a more elaborate discussion of the inherent differences in the supplementary. In contrast, neuberger2020oviton ; ira2021vogue are more flexible and can be directly trained with unpaired person images. However, OVITON neuberger2020oviton requires online appearance optimization for each garment region during testing to maintain texture detail of the original garment. VOGUE ira2021vogue needs to separately optimize the latent codes for each person image and the interpolate coefficient for the final try-on result during testing. Therefore, existing unpaired methods require either cumbersome data collecting or extensive online optimization, extremely harming their scalability in real scenarios.

3 PASTA-GAN

Given a source image $I_{s}$ of a person wearing a garment $G_{s}$ , and a target person image $I_{t}$ , the unpaired virtual try-on task aims to synthesize the try-on result $I_{t}^{\prime}$ retaining the identity of $I_{t}$ but wearing the source garment $G_{s}$ . To achieve this, our PASTA-GAN first utilizes the patch-routed disentanglement module (Sec. 3.1) to transform the garment $G_{s}$ into normalized patches $P_{n}$ that are mostly agnostic to the spatial features of the garment, and further deforms $P_{n}$ to obtain the warped garment $G_{t}$ complying with the target person pose. Then, an attribute-decoupled conditional StyleGAN2 (Sec. 3.2) is designed to synthesize try-on results in a coarse-to-fine manner, where we introduce novel spatially-adaptive residual blocks (Sec. 3.3) to inject the warped garment features into the generator network for more realistic texture synthesis. The loss functions and training details will be described in Sec. 3.4. Fig. 2 illustrates the overview of the inference process for PASTA-GAN.

3.1 Patch-routed Disentanglement Module

Since the paired data for supervised training is unavailable for the unpaired virtual try-on task, the synthesis network has to be trained in an unsupervised manner via image reconstruction, and thus takes a person image as input and separately extracts the feature of the intact garment and the feature of the person representation to reconstruct the original person image. While such a training strategy retains the intact garment information, which is helpful for the garment reconstruction, the features of the intact garment entangle the garment style with the spatial information in the original image. This is detrimental to the garment transfer during testing. Note that the garment style here refers to the garment color and categories, i.e., long sleeve, short sleeve, etc., while the garment spatial information implies the location, the orientation, and the relative size of the garment patch in the person image, in which the first two parts are influenced by the human pose while the third part is determined by the relative camera distance to the person.

To address this issue, we explicitly divide the garment into normalized patches to remove the inherent spatial information of the garment. Taking the sleeve patch as an example, by using division and normalization, various sleeve regions from different person images can be deformed to normalized patches with the same orientation and scale. Without the guidance of the spatial information, the network is forced to learn the garment style feature to reconstruct the garment in the synthesis image.

Fig. 3 illustrates the process of obtaining normalized garment patches, which includes two main steps: (1) pose-guided garment segmentation, and (2) perspective transformation-based patch normalization. Specifically, in the first step, the source garment $G_{s}$ and human pose (joints) $J_{s}$ are firstly obtained by applying Gong2019Graphonomy and openpose to the source person $I_{s}$ , respectively. Given the body joints, we can segment the source garment into several patches $P_{s}$ , which can be quadrilaterals with arbitrary shapes (e.g., rectangle, square, trapezoid, etc.), and will later be normalized. Taking the torso region as an example, with the coordinates of the left/right shoulder joints and the left/right hips joints in $P_{s}^{i}$ , a quadrilateral crop (of which the four corner points are visualized in color in $P_{s}^{i}$ of Fig. 3) covering the torso region of $G_{s}$ can be easily performed to produce an unnormalized garment patch. Note that we define eight patches for upper-body garments, i.e., the patches around the left/right upper/bottom arm, the patches around the left/right hips, a patch around the torso, and a patch around the neck. In the second step, all patches are normalized to remove their spatial information by perspective transformations. For this, we first define the same amount of template patches $P_{n}$ with fixed $64\times 64$ resolution as transformation targets for all unnormalized source patches, and then compute a homography matrix $\mathcal{H}_{s\rightarrow n}^{i}\in\mathbb{R}^{3\times 3}$ zhou2019stnhomography for each pair of $P_{s}^{i}$ and $P_{n}^{i}$ , based on the four corresponding corner points of the two patches. Concretely, $\mathcal{H}_{s\rightarrow n}^{i}$ serves as a perspective transformation to relate the pixel coordinates in the two patches, formulated as:

\left[\begin{array}[]{c}x_{n}^{i}\\ y_{n}^{i}\\ 1\end{array}\right]=\mathcal{H}_{s\rightarrow n}^{i}\left[\begin{array}[]{c}x_{s}^{i}\\ y_{s}^{i}\\ 1\end{array}\right]=\left[\begin{array}[]{ccc}h_{11}^{i}&h_{12}^{i}&h_{13}^{i}\\ h_{21}^{i}&h_{22}^{i}&h_{23}^{i}\\ h_{31}^{i}&h_{32}^{i}&h_{33}^{i}\end{array}\right]\left[\begin{array}[]{c}x_{s}^{i}\\ y_{s}^{i}\\ 1\end{array}\right]

(1)

where $(x_{n}^{i},y_{n}^{i})$ and $(x_{s}^{i},y_{s}^{i})$ are the pixel coordinates in the normalized template patch and the unnormalized source patch, respectively. To compute the homography matrix $\mathcal{H}_{s\rightarrow n}^{i}$ , we directly leverage the OpenCV API, which takes as inputs the corner points of the two patches and is implemented by using least-squares optimization and the Levenberg-Marquardt method gavin2019levenberg . After obtaining $\mathcal{H}_{s\rightarrow n}^{i}$ , we can transform the source patch $P_{s}^{i}$ to the normalized patch $P_{n}^{i}$ according to Eq. 1.

Moreover, the normalized patches $P_{n}$ can further be transformed to target garment patches $P_{t}$ by utilizing the target pose $J_{t}$ , which can be obtained from the target person $I_{t}$ via openpose . The mechanism of that backward transformation is equivalent to the forward one in Eq. 1, i.e., computing the homography matrix $\mathcal{H}_{n\rightarrow t}^{i}$ based on the four point pairs extracted from the normalized patch $P_{n}^{i}$ and the target pose $J_{t}$ . The recovered target patches $P_{t}$ can then be stitched to form the warped garment $G_{t}$ that will be sent to the texture synthesis branch in Fig. 2 to generate more realistic garment transfer results. We can also regard $\mathcal{H}_{s\rightarrow t}=\mathcal{H}_{n\rightarrow t}\cdot\mathcal{H}_{s\rightarrow n}$ as the combined deformation matrix that warps the source garment to the target person pose, bridged by an intermediate normalized patch representation that is helpful for disentangling garment styles and spatial features.

3.2 Attribute-decoupled Conditional StyleGAN2

Motivated by the impressive performance of StyleGAN2 karras2020stylegan2 in the field of image synthesis, our PASTA-GAN inherits the main architecture of StyleGAN2 and modifies it to the conditional version (see Fig. 2). In the synthesis network, the normalized patches $P_{n}$ are projected to the style code $w$ through a style encoder followed by a mapping network, which is spatial-agnostic benefiting from the disentanglement module. In parallel, the conditional information including the target head $H_{t}$ and pose $J_{t}$ is transferred into a feature map $f_{id}$ , encoding the identity of the target person by the identity encoder. Thereafter, the synthesis network starts from the identity feature map and leverages the style code as the injected vector for each synthesis block to generate the try-on result $\tilde{I_{t}}^{\prime}$ .

However, the standalone conditional StyleGAN2 is insufficient to generate compelling garment details especially in the presence of complex textures or logos. For example, although the illustrated $\tilde{I_{t}}^{\prime}$ in Fig. 2 can recover accurate garment style (color and shape) given the disentangled style code $w$ , it lacks the complete texture pattern. The reasons for this are twofold: First, the style encoder projects the normalized patches into a one-dimensional vector, resulting in loss of high frequency information. Second, due to the large variety of garment texture, learning the local distribution of the particular garment details is highly challenging for the basic synthesis network.

To generate more accurate garment details, instead of only having a one-way synthesis network, we intentionally split PASTA-GAN into two branches after the $128\times 128$ synthesis block, namely the Style Synthesis Branch (SSB) and the Texture Synthesis Branch (TSB). The SSB with normal StyleGAN2 synthesis blocks aims to generate intermediate try-on results $\tilde{I_{t}}^{\prime}$ with accurate garment style and predict a precise garment mask $M_{g}$ that will be used by TSB. The purpose of TSB is to exploit the warped garment $G_{t}$ , which has rich texture information to guide the synthesis path, and generate high-quality try-on results. We introduce a novel spatially-adaptive residual module specifically before the final synthesis block of the TSB, to embed the warped garment feature $f_{g}$ (obtained by passing $M_{g}$ and $G_{t}$ through the garment encoder) into the intermediate features and then send them to the newly designed spatialy-apaptive residual blocks, which are beneficial for successfully synthesizing texture of the final try-on result $I_{t}^{\prime}$ . The detail of this module will be described in the following section.

3.3 Spatially-adaptive Residual Module

Given the style code that factors out the spatial information and only keeps the style information of the garment, the style synthesis branch in Fig. 2 can accurately predict the mean color and the shape mask of the target garment. However, its inability to model the complex texture raises the need to exploit the warped garment $G_{t}$ to provide features that encode high-frequency texture patterns, which is in fact the motivation of the target garment reconstruction in Fig. 3.

However, as the coarse warped garment $G_{t}$ is directly obtained by stitching the target patches together, its shape is inaccurate and usually misaligns with the predicted mask $M_{g}$ (see Fig.4). Such shape misalignment in $G_{t}$ will consequently reduce the quality of the extracted warped garment feature $f_{g}$ .

To address this issue, we introduce the spatially-adaptive residual module between the last two synthesis blocks in the texture synthesis branch as shown in Fig. 2. This module is comprised of a garment encoder and three spatially-adaptive residual blocks with feature inpainting mechanism to modulate intermediate features by leveraging the inpainted warped garment feature.

To be specific on the feature inpainting process, we first remove the part of $G_{t}$ that falls outside of $M_{g}$ (green region in Fig.4), and explicitly inpaint the misaligned regions of the feature map within $M_{g}$ with average feature values (orange region in Fig. 4). The inpainted feature map can then help the final synthesis block infer reasonable texture in the inside misaligned parts.

Therefore given the predicted garment mask $M_{g}$ , the coarse warped garment $G_{t}$ and its mask $M_{t}$ , the process of feature inpainting can be formulated as:

M_{align}=M_{g}\cap M_{t},

(2)

M_{misalign}=M_{g}-M_{align},

(3)

f^{\prime}_{g}=\mathcal{E}_{g}(G_{t}\odot M_{g}),

(4)

f_{g}=f_{g}^{\prime}\odot(1-M_{misalign})+\mathcal{A}(f_{g}^{\prime}\odot M_{align})\odot M_{misalign},

(5)

where $\mathcal{E}_{g}(\cdot)$ represents the garment encoder and $f_{g}^{\prime}$ denotes the raw feature map of $G_{t}$ masked by $M_{g}$ . $\mathcal{A}(\cdot)$ calculates the average garment features and $f_{g}$ is the final inpainted feature map.

Subsequently, inspired by the SPADE ResBlk from SPADE park2019spade , the inpainted garment features are used to calculate a set of affine transformation parameters that efficiently modulate the normalized feature map within each spatially-adaptive residual block. The normalization and modulation process for a particular sample $h_{z,y,x}$ at location ( $z\in C,y\in H,x\in W$ ) in a feature map can then be formulated as:

\gamma_{z,y,x}(f_{g})\frac{h_{z,y,x}-\mu_{z}}{\sigma_{z}}+\beta_{z,y,x}(f_{g}),

(6)

where $\mu_{z}=\frac{1}{HW}\sum_{y,x}h_{z,y,x}$ and $\sigma_{z}=\sqrt{\frac{1}{HW}\sum_{y,x}\left(h_{z,y,x}-\mu_{z}\right)^{2}}$ are the mean and standard deviation of the feature map along channel $C$ . $\gamma_{z,y,x}(\cdot)$ and $\beta_{z,y,x}(\cdot)$ are the convolution operations that convert the inpainted feature to affine parameters.

Eq. 6 serves as a learnable normalization layer for the spatially-adaptive residual block to better capture the statistical information of the garment feature map, thus helping the synthesis network to generate more realistic garment texture.

With the modulated intermediate feature maps produced by the spatially-adaptive residual module, the texture synthesis branch can effectively utilize the reconstructed warped garment and generate the final compelling try-on result with high-frequency texture patterns.

3.4 Loss Functions and Training Details

As paired training data is unavailable, our PASTA-GAN is trained unsupervised via image reconstruction. During training, we utilize the reconstruction loss $\mathcal{L}_{rec}$ and the perceptual loss johnson2016perceptual $\mathcal{L}_{perc}$ for both the coarse try-on result $\widetilde{I}^{\prime}$ and the final try-on result $I^{\prime}$ :

\mathcal{L}_{rec}=\sum_{I\in\{\widetilde{I}^{\prime},I^{\prime}\}}\|I-I_{s}\|_{1}\;\;\;\;and\;\;\;\;\mathcal{L}_{perc}=\sum_{I\in\{\widetilde{I}^{\prime},I^{\prime}\}}\sum_{k=1}^{5}\lambda_{k}\left\|\phi_{k}(I)-\phi_{k}\left(I_{s}\right)\right\|_{1},

(7)

where $\phi_{k}(I)$ denotes the $k$ -th feature map in a VGG-19 network DBLP:journals/corr/SimonyanZ14a pre-trained on the ImageNet ILSVRC15 dataset. We also use the $L_{1}$ loss between the predicted garment mask $M_{g}$ and the real mask $M_{gt}$ which is obtained via human parsing Gong2019Graphonomy :

\mathcal{L}_{mask}=\|M_{g}-M_{gt}\|_{1}.

(8)

Besides, for both $\widetilde{I^{\prime}}$ and $I^{\prime}$ , we calculate the adversarial loss $\mathcal{L}_{GAN}$ which is the same as in StyleGAN2 karras2020stylegan2 . The total loss can be formulated as

\mathcal{L}=\mathcal{L}_{GAN}+\lambda_{rec}\mathcal{L}_{rec}+\lambda_{perc}\mathcal{L}_{perc}+\lambda_{mask}\mathcal{L}_{mask},

(9)

where $\lambda_{rec}$ , $\lambda_{perc}$ , and $\lambda_{mask}$ are the trade-off hyper-parameters.

During training, although the source and target pose are the same, the coarse warped garment $G_{t}$ is not identical to the intact source garment $G_{s}$ , due to the crop mechanism in the patch-routed disentanglement module. More specifically, the quadrilateral crop for $G_{s}$ is by design not seamless/perfect and there will accordingly often exist some small seams between adjacent patches in $G_{t}$ as well as incompleteness along the boundary of the torso region. To further reduce the training-test gap of the warped garment, we introduce two random erasing operations during the training phase. First, we randomly remove one of the four arm patches in the warped garment with a probability of $\alpha_{1}$ . Second, we use the random mask from liu2018partialconv to additionally erase parts of the warped garment with a probability of $\alpha_{2}$ . Both of the erasing operations can imitate self-occlusion in the source person image. Fig. 5 illustrates the process by displaying the source garment $G_{s}$ , the warped garment $G_{t}^{\prime}$ that is obtained by directly stitching the warped patches together, and the warped garment $G_{t}$ that is sent to the network. We can observe a considerable difference between $G_{t}$ and $G_{s}$ . An ablation experiment to validate the necessity of the randomly erasing operation for the unsupervised training is included in the supplementary material.

4 Experiments

Datasets. We conduct experiments on two existing virtual try-on benchmark datasets (MPV Dong2019mgvton dataset and DeepFashion liuLQWTcvpr16DeepFashion dataset) and our newly collected large-scale benchmark dataset for unpaired try-on, named UPT. UPT contains 33,254 half- and full-body front-view images of persons wearing a large variety of garments, e.g., long/short sleeve, vest, sling, pants, etc. UPT is further split into a training set of 27,139 images and a testing set of 6,115 images. In addition, we also pick out the front view images from MPV Dong2019mgvton and DeepFashion liuLQWTcvpr16DeepFashion to expand the size of our training and testing set to 54,714 and 10,493, respectively. Personally identifiable information (i.e. face information) has been masked out.

Metrics. We apply the Fr $\mathbf{\acute{e}}$ chet Inception Distance (FID) parmar2021cleanfid to measure the similarity between real and synthesized images, and perform human evaluation to quantitatively evaluate the synthesis quality of different methods. For the human evaluation, we design three questionnaires corresponding to the three used datasets. In each questionnaire, we randomly select 40 try-on results generated by our PASTA-GAN and the other compared methods. Then, we invite 30 volunteers to complete the 40 tasks by choosing the most realistic try-on results. Finally, the human evaluation score is calculated as the chosen percentage for a particular method.

Implementation Details. Our PASTA-GAN is implemented using PyTorch paszke2019pytorch and is trained on 8 Tesla V100 GPUs. During training, the batch size is set to 96 and the model is trained for 4 million iterations with a learning rate of 0.002 using the Adam optimizer Kingma2014adam with $\beta_{1}=0$ and $\beta_{2}=0.99$ . The loss hyper-parameters $\lambda_{rec}$ , $\lambda_{perc}$ , and $\lambda_{mask}$ are set to 40, 40, and 100, respectively. The hhyper-parameters for the random erasing probability $\alpha_{1}$ and $\alpha_{2}$ are set to 0.2 and 0.9, respectively. ²²2Additional details for the UPT dataset (e.g., data distribution, data pre-processing), the human evaluation, training details, and the inference time analysis, etc. are provided in the supplementary material.

Baselines. To validate the effectiveness of our PASTA-GAN, we compare it with the state-of-the-art methods, including three paired virtual try-on methods, CP-VTON bochao2018cpvton , ACGPN han2020acgpn , PFAFN ge2021pfafn , and two unpaired methods Liquid Warping GAN lwb2019 and ADGAN men2020adgan , which have released the official code and pre-trained weights.³³3For all these prior approaches, research use is permitted according to the respective licenses. Note, we are unable to compare with sarkar2021posestylegan , ira2021vogue and neuberger2020oviton as they have not released their code or pre-trained model. We directly use the pre-trained model of these methods as their training procedure depends on the paired data of garment-person or person-person image pairs, which are unavailable in our dataset. When testing paired methods under the unpaired try-on setting, we extract the desired garment from the person image and regard it as the in-shop garment to meet the need of paired approaches. To fairly compare with the paired methods, we further conduct another experiment on the paired MPV dataset Dong2019mgvton , in which the paired methods take an in-shop garment and a person image as inputs, while our PASTA-GAN still directly receives two person images. See the following two subsections for detailed comparisons on both paired and unpaired settings.

4.1 Comparison with the state-of-the-art methods on unpaired benchmark

Quantitative: As reported in Table 1, when testing on the DeepFashion liuLQWTcvpr16DeepFashion and the UPT dataset under the unpaired setting, our PASTA-GAN outperforms both the paired methods bochao2018cpvton ; han2020acgpn ; ge2021pfafn and the unpaired methods men2020adgan ; lwb2019 by a large margin, obtaining the lowest FID score and the highest human evaluation score, demonstrating that PASTA-GAN can generate more photo-realistic images. Note that, although ADGAN men2020adgan is trained on the DeepFashion dataset, our PASTA-GAN still surpasses it. Since the data in the DeepFashion dataset is more complicated than the data in UPT, the FID scores for the DeepFashion dataset are generally higher than the FID scores for the UPT dataset.

Table 1: The FID score parmar2021cleanfid and human evaluation score among different methods under the unpaired setting on the DeepFashion dataset liuLQWTcvpr16DeepFashion and our UPT dataset.

Method	DeepFashion		UPT
Method	FID $\downarrow$	Human Evaluation $\uparrow$	FID $\downarrow$	Human Evaluation $\uparrow$
CP-VTON bochao2018cpvton	69.46	2.177%	70.76	1.551%
ACGPN han2020acgpn	44.41	4.597%	37.99	3.448%
PFAFN ge2021pfafn	46.19	4.677%	36.69	4.224%
ADGAN men2020adgan	37.36	21.29%	39.60	7.241%
Liquid Warping GAN lwb2019	42.18	12.98%	33.18	9.310%
PASTA-GAN (Ours)	21.58	54.27%	7.852	74.22%

Qualitative: As shown in Fig. 6, under the unpaired setting, PASTA-GAN is capable of generating more realistic and accurate try-on results. On the one hand, paired methods bochao2018cpvton ; han2020acgpn ; ge2021pfafn tend to fail in deforming the cropped garment to the target shape, resulting in the distorted warped garment that is largely misaligned with the target body part. On the other hand, unpaired method ADGAN men2020adgan cannot preserve the garment texture and the person identity well due to its severe overfitting on the DeepFashion dataset. Liquid Warping GAN lwb2019 , another publicly available appearance transfer model, heavily relies on the 3D body model named SMPL matthew2015smpl to obtain the appearance transfer flow. It is sensitive to the prediction accuracy of SMPL parameters, and thus prone to incorrectly transfer the appearance from other body parts (e.g., hand, lower body) into the garment region in case of inaccurate SMPL predictions. In comparison, benefited by the patch-routed mechanism, PASTA-GAN can learn appropriate garment features and predict precise garment shape. Further, the spatially-adaptive residual module can leverage the warped garment feature to guide the network to synthesize try-on results with realistic garment textures. Note that, in the top-left example of Fig. 6, our PASTA-GAN seems to smooth out the belt region. The reason for this is a parsing error. Specifically, the human parsing model liang2018look that was used does not designate a label for the belt, and the parsing estimator Gong2019Graphonomy will therefore assign a label for the belt region (i.e. pants, upper clothes, background, etc). For this particular example, the parsing label for the belt region is assigned the background label. This means that the pants obtained according to the predicted human parsing will not contain the belt, which will therefore not be contained in the normalized patches and the warped pants. The style synthesis branch then predicts the precise mask for the pants (including the belt region) and the texture synthesis branch inpaints the belt region with the white color according to the features of the pants.

Table 2: The FID score parmar2021cleanfid and human evaluation score among different methods under their corresponding test setting on the MPV dataset Dong2019mgvton .

Method	CP-VTON bochao2018cpvton	ACGPN han2020acgpn	PFAFN ge2021pfafn	PASTA-GAN(Ours)
FID $\downarrow$	37.72	23.20	17.40	16.48
Human Evaluation $\uparrow$	8.071%	12.64%	28.71%	50.57%

4.2 Comparison with the state-of-the-art methods on paired benchmark

Quantitative: Tab. 2 illustrates the quantitative comparison on the MPV dataset Dong2019mgvton , in which the paired methods are tested under the classical paired setting, i.e., transferring an in-shop garment onto a reference person. Our unpaired PASTA-GAN, nevertheless, can surpass the paired methods especially the state-of-the-art PFAFN ge2021pfafn in both FID and human evaluation score, further evidencing the superiority of our PASTA-GAN.

Qualitative: Under the paired setting, the visual quality of the paired methods improves considerably, as shown in Fig. 7. The paired methods depend on TPS-based or flow-based warping architectures to deform the whole garment, which may lead to the distortion of texture and shape since the global interpolation or pixel-level correspondence is error-prone in case of large pose variation. Our PASTA-GAN, instead, warps semantic garment patches separately to alleviate the distortion and preserve the original garment texture to a larger extent. Besides, the paired methods are unable to handle garments like sling that are rarely presented in the dataset, and perform poorly on full-body images. Our PASTA-GAN instead generates compelling results even in these challenging scenarios.

4.3 Ablation Studies

Patch-routed Disentanglement Module: To validate its effectiveness, we train two PASTA-GANs without texture synthesis branch, denoted as PASTA-GAN $\star$ and PASTA-GAN $*$ , which take the intact garment and the garment patches as input of the style encoder, respectively. As shown in Fig. 8, PASTA-GAN $\star$ fails to generate accurate garment shape. In contrast, the PASTA-GAN $*$ which factors out spatial information of the garment, can focus more on the garment style information, leading to the accurate synthesis of the garment shape. However, without the texture synthesis branch, both of them are unable to synthesize the detailed garment texture. The models with the texture synthesis branch can preserve the garment texture well as illustrated in Fig 8.

Spatially-adaptive Residual Module To validate the effectiveness of this module, we further train two PASTA-GANs with texture synthesis branch, denoted as PASTA-GAN $\dagger$ and PASTA-GAN $\ddagger$ , which excludes the style synthesis branch and replaces the spatially-adaptive residual blocks with normal residual blocks, respectively. Without the support of the corresponding components, both PASTA-GAN $\dagger$ and PASTA-GAN $\ddagger$ fail to fix the garment misalignment problem, leading to artifacts outside the target shape and blurred texture synthesis results. The full PASTA-GAN instead can generate try-on results with precise garment shape and texture details. The quantitative comparison results in Fig. 8 further validate the effectiveness of our designed modules.

5 Conclusion

We propose the PAtch-routed SpaTially-Adaptive GAN (PASTA-GAN) towards facilitating scalable unpaired virtual try-on. By utilizing the novel patch-routed disentanglement module and the spatially-adaptive residual module, PASTA-GAN effectively disentangles garment style and spatial information and generates realistic and accurate virtual-try on results without requiring auxiliary data or extensive online optimization procedures. Experiments highlight PASTA-GAN’s ability to handle a large variety of garments, outperforming previous methods both in the paired and the unpaired setting.

We believe that this work will inspire new scalable approaches, facilitating the use of the large amount of available unlabeled data. However, as with most generative applications, misuse of these techniques is possible in the form of image forgeries, i.e. warping of unwanted garments with malicious intent.

Acknowledgments and Disclosure of Funding

We would like to thank all the reviewers for their constructive comments. Our work was supported in part by National Key R&D Program of China under Grant No. 2018AAA0100300, National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and No.61976233, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Shenzhen Fundamental Research Program (Project No. RCYX20200714114642083, No. JCYJ20190807154211365), CSIG Youth Fund.

References

(1) Badour AlBahar, Jingwan Lu, Jimei Yang, Zhixin Shu, Eli Shechtman, and Jia-Bin Huang. Pose with Style: Detail-preserving pose-guided image synthesis with conditional stylegan. ACM Transactions on Graphics, 2021.
(2) F.L. Bookstein. Principal warps: thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6):567–585, 1989.
(3) Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(01):172–186, 2021.
(4) Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14131–14140, 2021.
(5) Aiyu Cui, Daniel McKee, and Svetlana Lazebnik. Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14638–14647, 2021.
(6) Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9026–9035, 2019.
(7) Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1161–1170, 2019.
(8) Henri P. Gavin. The levenberg-marquardt algorithm for nonlinear least squares curve-fitting problems. 2013.
(9) Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, and Ping Luo. Disentangled cycle consistency for highly-realistic virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16928–16937, 2021.
(10) Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8485–8493, 2021.
(11) Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, and Liang Lin. Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7450–7459, 2019.
(12) Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R. Scott. Clothflow: A flow-based model for clothed person generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10471–10480, 2019.
(13) Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7543–7552, 2018.
(14) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), pages 694–711, 2016.
(15) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8110–8119, 2020.
(16) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
(17) Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. Vogue: Try-on by stylegan interpolation optimization. In arXiv preprint arXiv:2101.02285, 2021.
(18) Xiaodan Liang, Ke Gong, Xiaohui Shen, and Liang Lin. Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):871 – 885, 2018.
(19) Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In European Conference on Computer Vision (ECCV), page 85–100, 2018.
(20) Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5904–5913, 2019.
(21) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1096–1104, 2016.
(22) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: a skinned multi-person linear model. ACM Transactions on Graphics, 34(6), 2015.
(23) Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and Zhouhui Lian. Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR), pages 5084–5093, 2020.
(24) Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.
(25) Assaf Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, and Sharon Alpert. Image based virtual try-on network from unpaired data. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR), pages 5184–5193, 2020.
(26) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR),, pages 2337–2346, 2019.
(27) Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation. In arXiv preprint arXiv:2104.11222, 2021.
(28) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, 2019.
(29) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, 2015.
(30) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
(31) Kripasindhu Sarkar, Vladislav Golyanik, Lingjie Liu, and Christian Theobalt. Style and pose control for image synthesis of humans from a single monocular view. In arXiv preprint arXiv:2102.11263, 2021.
(32) Kripasindhu Sarkar, Lingjie Liu, Vladislav Golyanik, and Christian Theobalt. Humangan: A generative model of humans images. In arXiv preprint arXiv:2103.06902, 2021.
(33) Kripasindhu Sarkar, Dushyant Mehta, Weipeng Xu, Vladislav Golyanik, and Christian Theobalt. Neural re-rendering of humans from a single image. In European Conference onComputer Vision (ECCV), pages 596–613, 2020.
(34) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations (ICLR), 2015.
(35) Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic-preserving image-based virtual try-on network. In European Conference onComputerVision (ECCV), pages 589–604, 2018.
(36) Zhenyu Xie, Xujie Zhang, Fuwei Zhan, Haoye Dong, Michael C. Kampffmeyer, Haonan Yan, and Xiaodan Liang. Was-vton: Warping architecture search for virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), page 3350–3359, 2021.
(37) Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. Towards photo-realistic virtual try-on by adaptively generating preserving image content. In Proceedings of the IEEE/CVFConference on ComputerVision and Pattern Recognition (CVPR), pages 7850–7859, 2020.
(38) Ruiyun Yu, Xiaoqi Wang, and Xiaohui Xie. Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10511–10520, 2019.
(39) Fuwei Zhao, Zhenyu Xie, Michael Kampffmeyer, Haoye Dong, Songfang Han, Tianxiang Zheng, Tao Zhang, and Xiaodan Liang. M3d-vton: A monocular-to-3d virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13239–13249, 2021.
(40) Qiang Zhou and Xin Li. Stn-homography: estimate homography parameters directly. In arXiv preprint arXiv:1906.02539, 2019.