Shape Controllable Virtual Try-on for Underwear Models

Xin Gao¹,Zhenjiang Liu¹,Zunlei Feng²,Chengji Shen²,Kairi Ou¹,Haihong Tang¹,Mingli Song² ¹Alibaba Group,²Zhejiang University zimu.gx, stan.lzj, [email protected],zunleifeng, chengji.shen, [email protected], [email protected]

(2021)

Abstract.

Image virtual try-on task has abundant applications and has become a hot research topic recently. Existing 2D image-based virtual try-on methods aim to transfer a target clothing image onto a reference person, which has two main disadvantages: cannot control the size and length precisely; unable to accurately estimate the user’s figure in the case of users wearing thick clothes, resulting in inaccurate dressing effect. In this paper, we put forward an akin task that aims to dress clothing for underwear models. To solve the above drawbacks, we propose a Shape Controllable Virtual Try-On Network (SC-VTON), where a graph attention network integrates the information of model and clothing to generate the warped clothing image. In addition, the control points are incorporated into SC-VTON for the desired clothing shape. Furthermore, by adding a Splitting Network and a Synthesis Network, we can use clothing/model pair data to help optimize the deformation module and generalize the task to the typical virtual try-on task. Extensive experiments show that the proposed method can achieve accurate shape control. Meanwhile, compared with other methods, our method can generate high-resolution results with detailed textures.

Virtual try-on; Graph attention networks; Image warping

^†^†journalyear: 2021^†^†copyright: acmlicensed^†^†conference: Proceedings of the 29th ACM International Conference on Multimedia; October 20–24, 2021; Virtual Event, China^†^†booktitle: Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China^†^†price: 15.00^†^†doi: 10.1145/3474085.3475210^†^†isbn: 978-1-4503-8651-7/21/10^†^†ccs: Applied computing Online shopping^†^†ccs: Computing methodologies Image-based rendering

1. Introduction

Refer to caption — Figure 1. Shape controllable visual results. The shape can be continuously controlled in length and tightness.

The online virtual try-on system (Han et al., 2019, 2018; Issenhuth et al., 2019, 2020; Jandial et al., 2020; Li et al., 2020; Minar et al., 2020a; Raffiee and Sollami, 2020; Ren et al., 2020; Roy et al., 2020; Wang et al., 2018b; Wu et al., 2019; Yang et al., 2020; Yu et al., 2019) has become a research hot-spot in recent years to accommodate the vast market demand. Virtual try-on applications provide users with convenient what-you-see-is-what-you-get function: users can pick a clothing and put it on themselves virtually, and matching their favorite. It is still a big challenge for virtual try-on algorithms to make the results both realistic and attractive.

Existing virtual try-on techniques can be divided into 3D virtual try-on methods and 2D image-based methods. 3D virtual try-on methods need to model characters and clothing, then render the digitalized clothing on the character, achieving controllable and detailed results. However, the process of modeling for characters and clothing is costly and time-consuming. Rendering process also requires high performance of mobile devices.

Han et al. (Han et al., 2018) proposed a 2D image-based virtual try-on task that adopts model and clothing pairs to train a clothing deformation sub-network and then composites the deformed clothing and coarse model with a generator. Following (Han et al., 2018), many 2D image-based methods (Issenhuth et al., 2020; Wang et al., 2018b; Yang et al., 2020; Yu et al., 2019) have been proposed. Most of them focus on finding a better deformation module or a more realistic generator. However, few of those methods could control clothing shape, which is a significant capacity for the virtual try-on task. What’s more, existing 2D virtual try-on methods use parsing and key points information of the clothed model image as input, which cannot accurately obtain the model’s figure and leads to inaccurate dressing effects. Limited by low-resolution datasets and methodological limitations, most methods can only generate low-resolution results and have obvious blurry visual results on high-resolution results, which is not applicable in the online e-commerce scene.

This paper proposes the task that dressing clothing for underwear models, which is also an urgent demand for online clothing shops to exhibit new clothing efficiently. The model’s figure can be more easily got when they wear underwear than thick clothing.

Considering the real application requirement and the drawbacks of existing 2D image-based methods, we propose a Shape Controllable Virtual Try-On Network (SC-VTON) for the underwear model dressing task. As shown in Fig. 1, SC-VTON can achieve continuous shape control in tightness and length. SC-VTON uses GAT (Graph Attention Network) (Veličković et al., 2017) to learn the deformation offset of the meshed clothing. A shape information extraction module is used to extracts the features of the underwear model and clothing. The extracted feature maps are concatenated as shape vectors in channel-dimension. With the guidance of shape vectors and control points (the control points are some discrete points near the shape of the model), GAT predicts coordinate offsets of all mesh vertices. The warped clothing image is obtained by using the differentiable rendering module. The underwear model and clothing pairs for training SC-VTON are generated with an annotation tool based on As Rigid As Possible (ARAP) (Sorkine and Alexa, 2007) with clothing key points as control points.

Furthermore, in-shop clothing/model pairs are adopted to optimize SC-VTON. To generate self-loop supervised information, we devise a splitting network to predict the underwear model, and a synthesis network to synthesize the underwear model with the clothing image. The self-loop supervised information will enhance the robustness of SC-VTON in the real application by combining SC-VTON with the pre-trained splitting network and synthesis network. What’s more, with the splitting and synthesis networks, the proposed SC-VTON can be easily extended to normal virtual try-on for a model in clothing.

Therefore, our approach is the first method for the underwear model dressing task, which can adaptively generate high-quality results with different resolutions. The proposed SC-VTON is the first method that achieves continuous shape control in length and tightness. What’s more, the devised self-loop optimization improves the robustness of SC-VTON and extends the framework to the typical virtual try-on task for a model in clothing. Extensive experiments demonstrate that our method achieves more pleasing visual results than the state-of-art methods.

2. Related Works

2.1. Virtual Try-on

Virtual try-on systems have been an attractive topic even before the renaissance of deep learning.

3D virtual try-on systems based on human reconstruction in computer graphics have been widely researched, such as (Alldieck et al., 2019; Bhatnagar et al., 2019; Li et al., 2010; Loper et al., 2015; Umetani et al., 2011). While Physics-Based Simulation (PBS) can accurately drape a 3D garment on a 3D body (Gundogdu et al., 2019; Patel et al., 2020; Santesteban et al., 2019; Vidaurre et al., 2020), it remains too costly for real-time applications. 3D human estimation methods try to digitalize real-world 2D character photos to 3D models, which have made significant progress after the popularity of deep learning, such as (Guler and Kokkinos, 2019; Kanazawa et al., 2018; Kolotouros et al., 2019; Xu et al., 2019), while (Habermann et al., 2020; Jiang et al., 2020; Saito et al., 2019; Saito et al., 2020) also digitizing with garments, these methods still need to be optimized in texture authenticity.

Recently 2D image-based virtual try-on tasks have gained much attention from academia and industry. It is highly challenging because it requires to warp the clothing on the target person while preserving its patterns and characteristics.

Han et al. (Han et al., 2018) proposed a 2D image-based virtual try-on task that adopts model and clothing pairs to train a clothing deformation sub-network and then composites the deformed clothing and coarse model with a generation network. Based on (Han et al., 2018), Wang et al. (Wang et al., 2018b) refined the architecture with a geometric matching module to learn parameters for Thin-Plate Splines (TPS) deformation (Wood, 2003). Considering the complexity of pose and self-occlusion cases, Yu et al. (Yu et al., 2019) and Raffiee et al. (Raffiee and Sollami, 2020) took parsing information into account and used a generator to predict the target model’s parsing result, which could generate better results in self-occlusion cases. What’s more, Wu et al. (Wu et al., 2019) adopted DensePose (Alp Güler et al., 2018) to get more precise information of the model. Minar et al. (Minar et al., 2020a) solved the problem by first mapping 2D clothing texture to 3D character model and deforming the clothing in 3D space, and then back to 2D space. Chaudhuri et al. (Chaudhuri et al., 2021) proposed a method to generate diverse high fidelity texture maps for 3D human meshes in a semi-supervised setup. On the other hand, some researchers focus on the improvement of the deformation module. In (Issenhuth et al., 2019, 2020; Jandial et al., 2020; Yang et al., 2020), the spatial transformer networks (Jaderberg et al., 2015) is adopted to get more accurate deformation result. Roy et al. (Roy et al., 2020) adopted human and fashion landmarks to help optimize thin-plate splines deformation model. Han et al. (Han et al., 2019) and Ren et al. (Ren et al., 2020) adopted optical flow to get a more natural texture deformation effect. Li et al. (Li et al., 2020) tried to get better results by using multiple specialized warps. Ge et al. (Ge et al., 2021) proposed a knowledge distillation method to produce images without human parsing.

Based on the generated high-resolution images of StyleGAN (Karras et al., 2019), Yildirim et al. (Yildirim et al., 2019) trained a conditional StyleGAN for generating virtual try-on images with clothing image as conditional input on around 380K entries. Cheng et al. (Cheng et al., 2020) embedded human geometries into the latent space as independent codes and achieved flexible and continuous control of geometries via mixing and interpolation operations in explicit style representations. However, acceptable visual results of StyleGAN usually requires hundreds of thousands of samples, which is very difficult to obtain.

2.2. GCN based Deformation

Graph Convolutional Network (GCN) (Kipf and Welling, 2016) extends convolution operation from euclidean data (such as images) to a more extensive topological structure and is widely used in natural language processing, recommendation system, computer vision, and computer graphics domains. Wang et al. (Wang et al., 2018a) represented 3D meshes in a graph-based CNN and produced correct geometry by progressively deforming an ellipsoid. Based on (Wang et al., 2018a), Wen et al. (Wen et al., 2019) used GCN to solve the problem of shape generation in 3D mesh representation from a few color images with known camera poses. For the virtual try-on task, Vidaurre (Vidaurre et al., 2020) used GCN for parametric predefined 2D panels with arbitrary mesh topology. Kolotouros et al. (Kolotouros et al., 2019) adopted a Graph-CNN to estimate 3D human pose and shape from a single image. Zhu et al. (Zhu et al., 2020) used GCN to predict 3D clothing’s feature line and use it to warp clothing.

Graph Attention Networks (GATs) (Veličković et al., 2017) enable specifying different weights to different nodes in a neighborhood, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions. In this work, we consider using GAT to get more precise warping results.

3. Methods

Different from the typical virtual try-on task, the primary goal of the paper is to wear clothing for an underwear model: given an underwear model image $I_{m}$ and a clothing image $I_{c}$ , try to generate model image $I_{g}$ wearing the clothing with the same pose as $I_{m}$ . It is hard to get the pair $\{I_{m},I_{c},I_{g}\}$ that the underwear models and the model wearing clothing with the same pose at the same time. We adopt an annotation tool based on ARAP with manually fine-tuning to get the pseudo-labeled pairs $\{I_{m},I_{c},I_{g^{\prime}}\}$ , which are used as ground truth pairs for SC-VTON on stage I (Section 3.1). Furthermore, real pairs $\{I_{c},I_{g}\}$ are used to improve the robustness of SC-VTON in application scenarios. To adapt to real pairs $\{I_{c},I_{g}\}$ , a splitting network and a synthesis network are added into SC-VTON, which is extended to handle virtual try-on task for model in clothing (Section 3.2). The whole framework is illustrated in Fig. 2.

3.1. GAT based Clothing Deformation for Underwear Models

The GAT based clothing deformation contains three parts: the shape information extraction module, the control points guided deformation module and the differentiable rendering module.

The shape information extraction module is devised for extracting features from the clothing image and the underwear model image. The extracted feature maps will be concatenated as shape vectors and sent to the deformation module as part of the guiding information. For the control points guided deformation module, the clothing firstly meshes with the Delaunay Triangulation Algorithm(Lee and Schachter, 1980), then the meshed coordinate points are set as vertices $V$ of a graph $G=(V,E)$ , where $E$ denotes the edge set (See supplementary materials for more details). For each vertex in $V$ , the vertex’s information is expressed as a vertex vector. Next, the shape control points and the shape vectors are integrated into the corresponding vertex vector as the deformation’s guiding information. GAT predicts coordinate offsets of all mesh vertices. The warped clothing image is obtained by using the differentiable rendering module. The whole deformation process is trained in an iterative manner to obtain higher accuracy. The following subsections give detailed descriptions.

3.1.1. Shape Information Extraction Module

The shape information extraction module contains two branches. The first branch extracts model’s shape feature maps $F_{model}$ from the original model image and model’s parsing image with a CNN extractor. The second branch extracts clothing feature maps $F_{cloth}$ from the original clothing image and clothing’s parsing image with the same architecture (not sharing weights). It is worth noting that the clothing image has been coarsely aligned to the model image with some key points as references. For each vertex in the clothing graph, the Perceptual Feature Pooling (Wang et al., 2018a) is adopted to find the mapping relationship between each vertex’s coordinate and the feature vector’s coordinate in the extracted feature maps $F_{model}$ and $F_{cloth}$ . Next, feature vectors $x_{model}\in F_{model}$ and $x_{cloth}\in F_{cloth}$ with the same coordinate are concatenated as shape vectors $x_{shape}=[x_{model}||x_{cloth}]$ , which will be appended to the vertex vector $x$ (here —— denotes concatenation).

3.1.2. Control Points Guided Deformation Module

Here we first introduce how to get the meshed clothing. A Deeplabv3+ model (Chen et al., 2018) is trained to get clothing’s parsing results. Then, we uniformly sample points along the contour of clothing according to the parsing results. Next, the Delaunay Triangulation (Lee and Schachter, 1980) is adopted to mesh the inner part of the clothing. Contour points and inner points form the vertex set $V$ (See supplementary materials for more details). Each vertex’s information is expressed as a vertex vector $x=[v_{vertex}||v_{control}||x_{shape}]$ , which is composed of three parts: the vertex’s coordinate $v_{vertex}$ , the control point’s coordinate $v_{control}$ and the corresponding shape feature vector $x_{shape}$ .

Control points are some discrete points near the silhouette of the model. The red points in Fig. 3 show control points with different clothing types. Control points are obtained indirectly through model key points and model parsing results. SC-VTON can control both the tightness and length of the clothing, points near the model’s limbs and torso are used. By slightly adjusting the control points’ position (For length, adjusting the points position along the extension of the limbs. For tightness, a dilate operation is used for the model’s parsing result and finding the corresponding position according to the dilated parsing), we can achieve a shape controllable virtual try-on result.

GAT introduces the attention mechanism as a substitute for the statically normalized convolution operation. Below are the equations to compute the node embedding $h_{i}^{l+1}$ of layer $l+1$ from the embeddings of layer $l$ .

		$\displaystyle z_{i}^{l}=W^{l}h_{i}^{l},$
		$\displaystyle\alpha_{ij}^{l}=\frac{exp(LeakyReLU({\overrightarrow{a}^{l}}^{T}(z_{i}^{l}\|\|z_{j}^{l})))}{\sum_{k\in\mathcal{N}(i)}exp(LeakyReLU({\overrightarrow{a}^{l}}^{T}(z_{i}^{l}\|\|z_{k}^{l})))},$
		$\displaystyle h_{i}^{l+1}=\sigma\left(\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{l}\left(z_{j}^{l}\right)\right),$

where $h_{i}^{l}$ is the lower layer embeddings and $W^{l}$ is its learnable weight matrix. The second equation computes a pair-wise normalized attention score between neighbors, $j$ is the neighbor of node $i$ . Here, it first concatenates the $z$ embeddings of the two nodes, then takes a dot product of it and a learnable weight vector $\overrightarrow{a}$ , and applies a LeakyReLU function. Each node applies a softmax to normalize the attention scores on each node’s incoming edges. Finally the embeddings from neighbors are aggregated together, scaled by the attention scores and applied a nonlinearity $\sigma$ .

The network structure of adopted GAT is shown in Fig. 4. Since the task is a deformation task using low-level features, shallow GAT with two graph convolution blocks are used. Each block is composed of three graph attention layers. The ground truth of GAT deformation is the offset between the vertex’s coordinate in clothing image $I_{c}$ and generated pseudo image $I_{g^{\prime}}$ . In the training stage, the vertex number $K$ of each clothing image sample could be different and can be batched using diagonal alignment.

3.1.3. Differentiable Rendering Module

GAT predicts coordinate offsets of all vertices, the differentiable rendering module is then used to get the warped clothing image. We first get an indicator matrix ${\mathbb{I}}\in\mathbb{R}^{\mathbb{H}\times\mathbb{W}\times\mathbb{N}}$ during the preprocessing step, the last dimension is one-hot vector indicates which mesh each pixel belongs to ( $\mathbb{N}$ is the number of meshes). An affine transformation matrix with all meshes $M\in\mathbb{R}^{\mathbb{N}\times 2\times 3}$ can be get by multiply the coordinates of warped mesh’s vertices and the inverse of the coordinates of the original mesh’s vertices. The warped clothing image $\tilde{I_{c}}$ can be get by the following equation:

\tilde{I_{c}}=GridSample(I_{c},G_{c}\odot(\mathbb{I}\times M)),\\

where ${I_{c}}$ is the original clothing image, $G_{c}$ is the normalized grid of the clothing image, after getting the flow-grid of all pixels, GridSample (Jaderberg et al., 2015) is used to get the warped clothing image $\tilde{I_{c}}$ . This step is differentiable and can be trained in an end-to-end manner.

3.1.4. Iterative Training Policy

In the training stage, the iterative training policy is adopted to optimize the results continuously. After each iterative training, the Perceptual Feature Pooling is reused to extract new feature vector $x_{model}$ from the feature map $F_{model}$ according to the updated mesh coordinates, see Fig. 4 for more detailed information. In the experiment, SC-VTON achieves satisfying results with $2$ - $3$ iterations.

To achieve a controllable and precise result, SC-VTON should satisfy two principles under the constrain of control points. Firstly, the deformed clothing should fit and cover the model’s body as much as possible; Secondly, the deformed clothing should maintain the clothing silhouette. The training loss $\mathcal{L}_{gat}$ of SC-VTON is composed of three parts: $\mathcal{L}_{vertex}$ , $\mathcal{L}_{pixel}$ and $\mathcal{L}_{smooth}$ . The first term is Huber loss (Huber, 1992) for constraining all vertices of the meshed clothing, the second term is the L2 pixel loss, the last term $\mathcal{L}_{smooth}$ is the smoothing constraint term for adjacent vertices:

		$\displaystyle\mathcal{L}_{gat}=\mathcal{L}_{vertex}+\lambda_{1}\mathcal{L}_{pixel}+\lambda_{2}\mathcal{L}_{smooth},$
		$\displaystyle\mathcal{L}_{vertex}=\sum^{K}_{k=1}Huber(\Delta v_{k},\Delta\overline{v}_{k}),v_{k}\in V,$
		$\displaystyle\mathcal{L}_{pixel}=\|\|I_{c}-\tilde{I_{c}}\|\|^{2}_{2},$
		$\displaystyle\mathcal{L}_{smooth}=\frac{1}{K}\sum^{K}_{k=1}\sum^{\|Neighbor(v_{k})\|}_{t=1}\|\|\Delta(v_{k},v_{t})-\Delta(\hat{v}_{k},\hat{v}_{t})\|\|^{2}_{2},$

where $K$ denote the vertex number of all vertices $V$ , $\Delta v_{k}$ denote the predicted vertex coordinate offset, $\Delta\overline{v}_{k}$ denote the ground truth of vertex coordinate offset, $|Neighbor(v_{k})|$ denotes the number of the adjacent vertex set of the vertex $v_{k}$ , $\Delta(v_{k},v_{t})$ denotes the vertex offset between the vertex $v_{k}$ and it’s adjacent vertex $v_{k}$ in original clothing image, $\Delta(\hat{v}_{k},\hat{v}_{t})$ denotes the vertex offset between the vertex $\hat{v}_{k}$ and it’s adjacent vertex $\hat{v}_{k}$ in the deformed clothing image, $\lambda_{1},\lambda_{2}$ are the balance parameters.

3.2. Self-loop Optimization with Real Pairs

In Section 3.1, large number of pseudo-labeled pairs $\{I_{m},I_{c},I_{g^{\prime}}\}$ are adopted to train SC-VTON, the pseudo-labeled pairs are generated with an ARAP tool, which limits the performance on real data.

To solve the above drawbacks, real pairs $\{I_{c},I_{g}\}$ of the clothing image $I_{c}$ and the model $I_{g}$ in the same clothing are adopted to optimize SC-VTON. We devise two sub-networks: a splitting network and a synthesis network, and add them into SC-VTON, which can generate self-loop supervision information for promoting robustness and performance of SC-VTON. The splitting network is devised for predicting underwear model $I^{\prime}_{m}$ and control points from the clothed model image $I_{g}$ . The synthesis network is devised for synthesizing the clothed model image $I^{\prime}_{g}$ from the generated underwear model image $I^{\prime}_{m}$ and the warped clothing image $\tilde{I_{c}}$ as input. The difference between real clothed model image $I_{g}$ and synthesis clothed image $I^{\prime}_{g}$ will generate the supervision information for training the whole framework.

3.2.1. Splitting Network

As described above, the inputs of the splitting network are the real clothed model image $I_{g}$ , the output contain the predicted underwear model image $\hat{I}_{m}$ , underwear model’s parsing result $I_{mp}$ and control points. Before training the whole framework together, the splitting network is pre-trained by pseudo-labeled pairs $\{I_{m},I_{c},I_{g^{\prime}}\}$ with the following loss:

\begin{split}\mathcal{L}_{split}=||\hat{I}_{m}-I_{m}||^{2}_{2}+\alpha_{1}\sum_{t=1}^{T}\|\phi_{i}(\hat{I}_{m})-\phi_{i}(I_{m})\|_{1}\\ -\alpha_{2}I_{mp}log(\hat{I}_{mp})+\alpha_{3}\sum^{\hat{K}}_{k=1}||\hat{v}^{k}_{control}-v^{k}_{control}||^{2}_{2},\\ \end{split}\vspace{-1.em}

where $I_{mp}$ and $\hat{I}_{mp}$ are the origin and predicted parsing of the model, $\hat{v}^{k}_{control}$ and $v^{k}_{control}$ are the coordinates of the predicted control points and labeled control points, $\phi_{i}(I_{m})$ and $\phi_{i}(\hat{I}_{m})$ denote the feature maps of image $I_{m}$ and generated image $\hat{I}_{m}$ of the $t$ -th layer ( $T=4$ ) in the visual perception network $\phi$ , which is a VGG19 (Simonyan and Zisserman, 2014) pre-trained on ImageNet (Deng et al., 2009). $\alpha_{1},\alpha_{2},\alpha_{3}$ are the balance parameters.

3.2.2. Synthesis Network

As depicted in Fig. 2, the synthesis network contains three input parts: the warped clothing image, the underwear model image, and the corresponding parsing. The output of the synthesis network is the synthesis clothed model image. Similarly, the synthesis network are also pre-trained on the pseudo-labeled pairs $\{I_{m},I_{c},I_{g^{\prime}}\}$ with the following loss:

\mathcal{L}_{synth}=||\hat{I}_{g^{\prime}}-I_{g^{\prime}}||^{2}_{2}+\beta\sum_{t=1}^{T}\|\phi_{i}(\hat{I}_{g^{\prime}})-\phi_{i}(I_{g^{\prime}})\|_{1},\\

where $\hat{I}_{g^{\prime}}$ is the synthetic clothed model image with the pseudo-labeled pair as input. $\beta$ is the balance parameter.

3.2.3. Whole Framework Optimization

With the pre-trained splitting network and the synthesis network, the whole framework can be trained on real pairs $\{I_{c},I_{g}\}$ with the following loss:

\mathcal{L}_{real}=||\hat{I}_{g}-I_{g}||^{2}_{2}+\gamma\sum_{t=1}^{T}\|\phi_{i}(\hat{I}_{g})-\phi_{i}(I_{g})\|_{1},\\

where $\hat{I}_{g}$ is the synthetic clothed model image of whole framework, $I_{g}$ is the input with the real clothed model image, $\gamma$ is the balance parameter.

In the training stage of the whole framework, SC-VTON is also trained on the pseudo-labeled pairs simultaneously. The total loss $\mathcal{L}$ of the whole framework is defined as follows:

\mathcal{L}=\mathcal{L}_{synth}+\eta\mathcal{L}_{real},\\

where $\eta$ is the balance parameter.

3.2.4. Extension to Clothed Model Virtual Try-on Task

With the splitting network and the synthesis network, the whole framework can be easily extended to the virtual try-on task for the clothed model image. For a clothed model image, the splitting network first predicts the underwear model image, parsing, and control points. Then GAT uses the meshed clothing, predicted underwear model, and control points as input to get the warped clothing image. The synthesis network synthesizes the final result.

4. Experiments

4.1. Implementation Details

Underwear Model Virtual Try-On (UMV) Dataset. The UVM dataset consists of $143,428$ pseudo labeled pairs in total. To get the pseudo labeled pairs, we photographed $2648$ underwear models with different poses and collected tens of thousands of clothing images from the e-commerce platform. Key points and parsing labels for both models and clothing are defined. We manually label the model images to get accurate results. For clothing images, we use pre-trained models (Li et al., 2019; Sun et al., 2019) to predict key points and parsing results, and remap to our own definitions. An ARAP based annotation tool is used with manually fine-tune to get the pseudo label $I_{g^{\prime}}$ . supplementary materials provide more detailed descriptions of the UMV dataset.

Multi-Pose Virtual Try-On (MPV) Dataset. MPV dataset (Dong et al., 2019) consists of $37,723$ / $14,360$ person/clothing images, the dataset used for experiments is one part of the MPV dataset, consisting of 14,754 pairs of top clothing images and positive perspective images of female models. The resolution of images in the dataset is $256\times 192$ .

Network Architecture. The GAT module contains two graph attention blocks, each with three graph attention layers followed by a ReLU activation layer. The output dimension of each layer is $256$ , and the final output dimension is $2$ . The feature maps of the former three layers with $64$ , $128$ , $256$ channels of VGG network are used in the shape information extraction module. For each vertex, Perceptual Feature Pooling extract the feature from four near pixels using bilinear interpolation according to its coordinate position and get a $1\times 448$ vector. Two $1\times 448$ vectors extracted from the model and the clothing are concatenated into a $1\times 896$ vector, and then reduce the dimension by a fully connection layer to get the final $1\times 256$ vector.

The Splitting Network is an U-Net (Ronneberger et al., 2015) like encoder-decoder network. Three contracting blocks are used in Encoder, each with two convolution layers, the output stride is $8$ . The decoder uses three expansive blocks, each with two convolution layers and a transposed convolution to recover feature map size and concatenate with the corresponding contracting block’s output. The Synthesis Network’s has the similar network architecture as the Splitting Network. More details about network architecture are given in the supplementary materials.

Parameters Setting. As for training parameters, we set batch size $4$ , the epoch is $30$ , and $2$ iterations are used. ADAM optimizer is used with $\alpha=0.5$ and $\beta=0.999$ , the learning rate is $10^{-3}$ . We set $\lambda_{1}=1$ , $\lambda_{2}=10$ , $\alpha_{1}=0.1$ , $\alpha_{2}=1$ , $\alpha_{3}=1$ , $\beta=0.1$ , $\gamma=0.1$ , $\eta=1$ . Hyperparameters are set in terms of the importance of each subterm, and the balance of magnitudes between each loss.

4.2. Experiment on Controllable Shape

To verify the effectiveness of shape control, we set different control points as guidance for the same clothing and underwear model. Fig. 5 shows the results of various models and clothing, where we can see that the tightness of sweater and pants, the length of skirt and sleeves are successfully controlled with different control points as inputs. What’s more, for the pants in the middle column of Fig. 5, the tightness and length are controlled simultaneously.

4.3. Qualitative Results

Fig. 6 illustrates our results compared with other methods on test samples selected from the MPV dataset. CP-VTON (Wang et al., 2018b), CP-VTON+ (Minar et al., 2020b) and ACGPN (Yang et al., 2020) are adopted for comparison. The textures of clothing range from simple to complex. Compared with CP-VTON and CP-VTON+, the texture of our approach is much clearer, please see the results of the last two rows in Fig. 6. Meanwhile, compared with ACGPN, our approach can better maintain the clothing’s silhouette. For example, for the two cases in the last row in Fig. 6, ACGPN has turned short sleeves into sleeveless, which is an obvious mistake. Overall, our method is better at keeping the clothing’s texture and can better maintain the clothing’s silhouette.

4.4. Quantitative Results

We conduct a user study on the Qince Platform (A crowdsourcing platform developed by Alibaba Inc.) Given the clothing image, and two synthesized clothed model images by two different methods (with resolution $256\times 192$ and $512\times 384$ , respectively), the worker is asked to choose the one is more realistic and accurate in a virtual try-on situation. We choose $500$ testing samples from the MPV dataset. All synthesized results are presented five times to different workers to avoid individual preferences. Quantitative comparisons are summarized in Table 1. As can see from the table, our method gives better results than the other methods at a resolution of $256\times 192$ . When the resolution is enlarged to $512\times 384$ , our evaluation results are significantly better than the other methods, demonstrating that our method can better maintain texture details for large resolution images.

Table 1. Perceptual user study results with other methods on the MPV dataset. The score represents the proportion of samples that workers considered more realistic.

Method	Resolution 256	Resolution 512
CP-VTON (Wang et al., 2018b)	0.37	0.27
SC-VTON (Ours)	0.63	0.73
CP-VTON+ (Minar et al., 2020b)	0.43	0.33
SC-VTON (Ours)	0.57	0.67
ACGPN (Yang et al., 2020)	0.42	0.38
SC-VTON (Ours)	0.58	0.62

4.5. Comparing the Warping Results

Table 2. Precision within a certain threshold on upper clothing, pants and their average. Larger is better. SC-VTON^† is the optimal model trained with UMV. SC-VTON^∗ is the optimal model trained with both UMV and MPV.

Method	Uppers	Pants	Average
SC-VTON (w/o cloth info)	0.67	0.80	0.74
SC-VTON (w/o $\mathcal{L}_{smooth}$ )	0.84	0.87	0.86
SC-VTON (w/o iterative training)	0.71	0.83	0.77
SC-VTON^† (UMV)	0.76	0.83	0.80
SC-VTON^∗ (UMV & MPV)	0.78	0.85	0.82

As shown in Fig. 7, we further analyze and show the visual comparison of warping modules with CP-VTON, CP-VTON+ and ACGPN on MPV dataset. Both CP-VTON and CP-VTON+ use the Spatial Transformation Network (STN) to learn parameters of Thin-Plate Spline (TPS) to warp the clothing image. ACGPN uses TPS with a second-order difference constraint to get less distortion results. As shown in Fig. 7, the results of CP-VTON and CP-VTON+ are often distorted in the sleeve and torso parts, and ACGPN is prone to skewing. This is because the deformation between different clothing parts can affect each other using TPS.

We take this into account in the pre-processing stage. For the parsing definitions of person and clothing, we distinguish between different parts. For example, the left arm, torso and right arm are marked as different parsing categories (See supplementary materials), thus we know which part each mesh belongs to when meshing the clothing. GAT allows overlap between different meshes, and different parts do not affect each other. During the rendering step, the overlapping regions can be handled orderly. Therefore, compared to other methods, our warping results are flatter and rigid, and also better maintain the original silhouette of the garment.

4.6. Ablation Experiments

Here we conducted a series of ablation experiments to verify each module and loss’s effectiveness of the proposed methods. For evaluation metric, we calculate the precision by checking the percentage of vertices between prediction and ground truth within a certain threshold $\tau=10^{-2}$ .

Table 2 and Fig. 8 show the quantitative and qualitative results of the ablation study. SC-VTON^† is the optimal model trained with UMV dataset. SC-VTON^∗ is the optimal model trained with both UMV and MPV datasets. From Table 2 and Fig. 8, we can see that removing clothing’s image and parsing information made the results worse. Removing $\mathcal{L}_{smooth}$ increases the accuracy, but the silhouette of the clothing becomes jagged. There is also a slight decrease with no iteration during training, demonstrating that iterative training can improve the deformation result. Training with UMV and MPV simultaneously also slightly boosts the performance.

4.7. Extend to Typical Virtual Try-on Task

As described in Section 3.2, with the splitting network and the synthesis network, the whole framework can be extended to typical virtual try-on task for the clothed model. At inference time, both the clothed model image $I_{m}$ and a different clothing image $I_{c^{\prime}}$ are sent to the network. We can directly put the warped clothing image to the predicted underwear model to get the final result, or use the Synthesis Network to generate a refined result. More results are given in the supplementary materials. The task of predicting the underwear model is very difficult, hence the visual result of II is not quite good. Even so, adding II can optimize the warping result, while increasing the robustness on new samples.

4.8. Failure Cases

All warping methods based on 2D clothing images face a common problem: cannot handle the side postures. Most works infer the side textures of the clothing by generators, the results are often blurry and not realistic. Our method also has this problem and this paper only deal with the front view models. For side posture cases, the results are not good, please see Fig. 9 for visual results. We came up with some solutions to solve the problem as our future works: for person modeling, 3D information need to be considered, DensePose (Güler et al., 2018) or SMPL (Loper et al., 2015) model may be two options. A more detailed clothing dataset need to be collected, which should contain at least two clothing images of front and back views. For the warping algorithm, depth information need to be considered during the warping process.

5. Conclusion

In this paper, we propose a Shape Controllable Virtual Try-On Network (SC-VTON) to wear clothing for underwear models, which is an urgent demand for an online clothing shop to exhibit new clothing efficiently. The proposed method comprises two parts: GAT based clothing deformation for the underwear model and self-loop optimization with real pairs. The former is devised for deforming clothing for the underwear model under the constraint of shape control points. The latter is introduced for improving the robustness and performance of SC-VTON, which also extends the whole framework to the typical virtual try-on task. The advantage of the proposed method is that SC-VTON can achieve continuous shape control in length and tightness. As far as we know, SC-VTON is the first shape controllable method for the 2D image-based virtual try-on task. What’s more, based on the inherent advantage of the underwear model image, we can extract more accurate model’s figure information than the clothed model, which brings in more pleasing virtual try-on visual results. Our method can maintain detailed texture and clothing silhouette for high-resolution image, which has practical advantages in real applications.

Acknowledgements.

This work is supported by Key Research and Development Program of Zhejiang Province (2018C01004), National Natural Science Foundation of China (61976186,U20B2066), Fundamental Research Funds for the Central Universities (2021FZZX001-23), Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies.

References

(1)
Alldieck et al. (2019) Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and Gerard Pons-Moll. 2019. Learning to reconstruct people in clothing from a single RGB camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1175–1186.
Alp Güler et al. (2018) Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7297–7306.
Bhatnagar et al. (2019) Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, and Gerard Pons-Moll. 2019. Multi-garment net: Learning to dress 3d people from images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5420–5430.
Chaudhuri et al. (2021) Bindita Chaudhuri, Nikolaos Sarafianos, Linda Shapiro, and Tony Tung. 2021. Semi-supervised Synthesis of High-Resolution Editable Textures for 3D Humans. arXiv preprint arXiv:2103.17266 (2021).
Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV). 801–818.
Cheng et al. (2020) Haoqing Cheng, Heng Liu, Fei Gao, and Zhuo Chen. 2020. ADGAN: A Scalable GAN-based Architecture for Image Anomaly Detection. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Vol. 1. IEEE, 987–993.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
Dong et al. (2019) Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. 2019. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE International Conference on Computer Vision. 9026–9035.
Ge et al. (2021) Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. 2021. Parser-Free Virtual Try-on via Distilling Appearance Flows. arXiv preprint arXiv:2103.04559 (2021).
Guler and Kokkinos (2019) Riza Alp Guler and Iasonas Kokkinos. 2019. Holopose: Holistic 3d human reconstruction in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10884–10894.
Güler et al. (2018) Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7297–7306.
Gundogdu et al. (2019) Erhan Gundogdu, Victor Constantin, Amrollah Seifoddini, Minh Dang, Mathieu Salzmann, and Pascal Fua. 2019. GarNet: A two-stream network for fast and accurate 3D cloth draping. In Proceedings of the IEEE International Conference on Computer Vision. 8739–8748.
Habermann et al. (2020) Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard Pons-Moll, and Christian Theobalt. 2020. Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5052–5063.
Han et al. (2019) Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R Scott. 2019. Clothflow: A flow-based model for clothed person generation. In Proceedings of the IEEE International Conference on Computer Vision. 10471–10480.
Han et al. (2018) Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An image-based virtual try-on network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7543–7552.
Huber (1992) Peter J Huber. 1992. Robust estimation of a location parameter. In Breakthroughs in statistics. Springer, 492–518.
Issenhuth et al. (2019) Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes. 2019. End-to-End Learning of Geometric Deformations of Feature Maps for Virtual Try-On. arXiv preprint arXiv:1906.01347 (2019).
Issenhuth et al. (2020) Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes. 2020. Do Not Mask What You Do Not Need to Mask: a Parser-Free Virtual Try-On. arXiv preprint arXiv:2007.02721 (2020).
Jaderberg et al. (2015) Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Advances in neural information processing systems. 2017–2025.
Jandial et al. (2020) Surgan Jandial, Ayush Chopra, Kumar Ayush, Mayur Hemani, Balaji Krishnamurthy, and Abhijeet Halwai. 2020. SieveNet: A Unified Framework for Robust Image-Based Virtual Try-On. In The IEEE Winter Conference on Applications of Computer Vision. 2182–2190.
Jiang et al. (2020) Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Ligang Liu, and Hujun Bao. 2020. BCNet: Learning Body and Cloth Shape from A Single Image. arXiv preprint arXiv:2004.00214 (2020).
Kanazawa et al. (2018) Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7122–7131.
Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4401–4410.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Kolotouros et al. (2019) Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4501–4510.
Lee and Schachter (1980) Der-Tsai Lee and Bruce J Schachter. 1980. Two algorithms for constructing a Delaunay triangulation. International Journal of Computer & Information Sciences 9, 3 (1980), 219–242.
Li et al. (2010) Jituo Li, Juntao Ye, Yangsheng Wang, Li Bai, and Guodong Lu. 2010. Fitting 3D garment models onto individual human models. Computers & graphics 34, 6 (2010), 742–755.
Li et al. (2020) Kedan Li, Min Jin Chong, Jingen Liu, and David Forsyth. 2020. Toward Accurate and Realistic Virtual Try-on Through Shape Matching and Multiple Warps. arXiv preprint arXiv:2003.10817 (2020).
Li et al. (2019) Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2019. Self-Correction for Human Parsing. arXiv preprint arXiv:1910.09777 (2019).
Loper et al. (2015) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM transactions on graphics (TOG) 34, 6 (2015), 1–16.
Minar et al. (2020a) Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. 2020a. 3D Reconstruction of Clothes using a Human Body Model and its Application to Image-based Virtual Try-On. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
Minar et al. (2020b) Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. 2020b. CP-VTON+: Clothing Shape and Texture Preserving Image-Based Virtual Try-On. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
Patel et al. (2020) Chaitanya Patel, Zhouyingcheng Liao, and Gerard Pons-Moll. 2020. Tailornet: Predicting clothing in 3d as a function of human pose, shape and garment style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7365–7375.
Raffiee and Sollami (2020) Amir Hossein Raffiee and Michael Sollami. 2020. GarmentGAN: Photo-realistic Adversarial Fashion Transfer. arXiv preprint arXiv:2003.01894 (2020).
Ren et al. (2020) Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. 2020. Deep image spatial transformation for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7690–7699.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
Roy et al. (2020) Debapriya Roy, Sanchayan Santra, and Bhabatosh Chanda. 2020. LGVTON: A Landmark Guided Approach to Virtual Try-On. arXiv preprint arXiv:2004.00562 (2020).
Saito et al. (2019) Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. 2019. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision. 2304–2314.
Saito et al. (2020) Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. 2020. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 84–93.
Santesteban et al. (2019) Igor Santesteban, Miguel A Otaduy, and Dan Casas. 2019. Learning-Based Animation of Clothing for Virtual Try-On. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 355–366.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Sorkine and Alexa (2007) Olga Sorkine and Marc Alexa. 2007. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, Vol. 4. 109–116.
Sun et al. (2019) Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. 2019. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514 (2019).
Umetani et al. (2011) Nobuyuki Umetani, Danny M Kaufman, Takeo Igarashi, and Eitan Grinspun. 2011. Sensitive couture for interactive garment modeling and editing. ACM Trans. Graph. 30, 4 (2011), 90.
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Vidaurre et al. (2020) Raquel Vidaurre, Igor Santesteban, Elena Garces, and Dan Casas. 2020. Fully Convolutional Graph Neural Networks for Parametric Virtual Try-On. arXiv preprint arXiv:2009.04592 (2020).
Wang et al. (2018b) Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. 2018b. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision (ECCV). 589–604.
Wang et al. (2018a) Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. 2018a. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV). 52–67.
Wen et al. (2019) Chao Wen, Yinda Zhang, Zhuwen Li, and Yanwei Fu. 2019. Pixel2mesh++: Multi-view 3d mesh generation via deformation. In Proceedings of the IEEE International Conference on Computer Vision. 1042–1051.
Wood (2003) Simon N Wood. 2003. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 1 (2003), 95–114.
Wu et al. (2019) Zhonghua Wu, Guosheng Lin, Qingyi Tao, and Jianfei Cai. 2019. M2e-try on net: Fashion from model to everyone. In Proceedings of the 27th ACM International Conference on Multimedia. 293–301.
Xu et al. (2019) Yuanlu Xu, Song-Chun Zhu, and Tony Tung. 2019. Denserac: Joint 3d pose and shape estimation by dense render-and-compare. In Proceedings of the IEEE International Conference on Computer Vision. 7760–7770.
Yang et al. (2020) Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. 2020. Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7850–7859.
Yildirim et al. (2019) Gokhan Yildirim, Nikolay Jetchev, Roland Vollgraf, and Urs Bergmann. 2019. Generating high-resolution fashion model images wearing custom outfits. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0–0.
Yu et al. (2019) Ruiyun Yu, Xiaoqi Wang, and Xiaohui Xie. 2019. Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. In Proceedings of the IEEE International Conference on Computer Vision. 10511–10520.
Zhu et al. (2020) Heming Zhu, Yu Cao, Hang Jin, Weikai Chen, Dong Du, Zhangye Wang, Shuguang Cui, and Xiaoguang Han. 2020. Deep Fashion3D: A Dataset and Benchmark for 3D Garment Reconstruction from Single Images. arXiv preprint arXiv:2003.12753 (2020).