XCloth: Extracting template-free textured 3D clothes from a monocular image
Abstract.
We present a novel framework for high-fidelity, template-free, textured 3D digitization of arbitrary garment styles from an in-the-wild monocular image. Existing approaches for 3D garment reconstruction either assume a predefined template for the garment geometry (restricting them to fixed clothing styles) or yield vertex colored meshes (lacking high-frequency textural details). Our proposed framework co-learns geometric and semantic information of garment surface from the input image. More specifically, we propose to extend PeeledHuman representation to predict the pixel-aligned, layered depth and semantic maps to extract 3D garments. The layered representation is further exploited to UV parametrize the arbitrary surface of the extracted garment without any human intervention to form an UV atlas. The texture is then imparted on the UV atlas in a hybrid fashion by first projecting pixels from the input image to UV space for the visible region, followed by inpainting the occluded regions. Thus, we are able to digitize arbitrarily loose clothing styles while retaining high-frequency textural details from a monocular image. To enable learning of semantic information, we curated cloth and body segmentation labels for two publicly available datasets which we intend to release. We show high-quality reconstruction results on the aforementioned datasets and internet images.

Given an input image, our framework extract high-fidelity, textured 3D garments.
1. Introduction
The high-fidelity digitization of 3D garment(s) from a 2D image(s) is essential in achieving photorealism in a wide range of applications in computer vision e.g., 3D virtual try-on, human digitization for AR/VR, realistic virtual avatar animation, etc. The goal of the 3D digitization is to recover 3D surface geometry as well as high-frequency textural details of the garments with arbitrary styles/designs. Textured 3D digitization of garments remains notoriously difficult, as designing clothing is a work of art where a lot of creativity is involved at designers end. The lack of standardization in designing vocabulary (e.g., cloth panels) further makes the task challenging. Traditionally, expensive multi-view capture/scan setups were used to digitize garments but they can’t scale up in fast-fashion scenarios (bhardwaj2010fast; gazzola2020trends), owing to their cost, scan latency and volume of efforts.
Majority of the existing learning based garment digitization methods (yang2018physics; mir20pix2surf; majithia2022robust) rely on availability of predefined 3D template mesh for a specific clothing style, generally taken from popular parametric body models (e.g., SMPL (smpl)) or designed by an artist (UCBerkeley). Specifically, one class of methods (e.g., (mir20pix2surf; majithia2022robust)) propose to transfer the texture from RGB image to UV map of the predefined template while retaining the fixed template geometry. The other class of methods (e.g., (bhatnagar2019mgn; bcnet; deep)) propose to deform the predefined template using cues from input RGB image in a learnable fashion but do not attempte to recover texture. MGN (bhatnagar2019mgn) proposed an hybrid approach to achieve best of these two by learning to locally deform the template mesh using SMPL+D (alldieck2019learning) for recovering geometry and learn to inpaint the underlying UV map for recovering texture using (single/multi-view) RGB image(s). However, these methods restrict the clothing styles to a set of predefined templates, which are very simplistic in nature and are not able to model arbitrary clothing styles with high frequency geometrical details (e.g., folds in long skirts). A recent attempt in (zhu2020deep), propose to modify the SMPL template by adaptable template using handle-based deformation and register it to non-parametric mesh recovered from OccNet (mescheder2019occupancy) to accommodate loose clothing. Another, very recent attempt (https://doi.org/10.48550/arxiv.2203.15007) further extends this idea to incorporate learnable segmentation & garment-boundary field to improve the reconstruction quality. However, both these methods do not attempt to recover textured or even vertex-colored meshes. A similar scenario exist in related domain of clothed human body reconstruction where all existing methods either recovers 3D mesh surface sans texture or achieve a fix (SMPL mesh) texture map based reconstruction, which imposes tight clothing limitation (please refer Section LABEL: in supplementary draft for more details).
A textured mesh not only retains high quality visual details, which is crucial for 3D virtual-tryon kind of application. Additionally, it can also be exploited easily for extended applications such as texture map super-resolution for enhanced detailing, appearance manipulation by texture swapping, etc. In regard to computational efficiency, texture representation are also memory efficient as high frequency appearance details can be retained with rendering low quality meshes (with less number of faces), as shown in Fig. LABEL:fig:textvsvcolor. Additionally, texture sometimes also helps in modeling the appearance of fine grained geometrical details (wrinkles and bump maps), which are hard to recover from existing geometry reconstruction methods.
In this paper, we propose a novel method for template-free, textured 3D digitization of arbitrary garment styles from monocular image. The proposed framework predicts 3D geometry of the garment in the form of sparse, non-parametric peeled representation (introduced in (jinka2020peeledhuman)) to handle occlusions. We extend this representation to incorporate semantic information and surface normals in the form of semantic and normal peelmaps. The former provides supervision for extracting the cloths separately and also helps in dealing with complex garment geometry and pose, while the latter helps in recovering the high-frequency surface details. We further exploited the peeled representation in a novel way to automatically UV parametrize the extracted 3D garment. Since, the peeled representation provides natural view-specific partitions of the 3D garment mesh, we propose to automatically infer seams by exploiting these partitions for UV parametrization.
Finally, we recover the associated texture map by appropriately projecting the RGB peelmaps to each of the corresponding partitions. However, in many cases where the garment has high-frequency textural details, generative methods tend to yield blurred predictions and hence, as a remedy we propose to use strucutural-inpainting for the RGB peelmaps. We thoroughly evaluated the proposed framework and reported qualitative and quantitative results on the publicly available datasets as well as internet images, and reported superior performance as compared to existing SOTA methods. We intend to release the proposed model in the public domain.
2. Proposed Framework
The proposed framework is divided into three key modules. The first one is “3D Reconstruction Module” which recovers the dense point-cloud of the garment in PeeledHuman representation. The next is “Geometry Refinement Module”, where the point cloud is further refined and meshified to obtain the 3D surface of the garment. The last module consist of “Split Texture Mapping & In-painting Module”, which parameterizes the 3D surface of the garment into 2D UV space and infer the texture from peeled RGB map followed by texture in-painting. We discuss each of these modules in detail below.
2.1. 3D Reconstruction Module
The first task is to recover the 3D geometry of each garment from the monocular image. To this end, we adopt the architecture of PeelGAN (jinka2020peeledhuman) and extended it to include two novel decoder branches. More specifically, the commmon encoder of the PeelGAN (jinka2020peeledhuman) now encodes the input image along with peeled SMPL (depth and body part segmentation) priors. This common encoding is fed to four decoder branches namely, , , , . , serve the same purpose as in PeelGAN, i.e. predicting RGB peelmaps and depth peelmaps. branch predicts the semantic labels across all the peeled layers, providing pixel aligned semantic information in 3D in order to differentiate between multiple garments and body parts in the depth peelmaps. In order to train the respective decoder, we minimize the cross entropy loss on the label classes. The classes in ground truth labels correspond with the classes provided by PGN (gong2018instancelevel). Please refer to Section LABEL:sec:data-generation for complete semantic data generation process. It is important to note that that semantic labels should not only help in identifying different clothes, but also improve the depth prediction and corrects the distortion in body parts, as demonstrated in (SnehithICVGIP2021). To further improve the prediction of the deformation in clothing due to underlying body, similar to (jinka2021sharp), we provide SMPL depth peelmaps as a prior along with input RGB image. Additionally, we also provide peeled semantic labels of the SMPL body parts (see Fig. LABEL:fig:). This helps the network to predict the semantic labels more accurately and also to learn the correct deformation in clothing in presence of the underlying human body.
Another key improvement in architecture is the addition of branch, which estimates peeled normal maps. The normal maps add further details in the surface, by refining the wrinkles and folds present in the garments. Generally, the trend is to predict normal maps for front and back from the input image using off the shelf network (like in PIFuHD (saito2020pifuhd) and ICON (xiu2022icon)) and then passing them as a prior to subsequent generator networks, thereby improving the fidelity of the predicted surface. However, this strictly restricts the network to overcome any noise present in the predicted prior normal maps and leads to over fitting and failure to generalize in case of loose clothing. Thus, our normal peelmaps estimation branch () enable passive regularization for improving common encoding that would eventually help improve the prediction of peeled depth maps. We minimizing the L1 and perceptual loss over the normal peelmaps.
The aforementioned reason is one of the motivations behind using a simple yet efficient multi-branch encoder-decoder architecture. All the different decoder branches provide each other just the relevant information by propagating the gradients via the common encoder, while avoiding interfering directly with each other, resulting in more flexible learning which is robust to the noisy ground truth information.
It is important to note that all the peelmaps ( i.e., depth, rgb, segmentation and normal) are pixel-aligned with each other. The output of the 3D reconstruction module is the point cloud estimated from backprojected depth peelmaps, where each point is also assigned a segmentation label directly from the predicted semantic peelmaps. This labelling information is used to separate garments. The extracted point clouds of the garments are passed to the next module.

2.2. Geometry Refinement Module
The output of the reconstruction module are the point cloud representation of the garments and have missing regions, due to inherent drawback of PeeledHuman representation. These missing regions are cause by surface patches that are tangential to peeling camera rays and hence need to be filled to be able to recover a complete dense surface representation of garments. One common solution is to run PSR (Poisson Surface Reconstruction) on top of point cloud which gives a watertight mesh. But PSR smoothens out the details coming from the depth peelmaps (back-projected point cloud). Instead of naively running PSR, we first meshify the depth peelmaps in image space independently for each peeling layer and merge them while filling only the missing tangential regions by sampling on the Poisson surface reconstruction of the pointcloud. This helps in retaining the fine-grained surface details present in the predicted depth maps. Please refer Fig 2 for geometry refinement. Additionally, meshifying each depth peelmap also allows us to store peeling information, i.e. each vertex belongs to which peeling layer. Storing this information is useful as it is later needed for texture mapping.

2.3. Peeled Texture Mapping & In-painting Module
While existing garment digitization methods require a predefined parametrized garment template and then learn to transfer texture from multi-view input image(s) to the UV space, our novel approach adopt a method which exploits the predicted depth and RGB peelmaps for UV parametrization as well as texture map filling.
One naive idea is to simply apply flat plane parametrization to the geometry, but it causes overlap as multiple vertices project onto the same UV space. Various parameterization techniques are available such as LSCM (levy2002least), etc., which estimate parameterization from 3D space to 2D space. Most of the UV parametrization methods either suffer from the discontinuity in the texture space or multiple faces overlap in the UV space while parametrizing a (spherical) genus zero surfaces like human body (heckbert1986survey). The former results in visible seams throughout the whole geometry, while the latter faces the issue of projecting wrong texture to the geometry. Discontinuity in the uv space can be partially solved by employing in-painting methods to in-paint the seams, but estimating seams which can produce minimal distortion in the geometry is not straightforward.
We propose to perform automatically find the seams and split the garment mesh into multiple parts guided by predicted peeled depth maps and then apply UV parametrization on individual part meshes. The peelmaps are useful here, as they naturally split the shape into parts and hence can be used to find consistent seams. Instead of projecting to the same UV space, we can project the vertices according the peeling layer they belong to, thereby constructing a layered-UV space which avoids the issue of overlap.
Boundary first flattening(sawhney2017boundary) parametrization which produces flattening with minimal area distortion and virtually zero angle distortion. The initial mapping can be interactively edited using handle-based manipulation to reduce the overlap of vertices and to texture regions that receive more texture. In this work, we exploit depth peelmaps and split the mesh according to the peeling information to handle the overlapping.
The proposed split texture mapping approach has three steps:
-
(1)
Automated Seam Estimation: Since we have stored the peeling information for each vertex while reconstructing the mesh for each depth peelmap, we know which mesh vertices belong to which peeling layer as every mesh vertex project to a pixel in the peelmap. For vertices belonging to faces that are tangential to peeling camera rays, we employ nearest neighbor approach to find association with depth peelmaps. This yields a automatic definition of seams consist of set of vertices that are equidistance (in terms of mesh connectivity) from respective peelmap pairs. We propose to replicate these seam vertices across the two splits to avoid the artifacts near the split boundary.
-
(2)
Peeled Paramterization: We parametrize each split part mesh separately using boundary first flattening (sawhney2017boundary). This gives us 3D to 2D parametrization for each part with very minimal distortion.
-
(3)
Texture Filling & In-painting: Traditional methods fill texture map by projecting respective 3D points on visible multi-view images and interpolate the RGB colors for corresponding texture map pixel. However, this is computationally expensive and require multi-view images. Instead, we exploit the multi-layer RGB peelmaps for texture filling and our depth peelmap guided split UV parametrization strategy enables us to achieve this. Since every split texture map for a specific garment have an associated RGB peelmaps that are also pixel aligned. Thus, for each pixel in texture map, we take the RGB value from respective RGB peelmap pixel. This enables filling large part of texture map. However, the surface regions that are not visible in the peelmaps remain unfilled and still need a valid texture. We propose to employ off-the-shelf structured image in-painting solution (huang2014image) for this task as we already know the mask for these unfilled pixels in the texture map.
Additionally, we empirically observed that predicting high quality texture for unseen areas is a tedious task for any generative models like that of our’s. Thus, if the image has high frequency texture garments, we propose to employ the existing structured in-painting for second or subsequent RGB peelmaps from a known patch in the visible region of the input image or externally provided texture image of the underlying cloth (as shown in Figure LABEL:fig:).
3. Experiments and Results
3.1. Implementation Details
We employ a multi-branch encoder-decoder net- work for XCloth, which is trained in an end-to- end fashion. The network takes the input image concatenated with SMPL peel maps in 512 × 512 resolution. The shared encoder is consist of a convolutional layer and 2 downsampling layers which have 64, 128, 256 kernels of size 7×7, 3×3 and 3×3, respectively. This is followed by ResNet blocks which take downsampled feature maps of size 128×128×256. The decoders for predicting D fused, D rd and R , consist of two upsampling lay- ers followed by a convolutional layer, having same kernel sizes as of the shared encoder. Sigmoid acti- vation is used in D fused and D rd decoder branches, while a tanh activation is used for the R decoder branch. The D rd output values are scaled to a [−1, 0.5] range which is found empirically. We use the Adam optimizer with an expo- nentially decaying learning rate starting from 5 × 10−4. Our network takes around 18 hrs to train for 20 epochs on 4 Nvidia GTX 1080Ti GPUs with abatchsizeof8and to 1, 1, 0.1 and 0.001, respectively, found empirically. We use trimesh (Dawson-Haggerty et al., 2019) library for rendering the peel maps.
3.2. Datasets for cloth digitization:
In this subsection we explore the available datasets which can be leveraged for cloth digitization. (bhatnagar2019mgn) introduced the Digital Wardrobe dataset which consists of 94 real body scans with textured meshes in simple postures. The dataset covers only five categories of garments which are relatively tight clothing. DeepFashion3D (zhu2020deep) introduces vertex colored point cloud data of various garments with diverse garment categories, however they do not provide the input images which restricts us to use it for reconstruction from monocular images. (bertiche2020cloth3d) proposed large-scale dataset that provides dynamic dataset of garments in motion. However, the dataset is still synthetic in nature and cannot be fruitful for training models to extract clothes on real-world data. [Thuman2.0] proposed a large-scale dataset of around 500 real body scans with relatively tight clothing and diverse poses. Recently [3DHuman] introduced the 3DHuman dataset which contains the high-frequency textured scans of South-Asian popoulation with diverse and loose clothing.
Hence we train our network on 3DHumans, THUmans2.0 and Digital Wardrobe datasets and show superior performance over SOTA methods qualitatively as well as quantitatively.
3.3. Qualitative Comparision


We show the qualitative results of our method on three different datasets in Figure 5. Here (i) & (ii) corresponds to 3DHumans, (iii) & (iv) to THUmans2.0 and (v) & (vi) to the Digital Wardrobe dataset respectively. Our method can reconstruct high qualiy textured geometry for topwear and bottomwear separately. Due to the incorporating of the texture mapping, the patterns in the input image are preserved in the rendering of mesh as well.
To this end, we showcase our results on in-the-wild internet images rendered in different views (Figure 4). We show that our method can generalize to loose and diverse garments and can recover high fidelity textured mesh. Note that how textured reconstruction can recover the
3.4. Quantitative Comparision
We evaluate our framework separately for two classes of garments (topwear & bottomwear). To evaluate the reconstruction error, we compute Point-to-Surface distance (P2S) with the ground garment mesh. For evaluating the output of the segmentation branch, we use Intersection-Over-Union (IOU) for each class. Finally, to evaluate the quality of the extracted garment surface, we compute normal reprojection error (NRE) separately for each class (ignoring the background). We report the quantitative results in Table 1.
Topwear
Bottomwear
Datasets
P2S
IOU
NRE
P2S
IOU
NRE
Ours (3DHumans)
0.0087
0.82
0.089
0.0077
83
0.081
Ours (THUman2.0)
0.0079
0.80
0.094
0.0072
0.74
0.085
Ours (MGN)
0.0091
0.91
0.86
0.088
0.95
0.85
We train OccNet on 3DHumans and THUman2.0 dataset and compare it with our method in terms of P2S. In OccNet, there is no provision for semantic segmentation for each class, neither the final reconstruction is pixel-aligned with the input image, therefore we skip IOU and NRE during quantitative comparison. Additionally, the training code for other methods is not available so we decided not to do a quantitative comparison with them. We report the quantitative comparison with OccNet in Table 2.
P2S (Topwear)
P2S (Bottomwear)
OccNet (3DHumans)
0.081
0.081
Ours (3DHumans)
0.0087
0.0077
Occnet (THUman2.0)
0.073
0.081
Ours (THUman2.0)
0.0079
0.0072
4. Ablation Study
In this section, we discuss the performance of our framework in different training settings and perform ablation study.
4.1. Effect of Normal Loss
We extended original PeelGan architecture by adding another branch from the shared encoder for predicting normal peelmaps. Depth values in the depth map depend on the distance from the camera, whereas normal maps are pure representation of the surface geometry, therefore incorporating normal peelmaps helps retaining fine wrinkles and other high-frequency surface details in the garments. Moreover, computing loss on normal peelmaps regularizes the depth map prediction and lowers the P2S and NRE as shown in Table 1 and Table 3.
Topwear
Bottomwear
Datasets
P2S
NRE
P2S
NRE
Ours w.o. (3DHumans)
0.0091
0.118
0.0082
0.122
Ours w.o. (THUman2.0)
0.0088
0.182
0.0078
0.177
Ours w.o. (MGN)
0.0096
0.107
0.0089
0.097
4.2. Effect of Segmentation Prior
Although the segmentation branch of the module-1 is generalizable to in-the-wild images, but in case of highly challenging poses and complex clothing, the chance of errors in segmentation in high. For such cases, we have modified our framework to take the segmentation map for the first layer using some off-the-shelf semantic segmentation networks [JPPNet, PGN], and predict segmentation maps for the remaining layers. This small augmentation helps in achieving accurate semantic labelling in complex cases as can be seen by improved IOU in table Table 4 as compared to Table 1
IOU (Topwear)
IOU (Bottomwear)
Ours with seg. prior (3DHumans)
0.94
0.96
Ours with seg. prior (THUman2.0)
0.90
0.93
Ours with seg. prior (MGN)
0.91
0.95