Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Lingxiao Lu¹, Shengyi Wu¹, Haoxuan Sun¹, Junhong Gou¹,
Jianlou Si², Chen Qian², Jianfu Zhang¹ , Liqing Zhang^1∗
¹Shanghai Jiao Tong University, ²SenseTime
{lulingxiao,wsykk2,guwangtu,goujunhong,c.sis,zhang-lq}@sjtu.edu.cn
{sijianlou,qianchen}@sensetime.com Corresponding authors.

Abstract

Virtual clothes try-on has emerged as a vital feature in online shopping, offering consumers a critical tool to visualize how clothing fits. In our research, we introduce an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model. Our method emphasizes detail enhancement by contrasting local clothing image embeddings, generated by ViT, with their global counterparts. Techniques such as conditional guidance and focus on key regions have been integrated into our approach. These combined strategies empower the diffusion model to reproduce clothing details with increased clarity and realism. The experimental results showcase substantial advancements in the realism and precision of details in virtual try-on experiences, significantly surpassing the capabilities of existing technologies.

1 Introduction

The integration of virtual clothes try-on functionality in online shopping platforms has become a key feature, providing consumers with an invaluable tool to visualize and evaluate the fit of clothing items prior to purchase. With the growing trend of online shopping, the advancement of sophisticated virtual try-on techniques is crucial in elevating the overall user experience [34, 2]. These developments play a significant role in bridging the gap between physical and online retail experiences, catering to the evolving needs of consumers and reshaping the landscape of the retail industry.

A considerable portion of previous research in virtual try-on has heavily leaned on Generative Adversarial Networks (GANs) to generate lifelike images [2]. To refine the preservation of intricate features, earlier studies [23, 34, 12, 37, 9] have integrated specialized warping components. These components are tailored to align target clothing accurately with human figures. Subsequently, the warped clothing, combined with a representation of the individual agnostic to their garments, is input into GAN-based generators to yield final visual outputs. Additionally, certain efforts [6, 17] have extended these methodologies to accommodate high-resolution imagery. It is essential to acknowledge, however, that the efficacy of such methods heavily hinges on the quality of the garment warping process. Recently, diffusion models have emerged as notable alternatives to traditional generative models [15, 29, 32, 30], recognized for their comprehensive distribution coverage, well-defined training objectives, and enhanced scalability [26]. The Stable Diffusion Network, in particular, has gained prominence for its ability to create lifelike images by leveraging the reverse diffusion process [17]. However, a significant challenge in current approaches is the limited capacity to accurately and authentically replicate complex clothing details. This limitation is highlighted in the constraints of struggling to maintain precise control over crucial details that are vital for a realistic virtual try-on experience. The is because that previous methods [24, 36] employ CLIP [27] to extract detailed information from the reference image as guidance for Stable Diffusion (SD). However, CLIP’s inclination towards semantic description often results in pronounced differences between dissimilar semantic descriptions but limited variation within identical text prompts. Consequently, this leads to reduced distinctiveness for similar semantic meanings, posing a significant challenge in virtual try-on scenarios where most clothing items possess similar textual descriptions, making it challenging to accurately describe specific designs and layouts.

In our proposed methodology, we overcome the limitations of current virtual clothes try-on techniques by introducing innovative strategies focused on improving detail accuracy and realism. To overcome the constraints of textual descriptions, it becomes crucial to extract effective information from clothing datasets to create optimal conditions for SD to produce superior images. Inspired by previous self-supervised transformer works [3], which utilizes self-supervised learning for visual representation, our approach follows suit. The training methodology revolves around contrasting representations from different perspectives of the same clothes, facilitated by a dual training framework comprising both teacher and student networks. This method enables the extraction of semantically rich features from clothing data, significantly enhancing the condition quality for SD. Furthermore, in applying a self-supervised approach to Vision Transformers (ViT) [8], we discovered that a pre-trained ViT exhibits limited adaptability to our dataset and lacks the capacity to express clothing-specific features adequately. To address this, we intentionally fine-tune ViT on our dataset, concentrating on essential clothing aspects such as collars, sleeves, text, and patterns. By utilizing the self-attention maps of ViT to identify and concentrate on key clothing elements, we selectively sample local crops that surround these crucial keypoints. This approach enriches the network with detailed information necessary for an accurate representation of clothing, ensuring that ViT precisely identifies and emphasizes critical details during the fine-tuning process. Importantly, our method introduces only minor additional inference time compared to other methods based on diffusion models. Through the introduction of these innovative methodologies and enhancements, we aim for our proposed approach to deliver substantial improvements in the realism and detail precision of virtual try-on experiences, positioning it as a superior alternative to current solutions.

2 Related Works

2.1 Virtual Try-On

Virtual try-on technology has increasingly become a focal point in research, driven by its potential to revolutionize the consumer shopping experience. The domain encompasses two primary methodologies: 2D and 3D approaches, as outlined in previous studies [13]. The 3D virtual try-on [7, 39, 16, 21], offering an immersive user experience, requires sophisticated 3D parametric human models and extensive 3D datasets, often entailing significant costs. Conversely, 2D virtual try-on methods are more accessible and prevalent in practical applications due to their lighter computational demands. Prior research in 2D virtual try-on [12, 23, 34], has concentrated on adapting clothing to fit human figures flexibly, yet it faces challenges in managing substantial geometric deformations. Furthermore, flow-based methods [11, 9, 13] have been explored to depict the appearance flow field between clothing and the human body, aiming to enhance the fit of garments. While these methods offer significant advancements, they frequently fall short in generating high-quality, high-resolution imagery, particularly when it comes to replicating intricate details of clothing.

Moreover, with increasing resolution, synthesis stages based on Generative Adversarial Networks (GANs) [17, 6, 20, 18] encounter difficulties in preserving the integrity and characteristics of clothing, resulting in reduced realism and an escalation of artifacts. The generative capabilities of GANs, particularly in conjunction with high-resolution clothing warping, are constrained, impacting the overall quality of the synthesized images. These challenges underscore the need for more sophisticated approaches capable of surpassing the limitations inherent in current virtual try-on methods. Employing diffusion models emerges as a promising solution, offering significant improvements in virtual try-on performance. Specifically, diffusion models excel in managing complex clothing attributes and achieving realistic image synthesis, addressing the critical shortcomings of existing technologies.

2.2 Virtual Try-On Methods Based on Diffusion Models

In recent years, Diffusion Models [15, 32] have been proposed to generate realistic images by reversing a gradual noising process. And latent diffusion models [29], trained on latent space representations, incorporate cross-attention layers within their network structure to handle generic conditions as input. This advancement has established diffusion models as formidable rivals to GANs in image synthesis tasks. Recent studies have also delved into using text information as a condition in the denoising process, enabling diffusion models to create images with text-relevant features [28, 33, 10, 19, 4].

LaDI-VTON [24] introduces a novel textual inversion component, mapping the visual attributes of in-shop garments to CLIP token embeddings to generate pseudo-word token embeddings, conditioning the generation process and preserving detailed textures. Furthermore, the development of spatial-level guidance techniques in diffusion models [22, 5], particularly through targeted interventions in the denoising phase, has significantly enhanced their applicability across various domains. Despite these advancements, effective implementation of virtual try-on using diffusion models remains a challenge. Text-to-image methods [28, 31] alone are insufficient for accurately depicting the varied appearances of clothing. Similarly, inpainting techniques [36] struggle with maintaining precise control over finer details.

In response to these limitations, our approach introduces a novel self-supervised component specifically designed to enhance detail accuracy and realism in virtual clothing try-ons, focusing on the extraction of critical keypoints.

3 Method

In this section, we will detail the framework of our method, covering the foundational aspects of the diffusion model. We will also introduce our Vision Transformer training process and the procedure for extracting key points of clothing.

3.1 Overall Structure

Refer to caption — Figure 1: Overall framework of our network. we utilize the Stable Diffusion (SD) Inpainting network and employ a specially finetuned Vision Transformer (ViT) to direct the network’s focus towards intricate clothes image details. The finetuned ViT, denoted as $\tau$ , also functions as an essential feature extractor, instrumental in calculating the loss and further refining the inpainting process. Alongside, we integrate warp features into the input to enhance the alignment between the network’s internal features and those in the given condition. For simplicity in representation, we omit the encoder $E$ and the decoder $D$ of the SD network in our depiction.

In this research, our objective is to harness diffusion models within an inpainting framework for virtual try-on tasks, focusing on intricacies of clothing such as sleeves, collars, and textual patterns. Previous methodologies have explored various approaches for injecting explicit information, yet they often overlook these critical clothing details. To address this, we introduce a self-supervised learning-based detail enhancer, designed to aid our network in better recognizing and integrating these essential features. As depicted in Figure 1, our methodology involves processing a person image $I_{o}\in\mathbb{R}^{H\times W\times 3}$ and a clothing image $I_{c}\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times 3}$ to generate a synthesized image $\hat{I}\in\mathbb{R}^{H\times W\times 3}$ . This image aims to realistically blend the person’s attributes from $I_{o}$ with the clothing elements of $I_{c}$ . For try-on scenarios, we utilize a binary mask $m\in{0,1}^{H\times W}$ to identify the areas needing inpainting, generally covering the upper body and arms of the person. In the resulting image $\hat{I}$ , areas where $m=0$ should faithfully reflect $I_{p}$ , and those with $m=1$ should seamlessly incorporate elements from $I_{c}$ , ensuring a natural integration with the person’s figure.

The clothing image $I_{c}$ plays a pivotal role in guiding the model to generate try-on images. It is input into a Vision Transformer (ViT) [8] to produce conditional embeddings $c$ that direct the diffusion models. To improve the accuracy of detail replication, we fine-tune ViT as a condition encoder $\tau$ using our specialized clothing dataset. The fine-tuning process adopts distillation approaches, involving the collaboration of teacher and student networks. Throughout this phase, the teacher model processes the entire global crop $I_{c}$ , whereas the student model processes both $I_{c}$ and its locally cropped segments. Such an arrangement allows the student model to effectively assimilate knowledge from the teacher, thereby promoting self-supervised learning within ViT. Post fine-tuning, the ViT teacher model, now adept at capturing details, is employed as a detail injector in our stable diffusion inpainting network. It inputs complete clothing images to ensure enhanced precision in the final output.

To facilitate improved interaction with ViT features, our model’s input incorporates a warped image. During this warping process, the clothing image $I_{c}$ is input into a warping network, designed to predict an appearance flow field that aligns the clothing accurately. The outcome is a warped clothing image, which, when combined with a clothes-agnostic person image and a mask, yields a masked coarse result $I_{lc}$ . The warping process in our framework adopts an iterative refinement strategy, adept at capturing long-range correspondences. This approach is particularly effective in addressing significant misalignments in clothing images. For a more comprehensive understanding of the warping techniques employed, please refer to [9, 11].

Subsequently, we will explore the intricacies of the diffusion model, our method for self-supervised ViT fine-tuning, and the content correction loss mechanism that is specifically developed based on ViT insights.

3.2 Virtual Try-on by Diffusion Models

Existing studies [36, 24] have demonstrated the considerable capabilities of diffusion models in try-on tasks, which is why our work also adopts a diffusion model, specifically leveraging the open-source StableDiffusion (SD) [17] framework. The SD Network excels in generating realistic images by leveraging the reverse diffusion process. Starting with the target image $I_{0}$ , an initial forward diffusion process denoted as $q(\cdot)$ is applied. This process incrementally introduces noise following a Markov chain, ultimately transforming the image into a Gaussian distribution. To optimize computational efficiency, a latent diffusion model is employed. This involves transforming images from the image space to the latent space using a pre-trained encoder $E$ , followed by image reconstruction with a pre-trained decoder $D$ . The forward diffusion process on the latent variable $z_{0}=E(I_{0})$ at any given timestamp $t$ is defined as:

z_{t}=\sqrt{\alpha_{t}}z_{0}+\sqrt{1-\alpha_{t}}\epsilon,

(1)

where $\alpha:=\prod_{s=1}^{t}(1-\beta_{s})$ and $\epsilon\sim N(0,I)$ . Here, $\beta$ denotes a predefined variance schedule spanning $T$ steps. The latent code $z_{lc}$ is derived by passing $I_{lc}$ through the encoder $E$ , which is then concatenated with the downsampled mask $m$ to form the input ${z_{t},z_{lc},m}$ . For the denoising phase, an enhanced Diffusion UNet is utilized to predict a refined version of the input. Within this framework, the condition $c$ extracted from $I_{c}$ , using a condition encoder $\tau$ that $c=\tau(I_{c})$ , is incorporated into the Diffusion UNet through a cross-attention mechanism. The objective function of the UNet is defined as:

\mathcal{L}_{\text{SD}}=||\epsilon-\epsilon_{\theta}(z_{t},z_{lc},m,c,t)||_{2}^{2}

(2)

The condition $c$ is critical in the SD framework for inpainting tasks, as it provides essential clothing information for the process. Therefore, the selection and training of the condition encoder $\tau$ are of utmost importance. The Paint-by-Example approach [36] suggests using the CLIP [27] image ViT encoder to extract this condition. However, our findings indicate that this is not the most effective strategy, given that the CLIP encoder is pre-trained on open-world images rather than clothing-specific images. Consequently, we propose fine-tuning the ViT encoder $\tau$ to yield a more accurate condition, tailored for clothing images.

3.3 Finetuning ViT on Clothes

Paint by Example utilizes CLIP to enhance detailed information, yet CLIP is inclined towards classification, leading to significant inter-class differences but minimal intra-class variation. This results in a lack of distinctiveness for items within the same category. Consequently, extracting effective information from clothing datasets becomes crucial for producing optimal conditions for SD to generate superior images. However, this is challenging due to the scarcity of annotations or other supervised methods beneficial for virtual try-on in clothing datasets. Inspired by [3], which utilizes self-supervised learning for visual representation, we implement a similar approach. The core of this training methodology lies in contrasting representations from different perspectives of the same data instance.

This concept is operationalized through a dual training setup comprising teacher and student networks. The teacher network $\tau$ produces pseudo-labels for the student network $\tau^{\prime}$ by analyzing representations from augmented views of the input data. Notably, $\tau$ is maintained as the moving average of $\tau^{\prime}$ , ensuring more stable and consistent feature extraction. In parallel, the student network $\tau^{\prime}$ is trained to conform to these pseudo-labels, striving to minimize similarity across representations from various instances. This strategy facilitates the extraction of semantically rich features from clothing data, significantly enhancing the conditioning quality for SD. The loss function of the finetuning process can be expressed as follows:

\mathcal{L}_{\text{SS}}=\sum_{i=1}^{M}\sum_{\begin{subarray}{c}j=1\ j\neq i\end{subarray}}^{N+M}H(\tau(I_{c_{i}}),\tau^{\prime}(I_{c_{j}})),

(3)

where $M$ and $N$ represent the total number of global and local crops, respectively. The functions $\tau(\cdot)$ and $\tau^{\prime}(\cdot)$ denote the ViT feature extraction process conducted by the teacher and student networks, respectively. The term $H(a,b)=-a\log b$ signifies information entropy. $I_{c_{j}}$ refers to the augmented view of the input data instance $I_{c}$ . The augmentation process begins with a random resized crop operation, applying a scale factor and bicubic interpolation. This is followed by flip and color jitter transformations. Additionally, the process incorporates Gaussian blur and normalization for enhanced diversity in the data representation. The scales for the local crops are specified in the range $[0.05,0.25]$ of the whole image, while those for the global crops fall within the range $[0.25,1]$ .

The teacher network is utilized as the final condition encoder in our framework. In Figure 2, we present visualizations of the average head attention corresponding to the class token in ViT, both before and after self-supervised fine-tuning. In these visualizations, “SS-” represents the state without fine-tuning, while “SS+RF” indicates the application of self-supervised fine-tuning. A comparative analysis between “SS-” and “SS+RF” demonstrates that self-supervised fine-tuning on our dataset results in heightened attention to specific key areas of the clothing. Overall, this self-supervised approach empowers ViT to focus distinctively on each feature of the clothing. After undergoing self-supervised fine-tuning, ViT shows increased attention to particular local areas within the clothing images, reflecting an enhanced understanding of clothing features and details.

3.4 Finetuning ViT with Keypoints

While we implement a self-supervised approach to ViT, we find that using a pre-trained ViT directly offers limited adaptability to our dataset and inadequate capability to capture clothing-specific features. During the initial stages of the fine-tuning process, local crops are randomly sampled. However, this approach often overlooks vital keypoints of the clothing, resulting in the omission of important details. Given our objective for ViT to accurately identify key elements of clothing, such as collars, sleeves, text, and patterns, we intentionally fine-tune ViT on our dataset to enhance its attentiveness to these critical aspects. To overcome the shortcomings of random sampling, we refine our methodology to selectively sample local crops that encompass these essential keypoints. This adjustment ensures that the network is furnished with the detailed information, enabling a more precise representation of clothing.

In the self-attention maps of the ViT, we observe that different heads focus on distinct segments of clothing, closely aligning with key areas. Figure 3 showcases this, where subfigures (b) and (c) accurately target text and sleeves on the clothing. Consequently, we compile all points with attention surpassing a specified threshold, referred to as high-attention points, from each head’s attention and merge them into a singular map, as depicted in subfigure (d). We then cluster these points based on their proximity, ensuring the number of clusters corresponds to the quantity of local crops. As revealed in Figure 3 (e), the centroids of these clusters effectively encompass vital garment areas like sleeves, collars, hems, and patterns. These identified centroid points are then utilized as the basis for generating new local crops. Following this, we apply a range of data augmentation techniques to derive the final local crops, which are integral to the fine-tuning of ViT on clothes dataset. After fine-tuning, the stabilized ViT model is incorporated into the training of SD inpainting network, enhancing its effectiveness.

Specifically, we identify 10 centroids in the clustering process as the keypoints for each clothing item. The local crops are then centered around these keypoints, maintaining a $[0.05,0.25]$ ratio to the entire image. In Figure 2, ’SS+SF’ denotes the application of our method in selecting local crops for self-supervised fine-tuning. Comparing “SS-” and “SS+RF”, it is apparent that our dataset’s self-supervised fine-tuning leads to increased attention on specific key areas. However, ’SS+SF’ demonstrates a more pronounced focus on crucial points, text, patterns, and other intricate details on the clothing. This observation confirms that our self-supervised approach enables ViT to more effectively extract and emphasize the unique characteristics of clothing compared to previous methods.

4 Experiments

4.1 Datasets

Our experimental evaluations primarily utilize the VITON-HD dataset [6], comprising 13,679 pairs of frontal-view women and top clothing images. Following precedent [6, 17], we partition the dataset into training and test sets, encompassing 11,647 and 2,032 pairs, respectively. Experiments are conducted at $512\times 384$ . Furthermore, to validate the robustness of our method in more intricate scenarios, we extend our experiments to include the the DressCode dataset [25]. Detailed results of these extended evaluations will be provided in the supplementary material.

4.2 Evaluation Metrics

To evaluate the effectiveness of our method in different test scenarios, we employ a range of metrics. In the paired setting, where a clothing image is used to reconstruct a person’s image, we use two widely accepted metrics: Structural Similarity (SSIM) [35] and Learned Perceptual Image Patch Similarity (LPIPS) [38]. For the unpaired setting, involving the alteration of the person’s clothing, we measure the performance using Frechet Inception Distance FID [14] and Kernel Inception Distance KID [1]. To capture the complex aspects of human perception, we have incorporated a user study into our evaluation process. Specifically, we have created composite images using various methods for 200 randomly selected pairs from our test set, each at a resolution of $512\times 384$ . These images were then evaluated by a panel of 30 human judges who were tasked with identifying the method that most effectively restored clothing details and produced the most lifelike results for each pair. The frequency of each method being chosen as the best performer in these two aspects is thoroughly documented, providing a detailed insight into our method’s performance.

4.3 Implementation Details

The training of our network is systematically divided into two distinct stages. In the initial stage, we leverage local crops derived from attention maps to train the ViT model. We use 10 local crops and opt for the VIT-base model with a patch size of 16. The training commences with an initial learning rate of 2e-5, and a batch size is set at 32. This model undergoes training for more than 30 epochs on our dataset. In the second stage, we employ the pre-trained ViT model with its parameters fixed and focus on training the entire network. For this phase, the learning rate is adjusted to 1e-5, and the batch size is reduced to 8. This stage of training extends over 40 epochs, allowing for comprehensive learning and adaptation. Overall, the training process requires only 6 hours to finetune ViT and 24 hours to finetune SD, which is significantly faster compared to full-scale SD training. Besides, the testing is conducted on a single 3090 GPU. And we adopt PLMS Sampling for inference with 100 steps. It is crucial to highlight that our model does not incur any additional inference time to SD. The average generation time per image is 6.31 seconds, which is roughly equivalent to that of PbE with 6.28 seconds.

Method	SSIM↑	LPIPS↓	FID↓	KID↓	User↑
CP-VTON	0.791	0.141	30.25	4.012	0.006/0.003
VITON-HD	0.843	0.076	11.64	0.300	0.061/0.058
PF-AFN	0.858	0.082	11.30	0.283	0.273/0.187
HR-VITON	0.878	0.061	9.90	0.188	0.121/0.071
PbE	0.843	0.087	10.15	0.204	0.003/0.038
LaDI-VTON	0.876	0.059	9.07	0.148	0.067/0.121
Ours	0.886	0.052	8.93	0.117	0.469/0.522

Table 1: Quantitative comparisons. KID has been multiplied by 100 to facilitate comparison. The user study scores indicate the frequencies of method selection based on two distinct evaluation criteria: similarity of reconstruction (left) and realism (right).

4.4 Quantitative Evaluations

Our target baselines are CPVTON [34], PF-AFN [9], VITON-HD [6] and HR-VTON [17], and diffusion methods Paint-by-Example (PbE) [36] and LaDI-VTON [24]. The comparison results on VITON-HD dataset are shown in Table 1. The performance of traditional methods such as CP-VTON, VITON-HD and PF-AFN is relatively inferior, while the PbE method demonstrates acceptable results after finetuning on the dataset. Furthermore, the HR-VITON method, which performs warp operations on high-resolution images, exhibits favorable outcomes. Similarly, LaDI-VTON, which utilizes textural inversion to map images to image clip embeddings, also yields promising results. While our method achieves superior performance, attributed to the effective utilization of self-supervised learning techniques, which enables the network to focus on details and key points, thereby addressing the shortcomings of previous models.

We further compare our method against competitive approaches, namely PF-AFN [9], HR-VTON [17], and LaDI-VTON [24] on the DressCode dataset [25]. Table 2 shows that our method achieves the best results across different categories, including upper, lower, and dresses, in terms of similarity and realism metrics. These qualitative results demonstrate the effectiveness of our method on a diverse range of datasets, highlighting its strong transferability and ability to adapt to various clothing categories.

4.5 Qualitative Evaluations

In Figure 4, we present qualitative comparisons of our proposed method with other state-of-the-art methods on VITON-HD dataset. The analysis reveals distinct differences in performance. CP-VTON, for instance, tends to produce results that appear more artificial and less authentic in terms of integration. VTON-HD, on the other hand, falls short in capturing textural details. Compared to these methods, PF-AFN and HR-VTON come closer to replicating the original images but still exhibit deficiencies in several detail aspects. Paint-by-Example (PbE) shows some ability to learn similar features but struggles with capturing variations between individuals. LaDI-VTON, while showing promise, faces challenges similar to PbE.

Our method stands out in handling numerous details more effectively. Particular attention has been given to key areas such as collars and hems, as evidenced in the first and third rows of the figure. Additionally, our approach more accurately reflects the overall texture and material of the original images, as seen in the second and fourth rows, and demonstrates a slight superiority in handling various intricate details, as shown in the fifth and sixth rows. These successes are largely due to our fine-tuning process, which has enhanced the network’s ability to concentrate on crucial details at different positions, thereby improving overall precision. More visual results are shown in figure 8.

In order to compare the visual results of our method on DressCode Dataset with the highest-scoring baseline LaDI-VTON [24] in terms of evaluation metrics, we present a visual comparison in Figure7. This comparison allows for a direct assessment of the performance of our method against the state-of-the-art baseline. The results obtained from our proposed method represent a significant breakthrough in the field of clothing generation. Specifically, we have demonstrated the ability to achieve an exceptional level of fidelity, encompassing various intricate texture patterns and other essential features, that faithfully replicates the target garments. Notably, our approach also exhibits a heightened realism in terms of the overall fit and texture of the generated clothing, which further accentuates the superiority of our method over existing approaches.

Method	Upper		Lower		Dresses
Method	LPIPS↓	FID↓	LPIPS↓	FID↓	LPIPS↓	FID↓
PF-AFN	0.0380	14.32	0.0445	18.32	0.0758	13.59
HR-VITON	0.0635	16.86	0.0811	22.81	0.1132	16.12
LaDI-VTON	0.0249	13.26	0.0317	14.79	0.0442	13.40
Ours	0.0216	10.65	0.0255	11.89	0.0382	10.76

Table 2: Quantitative comparisons with competitive baselines on DressCode dataset.

Method	SSIM↑	LPIPS↓
(a) CLIP Encoder	0.841	0.142
(b) SS- Encoder	0.847	0.107
(c) SS+RF Encoder	0.860	0.105
(d) SS+SF Encoder	0.862	0.104
(e) Final Network	0.886	0.052

Table 3: Results of Ablation Studies

4.6 Ablation Studies

In our ablation studies, we assess the impact of each individual enhancement on the overall performance. These experiments utilize images of size $512\times 384$ . The CLIP encoder has a patch size of 14 and an image embedding size of 1024, while ViT has a patch size of 16 and an image embedding of 768. To ensure fairness, we maintain consistency by mapping the embedding of ViT to 1024 dimensions using a linear layer. We compare the following configurations: (a) CLIP: Utilizing CLIP as the encoder for conditioning. (b) SS-Encoder: Employing ViT for conditioning through self-supervised learning, without any fine-tuning. (c) SS+RF Encoder: Conditioning with ViT post fine-tuning using random local crops. (d) SS+SF Encoder: Conditioning with ViT after fine-tuning with selected local crops based on self-attention maps. Lastly, (e) the Final Network: Our complete network, integrating warping to interact with the features.

As shown in Table 3, it is evident that ViT, when subjected to self-supervised training, demonstrates notably superior improvement in comparison to CLIP. Notably, the application of fine-tuning with random local crop on our dataset fails to yield substantial improvements. However, our proposed fine-tuning method encompasses pivotal considerations, facilitating a more comprehensive capture of intricate clothing features and consequently leading to further enhancements in the experimental outcomes. Lastly, our comprehensive network analysis reveals that the incorporation of warping for interactions with ViT features highlights an inadequate focus of UNet features on ViT. Subsequent to the introduction of warp features, UNet features exhibit a notable improvement in effectively attending to the corresponding ViT feature regions, thereby significantly elevating the overall quality of the results.

As shown in Figure 6, the results clearly indicate that ViT, when subjected to self-supervised training, exhibits a significantly superior improvement compared to CLIP. It is worth noting that our proposed fine-tuning method takes critical factors into account, allowing for a more comprehensive capture of intricate clothing features, consequently leading to further enhancements in the experimental outcomes. Moreover, our comprehensive network analysis reveals that the incorporation of warping for interactions with ViT features highlights an insufficient focus on UNet features on ViT. Following the introduction of warp features, UNet features demonstrate a noticeable improvement in effectively attending to the corresponding ViT feature regions, thereby significantly elevating the overall quality of the results.

5 Limitations

The results presented in Figure 5 highlight certain unresolved issues with our proposed approach. Specifically, we have observed that the effectiveness of our method is not yet perfect when it comes to addressing minor details (as evidenced by the first row). This limitation is not unique to our approach, as it represents a common bottleneck for most current methods in the field. In addition, we have also noted that providing two-dimensional guidance for laying out clothes may potentially mislead the three-dimensional construction process, as illustrated by the shoulder straps in the second row. While these challenges remain, we are committed to exploring potential solutions to address them in future works.

6 Conclusions

In this paper, we present an innovative and effective methodology for virtual clothes try-on. The method integrates a self-supervised ViT with a diffusion model. It focuses on enhancing details by comparing local and global clothing image embeddings from ViT, demonstrating an acute understanding of complex visual elements. Techniques like conditional guidance, focus on key regions, and specialized content loss contribute to its thoroughness. These strategies enable the diffusion model to accurately replicate clothing details, significantly enhancing the realism and clarity of virtual try-on experiences.

References

[1] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
[2] Zhipeng Cai, Zuobin Xiong, Honghui Xu, Peng Wang, Wei Li, and Yi Pan. Generative adversarial networks. ACM Computing Surveys, page 1–38, Jul 2022.
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, Oct 2021.
[4] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. TOG, 42(4):1–10, 2023.
[5] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In ICCV, Oct 2021.
[6] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In CVPR, Jun 2021.
[7] Toby Chong, I-Chao Shen, Yinfei Qian, Nobuyuki Umetani, and Takeo Igarashi. Real-time image-based virtual try-on with measurement garment. In SIGGRAPH Asia, 2021.
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[9] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. Parser-free virtual try-on via distilling appearance flows. In CVPR, Jun 2021.
[10] Julia Guerrero-Viu, Milos Hasan, Arthur Roullier, Midhun Harikumar, Yiwei Hu, Paul Guerrero, Diego Gutierrez, Belen Masia, and Valentin Deschaintre. Texsliders: Diffusion-based texture editing in clip space. SIGGRAPH, 2024.
[11] Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R Scott. Clothflow: A flow-based model for clothed person generation. In ICCV, pages 10471–10480, 2019.
[12] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S. Davis. Viton: An image-based virtual try-on network. In CVPR, Jun 2018.
[13] Sen He, Yi-Zhe Song, and Tao Xiang. Style-based global appearance flow for virtual try-on. In CVPR, pages 3470–3479, 2022.
[14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems,Neural Information Processing Systems, Jan 2017.
[15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[16] Maria Korosteleva and Olga Sorkine-Hornung. Garmentcode: Programming parametric sewing patterns. ACM Transactions on Graphics (TOG), 42(6):1–15, 2023.
[17] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo. High-resolution virtual try-on with misalignment and occlusion-handled conditions. Jun 2022.
[18] Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. Tryongan: Body-aware try-on via layered interpolation. TOG, 40(4), 2021.
[19] Pengzhi Li, Qinxuan Huang, Yikang Ding, and Zhiheng Li. Layerdiffusion: Layered controlled image editing with diffusion models. In SIGGRAPH Asia. 2023.
[20] Anran Lin, Nanxuan Zhao, Shuliang Ning, Yuda Qiu, Baoyuan Wang, and Xiaoguang Han. Fashiontex: Controllable virtual try-on with text and texture. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023.
[21] Lijuan Liu, Xiangyu Xu, Zhijie Lin, Jiabin Liang, and Shuicheng Yan. Towards garment sewing pattern reconstruction from a single image. ACM Transactions on Graphics (TOG), 42(6):1–15, 2023.
[22] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
[23] Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on. In CVPR Workshops, volume 3, pages 10–14, 2020.
[24] Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In ACMMM, 2023.
[25] Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: high-resolution multi-category virtual try-on. In CVPR, pages 2231–2235, 2022.
[26] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
[28] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
[29] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
[30] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022.
[31] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
[32] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv: Learning,arXiv: Learning, Oct 2020.
[33] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models. In SIGGRAPH, pages 1–11, 2023.
[34] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic-preserving image-based virtual try-on network. In ECCV, pages 589–604, 2018.
[35] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600–612, 2004.
[36] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. Nov 2022.
[37] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. Towards photo-realistic virtual try-on by adaptively generating preserving image content. In CVPR, Jun 2020.
[38] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Jun 2018.
[39] Fuwei Zhao, Zhenyu Xie, Michael Kampffmeyer, Haoye Dong, Songfang Han, Tianxiang Zheng, Tao Zhang, and Xiaodan Liang. M3d-vton: A monocular-to-3d virtual try-on network. In ICCV, 2021.