Single Stage Warped Cloth Learning and Semantic-Contextual Attention Feature Fusion for Virtual TryOn
Abstract
Image-based virtual try-on aims to fit an in-shop garment onto a clothed person image. Garment warping, which aligns the target garment with the corresponding body parts in the person image, is a crucial step in achieving this goal. Existing methods often use multi-stage frameworks to handle clothes warping, person body synthesis and tryon generation separately or rely on noisy intermediate parser-based labels. We propose a novel single-stage framework that implicitly learns the same without explicit multi-stage learning. Our approach utilizes a novel semantic-contextual fusion attention module for garment-person feature fusion, enabling efficient and realistic cloth warping and body synthesis from target pose keypoints. By introducing a lightweight linear attention framework that attends to garment regions and fuses multiple sampled flow fields, we also address misalignment and artifacts present in previous methods. To achieve simultaneous learning of warped garment and try-on results, we introduce a Warped Cloth Learning Module. Our proposed approach significantly improves the quality and efficiency of virtual try-on methods, providing users with a more reliable and realistic virtual try-on experience.
Index Terms— Virtual Tryon, Single stage Synthesis, Garment Warping, Parser Free Tryon
1 Introduction
Virtual try-on technology has gained significant traction in the retail industry, offering a realistic and reliable experience for customers to virtually try on garments in-shop on their own images. Among the different areas within virtual try-on, garment try-on has garnered substantial research interest being extensively explored in recent years [1, 2, 3, 4].
Due to the complexity of the virtual try-on problem, preserving both human and garment texture in varying body scenarios is crucial. Over the years, numerous methods have been proposed to address this challenge. VITON [5], one of the pioneering works in this field, utilized TPS warping to align garments with the human body and synthesized coarse results. CPVTON [6] and CPVTON+ [1] further improved the results by introducing semantic category refinement. OVITON [7] proposed an online optimization module for appearance refinement. Building on these advances, [8] introduced disentanglement of garments and incorporated cycle consistency into self-supervised training.
Flow-based models, such as Flow-Style [9] and DAFlow [10], used flow estimation to warp garments effectively.
Researchers have also leveraged prior knowledge, such as UV correspondence, to handle unobserved appearances. TryOnGAN [11] utilized the StyleGAN2 generation method to improve body shape deformation. Further enhancing resolution , VITON-HD [4] and Dresscode [3] were introduced. WUTON [2] adopted a student-teacher model for appearance flow distillation, further pushing the boundaries of virtual try-on techniques.
Most try-on techniques follow a two stage tryon process (garment warp , tryon). In the context of garment warping, three primary techniques have been employed: Spatial Transformer Network (STN) [12], which learns affine transformations for rigid garment warping; Thin Plate Spline (TPS) deformation [1], used for non-rigid garment warping by manipulating control points; and optical flow-based methods [10], which offer high-intensity deformable optical flow, estimating pixel offsets.

Initially, many virtual try-on methods utilized the TPS warping method for input garment warping, leveraging its control points to align and non-rigidly warp the garment using an energy function. However, this method has been replaced by Flow-based warping techniques. Ge et al. [8] disentangled the garment and non-garment regions, and Bai et al. [10] proposed Flow-Style and DAFlow to warp garments using flow estimation. Furthermore, some studies combined pose transfer with virtual try-on, Dong et al. [13] proposed MGTON, which realized pose-guided virtual try-on. Xie et al. [11] introduced PASTA-GAN to implement patch-routed disentanglement in unsupervised training.
Recent advancements in self-attention mechanisms in vision transformers (ViTs) [14], Swin Transformer [14], Focal Transformer [14] and Restomer [14] have shown promise in various computer vision tasks which inspired the techniques to explore attention in tryon. One such work that utilized attention in virtual try-on is PG-VTON [3], which leverages attention to identify crucial regions in the image where the clothing makes contact with the body. Similarly, FashionGAN+ [11] employs attention to transfer the appearance of clothing from one person to another while SDAFN (DAWarp) [10], focuses on warping a garment onto a subject’s body while considering the subject’s pose.
While single-stage frameworks for virtual try-on exist [10], they often lack an extra flow network for warping and instead apply an end-to-end try-on pipeline with deformable flow. Nevertheless, these methods may encounter issues with disproportionate warping, as the learning is not task-specific like in multiple-stage methods. Additionally, some approaches introduce heavy priors for the model in the form of parsing or densepose.
Our understanding of gaps is three fold. Most virtual tryon solutions involve an explicit garment warp based learning module which focuses on generating tryon image utilising a multistage pipeline. This incorporates sequential learning of intermediate representations such as optical flow or TPS transformation to compute warped garment as a separate deep learning module. The Optical flow calculation is done similar to corresponding rigid body paradigms, where inherent pixel offset calculation is done without using semantics or attention as guidance, giving equal importance to foreground and background regions. Also the blending multiple output features/images using a shallow decoder or formulate a complex generative model as a separate learning framework causes misalignment and texture artifacts.
Our work makes significant contributions in the following aspects:
-
•
A novel warped garment learning module that jointly learns the warped garment, person synthesis and final tryon result, explicitly as a single stage learning process.
-
•
A lightweight linear attention framework that attends to garment regions and fuses multiple sampled flow fields to learn optimal implicit garment flow.
-
•
We attain state-of-the-art results on the VITON dataset, demonstrating the effectiveness of our parser free single-stage virtual try-on approach.
2 METHODOLOGY
2.1 Problem Definition
The objective of virtual try-on is to produce a try-on image by leveraging a person image and an in-shop garment image . The aim is to ensure that the garment in seamlessly aligns with the corresponding regions in , creating a coherent visual outcome. Moreover, the generated should retain both the intricate elements from and the non-garment sections of . In simpler terms, the individual portrayed in should remain unaltered in except for the addition of the worn garment .
Our proposed methodology encompasses several key components: An Attention-based Flow Estimator, Semantic-Contextual Fusion Attention Block, and a Joint Learning Tryon Module.
The Inputs to our method are the in-shop garment and cloth agnostic input of subject person for tryon. Additionally, we utilize a keypoint visualization to facilitate target person synthesis.
Our method utilises two feature extractors. The source input is whereas the reference input is the channel concatenated and . A Refined Pyramid Extractor operates by computing features at multiple scales [10].
These features are then employed for learning deformable flow through our innovative Linear Attention Flow (LAF) block. LAF learns both attention and flow across various scales, which is used by our warp module to compute warped features. These warped features are fed into the Semantic-Contextual Feature Fusion block to predict the final tryon image. It’s noteworthy that attention and flow are learned using a multi-scale residual approach similar to Fig2.
In the following subsections, we describe every module in detail.
2.2 Linear Attention Flow Block (LAF)
Transformers, pivotal in NLP and computer vision, rely on Query, Key, and Value (QKV) formulation. Despite enabling intricate dependency modeling, its quadratic complexity poses computational challenges. Addressing this, our proposed Linear Attention Flow Block (LAF) employs efficient linear attention, as in Wang et al. [14]. This alternative minimizes costly matrix multiplications, significantly reducing computation and memory requirements, ensuring LAF’s scalability and efficiency at high spatial resolutions.
The fundamental principle of our proposed linear self-attention involves the incorporation of two linear projection matrices, and , during key and value computation. Initially, we project the original -dimensional key and value layers, and , into -dimensional projected counterparts. Subsequently, the context mapping matrix of dimensions is derived through scaled dot-product attention, as illustrated by the equation:
(1) |
The final step entails generating context embeddings for each headi utilizing . Importantly, these operations exhibit a time and space complexity of , which is considerably efficient. This efficiency becomes more pronounced when we opt for a small projected dimension , substantially reducing memory and space requirements. Our LAF block stands as a testament to the balance struck between computational efficiency and robust attention modeling, making it well-suited for our Virtual Tryon pipeline. Finally, the attended values are obtained as .

2.3 Semantic-Contextual Fusion Attention
In this subsection, we propose a novel approach for computing the final try-on result in our virtual try-on pipeline. The baseline feeds pixel-wise summation of the features of the warped garment and the features of the generated reference person to a 2-layer convolutional shallow decoder to compute the target tryon result. This approach lacks the ability to effectively capture the fine-grained relationships between the two sets of features, potentially leading to suboptimal try-on results. To address this limitation, we introduce a Semantic-Contextual Fusion Attention (SCFA) module that encapsulates the attention mechanism’s capability to capture semantic information while considering the context between the garment and the person during the try-on process.
2.3.1 Key-Value Attention Computation
Given the features of the warped garment and the features of the generated reference person , we compute the KV attention as follows. We first apply a convolutional layer to create embeddings and for both sets of features. We compute the dot products of the ”keys” and ”values” embeddings, scale and apply softmax to obtain the attention weights for the person and garment features.
(2) |
(3) |
2.3.2 Attention-Guided Feature Fusion
Next, we utilize the computed attention weights to create attention features, which act as weights for fusing the input features. The attention features and are computed as follows:
(4) |
(5) |
where and denote the attention weights for the person and garment features, respectively, and represents element-wise multiplication. The resulting attention features have the same dimensions as the input feature maps.
2.3.3 Decoder for Final Try-On Result
The attention-guided features and are then concatenated and passed through a shallow decoder to generate the final try-on result . The shallow decoder is implemented as follows:
(6) |
where, denotes the concatenation operation, and Decoder is a shallow neural network responsible for producing the final try-on result.
2.4 Joint learning Module : Cloth Warping and Tryon
The warping module is the backbone of the tryon methodology and the results obtained from this stage play a crucial role in the tryon module moving forward.
The key idea behind this constraint is to integrate explicit learning of the structural and perceptual complexity of the warped garment into our pipeline. The availability of high quality garment segmentation is one of the key factors that motivated us to formulate a learning constraint with the warped garment directly into the single stage pipeline [10]. This results into a joint learning of the warped garment and tryon results, improving the results in a parallel manner.
The WCLM uses the parser based label in the form of garment segmentation GT as the guiding parameter. The model is optimised utilising an L1 loss.
The loss is a combination of and and can be written as:
(7) |
2.5 Training Loss functions
The L1 loss is a combination of and mathematically formulated as ,
(8) |
The perceptual loss is based on the VGG network which calculates the L1 distance between the extracted feature maps by the network for both the tryon and warped garment, formulated as ,
(9) |
The style Loss is the optimised statistical error between the feature maps of predicted and GT , where the gram matrix measures the feature correlation between the two inputs,
(10) |
where denotes the Gram matrix of features.
The overall loss is presented as:
(11) |
where is the loss of the scale, and the scale closer to the output is given a larger weight .
3 EXPERIMENTS
3.1 Datasets
The experiments were conducted on two commonly used and publicly available datasets, VITON[5] and VITON-HD[4]. VITON consists of 14221 training images and a test set of 2032 pairs of images. The view for the images is frontal and paired with the in-shop garment cloth for tryon. VITON-HD contains high-resolution paired images of in-shop garments and their corresponding human models wearing the garments consisting of consists of 11647 train images and 2032 test images.
Methods | Parser | Warping | FID | SSIM |
---|---|---|---|---|
VTON [5] | Y | TPS | 55.71 | 0.74 |
CP-VTON [6] | Y | TPS | 24.45 | 0.72 |
CP-VTON++ [1] | Y | TPS | 21.04 | 0.75 |
Cloth-flow[15] | Y | AF | 14.43 | 0.84 |
ACGPN[16] | Y | TPS | 16.64 | 0.84 |
DCTON[8] | Y | TPS | 14.82 | 0.83 |
PF-AFN[17] | N | AF | 10.09 | 0.89 |
[15] | N | AF | 10.73 | 0.89 |
Ours | N | AF | 9.14 | 0.90 |
3.2 Implementation details
All the experiments have been conducted on two A5000 GPUs using PyTorch. The deformable flow is learned the value of . Adam optimizer is used for the batchsize of 8. The model is trained for a learning rate of 0.000035 for 250 epochs. The predefined weights for the loss functions are . The features are extracted for scales with number of filters.
Config | SSIM | PSNR |
---|---|---|
Baseline | 0.817 | 20.78 |
+LAF | 0.835 | 21.84 |
+SCFA | 0.863 | 22.43 |
+Warp Loss | 0.90 | 26.5 |
Method | FID | KID | FID | KID | SSIM | LPIPS |
---|---|---|---|---|---|---|
VITON-HD [4] | 14.64 | 6.10 | 12.81 | 5.52 | 0.848 | 0.1216 |
HR-VITON [18] | 12.15 | 3.42 | 9.92 | 3.06 | 0.860 | 0.1038 |
GP-VTON [19] | 10.49 | 2.23 | 7.71 | 2.01 | 0.857 | 0.0897 |
DCI-VTON [20] | 11.14 | 3.35 | 8.19 | 2.93 | 0.875 | 0.0816 |
Proposed | 8.93 | 1.37 | 5.60 | 0.83 | 0.877 | 0.0803 |
3.3 Qualitative Results
Our method was qualitatively evaluated against CP-VTON+ [1], ACPGN [16], PFAPN [17], SDAFN [10] (Fig. 2). While CP-VTON+ and ACPGN are parser-based and prone to misalignment issues, PF-AFN, a parser-free framework, avoids misalignment but suffers from blurred body parts like arms and shoulders.
All methods nearly align the distorted garment with the wearer, but artifacts become more visible as the texture and structural complexity increases. Our model preserves image contrast and skin texture, especially neck region, much better. In Fig. 2, our model exhibits superior preservation in left shoulder garment and realistic arm generation, while ensuring structural integrity with minimal misalignments in sleeves and collar edges.
3.4 Quantitative Results
The baseline of the proposed approach incorporated a simple warp module and flow without an attention module with the computed scores as shown in table2, as an improvement we introduce LAF, SCFA and a Warp loss. The LAF module improves the flow which consecutively improves the garment warping, SCFA module encapsulates the attention mechanism’s capability to capture semantic information while considering the context between the garment and the person during the try-on process, and Finally, the warped loss improves the warping by not allowing the learning of warping in bad garment regions as shown in table2. Proposed method’s assessment on VITON Dataset as shown in Table 1, achieves outstanding results in terms of SSIM and FID with scores of 0.90 and 9.14, respectively, outperforming other methods as shown in Table1. The evaluated scored on VITON-HD dataset also outweigh the performance of proposed method upon comparison with other methods as shown in table3 demonstrating the effectiveness of our proposed single-stage framework.
4 Conclusion
We proposed a novel single-stage framework for image-based virtual try-on, achieving photo-realistic fitting results by aligning in-shop garments with clothed person images. Our approach jointly learns warped garments and flow fields using the semantic-contextual fusion attention module and a lightweight linear attention framework. The Warped Cloth Learning Module allows simultaneous learning of warped garment and try-on results. Experiments on the VITON dataset demonstrated state-of-the-art performance both qualitatively and quantitatively. Our method eliminates multi-stage processing, offering a more efficient and reliable virtual try-on experience. The proposed framework shows potential for real-world applications, offering a streamlined virtual try-on process. The future work shall include enhancing robustness on diverse datasets and exploring real-time applications.
References
- [1] Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai, “Cp-vton+: Clothing shape and texture preserving image-based virtual try-on,” in CVPR Workshops, 2020, vol. 3, pp. 10–14.
- [2] Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes, “Do not mask what you do not need to mask: a parser-free virtual try-on,” ArXiv, vol. abs/2007.02721, 2020.
- [3] Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, and Kerui Hu, “Pg-vton: A novel image-based virtual try-on method via progressive inference paradigm,” arXiv preprint arXiv:2304.08956, 2023.
- [4] Seunghwan Choi, Sunghyun Park, Min Gi Lee, and Jaegul Choo, “Viton-hd: High-resolution virtual try-on via misalignment-aware normalization,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14126–14135, 2021.
- [5] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S. Davis, “Viton: An image-based virtual try-on network,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7543–7552, 2017.
- [6] Bochao Wang, Huabing Zhang, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang, “Toward characteristic-preserving image-based virtual try-on network,” ArXiv, vol. abs/1807.07688, 2018.
- [7] Assaf Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, and Sharon Alpert, “Image based virtual try-on network from unpaired data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [8] Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, and Ping Luo, “Disentangled cycle consistency for highly-realistic virtual try-on,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16923–16932, 2021.
- [9] Sen He, Yi-Zhe Song, and Tao Xiang, “Style-based global appearance flow for virtual try-on,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3460–3469, 2022.
- [10] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, and Hongxia Yang, “Single stage virtual try-on via deformable attention flows,” in European Conference on Computer Vision, 2022.
- [11] Kiran Mayee Adavala, “Generation of “hand-drawn” images using deep convolutional generative adversarial networks,” resmilitaris, vol. 13, no. 3, pp. 2104–2111, 2023.
- [12] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds. 2015, vol. 28, Curran Associates, Inc.
- [13] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin, “Towards multi-pose guided virtual try-on network,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9026–9035.
- [14] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5728–5739.
- [15] Xintong Han, Weilin Huang, Xiaojun Hu, and Matthew R. Scott, “Clothflow: A flow-based model for clothed person generation,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10470–10479, 2019.
- [16] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo, “Towards photo-realistic virtual try-on by adaptively generating-preserving image content,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [17] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo, “Parser-free virtual try-on via distilling appearance flows,” arXiv preprint arXiv:2103.04559, 2021.
- [18] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo, “High-resolution virtual try-on with misalignment and occlusion-handled conditions,” arXiv preprint arXiv:2206.14180, 2022.
- [19] Xie Zhenyu, Huang Zaiyu, Dong Xin, Zhao Fuwei, Dong Haoye, Zhang Xijin, Zhu Feida, and Liang Xiaodan, “Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
- [20] Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang, “Taming the power of diffusion models for high-quality virtual try-on with appearance flow,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023.