Single Stage Warped Cloth Learning and Semantic-Contextual Attention Feature Fusion for Virtual TryOn

Abstract

Image-based virtual try-on aims to fit an in-shop garment onto a clothed person image. Garment warping, which aligns the target garment with the corresponding body parts in the person image, is a crucial step in achieving this goal. Existing methods often use multi-stage frameworks to handle clothes warping, person body synthesis and tryon generation separately or rely on noisy intermediate parser-based labels. We propose a novel single-stage framework that implicitly learns the same without explicit multi-stage learning. Our approach utilizes a novel semantic-contextual fusion attention module for garment-person feature fusion, enabling efficient and realistic cloth warping and body synthesis from target pose keypoints. By introducing a lightweight linear attention framework that attends to garment regions and fuses multiple sampled flow fields, we also address misalignment and artifacts present in previous methods. To achieve simultaneous learning of warped garment and try-on results, we introduce a Warped Cloth Learning Module. Our proposed approach significantly improves the quality and efficiency of virtual try-on methods, providing users with a more reliable and realistic virtual try-on experience.

Index Terms— Virtual Tryon, Single stage Synthesis, Garment Warping, Parser Free Tryon

1 Introduction

Virtual try-on technology has gained significant traction in the retail industry, offering a realistic and reliable experience for customers to virtually try on garments in-shop on their own images. Among the different areas within virtual try-on, garment try-on has garnered substantial research interest being extensively explored in recent years [1, 2, 3, 4].

Due to the complexity of the virtual try-on problem, preserving both human and garment texture in varying body scenarios is crucial. Over the years, numerous methods have been proposed to address this challenge. VITON [5], one of the pioneering works in this field, utilized TPS warping to align garments with the human body and synthesized coarse results. CPVTON [6] and CPVTON+ [1] further improved the results by introducing semantic category refinement. OVITON [7] proposed an online optimization module for appearance refinement. Building on these advances, [8] introduced disentanglement of garments and incorporated cycle consistency into self-supervised training.

Flow-based models, such as Flow-Style [9] and DAFlow [10], used flow estimation to warp garments effectively.

Researchers have also leveraged prior knowledge, such as UV correspondence, to handle unobserved appearances. TryOnGAN [11] utilized the StyleGAN2 generation method to improve body shape deformation. Further enhancing resolution , VITON-HD [4] and Dresscode [3] were introduced. WUTON [2] adopted a student-teacher model for appearance flow distillation, further pushing the boundaries of virtual try-on techniques.

Most try-on techniques follow a two stage tryon process (garment warp , tryon). In the context of garment warping, three primary techniques have been employed: Spatial Transformer Network (STN) [12], which learns affine transformations for rigid garment warping; Thin Plate Spline (TPS) deformation [1], used for non-rigid garment warping by manipulating control points; and optical flow-based methods [10], which offer high-intensity deformable optical flow, estimating pixel offsets.

Refer to caption — Fig. 1: The overall architecture framework of our method. The person, garment and pose is fed into the feature extractor followed by the LAF flow estimation. The joint learning module comprising of WCLM and tryon learning uses the resultant warped and tryon predictions for loss computation

Initially, many virtual try-on methods utilized the TPS warping method for input garment warping, leveraging its control points to align and non-rigidly warp the garment using an energy function. However, this method has been replaced by Flow-based warping techniques. Ge et al. [8] disentangled the garment and non-garment regions, and Bai et al. [10] proposed Flow-Style and DAFlow to warp garments using flow estimation. Furthermore, some studies combined pose transfer with virtual try-on, Dong et al. [13] proposed MGTON, which realized pose-guided virtual try-on. Xie et al. [11] introduced PASTA-GAN to implement patch-routed disentanglement in unsupervised training.

Recent advancements in self-attention mechanisms in vision transformers (ViTs) [14], Swin Transformer [14], Focal Transformer [14] and Restomer [14] have shown promise in various computer vision tasks which inspired the techniques to explore attention in tryon. One such work that utilized attention in virtual try-on is PG-VTON [3], which leverages attention to identify crucial regions in the image where the clothing makes contact with the body. Similarly, FashionGAN+ [11] employs attention to transfer the appearance of clothing from one person to another while SDAFN (DAWarp) [10], focuses on warping a garment onto a subject’s body while considering the subject’s pose.

While single-stage frameworks for virtual try-on exist [10], they often lack an extra flow network for warping and instead apply an end-to-end try-on pipeline with deformable flow. Nevertheless, these methods may encounter issues with disproportionate warping, as the learning is not task-specific like in multiple-stage methods. Additionally, some approaches introduce heavy priors for the model in the form of parsing or densepose.

Our understanding of gaps is three fold. Most virtual tryon solutions involve an explicit garment warp based learning module which focuses on generating tryon image utilising a multistage pipeline. This incorporates sequential learning of intermediate representations such as optical flow or TPS transformation to compute warped garment as a separate deep learning module. The Optical flow calculation is done similar to corresponding rigid body paradigms, where inherent pixel offset calculation is done without using semantics or attention as guidance, giving equal importance to foreground and background regions. Also the blending multiple output features/images using a shallow decoder or formulate a complex generative model as a separate learning framework causes misalignment and texture artifacts.

Our work makes significant contributions in the following aspects:

•

A novel warped garment learning module that jointly learns the warped garment, person synthesis and final tryon result, explicitly as a single stage learning process.
•

A lightweight linear attention framework that attends to garment regions and fuses multiple sampled flow fields to learn optimal implicit garment flow.
•

We attain state-of-the-art results on the VITON dataset, demonstrating the effectiveness of our parser free single-stage virtual try-on approach.

2 METHODOLOGY

2.1 Problem Definition

The objective of virtual try-on is to produce a try-on image $I_{p\_output}\in\mathbb{R}^{3XHXW}$ by leveraging a person image $I_{p}\in\mathbb{R}^{3XHXW}$ and an in-shop garment image $I_{g}\in\mathbb{R}^{3XHXW}$ . The aim is to ensure that the garment in $I_{g}$ seamlessly aligns with the corresponding regions in $I_{p}$ , creating a coherent visual outcome. Moreover, the generated $I_{p\_{output}}$ should retain both the intricate elements from $I_{g}$ and the non-garment sections of $I_{p}$ . In simpler terms, the individual portrayed in $I_{p}$ should remain unaltered in $I_{p\_{output}}$ except for the addition of the worn garment $I_{g}$ .

Our proposed methodology encompasses several key components: An Attention-based Flow Estimator, Semantic-Contextual Fusion Attention Block, and a Joint Learning Tryon Module.

The Inputs to our method are the in-shop garment $I_{g}$ and cloth agnostic input $I_{agnostic}$ of subject person $I_{p}$ for tryon. Additionally, we utilize a keypoint visualization $I_{pose}$ to facilitate target person synthesis.

Our method utilises two feature extractors. The source input is $I_{g}$ whereas the reference input is the channel concatenated $I_{agnostic}$ and $I_{pose}$ . A Refined Pyramid Extractor operates by computing features at multiple scales [10].

These features are then employed for learning deformable flow through our innovative Linear Attention Flow (LAF) block. LAF learns both attention and flow across various scales, which is used by our warp module to compute warped features. These warped features are fed into the Semantic-Contextual Feature Fusion block to predict the final tryon image $I_{p\_{output}}$ . It’s noteworthy that attention and flow are learned using a multi-scale residual approach similar to Fig2.

In the following subsections, we describe every module in detail.

2.2 Linear Attention Flow Block (LAF)

Transformers, pivotal in NLP and computer vision, rely on Query, Key, and Value (QKV) formulation. Despite enabling intricate dependency modeling, its quadratic complexity poses computational challenges. Addressing this, our proposed Linear Attention Flow Block (LAF) employs efficient linear attention, as in Wang et al. [14]. This alternative minimizes costly matrix multiplications, significantly reducing computation and memory requirements, ensuring LAF’s scalability and efficiency at high spatial resolutions.

The fundamental principle of our proposed linear self-attention involves the incorporation of two linear projection matrices, $E_{i}$ and $F_{i}\in\mathbb{R}^{n\times k}$ , during key and value computation. Initially, we project the original $(n\times d)$ -dimensional key and value layers, $KW_{i}^{K}$ and $VW_{i}^{V}$ , into $(k\times d)$ -dimensional projected counterparts. Subsequently, the context mapping matrix $\bar{P}$ of dimensions $(n\times k)$ is derived through scaled dot-product attention, as illustrated by the equation:

	$\displaystyle\overline{\text{head}_{i}}$	$\displaystyle=\text{Attention}(QW_{i}^{Q},E_{i}KW_{i}^{K},F_{i}VW_{i}^{V})$
		$\displaystyle=\underbrace{\text{softmax}\left(\frac{QW_{i}^{Q}(E_{i}KW_{i}^{K})^{T}}{\sqrt{d_{k}}}\right)}_{\bar{P}:n\times k}\cdot\underbrace{F_{i}VW_{i}^{V}}_{k\times d},$		(1)

The final step entails generating context embeddings for each head_i utilizing $\bar{P}\cdot(F_{i}VW_{i}^{V})$ . Importantly, these operations exhibit a time and space complexity of $O(nk)$ , which is considerably efficient. This efficiency becomes more pronounced when we opt for a small projected dimension $k$ , substantially reducing memory and space requirements. Our LAF block stands as a testament to the balance struck between computational efficiency and robust attention modeling, making it well-suited for our Virtual Tryon pipeline. Finally, the attended values are obtained as $Output=Attention_{scores}\cdot V$ .

2.3 Semantic-Contextual Fusion Attention

In this subsection, we propose a novel approach for computing the final try-on result in our virtual try-on pipeline. The baseline feeds pixel-wise summation of the features of the warped garment and the features of the generated reference person to a 2-layer convolutional shallow decoder to compute the target tryon result. This approach lacks the ability to effectively capture the fine-grained relationships between the two sets of features, potentially leading to suboptimal try-on results. To address this limitation, we introduce a Semantic-Contextual Fusion Attention (SCFA) module that encapsulates the attention mechanism’s capability to capture semantic information while considering the context between the garment and the person during the try-on process.

2.3.1 Key-Value Attention Computation

Given the features of the warped garment $F_{\text{garment}}\in\mathbb{R}^{C\times H\times W}$ and the features of the generated reference person $F_{\text{person}}\in\mathbb{R}^{C\times H\times W}$ , we compute the KV attention as follows. We first apply a convolutional layer to create embeddings $E_{\text{garment}}$ and $E_{\text{person}}$ for both sets of features. We compute the dot products of the ”keys” and ”values” embeddings, scale and apply softmax to obtain the attention weights for the person and garment features.

\text{Attention Weights}_{\text{person}}=\text{softmax}\left(\frac{E_{\text{garment}}\cdot(E_{\text{person}})^{T}}{\sqrt{C}}\right)

(2)

\text{Attention Weights}_{\text{garment}}=\text{softmax}\left(\frac{E_{\text{person}}\cdot(E_{\text{garment}})^{T}}{\sqrt{C}}\right)

(3)

2.3.2 Attention-Guided Feature Fusion

Next, we utilize the computed attention weights to create attention features, which act as weights for fusing the input features. The attention features $A_{\text{person}}$ and $A_{\text{garment}}$ are computed as follows:

A_{\text{person}}=\text{Attention Weights}_{\text{person}}\odot F_{\text{person}}

(4)

A_{\text{garment}}=\text{Attention Weights}_{\text{garment}}\odot F_{\text{garment}}

(5)

where $\text{Attention Weights}_{\text{person}}$ and $\text{Attention Weights}_{\text{garment}}$ denote the attention weights for the person and garment features, respectively, and $\odot$ represents element-wise multiplication. The resulting attention features have the same dimensions as the input feature maps.

2.3.3 Decoder for Final Try-On Result

The attention-guided features $A_{\text{person}}$ and $A_{\text{garment}}$ are then concatenated and passed through a shallow decoder to generate the final try-on result $T\in\mathbb{R}^{C\times H\times W}$ . The shallow decoder is implemented as follows:

T=\text{Decoder}\left([A_{\text{garment}}\otimes A_{\text{person}}]\right)

(6)

where, $[\cdot]$ denotes the concatenation operation, and Decoder is a shallow neural network responsible for producing the final try-on result.

2.4 Joint learning Module : Cloth Warping and Tryon

The warping module is the backbone of the tryon methodology and the results obtained from this stage play a crucial role in the tryon module moving forward.

The key idea behind this constraint is to integrate explicit learning of the structural and perceptual complexity of the warped garment into our pipeline. The availability of high quality garment segmentation is one of the key factors that motivated us to formulate a learning constraint with the warped garment directly into the single stage pipeline [10]. This results into a joint learning of the warped garment and tryon results, improving the results in a parallel manner.

The WCLM uses the parser based label in the form of garment segmentation GT as the guiding parameter. The model is optimised utilising an L1 loss.

The loss $L_{warp}$ is a combination of $I_{wgt}$ and $I_{warp}$ and can be written as:

\mathcal{L}_{warp}=||\bm{I}_{wgt}-\bm{I}_{warp}||_{1},

(7)

2.5 Training Loss functions

The L1 loss is a combination of $L_{tryon}$ and $L_{warp}$ mathematically formulated as ,

\mathcal{L}_{L1}=||\bm{I}_{p\_output}-\bm{I}_{p}||_{1}+\mathcal{L}_{warp}

(8)

The perceptual loss is based on the VGG network which calculates the L1 distance between the extracted feature maps by the network for both the tryon and warped garment, formulated as ,

\mathcal{L}_{prec}=\sum_{i=1}^{5}||\phi_{i}(\bm{I}_{p\_output})-\phi_{i}(\bm{I}_{p})||_{1},

(9)

The style Loss is the optimised statistical error between the feature maps of predicted and GT , where the gram matrix measures the feature correlation between the two inputs,

\mathcal{L}_{style}=\sum_{i}||G_{i}^{\phi}(\bm{I}_{p\_output})-G_{i}^{\phi}\bm{I}_{p})||_{1},

(10)

where $G_{i}^{\phi}$ denotes the Gram matrix of features.

The overall loss is presented as:

\mathcal{L}=\sum_{n=1}^{N}(n+1)*(\lambda_{L1}\mathcal{L}_{L1}^{n}+\lambda_{prec}\mathcal{L}_{prec}^{n}+\lambda_{style}\mathcal{L}_{style}^{n})

(11)

where $\mathcal{L}^{n}$ is the loss of the $n_{th}$ scale, and the scale closer to the output is given a larger weight $n-1$ .

3 EXPERIMENTS

3.1 Datasets

The experiments were conducted on two commonly used and publicly available datasets, VITON[5] and VITON-HD[4]. VITON consists of 14221 training images and a test set of 2032 pairs of images. The view for the images is frontal and paired with the in-shop garment cloth for tryon. VITON-HD contains high-resolution paired images of in-shop garments and their corresponding human models wearing the garments consisting of consists of 11647 train images and 2032 test images.

Methods	Parser	Warping	FID $\downarrow$	SSIM $\uparrow$
VTON [5]	Y	TPS	55.71	0.74
CP-VTON [6]	Y	TPS	24.45	0.72
CP-VTON++ [1]	Y	TPS	21.04	0.75
Cloth-flow[15]	Y	AF	14.43	0.84
ACGPN[16]	Y	TPS	16.64	0.84
DCTON[8]	Y	TPS	14.82	0.83
PF-AFN[17]	N	AF	10.09	0.89
$\text{Cloth-flow}^{\star}$ [15]	N	AF	10.73	0.89
Ours	N	AF	9.14	0.90

Table 1: Quantitative analysis of distinct models evaluated on the VITON dataset. The column ”Parser” denotes whether human parsing is employed during model inference, while the column ”Warping” indicates the specific warping techniques utilized in the respective models. TPS: Thin Plate Spline. AF: Appearance Flow.

\star

: re-trained with parser free training paradigm.

3.2 Implementation details

All the experiments have been conducted on two A5000 GPUs using PyTorch. The deformable flow is learned the value of $K=6$ . Adam optimizer is used for the batchsize of 8. The model is trained for a learning rate of 0.000035 for 250 epochs. The predefined weights for the loss functions are $\lambda_{L1}=1,\lambda_{prec}=1,\lambda_{style}=100$ . The features are extracted for $N=5$ scales with $[64,128,256,256,256]$ number of filters.

Config	SSIM	PSNR
Baseline	0.817	20.78
+LAF	0.835	21.84
+SCFA	0.863	22.43
+Warp Loss	0.90	26.5

Table 2: The effect of different modules.

Method	FID ${}_{\text{u}}$ $\downarrow$	KID ${}_{\text{u}}$ $\downarrow$	FID ${}_{\text{p}}$ $\downarrow$	KID ${}_{\text{p}}$ $\downarrow$	SSIM ${}_{\text{p}}$ $\uparrow$	LPIPS ${}_{\text{p}}$ $\downarrow$
VITON-HD [4]	14.64	6.10	12.81	5.52	0.848	0.1216
HR-VITON [18]	12.15	3.42	9.92	3.06	0.860	0.1038
GP-VTON [19]	10.49	2.23	7.71	2.01	0.857	0.0897
DCI-VTON [20]	11.14	3.35	8.19	2.93	0.875	0.0816
Proposed	8.93	1.37	5.60	0.83	0.877	0.0803

Table 3: Quantitative results on VITON-HD [4]. The subscripts ’u’ and ’p’ respectively represent the unpaired setting and paired setting.

3.3 Qualitative Results

Our method was qualitatively evaluated against CP-VTON+ [1], ACPGN [16], PFAPN [17], SDAFN [10] (Fig. 2). While CP-VTON+ and ACPGN are parser-based and prone to misalignment issues, PF-AFN, a parser-free framework, avoids misalignment but suffers from blurred body parts like arms and shoulders.

All methods nearly align the distorted garment with the wearer, but artifacts become more visible as the texture and structural complexity increases. Our model preserves image contrast and skin texture, especially neck region, much better. In Fig. 2, our model exhibits superior preservation in left shoulder garment and realistic arm generation, while ensuring structural integrity with minimal misalignments in sleeves and collar edges.

3.4 Quantitative Results

The baseline of the proposed approach incorporated a simple warp module and flow without an attention module with the computed scores as shown in table2, as an improvement we introduce LAF, SCFA and a Warp loss. The LAF module improves the flow which consecutively improves the garment warping, SCFA module encapsulates the attention mechanism’s capability to capture semantic information while considering the context between the garment and the person during the try-on process, and Finally, the warped loss improves the warping by not allowing the learning of warping in bad garment regions as shown in table2. Proposed method’s assessment on VITON Dataset as shown in Table 1, achieves outstanding results in terms of SSIM and FID with scores of 0.90 and 9.14, respectively, outperforming other methods as shown in Table1. The evaluated scored on VITON-HD dataset also outweigh the performance of proposed method upon comparison with other methods as shown in table3 demonstrating the effectiveness of our proposed single-stage framework.

4 Conclusion

We proposed a novel single-stage framework for image-based virtual try-on, achieving photo-realistic fitting results by aligning in-shop garments with clothed person images. Our approach jointly learns warped garments and flow fields using the semantic-contextual fusion attention module and a lightweight linear attention framework. The Warped Cloth Learning Module allows simultaneous learning of warped garment and try-on results. Experiments on the VITON dataset demonstrated state-of-the-art performance both qualitatively and quantitatively. Our method eliminates multi-stage processing, offering a more efficient and reliable virtual try-on experience. The proposed framework shows potential for real-world applications, offering a streamlined virtual try-on process. The future work shall include enhancing robustness on diverse datasets and exploring real-time applications.

References

[1] Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai, “Cp-vton+: Clothing shape and texture preserving image-based virtual try-on,” in CVPR Workshops, 2020, vol. 3, pp. 10–14.
[2] Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes, “Do not mask what you do not need to mask: a parser-free virtual try-on,” ArXiv, vol. abs/2007.02721, 2020.
[3] Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, and Kerui Hu, “Pg-vton: A novel image-based virtual try-on method via progressive inference paradigm,” arXiv preprint arXiv:2304.08956, 2023.
[4] Seunghwan Choi, Sunghyun Park, Min Gi Lee, and Jaegul Choo, “Viton-hd: High-resolution virtual try-on via misalignment-aware normalization,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14126–14135, 2021.
[5] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S. Davis, “Viton: An image-based virtual try-on network,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7543–7552, 2017.
[6] Bochao Wang, Huabing Zhang, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang, “Toward characteristic-preserving image-based virtual try-on network,” ArXiv, vol. abs/1807.07688, 2018.
[7] Assaf Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, and Sharon Alpert, “Image based virtual try-on network from unpaired data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[8] Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, and Ping Luo, “Disentangled cycle consistency for highly-realistic virtual try-on,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16923–16932, 2021.
[9] Sen He, Yi-Zhe Song, and Tao Xiang, “Style-based global appearance flow for virtual try-on,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3460–3469, 2022.
[10] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, and Hongxia Yang, “Single stage virtual try-on via deformable attention flows,” in European Conference on Computer Vision, 2022.
[11] Kiran Mayee Adavala, “Generation of “hand-drawn” images using deep convolutional generative adversarial networks,” resmilitaris, vol. 13, no. 3, pp. 2104–2111, 2023.
[12] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds. 2015, vol. 28, Curran Associates, Inc.
[13] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin, “Towards multi-pose guided virtual try-on network,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9026–9035.
[14] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5728–5739.
[15] Xintong Han, Weilin Huang, Xiaojun Hu, and Matthew R. Scott, “Clothflow: A flow-based model for clothed person generation,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10470–10479, 2019.
[16] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo, “Towards photo-realistic virtual try-on by adaptively generating-preserving image content,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[17] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo, “Parser-free virtual try-on via distilling appearance flows,” arXiv preprint arXiv:2103.04559, 2021.
[18] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo, “High-resolution virtual try-on with misalignment and occlusion-handled conditions,” arXiv preprint arXiv:2206.14180, 2022.
[19] Xie Zhenyu, Huang Zaiyu, Dong Xin, Zhao Fuwei, Dong Haoye, Zhang Xijin, Zhu Feida, and Liang Xiaodan, “Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
[20] Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang, “Taming the power of diffusion models for high-quality virtual try-on with appearance flow,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023.