Virtual Accessory Try-On via Keypoint Hallucination

Junhong Gou¹, Bo Zhang¹, Li Niu¹, Jianfu Zhang¹, Jianlou Si², Chen Qian², Liqing Zhang¹
¹Shanghai Jiao Tong University, ²Sensetime Research
{goujunhong, bo-zhang, ustcnewly, c.sis}@sjtu.edu.cn, [email protected]
{sijianlou, qianchen}@sensetime.com

Abstract

The virtual try-on task refers to fitting the clothes from one image onto another portrait image. In this paper, we focus on virtual accessory try-on, which fits accessory (e.g., glasses, ties) onto a face or portrait image. Unlike clothing try-on, which relies on human silhouette as guidance, accessory try-on warps the accessory into an appropriate location and shape to generate a plausible composite image. In contrast to previous try-on methods that treat foreground (i.e., accessories) and background (i.e., human faces or bodies) equally, we propose a background-oriented network to utilize the prior knowledge of human bodies and accessories. Specifically, our approach learns the human body priors and hallucinates the target locations of specified foreground keypoints in the background. Then our approach will inject foreground information with accessory priors into the background UNet. Based on the hallucinated target locations, the warping parameters are calculated to warp the foreground. Moreover, this background-oriented network can also easily incorporate auxiliary human face/body semantic segmentation supervision to further boost performance. Experiments conducted on STRAT dataset validate the effectiveness of our proposed method.

1 Introduction

Virtual try-on technology has gained popularity due to the booming online shopping industry. The goal of virtual clothing try-on technology is to improve the customer’s online shopping experience by providing a sense of how the garment may look when worn. Given an image with target clothes and another image with the target person, virtual clothing try-on could fit the target clothes onto the target person. Many previous works in this field have proposed flow-based [13, 14] or Thin Plain Spine (TPS) warping [15, 25, 35, 30, 34, 12] methods, which warped the clothes into shapes that fit the human body and then performed final synthesis.

Refer to caption — Figure 1: On the left side: synthesis results of a pair of foreground and background images with different methods. On the right side: the overall pipeline of our method, which can be described as follow. Given a background image $I_{b}$ and a foreground image $I_{f}$ with mask $M_{f}$ , our network predicts four keypoint heatmaps, which indicate the target locations of four foreground keypoints ( $A$ , $B$ , $C$ , $D$ ) on the background image. Based on pairs of source and target locations, transformation matrix $T$ is calculated to warp the foreground image, which is combined with the background image to obtain the composite image $I_{c}$ .

In this work, we focus on a specific and practical task: virtual accessory try-on (e.g., glasses, hat, tie), which involves warping the foreground object (accessory) and pastes it on another background image (face or portrait image). The necessity of specifically designing methods for virtual accessory try-on is evident. Simple cut-and-paste operations will inevitably bound to result in unrealistic synthesis results, like unnatural placement of foreground objects (e.g., position, scale) and mismatched camera viewpoints between foreground and background (i.e., human faces or human bodies). These issues could negatively affect the quality of the composite image. Although there are many previous virtual clothing try-on methods, they are unfortunately not appropriate for this accessory try-on task (See Figure 1 (a)). The reason is virtual clothing try-on warps the target clothes to fit the human body, while virtual accessory try-on requires finding a suitable location and shape for the accessory without an explicit target to fit. Moreover, accessories can often be warped using rigid transformations, such as perspective transformations, leading to a more compact solution.

There are only a few works [1, 23, 18, 21, 36] focusing on virtual accessory try-on. They also fall into the scope of spatial transformation for image composition similar to existing virtual clothing try-on methods. In these works, the spatial transformation parameters are learned to warp the foreground, and the warped foreground and background are combined to create the composite images. Most of the previous works used generative models to synthesize composite images, but their performances are far from satisfactory (See Figure 1 (b)). They fail to learn the matching of foreground and background due to the lack of supervision, i.e., the ground-truth composite image given a pair of foreground and background. Recently, Zhang et al. [37] constructed a dataset named STRAT dataset with ground-truth supervision, which contains three sub-datasets (STRAT-glasses, STRAT-hat, and STRAT-tie). To take advantage of ground-truth supervision, CorrelNet is proposed to capture the correspondence between foreground pixels and background pixels, based on which the warping parameters could be calculated to warp the foreground. However, capturing the correlations between foreground and background pixels is not an easy task due to the significant appearance difference between the foreground and the background. The results of CorrelNet are still not promising (See Figure 1 (c)). Intuitively, the foreground may contain noisy information (e.g., different colors, texture), which is harmful to predicting the target locations, while the background information can determine the approximate target locations of foreground keypoints based on the prior knowledge of human body (e.g., a tie should be around the neck, or the legs of the glasses should be on the ears). The foreground information with prior knowledge of accessory (e.g., shape, fine-grained category) can refine location prediction and transformation matrix based on results estimated by background. Like the situation in Fig.1, the man is slightly facing to the right, so the glasses should also be facing to the right based on human body priors. Then based on accessory priors, the glass on the left should look larger than the right one. Hence, we argue that the background should play a more important role than the foreground.

Motivated by the above point, we design a background-oriented network to learn human body priors and hallucinate background keypoints, i.e., the target locations (four vertices of the foreground bounding box) on the background. Our hallucination network mainly considers the background image. We adopt UNet [27] structure, which takes in the background image to predict heatmaps for each of the four target locations. We employ an encoder to extract foreground information with accessory priors, which is then injected into the bottleneck of UNet using a Dual Attention Fusion (DAF) module. According to the predicted target locations, we can calculate a transformation matrix $T$ to warp the foreground, combined with the background to achieve the composite image. Our background-oriented network design will leverage extra information (e.g., semantic segmentation) of background, which can be easily incorporated by employing an additional decoder. The decoder features with rich semantic information are added to the original decoder features, helping predict the target locations. Note that the semantic information is only used as auxiliary information during training. Our proposed method is effective and flexible. It can fast adapt to new accessories in these categories without additional training. To evaluate our proposed method, we conduct extensive experiments on the STRAT dataset [37] and compare it with previous works, which proves that our method can achieve excellent performance.

2 Related Work

Virtual Try-on

The popularity of online shopping has made virtual try-on essential for consumers to improve their online shopping experience. Referring to [11, 17], we can divide the existing virtual try-on technologies into 2D and 3D categories. 3D virtual try-on technology can bring a better user experience, but it relies on 3D parametric human models and building a large-scale 3D dataset for training is expensive. Compared with 3D, image-based virtual try-on, that is, 2D virtual try-on, although not as flexible as 3D (e.g., allowing being viewed with arbitrary views and poses), is more light-weighted and generally more prevalent.

Many previous 2D virtual try-on work [15, 30, 35, 40, 25, 12] have used the Thin Plain Spine (TPS) method to flexibly deform clothes to cover the human body. However, TPS can only provide simple deformation processing, which can only roughly migrate the clothing to the target area and cannot handle some larger geometric deformations. More recently, CT-Net [34] combined distance field guided dense warping and TPS warping to achieve more precise clothing transfer. In addition, many flow-based methods [14, 13, 17, 2] have been proposed, they modeled the appearance flow field between clothes and corresponding regions of the human body to better fit the clothes to the person. There are also some methods [6, 20] to deal with the virtual try-on task under the high-resolution image, which undoubtedly has higher quality requirements in the warping of clothes and the synthesis of images.

The 2D virtual try-on works mentioned above are mainly for the try-on of clothes, while we focus on the try-on of accessories such as glasses, hat, and tie. There are only a few works [1, 23, 18, 21, 37, 36] on virtual accessory try-on, which are also covered by spatial transformation for image composition and will be discussed in the next subsection.

Image Composition

Image composition means overlaying a foreground object on a background image to generate a composite image. However, the obtained composite image may look unrealistic and the unreality could be caused by many issues. One issue is the different lighting conditions between foreground and background. Image harmonization [7, 8, 29, 24, 41] sought to fix this issue by correcting the illumination statistics of the foreground to make it compatible with the background, resulting in a harmonious composite image. Another issue is the unreasonable placement of the foreground object. To find reasonable locations and scales to place the foreground object, many object placement methods [10, 19, 39, 38] have been proposed. To further support more flexible shape adjustment, spatial transformation methods [1, 23, 18, 21, 37, 5, 36] for image composition predicted the warping parameters of the foreground object. By taking virtual accessory try-on as an example application, previous works [1, 23, 18, 21, 36] studied spatial transformation for the accessory image to obtain the composite image. Most of them concatenated foreground and background as input to predict the warping parameters. Due to the lack of ground-truth supervision (the ground-truth composite image given a pair of foreground and background), most of the previous works used adversarial learning to generate composite images which are indistinguishable from real images, but their performances are far from satisfactory. The recent work [37] modeled the correlation between foreground and background, calculated the warping parameters on this basis, and achieved good results using the ground-truth supervision in their proposed STRAT dataset. All the above methods treat foreground and background equally. In contrast, we propose a novel background-oriented network which is more suitable for virtual accessory try-on.

3 Our Method

The overall view of our model is shown in Figure 2. The main branch is background UNet [27], which takes in background image $I_{b}$ and produces four heatmaps corresponding to the target locations of four foreground keypoints. The background UNet consists of a background encoder $E_{b}$ and a heatmap decoder $D_{hm}$ . We adopt ResNet18 [16] as $E_{b}$ to extracts the background feature map $F_{b}$ . The decoder $D_{hm}$ adopts the decoder structure in [22]. We use the foreground encoder $E_{f}$ with the same structure as $E_{b}$ to extract the foreground feature map $F_{f}$ . We inject foreground information into the main branch, by fusing $F_{f}$ and $F_{b}$ through our Dual Attention Fusion (DAF) module. To further improve the performance of our network, we introduce a semantic decoder $D_{sm}$ which takes in the background feature map $F_{b}$ and predicts background semantic masks. The semantic decoder $D_{sm}$ shares the same structure as the heatmap decoder $D_{hm}$ . Next, we will first introduce our main branch for heatmap prediction.

3.1 Main Branch

As the main branch of our network, background UNet contains a background encoder $E_{b}$ and a heatmap decoder $D_{hm}$ , and is able to generate heatmaps $\hat{H}$ based on the input background image $I_{b}$ . Following the structure in [22], the decoder $D_{hm}$ consists of five stages, where the first four stages consist of two to three stacked residual blocks with a deconvolution layer included in the first residual block of each stage, and the last stage consists of only a deconvolution layer and a convolutional layer to obtain the final output. The encoder feature maps from the last three stages in the encoder $E_{b}$ are added to the feature maps in the corresponding stages in the decoder $D_{hm}$ via skip connection. We denote the predicted (resp., ground-truth) $k$ -th heatmap as $\hat{H}_{k}$ (resp., $H_{k}$ ). We omit the subscript $k$ when not specifying a heatmap.

Loss Function

Instead of simply setting the value at the target location to $1$ and the values at other locations as $0$ , the ground-truth heatmap is generated by plotting a Gaussian distribution with radius $g$ centered at the target location. Directly labeling the locations close to the target location as $0$ could hinder the training of the network because they are highly similar to the target location. With the soft ground-truth heatmap, we observe that the network is easier to converge.

We employ Adaptive Wing (AWing) Loss [31] to measure the difference between predicted heatmaps and ground-truth heatmaps, which is defined as follows,

AWing(y,\hat{y})=\begin{cases}\omega\ln{(1+|\frac{y-\hat{y}}{\epsilon}|^{\alpha-y})},&{\text{if }}\ |y-\hat{y}|<\theta,\\ A|y-\hat{y}|-B,&{\text{otherwise,}}\end{cases}

(1)

where $y$ and $\hat{y}$ denote the value at a certain location on the ground-truth and predicted heatmap respectively. $\omega$ , $\alpha$ , $\theta$ , and $\epsilon$ are positive hyper-parameters. $A=\omega(1/(1+(\theta/\epsilon)^{(\alpha-y)}))(\alpha-y)((\theta/\epsilon)^{(\alpha-y-1)})(1/\epsilon)$ and $B=(\theta A-\omega\ln{(1+(\theta/\epsilon)^{(\alpha-y)}}))$ make the loss function continuous and smooth at $|y-\hat{y}|=\theta$ . Following [31], we set $\alpha=2.1$ , $\omega=14$ , $\epsilon=1$ , $\theta=0.5$ . Compared with commonly used MSE loss, AWing loss is more sensitive to some subtle errors, with a higher gradient when $|y-\hat{y}|$ approaches 0.

In addition, the number of positive locations (locations with large values) and negative locations (locations with small values) in the ground-truth heatmap are highly unbalanced. To address this issue, we tend to assign larger weights to the positive locations when calculating the loss. Following [31], we define a positive mask $M_{pos}$ to indicate the positive locations:

M_{pos}(p)=\begin{cases}1,&{\text{where }}\tilde{H}(p)\geq 0.2,\\ 0,&{\text{otherwise,}}\end{cases}

(2)

where $p$ is a location, $\tilde{H}$ is generated from ground-truth heatmap $H$ by $3\times 3$ dilation, $M_{pos}(p)$ (resp., $\tilde{H}(p)$ ) is the value at location $p$ on $M_{pos}$ (resp., $\tilde{H}$ ). Intuitively, the positive locations include the target location and a part of the surrounding locations.

With the positive mask $M_{pos}$ , the heatmap prediction loss $\mathcal{L}_{hm}$ is calculated as follows,

\mathcal{L}_{hm}=AWing(H,\hat{H})\otimes(\gamma\cdot M_{pos}+1),

(3)

where $\gamma$ is a hyper-parameter to control the degree of magnification and $\otimes$ means element-wise multiplication.

Spatial Transformation

After obtaining the heatmaps, we do not directly take the location with the maximum value in each heatmap as the predicted target location. For robustness, we use the values in the heatmap as weights to calculate the target location. Similar to [28], given the predicted $k$ -th heatmap $\hat{H}_{k}$ , the $k$ -th target location $v^{t}_{k}$ is obtained by

v^{t}_{k}=\sum_{p=1}^{N}p\cdot softmax(\beta\cdot\hat{H}_{k}(p)),

(4)

in which $N$ is the size of heatmap, $p$ is a location, and $\hat{H}_{k}(p)$ is the value at location $p$ on $\hat{H}_{k}$ . $\beta$ is a hyper-parameter (set as 1000) to exaggerate the impact of locations with near-maximum values and $softmax$ means softmax normalization.

The perspective transformation matrix $T$ is calculated according to the target locations $\{v^{t}_{k}|_{k=1}^{4}\}$ and their source locations $\{v^{s}_{k}|_{k=1}^{4}\}$ on the foreground image. The details of calculating the transformation matrix based on pairs of source and target locations can be found in [37]. Then, we warp the foreground image according to the perspective transformation matrix $T$ to get the warped foreground $\tilde{I}_{f}$ and the warped foreground mask $\tilde{M}_{f}$ . Finally, the composite image $I_{c}$ can be obtained by

I_{c}=\tilde{M}_{f}\otimes\tilde{I}_{f}+(1-\tilde{M}_{f})\otimes I_{b}.

(5)

This process is shown in Figure 1.

3.2 Injecting Foreground Information

Method	STRAT-glasses				STRAT-hat				STRAT-tie
Method	LSSIM $\uparrow$	IoU $\uparrow$	Disp $\downarrow$	User $\uparrow$	LSSIM $\uparrow$	IoU $\uparrow$	Disp $\downarrow$	User $\uparrow$	LSSIM $\uparrow$	IoU $\uparrow$	Disp $\downarrow$	User $\uparrow$
ST-GAN	0.5655	0.5932	0.0240	4.33%	0.4362	0.6859	0.0316	1.80%	0.2780	0.1126	0.0440	0.80%
ST-GAN(+s)	0.6061	0.6579	0.0198	0.87%	0.5164	0.7455	0.0235	4.80%	0.2517	0.1211	0.0393	0.53%
CompGAN	0.5362	0.5593	0.0279	4.27%	0.4311	0.6675	0.0343	3.80%	0.2768	0.0918	0.0506	0.20%
CompGAN(+s)	0.5807	0.6353	0.0216	2.00%	0.5047	0.7303	0.0246	1.87%	0.2452	0.1064	0.0425	0.40%
RegGAN	0.5356	0.5069	0.0299	1.67%	0.4028	0.6147	0.0371	1.13%	0.2469	0.0792	0.0603	0.93%
SF-GAN	0.5406	0.5472	0.0267	2.27%	0.4140	0.6521	0.0365	1.87%	0.2575	0.0885	0.0544	0.33%
AGCP	0.5240	0.4750	0.0347	0.20%	0.3954	0.5981	0.0386	0.60%	0.2348	0.0641	0.0640	0.27%
CorrelNet	0.6886	0.7573	0.0145	16.11%	0.5470	0.7873	0.0184	12.17%	0.2883	0.3948	0.0131	13.17%
PF-AFN	0.6403	0.7391	0.0186	7.86%	0.5325	0.7579	0.0235	6.27%	0.2735	0.2314	0.0201	1.43%
CP-VTON+	0.6596	0.7485	0.0159	8.01%	0.5321	0.7395	0.0349	4.91%	0.2758	0.1232	0.1989	0.67%
Ours(w/o sm)	0.6971	0.7753	0.0125	20.44%	0.5656	0.8181	0.0158	27.17%	0.3146	0.4744	0.0094	38.07%
Ours	0.7038	0.7814	0.0122	31.98%	0.5718	0.8239	0.0147	33.62%	0.3229	0.4832	0.0093	43.20%

Table 1: The quantitative results on STRAT dataset. The best results are emphasized in bold. “User” means the frequency that each method is chosen as the best method in user study.

Although the main branch alone can predict the target locations of foreground keypoints, it is challenging to make precise predictions in the absence of foreground information. For example, in the hat try-on task, although the four target locations can be roughly positioned around the top of head, only with the specific shape and the type of hat (e.g., baseball cap, cowboy hat) can they be more precisely decided. Therefore, it is necessary to inject foreground information into the main branch.

We first concatenate the foreground image $I_{f}$ and its mask $M_{f}$ , and feed them into the foreground encoder $E_{f}$ to extract the foreground feature map $F_{f}$ . Then, we fuse the foreground feature map $F_{f}$ and background feature map $F_{b}$ in the bottleneck to ensure that the heatmap decoder $D_{hm}$ could receive both foreground and background information. Inspired by previous works on attention [33] and feature fusion [9], we propose a dual-attention fusion (DAF) module to fuse two feature maps attentively. The DAF module first applies dual attention (channel attention and spatial attention in [33]) to the summation of two feature maps to predict the fusion weight map $M_{fuse}$ with the same size as $\{F_{f},F_{b}\}$ . Then, $F_{f}$ and $F_{b}$ are fused according to the fusion weight map $M_{fuse}$ :

F_{fuse}=M_{fuse}\otimes F_{b}+(1-M_{fuse})\otimes F_{f},

(6)

where $F_{fuse}$ is the fused feature. The values in $M_{fuse}$ are between 0 and 1 after Sigmoid, enabling our network to make a soft selection between two feature maps. The DAF module is illustrated in Figure 2, in which the dashed line means reversing the fusion weight map.

3.3 Utilizing Auxiliary Background Information

Due to our design of background-oriented network, it can be easily extended to utilize additional background information. In this work, we take semantic information as an example to investigate how to leverage auxiliary background information in our network. It is reasonable to presume that background semantic information could benefit target location hallucination, because the relative position and distance to different semantic components (e.g., eyes, nose, ears) could help determine the target locations of foreground keypoints. Similar to previous multitask learning works [29, 32, 3, 4], we introduce a semantic decoder $D_{sm}$ to predict background semantic mask. The semantic decoder $D_{sm}$ and the heatmap decoder $D_{hm}$ share the same background encoder $E_{b}$ . $D_{sm}$ and $E_{b}$ also have skip connections. To leverage the semantic knowledge learned by $D_{sm}$ , the feature map of each layer in the decoder $D_{sm}$ is propagated to the decoder $D_{hm}$ by adding to the corresponding feature map, except the last layer which accounts for heatmap prediction.

Loss Function

We denote the predicted semantic mask as $\hat{S}\in\mathbb{R}^{C\times N}$ , in which $C$ is the number of semantic classes (e.g., eyes, nose, ears) and $N$ is the size of semantic mask (the same as the size of heatmap). Besides, we use $c_{p}$ to denote the ground-truth label at location $p$ . Then, the pixel-wise cross-entropy loss can be represented by

\mathcal{L}_{sm}=-\frac{1}{N}\sum_{p=1}^{N}\log\left(\frac{\exp{\hat{S}_{c_{p},p}}}{\sum_{c=1}^{C}\exp{\hat{S}_{c,p}}}\right).

(7)

The overall objective function can be written as

\mathcal{L}=\mathcal{L}_{hm}+\lambda\mathcal{L}_{sm},

(8)

where $\lambda$ is a hyper-parameter to balance two tasks.

4 Experiments

4.1 Dataset and Evaluation Metrics

We conduct experiments on STRAT dataset [37], which consists of three sub-datasets (STRAT-glasses, STRAT-hat, STRAT-tie). Each sub-dataset contains 2000 training tuples and 1000 test tuples. Each tuple has a foreground image with mask, a background image, and the ground-truth composite image. The foreground/background images in the training set and test set have no overlap. For quantitative evaluation, we adopt local structural similarity index (LSSIM), vertex displacement error (Disp), and intersection over union (IoU) following [37]. We also consider human perception and include user study for more comprehensive comparison. Specifically, we collect the composite images generated by different methods for the whole test set. 20 human raters are asked to select the most reasonable and realistic result for each test tuple. Then, we report the frequency that each method is selected as the best one.

4.2 Implementation Details

We use PyTorch [26] v1.11.0 to implement our model, and train it on the NVIDIA RTX 3090 GPU. In the STRAT-glasses and STRAT-hat sub-datasets, we resize both foreground and background images to $224\times 224$ . For the STRAT-tie sub-dataset, we resize the images to $448\times 448$ because the foreground object only occupies a small proportion of the background, following [37]. We set the Gaussian radius of ground-truth heatmap as $g=20$ , and set $\gamma=10$ in Eqn. 3 and $\lambda=0.1$ in Eqn. 8 via cross-validation.

We use adam optimizer with the learning rate initialized as 0.0002 and weight decay set to $1\times 10^{-5}$ . The batch size is 32 and our model is trained for 40 epochs with the learning rate decreasing linearly from epoch 10 to 0.00005.

For the semantic classes of the background image, we divide three sub-datasets into two groups. The background images in the first group (STRAT-glasses and STRAT-hat) are human faces and those in the second group (STRAT-tie) are human bodies. Human face has 12 semantic classes: background, hair, hat, eyebrows, glasses, eyes, nose, mouth, skin, neck, ears, and clothing. Human body has 8 semantic classes: background, head, upper, lower, arms, legs, shoes, and bags.

4.3 Comparison with Baselines

We compare our method with previous virtual accessory try-on methods: CorrelNet [37], ST-GAN [23], RegGAN [18], CompGAN [1], SF-GAN [36], AGCP [21], and representative virtual clothing try-on methods: CP-VTON+ [25], PF-AFN [13].In particular, (+s)(e.g., ST-GAN(+s) and CompGAN(+s)) means adding additional supervision (MSE loss) between generated composite images and ground-truth composite images compared to their original methods, which allows them to utilize the ground-truth annotations more effectively (see [37] for more details). Note that two virtual clothing try-on methods require semantic parsing, and we provide them with ground-truth semantic annotations. As for the other input of these two methods, i.e. the pose heatmaps, we did not use the corresponding information into the input. And we only use the first stage of these two methods to obtain the warped foreground image, and then use the foreground mask as guidance to synthesize the two images, which is also consistent with the process of our method. In addition, since the optical flow-based warping method PF-AFN cannot accurately obtain the final locations of these specified vertices, we calculate the perspective transformation matrix by sampling some matching point pairs, through which we can obtain the final locations of there vertices to approximately estimate the Disp metric.

The quantitative results are summarized in Table 1. Among the virtual accessory try-on methods, the recent method CorrelNet [37] achieves the best results, surpassing other methods by a large margin, which demonstrates the advantage of calculating warping parameters based on selected keypoint pairs. The virtual clothing try-on methods [25, 13] are very competitive and outperform most early virtual accessory try-on methods. We observe that [25, 13] perform admirably on the STRAT-glasses and STRAT-hat datasets, but less ideally on the STRAT-tie dataset. The reason could be that with the foreground tie only occupying a small region on the background, the approach of seeking the global correspondence between foreground and background becomes less effective. This is more obvious in CP-VTON+ because it only establishes such correspondence on high-level features. Its poor performance on STRAT-tie dataset is also attributable to the difficulty in distinguishing between neck ties and bow ties. We refer to our method without semantic information as Ours(w/o sm). It can be seen that Ours(w/o sm) performs better than all previous methods on all metrics, which demonstrates the superiority of our method. After using semantic information, our approach achieves further improvement (Ours v.s. Ours(w/o sm)).

	SM	FG	$\mathcal{L}_{hm}$	LSSIM $\uparrow$	IoU $\uparrow$	Disp $\downarrow$
1			wAW	0.4962	0.7323	0.0262
2	+		wAW	0.4979	0.7358	0.0258
3		DAF	wAW	0.5656	0.8181	0.0158
4	+	add	wAW	0.5678	0.8224	0.0149
5	+	DAF(s)	wAW	0.5701	0.8227	0.0149
6	+	DAF	MSE	0.5527	0.8165	0.0158
7	+	DAF	AW	0.5653	0.8211	0.0149
8	+	DAF	wMSE	0.5596	0.8217	0.0148
9	+	DAF	wAW	0.5718	0.8239	0.0147

Table 2: Ablation Studies of loss functions and network components of our model. AW (resp., wAW) means AWing (resp., weighted AWing) loss and MSE (resp., wMSE) means MSE (resp., weighted MSE) loss. FG (resp., SM) means foreground (resp., semantic).

4.4 Ablation Studies

By taking STRAT-hat as an example, we conduct ablation studies to validate the effectiveness of each component in our network, and the results are shown in Table 2. First, we only use the main branch (background encoder $E_{b}$ and heatmap decoder $D_{hm}$ ). As shown in row 1, the overall results are very poor. On the basis of row 1, we introduce semantic mask and foreground information respectively, and the results are shown in row 2 and row 3 respectively. As can be seen, the additional semantic information brings limited improvement in this case. The addition of foreground information results in substantial improvement, because our model is unable to predict heatmaps accurately without knowing the foreground information (e.g., shape and sub-type of hat). On the basis of row 3, we add the semantic mask to arrive at our full-fledged method in row 9.

Then, we investigate the strategy of injecting foreground information. One simple strategy is adding the feature maps $F_{b},F_{f}$ as fused feature map $F_{fuse}$ . The obtained results in row 4 are lower than in row 9, which demonstrates that the DAF module can fuse foreground and background feature maps more effectively. Besides, we also experiment with a simplified version of DAF, which is referred to as DAF(s) in row 5. Specifically, we remove the dual attention [33] and directly output the fusion weight map. Compared with row 9, the features refined by dual attention are more conducive to feature fusion.

Next, we study the impact of various loss functions. We only use MSE loss or AWing loss [31] to supervise the heatmap, as shown in row 6 and row 7 respectively. We observe that AWing loss is more superior in supervising heatmap prediction. In row 8 (resp., 9), we add weighted loss based on row 6 (resp., 7). Whether using MSE loss or AWing loss, assigning higher weights to the positive locations is helpful to address the imbalance issue.

4.5 Qualitative Analysis

Composite Results

The composite images produced by various methods on three sub-datasets of STRAT are exhibited in Figure 3. As can be seen, the composite images generated by many previous methods are irrational, whereas our results are more realistic and plausible, as well as closer to the ground-truth. Although the two virtual clothing try-on methods [25, 13] perform well on the STRAT-glasses and STRAT-hat datasets, some illogical deformations may occur due to incompatibility of their warping methods, as illustrated in the first row second column. Besides, these two methods challenge handling such tiny accessories while coping with the STRAT-tie dataset. Particularly for the bow tie, they are tough to predict the size and location of the bow tie, which leads to poor results.

As for the previous methods of virtual accessory try-on, CorrelNet [37] is undoubtedly the best performing one, the composite results of the other methods are not ideal. Compared with CorrelNet, our method still performs better in many aspects. Specifically, in the first two rows, compared to CorrelNet, the warped foreground of our method is more in line with the orientation of the face. Additionally, in the second row, our model can more accurately predict the tilt of the glasses with the help of semantic information. The middle two rows show that our method can estimate the size and orientation of hat better than CorrelNet, and can predict the size of hat better when semantic information is included. Similarly, in the tie try-on examples, our model performs better for predicting the size and shape of the tie. More examples of composite results and the discussion on limitation of our method are presented in the Supplementary.

Heatmap Visualization

We visualize the ground-truth heatmaps and our predicted heatmaps for four foreground keypoints (A, B, C, D) in Figure 4. Our predicted heatmaps have Gaussian-like distribution and are basically consistent with the ground-truth heatmaps. The bright spots on our generated heatmaps are clearly visible and the overall heatmaps are not blurred, which can help us locate the predicted keypoints more accurately. Similarly, more visualization examples of heatmaps are included in Supplementary.

Semantic Mask

To verify that our model has learned useful semantic knowledge to help predict heatmaps, we visualize the semantic masks predicted by our model in Figure 5. Recall that human face images are used in the STRAT-glasses and STRAT-hat datasets, while human body images are used in the STRAT-tie dataset. Therefore, we show one example for each type of background images. The results demonstrate that, despite some inaccurate details, our model can learn the rough semantic information, which is qualified to help predict better heatmaps. For our tasks, such as hat try-on task, as long as our model can predict some semantic components (e.g., hair, ears, and facial skin) of people in the background, we can obtain a rough outline of the head, which can be helpful for keypoint heatmap generation.

4.6 Hyper-parameter Analyses and Significance Test

To better test the performance of our model, we perform hyper-parameter analyses and significance test. In the hyper-parameter analyses, we vary the values of three hyper-parameters (i.e., $\lambda$ in Eqn. 8, $\gamma$ in Eqn. 3, and the Gaussian radius $g$ of ground-truth heatmap), and plot the performance variance of our method to observe their impact. The significance test is conducted between our method and the strongest baseline CorrelNet [37]. The detailed results are left to Supplementary due to the space limitation.

5 Conclusion

In this work, we have proposed a novel way to address the virtual accessory try-on task, by converting the spatial transformation task to a keypoint hallucination task, that is, forecasting the target locations of foreground keypoints on the background. We have designed a novel background-oriented network and explored using auxiliary background semantic information for performance improvement. The experimental results on the STRAT dataset have demonstrated the superiority of our method.

References

[1] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and Trevor Darrell. Compositional gan: Learning image-conditional binary composition. International Journal of Computer Vision, 128(10):2570–2585, 2020.
[2] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, and Hongxia Yang. Single stage virtual try-on via deformable attention flows. In ECCV, 2022.
[3] Deblina Bhattacharjee, Tong Zhang, Sabine Süsstrunk, and Mathieu Salzmann. Mult: An end-to-end multitask learning transformer. In CVPR, 2022.
[4] Marcela Carvalho, Bertrand Le Saux, Pauline Trouvé-Peloux, Frédéric Champagnat, and Andrés Almansa. Multitask learning of height and semantics from aerial images. IEEE Geoscience and Remote Sensing Letters, 17(8):1391–1395, 2019.
[5] Bor-Chun Chen and Andrew Kae. Toward realistic image compositing with adversarial learning. In CVPR, 2019.
[6] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In CVPR, 2021.
[7] Wenyan Cong, Li Niu, Jianfu Zhang, Jing Liang, and Liqing Zhang. Bargainnet: Background-guided domain translation for image harmonization. In ICME, 2021.
[8] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Dovenet: Deep image harmonization via domain verification. In CVPR, 2020.
[9] Yimian Dai, Fabian Gieseke, Stefan Oehmcke, Yiquan Wu, and Kobus Barnard. Attentional feature fusion. In WACV, 2021.
[10] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Modeling visual context is key to augmenting object detection datasets. In ECCV, 2018.
[11] Ruili Feng, Cheng Ma, Chengji Shen, Xin Gao, Zhenjiang Liu, Xiaobo Li, Kairi Ou, Deli Zhao, and Zheng-Jun Zha. Weakly supervised high-fidelity clothing model generation. In CVPR, 2022.
[12] Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, and Ping Luo. Disentangled cycle consistency for highly-realistic virtual try-on. In CVPR, 2021.
[13] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. Parser-free virtual try-on via distilling appearance flows. In CVPR, 2021.
[14] Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R Scott. Clothflow: A flow-based model for clothed person generation. In ICCV, 2019.
[15] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In CVPR, 2018.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[17] Sen He, Yi-Zhe Song, and Tao Xiang. Style-based global appearance flow for virtual try-on. In CVPR, 2022.
[18] Kotaro Kikuchi, Kota Yamaguchi, Edgar Simo-Serra, and Tetsunori Kobayashi. Regularized adversarial training for single-shot virtual try-on. In ICCV Workshops, 2019.
[19] Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, and Jan Kautz. Context-aware synthesis and placement of object instances. In NeurIPS, 2018.
[20] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo. High-resolution virtual try-on with misalignment and occlusion-handled conditions. In ECCV, 2022.
[21] Xiang Li, Guowei Teng, Ping An, and Hai-Yan Yao. Image synthesis via adversarial geometric consistency pursuit. Signal Processing: Image Communication, 99:116489, 2021.
[22] Yaoyi Li and Hongtao Lu. Natural image matting via guided contextual attention. In AAAI, 2020.
[23] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In CVPR, 2018.
[24] Jun Ling, Han Xue, Li Song, Rong Xie, and Xiao Gu. Region-aware adaptive instance normalization for image harmonization. In CVPR, 2021.
[25] Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on. In CVPR Workshops, 2020.
[26] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
[27] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
[28] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In ECCV, 2018.
[29] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. Deep image harmonization. In CVPR, 2017.
[30] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic-preserving image-based virtual try-on network. In ECCV, 2018.
[31] Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive wing loss for robust face alignment via heatmap regression. In ICCV, 2019.
[32] Yiren Wang, Cheng Xiang Zhai, and Hany Hassan Awadalla. Multi-task learning for multilingual neural machine translation. In EMNLP, 2020.
[33] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, 2018.
[34] Fan Yang and Guosheng Lin. Ct-net: Complementary transfering network for garment transfer with arbitrary geometric changes. In CVPR, 2021.
[35] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In CVPR, 2020.
[36] Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. Spatial fusion gan for image synthesis. In CVPR, 2019.
[37] Bo Zhang, Yue Liu, Kaixin Lu, Li Niu, and Liqing Zhang. Spatial transformation for image composition via correspondence learning. arXiv preprint arXiv:2207.02398, 2022.
[38] Lingzhi Zhang, Tarmily Wen, Jie Min, Jiancong Wang, David Han, and Jianbo Shi. Learning object placement by inpainting for compositional data augmentation. In ECCV, 2020.
[39] Song-Hai Zhang, Zheng-Ping Zhou, Bin Liu, Xi Dong, and Peter Hall. What and where: A context-based recommendation system for object insertion. Computational Visual Media, 6(1):79–93, 2020.
[40] Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. In ACM MM, 2019.
[41] Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A Efros. Learning a discriminative model for the perception of realism in composite images. In ICCV, 2015.