DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images
^†^†thanks: This research is supported by the National Natural Science Foundation of China (62262045). Corresponding Email: [email protected]

Enbo Huang1, Yuan Zhang2, Faliang Huang1, Guangyu Zhang3, Yang Liu4 1Guangxi Key Lab of Human-machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, China
2Zhidayuan AI Lab, Nanning, China
3College of Mathematics and Informatics, South China Agricultural University, Guangzhou, China
4School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

Abstract

Person image synthesis with controllable body poses and appearances is an essential task owing to the practical needs in the context of virtual try-on, image editing and video production. However, existing methods face significant challenges with details missing, limbs distortion and the garment style deviation. To address these issues, we propose a Disentangled Representations Diffusion Model (DRDM) to generate photo-realistic images from source portraits in specific desired poses and appearances. First, a pose encoder is responsible for encoding pose features into a high-dimensional space to guide the generation of person images. Second, a body-part subspace decoupling block (BSDB) disentangles features from the different body parts of a source figure and feeds them to the various layers of the noise prediction block, thereby supplying the network with rich disentangled features for generating a realistic target image. Moreover, during inference, we develop a parsing map-based disentangled classifier-free guided sampling method, which amplifies the conditional signals of texture and pose. Extensive experimental results on the Deepfashion dataset demonstrate the effectiveness of our approach in achieving pose transfer and appearance control.

Index Terms:

Person image synthesis, Diffusion Model, Disentangled Representation, Human Parsing

I Introduction

The goal of the controllable person image synthesis task is to explicitly manipulate the body posture and appearance of the source individual picture, generating a photo-realistic person image that aligns with user expectations [1, 2, 3]. At present, majority of the approaches utilize conditional Generative Adversarial Networks (GANs) to transform the source individual photo into the desired target person image. For example, MUST [4] and ADGAN [5] encode style information from various semantic parts using encoders and incorporate style features through the Adaptive Instance Normalization (AdaIN) operation. GFLA [6] estimates the relationship between the source and target poses to guide the generation of the target photo, allowing it to preserve texture details. By contrast, CASD [7] and NTED [8] use attention mechanisms to better preserve the textural properties of the source picture. Nevertheless, these techniques frequently yield images with unclear and indistinct details, primarily because GANs struggle to accurately establish a direct transformation pathway from the source to the target image [9, 10, 11, 12, 13, 14, 15]. Recently, the diffusion model [16] has served as a source of inspiration for researchers to produce high-quality synthetic images by using a denoising score matching mechanism. For instance, models such as PIDM [1], OOTD [17], CFLD [2], and HumanSD [3], etc. can effectively preserve the texture details of a source photo by using a diffusion-based approach. However, these models face difficulties in fully maintaining the textural properties of every body parts because it directly encodes the texture of the entire source figure, leading to inaccuracies in clothing styles and limb textures, particularly when occlusions are present and postures are changed. As shown in Figure 1, PIDM tends to generate stretched limbs and false style garments. In fact, these problems are also commonly encountered in other spatial-temporal analysis and embodied AI tasks [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28].

Refer to caption — Figure 1: Illustration of challenging instances and our results. Details missing, limbs distortion and garment style deviation are three types of degeneration key issues in person image composition task. Our model can generate more accurate human limbs, clothing styles and more vivid details compared with recent state-of-the-art methods ( *e.g.*, PIDM [1]) under variance viewpoint changed and occluded conditions.

To address the aforementioned issues, we propose a Disentangled Representations Diffusion Model (DRDM) that consists of the body-part subspace decoupling block (BSDB), pose encoder, and noise prediction block to achieve the disentanglement of the different body parts and the decoupling of structure and texture in person images. Subsequently, we reassemble these disentangled appearance features according to the given target pose to generate a target person image. In particular, BSDB utilizes the human paring map as a regularization factor to extract disentangle features from different body parts of the source person images, and then disentangled textural features were obtained through a series of dedicated texture encoders. We introduce self-attention mechanisms within each encoder to address texture inaccuracies arising from imprecise semantic segmentation maps, allowing the texture encoders to focus solely on the accurate features of the corresponding body parts. Finally, we use a cross-attention mechanism to pass these disentangled features to the different layers of the noise prediction block to supply the network with rich disentangled features for generating a realistic target person image.

DRDM effectively addresses common occlusion issues, distortion at limb extremities and deviation of the garment style through the disentangled feature representation. This approach also enables explicit control over the source person image’s appearance and body posture by substituting texture features of different body parts of the source individual photo and pose features (Figure 1). During the inference, we devised a disentangled classifier-free guided sampling based on parsing map, which amplifies the conditional signals of texture and pose. This approach ensures a close alignment between the appearance of the generated person image and the target pose to preserve the realism of the generated result under varying poses. The experimental and evaluation results demonstrate that our method achieves state-of-the-art results on the Deepfashion dataset. Moreover, our approach generates images with more realistic details, and its superiority is confirmed through quantitative and qualitative comparisons and user studies. Our major contributions are as follows:

•

A diffusion model with a disentangled regularization module for synthesizing highly realistic person images. We can reassemble the disentangled textural features according to the parsing map, and demonstrate that this regularization factor matters the performance in addressing common issues such as occlusion and distortion of limbs.
•

A parsing map-based classifier-free disentangled guidance sampling method to ensure a close alignment between the generated appearance of person image and the target pose, thereby guaranteeing the authentic restoration of details during pose transformations.

II Proposed methods

Overall Framework: Our model focuses on achieving feature-decoupled person image synthesis, enabling realistic persona generation by flexibly manipulating the appearance and body pose of the source person image. This method excels in addressing common occlusion challenges, severe distortion at limb-end and deviation at cloth style within this task. The overall model architecture is illustrated in Figure 2. The DRDM framework consists of the BSDB, pose encoder $E_{p}$ , and noise prediction block $\mathcal{H}_{N}$ . During the generation of the target person image, we employ $E_{p}$ to encode the input pose $P_{t}$ into a high-dimensional space, which guides the network in recombining the decoupled texture features of the source person image. We employ a BSDB to decouple features of different body parts in the source person image $I_{s}$ to address occlusion and limb-end distortion issues. The disentangled textures of the various body parts are separately encoded and stacked to form a multi-scale disentangled texture feature $F_{s}$ . Subsequently, the BSDB uses a cross-attention mechanism guided by the target pose to extract the required texture style from $F_{s}$ and inject it into different layers of $\mathcal{H}_{N}$ . This process enables the network to generate high-fidelity texture details. Decoupling the textural features of the source person image helps the network in learning the relationship between body part textures and human structure, facilitating effective resolution of common issues, such as occlusion and distortion at limb-end extremities. Additionally, we can explicitly manipulate the appearance and body pose of the source portrait by replacing body part textures and body poses through the disentangled feature representation. During the inference, we propose a classifier-free disentangled guided sampling method to amplify the conditional signals of the disentangled textures and the target pose. This approach enables us to closely align the generated person image’s appearance with the target pose to enhance the quality of the generated images.

II-A Pose Encoder

The proposed pose encoder $E_{p}$ is composed of multiple channel attention (CA) layers and convolutional down-sampling layers. A pre-trained human pose estimation method [29] is employed to obtain estimated joint positions from the human body pose and construct an 18-channel pose estimation map as the pose input $P_{t}$ . $E_{p}$ encodes $P_{t}$ into a high-dimensional space to guide the network in reassembling the disentangled texture features of the source person image, thus achieving better texture and pose transfer effects.

II-B BSDB

To address occlusion and limb-end distortion issues, we propose the BSDB. BSDB utilizes the semantic segmentation map $M_{s}$ to separate the features of the different body parts in the source person image $I_{s}$ , achieving the disentanglement of features for the various semantic regions of the person. These disentangled features of different body parts are represented as ${T}=[{T}_{head},{T}_{cloth},{T}_{pant},{T}_{hand}]$ . To further achieve feature disentanglement, we pass each disentangled body part feature through the corresponding texture encoder and stack the outputs to obtain the multi-scale feature representation ${F}_{s}$ for the entire source person image. A self-attention within the texture encoders for each body part is introduced to address the issue of texture distortion in generated results due to inaccuracies in semantic segmentation maps. Which ensures that the encoders solely focus on the texture of the respective body part they are responsible for, rather than encoding the textures of other body parts together. This mechanism exploits the parsing map as a regularization factor for preserving the figure shape and body part subspace texture style, which guarantees the isolation of texture encoding for each individual body part, thereby alleviating the texture distortion problem caused by inaccuracies in semantic segmentation. Finally, we use cross-attention mechanisms at resolutions of $32\times 32$ , $16\times 16$ , and $8\times 8$ to provide the required features to the noise prediction block ( $\mathcal{H}_{N}$ ). At each resolution, we use network outputs containing pose features ${E}^{i}_{p}$ to extract the required texture style from $F_{s}$ and inject it into different layers of $\mathcal{H}_{N}$ . Decoupling the features of the source person image helps the network in learning the relationship between body part textures and human structure, enabling effective resolution of common issues, such as occlusion and distortion at limb-end extremities.

II-C Noise Prediction Block

We use the pose ${P}_{t}$ to guide the noise prediction block ( $\mathcal{H}_{N}$ ) in predicting the noise ${\epsilon}_{\theta}\left({y}_{t},t,{I}_{s},{M}_{s},{P}_{t}\right)$ throughout the entire process of noise prediction. During this phase, the BSDB employs cross-attention mechanisms to provide the required multi-scale texture features $F_{s}$ to $\mathcal{H}_{N}$ , thereby aiding $\mathcal{H}_{N}$ in more accurately predicting the noise. In the training process, we generate noise samples $\boldsymbol{y}_{t}\sim q\left(\boldsymbol{y}_{t}\mid\boldsymbol{y}_{0}\right)$ by adding random Gaussian noise $\epsilon$ to $y_{0}$ to train our network.

II-D Parsing Map-based Disentangled Guided Sampling

To achieve higher image quality, we design a disentanglement guided sampling approach based on classifier-free guidance [30] and parsing map, named parsing map-based disentanglement classifier-free guidance (PMDCF-guidance) sampling. In practice, we can iteratively sample $\boldsymbol{y}_{T}$ from $p_{\theta}\left(\boldsymbol{y}_{t-1}\mid\boldsymbol{y}_{t},{I}_{s},{M}_{s},{P}_{t}\right)$ from $t=T$ to $t=1$ , converging to the learned conditional and unconditional distributions. The parameterization for predicting the mean and variance of $\boldsymbol{y}_{t-1}$ is as follows:

	$\displaystyle p_{\theta}(\boldsymbol{y}_{t-1}\mid\boldsymbol{y}_{t},{I}_{s},{M}_{s},{P}_{t})=\qquad\qquad$
	$\displaystyle\mathcal{N}(\boldsymbol{y}_{t-1};\mu_{\theta}(\boldsymbol{y}_{t},t,{I}_{s},{M}_{s},{P}_{t}),\Sigma_{\theta}\left(\boldsymbol{y}_{t},t,{I}_{s},{M}_{s},{P}_{t}\right)).$		(1)

In the process, we effectively decouple and amplify the conditional signals of texture and pose, improving the alignment of the generated person image’s appearance with the target pose and enhancing the quality of the generated images. The formulation is as follows:

\displaystyle\boldsymbol{\epsilon}_{\text{cond }}

\displaystyle=

\displaystyle\boldsymbol{\epsilon}_{\text{uncond }}+w_{s}\boldsymbol{\epsilon}_{\text{style }}+w_{p}\boldsymbol{\epsilon}_{\text{pose}},

(2)

where, $\epsilon_{\mathrm{uncond}}=\epsilon_{\theta}\left(\boldsymbol{y}_{t},t,\emptyset,\emptyset,\emptyset\right)$ represents the model’s unconditional prediction. The style-guided prediction is denoted as $\epsilon_{\mathrm{style}}=\epsilon_{\theta}\left(\boldsymbol{y}_{t},t,\boldsymbol{I}_{s},\boldsymbol{M}_{s},\boldsymbol{P}_{t}\right)-\epsilon_{\theta}\left(\boldsymbol{y}{t},t,\emptyset,\emptyset,\boldsymbol{P}_{t}\right)$ , and the pose-guided prediction is indicated as $\epsilon_{\mathrm{pose}}=\epsilon_{\theta}\left(\boldsymbol{y}_{t},t,\emptyset,\emptyset,\boldsymbol{P}_{t}\right)-\epsilon_{\theta}\left(\boldsymbol{y}_{t},t,\emptyset,\emptyset,\emptyset\right)$ . Where, $\emptyset$ is a zero tensor used to replace the missing conditions, $w_{s}$ and $w_{p}$ are the guidance scale for style conditions and pose conditions repectively.

III Experiments

Experimental Details: We evaluate the performance of DRDM by conducting experiments on the widely used DeepFashion dataset,utilizing the Rebalanced Parsing model [31] to generate parsing maps. During the training phase, the Adam optimizer is configured with a learning rate of $2e-5$ , and the batch size is $7$ . We randomly set 10 $\%$ of variables ${I}_{s}$ , ${M}_{s}$ , and ${P}_{t}$ to zero tensors to encourage the model to effectively learn the unconditional distribution. For sampling, we set $w_{s}$ to 3.0 and $w_{p}$ to 2.5.

Our experiments were conducted on a system equipped with four NVIDIA V100 GPUs.

III-A Quantitative Comparison

We conduct a quantitative comparison of our proposed DRDM with several state-of-the-art methods [1, 8, 7, 32, 6, 5]. Table I displays the results, showing that we achieved the best scores in terms of SSIM and LPIPS metrics.

TABLE I: Quantitative comparison of the proposed DRDM with several state-of-the-art models in metrics of SSIM, FID and LPIPS.

Methods	FID (↓)	SSIM (↑)	LPIPS (↓)
Def-GAN	18.457	0.6786	0.2330
PATN	20.751	0.6709	0.2562
ADGAN	14.458	0.6721	0.2283
PISE	13.610	0.6629	0.2059
GFLA	10.573	0.7074	0.2341
DPTN	11.387	0.7112	0.1931
CASD	11.373	0.7248	0.1936
NTED	8.6838	0.7182	0.1752
PIDM	7.8615	0.7312	0.1678
DRDM (Ours)	7.7462	0.7409	0.1652

This notion means that our method can preserve complete and clear textural structures even under perfect alignment of the target pose. In fact, GAN methodologies may produce conspicuous artifacts when confronting significant texture deformations. Meanwhile, the diffusion-based PIDM [1] frequently struggles to retain clothing styles and limb details in cases with occlusions.

Five representative results are shown in Figure 3. The composition images for the baseline methods were generated using pre-trained models provided by their respective authors. Notably, our model demonstrates superior performance in preserving fine texture details, maintaining accurate forms, and reconstructing realistic limbs and facial features.

III-B User Study

While quantitative comparisons offer valuable insights into various aspects of the generated results, human image composition tasks are often user-centric. Accordingly we conduct a user study involving 90 volunteers to evaluate the perceptual performance of the generated results. The user study consisted of two parts. (a) Comparison with real images: We randomly select 90 real images and 90 generated images from the test set and shuffled them. The volunteers are required to determine whether the given image is real or synthetic within 1 second. (b) Comparison with other methods: We present 90 pairs of randomly selected images to the volunteers, including source images, target poses, real images, images generated by our method, and baseline methods. The volunteers are asked to choose the most realistic and plausible image compared with the source and real images. In the user study, R2G represents the percentage of real images mistaken for generated images, G2R denotes the percentage of generated images mistaken for real images, and Jab indicates the percentage of images judged as the best among all models. Our method exhibits the best results, indicating that the results generated by our method are more aligned with human perception, as shown in Table II.

TABLE II: Exhibition of user study in metrics of R2G, G2R and Jab, computed from users’ feedback shown on

256\times 176

resolution for DeepFashion.

Methods	R2G (↑)	G2R (↑)	Jab (↑)
ADGAN	22.2 $\%$	22.2 $\%$	4.22 $\%$
PISE	29.2 $\%$	25.8 $\%$	5 $\%$
GFLA	23 $\%$	26.2 $\%$	5.77 $\%$
CASD	34.7 $\%$	31.7 $\%$	5.77 $\%$
NTED	32.4 $\%$	36 $\%$	13 $\%$
PIDM	36.3 $\%$	43 $\%$	20.4 $\%$
DRDM (Ours)	40.7 $\%$	51.4 $\%$	45.7 $\%$

III-C Ablation Study

We further evaluate our proposed method by means of ablation experiments. We created three variants by selectively removing specific components from the complete model (w/o BSDB and w/o self-attention) and by not using our proposed sampling method (w/o PMDCF-guidance). Table III illustrates that our complete model achieved the best performance in all quantitative comparisons. The results highlight the significance of each component in contributing to the overall performance of the final outcomes.

TABLE III: Ablation studie on DeepFashion dataset.

Methods	FID (↓)	SSIM (↑)	LPIPS (↓)
w/o BSDB	10.598	0.7016	0.2487
w/o self-attention	7.9172	0.7365	0.1733
w/o PMDCF-guidance	9.1425	0.7366	0.1747
Model-full (ours)	7.7462	0.7409	0.1652

IV Conclusion

In this work, we proposed a DRDM that allows for a precise control of the pose and appearance of the source person image to synthesize realistic person images. In practice, there are three types of degeneration key issues existing in portrait composition, i.e., details missing, limbs distortion and garment style deviation. We exploit the parsing map as a regularization factor for preserving the figure shape and body part subspace texture style, and then extract structure and texture style features in each subspace. The introduction of the disentangled texture fusion block enables the decoupling of textures in different semantic parts of the source person image, which can then be guided by the target pose to compose the appearance of the target person. Additionally, we utilize a parsing map-based disentangled classifier-free guided sampling approach to achieve accurate detail reconstruction under pose variations. We demonstrate the effectiveness of DRDM for controllable person image synthesis task through extensive quantitative, ablation and user studies.

References

[1] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Jorma Laaksonen, Mubarak Shah, and Fahad Shahbaz Khan, “Person image synthesis via denoising diffusion model,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5968–5976.
[2] Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and Jianhuang Lai, “Coarse-to-fine latent diffusion for pose-guided person image synthesis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6420–6429.
[3] Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu, “Humansd: A native skeleton-guided diffusion model for human image generation,” in IEEE International Conference on Computer Vision (ICCV), 2023, pp. 15988–15998.
[4] Tianxiang Ma, Bo Peng, Wei Wang, and Jing Dong, “Must-gan: Multi-level statistics transfer for self-driven person image generation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13622–13631.
[5] Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and Zhouhui Lian, “Controllable person image synthesis with attribute-decomposed gan,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5084–5093.
[6] Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li, “Deep image spatial transformation for person image generation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7690–7699.
[7] Xinyue Zhou, Mingyu Yin, Xinyuan Chen, Li Sun, Changxin Gao, and Qingli Li, “Cross attention based style distribution for controllable person image synthesis,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 161–178.
[8] Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, and Thomas H Li, “Neural texture extraction and distribution for controllable person image synthesis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 13535–13544.
[9] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila, “Analyzing and improving the image quality of stylegan,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8110–8119.
[10] Tero Karras, Samuli Laine, and Timo Aila, “A style-based generator architecture for generative adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410.
[11] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8798–8807.
[12] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” in International Conference on Learning Representations (ICLR), 2018.
[13] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2223–2232.
[14] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-image translation with conditional adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1125–1134.
[15] Xin Ding, Yongwei Wang, Zuheng Xu, William J Welch, and Z Jane Wang, “Ccgan: Continuous conditional generative adversarial networks for image generation,” in International Conference on Learning Representations (ICLR), 2020.
[16] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems (NeurIPS), 2020, vol. 33, pp. 6840–6851.
[17] Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin, “Improving diffusion models for virtual try-on,” arXiv preprint arXiv:2403.05139, 2024.
[18] Yang Liu, Zhaoyang Lu, Jing Li, and Tao Yang, “Hierarchically learned view-invariant representations for cross-view action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2416–2430, 2018.
[19] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao, “Deep image-to-video adaptation and fusion networks for action recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 3168–3182, 2019.
[20] Yang Liu, Keze Wang, Guanbin Li, and Liang Lin, “Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition,” IEEE Transactions on Image Processing, vol. 30, pp. 5573–5588, 2021.
[21] Yang Liu, Keze Wang, Lingbo Liu, Haoyuan Lan, and Liang Lin, “Tcgl: Temporal contrastive graph for self-supervised video representation learning,” IEEE Transactions on Image Processing, vol. 31, pp. 1978–1993, 2022.
[22] Zhengji Li, Yuhong Xie, Xi Xiao, Lanju Tao, Jinyuan Liu, and Ke Wang, “An image data augmentation algorithm based on yolov5s-da for pavement distress detection,” in 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), 2022, pp. 891–895.
[23] Yuying Zhu, Yang Zhang, Lingbo Liu, Yang Liu, Guanbin Li, Mingzhi Mao, and Liang Lin, “Hybrid-order representation learning for electricity theft detection,” IEEE Transactions on Industrial Informatics, vol. 19, no. 2, pp. 1248–1259, 2022.
[24] Yang Liu, Guanbin Li, and Liang Lin, “Cross-modal causal relational reasoning for event-level visual question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11624–11641, 2023.
[25] Junfan Lin, Yuying Zhu, Lingbo Liu, Yang Liu, Guanbin Li, and Liang Lin, “Denselight: efficient control for large-scale traffic signals with dense feedback,” arXiv preprint arXiv:2306.07553, 2023.
[26] Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin, “Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5606–5618.
[27] Yushen Wei, Yang Liu, Hong Yan, Guanbin Li, and Liang Lin, “Visual causal scene refinement for video question answering,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 377–386.
[28] Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai,” arXiv preprint arXiv:2407.06886, 2024.
[29] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291–7299.
[30] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
[31] Enbo Huang, Zhuo Su, Fan Zhou, and Ruomei Wang, “Learning rebalanced human parsing model from imbalanced datasets,” Image and Vision Computing, 2020.
[32] Jinsong Zhang, Kun Li, Yu-Kun Lai, and Jingyu Yang, “Pise: Person image synthesis and editing with decoupled gan,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 7982–7990.

DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images ††thanks: This research is supported by the National Natural Science Foundation of China (62262045). Corresponding Email: [email protected]