This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: 1Zhejiang University     2The University of Adelaide     3The University of Hong Kong
Project Page: https://yongtaoge.github.io/projects/humanwild

3D Human Reconstruction in the Wild with Synthetic Data Using Generative Modelsthanks: Work was done when YG was visiting Zhejiang University. HC is the corresponding author.

Yongtao Ge   Wenjia Wang   Yongfan Chen   Hao Chen   Chunhua Shen 221133111111
Abstract

Despite remarkable progress has been made on the problem of 3D human pose and shape estimation (HPS), current state-of-the-art methods rely heavily on either confined indoor mocap datasets or datasets generated by rendering engine using computer graphics (CG). Both categories of datasets exhibit inadequacies in furnishing adequate human identities and authentic in-the-wild background scenes, which are crucial for accurately simulating real-world distributions. In this work, we show that synthetic data created by generative models is complementary to CG-rendered data for achieving remarkable generalization performance on diverse real-world scenes. Specifically, we propose an effective approach based on recent diffusion models, termed 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data.

Keywords:
Controllable Human Generation Synthetic Dataset 3D Human Reconstruction Diffusion Model

1 Introduction

Estimating human pose and shape (HPS) [27, 41, 36, 39, 11] from a single RGB image is a core challenge in computer vision and has many applications in robotics [10, 82, 75], computer graphics [90, 74], and digital content creation [76]. Current HPS estimation methods require well-annotated datasets to achieve good performance. Unfortunately, collecting large-scale 3D human body data is time-consuming and expensive.

Refer to caption
Figure 1: Dataset appearance distributions of synthesized datasets and in-the-wild real-world datasets.

As shown in Tab. 1, contemporary methodologies for acquiring precise 3D human body data predominantly employ two primary pipelines. The initial pipeline encompasses indoor motion capture (mocap) systems, such as marker-based and vision-based systems, utilized by numerous existing datasets [22, 71, 8, 49] for human body attribute capture. Nonetheless, this mocap pipeline confronts two primary limitations: firstly, the intricate and costly nature of mocap systems, involving synchronization and operation complexities necessitating specialized expertise; secondly, the restricted number of actors in datasets, typically set against indoor or laboratory backgrounds, precluding large-scale human data collection in diverse settings. The alternative pipeline involves synthesizing 3D human datasets via CG rendering [5, 84, 79]. However, this approach also presents twofold challenges: firstly, the collection costs incurred in acquiring high-quality 3D assets, including avatars and scene elements, alongside the requisite expertise in 3D rendering; secondly, ensuring realism in synthetic human bodies and background scenes, despite persistent appearance domain gaps between rendered and real-world images. This domain gap is discernible in Fig. 1. where we visualize feature distributions extracted by DINO-v2 [53] through UMAP dimension reduction [47], with red points denoting real-world in-the-wild data and blue points representing CG-rendered data. Evidently, a conspicuous domain gap exists between real-world and CG-rendered data.

Since both mocap-based and CG-rendered datasets fail to provide diverse, complex in-the-wild images, some researchers have considered capturing pseudo 3D ground truth by estimating from 2D clues or using additional sensors. SMPLify [6] proposed to fit the parameters of a 3D human model to the location of 2D keypoints. EFT [25] introduced the Exemplar Fine-Tuning strategy by overfitting a pre-trained 3D pose regressor with 2D keypoint reprojection loss, taking the final output of the regressor as pseudo labels. Nevertheless, these methodologies encounter challenges in accurately generating camera parameters and body parameters, resulting in suboptimal performance on 3D human pose estimation benchmarks.

Contemporary generative models, e.g., Stable Diffusion XL [58] are trained on billion-scale text-image pairs. This extensive training regime has demonstrated efficacy in faithfully simulating real-world distributions. In this paper, we focus on creating a new automatic and scalable synthetic data generation pipeline solely through generative models. The challenge of the pipeline lies in the sampling of diverse data pairs from the prior distribution inherent to generative models. It’s crucial to ensure diversity in pose, shape, and scene representation within generated human images, which is paramount for accurately simulating real-world human distribution. Conversely, ensuring precise alignment between human images and generated annotations stands as another critical facet. This alignment is indispensable for facilitating the effective training of downstream tasks. DiffusionHPC [78] proposes a simple solution by providing diverse text prompts to the model and subsequently utilizing pre-trained 3D human pose estimation (HPS) models to obtain pseudo labels. Nonetheless, relying solely on textual prompts lacks the necessary granularity to precisely control aspects such as pose, shape, and spatial positioning of human bodies. Furthermore, the resultant 3D pseudo labels often exhibit significant levels of noise. In contrast, we propose a method that begins by sampling SMPL-X parameters, which represent human body configurations, from large-scale human motion capture datasets such as AMASS [44, 5]. Subsequently, we utilize a random camera setup to render the human mesh into a normal map, thereby adding an extra input condition. This normal map provides supplementary information about the surface orientation of the human body. Finally, we feed text prompts and normal maps to a customized ControlNet [88] for generating human images. Notably, we collect over 1M image-text-normal pairs from LAION5B [64] and MOYO-700M [71], Hi4D [86], and BEDLAM [5] to train the normal-based ControlNet. As such, our method attains precise control over the visual characteristics of generated human images while simultaneously acquiring initial training data pairs derived from the input conditions and resultant human images. Empirical investigations have uncovered instances of label noise within initial training data pairs, including occurrences where the generated human and input conditions form mirror pairs, or where the human head orientation in the image diverges from the input SMPL-X parameters. To mitigate this challenge, we leverage a pre-trained segmentation foundation model, namely SAM [30], to forecast the human mask of the generated images. Subsequently, we compute the intersection-over-union (mIoU) between the ground-truth human mask and the predicted human mask, thereby enabling the filtration of data samples with an IoU below a predetermined threshold.

With the aforementioned pipeline, we can finally generate a large-scale 3D human dataset in the wild, with around 0.79M samples at 768×768768\times 768 resolution. Compared to previous datasets, our pipeline can generate diverse human identities and various in-the-wild scenes. Notably, the pipeline is much cheaper than both mocap-based and CG-based counterparts and is scalable to generate 3D human datasets in the wild with versatile human identities and real-world scenes.

Table 1: Comparison of different types of 3D HPS datasets.

Data type Datasets Subjects Scene Frame Modalities RGB D/N K2D K3D B.P. WB.P. Real-world Monocular COCO [42] - - 104K MPI-INF-3DHP [49] 8 - 1.4M MPII [1] - - 24K Multi-view HuMMan 1000 1 60M ZJU Mocap [57] 6 1 >1K AIST++ [38] 30 1 10.1M MOCAP 3DPW [46] 5 <60 51K Human3.6M [22] 11 1 3.6M Synthetic AGORA [55] >350 - 18K Synbody [84] 10K 6 1.2M BEDLAM [5] 217 103 280K Generated HumanWild \infty \infty 0.79M

Our contributions can be summarised as follows. 1). We propose a fully automatic pipeline to synthesize realistic and diverse human images with well-aligned annotations, including SMPL-X parameters and text descriptions. The dataset can empower a wide range of downstream perception tasks by rendering SMPL-X mesh into corresponding annotation format, e.g., human pose and shape estimation, human part segmentation, and human surface normal estimation. 2). We verify the quality of the generated dataset on the 3D HPS task. Experimental results indicate that the proposed pipeline is compositional with CG-rendered data, enhancing performance across multiple challenging HPS benchmarks under consistent settings.

2 Related Works

2.1 Human Pose and Shape Estimation Datasets

Real-world human pose data plays a pivotal role in achieving precision and realism in 3D HPS tasks. High-quality 3D human data is typically captured using advanced motion capture devices like Inertial Measurement Units (IMUs) [46, 44, 66] or Optical sensors [22], designed to capture precise marker movements or joint rotations. Nevertheless, the utilization of these tools may present challenges attributable to factors such as financial expenses, intricacies in configuration, and spatial constraints. To facilitate these challenges, researchers have explored alternative methods to capture pseudo labels from diverse image types, including single-view images [6], RGBD [15], and multi-view [8], eliminating the need for motion capture gear. SLOPER4D [9] consolidates data from IMU sensors, LiDAR, and RGB information to construct a large-scale urban human-scene dataset. Such methods often leverage perception models to derive 2D cues from images, which are further optimized by a 3D joint re-projecting loss.

Synthetic human pose datasets, developed with computer graphic techniques, has been used for many years. SURREAL [72] applies human skin and cloth textures to bare SMPL meshes, which lack realistic details. AGORA [55] uses high-quality static human scans for image rendering, but this routine also suffers from a high workload of scanning and rigging. However, rendering realistic manipulatable synthetic human datasets involves many challenges, including the need for diverse virtual properties for realistic data. BEDLAM [5] and Synbody [84] augment SMPL-X meshes [56] with diverse hair models and skin textures, facilitating the simulation of physically realistic cloth and hair dynamics. These processes can be resource-intensive. Furthermore, the use of rendering engines demands many professional skills. Thus, the rendering process can be computationally expensive and time-consuming.

Controllable human image generation has gained greate traction with the advancement of Stable Diffusion [60, 88]. Text2Human [24] uses a diffusion-based transformer sampler in response to text prompts and predicts indices from a hierarchical texture-aware codebook to conditionally generate realistic human images. HumanSD [26] introduces a skeleton-guided diffusion model with a novel heatmap loss for pose-conditioned human image generation. HyperHuman [43] proposes to jointly denoise surface normal and depth along with the synthesized RGB image conditioned on text prompt and pose skeleton.

Generative models for perception tasks is a promising field, which aims to enhance the performance of perception models with the capability of generative models. Several pioneer works have proposed to generate perception datasets with diffusion models. For instance, [2] and [73] verified the effectiveness of datasets synthesized by generative models in fundamental perception tasks, i.e., image classification, and object detection. StableRep [69] argues that training modern self-supervised methods on synthetic images from Stable Diffusion Models can yield impressive results. The learned representations often surpass those learned from real images of the same sample size. DatasetDM [80] leverages the prior of stable diffusion model by training customized perception decoders upon the output of the UNet [61], and then employs it to generate pseudo labels for semantic segmentation, depth estimation, and 2D human pose estimation.

3 Method

Refer to caption
Figure 2: The overall pipeline of the proposed controllable data generation. Our ControlNet could be conditioned on fine-grained normal maps and structual text prompts. Our text prompt includes human appearance, pose, and indoor/outdoor scene types based on w. or w/o background normals.

We present 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild, a simple yet effective approach for creating versatile human body images and corresponding perception annotations in a fully automated fashion, which can be used for many downstream human perception tasks, such as 2D/3D human pose and shape estimation (see Fig. 2). The core idea of the proposed pipeline is creating large-scale image-mesh-caption pairs by incorporating 2D generative models, e.g., ControlNet [88] and 3D human parametric models [56]. For the sake of completeness, we give a brief review of the controllable text-to-image (T2I), image-to-image (I2I) generative models and the 3D human parametric model, SMPL-X [56] in Sec. 3.1. In the following subsections, we illustrate how we generate the initial human image-annotation pairs in Sec. 3.2 and Sec. 3.3. Finally, we show how we filter out the noisy labels to get high-quality training pairs in Sec. 3.4.

3.1 Prerequisites

Stable Diffusion Models [60, 58] are text-to-image diffusion models capable of generating near photo-realistic images given any text input. It consists of an autoencoder and a U-Net. During training, U-Net model is designed to take a textual description and noise map ϵ\epsilon as input, learning a forward noising process by gradually transforming an image latent variable, denoted as y¯h×w×c\bar{y}\in\mathbb{R}^{h\times w\times c} into a noise map yt¯h×w×c\bar{y_{t}}\in\mathbb{R}^{h\times w\times c}:

yt=α¯ty+1α¯tϵtϵ𝒩(𝟎,𝐈),y_{t}=\sqrt{\bar{\alpha}_{t}}y+\sqrt{1-\bar{\alpha}_{t}}\epsilon_{t}\quad\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I}), (1)

where t=[1,,T]t=[1,\cdots,T] is the timestep for controlling the noise level and αt{\alpha}_{t} is the noise scheduler [19], hh, ww and cc are the height, width, and channels of the latent variable. During inference, UNet takes a noise map as input and generates an approximation of the original latent variable using a denoise scheduler [67].

SMPL-X [56], defined as M(𝜷,𝜽,𝝍):|θ|×|β|×|ψ|3NM(\boldsymbol{\beta},\boldsymbol{\theta},\boldsymbol{\psi}):\mathbb{R}^{|\theta|\times|\beta|\times|\psi|}\rightarrow\mathbb{R}^{3N}, is a 3D wholebody human parametric model, employing shape 𝜷200\boldsymbol{\beta}\in\mathbb{R}^{200}, expression 𝝍50\boldsymbol{\psi}\in\mathbb{R}^{50}, and pose 𝜽55×3\boldsymbol{\theta}\in\mathbb{R}^{55\times 3} to control the entire body mesh. The function of SMPL-X provides a differentiable skinning process that uses pose, shape, and expression parameters as inputs and delivers a triangulated mesh VN×3V\in\mathbb{R}^{N\times 3} with N=10475N=10475 vertices. The reconstructed 3D joints J144×3J\in\mathbb{R}^{144\times 3} can be obtained using a forward kinematics process. We use the SMPL-X model to parameterize the human body.

3.2 Surface Normal Estimation

The training of the ControlNet requires large-scale in-the-wild image and surface normal map pairs. However, it is impossible to manually annotate such data. Previous work [43] employed an off-the-shelf method, Omnidata [12] for generating pseudo surface normal maps. Unfortunately, in our pilot studies, we observed that surface normal maps generated by Omnidata [12] tend to lose details. Inspired by recent works that leverage diffusion prior for downstream perception tasks [28, 34, 80], we fine-tune a pretrained latent diffusion model for surface normal estimation. This refinement results in a deterministic mapping between input images and output surface normal maps. Drawing from strategies proposed in [3, 17], we introduce a blending strategy and redesign the diffusion process. Specifically, we treat the input image xx as the noise map and surface normal map nn as the generation target. We then extract corresponding latent variables x¯\bar{x} and n¯\bar{n} through the pre-trained VAE encoder. Finally, we reformulate the diffusion process described in Eq. 1 as:

nt¯=α¯tn¯+1α¯tx¯t=[1,,T].\bar{n_{t}}=\sqrt{\bar{\alpha}_{t}}\bar{n}+\sqrt{1-\bar{\alpha}_{t}}\bar{x}\quad t=[1,\cdots,T]. (2)

See  Fig. 3 for illustration. To predict surface normals from a pre-trained T2I model, we fine-tune a UNet using the v-prediction objective [62] on a synthetic dataset, Hypersim [59]. During inference, the UNet progressively denoises input images and extracts surface normals using the DDIM scheduler [67].

Refer to caption
Figure 3: The overall training pipeline of the surface normal estimation.

3.3 Initial Human Image and Annotation Generation

Camera simulation. One drawback of vision-based motion capture systems is that they need to calibrate and synchronize the camera’s intrinsic and extrinsic parameters during the capturing. Thus, the collected human data are limited in terms of the scales and view diversity. On the contrary, our pipeline gets rid of the physical RGBD cameras and can simulate arbitrary human scales and body orientation. Specifically, we randomly determine the orthographic scale ss of the human body (s[0.45,1.1]s\in[0.45,1.1]), along with the horizontal shift (tx,tyt_{x},t_{y}) within a range of [0.4/s,0.4/s][-0.4/s,0.4/s]. This methodology ensures that the majority of body parts are visible in the image. Following [27, 77], we determine the translation of the body as transl=[tx,ty,f/s]transl=[t_{x},t_{y},f/s]. The focal length in normalized device coordinate (NDC) space, denoted as ff, can be computed using the formula f=1/tan(FoV/2)f=1/\tan(FoV/2). Here, FoVFoV represents the Horizontal Field of View angle, which is randomly zoomed in from 25 to 120 degrees by following [5, 77].

Image condition generation. To synthesize realistic human images with paired pose annotations, we leverage ControlNet, equipped with the state-of-the-art diffusion model, SDXL [58], as our image generator. Existing ControlNet variants take a 2D skeleton, depth map, or canny map as condition inputs. These inputs are typically detected from real-world images by pre-trained perception models. However, there exist two main drawbacks to generating image conditions from these pre-trained perception models. On one hand, it’s laborious to crawl diverse human pose and shape images from the Internet. On the other, the perception models cannot ensure the generation of fully accurate annotations, thus the different modalities annotations have discrepancies, e.g., the 2D keypoint heatmap from a 2D pose estimator and the depth map from a depth estimator are not aligned. In such cases, if we take the perception results as the multi-condition inputs of ControlNet, the generated images would probably have weird artifacts.

To resolve the problem, we construct the input of ControlNet by taking advantage of the 3D human parametric model, SMPL-X. There exist several large-scale human motion capture databases [44, 5] with diverse body poses and shapes in SMPL-X format. Thanks to the disentanglement of the pose and shape parameters of the SMPL-X model, we can even recombine the two parameters to generate a human mesh that does not exist in the databases. For example, an overweight man doing an extremely difficult yoga pose. Upon getting the simulated camera parameters aforementioned in Sec. 3.3 and 3D human mesh from SMPL-X, we can render an existing 3D human mesh into the image plane, as such, getting the corresponding surce normal map. Notably, the surface normal map is proven to be crucial to generating accurate body shapes and orientation.

Human &\& scene positioning. We offer an optional flexible background selection feature, allowing users to generate scenes based on text prompts solely or in conjunction with normal maps. To integrate specific backgrounds, we randomly select scene meshes from the ScanNet++ dataset [85] and utilize Octree [48] to partition the room mesh into discrete voxels. With vertex-level classification annotations, we efficiently identify ground plane normals and height to anchor the human mesh. We randomly allocate human mesh within unoccupied areas of the voxel space. Subsequently, we refine human mesh positioning by optimizing the translation using the Chamfer distance to preclude inter-mesh collisions. Finally, we simulate a random camera perspective. The intrinsic matrix is obtained with previous rules, while the extrinsic matrix is crafted using a random azimuth angle. We sample the camera’s height range [-1m, +1m] relative to the pelvis height, and look at a random point on the human torso, to achieve a realistic viewpoint.

Text prompt generation. The aforementioned normal maps have two choices, with or without room mesh as backgrounds. Since our background information is retrieved from Scannet++ [85], for in-the-wild data, we need text prompts to provide background information. Moreover, the human body’s normal maps are not fine-grained enough to determine the gender and appearance of the human. Thus, we incorporate a structured text prompt template to handle this issue. In particular, we designed a simple text template as “A {gender} {action} {environment}". The gender and the action of the person are determined by the SMPL-X annotations. The environment is generated by a large language model, i.e., ChatGPT [51] and LLAMA [70]. To create photo-realistic humans, we also feed negative text prompts, e.g., “ugly, extra limbs, poorly drawn face, poorly drawn hands, poorly drawn feet”, to the model. Finally, we have all of the input conditions of the ControlNet. We apply a total of 40 inference steps for each sample. The resolution of the generated images are all 768×768768\times 768. The generated images and the input conditions (normal maps, SMPL-X parameters) are regarded as the initial data pairs. See Sup. Mat. for detail analysis.

Refer to caption
Figure 4: Data sample visualization of 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild. For each data sample, the left side is the normal image rendered from SMPL-X model, the right side is the image generated by 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild pipeline. The first two rows demonstrate the indoor activities. The third row shows the generated images with huge distortion. The fourth row describes the diverse camera views. The fifth row shows overweight human bodies.

3.4 Label Denoising

The generated images are not always well-aligned with the input conditions. The most common incorrect case is the generated human and the input conditions form a mirror pair. To resolve this problem, we employ an off-the-shelf foundation segmentation model, i.e., SAM [30], to filter negative samples from the final dataset. Specifically, we first calculate the ground-truth mesh segmentation mask by rendering the 3D mesh into the image plane. Then, we sample a random point coordinate from the ground-truth segmentation mask and feed the generated image and this point coordinate into the SAM model, thereby we can get the predicted human mask for the generated image. Finally, we compute the intersection-over-union (IoU) metric between the ground-truth mask and prediction mask. We filter data samples with IoU lower than 0.80.8.

4 Experiments

4.1 Implement Details and Evaluation Metrics

To illustrate the effectiveness and efficiency of our proposed 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild, we report Frechet Inception Distance (FID [18]) and Kernel Inception Distance (KID [4]), which are widely used to measure the quality of the synthesized images. We measure the alignment accuracy between the SMPLX-rendered mask and the generated images with mean intersection over union (mIoU). We use standard metrics to evaluate body pose and shape accuracy. PVE and MPJPE represent the average error in vertices and joints positions, respectively, after aligning the pelvis. PA-MPJPE further aligns the rotation and scale before computing distance. PVE-T-SC is a per-vertex error in a neutral pose (T-pose) after scale correction. For a fair comparison with the CG-rendered counterpart, we sample SMPL-X parameters from the BEDLAM [5] dataset and render them into surface normal maps with random camera parameters illustrated in Sec. 3. We re-implemented a regression-based HPS method, CLIFF [39], for evaluating the effectiveness of the synthesized dataset on a variety of evaluation benchmarks, 3DPW [46], AGORA [55], EgoBody [89], RICH [20], and SSP-3D [65] (for shape evaluation).

4.2 Ablation on ControlNet Input Conditions

Refer to caption
Figure 5: Surface normal visualization. (a), (d) are two images sampled from LAION5B [64]. (b), (e) are estimated by our method. (c), (f) are estimated by OmniData.

In the initial stages of our research, we employed various input conditions, including 2D keypoint heatmaps, depth images, and normal images, to generate data samples. However, a notable challenge emerged with regard to the ambiguity inherent in both the keypoint and depth conditions. Specifically, the diffusion model struggled to accurately discern the front and back of individuals depicted in these data modalities and had difficulty in aligning generated hands and faces with ground-truths (See Sup. Mat. for some failure cases), yet we found surface normal conditional can greatly reduce this kind of ambiguity. Thus, we choose it as the input condition of the ControlNet. To get accurate surface normal maps in the wild, we train a customized diffusion model based on a photo-realistic synthetic indoor dataset, Hypersim [59]. As shown in Tab. 3, our surface normal estimator outperforms OmniData [12] trained on a mixture of datasets, including Hypersim [59] dataset. We also show qualitative surface normal results in  Fig. 5. The results show that our diffusion-based estimator can generate more detailed normal maps when compared with OmniData [12] in real-world scenes. We annotate around 1M image/surface normal pairs from LAION5B [64] and COYO-700M [7] for training ControlNet. Please refer to the supplementary material for more details. Tab. 3 compares our trained ControlNet with two publicly available keypoint-based and depth-based ControlNets [21, 68] in terms of generated image quality and image/ground-truth alignment accuracy. In order to ensure equitable comparison, all ControlNets employ aligned sets of 1,000 input conditions derived from identical SMPL-X parameters and camera specifications. The results demonstrate that our ControlNet consistently outperforms the keypoint-based and depth-based ControlNet.

Table 2: Ablation of different input conditions of ControlNet.
ControlNet Image Quality Alignment Accuracy
Condition FID \downarrow KID\text{KID}\downarrow mIoU \uparrow
Keypoint [68] 29.6 2.92 41.3
Depth [21] 28.1 2.87 49.2
Normal 24.5 2.83 56.8
Table 3: Surface normal evaluation on Hypersim dataset.
Model encoder train Mean\downarrow Med\downarrow RMS\downarrow
OmniData [83] DPT mix 25.3 16.2 30.4
Ours UNet Hypersim 18.9 9.5 23.2

4.3 Verification of the Necessity of In-the-wild 3D Data

Proper initialization of backbone weights is crucial for the performance of HPS models. Different weight initialization strategies profoundly impact a HPS model’s convergence speed and final accuracy. HMR-benchmarks [54] conduct systematic analysis by initializing the same backbone pre-trained with different sources, i.e., ImageNet classification weights, 2D pose estimation weights from MPII and COCO. The first three columns of Tab. 4 show the result of their paper. In conclusion, employing a backbone pre-trained on in-the-wild pose estimation datasets proves to be beneficial for HPS tasks. Furthermore, the selection of the pose estimation dataset for pretraining significantly influences model performance. Notably, datasets like COCO demonstrate superior performance compared to MPII, likely due to the greater diversity and complexity of wild scenes contained within COCO. Columns 4 to 6 of the Tab. 4 elucidate the impact of backbone pretraining on CG-rendered datasets, BEDLAM [5] and AGORA [55]. A consistent conclusion is drawn from these results that pretraining with 2D wild data also has a great positive impact on CG-rendered datasets. We posit that neither indoor-mocap datasets nor CG-rendered datasets offer a sufficiently varied array of photo-realistic, in-the-wild scenes. Consequently, the network may struggle to acquire robust features conducive to generalizable testing. In the final column of  Tab. 4, the inclusion of in-the-wild synthesized 3D data generated by 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild significantly enhances the HPS performance of the CG-rendered datasets. Hence, data synthesized by diffusion models, featuring diverse human identities and real-world scenes, complements CG-rendered data effectively.

Table 4: Ablation experiments on 3D pose and shape estimation. ‘R’ denotes mixed real-world datasets, ‘H’ denotes 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild, ‘B’ denotes BEDLAM and ‘A’ denotes AGORA. PA-MPJPE, MPJPE and PVE are evaluated on 3DPW. PVE-T-SC is evaluated on SSP-3D.
Method Dataset Pretrain PA-MPJPE\downarrow MPJPE\downarrow PVE\downarrow PVE-T-SC\downarrow
PARE [31, 54] R ImageNet 54.8 N/A N/A N/A
PARE [31, 54] R MPII 51.5 N/A N/A N/A
PARE [31, 54] R COCO 49.5 N/A N/A N/A
CLIFF [39, 5] B+A scratch 61.7 96.5 115.0 N/A
CLIFF [39, 5] B+A ImageNet 51.8 82.1 96.9 N/A
CLIFF [39, 5] B+A COCO 47.4 73.0 86.6 14.2
CLIFF [39, 5] B+A+H COCO 44.9 70.2 82.7 13.9

4.4 Scale-up

Table 5: Scale-up ablation. We study the scaling law of the amount of data and the model sizes. The metrics are MPJPE for Human36M [22] and PVE for other evaluation benchmarks. Foundation models are named “ViT-M", where M indicates the size of ViT backbone (S, B, L, H). MPE: mean primary error cross 5 evaluation benchmarks. AGORA uses the validation set, and EgoBody uses the EgoSet.
#Crops #Inst. Model #Param. AGORA [55] EgoBody [89] RICH [40] 3DPW [46] H36M [22] MPE
25% 0.19M ViT-S 23M 125.0 114.2 116.5 118.2 102.3 115.2
50% 0.39M ViT-S 23M 116.2 103.6 104.1 110.4 95.6 106.0
100% 0.79M ViT-S 23M 106.8 97.3 107.7 106.5 87.1 101.1
100% 0.79M ViT-B 88M 103.6 94.2 105.1 103.5 82.5 97.8
100% 0.79M ViT-L 305M 97.5 85.6 97.2 97.4 75.7 90.7
100% 0.79M ViT-H 633M 90.2 82.3 92.1 90.3 72.4 85.5

We perform experiments to show the effectiveness of the generated datasets by studying the scaling law of the amount of data and the model sizes in Tab. 5. Notably, our training set is independent of the evaluation benchmarks in Tab. 5. We show results with various ViT backbones and percentages of the proposed dataset. It is observed that 1). more training data leads to better performance. 2). A larger backbone achieves higher performance on the proposed dataset.

4.5 Comparing with Other Real/Synthetic Datasets

In Table 6, we analyze the performance of different training datasets using mixed strategies on CLIFF with HRNet-W48 backbone. Key findings are: 1). CG-rendered data outperforms real-world data (motion-capture, pseudo-labeled) on 3DPW and RICH datasets, suggesting synthetic data advantages. 2). Solely training on HumanWild yields slightly lower performance than BEDLAM, we conjecture that there still exists label noise, especially in hand and face parameters. 3). Combining HumanWild with CG-rendered datasets (BEDLAM, AGORA) and marker-based dataset (3DPW) enhances performance, emphasizing the benefit of diverse data integration.

Table 6: Reconstruction error on 3DPW and RICH. ‘R’ denotes mixed real-world datasets, ‘H’ denotes 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild, ‘B’ denotes BEDLAM [5], and ‘A’ denotes AGORA [55]. ‘P’ denotes 3DPW [46].
Methods Training Data 3DPW (14) RICH (24)
PA-MPJPE\downarrow MPJPE\downarrow PVE\downarrow PA-MPJPE\downarrow MPJPE\downarrow PVE\downarrow
HMR [27] Real 76.7 130 N/A 90.0 158.3 186.0
SPIN [33] Real 59.2 96.9 116.4 69.7 122.9 144.2
SPEC [32] Real 53.2 96.5 118.5 72.5 127.5 146.5
PARE [31] Real 50.9 82.0 97.9 64.9 104.0 119.7
HybrIK [36] Real 48.8 80 94.5 56.4 96.8 110.4
CLIFF [39, 5] Real 46.4 73.9 87.6 55.7 90.0 102.0
CLIFF [39, 5] B 50.5 76.1 90.6 - - -
CLIFF [39, 5] H 52.7 87.3 102.1 - - -
CLIFF [39, 5] B+A 46.6 72.0 85.0 51.2 84.5 96.6
CLIFF [39, 5] H+A 47.5 74.1 87.8 51.9 85.3 97.4
CLIFF[39, 5] H+B+A 44.9 70.2 82.7 51.0 84.4 96.1
CLIFF[39, 5] B+A+P 43.0 66.9 78.5 50.2 84.4 95.6
CLIFF[39, 5] H+B+A+P 41.9 65.2 76.8 48.4 79.7 91.1

These results provide additional validation for the efficacy of the generated HumanWild dataset. Specifically, they demonstrate that incorporating HumanWild data enables the HPS regressor to encounter a more diverse range of scenarios, thereby enhancing its ability to generalize to in-the-wild scenes.

4.6 Result on Challenging Benchmarks

To verify the effectiveness of 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild on different scenerios, we report the detailed performance on three challenging benchmarks. Firstly, we evaluate the capability of 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild in handling perspective distortion tasks, by performing fair ablation using Zolly [77] with ResNet50 [16] backbone. The results in Tab. 8 show that 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild dataset achieves higher generalization performance on SPEC-MTP [32] when compared with PDhuman [77], which is a CG-rendered perspective distortion dataset. Secondly, MOYO [71] encompasses challenging yoga poses not present in BEDLAM and 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild datasets, providing a unique testbed for evaluating the model’s generalization ability on hard poses. The results in Tab. 8 demonstrate that 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild, complements existing datasets, resulting in improved generalization performance on the MOYO test set. In Tab. 9, our analysis reveals that while Humanwild demonstrates superior performance over BEDLAM on body metrics, it exhibits relatively poorer performance on hand and face metrics, particularly evident in comparison to the AGORA test set. We hypothesize that the alignment of hand and facial annotations with generated images remains noisy, primarily due to inherent limitations in current diffusion models’ ability to accurately generate hand poses, particularly within small resolutions.

Table 7: SPEC-MTP.
Training Data PA-MPJPE\downarrow MPJPE\downarrow PVE\downarrow
PDHuman 102.4 159.1 168.0
𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild 90.0 151.8 160.9
Table 8: MOYO.
Training Data MPJPE \downarrow PA-MPJPE \downarrow PVE \downarrow PA-PVE \downarrow
B+A+P 143.1 94.9 163.9 95.7
B+A+P+H 134.6 89.1 154.1 88.9
Table 9: AGORA test set. \dagger denotes the methods that are finetuned on the AGORA training set. \ast denotes the methods that are trained on AGORA training set only.
NMVE\downarrow (mm) NMJE\downarrow (mm) MVE\downarrow (mm) MPJPE\downarrow (mm)
Method All Body All Body All Body Face LHand RHand All Body Face LHand RHhand
BEDLAM [5] 179.5 132.2 177.5 131.4 131.0 96.5 25.8 38.8 39.0 129.6 95.9 27.8 36.6 36.7
Hand4Whole [50]\dagger 144.1 96.0 141.1 92.7 135.5 90.2 41.6 46.3 48.1 132.6 87.1 46.1 44.3 46.2
BEDLAM [5]\dagger 142.2 102.1 141.0 101.8 103.8 74.5 23.1 31.7 33.2 102.9 74.3 24.7 29.9 31.3
PyMaF-X [87]\dagger 141.2 94.4 140.0 93.5 125.7 84.0 35.0 44.6 45.6 124.6 83.2 37.9 42.5 43.7
OSX [40]\ast 130.6 85.3 127.6 83.3 122.8 80.2 36.2 45.4 46.1 119.9 78.3 37.9 43.0 43.9
𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild \dagger 120.5 73.7 115.7 72.3 112.1 68.5 37.0 46.7 47.0 107.6 67.2 38.5 41.2 41.4

5 Discussion and Conclusions

Based on the experiments, we can answer the question “Is synthetic data generated by generative models complementary to CG-rendered data for the 3D HPS task?” Our results strongly suggest that through the integration of 3D human priors with ControlNet, it becomes feasible to produce high-fidelity pseudo labels encompassing a wide array of real-world scenarios. As the landscape sees the emergence of increasingly large generative models, there arises a promising prospect for the expansion of diverse 3D human training datasets without intricate mocap systems and CG pipelines. We hope that our endeavors could pave the way for leveraging generative models to generate high-quality datasets conducive to enhancing the efficacy of 3D human perception tasks.

Our pipeline can apply to a series of similar tasks where high-quality data pairs are hard to collect, e.g., 3D animal pose estimation and 3D reconstruction of human-object/human-human interaction. Addressing these challenges requires further advancements in our methodology.

Acknowledgement

This work was supported by National Key R&D Program of China (No. 2022ZD0118700).

Appendix 0.A Appendix

0.A.1 Implement Details for training ControlNet

𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild employs a customized ControlNet to generate human images and corresponding annotations. In the preliminary phase of our research, we adopt the off-the-shelf depth-to-image ControlNet, and keypoint-to-image ControlNet, which are trained on the LAION5B dataset without filtering low-quality images. Experiments show that the pre-trained models face challenges when it comes to generating human images with hard poses. To resolve this issue, we collect high-quality human images from multiple datasets and then conduct finetuning on the collected datasets for better generation performance.

Data preprocessing.

We curate from four datasets LAION5B [64], COYO [7], HI4D [86] and BEDLAM [5]. Following HyperHuman [43], we employ YOLOS [13] for human detection on LAION5B [64] and COYO-700M [7] datasets, images containing 1 to 3 human bounding boxes are retained. We use BLIP [37] to get text captions for BEDLAM [5] and HI4D [86].

Training.

We train the customized ControlNet with with AdamW [29] in 1e5-5 learning rate, and 0.01 weight decay for 80,00080,000 iterations on 8 A100 GPUs.

0.A.2 Failure Cases of the keypoint and depth-based ControlNet

In Fig. 6, we show some failure cases of the initial data pairs generated by a multi-condition, i.e., depth and keypoint-based, ControlNet The inconsistency of the image and 3D mesh would affect the performance of the 3D human pose estimation. Thus, it is necessary to retrain a customized ControlNet with the surface normal condition proposed in this work.

Refer to caption
Figure 6: Failure cases of the initial data pairs generated by ControlNet. See highlighted regions with red circles.

0.A.3 More Visualizations of Human Interactions

We show visualization results on human interactions. Fig. 7 demonstrates that our pipeline can generate well-aligned image-annotation pairs where people are with close interactions. The generated data pairs are of great value in enhancing existing human interaction datasets collected in the studio environment. (e.g., Hi4D [86] and CHI3D [14])

Refer to caption
Figure 7: Visualization of Human interaction. The SMPL interaction annotations are sampled from the Hi4D [86] dataset.

0.A.4 Text prompt examples generated by LLM

In Tab. 10, we show some text prompt examples, which are generated by ChatGPT [52] with diverse human actions and scenes.

Table 10: Text prompt examples.
gender action environment
a man playing soccer at the park
a woman swimming in the pool
a man shopping at the mall
a woman running in the park
a man studying at the library
a man working at the office
a man chatting at a cafe

0.A.5 Misalignment Analysis of Complicated Prompts and Normals

We present instances of failure where text prompts conflict with the surface normal condition. Specifically, when the text prompt suggests an action that deviates from the surface normal map, the resulting images often fail to adhere to the text prompt, particularly when the control factor of the surface normal map approaches 1.01.0. As shown in Fig. 8, despite employing the same input surface normal map and a control factor of 0.950.95, the generated images exhibit similar foreground human identities but differ in background elements corresponding to distinct text prompts. Fig. 9 illustrates that when the control factor of the surface normal is extremely low, the generated images may disregard the surface normal map condition.

Refer to caption
Figure 8: Utilizing identical surface normal maps and control scale factor of the surface normal as input, we change the input text prompts.
Refer to caption
Figure 9: Utilizing identical surface normal maps and text prompts as input, we manipulate the control scale factor of the surface normal map across values of 0.750.75, 0.50.5, and 0.250.25, respectively.

0.A.6 Results on 2D HPE

2D keypoint refinement. 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild can also be applied for the 2D human pose estimation task. Specifically, we obtain the initial 2D keypoints by projecting the 3D joints of SMPL-X into the image plane with the simulated camera parameters. The intuition of the 2D keypoint refinement is that different pose datasets provide different skeleton formats [63], even though they sometimes share the same joint names. To tackle the label discrepancies, it is necessary to refine the initial 2D keypoints to the formulation of the target 2D pose dataset. Here, we take the COCO dataset as an example to explain the proposed strategy for refining the initial 2D keypoints converted from the SMPL-X model.

We leverage a COCO pre-trained keypoint decoder proposed in [45] to get more accurate 2D keypoint labels. Concretely, we replace the coarse proposals from fully connected layers with the initial 2D keypoints converted from SMPL-X, and then several deformable cross-attention [91] operations are performed between the image features and keypoint queries to gradually generate the 2D keypoints in the COCO format. Compared to the pure pseudo-labeling process, our refinement strategy has more reliable initial keypoint proposals. Thus, our method has a higher upper bound of the final generated 2D keypoint labels.

In Tab. 12, we adopt two types of 2D pose estimators, RTMPose [23] and RLEPose [35], to verify the effectiveness of the proposed data generation pipeline. For a fair comparison, all the models are trained with 10 epochs. Our pipeline can consistently improve the detection performance when mixed with different COCO training subsets (from 1%1\% to 100%100\%). The performance of 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild is comparable with BEDLAM in all data crops. When joint training with all three datasets, both 2D pose regressors get the best performance.

We summarize the key results in Tab. 11. (1) Due to the lack of occlusion and multi-person scenes in the generated images, 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild cannot improve the results on the OCHuman validation set. (2) 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild can outperform DatasetDM [80] by a large margin on the COCO validation set under the same training setting. We conjecture that the generated dataset only has one person per image, which lacks human-scene occlusion and human-human interaction. We also find that classification-based RTMPose is less data-hungry than regression-base RLEPose in low data regime, e.g., achieving much higher AP on 1%1\% COCO training set.

Table 11: Main Results on 2D Human Pose Estimation. ‘P’ denotes 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild; ‘D’ denotes DatasetDM [80], ‘C’ denotes COCO; ‘B’ denotes BEDLAM [5]. Crops % only applies to COCO during the training. We evaluate results on the COCO and OCHuman Datasets.
Method Backbone Training Set Crop COCO OCHuman
AP APm APl AP APm APl
RTMPose [23] CSPNeXt C 100 75.2 71.6 81.9 69.9 67.0 69.8
RTMPose [23] CSPNeXt P+C 100 75.7 72.4 82.9 67.2 62.5 67.2
SimplePose [81] HRNet-W32 C 100 74.9 71.3 81.5 59.8 65.3 59.8
SimplePose [81] HRNet-W32 D+C 1 47.5 44.2 52.6 N/A N/A N/A
SimplePose [81] HRNet-W32 P+C 1 50.3 44.7 59.1 29.5 18.7 29.5
Refer to caption
Figure 10: More Visualization of 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild with background surface normal information extracted from ScanNet++ [85]. The first two columns are compositional meshes with both SMPL-X  [56] model and indoor background meshes. The third column is the surface normal extracted from the compositional meshes. The fourth and fifth columns are images generated from the third column. The sixth column is the surface normal map without background normal. The last two columns are images generated from the sixth column.
Table 12: Ablation experiments on 2D human pose estimation. ‘C’ denotes COCO; ‘B’ denotes BEDLAM; ‘P’ denotes 𝙷𝚞𝚖𝚊𝚗𝚆𝚒𝚕𝚍\tt HumanWild, and Crops % only applies to COCO. All experiments are evaluated on the COCO validation set. AP is used as the evaluation metric.
Method Dataset Output Type Backbone 1% Crops\uparrow 5% Crops\uparrow 10% Crops\uparrow 100% Crops\uparrow
RTMPose C Classification CSPNeXt 0.0 7.2 23.6 67.9
RTMPose B+C Classification CSPNeXt 46.4 55.9 58.7 68.4
RTMPose P+C Classification CSPNeXt 49.1 55.7 58.0 68.1
RTMPose P+B+C Classification CSPNeXt 61.9 63.1 64.4 71.3
RLEPose C Regression ResNet50 0.0 3.9 19.2 53.5
RLEPose B+C Regression ResNet50 40.6 47.7 55.3 64.8
RLEPose P+C Regression ResNet50 31.5 39.0 50.3 65.1
RLEPose P+B+C Regression ResNet50 51.8 56.3 58.5 66.6

References

  • [1] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 3686–3693 (2014)
  • [2] Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466 (2023)
  • [3] Bansal, A., Borgnia, E., Chu, H.M., Li, J.S., Kazemi, H., Huang, F., Goldblum, M., Geiping, J., Goldstein, T.: Cold diffusion: Inverting arbitrary image transforms without noise. In: Proc. Int. Conf. Learn. Representations (2022)
  • [4] Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans. arXiv preprint arXiv:1801.01401 (2018)
  • [5] Black, M.J., Patel, P., Tesch, J., Yang, J.: Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 8726–8737 (2023)
  • [6] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: Proc. Eur. Conf. Comp. Vis. pp. 561–578. Springer (2016)
  • [7] Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022)
  • [8] Cai, Z., Ren, D., Zeng, A., Lin, Z., Yu, T., Wang, W., Fan, X., Gao, Y., Yu, Y., Pan, L., et al.: Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In: Proc. Eur. Conf. Comp. Vis. pp. 557–577. Springer (2022)
  • [9] Dai, Y., Lin, Y., Lin, X., Wen, C., Xu, L., Yi, H., Shen, S., Ma, Y., Wang, C.: Sloper4d: A scene-aware dataset for global 4d human pose estimation in urban environments. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 682–692 (2023)
  • [10] Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: C· ase: Learning conditional adversarial skillembeddings for physics-based characters. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–11 (2023)
  • [11] Dou, Z., Wu, Q., Lin, C., Cao, Z., Wu, Q., Wan, W., Komura, T., Wang, W.: Tore: Token reduction for efficient human mesh recovery with transformer. arXiv preprint arXiv:2211.10705 (2022)
  • [12] Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10786–10796 (2021)
  • [13] Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., Liu, W.: You only look at one sequence: Rethinking transformer in vision through object detection. In: Proc. Advances in Neural Inf. Process. Syst. (2021)
  • [14] Fieraru, M., Zanfir, M., Oneata, E., Popa, A.I., Olaru, V., Sminchisescu, C.: Three-dimensional reconstruction of human interactions. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (June 2020)
  • [15] Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambiguities with 3d scene constraints. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 2282–2292 (2019)
  • [16] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 770–778 (2016)
  • [17] Heitz, E., Belcour, L., Chambon, T.: Iterative α\alpha-(de) blending: a minimalist deterministic diffusion model. arXiv preprint arXiv:2305.03486 (2023)
  • [18] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
  • [19] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NIPS. vol. 33, pp. 6840–6851. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
  • [20] Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 13274–13285 (2022)
  • [21] huggingface: Sdxl-controlnet: Depth (2023), https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0
  • [22] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
  • [23] Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K.: Rtmpose: Real-time multi-person pose estimation based on mmpose (2023). https://doi.org/10.48550/ARXIV.2303.07399, https://arxiv.org/abs/2303.07399
  • [24] Jiang, Y., Yang, S., Qiu, H., Wu, W., Loy, C.C., Liu, Z.: Text2human: Text-driven controllable human image generation. ACM Trans. Graphics 41(4), 1–11 (2022)
  • [25] Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In: Int. Conf. 3D. Vis. pp. 42–52. IEEE (2021)
  • [26] Ju, X., Zeng, A., Zhao, C., Wang, J., Zhang, L., Xu, Q.: Humansd: A native skeleton-guided diffusion model for human image generation. arXiv preprint arXiv:2304.04269 (2023)
  • [27] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 7122–7131 (2018)
  • [28] Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145 (2023)
  • [29] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv: Comp. Res. Repository (2014)
  • [30] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)
  • [31] Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: Part attention regressor for 3d human body estimation. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 11127–11137 (2021)
  • [32] Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: Spec: Seeing people in the wild with an estimated camera. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 11035–11045 (2021)
  • [33] Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 2252–2261 (2019)
  • [34] Lee, H.Y., Tseng, H.Y., Lee, H.Y., Yang, M.H.: Exploiting diffusion prior for generalizable pixel-level semantic prediction. arXiv preprint arXiv:2311.18832 (2023)
  • [35] Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proc. IEEE Int. Conf. Comp. Vis. (2021)
  • [36] Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 3383–3393 (2021)
  • [37] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
  • [38] Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Ai choreographer: Music conditioned 3d dance generation with aist++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13401–13412 (2021)
  • [39] Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. arXiv: Comp. Res. Repository (2022)
  • [40] Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3d whole-body mesh recovery with component aware transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21159–21168 (2023)
  • [41] Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 12939–12948 (2021)
  • [42] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proc. Eur. Conf. Comp. Vis. (2014)
  • [43] Liu, X., Ren, J., Siarohin, A., Skorokhodov, I., Li, Y., Lin, D., Liu, X., Liu, Z., Tulyakov, S.: Hyperhuman: Hyper-realistic human generation with latent structural diffusion. arXiv preprint arXiv:2310.08579 (2023)
  • [44] Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proc. IEEE Int. Conf. Comp. Vis. pp. 5442–5451 (2019)
  • [45] Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Z., Hengel, A.v.d.: Poseur: Direct human pose regression with transformers. In: Proc. Eur. Conf. Comp. Vis. (October 2022)
  • [46] von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proc. Eur. Conf. Comp. Vis. (2018)
  • [47] McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
  • [48] Meagher, D.: Geometric modeling using octree encoding. Computer graphics and image processing 19(2), 129–147 (1982)
  • [49] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: Int. Conf. 3D. Vis. (2017)
  • [50] Moon, G., Choi, H., Lee, K.M.: Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 2308–2317 (2022)
  • [51] OpenAI: Gpt-3: Generative pre-trained transformer 3. https://openai.com/research/gpt-3 (2020)
  • [52] OpenAI: Gpt-4 technical report (2023)
  • [53] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., et al., M.S.: DINOv2: Learning robust visual features without supervision. TMLR (2024)
  • [54] Pang, H.E., Cai, Z., Yang, L., Zhang, T., Liu, Z.: Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms. In: Proc. Advances in Neural Inf. Process. Syst. (2022)
  • [55] Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: Avatars in geography optimized for regression analysis. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (Jun 2021)
  • [56] Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 10975–10985 (2019)
  • [57] Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., Zhou, X.: Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 9054–9063 (2021)
  • [58] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
  • [59] Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: International Conference on Computer Vision (ICCV) 2021 (2021)
  • [60] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 10684–10695 (2022)
  • [61] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
  • [62] Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: Proc. Int. Conf. Learn. Representations (2022)
  • [63] Sárándi, I., Hermans, A., Leibe, B.: Learning 3D human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. In: Proc. Winter Conf. on Appl. of Comp. Vis. (2023)
  • [64] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open large-scale dataset for training next generation image-text models (2022)
  • [65] Sengupta, A., Budvytis, I., Cipolla, R.: Synthetic training for accurate 3d human pose and shape estimation in the wild. arXiv preprint arXiv:2009.10013 (2020)
  • [66] Sigal, L., Balan, A.O., Black, M.J.: Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International journal of computer vision 87(1-2), 4–27 (2010)
  • [67] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (October 2020), https://arxiv.org/abs/2010.02502
  • [68] thibaud: Sdxl-controlnet: Pose (2023), https://huggingface.co/thibaud/controlnet-openpose-sdxl-1.0
  • [69] Tian, Y., Fan, L., Isola, P., Chang, H., Krishnan, D.: Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984 (2023)
  • [70] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  • [71] Tripathi, S., Müller, L., Huang, C.H.P., Omid, T., Black, M.J., Tzionas, D.: 3D human pose estimation via intuitive physics. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (June 2023)
  • [72] Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I., Schmid, C.: Learning from synthetic humans. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2017)
  • [73] Voetman, R., Aghaei, M., Dijkstra, K.: The big data myth: Using diffusion models for dataset generation to train deep detection models. arXiv preprint arXiv:2306.09762 (2023)
  • [74] Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: Trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135 (2023)
  • [75] Wang, J., Yuan, Y., Luo, Z., Xie, K., Lin, D., Iqbal, U., Fidler, S., Khamis, S.: Learning human dynamics in autonomous driving scenarios. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20796–20806 (2023)
  • [76] Wang, J., Liu, Y., Dou, Z., Yu, Z., Liang, Y., Li, X., Wang, W., Xie, R., Song, L.: Disentangled clothed avatargeneration from text descriptions. arXiv preprint arXiv:2312.05295 (2023)
  • [77] Wang, W., Ge, Y., Mei, H., Cai, Z., Sun, Q., Wang, Y., Shen, C., Yang, L., Komura, T.: Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. arXiv preprint arXiv:2303.13796 (2023)
  • [78] Weng, Z., Bravo-Sánchez, L., Yeung, S.: Diffusion-hpc: Generating synthetic images with realistic humans. arXiv preprint arXiv:2303.09541 (2023)
  • [79] Wood, E., Baltrušaitis, T., Hewitt, C., Johnson, M., Shen, J., Milosavljević, N., Wilde, D., Garbin, S., Sharp, T., Stojiljković, I., et al.: 3d face reconstruction with dense landmarks. In: Proc. Eur. Conf. Comp. Vis. pp. 160–177. Springer (2022)
  • [80] Wu, W., Zhao, Y., Chen, H., Gu, Y., Zhao, R., He, Y., Zhou, H., Shou, M.Z., Shen, C.: Datasetdm: Synthesizing data with perception annotations using diffusion models. arXiv: Comp. Res. Repository (2023)
  • [81] Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proc. Eur. Conf. Comp. Vis. pp. 466–481 (2018)
  • [82] Xiao, Z., Wang, T., Wang, J., Cao, J., Zhang, W., Dai, B., Lin, D., Pang, J.: Unified human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918 (2023)
  • [83] Yang, X., Yuan, L., Wilber, K., Sharma, A., Gu, X., Qiao, S., Debats, S., Wang, H., Adam, H., Sirotenko, M., et al.: Polymax: General dense prediction with mask transformer. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1050–1061 (2024)
  • [84] Yang, Z., Cai, Z., Mei, H., Liu, S., Chen, Z., Xiao, W., Wei, Y., Qing, Z., Wei, C., Dai, B., et al.: Synbody: Synthetic dataset with layered human models for 3d human perception and modeling. arXiv preprint arXiv:2303.17368 (2023)
  • [85] Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)
  • [86] Yin, Y., Guo, C., Kaufmann, M., Zarate, J., Song, J., Hilliges, O.: Hi4d: 4d instance segmentation of close human interaction. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2023)
  • [87] Zhang, H., Tian, Y., Zhang, Y., Li, M., An, L., Sun, Z., Liu, Y.: Pymaf-x: Towards well-aligned full-body model regression from monocular images. arXiv preprint arXiv:2207.06400 (2022)
  • [88] Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
  • [89] Zhang, S., Ma, Q., Zhang, Y., Qian, Z., Kwon, T., Pollefeys, M., Bogo, F., Tang, S.: Egobody: Human body shape and motion of interacting people from head-mounted devices. In: Proc. Eur. Conf. Comp. Vis. pp. 180–200. Springer (2022)
  • [90] Zhou, W., Dou, Z., Cao, Z., Liao, Z., Wang, J., Wang, W., Liu, Y., Komura, T., Wang, W., Liu, L.: Emdm: Efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256 (2023)
  • [91] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable Transformers for end-to-end object detection. In: Proc. Int. Conf. Learn. Representations (2021)