GenHMR: Generative Human Mesh Recovery

Muhammad Usama Saleem¹, Ekkasit Pinyoanuntapong¹, Pu Wang ¹, Hongfei Xue¹, Srijan Das¹, Chen Chen²

Abstract

Human mesh recovery (HMR) is crucial in many computer vision applications; from health to arts and entertainment. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given 2D image. However, HMR from a single image is an ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible 3D reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce GenHMR, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process. GenHMR comprises two key components: (1) a pose tokenizer to convert 3D human poses into a sequence of discrete tokens in a latent space, and (2) an image-conditional masked transformer to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with randomly masked token sequence. During inference, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing 3D reconstruction uncertainties. To further refine the reconstruction, a 2D pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected 3D body mesh to align with the 2D pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods. Project website can be found at https://m-usamasaleem.github.io/publication/GenHMR/GenHMR.html

Refer to caption — Figure 1: State of the art (SOTA) methods, HMR2.0 (Goel et al. 2023) and TokenHMR (Dwivedi et al. 2024), leverage vision transformers to recover 3D human mesh from a single image. The errors, highlighted by red circles, reveal the limitations of these SOTA approaches in handling unusual poses or ambiguous scenarios. Our method, GenHMR, overcomes these challenges by explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process, resulting in more accurate and robust 3D pose reconstructions in such complex scenarios.

Introduction

Recovering 3D human mesh from monocular images is an essential task in computer vision, with applications spanning diverse fields, such as character animation for video games and movies, metaverse, human-computer interaction, and sports performance optimization. However, recovering 3D human mesh from monocular images remains challenging due to inherent ambiguities in lifting 2D observations to 3D space, flexible body kinematic structures, complex intersections with the environment, and insufficient annotated 3D data (Tian et al. 2023). To address these challenges, recent efforts have been focusing on two methods: (1) deterministic HMR and (2) probabilistic HMR. Both methods face critical limitations.

Deterministic methods, as the dominant approach for HMR, are designed to produce a single prediction for a given 2D image. These methods estimate the shape and pose parameters of 3D body model either by regressing from 2D image features extracted from deep neural networks (Choi et al. 2022; Kocabas et al. 2021; Kanazawa et al. 2018) or by directly optimizing the parametric body model by fitting it to 2D image cues, such as 2D keypoints (Bogo et al. 2016; Pavlakos et al. 2019; Xu et al. 2020), silhouettes (Omran et al. 2018), and part segmentations (Lassner et al. 2017). Recent deterministic HMR models, utilizing vision transformers as their backbone, have achieved state-of-the-art (SOTA) mesh reconstruction accuracy. However, despite such promising progress, these SOTA deterministic models face a critical limitation: they constrain neural networks to produce a single prediction hypothesis. This approach overlooks the inherent depth ambiguity in monocular images, which can result in multiple plausible 3D reconstructions that equally fit the same 2D evidence.

To mitigate this problem, several works have proposed probabilistic approaches that generate multiple predictions from a single image using various generative models, such as conditional variational autoencoders (CVAEs) (Sharma et al. 2019; Sohn, Lee, and Yan 2015) and diffusion models (Shan et al. 2023; Holmquist and Wandt 2023). However, this increase in diversity typically comes at the cost of accuracy because strategical aggregation of multiple solutions into a single accurate prediction is challenging due to the potential kinematic inconsistency among these 3D human mesh predictions. As a result, none of these multi-hypothesis probabilistic methods are competitive with the latest single-output deterministic models.

To overcome the limitations of existing methods, we introduce GenHMR, a novel generative framework for 3D human mesh recovery from a single 2D image. GenHMR is built on two key components: a pose tokenizer and an image-conditional masked transformer. The framework follows a two-stage training paradigm. In the first stage, the pose tokenizer is trained using Vector Quantized Variational Autoencoders (VQ-VAE) (Van Den Oord, Vinyals et al. 2017), which convert the continuous human pose (i.e., the rotations of skeletal joints) into a sequence of discrete tokens in a latent space, based on a learned codebook. In the second stage, a portion of the pose token sequence is randomly masked. The image-conditional masked transformer is then trained to predict the masked tokens by learning the conditional categorical distribution of each token, given the input image and the unmasked tokens. This generative masking training allows GenHMR to learn a explicit probabilistic mapping from the 2D image to the human pose. Leveraging such feature, we propose uncertainty-guided iterative sampling during inference, where the model decodes multiple pose tokens simultaneously in each iteration by sampling from the learned image-conditioned pose distributions. The tokens with low prediction uncertainties are kept and the others are re-masked and re-predicted in the next iteration. This feature allows GenHMR to iteratively reduce 2D-to-3D mapping uncertainties and progressively correct the wrong joint rotations to improve the mesh reconstruction accuracy. To further refine the reconstruction quality, we propose a novel 2D pose-guided refinement technique, which directly optimizes the decoded pose tokens in the latent space, with an objective to force the projected 3D body mesh to align with the 2D pose clues. Our contributions are summarized as follows:

•

We introduce GenHMR, a novel generative framework for accurate HMR from a single image, which largely departs from existing deterministic and probabilistic methods in terms of model architecture, training paradigm and inference process.
•

We leverage generative masking training to learn intricate image-conditioned pose distributions, thus effectively capturing the 2D-to-3D mapping uncertainty.
•

We propose a novel iterative inference strategy, incorporating uncertainty-guided sampling followed by 2D pose-guided refinement to progressively mitigate HMR errors.
•

We demonstrate through extensive experiments that GenHMR outperforms SOTA methods on standard datasets, including Human3.6M in the controlled environment, and 3DPW and EMDB for in-the-wild scenarios. For both cases, GenHMR could lead to 20 - 30 % error reduction (in terms of MPJPE) compared with SOTA methods. Qualitative results shows that GenHMR is robust to ambiguous image observations (Fig. 1).

Related Work

Deterministic HMR

The field of Human Mesh Recovery (HMR) from monocular images has been primarily dominated by deterministic approaches, which aim to generate a single output for a given 2D image. The early work mainly adopts optimization-based approaches to fit a parametric model human model such as SMPL (Loper et al. 2015) to 2D image cues (Pavlakos et al. 2019; Kolotouros et al. 2019; Lassner et al. 2017). Later on, learning-based methods become more prevalent, which leverage CNNs to directly regress SMPL parameters from images (Choi et al. 2022; Kocabas et al. 2021; Kanazawa et al. 2018) and videos (Cho et al. 2023; Kanazawa et al. 2019). Recently, vision transformers (Alexey 2020) have been adopted for HMR tasks. For example, HMR 2.0 (Goel et al. 2023) and TokenHMR (Dwivedi et al. 2024) achieve the state-of-the-art mesh reconstruction accuracy by leveraging transformer’s ability of modeling the long-range correlations to learn the dependencies of different human body parts in HMR tasks. However, these deterministic methods are limited by their single-output nature, which fails to capture the inherent depth ambiguity in complex scenarios, leading to reconstruction errors when multiple plausible 3D interpretations exist.

Probabilistic HMR

To address the limitations of deterministic methods, various probabilistic models have been exploited to address the inherent uncertainty in the reconstruction process and enable the generation of multiple plausible 3D mesh predictions from a single 2D image. These methods include mixture density networks (MDNs) (Bishop 1994; Li and Lee 2019), conditional variational autoencoders (CVAEs) (Sohn, Lee, and Yan 2015; Pavlakos et al. 2018), and normalizing flows (Wehrbein et al. 2021; Kolotouros et al. 2021). Recent advancements in diffusion-based HMR models, such as DDHPose (Cai et al. 2024), Diff-HMR (Cho and Kim 2023), and D3DP (Shan et al. 2023), have shown significant promise in generating diverse and realistic human meshes. However, these approaches face a great challenge in synthesizing multiple predictions into a single, coherent 3D pose. Current methods often rely on simplistic aggregation techniques, such as averaging all hypotheses (Wehrbein et al. 2021; Li and Lee 2019, 2020) or performing pose or joint-level fusion (Shan et al. 2023; Cai et al. 2024), which frequently result in kinematically inconsistent and suboptimal reconstructions.

Unlike existing probabilistic methods that model and sample the distributions of entire 3D body meshes, GenHMR leverages masked generative transformers to model the pose and rotation distribution of each individual skeletal joint, while also capturing the inherent interdependence among joints. This approach enables our model to iteratively perform fine-grained joint-level sampling, progressively reconstructing the human mesh from ambiguous 2D image observations. Our generative framework draws inspiration from the great success of masked language and image models in text, image, and video generation (Ghazvininejad et al. 2019; Devlin et al. 2019; Zhang et al. 2021; Qian et al. 2020; Chang et al. 2022; Ding et al. 2022; Chang et al. 2023). However, while generative language and image models focus on increasing sample diversity, GenHMR aims to minimize 3D mesh generation diversity and uncertainty, conditioned on 2D image prompts.

Proposed Method: GenHMR

The goal of GenHMR is to achieve accurate 3D human mesh reconstructions from monocular images $I$ by learning body pose $\theta$ , shape $\beta$ , and camera parameters $T$ . As shown in Figure 2, GenHMR comprises two main modules: the pose tokenizer and the image-conditioned masked transformer. The pose tokenizer converts 3D human pose parameters $\theta$ into a sequence of discrete pose tokens. The image-conditioned masked transformer predicts masked pose tokens based on multi-scale image features extracted by an image encoder. During inference, we use an iterative decoding process to predict high-confidence pose tokens, progressively refining predictions by masking low-confidence tokens and leveraging both image semantics and inter-token dependencies. To further enhance accuracy, a 2D pose-guided sampling strategy is proposed to optimize the pose queries by aligning the re-projected 3D pose with the estimated 2D pose. Additionally, the shape parameters $\beta$ and weak perspective camera parameters $T$ are directly regressed from the image features, completing the 3D human mesh reconstruction process.

Body Model

We utilize SMPL, a differentiable parametric body model (Loper et al. 2015), to represent the human body. The SMPL model encodes the body using pose parameters $\theta\in\mathbb{R}^{24\times 3}$ and shape parameters $\beta\in\mathbb{R}^{10}$ . The pose parameters $\theta=[\theta_{1},\ldots,\theta_{24}]$ include the global orientation $\theta_{1}\in\mathbb{R}^{3}$ of the whole body and the local rotations $[\theta_{2},\ldots,\theta_{24}]\in\mathbb{R}^{23\times 3}$ of the body joints, where each $\theta_{k}$ represents the axis-angle rotation of joint $k$ relative to its parent in the kinematic tree. Given these pose and shape parameters, the SMPL model generates a body mesh $M(\theta,\beta)\in\mathbb{R}^{3\times N}$ , where $N=6890$ vertices. The body joints $J\in\mathbb{R}^{3\times k}$ are then defined as a linear combination of these vertices, calculated using $J=MW$ , where $W\in\mathbb{R}^{N\times k}$ represents fixed weights that map vertices to joints.

Pose Tokenizer

The goal of the pose tokenizer is to learn a discrete latent space for 3D pose parameters by quantizing the encoder’s output into learned codebook $C$ , as shown in Figure 2(a). We leverage the VQ-VAE (Van Den Oord, Vinyals et al. 2017) to pretrain the pose tokenizer. Specifically, given the SMPL pose parameters $\theta$ , we use a convolution encoder $E$ to map the pose parameters $\theta$ into a latent embedding $z$ . Each embedding $z$ is then quantized to codes $c\in C$ by finding the nearest codebook entry based on the Euclidean distance, described by $\hat{z}_{i}=\arg\min_{c_{k}\in C}\|z_{i}-c_{k}\|_{2}$ . Then, the total loss function is defined as follows

\mathcal{L}_{\text{vq}}=\lambda_{\text{re}}\mathcal{L}_{\text{re}}+\lambda_{\text{E}}\|\operatorname{sg}[z]-c\|_{2}+\lambda_{\alpha}\|z-\operatorname{sg}[c]\|_{2}

which consists of a SMPL reconstruction loss, a latent embedding loss and a commitment loss, where $\lambda_{\text{re}}$ , $\lambda_{\text{E}}$ , and $\lambda_{\alpha}$ are their respective weights. $\operatorname{sg}[\cdot]$ represents the stop-gradient operator. To improve reconstruction quality, we employ an L1 loss $\mathcal{L}_{\text{re}}=\lambda_{\theta}\mathcal{L}_{\theta}+\lambda_{\text{V}}\mathcal{L}_{\text{V}}+\lambda_{\text{J}}\mathcal{L}_{\text{J}}$ to minimize the difference between the SMPL parameters and their ground-truth, including pose parameters $\theta$ , mesh vertices $V$ , and kinematic joints $J$ . This tokenizer is optimized using a straight-through gradient estimator, with the codebooks being updated via exponential moving average and codebook reset, following the methodology outlined in (Esser, Rombach, and Ommer 2021; Williams et al. 2020).

Image Conditioned Masked Transformer

The image conditioned masked transformer comprises two main components: the image encoder and the masked transformer decoder with multi-scale deformable cross attention.

Image Encoder. Our encoder employs a vision transformer (ViT) to extract image features (Alexey 2020; Dwivedi et al. 2024; Goel et al. 2023). We utilize the ViT-H/16 variant, which processes 16x16 pixel patches through transformer layers to generate feature tokens. Inspired by ViTDet (Alexey 2020), we adopt a multi-scale feature approach by upsampling initial feature map from encoder to create a set of feature maps with varying resolutions. High-resolution feature maps capture fine-grained visual details (e.g., the presence and rotation of individual joints), while low-resolution feature maps preserve high-level semantics, e.g., the structure of the human skeleton.

Masked Transformer Decoder. Our decoder employs a multi-layer transformer whose inputs are the pose token sequence obtained from the pose tokenizer. These pose tokens serve as the queries that are cross-attended to the multi-scale feature maps from the image encoder. Since these feature maps are of high resolution, we adopt multi-scale deformable cross-attention to mitigate computational cost (Zhu et al. 2020). In particular, each pose token is only attended to a small set of sampling points around a reference point on multi-scale feature maps, regardless of the spatial size of the feature maps. The multi-scale deformable attention is expressed as:

\begin{split}\text{MSDA}(Y,\hat{p}_{y},\{x^{l}\}_{l=1}^{L})=\sum_{l=1}^{L}\sum_{k=1}^{K}A_{lyk}\cdot\mathbf{W}x^{l}\left(\hat{p}_{y}+\Delta p_{lyk}\right)\end{split}

where $Y$ represents pose token queries, $\hat{p}_{y}$ denotes learnable reference points, $\Delta p_{lyk}$ are the learnable sampling offsets around these points, $\{x^{l}\}_{l=1}^{L}$ are the multi-scale image features, $A_{lyk}$ are the attention weights, and $\mathbf{W}$ is a learnable weight matrix. The [MASK] token is a special-purpose token used in the masked transformer decoder. During training, it represents masked pose tokens, and the model learns to predict the actual tokens to replace them. During inference, [MASK] tokens serve as placeholders for pose token generation.

Training Strategy

Generative Masking. Given a pose token sequence $Y=[y_{i}]_{i=1}^{L}$ from the pose tokenizer where $L$ denotes the sequence length, our model is trained to reconstruct the pose token sequence, conditioned on the image prompt under random masking strategies. In particular, we randomly mask out $m=\lceil\gamma(\tau)\cdot L\rceil$ tokens, where $\gamma(\tau)\in[0,1]$ is a masking ratio function with $\tau$ following a uniform distribution $U(0,1)$ . We adopt a cosine masking ratio function $\gamma(\tau)=\cos\left(\frac{\pi\tau}{2}\right)$ similar to the ones from generative text-to-image models (Chang et al. 2022, 2023). The masked tokens are replaced with learnable [MASK] tokens, forming the corrupted pose sequence $Y_{M}$ . The categorical distribution of each pose token, conditioned on corrupted sequence $Y_{M}$ and image prompt $X$ is $p(y_{i}|Y_{M},X)$ , which explicitly models the uncertainty during the 2D-to-3D mapping process. The training objective is to minimize the negative log-likelihood of the pose token sequence prediction:

\mathcal{L}_{\text{mask}}=-\mathbb{E}_{Y\in\mathcal{D}}\left[\sum_{\forall i\in[1,L]}\log p(y_{i}|Y_{M},X)\right].

(1)

Training-time Differentiable Sampling. The training loss, $\mathcal{L}_{\text{mask}}$ , not only aids in capturing the uncertainty inherent in monocular human mesh recovery but also enforces accurate estimation of the pose parameter $\theta$ within the discrete latent space. However, prior research (Kanazawa et al. 2018) indicates that, in addition to ensuring correct $\theta$ estimation, it is highly advantageous to incorporate an additional 3D loss between the predicted and ground-truth 3D joints, as well as a 2D loss between the projections of these predicted 3D joints and the ground-truth 2D joints. The challenge in incorporating these losses into generative model training lies in the need to convert pose tokens in the latent space into the pose parameter $\beta$ in the SMPL space. This conversion requires sampling the categorical distribution of pose tokens during training, which is non-differentiable. To overcome this challenge, we adopt the straight-through Gumbel-Softmax technique (Jang, Gu, and Poole 2016), which uses categorical sampling during the forward pass and employs differentiable sampling according to the continuous Gumbel-Softmax distribution during the backward pass, which can approximate the categorical distribution via temperature annealing. The final overall loss is $\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{mask}}+\mathcal{L}_{\text{SMPL}}+\mathcal{L}_{\text{3D}}+\mathcal{L}_{\text{2D}}$ , which combines pose token prediction loss ( $\mathcal{L}_{\text{mask}}$ ), 3D loss ( $\mathcal{L}_{\text{3D}}$ ), 2D loss ( $\mathcal{L}_{\text{2D}}$ ), and SMPL parameter loss $\mathcal{L}_{\text{SMPL}}$ that minimize the shape and pose parameters in the SMPL space.

Inference Strategy

As shown in Fig. 3, our inference strategy comprises two key stages: (1) uncertainty-guided sampling, which iteratively samples high-confidence pose tokens based on their probabilistic distributions and (2) 2D pose-guided refinement, which fine-tunes the sampled pose tokens to further minimize 3D reconstruction uncertainty by ensuring the consistency between the 3D body mesh and 2D pose estimates.

Uncertainty-Guided Sampling. The sampling process begins with a fully masked sequence $Y_{1}$ of length $L$ , where all tokens are initially set to [MASK]. The sequence is decoded over $T$ iterations. At each iteration, $t$ , the masked tokens are decoded by performing stochastic sampling, where the tokens are sampled based on their prediction distributions $p(y_{i}|Y_{M},X)$ . After the token sampling, a certain number of tokens with low prediction confidences are re-masked and re-predicted in the next iteration. The number of tokens to be re-masked is determined by a masking schedule $\lceil\gamma\left(\frac{t}{T}\right)\cdot L\rceil$ , where $\gamma$ is a decaying function of iteration $t$ that produces higher masking ratio in the early iterations when the prediction confidence is low, while yielding low masking ratio in the latter iterations when the prediction confidence increases as more context information becomes available from previous iterations. We adopt the cosine function for $\gamma$ and the impact of other decaying functions is shown in the supplementary material.

2D Pose-Guided Refinement. To further reduce uncertainties and ambiguities in the 3D reconstruction, we refine the pose tokens $Y$ , while keeping the whole network frozen, so that the 3D pose estimates are better aligned with 2D pose clues from off-the-shelf 2D pose detectors such as OpenPose (Cao et al. 2017). This optimization process is initialized by the pose tokens from uncertainty-guided sampling, and these tokens are then iteratively updated to minimize a composite guidance function $G(Y_{p},J_{\text{2D}},\theta^{\prime})$ that penalizes the misalignment of 3D and 2D poses along with regularization terms:

Y^{+}=\arg\min_{Y_{p}}\left(\mathcal{L}_{2D}(J_{\text{3D}}^{\prime})+\lambda_{\theta^{\prime}}\mathcal{L}_{\theta^{\prime}}(\theta^{\prime})\right)

(2)

The term $\mathcal{L}_{2D}(J_{\text{3D}}^{\prime})$ ensures that the reprojected 3D joints $J_{\text{3D}}^{\prime}$ are aligned with the detected 2D keypoints $J_{\text{2D}}$ :

\mathcal{L}_{2D}(J_{\text{3D}}^{\prime})=|\Pi(K(J_{\text{3D}}^{\prime}))-J_{\text{2D}}|^{2}

(3)

where $\Pi(K(\cdot))$ represents the perspective projection with camera intrinsics $K$ . The regularization term $\mathcal{L}_{\theta^{\prime}}(\theta^{\prime})$ ensures that the pose parameters $\theta^{\prime}$ remain close to the initial estimate, preventing excessive deviations and maintaining plausible human body poses. At each iteration $p$ , the pose embeddings $Y_{p}$ are updated using the following gradient-based approach:

Y_{p+1}=Y_{p}-\eta\nabla_{Y_{p}}G(Y_{p},J_{\text{2D}},\theta^{\prime})

(4)

Here, $\eta$ controls the magnitude of the updates to the pose embeddings, while $\nabla_{Y_{p}}G(Y_{p},J_{\text{2D}},\theta^{\prime})$ represents the gradient of the objective function with respect to the pose embeddings $Y$ at iteration $p$ . This refinement process continues over $P$ iterations and our experiments show that only a small number of iterations (5 to 10) is sufficient to yield satisfaroy enhancement.

Experiments

Table 1: Reconstructions Evaluated in 3D: Reconstruction errors (in mm) on the Human3.6M, 3DPW, and EMDB datasets. Lower values (

\downarrow

) indicate better performance. Underlined results show the second-best performance in each column. Blue indicates improvements of our method compared to the second-best method. & – & means results are not reported.

		Human3.6M		3DPW			EMDB
Methods	Venue	PA-MPJPE (↓)	MPJPE (↓)	PA-MPJPE (↓)	MPJPE (↓)	MVE (↓)	PA-MPJPE (↓)	MPJPE (↓)	MVE (↓)
Deterministic HMR Methods
FastMETRO (Cho 2022)	ECCV 2022	33.7	52.2	65.7	109.0	121.6	72.7	108.1	119.2
PARE (Kocabas et al. 2021)	ICCV 2021	50.6	76.8	50.9	82.0	97.9	72.2	113.9	133.2
Virtual Marker (Ma et al. 2023)	CVPR 2023	-	-	48.9	80.5	93.8	-	-	-
CLIFF (Li et al. 2022)	ECCV 2022	32.7	47.1	46.4	73.9	87.6	68.8	103.1	122.9
HMR2.0 (Goel et al. 2023)	ICCV 2023	33.6	44.8	44.5	70.0	84.1	61.5	97.8	120.1
VQ HPS (Fiche et al. 2023)	ECCV 2024	-	-	45.2	71.1	84.8	65.2	99.9	112.9
TokenHMR (Dwivedi et al. 2024)	CVPR 2024	36.3	48.4	47.5	75.8	86.5	66.1	98.1	116.2
Probabilistic HMR Methods
Diff-HMR (Cho and Kim 2023)^†	ICCV 2023	-	-	55.9	94.5	109.8	-	-	-
3D Multibodies (Biggs et al. 2020)^†	NeurIPS 2020	42.2	58.2	55.6	75.8	-	-	-	-
ProHMR (Kolotouros et al. 2021)^†	ICCV 2021	-	-	52.4	84.0	-	-	-	-
GenHMR	Ours	22.4 (10.3 $\downarrow$ )	33.5 (11.3 $\downarrow$ )	32.6 (13.8 $\downarrow$ )	54.7 (15.3 $\downarrow$ )	67.5 (17.3 $\downarrow$ )	38.2 (23.3 $\downarrow$ )	68.5 (31.4 $\downarrow$ )	76.4 (43.7 $\downarrow$ )

^†Results for existing probabilistic HMR methods are reported for 25 multiple hypotheses.

Datasets. We trained the pose tokenizer using the AMASS (Mahmood et al. 2019) standard training split and MOYO (Tripathi et al. 2023). For GenHMR, following prior work (Goel et al. 2023) and to ensure fair comparisons, we used standard datasets (SD): Human3.6M (H36M) (Ionescu et al. 2013), COCO (Lin et al. 2014), MPI-INF-3DHP (Mehta et al. 2017), and MPII (Andriluka et al. 2014).

Evaluation Metrics. We evaluate GenHMR using the Mean Per Joint Position Error (MPJPE) and Mean Vertex Error (MVE) for 3D pose estimation accuracy. We also report the Procrustes-Aligned MPJPE (PA-MPJPE) to assess the alignment between predicted and ground-truth poses after rigid transformation. To quantify computational efficiency, we use Average Inference Time per Image (AITI) (s), similar to the average inference/optimization time per image reported by OpenPose (Cao et al. 2017). However, AITI measures per-iteration processing time, enabling fine-grained analysis of our uncertainty-guided sampling and 2D Pose-guided refinement. Lower values across all these metrics indicate better performance. GenHMR is tested on the Human3.6M testing split, following previous works (Kolotouros et al. 2019). To evaluate GenHMR’s generalization on challenging in-the-wild datasets with varying camera motions and diverse 3D poses, we test 3DPW (Von Marcard et al. 2018) and EMDB (Kaufmann et al. 2023) without training on them, ensuring a fair assessment on unseen data.

Comparison to State-of-the-art Approaches

We evaluate our approach against a range of state-of-the-art deterministic and probabilistic HMR methods on the Human3.6M, 3DPW, and EMDB datasets, as detailed in Table 1. GenHMR consistently outperforms existing methods across key evaluation metrics—PA-MPJPE, MPJPE, and MVE—demonstrating its superior ability to produce accurate 3D reconstructions. A significant contributor to this success is GenHMR’s capability to model and refine uncertainty throughout the reconstruction process, making it particularly effective in challenging scenarios involving complex poses and occlusions.

This capability enables GenHMR to not only achieve substantial performance gains on the controlled Human3.6M dataset but also to deliver impressive error reduction on the more challenging in-the-wild 3DPW and EMDB datasets. For example, GenHMR achieves a 25.2% reduction in MPJPE on the Human3.6M dataset, a 21.8% reduction on 3DPW, and a remarkable 29.9% reduction on EMDB, compared to existing state-of-the-art methods. These consistent improvements across multiple datasets, even without training on in-the-wild datasets like 3DPW and EMDB, highlight GenHMR’s effectiveness in addressing the complexities of real-world scenarios and delivering high-quality 3D reconstructions with unprecedented accuracy.

Ablation Study

The key to GenHMR’s effectiveness lies in its mask modeling and iterative refinement techniques. In this ablation study, we investigate how iterative refinement and mask-scheduling strategies influence the model’s performance. In the Supplementary Material, we provide extensive additional experimental results and visualizations that offer in-depth analyses of key factors contributing to GenHMR’s performance. These include: (1) architectural components (pose tokenizer design, backbones, feature resolutions), (2) training strategies (masking scheduling, Gumbel-Softmax annealing, regularization via keypoint and SMPL losses in GenHMR), (3) inference techniques (speed-accuracy trade-offs in iterative refinement stages), (4) impact of training dataset size, and (5) model limitations.

Table 2: Impact of Iterations in Uncertainty-Guided Sampling

		3DPW		EMDB
# of iter.	AITI (sec)	MPJPE	MVE	MPJPE	MVE
1	0.032	73.5	84.2	92.2	104.5
3	0.075	70.2	81.9	89.8	101.6
5	0.102	68.1	77.5	88.2	99.5
10	0.255	67.8	79.1	87.5	98.3
15	0.321	67.6	78.5	87.1	96.6
20	0.412	67.4	78.1	86.8	95.4

Table 3: Impact of iterations on 2D Pose Guided Refinement.

	H36M	3DPW		EMDB		AITI
# of iter.	MPJPE	MPJPE	MVE	MPJPE	MVE	(s)
UGS*	37.1	68.1	77.5	88.2	99.5	0.102
1	36.2	66.5	77.0	86.5	96.5	0.154
3	35.1	63.6	73.5	83.7	92.1	0.209
5	34.3	60.2	70.4	79.1	87.5	0.253
10	33.2	57.5	68.5	73.5	82.0	0.444
20	33.5	54.7	67.5	68.5	76.4	0.638

*

The first row UGS* reflects Uncertainty-Guided Sampling (UGS) with 5 iterations used to initialize the pose query $Y$ . AITI is obtained on a single mid-grade GPU (NVIDIA RTX A5000).

Effectiveness of Uncertainty-Guided Sampling. The number of iterations during inference plays a critical role in balancing speed and accuracy in 3D pose estimation. As demonstrated in Table 2, increasing the iterations generally enhances the accuracy of pose reconstructions, as reflected by improvements in MVE and MPJPE. For instance, on the 3DPW dataset, MVE decreases from 84.2 to 78.1, and MPJPE drops from 73.5 to 67.4 when the number of iterations increases from 1 to 20. However, beyond a certain threshold, such as after 10 iterations, the benefits of additional iterations diminish, with only marginal reductions in error. Notably, even with a smaller number of iterations, like 5, the model achieves significant accuracy improvements.

Effectiveness of 2D Pose Guided Refinement. Fig. 4 visualizes that 2D pose-guided refinement can iteratively improve 3D pose reconstruction accuracy particularly in scenarios with complex poses and occlusions. As shown in Table 3, significant accuracy gains occur early: on 3DPW, MVE drops by 11.04% (77.0 mm to 68.5 mm) from 1 to 10 iterations, with minimal improvement to 67.5 mm at 20 iterations, where MVE improves by 12.34% from 1 to 20 iterations. On EMDB dataset, MVE metric decreases by 15.02% (96.5 mm to 82.0 mm) from 1 to 10 iterations, further reducing to 20.8% (76.4 mm) at 20 iterations. These results underscore the method’s effectiveness, though a trade-off with computational time is evident, as AITI increases from 0.154 seconds at 1 iteration to 0.638 seconds at 20 iterations. Notably, substantial gains are achievable with fewer iterations; at 5 iterations, MVE improves by 8.57% on 3DPW (77.0 mm to 70.4 mm) and 9.3% on EMDB (96.5 mm to 87.5 mm), while AITI remains manageable at 0.253 seconds. This highlights that 2D Pose-Guided refinement effectively balances accuracy and efficiency, making it practical for real-time human mesh recovery.

Table 4: Impact of masking ratio during training. These results were derived from the Uncertainty-Guided Sampling (UGS) stage with 5 iterations, where we evaluated the initial pose and shape estimate by iteratively refining the pose tokens.

	3DPW		EMDB
Masking Ratio $\gamma(\tau)$	MPJPE	MVE	MPJPE	MVE
$\gamma(\tau\in\mathcal{U}(0,1))$	68.1	77.5	88.2	99.5
$\gamma(\tau\in\mathcal{U}(0,0.3))$	72.1	84.2	93.5	105.3
$\gamma(\tau\in\mathcal{U}(0,0.5))$	69.8	81.7	90.2	101.5
$\gamma(\tau\in\mathcal{U}(0,0.7))$	67.9	77.1	88.7	99.4

Masking Ratio during Training. The ablation study on masking ratios during training underscores their influence on the accuracy of our generative model for human mesh recovery (HMR). We utilize a cosine-based masking ratio function $\gamma(\tau)$ , with $\tau$ sampled from a uniform distribution, to randomly mask segments of the pose token sequence during training. This method induces varying levels of information loss, compelling the model to develop more robust reconstruction capabilities. The results reveal that broader masking ratios, such as $\gamma(\tau\in\mathcal{U}(0,0.7))$ , lead to the most accurate reconstructions, demonstrated by the lowest MPJPE of 67.9 mm and a corresponding MVE of 77.1 mm on the 3DPW dataset. In contrast, narrower masking ratios, like $\gamma(\tau\in\mathcal{U}(0,0.3))$ , result in higher error rates, with MPJPE rising to 72.1 mm and MVE to 84.2 mm, suggesting reduced generalization capabilities. These findings emphasize the pivotal role of selecting an appropriate masking ratio during training to achieve effective model generalization and precise 3D pose reconstruction.

Conclusion

In this paper, we introduced GenHMR, a novel generative approach to monocular human mesh recovery that effectively addresses the longstanding challenges of depth ambiguity and occlusion. By reformulating HMR as an image-conditioned generative task, GenHMR explicitly models and mitigates uncertainties in the complex 2D-to-3D mapping process. At its core, GenHMR consists of two key components: a pose tokenizer that encodes 3D human poses into discrete tokens within a latent space, and an image-conditional masked transformer that learns rich probabilistic distributions of these pose tokens. These learned distributions enable two powerful inference techniques: uncertainty-guided iterative sampling and 2D pose-guided refinement, which together produce robust and accurate 3D human mesh reconstructions. Extensive experiments on benchmark datasets demonstrate GenHMR’s superiority over SOTA HMR methods.

References

Alexey (2020) Alexey, D. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929.
Andriluka et al. (2014) Andriluka, M.; Pishchulin, L.; Gehler, P.; and Schiele, B. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 3686–3693.
Biggs et al. (2020) Biggs, B.; Novotny, D.; Ehrhardt, S.; Joo, H.; Graham, B.; and Vedaldi, A. 2020. 3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data. Advances in neural information processing systems, 33: 20496–20507.
Bishop (1994) Bishop, C. M. 1994. Mixture density networks.
Bogo et al. (2016) Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; and Black, M. J. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, 561–578. Springer.
Cai et al. (2024) Cai, Q.; Hu, X.; Hou, S.; Yao, L.; and Huang, Y. 2024. Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 882–890.
Cao et al. (2017) Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7291–7299.
Chang et al. (2023) Chang, H.; Zhang, H.; Barber, J.; Maschinot, A.; Lezama, J.; Jiang, L.; Yang, M.-H.; Murphy, K.; Freeman, W. T.; Rubinstein, M.; et al. 2023. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704.
Chang et al. (2022) Chang, H.; Zhang, H.; Jiang, L.; Liu, C.; and Freeman, W. T. 2022. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11315–11325.
Cho et al. (2023) Cho, H.; Ahn, J.; Cho, Y.; and Kim, J. 2023. Video inference for human mesh recovery with vision transformer. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), 1–6. IEEE.
Cho and Kim (2023) Cho, H.; and Kim, J. 2023. Generative approach for probabilistic human mesh recovery using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4183–4188.
Cho (2022) Cho, Y. K. O. T.-H., Junhyeong. 2022. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision, 342–359. Springer.
Choi et al. (2022) Choi, H.; Moon, G.; Park, J.; and Lee, K. M. 2022. Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1475–1484.
Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. arXiv preprint arXiv:1810.04805.
Ding et al. (2022) Ding, M.; Zheng, W.; Hong, W.; and Tang, J. 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35: 16890–16902.
Dwivedi et al. (2024) Dwivedi, S. K.; Sun, Y.; Patel, P.; Feng, Y.; and Black, M. J. 2024. TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1323–1333.
Esser, Rombach, and Ommer (2021) Esser, P.; Rombach, R.; and Ommer, B. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12873–12883.
Fiche et al. (2023) Fiche, G.; Leglaive, S.; Alameda-Pineda, X.; Agudo, A.; and Moreno-Noguer, F. 2023. VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space. arXiv preprint arXiv:2312.08291.
Ghazvininejad et al. (2019) Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer, L. 2019. Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324.
Goel et al. (2023) Goel, S.; Pavlakos, G.; Rajasegaran, J.; Kanazawa, A.; and Malik, J. 2023. Humans in 4D: Reconstructing and tracking humans with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 14783–14794.
Gu et al. (2018) Gu, C.; Sun, C.; Ross, D. A.; Vondrick, C.; Pantofaru, C.; Li, Y.; Vijayanarasimhan, S.; Toderici, G.; Ricco, S.; Sukthankar, R.; et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6047–6056.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Holmquist and Wandt (2023) Holmquist, K.; and Wandt, B. 2023. Diffpose: Multi-hypothesis human pose estimation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15977–15987.
Ionescu et al. (2013) Ionescu, C.; Papava, D.; Olaru, V.; and Sminchisescu, C. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7): 1325–1339.
Jang, Gu, and Poole (2016) Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Johnson and Everingham (2011) Johnson, S.; and Everingham, M. 2011. Learning effective human pose estimation from inaccurate annotation. In CVPR 2011, 1465–1472. IEEE.
Kanazawa et al. (2018) Kanazawa, A.; Black, M. J.; Jacobs, D. W.; and Malik, J. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7122–7131.
Kanazawa et al. (2019) Kanazawa, A.; Zhang, J. Y.; Felsen, P.; and Malik, J. 2019. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5614–5623.
Kaufmann et al. (2023) Kaufmann, M.; Song, J.; Guo, C.; Shen, K.; Jiang, T.; Tang, C.; Zárate, J. J.; and Hilliges, O. 2023. Emdb: The electromagnetic database of global 3d human pose and shape in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 14632–14643.
Kocabas et al. (2021) Kocabas, M.; Huang, C.-H. P.; Hilliges, O.; and Black, M. J. 2021. PARE: Part attention regressor for 3D human body estimation. In Proceedings of the IEEE/CVF international conference on computer vision, 11127–11137.
Kolotouros et al. (2019) Kolotouros, N.; Pavlakos, G.; Black, M. J.; and Daniilidis, K. 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision, 2252–2261.
Kolotouros et al. (2021) Kolotouros, N.; Pavlakos, G.; Jayaraman, D.; and Daniilidis, K. 2021. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, 11605–11614.
Lassner et al. (2017) Lassner, C.; Romero, J.; Kiefel, M.; Bogo, F.; Black, M. J.; and Gehler, P. V. 2017. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6050–6059.
Li and Lee (2019) Li, C.; and Lee, G. H. 2019. Generating multiple hypotheses for 3d human pose estimation with mixture density network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9887–9895.
Li and Lee (2020) Li, C.; and Lee, G. H. 2020. Weakly supervised generative network for multiple 3d human pose hypotheses. arXiv preprint arXiv:2008.05770.
Li et al. (2022) Li, Z.; Liu, J.; Zhang, Z.; Xu, S.; and Yan, Y. 2022. Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision, 590–606. Springer.
Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
Loper et al. (2015) Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; and Black, M. J. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 34(6): 248:1–248:16.
Ma et al. (2023) Ma, X.; Su, J.; Wang, C.; Zhu, W.; and Wang, Y. 2023. 3d human mesh estimation from virtual markers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 534–543.
Mahmood et al. (2019) Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; and Black, M. J. 2019. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, 5442–5451.
Mehta et al. (2017) Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; and Theobalt, C. 2017. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), 506–516. IEEE.
Omran et al. (2018) Omran, M.; Lassner, C.; Pons-Moll, G.; Gehler, P.; and Schiele, B. 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 international conference on 3D vision (3DV), 484–494. IEEE.
Pavlakos et al. (2019) Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A. A.; Tzionas, D.; and Black, M. J. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10975–10985.
Pavlakos et al. (2018) Pavlakos, G.; Zhu, L.; Zhou, X.; and Daniilidis, K. 2018. Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition, 459–468.
Qian et al. (2020) Qian, L.; Zhou, H.; Bao, Y.; Wang, M.; Qiu, L.; Zhang, W.; Yu, Y.; and Li, L. 2020. Glancing transformer for non-autoregressive neural machine translation. arXiv preprint arXiv:2008.07905.
Shan et al. (2023) Shan, W.; Liu, Z.; Zhang, X.; Wang, Z.; Han, K.; Wang, S.; Ma, S.; and Gao, W. 2023. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 14761–14771.
Sharma et al. (2019) Sharma, S.; Varigonda, P. T.; Bindal, P.; Sharma, A.; and Jain, A. 2019. Monocular 3d human pose estimation by generation and ordinal ranking. In Proceedings of the IEEE/CVF international conference on computer vision, 2325–2334.
Sohn, Lee, and Yan (2015) Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28.
Tian et al. (2023) Tian, Y.; Zhang, H.; Liu, Y.; and Wang, L. 2023. Recovering 3d human mesh from monocular images: A survey. IEEE transactions on pattern analysis and machine intelligence.
Tripathi et al. (2023) Tripathi, S.; Müller, L.; Huang, C.-H. P.; Taheri, O.; Black, M. J.; and Tzionas, D. 2023. 3D human pose estimation via intuitive physics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4713–4725.
Van Den Oord, Vinyals et al. (2017) Van Den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. Advances in neural information processing systems, 30.
Von Marcard et al. (2018) Von Marcard, T.; Henschel, R.; Black, M. J.; Rosenhahn, B.; and Pons-Moll, G. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV), 601–617.
Wehrbein et al. (2021) Wehrbein, T.; Rudolph, M.; Rosenhahn, B.; and Wandt, B. 2021. Probabilistic monocular 3d human pose estimation with normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, 11199–11208.
Williams et al. (2020) Williams, W.; Ringer, S.; Ash, T.; MacLeod, D.; Dougherty, J.; and Hughes, J. 2020. Hierarchical quantized autoencoders. Advances in Neural Information Processing Systems, 33: 4524–4535.
Wu et al. (2017) Wu, J.; Zheng, H.; Zhao, B.; Li, Y.; Yan, B.; Liang, R.; Wang, W.; Zhou, S.; Lin, G.; Fu, Y.; et al. 2017. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475.
Xu et al. (2020) Xu, H.; Bazavan, E. G.; Zanfir, A.; Freeman, W. T.; Sukthankar, R.; and Sminchisescu, C. 2020. Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6184–6193.
Zhang et al. (2021) Zhang, Z.; Ma, J.; Zhou, C.; Men, R.; Li, Z.; Ding, M.; Tang, J.; Zhou, J.; and Yang, H. 2021. M6-UFC: Unifying multi-modal controls for conditional image synthesis via non-autoregressive generative transformers. arXiv preprint arXiv:2105.14211.
Zhu et al. (2020) Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020. Deformable DETR: Deformable Transformers for End-to-End Object Detection. ArXiv, abs/2010.04159.

Appendix

Overview

The appendix is organized into the following sections:

•

Implementation Details
•

Data Augmentation
•

Camera Model
•

Ablation for Pose Tokenizer
•

Training on Large-Scale Datasets
•
Effectiveness of Training-time Differentiable Sampling
- –
  
  Varying Temperatures in Cosine Annealing
- –
  
  Impact of Annealing Durations
- –
  
  Impact of Various Losses
•
Uncertainty Reduction Strategies at Inference
- –
  
  Confidence-based Masking
- –
  
  Mask Scheduling Functions
- –
  
  Token Sampling Strategies
•

Impact of Pose Tokenizer on GenHMR
•

Ablation of Feature Resolutions
•

Impact of Deformable Cross-Attention Layers
•

Qualitative Results

Implementation Details

Our model, GenHMR, implemented using PyTorch, consists of two primary training stages: the pose tokenizer and the image-conditioned masked transformer.

In the first stage, we train the pose tokenizer to learn discrete pose representations using mocap data from the AMASS (Mahmood et al. 2019) and MOYO (Tripathi et al. 2023) datasets, which we converted from SMPL-H to SMPL format using SMPL-X instructions (Pavlakos et al. 2019). The resulting pose parameters $\theta\in\mathbb{R}^{24\times 3}$ include both the global orientation $\theta_{1}\in\mathbb{R}^{3}$ and the local rotations $[\theta_{2},\ldots,\theta_{24}]\in\mathbb{R}^{23\times 3}$ of the body joints. The pose tokenizer architecture comprises ResBlocks (He et al. 2016) and 1D convolutions for both encoder and decoder, with a single quantization layer. We train the pose tokenizer for 200K iterations using the Adam optimizer, with a batch size of 512 and a learning rate of $1\times 10^{-4}$ . The loss weights for this stage are set to $\lambda_{\text{re}}=1.0$ , $\lambda_{\text{E}}=0.02$ , $\lambda_{\theta}=1.0$ , $\lambda_{\text{V}}=0.5$ , and $\lambda_{\text{J}}=0.3$ . Based on the validation set reconstruction errors on the AMASS dataset, we select the final pose tokenizer model containing 96 tokens and a codebook size of $2048\times 256$ for GenHMR training.

The second stage focuses on training the image-conditioned masked transformer, with the pose tokenizer from the first stage frozen to leverage the learned pose prior. To balance computational efficiency and mesh reconstruction accuracy, we utilize feature maps at 1x, 4×, and 8× resolutions. We adopt the Gumbel-Softmax operation for consistent end-to-end training, approximating the categorical distribution via temperature annealing. A cosine annealing schedule is used for the temperature parameter $\tau$ , starting at $\tau_{\text{start}}=1.0$ and decreasing to $\tau_{\text{end}}=0.01$ with 50% duration annealing. The final overall loss is $\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{mask}}+\mathcal{L}_{\text{SMPL}}+\mathcal{L}_{\text{3D}}+\mathcal{L}_{\text{2D}}$ , combining pose token prediction loss ( $\mathcal{L}_{\text{mask}}$ ), 3D loss ( $\mathcal{L}_{\text{3D}}$ ), 2D loss ( $\mathcal{L}_{\text{2D}}$ ), and SMPL parameter loss $\mathcal{L}_{\text{SMPL}}$ that minimize the shape ( $\beta$ ) and pose ( $\theta$ ) parameters in the SMPL space. The loss weights for this stage are: $\lambda_{\text{mask}}=1.0$ , $\lambda_{\text{SMPL}}=1.5\times 10^{-3}$ (with $\lambda_{\theta}=1\times 10^{-3}$ for pose and $\lambda_{\beta}=5\times 10^{-4}$ for shape within $\mathcal{L}_{\text{SMPL}}$ ), $\lambda_{\text{3D}}=5\times 10^{-2}$ , and $\lambda_{\text{2D}}=1\times 10^{-2}$ . During inference in Uncertainty-Guided Sampling (UGS), we employ greedy sampling with $\text{top-k}=1$ , selecting the most confident token predictions over a default of 5 iterations. We train the masked transformer on two NVIDIA RTX A6000 GPUs, using the Adam optimizer with a batch size of 48 and a learning rate of $1\times 10^{-5}$ .

Data Augmentation

In the first stage of training, the Pose Tokenizer benefits from incorporating prior information about valid human poses, which is crucial for the overall performance of GenHMR. To achieve this, we rotate the poses at varying degrees, enabling the model to learn a robust representation of pose parameters across different orientations. During the training of GenHMR, we further enhanced the model’s robustness by applying random augmentations to both images and poses. These augmentations, including scaling, rotation, random horizontal flips, and color jittering, are designed to make the model more resilient to challenges such as occlusions and incomplete body information. Data augmentation is thus essential for improving the generalization and accuracy of the human mesh reconstruction process in GenHMR.

Camera Model

Our method GenHMR employs a perspective camera model with a fixed focal length and an intrinsic matrix $K\in\mathbb{R}^{3\times 3}$ . By simplifying the rotation matrix $R$ to the identity matrix $I_{3}$ and concentrating on the translation vector $T\in\mathbb{R}^{3}$ , we project the 3D joints $J_{\text{3D}}$ to 2D coordinates $J_{\text{2D}}$ using the formula $J_{\text{2D}}=\Pi(K(J_{\text{3D}}+T))$ , where $\Pi$ represents the perspective projection with camera intrinsics $K$ . This simplification reduces parameter complexity and enhances the computational efficiency of our human mesh recovery pipeline.

Ablation for Pose Tokenizer

In Tables 5 and 6, we present an extensive ablation study examining the design choices of the pose tokenizer on the AMASS test set and MOYO validation set, respectively. To assess out-of-distribution performance, most experiments are conducted using models trained exclusively on AMASS, while the final pose tokenizer employed in GenHMR was trained on both datasets. The findings underscore that the codebook size exerts a more substantial influence on performance than the number of pose tokens. Specifically, increasing the codebook size from 1024 to 4096 resulted in a reduction of Mean Vertex Error (MVE) by 3.3 mm (from 9.2 to 5.9) on AMASS and by 6.1 mm (from 14.5 to 8.4) on MOYO. Although increasing the number of tokens from 48 to 384 also enhanced performance (Table 6), we selected a codebook size of 2048 × 256 with 96 tokens for our final model, optimizing the trade-off between performance and computational efficiency.

Table 5: Impact of Codebook Size (Pose Tokens = 96) on Pose Tokenizer.

	AMASS		MOYO
# of code $\,\times\,$ code dimension	MPJPE	MVE	MPJPE	MVE
1024 $\,\times\,$ 256	10.5	9.2	16.2	14.5
2048 $\,\times\,$ 128	8.9	8.1	13.9	12.7
2048 $\,\times\,$ 256	8.5	7.8	13.4	12.2
4096 $\,\times\,$ 256	6.1	5.9	9.0	8.4

Table 6: Impact of number of Pose Tokens (Codebook = 2048

\times

256) on Pose Tokenizer

	AMASS		MOYO
Tokens	MPJPE	MVE	MPJPE	MVE
48	11.1	10.5	16.7	15.3
96	8.5	7.8	13.4	12.2
192	7.8	6.9	11.5	10.8
384	7.1	6.0	10.7	10.2

Training on Large-Scale Datasets

To enhance GenHMR’s real-world performance and generalization, we expanded our training data to include diverse datasets. We incorporated in-the-wild 2D datasets (ITW) such as InstaVariety (Kanazawa et al. 2019), AVA (Gu et al. 2018), and AI Challenger (Wu et al. 2017) with their pseudo ground truth (p-GT), and integrated the synthetic dataset BEDLAM (BL) as suggested by TokenHMR (Dwivedi et al. 2024). This approach exposes our model to varied poses, camera motions, and environments. For a fair comparison, we only compare GenHMR with state-of-the-art models using the same backbone and datasets. To evaluate generalization, we test on unseen 3DPW (Von Marcard et al. 2018) and EMDB (Kaufmann et al. 2023) datasets, assessing performance on challenging in-the-wild scenarios.

As shown in Table 7, GenHMR consistently achieves the lowest error rates across all datasets and metrics, highlighting its superiority in 3D human mesh reconstruction. This success is particularly evident when GenHMR is trained on standard datasets (SD) combined with ITW, where it outperforms leading models like HMR2.0 (Goel et al. 2023) and TokenHMR (Dwivedi et al. 2024), with significant reductions in MPJPE and MVE. The addition of BEDLAM further enhances its performance, as demonstrated on the challenging EMDB dataset, where GenHMR achieves an MPJPE of 67.5mm and an MVE of 74.8mm—reductions of 24.2mm and 34.6mm, respectively, compared to TokenHMR. These improvements stem from GenHMR’s novel inference strategy, which combines Uncertainty-Guided Sampling and 2D Pose-Guided Refinement. This approach incrementally refines pose predictions by focusing on high-confidence tokens and aligning 3D estimates with 2D detections, ensuring accurate and consistent reconstructions. Consequently, GenHMR demonstrates enhanced generalization and robustness across diverse datasets, establishing itself as a leading solution for 3D human mesh reconstruction.

Table 7: Training on Large-Scale Datasets Reconstruction errors (in mm) on the Human3.6M, 3DPW, and EMDB datasets. Lower values (

\downarrow

) indicate better performance. Underlined results show the second-best performance in each column. Blue indicates improvements of our method compared to the second-best method.

		Human3.6M		3DPW			EMDB
Training Datasets	Methods	PA-MPJPE (↓)	MPJPE (↓)	PA-MPJPE (↓)	MPJPE (↓)	MVE (↓)	PA-MPJPE (↓)	MPJPE (↓)	MVE (↓)
SD + ITW	HMR2.0	32.4	50.0	54.3	81.3	94.4	79.3	118.5	140.6
SD + ITW	TokenHMR	33.8	50.2	49.3	76.2	88.1	67.5	102.4	124.4
	GenHMR	24.3 (8.1 $\downarrow$ )	37.8 (12.2 $\downarrow$ )	37.5 (11.8 $\downarrow$ )	58.6 (17.6 $\downarrow$ )	72.5 (15.6 $\downarrow$ )	44.5 (23.0 $\downarrow$ )	74.6 (27.8 $\downarrow$ )	82.8 (41.6 $\downarrow$ )
SD + ITW + BL	HMR2.0	28.5	47.7	47.4	77.4	88.4	62.8	99.3	120.7
SD + ITW + BL	TokenHMR	29.4	46.9	44.3	71.0	84.6	55.6	91.7	109.4
	GenHMR	22.1 (6.4 $\downarrow$ )	32.1 (14.8 $\downarrow$ )	31.0 (13.3 $\downarrow$ )	52.1 (18.9 $\downarrow$ )	65.6 (19.0 $\downarrow$ )	37.5 (18.1 $\downarrow$ )	67.5 (24.2 $\downarrow$ )	74.8 (34.6 $\downarrow$ )

Appendix A Effectiveness of Training-time Differentiable Sampling

Varying Temperature Schedules in Cosine Annealing. In our study, we explored the impact of various temperature schedules on model performance using the Gumbel-Softmax estimator with cosine annealing. The temperature parameter ( $\tau$ ) plays a crucial role in balancing exploration and exploitation during the learning process. We implemented a cosine annealing schedule to gradually reduce $\tau$ from an initial value $\tau_{\text{start}}$ to a final value $\tau_{\text{end}}$ over $T$ training iterations, following the equation:

\tau_{t}=\tau_{\text{end}}+\frac{1}{2}(\tau_{\text{start}}-\tau_{\text{end}})\left(1+\cos\left(\frac{t\pi}{T}\right)\right)

(5)

This schedule allows for a smooth transition from broad exploration (higher $\tau$ ) to focused exploitation (lower $\tau$ ). The Gumbel-Softmax distribution, governed by $\tau$ , determines the probability of selecting codebook entries:

p_{m,k}=\frac{\exp\left(\frac{l_{m,k}+n_{k}}{\tau}\right)}{\sum_{j=1}^{K}\exp\left(\frac{l_{m,j}+n_{j}}{\tau}\right)}

(6)

where $l_{m,k}$ are logits and $n_{k}$ are Gumbel noise samples. As $\tau$ decreases, this distribution sharpens, leading to more deterministic predictions.

Our experiments, detailed in Table 8, revealed that starting with a moderate temperature ( $\tau_{\text{start}}=1.0$ or $0.8$ ) and annealing to a very low temperature ( $\tau_{\text{end}}=0.01$ ) yielded optimal performance, minimizing both MPJPE and MVE. This approach effectively balances initial exploration with subsequent exploitation. We found that higher starting temperatures (e.g., $\tau_{\text{start}}=2.0$ ) led to slower convergence and increased errors due to excessive exploration, while lower starting temperatures (e.g., $\tau_{\text{start}}=0.5$ ) accelerated convergence but reduced pose diversity, ultimately resulting in higher errors. Our findings underscore the importance of a carefully tuned temperature schedule in achieving both accuracy and generalization in accurate human mesh reconstruction.

Table 8: Impact of different temperature (

\tau

) schedules on the performance of human mesh recovery during training with the Gumbel-Softmax estimator using a cosine annealing schedule. Here, results are from 5 iterations of UGS to evaluate initial 3D pose estimates at inference.

Temperature ( $\tau$ )		3DPW		EMDB
$\tau_{start}$	$\tau_{end}$	MPJPE	MVE	MPJPE	MVE
2.0	0.01	90.1	99.5	110.2	120.5
2.0	0.1	85.4	95.9	105.7	115.8
1.5	0.01	78.3	88.2	96.5	107.3
1.5	0.05	82.1	91.6	100.3	111.1
1.2	0.01	74.5	83.2	92.0	103.5
1.0	0.01	68.1	77.5	88.2	99.5
0.8	0.01	68.0	78.2	88.8	99.1
0.5	0.1	77.2	86.0	98.3	109.7

Impact of Annealing Durations. We conducted a comprehensive analysis of annealing duration effects on model performance, varied the annealing period as a percentage of total training iterations, employing a cosine schedule to gradually reduce the temperature ( $\tau$ ). Our findings, presented in Table 9, demonstrate the impact of annealing duration on the transition from smooth, differentiable sampling to more discrete, categorical sampling. This transition affects the model’s ability to learn effective representations for human mesh recovery. Annealing durations spanning 50% to 100% of the training period consistently yield superior results, optimizing both accuracy and generalization in human mesh recovery. This range allows for an initial phase of broad exploration, gradually transitioning to focused exploitation of learned features. Shorter durations, such as 50%, yield competitive results but slightly compromise generalization due to limited exploration time. Conversely, extended durations (e.g., 150% or 200%) maintain an exploratory state for too long, slightly diminishing the overall performance by delaying the transition to exploitation.

Table 9: Impact of different annealing durations on the performance of GenHMR using the Gumbel-Softmax estimator. Here, results reflect 5 iterations of UGS to assess initial 3D pose estimates at inference

Annealing Duration	3DPW		EMDB
	MPJPE	MVE	MPJPE	MVE
50%	68.1	77.5	88.2	99.5
75%	80.5	89.4	99.7	110.5
100%	68.9	78.2	88.9	98.8
125%	70.0	79.0	89.5	99.2
150%	72.5	81.0	90.8	100.7
200%	75.2	84.1	94.3	105.0

Impact of Losses. The results in Table 10 highlight the critical role of Training-Time Differentiable Sampling in enhancing GenHMR’s performance, particularly when combined with key regularization losses. Incorporating all losses— $L_{mask}$ , $L_{\theta}$ , $L_{3}D$ , $L_{2}D$ , and $\beta$ —yields optimal performance across all evaluation metrics on both the 3DPW and EMDB datasets. The 3D loss ( $L_{3}D$ ) maintains structural integrity by aligning predicted 3D joints with ground truth, while the 2D loss ( $L_{2}D$ ) addresses monocular reconstruction ambiguities by ensuring alignment between 3D projections and observed 2D keypoints. Notably, excluding either $L_{3}D$ or $L_{2}D$ leads to significant increases in errors, underscoring their vital role in producing accurate and plausible 3D reconstructions. These findings demonstrate that differentiable sampling not only enables seamless integration of these losses but also ensures their direct contribution to the overall accuracy and reliability of the model’s predictions. Our analysis reveals that the combination of differentiable sampling and carefully chosen losses is crucial for achieving high-quality human mesh recovery. This approach allows the model to effectively learn from various constraints, resulting in more accurate and robust 3D reconstructions across different datasets and evaluation metrics.

Table 10: Impact of different losses on MPJPE, PA-MPJPE, and MVE errors on the 3DPW and EMDB datasets. ✓ indicates inclusion, ✗ indicates exclusion, with deltas (

\Delta

) showing error differences when all losses are included. These results are from the UGS stage with 5 iterations, where we evaluated the initial 3D pose estimates

Losses					3DPW			EMDB
$L_{mask}$	$L_{\theta}$	$L_{3}D$	$L_{2}D$	$\beta$	PA-MPJPE (↓)	MPJPE (↓)	MVE (↓)	PA-MPJPE (↓)	MPJPE (↓)	MVE (↓)
✓	✓	✓	✓	✓	42.1	68.1	77.5	51.7	88.2	99.5
✓	✗	✗	✗	✓	45.5_Δ3.4	80.5_Δ12.4	92.5_Δ15.0	60.8_Δ9.1	98.8_Δ10.6	113.9_Δ14.4
✗	✓	✗	✗	✓	48.5_Δ6.4	82.0_Δ13.9	95.6_Δ18.1	61.7_Δ10.0	100.7_Δ12.5	118.0_Δ18.5
✗	✗	✓	✗	✓	57.9_Δ15.8	87.5_Δ19.4	146.4_Δ68.9	67.9_Δ16.2	110.9_Δ22.7	160.8_Δ61.3
✗	✗	✗	✓	✓	103.6_Δ61.5	1160.6_Δ1092.5	1167.7_Δ1090.2	110.7_Δ59.0	1180.8_Δ1092.6	1190.7_Δ1091.2
✓	✓	✗	✗	✓	45.1_Δ3.0	80.0_Δ11.9	91.5_Δ14.0	58.9_Δ7.2	97.6_Δ9.4	110.6_Δ11.1
✓	✓	✓	✗	✓	43.7 ${}_{\Delta\textbf{1.6}}$	78.5 ${}_{\Delta\textbf{10.4}}$	90.0 ${}_{\Delta\textbf{12.5}}$	56.9 ${}_{\Delta\textbf{5.2}}$	93.7 ${}_{\Delta\textbf{5.5}}$	106.9 ${}_{\Delta\textbf{7.4}}$

Uncertainty Reduction Strategies at Inference

Confidence-based Masking. During inference, we employ an Uncertainty-Guided Sampling process. This process is visualized in Figures 5 and 6, both using the x-axis for 96 pose token indices (0-96) and the y-axis for iterations (0-9). The process begins with high uncertainty and progressively refines predictions. Figure 5 illustrates the masking pattern: black squares represent masked tokens, and orange squares unmasked tokens. Initially, most tokens are masked, reflecting the high uncertainty in early 3D pose estimates. As the model refines its predictions through iterations, the number of masked tokens decreases. Complementing this, Figure 6 shows prediction confidence levels: purple indicates low confidence, yellow high confidence. Early iterations display predominantly low confidence (purple), gradually shifting to higher confidence (yellow) in later iterations, mirroring the reduction in 2D-to-3D ambiguity. Notably, confidence typically increases for earlier tokens first, then spreads to later tokens, suggesting that initial context improves subsequent 3D estimates. This iterative approach enables GenHMR to transition from low-confidence to high-confidence predictions, leading to increasingly accurate and coherent human mesh recoveries while effectively reducing uncertainties in 2D-to-3D pose estimation.

Mask Scheduling Function. The mask scheduling function determines how many tokens are re-masked at each iteration of the sampling process. We investigate four distinct masking schedule functions, illustrated in Figure 8, to evaluate their influence on sequence generation. Let $L$ denote the sequence length and $T$ the total number of iterations. The number of tokens to be re-masked at iteration $t$ is given by $\lceil\gamma(t/T)\cdot L\rceil$ , where $\gamma$ is a decaying function of the normalized iteration time $t/T$ . The functions are as follows:

•

Cosine Function yields a smooth S-curve, balancing gradual initial exploration, rapid mid-process transition, and refined final exploitation throughout decoding.

$\gamma_{\text{cos}}\left(\frac{t}{T}\right)=\frac{1+\cos\left(\frac{\pi t}{T}\right)}{2}$
•

Linear Function provides a constant decrease rate, maintaining a uniform balance between exploration and exploitation throughout the process.

$\gamma_{\text{linear}}\left(\frac{t}{T}\right)=1-\frac{t}{T}$
•

Cubic Function function follows a concave curve, favoring extended initial exploration before rapidly transitioning to exploitation in later iterations.

$\gamma_{\text{cubic}}\left(\frac{t}{T}\right)=1-\left(\frac{t}{T}\right)^{3}$
•

Square Root Function produces a convex curve, enabling rapid initial exploration followed by gradual convergence for extended fine-tuning in later iterations.

$\gamma_{\text{sqrt}}\left(\frac{t}{T}\right)=\sqrt{1-\left(\frac{t}{T}\right)^{2}}$

Our analysis of masking functions in uncertainty-guided sampling (Table 11) reveals significant performance variations across iterations and datasets, closely tied to each function’s unique exploration-exploitation balance. The Cosine function consistently outperforms other methods on both 3DPW and EMDB datasets, achieving optimal performance at just 5 iterations. This suggests an efficient balance between initial exploration and rapid convergence to optimal solutions. While all functions improve with increased iterations, performance generally plateaus or slightly declines beyond 10-20 iterations. The Cubic and Linear functions, with their distinct decay patterns, reach peak performance at 20 iterations. In contrast, the Square Root function shows the least improvement, peaking early at 10 iterations, indicating a potentially premature transition from exploration to exploitation. These findings highlight how each function’s unique masking strategy influences the model’s ability to navigate the solution space effectively. The results underscore the critical importance of selecting appropriate masking functions for optimal performance in human mesh recovery tasks, emphasizing the need to balance thorough exploration of the pose space with efficient convergence to accurate solutions.

Table 11: Impact of Iterations and Masking Functions in Uncertainty-Guided Sampling

	3DPW		EMDB
# of iter.	MPJPE	MVE	MPJPE	MVE
Cosine
1	73.5	84.2	92.2	104.5
3	70.2	81.9	89.8	101.6
5	68.1	77.5	88.2	99.5
10	67.8	79.1	87.5	98.3
15	67.6	78.5	87.1	96.6
20	67.4	78.1	86.8	95.4
Cubic
5	69.7	78.9	89.8	101.2
10	69.1	79.1	89.0	100.1
15	68.8	79.0	88.5	99.3
20	68.6	79.2	88.2	98.7
Linear
5	71.6	81.2	91.7	103.1
10	71.3	80.7	91.2	102.4
15	71.0	80.9	90.8	101.3
20	70.8	81.1	90.3	100.6
Square Root
5	73.4	82.8	93.5	104.9
10	73.1	83.0	93.1	104.2
15	73.3	83.3	93.3	104.6
20	73.6	83.5	93.6	105.0

Token sampling strategies. The results demonstrate that the Top-k sampling strategy has a significant impact on GenHMR’s performance as depicted in Table 12. Specifically, using a Top-1 sampling approach yields the best results, with the lowest MPJPE and MVE across both the 3DPW and EMDB datasets (Table 12). As the value of k increases, there is a noticeable decline in performance, with higher k values leading to increased MPJPE and MVE, indicating reduced accuracy in 3D pose estimation and mesh reconstruction. This suggests that restricting the model to the most confident token predictions (Top-1) is crucial for maintaining high precision in GenHMR, while broader sampling (higher k) introduces noise and uncertainty that degrade the overall performance of the estimated final pose.

Table 12: Impact of Top-k Sampling on GenHMR. Here, results are from 5 UGS iterations, refining pose tokens to evaluate initial 3D pose estimates at inference.

	3DPW		EMDB
Top-K	MPJPE	MVE	MPJPE )	MVE
1	68.1	77.5	88.2	99.5
2	75.7	90.3	98.2	116.9
5	85.1	98.6	108.6	123.1
10	90.5	110.1	118.1	130.6
20	98.9	120.6	130.9	140.3

Impact of Pose Tokenizer on GenHMR

The results from our experiments highlight the critical role of the Pose Tokenizer’s design in the overall performance of GenHMR as shown in Table 13 and 14. Specifically, our ablation studies on the codebook size demonstrate that an increase in codebook size initially enhances performance, as seen when moving from a $1024\times 256$ to a $2048\times 256$ configuration, with notable improvements in both MPJPE and MVE metrics across the 3DPW and EMDB datasets (Table 13). However, further expansion to a $4096\times 256$ codebook leads to a decline in accuracy, indicating that while larger codebooks can provide richer pose representations, they may also introduce complexity that hinders 3D pose estimation in subsequent stages. Moreover, the number of tokens in the second stage critically impacts GenHMR’s performance (Table 14). A moderate token count (96) provides the best results, with lower (48) and higher (192, 384) counts leading to reduced accuracy. This highlights the need for an optimal balance between pose representation richness and the model’s ability to effectively utilize this information for accurate 3D human mesh reconstruction.

Table 13: Impact of Codebook Size (Tokens = 96) on GenHMR. Here, results are from 5 iterations of UGS, iteratively refining pose tokens to evaluate initial 3D pose estimates at inference.

	3DPW		EMDB
# of code $\,\times\,$ code dimension	MPJPE (↓)	MVE (↓)	MPJPE (↓)	MVE (↓)
1024 $\,\times\,$ 256	70.1	82.5	90.2	103.7
2048 $\,\times\,$ 128	69.9	80.3	89.2	99.9
2048 $\,\times\,$ 256	68.1	77.5	88.2	99.5
4096 $\,\times\,$ 256	69.5	80.4	90.4	100.4

Table 14: Impact of Tokens (Codebook = 2048

\times

256) on GenHMR. Here, results are from 5 iterations of UGS, iteratively refining pose tokens to evaluate initial 3D pose estimates at inference.

	3DPW		EMDB
# of tokens	MPJPE (↓)	MVE (↓)	MPJPE (↓)	MVE (↓)
48	68.5	80.5	92.6	105.6
96	68.1	77.5	88.2	99.5
192	72.5	83.6	96.8	110.7
384	82.5	91.5	101.4	135.4

Ablation of Feature Resolutions

The ablation study on feature resolutions in GenHMR provides crucial insights into the efficacy of multi-scale feature representation for Human Mesh Recovery. Results, as shown in Table 15, consistently demonstrate that increasing resolution from 1 $\times$ to 16 $\times$ significantly enhances accuracy, reducing MPJPE by 3.2 mm on 3DPW and 4.1 mm on EMDB. However, further increases yield diminishing returns, indicating an optimal balance between performance and computational efficiency at 16 $\times$ resolution. Crucially, the study reveals a symbiotic relationship between high and low-resolution features, with performance degrading when lower scales (1 $\times$ or 4 $\times$ ) are omitted. This underscores the importance of GenHMR’s multi-scale approach in capturing both fine-grained details and overall structure.

Table 15: Impact of feature resolutions on MPJPE error on 3DPW and EMDB datasets. ✓ indicates inclusion, ✗ indicates exclusion, with deltas (

\Delta

) showing error differences when we use all feature maps. Here, results are from 5 iterations of UGS, iteratively refining pose tokens to evaluate initial pose estimate at inference.

Feature Scale				MPJPE (↓)
1 $\times$	4 $\times$	8 $\times$	16 $\times$	3DPW	EMDB
✓	✓	✓	✓	68.3	88.3
✓	✗	✗	✗	71.3_Δ3.0	92.3_Δ4.0
✓	✓	✗	✗	69.8_Δ1.5	90.3_Δ2.0
✓	✓	✓	✗	68.8_Δ0.5	89.3_Δ1.0
✗	✓	✓	✓	68.8_Δ0.5	88.8_Δ0.5
✓	✗	✓	✓	69.3_Δ1.0	89.8_Δ1.5
✓	✓	✗	✓	69.8_Δ1.5	89.9_Δ1.6

Impact of Deformable Cross Attention Layers

The results show that the number of Deformable Cross Attention Layers is crucial for GenHMR’s performance. Increasing the layers from 2 to 4 significantly improves MPJPE and MVE on the AMASS and MOYO datasets, enhancing 3D pose estimation and mesh reconstruction as shown in Table 16. However, adding more than 4 layers leads to diminishing returns and slight performance degradation. This indicates that 4 layers provide the optimal balance between complexity and performance, ensuring GenHMR achieves accurate and efficient 3D human mesh reconstruction.

Table 16: Impact of # of Deformable Cross Attention Layers in GenHMR. Here, results are from 5 iterations of UGS, iteratively refining pose tokens to evaluate initial 3D pose estimates at inference

	3DPW		EMDB
# of Deformable
Cross Attention
Layers	MPJPE	MVE	MPJPE	MVE
2	70.5	85.9	94.8	107.0
4	68.1	77.5	88.2	99.5
6	69.5	77.1	88.1	100.2
8	70.9	79.9	91.9	104.9

Qualitative Results

We present qualitative results of GenHMR in Figures 9, 10, and 11, demonstrating the model’s robustness in handling extreme poses and partial occlusions. These results highlight the effectiveness of our approach, where reconstructions are well-aligned with the input images and remain valid when viewed from novel perspectives. A key factor contributing to this success is GenHMR’s explicit modeling and reduction of uncertainty during the 2D-to-3D mapping process. By iteratively refining pose estimates and focusing on high-confidence predictions, GenHMR is able to mitigate the challenges that typically hinder other state-of-the-art methods. This approach ensures more accurate and consistent 3D reconstructions, even in complex scenarios where traditional deterministic models often falter. Additionally, the refinement process, as illustrated in Figure 11, plays a crucial role in aligning 3D outputs with 2D pose detections, further enhancing the model’s ability to produce realistic and accurate meshes.