This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose

Abstract

3D human mesh recovery from a 2D pose plays an important role in various applications. However, it is hard for existing methods to simultaneously capture the multiple relations during the evolution from skeleton to mesh, including joint-joint, joint-vertex and vertex-vertex relations, which often leads to implausible results. To address this issue, we propose a novel solution, called GATOR, that contains an encoder of Graph-Aware Transformer (GAT) and a decoder with Motion-Disentangled Regression (MDR) to explore these multiple relations. Specifically, GAT combines a GCN and a graph-aware self-attention in parallel to capture physical and hidden joint-joint relations. Furthermore, MDR models joint-vertex and vertex-vertex interactions to explore joint and vertex relations. Based on the clustering characteristics of vertex offset fields, MDR regresses the vertices by composing the predicted base motions. Extensive experiments show that GATOR achieves state-of-the-art performance on two challenging benchmarks. Code is available at https://github.com/kasvii/GATOR.

Index Terms—  3D Human Mesh Recovery, Transformer, Graph Convolutional Network, Motion Disentangling

1 Introduction

3D human mesh recovery from the 2D observation is an essential task for many 3D applications [1]. However, image-based methods suffer from the domain gap in image appearance between well-controlled datasets and in-the-wild scenes, while pose-based methods naturally relieve this problem with the skeleton inputs [2, 3, 4, 5]. But existing pose-based methods neglect the multiple relations during the evolution from skeleton to mesh, including joint-joint, joint-vertex, and vertex-vertex relations, that are prone to produce implausible results.

Existing pose-based methods follow an encoder-decoder manner [1]. In encoders, Graph Convolution Networks (GCNs) and Transformers have become the mainstream [6, 7, 8]. Benefiting from the graph structure of human skeleton, GCNs naturally capture the physical relations between neighboring joints [9, 10]. But it is difficult to capture non-local relations. In contrast, Transformers can explore global information by the attention mechanism while weakening the graph topology and local relations. Recently, several methods combine GCNs and Transformers to complement each other [4, 11]. But the neglect of graph structures in Transformers and the adopted cascaded architecture may limit the effectiveness.

Refer to caption
Fig. 1: The offset field (template \to target) can be clustered to several base motions drawn in the corresponding colors.

For decoders, some methods regress vertex coordinates [11, 12, 13], and some recent works predict the offset fields then add to the template mesh [4, 14, 15]. They regress 3D coordinates directly from high-dimension features, which is data-driven ignoring the physical plausibility. As shown in Fig. 1, the offset field from template mesh to target mesh can be clustered to several base motions due to the motion similarity in the same body part. This inspires us to generate the vertex motions by predicting the base motions and using them to constitute each vertex offset. Compared to directly regressing the vertex offsets, predicting and aggregating the base motions shall release the network training burden and provide more accurate results.

Refer to caption
Fig. 2: Architecture of GATOR. Given an input 2D pose, the Graph-Aware Transformer (GAT) encoder learns local-global joint features by the parallel GCN and Graph-Aware Self-Attention (GA-SA), where GA-SA adds two encodings for skeleton graph awareness. The Motion-Disentangled Regression (MDR) decoder then generates the vertex offsets by composing base motions.

Based on the above observations, we present a novel network, termed GATOR, including an encoder of Graph-Aware Transformer (GAT) and a decoder with Motion-Disentangled Regression (MDR), which recovers 3D human mesh from a 2D human pose. (1) In GAT, we design a two-branch module that contains a GCN branch and a Graph-Aware Self-Attention (GA-SA) branch to explore physical and hidden joint relations, where GA-SA takes two important skeleton encodings to enhance graph awareness. (2) Moreover, MDR models joint-vertex and vertex-vertex interactions and generates the vertex offsets by composing the predicted base motions. (3) Experimental results show that GATOR outperforms previous state-of-the-art methods on two benchmark datasets.

2 Method

Fig. 2 illustrates the architecture of GATOR, including GAT and MDR. Given 2D human joints estimated by an off-the-shelf 2D pose detector, GAT first extracts physical and hidden joint features and then generates a 3D pose. MDR learns vertex features through joint-vertex and vertex-vertex interactions, then predicts base motions to constitute the vertex offsets which are added to the template mesh as the final mesh.

2.1 Graph-Aware Transformer Encoder

The joint relations include physical skeleton topology and action-specific information (e.g., the relation between hands and feet is strong during running but weak when sitting), which is difficult to capture by a static graph [6]. Therefore, we propose a two-branch module named Static and Dynamic Graph Attention (SDGA). One branch is GA-SA which takes two important skeleton structures to improve graph awareness for global and dynamic feature learning. The other is the GCN branch to enhance the physical topology along a static graph.

Graph-Aware Self-Attention. Inspired by graph representation tasks [16], wherein the injected priors in the attention mechanism can adaptively change the attention distribution, we design GA-SA by introducing two crucial skeleton priors.

One is the multi-hop connectivity between joints, represented by a N×NN{\times}N matrix DD, where NN is the number of joints and Dij=ϕ(i,j)D_{ij}{=}\phi(i,j) denotes the hop distance between joint ii and joint jj. A learnable embedding table Tsmax(D)×HT_{s}{\in}\mathbb{R}^{\operatorname{max}(D)\times H} is used to project each hop number in DD to a vector of size HH, the head number, and thus embeds the matrix DD to a learnable tensor named Hop Encoding (HE) N×N×H{\in}\mathbb{R}^{N\times N\times H}:

HEij=Ts[Dij],\operatorname{HE}_{ij}=T_{s}[D_{ij}],\vspace{-2mm} (1)

where [][\cdot] represents the indexing operation.

The other is the path information between joints, which reflects the bone length and the body proportion [1]. The Path Encoding (PE) mechanism is built upon a distance graph {J,E}\{J,E\}, where JJ denotes the joints, and EE denotes the spatial distances between adjacent joints. The vector pij={eij1,eij2,,eijDij}p_{ij}{=}\{e_{ij}^{1},e_{ij}^{2},\dots,e_{ij}^{D_{ij}}\} is defined as the path from joint ii to joint jj. A linear embedding layer f()f(\cdot) is used to project each path pijp_{ij} to a learnable tensor: f(pij)Dij×Hf(p_{ij}){\in}\mathbb{R}^{{D_{ij}}\times H}. The path encoding of joint pair (i,j)(i,j) is defined as an average of the dot-products of the edge embeddings and the learnable weights in the path:

PEij=1Dijk=1DijWijkf(eijk),\operatorname{PE}_{ij}=\frac{1}{D_{ij}}\sum_{k=1}^{D_{ij}}W_{ij}^{k}f(e_{ij}^{k}),\vspace{-2mm} (2)

where WijDijW_{ij}{\in}\mathbb{R}^{D_{ij}} denotes the learnable weights for pijp_{ij}.

By adding up the hop and path encodings to the attention matrix AA, the improved attention matrix AA^{\prime} can be written as:

Aij=Aij+HEij+PEij,A^{\prime}_{ij}=A_{ij}+HE_{ij}+PE_{ij}, (3)
Aij=(XiWQs)(XjWKs)T/d,A_{ij}={(X_{i}W_{Q}^{s})(X_{j}W_{K}^{s})^{T}}/{\sqrt{d}},\vspace{-1mm} (4)

where dd is the feature dimension, Xi,XjdX_{i},X_{j}{\in}\mathbb{R}^{d} are the input features, and WQs,WKsd×dW_{Q}^{s},W_{K}^{s}{\in}\mathbb{R}^{d\times d} are the learnable weight matrices that project the input to different representations.

Refer to caption
Fig. 3: Qualitative results on COCO. From left to right: input image, GraphCMR [12], I2L-MeshNet [17], GTRS [4], and ours.

Static and Dynamic Graph Learning. GA-SA is updated by input features in both training and inference processes. When the input pose varies, the attention maps also change to capture the dynamic action-specific relations. However, GA-SA weakens the physical and local interaction. A GCN branch is further introduced following MGCN [9], whose parameters are updated during training and fixed during inference to capture the physical topology. The joint feature XinN×dX_{in}{\in}\mathbb{R}^{N\times d} enters each branch and is transformed by the attention maps and the adjacent matrices. By adding up the two features, the output contains the information from both static and dynamic graphs.

2.2 Motion-Disentangled Regression Decoder

We design MDR including a Linear Blend Featuring (LBF) module to learn vertex features through joint-vertex and vertex-vertex interactions, and a Motion-Disentangled Regression Head (MDR Head) to predict base motions and use them to constitute vertex offsets. To avoid redundancies and make training more effcient [11, 13], MDR processes a coarse mesh with 431 vertices, then samples the vertex offsets up to 6K and adds to the original template mesh as the final result.

Linear Blend Featuring. Previous pose-based methods [2, 3, 4] ignore the inherent joint-vertex relations in the transition from skeleton to mesh. In the algorithm of Linear Blend Skinning (LBS), each vertex is driven by joints and its coordinate can be represented as a weighted sum of all joints [18, 19].

Inspired by LBS, a cross-attention module is designed to perform joint-vertex interaction, which can be expressed as:

Xcrov=softmax(QvKjT/d)Vj,X^{v}_{cro}=\operatorname{softmax}({Q_{v}K_{j}^{T}}/{\sqrt{d}})V_{j}, (5)
Qv=XinvWQc,Kj=XinjWKc,Vj=XinjWVc,Q_{v}=X^{v}_{in}W^{c}_{Q},\quad K_{j}=X^{j}_{in}W^{c}_{K},\quad V_{j}=X^{j}_{in}W^{c}_{V},\vspace{-1mm} (6)

where dd is the feature dimension. XinvV×dX^{v}_{in}{\in}\mathbb{R}^{V\times d} and XinjN×dX^{j}_{in}{\in}\mathbb{R}^{N\times d} denote the input features of vertices and joints, VV and NN denote the numbers of vertices and joints, respectively. WQc,WKc,WVcd×dW_{Q}^{c},W_{K}^{c},W_{V}^{c}{\in}\mathbb{R}^{d\times d} are the learnable weight matrices. Thus, each vertex feature is a weighted sum of the joint features. The input joint feature XinjX^{j}_{in} is the concatenation of 2D and 3D joint coordinates and the output joint feature from GAT. Besides, the input feature of a vertex XinvX^{v}_{in} is the concatenation of the vertex coordinate from the coarse template mesh and its nearest 3D joint coordinate. After the joint-vertex interaction, we introduce a self-attention module for vertex-vertex interaction.

Motion-Disentangled Regression Head. Traditional mesh regression heads project the high-dimension features to the vertex coordinates by a linear layer while ignoring the physical plausibility. Motivated by the observation in Sec. 1, we propose a novel regression head based on the motion similarity in the same body part. Specifically, we predict several base motions to constitute the vertex offsets of coarse mesh (431 vertices) and add them to the original template mesh (6K vertices) after the upsampling operation. The coarse vertex offsets ΔMV×3\Delta M^{\prime}{\in}\mathbb{R}^{V\times 3} consist of the weighted sum of base motions determining the general orientation and translation, and the learnable biases for refinement, which is expressed as:

ΔM=α×softmax(MA)MB+MC,\Delta M^{\prime}={\alpha}\times\operatorname{softmax}(M_{A})M_{B}+M_{C},\vspace{-2mm} (7)

where MAV×mM_{A}{\in}\mathbb{R}^{V\times m} is the motion weight matrix, MBm×3M_{B}{\in}\mathbb{R}^{m\times 3} denotes mm base motions, MCV×3M_{C}{\in}\mathbb{R}^{V\times 3} denotes motion biases, and αV\alpha{\in}\mathbb{R}^{V} denotes scaling factors. They are all learned from the network. The coarse vertex offset ΔM\Delta M^{\prime} is upsampled to the original resolution with 6K vertices through a linear projection and added to the template mesh to get the final mesh result.

2.3 Loss Functions

GAT is first pretrained using the 3D joint loss to supervise the intermediate 3D pose. Then following [2, 4], the whole model is supervised by four losses: mesh vertex loss M\mathcal{L}_{M}, 3D joint loss J\mathcal{L}_{J} (joints from the final mesh), surface normal loss N\mathcal{L}_{N}, and surface edge loss E\mathcal{L}_{E}. The total loss is calculated as:

=λMM+λJJ+λNN+λEE,\mathcal{L}=\lambda_{M}\mathcal{L}_{M}+\lambda_{J}\mathcal{L}_{J}+\lambda_{N}\mathcal{L}_{N}+\lambda_{E}\mathcal{L}_{E},\vspace{-2mm} (8)

where λM=1\lambda_{M}{=}1, λJ=1\lambda_{J}{=}1, λN=0.1\lambda_{N}{=}0.1, λE=20\lambda_{E}{=}20 in our experiments.

Table 1: Comparison with state-of-the-art methods on Human3.6M and 3DPW datasets. §\S denotes the input is from 2D pose detectors [20, 21]. * denotes the input is ground truth 2D pose. None of these methods uses 3DPW for training.
Method Human3.6M 3DPW
MPJPE \downarrow PA-MPJPE \downarrow MPJPE \downarrow PA-MPJPE \downarrow MPVE \downarrow
image HMR [22] 88.0 56.8 - 81.3 130.0
GraphCMR [12] - 50.1 - 70.2 -
SPIN [23] 62.5 41.1 96.9 59.2 116.4
PyMAF [24] 57.7 40.5 92.8 58.9 110.1
I2LMeshNet [17] 55.7 41.1 93.2 57.7 -
ProHMR [25] - 41.2 - 59.8 -
OCHMR [26] - - 89.7 58.3 107.1
video VIBE [27] 65.6 41.4 93.593.5 56.556.5 113.4113.4
TCMR [28] 62.3 41.1 95.095.0 55.855.8 111.5111.5
AdvLearning [29] - - 92.6 55.2 111.9
MPS-Net [30] - - 91.6 54.0 109.6
2D pose Pose2Mesh§\S [2] 64.9 46.3 88.9 58.3 106.3
PQ-GCN§\S [3] 64.6 47.9 89.2 58.3 106.4
GTRS§\S [4] 64.3 45.4 88.5 58.9 106.2
GATOR§\S (Ours) 64.0 44.7 87.5 56.8 104.5
Pose2Mesh* [2] 51.3 35.9 65.1 34.6 -
GTRS* [4] 50.6 34.4 53.8 34.5 61.6
GATOR* (Ours) 48.8 31.2 50.8 30.5 59.6

3 Experiments

Datasets. Human3.6M [31], 3DPW [32], COCO [33], and MuCo-3DHP [34] are used following previous works [2, 3, 4].

Evaluation Metrics. Three metrics are used to report the experimental results: Mean Per Joint Position Error (MPJPE), Procrustes-Aligned MPJPE (PA-MPJPE) that denotes MPJPE after rigid alinement, and Mean Per Vertex Error (MPVE).

Implementation Details. We first pretrain GAT and then train the whole GATOR in an end-to-end manner. The GAT is stacked by L1=6L_{1}{=}6 layers with the feature dimension d1=128d_{1}{=}128. In MDR, the LBF is stacked by L2=3L_{2}{=}3 layers with the feature dimension d2=64d_{2}{=}64. GAT is pretrained by Adam optimizer for 30 epochs with a batch size of 256 and a learning rate of 8×1048{\times}10^{-4}, while the whole GATOR is trained with a batch size of 64 and a learning rate of 1×1041{\times}10^{-4} for 30 epochs. All experiments are conducted on one NVIDIA RTX 3090 GPU.

3.1 Comparison with State-of-the-Art Methods

Table 1 compares GATOR with previous image/video/pose-based methods on Human3.6M and 3DPW datasets. For a fair comparison, we follow the same setting as previous works [4, 2, 22, 23]. For Human3.6M, GATOR is trained on the Human3.6M training set and measured PA-MPJPE on the frontal camera set. For 3DPW, GATOR is trained on multi-datasets including Human3.6M, COCO, and MuCo-3DHP, and evaluated on the 3DPW testing set to examine cross-dataset generalization ability. When using detected 2D poses [20, 21] as input, our method outperforms previous pose-based methods and achieves comparable results with image/video-based methods. Especially on 3DPW, although the input poses are not particularly accurate and without any image or temporal information, GATOR outperforms existing methods on the metrics of MPJPE and MPVE. When using GT 2D poses as inputs, the performance boosts by a large margin (improves 9.3% from 34.4 mm to 31.2 mm on Human3.6M, 11.6% from 34.5 mm to 30.5 mm on 3DPW in the metric of PA-MPJPE). It indicates that with a more accurate 2D pose detector, our method can further improve the performance.

3.2 Ablation Study

Effectiveness of GA-SA and GCN. Table 3.2 examines the components of GAT on the intermediate 3D pose and the final mesh. The proposed GAT improves both pose and mesh performance, and a more accurate 3D pose is beneficial to higher mesh accuracy. Individual HE, PE or GCN brings similar improvements but combining them together boosts the performance by a clear margin (improve 2.3 mm in MPJPE and 3.2 mm in MPVE of 3D mesh). The improvements indicate that the hidden action-specific information explored by GA-SA and the physical skeleton topology extracted by GCN effectively complement each other to achieve better results.

\captionof

tableComparison of GAT components on 3DPW. HE PE GCN 3D Pose 3D Mesh MPJPE \downarrow MPJPE \downarrow MPVE \downarrow 87.1 89.8 107.7 86.5 88.6 105.5 86.4 88.7 106.0 86.6 88.7 105.6 86.0 88.0 105.3 85.9 88.3 105.3 85.2 88.0 105.1 84.3 87.5 104.5

\captionof

tableComparison of different regressors on 3DPW. Regressor 3D Mesh PA-MPJPE\downarrow MPVE \downarrow Linear 63.2 119.9 Linear + LBF 58.8 107.7 MDR20 w/o LBF 58.2 107.2 MDR5 56.8 104.8 MDR10 57.4 105.5 MDR20 56.8 104.5 MDR30 57.4 105.2 MDR40 57.5 105.5

Refer to caption
Fig. 4: Qualitative comparison of mesh details between Pose2Mesh [2] and our proposed GATOR.

Effectiveness of MDR. Table 3.2 evaluates the impact of MDR by removing LBF, replacing MDR Head with a general linear layer, and setting different base motion numbers. The top two lines show that the joint and vertex interactions in LBF are effective for relations exploring and significantly improve the results. When considering MDR Heads, the performance is further improved, where MDR Head with 20 base motions achieves the optimal results. More or fewer base motions may bring inappropriate clustering, which drops the performance.

3.3 Qualitative Results

Fig. 3 shows qualitative results compared to GraphCMR [12], I2L-MeshNet [17], and GTRS [4] on the COCO dataset. The first two are image-based methods that are often impacted by the background, while the latter two pose-based methods are more robust, whereas GATOR produces more plausible results. Fig. 4 compares mesh details between Pose2Mesh [2] and GATOR. Pose2Mesh is prone to generate artifacts due to the sub-optimal prediction of vertex positions, while our method can provide more accurate vertices for fine-grained meshes.

4 Conclusion

In this paper, we present GATOR, a novel network for 3D human mesh recovery from a 2D pose. To explore multiple joint and vertex relations in the evolution from skeleton to mesh, GAT is designed to explore joint relations by combining a GCN branch and a GA-SA branch in parallel for static and dynamic graph learning, where GA-SA takes two important skeleton encodings to enhance the graph awareness. Besides, MDR is proposed to model joint-vertex and vertex-vertex interactions and generate the vertex coordinates in a motion-disentangled regression, which provides more accurate results. Extensive experiments show that our method achieves state-of-the-art performance on two challenging benchmarks.

References

  • [1] Y. Tian, H. Zhang, Y. Liu, and L. Wang, “Recovering 3D human mesh from monocular images: A survey,” arXiv preprint arXiv:2203.01923, 2022.
  • [2] H. Choi, G. Moon, and K. M. Lee, “Pose2Mesh: Graph convolutional network for 3D human pose and mesh recovery from a 2D human pose,” in Proc. ECCV, 2020, pp. 769–787.
  • [3] L. Wang, X. Liu, X. Ma, J. Wu, J. Cheng, and M. Zhou, “A progressive quadric graph convolutional network for 3D human mesh recovery,” IEEE TCSVT, pp. 104–117, 2022.
  • [4] C. Zheng, M. Mendieta, P. Wang, A. Lu, and C. Chen, “A lightweight graph transformer network for human mesh reconstruction from 2D human pose,” in Proc. ACM MM, 2022, pp. 5496–5507.
  • [5] C. Wu, Y. Li, X. Tang, and J. Wang, “MUG: Multi-human graph network for 3D mesh reconstruction from 2D pose,” arXiv preprint arXiv:2205.12583, 2022.
  • [6] A. Zeng, X. Sun, L. Yang, N. Zhao, M. Liu, and Q. Xu, “Learning skeletal graph neural networks for hard 3D pose estimation,” in Proc. ICCV, 2021, pp. 11436–11445.
  • [7] W. Li, H. Liu, R. Ding, M. Liu, P. Wang, and W. Yang, “Exploiting temporal contexts with strided transformer for 3D human pose estimation,” IEEE TMM, 2022.
  • [8] W. Li, H. Liu, H. Tang, P. Wang, and L. Van Gool, “MHFormer: Multi-hypothesis transformer for 3D human pose estimation,” in Proc. CVPR, 2022, pp. 13147–13156.
  • [9] Z. Zou and W. Tang, “Modulated graph convolutional network for 3D human pose estimation,” in Proc. ICCV, 2021, pp. 11477–11487.
  • [10] W. Li, H. Liu, T. Guo, H. Tang, and R. Ding, “GraphMLP: A graph MLP-like architecture for 3D human pose estimation,” arXiv preprint arXiv:2206.06420, 2022.
  • [11] K. Lin, L. Wang, and Z. Liu, “Mesh graphormer,” in Proc. ICCV, 2021, pp. 12939–12948.
  • [12] N. Kolotouros, G. Pavlakos, and K. Daniilidis, “Convolutional mesh regression for single-image human shape reconstruction,” in Proc. CVPR, 2019, pp. 4501–4510.
  • [13] K. Lin, L. Wang, and Z. Liu, “End-to-end human pose and mesh reconstruction with transformers,” in Proc. CVPR, 2021, pp. 1954–1963.
  • [14] T. Luan, Y. Wang, J. Zhang, Z. Wang, Z. Zhou, and Y. Qiao, “PC-HMR: Pose calibration for 3D human mesh recovery from 2D images/videos,” in Proc. AAAI, 2021, vol. 35, pp. 2269–2276.
  • [15] M. Zanfir, A. Zanfir, E. G. Bazavan, W. T. Freeman, R. Sukthankar, and C. Sminchisescu, “THUNDR: Transformer-based 3D human reconstruction with markers,” in Proc. ICCV, 2021, pp. 12971–12980.
  • [16] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T. Y. Liu, “Do transformers really perform badly for graph representation?,” in Proc. NeurIPS, 2021, pp. 28877–28888.
  • [17] G. Moon and K. M. Lee, “I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image,” in Proc. ECCV, 2020, pp. 752–768.
  • [18] A. Jacobson, Z. Deng, L. Kavan, and J. P. Lewis, “Skinning: Real-time shape deformation,” in ACM SIGGRAPH 2014 Courses. 2014.
  • [19] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM TOG, vol. 34, no. 6, pp. 1–16, 2015.
  • [20] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in Proc. ECCV, 2018, pp. 529–545.
  • [21] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in Proc. CVPR, 2020, pp. 7093–7102.
  • [22] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in Proc. CVPR, 2018, pp. 7122–7131.
  • [23] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3D human pose and shape via model-fitting in the loop,” in Proc. ICCV, 2019, pp. 2252–2261.
  • [24] H. Zhang, Y. Tian, X. Zhou, W. Ouyang, Y. Liu, L. Wang, and Z. Sun, “PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop,” in Proc. ICCV, 2021, pp. 11446–11456.
  • [25] N. Kolotouros, G. Pavlakos, D. Jayaraman, and K. Daniilidis, “Probabilistic modeling for human mesh recovery,” in Proc. ICCV, 2021, pp. 11605–11614.
  • [26] R. Khirodkar, S. Tripathi, and K. Kitani, “Occluded human mesh recovery,” in Proc. CVPR, 2022, pp. 1715–1725.
  • [27] M. Kocabas, N. Athanasiou, and M. J. Black, “VIBE: Video inference for human body pose and shape estimation,” in Proc. CVPR, 2020, pp. 5253–5263.
  • [28] H. Choi, G. Moon, J. Y. Chang, and K. M. Lee, “Beyond static features for temporally consistent 3D human pose and shape from a video,” in Proc. CVPR, 2021, pp. 1964–1973.
  • [29] Y. Sun, J. Zhang, and W. Wang, “Adversarial learning enhancement for 3D human pose and shape estimation,” in Proc. ICASSP, 2022, pp. 3743–3747.
  • [30] W. L. Wei, J. C. Lin, T. L. Liu, and H. Y. M. Liao, “Capturing humans in motion: temporal-attentive 3D human pose and shape estimation from monocular video,” in Proc. CVPR, 2022, pp. 13211–13220.
  • [31] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” IEEE TPAMI, vol. 36, no. 7, pp. 1325–1339, 2013.
  • [32] T. Von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll, “Recovering accurate 3D human pose in the wild using imus and a moving camera,” in Proc. ECCV, 2018, pp. 601–617.
  • [33] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. ECCV, 2014, pp. 740–755.
  • [34] D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt, “Single-shot multi-person 3D pose estimation from monocular RGB,” in Proc. 3DV, 2018, pp. 120–130.