Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Jianbin Jiao¹, Xina Cheng¹^,∗, Weijie Chen¹, Xiaoting Yin², Hao Shi², and Kailun Yang^3,4 Correspondence: [email protected] ¹School of Artificial Intelligence, Xidian University, China
²State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, China
³School of Robotics, Hunan University, China.
⁴National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China.

Abstract

3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose.

Index Terms:

3D Human Pose Estimation, Vision Transformers, Multi-Perspective, Spatial-Temporal Relationship

I Introduction

Human pose estimation provides the positions of human key joint points from images or videos, which is the core technology for understanding human behaviors and motion scenes. The reliability and generalization of these applications are highly reliant on the precision of human pose data, especially in human-computer interaction, virtual reality, and rehabilitation training, the 3D human pose is required. The primary task of 3D human pose estimation is to predict the three-dimensional coordinates of human body keypoints in the 3D space. Compared with the image-based 2D pose estimation, which has been quite matured thanks to the powerful deep learning method [1, 2, 3, 4], 3D human pose estimation [5, 6, 7, 8] provides 3D keypoints positions. With the depth information, the 3D human pose is represented keeping the physical structure and connected relation. The rich data is essential for applications requiring precise pose information.

This paper targets the 3D human pose estimation from the multi-view video data, which is the mainstream dataset of 3D pose [9, 10, 11]. Besides the image frame content, these datasets provide temporal information and 3D spatial relations. Numerous studies aim to leverage spatial-temporal information from these datasets to achieve enhanced performance of human pose detection [12, 13, 14], which are broadly categorized into two types according to the backbone network: the CNN based method [15] and Transformer based method [16, 17, 18]. Given the richness of multi-view information and temporal relation in 3D human pose datasets, the transformer shows the strength to model long-range dependencies in data with attention mechanisms. However, current transformer-based methods primarily investigate image features without harnessing the power of transformers to extract rich feature information from video datasets [9, 10, 11]. For instance, PRTR [18] adopts a structure comprising both transformer encoders and decoders, cascadingly refining the estimation of keypoint positions. Conversely, TokenPose [16] and TransPose [17] utilize a transformer architecture with only encoders to process features extracted by CNNs. These approaches fail to effectively exploit the rich feature information in the dataset and overcome the limitations of CNN-based methods, which solely focus on intrinsic image features.

Refer to caption — Figure 1: The conceptual diagram of our approach consists of two main components: the Spatial Module and the Image Relations Module. The Spatial Module extracts human body pose features inherent in the images themselves, while the Image Relations Module first models the temporal relationships between frame images and subsequently models the spatial positional relationships between corresponding images in 3D space.

Based on the identified shortcomings and deficiencies mentioned above, our network architecture is founded on the self-attention mechanism of the transformer, enabling the comprehensive extraction of various rich feature information in 3D human pose estimation. Concerning input data, our network takes all frames of a video sequence and feeds them into the transformer architecture, facilitating the direct construction of spatial and temporal information for video frames. Our network comprises two components: the spatial module and the frame-image relation module.

Fig. 1 illustrates the structure of our proposed approach. The effectiveness of this approach is evaluated on the widely adopted dataset for 3D human body pose estimation: Human3.6M [10]. As the Human3.6M dataset [10] comprises videos captured from four cameras, it inherently contains image information, 3D spatial positional information, and temporal information. The spatial module is designed to extract the image feature and the images relation module is for harvesting the 3D spatial and temporal features.

Firstly, the spatial module is responsible for gathering internal information within video frames. This section involves extracting feature information from images to construct intra-frame spatial relationships between keypoints. Considering that the Human3.6M [10] dataset consists of data extracted from the same scene, we perform image cropping in this phase. A windowed self-attention mechanism is employed as the backbone to retain image patches with high attention and eliminate the disturbance of non-human body patches. Simultaneously, this approach has the ability to reduce the computational load caused by long sequences of input images.

Secondly, the images relation module involves modeling the relationships between frame images. On the time scale, the output image features of the spatial module are treated as a token input to model the temporal relationships within a sequence of frame images. On a 3D spatial scale, the 3D spatial positional relationships are modeled between images. In this part, we model the 3D spatial positional information between images. Given that the 3D spatial and temporal features require modeling global relationships, we employ a basic transformer model for this purpose. Our contributions are summarized into three key aspects:

•

We introduce a novel 3D sequence-to-sequence human pose detection network that incorporates intra-frame spatial, temporal, and 3D position information. Notably, we applied window self-attention for the first time in frame-based human pose detection. Moreover, we combined window self-attention with global self-attention, effectively reducing computational complexity while modeling global relationships. This integration maximizes the advantages of both structures.
•

In the spatial domain, we implement the cropping of image patches, selectively preserving patches relevant to pose detection. This approach led to an improvement in network performance by focusing on pertinent information within the image.
•

In the temporal domain, we propose to model the temporal information between video frames and extract corresponding 3D spatial position information. This comprehensive consideration of temporal dynamics and spatial relationships between frames contributes to a more robust understanding of the context in video-based human pose detection.

II Related Work

In this part, we present a review of relevant literature on 2D and 3D human pose estimation, discussing the insights derived from these methods, and elucidating the distinctions from our proposed approach. Subsequently, we delve into the impact of existing research on vision transformers, highlighting how these studies have influenced our methodology.

II-A 2D Human Pose Detection

Before the widespread application of transformers in various image tasks, most 2D human keypoint detection methods relied on Convolutional Neural Networks (CNNs) as the main backbone. These 2D human keypoint detection approaches are categorized into multi-person and single-person detection based on image content. Multi-person human keypoint detection is further divided into top-down [19, 20, 21, 22] and bottom-up [3, 4] approaches. In the bottom-up approach, all keypoints of all individuals are detected first, followed by using clustering algorithms to classify keypoints belonging to the same person, resulting in the final output. On the other hand, the top-down approach involves using object detection algorithms [23, 24] to obtain bounding boxes for each person and then performing single-person keypoint detection based on these bounding boxes. Among these methods, the top-down approach consistently achieves state-of-the-art results on benchmark datasets, leading to the choice of employing the top-down approach in this work.

While CNNs and their variants remain the primary backbone architecture for computer vision applications, the modeling capabilities of transformers [25, 26] for various information relationships are unparalleled by CNNs. In this context, we aim to leverage the transformer architecture to extract rich feature information from 3D human pose datasets. Despite the continued prominence of CNN models, the unique ability of transformers to harvest complex relationships motivates our exploration of their potential in capturing intricate features within 3D human pose data.

II-B 3D Human Pose Estimation

3D human pose recognition can be broadly categorized into direct methods [5] and two-stage methods [6, 7, 27]. Direct methods aim to extract raw feature information directly from images. For instance, C2F-Vol [5] draws inspiration from the Hourglass network structure employed in 2D Human Pose Estimation (HPE) and represents 3D pose in the form of 3D heatmaps. Two-stage methods, on the other hand, simplify the prediction process. The baseline approach utilizes a relatively simple feedforward neural network for estimating 3D pose, taking 2D keypoint coordinates as input and directly mapping 2D pose to 3D space through fully connected layers with residual connections. Hossain and Little [14] introduce a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) units to leverage temporal information within input sequences. Pavllo et al. [15] incorporate a temporal convolutional network to estimate 3D pose generated from continuous 2D sequences of keypoints. In [6, 28], efficient human pose estimation has been explored by using event point cloud data. However, these networks [14, 15] often struggle to model information relationships between frames in multi-view video sequences effectively. To address this limitation, we employ a transformer structure to extract information from multi-view video datasets. Leveraging rich feature information, this approach yields accurate 2D human pose estimations. Finally, a two-stage method is employed to achieve 3D human pose recognition.

II-C Vision Transformer

After the initial introduction of Transformers for image tasks, numerous variant architectures suitable for various image-related tasks have been proposed. DETR [23], applied in object detection, achieved an end-to-end framework for object detection. ViT [26] successfully employed a pure Transformer structure for image classification, demonstrating excellent performance. Given the substantial computational resources required for self-attention calculations on images, several models focusing on reducing computational complexity have been introduced. The Swin transformer [29] divides images into small windows, performing self-attention computations within these windows. In [30], the presented TCFormer achieves sparse input by clustering and pruning the input tokens. DynamicViT [31] evaluates tokens generated for an image, discarding low-scoring background tokens to facilitate sparse input. Moreover, the transformer architecture has been introduced to various tasks like semantic segmentation [32, 33], content outpainting [34, 35] and pose estimation [2, 27, 36, 37]. In this work, our proposed network architecture leverages windowed self-attention to efficiently extract spatial information regarding human key points in the image. Furthermore, we introduce an image relations module to harvest temporal relationship features among frames in video sequences and 3D spatial position features for enhancing the estimation performance.

III Method

III-A Overview Architecture

In this section, we introduce a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection, as illustrated in Fig. 2. Our architecture primarily comprises a spatial module and a frame-image relation module, both adeptly leveraging the self-attention mechanism [25]. The spatial module is utilized to extract human body pose features inherent in the images themselves, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the images. Connecting these two components is a frame-image information aggregation module, responsible for aggregating all tokenized information containing image details into a sequence of tokens that represent the frames within a video. Finally, keypoint coordinates for all input frames are directly estimated through a regression head, in accordance with the standard seq2seq architecture.

III-B Motivation

The fundamental premise of our research centers on the self-attention mechanism [25], an overarching architecture adept at modeling dependencies within input sequences. Initially, the input images $I$ undergo processing through embedding [26] operations,

\displaystyle Z=Patch\ Embedding(I),

(1)

where $Z\in\mathbb{R}^{N\times C}$ represents the patch sequence, with $N$ being the length of the patch sequence and $C$ being the dimensionality of each patch. Subsequently, the self-attention mechanism transforms $Z\in\mathbb{R}^{N\times C}$ into three vectors, $Q$ , $K$ , and $V$ , through three distinct linear transformations:

\displaystyle Q=ZW_{Q},K=ZW_{K},V=ZW_{V}.

(2)

Among these, $Q$ can be understood as the information to be queried, $K$ represents the patches to be queried, and $V$ signifies the features of each patch. Following the derivation of these three vectors, the computation of self-attention can be expressed as:

\displaystyle Attention(Q,K,V)=Softmax(\frac{Q{\cdot}K}{\sqrt{d}})V,

(3)

where $\sqrt{d}$ serves as a normalization factor for the values of $Q{\cdot}K$ , preventing numerical instability that may lead to gradient vanishing. The $Softmax(\cdot)$ function normalizes the computed results, generating a weight matrix that, when multiplied by the feature vectors $V$ , yields the attention values. To capture intricate feature information, the self-attention mechanism employs a multi-head self-attention layer ( $MSA$ ) for parallel processing of feature information. Each head simultaneously computes self-attention, and the outputs from $k$ heads are concatenated. This can be expressed as:

\displaystyle MultiHead(Q,K,V)=Concat(head_{1},...,head_{h})W_{Out}

(4)

\displaystyle where\ head_{i}=Attention(Q_{i},K_{i},V_{i}),i\in\mathbb{[}1,...,h]

(5)

In 3D pose estimation datasets, rich feature information is present, and we aim to model the specific dependencies among these features. Therefore, we employ the self-attention mechanism as the core mechanism in our network.

III-C Spatial Module

As a network tailored for processing video sequences, we need to design a spatial feature extraction module capable of handling long input video frame sequences. Given the substantial volume of video frame data, we employ a moving window self-attention mechanism to extract spatial features, which helps alleviate the computational burden and reduce processing time. Specifically, we partition a single image into multiple small blocks using a small window, limiting self-attention computations to the interior of these blocks. Simultaneously, we incorporate window movement to capture global features, facilitating interaction with image information outside the current window. Subsequently, within the generated image blocks after movement, another round of self-attention computation is performed. Consequently, as illustrated in Fig. 3(a), a complete spatial transformer block involves two rounds of self-attention computation, which can be expressed as:

\displaystyle\hat{z}^{l}=W-MSA(LN(z^{l-1})+z^{l-1}

(6)

\displaystyle z^{l}=MLP(LN(\hat{z}^{l}))+\hat{z}^{l}

(7)

\displaystyle\hat{z}^{l+1}=SW-MSA(LN(z^{l})+z^{l}

(8)

\displaystyle z^{l+1}=MLP(LN(z^{l+1})+\hat{z}^{l+1}

(9)

In this context, $\hat{z}^{l}$ and $z^{l}$ respectively denote the conventional window module of self-attention blocks and the output features of the $MLP$ module. Here, $W-MSA$ and $SW-MSA$ refer to utilizing conventional and sliding window partitions in window-based self-attention.

To enable the network to learn hierarchical features, a patch merging module performs downsampling on the images. This aids in reducing the length of the patch sequence, increasing the patch dimension, and enlarging the receptive field for extracting contextual cues. The downsampling operation generates images at different scales, facilitating the network in learning features from multiple scales.

III-D Image Relations Module

Following the extraction of image spatial features using the spatial module, we consolidate the information of each image into a vector that encapsulates the spatial characteristics of the image. Subsequently, we concatenate the feature information of each frame’s image as tokens and input them into the standard transformer model.

When modeling the relationships between images, as illustrated in Fig. 3(b), we employ a standard transformer model [25], as a significant amount of spatial image feature information has already been consolidated into a single vector during the spatial module, facilitating the flexible computation of global self-attention. In a standard transformer model (assuming input of $32$ frames), if the input frame image size is $(32,224,224,3)$ , after mapping, the size of tokens input to the transformer module is $(32,196,768)$ . However, in our structure, after the spatial feature extraction, the size of tokens input to the transformer model is $(1,32,768)$ . This significantly reduces computational complexity, enabling effortless calculation of global self-attention. The Image Relations Module is divided into two parts. The first part models the temporal relationships between video frames, utilizing self-attention mechanisms to learn global dependencies between video frames. The second part models the 3D spatial positional relationships between corresponding images. Since data collection involves the use of four cameras, a 3D pose corresponds to four 2D poses, and there exists a spatial positional relationship between these four 2D poses. We utilize self-attention mechanisms to learn this relationship. Additionally, sometimes, to fully leverage temporal information in video sequences, long sequences of video frames may be inputted, a scenario our model can readily accommodate.

Our Image Relations Module can be represented as follows:

\displaystyle{Z}_{l}^{{}^{\prime}}=MSA(LN(Z_{l-1}))+Z_{l-1},

\displaystyle l=1,2...L

(10)

\displaystyle Z_{l}=MLP(LN({Z}_{l}^{{}^{\prime}}))+{Z}_{l}^{{}^{\prime}},

\displaystyle l=1,2,...L

(11)

\displaystyle Y=LN(Z_{L})

(12)

where $MSA$ represents the multi-head self-attention module, $LN(\cdot)$ [38] represents the normalization layer, $Y$ is the output, and $Z$ represents the output of each layer of the transformer module, with a total of $L$ layers.

We employ positional embeddings [26] to represent the temporal dependencies and spatial relationships in the 3D space of the images:

\displaystyle Z_{time}=[z_{1};z_{2};...;z_{f}]+E_{time}

(13)

\displaystyle Z_{3D}=[z_{1};z_{2};z_{3};z_{4}]+E_{3D}

(14)

where $z$ represents the feature information of each image, $E_{time}$ and $E_{3D}$ denote two types of relationships.

TABLE II: Evaluation Results for 3D Human Pose

MPJPE $\downarrow$	Venue	Dir.	Disc.	Eat	Greet	Phone	Photo	Pose	Purch.	Sit	SitD.	Somke	Wait	WalkD.	Walk	WalkT.	Average
Dabral et al. [39]	ECCV’18	44.8	50.4	44.7	49.0	52.9	61.4	43.5	45.5	63.1	87.3	51.7	48.5	52.2	37.6	41.9	52.1
Cai et al. [40]	ICCV’19	44.6	47.4	45.6	48.8	50.8	59.0	47.2	43.9	57.9	61.9	49.7	46.6	51.3	37.1	39.4	48.8
Pavllo et al. [15]	CVPR’19	45.2	46.7	43.3	45.6	48.1	55.1	44.6	44.3	57.3	65.8	47.1	44.0	49.0	32.8	33.9	46.8
Lin et al. [41]	BMVC’19	42.5	44.8	42.6	44.2	48.5	57.1	52.6	41.4	56.5	64.5	47.4	43.0	48.1	33.0	35.1	46.6
Yeh et al. [42]	NIPS’19	44.8	46.1	43.3	46.4	49.0	55.2	44.6	44.0	58.3	62.7	47.1	43.9	48.6	32.7	33.3	46.7
Liu et al. [12]	CVPR’20	41.8	44.8	41.1	44.9	47.4	54.1	43.4	42.2	56.2	63.6	45.3	43.5	45.3	31.3	32.2	45.1
SRNet [43]	ECCV’20	46.6	47.1	43.9	41.6	45.8	49.6	46.5	40.0	53.4	61.1	46.1	42.6	43.1	31.5	32.6	44.8
UGCN [44]	ECCV’20	41.3	43.9	44.0	42.2	48.0	57.1	42.2	43.2	57.3	61.3	47.0	43.5	47.0	32.6	31.8	45.6
Chen et al. [13]	TCSVT’21	42.1	43.8	41.0	43.8	46.1	53.5	42.4	43.1	53.9	60.5	45.7	42.1	46.2	32.2	33.8	44.6
PoseFormer [27]	ICCV’21	41.5	44.8	39.8	42.5	46.5	51.6	42.1	42.0	53.3	60.7	45.5	43.3	46.1	31.8	32.2	44.3
Ours		35.2	41.0	37.9	36.9	39.4	45.7	38.7	38.7	54.4	58.2	40.8	38.8	41.0	27.5	29.5	40.3
P-MPJPE $\downarrow$	Venue	Dir.	Disc.	Eat	Greet	Phone	Photo	Pose	Purch.	Sit	SitD.	Somke	Wait	WalkD.	Walk	WalkT.	Average
Pavlakos et al. [45]	CVPR’18	34.7	39.8	41.8	38.6	42.5	47.5	38.0	36.6	50.7	56.8	42.6	39.6	43.9	32.1	36.5	41.8
Hossain et al. [14]	ECCV’18	35.7	39.3	44.6	43.0	47.2	54.0	38.3	37.5	51.6	61.3	46.5	41.4	47.3	34.2	39.4	44.1
Cai et al. [40]	ICCV’19	35.7	37.8	36.9	40.7	39.6	45.2	37.4	34.5	46.9	50.1	40.5	36.1	41.0	29.6	32.3	39.0
Lin et al. [41]	BMVC’19	32.5	35.3	34.3	36.2	37.8	43.0	33.0	32.2	45.7	51.8	38.4	32.8	37.5	25.8	28.9	36.8
Pavllo et al. [15]	CVPR’19	34.1	36.1	34.4	37.2	36.4	42.2	34.4	33.6	45.0	52.5	37.4	33.8	37.8	25.6	27.3	36.5
Liu et al. [12]	CVPR’20	32.3	35.2	33.3	35.8	35.9	41.5	33.2	32.7	44.6	50.9	37.0	32.4	37.0	25.2	27.2	35.6
UGCN [44]	ECCV’20	32.9	35.2	35.6	34.4	36.4	42.7	31.2	32.5	45.6	50.2	37.3	32.8	36.3	26.0	23.9	35.5
Chen et al. [13]	TCSVT’21	33.1	35.3	33.4	35.9	36.1	41.7	32.8	33.3	42.6	49.4	37.0	32.7	36.5	25.5	27.9	35.6
PoseFormer [27]	ICCV’21	32.5	34.8	32.6	34.6	35.3	39.5	32.1	32.0	42.8	48.5	34.8	32.4	35.3	24.5	26.0	34.6
Ours		30.5	30.5	30.5	30.5	30.5	30.5	28.8	28.1	39.2	46.3	31.6	28.2	31.0	20.7	22.5	30.5

IV Experiments

IV-A Datasets

Human3.6M [10] is a large public dataset for 3D human pose estimation research, featuring $3.6$ million images with corresponding 3D human poses. The dataset includes $11$ professional actors ( $6$ male, $5$ female) and spans $7$ scenes (such as discussions, smoking, photography, and phone calls). Comprising videos captured by $4$ calibrated high-resolution cameras at $50Hz$ , the dataset’s labels are derived from precise 3D joint positions and angles obtained from a high-speed motion capture system. We utilize five subjects from the dataset for training and reserve two subjects for testing purposes.

We evaluate our approach on the Human3.6M [10] dataset. To comprehensively assess our method, the experiments are primarily divided into three categories. The first category involves evaluating the 2D detection results as our method aims to improve 3D results through enhanced 2D human pose detection. The second category assesses the application of 2D detection results to 3D detection for evaluating the 3D pose results. The third category evaluates the impact of different numbers of input frames on the detection results since our method models temporal relationships based on video frame sequences.

IV-B Implementation Details

Our method is implemented using PyTorch [46]. All experiments were conducted on a single NVIDIA RTX 3090 GPU. We selected five subjects from the Human3.6M dataset for training and evaluated the model on two subjects, utilizing an input resolution of $224{\times}224$ . The training process spanned $100$ epochs, employing the AdamW [47] optimizer, cosine decay learning rate scheduler, and linear warm-up for the initial $20$ epochs. The initial learning rate was set to $0.001$ , with a weight decay of $0.05$ . During training, three different lengths of sequence frames were chosen, i.e., $f{=}8,f{=}32,f{=}128$ . Our spatial module was constructed using a fine-tuned pre-trained model based on the Swin transformer architecture [29], while the temporal module underwent fine-tuning using a pre-trained vision transformer [26].

IV-C Evaluation of 2D Human Body Detection Results

TABLE I: Evaluation Results for 2D Human Pose

Method	AP $\uparrow$	AR $\uparrow$	PCK $\uparrow$	MSE $\downarrow$	Time (s)(one epoch) $\downarrow$
ViT-S [26]	-	-	84.8	134.5	654.5
ViT-B [26]	-	-	91.2	84.0	1097.6
HRNet-32 [4]	76.5	79.3	88.1	-	1544.64
HRNet-48 [4]	77.1	79.9	88.9	-	2210.47
Baseline(32frames)	79.5	86.6	95.4	74.7	755.3
Ours(32frames)	91.4	94.8	98.6	33.6	755.4

We conducted a quantitative comparison using ViT-S [26], ViT-B [26], HRNet-32 [4], and HRNet-48 [4]. The model excluding the Image Relations Module is designated as the baseline model, and experiments were performed on this baseline to investigate the impact of the Image Relations Module on performance. Training was carried out on these networks using the Human3.6M dataset [10], and evaluation metrics such as AP (Average Precision), AR (Average Recall), PCK (Percentage of Correct Keypoints), and MSE (Mean Squared Error) were employed for the assessment of 2D human body pose. We computed the time taken for each model to train one epoch on the S1 subject as an indicator of the model’s inference speed.

From Table LABEL:table1, it is evident that our approach achieves the highest accuracy with minimal errors. In comparison to the baseline model, our overall model exhibits a higher accuracy, indicating the effectiveness of the Image Relations Module and its significant enhancement of overall performance. Despite not being the fastest in terms of model inference speed, our method demonstrates exceptional cost-effectiveness. It notably improves prediction accuracy without substantially increasing the inference time. Moreover, our model shows a marginal increase in inference time compared to the Baseline model, suggesting that the Image Relations Module does not introduce a significant computational overhead in terms of the inference duration.

IV-D Evaluation of 3D Human Pose Detection Results

The 2D human body pose estimation network proposed in this paper is designed to facilitate 3D human body pose estimation. To demonstrate the effectiveness of our network, we utilize the 2D human body pose estimation results as input for 3D human body pose estimation. PoseFormer [27] is employed as the 3D human body pose detection network, and the network is trained following the experimental settings of PoseFormer [27].

We use MPJPE (Mean Per Joint Position Error) [48] as the evaluation metric, which represents the Euclidean distance between predicted and ground truth joint positions. Evaluation is conducted on the test set (S9, S11) for $15$ actions. As shown in Table LABEL:table2, inferences are made for the $15$ actions of two test subjects (S9, S11), and the last column represents the average error for each action.

As shown in Table LABEL:table2, we utilized the 2D poses generated by our method as input, employing PoseFormer [27] as the 3D pose network. We conducted a quantitative evaluation of the average error for each joint position using MPJPE and P-MPJPE as evaluation criteria. We drew insights from relevant experimental results in PoseFormer [27]. In the table, red indicates the best performance, and blue indicates the second-best. Inference was performed for $15$ actions on two test subjects (S9, S11), with the last column representing the average error for each action. From Table LABEL:table2, our method exhibits leadership in both evaluation metrics. Compared to the initial PoseFormer [27], our method reduces the average error by $9\%$ , decreasing from $44.3$ to $40.3$ . In terms of the MPJPE metric, our method achieves a leading position in almost all $15$ actions, except for the $Sit$ action. However, in the P-MPJPE [48] metric, our method achieves the optimal result for all actions, demonstrating an enhancement in 3D human body pose estimation effectiveness.

As shown in Fig 4, our method detects all images in the Human3.6M dataset and outputs 2D human body poses for all images. Many complex actions are accurately detected, and from Fig 4, we can see that many occluded human keypoints are accurately predicted, thanks to the picture relation module in our network, which can predict these occluded human keypoints based on temporal relations and 3D spatial location relations. We input all the predicted 2D human body poses into the 3D human body detection network Poseformer to reconstruct the 3D human body poses.

IV-E Frame Sequence Length Analysis

TABLE III: Frame Sequence Length Evaluation

Frame	AP $\uparrow$	AR $\uparrow$	PCK $\uparrow$	MSE $\downarrow$
8	90.3	94.1	97.7	40.2
32	91.4	94.8	98.6	33.6
128	92.1	95.3	98.9	28.7

Most 3D pose detection is based on video datasets [9, 10, 11], so we want the network to be able to process information from an entire video and learn the relationships between all video frames. We set up three different lengths of input frame sequences, $f{=}8,f{=}32,f{=}128$ to evaluate the effect of frame sequence length on the network. In Table LABEL:table3, we show the effect of the number of frames on the experiment. As the number of frames increases, the accuracy improves, indicating that the longer the input video sequence, the easier it is to capture the relationship between the frame images, resulting in better overall performance. This result proves that our network is able to model the relationships between long sequences of inputs, extract the temporal relationships in them, and achieve better output.

V Conclusion

In this paper, we have looked into multi-perspective temporal-relational 3D human pose estimation and proposed a 2D human body pose detection network that incorporates temporal and spatial relationships to enhance 3D human body pose detection. The spatial module is designed to extract features from individual images, whereas an Image Relations Module is established for the global modeling of relationships between images. The Image Relations Module not only captures temporal relationships between video frames but also learns 3D spatial positional relationships. Extensive experimental results demonstrate that our proposed approach not only achieves state-of-the-art performance in 2D human body pose estimation but also significantly enhances the effectiveness of 3D human body pose estimation. We also investigated the impact of the length of video frame sequences on our approach and observed an improvement in accuracy as the length of the input video frame sequences increased.

The future work of this study will focus on two main directions. Firstly, we intend to investigate pruning techniques for Transformers to reduce the computational complexity of self-attention, thereby enhancing the modeling of spatial-temporal context in an efficient manner. We aim to process entire video sequences through the network efficiently, meeting the real-time processing requirements. Secondly, we plan to integrate spatial geometry and self-attention mechanisms to better model the 3D spatial relationships within video frames, leading to improved output results for 3D human body pose estimation.

References

[1] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proc. CVPR, 2016, pp. 4724–4732.
[2] Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” in Proc. NeurIPS, vol. 35, 2022, pp. 38 571–38 584.
[3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. CVPR, 2017, pp. 1302–1310.
[4] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proc. CVPR, 2019, pp. 5693–5703.
[5] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-fine volumetric prediction for single-image 3D human pose,” in Proc. CVPR, 2017, pp. 1263–1272.
[6] J. Chen, H. Shi, Y. Ye, K. Yang, L. Sun, and K. Wang, “Efficient human pose estimation via 3D event point cloud,” in Proc. 3DV, 2022, pp. 1–10.
[7] H. Qiu, C. Wang, J. Wang, N. Wang, and W. Zeng, “Cross view fusion for 3D human pose estimation,” in Proc. ICCV, 2019, pp. 4341–4350.
[8] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3D human pose estimation,” in Proc. ICCV, 2017, pp. 2659–2668.
[9] D. Mehta et al., “Monocular 3D human pose estimation in the wild using improved CNN supervision,” in Proc. 3DV, 2017.
[10] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1325–1339, 2014.
[11] L. Sigal, A. O. Balan, and M. J. Black, “HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion,” International Journal of Computer Vision, vol. 87, no. 1-2, pp. 4–27, 2010.
[12] R. Liu, J. Shen, H. Wang, C. Chen, S.-c. Cheung, and V. Asari, “Attention mechanism exploits temporal contexts: Real-time 3D human pose reconstruction,” in Proc. CVPR, 2020, pp. 5063–5072.
[13] T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo, “Anatomy-aware 3D human pose estimation with bone-based pose decomposition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 198–209, 2022.
[14] M. R. I. Hossain and J. J. Little, “Exploiting temporal information for 3D human pose estimation,” in Proc. ECCV, vol. 11214, 2018, pp. 69–86.
[15] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3D human pose estimation in video with temporal convolutions and semi-supervised training,” in Proc. CVPR, 2019, pp. 7753–7762.
[16] Y. Li et al., “TokenPose: Learning keypoint tokens for human pose estimation,” in Proc. ICCV, 2021, pp. 11 293–11 302.
[17] S. Yang, Z. Quan, M. Nie, and W. Yang, “TransPose: Keypoint localization via transformer,” in Proc. ICCV, 2021, pp. 11 782–11 792.
[18] K. Li, S. Wang, X. Zhang, Y. Xu, W. Xu, and Z. Tu, “Pose recognition with cascade transformers,” in Proc. CVPR, 2021, pp. 1944–1953.
[19] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proc. CVPR, 2018, pp. 7103–7112.
[20] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proc. ECCV, vol. 9912, 2016, pp. 483–499.
[21] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proc. ECCV, vol. 11210, 2018, pp. 472–487.
[22] W. Li et al., “Rethinking on multi-stage networks for human pose estimation,” arXiv preprint arXiv:1901.00148, 2019.
[23] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. ECCV, vol. 12346, 2020, pp. 213–229.
[24] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask r-cnn,” in Proc. ICCV, 2017, pp. 2980–2988.
[25] A. Vaswani et al., “Attention is all you need,” in Proc. NeurIPS, vol. 30, 2017, pp. 6000–6010.
[26] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.
[27] C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, and Z. Ding, “3D human pose estimation with spatial and temporal transformers,” in Proc. ICCV, 2021, pp. 11 636–11 645.
[28] X. Yin et al., “Rethinking event-based human pose estimation with 3D event representations,” arXiv preprint arXiv:2311.04591, 2023.
[29] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. ICCV, 2021, pp. 9992–10 002.
[30] W. Zeng et al., “Not all tokens are equal: Human-centric visual analysis via token clustering transformer,” in Proc. CVPR, 2022, pp. 11 091–11 101.
[31] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “DynamicViT: Efficient vision transformers with dynamic token sparsification,” in Proc. NeurIPS, vol. 34, 2021, pp. 13 937–13 949.
[32] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proc. NeurIPS, vol. 34, 2021, pp. 12 077–12 090.
[33] J. Zhang, K. Yang, A. Constantinescu, K. Peng, K. Müller, and R. Stiefelhagen, “Trans4Trans: Efficient transformer for transparent object and semantic scene segmentation in real-world navigation assistance,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 173–19 186, 2022.
[34] H. Shi et al., “FishDreamer: Towards fisheye semantic completion via unified image outpainting and segmentation,” in Proc. CVPRW, 2023, pp. 6434–6444.
[35] H. Shi, Q. Jiang, K. Yang, X. Yin, H. Ni, and K. Wang, “Beyond the field-of-view: Enhancing scene visibility and perception with clip-recurrent transformer,” arXiv preprint arXiv:2211.11293, 2022.
[36] W. Li, H. Liu, H. Tang, P. Wang, and L. Van Gool, “MHFormer: Multi-hypothesis transformer for 3D human pose estimation,” in Proc. CVPR, 2022, pp. 13 137–13 146.
[37] W. Li, H. Liu, R. Ding, M. Liu, P. Wang, and W. Yang, “Exploiting temporal contexts with strided transformer for 3D human pose estimation,” IEEE Transactions on Multimedia, vol. 25, pp. 1282–1293, 2023.
[38] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[39] R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain, “Learning 3D human pose from structure and motion,” in Proc. ECCV, vol. 11213, 2018, pp. 679–696.
[40] Y. Cai et al., “Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks,” in Proc. ICCV, 2019, pp. 2272–2281.
[41] J. Lin and G. H. Lee, “Trajectory space factorization for deep video-based 3D human pose estimation,” in Proc. BMVC, 2019, p. 101.
[42] R. Yeh, Y.-T. Hu, and A. Schwing, “Chirality nets for human pose regression,” in Proc. NeurIPS, vol. 32, 2019, pp. 8161–8171.
[43] A. Zeng, X. Sun, F. Huang, M. Liu, Q. Xu, and S. Lin, “SRNet: Improving generalization in 3D human pose estimation with a split-and-recombine approach,” in Proc. ECCV, vol. 12359, 2020, pp. 507–523.
[44] J. Wang, S. Yan, Y. Xiong, and D. Lin, “Motion guided 3D pose estimation from videos,” in Proc. ECCV, vol. 12358, 2020, pp. 764–780.
[45] G. Pavlakos, X. Zhou, and K. Daniilidis, “Ordinal depth supervision for 3D human pose estimation,” in Proc. CVPR, 2018, pp. 7307–7316.
[46] A. Paszke et al., “Automatic differentiation in PyTorch,” 2017.
[47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
[48] C. Zheng et al., “Deep learning-based human pose estimation: A survey,” ACM Computing Surveys, vol. 56, no. 1, pp. 1–37, 2024.