This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-Attentive 3D Human Pose and Shape Estimation from Videos

Yun-Chun Chen Marco Piccirilli Robinson Piramuthu Ming-Hsuan Yang [email protected] Department of Computer Science, University of Toronto, ON, Canada eBay Inc., San Jose, CA, USA Amazon, Oakland, CA, USA School of Engineering, University of California at Merced, CA, USA
(In addition to coherent and smooth predictions, we also address the issues with ground-truth camera parameters and heavy occlusion for robust 3D human pose and shape estimation from videos.; Our self-attention module AA is composed of an attention network QQ and an attention network KK. Given a sequence of input features, the self-attention module first predicts the attention vector qq and the attention vector kk for each frame. Next, we compute the inner product between each attention vector qq and all attention vectors {kj}j=1N\{k_{j}\}_{j=1}^{N}, followed by normalizing the weights using a softmax function. The input features are first fused using the associated weights and then summed with the skipped input features to derive the final latent representations as output. ; Our self-attention module AA is composed of an attention network QQ and an attention network KK. Given a sequence of input features, the self-attention module first predicts the attention vector qq and the attention vector kk for each frame. Next, we compute the inner product between each attention vector qq and all attention vectors {kj}j=1N\{k_{j}\}_{j=1}^{N}, followed by normalizing the weights using a softmax function. The input features are first fused using the associated weights and then summed with the skipped input features to derive the final latent representations as output. ; In this work, we aim to exploit the idea that the occluded frames can benefit from the information of the non-occluded frames, while the non-occluded frames do not have to depend on the information from the occluded frames (i.e., anti-symmetric attention of humans that can be occluded and un-occluded between frames in either direction). To achieve this, we have an attention network QQ and an attention network KK in our self-attention module AA. ; As shown in Figure 3, for each feature fif_{i}, we first apply the attention network QQ and the attention network KK to encode an attention vector qi=Q(fi)dq_{i}=Q(f_{i})\in\mathbb{R}^{d} and an attention vector ki=K(fi)dk_{i}=K(f_{i})\in\mathbb{R}^{d}, respectively. To consider the dependency between two input frames IiI_{i} and IjI_{j}, we compute the inner product between the attention vector qiq_{i} of frame IiI_{i} and the attention vector kjk_{j} of frame IjI_{j}, i.e., wij=qikjw_{i}^{j}=q_{i}\cdot k_{j}\in\mathbb{R}. To derive the latent representation hih_{i} for frame IiI_{i}, we first apply a softmax layer to all the weights {wil}l=1N\{w_{i}^{l}\}_{l=1}^{N} computed between the attention vector qiq_{i} of frame IiI_{i} and all attention vectors {kl}l=1N\{k_{l}\}_{l=1}^{N} for normalization to derive the attention weights. ; The feature regression loss feature\mathcal{L}_{\mathrm{feature}} allows the forecasting module FF to forecast the feature of the next time step for each input feature by exploiting visual cues from human motion that provide more temporal context for generating temporally smooth predictions. ; Number of; parameters; HMMR; We present visual results on the CrowdPose dataset (Li et al., 2019). Our results demonstrate the ability of the SPS-Net to recover plausible human bodies for the occluded person instances. ; We present visual results on the CrowdPose dataset (Li et al., 2019). Our results demonstrate the ability of the SPS-Net to recover plausible human bodies for the occluded person instances. ; Visual results of occlusion handling. Figure 7 presents example visual results of occlusion handling on the CrowdPose dataset (Li et al., 2019). We observe that our model is able to recover plausible human bodies for the occluded person instances, demonstrating the robustness of our SPS-Net. ; Number of parameters; Acceleration Error \downarrow; Figure 12 shows visual comparisons with the variants of our SPS-Net (i.e., Ours w/o Self-Attention AA and Ours w/o Forecasting FF) on the CrowdPose dataset (Li et al., 2019). Our visual results show that without either the self-attention module AA or the forecasting module FF, the degraded model cannot recover accurate poses. ; Number of parameters; We report the GPU platform, the model training time in hours, and the inference time for processing an image in seconds. The “-” indicates the result is not available. ; We report the GPU platform, the model training time in hours, and the inference time for processing an image in seconds. The “-” indicates the result is not available. ;

4.5 Run-time analysis

We report the model training time in hours, the inference time for processing an image in seconds, and the GPU platform used by each method in Table 6. First, the training time of our method is shorter than that of the HMR (Kanazawa et al., 2018), STRAPS (Sengupta et al., 2020), and HUND (Zanfir et al., 2020). Second, the inference time of our method is comparable to that of the VIBE method (Kocabas et al., 2020), and shorter than that of the ExPose (Choutas et al., 2020) and STRAPS (Sengupta et al., 2020) approaches. Third, our method performs favorably against existing frame-based and video-based approaches on all three datasets as shown in Table 1.

; We present visual comparisons with the Ours w/o Self-Attention AA and Ours w/o Forecasting FF methods on the CrowdPose dataset (Li et al., 2019). ; We present visual comparisons with the Ours w/o Self-Attention AA and Ours w/o Forecasting FF methods on the CrowdPose dataset (Li et al., 2019). )
Abstract

We consider the task of estimating 3D human pose and shape from videos. While existing frame-based approaches have made significant progress, these methods are independently applied to each image, thereby often leading to inconsistent predictions. In this work, we present a video-based learning algorithm for 3D human pose and shape estimation. The key insights of our method are two-fold. First, to address the inconsistent temporal prediction issue, we exploit temporal information in videos and propose a self-attention module that jointly considers short-range and long-range dependencies across frames, resulting in temporally coherent estimations. Second, we model human motion with a forecasting module that allows the transition between adjacent frames to be smooth. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. Extensive experimental results show that our algorithm performs favorably against the state-of-the-art methods.

keywords:
MSC:
41A05, 41A10, 65D05, 65D17 \KWD3D human pose and shape estimation , Self-supervised learning , Occlusion handling
journal: Computer Vision and Image Understanding

1 Introduction

3D human pose and shape estimation (Kanazawa et al., 2018; Kolotouros et al., 2019a; Bogo et al., 2016) is an active research topic in computer vision and computer graphics that finds numerous applications (Xu et al., 2019; Liu et al., 2019). The inherent under-constrained nature where multiple 3D meshes can explain the same 2D projection makes this problem very challenging. While frame-based methods (Kanazawa et al., 2018; Kolotouros et al., 2019a; Bogo et al., 2016) and video-based approaches (Kocabas et al., 2020; Lee et al., 2018b; Rayat Imtiaz Hossain and Little, 2018; Kanazawa et al., 2019; Zhang et al., 2019b) have been developed to recover human pose in the literature, numerous issues remain to be addressed. First, existing approaches employ recurrent neural networks (RNNs) to model temporal information for consistent predictions. However, it is difficult to train RNNs to capture long-range dependencies (Vaswani et al., 2017; Pascanu et al., 2013). On the other hand, one recent approach employing RNNs does not consistently render smooth predictions across frames (Kocabas et al., 2020).

\animategraphics

[autoplay,loop,width=]30gif/000000000234

Fig. 1: 3D human pose and shape estimation. The results are generated by our method without prior information of camera or manual initialization. The embedded video can be viewed using Adobe Acrobat.

Second, as most real-world datasets do not contain ground-truth camera parameter annotations, existing methods typically reproject the predicted 3D joints onto the 2D space using the estimated camera parameters, followed by a loss enforced between the reprojected 2D joints and the corresponding ground-truth 2D joints. Nevertheless, such regularization terms are still insufficient to account for complex scenes. Third, existing methods (Kocabas et al., 2020; Kanazawa et al., 2019; Zhang et al., 2019b; Kanazawa et al., 2018) do not perform well for humans under heavy occlusion or out-of-view, as there is no explicit constraint enforced on the invisible regions.

In this paper, we propose the Self-attentive Pose and Shape Network (SPS-Net) for 3D human pose and shape estimation from videos. Our key insights are two-fold. First, motivated by the attention models in neural machine translation (Vaswani et al., 2017) and image generation (Zhang et al., 2019a) tasks, we develop a self-attention module to exploit temporal cues in videos for coherent predictions. For each input frame, our self-attention module derives a visual representation by observing past and future frames and predicting the associated attention weights. Second, motivated by the autoregressive models in human motion prediction (Kanazawa et al., 2019; Zhang et al., 2019b), we develop a forecasting module that leverages visual cues from human motion to encourage our model to generate temporally smooth predictions. By jointly considering both features, our SPS-Net is able to estimate accurate and temporally coherent human pose and shape (see Figure 1).

To account for images without ground-truth camera parameter annotations, we exploit the property that the camera parameters for the overlapped frames of two segments from the same video should be the same. We enforce this constraint with a camera parameter consistency loss. Furthermore, we address the occlusion and out-of-view issues by masking out some regions of the video frames. Our core idea is to leverage the predictions of the original video frames to supervise those of the synthesized occluded or partially visible data, making our model more robust to the occlusion and out-of-view issues. We demonstrate the effectiveness of the proposed SPS-Net on three standard benchmarks, including the 3DPW (von Marcard et al., 2018), MPI-INF-3DHP (Mehta et al., 2017a), and Human3.6M (Ionescu et al., 2013) datasets.

Our main contributions can be summarized as follows:

  • We present a video-based learning algorithm for 3D human pose and shape estimation.

  • We propose a camera parameter consistency loss that provides additional supervisory signals for model training, resulting in more accurate camera parameter predictions.

  • Our model learns to predict plausible estimations when occlusion or out-of-view occurs in a self-supervised fashion.

  • Extensive evaluations on three challenging benchmarks demonstrate that our method achieves the state-of-the-art performance against existing approaches.

2 Related Work

3D human pose and shape estimation. Existing methods for 3D human pose and shape estimation can be broadly categorized as frame-based and video-based. Frame-based methods typically use an off-the-shelf keypoint detector (e.g., DeepCut (Pishchulin et al., 2016)) to fit the SMPL (Loper et al., 2015) body model (Bogo et al., 2016), leverage silhouettes and keypoints for model fitting (Lassner et al., 2017), or directly regress the parameters for the SMPL (Loper et al., 2015) body model from pixels using neural networks (Kolotouros et al., 2019a; Kanazawa et al., 2018; Kolotouros et al., 2019b). While these frame-based approaches are able to recover 3D poses from a single image, independently applying these algorithms to each video frame often leads to temporally inconsistent predictions. Video-based methods, on the other hand, usually adopt RNN-based models to generate temporally coherent predictions. These approaches either focus on estimating the human body of the current frame (Arnab et al., 2019; Sun et al., 2019; Kocabas et al., 2020) or predicting the past and future motions (Kanazawa et al., 2019; Zhang et al., 2019b).

Our algorithm differs from these video-based methods in three aspects. First, in contrast to adopting RNN-based models, we develop a self-attention module to aggregate temporal information and a forecasting module to model human motion for predicting temporally coherent estimations. Second, we enforce a consistency loss on the prediction of camera parameters to regularize model learning. Third, we address the occlusion and out-of-view issues with a self-supervised learning scheme to generate plausible human pose and shape predictions.

Attention models. Attention models have been shown effective in neural machine translation (Vaswani et al., 2017) and image generation problems (Zhang et al., 2019a; Parmar et al., 2018). For machine translation, employing self-attention models (Vaswani et al., 2017) helps capture short-range and long-range correlations between tokens in the sentence for improving the translation quality. In image generation, the Image Transformer (Parmar et al., 2018) and SAGAN (Zhang et al., 2019a) show that leveraging self-attention mechanisms facilitates the models to generate realistic images. In 3D human pose and shape estimation, the VIBE (Kocabas et al., 2020) method adopts a self-attention scheme in the discriminator for feature aggregation, allowing the discriminator to better distinguish the motions of attended video frames between the real sequences and generated ones.

We adopt self-attention modules in both the SPS-Net and discriminator. Our method differs from the VIBE (Kocabas et al., 2020) in that our self-attention module aims to derive a representation for each frame that contains temporal information by jointly considering short-range and long-range dependencies across video frames, whereas the VIBE (Kocabas et al., 2020) method aims to derive a single representation for the entire pose sequence.

Future human pose predictions. Predicting future poses from videos has been studied by a few approaches in the literature. Existing algorithms estimate 2D poses from pixels (Denton and Birodkar, 2017; Finn et al., 2016), optical flow (Walker et al., 2016), or 2D poses (Walker et al., 2017), or predict 3D outputs based on 3D inputs (Butepage et al., 2017; Fragkiadaki et al., 2015; Jain et al., 2016; Li et al., 2018; Villegas et al., 2018). Other approaches learn 3D pose prediction from 2D inputs (Zhang et al., 2019b; Kanazawa et al., 2019).

Similar to the HMMR (Kanazawa et al., 2019) and PHD (Zhang et al., 2019b) methods, we leverage visual cues from human motion to predict temporally smooth predictions. Our method differs from them in that our self-attention module helps capture short-range and long-range dependencies across video frames in the input video, while the 1D convolution in the temporal encoder and autoregressive module of these methods does not have such ability.

Refer to caption
Fig. 2: Overview of the Self-attentive Pose and Shape Network (SPS-Net). Our SPS-Net is composed of four main components: a feature encoder EE (highlighted in blue), a self-attention module AA (highlighted in green), a forecasting module FF (highlighted in orange), and three parameter regressors RshapeR_{\mathrm{shape}}, RposeR_{\mathrm{pose}}, and RcameraR_{\mathrm{camera}} (highlighted in yellow). The feature encoder extracts features from input video frames. The encoded features are then passed to the self-attention module to produce latent representations that contain temporal information of past and future frames and to the forecasting module to predict the features of the next time step. The latent representations and the predicted features of the same time step are forwarded to the feature fusion module for feature aggregation. Finally, the fused representations are passed to three parameter regressors to predict the corresponding shape, pose, and camera parameters, respectively.

Consistency constraints for visual learning. Exploiting consistency constraints to regularize model learning has been shown effective in numerous applications, including semantic matching (Zhou et al., 2015), optical flow estimation (Meister et al., 2018), depth prediction (Gordon et al., 2019), and image-to-image translation (Zhu et al., 2017; Lee et al., 2018a; Huang et al., 2018). Other methods exploit consistency constraints across multiple network outputs, including depth and optical flow estimation (Zou et al., 2018), joint semantic matching and object co-segmentation (Chen et al., 2020), ego-motion (Zhou et al., 2017), and domain adaptation (Chen et al., 2019b). In our work, we show that enforcing consistency constraints on the prediction of camera parameters for the overlapped video frames of two segments from the same video results in performance improvement.

3 Proposed Algorithm

In this section, we first provide an overview of our approach. Next, we describe the details of the self-attention and forecasting modules, followed by formulating the proposed camera parameter consistency loss. We then motivate the self-supervised learning scheme for addressing the occlusion and out-of-view issues.

3.1 Algorithmic overview

Given an input video V={Ii}i=1NV=\{I_{i}\}_{i=1}^{N} of length NN containing a single person, our goal is to learn a model that recovers the 3D human body of each frame. We present the Self-attentive Pose and Shape Network (SPS-Net), comprising four components: 1) feature encoder EE, 2) self-attention module AA, 3) forecasting module FF, and 4) three parameter regressors RshapeR_{\mathrm{shape}}, RposeR_{\mathrm{pose}}, and RcameraR_{\mathrm{camera}}.

As shown in Figure 2, we first apply the encoder EE to each frame IiVI_{i}\in V to extract the feature fi=E(Ii)df_{i}=E(I_{i})\in\mathbb{R}^{d}, where dd denotes the number of channels of the feature fif_{i}. Next, the self-attention module AA takes all the encoded features {fi}i=1N\{f_{i}\}_{i=1}^{N} as input and outputs the corresponding latent representations {hi}i=1N\{h_{i}\}_{i=1}^{N}, where hidh_{i}\in\mathbb{R}^{d} denotes the latent representation for IiI_{i}, containing temporal information of past and future frames. The forecasting module FF takes each encoded feature fif_{i} as input and forecasts the feature of the next time step fi+1=F(fi)df_{i+1}^{\prime}=F(f_{i})\in\mathbb{R}^{d}. The latent representations {hi}i=1N\{h_{i}\}_{i=1}^{N} and the predicted features {fi+1}i=1N\{f_{i+1}^{\prime}\}_{i=1}^{N} of the same time step (e.g., hih_{i} and fif_{i}^{\prime}) are passed to a feature fusion module to derive the fused representations {Fi}i=1N\{F_{i}\}_{i=1}^{N}, where FidF_{i}\in\mathbb{R}^{d} contains both global temporal and local motion information. The pose parameter regressor RposeR_{\mathrm{pose}} takes each fused representation FiF_{i} as input and renders the pose parameters θi\theta_{i} for each frame IiI_{i}, where θi=Rpose(Fi)72\theta_{i}=R_{\mathrm{pose}}(F_{i})\in\mathbb{R}^{72}. The shape parameter regressor RshapeR_{\mathrm{shape}}, on the other hand, takes all the fused representations {Fi}i=1N\{F_{i}\}_{i=1}^{N} as input and regresses the shape parameters β10\beta\in\mathbb{R}^{10} of the input video VV.

3D human body representation. Similar to the state-of-the-art methods (Kanazawa et al., 2018; Kolotouros et al., 2019a; Kocabas et al., 2020), we adopt the SMPL (Loper et al., 2015) body model to describe the human body using a 3D mesh representation. The SMPL (Loper et al., 2015) model is described by the pose θ72\theta\in\mathbb{R}^{72} and shape β10\beta\in\mathbb{R}^{10} parameters. The pose parameters θ\theta contain the global body rotation and the relative 3D rotation of 2323 joints in axis-angle format. The shape parameters β\beta are parameterized by the first 1010 linear coefficients of a PCA shape space. We use a gender-neutral shape model as in previous work (Kanazawa et al., 2018; Kolotouros et al., 2019a; Kocabas et al., 2020). The differentiable SMPL (Loper et al., 2015) body model takes the pose θ\theta and shape β\beta parameters as input and outputs a triangular mesh M(θ,β)6890×3M(\theta,\beta)\in\mathbb{R}^{6890\times 3} consisting of 6,8906,890 mesh vertices by shaping a template body mesh based on forward kinematics. The 3D keypoints Xk×3X\in\mathbb{R}^{k\times 3} of kk body joints can be obtained by applying a pre-trained linear regressor WW to the 3D mesh M(θ,β)M(\theta,\beta), and is defined as X=WM(θ,β)X=WM(\theta,\beta).

Refer to caption
Fig. 3: Overview of the self-attention module AA.

Camera model. Similar to existing approaches (Kanazawa et al., 2018; Kolotouros et al., 2019a; Kocabas et al., 2020), we use a weak-perspective camera model in this work. By estimating the camera parameters {s,R,t}\{s,R,t\} using the regressor RcameraR_{\mathrm{camera}}, where ss\in\mathbb{R} denotes the scale, R3×3R\in\mathbb{R}^{3\times 3} is the global rotation in axis-angle format, and t2t\in\mathbb{R}^{2} denotes the translation, the 2D projection xk×2x\in\mathbb{R}^{k\times 2} of the 3D keypoints XX can be obtained by x=sΠ(RX(θ,β))+tx=s\Pi(RX(\theta,\beta))+t, where Π\Pi is an orthographic projection.

3.2 Self-attention module

Given a sequence of features {fi}i=1N\{f_{i}\}_{i=1}^{N} encoded by the encoder EE, our goal is to leverage temporal cues in the input video to provide more information that helps regularize the estimation of human pose and shape. Existing methods exploit temporal information by resorting to an RNN-based model, e.g., GRU (Kocabas et al., 2020) or LSTM (Lee et al., 2018b; Rayat Imtiaz Hossain and Little, 2018). However, training RNN-based models is difficult to capture long-range dependencies (Vaswani et al., 2017; Pascanu et al., 2013).

Motivated by the attention models (Vaswani et al., 2017; Zhang et al., 2019a; Parmar et al., 2018) which have been shown effective to jointly capture short-range and long-range dependencies while being more parallelizable to train (Vaswani et al., 2017), we develop a self-attention module to learn latent representations hh that jointly observe past and future video frames for producing temporally consistent pose and shape predictions.

The attention weights {ail}l=1N\{a_{i}^{l}\}_{l=1}^{N} are computed by

ail=exp(wil)n=1Nexp(win).a_{i}^{l}=\frac{\exp(w_{i}^{l})}{\sum_{n=1}^{N}\exp(w_{i}^{n})}.\vspace{-1.0mm} (1)

We then apply a weighted sum layer to sum over all input features {fl}l=1N\{f_{l}\}_{l=1}^{N} with the associated attention weights {ail}l=1N\{a_{i}^{l}\}_{l=1}^{N}. In addition, we add a residual connection (He et al., 2016) to pass the input feature fif_{i} to the output of the self-attention module. Specifically, the latent representation hih_{i} is described by

hi=fi+l=1Nailfl.h_{i}=f_{i}+\sum_{l=1}^{N}a_{i}^{l}\cdot f_{l}.\vspace{-1.0mm} (2)

3.3 Forecasting module

In addition to considering global temporal information as in the self-attention module AA, we exploit visual cues from human motion to encourage our model to generate temporally smooth predictions. Motivated by methods that focus on tackling human motion prediction (Kanazawa et al., 2019; Zhang et al., 2019b), we develop a forecasting module FF that takes each encoded feature fif_{i} as input and forecasts the feature of the next time step fi+1f_{i+1}^{\prime}. As the feature of the next time step is available (given by the encoder), we train the forecasting module FF in a self-supervised fashion with a feature regression loss:

feature=i=1N1fi+1fi+12.\mathcal{L}_{\mathrm{feature}}=\sum_{i=1}^{N-1}\|f_{i+1}-f_{i+1}^{\prime}\|_{2}.\vspace{-1.0mm} (3)

We note that since the feature of the next time step of fNf_{N} is not available, we do not compute the feature regression loss on fN+1f_{N+1}^{\prime}.

3.4 3D human pose and shape estimation

To jointly consider the latent representations {hi}i=1N\{h_{i}\}_{i=1}^{N} that contain global temporal information and the predicted features {fi+1}i=1N1\{f_{i+1}^{\prime}\}_{i=1}^{N-1} that contain local motion information for predicting the parameters for 3D human pose and shape estimation, we have a feature fusion module that fuses {hi}i=1N\{h_{i}\}_{i=1}^{N} and {fi+1}i=1N1\{f_{i+1}^{\prime}\}_{i=1}^{N-1} at the same time step to derive the fused representations {Fi}i=1N\{F_{i}\}_{i=1}^{N}. We note that since our encoder EE is pre-trained on single-image pose and shape estimation task and fixed during training as in prior work (Kanazawa et al., 2018; Kocabas et al., 2020), the feature fif_{i} encoded by the encoder EE is static and does not contain motion information. Therefore, we use the predicted feature fif_{i}^{\prime} from the forecasting module FF that contains motion information for feature fusion.

As shown in Figure 4, our feature fusion module is composed of a fully connected (FC) layer, followed by a softmax layer. Given a latent representation hih_{i} and a predicted feature fif_{i}^{\prime}, we first apply the FC layer to each input feature to predict a weight. The predicted weights are then normalized using a softmax layer. The two input features are then fused by Fi=ahihi+afifidF_{i}=a_{h_{i}}\cdot h_{i}+a_{f_{i}^{\prime}}\cdot f_{i}^{\prime}\in\mathbb{R}^{d}. We note that since f1f_{1}^{\prime} is not available, we define F1=h1F_{1}=h_{1}.

Next, we pass all the fused features {Fi}i=1N\{F_{i}\}_{i=1}^{N} to the shape RshapeR_{\mathrm{shape}}, pose RposeR_{\mathrm{pose}}, and camera RcameraR_{\mathrm{camera}} parameter regressors to predict the corresponding parameters, respectively. Similar to one prior work (Kanazawa et al., 2018), we adopt an iterative error feedback scheme to regress the parameters. To train the proposed SPS-Net, we impose a SMPL parameter regression loss SMPL\mathcal{L}_{\mathrm{SMPL}} on the estimated pose {θ^i}i=1N\{\hat{\theta}_{i}\}_{i=1}^{N} and shape β^\hat{\beta} parameters, a 3D joint loss joint3D\mathcal{L}_{\mathrm{joint}}^{3D} on the predicted 3D joints {X^i}i=1N\{\hat{X}_{i}\}_{i=1}^{N}, and a 2D joint loss joint2D\mathcal{L}_{\mathrm{joint}}^{2D} on the reprojected 2D joints {x^i}i=1N\{\hat{x}_{i}\}_{i=1}^{N} (Kanazawa et al., 2018; Kocabas et al., 2020). Specifically, the SMPL parameter regression loss SMPL\mathcal{L}_{\mathrm{SMPL}}, the 3D joint loss joint3D\mathcal{L}_{\mathrm{joint}}^{3D}, and the 2D joint loss joint2D\mathcal{L}_{\mathrm{joint}}^{2D} are defined as

SMPL=ββ^2+i=1Nθiθ^i2,joint3D=i=1NXiX^i2,joint2D=i=1Nxix^i2.\small\begin{split}&\mathcal{L}_{\mathrm{SMPL}}=\|\beta-\hat{\beta}\|_{2}+\sum_{i=1}^{N}\|\theta_{i}-\hat{\theta}_{i}\|_{2},\\ &\mathcal{L}_{\mathrm{joint}}^{3D}\mathrm{=}\sum_{i=1}^{N}\|X_{i}\mathrm{-}\hat{X}_{i}\|_{2},\quad\mathcal{L}_{\mathrm{joint}}^{2D}\mathrm{=}\sum_{i=1}^{N}\|x_{i}\mathrm{-}\hat{x}_{i}\|_{2}.\end{split}\vspace{-3.0mm} (4)

Mask loss. Since the ground-truth pose {θi}i=1N\{\theta_{i}\}_{i=1}^{N}, shape β\beta, and 3D joint {Xi}i=1N\{X_{i}\}_{i=1}^{N} annotations are usually not available, using the 2D joint loss joint2D\mathcal{L}_{\mathrm{joint}}^{2D} alone is insufficient to train the SPS-Net as there are numerous 3D meshes that can explain the same 2D projection. To address this issue, we exploit the idea that the reprojection of the 3D mesh using the estimated camera parameters should be consistent with the segmentation mask obtained by directly segmenting the human from the input video frame. We leverage an off-the-shelf instance segmentation model (Bolya et al., 2019) to compile a pseudo ground-truth segmentation mask mipseudom_{i}^{\mathrm{pseudo}} for each input video frame IiI_{i}.111We note that while other existing instance segmentation models can also be used for compiling segmentation masks, we leave the discussion of adopting different instance segmentation models as future work. Then, we use the pseudo ground-truth segmentation mask to supervise the reprojection of the 3D mesh with a mask loss:

mask=i=1Nmipseudolog(miproj),\mathcal{L}_{\mathrm{mask}}=-\sum_{i=1}^{N}m_{i}^{\mathrm{pseudo}}\log(m_{i}^{\mathrm{proj}}),\vspace{-1.5mm} (5)

where miprojm_{i}^{\mathrm{proj}} denotes the reprojection of the 3D mesh using the estimated camera parameters.

Refer to caption
Fig. 4: Overview of the feature fusion module. Our feature fusion module consists of a shared fully connected layer and a softmax layer. We first apply the FC layer to each input feature to predict a weight. We then apply a softmax layer to normalize the predicted weights. The input features are combined with the normalized weights to generate FiF_{i}.

Camera parameter consistency loss. Since there are no ground-truth camera parameter annotations for most datasets, existing methods (Kocabas et al., 2020; Kanazawa et al., 2018; Kolotouros et al., 2019a) regularize the estimation of camera parameters via reprojecting the detected 3D keypoints onto 2D space and enforcing a 2D joint loss joint2D\mathcal{L}_{\mathrm{joint}}^{2D} between the reprojected 2D joints and the corresponding ground-truth 2D joints. This weaker form of supervision, however, is still under-constrained. To address the absence of ground-truth camera parameter annotations, we exploit the idea that the overlapped video frames in different sequence segments from the same video should have the same camera parameter predictions. Given two input sequence segments S1={IiS1}i=nkS_{1}=\{I_{i}^{S_{1}}\}_{i=n}^{k} and S2={IiS2}i=n+1k+1S_{2}=\{I_{i}^{S_{2}}\}_{i=n+1}^{k+1} from the same video VV, the overlapped frames are {Ii}i=n+1k\{I_{i}\}_{i=n+1}^{k}. We enforce the camera parameter predictions of the overlapped frames {Ii}i=n+1k\{I_{i}\}_{i=n+1}^{k} to be the same in these two input sequence segments S1S_{1} and S2S_{2}. To achieve this, we propose a camera parameter consistency loss camera\mathcal{L}_{\mathrm{camera}} which is defined as

camera=i=n+1kRcamera(FiS1)Rcamera(FiS2)2,\mathcal{L}_{\mathrm{camera}}=\sum_{i=n+1}^{k}\|R_{\mathrm{camera}}(F_{i}^{S_{1}})-R_{\mathrm{camera}}(F_{i}^{S_{2}})\|_{2},\vspace{-1.5mm} (6)

where FiS1dF_{i}^{S_{1}}\in\mathbb{R}^{d} and FiS2dF_{i}^{S_{2}}\in\mathbb{R}^{d} are the fused feature of frame IiS1I_{i}^{S_{1}} and frame IiS2I_{i}^{S_{2}}, respectively. Incorporating such consistency loss during training not only regularizes the prediction of camera parameters but also provides more supervisory signals to facilitate model training.

Adversarial loss. In addition to the aforementioned loss functions, we also adopt an adversarial learning scheme that aims to encourage our method to recover a sequence of 3D meshes with realistic motions (Kocabas et al., 2020). Similar to the VIBE (Kocabas et al., 2020) method, we adopt the AMASS (Mahmood et al., 2019) dataset and employ a discriminator DD that takes as input a sequence of pose parameters with the associated shape parameters Θ^=[θ^1,,θ^N,β^]\hat{\Theta}=[\hat{\theta}_{1},...,\hat{\theta}_{N},\hat{\beta}] estimated by the SPS-Net (treated as a fake example) and a sequence of those Θ=[θ1,,θN,β]\Theta=[\theta_{1},...,\theta_{N},\beta] sampled from the AMASS (Mahmood et al., 2019) dataset (treated as a real example), and aims to distinguish whether the input sequences are realistic or not.

As shown in Figure 5, our discriminator DD is composed of a self-attention module ADA_{D} and a classifier CDC_{D}. We first concatenate the estimated shape parameters β^\hat{\beta} with each of the estimated pose parameters {θ^i}i=1N\{\hat{\theta}_{i}\}_{i=1}^{N} to form the joint representations {J^i}i=1N\{\hat{J}_{i}\}_{i=1}^{N}, where J^i=[β^,θ^i]82\hat{J}_{i}=[\hat{\beta},\hat{\theta}_{i}]\in\mathbb{R}^{82}. We then pass all joint representations {J^i}i=1N\{\hat{J}_{i}\}_{i=1}^{N} to the self-attention module ADA_{D} to derive the latent representations {H^i}i=1N\{\hat{H}_{i}\}_{i=1}^{N}, where H^i82\hat{H}_{i}\in\mathbb{R}^{82} is the latent representation of J^i\hat{J}_{i}. To derive the motion representation M^\hat{M} of Θ^\hat{\Theta}, we average all the latent representations {H^i}i=1N\{\hat{H}_{i}\}_{i=1}^{N}, i.e., M^=1Ni=1NH^i82\hat{M}=\frac{1}{N}\sum_{i=1}^{N}\hat{H}_{i}\in\mathbb{R}^{82}. The motion representation M82M\in\mathbb{R}^{82} of Θ\Theta can be derived similarly. The classifier CDC_{D} takes the motion representations M^\hat{M} and MM as input and distinguishes whether the input motion representations are realistic or not. Specifically, we have an adversarial loss adv\mathcal{L}_{\mathrm{adv}} which is defined as

adv=𝔼ΘpΘ[D(Θ)12]+𝔼Θ^pΘ^[D(Θ^)2].\mathcal{L}_{\mathrm{adv}}=\mathbb{E}_{\Theta\sim p_{\Theta}}[\|D(\Theta)-1\|_{2}]+\mathbb{E}_{\hat{\Theta}\sim p_{\hat{\Theta}}}[\|D(\hat{\Theta})\|_{2}].\vspace{-1.0mm} (7)

Leveraging the unpaired data from the AMASS (Mahmood et al., 2019) dataset serves as a weak supervision to encourage the SPS-Net to recover a sequence of 3D meshes with realistic motions.

We note that our discriminator DD is different from that of the VIBE (Kocabas et al., 2020) method in two aspects. First, our discriminator has a self-attention module, while the discriminator of the VIBE (Kocabas et al., 2020) method has two GRU layers. Second, we use self-attention to derive a representation for each frame that contains temporal information by jointly considering short-range and long-range dependencies across video frames, whereas the VIBE (Kocabas et al., 2020) method leverages self-attention to derive a single representation for the entire pose sequence.

Self-supervised occlusion handling. While the aforementioned loss functions regularize the learning of the SPS-Net, the 2D and 3D joint losses and the mask loss are only enforced on the visible keypoints and regions of the human body. That is, there is no explicit constraint imposed on the invisible keypoints and regions. We develop a self-supervised learning scheme to allow our model to produce plausible predictions in order to account for the occlusion and out-of-view scenarios. For each input frame IiI_{i}, we first synthesize the occluded version IiI_{i}^{\prime} by randomly masking out some regions. We then leverage the predictions of the original frames to supervise those of the synthesized occluded or partially visible frames and develop a self-supervised parameter regression loss param\mathcal{L}_{\mathrm{param}} to exploit this property with

param=β^β^2+i=1Nθ^iθ^i2+i=1NRcamera(Fi)Rcamera(Fi)2.\small\begin{split}\mathcal{L}_{\mathrm{param}}&=\|\hat{\beta}-\hat{\beta}^{\prime}\|_{2}+\sum_{i=1}^{N}\|\hat{\theta}_{i}-\hat{\theta}_{i}^{\prime}\|_{2}\\ &+\sum_{i=1}^{N}\|R_{\mathrm{camera}}(F_{i})-R_{\mathrm{camera}}(F_{i}^{\prime})\|_{2}.\end{split}\vspace{-4.0mm} (8)

By simulating the occlusion and out-of-view scenes, our model is able to predict plausible shape, pose, and camera parameters from the occluded or partially visible frames.

Refer to caption
Fig. 5: Overview of the discriminator DD. Our discriminator DD consists of a self-attention module ADA_{D} and a classifier CDC_{D}. Given a sequence of pose parameters {θi}i=1N\{\theta_{i}\}_{i=1}^{N} and the associated shape parameters β\beta, we first derive the latent representations {Hi}i=1N\{H_{i}\}_{i=1}^{N} using the self-attention module ADA_{D}. We then average all the latent representations {Hi}i=1N\{H_{i}\}_{i=1}^{N} to derive the motion representation MM. The classifier CDC_{D} takes the motion representation MM as input and distinguishes whether the input motion representation MM is realistic or not.

4 Experimental Results

In this section, we first describe the implementation details. Next, we describe the datasets for model training and testing, followed by the evaluation metrics. We then present the quantitative and visual comparisons to existing methods as well as the ablation study.

4.1 Implementation details

We implement our model using PyTorch (Paszke et al., 2019). Same as prior work (Kanazawa et al., 2018; Kocabas et al., 2020), we adopt the ResNet-5050 (He et al., 2016) pre-trained on single-image pose and shape estimation task (Kanazawa et al., 2018; Kolotouros et al., 2019a) to serve as our encoder EE. Our encoder EE is fixed and outputs a 2,0482,048-dimensional feature for each frame, i.e., fi2048f_{i}\in\mathbb{R}^{2048}. We set the length of the input sequence to 3232 with a batch size of 1616. Both the attention network QQ and the attention network KK in the self-attention module AA consist of 22 fully connected layers, each of which has a hidden size of 2,0482,048, followed by a LeakyReLU layer. As for the forecasting module FF, unlike prior methods (Zhang et al., 2019b; Kanazawa et al., 2019) that use 1D convolution layers, our forecasting module FF is composed of 22 fully connected layers, each of which has a hidden size of 2,0482,048, followed by a LeakyReLU layer. Both the attention network QQ and the attention network KK in the self-attention module ADA_{D} also consist of 22 fully connected layers, each of which has a hidden size of 8282, followed by a LeakyReLU layer. The classifier CDC_{D} in the discriminator DD is composed of a fully connected layer, followed by a sigmoid function. The input and output dimensions of the classifier CDC_{D} are 8282 and 11, respectively. Similar to the HMR (Kanazawa et al., 2018), the SMPL (Loper et al., 2015) parameter regressor {Rpose,Rshape}\{R_{\mathrm{pose}},R_{\mathrm{shape}}\} is composed of 22 fully connected layers with a hidden size of 1,0241,024. The shape RshapeR_{\mathrm{shape}}, pose RposeR_{\mathrm{pose}}, and camera RcameraR_{\mathrm{camera}} parameter regressors are initialized from the pre-trained weights of the HMR (Kanazawa et al., 2018) approach. The weights of the self-attention module AA, the forecasting module FF, the feature fusion module, and the discriminator DD are randomly initialized. We use the ADAM (Kingma and Ba, 2014) optimizer for training. The learning rates for the SPS-Net and the discriminator DD are set to 5×1055\times 10^{-5} and 1×1041\times 10^{-4}, respectively. Following the VIBE (Kocabas et al., 2020) method, we set the hyperparameters for the loss functions as follows: λβ=0.06\lambda_{\beta}\mathrm{=}0.06, λθ=60\lambda_{\theta}\mathrm{=}60, λjoint3D=300\lambda_{\mathrm{joint}}^{3D}\mathrm{=}300, λjoint2D=300\lambda_{\mathrm{joint}}^{2D}\mathrm{=}300, and λadv=2\lambda_{\mathrm{adv}}\mathrm{=}2. For the other hyperparameters, we set λfeature=1\lambda_{\mathrm{feature}}\mathrm{=}1, λmask=300\lambda_{\mathrm{mask}}\mathrm{=}300, λcamera=0.1\lambda_{\mathrm{camera}}\mathrm{=}0.1, λparamβ=0.06\lambda_{\mathrm{param}}^{\beta}\mathrm{=}0.06, λparamθ=60\lambda_{\mathrm{param}}^{\theta}\mathrm{=}60, and λparamcamera=0.1\lambda_{\mathrm{param}}^{\mathrm{camera}}\mathrm{=}0.1. We train our model on a single NVIDIA V100100 GPU with 3232GB memory for 120120 epochs. For each epoch, there are 500500 iterations.

Camera parameter consistency loss camera\mathcal{L}_{\mathrm{camera}}. To compute the camera parameter consistency loss camera\mathcal{L}_{\mathrm{camera}}, in each iteration we sample two consecutive sequence segments by shifting the starting index for data sampling by 11. Assuming that the starting index for data sampling is nn, we first sample a sequence segment S1={Ii}i=nn+31S_{1}=\{I_{i}\}_{i=n}^{n+31}. We then shift the starting index for data sampling by 11 and sample another sequence segment S2={Ii}i=n+1n+32S_{2}=\{I_{i}\}_{i=n+1}^{n+32}. Given these two sequence segments S1S_{1} and S2S_{2}, the overlapped video frames are {Ii}i=n+1n+31\{I_{i}\}_{i=n+1}^{n+31}. We enforce the camera parameter predictions of the overlapped video frames {Ii}i=n+1n+31\{I_{i}\}_{i=n+1}^{n+31} to be the same in these two sequence segments with a camera parameter consistency loss.

Self-supervised occlusion handling. Since the ground-truth 2D joint annotations are available, for each training image, we randomly sample 33 to 55 keypoints. For each keypoint, we randomly sample a width offset between 2525 and 5050 pixels and a height offset between 2525 and 5050 pixels to determine the region to be masked out for synthesizing the occluded training data. The shape, pose, and camera parameter predictions of the occluded training data are supervised by those of the original training data. We note that for frames with ground-truth pose parameter annotations, the self-supervised parameter regression loss param\mathcal{L}_{\mathrm{param}} can be computed against the ground truth. However, in our training set, only the MPI-INF-3DHP (Mehta et al., 2017a) and Human3.6M (Ionescu et al., 2013) datasets contain ground-truth pose parameter annotations. For ease of implementation, we choose to compute the loss against the predictions of the original frames. The formulation of the self-supervised parameter regression loss param\mathcal{L}_{\mathrm{param}} is applicable to all training data, with or without ground truth.

Multi-person tracking. To recover human body from videos that contain multiple person instances, we first leverage a multi-person tracker to detect and track each person instance. We then apply our SPS-Net to each person tracking result to estimate the 3D human pose and shape. The multi-person tracker is composed of an object detector and an object tracker. We adopt the YOLOv4 (Bochkovskiy et al., 2020) as the object detector and the SORT (Bewley et al., 2016) as the object tracker. The multi-person tracker first applies the YOLOv4 (Bochkovskiy et al., 2020) detector to each video frame to detect each person instance. Then the person detection results are passed to the SORT (Bewley et al., 2016) method to associate the detected person instances in the current frame to the existing ones. Specifically, the SORT (Bewley et al., 2016) first predicts the bounding box in the current frame for each existing person. Then, we compute the intersection over union (IoU) between the detected bounding boxes and the predicted bounding boxes. By using the Hungarian algorithm with a minimum IoU threshold, we can assign each detected person instance to an existing one or consider the detected person instance a new one.

Table 1: Experimental results of 3D human pose and shape estimation. We present the experimental results with comparisons to existing methods. (Left) Results on the 3DPW (von Marcard et al., 2018) dataset. (Middle) Results on the MPI-INF-3DHP (Mehta et al., 2017a) dataset. (Right) Results on the Human3.6M (Ionescu et al., 2013) dataset. The bold and underlined numbers indicate the top two results, respectively. The “-” indicates the result is not available.
Method 3DPW (von Marcard et al., 2018) MPI-INF-3DHP (Mehta et al., 2017a) Human3.6M (Ionescu et al., 2013)
PA-MPJPE \downarrow MPJPE \downarrow PVE \downarrow Acceleration Error \downarrow PA-MPJPE \downarrow MPJPE \downarrow PCK \uparrow PA-MPJPE \downarrow MPJPE \downarrow
Frame based Yang et al. (Yang et al., 2018) - - - - - - - 69.0 - -
Chen et al. (Chen et al., 2019a) - - - - - - - 71.1 - -
Mehta et al. (Mehta et al., 2017b) 9.81M - - - - - - 72.5 - -
EpipolarPose (Kocabas et al., 2019) 34.28M - - - - - - 77.5 - -
TCN (Cheng et al., 2020) - - - - - - - 84.1 - -
RepNet (Wandt and Rosenhahn, 2019) 10.03M - - - - - 97.8 82.5 - -
CMR (Kolotouros et al., 2019b) 46.31M 70.2 - - - - - - 50.1 -
STRAPS (Sengupta et al., 2020) 12.48M 66.8 - - - - - - 55.4 -
NBF (Omran et al., 2018) 68.11M 90.7 - - - - - - 59.9 -
ExPose (Choutas et al., 2020) 47.22M 60.7 93.4 - - - - - - -
HUND (Zanfir et al., 2020) - 56.5 87.7 - - - - - 53.0 72.0
HMR (Kanazawa et al., 2018) 26.98M 76.7 130.0 - 37.4 89.8 124.2 72.9 56.8 88.0
SPIN (Kolotouros et al., 2019a) 26.98M 59.2 96.9 116.4 29.8 67.5 105.2 76.4 41.1 -
Video based Temporal 3D Kinetics (Arnab et al., 2019) - 72.2 - - - - - - - -
Motion to the Rescue (Doersch and Zisserman, 2019) - 74.7 - - - - - - - -
DSD-SATN (Sun et al., 2019) - 69.5 - - - - - - 42.4 59.1
(Kanazawa et al., 2019) 29.76M 72.6 116.5 139.3 15.2 - - - 56.9 -
VIBE (Kocabas et al., 2020) 48.30M 56.5 93.5 113.4 27.1 63.4 97.7 89.0 41.5 65.9
Ours 51.43M 50.4 85.8 100.6 22.1 60.7 94.3 90.1 38.7 58.9
Refer to caption

SPIN

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

VIBE

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Ours

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Fig. 6: Visual comparisons. We present two visual comparisons with the SPIN (Kolotouros et al., 2019a) and VIBE (Kocabas et al., 2020) methods. Our method is capable of estimating shapes that cover human bodies well and predicting more accurate poses for limbs in particular.

4.2 Experimental settings

We describe the datasets and the evaluation metrics below.

4.2.1 Datasets

Similar to the state-of-the-art human pose and shape estimation methods (Kanazawa et al., 2018, 2019; Kolotouros et al., 2019a; Kocabas et al., 2020), we adopt a number of datasets that contain either 2D or 3D ground-truth annotations for training. Specifically, we use the PennAction (Zhang et al., 2013), InstaVariety (Kanazawa et al., 2019), PoseTrack (Andriluka et al., 2018), MPI-INF-3DHP (Mehta et al., 2017a), and Human3.6M (Ionescu et al., 2013) datasets for training. Same as the VIBE (Kocabas et al., 2020) method, we use the Kinetics-400400 (Kay et al., 2017) dataset to complement the missing parts of the InstaVariety (Kanazawa et al., 2019) dataset. We evaluate our method on the 3DPW (von Marcard et al., 2018), MPI-INF-3DHP (Mehta et al., 2017a), and Human3.6M (Ionescu et al., 2013) datasets. The details of each dataset are described below.

3DPW (von Marcard et al., 2018). The 3DPW dataset is an in-the-wild 3D dataset, containing 6060 videos of several in-the-wild and indoor activities. The training, validation, and test sets are composed of 2424, 1212, and 2424 video sequences, respectively. We evaluate our method on the 3DPW test set.

MPI-INF-3DHP (Mehta et al., 2017a). The MPI-INF-3DHP dataset consists of multi-view videos captured in indoor environments. The training set contains 88 subjects, each of which has 1616 videos. Following existing approaches (Kolotouros et al., 2019a; Kocabas et al., 2020), we use the training set for model training and evaluate our SPS-Net on the test set.

Human3.6M (Ionescu et al., 2013). The Human3.6M dataset is composed of 1515 sequences of several people performing different actions. This dataset is collected in an indoor and controlled environment. The training set contains 1.51.5 million images, each of which has 3D ground-truth annotations. Same as the VIBE (Kocabas et al., 2020) method, we train our model on 55 subjects (i.e., S11, S55, S66, S77, and S88) and evaluate our method on the remaining 22 subjects (i.e., S99 and S1111).

PennAction (Zhang et al., 2013). The PennAction dataset is composed of 2,3262,326 videos of 1515 actions. Each video is annotated with 2D keypoints. We use this dataset for training.

InstaVariety (Kanazawa et al., 2019). The InstaVariety dataset is composed of videos of 2424-hour long collected from Instagram. Each video is annotated with 2D joints obtained by using the OpenPose (Cao et al., 2019) and Detect and Track (Girdhar et al., 2018) methods. We adopt this dataset for training.

PoseTrack (Andriluka et al., 2018). The PoseTrack dataset consists of 1,3371,337 videos. The training set is composed of 792792 videos. The validation set contains 170170 videos. The test set comprises 375375 videos. Each video is annotated with 1515 keypoints. We use the training set for model training.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Fig. 7: Visual results of occlusion handling.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Fig. 8: Qualitative results of 3D human pose and shape estimation. We visualize the 3D human body from different viewpoints recovered by our SPS-Net on the 3DPW (von Marcard et al., 2018) test set.

Ours w/o camera\mathcal{L}_{\mathrm{camera}}

Refer to caption
Refer to caption
Refer to caption

Ours w/o mask\mathcal{L}_{\mathrm{mask}}

Refer to caption
Refer to caption
Refer to caption

Ours

Refer to caption
Refer to caption
Refer to caption

Ours

Refer to caption
Refer to caption
Refer to caption
Fig. 9: Visual comparisons with our variant methods. (Left) Visual comparisons with the Ours w/o camera\mathcal{L}_{\mathrm{camera}} method. (Right) Visual comparisons with the Ours w/o mask\mathcal{L}_{\mathrm{mask}} approach.
Refer to caption
Refer to caption
Fig. 10: Sensitivity analysis of hyperparameters. We report the PA-MPJPE results of our method on the 3DPW (von Marcard et al., 2018) dataset. Experimental results show that the performance of our SPS-Net is stable when the hyperparameters are set within a suitable range.

4.2.2 Evaluation metrics

We use the procrustes aligned mean per joint position error (PA-MPJPE), mean per joint position error (MPJPE), percentage of correct keypoints (PCK) (Mehta et al., 2017a), per vertex error (PVE), and mean acceleration error of every joint in mm/s2\mathrm{mm}/\mathrm{s}^{2} (Kanazawa et al., 2019) for performance evaluation.

4.3 Performance evaluation and comparisons

We compare the performance of our SPS-Net with existing frame-based methods (Yang et al., 2018; Chen et al., 2019a; Kocabas et al., 2019; Mehta et al., 2017b; Cheng et al., 2020; Wandt and Rosenhahn, 2019; Kolotouros et al., 2019b; Sengupta et al., 2020; Omran et al., 2018; Choutas et al., 2020; Zanfir et al., 2020; Kanazawa et al., 2018; Kolotouros et al., 2019a) and video-based approaches (Kanazawa et al., 2019; Arnab et al., 2019; Doersch and Zisserman, 2019; Sun et al., 2019; Kocabas et al., 2020). Table 1 presents the quantitative results on the 3DPW (von Marcard et al., 2018), MPI-INF-3DHP (Mehta et al., 2017a), and Human3.6M (Ionescu et al., 2013) datasets.

Experimental results on all three datasets show that our method performs favorably against existing frame-based and video-based approaches on the PA-MPJPE, MPJPE, PVE, and PCK evaluation metrics. However, the acceleration error of our method is inferior to that of the HMMR (Kanazawa et al., 2019) approach. The reason for the inferior performance is that the goal of the HMMR (Kanazawa et al., 2019) method lies in predicting past and future motions given a single image. While we have a forecasting module FF that predicts the feature of the future frame based on past information, we do not aim to optimize the performance on the human motion prediction task but instead focus on learning to estimate 3D human pose and shape of the current frame. On the other hand, as noted by the VIBE (Kocabas et al., 2020), the HMMR (Kanazawa et al., 2019) method applies smoothing to the predictions, leading to overly smooth pose predictions at the expense of sacrificing the accuracy of pose and shape estimation.

In addition to quantitative comparisons, we present 1) visual comparisons with the VIBE (Kocabas et al., 2020) and SPIN (Kolotouros et al., 2019a) methods, 2) visual results of occlusion handling, and 3) visual results of different viewpoints.

Visual comparisons with the VIBE and SPIN methods. Figure 6 shows two visual comparisons with the VIBE (Kocabas et al., 2020) and SPIN (Kolotouros et al., 2019a). We observe that our model recovers bodies that well cover humans and estimates more accurate poses for limbs in particular.

Visual results of different viewpoints. We visualize human bodies recovered by our SPS-Net from different viewpoints in Figure 8. Our results show that our method estimates accurate rotation parameters.

4.4 Ablation study

Loss functions. To analyze the effectiveness of each loss function, we conduct an ablation study by removing one loss function at a time. Specifically, we analyze how much performance gain each loss function contributes. Table 2 shows the results on the 3DPW (von Marcard et al., 2018) test set.

Without the camera parameter consistency loss camera\mathcal{L}_{\mathrm{camera}}, there is no explicit constraint imposed on the prediction of camera parameters, leading to performance drops of 1.71.7 in PA-MPJPE and 3.53.5 in PVE. When removing the mask loss mask\mathcal{L}_{\mathrm{mask}}, our model does not have any constraints to regularize the 3D mesh. Performance drops of 5.95.9 in PA-MPJPE and 7.07.0 in PVE occur. Without the self-supervised parameter regression loss param\mathcal{L}_{\mathrm{param}}, our model does not learn to produce plausible predictions when the occlusion or out-of-view issues occur, resulting in performance drops of 5.45.4 in PA-MPJPE and 4.64.6 in PVE. When removing the adversarial loss adv\mathcal{L}_{\mathrm{adv}}, our model does not learn to render 3D meshes that have realistic motions. Performance drops on all three evaluation metrics occur, which also concur with the findings in the HMR (Kanazawa et al., 2018) and VIBE (Kocabas et al., 2020).

Figure 9 presents two visual comparisons with the variant methods of our SPS-Net (i.e., Ours w/o camera\mathcal{L}_{\mathrm{camera}} and Ours w/o mask\mathcal{L}_{\mathrm{mask}}). Our visual results show that both the camera parameter consistency loss camera\mathcal{L}_{\mathrm{camera}} and the mask loss mask\mathcal{L}_{\mathrm{mask}} allow our model to predict more accurate pose and shape estimates.

The ablation study on loss functions shows that all four losses are crucial to the SPS-Net.

Table 2: Ablation study on loss functions. We report the experimental results on the 3DPW (von Marcard et al., 2018) test set. The bold and underlined numbers indicate the top two results, respectively.
Method PA-MPJPE \downarrow MPJPE \downarrow PVE \downarrow
Ours 50.4 85.8 100.6
Ours w/o camera\mathcal{L}_{\mathrm{camera}} 52.1 88.2 104.1
Ours w/o mask\mathcal{L}_{\mathrm{mask}} 56.3 90.0 107.6
Ours w/o param\mathcal{L}_{\mathrm{param}} 55.8 89.4 105.2
Ours w/o adv\mathcal{L}_{\mathrm{adv}} 56.2 93.4 112.5
Table 3: Ablation study on the self-attention and forecasting modules. We report the experimental results on the 3DPW (von Marcard et al., 2018) test set. The bold and underlined numbers indicate the top two results, respectively.
Method PA-MPJPE \downarrow MPJPE \downarrow PVE \downarrow
Ours 51.43M 50.4 85.8 100.6 22.1
Ours w/o Forecasting FF 47.23M 54.2 91.9 104.3 23.3
Ours w/o Self-Attention AA 34.64M 57.6 96.6 104.7 22.9

Self-attention and forecasting modules. We conduct an ablation study to analyze the contribution of the self-attention module AA and the forecasting module FF in the SPS-Net. Specifically, we show the contribution of each component by disabling (removing) one at a time. Table 3 shows the results on the 3DPW (von Marcard et al., 2018) test set. Without either the forecasting module FF or the self-attention module AA, the degraded method suffers from significant performance loss in all metrics. When both modules are jointly utilized, our model achieves the best results, demonstrating the complementary importance of these two components.

Table 4: Ablation study on different temporal modules. We report the experimental results on the 3DPW (von Marcard et al., 2018) test set. The bold and underlined numbers indicate the top two results, respectively.
Method PA-MPJPE \downarrow MPJPE \downarrow PVE \downarrow
Ours (Self-Attention) 51.43M 50.4 85.8 100.6
Ours (GRU) 50.88M 52.8 87.7 103.2

Self-attention module vs. GRU. To analyze the effectiveness of employing different temporal modules, we conduct an ablation study by swapping the self-attention module AA in the SPS-Net with a two-layer GRU module as in the VIBE (Kocabas et al., 2020) model, i.e., comparing the performance between the “Ours (Self-Attention)” method and the “Ours (GRU)” approach. Table 4 presents the results on the 3DPW (von Marcard et al., 2018) test set. We observe that employing the self-attention module results in performance improvement over adopting the GRU on all three evaluation metrics.

Table 5: Ablation study of the input sequence length. We present the experimental results on the 3DPW (von Marcard et al., 2018) test set. The bold and underlined numbers indicate the top two results, respectively.
Input sequence length PA-MPJPE \downarrow MPJPE \downarrow PVE \downarrow
8 55.3 92.4 110.8
16 53.1 87.6 105.5
32 50.4 85.8 100.6
48 50.2 85.1 100.2
Table 6: Run time analysis.
Method Platform Training Inference
Yang et al. (Yang et al., 2018) Titan X - 1.1
Mehta et al. (Mehta et al., 2017b) Titan X - 3.3
RepNet (Wandt and Rosenhahn, 2019) Titan X - 10
CMR (Kolotouros et al., 2019b) RTX 2080Ti - 3.3
STRAPS (Sengupta et al., 2020) RTX 2080Ti 120 0.25
NBF (Omran et al., 2018) V100 18 -
ExPose (Choutas et al., 2020) Quadro P5000 - 0.16
HUND (Zanfir et al., 2020) P100 72 0.055
HMR (Kanazawa et al., 2018) Titan 1080Ti 120 0.04
SPIN (Kolotouros et al., 2019a) - - 3
Temporal 3D Kinetics (Arnab et al., 2019) - - 2
VIBE (Kocabas et al., 2020) RTX 2080Ti 1 0.07
Ours V100 12 0.09

Input sequence length. We conduct an ablation study to analyze the effect of the input sequence length. Table 5 presents the results on the 3DPW (von Marcard et al., 2018) dataset. Our results show that the performance on all three metrics improves as the input sequence length increases. When the input sequence length increases from 3232 (the default setting in our experiments) to 4848, our results can be further improved. However, due to GPU memory constraints, we are not able to experiment with longer input sequence lengths.

Sensitivity analysis. To analyze the sensitivity of the SPS-Net with respect to the hyperparameters, we perform a sensitivity analysis on the hyperparameters λmask\lambda_{\mathrm{mask}} and λcamera\lambda_{\mathrm{camera}}. We report the PA-MPJPE results on the 3DPW (von Marcard et al., 2018) test set. Figure 10 presents the experimental results.

We observe that when the hyperparameter is set to 0 (i.e., the corresponding loss function is removed), our SPS-Net suffers from performance drops. When the hyperparameters are set within a suitable range (i.e., around 300300 for λmask\lambda_{\mathrm{mask}} and around 0.10.1 for λcamera\lambda_{\mathrm{camera}}), the performance of our SPS-Net is improved, demonstrating the effectiveness of the corresponding loss function. When the hyperparameters are set to large values (e.g., 1×1041\times 10^{4} for λmask\lambda_{\mathrm{mask}} and 1×1031\times 10^{3} for λcamera\lambda_{\mathrm{camera}}), our model training will be dominated by optimizing the corresponding loss, leading to performance drops.

The sensitivity analysis of hyperparameters shows that when each hyperparameter is set within a suitable range, the performance of our method is improved and remains stable.

4.6 Failure modes

We present the failure cases of our method in Figure 11. As our SPS-Net assumes that the input video frames contain a single person, if missing detection happens, our method will not be able to perform human pose and shape estimation.

Refer to caption
Refer to caption
Fig. 11: Failure cases. If missing detection happens, our SPS-Net will not be able to predict 3D human body.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Ours w/o Self-Attention AA

Ours w/o Forecasting FF

Ours

Fig. 12: Visual comparisons with our variant methods.

5 Conclusions

We propose the SPS-Net for estimating 3D human pose and shape from videos. The main contributions of this work lie in the design of the self-attention module that captures short-range and long-range dependencies across video frames and the forecasting module that allows our model to exploit visual cues from human motion for producing temporally coherent predictions. To address the absence of ground-truth camera parameter annotations, we propose a camera parameter consistency loss that not only regularizes the learning of camera parameter prediction but also provides additional supervisory signals to facilitate model training. We develop a self-supervised learning scheme that explicitly models the occlusion and out-of-view scenarios by masking out some regions in the video frames. By leveraging the predictions of the original video frames to supervise those of the synthesized occluded or partially visible data, our model learns to predict plausible estimations. Extensive experimental results on three challenging datasets show that our SPS-Net performs favorably against the state-of-the-art 3D human pose and shape estimation methods.

References

  • Andriluka et al. (2018) Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., Schiele, B., 2018. Posetrack: A benchmark for human pose estimation and tracking, in: CVPR.
  • Arnab et al. (2019) Arnab, A., Doersch, C., Zisserman, A., 2019. Exploiting temporal context for 3d human pose estimation in the wild, in: CVPR.
  • Bewley et al. (2016) Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B., 2016. Simple online and realtime tracking, in: ICIP.
  • Bochkovskiy et al. (2020) Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M., 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv .
  • Bogo et al. (2016) Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J., 2016. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image, in: ECCV.
  • Bolya et al. (2019) Bolya, D., Zhou, C., Xiao, F., Lee, Y.J., 2019. Yolact: real-time instance segmentation, in: ICCV.
  • Butepage et al. (2017) Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H., 2017. Deep representation learning for human motion prediction and classification, in: CVPR.
  • Cao et al. (2019) Cao, Z., Martinez, G.H., Simon, T., Wei, S.E., Sheikh, Y.A., 2019. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. TPAMI .
  • Chen et al. (2019a) Chen, C.H., Tyagi, A., Agrawal, A., Drover, D., Stojanov, S., Rehg, J.M., 2019a. Unsupervised 3d pose estimation with geometric self-supervision, in: CVPR.
  • Chen et al. (2019b) Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B., 2019b. Crdoco: Pixel-level domain transfer with cross-domain consistency, in: CVPR.
  • Chen et al. (2020) Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B., 2020. Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation. TPAMI .
  • Cheng et al. (2020) Cheng, Y., Yang, B., Wang, B., Tan, R.T., 2020. 3d human pose estimation using spatio-temporal networks with explicit occlusion training, in: AAAI.
  • Choutas et al. (2020) Choutas, V., Pavlakos, G., Bolkart, T., 2020. Monocular expressive body regression through body-driven attention, in: ECCV.
  • Denton and Birodkar (2017) Denton, E., Birodkar, V., 2017. Unsupervised learning of disentangled representations from video, in: NeurIPS.
  • Doersch and Zisserman (2019) Doersch, C., Zisserman, A., 2019. Sim2real transfer learning for 3d human pose estimation: motion to the rescue, in: NeurIPS.
  • Finn et al. (2016) Finn, C., Goodfellow, I., Levine, S., 2016. Unsupervised learning for physical interaction through video prediction, in: NeurIPS.
  • Fragkiadaki et al. (2015) Fragkiadaki, K., Levine, S., Felsen, P., Malik, J., 2015. Recurrent network models for human dynamics, in: ICCV.
  • Girdhar et al. (2018) Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., Tran, D., 2018. Detect-and-track: Efficient pose estimation in videos, in: CVPR.
  • Gordon et al. (2019) Gordon, A., Li, H., Jonschkowski, R., Angelova, A., 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras, in: ICCV.
  • He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: CVPR.
  • Huang et al. (2018) Huang, X., Liu, M.Y., Belongie, S., Kautz, J., 2018. Multimodal unsupervised image-to-image translation, in: ECCV.
  • Ionescu et al. (2013) Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C., 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI .
  • Jain et al. (2016) Jain, A., Zamir, A.R., Savarese, S., Saxena, A., 2016. Structural-rnn: Deep learning on spatio-temporal graphs, in: CVPR.
  • Kanazawa et al. (2018) Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J., 2018. End-to-end recovery of human shape and pose, in: CVPR.
  • Kanazawa et al. (2019) Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J., 2019. Learning 3d human dynamics from video, in: CVPR.
  • Kay et al. (2017) Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A., 2017. The kinetics human action video dataset. arXiv .
  • Kingma and Ba (2014) Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization, in: ICLR.
  • Kocabas et al. (2020) Kocabas, M., Athanasiou, N., Black, M.J., 2020. Vibe: Video inference for human body pose and shape estimation, in: CVPR.
  • Kocabas et al. (2019) Kocabas, M., Karagoz, S., Akbas, E., 2019. Self-supervised learning of 3d human pose using multi-view geometry, in: CVPR.
  • Kolotouros et al. (2019a) Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K., 2019a. Learning to reconstruct 3d human pose and shape via model-fitting in the loop, in: ICCV.
  • Kolotouros et al. (2019b) Kolotouros, N., Pavlakos, G., Daniilidis, K., 2019b. Convolutional mesh regression for single-image human shape reconstruction, in: CVPR.
  • Lassner et al. (2017) Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V., 2017. Unite the people: Closing the loop between 3d and 2d human representations, in: CVPR.
  • Lee et al. (2018a) Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H., 2018a. Diverse image-to-image translation via disentangled representations, in: ECCV.
  • Lee et al. (2018b) Lee, K., Lee, I., Lee, S., 2018b. Propagating lstm: 3d pose estimation based on joint interdependency, in: ECCV.
  • Li et al. (2019) Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C., 2019. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark, in: CVPR.
  • Li et al. (2018) Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H., 2018. Auto-conditioned recurrent networks for extended complex human motion synthesis, in: ICLR.
  • Liu et al. (2019) Liu, L., Xu, W., Zollhoefer, M., Kim, H., Bernard, F., Habermann, M., Wang, W., Theobalt, C., 2019. Neural rendering and reenactment of human actor videos. ACM TOG .
  • Loper et al. (2015) Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J., 2015. Smpl: A skinned multi-person linear model. ACM TOG .
  • Mahmood et al. (2019) Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J., 2019. Amass: Archive of motion capture as surface shapes, in: ICCV.
  • von Marcard et al. (2018) von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G., 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera, in: ECCV.
  • Mehta et al. (2017a) Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C., 2017a. Monocular 3d human pose estimation in the wild using improved cnn supervision, in: 3DV.
  • Mehta et al. (2017b) Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.P., Xu, W., Casas, D., Theobalt, C., 2017b. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM TOG .
  • Meister et al. (2018) Meister, S., Hur, J., Roth, S., 2018. Unflow: Unsupervised learning of optical flow with a bidirectional census loss, in: AAAI.
  • Omran et al. (2018) Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B., 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation, in: 3DV.
  • Parmar et al. (2018) Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., Tran, D., 2018. Image transformer, in: ICML.
  • Pascanu et al. (2013) Pascanu, R., Mikolov, T., Bengio, Y., 2013. On the difficulty of training recurrent neural networks, in: ICML.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. Pytorch: An imperative style, high-performance deep learning library, in: NeurIPS.
  • Pishchulin et al. (2016) Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B., 2016. Deepcut: Joint subset partition and labeling for multi person pose estimation, in: CVPR.
  • Rayat Imtiaz Hossain and Little (2018) Rayat Imtiaz Hossain, M., Little, J.J., 2018. Exploiting temporal information for 3d human pose estimation, in: ECCV.
  • Sengupta et al. (2020) Sengupta, A., Budvytis, I., Cipolla, R., 2020. Synthetic training for accurate 3d human pose and shape estimation in the wild. arXiv .
  • Sun et al. (2019) Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T., 2019. Human mesh recovery from monocular images via a skeleton-disentangled representation, in: ICCV.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need, in: NeurIPS.
  • Villegas et al. (2018) Villegas, R., Yang, J., Ceylan, D., Lee, H., 2018. Neural kinematic networks for unsupervised motion retargetting, in: CVPR.
  • Walker et al. (2016) Walker, J., Doersch, C., Gupta, A., Hebert, M., 2016. An uncertain future: Forecasting from static images using variational autoencoders, in: ECCV.
  • Walker et al. (2017) Walker, J., Marino, K., Gupta, A., Hebert, M., 2017. The pose knows: Video forecasting by generating pose futures, in: ICCV.
  • Wandt and Rosenhahn (2019) Wandt, B., Rosenhahn, B., 2019. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation, in: CVPR.
  • Xu et al. (2019) Xu, W., Chatterjee, A., Zollhoefer, M., Rhodin, H., Fua, P., Seidel, H.P., Theobalt, C., 2019. Mo 2 cap 2: Real-time mobile 3d motion capture with a cap-mounted fisheye camera. TVCG .
  • Yang et al. (2018) Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X., 2018. 3d human pose estimation in the wild by adversarial learning, in: CVPR.
  • Zanfir et al. (2020) Zanfir, A., Bazavan, E.G., Zanfir, M., Freeman, W.T., Sukthankar, R., Sminchisescu, C., 2020. Neural descent for visual 3d human pose and shape. arXiv .
  • Zhang et al. (2019a) Zhang, H., Goodfellow, I., Metaxas, D., Odena, A., 2019a. Self-attention generative adversarial networks, in: ICML.
  • Zhang et al. (2019b) Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J., 2019b. Predicting 3d human dynamics from video, in: ICCV.
  • Zhang et al. (2013) Zhang, W., Zhu, M., Derpanis, K.G., 2013. From actemes to action: A strongly-supervised representation for detailed action understanding, in: ICCV.
  • Zhou et al. (2017) Zhou, T., Brown, M., Snavely, N., Lowe, D.G., 2017. Unsupervised learning of depth and ego-motion from video, in: CVPR.
  • Zhou et al. (2015) Zhou, T., Jae Lee, Y., Yu, S.X., Efros, A.A., 2015. Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences, in: CVPR.
  • Zhu et al. (2017) Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks, in: ICCV.
  • Zou et al. (2018) Zou, Y., Luo, Z., Huang, J.B., 2018. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency, in: ECCV.