Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion Model

Abstract

We propose a simple and novel method for generating 3D human motion from complex natural language sentences, which describe different velocity, direction and composition of all kinds of actions. Different from existing methods that use classical generative architecture, we apply the Denoising Diffusion Probabilistic Model to this task, synthesizing diverse motion results under the guidance of texts. The diffusion model converts white noise into structured 3D motion by a Markov process with a series of denoising steps and is efficiently trained by optimizing a variational lower bound. To achieve the goal of text-conditioned image synthesis, we use the classifier-free guidance strategy to add text embedding into the model during training. Our experiments demonstrate that our model achieves competitive results on HumanML3D test set quantitatively and can generate more visually natural and diverse examples. We also show with experiments that our model is capable of zero-shot generation of motions for unseen text guidance.

Index Terms— Diffusion Model, 3D motion generation, Multi-modalities

1 Introduction

Generating 3D human motion from natural language sentences is an interesting and useful task. It has extensive applications across virtual avatar controlling, robot motion planning, virtual assistants and movie script visualization.

The task has two major challenges. First, since natural language can have very fine-grained representation, generating visually natural and semantically relevant motions from texts is difficult. Specifically, the text inputs can contain a lot of subtleties. For instance, given different verbs and adverbs in the text, the model needs to generate different motions. The input may indicate different velocities or directions, e.g., “a person is running fast forward then walking slowly backward”. The input may also describe a diverse set of motions, e.g., “a man is playing golf”, “a person is playing the violin ”, “a person walks steadily along a path while holding onto rails to keep balance”. The second challenge is that one textual description could map to multiple motions. This requires the generative model to be probabilistic. For instance, the generated motions from the description “a person is walking” should have multiple output samples with different velocities and directions.

Early motion generating methods [1, 2, 3, 4] in generating 3D human motions are based on very simple textual descriptions, such as an action category, e.g. jump, throw or run. This type of setup has two limitations. First, the feature space of input texts is too sparse. Therefore, the solutions do not generalize to texts outside the distribution of the dataset. Second, category-based texts have very limited applications in real-world scenarios. With the emergence of the KIT-ML dataset [5], which contains 6,278 long sentence descriptions and 3,911 complex motions, a series of work [6, 7] started to convert complex sentence modalities into motion modalities. They usually design a sequence-to-sequence architecture to generate one result. However, this is inconsistent with the nature of the motion generation task because every language modality corresponds to a very diverse set of 3D motions. Most recently, a new dataset HumanML3D and a new model has been proposed in [8] to solve the above problems. The dataset consists of 14,616 motion clips and 44,970 text descriptions and provides the basis for training models that can generate multiple results. The new model proposed in [8] is able to generate high-fidelity and multiple samples. It achieves state-of-the-art quantitatively. However, the generated samples have very limited diversity and are not capable of achieving zero-shot test. In addition, this model consists of several sub-models, which cannot be trained end-to-end, and the inference process is very complex.

A new paradigm for image and video generation named denoising diffusion probabilistic models, has recently emerged and achieved remarkable results [9, 10]. The diffusion model learns an iterative denoising process that gradually recovers the target output from a Gaussian noise at inference time. Many recent papers aim to generate images based on textual descriptions. They blend text into the input and guide the generation of images using techniques such as classifier guidance [11], classifier-free guidance [12], which synthesize impressive samples, such as [13]. Generation using diffusion models are also applied to other modalities such as speech generation [14] and point cloud generation [15] to achieve much better results than previously possible.

We apply the diffusion model to the task of 3D human motion generation based on textual descriptions. Our results show that the generated samples have better fidelity and wider diversity. The guidance of texts is more controllable. More specifically, we make the following contributions:

•

To our best knowledge, we are the first to construct the diffusion motion architecture for 3D human motion generation based on textual description.
•

Based on visualized experimental results, we discuss the advantages of our diffusion model in terms of flexibility in text control, diversity of generated samples, and zero-shot capability for motion generation.
•

Quantitatively, we have achieved very competitive results in our experiments with existing metrics on the HumanML3D test set.

2 Method

In this section, we first formulate the probabilistic model of forward and reverse diffusion processes for 3D human motion generation from text descriptions. Then, we explain the mathematical expression of the modified objective for training the model. Lastly, we write the full training and sampling algorithm. The whole architecture is shown in Fig 1.

2.1 Diffusion Processes

Each 3D human pose is represented by the position of $J$ keypoints. Each keypoint has three coordinates in 3D. Therefore, a 3D human pose can be denoted by $\mathbf{p}\in\mathbb{R}^{J\times 3}$ . 3D human motion is a sequence of poses and hence can be written as $\mathbf{x}\in\mathbb{R}^{L\times J\times 3}$ where $L$ represents the number of time steps. For simplicity, we flatten the pose representation into a one-dimensional vector. Then the motion can be represented as $\mathbf{x}\in\mathbb{R}^{L\times C}$ where $C=J\times 3$ .

The denoising diffusion probabilistic model consists of a forward process and a reverse process. The forward diffusion process converts the original highly structured and semantically relevant keypoints distribution into a Gaussian noise distribution. We denote motion with increasing levels of noise as $\mathbf{{x}_{0}},\mathbf{{x}_{1}},...,\mathbf{{x}_{T}}$ . The forward diffusion process is modeled as a Markov chain:

q\left(\mathbf{x}_{1:T}\mid\mathbf{x}_{0}\right)=\prod_{t=1}^{T}q\left(\mathbf{x}_{t}\mid\mathbf{x}_{t-1}\right)

(1)

where $q\left(\mathbf{x}_{t}\mid\mathbf{x}_{t-1}\right)$ is denoted as the forward process, adding noise to motions at the previous time step and generate the distribution at the next time step. Given a set of pre-defined hyper-parameters $\beta_{1},\beta_{2},...,\beta_{T}$ , each transition step can be written as

q\left(\mathbf{x}_{t}\mid\mathbf{x}_{t-1}\right):=\mathcal{N}\left(\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I}\right)

(2)

where $\beta_{t}$ controls the diffusion rate of the process and usually ranges from 0 to 1.

The reverse diffusion process is a generation process where keypoints sampled from a Gaussian noise distribution $p\left(\mathbf{x}_{t}\right)$ are given as the initial input. Then the initial input is progressively denoised on an inverse Markov chain following the given text description. The text description is encoded into a latent variable $\mathbf{z}$ using the BERT [16] model. $\theta$ are the parameters of the denoising model. The reverse diffusion process can be written as follows:

$p_{\boldsymbol{\theta}}\left(\mathbf{x}_{0:T}\mid\mathbf{z}\right)=p\left(\mathbf{x}_{T}\right)\prod_{t=1}^{T}p_{\boldsymbol{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{z}\right)$

(3)

$p_{\boldsymbol{\theta}}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{z}\right)=\mathcal{N}\left(\mathbf{x}_{t-1}\mid\boldsymbol{\mu}_{\boldsymbol{\theta}}\left(\mathbf{x}_{t},\mathbf{t},\mathbf{z}\right),\beta_{t}\boldsymbol{I}\right)$

(4)

where $\boldsymbol{\mu}_{\boldsymbol{\theta}}$ is the target we want to estimate by a neural network. $t$ is the timestep indicating where the denoising process has conducted, which is encoded as a vector based on the cosine schedule [17]. $p\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{z}\right)$ represents the reverse transitional probability of keypoints from one step to the previous step.

2.2 Training Objective

In the previous subsection, we established the forward and reverse diffusion process for 3D human motion generation from textual descriptions. In this subsection, we will derive the mathematical expressions for our training objective of the reverse diffusion process.

In order to approximate the intractable marginal likelihood $p_{\boldsymbol{\theta}}\left(\mathbf{x}\right)$ , we maximize a variational lower bound to make the objective optimal:

\mathrm{E}_{q\left(\mathbf{x}_{0}\right)}\left[\log p_{\theta}\left(\mathbf{x}_{0}\right)\right]\geq\mathrm{E}_{q\left(\mathbf{x}_{0:T}\right)}\left[\log\frac{p_{\theta}\left(\mathbf{x}_{0:T},\mathbf{z}\right)}{q\left(\mathbf{x}_{1:T},\mathbf{z}\mid\mathbf{x}_{0}\right)}\right]

(5)

It is the same objective as in DDPM [9] except for involving the text embedding $\mathbf{z}$ . We add text embedding $\mathbf{z}$ and time embedding $\mathbf{t}$ into a new embedding during our specific training process. For simplicity, we still represent the fused embedding with $\mathbf{t}$ , which makes the objective consistent with DDPM:

\mathrm{E}_{q\left(\mathbf{x}_{0}\right)}\left[\log p_{\theta}\left(\mathbf{x}_{0}\right)\right]\geq\mathrm{E}_{q\left(\mathbf{x}_{0:T}\right)}\left[\log\frac{p_{\theta}\left(\mathbf{x}_{0:T},\right)}{q\left(\mathbf{x}_{1:T},\mid\mathbf{x}_{0}\right)}\right]

(6)

The same as DDPM simplifies the objective, our final training objective is

\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\mathbf{x}_{\mathbf{t}},\mathbf{t}\right)\right\|^{2},\quad\epsilon\sim\mathcal{N}(0,\mathbf{I})

(7)

where $\epsilon$ is the noise sampled from standard Gaussian distribution, $\boldsymbol{\epsilon}_{\theta}\left(\mathbf{x}_{\mathbf{t}},t\right)$ is the output of the noise prediction model.

Refer to caption — Fig. 1: Architecture overview. The forward process is a Markov process without training parameters; The reverse process uses a Unet-like architecture to predict the noise in every step. We utilize a pre-trained BERT model for feature extraction for text input.

2.3 Training Algorithm

In principle, the training target is to minimize Eq. (7). The simplified training and sampling algorithm is as follows:

Algorithm 1 Training Process(Simplified)

repeat

\mathbf{x}_{0},\mathbf{z}\sim q\left(\mathbf{x}_{0}\right)

\mathbf{z}\leftarrow\varnothing

with probability

p_{\text{uncond }}

\mathbf{t}\sim\operatorname{Uniform}(\{1,\ldots,T\})

\lambda\sim p(\lambda)

\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

\mathbf{x}=\alpha_{\lambda}\mathbf{x}+\sigma_{\lambda}\boldsymbol{\epsilon}

\mathbf{t}=f(\mathbf{t},\mathbf{z})

Take gradient descent step on

\nabla_{\theta}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon},\mathbf{t}\right)\right\|^{2}

until converged

Algorithm 2 Sampling Process(Simplified)

\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

\text{ for }t=T,\ldots,1\hskip 5.69054pt\mathbf{do:}

\quad\tilde{\boldsymbol{\epsilon}}_{t}=(1+w)\boldsymbol{\epsilon}_{\theta}\left(\mathbf{x}_{t},\mathbf{z}\right)-w\boldsymbol{\epsilon}_{\theta}\left(\mathbf{x}_{t}\right)

\quad\triangleright\text{ Sampling step }(\text{we adopt the way DDPM does})

end for

return

\mathbf{x}_{0}

3 Experiments

We present our experimental setup in Section 3.1. The quantitative and visualized results are compared to the current SOTA method Temporal VAE [8] in Section 3.2. In Section 3.3, we make some further comments on the evaluation and discuss future work.

3.1 Experimental Setup

Datasets. For 3D human motion generation from textual descriptions, we use the HumanML3D [8] dataset originating from a combination of HummanAct12 [1] and Amass [18] datasets. It covers a broad range of human actions such as daily activities.

Baseline Methods. We compare our work to Seq2Seq [19], Language2Pose [6], Text2Gesture [20], MoCoGAN [3], Dance2Music [21] and Temporal VAE [8]. Except Temporal VAE, all these existing methods either are deterministic methods, which means they can generate only one result from each text input, or are not able to do the 3D human motion generation from textual description task directly.

Evaluation Metrics. Our quantitative evaluation is to measure the naturality, semantic relevance and diversity of the generated 3D human motions. Below are details of the four metrics:

1.

Recognition Precision. We use a pre-trained Motion encoder and Text encoder to compute the generated samples’ embedding and text embedding, and further compare the similarity of those.
2.

Frechet Inception Distance (FID). Unlike agreeing to use the inception network for feature extraction for image FID, there is no agreed network for 3D human motion. Meanwhile, to better compare with Temporal VAE’s experimental results, we use the same encoder as in their work to measure FID. Their model is both trained and evaluated on this encoder so it puts our method at a disadvantage. We only show this as a reference point.

Diversity. We propose this new metric which evaluates diversity without an encoder. Specifically, diversity measures how much the generated motions diversify with the same text input. Given a set of generated 3D motions with $T$ text inputs. For $t$ -th motion, we randomly sample two subsets with the same size $S$ , and we use $m$ to represent the generated motion. The diversity can be formalized as:

\text{ Diversity }=\frac{1}{T\times S}\sum_{t=1}^{T}\sum_{i=1}^{S}\left\|\mathbf{m}_{t,i}-\mathbf{m}_{t,i}^{\prime}\right\|_{2}

(8)

Method	R Precision(T-1) $\uparrow$	R Precision(T-2) $\uparrow$	R Precision(T-3) $\uparrow$	FID $\downarrow$	Diversity $\uparrow$	Variance $\rightarrow$
Real motions	0.511	0.703	0.707	0.002	0	9.503
Seq2Seq [19]	0.180	0.300	0.396	11.75	0	6.223
Language2Pose [6]	0.246	0.387	0.486	11.02	0	7.626
Text2Gesture [20]	0.165	0.267	0.345	7.664	0	6.409
MoCoGAN [3]	0.037	0.072	0.106	94.41	9.421	0.462
Dance2Music [21]	0.033	0.065	0.097	66.98	7.235	0.725
Temporal VAE [8]	0.457	0.639	0.740	1.067	18.529	9.188
Ours	0.406	0.612	0.735	10.21	23.692	7.660

Table 1: Quantitative evaluation results on the HumanML3D test set. R Precision(Top-

k

) is short for Recognition Precision with Top-

k

accuracy.

\uparrow

means the larger value the better performance;

\downarrow

means the smaller value the better performance;

\rightarrow

means the closer to Real motions the better performance.

4.

Variance. This metric measures the variance of the generated samples without considering the text input. For any two subsets of the dataset with the same size $S$ , suppose after encoding with the Temporal VAE method, their motion feature vectors are $\left\{\mathbf{v}_{1},\ldots,\mathbf{v}_{S}\right\}$ and $\left\{\mathbf{v}_{1}^{\prime},\ldots,\mathbf{v}_{S}^{\prime}\right\}$ . The variance of motions is calculated as:

$\text{ Variance }=\frac{1}{S}\sum_{i=1}^{S}\left\|\mathbf{v}_{i}-\mathbf{v}_{i}^{\prime}\right\|_{2}$ (9)

3.2 Quantitative and Visualized Evaluation

Table 1 shows the quantitative evaluation results for the HumanML3D dataset. For Recognition Precision, our method and Temporal VAE are far more superior than other methods. This indicates that the generated keypoint sequences are highly correlated with the textual descriptions. For the FID and Diversity metrics, Seq2Seq [19], Language2Pose [6] and Text2Gesture [20] can generate natural results, but due to their deterministic model design, they have no diversity for this task. MoCoGAN [3] and Dance2Music [21] have unfavorable FID scores but achieve diversity. Clearly, there is a trade-off between the naturality and diversity of the generated results. Overall, Our method and Temporal VAE are far better than other methods in terms of naturality and diversity.

Fig 2 illustrates the visualized results of our method vs. Temporal VAE. For the input “A person walks steadily along a path while holding onto rails to keep balance”, our method generates motions with different directions, velocities and distances. This shows that the diversity of our method is beyond numerical variances, reaching semantic levels. We do not observe this level of diversity in other models. For instance, the generated results from Temporal VAE contain only minor differences. Our model even interprets ‘path’ to be ‘stairs’, which is shown in row 4 and row 5.

Thanks to the pre-trained model BERT [16] and classifier-free guidance technique [12], our diffusion model has very impressive zero-shot capability. As shown in Fig 3, for unseen actions and combinations, our method can generate natural and semantically relevant results.

4 Conclusion

Our work applies diffusion models to the task of 3D human motion generation based on textual descriptions. We invent three new techniques. We modify the architecture of the denoising model to take 3D motion input. We fuse the BERT embedding into time embedding. We apply the classifier-free guidance technique in this setting. With the largest language-motion dataset HumanML3D, our method performs competitively in quantitative evaluations. Studying the visualized results, we can see that we excel in diversity by a far margin in meaningful ways. We also demonstrate that our method is capable of impressive performance in zero-shot settings.

References

[1] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng, “Action2motion: Conditioned generation of 3d human motions,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2021–2029.
[2] Haoye Cai, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang, “Deep video generation, prediction and completion of human action sequences,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 366–382.
[3] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz, “Mocogan: Decomposing motion and content for video generation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1526–1535.
[4] Mathis Petrovich, Michael J Black, and Gül Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10985–10995.
[5] Matthias Plappert, Christian Mandery, and Tamim Asfour, “The kit motion-language dataset,” Big data, vol. 4, no. 4, pp. 236–252, 2016.
[6] Chaitanya Ahuja and Louis-Philippe Morency, “Language2pose: Natural language grounded pose forecasting,” in 2019 International Conference on 3D Vision (3DV). IEEE, 2019, pp. 719–728.
[7] Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek, “Synthesis of compositional animations from textual descriptions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1396–1406.
[8] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161.
[9] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
[10] Jiaming Song, Chenlin Meng, and Stefano Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[11] Prafulla Dhariwal and Alexander Nichol, “Diffusion models beat gans on image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.
[12] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
[13] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
[14] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
[15] Shitong Luo and Wei Hu, “Diffusion probabilistic models for 3d point cloud generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2837–2845.
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[17] Alexander Quinn Nichol and Prafulla Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning. PMLR, 2021, pp. 8162–8171.
[18] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black, “Amass: Archive of motion capture as surface shapes,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5442–5451.
[19] Angela, Lemeng Wu, Rodolfo Corona, Kevin Tai, Qixing Huang, and Raymond J. Mooney, “Generating animated videos of human activities from natural language descriptions,” in Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018, December 2018.
[20] Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha, “Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents,” in 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 2021, pp. 1–10.
[21] Taoran Tang, Jia Jia, and Hanyang Mao, “Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1598–1606.