This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

Nhat M. Hoang1,2, Kehong Gong1, Chuan Guo111footnotemark: 1, Michael Bi Mi1 Work done during an internship at HuaweiCorresponding author
Abstract

Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial TTT-T^{*} steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last TT^{*} steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.

{strip}[Uncaptioned image]
Figure 1: Examples of applying MotionMix on text-to-motion generation. Unlike previous works, our training data are only comprised of noisy annotated motions and unannotated motions. https://nhathoang2002.github.io/MotionMix-page/

1 Introduction

The rapidly arising attention and interest in digital humans bring up the great demand for human motion generation, in a wide range of fields such as industrial game and movie animation (Ling et al. 2020), human-machine interaction (Koppula and Saxena 2013), VR/AR and metaverse development (Lee et al. 2021). Over the years, automated generation of human motions that align with user preferences, spanning aspects such as prefix poses (Ruiz, Gall, and Moreno-Noguer 2018; Guo et al. 2022c), action classes (Petrovich, Black, and Varol 2021; Cervantes et al. 2022), textual descriptions (Petrovich, Black, and Varol 2022; Ahuja and Morency 2019; Tevet et al. 2022), or music (Aristidou et al. 2021; Siyao et al. 2022; Gong et al. 2023), has been a focal point of research. Recently, building upon the advancement of diffusion models, human motion generation has experienced a notable improvement in quality and controllability. However, these prior diffusion models are commonly trained on well-crafted motions that come with explicit annotations like textual descriptions. While capturing motions from the real world is a laborious effort, annotating these motion sequences further urges the matter.

In contrast, motions with lower fidelity or fewer annotations are more accessible in the real world. For example, 3D human motions are readily extracted from monocular videos through video-based pose estimation (Kanazawa et al. 2017; Kocabas, Athanasiou, and Black 2019; Choutas et al. 2020). Meanwhile, a wealth of unannotated motion sequences, such as those available from Mixamo (Inc. 2021) and AMASS (Mahmood et al. 2019), remains largely untapped. This brings up the question we are investigating in this work, as illustrated in Figure 1. Can we learn reliable diffusion models for controllable motion generation based on the supervision of noisy and the unannotated motion sequences?

Fortunately, with the inherent denoising mechanism of diffusion models, we are able to answer this question with a simple yet effective solution that applies separate diffusion steps regarding the source of training motion data, referred to as MotionMix. To demonstrate our application and approach, we split each gold annotated motion dataset into two halves: the first half of the motions are injected with random-scale Gaussian noises (noisy half), and the second half is deprived of annotations (clean half). As in Figure 2, the diffusion model bases on the clean samples for diffusion steps in [1,T][1,T^{*}], with condition input erased. Meanwhile, noisy motions supervise the model with explicit conditions for the rest of steps [T+1,T][T^{*}+1,T]. Note TT^{*} is an experimental hyper-parameter, with its role analyzed in later ablation studies. Our key insight is that, during sampling, starting from Gaussian noises, the model first produces rough motion approximations with conditional guidance in the initial TTT-T^{*} steps; afterward, these rough approximations are further refined by unconditional sampling in the last TT^{*} steps. Yet learning with weak supervision signals, our proposed MotionMix empirically facilitates motion generation with higher quality than fully supervised models on multiple applications. Benefiting from the conciseness of design, MotionMix finds its place in many applications. In this work, we thoroughly examine the effectiveness and flexibility of the proposed approach through extensive experiments on benchmarks of text-to-motion, music-to-dance, and action-to-motion tasks.

The main contributions of our work can be summarized as follows:

\bullet We present MotionMix, the first weakly-supervised approach for conditional diffusion models that utilizes both noisy annotated and clean unannotated motion sequences simultaneously.

\bullet We demonstrate that by training with these two sources of data simultaneously, MotionMix can improve upon prior state-of-the-art motion diffusion models across various tasks and benchmarks, without any conflict.

\bullet Our approach opens new avenues for addressing the scarcity of clean and annotated motion sequences, paving the way for scaling up future research by effectively harnessing available motion resources.

Refer to caption
Figure 2: (Left) Training Process. The model is trained with a mixture of noisy and clean data. A noise timestep in ranges of [1,T][1,T^{*}] and [T+1,T][T^{*}+1,T] is sampled respectively for each clean and noisy data. Here, TT^{*} is a denoising pivot that determines the starting point from which the diffusion model refines the noisy motion sequences into clean ones without any guidance. (Right) Sampling Process. The sampling process consists of two stages. In Stage-1 from timestep TT to T+1T^{*}+1, the model generates the rough motion approximations, guided by the conditional input cc. In Stage-2 from timestep TT^{*} to 11, the model refines these approximations to high-quality motion sequences while the input cc is masked.

2 Related Work

2.1 Weakly-Supervised Learning

To tackle the limited availability of annotated data, researchers have been exploring the use of semi-supervised generative models, using both annotated and unannotated data (Kingma et al. 2014; Li et al. 2017; Lucic et al. 2019). However, the investigation of semi-supervised diffusion models remains limited (You et al. 2023), possibly due to the significant performance gap observed between conditional and unconditional diffusion models (Bao et al. 2022; Dhariwal and Nichol 2021; Tevet et al. 2022). Moreover, many state-of-the-art models, such as Stable Diffusion (Rombach et al. 2021), implicitly assume the availability of abundant annotated data for training (Chang, Koulieris, and Shum 2023; Kawar et al. 2023). This assumption poses a challenge when acquiring high-quality annotated data is expensive, particularly in the case of 3D human motion data.

Recent interest has emerged in developing data-efficient approaches for training conditional diffusion models with low-quality data (Daras et al. 2023; Kawar et al. 2023), or utilizing unsupervised (Tur et al. 2023), semi-supervised (You et al. 2023), self-supervised methods (Miao et al. 2023). These approaches have exhibited promising results across various domains and hold potential for future exploration of diffusion models when handling limited annotated data. However, in the domain of human motion generation, efforts toward these approaches have been even more limited. One related work, Make-An-Animation (Azadi et al. 2023), trains a diffusion model utilizing unannotated motions in a semi-supervised setting. In contrast, our work introduces a unique aspect by training with noisy annotated motion and clean unannotated motion.

2.2 Conditional Motion Generation

Over the years, human motion generation has been extensively studied using various signals, including prefix poses (Ruiz, Gall, and Moreno-Noguer 2018; Guo et al. 2022c; Petrovich, Black, and Varol 2021), action classes (Guo et al. 2020; Petrovich, Black, and Varol 2021; Cervantes et al. 2022), textual descriptions (Guo et al. 2022b; Petrovich, Black, and Varol 2022; Ghosh et al. 2021; Guo et al. 2022a; Ahuja and Morency 2019; Bhattacharya et al. 2021), or music (Li et al. 2020; Aristidou et al. 2021; Li et al. 2021; Siyao et al. 2022; Gong et al. 2023). However, it is non-trivial for these methods to align the distributions of motion sequences and conditions such as natural languages or speech (Chen et al. 2022). Diffusion models resolve this problem using a dedicated multi-step gradual diffuse and denosing process(Ramesh et al. 2022a; Saharia et al. 2022; Ho et al. 2022). Recent advancements, such as MDM (Tevet et al. 2022), MotionDiffuse (Zhang et al. 2022), MLD (Chen et al. 2022), have demonstrated the ability of diffusion-based models to generate plausible human motion, guided by textual descriptions or action classes. In the domain of music, EDGE (Tseng, Castellon, and Liu 2022) showcased high-quality dance generation in diverse music categories. Nevertheless, these works still rely on high-quality motion datasets with annotated guidance.

3 Method

3.1 Problem Formulation

Conditional motion generation involves generating high-quality and diverse human motion sequences based on a desired conditional input cc. This input can take various forms, such as a textual description w1:Nw^{1:N} of NN words (Guo et al. 2022b), an action class aAa\in A (Guo et al. 2020), music audio mm (Li et al. 2021), or even an empty condition c=c=\emptyset (unconditional input) (Raab et al. 2022). Our goal is to train a diffusion model in a weakly-supervised manner, using both noisy motion sequences with conditional inputs c={,a,w,c}c=\{\emptyset,a,w,c\} (where \emptyset is used when the classifier-free guidance (Ho and Salimans 2022) is applied) and clean motion sequences with unconditional input c=c=\emptyset. Despite being trained with noisy motions, our model can consistently generate plausible motion sequences. To achieve this, we propose a two-stage reverse process, as illustrated in Figure 2.

3.2 Diffusion Probabilistic Model

The general idea of a diffusion model, as defined by the denoising diffusion probabilistic model (DDPM) (Ho, Jain, and Abbeel 2020), is to design a diffusion process that gradually adds noise to a data sample and trains a neural model to learn a reverse process of denoising it back to a clean sample. Specifically, the diffusion process can be modeled as a Markov noising process with {𝐱t}t=0T\{\mathbf{x}_{t}\}_{t=0}^{T} where 𝐱0p(x)\mathbf{x}_{0}\sim p(x) is the clean sample drawn from the data distribution. The noised 𝐱t\mathbf{x}_{t} is obtained by applying Gaussian noise ϵt\boldsymbol{\epsilon}_{t} to 𝐱0\mathbf{x}_{0} through the posterior:

q(𝐱t|𝐱0)=𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈)q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0},(1-\bar{\alpha}_{t})\mathbf{I}) (1)

where α¯t(0,1)\bar{\alpha}_{t}\in(0,1) are constants which follow a monotonically decreasing scheduler. Thus, when α¯t\bar{\alpha}_{t} is small enough, we can approximate 𝐱T𝒩(0,𝐈)\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I}).

In the reverse process, given the condition cc, a neural model fθf_{\theta} is trained to estimate the clean sample 𝐱0\mathbf{x}_{0} (Ramesh et al. 2022b) or the added noise ϵt\epsilon_{t} (Ho, Jain, and Abbeel 2020) for all tt. The model parameters θ\theta are optimized using the “simple” objective introduced by Ho, Jain, and Abbeel:

simple=𝔼t[1,T],𝐬t[𝐬tfθ(𝐱t,t,c)2]\mathcal{L}_{\text{simple}}=\mathbb{E}_{t\sim[1,T],\mathbf{s}_{t}}\Big{[}\|\mathbf{s}_{t}-f_{\theta}(\mathbf{x}_{t},t,c)\|^{2}\Big{]} (2)

where the target objective 𝐬t\mathbf{s}_{t} refers to either 𝐱0\mathbf{x}_{0} or ϵt\boldsymbol{\epsilon}_{t} for ease of notation.

3.3 Training

We propose a novel weakly-supervised learning approach that enables a diffusion model to effectively utilize both noisy and clean motion sequences. During the training phase, we construct batches comprising both noisy and clean samples, each coupled with a corresponding guidance condition cc, as further detailed in Subsection 3.5. To learn the denoising process, we apply the diffusion process to this batch using Equation 1 with varying noise timesteps. In contrary to the conventional training, where both noisy and clean motion sequences are treated as the ground truth x0x_{0} with diffusion steps spanning [1,T][1,T], our approach adopts separate ranges for different data types. For noisy samples, we randomly select noise timesteps t[T+1,T]t\in[T^{*}+1,T], while for clean samples, we confine them to t[1,T]t\in[1,T^{*}]. Here, TT^{*} serves as a denoising pivot, determining when the diffusion model starts refining noisy motion sequences into cleaner versions. This pivot is especially crucial in real-world applications, where motion capture data might be corrupted by noise due to diverse factors. This denoising strategy for noisy motions draws inspiration from (Nie et al. 2022), which purified adversarial images by diffusing them up to a specific timestep TT^{*} before denoising to clean images. The determination of TT^{*} typically relies on empirical estimation, its impact on generation quality is further analyzed in Table 4.

Through this training process, the model becomes adept at generating initial rough motions from TT to T+1T^{*}+1, and subsequently refining these rough motions into high-quality ones from TT^{*} to 11. By dividing into two distinct time ranges, the model can effectively learn from both noisy and clean motion sequences as ground truth without any conflict.

3.4 Two-stage Sampling and Guidance

Our approach introduces a modification to the conventional DDPM sampling procedure, which commonly relies on the same explicit conditional input cc to guide the denoising operation at each time step tt, initiating from TT and denoising back to the subsequent time step t1t-1 until reaching t=0t=0. However, it is important to note that our work specifically focuses on clean, unannotated samples. As discussed in Subsection 3.3, these samples are trained using an identical guidance condition c=c=\emptyset confined within the time interval [1,T][1,T^{*}]. Consequently, if the conventional DDPM sampling process is employed within this temporal range, it could potentially lead to jittering or the generation of unrealistic motions. This occurs because the model is not trained to handle varying conditions within this specific range. To tackle this issue, we adopt a distinct strategy to align the sampling process accordingly. Specifically, when the model reaches the denoising pivot TT^{*} during the sampling, we substitute the conditional input with c=c=\emptyset starting from TT^{*}.

In the case of using classifier-free guidance (Ho and Salimans 2022), guided inference is employed for all tt, which involves generating motion samples through a weighted sum of unconditionally and conditionally generated samples:

𝐬^(𝐱t,t,c)=wfθ(𝐱t,t,c)+(1w)fθ(𝐱t,t,)\hat{\mathbf{s}}(\mathbf{x}_{t},t,c)=w\cdot f_{\theta}(\mathbf{x}_{t},t,c)+(1-w)\cdot f_{\theta}(\mathbf{x}_{t},t,\emptyset) (3)

where ww is the guidance weight during sampling.

3.5 Data Preparation

To facilitate our setting, we randomly partition an existing training dataset into two subsets. In one subset, we retain the annotated condition and introduce noise to the motion sequences to approximate the real noisy samples. In the other subset, we reserve the cleanliness of the data and discard the annotated conditions by replacing them as c=c=\emptyset.

Motivated by the use of Gaussian noises in approximating noisy samples in previous works (Tiwari et al. 2022; Fiche et al. 2023), we apply the Equation 1 to gradually introduce noise to the clean samples. Since the precise noise schedule in real-world motion capture data is unknown, we address this uncertainty by applying a random noising step sampled from the range [T1,T2][T_{1},T_{2}], where T1T_{1} and T2T_{2} are hyperparameters simulating the level of disruption in real noisy motions. Interestingly, our experiments (Tab. 6) show that neither smaller value of T1T_{1}, T2T_{2} nor small T2T1T_{2}-T_{1} relates to better final performance. Due to page limit, examples of noisy motions for training are delegated to present in supplementary videos.

It is worth noting that the processes of dividing the training dataset and preparing noisy samples, and unannotated samples only take place on the side of the training dataset. The remaining evaluation dataset, diffusion models, and training process are kept unchanged as in previous works.

4 Experiments

We thoroughly experiment our MotionMix in diverse tasks using different conditional motion generation diffusion models as backbones: (1) MDM (Tevet et al. 2022) for text-to-motion task on HumanML3D (Guo et al. 2022b), KIT-ML (Plappert, Mandery, and Asfour 2016), as well as action-to-motion task on HumanAct12 (Guo et al. 2020) and UESTC (Ji et al. 2018); (2) MotionDiffuse (Zhang et al. 2022) for text-to-motion task; and (3) EDGE (Tseng, Castellon, and Liu 2022) for music-to-dance task on AIST++ (Li et al. 2021).

4.1 Models

\bullet MDM (Tevet et al. 2022) MDM is a lightweight diffusion model that utilizes a transformer encoder-only architecture (Vaswani et al. 2017). Its training objective is to estimate the clean sample 𝐱0\mathbf{x}_{0} (Ramesh et al. 2022b). In the text-to-motion task, MDM encodes the text description c=w1:Nc=w^{1:N} using a frozen CLIP-VIT-B/32. During training, classifier-free guidance (Ho and Salimans 2022) is employed by randomly masking the condition with c=c=\emptyset with a probability of 10%10\%. Meanwhile, in the action-to-motion task, the conditioning c=ac=a is projected to a linear action embedding, and the classifier-free guidance is not applied. Additionally, three geometric losses are incorporated as training constraints for this task.

\bullet MotionDiffuse (Zhang et al. 2022) MotionDiffuse employs a series of transformer decoder layers (Vaswani et al. 2017) and incorporates a frozen CLIP-VIT-B/32 for text description encoding. However, in contrast to MDM, MotionDiffuse focuses on estimating the noise ϵ\boldsymbol{\epsilon} as its training objective and does not incorporate the classifier-free guidance (Ho and Salimans 2022).

\bullet EDGE (Tseng, Castellon, and Liu 2022). EDGE shares similarities with MDM in terms of its transformer encoder-only architecture (Vaswani et al. 2017) and the adoption of geometric losses for the music-to-dance task. In addition, the authors introduced a novel Contact Consistency Loss to enhance foot contact prediction control. In the case of music conditioning, EDGE utilizes a pre-trained Jukebox model (Dhariwal et al. 2020) to extract audio features mm from music, which then serve as conditioning input c=mc=m. During inference, the approach incorporates classifier-free guidance (Ho and Salimans 2022) with a masking probability of 25%25\%.

4.2 Text-to-motion

\bullet Datasets. Two leading benchmarks used for text-driven motion generation are HumanML3D (Guo et al. 2022b) and KIT-ML (Plappert, Mandery, and Asfour 2016). The KIT-ML dataset provides 6,353 textual descriptions corresponding to 3,911 motion sequences, while the HumanML3D dataset combines 14,616 motion sequences from HumanAct12 (Guo et al. 2020) and AMASS (Mahmood et al. 2019), along with 44,970 sequence-level textual descriptions. As suggested by Guo et al., we adopt a redundant motion representation that concatenates root velocities, root height, local joint positions, velocities, rotations, and the binary labels of foot contact. This representation, denoted as 𝐱N×D\mathbf{x}\in\mathbb{R}^{N\times D}, is used for both HumanML3D and KIT-ML, with DD being the dimension of the pose vector and is equal to 263 for HumanML3D or 251 for KIT-ML. This motion representation is also employed in previous work (Tevet et al. 2022; Zhang et al. 2022; Chen et al. 2022).

\bullet Implementation Details. On both datasets, we train the MDM and MotionDiffuse models from scratch for 700K700K and 200K200K steps, respectively. To approximate the noisy motion data 𝐱~\tilde{\mathbf{x}} from 𝐱N×D\mathbf{x}\in\mathbb{R}^{N\times D}, we use noisy ranges [20,60][20,60] and [20,40][20,40] for HumanML3D and KIT-ML, respectively.

\bullet Evaluation Metrics. As suggested by Guo et al., the metrics are based on a text feature extractor and a motion feature extractor jointly trained under contrastive loss to produce feature vectors for matched text-motion pairs. R Precision (top 3) measures the accuracy of the top 3 retrieved descriptions for each generated motion, while the Frechet Inception Distance (FID) is calculated using the motion extractor as the evaluator network. Multimodal Distance measures the average Euclidean distance between the motion feature of each generated motion and the text feature of its corresponding description in the test set. Diversity measures the variance of the generated motions across all action categories, while MultiModality measures the diversity of generated motions within each condition.

\bullet Quantitative Result. Table 1 presents quantitative results of our weakly-supervised MotionMix using MDM and MotionDiffuse backbones, in comparison with their original models that are trained with fully annotated and clean motion sequences. To our surprise, in most settings, MotionMix even improves the motion quality (i.e., FID) and multimodal consistency (i.e., R Precision) upon the fully supervised backbones. For example, on HumanML3D and KIT-ML dataset, MDM (MotionMix) commonly reduces FID by over 0.160.16 compare to MDM; this comes with the enhancement of both R Precision and Multimodal Distance. We may attribute this to the better generalizability and robustness by involving noisy data in our MotionMix. On the specifical setting of MotionDiffuse (MotionMix) on HumanML3D, though being inferior to the original MotionDiffuse, our MotionMix maintains competitive performance on par with other fully supervised baselines, such as Language2Pose (Ahuja and Morency 2019), Text2Gestures (Bhattacharya et al. 2021), Guo et al. (Guo et al. 2022b).

Refer to caption
Figure 3: Qualitative performance of baseline MDM and MotionDiffuse models, trained exclusively on high-quality annotated data, with our MotionMix approach, which learns from imperfect data sources. Their visualized motion results are presented alongside real references for three distinct text prompts. Please refer to supplementary files for more animations.
Method R Precision (top 3)\uparrow FID\downarrow Multimodal Dist.\downarrow Diversity\rightarrow Multimodality\uparrow
HumanML3D Real Motion 0.797±.0020.797^{\pm.002} 0.002±.0000.002^{\pm.000} 2.974±.0082.974^{\pm.008} 9.503±.0659.503^{\pm.065} -
Language2Pose 0.486±.0020.486^{\pm.002} 11.02±.04611.02^{\pm.046} 5.296±.0085.296^{\pm.008} 7.676±.0587.676^{\pm.058} -
Text2Gestures 0.345±.0020.345^{\pm.002} 7.664±.0307.664^{\pm.030} 6.030±.0086.030^{\pm.008} 6.409±.0716.409^{\pm.071} -
Guo et al. 0.740±.0030.740^{\pm.003} 1.067±.0021.067^{\pm.002} 3.340±.0083.340^{\pm.008} 9.188±.0029.188^{\pm.002} 2.090±.0832.090^{\pm.083}
MLD 0.772±.0020.772^{\pm.002} 0.473±.0130.473^{\pm.013} 3.196±.0103.196^{\pm.010} 9.724±.0829.724^{\pm.082} 2.413±.0792.413^{\pm.079}
MDM 0.611±.0070.611^{\pm.007} 0.544±.4400.544^{\pm.440} 5.566±.0275.566^{\pm.027} 9.559±.8609.559^{\pm.860} 2.799±.0722.799^{\pm.072}
MDM (MotionMix) 0.632±.0060.632^{\pm.006} (\uparrow3.4%3.4\%) 0.381±.0420.381^{\pm.042} (\uparrow30.0%30.0\%) 5.325±.0265.325^{\pm.026} (\uparrow4.3%4.3\%) 9.520±.0909.520^{\pm.090} (\uparrow69.6%69.6\%) 2.718±.0192.718^{\pm.019} (\downarrow2.9%2.9\%)
MotionDiffuse 0.782±.0010.782^{\pm.001} 0.630±.0010.630^{\pm.001} 3.113±.0013.113^{\pm.001} 9.410±.0499.410^{\pm.049} 1.553±.0421.553^{\pm.042}
MotionDiffuse (MotionMix) 0.738±.0060.738^{\pm.006} (\downarrow5.6%) 1.021±.0711.021^{\pm.071} (\downarrow62.1%) 3.310±.0203.310^{\pm.020} (\downarrow6.3%6.3\%) 9.297±.0839.297^{\pm.083} (\downarrow121.5%121.5\%) 1.523±.1531.523^{\pm.153} (\downarrow1.9%1.9\%)
KIT-ML Real Motion 0.779±.0060.779^{\pm.006} 0.031±.0040.031^{\pm.004} 2.788±.0122.788^{\pm.012} 11.080±.09711.080^{\pm.097} -
Language2Pose 0.483±.0050.483^{\pm.005} 6.545±.0726.545^{\pm.072} 5.147±.0305.147^{\pm.030} 9.073±.1009.073^{\pm.100} -
Text2Gestures 0.338±.0040.338^{\pm.004} 12.12±.18312.12^{\pm.183} 6.964±.0296.964^{\pm.029} 9.334±.0799.334^{\pm.079} -
Guo et al. 0.693±.0070.693^{\pm.007} 2.770±.1092.770^{\pm.109} 3.401±.0083.401^{\pm.008} 10.910±.11910.910^{\pm.119} 1.482±.0651.482^{\pm.065}
MLD 0.734±.0070.734^{\pm.007} 0.404±.0270.404^{\pm.027} 3.204±.0273.204^{\pm.027} 10.800±.11710.800^{\pm.117} 2.192±.0712.192^{\pm.071}
MDM 0.396±.0040.396^{\pm.004} 0.497±.0210.497^{\pm.021} 9.191±.0229.191^{\pm.022} 10.847±.10910.847^{\pm.109} 1.907±.2141.907^{\pm.214}
MDM (MotionMix) 0.404±.0050.404^{\pm.005} (\uparrow2.0%2.0\%) 0.322±.0200.322^{\pm.020} (\uparrow35.2%35.2\%) 9.068±.0199.068^{\pm.019} (\uparrow1.3%1.3\%) 10.781±.09810.781^{\pm.098} (\downarrow28.3%28.3\%) 1.946±.0191.946^{\pm.019} (\uparrow2.0%2.0\%)
MotionDiffuse 0.739±.0040.739^{\pm.004} 1.954±.0621.954^{\pm.062} 2.958±.0052.958^{\pm.005} 11.100±.14311.100^{\pm.143} 0.730±.0130.730^{\pm.013}
MotionDiffuse (MotionMix) 0.742±.0050.742^{\pm.005} (\uparrow0.4%0.4\%) 1.192±.0731.192^{\pm.073} (\uparrow39.0%39.0\%) 3.066±.0183.066^{\pm.018} (\downarrow3.6%3.6\%) 10.998±.07210.998^{\pm.072} (\downarrow310%310\%) 1.391±.1111.391^{\pm.111} (\uparrow90.5%90.5\%)
Table 1: Quantitative results of text-to-motion on the test set of HumanML3D and KIT-ML. Note all baselines are trained with gold data. We run all the evaluation 20 times (except Multimodality runs 5 times) and ±\pm indicates the 95% confidence interval. \uparrow means higher is better, \downarrow means lower is better, \rightarrow means closer to the real distribution is better. The x%\uparrow x\% and x%\downarrow x\% indicate the percentage difference in performance improvement or deterioration when comparing our approach to its correspond baseline.

4.3 Action-to-motion

\bullet Datasets. We evaluate our MotionMix on two benchmarks: HumanAct12 (Guo et al. 2020) and UESTC (Ji et al. 2018). HumanAct12 offers 1,191 motion clips categorized into 12 action classes, while UESTC provides 24K sequences of 40 action classes. For this task, we use the pre-processed sequences provided by Petrovich, Black, and Varol as the gold clean motion sequences, and further process them to approximate noisy samples. A pose sequence of NN frames is represented in the 24-joint SMPL format (Loper et al. 2015), using the 6D rotation (Zhou et al. 2018) for every joint, resulting in 𝐩N×24×6\mathbf{p}\in\mathbb{R}^{N\times 24\times 6}. A single root translation 𝐫N×1×3\mathbf{r}\in\mathbb{R}^{N\times 1\times 3} is padded and concatenated with 𝐩\mathbf{p} to obtain the final motion representation 𝐱=Concat([𝐩,𝐫])N×25×6\mathbf{x}=\text{Concat}([\mathbf{p},\mathbf{r}])\in\mathbb{R}^{N\times 25\times 6}.

\bullet Implementation Details. Following the experimental setup by Tevet et al., we train the MDM (MotionMix) from scratch on the HumanAct12 and UESTC datasets for 750K750K and 2M2M steps, respectively. In our approximation preprocess, we determine the amount of noise to be injected into both the pose sequence 𝐩\mathbf{p} and the root translation 𝐫\mathbf{r} by randomly sampling from range [10,30][10,30]. The resulting 𝐩~\tilde{\mathbf{p}} and 𝐫~\tilde{\mathbf{r}} are then concatenated to obtain noisy motion 𝐱~\tilde{\mathbf{\mathbf{x}}}.

\bullet Evaluation Metrics. Four metrics are used to assess the quality of generated motions. The FID is commonly used to evaluates the overall quality of generated motions. Accuracy measures the correlation between the generated motion and its action class. Diversity and MultiModality are similar to the text-to-motion metrics.

\bullet Quantitative Result. Table 2 presents the performance outcomes of MDM (MotionMix) and several baseline models, including Action2Motion (Guo et al. 2020), ACTOR (Petrovich, Black, and Varol 2021), INR (Cervantes et al. 2022), MLD (Chen et al. 2022), and MDM (Tevet et al. 2022), on both the HumanAct12 and UESTC datasets. Following the methodology of Tevet et al., we perform 20 evaluations, each comprising 1000 samples, and present average scores with a confidence interval of 95%. The results highlight that our MotionMix achieves competitive performance with significantly fewer high-quality annotated data instances. In particular, the improvement seen on the UESTC dataset underscores its efficacy in training with noisy motion data from the real-world scenario. On the other hand, the deterioration in performance on HumanAct12 suggests that our approach is better suited for larger datasets, given that the size of HumanAct12 is remarkably smaller than that of UESTC. Nevertheless, our supplementary videos demonstrate that the model trained on HumanAct12 remains capable of generating quality motion sequences based on the provided action classes.

Method FID \downarrow Accuracy \uparrow Diversity \rightarrow MultiModality \rightarrow
HumanAct12 Real Motion 0.053±.0030.053^{\pm.003} 0.995±.0010.995^{\pm.001} 6.835±.0456.835^{\pm.045} 2.604±.0402.604^{\pm.040}
Action2Motion 0.338±.0150.338^{\pm.015} 0.917±.0010.917^{\pm.001} 6.850±.0506.850^{\pm.050} 2.511±.0232.511^{\pm.023}
ACTOR 0.120±.0000.120^{\pm.000} 0.955±.0080.955^{\pm.008} 6.840±.0306.840^{\pm.030} 2.530±.0202.530^{\pm.020}
INR 0.088±.0040.088^{\pm.004} 0.973±.0010.973^{\pm.001} 6.881±.0486.881^{\pm.048} 2.569±.0402.569^{\pm.040}
MLD 0.077±.0040.077^{\pm.004} 0.964±.0020.964^{\pm.002} 6.831±.0506.831^{\pm.050} 2.824±.0382.824^{\pm.038}
MDM 0.100±.0000.100^{\pm.000} 0.990±.0000.990^{\pm.000} 6.860±.0506.860^{\pm.050} 2.520±.0102.520^{\pm.010}
MDM (MotionMix) 0.196±.0070.196^{\pm.007} (\downarrow96%96\%) 0.930±.0030.930^{\pm.003} (\downarrow6.1%6.1\%) 6.836±.0626.836^{\pm.062} (\uparrow96%96\%) 3.043±.0543.043^{\pm.054} (\downarrow422.6%422.6\%)
UESTC Real Motion 2.790±.2902.790^{\pm.290} 0.988±.0010.988^{\pm.001} 33.349±.32033.349^{\pm.320} 14.160±.06014.160^{\pm.060}
ACTOR 23.430±2.20023.430^{\pm 2.200} 0.911±.0030.911^{\pm.003} 31.960±.33031.960^{\pm.330} 14.520±.09014.520^{\pm.090}
INR 15.000±.09015.000^{\pm.090} 0.941±.0010.941^{\pm.001} 31.590±.19031.590^{\pm.190} 14.680±.07014.680^{\pm.070}
MLD 15.790±.07915.790^{\pm.079} 0.954±.0010.954^{\pm.001} 33.520±.14033.520^{\pm.140} 13.570±.06013.570^{\pm.060}
MDM 12.810±1.46012.810^{\pm 1.460} 0.950±.0000.950^{\pm.000} 33.100±.29033.100^{\pm.290} 14.260±.12014.260^{\pm.120}
MDM (MotionMix) 11.400±.39311.400^{\pm.393} (\uparrow11%11\%) 0.960±.0030.960^{\pm.003} (\uparrow1.1%1.1\%) 32.806±.17632.806^{\pm.176} (\downarrow118%118\%) 14.277±.09414.277^{\pm.094} (\downarrow17%17\%)
Table 2: Quantitative results of action-to-motion on the HumanAct12 dataset and UESTC test set. We run the evaluation 20 times, and the metric details are similar to Table 1.

4.4 Music-to-dance

\bullet Datasets. We utilize the AIST++ dataset (Li et al. 2021), which comprises 1,408 high-quality dance motions accompanied by music from a diverse range of genres. Following the experimental setup proposed by Tseng, Castellon, and Liu, we adopt a configuration in which all training samples are trimmed to 5 seconds and 30 FPS. Similarly to the action-to-motion data, we concatenate NN-frame pose sequences denoted as 𝐩N×24×6=N×144\mathbf{p}\in\mathbb{R}^{N\times 24\times 6=N\times 144}, along with a single root translation denoted as 𝐫N×3\mathbf{r}\in\mathbb{R}^{N\times 3}, and an additional binary contact label for the heel and toe of each foot denoted as 𝐛{0,1}N×4\mathbf{b}\in\{0,1\}^{N\times 4}. Consequently, EDGE is trained using the final motion representation 𝐱=Concat([𝐛,𝐫,𝐩])N×151\mathbf{x}=\text{Concat}([\mathbf{b},\mathbf{r},\mathbf{p}])\in\mathbb{R}^{N\times 151}.

\bullet Implementation Details. Similar to the action-to-motion task, we inject noise into both 𝐩\mathbf{p} and 𝐫\mathbf{r} using the same noise timestep sampled from [20,80][20,80]. Since the contact label 𝐛\mathbf{b} is obtained from both 𝐩\mathbf{p} and 𝐫\mathbf{r}, it is not necessary to inject noise into 𝐛\mathbf{b}. Following the setup of Tseng, Castellon, and Liu, we train both the EDGE model and our EDGE (MotionMix) from scratch on AIST++ for 20002000 epochs.

\bullet Evaluation Metrics. To evaluate the quality of the generated dance, we adopt the same evaluation settings as suggested in paper EDGE, including Physical Foot Contact (PFC), Beat Alignment, and Diversity. PFC is a physically-inspired metric that evaluates physical plausibility by capturing realistic foot-ground contact without explicit physical modeling or assuming static contact. Following the previous works (Li et al. 2021; Siyao et al. 2022), Beat Alignment evaluates the tendency of generated dances to follow the beat of the music, while Diversity measures the distribution of generated dances in the “kinetic” (Distk\text{Dist}_{k}) and “geometric” (Distg\text{Dist}_{g}) feature spaces.

\bullet Quantitative Result. In contrary to prior works, which typically reported only a single evaluation result, we have observed that the metrics can be inconsistent. Thus, to offer a more comprehensive evaluation, we present the average and 95% confidence interval, derived from 20 evaluation runs for our retrained EDGE model and our EDGE (MotionMix) variant. For Bailando (Siyao et al. 2022) and FACT (Li et al. 2021), we directly fetched results from the paper EDGE (Tseng, Castellon, and Liu 2022). The results in Table 2 vividly demonstrate that, our EDGE (MotionMix) significantly outperforms the baseline across all metrics, showcasing improvements of up to 43.1%43.1\% in PFC and 95.0%95.0\% in Distk\text{Dist}_{k}. This further reinforces the generalizability prowess of our MotionMix approach, consistent with the outcomes observed in our text-to-motion experiments.

Method PFC \downarrow Beat Align. \uparrow Distk\text{Dist}_{k} \rightarrow Distg\text{Dist}_{g} \rightarrow
Real Motion 1.380 0.314 9.545 7.766
Bailando 1.754 0.23 10.58 7.72
FACT 2.2543 0.22 10.85 6.14
EDGE\dagger 1.605±.2241.605^{\pm.224} 0.224±.0250.224^{\pm.025} 5.549±.7835.549^{\pm.783} 4.831±.7524.831^{\pm.752}
EDGE (MotionMix) 1.988±.1201.988^{\pm.120} (\uparrow43.1%43.1\%) 0.256±.0130.256^{\pm.013} (\uparrow13.3%13.3\%) 10.103±2.03910.103^{\pm 2.039} (\uparrow95.0%95.0\%) 6.595±.1736.595^{\pm.173} (\uparrow15.1%15.1\%)
Table 3: Quantitative results of music-to-dance on the AIST++ test set. We run the evaluation 20 times, and the metric details are similar to Table 1. \dagger denotes the EDGE model that is re-trained by us.222The results of the EDGE baseline are different from the ones submitted to AAAI’24 due to a multi-gpu bug. However, our EDGE (MotionMix) still achieves overall better performance.

5 Ablation Studies

MotionMix is introduced as a potential solution that enables the diffusion model to effectively leverage both noisy motion sequences and unannotated data. To demonstrate the efficacy of this approach, we approximate noisy samples from existing datasets and train the model on them, which incorporate several essential hyperparameters: (1) the denoising pivot TT^{*}; (2) the ratio of noisy and clean data for training; (3) the noisy range [T1,T2][T_{1},T_{2}] to approximate noisy data. In this section, we thoroughly assess the impact of each hyperparameters within MotionMix. All ablation experiments are carried out on the HumanML3D dataset using the MDM model with the identical settings described in Subsection 4.2.

5.1 Effect of The Denoising Pivot TT^{*}

We begin our ablation studies by examining the impact of the denoising pivot TT^{*}. To evaluate its impact, we conduct experiments with a fixed noisy range of [T1,T2]=[20,60][T_{1},T_{2}]=[20,60], a noisy ratio of 50%, and evaluate various TT^{*} values, encompassing 2020, 4040, 6060, and 8080. The results, detailed in Table 4, reveal a notable observation: a roughly estimated denoising pivot is sufficient for real-world scenarios, as evidenced by the competitive outcomes across various TT^{*} values. This robustness underlines the versatility of our MotionMix approach. Additionally, selecting a very small denoising pivot (e.g., T=0T^{*}=0 or 2020) enables conditions to steer the model toward diverse rough motion sequences before the refining phase, as reflected in the MModality score trend. However, this small value may potentially compromise motion quality, leading to subpar results in other metrics. In contrast, the choice of T=60T^{*}=60, which is well aligned with our predefined noisy range, yields superior results in multiple evaluation metrics. This sheds light on the need of tuning the denoising pivot to optimize the results, as this hyperparameter determines the starting point for the diffusion model to transform initial noisy motion into high-quality sequences.

Method R Precision (top 3)\uparrow FID\downarrow Multimodal Dist.\downarrow Diversity\rightarrow Multimodality\uparrow
Real Motion 0.797±.0020.797^{\pm.002} 0.002±.0000.002^{\pm.000} 2.974±.0082.974^{\pm.008} 9.503±.0659.503^{\pm.065} -
MDM (Tevet et al. 2022) 0.611±.007¯\underline{0.611^{\pm.007}} 0.544±.4400.544^{\pm.440} 5.566±.0275.566^{\pm.027} 9.559±.860¯\underline{9.559^{\pm.860}} 2.799±.0722.799^{\pm.072}
50% noisy, T1T_{1}=20, T2T_{2}=60
MDM (MotionMix) (TT^{*}=0) 0.598±.0060.598^{\pm.006} 0.714±.0450.714^{\pm.045} 5.503±.036¯\underline{5.503^{\pm.036}} 9.750±.1239.750^{\pm.123} 3.044±.054\boldsymbol{3.044^{\pm.054}}
MDM (MotionMix) (TT^{*}=20) 0.601±.0050.601^{\pm.005} 0.497±.0480.497^{\pm.048} 5.562±.0265.562^{\pm.026} 9.414±.0929.414^{\pm.092} 2.935±.059¯\underline{2.935^{\pm.059}}
MDM (MotionMix) (TT^{*}=40) 0.604±.0080.604^{\pm.008} 0.402±.032¯\underline{0.402^{\pm.032}} 5.524±.0335.524^{\pm.033} 9.396±.0949.396^{\pm.094} 2.747±.0702.747^{\pm.070}
MDM (MotionMix) (TT^{*}=60) 0.632±.006\boldsymbol{0.632^{\pm.006}} 0.381±.042\boldsymbol{0.381^{\pm.042}} 5.325±.026\boldsymbol{5.325^{\pm.026}} 9.520±.090\boldsymbol{9.520^{\pm.090}} 2.718±.0192.718^{\pm.019}
MDM (MotionMix) (TT^{*}=80) 0.594±.0050.594^{\pm.005} 0.589±.0590.589^{\pm.059} 5.670±.0335.670^{\pm.033} 9.242±.0869.242^{\pm.086} 2.602±.0572.602^{\pm.057}
Table 4: We evaluate MDM (MotionMix) on the HumanML3D test set using different values of the denoising pivot TT^{*}. The metrics are calculated in the same manner as detailed in Table 1. The best and the second best result are bold and underlined respectively.

5.2 Effect of Noisy/Clean Data Ratio

In this ablation study, we evaluate how the noisy/clean data ratio affects our approach by keeping T=60T^{*}=60 and [T1,T2]=[20,60][T_{1},T_{2}]=[20,60] constant. We experiment with various noisy ratios of 30%, 50%, and 70%. The results, presented in Table 5, show interesting trends across the evaluation metrics. Notably, higher noisy ratios (i.e., 50% and 70%) consistently outperform the lower ratio (i.e., 30%). Note that, a higher noisy ratio allows the model to access a wider range of annotated text conditions, yielding better R Precision and Multimodal Distance. On the other hand, the 30% ratio, despite being trained with a greater amount of clean data, exhibits suboptimal motion quality (scoring 0.8980.898 in FID) in comparison to other supervised baselines in Table 1, such as Language2Pose (FID of 11.0211.02), Text2Gestures (FID of 7.6647.664), Guo et al. (FID of 1.0671.067). Nevertheless, it still achieves results on par with the supervised MDM baseline in terms of multimodal consistency (i.e. Multimodal Distance). These observations underscore the resilience of our MotionMix approach to variations in the noisy/clean data ratio.

Method R Precision (top 3)\uparrow FID\downarrow Multimodal Dist.\downarrow Diversity\rightarrow Multimodality\uparrow
Real Motion 0.797±.0020.797^{\pm.002} 0.002±.0000.002^{\pm.000} 2.974±.0082.974^{\pm.008} 9.503±.0659.503^{\pm.065} -
MDM 0.611±.0070.611^{\pm.007} 0.544±.4400.544^{\pm.440} 5.566±.0275.566^{\pm.027} 9.559±.8609.559^{\pm.860} 2.799±.0722.799^{\pm.072}
T1T_{1}=20, T2T_{2}=60, TT^{*}=60
MDM (MotionMix) (30% noisy) 0.601±.0070.601^{\pm.007} 0.898±.0450.898^{\pm.045} 5.581±.0305.581^{\pm.030} 9.080±.0929.080^{\pm.092} 2.856±.074¯\underline{2.856^{\pm.074}}
MDM (MotionMix) (50% noisy) 0.632±.006\boldsymbol{0.632^{\pm.006}} 0.381±.042¯\underline{0.381^{\pm.042}} 5.325±.026\boldsymbol{5.325^{\pm.026}} 9.520±.090\boldsymbol{9.520^{\pm.090}} 2.718±.0192.718^{\pm.019}
MDM (MotionMix) (70% noisy) 0.615±.006¯\underline{0.615^{\pm.006}} 0.359±.030\boldsymbol{0.359^{\pm.030}} 5.545±.031¯\underline{5.545^{\pm.031}} 9.457±.098¯\underline{9.457^{\pm.098}} 2.867±.107\boldsymbol{2.867^{\pm.107}}
Table 5: We evaluate MDM (MotionMix) on the HumanML3D test set using different ratios for noisy and clean data. The metrics are calculated in the same manner as detailed in Table 1. The best and the second best result are bold and underlined respectively.

5.3 Effect of The Noisy Range

The purpose of the noisy range in our work is to approximate the noise schedule found in real-world motion capture data. Thus, for different datasets in Section 4, we choose noisy ranges based on the visualization of motion from each dataset. For example, UESTC (Ji et al. 2018) contains noisy mocap data, while HumanML3D (Guo et al. 2022b), derived from AMASS (Mahmood et al. 2019), consists of clean motion sequences. This ablation, therefore, comprehensively evaluates the effectiveness of our MotionMix approach when handling different noisy levels of motion sequences. We categorize the evaluations into two groups: narrow/wide ranges of noise and low/high schedules of noise. All experiments are conducted with a noisy ratio of 50%, and the denoising pivot TT^{*} is equal to the chosen T2T_{2}. The results are presented in Table 6.

\bullet Narrow/Wide Noisy Range. Three noisy ranges [T1,T2]{[20,40],[20,60],[20,80]}[T_{1},T_{2}]\in\{[20,40],[20,60],[20,80]\} are set to analyze the effect of how much the range spans. Counterintuitively, the smaller noisy range does not equal to the better performance. For example, noisy ranging from 20 to 60 time steps leads to overall the best performance, compared to range [20,40][20,40]. Though, large noisy range (i.e., [20,80][20,80]) unevitably deteriotate the model capacity.

\bullet Low/High Noisy Schedule. Four contrast ranges [T1,T2]{[10,30],[20,40],[40,60],[60,80]}[T_{1},T_{2}]\in\{[10,30],[20,40],[40,60],[60,80]\} are experimented to evaluate the robustness of MotionMix regarding corruption level of noisy motions. Notably, our proposed MotionMix performs reasonably stable on different levels of corrupted motions. More visual animations are also provided in our supplementary videos.

Method R Precision (top 3)\uparrow FID\downarrow Multimodal Dist.\downarrow Diversity\rightarrow Multimodality\uparrow
Real Motion 0.797±.0020.797^{\pm.002} 0.002±.0000.002^{\pm.000} 2.974±.0082.974^{\pm.008} 9.503±.0659.503^{\pm.065} -
MDM 0.611±.0070.611^{\pm.007} 0.544±.4400.544^{\pm.440} 5.566±.0275.566^{\pm.027} 9.559±.8609.559^{\pm.860} 2.799±.0722.799^{\pm.072}
50% noisy, T=T2T^{*}=T_{2}
MDM (MotionMix) (T1T_{1}=20, T2T_{2}=40) 0.616±.006¯\underline{0.616^{\pm.006}} 0.451±.033¯\underline{0.451^{\pm.033}} 5.459±.027¯\underline{5.459^{\pm.027}} 9.585±.1019.585^{\pm.101} 2.585±.0762.585^{\pm.076}
MDM (MotionMix) (T1T_{1}=20, T2T_{2}=60) 0.632±.006\boldsymbol{0.632^{\pm.006}} 0.381±.042\boldsymbol{0.381^{\pm.042}} 5.325±.026\boldsymbol{5.325^{\pm.026}} 9.520±.090\boldsymbol{9.520^{\pm.090}} 2.718±.019¯\underline{2.718^{\pm.019}}
MDM (MotionMix) (T1T_{1}=20, T2T_{2}=80) 0.604±.0040.604^{\pm.004} 0.614±.0600.614^{\pm.060} 5.540±.0245.540^{\pm.024} 9.554±.104¯\underline{9.554^{\pm.104}} 2.768±.095\boldsymbol{2.768^{\pm.095}}
50% noisy, T=T2T^{*}=T_{2}
MDM (MotionMix) (T1T_{1}=10, T2T_{2}=30) 0.592±.0080.592^{\pm.008} 0.713±.0480.713^{\pm.048} 5.633±.0285.633^{\pm.028} 9.567±.1099.567^{\pm.109} 2.783±.1392.783^{\pm.139}
MDM (MotionMix) (T1T_{1}=20, T2T_{2}=40) 0.616±.006\boldsymbol{0.616^{\pm.006}} 0.451±.033¯\underline{0.451^{\pm.033}} 5.459±.027\boldsymbol{5.459^{\pm.027}} 9.585±.1019.585^{\pm.101} 2.585±.0762.585^{\pm.076}
MDM (MotionMix) (T1T_{1}=40, T2T_{2}=60) 0.598±.004¯\underline{0.598^{\pm.004}} 0.554±.0760.554^{\pm.076} 5.600±.0315.600^{\pm.031} 9.479±.100\boldsymbol{9.479^{\pm.100}} 2.815±.094¯\underline{2.815^{\pm.094}}
MDM (MotionMix) (T1T_{1}=60, T2T_{2}=80) 0.597±.0080.597^{\pm.008} 0.437±.039\boldsymbol{0.437^{\pm.039}} 5.554±.033¯\underline{5.554^{\pm.033}} 9.452±.092¯\underline{9.452^{\pm.092}} 2.895±.079\boldsymbol{2.895^{\pm.079}}
Table 6: We evaluate MDM (MotionMix) on the HumanML3D test set using different noisy ranges [T1,T2][T_{1},T_{2}] to approximate the noisy motion sequences. The table presents two distinct scenarios: the upper block ablates how much the range spans, while the lower block examines the impact of the corruption level of noisy motions. The metrics are calculated in the same manner as detailed in Table 1. For each setting, the best and the second best result are bold and underlined respectively.

6 Conclusion

In this work, we look into the realm of conditional human motion generation, devling into the challenge of training with both noisy annotated and clean unannotated motion sequences. The proposed approach, MotionMix, pioneers the utilization of a weakly-supervised diffusion model as a potential solution for this challenge. This innovative method effectively overcomes the constraints arising from limited high-quality annotated data, achieving competitive results compared to fully supervised models. The versatility of MotionMix is showcased across multiple motion generation benchmarks and fundamental diffusion model designs. Comprehensive ablation studies further bolster its resilience in diverse noisy schedules and the strategic selection of the denoising pivot.

Method PFC \downarrow Beat Align. \uparrow Distk\text{Dist}_{k} \rightarrow Distg\text{Dist}_{g} \rightarrow
Real Motion (AIST++) 1.380 0.314 9.545 7.766
Real Motion (AMASS) 1.032 - - -
EDGE\dagger 1.605±.2241.605^{\pm.224} 0.224±.0250.224^{\pm.025} 5.549±.783¯\underline{5.549^{\pm.783}} 4.831±.752¯\underline{4.831^{\pm.752}}
Half noisy AIST++ and half clean AIST++ (in our main paper)
EDGE (MotionMix) 1.988±.1201.988^{\pm.120} 0.256±.013\boldsymbol{0.256^{\pm.013}} 10.103±2.039\boldsymbol{10.103^{\pm 2.039}} 6.595±.173\boldsymbol{6.595^{\pm.173}}
Combine clean AIST++ and clean AMASS
EDGE (MotionMix) (TT^{*}=20) 1.310±.078¯\underline{1.310^{\pm.078}} 0.236±.0070.236^{\pm.007} 3.437±.2293.437^{\pm.229} 4.308±.1344.308^{\pm.134}
EDGE (MotionMix) (TT^{*}=40) 1.062±.080\boldsymbol{1.062^{\pm.080}} 0.240±.009¯\underline{0.240^{\pm.009}} 3.639±.2923.639^{\pm.292} 4.371±.1114.371^{\pm.111}
Table 7: Quantitative results of music-to-dance on the AIST++ test set. We run the evaluation 20 times. The best and the second best result are bold and underlined respectively. \dagger denotes the EDGE model that is re-trained by us.

Appendix A Application - Real Case Scenario

We experimented training the EDGE model using both AIST++ and AMASS together. With AMASS (low PFC), our model can generate plausible motion with less skating (PFC: 1.06, Tab. 7), visually supported by videos on our project page.

Appendix B Application - Motion Editing

MDM (Tevet et al. 2022) introduced two motion editing applications: in-betweening and body part editing. These applications share the same approach, respectively, in the temporal and spatial domains. For in-betweening, they maintained the initial and final 25% of the motion sequence as fixed, while the model generated the intermediate 50%. In the context of body part editing, specific joints were held fixed, leaving the model responsible for generating the remaining segments. In particular, their experimentation focused on editing the upper body joints exclusively. In our supplementary videos, we demonstrate that, in both scenarios, our MDM (MotionMix) does not compromise this useful feature, exhibiting the ability to produce coherence motion sequences that align with both the motion’s fixed section and the given condition (if provided).

References

  • Ahuja and Morency (2019) Ahuja, C.; and Morency, L.-P. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. 2019 International Conference on 3D Vision (3DV), 719–728.
  • Aristidou et al. (2021) Aristidou, A.; Yiannakidis, A.; Aberman, K.; Cohen-Or, D.; Shamir, A.; and Chrysanthou, Y. 2021. Rhythm is a Dancer: Music-Driven Motion Synthesis with Global Structure. IEEE transactions on visualization and computer graphics, PP.
  • Azadi et al. (2023) Azadi, S.; Shah, A.; Hayes, T.; Parikh, D.; and Gupta, S. 2023. Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation. ArXiv, abs/2305.09662.
  • Bao et al. (2022) Bao, F.; Li, C.; Sun, J.; and Zhu, J. 2022. Why Are Conditional Generative Models Better Than Unconditional Ones? ArXiv, abs/2212.00362.
  • Bhattacharya et al. (2021) Bhattacharya, U.; Rewkowski, N.; Banerjee, A.; Guhan, P.; Bera, A.; and Manocha, D. 2021. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents. 2021 IEEE Virtual Reality and 3D User Interfaces (VR), 1–10.
  • Cervantes et al. (2022) Cervantes, P.; Sekikawa, Y.; Sato, I.; and Shinoda, K. 2022. Implicit Neural Representations for Variable Length Human Motion Generation. ArXiv, abs/2203.13694.
  • Chang, Koulieris, and Shum (2023) Chang, Z.; Koulieris, G. A.; and Shum, H. P. H. 2023. On the Design Fundamentals of Diffusion Models: A Survey. ArXiv, abs/2306.04542.
  • Chen et al. (2022) Chen, X.; Jiang, B.; Liu, W.; Huang, Z.; Fu, B.; Chen, T.; Yu, J.; and Yu, G. 2022. Executing your Commands via Motion Diffusion in Latent Space. ArXiv, abs/2212.04048.
  • Choutas et al. (2020) Choutas, V.; Pavlakos, G.; Bolkart, T.; Tzionas, D.; and Black, M. J. 2020. Monocular Expressive Body Regression through Body-Driven Attention. ArXiv, abs/2008.09062.
  • Daras et al. (2023) Daras, G.; Shah, K.; Dagan, Y.; Gollakota, A.; Dimakis, A. G.; and Klivans, A. R. 2023. Ambient Diffusion: Learning Clean Distributions from Corrupted Data. ArXiv, abs/2305.19256.
  • Dhariwal et al. (2020) Dhariwal, P.; Jun, H.; Payne, C.; Kim, J. W.; Radford, A.; and Sutskever, I. 2020. Jukebox: A Generative Model for Music. ArXiv, abs/2005.00341.
  • Dhariwal and Nichol (2021) Dhariwal, P.; and Nichol, A. 2021. Diffusion Models Beat GANs on Image Synthesis. ArXiv, abs/2105.05233.
  • Fiche et al. (2023) Fiche, G.; Leglaive, S.; Alameda-Pineda, X.; and S’eguier, R. 2023. Motion-DVAE: Unsupervised learning for fast human motion denoising. ArXiv, abs/2306.05846.
  • Ghosh et al. (2021) Ghosh, A.; Cheema, N.; Oguz, C.; Theobalt, C.; and Slusallek, P. 2021. Synthesis of Compositional Animations from Textual Descriptions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 1376–1386.
  • Gong et al. (2023) Gong, K.; Lian, D.; Chang, H.; Guo, C.; Zuo, X.; Jiang, Z.; and Wang, X. 2023. TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration. arXiv:2304.02419.
  • Guo et al. (2022a) Guo, C.; Xuo, X.; Wang, S.; and Cheng, L. 2022a. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. ArXiv, abs/2207.01696.
  • Guo et al. (2022b) Guo, C.; Zou, S.; Zuo, X.; Wang, S.; Ji, W.; Li, X.; and Cheng, L. 2022b. Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5152–5161.
  • Guo et al. (2020) Guo, C.; Zuo, X.; Wang, S.; Zou, S.; Sun, Q.; Deng, A.; Gong, M.; and Cheng, L. 2020. Action2Motion: Conditioned Generation of 3D Human Motions. Proceedings of the 28th ACM International Conference on Multimedia.
  • Guo et al. (2022c) Guo, W.; Du, Y.; Shen, X.; Lepetit, V.; Alameda-Pineda, X.; and Moreno-Noguer, F. 2022c. Back to MLP: A Simple Baseline for Human Motion Prediction. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 4798–4808.
  • Ho et al. (2022) Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A. A.; Kingma, D. P.; Poole, B.; Norouzi, M.; Fleet, D. J.; and Salimans, T. 2022. Imagen Video: High Definition Video Generation with Diffusion Models. ArXiv, abs/2210.02303.
  • Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Probabilistic Models. ArXiv, abs/2006.11239.
  • Ho and Salimans (2022) Ho, J.; and Salimans, T. 2022. Classifier-Free Diffusion Guidance. arXiv:2207.12598.
  • Inc. (2021) Inc., A. S. 2021. Mixamo. https://www.mixamo.com/. Accessed: 2021-12-25.
  • Ji et al. (2018) Ji, Y.; Xu, F.; Yang, Y.; Shen, F.; Shen, H. T.; and Zheng, W. 2018. A Large-scale RGB-D Database for Arbitrary-view Human Action Recognition. Proceedings of the 26th ACM international conference on Multimedia.
  • Kanazawa et al. (2017) Kanazawa, A.; Black, M. J.; Jacobs, D. W.; and Malik, J. 2017. End-to-End Recovery of Human Shape and Pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131.
  • Kawar et al. (2023) Kawar, B.; Elata, N.; Michaeli, T.; and Elad, M. 2023. GSURE-Based Diffusion Model Training with Corrupted Data. ArXiv, abs/2305.13128.
  • Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semi-supervised Learning with Deep Generative Models. ArXiv, abs/1406.5298.
  • Kocabas, Athanasiou, and Black (2019) Kocabas, M.; Athanasiou, N.; and Black, M. J. 2019. VIBE: Video Inference for Human Body Pose and Shape Estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5252–5262.
  • Koppula and Saxena (2013) Koppula, H. S.; and Saxena, A. 2013. Anticipating human activities for reactive robotic response. 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2071–2071.
  • Lee et al. (2021) Lee, L.-H.; Braud, T.; Zhou, P.; Wang, L.; Xu, D.; Lin, Z.; Kumar, A.; Bermejo, C.; and Hui, P. 2021. All One Needs to Know about Metaverse: A Complete Survey on Technological Singularity, Virtual Ecosystem, and Research Agenda. ArXiv, abs/2110.05352.
  • Li et al. (2017) Li, C.; Xu, T.; Zhu, J.; and Zhang, B. 2017. Triple Generative Adversarial Nets. ArXiv, abs/1703.02291.
  • Li et al. (2020) Li, J.; Yin, Y.; Chu, H.; Zhou, Y.; Wang, T.; Fidler, S.; and Li, H. 2020. Learning to Generate Diverse Dance Motions with Transformer. ArXiv, abs/2008.08171.
  • Li et al. (2021) Li, R.; Yang, S.; Ross, D. A.; and Kanazawa, A. 2021. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 13381–13392.
  • Ling et al. (2020) Ling, H. Y.; Zinno, F.; Cheng, G.; and van de Panne, M. 2020. Character controllers using motion VAEs. ACM Transactions on Graphics (TOG), 39: 40:1 – 40:12.
  • Loper et al. (2015) Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; and Black, M. J. 2015. SMPL: a skinned multi-person linear model. ACM Trans. Graph., 34: 248:1–248:16.
  • Lucic et al. (2019) Lucic, M.; Tschannen, M.; Ritter, M.; Zhai, X.; Bachem, O.; and Gelly, S. 2019. High-Fidelity Image Generation With Fewer Labels. ArXiv, abs/1903.02271.
  • Mahmood et al. (2019) Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; and Black, M. J. 2019. AMASS: Archive of Motion Capture As Surface Shapes. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 5441–5450.
  • Miao et al. (2023) Miao, Y.-C.; Zhang, L.; Zhang, L.; and Tao, D. 2023. DDS2M: Self-Supervised Denoising Diffusion Spatio-Spectral Model for Hyperspectral Image Restoration. ArXiv, abs/2303.06682.
  • Nie et al. (2022) Nie, W.; Guo, B.; Huang, Y.; Xiao, C.; Vahdat, A.; and Anandkumar, A. 2022. Diffusion Models for Adversarial Purification. arXiv:2205.07460.
  • Petrovich, Black, and Varol (2021) Petrovich, M.; Black, M. J.; and Varol, G. 2021. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 10965–10975.
  • Petrovich, Black, and Varol (2022) Petrovich, M.; Black, M. J.; and Varol, G. 2022. TEMOS: Generating diverse human motions from textual descriptions. ArXiv, abs/2204.14109.
  • Plappert, Mandery, and Asfour (2016) Plappert, M.; Mandery, C.; and Asfour, T. 2016. The KIT Motion-Language Dataset. Big Data, 4(4): 236–252.
  • Raab et al. (2022) Raab, S.; Leibovitch, I.; Li, P.; Aberman, K.; Sorkine-Hornung, O.; and Cohen-Or, D. 2022. MoDi: Unconditional Motion Synthesis from Diverse Data. arXiv:2206.08010.
  • Ramesh et al. (2022a) Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022a. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv, abs/2204.06125.
  • Ramesh et al. (2022b) Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022b. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125.
  • Rombach et al. (2021) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10674–10685.
  • Ruiz, Gall, and Moreno-Noguer (2018) Ruiz, A. H.; Gall, J.; and Moreno-Noguer, F. 2018. Human Motion Prediction via Spatio-Temporal Inpainting. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 7133–7142.
  • Saharia et al. (2022) Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E. L.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S. S.; Lopes, R. G.; Salimans, T.; Ho, J.; Fleet, D. J.; and Norouzi, M. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. ArXiv, abs/2205.11487.
  • Siyao et al. (2022) Siyao, L.; Yu, W.; Gu, T.; Lin, C.; Wang, Q.; Qian, C.; Loy, C. C.; and Liu, Z. 2022. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11040–11049.
  • Tevet et al. (2022) Tevet, G.; Raab, S.; Gordon, B.; Shafir, Y.; Cohen-Or, D.; and Bermano, A. H. 2022. Human Motion Diffusion Model. ArXiv, abs/2209.14916.
  • Tiwari et al. (2022) Tiwari, G.; Antic, D.; Lenssen, J. E.; Sarafianos, N.; Tung, T.; and Pons-Moll, G. 2022. Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields. In European Conference on Computer Vision (ECCV).
  • Tseng, Castellon, and Liu (2022) Tseng, J.-H.; Castellon, R.; and Liu, C. K. 2022. EDGE: Editable Dance Generation From Music. ArXiv, abs/2211.10658.
  • Tur et al. (2023) Tur, A. O.; Dall’Asen, N.; Beyan, C.; and Ricci, E. 2023. Exploring Diffusion Models for Unsupervised Video Anomaly Detection. ArXiv, abs/2304.05841.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In NIPS.
  • You et al. (2023) You, Z.; Zhong, Y.; Bao, F.; Sun, J.; Li, C.; and Zhu, J. 2023. Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels. ArXiv, abs/2302.10586.
  • Zhang et al. (2022) Zhang, M.; Cai, Z.; Pan, L.; Hong, F.; Guo, X.; Yang, L.; and Liu, Z. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. ArXiv, abs/2208.15001.
  • Zhou et al. (2018) Zhou, Y.; Barnes, C.; Lu, J.; Yang, J.; and Li, H. 2018. On the Continuity of Rotation Representations in Neural Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5738–5746.