KinMo: Kinematic-aware Human Motion Understanding and Generation

Pengfei Zhang¹¹¹1These authors contributed equally to this work., Pinxin Liu²¹¹1These authors contributed equally to this work., Hyeongwoo Kim³, Pablo Garrido⁴, Bindita Chaudhuri⁴
¹University of California, Irvine, ² University of Rochester, ³ Imperial College, London, ⁴ FlawlessAI
¹[email protected], ²[email protected],
³[email protected], ⁴{pablo.garrido, bindita.chaudhuri}@flawlessai.com

Abstract

Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control. Project Page: https://andypinxinliu.github.io/KinMo/

Figure 1: We present KinMo, a method designed for fine-grained motion understanding for (a) efficient motion retrieval based on text descriptions, text-aligned motion (b) generation and (c) editing, and (d) trajectory control on local kinematic body parts.

1 Introduction

Controlling human motion through natural language is a rapidly expanding area within computer vision, enabling interactive systems to generate or modify 3D human motions based on textual inputs. This technology has a wide range of applications, including robotics[11], virtual reality[12], and automatic animation [37, 36], where human-like motion is crucial for user interaction and immersion.

Despite substantial advancements in general motion generation [25, 29, 35, 27, 21, 9], the challenge of fine-grained control over individual body parts remains largely unsolved. Current models are proficient at producing coherent whole-body movements from high-level language descriptions but struggle when tasked with editing or controlling specific body parts independently. This limitation prevents these systems from achieving the precision and adaptability required for most real-world applications.

Recent advancements [19, 10] have introduced more refined approaches by incorporating controllability into motion generation. However, these models are limited to processing simple instructions and still lack the fine-grained compatibility required for scenarios where multiple body parts must coordinate to perform complex actions. Similarly, generative models for motion synthesis [9, 21] present innovative methods but do not directly tackle the issue of controlling individual body parts in response to specific textual descriptions.

This challenge is rooted in the inherent ambiguity of motion text descriptions in existing datasets. For example, multiple phrases (such as pick up an object from the ground and bend down to reach something) can describe the same motion. Conversely, a single term (like running) can encompass a wide range of variations, depending on factors such as speed, arm movement, or direction. This many-to-many mapping problem [13] hinders existing models from handling the complexities of natural language, often resulting in inconsistent or unnatural outcomes when trying to generate or edit specific body parts.

To solve this problem, we introduce a novel motion representation based on six fundamental kinematic components: torso, head, left arm, right arm, left leg, and right leg. Unlike current methods, which typically treat the body as a whole, this decomposition explicitly combines each joint motion and their interactions to represent the global complex action based on each local body region. For instance, consider a sneaking motion, where the torso and legs coordinate for movement while the arms are used for balance. Based on this, we propose kinematic-aware formulation opens new possibilities for text-motion understanding, fine-grained motion generation, and editing.

To achieve this goal, we restructure the HumanML3D [8] dataset and propose a semi-supervised annotation system named KinMo, which reformulates existing motion descriptions to support body-part-specific control. We enhance the text-motion alignment by constructing the hierarchical semantic representation for global action, joint-group motion, and joint interactions. Additionally, we extend the current motion generative model, MoMask [9], to incorporate fine-grained body-part generation and editing, enabling enhanced control and manipulation of motion sequences. Our contributions can be summarized as follows:

1.

We introduce KinMo, a novel framework that redefines human motion as a three-level hierarchy: global actions, local joint groups, and joint interactions, with a semi-supervised dataset upgrading system. This representation significantly enhances text-to-motion understanding, generation, and editability.
2.

We refine Text-Motion Alignment by progressively encoding the proposed hierarchical text semantics, thus facilitating comprehensive motion spatial semantic correspondence learning.
3.

We decompose Motion Generation process into a coarse-to-fine procedure, allowing the model to transition smoothly from global actions to local joints and their interactions to support various fine-grained generation and editing applications.

2 Related Work

Text-to-Motion Understanding. Aligning text and motion modalities into a joint embedding space is a core component of human motion understanding. Posescript [4] tries to use fine-grained text descriptions to represent various human poses. ChatPose [6] leverages LLMs for describing the poses with natural languages. MotionCLIP [28] and TMR [18] further enhance the alignment from single poses to motion sequences. MotionLLM [2] creates a large corpus of Motion-QA for language-motion understandings. However, these methods only focus on global action-level motion, and they cannot capture the subtle local joint motions and the extent of the motions.

Text-to-Motion Generation. As Diffusion models have demonstrated notable success in image generation [16, 24], some early methods, e.g., MDM [29] and MotionDiffuse [35], have adopted them for motion generation. Other methods, such as T2M-GPT [33] and MotionGPT [23], represent motions as discrete tokens and leverage autoregressive models to improve motion generation quality. MMM [21] and MoMask [9] further improve motion generation quality with a bidirectional masking mechanism and introduce the concept of editability. Inspired by previous work [30], we propose a hierarchical text semantic representation for generating human motions in a coarse-to-fine scheme. Such a representation ensures that descriptions of high-level actions, low-level joint groups, and joint interactions can control fine-grained motion generations.

Motion Control and Editing. Motion Editing allows users to modify or refine the motion generation. Initial work, such as PoseFix [5], automates the generation of frame-level 3D pose and text modifiers for supervised editing. Some other diffusion-based models [29, 35] can perform zero-shot spatiotemporal editing by infilling specific joints or frames, but they may create unnatural discontinuities. TLControl [31] and OmniControl [32] can control arbitrary joints at any time by combining spatial and temporal control together but lack editability over various local joints to represent complex motions. CoMo [10] prompts a large language model (LLM) to edit the original motion directly. However, none of these methods adopt a fine-grained approach that allows independent editing of body part movements while ensuring overall motion compatibility.

Refer to caption — Figure 2: Overview of KinMo. The pipeline steps include three components: (1) semi-supervised dataset annotation with LLMs; (2) hierarchical text-motion alignment for fine-grained motion understanding; (3) coarse-to-fine motion generation for downstream applications.

3 Kinematic-aware Human Motion

Our framework (see Fig. 2) refines motion representation with local joint groups and joint interactions, with corresponding low-level text semantics for understanding and motion generation. We first present our motion representation formulation (Sec. 3.1) and introduce KinMo dataset (Sec. 3.2). We then present a semantic hierarchy for enhanced text-motion alignment and understanding (Sec. 3.3). We finally formulate motion generation as a coarse-to-fine procedure based on our semantic hierarchy (Sec. 3.4) and show its various downstream applications (Sec. 4).

3.1 Joint-Kinematic Motion Representation

While existing text-to-motion datasets only annotate the global action descriptions, their formulation struggles to represent the regional joint-group movements and their interactions, and it also fails to achieve spatial understanding, generation, and editing of fine-grained motion. To resolve this problem, we reformulate motion representation by organizing joints into a set of kinematic groups defined as $G=\{\text{Torso},\text{Neck},\text{Left Arm},\text{Right Arm},\text{Left Leg},\text{Right Leg}\}$ , with each group $g$ comprising a subset of joints $J_{g}\subseteq J$

Joint-Group Representation. For each group $g$ at time $t$ , we define the Group Position $\mathbf{P}_{g}(t)$ as the average position of the joints within that group:

\mathbf{P}_{g}(t)=\frac{1}{|J_{g}|}\sum_{j\in J_{g}}\mathbf{p}_{j}(t).

(1)

Additionally, we define the Limb Angles $\Theta_{g}(t)$ as the collection of joint rotations within the group, and the Group Velocity $\mathbf{V}_{g}(t)$ as the average velocity of the joints:

\Theta_{g}(t)=\{\mathbf{R}_{j}(t)\mid j\in J_{g}\},\quad\mathbf{V}_{g}(t)=\frac{1}{|J_{g}|}\sum_{j\in J_{g}}\mathbf{v}_{j}(t).

(2)

Joint-Interaction Representation. Human motion also involves the relationships between different kinematic groups. For physically connected groups (e.g., torso and neck), we define Relative Position, Relative Limb Angles (angles of the connecting joint between two physically connected groups), and Relative Velocity (angular velocity of the connecting joint between two physically connected groups). For non-physically connected groups (e.g., left arm and right arm), only Relative Position exists. Each pair of groups $(g,h)\in G\times G$ can be represented as:

\begin{bmatrix}\Delta\mathbf{P}_{g,h}(t)\\ \Delta\Theta_{g,h}(t)\\ \Delta\mathbf{V}_{g,h}(t)\end{bmatrix}=\begin{bmatrix}\mathbf{P}_{h}(t)-\mathbf{P}_{g}(t)\\ \Theta_{h\cap g}(t)\\ \mathbf{V}_{h}(t)-\mathbf{V}_{g}(t)\end{bmatrix}

(3)

In this framework, $\Delta\mathbf{P}_{g,h}(t)$ denotes the difference in position, $\Delta\Theta_{g,h}(t)$ represents the relevant angles at the connecting joint , and $\Delta\mathbf{V}_{g,h}(t)$ indicates the angular velocity of that joint. This formalism allows for a comprehensive analysis of motion dynamics across both physically and non-physically connected kinematic groups.

The formulation above can be transformed back to the existing global motion representation framework used in various text-to-motion understanding and generation methods, as shown in the supplementary document. Based on this formulation, we collect the text descriptions of local joint groups and joint interactions.

3.2 Kinematic-Group Motion Dataset

Kinematic-aware Joint-Motion Text Annotation. Our method aims to capture detailed spatial-temporal local joint motions and their relationships to global action-level dynamics. To achieve this, we enhance the HumanML3D dataset [8] with fine-grained joint motion annotations by leveraging large language models for semantic representation. To achieve high-quality automatic annotation of the dataset, we delineate two strategies, detailed below.

Spatial-Temporal Motion Processing.

Existing human motion understanding models [2, 8, 23] struggle to capture subtle local movements due to the complex interplay of spatial and temporal dynamics. To address this, we propose a two-stage disentanglement approach, first resolving spatial dynamics and then temporal dynamics.

To capture fine-grained spatial information, we utilize PoseScript [4] to generate text descriptions for each pose frame with detailed angular rotations of joints for any target human pose, thus producing rich spatial annotations. To capture fine-grained temporal information, we propose a keyframe selection pipeline. We use sBERT [22] to extract embeddings for the per-frame pose descriptions. By calculating the cosine similarity between text embeddings, we assess the similarity of poses across the time frames. If the cosine similarity falls below a user-defined threshold, we label that frame as a keyframe. This process effectively filters keyframes from a sequence of poses based on temporal transitions. We then estimate temporal local motions by analyzing the pose differences for any joint group across a specified time window.

Semi-supervised Annotation Refinement. A proper prompt design is crucial for achieving high-quality automatic annotation. To achieve that, two human evaluators curate the automatic annotation for prompt refinement. We randomly sample 20 motion sequences that undergo the aforementioned preprocessing. Subsequently, we use GPT-4o-mini [1] to infer local joint motions based on pose annotations. Two human evaluators independently judge the annotation of the local motion sequence by writing down the errors LLM made. This information is then fed back to the LLM to improve the automatic prompt design. This iterative procedure progressively refines the prompt design until the human evaluators reach a consensus, with a kappa statistic of over 0.8. In this case, we ensure the subsequent LLM annotations for the remaining data align with our objective and achieve high quality. Please refer to the supplementary document for additional details on the text prompt used for this inference.

3.3 Hierarchical Text-Motion Alignment

With fine-grained text descriptions for joint-group and interactions to represent human motions, we extend the existing text-motion alignment to improve understanding of spatial human motion. To achieve this goal, we treat text descriptions in the text-to-motion (T2M) framework as coarse semantics and progressively integrate joint-level texts to refine the representation, achieving hierarchical alignment.

Modality Encoders. We follow TMR [18] to establish motion and text encoders. The motion sequences comprise of joint velocities, local joint positions, 6D local joint rotations, and foot contact labels, as described in [8]. The original text descriptions from HumanML3D act as coarse semantics. Additionally, we introduce joint-group motion descriptions and joint interaction descriptions as intermediate and final levels, thus generating hierarchical representations. Please refer to the supplementary document for details on the network architecture.

\text{CrossAtt}(z_{j},z_{g})=\text{softmax}\left(z_{j}z_{c}^{T}/\sqrt{d_{k}}\right)z_{c}.

(4)

This operation allows us to integrate base semantics into low-level joint semantics. For the subsequent levels, we progressively add the resulting mean and standard deviation from the joint-group and joint-interaction embeddings back into the mean representation of the coarse text.

Contrastive Learning. We use contrastive learning to bridge the text and motion modalities [18],

for each semantic within the hierarchy. For simplicity, we denote any level of text and motion pair as $(z^{T},z^{M})$

. For a batch of $N$ positive pairs of latent codes $(z_{1}^{T},z_{1}^{M}),\ldots,(z_{N}^{T},z_{N}^{M})$ , any pair $(z_{i}^{T},z_{j}^{M})$ where $i\neq j$ is considered a negative sample. The similarity matrix $S$ computes the pairwise cosine similarities for all pairs in the batch, defined as $S_{ij}=\text{cos}(z_{i}^{T},z_{j}^{M})$ . We apply InfoNCE loss [30], as follows:

\begin{split}\mathcal{L}_{\text{NCE}}&=-\frac{1}{2N}\sum_{i}\left(\log\frac{\exp{S_{ii}/\tau}}{\sum_{j}\exp{S_{ij}/\tau}}+\log\frac{\exp{S_{ii}/\tau}}{\sum_{j}\exp{S_{ji}/\tau}}\right),~{}\end{split}

(5)

where $\tau$ represents a temperature parameter.

To maximize the proximity between the two modalities, we follow TMR [18] to construct a weighted sum of 3 losses: (a) Kullback–Leibler divergence loss $\mathcal{L}_{\text{KL}}$ , (b) cross-modal embedding similarity loss $\mathcal{L}_{\text{E}}$ , and (c) motion reconstruction loss $\mathcal{L}_{\text{R}}$ for each semantic hierarchy.

3.4 Coarse-to-Fine Motion Generation

Joint Motion Reasoner. For open-vocabulary motion generation, we finetune LLaMA using KinMo Dataset as the low-level motion reasoner that, based on the global action-level text descriptions, generate the corresponding joint-group and joint-interaction text scripts through supervised learning. Given the low-level joint motion descriptions from the Motion Reasoner, we obtain their corresponding text embeddings based on hierarchical text encoders from Sec.3.3 to perform various motion generation applications.

Text-to-Motion Generation. We enhance the existing SOTA motion generator, MoMask [9], through a coarse-to-fine generation process. Specifically, we modify the text conditioning by constructing $T_{\text{global}}$ , $[T_{\text{global}};T_{\text{joint}}]$ , and $[T_{\text{global}};T_{\text{joint}};T_{\text{inter}}]$ to represent three levels of text semantic control. We leverage one cross-attention layer akin to Sec. 3.3 to learn the correlation between high-level action and low-level joint semantics. These are prepended to the motion tokens before they are inserted into the generator, as shown in Fig. 2. The Mask Motion Generator first generates coarse motions based on the global action-level description. These initial motions are then refined by re-feeding them into the generator, now conditioned on joint-level motion semantics. This process is repeated in a similar manner, where the motion representations are further refined by incorporating joint-interaction semantics. For efficiency, the generator shares weights with the same logit classification loss functions used for motion reconstruction for the three levels of text conditioning. Other configurations are the same as in MoMask [9].

Text-to-Motion Editing. We leverage a Joint Motion Reasoner to refine both global and local action descriptions based on the users’ input. This model enables precise action-level edits (e.g., changing running to jumping) or local joint adjustments (e.g., slightly raising the hands). Our method uses a coarse-to-fine approach, assisted by a masking mechanism, to perform these edits at varying levels of granularity. Specifically, by masking the target sequences and using the mask generator to fill in the masked area, we can dynamically adjust the motion to meet the target requirement. For more details on the masking-based editing process, please refer to MMM [21].

Motion Trajectory Control. Inspired by Omnicontrol [32], we propose a latent joint spatial conditioning for mask motion generator as shown in Fig.4. The motion control model is a trainable copy of the frozen mask motion generator providing spatial guidance. To process these trajectories, we use a Spatial Encoder, which consists of convolution layers designed to encode the trajectory signals.

We combine a motion reconstruction loss with the use of the RQ-VAE Decoder to map the latent representation $\hat{\mathbf{f}}_{m}$ back into the motion space, resulting in the predicted motion $\hat{\mathbf{x}}_{m}$ . The control loss, ${\cal L}_{\text{control}}$ is defined as follows:

{\cal L}_{\text{control}}=\mathbb{E}\left[\frac{\sum_{i}\sum_{j}m_{ij}||R(\hat{\mathbf{x}}_{m})_{ij}-R(\mathbf{x}_{m})_{ij}||_{2}^{2}}{\sum_{i}\sum_{j}m_{ij}}\right]

(6)

where $R(\cdot)$ transforms the joint local positions to global absolute locations, and $m_{ij}\in\{0,1\}$ is a binary mask indicating whether joint $j$ at frame $i$ is active. Given the trajectory control, we can achieve joint-level editing by modifying joint motion scripts while maintaining the positions of any other joints as the source sequence.

4 Experiments

Table 1: Text-to-motion retrieval benchmark on HumanML3D: Evaluation protocols with decreasing difficulty from (a) to (d).

Protocol	Methods	Text-motion retrieval						Motion-text retrieval
Protocol	Methods	R@1 $\uparrow$	R@2 $\uparrow$	R@3 $\uparrow$	R@5 $\uparrow$	R@10 $\uparrow$	MedR $\downarrow$	R@1 $\uparrow$	R@2 $\uparrow$	R@3 $\uparrow$	R@5 $\uparrow$	R@10 $\uparrow$	MedR $\downarrow$
(a) All	TEMOS [17]	2.12	4.09	5.87	8.26	13.52	173.0	3.86	4.54	6.94	9.38	14.00	183.25
	HumanML3D [8]	1.80	3.42	4.79	7.12	12.47	81.00	2.92	3.74	6.00	8.36	12.95	81.50
	TMR [18]	5.68	10.59	14.04	20.34	30.94	28.00	9.95	12.44	17.95	23.56	32.69	28.50
	Ours (distilbert)	8.13	14.16	19.69	27.07	39.18	18.00	9.29	15.51	20.61	28.29	40.25	18.00
	Ours (RoBERTa)	9.05	15.23	20.47	28.62	41.60	16.00	9.01	15.92	21.42	29.50	41.43	16.00
(b) All with threshold	TEMOS [17]	5.21	8.22	11.14	15.09	22.12	79.00	5.48	6.19	9.00	12.01	17.10	129.0
	HumanML3D [8]	5.30	7.83	10.75	14.59	22.51	54.00	4.95	5.68	8.93	11.64	16.94	69.50
	TMR [18]	11.60	15.39	20.50	27.72	38.52	19.00	13.20	15.73	22.03	27.65	37.63	21.50
	Ours (distilbert)	10.82	18.49	25.33	33.89	46.54	12.00	12.25	19.69	24.98	32.70	44.04	14.00
	Ours (RoBERTa)	11.39	19.18	25.73	34.76	47.94	12.00	11.65	19.54	25.45	33.67	45.08	14.00
(c) Dissimilar subset	TEMOS [17]	33.00	42.00	49.00	57.00	66.00	4.00	35.00	44.00	50.00	56.00	70.00	3.50
	HumanML3D [8]	34.00	48.00	57.00	72.00	84.00	3.00	34.00	47.00	59.00	72.00	83.00	3.00
	TMR [18]	47.00	61.00	71.00	80.00	86.00	2.00	48.00	63.00	69.00	80.00	84.00	2.00
	Ours (distilbert)	45.73	62.80	70.73	79.88	90.85	2.00	46.95	62.80	70.12	82.93	91.46	2.00
	Ours (RoBERTa)	57.73	78.35	81.44	86.60	90.72	1.00	63.92	80.41	82.47	87.63	90.72	1.00
(d) Small batches [8]	TEMOS [17]	40.49	53.52	61.14	70.96	84.15	2.33	39.96	53.49	61.79	72.40	85.89	2.33
	HumanML3D [8]	52.48	71.05	80.65	89.66	96.58	1.39	52.00	71.21	81.11	89.87	96.78	1.38
	TMR [18]	67.16	81.32	86.81	91.43	95.36	1.04	67.97	81.20	86.35	91.70	95.27	1.03
	Ours (distilbert)	72.28	85.42	90.15	94.01	97.09	1.00	72.21	85.19	90.00	94.42	97.04	1.00
	Ours (RoBERTa)	72.88	85.54	89.91	93.46	96.68	1.00	73.00	85.64	90.17	93.70	96.49	1.00

Table 2: Comparison of text-to-motion generation on HumanML3D. For each metric, we repeat the evaluation 20 times and report the average with 95

\%

confidence interval. The right arrow (→) indicates that the closer the result is to real motion, the better.

Methods	R-Precision $\uparrow$			FID $\downarrow$	MM-Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Methods	Top-1 $\uparrow$	Top-2 $\uparrow$	Top-3 $\uparrow$	FID $\downarrow$	MM-Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Real	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-
MDM [29]	$0.320^{\pm.005}$	$0.498^{\pm.004}$	$0.611^{\pm.007}$	$0.544^{\pm.044}$	$5.566^{\pm.027}$	$9.559^{\pm.086}$	$\mathbf{2.799^{\pm.072}}$
MotionDiffuse [35]	$0.491^{\pm.001}$	$0.681^{\pm.001}$	$0.782^{\pm.001}$	$0.630^{\pm.001}$	$3.113^{\pm.001}$	$9.410^{\pm.049}$	$1.553^{\pm.042}$
T2M-GPT [34]	$0.491^{\pm.003}$	$0.680^{\pm.003}$	$0.775^{\pm.002}$	$0.116^{\pm.004}$	$3.118^{\pm.011}$	$9.761^{\pm.081}$	$1.856^{\pm.011}$
MotionLCM [3]	$0.504^{\pm.002}$	$0.698^{\pm.003}$	$0.796^{\pm.002}$	$0.304^{\pm.003}$	$3.012^{\pm.007}$	$9.634^{\pm.064}$	$2.267^{\pm.082}$
MMM [21]	$0.504^{\pm.003}$	$0.696^{\pm.003}$	$0.794^{\pm.002}$	$0.080^{\pm.003}$	$2.998^{\pm.007}$	$9.411^{\pm.058}$	$1.164^{\pm.041}$
MoMask [9]	$0.521^{\pm.002}$	$0.713^{\pm.002}$	$0.807^{\pm.002}$	$\mathbf{0.045^{\pm.003}}$	$2.958^{\pm.008}$	$9.678^{\pm.052}$	$1.241^{\pm.040}$
Ours	${\mathbf{0.529^{\pm.003}}}$	${\mathbf{0.722^{\pm.002}}}$	$\mathbf{0.817^{\pm.002}}$	$0.050^{\pm.003}$	$\mathbf{2.907^{\pm.009}}$	$9.684^{\pm.063}$	$1.313^{\pm.041}$

Table 3: Comparison of Motion Editing.

G

represents control for global action,

J

represents control for joint-group, and

I

represents control on joint-interaction.

Methods	FID $\downarrow$	R-Prec(Top 3) $\uparrow$	MM-Dist $\downarrow$	Diversity $\rightarrow$	HTMA-S $\uparrow$
STMC	0.561	0.612	3.864	8.952	0.636
MMM [21]	0.102	0.685	3.574	9.573	0.598
MoMask [9]	0.068	0.696	3.825	9.424	0.575
Ours (G)	0.089	0.712	3.434	9.453	0.712
Ours (G + J)	0.083	0.754	3.356	9.575	0.721
Ours (G + J + I)	0.086	0.734	3.203	9.364	0.744

Table 4: Comparison of Motion Trajectory Control for pelvis only, excluding test-time optimization.

Methods	FID $\downarrow$	R-Precision $\uparrow$	Traj. err. $\downarrow$	Loc. err. $\downarrow$	Avg. err. $\downarrow$
Methods		Top 3	(50cm)	(50cm)
Real	0.002	0.797	0.0000	0.0000	0.0000
MDM [29]	0.698	0.602	0.4022	0.3076	0.5959
PriorMDM [26]	0.475	0.583	0.3457	0.2132	0.4417
OmniControl [32]	0.212	0.678	0.3041	0.1873	0.3226
MotionLCM [3]	0.531	0.752	0.1887	0.0769	0.1897
KinMo (Ours)	0.103	0.756	0.2034	0.0696	0.1657

We conduct experiments on the motion-text benchmark dataset, HumanML3D [8], which collects 14,616 motions from AMASS [15] and HumanAct12 [7] datasets, with each motion described by 3 text scripts, totaling 44,970 descriptions. We adopt their pose representation and augment the dataset by mirroring, followed by a 80/5/15 split for training, validation, and testing, respectively. We enhance this dataset with each motion described by 6 joint-motion and 15 joint-interaction text scripts as in Sec. 3.2. Our implementation details are provided in the supplementary.

4.1 Text-Motion Retrieval

Evaluation metrics. To validate the effectiveness of incorporating joint semantics, we adopt TMR [18] settings to measure retrieval performance using recall scores at various ranks (e.g., R@1, R@2) and the median rank (MedR) of our results. MedR represents the median ranking position of the ground-truth result, with lower values indicating more precise retrievals. The four evaluation protocols used in our experiments are outlined below: (i) All uses the complete test dataset, though similar negative pairs may affect precision; (ii) All with threshold sets a 0.8 similarity threshold to determine accurate retrievals; (iii) Dissimilar subset uses 100 distinctly different sampled pairs measured by sBERT [22] embedding difference; and (iv) Small batches evaluates performance on random batches of 32 motion-text pairs.

Evaluation results. We benchmark KinMo against three baselines [17, 8, 18]. Tab. 1 shows that our model outperforms existing baselines, especially for setting (a). This is mainly ascribed to our joint-level text descriptions that help resolve the action-level text-motion correspondence ambiguities, which in turn improves the discrimination of various motions with slightly distinct joint motion differences but with similar action-level text descriptions. The contribution of joint-level semantics is further enhanced given RoBERTa [14] as a stronger text encoder.

4.2 Text-Motion Generation

Evaluation Metrics. We adopt (1) Frechet Inception Distance (FID), as an overall motion quality metric to measure the distribution difference between features of the generated and real motions; (2) R-Precision and multimodal distance to quantify the semantic alignment between input text and generated motions; and (3) Multimodality for assessing the diversity of motions generated from the same text following T2M [8]. Please refer to the supplementary document for further details on our metrics.

Evaluation Results. We compare KinMo against various methods for T2M Generation [29, 35, 9, 21, 17, 3]. As presented in Tab. 2, our method attains the best motion generation quality with the highest text alignment score. Additionally, thanks to the introduction of explicit joint-group and interaction motion representations, we observe that KinMo generates better aligned motions for any given dense and fine-grained text descriptions shown in Fig. 5, while other baseline methods fail to capture local body part movements.

User Study. We conducted a user study involving 20 participants and 320 samples - 80 from each of KinMo, MoMask [9], MMM [21] and STMC [20], to assess the quality of our results. Each participant was presented the video clips in a random order and asked to rate the results between 1 (lowest) to 5 (highest) based on (1) realness, (2) correctness of text-motion alignment, and (3) overall impression. Fig. 6 shows KinMo achieves higher Mean Opinion Scores (MOS) on all the criteria compared to the other methods, thereby proving improved accuracy of synthesized motion and better alignment with text condition.

4.3 Text-Motion Editing

Evaluation Metrics. Due to the lack of benchmark datasets and metrics, we generate 200 fine-grained text-prompt with its corresponding edited version by GPT4-o [1]. The comparison is conducted by utilizing models to first generate the motion corresponding to the original text and then do editing to this generation based on the new instruction. To evaluate the editing quality, beyond generation metrics, we propose using Text-Motion Similarity score to measure the similarity of edited motion with editing global motion description based on Sec. 3.1, denoted as HTMA-S.

Evaluation Results. We benchmark KinMo against various methods for T2M generation [21, 9, 20]. We discover KinMo is the only method capable of local temporal editing, while maintaining motion naturalness as shown in Fig. 4 and Tab. 4. Editing global semantics can be captured at both the joint and interaction semantic levels, thus achieving better generation and editing, as shown in Figs. 5 and 7.

4.4 Motion Trajectory Control

Evaluation metrics. Other than the metrics from Sec.4.2, we include (1) Trajectory error (Traj. err.): measures the ratio of unsuccessful trajectories, characterized by any control joint location error surpassing a predetermined threshold, (2) Location error (Loc. err.): represents the ratio of unsuccessful joints, and (3) Average error (Avg. err.): denotes the mean location error of the control joints.

Evaluation Results. We compare KinMo with open-source models [32, 3], specifically focusing on pelvis control and excluding test-time optimization for all baselines to ensure a fair comparison. As shown in Tab. 4, our method achieves more robust and accurate controlled generation with lower errors and FID compared to other methods.

Table 5: Effects of Motion Semantics for Text-Motion Retrieval.

Motion	Text-motion retrieval				Motion-text retrieval
Semantic	R@1 $\uparrow$	R@2 $\uparrow$	R@3 $\uparrow$	MedR $\downarrow$	R@1 $\uparrow$	R@2 $\uparrow$	R@3 $\uparrow$	MedR $\downarrow$
global	3.67	7.17	10.32	40.00	8.08	11.56	17.23	38.00
+ joint	7.58	13.16	16.97	22.00	8.58	14.51	19.21	21.00
+ interact	9.05	15.23	20.47	16.00	9.01	15.92	21.42	16.00
- cross	7.63	13.13	16.94	22.00	8.60	14.54	19.21	21.00

4.5 Ablation Study

Hierarchical Motion Semantics. We investigate several strategies for semantic incorporation of text-motion alignment in Tab. 5: (1) only global action-level descriptions (global), (2) + joint-group (+ joint), (3) + joint-interaction (+ interact), and (4) - cross-attention (- cross). We observe that adding joint motion descriptions enhances the motion understanding, resolving the ambiguity of global action descriptions. In addition, our design based on cross-attention improves the connectivity of different hierarchy levels for motion representation. For an additional analysis of motion semantics, please refer to the supplementary document. As shown in Fig. 7, both joint and interaction semantics are beneficial for text-motion alignment on local body parts (hands in the example) during the generation process.

Coarse-to-Fine Motion Generation. To assess the contribution of each component within our pipeline, we design the following variations: (1) CLIP-G: Only global motion description (original HumanML3D text) is applied for motion generation (as in MoMask [9]), (2) CLIP-J: We add joint-group semantics for generation based on CLIP, (3) CLIP-I: We add joint-interaction semantics for generation in addition to previous setting. We also apply these three settings for HTMA (Hierarchical Text Motion Alignment) to validate the effectiveness of our coarse-to-fine generation strategy and the benefits of text-motion alignment for generation. As shown in Fig. 8, coarse-to-fine procedure can enhance the motion generation quality. In addition, our proposed text-motion alignment can significantly speed up training and improve the performance.

5 Conclusion

We present KinMo, a novel framework that represents human motion as joint movements and interactions, thereby enabling fine-grained text-to-motion understanding, generation, editability and trajectory control. Our method progressively encodes global actions with joint-level interactions, and leverages these hierarchical text descriptions to generate coarse-to-fine motion through cross attention and contrastive learning. The dataset will be made publicly available to the scientific community. Extensive comparisons with state-of-the-art methods show that KinMo improves text-motion alignment and kinematic body part control. Future directions would include extending our method to fine-grained text-guided image-to-video generation or video-to-video editing tasks.

Acknowledgements: We thank Brian Burritt and Avi Goyal for helping with visualizations, and Kyle Olszewski and Ari Shapiro for valuable discussions.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Chen et al. [2024] Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, and Lei Zhang. Motionllm: Understanding human behaviors from human motions and videos. arXiv preprint arXiv:2405.20340, 2024.
Dai et al. [2024] Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. arXiv preprint arXiv:2404.19759, 2024.
Delmas et al. [2022] Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. Posescript: 3d human poses from natural language. In European Conference on Computer Vision, pages 346–362. Springer, 2022.
Delmas et al. [2024] Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, and Grégory Rogez. Posefix: Correcting 3d human poses with natural language, 2024.
Feng et al. [2024] Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, and Michael J Black. Chatpose: Chatting about 3d human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2093–2103, 2024.
Guo et al. [2020] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
Guo et al. [2022] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022.
Guo et al. [2024] Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024.
Huang et al. [2024] Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, and Lingjie Liu. Como: Controllable motion generation through language guided pose code editing. arXiv preprint arXiv:2403.13900, 2024.
Koppula and Saxena [2013] Hema Swetha Koppula and Ashutosh Saxena. Anticipating human activities for reactive robotic response. In IROS. Tokyo, 2013.
Li et al. [2024] Yong-Lu Li, Xiaoqian Wu, Xinpeng Liu, Zehao Wang, Yiming Dou, Yikun Ji, Junyi Zhang, Yixing Li, Xudong Lu, Jingru Tan, et al. From isolated islands to pangea: Unifying semantic space for human action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16582–16592, 2024.
Liu et al. [2023] Xinpeng Liu, Yong-Lu Li, Ailing Zeng, Zizheng Zhou, Yang You, and Cewu Lu. Bridging the gap between human motion and action semantics via kinematic phrases. arXiv preprint arXiv:2310.04189, 2023.
Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
Mahmood et al. [2019] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023.
Petrovich et al. [2022] Mathis Petrovich, Michael J Black, and Gül Varol. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
Petrovich et al. [2023] Mathis Petrovich, Michael J Black, and Gül Varol. TMR: Text-to-motion retrieval using contrastive 3D human motion synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9488–9497, 2023.
Petrovich et al. [2024a] Mathis Petrovich, Or Litany, Umar Iqbal, Michael J Black, Gul Varol, Xue Bin Peng, and Davis Rempe. Multi-track timeline control for text-driven 3d human motion generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1911–1921, 2024a.
Petrovich et al. [2024b] Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng, and Davis Rempe. Multi-track timeline control for text-driven 3d human motion generation. In CVPR Workshop on Human Motion Generation, 2024b.
Pinyoanuntapong et al. [2024] Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024.
Reimers [2019] N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
Ribeiro-Gomes et al. [2024] Jose Ribeiro-Gomes, Tianhui Cai, Zoltán Á Milacski, Chen Wu, Aayush Prakash, Shingo Takagi, Amaury Aubel, Daeil Kim, Alexandre Bernardino, and Fernando De La Torre. Motiongpt: Human motion synthesis with improved diversity and realism via gpt-3 prompting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5070–5080, 2024.
Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
Shafir et al. [2023] Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
Shafir et al. [2024] Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. In The Twelfth International Conference on Learning Representations, 2024.
Sun et al. [2024] Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, and Yong-jin Liu. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Transactions on Graphics (TOG), 43(4):1–9, 2024.
Tevet et al. [2022a] Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022a.
Tevet et al. [2022b] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model, 2022b.
van den Oord et al. [2019] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019.
Wan et al. [2024] Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis, 2024.
Xie et al. [2024] Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. In The Twelfth International Conference on Learning Representations, 2024.
Zhang et al. [2023a] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
Zhang et al. [2023b] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023b.
Zhang et al. [2022] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
Zhang and Kong [2024] Pengfei Zhang and Deying Kong. Handformer2t: A lightweight regression-based model for interacting hands pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6248–6257, 2024.
Zhu et al. [2024] Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance. arXiv preprint arXiv:2403.14781, 2024.