This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MAAIG: Motion Analysis And Instruction Generation

Wei-Hsin Yeh [email protected] 0009-0001-2194-0419 Academia SinicaTaipeiTaiwan Pei Hsin Lin [email protected] XXXX Academia SinicaTaipeiTaiwan Yu-An Su 0009-0004-9842-3251 [email protected] Academia SinicaTaipeiTaiwan43017-6221 Wen Hsiang Cheng 0009-0000-6894-548X [email protected] Academia SinicaTaipeiTaiwan  and  Lun-Wei Ku 0000-0003-2691-5404 [email protected] Academia SinicaTaipeiTaiwan
(2023)
Abstract.

Many people engage in self-directed sports training at home but lack the real-time guidance of professional coaches, making them susceptible to injuries or the development of incorrect habits. In this paper, we propose a novel application framework called MAAIG(Motion Analysis And Instruction Generation). It can generate embedding vectors for each frame based on user-provided sports action videos. These embedding vectors are associated with the 3D skeleton of each frame and are further input into a pretrained T5 model. Ultimately, our model utilizes this information to generate specific sports instructions. It has the capability to identify potential issues and provide real-time guidance in a manner akin to professional coaches, helping users improve their sports skills and avoid injuries.

deep learning, instruction generation, nature language generating, computer vision
journalyear: 2023copyright: rightsretainedconference: ACM Multimedia Asia Workshops; December 6–8, 2023; Tainan, Taiwanbooktitle: ACM Multimedia Asia Workshops (MMAsia ’23 Workshops), December 6–8, 2023, Tainan, Taiwandoi: 10.1145/3611380.3630165isbn: 979-8-4007-0326-3/23/12submissionid: 355ccs: Computing methodologies Natural language generation

1. Introduction

In this era, many individuals choose to self-learn various athletic movements at home due to constraints such as time limitations, inconvenient access to sports facilities, or other factors. However, autonomous learning of sports movements comes with several challenges, with one of the primary concerns being the absence of real-time guidance from professional coaches. In such circumstances, learners are susceptible to injuries or may develop incorrect exercise habits, potentially having adverse effects on their health and performance.

To address this issue, we propose a novel framework called Motion Analysis And Instruction Generation(MAAIG). This application accepts user-provided videos of sports movements and leverages advanced machine learning techniques to analyze motion characteristics and patterns within the videos. It can identify potential issues and provide real-time instructional language to users, akin to what a professional coach would say, aiding them in improving their athletic skills. This technology proves highly beneficial for accelerated learning and injury prevention, especially for those engaged in self-learning athletic activities at home.

Our approach is rooted in instruction generation. Initially, the videos captured by users are input into a pose recognition model to generate a dataset of 3D skeletal information. Subsequently, this dataset is transformed into embedding vectors, which are fed into the instruction generation model to provide guidance language to the users. Building upon this foundation, we introduce a novel architecture that redefines the instruction generation process. Traditional instruction generation often relies on static text or predefined categories, limiting its adaptability and real-time applicability. In contrast, our approach harnesses motion recognition from the source video, offering richer information to the instruction generation model.

Refer to caption
Figure 1. Framework of Motion To Instruction Model.
\Description

Pretrain on the task ”Motion To Text” then Fine-tune on the task ”Motion To Instruction”

2. Related Work

2.1. Motion to Text

Several research studies on Motion2text (Guo et al., 2022) (Jiang et al., 2023). In their work (Guo et al., 2022), the author introduce a novel approach in which they transform a sequence of motions into a tokenized motion representation within a Code Book, employing the VQ-VAE (van den Oord et al., 2017) framework. This process generates a compact yet semantically rich representation for 3-D motions. Leveraging quantified embedding vectors (namely motion tokens ), the authors employ a Transformer model to build a mapping from human motions to textual descriptions. This approach proves effective in connecting quantified motion tokens to quantified text tokens, enhancing the text generated by model can better describe the sequence of motion. However, it is worthy noting that the aforementioned research primarily focus on generating text which can capture the semantic essence of an entire motion sequence. When it comes to the task ”motion to instruction”, it is still too difficult to capture the detailed information of the sequence of motions.

In (Jiang et al., 2023), MotionGPT employs T5 as its underlying language model with motion-aware capabilities. The research involves two stage process: initial train models with pretraining and then instruction tuning. This approach results in improvements across various tasks and enhances the model’s performance, particularly when faced with previously unseen tasks or prompts.

2.2. Motion to Instruction

In (Zhao et al., 2022), the framework for providing exercise feedback relies on Graph Convolutional Kernel. The author employee a classifier model to predict action labels (namely instruction labels). It is worthy of noting that instead of utilizing the 3D joint positions, the author take the DCT (Ahmed et al., 1974) coefficients of joint trajectories as input of the classifier model. Furthermore, the predicted action labels are input to a feedback model, subsequently converted into the tensor. This resulting tensor is then fed into a correction model along with the DCT coefficients of joints. This mechanism enable the correction model to improve the accuracy of the corrected motion. This exercise feedback framework can generate the instruction label. We believe that the output text would be more valuable if it could provide precise instructions rather than just labels. In this way, the framework can offer more meaningful guidance for exercise routines.

3. Methodology

This paper presents a framework for converting skeletal information into language guidance, and the basic structure is depicted in Figure 1. In the following sections, individual models will be discussed in detail.

3.1. Human Pose Embedding

The motion to instruction model takes skeleton-based data as input, which is obtained from pose estimation algorithms from videos proposed by HybrIK model (Li et al., 2020), which estimates 3D pose via the twist-and-swing decomposition. Because our motion to instruction model is dedicated to assisting figure skaters in improving their athletic postures, and figure skating involves a multitude of airborne rotations and body twists, we choose the HybrIK (Li et al., 2020) model to assist us in constructing the skeletal structure from the video.

Additionally, it’s essential to note that figure skating takes place on an ice rink and our goal aims to provide accurate instruction. Therefore, we also endeavor to construct a 3-D skeleton to provide the motion to instruction model with distance-related information.

The input data consists of a sequence of frames, with each frame containing a set of 3-D pose joint coordinates. These coordinates are relative to the human body-part in the SMPL (Loper et al., 2015) format with dimension ( 22x3 ). ”22” means the number of joints and ”3” means the x, y, z coordinates for each joints.

3.2. Motion to Instruction Model

Here we describe the proposed model in detail. The inputs to the model consist of 3-D skeleton-based data, with joint dimensions represented as ( 22x3 ). Additionally, we apply linear transformation to convert our 3-D skeleton, which initially has a dimension of 66, into 512-dimensional embedding.

Our model is built upon the T5 (Text-to-Text Transfer Transformer) (Raffel et al., 2019) architecture introduced by Google Research in 2019, which represents a significant advancement in the field of natural language processing (NLP). T5 is a variant of the Transformer (Vaswani et al., 2023) architecture, notable for its pioneering self-attention mechanism that enables the model to process positional information across input sequences simultaneously. This capability is particularly valuable for capturing long-range dependencies in text. The fundamental idea behind T5 is to frame various NLP tasks as text-to-text transformations. It adopts an end-to-end approach where both inputs and outputs are treated as text sequences. This implies that tasks spanning question answering, text classification, text generation, summarization, translation, and more can all be cast as processes that map one text sequence to another. T5’s architectural design comprises two essential components: the encoder and the decoder, which are the core constituents of the Transformer architecture. The encoder is responsible for encoding input text into an intermediate representation, while the decoder deciphers this intermediate representation into the desired output text. During the training process, T5 learns its parameters by minimizing the disparity between predicted output text and the actual target text. This enables the model to generate high-quality outputs during inference. Whether the task at hand involves answering questions, classifying text, generating text, summarizing, or translating, T5’s text-to-text framework offers a unified approach. This versatility empowers T5 as a robust and adaptable NLP model, obviating the necessity to design distinct models for individual tasks. Instead, it requires only adjustments to input and output text specifications.

To tokenize instructions, we used the Tokenizer from mT5 (multilingual Text-to-Text Transfer Transformer). mT5 is a multi-language NLP model based on the Transformer architecture, known for its excellent cross-language processing capabilities. This tokenizer facilitates the tokenization, processing, and representation of textual instructions in a manner compatible with our model’s architecture. Thus, the model can simultaneously handle 3D skeletal data and textual instructions to perform the required tasks.

Furthermore, to equip the model with the capability to handle 3-D skeletal data and generate language, we conducted pretraining using the HumanML3D dataset (Plappert et al., 2016) through T5. HumanML3D is a pretrained three-dimensional human pose recognition model designed to integrate three-dimensional skeletal data with tasks related to human actions. This model focuses on understanding and interpreting three-dimensional human poses, which hold significant value in various applications such as sports analysis, motion capture, human-computer interaction, and virtual reality. Key features and functionalities of HumanML3D include processing three-dimensional skeletal data, pretrained representations, multi-task learning, and real-time performance. It can accept three-dimensional skeletal data captured from deep learning cameras, sensors, or other devices as input, identify and track the positions, orientations, and movements of human joints, and reconstruct three-dimensional human poses. Through large-scale pretraining, HumanML3D can acquire a universal representation of human poses applicable to various tasks and datasets.

Refer to caption
Figure 2. The annotation instruction system. The coach will first select the video interval by the yellow bars at the left, then enter the corresponding instruction or feedback on the right.
\Description

The user interface of the annotating system.

4. Experiments

In this section, we commence by providing a detailed description of the MAAIG, a private dataset that we collected specifically for the use in this experiment. Following this, we will delve into the intricacies of our framework implementations. Finally, we will present empirical results, showcasing the efficacy of our framework’s instruction generation capabilities, assessed through a diverse array of machine translation evaluation metrics, namely Bleu (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROGUE (Grusky, 2023).

4.1. YourSkatingCoach Dataset

Given the lack of existing motion and instruction pair dataset, We collect our own figure skating dataset for this specific task by our self-developed annotation instruction system. Experts can select specific time segments from the database’s videos and annotate them with instruction. An example of the selected motion clip and its corresponding instructions are shown in Figure 2. In this section, we outline the process for acquiring the YourSkatingCoach, which involves three distinct phases: Motion Collection, Instruction Collection, and Dataset Generation.

Motion Collection. We obtain video footage featuring two skaters performing the Axel jump, a maneuver in skating that involves transitioning from the forward outside edge of one skate to the backward outside edge of the other, encompassing one (or more) and a half turns in mid-air. Subsequently, we categorized these videos into two distinct groups: ”Axel” for single jump performances and ”Axel_com” (Axel combo) for sequences comprising one or more consecutive jumps. Nevertheless, due to the inadequate collection of data in the Axel dataset, we have opted to exclusively utilize the Axel_com subset in this experiment. At first, there were 89 videos of athletes performing these two motions.

Instruction Collection. To acquire the instructions for the motions we collected, we hire a professional skater to annotate the video tapes we collected. The annotation process works as follows: firstly, the skater watch the video taken from the last phase and select a ”start time” and an ”end time” of the clip and leave an instruction on how the performers can improve their Axel moves. In the end, we clip out 164 motion clips from the original 89 videos, and divide them to 90/10 for training and testing.

Dataset Generation. With the motion dataset obtained from the Motion Collection and the instruction dataset acquired through the Instruction Collection, we can now construct a comprehensive dataset composed of pairs comprising motion clips and their respective instructions. This process involves segmenting the videos based on the ”start time” and ”end time” chosen by the annotator, followed by merging them with the corresponding annotations provided by the annotator. In scenarios where multiple instructions are assigned to the same time interval within a single video, we unify these instructions into a coherent directive by connecting them using a designated separator symbol.

Table 1. Evaluation Metrics for Different Models
Model Pretrain Bleu_1 Bleu_2 Bleu_3 Bleu_4 METEOR ROUGE_L
Transformer N/A 0.100418 0.067011 0.047725 0.037824 0.084673 0.180627
Transformer HumanML3D (world) 0.250667 0.170773 0.133011 0.110372 0.161843 0.212154
Transformer HumanML3D (local) 0.268576 0.163972 0.106748 0.076333 0.125742 0.202856
T5 N/A 0.371186 0.284594 0.235283 0.204323 0.20255 0.329223
T5 HumanML3D (world) 0.427747 0.335629 0.282911 0.250055 0.216211 0.380814
T5 HumanML3D (local) 0.439764 0.345505 0.299627 0.271761 0.220812 0.388061

4.2. Implementation Details

Initially, we embark on model development from the ground up using T5 architecture. However, this approach yields subpar results owing to the inherent limitations of our YourSkatingCoach Dataset, which comprises a relatively modest collection of just 164 video clips. The constraints imposed by this dataset made it unfeasible to train a sufficiently compact Large Language Model capable of processing textual content effectively.

In response to these constraints, we turn to the HumanML3D dataset (Petrovich et al., 2022), a valuable resource encompassing a substantial repository of 44,970 textual descriptions corresponding to 14,616 3D human motions. This dataset serves as a pivotal pretraining source for our model, empowering it to attain a holistic comprehension of motion representations.

However, it has come to our attention that there exists a difference in the skeletal data between YourSkatingCoach and HumanML3D. Specifically, the data in YourSkatingCoach is represented in a local coordinate system, while HumanML3D utilizes the world coordinate system. As a result, we have explored an alternative setting where we remap HumanML3D into the local coordinate system to address this discrepancy.

In addition to our exploration of the T5 architecture, we extended our experimentation to include the Transformer architecture. Our experiments cover the following six distinct settings:

  1. (1)

    Train from scratch using Transformer.

  2. (2)

    Fine-tune from HumanML3D pretrain in world coordinate system using Transformer.

  3. (3)

    Fine-tune from HumanML3D pretrain in local coordinate system using Transformer.

  4. (4)

    Train from scratch using T5.

  5. (5)

    Fine-tune from HumanML3D pretrain in world coordinate system using T5.

  6. (6)

    Fine-tune from HumanML3D pretrain in local coordinate system using T5.

4.3. Results

In Table 1, we present the results of our experiment. Two key observations emerge from this table: (1) Pretraining on a large dataset helps mitigate the limitations of a small dataset. (2) Converting the coordinate systems further enhances performance.

Comparing With or Without Pretraining. The table clearly illustrates the substantial performance improvement achieved by using pretraining across both the Transformer and T5 architectures. This demonstrates the efficacy of pretraining on a large dataset in overcoming the limitations of a smaller dataset, thereby enhancing the language understanding within our model. Notably, all scores exhibit a significant elevation when HumanML3D is utilized as the pretraining source.

Comparing With or Without Converting Coordination. The table reveals that performance is notably enhanced when employing the HumanML3D dataset in the local coordinate system compared to the world coordinate system. This observation underscores the critical role of data compatibility in transferring language understanding from one dataset to another. Remarkably, when YourSkatingCoach and HumanML3D share the same coordinate system, they yield the optimal result in the T5 architectures in the setting of Fine-tune from HumanML3D pretrain in local coordinate system.

Refer to caption
Figure 3. Example of failing to generate complete sentence.
\Description

A video and its instruction labels.

4.4. Discussion

While pretraining on the large dataset with T5 significantly boost the performance, there were still parts where the model performed poorly. For example we observe that the model cannot distinguish left from right and vice versa. We suspect that this can be attributed to the scarcity of the annotation regarding left and right. Secondly, the vanilla transformer model have failed to generate fluent sentences, some of the common failures are: omitting verbs, repetitive words, and violating grammar, as shown in Figure 3. Though these issues were all largely mitigated when we switch to the T5 implementation, each evaluating score still remains rooms to improve.

In the future, we aim to initially convert the video actions into world coordinates system by estimating implicit and explicit camera parameters to let model learn left and right. After converting human pose from camera coordinate system to world coordinate system, we will train our ”MAAIG” model in world coordinate system, enabling it to learn the direction-related information. Subsequently, we can also take the kinematic information of skeleton in world coordinate system, e.g.: angular velocity, linear velocity, and rotation as the input of ”MMAIG” model. Nonetheless, we will employ Part-of-speech (POS) tagging on the instruction, which can ensure the output of our model is grammarly correct. Also, it is non-trivial to study the effect of different movement speed in terms of the model performance, we plan to set off a variety of experiments to ensure our model’s stability under arbitrary situations.

5. Conclusion

In this paper, we introduce a model based on the T5 architecture for generating sports coaching instructions. This model utilizes pose recognition technology to generate 3D skeletons, which are then transformed into embedding vectors. These embedding vectors are fed into the instruction generation model to produce coaching instructions. In our experiments, we applied this model to the ”Axel combination” movement in the context of ice skating. By incorporating a pretrained HumanML3D model and using BLEU scores as the measure of accuracy, we achieved a BLEU score of 0.439. This indicates that language models are capable of reading and analyzing input 3D skeleton data, resulting in the generation of corresponding coaching instructions. Our model provides a promising tool for sports coaches and athletes, as it can generate personalized instructions based on the user’s actual movements. It has the potential to play a significant role in various sports training and sports science applications.

Acknowledgements.
We thank figure skating coach Kristina Stepanova for helping with video shooting and instruction generation. This work is supported by the National Science and Technology Council of Taiwan under grants 112-2425-H-007-002- and by the Academia Sinica and National Tsing Hua University collaborative project.

References

  • (1)
  • Ahmed et al. (1974) N. Ahmed, T. Natarajan, and K.R. Rao. 1974. Discrete Cosine Transform. IEEE Trans. Comput. C-23, 1 (1974), 90–93. https://doi.org/10.1109/T-C.1974.223784
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65–72. https://aclanthology.org/W05-0909
  • Grusky (2023) Max Grusky. 2023. Rogue Scores. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 1914–1934. https://doi.org/10.18653/v1/2023.acl-long.107
  • Guo et al. (2022) Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. arXiv:2207.01696 [cs.CV]
  • Jiang et al. (2023) Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. 2023. MotionGPT: Human Motion as a Foreign Language. arXiv:2306.14795 [cs.CV]
  • Li et al. (2020) Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. 2020. HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation. CoRR abs/2011.14672 (2020). arXiv:2011.14672 https://arxiv.org/abs/2011.14672
  • Loper et al. (2015) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34, 6 (Oct. 2015), 248:1–248:16. https://doi.org/10.1145/2816795.2818013
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  • Petrovich et al. (2022) Mathis Petrovich, Michael J. Black, and Gül Varol. 2022. TEMOS: Generating diverse human motions from textual descriptions. arXiv:2204.14109 [cs.CV]
  • Plappert et al. (2016) Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT Motion-Language Dataset. Big Data 4, 4 (dec 2016), 236–252. https://doi.org/10.1089/big.2016.0028
  • Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR abs/1910.10683 (2019). arXiv:1910.10683 http://arxiv.org/abs/1910.10683
  • van den Oord et al. (2017) Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. CoRR abs/1711.00937 (2017). arXiv:1711.00937 http://arxiv.org/abs/1711.00937
  • Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
  • Zhao et al. (2022) Ziyi Zhao, Sena Kiciroglu, Hugues Vinzant, Yuan Cheng, Isinsu Katircioglu, Mathieu Salzmann, and Pascal Fua. 2022. 3D Pose Based Feedback for Physical Exercises. arXiv:2208.03257 [cs.CV]