AIMusicGuru: Music Assisted Human Pose Correction

Snehesh Shrestha, Cornelia Fermüller, Tianyu Huang, Pyone Thant Win, Adam Zukerman,
Chethan M. Parameshwara, Yiannis Aloimonos

University of Maryland, College Park, MD, USA
{snehesh,fermulcm,andy0412,pwin17,adamzuk,cmparam9,jyaloimo}@umd.edu

Abstract

Pose Estimation techniques rely on visual cues available through observations represented in the form of pixels. But the performance is bounded by the frame rate of the video and struggles from motion blur, occlusions, and temporal coherence. This issue is magnified when people are interacting with objects and instruments, for example playing the violin. Standard approaches for postprocessing use interpolation and smoothing functions to filter noise and fill gaps, but they cannot model highly non-linear motion. We present a method that leverages our understanding of the high degree of a causal relationship between the sound produced and the motion that produces them. We use the audio signature to refine and predict accurate human body pose motion models. We propose MAPnet (Music Assisted Pose network) for generating a fine grain motion model from sparse input pose sequences but continuous audio. To accelerate further research in this domain, we also open-source MAPdat, a new multi-modal dataset of 3D violin playing motion with music. We perform a comparison of different standard machine learning models and perform analysis on input modalities, sampling techniques, and audio and motion features. Experiments on MAPdat suggest multi-modal approaches like ours as a promising direction for tasks previously approached with visual methods only. Our results show both qualitatively and quantitatively how audio can be combined with visual observation to help improve any pose estimation methods.

1 Introduction

The future of education is going to be enhanced by AI [31]. Online classes, self-practice, and journaling one’s progress have become part of modern music education. Can AI enhance the experience by providing insights and analytics based on visual and aural observations for the students and the teachers [5]? Specifically, the art and skill of playing a musical instrument are learned through many iterations of lessons, practice, and feedback over multiple years. The teacher makes acute observations on how the student holds the instrument and moves to produce music. In addition, the teacher provides feedback on the movement corrections to produce the desired sound. The traditional approach involves motion capture systems, but it doesn’t scale beyond research labs.

Refer to caption — Figure 1: AI Music Guru. We present a new 3D violin playing music and motion dataset, MAPdat, which has rich motion capture ground truth, audio, and video of four master musicians. We also present MAPnet (Music Assisted Pose network), which generates a fine-grain motion model from sparse input pose estimate sequences and continuous audio. In this figure, a standard smartphone or a webcam can capture video at a low frame rate that MAPnet enhances to generate higher temporal resolution fine-grain pose estimates.

Modern visual pose estimation algorithms in 2D [40, 9] and 3D [24, 29] have paved a promising path for using smartphones and webcams to estimate human pose. Approaches based on this technology are used already for assisting with large body movements such as in yoga, physical training, and golf [7]. However, predicting poses for fine movements, such as in music training, is challenging where body movement is dictated by the audio generation. However, since the pose and audio signals are recorded at a different signal frequency (audio data are generated at a higher frequency than poses), current visual pose estimation models [40, 24] are not equipped to handle the non-linear relationship between the music and the motion that produces them. In this work, we present the Music Assisted Pose network (MAPNet), which combines visual observations with temporally dense audio signals to produce accurate human pose as shown in Fig. 1.

MAPnet learns the temporal distribution of pose and audio with the help of transformers. The features are then fused by a cross-modal fusion transformer that learns the distribution between pose and audio. Finally, the model is trained to predict poses at high temporal resolution in an auto-regressive sliding window fashion. Recent multi-modal pose and audio methods [17, 15, 36, 34, 32] aim at producing stylistically representative motion but not at predicting accurate pose. But well-recovered pose is essential for applications of music learning. To our knowledge, this is the first multi-modal approach that can predict correct human pose.

To train such a model, we need an appropriate dataset. While there are multi-modal datasets featuring people playing musical instruments [19, 30, 32], they are collected in the wild without ground truth data. Collecting motion capture data of music players requires careful set-up and instrumented environments. In this work, we present a multi-modal audio-video dataset with precise ground truth human poses featuring people playing musical instruments. The authors of the TELMI [38] database collected and released raw motion capture marker data along with its video. However, this data lacks kinematic human body joint modeling, body measurements, and calibration of the cameras for transformation from world 3D ground truth to image frame 3D coordinates. These make it difficult to repurpose them for machine learning tasks. We detail how we post-process and model the pose estimation challenges with the raw data to create this new dataset that we call Music Assisted Pose dataset (MAPdat), which could help this community towards a new research direction on multi-modal machine learning for music and motion.

To our knowledge, MAPdat is the first benchmark dataset for precise fine motor 3D pose estimation conditioned on music. In summary, our contributions are as follows:

•

We propose Music Assisted Pose network (MAPnet), which generates a high frame rate video and a precise pose estimation sequence from a low frame rate video and a low bandwidth music data.
•

We introduce the Music Assisted Pose dataset (MAPdat) dataset containing music audio, high frame rate motion capture ground truth, and simulated pose estimation errors of advanced violin players.

Dataset	Type	Interaction	Fine	Pose	Estimates	Audio	Video	Length (s)
MoVi [10]	M	✗	✗	✓	✗	✗	✓	61,200
HumanEva [33]	M	✗	✗	✓	✗	✗	✓	1,300
Human3.6M [13]	M	✗	✗	✓	✓	✗	✓	17,890
TWH 16.2M [17]	S	✗	✗	$\sim$	✗	✓	✓	180,000
IEMOCAP [4]	S	✗	✗	$\sim$	✓	✓	✓	43,200
CreativeIT [26]	S	✗	✗	✓	✓	✓	✓	28,800
AIST++ [21]	D	✗	✗	✓	✓	✓	✓	18,694
DwM [35]	D	✗	✗	✓	✗	✓	✗	–
GrooveNet [1]	D	✗	✗	✓	✗	✓	✗	–
DanceNet [42]	D	✗	✗	✓	✗	✓	✗	–
EA-MUD [8]	D	✗	✗	✓	✗	✓	✗	–
PHENICX-conduct [30]	C	✗	✗	✓	✗	✓	$\sim$	3,420
URMP [19]	I	✓	✗	✗	✗	✓	✓	4,680
QUARTET [28]	I	✓	✓	✗	✗	✓	$\sim$	1,742
TELMI [39]	I	✓	✓	✗	✗	$\sim$	$\sim$	8,625
MAPdat (Ours)	I	✓	✓	✓	✓	✓	✓	120,690

Table 1: Dataset Comparison Table summarizes 3D human kinematic pose and audio datasets. Symbols (✓) means fully satisfies, (✗) does not satisfy, and (

\sim

) partially satisfies the field topic. Dataset Type include Motion (M), Speech (S), Dance (D), Conducting (C), and Play Instruments (I). Interaction refers to people interacting with objects, thus making the dataset difficult for pose estimation due to occlusions, complex inter-body parts geometry, and joint deformation due to external forces. Fine refers to the motion velocity and complexity, where slow-moving and large change is easier to observe. At the same time, fast-moving and small movements are much more challenging due to motion blur and pixel saturation. In the case of playing musical instruments, QUARTET and TELMI contain raw motion capture markers, audio, and video. URMP has audio and video. But they lack accurate 3D human joints, therefore lack 3D Pose Ground Truth Pose and do not have GT to Video transformation calibration information making it difficult to compare with Pose Estimates from video. MAPdat has full-body 3D human pose and simulated noisy pose estimates.

2 Related Work

2.1 Pose Estimation with Audio

The original pose estimation algorithms generate 2D skeletons from monocular image frames [40, 9]. However, due to limited expressiveness and ambiguity, 2D representations are often not sufficiently powerful for downstream tasks. Recently monocular 3D pose estimation approaches, often built on top of the 2D skeletons, have gained success [24, 6, 12, 41]. However, both 2D and 3D pose estimators suffer from issues such as jitter, joint inversions, joint swaps, and misses - issues extensively studied in [22, 27]. In 3D, we observe additional challenges with scale, depth disparity, and joint drift. To overcome these issues, filter-based methods such as simple moving average, Kalman filters, and particle filters have been used [23] with some success, but they struggle to model highly non-linear motion trajectories.

Recently a new paradigm demonstrating the use of multi-modal data has emerged. It has been shown of tremendous success for generating poses based on audio data in dance sequences [21, 18, 32], for speech gesture generation [20, 11], and for learning new multi-modal features from visual and sound data that can be used with a variety of classic vision or audio tasks [2].

2.2 Pose and Audio Datasets

While there are plenty of audio-video datasets for the areas of speech [17, 4, 26], dance [21, 35, 1, 42, 30], everyday actions [10, 13, 33], and music [21, 35, 1, 28, 30], there are limited data when it comes to ground truth motion capture data on how people move while playing musical instruments. The table 1 details the comparison of audio-video datasets.

There are other datasets with synchronized audio and motion capture data that are also used in gesture generation, such as AIST++[21] and EA-MUD[8]. These datasets have motion capture data from dances with synchronized music. However, in these scenarios, the audio is not directly produced due to human motion. Instead, people move in response to or with the anticipation of the beats of the music. As shown by prior work [21], they can be used for learning realistic human motions that are consistent with the motion styles. However, generation of such pose trajectories does not have a single unique solution, and approximations to the predictions are made. Therefore, the motion trajectories are not precise and do not conform to the real observations in the videos.

For the case of understanding the relationship between music production and the motion that produces them, we find four potential datasets [14, 19, 28, 39]. FCVID [14] contains YouTube videos from the wild hence lacking 3D pose ground truth. URMP [19] was recorded in the lab but without a motion capture or multi-camera system, so it lacks the motion ground truth. QUARTET [28] consists of motion capture raw marker data; however, it only contains upper body parts, contains limited view angles from a single camera, and only contains 30 video sequences out of the 101 that were recorded, making the data very limited. The TELMI [39] dataset consists of 145 sessions with three different camera angles and full-body motion capture raw marker data of the body, violin, and bow. However, the motion capture marker needs kinematic human body model fitting to generate accurate 3D joint coordinates. TELMI does not have calibration data necessary for this mapping from ground truth to an image frame. Additionally, it has a limited number of subjects, making it difficult for downstream machine learning tasks. Thus we conducted extensive evaluation, synchronization, and post-processing steps to get the data to a machine learning-ready state. This is detailed in section 4.

3 Method

3.1 Overview

As shown in Fig. 2, we propose Music Assisted Pose Network (MAPnet) that takes temporally sparse and noisy pose skeleton data and refines them to generate temporally dense pose with the help of the audio. The primary building block of MAPnet are three transformers: (1) Pose Transformer and (2) Audio Transformer learn temporal attention from the input pose and audio signals. Since pose and audio embeddings are of different temporal dimensions, we use novel Rebalancing layers for both pose and audio embeddings. Further, the output pose and audio embeddings will be fused by the (3) Cross-Modal Fusion Transformer to learn the correspondences between pose and audio.

3.2 Preliminaries: Transformers

Initially introduced in natural language processing for machine translation, the Transformer [37] is a very powerful technique employing multi-head attention (MHA), which allows a model to learn to pay attention to multiple salient features in sequential data. Transformers’ fundamental building blocks are an MHA followed by a feed-forward (FF) layer with normalization and residual computed after each layer.

3.3 MAPnet: Motion Assisted Pose Network

As motion features, we directly take the sequence of normalized poses as shown in Fig. 3. The input 3D skeletons have dimension $\mathbb{R}^{T_{in}\times 13\times 3}$ , where $T_{in}$ is the number of input frames and is equal to 3 seconds times the input frame rate. We have 13 joints for each person, and each joint is described by its 3D position (x,y,z). We flatten the last two dimensions to form our pose feature $P\in\mathbb{R}^{T_{in}\times 39}$ .

3.4 Audio Features

Raw audio which are sampled at 44 kH are divided into 3 seconds sliding window with 1 second hop. Each 3 second audio is then divided into 150 time steps to generate 150 audio features. We compared raw audio and audio features used by other audio based networks [21]. After our experiments, we carefully selected the following features: the 1-dim envelope (Fig. 4e), 20-dim MFCC (Fig. 4c), 12-dim chroma (Fig. 4d), 1-dim one-hot peaks (Fig. 4e), and 1-dim rms to obtain a 35-dim music feature at 150 time instances $A\in\mathbb{R}^{150\times 35}$ .

Given the pose feature $P\in\mathbb{R}^{B\times T_{in}\times 39}$ and the audio feature $A\in\mathbb{R}^{B\times 150\times 35}$ , where B is the batch size, MAPnet first embeds the two features using fixed hidden size $H_{1}$ into $P\in\mathbb{R}^{B\times T_{in}\times H_{1}}$ and $A\in\mathbb{R}^{B\times 150\times H_{1}}$ . The two embeddings are then fed into the Pose Transformer and the Audio Transformer, respectively, with positional encoding. The positional encoding ensures the temporal order of the concatenated features. The output embeddings are then passed to the Rebalancing layer to reshape the embeddings into ${P\in\mathbb{R}^{B\times H_{2}\times H_{1}}}$ and $A\in\mathbb{R}^{B\times H_{2}\times H_{1}}$ . These are then concatenated to get the combined embedding $C\in\mathbb{R}^{B\times 2H_{2}\times H_{1}}$ and sent to the fusion transformer without positional encoding.

We pass the output of the fusion transformer into three fully connected layers to get our output $\in\mathbb{R}^{B\times T_{out}\times 39}$ , which gets reshaped into $\mathbb{R}^{B\times T_{out}\times 13\times 3}$ to calculate the loss with the ground truth, where $T_{out}$ is the number of output frames and is equal to 3 seconds times the output frame rate. The whole network with three transformers is learned in an end-to-end-manner.

3.5 Reblancing Layer

In our experiments, simply concatenating the modalities do not work, and the number of layers on individual transformers and fusion transformers make a huge difference. Therefore, our introduction of the rebalancing layer is vital to make the model work, for which we include ablation study (see table 5). The rebalancing layer re-encodes the output from the individual pose and audio transformers. Doing this reduces the load for the fusion transformer to learn the attention mapping distribution of the pose and audio transformer embeddings.

3.6 Loss Function

We use the Mean Per Joint Position Error (MPJPE) as our loss function. MPJPE calculates the mean of the l2 loss between the ground truth joint position and the predicted joint position. For our case, the MPJPE is defined as:

\mathcal{L}=\frac{1}{13}\Sigma_{n=1}^{13}\left\|j_{n}-\hat{j}_{n}\right\|_{2}

where $j_{n}$ and $\hat{j}_{n}$ are the ground truth and estimated 3D joint coordinates of the n-th joint.

4 MAPdat Dataset

4.1 Data Description

The proposed MAPdat dataset is developed based on the publicly available database called TELMI Open Database [39]. The TELMI dataset includes motion capture data, depth data, video data, and sound data of four master practitioners playing different violin techniques ranging from basic techniques such as controlling the bow weight while playing a violin to advanced techniques such as staccato articulation, sautille articulation, etc.

The recordings are of master practitioners. Two to four of the master practitioners performed 41 different techniques as shown in figure 5. Although much significant research has been done by the original owners of the TELMI dataset, it was not in an ideal format to be used for machine learning.

Our current MAPdat is a prototype dataset and uses only a subset of TELMI. Since the TELMI consortium owns the TELMI data, we are releasing our scripts to automate the download, post-process the original data, and generate machine learning-ready MAPdat data for the community to allow replication of our work.

4.2 Dataset Preparation

While the TELMI dataset is rich, the dataset was built for a different type of research than our use case (e.g., [3]). The video and motion capture data of the TELMI dataset is 50 frames per second, and the motion capture data is synchronized with audio and video data [38]. The 32 motion capture markers of the body: ariel, right and left forehead, back-head, shoulder, back-shoulder, inner elbow, outer elbow, inner wrist, outer wrist, pinky, thumb, back waist, inner knee, outer knee, toe, and two points on the vertebrae (TS and T10) are included in the original data. We ignore markers that cannot be used for human body kinematic fitting with actual physical measurements such as the head and hand markers and do a kinematic fitting with useful markers using heuristic calculations of the clavicle, shoulders, elbows, wrists, hips, knees, and toes. From this, we compute a final 13 joints for a human body model as described in Fig. 3 consistent with the output of the video-based pose estimation approaches.

Although the motion capture data was synchronized with audio and video data, we found that the video, the audio, and the motion capture markers data had different lengths for many trials. Therefore, after they are temporally synchronized, we trim the data to have the same sizes. We validated the motion capture and audio-video synchronization by calculating the onset and offset of the music from the audio features and bow-violin distances and velocity from the motion capture data and doing a qualitative manual review. In TELMI data, data collection began and ended a few seconds before and after the master practitioners played violin. Since those portions of data where no violin was being played are not relevant to our study, we have also cropped those parts before onset and after offset using the calculated bow-violin distances. We then transformed all the motion capture data with the left toe as the origin.

Each sample is normalized and then modeled with jitters as a gaussian noise and joint inversions as random swaps of joints to create ten variations, detailed in the next section. These synthetic noises are based on the most common types of challenges that pose estimators face [23]. These new 10x sets are randomly divided into 8:1:1 ratio for train:valid:test sets. The resulting sample and its corresponding audio samples are then re-sampled into three-second windows with a one-second hop as a sliding window. This results in 40,230 samples or 120,690 seconds of total data.

4.3 Pose Error Characterization

From our pilot studies, we find that the two most prominent forms of noise in 3D pose estimation are jitter and joint inversions, which was reinforced by [22, 27]. Depending on the pose estimation network, jitter can vary in magnitudes, and joint inversion can occur with variation in the frequency of swaps and with which joints they get swapped. To emulate the noisy pose estimates of a monocular 3D pose estimator, we augment jitter and joint inversion to our ground truth motion capture data. We determined the noise parameters based on the average error from 3D pose estimates of MediaPipe [24] and VideoPose3D [29] on the TELMI [38] video data. The magnitude of Gaussian noise is based on the average distance of pose data from the ground truth motion capture data. While joint swaps are highly dependent on the camera viewpoint, we randomly distribute joint swaps based on the average occurrence from the total length of the video. In MAPdat, we modeled jitter as Gaussian noise distribution with a standard deviation of 300 mm. We modeled the distribution of joint swapping events over time as a Poisson random variable with an average of five swaps per minute. Once a joint swap occurs, the swap length is uniformly distributed between 0.5s and 3.0s. Using these parameters, we randomly generate ten variants of each video.

5 Experiments

5.1 Implementation Details

We implemented all our methods in Tensorflow. We used four NVIDIA RTX 1080 Ti GPUs to do our training and testing. The number of output frames $T_{out}$ across our experiments is 150 so that our model will predict 3D skeleton poses at 50 fps ( $\tau$ =1.0). We define $\tau$ as the ratio of output to input i.e. our output frame rate is 50, so the input frame rates at 50, 25, and 17 corresponds to $\tau$ =1.0, $\tau$ =0.5, and $\tau$ =0.33. The input frame rate in the experiments was 50fps ( $\tau$ =1.0), 25fps ( $\tau$ =0.5), and 17fps ( $\tau$ =0.33), which correspond to same, one half, and one third of the output frame rate. The size $H_{1}$ and $H_{2}$ in the hidden layers used to embed the features was set to 160 and 150, respectively. For all experiments we used in training a batch size of 128 using the Adam optimizer [16].

To extract audio input features, we use audio processing toolbox Librosa [25]. Most dance-based pose generation uses the beats feature as a cyclical marker for motions, such as foot-to-ground contact. In the case of the violin playing music, no distinct beats can be detected. Instead, we use changes in bow direction as the one-hot peaks and the RMS feature that better model the bow motion. The details of a sample waveform and audio feature have been visualized in Fig. 4.

5.2 Analysis of Experimental Results

We compare MAPnet with and without rebalancing with a Simple Moving Average calculation (SMA), a Long-Short Term Memory (LSTM) network using Pose only (LSTM Po), and one using Pose and Audio (LSTM PA), and to a Transformer based on Pose only (PoT). Table 2 shows the quantitative results of our experiments. We used the Mean Per Joint Position Error (MPJPE) as our metric. For the Pose-only Transformer (PoT), we observed from the results that at 50 fps, the model struggles to filter out large jitter in the input data. PoT loses the motion trajectory data and deteriorates, as the number of input frame rate declines, at both $\tau$ = 0.5 (25fps) input rate and $\tau$ = 0.33 (17fps) input rate. In contrast, because MAPnet models the music features that have alignment with the actual motion trajectory, it filters the jitter and learns to model the joint swaps. Additionally, as we reduce the input frame rate (from 25 to 17), we do not see further deterioration but rather see a marginal improvement which can be explained by the transformer giving more attention to the audio signal to capture motion trajectories of very fine details, as can be observed in figure 7.

Method	$\mathbf{\tau=}$ 1.00	$\mathbf{\tau=}$ 0.50	$\mathbf{\tau=}$ 0.33
SMA	292.31	303.63	311.20
LSTM Po	345.86	341.80	338.17
LSTM PA	318.83	316.17	314.32
PoT	35.49	43.16	41.65
MAPnet w/o Rebalance (Ours)	26.68	28.16	31.35
MAPnet (Ours)	26.69	26.62	26.60

Table 2: Quantitative results of our experiments showing LSTM Pose-only (LSTM Po) and Pose and Audio (LSTM PA), Pose Only Transformer (PoT), and MAPnet correspond to the two networks we are comparing. The columns correspond to different input frame rates (given as the multiplication factor of the output frame rate). The numbers denote the MPJPE (

\downarrow

) in mm.

Method	Fine	Gross	Inversions
LSTM Po	385.17	310.62	273.72
LSTM PA	351.83	263.41	259.50
PoT	46.27	44.43	29.01
MAPnet (Ours)	27.76	22.29	19.05

Table 3: Samples are categorized into three groups. Fine and Gross motion are based linearity of the data. Large motions and slow motions tend to be linear or piece-wise linear and therefore also the easiest for most models, including simple moving average to predict. Fine motions are highly non-linear, so we categorize and annotate these samples as the hard set. Inversions are when random joints get swapped with other joints. We categorize them as medium hardness as most non-linear (both Pose-only and MAPnet) perform well on this set. While simple filtering methods fail with very high MPJPE (

\downarrow

Simple Moving Average(SMA) is a traditional noise filtering method where the sum of a specified number of consecutive data is averaged to flatten out the noise. We have calculated the SMA on the input frames, and since it is a piece-wise linear noise-filtering system, we calculated the midpoints of the SMA results to generate higher frequency output. This method is not suitable for solving our problem of predicting motions at a high frame rate due to two reasons. Firstly, SMA is meant for linear systems, but the fine bow movements or movements of the right wrist are highly non-linear, as shown in Fig. 7 with the ground truth. Therefore, it cannot detect and learn the differences between fine bow movements and noise. Secondly, since it averages the input data, the prediction of the motions between two input frames is always predicted as linear, which is highly inaccurate for non-linear bow movements. As shown in Table 2, as the frame sparseness increases, the performance of a simple moving average algorithm deteriorates, while the MAPnet prediction, with the help of audio features, stays close to the trajectory of the ground truth. The results of the simple moving average from this ablation study were consistent with [23].

Inspired from the LSTM network architecture presented in [32], we present the ablation study by replacing our transformer models with LSTM models. We observe that LSTM struggles to model the non-linearity between the pose and music. From table 2, the MPJPE error is exponentially high when compared to our MAPnet model, and therefore it fails to predict the accurate poses. We also observe the LSTM also struggles to learn different motion difficulty (see table 3). We anticipate this behavior is due to LSTM inherently depending on the sequential structure and being versatile in handling a single modality (either audio or pose). On the other hand, our transformer-base MAPnet is equipped to correlate multi-modal non-sequential data structures.

As shown in figure 7, at low input visual frame rate $\mathbf{\tau}=0.33$ , during fine bow movements, The Pose-only model (Fig. 7a and 7d) and SMA (Fig. 7c and 7f) poorly predicts the motion of the wrist joints. However, the MAPnet model, due to coupling relations between the motion and the music, can generate more accurate predictions, as can be seen in the Fig. 7b and 7e. We see that there is a significant error in the motion trajectory of the SMA and Pose-only model prediction, whereas there is a more substantial overlap between the MAPnet prediction and the ground truth.

5.2.1 Ablation Studies

Rebalancing Layer: We compared the effect of adding and removing the rebalancing layer, and we observed degradation of 6.31% of per frame MPJPE metric and 17.64% across frame MPJAE metric (see table 5). This indicates that blindly concatenating the output of two transformers is not optimal.

Early, balanced, or late fusion: We also conducted early, balanced, and late fusion ablation (see table 4), which shows as much as 17.39% difference in the MPJPE results. In this paper, our experiments suggest that pose only and traditional filtering methods do not work well, and our model outperforms them significantly.

Results and Metrics: In table 4 and 5, we include additional metric MPJAE (acceleration) to demonstrate temporal smoothness and correctness. The MPJAE is consistent with with MPJPE.

Ablation	MPJPE $(\downarrow)$	MPJAE $(\downarrow)$
Early-Fusion	32.20	44.37
Balanced-Fusion	28.22	41.53
Late-Fusion	26.60	37.51

Table 4: We did ablation studies on when to fuse the pose and audio modalities. We conduct experiments in three settings, (1) Early-Fusion: 2 layers of Pose/Audio Transformer and 12 layers of Fusion Transformer (2) Balanced-Fusion: 7 layers of Pose/Audio Transformer and seven layers of Fusion Transformer (3) Late-Fusion: 12 layers of Pose/Audio Transformer and two layers of Fusion Transformer. Our result shows that the Late-Fusion strategy performs better in MPJPE and MPJAE.

Ablation	MPJPE $(\downarrow)$	MPJAE $(\downarrow)$
Without Re-balancing	28.39	45.55
With Re-balancing	26.60	37.51

Table 5: We also conducted experiments to evaluate the importance of the re-balancing layer in MAPnet. Removing the re-balancing layer will make a difference of 6.31% MPJPE and 17.65% MPJAE.

6 Limitations

The current approach and methods have limitations that suggest future work. The ground truth data has a limited number of joints, especially hand keypoints that were not recorded. The joints are not kinematically accurate and lack mapping between ground truth mocap joints and video due to the lack of calibration data that are necessary to compare with pose estimators results. The data has a limited number of subjects with wide variation in the number of samples from each subject. The data has advanced players, which does not include many sound variations in playing quality, instrument characteristics, and environmental effects. This data has only one instrument (violin) and may not generalize to other instruments. The current ground truth is limited to 50 fps limiting the upper bound of the fast motion. The ground truth motion complexity has been classified into gross motion, fine motion, and inversions. More analysis is needed into the nature of the issues.

The current approach relies on motions that are accompanied by sound —however, certain motions, such as when the bow hand moves away from the violin may not have sound to assist. The input is raw joints and five types of sound features. While this shows promise as one possible proof of concept, more exploration is needed for other motion and sound features. The current method relies on the network to figure out the mapping of each joint coordinates as a vector with sound features as vectors. More work is needed to provide a better coupling of motion and sound. The loss function does not consider the motion trajectory and, therefore, is limited in accurately modeling fine-fast-moving splines. Comparison to cross-datasets and cross-SOTA methods are needed to test for generalization.

7 Conclusion

This paper presents a method that leverages the audio features to refine and predict accurate human body pose motion models. We propose MAPnet (Music Assisted Pose network) for generating a fine-grained motion model from sparse input pose sequences and continuous audio. To accelerate further research in this domain, we also open-source MAPdata, a new multi-modal dataset of 3D violin playing. We hope this work will be useful to Computer Vision researchers, who can leverage the rich data from other modalities to pair it with vision in creative ways. Our results suggest that audio can be a rich resource to correct and fill in visual information between adjacent frames. Our work shows promising results, and it also creates opportunities for future researchers for cases like music, where there is a strong causal coupling between movement and audio.

References

[1] Omid Alemi, Jules Françoise, and Philippe Pasquier. Groovenet: Real-time music-driven dance movement generation using artificial neural networks. networks, 8(17):26, 2017.
[2] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–617, 2017.
[3] Angel David Blanco, Simone Tassani, and Rafael Ramirez. Real-Time sound and motion feedback for violin bow technique learning: A controlled, randomized trial. Front. Psychol., 12:648479, Apr. 2021.
[4] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008.
[5] Jason Chi Wai Chen. Ai in music education: The impact of using artificial intelligence (ai) application to practise scales and arpeggios in a virtual learning environment. In Learning Environment and Design, pages 307–322. Springer, 2020.
[6] Tianlang Chen, Chen Fang, Xiaohui Shen, Yiheng Zhu, Zhili Chen, and Jiebo Luo. Anatomy-aware 3d human pose estimation in videos. arXiv preprint arXiv:2002.10322, 2020.
[7] Gisela Miranda Difini, Marcio Garcia Martins, and Jorge Luis Victória Barbosa. Human pose estimation for training assistance: a systematic literature review. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pages 189–196, 2021.
[8] Rukun Fan, Songhua Xu, and Weidong Geng. Example-based automatic music-driven conventional dance motion synthesis. IEEE transactions on visualization and computer graphics, 18(3):501–515, 2011.
[9] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In ICCV, 2017.
[10] Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F Troje. Movi: A large multi-purpose human motion and video dataset. Plos one, 16(6):e0253157, 2021.
[11] Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3497–3506, 2019.
[12] Kehong Gong, Jianfeng Zhang, and Jiashi Feng. Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In CVPR, 2021.
[13] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
[14] Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. Fcvid: Fudan-columbia video dataset.
[15] Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[17] Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 763–772, 2019.
[18] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. arXiv preprint arXiv:1911.02001, 2019.
[19] Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2):522–535, 2018.
[20] Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11293–11302, 2021.
[21] Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021.
[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
[23] Hongyi Liu and Lihui Wang. Gesture recognition for human-robot collaboration: A review. Int. J. Ind. Ergon., 68:355–367, Nov. 2018.
[24] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
[25] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, pages 18–25. Citeseer, 2015.
[26] Angeliki Metallinou, Zhaojun Yang, Chi-chun Lee, Carlos Busso, Sharon Carnicke, and Shrikanth Narayanan. The usc creativeit database of multimodal dyadic interactions: From speech and full body motion capture to continuous emotional annotations. Language resources and evaluation, 50(3):497–521, 2016.
[27] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Posefix: Model-agnostic general human pose refinement network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7773–7781, 2019.
[28] Panagiotis Papiotis. A computational approach to studying interdependence in string quartet performance. phdphd, Univesitat Pompeu Fabra, 2016.
[29] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[30] Álvaro Sarasúa Berodia et al. Musical interaction based on the conductor metaphor. PhD thesis, Universitat Pompeu Fabra, 2017.
[31] Neil Selwyn. Should robots replace teachers?: AI and the future of education. John Wiley & Sons, 2019.
[32] Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. Audio to body dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7574–7583, 2018.
[33] L. Sigal, A. Balan, and M. J. Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1):4–27, Mar. 2010.
[34] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
[35] Taoran Tang, Jia Jia, and Hanyang Mao. Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM international conference on Multimedia, pages 1598–1606, 2018.
[36] Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(4):1–11, 2017.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[38] Gualtiero Volpe, Ksenia Kolykhalova, Erica Volta, Simone Ghisio, George Waddell, Paolo Alborno, Stefano Piana, Corrado Canepa, and Rafael Ramirez-Melendez. A multimodal corpus for technology-enhanced learning of violin playing. In Proceedings of the 12th Biannual Conference on Italian SIGCHI Chapter, number Article 25 in CHItaly ’17, pages 1–5, New York, NY, USA, Sept. 2017. Association for Computing Machinery.
[39] Gualtiero Volpe, Ksenia Kolykhalova, Erica Volta, Simone Ghisio, George Waddell, Paolo Alborno, Stefano Piana, Corrado Canepa, and Rafael Ramirez-Melendez. A multimodal corpus for technology-enhanced learning of violin playing. In Proceedings of the 12th Biannual Conference on Italian SIGCHI Chapter, pages 1–5, 2017.
[40] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In CVPR, 2016.
[41] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N. Metaxas. Semantic graph convolutional networks for 3d human pose regression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3425–3435, 2019.
[42] Wenlin Zhuang, Congyi Wang, Siyu Xia, Jinxiang Chai, and Yangang Wang. Music2dance: Dancenet for music-driven dance generation. arXiv preprint arXiv:2002.03761, 2020.