Piano Skills Assessment

1 Introduction

2 Related Work

3 Multimodal PISA Dataset

4 Approach

5 Experiments

6 Conclusion

References

4.1 Visual branch

4.2 Aural branch

4.3 Multimodal branch

4.4 Objective function

Abstract

Can a computer determine a piano player’s skill level? Is it preferable to base this assessment on visual analysis of the player’s performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multimodal skill assessment focusing on assessing piano player’s skill level, answer the asked questions, initiate work in automated evaluation of piano playing skills and provide baselines for future work.

Index Terms— Automated Piano Skills Assessment, Action Quality Assessment

Automated evaluation of skills/action quality involves quantifying how skillful the person is at the task at hand/ or how well the action was performed. Automated skills assessment (SA)/action quality assessment (AQA) is needed in a variety of areas including sports judging and even education as has been recently underscored due to the ongoing COVID-19 pandemic which has severely reduced in-person teaching and guidance. Moreover, automated evaluation of skills can make learning more accessible to socioeconomically disadvantaged subgroups of our society. Apart from teaching and guidance, it can be employed in video retrieval, and can provide second opinions in the case of controversial judging decisions in order to remove biases.

In this paper, we address automated determination of piano playing skill level based on a 10 point scale using a new multimodal PIano Skills Assessment (PISA) dataset (Fig. 1) that accounts for both visual and auditory evidence.

Piano + Computer Vision: There has been work exploring the use of computer vision in determining the accuracy of pianists [1, 2, 3, 4, 5]; automatic transcription [6, 7, 8, 9, 10]; generating audio from video/silent performance [11, 12, 13]; or generating pianist’s body pose from audio/MIDI [14]. However, there has been no work addressing the prediction of pianist’s skill level from the recording of their performance. Our work is the first to address prediction of pianist’s skill level in an automated fashion.

Skills-Assessment/Action Quality Assessment: While there has been quite some progress recently in the areas of SA [15, 16, 17, 18, 19, 20, 21, 22] and AQA [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], none of these works address assessment of pianist’s skills. Moreover, ours is the first work to take a multimodal learning approach in skills-assessment or action quality assessment.

Refer to caption — Fig. 1: Examples of samples from our dataset. First row: images of full scenes; Second row: close-up videos of pianists’ hands (only a part of the full sample shown); Third row: audio of the performances; Fourth row: song-level (SL) (1 to 10); Fifth row: player-level (PL) (1 to 10). To watch videos and listen to audios, please view this paper in Adobe Reader with multimedia settings enabled. Control audios with Right-Click. The goal of this paper is to predict the player-level from video and/or audio.

The new PISA dataset includes two attributes in need of definition: player skill and music difficulty. A 10 point grading system was selected for player skill/level (PL) based on a technical and repertoire syllabus [50] developed by the local chapter of the Music Teachers National Association (MTNA). For certification services with other metrics, the grading system used in this paper can be roughly translated as follows: levels 1-9 represent pre-collegiate certification/skill level, while level 10 represents a collegiate or post-collegiate mastery. The primary evaluation of player skill is through the technique required for the most difficult song they are able to play. In this way, song level generally indicates player skill.

The methodology for determining the song level (SL) of difficulty of a piano piece utilized multiple syllabi. A 10 point grading system was used for piece difficulty. Songs rated in the Con Brio syllabus [51] as levels 1-8 were given a 1-8 rating, songs rated as pre-collegiate were given a 9 rating, and songs rated as collegiate or beyond were given a 10 rating. Songs not present in the Con Brio syllabus were referenced in the Henle syllabus [52], which rates pieces from 1-9. Pieces at level 9 in the Henle syllabus were cross checked with the Royal Conservatory of Music syllabus [53] to determine if they would be assigned a 9 or a 10 in our dataset (because the distinction between 9 and 10 is not made in the Henle syllabus).

Existing skills assessment and action quality assessment have been prepared either through crowd-sourcing or direct annotation present in the video footage. Compared to those methods, we found collecting piano skills assessment data to be challenging. We had to rely on a trained pianist to collect videos from YouTube, analyze them, and determine each player’s skill level (as described above). As such, scaling up the dataset is difficult. We collected a total of 61 total piano performances. Minimum, average, and maximum performance lengths were 570, 2690, and 10038 frames, respectively. Histograms with the no. of samples for each player-level and song-level are in Fig. 2. To mitigate small dataset size, we create multiple unique, non-overlapping samples, each of size of 160 frames. In this way, we have a total of 992 unique samples. We considered two types of sampling schemes: 1) contiguous; 2) uniformly distributed, which are illustrated in Fig. 3. Out of the 61 total piano performances, we use 31 (516 samples) as our training set and 30 (476 samples) as our test set. Note that no overlap exists between train and test sets. For each piano performance, we provide the following annotations: 1) player skill level; 2) song difficulty level; 3) name of the song; 4) a bounding box around the pianist’s hands (refer Fig. 1).

In this section, we first present unimodal SA approaches, which are followed by a multimodal approach. Our full framework is shown in Fig. 4. Finally, we provide our learning objective.

While initially, judging music visually rather than auditorily seems impossible, there are skills that can be judged visually. These skills would generally fall into two categories - technical skills, which build from grade to grade (in the above syllabi), and virtuoistic skills or professional skills, which are skills unique to a pianist that would be present in only grade 9 or 10 pianists. The first category includes the difficulty and speed of the scales and arpeggios included in the music. For example, the level 10 syllabus of the Las Vegas Music Teachers Association Technical Portion [54] includes all major scales (four octaves) and all minor scales (four octaves, natural, harmonic, and melodic included). In a visual examination of a grade 10 song, such as Chopin’s Etude Op 25 No 11, the four octave A harmonic minor scale at high speed would immediately indicate a grade 10 piece. The same logic can be applied to arpeggios and cadences. This would cover the first category; the second category are skills unique to a pianist. The lack of such skills would not give any indication as to the level of a pianist, but the presence of such skills would immediately indicate a very high level of technical achievement. For example, professional pianists who must play at high speeds may play eight note intervals with their first and third fingers, which is incredibly difficult for the average pianist.

In order to take into account the factors mentioned, we believe processing short clips with 3DCNNs would be more suitable than processing single frames using 2DCNNs. Since we sample several such clips and process them individually using 3DCNNs, we need to aggregate individual clip-level features. Prior work [36] in AQA has shown averaging to work well and it also enables end-to-end training. RNN based aggregation would not enable end-to-end training when used with 3DCNNs due to large number of parameters and consequent overfitting. Therefore, we choose to use averaging as our aggregation scheme to obtain whole sample-level video features from clip-level features.

When considering just the visual branch, we deactivate everything other than the blue colored portion Fig. 4. We further pass the whole sample-level video features through a linear layer to reduce its dimensions to 128.

A significant amount of information can be detected from audio as well. The velocity of the music (notes per second) can be detected auditorily, which is a simple yet valuable tool to judge the technical skill required for a piece. The presence of multiple notes at once with cadences that correspond to cadences included in the grade level syllabi above would be recognizable both through auditory and visual analysis. Unfortunately, the significant variety in style, clarity, and dynamics from song to song, while an imperative judge of piano skill in competitions and performances by judges familiar with said songs, makes an unfamiliar judge or computer system less effective at determining skill.

We convert the raw audio signal corresponding to sampled clips (the same clips as those mentioned in the visual branch) to its melspectrogram, which we then process using a 2DCNN. Similar to the visual branch, we aggregate the clip-level audio features using an averaging function. We learn the parameters of the 2DCNN end-to-end.

When considering just the aural branch, we deactivate everything other than orange colored part in Fig. 4. We further pass the whole sample-level audio features through a linear layer to reduce its dimensions to 128.

In the multimodal branch, we take the video-level video and audio features and concatenate them to produce multimodal features. We further pass the whole sample-level feature through a linear layer to reduce its dimensions to 128. We do not back-propagate from the multimodal branch to individual modality backbones to avoid cross-modality contamination.

Unlike a typical classification problem, in our player-level prediction problem, the distance between categories has meaning. For example, for a ground-truth player-level of 5, although predicted levels of 2 and 6 are both incorrect, a predicted level of 6 is “less wrong” than predicted level of 2. To address this characteristic, we incorporate a distance function ( $\mathcal{L}_{Reg}$ ) in the objective in addition to cross-entropy loss ( $\mathcal{L}_{CE}$ ). Specifically, we use a sum of L1 and L2 distances as our $\mathcal{L}_{Reg}$ . L1 was found to be beneficial in [36]. Overall objective function is as shown in Eq. 1 with superscript of $V$ for visual and $A$ for audio cues.

\mathcal{L}_{total}=\alpha_{1}\mathcal{L}_{CE}^{V}+\alpha_{2}\mathcal{L}_{Reg}^{V}\\ +\beta_{1}\mathcal{L}_{CE}^{A}+\beta_{2}\mathcal{L}_{Reg}^{A}+\gamma_{1}\mathcal{L}_{CE}^{M}+\gamma_{2}\mathcal{L}_{Reg}^{M}

(1)

Preprocessing: Visual information pertaining to the whole scene might not be useful in determining the player level. Instead, we crop and use visual information pertaining to the forearms, hands, and the piano as shown in Fig. 1. Using librosa [55], we convert the audio signal to its melspectrogram (settings adopted from [56]), and express that information in decibels.

Implementations details: We use PyTorch [57] to implement our networks. We use an ADAM optimizer [58] with a learning rate of 0.0001 and train all the networks for 100 epochs, with a batch size of 4. In Eq. 1, we set the values of $\alpha_{1},\beta_{1},\text{and}\gamma_{1}$ to 1, while those of $\alpha_{2},\beta_{2},\text{and}\gamma_{2}$ to 0.1. Codebase would be made publicly available.

Visual branch: Since our dataset is small, we design a small, custom 3DCNN network (C64, MaxP, C128, MaxP, C256, MaxP, C256, MaxP, C256, MaxP) to process the visual information. To avoid overfitting, we pretrain this 3DCNN on UCF101 action recognition dataset [59]. We use 16 consecutive frames to form the input clip for our 3DCNN. All the frames are resized to 112 $\times$ 112 pixels. We apply horizontal flipping during training.

Aural branch: We use ResNet-18 (R18) [60] as the backbone for our aural branch. We found that initializing weights by ImageNet [61] helped a lot. We change the number of input channels of the first convolutional layer of the pretrained network from 3 to 1. We convert the melspectrogram of audio signals to single channel images and resize these images to 224 $\times$ 224 pixels before use with R18. We found that applying random cropping hurt the performance. This may be because the useful information is present in the lowest and highest frequencies and removing those in the process of cropping adversely affects the performance.

We conduct experiments to answer the following questions:

1.

Is it possible to determine the pianist’s skill level using machine learning/computer vision?
2.

What is the better sampling strategy: contiguous or uniform distribution?
3.

Is a multimodal assessment better than a unimodal assessment?

Performance Metric: We use accuracy (in %) as the performance metric.

Modality	Sampling Scheme
Modality	Contiguous	Uniformly Dist.
Video	65.55	73.95
Audio	53.36	64.50
MMDL	61.55	74.60

Table 1: Performance (accuracy in %) of single modalities vs a multimodal (MMDL) assessment for contiguous and uniformly distributed sampling schemes.

The results are presented in Table 1. We observe that uniformly sampling is better than contiguous sampling. This may be due to the fact that different parts of the song can contain different elements, which can be better at providing a more diverse base for testing the pianist’s skill set. Another finding to note is that visual analysis was better than aural analysis. Moreover, using both visual and aural features yielded the best results. As discussed in Sec. 4, different/unique elements of skills can be observed visually and aurally which may account for the small boost in performance over visual alone in the uniform sampling case. Results also show that we were able to train both the networks end-to-end on a small dataset, which justifies our design choices.

Furthermore, a significant gap in performances from different sampling schemes also suggests that the networks are not biased towards some trivial, low-level, local/static cues present in video or audio streams.

In this work, we addressed the automated assessment of piano skills. We introduced PISA the first ever piano skills assessment dataset. We found that assessing the performance on the basis of visual elements was better than on the basis of aural elements. However, our best performing model was the one that combined visual and aural elements. We found that uniformly distributed sampling was significantly better than contiguous sampling at reflecting a player’s skill level. Computer vision is finding increasing applications in piano transcription, video to audio generation, etc. With our work, we hope to inspire future work in the direction of automated piano skills assessment and tutoring. For that, we have released our dataset and codebase and provided the performance baselines. While our approach yielded good results, we believe there is a significant scope for improvement in this direction.

\animategraphics[loop,autoplay,poster=1,width=0.22]30Figs/videos/hands/8/00000399

\animategraphics[loop,autoplay,poster=1,width=0.22]30Figs/videos/hands/9/00000399

\animategraphics[loop,autoplay,poster=1,width=0.22]30Figs/videos/hands/19/00000399

\animategraphics[loop,autoplay,poster=1,width=0.22]30Figs/videos/hands/61/00000599

autoPlay=true , ]APlayer.swf

autoPlay=true , ]APlayer.swf

autoPlay=true , ]APlayer.swf

autoPlay=true , ]APlayer.swf

[1] Jangwon Lee, Bardia Doosti, Yupeng Gu, David Cartledge, David Crandall, and Christopher Raphael, “Observing pianist accuracy and form with computer vision,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1505–1513.
[2] Yoshinari Takegawa, Tsutomu Terada, and Shojiro Nishio, “Design and implementation of a real-time fingering detection system for piano performance,” .
[3] Dmitry O Gorodnichy and Arjun Yogeswaran, “Detection and tracking of pianist hands and fingers,” in The 3rd Canadian Conference on Computer and Robot Vision (CRV’06). IEEE, 2006, pp. 63–63.
[4] Akiya Oka and Manabu Hashimoto, “Marker-less piano fingering recognition using sequential depth images,” in The 19th Korea-Japan Joint Workshop on Frontiers of Computer Vision. IEEE, 2013, pp. 1–4.
[5] Potcharapol Suteparuk, “Detection of piano keys pressed in video,” .
[6] Mohammad Akbari and Howard Cheng, “Real-time piano music transcription based on computer vision,” IEEE Transactions on Multimedia, vol. 17, no. 12, pp. 2113–2121, 2015.
[7] Souvik Sinha Deb and Ajit Rajwade, “An image analysis approach for transcription of music played on keyboard-like instruments,” in Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing, 2016, pp. 1–6.
[8] Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma, “Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522–535, 2018.
[9] A Sophia Koepke, Olivia Wiles, Yael Moses, and Andrew Zisserman, “Sight to sound: An end-to-end approach for visual piano transcription,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 1838–1842.
[10] Jun Li, Wei Xu, Yong Cao, Wei Liu, and Wenqing Cheng, “Robust piano music transcription based on computer vision,” in Proceedings of the 2020 4th High Performance Computing and Cluster Technologies Conference & 2020 3rd International Conference on Big Data and Artificial Intelligence, 2020, pp. 92–97.
[11] Seongjae Kang, Jaeyoon Kim, and Sung-eui Yoon, “Virtual piano using computer vision,” arXiv preprint arXiv:1910.12539, 2019.
[12] Kun Su, Xiulong Liu, and Eli Shlizerman, “Audeo: Audio generation for a silent performance video,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[13] Chuang Gan, Deng Huang, Peihao Chen, Joshua B Tenenbaum, and Antonio Torralba, “Foley music: Learning to generate music from videos,” arXiv preprint arXiv:2007.10984, 2020.
[14] Bochen Li, Akira Maezawa, and Zhiyao Duan, “Skeleton plays piano: Online generation of pianist body movements from midi performance.,” in ISMIR, 2018, pp. 218–224.
[15] Hazel Doughty, Dima Damen, and Walterio Mayol-Cuevas, “Who’s better? who’s best? pairwise deep ranking for skill determination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6057–6066.
[16] Ruimin Li, Bin Li, Shixiong Zhang, Hong Fu, Wai-Lun Lo, Jie Yu, Cindy HP Sit, and Desheng Wen, “Evaluation of the fine motor skills of children with dcd using the digitalised visual-motor tracking system,” The Journal of Engineering, vol. 2018, no. 2, pp. 123–129, 2018.
[17] Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen, “The pros and cons: Rank-aware temporal attention for skill determination in long videos,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7862–7871.
[18] Zhenqiang Li, Yifei Huang, Minjie Cai, and Yoichi Sato, “Manipulation-skill assessment from videos with spatial attention network,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[19] Chanjin Seo, Masato Sabanai, Hiroyuki Ogata, and Jun Ohya, “Understanding sprinting motion skills using unsupervised learning for stepwise skill improvements of running motion.,” in ICPRAM, 2019, pp. 467–475.
[20] Ruimin Li, Hong Fu, Yang Zheng, Wai-Lun Lo, J Yu Jane, Cindy HP Sit, Zheru Chi, Zongxi Song, and Desheng Wen, “Automated fine motor evaluation for developmental coordination disorder,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 5, pp. 963–973, 2019.
[21] Tianyu Wang, Yijie Wang, and Mian Li, “Towards accurate and interpretable surgical skill assessment: A video-based method incorporating recognized surgical gestures and skill levels,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2020, pp. 668–678.
[22] Long-fei Chen, Yuichi Nakamura, and Kazuaki Kondo, “Toward adaptive guidance: Modeling the variety of user behaviors in continuous-skill-improving experiences of machine operation tasks,” arXiv preprint arXiv:2003.03025, 2020.
[23] Andrew S Gordon, “Automated video assessment of human performance,” in Proceedings of AI-ED, 1995, pp. 16–19.
[24] Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba, “Assessing the quality of actions,” in European Conference on Computer Vision. Springer, 2014, pp. 556–571.
[25] Paritosh Parmar and Brendan Tran Morris, “Measuring the quality of exercises,” in 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2016, pp. 2241–2244.
[26] Paritosh Parmar and Brendan Tran Morris, “Learning to score olympic events,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 20–28.
[27] Yongjun Li, Xiujuan Chai, and Xilin Chen, “End-to-end learning for action quality assessment,” in Pacific Rim Conference on Multimedia. Springer, 2018, pp. 125–134.
[28] Xiang Xiang, Ye Tian, Austin Reiter, Gregory D Hager, and Trac D Tran, “S3d: Stacking segmental p3d for action quality assessment,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 928–932.
[29] Yongjun Li, Xiujuan Chai, and Xilin Chen, “Scoringnet: Learning key fragment for action quality assessment with ranking loss in skilled sports,” in Asian Conference on Computer Vision. Springer, 2018, pp. 149–164.
[30] Chengming Xu, Yanwei Fu, Bing Zhang, Zitian Chen, Yu-Gang Jiang, and Xiangyang Xue, “Learning to score figure skating sport videos,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
[31] Paritosh Parmar and Brendan Morris, “Action quality assessment across multiple actions,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1468–1476.
[32] Qing Lei, Hong-Bo Zhang, Ji-Xiang Du, Tsung-Chih Hsiao, and Chih-Cheng Chen, “Learning effective skeletal representations on rgb video for fine-grained human action quality assessment,” Electronics, vol. 9, no. 4, pp. 568, 2020.
[33] Paritosh Parmar and Brendan Morris, “Hallucinet-ing spatiotemporal representations using 2d-cnn,” arXiv preprint arXiv:1912.04430, 2019.
[34] Faegheh Sardari, Adeline Paiement, and Majid Mirmehdi, “View-invariant pose analysis for human movement assessment from rgb data,” in International Conference on Image Analysis and Processing. Springer, 2019, pp. 237–248.
[35] Jia-Hui Pan, Jibin Gao, and Wei-Shi Zheng, “Action assessment by joint relation graphs,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6331–6340.
[36] Paritosh Parmar and Brendan Tran Morris, “What and how well you performed? a multitask learning approach to action quality assessment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 304–313.
[37] Hiteshi Jain and Gaurav Harit, “An unsupervised sequence-to-sequence autoencoder based human action scoring model,” in 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2019, pp. 1–5.
[38] Ryoji Ogata, Edgar Simo-Serra, Satoshi Iizuka, and Hiroshi Ishikawa, “Temporal distance matrices for squat classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
[39] Jibin Gao, Wei-Shi Zheng, Jia-Hui Pan, Chengying Gao, Yaowei Wang, Wei Zeng, and Jianhuang Lai, “An asymmetric modeling for action assessment,” in European Conference on Computer Vision. Springer, 2020, pp. 222–238.
[40] Tianyu Wang, Minhao Jin, Jingying Wang, Yijie Wang, and Mian Li, “Towards a data-driven method for rgb video-based hand action quality assessment in real time,” in Proceedings of the 35th Annual ACM Symposium on Applied Computing, 2020, pp. 2117–2120.
[41] Jiahao Wang, Zhengyin Du, Annan Li, and Yunhong Wang, “Assessing action quality via attentive spatio-temporal convolutional networks,” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2020, pp. 3–16.
[42] Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, and Jian-Huang Lai, “Hybrid dynamic-static context-aware attention network for action assessment in long videos,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2526–2534.
[43] Mahdiar Nekoui, Fidel Omar Tito Cruz, and Li Cheng, “Falcons: Fast learner-grader for contorted poses in sports,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 900–901.
[44] Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, and Jie Zhou, “Uncertainty-aware score distribution learning for action quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9839–9848.
[45] Chen Du, Sarah Graham, Shiwei Jin, Colin Depp, and Truong Nguyen, “Multi-task center-of-pressure metrics estimation from skeleton using graph convolutional network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2313–2317.
[46] Hiteshi Jain, Gaurav Harit, and Avinash Sharma, “Action quality assessment using siamese network-based deep metric learning,” arXiv preprint arXiv:2002.12096, 2020.
[47] Faegheh Sardari, Adeline Paiement, Sion Hannuna, and Majid Mirmehdi, “Vi-net—view-invariant quality of human movement assessment,” Sensors, vol. 20, no. 18, pp. 5258, 2020.
[48] Gladys Calle Condori, Eveling Castro-Gutierrez, and Luis Alfaro Casas, “Virtual rehabilitation using sequential learning algorithms,” .
[49] Behnoosh Parsa and Ashis G Banerjee, “A multi-task learning approach for human action detection and ergonomics risk assessment,” arXiv preprint arXiv:2008.03014, 2020.
[50] “Chase-riecken musicianship exams - syllabi & study guides,” https://lvmta.com/musicianship, [Online].
[51] “Con brio examinations - grade indications,” https://www.conbrioexams.com/exams-grade-inidications, [Online].
[52] “G. henle verlag - levels of difficulty (piano),” https://www.henle.de/us/about-us/levels-of-difficulty-piano/, [Online].
[53] “The royal conservatory of music-piano-syllabus-2015 edition,” https://files.rcmusic.com//sites/default/files/files/RCM-Piano-Syllabus-2015.pdf, [Online].
[54] “Las vegas music teacher association technic - level 10,” https://content.web-repository.com/s/02553545902171053/uploads/Images/Technic_Materials_10-0954933.pdf, [Online].
[55] B. McFee, C. Raffel, D. Liang, D.P.W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in 14th annual Scientific Computing with Python conference, July 2015, SciPy.
[56] “Audio classification using librosa and pytorch,” https://medium.com/@hasithsura/audio-classification-d37a82d6715, [Online].
[57] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, pp. 8026–8037. 2019.
[58] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[59] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[60] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[61] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.