Deep Composer Classification Using Symbolic Representation

Abstract

In this study, we train deep neural networks to classify composer on a symbolic domain. The model takes a two-channel two-dimensional input, i.e., onset and note activations of time-pitch representation, which is converted from MIDI recordings and performs a single-label classification. On the experiments conducted on MAESTRO dataset, we report an F1 value of 0.8333 for the classification of 13 classical composers.

1 Introduction and Proposed System

Symbolic representation can be used as input representation to allow the model to be independent to timbre and acoustic recording environment and focus on note-related aspects such as pitch and duration of notes. For example, n-grams of symbolic musical events were used as an input of a composer style classification model in [1].

In the proposed system, as illustrated in Figure 1, we use a ResNet to classify composer from a segment of input music and the final decision is made by majority voting [2]. For the input, we adopt a symbolic representation that was originally used in piano transcription task [3]. In detail, the midi note events of a recording are converted to onset and frame channels. Each channel is a piano roll-like 2D array and in a shape of (time, pitch), where a bin represents 0.05 second and 1 semitone, respectively. A bin of the onset channel is binarized where 1 indicates a note onset. In the frame channel, each bin represents an activation of a note by its recorded midi velocity (0-127).

By designing the proposed system, we expect that the ResNet would learn some specific patterns that could indicate musical styles of composers. This can be done by the network learning spatial features such as pitch interval tendency in the input (e.g. chord and voicing). We also hypothesize that the network would learn some useful velocity patterns from the frame channel such as variations of velocity in a chord or its local/global dynamic range.

2 Experiments and Discussion

2.1 Overview

We used MIDI-only archive of the MAESTRO dataset v2.0.0 [4]. It provides over 200 hours of classical MIDI performances along with metadata such as composer, title, and duration. We excluded pieces with multiple composers to remove any ambiguity and selected composers with more than 16 pieces. These steps left 505 pieces by 13 composers in our experiment. Data are split into 7:3 train and test sets (347 and 158 pieces respectively) using stratified sampling with respect to the composer label.

In the dataset, each midi recording corresponds to a piece. To deal with the variable durations, only a randomly selected segment of the whole recording is fed to the model. The duration of a segment is set to 20 second (400 bins). For both training and inference of the model, we use 90 segments that are uniformly sampled over time.¹¹1All the details of the experiment including model and dataset are released at https://github.com/KimSSung/Deep-Composer-Classification.

Refer to caption — Figure 1: A MIDI recording is converted to onset/frame channels which is segmented into the input size of the model. During inference, the final classification of a piece is decided by major voting of segment-wise prediction.

Composer (abb.)	Pieces	Composer (abb.)	Pieces
F. Chopin (Chop)	64	W. A. Mozart (Moza)	29
J. S. Bach (Bach)	62	D. Scarlatti (Scar)	25
L. V. Beethoven (Beet)	62	J. Haydn (Hayd)	20
F. Liszt (Lisz)	60	A. Scriabin (Scri)	19
F. Schubert (F.Sch)	58	R. Schumann (R.Sch)	18
C. Debussy (Debu)	37	J. Brahms (Brah)	17
S. Rachmaninoff (Rach)	34

Table 1: List of remaining data classified by its canonical composer: Pieces are the count of unique canonical titles owned by each composer.

After a preliminary experiment, we chose ResNet-50 as a reference model based on its performance after testing ResNet-18/34/50/101. The model was trained using SGD with momentum and weight decay. We used CE loss function with cosine annealing for LR scheduling.

2.2 Result and Discussion

We report our reference model with ‘weighted’ averaged F1 score of $0.8333$ .²²2ResNet-18/34/101 achieved F1 scores of $0.7962$ / $0.7881$ / $0.7892$ . This can be loosely compared to [5, 6] although the problem definition and the dataset vary. [5] using gradient boosting and CNN achieved F1 scores of $0.742$ and $0.700$ for classifying 15 and 6 classical composers, respectively. Similarly, a ConvRNN-based system in [6] achieved 70% accuracy for classifying 6 composers.

Based on this reference model, we investigated the performance variation by composers. First, the Spearman rank correlation coefficient of the composers’ birth years and performances was $-0.45$ , i.e., the model performed better for relatively old classical composers. This results seem musically plausible and related to the fundamental property of the problem – Because the style of music has been diversified and more complicated chords have been popularized over time, recognition of composer by the music may be inherently more difficult for relatively modern composers. Second, we also hypothesized if the label imbalance in the training data has an effect on the performance, but the Spearman rank correlation coefficient between the number of training items (pieces) and performance was $-0.13$ , which does not seem significant.

The confusion matrix in Figure 3 shows that a majority of the misclassifications occurred within the same era (Denoted with bold boxes). Out of 19 misclassified pairs, only 5 were classified as composers from different era. We assume the composers from the same era are more likely to share similar musical patterns, and this indicates that the model successfully learned such patterns during training.

No. of Segments		Onset Channel		Frame Channel
5	.5713	Used	.8333	Continuous	.8333
10	.7196	Omitted	.7858	Binarized	.8525
20	.7687
30	.8148
60	.8249
90	.8333

Table 2: Ablation study results (F1 score) of controlling i) the number of segments, ii) onset channel usage, and iii) frame channel binarization.

2.3 Ablation Study

Number of segments: We trained various instances of the model by increasing the number of segments from 5 to 90 as reported in the first column of Table 2. The result indicates that segments over 30 provide information that is diverse enough for the model to perform at a certain level (over 80% of F1 score).

Onset Channel Usage: As in the second column of Table 2, removing the onset channel from the input introduced a degradation of performance by $0.0475$ . This result demonstrates two interesting aspects of the task and the model. First, even only with the frame channel, i.e., by comparing the chord and voicing, the model can classify composers with a reasonable performance. Second, the location of onsets may be helping to solve the task by clarifying the chords and/or providing rhythmic information.

Frame Binarization: In the third column of Table 2, we compare the result of our reference model (annotated as “Continuous”) with a model that was trained using binarized frame channel, (i.e., one where the velocity values are binarized) in an attempt to maximize the symbolic characteristic of our data by removing the remaining performer-related information. Using this new input, somewhat surprisingly, this “Binarized” model achieved an F1 score of $0.8525$ , outperforming the reference model by $0.0192$ .

2.4 Conclusion

In this paper, we showed that a composer classification can be done using symbolic representation. We investigated the trained model by varying the within-piece coverage and input data representation as well as analyzing the performances per composer. A future direction may include separating the influence of players, generalizing towards other instruments, and domain adaptation to other genres.

References

[1] J. WOŁKOWICZ, Z. KULKA, and V. KEŠELJ, “N-gram-based approach to composer recognition,” Archives of Acoustics, vol. 33, no. 1, pp. 43–55, 2008.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[3] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano transcription,” International Society of Music Information Retrieval conference, 2017.
[4] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” in International Conference on Learning Representations, 2019.
[5] S. Jain, A. Smit, and T. Yngesjö, “Analysis and classification of symbolic western classical music by composer,” Preprint. [Online]. Available: http://cs229.stanford.edu/proj2019aut/data/assignment_308832_raw/26583519.pdf
[6] G. Micchi, “A neural network for composer classification,” International Society of Music Information Retrieval conference – Late Breaking/Demo session, 2018.