Optimal Embedding Calibration for Symbolic Music Similarity

Xinran Zhang¹, Maosong Sun¹²³⁴, Jiafeng Liu¹, Xiaobing Li¹
¹Department of Music Artificial Intelligence and Music Information Technology
Central Conservatory of Music, Beijing, China
²Department of Computer Science and Technology, Tsinghua University, Beijing, China
³Institute for Artificial Intelligence, Tsinghua University, Beijing, China
⁴State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China
[email protected], [email protected] Corresponding author.

Abstract

In natural language processing (NLP), the semantic similarity task requires large-scale, high-quality human-annotated labels for fine-tuning or evaluation. By contrast, in cases of music similarity, such labels are expensive to collect and largely dependent on the annotator’s artistic preferences. Recent research has demonstrated that embedding calibration technique can greatly increase semantic similarity performance of the pre-trained language model without fine-tuning. However, it is yet unknown which calibration method is the best and how much performance improvement can be achieved. To address these issues, we propose using composer information to construct labels for automatically evaluating music similarity. Under this paradigm, we discover the optimal combination of embedding calibration which achieves superior metrics than the baseline methods.

1 Introduction

Symbolic music research has benefited a lot from natural language processing (NLP) paradigm with Transformer architecture Vaswani et al. (2017) and powerful pre-training models such as BERT Devlin et al. (2019) and GPT-2 Radford et al. (2019). Addressing symbolic music problems as language modeling can utilize classical NLP functionalities such as semantic similarity. However, traditional methods for semantic similarity Reimers and Gurevych (2019); Zhang et al. (2020); Li et al. (2020) all require datasets with large-scale, high-quality human-annotated labels. By contrast, in cases of music similarity, these labels can be expensive to collect. Annotator’s artistic preferences will also severely bias these labels. Hence they are difficult to deploy for music similarity.

On the other hand, some baseline methods only rely on a pre-trained language model and do not require fine-tuning. The widely used baseline methods include using a single specific token embedding (e.g., [CLS] token in BERT) as the sentence embedding, or averaging the token embeddings from the last Transformer layer as the sentence embedding, and then calculating the cosine distance between sentences as a measure of their similarity. However, recent study has revealed that these baseline methods perform poorly. Results by Reimers and Gurevych (2019) demonstrate that these baseline methods actually can not outperform the GloVe algorithm Pennington et al. (2014).

We consider the embedding calibration technique Li et al. (2020); Mu and Viswanath (2018). Recent findings indicate that standard normalization (SN) and nulling away top-k singular vectors (NATSV, Li et al., 2020; Mu and Viswanath, 2018) can significantly improve the performance of semantic similarity without requiring fine-tuning, and thus are frequently used as baseline methods. Regrettably, these studies do not include cases that demonstrate all feasible calibration combinations, for example, averaging more than two Transformer layers (Li et al., 2020). As a result, the performance boundary for embedding calibration remains unknown.

Refer to caption — Figure 1: Illustration of aggregated vocabulary with multiple-event tokens

To address these issues, we investigate the optimal embedding calibration for music similarity. The following are our contributions.

•

We propose an automated method for evaluating music similarity that makes use of composer information. This method is capable of producing statistically significant results without requiring human annotations.
•

Under the proposed evaluation paradigm, we discover that the optimal performance is obtained by averaging the last 8 out of 12 Transformer layers, in combination with standard normalization calibration. The correlation metric increases to 0.223 when using the optimally calibrated embedding, compared to 0.154/0.028 when using baseline methods.

2 Embedding Calibration for Music Similarity

2.1 Pre-trained Language Model

We release a pre-trained auto-regressive Transformer language model for symbolic music. To build the composer-centric labels, we choose the MAESTRO dataset Hawthorne et al. (2019) to train the model since the dataset contains large-scale, high-quality symbolic music data with accurate metadata. However, past works Huang et al. (2019); Simon and Oore (2017) have used a vocabulary of only 308 words (midi events), which is far less than the vocabulary of benchmark NLP language models. Additionally, the music sequence is significantly longer than the typical language model’s maximum context length. To remedy this, we employ a vocabulary aggregation technique, as illustrated in Figure 1. Each midi-event in the vocabulary is treated as a single-length word. The Sentencepiece model Kudo and Richardson (2018) is then used to aggregate and expand the vocabulary by integrating high-frequency single-event (single-length) words to multiple-event (multiple-length) words and adding them to the vocabulary. It will generate a vocabulary that includes both single-event and multiple-event words. The details of the pre-trained model are available in our publicly available code.

2.2 Embedding Calibration

Let $\mathbf{h}_{l,t}\in\mathbb{R}^{H},l\in\{1,2,...,L\},t\in\{1,2,...,T\}$ denote the embedding vector of the token on layer $l$ and position $t$ , where $L$ denotes the total number of Transformer layers, $T$ denotes the maximum context length and $H$ denotes the embedding size. We propose to investigate all possible combinations of the following three embedding calibration techniques.

Last $\tilde{L}$ Layer Average Calibration: averaging on the last $\tilde{L}$ Transformer layers, with $\tilde{L}\in\{1,...,L\}$ .

Standard Normalization (SN) Calibration: $\mathbf{h}$ is calibrated by $(\mathbf{h}-\mathbf{\mu})/\mathbf{\sigma}$ , where $\mathbf{\mu}$ and $\mathbf{\sigma}$ denote the mean and standard variance of all sentence embeddings.

Nulling Away Top $\tilde{K}$ Singular Vector (NATSV) Calibration: calculate and remove the first $\tilde{K}$ principal components in the sentence embedding space, with $\tilde{K}\in\{0,1,...\}$ .

Note that the original NATSV algorithm Mu and Viswanath (2018) is calculated in the token embedding space. Conducting principal component analysis (PCA) on all tokens in the MAESTRO corpus will be challenging. We propose, heuristically, that PCA be performed on the sentence embedding space instead.

2.3 Automated Labels for Evaluation using Composer Information

We propose an automated method for creating labels for the purpose of evaluating music similarity. It is worth noting that the MAESTRO dataset contains precise midi events that record performance data. By intuition, the internal consistency of music composed by the same composer is often more significant than the internal consistency of music composed by different composers. As a result, music sequences by the same composer will be more similar than sequences by different composers in general. Although this is not a perfect similarity annotation and hence cannot be used for fine-tuning, it may be used to evaluate all potential calibration methods fairly. To ensure statistical significance, the number of such labels can be much greater than the human-annotated NLP datasets (as in the cases in Reimers and Gurevych, 2019).

We construct music sequences using a sliding window with a size equal to the maximum context length and a stride half the size of the window. Then we randomly select 100,000 pairs of sequences by the same composer and label them positively, and another 100,000 pairs of sequences by different composers and label them negatively. The similarity between two sequences can be measured using the cosine distance between their sentence embeddings. Then, as an evaluation metric for the calibration, we use the commonly used Spearman’s correlation between the calculated similarity and the constructed labels. Without requiring a human-annotated dataset, we may now search for the optimal combination of embedding calibration.

3 Experiments

3.1 Optimal Embedding Calibration

The main results are shown in Figure 2. “LayerAvg” on x-axis indicates $\tilde{L}$ , and different markers indicates $\tilde{K}$ . For simplicity of narration, we also refer to $\tilde{L}$ as “LayerAvg value”, and refer to $\tilde{K}$ as “NATSV value”. The left figure column indicates “with standard normalization (SN)”, and the right figure column indicates otherwise. All experiments are statistically significant (with $p$ -value being less than $10^{-4}$ ). Since our model is auto-regressive, the last token will have a full attention range that covers all tokens in the sequence. As a result, we choose the last token embedding as the baseline that corresponds to the [CLS] token embedding baseline, which is marked with LayerAvg being 0 on both figure columns. The other baseline is the points with LayerAvg being 1.

As is shown in Figure 2, the optimal metric is obtained on the left column by setting LayerAvg to 8, with SN, and setting NATSV to zero. Note that the correlation metric achieves its maximum value when LayerAvg equals 8. This demonstrates that while averaging more layers, as demonstrated by Li et al. (2020), is beneficial, there is an optimal point over which the correlation metric degrades.

The performance boost provided by SN is significant. The curve with SN and NATSV set to 0 in the left figure column is substantially better in Figure 2. However, the gain with NASTV is only noticeable in a few spots and is not as large as the gain with SN. For instance, for the curve without SN in the right figure column, NASTV equals 1 is always preferable to 0, whereas NASTV equals 2 is only partially preferable to 1.

3.2 Impact of Token Position

Since our model is auto-regressive, which means that the attention span of each token is different than that of the auto-encoder Transformer model, a further question arises: How does the position of the token affect its importance for similarity? We undertake the following experiment to address this question. Given the fact that the current method of calculating sentence embedding takes the average of all positions, which implies that all positions are equally important, we examine two scenarios in which the token’s importance is unequal and proportional to its position.

Linear Position Weighting:

\mathbf{h}_{l}=\bigg{(}\sum_{t=1}^{T}{t\times\mathbf{h}_{l,t}}\bigg{)}/\sum_{t=1}^{T}{t},

(1)

Inverse Linear Position Weighting:

\mathbf{h}_{l}=\bigg{(}\sum_{t=1}^{T}{(T-t+1)\times\mathbf{h}_{l,t}}\bigg{)}/\sum_{t=1}^{T}{t},

(2)

where $\mathbf{h}_{l}$ is the intermediate embedding computation for layer $l$ , which is subsequently averaged across layers to obtain the sentence embedding $\mathbf{h}$ . Results are shown in Figure 3. They demonstrate that linear position weighting outperforms inverse linear position weighting in terms of the correlation metrics. Additionally, the former can obtain an optimal metric that is fairly close to Figure 2, whereas the latter drops more. These results indicate that in an auto-regressive Transformer model, tokens with a larger position index in the sequence are more important for music similarity.

4 Conclusion

We investigate music similarity using embedding calibration. We propose an automated method for constructing labels using composer information for evaluation. Our results show that the optimal embedding calibration is obtained by averaging the last 8 out of 12 layers of token embeddings using standard normalization calibration. The correlation metric significantly increases compared to the baseline methods. This method only requires the pre-trained language model and does not rely on additional datasets for fine-tuning.

References

Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Hawthorne et al. (2019) Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations.
Huang et al. (2019) Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music transformer. In International Conference on Learning Representations.
Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
Li et al. (2020) Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. 2020. On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130, Online. Association for Computational Linguistics.
Mu and Viswanath (2018) Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
Simon and Oore (2017) Ian Simon and Sageev Oore. 2017. Performance rnn: Generating music with expressive timing and dynamics. https://magenta.tensorflow.org/performance-rnn.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
Zhang et al. (2020) Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. 2020. An unsupervised sentence embedding method by mutual information maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1601–1610, Online. Association for Computational Linguistics.