An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging

Abstract

Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.

Index Terms— audio representations, music information retrieval, self-supervised learning

1 Introduction

Over the past few years, self-supervised learning methods have become popular for training robust, generalizable machine learning models [1, 2]. These methods eliminate the need for labeled data. Instead, we broadly define self-supervised learning as a paradigm in which a model is trained to accomplish a task whose ground truth is trivially generated from the data itself. This task, known as “pretext”, may not have practical interest, but its resolution requires the machine to encode reusable generic signal information efficiently. These approaches offer numerous advantages. First, they allow the use of extremely large datasets for pre-training and can thus help mitigate some of the biases and ambiguities associated with labeled data [3, 4]. Moreover, self-supervised models can produce features that are not specialized for solving specific supervised tasks. They instead encapsulate richer mid-level representations that make them adaptable to various downstream scenarios with minimal effort and resources; often, a single multilayer perceptron (MLP) trained upon self-supervised embeddings is sufficient for competitive downstream performance [5, 6]. Presently, self-supervised learning approaches have achieved competitive or even superior results compared to traditional supervised learning methods. They have also paved the way for new research directions, such as learning multimodal spaces for text and image [7] or vision and audio [8].

Self-supervised methods rely on several key elements, such as the choice of data, the method for generating different input views, and the definition of the pretext task. The data used plays a critical role in determining the generalization scope of the trained model; generating diverse data views defines the specific signal traits to which the model becomes variant and invariant to; and the pretext task defines the supervised learning objective that is used to update the model. While all of these aspects are essential in defining a proper self-supervised task, in this study, we focus on understanding the impact of the pretext task on the overall effectiveness of a pre-trained model in the context of music.

Self-supervised approaches can be clustered into two main categories that are based on the definition of the pretext task [9]. In the first, the self-supervised method is defined for a single view of the input signal. The model is then trained to predict missing or altered parts of the data, which usually leads to it learning useful properties of the data. These methods include masking techniques commonly employed by large language models [10] or spectral inpainting [11]. On the other hand, tasks in the second category rely on several views of the same data point. The model is then trained to generate embeddings that are similar when the input consists of different views of the same data instance. Data augmentation techniques are often used to create these diverse views. However, the challenge lies in ensuring that the models do not collapse into a trivial constant solution, where everything is deemed similar to everything else, as highlighted in [12]. To address this issue, different methods diverge in how they measure agreement and prevent collapse. In the realm of music, contrastive learning, a popular pretext task in this category, which we will outline later in this paper, has proved to be effective for numerous tasks, such as tagging [13, 14], classification [15], and even beat tracking [16] and artist identification [17]. However, more recent pretext tasks that have gained traction in the computer vision community are yet to be benchmarked on popular music information retrieval tasks.

Hence, this paper focuses on the second category of pretext tasks applied to classic music tagging tasks. We pre-train a simple ResNet architecture [18] on the same, large-scale music dataset with five popular pretext tasks: contrastive learning [19], Bootstrap your own Latent (BYOL) [20], clustering [21, 22], Barlow twins [23], and Variance-Invariance-Covariance Regularization (VICReg) [24, 25]. The embeddings generated by each model are then used to train a single-layer MLP model for five downstream tagging tasks. We also report performance on these downstream tasks in a limited-data setting, where only small portions of the training set are used. Our results demonstrate that the models trained upon contrastive learning embeddings consistently outperform the others, though the gap is quite narrow. We hope that these findings will aid researchers and engineers in the music or audio industry in selecting the best-performing pre-trained model for their needs. Furthermore, we open-source the trained models,¹¹1 https://github.com/deezer/multi-view-ssl-benchmark with the hope that the same people will be able to take advantage of the scale of the catalog we used for training, and perhaps investigate the musical features that the generated embeddings encode.

2 Pretext Methodology

Dataset	#Labels	#Train	#Val.	#Test	Ref.
MSD100	100	71388	15618	15281	[26]²²2Splits from: https://github.com/minzwon/tag-based-music-retrieval
MTAT	50	15244	1529	4332	[27]³
Jam_Top50	50	32136	10888	11356	[28]³³3We use the first provided split for each Jamendo experiment.
Jam_Instrument	40	14395	5466	5115	[28]⁴
Jam_Mood	56	9949	3802	4231	[28]⁴

Table 1: Overview of the downstream datasets.

This study compares the impact of self-supervised pretext tasks that leverage multiple views of the same data point. Our methodology is, therefore, consistent across the pretraining pipeline. We notably ensure that the data, view generation, and model architecture remain constant throughout all approaches in order to better measure the impact of the pretext definition on downstream performance.

Refer to caption — Fig. 1: Downstream results. We apply transfer learning to each task by training an MLP classifier on top of the embeddings generated by the frozen pretext model. We utilize bootstrapping. Each dot represents the metric of a resampled batch. The marker indicates the mean of each result.

2.1 Data

We use an in-house dataset, which consists of $\sim$ 4M full tracks, to pre-train our models. We aim to make our dataset as musically balanced and diverse as possible in order to expand the models’ generalization ability toward unseen data. To do so, we collect tracks that have been used in past curated playlists from the music streaming service Deezer. These playlists are categorized by mood or genre; our dataset thus contains a vast collection of high-quality sounds and songs that span various genres and periods. We acknowledge that these may be biased toward Western and Billboard music. Each piece of audio is resampled at 16 kHz, normalized, and converted to mono. We then compute a log-magnitude mel-spectrogram with 128 frequency dimensions to use as input for each of our models.

In music, multi-view methods rely on the idea that various facets of music, such as songs, albums, or artists, collectively share common features [29], which are sufficiently rich and informative to extract meaningful music-invariant traits. For each self-supervised approach, the input data consists of two four-second audio pairs, which we call anchor and positive. For each song, an anchor is randomly selected. The corresponding positive is then selected from the same song. We force this segment to be at least four seconds away from the anchor but no further than sixteen seconds away in order to ensure both segments are contextually similar. Furthermore, we use online training: every time we see an audio track, the constructed pairs are generated on the fly. All self-supervised models are then trained with batches consisting of 256 pairs for 1000 epochs of 512 steps.

2.2 Architecture

Our self-supervised learning models have a ResNet backbone architecture and contain a total of 2.8 million trainable parameters. Similarly to [30], we incorporate an attention layer that summarises the sequential data into a single, 1024-dimension embedding vector. A projector head is then placed upon the backbone to solve our designated pretext task. This ensures that our generated embeddings are not tailored to the pre-training method; they maintain a broader, more general utility. All the approaches we study in this work use the same head. It is comprised of two blocks, which consist of a linear layer, batch normalization, and a ReLu activation. The first linear layer contains 1024 connected weights, while the final linear 2048. All our approaches are trained using a single GPU.⁴⁴4NVIDIA GeForce GTX TITAN X

2.3 Pre-training Tasks

Our paper studies how five different self-supervised pretext tasks popular in image processing adapt to the audio domain.

Contrastive learning aims to project similar samples closer together in the embedding space while pushing dissimilar samples apart. We use the normalized temperature-scaled cross-entropy loss [31], with a temperature set to 0.1, in order to do so. In this setting, similarly to [14], for each anchor and positive pair, negative examples consist of the rest of the audio segments in the same batch. The distance between embeddings is computed using a cosine similarity.

BYOL [20] uses a teacher-student approach. The student model is encouraged to match a target embedding representation generated by the teacher. The networks are almost identical and start with the same random initialization. The student, however, needs an extra prediction layer for stability reasons. The distance used to compare embeddings is a mean squared error between the normalized student predictions and target teacher projections. At each training step, the student is updated with the normal gradient, whereas the teacher is updated using an exponential moving average (EMA) of recent student weights. Unlike in this work, in [32], a unique, data-augmented segment of audio is used for BYOL pre-training. We hope to explore whether this type of pre-processing leads to more favorable self-supervised learning in the future.

Clustering [21, 22] groups embeddings based on similarity. It uses a teacher-student approach close to BYOL. However, instead of computing similarity between embeddings, the teacher generates the target class distribution the student has to match via a cross-entropy loss. Both models are initialized identically and have the same architecture. At each training step, the student is updated normally, whereas the teacher is updated with an EMA. As presented in [21], we use centering and sharping to avoid collapsing into a single class and uniform distribution.

The two methods are grounded in the imposition of statistical properties within the embedding space. Barlow-Twins [23] encourages the diagonalization and independence of each embedding dimension. The pretext task’s goal is to ensure that the cross-correlation between the anchor and positive embeddings is close to the identity matrix. This reduces the redundancy between the components of each embedding dimension. VICReg [24, 25] combines variance, invariance, and covariance terms. The invariance term encourages proximity between the anchor and positive embedding vectors by minimizing their mean squared distance. The variance term promotes diversity among embedding vectors within a batch. Finally, the covariance term is similar to Barlow-Twins, but it focuses exclusively on reducing off-diagonal terms, effectively decorrelating the variables within each embedding.

3 Evaluation

3.1 Downstream Tasks

We apply the same procedure across all the downstream tasks and pretext approaches. We use an MLP that operates directly on the average embedding space of each track to find the linear separation between classes. Thus, we optimize $1024\times N$ weights, where $N$ denotes the number of classes for each downstream class. The backbone generating the embeddings remains frozen. We evaluate the effectiveness and generalizability of each pretext approach on a set of five music tagging datasets, which are depicted in Table 1. We use the standard evaluation metrics: area under the Receiver Operating Characteristic curve (ROC) and mean average precision (mAP).

3.2 Pretext to Downstream Transfer Learning

All downstream tasks are first trained using their full training set. We use a batch size of 256 pairs and 25 epochs of 128 steps. We use the last checkpoint of the pretext backbone and the best downstream model checkpoint, in terms of validation loss, for test set evaluation. Both ROC and mAP are computed using all song classifications for the test set; therefore they do not provide insights into the metric values’ robustness to data variation. In order to overcome this limitation, we use bootstrapping. The test dataset is sampled without replacement several times and computes the mAP and ROC for each batch. We sample $50\%$ of each test set 50 times and report the mean and standard deviation of all samples.

Results in Figure 1 reveal that the model pre-trained using contrastive learning consistently achieves superior performance compared to those pre-trained with other pretext tasks on both metrics. This observation is noteworthy, particularly since this is not the case in the image domain, which serves as inspiration for the majority of this work. Furthermore, despite other works using contrastive learning with more advanced architectures and incorporating more intricate probes [13, 33], our simple ResNet and MLP combination delivers comparable performance. In second place, clustering exhibits strong performance, most likely due to its ability to create well-defined groupings within the embedding space. We observe an uncharted collapse mode, where the model utilizes only a subset of clusters to partition the feature space, preventing the use of this pretext task to its full potential. On the contrary, the BYOL method exhibits lower performance, which can be attributed to its distinct design. Predicting the same embedding space appears to be overly restrictive and stringent in this context. The differing performances of VICReg and Barlow Twins are intriguing. While the simple orthogonalization and diagonalization of Barlow Twins is successful for conventional tagging tasks like MSD100 and Jam_Top50, it proves insufficient for separating instruments and mood, where VICReg’s regularization is better suited.

3.3 Limited Data Music Tagging

Annotating a music dataset is a time-consuming, error-prone process frequently plagued by inherent ambiguities [34]. As a result, a substantial portion of datasets within the MIR community remain relatively small in scale, which hinders our ability to train deep neural networks on them. In this experiment, our objective is to assess the adaptability of each approach when faced with the constraint of limited data availability. We focus on three datasets: MSD100, MTAT, and Jam_Top50. We randomly sample each dataset’s train set four times for four different percentages ( $1\%$ , $5\%$ , $10\%$ , and $20\%$ ) and train a different MLP for each iteration and percentage using the same specifications as earlier. We then evaluate their performance on the full test set, as shown in Figure 2. It is interesting to note that all approaches demonstrate decent performance with just $1\%$ of the training set and nearly optimal results with just $10\%$ . Subsequent performance improvements occur at a slower rate. We also observe that, as seen previously, contrastive learning outperforms all other methods. It is, however, worth mentioning that the performance gap between clustering and contrastive methods is less pronounced when using limited data compared to the full dataset. Nonetheless, all other pre-training methods perform similarly and less well than the two methods mentioned previously.

3.4 Training Stability

In our empirical findings, we observed that contrastive learning and Barlow Twins are the most stable methods during training. They both avoid convergence issues. These models also rely on a minimal set of hyperparameters, the scale parameter for the former and diagonal and off-diagonal ratios for the latter. On the other hand, VICReg is highly sensitive to hyperparameter choice since it uses multiple losses. This may explain its notable fluctuations in performance. BYOL exhibits instability. We observe substantial fluctuations in both training and validation loss from one epoch to another. It is highly sensitive to the EMA momentum, where slight value variations have a significant impact on the model weights. It is also worth noting that it requires the most time per backpropagation step compared to other approaches. Clustering demonstrates an even greater reliance on hyperparameter selection, as it encompasses a complex interplay of factors, such as the EMA momentum for teacher and centering updates, temperature settings for the sharping and centering process to avoid collapse and weight decay. Striking the right balance among these elements is delicate and often requires the use of multiple schedulers to achieve effective model optimization. We found that the default values for most hyperparameters proposed in the original works for image processing are not well-suited for audio applications. As a result, we introduced new parameter values to mitigate early training plateaus. It is worth noting that further hyperparameter tuning could lead to improved results, a direction that warrants exploration in future investigations. Our settings can all be visualized in the GitHub repository linked with this publication.

4 Conclusion

This paper presents a comparative analysis of various self-supervised pretext tasks for music tagging: contrastive learning, BYOL, Barlow Twins, VICReg, and clustering. A simple ResNet model is pre-trained with these methods. From there, an MLP is trained upon the embeddings generated by these methods to solve the downstream tagging tasks we selected in this work. Our work highlights the importance of choosing a relevant pretext task that aligns with our specific domain. For our downstream tasks, contrastive learning stands out as the preferred choice. It consistently demonstrates superior performance and relies minimally on hyperparameter tuning. Clustering shows promise. However, its performance is strongly linked to hyperparameter tuning and may be affected by the uncharted collapse we observed. Strategies to address this issue are to be addressed in future research. Finally, models pre-trained with BYOL, Barlow Twins, and VICReg do not perform as well as models trained with the two pretext tasks mentioned earlier in this paragraph, even in a limited-data setting. Our models and code are readily accessible; we hope they can be a valuable resource to members of our community who would like to take advantage of the training scale used or investigate the musical features the generated embeddings encode.

References

[1] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
[2] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
[3] Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Mannat Singh, Ishan Misra, Levent Sagun, Armand Joulin, and Piotr Bojanowski, “Vision models are more robust and fair when pretrained on uncurated images without supervision,” arXiv:2202.08360, 2022.
[4] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song, “Using self-supervised learning can improve model robustness and uncertainty,” NeurIPS, 2019.
[5] Luyu Wang, Pauline Luc, Yan Wu, Adria Recasens, Lucas Smaira, Andrew Brock, Andrew Jaegle, Jean-Baptiste Alayrac, Sander Dieleman, et al., “Towards learning universal audio representations,” in IEEE ICASSP, 2022.
[6] Linus Ericsson, Henry Gouk, and Timothy M Hospedales, “How well do self-supervised models transfer?,” in IEEE/CVF CVPR, 2021.
[7] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[8] Relja Arandjelovic and Andrew Zisserman, “Look, listen and learn,” in IEEE ICCV, 2017.
[9] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al., “A cookbook of self-supervised learning,” arXiv:2304.12210, 2023.
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al., “Language models are few-shot learners,” NeurIPS, 2020.
[11] Marco Tagliasacchi, Beat Gfeller, Félix de Chaumont Quitry, and Dominik Roblek, “Pre-training audio representations with self-supervision,” IEEE Signal Processing Letters, 2020.
[12] Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian, “Understanding dimensional collapse in contrastive self-supervised learning,” in ICLR, 2022.
[13] Matthew C McCallum, Filip Korzeniowski, Sergio Oramas, Fabien Gouyon, and Andreas Ehmann, “Supervised and unsupervised learning of audio representations for music understanding,” in ISMIR, 2022.
[14] Janne Spijkervet and John Ashley Burgoyne, “Contrastive learning of musical representations,” ISMIR, 2021.
[15] Aaqib Saeed, David Grangier, and Neil Zeghidour, “Contrastive learning of general-purpose audio representations,” in IEEE ICASSP, 2021.
[16] Dorian Desblancs, Vincent Lostanlen, and Romain Hennequin, “Zero-note samba: Self-supervised beat tracking,” IEEE Trans. on Audio, Speech, and Language Processing, 2023.
[17] Hiromu Yakura, Kento Watanabe, and Masataka Goto, “Self-supervised contrastive learning for singing voices,” IEEE Trans. on Audio, Speech, and Language Processing, 2022.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016.
[19] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
[20] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al., “Bootstrap your own latent-a new approach to self-supervised learning,” NeurIPS, 2020.
[21] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” NeurIPS, 2020.
[22] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, “Emerging properties in self-supervised vision transformers,” CoRR, 2021.
[23] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in ICML, 2021.
[24] Adrien Bardes, Jean Ponce, and Yann Lecun, “Vicreg: Variance-invariance-covariance regularization for self-supervised learning,” in ICLR, 2022.
[25] Adrien Bardes, Jean Ponce, and Yann LeCun, “Vicregl: Self-supervised learning of local visual features,” arXiv:2210.01571, 2022.
[26] Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere, “The million song dataset,” 2011.
[27] Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie, “Evaluation of algorithms using games: The case of music tagging.,” in ISMIR, 2009.
[28] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra, “The mtg-jamendo dataset for automatic music tagging,” in ICML, 2019.
[29] Pablo Alonso-Jiménez, Xavier Serra, and Dmitry Bogdanov, “Music representation learning based on editorial metadata from discogs,” in ISMIR, 2022.
[30] Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas, “Contrastive audio-language learning for music,” in ISMIR, 2022.
[31] Kihyuk Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” NeurIPS, 2016.
[32] Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino, “Byol for audio: Self-supervised learning for general-purpose audio representation,” in IEEE IJCNN, 2021.
[33] Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, et al., “Mert: Acoustic music understanding model with large-scale self-supervised training,” arXiv:2306.00107, 2023.
[34] Rachel M Bittner, Katherine Pasalo, Juan José Bosch, Gabriel Meseguer-Brocal, and David Rubinstein, “vocadito: A dataset of solo vocals with $f\_0$ , note, and lyric annotations,” arXiv:2110.05580, 2021.