UNFUSED : UNsupervised Finetuning Using SElf supervised Distillation
Abstract
In this paper, we introduce UnFuSeD, a novel approach to leverage self-supervised learning and reduce the need for large amounts of labeled data for audio classification. Unlike prior works, which directly fine-tune a self-supervised pre-trained encoder on a target dataset, we use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. We first train an encoder using a novel self-supervised learning algorithm (SSL) on an unlabeled audio dataset. Then, we use that encoder to generate pseudo-labels on our target task dataset via clustering the extracted representations. These pseudo-labels are then used to guide self-distillation on a randomly initialized model, which we call unsupervised fine-tuning. Finally, the resultant encoder is fine-tuned on our target task dataset. Through UnFuSeD, we propose the first system that moves away from generic SSL paradigms in literature, which pre-train and fine-tune the same encoder, and presents a novel self-distillation-based system to leverage SSL pre-training for low-resource audio classification. In practice, UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines. Additionally, UnFuSeD allows us to achieve this at a 40% reduction in the number of parameters over the previous state-of-the-art system. We make all our codes publicly available111https://github.com/Sreyan88/LAPE.
Index Terms— audio, speech, self-supervision
1 Introduction
Self-Supervised Learning (SSL) has proven to be one of the biggest success of the past decade enabling deep learning models to learn useful representations under low-resource labeled data settings. SSL has been adopted successfully in speech [1], vision [2, 3], and text [4] outperforming all prior-art trained with only labeled data supervision on several benchmark datasets [5]. Though current SSL models pre-trained using Masked Acoustic Modeling (MAM) have been shown to generalize well over speech tasks like Automatic Speech Recognition (ASR), Phoneme Recognition (PR), etc., they fail to perform well on non-speech tasks like acoustic scene classification [6]. We list some possible reasons for this phenomenon in Section 2. Thus, we emphasize the importance of learning general-purpose audio representations that can generalize over both speech and non-speech tasks, which is currently largely understudied in the literature compared to SSL in speech using MAM.
In the recent past, researchers have proposed novel algorithms for learning general-purpose audio representations [7, 8, 9]. A common trait among all these systems is that they directly fine-tune the model post-SSL pre-training. However, this direct fine-tuning approach may result in sub-optimal performance due to significant discrepancy between the pre-training and fine-tuning domains [10]. For example, most of these systems perform SSL pre-training on the AudioSet [11] (every day sounds like the sound of a toothbrush) and evaluate their learned representations on tasks like Speaker Verification [12] (human spoken utterances). Additionally, under the linear evaluation setup, we argue that the downstream tasks cannot leverage the SSL representations to their full extent due to their learning capacity being constrained to an affine transform.
Main Contributions: We present UnFuSeD, a new framework to improve downstream audio classification performance in low-resource labeled data settings leveraging SSL. Unlike all prior systems in literature, UnFuSeD does not directly fine-tune an SSL pre-trained model but uses it to extract and cluster audio features to generate pseudo-labels on a downstream task dataset which is then used to perform un-supervised fine-tuning. More precisely, we perform a step of self distillation, guided by the generated pseudo-labels, on a randomly initialized convnet encoder, divided into student and teacher encoders. Finally, post unsupervised fine-tuning, we perform supervised fine-tuning and evaluate downstream task performance on our model linear evaluation setup. Additionally, to pre-train our encoder using SSL, we propose a novel SSL algorithm by modifying over DECAR [13]. Fig.1 shows a clear pictorial representation of our complete training process. We emphasize that UnFuSeD changes the paradigm in which SSL is leveraged to tackle data scarcity and improve downstream task performance. In practice, UnFuSeD achieves state-of-the-art (SOTA) performance on the LAPE Benchmark [7] with an encoder with 40% fewer parameters than the current SOTA model on LAPE.

2 Related Work
Self Supervised Learning in Speech and Audio. The past decade has seen massive success in self-supervised learning in vision (CV), speech (SLP), and text (NLP), pushing the boundaries of low-resource representation learning for downstream classification [1, 4]. The most common systems for SSL with speech solve a Masked Acoustic Modelling (MAM) task, either using contrastive learning [1], frame reconstruction [14], or pseudo-label prediction [15]. However, recent research has shown that solving MAM makes the model representations mimic human articulatory responses [16], thus making it unsuitable for non-speech tasks. Thus, in the recent past, researchers have proposed novel systems to learn audio representations that can generalize over both speech and non-speech tasks. Following SSL in speech, these systems either solve a contrastive learning-based instance discrimination task [8], a clustering-based pseudo-label prediction task [13], or a reconstruction task [9]. Knowledge distillation has shown great success in CV, and NLP with major applications in model compression [17]. In a supervised setting, researchers have explored knowledge distillation for automatic speech recognition, speech emotion recognition, and speaker verification [18]. DistillHubert [19] was the first work on distilling SSL-based speech models and performs layer-wise knowledge-distillation (KD) to compress a full HuBERT [15]. On the other hand, when both encoder architectures are the same, this is known as self distillation (SD) [20], and has shown impressive results with the student often outperforming the original teacher. However, to our knowledge, no existing work leverages SD for general-purpose self-supervised audio representation learning, and we are the first to explore this through UnFuSeD.
3 Methodology
Fig.1 illustrates our proposed UnFuSeD learning algorithm. Algorithm 1 provides a detailed algorithmic overview of the same. In practice, UnFuSeD has three main steps, namely, (1) Upstream SSL Pre-training, (2) Unsupervised Fine-tuning, and (3) Downstream Supervised Fine-tuning. In the next paragraphs, we describe each step in detail.
(1) Upstream SSL Pre-training. Let be an unlabeled dataset of size where = . In our case, here = 0.25 million, following the exact pre-training setup proposed by the LAPE benchmark. Our primary aim is to learn general-purpose audio representations from this unlabeled audio dataset. To achieve this, we use a simple convnet-based architecture [21, 22], popular in prior-art [7, 9] for a fair comparison. For upstream SSL pre-training, we propose DECAR-v2, an improved version of DECAR [13], based on findings in [23]. DECAR-v2 has two main steps or phases: (a) Assignment Phase and the (b) Training Phase.
(a) Assignment Phase: The primary purpose of this phase is to obtain “pseudo-labels” for every unlabelled audio sample . To achieve this, we first store all the embeddings obtained from our convnet projection head in memory for the entire . After this, we apply Spherical K-means to cluster and get the “pseudo-labels” for every as follows: where C is the Centroid matrix. Both and columns of C are normalized. represents the number of clusters, and is an augmented and sampled version of the original audio sample. Additionally, for ConvNet training stability, we keep the prototype head parameters frozen throughout pre-training, and at the end of every assignment phase, the parameters of are replaced by C.
(b) Training Phase: We train the network using supervision from the
“pseudo-labels” obtained from the assignment phase. To do this, we first obtain the prediction using where is the output of the . Post this step; we minimize the multinomial logistic loss between and with: . The “pseudo-labels” are kept fixed during the training phase and updated for the entire X only once every epoch during the assignment phase. Similar to [23], the assignment phase and training phase take place in isolation only at the first epoch, after which we use the embeddings obtained from the previous epoch. These embeddings are stored in memory at every iteration of an epoch right after the back-propagation step.
(2) Unsupervised Downstream Fine-tuning. After SSL pre-training, we don’t fine-tune the pre-trained convnet on the target task dataset directly but instead, use it for unsupervised fine-tuning on a randomly initialized convnet . We call this step unsupervised fine-tuning as we use the target task dataset but without using its actual labels. Let be the target task labeled dataset of size where are labels associated with audio samples . For unsupervised fine-tuning, we first generate by extracting and clustering representations obtained on passing through . DECAR-v2 generates clusterable embeddings, which helps in the generation. We then use to perform self-distillation on . We first divide , which follows a similar architecture to , into a student () and teacher network (). has 4 individual blocks, where the first 3 make and the last block makes . For more details on the architecture of , we refer our readers to [7, 21]. For self-distillation, we treat each block as a separate classifier and add a linear transform to to solve three losses parallelly, KL-divergence , Mean-Square error and Cross Entorpy . ensures that the student blocks correctly classify the pseudo labels and thus utilize the weak supervision knowledge hidden in them. ensures that knowledge of the deepest layers is leveraged to improve feature extraction in shallow layers. ensures that the classification results of student classifiers are similar to that of the teacher classifier. Finally, to optimize our network, we use a weighted average of , and , which we weigh by , as shown:
(1) |
DT | BYOL-A | SimCLR | DECAR-v1 | DeLoRes-S | MoCo | DeLoRes-M | DECAR-v2 | UnFuSeD |
Speech | ||||||||
SC-V1 | 77.3 | 82.3 | 86.1 | 91.6 | 94.4 | |||
SC-V2(12) | 91.0 | 77.2 | 83.0 | 85.4 | 90.6 | 94.1 | ||
SC-V2(35) | 92.2 | 66.0 | 73.6 | 80.0 | 87.2 | 90.1 | ||
LBS | 89.0 | 91.0 | 90.0 | 92.5 | 97.0 | |||
VC | 40.1 | 28.9 | 25.6 | 31.2 | 33.0 | 50.0 | ||
IC | 59.8 | 63.2 | 60.7 | 65.2 | 66.0 | |||
VF | 90.2 | 69.2 | 74.1 | 76.5 | 78.2 | 89.8 | ||
Non-Speech | ||||||||
NS | 74.1 | 61.3 | 70.7 | 66.3 | 69.8 | 76.4 | ||
BSD | 85.2 | 87.7 | 86.7 | 88.5 | 90.0 | |||
TUT | 52.4 | 62.5 | 58.6 | 64.6 | 66.8 | |||
US8K | 79.1 | 69.1 | 70.1 | 71.2 | 73.2 | 83.2 | ||
Average | 66.9 | 71.2 | 72.1 | 75.8 | 81.6 |
(3) Supervised Downstream Fine-tuning Post unsupervised downstream fine-tuning, we do supervised downstream fine-tuning on the student model using . For a fair comparison with prior-art in this space, we don’t train all the layers of our model and instead just train a task-specific linear head added to the encoder. This method of training is also known as linear evaluation and proves to be an effective technique for evaluating learned audio representations.
4 Experimental Setup
Datasets. In our experiments, we use the exact same upstream and downstream training setups proposed by LAPE [7]. For SSL-based pre-training, we use a balanced subset of 10% of the complete AudioSet (0.2 million) and the FSD50K [24]. For downstream tasks (DT), we evaluate our learned representations on LibriSpeech (LBS) [25] and VoxCeleb (VC) [26] for speaker identification, Speech Commands (SC) v1 and v2 [27] for keyword spotting, VoxForge (VF) [12] for language identification, IEMOCAP (IC) [28] for speech emotion recognition, NSynth [29] for TUT Urban [6] and US8K [30] for acoustic event classification and finally Bird Song Detection (BSD) [31].
Hyperparameter Tuning. For SSL Pre-training (DECAR-v2), we find the optimal values for the number of clusters as 512, learning rate as 0.005, batch size as 512, and number of epochs as 100. Projector performs a non-linear transformation using multiple linear-layers. For Unsupervised Fine-tuning, we use the learning rate as 0.007, batch size as 512, number of epochs as 50, as 0.7, and as 0.003. performs a linear transformation, where is number of classes in target dataset. Projectors , and perform , and non-linear transformations respectively. Finally, for Linear Evaluation, we use the learning rate as 0.001, batch size as 32, and number of epochs as 50. All the hyperparameter choices were made based on an extensive grid search while considering the average performance across all the downstream tasks.
5 Results and Result Analysis
As clearly evident from Table 1, UnFuSeD outperforms all other approaches in literature by a significant margin. Results of BYOL-A were borrowed from their original papers. SimCLR was proposed as the pre-training approach in COLA [32] and was repeated on our convnet encoder using LAPE upstream dataset settings. We hypothesize that the gap in results from the original paper may be due to using a powerful encoder and 10 more data from the AudioSet used in the paper. Measuring the effect of change in encoders is beyond the scope of this paper. Our proposed DECAR-v2 outperforms the already proposed DECAR-v1 by a margin of 4.6% (averaged across all tasks). Additionally, UnFuSeD outperforms DECAR-v2 by a margin of 5.8% (averaged across all tasks). Owing to space constraints, we provide results of UnFuSeD with different SSL training frameworks on our GitHub. Additionally, our final convnet encoder used for downstream task evaluation has 40% fewer parameters than DeLoRes-M [7] (current SOTA system on the LAPE Benchmark).
6 Conclusion
In this paper, we propose UnFuSeD, a novel methodology to leverage SSL for low-resource audio classification. In practice, UnFuSeD significantly outperforms all other approaches in literature on the LAPE audio evaluation benchmark. Additionally, we propose a new SSL algorithm called DECAR-v2 to learn general-purpose audio representations from unlabeled data.
References
- [1] Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS 2020, vol. 33, pp. 12449–12460.
- [2] Grill et al., “Bootstrap your own latent-a new approach to self-supervised learning,” NeurIPS 2020, vol. 33, pp. 21271–21284.
- [3] He et al., “Momentum contrast for unsupervised visual representation learning,” in IEEE CVPR 2020.
- [4] Devlin et. al, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [5] Yang et al, “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
- [6] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “A multi-device dataset for urban acoustic scene classification,” 2018.
- [7] Ghosh et al., “Decorrelating feature spaces for learning general-purpose audio representations,” IEEE Journal of Selected Topics in Signal Processing, pp. 1–13, 2022.
- [8] Saeed et al., “Contrastive learning of general-purpose audio representations,” in IEEE ICASSP 2021, pp. 3875–3879.
- [9] Niizumi et al, “Byol for audio: Self-supervised learning for general-purpose audio representation,” in IEEE IJCNN 2021, pp. 1–8.
- [10] Lee et al, “Self-distillation for further pre-training of transformers,” arXiv preprint arXiv:2210.02871, 2022.
- [11] Gemmeke et al, “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE ICASSP 2022. IEEE, 2017, pp. 776–780.
- [12] Voxforge.org, “Free speech… recognition (linux, windows and mac) - voxforge.org,” accessed 06/25/2014.
- [13] Ghosh et al., “Deep clustering for general-purpose audio representations,” arXiv preprint arXiv:2110.08895, 2021.
- [14] Liu et al., “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in IEEE ICASSP 2020, pp. 6419–6423.
- [15] Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021.
- [16] Wu et al, “Speaker-independent acoustic-to-articulatory speech inversion,” arXiv preprint arXiv:2302.06774, 2023.
- [17] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao, “Knowledge distillation: A survey,” vol. 129, no. 6, pp. 1789–1819, jun 2021.
- [18] Liu et al., “Self-knowledge distillation via feature enhancement for speaker verification,” in IEEE ICASSP 2022.
- [19] Chang et al., “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in IEEE ICASSP 2022.
- [20] Pham at al., “Revisiting self-distillation,” arXiv preprint arXiv:2206.08491, 2022.
- [21] Koizumi et al., “The ntt dcase2020 challenge task 6 system: Automated audio captioning with keywords and sentence length estimation,” arXiv preprint arXiv:2007.00225, 2020.
- [22] Takeuchi et al, “Effects of word-frequency based pre-and post-processings for audio captioning,” arXiv preprint arXiv:2009.11436, 2020.
- [23] Caron et al., “Unsupervised learning of visual features by contrasting cluster assignments,” NeurIPS 2022.
- [24] Fonseca et al., “Fsd50k: an open dataset of human-labeled sound events,” IEEE/ACM TASLP, vol. 30, pp. 829–852, 2021.
- [25] Panayotov et al., “Librispeech: An asr corpus based on public domain audio books,” in IEEE ICASSP 2015, pp. 5206–5210.
- [26] Nagrani at al., “Voxceleb: A large-scale speaker identification dataset,” ISCA Interspeech 2017.
- [27] Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” 2018.
- [28] Busso et al., “Iemocap: Interactive emotional dyadic motion capture database,” LREC 2008, vol. 42, no. 4, pp. 335–359.
- [29] Engel et al., “Neural audio synthesis of musical notes with wavenet autoencoders,” in ICML 2017.
- [30] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello, “A dataset and taxonomy for urban sound research,” in ACM MM 2014, 2014, p. 1041–1044.
- [31] Stowell et al., “Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge,” Methods in Ecology and Evolution 2019.
- [32] Wang at al., “Towards learning universal audio representations,” in IEEE ICASSP 2022, pp. 4593–4597.