UNFUSED : UNsupervised Finetuning Using SElf supervised Distillation

Abstract

In this paper, we introduce UnFuSeD, a novel approach to leverage self-supervised learning and reduce the need for large amounts of labeled data for audio classification. Unlike prior works, which directly fine-tune a self-supervised pre-trained encoder on a target dataset, we use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. We first train an encoder using a novel self-supervised learning algorithm (SSL) on an unlabeled audio dataset. Then, we use that encoder to generate pseudo-labels on our target task dataset via clustering the extracted representations. These pseudo-labels are then used to guide self-distillation on a randomly initialized model, which we call unsupervised fine-tuning. Finally, the resultant encoder is fine-tuned on our target task dataset. Through UnFuSeD, we propose the first system that moves away from generic SSL paradigms in literature, which pre-train and fine-tune the same encoder, and presents a novel self-distillation-based system to leverage SSL pre-training for low-resource audio classification. In practice, UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines. Additionally, UnFuSeD allows us to achieve this at a $\approx$ 40% reduction in the number of parameters over the previous state-of-the-art system. We make all our codes publicly available¹¹1https://github.com/Sreyan88/LAPE.

Index Terms— audio, speech, self-supervision

1 Introduction

Self-Supervised Learning (SSL) has proven to be one of the biggest success of the past decade enabling deep learning models to learn useful representations under low-resource labeled data settings. SSL has been adopted successfully in speech [1], vision [2, 3], and text [4] outperforming all prior-art trained with only labeled data supervision on several benchmark datasets [5]. Though current SSL models pre-trained using Masked Acoustic Modeling (MAM) have been shown to generalize well over speech tasks like Automatic Speech Recognition (ASR), Phoneme Recognition (PR), etc., they fail to perform well on non-speech tasks like acoustic scene classification [6]. We list some possible reasons for this phenomenon in Section 2. Thus, we emphasize the importance of learning general-purpose audio representations that can generalize over both speech and non-speech tasks, which is currently largely understudied in the literature compared to SSL in speech using MAM.

In the recent past, researchers have proposed novel algorithms for learning general-purpose audio representations [7, 8, 9]. A common trait among all these systems is that they directly fine-tune the model post-SSL pre-training. However, this direct fine-tuning approach may result in sub-optimal performance due to significant discrepancy between the pre-training and fine-tuning domains [10]. For example, most of these systems perform SSL pre-training on the AudioSet [11] (every day sounds like the sound of a toothbrush) and evaluate their learned representations on tasks like Speaker Verification [12] (human spoken utterances). Additionally, under the linear evaluation setup, we argue that the downstream tasks cannot leverage the SSL representations to their full extent due to their learning capacity being constrained to an affine transform.

Main Contributions: We present UnFuSeD, a new framework to improve downstream audio classification performance in low-resource labeled data settings leveraging SSL. Unlike all prior systems in literature, UnFuSeD does not directly fine-tune an SSL pre-trained model but uses it to extract and cluster audio features to generate pseudo-labels on a downstream task dataset which is then used to perform un-supervised fine-tuning. More precisely, we perform a step of self distillation, guided by the generated pseudo-labels, on a randomly initialized convnet encoder, divided into student and teacher encoders. Finally, post unsupervised fine-tuning, we perform supervised fine-tuning and evaluate downstream task performance on our model linear evaluation setup. Additionally, to pre-train our encoder using SSL, we propose a novel SSL algorithm by modifying over DECAR [13]. Fig.1 shows a clear pictorial representation of our complete training process. We emphasize that UnFuSeD changes the paradigm in which SSL is leveraged to tackle data scarcity and improve downstream task performance. In practice, UnFuSeD achieves state-of-the-art (SOTA) performance on the LAPE Benchmark [7] with an encoder with $\approx$ 40% fewer parameters than the current SOTA model on LAPE.

Refer to caption — Fig. 1: Illustration of UnFuSeD: UnFuSeD follows a 3 step training process from upstream SSL pre-training to downstream task-specific fine-tuning. \raisebox{-0.9pt}{1}⃝ SSL pre-training. We first pre-train an convnet using un-labeled audio using DECAR-v2 (described in Section 3).\raisebox{-0.9pt}{2}⃝ Unsupervised Fine-tuning. We now pass the downstream task-specific data through the upstream model pre-trained in the last stage and extract and cluster these representations to generate pseudo-labels. These pseudo-labels are then used to perform *unsupervised fine-tuning* on a randomly initialized convnet. \raisebox{-0.9pt}{3}⃝ Linear Evaluation. Finally, a task-specific linear head is added to the convnet obtained from the previous step, and we perform supervised fine-tuning on the task-specific labeled dataset keeping the convnet frozen.

2 Related Work

Self Supervised Learning in Speech and Audio. The past decade has seen massive success in self-supervised learning in vision (CV), speech (SLP), and text (NLP), pushing the boundaries of low-resource representation learning for downstream classification [1, 4]. The most common systems for SSL with speech solve a Masked Acoustic Modelling (MAM) task, either using contrastive learning [1], frame reconstruction [14], or pseudo-label prediction [15]. However, recent research has shown that solving MAM makes the model representations mimic human articulatory responses [16], thus making it unsuitable for non-speech tasks. Thus, in the recent past, researchers have proposed novel systems to learn audio representations that can generalize over both speech and non-speech tasks. Following SSL in speech, these systems either solve a contrastive learning-based instance discrimination task [8], a clustering-based pseudo-label prediction task [13], or a reconstruction task [9]. Knowledge distillation has shown great success in CV, and NLP with major applications in model compression [17]. In a supervised setting, researchers have explored knowledge distillation for automatic speech recognition, speech emotion recognition, and speaker verification [18]. DistillHubert [19] was the first work on distilling SSL-based speech models and performs layer-wise knowledge-distillation (KD) to compress a full HuBERT [15]. On the other hand, when both encoder architectures are the same, this is known as self distillation (SD) [20], and has shown impressive results with the student often outperforming the original teacher. However, to our knowledge, no existing work leverages SD for general-purpose self-supervised audio representation learning, and we are the first to explore this through UnFuSeD.

3 Methodology

Fig.1 illustrates our proposed UnFuSeD learning algorithm. Algorithm 1 provides a detailed algorithmic overview of the same. In practice, UnFuSeD has three main steps, namely, (1) Upstream SSL Pre-training, (2) Unsupervised Fine-tuning, and (3) Downstream Supervised Fine-tuning. In the next paragraphs, we describe each step in detail.

(1) Upstream SSL Pre-training. Let $X_{pre}$ be an unlabeled dataset of size $J$ where $X_{pre}$ = $\{x_{1},\cdots,x_{j},\cdots,x_{J}\}$ . In our case, here $J$ = 0.25 million, following the exact pre-training setup proposed by the LAPE benchmark. Our primary aim is to learn general-purpose audio representations from this unlabeled audio dataset. To achieve this, we use a simple convnet-based architecture [21, 22], popular in prior-art [7, 9] for a fair comparison. For upstream SSL pre-training, we propose DECAR-v2, an improved version of DECAR [13], based on findings in [23]. DECAR-v2 has two main steps or phases: (a) Assignment Phase and the (b) Training Phase.
(a) Assignment Phase: The primary purpose of this phase is to obtain “pseudo-labels” $\mathbf{q}$ for every unlabelled audio sample $x\in X_{pre}$ . To achieve this, we first store all the embeddings $\mathbf{g_{\tilde{x}}}$ obtained from our convnet projection head $h_{proj}$ in memory for the entire $X_{pre}$ . After this, we apply Spherical K-means to cluster and get the “pseudo-labels” $\mathbf{q}$ for every $x$ as follows: $\min_{\mathbf{C}\in\mathbb{R}^{d\times K}}\frac{1}{N}\sum_{n=1}^{N}\min_{\mathbf{q}}-\mathbf{g_{\tilde{x}}}^{\top}\mathbf{Cq}$ where C is the Centroid matrix. Both $\mathbf{g_{\tilde{x}}}$ and columns of C are $l_{2}$ normalized. $K$ represents the number of clusters, and $\tilde{x}\approx x$ is an augmented and sampled version of the original audio sample. Additionally, for ConvNet training stability, we keep the prototype head $h_{prot}$ parameters frozen throughout pre-training, and at the end of every assignment phase, the parameters of $h_{prot}$ are replaced by C.
(b) Training Phase: We train the network using supervision from the “pseudo-labels” $\mathbf{q}$ obtained from the assignment phase. To do this, we first obtain the prediction $\mathbf{p}$ using $softmax(z)$ where $z$ is the output of the $h_{prot}$ . Post this step; we minimize the multinomial logistic loss between $\mathbf{p}$ and $\mathbf{q}$ with: $\ell(\mathbf{p},\mathbf{q})=-\sum_{k}\mathbf{q}^{(k)}\log\mathbf{p}^{(k)}$ . The “pseudo-labels” are kept fixed during the training phase and updated for the entire X only once every epoch during the assignment phase. Similar to [23], the assignment phase and training phase take place in isolation only at the first epoch, after which we use the embeddings $\mathbf{g_{\tilde{x}}}$ obtained from the previous epoch. These embeddings are stored in memory at every iteration of an epoch right after the back-propagation step.

(2) Unsupervised Downstream Fine-tuning. After SSL pre-training, we don’t fine-tune the pre-trained convnet $f_{pre}$ on the target task dataset directly but instead, use it for unsupervised fine-tuning on a randomly initialized convnet $f_{sd}$ . We call this step unsupervised fine-tuning as we use the target task dataset but without using its actual labels. Let $D_{targ}=\{X_{targ},Y_{targ}\}$ be the target task labeled dataset of size $I$ where $Y_{targ}$ are labels associated with audio samples $X_{targ}$ . For unsupervised fine-tuning, we first generate $Y_{pseduo}$ by extracting and clustering representations obtained on passing $X_{targ}$ through $f_{pre}$ . DECAR-v2 generates clusterable embeddings, which helps in the $Y_{pseudo}$ generation. We then use $Y_{pseduo}$ to perform self-distillation on $f_{sd}$ . We first divide $f_{sd}$ , which follows a similar architecture to $f_{pre}$ , into a student ( $f^{s}_{sd}$ ) and teacher network ( $f^{t}_{sd}$ ). $f_{sd}$ has 4 individual blocks, where the first 3 make $f^{s}_{sd}$ and the last block makes $f^{t}_{sd}$ . For more details on the architecture of $f^{s}_{sd}$ , we refer our readers to [7, 21]. For self-distillation, we treat each block $b_{i}$ as a separate classifier and add a linear transform $h^{i}_{proj}$ to $b_{i}$ to solve three losses parallelly, KL-divergence $L_{kl}$ , Mean-Square error $L_{mse}$ and Cross Entorpy $L_{ce}$ . $L_{ce}$ ensures that the student blocks correctly classify the pseudo labels $Y_{pseduo}$ and thus utilize the weak supervision knowledge hidden in them. $L_{mse}$ ensures that knowledge of the deepest layers is leveraged to improve feature extraction in shallow layers. $L_{kl}$ ensures that the classification results of student classifiers are similar to that of the teacher classifier. Finally, to optimize our network, we use a weighted average of $L_{kl}$ , $L_{mse}$ and $L_{ce}$ , which we weigh by $\alpha$ , $\beta$ as shown:

L_{all}=L_{ce}+\alpha\sum^{3}_{i=1}L^{i}_{ce}+(1-\alpha)\sum^{3}_{i=1}L^{i}_{kl}+\beta\sum^{3}_{i=1}L^{i}_{mse}

(1)

	$\displaystyle L_{ce}\leftarrow L_{ce}(l,Y_{pseudo}),L^{i}_{ce}\leftarrow L^{i}_{ce}(z^{i},Y_{pseudo}),$
	$\displaystyle L^{i}_{kl}\leftarrow L^{i}_{kl}(z^{i},l),L^{i}_{mse}\leftarrow L^{i}_{mse}(z^{i},f_{sd}(X_{targ})))$
	$\displaystyle where\,z^{i}=h^{i}_{proj}(f^{i}_{sd}(X_{targ})),l=h_{cl}(f_{sd}(X_{targ}))$

// SSL-pretraining

Data: dataset

\mathcal{X}_{pre}

; number of clusters

K

; epoch

\mathcal{E}

; batch size

\mathcal{N}

for $epoch=1$ $to$ $\mathcal{E}$ do

Sample a mini batch

\mathcal{X}^{n}_{pre}

from

\mathcal{X}_{pre}

;

Perform augmentations on

\mathcal{X}_{pre}

to get

\tilde{\mathcal{X}}_{pre}

;

Compute feature embedding obtained from encoder

f_{pre}(\tilde{\mathcal{X}}_{pre})

and obtain

z=h_{proj}(f_{pre}(\tilde{\mathcal{X}}_{pre}))

;

Initialize weights of

h_{prot}

with centroid matrix

C

obtained by

Kmeans(z)

Compute

Y_{pseudo}

for

\mathcal{X}_{pre}

using

C

;

Minimize the cross-entropy

L_{ce}(Y^{n}_{pseudo},\hat{Y}^{n})

where

\hat{Y}^{n}=softmax(h_{prot}(z^{n}))

;

Update

f_{pre}

h_{proj}

using gradient descent;

end for

// Self-Distillation

Data: target dataset

\mathcal{X}_{targ}

; number of classes

t

; epoch

\mathcal{E}

; batch size

\mathcal{N}

for $epoch=1$ $to$ $\mathcal{E}$ do

Sample a mini batch

\mathcal{X}^{n}_{targ}

from

\mathcal{X}_{targ}

;

Compute

Y_{pseudo}

using

Kmeans(f_{pre}(\mathcal{X}_{targ}))

where

K=t

;

Compute

z^{i}

h^{i}_{proj}(f^{i}_{sd}(\mathcal{X}^{n}_{targ}))

and

l

h_{cl}(f_{sd}(\mathcal{X}^{n}_{targ}))

for each Block

b_{i}

where

i\in\{1,2,3\}

;

Compute Cross-Entropy

L^{i}_{ce}(z^{i},Y^{n}_{pseudo})

, KL-divergence

L^{i}_{kl}(z^{i},l)

and MSE

L^{i}_{mse}(z^{i},f_{sd}(X_{targ}))

loss for each Block

b_{i}

where

i\in\{1,2,3\}

(use Eq:1);

Combine all losses

L_{all}

with appropriate parameters

\alpha

\beta

as stated in Eq:1;

Update

f_{sd}

h^{i}_{proj}

for each Block

b_{i}

where

i\in\{1,2,3\}

and

h_{cl}

using gradient descent;

end for

Algorithm 1 UnFuSeD

Table 1: Result comparison of various SSL methods with proposed method DECAR-v2 and UnFuSeD on the linear evaluation setup with frozen encoder. The best results for each task are presented in bold. UnFuSeD outperforms all our baselines.

DT	BYOL-A	SimCLR	DECAR-v1	DeLoRes-S	MoCo	DeLoRes-M	DECAR-v2	UnFuSeD
Speech
SC-V1	$-$	77.3	82.3	86.1	$93.6$	$94.0$	91.6	94.4
SC-V2(12)	91.0	77.2	83.0	85.4	$93.2$	$93.3$	90.6	94.1
SC-V2(35)	92.2	66.0	73.6	80.0	$89.3$	$89.7$	87.2	90.1
LBS	$-$	89.0	91.0	90.0	$95.5$	$95.7$	92.5	97.0
VC	40.1	28.9	25.6	31.2	$42.5$	$45.3$	33.0	50.0
IC	$-$	59.8	63.2	60.7	$65.1$	$65.2$	65.2	66.0
VF	90.2	69.2	74.1	76.5	$87.3$	$88.0$	78.2	89.8
Non-Speech
NS	74.1	61.3	70.7	66.3	$74.7$	$75.0$	69.8	76.4
BSD	$-$	85.2	87.7	86.7	$89.0$	$89.6$	88.5	90.0
TUT	$-$	52.4	62.5	58.6	$66.7$	$65.7$	64.6	66.8
US8K	79.1	69.1	70.1	71.2	$81.2$	$82.7$	73.2	83.2
Average	$-$	66.9	71.2	72.1	$79.8$	$80.4$	75.8	81.6

(3) Supervised Downstream Fine-tuning Post unsupervised downstream fine-tuning, we do supervised downstream fine-tuning on the student model $f_{sd}$ using ${D}_{targ}$ . For a fair comparison with prior-art in this space, we don’t train all the layers of our model and instead just train a task-specific linear head added to the encoder. This method of training is also known as linear evaluation and proves to be an effective technique for evaluating learned audio representations.

4 Experimental Setup

Datasets. In our experiments, we use the exact same upstream and downstream training setups proposed by LAPE [7]. For SSL-based pre-training, we use a balanced subset of 10% of the complete AudioSet (0.2 million) and the FSD50K [24]. For downstream tasks (DT), we evaluate our learned representations on LibriSpeech (LBS) [25] and VoxCeleb (VC) [26] for speaker identification, Speech Commands (SC) v1 and v2 [27] for keyword spotting, VoxForge (VF) [12] for language identification, IEMOCAP (IC) [28] for speech emotion recognition, NSynth [29] for TUT Urban [6] and US8K [30] for acoustic event classification and finally Bird Song Detection (BSD) [31].

Hyperparameter Tuning. For SSL Pre-training (DECAR-v2), we find the optimal values for the number of clusters as 512, learning rate as 0.005, batch size as 512, and number of epochs as 100. Projector $h_{proj}$ performs a $\mathbb{R}^{2048}\rightarrow\mathbb{R}^{512}$ non-linear transformation using multiple linear-layers. For Unsupervised Fine-tuning, we use the learning rate as 0.007, batch size as 512, number of epochs as 50, $\alpha$ as 0.7, and $\beta$ as 0.003. $h_{cf}$ performs a $\mathbb{R}^{2048}\rightarrow\mathbb{R}^{t}$ linear transformation, where $t$ is number of classes in target dataset. Projectors $h^{1}_{proj}$ , $h^{2}_{proj}$ and $h^{3}_{proj}$ perform $\mathbb{R}^{2048}\rightarrow\mathbb{R}^{t}$ , $\mathbb{R}^{1024}\rightarrow\mathbb{R}^{t}$ and $\mathbb{R}^{512}\rightarrow\mathbb{R}^{t}$ non-linear transformations respectively. Finally, for Linear Evaluation, we use the learning rate as 0.001, batch size as 32, and number of epochs as 50. All the hyperparameter choices were made based on an extensive grid search while considering the average performance across all the downstream tasks.

5 Results and Result Analysis

As clearly evident from Table 1, UnFuSeD outperforms all other approaches in literature by a significant margin. Results of BYOL-A were borrowed from their original papers. SimCLR was proposed as the pre-training approach in COLA [32] and was repeated on our convnet encoder using LAPE upstream dataset settings. We hypothesize that the gap in results from the original paper may be due to using a powerful encoder and 10 $\times$ more data from the AudioSet used in the paper. Measuring the effect of change in encoders is beyond the scope of this paper. Our proposed DECAR-v2 outperforms the already proposed DECAR-v1 by a margin of 4.6% (averaged across all tasks). Additionally, UnFuSeD outperforms DECAR-v2 by a margin of 5.8% (averaged across all tasks). Owing to space constraints, we provide results of UnFuSeD with different SSL training frameworks on our GitHub. Additionally, our final convnet encoder $f^{s}_{sd}$ used for downstream task evaluation has $\approx$ 40% fewer parameters than DeLoRes-M [7] (current SOTA system on the LAPE Benchmark).

6 Conclusion

In this paper, we propose UnFuSeD, a novel methodology to leverage SSL for low-resource audio classification. In practice, UnFuSeD significantly outperforms all other approaches in literature on the LAPE audio evaluation benchmark. Additionally, we propose a new SSL algorithm called DECAR-v2 to learn general-purpose audio representations from unlabeled data.

References

[1] Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS 2020, vol. 33, pp. 12449–12460.
[2] Grill et al., “Bootstrap your own latent-a new approach to self-supervised learning,” NeurIPS 2020, vol. 33, pp. 21271–21284.
[3] He et al., “Momentum contrast for unsupervised visual representation learning,” in IEEE CVPR 2020.
[4] Devlin et. al, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[5] Yang et al, “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
[6] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “A multi-device dataset for urban acoustic scene classification,” 2018.
[7] Ghosh et al., “Decorrelating feature spaces for learning general-purpose audio representations,” IEEE Journal of Selected Topics in Signal Processing, pp. 1–13, 2022.
[8] Saeed et al., “Contrastive learning of general-purpose audio representations,” in IEEE ICASSP 2021, pp. 3875–3879.
[9] Niizumi et al, “Byol for audio: Self-supervised learning for general-purpose audio representation,” in IEEE IJCNN 2021, pp. 1–8.
[10] Lee et al, “Self-distillation for further pre-training of transformers,” arXiv preprint arXiv:2210.02871, 2022.
[11] Gemmeke et al, “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE ICASSP 2022. IEEE, 2017, pp. 776–780.
[12] Voxforge.org, “Free speech… recognition (linux, windows and mac) - voxforge.org,” accessed 06/25/2014.
[13] Ghosh et al., “Deep clustering for general-purpose audio representations,” arXiv preprint arXiv:2110.08895, 2021.
[14] Liu et al., “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in IEEE ICASSP 2020, pp. 6419–6423.
[15] Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021.
[16] Wu et al, “Speaker-independent acoustic-to-articulatory speech inversion,” arXiv preprint arXiv:2302.06774, 2023.
[17] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao, “Knowledge distillation: A survey,” vol. 129, no. 6, pp. 1789–1819, jun 2021.
[18] Liu et al., “Self-knowledge distillation via feature enhancement for speaker verification,” in IEEE ICASSP 2022.
[19] Chang et al., “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in IEEE ICASSP 2022.
[20] Pham at al., “Revisiting self-distillation,” arXiv preprint arXiv:2206.08491, 2022.
[21] Koizumi et al., “The ntt dcase2020 challenge task 6 system: Automated audio captioning with keywords and sentence length estimation,” arXiv preprint arXiv:2007.00225, 2020.
[22] Takeuchi et al, “Effects of word-frequency based pre-and post-processings for audio captioning,” arXiv preprint arXiv:2009.11436, 2020.
[23] Caron et al., “Unsupervised learning of visual features by contrasting cluster assignments,” NeurIPS 2022.
[24] Fonseca et al., “Fsd50k: an open dataset of human-labeled sound events,” IEEE/ACM TASLP, vol. 30, pp. 829–852, 2021.
[25] Panayotov et al., “Librispeech: An asr corpus based on public domain audio books,” in IEEE ICASSP 2015, pp. 5206–5210.
[26] Nagrani at al., “Voxceleb: A large-scale speaker identification dataset,” ISCA Interspeech 2017.
[27] Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” 2018.
[28] Busso et al., “Iemocap: Interactive emotional dyadic motion capture database,” LREC 2008, vol. 42, no. 4, pp. 335–359.
[29] Engel et al., “Neural audio synthesis of musical notes with wavenet autoencoders,” in ICML 2017.
[30] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello, “A dataset and taxonomy for urban sound research,” in ACM MM 2014, 2014, p. 1041–1044.
[31] Stowell et al., “Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge,” Methods in Ecology and Evolution 2019.
[32] Wang at al., “Towards learning universal audio representations,” in IEEE ICASSP 2022, pp. 4593–4597.