Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations
Abstract
Improving generalization is a major challenge in audio classification due to labeled data scarcity. Self-supervised learning (SSL) methods tackle this by leveraging unlabeled data to learn useful features for downstream classification tasks. In this work, we propose an augmented contrastive SSL framework to learn invariant representations from unlabeled data. Our method applies various perturbations to the unlabeled input data and utilizes contrastive learning to learn representations robust to such perturbations. Experimental results on the Audioset and DESED datasets show that our framework significantly outperforms state-of-the-art SSL and supervised learning methods on sound/event classification tasks.
Index Terms— self-supervised learning, contrastive learning, audio data augmentation, feature invariance
1 Introduction
Improving generalization is a key challenge in supervised learning for audio classification [1]. Traditional approach to solve this problem is to obtain more labeled training data to learn invariant features via shared labels. However, data labeling is a highly expensive process and audio recordings are usually weakly labeled. This leads to a limited availability of desired labeled data to train models in a supervised fashion.
Recently, self-supervised learning (SSL) methods remedy the labeled data scarcity issue by leveraging unlabeled data to learn useful representations and transfer the pretrained features in subsequent supervised tasks [2, 3, 4]. In this approach, invariance of the pretrained features to transformations is crucial to downstream tasks. Recent works have proposed methods to encourage invariance for vision tasks. However, for audio applications, learning invariant representations without using labeled data remains a challenge.
Inspired by recent advances in SSL for audio, we propose an augmented contrastive learning framework that learns invariant features using unlabeled data. In this approach, the embeddings are learnt by comparing multiple augmented audio segments from the same or different recordings in a contrastive fashion: The model encourages the similarity between the representations learnt from the same audio recording and discourages the similarity between those from different audio recordings. We incorporate various types of perturbations to the input data and minimize a contrastive loss to learn representations robust to such pertubations. Utilizing a vast amount of unlabeled data, this produces representations invariant to nuisance factors and improve generalization of the supervised downstream task given the same amount of labeled data. Our experimental results show that our finetuned model outperformed the supervised network and a naive SSL approach for the sound-event classification on the Audioset [5] and DESED [6] datasets.
Prior work. Recently proposed self-supervised learning frameworks in vision and audio have shown great promise in learning useful representations for downstream tasks. Most of these frameworks use contrastive learning approaches that rely on negative sampling [2, 3, 4, 7, 8, 9]. Some, however, avoid the negative samples by introducing architectural changes to compensate for that [10, 11]. In vision, besides altering the network architecture, different augmentation techniques have been used to encourage feature invariance, and delivered better performance than fully supervised training [4, 7, 8]. Inspired by this success, several SSL methods [9, 12, 13, 14] have been presented to exploit massive unlabeled data to learn audio representations.
The most related framework to this work is COLA [9] which learns general-purpose audio representations in a contrastive fashion. However, COLA has no mechanism to enforce feature invariance. This explains why, despite using a vast amount of unlabeled data for pre-trained features, its results are only comparable or slightly better than supervised learning given the same labeled dataset as shown in our experimental results.
Outline. The rest of the paper is organized as follows. In section 2, we introduce our augmented contrastive learning framework and its modules. There, we describe the augmentation strategies used in our work. In section 3, we present the experimental results and show that our augmented framework outperforms state-of-the-art SSL and supervised training.
2 Contrastive Learning Framework for Audio/Speeech
The general contrastive learning pipeline is illustrated in Figure 1. We learn audio/speech representations by pretraining a network using unlabeled data. As shown in Figure 1, the pipeline includes: 1) an augmentation module, 2) an Encoder network with a projection head, and 3) a loss function.
During pretraining of the contrastive network, we maximize the similarity between latent representations of the augmented segments from the same audio clip (positive pair) and minimize that between augmented representations of different clips (negative pairs). The augmentation module serves as a regularizer encouraging the contrastive network to learn invariant representations. The contrastive feature extractor is later used in a downstream task such as sound event classification.
Augmentation: This module provides a series of transformations to the input audio/speech signal. Given a batch of size and for any signal from the batch, the module produces pairs of data including one positive pair and negative pairs. Let , where is the set of transformations applied by this module, and the selected batch. We denote
The positive and negative pairs are then define as follows:
The augmentation module outputs a positive or negative pair of transformed signals, a pair of mel-spectrograms in our case, that is fed to the encoder. We discuss our proposed transformations in detail in section 2.1.
Encoder: In our model, the feature extractor network is a convolutional encoder which receives mel-spectrograms as its input. The extracted features are passed through a projection network, a multi-layer perceptron (MLP), and then used in the contrastive loss. Let be the encoder network that maps each input into the latent space. Further assume that represents the projection head. For each pair of inputs, we notate
(1) |
We follow the COLA’s [9] framework for the feature extractor network architecture in our work. The extracted features are then used for optimizing the contrastive loss function via a similarity measure.
Similarity measure: As mentioned earlier, we aim to maximize the similarity between two augmented segments of the same sound clip and minimize the similarity for different ones. Common approaches for calculating the similarity measure include dot product and cosine similarity. In our framework, we choose bilinear transformation. This type of similarity has been used before in [9, 3]. The score is calculated as
(2) |
where are the bilinear parameters.
Loss function: Our contrastive loss function is the multi-class cross entropy over the similarity scores. For a given , the loss is equivalent to
(3) |
This loss is closely related to the Info-NCE loss.
2.1 Proposed Augmentation Module
Unlike traditional augmentation techniques that aim to enlarge the training dataset using artificial samples, we incorporate augmentation and contrastive training to learn representations invariant to nuisance factors. In particular, our proposed augmentation module generates positive and negative pairs for the contrastive network. This module contains a series of transformations that, via the contrastive loss, preserve useful information and ignore nuisance factors in the input audio/speech signals. The input to the module is a 10-second audio/speech clip. We randomly choose and add noise to a 1-second chunk of this audio. We then calculate the mel-spectrogram of this audio signal and construct our anchor feature. The positive feature is another 1-second segment, from the same clip, that is passed through the augmentation blocks. Finally, we choose a 1-second from any other audio clip, which is different from the clip that we sample the anchor and positive feature, in the batch as the negative feature. This negative sample is also fed to the augmentation blocks. Figure 2 illustrates different transformations, described below, in the augmentation module.
Time stretching: This block randomly (with probability ) speeds up or slows down the original audio clip in time by a factor of . It makes the network robust to temporal variations in the input. Note that for the positive and negative pairs, we choose the 1-sec segments after this block.
RIR filtering: This module passes the 1-sec segment through a room impulse response filter chosen from [15]. The transformation leads to contrastive features invariant to different recording environments.
Time/Freq masking: This operation applies a time and frequency masking to the calculated mel-spectrograms. It encourages the contrastive network to learn representations robust to large distortions in the input.
Even though each augmentation type leads to discriminative features benefiting certain classes in the downstream classification task, our ablation study indicates that a combination of some of these transformations produces the highest classification performance on average for all classes in the considered dataset.
2.2 Downstream Tasks
We perform downstream tasks for sound/event classification on both DESED [6] and Audioset [5] datasets. We aim to improve the generalization error using the proposed augmentation module. As both of the test datasets are unbalanced, meaning the labels are represented unequally, standard evaluation metrics such as class accuracy are unreliable and misleading. Therefore, we use the F1-score and weighted average precision (wAP) score to evaluate the model performance. These metrics are calculated based on the precision and recall scores defined below:
precision | ||||
F1-score | (4) |
Here, TP, FP, and FN correspond to the true positive, false positive, and false negatives for each class labels, respectively. Weighted average precision (wAP) score is calculated by taking the weighted mean of the precision scores for each class, where the weights are the number of true instances for each class label.
3 Experiments
3.1 Experimental Setup and Datasets
We pretrain our contrastive network on the unbalanced Audioset dataset and transfer it to sound-event classification for a subset of the Audioset and DESED [6] dataset. Audioset [5] consists of over 2 million 10-sec audio clips from YouTube videos. The clips are weakly labeled for over 500 classes. We don’t use labels when pretraining the self-supervised network. This dataset has been used for pretraining SSL models in the literature [13, 11, 9].
We perform two types of transfers for the downstream task: (1) Freezing the embedding and training a linear classifier on top, and (2) finetuning the entire network.
The mel-spectrograms calculated in the augmentation module use ms window size, ms hop-size, and frequency bins for frames. Following COLA, the encoder is EfficientNet-B0 [16] and the projection head is a fully-connected layer with dimension and activation. We use ADAM [17] with learning rate and batch size for optimization.
3.2 Results: DESED
The DESED dataset [6] comprises 10 second audio clips recorded in domestic environment or synthesized to simulate a domestic environment, and focuses on 10 class of sound events. There are both real and synthetic recordings in the dataset. We used the strongly labeled real validation set for training the linear classifier or finetuning the entire network, and the evaluation set for testing.
Table 1 shows the F1 scores on the DESED dataset. We present the results for both the cases where we freeze the encoder and train a linear classifier on top of the network, and when we finetune the entire network. We use COLA [9] as the baseline. It’s a special case of our framework in which only the noise block is used in the augmentation module. We consider different augmentation strategies and compare the corresponding F1 scores against the baseline and the supervised model. The results show that COLA with finetuning, having no mechanism to enforce feature invariance, was only slightly better than the supervised model. On the other hand, our finetuned model with time stretching and time/freq masking augmentation surpassed the full supervision by almost 4% in F1 score. This highlights the importance of our augmentation module in this framework. Note that the low F1 scores of the SSL models are due to the distribution mismatch between the SSL data (Audioset) and the classification data (DESED).
Augmentation type | Freeze Encoder | Finetune |
---|---|---|
Noise only (baseline) | 0.434 | 0.629 |
Time stretching | 0.454 | 0.640 |
RIR filtering | 0.453 | 0.616 |
Time/freq masking | 0.492 | 0.624 |
Time stretching + time/freq masking | 0.511 | 0.644 |
Supervised | 0.62 |

Figure 3 presents the per-class F1-score for the considered augmentation combinations. Evidently, for almost all classes, there were one or more transformation combinations producing higher F-1 scores than that of the baseline. Furthermore, F-1 scores yielded by our best combination, time stretching + time/freq masking, were among the top scores for a majority of classes. This is consistent with Table 1 which shows that this strategy has the highest score on average for all classes.
Finally, in Figure 4, we show the normalized confusion matrices for four augmentation types for the frozen encoder case. The normalization is applied over the total number of true labels for each class. That the non-diagonal elements of the time stretching + time/freq masking confusion matrix generally have lower values than those in other confusion matrices indicates that this strategy yields the best classification performance. This matches the insights from previous results.
![]() |
![]() |
![]() |
![]() |
3.3 Results: Audioset
Augmentation type | Freeze Encoder | Finetune |
---|---|---|
Noise only (baseline) | 0.361 | 0.375 |
Time stretching | 0.391 | 0.393 |
RIR filtering | 0.384 | 0.385 |
Time/freq masking | 0.378 | 0.395 |
Time streching + Time/freq masking | 0.393 | 0.399 |
Supervised | 0.371 |
We also assessed the representations learned by SSL pretraining using a subset of the Audioset. This subset includes 2000 weakly labeled sound clip for 10 most populated classes available. As the test set is imbalanced, we report, in Table 2, the weighted average precision (wAP) scores for different augmentation types and compare them to those for the baseline (the COLA framework [9]) and the supervised network. The results echo the phenomenon in Table 1 that COLA is inferior to the supervised network without finetuning and only slightly better than it with finetuning. Significantly, all of our augmentation strategies in our framework are superior to full supervision with and without finetuning. This indicates that, compared to naive contrastive SSL methods such as COLA, our augmented contrastive SSL framework learns more robust representations benifiting downstream tasks.
4 Conclusion
Improving generalization for audio classification requires solving the labeled data scarcity issue. In this work, we propose an augmented contrastive self-supervised learning method that, using unlabeled data, learns invariant representations benefiting downstream supervised tasks. Using the Audioset and DESED datasets, we show that our framework significantly outperforms state-of-the-art SSL and supervised learning methods on audio classification problems. Understanding how contrastive SSL combined with data augmentation produce features invariant to nuisance factors is still an open problem. We preserve it for future work.
References
- [1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021.
- [2] Philip Bachman, R Devon Hjelm, and William Buchwalter, “Learning representations by maximizing mutual information across views,” arXiv preprint arXiv:1906.00910, 2019.
- [3] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- [4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- [5] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
- [6] Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” working paper or preprint, June 2019.
- [7] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- [8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” arXiv preprint arXiv:2006.09882, 2020.
- [9] Aaqib Saeed, David Grangier, and Neil Zeghidour, “Contrastive learning of general-purpose audio representations,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3875–3879.
- [10] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko, “Bootstrap your own latent - a new approach to self-supervised learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 21271–21284, Curran Associates, Inc.
- [11] Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino, “Byol for audio: Self-supervised learning for general-purpose audio representation,” arXiv preprint arXiv:2103.06695, 2021.
- [12] Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel PW Ellis, Shawn Hershey, Jiayang Liu, R Channing Moore, and Rif A Saurous, “Unsupervised learning of semantic audio representations,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 126–130.
- [13] Eduardo Fonseca, Diego Ortego, Kevin McGuinness, Noel E O’Connor, and Xavier Serra, “Unsupervised contrastive learning of sound event representations,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 371–375.
- [14] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio, “Multi-task self-supervised learning for robust speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6989–6993.
- [15] Chandan KA Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, and Sriram Srinivasan, “Interspeech 2021 deep noise suppression challenge,” in INTERSPEECH, 2021.
- [16] Mingxing Tan and Quoc Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 6105–6114.
- [17] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.