Learning Audio Concepts from Counterfactual Natural Language
Abstract
Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language. Despite recent advancements, there is little exploration of systematic methods to train models for recognizing sound events and sources in alternative scenarios, such as distinguishing fireworks from gunshots at outdoor events in similar situations. This study introduces causal reasoning and counterfactual analysis in the audio domain. We use counterfactual instances and include them in our model across different aspects. Our model considers acoustic characteristics and sound source information from human-annotated reference texts. To validate the effectiveness of our model, we conducted pre-training utilizing multiple audio captioning datasets. We then evaluate with several common downstream tasks, demonstrating the merits of the proposed method as one of the first works leveraging counterfactual information in audio domain. Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
Index Terms— sound event detection, audio understanding, multimodal representations, free-form text, counterfactual representation learning, audio captioning
1 Introduction
Conventional audio processing in machine learning relies on predefined categories, limiting the ability of the models to understand audio nuances using descriptive text. These constraints limit open-ended and contrastive training for better audio-text alignment. New trends in the field improve classic models using audio-text learning from audio data and matching natural language descriptions that have become successful models in image-text tasks [1]. Learning audio representations from pairs of audio and their textual descriptions facilitates the development of foundational models for audio tasks, enabling audio-text models to generalize beyond the confines of predefined classes, leveraging natural language descriptions [2].

Advancements in audio-text representation include AudioCLIP [3], a tri-modal model, and Wav2CLIP [4], which extends CLIP [1] to audio. Subsequently, Elizalde et al. proposed CLAP (Contrastive Language-Audio Pretraining) [5], a method that takes its inspiration from successful image-text models [1]. CLAP is able to train on audio and text directly without relying on image data for the learning process. However, human annotated audio captioning data is expensive and time-consuming to acquire. Recently, with large language models (LLMs) such as ChatGPT, an upgrade version of GPT-3 [6] fine-tuned to follow human instructions [7], have become popular and been utilized to augment learning for various domains, both LAION-Audio-630K [8] and WavCaps [9] leverage the powerful textual re-writing and editing capabilities in order to acquire more audio-text pairs. Furthermore, methods such as Pengi [2] and listen, think, and understand [10] have built upon these foundational methodologies.


Distinguishing between sounds in similar conditions, such as the sound of a firework and a gunshot in the same concert event, requires controlled trials of both sounds and a learning algorithm to differentiate between them in the same context. However, existing audio-text datasets lack alternative scenarios for ethical and practical reasons. Counterfactual reasoning has been utilized to improve multimodal models involving vision modality [11] and to aid grounding concepts within visual objects [12]. The semantic differences between pairs of similar but slightly different audio clips have been discussed in [13] and [14], addressing audio captioning. To the best of our knowledge, , our work is pioneering in utilizing the knowledge base and capabilities of LLMs to integrate counterfactual reasoning. This provides data and methods to enhance the learning of audio-text correspondence with counterfactual information. The proposed method utilizes counterfactual language and multimodal embeddings to improve textual discriminability in audio processing tasks, as shown in Figure 1. Subsequently, we develop a composite loss function that integrates the concept of triplet angular spaces and enforces factual audio-text consistency. We are inspired by recent developments in causality using LLMs [15] that offer new possibilities, such as augmentation of meaningful counterfactuals in situations where audio learning data may not be available. The potential of the proposed method is to influence a wide array of applications, including automatic speech recognition, sound event detection, and audio-visual scene understanding. The core innovation of our work lies in being the first to integrate causal and counterfactual models into the analysis of audio data. Generated counterfactual sentences and prompts attempted will be available at https://github.com/ali-vosoughi/counterfactual-audio.
2 Learning Audio from Counterfactual
Causality describes the relationship between a cause and its subsequent effect. This study focuses on the causal relationships within audio samples that can be inferred from their corresponding text captions. Specifically, we address scenarios where data acquisition of counterfactual audio is impossible or costly by leveraging natural language as a substitute for imaginative data. Our method aligns with recent advancements in causality in natural language that aim to extend domains of causality to LLMs [16, 15]. We first explore prior work of CLAP [17] that motivated us and then propose our method as the first work to extend counterfactuals to audio learning.
Contrastive Language-Audio Pretraining (CLAP): Given a batch of pairs , represents an audio sample and is its corresponding text caption for . We define the audio and text encoders as and , respectively. These encoders transform each and into vectors in , which are aggregated to form matrices and . The similarity matrix in the CLAP framework is given by (1) [5]:
(1) |
The CLAP loss minimizes the discrepancy between audio and text representations, is denoted by and formulated as:
(2) |
Here, serves as a scaling factor in (1), modulating the effect of the similarity scores. While the original CLAP framework, shown in Figure 2(a), is valuable for audio-text pretraining, it does not allow causality expression. Addressing this gap, we introduce a counterfactual natural language strategy to infuse the CLAP framework with causality. This addition is beneficial as it circumvents the difficulties associated with data collection for counterfactual audio analysis.
Causal Identification and Intervention: The counterfactual sentences in our model are generated through a prompt-based intervention on an observed caption , represented as . This prompt is designed to fulfill three key aspects: it is factually grounded, identifies acoustic sources in captions to serve as identifying causes, and manipulates these sources to alter the caption, to serve as causal interventions. Identifiability means that all causal sources of acoustic waves can be obtained only from the language [15]. Examples of these transformations are depicted in Table 1.
Dataset | Original Caption |
---|---|
Generated Counterfactual | |
Clotho | A gun is loaded, then loaded by hand some more |
A piano is played, then played by hand some more. | |
A few gunshots are fired at the target shooting range | |
A few fireworks light up the night sky at shooting range. | |
AudioCaps | An adult male speaks and a crash occurs |
An adult male speaks and a thunderstorm rumbles. | |
Large group of people clapping | |
Flock of birds chirping in unison. | |
Idling car, train blows horn and passes | |
Dogs barking, train blows horn and passes. | |
MACS | A crowd of people indoors talking |
A group of cars honking on a busy street. | |
Adults and children are walking and talking | |
Cars and trucks are honking and zooming. | |
Adults talking and some footsteps coming across | |
Dogs barking and some footsteps coming across. |
Control Mechanisms for Counterfactual Language: We utilize prompts based on the Chain-of-Thought (CoT) method [21] to align with objectives of causal identification and intervention and generating counterfactuals. Figure 2(b) shows our prompt design’s rationale, objectives, and context. We introduce a two-step prompting mechanism, denoted as , as shown in Fig. 2(b). In this mechanism, anchors the discussion in factual elements, dissects acoustic objects, and plays the role of causal identifier. Concurrently, governs the generation of counterfactual statements by intervening in the identified causal acoustic sources. The decomposition of the audio captions to their acoustic sources and objects based on the physics of the acoustic waves using will ground the generated output of the LLM to physically possible scenarios. The counterfactuals can range from full negative examples, as found in hard negative sampling [22], to minor, physically plausible counterfactual scenarios. The steering of counterfactuals will influence the learning of the audio-text embeddings. For instance, in the example of "children and adult voices with footsteps and birds singing in the background," can be used to change the primary sound, inferred from the caption, "children and adult voices," background sound, "with footsteps," or ambient sound, "birds singing in the background." In this paper, we use a combination of these assumptions without restricting the assumption for the sake of generalizability.
Loss Functions to Incorporate Counterfactuals: The angle loss aims to minimize the angular difference between factual and counterfactual captions. It is defined as:
(3) | |||
(4) |
Here, represents the angular margin. To encourage factual consistency between audio samples and their corresponding captions, we define the factual consistency loss as follows:
(5) |
The total loss, combining both the factual consistency loss and angle loss, is expressed in (6):
(6) |
Choices of hyperparameters in (6) are pragmatic, serving as a trade-off to best exploit the capabilities of counterfactuals while ensuring factual consistency.
3 Experimental design
3.1 Encoders
Audio encoder: We use PANNs [23] encoder, specifically ResNet-38 has been used with pretrained weights loaded with adapter layers to fine-tune model and align the embeddings.
Text encoders: We use the same CLIP text encoder modules [1] provided from HuggingFace111https://huggingface.co/docs/transformers/model_doc/clip [24] for encoding captions and counterfactuals. The weights of the encoders were frozen in all stages. Therefore, we excluded the effect that may arise form the encoding performance of the language encoders.
We employed logarithmic Mel spectrograms of audio sampled at 32kHz. The hop size is set to 320 frames, the window size to 1024 frames, and we utilized 64 Mel bins spanning the frequency range of 50-14000 Hz. Audio clips were randomly truncated to contiguous 10-second segments for training purposes, with zero padding applied for shorter clips. The captions remained unaltered. During training, batches containing pairs of audio and text are randomly selected.
3.2 Data
Training datasets: Total of 44,292 from AudioCaps [20], 29,646 pairs from Clotho [18] (each audio has five captions, so we created five pairs per clip), and 17,276 pairs from MACS [19] have been used during pretraining. The reason that we picked these three datasets are that these are purely annotated by humans.
Test datasets: We use the Clotho dataset [18] for evaluating the model’s performance in language-based audio retrieval task. We generated counterfactual captions using the GPT-3.5-Turbo, similar to the training sets. For evaluating the model’s performance as zero-shot classification in conventional problems with limited classes, we use the Environmental Sound Classification 50 (ESC-50) [25] with 50 predefined categories related to audio classes, and UrbanSound8K (US8K) [26] with ten classes as air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, street music.
3.3 Baseline
We adopt the approach from CLAP [5], and train our version with the same datasets we used in generating counterfactuals, including only AudioCaps, Clotho, and MACS.
4 Results and Discussions
4.1 Evaluation on Downstream Tasks
Results on Clotho: We use Clotho [18] to test the performance of our method on language-based audio retrieval task. As listed in Table 2, our method’s performance on text-based retrieval tasks yields a 43% improvement in top-1 accuracy, reinforcing its superior precision. Notably, the performance has slight improvement for top-10 retrieval tasks. We used the cosine similarity of text and audio embeddings to measure the performance. Therefore, the text encoder plays a crucial role in capturing small spelling nuances that have significant effects, and the limitations of the text encoder in capturing nuances will impact the model’s performance. For instance, the cosine similarity of embeddings of "This is the sound of a dog" vs. "This is the sound of a cat" using BERTscore text encoder [27] is 0.99, while these two sentences are carrying two significantly distinct animals. While this study excludes the effects of different text encoders to focus on the merits of counterfactual audio analysis, a comprehensive token dictionary and text encoder can improve sentence distinction.
Method | Top-1 | Top-10 |
---|---|---|
CLAP | 0.088 | 0.395 |
Our method | 0.126 | 0.423 |
Results on ESC-50 and US8K: We evaluated the zero-shot classification performance of our proposed model on two benchmark datasets, ESC-50 and US8K, and summarized in Table 3. As the table shows, our model performs commendably on the ESC-50 dataset, which features many classes. This performance is slightly better than that of the CLAP method. Conversely, the performance lags on the US8K dataset when compared to CLAP. One reason might be that the number of classes in US8K is much lower, in contrast to what our model learned during the training. As the class labels of the US8K lack sufficient textual detail about the data, the classification becomes harder as all labels get relatively similar and high similarity scores. For instance, the BERTscore distance [27] between class label "siren" and all other class labels is 0.844 0.020, and the trend consistently repeats for all class labels of the dataset. Contrastively, for the ESC-50 dataset, the value of similarity of the class label "siren" to all other classes is 0.821 0.018. Further analysis is necessary to ascertain the reasons behind the reduced performance on the US8K dataset and to refine our model for enhanced overall accuracy.
4.2 Ablation Studies
Figure 3 shows the t-SNE visualizations of the embeddings for various settings of the coefficients. Evidently, starting from a random guess, the embeddings evolve with each subsequent addition of the loss terms. The plots in Fig. 3 show that when all loss functions are set to zero, the audio encoders are based on the PANN embeddings in ours, and text encoders which are frozen and are showing the fixed space of captions and counterfactuals. One particular observation is that audio embeddings of the CLAP are closer to the fact (orange dots) than counterfactuals (blue dots). This observation, by itself, shows that CLAP is favorably learning to stay closer to facts. One reason is that our generated counterfactuals combine various types of counterfactuals, and some of the counterfactuals are similar to negative, so CLAP can successfully work on them. By incorporating counterfactuals via angle loss, we observe that the audio embeddings of our method get distant from the counterfactuals, staying closer to the facts. Alternatively, by only including factual loss, audio embeddings tend to align fully with factual embeddings, staying closer to the facts. Noteworthy is the trade-off that adding the factual consistency loss reduces the distance to the counterfactuals; however, the distance to the facts becomes much less, and we observe this as a trade-off. Having both factual consistency loss and angle loss ensures that the counterfactuals are distant enough while facts are kept closer to the audio embeddings.
Top-1 | Top-10 | |||
---|---|---|---|---|
0 | 0 | 0.0002 | 0.01 | |
1 | 0 | 0.0819 | 0.3365 | |
0 | 100 | 0.1328 | 0.4379 | |
1 | 100 | 0.1102 | 0.3782 |
In order to quantify the closeness of audio embeddings to original text and counterfactual text, we use the cosine similarity of audio-text and audio-counterfactual embeddings. Based on these similarities, we count the number of times over 1044 samples of the Clotho evaluation set that the audio embeddings are closer to facts than counterfactuals. As expected in the first row, the random guess is 517, roughly about half of the 1044 samples in the test set. In Table 4, adding angle loss and using counterfactuals improve our audio embeddings. Another pattern is that having factual consistency loss improves accuracy; however, the closeness of audio-text, as compared to counterfactuals, is not as good as when we add angle loss. Therefore, by observing patterns in Fig. 4 and Table 4, there is a trade-off between facts and counterfactuals. By getting closer to facts through factual consistency loss, the accuracy of Clotho retrieval increases, while the model does not learn to distinguish well between the facts and counterfactuals. In contrast, adding angle loss incorporates counterfactuals in training, helping to distinguish between descriptions that might be directly (as facts/captions) or indirectly (as counterfactual captions) related to the same audio. This trade-off is conformal with intuition. Imagine a judge evaluating two cases. One case presents strong evidence (facts), and the other presents a scenario that might have happened but did not (counterfactual). If a judge focuses solely on ensuring the judgments align perfectly with the facts, then might struggle to distinguish between the two cases effectively. However, considering both the facts and the counterfactual scenario, the judge becomes better at discerning their nuances. This balance between staying close to facts and acknowledging counterfactuals is what our model learns during training.




It is important to note that while some existing methods may outperform ours, e.g., on the US8K dataset due to the use of larger datasets. We argue that these methods often overlook the importance of causal reasoning, tending to improve accuracy rather than introducing counterfactuals as an insightful and novel method in bridging counterfactuals to the audio community. In contrast, our method considers both the data and its underlying causal factors, providing a more robust and insightful audio-text representation by borrowing identification and intervention from the science of causality.
5 Conclusion and Future Direction
For the first time, we incorporate a counterfactual framework into the audio domain. We leverage LLMs for counterfactual reasoning by prompt-based intervention on the identified acoustic objects. This integration aims to identify variations in audio-text representations by focusing on natural language as a surrogate for the lack of alternative audio-text data when acoustic waves’ origin and root cause vary. Our method exploits human-generated reference captions for surrogate counterfactuals and adopts them to the audio-text pretraining in a triplet model with factual consistency. Counterfactual natural language effectively compensates for the scarcity of comprehensive counterfactual audio data for ethical or feasibility reasons and enhances the distinguishability of audio-text models. Empirical evaluations using the various datasets substantiated the effectiveness of our method. In particular, our method yields a 43% improvement in top-1 accuracy for open-text tasks. Future research may explore the efficacy of these counterfactuals in challenging existing factual representations and their subsequent impact on audio-text correlation. Another avenue for future work could involve examining various levels of counterfactual reasoning.
References
- [1] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- [2] Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang, “Pengi: An audio language model for audio tasks,” arXiv preprint arXiv:2305.11834, 2023.
- [3] Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel, “Audioclip: Extending clip to image, text and audio,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 976–980.
- [4] Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello, “Wav2clip: Learning robust audio representations from clip,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4563–4567.
- [5] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- [7] L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022.
- [8] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- [9] X. Mei et al., “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
- [10] Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass, “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
- [11] Mehdi Zemni, Mickaël Chen, Éloi Zablocki, Hédi Ben-Younes, Patrick Pérez, and Matthieu Cord, “Octet: Object-aware counterfactual explanations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15062–15071.
- [12] Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata, “Grounding visual explanations,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 264–279.
- [13] Shunsuke Tsubaki1 Yohei Kawaguchi, Tomoya Nishida, Keisuke Imoto, Yuki Okamoto, Kota Dohi, and Takashi Endo, “Audio-change captioning to explain machine-sound anomalies,” in Proceedings of the DCASE 2023 Workshop, 2023.
- [14] Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, and Kunio Kashino, “Audio difference captioning utilizing similarity-discrepancy disentanglement,” in Proceedings of the DCASE 2023 Workshop, 2023.
- [15] Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan, “Causal reasoning and large language models: Opening a new frontier for causality,” arXiv preprint arXiv:2305.00050, 2023.
- [16] Judea Pearl, Causality, Cambridge university press, 2009.
- [17] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [18] Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 736–740.
- [19] Irene Martín-Morató and Annamaria Mesaros, “What is the ground truth? reliability of multi-annotator data for audio tagging,” in 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 2021, pp. 76–80.
- [20] Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
- [21] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.
- [22] Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka, “Contrastive learning with hard negative samples,” arXiv preprint arXiv:2010.04592, 2020.
- [23] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
- [24] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45.
- [25] Karol J Piczak, “Esc: Dataset for environmental sound classification,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1015–1018.
- [26] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 1041–1044.
- [27] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
- [28] Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello, “Wav2clip: Learning robust audio representations from clip,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4563–4567.
- [29] Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel, “Audioclip: Extending clip to image, text and audio,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 976–980.