CAT: Causal Audio Transformer for Audio Classification

Abstract

The attention-based Transformers have been increasingly applied to audio classification because of their global receptive field and ability to handle long-term dependency. However, the existing frameworks which are mainly extended from the Vision Transformers are not perfectly compatible with audio signals. In this paper, we introduce a Causal Audio Transformer (CAT) consisting of a Multi-Resolution Multi-Feature (MRMF) feature extraction with an acoustic attention block for more optimized audio modeling. In addition, we propose a causal module that alleviates over-fitting, helps with knowledge transfer and improves interpretability. CAT obtains higher or comparable state-of-the-art classification performance on ESC50, AudioSet and UrbanSound8K datasets, and can be easily generalized to other Transformer-based models.

Index Terms:

audio classification, transformer, causal inference, attention

I Introduction

Audio classification is the task of analyzing audio recordings and assigning the corresponding labels to the audio sample. Having witnessed the huge success of Transformers [1] in the field of Natural Language Processing (NLP) and Computer Vision (CV), they have been adapted to the audio modality and have obtained the state-of-the-art performances. Part of the success is attributed to the global receptive fields in Transformers to capture long-range context in the audio signals. The existing audio transformer models inherit the structure of the famous Vision Transformer (ViT) [2], mainly because one of its most widely-used features, the Mel-Spectrogram, is of the same format as an image. Instead of letting the x- and y-axis carry the spatial information in image modeling, the x-axis of the Mel-Spectrogram denotes temporal information while the y-axis carries the discrete frequency information for audio inputs[3, 4, 5, 6, 7, 8].

While achieving superior performance, there are still open challenges in these trendy audio Transformers: (1) The commonly used acoustic representations leverage different time-frequency transformations and contain acoustic semantics of various scales and different granularities [6, 5, 4], which can hardly be effectively captured by ViTs [2] using conventional self-attentions, patch sampling and embeddings; (2) The successful designs in visual tasks, such as resnet [9], mixup [10], are shown to be more prone to over-fitting and less generalizable in the acoustic domain; (3) Feature selection and representation learning are critical in computer vision, yet have been often overlooked in acoustic modeling.

In the face of the challenges, we propose a Causal Audio Transformer (CAT), which includes Multi-Resolution Multi-Filter (MRMF) feature extraction, an acoustic attention and a causal module. Spectrograms, as standard inputs in audio models, are sampled using Fourier Transform, leading to a natural trade-off between the temporal resolution and the frequency resolution. CAT balances the trade-off by extracting comprehensive temporal-frequency feature patches in multiple resolutions and filters which is later combined with 3D positional embeddings. Then, the acoustic attention is proposed to effectively extract semantics from such representations, taking the feature patches as inputs. The patches from different filters are evenly distributed among the attention heads, while we calculate the pair-wise attentions among patches from different resolutions but within the same time frames, allowing information exchange in various granularities. We further introduce a causal module to establish the necessary and sufficient relationship between the learned representation and predicted labels based on counterfactual reasoning[11, 12]. We extend [12] to the context of audio classification where a lower bound of Probability of Necessity and Sufficiency (PNS) in terms of interventional distribution is provided. Since such a lower bound can only be estimated from the true distribution, we propose a causal module that learns a mapping from the interventional distribution to the observational dataset (i.e., the one we have), to alleviate over-fitting, improve interpretability and enable better knowledge transfer.

CAT achieves SOTA performance on ESC50 [13], AudioSet [14] and UrbenSound8K [15]. To sum up, our contributions are:

•

Causal Audio Transformer (CAT), with Multi-Resolution Multi-Filter (MRMF) features extraction and an acoustic attention for acoustic modeling.
•

Causal loss with reconstruction block, that explicitly measures the feature quality using Probability of Necessity and Sufficiency (PNS), alleviates over-fitting, and improves knowledge transfer across different datasets.

Refer to caption — Fig. 1: Left: Overview of proposed CAT; Right: the detailed structure of proposed acoustic attention.

II Related Work

Audio classification: from CNN to Transformer. Audio classification is the task of predicting labels of soundtracks. The great success of large-scale training originates from the rise of CNN models in audio classification[8, 16, 17, 18, 19, 20]. More recently, with attention mechanism triumphs in the field of Natural Language Processing (NLP) and Computer Vision (CV), self-attention-based transformers are first applied to acoustic classification by AST [5], where the spectrogram is used as the input of a ViT [2] backbone. Multiple subsequent studies focus on improving the model efficiency: PaSST [4] proposes a patch-out mechanism, and HTS-AT[6] adopts a hierarchical backbone [21]. To further boost the performance, MBT[3] introduces an visual-acoustic fusion and PLSA[7] introduces a model-agnostic framework. However, the network structures in most recent studies are heavily borrowed from vanilla transformers originally proposed for NLP and CV tasks, and are more prone to over-fitting and less generalizable to acoustic data. Therefore, in this paper, we introduce the acoustic attention which incorporates MRMF features specialized for acoustic modeling. The proposed acoustic attention can be easily extended to various Transformer blocks.

Causal inference in feature representation. The concept of causality is first introduced in graphical probabilistic models[11]. Although causal inference is a relatively new concept in audio classification, it has shown advances in interpretable machine learning and representation learning. Modeling causality among generating factors significantly encourages the learning of representative features[22]. [23] adopts counterfactual information that helps knowledge transfer across different domains. [12] proves a lower bound of the learned representation being a necessary and sufficient condition for label prediction. However, such a lower bound is in the interventional setting, and thus cannot be estimated directly without knowing the true distribution. We base our design on [12] and extend it to a causal module that maps the lower bound to the observational dataset in the context of audio classification.

III Causal Audio Transformer

In this section, we introduce our Causal Audio Transformer (CAT). CAT first extracts MRMF feature patches with 3D positional embedding. The feature patches are then sent as input to the acoustic attention (Section III-A). Then a causal module is proposed to alleviate over-fitting and improve interpretability and knowledge transfer (Section III-B).

III-A Transformer For Acoustic Modeling

Image modeling is largely different from acoustic modeling because a 2D image only carries spatial information along its axis, whereas acoustic features convey both temporal and frequency information. Therefore, it is not trivial to apply the conventional vision transformer for efficient acoustic modeling. We bridge the gap by extracting MRMF features and acoustic attention.

Multi-resolution multi-filter feature extraction. Due to the nature of 1D Fourier Transformation (FT), there is a trade-off between frequency and temporal resolution in a temporal-frequency representation (e.g. spectrogram). We propose to extract spectrograms of different temporal resolutions as:

x_{\text{MR}}=\{\text{FFT}_{\omega}(s)\},\omega\in\{\omega_{1},\omega_{2},...,\omega_{K}\}

(1)

where s denotes the input audio wave and $\text{FFT}_{\omega}$ denotes the 1D FFT with a window size of $\omega$ , $\{*\}$ represents a set. For each raw input $x$ , $x_{\text{MR}}$ contains k spectrograms, each of size $\mathrm{R}^{T_{i}\times F},i=1,2,...,K$ . $T_{i}$ represents the number of time frames, and $F$ is the number of frequency bins. Such multi-resolution spectrograms share similar intuitions with [24, 25], but ours is paired with acoustic attention and 3D positional embedding, as introduced in the next section, so that the information can be more effectively taken use of. We further allow spectrograms under multiple filters to serve as parallel inputs to different attention heads. For example, we could pass a copy of $x_{\text{MR}}$ to the Mel-filter bank, obtaining the MRMF features $x_{\text{MRMF}}\in\mathbb{R}^{\{T_{i}\}\times K\times F\times 2}$ as follows:

x_{\text{MRMF}}=\{\text{mel}(x_{\text{MR}}),x_{\text{MR}}\}

(2)

Spectrograms of different resolutions have different sizes. There are multiple ways to align them. One possibility is to concatenate them along the temporal dimension, forming a super spectrogram. Another method is to project the spectrograms to the same temporal dimension. We adopt the later setting based on better empirical results.

Patching and 3D positional embedding. The input $x_{\text{MRMF}}$ are patchified and aggregated with windows ${Wi}\in\mathrm{R}^{1\times K\times F\times 1}$ , and we use linear projections $\{\xi\}$ to match the dimension to $M$ . Different from the square-shaped input patching in computer vision [2], the proposed patch $x_{p}$ contains frequency information of various resolutions. Such design aggregates and preserves the important multi-scale frequency information, which is critical for audio classification. Inspired by 3D positional embedding used in video transformer [26], we propose an acoustic 3D positional embedding to make the network aware of features extracted with different window sizes and filters:

pe=(g([pe_{1},pe_{3}])\otimes\mathbf{1}_{F})^{T}+pe_{2}\otimes\mathbf{1}_{T}

(3)

where $g$ is a linear projection, $pe_{1}$ is a $T$ dimensional sinusoid time embedding and $pe_{2}$ a $F$ dimensional frequency sinusoid embedding, following [1]. $pe_{3}$ is the one-hot encoding of resolutions. $\mathbf{1}_{F}$ is $F$ -dimensional all-one vector.

Acoustic attention. We consequently design the acoustic attention to incorporate MRMF feature patches in audio Transformers. Specifically, multi-filtered inputs are fed into different heads and processed with separate scaled dot-production (Figure 1) so that attentions with and without Mel-filtering are calculated independently. Note that the proposed acoustic attention works with various attention kernels (e.g. multi-head attention [1], SWIN-attention [21], etc.). We use the SWIN kernel for our experiments.

III-B Causal module for better representation

The idea of encouraging the learned hidden representation to be a necessary and sufficient cause of label prediction is first proposed in [12], by maximizing a lower bound under intervention. [12] demonstrated it helps build models with less over-fitting and more robustness. However, such a bound requires an estimation of the unobserved confounder $C$ , making it hard to succeed in empirical setups. In this paper, we introduce a causal module with a causal loss that applies addition constraints on bijective mappings to resolve the issue. In this section, we first introduce the preliminaries from a causal perspective, then introduce the proposed module.

Preliminaries Let $X$ denote the training data and $Y$ be the counterfactual label function. we define a function $f$ that maps $X$ to its latent representation $Z$ , where $Z=f(X)$ . The probability of sufficiency and necessity of $\mathbb{I}\{Z=z\}$ for $\mathbb{I}\{Y=y\}$ is defined as [12]:

\text{PNS}_{Z=z,Y=y}=P(Y(Z\neq z)\neq y,Y(Z=z)=y).

(4)

where $Y(Z=z)$ reasons about what the label would be if $Z=z$ . To combat over-fitting and ensure the feature is representative, our goal is to maximize the PNS during training. [12] established the lower-bounded of PNS as:

\text{PNS}\geq P(Y=y|do(Z=z))-P(Y\neq y|do(Z\neq z))

(5)

where $do$ notation[11] stands for intervention of $Z$ .

However, without knowing the true distribution, the corresponding label change under the certain intervention of latent representation $Z$ cannot be directly estimated. Instead of solving the problem through deconfounding, we derive an estimation of the lower bound as:

\footnotesize\text{PNS}\geq=\int P(Y|X)[P(X|f(X)=z)-P(X|f(X)\neq z)]dX

(6)

We could observe from a probabilistic view that maximizing $P(Y|X)$ is the classification objective, and $P(X|f(X)=z)$ is the probability of inferring $X$ from a specific latent representation $Z=z$ . Given that we want to maximize such a lower bound of PNS, a feasible strategy is to maximize both $P(Y|X)$ and $P(X|f(X)=z)-P(X|f(X)\neq z)$ . The former is consistent with the classification objective, while the second term indicates that we need an accurate and bijective matching from $Z$ to $X$ .

Reconstruction and causal loss We argue that such mapping can be accomplished by conducting a reconstruction module and a “causal” loss, as shown in Figure 1. We purpose a simple reconstruction block that runs in parallel with a classification layer so that $X$ and $Z$ approximate bijective relationship measured by the reconstruction loss $l_{rs}$ :

l_{rs}=\lVert\Phi(Z)-X\rVert_{2}

(7)

where $\Phi$ is parameterized by a reconstruction block as in Figure 1.

With the underlying requirement being satisfied, we further minimize the causal loss ( [12] Equation 25 )as :

lc=-\sum_{j=1}^{d}log\text{PNS}_{n}(f_{j}(X),Y|f_{-j}(X))

(8)

where

\text{PNS}(f_{j}(X),Y|f_{-j}(X))=\prod_{i=1}^{n}\text{PNS}_{f_{j}(X_{i}),Y_{i}|f_{-j}(X_{i})}

(9)

The objective of CAT is the sum of a cross-entropy loss $l_{\theta}$ , a reconstruction loss, and a causal loss:

L=\operatorname*{\arg\!\min}(l_{\theta}+l_{c}+l_{rs})

(10)

IV Experiments

IV-A Dataset

We evaluate CAT on three datasets. AudioSet[14]: contains over two million 10-second audio clip with 527 labels. We report mean average precision (mAP) on evaluation set following [5]. ESC50[13]: contains 50 categories, 2000 audio recordings. We report the averaged accuracy of 5-fold cross-validate and mAP of 5 rounds of experiments following [5]. UrbanSound8K[15]: contains 8732 labeled sound excerpts of 10 classes. We report the average accuracy of 10-fold cross-validation following [27].

IV-B Implementation details

Following previous recipe [8, 5, 4], we use 32kHz as sample rate for all experiments. We generate both the spectrogram and mel-spectrograms in a window of 25ms and hop length of 10ms. In training, we adopt the inverse proportional weighting[28] to balance the AudioSet training data, the mixup of 0.5 [29] is used in all of our experiments.

IV-C Main Results

Table I: Evaluation on AudioSet and ESC50 dataset. We report mAP for AudioSet and top1 average accuracy for ESC50.* denotes the ensemble version of the model.

Models	Modality	AudioSet %wo	ESC50 %
PANN [8]	Audio	0.434	94.7
AST [5]	Audio	0.347	95.6
HTS-AT [6]	Audio	0.471	-
CAT	Audio	0.478	96.4
PSLA* [7]	Audio	0.474	-
AST*[5]	Audio	0.475	95.7
HTS-AT*[6]	Audio	0.487	97.0
PaSST*[4]	Audio	0.496	96.8
CLAP [30]	Audio+Text	-	96.7
AnT[31]	Audio+Video	-	95.7
CAT*	Audio	0.489	96.9
CAT + PANN*	Audio	0.491	97.2

Audioset We show CAT performance on AudioSet in Table I. Our model significantly outperforms previous work based on ConvNets [8], as well as SOTA transformer-based models by a noticeable margin (+ $0.7\%$ comparing to [6]). Note that both our CAT and HTS-AT share a similar SWIN backbone, the superior performance demonstrates that our proposed audio-transformer with MRMF and acoustic attention, plus the causal module are effective. It is worth mentioning that following the same ensemble protocol, the CAT achieves performance comparable to SOTA ensemble models. By simply ensemble the CAT with PANN [8], we achieve performance comparable to the previous 9-model ensemble [4] results. This also shows the CAT provides complementary information to ConvNet based model.

ESC50 The CAT achieves SOTA performance on ESC50 dataset as well (Table I). We observe a similar pattern that the proposed CAT outperforms previous SOTA [5, 6] by 0.8% and 0.2% with and without ensemble; demonstrating the effectiveness of the proposed design. Note that ensemble CAT with other acoustic model[8] achieves performance comparable to previous work trained on additional modality, e.g. CAT is 0.7% better than AnT [30] which utilizes visual-acoustic information. It can be anticipated that incorporating multimodality features into CAT may further improve performance.

IV-D Ablations

Table II: Ablation on ESC50 and AudioSet. CL denotes causal loss and US8K denotes urbanSound8K.

	AudioSet	ESC50
w/o CL	0.474	95.9
w. CL	0.478	96.4

(a)

Pre-train	AudioSet	ESC50
N/A	0.451	89.4
ImageNet	0.478	96.4

(b)

	AST[5]	CAT
-	88.6	88.7
+ MRMF	88.9	89.1
+ CL	89.1	89.4

(c)

pre-train	val	w/o CL	w. CL
US8K	ESC50	84.1	86.7
ESC50	US8K	95.9	96.1

(d)

We perform ablations on ESC50 and AudioSet.

Effectiveness of causal loss. Table 1(a) show that the causal loss helps with performance on datasets of different scale.

Impact of pre-training. The imagenet pre-training helps with the performance on both datasets (Table 1(b)), however, the impact is more significant on ESC50. This is due to that the transformer generally requires large-scale data to train.

Generalization We show that the proposed MRMF and causal loss generalize well to different backbone (e.g. backbone w [21] and w/o [5] hierarchy), and consistently improve the performance in Table 1(c).

Acoustic embedding quality We show that the proposed causal loss is able to eliminate noise and thus the feature transfer better across datasets (Table 1(d)). We perform pre-training on small-scale datasets to ensure the performance gain is from proposed causal loss instead of data overlapping.

V Conclusion and Future Work

In this paper, we propose Causal Audio Transformer, an acoustic transformer designed for audio classification. CAT has MRMF features, an acoustic attention and a causal module. We empirically show that the proposed CAT achieves SOTA performance on multiple datasets. Our ablation also demonstrates the MRMF and causal loss generalize well to different backbones.

References

[1] Ashish Vaswani and et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[3] Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun, “Attention bottlenecks for multimodal fusion,” Advances in Neural Information Processing Systems, vol. 34, pp. 14200–14213, 2021.
[4] Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer, “Efficient training of audio transformers with patchout,” arXiv preprint arXiv:2110.05069, 2021.
[5] Yuan Gong, Yu-An Chung, and James Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
[6] Ke Chen and et al., “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650.
[7] Yuan Gong, Yu-An Chung, and James Glass, “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3292–3306, 2021.
[8] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
[9] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
[10] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[11] Judea Pearl, Causality, Cambridge university press, 2009.
[12] Yixin Wang and Michael I Jordan, “Desiderata for representation learning: A causal perspective,” arXiv preprint arXiv:2109.03795, 2021.
[13] Karol J Piczak, “Esc: Dataset for environmental sound classification,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1015–1018.
[14] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
[15] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 1041–1044.
[16] Paulo Lopez-Meyer, Juan A del Hoyo Ontiveros, Hong Lu, and Georg Stemmer, “Efficient end-to-end audio embeddings generation for audio classification on target applications,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 601–605.
[17] Jaehun Kim, “Urban sound tagging using multi-channel audio feature with convolutional neural networks,” Proceedings of the Detection and Classification of Acoustic Scenes and Events, 2020.
[18] Loris Nanni, Gianluca Maguolo, Sheryl Brahnam, and Michelangelo Paci, “An ensemble of convolutional neural networks for audio classification,” Applied Sciences, vol. 11, no. 13, pp. 5796, 2021.
[19] Hardik B Sailor, Dharmesh M Agrawal, and Hemant A Patil, “Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification.,” in InterSpeech, 2017, vol. 8, p. 9.
[20] Jonathan J Huang and Juan Jose Alvarado Leanos, “Aclnet: efficient end-to-end audio classification cnn,” arXiv preprint arXiv:1811.06669, 2018.
[21] Ze Liu and et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
[22] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang, “Causalvae: Disentangled representation learning via neural structural causal models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9593–9602.
[23] Sharmila Reddy Nangi, Niyati Chhaya, Sopan Khosla, Nikhil Kaushik, and Harshit Nyati, “Counterfactuals to control latent disentangled text representations for style transfer,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021, pp. 40–48.
[24] Boqing Zhu, Changjian Wang, Feng Liu, Jin Lei, Zhen Huang, Yuxing Peng, and Fei Li, “Learning environmental sounds with multi-scale convolutional neural network,” in 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1–8.
[25] Emad M Grais, Hagen Wierstorf, Dominic Ward, and Mark D Plumbley, “Multi-resolution fully convolutional neural networks for monaural audio source separation,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2018, pp. 340–350.
[26] Yanyi Zhang and et al., “Vidtr: Video transformer without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13577–13587.
[27] Avi Gazneli, Gadi Zimerman, Tal Ridnik, Gilad Sharir, and Asaf Noy, “End-to-end audio strikes back: Boosting augmentations towards an efficient audio classification network,” arXiv preprint arXiv:2204.11479, 2022.
[28] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson, “Averaging weights leads to wider optima and better generalization,” arXiv preprint arXiv:1803.05407, 2018.
[29] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada, “Learning from between-class examples for deep sound recognition,” arXiv preprint arXiv:1711.10282, 2017.
[30] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “Clap: Learning audio concepts from natural language supervision,” arXiv preprint arXiv:2206.04769, 2022.
[31] Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, and Yejin Choi, “Connecting the dots between audio and text without parallel data through visual knowledge transfer,” arXiv preprint arXiv:2112.08995, 2021.