This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

State-Space Large Audio Language Models

Saurabhchand Bhati1, Yuan Gong1, Leonid Karlinsky2,3, Hilde Kuehne3,4, Rogerio Feris2,3, James Glass1
1MIT, USA, 2IBM Research AI, USA, 3MIT-IBM Watson AI Lab, USA, 4University of Bonn, Germany
[email protected]
Abstract

Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent. However, these systems rely on Transformers which scale quadratically with the input sequence lengths which poses computational challenges in deploying these systems in memory and time-constrained scenarios. Recently, the state-space models (SSMs) have emerged as an alternative to transformer networks.

While there have been successful attempts to replace transformer-based audio perception models with state-space ones, state-space-based LALMs remain unexplored. First, we begin by replacing the transformer-based audio perception module and then replace the transformer-based LLM and propose the first state-space-based LALM. Experimental results demonstrate that space-based LALM despite having a significantly lower number of parameters performs competitively with transformer-based LALMs on close-ended tasks on a variety of datasets.

Index Terms:
State-space models, Large audio language models, Audio reasoning

I Introduction

Large Language Models (LLMs) [1, 2, 3, 4, 5, 6] built using the powerful transformer architecture and trained on web-scale text data have made significant progress in Natural Language Processing. LLMs can read, write understand, and reason about the world through text. ChatGPT and similar systems are increasingly becoming commonplace.

While text-based systems have made significant strides, in our day-to-day lives we are surrounded by complex audio signals. Understanding audio is a crucial part of developing the next generation of intelligent systems that can interact with the world. Motivated by this, there have been attempts to supplement the LLMs with the ability to understand audio and speech [7, 8, 9, 10, 11, 12].

Despite the success of Transformers [13], they suffer from the quadratic time and memory complexity which is a bottleneck for long audio, and speech signals and use on devices with low computational resources. While substantial effort has gone into reducing the computational requirements of the transformer architectures [14, 15], there is a need to explore alternative models for use cases with memory and time constraints.

State-space models (SSMs) have emerged as an alternative to transformer-based models [16, 17, 18, 19]. SSMs have linear complexity in token length and perform on par with transformers. SSMs demonstrate faster inference time and lower memory requirements than the transformer-based models [20, 21, 18]. SSMs have demonstrated their performance on text [17] and image [21, 18] and are increasingly becoming common for modeling speech and audio signals [22, 23, 24, 19]. Recently, SSMs coupled with knowledge distillation have been shown to outperform transformer-based teachers and achieve state-of-the-art performance on the audio perception task [19].

While there have been works exploring state-space-based audio perception systems, there have been no attempts to explore state-space-based LALMs. In this paper, we systematically explore the impact of using state-space-based systems in developing LALMs. First, we replace the audio perception module from a transformer-based system, AST, to a state-space-based system, DASS, and keep the transformer-based LLM. Then, we also replace the LLMs with state-space LLMs and propose the (to the best of our knowledge) first state-space LALM. We evaluate the model using a mix of datasets for classification, and caption retrieval tasks and show that the state-space LALM performs competitively with transformer-based LALMs.

II Related Work

LLMs trained on wed-scale data with the next token prediction task showed impressive reasoning and understanding of world knowledge. These models learn general-purpose representations that can be aligned to desired responses via instruction tuning [25]. Large Audio Language Models extend the capabilities of these models beyond text to include general audio and speech.

Pengi [7] uses hierarchical transformer HTSAT [26] as the audio encoder, CLIP text encoder as the text encoder, and GPT2 [27] as the language model. AudioGPT [11] augments ChatGPT’s ability to handle complex audio and speech tasks. AudioGPT analyzes the user prompt and assigns a model based on the prompt for example for speech recognition task, whisper [28] is used whereas for speech enhancement ConvTasNet [29] is used. SALMONN [8] uses dual encoders: a whisper model to extract speech and information about background noises and a BEATs encoder [30] to extract high-level non-speech audio semantics information and LLaMA as the language-model.

LTU [9] uses AST [31] as the audio encoder and the LLaMA as the language model. LTU outperforms the existing LALMs on the close-ended task and shows free-form open-ended question-answering capabilities. Our approaches build upon LTU and reduce the computational cost of the model by using state-space models while retaining the performance of these models. GAMA [12] is a concurrent approach that builds upon LTU and combines information extracted from various layers from AST via an Audio Q-Former. GAMA contains significantly more trainable parameters (\sim300M) compared to LTU (\sim100M) and our proposed models (\sim40M, \sim60M). GAMA also proposed and used an improved instruction tuning dataset to improve complex reasoning on the input audio. This work explores alternatives for Transformers backbone and builds computationally efficient audio-language generation systems.

III State-space Large Audio Language Models

III-A State-Space Models

Structured state space sequence models (S4) [16] are inspired by classical state-space models such as Kalman filters and Hidden Markov Models. The state-space models map a 1-D sequence x(t)y(t)x(t)\in\mathbb{R}\rightarrow y(t)\in\mathbb{R} through a hidden state h(t)Nh(t)\in\mathbb{R}^{N} via linear ordinary differential equations as follows:

𝐡(𝐭)=𝐀𝐡(𝐭)+𝐁x(t),\displaystyle\mathbf{h^{\prime}(t)}=\mathbf{A}\mathbf{h(t)}+\mathbf{B}x(t), (1)
y(t)=𝐂𝐡(𝐭)\displaystyle y(t)=\mathbf{C}\mathbf{h(t)} (2)

where 𝐀N×N\mathbf{A}\in\mathbb{R}^{N\times N}, (𝐁N×1,𝐂1×N)(\mathbf{B}\in\mathbb{R}^{N\times 1},\mathbf{C}\in\mathbb{R}^{1\times N}) are called the evolution and projection parameters respectively.

A discretization step transforms the continuous parameters, 𝐀,𝐁\mathbf{A},\mathbf{B} to discrete parameters 𝐀¯,𝐁¯\bar{\mathbf{A}},\bar{\mathbf{B}}. A commonly used method for discretization, zero-order hold uses a timescale parameter Δ\Delta to discretize as follows:

𝐀¯=exp(Δ𝐀),\displaystyle\bar{\mathbf{A}}=\exp(\Delta\mathbf{A}), (3)
𝐁¯=(Δ𝐁)1(exp(Δ𝐀)𝐈)Δ𝐁\displaystyle\bar{\mathbf{B}}=(\Delta\mathbf{B})^{-1}(\exp(\Delta\mathbf{A})-\mathbf{I})\Delta\mathbf{B} (4)

After the discretization step, the state-space equations can written as:

ht=𝐀¯ht1+𝐁¯xt\displaystyle h_{t}=\overline{\mathbf{A}}h_{t-1}+\overline{\mathbf{B}}x_{t} (5)
yt=𝐂ht\displaystyle y_{t}=\mathbf{C}h_{t} (6)

The current view of state-space models is analogous to RNNs where the output of the current time step only depends on the previous hidden state and current input. Since the state-space parameters are not time or input-dependent, SSM can also be viewed as a convolution yt=(x0,x1,,xt)(𝐂𝐁¯,𝐂𝐀𝐁¯,,𝐂𝐀¯M1𝐁¯)=𝐱𝐊¯y_{t}=(x_{0},x_{1},...,x_{t})*(\mathbf{C}\overline{\mathbf{B}},\mathbf{C}\overline{\mathbf{A}\mathbf{B}},...,\mathbf{C}\overline{\mathbf{A}}^{M-1}\overline{\mathbf{B}})=\mathbf{x}*\overline{\mathbf{K}} where 𝐊¯\overline{\mathbf{K}} is the called the global convolutional kernel.

SSMs can either be viewed as CNNs or RNNs depending on the task. During training, the convolutional view is used to enable parallel training. During inference, the recurrent view allows faster inference and unbounded context. However, the linear time-invariant nature of these models limits their performance on content-based reasoning tasks.

Gu et al. [17] proposed a parametrization method to make the state-space parameters input-dependent. However, this breaks the convolutional view and poses a challenge for efficient computation. To address this, a hardware-aware parallel algorithm is used to efficiently compute the output.

Refer to caption
Figure 1: Overview of the proposed State-space Large Audio Language Model. The components in green i.e. DASS, downsampling modules, and the Lora adapters are trainable where the LLM is frozen.
TABLE I: Performance comparison on closed-ended audio tasks. ZS: Zero-shot evaluation, the entire dataset is not seen in the training; ZS-: Weak zero-shot evaluation, the dataset is not used in training, but it is sourced from the same project as part of the training data. Mean average precision (mAP) underestimates the performance of LALM as they do not predict the likelihood of non-prominent sound classes. Use a higher sampling rate of 44.1KHz than 16KHz used in other approaches.
Model
ESC50ZS-
(Acc)
DCASEZS-
(Mi-F1)
VSZS
(Acc)
TUTZS
(Acc)
BJOZS
(Acc)
VGG
(Acc)
FSD
(mAP)
AudioSet
(mAP)
Classif.
Avg.
AudioCaps
(SPICE)
Clotho
(SPICE)
Cap.
Avg.
Best specialized models trained supervisedly on each dataset. Not generalizable to unseen label sets and tasks.
Best Supervised & Specialized 97.0 64.6 98.0 74.6 97.5 59.5 56.2 47.3 74.3 17.7 13.5 15.6
CLIP-like audio-text model. Generalizable to unseen labels, but a pre-defined label set is required for inference.
AudioCLIP [32] 69.4 - - - - - - 25.9 - - - -
Wu et. al [33] 89.1 - - - - - - - - - - -
CLAP [34] 82.6 30.0 49.5 29.6 47.5 - 30.2 5.8 40.7 - - -
Baseline LALM: directly output label names, no pre-defined label set is required at inference.
LTU (7B) [9] 83.1 45.9 55.6 32.5 69.9 50.3 46.3 18.7 50.3 17.0 11.9 14.5
SALMONN (7B) [8] 16.4ZS- 18.0ZS- 16.9ZS- 7.8ZS- 25.0ZS- 23.3ZS- 22.1ZS- 13.4ZS- 17.9 8.3 7.6 8.0
Pengi [7] 80.8ZS- 29.6ZS- 46.4ZS- 18.4ZS- 47.3ZS- 16.6ZS- 35.8 11.5 35.8 12.7 7.0 9.9
AudioGPT [11] 41.3 20.9 35.8 14.9 21.6 5.6 18.8 12.7 21.5 6.9 6.2 6.6
state-space audio encoder (DASS) + LLaMA
Small Hybrid-LALM (7B) 87.4 47.9 58.2 28.0 67.0 48.6 46.4 18.4 50.2 17.0 12.6 14.8
Medium Hybrid-LALM (7B) 85.6 49.5 59.4 30.6 59.3 49.4 46.1 18.3 49.8 17.6 12.4 15.0
State-space audio encoder (DASS) + state-space LLM
Small ssLALM (3B) 84.3 46.4 55.8 34.1 61.9 51.2 47.8 18.6 50.0 18.0 12.1 15.1
Medium ssLALM (3B) 86.8 47.9 61.2 35.9 61.0 51.0 47.7 19.4 51.4 17.7 11.7 14.7

III-B DASS: Distilled Audio State-Space Model

DASS [19] was one of the first attempts to use a pure state-space-based model to classify audio signals. DASS uses AST [31], a transformer-based teacher model, to guide and train a state-space audio classifier. DASS combines the best of both worlds: it outperforms the transformer-based models and retains the computational advantages of state-space models.

The DASS model can be divided into four groups and each group consists of a state-space block and the first three groups also contain a downsampling layer. Each group progressively reduces the sequence lengths and increases the feature size. A pooling method generates the final embedding that summarizes the input spectrogram. DASS shows remarkable duration scalability: even a model trained with ten-second utterances can infer information from hour-long audio.

In this work, we use DASS pretrained on AudioSet-2M [35] dataset with the classification layer removed as the audio features extractor for the input audio. It takes a spectrogram of size 1024*128 as input and generates a feature map of size 32*4*768 as output. To further reduce the spatial dimension, we use a two-dimensional convolution with a kernel size of 3 and stride 2 and then use a linear layer to map the features from 768 dimensions to the input size of the language model i.e. 4096 for the LLaMa and 2560 for the state-space based LLM.

III-C LLM

We use the following LLMs in this work:

LLaMA: We follow LTU and use Vicuna instruction tuned LLaMA-7B LLM [36]. LLaMa is pretrained on large amounts of natural language and code corpora in a self-supervised manner. Vicuna [37] is trained on instruction-following language prompts generated by GPT models which improves the models’ performance on reasoning and generation tasks.

State-space LLM: State-space LLM-2.8B [17] trained on the Pile dataset. The state-space LLM outperforms similar-size transformer-based models such as GPT-Neo 2.7B.

Low-rank Adapters: Instead of finetuning all the weights of LLMs on our task, we use Low-rank (LoRA) [38] adapters to finetune the LLMs. LoRA adds a small set of learnable weights on top of the pre-trained weights from the LLM. The learnable weights can be decomposed into a product of two low-rank matrices. This allows us to modify the large parameter matrices of the LLMs without adding a lot of learnable parameters. The final LLM parameters are the addition of frozen parameters and the low-rank learnable matrices.

For LLaMa, we add LoRA adapters (rank=8 and α\alpha=16) to the key and query projection layers in all the self-attention layers of LLaMa models. This step introduces 4.2M learnable parameters. For state-space LLM, we add LoRA adapters (rank=8 and α\alpha=16) to the input projection layers of the state-space block. This step adds 6.5M learnable parameters.

III-D Training Objective

We train our models using the next token prediction task conditioned on the input audio and past tokens. We maximize the following probability, P(xt|x1:t1,A)P(x_{t}|x_{1:t-1},A), by using cross-entropy loss for all the text tokens 1<tT1<t\leq T in the input text tokens and reference audio. For generation, we use the following settings: Temperature=0.1, Top K=500, and Top P=0.95 with a repetition penalty of 1.1.

III-E Experiments

We train our models on OpenAQA dataset [9]. This dataset contains tuples of Audio, question, answer where the models take Audio and question as input and generate answer as the output.

We follow the same training pipeline from LTU for training the models proposed in this paper. We use 4× RTX A6000 GPUs for training the models. The hybrid LALM is trained for about 3 days. For the state-space LALM, we can increase the batch size from 4 to 16. We use gradient accumulation to ensure the effective batch size is the same i.e. 256. This speeds up the training and allows us to train the model in less than two days. However, the reason for the training speed up is unclear: it could be simply due to the state-space LLM being smaller or the state-space model being computationally efficient. In the future, when larger state-space LLMs become publicly available, we plan to increase the model size and compare the computational efficiencies of the state-space and Transformer-based LALMs.

The small and medium hybrid-LALM contains 42M and 61M trainable parameters out of 6.8B parameters respectively. The small and medium ssLALM contains 43M and 62M trainable parameters out of a total of 2.8B parameters respectively.

Question: Based on the combination of sounds in the audio clip, what kind of atmosphere or environment does it evoke? [Scene Understanding]
Reference: The combination of sounds suggests a rural or natural setting with animals and machinery.
LTU: The combination of sounds in the audio clip evokes a busy, active atmosphere, possibly a quirky environment with a mix of mechanical devices, talking, an animal, and tapping.
Small Hybrid-LALM: The audio clip evokes a natural and outdoor atmosphere with animals and people interacting together.
Medium Hybrid-LALM: The combination of sounds in the audio clip evokes a rural or natural environment with animals present.
Small ssLALM: The combination of sounds in the audio clip evokes a busy, active atmosphere, possibly a quirky environment with a mix of mechanical devices, talking, an animal, and tapping.
Medium ssLALM: The combination of sounds creates a lively and bustling atmosphere with people talking while pigs are making noise.
Question: Based on the audio clip’s acoustic features, how can we differentiate between Liquid and Gurgling, bubbling sounds? [Detailed Perception Question]
Reference: Liquid sounds are characterized by Smooth, flowing, and consonant features, while Gurgling, bubbling sounds have a Low-pitched, rhythmic, wet popping sensation.
LTU: Liquid sounds are characterized by Smooth, flowing, and consonant features, while Gurgling, bubbling sounds have a Low-pitched, rhythmic, wet popping sensation.
Small Hybrid-LALM: Liquid sound is smooth and flowing while Gurgling, bubbling sound is more irregular and intermittent.
Medium Hybrid-LALM: Liquid sound is characterized by a smooth and continuous flow of water, while gurgling and bubbling sounds are caused by air bubbles in the liquid that create turbulence and noise.
Small ssLALM: Liquid sound is smooth with a continuous flow of water while gurgling or bubbling noise has intermittent bursts.
Medium ssLALM: Liquid sound is smooth and flowing while gurgling has a distinct bubbly quality.
TABLE II: Open-ended AQA samples for the LALMs proposed in the paper.

To reason and understand the audio, the LALM must be able to first recognize the input audio. One important step in bench-marking LALMs is the performance comparison on close-ended tasks where the output labels are predefined. We follow the evaluation pipeline from LTU [9] and compare our models with existing LALMs on 8 audio classification benchmarks and 2 audio captioning benchmarks.

Audio Classification: LALMs do not directly predict the class index but instead output the audio label names or descriptions. To compute the performance of these models, we first encode the LALM output and the evaluation label using a text encoder and then we compute a cosine similarity between the LALM output and the label. For single-label classification tasks, we use the label with the highest similarity score and compute accuracy or F1-score and for multi-label classification tasks, we use the cosine similarity as the prediction score and compute the mAP. We used the prompt “write an audio caption describing the sound” for the classification tasks.

Audio Captioning: For the caption generation tasks, we use the prompt “write an audio caption describing the sound” and take the LALM output as the prediction. We AudioCaps and Clotho datasets and use SPICE as the evaluation metric.

As seen in Table 1, our proposed model outperforms the other existing models such as SALMONN, Pengi, and AudioGPT. For SALMONN [8], Pengi [7], AudioGPT [11], we use the results reported in the GAMA paper. For both the audio classification and audio captioning task our proposed models outperform the existing models in most of the datasets and overall average performance. For the audio captioning task, our models perform similarly to the best-supervised systems.

Our proposed models match the performance of LTU [9] which our approaches build upon. The audio encoder i.e. AST used in LTU is first pretrained on audio-visual data and then finetuned on the AudioSet-2M dataset. In contrast, the DASS model is trained on the audio data from AudioSet-2M. DASS trained on only AudioSet-2M outperforms and shows more robustness and duration-scalability than AST trained on the same dataset [19].

The state-space LLM used in ssLALM is much smaller and is not instruction-trained unlike LLaMA used in hybrid LALMs or transformer LALMs such as LTU. Despite being smaller and having the audio encoder and LLM trained on less data, the ssLALMs perform competitively with the best transformer-based LALMs.

Although all the LALMs, including the ssLALMs, perform poorly on multi-label classification task on AudioSet. We believe it is because LALMs mainly predict the prominent class and underestimate the likelihood of non-prominent sound classes in the input audio which results in low mAP scores. We also show some open-ended question-answering abilities of the models in Table 2. All the models can infer the information from audio and generate reasonable answers.

IV Conclusions and Future Work

In this paper, we propose the first state-space large audio language model. We systematically replace the audio-perception and LLM components in the LALMs and analyze the performance. Our experiments show that ssLALMs perform competitively with the transformer-based LALMs despite using a significantly lower number of parameters.

In the future, we would like to scale up the data used for training the model and more complex reasoning datasets such as CompA-R [12]. We would also like to build the larger ssLALMs with large state-space LLMs. We would also like to explore state-space attention hybrid models such as Jamba as the language models as they combine the best of the state-space and transformers.

References

  • [1] Jacob Devlin, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [2] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
  • [3] Tom B Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
  • [4] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
  • [5] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
  • [6] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  • [7] Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang, “Pengi: An audio language model for audio tasks,” Advances in Neural Information Processing Systems, vol. 36, pp. 18090–18108, 2023.
  • [8] Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang, “Salmonn: Towards generic hearing abilities for large language models,” arXiv preprint arXiv:2310.13289, 2023.
  • [9] Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass, “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
  • [10] Yuan Gong, Alexander H Liu, Hongyin Luo, Leonid Karlinsky, and James Glass, “Joint audio and speech understanding,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8.
  • [11] Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2024, vol. 38, pp. 23802–23804.
  • [12] Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,” arXiv preprint arXiv:2406.11768, 2024.
  • [13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [14] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
  • [15] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
  • [16] Albert Gu, Karan Goel, and Christopher Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
  • [17] Albert Gu and Tri Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  • [18] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu, “Vmamba: Visual state space model,” arXiv preprint arXiv:2401.10166, 2024.
  • [19] Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, and James Glass, “Dass: Distilled audio state space models are stronger and more duration-scalable learners,” arXiv preprint arXiv:2407.04082, 2024.
  • [20] Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur, “Long range language modeling via gated state spaces,” arXiv preprint arXiv:2206.13947, 2022.
  • [21] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.
  • [22] Mehmet Hamza Erol, Arda Senocak, Jiu Feng, and Joon Son Chung, “Audio mamba: Bidirectional state space model for audio representation learning,” arXiv e-prints, pp. arXiv–2406, 2024.
  • [23] Jiaju Lin and Haoxuan Hu, “Audio mamba: Pretrained audio state space model for audio tagging,” arXiv preprint arXiv:2405.13636, 2024.
  • [24] Siavash Shams, Sukru Samet Dindar, Xilin Jiang, and Nima Mesgarani, “Ssamba: Self-supervised audio representation learning with mamba state space model,” arXiv preprint arXiv:2405.11831, 2024.
  • [25] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al., “Instruction tuning for large language models: A survey,” arXiv preprint arXiv:2308.10792, 2023.
  • [26] Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650.
  • [27] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
  • [28] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, pp. 28492–28518.
  • [29] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  • [30] Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei, “Beats: Audio pre-training with acoustic tokenizers,” arXiv preprint arXiv:2212.09058, 2022.
  • [31] Yuan Gong, Yu-An Chung, and James Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
  • [32] Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel, “Audioclip: Extending clip to image, text and audio,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 976–980.
  • [33] Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  • [34] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  • [35] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
  • [36] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [37] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, pp. 6, 2023.
  • [38] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.