[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]
TRANSAVS: END-TO-END AUDIO-VISUAL SEGMENTATION WITH TRANSFORMER
Abstract
Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio signals, making it difficult to distinguish between them and thus leading to unclear segmentation results. Toward this end, we propose TransAVS, the first Transformer-based end-to-end framework for AVS task. Specifically, TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks with full transformer architectures. This scheme not only promotes comprehensive audio-image communication but also explicitly excavates instance cues encapsulated in the scene. Meanwhile, to encourage these audio queries to capture distinctive sounding objects instead of degrading to be homogeneous, we devise two self-supervised loss functions at both query and mask levels, allowing the model to capture distinctive features within similar audio data and achieve more precise segmentation. Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset, highlighting its effectiveness in bridging the gap between audio and visual modalities.
Index Terms— Audio-visual segmentation, multi-modal learning, transformer.
1 Introduction
Humans possess the remarkable ability to leverage audio and visual input signals to enhance the perception of the world [1]. For instance, we can identify the location of an object not only based on its visual appearance but also by the sounds it produces. This intrinsic connection has paved the way for numerous audio-visual tasks, including audio-visual correspondence [2, 3, 4, 5], audio-visual event localization [6, 7, 8, 9, 10], audio-visual video parsing [11, 12, 13, 14], and sound source localization [2, 3, 15]. However, the absence of pixel-wise annotations has limited these methods to frame or patch-level comprehension, ultimately restricting their training objectives to the classification of audible images.
Recently, a novel audio-visual segmentation (AVS) task was introduced in [16] with the aim of segmenting sounding objects corresponding to audio cues in video frames. This task is inherently a non-trivial one due to the two following challenges. Firstly, audio signals are information-dense, as they often contain sounds from multiple sources simultaneously. For example, in a concert, the sounds of instruments and human voices often become intertwined. This necessitates the disentanglement of audio signals at each timestamp into multiple latent components to effectively capture the unique sounding features of individual objects. Secondly, audio signals from objects of the same category often exhibit similar frequencies, such as in the case of the Husky and the Tibetan Mastiff. This ambiguity presents greater demands on the audio signal representation throughout the network to avoid inaccurately locating the sources of sound. However, the existing method [16] fails to address these challenges. Concrectly, it simply extracts audio features at each timestamp using an audio encoder, followed by the interaction with image embeddings through convolution, then generates the final prediction using an FPN-based scheme [17] under the supervision of the standard segmentation [18, 19].

To this end, we propose a novel transformer-based end-to-end audio visual segmentation framework (TransAVS), drawing inspiration from the recent success of the transformer architecture in multi-modal learning [20]. As depicted in Fig. 1(b), TransAVS is a multi-modal transformer that leverages audio cues to guide both the fusion with visual features and segmentation. Concretely, first, to handle scenarios with multiple sounding objects, we disentangle the audio stream to initialize several audio queries, which encourages the model to explicitly attend to different objects, facilitating the acquisition of instance-level awareness and discrimination. Besides, we introduce two self-supervised loss functions at the query and mask levels, respectively. These functions play a pivotal role in optimizing the audio queries by encouraging heterogeneity during the learning process. This design empowers the model to discern and capture unique features embedded within similar audio streams, resulting in more precise segmentation.

In summary, the main contributions of this paper are four-fold: (1) To the best of our knowledge, we are the first to introduce a multi-modal transformer-based framework to tackle the AVS task, leveraging the potent long-range modeling abilities to promote cross-modal interaction; (2) To guide the model towards perceiving and discriminating sounding objects at the instance level, we explicitly disentangle audio cues as audio queries. (3) To effectively address the issue of homogeneity of sounds among objects of the same category, we design two self-supervised loss functions, enabling the model to capture distinctive features within similar audio streams. (4) Qualitative and quantitative experimentation conclusively demonstrates the state-of-the-art performance of our method on the AVSBench dataset.
2 Methodology
In this section, we will delve into the details of our proposed TransAVS framework. We begin by introducing the problem formulation in Section 2.1, followed by a comprehensive explanation of the TransAVS architecture in Section 2.2. Furthermore, we outline the design and rationale behind our self-supervised loss functions in Section 2.3. Lastly, Section 2.4 explains how TransAVS infers sounding object masks.
2.1 Problem Formulation
Metric | Setting | SSL | VOS | SOD | Baseline[16] | Ours | |||||
LVS[21] | MSSL[22] | 3DC[23] | SST[24] | iGAN[25] | LGVT[26] | ResNet | Pvt-v2 | ResNet | Swin-base | ||
Single-source(S4) | 37.9 | 44.9 | 57.1 | 66.3 | 61.6 | 74.9 | 72.8 | 78.7 | 83.1 | 89.4 | |
Multi-source(MS3) | 29.5 | 26.1 | 36.9 | 42.6 | 42.9 | 40.7 | 47.9 | 54.0 | 58.9 | 63.5 | |
Single-source(S4) | 51.0 | 66.3 | 75.9 | 80.1 | 77.8 | 87.3 | 84.8 | 87.9 | 90.6 | 94.2 | |
Multi-source(MS3) | 31.0 | 36.3 | 50.3 | 57.2 | 54.4 | 59.3 | 59.3 | 64.5 | 72.9 | 75.2 |
For the AVS task, the input data comprises a sequence of video frames , where , and -second audio stream . The goal of AVS is to segment all sounding objects in each frame under the acoustic guidance . The segmentation results are binary masks , where , with ‘1’ indicates sounding objects while ‘0’ corresponds to background or silent objects.
2.2 The Architecture of AUST
Our TransAVS framework consists of 3 modules: (1) a feature extractor which is responsible for extracting multi-scale image features and audio features ; (2) an audio-visual transformer-based fusion module, which disentangles into audio queries and fuses with in a transformer fashion; (3) a mask generation module predicting binary masks with probability of sounding objects.
2.2.1 Feature Extractor
Visual Feature: Taking one frame in as input, a pretrained visual backbone is employed to extract visual features. To exploit semantic information in different levels, we extract vision features in 3 scales , where , is the channel dimension depending on different encoders. We also upsample as for later mask generation.
Audio Feature: Given the audio clip , we first process it to a spectrogram via the short-time Fourier transform, then pass it to a pretrained audio backbone VGGish [27] to obtain the audio embedding .
2.2.2 Audio-visual Transformer-based Fusion Module
As previously mentioned, we first disentangle audio features into audio queries to facilitate the model’s learning of instance-level awareness and discrimination, then adopt attention mechanism to establish long-range connection between audio cues and visual features. Technically, we begin by projecting into independent queries with a linear transform :
(1) |
then input them into encoder layers to capture their dependency. Concretely, at the -th layer:
(2) | ||||
(3) |
where , , and respectively, represent the Q, K, V transform matrix at the -th layer in , following the standard attention scheme [28].
After encoder layers, the output conveys different audio components information, providing the network with guidance for attending to different sounding regions during the cross-modal fusion process within following decoder layers. Specifically, at the -th layer, audio queries serves as query while image features , , act as keys and values in a round-robin fashion:
(4) | ||||
(5) | ||||
(6) |
where ‘mod’ denotes modulo operation, , , and denote the Q, K, V transformation matrix at the -th layer, respectively. This approach not only establishes long-range connections between the audio stream and visual frames but also compels the model, through audio queries, to be aware of and discriminate sounding objects at the instance level.
2.2.3 Mask Generation
Based on the fused feature and the image embedding , the mask generation module predicts segmentation masks with the probability of sounding objects.
Technically, for binary mask generation, we apply 1 1 convolution denoted as on to adjust the channel dimension to , then multiply it with followed by a sigmoid function:
(7) |
Meanwhile, to calculate the probability , a classifier and softmax function are utilised:
(8) |
where is the number of category. and are paired with audio queries as for optimization and inference.
2.3 Self-Supervised Loss Functions
As mentioned before, objects of the same category tend to produce similar sound frequencies, resulting in a significant degree of homogeneity that can hinder the model’s performance. Toward this, we propose two loss functions at query and mask level, namely the Audio Query Distance Loss (AQDL) and the Audio Query Mask Loss (AQML), with the goal of increasing heterogeneity and thus enhancing segmentation accuracy.
To be specific, AQDL, denoted as , is a penalty on that predicts sounding objects but getting too close to each other, indicating that they have a high similarity with less clear guidance:

(9) | ||||
(10) |
where represents a projection head in , and are elements of the set , is the confidence threshold of , is the cardinality of , · is the norm, and is the threshold for . AQDL promotes heterogeneity by restricting queries from getting too close.
On the other hand, AQML encourages to predict exclusive sounding masks as much as possible. It focuses on the intersecting pixels between different sounding-object masks:
(11) | ||||
(12) |
where and belong to the set , is the confidence threshold of , is the cardinality of , ‘Bin’ denotes the binary operation with threshold set and is the Hadamard product. AQML forces queries to attend to different parts of images, thus reducing their heterogeneity.
2.4 Inference Stage
During the inference, TransAVS predicts each pixel at location based on :
(14) | |||
(15) |
that is, when and only when both the class probability and the mask prediction probability are high enough will a pixel be assigned to .
3 Experiments
3.1 AVSBench Dataset
All videos in AVSBench dataset are trimmed into 5 seconds and separated into 2 subsets based on the number of sound source: single-source sound segmentation (S4) and multiple-sound source segmentation (MS3). Then each video is divided into 5 non-overlapping 1-second clips, each clip is sampled one frame. As shown in Table 2, only the first sampled frame in the training split of S4 is annotated while all frames in other split are annotated. S4 contains 23 classes (Cls.), covering sounds from humans, animals, vehicles, and musical instruments. Each video in the MS3 subset includes two or more categories from the S4 subset.
Subset | Cls. | Videos | Train/Valid/Test | Annotated frames |
Single-source(S4) | 23 | 4932 | 3,452 / 740 / 740 | 3452 / 3700 / 3700 |
Multi-source(MS3) | 23 | 424 | 296 / 64 / 64 | 1480 / 320 / 320 |
3.2 Implementation Details
For the visual backbone, we choose 2 representative ones: the standard CNN-based ResNet[31] backbone R101 and Transformer-based Swin-Transformer[32] backbone swin-base. R101 is pretrained on ImageNet-1K[33] while swin-base on ImageNet-22K. For the loss weights, we set and . For the optimizer, we use AdamW [34] with an initial learning rate of 0.0001 for both R101 and swin-base backbones. A learning rate multiplier of is also applied. All models are trained with 8 3090 GPUs for 90k iterations with a batch size of .
3.3 Main Results
Since AVS is a newly proposed problem, we compare our network with baseline in [16] and methods from three related tasks, including sound source localization (SSL), video object segmentation (VOS), and salient object detection (SOD). For each task, we report the results of two SOTA methods on AVSBench dataset, i.e., LVS[21] and MSSL[22] for SSL, 3DC[23] and SST[24] for VOS, iGAN[25] and LGVT[26] for SOD. To ensure fairness, all backbones of these methods were pretrained on the ImageNet-1K[33] dataset.
Quantitative Comparison. Given a test frame, we denote the predicted mask as and ground truth as , the Jaccard index [35] and F-score are used to measure region similarity and contour accuracy, respectively:
(16) | ||||
(17) |
where . We use and to denote the mean metric values over the whole test dataset. The quantitative results is shown in Table 1. It is evident that our proposed approach consistently outperforms existing methods in both subsets across all visual backbones. Even in the S4 subset, where the Baseline achieves high values on the metric (72.8 with ResNet50 and 78.7 with Pvt-v2), our proposed TransAVS still shows improvements: 10.3 points higher with ResNet and 10.7 points higher with Pvt-v2. We attribute this improvement to our transformer framework which uses audio queries to explicitly learn instance-level awareness and discrimination of sounding objects, as well as our loss functions that increase heterogeneity among sounds from objects of the same category. These design choices allow our model to exploit important audio cues to gain better segmentation.
Qualitative Comparison. We provide some qualitative examples from both S4 and MS3 in Fig. 3. The segmentation results clearly demonstrate that our method outperforms the baseline. We believe that our method’s instance-level awareness and discrimination, enabling it to distinguish between individual sounding sources, significantly contribute to the more precise segmentation. This is especially evident in the Fig. 3(b), where TransAVS effectively delineates the shape of the sounding source (guitars) while discarding the silent objects (hands).
3.4 Ablation Study
Mode of and | S4 | MS3 | |||
100 | increasing and | 80.8 | 87.4 | 56.2 | 70.3 |
500 | increasing and | 81.2 | 89.4 | 57.1 | 71.8 |
300 | only fixed | 80.4 | 87.9 | 56.3 | 70.1 |
300 | only fixed | 80.5 | 87.7 | 56.2 | 70.4 |
300 | increasing and | 83.1 | 90.6 | 58.9 | 72.9 |
In Table 3, we verify the effectiveness of each key design in the proposed method with ResNet backbone on both S4 and MS3 subsets. Based on the first 2 rows and the last row, our results show that the optimal performance obtained when the number of queries was set to 300. In the third and fourth rows, and are set to , respectively. Both fixed mode show a notable decrease compared with the increasing mode adapted in the last row:
(18) |
where , , . We hypothesize that this phenomenon can be explained as follows: as TransAVS becomes increasingly confident about its mask prediction, a fixed threshold may not penalize audio queries with less confidence at earlier stages while pushing too many audio queries away at later epochs. This leads to poorer performance compared to using an increasing threshold.
4 Conclusion
In this paper, we introduce TransAVS, the first transformer-based framework for the AVS task. We disentangle audio as audio queries to explicitly guide the model in learning instance-level awareness and discrimination of sounding objects. Additionally, we design self-supervised loss functions to address the homogeneity of sounds within the same category. Experimental results on the AVSBench dataset demonstrate that TransAVS achieves SOTA performance in both the S4 and MS3 subsets, demonstrating the effectiveness of TransAVS in bridging the gap between audio and vision modalities.
References
- [1] Dana M Small and John Prescott, “Odor/taste integration and the perception of flavor,” Exp Brain Res, 2005.
- [2] Relja Arandjelovic and Andrew Zisserman, “Look, listen and learn,” in ICCV, 2017.
- [3] Relja Arandjelovic and Andrew Zisserman, “Objects that sound,” in ECCV, 2018.
- [4] Yusuf Aytar, Carl Vondrick, and Antonio Torralba, “Soundnet: Learning sound representations from unlabeled video,” PAMI, 2016.
- [5] Janani Ramaswamy and Sukhendu Das, “See the sound, hear the pixels,” in WACV, 2020.
- [6] Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang, “Dual-modality seq2seq network for audio-visual event localization,” in ICASSP, 2019.
- [7] Yan-Bo Lin and Yu-Chiang Frank Wang, “Audiovisual transformer with instance attention for audio-visual event localization,” in ACCV, 2020.
- [8] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu, “Audio-visual event localization in unconstrained videos,” in ECCV, 2018.
- [9] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang, “Dual attention matching for audio-visual event localization,” in ECCV, 2019.
- [10] Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan, “Cross-modal relation-aware networks for audio-visual event localization,” in ACM MM, 2020.
- [11] Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, and Ming-Hsuan Yang, “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” NIPS, 2021.
- [12] Yapeng Tian, Dingzeyu Li, and Chenliang Xu, “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” in ECCV, 2020.
- [13] Yu Wu and Yi Yang, “Exploring heterogeneous clues for weakly-supervised audio-visual video parsing,” in CVPR, 2021.
- [14] Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, and Limin Wang, “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” in ECCV, 2022.
- [15] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon, “Learning to localize sound source in visual scenes,” in CVPR, 2018.
- [16] Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong, “Audio–visual segmentation,” in ECCV, 2022.
- [17] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
- [18] Jianzong Wu, Xiangtai Li, Shilin Xu Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, et al., “Towards open vocabulary learning: A survey,” arXiv preprint arXiv:2306.15880, 2023.
- [19] Xiangtai Li, Jiangning Zhang, Yibo Yang, Guangliang Cheng, Kuiyuan Yang, Yunhai Tong, and Dacheng Tao, “Sfnet: Faster and accurate semantic segmentation via semantic flow,” IJCV, 2023.
- [20] Peng Xu, Xiatian Zhu, and David A Clifton, “Multimodal learning with transformers: A survey,” PAMI, 2023.
- [21] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman, “Localizing visual sounds the hard way,” in CVPR, 2021.
- [22] Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, and Weiyao Lin, “Multiple sound sources localization from coarse to fine,” in ECCV, 2020.
- [23] Sabarinath Mahadevan, Ali Athar, Aljoša Ošep, Sebastian Hennen, Laura Leal-Taixé, and Bastian Leibe, “Making a case for 3d convolutions for object segmentation in videos,” BMVC, 2020.
- [24] Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor, “Sstvos: Sparse spatiotemporal transformers for video object segmentation,” in CVPR, 2021.
- [25] Yuxin Mao, Jing Zhang, Zhexiong Wan, Yuchao Dai, Aixuan Li, Yunqiu Lv, Xinyu Tian, Deng-Ping Fan, and Nick Barnes, “Transformer transforms salient object detection and camouflaged object detection,” CoRR, 2021.
- [26] Jing Zhang, Jianwen Xie, Nick Barnes, and Ping Li, “Learning generative vision transformer with energy-based latent space for saliency prediction,” NIPS, 2021.
- [27] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn architectures for large-scale audio classification,” in ICASSP, 2017.
- [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” NIPS, 2017.
- [29] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár, “Focal loss for dense object detection,” in ICCV, 2017.
- [30] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV, 2016.
- [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
- [32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
- [33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,” Int J Comput Vis, 2015.
- [34] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
- [35] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman, “The pascal visual object classes (voc) challenge,” Int J Comput Vis, 2010.