This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

TRANSAVS: END-TO-END AUDIO-VISUAL SEGMENTATION WITH TRANSFORMER

Abstract

Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio signals, making it difficult to distinguish between them and thus leading to unclear segmentation results. Toward this end, we propose TransAVS, the first Transformer-based end-to-end framework for AVS task. Specifically, TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks with full transformer architectures. This scheme not only promotes comprehensive audio-image communication but also explicitly excavates instance cues encapsulated in the scene. Meanwhile, to encourage these audio queries to capture distinctive sounding objects instead of degrading to be homogeneous, we devise two self-supervised loss functions at both query and mask levels, allowing the model to capture distinctive features within similar audio data and achieve more precise segmentation. Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset, highlighting its effectiveness in bridging the gap between audio and visual modalities.

Index Terms—  Audio-visual segmentation, multi-modal learning, transformer.

11footnotetext:  is the corresponding author.

1 Introduction

Humans possess the remarkable ability to leverage audio and visual input signals to enhance the perception of the world [1]. For instance, we can identify the location of an object not only based on its visual appearance but also by the sounds it produces. This intrinsic connection has paved the way for numerous audio-visual tasks, including audio-visual correspondence [2, 3, 4, 5], audio-visual event localization [6, 7, 8, 9, 10], audio-visual video parsing [11, 12, 13, 14], and sound source localization [2, 3, 15]. However, the absence of pixel-wise annotations has limited these methods to frame or patch-level comprehension, ultimately restricting their training objectives to the classification of audible images.

Recently, a novel audio-visual segmentation (AVS) task was introduced in [16] with the aim of segmenting sounding objects corresponding to audio cues in video frames. This task is inherently a non-trivial one due to the two following challenges. Firstly, audio signals are information-dense, as they often contain sounds from multiple sources simultaneously. For example, in a concert, the sounds of instruments and human voices often become intertwined. This necessitates the disentanglement of audio signals at each timestamp into multiple latent components to effectively capture the unique sounding features of individual objects. Secondly, audio signals from objects of the same category often exhibit similar frequencies, such as in the case of the Husky and the Tibetan Mastiff. This ambiguity presents greater demands on the audio signal representation throughout the network to avoid inaccurately locating the sources of sound. However, the existing method [16] fails to address these challenges. Concrectly, it simply extracts audio features at each timestamp using an audio encoder, followed by the interaction with image embeddings through convolution, then generates the final prediction using an FPN-based scheme [17] under the supervision of the standard segmentation [18, 19].

Refer to caption
Fig. 1: The pipeline comparison between: (a) the existing method and (b) our proposed TransAVS framework.

To this end, we propose a novel transformer-based end-to-end audio visual segmentation framework (TransAVS), drawing inspiration from the recent success of the transformer architecture in multi-modal learning [20]. As depicted in Fig. 1(b), TransAVS is a multi-modal transformer that leverages audio cues to guide both the fusion with visual features and segmentation. Concretely, first, to handle scenarios with multiple sounding objects, we disentangle the audio stream to initialize several audio queries, which encourages the model to explicitly attend to different objects, facilitating the acquisition of instance-level awareness and discrimination. Besides, we introduce two self-supervised loss functions at the query and mask levels, respectively. These functions play a pivotal role in optimizing the audio queries by encouraging heterogeneity during the learning process. This design empowers the model to discern and capture unique features embedded within similar audio streams, resulting in more precise segmentation.

Refer to caption
Fig. 2: Architecture of the Transformer-based end-to-end framework, TransAVS. In this framework, the audio stream is disentangled into audio queries, guiding both the fusion with visual features and the segmentation process using the Transformer manner. To address the challenge of sound homogeneity among objects of the same category, we introduce two self-supervised loss functions at the query and mask levels. These innovative designs, distinct from existing method, not only enable the model to attain instance-level awareness and discrimination, but also distinguish and capture unique features embedded within similar audio streams, resulting in more precise segmentation.

In summary, the main contributions of this paper are four-fold: (1) To the best of our knowledge, we are the first to introduce a multi-modal transformer-based framework to tackle the AVS task, leveraging the potent long-range modeling abilities to promote cross-modal interaction; (2) To guide the model towards perceiving and discriminating sounding objects at the instance level, we explicitly disentangle audio cues as audio queries. (3) To effectively address the issue of homogeneity of sounds among objects of the same category, we design two self-supervised loss functions, enabling the model to capture distinctive features within similar audio streams. (4) Qualitative and quantitative experimentation conclusively demonstrates the state-of-the-art performance of our method on the AVSBench dataset.

2 Methodology

In this section, we will delve into the details of our proposed TransAVS framework. We begin by introducing the problem formulation in Section 2.1, followed by a comprehensive explanation of the TransAVS architecture in Section 2.2. Furthermore, we outline the design and rationale behind our self-supervised loss functions in Section 2.3. Lastly, Section 2.4 explains how TransAVS infers sounding object masks.

2.1 Problem Formulation

Table 1: Quantitative comparison results of different methods on AVSBench. Our method outperforms the Baseline with a great gap in both the S4 and MS3 subsets across all visual backbones. Results of mean Jaccard index 𝒥\mathcal{M}_{\mathcal{J}} and F-score \mathcal{M}_{\mathcal{F}} are reported.
Metric Setting SSL VOS SOD Baseline[16] Ours
LVS[21] MSSL[22] 3DC[23] SST[24] iGAN[25] LGVT[26] ResNet Pvt-v2 ResNet Swin-base
𝒥\mathcal{M}_{\mathcal{J}} Single-source(S4) 37.9 44.9 57.1 66.3 61.6 74.9 72.8 78.7 83.1 89.4
Multi-source(MS3) 29.5 26.1 36.9 42.6 42.9 40.7 47.9 54.0 58.9 63.5
\mathcal{M}_{\mathcal{F}} Single-source(S4) 51.0 66.3 75.9 80.1 77.8 87.3 84.8 87.9 90.6 94.2
Multi-source(MS3) 31.0 36.3 50.3 57.2 54.4 59.3 59.3 64.5 72.9 75.2

For the AVS task, the input data comprises a sequence of video frames 𝒱={vi}i=1T\mathcal{V}=\{v_{i}\}_{i=1}^{T}, where viv_{i} 3×Hv×Wv\in\mathbb{R}^{3\times H_{v}\times W_{v}}, and TT-second audio stream 𝒜\mathcal{A}. The goal of AVS is to segment all sounding objects in each frame viv_{i} under the acoustic guidance 𝒜\mathcal{A}. The segmentation results are binary masks ={mi}i=1T\mathcal{M}=\{m_{i}\}_{i=1}^{T}, where mi{0,1}Hv×Wvm_{i}\in\{0,1\}^{H_{v}\times W_{v}}, with ‘1’ indicates sounding objects while ‘0’ corresponds to background or silent objects.

2.2 The Architecture of AUST

Our TransAVS framework consists of 3 modules: (1) a feature extractor which is responsible for extracting multi-scale image features FvF_{v} and audio features FaF_{a}; (2) an audio-visual transformer-based fusion module, which disentangles FaF_{a} into audio queries AqA_{q} and fuses with FvF_{v} in a transformer fashion; (3) a mask generation module predicting binary masks with probability of sounding objects.

2.2.1 Feature Extractor

Visual Feature: Taking one frame viv_{i} in 𝒱\mathcal{V} as input, a pretrained visual backbone is employed to extract visual features. To exploit semantic information in different levels, we extract vision features in 3 scales Fv={fvi}i=13F_{v}=\{f_{v}^{i}\}_{i=1}^{3}, where fviCv×Hv24i×Wv24if_{v}^{i}\in\mathbb{R}^{C_{v}\times\frac{H_{v}}{2^{4-i}}\times\frac{W_{v}}{2^{4-i}}}, CvC_{v} is the channel dimension depending on different encoders. We also upsample fv3f_{v}^{3} as fv4Cv×Hv×Wvf_{v}^{4}\in\mathbb{R}^{C_{v}\times H_{v}\times W_{v}} for later mask generation.

Audio Feature: Given the audio clip 𝒜\mathcal{A}, we first process it to a spectrogram via the short-time Fourier transform, then pass it to a pretrained audio backbone VGGish [27] to obtain the audio embedding FaT×dF_{a}\in\mathbb{R}^{T\times d}.

2.2.2 Audio-visual Transformer-based Fusion Module

As previously mentioned, we first disentangle audio features FaF_{a} into audio queries AqA_{q} to facilitate the model’s learning of instance-level awareness and discrimination, then adopt attention mechanism to establish long-range connection between audio cues and visual features. Technically, we begin by projecting FaF_{a} into NN independent queries with a linear transform W1N×TW_{1}\in\mathbb{R}^{N\times T}:

Aq0=W1Fa\displaystyle A_{q}^{0}=W_{1}F_{a} (1)

then input them into N1N_{1} encoder layers to capture their dependency. Concretely, at the nn-th layer:

Atten(Q,K,V)\displaystyle\text{Atten}(Q,K,V) =softmax(QKTd)V\displaystyle=\text{softmax}(\frac{QK^{T}}{\sqrt{d}})V (2)
Aqn+1=Atten(Aqn\displaystyle A_{q}^{n+1}=\text{Atten}(A_{q}^{n} WQn,AqnWKn,AqnWVn)+Aqn\displaystyle W_{Q}^{n},A_{q}^{n}W_{K}^{n},A_{q}^{n}W_{V}^{n})+A_{q}^{n}\vspace{-14pt} (3)

where AqnN×dA_{q}^{n}\in\mathbb{R}^{N\times d}, WQnW_{Q}^{n}, WKnW_{K}^{n} and WVnW_{V}^{n} respectively, represent the Q, K, V transform matrix at the nn-th layer in d×d\mathbb{R}^{d\times d}, following the standard attention scheme [28].

After N1N_{1} encoder layers, the output AqN1A_{q}^{N_{1}} conveys different audio components information, providing the network with guidance for attending to different sounding regions during the cross-modal fusion process within following N2N_{2} decoder layers. Specifically, at the ll-th layer, audio queries Aq=AqN1A_{q}=A_{q}^{N_{1}} serves as query while image features f1f_{1}, f2f_{2}, f3f_{3} act as keys and values in a round-robin fashion:

i\displaystyle i =(lmod 3)+1\displaystyle=\ (l\ \text{mod}\ 3)+1 (4)
Favl\displaystyle F_{av}^{l} ={Aqif l=1 Favlin other cases\displaystyle=\begin{cases}\ A_{q}\ &\text{if $l=1$ }\\ \ F_{av}^{l}&\text{in other cases }\end{cases} (5)
Favl+1=Atten\displaystyle F^{l+1}_{av}=\text{Atten} (FavlWQl,fviWKl,fviWVl)+Favl\displaystyle(F^{l}_{av}W_{Q}^{l},f_{v}^{i}W_{K}^{l},f_{v}^{i}W_{V}^{l})+F^{l}_{av} (6)

where ‘mod’ denotes modulo operation, FavlN×dF^{l}_{av}\in\mathbb{R}^{N\times d}, WQlW_{Q}^{l}, WKlW_{K}^{l} and WVlW_{V}^{l} denote the Q, K, V transformation matrix at the ll-th layer, respectively. This approach not only establishes long-range connections between the audio stream and visual frames but also compels the model, through audio queries, to be aware of and discriminate sounding objects at the instance level.

2.2.3 Mask Generation

Based on the fused feature Fav=FavN2F_{av}=F_{av}^{N_{2}} and the image embedding fv4f_{v}^{4}, the mask generation module predicts segmentation masks ={Mi}i=1N\mathcal{M}=\{M_{i}\}_{i=1}^{N} with the probability 𝒫={pi}i=1N\mathcal{P}=\{p_{i}\}_{i=1}^{N} of sounding objects.

Technically, for binary mask \mathcal{M} generation, we apply 1 ×\times 1 convolution denoted as W2W_{2} on fv4f_{v}^{4} to adjust the channel dimension to dd, then multiply it with FavF_{av} followed by a sigmoid function:

\displaystyle\mathcal{M} =sigmoid(FavW2fv4)\displaystyle=\text{sigmoid}(F_{av}W_{2}f_{v}^{4}) (7)

Meanwhile, to calculate the probability 𝒫\mathcal{P}, a classifier gd×Kg\in\mathbb{R}^{d\times K} and softmax function are utilised:

𝒫\displaystyle\mathcal{P} =softmax[g(Fav)]\displaystyle=\text{softmax}[g(F_{av})] (8)

where K=2K=2 is the number of category. \mathcal{M} and 𝒫\mathcal{P} are paired with audio queries AqA_{q} as 𝒵={zi=(qi,mi,pi)}i=1N\mathcal{Z}={\{z_{i}=(q_{i},m_{i},p_{i})\}}_{i=1}^{N} for optimization and inference.

2.3 Self-Supervised Loss Functions

As mentioned before, objects of the same category tend to produce similar sound frequencies, resulting in a significant degree of homogeneity that can hinder the model’s performance. Toward this, we propose two loss functions at query and mask level, namely the Audio Query Distance Loss (AQDL) and the Audio Query Mask Loss (AQML), with the goal of increasing heterogeneity and thus enhancing segmentation accuracy.

To be specific, AQDL, denoted as AQDL\mathcal{L}_{AQDL}, is a penalty on qiq_{i} that predicts sounding objects but getting too close to each other, indicating that they have a high similarity with less clear guidance:

Refer to caption
Fig. 3: Qualitative comparison between the Baseline and our proposed TransAVS. Our method’s instance-level awareness and discrimination, enabling it to distinguish between individual sounding sources, significantly contribute to the more precise segmentation, which is evident in the subfig (b), where TransAVS effectively delineates the shape of the sounding source (guitars) while discarding the silent objects (hands).
AQDL\displaystyle\mathcal{L}_{AQDL} =2n1(n1+1)i=1n1j=i+1n1d(qi,qj)\displaystyle=\frac{2}{n_{1}(n_{1}+1)}\sum_{i=1}^{n_{1}}\sum_{j=i+1}^{n_{1}}d(q_{i},q_{j}) (9)
d(qi,qj)\displaystyle d(q_{i},q_{j}) ={1h(qi)h(qj)22if h(qi)h(qj)22<d0 0in other cases\displaystyle=\begin{cases}\frac{1}{||h(q_{i})-h(q_{j})||_{2}^{2}}\ &\text{if }||h(q_{i})-h(q_{j})||_{2}^{2}<d_{0}\\ \ \ \ \ \ \ 0\ &\text{in other cases }\end{cases} (10)

where hh represents a projection head in d×d\mathbb{R}^{d\times d}, qiq_{i} and qjq_{j} are elements of the set S1={zk|pk>δ1}S_{1}=\{z_{k}|p_{k}>\delta_{1}\}, δ1\delta_{1} is the confidence threshold of AQDL\mathcal{L}_{AQDL}, n1n_{1} is the cardinality of S1S_{1}, ||||·||2||_{2} is the 2\mathcal{L}_{2} norm, and d0d_{0} is the threshold for d(ai,aj)d(a_{i},a_{j}). AQDL promotes heterogeneity by restricting queries from getting too close.

On the other hand, AQML encourages qiq_{i} to predict exclusive sounding masks mim_{i} as much as possible. It focuses on the intersecting pixels between different sounding-object masks:

AQML\displaystyle\mathcal{L}_{AQML} =2n2(n2+1)i=1n2j=i+1n2I(mi,mj)\displaystyle=\frac{2}{n_{2}(n_{2}+1)}\sum_{i=1}^{n_{2}}\sum_{j=i+1}^{n_{2}}I(m_{i},m_{j}) (11)
I(mi,mj)\displaystyle I(m_{i},m_{j}) =12HW[Bin(mi)mj+miBin(mj)]\displaystyle=\frac{1}{2HW}[\text{Bin}(m_{i})\odot m_{j}+m_{i}\odot\text{Bin}(m_{j})] (12)

where mim_{i} and mjm_{j} belong to the set S2={zk|S_{2}=\{z_{k}| pk>δ2}p_{k}>\delta_{2}\}, δ2\delta_{2} is the confidence threshold of AQML\mathcal{L}_{AQML}, n2n_{2} is the cardinality of S2S_{2}, ‘Bin’ denotes the binary operation with threshold set 0.50.5 and \odot is the Hadamard product. AQML forces queries to attend to different parts of images, thus reducing their heterogeneity.

Besides AQDL and AQML, we also use 2 supervised segmentation losses, including focal classification loss class\mathcal{L}_{class}[29] and dice loss dice\mathcal{L}_{dice}[30]. All losses are linearly combined as optimization objective during our end-to-end training process:

=λ1AQDL+λ2AQML+λ3class+λ4dice\displaystyle\mathcal{L}=\lambda_{1}\mathcal{L}_{AQDL}+\lambda_{2}\mathcal{L}_{AQML}+\lambda_{3}\mathcal{L}_{class}+\lambda_{4}\mathcal{L}_{dice} (13)

2.4 Inference Stage

During the inference, TransAVS predicts each pixel at location (x,y)(x,y) based on 𝒵\mathcal{Z}:

argmaxcipi(ci)×mi(x,y)\displaystyle\mathop{\arg\max}_{c_{i}}\ p_{i}(c_{i})\times m_{i}(x,y) (14)
ci=argmaxc{1,,K}pi(c)ziz\displaystyle c_{i}=\mathop{\arg\max}_{c\in\{{1,...,K\}}}\ p_{i}(c)\ \ \ \forall z_{i}\in z (15)

that is, when and only when both the class probability pi(ci)p_{i}(c_{i}) and the mask prediction probability mi(x,y)m_{i}(x,y) are high enough will a pixel (x,y)(x,y) be assigned to ziz_{i} .

3 Experiments

3.1 AVSBench Dataset

All videos in AVSBench dataset are trimmed into 5 seconds and separated into 2 subsets based on the number of sound source: single-source sound segmentation (S4) and multiple-sound source segmentation (MS3). Then each video is divided into 5 non-overlapping 1-second clips, each clip is sampled one frame. As shown in Table 2, only the first sampled frame in the training split of S4 is annotated while all frames in other split are annotated. S4 contains 23 classes (Cls.), covering sounds from humans, animals, vehicles, and musical instruments. Each video in the MS3 subset includes two or more categories from the S4 subset.

Table 2: AVSBench dataset statistics. For S4 training split, one annotation per video while all others contain 5 annotations per video.
Subset Cls. Videos Train/Valid/Test Annotated frames
Single-source(S4) 23 4932 3,452 / 740 / 740 3452 / 3700 / 3700
Multi-source(MS3) 23 424 296 / 64 / 64 1480 / 320 / 320

3.2 Implementation Details

For the visual backbone, we choose 2 representative ones: the standard CNN-based ResNet[31] backbone R101 and Transformer-based Swin-Transformer[32] backbone swin-base. R101 is pretrained on ImageNet-1K[33] while swin-base on ImageNet-22K. For the loss weights, we set λ1=λ2=2.0\lambda_{1}=\lambda_{2}=2.0 and λ3=λ4=5.0\lambda_{3}=\lambda_{4}=5.0. For the optimizer, we use AdamW [34] with an initial learning rate of 0.0001 for both R101 and swin-base backbones. A learning rate multiplier of 0.10.1 is also applied. All models are trained with 8 3090 GPUs for 90k iterations with a batch size of 88.

3.3 Main Results

Since AVS is a newly proposed problem, we compare our network with baseline in [16] and methods from three related tasks, including sound source localization (SSL), video object segmentation (VOS), and salient object detection (SOD). For each task, we report the results of two SOTA methods on AVSBench dataset, i.e., LVS[21] and MSSL[22] for SSL, 3DC[23] and SST[24] for VOS, iGAN[25] and LGVT[26] for SOD. To ensure fairness, all backbones of these methods were pretrained on the ImageNet-1K[33] dataset.

Quantitative Comparison. Given a test frame, we denote the predicted mask as mm and ground truth as yy, the Jaccard index 𝒥\mathcal{J} [35] and F-score \mathcal{F} are used to measure region similarity and contour accuracy, respectively:

 𝒥\displaystyle\text{\ \ \ \ \ \ }\mathcal{J} =mymy\displaystyle=\frac{m\cap y}{m\cup y} (16)
\displaystyle\mathcal{F} =(1+β2)×precision×recallβ2×precision+recall\displaystyle=\frac{(1+\beta^{2})\times\text{precision}\times\text{recall}}{\beta^{2}\times\text{precision}+\text{recall}} (17)

where β=0.3\beta=0.3. We use 𝒥\mathcal{M}_{\mathcal{J}} and \mathcal{M}_{\mathcal{F}} to denote the mean metric values over the whole test dataset. The quantitative results is shown in Table 1. It is evident that our proposed approach consistently outperforms existing methods in both subsets across all visual backbones. Even in the S4 subset, where the Baseline achieves high values on the 𝒥\mathcal{M}_{\mathcal{J}} metric (72.8 with ResNet50 and 78.7 with Pvt-v2), our proposed TransAVS still shows improvements: 10.3 points higher with ResNet and 10.7 points higher with Pvt-v2. We attribute this improvement to our transformer framework which uses audio queries to explicitly learn instance-level awareness and discrimination of sounding objects, as well as our loss functions that increase heterogeneity among sounds from objects of the same category. These design choices allow our model to exploit important audio cues to gain better segmentation.

Qualitative Comparison. We provide some qualitative examples from both S4 and MS3 in Fig. 3. The segmentation results clearly demonstrate that our method outperforms the baseline. We believe that our method’s instance-level awareness and discrimination, enabling it to distinguish between individual sounding sources, significantly contribute to the more precise segmentation. This is especially evident in the Fig. 3(b), where TransAVS effectively delineates the shape of the sounding source (guitars) while discarding the silent objects (hands).

3.4 Ablation Study

Table 3: Ablation study in both S4 and MS3 setting validates that our key designs are essential for the performance of TransAVS.
NN Mode of δ1\delta_{1} and δ2\delta_{2} S4 MS3
𝒥\mathcal{M}_{\mathcal{J}} \mathcal{M}_{\mathcal{F}} 𝒥\mathcal{M}_{\mathcal{J}} \mathcal{M}_{\mathcal{F}}
100 increasing δ1\delta_{1} and δ2\delta_{2} 80.8 87.4 56.2 70.3
500 increasing δ1\delta_{1} and δ2\delta_{2} 81.2 89.4 57.1 71.8
300 only fixed δ1=0.6\delta_{1}=0.6 80.4 87.9 56.3 70.1
300 only fixed δ2=0.6\delta_{2}=0.6 80.5 87.7 56.2 70.4
300 increasing δ1\delta_{1} and δ2\delta_{2} 83.1 90.6 58.9 72.9

In Table 3, we verify the effectiveness of each key design in the proposed method with ResNet backbone on both S4 and MS3 subsets. Based on the first 2 rows and the last row, our results show that the optimal performance obtained when the number of queries NN was set to 300. In the third and fourth rows, δ1\delta_{1} and δ2\delta_{2} are set to 0.60.6, respectively. Both fixed mode show a notable decrease compared with the increasing mode adapted in the last row:

δi\displaystyle\delta_{i} =a+ba18×iterationsniter,i{1,2}\displaystyle=a+\frac{b-a}{18}\times\lfloor\frac{\text{iterations}}{n_{iter}}\rfloor,i\in\{{1,2}\} (18)

where a=0.55a=0.55, b=0.65b=0.65, niter=5000n_{iter}=5000. We hypothesize that this phenomenon can be explained as follows: as TransAVS becomes increasingly confident about its mask prediction, a fixed threshold may not penalize audio queries with less confidence at earlier stages while pushing too many audio queries away at later epochs. This leads to poorer performance compared to using an increasing threshold.

4 Conclusion

In this paper, we introduce TransAVS, the first transformer-based framework for the AVS task. We disentangle audio as audio queries to explicitly guide the model in learning instance-level awareness and discrimination of sounding objects. Additionally, we design self-supervised loss functions to address the homogeneity of sounds within the same category. Experimental results on the AVSBench dataset demonstrate that TransAVS achieves SOTA performance in both the S4 and MS3 subsets, demonstrating the effectiveness of TransAVS in bridging the gap between audio and vision modalities.

References

  • [1] Dana M Small and John Prescott, “Odor/taste integration and the perception of flavor,” Exp Brain Res, 2005.
  • [2] Relja Arandjelovic and Andrew Zisserman, “Look, listen and learn,” in ICCV, 2017.
  • [3] Relja Arandjelovic and Andrew Zisserman, “Objects that sound,” in ECCV, 2018.
  • [4] Yusuf Aytar, Carl Vondrick, and Antonio Torralba, “Soundnet: Learning sound representations from unlabeled video,” PAMI, 2016.
  • [5] Janani Ramaswamy and Sukhendu Das, “See the sound, hear the pixels,” in WACV, 2020.
  • [6] Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang, “Dual-modality seq2seq network for audio-visual event localization,” in ICASSP, 2019.
  • [7] Yan-Bo Lin and Yu-Chiang Frank Wang, “Audiovisual transformer with instance attention for audio-visual event localization,” in ACCV, 2020.
  • [8] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu, “Audio-visual event localization in unconstrained videos,” in ECCV, 2018.
  • [9] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang, “Dual attention matching for audio-visual event localization,” in ECCV, 2019.
  • [10] Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan, “Cross-modal relation-aware networks for audio-visual event localization,” in ACM MM, 2020.
  • [11] Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, and Ming-Hsuan Yang, “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” NIPS, 2021.
  • [12] Yapeng Tian, Dingzeyu Li, and Chenliang Xu, “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” in ECCV, 2020.
  • [13] Yu Wu and Yi Yang, “Exploring heterogeneous clues for weakly-supervised audio-visual video parsing,” in CVPR, 2021.
  • [14] Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, and Limin Wang, “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” in ECCV, 2022.
  • [15] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon, “Learning to localize sound source in visual scenes,” in CVPR, 2018.
  • [16] Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong, “Audio–visual segmentation,” in ECCV, 2022.
  • [17] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
  • [18] Jianzong Wu, Xiangtai Li, Shilin Xu Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, et al., “Towards open vocabulary learning: A survey,” arXiv preprint arXiv:2306.15880, 2023.
  • [19] Xiangtai Li, Jiangning Zhang, Yibo Yang, Guangliang Cheng, Kuiyuan Yang, Yunhai Tong, and Dacheng Tao, “Sfnet: Faster and accurate semantic segmentation via semantic flow,” IJCV, 2023.
  • [20] Peng Xu, Xiatian Zhu, and David A Clifton, “Multimodal learning with transformers: A survey,” PAMI, 2023.
  • [21] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman, “Localizing visual sounds the hard way,” in CVPR, 2021.
  • [22] Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, and Weiyao Lin, “Multiple sound sources localization from coarse to fine,” in ECCV, 2020.
  • [23] Sabarinath Mahadevan, Ali Athar, Aljoša Ošep, Sebastian Hennen, Laura Leal-Taixé, and Bastian Leibe, “Making a case for 3d convolutions for object segmentation in videos,” BMVC, 2020.
  • [24] Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor, “Sstvos: Sparse spatiotemporal transformers for video object segmentation,” in CVPR, 2021.
  • [25] Yuxin Mao, Jing Zhang, Zhexiong Wan, Yuchao Dai, Aixuan Li, Yunqiu Lv, Xinyu Tian, Deng-Ping Fan, and Nick Barnes, “Transformer transforms salient object detection and camouflaged object detection,” CoRR, 2021.
  • [26] Jing Zhang, Jianwen Xie, Nick Barnes, and Ping Li, “Learning generative vision transformer with energy-based latent space for saliency prediction,” NIPS, 2021.
  • [27] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn architectures for large-scale audio classification,” in ICASSP, 2017.
  • [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” NIPS, 2017.
  • [29] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár, “Focal loss for dense object detection,” in ICCV, 2017.
  • [30] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV, 2016.
  • [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  • [33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,” Int J Comput Vis, 2015.
  • [34] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
  • [35] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman, “The pascal visual object classes (voc) challenge,” Int J Comput Vis, 2010.