This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\addauthor

Hanyuan [email protected] \addauthorMajid [email protected] \addauthorDima [email protected] \addauthorToby [email protected] \addinstitution Department of Computer Science
Faculty of Engineering
University of Bristol
Bristol, UK Centricity-based Audio-Visual Temporal Action Detection

Centre Stage: Centricity-based
Audio-Visual Temporal Action Detection

Abstract

Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.git.

1 Introduction

Temporal action detection aims to predict the boundaries of action segments from a long untrimmed video and classify the actions, as a fundamental step towards video understanding [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He, Carreira and Zisserman(2017), Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool]. A typical challenging scenario is in unscripted actions in egocentric videos [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al., Grauman et al.(2022)Grauman, Westbury, Byrne, Chavis, Furnari, Girdhar, Hamburger, Jiang, Liu, Liu, et al.] which contain dense action segments of various lengths in an unedited video, ranging from seconds to minutes.

Most recently, a few have approached egocentric action detection by modelling their long-range visual dependencies with transformers [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Wang et al.(2023)Wang, Singh, and Torresani, Nawhal and Mori(2021), Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem]. However, only using visual information, means a missed opportunity to exploit potentially meaningful aural action cues. As shown in Figure 1(a), sound exhibits discriminating characteristics around the starting point of actions, such as ‘open drawer’, ‘take spoon’ and ‘scoop yoghurt’, which can be useful for boundary regression. Also for action classification, the sound of flowing water can boost confidence in identifying an action as ‘turn-on tap’ rather than ‘turn-off tap’, even though their visual content is similar.

\bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption
(a) (b)
Figure 1: Our motivation – (a) Sounds help detect actions both in refining the action boundaries (regression) and in identifying the action within these boundaries (classification), (b) Timesteps closer to the action center generate better proposals with high tIoU.

Unlike methods [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem, Tian et al.(2018)Tian, Shi, Li, Duan, and Xu, Kazakos et al.(2021a)Kazakos, Huh, Nagrani, Zisserman, and Damen] that directly fuse audio and visual modalities at the same scale through concatenation, addition or gating modules, in this paper we learn these modalities with separate encoders and fuse their representations using a cross-modal attention mechanism at different temporal scales. This allows us to exploit sufficient audio-visual information to detect actions of various duration.

Recent one-stage anchor-free methods [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] operate on egocentric videos by simultaneously predicting boundaries and action categories for each timestep. In contrast to anchor-based methods, anchor-free methods do not require pre-defined anchors to locate actions but directly generate one proposal for each timestep. We have observed that timesteps near the centre of actions tend to produce proposals with more precise boundaries. These proposals have higher temporal Intersection-over-Union (tIoU) values with corresponding ground-truth segments. As shown in Figure 1(b), the closer the current timestep is to the action centre, the greater the tIoU. Inspired by this observation, we introduce a centricity head that predicts a score so as to indicate how close the current timestep is to the action centre. This score is then integral to calculating the confidence scores for ranking candidate proposals, where those with more precise boundaries will be ranked higher. Our approach can be incorporated into most one-stage anchor-free action detectors and achieve significant improvement.

In summary, our key contributions are as follows: (i) we introduce a framework to effectively fuse audio and visual modalities using a cross-modal attention mechanism at various temporal scales, (ii) we propose a novel centricity head to predict the degree of closeness of each frame’s temporal distance to the action centre – this boosts a proposal’s confidence score and allows for the preferential selection of proposals with more precise boundaries, and (iii) we achieve state-of-the-art results on the EPIC-Kitchens-100 action detection benchmark, demonstrating the effectiveness of audio modality and the benefits of centricity in improving detection performance.

2 Related Work

Temporal action detection – Current temporal action detection methods can be divided into: (i) two-stage methods that first generate proposals and then classify them, and (ii) one stage methods that predict boundaries and corresponding classes simultaneously. Some two-stage works generate proposals by estimating boundary probabilities [Lin et al.(2018)Lin, Zhao, Su, Wang, and Yang, Su et al.(2021)Su, Gan, Wu, Yan, and Qiao, Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Lin et al.(2020)Lin, Li, Wang, Tai, Luo, Cui, Wang, Li, Huang, and Ji] and action-ness scores [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin]. Many one-stage methods [Gao et al.(2017)Gao, Yang, Chen, Sun, and Nevatia, Lin et al.(2017a)Lin, Zhao, and Shou, Long et al.(2019)Long, Yao, Qiu, Tian, Luo, and Mei, Xu et al.(2017)Xu, Das, and Saenko, Liu and Wang(2020)] rely on pre-defined anchors to model temporal relations, which often leads to inflexibility and poor boundaries when detecting actions with various lengths. To address this, recent anchor-free methods [Yang et al.(2020)Yang, Peng, Zhang, Fu, and Han, Lin et al.(2021)Lin, Xu, Luo, Wang, Tai, Wang, Li, Huang, and Fu, Zhang et al.(2022)Zhang, Wu, and Li] predict the action category and offsets to the boundaries simultaneously for each timestep using parallel classification and regression heads. Then, candidate proposals constructed by these predictions are filtered to obtain the final results. Our work follows such an anchor-free pipeline.

Inspired by the DETR framework [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko], some works input relational queries [Shi et al.(2022)Shi, Zhong, Cao, Zhang, Ma, Li, and Tao], learned actions [Liu et al.(2022)Liu, Wang, Hu, Tang, Zhang, Bai, and Bai] or graph queries [Nawhal and Mori(2021)] to a transformer decoder to detect actions. However, with a limited number of queries, these methods struggle to cover a large number of actions in long videos. Alternatively, other works [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Chang et al.(2022)Chang, Wang, Wang, Li, and Shou, Cheng and Bertasius(2022)] use multi-scale transformer encoders [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] to model temporal dependencies for stronger video representations. For example, ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] applies local self-attention to extract a discriminative feature pyramid, which is then used for classification and regression. Our work falls under this workflow.

Audio-visual learning – Sight and hearing are both vital sensory modes that assist humans in perceiving the world. This can transfer to computational approaches too to learn models from. Numerous works [Afouras et al.(2020)Afouras, Owens, Chung, and Zisserman, Arandjelovic and Zisserman(2017), Aytar et al.(2016)Aytar, Vondrick, and Torralba, Hu et al.(2019)Hu, Nie, and Li, Korbar et al.(2018)Korbar, Tran, and Torresani, Owens and Efros(2018)] have focused on jointly learning audio and visual representations for tasks such as action recognition [Kazakos et al.(2019)Kazakos, Nagrani, Zisserman, and Damen, Kazakos et al.(2021a)Kazakos, Huh, Nagrani, Zisserman, and Damen, Nagrani et al.(2020)Nagrani, Sun, Ross, Sukthankar, Schmid, and Zisserman], video parsing [Wu and Yang(2021), Mo and Tian(2022)] and event localization [Tian et al.(2018)Tian, Shi, Li, Duan, and Xu, Rao et al.(2022)Rao, Khalil, Li, Dai, and Lu, Xia and Zhao(2022)]. Audio-visual event localization aims to classify each timestep into a limited number of categories [Tian et al.(2018)Tian, Shi, Li, Duan, and Xu], relying on clear audio-visual signals and without the need to predict temporal boundaries. In contrast, our action detection task aims to leverage the audio-visual representation to detect temporal boundaries for dense actions with various lengths and unclear audio cues, and then classify them into a wide range of categories. OWL [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] attempts different strategies for fusing audio and visual modalities, but it fuses at a single temporal scale only and classifies pre-generated proposals from [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen], rather than detect boundaries. In [Lee et al.(2021)Lee, Jain, Park, and Yun], the authors address this task by extracting intra-modal features, but their proposed framework is designed for simple, weakly-labelled data with sparse actions per video. Our work focuses on large-scale egocentric data comprising dense complex actions of various durations, and we propose a framework to incorporate audio-visual learning and centricity into one-stage anchor-free methods.

3 Method

We propose a novel framework for temporal action detection, rooted in audio-visual data, which can be incorporated into one-stage anchor-free pipelines [Zhang et al.(2022)Zhang, Wu, and Li, Lin et al.(2021)Lin, Xu, Luo, Wang, Tai, Wang, Li, Huang, and Fu, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] (see Figure 2). Similar to such temporal action detection works, we define the problem as follows. Given an untrimmed video, we extract features for the video and audio modalities and then process them using transformer encoders to obtain the visual and audio representation sequences Fv={ftv}t=1TF^{v}=\left\{f^{v}_{t}\right\}^{T}_{t=1} and Fa={fta}t=1TF^{a}=\left\{f^{a}_{t}\right\}^{T}_{t=1}, respectively. Based on these, our approach is to learn to predict a set of possible action instances Φ={(s,e,α)m}m=1M\Phi=\left\{(s,e,{\alpha})_{m}\right\}^{M}_{m=1}, where ss and ee represent the starting and ending boundaries of an action, and α\alpha represents the predicted action class.

Refer to caption
Figure 2: An overview of our architecture – Given an untrimmed video, audio and visual features are extracted from video clips, then fed into encoders EaE_{a} and EvE_{v} for generating audio FaF^{a} and visual FvF^{v} feature pyramids. These are fused using cross-attention across NN temporal scales to build the audio-visual representation FavF^{av}. This representation is passed to the centricity head h𝒞h_{{\mathcal{C}}} for centricity scores, and to prediction heads hbvh^{v}_{b}, hcvh^{v}_{c}, and hcah^{a}_{c} for boundary, classification, and audio scores respectively. Finally, predicted scores and boundaries are used to construct candidate proposals, which are then filtered to obtain the final predictions.
\bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption
(a) (b) (c)
Figure 3: Three strategies for fusing modalities – (a) Proposal fusion: visual and audio modalities are separately fed into their respective streams and generate corresponding sets of proposals, which are directly concatenated to obtain the final set of proposals, (b) Classification scores fusion: classification scores from both modalities are combined alongside boundaries from the video to obtain the final set of proposals, (c) Feature pyramid fusion: feature pyramids from the two modalities are fused through cross-attention and fed into parallel heads to predict boundaries and class scores, which then construct the set of proposals.

3.1 Audio-visual Fusion

In this section, we explore three different strategies to effectively utilise the audio modality and combine it with visual information to improve action detection performance.

Proposal fusion – In this strategy (see Figure 3 (a)), at first the visual and audio representations (Fv,Fa)(F^{v},F^{a}) are produced by encoders EvE_{v} and EaE_{a} respectively. Classification heads hcvh^{v}_{c} and hcah^{a}_{c} and regression heads hbvh^{v}_{b} and hbah^{a}_{b} are then used to predict the classification scores (ptv(p_{t}^{v},pta)p_{t}^{a}) and boundaries (stv(s^{v}_{t},etv)e^{v}_{t}) and (sta(s^{a}_{t},eta)e^{a}_{t}) for the visual and audio modalities, respectively. Thus, we can obtain a set of candidate proposals for the visual modality Φv={(stv,etv,ptv)}t=1T\Phi_{v}=\left\{(s^{v}_{t},\ e^{v}_{t},\ p^{v}_{t})\right\}^{T}_{t=1} and similarly, a set of candidate proposals for the audio modality Φa={(sta,eta,pta)}t=1T\Phi_{a}=\left\{(s^{a}_{t},\ e^{a}_{t},\ p^{a}_{t})\right\}^{T}_{t=1}. Then, we concatenate these two sets as Φo={(stv,etv,ptv),(sta,eta,pta)}t=1T\Phi_{{o}}=\left\{(s^{v}_{t},\ e^{v}_{t},\ p^{v}_{t}),(s^{a}_{t},\ e^{a}_{t},\ p^{a}_{t})\right\}^{T}_{t=1}.

Classification scores fusion – Although sounds can be associated with actions for classification purposes, the duration of an action does not necessarily correspond to its audio start and end as recently shown in [Huh et al.(2023)Huh, Chalk, Kazakos, Damen, and Zisserman]. Thus, we discard the audio boundaries, integrate the classification scores from both visual and audio modalities, and then use them along with the visual boundaries to generate proposals.

We use an approach similar to [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] to fuse visual and audio classification scores. Specifically, as shown in Figure 3(b), based on FvF^{v} and FaF^{a}, the visual classification head hcvh^{v}_{c}, audio classification head hcah^{a}_{c} and visual boundary head hbvh^{v}_{b} predict scores ptvp_{t}^{v} and ptap_{t}^{a}, and frame boundaries stvs^{v}_{t} and etve^{v}_{t}, respectively. We fuse the classification scores ptvp_{t}^{v} and ptap_{t}^{a} by simple addition and combine them with the visual boundaries stvs^{v}_{t} and etve^{v}_{t} to construct the set of fused candidate proposals Φc\Phi_{c}, such that Φc={(stv,etv,ptv+pta))}Tt=1\Phi_{c}=\left\{(s^{v}_{t},\ e^{v}_{t},\ p^{v}_{t}+p^{a}_{t}))\right\}^{T}_{t=1}.

Feature pyramid fusion – In this strategy (see Figure 3(c)), FvF^{v} and FaF^{a} are put through a cross-attention mechanism [Lee et al.(2021)Lee, Jain, Park, and Yun, Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] (see also grey box of Figure 2) to model their inter-modal dependencies which then results in a single representation vector FavF^{av}.

Firstly, FvF^{v} and FaF^{a} are projected into query Q=WQFvQ=W_{Q}F^{v}, key K=WKFaK=W_{K}F^{a}, and value V=WVFaV=W_{V}F^{a}, where FvF^{v} serves as a query input and FaF^{a} serves as key and value inputs. WQW_{Q}, WKW_{K} and WVd×dW_{V}\in\mathbb{R}^{d\times d} are learnable weight matrices, where dd is the embedding dimension. Next, we calculate the audio-visual representation vector

Fav=softmax(QKTd)V,F^{av}=softmax(\frac{QK^{T}}{\sqrt{d}})V~{}, (1)

which is then fed into a classification head hch_{c} and a regression head hbh_{b} to obtain the classification scores ptavp^{av}_{t} and the boundaries stavs^{av}_{t}, etave^{av}_{t} for each timestep. Therefore, the set of candidate proposals is Φf={(stav,etav,ptav))}Tt=1\Phi_{f}=\left\{(s^{av}_{t},\ e^{av}_{t},\ p^{av}_{t}))\right\}^{T}_{t=1}.

We ablate these three strategies in Sec. 4.1. Based on the ablations, we chose feature pyramid fusion to generate audio-visual representations across NN temporal pyramid scales for assessment by the centricity head to predict corresponding scores (see Sec. 3.2 next). We also selected the classification scores fusion approach to generate stronger audio-visual classification scores to predict action categories and calculate confidence scores.

3.2 Audio-visual Centricity Head

We investigated the relationship between the distance of a timestep from the action centre and the tIoU value between its generated proposal and the ground truth. As shown in Figure 4(a), as a timestep gets closer to the action center, its generated proposal has a higher tIoU value. This indicates that timesteps around the centre of an action can generate proposals with more reliable action boundaries. Thus, we propose a simple, yet effective, centricity head based on the audio-visual representation FavF^{av} to estimate how close the timestep tt is to the centre of the action (as shown in Figure 4 (b)). The centricity head consists of three 1D conv layers with layer normalization and a ReLU activation function.

\bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption
(a) (b)
Figure 4: The illustration of centricity – (a) The tIoUs between ground-truth and predictions (from RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]) is plotted across various centre distances. Actions are divided into five groups based on segment lengths (in seconds): XS (0, 2], S (2, 4], M (4, 6], L (6, 8], and XL (8, inf). (b) The centricity head takes in the audio-visual feature FavF^{av} and produces the centricity scores pt𝒞p_{t}^{{\mathcal{C}}}. The ground-truth centricity score pt𝒞p_{t}^{{\mathcal{C}}*} is calculated based on the relative distance dtd_{t} between the time step tt and the action center.

Label assignment – We require the centricity scores pt𝒞p_{t}^{\mathcal{C}*} calculated from ground-truth data as supervision signals for training. For each timestep tt, we consider the relative distance dtd_{t} between the current timestep tt and the centre of the corresponding ground-truth action to map the training labels of centricity scores

pt𝒞=exp((dt)2/2σ2),p_{t}^{{\mathcal{C}}*}=\text{exp}\left({-(d_{t})^{2}/2\sigma^{2}}\right)~{}, (2)

where σ\sigma is a scaling hyperparameter which defines that the closer a timestep is to the action centre, the higher the centricity score. This has previously been explored to predict boundary confidences [Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], but not specifically for centricity use cases. The centricity scores are normalized to a range of 0 to 1.

Training – We optimize the loss between the ground-truth pt𝒞p_{t}^{\mathcal{C}*} and the predicted centricity scores pt𝒞p_{t}^{\mathcal{C}} using Mean Square Error (MSE) loss as

L𝒞=1TtT(pt𝒞pt𝒞)2,\begin{aligned} {L}_{{\mathcal{C}}}=\frac{1}{T^{{}^{\prime}}}\sum_{t}^{T^{{}^{\prime}}}(p_{t}^{{\mathcal{C}}*}-p_{t}^{{\mathcal{C}}})^{2}\end{aligned}~{}, (3)

where TT^{{}^{\prime}} is the total number of timesteps used for training from all scales of the audio-visual feature pyramid. Our method can be integrated into any one-stage anchor-free frameworks and trained in an end-to-end manner. The total loss is Ltotal=Lg+λ1Lc+λ2Lb+λ3L𝒞L_{total}=L_{g}+\lambda_{1}L_{c}+\lambda_{2}L_{b}+\lambda_{3}L_{\mathcal{C}}~{}, where LgL_{g} and LcL_{c} are losses for regression [Rezatofighi et al.(2019)Rezatofighi, Tsoi, Gwak, Sadeghian, Reid, and Savarese] and classification [Lin et al.(2017b)Lin, Goyal, Girshick, He, and Dollár] and are the same as in [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. LbL_{b} is the boundary confidence loss from [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. λ1\lambda_{1}, λ2\lambda_{2} and λ3\lambda_{3} denote the loss balancing weights, and λ3\lambda_{3} is set to 0 when the baseline is ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] or TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao].

3.3 Post-processing

For each timestep tt, the network produces the visual and audio classification scores ptv{p}_{t}^{v}, pta{p}_{t}^{a}, the corresponding action class label α\alpha, a centricity score pt𝒞{p}_{t}^{{\mathcal{C}}}, and a pair of starting and ending boundaries st{s_{t}}, et{e_{t}} with their corresponding boundary confidences psts{p}_{{s_{t}}}^{s} and pete{p}_{{e_{t}}}^{e} (with these confidences computed as in [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]). The final confidence score for timestep tt is then a weighted combination of the learnt knowledge, i.e.

𝒮=ptv+τpta+βpt𝒞+γ(psts+pete),{{\mathcal{S}}}={p}_{t}^{v}+{\tau}{p}_{t}^{a}+\beta{p}_{t}^{{\mathcal{C}}}+\gamma(p^{s}_{s_{t}}+p^{e}_{e_{t}}), (4)

where τ\tau, β\beta and γ\gamma are fusion weights, and γ\gamma is set to 0 when the baseline is ActionFormer[Zhang et al.(2022)Zhang, Wu, and Li] or TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao]. Finally, we follow standard practice [Lin et al.(2018)Lin, Zhao, Su, Wang, and Yang, Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Su et al.(2021)Su, Gan, Wu, Yan, and Qiao, Zhang et al.(2022)Zhang, Wu, and Li] to rank these candidate actions based on the final confidence score 𝒮\mathcal{S} and filter them using Soft-NMS [Bodla et al.(2017)Bodla, Singh, Chellappa, and Davis] to obtain a final set of MM predictions Φ={(s,e,α)m}m=1M\Phi=\left\{(s,e,\alpha)_{m}\right\}^{M}_{m=1}.

4 Experiments

Dataset – We conduct experiments on EPIC-Kitchens-100 [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.], a large-scale audio-visual dataset that contains 700 unscripted videos with 97 verb and 300 noun classes. On average, there are 128 action instances per video, with significant overlap.

Evaluation metric – We use mean Average Precision (mAP) for verb, noun and action tasks at various IoU thresholds {0.1, 0.2, 0.3, 0.4, 0.5} to evaluate our method against others.

Baselines – Our approach is integrated with three one-stage anchor-free approaches [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] is chosen as it is a pioneering anchor-free work that models long-range temporal dependencies using the Transformer for action detection. TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] extends this by incorporating scalable-granularity perception layers and a Trident head to regress boundaries. Finally, Wang et al. [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] (hereafter RAB) introduces a method to estimate boundary confidences through Gaussian scaling.

Implementation details – We compare our approach against state-of-the-art (SOTA) methods [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al., Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem, Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr, Zhang et al.(2022)Zhang, Wu, and Li, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] for temporal action detection. For a fair comparison, we employ the same visual features as [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], extracted from an action recognition model [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.] that is pre-trained with the SlowFast [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] network on EPIC-Kitchens-100. To obtain features with a dimension of 1x2304, we have a window size of 32 and a stride of 16 frames. For audio features, we generate 512×128 spectrograms using a window size of 2.6ms and a stride of 1.3ms. These spectrograms are then fed into a SlowFast audio recognition model [Kazakos et al.(2021b)Kazakos, Nagrani, Zisserman, and Damen], with features extracted after the average pooling layer with a dimension 1x2304.

Again, following [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], the feature pyramid generated by the transformer encoder has NN = 6 levels, with a level scaling factor of 2. For training, we use one Nvidia P100 GPU. We crop the video features with various lengths to 2304. The loss balancing weights in Sec. 3.2 are λ1=1\lambda_{1}=1, λ2=0.5\lambda_{2}=0.5, and λ3=1.7\lambda_{3}=1.7. The weight ratio of the classification loss between verb and noun is set to 2:3. The scaling hyperparameter in Eq. (2) is set to σ=1.7\sigma=1.7. During the inference stage, the confidence score weights in Sect. 3.3 are assigned as τ=0.2\tau=0.2, β=1\beta=1, and γ=0.7\gamma=0.7. For the multi-task classification, we select the top 11 verb and the top 33 noun predictions to combine the candidate actions.

Main results – Table 1 shows that the proposed method outperforms recent SOTA approaches [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al., Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem, Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr, Zhang et al.(2022)Zhang, Wu, and Li, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] on the EPIC-Kitchens-100 action detection benchmark, and achieves significant improvements when added to existing SOTA one-stage multi-scale methods [Zhang et al.(2022)Zhang, Wu, and Li, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao]. ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] and TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] train different models for verb and noun detection and do not detect actions. Instead, we train one model and add a multi-task classification head to predict their results for the action task. It can be seen that our enhanced proposed method with audio fusion and centricity improves performance in every metric. In action detection, mAP improves by 1.35%1.35\% and 0.97%0.97\%, respectively. RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] also performs well on egocentric data and follows the same anchor-free pipeline as ours. The improvement achieved on RAB[Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] was 1.32%1.32\% and is also the best result amongst the baselines.

Method Venue Feature Audio Avg. mAP@task
Verb Noun Action
BMN [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.] IJCV 2022 SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] ×\times 8.36 6.53 5.21
OWL [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] ArXiv 2022 SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] 11.47 12.63 8.35
BMN+TSN [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr] ICLR 2022 TSN [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool] ×\times 13.47 12.37 9.71
BMN+TAda2D [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr] ICLR 2022 TAda2D [Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr] ×\times 16.78 17.39 13.18
ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] ECCV 2022 SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] ×\times 20.45 20.90 16.63
ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] + Ours - SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] 20.48 22.41 17.98
TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] CVPR 2023 SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] ×\times 20.87 21.04 17.21
TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] + Ours - SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] 21.94 22.86 18.18
RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] AVSS 2022 SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] ×\times 20.71 20.53 17.18
RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] + Ours - SF [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] 21.10 23.08 18.50
Table 1: Comparative results on the EPIC-Kitchens-100 action detection validation set – Actionformer and TriDet only provide results on verb and noun detection, hence we produce action results by modifying them with a multi-task action classification head [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett].
Refer to caption
Figure 5: Qualitative results on the EPIC-Kitchens-100 action detection dataset – The top row shows the visual content of selected frames. The middle seven lines display the ground-truth (GT), the predictions of RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]+Ours, TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao], TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao]+Ours, AF [Zhang et al.(2022)Zhang, Wu, and Li], AF [Zhang et al.(2022)Zhang, Wu, and Li]+ours for a zoomed-in region. The bottom seven lines represent the whole video sequence. These results demonstrate the effectiveness of our method in accurately detecting dense actions.

Qualitative results – Qualitative plots of ‘RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]+Ours’ on the EPIC-Kitchens-100 action detection validation dataset are shown in Figure 5. The bottom two lines showcase the model’s ability to detect dense actions with different classes and durations in a video, demonstrating that our approach can effectively utilise the audio modality to learn discriminative representations. The middle three lines show a zoomed-in, detailed look where it is easier to see that our method better deals with challenging actions, e.g. see action ‘put fork’ in the 77th77^{th} second of the first video and the action ‘wash hand’ in the 68th68^{th} second of the second video that were missed by the baseline model. This indicates that our centricity head enhances the confidence scores for actions with more precise boundaries, resulting in their preferential ranking and selection during the Soft-NMS processing.

4.1 Ablations

Components ablation – We ablate the contributions of our two main components, audio-visual fusion and centricity head, as seen in Table 2. Both components demonstrate notable improvements in the performance of the baseline methods [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], particularly when they are engaged, singularly or in combination, for action detection.

Baseline Audio Centri- city Avg. mAP@task
Verb Noun Action
ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] ×\times ×\times 20.45 20.90 16.63
ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] + Ours ×\times 19.75 21.46 17.22
ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] + Ours ×\times 21.50 22.25 17.60
ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] + Ours 20.48 22.41 17.98
TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] ×\times ×\times 20.87 21.04 17.21
TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] + Ours ×\times 21.07 21.75 17.61
TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] + Ours ×\times 21.65 21.23 17.42
TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] + Ours 21.94 22.86 18.18
RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] ×\times ×\times 20.71 20.53 17.18
RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] + Ours ×\times 20.93 21.84 17.88
RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] + Ours ×\times 21.47 21.43 17.69
RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] + Ours 21.10 23.08 18.50
Table 2: Components analysis on the EPIC-Kitchens-100 action detection validation set.
Classification Weight λ1\lambda_{1} Avg. mAP@task
Verb Noun Action
0.5 22.04 22.70 18.22
1 21.10 23.08 18.50
2 22.13 22.52 18.01
4 22.36 22.11 18.09
6 22.21 22.75 17.83
8 22.20 22.92 17.83
Table 3: Ablation on varying classification weight λ1\lambda_{1}.
Refer to caption
Figure 6: Effect of centricity on confidence scores (see text for details).

Loss function weights – We train our network in an end-to-end manner by minimizing the total loss function in Sec. 3.2, where we assign three weights to balance various losses. Table 6 shows that λ1=1\lambda_{1}=1 is the best value for performing the action task for the classification loss LcL_{c} weight, across the improvements made to the baseline. For the boundary confidence loss LbL_{b} weight λ2\lambda_{2}, we follow the baseline RAB’s recommended setting [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. Finally, Table 5 demonstrates that our results are relatively stable when varying centricity loss L𝒞L_{\mathcal{C}} weight λ3\lambda_{3} between 0.520.5-2, with the best action detection result at λ3=1.7\lambda_{3}=1.7.

The effect of centricity on confidence score – Figure 6 shows that as the centre distance increases, the average tIoU values (blue bars) between the ground truth and the proposals generated by timesteps at the centre distance present a notable decreasing trend (\downarrow0.16). However, the original confidence scores (dashed blue line) only slightly decreases (\downarrow0.03). Adding centricity into the confidence scores (solid blue line) responds better to the expected trend (\downarrow0.09). Thus, proposals with more accurate boundaries (higher tIoU values) rank higher based on their confidence scores when centricity is incorporated.

Audio-visual fusion strategies – The three strategies for fusing audio and visual modalities (see Section 3.1) are compared in Table 5. The proposal fusion strategy has the lowest overall action detection performance due to the audio stream’s proposals having less precise boundaries. The audio-visual fusion of classification scores strategy improves on Visual-only through multiplication (\uparrow0.19%) or addition (\uparrow0.51%). For the feature pyramid fusion, while a direct concatenation achieves a relative increase (\uparrow0.27%), the cross-attention mechanism provides a more significant learning opportunity (\uparrow0.70%).

Centricity Weight λ3\lambda_{3} Avg. mAP@task
Verb Noun Action
0.5 21.18 22.82 18.04
1 21.19 22.02 18.11
1.5 21.23 23.24 18.24
1.6 21.71 22.89 18.16
1.7 21.10 23.08 18.50
1.8 21.16 22.77 18.13
2 21.27 22.90 18.22
Table 4: Ablation on varying centricity weight λ3\lambda_{3}.
Fusion Strategies Avg. mAP@task
Verb Noun Action
Visual-only 20.71 20.53 17.18
Audio-only 7.94 6.41 4.57
Proposals fusion (Figure 3 (a)) 20.36 20.38 16.65
Classif. scores fusion (multiplication) (Fig 3(b)) 20.93 21.82 17.37
Classif. scores fusion (addition) (Fig 3(b)) 20.18 21.89 17.69
Feature pyramid fusion (concatenation) (Fig 3(c)) 20.95 21.52 17.45
Feature pyramid fusion (cross-attention) (Fig 3(c)) 20.93 21.84 17.88
Table 5: Comparing different strategies to fuse audio & visual modalities.

Centricity vs. action-ness – Centricity establishes how close the current timestep is from the action centre, while action-ness [Chang et al.(2022)Chang, Wang, Wang, Li, and Shou] represents the probability of the action occurring. Figure 7 displays individual instances of actions with centricity scores, action-ness scores and the average tIoUs between the proposals generated by corresponding timesteps and their ground truth. In the middle of the action, timesteps tend to exhibit peak values of tIoUs that gradually decrease on both sides, and a similar trend is also observed in the centricity scores. This suggests that timesteps associated with higher centricity scores are inclined to generate proposals with more precise boundaries. In contrast, the action-ness scores (purple line) tend to drop in the middle of the action.

\bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption
(a) (b) (c) (d)
Figure 7: Visualization examples of tIoU values, and centricity and action-ness scores – The x-axis represents the re-scaling of the temporal dimension of an action segment to the range of [01][0-1]. The bars represent the average tIoU between the proposals generated by corresponding timesteps and their ground truth. The green and purple lines are the centricity and action-ness scores, respectively.

5 Conclusion

We introduced an audio-visual fusion approach and a novel centricity head for one-stage anchor-free action detectors. Our method achieves state-of-the-art results on the large-scale egocentric EPIC-Kitchens-100 action detection benchmark where audio and video streams are available. Detailed ablations demonstrated the benefits of fusing audio and visual modalities and emphasized the importance of centricity scores.

Many questions about the use of multi-modalities in temporal action detection remain unexplored, such as the discrepancies in the training data and the temporal misalignment between different modalities. An extension of our work would involve joint learning from visual, audio, and language modalities to enhance action detection performance, with specific focus on mitigating the misalignment among these three modalities and developing novel fusion techniques to provide discriminative representations.

References

  • [Afouras et al.(2020)Afouras, Owens, Chung, and Zisserman] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self-supervised learning of audio-visual objects from video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 208–224, 2020.
  • [Arandjelovic and Zisserman(2017)] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 609–617, 2017.
  • [Aytar et al.(2016)Aytar, Vondrick, and Torralba] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. SoundNet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (NeurIPS), pages 892–900, 2016.
  • [Bodla et al.(2017)Bodla, Singh, Chellappa, and Davis] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-NMS - improving object detection with one line of code. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5562–5570, 2017.
  • [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2020.
  • [Carreira and Zisserman(2017)] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the Kinetics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017.
  • [Chang et al.(2022)Chang, Wang, Wang, Li, and Shou] Shuning Chang, Pichao Wang, Fan Wang, Hao Li, and Zheng Shou. Augmented transformer with adaptive graph for temporal action proposal generation. In Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, pages 41–50, 2022.
  • [Cheng and Bertasius(2022)] Feng Cheng and Gedas Bertasius. TallFormer: Temporal action localization with a long-memory transformer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 503–521, 2022.
  • [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision. International Journal of Computer Vision (IJCV), pages 33–55, 2022.
  • [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6202–6211, 2019.
  • [Gao et al.(2017)Gao, Yang, Chen, Sun, and Nevatia] Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3628–3636, 2017.
  • [Grauman et al.(2022)Grauman, Westbury, Byrne, Chavis, Furnari, Girdhar, Hamburger, Jiang, Liu, Liu, et al.] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022.
  • [Hu et al.(2019)Hu, Nie, and Li] Di Hu, Feiping Nie, and Xuelong Li. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
  • [Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr] Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, and Marcelo H Ang Jr. TAda! temporally-adaptive convolutions for video understanding. In Proceedings of International Conference on Learning Representations (ICLR), pages 1–23, 2022.
  • [Huh et al.(2023)Huh, Chalk, Kazakos, Damen, and Zisserman] Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  • [Kazakos et al.(2019)Kazakos, Nagrani, Zisserman, and Damen] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. EPIC-Fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5492–5501, 2019.
  • [Kazakos et al.(2021a)Kazakos, Huh, Nagrani, Zisserman, and Damen] Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, and Dima Damen. With a little help from my temporal context: Multimodal egocentric action recognition. In Proceedings of the British Machine Vision Conference (BMVC), 2021a.
  • [Kazakos et al.(2021b)Kazakos, Nagrani, Zisserman, and Damen] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Slow-Fast auditory streams for audio recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 855–859. IEEE, 2021b.
  • [Korbar et al.(2018)Korbar, Tran, and Torresani] Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems (NeurIPS), pages 7774–7785, 2018.
  • [Lee et al.(2021)Lee, Jain, Park, and Yun] Jun-Tae Lee, Mihir Jain, Hyoungwoo Park, and Sungrack Yun. Cross-attentional audio-visual fusion for weakly-supervised action localization. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  • [Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu] Yanghao Li, Cuiling Lan, Junliang Xing, Wenjun Zeng, Chunfeng Yuan, and Jiaying Liu. Online human action detection using joint classification-regression recurrent neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 203–220, 2016.
  • [Lin et al.(2020)Lin, Li, Wang, Tai, Luo, Cui, Wang, Li, Huang, and Ji] Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 11499–11506, 2020.
  • [Lin et al.(2021)Lin, Xu, Luo, Wang, Tai, Wang, Li, Huang, and Fu] Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3320–3329, 2021.
  • [Lin et al.(2017a)Lin, Zhao, and Shou] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia, pages 988–996, 2017a.
  • [Lin et al.(2018)Lin, Zhao, Su, Wang, and Yang] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
  • [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3889–3898, 2019.
  • [Lin et al.(2017b)Lin, Goyal, Girshick, He, and Dollár] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2980–2988, 2017b.
  • [Liu and Wang(2020)] Qinying Liu and Zilei Wang. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 11612–11619, 2020.
  • [Liu et al.(2022)Liu, Wang, Hu, Tang, Zhang, Bai, and Bai] Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, pages 5427–5441, 2022.
  • [Long et al.(2019)Long, Yao, Qiu, Tian, Luo, and Mei] Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 344–353, 2019.
  • [Mo and Tian(2022)] Shentong Mo and Yapeng Tian. Multi-modal grouping network for weakly-supervised audio-visual video parsing. In Advances in Neural Information Processing Systems (NeurIPS), pages 34722–34733, 2022.
  • [Nagrani et al.(2020)Nagrani, Sun, Ross, Sukthankar, Schmid, and Zisserman] Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. Speech2Action: Cross-modal supervision for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10317–10326, 2020.
  • [Nawhal and Mori(2021)] Megha Nawhal and Greg Mori. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540, 2021.
  • [Owens and Efros(2018)] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018.
  • [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] Merey Ramazanova, Victor Escorcia, Fabian Caba Heilbron, Chen Zhao, and Bernard Ghanem. OWL (Observe, Watch, Listen): Localizing actions in egocentric video via audiovisual temporal context. arXiv preprint arXiv:2202.04947, 2022.
  • [Rao et al.(2022)Rao, Khalil, Li, Dai, and Lu] Varshanth Rao, Md Ibrahim Khalil, Haoda Li, Peng Dai, and Juwei Lu. Dual perspective network for audio-visual event localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 689–704, 2022.
  • [Rezatofighi et al.(2019)Rezatofighi, Tsoi, Gwak, Sadeghian, Reid, and Savarese] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, 2019.
  • [Shi et al.(2022)Shi, Zhong, Cao, Zhang, Ma, Li, and Tao] Dingfeng Shi, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, and Dacheng Tao. ReAct: Temporal action detection with relational queries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 105–121, 2022.
  • [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. TriDet: Temporal action detection with relative boundary modeling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18857–18866, 2023.
  • [Su et al.(2021)Su, Gan, Wu, Yan, and Qiao] Haisheng Su, Weihao Gan, Wei Wu, Junjie Yan, and Yu Qiao. BSN++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 2602–2610, 2021.
  • [Tian et al.(2018)Tian, Shi, Li, Duan, and Xu] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 247–263, 2018.
  • [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] Hanyuan Wang, Majid Mirmehdi, Dima Damen, and Toby Perrett. Refining action boundaries for one-stage detection. In 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8. IEEE, 2022.
  • [Wang et al.(2023)Wang, Singh, and Torresani] Huiyu Wang, Mitesh Kumar Singh, and Lorenzo Torresani. Ego-Only: Egocentric action detection without exocentric pretraining. arXiv preprint arXiv:2301.01380, 2023.
  • [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2016.
  • [Wu and Yang(2021)] Yu Wu and Yi Yang. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1326–1335, 2021.
  • [Xia and Zhao(2022)] Yan Xia and Zhou Zhao. Cross-modal background suppression for audio-visual event localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19989–19998, 2022.
  • [Xu et al.(2017)Xu, Das, and Saenko] Huijuan Xu, Abir Das, and Kate Saenko. R-C3D: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5783–5792, 2017.
  • [Yang et al.(2020)Yang, Peng, Zhang, Fu, and Han] Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, pages 8535–8548, 2020.
  • [Zhang et al.(2022)Zhang, Wu, and Li] Chen-Lin Zhang, Jianxin Wu, and Yin Li. ActionFormer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 492–510, 2022.
  • [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2914–2923, 2017.