Hanyuan [email protected]
\addauthorMajid [email protected]
\addauthorDima [email protected]
\addauthorToby [email protected]
\addinstitution
Department of Computer Science
Faculty of Engineering
University of Bristol
Bristol, UK
Centricity-based Audio-Visual Temporal Action Detection
Centre Stage: Centricity-based
Audio-Visual Temporal Action Detection
Abstract
Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.git.
1 Introduction
Temporal action detection aims to predict the boundaries of action segments from a long untrimmed video and classify the actions, as a fundamental step towards video understanding [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He, Carreira and Zisserman(2017), Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool]. A typical challenging scenario is in unscripted actions in egocentric videos [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al., Grauman et al.(2022)Grauman, Westbury, Byrne, Chavis, Furnari, Girdhar, Hamburger, Jiang, Liu, Liu, et al.] which contain dense action segments of various lengths in an unedited video, ranging from seconds to minutes.
Most recently, a few have approached egocentric action detection by modelling their long-range visual dependencies with transformers [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Wang et al.(2023)Wang, Singh, and Torresani, Nawhal and Mori(2021), Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem]. However, only using visual information, means a missed opportunity to exploit potentially meaningful aural action cues. As shown in Figure 1(a), sound exhibits discriminating characteristics around the starting point of actions, such as ‘open drawer’, ‘take spoon’ and ‘scoop yoghurt’, which can be useful for boundary regression. Also for action classification, the sound of flowing water can boost confidence in identifying an action as ‘turn-on tap’ rather than ‘turn-off tap’, even though their visual content is similar.
\bmvaHangBox![]() |
\bmvaHangBox![]() |
(a) | (b) |
Unlike methods [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem, Tian et al.(2018)Tian, Shi, Li, Duan, and Xu, Kazakos et al.(2021a)Kazakos, Huh, Nagrani, Zisserman, and Damen] that directly fuse audio and visual modalities at the same scale through concatenation, addition or gating modules, in this paper we learn these modalities with separate encoders and fuse their representations using a cross-modal attention mechanism at different temporal scales. This allows us to exploit sufficient audio-visual information to detect actions of various duration.
Recent one-stage anchor-free methods [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] operate on egocentric videos by simultaneously predicting boundaries and action categories for each timestep. In contrast to anchor-based methods, anchor-free methods do not require pre-defined anchors to locate actions but directly generate one proposal for each timestep. We have observed that timesteps near the centre of actions tend to produce proposals with more precise boundaries. These proposals have higher temporal Intersection-over-Union (tIoU) values with corresponding ground-truth segments. As shown in Figure 1(b), the closer the current timestep is to the action centre, the greater the tIoU. Inspired by this observation, we introduce a centricity head that predicts a score so as to indicate how close the current timestep is to the action centre. This score is then integral to calculating the confidence scores for ranking candidate proposals, where those with more precise boundaries will be ranked higher. Our approach can be incorporated into most one-stage anchor-free action detectors and achieve significant improvement.
In summary, our key contributions are as follows: (i) we introduce a framework to effectively fuse audio and visual modalities using a cross-modal attention mechanism at various temporal scales, (ii) we propose a novel centricity head to predict the degree of closeness of each frame’s temporal distance to the action centre – this boosts a proposal’s confidence score and allows for the preferential selection of proposals with more precise boundaries, and (iii) we achieve state-of-the-art results on the EPIC-Kitchens-100 action detection benchmark, demonstrating the effectiveness of audio modality and the benefits of centricity in improving detection performance.
2 Related Work
Temporal action detection – Current temporal action detection methods can be divided into: (i) two-stage methods that first generate proposals and then classify them, and (ii) one stage methods that predict boundaries and corresponding classes simultaneously. Some two-stage works generate proposals by estimating boundary probabilities [Lin et al.(2018)Lin, Zhao, Su, Wang, and Yang, Su et al.(2021)Su, Gan, Wu, Yan, and Qiao, Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Lin et al.(2020)Lin, Li, Wang, Tai, Luo, Cui, Wang, Li, Huang, and Ji] and action-ness scores [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin]. Many one-stage methods [Gao et al.(2017)Gao, Yang, Chen, Sun, and Nevatia, Lin et al.(2017a)Lin, Zhao, and Shou, Long et al.(2019)Long, Yao, Qiu, Tian, Luo, and Mei, Xu et al.(2017)Xu, Das, and Saenko, Liu and Wang(2020)] rely on pre-defined anchors to model temporal relations, which often leads to inflexibility and poor boundaries when detecting actions with various lengths. To address this, recent anchor-free methods [Yang et al.(2020)Yang, Peng, Zhang, Fu, and Han, Lin et al.(2021)Lin, Xu, Luo, Wang, Tai, Wang, Li, Huang, and Fu, Zhang et al.(2022)Zhang, Wu, and Li] predict the action category and offsets to the boundaries simultaneously for each timestep using parallel classification and regression heads. Then, candidate proposals constructed by these predictions are filtered to obtain the final results. Our work follows such an anchor-free pipeline.
Inspired by the DETR framework [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko], some works input relational queries [Shi et al.(2022)Shi, Zhong, Cao, Zhang, Ma, Li, and Tao], learned actions [Liu et al.(2022)Liu, Wang, Hu, Tang, Zhang, Bai, and Bai] or graph queries [Nawhal and Mori(2021)] to a transformer decoder to detect actions. However, with a limited number of queries, these methods struggle to cover a large number of actions in long videos. Alternatively, other works [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Chang et al.(2022)Chang, Wang, Wang, Li, and Shou, Cheng and Bertasius(2022)] use multi-scale transformer encoders [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] to model temporal dependencies for stronger video representations. For example, ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] applies local self-attention to extract a discriminative feature pyramid, which is then used for classification and regression. Our work falls under this workflow.
Audio-visual learning – Sight and hearing are both vital sensory modes that assist humans in perceiving the world. This can transfer to computational approaches too to learn models from. Numerous works [Afouras et al.(2020)Afouras, Owens, Chung, and Zisserman, Arandjelovic and Zisserman(2017), Aytar et al.(2016)Aytar, Vondrick, and Torralba, Hu et al.(2019)Hu, Nie, and Li, Korbar et al.(2018)Korbar, Tran, and Torresani, Owens and Efros(2018)] have focused on jointly learning audio and visual representations for tasks such as action recognition [Kazakos et al.(2019)Kazakos, Nagrani, Zisserman, and Damen, Kazakos et al.(2021a)Kazakos, Huh, Nagrani, Zisserman, and Damen, Nagrani et al.(2020)Nagrani, Sun, Ross, Sukthankar, Schmid, and Zisserman], video parsing [Wu and Yang(2021), Mo and Tian(2022)] and event localization [Tian et al.(2018)Tian, Shi, Li, Duan, and Xu, Rao et al.(2022)Rao, Khalil, Li, Dai, and Lu, Xia and Zhao(2022)]. Audio-visual event localization aims to classify each timestep into a limited number of categories [Tian et al.(2018)Tian, Shi, Li, Duan, and Xu], relying on clear audio-visual signals and without the need to predict temporal boundaries. In contrast, our action detection task aims to leverage the audio-visual representation to detect temporal boundaries for dense actions with various lengths and unclear audio cues, and then classify them into a wide range of categories. OWL [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] attempts different strategies for fusing audio and visual modalities, but it fuses at a single temporal scale only and classifies pre-generated proposals from [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen], rather than detect boundaries. In [Lee et al.(2021)Lee, Jain, Park, and Yun], the authors address this task by extracting intra-modal features, but their proposed framework is designed for simple, weakly-labelled data with sparse actions per video. Our work focuses on large-scale egocentric data comprising dense complex actions of various durations, and we propose a framework to incorporate audio-visual learning and centricity into one-stage anchor-free methods.
3 Method
We propose a novel framework for temporal action detection, rooted in audio-visual data, which can be incorporated into one-stage anchor-free pipelines [Zhang et al.(2022)Zhang, Wu, and Li, Lin et al.(2021)Lin, Xu, Luo, Wang, Tai, Wang, Li, Huang, and Fu, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] (see Figure 2). Similar to such temporal action detection works, we define the problem as follows. Given an untrimmed video, we extract features for the video and audio modalities and then process them using transformer encoders to obtain the visual and audio representation sequences and , respectively. Based on these, our approach is to learn to predict a set of possible action instances , where and represent the starting and ending boundaries of an action, and represents the predicted action class.

\bmvaHangBox![]() |
\bmvaHangBox![]() |
\bmvaHangBox![]() |
(a) | (b) | (c) |
3.1 Audio-visual Fusion
In this section, we explore three different strategies to effectively utilise the audio modality and combine it with visual information to improve action detection performance.
Proposal fusion – In this strategy (see Figure 3 (a)), at first the visual and audio representations are produced by encoders and respectively. Classification heads and and regression heads and are then used to predict the classification scores , and boundaries , and , for the visual and audio modalities, respectively. Thus, we can obtain a set of candidate proposals for the visual modality and similarly, a set of candidate proposals for the audio modality . Then, we concatenate these two sets as .
Classification scores fusion – Although sounds can be associated with actions for classification purposes, the duration of an action does not necessarily correspond to its audio start and end as recently shown in [Huh et al.(2023)Huh, Chalk, Kazakos, Damen, and Zisserman]. Thus, we discard the audio boundaries, integrate the classification scores from both visual and audio modalities, and then use them along with the visual boundaries to generate proposals.
We use an approach similar to [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] to fuse visual and audio classification scores. Specifically, as shown in Figure 3(b), based on and , the visual classification head , audio classification head and visual boundary head predict scores and , and frame boundaries and , respectively. We fuse the classification scores and by simple addition and combine them with the visual boundaries and to construct the set of fused candidate proposals , such that .
Feature pyramid fusion – In this strategy (see Figure 3(c)), and are put through a cross-attention mechanism [Lee et al.(2021)Lee, Jain, Park, and Yun, Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] (see also grey box of Figure 2) to model their inter-modal dependencies which then results in a single representation vector .
Firstly, and are projected into query , key , and value , where serves as a query input and serves as key and value inputs. , and are learnable weight matrices, where is the embedding dimension. Next, we calculate the audio-visual representation vector
(1) |
which is then fed into a classification head and a regression head to obtain the classification scores and the boundaries , for each timestep. Therefore, the set of candidate proposals is .
We ablate these three strategies in Sec. 4.1. Based on the ablations, we chose feature pyramid fusion to generate audio-visual representations across temporal pyramid scales for assessment by the centricity head to predict corresponding scores (see Sec. 3.2 next). We also selected the classification scores fusion approach to generate stronger audio-visual classification scores to predict action categories and calculate confidence scores.
3.2 Audio-visual Centricity Head
We investigated the relationship between the distance of a timestep from the action centre and the tIoU value between its generated proposal and the ground truth. As shown in Figure 4(a), as a timestep gets closer to the action center, its generated proposal has a higher tIoU value. This indicates that timesteps around the centre of an action can generate proposals with more reliable action boundaries. Thus, we propose a simple, yet effective, centricity head based on the audio-visual representation to estimate how close the timestep is to the centre of the action (as shown in Figure 4 (b)). The centricity head consists of three 1D conv layers with layer normalization and a ReLU activation function.
\bmvaHangBox![]() |
\bmvaHangBox![]() |
(a) | (b) |
Label assignment – We require the centricity scores calculated from ground-truth data as supervision signals for training. For each timestep , we consider the relative distance between the current timestep and the centre of the corresponding ground-truth action to map the training labels of centricity scores
(2) |
where is a scaling hyperparameter which defines that the closer a timestep is to the action centre, the higher the centricity score. This has previously been explored to predict boundary confidences [Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], but not specifically for centricity use cases. The centricity scores are normalized to a range of 0 to 1.
Training – We optimize the loss between the ground-truth and the predicted centricity scores using Mean Square Error (MSE) loss as
(3) |
where is the total number of timesteps used for training from all scales of the audio-visual feature pyramid. Our method can be integrated into any one-stage anchor-free frameworks and trained in an end-to-end manner. The total loss is , where and are losses for regression [Rezatofighi et al.(2019)Rezatofighi, Tsoi, Gwak, Sadeghian, Reid, and Savarese] and classification [Lin et al.(2017b)Lin, Goyal, Girshick, He, and Dollár] and are the same as in [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. is the boundary confidence loss from [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. , and denote the loss balancing weights, and is set to 0 when the baseline is ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] or TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao].
3.3 Post-processing
For each timestep , the network produces the visual and audio classification scores , , the corresponding action class label , a centricity score , and a pair of starting and ending boundaries , with their corresponding boundary confidences and (with these confidences computed as in [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]). The final confidence score for timestep is then a weighted combination of the learnt knowledge, i.e.
(4) |
where , and are fusion weights, and is set to 0 when the baseline is ActionFormer[Zhang et al.(2022)Zhang, Wu, and Li] or TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao]. Finally, we follow standard practice [Lin et al.(2018)Lin, Zhao, Su, Wang, and Yang, Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Su et al.(2021)Su, Gan, Wu, Yan, and Qiao, Zhang et al.(2022)Zhang, Wu, and Li] to rank these candidate actions based on the final confidence score and filter them using Soft-NMS [Bodla et al.(2017)Bodla, Singh, Chellappa, and Davis] to obtain a final set of predictions .
4 Experiments
Dataset – We conduct experiments on EPIC-Kitchens-100 [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.], a large-scale audio-visual dataset that contains 700 unscripted videos with 97 verb and 300 noun classes. On average, there are 128 action instances per video, with significant overlap.
Evaluation metric – We use mean Average Precision (mAP) for verb, noun and action tasks at various IoU thresholds {0.1, 0.2, 0.3, 0.4, 0.5} to evaluate our method against others.
Baselines – Our approach is integrated with three one-stage anchor-free approaches [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] is chosen as it is a pioneering anchor-free work that models long-range temporal dependencies using the Transformer for action detection. TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] extends this by incorporating scalable-granularity perception layers and a Trident head to regress boundaries. Finally, Wang et al. [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] (hereafter RAB) introduces a method to estimate boundary confidences through Gaussian scaling.
Implementation details – We compare our approach against state-of-the-art (SOTA) methods [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al., Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem, Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr, Zhang et al.(2022)Zhang, Wu, and Li, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] for temporal action detection. For a fair comparison, we employ the same visual features as [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], extracted from an action recognition model [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.] that is pre-trained with the SlowFast [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] network on EPIC-Kitchens-100. To obtain features with a dimension of 1x2304, we have a window size of 32 and a stride of 16 frames. For audio features, we generate 512×128 spectrograms using a window size of 2.6ms and a stride of 1.3ms. These spectrograms are then fed into a SlowFast audio recognition model [Kazakos et al.(2021b)Kazakos, Nagrani, Zisserman, and Damen], with features extracted after the average pooling layer with a dimension 1x2304.
Again, following [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], the feature pyramid generated by the transformer encoder has = 6 levels, with a level scaling factor of 2. For training, we use one Nvidia P100 GPU. We crop the video features with various lengths to 2304. The loss balancing weights in Sec. 3.2 are , , and . The weight ratio of the classification loss between verb and noun is set to 2:3. The scaling hyperparameter in Eq. (2) is set to . During the inference stage, the confidence score weights in Sect. 3.3 are assigned as , , and . For the multi-task classification, we select the top 11 verb and the top 33 noun predictions to combine the candidate actions.
Main results – Table 1 shows that the proposed method outperforms recent SOTA approaches [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al., Lin et al.(2019)Lin, Liu, Li, Ding, and Wen, Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem, Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr, Zhang et al.(2022)Zhang, Wu, and Li, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] on the EPIC-Kitchens-100 action detection benchmark, and achieves significant improvements when added to existing SOTA one-stage multi-scale methods [Zhang et al.(2022)Zhang, Wu, and Li, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao]. ActionFormer [Zhang et al.(2022)Zhang, Wu, and Li] and TriDet [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] train different models for verb and noun detection and do not detect actions. Instead, we train one model and add a multi-task classification head to predict their results for the action task. It can be seen that our enhanced proposed method with audio fusion and centricity improves performance in every metric. In action detection, mAP improves by and , respectively. RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] also performs well on egocentric data and follows the same anchor-free pipeline as ours. The improvement achieved on RAB[Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] was and is also the best result amongst the baselines.

Qualitative results – Qualitative plots of ‘RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]+Ours’ on the EPIC-Kitchens-100 action detection validation dataset are shown in Figure 5. The bottom two lines showcase the model’s ability to detect dense actions with different classes and durations in a video, demonstrating that our approach can effectively utilise the audio modality to learn discriminative representations. The middle three lines show a zoomed-in, detailed look where it is easier to see that our method better deals with challenging actions, e.g. see action ‘put fork’ in the second of the first video and the action ‘wash hand’ in the second of the second video that were missed by the baseline model. This indicates that our centricity head enhances the confidence scores for actions with more precise boundaries, resulting in their preferential ranking and selection during the Soft-NMS processing.
4.1 Ablations
All our ablations use RAB [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] as baseline and are performed on EPIC-Kitchens-100 validation dataset [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.].
Components ablation – We ablate the contributions of our two main components, audio-visual fusion and centricity head, as seen in Table 2. Both components demonstrate notable improvements in the performance of the baseline methods [Zhang et al.(2022)Zhang, Wu, and Li, Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao, Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett], particularly when they are engaged, singularly or in combination, for action detection.
Classification Weight | Avg. mAP@task | ||
Verb | Noun | Action | |
0.5 | 22.04 | 22.70 | 18.22 |
1 | 21.10 | 23.08 | 18.50 |
2 | 22.13 | 22.52 | 18.01 |
4 | 22.36 | 22.11 | 18.09 |
6 | 22.21 | 22.75 | 17.83 |
8 | 22.20 | 22.92 | 17.83 |

Loss function weights – We train our network in an end-to-end manner by minimizing the total loss function in Sec. 3.2, where we assign three weights to balance various losses. Table 6 shows that is the best value for performing the action task for the classification loss weight, across the improvements made to the baseline. For the boundary confidence loss weight , we follow the baseline RAB’s recommended setting [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett]. Finally, Table 5 demonstrates that our results are relatively stable when varying centricity loss weight between , with the best action detection result at .
The effect of centricity on confidence score – Figure 6 shows that as the centre distance increases, the average tIoU values (blue bars) between the ground truth and the proposals generated by timesteps at the centre distance present a notable decreasing trend (0.16). However, the original confidence scores (dashed blue line) only slightly decreases (0.03). Adding centricity into the confidence scores (solid blue line) responds better to the expected trend (0.09). Thus, proposals with more accurate boundaries (higher tIoU values) rank higher based on their confidence scores when centricity is incorporated.
Audio-visual fusion strategies – The three strategies for fusing audio and visual modalities (see Section 3.1) are compared in Table 5. The proposal fusion strategy has the lowest overall action detection performance due to the audio stream’s proposals having less precise boundaries. The audio-visual fusion of classification scores strategy improves on Visual-only through multiplication (0.19%) or addition (0.51%). For the feature pyramid fusion, while a direct concatenation achieves a relative increase (0.27%), the cross-attention mechanism provides a more significant learning opportunity (0.70%).
Centricity Weight | Avg. mAP@task | ||
Verb | Noun | Action | |
0.5 | 21.18 | 22.82 | 18.04 |
1 | 21.19 | 22.02 | 18.11 |
1.5 | 21.23 | 23.24 | 18.24 |
1.6 | 21.71 | 22.89 | 18.16 |
1.7 | 21.10 | 23.08 | 18.50 |
1.8 | 21.16 | 22.77 | 18.13 |
2 | 21.27 | 22.90 | 18.22 |
Fusion Strategies | Avg. mAP@task | ||
Verb | Noun | Action | |
Visual-only | 20.71 | 20.53 | 17.18 |
Audio-only | 7.94 | 6.41 | 4.57 |
Proposals fusion (Figure 3 (a)) | 20.36 | 20.38 | 16.65 |
Classif. scores fusion (multiplication) (Fig 3(b)) | 20.93 | 21.82 | 17.37 |
Classif. scores fusion (addition) (Fig 3(b)) | 20.18 | 21.89 | 17.69 |
Feature pyramid fusion (concatenation) (Fig 3(c)) | 20.95 | 21.52 | 17.45 |
Feature pyramid fusion (cross-attention) (Fig 3(c)) | 20.93 | 21.84 | 17.88 |
Centricity vs. action-ness – Centricity establishes how close the current timestep is from the action centre, while action-ness [Chang et al.(2022)Chang, Wang, Wang, Li, and Shou] represents the probability of the action occurring. Figure 7 displays individual instances of actions with centricity scores, action-ness scores and the average tIoUs between the proposals generated by corresponding timesteps and their ground truth. In the middle of the action, timesteps tend to exhibit peak values of tIoUs that gradually decrease on both sides, and a similar trend is also observed in the centricity scores. This suggests that timesteps associated with higher centricity scores are inclined to generate proposals with more precise boundaries. In contrast, the action-ness scores (purple line) tend to drop in the middle of the action.
\bmvaHangBox![]() |
\bmvaHangBox![]() |
\bmvaHangBox![]() |
\bmvaHangBox![]() |
(a) | (b) | (c) | (d) |
5 Conclusion
We introduced an audio-visual fusion approach and a novel centricity head for one-stage anchor-free action detectors. Our method achieves state-of-the-art results on the large-scale egocentric EPIC-Kitchens-100 action detection benchmark where audio and video streams are available. Detailed ablations demonstrated the benefits of fusing audio and visual modalities and emphasized the importance of centricity scores.
Many questions about the use of multi-modalities in temporal action detection remain unexplored, such as the discrepancies in the training data and the temporal misalignment between different modalities. An extension of our work would involve joint learning from visual, audio, and language modalities to enhance action detection performance, with specific focus on mitigating the misalignment among these three modalities and developing novel fusion techniques to provide discriminative representations.
References
- [Afouras et al.(2020)Afouras, Owens, Chung, and Zisserman] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self-supervised learning of audio-visual objects from video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 208–224, 2020.
- [Arandjelovic and Zisserman(2017)] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 609–617, 2017.
- [Aytar et al.(2016)Aytar, Vondrick, and Torralba] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. SoundNet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (NeurIPS), pages 892–900, 2016.
- [Bodla et al.(2017)Bodla, Singh, Chellappa, and Davis] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-NMS - improving object detection with one line of code. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5562–5570, 2017.
- [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2020.
- [Carreira and Zisserman(2017)] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the Kinetics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017.
- [Chang et al.(2022)Chang, Wang, Wang, Li, and Shou] Shuning Chang, Pichao Wang, Fan Wang, Hao Li, and Zheng Shou. Augmented transformer with adaptive graph for temporal action proposal generation. In Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, pages 41–50, 2022.
- [Cheng and Bertasius(2022)] Feng Cheng and Gedas Bertasius. TallFormer: Temporal action localization with a long-memory transformer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 503–521, 2022.
- [Damen et al.(2022)Damen, Doughty, Farinella, Furnari, Kazakos, Ma, Moltisanti, Munro, Perrett, Price, et al.] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision. International Journal of Computer Vision (IJCV), pages 33–55, 2022.
- [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6202–6211, 2019.
- [Gao et al.(2017)Gao, Yang, Chen, Sun, and Nevatia] Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3628–3636, 2017.
- [Grauman et al.(2022)Grauman, Westbury, Byrne, Chavis, Furnari, Girdhar, Hamburger, Jiang, Liu, Liu, et al.] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022.
- [Hu et al.(2019)Hu, Nie, and Li] Di Hu, Feiping Nie, and Xuelong Li. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
- [Huang et al.(2022)Huang, Zhang, Pan, Qing, Tang, Liu, and Ang Jr] Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, and Marcelo H Ang Jr. TAda! temporally-adaptive convolutions for video understanding. In Proceedings of International Conference on Learning Representations (ICLR), pages 1–23, 2022.
- [Huh et al.(2023)Huh, Chalk, Kazakos, Damen, and Zisserman] Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- [Kazakos et al.(2019)Kazakos, Nagrani, Zisserman, and Damen] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. EPIC-Fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5492–5501, 2019.
- [Kazakos et al.(2021a)Kazakos, Huh, Nagrani, Zisserman, and Damen] Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, and Dima Damen. With a little help from my temporal context: Multimodal egocentric action recognition. In Proceedings of the British Machine Vision Conference (BMVC), 2021a.
- [Kazakos et al.(2021b)Kazakos, Nagrani, Zisserman, and Damen] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Slow-Fast auditory streams for audio recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 855–859. IEEE, 2021b.
- [Korbar et al.(2018)Korbar, Tran, and Torresani] Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems (NeurIPS), pages 7774–7785, 2018.
- [Lee et al.(2021)Lee, Jain, Park, and Yun] Jun-Tae Lee, Mihir Jain, Hyoungwoo Park, and Sungrack Yun. Cross-attentional audio-visual fusion for weakly-supervised action localization. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
- [Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu] Yanghao Li, Cuiling Lan, Junliang Xing, Wenjun Zeng, Chunfeng Yuan, and Jiaying Liu. Online human action detection using joint classification-regression recurrent neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 203–220, 2016.
- [Lin et al.(2020)Lin, Li, Wang, Tai, Luo, Cui, Wang, Li, Huang, and Ji] Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 11499–11506, 2020.
- [Lin et al.(2021)Lin, Xu, Luo, Wang, Tai, Wang, Li, Huang, and Fu] Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3320–3329, 2021.
- [Lin et al.(2017a)Lin, Zhao, and Shou] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia, pages 988–996, 2017a.
- [Lin et al.(2018)Lin, Zhao, Su, Wang, and Yang] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
- [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3889–3898, 2019.
- [Lin et al.(2017b)Lin, Goyal, Girshick, He, and Dollár] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2980–2988, 2017b.
- [Liu and Wang(2020)] Qinying Liu and Zilei Wang. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 11612–11619, 2020.
- [Liu et al.(2022)Liu, Wang, Hu, Tang, Zhang, Bai, and Bai] Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, pages 5427–5441, 2022.
- [Long et al.(2019)Long, Yao, Qiu, Tian, Luo, and Mei] Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 344–353, 2019.
- [Mo and Tian(2022)] Shentong Mo and Yapeng Tian. Multi-modal grouping network for weakly-supervised audio-visual video parsing. In Advances in Neural Information Processing Systems (NeurIPS), pages 34722–34733, 2022.
- [Nagrani et al.(2020)Nagrani, Sun, Ross, Sukthankar, Schmid, and Zisserman] Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. Speech2Action: Cross-modal supervision for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10317–10326, 2020.
- [Nawhal and Mori(2021)] Megha Nawhal and Greg Mori. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540, 2021.
- [Owens and Efros(2018)] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018.
- [Ramazanova et al.(2022)Ramazanova, Escorcia, Heilbron, Zhao, and Ghanem] Merey Ramazanova, Victor Escorcia, Fabian Caba Heilbron, Chen Zhao, and Bernard Ghanem. OWL (Observe, Watch, Listen): Localizing actions in egocentric video via audiovisual temporal context. arXiv preprint arXiv:2202.04947, 2022.
- [Rao et al.(2022)Rao, Khalil, Li, Dai, and Lu] Varshanth Rao, Md Ibrahim Khalil, Haoda Li, Peng Dai, and Juwei Lu. Dual perspective network for audio-visual event localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 689–704, 2022.
- [Rezatofighi et al.(2019)Rezatofighi, Tsoi, Gwak, Sadeghian, Reid, and Savarese] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, 2019.
- [Shi et al.(2022)Shi, Zhong, Cao, Zhang, Ma, Li, and Tao] Dingfeng Shi, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, and Dacheng Tao. ReAct: Temporal action detection with relational queries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 105–121, 2022.
- [Shi et al.(2023)Shi, Zhong, Cao, Ma, Li, and Tao] Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. TriDet: Temporal action detection with relative boundary modeling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18857–18866, 2023.
- [Su et al.(2021)Su, Gan, Wu, Yan, and Qiao] Haisheng Su, Weihao Gan, Wei Wu, Junjie Yan, and Yu Qiao. BSN++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 2602–2610, 2021.
- [Tian et al.(2018)Tian, Shi, Li, Duan, and Xu] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 247–263, 2018.
- [Wang et al.(2022)Wang, Mirmehdi, Damen, and Perrett] Hanyuan Wang, Majid Mirmehdi, Dima Damen, and Toby Perrett. Refining action boundaries for one-stage detection. In 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8. IEEE, 2022.
- [Wang et al.(2023)Wang, Singh, and Torresani] Huiyu Wang, Mitesh Kumar Singh, and Lorenzo Torresani. Ego-Only: Egocentric action detection without exocentric pretraining. arXiv preprint arXiv:2301.01380, 2023.
- [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2016.
- [Wu and Yang(2021)] Yu Wu and Yi Yang. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1326–1335, 2021.
- [Xia and Zhao(2022)] Yan Xia and Zhou Zhao. Cross-modal background suppression for audio-visual event localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19989–19998, 2022.
- [Xu et al.(2017)Xu, Das, and Saenko] Huijuan Xu, Abir Das, and Kate Saenko. R-C3D: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5783–5792, 2017.
- [Yang et al.(2020)Yang, Peng, Zhang, Fu, and Han] Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, pages 8535–8548, 2020.
- [Zhang et al.(2022)Zhang, Wu, and Li] Chen-Lin Zhang, Jianxin Wu, and Yin Li. ActionFormer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 492–510, 2022.
- [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2914–2923, 2017.