Context-aware Proposal Network for Temporal Action Detection
Abstract
This technical report presents our first place winning solution for temporal action detection task in CVPR-2022 AcitivityNet Challenge. The task aims to localize temporal boundaries of action instances with specific classes in long untrimmed videos. Recent mainstream attempts are based on dense boundary matchings and enumerate all possible combinations to produce proposals. We argue that the generated proposals contain rich contextual information, which may benefits detection confidence prediction. To this end, our method mainly consists of the following three steps: 1) action classification and feature extraction by Slowfast [10], CSN [20], TimeSformer [4], TSP [1], I3D-flow [7], VGGish-audio [11], TPN [33] and ViViT [3]; 2) proposal generation. Our proposed Context-aware Proposal Network (CPN) builds on top of BMN [16], GTAD [32] and PRN [26] to aggregate contextual information by randomly masking some proposal features. 3) action detection. The final detection prediction is calculated by assigning the proposals with corresponding video-level classification results. Finally, we ensemble the results under different feature combination settings and achieve 45.8% performance on the test set, which improves the champion result in CVPR-2021 ActivityNet Challenge [26] by 1.1% in terms of average mAP.
1 Introduction
Recently, the emergence of large-scale datasets [5, 34, 17, 6, 7, 8, 12] and deep models [15, 22, 10] has promoted the development of video understanding, which has a wide range of application prospects in security, surveillance, autonomous driving fields. Video understanding includes many sub-research directions, such as Action Recognition [22, 10, 13, 30], Action Detection [16, 18, 2, 28, 29, 25, 27, 24, 23], Spatio-Temporal Action Detection [19, 14], etc. In this report, we present our competition method for the temporal action detection task in the CVPR-2022 ActivityNet Challenge [5].
For temporal action detection task, we need to localize temporal boundaries of action instances (\ie, start time and end time) and classify the target categories in the long untrimmed videos. This task is challenging, involving some difficulties such as wide temporal spans of action instances, confusing background and foreground, and limited proposal contextual information. Current mainstream approaches [16, 32, 1, 23, 31] usually adopt “proposal and classification” paradigm, which generates proposals by calculating the boundary probabilities of each time point to combine start points with end points and then classify the proposals. In order to produce high-quality detection results, the generated proposals should precisely cover instance with high recalls and reliable confidence scores. Since proposal-level classification is limited to insufficient instance information, video-level classification has attracted much attention [25, 26], which leverages the entire video as input to obtain the final results. In this report, we follow this paradigm to design the solution of this challenge. Our main observation is that when predicting the confidence map of dense proposals, the proposals can be mutually inferential, \ie, the confidence of a proposal may be obtained by inference from the surrounding proposals. We thus apply a randomly masking strategy to the proposal features and encourage the model to aggregate context associations for precise proposal confidence prediction. Moreover, to further improve the performance, we apply some data pre-processing techniques, such as too long instance removal, short instance resampling, action instance resize and temporal shift perturbation [28, 15]. Finally, we ensemble some existing methods [16, 32, 18, 26] and achieve 45.8% mAP on the test set of ActivityNet v1.3 [5], which improves the champion result in CVPR- 2021 ActivityNet Challenge [26] by 1.1%.

2 Feature Extractor
In recent years, a large number of advanced deep learning algorithms have been proposed for action classification. These methods can act as feature extractors for action detection and also be adopted to generate video-level classifications. In this section, we introduce some deep action classification networks used in our solution.
2.1 Slowfast
Slowfast network [10] was proposed for action classification by combining a fast and a slow branch. For the slow branch, the input is with a low frame rate, which is used to capture spatial semantic information. The fast branch, whose input is with a high frame rate, targets to capture motion information. Note that the fast branch is a lightweight network, because its channel is relatively small. Due to its excellent performance in action recognition and detection, we choose Slowfast as one of our backbone models.
Model | Pretrain | Top 1 Acc. | Top5 Acc. |
I3D-flow [7] | K400 | 79.5% | 93.6% |
Slowfast50 [10] | K400 | 85.3% | 95.8% |
Slowfast101 [10] | K400 | 87.1% | 97.4% |
Slowfast152 [10] | K700 | 88.9% | 97.8% |
TPN [33] | K400 | 87.4% | 97.1% |
TSP [1] | ANet | 86.4% | 97.4% |
CSN [20] | K400 | 90.3% | 98.1% |
ViViT-B/16x2 [3] | K700 | 91.2% | 98.0% |
TimeSformer [4] | K600 | 91.1% | 97.3% |
ANet-2020 champion [25] | Ensemble | 91.8% | 98.1% |
ANet-2021 champion [26] | Ensemble | 93.6% | 98.5% |
Ours | Ensemble | 94.6% | 98.7% |
2.2 I3D-flow
Inflated 3D ConvNet (I3D) [7] designs some inflated convolutions to cover different receptive fields, which is based on 2D ConvNet inflation. I3D expands the filters and pooling kernels of deep image recognition networks into 3D shape, making it suitable for spatio-temporal modeling. In our solution, we apply I3D network to extract flow features of ActivityNet v1.3 dataset.
2.3 CSN
Channel-Separated Convolutional Network (CSN) [20] aims to reduce the parameters of 3D convolution, and extract useful information by finding important channels simultaneously. It can efficiently learn feature representation through grouping convolution and channel interaction, and reach a good balance between effectiveness and efficiency.
2.4 TimeSformer
Timesformer [4] presents the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. In addition, Timesformer shows that separate temporal attention and spatial attention within each block leads to the best video classification accuracy.
2.5 TPN
Temporal Pyramid Network (TPN) [33] is a feature pyramid architecture, which captures the visual tempos of action instances. TPN can be applied to existing 2D/3D architectures in the plug-and-play manner, bringing consistent improvements. Considering its excellent spatio-temporal modeling ability, we also use it to extract spatio-temporal features.
2.6 ViViT
Due to transformers [21, 9, 29] have shown powerful abilities on various vision tasks, we apply the ViViT [3] as one of backbones. ViViT is a pure Transformer based model for action recognition. It extracts spatio-temporal tokens from input videos, and then encoded by series of Transformer layers. In order to handle the long sequences of tokens encountered in videos, several efficient variants of ViViT decompose the spatial- and temporal-dimensions of the inputs. We apply the ViViT-B/16x2 version with factorised encoder, which initialized from imagenet pretrained ViT [9], and then pretrain it on Kinetics700 dataset [6].
2.7 Classification results
In addition to the several models mentioned above, we also utilize TSP features [1] and VGGish-audio features [11]. Table 1 shows the action recognition results of the above methods on the validation set of ActivityNet v1.3 dataset [5]. From the results, we can draw several following conclusions: 1) CSN model can outperform slowfast101 by 3.1% with Kinetics400 pretraining on ActivityNet dataset; 2) Transformer based model can indeed obtain better performance than CNN based models. For instance, TimeSformer and ViViT achieve 91.2% and 91.1% Top1 accuracy. 3) The flow feature alone is not as good as the spatio-temporal feature of RGB in performance. We then ensemble all the models and achieve 1.0% performance gain over ActivityNet-2021 champion result.
Model | Feature | AR@100 | AUC |
---|---|---|---|
BMN [16] | Slowfast101 | 75.8% | 68.6% |
PRN [26] | Slowfast101 | 76.5% | 69.3% |
CPN | Slowfast101 | 76.9% | 69.5% |
Model | Feature | 0.5 | Average mAP |
BMN [16] | Slowfast101 | 56.3% | 37.7% |
PRN [26] | Slowfast101 | 57.2% | 38.8% |
CPN | Slowfast101 | 57.8% | 39.0% |
BMN [16] | Slowfast152 | 55.5% | 36.8% |
PRN [26] | Slowfast152 | 56.5% | 38.0% |
CPN | Slowfast152 | 56.6% | 38.8% |
BMN [16] | CSN | 56.9% | 38.1% |
PRN [26] | CSN | 57.9% | 39.4% |
CPN | CSN | 58.6% | 39.5% |
BMN [16] | ViViT | 55.1% | 36.7% |
PRN [26] | ViViT | 55.5% | 37.5% |
CPN | ViViT | 56.3% | 38.1% |
PRN [26] | Ensemble | 59.7% | 42.0% (test: 44.7%) |
Ours | Ensemble | 60.8% | 43.3% (test: 45.8%) |
3 Context-aware Proposal Network
In the section, we introduce our proposed Context-aware Proposal Network (CPN). As is shown in Figure 1, CPN mainly contains two key components: data pre-processing strategies and proposal feature random masking. We will introduce each part in details below, and finally show the detection performance.
3.1 Data pre-processing strategies
In our solution, we mainly used four data pre-processing tricks: too long instance removal, short instance resampling, action instance resize and temporal shift perturbation.
Too long instance removal means that we delete training videos where the percentage of action instances is too long (\eg, 98%). The intuition is that these training data lack negative samples (\ie, the IoU between proposal and ground-truth is ) when generating confidence maps, which may damage the training process.
Short instance resampling denotes that the training video containing short instances is repeatedly sampled, because the recall and localization of the short video instance is low precision, and we hope to alleviate this problem by resampling.
Action instance resize is to obtain and resize action instance by ground-truth annotations, which can simulate the change in the speed of video instances.
Temporal shift operation for action recognition is first applied in TSM [15], and then applied as a kind of perturbations in SSTAP [28] for semi-supervised learning. Here we reuse the perturbation as the feature augmentation. The temporal feature shift is a channel-shift pattern, including two operations such as forward movement and backward movement in the channel latitude of the feature map. This module can improve the robustness of the models.
3.2 Proposal feature random masking
Recall that temporal action detection is to accurately locate the boundary of the target actions. We explore the associations among proposals to capture the contextual relationships. To capture contextual associations among proposals, we randomly mask some proposal features. Specifically, a simple dropout3d operation is composed on the sampled dense proposal feature maps.
To evaluate proposal, we calculate AR under different Average Number of proposals (AN), termed AR@AN (\eg, AR@100), and calculate the Area under the AR vs. AN curve (AUC) as metrics. Table 2 presents the results of BMN, PRN and CPN on the validation set of ActivityNet v1.3, which prove that CPN can outperform BMN significantly. Especially, our method significantly improves AUC from 68.6% to 69.5% by gaining 0.9%. In addition, compared to PRN, our CPN also has a certain performance improvement.
3.3 Detection results
We follow the “proposal + classification” pipeline to generate the final detection results. Mean Average Precision (mAP) is adopted as the evaluation metric of temporal action detection task. Average mAP with IoU thresholds is applied for this challenge.
In order to demonstrate the effectiveness of CPN, we conduct experiments with different features, as is shown in Table 3. The results shows that the proposed CPN can gain 1.5% over BMN in terms of Average mAP when Slowfast101 feature is adopted. Then we ensemble all the results and reach 43.3% on the validation set and 45.8% on the test set. The ensemble strategies mainly contain multi-scale fusion and feature combination. We also used the boundary refinement methods [18, 25] to predict boundaries more accurately.
Moreover, we can find that the Transformer based ViViT shows very strong performance on classification task but unsatisfactory on detection task when compared with the CNN models. The reason may be that the Transformer tends to capture global information by self-attention operation, hence it loses local information which is also important for detection task. Meanwhile, the models perform well on action task may not achieve better performance on the detection task. Slowfast152 exceeds Slowfast101 by 0.8% for classification, but suffers 1.2% drop for detection in our CPN.
4 Conclusion
In this report, we present our solution for temporal action detection task in CVPR-2022 ActivityNet Challenge. For this task, we propose a CPN to leverage rich contextual information among proposals and apply some data pre-processing strategies for high robustness. Experimental results show that CPN can outperform the baseline methods significantly. By fusing all detection results with different backbones, we obtain 45.8% Average mAP on the test set, which gains 1.1% over the champion method in CVPR-2021 ActivityNet Challenge.
5 Acknowledgment
This work is supported by the National Natural Science Foundation of China under grant 61871435, Fundamental Research Funds for the Central Universities no.2019kfyXKJC024, by the 111 Project on Computational Intelligence and Intelligent Control under Grant B18024.
References
- [1] Humam Alwassel, Silvio Giancola, and Bernard Ghanem. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021.
- [2] Humam Alwassel, Fabian Caba Heilbron, and Bernard Ghanem. Action search: Spotting actions in videos and its application to temporal action localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 251–266, 2018.
- [3] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691, 2021.
- [4] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In International Conference on Machine Learning, pages 813–824. PMLR, 2021.
- [5] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
- [6] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
- [7] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- [8] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018.
- [9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- [10] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019.
- [11] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- [12] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- [13] Ziyuan Huang, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Rong Jin, and Marcelo Ang. Self-supervised motion learning from static images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [14] Jianwen Jiang, Yu Cao, Lin Song, Shiwei Zhang, Yunkai Li, Ziyao Xu, Qian Wu, Chuang Gan, Chi Zhang, and Gang Yu. Human centric spatio-temporal action localization. In ActivityNet Workshop on CVPR, 2018.
- [15] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7083–7093, 2019.
- [16] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3889–3898, 2019.
- [17] Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip HS Torr. Multi-shot temporal event localization: a benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12596–12606, 2021.
- [18] Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [19] Lin Song, Shiwei Zhang, Gang Yu, and Hongbin Sun. Tacnet: Transition-aware context network for spatio-temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11987–11995, 2019.
- [20] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5552–5561, 2019.
- [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
- [22] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
- [23] Qiang Wang, Yanhao Zhang, Yun Zheng, and Pan Pan. Rcl: Recurrent continuous localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13566–13575, 2022.
- [24] Xiang Wang, Changxin Gao, Shiwei Zhang, and Nong Sang. Multi-level temporal pyramid network for action detection. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 41–54. Springer, 2020.
- [25] Xiang Wang, Baiteng Ma, Zhiwu Qing, Yongpeng Sang, Changxin Gao, Shiwei Zhang, and Nong Sang. Cbr-net: Cascade boundary refinement network for action detection: Submission to activitynet challenge 2020 (task 1). arXiv preprint arXiv:2006.07526, 2020.
- [26] Xiang Wang, Zhiwu Qing, Ziyuan Huang, Yutong Feng, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Changxin Gao, and Nong Sang. Proposal relation network for temporal action detection. arXiv preprint arXiv:2106.11812, 2021.
- [27] Xiang Wang, Zhiwu Qing, Ziyuan Huang, Yutong Feng, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Yuanjie Shao, and Nong Sang. Weakly-supervised temporal action localization through local-global background modeling. arXiv preprint arXiv:2106.11811, 2021.
- [28] Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Changxin Gao, and Nong Sang. Self-supervised learning for semi-supervised temporal action proposal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [29] Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Zhengrong Zuo, Changxin Gao, and Nong Sang. Oadtr: Online action detection with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7565–7575, 2021.
- [30] Xiang Wang, Shiwei Zhang, Zhiwu Qing, Mingqian Tang, Zhengrong Zuo, Changxin Gao, Rong Jin, and Nong Sang. Hybrid relation guided set matching for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19948–19957, 2022.
- [31] Mengmeng Xu, Juan Manuel Perez Rua, Xiatian Zhu, Bernard Ghanem, and Brais Martinez. Low-fidelity video encoder optimization for temporal action localization. Advances in Neural Information Processing Systems, 34, 2021.
- [32] Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020.
- [33] Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 591–600, 2020.
- [34] Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019.