TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos

Praveen Tirupattur^* Aayush J Rana^* Tushar Sangam^*
Shruti Vyas^† Yogesh S Rawat^‡ Mubarak Shah^‡
Center for Research in Computer Vision
University of Central Florida, Orlando, Florida, USA
Email: ^*[praveentirupattur, aayushjr, tusharsangam]@knights.ucf.edu, ^†[email protected]
^‡[yogesh, mubarak.shah]@ucf.edu

Abstract

This paper summarizes the TinyAction challenge ¹¹1https://www.crcv.ucf.edu/tiny-actions-challenge-cvpr2021 which was organized in ActivityNet workshop at CVPR 2021. This challenge focuses on recognizing real-world low-resolution activities present in videos. Action recognition task is currently focused around classifying the actions from high-quality videos where the actors and the action is clearly visible. While various approaches have been shown effective for recognition task in recent works, they often do not deal with videos of lower resolution where the action is happening in a tiny region. However, many real world security videos often have the actual action captured in a small resolution, making action recognition in a tiny region a challenging task. In this work, we propose a benchmark dataset, TinyVIRAT-v2 ²²2https://www.crcv.ucf.edu/tiny-actions-challenge-cvpr2021/data/TinyVIRAT-v2.zip, which is comprised of naturally occuring low-resolution actions. This is an extension of the TinyVIRAT dataset [7] and consists of actions with multiple labels. The videos are extracted from security videos which makes them realistic and more challenging. We use current state-of-the-art action recognition methods on the dataset as a benchmark, and propose the TinyAction Challenge.

1 Introduction

In recent years, action recognition from videos has become widely applied in security analysis security and automation tasks. The availability of large-scale datasets and the progress of neural networks have provided significant improvement to video action recognition task. Datasets with multiple actors and actions such as UCF-101 [21], Kinetics [20, 13], AVA [8], YouTube-8M [1] and Moments-in-time [15] provide a large set of data with higher versatility for training neural networks. This has enabled several state-of-the-art architectures such as C3D [22], I3D [3], ResNet-3D [9] and R2+1D [23] which have been effective at recognizing the correct actions. While development of such architectures and larger datasets have improved action recognition in videos, it is ignoring a large portion of real life videos where the actions are occurring at a distance with a smaller resolution. The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. Recognizing such actions is a challenging problem since the available architectures are not designed to handle low-resolution regions with less information. Due to a lack of appropriate architectures and datasets that focus on such low-resolution actions, their performance is still far from satisfactory when the action is not distinctly visible.

Refer to caption — Figure 1: Sample video frames for various actions from TinyVIRAT-v2 dataset. TinyVIRAT-v2 is a multi-class multi-label dataset with multiple actions occurring simultaneously in a single video.

Table 1: Dataset statistics. ANF: Average number of frames, ML: Multi-label, NC: Number of classes, and NV: Number of Videos.

Dataset	Resolution	ANF	ML	NC	NV	Train	Val	Test
UCF-101 [21]	320x240	186.50	No	101	13320	9537	-	3783
HMDB-51 [14]	320x240	94.49	No	51	7000	3570	1530	-
AVA [8]	264x440 - 360x640	127081.66	Yes	80	385,446(272)	210,634	57,371	117,441
TinyVIRAT [7]	10x10 - 128x128	93.93	Yes	26	12829	7663	-	5166
TinyVIRAT-v2	10x10 - 128x128	76.14	Yes	26	26355	16950	3308	6097

The ActivityNet challenge has seen a wide range of tasks relevant to action recognition, ranging from temporal activity recognition to spatio-temporal action detection. However, in all the tasks we have seen so far, the focus has rarely been on low-resolution activities. In real-world security environments, the actions in videos are captured at a wide range of resolutions, where most activities occur at a distance at a small resolution. Contrary to this, most widely used datasets contain high-resolution videos where the occurring activities cover most of the region.

In this work, the focus is on recognizing tiny actions in low-resolution videos. The existing approaches addressing this issue perform their experiments on artificially created datasets where the high-resolution videos are down-scaled to a smaller resolution to create a low-resolution sample. However, re-scaling a high-resolution video to a lower- resolution does not reflect real world low-resolution video quality. Real world low-resolution videos suffer from grain, camera sensor noise, and other factors, which are not not present in the down-scaled videos.

We address this problem via a two-pronged approach. Firstly, we provide the TinyVIRAT-v2 dataset, a benchmark dataset for activity recognition which contains natural low-resolution activities. Then we host the TinyAction Challenge to create a competitive opportunity for the research community to focus on low-resolution action recognition task and develop specific architectures to tackle its various challenges.

The TinyVIRAT-v2 dataset is built upon the existing TinyVIRAT dataset [7] and consists of realistic low-resolution action videos extracted from security videos of VIRAT [16] and MEVA [5] dataset. This is a multi-label dataset with multiple actions per video clip which makes it even more challenging. In addition to TinyVIRAT, it also includes indoor scenes making this problem more challenging and realistic.

2 TinyAction Challenge

This challenge is a first of its kind for low-resolution action recognition task. Our goal is to generate interest in the research community for such action recognition task which often is overlooked in other large-scale action datasets. As modern security and analysis videos often have multiple actions occurring in a low-resolution region further away from the camera, it is essential to bridge the gap between good action recognition architectures and real-world videos with low-resolution actions.

2.1 TinyVIRAT Dataset

Most of the existing action recognition datasets contain high resolution, actor centric videos [25], [12], [13], [19], [2], [18], [8], [11], [17], [6]. For example, Kinetics [13], Charades [19], Youtube-8M [2] are collected from Youtube videos where actions cover most of the image regions in every frame of a video. Using these videos to create low-resolution benchmark datasets does not reflect real world situation, and it is not appropriate as they generally contain larger actors. In the real world, we encounter low-quality actions mostly in security video clips where the camera placed in a distant place. Even though security camera is capable of recording high-quality video, if an action happens far away from the camera, it will suffer from lack of details. Thus, security videos are the perfect candidate for this problem. VIRAT dataset has naturally occurring tiny actors which is well suited for low-resolution action recognition task.

We introduce TinyVIRAT dataset which is based on VIRAT [16] dataset for real-life tiny action recognition problems. VIRAT dataseta is a natural candidate for low-resolution actions but it contains a large variety of different actor sizes and it is a very complex since actions can happen any time in any spatial position. To focus only on low-resolution action recognition problem, we crop small action clips from VIRAT videos. In VIRAT dataset actors can perform multiple actions and temporally actions can start and end at different times. Before deciding which actions are tiny, we merged spatio-temporally overlapping actions and created multi-label action clips. We split these clips if the labels are changing temporally. This steps makes sure that created clips are trimmed. We selected clips that are spatially smaller than 128x128. Finally, long videos are split into smaller chunks and actions which do not have enough samples are removed from the dataset. TinyVIRAT has 7,663 training and 5,166 testing videos with 26 action labels. Table 1 shows statistics from TinyVIRAT and several other datasets.

2.2 TinyVIRAT-v2 Dataset

TinyVIRAT-v2 is an extension to TinyVIRAT dataset where we use MEVA dataset [5] to extract tiny actions. Much like TinyVIRAT, TinyVIRAT-v2 is based on security videos. We use the same strategy as we used for VIRAT to extract tiny actions from MEVA dataset. TinyVIRAT was restricted to only outdoor videos and in TinyVIRAT-v2 we also have indoor scenes which makes this problem more challenging. It adds new sub domain and challenge in the data, models trained on TinyVIRAT-v2 can be more generalized and can be trusted to work both on indoor and outdoor data along with the actions mentioned.

TinyVIRAT-v2 has 16950 videos in train, 3308 videos in validation and 6097 videos in test split. Table 1 shows statistics from TinyVIRAT and several other datasets. Fig 2. Number of samples per action labels and Fig 3 shows resolution wise sample distribution. Samples are indicated by percentage per class.

3 Results

3.1 Evaluation metrics

TinyVIRAT-v2 has multiple labels in each sample, the submissions have to predict multiple action classes for each sample. The contestants choose a prediction threshold of their choice and only submit the occurring activities for each sample as a multi-hot vector. The submissions are evaluated using precision, recall, and F1-score. The challenge winners are determined based on the F1-score averaged over each class.

3.2 Baseline scores

We apply the recent state-of-the-art architectures on video action recognition on TinyVIRAT-v2 dataset and evaluate them based on their precision, recall and F1-scores. We use the base versions of I3D [3], Resnet-3D [9], R(2+1)D [23] and WideResNet-3D [26] and modify them to take in the low resolution input videos by removing certain pooling layers. This maintains the output feature matrix size for the final classification task. The evaluation results are shown in table 2. We observe that the R(2+1)D architecture gives overall best F1-score of 0.32.

We present per-class performance of the best two baseline models, I3D and R(2+1)D, in Figure 4. In Figure 5, we compare the performance of these models with the average resolution of the training samples for each class. Finally in Figure 6, we compare the per-class performance with the total number of training samples for each class.

Table 2: Baseline scores using various state-of-the-art methods on TinyVIRAT-v2 dataset. The overall F1-score, precision and recall is reported for each method.

Method	F1-Score	Precision	Recall
I3D [3]	0.31	0.36	0.32
Resnet-3D [9]	0.25	0.24	0.36
R(2+1)D [23]	0.32	0.34	0.37
WideResNet-3D [26]	0.29	0.29	0.33

3.3 Challenge winner scores

The submissions for each team was evaluated using the same metric in an evaluation server. At the end of the evaluation the teams were ranked based on the overall F1-score. The top 3 team scores are shown in table 3.

Table 3: Scores of the top-3 team in the TinyAction Challenge. They are ranked based on the overall F1-score.

#	Team Name	F1-Score	Precision	Recall
1	DeepBlue [10]	0.47	0.51	0.49
2	ALONG [4]	0.44	0.50	0.42
3	SUST&HKU [24]	0.41	0.34	0.37

4 Conclusion

We introduce an improved low-resolution tiny action recognition benchmark dataset TinyVIRAT-v2 with natural low-resolution videos. We also organize a challenge focused on tiny action recognition named TinyAction Challenge which allows researchers and enthusiasts to develop novel architectures aimed at improving action recognition from natural low-resolution videos. We showed that existing state-of-the-art methods do not perform well on tiny action setting as they are only trained with datasets that focus on larger action-to-frame ratios. Since these datasets exclude real-life security videos with naturally occurring low-resolution actions, TinyVIRAT-v2 provides a unique opportunity to the research community to improve methods for tiny actions. The top 3 performers of the challenge were able to significantly improve the classification scores across different metrics (F1-score, precision, recall). This challenge demonstrates the need for specific methods to improve tiny action recognition.

5 Acknowledgement

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

References

[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
[2] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark, 2016.
[3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[4] Liu Cen, Yunbo Peng, and Yue Lin. Along. 2021. https://www.crcv.ucf.edu/tiny-actions-challenge-cvpr2021/submissions/ALONG.pdf.
[5] Kellie Corona, Katie Osterdahl, Roderic Collins, and Anthony Hoogs. Meva: A large-scale multiview, multimodal video dataset for activity detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1060–1068, January 2021.
[6] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset, 2018.
[7] Ugur Demir, Yogesh S Rawat, and Mubarak Shah. Tinyvirat: low-resolution video action recognition. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 7387–7394. IEEE, 2021.
[8] Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. corr abs/1705.08421 (2017). arXiv preprint arXiv:1705.08421, 2017.
[9] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
[10] Jianye He, Zhiguang Zhang, Zhenyu Xu, and Zhipeng Luo. Delving into high quality action recognition for low resolution videos. 2021. https://www.crcv.ucf.edu/tiny-actions-challenge-cvpr2021/submissions/DeepBlueAI_Report.pdf.
[11] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
[12] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
[13] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[14] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, pages 2556–2563. IEEE, 2011.
[15] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–8, 2019.
[16] Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011, pages 3153–3160. IEEE, 2011.
[17] O.V. Ramana Murthy and Roland Goecke. Ordered trajectories for large scale human action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, December 2013.
[18] Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. A database for fine grained activity detection of cooking activities. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1194–1201, 2012.
[19] Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding, 2016.
[20] Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020.
[21] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[22] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4489–4497. IEEE, 2015.
[23] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
[24] Teng Wang, Tiantian Geng, Jinbao Wang, and Feng Zheng. Sustech&hku submission to tinyaction challenge 2021. 2021. https://www.crcv.ucf.edu/tiny-actions-challenge-cvpr2021/submissions/SUSTech&HKU_Report.pdf.
[25] Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J. Corso. Can humans fly? action understanding with multiple classes of actors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2264–2273, 2015.
[26] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.