This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Human-Centered Prior-Guided and Task-Dependent Multi-Task Representation Learning for Action Recognition Pre-Training

Abstract

Recently, much progress has been made for self-supervised action recognition. Most existing approaches emphasize the contrastive relations among videos, including appearance and motion consistency. However, two main issues remain for existing pre-training methods: 1) the learned representation is neutral and not informative for a specific task; 2) multi-task learning-based pre-training sometimes leads to sub-optimal solutions due to inconsistent domains of different tasks. To address the above issues, we propose a novel action recognition pre-training framework, which exploits human-centered prior knowledge that generates more informative representation, and avoids the conflict between multiple tasks by using task-dependent representations. Specifically, we distill knowledge from a human parsing model to enrich the semantic capability of representation. In addition, we combine knowledge distillation with contrastive learning to constitute a task-dependent multi-task framework. We achieve state-of-the-art performance on two popular benchmarks for action recognition task, i.e., UCF101 and HMDB51, verifying the effectiveness of our method.

11footnotetext: Corresponding author.

Index Terms—  Multi-Task Learning, Knowledge Distillation, Video Representation Learning.

1 Introduction

Action recognition is a hot topic in the computer vision community. It has many practical applications such as intelligent surveillance, human-computer interaction and behavior analysis [1]. The critical challenge of video-based action recognition is to model the complex spatial-temporal information in videos, which is more difficult than understanding static images [2]. Early works usually follow a two-stream paradigm [3, 4, 5, 6] or exploit 3D convolutional neural network (3D CNN) [7, 8, 9, 10] to explore the visual appearances and temporal dynamics. However, these supervised methods need great annotation effort [11]. Recently, many works focus on action recognition pre-training with self-supervised learning frameworks without using categorical labels. These approaches usually follow the design of pretext tasks for representation learning, such as temporal shuffle [12], future frame prediction [13], video-based space-time cubic puzzle completion [14], and contrastive learning-based tasks [15, 16, 2].

Refer to caption
Fig. 1: Two rows show some examples of human-centered actions and corresponding segmentation maps from human parsing models, respectively. The human parsing prior provides useful knowledge to action recognition tasks.

Although much great effort has been made for video representation learning, two main issues remain for existing pre-training methods. The commonly used contrastive learning pre-training paradigm emphasizes the instance-level similarity [17], which can hardly capture abundant semantic information in videos, resulting in neutral and less informative representation. Therefore, it further causes severe performance degradation on many specific downstream tasks with very different objectives compared with the pre-training stage. Secondly, many multi-task learning-based pre-training frameworks fail to consider the potential conflicts among different and inconsistent objectives from multiple tasks, leading to sub-optimal solutions [18, 19].

To address the above issues, in this paper, we propose a novel prior-guided and task-dependent multi-task representation learning framework for video-based action recognition pre-training. First, we incorporate the human-centered prior by distilling informative knowledge from a human parsing teacher model with an encoder-decoder network. This is based on the intuition that the informative human parsing knowledge can reflect human actions, which is well-aligned with the downstream action recognition objectives. An example is shown in Fig. 1 for a better demonstration. In addition, we also combine the contrastive learning with both appearance and motion consistency into a multi-task learning framework. To avoid the potential conflict from multiple tasks, task-dependent models are employed. The generated task-dependent representations are further combined for downstream tasks. The framework of our proposed method is demonstrated in Fig. 2. We conduct extensive experiments for action recognition on UCF101 and HMDB51 datasets and achieve state-of-the-art (SOTA) performance, verifying the effectiveness of our proposed method.

The main contributions of this paper are summarized as follows: 1) we present a novel framework that incorporates the human-centered prior for representation learning by knowledge distillation (KD) from the human parsing teacher model; 2) we employ a multi-task learning framework with the task-dependent representation learning strategy; 3) we conduct experiments on action recognition task, demonstrating the effectiveness of the proposed multi-task learning framework for video representation learning.

2 Related Work

Self-Supervised Learning for Visual Representation. Self-supervised learning aims at learning discriminative representation by leveraging information from unlabeled data. Most works have explored the self-supervised visual representation learning based on the design of pretext tasks, such as image inpainting [20], permutation [21], predicting jigsaw puzzles [22], and contrastive learning strategies. Recently, the extension from image to video representation learning has become increasingly popular due to the richer temporal information of videos. Like representation learning for images, the self-supervised video representation learning also focuses on the design of pretext tasks yet with the extension of considering temporal consistency in video clips. For example, [23, 12, 24] attempt to shuffle the frame and clip order along the temporal dimension; [15] proposes a pretext task to predict future frames; [14] learns video features through designing space-time cubic puzzles, and [25] proposes the SpeedNet to predict the motion speed in videos. In addition, contrastive learning is also widely used in the design of pretext tasks. Specifically, [15] proposes a dense predictive coding method with contrastive loss for video frames; [16] exploits contrastive multi-view video coding with the inspiration by [26] for image coding; [2] proposes a relative contrastive speed perception task to learn the motion information of videos. However, the representation by self-supervised training may be not well-aligned with the objectives of downstream tasks, leading to severe performance degradation.

Video-Based Action Recognition. Some works of action recognition follow a two-stream architecture to model appearance and temporal information separately. For example, [3] exploits both spatial 2D CNN and temporal CNN to extract the spatial information and motion information on RGB frames and optical flows. [27] utilizes a shift module for better capturing temporal information. [28] proposes a slow-fast network to extract spatial semantics and temporal motion information in the video clips. However, such 2D CNN-based methods adopted in the two-stream architecture have limited ability to capture the dynamics of visual tempos [29]. To address such issues, some recent works of action recognition is based on 3D CNN and its variants, that extract both appearance and temporal information jointly. Specifically, [8] firstly employs 3D convolutions on adjacent frames to model the spatial and temporal information of video. [9] inflates the pre-trained 2D convolutions to 3D convolutions kernels. [10] decomposes 3D convolution kernels into a 2D+1D paradigm to improve the performances of 3D CNNs. [30] proposes a self-attention mechanism to model long-range temporal dynamics of videos. Despite the great progress has been made in recent works, how to better generate more informative representations with self-supervised pre-training framework for action recognition is still under explored and needs further study.

Refer to caption
Fig. 2: Overview of our proposed framework. Video clips are firstly augmented and then sent to a shared low-level encoder to generate common low-level feature maps across multiple tasks. Then the low-level feature maps are sent to two distinct tasks, namely the human-centered prior knowledge distillation task and video contrastive learning task, following a multi-task learning framework. The knowledge distillation branch is supervised by a pre-trained human parsing teacher model, while the contrastive learning branch combines both motion and appearance contrastive relationships. Task-dependent embeddings are learned from two distinct high-level encoders and concatenated as the final representation.

3 Proposed Method

In this section, we present our prior-guided and task-dependent multi-task representation learning framework for video-based action recognition pre-training. We firstly describe how to incorporate the human-centered prior information by distilling informative knowledge from a human parsing teacher model with an encoder-decoder network, to enrich the semantic capability of the representation. Then, we combine knowledge distillation (KD) with video contrastive learning into a multi-task learning framework to boost the discrimination with fused task-dependent representations.

3.1 Human-Centered Prior Knowledge Distillation

Since most action recognition videos are related to human actions, we incorporate the human-centered prior into representation learning with a pre-trained human parsing guided teacher network. Denote fl()f_{l}(\cdot) and hh()h_{h}(\cdot) as the low-level encoder and high-level encoder of the human prior representation module, respectively, where the low-level encoder fl()f_{l}(\cdot) shares the weights across multiple tasks. Given the input video clip 𝑿\boldsymbol{X}, the human prior representation 𝒛h\boldsymbol{z}_{h} can be obtained by

𝒛h=hh(fl(𝑿)).\boldsymbol{z}_{h}=h_{h}(f_{l}(\boldsymbol{X})). (1)

To guide the learning of the human prior representation, we employ a pre-trained human parsing network ft()f_{t}(\cdot) as the teacher model. Given the middle frame of input video clip 𝑿m\boldsymbol{X}_{m}, the teacher model can generate the human parsing segmentation feature map 𝑭t\boldsymbol{F}_{t} as follows,

𝑭t=ft(𝑿m),\boldsymbol{F}_{t}=f_{t}(\boldsymbol{X}_{m}), (2)

In the meanwhile, the prior representation is fed into a decoder gh()g_{h}(\cdot) to generate a parsing segmentation as follows,

𝑭h=gh(𝒛h).\boldsymbol{F}_{h}=g_{h}(\boldsymbol{z}_{h}). (3)

We assume both 𝑭t\boldsymbol{F}_{t} and 𝑭h\boldsymbol{F}_{h} are normalized feature maps after the softmax operation. 𝑭h\boldsymbol{F}_{h} is supervised by the teacher model with the KL-divergence loss KL\mathcal{L}_{KL} as follow,

KL=1Ni,jNc=1C𝑭t(i,j,c)log𝑭h(i,j,c),\mathcal{L}_{KL}=-\frac{1}{N}\sum_{i,j}^{N}\sum_{c=1}^{C}\boldsymbol{F}_{t}(i,j,c)\log\boldsymbol{F}_{h}(i,j,c), (4)

where i,ji,j represent the spatial index, cc represents the segmentation class index, CC is the number of segmentation classes, and NN is the number of spatial dimensions. With human parsing guided learning, the prior representation 𝒛h\boldsymbol{z}_{h} should contain rich semantic information related to human-centered actions.

3.2 Task-Dependent Multi-Task Learning

To learn discriminative representations among videos, inspired by [2], we also combine the video contrastive learning module into a multi-task learning framework. From the output of the low-level encoder fl()f_{l}(\cdot), we stack another high-level encoder hc()h_{c}(\cdot), which is different to hh()h_{h}(\cdot) used in the human prior module, to generate contrastive representations 𝒛c\boldsymbol{z}_{c} as follows,

𝒛c=hc(fl(𝑿)).\boldsymbol{z}_{c}=h_{c}(f_{l}(\boldsymbol{X})). (5)

Two more projection heads, i.e., gm()g_{m}(\cdot) and ga()g_{a}(\cdot), are further employed to learn motion contrastive and appearance contrastive with the following margin ranking loss and InfoNCE loss [31],

m=max(0,γ(d+d)),\displaystyle\mathcal{L}_{m}=\text{max}(0,\gamma-(d^{+}-d^{-})), (6)
a=logq+q++n=1Kqn,\displaystyle\mathcal{L}_{a}=-\log\frac{q^{+}}{q^{+}+\sum_{n=1}^{K}q_{n}^{-}},

where d+/=d(gm(𝒛c),gm(𝒛c+/))d^{+/-}=d(g_{m}(\boldsymbol{z}_{c}),g_{m}(\boldsymbol{z}_{c}^{+/-})) represents the distance between the anchor motion embedding to the embedding with the same/different playback speed [2], and q+/=exp(d(ga(𝒛c),ga(𝒛c+/))/τ)q^{+/-}=\exp(d(g_{a}(\boldsymbol{z}_{c}),g_{a}(\boldsymbol{z}_{c}^{+/-}))/\tau) represents the similarity between anchor appearance embedding and the embedding from the same/different video clips.

With the human prior knowledge distillation module and video contrastive module, we form a multi-task learning framework with the following total loss,

=λkKL+λmm+λaa,\mathcal{L}=\lambda_{k}\mathcal{L}_{KL}+\lambda_{m}\mathcal{L}_{m}+\lambda_{a}\mathcal{L}_{a}, (7)

where λk\lambda_{k}, λm\lambda_{m} and λa\lambda_{a} are the weights of individual loss. Instead of widely adopted multi-task learning approaches that different tasks share the common representation, we use task-dependent and uncorrelated representations 𝒛h\boldsymbol{z}_{h} and 𝒛c\boldsymbol{z}_{c} to learn different semantic information from videos, which can better avoid the conflict across different tasks. We use the concatenated representation [𝒛h||𝒛c][\boldsymbol{z}_{h}||\boldsymbol{z}_{c}] as the final representation of each video clip.

4 Experiment

4.1 Implementation Details

We adopt Kinetics-400 [9] dataset for self-supervised pre-training. We employ the SOTA human parsing model SCHP [32] as the pre-trained teacher model for human-centered prior knowledge distillation. After the pre-training process, we finetune the model on UCF101 [33] and HMDB51 [34] datasets for downstream action recognition task. We evaluate the performance based on top-1 and top-5 accuracy (Acc@1, Acc@5, respectively). More implementation details are demonstrated in the Supplementary Material.

Table 1: Top-1 and Top-5 accuracy on UCF101 and HMDB51 datasets compared with SOTA methods. Best performance is marked in bold for each type of architecture.
UCF101 HMDB51
Method Acc@1 Acc@5 Acc@1 Acc@5
C3D Architecture
VCP [35] 68.5 - 32.5 -
MAS [36] 61.2 - 33.4 -
RTT [37] 69.9 - 39.6 -
RSPNet [2] 76.7 - 44.6 -
RSPNet [2] 77.6 93.7 45.4 75.7
Ours 80.4 95.7 46.1 78.4
R(2+1)D Architecture
VCP [35] 66.3 - 32.2 -
PSP [38] 74.8 - 36.8 -
ClipOrder [24] 72.4 - 30.9 -
PRP [39] 72.4 - 35.0 -
Pace [11] 77.1 - 36.6 -
RSPNet [2] 81.1 - 41.8 -
RSPNet [2] 79.4 94.3 43.0 74.5
Ours 81.6 95.3 46.1 74.8
Table 2: Ablation Studies on Action Recognition Task on UCF101 and HMDB51 datasets. Best performance is marked in bold for each type of architecture.
UCF101 HMDB51
Method Acc@1 Acc@5 Acc@1 Acc@5
C3D Architecture
w/o KD 77.6 93.7 45.4 75.7
TI 79.3 92.1 44.8 67.5
Full model 80.4 95.7 46.1 78.4
R(2+1)D Architecture
w/o KD 79.4 94.3 43.0 74.5
TI 80.1 94.3 44.6 74.4
Full model 81.6 95.3 46.1 74.8
Refer to caption
Fig. 3: Some examples of visualization results. The first row and second row represent RSPNet [2] and our proposed method, respectively. As expected, our model focuses more on the human actions rather than background regions.

4.2 Evaluation on Video Action Recognition Task

Comparison with SOTA Methods. We compare our method with other SOTA methods on UCF101 and HMDB51 datasets. We report top-1 and top-5 accuracy with C3D [8] and R(2+1)D [10] architectures in Table 1. RSPNet denotes that a 100-epoch finetuning is used instead of the result in the original paper. It is shown that we outperform all the other SOTA methods with a large margin, verifying the effectiveness of our proposed method.

Ablation Study. We conduct ablation studies on the effectiveness of each component of our proposed method on UCF101 and HMDB51 datasets with the results shown in Table 2. Denote the removal of the knowledge distillation module as “w/o KD”, and the use of task-independent representation with the shared representation across multiple tasks as “TI”. It can be found that there is a significant degradation for the corresponding variants compared with the full model, further demonstrating the effectiveness of our proposed method on individual components.

Visualization. To better reveal the effectiveness of our framework, we show some visualization examples of the region of interest by using class-activation map (CAM) technique [40] in Fig. 3. The produced heatmap is added on the original video frames for each example in the figure. The first row shows the results from RSPNet [2], while the second row represents the results of our proposed method. It is shown that our method concentrates more on the human action, while RSPNet is distracted by uncorrelated background regions. The visualization results demonstrate the validity of our motivation that incorporates human-centered prior knowledge for multi-task pre-training to learn discriminative representations. More visualization examples are included in Supplementary Material.

5 Conclusion

In this paper, we present a novel prior-guided and task-dependent multi-task representation learning framework for video-based action recognition pre-training. First, we incorporate the human-centered prior information by knowledge distillation from the human parsing teacher model to enrich the learned representation. Moreover, we follow the multi-task learning framework with the task-dependent representation learning strategy to solve the conflict of multi-task training paradigm. Experimental results on the action recognition task demonstrate the effectiveness of the proposed multi-task learning framework for video representation learning.

Acknowledgement. This work was supported by the National Natural Science Foundation of China under Grant 62106219.

References

  • [1] De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles, “What makes a video a video: Analyzing temporal information in video understanding models and datasets,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7366–7375.
  • [2] Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, and Chuang Gan, “Rspnet: Relative speed perception for unsupervised video representation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 1045–1053.
  • [3] Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems, 2014, pp. 568–576.
  • [4] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes, “Spatiotemporal multiplier networks for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4768–4777.
  • [5] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933–1941.
  • [6] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 20–36.
  • [7] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, “3d convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2012.
  • [8] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.
  • [9] Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  • [10] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
  • [11] Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu, “Self-supervised video representation learning by pace prediction,” in European Conference on Computer Vision. Springer, 2020, pp. 504–521.
  • [12] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang, “Unsupervised representation learning by sorting sequences,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 667–676.
  • [13] Nadine Behrmann, Jurgen Gall, and Mehdi Noroozi, “Unsupervised video representation learning by bidirectional feature prediction,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 1670–1679.
  • [14] Dahun Kim, Donghyeon Cho, and In So Kweon, “Self-supervised video representation learning with space-time cubic puzzles,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 8545–8552.
  • [15] Tengda Han, Weidi Xie, and Andrew Zisserman, “Video representation learning by dense predictive coding,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  • [16] Li Tao, Xueting Wang, and Toshihiko Yamasaki, “Self-supervised video representation learning using inter-intra contrastive framework,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2193–2201.
  • [17] Lin Zhang, Qi She, Zhengyang Shen, and Changhu Wang, “How incomplete is contrastive learning? an inter-intra variant dual representation method for self-supervised video recognition,” arXiv e-prints, pp. arXiv–2107, 2021.
  • [18] Alex Kendall, Yarin Gal, and Roberto Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491.
  • [19] Ozan Sener and Vladlen Koltun, “Multi-task learning as multi-objective optimization,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 525–536.
  • [20] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
  • [21] Ishan Misra and Laurens van der Maaten, “Self-supervised learning of pretext-invariant representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6707–6717.
  • [22] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon, “Learning image representations by completing damaged jigsaw puzzles,” in 2018 IEEE Winter Conference on Applications of Computer Vision. IEEE, 2018, pp. 793–802.
  • [23] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould, “Self-supervised video representation learning with odd-one-out networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3636–3645.
  • [24] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang, “Self-supervised spatiotemporal learning via video clip order prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10334–10343.
  • [25] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel, “Speednet: Learning the speediness in videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9922–9931.
  • [26] Yonglong Tian, Dilip Krishnan, and Phillip Isola, “Contrastive multiview coding,” in European Conference on Computer Vision. Springer, 2020, pp. 776–794.
  • [27] Ji Lin, Chuang Gan, and Song Han, “Temporal shift module for efficient video understanding. 2019 ieee,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7082–7092.
  • [28] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He, “Slowfast networks for video recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6202–6211.
  • [29] Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou, “Temporal pyramid network for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 591–600.
  • [30] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
  • [31] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
  • [32] Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang, “Self-correction for human parsing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, , no. 01, pp. 1–1, 2020.
  • [33] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [34] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre, “Hmdb: a large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2011, pp. 2556–2563.
  • [35] Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, and Weiping Wang, “Video cloze procedure for self-supervised spatio-temporal learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 11701–11708.
  • [36] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu, “Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4006–4015.
  • [37] Simon Jenni, Givi Meishvili, and Paolo Favaro, “Video representation learning by recognizing temporal transformations,” in European Conference on Computer Vision. Springer, 2020, pp. 425–442.
  • [38] Hyeon Cho, Taehoon Kim, Hyung Jin Chang, and Wonjun Hwang, “Self-supervised spatio-temporal representation learning using variable playback speed prediction,” arXiv preprint arXiv:2003.02692, vol. 3, no. 6, pp. 7, 2020.
  • [39] Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye, “Video playback rate perception for self-supervised spatio-temporal representation learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6548–6557.
  • [40] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.