This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RSPNet: Relative Speed Perception for
Unsupervised Video Representation Learning

Peihao Chen1 This work was done when Peihao Chen was a research intern at Baidu.Equal contribution.    Deng Huang1†    Dongliang He2    Xiang Long2    Runhao Zeng1   
Shilei Wen2
   Mingkui Tan1 Corresponding author.    Chuang Gan3
1South China University of Technology
   2Baidu Inc    3MIT-IBM Watson AI Lab
{phchencs, im.huangdeng, runhaozeng.cs, ganchuang1990}@gmail.com,
{hedongliang01, longxiang, wenshilei}@baidu.com, [email protected]
Abstract

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to: 1) the highly complex spatial-temporal information in videos; and 2) the lack of labeled data for training. Unlike the representation learning for static images, it is difficult to construct a suitable self-supervised task to well model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learnt models may tend to focus on motion pattern and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion pattern, and thus provide more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to well perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that optimizing the two tasks jointly consistently improves the performance on two downstream tasks. Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training. Code and pre-trained models can be found at https://github.com/PeihaoChen/RSPNet.

Refer to caption
Figure 1: An illustrative example of content-label inconsistency. In existing speed perception-based methods [1], 1) both video clips (a) and (b) are labeled as 1x speed, i.e., sampled consecutively, but the content of these two clips are dissimilar. The left player shoots more slowly and the middle player has finished shooting in the same time period. 2) Although clip (c) is labeled as 2x speed, i.e., the sampling interval is set to 2 frames, it looks similar to the middle clip with different speed labels.

1 Introduction

Video analysis [24] has been a prominent research topic in computer vision due to its vast potential applications, such as action recognition [32, 8], event detection [14], action localization [4, 49, 50], audio-visual scene analysis [16, 5, 12], etc. Compared with static images, videos often contain more complex spatial-temporal contents and have a larger data volume, making it very challenging to annotate and analyze. How to learn effective video representations with a few annotations even without annotations is an important yet challenging task [13, 11, 10].

Recently, unsupervised video representation learning, which seeks to learn appearance and motion features from unlabeled videos, has attracted great attention [7, 1, 9, 15]. This task, however, is very difficult due to several challenges: 1) The downstream video understanding tasks, such as action recognition, rely on both appearance features (e.g., texture and shape of objects, background scene) and motion features (e.g., the movement of objects). It is difficult to learn representation for both appearance and motion simultaneously because of the complex spatial-temporal information in videos. 2) It is difficult to mine effective supervision from unlabeled video data for representation learning.

Existing methods attempt to solve these challenges by designing pretext tasks to obtain pseudo labels for video representation learning. The pretext tasks include context prediction [20], playback rate perception [1], temporal clip orders prediction [46], etc. Among them, training models using playback speeds perception task achieves great success because models have to focus on the moving objects to perceive the playback speed [43]. This helps models to learn representative motion features. Specifically, Benaim et al.[1] train a model to determine whether videos are sped up or not. Some works [9, 48, 43] try to predict the specific playback speed for each video.

However, these works suffer from two limitations. First, the playback speed labels used for pretext task can be imprecise because it may be inconsistent with motion content in videos. As shown in Figure 1, the clips with different labels (i.e., different playback speeds) may look similar to each other. The underlying reason is that different people often implement the same action at different speeds. Using such inconsistent speed labels for training may make it difficult to learn discriminative features. Second, perceiving speed mainly relies on motion content. It does not explicitly encourage models to explore appearance features which, however, are also important for video understanding. Recently, instance discrimination task [44, 23] has shown its effectiveness for learning appearance features in image domain. However, how to extend it to video domain and combine it well with motion features learning is non-trivial.

To address the imprecise label issue in the above methods, we observe that the relative playback speed can provide more precise supervision for training. To this end, we propose a new pretext task that exploits relative playback speed as labels for perceiving speed, namely Relative Speed Perception (RSP). Specifically, we sample two clips from the same video and train a neural network to identify their relative playback speed instead of predicting the specific playback speed of each video clip. The relative playback speed label is obtained through the comparison between playback speeds of two clips from the same video (e.g., 2x is faster than 1x). We observe that for the same video, the higher the playback speed, the faster the objects will move. Consequently, such labels are independent of the original speed of objects in a video and can reveal the precise motion distinction between two clips. In this sense, the labels are more consistent with the motion content and can provide more effective supervision for representation learning.

Moreover, to encourage models to pay attention to learning appearance features, we follow the spirit of instance discrimination task in image domain and design an Appearance-focused Video Instance Discrimination (A-VID) task. In this task, we require model to find out two clips sampled from the same video against a bunch of clips from other videos. Considering that different clips in the same video are often at the same speed, we propose a speed augmentation strategy, i.e., randomizing the playback speed of each clip. Consequently, models cannot finish this task by simply learning speed information. Instead, models tend to learn appearance features, such as background scene and the texture of objects, because these features are consistent along a video but vary among different videos. We train models to finish RSP and A-VID tasks jointly using a two-branch architecture such that models are expected to learn both motion and appearance features simultaneously. We name our model as RSPNet. Experimental results on three datasets show that the learnt features perform well on two downstream tasks, i.e., action recognition and video retrieval.

To sum up, our contributions are as follows:

  • We propose a relative speed perception task for unsupervised video representation learning. In this sense, the labels are more consistent with the motion content and can provide more effective supervision for representation learning.

  • We extend instance discrimination task to video domain and propose a speed augmentation strategy to make it focus more on exploring appearance content. In this way, we can combine it well with relative speed perception task to learn representation for both motion and appearance contents simultaneously.

  • We verify the effectiveness of RSP and A-VID tasks for learning video representation on two downstream tasks and three datasets. Remarkably, without the need of annotation for pre-training, the action recognition accuracy on UCF-101 significantly outperforms the models supervised pre-trained on ImageNet (93.7% v.s 86.6%).

Refer to caption
Figure 2: Illustration of the proposed self-supervised video representation learning scheme. Given a set of video clips with different playback speeds, we use a spatial-temporal encoder f(;θ)f(\cdot;\theta) followed by two projection heads (i.e., gmg_{m} and gag_{a}) to extract clip features for two pretext tasks. In the relative speed perception (RSP) task, we identify the relative playback speed between clips instead of predicting their specific playback speeds. In the appearance-focused video instance discrimination (A-VID) task, we distinguish video clips relying on the appearance contents. We formulate two tasks as a metric learning problem and use triplet loss m{\mathcal{L}}_{m} and InfoNCE loss a{\mathcal{L}}_{a} for training.

2 Related work

Unsupervised video representation learning.

In recent years, unsupervised video representation learning, which uses video itself as supervision, has become a popular topic [25]. The existing methods learn representation through various carefully designed pretext tasks. Xu et al.[46] proposed video clip order prediction task for leveraging the temporal order of image sequences. Luo et al.[33] proposed video cloze procedure task by prediction the spatio-temporal operation applied on the video clips. Instead of focusing on RGB domain, Ng et al.[36] proposed a multitask learning model trained by estimating optical flow to learn motion representation. Since the video contains multiple frames, predicting future frames in latent space [41] is also a effective task to learn visual representation.

More recently, many works have been proposed to learn features through discriminating playback speeds. Epstein et al.[9] try to predict whether a clip is sped up or not. Some works [43, 48] attempt to predict the specific playback speed of one clip. However, these works suffer from the imprecise speed label issue. Cho et al.[7] design a method to sort video clips according to their playback speeds. However, they do not explicitly encourage model to learn appearance features. Our method makes use of relative speed to resolve the imprecise label issue. Moreover, we extent instance discrimination task [44] to video domain to encourage appearance learning.

Metric Learning.

Metric learning [45] aims to automatically construct task-specific distance metrics that compare two samples from a specific aspect. Based on such metric, the similar pairs of samples are pulled together and the dissimilar pairs of samples are pushed apart. It has achieved great success in many areas, e.g., face recognition [37], music recommendation [34], and person re-identification [47]. Recently, many works have successfully adopted metric learning for self-supervised representation learning [44, 23, 40]. They usually generate positive pairs by creating multiple views of each data and generate negative pairs by randomly choosing images/patches/videos. In this work, we aim to learn video representation by comparing two video clips using metric learning. Unlike the existing works, we propose to identify their speed distinction and appearance distinction to learn motion and appearance features from unlabeled data.

3 Proposed method

Problem Definition.

Let 𝒱={vi}i=1N{\mathcal{V}}=\{v_{i}\}_{i=1}^{N} be a video set containing NN videos. We sample a clip 𝐜i{\bf c}_{i} from a video with sis_{i} playback speed. Unsupervised video representation learning aims to learn a spatial-temporal encoder f(;θ)f(\cdot;\theta) to map video clip 𝐜i{\bf c}_{i} to their corresponding features 𝐱i{\bf x}_{i} that best describe the content in 𝐜i{\bf c}_{i}.

This task is very challenging because of the complex spatial-temporal information in videos and the lack of annotations. It is difficult to construct supervision information from unlabeled videos 𝒱{\mathcal{V}} to train a model to learn representation for both appearance and motion contents. Recently, some existing unsupervised learning methods attempt to learn video representation through playback speed perception. However, most of them suffer from imprecise speed label issue and do not explicitly encourage models to learn appearance features. Consequently, the learnt features may not be suitable for downstream video understanding tasks such as action recognition and video retrieval.

3.1 General scheme of RSPNet

In this paper, we observe that relative playback speed can provide more effective labels for representation learning. Thus, we propose a relative speed perception task, i.e., predicting whether two clips are with the same speed or not, to resolve imprecise label issues and learn motion features. Moreover, we extend the instance discrimination task to video domain and propose a speed augmentation strategy to explicitly make models pay attention on exploring appearance features. Considering the success of metric learning on representation learning [19, 17], we formulate these two tasks as metric learning, in which we seek to maximize the similarity of two clip features in positive pairs while minimizing the one in negative pairs.

Formally, for relative speed perception task, instead of directly predicting playback speed sis_{i} for clip 𝐜i{\bf c}_{i}, we propose to compare the speeds of two clips 𝐜i{\bf c}_{i} and 𝐜j{\bf c}_{j} that are sampled from the same video. Since the actions in 𝐜i{\bf c}_{i} (or 𝐜j{\bf c}_{j}) are often implemented by the same subject, the motions in these two clips are similar when si=sjs_{i}=s_{j} and are dissimilar when sisjs_{i}\neq s_{j}. In this sense, the relative speed labels are obtained through comparing sis_{i} and sjs_{j} (i.e., clips 𝐜i{\bf c}_{i} and 𝐜j{\bf c}_{j} are labeled as a positive pair when si=sjs_{i}=s_{j} and otherwise negative). Such labels are more consistent with motion content in videos and reveal the precise motion distinction. For appearance-focused video instance discrimination task, we enforce the model to predict whether two clips 𝐜i{\bf c}_{i} and 𝐜l{\bf c}_{l} are sampled from the same video. The intuition is that clips sampled from the same video often share similar appearance contents, which can be used as an important clue for distinguishing videos. We also randomize the playback speed, i.e., sis_{i} can be equal or not equal to sls_{l}. In this way, models are encouraged to pay more attention on learning appearance features instead of finishing this task by learning playback speed information.

We use two individual projection heads gm(;θm)g_{m}(\cdot;{\theta_{m}}) and ga(;θa)g_{a}(\cdot;{\theta_{a}}) to map spatial-temporal features 𝐜i{\bf c}_{i} to 𝐦i{\bf m}_{i} and 𝐚i{\bf a}_{i} for two tasks, respectively. We train models on these two tasks jointly. The objective function is formulated as follows,

(𝒱;θ,θa,θm)=m(𝒱;θ,θm)+λa(𝒱;θ,θa),{\mathcal{L}}(\mathcal{V};\theta,\theta_{a},\theta_{m})={\mathcal{L}}_{m}(\mathcal{V};\theta,\theta_{m})+\lambda{\mathcal{L}}_{a}(\mathcal{V};\theta,\theta_{a}), (1)

where m{\mathcal{L}}_{m} and a{\mathcal{L}}_{a} denote the loss functions of each task, respectively and λ\lambda is a fixed hyper-parameter to control the relative importance of each term. During inference for downstream tasks, we forward a video clip through the spatial-temporal encoder f(;θ)f(\cdot;\theta) and obtain 𝐱i{\bf x}_{i} as its spatial-temporal features. The schematic of our approach is shown in Figure 2. In the following, we will introduce more details about two pretext tasks in Section 3.2.

3.2 RSP and A-VID tasks

Relative speed perception.

This task aims to maximize the similarity of two clips with same playback speed and minimize the similarity of two clips with different playback speeds. Given a video, we sample 3 clips 𝐜i{\bf c}_{i}, 𝐜j{\bf c}_{j} and 𝐜k{\bf c}_{k} with playback speeds sis_{i}, sjs_{j} and sks_{k}, respectively, where si=sjsks_{i}=s_{j}\neq s_{k}. We feed each clip into the spatial-temporal encoder f(;θ)f(\cdot;\theta) followed by a projection head gm(;θm)g_{m}(\cdot;\theta_{m}) to get their corresponding features 𝐦i{\bf m}_{i}, 𝐦j{\bf m}_{j}, 𝐦k{\bf m}_{k}. Dot product function d(,)d(\cdot,\cdot) is used to measure the similarity between two clips. As the clips with the same playback speed share similar motion features, we expect their features can be closer compared with the clips with different playback speeds. We achieve this object by using a triplet loss [37] as follows,

m(𝒱;θ,θm)=max(0,γ(p+p)),{\mathcal{L}}_{m}(\mathcal{V};\theta,\theta_{m})={{\rm max}(0,\gamma-(p^{+}-p^{-}))}, (2)

where p+=d(𝐦i,𝐦j)p^{+}=d({\bf m}_{i},{\bf m}_{j}), p=d(𝐦i,𝐦k)p^{-}=d({\bf m}_{i},{\bf m}_{k}) and γ>0\gamma>0 is a certain margin. We desire that the similarity of a positive pair is larger than a negative pair by a margin γ\gamma.

Algorithm 1 Training method of RSPNet
0:  video set 𝒱={vi}i=1N{\mathcal{V}}=\{v_{i}\}_{i=1}^{N}, # negative pair for A-VID KK.
1:  Initialize parameters θ,θa,θm\theta,\theta_{a},\theta_{m} for f(;θ),ga(;θa),gm(;θm)f(\cdot;\theta),g_{a}(\cdot;\theta_{a}),g_{m}(\cdot;\theta_{m}), respectively
2:  while no converge do
3:     Randomly sample a video v+v^{+} from 𝒱{\mathcal{V}}, extract clips cic_{i}, cjc_{j}, ckc_{k} from v+v^{+} with speed sis_{i}, sjs_{j}, sks_{k}, where si=sjsks_{i}=s_{j}\neq s_{k}.
4:     Sample KK clips {cn}n=1K\{c_{n}\}_{n=1}^{K} from video set 𝒱{v+}{\mathcal{V}}\setminus\{v^{+}\}.
5:     Extract features 𝐱i,𝐱j,𝐱k,{\bf x}_{i},{\bf x}_{j},{\bf x}_{k}, and {𝐱n}n=1K\{{\bf x}_{n}\}_{n=1}^{K} from video clips cic_{i}, cjc_{j}, ckc_{k}, {cn}n=1K\{c_{n}\}_{n=1}^{K} using encoder f(;θ)f(\cdot;\theta).
6:     // RSP task
7:     Obtain features 𝐦i,𝐦j,𝐦k{\bf m}_{i},{\bf m}_{j},{\bf m}_{k} from 𝐱i{\bf x}_{i}, 𝐱j{\bf x}_{j}, 𝐱k{\bf x}_{k} using gm(;θm)g_{m}(\cdot;\theta_{m}) .
8:     Compute m{\mathcal{L}}_{m} using Equation (2).
9:     // A-VID task
10:     Obtain features 𝐚i,𝐚j,{𝐚n}n=1K{\bf a}_{i},{\bf a}_{j},\{{\bf a}_{n}\}_{n=1}^{K} from 𝐱i{\bf x}_{i}, 𝐱j{\bf x}_{j}, {𝐱n}n=1K\{{\bf x}_{n}\}_{n=1}^{K} using ga(;θa)g_{a}(\cdot;\theta_{a})
11:     Compute a{\mathcal{L}}_{a} and {\mathcal{L}} using Equations (3) and (1), respectively.
12:     Update parameters θ,θa,θm\theta,\theta_{a},\theta_{m} via stochastic gradient descent.
13:  end while
Table 1: Comparison of different pre-training settings on UCF101 and HMDB51 datasets. All models are pre-trained on the Kinetics-100 dataset except for the w/o pre-training setting. SP denotes speed prediction for each individual clip. VID denotes video instance discrimination without speed augmentation strategy.
Pre-training settings UCF101 HMDB51
TSM-18 ResNet-18 C3D TSM-18 ResNet-18 C3D
w/o pre-training 49.7 42.3 59.0 17.5 19.0 24.9
w/ RSP only 54.5 49.7 67.2 26.5 25.9 29.4
w/ A-VID only 60.8 57.2 68.1 30.2 31.1 35.1
SP + A-VID 59.8 57.8 70.9 29.7 30.7 35.1
RSP + VID 57.5 54.2 70.8 30.1 29.9 34.5
RSP + A-VID (Ours) 61.2 60.2 71.5 32.2 32.6 36.3

Appearance-focused video instance discrimination.

To explicitly encourage models to learn appearance features, we propose an A-VID task to further regularize the learning process. Motivated by the fact that different clips from the same video are always with similar spatial information, we extent the contrastive learning in the image domain [44] to the video domain. Specifically, we sample two clips 𝐜i{\bf c}_{i} and 𝐜j{\bf c}_{j} from the same randomly selected video v+v^{+} and KK clips {𝐜n}n=1K\{{\bf c}_{n}\}_{n=1}^{K} from KK videos in subset 𝒱v+{\mathcal{V}}\setminus v^{+}. After that, we feed each clip into the spatial-temporal encoder f(;θ)f(\cdot;\theta) followed by a projection head ga(;θa)g_{a}(\cdot;\theta_{a}) and get their corresponding features. The encoder f(;θ)f(\cdot;\theta) share weights with the encoder in RSP task while the weights of projection head ga(;θa)g_{a}(\cdot;\theta_{a}) is independent of gm(;θm)g_{m}(\cdot;\theta_{m}). We consider (𝐜i{\bf c}_{i}, 𝐜j{\bf c}_{j}) as positive pair and (𝐜i{\bf c}_{i}, 𝐜n{\bf c}_{n}) as negative pair. We further apply the InfoNCE loss [23] as the training loss:

a(𝒱;θ,θa)=logq+q++n=1Kqn,{\mathcal{L}}_{a}(\mathcal{V};\theta,\theta_{a})=-{\rm log}\frac{q^{+}}{q^{+}+\sum_{n=1}^{K}q^{-}_{n}}, (3)

where q+=exp(d(𝐚i,𝐚j)/τ)q^{+}={\rm exp}(d({\bf a}_{i},{\bf a}_{j})/\tau), qn=exp(d(𝐚i,𝐚n)/τ)q^{-}_{n}={\rm exp}(d({\bf a}_{i},{\bf a}_{n})/\tau), and τ\tau is a temperature hyper-parameter [44] which affects the concentration level of distribution. Optimizing Equation (3) will pull closer the positive pairs while push away the negative pairs.

An underlying question is that how to sample these video clips. A naive solution is to sample all clips at the same playback speed. In this sense, clips 𝐜i{\bf c}_{i} and 𝐜j{\bf c}_{j} will share similar motion features while the motion features in 𝐜i{\bf c}_{i} and 𝐜n{\bf c}_{n} are dissmilar. This may provide clues for models to find out whether any two clips are from the same video or not. To encourage models to pay more attention on learning appearance features, we propose a speed augmentation strategy. Concretely, we randomize the playback speed of each clip, i.e., sis_{i}, sjs_{j}, and sns_{n} are randomly selected from possible playback speeds, such that the motion features cannot provide effective clues for this task. In this way models have to focus on learning other informative features, including background and object appearance for discriminating video instance. The training method is shown in Algorithm 1.

4 Experiments

Datasets.

We pre-train models on the training set of Kinetics-400 dataset [3], which consists of around 240K training videos with 400 human action classes. Each video lasts about 10 seconds. To reduce training costs in ablation studies, we build a lightweight dataset, namely Kinetics-100, by selecting 100 classes with the least disk size of videos from Kinetics-400. UCF101 [38] dataset consists of 13,320 videos from 101 realistic action categories on YouTube. HMDB51 [29] dataset consists of 6,849 clips from 51 cation classes. Compared with UCF101 and HMDB51, Something-Something-V2 (Something-V2) dataset [18] contains 220,847 videos with 174 classes and focuses more on modeling temporal relationships [31, 51].

Pre-training details.

We instantiate the projection head as a fully connected layer with 128 output dimension. After pre-training, we drop the projection heads and use the features before them for downstream tasks. Unless otherwise stated, we sample 16 consecutive frames with 112 ×\times 112 spatial size for each clip following Kim et al.[28]. Clips are augmented by using random cropping with resizing, random color jitter and random Gaussian blur [6]. We use SGD as the optimizer with a mini-batch size of 64. We train the model for 200 epochs by default. The learning rate policy is linear cosine decay starting from 0.1. Following He et al.[23], we set τ=0.07\tau=0.07, K=16384K=16384, γ=0.15\gamma=0.15 and λ=1\lambda=1 for Equations (1), (2) and (3). All videos are with 25 fps. The possible playback speed ss for clips in this paper is set to 1x (i.e., sampling frames consecutively) and 2x (i.e., sampling interval is set 2 frames).

Fine-tuning details.

We fine-tune our RSPNet on UCF101, HMDB51, and Something-V2 with labeled videos for action recognition. We train for 30, 70, 50 epochs on these datasets, respectively, with a learning rate of 0.01. Following Xu et al.[46], we initialize the models with the weights from the pre-trained RSPNet except for the newly appended fully-connected layer with randomly initialized weights.

Table 2: Comparison with other unsupervised methods on UCF101 and HMDB51 datasets. We show the backbone architecture and the pre-training dataset of each method. *We pre-train the model for 1000 epochs.
Method Architecture Pre-train Dataset UCF101 HMDB51
Shuffle&Learn [35] CaffeNet UCF101 50.2 18.1
CMC [40] CaffeNet UCF101 59.1 26.7
OPN [30] VGG UCF101 59.8 23.8
VCP [33] C3D UCF101 68.5 32.5
PSP [7] R(2+1)D UCF101 74.8 36.8
ClipOrder [46] R(2+1)D UCF101 72.4 30.9
PRP [48] R(2+1)D UCF101 72.1 35.0
3D ST-Puzzle [28] C3D Kinetics-400 60.6 28.3
MAS [42] C3D Kinetics-400 61.2 33.4
3D ST-Puzzle [28] ResNet-18 Kinetics-400 65.8 33.7
3DRotNet [26] ResNet-18 Kinetics-400 66.0 37.1
DPC [20] ResNet-18 Kinetics-400 68.2 34.5
MemDPC [21] ResNet-34 Kinetics-400 78.7 41.2
Pace [43] R(2+1)D Kinetics-400 77.1 36.6
CBT [39] S3D Kinetics-600 79.5 44.6
CoCLR [22] S3D Kinetics-400 87.9 54.6
SpeedNet [1] S3D-G Kinetics-400 81.1 48.8
Fully supervised S3D-G ImageNet 86.6 57.7
S3D-G Kinetics-400 96.8 75.9
RSPNet (Ours) C3D Kinetics-400 76.7 44.6
ResNet-18 Kinetics-400 74.3 41.8
R(2+1)D Kinetics-400 81.1 44.6
S3D-G Kinetics-400 89.9 59.6
S3D-G Kinetics-400 93.7* 64.7*

4.1 Ablation studies

Effectiveness of two pretext tasks.

In this paper, we propose two tasks, namely RSP and A-VID, to learn video representation. To verify the effectiveness of each task, we pre-train models using either RSP or A-VID on three backbone networks.

In Table 1, compared with training from scratch, using RSP or A-VID task for pre-training significantly improves the action recognition performance on UCF101 and HMDB51 datasets, which demonstrates models learn useful clues for action recognition through pre-training on our designed pretext task. The improvement brought by A-VID task is relatively larger than pair-wise speed discrimination. The underlying reason is that UCF101 and HMDB51 datasets focus more on modeling appearance information compared with temporal relationship [31]. The models pre-trained on A-VID are more sensitive to object appearance and background scene while models pre-trained on RSP are more sensitive to the movement of objects. When we pre-trained models on both task jointly, we achieve the best results on all three models. Compared with w/o pre-training setting, we achieve relative improvement of 11.5%, 17.9%, and 12.5% on UCF101 and 14.7%, 13.6%, and 11.4% on HMDB51 in top-1 accuracy. This demonstrates that the two pretext tasks are complementary to each other and are effective for learning video representation.

Does relative speed perception help?

As discussed in Section 1, we train models to perceive relative speed of two clips to resolve the imprecise speed label issue. Here, we implement a variant of our method by replacing RSP with directly predicting speed of each clip (i.e., 1x or 2x speed). We formulated it as a classification problem and use a cross-entropy loss to optimize it following Wang et al.[43]. We denote this task as speed prediction (SP). Table 1 shows that exploiting relative speed as labels consistently improve the performance on three backbone networks and on two datasets compared with directly using playback speed of each clip (SP + A-VID v.s RSP + A-VID). These results demonstrate that relative speed labels are more consistent with the motion content and help models to learn more discriminative video features.

Does speed augmentation help?

Instead of naively extend instance discrimination task from image domain to video domain, we propose to randomize the speed of each clip. To verify its effectiveness, we implement a variant by dropping speed augmentation. We denote it as VID as it is not appearance-focused. Table 1 shows that the speed augmentation strategy significantly improve the performance (RSP + VID v.s RSP + A-VID). The reason is that the speed augmentation strategy make the VID task become speed-agnostic. In this way, models are encouraged to pay more attention on learning appearance features. Together with the motion features learnt from RSP task, models can extract more discriminative representation for appearance and motion, which are both important for action recognition.

4.2 Evaluation on action recognition task

Performance on UCF101 and HMDB51.

We compare our method with the state-of-the-art self-supervised learning methods in Table 2. We report top-1 accuracy on UCF101 and HMDB51 datasets together with the backbone and pre-training dataset. As the prior works use different backbone networks for experiments, we report results using the same settings as theirs for fair comparisons.

Our RSPNet achieves the best results on all backbone networks over two datasets. Specifically, with C3D, our method outperforms MAS [42] by a large margin (76.7% v.s 61.2% on UCF101 and 44.6% v.s 33.4% on HMDB51). With ResNet-18, our method outperforms DPC [20] by 6.1% and 7.3% absolute improvement on two datasets, respectively. With R(2+1)D, our RSPNet improves accuracy from 77.1% to 81.1% on UCF101 and from 36.6% to 44.6% on HMDB51. For S3D-G, we follow SpeedNet [1] to use video frames with 224 ×\times 224 as input for pre-training and fine-tuning. Under the same settings, our RSPNet increase the accuracy from 81.1% to 89.9% on UCF101 and from 48.8% to 59.6% on HMDB51.

When we train longer (i.e., 1000 epochs), we can further improve the tkeop-1 accuracy to 93.7% and 64.7% for two datasets, respectively. In Figure 3, we show the curve of pre-training losses and the performance on UCF101 for S3D-G model using different checkpoints. As the losses decrease, the performance for downstream task increases consistently. This demonstrates the effectiveness of the proposed RSP and A-VID tasks.The model does learn semantic representation to solve them instead of learning trivial solution. Remarkably, without the need of any annotation for pre-training, our RSPNet outperforms the ImageNet supervised pre-trained variant (93.7% v.s 86.6%) and achieve close performance to the Kinetics supervised pre-trained model (96.8%).

Refer to caption
Figure 3: Pre-training losses of two pretext tasks and Top-1 accuracy of UCF101 after fine-tuning. We pre-train S3D-G model on K-400 for 1000 epochs and report the results every 200 epochs.

Performance on Something-V2.

We compare our RSPNet with supervised learning methods on Something-V2, a challenging dataset in which temporal information is essential [31]. Following the settings in Lin et al.[31], we train models for 50 epochs and set the initial learning rate to 0.01 (decays by 0.1 at epoch 20 and 40). For the supervised pre-trained models, the ResNet-18 and S3D-G are pre-trained on K-400 dataset, and C3D is pre-trained on Sport-1M dataset [27]. Both K-400 and Sport-1M are large-scale datasets with manually annotated action labels, and thus the supervised pre-trained models are strong baselines for our unsupervised pre-trained RSPNet.

In Table 3, despite not using manual annotation, RSPNet consistently increases the accuracy compared with the random initialized models on three backbone architectures. Surprisingly, RSPNet even outperforms the supervised pre-trained model on ResNet-18 and C3D, increasing from 43.7% to 44.0% and from 47.0% to 47.8%, respectively. It shows the benefits of the discriminative features learnt from the proposed two pretext tasks.

Table 3: Performance comparison on Something-V2.
ResNet-18 C3D S3D-G
w/o pre-training 42.1 45.8 51.2
Fully supervised 43.7 47.0 56.8
Unsupervised (Ours) 44.0 47.8 55.0
Table 4: Video retrieval results on UCF101, measured by top-kk retrieval accuracy (%).
Method Architecture Top-kk
k=1k=1 k=5k=5 k=10k=10 k=20k=20 k=50k=50
OPN [30] OPN 19.9 28.7 34.0 40.6 51.6
Buchler et al.[2] CaffeNet 25.7 36.2 42.2 49.2 59.5
ClipOrder [46] R3D 14.1 30.3 40.0 51.1 66.5
SpeedNet [1] S3D-G 13.0 28.1 37.5 49.5 65.0
VCP [33] R(2+1)D 19.9 33.7 42.0 50.5 64.4
Pace [43] C3D 31.9 49.7 59.2 68.9 80.2
RSPNet (Ours) C3D 36.0 56.7 66.5 76.3 87.7
ResNet-18 41.1 59.4 68.4 77.8 88.7

4.3 Evaluation on video retrieval task

Given a query video, we use the nearest neighbor search to retrieve relevant videos based on the cosine similarity of their features. Specifically, following previous works [1, 46], we evenly sample 10 clips for each video and take the output of the last convolutional layer in spatial-temporal encoder as clip-level features. Then, we perform spatial max-pooling on each clip and average-pooling over 10 clips to obtain a video-level feature vector. We use the video in testing set to retrieve the videos in training set. We evaluate our method on the split 1 of UCF101 dataset and apply the top-kk accuracies (kk=1, 5, 10, 20, 50) as evaluation metrics. Our RSPNet is pre-trained on K-400 dataset.

From Table 4, our method outperforms state-of-the-arts by a large margin under different values of kk. For example, our method achieves much better performance than Pace [43] under all values of kk using the same C3D backbone. With ResNet-18 as backbone network, we can achieve better retrieval performance. These imply that the proposed pretext tasks help us to learn more discriminative features for video retrieval tasks.

We further provide some retrieval results in Figure 4 as a qualitative study. For the two query clips, we successfully retrieve highly relevant videos with very similar appearance and motion. This implies that our method is able to learn both meaningful appearance and motion features for videos.

4.4 RoI visualization

From Section 3.1, we formulate two pretext tasks as metric learning, which seeks to maximize the similarity of the positive pair. To better understand the clues learnt for the two pretext tasks, we visualize the region of interest (RoI) that contributes most to the similarity score using the class-activation map (CAM) technique [52]. We will describe the technical details of visualization followed by the results analysis.

Refer to caption
Figure 4: Qualitative examples of video retrieval.

In our RSPNet, we calculate the similarity ss between video clip features 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j} using cosine distance, i.e., s=(𝐖j𝐱j)(𝐖i𝐱i)=((𝐖j𝐱j)𝐖i)𝐱is=({\bf W}_{j}{\bf x}_{j})^{\top}({\bf W}_{i}{\bf x}_{i})=(({\bf W}_{j}{\bf x}_{j})^{\top}{\bf W}_{i}){\bf x}_{i}, where 𝐖i128×C{\bf W}_{i}\in\mathbb{R}^{128\times C} and 𝐖j128×C{\bf W}_{j}\in\mathbb{R}^{128\times C} are the parameters for the projection head gmg_{m} (or gag_{a}). The features 𝐱i{\bf x}_{i} is average pooled from the last convolutional features 𝐅iC×H×W×T{\bf F}_{i}\in\mathbb{R}^{C\times H\times W\times T} of the spatial-temporal encoder, where CC is the number of channels and H,W,TH,W,T are spatial-temporal sizes. In analogy with CAM [52], the similarity activation maps 𝐌sH×W×T{\bf M}_{s}\in\mathbb{R}^{H\times W\times T} of clip 𝐜i{\bf c}_{i} for similarity score ss can be defined as

𝐌s=((𝐖j𝐱j)𝐖i)𝐅i.{\bf M}_{s}=(({\bf W}_{j}{\bf x}_{j})^{\top}{\bf W}_{i}){\bf F}_{i}. (4)

Such similarity activation maps indicate the salient regions of clip 𝐜i{\bf c}_{i} that are used by models to figure out whether the two clips are positive pair. We can also obtain activation maps of clip 𝐜j{\bf c}_{j} in a similar manner. More details can refer to Zhou et al.[52].

Although both RSP and A-VID pretext tasks are based on the same features 𝐅i{\bf F}_{i}, we use two independent projection heads gmg_{m} and gag_{a} to map 𝐅i{\bf F}_{i} to different 128-D embedding spaces, as shown in Figure 2. Thus, the parameters of linear layers, i.e., 𝐖i{\bf W}_{i} and 𝐖j{\bf W}_{j}, for two pretext tasks are different. Consequently, the activation maps can be different and models can focus on learning different clues for completing each specific pretext task.

In Figure 5, we show the heatmaps in three positive clip pairs. We use the middle frame to represent a clip to visualize the heatmap. For the RSP task, the heatmaps tend to cover the whole region of actions, which provides rich information for perceiving the relative speed. For the A-VID task, models tend to focus on small but discriminative regions (e.g., the striped clothes and the eyes of a baby in two pair samples, respectively) to identify two clips in the same video. One interesting finding is that the models are able to adaptively localize the same object even though they appear in different locations of a frame. This may provide a new perspective for person re-identification and we leave it for futher work.

Refer to caption
Figure 5: Visualization of RoI learnt for RSP and A-VID. Our model focuses on the regions containing rich motion and appearance information for two pretext tasks, respectively. We outline the area where the heatmap is higher than a threshold with a rectangle.

5 Conclusion

In this paper, we have proposed an unsupervised video representation learning framework named RSPNet. We train models to perceive relative playback speed for learning motion features by using relative speed labels to resolve the imprecise speed label issue. Also, we extend instance discrimination task to video domain and propose a speed augmentation strategy to make models focus on learning appearance features. Extensive experiments show that the features learnt by RSPNet perform better on action recognition and video retrieval downstream tasks. Visualization of RoI implies that RSPNet can focus on discriminative area for two tasks.

References

  • [1] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In CVPR, 2020.
  • [2] Uta Büchler, Biagio Brattoli, and Björn Ommer. Improving spatiotemporal self-supervision by deep reinforcement learning. In ECCV, 2018.
  • [3] João Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017.
  • [4] Peihao Chen, Chuang Gan, Guangyao Shen, Wenbing Huang, Runhao Zeng, and Mingkui Tan. Relation attention for temporal action localization. IEEE Trans. Multim., 22:2723–2733, 2020.
  • [5] Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. Generating visually aligned sound from videos. IEEE Trans. Image Process., 29:8292–8302, 2020.
  • [6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. arXiv, abs/2002.05709, 2020.
  • [7] Hyeon Cho, Taehoon Kim, Hyung Jin Chang, and Wonjun Hwang. Self-supervised spatio-temporal representation learning using variable playback speed prediction. arXiv, abs/2003.02692, 2020.
  • [8] Jinwoo Choi, Chen Gao, Joseph C. E. Messou, and Jia-Bin Huang. Why can’t I dance in the mall? learning to mitigate scene bias in action recognition. In NeurIPS, 2019.
  • [9] Dave Epstein, Boyuan Chen, and Carl Vondrick. Oops! predicting unintentional action in video. In CVPR, 2020.
  • [10] Lijie Fan, Wen-bing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. End-to-end learning of motion representation for video understanding. In CVPR, pages 6016–6025, 2018.
  • [11] Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J. Guibas. Geometry guided convolutional neural networks for self-supervised video representation learning. In CVPR, pages 5589–5597, 2018.
  • [12] Chuang Gan, Deng Huang, Peihao Chen, Joshua B Tenenbaum, and Antonio Torralba. Foley Music : Learning to Generate Music from Videos. In ECCV, 2020.
  • [13] Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, ECCV, volume 9907, pages 849–866, 2016.
  • [14] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alexander G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, pages 2568–2577, 2015.
  • [15] Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In CVPR, pages 923–932, 2016.
  • [16] Chuang Gan, Hang Zhao, Peihao Chen, David D. Cox, and Antonio Torralba. Self-supervised moving vehicle tracking with stereo sound. In ICCV, pages 7052–7061, 2019.
  • [17] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv, abs/1706.02677, 2017.
  • [18] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The ”something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
  • [19] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
  • [20] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In ICCVW, 2019.
  • [21] Tengda Han, Weidi Xie, and Andrew Zisserman. Memory-augmented dense predictive coding for video representation learning. In ECCV, 2020.
  • [22] Tengda Han, Weidi Xie, and Andrew Zisserman. Self-supervised co-training for video representation learning. In Neurips, 2020.
  • [23] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  • [24] Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. Location-aware graph convolutional networks for video question answering. In AAAI, pages 11021–11028, 2020.
  • [25] Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 2020.
  • [26] Longlong Jing, Xiaodong Yang, Jingen Liu, and Yingli Tian. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv, abs/1811.11387, 2018.
  • [27] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [28] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. In AAAI, 2019.
  • [29] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and Thomas Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011.
  • [30] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017.
  • [31] Ji Lin, Chuang Gan, and Song Han. TSM: temporal shift module for efficient video understanding. In ICCV, 2019.
  • [32] Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. Attention clusters: Purely attention based local feature integration for video classification. In CVPR, pages 7834–7843, 2018.
  • [33] Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, and Weiping Wang. Video cloze procedure for self-supervised spatio-temporal learning. In AAAI, 2020.
  • [34] Brian McFee, Luke Barrington, and Gert R. G. Lanckriet. Learning content similarity for music recommendation. IEEE Trans. Speech Audio Process., 20:2207–2218, 2012.
  • [35] Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV, 2016.
  • [36] Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S. Davis. Actionflownet: Learning motion representation for action recognition. In WACV, 2018.
  • [37] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  • [38] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv, abs/1212.0402, 2012.
  • [39] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Learning video representations using contrastive bidirectional transformer. arXiv, abs/1906.05743, 2019.
  • [40] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv, abs/1906.05849, 2019.
  • [41] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv, abs/1807.03748, 2018.
  • [42] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, 2019.
  • [43] Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. arXiv, abs/2008.05861, 2020.
  • [44] Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  • [45] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart J. Russell. Distance metric learning with application to clustering with side-information. In NeurIPS, 2002.
  • [46] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019.
  • [47] Xun Yang, Meng Wang, and Dacheng Tao. Person re-identification with metric learning using privileged information. IEEE Trans. Image Process., 27:791–805, 2018.
  • [48] Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. In CVPR, 2020.
  • [49] Runhao Zeng, Chuang Gan, Peihao Chen, Wenbing Huang, Qingyao Wu, and Mingkui Tan. Breaking Winner-Takes-All : Iterative-Winners-Out Networks for Weakly Supervised Temporal Action Localization. IEEE Transactions on Image Processing, 28:5797–5808, 2019.
  • [50] Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding. In CVPR, pages 10284–10293, 2020.
  • [51] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In ECCV, 2018.
  • [52] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.