Video 3D Sampling for Self-supervised Representation Learning

Wei Li, Dezhao Luo, Bo Fang, Yu Zhou, Weiping Wang [email protected], [email protected], [email protected], [email protected], [email protected] Institute of Information Engineering, Chinese Academy of Sciences

(2021)

Abstract.

Most of the existing video self-supervised methods mainly leverage temporal signals of videos, ignoring that the semantics of moving objects and environmental information are all critical for video-related tasks. In this paper, we propose a novel self-supervised method for video representation learning, referred to as Video 3D Sampling (V3S). In order to sufficiently utilize the information (spatial and temporal) provided in videos, we pre-process a video from three dimensions (width, height, time). As a result, we can leverage the spatial information (the size of objects), temporal information (the direction and magnitude of motions) as our learning target. In our implementation, we combine the sampling of the three dimensions and propose the scale and projection transformations in space and time respectively. The experimental results show that, when applied to action recognition, video retrieval and action similarity labeling, our approach improves the state-of-the-arts with significant margins.

3D sampling, self-supervised, action recognition, video retrieval

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†doi: 10.1145/1122445.1122456^†^†conference: In Proceedings of ACM Conference; 2021.ACM; ACM, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/21/06^†^†ccs: Computer methodologies Activity recognition and understanding

Refer to caption — Figure 1. V3S novelly extracts video information from three dimensions (width, height, time) as supervised labels. For T-Sampling, frames of the video are skipped with intervals. For H-Sampling, pixels in the frame are removed in rows, which stretches the image along its width. Accordingly, W-Sampling stretches the image along its height.

1. Introduction

With the development of convolutional neural networks (CNNs), many computer vision tasks have achieved great progress in recent years. Even though supervised learning shows promising results, training CNNs needs large labeled datasets like ImageNet (Deng et al., 2009) and Kinetics (Kay et al., 2017). However, it is very expensive and time-consuming to annotate large-scale datasets, especially for complex annotation tasks. To explain it further, the semantic segmentation task requires the category of each pixel, and action detection requires the borders and category of each action instance. From this point of view, training CNNs in a self-supervised manner is of great significance.

Recently, self-supervised learning is proposed to utilize unlabeled data to learn representations. Typically, self-supervised methods automatically generate labels from raw data itself and design a proxy task to predict the labels. In this manner, CNNs are encouraged to learn representative features without manual annotations, and the learned features can be used to finetune downstream tasks.

For image self-supervised learning, early methods intended to learn representations by predicting the angle of images (Gidaris et al., 2018), the relative patch location (Noroozi and Favaro, 2016), or the removed region of the image (Pathak et al., 2016). For video self-supervised learning, a growing number of researches focused on modeling temporal transformations of videos. (Lee et al., 2017; Misra et al., 2016; Xu et al., 2019) shuffled video frames or clips and utilized the original order as the learning target. SpeedNet (Benaim et al., 2020), PRP (Yao et al., 2020), PacePred (Wang et al., 2020) randomly sped up the video and predicted the speed of the video to learn representations. (Jenni et al., 2020) investigated 4 temporal transformations (e.g. speed, random, periodic, and warp) and demonstrated their effectiveness to guide representation learning.

While promising results have been achieved, the above works still have some drawbacks. First of all, they do not sufficiently exploit the information provided by videos. The motion of an object contains two parameters: magnitude and direction. For temporal representation learning, speed/playback rate based methods (Benaim et al., 2020; Yao et al., 2020) modify the magnitude of motion and use it as the learning target. They ignore the moving direction of objects. Moreover, previous methods in video self-supervised learning tend to focus on temporal feature learning. Most of them do not exploit tasks for spatial representation learning. However, appearance information, including the semantics of the moving objects and environmental information, are also essential for video-related tasks. Secondly, the preprocessing strategies proposed by previous methods tend to destroy the video’s semantic structure, resulting in unreasonable content. For example, order-based method (Xu et al., 2019) disrupts the motion pattern and uses the original order as the learning target. However, shuffling video frames will seriously affect the content semantics. While (cubic puzzle (Kim et al., 2019), VCP (Luo et al., 2020)) design spatial labels, they severely destroy the spatial structure. Therefore, we intend to investigate how motion direction, as well as spatial semantics, contribute to video representation learning in a simple-yet-effective manner.

In this paper, we propose a novel self-supervised representation learning approach referred to as video 3D Sampling (V3S). As shown in Fig. 1, our goal is to make full use of the spatio-temporal information in videos without changing its semantics. To learn temporal features, we leverage the direction and magnitude of motions as the learning target. To learn spatial features, we apply spatial scale and spatial projection to modify the size of objects and the direction of motions. Accordingly, we apply temporal scale and temporal projection for temporal representation learning. To avoid losing too many frames in the speed-up process, which might lead to the loss of the original semantic information, we propose a progressive fast-forward sampling strategy. In this way, we transform the task of predicting the speed to that of predicting the changing pattern of the speed. As a self-supervised method, the transformations mentioned above are used as supervisory signals. In order to fully integrate spatial and temporal features, we set V3S as a multi-task learning framework.

The main contributions of this work are summarized as follows:

•

We propose video 3D sampling to learn spatio-temporal representations. Targeting at exploiting the information a video contains comprehensively, V3S samples video information at all three dimensions of width, height, and time.
•

For spatial representation learning, we modify the aspect ratio of objects as the learning target. For temporal representation learning, we exploit the direction of motions and propose to speed up videos progressively.
•

We verify V3S’s effectiveness on 4 backbones (C3D, R3D, R(2+1)D, S3D-G) and 3 tasks (action recognition, video retrieval, action similarity labeling), which demonstrates that V3S improves the state-of-the-arts with significant margins.

2. Related Work

In this section, we first introduce video representation learning in self-supervised manners. Then we introduce the recent development of video action recognition.

2.1. Self-Supervised Representation Learning

By generating pseudo labels from the raw data, self-supervised learning methods can learn rich representations without leveraging expensive human-annotated labels. Self-supervised image representation learning has witnessed rapid progress. Various proxy tasks have been proposed such as jigsaw puzzle (Noroozi and Favaro, 2016), rotation prediction (Gidaris et al., 2018), colorization (Zhang et al., 2016), inpainting (Pathak et al., 2016), and context prediction (Doersch et al., 2015), to name a few. Video representation learning can be categorized into temporal representation learning and spatiao-temporal representation learning.

2.1.1. Temporal Representation Learning

Existing self-supervised learning methods in image classification can be directly applied to video representation learning due to the fact that video frames are images in essence. Moreover, distinct temporal information of videos was demonstrated effective for many vision tasks (e.g. action recognition). Prior works have explored the temporal ordering of the video frames as a supervisory signal.

Based on 2D-CNNs, (Lee et al., 2017; Misra et al., 2016) took temporally shuffled frames as inputs and trained a ConvNet to sort the shuffled sequences. (Wei et al., 2018) exploited the arrow of time as a supervisory signal. In (Fernando et al., 2017), an odd-one-out network was proposed to identify the unrelated or odd clips from a set of otherwise related clips. Recently, video self-supervised learning performance has been largely boosted due to 3D-CNNs (e.g. C3D (Tran et al., 2015), S3D-G (Xie et al., 2018)).

VCOP (Xu et al., 2019) extended the 2D frame ordering pretext tasks to 3D clip ordering. In SpeedNet (Benaim et al., 2020) and PRP (Yao et al., 2020), the network was trained to predict the video playback rate, which was proved to be effective in learning the foreground moving objects. Similarly, (Wang et al., 2020) introduced a pace prediction task that fused the novel option of slow motion. Furthermore, (Jenni et al., 2020) investigated multiple different temporal transformations (speed, periodic, warp, etc.) to build a useful representation of videos for action recognition.

2.1.2. Spatio-Temporal Representation Learning

Despite the success in temporal representation learning, spatial transformations applied to video frames are still necessary. It can be interpreted that spatial representation learning focuses more on the appearance of objects while temporal representation learning tends to learn motion patterns. Several spatio-temporal self-supervised methods have been proposed recently. (Kim et al., 2019) trained 3D-CNNs by completing space-time cubic puzzles. (Jing et al., 2018) proposed 3DRotNet using rotation angel as a supervisory signal, which extended the rotation operation from images to videos. (Wang et al., 2019) proposed to regress both motion and appearance statistics along spatial and temporal dimensions for representation learning. VCP (Luo et al., 2020) designed the video cloze procedure task to learn the spatio-temporal information of videos to full advantage. In addition, future frame prediction (Han et al., 2019, 2020a), was usually a considered approach for video representation learning.

Self-supervised learning combined with contrastive learning currently has demonstrated promising results, e.g. (Wang et al., 2020). In this paper, we focus on designing pure proxy tasks and leave the potential extension to contrastive learning as future research.

2.2. Video Action Recognition

Action recognition is one of the most important tasks for video understanding. It takes a video clip as the input and outputs the specific action category of the video. Since the dynamic information is complex to understand, action recognition is challenging.

Based on the 2D CNNs feature extractor, (Simonyan and Zisserman, 2014) proposed two-stream convolutional networks where the results of RGB stream and optical flow stream were fused. TSN (Wang et al., 2016) extracted multiple clips for a video and utilized the whole action video level supervision. (Zhou et al., 2018) built temporal dependencies among video frames for action recognition.

Recently, 3D CNNs feature extractors have attracted much attention due to their strong ability for temporal modeling. C3D (Tran et al., 2015) designed 3D convolutional kernels, which can model spatial and temporal features simultaneously. R3D (Tran et al., 2018) extended C3D with ResNet(He et al., 2016). S3D-G (Xie et al., 2018) replaced 3D CNNs at the bottom of the network with low-cost 2D convolutions.

3. Methods

Recent methods use sampling interval (Benaim et al., 2020) or clip order (Xu et al., 2019) as the learning target to learn temporal features. Specifically, (Benaim et al., 2020) generates speed-up videos by interval sampling, which enhances the magnitude of the motions. In this work, in addition to learning the magnitude of motions, we also leverage the direction of motions as one of our learning targets. Moreover, spatial semantics are used in our methods, which are ignored by previous methods.

Our goal is to encourage CNNs to learn rich spatial and temporal representations. For spatial representation learning, we apply scale and projection transformations. For temporal representation learning, we further extend scale and projection transformation on the temporal dimension. We are going to describe these transformations in the following.

3.1. Spatial Transformation

To encourage the model to learn spatial representations, we design two transformations on the appearance of video clips. Let $I(x,y)$ be the original frame, $I(u,v)$ be the transformed frame. There is a conversion formula which maps ( $x$ , $y$ ) to ( $u$ , $v$ ):

u=\frac{m_{0}*x+m_{1}*y+m_{2}}{m_{6}*x+m_{7}*y+1},v=\frac{m_{3}*x+m_{4}*y+m_{5}}{m_{6}*x+m_{7}*y+1}.

\begin{bmatrix}m_{0}&m_{1}&m_{2}\\ m_{3}&m_{4}&m_{5}\\ m_{6}&m_{7}&1\end{bmatrix}iscalculatedbysolvinglinearsystem:

\begin{bmatrix}x_{0}&y_{0}&1&0&0&0&-x_{0}*u_{0}&-y_{0}*u_{0}\\ x_{1}&y_{1}&1&0&0&0&-x_{1}*u_{1}&-y_{1}*u_{1}\\ x_{2}&y_{2}&1&0&0&0&-x_{2}*u_{2}&-y_{2}*u_{2}\\ x_{3}&y_{3}&1&0&0&0&-x_{3}*u_{3}&-y_{3}*u_{3}\\ 0&0&0&x_{0}&y_{0}&1&-x_{0}*v_{0}&-y_{0}*v_{0}\\ 0&0&0&x_{1}&y_{1}&1&-x_{1}*v_{1}&-y_{1}*v_{1}\\ 0&0&0&x_{2}&y_{2}&1&-x_{2}*v_{2}&-y_{2}*v_{2}\\ 0&0&0&x_{3}&y_{3}&1&-x_{3}*v_{3}&-y_{3}*v_{3}\\ \end{bmatrix}\cdot\begin{bmatrix}m_{0}\\ m_{1}\\ m_{2}\\ m_{3}\\ m_{4}\\ m_{5}\\ m_{6}\\ m_{7}\\ \end{bmatrix}=\begin{bmatrix}u_{0}\\ u_{1}\\ u_{2}\\ u_{3}\\ v_{0}\\ v_{1}\\ v_{2}\\ v_{3}\\ \end{bmatrix}

, where $(x_{i},y_{i}),(u_{j},v_{j})(i,j=1,2,3,4)$ are the coordinates of the four vertices of the original and the transformed image. To explain it in detail, we firstly obtain the coordinates $(x,y)$ of four vertices in the original frame, and then calculate the coordinates $(u,v)$ of vertices after transformation. Then, $m_{i}(i=0,...,7)$ is calculated by the above formula. Thus, every pixel in the original image can find the corresponding location in the transformed frame.

To make it simple, we set $(x_{0},y_{0})=(0,0),(x_{1},y_{1})=(0,H),(x_{2},y_{2})$ $=(W,H),(x_{3},y_{3})=(W,0)$ , where $W,H$ denotes the width and the height of the original image. We also set $(u_{0},v_{0})=(0,0)$ . In the following, we are going to describe the detail of the transformations.

Scale: In order to change the size (aspect ratio) of the object, we modify the height or width of the video frame. It should be noted that we do not scale the height and width equally (with the same rate), because this only changes the resolution of the image, which is trivial to learn. Moreover, if we change the aspect ratio of the object, and the network can still learn its original proportion, it demonstrates that the network has learned the semantic information of the object.

In scale, we set $(u_{1},v_{1})=(0,b*H),(u_{2},v_{2})=(a*W,b*H),(u_{3},v_{3})=(a*W,0)$ . It denotes that the width and height of the transformed video frame is $a$ and $b$ times of that of the original video frame. In our implementation, $(a,b)$ is the hyperparameter and the learning target. Fig: 2 shows an example of $(a,b)=(1,0.3)$ .

Projection: Projection transforms a cuboid into a trapezoid. The head end of the trapezoid is smaller than the tail end, which has the effect of expanding objects close to the camera and shrinking distant objects. In order to change the sizes of objects in different regions with different rates, we apply projection on the video frame.

To transform the frame to a trapezoid, we randomly choose one side as the head end and shorten the length. For example, we take the right side as the head end and set $(u_{1},v_{1})=(0,H)$ , $(u_{2},v_{2})$ = $(W,(H+c*H)/2),(u_{3},v_{3})=(W,(H-c*H)/2)$ . It denotes that the transformed frame is a trapezoid which takes the left side as the bottom end and the right as the head end. The length of the head end is = $c*H$ . For spatial projection, $c$ and the head end side is the learning target. Fig. 2 shows an example of $c=0.5$ and the head end is the right side.

Through spatial transformations, we can change the size (aspect ratio) of the objects uniformly (scale) or non-uniformly (projection) whilst maintaining the semantic information. Fig. 3 shows the example of a three-frame video with an object moving in a straight line. Through spatial transformations, we can modify the direction of motions.

3.2. Temporal Transformation

To encourage the model to learn rich temporal representations, two temporal transformations on video clips are carried out. We design different frame sampling strategies for different temporal transformations.

Scale: The scale transformation in spatial modifies the size of the object along its width or height. Accordingly, we change the duration of an action in temporal scale. To achieve that, we speed up a video by interval sampling, and the interval is saved as the learning target. Specifically, let $n$ be the speed, the interval is $n-1$ . That is, given a video $V=\{v_{i}\}$ , where $v_{i}$ is the $i$ - $th$ frame in $V$ . We generated $V^{s}$ , which is s times speed of $V$ as $V^{s}=[{v_{r},v_{r+s-1},v_{r+2(s-1)},...,v_{r+l(s-1)}}]$ , where $r+l(s-1)\leq$ the length of $V$ , $r$ is a random start frame in $V$ and $l$ is the length of $V^{s}$ . In our implementation, $l=16$ . For temporal scale, the speed $s$ is the learning target. Fig. 2 shows an example of $s=2$ .

Projection: The spatial projection scales objects in different regions of the video frame at different rates. Accordingly, we propose temporal projection to progressively speed up a video so that it has different playback rates in different stages. In our implementation, we use a multi-stage speed-up mode, which means the video has multiple speeds in the same video. To generate the training samples, when a video $V$ is given, we first sample $l_{1}$ frames at $s_{1}-1$ intervals, then $l_{2}$ frames at $s_{2}-1$ intervals. The total length of $V^{p}$ $l=l_{1}+l_{2}$ . In our implementation, $l_{1}=l_{2}=8$ . Given a specific speed pattern $p=(s_{1},s_{2})$ and $V$ , we generate $V^{p}$ as $V^{s_{1},s_{2}}=[{V^{S_{1}},V^{S_{2}}}]$ . The generation of $V^{S_{1}},V^{S_{2}}$ is same as the operation in temporal scale. For temporal projection learning, $p$ is the learning target. Fig. 2 shows an example of $p=(1,2)$ .

With temporal transformation, we speed up the video straightforwardly (scale) or progressively (projection), which encourages the model to capture rich temporal representations.

3.3. Representation Learning

Given a video clip, we first apply a spatial transformation and then a temporal transformation on it. Then we feed it to a backbone to extract features and use a multi-task network to predict the specific transformation.

Feature Extraction: To extract video representations, we choose C3D (Tran et al., 2015), R3D, R(2+1)D (Tran et al., 2018), and S3D-G (Xie et al., 2018) as backbones. C3D stacks five 3D convolution blocks with 3 $\times$ 3 $\times$ 3 convolution kernels in all layers. Within the framework of residual learning, R3D block consists of two 3D convolution layers followed by batch normalization and ReLU layers. R(2+1)D are ResNets with (2+1)D convolutions, which decompose full 3D convolutions into a 2D convolution followed by a 1D convolution. Unlike many other 3D CNNs, S3D-G replaces many of the 3D convolutions, especially the 3D convolutions at the bottom of the network, by low-cost 2D convolutions. S3D-G exhibits distinguished feature extraction ability for action recognition.

Category Prediction: To complete the prediction, we take it as a classification task. For each transformation, we fix the parameters and take them as a specific category for classification. To be noted, to make most use of spatial and temporal transformation, we take V3S as a multi-task network.

Given the feature extracted by 3D CNNs, it is then fed to two fully connected (FC) layers, which completes the prediction. The output of each FC layer is a probability distribution over different categories. With $a_{i}$ is the $i$ - $th$ output of the fully connected layer for transformation, the probabilities are as follows:

p_{i}=\frac{\exp(a_{i})}{\sum^{n}_{j=1}\exp(a_{j})}

where $p_{i}$ is the probability that the transformation belongs to class $i$ , and $n$ is the number of transformations. We update the parameters of the network by minimizing the regularized cross-entropy loss of the predictions:

\mathcal{L}=-\sum^{n}_{i=1}y_{i}\log(p_{i})

where $y_{i}$ is the groundtruth. Let $\mathcal{L^{S}}$ be the entropy loss for spatial transformation prediction and $\mathcal{L^{T}}$ be the loss for temporal transformation prediction. The objective function of V3S is :

\mathcal{L}_{\mathcal{ST}}=\mathcal{L^{S}}+\mathcal{L^{T}}

3.4. Discussion

In Fig. 4, we show the motion speediness curve (Benaim et al., 2020) of cliff diving action. As one can see, the speediness of the action is different in different stages, rather than a constant value. Take the cliff diving action as an example, there is a rapid rise of the motion speediness from stage 1 to stage 3. We argue that if the network can sense the variation of speed, it can better understand the characteristics of an action. V3S proposes to utilize temporal projection to capture the variation of speed, thereby completes the deficiencies in previous speed-based methods.

4. Experiment

4.1. Experimental Setting

4.1.1. Datasets

In our experiments, we use four datasets: the UCF101 (Soomro et al., 2012), the HMDB51 (Kuehne et al., 2011), the Kinetics-400(K-400)(Kay et al., 2017) and the ASLAN (Kliper-Gross et al., 2011) to evaluate the effectiveness of our method.

UCF101 is a widely used dataset in action recognition task, which is collected from websites including Prelinger archive, YouTube and Google videos. The dataset contains 101 action categories with 9.5k videos for training and 3.5k videos for testing.

HMDB51 is extracted from a variety of sources ranging from digitized movies to YouTube, which contains 3.4k videos for training and 1.4k videos for testing with 51 action categories.

Kinetics-400 contains 246K train videos. The videos are collected from realistic YouTube videos.

ASLAN is an action similarity labeling dataset. It includes 3,631 videos in over 400 action categories. The goal of action similarity labeling task is to estimate if two videos present the same action or not, and this dataset is composed of video pairs with ”same” or ”not-same” labels. This task is challenging because its test set only contains videos of never-before-seen actions. We use this task to verify the spatio-temporal feature extraction capabilities of our model.

Method	Value	Acc.
$S_{S}$	{(1,1.15), (1,1.3), (1,1.45), (1.15,1), (1.3,1), (1.45,1)}	73.2
$S_{S}$	{(1,1.3), (1,1.6), (1,1.9), (1.3,1), (1.6,1), (1.9,1)}	71.1
$S_{S}$	{(1,1.45), (1,1.9), (1,2.35), (1.45,1), (1.9,1), (2.35,1)}	72.8
$S_{P}$	{0.8, 0.75, 0.7}	73.5
$S_{P}$	{0.8, 0.7, 0.6}	73.6
$S_{P}$	{0.8, 0.65, 0.5}	73.7
$T_{S}$	{1, 2}	71.0
$T_{S}$	{1, 2, 3}	76.4
$T_{S}$	{1, 2, 3, 4}	76.3
$T_{P}$	{(1,2) ,(2,3), (2,1), (3,2) }	75.0
$T_{P}$	{(1,2), (2,3), (3,4), (2,1), (3,2), (4,3)}	76.8
$T_{P}$	{(1,2), (2,3), (3,4), (4,5), (2,1), (3,2),(4,3), (5,4)}	77.0

Table 1. Evaluation of V3S with R(2+1)D under different parameters.

S_{S}

denotes spatial scale,

S_{P}

spatial projection,

T_{S}

temporal scale,

T_{P}

temporal projection.

Method	Network	Input Size	Params	Pre-train Dataset	UCF101	HMDB51
Random	C3D	112 × 112	58.3M	UCF101	63.7	24.7
VCP(Luo et al., 2020)	C3D	112 × 112	58.3M	UCF101	68.5	32.5
PRP(Yao et al., 2020)	C3D	112 × 112	58.3M	UCF101	69.1	34.5
V3S(Ours)	C3D	112 × 112	58.3M	UCF101	74.8	34.9
Random	R3D	112 × 112	33.6M	UCF101	54.5	23.4
ST-puzzle(Kim et al., 2019)	R3D	224 × 224	33.6M	Kinetics-400	65.8	33.7
VCP(Luo et al., 2020)	R3D	112 × 112	33.6M	UCF101	66.0	31.5
PRP(Yao et al., 2020)	R3D	112 × 112	33.6M	UCF101	66.5	29.7
V3S(Ours)	R3D	112 × 112	33.6M	UCF101	74.0	38.0
Random	R(2+1)D	112 × 112	14.4M	UCF101	55.8	22.0
VCP(Luo et al., 2020)	R(2+1)D	112 × 112	14.4M	UCF101	66.3	32.2
PRP(Yao et al., 2020)	R(2+1)D	112 × 112	14.4M	UCF101	72.1	35.0
PacePred(Wang et al., 2020) w/o Ctr	R(2+1)D	112 × 112	14.4M	UCF101	73.9	33.8
PacePred(Wang et al., 2020)	R(2+1)D	112 × 112	14.4M	UCF101	75.9	35.0
PacePred(Wang et al., 2020)	R(2+1)D	112 × 112	14.4M	Kinetics-400	77.1	35.0
V3S(Ours)	R(2+1)D	112 × 112	14.4M	UCF101	79.1	38.7
V3S(Ours)	R(2+1)D	112 × 112	14.4M	Kinetics-400	79.2	40.4
SpeedNet(Benaim et al., 2020)	S3D-G	112 × 112	9.6M	Kinetics-400	81.1	48.8
CoCLR(Han et al., 2020b)	S3D	128 × 128	9.6M	UCF101	81.4	52.1
V3S(Ours)	S3D-G	112 × 112	9.6M	UCF101	85.4	53.2

Table 2. Action recognition accuracy on UCF101 and HMDB51.

4.1.2. Implementation Details

For 3D sampling, we firstly transform the raw video and then sample a 16-frame clip from it. Each frame is resized to 224 $\times$ 224 and randomly cropped to 112 $\times$ 112. Specially for spatial projection, if the head end is shorter than 224, we simply crop the frame by $l\times l$ , where $l$ is the length of the head end, then it is resized to 112 $\times$ 112.

We set the initial learning rate to be 0.01, momentum to be 0.9, and batch size to be 32. Our pre-training process stops after 300 epochs and the best validation accuracy model is used for downstream tasks. Specially, to match the requirement of S3D-G, each frame is firstly resized to 256 $\times$ 256, then randomly cropped to 224 $\times$ 224. In addition, we set the learning rate to be 0.005 for R3D for better convergence. For UCF101, the training set of the first split is used in our pre-training stage, where we randomly choose 800 videos for validation. For Kinetics-400, we use its training set to train our self-supervised model, and randomly select 3000 samples to build the validation set. The batch size with Kinetics-400 is 16 and we train it for 70 epochs.

Method	$S_{S}$	$S_{P}$	$S_{S}$ + $S_{P}$	$T_{S}$	$T_{P}$	$T_{S}$ + $T_{P}$	V3S
Acc.	73.2	73.7	73.8	76.4	77.0	78.0	79.1

Table 3. Combining spatial and temporal transformations. V3S denotes

S_{S}+S_{P}+T_{S}+T_{P}

4.2. Ablation Study

In this section, we evaluate the effectiveness and discuss the hyperparameters of the designed four transformations on the first split of UCF101. For simplicity, we choose R(2+1)D as the backbone for our ablation studies.

As shown in table 1, we conduct extensive experiments for the selection of each parameter. In table 1, we discuss the influence of hyperparameters for spatial transformations. To explain it further, we discuss $(a,b)$ for spatial scale $S_{S}$ , and $c$ for spatial projection $S_{P}$ . For $S_{S}$ , we randomly select $(a,b)$ $\in$ {(1,1.15), (1,1.3), (1,1.45), (1.15,1), (1.3,1), (1.45,1)} in the following experiments, because it demonstrates the best performance among the settings (73.2% to 71.1% $\backslash$ 72.8%). For $S_{P}$ , we accordingly select projection magnitude $c$ $\in$ {0.8, 0.65, 0.5} in the following experiments.

In table 1 (bottom), we discuss the hyperparameters for temporal transformations, which are $s$ for $T_{S}$ and $p$ for $T_{P}$ . For $T_{S}$ , we randomly select $s$ from $\{1,2\}$ , $\{1,2,3\}$ or $\{1,2,3,4\}$ . When sampling $s$ $\in$ $\{1,2,3\}$ , it demonstrates the best performance (76.4% to 71.0% $\backslash$ 76.3%). we thus set a sampling speed $s\in\{1,2,3\}$ in the following experiments. For $T_{P}$ , we accordingly select the speed pattern $p$ $\in$ {(1,2), (2,3), (3,4), (4,5), (2,1), (3,2), (4,3), (5,4)} in the following experiments.

To combine the designed transformations, we integrate $S_{S}$ , $S_{P}$ , $T_{S}$ , $T_{P}$ in Table 3. After combining spatial scale $S_{S}$ and spatial projection $S_{P}$ , V3S achieves 73.8%. V3S also show better performance (78.0%) when combining $T_{S}$ and $T_{P}$ . After combining all these four spatial and temporal transformations, the accuracy of 79.1% is achieved, which surpasses the accuracy of the individual spatial or temporal transformations. In summary, the spatial and temporal transformations are complementary, thus with the combination as the final proxy task, more powerful representations can be learned.

Method	Top1	Top5	Top10	Top20	Top50
Jigsaw	19.7	28.5	33.5	40.0	49.4
OPN	19.9	28.7	34.0	40.6	51.6
$\mathrm{B\ddot{u}chler}$	25.7	36.2	42.2	49.2	59.5
C3D(random)	16.7	27.5	33.7	41.4	53.0
C3D(VCP(Luo et al., 2020))	17.3	31.5	42.0	52.6	67.7
C3D(PRP(Yao et al., 2020))	23.2	38.1	46.0	55.7	68.4
C3D(PacePred(Wang et al., 2020))	20.0	37.4	46.9	58.5	73.1
C3D(V3S)	21.8	39.0	47.7	57.0	69.2
R3D(random)	9.9	18.9	26.0	35.5	51.9
R3D(VCP(Luo et al., 2020))	18.6	33.6	42.5	53.5	68.1
R3D(PRP(Yao et al., 2020))	22.8	38.5	46.7	55.2	69.1
R3D(PacePred(Wang et al., 2020))	19.9	36.2	46.1	55.6	69.2
R3D(V3S)	28.3	43.7	51.3	60.1	71.9
R(2+1)D(random)	10.6	20.7	27.4	37.4	53.1
R(2+1)D(VCP(Luo et al., 2020))	19.9	33.7	42.0	50.5	64.4
R(2+1)D(PRP(Yao et al., 2020))	20.3	34.0	41.9	51.7	64.2
R(2+1)D(PacePred(Wang et al., 2020))	17.9	34.3	44.6	55.5	72.9
R(2+1)D(V3S)	23.1	40.5	48.7	58.5	72.4
R(2+1)D(V3S*)	23.5	40.0	49.4	59.7	73.9
S3D-G(SpeedNet*(Benaim et al., 2020))	13.0	28.1	37.5	49.5	65.0
S3D-G(V3S)	16.6	32.2	41.8	52.3	68.0

Table 4. Video retrieval performance on UCF101. Methods marked with * are pretrained with Kinetics-400.

4.3. Action Recognition

Utilizing self-supervised pre-training to initialize action recognition models is an established and effective way for evaluating the representation learned via self-supervised tasks. To verify the effectiveness of our method, we conduct experiments on the action recognition task. We initialize the backbone with V3S pre-trained model, and initialize the fully connected layer randomly. Following the protocol of (Xu et al., 2019), we train backbones for 300 epochs during training and make the fine-tuning procedure stop after 160 epochs. For testing, we sample 10 clips for each video and average the possibility of predictions to obtain the final action category.

For C3D, R3D and R(2+1)D, we set the initial learning rate to be 0.001. The number of the clip frames is 16 and each frame is first resized to 128 $\times$ 171 and randomly cropped to 112 $\times$ 112. For S3D-G, we set the initial learning rate to be 0.01, the input clip length is 64 frames and each frame is first resized to 256 $\times$ 256 and randomly cropped to 224 $\times$ 224. For action recognition, the batch size is set to 8 and the momentum is set to 0.9.

Table 2 shows the split-1 accuracy on UCF101 and HMDB51 for action recognition task. With S3D-G pretrained on UCF101, V3S obtains 85.4% and 53.2% on UCF101 and HMDB51 respectively, outperforms CoCLR (Han et al., 2020b) by 4.0% and 1.1%. With R(2+1)D pretrained on Kinetics-400, V3S achieves 79.2% and 40.4% on UCF101 and HMDB51, which outperforms PacePred(Wang et al., 2020) by 2.1% and 5.4%. With C3D, R3D, V3S obtains 74.8% $\backslash$ 34.9% and 74.0% $\backslash$ 38.0%. To be noted, V3S is purely pretext task based, and does not incorporate contrastive learning. However, its performance is better than a series of methods with contrastive loss such as PacePrediction(Wang et al., 2020) and CoCLR(Han et al., 2020b). This can further verify its effectiveness.

Method	Top1	Top5	Top10	Top20	Top50
C3D(random)	7.4	20.5	31.9	44.5	66.3
C3D(VCP(Luo et al., 2020))	7.8	23.8	35.5	49.3	71.6
C3D(PRP(Yao et al., 2020))	10.5	27.2	40.4	56.2	75.9
C3D(PacePred(Wang et al., 2020))	8.0	25.2	37.8	54.4	77.5
C3D(V3S)	11.1	26.5	38.0	52.0	73.0
R3D(random)	6.7	18.3	28.3	43.1	67.9
R3D(VCP(Luo et al., 2020))	7.6	24.4	36.6	53.6	76.4
R3D(PRP(Yao et al., 2020))	8.2	25.8	38.5	53.3	75.9
R3D(PacePred(Wang et al., 2020))	8.2	24.2	37.3	53.3	74.5
R3D(V3S)	10.8	30.6	42.3	56.2	77.1
R(2+1)D(random)	4.5	14.8	23.4	38.9	63.0
R(2+1)D(VCP(Luo et al., 2020))	6.7	21.3	32.7	49.2	73.3
R(2+1)D(PRP(Yao et al., 2020))	8.2	25.3	36.2	51.0	73.0
R(2+1)D(PacePred(Wang et al., 2020))	10.0	24.6	37.6	54.4	77.1
R(2+1)D(V3S)	9.6	24.0	37.2	54.3	77.9
R(2+1)D(V3S*)	9.8	26.9	38.5	52.7	72.2

Table 5. Video retrieval performance on HMDB51. Methods marked with * are pretrained with Kinetics-400. The best results are bold, and the second best results are underlined.

4.4. Video Retrieval

To further validate the effectiveness of V3S, we adopt video retrieval as another downstream task. In the process of retrieval, we generate features at the last pooling layer of extractors. For each clip in the testing split, we query top-K nearest videos from the training set by computing the cosine similarity between two feature vectors. When the testing video and a retrieved video are from the same category, a correct retrieval is counted.

Video retrieval results on UCF101 and HMDB51 are listed in Table 4 and Table 5 respectively. Note that we outperform SOTA methods with different backbones from Top1 to Top50. These results indicate that in addition to providing a good weight initialization for the downstream model, V3S can also extract high-quality and discriminative spatiao-temporal features.

Features	Hand-crafted	Sup.	Self-sup.	Acc.
C3D		✓		78.3
HOF	✓			56.7
HNF	✓			59.5
HOG	✓			59.8
STS(Wang et al., 2019),R3D			✓	60.9
V3S,R3D			✓	65.4

Table 6. Action similarity accuracy on ASLAN.

4.5. Action Similarity Labeling

In this section, we exploit action similarity labeling task (Kliper-Gross et al., 2011) to verify the quality of the learned spatio-temporal representations from another perspective on the ASLAN dataset(Kliper-Gross et al., 2011). Unlike action recognition task, the action similarity labeling task focuses on action similarity (same/not-same). The model needs to determine whether the action categories of the two videos are the same. This task is quite challenging as the test set contains never-before-seen actions.

To evaluate on the action similarity labeling task, we use pre-trained model to extract features from video pairs, and use a linear SVM for the binary classification. Specifically, for each video pairs, each video is first split into 16-frame clips with stride length 8. The network takes these video clips as input to extract clip-level features from the pool3, pool4 and pool5 layers. Then, the clip-level features are averaged and L2 Normalized to get a spatiao-temporal video-level feature. In order to measure the similarity of these two video-level features, we calculate 12 different distances for each feature as described in (Kliper-Gross et al., 2011). A 36-dimensional feature is obtained by concatenating these three types of video features and this feature is normalized to ensure that the scale of each distance is the same, following (Tran et al., 2015). Finally, a linear SVM is used to determine whether the two videos are in the same category or not.

Action similarity results on ASLAN are listed in Table 6. The results show that the accuracy of V3S on R3D outperforms that of the previous methods, and it further shortens the gap between supervised and unsupervised methods. It demonstrates that the features extracted by the V3S network have excellent intra-class similarity and inter-class dissimilarity.

4.6. Visualization

In order to gain a better understanding of what V3S learns, we adopt the Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017), an improved version of CAM(Zhou et al., 2016) to visualize the attention map.

Fig. 5 shows the samples (baseball pitching, archery, band marching, baseball pitch, walking with dog, wall pushups, typing) of such heat maps. One can see that the highly activated regions of these heat maps have a great correlation with the movement of actions. For example, in different stages of the archery action, the activation area varies along the motion. When the man takes out of the arrow, the activation area concentrates on the left hand that is drawing the arrow, and when the man is drawing the bow, the activation area moves to the right hand.

5. Conclusion

In this paper, we propose a novel self-supervised method referred to as V3S to obtain rich spatio-temporal features without human annotations. In V3S, to fully utilize the information in videos, we propose spatial scale and spatial projection to uniformly or non-uniformly scale the objects in a video. We propose temporal scale and temporal projection to straightforwardly or progressively speed up a video. Experimental results show the effectiveness of V3S for downstream tasks such as action recognition, video retrieval and action similarity labeling. Our work inspires the field of video understanding with two aspects: non-contrastive learning based self-supervised pretext task learning is still below the upper bound, and powerful representations can be learned with relatively small datasets like UCF101.

References

(1)
Benaim et al. (2020) Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. 2020. SpeedNet: learning the speediness in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9922–9931.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition. 248–255.
Doersch et al. (2015) Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision. 1422–1430.
Fernando et al. (2017) Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3636–3645.
Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018).
Han et al. (2019) Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video representation learning by dense predictive coding. In Proceedings of the IEEE international conference on computer vision, workshop. 0–0.
Han et al. (2020a) Tengda Han, Weidi Xie, and Andrew Zisserman. 2020a. Memory-augmented dense predictive coding for video representation learning. In Proceedings of the european conference on computer vision. 312–329.
Han et al. (2020b) Tengda Han, Weidi Xie, and Andrew Zisserman. 2020b. Self-supervised co-training for video representation learning. arXiv preprint arXiv:2010.09709 (2020).
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Jenni et al. (2020) Simon Jenni, Givi Meishvili, and Paolo Favaro. 2020. Video representation learning by recognizing temporal transformations. In Proceedings of the european conference on computer vision. 425–442.
Jing et al. (2018) Longlong Jing, Xiaodong Yang, Jingen Liu, and Yingli Tian. 2018. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018).
Kay et al. (2017) Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
Kim et al. (2019) Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8545–8552.
Kliper-Gross et al. (2011) Orit Kliper-Gross, Tal Hassner, and Lior Wolf. 2011. The action similarity labeling challenge. IEEE transactions on pattern analysis and machine Intelligence 34, 3 (2011), 615–621.
Kuehne et al. (2011) Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE international conference on computer vision. IEEE, 2556–2563.
Lee et al. (2017) Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision. 667–676.
Luo et al. (2020) Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, and Weiping Wang. 2020. Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the AAAI conference on artificial intelligence. 11701–11708.
Misra et al. (2016) Ishan Misra, C Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: Unsupervised learning using temporal order verification. In Proceedings of the european conference on computer vision. Springer, 527–544.
Noroozi and Favaro (2016) Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the european conference on computer vision. Springer, 69–84.
Pathak et al. (2016) Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2536–2544.
Selvaraju et al. (2017) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618–626.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568–576.
Soomro et al. (2012) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
Tran et al. (2018) Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6450–6459.
Wang et al. (2019) Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4006–4015.
Wang et al. (2020) Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. 2020. Self-supervised video representation learning by pace prediction. In Proceedings of the european conference on computer vision. Springer, 504–521.
Wang et al. (2016) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the european conference on computer vision. Springer, 20–36.
Wei et al. (2018) Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. 2018. Learning and using the arrow of time. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8052–8060.
Xie et al. (2018) Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the european conference on computer vision. 305–321.
Xu et al. (2019) Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 10334–10343.
Yao et al. (2020) Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. 2020. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6548–6557.
Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In Proceedings of the european conference on computer vision. Springer, 649–666.
Zhou et al. (2018) Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the european conference on computer vision. 803–818.
Zhou et al. (2016) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2921–2929.