Long-Short Temporal Contrastive Learning of Video Transformers

Jue Wang¹ Gedas Bertasius² Du Tran¹ Lorenzo Torresani^1,3
¹Facebook AI Research ²UNC Chapel Hill ³Dartmouth

Abstract

Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

1 Introduction

Since the introduction of AlexNet [36], deep convolutional neural networks (CNNs) have emerged as the prominent model in numerous computer vision tasks [30, 21, 55, 22, 66, 65]. More recently, the Transformer model [62] has received much attention due to its impressive performance in the field of natural language processing (NLP) [15]. While CNNs rely on the local operation of convolution, the building block of transformers is self-attention [62] which is particularly effective at modelling long-range dependencies. In the image domain, the Vision Transformer (ViT) [16] was proposed as a convolution-free architecture which uses self-attention between non-overlapping patches in all layers of the model. ViT was shown to be competitive with state-of-the-art CNNs on the task of image categorization. In the last few months, several adaptations of ViT to video have been proposed [6, 45, 3]. In order to capture salient temporal information from the video, these works typically extend the self-attention mechanism to operate along the time axis in addition to within each frame. Since video transformers have a larger numbers of parameters and fewer inductive biases compared to CNNs, they typically require large-scale pretraining on supervised image datasets, such as ImageNet-21K [53] or JFT [3], in order to achieve top performance.

Self-supervised learning has been shown to be an effective solution to eliminate the need for large-scale supervised pretraining of transformers both in NLP [15] as well as in image-analysis [60, 9]. In this work, we show that, even in the video domain, self-supervised learning provides an effective way for pretraining video transformers. Specifically, we introduce Long-Short Temporal Contrastive Learning (LSTCL), a contrastive formulation that maximizes representation similarity between a long video clip (say, 8 seconds long) and a much shorter clip (say, 2 seconds long) where both clips are sampled from the same video. We argue that by training the short-clip representation to match the long-clip representation, the model is forced to extrapolate from a short extent the contextual information exhibited in the longer temporal span. As the long clip includes temporal segments not included in the short clip, this self-supervised strategy trains the model to anticipate the future and to predict the past from a small temporal window in order to match the representation extracted from the long clip. We believe that this is a good pretext for video representation learning, as it can be accomplished only by a successful understanding and recognition of the structure and correlation of atomic actions in a long video. Furthermore, such framework is particularly suitable for video transformers as they have been recently shown to effectively capture long-term temporal cues [6]. In this work we demonstrate that these long-term temporal cues can be effectively encoded into a short-range clip-level representation leading to a substantial improvement in video classification performance.

To demonstrate the generality of our findings, we experiment with two different video transformer architectures whose code is publicly available. The first is TimeSformer [6], which reduces the computational cost of self-attention over the 3D video volume by means of a space-time factorization. The second architecture is the Swin transformer [39], which we further extend into a 3D version, dubbed Space-Time Swin transformer, that computes hierarchical spatiotemporal self-attention by using 3D shifting windows. We show that our unsupervised LSTCL pretraining scheme allows both of these video transformers to outperform their respective counterparts pretrained with full supervision on the large-scale ImageNet-21K dataset.

In summary, the contributions of this paper can be summarized as follows:

•

We introduce Long-Short Temporal Contrastive Learning (LSTCL), which enables encoding temporal context from the longer video into a short-range clip representation.
•

We demonstrate that for recent video transformer models, our proposed LSTCL pretraining provides an effective alternative to large-scale supervised pretraining on images.
•

We propose a Space-Time Swin transformer for spatiotemporal feature learning, and show that it achieves strong results on multiple action recognition benchmarks.

2 Related work

Self-supervised Learning in Images. Early attempts at self-supervised visual representation learning used a variety of pretext tasks, such as image rotation prediction [35], auto-encoder learning [63, 49, 56], or solving jigsaw puzzles [46]. In comparison, recent approaches in self-supervised learning leverage contrastive learning [29, 11, 12, 14, 9, 52, 28]. The idea is to generate two views of the same image through data augmentation and then minimize the distance of their representations while, optionally, maximizing the distance to other images [11, 29]. One disadvantage of contrastive learning is that it requires a large number of negative examples, which implies a large batch size [11] or the use of a memory bank [29]. To tackle the high computational cost of such contrastive approaches, several recent methods proposed to eliminate the reliance on negative samples [13, 14, 26, 8].

Self-supervised Learning in Videos. Several methods for self-supervised video representation learning focused on predictive spatiotemporal ordering tasks [23, 33, 1, 58, 69, 44, 71, 59, 38, 72, 31]. Other approaches have leveraged temporal cues such as tempo and speed to define self-supervised pretext tasks [5, 67]. Just like in the image domain, more recent approaches [20, 27, 51] adopt contrastive learning objectives. Our method also falls into the category of contrastive approaches. In comparison to prior contrastive video methods, we propose a contrastive formulation where a positive pair is generated from a short clip and a long clip, both of which are sampled from the same video. This pushes our model to learn a short clip-level representation that captures global video-level context.

The approach that is most closely related to our own is the BraVe system [52]. BraVe shares the same underlying idea of training a model to match a long (broad) view to a short (narrow) view of the same video. However, our work differs in several aspects. First of all, our main focus is to leverage self-supervised learning as a means to train video transformers without labeled image data, while BraVe is applied to 3D CNNs. Video transformers are emerging as a competitive alternative to 3D CNNs. However, as discussed earlier, they suffer from the limitation of requiring image-based supervised pretraining. Thus, we believe that this is an important and timely problem to address. Additionally, we note that our LSTCL is a lot simpler than BraVe: while our model uses shared parameters, a single projection network, and a single prediction network, BraVe requires separate backbones, separate projection networks, and separate prediction networks for the two views in order to achieve the best performance; furthermore, while LSTCL can be applied with any traditional contrastive loss (as demonstrated by our experiments with MoCo v3, BYOL, and SimSiam), BraVe uses a combination of two specific regression objectives (broad-to-narrow and narrow-to-broad) and employs distinct augmentation strategies for the two views. Despite the bare-bone simplicity of our learning formulation, we demonstrate that it delivers impressive results, elevating the accuracy of video transformers to the state-of-the-art on challenging action classification benchmarks without the need of any supervised image-level pretraining.

Transformers in Vision. Transformer-based models [62, 15] currently define the state-of-the-art for the majority of natural language processing (NLP) tasks. Similarly, there have also been several attempts to adopt transformer-based architectures for vision problems. Initially, these attempts focused on architectures mixing convolution with self-attention [68, 70, 7, 32, 73]. The recent introduction of Vision Transformer (ViT) [16] has demonstrated that it is possible to achieve competitive image classification results with a convolution-free architecture. To increase the data-efficiency aspect of the original ViT, Touvron et al. [60] proposed a training recipe based on distillation. Lastly, the recently introduced Swin transformer [39] significantly reduces the number of parameters and the cost of ViT by employing local rather than global self-attention.

ViT models were also adapted to the video domain by introducing different forms of spatiotemporal self-attention [6, 3, 2, 50]. However, due to their large number of parameters, these models typically require large amounts of training data, which typically comes in the form of a large scale labeled dataset (such as ImageNet or JFT). To address this issue, Fan et al. [17] introduced a multi-scale vision transformer (MViT), which uses a much smaller number of parameters and can be successfully trained from scratch. Instead of reducing the model capacity, as done in MViT, we show that it is possible to train large-capacity video transformer models without any external data by means of our proposed LSTCL self-supervised learning framework.

3 Video Transformers

Several recent attempts have been made to extend ViT to the video domain [6, 3, 17, 2, 50]. Most video transformers share common principles, which we review below. We then discuss specific designs differentiating the video transformers considered in our experiments.

3.1 Overview

Linear and positional embeddings. Each patch $\boldsymbol{p}_{(i,t)}$ is linearly embedded into a feature vector $\smash{\boldsymbol{z}^{0}_{(i,t)}\in\mathbb{R}^{D}}$ obtained by means of a learnable matrix $\smash{W\in\mathbb{R}^{D\times(P^{2}\cdot C)}}$ and a learnable vector $\smash{\boldsymbol{e}_{(i,t)}\in\mathbb{R}^{D}}$ representing the spatial-temporal positional embedding: $\smash{\boldsymbol{z}^{0}_{(i,t)}=W\boldsymbol{p}_{(i,t)}+\boldsymbol{e}_{(i,t)}}$ .

Multi-headed attention. The multi-headed self-attention (MHA) is the key component of the transformer. It implements the query-key-value computation for each patch, and it is interleaved with layer normalization [4] (LN) and a multilayer perceptron (MLP) within each block $\ell$ . Thus, the intermediate representation $\boldsymbol{z}^{\ell}$ for a patch in block $\ell$ is obtained from its features in the previous block, as:

	$\displaystyle\tilde{\boldsymbol{z}}^{\ell}=\text{MHA}(\text{LN}(\boldsymbol{z}^{\ell-1}))+\boldsymbol{z}^{\ell-1}$		(1)
	$\displaystyle\boldsymbol{z}^{\ell}=\text{MLP}(\text{LN}(\tilde{\boldsymbol{z}}^{\ell}))+\tilde{\boldsymbol{z}}^{\ell}~{}.$		(2)

Classification. As in BERT [15], a classification token $\boldsymbol{p}_{(0,0)}$ is added at the beginning of the input sequence. In the last layer of the network, a linear layer with softmax activation function is attached to the classification token in order to output the final classification probabilities.

3.2 TimeSformer

TimeSformer [6] extends ViT [16] to the video domain. It uses two independent multi-head attention blocks for spatial and temporal self-attention. As shown in Figure 1, the spatial self-attention compares the query patch only to image patches appearing in the same frame. Conversely, the temporal self-attention compares the query patch to the image patches in the same spatial location but from the other frames. The decomposition over space and time dramatically reduces the cost of self-attention compared to a dense comparison over all pairs of patches of the video. Thus, the feature representation is computed as:

	$\displaystyle\boldsymbol{z}^{\ell}_{t}=\text{MHA}_{Time}(LN(\boldsymbol{z}^{\ell-1}))+\boldsymbol{z}^{\ell-1}$		(3)
	$\displaystyle\boldsymbol{z}^{\ell}_{s}=\text{MHA}_{Space}(LN(\boldsymbol{z}^{\ell}_{t}))+\boldsymbol{z}^{\ell}_{t}$
	$\displaystyle\boldsymbol{z}^{\ell}=\text{MLP}(\text{LN}(\boldsymbol{z}^{\ell}_{s}))+\boldsymbol{z}^{\ell}_{s}$

Refer to caption — Figure 1: An illustration of the self-attention mechanisms in TimeSformer [6] and Space-Time (ST) Swin Transformer. Each column in the figure depicts a different self-attention block. Patches that have the same color are compared during the self-attention computation.

3.3 Space-Time Swin Transformer

Compared to ViT, the Swin Transformer [39] applies self-attention locally. The features are learned hierarchically by aggregating information from local neighborhoods of patches in each layer. Here we adapt the original Swin transformer, which was introduced for still-images, to video. We name this new variant Space-Time Swin Transformer (ST Swin). Instead of considering 2D neighborhoods of image patches for self-attention computation, ST Swin uses local 3D space-time volumes. Specifically, as proposed in the original paper [39], ST Swin uses two distinct self-attention mechanisms: uniform partition and shifted partition. In our case, both of these self-attention schemes are adapted to video by considering the temporal dimension in the local patch neighborhoods. As shown in Figure 1, the uniform partition splits the entire clip into 4 non-overlapping 3D sections, with each section sharing the same partition index. Spatiotemporal self-attention is then computed between image patches that have the same partition index. Similarly, shifted partition generates multiple non-overlapping 3D sections at different scales, and spatiotemporal patches within each section are compared for self-attention computation. The uniform partition and the shifted partition are stacked to form two successive attention blocks, which implement cross-window connections further increasing the model capacity. Thus, the complete transformation carried out in each layer $\ell$ of the ST Swin transformer can be summarized as follows::

$\displaystyle\boldsymbol{z}^{\ell}_{u}$	$\displaystyle=\text{MHA}_{Uniform}(LN(\boldsymbol{z}^{\ell-1}))+\boldsymbol{z}^{\ell-1}$	(4)
$\displaystyle\boldsymbol{z}^{\ell}$	$\displaystyle=\text{MLP}(\text{LN}(\boldsymbol{z}^{\ell}_{u}))+\boldsymbol{z}^{\ell}_{u}$
$\displaystyle\boldsymbol{z}^{\ell+1}_{s}$	$\displaystyle=\text{MHA}_{Shift}(\text{LN}(\boldsymbol{z}^{\ell}))+\boldsymbol{z}^{\ell}$
$\displaystyle\boldsymbol{z}^{\ell+1}$	$\displaystyle=\text{MLP}(\text{LN}(\boldsymbol{z}^{\ell+1}_{s}))+\boldsymbol{z}^{\ell+1}_{s}$

We adopt the 3D relative positional embedding and the patch merging strategy used in Swin [39]. However, we only merge image patches along the spatial axis while maintaining fixed temporal resolution through the layers.

4 Long-Short Temporal Contrastive Learning

Overview. Video transformers have been shown to be particularly effective at long-range temporal modeling [6]. Our aim is to design a contrastive learning framework that exploits this characteristic. Our proposed Long-Short Temporal Contrastive Learning (LSTCL) framework takes as input a pair of clips sampled from the same video–a long clip and a short clip. The procedure trains the video transformer to match the representation of the short clip to the representation of the long clip. This forces the model to predict the future and the past from a small temporal window, which is beneficial for capturing the general structure of the video. Below we describe specific details related to our LSTCL.

Given a batch $B$ of unlabeled training videos, we randomly sample a short clip and a long clip from each of them. While both the long and the short clip include a total of $T$ frames, we use largely different sampling temporal strides $\tau_{S}$ and $\tau_{L}$ with $\tau_{S}<\tau_{L}$ in order for the long clip to cover a much longer temporal extent than the short clip. The sets of short and long clips in the batch $B$ are denoted as $X_{S}=\{x_{S}^{1},x_{S}^{2},...x_{S}^{B}\}$ and $X_{L}=\{x_{L}^{1},x_{L}^{2},...x_{L}^{B}\}$ , respectively, where $x_{S}^{i}$ and $x_{L}^{i}$ represent the short clip and the long clip sampled from the $i$ -th example in the batch. The set of short clips is processed by an encoder $f_{q}$ to yield a set of “query” examples $Q=\{q^{1},q^{2},...q^{B}\}$ where $q^{i}=f_{q}(x_{S}^{i})\in\mathbb{R}^{D}$ . The set of long clips is processed by a separate encoder $f_{k}$ to produce “key” examples $K=\{k^{1},k^{2},...k^{B}\}$ . We optimize the encoders to yield similar query-key representations for pairs consisting of a long clip and a short clip taken from the same video, and dissimilar representations for cases where the long clip and the short clip are sampled from different videos. This is achieved by adopting an InfoNCE [47] loss on the sets $Q$ and $K$ :

\mathcal{L}_{NCE}=\sum_{i}-log\frac{exp({q^{i}}^{\top}k^{i}/\rho)}{exp({q^{i}}^{\top}k^{i}/\rho)+\sum_{j\neq i}exp({q^{i}}^{\top}k^{j}/\rho)}

(5)

where $\rho$ is a temperature hyperparameter that controls the sharpness of the output distribution. As commonly done [14, 13, 26, 8], we symmetrize the loss function. In our case this is achieved by adding to the loss term above a dual term obtained by reversing the role of the long and the short clips, i.e., by computing queries from long clips $q^{i}=f_{q}(x_{L}^{i})$ and keys from short clips $k^{i}=f_{k}(x_{S}^{i})$ . The encoder $f_{q}$ consists of a video transformer backbone, a MLP projection head and an additional prediction MLP head. The purpose of the prediction layer is to transform the representation of the query clip to match the key. The encoder $f_{k}$ consists of a video transformer backbone and a MLP projection head. Our experiments present results obtained with different contrastive learning optimizations to update the parameters of $f_{q}$ and $f_{k}$ . In the case of our default optimization based on MoCo v3 [14], the parameters of $f_{q}$ are updated by minimizing $\mathcal{L}_{NCE}$ via backpropagation, while the parameters of $f_{k}$ are updated as a moving average of the parameters of $f_{q}$ . We refer the reader to our supplementary materials for details of the optimizations based on the other contrastive learning frameworks considered in our experiments—BYOL and SimSiam.

Clip Sampling Strategy. Since we want our model to be able to extrapolate the context observed in the entire video from the brief extent of the short clip, we propose to sample the long and the short clip at random and independently from each video. By doing so, the learning cannot leverage any synchrony between the two clips and because the temporal offset will be random for every pair of long-short samples, the optimization will force the short clip representation to encode as much as possible of the context exhibited over the entire video. To demonstrate the value of random independent sampling, in our ablation study we contrast this strategy (named “Random Independent”) to two alternative schemes. The first, named “Random Included,” consists in sampling the short clip at random but so as to fall completely within the temporal extent spanned by the long clip (which is sampled first at random). The second, named “Random Disjoint,” samples the two clips at random but it enforces the constraint that they cannot overlap at all, i.e., they are completely disjoint. We refer the reader to our experiments which validate our hypothesis that random independent sampling is indeed the superior strategy for long-short temporal contrastive learning of video transformers.

Implementation Details. We implement LSTCL under three different and popular contrastive learning frameworks: BYOL [26], MoCo v3 [14], and SimSiam [13]. For training we adopt the video data augmentations described in [20] using clips of size $224\times 224\times 8$ sampled from the video. We experiment with two video transformer architectures: TimeSformer with Divided Space-Time attention [6] and our adaptation of the Swin-B model [39] to video (Space-Time Swin). We use the AdamW [43] optimizer, which is commonly used for training vision transformer models [6, 9, 14, 3, 2, 16, 60]. In our default set-up, we train LSTCL for $200$ epochs on the 240K videos of Kinetics-400 [34] using linear warm-up [24] during the first $40$ epochs. We apply a cosine decay schedule [42] after the warm-up and the learning rate is set to $lr\times BatchSize/256$ . We adopt the base learning rate and weight decay from [14]. Our experiments are run on 64 V100 GPUs with a distributed training set-up in Pytorch [48]. The training of 200 epochs takes about three days.

5 Experiments

We evaluate our proposed LSTCL on several action recognition benchmarks: Kinetics-400 [34], Kinetics-600 [10], Something-Something-V2 [25] (SSv2), HMDB [37], and UCF101 [57]. Our experimental setup is as follows. First, we perform self-supervised LSTCL pretraining on Kinetics-400 with clips of $T=8$ frames but using distinct temporal sampling strides for the short view and the long view, so that the two views effectively span temporal extents of different lengths in seconds. Afterwards, we finetune the LSTCL-pretrained model for 200 epochs in a fully supervised fashion on each of these three datasets. During inference, we sample uniformly 5 clip with center cropping from each video and average the sample-level predictions to perform video-level classification. In the following ablation studies, unless otherwise noted, we adopt TimeSformer as the backbone in our LSTCL with an input clip of size $8\times 224\times 224$ .

5.1 Ablation Studies

Importance of the Temporal Extent. We first ablate the choice of $\tau_{S}$ and $\tau_{L}$ for self-supervised training, while keeping the finetuning temporal stride fixed to the value $\tau=8$ (i.e., sampling a frame every 8 from the video starting from a random frame). Figure 2 shows how different combinations of $\tau_{S}$ and $\tau_{L}$ affect the final video-level accuracy on Kinetics-400. For ease of interpretation we split the visualization of results over 4 distinct plots, representing 4 different values of $\tau_{S}$ : $\tau_{S}\in\{4,8,16,32\}$ . Each plot shows how the final video-level accuracy varies for different temporal stride values $\tau_{L}$ of the long clip where $\tau_{L}\geq\tau_{S}$ and $\tau_{S}$ is kept fixed. There are two important observations we can make from these results. The first is that, for each choice of $\tau_{S}$ , the larger the gap between the two strides (i.e., the larger the value of $\tau_{L}-\tau_{S}$ ), the higher is the accuracy. This can be seen in the first three plots where the accuracy curve monotonically increases as $\tau_{L}$ is made larger starting from the initial value of $\tau_{L}=\tau_{S}$ . This validates the importance of contrasting views of different temporal lengths during self-supervised pretraining. The second observation is that our model performs best when $\tau_{S}=8$ and $\tau_{L}=32$ . This result makes intuitive sense as a short clip sampled with $\tau_{S}=8$ is temporally long enough to allow to predict the context of the long clip; at the same time it is short enough to allow the method to use a long view that is significantly longer (up to 4 times longer than the short view). Conversely, choosing a larger value of $\tau_{S}$ (i.e., 16 or 32) reduces the maximum possible gap $\tau_{L}-\tau_{S}$ between the two views, while choosing a smaller value of $\tau_{S}$ (i.e., 4) would cause the contrastive learning between the two views to be overly difficult due to the excessive brevity of the short clip.

$\tau_{S}$	$\tau_{L}$	Accuracy
4	{8,16,32}	73.9
{4,8,16}	32	74.8
8	{8, 16,32}	75.5
{8,16,32}	32	75.9
{4,8,16,32}	{4,8,16,32}	75.6
{8,16,32}	{8,16,32}	76.0
8	32	76.6

Table 1: We analyze the potential benefits of randomly sampling either

\tau_{S}

and/or

\tau_{L}

(for the short and the long clips, respectively). Accuracy is measured for video-level classification on Kinetics-400 after pretraining with our LSTCL system using MoCo v3. The best result is still achieved for fixed values of

\tau_{S}=8

and

\tau_{L}=32

In Table 1, we include additional performance points corresponding to settings where $\tau_{S}$ and/or $\tau_{L}$ are sampled randomly for each training video clip. Specifically, the first row in the table shows the performance of our system when $\tau_{S}=4$ and $\tau_{L}$ is sampled randomly from $\{8,16,32\}$ ; the second row represents the opposite setting where $\tau_{L}$ is kept fixed ( $\tau_{L}=32$ ) and $\tau_{S}$ is randomly sampled from $\{4,8,16\}$ ; the setting in the third row is similar to that of the first row but with $\tau_{S}=8$ ; the fourth row shows the same setting as the second row but excludes $\tau_{S}=4$ ; the fifth and the sixth rows show configurations where both temporal strides are randomly chosen for each training video clip, subject to $\tau_{S}\leq\tau_{L}$ . As before, we keep the finetuning temporal stride fixed to the value $\tau=8$ . The results in Table 1 clearly show that adding randomness in the choice of the temporal extents for the long and short clips does not produce improved performance. The best performance is still achieved when $\tau_{S}=8$ and $\tau_{L}=32$ (shown in the last row). Thus, we adopt this setup for all subsequent experiments.

Now we turn to study the impact of the finetuning stride $\tau$ on video-level accuracy. The two plots in Figure 3 show how the accuracy on Kinetics-400 varies as we change the value of $\tau$ (on the horizontal axis) for two different choices of $\tau_{S}$ ( $\tau_{S}=4$ in the left plot, and $\tau_{S}=8$ in the right plot). The different curves in each plot correspond to different choices of $\tau_{L}$ . We see that setting the finetuning stride to $\tau=8$ tends to produce the best results across all possible choices of $\tau_{S}$ and $\tau_{L}$ . This makes sense, since with $\tau=8$ the 5 inference clips are short enough not to overlap so that they provide complementary information for the video-level classification. At the same time, $\tau=8$ implies an inference clip long enough to yield good classification on its own.

Different Contrastive Learning Frameworks. Next, we investigate the effects of different contrastive learning frameworks in our LSTCL system. Specifically, we experiment with three recent approaches: BYOL, MoCo v3, and SimSiam. Figure 4 shows that a larger temporal stride $\tau_{L}$ for the long view leads to better accuracy for all three of these frameworks. Specifically, setting $\tau_{L}=32$ leads to the following performance gains compared to the setting where $\tau_{L}=\tau_{S}=8$ : +2.6% for BYOL, +3.1% for MoCo v3, and +1.6% for SimSiam. The lower absolute performance of SimSiam can be explained by the lack of the momentum-encoder, which we observed to be important when training video transformer models with LSTCL. Thus, based on these result, for all subsequent experiments, we adopt MoCo v3 as our base learning framework.

Weight Sharing and Contrastive Loss. Here we ablate the two main differences between LSTCL and BraVe [52]. 1) BraVe has two independent backbones, projectors and predictors, which define a broad stream and a narrow stream. Instead, our LSTCL adopts online and momentum encoders with shared parameters. 2) Each stream in BraVe is specialized to process a particular type of view (either broad or narrow). Training is done by means of a combination of two regression objectives (one mapping from broad to narrow, the other mapping in the opposite direction). In LSTCL, a single encoder takes both views. Our model is optimized using a single contrastive loss, which minimizes differences between the two views.

In Table 2, we present ablation results of LSTCL with respect to differences 1) and 2) outlined above. For 1), we modify LSTCL to use distinct networks (independent backbones and projectors) for the two views, as in BraVe. 2) In addition to using separate networks, we adopt the data feeding and learning objectives from BraVe in our LSTCL. From the results, it can be seen that LSTCL (first row) achieves superior performance with only half the number of parameters compared to these two alternative setups.

Loss	Shared Backbone	Accuracy	Params
InfoNCE	Yes	76.6	121.4M
InfoNCE	No	73.2	242.8M
Regression	No	70.8	242.8M

Table 2: We compare our proposed approach (first row) against the weight sharing and loss proposed in BraVe [52] by evaluating the effects on Kinetics-400.

Sampling Method	Accuracy
Random Disjoint	72.6
Random Included	76.2
Random Independent	76.6

Table 3: Comparison of different clip sampling strategies for LSTCL on Kinetics-400. In these experiments we use

\tau_{S}=8

and

\tau_{L}=32

for LSTCL and

\tau=8

for finetuning.

Clip Sampling Strategy in LSTCL. In Table 3, we study the effect of different clip sampling strategies. These results indicate that random independent sampling works best in our setting. Intuitively, this makes sense as it forces our model to extrapolate to arbitrary video views.

Video Transformers. In Table 4 we compare the performance of three distinct video transformer architectures: TimeSformer, Swin, and Space-Time (ST) Swin. We train each of these models under three different scenarios on Kinetics-400: 1) from scratch (without pretraining), 2) using supervised pretraining on the large-scale ImageNet-1K dataset, and 3) using our self-supervised LSTCL pretraining. We can see that among these three training strategies, our LSTCL pretraining provides the highest accuracy, outperforming the models that use large-scale supervised ImageNet-1K pretraining for all three architectures.

5.2 Comparison to the State-of-the-Art

For our final experiments, we adopt the Space-Time Swin transformer as it achieves the strongest results in our ablation studies. For this comparison to the state-of-the-art, we also train models using clips of $T=16$ frames during both pretraining with LSTCL and supervised finetuning. Even in this case we set the temporal stride to $\tau_{S}=4$ for short clips and $\tau_{L}=16$ for long clips.

Kinetics-400 & Kinetics-600. In Table 6, we report results on Kinetics-400, listing for each method the clip size, the accuracy, the inference cost (in TFLOPs), and the number of parameters. We group methods on the basis of the input clip size, since models trained on longer clips or higher resolution frames tend to yield higher accuracy. The first two groups include models operating on clips of the same size as those used by our system ( $8\times 224^{2}$ and $16\times 224^{2}$ ). It can be seen that the ST Swin model pretrained with LSTCL achieves the highest accuracy among all previous methods that use the same input clip size and that do not make use of additional data. Furthermore, compared to prior video transformer models that are pretrained with full supervision on large-scaled labeled datasets (the bottom part of the table), our method still achieves competitive results and actually often yields better accuracy. Lastly, note that compared to training our ST Swin model from scratch, LSTCL pretraining leads to a significant $8.7\%$ boost on Kinetics-400.

Model	Scratch	IN-1K	LSTCL	Params
TimeSformer [6]	60.4	75.8	76.6	121.4M
Swin	66.2	73.3	75.5	88.0M
ST Swin	71.1	76.0	79.8	88.0M

Table 4: Comparing self-supervised pretraining using LSTCL to training from scratch and supervised pretraining on ImageNet-1K (IN-1K). The results show video classification accuracy on Kinetics-400 for three video transformer architectures.

Model	Pretraining dataset	UCF101	HMDB51
BraVe [52]	K400 (Unsup.)	95.1	74.6
$\rho$ BYOL [20]	K400 (Unsup.)	96.3	75.0
ST Swin	IN-1K (Superv.)	78.1	40.2
ST Swin	K400 (Superv.)	88.9	61.2
ST Swin w/ LSTCL	K400 (Unsup.)	96.8	75.9

Table 5: Transfer learning results on UCF101 and HMDB51. We report the performance using the full fine-tuning setting. Our method outperforms previous state-of-the-art approaches on both UCF101 and HMDB51. Furthermore, our unsupervised LSTCL pretraining scheme achieves better results than the approaches based on supervised pretraining (on IN-1K and K400).

Table 7 shows a comparison with the state-of-the-art on the Kinetics-600 dataset. Even here we see that ST Swin pretrained with LSTCL achieves the best accuracy within the two groups of models using the same clip sizes as our networks. Furthermore, LSTCL produces a gain of $7.3\%$ compared to learning from scratch.

Something-Something-V2. In Table 8 we report the performance on the Something-Something-V2 dataset. Most prior methods leverage supervised large-scale pre-training on external datasets in order to achieve strong performance on this benchmark, since the dataset is relatively small. The results in the table highlight that our ST Swin models pretrained without labels on Kinetics-400 using LSTCL achieves higher accuracy than methods that leverage pretraining on larger datasets and using manually labeled data. Moreover, our LSTCL pretraining yields a gain of $26.4\%$ over the same model trained from scratch. This remarkable improvement is due to the fact that the Something-Something-V2 dataset requires thorough temporal reasoning for good accuracy. Our LSTCL method trains the clip representation to predict the temporal context from the entire video and thus yields strong benefits on this benchmark.

Table 6: Comparison to the state-of-the-art on Kinetics-400. Among methods using the same clip sizes as our models and no additional data (first two groups in the table), ST Swin networks pretrained with LSTCL achieve the highest accuracy and they are are on-par with models that use longer or higher-resolution clips (third group) or that leverage additional data for supervised pretraining (bottom group).

Method	Clip Size	Additional Data (# Samples)	Top-1	Top-5	TFLOPs	Params
SlowFast [19]	$8\ \mathrm{x}\ 224^{2}$	-	77.9	93.2	3.0	59.9M
TimeSformer-scratch	$8\ \mathrm{x}\ 224^{2}$	-	60.4	76.7	0.59	121.4M
ST Swin from scratch	$8\ \mathrm{x}\ 224^{2}$	-	71.1	85.2	0.60	88.0M
ST Swin w/ LSTCL	$8\ \mathrm{x}\ 224^{2}$	-	79.8	94.0	0.60	88.0M
CorrNet-101 [64]	$16\ \mathrm{x}\ 224^{2}$	-	79.2	-	7.0	-
SlowFast-NL [19]	$16\ \mathrm{x}\ 224^{2}$	-	79.8	93.9	7.0	59.9M
MViT-B [17]	$16\ \mathrm{x}\ 224^{2}$	-	78.4	93.5	0.36	36.6M
ST Swin w/ LSTCL	$16\ \mathrm{x}\ 224^{2}$	-	81.5	95.2	1.80	88.0M
X3D-XL [18]	$16\ \mathrm{x}\ 312^{2}$	-	79.1	93.9	1.45	11.0M
ip-CSN-152 [61]	$32\ \mathrm{x}\ 224^{2}$	-	77.8	92.8	3.3	32.8M
MViT-B [17]	$64\ \mathrm{x}\ 224^{2}$	-	81.2	95.1	4.09	36.6M
TimeSformer [6]	$8\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	78.0	93.7	0.59	121.4M
STAM [54]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	79.3	-	0.27	96.0M
TEINet [40]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-1K (1.2M)	76.2	92.5	1.8	-
Mformer [50]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	79.7	94.2	11.1	109.1M
ViViT-L [3]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	80.6	94.7	47.9	310.0M
VATT-B [2]	$32\ \mathrm{x}\ 320^{2}$	AudioSet + HowTo100M (3.2M)	79.6	94.9	9.08	88.0M
TimeSformer-L [6]	$96\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	80.7	94.7	7.14	121.4M

Table 7: Video-level accuracy on Kinetics-600. The ST Swin model trained with LSTCL achieves results comparable with the state-of-the-art without using additional data or labels.

Method	Clip Size	Additional Data (# Samples)	Top-1	Top-5
SlowFast [19]	$8\ \mathrm{x}\ 224^{2}$	-	80.4	94.8
ST Swin from scratch	$8\ \mathrm{x}\ 224^{2}$	-	74.7	92.2
ST Swin w/ LSTCL	$8\ \mathrm{x}\ 224^{2}$	-	82.0	95.5
SlowFast [19]	$16\ \mathrm{x}\ 224^{2}$	-	81.8	95.1
MViT-B [17]	$16\ \mathrm{x}\ 224^{2}$	-	82.1	95.7
ST Swin w/ LSTCL	$16\ \mathrm{x}\ 224^{2}$	-	83.6	96.6
X3D-XL [18]	$16\ \mathrm{x}\ 312^{2}$	-	81.9	95.9
MViT-B [17]	$32\ \mathrm{x}\ 224^{2}$	-	83.4	96.3
TimeSformer [6]	$8\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	79.1	94.4
Mformer [50]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	81.6	95.6
ViViT-L [3]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	82.5	95.6
VATT-B [2]	$32\ \mathrm{x}\ 320^{2}$	AudioSet + HowTo100M (3.2M)	80.5	95.5
VATT-L [2]	$32\ \mathrm{x}\ 320^{2}$	AudioSet + HowTo100M (3.2M)	83.6	96.6
TimeSformer-L [6]	$96\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	82.2	95.6

Table 8: Video-level classification accuracy on Something-Something-V2. Our ST Swin models pretrained without labels using LSTCL yield results on par with the state-of-the-art.

Method	Clip Size	Additional Data (# Samples)	Pretraining	Top-1	Top-5
TimeSformer [6]	$8\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	supervised	59.5	-
ResNet50 [20]	$8\ \mathrm{x}\ 224^{2}$	K400 (240K)	unsupervised	55.8	-
ST Swin from scratch	$8\ \mathrm{x}\ 224^{2}$	-	-	38.4	65.5
ST Swin w/ LSTCL	$8\ \mathrm{x}\ 224^{2}$	K400 (240K)	unsupervised	64.8	89.4
TEINet [40]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-1K (1.2M)	supervised	64.7	-
Mformer [50]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-21K + K400 (14.2M)	supervised	66.5	90.1
ViViT-L [3]	$16\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	supervised	65.4	89.8
MViT-B [17]	$16\ \mathrm{x}\ 224^{2}$	K400 (240K)	supervised	64.7	89.2
ST Swin w/ LSTCL	$16\ \mathrm{x}\ 224^{2}$	K400 (240K)	unsupervised	67.0	90.5
TimeSformer-L [6]	$96\ \mathrm{x}\ 224^{2}$	ImageNet-21K (14M)	supervised	62.4	-
MViT-B [17]	$64\ \mathrm{x}\ 224^{2}$	K400 (240K)	supervised	67.7	90.9

HMDB51 & UCF101. Finally, we assess the ability to transfer the unsupervised representation learned by LSTCL from Kinetics-400 to the small-scale datasets of HMDB [37] and UCF101 [57] via supervised finetuning. The results are shown in Table 5 where we include also accuracies obtained via fully-supervised pretraining (using class labels) on IN-1K and K400 and also the two recent self-supervised methods $\rho$ BYOL [20] and BraVe [52]. It can be seen that LSTCL ourperforms both (i) the previous state-of-the-art unsupervised pretraining methods, and (ii) the supervised pretraining baselines on both datasets.

6 Conclusion

This paper introduces Long-Short Temporal Contrastive Learning (LSTCL), an unsupervised pretraining scheme for video transformers. By contrasting representations obtained from a long view and a short view of each video, it forces the model to encode context from the whole video into the features of short clips. We demonstrate our LSTCL under three different contrastive frameworks and two video transformer architectures including a new variant, Space-Time Swin transformer. In our experiments we show that unsupervised pretraining with LSTCL leads to similar or better video classification accuracy compared to pretraining with full supervision on ImageNet-21K and it achieves competitive results on three different video classification benchmarks. LSTCL effectively eliminates the need for large-scaled supervised image pretraining in video transformers.

References

[1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
[2] Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178, 2021.
[3] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021.
[4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[5] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9922–9931, 2020.
[6] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 813–824. PMLR, 2021.
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
[8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
[10] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[11] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[12] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
[13] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566, 2020.
[14] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised visual transformers. arXiv preprint arXiv:2104.02057, 2021.
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[17] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, October 2021.
[18] Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 203–213, 2020.
[19] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019.
[20] Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross B. Girshick, and Kaiming He. A large-scale study on unsupervised spatiotemporal representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 3299–3309. Computer Vision Foundation / IEEE, 2021.
[21] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
[22] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 2014.
[23] Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. Unsupervised learning of spatiotemporally coherent metrics. In Proceedings of the IEEE international conference on computer vision, pages 4086–4093, 2015.
[24] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[25] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017.
[26] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
[27] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[28] Tengda Han, Weidi Xie, and Andrew Zisserman. Self-supervised co-training for video representation learning. In Neurips, 2020.
[29] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[31] Kai Hu, Jie Shao, Yuan Liu, Bhiksha Raj, Marios Savvides, and Zhiqiang Shen. Contrast and order representations for video self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7939–7949, 2021.
[32] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 603–612, 2019.
[33] Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H. Adelson. Learning visual groups from co-occurrences in space and time. CoRR, abs/1511.06811, 2015.
[34] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[35] Nikos Komodakis and Spyros Gidaris. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), 2018.
[36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
[37] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011.
[38] Yang Liu, Keze Wang, Lingbo Liu, Haoyuan Lan, and Liang Lin. Tcgl: Temporal contrastive graph for self-supervised video representation learning. IEEE Transactions on Image Processing, 31:1978–1993, 2022.
[39] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
[40] Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11669–11676, 2020.
[41] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. arXiv preprint arXiv:2106.13230, 2021.
[42] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
[43] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[44] Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In Proceedings of (ECCV) European Conference on Computer Vision, pages 527 – 544, October 2016.
[45] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. arXiv preprint arXiv:2102.00719, 2021.
[46] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
[47] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[48] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
[49] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
[50] Mandela Patrick, Dylan Campbell, Yuki M Asano, Ishan Misra Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, Jo Henriques, et al. Keeping your eye on the ball: Trajectory attention in video transformers. Advances in neural information processing systems, 2012.
[51] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge J. Belongie, and Yin Cui. Spatiotemporal contrastive video representation learning. CoRR, abs/2008.03800, 2020.
[52] Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Pătrăucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, and Andrew Zisserman. Broaden your views for self-supervised video learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1255–1265, October 2021.
[53] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
[54] Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021.
[55] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199, 2014.
[56] Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, and Richang Hong. Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing, 27(7):3210–3221, 2018.
[57] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012.
[58] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 843–852, Lille, France, 07–09 Jul 2015. PMLR.
[59] Li Tao, Xueting Wang, and Toshihiko Yamasaki. Self-supervised video representation learning using inter-intra contrastive framework. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2193–2201, 2020.
[60] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.
[61] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5552–5561, 2019.
[62] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
[63] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
[64] Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli. Video modeling with correlation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 352–361, 2020.
[65] Jue Wang and Anoop Cherian. Learning discriminative video representations using adversarial perturbations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 685–701, 2018.
[66] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019.
[67] Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. In European conference on computer vision, pages 504–521. Springer, 2020.
[68] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[69] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2015.
[70] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503, 2020.
[71] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10334–10343, 2019.
[72] Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei. Seco: Exploring sequence supervision for unsupervised representation learning. In AAAI, volume 2, page 7, 2021.
[73] Li Zhang, Dan Xu, Anurag Arnab, and Philip HS Torr. Dynamic graph message passing networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3726–3735, 2020.

Appendix A Self-Supervised Learning Frameworks

In the main paper we presented experimental results obtained by implementing our proposed LSTCL procedure under three popular self-supervised constrastive learning frameworks: MoCo v3 [14], BYOL [26], and SimSiam [13]. Since we used the framework of MoCo v3 to present our method (section 4), here we provide a short description of the other two frameworks—BYOL and SimSiam– and of how we use them in our method.

BYOL. BYOL is a self-supervised learning framework consisting of a momentum-encoder ( $f_{\theta_{m}}$ ), an online encoder ( $f_{\theta}$ ) and a predictor MLP ( $g_{\theta_{p}}$ ) which is connected to the online encoder. The momentum-encoder is the moving average of the online encoder, which is controlled by a momentum parameter $m$ . There is no back-propagation into the momentum-encoder parameters $\theta_{m}$ . The momentum update can be written as:

\theta_{m}=m\theta_{m}+(1-m)\theta

(6)

Differently from MoCo, BYOL uses only positive pairs of examples. It minimizes the negative cosine similarity over all positive pairs. In our setting, the set of short clips is processed by the online encoder $f_{\theta}$ to yield a set of “query” examples $Q=\{q^{1},q^{2},...q^{B}\}$ where $q^{i}=f_{\theta}(x_{S}^{i})\in\mathbb{R}^{D}$ . The set of long clips is processed by the momentum encoder $f_{\theta_{m}}$ to produce “key” examples $K=\{k^{1},k^{2},...k^{B}\}$ . Then, the LSTCL method with BYOL loss minimizes the following objective:

\mathcal{L}_{BYOL}=\sum_{i}[2-2\cdot\frac{g_{\theta_{p}}(q^{i})^{\top}k^{i}}{||g_{\theta_{p}}(q^{i})||_{2}\cdot||k^{i}||_{2}}].

(7)

We symmetrize the objective by adding to this loss a dual term obtained by reversing the role of the long and the short clips, i.e., by computing queries from long clips with the online encoder ( $q^{i}=f_{\theta}(x_{L}^{i})$ ) and keys from the short clips with the momentum encoder ( $k^{i}=f_{\theta_{m}}(x_{S}^{i})$ ).

SimSiam. SimSiam can be viewed as the BYOL method without momentum-encoder and momentum update. Thus, in our setting the queries and the keys are computed from the short view and the long view, respectively, but using the same encoder: $q^{i}=f_{\theta}(x_{S}^{i})$ , $k^{i}=f_{\theta}(x_{L}^{i})$ . During training SimSiam applies the stop-gradient operation on the key view. Thus, LSTCL with SimSiam loss mimimizes:

\mathcal{L}_{SimSiam}=\sum_{i}[2-2\cdot\frac{g_{\theta_{p}}(q^{i})^{\top}SG(k^{i})}{||g_{\theta_{p}}(q^{i})||_{2}\cdot||SG(k^{i})||_{2}}]

(8)

where $SG()$ denotes the stop-gradient operation. Even here we symmetrize the objective by adding a dual term with reversed roles for the long and the short clip.

Appendix B Analysis of Space-Time Swin Transformer

We note that our Space-Time Swin Transformer (ST Swin) bears relations with the Video Swin model [41] which is concurrently presented in unpublished work. Video Swin differs from our ST Swin in the way it treats the temporal dimension. Video Swin subdivides the video volume into 3D neighborhoods for a self-attention computation that is local in both space and time. Conversely, our ST Swin inflates 2D spatial Swin blocks [39] into space-time attention tubes that cover the entire temporal extent of the clip. This gives our model the ability to compare patches from all frames within the same spatial neighborhood. Here we present an empirical comparison between Video Swin (using the code provided by the authors) and our ST Swin, with both models pretrained using our LSTCL and then finetuned on K400. Table 9 shows that ST Swin achieves better accuracy. We also list the result reported in the original Video Swin paper for the case when this model is pretrained on ImageNet and then finetuned on K400. This result shows that Video Swin trained with our LSTCL achieves higher accuracy than the same model pretrained on ImageNet using longer clips (32 frames instead of 16).

Model	Clip Size	Additional data	Pretraining	Acc.
ST Swin	$16\ \mathrm{x}\ 224^{2}$	-	LSTCL	81.5%
Video Swin	$16\ \mathrm{x}\ 224^{2}$	-	LSTCL	81.0%
Video Swin	$32\ \mathrm{x}\ 224^{2}$	IN-1K	Superv.	80.6%

Table 9: Comparison between ST Swin and Video Swin on K400.