This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-Supervised Video Representation Learning by Video Incoherence Detection

Haozhi Cao,\equalcontrib1 Yuecong Xu,\equalcontrib2 Jianfei Yang,1
Kezhi Mao,1 Lihua Xie, 1 Jianxiong Yin, 3 Simon See 3
Abstract

This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval utilizing various backbone networks. Experiments show that our proposed method achieves state-of-the-art performance across different backbone networks and different datasets compared with previous coherence-based methods.

Introduction

Fully supervised learning has achieved great success in video representation learning during the past decade. However, its outstanding performance significantly relies on a large amount of labeled data, whose annotation is resource-expensive and time-consuming. Additionally, fully supervised methods are designed to extract task-specific representation, suffering from poor transfer-ability and generalization. To address these issues, recent works have paid more attention to self-supervised learning, which aims to extract generalized representation from more accessible unlabeled data on the Internet.

Refer to caption
Figure 1: Illustration about how video incoherence affects motion information. Both videos demonstrate the action “High Jump”. The incoherence caused by losing frame (3) leads to a distortion of athlete motion between frame (2) and frame (4) in the incoherent video, which is incompatible with our understanding of “High Jump”. This observation suggests that incoherence detection requires a comprehensive understanding of videos.

Typically, the core of self-supervised methods is to design a pretext task where the network is driven to learn representation through characteristics of unlabeled data. Existing self-supervised methods can be categorized into two main types: (i) dense prediction and (ii) spatio-temporal reasoning. Methods using dense prediction require the network to predict parts of low-level representation, such as future frames (Vondrick, Pirsiavash, and Torralba 2016; Srivastava, Mansimov, and Salakhutdinov 2015) and optical flows (Gan et al. 2018). While outstanding performance can be achieved, they usually require hand-crafted features (e.g. optical flow (Gan et al. 2018)) or complicated computation process (Tian et al. 2020; Han, Xie, and Zisserman 2020), leading to an expensive cost of time and resources. To improve the efficiency, recent spatio-temporal reasoning methods, such as clip order prediction (Xu et al. 2019; Fernando et al. 2017; Misra, Zitnick, and Hebert 2016; Lee et al. 2017), video speed prediction (Wang, Jiao, and Liu 2020; Jenni, Meishvili, and Favaro 2020; Yao et al. 2020) and spatio-temporal statistic prediction (Wang et al. 2019), tend to learn the high-level spatio-temporal correlations of raw videos. Methods based on clip order prediction attempt to leverage the video coherence for representation learning, where supervision signal is generated from frame order disruption. In this paper, we propose an incoherence detection method to leverage video coherence for video representation learning from a new perspective.

Intuitively, our visual systems can easily identify the incoherence of videos (e.g. loss frames caused by connection latency), since we can speculate the abnormal motion based on our understanding of videos. In this case, the incoherence can be viewed as the noise to motion information. To detect incoherence, the network requires a comprehensive understanding of videos that motivates our paper. For example, as illustrated in Figure 1, we can easily justify whether there exists some incoherence between frame (2) and frame (4). This is because given previous frames (1-2), we can deduce that the athlete should be leaping over the bar in the next frame. Yet given frames (1-2) where the athlete remains on the left side of the bar, the athlete in the frame (4) suddenly appears on the right side without the process of leaping, which is incompatible with our deduction. This bi-directional reasoning of video contents could be an effective supervision signal for the network to learn high-level representation of videos.

Inspired by this observation, we propose a simple-yet-effective method called Video Incoherence Detection (VID) for video representation learning in a self-supervised manner. Each training sample is generated as an incoherent clip constructed by multiple sub-clips from the same raw video. Specifically, sub-clips are hierarchically sampled from the raw video given the random incoherence location and length. The incoherent clip is then constructed as the concatenation of sub-clips along the temporal dimension. Different from previous coherence-based methods (Xu et al. 2019; Fernando et al. 2017; Misra, Zitnick, and Hebert 2016; Lee et al. 2017) which undermine temporal orders, VID preserves the sequential relationship of the raw video during the generation. The network can therefore learn temporal representation for incoherence detection.

Given the incoherent clips as input, the network is trained to detect the incoherence by two novel pretext tasks that predict the location and length of incoherence, denoted as Incoherence Location Detection (LoD) and Incoherence Length Detection (LeD), respectively. Moreover, we introduce the Intra-Video Contrastive Learning (ICL) as an additional optimization objective to maximize the mutual information between different incoherent clips from the same raw video.

In summary, our contributions are three-fold. Firstly, motivated by the fact that detecting incoherence requires semantic understanding, we propose a simple-yet-effective self-supervised method, called Video Incoherence Detection (VID), utilizing a single temporal transformation method for video representation learning. Secondly, Incoherence Location Detection (LoD) and Incoherence Length Detection (LeD) are proposed to learn spatio-temporal representation by detecting incoherence while avoiding shortcuts. Thirdly, we introduce Intra-Video Contrastive Learning (ICL) to maximize the mutual information between incoherent clips from the same video. Extensive experiments show that our VID achieves state-of-the-art performance on action recognition and video retrieval compared with previous coherence-based methods.

Literature review

Self-supervised learning. To leverage more accessible unlabeled data on the Internet, recent methods pay more attention to self-supervised learning. Self-supervised learning stems from the previous works (Caruana and de 1997; Ando, Zhang, and Bartlett 2005) and has been widely explored in images (Wu et al. 2018; Hjelm et al. 2019; Misra and Maaten 2020) or natural language (Devlin et al. 2018; Lan et al. 2019). Early works have expanded self-supervised methods from other domains to videos, e.g. DPC (Han, Xie, and Zisserman 2019) inspired by CPC (Oord, Li, and Vinyals 2018) in image domain and (Sun et al. 2019b, a) inspired by BERT (Devlin et al. 2018).

Recent self-supervised methods for video representation learning can be categorized into two types, including dense prediction and spatio-temporal reasoning. Methods based on dense prediction (Vondrick, Pirsiavash, and Torralba 2016; Gan et al. 2018; Han, Xie, and Zisserman 2019, 2020; Srivastava, Mansimov, and Salakhutdinov 2015; Tian et al. 2020) require network to predict the low-level information of videos. (Vondrick, Pirsiavash, and Torralba 2016; Srivastava, Mansimov, and Salakhutdinov 2015) proposed to learn video representation by predicting future frames whose foreground and background are generated from independent streams. To leverage video information of multi-modalities, some previous works proposed to generate supervision signal through the input of multi-modalities, such as 3D videos (Gan et al. 2018) and RGB-D data (Luo et al. 2017).

Instead of directly predicting low-level information, methods based on spatio-temporal reasoning are proposed to generate supervision signals as correlations or characteristics of videos. compared with dense prediction, previous spatio-temporal reasoning methods require dedicate pretext tasks, such as temporal order prediction (Fernando et al. 2017; Xu et al. 2019; Lee et al. 2017; Misra, Zitnick, and Hebert 2016; Kim, Cho, and Kweon 2019) and video speed prediction (Wang, Jiao, and Liu 2020; Yao et al. 2020; Jenni, Meishvili, and Favaro 2020). Inspired by the sequential relationships of videos, previous works (Fernando et al. 2017; Lee et al. 2017; Misra, Zitnick, and Hebert 2016) attempted to predict or identify the correct frame order given clips shuffled along the temporal dimension. (Xu et al. 2019) further applied the order prediction method with 3D-CNN and (Kim, Cho, and Kweon 2019) expanded the order prediction to the spatial dimension. On the other hand, recent methods (Yao et al. 2020; Wang, Jiao, and Liu 2020; Jenni, Meishvili, and Favaro 2020) proposed to extract effective representation by predicting the speed of videos. Specifically, (Yao et al. 2020; Wang, Jiao, and Liu 2020) combined the speed prediction task with re-generation and contrastive learning, respectively. (Jenni, Meishvili, and Favaro 2020) achieved state-of-the-art performance by recognizing various temporal transformations under different speed. Inspired by humans’ sensitivity towards incoherence in videos, we argue that video incoherence detection requires semantic understanding of video contents, which can be explored to learn effective video representation.

Contrastive learning. Contrastive learning has been proven to be an effective optimization objective in self-supervised learning. For image representation learning, multiple methods (Misra and Maaten 2020; Hjelm et al. 2019; Bachman, Hjelm, and Buchwalter 2019; Wu et al. 2018) proposed to learning effective image representation by using contrastive learning. Inspired by the success of contrastive learning in images, recent methods (Dwibedi et al. 2019; Wang, Jiao, and Liu 2020; He et al. 2020; Lorre et al. 2020; Yao et al. 2021) have been proposed to leverage contrastive learning for video representation learning. The basic idea is to maximize the mutual information between positive pairs. For instance, (Dwibedi et al. 2019; Wang, Jiao, and Liu 2020) attempted to align spatio-temporal representation of the same action or same context. Recently, (Yao et al. 2021) conducted contrastive learning from spatial, spatio-temporal and sequential perspectives. In this work, we utilize intra-video contrastive learning to maximize the mutual information between different incoherent clips from the same video.

Proposed methods

Coherence is one of the crucial properties of videos. Natural videos are formed by sets of frames coherently observed. Our visual systems can easily identify incoherence caused by loss frames within a video clip, which demonstrates that the detection of incoherence would require a semantic understanding of videos. This motivates us to design a self-supervised method by leveraging incoherence detection for video representation learning.

In this work, we propose to extract effective spatio-temporal representation by Video Incoherence Detection (VID) based on a single temporal transformation in a self-supervised manner. We first illustrate how to generate incoherent clips from raw videos. Based on these generated clips, Location Detection (LoD), Length Detection (LeD) and Intra-Video Contrastive Learning (ICL) are proposed for self-supervised learning in details. To clarify the whole learning procedure, we summarize the overall learning objective and framework of VID in Section Network structure and training.

Generation of incoherent video clips

To utilize VID, we first generate incoherent clips from raw videos. Given a raw video VV, the incoherent clip 𝒱inc\mathcal{V}_{inc} is constructed by kk sub-clips 𝒱1,𝒱2,,𝒱k\mathcal{V}_{1},\mathcal{V}_{2},...,\mathcal{V}_{k} sampled from VV with a certain length of incoherence between each other. The location LlocL_{loc} and length lincl_{inc} of incoherence are both randomly generated. The length of incoherence lincl_{inc} between sub-clips is limited within the range of:

linc[lincmin,lincmin+1,,lincmax],l_{inc}\in[l_{inc}^{min},l_{inc}^{min}+1,...,l_{inc}^{max}], (1)

where lincmin,lincmaxl_{inc}^{min},l_{inc}^{max} are both hyper-parameters indicating the upper and lower bounds of the incoherence length, respectively. The purposes of this constraint are two-fold. Firstly, the constraint on lincl_{inc} is necessary for our Incoherence Length Detection (LeD). Secondly, the constraint on incoherence length prevents the incoherence between sub-clips from being either too vague or too obvious, which avoids learning trivial solutions. For simplicity, we take the case where 𝒱inc\mathcal{V}_{inc} is constructed by two sub-clips as an example to thoroughly illustrate the generation process of incoherent clips as shown in Figure 2.

Refer to caption
Figure 2: Generation process of the incoherent clip 𝒱inc\mathcal{V}_{inc}. Indices 1-16 in square denote the frame indices in the raw video VV while indices (0-7) denote the relative frame indices in 𝒱inc\mathcal{V}_{inc}. Squares in shadow and color refer to the sample range and sampled frames for the correspondent sub-clip, respectively. 𝒱inc\mathcal{V}_{inc} is generated as the concatenation of 𝒱1\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2} along the temporal dimension as shown in (e).

Selection of incoherence location. The incoherence location LlocL_{loc} refers to the relative concatenation location between two sub-clips. Formally, given the desired length of incoherent clips l0l_{0}, the location of incoherence LlocL_{loc} is uniformly selected as:

l1{1,2,,l01},l2=l0l1,\displaystyle l_{1}\in\{1,2,...,l_{0}-1\},\ \ l_{2}=l_{0}-l_{1}, (2)
Lloc=l11,\displaystyle L_{loc}=l_{1}-1, (3)

where l1l_{1}, l2l_{2} are the length of 𝒱1\mathcal{V}_{1}, 𝒱2\mathcal{V}_{2} as illustrated in Figure 2(b), where squares in different colors refer to allocated frame positions for different sub-clips in 𝒱inc\mathcal{V}_{inc}. LlocL_{loc} is the relative location of incoherence where sub-clips concatenate as well as the label for the following LoD task.

Refer to caption
Figure 3: The structure of our proposed VID method. The first row indicates two raw videos V1V_{1}, V2V_{2}. Two incoherent clips are generated from each raw video. Subsequently, they are fed into a single 3D CNN backbone. The extracted high-level representation HH then passes to three different linear or non-linear layers to perform three different sub-tasks, including Incoherence Location Detection (LoD), Incoherence Length Detection (LeD) and Intra-Video Contrastive Learning (ICL).

Hierarchical selection of sub-clips. Given the sub-clip lengths l1,l2l_{1},l_{2}, sub-clips 𝒱1\mathcal{V}_{1}, 𝒱2\mathcal{V}_{2} are hierarchically sampled from the raw video VV with incoherence between each other. The incoherent clip 𝒱inc\mathcal{V}_{inc} is generated as the temporal concatenation of 𝒱1\mathcal{V}_{1}, 𝒱2\mathcal{V}_{2}. While previous work (Wang, Jiao, and Liu 2020) proposed to sample frames by looping over the raw video, such strategy is not compatible with our proposed VID since it could introduce unexpected incoherence when looping from the end to the start of the video. Instead, to preserve the sequential relationship of the raw video, we propose a hierarchical sampling strategy that maximizes the sample range of each sub-clip while satisfying Equation 1.

Illustrated as the upper row in Figure 2(c), given the raw video VV of the length TT, the sample range T1T_{1} of the first sub-clip 𝒱1\mathcal{V}_{1} is determined by reserving sufficient frames for subsequent sub-clips. In this way, the sub-clip 𝒱2\mathcal{V}_{2} can be sampled from the rest of the raw frames wherever the 𝒱1\mathcal{V}_{1} locates in T1T_{1}, which preserves the sequential relationship and satisfies the constraint in Equation 1. Formally, given l2l_{2} and lincminl_{inc}^{min}, the sample range T1T_{1} is computed as:

t1min\displaystyle t_{1}^{min} =1,\displaystyle=1, (4)
t1max\displaystyle t_{1}^{max} =Tlincminl2,\displaystyle=T-l_{inc}^{min}-l_{2}, (5)
T1\displaystyle T_{1} ={t1min,t1min+1,,t1max},\displaystyle=\{t_{1}^{min},t_{1}^{min}+1,...,t_{1}^{max}\}, (6)

where t1mint_{1}^{min}, t1maxt_{1}^{max} are the lower and upper bound of the range T1T_{1}. Given the range T1T_{1}, 𝒱1\mathcal{V}_{1} is uniformly sampled as 𝒱1T1\mathcal{V}_{1}\in T_{1} illustrated as the lower row in Figure 2(c).

Hierarchically, the range of the second sub-clip T2T_{2} is decided by the sampled sub-clip 𝒱1\mathcal{V}_{1} and the range of lincl_{inc} in Equation 1. As shown in the upper row of Figure 2(d), given the raw frame index of the last frame in 𝒱1\mathcal{V}_{1} denoted as m1=max(𝒱1)m_{1}=\max(\mathcal{V}_{1}), the sample range T2T_{2} is computed as:

t2min\displaystyle t_{2}^{min} =m1+lincmin+1,\displaystyle=m_{1}+l_{inc}^{min}+1, (7)
t2max\displaystyle t_{2}^{max} =min(m1+lincmax+l2,T),\displaystyle=\min(m_{1}+l_{inc}^{max}+l_{2},T), (8)
T2\displaystyle T_{2} ={t2min,t2min+1,,t2max},\displaystyle=\{t_{2}^{min},t_{2}^{min}+1,...,t_{2}^{max}\}, (9)

where t2mint_{2}^{min}, t2maxt_{2}^{max} are the upper and lower bounds which ensure that lincl_{inc}, the length of incoherence between 𝒱2\mathcal{V}_{2} and 𝒱1\mathcal{V}_{1}, always satisfies the constraints in Equation 1. Similar to 𝒱1\mathcal{V}_{1}, the second clip 𝒱2\mathcal{V}_{2} is uniformly sampled as 𝒱2T2\mathcal{V}_{2}\in T_{2}, as in the lower row of Figure 2(d).

Given the sub-clips 𝒱1,𝒱2\mathcal{V}_{1},\mathcal{V}_{2}, the incoherent clip 𝒱inc\mathcal{V}_{inc} and its label LlenL_{len} for Incoherence Length Detection (LeD) task are generated as:

𝒱inc\displaystyle\mathcal{V}_{inc} =𝒱1𝒱2,\displaystyle=\mathcal{V}_{1}\oplus\mathcal{V}_{2}, (10)
linc\displaystyle l_{inc} =min(𝒱2)max(𝒱1),\displaystyle=\min(\mathcal{V}_{2})-\max(\mathcal{V}_{1}), (11)
Llen\displaystyle L_{len} =linclincmin,\displaystyle=l_{inc}-l_{inc}^{min}, (12)

where \oplus indicates the concatenation of two sub-clips 𝒱1\mathcal{V}_{1}, 𝒱2\mathcal{V}_{2} along the temporal dimension.

Optimization objectives

We propose two novel self-supervised tasks, including Incoherence Location Detection (LoD) and Incoherence Length Detection (LeD) to detect the incoherence in incoherence clips while maximizing the mutual information between different incoherent clips from the same raw video. Specifically, given an incoherent clip 𝒱inc\mathcal{V}_{inc}, the high-level representation is first extracted as h=f(𝒱inc)h=f(\mathcal{V}_{inc}) where f()f(\cdot) denotes the encoder. Given the representation hh, the optimization objectives of VID include three components: Incoherence Location Detection (LoD), Incoherence Length Detection (LeD) and Intra-Video Contrastive Learning (ICL).

Incoherence Location Detection (LoD). Given the high-level representation hh and its location label LlocL_{loc}, the network is required to predict the location of incoherence in 𝒱inc\mathcal{V}_{inc}. This is mainly inspired by humans’ sensitivity towards the loss frames within video clips. The network is driven to identify the abnormal motion caused by incoherence, which encourages the network to learn semantic representation about videos. The LoD task is formulated as a single-label classification problem. Given the representation hh and label LlocL_{loc}, the network is optimized by the cross-entropy loss illustrated as:

lLoD=i=0l01yiloclog(exp(ziloc)j=0l01exp(zjloc)),l_{LoD}=-\sum_{i=0}^{l_{0}-1}y^{loc}_{i}\log(\frac{\exp{(z^{loc}_{i})}}{\sum_{j=0}^{l_{0}-1}\exp{(z^{loc}_{j}})}), (13)

where zlocl01z^{loc}\in\mathbb{R}^{l_{0}-1} is the output of fully-connected layers ϕloc()\phi^{loc}(\cdot) given the representation hh as input. ylocl01y^{loc}\in\mathbb{R}^{l_{0}-1} is the ground-truth label vector whose element at LlocL_{loc} equals 1 while the rest to 0. In practice, given a mini-batch of representation ZlocZ^{loc}, Equation 13 is applied to each representation in ZlocN×(l01)Z^{loc}\in\mathbb{R}^{N\times(l_{0}-1)}, where NN denotes the batch size. The loc\mathcal{L}_{loc} loss is then calculated as the average of all losses of representation in ZlocZ^{loc}.

Incoherence Length Detection (LeD). In addition to LoD, given the high-level representation hh and its corresponding label LlenL_{len} of Equation 12, the network is required to predict the length of incoherence. The proposed LeD task is designed as a regularization measure to avoid trivial learning. In some cases, it is possible that the incoherence locates at the period when the distribution of low-level representation intensively changes (e.g. intensive movement of the camera or sudden changes of light conditions). This could cause a distinct difference in low-level representation between sub-clips of the incoherent clip 𝒱inc\mathcal{V}_{inc}, leading to trivial learning during LoD. compared with LoD which extracts semantic representation, our proposed LeD can be regarded as a simple yet challenging task that extracts additional temporal information by enforcing the network to deduce the length of incoherence with respect to the raw video. In practice, the accuracy of LeD is relatively low (with Top1 about 25%), while our ablation experiment shows that it can bring noticeable improvement to the performance mainly because it can improve the robustness of VID towards intensive changes of low-level information.

Similar to LoD, the LeD task can also be formulated as a classification problem, where cross-entropy loss is utilized for optimization as:

lLeD=i=0Δlincyilenlog(exp(zilen)j=0Δlincexp(zjlen)),l_{LeD}=-\sum_{i=0}^{\Delta l_{inc}}y^{len}_{i}\log(\frac{\exp{(z^{len}_{i})}}{\sum_{j=0}^{\Delta l_{inc}}\exp{(z^{len}_{j}})}), (14)

where zlenl01z^{len}\in\mathbb{R}^{l_{0}-1} is the output of fully-connected layers ϕlen()\phi^{len}(\cdot) given hh as input. ylenΔlincy^{len}\in\mathbb{R}^{\Delta l_{inc}} is the ground-truth label of incoherence length and Δlinc\Delta l_{inc} is the difference between the upper bound and lower bound of incoherence length. Similar to LoD, Equation 14 is also applied to each representation of the mini-batch ZlenN×ΔlincZ^{len}\in\mathbb{R}^{N\times\Delta l_{inc}}, whose loss LeD\mathcal{L}_{LeD} is calculated as the average of all losses of ZlenZ^{len}.

Intra-Video Contrastive Learning (ICL). Contrastive learning can effectively extract the mutual information between variously augmented samples from the same source. Recent works (Yao et al. 2021; Wang, Jiao, and Liu 2020) demonstrate its great potential, exceeding other self-supervised or even supervised methods. In this work, we include Intra-Video Contrastive Learning (ICL) as an extra optimization objective to maximize the mutual information between different incoherent clips from the same video. This is inspired by the fact that our visual systems extract mutual information from incoherent clips, and thus can still recognize the correct motions even under incoherent circumstances.

Formally, given a mini-batch of NN raw videos 𝐕={V1,V2,,VN}\mathbf{V}=\{V_{1},V_{2},...,V_{N}\}, two incoherent clips are randomly generated as demonstrated in Section Generation of incoherent video clips for each raw video Vi𝐕V_{i}\in\mathbf{V}. The incoherent clips from the same raw video ViV_{i} are considered as positive pairs denoted as {𝒱inci,𝒱~inci}\{\mathcal{V}_{inc}^{i},\widetilde{\mathcal{V}}_{inc}^{i}\}. The incoherent clips from different raw videos are regarded as negative pairs denoted as {𝒱inci,𝒱inck},ki\{\mathcal{V}_{inc}^{i},\mathcal{V}_{inc}^{k}\},k\neq i. Each incoherent clip 𝒱inci\mathcal{V}_{inc}^{i} is then fed to the network f()f(\cdot), forming the high-level representation hih_{i}. The representation hih_{i} is subsequently fed to a fully-connected layer ϕcl()\phi^{cl}(\cdot) followed by a non-linear ReLU activation, generating features ziclz_{i}^{cl}. Provided with features of positive pairs {zicl,zi~cl}\{z_{i}^{cl},\widetilde{z_{i}}^{cl}\} and features of negative pairs {zicl,zkcl},ki\{z_{i}^{cl},z_{k}^{cl}\},k\neq i, the contrastive loss is computed as:

ICL=12Ni=12Nlog(exp(s(zicl,zi~cl))exp(s(zicl,zi~cl))+𝒟(i))),\mathcal{L}_{ICL}=-\frac{1}{2N}\sum_{i=1}^{2N}\log(\frac{\exp{(s(z^{cl}_{i},\widetilde{z_{i}}^{cl})})}{\exp{(s(z^{cl}_{i},\widetilde{z_{i}}^{cl})})+\mathcal{D}(i))}), (15)
𝒟(i)=kiexp(s(zicl,zkcl)),\mathcal{D}(i)=\sum\nolimits_{k\neq i}\exp{(s(z^{cl}_{i},z_{k}^{cl}))}, (16)

where s(u,v)=uv/uvs(u,v)=u^{\top}v/\left\|u\right\|\left\|v\right\| denotes the similarity between feature uu and vv. 𝒟(i)\mathcal{D}(i) is the summation of exponential similarity between features of negative pairs.

Network structure and training

The overall network structure is illustrated as Figure 3. Given the unlabeled raw video, the incoherent clips are first generated as described in Section Generation of incoherent video clips, where each raw video randomly generates two incoherent clips as shown in Figure 3(a). Subsequently, the batch of incoherent clips is fed to the encoder f()f(\cdot) implemented as the 3D CNN backbone. The ultimate optimization objective is formulated as:

=αLoD+βLeD+λICL,\mathcal{L}=\alpha\mathcal{L}_{LoD}+\beta\mathcal{L}_{LeD}+\lambda\mathcal{L}_{ICL}, (17)

where α\alpha, β\beta, λ\lambda are the coefficients of three loss terms from the sub-tasks, respectively.

Experiments

In this section, we present thorough experiments to justify our proposed VID. We first illustrate our experiment settings and subsequently justify our VID design through detailed ablation studies. Finally, VID is evaluated on two downstream tasks including action recognition and video retrieval in comparison with state-of-the-art methods. Code and visualization are available at the supplementary material.

Experiment settings

Datasets. We evaluate our VID across three action recognition datasets, including UCF101 (Soomro, Zamir, and Shah 2012), HMDB51 (Kuehne et al. 2011) and Kinetics-400 (Kay et al. 2017). UCF101 is a widely used video dataset for action recognition, which contains 13,320 videos with 101 action categories. HMDB51 is a relatively smaller yet challenging dataset for action recognition. It includes about 7,000 videos with 51 action classes. Both UCF101 and HMDB51 are divided into three training and testing splits. Kinetics-400, denoted as K-400, is a large dataset for action recognition. It contains about 304,000 videos with 400 action classes collected from the online video platform YouTube. Same as the setting of prior work (Wang, Jiao, and Liu 2020; Jenni, Meishvili, and Favaro 2020), we utilize the training split of Kinetics-400 and the training split 1 of UCF101 for self-supervised pre-training. The training split 1 of UCF101 and HMDB51 are utilized during fine-tuning for action recognition.

Backbone networks. As for the 3D CNN backbones, to fairly compare our proposed methods with others (Wang, Jiao, and Liu 2020; Xu et al. 2019), we utilize three different 3D CNN networks in our experiments, including C3D (Tran et al. 2015), R3D (Hara, Kataoka, and Satoh 2018) and R(2+1)D (Tran et al. 2018). The mentioned backbones have been widely used to evaluate self-supervised methods in previous research. Specifically, C3D (Tran et al. 2015) is constructed by direct extending 2D kernels of 2D CNN to 3D ones. R3D (Hara, Kataoka, and Satoh 2018) introduces the residual connections from 2D CNNs to 3D CNNs. Following previous works (Yao et al. 2020; Jenni, Meishvili, and Favaro 2020; Kim, Cho, and Kweon 2019), we utilize R3D-18 which is the 18-layer variant of R3D. R(2+1)D (Tran et al. 2018) proposes to replace the traditional 3D kernel with the combination of a 2D kernel and a 1D kernel for spatial and temporal feature extraction, respectively. In this work, we mainly conduct our experiments with R(2+1)D thanks to its superior performance compared with others.

Augmentation and parameters. Following the setting of prior work (Jenni, Meishvili, and Favaro 2020; Wang, Jiao, and Liu 2020), each incoherent clip includes 16 frames. The sampling for each sub-clip is 1 and the range of incoherence length is set as linc{3,4,,10}l_{inc}\in\{3,4,...,10\}. While pre-training on UCF101, we follow the setting in (Wang, Jiao, and Liu 2020; Alwassel et al. 2019) which increases the epoch size from 9k to 90k with color jittering along the temporal dimension. Frames are resized to 128×171128\times 171 and then randomly cropped to 112×112112\times 112. The whole input clip is then flipped horizontally with a probability of 50%50\%. The network is trained with a batch-size of 30. The stochastic gradient descent (Bottou 2010) is utilized for optimization with the weight decay set to 0.005 and the momentum set to 0.9. The coefficients of sub-tasks α\alpha, β\beta, λ\lambda are empirically set to 1, 0.1, 0.1. The learning rate is initialized as 0.001 and divided by 10 every 6 epochs with a total training epoch of 18.

Method Jittering Range of lincl_{inc} UCF101(%)
Random - 56.7
VID [2,10][2,10] 76.7
[4,10][4,10] 77.3
[5,10][5,10] 76.9
[3,6][3,6] 76.8
[3,8][3,8] 77.5
[3,12][3,12] 77.1
[3,14][3,14] 76.7
[3,10][3,10] 78.1
×\times [3,10][3,10] 76.3
Table 1: Ablation study of range of incoherence lengths and jittering. The range of lincl_{inc} is illustrated as [lincmin,lincmax][l_{inc}^{min},l_{inc}^{max}]

Ablation studies

In this section, we justify the design of our proposed VID by ablation studies. We first illustrate the optimal range of the incoherence length, and then conduct experiments utilizing different sub-tasks. Our proposed incoherence detection is additionally evaluated with various backbones compared with previous coherence-based methods. Except as otherwise specified, our ablation studies are conducted with R(2+1)D (Tran et al. 2018) pre-trained on UCF101.

Range of the incoherence length. We first explore the best range of incoherence length lincl_{inc}. Illustrated as Table 1, the experiments are conducted by changing either the lower bound lincminl_{inc}^{min} or the upper bound lincmaxl_{inc}^{max} of the range. As lincminl_{inc}^{min} increases from 2, we can also observe an improvement of performance from 76.7%76.7\% peaking at 78.1%78.1\% with lincmin=3l_{inc}^{min}=3, while the performance begins to decrease when lincminl_{inc}^{min} further increases. When lincminl_{inc}^{min} is smaller than 3, the incoherence between sub-clips are too difficult for the network to identify. The further increase of lower bound decreases the variety of incoherence length, leading to a drop in performance. Similar to the lower bound, when the upper bound of incoherence length increases from lincmax=6l_{inc}^{max}=6, the performance of VID rises consistently from 76.8%76.8\% which reaches a climax when lincmax=10l_{inc}^{max}=10, whereas a decreasing performance is observed as upper bound further increases, dropping from 78.1%78.1\% with lincmax=10l_{inc}^{max}=10 to 76.7%76.7\% with lincmax=14l_{inc}^{max}=14. As lincmaxl_{inc}^{max} increases, the sample range of incoherence becomes more abundant, while the incoherence becomes too obvious when lincmax>10l_{inc}^{max}>10. This observation indicates that an inappropriate range of lincl_{inc} can result in too vague or too obvious incoherence, which leads to inferior performance. We thus set the range of lincl_{inc} as [3,10][3,10] in the following experiments.

Sub-tasks LoD / α\alpha LeD / β\beta ICL / λ\lambda UCF101(%)
Random Init - - - 56.7
LoD 1 - - 75.4
LeD - 1 - 70.9
ICL - - 1 72.1
LoD+LeD 1 0.1 - 77.3
LoD+ICL 1 - 0.1 76.9
LeD+ICL - 1 0.1 71.8
LoD+LeD+ICL 1 0.1 0.1 78.1
Table 2: Ablation study of different sub-tasks.

Different sub-tasks. We further evaluate the performance of different sub-tasks. As shown in Table 2, when utilizing a single sub-task, we observe that networks pre-trained with any sub-task significantly exceed random initialization with a relative improvement of more than 25.0%25.0\%. The network with LoD obtains the highest performance of 75.4%75.4\% on UCF101, which justifies the effectiveness of LoD. The network with ICL also achieves a competitive performance of 72.1%72.1\%. However, when optimizing with only LeD, the network is required to directly predict the incoherence length without locating it. Therefore, the network can not fully leverage incoherence detection in videos, leading to inferior performance of 70.9%70.9\%.

As for arbitrary pairs of sub-tasks, we observe that LoD-based pairs (LoD+LeD and LoD+ICL) surpass the single LoD with noticeable margins of more than 2.0%2.0\%. This justifies the effectiveness of LeD and ICL as additional objectives. We also observe an inferior performance of the LeD-based pair (LeD+ICL), which aligns with the performance of the network with a single LeD.

Refer to caption
Figure 4: Comparison with coherence-based methods.

Comparison with coherence-based methods. To justify the effectiveness of incoherence detection, we evaluate our VID without additional ICL sub-task compared with previous coherence-based methods (Luo et al. 2020; Xu et al. 2019) utilizing order prediction. Illustrated in Figure 4, it is seen that our VID outperforms previous coherence-based methods across various backbones on different datasets. On UCF101, VID exceeds the previous VCOP (Xu et al. 2019) and VCP (Luo et al. 2020) by 4.3%4.3\%-7.9%7.9\% and 1.4%1.4\%-11.0%11.0\%, respectively. On HMDB51, the proposed VID surpasses VCOP (Xu et al. 2019) and VCP (Luo et al. 2020) by over 10%10\% relatively. The improvement indicates that incoherence detection requires a more comprehensive understanding of videos compared with frame order reasoning.

Evaluating self-supervised representation

Method pre-train UCF101(%) HMDB51(%)
C3D (PRP (Yao et al. 2020)) UCF101 69.1 34.5
C3D (PMAS (Wang et al. 2019)) K-400 58.8 32.6
C3D (RTT (Jenni, Meishvili, and Favaro 2020)) K-600 69.9 39.6
C3D (Ours) UCF101 70.2±\pm0.5 37.7±\pm0.7
R3D (ST-puzzle (Kim, Cho, and Kweon 2019)) K-400 58.8 32.6
R3D (PRP (Yao et al. 2020)) UCF101 66.5 29.7
R3D (RTT (Jenni, Meishvili, and Favaro 2020)) UCF101 77.3 47.5
R3D (Ours) UCF101 73.6±\pm0.5 38.0±\pm0.6
R(2+1)D (PRP (Yao et al. 2020)) UCF101 72.1 35.0
R(2+1)D (PP (Wang, Jiao, and Liu 2020)) K-400 77.1 36.6
R(2+1)D (RTT (Jenni, Meishvili, and Favaro 2020)) UCF101 81.6 46.4
R(2+1)D (Ours) UCF101 78.1±\pm0.6 40.1±\pm0.6
R(2+1)D (Ours) K-400 78.5±\pm0.4 41.5±\pm0.5
Table 3: Performance of action recognition compared with previous methods. RTT (Jenni, Meishvili, and Favaro 2020) is a SOTA method utilizing multiple data transformations. Results are the average of three evaluations (mean±\pmstd).

Action recognition. To verify the effectiveness of our proposed VID, we evaluate our VID with different backbones on action recognition which is a primary downstream task adopted in prior works (Wang, Jiao, and Liu 2020; Yao et al. 2020; Jenni, Meishvili, and Favaro 2020). For action recognition, the network is initialized with the weights of the pre-trained model while the fully-connected layer is randomly initialized. The whole network is trained using the cross-entropy loss with an initial learning rate of 0.003. Other augmentation and parameter settings are the same as the pre-training stage. For testing, following the evaluation protocol of previous works (Wang, Jiao, and Liu 2020; Yao et al. 2020), we uniformly sample 10 clips from each video followed by a center crop. The final predictions for each video are the average result of all sampled clips.

As shown in Table 3, our proposed VID achieves state-of-the-art (SOTA) results compared with self-supervised spatio-temporal reasoning methods which utilize a single data transformation. For C3D, our proposed VID outperforms the SOTA PRP method (Yao et al. 2020) by 1.1%1.1\% on UCF101 and 3.2%3.2\% on HMDB51. For R3D, VID also exceeds PRP and ST-Puzzle (Kim, Cho, and Kweon 2019), which are the previous SOTA method on UCF101 and HMDB51, with noticeable margins of 7.1%7.1\% on UCF101 and 5.4%5.4\% on HMDB51. With R(2+1)D pre-trained on UCF101, VID further surpasses the previous SOTA method PP (Wang, Jiao, and Liu 2020) by 1.0%1.0\% on UCF101 and 3.1%3.1\% on HMDB51. When pre-trained on Kinetics-400, the margins of improvement further expand to 1.4%1.4\% and 4.5%4.5\%, respectively. Provided by the illustrated results, VID can learn more abundant spatio-temporal representations compared with previous single-transformation methods.

In Table 3, we also include the results of RTT (Jenni, Meishvili, and Favaro 2020) which assembles multiple transformations, leading to superior performance compared with single-transformation methods. Nevertheless, for C3D, VID is the only single-transformation method that outperforms RTT by 0.3%0.3\% on UCF101 and provides competitive performance on HMDB51. It is possible to further improve the performance of ensemble-based methods by including our VID, while we mainly focus on leveraging video coherence by using a single temporal transformation in this work.

Method Top1 Top5 Top10 Top20 Top50
C3D (PRP (Yao et al. 2020)) 23.2 38.1 46.0 55.7 68.4
C3D (PP (Wang, Jiao, and Liu 2020)) 20.0 37.4 46.9 58.5 73.1
C3D (Ours) 26.9 43.6 53.6 63.8 78.2
R3D (PRP (Yao et al. 2020)) 22.8 38.5 46.7 55.2 69.1
R3D (PP (Wang, Jiao, and Liu 2020)) 19.9 36.2 46.1 55.6 69.8
R3D (RTT (Jenni, Meishvili, and Favaro 2020)) 26.1 48.5 59.1 69.6 82.8
R3D (Ours) 26.4 44.5 54.1 63.9 78.2
R(2+1)D (PRP (Yao et al. 2020)) 20.3 34.0 41.9 51.7 64.2
R(2+1)D (PP (Wang, Jiao, and Liu 2020)) 17.9 34.3 44.6 55.5 72.0
R(2+1)D (Ours) 22.0 40.4 51.2 61.8 74.7
Table 4: Performance of video retrieval on UCF101.
Method Top1 Top5 Top10 Top20 Top50
C3D (PRP (Yao et al. 2020)) 10.5 27.2 40.4 56.2 75.9
C3D (PP (Wang, Jiao, and Liu 2020)) 8.0 25.2 37.8 54.4 77.5
C3D (Ours) 11.6 29.6 43.3 58.4 77.3
R3D (PRP (Yao et al. 2020)) 8.2 25.8 38.5 53.3 75.9
R3D (PP (Wang, Jiao, and Liu 2020)) 8.2 24.2 37.3 53.3 74.5
R3D (Ours) 11.2 32.2 45.4 59.8 79.2
R(2+1)D (PRP (Yao et al. 2020)) 8.2 25.3 36.2 51.0 73.0
R(2+1)D (PP (Wang, Jiao, and Liu 2020)) 10.1 24.6 37.6 54.4 77.1
R(2+1)D (Ours) 10.4 27.9 42.7 58.1 76.7
Table 5: Performance of video retrieval on HMDB51

Video retrieval. We further evaluate our VID on another downstream task of nearest-neighbour video retrieval, which evaluates the quality of features extracted by the self-supervised pre-trained model. To make a fair comparison, our evaluation follows the protocol of previous state-of-the-art methods (Wang, Jiao, and Liu 2020; Yao et al. 2020). All models are pre-trained on UCF101. Given ten 16-frame clips sampled from each video, their features are extracted from the last pooling layer of the pre-trained backbone model. During inference, frames of each clip are first resized to 128 × 171 and then centrally cropped 112 × 112. Clips in the testing split are utilized to query the Top kk nearest samples based on their corresponding features. Here we consider kk equal to 1, 5, 10, 20, 50.

As shown in Table 4 and Table 5, VID outperforms state-of-the-art methods PP (Wang, Jiao, and Liu 2020) and PRP (Yao et al. 2020) on most evaluation metrics of UCF101 and HMDB51 across all backbones with significant marginals (e.g. 1.7%1.7\%-4.4%4.4\% for Top1 on UCF101). Specifically, VID surpasses all previous methods on HMDB51 across all evaluation metrics except for Top50, with improvement ranging from 0.3%0.3\% to 8.1%8.1\%. The significant improvement further justifies that our VID extracts more effective spatio-temporal representation for downstream tasks.

Conclusion

In this paper, we propose a novel self-supervised method based on video incoherence detection for video representation learning. The incoherent clip is generated as the concatenation of sub-clips sampled from the same video with incoherence between each other. By detecting the location and length of incoherence, the network can extract effective spatio-temporal features. The intra-video contrastive learning is developed to maximize the mutual information between sub-clips from the same raw video. Extensive experiments show that VID achieves state-of-the-art performance with significant margins compared with previous methods. The proposed VID reveals a new perspective to leverage video coherence for video representation learning.

Appendix

Refer to caption
Figure 5: Visualization of heat maps with/without LeD. The heat maps are generated from the last convolution layer based on Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al. 2017). The indices below indicate the corresponding sub-clips which the frame belongs to.

Implementation details

In addition to experiment settings in our paper, we present the implementation details of our experiments. As mentioned in Sec. 4.1, color jittering is applied to each extracted incoherent clip along the temporal dimension. Specifically, we randomly modify the brightness, contrast, saturation and hue for each frame in the incoherent clip, whose augmentation ranges are [0.2,1.8][0.2,1.8], [0.2,1.8][0.2,1.8], [0.2,1.8][0.2,1.8] and [0.2,0.2][-0.2,0.2], respectively. Color jittering is enabled during the pre-training and fine-tuning stages and is disabled during evaluation. We conduct our experiments utilizing PyTorch (Paszke et al. 2017) with two NVIDIA Tesla P100. Our code is available in the appendix.

Heat map visualization

We visualize the heat map of extracted representation to justify the effectiveness of LeD task as shown in Figure 5. To validate the regularization effect of LeD, we additionally present corresponding heat maps from the network pre-trained without LeD. As shown in the last two rows, when there is subtle difference between scenes of sub-clips, networks

Refer to caption
Figure 6: Heat map visualization of our VID. The column “Input” contains original frames from the test samples. The following frames are sampled from different sub-clips.

pre-trained with or without LeD both focus on the actors to detect abnormal motion caused by incoherence, which proves that incoherence detection requires motion understanding. When scenes change intensively as shown in the first row, the network pre-trained with LeD maintains its concentration on motion areas, while the network without LeD is distracted by the dynamic scene. This observation justifies that the utilization of LeD increases the robustness of VID towards the severe changes of low-level information, which avoids trivial learning.

Refer to caption
Figure 7: Visualization of video retrieval results of our VID and previous PRP (Yao et al. 2020). The figures in the first column are queries. For each query, we present Top 3 retrieval results of our VID and the previous state-of-the-art PRP (Yao et al. 2020). Action classes in red represent the correct retrieval results.

In addition to Figure 5, we present more heat map visualization of our VID to justify its effectiveness. As shown in Figure 6, each row is augmented frames extracted from the test sample of UCF101 (Soomro, Zamir, and Shah 2012). The first column is the original frame representing the input sample. The following columns are heat maps visualization by Grad-CAM (Selvaraju et al. 2017) of different sub-clips.

The heat maps provided in Figure 6 justify our assumption that incoherence detection requires an understanding of motion in videos. For example, given the golf swing in the first row, the network pre-trained with our VID concentrates on the upper body of the actor to detect the movements for incoherence detection. Additionally, as illustrated in the lower three rows in Figure 6, when there are intensive changes between scenes of different sub-clip, our VID can maintain its concentration on the motion areas to detect incoherence. For example, given input indicating horse racing in the last row, the network pre-trained with our VID continuously focus on the horses and riders regardless of the intensive changes of backgrounds.

Video retrieval results

We present multiple examples of video retrieval results in comparison with the previous state-of-the-art method PRP (Yao et al. 2020). Following the evaluation protocols in previous works (Yao et al. 2020; Xu et al. 2019), we extract the features from the last pooling layer of the pre-trained network. For PRP (Yao et al. 2020), we evaluate its performance with its provided pre-trained model. Both our VID and PRP (Yao et al. 2020) are evaluated with ResNet18, which is the 18-layer variant of R3D (Tran et al. 2018). Illustrated as Figure 7, our proposed VID provide more reasonable results compared to previous PRP (Yao et al. 2020). For example, given the query of applying eye makeup as shown in the first row, our VID retrieves two samples of the same action classes as the query among Top 3, while PRP retrieves samples that belong to similar-yet-incorrect actions, such as haircut and blow-dry hair. The retrieval results indicate that the network pre-trained with our VID obtains a more comprehensive understanding of videos compared to previous methods.

References

  • Alwassel et al. (2019) Alwassel, H.; Mahajan, D.; Korbar, B.; Torresani, L.; Ghanem, B.; and Tran, D. 2019. Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667.
  • Ando, Zhang, and Bartlett (2005) Ando, R. K.; Zhang, T.; and Bartlett, P. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(11).
  • Bachman, Hjelm, and Buchwalter (2019) Bachman, P.; Hjelm, R. D.; and Buchwalter, W. 2019. Learning Representations by Maximizing Mutual Information Across Views. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Bottou (2010) Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, 177–186. Springer.
  • Caruana and de (1997) Caruana, R.; and de, V. 1997. Promoting Poor Features to Supervisors: Some Inputs Work Better as Outputs. In Mozer, M. C.; Jordan, M.; and Petsche, T., eds., Advances in Neural Information Processing Systems, volume 9. MIT Press.
  • Devlin et al. (2018) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dwibedi et al. (2019) Dwibedi, D.; Aytar, Y.; Tompson, J.; Sermanet, P.; and Zisserman, A. 2019. Temporal Cycle-Consistency Learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1801–1810.
  • Fernando et al. (2017) Fernando, B.; Bilen, H.; Gavves, E.; and Gould, S. 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3636–3645.
  • Gan et al. (2018) Gan, C.; Gong, B.; Liu, K.; Su, H.; and Guibas, L. J. 2018. Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5589–5597.
  • Han, Xie, and Zisserman (2019) Han, T.; Xie, W.; and Zisserman, A. 2019. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0–0.
  • Han, Xie, and Zisserman (2020) Han, T.; Xie, W.; and Zisserman, A. 2020. Memory-Augmented Dense Predictive Coding for Video Representation Learning. In Vedaldi, A.; Bischof, H.; Brox, T.; and Frahm, J.-M., eds., Computer Vision – ECCV 2020, 312–329. Cham: Springer International Publishing. ISBN 978-3-030-58580-8.
  • Hara, Kataoka, and Satoh (2018) Hara, K.; Kataoka, H.; and Satoh, Y. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6546–6555.
  • He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
  • Hjelm et al. (2019) Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  • Jenni, Meishvili, and Favaro (2020) Jenni, S.; Meishvili, G.; and Favaro, P. 2020. Video representation learning by recognizing temporal transformations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, 425–442. Springer.
  • Kay et al. (2017) Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  • Kim, Cho, and Kweon (2019) Kim, D.; Cho, D.; and Kweon, I. S. 2019. Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 8545–8552.
  • Kuehne et al. (2011) Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB51: A Large Video Database for Human Motion Recognition. In Proceedings of the IEEE International Conference on Computer Vision, 2556–2563.
  • Lan et al. (2019) Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; and Soricut, R. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  • Lee et al. (2017) Lee, H.-Y.; Huang, J.-B.; Singh, M.; and Yang, M.-H. 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, 667–676.
  • Lorre et al. (2020) Lorre, G.; Rabarisoa, J.; Orcesi, A.; Ainouz, S.; and Canu, S. 2020. Temporal contrastive pretraining for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 662–670.
  • Luo et al. (2020) Luo, D.; Liu, C.; Zhou, Y.; Yang, D.; Ma, C.; Ye, Q.; and Wang, W. 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, 11701–11708. AAAI Press.
  • Luo et al. (2017) Luo, Z.; Peng, B.; Huang, D.-A.; Alahi, A.; and Fei-Fei, L. 2017. Unsupervised learning of long-term motion dynamics for videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2203–2212.
  • Misra and Maaten (2020) Misra, I.; and Maaten, L. v. d. 2020. Self-Supervised Learning of Pretext-Invariant Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Misra, Zitnick, and Hebert (2016) Misra, I.; Zitnick, C. L.; and Hebert, M. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, 527–544. Springer.
  • Oord, Li, and Vinyals (2018) Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  • Paszke et al. (2017) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Selvaraju et al. (2017) Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In 2017 IEEE International Conference on Computer Vision (ICCV), 618–626.
  • Soomro, Zamir, and Shah (2012) Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  • Srivastava, Mansimov, and Salakhutdinov (2015) Srivastava, N.; Mansimov, E.; and Salakhutdinov, R. 2015. Unsupervised Learning of Video Representations Using LSTMs. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, 843–852. JMLR.org.
  • Sun et al. (2019a) Sun, C.; Baradel, F.; Murphy, K.; and Schmid, C. 2019a. Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743.
  • Sun et al. (2019b) Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; and Schmid, C. 2019b. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7464–7473.
  • Tian et al. (2020) Tian, Y.; Che, Z.; Bao, W.; Zhai, G.; and Gao, Z. 2020. Self-supervised Motion Representation via Scattering Local Motion Cues. In Vedaldi, A.; Bischof, H.; Brox, T.; and Frahm, J.-M., eds., Computer Vision – ECCV 2020, 71–89. Cham: Springer International Publishing. ISBN 978-3-030-58568-6.
  • Tran et al. (2015) Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision (ICCV), 4489–4497.
  • Tran et al. (2018) Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; and Paluri, M. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450–6459.
  • Vondrick, Pirsiavash, and Torralba (2016) Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Generating Videos with Scene Dynamics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, 613–621. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781510838819.
  • Wang et al. (2019) Wang, J.; Jiao, J.; Bao, L.; He, S.; Liu, Y.; and Liu, W. 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4006–4015.
  • Wang, Jiao, and Liu (2020) Wang, J.; Jiao, J.; and Liu, Y.-H. 2020. Self-supervised video representation learning by pace prediction. In European Conference on Computer Vision, 504–521. Springer.
  • Wu et al. (2018) Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3733–3742.
  • Xu et al. (2019) Xu, D.; Xiao, J.; Zhao, Z.; Shao, J.; Xie, D.; and Zhuang, Y. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10334–10343.
  • Yao et al. (2021) Yao, T.; Zhang, Y.; Qiu, Z.; Pan, Y.; and Mei, T. 2021. SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning. In 35th AAAI Conference on Artificial Intelligence.
  • Yao et al. (2020) Yao, Y.; Liu, C.; Luo, D.; Zhou, Y.; and Ye, Q. 2020. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).