This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Novel Self-Knowledge Distillation Approach with
Siamese Representation Learning for
Action Recognition

Duc-Quang Vu Dept. of CSIE
National Central University
Taoyuan, Taiwan
Email: [email protected]
   Thi-Thu-Trang Phung Thai Nguyen University
Thai Nguyen, Vietnam
Email: [email protected]
   Jia-Ching Wang Dept. of CSIE
National Central University
Taoyuan, Taiwan
Email: [email protected]
Abstract

Knowledge distillation is an effective transfer of knowledge from a heavy network (teacher) to a small network (student) to boost students’ performance. Self-knowledge distillation, the special case of knowledge distillation, has been proposed to remove the large teacher network training process while preserving the student’s performance. This paper introduces a novel Self-knowledge distillation approach via Siamese representation learning, which minimizes the difference between two representation vectors of the two different views from a given sample. Our proposed method, SKD-SRL, utilizes both soft label distillation and the similarity of representation vectors. Therefore, SKD-SRL can generate more consistent predictions and representations in various views of the same data point. Our benchmark has been evaluated on various standard datasets. The experimental results have shown that SKD-SRL significantly improves the accuracy compared to existing supervised learning and knowledge distillation methods regardless of the networks.

I Introduction

Action recognition is one of the most important issues in computer vision. Various methods have been proposed to address this task, such as HOG3D [1], SIFT3D [2], ESURF [3], MBH [4], iDTs [5], etc. Instead of using the traditional methods above to extract features, deep learning models are now trained to automatically learn the features by using convolutional neural networks (CNNs) [6], and they have brought a large change for computer vision and image processing. Various CNN models have been proposed to address the action recognition task in recent years, such as I3D [7], SlowFastNet [8], 3D ResNet [9], ip-CSN [10], ir-CSN [10], etc. However, these deep models require a lot of layers with millions of parameters. Thus, such approaches may not be suitable for deploying these deep models on embedded or mobile devices with limited resources.

Knowledge distillation (KD) has become a promising approach to address the above limitation via transferring knowledge from a larger deep neural network (i.e., the teacher network) to a small network (i.e., the student network). Various KD approaches  [11, 12, 13, 14, 15] have been proposed to transfer different knowledge, such as using logit vectors from the last layer or feature maps from intermediate layers of a large network to guide the learning of the student model (see Fig. 1 (a)). Although this approach can effectively improve the performance of the student network, the existence of the teacher network complicates the training process of a single network, especially for 3D CNNs, when the costs of the training time and the GPU space spend a lot much compared to 2D CNNs. Self-knowledge distillation (Self-KD) has been proposed to remove the large and expensive teacher network. In Self-KD, the student is learned and distilled the knowledge by itself without using any teacher. Various Self-KD-based methods have been proposed for many tasks such as image classification [16, 17], object detection [18], machine translation [18], natural language processing [19], etc. In [20], Xu et al. have demonstrated that the Self-KD approach outperforms conventional KD-based methods without training teacher networks.

Refer to caption

Figure 1: Comparison of various distillation approaches. The black line is the forward path; the black dashed line indicates the soft label distillation; the red dashed line denotes the feature distillation at intermediate layers; the blue dashed line illustrates the similarity loss between two representation vectors of the same sample. Two data augmentation subsets are sampled from the set of augmentations (t𝒯t\thicksim\mathcal{T} and t𝒯t^{\prime}\thicksim\mathcal{T}) and applied to each data to obtain two different views. (a) Conventional knowledge distillation with the heavy teacher (the green network). (b) Self-Knowledge distillation via data augmentation. (c) Self-Knowledge distillation by an auxiliary network (the dark blue network). (d) Our proposed approach with Siamese representation learning.

Self-KD has largely been divided into two main categories: data augmentation-based approaches and auxiliary network-based approaches. Data augmentation-based approaches usually generate a consistent prediction between two different distorted versions of a sample or a pair [20] of samples of the same class [16] (see Fig. 1 (b)). Meanwhile, the auxiliary network-based approaches utilize additional layers in the middle of the classifier network to transfer the knowledge for itself via feature maps at intermediate layers and/or soft label distillation at the classification layer [17] (see Fig. 1 (c)).

In this paper, we introduce a novel self-knowledge distillation approach with Siamese representation learning (SKD-SRL) for action recognition. Siamese networks are weight-sharing neural networks applied on two or more inputs [21], and they are usually used to maximize the similarity object in different conditions. Various Siamese network-based methods have been proposed and achieve state-of-the-art performance in self-supervised learning such as SimCLR [22], SwAV [23], Barlow Twins [24], etc. Siamese representation learning is a simple yet robust method to maximizes the similarity of two views from a given input. They are usually applied in unsupervised visual representation learning [21]. In SKD-SRL, we focus on both the consistency between the two predictive distributions (via soft label distillation) and the similarity between two representation vectors of two distorted views (by Siamese representation learning) in the same sample (see Fig. 1 (d)). Experimental results show that our proposed SKD-SRL has significantly improved the accuracy compared to the independently training and conventional KD methods. Moreover, our method outperforms the state-of-the-art supervised learning and KD methods. The details are discussed in Section IV.

II Related Work

Many different approaches and network architectures have been proposed for action recognition in recent years. FASTER [25] is proposed to learn the predictions from models of different complexities to capture subtle motion information and lightweight representations from cheap models to cover scene changes in the video. Combine with a recurrent network, the FASTER significantly reduces computational cost while maintaining state-of-the-art accuracy across popular datasets. In [26], the authors presented a new architecture based on a combination of a deep subnet operating on low-resolution frames with a compact subnet operating on high-resolution frames. MoViNets [27] is introduced as a family of computation and memory-efficient video networks found by neural architecture search. These methods are demonstrated that they require less computational cost and memory while achieving state-of-the-art performance. For distillation approaches, Crasto et al. [14] proposed a new approach, namely MARS, that allows the knowledge distillation from a flow network, i.e., teacher, to an RGB network, i.e., student. Besides, several approaches based on transformer have been proposed and achieved state-of-the-art performance, such as MViT [28], VTN [29], etc.

III Proposed Method

Refer to caption

Figure 2: Overview of the SKD-SRL approach. The black line is the forward path; CE loss is the standard cross-entropy loss; the distillation loss is calculated by Eq. 2 and the similarity loss is calculated by Eq. 3; \oplus denotes the add operator. The hard label is the ground truth label of the video.

SKD-SRL is the combination of Self-KD and Siamese representation learning. We first apply data augmentation twice to obtain two versions (two views) of the input video. Each view is forward propagated through the encoder network fθf_{\theta} to create a representation vector. Both representation vectors are passed into a fully connected (FCFC) layer to generate logit prediction vectors. The sum of logit predictions is utilized as the soft label to transfer the knowledge for each branch. Besides, both representation vectors are also forward to the projector MLP network gξg_{\xi} and the predictor network qμq_{\mu} to transform one vector’s output and match it to the other vector by minimizing their negative cosine similarity. The proposed SKD-SRL method is illustrated in Figure. 2. In summary, our proposed approach includes three main steps as following:

  1. 1.

    Calculating the representation vectors of two randomly augmented views from a given video by the encoder network fθf_{\theta}.

  2. 2.

    Distilling the knowledge via soft labels, which is calculated through the representation vectors at step 1.

  3. 3.

    Calculating the similarity between two representation vectors by Siamese representation learning.

III-A Training Paradigm

Given a set of NN training samples is denote as {(x(i),y(i))}i=1N\{(x^{(i)},y^{(i)})\}_{i=1}^{N} as a labelled source dataset from KK classes where x(i)x^{(i)} is a video, and y(i)y^{(i)} is a KK-dimensional one-hot vector as its hard label (i.e., the action). For a mini-batch input B={(x(i),y(i))}i=1nB=\{(x^{(i)},y^{(i)})\}_{i=1}^{n}, we apply data augmentation operators (e.g. flip, contrast adjustment, etc.) for each x(i)x^{(i)} in BB. Let x1(i),x2(i)=𝒯(x(i))x_{1}^{(i)},x_{2}^{(i)}=\mathcal{T}(x^{(i)}) denote the two randomly augmented views from the original video x(i)x^{(i)} where 𝒯()\mathcal{T}(\cdot) is the set of data augmentation operators. Note that x1(i),x2(i)x_{1}^{(i)},x_{2}^{(i)} have the same the label y(i)y^{(i)}. The two views x1(i),x2(i)x_{1}^{(i)},x_{2}^{(i)} are passed into an encoder network fθf_{\theta}, where θ\theta denotes the set of parameters of ff. Let 𝐫1(i)\mathbf{r}_{1}^{(i)} and 𝐫2(i)\mathbf{r}_{2}^{(i)} correspond to the output vectors of the network fθf_{\theta} with the input being x1(i),x2(i)x_{1}^{(i)},x_{2}^{(i)} (i.e., 𝐫1(i)=fθ(x1(i))\mathbf{r}_{1}^{(i)}=f_{\theta}(x_{1}^{(i)}) and 𝐫2(i)=fθ(x2(i))\mathbf{r}_{2}^{(i)}=f_{\theta}(x_{2}^{(i)})). The encoder fθf_{\theta} shares weights between the two views. A FCFC layer is utilized to generate the logit prediction from 𝐫1(i)\mathbf{r}_{1}^{(i)} and 𝐫2(i)\mathbf{r}_{2}^{(i)} as follows:

𝐩1(i)=FC(𝐫1(i))𝐩2(i)=FC(𝐫2(i))\begin{split}\mathbf{p}_{1}^{(i)}=FC(\mathbf{r}_{1}^{(i)})\\ \mathbf{p}_{2}^{(i)}=FC(\mathbf{r}_{2}^{(i)})\end{split} (1)

where 𝐩1(i)\mathbf{p}_{1}^{(i)} and 𝐩2(i)\mathbf{p}_{2}^{(i)} denote the KK-dimensional logit prediction vector and each dimension represents the logit value for the kthk^{th} class (with k=1,2,,Kk=1,2,...,K). Let 𝐩(i)=𝐩1(i)𝐩2(i)\mathbf{p}^{(i)}=\mathbf{p}_{1}^{(i)}\oplus\mathbf{p}_{2}^{(i)} where \oplus corresponds to the add operator. Similar to other self-knowledge distillation methods, SKD-SRL also performs the knowledge distillation through the soft label as follows:

KL(𝐩1(i),𝐩2(i);τ)=DKL(softmax(𝐩(i)τ)||softmax(𝐩1(i)τ))+DKL(softmax(𝐩(i)τ)||softmax(𝐩2(i)τ))\begin{split}\mathcal{L}_{KL}(\mathbf{p}_{1}^{(i)},\mathbf{p}_{2}^{(i)};\tau)=D_{KL}\Big{(}softmax(\frac{\mathbf{p}^{(i)}}{\tau})||softmax(\frac{\mathbf{p}^{(i)}_{1}}{\tau})\Big{)}\\ +D_{KL}\Big{(}softmax(\frac{\mathbf{p}^{(i)}}{\tau})||softmax(\frac{\mathbf{p}_{2}^{(i)}}{\tau})\Big{)}\end{split} (2)

where DKLD_{KL} denotes the Kullback-Leibler (KL) divergence function, τ\tau is the temperature scaling parameter. 𝐫1(i)\mathbf{r}_{1}^{(i)} and 𝐫2(i)\mathbf{r}_{2}^{(i)} also are passed into a projector MLP network gξg_{\xi}. We have 𝐳1(i)=gξ(𝐫1(i))\mathbf{z}_{1}^{(i)}=g_{\xi}(\mathbf{r}_{1}^{(i)}) and 𝐳2(i)=gξ(𝐫2(i))\mathbf{z}_{2}^{(i)}=g_{\xi}(\mathbf{r}_{2}^{(i)}). A predictor MLP qμq_{\mu} is utilized to transform one vector 𝐳1(i)\mathbf{z}_{1}^{(i)} (𝐳2(i)\mathbf{z}_{2}^{(i)}) and match it to the other vector 𝐳2(i)\mathbf{z}_{2}^{(i)} (𝐳1(i)\mathbf{z}_{1}^{(i)}) by minimizing their negative cosine similarity, ξ\xi and μ\mu denote the set of parameters of the network gg and qq, respectively. Denoting the two output vectors as 𝐯1(i)=gμ(𝐳1(i))\mathbf{v}_{1}^{(i)}=g_{\mu}(\mathbf{z}_{1}^{(i)}) and 𝐯2(i)=gμ(𝐳2(i))\mathbf{v}_{2}^{(i)}=g_{\mu}(\mathbf{z}_{2}^{(i)}). We utilize the similarity loss function as follows:

sim(𝐳1(i),𝐳2(i),𝐯1(i),𝐯2(i))=12Dsim(𝐯1(i),stopgrad(𝐳2(i)))+12Dsim(𝐯2(i),stopgrad(𝐳1(i)))\begin{split}\mathcal{L}_{sim}(\mathbf{z}_{1}^{(i)},\mathbf{z}_{2}^{(i)},\mathbf{v}_{1}^{(i)},\mathbf{v}_{2}^{(i)})=\frac{1}{2}D_{sim}(\mathbf{v}_{1}^{(i)},stopgrad(\mathbf{z}_{2}^{(i)}))\\ +\frac{1}{2}D_{sim}(\mathbf{v}_{2}^{(i)},stopgrad(\mathbf{z}_{1}^{(i)}))\end{split} (3)

where stopgradstopgrad is the stop-gradient operator [21]. This means that 𝐳1(i)\mathbf{z}_{1}^{(i)} and 𝐳2(i)\mathbf{z}_{2}^{(i)} are treated as a constant in this term. DsimD_{sim} is the negative cosine similarity function as follows:

Dsim(𝐯1,𝐳2)=𝐯1𝐯12𝐳2𝐳22,D_{sim}(\mathbf{v}_{1},\mathbf{z}_{2})=-\frac{\mathbf{v}_{1}}{\left\lVert\mathbf{v}_{1}\right\rVert_{2}}\cdot\frac{\mathbf{z}_{2}}{\left\lVert\mathbf{z}_{2}\right\rVert_{2}}, (4)

where 2\left\lVert\cdot\right\rVert_{2} is 2\ell_{2}-norm. By integrating Eq. 2 and Eq. 3, we can construct the final optimization objective for the entire network as follows:

net=1ni=1nCE(y,softmax(𝐩1(i)))+CE(y,softmax(𝐩2(i)))+αKL(𝐩1(i),𝐩2(i);τ)+βsim(𝐳1(i),𝐳2(i),𝐯1(i),𝐯2(i))\begin{split}\mathcal{L}_{net}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}_{CE}(y,softmax(\mathbf{p}_{1}^{(i)}))+\mathcal{L}_{CE}(y,softmax(\mathbf{p}_{2}^{(i)}))\\ +\alpha\mathcal{L}_{KL}(\mathbf{p}_{1}^{(i)},\mathbf{p}_{2}^{(i)};\tau)+\beta\mathcal{L}_{sim}(\mathbf{z}_{1}^{(i)},\mathbf{z}_{2}^{(i)},\mathbf{v}_{1}^{(i)},\mathbf{v}_{2}^{(i)})\end{split} (5)

where CE\mathcal{L}_{CE} is the standard cross-entropy loss, α\alpha and β\beta are the loss weights for the distillation loss KL\mathcal{L}_{KL} and the similarity loss sim\mathcal{L}_{sim}, respectively. The pseudo-code of SKD-SRL is illustrated in Algorithm. 1.

Input: fθf_{\theta}, gξg_{\xi}, qμq_{\mu}: the encoder, projector, and predictor networks, respectively.
𝒯\mathcal{T}: the set of data augmentation operators
α,β,τ\alpha,\beta,\tau: loss weights and temperature factor.
1 Initialize parameters θ,ξ,μ\theta,\xi,\mu
2 while θ\theta has not converged do
3       Sample a batch (x,y)(x,y) from the training set
4       x1x_{1}, x2x_{2} = 𝒯(x)\mathcal{T}(x)
5       𝐫1\mathbf{r}_{1}, 𝐫2\mathbf{r}_{2} = fθ(x1)f_{\theta}(x_{1}), fθ(x2)f_{\theta}(x_{2})
6       𝐩1\mathbf{p}_{1}, 𝐩2\mathbf{p}_{2} = FC(𝐫1)FC(\mathbf{r}_{1}), FC(𝐫2)FC(\mathbf{r}_{2})
7       𝐳1\mathbf{z}_{1}, 𝐳2\mathbf{z}_{2} = gξ(𝐫1)g_{\xi}(\mathbf{r}_{1}), gξ(𝐫2)g_{\xi}(\mathbf{r}_{2})
8       𝐯1\mathbf{v}_{1}, 𝐯2\mathbf{v}_{2} = qμ(𝐳1)q_{\mu}(\mathbf{z}_{1}), qμ(𝐳2)q_{\mu}(\mathbf{z}_{2})
9       Calculate the loss net\mathcal{L}_{net} by Eq. 5
10       Update parameters θ\theta, ξ\xi, μ\mu
return Encoder network fθf_{\theta}
Algorithm 1 SKD-SRL Pseudocode for Action Recognition

III-B Network Architecture and Data Augmentation

Network Architecture. For the encoder network fθf_{\theta}, we consider two state-of-the-art CNN architectures including the 3D ResNet-18, and the 3D ResNet-50 in [9]. For the projector network gξg_{\xi} which maps the representation vectors 𝐫1\mathbf{r}_{1} and 𝐫2\mathbf{r}_{2} to vectors 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2}, we instantiate gξg_{\xi} just a single linear layer of size = 2048. The predictor network qμq_{\mu} has two MLP layers where the first hidden layer is applied BN and ReLU, and the output layer does not have BN and activation function. Due to the predictor’s input and output dimension as 2048, we set the predictor’s hidden layer’s dimension as 512 following the instruction in [21]. We leave to future work the investigation of optimal gξg_{\xi} and qμq_{\mu} architectures.

Data Augmentation. Given an input video, our data augmentation method takes two steps. We first trim two clips with TT continuous frames from the raw video. Each frame in the clip is scaled the shorter edge of the frames in the clip to 128, and the other edge is calculated to maintain the frame original aspect ratio. A random cropping window with dimensions of 112×112112\times 112 is generated and applied to all frames. We then randomly choose data augmentation operators to apply for each clip to generate the two views from the original clip. The list of all augmentation operators are used in our work including flip, contrast adjustment, brightness adjustment, hue adjustment, Gaussian blur and channel splitting [30]. The probability chosen for each augmentation operator is set to 0.5.

IV Experiment

IV-A Datasets and Implementation

We have conducted experiments on three datasets including UCF101 [31], HMDB51 [32] and Kinetics400 [33].

UCF101: includes 13,320 action instances from 101 human action classes. The average duration of each video is about 7 seconds.

HMDB51: is a small dataset including 6,766 videos from 51 human action classes. The average duration of each video is about 3 seconds.

Kinetics400: is a large dataset that has 400 human action classes [33]. The videos were temporally trimmed and last around 10 seconds and 200–1000 clips for each action. The total has 306,245 videos in Kinetics400.

Implementation Details. All networks are trained from scratch and optimized by stochastic gradient descent (SGD) with a momentum of 0.9 and an initial learning rate of 0.01. The weight decay is set to 5×1045\times 10^{-4}. The input of the encoder network is a video clip with 16 frames, each frame has 112×112×3112\times 112\times 3 of dimension and normalized to be [-1, 1]. We use the mini-batch of 32 clips per GPU, and training is done in 200 epochs. The learning rate is dropped by 10×\times after 10 epochs if the validation accuracy not improving. For our method, the temperature τ\tau is set as 10, and the loss weights α\alpha and β\beta are set as 0.1 and 1, respectively.

IV-B Comparison with independently training

To examine the effectiveness of the proposed SKD-SRL method, we have compared our approach to the baseline (independently training) with cross-entropy loss on the standard datasets with both the ResNet-18 and ResNet-50 networks.

TABLE I: Top-1 accuracy of the SKD-SRL compares to the baseline method in both the 3D ResNet-18 and the 3D ResNet-50 networks on standard datasets.
Method Backbone UCF101 HMDB51 Kinetics400
Baseline ResNet-18 46.5 17.1 54.2
SKD-SRL ResNet-18 69.8 24.7 66.7
Baseline ResNet-50 59.2 22.0 61.3
SKD-SRL ResNet-50 71.9 29.8 75.6

As shown in Table. I, the SKD-SRL method outperforms the baseline method for both the large and small-scale datasets regardless of the backbone networks. Specifically, the SKD-SRL method increases the accuracies by 23.3% and 12.7% for the 3D ResNet-18 and 3D ResNet-50 networks, respectively, on small-scale datasets such as the UCF101 dataset. In the large-scale dataset, i.e., the Kinetics400 dataset, the SKD-SRL method increases the accuracies by 12.5% and 14.3% for the 3D ResNet-18 and 3D ResNet-50 networks, respectively. From these results, we found that our approach can significantly improve generalization and performance compared to independently training only with hard labels.

IV-C Comparison with other KD mechanisms

Table II compares our SKD-SRL method with other distillation mechanisms. As expected, the student performance in distillation approaches improves compared to independent training (i.e., baseline). Moreover, the Self-KD method shows better top-1 accuracy than the conventional KD (+5.3% accuracy). This shows that although there is no teacher network, Self-KD still significantly improves performance via its self-teaching and self-learning mechanism. Meanwhile, the proposed SKD-SRL method outperforms 3.1% compare to Self-KD. It demonstrates that our approach enhances the generalization capability of the single network by combining Self-KD with Siamese representation learning.

TABLE II: Comparison with different distillation mechanisms on the UCF101 dataset. The student network in all mechanisms is the ResNet-18 network.
Mechanism Teacher Network Top-1 Accuracy
Baseline None 46.5
Baseline + Data augment None 55.6
KD ResNet-50 61.4
Self-KD itself 66.7
SKD-SRL itself 69.8

IV-D Comparison with state-of-the-art methods

In this part, we evaluate SKD-SRL on the Kinetics400 dataset with two backbone networks, including 3D ResNet18 and 3D ResNet-50. As shown in Table. III, the proposed SKD-SRL method has obtained state-of-the-art performance while utilizing fewer frames (16 vs 32, 64) and lower frame resolutions (112 vs 224). In particular, the SKD-SRL method uses RGB frames without the optical flow, which significantly reduces the cost of optical flow calculation and model training on the optical flow domain. Moreover, our proposed SKD-SRL has better performance with a shallower model than other state-of-the-art methods (3D ResNet-50 vs 3D ResNet-101, DenseNet-169).

TABLE III: Top-1 accuracy of the SKD-SRL method compares to state-of-the-art methods on the Kinetics400 dataset. * indicates that these methods use both RGB and optical flow frames in the training phase.
Method Backbone
Pretraining
dataset
#frames
Frame
resolution
Top-1
Accuracy
I3D [7] InceptionNet ImageNet 64 224×224224\times 224 71.1
FASTER [25] R(2+1)D-50 None 32 224×224224\times 224 71.7
R(2+1)D* [34] ResNet-34 Sport-1M 32 112×112112\times 112 73.3
bLVNet [26] ResNet-50 None 24 224×224224\times 224 73.5
MoViNets [27] MoViNet-A1 None 50 224×224224\times 224 72.7
MoViNets [27] MoViNet-A2 None 50 224×224224\times 224 75.0
MViT [28] MViT-S None 16 224×224224\times 224 76.0
R3D [9] ResNet-152 None 16 112×112112\times 112 63.0
R3D [9] ResNeXt-101 None 16 112×112112\times 112 65.1
STC [15] ResNeXt-101 ImageNet 32 112×112112\times 112 68.7
T3D [13] DenseNet-169 ImageNet 32 224×224224\times 224 62.2
MARS* [14] ResNeXt-101 None 16 112×112112\times 112 68.9
SKD-SRL ResNet-18 None 16 112×112112\times 112 66.7
SKD-SRL ResNet-50 None 16 112×112112\times 112 75.6

V Conclusion

In this work, we have introduced the combination of Self-knowledge distillation and Siamese representation learning. Our approach utilizes the sum of predictive distributions from two different views of the same sample for the soft label distillation. Moreover, we minimize the difference between two representation vectors of these views via Siamese representation learning. Experiments conducted across different network architectures have shown that our proposed method achieves state-of-the-art performance compared to other methods on the standard datasets. In future work, we investigate optimal architectures for projector and predictor networks. In addition, the proposed method can be adapted and applied to other computer vision tasks involving.

References

  • [1] Alexander Klaser, Marcin Marszałek, and Cordelia Schmid, “A spatio-temporal descriptor based on 3d-gradients,” in BMVC 2008-19th British Machine Vision Conference. British Machine Vision Association, 2008, pp. 275–1.
  • [2] Paul Scovanner, Saad Ali, and Mubarak Shah, “A 3-dimensional sift descriptor and its application to action recognition,” in Proceedings of the 15th ACM international conference on Multimedia, 2007, pp. 357–360.
  • [3] Geert Willems, Tinne Tuytelaars, and Luc Van Gool, “An efficient dense and scale-invariant spatio-temporal interest point detector,” in ECCV. Springer, 2008, pp. 650–663.
  • [4] Navneet Dalal, Bill Triggs, and Cordelia Schmid, “Human detection using oriented histograms of flow and appearance,” in ECCV. Springer, 2006, pp. 428–441.
  • [5] Heng Wang and Cordelia Schmid, “Action recognition with improved trajectories,” in ICCV, 2013, pp. 3551–3558.
  • [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, MIT press, 2016.
  • [7] Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 6299–6308.
  • [8] Feichtenhofer Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He, “Slowfast networks for video recognition,” in The IEEE 2019 International Conference on Computer Vision (ICCV). 2019, pp. 6201–6210, IEEE.
  • [9] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?,” in CVPR, 2018, pp. 6546–6555.
  • [10] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli, “Video classification with channel-separated convolutional networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5552–5561.
  • [11] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015.
  • [12] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio, “Fitnets: Hints for thin deep nets,” in ICLR, 2015.
  • [13] Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool, “Temporal 3d convnets: New architecture and transfer learning for video classification,” arXiv preprint arXiv:1711.08200, 2017.
  • [14] Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid, “Mars: Motion-augmented rgb stream for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7882–7891.
  • [15] Ali Diba, Mohsen Fayyaz, Vivek Sharma, M Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool, “Spatio-temporal channel correlation networks for action classification,” in ECCV, 2018, pp. 284–299.
  • [16] Sukmin Yun, Jongjin Park, Kimin Lee, and Jinwoo Shin, “Regularizing class-wise predictions via self-knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13876–13885.
  • [17] Mingi Ji, Seungjae Shin, Seunghyun Hwang, Gibeom Park, and Il-Chul Moon, “Refine myself by teaching myself: Feature refinement via self-knowledge distillation,” arXiv preprint arXiv:2103.08273, 2021.
  • [18] Kyungyul Kim, ByeongMoon Ji, Doyoung Yoon, and Sangheum Hwang, “Self-knowledge distillation: A simple way for better generalization,” arXiv preprint arXiv:2006.12000, 2020.
  • [19] Sangchul Hahn and Heeyoul Choi, “Self-knowledge distillation in natural language processing,” arXiv preprint arXiv:1908.01851, 2019.
  • [20] Ting-Bing Xu and Cheng-Lin Liu, “Data-distortion guided self-distillation for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 5565–5572.
  • [21] Xinlei Chen and Kaiming He, “Exploring simple siamese representation learning,” arXiv preprint arXiv:2011.10566, 2020.
  • [22] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
  • [23] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” arXiv preprint arXiv:2006.09882, 2020.
  • [24] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” arXiv preprint arXiv:2103.03230, 2021.
  • [25] Linchao Zhu, Du Tran, Laura Sevilla-Lara, Yi Yang, Matt Feiszli, and Heng Wang, “Faster recurrent networks for efficient video classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 13098–13105.
  • [26] Quanfu Fan, Chun-Fu (Richard) Chen, Hilde Kuehne, Marco Pistoia, and David Cox, “More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. 2019, vol. 32, Curran Associates, Inc.
  • [27] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong, “Movinets: Mobile video networks for efficient video recognition,” arXiv preprint arXiv:2103.11511, 2021.
  • [28] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer, “Multiscale vision transformers,” arXiv preprint arXiv:2104.11227, 2021.
  • [29] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann, “Video transformer network,” arXiv preprint arXiv:2102.00719, 2021.
  • [30] Duc-Quang Vu, Ngan Le, and Jia-Ching Wang, “Teaching yourself: A self-knowledge distillation approach to action recognition,” IEEE Access, vol. 9, pp. 105711–105723, 2021.
  • [31] Khurram Soomro, Amir Roshan Zamir, and M Shah, “A dataset of 101 human action classes from videos in the wild,” Center for Research in Computer Vision, vol. 2, 2012.
  • [32] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre, “Hmdb: a large video database for human motion recognition,” in Proceedings of the IEEE international conference on computer vision. IEEE, 2011, pp. 2556–2563.
  • [33] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  • [34] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.