Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training

Saurabh Sahu
Samsung Research America
[email protected] Palash Goyal
Samsung Research America
[email protected]

Abstract

The introduction of Transformer model has led to tremendous advancements in sequence modeling, especially in text domain. However, the use of attention-based models for video understanding is still relatively unexplored. In this paper, we introduce Gated Adversarial Transformer (GAT) to enhance the applicability of attention-based models to videos. GAT uses a multi-level attention gate to model the relevance of a frame based on local and global contexts. This enables the model to understand the video at various granularities. Further, GAT uses adversarial training to improve model generalization. We propose temporal attention regularization scheme to improve the robustness of attention modules to adversarial examples. We illustrate the performance of GAT on the large-scale YoutTube-8M data set on the task of video categorization. We further show ablation studies along with quantitative and qualitative analysis to showcase the improvement.

1 Introduction

Video classification aims at understanding the visual and audio features to assign one or more relevant tags to the video. With the rapid increase in amount of video content, this task is crucial for applications such as smart content search, user profiling and alleviating missing metadata. Moving from image to video classification adds several challenges to the task. First, the temporal dimension increases the overall size of the input in turn increasing the model capacity required to make accurate predictions. Secondly, the number of possible tags increase due to variations in sequence. For e.g., a leaf falling from tree can be tagged as nature but the reverse could indicate science-fiction elements in the video.

Existing works on video classification using deep learning can be broadly divided into four categories: (i) convolutional neural networks (CNNs) [21, 12, 4, 20], (ii) recurrent neural networks (RNNs) [28, 27, 26], (iii) graph-based methods [13, 2], and (iv) attention-based models [17, 10, 7]. Convolutional approaches use 3-D CNN along with optical flow networks to jointly model spatial and temporal features. Recurrent models use variations of recurrent neural cell including gated recurrent unit (GRU) and long-short term memory (LSTM) to understand the video. Graph based approaches create a similarity graph of frames to cluster scenes together. Lastly, attention-based models use self-attention blocks at each layer to understand a frame’s role with respect to other frames in the video. In this paper, we focus on attention approaches as they have low computation cost compared to 3-D CNNs and outperform RNN based approaches on video classification task [17].

Current attention-based approaches suffer from two key issues which we aim to tackle in this paper. First, video representation and classification is highly dependent on the attention weights. This can lead to incomplete predictions if high attention is given to only a handful of relevant frames, and incorrect predictions if given to irrelevant frame(s). For e.g., in Figure 1 (left), for a news video on protests, we show attention weights and prediction of Transformer model trained on 4 million YouTube videos. We observe that high attention is given to the frames containing crowd and thus final prediction is given as sports (confusing it with sports crowd). Second issue which has recently been showed for videos [24] is that the models are prone to adversarial examples. In Figure 1 (right), we observe that a minor change in input modifies attention profiles obtained from a self-attention based architecture.

In this paper, to alleviate the above issues, we propose Gated Adversarial Transformer (GAT) which makes two technical contributions. (1) a multi-level gated attention: instead of using a single global profile we define a local and global attention profile and use a gated mixture of experts model to learn a weighted combination of the representations obtained from the two profiles. (2) temporal adversarial training: we compute the adversarial direction for each example and add an attention-regularization term to ensure robustness to adversarial example at two levels: attention level and output classification.

We make the following contributions in this paper:

•

We observed global attention dependency problem in state-of-the-art self-attention based models. We propose a gated multi-level attention module to tackle this.
•

We propose temporal adversarial training for video understanding to enhance the robustness of attention-based encoders.
•

We provide comprehensive large-scale quantitative and qualitative experiments on YouTube-8M data, showing significant improvements with our approach over state-of-the-art modes.

2 Related Work

Attention based models: With the success of attention based architectures (such as Transformers) in NLP tasks, recently they have been used in video related tasks as well. The self-attention block proposed in [22] is closely related to the non-local blocks used for video classification [23]. Chen et al. [5] proposed increasing the efficacy of existing convolutional blocks for action classification by using a two-stage attention mechanism to collect relevant features from different patches of video frames at various instants. Kmiec et al. [10] used Transformer encoder along with NetVLAD blocks to get effective feature representations of audio/video features for large scale video understanding. Wu et al. [25] used an attention block to compute the interactions between short-term and long-term feature representations for detailed video understanding. Girdhar et al. [7] proposed ‘action Transformers’ where features extracted using a 3D CNN are aggregated and fed to a self-attention block to leverage the spatio-temporal information around a person for action localization. Bertasius et al. [3] extended the idea of [6] to apply Transformers for video classification by collectively using the patches obtained from a frame at different time-steps. While the prior works have used vanilla self-attention block’s operations, we propose modifications to self-attention architecture itself to make it more relevant for video understanding. Furthermore, it can also be used in cases where raw videos are not available for training (such as Youtube-8M dataset) thereby rendering some of the above methods infeasible as they rely on extracting information from image patches.

Adversarial training: It has been shown that machine learning models misclassify correctly classified images when a small non-perceptible perturbation is added to them [19, 8, 16]. [24] showed it for videos where the trained model fails to detect the correct class of a perturbed video. For a threat model, they generated the adversarial perturbations in an iterative way by maximising the cross-entropy loss between the model’s output for a perturbed video and its ground-truth label. They additionally minimize the $L_{21}$ norm of the perturbations so that the perturbed video is semantically close to the original video. This ensured that even though both the videos are visually similar, the model outputs differ from each other. In [8, 14], the authors propose a way of computing the adversarial counterparts of images while training and adding an extra loss regularization that forces the model to correctly classify them or make their predictions close to that of the original image. We extend this idea to videos. As we are using a self-attention based architecture, we additionally implement a attention regularization term in the loss function that promotes similarity in the attention map between original and adversarial videos, ensuring robustness in both attention and output space.

3 Gated Adversarial Transformer (GAT)

Refer to caption — Figure 1: Gated Multi-level Self Attention (GMSA) architecture.

In this section, we explain our model architecture and training algorithm. Given $D$ dimensional input features for $T$ frames of a video, we encode the temporal information with a positional encoding block to get a representation $X\in\mathbb{R}^{T\times D}$ and pass it through a multi-head self attention (SA) block to get the output representations. This is followed by layer-normalization, residual connections and feed-forward layers as proposed in [22]. We then get a video level representation by taking a mean across the temporal dimension. We get these video level representations for both the audio and visual modalities. Then they are concatenated and passed through a hidden layer before getting the final output. We incorporate novelties in the way we get the output representation from the SA block. Additionally, we train the model using regularization loss terms computed from adversarial examples along with the vanilla cross-entropy loss term.

3.1 Gated Multi-level Self Attention

For a video, the inter-frame relationship is different when seen in local context versus global context. Hence, to identify a video correctly, we compute multiple output representations from local as well as global attention maps. The final output representation is computed by taking a weighted combination of these representations, the weights being determined by a soft-gating mechanism. Figure 1 outlines the gated multi-level attention architecture.

In a multi-headed SA block with $M$ heads, input $X$ is first transformed into query, key and value matrices. For $m$ -th head, we compute the attention map $A_{m}\in\mathbb{R}^{T\times T}$ using the scaled dot product of the corresponding query $Q_{m}\in\mathbb{R}^{T\times D_{M}}$ and key $K_{m}\in\mathbb{R}^{T\times D_{M}}$ matrices. It is then multiplied with the value matrix $V_{m}\in\mathbb{R}^{T\times D_{M}}$ to get a global level output representation for the head $O_{gm}\in\mathbb{R}^{T\times D_{M}}$ . We call it a global level representation because it is compute using the attention map that depicts the importance of a frame $i$ with respect to frame $j$ taking into context the entire video. These outputs from different heads heads are then concatenated and a linear transformation is done to get the final output representation $Y^{g}\in\mathbb{R}^{T\times D}$ .

$\displaystyle Q_{m}$	$\displaystyle=$	$\displaystyle XW^{q}_{m},\hskip 2.84526ptK_{m}=XW^{k}_{m},\hskip 2.84526ptV_{m}=XW^{v}_{m}$
$\displaystyle A_{m}$	$\displaystyle=$	$\displaystyle\mbox{softmax}\big{(}\frac{Q_{m}K_{m}^{T}}{\sqrt{d_{m}}}\big{)}$	(1)
$\displaystyle O_{gm}$	$\displaystyle=$	$\displaystyle A_{m}.V_{m}$
$\displaystyle Y^{g}$	$\displaystyle=$	$\displaystyle\mbox{concat}(O_{g1},\ldots,O_{gM}).W^{g}$

where $W^{q}_{m}\in\mathbb{R}^{D\times D_{M}}$ , $W^{k}_{m}\in\mathbb{R}^{D\times D_{M}}$ , $W^{v}_{m}\in\mathbb{R}^{D\times D_{M}}$ and $W^{g}\in\mathbb{R}^{D\times D}$ are learnable matrices. Note that in this formulation, only a global attention map is computed by each head to get the final output representation.

We additionally compute local attention maps and use it to influence the final output representation of the block. Specifically, the query, key and value matrices are divided into $N$ segments each of duration $T_{N}=T/N$ . For the $n$ -th segment in $m$ -th head the query matrix $Q_{n,m}\in\mathbb{R}^{T_{N}\times D_{M}}$ is obtained by taking the frames $\{n*T_{N},\ldots,(n+1)*T_{N}-1\}$ of the matrix $Q_{m}$ . Similarly we get the localised key $K_{n,m}\in\mathbb{R}^{T_{N}\times D_{M}}$ and value $V_{n,m}\in\mathbb{R}^{T_{N}\times D_{M}}$ matrices. Using these matrices, we get the output $O_{n,m}\in\mathbb{R}^{T_{N}\times D_{M}}$ as shown in the equation below. These are then concatenated in the temporal dimension to get the local level representation for the head $O_{lm}\in\mathbb{R}^{T\times D_{M}}$ followed by getting the final local level representation $Y^{l}\in\mathbb{R}^{T\times D}$ by multiplying with a learnable matrix $W^{l}\in\mathbb{R}^{D\times D}$

$\displaystyle O_{n,m}$	$\displaystyle=$	$\displaystyle\mbox{softmax}\big{(}\frac{Q_{n,m}K_{n,m}^{T}}{\sqrt{d_{m}}}\big{)}.V_{n,m}$
$\displaystyle O_{lm}$	$\displaystyle=$	$\displaystyle\mbox{concat}(O_{1,m},\ldots,O_{N,m})$
$\displaystyle Y^{l}$	$\displaystyle=$	$\displaystyle\mbox{concat}(O_{l1},\ldots,O_{lM}).W^{l}$

This local level output representation can carry relevant information about the video. It is used along with the global level representation to get the final output. We incorporate a gating mechanism that considers the multi-level output representations as different experts and computes the final output representation which is fed to subsequent layers for video classification. We call it gated multi-level self-attention (GMSA). Specifically the relevance $R^{g}\in\mathbb{R}^{T\times D}$ and $R^{l}\in\mathbb{R}^{T\times D}$ are computed using the gating network. This is followed by computing the final output $Y\in\mathbb{R}^{T\times D}$ as shown below

$\displaystyle R^{g}$	$\displaystyle=$	$\displaystyle\mbox{concat}(O_{g1},\ldots,O_{gM}).W^{g}_{g}$
$\displaystyle R^{l}$	$\displaystyle=$	$\displaystyle\mbox{concat}(O_{l1},\ldots,O_{lM}).W^{l}_{g}$
$\displaystyle R^{g},R^{l}$	$\displaystyle=$	$\displaystyle softmax([R^{g},R^{l}])$
$\displaystyle Y$	$\displaystyle=$	$\displaystyle R^{g}\odot Y^{g}+R^{l}\odot Y^{l}$

where $W^{g}_{g}\in\mathbb{R}^{D\times D}$ , $W^{g}_{l}\in\mathbb{R}^{D\times D}$ are learnable layers and $\odot$ denotes element-wise multiplication. In the gating network, softmax is taken across the experts similar to [18]. In the above multi-level formulation we have considered two output representations – one at local and one at global level. Note that we can get multiple such attention representations for different values of $N$ each of them acting as an expert to get the relevance weights for them.

Since, for local level representations, the softmax is computed over a time period of $T_{N}$ instead of $T$ , it gives us additional information about how a frame is perceived in relation to a local context than the entire global context. This extra information is complimentary to the global level context which is shown in our experiments in section 4.

3.2 Adversarial Perturbation based Regularization

Having changed the architecture of the self-attention block to take into account attentions computed at different granularity, we now focus on modifying the loss function for more robust learning. In [24], the authors showed adding a small amount of perturbation to an existing video which doesn’t change the semantics of video, can fool a pre-trained video classification model. We plan to address this vulnerability of deep learning based video classification models. Below, we formalize our approach.

We denote the training set with $L$ data point as $\{X_{i},y_{i}\},i=1,..,L$ , where $X_{i}\in\mathbb{R}^{T\times D}$ represents the frame-wise feature representation of video $i$ and $y_{i}\in\mathbb{R}^{K}$ represents the ground-truth labels. $K$ is the number of classes. We represent the video classification model’s output vector of probabilities for the point $X_{i}$ as $\theta(X_{i})$ . $l(\theta(X_{i}),y_{i})$ is the loss for the data point $X_{i}$ which we consider to be cross-entropy loss in our multi-class scenario.

\mathcal{L}_{CE}=\frac{1}{L}\sum_{i=1}^{L}l(y_{i},\theta(X_{i}))

(2)

However, such models would be susceptible to adversarial examples hurting the generalizability of the model. To address this limitation, we add a regularization term minimizing the loss for adversarial counterparts of the training samples as proposed in [8].

	$\displaystyle\mathcal{L}_{Adv}$	$\displaystyle=$	$\displaystyle\mathcal{L}_{CE}+\alpha*\frac{1}{L}\sum_{i=1}^{L}l(y_{i},\theta(X_{i}+R_{i}))$
	$\displaystyle R_{i}$	$\displaystyle=$	$\displaystyle\arg\max_{R_{i}:\\|R_{i}\\|_{2}\leq\epsilon}l(y_{i},\theta(X_{i}+R_{i}))$

The loss function is approximated to behave linearly around the input $X_{i}$ to get the perturbation term which can be easily calculated using backpropagation.

R_{i}\approx\epsilon\frac{G_{i}}{\|G_{i}\|_{2}},where\ G_{i}=\nabla_{X_{i}}l(y_{i},\theta(X_{i}))

(3)

In the above equations, the norm is taken across rows of the input tensor. We compute the gradient of the loss with respect to the features from each of video and audio modalities to get the corresponding adversarial features. Note that, we train the model to be invariant to adversarial samples within the $\epsilon$ ball. Hence, optimizing this loss function has two hyper-parameters to tune, $\alpha$ and $\epsilon$ . We investigate the impact of these hyper-parameters on the model performance in our experiments.

3.2.1 Attention-map regularization

The attention-map being generated in the self-attention blocks should also be invariant to adversarial examples. We add a regularization term in our loss function to enforce this condition. For a given input, we average over the attention maps $A_{m}$ generated by each head to get the attention map $A\in\mathbb{R}^{T\times T}$ for that input. The corresponding attention map generated using the adversarial example is denoted by $A^{adv}$ . Based on how the similarity is enforced in the attention space, we propose two variations of the adversarial loss function:

•

We minimize the Frobenius norm of the difference between the attention maps and average over the mini-batch in which case the loss function becomes

$\mathcal{L}_{advFr}=\mathcal{L}_{adv}+\beta_{Fr}*\frac{1}{L}\sum_{i=1}^{L}\|A_{i}-A^{adv}_{i}\|_{Fr}\\$

•

Each row of the attention map can be treated as a probability distribution. Since Jensen-Shannon (JS) divergence between two distributions is symmetric, we minimize it to enforce similarity between the attention maps.

\mathcal{L}_{advJS}=\mathcal{L}_{adv}+\beta_{JS}*\frac{1}{LT}\sum_{i=1}^{L}\sum_{t=1}^{T}JSD(A_{i,t},A^{adv}_{i,t})\\

where $A_{i,t}$ and $A^{adv}_{i,t}$ denote the $t$ -th row of the attention-map generated by the $i$ -th example in the min-batch and its adversarial counterpart respectively.

Note that we can use both local and global attention maps computed by the GMSA block. However, we observed that there was not a significant change in performance when using both the attention maps for regularization. Hence, we only show results for the global attention map regularization in our experiment section.

We call the model combining gated multi-level self attention (GMSA) and adversarial perturbation with attention regularization as Gated Adversarial Transformer (GAT).

3.2.2 Algorithm

The overall algorithm describing our training procedure given features from audio and video modalities and the ground truth labels is described below. For a mini-batch, we compute the cross-entropy loss and adversarial perturbations using equations 2 and 3 respectively. Then, the function $get\_att$ gets the global attention map averaged over the heads computed using Equation 1. We compute the loss for adversarial samples and the attention-map regularization term and get the final loss $L$ to update the model parameters.

Function gatTraining

Initialise

\theta^{enc}_{v}

\theta^{enc}_{a}

and

\theta_{MLP}

for $e=1\ldots E$ do

for $b=1\ldots B$ do

\hat{y_{b}}=\theta_{MLP}(\theta^{enc}_{v}(X_{b,v}),\theta^{enc}_{a}(X_{b,a}))

L_{CE}=l(y_{b},\hat{y_{b}})

for $m$ in $\{a,v\}$ do

R_{b,m}=\epsilon\frac{\nabla_{X_{b,m}}L_{CE}}{\|\nabla_{X_{b,m}}L_{CE}\|_{2}}

X_{b,m}^{ad}

X_{b,m}

R_{b,m}

A_{b,m}=get\_att(\theta^{enc}_{m},X_{b,m})

A_{b,m}^{ad}=get\_att(\theta^{enc}_{m},X_{b,m}^{ad})

L_{F}=\|A_{b,v}-A^{ad}_{b,v}\|+\|A_{b,a}-A^{ad}_{b,a}\|

\hat{y_{b}^{ad}}=\theta_{MLP}(\theta^{enc}_{v}(X_{b,v}^{ad}),\theta^{enc}_{a}(X_{b,a}^{ad}))

L_{CE}^{ad}=l(y_{b},\hat{y_{b}^{ad}})

L=L_{CE}+\alpha*L_{CE}^{ad}+\beta_{Fr}*L_{F}

\theta^{enc}_{v}=\theta^{enc}_{v}-\eta*\nabla_{\theta^{enc}_{v}}L

\theta^{enc}_{a}=\theta^{enc}_{a}-\eta*\nabla_{\theta^{enc}_{a}}L

\theta_{MLP}=\theta_{MLP}-\eta*\nabla_{\theta_{MLP}}L

return

\theta^{enc}_{v}

\theta^{enc}_{a}

and

\theta_{MLP}

Algorithm 1 Training GAT with Frobenius attention map regularization given input audio-visual features

(X_{v},X_{a})

, ground truth labels

y

, loss function

l

, gated multi-level attention based Transformer encoder blocks

\theta^{enc}_{v}

and

\theta^{enc}_{a}

for the two modalities, MLP parameters

\theta_{MLP}

, some radius

\epsilon

, hyper-parameters

\alpha

and

\beta_{Fr}

, learning rate

\eta

for

E

epochs with mini-batch size

B

4 Experiments

Table 1: Overall performance (in percentages) of baselines and different variations of our proposed model for video categorization task on our YouTube-8M test sets. Each column represents the results for a training paradigm defined by the architecture and the loss function used. We present the mean and standard deviation obtained for five non-overlapping partitions of the entire test set.

Metrics	SA, $\mathcal{L}_{CE}$	GMSA, $\mathcal{L}_{CE}$	GMSA, $\mathcal{L}_{Adv}$	GAT_AdvJS	GAT_AdvFr	Gain
GAP	91.48 $\pm$ 0.02	92.18 $\pm$ 0.02	92.49 $\pm$ 0.02	92.56 $\pm$ 0.02	92.60 $\pm$ 0.02	+1.12
MAP	91.29 $\pm$ 0.02	92.32 $\pm$ 0.03	92.68 $\pm$ 0.01	92.70 $\pm$ 0.04	92.80 $\pm$ 0.03	+1.51
PERR	89.46 $\pm$ 0.04	90.03 $\pm$ 0.04	90.34 $\pm$ 0.04	90.41 $\pm$ 0.03	90.44 $\pm$ 0.03	+0.98
Hit@1	94.84 $\pm$ 0.05	95.02 $\pm$ 0.04	95.16 $\pm$ 0.03	95.21 $\pm$ 0.05	95.22 $\pm$ 0.04	+0.38

In this section, we explain our experiments and results. We first show our experimental setup. Then, we present the results for video classification on a large-scale YouTube dataset. We further analyse the various novel aspects of our model and present quantitative and qualitative analysis.

4.1 Experimental-setup

We use YouTube-8M dataset for our experiments which consists of frame-wise video and audio features for approximately 5 million videos extracted using Inception v3 and VGGish respectively followed by PCA [1]. We use the hierarchical label space with 431 classes (see [17]). We use binary cross-entropy loss to train our models. We evaluate our models using the four metrics mentioned in [11]: (i) Global Average Precision (GAP), (ii) Mean Average Precision (MAP), (iii) Precision at Equal Recall Rate (PERR), and (iv) Hit@1. Our training set consists of approximately 4 million videos. We use 64000 videos from the official development set for validation and use the rest as test set. Our baseline Transformer model consists of a single layer of multi-head attention with 8 attention heads for each of audio and video modalities. For training we used Adam optimizer, with an initial learning rate of 0.0002 and batch size of 64. We compute validation set GAP every 10000 iterations and perform early-stopping with patience of 5. We also use it for learning-rate scheduler that decreases the learning rate by a factor of 0.1 with patience of 3.

4.2 Results

We present the results for video categorization using a baseline Transformer encoder with and without our novelties. Specifically, we compare the performance of SA with GMSA and using cross-entropy loss function $\mathcal{L}_{CE}$ . The window size $T_{N}$ for the GMSA module was set to be $20$ based on the best GAP on validation set. We choose the best performing architecture and implement the adversarial loss function $\mathcal{L}_{Adv}$ . Based on validation set results, the value of $\epsilon$ and $\alpha$ for adversarial training was set to be $0.5$ and $1$ respectively. $\beta_{Fr}$ and $\beta_{JS}$ are set to $0.001$ and $0.01$ respectively. To test the variance in performance of the different models under different test set distributions, we divide our test set into five equal partitions. We compute the mean and standard deviation of each of the four metrics obtained across these five partitions.

From Table 1, we see that using GMSA block in Transformer outperforms SA block in all metrics. Improvement in MAP is more compared to GAP and Hit $@$ 1 suggesting that the performance boost is more for underrepresented classes. Moreover, using adversarial loss $\mathcal{L}_{adv}$ , we note an improvement in performance compared to baseline cross entropy. Furthermore, enforcing attention map regularization between the original and adversarial examples leads to an even better performance showing the applicability of the extra regularization term. We obtain slightly better results with the Frobenius norm compared to JS divergence for attention regularization.

4.3 Analysis

We first provide hyperparameter analysis and qualitative analysis for gated mult-level self attention. We then analyze the effect of attention map regularization on model robustness and performance. We further show qualitative analysis and class-wise analysis of the final model.

4.3.1 Gated multi-level self attention

We present the GAP values on validation set of varying local attention window length ( $T_{N}$ ) in Figure 3. We also show results if both the experts are given global attention map, which still improves over the baseline SA architecture since the two experts can learn complementary information. Further, we show results with three gating experts — one global attention map and two local attention maps.

We observe that using multiple gated experts improves the performance of a baseline SA block. Note that performance improves using both the local and global attention maps. We also note that the validation set GAP saturates when more than two experts are used in GMSA.

We also compare the global attention profile obtained from a baseline SA block and the local attention profile from a GMSA block for two videos in Figure 1 and Figure 4. For global attention profile, we take mean over the rows of the 2D attention map to get an attention vector of length $T$ . Similarly we get $N$ such local attention profiles each of length $T_{N}$ which are then concatenated to get the final local attention profile of length $T$ . In Figure 1, we observe that the global attention profile peaks abruptly towards the end of the video which shows crowds chanting, possibly leading it to believe it to be a ‘Sports’ video. The local attention profile attends the frames more evenly leading to the correct prediction that it is a ‘News’ video. We observe that most of the frames in the video correspond to a studio setting with people having discussion. Hence, it makes sense that attending these frames uniformly leads to the correct prediction. In Figure 4, while the baseline model can predict the presence of ‘Vehicle’ with high probability ( $0.7$ ) and that the video is about ‘Food’ with some probability ( $0.3$ ); it fails to predict the presence of ‘Vegetables’ (probability $<0.1$ ) which is actually the ground truth label of the video. The GMSA module can predict this video is about ’Vegetable’ with a very high probability ( $0.8$ ) along with ‘Vehicle’ (with probability of $0.4$ ).

4.3.2 Adversarial training

Based on experiments in the previous section, we use a GMSA based Transformer encoder block with $T_{N}=20$ for our further experiments. We train the model with adversarial loss function $\mathcal{L}_{Adv}$ . We aim to understand the impact of the two hyper-parameters $\epsilon$ and $\alpha$ on the model performance by perturbing one of them, while keeping the other constant. By altering $\epsilon$ , we aim to understand the impact of smoothing radius around the data-points on the model performance and perturbing $\alpha$ impacts the weight of the adversarial loss on the overall optimization. The plots comparing the validation GAP for different values of hyper-parameters is shown in Figure 5 . First, the value of $\alpha$ was kept fixed at 1 and $\epsilon$ was varied. For lower values of $\epsilon$ , models trained with adversarial loss function $\mathcal{L}_{Adv}$ show an improvement over model trained with $\mathcal{L}_{CE}$ peaking at $\epsilon=0.5$ . As we increase the value of $\epsilon$ , the model’s performance starts deteriorating. This is expected since $\epsilon$ defines the neighborhood around an input feature vector over which the conditional distribution is smoothed. Increasing the radius of this neighborhood forces our model to learn smoother functions that cannot capture the complexity of the conditional distribution function thereby decreasing its performance on the validation set. Similarly, as we increase the adversarial loss weight $\alpha$ , the performance increases, peaks at $1$ and starts reducing as the relative weight of classification goes down.

We note that after using the attention map regularization, the Frobenius norm of the difference between the attention maps computed from the validation set samples and their adversarial counterparts reduces from $1.23$ to $0.02$ . Qualitatively, the effect of using attention-map based regularisation in loss term ( $\mathcal{L}_{advFr}$ ) can be seen in Figure 6. Here, we compare the attention profiles being generated for a sample in test set and its adversarial counterpart by models trained using the vanilla adversarial loss $\mathcal{L}_{adv}$ and our approach $\mathcal{L}_{advFr}$ . We observe that that attention generated by $\mathcal{L}_{advFr}$ is more robust to adversarial perturbations and the attention profiles of the original sample and its adversarial counterpart overlap to a great extent. On the other hand, the model trained with $\mathcal{L}_{adv}$ exhibits more variations in the attention profiles as a result of adversarial perturbations to the input. Another thing to notice from the figure is that the maximum attention being given to any frame reduces by an order of magnitude when using $\mathcal{L}_{advFr}$ . In other words, the attention map generated by $\mathcal{L}_{advFr}$ is smoother enforcing the temporal coherence property which has been shown to help video classification model performance [9, 15].

[Uncaptioned image] — Table 2: Qualitative analysis for example videos in YouTube-8M comparing the performance of GAT with baseline Transformer model. Upto top 3 predictions with probabilities $>0.2$ for each model are shown for comparison. Abbreviations MusicIns, StringIns, VG and Perf stand for MusicInstrument, StringInstrument, VideoGame and PerformanceArt respectively.

	Example 1	Example 2	Example 3	Example 4
Video
Links	https://bit.ly/3bTxexj	https://bit.ly/3lpTbr2	https://bit.ly/3rWdMph	https://bit.ly/3qTOQNN
GT	Music
Music:MusicIns:StrIns	Art:Perf:Dance	Sports:Combat:Wrestling	Transport:Air:Spaceship
GAT			Sports:Combat: 0.65
Music: 0.35	Art:Perf: 0.87	Sports:Combat:Wrestling: 0.46	Transport:Air:Spaceship: 0.91
	Art:Perf:Dance: 0.78	Music: 0.36
Baseline	Game:VG: 0.77	Sports: 0.83		Game:VG: 0.7
Game:VG:Action: 0.51	Sports:Ball: 0.77	Music: 0.6	Game:VG:Action: 0.26

From Figure 7, we further notice that enforcing attention map regularization for adversarial samples makes the model more robust to adversarial attacks. For each of the saved models trained with losses $L_{CE}$ , $\mathcal{L}_{adv}$ and $\mathcal{L}_{advFr}$ , adversarial counterparts were created for the validation set using the current parameters. These were then used to evaluate the performance of the corresponding model. We notice that the models trained with $\mathcal{L}_{advFr}$ outperform those trained with $\mathcal{L}_{adv}$ and $\mathcal{L}_{CE}$ significantly.

4.3.3 Qualitative analysis

We provide some example videos and the categories predicted by baseline SA based Transformer model trained using $\mathcal{L}_{CE}$ and our GAT model in Table 2. For each YouTube video, we show the ground truth and up to top 3 predictions from the baseline model and our proposed model. Example 1 is a music video with a static poster. While our model predicts it correctly to be of ‘Music’ category, the baseline detects it as a video game because of the image graphics. Example 2 is a video of a person dancing but the baseline model predicts it as being a ‘Sports’ video. In example 3, the video is of people arm wrestling overlaid with a rock song playing throughout. Baseline model classifies it as ‘Music’ whereas GAT can correctly identify that the video is about ‘Wrestling’. Example 4 is an interesting case. It is a video about UFOs. It is quite dark in general and only has selective frames where an UFO like object can be seen. While the baseline model incorrectly identifies it as a ‘Video Game’, GAT can confidently predict its class correctly.

4.3.4 Class-wise analysis

For each class, we computed the difference between average precision (AP) obtained using GAT and baseline model. We present the top five and bottom five classes in terms of improvement of GAT over baseline in Figure 8. We observe that GAT performs worse than the baseline model in only two classes out of 431. We also notice that the degradation in performance is not much. We observe that the classes which gave the least amount of increment, performances of both baseline and GAT are already quite high. It is interesting to see that some of the biggest improvements are achieved in underrepresented classes such as a specific landmark, a specific book/games genre, or science categories.

5 Conclusions

In this paper, we proposed a novel approach named Gated Adversarial Transformer (GAT) for the task of video classification. We introduced a gated multi-level attention module which modifies the functionalities of a self-attention block to capture representations of a feature set based on local and global contexts. We further enhanced the performance of our model by adversarial training procedures. We introduced a regularization term in the loss function that equips the model with adversarial robustness to attention maps as well as the final output space. We performed experiments on the large scale YouTube-8M dataset and showed consistent improvements over the baseline models using our methods. We showed various ablation studies to showcase the effect of each component in the model. Further, we showed that our model not only improves performance on test data but provides significant robustness to adversarial attacks (Figure 7).

In the future, we would like to extend gated multi-level attention to more than two levels creating a hierarchy. Further, we would like to use video segmentation techniques to assist multi-level attention. To further improve the adversarial training aspect, we would like to include more realistic adversarial examples as well as add weighted regularization for various layers in the model.

References

[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
[2] Ali Aminian. Vidsage: Unsupervised video representational learning with graph convolutional networks.
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095, 2021.
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[5] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A2-nets: Double attention networks. arXiv preprint arXiv:1810.11579, 2018.
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[7] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 244–253, 2019.
[8] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
[9] De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7366–7375, 2018.
[10] Sebastian Kmiec, Juhan Bae, and Ruijian An. Learnable pooling methods for video classification. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
[11] Joonseok Lee, Walter Reade, Rahul Sukthankar, George Toderici, et al. The 2nd youtube-8m large-scale video understanding challenge. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
[12] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7083–7093, 2019.
[13] Feng Mao, Xiang Wu, Hui Xue, and Rong Zhang. Hierarchical video frame sequence representation with deep convolutional graph network. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
[14] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
[15] Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 737–744, 2009.
[16] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016.
[17] Saurabh Sahu, Palash Goyal, Shalini Ghosh, and Chul Lee. Cross-modal non-linear guided attention and temporal coherence in multi-modal deep video models. In Proceedings of the 28th ACM International Conference on Multimedia, pages 313–321, 2020.
[18] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
[19] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[20] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
[21] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5552–5561, 2019.
[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
[23] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[24] Xingxing Wei, Jun Zhu, Sha Yuan, and Hang Su. Sparse adversarial perturbations for videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8973–8980, 2019.
[25] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
[26] Zuxuan Wu, Ting Yao, Yanwei Fu, and Yu-Gang Jiang. Deep learning for video classification and captioning. In Frontiers of multimedia research, pages 3–29. 2017.
[27] Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks for video classification. In International Conference on Machine Learning, pages 3891–3900. PMLR, 2017.
[28] Linchao Zhu, Du Tran, Laura Sevilla-Lara, Yi Yang, Matt Feiszli, and Heng Wang. Faster recurrent networks for efficient video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13098–13105, 2020.