This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Robustness Assessment via
Adversarial Spatial-Temporal Focus on Videos

Xingxing Wei, Member, IEEE, Songping Wang, and Huanqian Yan Xingxing Wei, Songping Wang and Huanqian Yan are with the Institute of Artificial Intelligence, Beihang University, No.37, Xueyuan Road, Haidian District, 100191, Beijing, P.R. China. Xingxing Wei is the corresponding author ([email protected])
Abstract

Adversarial robustness assessment for video recognition models has raised concerns owing to their wide applications on safety-critical tasks. Compared with images, videos have much high dimension, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks where gradient estimation for the threat models is usually utilized, and high dimensions will lead to a large number of queries. To mitigate this issue, we propose to simultaneously eliminate the temporal and spatial redundancy within the video to achieve an effective and efficient gradient estimation on the reduced searching space, and thus query number could decrease. To implement this idea, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. AstFocus attack is based on the cooperative Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible for selecting key frames, and another agent is responsible for selecting key regions. These two agents are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video. Extensive experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods, which is prevenient in fooling rate, query number, time, and perturbation magnitude at the same.

Index Terms:
Adversarial examples, Video recognition, Reinforcement learning, Black-box attacks, Spatial-Temporal analysis

1 Introduction

Deep Neural Networks (DNNs) have made remarkable achievements in various tasks such as object detection [1], action recognition [2], scene understanding [3], and so on. Recent studies illustrate the DNNs’ vulnerability to the so-called adversarial examples [4, 5, 6]. Afterwards, a series of methods are proposed to evaluate the adversarial robustness of DNNs. Among these works, the attack-based robustness evaluation methods [7, 8, 9] are more popular and practical because of their good implementability. They mainly seek for the minimum adversarial perturbations of successful attacks to measure the robustness [10]. On one hand, accurate assessment for adversarial robustness can help to deploy DNNs into safety-critical systems. On the other hand, it provides a quantitative metric to design more robust DNNs. Therefore, adversarial robustness assessment is important in both theoretical and practical values.

Video recognition [11, 12, 13] is a major branch in computer vision. Leveraging the temporal and spatial relationship within the video data can effectively locate and classify the objects or behaviors in videos, and thus help to perform video analysis. Owing to the DNNs’ advantage, current video recognition models are usually designed based on DNNs. The DNNs’ vulnerability is inevitably inherited by video recognition models. Owing to the wide applications in some safety-critical tasks like security surveillance, evaluating their adversarial robustness becomes necessary. Currently, more and more users begin to employ the video recognition APIs released by commercial cloud platforms because of their easy accessibility. In such cases, the APIs’ details are not public, we can only assess their adversarial robustness according to the outputs obtained by querying the systems. So these methods are called as query-based black-box attacks, which mainly rely on the estimated gradients for the APIs [14, 15].

Compared with images, videos have much high dimensions owing to the additional temporal information, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks because the high-dimension video data needs a large number of queries to obtain an accurate gradient estimation. Thus, seeking for the minimum adversarial perturbations on videos is more challenging than that on images, a reasonable attack algorithm should reduce the video dimensions firstly, so as to improve the attacks’ efficiency and reduce the perturbations’ magnitude. To meet this goal, temporally sparse video attacks [16, 17, 18] are proposed to eliminate the redundancy in the temporal domain, and spatial video attacks [19] try to eliminate the redundancy in the spatial domain. More importantly, the spatial and temporal redundancy should be jointly considered, i.e, modeling the key regions within key frames, and then evaluating the robustness on these areas. The current related methods [20, 21] both regard the selecting key frames and selecting key regions as two separate steps, and don’t simultaneously consider their interaction, thus leading to the sub-optimal attacking efficiency and performance.

Refer to caption
Figure 1: Overview of the proposed AstFocus attack. It integrates a cooperative Multi-Agent Reinforcement Learning (MARL) module into the PGD attack with NES gradient estimation [22], and thus selects key frames and key patches within the video to reduce dimensions. In this way, an effective and efficient gradient estimation on the reduced space is achieved, and the evaluation’s efficiency and accuracy are improved at the same.

However, simultaneously optimizing the key frames and key regions is difficult. Because they belong to different domains, and are closely coupled, i.e, changing the key frames also affects the selection of key regions. This is more challenging in the query-based black-box attacks, where only the feedback from the threat model can be used to perform the optimization. Considering the above points, this paper mainly addresses the following problem: How to simultaneously learn the precise key frames and key regions to efficiently and accurately assess the adversarial robustness of video recognition in the query-based black-box setting?

To answer this question, in this paper, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. The key frames and key regions are dynamically adjusted by the interaction with the threat model. Technically, this process is achieved based on the cooperative Multi-Agent Reinforcement Learning (MARL) [23]. One agent is responsible for selecting key frames (temporal agent), and another agent is responsible for selecting key regions (spatial agent). These two agents use one backbone network, and are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the focused space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video.

More specifically, AstFocus attack is constructed based on the PGD+NES baseline, which extends PGD [24] to the black-box attack with Natural Evolution Strategy (NES) [25] gradient estimator. We attach two agents before the gradient estimator module to reduce the video dimension. In each PGD iteration, NES gradient estimator is first performed on the selected key frames and key regions predicted by agents. Then the local adversarial perturbations are generated to attack the threat model. Finally, these two agents are updated according to the computed rewards to predict better key frames and key regions in the next iteration. This process is continuously repeated until the successful attack is achieved. These two agents have similarities and also differences. For policy networks, we apply the same one backbone network to extract the feature maps from the input video frames for both of them, but design the distinct LSTM-based [26] structures according to their own characteristic to predict the optimal actions. For actions, the temporal agent’s actions are defined as the sets composed of different key frames. For the spatial agent, actions are defined as the sets composed of different patch regions located in each frame. For rewards, three rewards are carefully designed to train the agents. The first one is the common reward from the feedback of black-box threat models, which is used to simultaneously guide two agents. The following two rewards are specially designed for temporal and spatial agent, respectively. And they mainly measure the actions from the view of appearance. The whole flowchart of AstFocus attack is shown in Figure 1, and the code is released in https://github.com/DeepSota/AstFocus.

This paper is an extended work based on our conference version [27] and has the following major improvements. Firstly, we consider the spatial redundancy besides the temporal redundancy in the previous version, and further propose the novel AstFocus attack to simultaneously learn key frames and key regions and generate perturbations. This is a major change in the idea, which comprehensively makes use of the videos’ spatial-temporal character to perform attacks. Secondly, we design a cooperative multi-agent RL based method to implement the new idea, while the previous version uses single-agent RL. Thus the rewards, actions, and policies are carefully re-designed. Thirdly, more experiments are given and discussed involving the parameter tuning, ablation study, and comparisons with SOTA methods. We also re-write the abstract, introduction, methodology, and experiment sections to better introduce our motivation and methods. We believe these modifications can significantly improve the quality of our work.

In summary, this paper has the following contributions:

  • We propose AstFocus attack, a novel query-based black-box attack method to assess the adversarial robustness for video recognition models, the adversarial perturbations are only added on the key spatial-temporal focused spaces, which can help reduce attack query numbers and perturbations significantly.

  • A cooperative multi-agent reinforcement learning module is adopted for identifying the key frames and key regions at the same. For that, we carefully design the actions, policy networks, and rewards for both the agents according to the specific task. The agents are updated in each iteration rather than after each round of successful attack, so is efficient to converge.

  • Compared with the state-of-the-art video attack algorithms, the proposed AstFocus attack can achieve less query number and smaller adversarial perturbations. Specifically, it reduces at least 10% query number, and improves at least 5% fooling rate with the smallest perturbations, which verifies the efficiency and effectiveness of AstFocus attacks.

The rest of this paper is organized as follows: we briefly review the related works in Section 2. The proposed AstFocus attack algorithm is described in Section 3. Experimental results and analysis are presented in Section 4. Finally, we conclude the whole paper in Section 5.

2 Related Works

2.1 Adversarial Attacks on Videos

Adversarial example [4, 24, 28] is a maliciously crafted input designed for making the classifier produce wrong output. To make human imperceptible of its existence, the generation of adversarial examples is often limited by some deliberate conditions, such as noise size and query numbers. Adversarial video attack and adversarial image attack are similar, the difference is that the attack space of the video is much larger than that of images. It is not easy to directly extend some image attack algorithms to attack such high-dimension video data. High dimensions usually bring huge search space, leading to high costs to achieve successful attacks. Especially in the black-box setting, a huge search space will bring a large number of queries.

Some video attack techniques have been proposed to find adversarial videos. Wei et al. [16] generate sparse 3D adversarial perturbations to add on the whole video. To reduce the attacking space, an l2,1l_{2,1}-norm regularization based optimization is designed for making the adversarial perturbations more concentrated in some key frames of the input video. This method shows the sparse ability of adversarial video noises. Similarly, [18] propose “one frame attack”, they only add adversarial noise on one video frame. The perturbation can easily defeat deep learning-based action recognition systems. The vulnerable frame is perturbed with a gradient-based adversarial attack method. In addition, [29] finds that the temporal structure is key to generating adversarial videos. They have used generative adversarial network to generate adversarial examples that can cause large misclassification rate for the video recognition models.

Table I: Comparisons with query-based black-box video attack methods. “Temporal” denotes reducing temporal redundancy in the video, “Spatial” denotes reducing spatial redundancy, “Jointly” denotes jointly learning for reducing the spatial and temporal redundancy in the video.
Temporal Spatial Jointly
VBAD attack [19]
Sparse attack [17]
Motion-sampler attack [30]
GEO-TRAP attack [31]
Heuristic attack [20]
RLSB attack [21]
AstFocus attack (ours)

Not only white-box video attacks, but also black-box video attacks are explored. One class of such methods is based on transferablity across different models. For example, Wei et al.[32] perform black-box video attacks based on adversarial perturbations generated on image models.

Another black-box video attacks belong to query-based methods. They generate perturbations via querying the target video recognition system. Among them, Jiang et al.[19] extend PGD algorithm to video attack with gradient estimators computed using super-pixels. To reduce attacking costs, some efficient black-box video attack algorithms are proposed. [30] argues the initialized random noises in [19] are less effective, they utilize the intrinsic movement pattern and regional relative motion, and propose the motion-aware noises to replace random noises. By using this prior in gradient estimation, fewer queries are needed to perform video attacks. Wei et al. [20] search for a subset of frames based on the importance of each video frame to the recognition model. Besides, they also limit the adversarial perturbations only on some salient regions. Because the temporal and spatial reductions are separately formulated, the method usually needs hundreds of thousands of queries. To mitigate this defect. Wei et al. [17] have proposed a sparse video attack algorithm based on reinforcement learning. An agent is designed to identify key frames through some interactions with the threat model. It can significantly reduce the adversarial perturbations, but update the agent only after each round of successful attack. This poor update mechanism leads to many unnecessary queries and a weak fooling rate. RLSB attack [21] explores to select key frames and key regions to reduce the high computation cost. However, the reinforcement learning is only applied to select key frames, which is similar to [17]. The process of selecting key regions is based on the saliency maps, it is independent to the process of selecting key frames, and not integrated into the reinforcement learning framework. Thus, the selecting key frames and key regions are separately formulated. Recently, [31] presents to parameterize the temporal structure of the search space using geometric transformations, and then reduce the temporal search space. Thus, they can efficiently estimate the gradients.

In this paper, we also explore important searching space, which is different from the previous work focusing only on key frames in the temporal domain. We jointly consider the identification of key regions in the spatial domain besides the temporal domain. For that, a multi-agent reinforcement learning is designed to identify a reduced space through rewards on the inherent property of video and interactions with the threat model. The comparisons with query-based black-box video attack methods are summarized in Table I.

2.2 Spatial-Temporal Property for Videos

Video can be regarded as multiple continuous images, therefore video processing often needs to consider both spatial and temporal correlations. The simultaneous consideration of temporal and spatial correlation of video is the key of video related tasks. Video action recognition is a longstanding research topic in multimedia and computer vision. Many mainstream algorithms are motivated by the advances in image classification, and improved through utilizing the temporal dimension of the video data. To facilitate the classification performance, Wu et al. [33] have proposed a hybrid deep learning framework for video classification, which is able to harness not only the spatial and short-term motion features, but also the long-term temporal clues. They integrate the spatial and temporal features in deep neural model with elaborately designed regularizations to explore feature correlations. The method can produce competitive classification performance. Some works based on the spatial-temporal property can be found in [11, 12, 13].

Unlike the above methods, we consider the spatial-temporal property of videos in the video attack task. The temporal and spatial redundancy within videos are reduced to improve the efficiency of video attack, which extends the application scope of spatial-temporal property of videos.

3 Methodology

In this section, we first give the baseline video attack algorithm: PGD [24] attack with NES [25] gradient estimator. Then the details of integrating cooperative Multi-Agent Reinforcement Learning (MARL) [23] into the baseline are introduced. Finally, the whole algorithm is summarized.

3.1 Preliminaries

We assume F()F(\cdot) is a black-box video recognition model only whose top-1 information including the category label and confidence score can be required. Given a video X={xi|i=1,,M}X=\{x_{i}|i=1,...,M\} with ground-truth label yy where xiH×W×3x_{i}\in\mathbb{R}^{H\times W\times 3} denotes the ii-th frame, and MM is the total frame number, the predicted category label is y=F(X)y=F(X), and the corresponding confidence score is P(y|X)P(y|X).

To attack the video recognition model, we extend Projected Gradient Descent (PGD) [24] to adapt the video data. The adversarial video XX^{\prime} under the un-targeted attack is defined as:

Xt+1=Proj(Xt+αsign(Xl(Xt,y))),X^{\prime}_{t+1}=Proj(X^{\prime}_{t}+\alpha\cdot sign(\nabla_{X}l(X^{\prime}_{t},y))), (1)

where Proj()Proj(\cdot) projects the updated adversarial example to a valid range. α\alpha is the attack step, and is used to control the magnitude of the added adversarial noise per each iteration. The sign()sign(\cdot) is the sign function, and l()l(\cdot) is the cross-entropy loss function. Due to the limitation of black-box settings, we cannot obtain the accurate gradient gg by directly computing g=Xl(Xt,y))g=\nabla_{X}l(X^{\prime}_{t},y)). Instead, [22] proposes to utilize Natural Evolution Strategy (NES) [25] to estimate gg by querying the threat model. Specifically, NES can be described as:

g1Δni=1nσiP(y|Xt+Δσi).{g}\approx\cfrac{1}{{\Delta n}}\sum_{i=1}^{n}\sigma_{i}\cdot P(y|X_{t}^{\prime}+\Delta\cdot\sigma_{i}). (2)

It first samples n/2n/2 values δiN(0,I)\delta_{i}\backsim N(0,I), and then sets δj=δnj+1,j{(n/2+1),..n}\delta_{j}=-\delta_{n-j+1},j\in\{(n/2+1),..n\}. Finally, the gradient g{g} is estimated through averaging the ratio of the predicted results to search variance Δ\Delta.

For the targeted attack, Eq.(1) is modified as follows:

Xt+1=Proj(Xtαsign(Xl(Xt,y))),X^{\prime}_{t+1}=Proj(X^{\prime}_{t}-\alpha\cdot sign(\nabla_{X}l(X^{\prime}_{t},y^{\prime}))), (3)

where yy^{\prime} is a target category label pred-defined by the adversary in advance. In Eq.(2), the ground-truth yy should also be modified as the target label yy^{\prime} to estimate the gradients versus the target label.

In practial application, directly performing Eq.(2) is inefficient. Because the number of sample points nn is related with the dimension of XtM×H×W×3X_{t}^{\prime}\in\mathbb{R}^{M\times H\times W\times 3}. Owing to the high dimension of video data XtX_{t}^{\prime}, we need to set a large value of nn to compute an accurate gradient in each iteration tt, which will lead to a large number of queries with the threat model. To improve the attack efficiency, the video dimension should be reduced by selecting the key frames and key regions, obtaining a reduced M,HM,H and WW, and thus a small value of nn can be available. Technically, we hope to replace XtX_{t}^{\prime} in Eq. (2) with X^t=Γ(Xt)\hat{X}_{t}^{\prime}=\Gamma(X_{t}^{\prime}), where Γ()\Gamma(\cdot) denotes the reduced operation, and X^t\hat{X}_{t}^{\prime} is the reduced video.

Refer to caption
Figure 2: The designed actions of the spatial agent. In each frame, we uniformly divide the frame into overlapped patches according to a predefined stride. All the patch candidates constitute the actions. For simplicity, the stride equals to the patch size in this example.

3.2 The Proposed AstFocus Attack

To implement the above idea, we build the so-called AstFocus attack based on a cooperative multi-agent reinforcement learning (MARL) to jointly solve for the key frames and key regions during the black-box attack process. In AstFocus attack, one agent is responsible for selecting key frames (temporal agent), and another agent is responsible for selecting key regions (spatial agent). These two agents are cooperative to achieve the same goal. The processes of selecting key frames and key regions in each iteration of PGD are formulated into the Markov Decision Processes (MDP). The details of these two agents as well as the optimization algorithm are given below.

3.2.1 Spatial Agent

Spatial agent actually aims at solving an object localization problem (detecting key regions), we detail from three parts.

Action Design: To construct the actions of the spatial agent, we uniformly divide each video frame into overlapped patches inspired by the Vision Transformer [34]. In this way, we obtain a candidate patch set for the ii-th frame xix_{i}: Bi={bij|j=1,,D}B_{i}=\{b_{i}^{j}|j=1,...,D\} where bijb_{i}^{j} denotes the jj-th patch region within xix_{i}, and DD is the total number of candidate patches in this frame. bijh×wb_{i}^{j}\in\mathbb{R}^{h\times w} denotes that the patch’s size is hh and ww, and their values will be tuned in the experiments. The goal of spatial agent is to select an optimal patch biBib_{i}^{*}\in B_{i} in each frame as the key region, and thus the final selected action is a sequence set ap={bi|i=1,,M}a^{p}=\{b_{i}^{*}|i=1,...,M\}. From the definition, we can see that there are totally DMD^{M} action combinations for the given video XX, which implies the search space is huge. An example of actions in one frame is listed in Figure 2, where D=16D=16.

Policy Network Design: Spatial policy network πp(ap|sp)\pi^{p}(a^{p}|s^{p}) is used to predict the spatial action apa^{p} when the state sps^{p} is given. The flowchart of our policy network is listed in Figure 3. Overall speaking, because we need to tackle with the sequence video data, a LSTM-based [26] structure is used to construct the policy network πp(ap|sp)\pi^{p}(a^{p}|s^{p}). For the ii-th frame xix_{i}, a lightweight convolution neural network (CNN) f(xi)f(x_{i}) is first to extract the frame-level feature maps eie_{i}. In our experiments, we use MobileNet V2 as the lightweight CNN backbone for simplicity. Users can also apply other lightweight CNNs. Then they are fed into the LSTM unit to predict the logits for each patch. Next, a Softmax with Fully Connected Layer (FCL) is attached to output each patch’s probability pbijp_{b_{i}^{j}}. Finally, we utilize the categorical sampling to obtain the optimal patch region bib_{i}^{*} according to their probability values p(Bi)={pbij|j=1,,D}p(B_{i})=\{p_{b_{i}^{j}}|j=1,...,D\}. To guarantee the smooth change of selected patch between adjacent frames, we concat the local patch features ei1be^{b^{*}}_{i-1} of the previous selected patch bi1b_{i-1}^{*} with the current frame-level features eie_{i} to jointly predict the current patch region, and ei1e^{*}_{i-1} is extracted via a simple multilayer perceptron (MLP) on the corresponding patch features of ei1e_{i-1}.

Formally, the frame-level feature maps are extracted by:

ei=f(xi),i=1,2,,M,e_{i}=f(x_{i}),i=1,2,...,M, (4)

next, the optimal action for each frame is achieved by:

p(Bi)=πθp(|concat(ei,ei1b),hi1π),i=1,2,,M,p(B_{i})=\pi^{p}_{\theta}(\cdot|concat(e_{i},e^{b^{*}}_{i-1}),h^{\pi}_{i-1}),i=1,2,...,M, (5)
bi=categorical(p(Bi)),i=1,2,,M,b_{i}^{*}=categorical(p(B_{i})),i=1,2,...,M, (6)

where hi1πh^{\pi}_{i-1} denotes hidden states output by LSTM unit in the ii-1-th frame. Thus, the state sps^{p} in our method is defined as the concat feature concat(ei,ei1b)concat(e_{i},e^{b^{*}}_{i-1}). Eq.(6) is repeated MM times to achieve the optimal action ap={bi|i=1,,M}a^{p}=\{b_{i}^{*}|i=1,...,M\}.

In our method, the policy network is updated in each iteration tt of PGD attack, therefore, the optimal action will be updated in each iteration until the PGD attack stops.

Refer to caption
Figure 3: The flowchart for the Policy network of the proposed spatial agent. It is used to identify the crucial regions of each video frame.

Reward Design: In each iteration, the spatial policy network will receive the feedback from the environment to update its parameters θp\theta^{p}. Therefore, we need to design the reasonable rewards to guide the update of policy network. Because AstFocus attacks are based on the Multi-Agent Reinforcement Learning (MARL) framework, we design two kinds of rewards: one is specific for the spatial agent, and another is the common rewards shared with temporal agent.

For the special reward, an intuitive idea to evaluate the patch’s importance is the area covered by the foreground objects. Because video recognition model mainly performs predictions based on the foreground objects like person, car, etc. Therefore, if the policy network πp(ap|sp)\pi^{p}(a^{p}|s^{p}) selects the foreground patch, the specific reward should be enlarged, and thus the policy network will be encouraged to select the foreground object in the next iteration. Based on this idea, we need a metric to measure the objectness score for a given patch. We here choose a classic objectness model: edgeboxes [35]. It calculates the edge response of each pixel and determines the boundary of the object by using the structured edge detector.

More concretely, the redgeboxir^{i}_{edgebox} reward for the selected patch bib_{i}^{*} can be described as following:

redgeboxi=kwb(sk)uk2(w+h)2,i=1,2,,M,r^{i}_{edgebox}=\cfrac{\prod_{k}w_{b}(s_{k})\cdot u_{k}}{2\cdot(w+h)^{2}},i=1,2,...,M, (7)

The edgebox reward for the whole video is defined:

redgebox=i=1Mredgeboxi,i=1,2,,M,r_{edgebox}=\sum_{i=1}^{M}r^{i}_{edgebox},i=1,2,...,M, (8)

where ww and hh are the patch’s width and height. wb(sk)w_{b}(s_{k}) is used to measure the affinity of the kk-th edge groups in the selected patch. The uku_{k} is the sum of the kk-th edge groups in the selected patch. In general, a large patch often results in a large edgebox value. More detailed information about edgebox function can be found in [35].

For the common reward, it comes from the feedback of the black-box threat models. If the selected patch is reasonable, the generated adversarial patch should have a strong attacking ability, and thus will make the confidence score output by threat models have a big drop. Therefore, we can use the confidence drop of the ground-truth label as a metric to compute this reward. Because this is also useful to the temporal agent, it is called as common reward. Specifically, the common reward rcommonr_{common} is defined as follows:

V(X)=exp(P(y|X)P(y|X));V(X^{\prime})=exp(P(y^{\prime}|X^{\prime})-P(y|X^{\prime})); (9)
rcommon=V(Xt+1)V(Xt)V(Xt),r_{common}=\cfrac{V(X^{\prime}_{t+1})-V(X^{\prime}_{t})}{V(X^{\prime}_{t})}, (10)

where exp()exp(\cdot) is the exponential function, and P(y|X)P(y|X^{\prime}) represents the ground-truth label’s confidence when XX^{\prime} is fed into video recognition model. In the un-targeted attack, P(y|X)P(y^{\prime}|X^{\prime}) represents the second-ranked label’s confidence which is considered as the most competitive label to replace the ground-truth label. Only if the second-ranked label’s confidence becomes larger than that of the ground-truth label, V(X)V(X^{\prime}) becomes large. Then, we use the relative change of V(X)V(X^{\prime}) at different iteration tt as the metric for common reward. Eq.(10) is designed to encourage the agent to add perturbations on the selected regions that make the second-ranked label’s confidence gradually reach the ground-truth label and finally exceed it. In targeted attack, P(y|X)P(y^{\prime}|X^{\prime}) is the confidence of pre-defined target label by the adversary.

In summary, the tt-iteration reward for spatial agent is:

rspatialt=rcommont+λ1redgeboxt,r^{t}_{spatial}=r^{t}_{common}+\lambda_{1}r^{t}_{edgebox}, (11)

where λ1\lambda_{1} denotes a weight to balance the two terms.

3.2.2 Temporal Agent

There exists a major distinction between temporal agent and spatial agent. Spatial agent aims at solving the object localization problem, while temporal agent aims at solving the binary classification problem (selecting or not selecting a frame). Thus, the actions, rewards, and policy networks of temporal agent should be re-designed.

Action Design: Key frames refer to those video frames that are conducive to successful attack, and their number is less than that of the whole video. The goal of the temporal agent is to select some key frames from the whole input video XX, and thus the final selected action is also a sequence set af={oi|i=1,,M}a^{f}=\{o_{i}^{*}|i=1,...,M\} just like spatial agent. The oi{0,1}o_{i}^{*}\in\{0,1\} indicates whether the ii-th frame is selected or not. Therefore, there are totally 2M2^{M} different actions, which is not friendly to direct optimization learning.

Refer to caption
Figure 4: The flowchart for the policy network of the proposed temporal agent. It is used to select the key video frames from the input video.

Policy Network Design: The temporal policy network πf(af|sf)\pi^{f}(a^{f}|s^{f}) is used to predict the spatial action afa^{f} when the state sfs^{f} is given. It is constructed with a LSTM structure. The skeleton diagram of the temporal policy network is shown in Figure. 4. The input of the policy network is the concat features composed of current frame-level features eie_{i} and a video-level global features ege^{g}. Combining these two features can better select the key frames by considering the global video information. The global features ege^{g} is achieved by a fully connected layer on all the frame-level features eie_{i}, i=1,,Mi=1,...,M. The output of LSTM network is then fed to a Softmax with Fully Connected Layer (FCL) to predict the probability pip_{i} to indicate oio_{i}=1. Technically, the temporal policy network can be expressed as:

pi=πθf(|concat(ei,eg),hi1π),i=1,2,,M,p_{i}=\pi^{f}_{\theta}(\cdot|concat(e_{i},e^{g}),h^{\pi}_{i-1}),i=1,2,...,M, (12)
oi=Bernoulli(pi),i=1,2,,M,o_{i}^{*}=Bernoulli(p_{i}),i=1,2,...,M, (13)

where Bernoulli()Bernoulli(\cdot) is the Bernouli function. hi1πh^{\pi}_{i-1} denotes the hidden states output by LSTM unit in the ii-1-th frame. The state sfs^{f} is defined as the concat feature concat(ei,eg)concat(e_{i},e^{g}). Eq.(13) is repeated MM times to get af={oi|i=1,,M}a^{f}=\{o_{i}^{*}|i=1,...,M\}.

Reward Design: To make the temporal agent intelligent, the temporal policy network interact with the environment for updating its parameters θf\theta^{f}. Similar to the training of the spatial agent, in addition to the common reward function which shared with the spatial agent, we also have designed two special rewards to guide the temporal agent. The first specific reward function is the sparse reward rsparser_{sparse}:

rsparse=exp(1M|i=1MoiL|),r_{sparse}=exp(-\dfrac{1}{M}|\sum_{i=1}^{M}o_{i}-L|), (14)

where LL here is used to control the number of key frames selected by the temporal agent, and L<ML<M. The second specially designed reward function is mainly used to evaluate the representative ability of the video frames selected by the temporal agent. Because the selected video frames need to be sparse but effectively represent the semantic information of the whole input video. The representative reward function [36] rrepr_{rep} is defined as:

rrep=exp(1Mi=1Mmint𝒦eiet2),r_{rep}=exp(-\frac{1}{M}\sum_{i=1}^{M}\min\limits_{t^{\prime}\subset\mathcal{K}}||e_{i}-e_{t^{\prime}}||_{2}), (15)

where 𝒦\mathcal{K} is a set of selected frame, i.e.,frames with oio_{i}=1. Through those reward functions, the temporal can be forced to recognize few and critical video frames. The selected video frames can be effectively reduce the temporal redundancy of the entire video and effectively improve the following attack efficiency.

To make key video frames conducive to successful attacks, rcommonr_{common} also join the learning of the temporal agent. For the tt-th iteration, the corresponding reward is:

rtemporalt=rcommont+λ2rsparset+λ3rrept,r^{t}_{temporal}=r^{t}_{common}+\lambda_{2}r^{t}_{sparse}+\lambda_{3}r^{t}_{rep}, (16)

where λ2\lambda_{2} and λ3\lambda_{3} are two balance coefficients, and they will be discussed and set up in the experimental section.

So far, through the cooperation of spatial agent and temporal agent, the key regions in the key video frames of the input video can be identified. In the procedure of multi-agent reinforcement learning, the agents interact with the threat model for many times, and the predicted results of the agents are more inclined to the rapid and successful attack. Therefore, the critical attacking spaces selected by multi-agent reinforcement learning is the space sensitive to attack, which can effectively improve the efficiency of attack.

3.2.3 Optimization Algorithm

There are two parts to optimize: one is the lightweight CNN backbone f()f(\cdot), and another is the policy network π()\pi(\cdot).

CNN backbone: In our method, the CNN backbone f()f(\cdot) is used for both temporal agent and spatial agent. It extracts the frames’ feature maps to construct state ss. To decouple the training process of CNN backbone and policy network, we directly apply a pre-trained MobileNet V2 backbone on ImageNet dataset as the feature extractor. In this way, we can focus on the optimization of two policy networks.

Policy network: The policy gradient methods are used to optimize the temporal and spatial policy network. They are to directly adjust the parameters θ\theta in order to maximize the objective J(θ)=𝔼sρπ,aπθ[R]J(\theta)=\mathbb{E}_{s\backsim\rho^{\pi},a\backsim\pi_{\theta}}[R] by tacking steps in the direction of θJ(θ)\nabla_{\theta}J(\theta). By introducing an action-value function Qπ(s,a)Q^{\pi}(s,a), the policy gradient can be changed as:

θJ(θ)=𝔼sρπ,aπθ[θlogπθ(a|s)Qπ(s,a)].\nabla_{\theta}J(\theta)=\mathbb{E}_{s\backsim\rho^{\pi},a\backsim\pi_{\theta}}[\nabla_{\theta}log\pi_{\theta}(a|s)Q^{\pi}(s,a)]. (17)

To solve Eq.(17), we utilize the actor-critic reinforcement learning framework [37] where a critic network is applied to approximate the action-value function Qπ(s,a)Q^{\pi}(s,a). Actor network is the policy network in Figure 3 and Figure 4.

Because our method focuses on the cooperative multi-agent tasks, in which the two agents are trying to optimize a shared reward function. Each agent is decentralized and only has access to locally available information. For example, temporal agent can only observe the change of key frames, and spatial agent can only observe the change of key patches. Therefore, our method can be described as Decentralized Partially Observable Markov Decision Processes (Dec-POMDP) [38]. To solve this problem, [23] presents the multi-agent decentralized actor, centralized critic approach, thus Eq.(17) is reformulated as:

θkJ(θk)=𝔼sρπ,akπk[θklogπk(ak|sk)Qkπ(s,a1,,aK)].\nabla_{\theta_{k}}J(\theta_{k})=\mathbb{E}_{s\backsim\rho^{\pi},a_{k}\backsim\pi_{k}}[\nabla_{\theta_{k}}log\pi_{k}(a_{k}|s_{k})Q_{k}^{\pi}(\textbf{s},a_{1},...,a_{K})]. (18)

where πk\pi_{k} denotes the policy network of the kk-th agent, and θk\theta_{k} is the corresponding parameters. In our method, there are totally two agents. The corresponding policy networks are πp(ap|sp)\pi^{p}(a^{p}|s^{p}) with parameter θp\theta^{p} and πf(af|sf)\pi^{f}(a^{f}|s^{f}) with parameter θf\theta^{f}. Here Qkπ(s,a1,a2)]Q_{k}^{\pi}(\textbf{s},a_{1},a_{2})] is a centralized action-value function that takes as input the actions of all agents (a1,a2)(a_{1},a_{2}) in addition to some state information s=[sp,sf]\textbf{s}=[s^{p},s^{f}], and outputs the Q-value for agent kk. In this way, we can perform a communication between two agents. In cooperative MARL, each agent is expected to maximize the common reward and its specific reward, therefore, we just need to solve Eq.(18) according to the rewards for spatial agent and temporal agent, respectively.

Algorithm 1 AstFocus black-box video attack algorithm
0:  Clean video:XX; ground-truth label:yy; feature extractor: f()f(\cdot); black-box video recognition model:F()F(\cdot). Max PGD iterations:TT; PGD attack step:α\alpha; learning rate:ϵ\epsilon.
0:  Adversarial video XX^{\prime}
1:  Initialize parameters θf\theta^{f} and θp\theta^{p} for temporal policy network πf()\pi^{f}(\cdot) and spatial policy network πp()\pi^{p}(\cdot);
2:  Extract frame-level features {ei|i\{e_{i}|i=1,…,M}M\} via Eq.(4);
3:  while t<Tt<T do
4:     Compute key regions ap={bi|i=1,,M}a^{p}=\{b_{i}^{*}|i=1,...,M\} via Eq.(5) and Eq.(6);
5:     Compute key frames af={oi|i=1,,M}a^{f}=\{o_{i}^{*}|i=1,...,M\} via Eq.(12) and Eq.(13);
6:     Obtain the core videoX^={oibi|i=1,,M}\hat{X}=\{o_{i}^{*}\cdot b_{i}^{*}|i=1,...,M\};
7:     Estimate the gradients on X^\hat{X} via Eq.(2);
8:     Generate adversarial video Xt+1X^{\prime}_{t+1} via Eq.(1);
9:     if F(Xt+1)yF(X^{\prime}_{t+1})\neq y  then
10:        XXt+1X^{\prime}\leftarrow X^{\prime}_{t+1}; Break;
11:     else
12:        Compute spatial reward rspatialtr^{t}_{spatial} via Eq.(11) and temporal reward rtemporaltr^{t}_{temporal} via Eq.(16);
13:        Compute θfJ(θf)\nabla_{\theta^{f}}J(\theta^{f}) and θpJ(θp)\nabla_{\theta^{p}}J(\theta^{p}) via Eq.(18);
14:        Update θfθf\theta^{f}\leftarrow\theta^{f} + ϵθfJ(θf\epsilon\cdot\nabla_{\theta^{f}}J(\theta^{f});
15:        Update θpθp\theta^{p}\leftarrow\theta^{p} + ϵθpJ(θp\epsilon\cdot\nabla_{\theta^{p}}J(\theta^{p});
16:     end if
17:  end while
18:  return  XX^{\prime}

To solve Eq. (18) for the spatial agent πp(ap|sp)\pi^{p}(a^{p}|s^{p}) and temporal agent πf(af|sf)\pi^{f}(a^{f}|s^{f}), we use the Proximal Policy Optimization (PPO), a popular single-agent on-policy RL algorithm [39] to obtain the θf\theta_{f} and θp\theta_{p}. For the details of PPO algorithms, please refer to [39].

3.3 The Overall Framework

After the MARL module, the key frames and key regions are obtained. The video XtM×H×W×3X_{t}^{\prime}\in\mathbb{R}^{M\times H\times W\times 3} in Eq.(1) and Eq.(2) is reduced to the video X^tm×h×w×3\hat{X}_{t}^{\prime}\in\mathbb{R}^{m\times h\times w\times 3} composed of key frames and key regions, where mm denotes the number of key frames, and h,wh,w denote the key patches’ height and width. It is clear that mM,hH,wWm\ll M,h\ll H,w\ll W. AstFocus attack finally utilizes X^tm×h×w×3\hat{X}_{t}^{\prime}\in\mathbb{R}^{m\times h\times w\times 3} to compute Eq.(2). Because of the reduced dimension, the gradient estimation can be efficient.

We now give the overall algorithm of AstFocus attack, which is illustrated under the un-targeted attack. The process of agent learning is an unsupervised process. Through continuous interaction with the threat model, the agent gets the feedback from the attack effect and external evaluation indicators from the video itself to update agents to encourage them to perform better. The whole algorithm is summarized in Algorithm 1.

4 Experiments and Results

4.1 Datasets and Recognition Models

Datasets. In our experiments, three public action recognition datasets: UCF-101 [40], HMDB-51 [41], and Kinetics-400 [42] are used. The UCF-101 contains 13,320 videos with 101 action categories, HMDB-51 is a dataset for human motion recognition, which contains 51 action categories with a total of 70,00 videos. Kinetics-400 contains 400 human action classes, with at least 400 video clips for each action. All of these datasets divides 70% of the video into training sets and 30% of the test sets. We randomly sample 100 videos from UCF-101 test set, 50 videos from HMDB-51 test set, and 400 videos from Kinetics-400 test set. All sampled videos can be classified by the recognition models correctly.

Recognition Models. For recognition models, four representative methods are used in our experiments. They are C3D [11], Temporal Segment Network (TSN) [12], Temporal Shift Module (TSM) [13], and SlowFast network [43]. These models are all mainstream methods for video classification task. For TSN, TSM, and SlowFast on three datasets, we utilize the corresponding pre-trained weights released by MMAction2 [44], a widely used open-source toolbox for video understanding based on PyTorch. For C3D, because MMAction2 only releases the pre-trained weights on UCF101, to ensure the consistency, we utilize the officially pre-trained weights on three datasets released by the authors111https://github.com/kenshohara/3D-ResNets-PyTorch. Table II lists their accuracy values under the test set.

Table II: The accuracy of four different modes on three datasets.
Models Datasets
UCF-101 HMDB-51 Kinetics-400
C3D [11] 85.88% 59.57% 54.20%
TSN [12] 83.03% 56.08 % 70.42%
TSM [13] 94.58% 74.77% 71.90%
SlowFast [43] 92.78% 65.95% 74.42%

4.2 Evaluation metrics

There are four metrics to test the performance of our method on various sides. Specifically, Fooling Rate, Query Number, Mean Absolute Perturbation, and Time are explored.

Fooling Rate (FR): indicates the percentage of adversarial videos, which successfully fool the threat model, out of all the tested videos. FR reflects the probability of successfully generating adversarial examples. A higher FR value means the better performance on the task of attacks.

Mean Absolute Perturbation (MAP): denotes the magnitude value of the generated adversarial perturbation 𝐫\mathbf{r}. For a given video: MAP=1Mi|𝐫i|\frac{1}{M}\sum_{i}|\mathbf{r}_{i}|, where MM is the number of frames in a video, and 𝐫i\mathbf{r}_{i} is the perturbation intensity vector on the ii-th frame. To be intuitive, the value of MAP is resized to 0-255. In the experiments, we report the average MAP across the test videos. A lower MAP value means the better imperceptibility.

Query Number (QN): denotes the used query times to successfully fool the threat model for a given adversarial video. It reflects the efficiency of different video attack methods. In the experiments, we set an upper bound for the query number, if the queries reach the upper bound but the threat video model is still not fooled successfully, we think this adversarial video is not successfully generated. The average query number across the test videos is reported. A lower QN value means the higher efficiency.

Time (T): denotes all the cost time when the successful attack is finished. We use seconds to measure the time. In the experiments, we report the average seconds across the test videos. A lower time value means the higher efficiency.

Note that previous works [27, 17, 20] have also used these metrics. But this paper has a slight difference with them. In [27, 17, 20], they compute MAP and NQ values only for the adversarial videos which can successfully perform the attacks, the MAP and NQ values of failed videos don’t be considered. In contrast, this paper computes MAP and NQ values for all the test videos. We think this is more reasonable because the failed videos also generate perturbations and cost queries with the threat models.

4.3 State-of-the-art attack competitors

Here, we use six state-of-the-art black-box video attack methods as comparisons with our method in effect and speed, named VBAD attack [19], Heuristic attack [20], Sparse attack [17], GEO-TRAP attack [31], RLSB attack [21], and Motion-sampler attack [30]. The detailed introductions about these competitors can be found in the related works section. We use their own officially released codes to conduct comparisons (for Sparse attack, we directly use the well-trained agent to predict key frames and then perform attacks. There is no released code for RLSB attack, we implement it according to the paper). For fair comparisons, all the settings are the same.

4.4 Implementation details

In the query-based black-box attacks, the query number is a key metric to evaluate the attacks’ performance. Thus, given a video, we set a maximum query number for all the compared methods. If the used query number is above the maximum query number, the adversarial attack is regarded as failure for this given video. We here set the maximum query number to 1.5×1041.5\times 10^{4} in the un-targeted attack and 3×1043\times 10^{4} in the targeted attack. In the NES, we set the variance Δ\Delta in NES to 10310^{-3} for the un-targeted attack and 10610^{-6} for the targeted attack according to our experience.

4.5 Parameter tuning

There are some hyperparameters in our method. In this section, we will determine their values via parameter tuning on the validate set. Specifically, we randomly selected 20 videos from HMDB-51 to construct the validation set, and then perform parameter tuning versus C3D model.

Refer to caption
Figure 5: The parameter tuning results of AstFocus attacks with different patch sizes. (A) The effects for fooling rate. (B) The effects for query number. (C) The effects for perturbation magnitude.

4.5.1 Patch size for spatial agent

The first hyperparameter is the patch size hh and ww when designing the spatial agent’s action. A reasonable patch size will lead to less queries and less perturbations. The parameter tuning results for patch size is given in Figure 5, where we explore its effects for the fooling rate, query number and perturbations, respectively. From the figure, we see that patch size mainly affects the query number but shows slight changes for fooling rate and perturbation magnitude. Moreover, Figure 5 (B) shows the query number is relatively sensitive to the patch size. This is reasonable because the pre-defined patch size determines the proportion of selected key regions out of the whole image, thus affects the query number.222 From the last column in Table III, we see that AstFocus attack has smaller variance when performing multiple times. For attacking C3D model on HMDB-51, the variance has no changes for FR, and only changes 1% around the mean for MAP, 4% around the mean for NQ. Therefore, the unsmooth curve is not caused by the significant variance. Overall, when the patch size is set to 65, the query number reaches the smallest value. Therefore, we set h=w=65h=w=65.

4.5.2 Upper bound of key frames

The second hyperparameter is the upper bound LL of selected key frames in Eq.(14). A reasonable LL can help our method select the minimal key frames to perform a successful video attack, and thus query number can be reduced. The parameter tuning results for upper bound LL is given in Figure 6, where we also explore its effects for the fooling rate, query number and perturbations, respectively. We can see that with the increase of LL value, the fooling rate will gradually stabilize to 100% and query number is slowly decreasing, but it would cause a big increase in perturbation magnitude. To balance three different evaluation metrics, we set L=10L=10 in the following experiments.

4.5.3 Sample number in NES

The third hyperparameter is the number of sampled points nn in Eq.(2). The sample number nn per each iteration has a great influence on the accuracy of the estimated gradient, especially when the attacking space changes. To explore the impact of the sample number nn on the attack effect, we have conducted a series of experiments. The parameter tuning results are given in Figure 7. We can see that with the increase of nn value, the fooling rate will gradually stabilize to 100%, but query number and perturbation magnitude achieve their optimal performance when nn is located in 60. Therefore, we set n=60n=60 in the following experiments.

Refer to caption
Figure 6: The parameter tuning results of AstFocus attacks with different upper bounds of key frames. (A) The effects for fooling rate. (B) The effects for query number. (C) The effects for perturbation magnitude.
Refer to caption
Figure 7: The parameter tuning results of AstFocus attacks with different sample numbers nn. (A) The effects for fooling rate. (B) The effects for query number. (C) The effects for perturbation magnitude.

4.5.4 Weights for various rewards

There are three weights to tune in the reward functions. They are λ1\lambda_{1} in Eq.(11), λ2\lambda_{2} and λ3\lambda_{3} in Eq.(16), which measures the importance of their own rewards. The parameter tuning results are given in Figure 8. According to the figure, we set λ1=0.2,λ2=0.4\lambda_{1}=0.2,\lambda_{2}=0.4, and λ3=0.6\lambda_{3}=0.6, respectively. It means there exists more redundancy to reduce in the temporal domain than spatial domain, thus needing to set large rewards in Eq.(16) to guide agent for learning key frames.

4.6 Ablation study

To explore the effectiveness of different components in the proposed algorithm, a series of experiments are conducted here. Specifically, we investigate the effects of various agents and various rewards, respectively. Similarly, we randomly select 20 videos from HMDB-51 to construct the validation set, and then perform the ablation study versus C3D video recognition model. Because the gradient estimator module introduces randomness, to consider this factor, we perform each ablation study for five times, and then report the mean ±\pm variance for different metrics.

Refer to caption
Figure 8: The parameter tuning results of AstFocus attacks with different reward weights. (A) The effects for fooling rate. (B) The effects for query numbers. (C) The effects for perturbation magnitude.
Table III: Effects of various agents to AstFocus attacks in an un-targeted setting.
Metrics Different agent versions
Baseline Spatial Temporal Spatial&Temporal
FR(%) 73.3±\pm2.9 88.3±\pm2.9 93.3±\pm2.9 100±\pm0.0
QN 3662±\pm244 2670±\pm155 2757±\pm125 2227±\pm92
MAP 6.37±\pm0.08 4.21±\pm0.05 4.53±\pm0.07 3.35±\pm0.03

4.6.1 Effects of various agents

In our method, the baseline is PGD+NES algorithm. Then we integrate two agents into the PGD+NES to reduce the video dimension from the temporal and spatial domains, respectively. Here we perform the ablation study about whether these two agents work in the video attack. The results are given in Table III, where “Baseline” denotes the PGD+NES. In this setting, because there is no dimension reduction module, the perturbations are added on the whole video, which can be called as “dense attack”. The term “Spatial” denotes integrating the spatial agent into the baseline. In this setting, we reduce the spatial redundancy by selecting the key patches in each frame. The term “Temporal” denotes integrating the temporal agent into the baseline, which reduces the temporal redundancy by selecting the key frames. The term “Spatial&\&Temporal” denotes the full version of AstFocus attack, i.e., simultaneously reducing the temporal and spatial redundancy via two agents.

We show the effects versus fooling rate (FR), number query (NQ), and perturbation magnitude (MAP). From the table, we can see that the dimension reduction is indeed useful to the attacking performance, i.e., the “Spatial” and “Temporal” achieve higher FR and smaller QN and MAP than the “Baseline”. By simultaneous reducing the temporal and spatial redundancy, “Spatial&\&Temporal” achieves the highest FR (100%), and smallest QN and MAP. The average FR increases 26.7% (73.3%\rightarrow100%), average QN and MAP decrease about 36% (3662\rightarrow2227) and 47% (6.37\rightarrow3.35) versus the baseline, respectively. In addition, “Spatial&\&Temporal” has smaller variance than baseline. For example, the variance has no changes for FR metric, and only changes 1% around the mean for MAP metric, 4% around the mean for NQ metric. For this reason, we neglect the variance value in the following comparison experiments. This verifies the important role of dimension reduction when performing video attacks.

Refer to caption
Figure 9: Comparison between RL-based agent and random agents.

We also compare our RL-based agents with random agents, where the key frames and key patches are randomly selected in each PGD iteration, other settings are the same. Figure 9 shows the comparison results. We can see that for all the evaluation metrics, our RL-based agent outperforms the random agent (100%±\pm0% vs 86.7%±\pm2.89%, 2227±\pm92 vs 2632±\pm114, 3.35±\pm0.03 vs 3.62±\pm0.04), which verifies the temporal and spatial agents in our method are jointly well-trained under the guidance of the carefully designed rewards. With the feedback of threat models, the temporal and spatial agents become intelligent.

4.6.2 Effects of various rewards

To better guide the agent learning, we carefully design the rewards. Specifically, one common reward Eq.(10) coming from the black-box threat model, and two kinds of specific rewards Eq.(8) as well as Eq.(15) and Eq.(14). The common reward is shared by the spatial agent and temporal agent, and the special rewards only belong to their own agents. In this section, we explore the effects of these rewards to the fooling rate, query number and perturbations.

Table IV: Effects of various rewards to AstFocus attacks in an un-targeted setting.
Metrics Different reward versions
Common +Edgebox +Sparse +Representative
FR(%) 76.7±\pm2.9 91.7±\pm2.9 96.7±\pm2.9 100±\pm0.0
QN 3780±\pm183 2690±\pm122 2490±\pm87 2227±\pm92
MAP 3.62±\pm0.07 3.58±\pm0.04 3.39±\pm0.04 3.35±\pm0.03

Table IV lists the ablation study for effects of various rewards. The term “Common” denotes the guidance of both temporal and spatial agents only using the common reward in Eq.(10). The terms “+Edgebox”, “+Sparse”, and ‘+Representative” denote adding the corresponding rewards (Eq.(8), Eq.(14), and Eq.(15), respectively) on the former basis to guide the agent learning. From the table, we see that with the addition of more and more rewards, average FR gradually increases, and average QN and average MAP gradually decrease. Compared with the solo common reward, the full version (the rightmost column) with all the rewards improves 23.3% for average FR (76.7%\rightarrow100%), and reduces 43% for average QN (3780\rightarrow2227), and 8% for average MAP (3.62\rightarrow3.35). The variance also becomes smaller and smaller. The contrast verifies the rationality of the designed rewards.

Table V: The comparative results versus four different threat models on UCF-101 dataset. The best results are highlighted in red. The symbol “-” means the used NQ exceeds the maximum NQ. \uparrow denotes the larger, the better, and \downarrow denotes the smaller, the better.
Datasets Threat Models Attack Methods Un-targeted attacks Targeted attacks
MAP \downarrow NQ \downarrow FR \uparrow T(s)\downarrow MAP\downarrow NQ\downarrow FR \uparrow T(s)\downarrow
UCF-101 TSM [13] VBAD attack [19] 6.023 2803 84% 31.2 9.208 15304 63% 201.7
Heuristic attack [20] 5.956 10657 40% 212.9
Sparse attack [17] 3.417 8529 58% 147.2
Motion-sampler attack [30] 7.237 5187 83% 124.6 7.415 13577 79% 288.6
GEO-TRAP attack [31] 5.865 3782 88% 87.2 6.265 11494 84% 247.6
RLSB attack [21] 4.823 4898 87% 101.3 7.274 20532 40% 365.4
AstFocus attack (ours) 3.355 1138 96% 24.6 4.546 8064 100% 274.5
TSN [12] VBAD attack [19] 6.168 2450 84% 13.8 9.391 18960 47% 394.6
Heuristic attack [20] 5.265 9135 51% 141.7
Sparse attack [17] 3.131 6916 64% 181.4
Motion-sampler attack [30] 6.895 4744 78% 95.6 6.903 19626 59% 316.8
GEO-TRAP attack [31] 5.472 3782 75% 87.3 5.952 18585 55% 301.2
RLSB attack [21] 5.238 3504 93% 48.2 8.448 20668 44% 289.5
AstFocus attack (ours) 3.265 2015 99% 37.4 4.495 8483 76% 272.9
C3D [11] VBAD attack [19] 6.800 4890 75% 43.4 10.760 20234 60% 139.2
Heuristic attack [20] 6.295 14160 30% 143.2
Sparse attack [17] 3.009 9507 42% 86.0
Motion-sampler attack [30] 6.153 8132 62% 97.9 7.242 20690 47% 252.9
GEO-TRAP attack [31] 5.877 7045 75% 74.4 6.332 17340 77% 205.4
RLSB attack [21] 5.326 6568 68% 72.0 7.225 22018 35% 207.1
AstFocus attack (ours) 4.015 4224 90% 66.8 4.225 13470 88% 236.4
SlowFast [43] VBAD attack [19] 6.302 4089 77% 42.5 9.118 19423 53% 499.2
Heuristic attack [20] 5.869 12776 34% 260.5
Sparse attack [17] 3.164 8642 58% 168.8
Motion-sampler attack [30] 7.086 5166 77% 136.3 7.275 17928 56% 451.8
GEO-TRAP attack [31] 5.712 4334 84% 120.5 6.273 16506 62% 532.1
RLSB attack [21] 5.586 4563 85% 94.5 7.655 22552 43% 448.4
AstFocus attack (ours) 4.286 1435 93% 35.7 4.436 13660 85% 385.3
Table VI: The comparative results versus four different threat models on HMDB-51 dataset. The best results are highlighted in red. The symbol “-” means the used NQ exceeds the maximum NQ. \uparrow denotes the larger, the better, and \downarrow denotes the smaller, the better.
Datasets Threat Models Attack Methods Un-targeted attacks Targeted attacks
MAP \downarrow NQ \downarrow FR \uparrow T(s)\downarrow MAP\downarrow NQ\downarrow FR \uparrow T(s)\downarrow
HMDB-51 TSM [13] VBAD attack [19] 6.361 1818 92% 22.4 9.057 17550 60% 495.6
Heuristic attack [20] 5.043 10385 58% 211.4
Sparse attack [17] 3.334 6244 62% 102.5
Motion-sampler attack [30] 7.229 3911 90% 95.8 8.012 19508 58% 686.7
GEO-TRAP attack [31] 5.919 3164 92% 84.8 6.222 10836 84% 337.7
RLSB attack [21] 5.323 5950 82% 112.9 7.754 20171 28% 575.8
AstFocus attack (ours) 3.411 1529 100% 34.7 4.326 7319 92% 419.1
TSN [12] VBAD attack [19] 5.873 2373 90% 26.4 9.244 21795 46% 659.1
Heuristic attack [20] 5.395 10146 58% 172.5
Sparse attack [17] 3.271 6765 74% 105.9
Motion-sampler attack [30] 7.275 3667 88% 74.1 8.087 24332 28% 964.6
GEO-TRAP attack [31] 5.192 3392 88% 61.6 6.344 21492 36% 744.6
RLSB attack [21] 5.312 4217 92% 68.6 6.494 22718 22% 719.1
AstFocus attack (ours) 3.52 2198 96% 51.1 4.090 9953 74% 522.4
C3D [11] VBAD attack [19] 6.743 4107 78% 36.5 10.528 22302 64% 361.2
Heuristic attack [20] 4.838 10534 42% 117.4
Sparse attack [17] 2.983 8545 46% 73.8
Motion-sampler attack [30] 7.035 6491 68% 89.5 7.973 22199 44% 558.4
GEO-TRAP attack [31] 5.666 5082 84% 48.2 6.324 16374 74% 518.2
RLSB attack [21] 4.688 7279 62% 77.1 7.212 18190 60% 530.8
AstFocus attack (ours) 3.835 3628 92% 45.6 4.025 9997 86% 473.6
SlowFast [43] VBAD attack [19] 6.528 5442 72% 46.7 10.615 22955 36% 693.3
Heuristic attack [20] 5.875 9094 54% 162.9
Sparse attack [17] 3.228 8977 56% 138.0
Motion-sampler attack [30] 7.163 6553 74% 156.4 7.956 18513 52% 652.3
GEO-TRAP attack [31] 6.242 5741 78% 122.8 6.179 17636 46% 666.1
RLSB attack [21] 5.68 4495 84% 81.9 7.416 20378 34% 650.0
AstFocus attack (ours) 4.078 2295 96% 47.1 4.682 13970 78% 567.9

4.6.3 Convergence of AstFocus attacks

Because our agents are updated by the rewards in each iteration, it is necessary to investigate whether the agents are under the convergence with the increasing iterations. For that, we list the values’ change of Eq.(9) with the increasing PGD iteration in Figure 10. Eq.(9) directly reflects the success or failure of an attack. If the target class’s confidence score is above the ground-truth class’s confidence score, the value of Eq.(9) will be above 1, representing that the attack is successful, and vice versa. From the figure, we can see that Eq.(9)’s values for all the threat models are gradually increasing until the stable situation. When the iteration reaches 400, all the models achieves the convergence. This verifies the good convergence of AstFocus attack. Actually, the attack usually stops when Eq.(9)’s value is above 1, i.e., the step 9 in Algorithm 1. Therefore, we only need few iterations in application. Figure 11 gives a qualitative example of the agents in AstFocus attacks, where the key frames and key patches in different iterations are illustrated by the bounding boxes. We can see that the spatial agent gradually focuses on the foreground objects. This is reasonable because these areas are key cues for video recognition task. In addition, the temporal agent tends to select the frames with big changes in the actions. These frames have a strong representative ability for the whole video from the appearance, which shows key frames are sensitive to attacks.

Refer to caption
Figure 10: The convergence of the proposed AstFocus attacks.
Refer to caption
Figure 11: A qualitative example for selecting key frames and key regions in AstFocus attacks. From top to bottom denotes the selected key frames and key patches in the 5-th, 10-th, and 20-th PGD iteration.
Table VII: The comparative results versus four different threat models on Kinetics-400 dataset. The best results are highlighted in red. The symbol “-” means the used NQ exceeds the maximum NQ. \uparrow denotes the larger, the better, and \downarrow denotes the smaller, the better.
Datasets Threat Models Attack Methods Un-targeted attacks Targeted attacks
MAP \downarrow NQ \downarrow FR \uparrow T(s)\downarrow MAP\downarrow NQ\downarrow FR \uparrow T(s)\downarrow
Kinetics-400 TSM [13] VBAD attack [19] 6.480 3626 78% 16.8 10.338 23670 34% 593.6
Heuristic attack [20] 5.744 11918 56% 202.1
Sparse attack [17] 2.725 9105 60% 147.6
Motion-sampler attack [30] 7.234 4494 88% 105.9 8.033 21032 46% 803.1
GEO-TRAP attack [31] 6.007 3962 92% 87.1 6.217 15585 72% 536.0
RLSB attack [21] 5.856 4422 84% 91.4 7.200 21724 26% 653.2
AstFocus attack (ours) 3.658 2416 96% 44.1 4.482 9758 88% 556.5
TSN [12] VBAD attack [19] 5.764 1668 92% 22.6 9.414 17560 58% 427.9
Heuristic attack [20] 4.806 10080 52% 165.6
Sparse attack [17] 2.579 8212 54% 117.8
Motion-sampler attack [30] 6.982 3422 90% 80.3 7.979 22119 36% 735.2
GEO-TRAP attack [31] 5.636 2684 90% 53.1 5.789 16402 50% 459.7
RLSB attack [21] 5.395 3774 94% 63.7 7.140 23574 22% 622.4
AstFocus attack (ours) 3.349 1021 100% 26.7 4.684 9940 90% 558.4
C3D [11] VBAD attack [19] 5.640 3444 90% 13.2 10.230 22070 52% 135.2
Heuristic attack [20] 5.805 11384 48% 44.7
Sparse attack [17] 2.769 5045 78% 20.7
Motion-sampler attack [30] 6.895 2485 96% 14.1 7.808 15679 70% 163.2
GEO-TRAP attack [31] 6.135 3436 96% 15.6 6.334 12260 90% 120.5
RLSB attack [21] 4.925 5915 76% 25.2 8.053 20975 36% 179.6
AstFocus attack (ours) 3.858 1055 100% 9.1 4.728 10840 92% 152.9
SlowFast [43] VBAD attack [19] 6.667 2732 86% 33.6 10.532 19970 44% 599.2
Heuristic attack [20] 4.723 9154 54% 159.5
Sparse attack [17] 3.043 5901 70% 97.8
Motion-sampler attack [30] 7.145 2282 92% 55.4 7.953 20264 40% 878.3
GEO-TRAP attack [31] 5.832 1646 94% 37.5 6.197 9594 86% 366.8
RLSB attack [21] 4.636 4802 88% 88.1 7.405 21137 34% 705.4
AstFocus attack (ours) 3.356 851 100% 24.4 4.015 7572 98% 386.2

4.7 Comparisons with SOTA methods

Here, we compare the proposed AstFocus attack with six state-of-the-art black-box video attack methods on three public datasets and four widely used video recognition models. The comparative results in the un-targeted and targeted settings are recorded in Table V, Table VI, and VII (for fair comparison, the target label for all the methods are the same when performing targeted attacks). From the tables, we see that: (1) For attack effect (FR and MAP), our method significantly outperforms other six SOTA methods for FR metric (at least 5%) versus all the threat models on all the datasets, showing the big advantage in attacking ability. For MAP metric, AstFocus attack only slightly loses to Sparse attack in the un-targeted attack but obviously outperforms other five video attacks. Because Sparse attack adds adversarial perturbations only on the fixed key frames in each PGD iteration. In Eq.(1), there exists a clip operation Proj()Proj(\cdot) to project the perturbations to a small range. So the upper bound of adversarial perturbations generated by Sparse attack is small. This design also limits the attacking efficiency and effectiveness, for example, Sparse attack only has almost 40% FR but needs almost 9000 NQ on average for un-targeted attacks, far less than AstFocus attacks. Overall, AstFocus attack is better than Sparse attack. A small MAP under a high FR means an accurate evaluation for the models’ adversarial robustness. From this viewpoint, AstFocus attack is more suitable to evaluate different video models. (2) For attack efficiency (NQ and T), AstFocus also significantly beats other six SOTA methods for NQ versus all the threat models on all the datasets, reducing at least 10% queries compared with the second best video attacks. For time metric, AstFocus attack only slightly loses to VBAD attack but still beats other five video attacks. This is reasonable because AstFocus attack integrates two additional agents to reduce dimensions during attacks while VBAD does not involve this step. In return, AstFocus greatly outperforms VBAD versus the other three metrics. Overall, AstFocus has the high efficiency. (3) For simultaneous modeling, AstFocus attack remarkably outperforms RLSB attack on all the settings, showing simultaneously modeling the key frames and key regions is indeed more effective than separately modeling them. This also demonstrates the core idea in this paper. (4) From the view of robustness evaluation, all the seven black-box video attacks show C3D has better adversarial robustness than the other models. The C3D has lower FR values but higher NQ values, which shows C3D is harder to attack. This may motivates us an in-depth study for the C3D’s structure to design robust video recognition models.

Refer to caption
Figure 12: Comparisons of different gradient estimators within AstFocus.
Refer to caption
Refer to caption
Figure 13: Two qualitative examples output by AstFocus attacks. For each example, from top to bottom rows are clean video, adversarial perturbations, and our adversarial video, respectively. Two adversarial videos generated by RLSB attack and GEO-TRAP attack are listed below the dotted line as a reference. We compute two metrics as (blurriness/SSIM) for each video frame. For the left blurriness degree [45], the smaller the better, and for the right SSIM [46], the larger the better.
Table VIII: Results of AstFocus attack against defended C3D method on HMDB-51
Metrics No defense PGD-AT [24] OUD [47] AdvIT [48]
FR(%) 92.0 60.0 (\downarrow32) 72.0 (\downarrow20) 70.0 (\downarrow22)
QN 3628 7005 (\uparrow3377) 6283 (\uparrow2655) 2802 (\downarrow826)
MAP 3.835 4.814 (\uparrow0.979) 5.090 (\uparrow1.255) 3.512 (\downarrow0.323)

4.8 Integrated with other gradient estimators

In our AstFocus attack, the current gradient estimator is NES. Actually, we can replace NES with other state-of-the-art gradient estimators. To test this point, we conduct experiments. Here we choose two SOTA gradient estimators: Prior convictions [49] and ZOO [50]. Figure 12 gives the results. We can see that when the gradient estimators are changed, the fooling rate, query number, and perturbation magnitudes only show a slight variation. Relatively speaking, NES achieves the better performance versus three metrics. AstFocus attack is a flexible framework, which implies other modules can be replaced except the MARL module. The PGD can also be replaced with its improved versions.

4.9 Qualitative results of AstFocus attacks

We list two adversarial videos and the perturbations generated by AstFocus attacks in Figure 13, we see adversarial video is consistent with original video from the appearance, showing the imperceptibility of adversarial perturbations. To understand the perturbations, we enlarge their values to give a display. We see the final adversarial perturbations are sparse both in inter-frames and intra-frames. They show a superposition phenomenon by many noise patches generated in each PGD iteration. These adversarial perturbations cover the foreground regions in key frames. We also give two adversarial videos generated by other recently published attack methods (RLSB attack and GEO-TRAP attack) as a reference, where we can see our method has better imperceptible perturbations than the other methods.

To better show the advantage, we compute two metrics to quantitatively measure the image quality. The first metric is to measure the blurriness degree [45]. For this metric, the smaller the better. And another metric is SSIM [46]. For this metric, the larger the better. We list these two metric values below each video frame as (blurriness/SSIM), where we see that our adversarial videos show better image quality than RLSB and GEO-TRAP attacks.

4.10 AstFocus attack against defense methods

We evaluate the performance of AstFocus attack against defense methods. Three kinds of representative video defense methods are chosen: Adversarial Training method (PGD-AT [24]), modifying network architecture method (OUD [47]), and pre-processing method (AdvIT333AdvIT is proposed to detect the adversarial example. To adopt it to perform defense, we attach it before the threat model. If the input is detected as adversarial example, it will not be fed into the threat model. For this reason, the QN and MAP may decrease rather than increase. [48]). The results for C3D model on HMDB-51 dataset are reported in Table VIII, where the changes compared with the un-defended C3D are listed in the brackets. We can see that both the attacking performance and efficiency decrease. Specifically, the maximum drop of FR after defense is 32%, QN increases by 3377 at most, and MAP increases by 33% at most. This is reasonable because the defended model will be harder to attack, but the FR, QN, and MAP are still acceptable. This shows that AstFocus attack is effective to evaluate the adversarial robustness even for the defended action recognition models.

5 Conclusion

In this paper, we designed the novel adversarial spatial-temporal focus attack on videos to simultaneously identify the key frames and key regions in the video. AstFocus attack was based on the cooperative multi-agent reinforcement learning framework. One agent was responsible for selecting key frames, and another agent was responsible for selecting key regions. These two agents were jointly trained by the common rewards received from the black-box threat models. By continuously querying, the reduced space composed of key frames and key regions was becoming precise, and the whole query number was less than that on the original video. Extensive experiments on four famous video recognition models and three public action recognition datasets verified our efficiency and effectiveness, which was prevenient in fooling rate, query number, time, and perturbation magnitude at the same.

Acknowledgment

This work is supported by National Key R&\&D Program of China (Grant No.2020AAA0104002), National Natural Science Foundation of China (No.62076018).

References

  • [1] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, “Salient object detection in the deep learning era: An in-depth survey,” TPAMI, 2021.
  • [2] Y. Zhu, X. Li, C. Liu, M. Zolfaghari, Y. Xiong, C. Wu, Z. Zhang, J. Tighe, R. Manmatha, and M. Li, “A comprehensive study of deep video action recognition,” arXiv preprint:2012.06567, 2020.
  • [3] S. Yang, W. Wang, C. Liu, and W. Deng, “Scene understanding in deep learning-based end-to-end controllers for autonomous vehicles,” IEEE TSMC: Systems, vol. 49, no. 1, pp. 53–63, 2018.
  • [4] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint:1412.6572, 2014.
  • [5] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,” in NeurIPS, 2019, pp. 125–136.
  • [6] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in CVPR, 2017.
  • [7] H. Wang, F. He, Z. Peng, T. Shao, Y.-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” in CVPR, 2021.
  • [8] S. Tang, R. Gong, Y. Wang, A. Liu, J. Wang, X. Chen, F. Yu, X. Liu, D. Song, A. Yuille et al., “Robustart: Benchmarking robustness on architecture design and training techniques,” arXiv preprint:2109.05211, 2021.
  • [9] S. Geisler, T. Schmidt, H. Şirin, D. Zügner, A. Bojchevski, and S. Günnemann, “Robustness of graph neural networks at scale,” NeurIPS, 2021.
  • [10] U.-A. M. Chapman-Rounds, U. Bhatt, E. Pazos, M.-A. Schulz, and K. Georgatzis, “Fimap: feature importance by minimal adversarial perturbation,” in AAAI, 2021.
  • [11] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” in CVPR, 2018.
  • [12] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in ECCV, 2016, pp. 20–36.
  • [13] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in CVPR, 2019, pp. 7083–7093.
  • [14] Y. Dong, S. Cheng, T. Pang, H. Su, and J. Zhu, “Query-efficient black-box adversarial attacks guided by a transfer-based prior,” IEEE TPAMI, 2021.
  • [15] X. Wei, Y. Guo, and J. Yu, “Adversarial sticker: A stealthy attack method in the physical world,” IEEE TPAMI, 2022.
  • [16] X. Wei, J. Zhu, S. Yuan, and H. Su, “Sparse adversarial perturbations for videos,” in AAAI, vol. 33, 2019, pp. 8973–8980.
  • [17] X. Wei, H. Yan, and B. Li, “Sparse black-box video attack with reinforcement learning,” IJCV, pp. 1–15, 2022.
  • [18] J. Hwang, J.-H. Kim, J.-H. Choi, and J.-S. Lee, “Just one moment: Structural vulnerability of deep action recognition against one frame attack,” in ICCV, 2021, pp. 7668–7676.
  • [19] L. Jiang, X. Ma, S. Chen, J. Bailey, and Y.-G. Jiang, “Black-box adversarial attacks on video recognition models,” in ACMMM, 2019, pp. 864–872.
  • [20] Z. Wei, J. Chen, X. Wei, L. Jiang, T.-S. Chua, F. Zhou, and Y.-G. Jiang, “Heuristic black-box adversarial attacks on video recognition models,” in AAAI, 2020.
  • [21] Z. Wang, C. Sha, and S. Yang, “Reinforcement learning based sparse black-box adversarial attack on video recognition models,” arXiv preprint:2108.13872, 2021.
  • [22] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial attacks with limited queries and information,” in ICML, 2018.
  • [23] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” NeurIPS, 2017.
  • [24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint:1706.06083, 2017.
  • [25] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Natural evolution strategies,” in IEEE Congress on Evolutionary Computation, 2008, pp. 3381–3387.
  • [26] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” TNNLS, vol. 28, no. 10, pp. 2222–2232, 2016.
  • [27] H. Yan and X. Wei, “Efficient sparse attacks on videos using reinforcement learning,” in ACMMM, 2021, pp. 2326–2334.
  • [28] X. Yuan, P. He, Q. Zhu, and X. Li, “Adversarial examples: Attacks and defenses for deep learning,” IEEE TNNLS, vol. 30, no. 9, pp. 2805–2824, 2019.
  • [29] S. Li, A. Neupane, S. Paul, C. Song, S. V. Krishnamurthy, A. K. R. Chowdhury, and A. Swami, “Adversarial perturbations against real-time video classification systems,” in NDSS, 2019.
  • [30] H. Zhang, L. Zhu, Y. Zhu, and Y. Yang, “Motion-excited sampler: Video adversarial attack with sparked prior,” in ECCV, 2020.
  • [31] S. Li, A. Aich, S. Zhu, S. Asif, C. Song, A. Roy-Chowdhury, and S. Krishnamurthy, “Adversarial attacks on black box video classifiers: Leveraging the power of geometric transformations,” NeurIPS, 2021.
  • [32] Z. Wei, J. Chen, Z. Wu, and Y. Jiang, “Cross-modal transferable adversarial attacks from images to videos,” in CVPR, 2022, pp. 15 064–15 073.
  • [33] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spatial-temporal clues in a hybrid deep learning framework for video classification,” in ACMMM, 2015, pp. 461–470.
  • [34] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in ICCV, 2021.
  • [35] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014, pp. 391–405.
  • [36] K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward,” in AAAI, 2018.
  • [37] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” NeurIPS, 1999.
  • [38] M. T. Spaan, “Partially observable markov decision processes,” in Reinforcement Learning, 2012, pp. 387–414.
  • [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint:1707.06347, 2017.
  • [40] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint:1212.0402, 2012.
  • [41] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in ICCV, 2011, pp. 2556–2563.
  • [42] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017.
  • [43] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in CVPR, 2019, pp. 6202–6211.
  • [44] M. Contributors, “Openmmlab’s next generation video understanding toolbox and benchmark,” https://github.com/open-mmlab/mmaction2, 2020.
  • [45] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi, “A no-reference perceptual blur metric,” in Proceedings. International conference on image processing, vol. 3.   IEEE, 2002, pp. III–III.
  • [46] A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th international conference on pattern recognition.   IEEE, 2010, pp. 2366–2369.
  • [47] S.-Y. Lo, J. M. J. Valanarasu, and V. M. Patel, “Overcomplete representations against adversarial videos,” in ICIP, 2021.
  • [48] C. Xiao, R. Deng, B. Li, T. Lee, B. Edwards, J. Yi, D. Song, M. Liu, and I. Molloy, “Advit: Adversarial frames identifier based on temporal consistency in videos,” in CVPR, 2019, pp. 3968–3977.
  • [49] A. Ilyas, L. Engstrom, and A. Madry, “Prior convictions: Black-box adversarial attacks with bandits and priors,” arXiv preprint:1807.07978, 2018.
  • [50] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in AISec, 2017.
[Uncaptioned image] Xingxing Wei received his Ph.D degree in computer science from Tianjin University, and B.S. degree in Automation from Beihang University, China. He is now an Associate Professor in Beihang University (BUAA). His research interests include computer vision, adversarial machine learning and its applications to multimedia content analysis. He is the author of referred journals and conferences in IEEE TPAMI, TMM, TCYB, TGRS, IJCV, PR, CVIU, CVPR, ICCV, ECCV, ACMMM, AAAI, IJCAI etc.
[Uncaptioned image] Songping Wang is now a Master student at school of software, Beihang University (BUAA). His research interests include deep learning and adversarial robustness in machine learning.
[Uncaptioned image] Huanqian Yan is pursuing his Ph.D. degree at the School of Computer Science and Engineering, Beihang University, Beijing, China. Previously, He received his master degree in computer application and technology from Lanzhou University in July 2018 and his Bachelor degree in the field of computer science and technology from Changchun University of Science and Technology in July 2015. His current research interests are object detection, adversarial examples, and clustering analysis, etc.