Efficient Robustness Assessment via
Adversarial Spatial-Temporal Focus on Videos
Abstract
Adversarial robustness assessment for video recognition models has raised concerns owing to their wide applications on safety-critical tasks. Compared with images, videos have much high dimension, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks where gradient estimation for the threat models is usually utilized, and high dimensions will lead to a large number of queries. To mitigate this issue, we propose to simultaneously eliminate the temporal and spatial redundancy within the video to achieve an effective and efficient gradient estimation on the reduced searching space, and thus query number could decrease. To implement this idea, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. AstFocus attack is based on the cooperative Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible for selecting key frames, and another agent is responsible for selecting key regions. These two agents are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video. Extensive experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods, which is prevenient in fooling rate, query number, time, and perturbation magnitude at the same.
Index Terms:
Adversarial examples, Video recognition, Reinforcement learning, Black-box attacks, Spatial-Temporal analysis1 Introduction
Deep Neural Networks (DNNs) have made remarkable achievements in various tasks such as object detection [1], action recognition [2], scene understanding [3], and so on. Recent studies illustrate the DNNs’ vulnerability to the so-called adversarial examples [4, 5, 6]. Afterwards, a series of methods are proposed to evaluate the adversarial robustness of DNNs. Among these works, the attack-based robustness evaluation methods [7, 8, 9] are more popular and practical because of their good implementability. They mainly seek for the minimum adversarial perturbations of successful attacks to measure the robustness [10]. On one hand, accurate assessment for adversarial robustness can help to deploy DNNs into safety-critical systems. On the other hand, it provides a quantitative metric to design more robust DNNs. Therefore, adversarial robustness assessment is important in both theoretical and practical values.
Video recognition [11, 12, 13] is a major branch in computer vision. Leveraging the temporal and spatial relationship within the video data can effectively locate and classify the objects or behaviors in videos, and thus help to perform video analysis. Owing to the DNNs’ advantage, current video recognition models are usually designed based on DNNs. The DNNs’ vulnerability is inevitably inherited by video recognition models. Owing to the wide applications in some safety-critical tasks like security surveillance, evaluating their adversarial robustness becomes necessary. Currently, more and more users begin to employ the video recognition APIs released by commercial cloud platforms because of their easy accessibility. In such cases, the APIs’ details are not public, we can only assess their adversarial robustness according to the outputs obtained by querying the systems. So these methods are called as query-based black-box attacks, which mainly rely on the estimated gradients for the APIs [14, 15].
Compared with images, videos have much high dimensions owing to the additional temporal information, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks because the high-dimension video data needs a large number of queries to obtain an accurate gradient estimation. Thus, seeking for the minimum adversarial perturbations on videos is more challenging than that on images, a reasonable attack algorithm should reduce the video dimensions firstly, so as to improve the attacks’ efficiency and reduce the perturbations’ magnitude. To meet this goal, temporally sparse video attacks [16, 17, 18] are proposed to eliminate the redundancy in the temporal domain, and spatial video attacks [19] try to eliminate the redundancy in the spatial domain. More importantly, the spatial and temporal redundancy should be jointly considered, i.e, modeling the key regions within key frames, and then evaluating the robustness on these areas. The current related methods [20, 21] both regard the selecting key frames and selecting key regions as two separate steps, and don’t simultaneously consider their interaction, thus leading to the sub-optimal attacking efficiency and performance.

However, simultaneously optimizing the key frames and key regions is difficult. Because they belong to different domains, and are closely coupled, i.e, changing the key frames also affects the selection of key regions. This is more challenging in the query-based black-box attacks, where only the feedback from the threat model can be used to perform the optimization. Considering the above points, this paper mainly addresses the following problem: How to simultaneously learn the precise key frames and key regions to efficiently and accurately assess the adversarial robustness of video recognition in the query-based black-box setting?
To answer this question, in this paper, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. The key frames and key regions are dynamically adjusted by the interaction with the threat model. Technically, this process is achieved based on the cooperative Multi-Agent Reinforcement Learning (MARL) [23]. One agent is responsible for selecting key frames (temporal agent), and another agent is responsible for selecting key regions (spatial agent). These two agents use one backbone network, and are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the focused space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video.
More specifically, AstFocus attack is constructed based on the PGD+NES baseline, which extends PGD [24] to the black-box attack with Natural Evolution Strategy (NES) [25] gradient estimator. We attach two agents before the gradient estimator module to reduce the video dimension. In each PGD iteration, NES gradient estimator is first performed on the selected key frames and key regions predicted by agents. Then the local adversarial perturbations are generated to attack the threat model. Finally, these two agents are updated according to the computed rewards to predict better key frames and key regions in the next iteration. This process is continuously repeated until the successful attack is achieved. These two agents have similarities and also differences. For policy networks, we apply the same one backbone network to extract the feature maps from the input video frames for both of them, but design the distinct LSTM-based [26] structures according to their own characteristic to predict the optimal actions. For actions, the temporal agent’s actions are defined as the sets composed of different key frames. For the spatial agent, actions are defined as the sets composed of different patch regions located in each frame. For rewards, three rewards are carefully designed to train the agents. The first one is the common reward from the feedback of black-box threat models, which is used to simultaneously guide two agents. The following two rewards are specially designed for temporal and spatial agent, respectively. And they mainly measure the actions from the view of appearance. The whole flowchart of AstFocus attack is shown in Figure 1, and the code is released in https://github.com/DeepSota/AstFocus.
This paper is an extended work based on our conference version [27] and has the following major improvements. Firstly, we consider the spatial redundancy besides the temporal redundancy in the previous version, and further propose the novel AstFocus attack to simultaneously learn key frames and key regions and generate perturbations. This is a major change in the idea, which comprehensively makes use of the videos’ spatial-temporal character to perform attacks. Secondly, we design a cooperative multi-agent RL based method to implement the new idea, while the previous version uses single-agent RL. Thus the rewards, actions, and policies are carefully re-designed. Thirdly, more experiments are given and discussed involving the parameter tuning, ablation study, and comparisons with SOTA methods. We also re-write the abstract, introduction, methodology, and experiment sections to better introduce our motivation and methods. We believe these modifications can significantly improve the quality of our work.
In summary, this paper has the following contributions:
-
•
We propose AstFocus attack, a novel query-based black-box attack method to assess the adversarial robustness for video recognition models, the adversarial perturbations are only added on the key spatial-temporal focused spaces, which can help reduce attack query numbers and perturbations significantly.
-
•
A cooperative multi-agent reinforcement learning module is adopted for identifying the key frames and key regions at the same. For that, we carefully design the actions, policy networks, and rewards for both the agents according to the specific task. The agents are updated in each iteration rather than after each round of successful attack, so is efficient to converge.
-
•
Compared with the state-of-the-art video attack algorithms, the proposed AstFocus attack can achieve less query number and smaller adversarial perturbations. Specifically, it reduces at least 10% query number, and improves at least 5% fooling rate with the smallest perturbations, which verifies the efficiency and effectiveness of AstFocus attacks.
The rest of this paper is organized as follows: we briefly review the related works in Section 2. The proposed AstFocus attack algorithm is described in Section 3. Experimental results and analysis are presented in Section 4. Finally, we conclude the whole paper in Section 5.
2 Related Works
2.1 Adversarial Attacks on Videos
Adversarial example [4, 24, 28] is a maliciously crafted input designed for making the classifier produce wrong output. To make human imperceptible of its existence, the generation of adversarial examples is often limited by some deliberate conditions, such as noise size and query numbers. Adversarial video attack and adversarial image attack are similar, the difference is that the attack space of the video is much larger than that of images. It is not easy to directly extend some image attack algorithms to attack such high-dimension video data. High dimensions usually bring huge search space, leading to high costs to achieve successful attacks. Especially in the black-box setting, a huge search space will bring a large number of queries.
Some video attack techniques have been proposed to find adversarial videos. Wei et al. [16] generate sparse 3D adversarial perturbations to add on the whole video. To reduce the attacking space, an -norm regularization based optimization is designed for making the adversarial perturbations more concentrated in some key frames of the input video. This method shows the sparse ability of adversarial video noises. Similarly, [18] propose “one frame attack”, they only add adversarial noise on one video frame. The perturbation can easily defeat deep learning-based action recognition systems. The vulnerable frame is perturbed with a gradient-based adversarial attack method. In addition, [29] finds that the temporal structure is key to generating adversarial videos. They have used generative adversarial network to generate adversarial examples that can cause large misclassification rate for the video recognition models.
Not only white-box video attacks, but also black-box video attacks are explored. One class of such methods is based on transferablity across different models. For example, Wei et al.[32] perform black-box video attacks based on adversarial perturbations generated on image models.
Another black-box video attacks belong to query-based methods. They generate perturbations via querying the target video recognition system. Among them, Jiang et al.[19] extend PGD algorithm to video attack with gradient estimators computed using super-pixels. To reduce attacking costs, some efficient black-box video attack algorithms are proposed. [30] argues the initialized random noises in [19] are less effective, they utilize the intrinsic movement pattern and regional relative motion, and propose the motion-aware noises to replace random noises. By using this prior in gradient estimation, fewer queries are needed to perform video attacks. Wei et al. [20] search for a subset of frames based on the importance of each video frame to the recognition model. Besides, they also limit the adversarial perturbations only on some salient regions. Because the temporal and spatial reductions are separately formulated, the method usually needs hundreds of thousands of queries. To mitigate this defect. Wei et al. [17] have proposed a sparse video attack algorithm based on reinforcement learning. An agent is designed to identify key frames through some interactions with the threat model. It can significantly reduce the adversarial perturbations, but update the agent only after each round of successful attack. This poor update mechanism leads to many unnecessary queries and a weak fooling rate. RLSB attack [21] explores to select key frames and key regions to reduce the high computation cost. However, the reinforcement learning is only applied to select key frames, which is similar to [17]. The process of selecting key regions is based on the saliency maps, it is independent to the process of selecting key frames, and not integrated into the reinforcement learning framework. Thus, the selecting key frames and key regions are separately formulated. Recently, [31] presents to parameterize the temporal structure of the search space using geometric transformations, and then reduce the temporal search space. Thus, they can efficiently estimate the gradients.
In this paper, we also explore important searching space, which is different from the previous work focusing only on key frames in the temporal domain. We jointly consider the identification of key regions in the spatial domain besides the temporal domain. For that, a multi-agent reinforcement learning is designed to identify a reduced space through rewards on the inherent property of video and interactions with the threat model. The comparisons with query-based black-box video attack methods are summarized in Table I.
2.2 Spatial-Temporal Property for Videos
Video can be regarded as multiple continuous images, therefore video processing often needs to consider both spatial and temporal correlations. The simultaneous consideration of temporal and spatial correlation of video is the key of video related tasks. Video action recognition is a longstanding research topic in multimedia and computer vision. Many mainstream algorithms are motivated by the advances in image classification, and improved through utilizing the temporal dimension of the video data. To facilitate the classification performance, Wu et al. [33] have proposed a hybrid deep learning framework for video classification, which is able to harness not only the spatial and short-term motion features, but also the long-term temporal clues. They integrate the spatial and temporal features in deep neural model with elaborately designed regularizations to explore feature correlations. The method can produce competitive classification performance. Some works based on the spatial-temporal property can be found in [11, 12, 13].
Unlike the above methods, we consider the spatial-temporal property of videos in the video attack task. The temporal and spatial redundancy within videos are reduced to improve the efficiency of video attack, which extends the application scope of spatial-temporal property of videos.
3 Methodology
In this section, we first give the baseline video attack algorithm: PGD [24] attack with NES [25] gradient estimator. Then the details of integrating cooperative Multi-Agent Reinforcement Learning (MARL) [23] into the baseline are introduced. Finally, the whole algorithm is summarized.
3.1 Preliminaries
We assume is a black-box video recognition model only whose top-1 information including the category label and confidence score can be required. Given a video with ground-truth label where denotes the -th frame, and is the total frame number, the predicted category label is , and the corresponding confidence score is .
To attack the video recognition model, we extend Projected Gradient Descent (PGD) [24] to adapt the video data. The adversarial video under the un-targeted attack is defined as:
(1) |
where projects the updated adversarial example to a valid range. is the attack step, and is used to control the magnitude of the added adversarial noise per each iteration. The is the sign function, and is the cross-entropy loss function. Due to the limitation of black-box settings, we cannot obtain the accurate gradient by directly computing . Instead, [22] proposes to utilize Natural Evolution Strategy (NES) [25] to estimate by querying the threat model. Specifically, NES can be described as:
(2) |
It first samples values , and then sets . Finally, the gradient is estimated through averaging the ratio of the predicted results to search variance .
For the targeted attack, Eq.(1) is modified as follows:
(3) |
where is a target category label pred-defined by the adversary in advance. In Eq.(2), the ground-truth should also be modified as the target label to estimate the gradients versus the target label.
In practial application, directly performing Eq.(2) is inefficient. Because the number of sample points is related with the dimension of . Owing to the high dimension of video data , we need to set a large value of to compute an accurate gradient in each iteration , which will lead to a large number of queries with the threat model. To improve the attack efficiency, the video dimension should be reduced by selecting the key frames and key regions, obtaining a reduced and , and thus a small value of can be available. Technically, we hope to replace in Eq. (2) with , where denotes the reduced operation, and is the reduced video.

3.2 The Proposed AstFocus Attack
To implement the above idea, we build the so-called AstFocus attack based on a cooperative multi-agent reinforcement learning (MARL) to jointly solve for the key frames and key regions during the black-box attack process. In AstFocus attack, one agent is responsible for selecting key frames (temporal agent), and another agent is responsible for selecting key regions (spatial agent). These two agents are cooperative to achieve the same goal. The processes of selecting key frames and key regions in each iteration of PGD are formulated into the Markov Decision Processes (MDP). The details of these two agents as well as the optimization algorithm are given below.
3.2.1 Spatial Agent
Spatial agent actually aims at solving an object localization problem (detecting key regions), we detail from three parts.
Action Design: To construct the actions of the spatial agent, we uniformly divide each video frame into overlapped patches inspired by the Vision Transformer [34]. In this way, we obtain a candidate patch set for the -th frame : where denotes the -th patch region within , and is the total number of candidate patches in this frame. denotes that the patch’s size is and , and their values will be tuned in the experiments. The goal of spatial agent is to select an optimal patch in each frame as the key region, and thus the final selected action is a sequence set . From the definition, we can see that there are totally action combinations for the given video , which implies the search space is huge. An example of actions in one frame is listed in Figure 2, where .
Policy Network Design: Spatial policy network is used to predict the spatial action when the state is given. The flowchart of our policy network is listed in Figure 3. Overall speaking, because we need to tackle with the sequence video data, a LSTM-based [26] structure is used to construct the policy network . For the -th frame , a lightweight convolution neural network (CNN) is first to extract the frame-level feature maps . In our experiments, we use MobileNet V2 as the lightweight CNN backbone for simplicity. Users can also apply other lightweight CNNs. Then they are fed into the LSTM unit to predict the logits for each patch. Next, a Softmax with Fully Connected Layer (FCL) is attached to output each patch’s probability . Finally, we utilize the categorical sampling to obtain the optimal patch region according to their probability values . To guarantee the smooth change of selected patch between adjacent frames, we concat the local patch features of the previous selected patch with the current frame-level features to jointly predict the current patch region, and is extracted via a simple multilayer perceptron (MLP) on the corresponding patch features of .
Formally, the frame-level feature maps are extracted by:
(4) |
next, the optimal action for each frame is achieved by:
(5) |
(6) |
where denotes hidden states output by LSTM unit in the -1-th frame. Thus, the state in our method is defined as the concat feature . Eq.(6) is repeated times to achieve the optimal action .
In our method, the policy network is updated in each iteration of PGD attack, therefore, the optimal action will be updated in each iteration until the PGD attack stops.

Reward Design: In each iteration, the spatial policy network will receive the feedback from the environment to update its parameters . Therefore, we need to design the reasonable rewards to guide the update of policy network. Because AstFocus attacks are based on the Multi-Agent Reinforcement Learning (MARL) framework, we design two kinds of rewards: one is specific for the spatial agent, and another is the common rewards shared with temporal agent.
For the special reward, an intuitive idea to evaluate the patch’s importance is the area covered by the foreground objects. Because video recognition model mainly performs predictions based on the foreground objects like person, car, etc. Therefore, if the policy network selects the foreground patch, the specific reward should be enlarged, and thus the policy network will be encouraged to select the foreground object in the next iteration. Based on this idea, we need a metric to measure the objectness score for a given patch. We here choose a classic objectness model: edgeboxes [35]. It calculates the edge response of each pixel and determines the boundary of the object by using the structured edge detector.
More concretely, the reward for the selected patch can be described as following:
(7) |
The edgebox reward for the whole video is defined:
(8) |
where and are the patch’s width and height. is used to measure the affinity of the -th edge groups in the selected patch. The is the sum of the -th edge groups in the selected patch. In general, a large patch often results in a large edgebox value. More detailed information about edgebox function can be found in [35].
For the common reward, it comes from the feedback of the black-box threat models. If the selected patch is reasonable, the generated adversarial patch should have a strong attacking ability, and thus will make the confidence score output by threat models have a big drop. Therefore, we can use the confidence drop of the ground-truth label as a metric to compute this reward. Because this is also useful to the temporal agent, it is called as common reward. Specifically, the common reward is defined as follows:
(9) |
(10) |
where is the exponential function, and represents the ground-truth label’s confidence when is fed into video recognition model. In the un-targeted attack, represents the second-ranked label’s confidence which is considered as the most competitive label to replace the ground-truth label. Only if the second-ranked label’s confidence becomes larger than that of the ground-truth label, becomes large. Then, we use the relative change of at different iteration as the metric for common reward. Eq.(10) is designed to encourage the agent to add perturbations on the selected regions that make the second-ranked label’s confidence gradually reach the ground-truth label and finally exceed it. In targeted attack, is the confidence of pre-defined target label by the adversary.
In summary, the -iteration reward for spatial agent is:
(11) |
where denotes a weight to balance the two terms.
3.2.2 Temporal Agent
There exists a major distinction between temporal agent and spatial agent. Spatial agent aims at solving the object localization problem, while temporal agent aims at solving the binary classification problem (selecting or not selecting a frame). Thus, the actions, rewards, and policy networks of temporal agent should be re-designed.
Action Design: Key frames refer to those video frames that are conducive to successful attack, and their number is less than that of the whole video. The goal of the temporal agent is to select some key frames from the whole input video , and thus the final selected action is also a sequence set just like spatial agent. The indicates whether the -th frame is selected or not. Therefore, there are totally different actions, which is not friendly to direct optimization learning.

Policy Network Design: The temporal policy network is used to predict the spatial action when the state is given. It is constructed with a LSTM structure. The skeleton diagram of the temporal policy network is shown in Figure. 4. The input of the policy network is the concat features composed of current frame-level features and a video-level global features . Combining these two features can better select the key frames by considering the global video information. The global features is achieved by a fully connected layer on all the frame-level features , . The output of LSTM network is then fed to a Softmax with Fully Connected Layer (FCL) to predict the probability to indicate =1. Technically, the temporal policy network can be expressed as:
(12) |
(13) |
where is the Bernouli function. denotes the hidden states output by LSTM unit in the -1-th frame. The state is defined as the concat feature . Eq.(13) is repeated times to get .
Reward Design: To make the temporal agent intelligent, the temporal policy network interact with the environment for updating its parameters . Similar to the training of the spatial agent, in addition to the common reward function which shared with the spatial agent, we also have designed two special rewards to guide the temporal agent. The first specific reward function is the sparse reward :
(14) |
where here is used to control the number of key frames selected by the temporal agent, and . The second specially designed reward function is mainly used to evaluate the representative ability of the video frames selected by the temporal agent. Because the selected video frames need to be sparse but effectively represent the semantic information of the whole input video. The representative reward function [36] is defined as:
(15) |
where is a set of selected frame, i.e.,frames with =1. Through those reward functions, the temporal can be forced to recognize few and critical video frames. The selected video frames can be effectively reduce the temporal redundancy of the entire video and effectively improve the following attack efficiency.
To make key video frames conducive to successful attacks, also join the learning of the temporal agent. For the -th iteration, the corresponding reward is:
(16) |
where and are two balance coefficients, and they will be discussed and set up in the experimental section.
So far, through the cooperation of spatial agent and temporal agent, the key regions in the key video frames of the input video can be identified. In the procedure of multi-agent reinforcement learning, the agents interact with the threat model for many times, and the predicted results of the agents are more inclined to the rapid and successful attack. Therefore, the critical attacking spaces selected by multi-agent reinforcement learning is the space sensitive to attack, which can effectively improve the efficiency of attack.
3.2.3 Optimization Algorithm
There are two parts to optimize: one is the lightweight CNN backbone , and another is the policy network .
CNN backbone: In our method, the CNN backbone is used for both temporal agent and spatial agent. It extracts the frames’ feature maps to construct state . To decouple the training process of CNN backbone and policy network, we directly apply a pre-trained MobileNet V2 backbone on ImageNet dataset as the feature extractor. In this way, we can focus on the optimization of two policy networks.
Policy network: The policy gradient methods are used to optimize the temporal and spatial policy network. They are to directly adjust the parameters in order to maximize the objective by tacking steps in the direction of . By introducing an action-value function , the policy gradient can be changed as:
(17) |
To solve Eq.(17), we utilize the actor-critic reinforcement learning framework [37] where a critic network is applied to approximate the action-value function . Actor network is the policy network in Figure 3 and Figure 4.
Because our method focuses on the cooperative multi-agent tasks, in which the two agents are trying to optimize a shared reward function. Each agent is decentralized and only has access to locally available information. For example, temporal agent can only observe the change of key frames, and spatial agent can only observe the change of key patches. Therefore, our method can be described as Decentralized Partially Observable Markov Decision Processes (Dec-POMDP) [38]. To solve this problem, [23] presents the multi-agent decentralized actor, centralized critic approach, thus Eq.(17) is reformulated as:
(18) |
where denotes the policy network of the -th agent, and is the corresponding parameters. In our method, there are totally two agents. The corresponding policy networks are with parameter and with parameter . Here is a centralized action-value function that takes as input the actions of all agents in addition to some state information , and outputs the Q-value for agent . In this way, we can perform a communication between two agents. In cooperative MARL, each agent is expected to maximize the common reward and its specific reward, therefore, we just need to solve Eq.(18) according to the rewards for spatial agent and temporal agent, respectively.
3.3 The Overall Framework
After the MARL module, the key frames and key regions are obtained. The video in Eq.(1) and Eq.(2) is reduced to the video composed of key frames and key regions, where denotes the number of key frames, and denote the key patches’ height and width. It is clear that . AstFocus attack finally utilizes to compute Eq.(2). Because of the reduced dimension, the gradient estimation can be efficient.
We now give the overall algorithm of AstFocus attack, which is illustrated under the un-targeted attack. The process of agent learning is an unsupervised process. Through continuous interaction with the threat model, the agent gets the feedback from the attack effect and external evaluation indicators from the video itself to update agents to encourage them to perform better. The whole algorithm is summarized in Algorithm 1.
4 Experiments and Results
4.1 Datasets and Recognition Models
Datasets. In our experiments, three public action recognition datasets: UCF-101 [40], HMDB-51 [41], and Kinetics-400 [42] are used. The UCF-101 contains 13,320 videos with 101 action categories, HMDB-51 is a dataset for human motion recognition, which contains 51 action categories with a total of 70,00 videos. Kinetics-400 contains 400 human action classes, with at least 400 video clips for each action. All of these datasets divides 70% of the video into training sets and 30% of the test sets. We randomly sample 100 videos from UCF-101 test set, 50 videos from HMDB-51 test set, and 400 videos from Kinetics-400 test set. All sampled videos can be classified by the recognition models correctly.
Recognition Models. For recognition models, four representative methods are used in our experiments. They are C3D [11], Temporal Segment Network (TSN) [12], Temporal Shift Module (TSM) [13], and SlowFast network [43]. These models are all mainstream methods for video classification task. For TSN, TSM, and SlowFast on three datasets, we utilize the corresponding pre-trained weights released by MMAction2 [44], a widely used open-source toolbox for video understanding based on PyTorch. For C3D, because MMAction2 only releases the pre-trained weights on UCF101, to ensure the consistency, we utilize the officially pre-trained weights on three datasets released by the authors111https://github.com/kenshohara/3D-ResNets-PyTorch. Table II lists their accuracy values under the test set.
4.2 Evaluation metrics
There are four metrics to test the performance of our method on various sides. Specifically, Fooling Rate, Query Number, Mean Absolute Perturbation, and Time are explored.
Fooling Rate (FR): indicates the percentage of adversarial videos, which successfully fool the threat model, out of all the tested videos. FR reflects the probability of successfully generating adversarial examples. A higher FR value means the better performance on the task of attacks.
Mean Absolute Perturbation (MAP): denotes the magnitude value of the generated adversarial perturbation . For a given video: MAP=, where is the number of frames in a video, and is the perturbation intensity vector on the -th frame. To be intuitive, the value of MAP is resized to 0-255. In the experiments, we report the average MAP across the test videos. A lower MAP value means the better imperceptibility.
Query Number (QN): denotes the used query times to successfully fool the threat model for a given adversarial video. It reflects the efficiency of different video attack methods. In the experiments, we set an upper bound for the query number, if the queries reach the upper bound but the threat video model is still not fooled successfully, we think this adversarial video is not successfully generated. The average query number across the test videos is reported. A lower QN value means the higher efficiency.
Time (T): denotes all the cost time when the successful attack is finished. We use seconds to measure the time. In the experiments, we report the average seconds across the test videos. A lower time value means the higher efficiency.
Note that previous works [27, 17, 20] have also used these metrics. But this paper has a slight difference with them. In [27, 17, 20], they compute MAP and NQ values only for the adversarial videos which can successfully perform the attacks, the MAP and NQ values of failed videos don’t be considered. In contrast, this paper computes MAP and NQ values for all the test videos. We think this is more reasonable because the failed videos also generate perturbations and cost queries with the threat models.
4.3 State-of-the-art attack competitors
Here, we use six state-of-the-art black-box video attack methods as comparisons with our method in effect and speed, named VBAD attack [19], Heuristic attack [20], Sparse attack [17], GEO-TRAP attack [31], RLSB attack [21], and Motion-sampler attack [30]. The detailed introductions about these competitors can be found in the related works section. We use their own officially released codes to conduct comparisons (for Sparse attack, we directly use the well-trained agent to predict key frames and then perform attacks. There is no released code for RLSB attack, we implement it according to the paper). For fair comparisons, all the settings are the same.
4.4 Implementation details
In the query-based black-box attacks, the query number is a key metric to evaluate the attacks’ performance. Thus, given a video, we set a maximum query number for all the compared methods. If the used query number is above the maximum query number, the adversarial attack is regarded as failure for this given video. We here set the maximum query number to in the un-targeted attack and in the targeted attack. In the NES, we set the variance in NES to for the un-targeted attack and for the targeted attack according to our experience.
4.5 Parameter tuning
There are some hyperparameters in our method. In this section, we will determine their values via parameter tuning on the validate set. Specifically, we randomly selected 20 videos from HMDB-51 to construct the validation set, and then perform parameter tuning versus C3D model.

4.5.1 Patch size for spatial agent
The first hyperparameter is the patch size and when designing the spatial agent’s action. A reasonable patch size will lead to less queries and less perturbations. The parameter tuning results for patch size is given in Figure 5, where we explore its effects for the fooling rate, query number and perturbations, respectively. From the figure, we see that patch size mainly affects the query number but shows slight changes for fooling rate and perturbation magnitude. Moreover, Figure 5 (B) shows the query number is relatively sensitive to the patch size. This is reasonable because the pre-defined patch size determines the proportion of selected key regions out of the whole image, thus affects the query number.222 From the last column in Table III, we see that AstFocus attack has smaller variance when performing multiple times. For attacking C3D model on HMDB-51, the variance has no changes for FR, and only changes 1% around the mean for MAP, 4% around the mean for NQ. Therefore, the unsmooth curve is not caused by the significant variance. Overall, when the patch size is set to 65, the query number reaches the smallest value. Therefore, we set .
4.5.2 Upper bound of key frames
The second hyperparameter is the upper bound of selected key frames in Eq.(14). A reasonable can help our method select the minimal key frames to perform a successful video attack, and thus query number can be reduced. The parameter tuning results for upper bound is given in Figure 6, where we also explore its effects for the fooling rate, query number and perturbations, respectively. We can see that with the increase of value, the fooling rate will gradually stabilize to 100% and query number is slowly decreasing, but it would cause a big increase in perturbation magnitude. To balance three different evaluation metrics, we set in the following experiments.
4.5.3 Sample number in NES
The third hyperparameter is the number of sampled points in Eq.(2). The sample number per each iteration has a great influence on the accuracy of the estimated gradient, especially when the attacking space changes. To explore the impact of the sample number on the attack effect, we have conducted a series of experiments. The parameter tuning results are given in Figure 7. We can see that with the increase of value, the fooling rate will gradually stabilize to 100%, but query number and perturbation magnitude achieve their optimal performance when is located in 60. Therefore, we set in the following experiments.


4.5.4 Weights for various rewards
There are three weights to tune in the reward functions. They are in Eq.(11), and in Eq.(16), which measures the importance of their own rewards. The parameter tuning results are given in Figure 8. According to the figure, we set , and , respectively. It means there exists more redundancy to reduce in the temporal domain than spatial domain, thus needing to set large rewards in Eq.(16) to guide agent for learning key frames.
4.6 Ablation study
To explore the effectiveness of different components in the proposed algorithm, a series of experiments are conducted here. Specifically, we investigate the effects of various agents and various rewards, respectively. Similarly, we randomly select 20 videos from HMDB-51 to construct the validation set, and then perform the ablation study versus C3D video recognition model. Because the gradient estimator module introduces randomness, to consider this factor, we perform each ablation study for five times, and then report the mean variance for different metrics.

Metrics | Different agent versions | |||
---|---|---|---|---|
Baseline | Spatial | Temporal | Spatial&Temporal | |
FR(%) | 73.32.9 | 88.32.9 | 93.32.9 | 1000.0 |
QN | 3662244 | 2670155 | 2757125 | 222792 |
MAP | 6.370.08 | 4.210.05 | 4.530.07 | 3.350.03 |
4.6.1 Effects of various agents
In our method, the baseline is PGD+NES algorithm. Then we integrate two agents into the PGD+NES to reduce the video dimension from the temporal and spatial domains, respectively. Here we perform the ablation study about whether these two agents work in the video attack. The results are given in Table III, where “Baseline” denotes the PGD+NES. In this setting, because there is no dimension reduction module, the perturbations are added on the whole video, which can be called as “dense attack”. The term “Spatial” denotes integrating the spatial agent into the baseline. In this setting, we reduce the spatial redundancy by selecting the key patches in each frame. The term “Temporal” denotes integrating the temporal agent into the baseline, which reduces the temporal redundancy by selecting the key frames. The term “SpatialTemporal” denotes the full version of AstFocus attack, i.e., simultaneously reducing the temporal and spatial redundancy via two agents.
We show the effects versus fooling rate (FR), number query (NQ), and perturbation magnitude (MAP). From the table, we can see that the dimension reduction is indeed useful to the attacking performance, i.e., the “Spatial” and “Temporal” achieve higher FR and smaller QN and MAP than the “Baseline”. By simultaneous reducing the temporal and spatial redundancy, “SpatialTemporal” achieves the highest FR (100%), and smallest QN and MAP. The average FR increases 26.7% (73.3%100%), average QN and MAP decrease about 36% (36622227) and 47% (6.373.35) versus the baseline, respectively. In addition, “SpatialTemporal” has smaller variance than baseline. For example, the variance has no changes for FR metric, and only changes 1% around the mean for MAP metric, 4% around the mean for NQ metric. For this reason, we neglect the variance value in the following comparison experiments. This verifies the important role of dimension reduction when performing video attacks.

We also compare our RL-based agents with random agents, where the key frames and key patches are randomly selected in each PGD iteration, other settings are the same. Figure 9 shows the comparison results. We can see that for all the evaluation metrics, our RL-based agent outperforms the random agent (100%0% vs 86.7%2.89%, 222792 vs 2632114, 3.350.03 vs 3.620.04), which verifies the temporal and spatial agents in our method are jointly well-trained under the guidance of the carefully designed rewards. With the feedback of threat models, the temporal and spatial agents become intelligent.
4.6.2 Effects of various rewards
To better guide the agent learning, we carefully design the rewards. Specifically, one common reward Eq.(10) coming from the black-box threat model, and two kinds of specific rewards Eq.(8) as well as Eq.(15) and Eq.(14). The common reward is shared by the spatial agent and temporal agent, and the special rewards only belong to their own agents. In this section, we explore the effects of these rewards to the fooling rate, query number and perturbations.
Metrics | Different reward versions | |||
---|---|---|---|---|
Common | +Edgebox | +Sparse | +Representative | |
FR(%) | 76.72.9 | 91.72.9 | 96.72.9 | 1000.0 |
QN | 3780183 | 2690122 | 249087 | 222792 |
MAP | 3.620.07 | 3.580.04 | 3.390.04 | 3.350.03 |
Table IV lists the ablation study for effects of various rewards. The term “Common” denotes the guidance of both temporal and spatial agents only using the common reward in Eq.(10). The terms “+Edgebox”, “+Sparse”, and ‘+Representative” denote adding the corresponding rewards (Eq.(8), Eq.(14), and Eq.(15), respectively) on the former basis to guide the agent learning. From the table, we see that with the addition of more and more rewards, average FR gradually increases, and average QN and average MAP gradually decrease. Compared with the solo common reward, the full version (the rightmost column) with all the rewards improves 23.3% for average FR (76.7%100%), and reduces 43% for average QN (37802227), and 8% for average MAP (3.623.35). The variance also becomes smaller and smaller. The contrast verifies the rationality of the designed rewards.
Datasets | Threat Models | Attack Methods | Un-targeted attacks | Targeted attacks | ||||||
---|---|---|---|---|---|---|---|---|---|---|
MAP | NQ | FR | T(s) | MAP | NQ | FR | T(s) | |||
UCF-101 | TSM [13] | VBAD attack [19] | 6.023 | 2803 | 84% | 31.2 | 9.208 | 15304 | 63% | 201.7 |
Heuristic attack [20] | 5.956 | 10657 | 40% | 212.9 | – | – | – | – | ||
Sparse attack [17] | 3.417 | 8529 | 58% | 147.2 | – | – | – | – | ||
Motion-sampler attack [30] | 7.237 | 5187 | 83% | 124.6 | 7.415 | 13577 | 79% | 288.6 | ||
GEO-TRAP attack [31] | 5.865 | 3782 | 88% | 87.2 | 6.265 | 11494 | 84% | 247.6 | ||
RLSB attack [21] | 4.823 | 4898 | 87% | 101.3 | 7.274 | 20532 | 40% | 365.4 | ||
AstFocus attack (ours) | 3.355 | 1138 | 96% | 24.6 | 4.546 | 8064 | 100% | 274.5 | ||
TSN [12] | VBAD attack [19] | 6.168 | 2450 | 84% | 13.8 | 9.391 | 18960 | 47% | 394.6 | |
Heuristic attack [20] | 5.265 | 9135 | 51% | 141.7 | – | – | – | – | ||
Sparse attack [17] | 3.131 | 6916 | 64% | 181.4 | – | – | – | – | ||
Motion-sampler attack [30] | 6.895 | 4744 | 78% | 95.6 | 6.903 | 19626 | 59% | 316.8 | ||
GEO-TRAP attack [31] | 5.472 | 3782 | 75% | 87.3 | 5.952 | 18585 | 55% | 301.2 | ||
RLSB attack [21] | 5.238 | 3504 | 93% | 48.2 | 8.448 | 20668 | 44% | 289.5 | ||
AstFocus attack (ours) | 3.265 | 2015 | 99% | 37.4 | 4.495 | 8483 | 76% | 272.9 | ||
C3D [11] | VBAD attack [19] | 6.800 | 4890 | 75% | 43.4 | 10.760 | 20234 | 60% | 139.2 | |
Heuristic attack [20] | 6.295 | 14160 | 30% | 143.2 | – | – | – | – | ||
Sparse attack [17] | 3.009 | 9507 | 42% | 86.0 | – | – | – | – | ||
Motion-sampler attack [30] | 6.153 | 8132 | 62% | 97.9 | 7.242 | 20690 | 47% | 252.9 | ||
GEO-TRAP attack [31] | 5.877 | 7045 | 75% | 74.4 | 6.332 | 17340 | 77% | 205.4 | ||
RLSB attack [21] | 5.326 | 6568 | 68% | 72.0 | 7.225 | 22018 | 35% | 207.1 | ||
AstFocus attack (ours) | 4.015 | 4224 | 90% | 66.8 | 4.225 | 13470 | 88% | 236.4 | ||
SlowFast [43] | VBAD attack [19] | 6.302 | 4089 | 77% | 42.5 | 9.118 | 19423 | 53% | 499.2 | |
Heuristic attack [20] | 5.869 | 12776 | 34% | 260.5 | – | – | – | – | ||
Sparse attack [17] | 3.164 | 8642 | 58% | 168.8 | – | – | – | – | ||
Motion-sampler attack [30] | 7.086 | 5166 | 77% | 136.3 | 7.275 | 17928 | 56% | 451.8 | ||
GEO-TRAP attack [31] | 5.712 | 4334 | 84% | 120.5 | 6.273 | 16506 | 62% | 532.1 | ||
RLSB attack [21] | 5.586 | 4563 | 85% | 94.5 | 7.655 | 22552 | 43% | 448.4 | ||
AstFocus attack (ours) | 4.286 | 1435 | 93% | 35.7 | 4.436 | 13660 | 85% | 385.3 |
Datasets | Threat Models | Attack Methods | Un-targeted attacks | Targeted attacks | ||||||
---|---|---|---|---|---|---|---|---|---|---|
MAP | NQ | FR | T(s) | MAP | NQ | FR | T(s) | |||
HMDB-51 | TSM [13] | VBAD attack [19] | 6.361 | 1818 | 92% | 22.4 | 9.057 | 17550 | 60% | 495.6 |
Heuristic attack [20] | 5.043 | 10385 | 58% | 211.4 | – | – | – | – | ||
Sparse attack [17] | 3.334 | 6244 | 62% | 102.5 | – | – | – | – | ||
Motion-sampler attack [30] | 7.229 | 3911 | 90% | 95.8 | 8.012 | 19508 | 58% | 686.7 | ||
GEO-TRAP attack [31] | 5.919 | 3164 | 92% | 84.8 | 6.222 | 10836 | 84% | 337.7 | ||
RLSB attack [21] | 5.323 | 5950 | 82% | 112.9 | 7.754 | 20171 | 28% | 575.8 | ||
AstFocus attack (ours) | 3.411 | 1529 | 100% | 34.7 | 4.326 | 7319 | 92% | 419.1 | ||
TSN [12] | VBAD attack [19] | 5.873 | 2373 | 90% | 26.4 | 9.244 | 21795 | 46% | 659.1 | |
Heuristic attack [20] | 5.395 | 10146 | 58% | 172.5 | – | – | – | – | ||
Sparse attack [17] | 3.271 | 6765 | 74% | 105.9 | – | – | – | – | ||
Motion-sampler attack [30] | 7.275 | 3667 | 88% | 74.1 | 8.087 | 24332 | 28% | 964.6 | ||
GEO-TRAP attack [31] | 5.192 | 3392 | 88% | 61.6 | 6.344 | 21492 | 36% | 744.6 | ||
RLSB attack [21] | 5.312 | 4217 | 92% | 68.6 | 6.494 | 22718 | 22% | 719.1 | ||
AstFocus attack (ours) | 3.52 | 2198 | 96% | 51.1 | 4.090 | 9953 | 74% | 522.4 | ||
C3D [11] | VBAD attack [19] | 6.743 | 4107 | 78% | 36.5 | 10.528 | 22302 | 64% | 361.2 | |
Heuristic attack [20] | 4.838 | 10534 | 42% | 117.4 | – | – | – | – | ||
Sparse attack [17] | 2.983 | 8545 | 46% | 73.8 | – | – | – | – | ||
Motion-sampler attack [30] | 7.035 | 6491 | 68% | 89.5 | 7.973 | 22199 | 44% | 558.4 | ||
GEO-TRAP attack [31] | 5.666 | 5082 | 84% | 48.2 | 6.324 | 16374 | 74% | 518.2 | ||
RLSB attack [21] | 4.688 | 7279 | 62% | 77.1 | 7.212 | 18190 | 60% | 530.8 | ||
AstFocus attack (ours) | 3.835 | 3628 | 92% | 45.6 | 4.025 | 9997 | 86% | 473.6 | ||
SlowFast [43] | VBAD attack [19] | 6.528 | 5442 | 72% | 46.7 | 10.615 | 22955 | 36% | 693.3 | |
Heuristic attack [20] | 5.875 | 9094 | 54% | 162.9 | – | – | – | – | ||
Sparse attack [17] | 3.228 | 8977 | 56% | 138.0 | – | – | – | – | ||
Motion-sampler attack [30] | 7.163 | 6553 | 74% | 156.4 | 7.956 | 18513 | 52% | 652.3 | ||
GEO-TRAP attack [31] | 6.242 | 5741 | 78% | 122.8 | 6.179 | 17636 | 46% | 666.1 | ||
RLSB attack [21] | 5.68 | 4495 | 84% | 81.9 | 7.416 | 20378 | 34% | 650.0 | ||
AstFocus attack (ours) | 4.078 | 2295 | 96% | 47.1 | 4.682 | 13970 | 78% | 567.9 |
4.6.3 Convergence of AstFocus attacks
Because our agents are updated by the rewards in each iteration, it is necessary to investigate whether the agents are under the convergence with the increasing iterations. For that, we list the values’ change of Eq.(9) with the increasing PGD iteration in Figure 10. Eq.(9) directly reflects the success or failure of an attack. If the target class’s confidence score is above the ground-truth class’s confidence score, the value of Eq.(9) will be above 1, representing that the attack is successful, and vice versa. From the figure, we can see that Eq.(9)’s values for all the threat models are gradually increasing until the stable situation. When the iteration reaches 400, all the models achieves the convergence. This verifies the good convergence of AstFocus attack. Actually, the attack usually stops when Eq.(9)’s value is above 1, i.e., the step 9 in Algorithm 1. Therefore, we only need few iterations in application. Figure 11 gives a qualitative example of the agents in AstFocus attacks, where the key frames and key patches in different iterations are illustrated by the bounding boxes. We can see that the spatial agent gradually focuses on the foreground objects. This is reasonable because these areas are key cues for video recognition task. In addition, the temporal agent tends to select the frames with big changes in the actions. These frames have a strong representative ability for the whole video from the appearance, which shows key frames are sensitive to attacks.


Datasets | Threat Models | Attack Methods | Un-targeted attacks | Targeted attacks | ||||||
---|---|---|---|---|---|---|---|---|---|---|
MAP | NQ | FR | T(s) | MAP | NQ | FR | T(s) | |||
Kinetics-400 | TSM [13] | VBAD attack [19] | 6.480 | 3626 | 78% | 16.8 | 10.338 | 23670 | 34% | 593.6 |
Heuristic attack [20] | 5.744 | 11918 | 56% | 202.1 | – | – | – | – | ||
Sparse attack [17] | 2.725 | 9105 | 60% | 147.6 | – | – | – | – | ||
Motion-sampler attack [30] | 7.234 | 4494 | 88% | 105.9 | 8.033 | 21032 | 46% | 803.1 | ||
GEO-TRAP attack [31] | 6.007 | 3962 | 92% | 87.1 | 6.217 | 15585 | 72% | 536.0 | ||
RLSB attack [21] | 5.856 | 4422 | 84% | 91.4 | 7.200 | 21724 | 26% | 653.2 | ||
AstFocus attack (ours) | 3.658 | 2416 | 96% | 44.1 | 4.482 | 9758 | 88% | 556.5 | ||
TSN [12] | VBAD attack [19] | 5.764 | 1668 | 92% | 22.6 | 9.414 | 17560 | 58% | 427.9 | |
Heuristic attack [20] | 4.806 | 10080 | 52% | 165.6 | – | – | – | – | ||
Sparse attack [17] | 2.579 | 8212 | 54% | 117.8 | – | – | – | – | ||
Motion-sampler attack [30] | 6.982 | 3422 | 90% | 80.3 | 7.979 | 22119 | 36% | 735.2 | ||
GEO-TRAP attack [31] | 5.636 | 2684 | 90% | 53.1 | 5.789 | 16402 | 50% | 459.7 | ||
RLSB attack [21] | 5.395 | 3774 | 94% | 63.7 | 7.140 | 23574 | 22% | 622.4 | ||
AstFocus attack (ours) | 3.349 | 1021 | 100% | 26.7 | 4.684 | 9940 | 90% | 558.4 | ||
C3D [11] | VBAD attack [19] | 5.640 | 3444 | 90% | 13.2 | 10.230 | 22070 | 52% | 135.2 | |
Heuristic attack [20] | 5.805 | 11384 | 48% | 44.7 | – | – | – | – | ||
Sparse attack [17] | 2.769 | 5045 | 78% | 20.7 | – | – | – | – | ||
Motion-sampler attack [30] | 6.895 | 2485 | 96% | 14.1 | 7.808 | 15679 | 70% | 163.2 | ||
GEO-TRAP attack [31] | 6.135 | 3436 | 96% | 15.6 | 6.334 | 12260 | 90% | 120.5 | ||
RLSB attack [21] | 4.925 | 5915 | 76% | 25.2 | 8.053 | 20975 | 36% | 179.6 | ||
AstFocus attack (ours) | 3.858 | 1055 | 100% | 9.1 | 4.728 | 10840 | 92% | 152.9 | ||
SlowFast [43] | VBAD attack [19] | 6.667 | 2732 | 86% | 33.6 | 10.532 | 19970 | 44% | 599.2 | |
Heuristic attack [20] | 4.723 | 9154 | 54% | 159.5 | – | – | – | – | ||
Sparse attack [17] | 3.043 | 5901 | 70% | 97.8 | – | – | – | – | ||
Motion-sampler attack [30] | 7.145 | 2282 | 92% | 55.4 | 7.953 | 20264 | 40% | 878.3 | ||
GEO-TRAP attack [31] | 5.832 | 1646 | 94% | 37.5 | 6.197 | 9594 | 86% | 366.8 | ||
RLSB attack [21] | 4.636 | 4802 | 88% | 88.1 | 7.405 | 21137 | 34% | 705.4 | ||
AstFocus attack (ours) | 3.356 | 851 | 100% | 24.4 | 4.015 | 7572 | 98% | 386.2 |
4.7 Comparisons with SOTA methods
Here, we compare the proposed AstFocus attack with six state-of-the-art black-box video attack methods on three public datasets and four widely used video recognition models. The comparative results in the un-targeted and targeted settings are recorded in Table V, Table VI, and VII (for fair comparison, the target label for all the methods are the same when performing targeted attacks). From the tables, we see that: (1) For attack effect (FR and MAP), our method significantly outperforms other six SOTA methods for FR metric (at least 5%) versus all the threat models on all the datasets, showing the big advantage in attacking ability. For MAP metric, AstFocus attack only slightly loses to Sparse attack in the un-targeted attack but obviously outperforms other five video attacks. Because Sparse attack adds adversarial perturbations only on the fixed key frames in each PGD iteration. In Eq.(1), there exists a clip operation to project the perturbations to a small range. So the upper bound of adversarial perturbations generated by Sparse attack is small. This design also limits the attacking efficiency and effectiveness, for example, Sparse attack only has almost 40% FR but needs almost 9000 NQ on average for un-targeted attacks, far less than AstFocus attacks. Overall, AstFocus attack is better than Sparse attack. A small MAP under a high FR means an accurate evaluation for the models’ adversarial robustness. From this viewpoint, AstFocus attack is more suitable to evaluate different video models. (2) For attack efficiency (NQ and T), AstFocus also significantly beats other six SOTA methods for NQ versus all the threat models on all the datasets, reducing at least 10% queries compared with the second best video attacks. For time metric, AstFocus attack only slightly loses to VBAD attack but still beats other five video attacks. This is reasonable because AstFocus attack integrates two additional agents to reduce dimensions during attacks while VBAD does not involve this step. In return, AstFocus greatly outperforms VBAD versus the other three metrics. Overall, AstFocus has the high efficiency. (3) For simultaneous modeling, AstFocus attack remarkably outperforms RLSB attack on all the settings, showing simultaneously modeling the key frames and key regions is indeed more effective than separately modeling them. This also demonstrates the core idea in this paper. (4) From the view of robustness evaluation, all the seven black-box video attacks show C3D has better adversarial robustness than the other models. The C3D has lower FR values but higher NQ values, which shows C3D is harder to attack. This may motivates us an in-depth study for the C3D’s structure to design robust video recognition models.



4.8 Integrated with other gradient estimators
In our AstFocus attack, the current gradient estimator is NES. Actually, we can replace NES with other state-of-the-art gradient estimators. To test this point, we conduct experiments. Here we choose two SOTA gradient estimators: Prior convictions [49] and ZOO [50]. Figure 12 gives the results. We can see that when the gradient estimators are changed, the fooling rate, query number, and perturbation magnitudes only show a slight variation. Relatively speaking, NES achieves the better performance versus three metrics. AstFocus attack is a flexible framework, which implies other modules can be replaced except the MARL module. The PGD can also be replaced with its improved versions.
4.9 Qualitative results of AstFocus attacks
We list two adversarial videos and the perturbations generated by AstFocus attacks in Figure 13, we see adversarial video is consistent with original video from the appearance, showing the imperceptibility of adversarial perturbations. To understand the perturbations, we enlarge their values to give a display. We see the final adversarial perturbations are sparse both in inter-frames and intra-frames. They show a superposition phenomenon by many noise patches generated in each PGD iteration. These adversarial perturbations cover the foreground regions in key frames. We also give two adversarial videos generated by other recently published attack methods (RLSB attack and GEO-TRAP attack) as a reference, where we can see our method has better imperceptible perturbations than the other methods.
To better show the advantage, we compute two metrics to quantitatively measure the image quality. The first metric is to measure the blurriness degree [45]. For this metric, the smaller the better. And another metric is SSIM [46]. For this metric, the larger the better. We list these two metric values below each video frame as (blurriness/SSIM), where we see that our adversarial videos show better image quality than RLSB and GEO-TRAP attacks.
4.10 AstFocus attack against defense methods
We evaluate the performance of AstFocus attack against defense methods. Three kinds of representative video defense methods are chosen: Adversarial Training method (PGD-AT [24]), modifying network architecture method (OUD [47]), and pre-processing method (AdvIT333AdvIT is proposed to detect the adversarial example. To adopt it to perform defense, we attach it before the threat model. If the input is detected as adversarial example, it will not be fed into the threat model. For this reason, the QN and MAP may decrease rather than increase. [48]). The results for C3D model on HMDB-51 dataset are reported in Table VIII, where the changes compared with the un-defended C3D are listed in the brackets. We can see that both the attacking performance and efficiency decrease. Specifically, the maximum drop of FR after defense is 32%, QN increases by 3377 at most, and MAP increases by 33% at most. This is reasonable because the defended model will be harder to attack, but the FR, QN, and MAP are still acceptable. This shows that AstFocus attack is effective to evaluate the adversarial robustness even for the defended action recognition models.
5 Conclusion
In this paper, we designed the novel adversarial spatial-temporal focus attack on videos to simultaneously identify the key frames and key regions in the video. AstFocus attack was based on the cooperative multi-agent reinforcement learning framework. One agent was responsible for selecting key frames, and another agent was responsible for selecting key regions. These two agents were jointly trained by the common rewards received from the black-box threat models. By continuously querying, the reduced space composed of key frames and key regions was becoming precise, and the whole query number was less than that on the original video. Extensive experiments on four famous video recognition models and three public action recognition datasets verified our efficiency and effectiveness, which was prevenient in fooling rate, query number, time, and perturbation magnitude at the same.
Acknowledgment
This work is supported by National Key RD Program of China (Grant No.2020AAA0104002), National Natural Science Foundation of China (No.62076018).
References
- [1] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, “Salient object detection in the deep learning era: An in-depth survey,” TPAMI, 2021.
- [2] Y. Zhu, X. Li, C. Liu, M. Zolfaghari, Y. Xiong, C. Wu, Z. Zhang, J. Tighe, R. Manmatha, and M. Li, “A comprehensive study of deep video action recognition,” arXiv preprint:2012.06567, 2020.
- [3] S. Yang, W. Wang, C. Liu, and W. Deng, “Scene understanding in deep learning-based end-to-end controllers for autonomous vehicles,” IEEE TSMC: Systems, vol. 49, no. 1, pp. 53–63, 2018.
- [4] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint:1412.6572, 2014.
- [5] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,” in NeurIPS, 2019, pp. 125–136.
- [6] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in CVPR, 2017.
- [7] H. Wang, F. He, Z. Peng, T. Shao, Y.-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” in CVPR, 2021.
- [8] S. Tang, R. Gong, Y. Wang, A. Liu, J. Wang, X. Chen, F. Yu, X. Liu, D. Song, A. Yuille et al., “Robustart: Benchmarking robustness on architecture design and training techniques,” arXiv preprint:2109.05211, 2021.
- [9] S. Geisler, T. Schmidt, H. Şirin, D. Zügner, A. Bojchevski, and S. Günnemann, “Robustness of graph neural networks at scale,” NeurIPS, 2021.
- [10] U.-A. M. Chapman-Rounds, U. Bhatt, E. Pazos, M.-A. Schulz, and K. Georgatzis, “Fimap: feature importance by minimal adversarial perturbation,” in AAAI, 2021.
- [11] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” in CVPR, 2018.
- [12] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in ECCV, 2016, pp. 20–36.
- [13] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in CVPR, 2019, pp. 7083–7093.
- [14] Y. Dong, S. Cheng, T. Pang, H. Su, and J. Zhu, “Query-efficient black-box adversarial attacks guided by a transfer-based prior,” IEEE TPAMI, 2021.
- [15] X. Wei, Y. Guo, and J. Yu, “Adversarial sticker: A stealthy attack method in the physical world,” IEEE TPAMI, 2022.
- [16] X. Wei, J. Zhu, S. Yuan, and H. Su, “Sparse adversarial perturbations for videos,” in AAAI, vol. 33, 2019, pp. 8973–8980.
- [17] X. Wei, H. Yan, and B. Li, “Sparse black-box video attack with reinforcement learning,” IJCV, pp. 1–15, 2022.
- [18] J. Hwang, J.-H. Kim, J.-H. Choi, and J.-S. Lee, “Just one moment: Structural vulnerability of deep action recognition against one frame attack,” in ICCV, 2021, pp. 7668–7676.
- [19] L. Jiang, X. Ma, S. Chen, J. Bailey, and Y.-G. Jiang, “Black-box adversarial attacks on video recognition models,” in ACMMM, 2019, pp. 864–872.
- [20] Z. Wei, J. Chen, X. Wei, L. Jiang, T.-S. Chua, F. Zhou, and Y.-G. Jiang, “Heuristic black-box adversarial attacks on video recognition models,” in AAAI, 2020.
- [21] Z. Wang, C. Sha, and S. Yang, “Reinforcement learning based sparse black-box adversarial attack on video recognition models,” arXiv preprint:2108.13872, 2021.
- [22] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial attacks with limited queries and information,” in ICML, 2018.
- [23] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” NeurIPS, 2017.
- [24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint:1706.06083, 2017.
- [25] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Natural evolution strategies,” in IEEE Congress on Evolutionary Computation, 2008, pp. 3381–3387.
- [26] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” TNNLS, vol. 28, no. 10, pp. 2222–2232, 2016.
- [27] H. Yan and X. Wei, “Efficient sparse attacks on videos using reinforcement learning,” in ACMMM, 2021, pp. 2326–2334.
- [28] X. Yuan, P. He, Q. Zhu, and X. Li, “Adversarial examples: Attacks and defenses for deep learning,” IEEE TNNLS, vol. 30, no. 9, pp. 2805–2824, 2019.
- [29] S. Li, A. Neupane, S. Paul, C. Song, S. V. Krishnamurthy, A. K. R. Chowdhury, and A. Swami, “Adversarial perturbations against real-time video classification systems,” in NDSS, 2019.
- [30] H. Zhang, L. Zhu, Y. Zhu, and Y. Yang, “Motion-excited sampler: Video adversarial attack with sparked prior,” in ECCV, 2020.
- [31] S. Li, A. Aich, S. Zhu, S. Asif, C. Song, A. Roy-Chowdhury, and S. Krishnamurthy, “Adversarial attacks on black box video classifiers: Leveraging the power of geometric transformations,” NeurIPS, 2021.
- [32] Z. Wei, J. Chen, Z. Wu, and Y. Jiang, “Cross-modal transferable adversarial attacks from images to videos,” in CVPR, 2022, pp. 15 064–15 073.
- [33] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spatial-temporal clues in a hybrid deep learning framework for video classification,” in ACMMM, 2015, pp. 461–470.
- [34] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in ICCV, 2021.
- [35] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014, pp. 391–405.
- [36] K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward,” in AAAI, 2018.
- [37] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” NeurIPS, 1999.
- [38] M. T. Spaan, “Partially observable markov decision processes,” in Reinforcement Learning, 2012, pp. 387–414.
- [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint:1707.06347, 2017.
- [40] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint:1212.0402, 2012.
- [41] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in ICCV, 2011, pp. 2556–2563.
- [42] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017.
- [43] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in CVPR, 2019, pp. 6202–6211.
- [44] M. Contributors, “Openmmlab’s next generation video understanding toolbox and benchmark,” https://github.com/open-mmlab/mmaction2, 2020.
- [45] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi, “A no-reference perceptual blur metric,” in Proceedings. International conference on image processing, vol. 3. IEEE, 2002, pp. III–III.
- [46] A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th international conference on pattern recognition. IEEE, 2010, pp. 2366–2369.
- [47] S.-Y. Lo, J. M. J. Valanarasu, and V. M. Patel, “Overcomplete representations against adversarial videos,” in ICIP, 2021.
- [48] C. Xiao, R. Deng, B. Li, T. Lee, B. Edwards, J. Yi, D. Song, M. Liu, and I. Molloy, “Advit: Adversarial frames identifier based on temporal consistency in videos,” in CVPR, 2019, pp. 3968–3977.
- [49] A. Ilyas, L. Engstrom, and A. Madry, “Prior convictions: Black-box adversarial attacks with bandits and priors,” arXiv preprint:1807.07978, 2018.
- [50] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in AISec, 2017.
![]() |
Xingxing Wei received his Ph.D degree in computer science from Tianjin University, and B.S. degree in Automation from Beihang University, China. He is now an Associate Professor in Beihang University (BUAA). His research interests include computer vision, adversarial machine learning and its applications to multimedia content analysis. He is the author of referred journals and conferences in IEEE TPAMI, TMM, TCYB, TGRS, IJCV, PR, CVIU, CVPR, ICCV, ECCV, ACMMM, AAAI, IJCAI etc. |
![]() |
Songping Wang is now a Master student at school of software, Beihang University (BUAA). His research interests include deep learning and adversarial robustness in machine learning. |
![]() |
Huanqian Yan is pursuing his Ph.D. degree at the School of Computer Science and Engineering, Beihang University, Beijing, China. Previously, He received his master degree in computer application and technology from Lanzhou University in July 2018 and his Bachelor degree in the field of computer science and technology from Changchun University of Science and Technology in July 2015. His current research interests are object detection, adversarial examples, and clustering analysis, etc. |