Recover Triggered States: Protect Model Against Backdoor Attack in Reinforcement Learning

Hao Chen¹, Chen Gong¹, Yizhe Wang¹, Xinwen Hou¹ This work was supported by Chinese Academy of Sciences and University of Chinese Academy of SciencesCorresponding author: Xinwen Hou, ¹Institute of Automation, Chinese Academy of Sciences, Beijing, China

Abstract

A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent. Such attacks compromise the RL system’s reliability, leading to potentially catastrophic results in various key fields. In contrast, relatively limited research has investigated effective defenses against backdoor attacks in RL. This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks. RTS involves building a surrogate network to approximate the dynamics model. Developers can then recover the environment from the triggered state to a clean state, thereby preventing attackers from activating backdoors hidden in the agent by presenting the trigger. When training the surrogate to predict states, we incorporate agent action information to reduce the discrepancy between the actions taken by the agent on predicted states and the actions taken on real states. RTS is the first approach to defend against backdoor attacks in a single-agent setting. Our results show that using RTS, the cumulative reward only decreased by 1.41% under the backdoor attack.

I Introduction

While deep reinforcement learning (DRL) algorithms have brought significant improvements in efficiency in real-world applications [1], a growing body of research has recently focused on the potential threat of backdoor attacks on agents trained through DRL algorithms. Malicious attackers can poison the training data or manipulate the training environment of the agents. Developing effective defenses against backdoor attacks in RL is crucial to ensure the security and trustworthiness of RL-based systems. Backdoor attacks can target RL agents used in critical applications such as autonomous driving [2, 3], resource scheduling [4], robotic control [5], and competitive games [3]. Although the threat of backdoor attacks on DRL agents is widespread, the research into developing effective defenses to help agents evade such attacks in RL has been insufficient.

Previous method has proposed to eliminate backdoors in multi-agent scenarios [6] with a backdoor seeker and memory-loss procedure. But it is unrealistic to use this method in single-agent scenarios, because we can not adjust the observation deliberately as we wish in the single-agent setting. What’s more, reverse engineering methods which is designed to locate the trigger in the dataset can not locate the trigger of untarget attacks [7, 8]. Overall, we need to develop new defense methods specifically for single-agent scenarios. In practice, once a backdoor is injected into an RL agent, it becomes difficult to remove it. Additionally, it can be challenging to determine whether an agent has been implanted with a backdoor if the malicious users do not present triggers. It is inspired by the observation that must manipulate the agent’s observation by adding the trigger to activate the hidden backdoor within it. Therefore, we have designed a novel defense approach called ”Recovery Triggered States” (RTS), which aims to restore the state that has been tampered with the trigger back to its original state. Then, the backdoor can not be activated without the presentation of triggers.

To recover the triggered states, we exploit the use of the dynamics model to build surrogate network to approximate the transit function of environments. Subsequently, we can leverage this network to restore any state to its original state. The agent perform an action $a$ to a certain state $s$ , and the environment would transit to the next state $s^{\prime}$ specified with dynamics of the environment $T$ . If the attackers manage to steal the trigger $\delta$ and apply it to the state $s^{\prime}$ , they can present these manipulated states, denoted as $s^{\prime}+\delta$ , to the victim agent in order to activate the backdoor. However, if we have access to the simulation function $T$ , we can easily restore the original state by computing $s^{\prime}=T(s,a)$ . We propose a defense strategy called Recovery Triggered States (RTS), which leverages the consistency of the environment’s dynamics to prevent the backdoor attack from being triggered.

Current methods use a single-objective deterministic dynamics model to simulate the environment dynamics, which focuses on the accuracy of predicting state, and is referred to as a single-objective dynamics model. However, when employing a single-objective dynamics model for defending against trigger stage attacks in continuous control scenarios, even small gaps between predicted and real states can cause the policy to execute significantly different actions, resulting in a decline in defense performance. This problem is related to the undesired actions in model inference tasks will cause compounding errors. In a $\mathcal{L}$ length model inference task, the agent get the true state at the beginning state $s_{t0}$ , then acting on states predicted by the dynamics model. The dynamics model will get the true state $s_{t0}$ , and at rest of time, it will predict states $s^{p}_{t+1}$ , with predicted state $s^{p}_{t}$ and policy’s action $\pi(s^{p}_{t})$ . The environment will get actions of the agent $\pi(s_{t0}),\pi(s^{p}_{t1}),...,\pi(s^{p}_{t\mathcal{L}})$ . The vulnerability and over-fitting of neural networks causes the policy to execute actions differently when getting predicted states compared to real states.

Compared to traditional dynamics model training that solely focuses on fitting the transition function of environments, we recognize the importance of accuracy in action prediction for the predicted states. Thus, we add an action-related regularization term to the training target of the single-deterministic dynamics model. RTS specifically utilizes a dual-objective dynamics model as the defender, which generates predicted states that conform to the transition dynamics and guide the agent to execute actions similar to those taken in the original states. This approach helps to reduce compounding errors. As shown by our theoretical proof, we can improve the defense performance by utilizing predicted states to guide the agent in executing actions similar to those taken on the original state. Thus, whether in a single or consecutive attack scenario, the dual-objective dynamics model performs better than the single-objective dynamics model.

We conducted extensive experiments to compare the performance of victim agents in different settings, including those where backdoors were not triggered, those where attackers triggered backdoors, and those where the defender protected the input from being corrupted by attackers to trigger backdoors. The results show that the existence of the dual-objective defender can degrade the harm from backdoor attacks, and in average maintain 101.7% performance of the agent in single-step attack, and maintain 95.49% performance in consecutive two-steps attack, while the single-objective defender can maintain 86.59% performance in one step attack and maintain 15.01% performance in consecutive two steps attack. While agents exposing directly to the backdoor attacks, the performance of the victim agent decreases to 57.57% and 4.74%, without the protection of RTS.

We establish the upper bound of the performance of backdoor attacks in this work by their ability to cause deviation of action distributions from those taken on real states, which is caused by the trigger stage attack. The advantages of our method are its ability to resist both target and untarget attacks from being triggered, as well as improving the accuracy of agent’s actions on predicted states compared with these taken on real states. In summary, we have developed a defense mechanism against backdoor attacks using a dynamics model, and proposed a dynamics model that is specifically designed for situations involving consecutive attacks. To the best of our knowledge, RTS the first work that utilizes the dynamics model for this purpose. We believe that our work can serve as a source of inspiration for the development of new ideas in the design of dynamics models and abnormal detection techniques. The replication package of RTS are made available online¹¹1https://github.com/Spaxilia/Recover-triggered-states.

II Related work

II-A Backdoor attacks

Backdoor attacks can be set in a large number of devices such as a brain-computer interface [9], and traffic control systems [10], code generation [11]. Backdoor attacks can also be set in different types of neural networks such as a transformer model [12]. The backdoor can be activated through imperceptible media such as ultra-sonic sound wave [13]. A backdoor attacker can set different triggers so that he could reinforce his attack ability [14]. Researchers introduced a backdoor attack method to attack RL agents in single-agent scenarios in electronics games [8]. In real world, researchers used backdoor attack to demolish reinforcement-learning-augment autonomous driving. In competition scenarios, researchers have proposed to attack RL agent competing with other agents [15, 16] and trigger the backdoor through taking a consecutive legal actions allowed in the environment. Researchers also introduced a way to enhance the effectiveness of the backdoor attack [3].

These backdoor attack methods have threaten the use of the intelligent agent in different aspects of our society. We should not only highlight the security issue on data but also design algorithms for defending against backdoor attacks.

Refer to caption — Figure 1: Our threat and defense model involves backdoor attackers leaving poisoned datasets online or burying triggers in unprotected datasets. RL developers then unknowingly train agents using these datasets, resulting in the agents being injected with backdoors. The poisoned agents may still perform well in the developers’ view, but when the attacker presents triggered states to the victim agent, its performance deteriorates. To defend against this threat, our model uses a defender to recognize abnormal inputs and provide a predicted state to guide the agent, thereby preventing it from executing an abnormal action.

II-B Defending backdoor attacks for RL

Many of methods defending against backdoor attacks need true samples, such as [17, 18]. But true samples can not be provided in most cases of RL, which means some methods used in defending against backdoor attacks in tasks such as picture classification cannot be used in RL. Researchers also introduced a backdoor seeker and memory-loss technique to eliminate backdoors in multi-agent competition scenarios [6], and it is not applicable as mentioned above. And people have tried to use reverse engineering [7] to locate the polluted data before training the agent, but it didn’t work well in locating the untarget attack because random neurons are activated as shown in [8].

Recently, researchers also introduced defense methods that don’t need true sample in classification tasks, such as using sparsity [19], which may help defend against backdoor attacks in RL.

III Preliminaries

The purpose of the deep reinforcement learning (DRL) algorithm was to train an agent that can solve sequential decision-making tasks modeled as a Markov decision process (MDP). An MDP is formally described as a 4-tuple $\langle\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{T}\rangle$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{R}$ is the reward function, and $\mathcal{T}$ is the transition function.

In DRL, an agent interacts with an environment over a series of time steps. At each time step, the agent observes a state $s_{t}$ and selects an action $a_{t}$ based on its policy $\pi(\cdot|s_{t})$ . The agent then receives an immediate reward $r_{t}$ from the environment based on its action, and the environment transitions to a new state $s_{t+1}$ as determined by the transition function $\mathcal{T}(\cdot|s_{t},a_{t})$ . This process continues until the agent reaches a termination state $s_{T}$ , generating a trajectory $\mathcal{N}:(s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},\cdots,s_{T},a_{T},r_{T}).$ The cumulative discounted return $R(\mathcal{N})=\sum_{i=0}^{T}g^{i}r_{i}$ is the sum of discounted rewards along the trajectory, where the discount factor $g\in(0,1)$ . The agent’s goal is to learn an optimal policy $\pi^{*}$ that maximizes the expected cumulative discounted return over all possible trajectories,

\pi^{\ast}=\arg\max_{\pi}\mathbb{E}\left[R(\mathcal{N}_{\pi})\right],

(1)

where $\mathcal{N}_{\pi}$ denotes the distribution of trajectories generated by policy $\pi$ .

When an agent is implanted with a backdoor, it behaves normally until it encounters a specific trigger $\delta$ in its observation, at which point it fails or behaves abnormally. The trigger can be a specific pattern or input the agent has been trained to recognize as a signal to perform a specific action. We formalize the target of a backdoor attack as follows,

\max\mathbb{E}\left[R(\mathcal{N}_{\pi_{b}})\right]+\min\mathbb{E}\left[R\left((\mathcal{N}+\delta\right)_{\pi_{b}})\right]

(2)

where $\pi_{b}$ refers to the policy that has been implanted with a backdoor, while $\tau+\delta$ denotes a trajectory that includes a state in which the backdoor trigger is present.

IV The Recovery triggered states Method

IV-A The Threat and Protection Model

In our threat model, a backdoor attack can be separated into two stages, the injection stage and the trigger stage. In the injection stage, the attacker would implant backdoors into the dataset, and the RL trainer will use the polluted dataset to train the model, which will cause the agent being poisoned, as shown in Fig. 1. The proportion of polluted data in the dataset is very low [5, 8] thus decreasing the possibility of being detected by human inspection. Backdoor attacks on RL are notorious because it can hardly be removed once injected. And human inspections are impossible to achieve because the amount of data needed for using an offline RL algorithm to train a RL agent is too big, for example, in general we need 10M - 20M pieces of data to use an offline RL algorithm to train a RL agent in the environment of our experiment. The injection stage attack warms us to watch out the data we used for training the agents. In the trigger stage attack, attacker can set one step attack and consecutive attacks, and the later can further deviate the actions of victim agents from their normal values, which makes a severer damage.

In our defense model, we propose to defend against backdoor attacks in the trigger stage. Before deployment, we let the victim agents interact with clean environment, and collect trajectories to train the dynamics model, then apply the dynamics model when the victim agent is deployed for working. Our defense method is usable because the amount of data needed for training the dynamics model is relatively small, for example, in our experiment we need 50K pieces of data to train a dynamics model for single-step state prediction with a loss below 4 in L2 norm. With little effort we could prevent the agent from being harassed by the backdoor attacks, and if we use RL evaluation methods to include much of the circumstances before deployment, we could further improve the defense performance of our RTS method.

Overall, our method defends the attack in a unison and easy way that the defender can detect and prevent the trigger stage attack using the same dynamics model, and protect the victim agent from the target attack and the untarget attack. The tenet for our defence strategy can be concluded as below.

\displaystyle\mathbb{E}\left[R(\mathcal{N}_{\pi_{b}})\right]=\mathbb{E}\left[R\left(T^{d}(\mathcal{N}+\delta)\right)_{\pi_{b}})\right]

(3)

In the equation (3), the $T^{d}$ is our defender. If an attacker wants to trigger the backdoor, he must distort the input and this will cause the discontinuity of the environment dynamics especially in a continuous control scenario thus our defender would easily capture the difference.

In image-input scenarios, it would be hard to distinguish if the coming states is triggered merely comparing the predicted state and the coming state. In this case, we could use the difference between policy’s actions on the predicted state and that on the coming state to determine if the states is triggered, and our proposed action-related regularization term is more welcomed because $\pi(s_{p})$ would be similar to $\pi(s)$ , thus improving the accuracy of detection a trigger stage attack.

Overall, the attack performance of backdoor attack is bounded by both whether the attacker can successfully trigger the backdoor and its ability to change the action with the use of triggered input $s^{*}$ , and our RTS method can provide an easy and efficient way to detect the abnormal states and prevent attacks.

IV-B The Design and Use of the Defender

Algorithm 1 The RTS Method for Defending Against the Backdoor Attack in Continuous Control

Input: Victim Agent $\pi_{V}$ , Dynamics Model $T$ , Environment env, Episode Length N, Final State $s_{F}$ , Reward r, Done done, Threshold H.

s^{(0)}=env.start

n=0

3: while n<N or Not done do

a_{V}^{(n)}

\pi_{V}(s^{(n)})

s^{(n+1)},r,done=env.step(a_{V}^{(n)})

6: In this time, the attacker may skew the

s^{n+1}

into a trigger

s^{*}

s^{p}=T(s^{(n)},a_{V}^{(n)})

8: if

\left\|s^{p}-s^{(n+1)}\right\|_{2}>H

then

s^{(n)}=s^{p}

10: else

11:

s^{(n)}=s^{(n+1)}

12: end if

13:

n+=1

14: end while

15: return

Model choice: In designing the structure of the defender, considering the data expenditure, we use the single-deterministic model to simulate the environmental dynamics. In the experiment, the error between the predicted state and the real state given by the model is within an L2 norm of 4.

Data collection: In a clean environment, we let the poisoned agent perform tasks and collect the trajectories data. This premise can be easily fulfilled, because we don’t need backdoors to be triggered to collect the data, which can save the time used in data collection.

Train the Defender: After data is collected, we begin to train a defender. The target of training the defender can be denoted as (4), in which $T^{d}$ is the defender, $s_{t-1},a_{t-1}$ are the input for the defender, and the $s_{t}$ , $a_{t}$ are the target.

\displaystyle\begin{aligned} T^{d}=\underset{T(\ldots;\theta)}{\operatorname{argmin}}&\left\|s_{t}-T\left(s_{t-1},a_{t-1};\theta\right)\right\|_{2}+\\ &\left\|a_{t}-\pi(T\left(s_{t-1},a_{t-1};\theta\right))\right\|_{2}\end{aligned}

(4)

Defend the Attack: At time step $t$ , the defender gets previous input $s_{t-1}$ and agent’s action $a_{t-1}$ to give a prediction $s^{p}$ . The state $s_{t}$ should be close to the state $s^{p}$ . The attacker may distort a state into a triggered state $st^{*}$ . The defender will consider the difference of the $s_{t}$ and $s^{p}$ . We could measure the difference of other targets, such as the difference between the $\pi(s_{t}),\pi(s^{p})$ . If the difference is out of expectation $h$ , the defender will proceed to recover the state. Here we use the l2 norm in the experiment as shown in the 5.

\displaystyle\left\|T^{d}(s_{t}^{p}\mid s_{t-1},a_{t-1})-s_{t}\right\|_{2}>h

(5)

If the defender captures the abnormal distance between the $s_{t}$ and $s_{t}^{p}$ , the defender would recover a prediction $s_{t}^{p}$ which the victim agent acts on, finally stopping the backdoor to be triggered. The neural networks kind dynamics models have a better ability to capture the dynamics in an episodic manner [20, 21], which encourages us to use the model to simulate the dynamics of a fixed policy $\pi$ interacting with the environment.

IV-C Defense performance bound and its similarity with model inference tasks

The process of the trigger stage attack of backdoor attacks is similar to that of adversarial attack. In both kinds of the attack, attackers will skew the input of an victim agent thus inducing the victim agent to perform another action. Both of attacks will not influence the dynamics of the environment. Thus, we can use the performance bound of the adversarial noise attack to define the attack performance of the backdoor attacks.

Previously, people have defined a performance bound of an adversarial attack as shown in (6) which shows that an adversarial attack can be determined by its ability to change the action of the policy on the noised input. This equation also indicates an upper limit on the maximum difference between value functions with distinct action distributions [22].

		$\displaystyle\max_{s\in S}\left\{V_{\pi}(s)-V_{\pi}(s^{*})\right\}\leq$		(6)
		$\displaystyle\alpha\max_{s\in S}\max_{s^{}\in\beta(s)}D_{TV}(\pi(\cdot\|s),\pi(\cdot\|s^{}))$		(6)

In Equation (6), $V_{\pi}(\cdot)$ is the value function of the policy $\pi$ , and $\beta(\cdot)$ outputs $s^{*}$ , state added adversarial noise, and $D_{TV}(\pi(\cdot|s),\pi(\cdot|s^{*}))$ is the total variation distance between $\pi(\cdot|s)$ and $\pi(\cdot|s^{*})$ and $\alpha$ is a constant that does not depend on $\pi$ .

As explained above, because these two attack methods share certain similarities, we use (7) to establish a limit on the effectiveness of the backdoor attack in terms of its capacity to sway the victim’s policy through the use of the triggered input $s^{*}$ .

		$\displaystyle\max_{s\in S}\left\{V_{\pi^{}}(s)-V_{\pi^{}}(s^{*})\right\}\leq$		(7)
		$\displaystyle\alpha\max_{s\in S}\max_{s^{}\in T(s)}D_{TV}(\pi^{}(\cdot\|s),\pi^{}(\cdot\|s^{}))$		(7)

In Equation (7), $\pi^{*}$ means the poisoned agent, and $T(\cdot)$ outputs $s^{*}$ , the state triggered by the attacker, and $D_{TV}(\pi^{*}(\cdot|s),\pi^{*}(\cdot|s^{*}))$ is the total variation distance between $\pi^{*}(\cdot|s)$ and $\pi^{*}(\cdot|s^{*})$ .

Thus, as shown in (8), we can evaluate the defender’s performance in preventing the trigger stage attack by examining the differences between $\pi(\cdot|s_{t})$ and $\pi(\cdot|T^{d}(s_{t-1},a_{t-1}))$ .

		$\displaystyle\max_{s\in S}\left\{V_{\pi^{}}(s)-V_{\pi^{}}(T^{d}(s^{*}))\right\}\leq$		(8)
		$\displaystyle\alpha\max_{s\in S}\max_{s^{}\in T(s)}D_{TV}(\pi^{}(\cdot\|s),\pi^{*}(\cdot\|T^{d}(s_{t-1},a_{t-1})))$		(8)

This equation shows the importance of a defender’s ability to generate a predicted state and guide a victim agent’s actions to closely approximate those in the true state, in order to effectively defend against attacks. Once the defender detects a trigger-stage attack, it supports the victim agent in acting on nearly normal input, rather than maliciously controlled input, to prevent the attack.

We aim to use a single-deterministic dynamics model as a defender to get the advantage of the generalization ability of neural networks as well as avoiding the huge data consumption associated with recurrent neural networks. The training target of the traditional single-deterministic dynamics model is as shown in (9),

\displaystyle T^{d}=\underset{T(\ldots;\theta)}{\operatorname{argmin}}\left\|S_{t}-T\left(s_{t-1},a_{t-1};\theta\right)\right\|_{2}

(9)

which is very common. However, in continuous control scenarios, even gaps between predicted states and real states are small, the policy will execute very different actions.

The problem can be illustrated in (10).

\|\pi(s)-\pi(s+\Delta)\|_{2}\geq L\text{, where }\Delta\leq d

(10)

This issue will hinder the defense performance, which is severer in consecutive attack scenarios.

This problem is related to the undesired actions in model inference tasks causing accuracy degradation. When policy gets a predicted state different from the real state, it executes an action different from that taken in a real state. And this action further causes the dynamics model generating states different from the real states even if the dynamics model can accurately predict states under a certain bound, and finally accumulates to big compounding errors. Thus, one way to improve the model’s inference task performance and reduce the effect brought by the compounding error is to generate a state that guides the agent executing an action similar to that taken on the real state, then the dynamics model could predict accurately. The existence of the adversarial noise [23, 22] reminds us that with little turbulence on the input, the output of the neural networks could be very different. So it is reasonable to use this regularization term in single-deterministic dynamics model as well as other dynamics model to reduce accumulated errors.

s_{t+1}^{p}=f_{\theta}\left(s_{t}^{p},\pi(s_{t}^{p})\right)

(11)

A dual-objective dynamics model could perform better in both single and consecutive attack scenarios as shown in Fig. 3. The defense performance of the single-objective dynamics model degrades because agent doesn’t act as expected and the error further causing the dynamics model to predict an undesired states, finally causing the failure of the agent. And the agent plays to the end with the dual-objective dynamics model as a defender.

By guiding the agents to execute actions as if on a real state, the defender can achieve a smaller distance of $D_{TV}(\pi^{*}(\cdot|s),\pi^{*}(\cdot|T^{d}(s^{*})))$ , which improves their performance against trigger stage attacks, as shown in (8).

V Experiments

The Fig. 4, and the Fig. 5 are the result of our experiments, where the x-axis indicates the index of the episode and the y-axis indicates the total return of the policy. We test the victim agent trained on the mujoco Hopper environment, and we pick up the victim agents’ policy in two time spots of the training on the Hopper, accordingly at 500K steps 800K steps.

V-A The defender model’s structure

We use data collected when the victim agent interacts with environments to train the defender. A Defender’s network is: (State Dimension + Action Dimension, 256) (256,256) (256, State Dimension). When collecting the data, we modify the policy through adding a Gaussian noise on its action to maintain exploration. We find with a 0.01 probability to use the noise is sufficient for sampling the data. We use 50K pieces of data to train a defender in the Hopper environment.

V-B The single-step attack

We use the deep deterministic policy gradient (DDPG) [24] to train the victim model, and inject the target backdoor attack data after the policy has run for 25K steps to reappear the injection stage attack. The proportion of polluted data is 4% in a dataset of 25K steps, 0.2% in a dataset of 500K steps, and 0.0125% in a dataset of 800K steps.

In the single-step attack scenario, we trigger the backdoor one time every twenty steps by distorting the input when a well trained victim agent is working as shown in 4. On average, the victim agent scores 1523.09 and 2994.43 points without attack in two scenarios. In contrast, the scores decrease to 1246.96 and 998.28 with trigger stage attack. The scores increase again to 1526.24 and 2213.60 when protected by the single-objective defender and to 1515.84 and 3112.66 when protected by the dual-objective defender.

As the agent gradually learns the optimal strategy and executes more steps in a trajectory, the dual-objective defender provides better defense.

V-C The consecutive attack

Then we perform the attack two consecutive steps in every twenty steps, and results as shown in Fig. 5. With trigger stage attack, the scores decrease to 112.08 and 63.84. And the scores increase to 112.89 and 679.38 when protected by the single-objective defender and to 1340.88 and 3084.50 when protected by the dual-objective defender.

Our dual-objective defender provides a better defense performance compared with that of the single-objective defender. Due to the existence of perturbations, the scores of the protected agents in attack scenarios are slightly higher than those of the agents who have not been attacked, and this deviation is acceptable.

V-D The states loss and action loss

To verify that the dual-objective dynamics model can generate state guiding agents to execute actions similar to that on real states, we demonstrate the difference between predicted states and real states in a length $\mathcal{L}=1$ model inference task. The predicted state are separately predicted by the single-objective model and the dual-objective model.

An agent trained after 500K reported that, on average, the L2 distance between the real state and the states predicted by the single-objective defender and the dual-objective defender were 0.1715 and 0.3125, respectively. Additionally, the average L2 distance between actions on predicted states and real states were 0.3525 and 0.0796.

An agent trained after 800K reported that, on average, the L2 distance between the real state and the states predicted by the single-objective defender and the dual-objective defender were 0.4099 and 0.6304, respectively. Additionally, the average L2 distance between actions on predicted states and real states were 0.7043 and 0.3941.

The results are as shown in Fig. 6. These result confirm that our dual-objective defender can predict states guiding the agent to act as if on real states.

VI Discussion AND FUTURE WORKS

The experiments results also confirm our theorem that defense method will be better if agent’s actions on the predicted states are closer to the actions executed on real states by the agent, thus the dual-objective dynamics model performs better than the single-objective dynamics model as a defender against trigger stage attacks.

We also want to introduce our regularization to other dynamics models and finally use them in the model-based RL. Like the long-gap-term prediction model trained with episodic data in [21], we train a dual-objective model with data generated by a fixed policy interacting with the environment, so we are facing an associated question that can the episodic-related dynamics model encourage a better performance in model-based RL. Whether our regularization can be used in improving the model-based reinforcement learning is waiting to be revealed. Moreover, it is possible to apply the action-related regularization term to other types of dynamics models, but if it is effective in such cases remains to be researched.

VII CONCLUSIONS

Our method can degrade backdoor attacks by applying the dynamics model for detecting and defending against trigger stage backdoor attacks. Current works about designing the trigger ignore the criteria that whether it breaks the consistency held in the environment and our work points out the leak, and it is an issue worthy researching how to plant the backdoor without being detected by the dynamics model. We discuss the similarities of adversarial attacks between backdoor attacks and give a performance bound to the backdoor attacks. We present a dual-objective dynamics model can leverage the the compounding errors brought by inaccurate prediction on states. To the best of our knowledge, this is the first work to use a dynamics model for defending against backdoor attacks, as well as the first to introduce an action-related regularization term in the training of a single-deterministic dynamics model, which we hope would provoke new ideas in the usage of the dynamics model in fields like anomaly detection and model inference task.

References

[1] C. Gong, Q. He, Y. Bai, X. Hou, G. Fan, and Y. Liu, “Wide-sense stationary policy optimization with bellman residual on video games,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6.
[2] S. Ding, Y. Tian, F. Xu, Q. Li, and S. Zhong, “Trojan attack on deep generative models in autonomous driving,” in International Conference on Security and Privacy in Communication Systems. Springer, 2019, pp. 299–318.
[3] C. Gong, Z. Yang, Y. Bai, J. He, J. Shi, A. Sinha, B. Xu, X. Hou, G. Fan, and D. Lo, “Mind your data! hiding backdoors in offline reinforcement learning datasets,” arXiv preprint arXiv:2210.04688, 2022.
[4] Z. Huang, Q. Wang, Y. Chen, and X. Jiang, “A survey on machine learning against hardware trojan attacks: Recent advances and challenges,” IEEE Access, vol. 8, pp. 10 796–10 826, 2020.
[5] G. W. Clark, M. V. Doran, and T. R. Andel, “Cybersecurity issues in robotics,” in 2017 IEEE conference on cognitive and computational aspects of situation management (CogSIMA). IEEE, 2017, pp. 1–5.
[6] J. Guo, A. Li, and C. Liu, “Backdoor detection in reinforcement learning,” arXiv preprint arXiv:2202.03609, 2022.
[7] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 707–723.
[8] K. Panagiota, W. Kacper, S. Jha, and L. Wenchao, “Trojdrl: Trojan attacks on deep reinforcement learning agents. in proc. 57th acm/ieee design automation conference (dac), 2020, march 2020,” in Proc. 57th ACM/IEEE Design Automation Conference (DAC), 2020, 2020.
[9] L. Meng, J. Huang, Z. Zeng, X. Jiang, S. Yu, T.-P. Jung, C.-T. Lin, R. Chavarriaga, and D. Wu, “Eeg-based brain-computer interfaces are vulnerable to backdoor attacks,” arXiv preprint arXiv:2011.00101, 2020.
[10] Y. Wang, M. Maniatakos, and S. E. Jabari, “A trigger exploration method for backdoor attacks on deep learning-based traffic control systems,” in 2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 4394–4399.
[11] Z. Yang, B. Xu, J. M. Zhang, H. J. Kang, J. Shi, J. He, and D. Lo, “Stealthy backdoor attack for code models,” 2023.
[12] M. Zheng, Q. Lou, and L. Jiang, “Trojvit: Trojan insertion in vision transformers,” arXiv preprint arXiv:2208.13049, 2022.
[13] S. Koffas, J. Xu, M. Conti, and S. Picek, “Can you hear it? backdoor attacks via ultrasonic triggers,” in Proceedings of the 2022 ACM Workshop on Wireless Security and Machine Learning, 2022, pp. 57–62.
[14] X. Zhou, W. Jiang, S. Qi, and Y. Mu, “Multi-target invisibly trojaned networks for visual recognition and detection.” in IJCAI, 2021, pp. 3462–3469.
[15] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch, “Emergent complexity via multi-agent competition,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=Sy0GnUxCb
[16] C. Gong, Z. Yang, Y. Bai, J. Shi, A. Sinha, B. Xu, D. Lo, X. Hou, and G. Fan, “Curiosity-driven and victim-aware adversarial policies,” in Proceedings of the 38th Annual Computer Security Applications Conference, ser. ACSAC ’22, 2022, p. 186–200. [Online]. Available: https://doi.org/10.1145/3564625.3564636
[17] J. Hayase, W. Kong, R. Somani, and S. Oh, “Spectre: Defending against backdoor attacks using robust statistics,” in International Conference on Machine Learning. PMLR, 2021, pp. 4129–4139.
[18] S. Udeshi, S. Peng, G. Woo, L. Loh, L. Rawshan, and S. Chattopadhyay, “Model agnostic defence against backdoor attacks in machine learning,” IEEE Transactions on Reliability, vol. 71, no. 2, pp. 880–895, 2022.
[19] T. Chen, Z. Zhang, Y. Zhang, S. Chang, S. Liu, and Z. Wang, “Quarantine: Sparsity can uncover the trojan attack trigger for free,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 598–609.
[20] N. Lambert, B. Amos, O. Yadan, and R. Calandra, “Objective mismatch in model-based reinforcement learning,” Proceedings of Machine Learning Research vol, vol. 120, pp. 1–15, 2020.
[21] N. Lambert, A. Wilcox, H. Zhang, K. S. Pister, and R. Calandra, “Learning accurate long-term dynamics for model-based reinforcement learning,” in 2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 2880–2887.
[22] Z. Xiong, J. Eappen, H. Zhu, and S. Jagannathan, “Defending observation attacks in deep reinforcement learning via detection and denoising,” in 2022 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2022.
[23] R. Yang, C. Bai, X. Ma, Z. Wang, C. Zhang, and L. Han, “RORL: Robust offline reinforcement learning via conservative smoothing,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022.
[24] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning. Pmlr, 2014, pp. 387–395.