Sampling Attacks on Meta Reinforcement Learning: A Minimax Formulation and Complexity Analysis
Abstract
Meta reinforcement learning (meta RL), as a combination of meta-learning ideas and reinforcement learning (RL), enables the agent to adapt to different tasks using a few samples. However, this sampling-based adaptation also makes meta RL vulnerable to adversarial attacks. By manipulating the reward feedback from sampling processes in meta RL, an attacker can mislead the agent into building wrong knowledge from training experience, which deteriorates the agent’s performance when dealing with different tasks after adaptation. This paper provides a game-theoretical underpinning for understanding this type of security risk. In particular, we formally define the sampling attack model as a Stackelberg game between the attacker and the agent, which yields a minimax formulation. It leads to two online attack schemes: Intermittent Attack and Persistent Attack, which enable the attacker to learn an optimal sampling attack, defined by an -first-order stationary point, within iterations. These attack schemes freeride the learning progress concurrently without extra interactions with the environment. By corroborating the convergence results with numerical experiments, we observe that a minor effort of the attacker can significantly deteriorate the learning performance, and the minimax approach can also help robustify the meta RL algorithms. Code and demos are available at https://github.com/NYU-LARX/Attack_MetaRL
1 Introduction
Meta Reinforcement Learning (meta RL), as a combination of meta learning and reinforcement learning, intends to learn a decision-making model that can quickly adapt to new tasks. Compared with learning from scratch, meta RL enables the agent to leverage previous training data, requiring fewer samples when dealing with new tasks. The successes of meta RL (Wang et al., 2016; Duan et al., 2016; Finn et al., 2017; Fallah et al., 2020) can be explained from a feature learning standpoint: meta RL aims at building an internal representation of the policy (meta policy) that is broadly suitable for a collection of similar tasks. This internal representation is learned by either recurrent neural networks (Wang et al., 2016; Duan et al., 2016) or stochastic optimization (Finn et al., 2017; Fallah et al., 2020), when agents are exposed to the collection of tasks via a unique sampling process. In meta RL, in addtion to producing sample trajectories within a single task, the sampling process also randomly samples environments for learning a common representation shared across these tasks.
However, this unique sampling process also makes meta RL vulnerable when facing adversarial attacks. By manipulating sample data collected in the meta training phase, the attacker can mislead the agent into learning a wrong meta policy that adapts poorly to individual tasks from the collection. For example, in Model-Agnostic Meta Reinforcement Learning (MAMRL) (Finn et al., 2017), a unique sampling process is required for optimizing the meta policy. As shown by our numerical results, slightly perturbing the reward feedback from this sampling, the attacker can significantly degrade the performance of meta RL agents under different tasks.
Contributions This paper aims to provide a theoretical understanding of adversarial attacks on sampling processes of meta RL, which is referred to as sampling attacks. Due to its mathematical clarity, we choose MAMRL (Finn et al., 2017) as the departure point of our discussion. Yet, the idea of sampling attacks also applies to other meta RL formulations (Wang et al., 2016; Duan et al., 2016) (see Appendix B).
Our contributions are three-fold. 1) We characterize the sampling attack as a Stackelberg game (Von Stackelberg, 2010) between the attacker and the agent (the learner), which yields a minimax formulation. 2) Based on this formulation, two online black-box training-time attack schemes are proposed: Intermittent Attack and Persistent Attack, which enable the attacker to learn optimal sampling attacks concurrently with the meta learning without extra interactions with the environment. 3) The optimality of sampling attacks is defined by -first-order stationarity, and our nonasymptotic analysis shows that the proposed attack schemes can achieve the optimality within iterations.
To the best of our knowledge, this work is among the first endeavor to investigate online training-time adversarial attacks on sampling processes of meta RL. In addition, our analysis of nonconvex-nonconcave minimax problem can be extended to other RL problems sharing the same nature of nonconvexity.
2 Related Works
This section reviews recent progress in the area of adversarial attacks on RL. We position our work using the following categorizations and comment on the differences between existing approaches and the proposed one in this paper, highlighting our contribution to the security study of RL and meta RL.
Testing-time and Training-time Attacks Unlike testing-time attacks, where the attacker intends to degrade the performance of a learned and deployed policy, training-time attacks aim to enforce a target policy in favor of the attacker by manipulating the learning process.
Our work falls within the class of training-time attack. Even though our attack methods are based on reward manipulations, which have also been explored in the literature (Ma et al., 2019; Huang and Zhu, 2019; Zhang et al., 2020b, 2021), we emphasize that previous works mainly target Q-learning algorithm (Ma et al., 2019; Huang and Zhu, 2019; Zhang et al., 2020b) or neural policy gradient (Zhang et al., 2021), which relies on value estimation. For these value-based algorithms, the influence of the reward manipulation on the value function is more straightforward than in meta RL, where the adaptation in the meta training complicates the analysis. The optimality results regarding previous attacks do not apply to meta RL, and our paper’s theoretical analysis of the optimality of the reward manipulation is more challenging.
White-box and Black-box Attacks For both testing-time and training-time attacks on RL agents, existing works have investigated both white-box (Ma et al., 2019; Zhang et al., 2020b; Huang and Zhu, 2019) and black-box settings (Xu et al., 2021; Russo and Proutiere, 2021). For white-box attacks, the attacker needs to fully know the learner’s learning algorithm and the policy model. Some attacks additionally require the knowledge of the environment, e.g., transition dynamics (Xu et al., 2021).
In contrast, the proposed attacks, as a black-box attack111The fault line between white-box and black-box may vary in different contexts, We here follow the notion used in (Xu et al., 2021) and treat our proposed attack as a black-box attack., do not require prior knowledge of the environment setup or learning algorithms. With access to the policy model, the attacker can adjust its attack by gradient updates, regardless of learning algorithms or the environment parameters.
Compared with previous black-box attacks (Xu et al., 2021; Russo and Proutiere, 2021), the proposed attacks do not require interactions with the training environment, e.g., the simulator. All it needs are the sample trajectories rolled out by the learner during the training process. In other words, the attacker considered in this paper is a free rider who does not actively collect information regarding the agent’s learning algorithm or the environment.
Offline and Online Attacks Previous works mainly consider offline training time reward manipulations (Ma et al., 2019; Huang and Zhu, 2019), where the attacker has full knowledge of the training dataset. The attacker makes one decision on reward manipulation and then provides the poisoned data to RL agent. In this work, online attacks are studied, where the attacker manipulates the reward feedback sequentially, adaptive to the learner’s learning process.
3 Preliminaries
Let be the set of finite-horizon Markov Decision Processes (MDP) representing different tasks/environments. Each task is given by a tuple , where and denotes the the space of states and actions, respectively. For each , denotes the transition kernel that maps a state-action pair to , a Borel probability measure over the state space. Finally, the agent receives a reward when taking action at state , and the initial state is drawn from the initial distribution . For the sake of simplicity, we assume that all tasks share the same state and action spaces, as well as the same horizon length , and our analysis applies to cases where different tasks have different MDP formulations (Fallah et al., 2020).
To facilitate our discussion, the following notations are introduced. A trajectory from an MDP is defined as , where denotes the horizon length. A batch of trajectories under the same MDP is considered to be a training data set . Given a trajectory (a shorthand for ), the total reward received over this trajectory is .
Suppose that the agent’s policy in , , is parameterized with , and is the probability of choosing action at state . Then, given the transition dynamics and the initial distribution , the probability that such a trajectory occurs is Therefore, the expected cumulative rewards under the policy in the task is . RL algorithms, such as policy gradient (Sutton et al., 2000), seek to find a that maximizes .
The idea of policy gradient method is to apply gradient descent with respect to the objective function . Following the policy gradient theorem (Sutton et al., 2000), we obtain where . In RL practice, the policy gradient is replaced by its Monte Carlo (MC) estimation, since evaluating the exact value is intractable. Given a batch of trajectories under the policy , the MC estimation is .
Instead of focusing on individual tasks , MAMRL, as introduced in (Finn et al., 2017), aims to find a good initial policy, also called a meta policy that returns satisfying rewards in expectation when it is updated using limited number of gradient step updates under a new environment , sampled from a task distribution . In particular, for only one-step policy gradient update with a step size , the optimization problem of MAMRL (Fallah et al., 2020) is
(1) |
A stochastic gradient method (SG-MRL) is proposed in (Fallah et al., 2020) for solving the maximization problem (1) (see Algorithm 1), where an unbiased estimator of can be efficiently computed for performing gradient ascent. Suppose that for simplicity , the data set sampled using , only contains one trajectory, then . The MC estimation of is given below:
(2) |
where and are estimated using another batch of trajectories rolled out under the policy . The average of constitutes an unbiased estimate of .
4 Sampling Attack Models
In this section, we propose three sampling attack models and characterize them as Stackelberg games between the attacker and the learner. As shown in Algorithm 1, MAMRL includes two sampling processes. The first sampling (line 5 in Algorithm 1) is for the MC estimation of the corresponding policy gradient, while the second sampling (line 7 in Algorithm 1) is for the MC estimation of the gradient of the objective function . The main idea of the sampling attack is to manipulate the reward signal so that MC estimations are corrupted. Since the meta training includes two sampling processes, there are three possible sampling attacks: 1) Inner Sampling Attack (ISA): only attacking the first sampling process, where samples are used to compute the gradient adaptation in the inner loop; 2) Outer Sampling Attack (OSA): only attacking the second sampling process, where samples are used to perform one step meta optimization in the outer loop; 3) Combined Sampling Attack (CSA): attacking both the first and the second sampling process.

Practicability of Sampling Attacks Before delving into the details of these attacks, we first illustrate how this attack can happen in a training environment. For the training process in RL, a learner, as the orange box in Figure 1, usually consists of two components: a sampler and an optimizer. At each state, the sampler chooses an action based on the current policy, and sends it to the simulator, which returns a reward as well as the next state. The state-action-reward tuple will be stored within the sampler, and this process repeats until multiple trajectories are sampled. These sample trajectories make up a training data set fed to the optimizer, which performs one-step policy improvement.
To launch sampling attacks, the attacker first covertly infiltrates into the learning system shown in Figure 1 through Advanced Persistent Threats (APT) processes (Zhu and Rass, 2018). Details on this infiltration are included in Section A.1. After the infiltration, the attacker intercepts communication between the sampler and the simulator and sends the learner poisoned reward signals, misleading the learner. One may argue that the attacker could gain control of the whole system, and its possible maneuvers are more than reward manipulation (e.g., compromising the simulator, corrupting the state input). In Section A.2, we justify our choice of reward manipulation as the first step of the investigation without losing the generality of sampling attacks.
The following summarizes three important aspects of the proposed sampling attacks. 1) Knowledge of the attacker: the attacker only has access to the sample trajectories rolled out by the learner and the current policy. The simulator and the learner’s internal learning process remain black-box to the attacker. 2) Attack Strategies: the attacker is allowed to modify the rewards of sample trajectories provided by the simulator. It can adjust its attack sequentially depending on the meta learning process. Note that the attacker needs to choose the manipulation strategy from a constraint set, which can be interpreted as the attack budget (see Section A.3). 3) The attacker’s objective: this paper considers a greedy attacker who misleads the learner into learning the worst possible meta policy, i.e., the value function of the learned meta policy [ in (1)] is minimized after sampling attacks. In the subsequent, we use ISA as an example to elaborate on these aspects of the proposed sampling attacks.
Inner Sampling Attack The first attack is to apply reward manipulations to the first sampling process. Denote the manipulated reward by , which is parameterized by . We assume that , i.e., the manipulated reward is obtained by applying the linear transformation to the original reward, and . It is a more general practice to represent by a neural network, being the weights of the network. Another remark is that the admissible manipulation strategies are confined to a bounded set (see 6), limiting the attacker’s ability. Section A.3 compares our attack constraints with existing constraints in the literature.
Under ISA, the policy adaptation is compromised, as is replaced by , in which every reward is scaled by . For simplicity, it is assumed that reward signals in are uniformly scaled by . More sophisticated manipulation strategies, such as adaptive poisoning (Zhang et al., 2020b), can also be considered, yet to keep the theoretical analysis neat, we limit ourselves to this uniform scaling within each batch. In the following, notations with either denote data corrupted by the manipulation or quantities computed using corrupted data.
Following the policy gradient theorem reviewed in Section 3, the policy gradient under ISA takes the following form , where . Since only differs from in that the rewards are scaled by , , the relationship between the corrupted policy gradient and the original one is given by . Similarly, for the MC estimation, we have . Finally, the goal of the attack is to find a such that the performance of the learned meta policy [evaluated through the objective function , specified in (1).] is minimized. The optimization problem associated with (ISA) is given by
(ISA) | ||||
s.t. |
Outer Sampling Attack Another sampling attack is to consider manipulating the samples in Algorithm 1. The outer sampling collects samples under the updated policy , which are used to compute the MC estimation of the gradient of . In this case, the optimization problem in OSA share the same attack objective as in ISA, but the constraint becomes
s.t. | (OSA) |
Combined Sampling Attack Finally, the inner and the outer sampling attack can be combined together. Following the same line of argument, the objective is , and the constraint is
(CSA) |
Some remarks are in order. Note that compared with the other two sampling attacks, OSA is rather trivial: a reward flipping, i.e., is sufficient, which makes already a minimizer. Hence, for the rest of the paper, we mainly discuss (ISA) and (CSA). Our theoretical analysis focuses on (ISA), and its extension to (CSA) is straightforward (see Section E.1). Another remark is that even though there may exist multiple maximizers , as the objective function is nonconvex differentiable, we assume that in our formulation, is properly selected, and hence unique. One way to justify this assumption is that can be viewed as the policy parameter returned by the RL algorithm in the training process, and hence, it is unique. We will elaborate on this selection more mathematically (see Lemma E1 and its proof) in Section 5.
Note that all three sampling attacks are formulated as bi-level optimization problems. Hence they amount to a Stackelberg game between the attacker (leader) and the learner (follower). The best response of the learner is to learn an optimal policy under the reward manipulation, and the three constraints on in the above formulations correspond to the learner’s meta learning processes under ISA, OSA, and CSA, respectively. Knowing the learner’s best response, the attacker aims to find a such that the learned meta policy adapts poorly in the testing phase without further attacks.
Since the attacker targets the testing performance of , it needs to consider the policy gradient adaptation to evaluate , where is the genuine sample trajectories under without manipulation. However, in (ISA), the attacker can only access the learner’s contaminated training data [i.e., sample trajectories under the policy ]. In this case, the feedback of the attacker’s manipulation [e.g., estimates of ] is not directly available during the training time. Hence, searching for the optimal attack defined in (ISA) in an online manner is challenging, due the attacker’s limited knowledge of the meta learning process.
From Stackelberg to Minimax As discussed above, it is a daunting task for the attacker to find the optimal by solving the bi-level optimization problem in (ISA). Thus, we reformulate the attacker’s problem as a minimax problem (MMP), which is more tractable than (ISA), as the attacker now targets the learner’s training performance as shown below.
(MMP) |
Note that in general the and operator in (MMP) is not changable, since the objective is nonconvex-nonconcave (Fallah et al., 2020). Hence, (MMP) still amounts to a Stackelberg game, yet the attacker now aims to minimize the training performance under ISA, i.e., the value function under the corrupted dataset [the max part of (MMP)]. Sample trajectories rolled out by the learner can also help guide the attacker’s search for the optimal attack, and no extra data or information is needed. In this case, the attack freerides the meta learing process, making the proposed sampling attack stealthy and lightweight. This free-ride nature will be more clear in Section 5, where two online attack schemes based on the gradient feedback are introduced.
Before concluding this section, we comment on the relationship between original (ISA) and (MMP). Even though the minimax value of (MMP) is by definition different from the Stackelberg equilibrium payoff given by (ISA), our following Proposition 1 states that under the assumption of bounded policy gradients (see 3 in Appendix D), the minimax value, when combined with a -norm regularization serves as an upper-bound of the value function under the optimal attack.
Proposition 1.
Under the assumption of bounded policy gradients, there exists a constant , such that for defined in (ISA), any batch of trajectories ,
(3) |
As a consequence,
(4) |
From Proposition 1, the free-rider nature can explained in the following way: as long as the training performance, i.e., the right-hand side of (4) deteriorates, the testing performance or the quality of the meta policy, i.e., the left-hand side of (4) cannot be any better. Even though the proposed sampling attack models are discussed in the context of MAMRL, they are also relevant to other meta RL frameworks, as discussed in Appendix B.
5 Sampling Attacks under Concurrent Learning
Based on (MMP), this section presents two onine attack schemes that enable the attacker to learn the optimal attack alongside with learner’s learning process. We present the two schemes in the context of ISA, and the extension to CSA is straightforward (see Section E.1). To streamline the investigation into the optimality of proposed mechanisms, we skip the term, treating it as a regularization in numerical implementations, and mainly focus on (MMP).
Intermittent Attack In (MMP), the inner maximization problem corresponds to the meta RL under ISA, while the attacker’s problem is given by . A natural way to seek the minimizer is to apply gradient descent to the objective function. Using the similar argument in (2), we arrive at the following MC estimation of the gradient with respect to ,
(5) |
Similarly for the gradient with respect to when fixing , the MC estimate is given by
(6) |
The attacker’s learning proceeds as follows. At first, the attacker initializes the attack by setting , and closely monitors the learning process. During the meta training process, the attacker remains dormant, i.e., remains constant, until policy parameter stabilizes. Then it is activated and updates using one-step stochastic gradient descent under the Euclidean projection . It repeats the above steps until the stabilizes. Since the attacker only adapts the reward manipulation intermittently, we refer to such attack as Intermittent Attack. The pseudocode is in Algorithm 2.
Persistent Attack Instead of waiting for the learner to converge to , the attacker can also implement a real-time attack, i.e., it performs gradient descent along with the learner’s gradient ascent, no matter whether the current stabilizes or not. In this case, with the initialization , the attacker is spared from detecting the stabilization of the learner’s training and updates every time is updated. Algorithm 3 summarizes this kind of attack, which we refer to as Persistent Attack, since the attacker keeps adjusting according to the learner’s progress all the time. The intuition behind Persistent Attack is that if the attacker’s learning rate is far smaller than the learner’s, the learner views the attack as quasi-static, while the attacker treats as stabilized (Lin et al., 2019), which shares the same spirit as Intermittent Attack.
Before proceeding to the theoretical analysis, we comment on the differences between the two schemes in terms of cost and stealthiness. Since Intermittent Attack utilizes the stabilized policy parameter, the attacker’s MC estimation (6) in Intermittent Attack tends to return a more accurate gradient descent direction than its counterpart in Persistent Attack when minimizing the objective function in (MMP). For the attacker’s gradient estimation under two attack schemes, the reader is referred to Lemma E1, Lemma E5, and associated proofs. As a result, in practice, Intermittent Attack tends to find the optimal attack using fewer iterations (as shown in Figure 2(b)(d).) However, in Intermittent Attack, the attacker needs to detect the stabilization of the learner’s training. This detection can be costly, as the attacker needs to keep a record of the average return of each iteration. In contrast, in Persistent Attack, the attacker simply updates the attack per iteration without detection. On the other hand, since Persistent Attack manipulates the rewards of sample trajectories during each iteration, it is less stealthy than Intermittent Attack.
Optimality of Sampling Attacks The objective in (MMP) is nonconvex-nonconcave. An appropriate optimality condition is the -first order stationary point (-FOSP), widely adopted in the study of programming and game theory (Nouiehed et al., 2019; Li et al., 2022).
Definition 1 (-FOSP).
is an -FOSP of a continuously differentiable function if , , where , .
Since the objective function in (MMP) is noncovex-nonconcave, in order to show the optimality of the attack, we need to impose some regularity conditions. For the ease of notation, we define , and denote its MC estimation by , where the random variable represents the a collection of datasets to be sampled in the sampling processes for the computation of . Denote . The following assumptions extend Polyak-Łojasiewicz (PL) condition (Polyak, 1963; Łojasiewica, 1963) and restricted secant inequality (RSI) (Zhang and Yin, 2013) to the MMP. Note that both 1 and 2 are weaker than commonly assumed strong convexity condition (Lin et al., 2019; Nouiehed et al., 2019), and detailed discussions are presented in Appendix D.
Assumption 1 (Minmax PL).
For (MMP), it is assumed that there exists a constant such that .
Assumption 2 (Minimax RSI).
For (MMP), it is assumed that there exists a constant such that for any fixed , .
In addition to the above assumptions, some regularity assumptions are also needed to show the convergence of Intermittent and Persistent Attack. They are presented in Appendix D.
Nonasymptotic Analysis of Intermittent Attack The convergence result rests on that when the meta learning stabilizes, the attacker’s gradient feedback equals (see Lemma E1). Then, the standard analysis of gradient descent in nonconvex programming can be applied.
Theorem 1.
Under 1 and regularity assumptions, for any given , let the step sizes as well as the batch size be properly chosen (see Section E.2), if , , then there exists an index such that generated by Algorithm 2 is an -FOSP.
Nonasymptotic Analysis of Persistent Attack Although 1 suffices for proving the optimality of Intermittent Attack, it is rather a weak condition when dealing with Persistent Attack. The gap is that in Persistent Attack, the attacker’s gradient update does not utilize the exact gradient or its approximation , instead, is applied. To control the difference , we show in Section E.3 that exhibits a linear contraction using 2, leading to the convergence result stated as follows.
Theorem 2.
Under 2 and regularity conditions, for any given , let the step sizes as well as the batch size be properly chosen (see Section E.3), then if , there exists an index , such that generated by Algorithm 3 is an -FOSP.
6 Numerical Experiments
This section presents experimental evaluations of the proposed attack schemes for ISA and CSA. We consider two benchmark meta learning tasks, including a 2D-navigation task and a more complex locomotion problem: Half-cheetah (H-C) (Finn et al., 2017). The following briefly introduces the two tasks, and more details can be found in Section C.1. In the 2D-navigation task, starting from a fixed initial position [the black triangle in Figure 2(a)] in a 2D plane, a point agent [the red dot in Figure 2(a)] needs to move to a goal position [the green star in Figure 2(a)] randomly sampled from a given task distribution. In the H-C task, a planar cheetah [as shown in Figure 2(c)] is required to run at a particular velocity sampled from the task distribution.
Experimental Setup Hyperparameter setup for the two algorithms is detailed in Section C.2. When evaluating the corrupted meta policy in the testing phase, we compare the adaptation performance [ defined in (1)] of meta policies learned under different attack settings and the benchmark setting without attacks. One-step policy adaption is considered throughout the paper. Additional experiments on multi-step adaption can be found in Section C.3. As in (Finn et al., 2017), the evaluation metric is the average of cumulative rewards of policies adapted from the meta policy.
Experiment Results Figure 2 presents an example of the poor adaptation of the corrupted meta policies and the convergence of two online attack schemes under ISA in a single experiment. In Figure 2(a), when using the policy adapted from the meta policy, the agent is misled into searching the lower-right corner [as the search paths concentrate on that corner], opposite to the true goal. Similarly, with a small perturbation , the cheetah agent fails to learn the desired running-forward policy as shown in Figure 2(c). In this example, it is observed in Figure 2(b)(d) that both attack schemes converge. Intermittent Attack takes fewer iterations since its gradient estimation is more accurate as we discussed in Section 5. To demonstrate the impact of sampling attacks, the first two columns in Table 1 summarizes the average rewards (and their standard deviations) of the policies after one-step adaptation from the meta policies corrupted by different attacks.
Intermittent Attack | Persistent Attack | Robust Training | Benchmark | ||
---|---|---|---|---|---|
2D | ISA | ||||
CSA | |||||
H-C | ISA | ||||
CSA |
Robust Training against Sampling Attacks In Persistent Attack, the attacker and the learner operate in a two-timescale manner. Intuitively, the learner first achieves the inner maximization using a larger step size. On the contrary, if the learner’s step size is smaller than the attacker’s, then the attacker would first achieve the minimization. In this case, the Stackelberg game becomes a maxmin problem , inspiring a robust training method. During the meta learning process, the learner can decrease when it is suspicious of possible attacks. The obtained meta policy returns the best possible adaptation performance when the training data is corrupted. In one of our experiments, when , it is observed that the meta policy is robust to Persistent Attack under ISA, as the adaptation performance is close to the benchmark. The column “Robust Training” in Table 1 compares the performance of the meta policies trained under Persistent Attack [], Robust Training [], and the benchmark.




7 Discussion
This work formally defines three sampling attack models on meta RL in the training phase. The optimal sampling attack is formulated as the solution to a minimax problem, which leads to two online black-box attack mechanisms: Intermittent Attack and Persistent Attack. Experiments show that simply scaling the reward feedback during the training, the attacker can significantly deteriorate the adaptation performance of the learned meta policy.
The main limitation of the proposed online attack schemes is that with the noised gradient feedback, the attacker searches for the -FOSP (see Definition 1), which is a local notion. Since the objective function is nonconvex-nonconcave, it is challenging to characterize these stationary points and compare the corresponding objective values with the global optimum. Hence, quantifying the impact of sampling attacks theoretically (e.g., the performance loss or robustness) remains an open problem. Finally, we comment on the potential negative societal impacts of the proposed sampling attacks. As meta RL has been applied to many real-world applications (Hospedales et al., 2021), the stealthy and lightweight sampling attack schemes pose a great threat to these learning systems. Our robust training idea points out a promising way to combat the attack, yet more efforts are needed for a systematic investigation of defense mechanisms.
References
- Agarwal et al. [2019] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. arXiv, 2019.
- Bannon et al. [2020] J. Bannon, B. Windsor, W. Song, and T. Li. Causality and batch reinforcement learning: Complementary approaches to planning in unknown domains. arXiv preprint arXiv:2006.02579, 2020. doi: 10.48550/ARXIV.2006.02579. URL https://arxiv.org/abs/2006.02579.
- Bernhard and Rapaport [1995] P. Bernhard and A. Rapaport. On a theorem of danskin with an application to a theorem of von neumann-sion. Nonlinear Analysis: Theory, Methods & Applications, 24(8):1163–1181, 1995. ISSN 0362-546X. doi: https://doi.org/10.1016/0362-546X(94)00186-L. URL https://www.sciencedirect.com/science/article/pii/0362546X9400186L.
- Dayan and Watkins [1992] P. Dayan and C. J. Watkins. Q-Learning. Machine Learning, 8(3-4):279 292, 1992. ISSN 0885-6125. doi: 10.1007/bf00992698.
- Duan et al. [2016] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv, 2016.
- Fakoor et al. [2020] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola. Meta-q-learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJeD3CEFPH.
- Fallah et al. [2020] A. Fallah, K. Georgiev, A. Mokhtari, and A. Ozdaglar. On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning. arXiv, 2020.
- Finn et al. [2017] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
- Hospedales et al. [2021] T. M. Hospedales, A. Antoniou, P. Micaelli, and A. J. Storkey. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–1, may 2021. ISSN 1939-3539. doi: 10.1109/TPAMI.2021.3079209.
- Huang and Zhu [2019] Y. Huang and Q. Zhu. Deceptive reinforcement learning under adversarial manipulations on cost signals. In Decision and Game Theory for Security, pages 217–237, Cham, 2019. Springer International Publishing. ISBN 978-3-030-32430-8.
- Karimi et al. [2016] H. Karimi, J. Nutini, and M. Schmidt. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-ŁOjasiewicz Condition. In European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9851, ECML PKDD 2016, page 795–811, Berlin, Heidelberg, 2016. Springer-Verlag. ISBN 9783319461274. doi: 10.1007/978-3-319-46128-1\_50. URL https://doi.org/10.1007/978-3-319-46128-1_50.
- Li and Zhu [2019] T. Li and Q. Zhu. On Convergence Rate of Adaptive Multiscale Value Function Approximation for Reinforcement Learning. 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2019. doi: 10.1109/mlsp.2019.8918816.
- Li et al. [2021] T. Li, G. Peng, and Q. Zhu. Blackwell Online Learning for Markov Decision Processes. 2021 55th Annual Conference on Information Sciences and Systems (CISS), 00:1–6, 2021. doi: 10.1109/ciss50987.2021.9400319.
- Li et al. [2022] T. Li, G. Peng, Q. Zhu, and T. Başar. The confluence of networks, games, and learning a game-theoretic framework for multiagent decision making over networks. IEEE Control Systems Magazine, 42(4):35–67, 2022. doi: 10.1109/MCS.2022.3171478.
- Lin et al. [2019] T. Lin, C. Jin, and M. I. Jordan. On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems. arXiv, 2019.
- Liu et al. [2020] T. Liu, Z. Liu, Q. Liu, W. Wen, W. Xu, and M. Li. StegoNet: Turn Deep Neural Network into a Stegomalware. Annual Computer Security Applications Conference, pages 928–938, 2020. doi: 10.1145/3427228.3427268.
- Łojasiewica [1963] S. Łojasiewica. A topological property of real analytic subsets (in french). Coll. du CNRS, Leśequationsaux d́eriv́ees partielles, pages 87–89, 1963.
- Ma et al. [2019] Y. Ma, X. Zhang, W. Sun, and J. Zhu. Policy poisoning in batch reinforcement learning and control. Advances in Neural Information Processing Systems, 32:14570–14580, 2019.
- Mehta et al. [2020] B. Mehta, T. Deleu, S. C. Raparthy, C. J. Pal, and L. Paull. Curriculum in Gradient-Based Meta-Reinforcement Learning. arXiv, 2020.
- Nouiehed et al. [2019] M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods. arXiv, 2019.
- Polyak [1963] B. T. Polyak. Gradient methods for minimizing functionals. Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki, 3(4):643–653, 1963.
- Rakelly et al. [2019] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5331–5340. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/rakelly19a.html.
- Russo and Proutiere [2021] A. Russo and A. Proutiere. Towards Optimal Attacks on Reinforcement Learning Policies. 2021 American Control Conference (ACC), 00:4561–4567, 2021. doi: 10.23919/acc50511.2021.9483025.
- Schulman et al. [2016] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506.02438.
- Silver et al. [2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. v. d. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–9, 2016. ISSN 0028-0836. doi: 10.1038/nature16961.
- Sutton et al. [2000] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057—1063. MIT press, 2000.
- Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
- Von Stackelberg [2010] H. Von Stackelberg. Market structure and equilibrium. 2010. doi: 10.1007/978-3-642-12586-7.
- Wang et al. [2016] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv, 2016.
- Xu et al. [2021] H. Xu, R. Wang, L. Raizman, and Z. Rabinovich. Transferable environment poisoning: Training-time attack on reinforcement learning. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’21, page 1398–1406, Richland, SC, 2021. ISBN 9781450383073.
- Zhang and Yin [2013] H. Zhang and W. Yin. Gradient methods for convex minimization: better rates under weaker conditions. arXiv preprint arXiv:1303.4645, 2013.
- Zhang et al. [2020a] H. Zhang, H. Chen, C. Xiao, B. Li, M. Liu, D. Boning, and C.-J. Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33:21024–21037, 2020a.
- Zhang et al. [2020b] X. Zhang, Y. Ma, A. Singla, and X. Zhu. Adaptive reward-poisoning attacks against reinforcement learning. In International Conference on Machine Learning, pages 11225–11234. PMLR, 2020b.
- Zhang et al. [2021] X. Zhang, Y. Chen, X. Zhu, and W. Sun. Robust Policy Gradient against Strong Data Corruption. arXiv, 2021.
- Zhu and Rass [2018] Q. Zhu and S. Rass. On Multi-Phase and Multi-Stage Game-Theoretic Modeling of Advanced Persistent Threats. IEEE Access, 6:13958–13971, 2018. ISSN 2169-3536. doi: 10.1109/access.2018.2814481.
Appendix A Practicability of Sampling Attacks
A.1 Infiltration through Advanced Persistent Threats
In reality, sampling attacks can be achieved through Advanced Persistent Threats [Zhu and Rass, 2018], a class of multi-stage stealthy attack processes. In the first stage, the attacker embeds malware in neural network models by replacing the model parameters with malware bytes Liu et al. [2020]. Then, the malware is delivered covertly along with the model into the system in Figure 1. In the second stage, the malware performs reconnaissance and tailors its attack to the communication channel between the simulator and the sampler. In the final stage, the malware launches sampling attacks using either Intermittent Attack (see Algorithm 2) or Persistent Attack (see Algorithm 3). When implementing sampling attacks, the attacker stands between the simulator and the learner, cutting off the information transmission between the two and sending the learner manipulated reward feedback to mislead the learner.
A.2 Justification of the Reward Manipulation
When it has covertly infiltrated the learning system, the attacker can corrupt any component within the learning system in Figure 1. This work considers sampling attacks, a particular case of man-in-the-middle attacks, where the attacker interrupts an existing conversation or data transfer between the simulator and the sampler. Sampling attacks only require sample trajectories under the current policy model to sabotage the learning process. In contrast, if the attacker were to control the simulator or other components of the system, it requires knowledge of how these components are built and operate. Therefore, sampling attacks are more lightweight and stealthy, requiring less knowledge of the learning system.
The attacker can manipulate everything in sample trajectories during sampling attacks, including states, actions, and rewards. We choose reward manipulation as the first step of the investigation because rewards are one-dimensional scalars, and they are more vulnerable to adversarial manipulations (state and action manipulations may require more advanced devices [Zhang et al., 2020a, Russo and Proutiere, 2021] than simply linear scaling). Its analysis leads to sufficient insights without losing the generality of sampling attacks, since the Stackelberg formulation also applies to state and action manipulations.
A.3 Attack Budgets
The attack budget aims to limit the attacker’s ability to sabotage the learning process; otherwise, it is not surprising that an omnipotent attacker can compromise the learning system of interest. Previous works mainly study two kinds of attack budgets. The first one is to restrict the amount of sample data the attacker can attack [Zhang et al., 2021, Huang and Zhu, 2019, Ma et al., 2019] which we refer to as the volume constraint. This volume constraint is often considered in offline RL or batch RL scenarios where the training dataset is prefixed [Zhang et al., 2021, Ma et al., 2019]. The other one is to confine the attacker’s perturbation to a limited magnitude [Zhang et al., 2020a, Russo and Proutiere, 2021, Zhang et al., 2020b], which we refer to as the magnitude constraint.
This work considers the magnitude constraint as in other studies on online attacks [Zhang et al., 2020a, b]. Note that the proposed Intermittent and Persistent Attacks can accommodate the volume constraint by restricting the number of sample trajectories the attacker can poison. In this case, our Stackelberg formulations are still valid, yet the optimality analysis needs refinements as the gradient-based analysis is not directly available. The associated optimality analysis is beyond the nonconvex-nonconcave programming framework and remains an open problem.
Appendix B Relevance to Broader Meta RL research
This work chooses a well-accepted meta RL framework [Finn et al., 2017, Fallah et al., 2020] and studies associated sampling attacks. In particular, our convergence analysis in Section 5 is built on the theoretical results regarding SG-MRL established in [Fallah et al., 2020].
Picking MAMRL serves as a starting point, and our findings can apply to many other variants of meta RL algorithms. The proposed minimax formulation in this work remains valid for other non-MAMRL-based algorithms. Recall that in (MMP), the max part corresponds to the objective in MAMRL. We can replace this objective with its counterparts using other meta RL formulations, such as recurrent-network-based ones [Wang et al., 2016, Duan et al., 2016]. Similar to MAMRL, meta RL methods in [Wang et al., 2016, Duan et al., 2016] utilize on-policy data during both meta-training and adaptation. In this case, the attacker can also leverage these on-policy data to estimate its gradient feedback. Hence, Intermittent and Persistent Attacks also apply to these meta RL frameworks.
On the other hand, our online attack schemes need refinements when dealing with meta RL methods based on off-policy learning [Fakoor et al., 2020, Rakelly et al., 2019]. In off-policy learning, such as Q-learning [Dayan and Watkins, 1992] and its variants [Silver et al., 2016, Li and Zhu, 2019, Li et al., 2021], sample trajectories are collected using a behavior policy different from the target policy (the optimal policy). In this case, the attacker may not be able to assess its attack impact on the target policy if it can only access sample trajectories under the behavior policy. One possible remedy would be to reveal to the attacker some side information on the importance ratio between the target policy and the behavior policy [Bannon et al., 2020] so that the target policy’s performance can be estimated using off-policy data. We leave this topic as future work.
Appendix C Experiment Details
C.1 Meta Reinforcement Learning Tasks
2D Navigation
Consider in a 2D plane, a point agent must move to different goal positions that are sampled according to a task distribution. Goals are picked from a unit square, centered at the origin with length and width ranging from [-0.5, 0.5]. The state observation is the current 2D position, a 2-dimensional vector, and actions correspond to velocity commands clipped to the range [-0.1, 0.1]. The reward is the negative squared distance to the goal, and iterations terminate when the agent is within 0.01 of the goal or at the horizon of 100 steps. The discounting factor is .
Half-Cheetah
In the half-cheetah goal velocity problem, a planar cheetah is required to run at a particular velocity, randomly sampled from . The state variable is a 17-dimensional vector, representing the position and velocity of the cheetah’s joints, and actions correspond to continuous momentum of six leg joints. The transition follows the physical rule simulated by MuJoCo Todorov et al. [2012], and the reward is defined as negative absolute value between the current velocity of the agent and a goal, which is chosen uniformly at random from [0, 2]. The horizon length is 200. The discounting factor is .
C.2 Experimental Setup
The attack program is built on the meta RL program developed in [Mehta et al., 2020]. We modify the original meta RL program to support parallel computing using graphics cards, saving a significant amount of training time. All the experiments are conducted using a Linux (Ubuntu 18.04) desktop with 4 Nvidia RTX8000 graphics cards and an AMD Threadripper 3990X (64 Cores, 2.90 GHz) processor. The experiments use a feedforward neural network policy with two 64-unit hidden layers and tanh activiations. The rest of this subsection summarizes the experimental setup of the meta training and sampling attack schemes.
Meta RL Training
Recall that in Algorithm 1, a batch of i.i.d. tasks are first sampled from the task distribution. This batch of tasks is referred to as the meta batch [Finn et al., 2017, Fallah et al., 2020]. Then for each task , the sampler returns two batches of trajectories: the inner batch and the outer batch . The batch sizes (the number of trajectories/ episodes) of the inner and the outer batch are the same in the experiments. This size is referred to as the batch size.
In the training phase, the total iteration number is 1000 for 2D navigation (2D) and half-cheetah (H-C) tasks, and the meta batch size is 20 for 2D and 40 for H-C. For both tasks, the batch size is 20 episodes. The learning rate is . The policy gradient is obtained through the generalized advantage estimator (GAE) [Schulman et al., 2016], which is a refinement of the vanilla policy gradient reviewed in Section 3. Unlike the original policy gradient estimator , the GAE estimator takes the following form
where the advantage function is defined as , and is the baseline function. (its MC estimation) is referred to as the reinforce loss function (empirical reinforce loss), and this loss is the optimization objective maximizing the probability of taking good actions. In short, the higher the reinforce loss, the better policy is. The introduction of the advantage function helps reduce the estimation variance in the RL training. The reader is referred to [Schulman et al., 2016] for more details on computing the baseline function and interpretations of the advantage function.
Sampling Attacks
For both Intermittent (Algorithm 2) and Persistent (Algorithm 3) Attacks, the learner’s meta learning process uses the hyperparameters specified above, albeit the inner batch and the outer batch are poisoned according to sampling attack schemes. Unless otherwise specified, the learner’s and the attacker’s learning rates in Intermittent Attack are , respectively, and in Persistent Attack. In sampling attack experiments, the regularization is incorporated into the (MMP) objective function.
Testing
To evaluate the quality of the learned meta policy, we conduct test experiments. In the experiment, a batch of tasks is first sampled. The meta policy first perform the gradient adaptation using the inner batch in each sampled task. Then, the adaptation performance is given by the averaging cumulative rewards of episodes from the outer batch of every sampled task. Note that in the testing phase, no attacks are allowed. For a single meta policy, We repeat the test experiment with 10 random seeds and report mean rewards and standard deviations of these experimental results in Table 1 and other tables in the following subsection.
C.3 Additional Experiment Results
The intuition behind Sampling Attacks
By scaling the reward signals by , the attacker distorts the learner’s perception of its reinforce loss, misleading the meta learning process. This distortion can be explained using the reinforce loss plots in Figure 3. We choose one representative experiment for each meta task under ISA and plot three kinds of reinforce loss functions (under smoothing) reviwed in the paragraph “Meta Training”. The benchmark loss (red curves) is obtained by averaging all empirical reinforce loss (ERL) of policies adapted from the meta policy during the training phase in the benchmark setting (no attack). Denote by the meta policy at the -th iteration in the benchmark training, the benchmark loss at time is given by
where the inner batch the outer batch are sampled using and , respectively (see Algorithm 1). ERL is the MC estimation of the reinforce loss.
The benchmark loss shows the training progress of meta RL in a neutral environment and serves as the baseline. To compare the training progress under adversarial attacks with this baseline, we introduce two other loss functions. Denote by the meta policy obtained under ISA and by the attacker’s manipulation, then the attacked loss (green curves for Intermittent Attack and blue curves for Persistent Attack) and the actual loss (yellow curves) are given by
Attacked Loss | |||
Actual Loss |
The attacked loss implies the learner’s distorted perception of its training progress as its policy gradient is corrupted. The actual loss, on the other hand, reflects the true learning progress in the adversarial environment since the ERL of reflects the current meta policy’s adaptation performance under actual sample data.
As shown in Figure 3, the agent mistakenly believes that it is making progress in the adversarial environment, as the green/blue curves go up (see Figures 3(a)(d)) or even above the benchmark (see Figures 3(b)). However, the actual performance of the learning either keeps deteriorating (see Figures 3(c)(d)) or remains at a modest level (see Figures 3(a)(b)).




Additional Results on Intermittent and Persistent Attacks
This subsection presents additional experimental results on sampling attacks. We begin with experiments on attacking meta RL with two-step policy gradient adaptation (the one-step adaptation case is in Section 3 ). Algorithm 4 summarizes the MAMRL algorithm using two-step policy gradient adaptation. Note that when using two-step policy adaption, the inner sampling process requires two batches of trajectories and , both of which are attacked in the experiments.
Similar to our findings in the one-step case, the proposed sampling attack can significantly deteriorate the two-step adaptation performance of meta policies. Since Intermittent and Persistent Attacks are based on stochastic gradient descent, and the objective function is nonconvex-nonconcave, the two attack schemes do not return the same optimal attack or the same corrupted meta policy in general. Table 3 summarizes optimal attacks under different attack settings. Finally, we comment on the robust training idea. As discussed in Section 6, decreasing the learner’s step size turns the minimax into maxmin. For example, the maxmin problem in ISA is
In this case, the learner aims to maximize its training performance under sampling attacks, which does not necessarily lead to the maximization of the testing performance. It is not guaranteed that the meta policy obtained through the robust training would achieve satisfying adaptation performance in the testing phase. Table 4 summarizes the robust training experiments where . Some encouraging results (shown in bold in Table 4) do appear in some experiments where the obtained meta policies achieve comparable adaptation performance as the benchmark ones. However, the robust training does not succeed222We run multiple robust training experiments in half-cheetah, yet not every obtained meta policy shows adaptation performance comparable to the benchmark ones. in the half-cheetah task under CSA (the blank entries in Table 4). This maxmin training remains largely a heuristic, and more efforts are needed to investigate defense mechanisms for meta learning thoroughly.
Task | Attack Model | Intermittent Attack | Persistent Attack | Benchmark |
---|---|---|---|---|
2D | ISA | |||
CSA | ||||
H-C | ISA | |||
CSA |
Task | gradient step | Attacl Model | Intermittent Attack | Persistent Attack |
---|---|---|---|---|
2D | 1 | ISA | -0.24 | -39.56 |
CSA | -12.96 | -38.58 | ||
2 | ISA | 2.78 | -2.36 | |
CSA | -0.29 | -2.60 | ||
H-C | 1 | ISA | 1.11 | 0.98 |
CSA | -5.94 | 0.89 | ||
2 | ISA | 5.81 | 0.06 | |
CSA | 4.23 | 0.66 |
Task | gradient step | Attacl Model | Persistent Attack | Robust Training | Benchmark |
---|---|---|---|---|---|
2D | 1 | ISA | |||
CSA | |||||
2 | ISA | ||||
CSA | |||||
H-C | 1 | ISA | |||
CSA | |||||
2 | ISA | ||||
CSA |
Appendix D Regularity Assumptions
In this section, we provide further justifications for the regularity conditions assumed in this paper. In our analysis, it is assumed that the policy gradient is bounded as stated in 3.
Assumption 3 (Bounded Policy Gradient).
There exists a constant , such that for any trajectory and policy parameter , . Equivalently, for any and any batch of trajectories , , .
Moreover, the value function is assumed to Lipschitz smooth in 4.
Assumption 4 (Lipschitz Gradients).
The function is continuously differentiable in both and . Furthermore, there exists constants and such that for every and ,
(7) |
The two assumptions are direct extensions of [Fallah et al., 2020, Lemma 1] in the context of adversarial MAMRL, which is established on customary assumptions (e.g., bounded rewards) in the policy gradient literature Agarwal et al. [2019]. On the other hand, the unbiasedness and finite variance of the stochastic gradient assumed in 5 is also valid in MAMRL, which has been proved in Fallah et al. [2020], in order to show the convergence of Algorithm 1.
Assumption 5 (Stochastic Gradients).
The stochastic gradient , with being the random variable corresponding to the batch of trajectories with batch size , satisfies
(8) |
Independent of the above assumptions regarding the regularity of the value function , we require that the admissible set of attacks is also regular as stated in 6, which helps the derivation of complexity results.
Assumption 6.
The set is convex and compact, with its diameter upper bounded by . Without loss of generality, it is assumed that .
In addition to the assumptions discussed above, PL condition in 1 and RSI condition in 2 play an important role in showing the convergence of the proposed attack mechanisms. Given their relationships with respect to strong convexity(SC) Karimi et al. [2016]:
this work extends the convergence results of Gradient Descent Ascent [Karimi et al., 2016, Lin et al., 2019] to nonconvex situations under weaker conditions. The key point is that the linear contraction of (see Lemma E4), previously established under strong convexity Lin et al. [2019], also holds true under 2. Our theoretical treatment on nonconvex-nonconcave minimax problems can be helpful for future studies on this topic. Even though it is unclear under which condition, the value function of RL or meta RL will satisfy PL or RSI, we provide numerical evidence in Figure 4 and Figure 5, demonstrating their validity in the experiments.




Appendix E Proofs
E.1 Proofs in Section 4
Proof of Proposition 1.
As shown in [Fallah et al., 2020, lemma 1], under 3, for any , we have
where is a constant depending on . Fix a , and let , , the above inequality leads to
where the second inequality holds by the boundedness of assumed in 3. Combining triangle inequality and the above one yields (3). Taking expectations on both sides of (3), and further taking the minimization of the left-hand side, we obtain
Finally, taking the minimization of the right-hand side of the above inequality leads to (4). ∎
Extensions to CSA
Proposition 2.
Proof.
Note that
where in the last inequality we use the following result obtained in the proof of Proposition 1
Let (assumed to be bounded). According to the definition of the value function , the horizon length , and the discounting factor , we obtain
Since , can be controlled by the diameter . Hence, we obtain
leading to (E1). Following the same argument in the proof of Proposition 1, we obtain (E3). ∎
The above proposition leads to the MMP formulation of CSA in below.
Since is a bounded compact convex set, our nonasymptotic convergence analysis on Intermittent and Persistent Attacks under ISA also applies to CSA using the above MMP formulation. Finally, we comment on the relationship between MMP and ISA(CSA). The original Stackelberg formations in (ISA) and (CSA) target the testing performance, while in (MMP), the objective is the training performance. Hence, the two are different problems, even though the minimax values serves as the upper bounds for (ISA) and (CSA). When the access to the testing performance is not available during the training phase, this minimax upper bound helps guide the attacker’s update.
E.2 Proofs in Section 5
In this section, we present the full version of Theorem 1 with detailed choices of step sizes and batch size . We begin with some technical lemmas.
Lemma E1.
Proof of Lemma E1.
Our first step is to show that for any and , there exists such that .
First, based on the assumption of Lipschitz gradients in (7),
which is equivalent to , since . Then, using the PL condition in (1), we obtain
which implies that
(E4) |
As shown in [Karimi et al., 2016, Theorem 2], PL condition implies the quadratic growth, that is, if satisfies -PL condition, the following holds true
(E5) |
where and . In other words, is the projection of onto the set of optimal solution, and is the distance of the point to the optimal solution set.
The above lemma, as an extension of [Nouiehed et al., 2019, Lemma A.1], serves as a substitute of Danskin’s theorem in convex analysis [Bernhard and Rapaport, 1995], which states that the gradient of can be directly evaluated using the gradient . Moreover, such gradient function is Lipschitz smooth, reducing the attacker’s convergence problem to a standard analysis of gradient descent of nonconvex objectives.
Our next lemma shows that returned by the inner loop properly approximates in the sense that , where denotes the final iterate of the inner loop under . The proof of the following lemma relies on the assumption that there exists a constant , such that holds for every , which can be verified using Lemma E1, as shown in [Nouiehed et al., 2019, Lemma A.6]
Lemma E2.
Let , , where . Assume that there exists a constant such that . For any , if and are large enough such that
then for ,
(E7) |
and
(E8) |
Proof of Lemma E2.
Since is assumed to be Lipschitz-smooth, is also Lipschitz-smooth, and we have
Plugging the updating rule into the above inequality, we obtain
which implies
By 5, we have
where the second last inequality follows from Young’s inequality. Combing the two inequalities above, we arrive at
Recursively applying the above inequality, and let , we obtain
As shown in [Nouiehed et al., 2019], there exists a constant , such that holds for every . Then, using the quadratic growth shown in the proof of Lemma E1, we have
We are now ready to present the full version of Theorem 1 and its proof in the following.
Theorem E1 (Full version of Theorem 1).
Proof of Theorem 1.
Due to projection,
Let , and take expectations on both sides, we have
which leads to
(E9) |
On the other hand, since is L-Lipschitz smooth,
(E10) |
where the last inequality holds because of Equation E9 and law of total probability.
Also by projection,
(E11) |
where the last inequality is obtained using our assumptions that , and the diameter of the set is .
Notice that by the choice of in Lemma E2,
which reduces (E11) to the following
Let , then we have
(E12) |
and combining (E12) with (E10) yields
Therefore, we have
Denote , and the the above inequality can be rewritten as
Let , and then by Lemma E2, we obtain
which implies that there exists one index for which . This completes the proof. ∎
E.3 Proofs in Section 5
This section presents the full version of Theorem 2 with detailed choices of step sizes and batch size , and the main results rest on the following lemmas.
Lemma E3.
The iterates generated by Algorithm 3 satisfies the following inequality,
(E13) |
Proof of Lemma E3.
Since is -smooth (see Lemma E1), we have
In Algorithm 3, gradient descent is used to update , hence , and the above inequality can be rewritten as
By taking a conditional expectation on both sides, we have
(E14) |
With Cauchy-Schwartz inequality,
(E15) | ||||
(E16) |
Lemma E4.
Under 2, 4 and 5, for the iterates generated by Algorithm 3, let , where , and let , where is defined in Lemma E2, then
(E18) |
Proof of Lemma E4.
According to the gradient update with respect to ,
Let , then the above inequality leads to
(E19) |
Denote the right-hand side of the above inequality by , then
(E20) |
By 2, we have
(E21) |
Taking expectations on both sides of (E20) and plugging (E21) yields
(E22) |
Similar to the argument in the proof of Lemma E2, by Young’s inequality, 5 and 4,
Hence, plug the above inequality and (E22) into (E19) after taking expectations, we obtain
Let and with the choice of , the above inequality can be rewritten as
(E23) |
By Young’s inequality
By the Lipschitz continuity of shown in (E6), . Without loss of generality, it is assumed that are small enough. Furthermore, according to the stochastic gradient ascent, we have
Lemma E5.
Under 2, 4 and 5, for the iterates generated by Algorithm 3, and for defined in Lemma E4,
(E24) |
Proof of Lemma E5.
Lemma E5 implies that the difference is controlled by the sum of , that is,
Since exhibits a linear contraction as shown in Lemma E4, can be further controlled by . Eventually, one can show that is bounded, implying the convergence of Algorithm 3. The full version of Theorem 2 is stated as follows.
Theorem E2.
Proof of Theorem 2.
Let . Applying (E18) recursively yields
(E25) |
Combining (E24) and (E25), we obtain
(E26) |
Applying (E26) recursively leads to
By the choice of , , and hence , which implies
Combining all the inequalities above, and suppressing the constants in front of variance terms, we arrive at
By the definition of , the above inequality leads to
Since , and ,
Following the same argument in the proof of Theorem 1, we arrive at the conclusion. ∎