This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Policy Gradient against Strong Data Corruption

Xuezhou Zhang    Yiding Chen    Xiaojin Zhu    Wen Sun
Abstract

We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an adaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most ε\varepsilon-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than O(ε)O(\varepsilon)-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an O(ε)O(\sqrt{\varepsilon})-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an O(ε1/4)O(\varepsilon^{1/4})-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.

Machine Learning, ICML

1 Introduction

Policy gradient methods are a popular class of Reinforcement Learning (RL) methods among practitioners, as they are amenable to parametric policy classes (schulman2015high; schulman2017proximal), resilient to modeling assumption mismatches (agarwal2019optimality; agarwal2020pc), and they directly optimizing the cost function of interest. However, one current drawback of these methods and most existing RL algorithms is the lack of robustness to data corruption, which severely limits their applications to high-stack decision-making domains with highly noisy data, such as autonomous driving, quantitative trading, or medical diagnosis.

In fact, data corruption can be a larger threat in the RL paradigm than in traditional supervised learning, because supervised learning is often applied in a controlled environment where data are collected and cleaned by highly-skilled data scientists and domain experts, whereas RL agents are developed to learn in the wild using raw feedbacks from the environment. While the increasing autonomy and less supervision mark a step closer to the goal of general artificial intelligence, they also make the learning system more susceptible to data corruption: autonomous vehicles can misread traffic signs when the signs are contaminated by adversarial stickers (eykholt2018robust); chatbot can be mistaught by a small group of tweeter users to make misogynistic and racist remarks (neff2016automation); recommendation systems can be fooled by a small number of fake clicks/reviews/comments to rank products higher than they should be. Despite the many vulnerabilities, robustness against data corruption in RL has not been extensively studied only until recently.

The existing works on robust RL are mostly theoretical and can be viewed as a successor of the adversarial bandit literature. However, several drawbacks of this line of approach make them insufficient to modern real-world threats faced by RL agents. We elaborate them below:

  1. 1.

    Reward vs. transition contamination: The majority of prior works on adversarial RL focus on reward contamination (even2009online; neu2010online; neu2012adversarial; zimin2013online; rosenberg2019online; jin2020learning), while in reality the adversary often has stronger control during the adversarial interactions. For example, when a chatbot interacts with an adversarial user, the user has full control over both the rewards and transitions during that conversation episode.

  2. 2.

    Density of contamination: The existing works that do handle adversarial/time-varying transitions can only tolerate sublinear number of interactions being corrupted (lykouris2019corruption; cheung2019non; ornik2019learning; ortner2019variational). These methods would fail when the adversary’s attack budget also grows linearly with time, which is often the case in practice.

  3. 3.

    Practicability: The majority of these work focuses on the setting of tabular MDPs and cannot be applied to real-world RL problems that have large state and action spaces and require function approximations.

In this work, we address the above shortcomings by developing a variant of natural policy gradient (NPG) methods that, under the linear value function assumption, are provably robust against strongly adaptive adversaries, who can arbitrarily contaminate both rewards and transitions in ε\varepsilon fraction of all learning episodes. Our algorithm does not need to know ε\varepsilon, and is adaptive to the contamination level. Specifically, it guarantees to find an O~(ε1/4)\tilde{O}(\varepsilon^{1/4})-optimal policy in a polynomial number of steps. Complementarily, we also present a corresponding lower-bound, showing that no algorithm can consistently find a better than Ω(ε)\Omega(\varepsilon) optimal policy, even with infinite data. In addition to the theoretical results, we also develop a neural network implementation of our algorithm which is shown to achieve strong robustness performance on the MuJoCo continuous control benchmarks (todorov2012mujoco), proving that our algorithm can be applied to real-world, high-dimensional RL problems.

2 Related Work

RL in standard MDPs.

Learning MDPs with stochastic rewards and transitions is relatively well-studied for the tabular case (that is, a finite number of states and actions). For example, in the episodic setting, the UCRL2 algorithm (auer2009near) achieves O(H4S2AT)O(\sqrt{H^{4}S^{2}AT}) regret, where HH is the episode length, SS is the state space size, AA is the action space size, and TT is the total number of steps. Later the UCBVI algorithm (azar2017minimax; dann2017unifying) achieves the optimal O(H2SAT)O(\sqrt{H^{2}SAT}) regret matching the lower-bound (osband2016lower; dann2015sample). Recent work extends the analysis to various linear setting (jin2020provably; yang2019reinforcement; yang2019sample; zanette2020frequentist; ayoub2020model; zhou2020provably; cai2019provably; du2019provably; kakade2020information) with known linear feature. For unknown feature, (agarwal2020flambe) proposes a sample efficient algorithm that explicitly learns feature representation under the assumption that the transition matrix is low rank. Beyond the linear settings, there are works assuming the function class has low Eluder dimension which so far is known to be small only for linear functions and generalized linear models (osband2014model). For more general function approximation, (jiang2017contextual; sun2019model) showed that polynomial sample complexity is achievable as long as the MDP and the given function class together induce low Bellman rank and Witness rank, which include almost all prior models such as tabular MDP, linear MDPs (yang2019reinforcement; jin2020provably), Kernelized nonlinear regulators (kakade2020information), low rank MDP (agarwal2020flambe), and Bellman completion under linear functions (zanette2020frequentist).

Policy Gradient and Policy Optimization

Policy Gradient (williams1992simple; sutton1999policy) and Policy optimization methods are widely used in practice (kakade2002approximately; schulman2015high; schulman2017proximal) and have demonstrated amazing performance on challenging applications (berner2019dota; akkaya2019solving). Unlike model-based approach or Bellman-backup based approaches, PG methods directly optimize the objective function and are often more robust to model-misspecification (agarwal2020pc). In addition to being robust to model-misspecification, we show in this work that vanilla NPG is also robust to constant fraction and bounded adversarial corruption on both rewards and transitions.

RL with adversarial rewards.

Almost all prior works on adversarial RL study the setting where the reward functions can be adversarial but the transitions are still stochastic and remain unchanged throughout the learning process. Specifically, at the beginning of each episode, the adversary must decide on a reward function for this episode, and can not change it for the rest of the episode. Also, the majority of these works focus on tabular MDPs. Early works on adversarial MDPs assume a known transition function and full-information feedback. For example, (even2009online) proposes the algorithm MDP-E and proves a regret bound of O~(τTlogA)\tilde{O}(\tau\sqrt{T\log A}) in the non-episodic setting, where τ\tau is the mixing time of the MDP; Later, (zimin2013online) consider the episodic setting and propose the O-REPS algorithm which applies Online Mirror Descent over the space of occupancy measures, a key component adopted by (rosenberg2019online) and (jin2020learning). O-REPS achieves the optimal regret O~(H2Tlog(SA))\tilde{O}(\sqrt{H^{2}T\log(SA)}) in this setting. Several works consider the harder bandit feedback model while still assuming known transitions. The work (neu2010online) achieves regret O~(H3AT/α)\tilde{O}(\sqrt{H^{3}AT}/\alpha) assuming that all states are reachable with some probability α\alpha under all policies. Later, (neu2010online) eliminates the dependence on α\alpha but only achieves O(T2/3)O(T^{2/3}) regret. The O-REPS algorithm of (zimin2013online) again achieves the optimal regret O~(H3SAT)\tilde{O}(\sqrt{H^{3}SAT}). To deal with unknown transitions, (neu2012adversarial) proposes the Follow the Perturbed Optimistic Policy algorithm and achieves O~(H2S2A2T)\tilde{O}(\sqrt{H^{2}S^{2}A^{2}T}) regret given full-information feedback. Combining the idea of confidence sets and Online Mirror Descent, the UC-O-REPS algorithm of (rosenberg2019online) improves the regret to O~(H2S2AT)\tilde{O}(\sqrt{H^{2}S^{2}AT}). A few recent works start to consider the hardest setting assuming unknown transition as well as bandit feedback. (rosenberg2019online) achieves O(T3/4)O(T^{3/4}) regret, which is improved by (jin2020learning) to O~(H2S2AT)\tilde{O}(\sqrt{H^{2}S^{2}AT}), matching the regret of UC-O-REPS in the full information setting. Also, note that the lower bound of Ω(H2SAT)\Omega(\sqrt{H^{2}SAT}) (jin2018q) still applies. In summary, it is found that on tabular MDPs with oblivious reward contamination, an O(T)O(\sqrt{T}) regret can still be achieved. Recent improvements include best-of-both-worlds algorithms (jin2020simultaneously), data-dependent bound (lee2020bias) and extension to linear function approximation (neu2020online).

RL with adversarial transitions and rewards.

Very few prior works study the problem of both adversarial transitions and adversarial rewards, in fact, only one that we are aware of (lykouris2019corruption). They study a setting where only a constant CC number of episodes can be corrupted by the adversary, and most of their technical effort dedicate to designing an algorithm that is agnostic to CC, i.e. the algorithm doesn’t need to know the contamination level ahead of time. As a result, their algorithm takes a multi-layer structure and cannot be easily implemented in practice. Their algorithm achieves a regret of O(CT)O(C\sqrt{T}) for tabular MDPs and O(C2T)O(C^{2}\sqrt{T}) for linear MDPs, which unfortunately becomes vacuous when CΩ(T)C\geq\Omega(\sqrt{T}) and CΩ(T1/4)C\geq\Omega(T^{1/4}), respectively. Note that the contamination ratio C/TC/T approaches zero when TT increases, and hence their algorithm cannot handle constant fraction contamination. Notably, in all of the above works, the adversary can partially adapt to the learner’s behavior, in the sense that the adversary can pick an adversary MDP k\mathcal{M}_{k} or reward function rkr_{k} at the start of episode kk based on the history of interactions so far. However, it can no longer adapt its strategy after the episode starts, and therefore, the learner can still use a randomization strategy to trick the adversary.

A separate line of work studies the online MDP setting, where the MDP is not adversarial but slowly change over time, and the amount of change is bounded under a total-variation metric (cheung2019non; ornik2019learning; ortner2019variational; domingues2020kernel). Due to the slow-changing nature of the environment, algorithms in these works typically uses a sliding window approach where the algorithm keeps throwing away old data and only learns a policy from recent data, assuming that most of them come from the MDP that the agent is currently experiencing. These methods typically achieve a regret in the form of O(ΔcK1c)O(\Delta^{c}K^{1-c}), where Δ\Delta is the total variation bound. It is worth noting that all of these regrets become vacuous when the amount of variation is linear in time, i.e. ΔΩ(T)\Delta\geq\Omega(T). Separately, it is shown that when both the transitions and the rewards are adversarial in every episode, the problem is at least as hard as stochastic parity problem, for which no computationally efficient algorithm exists (yadkori2013online).

Learning robust controller.

A different type of robustness has also been considered in RL (pinto2017robust; derman2020bayesian) and robust control (zhou1998essentials; petersen2012robust), where the goal is to learn a control policy that is robust to potential misalignment between the training and deployment environment. Such approaches are often conservative, i.e. the learned polices are sub-optimal even if there is no corruption. In comparison, our approach can learn as effectively as standard RL algorithms without corruption. Interestingly, parallel to our work, a line of concurrent work in the robust control literature (zhang2020policy; zhang2020stability; zhang2021derivative) has also found that policy optimization method enjoys some implicit regularization/robustness property that can automatically converge to robust control policies. An interesting future direction could be to understand the connection between these two kind of robustness.

Robust statistics.

One of the most important discoveries in modern robust statistics is that there exists computationally efficient and robust estimator that can learn near-optimally even under the strongest adaptive adversary. For example, in the classic problem of Gaussian mean estimation, the recent works (diakonikolas2016robust; lai2016agnostic) present the first computational and sample-efficient algorithms. The algorithm in (diakonikolas2016robust) can generate a robust mean estimate μ^\hat{\mu}, such that μ^μ2O(εlog(1/ε))\|\hat{\mu}-\mu\|_{2}\leq O(\varepsilon\sqrt{\log{(1/\varepsilon)}}) under ε\varepsilon corruption. Crucially, the error bound does not scale with the dimension dd of the problem, suggesting that the estimator remains robust even in high dimensional problems. Similar results have since been developed for robust mean estimation under weaker assumptions (diakonikolas2017being), and for supervised learning and unsupervised learning tasks (charikar2017learning; diakonikolas2019sever). We refer readers to (diakonikolas2019recent) for a more thorough survey of recent advances in high-dimensional robust statistics.

3 Problem Definitions

A Markov Decision Process (MDP) =(𝒮,𝒜,P,r,γ,μ0)\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma,\mu_{0}) is specified by a state space 𝒮\mathcal{S}, an action space 𝒜\mathcal{A}, a transition model P:𝒮×𝒜Δ(𝒮)P:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) (where Δ(𝒮)\Delta(\mathcal{S}) denotes a distribution over 𝒮\mathcal{S}), a (stochastic and possibly unbounded) reward function r:𝒮×𝒜Δ()r:\mathcal{S}\times\mathcal{A}\to\Delta(\mathbb{R}), a discounting factor γ[0,1)\gamma\in[0,1), and an initial state distribution μ0Δ(𝒮)\mu_{0}\in\Delta(\mathcal{S}), i.e. s0μ0s_{0}\sim\mu_{0}. In this paper, we assume that 𝒜\mathcal{A} is a small and finite set, and denote A=|𝒜|A=\lvert\mathcal{A}\rvert. A policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\to\Delta(\mathcal{A}) specifies a decision-making strategy in which the agent chooses actions based on the current state, i.e., aπ(|s)a\sim\pi(\cdot|s).

The value function Vπ:𝒮V^{\pi}:\mathcal{S}\to\mathbb{R} is defined as the expected discounted sum of future rewards, starting at state ss and executing π\pi, i.e. Vπ(s):=𝔼[t=0γtr(st,at)|π,s0=s],V^{\pi}(s):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|\pi,s_{0}=s\right], where the expectation is taken with respect to the randomness of the policy and environment \mathcal{M}. Similarly, the state-action value function Qπ:𝒮×𝒜Q^{\pi}:\mathcal{S}\times\mathcal{A}\to\mathbb{R} is defined as Qπ(s,a):=𝔼[t=0γtr(st,at)|π,s0=s,a0=a].Q^{\pi}(s,a):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|\pi,s_{0}=s,a_{0}=a\right].

We define the discounted state-action distribution dsπd_{s}^{\pi} of a policy π\pi: dsπ(s,a):=(1γ)t=0γtπ(st=s,at=a|s0=s),d_{s^{\prime}}^{\pi}(s,a):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}{\mathbb{P}}^{\pi}(s_{t}=s,a_{t}=a|s_{0}=s^{\prime}), where π(st=s,at=a|s0=s)\mathbb{P}^{\pi}(s_{t}=s,a_{t}=a|s_{0}=s^{\prime}) is the probability that st=ss_{t}=s and at=aa_{t}=a, after we execute π\pi from t=0t=0 onwards starting at state ss^{\prime} in model \mathcal{M}. Similarly, we define dπs,a(s,a)d^{\pi}_{s^{\prime},a^{\prime}}(s,a) as: dπs,a(s,a):=(1γ)t=0γtπ(st=s,at=s|s0=s,a0=a).d^{\pi}_{s^{\prime},a^{\prime}}(s,a):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}{\mathbb{P}}^{\pi}(s_{t}=s,a_{t}=s|s_{0}=s^{\prime},a_{0}=a^{\prime}). For any state-action distribution ν\nu, we write dπν(s,a):=(s,a)𝒮×𝒜ν(s,a)dπs,a(s,a)d^{\pi}_{\nu}(s,a):=\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\nu(s^{\prime},a^{\prime})d^{\pi}_{s^{\prime},a^{\prime}}(s,a). For ease of presentation, we assume that the agent can reset to s0μ0s_{0}\sim\mu_{0} at any point in the trajectory. We denote dπν(s)=adπν(s,a)d^{\pi}_{\nu}(s)=\sum_{a}d^{\pi}_{\nu}(s,a).

The goal of the agent is to find a policy π\pi that maximizes the expected value from the starting state s0s_{0}, i.e. the optimization problem is: maxπVπ(μ0):=𝔼sμ0Vπ(s)\max_{\pi}V^{\pi}(\mu_{0})\mathrel{\mathchoice{\vbox{\hbox{$\displaystyle:$}}}{\vbox{\hbox{$\textstyle:$}}}{\vbox{\hbox{$\scriptstyle:$}}}{\vbox{\hbox{$\scriptscriptstyle:$}}}{=}}\mathbb{E}_{s\sim\mu_{0}}V^{\pi}(s), where the max\max is over some policy class.

For completeness, we specify a dπνd^{\pi}_{\nu}-sampler and an unbiased estimator of Qπ(s,a)Q^{\pi}(s,a) in Algorithm 1, which are standard in discounted MDPs (agarwal2019optimality; agarwal2020pc). The dπνd^{\pi}_{\nu} sampler samples (s,a)(s,a) i.i.d from dπνd^{\pi}_{\nu}, and the QπQ^{\pi} sampler returns an unbiased estimate of Qπ(s,a)Q^{\pi}(s,a) for a given pair (s,a)(s,a) by a single roll-out from (s,a)(s,a). Later, when we define the contamination model and the sample complexity of learning, we treat each call of dπνd^{\pi}_{\nu}-sampler (optionally followed by a Qπ(s,a)Q^{\pi}(s,a)-estimator) as a single episode, as in practice both of these procedures can be achieved in a single roll-out from μ0\mu_{0}.

Assumption 3.1 (Linear Q function).

For the theoretical analysis, we focus on the setting of linear value function approximation. In particular, we assume that there exists a feature map ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d}, such that for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and any policy π:𝒮Δ𝒜\pi:\mathcal{S}\to\Delta_{\mathcal{A}}, we have

Qπ(s,a)=ϕ(s,a)wπ, for some wπW\displaystyle Q^{\pi}(s,a)=\phi(s,a)^{\top}w^{\pi}\text{, for some }\|w^{\pi}\|\leq W (1)

We also assume that the feature is bounded, i.e. maxs,aϕ(s,a)21\max_{s,a}\|\phi(s,a)\|_{2}\leq 1, and the reward function has bounded first and second moments, i.e. 𝔼[r(s,a)][0,1]\mathbb{E}\left[r(s,a)\right]\in[0,1] and Var(r(s,a))σ2\mbox{Var}(r(s,a))\leq\sigma^{2} for all (s,a)(s,a).

Remark 3.1.

Assumption 3.1 is satisfied, for example, in tabular MDPs and linear MDPs of (jin2020provably) or (yang2019sample). Unlike most theoretical RL literature, we allow the reward to be stochastic and unbounded. Such a setup aligns better with applications with a low signal-to-noise ratio and motivates the requirement for nontrivial robust learning techniques.

Notation.

When clear from context, we write dπ(s,a)d^{\pi}(s,a) and dπ(s)d^{\pi}(s) to denote dμ0π(s,a)d_{\mu_{0}}^{\pi}(s,a) and dπμ0(s)d^{\pi}_{\mu_{0}}(s) respectively. For iterative algorithms which obtain policies at each episode, we let ViV^{i},QiQ^{i} and AiA^{i} denote the corresponding quantities associated with episode ii. For a vector vv, we denote v2=ivi2\|v\|_{2}=\sqrt{\sum_{i}v_{i}^{2}}, v1=i|vi|\|v\|_{1}=\sum_{i}|v_{i}|, and v=maxi|vi|\|v\|_{\infty}=\max_{i}|v_{i}|. We use Uniform(𝒜)\text{Uniform}(\mathcal{A}) (in short Unif𝒜\text{Unif}_{\mathcal{A}}) to represent a uniform distribution over the set 𝒜\mathcal{A}.

3.1 The Contamination Model

In this paper, we study the robustness of policy gradient methods under the ε\varepsilon-contamination model, a widely studied adversarial model in the robust statistics literature, e.g. see (diakonikolas2016robust). In the classic robust mean estimation problem, given a dataset DD and a learning algorithm ff, the ε\varepsilon-contamination model assumes that the adversary has full knowledge of the dataset DD and the learning algorithm ff, and can arbitrarily change ε\varepsilon-fraction of the data in the dataset and then send the contaminated data to the learner. The goal of the learner is to identify an O(poly(ε))O(\text{poly}(\varepsilon))-optimal estimator of the mean despite the ε\varepsilon-contamination.

Unfortunately, the original ε\varepsilon-contamination model is defined for the offline learning setting and does not directly generalize to the online setting, because it doesn’t specify the availability of knowledge and the order of actions between the adversary and the learner in the time dimension. In this paper, we define the ε\varepsilon-contamination model for online learning as follows:

Definition 3.1 (ε\varepsilon-contamination model for Reinforcement Learning).

Given ε\varepsilon and the clean MDP \mathcal{M}, an ε\varepsilon-contamination adversary operates as follows:

  1. 1.

    The adversary has full knowledge of the MDP \mathcal{M} and the learning algorithm, and observes all the historical interactions.I

  2. 2.

    At any time step tt, the adversary observes the current state-action pair (st,at)(s_{t},a_{t}), as well as the reward and next state returned by the environment, (rt,st+1)(r_{t},s_{t+1}). He then can decide whether to replace (rt,st+1)(r_{t},s_{t+1}) with an arbitrary reward and next state (rt,st+1)×𝒮(r^{\dagger}_{t},s^{\dagger}_{t+1})\in\mathbb{R}\times\mathcal{S}.

  3. 3.

    The only constraint on the adversary is that if the learning process terminates after KK episodes, he can contaminate in at most εK\varepsilon K episodes.

Compared to the standard adversarial models studied in online learning (shalev2011online), adversarial bandits (bubeck2012regret; lykouris2018stochastic; gupta2019better) and adversarial RL (lykouris2019corruption; jin2020learning), the ε\varepsilon-contamination model in Definition 3.1 is stronger in several ways: (1) The adversary can adaptively attack after observing the action of the learner as well as the feedback from the clean environments; (2) the adversary can perturb the data arbitrarily (any real-valued reward and any next state from the state space) rather than sampling it from a pre-specified bounded adversarial reward function or adversarial MDP.

Given the contamination model, our first result is a lower-bound, showing that under the ε\varepsilon-contamination model, one can only hope to find an O(ε)O(\varepsilon)-optimal policy. Exact optimal policy identification is not possible even with infinite data.

Theorem 3.1 (lower bound).

For any algorithm, there exists an MDP such that the algorithm fails to find an (ε2(1γ))\left(\frac{\varepsilon}{2(1-\gamma)}\right)-optimal policy under the ε\varepsilon-contamination model with a probability of at least 1/41/4.

The high-level idea is that we can construct two MDPs, MM and MM^{\prime}, with the following properties: 1. No policy can be O(ε/(1γ))O(\varepsilon/(1-\gamma)) optimal on both MDP simultaneously. 2. An ε\varepsilon-contamination adversary can with large probability mimic one MDP via contamination in the other, regardless of the learner’s behavior. Therefore, under contamination, the learner will not be able to distinguish MM and MM^{\prime} and must suffer Ω(ε/(1γ))\Omega(\varepsilon/(1-\gamma)) gap on at least one of them.

3.2 Background on NPG

Given a differentiable parameterized policy πθ:𝒮Δ(𝒜)\pi_{\theta}:\mathcal{S}\to\Delta(\mathcal{A}), NPG can be written in the following actor-critc style update form. With the dataset {si,ai,Q^πθ(si,ai)}i=1N\{s_{i},a_{i},\widehat{Q}^{\pi_{\theta}}(s_{i},a_{i})\}_{i=1}^{N} where si,aidπθνs_{i},a_{i}\sim d^{\pi_{\theta}}_{\nu}, and Q^πθ(si,ai)\widehat{Q}^{\pi_{\theta}}(s_{i},a_{i}) is unbiased estimate of Qπθ(s,a)Q^{\pi_{\theta}}(s,a) (e.g., via QπQ^{\pi}-estimator), we have

w^argminw:w2Wi=1N(wlogπθ(ai|si)Q^πθ(si,ai))2\displaystyle\widehat{w}\in\operatorname*{\arg\min}_{w:\|w\|_{2}\leq W}\sum_{i=1}^{N}\left(w^{\top}\nabla\log\pi_{\theta}(a_{i}|s_{i})-\widehat{Q}^{\pi_{\theta}}(s_{i},a_{i})\right)^{2}
θ=θ+ηw^.\displaystyle\theta^{\prime}=\theta+\eta\widehat{w}. (2)

In theoretical part of this work, we focus on softmax linear policy, i.e., πθ(a|s)exp(θϕ(s,a))\pi_{\theta}(a|s)\propto\exp(\theta^{\top}\phi(s,a)). In this case, note that logπθ(a|s)=ϕ(s,a)\nabla\log\pi_{\theta}(a|s)=\phi(s,a), and it is not hard to verify that the policy update procedure is equivalent to:

πθ(a|s)πθ(a|s)exp(ηw^ϕ(s,a)),s,a,\displaystyle\pi_{\theta^{\prime}}(a|s)\propto\pi_{\theta}(a|s)\exp\left(\eta\widehat{w}^{\top}\phi(s,a)\right),\quad\forall s,a,

which is equivalent to running Mirror Descent on each state with a reward vector w^ϕ(s,)|𝒜|\widehat{w}^{\top}\phi(s,\cdot)\in\mathbb{R}^{|\mathcal{A}|}. We refer readers to (agarwal2019optimality) for more detailed explanation of NPG and the equivalence between the form in Eq. (3.2) and the classic form that uses Fisher information matrix. Similar to (agarwal2019optimality), we make the following assumption of having access to an exploratory reset distribution, under which it has been shown that NPG can converge to the optimal policy without contamination.

Assumption 3.2 (Relative condition number).

With respect to any state-action distribution υ\upsilon, define:

Συ=𝔼s,aυ[ϕs,aϕs,a],\Sigma_{\upsilon}=\mathbb{E}_{s,a\sim\upsilon}\left[\phi_{s,a}\phi_{s,a}^{\top}\right],

and define

supwdwΣdwwΣνw=κ, where d(s,a)=dπμ0(s)Unif𝒜(a)\sup_{w\in\mathbb{R}^{d}}\ \frac{w^{\top}\Sigma_{d^{\star}}w}{w^{\top}\Sigma_{\nu}w}=\kappa\text{, where }d^{*}(s,a)=d^{\pi^{*}}_{\mu_{0}}(s)\circ\text{Unif}_{\mathcal{A}}(a)

We assume κ\kappa is finite and small w.r.t. a reset distribution ν\nu available to the learner at training time.

4 The Natural Robustness of NPG Against Bounded corruption

Our first result shows that, surprisingly, NPG can already be robust against ε\varepsilon-contamination, if the adversary can only generate small and bounded rewards. In particular, we assume that the adversarial rewards is bounded in [0,1][0,1] (the feature ϕ(s,a)\phi(s,a) is already bounded).

Theorem 4.1 (Natural robustness of NPG).

Under assumptions 3.1 and 3.2, given a desired optimality gap α\alpha, there exists a set of hyperparameters agnostic to the contamination level ε\varepsilon, such that Algorithm LABEL:alg:q_npg_sample guarantees with a poly(1/α,1/(1γ),|𝒜|,W,σ,κ)poly(1/\alpha,1/(1-\gamma),|\mathcal{A}|,W,\sigma,\kappa) sample complexity that under ε\varepsilon-contamination with adversarial rewards bounded in [0,1][0,1], we have

𝔼[V(μ0)Vπ^(μ0)]O~(max[α,W|𝒜|κε(1γ)3 ])\displaystyle\mathbb{E}\left[V^{*}(\mu_{0})-V^{\hat{\pi}}(\mu_{0})\right]\leq\tilde{O}\left(\max\left[\alpha,W\sqrt{\frac{|\mathcal{A}|\kappa\varepsilon}{(1-\gamma)^{3}}}\mbox{ }\right]\right)

where π^\hat{\pi} is the uniform mixture of π(1)\pi^{(1)} through π(T)\pi^{(T)}.

A few remarks are in order.

Remark 4.1 (Agnostic to the contamination level ε\varepsilon).

It is worth emphasizing that to achieve the above bound, the hyperparameters of NPG are agnostic to the value of ε\varepsilon, and so the algorithm can be applied in the more realistic setting where the agent does not have knowledge of the contamination level ε\varepsilon, similar to what’s achieved in (lykouris2019corruption) with a complicated nested structure. The same property is also achieved by the FPG algorithm in the next section.

Remark 4.2 (Dimension-independent robustness guarantee).

Theorem 4.1 guarantees that NPG can find an O(ε1/2)O(\varepsilon^{1/2})-optimal policy after polynomial number of episodes, provided that |𝒜||\mathcal{A}| and κ\kappa are small. Conceptually, the relative condition number κ\kappa indicates how well-aligned the initial state distribution is to the occupancy distribution of the optimal policy. A good initial distribution can have a κ\kappa as small as 11, and so κ\kappa is independent of dd. Interested readers can refer to (agarwal2019optimality) (Remark 6.3) for additional discussion on the relative condition number. Here, importantly, the optimality gap does not directly scale with dd, and so the guarantee will not blow up on high-dimensional problems. This is an important attribute of robust learning algorithms heavily emphasized in the traditional robust statistics literature.

The proof of Theorem 4.1 relies on the following NPG regret lemma, first developed by (even2009online) for the MDP-Expert algorithm and later extend to NPG by (agarwal2019optimality; agarwal2020pc):

Lemma 4.1 (NPG Regret Lemma).

Suppose Assumption 3.1 and 3.2 hold and Algorithm LABEL:alg:q_npg_sample starts with θ(0)=0\theta^{(0)}=0, η=2log|𝒜|/(W2T)\eta=\sqrt{2\log|\mathcal{A}|/(W^{2}T)}. Suppose in addition that the (random) sequence of iterates satisfies the assumption that

𝔼[𝔼s,ad(t)[(Qπ(t)(s,a)ϕ(s,a)w(t))2]]εstat(t).\mathbb{E}\left[\mathbb{E}_{s,a\sim d^{(t)}}\left[\left(Q^{\pi^{(t)}}(s,a)-\phi(s,a)^{\top}w^{(t)}\right)^{2}\right]\right]\leq\varepsilon_{stat}^{(t)}.

Then, we have that

𝔼\displaystyle\mathbb{E} [t=1T{V(μ0)V(t)(μ0)}]\displaystyle\left[\sum_{t=1}^{T}\{V^{*}(\mu_{0})-V^{(t)}(\mu_{0})\}\right] (3)
W1γ2log|𝒜|T+t=1T4|𝒜|κεstat(t)(1γ)3.\displaystyle\qquad\leq\frac{W}{1-\gamma}\sqrt{2\log|\mathcal{A}|T}+\sum_{t=1}^{T}\sqrt{\frac{4|\mathcal{A}|\kappa\varepsilon_{stat}^{(t)}}{(1-\gamma)^{3}}}.

Intuitively, Lemma 4.1 decompose the regret of NPG into two terms. The first term corresponds to the regret of standard mirror descent procedure, which scales with T\sqrt{T}. The second term corresponds to the estimation error on the Q value, which acts as the reward signal for mirror descent. When not under attack, estimation error εstat(t)\varepsilon_{stat}^{(t)} goes to zero as the number of samples MM gets larger, which in turn implies the global convergence of NPG. However, when under bounded attack, the generalization error εstat(t)\varepsilon_{stat}^{(t)} will not go to zero even with infinite data. Nevertheless, we can show that it is bounded by O(ε(t))O(\varepsilon^{(t)}) when the sample size MM is large enough, where ε(t)\varepsilon^{(t)} denotes the fraction of episodes being corrupted in iteration tt. Note that by definition, we have tε(t)εT\sum_{t}\varepsilon^{(t)}\leq\varepsilon T.

Lemma 4.2 (Robustness of linear regression under bounded contamination).

Suppose the adversarial rewards are bounded in [0,1][0,1], and in a particular iteration tt, the adversary contaminates ε(t)\varepsilon^{(t)} fraction of the episodes, then given M episodes, it is guaranteed that with probability at least 1δ1-\delta,

𝔼s,ad(t)\displaystyle\mathbb{E}_{s,a\sim d^{(t)}} [(Qπ(t)(s,a)ϕ(s,a)w(t))2]\displaystyle\left[\left(Q^{\pi^{(t)}}(s,a)-\phi(s,a)^{\top}w^{(t)}\right)^{2}\right] (4)
4(W2+WH)(ε(t)+8Mlog4dδ).\displaystyle\leq 4\left(W^{2}+WH\right)\left(\varepsilon^{(t)}+\sqrt{\frac{8}{M}\log\frac{4d}{\delta}}\right).

where H=(logδlogM)/logγH=(\log\delta-\log M)/\log\gamma is the effective horizon.

This along with the NPG regret lemma guarantees that the expected regret of NPG is bounded by O(T+M1/4+εT)O(\sqrt{T}+M^{-1/4}+\sqrt{\varepsilon}T) which in turn guarantees to identify an O(ε)O(\sqrt{\varepsilon})-optimal policy.

In the special case of tabular MDPs, ϕ(s,a)\phi(s,a) will all be one-hot vectors and WW will in general by on the order of O(SA)O(\sqrt{SA}), which means that the bound given by Theorem 4.1 still scales with the size of the state space. In the following corollary, we show that this dependency can be removed through a tighter analysis.

Corollary 4.1 (Dimension-free Robustness of NPG in tabular MDPs).

Given a tabular MDP and assumption 3.2, given a desired optimality gap α\alpha, there exists a set of hyperparameters agnostic to the contamination level ε\varepsilon, such that Algorithm LABEL:alg:q_npg_sample guarantees with a poly(1/α,1/(1γ),|𝒜|,W,σ,κ)poly(1/\alpha,1/(1-\gamma),|\mathcal{A}|,W,\sigma,\kappa) sample complexity that under ε\varepsilon-contamination with adversarial rewards bounded in [0,1][0,1], we have

𝔼[V(μ0)Vπ^(μ0)]O~(max[α,|𝒜|κε(1γ)5 ])\displaystyle\mathbb{E}\left[V^{*}(\mu_{0})-V^{\hat{\pi}}(\mu_{0})\right]\leq\tilde{O}\left(\max\left[\alpha,\sqrt{\frac{|\mathcal{A}|\kappa\varepsilon}{(1-\gamma)^{5}}}\mbox{ }\right]\right)

where π^\hat{\pi} is the uniform mixture of π(1)\pi^{(1)} through π(T)\pi^{(T)}.

In the more general case of linear MDP, WW will not necessarily scale with dd in an obvious way and thus we leave Theorem 4.1 untouched.

5 FPG: Robust NPG Against Unbounded Corruption

Algorithm 1 dνπd_{\nu}^{\pi} sampler and QπQ^{\pi} estimator
dνπd_{\nu}^{\pi}-sampler Input\State\State\EndFunction\Function\State: A reset distribution νΔ(𝒮×𝒜)\nu\in\Delta(\mathcal{S}\times\mathcal{A}). Sample s0,a0νs_{0},a_{0}\sim\nu. Execute π\pifrom s0,a0s_{0},a_{0}; at any step ttwith (st,at)(s_{t},a_{t}), return (st,at)(s_{t},a_{t})with probability 1γ1-\gamma. QπQ^{\pi}-estimator Input\State\State: current state-action (s,a)(s,a), a policy π\pi.Execute π\pifrom (s0,a0)=(s,a)(s_{0},a_{0})=(s,a); at step ttwith (st,at)(s_{t},a_{t}), terminate with probability 1γ1-\gamma. Return\EndFunction: Q^π(s,a)=i=0tr(si,ai)\widehat{Q}^{\pi}(s,a)=\sum_{i=0}^{t}r(s_{i},a_{i}).
\Function
\State
Algorithm 0 dπd^{\pi} sampler and QπQ^{\pi} estimator
Algorithm 1 dνπd_{\nu}^{\pi} sampler and QπQ^{\pi} estimator

[In an adversarial episode, the adversary can hijack the dνπd_{\nu}^{\pi} sampler to return any (s,a)(s,a) pair and the QπQ^{\pi}-estimator to return any Q^π(s,a)\widehat{Q}^{\pi}(s,a)\in\mathbb{R}.]

Learning rate η\eta; number of episodes per iteration MMInitialize θ(0)=0\theta^{(0)}=0. t=0,1,,T1t=0,1,\ldots,T-1Call Algorithm \StateMMtimes with π(t)\pi^{(t)}to obtain a dataset that consist of si,aid(t)νs_{i},a_{i}\sim d^{(t)}_{\nu}and Q^(t)(si,ai)\widehat{Q}^{(t)}(s_{i},a_{i}), i[M]i\in[M]. Solve the linear regression problem
w(t)=argminw2Wi=1M(Q^(t)(si,ai)wθϕ(si,ai))2w^{(t)}=\operatorname*{\arg\min}_{\|w\|_{2}\leq W}\sum_{i=1}^{M}\left(\widehat{Q}^{(t)}(s_{i},a_{i})-w^{\top}\nabla_{\theta}\phi(s_{i},a_{i})\right)^{2}
Update θ(t+1)=θ(t)+ηw(t)\theta^{(t+1)}=\theta^{(t)}+\eta w^{(t)}
\Require\State\For\State\State