This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate

Yufeng Zhang   Qi Cai    Zhuoran Yang   Zhaoran Wang Northwestern University; [email protected]Northwestern University; [email protected]Princeton University; [email protected]Northwestern University; [email protected]
Abstract

Generative adversarial imitation learning (GAIL) demonstrates tremendous success in practice, especially when combined with neural networks. Different from reinforcement learning, GAIL learns both policy and reward function from expert (human) demonstration. Despite its empirical success, it remains unclear whether GAIL with neural networks converges to the globally optimal solution. The major difficulty comes from the nonconvex-nonconcave minimax optimization structure. To bridge the gap between practice and theory, we analyze a gradient-based algorithm with alternating updates and establish its sublinear convergence to the globally optimal solution. To the best of our knowledge, our analysis establishes the global optimality and convergence rate of GAIL with neural networks for the first time.

1 Introduction

The goal of imitation learning (IL) is to learn to perform a task based on expert demonstration (Ho and Ermon, 2016). In contrast to reinforcement learning (RL), the agent only has access to the expert trajectories but not the rewards. The most straightforward approach of IL is behavioral cloning (BC) (Pomerleau, 1991). BC treats IL as the supervised learning problem of predicting the actions based on the states. Despite its simplicity, BC suffers from the compounding errors caused by covariate shift (Ross et al., 2011; Ross and Bagnell, 2010). Another approach of IL is inverse reinforcement learning (IRL) (Russell, 1998; Ng and Russell, 2000; Levine and Koltun, 2012; Finn et al., 2016), which jointly learns the reward function and the corresponding optimal policy. IRL formulates IL as a bilevel optimization problem. Specifically, IRL solves an RL subproblem given a reward function at the inner level and searches for the reward function which makes the expert policy optimal at the outer level. However, IRL is computationally inefficient as it requires fully solving an RL subproblem at each iteration of the outer level. Moreover, the desired reward function may be nonunique. To address such issues of IRL, Ho and Ermon (2016) propose generative adversarial imitation learning (GAIL), which searches for the optimal policy without fully solving an RL subproblem given a reward function at each iteration. GAIL solves IL via minimax optimization with alternating updates. In particular, GAIL alternates between (i) minimizing the discrepancy in expected cumulative reward between the expert policy and the learned policy and (ii) maximizing such a discrepancy over the reward function class. Such an alternating update scheme mirrors the training of generative adversarial networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017), where the learned policy acts as the generator while the reward function acts as the discriminator.

Incorporated with neural networks, which parameterize the learned policy and the reward function, GAIL achieves significant empirical success in challenging applications, such as natural language processing (Yu et al., 2016), autonomous driving (Kuefler et al., 2017), human behavior modeling (Merel et al., 2017), and robotics (Tai et al., 2018). Despite its empirical success, GAIL with neural networks remains less understood in theory. The major difficulty arises from the following aspects: (i) GAIL involves minimax optimization, while the existing analysis of policy optimization with neural networks (Anthony and Bartlett, 2009; Liu et al., 2019; Bhandari and Russo, 2019; Wang et al., 2019) only focuses on a minimization or maximization problem. (ii) GAIL with neural networks is nonconvex-nonconcave, and therefore, the existing analysis of convex-concave optimization with alternating updates is not applicable (Nesterov, 2013). There is an emerging body of literature (Rafique et al., 2018; Zhang et al., 2019b) that casts nonconvex-nonconcave optimization as bilevel optimization, where the inner level is solved to a high precision as in IRL. However, such analysis is not applicable to GAIL as it involves alternating updates.

In this paper, we bridge the gap between practice and theory by establishing the global optimality and convergence of GAIL with neural networks. Specifically, we parameterize the learned policy and the reward function with two-layer neural networks and consider solving GAIL by alternatively updating the learned policy via a step of natural policy gradient (Kakade, 2002; Peters and Schaal, 2008) and the reward function via a step of gradient ascent. In particular, we parameterize the state-action value function (also known as the Q-function) with a two-layer neural network and apply a variant of the temporal difference algorithm (Sutton and Barto, 2018) to solve the policy evaluation subproblem in natural policy gradient. We prove that the learned policy π¯\bar{\pi} converges to the expert policy πE\pi_{\text{{\tiny E}}} at a 1/T1/\sqrt{T} rate in the \mathcal{R}-distance (Chen et al., 2020), which is defined as 𝔻(πE,π¯)=maxrJ(πE;r)J(π¯;r).\mathbb{D}_{\mathcal{R}}(\pi_{\text{{\tiny E}}},\bar{\pi})=\max_{r\in\mathcal{R}}J(\pi_{\text{{\tiny E}}};r)-J(\bar{\pi};r). Here J(π;r)J(\pi;r) is the expected cumulative reward of a policy π\pi given a reward function r(s,a)r(s,a) and \mathcal{R} is the reward function class. The core of our analysis is constructing a potential function that tracks the \mathcal{R}-distance. Such a rate of convergence implies that the learned policy π¯\bar{\pi} (approximately) outperforms the expert policy πE\pi_{\text{{\tiny E}}} given any reward function rr\in\mathcal{R} within a finite number of iterations TT. In other words, the learned policy π¯\bar{\pi} is globally optimal. To the best of our knowledge, our analysis establishes the global optimality and convergence of GAIL with neural networks for the first time. It is worth mentioning that our analysis is straightforwardly applicable to linear and tabular settings, which, however, are not our focus.

Related works. Our work extends an emerging body of literature on RL with neural networks (Xu et al., 2019a; Zhang et al., 2019a; Bhandari and Russo, 2019; Liu et al., 2019; Wang et al., 2019; Agarwal et al., 2019) to IL. This line of research analyzes the global optimality and convergence of policy gradient for solving RL, which is a minimization or maximization problem. In contrast, we analyze GAIL, which is a minimax optimization problem.

Our work is also related to the analysis of apprenticeship learning (Syed et al., 2008) and GAIL (Cai et al., 2019a; Chen et al., 2020). Syed et al. (2008) analyze the convergence and generalization of apprenticeship learning. They assume the state space to be finite, and thus, do not require function approximation for the policy and the reward function. In contrast, we assume the state space to be infinite and employ function approximation based on neural networks. Cai et al. (2019a) study the global optimality and convergence of GAIL in the setting of linear-quadratic regulators. In contrast, our analysis handles general MDPs without restrictive assumptions on the transition kernel and the reward function. Chen et al. (2020) study the convergence and generalization of GAIL in the setting of general MDPs. However, they only establish the convergence to a stationary point. In contrast, we establish the global optimality of GAIL.

Notations. Let [n]={1,,n}[n]=\{1,\ldots,n\} for n+n\in\mathbb{N}_{+} and [m:n]={m,m+1,,n}[m:n]=\{m,m+1,\ldots,n\} for m<nm<n. Also, let N(μ,Σ)N(\mu,\Sigma) be the Gaussian distribution with mean μ\mu and covariance Σ\Sigma. We denote by 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}) the set of all probability measures over the space 𝒳{\mathcal{X}}. For a function f:𝒳f:{\mathcal{X}}\rightarrow\mathbb{R}, a constant p1p\geq 1, and a probability measure μ𝒫(𝒳)\mu\in{\mathcal{P}}({\mathcal{X}}), we denote by fp,μ=(𝒳|f(x)|pdμ(x))1/p\|f\|_{p,\mu}=(\int_{\mathcal{X}}|f(x)|^{p}{\mathrm{d}}\mu(x))^{1/p} the Lp(μ)L_{p}(\mu) norm of the function ff. For any two functions f,g:𝒳f,g:{\mathcal{X}}\rightarrow\mathbb{R}, we denote by f,g𝒳=𝒳f(x)g(x)dx\langle f,g\rangle_{\mathcal{X}}=\int_{\mathcal{X}}f(x)\cdot g(x){\mathrm{d}}x the inner product on the space 𝒳{\mathcal{X}}.

2 Background

In this section, we introduce reinforcement learning (RL) and generative adversarial imitation learning (GAIL).

2.1 Reinforcement Learning

We consider a Markov decision process (MDP) (𝒮,𝒜,r,P,ρ,γ)({\mathcal{S}},\mathcal{A},r,P,\rho,\gamma). Here 𝒮d1{\mathcal{S}}\subseteq\mathbb{R}^{d_{1}} is the state space, 𝒜d2\mathcal{A}\subseteq\mathbb{R}^{d_{2}} is the action space, which is assumed to be finite throughout this paper, r:𝒮×𝒜r:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R} is the reward function, P:𝒮×𝒜𝒫(𝒮)P:{\mathcal{S}}\times\mathcal{A}\rightarrow{\mathcal{P}}({\mathcal{S}}) is the transition kernel, ρ𝒫(𝒮)\rho\in{\mathcal{P}}({\mathcal{S}}) is the initial state distribution, and γ(0,1)\gamma\in(0,1) is the discount factor. Without loss of generality, we assume that 𝒮×𝒜{\mathcal{S}}\times\mathcal{A} is compact and that (s,a)21\|(s,a)\|_{2}\leq 1 for any (s,a)𝒮×𝒜d(s,a)\in{\mathcal{S}}\times\mathcal{A}\subseteq\mathbb{R}^{d}, where d=d1+d2d=d_{1}+d_{2}. An agent following a policy π:𝒮𝒫(𝒜)\pi:{\mathcal{S}}\rightarrow{\mathcal{P}}(\mathcal{A}) interacts with the environment in the following manner. At the state st𝒮s_{t}\in{\mathcal{S}}, the agent takes the action at𝒜a_{t}\in\mathcal{A} with probability π(at|st)\pi(a_{t}\,|\,s_{t}) and receives the reward rt=r(st,at)r_{t}=r(s_{t},a_{t}). The environment then transits into the next state st+1s_{t+1} with probability P(st+1|st,at)P(s_{t+1}\,|\,s_{t},a_{t}). Given a policy π\pi and a reward function r(s,a)r(s,a), we define the state-action value function Qrπ:𝒮×𝒜Q^{\pi}_{r}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R} as follows,

Qrπ(s,a)=𝔼π[(1γ)t=0γtr(st,at)|s0=s,a0=a].\displaystyle Q^{\pi}_{r}(s,a)=\mathbb{E}_{\pi}\biggl{[}(1-\gamma)\cdot\sum_{t=0}^{\infty}\gamma^{t}\cdot r(s_{t},a_{t})\,\bigg{|}\,s_{0}=s,a_{0}=a\biggr{]}. (2.1)

Here the expectation 𝔼π\mathbb{E}_{\pi} is taken with respect to atπ(|st)a_{t}\sim\pi(\cdot\,|\,s_{t}) and st+1P(|st,at)s_{t+1}\sim P(\cdot\,|\,s_{t},a_{t}). Correspondingly, we define the state value function Vrπ:𝒮V^{\pi}_{r}:{\mathcal{S}}\rightarrow\mathbb{R} and the advantage function Arπ:𝒮×𝒜A^{\pi}_{r}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R} as follows,

Vrπ(s)=𝔼aπ(|s)[Qrπ(s,a)],Arπ(s,a)=Qrπ(s,a)Vrπ(s).\displaystyle V^{\pi}_{r}(s)=\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\bigl{[}Q_{r}^{\pi}(s,a)\bigr{]},\quad A^{\pi}_{r}(s,a)=Q_{r}^{\pi}(s,a)-V^{\pi}_{r}(s).

The goal of RL is to maximize the following expected cumulative reward,

J(π;r)=𝔼sρ[Vrπ(s)].\displaystyle J(\pi;r)=\mathbb{E}_{s\sim\rho}\bigl{[}V_{r}^{\pi}(s)\bigr{]}. (2.2)

The policy π\pi induces a state visitation measure dπ𝒫(𝒮)d_{\pi}\in{\mathcal{P}}({\mathcal{S}}) and a state-action visitation measure νπ𝒫(𝒮×𝒜)\nu_{\pi}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A}), which take the forms of

dπ(s)=(1γ)t=0γt(st=s|s0ρ,atπ(|st)),νπ(s,a)=dπ(s)π(a|s).\displaystyle d_{\pi}(s)=(1-\gamma)\cdot\sum_{t=0}^{\infty}\gamma^{t}\cdot\mathbb{P}\bigl{(}s_{t}=s\,\big{|}\,s_{0}\sim\rho,a_{t}\sim\pi(\cdot\,|\,s_{t})\bigr{)},\quad\nu_{\pi}(s,a)=d_{\pi}(s)\cdot\pi(a\,|\,s). (2.3)

It then holds that J(π;r)=𝔼(s,a)νπ[r(s,a)]J(\pi;r)=\mathbb{E}_{(s,a)\sim\nu_{\pi}}[r(s,a)]. Meanwhile, we assume that the policy π\pi induces a state stationary distribution ϱπ\varrho_{\pi} over 𝒮{\mathcal{S}}, which satisfies that

ϱπ(s)=(st+1=s|stρπ,atπ(|st)).\displaystyle\varrho_{\pi}(s)=\mathbb{P}\bigl{(}s_{t+1}=s\,\big{|}\,s_{t}\sim\rho_{\pi},a_{t}\sim\pi(\cdot\,|\,s_{t})\bigr{)}.

We denote by ρπ(s,a)=ϱ(s)π(a|s)\rho_{\pi}(s,a)=\varrho(s)\cdot\pi(a\,|\,s) the state-action stationary distribution over 𝒮×𝒜{\mathcal{S}}\times\mathcal{A}.

2.2 Generative Adversarial Imitation Learning

The goal of imitation learning (IL) is to learn a policy that outperforms the expert policy πE\pi_{\text{{\tiny E}}} based on the trajectory {(siE,aiE)}i[TE]\{(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})\}_{i\in[T_{\text{{\tiny E}}}]} of πE\pi_{\text{{\tiny E}}}. We denote by νE=νπE\nu_{\text{{\tiny E}}}=\nu_{\pi_{\text{{\tiny E}}}} and dE=dπEd_{\text{{\tiny E}}}=d_{\pi_{\text{{\tiny E}}}} the state-action and state visitation measures induced by the expert policy, respectively, and assume that the expert trajectory {(si,ai)}i[TE]\{(s_{i},a_{i})\}_{i\in[T_{\text{{\tiny E}}}]} is drawn from νE\nu_{{\text{{\tiny E}}}}. To this end, we parameterize the policy and the reward function by πθ\pi_{\theta} for θ𝒳Π\theta\in{\mathcal{X}}_{\Pi} and rβ(s,a)r_{\beta}(s,a) for β𝒳R\beta\in{\mathcal{X}}_{R}, respectively, and solve the following minimax optimization problem known as GAIL (Ho and Ermon, 2016),

minθ𝒳Πmaxβ𝒳RL(θ,β),where L(θ,β)=J(πE;rβ)J(πθ;rβ)λψ(β).\displaystyle\min_{\theta\in{\mathcal{X}}_{\Pi}}\max_{\beta\in{\mathcal{X}}_{R}}L(\theta,\beta),\quad\text{where }L(\theta,\beta)=J(\pi_{\text{{\tiny E}}};r_{\beta})-J(\pi_{\theta};r_{\beta})-\lambda\cdot\psi(\beta). (2.4)

Here J(π;r)J(\pi;r) is the expected cumulative reward defined in (2.2), ψ:𝒳R+\psi:{\mathcal{X}}_{R}\rightarrow\mathbb{R}_{+} is the regularizer, and λ0\lambda\geq 0 is the regularization parameter. Given a reward function class \mathcal{R}, we define the \mathcal{R}-distance between two policies π1\pi_{1} and π2\pi_{2} as follows,

𝔻(π1,π2)=maxrJ(π1;r)J(π2;r)=maxr𝔼νπ1[r(s,a)]𝔼νπ2[r(s,a)].\displaystyle\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2})=\max_{r\in\mathcal{R}}J(\pi_{1};r)-J(\pi_{2};r)=\max_{r\in\mathcal{R}}\mathbb{E}_{\nu_{\pi_{1}}}\bigl{[}r(s,a)\bigr{]}-\mathbb{E}_{\nu_{\pi_{2}}}\bigl{[}r(s,a)\bigr{]}. (2.5)

When \mathcal{R} is the class of 11-Lipschitz functions, 𝔻(π1,π2)\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2}) is the Wasserstein-11 metric between the state-action visitation measures induced by π1\pi_{1} and π2\pi_{2}. However, 𝔻(π1,π2)\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2}) is not a metric in general. When 𝔻(π1,π2)0\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2})\leq 0, the policy π2\pi_{2} outperforms the policy π1\pi_{1} for any reward function rr\in\mathcal{R}. Such a notion of \mathcal{R}-distance is used in Chen et al. (2020). We denote by β={rβ(s,a)|β𝒳R}\mathcal{R}_{\beta}=\{r_{\beta}(s,a)\,|\,\beta\in{\mathcal{X}}_{R}\} the reward function class parameterized with β\beta. Hence, the optimization problem in (2.4) minimizes the β\mathcal{R}_{\beta}-distance 𝔻β(πE,πθ)\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\pi_{\theta}) (up to the regularizer λψ(β)\lambda\cdot\psi(\beta)), which searches for a policy π¯\bar{\pi} that (approximately) outperforms the expert policy given any reward function rββr_{\beta}\in\mathcal{R}_{\beta}.

3 Algorithm

In this section, we introduce an algorithm with alternating updates for GAIL with neural networks, which employs natural policy gradient (NPG) to update the policy πθ\pi_{\theta} and gradient ascent to update the reward function rβ(s,a)r_{\beta}(s,a).

3.1 Parameterization with Neural Networks

We define a two-layer neural network with rectified linear units (ReLU) as follows,

uW,b(s,a)=1ml=1mbl𝟙{(s,a)[W]l>0}(s,a)[W]l=l=1m[ϕW,b(s,a)]l[W]l.\displaystyle u_{W,b}(s,a)=\frac{1}{\sqrt{m}}\sum_{l=1}^{m}b_{l}\cdot\operatorname{\mathds{1}}{\bigl{\{}(s,a)^{\top}[W]_{l}>0\bigr{\}}}\cdot(s,a)^{\top}[W]_{l}=\sum_{l=1}^{m}\bigl{[}\phi_{W,b}(s,a)\bigr{]}_{l}^{\top}[W]_{l}. (3.1)

Here m+m\in\mathbb{N}_{+} is the width of the neural network, b=(b1,,bm)mb=(b_{1},\ldots,b_{m})^{\top}\in\mathbb{R}^{m} and W=([W]1,,[W]m)mdW=([W]_{1}^{\top},\ldots,[W]_{m}^{\top})^{\top}\in\mathbb{R}^{md} are the parameters, and ϕW,b(s,a)=([ϕW,b(s,a)]1,,[ϕW,b(s,a)]m)md\phi_{W,b}(s,a)=([\phi_{W,b}(s,a)]^{\top}_{1},\ldots,[\phi_{W,b}(s,a)]^{\top}_{m})^{\top}\in\mathbb{R}^{md} is called the feature vector in the sequel, where

[ϕW,b(s,a)]l=m1/2bl𝟙{(s,a)[W]l>0}(s,a).\displaystyle\bigl{[}\phi_{W,b}(s,a)\bigr{]}_{l}=m^{-1/2}\cdot b_{l}\cdot\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W]_{l}>0\bigr{\}}\cdot(s,a). (3.2)

It then holds that uW,b(s,a)=WϕW,b(s,a)u_{W,b}(s,a)=W^{\top}\phi_{W,b}(s,a). Note that the feature vector ϕW,b(s,a)\phi_{W,b}(s,a) depends on the parameters WW and bb. We consider the following random initialization,

bli.i.d.Unif({1,1}),[W0]li.i.d.N(0,Id/d),l[m].\displaystyle b_{l}\overset{{\text{i.i.d.}}}{\sim}{\text{Unif}}\bigl{(}\{-1,1\}\bigr{)},\quad[W_{0}]_{l}\overset{{\text{i.i.d.}}}{\sim}N(0,I_{d}/d),\quad\forall l\in[m]. (3.3)

Throughout the training process, we keep the parameter bb fixed while updating WW. For notational simplicity, we write uW,b(s,a)u_{W,b}(s,a) as uW(s,a)u_{W}(s,a) and ϕW,b(s,a)\phi_{W,b}(s,a) as ϕW(s,a)\phi_{W}(s,a) in the sequel. We denote by 𝔼init\mathbb{E}_{\text{init}} the expectation taken with respect to the random initialization in (3.3). For an absolute constant B>0B>0, we define the parameter domain as

SB={Wmd|WW02B},\displaystyle S_{B}=\bigl{\{}W\in\mathbb{R}^{md}\,\big{|}\,\|W-W_{0}\|_{2}\leq B\bigr{\}}, (3.4)

which is the ball centered at W0W_{0} with the domain radius BB.

In the sequel, we consider the following energy-based policy,

πθ(a|s)=exp(τuθ(s,a))a𝒜exp(τuθ(s,a)),\displaystyle\pi_{\theta}(a\,|\,s)=\frac{\exp\bigl{(}\tau\cdot u_{\theta}(s,a)\bigr{)}}{\sum_{a^{\prime}\in\mathcal{A}}\exp\bigl{(}\tau\cdot u_{\theta}(s,a^{\prime})\bigr{)}}, (3.5)

where τ0\tau\geq 0 is the inverse temperature parameter and uθ(s,a)u_{\theta}(s,a) is the energy function parameterized by the neural network defined in (3.1) with W=θW=\theta. In the sequel, we call θ\theta the policy parameter. Meanwhile, we parameterize the reward function rβ(s,a)r_{\beta}(s,a) as follows,

rβ(s,a)=(1γ)1uβ(s,a),\displaystyle r_{\beta}(s,a)=(1-\gamma)^{-1}\cdot u_{\beta}(s,a), (3.6)

where uβ(s,a)u_{\beta}(s,a) is the neural network defined in (3.1) with W=βW=\beta and γ\gamma is the discount factor. Here we use the scaling parameter (1γ)1(1-\gamma)^{-1} to ensure that for any policy π\pi the state-action value function Qrβπ(s,a)Q^{\pi}_{r_{\beta}}(s,a) defined in (2.1) is well approximated by the neural network defined in (3.1). In the sequel, we call β\beta the reward parameter and define the reward function class as

β={rβ(s,a)|βSBβ},\displaystyle\mathcal{R}_{\beta}=\{r_{\beta}(s,a)\,|\,\beta\in S_{B_{\beta}}\},

where SBβS_{B_{\beta}} is the parameter domain of β\beta defined in (3.4) with domain radius BβB_{\beta}. To facilitate algorithm design, we establish the following proposition, which calculates the explicit expressions of the gradients L(θ,β)\nabla L(\theta,\beta) and the Fisher information (θ)\mathcal{I}(\theta). Recall that the Fisher information is defined as

(θ)=𝔼(s,a)νπθ[θlogπθ(s,a)θlogπθ(s,a)].\displaystyle\mathcal{I}(\theta)=\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}\nabla_{\theta}\log\pi_{\theta}(s,a)\nabla_{\theta}\log\pi_{\theta}(s,a)^{\top}\bigr{]}. (3.7)
Proposition 3.1 (Gradients and Fisher Information).

We call ιθ(s,a)=τ1θlogπθ(a|s)\iota_{\theta}(s,a)=\tau^{-1}\cdot\nabla_{\theta}\log\pi_{\theta}(a\,|\,s) the temperature-adjusted score function. It holds that

ιθ(s,a)=ϕθ(s,a)𝔼aπθ(|s)[ϕθ(s,a)].\displaystyle\iota_{\theta}(s,a)=\phi_{\theta}(s,a)-\mathbb{E}_{a^{\prime}\sim\pi_{\theta}(\cdot\,|\,s)}\bigl{[}\phi_{\theta}(s,a^{\prime})\bigr{]}. (3.8)

It then holds that

(θ)\displaystyle\mathcal{I}(\theta) =τ2𝔼(s,a)νπθ[ιθ(s,a)ιθ(s,a)],\displaystyle=\tau^{2}\cdot\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}\iota_{\theta}(s,a)\,\iota_{\theta}(s,a)^{\top}\bigr{]}, (3.9)
θL(θ,β)\displaystyle\nabla_{\theta}L(\theta,\beta) =τ𝔼(s,a)νπθ[Qrβπθ(s,a)ιθ(s,a)],\displaystyle=-\tau\cdot\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}Q^{\pi_{\theta}}_{r_{\beta}}(s,a)\cdot\iota_{\theta}(s,a)\bigr{]}, (3.10)
βL(θ,β)\displaystyle\nabla_{\beta}L(\theta,\beta) =(1γ)1𝔼(s,a)νE[ϕβ(s,a)](1γ)1𝔼(s,a)νπθ[ϕβ(s,a)]λβψ(β),\displaystyle=(1-\gamma)^{-1}\cdot\mathbb{E}_{(s,a)\sim\nu_{{\text{{\tiny E}}}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-(1-\gamma)^{-1}\cdot\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta), (3.11)

where Qrβπθ(s,a)Q^{\pi_{\theta}}_{r_{\beta}}(s,a) is the state-action value function defined in (2.1) with π=πθ\pi=\pi_{\theta} and r=rβr=r_{\beta}, νπθ\nu_{\pi_{\theta}} is the state-action visitation measure defined in (2.3) with π=πθ\pi=\pi_{\theta}, and (θ)\mathcal{I}(\theta) is the Fisher information defined in (3.7).

Proof.

See §C.1 for a detailed proof. ∎

Note that the expression of the policy gradient θL(θ,β)\nabla_{\theta}L(\theta,\beta) in (3.10) of Proposition 3.1 involves the state-action value function Qrβπθ(s,a)Q^{\pi_{\theta}}_{r_{\beta}}(s,a). To this end, we estimate the state-action value function Qrπ(s,a)Q_{r}^{\pi}(s,a) by Q^ω(s,a)\widehat{Q}_{\omega}(s,a), which is parameterized as follows,

Q^ω(s,a)=uω(s,a).\displaystyle\widehat{Q}_{\omega}(s,a)=u_{\omega}(s,a). (3.12)

Here uω(s,a)u_{\omega}(s,a) is the neural network defined in (3.1) with W=ωW=\omega. In the sequel, we call ω\omega the value parameter.

3.2 GAIL with Alternating Updates

We employ an actor-critic scheme with alternating updates of the policy and the reward function, which is presented in Algorithm 1. Specifically, we update the policy parameter θ\theta via natural policy gradient and update the reward parameter β\beta via gradient ascent in the actor step, while we estimate the state-action value function Qrπ(s,a)Q^{\pi}_{r}(s,a) via neural temporal difference (TD) (Cai et al., 2019c) in the critic step.

Actor Step. In the kk-th actor step, we update the policy parameter θ\theta and the reward parameter β\beta as follows,

θk+1\displaystyle\theta_{k+1} =τk+11(τkθkηδk),\displaystyle=\tau_{k+1}^{-1}\cdot(\tau_{k}\cdot\theta_{k}-\eta\cdot\delta_{k}), (3.13)
βk+1\displaystyle\beta_{k+1} =ProjSBβ{βk+η^βL(θk,βk)},\displaystyle=\text{Proj}_{S_{B_{\beta}}}\bigl{\{}\beta_{k}+\eta\cdot\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\bigr{\}}, (3.14)

where

τk+1\displaystyle\tau_{k+1} =η+τk,δkargminδSBθ^(θk)δτk^θL(θk,βk)2.\displaystyle=\eta+\tau_{k},\quad\delta_{k}\in\mathop{\mathrm{argmin}}_{\delta\in S_{B_{\theta}}}\big{\|}\widehat{\mathcal{I}}(\theta_{k})\delta-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\big{\|}_{2}. (3.15)

Here η>0\eta>0 is the stepsize, SBθS_{B_{\theta}} and SBβS_{B_{\beta}} are the parameter domains of θ\theta and β\beta defined in (3.4) with domain radii BθB_{\theta} and BβB_{\beta}, respectively, ProjSBβ:mdSBβ\text{Proj}_{S_{B_{\beta}}}\!:\mathbb{R}^{md}\rightarrow S_{B_{\beta}} is the projection operator, τk\tau_{k} is the inverse temperature parameter of πθk\pi_{\theta_{k}}, and ^(θk),^θL(θk,βk),^βL(θk,βk)\widehat{\mathcal{I}}(\theta_{k}),\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k}),\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k}) are the estimators of (θk),θL(θk,βk),βL(θk,βk)\mathcal{I}(\theta_{k}),\nabla_{\theta}L(\theta_{k},\beta_{k}),\nabla_{\beta}L(\theta_{k},\beta_{k}), respectively, which are defined in the sequel. In (3.13), we update the policy parameter θk\theta_{k} along the direction δk\delta_{k}, which approximates the natural policy gradient (θ)1θL(θ,β)\mathcal{I}(\theta)^{-1}\cdot\nabla_{\theta}L(\theta,\beta), and in (3.15) we update the inverse temperature parameter τk\tau_{k} to ensure that θk+1SBθ\theta_{k+1}\in S_{B_{\theta}}. Meanwhile, in (3.14), we update the reward parameter β\beta via (projected) gradient ascent. Following from (3.9)\eqref{eq:fisher_form} and (3.10) of Proposition 3.1, we construct the estimators of (θk)\mathcal{I}(\theta_{k}) and θL(θk,βk)\nabla_{\theta}L(\theta_{k},\beta_{k}) as follows,

^(θk)\displaystyle\widehat{\mathcal{I}}(\theta_{k}) =τk2Ni=1Nιθk(si,ai)ιθk(si,ai),\displaystyle=\frac{\tau_{k}^{2}}{N}\sum_{i=1}^{N}\iota_{\theta_{k}}(s_{i},a_{i})\,\iota_{\theta_{k}}(s_{i},a_{i})^{\top}, (3.16)
^θL(θk,βk)\displaystyle\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k}) =τkNi=1NQ^ωk(si,ai)ιθk(si,ai),\displaystyle=-\frac{\tau_{k}}{N}\sum_{i=1}^{N}\widehat{Q}_{\omega_{k}}(s_{i},a_{i})\cdot\iota_{\theta_{k}}(s_{i},a_{i}), (3.17)

where {(si,ai)}i[N]\{(s_{i},a_{i})\}_{i\in[N]} is sampled from the state-action visitation measure νπθk\nu_{\pi_{\theta_{k}}} given θk\theta_{k} with the batch size NN, and Q^ωk(s,a)\widehat{Q}_{\omega_{k}}(s,a) is the estimator of Qrβkπθk(s,a)Q^{\pi_{\theta_{k}}}_{r_{\beta_{k}}}(s,a) computed in the critic step. Meanwhile, following from (3.11) of Proposition 3.1, we construct the estimator of βL(θk,βk)\nabla_{\beta}L(\theta_{k},\beta_{k}) as follows,

^βL(θ,β)=1N(1γ)i=1N[ϕβk(siE,aiE)ϕβk(si,ai)]λβψ(βk),\displaystyle\widehat{\nabla}_{\beta}L(\theta,\beta)=\frac{1}{N\cdot(1-\gamma)}\sum_{i=1}^{N}\bigl{[}\phi_{\beta_{k}}(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})-\phi_{\beta_{k}}(s_{i},a_{i})\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta_{k}), (3.18)

where {(siE,aiE)}i[N]\{(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})\}_{i\in[N]} is the expert trajectory. For notational simplicity, we write πk=πθk\pi_{k}=\pi_{\theta_{k}}, rk(s,a)=rβk(s,a)r_{k}(s,a)=r_{\beta_{k}}(s,a), dk=dπkd_{k}=d_{\pi_{k}} and νk=νπk\nu_{k}=\nu_{\pi_{k}} for the kk-th step hereafter, where πθ\pi_{\theta} is the policy, rβ(s,a)r_{\beta}(s,a) is the reward function, and dπ,νπd_{\pi},\nu_{\pi} are the visitation measures defined in (2.3).

Critic Step. Note that the estimator ^θL(θ,β)\widehat{\nabla}_{\theta}L(\theta,\beta) in (3.17) involves the estimator Q^ωk(s,a)\widehat{Q}_{\omega_{k}}(s,a) of Qrkπk(s,a)Q^{\pi_{k}}_{r_{k}}(s,a). To this end, we parameterize Q^ω(s,a)\widehat{Q}_{\omega}(s,a) as in (3.12) and adapt neural TD (Cai et al., 2019c), which solves the following minimization problem,

ωk=argminωSBω𝔼(s,a)ρk[Q^ω(s,a)𝒯rkπkQ^ω(s,a)]2.\displaystyle\omega_{k}=\mathop{\mathrm{argmin}}_{\omega\in S_{B_{\omega}}}\mathbb{E}_{(s,a)\sim\rho_{k}}\bigl{[}\widehat{Q}_{\omega}(s,a)-{\mathcal{T}}^{\pi_{k}}_{r_{k}}\widehat{Q}_{\omega}(s,a)\bigr{]}^{2}. (3.19)

Here SBωS_{B_{\omega}} is the parameter domain with domain radius BωB_{\omega}, ρk\rho_{k} is the state-action stationary distribution induced by πk\pi_{k}, and 𝒯rkπk{\mathcal{T}}^{\pi_{k}}_{r_{k}} is the Bellman operator. Note that the Bellman operator 𝒯rπ{\mathcal{T}}^{\pi}_{r} is defined as follows,

𝒯rπQ(s,a)=(1γ)r(s,a)+γ𝔼π[Q(s,a)|s,a],\displaystyle{\mathcal{T}}^{\pi}_{r}Q(s,a)=(1-\gamma)\cdot r(s,a)+\gamma\cdot\mathbb{E}_{\pi}\bigl{[}Q(s^{\prime},a^{\prime})\,\big{|}\,s,a\bigr{]},

where the expectation is taken with respect to sP(|s,a)s^{\prime}\sim P(\cdot\,|\,s,a) and aπ(|s)a^{\prime}\sim\pi(\cdot\,|\,s^{\prime}). In neural TD, we iteratively update the value parameter ω\omega via

δ(j)\displaystyle\delta(j) =Qω(j)(s,a)r(s,a)γQω(j)(s,a),\displaystyle=Q_{\omega(j)}(s,a)-r(s,a)-\gamma\cdot Q_{\omega(j)}(s^{\prime},a^{\prime}),
ω(j+1)\displaystyle\omega(j+1) =ProjSBω{ω(j)αδ(j)ωQω(j)(s,a)},\displaystyle=\text{Proj}_{S_{B_{\omega}}}\bigl{\{}\omega(j)-\alpha\cdot\delta(j)\cdot\nabla_{\omega}Q_{\omega(j)}(s,a)\bigr{\}}, (3.20)

where δ(j)\delta(j) is the Bellman residual, α>0\alpha>0 is the stepsize, (s,a)(s,a) is sampled from the state-action stationary distribution ρk\rho_{k}, and sP(|s,a),aπk(|s)s^{\prime}\sim P(\cdot\,|\,s,a),a^{\prime}\sim\pi_{k}(\cdot\,|\,s^{\prime}) are the subsequent state and action. We defer the detailed discussion of neural TD to §B.

To approximately obtain the compatible function approximation (Sutton et al., 2000; Wang et al., 2019), we share the random initialization among the policy πθ\pi_{\theta}, the reward function rβ(s,a)r_{\beta}(s,a), and the state-action value function Q^ω(s,a)\widehat{Q}_{\omega}(s,a). In other words, we set θ0=β0=ω(0)=W0\theta_{0}=\beta_{0}=\omega(0)=W_{0} in our algorithm, where W0W_{0} is the random initialization in (3.3). The output of GAIL is the mixed policy π¯\bar{\pi} (Altman, 1999). Specifically, the mixed policy π¯\bar{\pi} of π0,,πT1\pi_{0},\ldots,\pi_{T-1} is executed by randomly selecting a policy πk\pi_{k} for k[0:T1]k\in[0:T-1] with equal probability before time t=0t=0 and exclusively following πk\pi_{k} thereafter. It then holds for any reward function r(s,a)r(s,a) that

J(π¯;r)=1Tk=0T1J(πk;r).\displaystyle J(\bar{\pi};r)=\frac{1}{T}\sum_{k=0}^{T-1}J(\pi_{k};r). (3.21)
Algorithm 1 GAIL
0:  Expert trajectory {(siE,aiE)}i[TE]\{(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})\}_{i\in[T_{\text{{\tiny E}}}]}, number of iterations TT, number of iterations TTDT_{\text{TD}} of neural TD, stepsize η\eta, stepsize α\alpha of neural TD, batch size NN, and domain radii Bθ,Bω,BβB_{\theta},B_{\omega},B_{\beta}.
1:  Initialization. Initialize blUnif({1,1})b_{l}\sim{\text{Unif}}(\{-1,1\}) and [W0]lN(0,Id/d)[W_{0}]_{l}\sim N(0,I_{d}/d) for any l[m]l\in[m] and set τ00\tau_{0}\leftarrow 0, θ0W0\theta_{0}\leftarrow W_{0}, and β0W0\beta_{0}\leftarrow W_{0}.
2:  for k=0,1,,T1k=0,1,\ldots,T-1 do
3:     Update value parameter ωk\omega_{k} via Algorithm 2 with πk\pi_{k}, rkr_{k}, W0W_{0}, bb, TTDT_{{\mathrm{TD}}}, and α\alpha as the input.
4:     Sample {(si,ai)}i=1N\{(s_{i},a_{i})\}_{i=1}^{N} from the state-action visitation measure νk\nu_{k}, and estimate ^(θk)\widehat{\mathcal{I}}(\theta_{k}), ^θL(θk,βk)\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k}), and ^βL(θk,βk)\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k}) via (3.16), (3.17), and (3.18), respectively.
5:     Solve δkargminδSθ^(θk)δτk^θL(θk,βk)2\delta_{k}\leftarrow\mathop{\mathrm{argmin}}_{\delta\in S_{\theta}}\bigl{\|}\widehat{\mathcal{I}}(\theta_{k})\cdot\delta-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{\|}_{2} and set τk+1τk+η\tau_{k+1}\leftarrow\tau_{k}+\eta.
6:     Update policy parameter θ\theta via θk+1τk+11(τkθkηδk)\theta_{k+1}\leftarrow\tau_{k+1}^{-1}\cdot(\tau_{k}\cdot\theta_{k}-\eta\cdot\delta_{k}).
7:     Update reward parameter β\beta via βk+1ProjSBβ{βk+η^βL(θk,βk)}\beta_{k+1}\leftarrow\text{Proj}_{S_{B_{\beta}}}\{\beta_{k}+\eta\cdot\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\}.
8:  end for
8:  Mixed policy π¯\bar{\pi} of π0,,πT1\pi_{0},\ldots,\pi_{T-1}.

4 Main Results

In this section, we first present the assumptions for our analysis. Then, we establish the global optimality and convergence of Algorithm 1.

4.1 Assumptions

We impose the following assumptions on the stationary distributions ϱk𝒫(𝒮),ρk𝒫(𝒮×𝒜)\varrho_{k}\in{\mathcal{P}}({\mathcal{S}}),\rho_{k}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A}) and the visitation measures dk𝒫(𝒮),νk𝒫(𝒮×𝒜)d_{k}\in{\mathcal{P}}({\mathcal{S}}),\nu_{k}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A}).

Assumption 4.1.

We assume that the following properties hold.

  • (a)

    Let μ\mu be either ρk\rho_{k} or νk\nu_{k}. We assume for an absolute constant c>0c>0 that

    𝔼(s,a)μ[𝟙{|W(s,a)|y}]cyW2,y>0,W0.\displaystyle\mathbb{E}_{(s,a)\sim\mu}\Bigl{[}\operatorname{\mathds{1}}\bigl{\{}|W^{\top}(s,a)|\leq y\bigr{\}}\Bigr{]}\leq\frac{c\cdot y}{\|W\|_{2}},\quad\forall y>0,W\neq 0.
  • (b)

    We assume for an absolute constant Ch>0C_{h}>0 that

    maxk{ddEddk2,dk+dνEdνk2,νk}Ch,\displaystyle\max_{k\in\mathbb{N}}\Biggl{\{}\bigg{\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\|}_{2,d_{k}}+\bigg{\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\nu_{k}}\bigg{\|}_{2,\nu_{k}}\Biggr{\}}\leq C_{h},
    maxk{ddEdϱk2,ϱk+dνEdρk2,ρk}Ch.\displaystyle\max_{k\in\mathbb{N}}\Biggl{\{}\bigg{\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}\varrho_{k}}\bigg{\|}_{2,\varrho_{k}}+\bigg{\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\rho_{k}}\bigg{\|}_{2,\rho_{k}}\Biggr{\}}\leq C_{h}.

    Here ddE/ddk{\mathrm{d}}d_{\text{{\tiny E}}}/{\mathrm{d}}d_{k}, dνE/dνk{\mathrm{d}}\nu_{\text{{\tiny E}}}/{\mathrm{d}}\nu_{k}, ddE/dϱk{\mathrm{d}}d_{\text{{\tiny E}}}/{\mathrm{d}}\varrho_{k}, and dνE/dρk{\mathrm{d}}\nu_{\text{{\tiny E}}}/{\mathrm{d}}\rho_{k} are the Radon-Nikodym derivatives.

Assumption (a) (a) holds when the probability density functions of ρk\rho_{k} and νk\nu_{k} are uniformly upper bounded across kk. Assumption (b) (b) states that the concentrability coefficients are uniformly upper bounded across kk, which is commonly used in the analysis of RL (Szepesvári and Munos, 2005; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Scherrer et al., 2015; Farahmand et al., 2016; Lazaric et al., 2016).

For notational simplicity, we write u0(s,a)=uW0(s,a)u_{0}(s,a)=u_{W_{0}}(s,a) and ϕ0(s,a)=ϕW0(s,a)\phi_{0}(s,a)=\phi_{W_{0}}(s,a), where uW0(s,a)u_{W_{0}}(s,a) is the neural network defined in (3.1) with W=W0W=W_{0}, ϕW0(s,a)\phi_{W_{0}}(s,a) is the feature vector defined in (3.2) with W=W0W=W_{0}, and W0W_{0} is the random initialization in (3.3). We impose the following assumptions on the neural network u0(s,a)u_{0}(s,a) and the transition kernel PP.

Assumption 4.2.

We assume that the following properties hold.

  • (a)

    Let U¯=sup(s,a)𝒮×𝒜|u0(s,a)|\bar{U}=\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|u_{0}(s,a)|. We assume for absolute constants M0>0M_{0}>0 and v>0v>0 that

    𝔼init[U¯2]M02,(U¯>t)exp(vt2),t>2M0.\displaystyle\mathbb{E}_{\text{init}}[\bar{U}^{2}]\leq M_{0}^{2},\quad\mathbb{P}(\bar{U}>t)\leq\exp(-v\cdot t^{2}),\quad\forall t>2M_{0}. (4.1)
  • (b)

    We assume that the transition kernel PP belongs to the following class,

    ~,BP={P(s|s,a)=ϑ(s,a;w)φ(s;w)dq(w)|supwφ(s;w)ds2BP}.\displaystyle\widetilde{\mathcal{M}}_{\infty,B_{P}}=\biggl{\{}P(s^{\prime}\,|\,s,a)=\int\vartheta(s,a;w)^{\top}\varphi(s^{\prime};w)\,{\mathrm{d}}q(w)\,\bigg{|}\,\sup_{w}\biggl{\|}\int\varphi(s;w){\mathrm{d}}s\biggr{\|}_{2}\leq B_{P}\biggr{\}}.

    Here BP>0B_{P}>0 is an absolute constant, qq is the probability density function of N(0,Id/d)N(0,I_{d}/d), and ϑ(s,a;w)\vartheta(s,a;w) is defined as ϑ(s,a;w)=𝟙{w(s,a)>0}(s,a).\vartheta(s,a;w)=\operatorname{\mathds{1}}\{w^{\top}(s,a)>0\}\cdot(s,a).

Assumption 4.2 (b) states that the MDP belongs to (a variant of) the class of linear MDPs (Yang and Wang, 2019a, b; Jin et al., 2019; Cai et al., 2019b). However, our class of transition kernels is infinite-dimensional, and thus, captures a rich class of MDPs. To understand Assumption 4.2 (a), recall that we initialize the neural network with [W0]lN(0,Id/d)[W_{0}]_{l}\sim N(0,I_{d}/d) and blUnif({1,1})b_{l}\sim{\text{Unif}}(\{-1,1\}) for any l[m]l\in[m]. Thus, the neural network u0(s,a)u_{0}(s,a) defined in (3.1) with W=W0W=W_{0} converges to a Gaussian process indexed by (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} as mm goes to infinity. Following from the facts that the maximum of a Gaussian process over a compact index set is sub-Gaussian (van de Geer and Muro, 2014) and that 𝒮×𝒜{\mathcal{S}}\times\mathcal{A} is compact, it is reasonable to assume that sup(s,a)𝒮×𝒜|u0(s,a)|\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|u_{0}(s,a)| is sub-Gaussian, which further implies the existence of the absolute constants M0M_{0} and vv in (4.1) of Assumption 4.2 (a). Moreover, if we assume that mm is even and initialize the parameters W0,bW_{0},b as follows,

{[W0]li.i.d.N(0,Id/d),blUnif({1,1}),l=1,,m/2,[W0]l=[W0]lm/2,bl=blm/2,l=m/2+1,,m,\displaystyle\begin{cases}[W_{0}]_{l}\overset{{\text{i.i.d.}}}{\sim}N(0,I_{d}/d),\quad b_{l}\sim{\text{Unif}}\bigl{(}\{-1,1\}\bigr{)},\quad&\forall l=1,\ldots,m/2,\\ [W_{0}]_{l}=[W_{0}]_{l-m/2},\quad b_{l}=-b_{l-m/2},\quad&\forall l=m/2+1,\ldots,m,\end{cases} (4.2)

we have that u0(s,a)=0u_{0}(s,a)=0 for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, which allows us to set M0=0M_{0}=0 and v=+v=+\infty in Assumption (a) (a). Also, it holds that 0=u0(s,a)β0=u_{0}(s,a)\in\mathcal{R}_{\beta}, which implies that 𝔻β(π1,π2)0\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{1},\pi_{2})\geq 0 for any π1\pi_{1} and π2\pi_{2}. The proof of our results with the random initialization in (4.2) is identical.

Finally, we impose the following assumption on the regularizer ψ(β)\psi(\beta) and the variances of the estimators ^(θ)\widehat{\mathcal{I}}(\theta), ^θL(θ,β)\widehat{\nabla}_{\theta}L(\theta,\beta), and ^βL(θ,β)\widehat{\nabla}_{\beta}L(\theta,\beta) defined in (3.16), (3.17), and (3.18), respectively.

Assumption 4.3.

We assume that the following properties hold.

  • (a)

    We assume for an absolute constant σ>0\sigma>0 that

    𝔼k[^(θk)W𝔼k[^(θk)W]22]τk4σ2/N,WSBθ,\displaystyle\mathbb{E}_{k}\bigg{[}\Big{\|}\widehat{\mathcal{I}}(\theta_{k})W-\mathbb{E}_{k}\bigl{[}\widehat{\mathcal{I}}(\theta_{k})W\bigr{]}\Big{\|}_{2}^{2}\bigg{]}\leq\tau_{k}^{4}\cdot\sigma^{2}/N,\quad\forall W\in S_{B_{\theta}}, (4.3)
    𝔼k[^θL(θk,βk)𝔼k[^θL(θk,βk)]22]τk2σ2/N,\displaystyle\mathbb{E}_{k}\biggl{[}\Big{\|}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})-\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\|}_{2}^{2}\biggr{]}\leq\tau_{k}^{2}\cdot\sigma^{2}/N, (4.4)
    𝔼k[^βL(θk,βk)𝔼k[^βL(θk,βk)]22]σ2/N,\displaystyle\mathbb{E}_{k}\biggl{[}\Big{\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\|}_{2}^{2}\biggr{]}\leq\sigma^{2}/N, (4.5)

    where τk\tau_{k} is the inverse temperature parameter in (3.5), N+N\in\mathbb{N}_{+} is the batch size, and SBθS_{B_{\theta}} is the parameter domain of θ\theta defined in (3.4) with the domain radius BθB_{\theta}. Here the expectation 𝔼k\mathbb{E}_{k} is taken with respect to the kk-th batch, which is drawn from νk\nu_{k} given θk\theta_{k}.

  • (b)

    We assume that the regularizer ψ(β)\psi(\beta) in (2.4) is convex and LψL_{\psi}-Lipschitz continuous over the compact parameter domain SBβS_{B_{\beta}}.

Assumption 4.3 (a) holds when Q^ωk(si,ai)ιθk(si,ai)\widehat{Q}_{\omega_{k}}(s_{i},a_{i})\cdot\iota_{\theta_{k}}(s_{i},a_{i}), ιθk(si,ai)ιθk(si,ai)\iota_{\theta_{k}}(s_{i},a_{i})\iota_{\theta_{k}}(s_{i},a_{i})^{\top}, and ϕβk(si,ai)\phi_{\beta_{k}}(s_{i},a_{i}) have uniformly upper bounded variances across i[m]i\in[m] and kk, and the Markov chain that generates {(si,ai)}i[N]\{(s_{i},a_{i})\}_{i\in[N]} mixes sufficiently fast (Zhang et al., 2019a). Similar assumptions are also used in the analysis of policy optimization (Xu et al., 2019a, b). Also, Assumption 4.3 (b) holds for most commonly used regularizers (Ho and Ermon, 2016).

4.2 Global Optimality and Convergence

In this section, we establish the global optimality and convergence of Algorithm 1. The following proposition adapted from Cai et al. (2019c) characterizes the global optimality and convergence of neural TD, which is presented in Algorithm 2.

Proposition 4.4 (Global Optimality and Convergence of Neural TD).

In Algorithm 2, we set TTD=Ω(m)T_{{\mathrm{TD}}}=\Omega(m), α=min{(1γ)/8,m1/2}\alpha=\min\{(1-\gamma)/8,m^{-1/2}\}, and Bω=c(Bβ+BP(M0+Bβ))B_{\omega}=c\cdot(B_{\beta}+B_{P}\cdot(M_{0}+B_{\beta})) for an absolute constant c>0c>0. Let πk,rk\pi_{k},r_{k} be the input and ωk\omega_{k} be the output of Algorithm 2. Under Assumptions 4.1 and 4.2, it holds for an absolute constant Cv>0C_{v}>0 that

𝔼init[Qωk(s,a)Qrkπk(s,a)2,ρk2]=𝒪(Bω3m1/2+Bω5/2m1/4+Bω2exp(CvBω2)).\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}Q_{\omega_{k}}(s,a)-Q^{\pi_{k}}_{r_{k}}(s,a)\big{\|}^{2}_{2,\rho_{k}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}. (4.6)

Here the expectation 𝔼init\mathbb{E}_{\text{init}} is taken with respect to the random initialization in (3.3).

Proof.

See §B.1 for a detailed proof. ∎

The term Bω2exp(CvBω2)B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2}) in (4.6) of Proposition 4.4 characterizes the hardness of estimating the state-action value function Qrkπk(s,a)Q^{\pi_{k}}_{r_{k}}(s,a) by the neural network defined in (3.1), which arises because Qrkπk(s,a)\|Q^{\pi_{k}}_{r_{k}}(s,a)\|_{\infty} is not uniformly upper bounded across kk. Note that if we employ the random initialization in (4.2), we have that Cv=+C_{v}=+\infty. And consequently, such a term vanishes. We are now ready to establish the global optimality and convergence of Algorithm 1.

Theorem 4.5 (Global Optimality and Convergence of GAIL).

We set η=1/T\eta=1/\sqrt{T} and Bω=c(Bβ+BP(M0+Bβ))B_{\omega}=c\cdot(B_{\beta}+B_{P}\cdot(M_{0}+B_{\beta})) for an absolute constant c>0c>0, and Bθ=BωB_{\theta}=B_{\omega} in Algorithm 1. Let π¯\bar{\pi} be the output of Algorithm 1. Under Assumptions 4.1-4.3, it holds that

𝔼[𝔻β(πE,π¯)]\displaystyle\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]} (1γ)1log|𝒜|+13B¯2+M02+8T(i)+2λLψB¯(ii)+1Tk=0T1εk(iii).\displaystyle\leq\underbrace{\frac{(1-\gamma)^{-1}\cdot\log|\mathcal{A}|+13\bar{B}^{2}+M_{0}^{2}+8}{\sqrt{T}}}_{\displaystyle{\text{(i)}}}+\underbrace{2\lambda\cdot L_{\psi}\cdot\bar{B}}_{\displaystyle{\text{(ii)}}}+\underbrace{\frac{1}{T}{\sum_{k=0}^{T-1}\varepsilon_{k}}}_{\displaystyle{\text{(iii)}}}. (4.7)

Here B¯=max{Bθ,Bω,Bβ}\bar{B}=\max\{B_{\theta},B_{\omega},B_{\beta}\}, 𝔻β\mathbb{D}_{\mathcal{R}_{\beta}} is the β\mathcal{R}_{\beta}-distance defined in (2.5) with β={rβ(s,a)|βSBβ}\mathcal{R}_{\beta}=\{r_{\beta}(s,a)\,|\,\beta\in S_{B_{\beta}}\}, the expectation is taken with respect to the random initialization in (3.3) and the TT batches, and the error term εk\varepsilon_{k} satisfies that

εk=22ChB¯σN1/2(iii.a)+ϵQ,k(iii.b)+𝒪(kB¯3/2m1/4+B¯5/4m1/8)(iii.c),\displaystyle\varepsilon_{k}=\underbrace{2\sqrt{2}\cdot C_{h}\cdot\bar{B}\cdot\sigma\cdot N^{-1/2}}_{\displaystyle{\text{(iii.a)}}}+\underbrace{{\epsilon}_{Q,k}}_{\displaystyle\text{(iii.b)}}+\underbrace{\mathcal{O}(k\cdot\bar{B}^{3/2}\cdot m^{-1/4}+\bar{B}^{5/4}\cdot m^{-1/8})}_{\displaystyle\text{(iii.c)}}, (4.8)

where ChC_{h} is defined in Assumption 4.1, LψL_{\psi} and σ\sigma are defined in Assumption (b), and ϵQ,k=𝒪(Bω3m1/2+Bω5/2m1/4+Bω2exp(CvBω2)){\epsilon}_{Q,k}=\mathcal{O}(B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})) is the error induced by neural TD (Algorithm 2).

Proof.

See §5 for a detailed proof. ∎

The optimality gap in (4.7) of Theorem 4.5 is measured by the expected β\mathcal{R}_{\beta}-distance 𝔻β(πE,π¯)\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{{\text{{\tiny E}}}},\bar{\pi}) between the expert policy πE\pi_{\text{{\tiny E}}} and the learned policy π¯\bar{\pi}. Thus, by showing that the optimality gap is upper bounded by 𝒪(1/T)\mathcal{O}(1/\sqrt{T}), we prove that π¯\bar{\pi} (approximately) outperforms the expert policy πE\pi_{\text{{\tiny E}}} in expectation when the number of iterations TT goes to infinity. As shown in (4.7) of Theorem 4.5, the optimality gap is upper bounded by the sum of the three terms. Term (i) corresponds to the 1/T1/\sqrt{T} rate of convergence of Algorithm 1. Term (ii) corresponds to the bias induced by the regularizer λψ(β)\lambda\cdot\psi(\beta) in the objective function L(θ,β)L(\theta,\beta) defined in (2.4). Term (iii) is upper bounded by the sum of the three terms in (4.8) of Theorem 4.5. In detail, term (iii.a) corresponds to the error induced by the variances of ^(θ)\widehat{\mathcal{I}}(\theta), ^θL(θ,β)\widehat{\nabla}_{\theta}L(\theta,\beta), and ^βL(θ,β)\widehat{\nabla}_{\beta}L(\theta,\beta) defined in (4.3), (4.4), and (4.5) of Assumption 4.3, which vanishes as the batch size NN in Algorithm 1 goes to infinity. Term (iii.b) is the error of estimating Qrπ(s,a)Q^{\pi}_{r}(s,a) by Q^ω(s,a)\widehat{Q}_{\omega}(s,a) using neural TD (Algorithm 2). As shown in Proposition 4.4, the estimation error ϵQ,k\epsilon_{Q,k} vanishes as mm and BωB_{\omega} go to infinity. Term (iii.c) corresponds to the linearization error of the neural network defined in (3.1), which is characterized in Lemma A.2. Following from Theorem 4.5, it holds for Bω=Ω((Cv1logT)1/2)B_{\omega}=\Omega((C_{v}^{-1}\cdot\log T)^{1/2}), m=Ω(B¯10T6)m=\Omega(\bar{B}^{10}\cdot T^{6}), and N=Ω(B¯2Tσ2)N=\Omega(\bar{B}^{2}\cdot T\cdot\sigma^{2}) that 𝔼[𝔻β(πE,π¯)]=𝒪(T1/2+λ)\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]}=\mathcal{O}(T^{-1/2}+\lambda), which implies the 1/T1/\sqrt{T} rate of convergence of Algorithm 1 (up to the bias induced by the regularizer).

5 Proof of Main Results

In this section, we present the proof of Theorem 4.5, which establishes the global optimality and convergence of Algorithm 1. For notational simplicity, we write πs(a)=π(a|s)\pi^{s}(a)=\pi(a\,|\,s) for any policy π\pi, state s𝒮s\in{\mathcal{S}}, and action a𝒜a\in\mathcal{A}. For any policies π1,π2\pi_{1},\pi_{2} and distribution μ\mu over 𝒮{\mathcal{S}}, we denote the expected Kullback-Leibler (KL) divergence by KLμ{\mathrm{KL}}^{\mu}, which is defined as KLμ(π1π2)=𝔼sμ[KL(π1sπ2s)]{\mathrm{KL}}^{\mu}(\pi_{1}\,\|\,\pi_{2})=\mathbb{E}_{s\sim\mu}[{\mathrm{KL}}(\pi_{1}^{s}\,\|\,\pi_{2}^{s})]. For any visitation measures dπ𝒫(𝒮)d_{\pi}\in{\mathcal{P}}({\mathcal{S}}) and νπ𝒫(𝒮×𝒜)\nu_{\pi}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A}), we denote by 𝔼dπ\mathbb{E}_{d_{\pi}} and 𝔼νπ\mathbb{E}_{\nu_{\pi}} the expectations taken with respect to sdπs\sim d_{\pi} and (s,a)νπ(s,a)\sim\nu_{\pi}, respectively.

Following from the property of the mixed policy π¯\bar{\pi} in (3.21), we have that

𝔼[𝔻β(πE,π¯)]\displaystyle\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]} =𝔼[maxβSBβJ(πE;rβ)J(π¯;rβ)]\displaystyle=\mathbb{E}\bigl{[}\max_{\beta^{\prime}\in S_{B_{\beta}}}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\bar{\pi};r_{\beta^{\prime}})\bigr{]}
=𝔼[maxβSBβ1Tk=0T1J(πE;rβ)J(πk;rβ)].\displaystyle=\mathbb{E}\biggl{[}\max_{\beta^{\prime}\in S_{B_{\beta}}}\frac{1}{T}\sum_{k=0}^{T-1}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}})\biggr{]}. (5.1)

We now upper bound the optimality gap in (5) by upper bounding the following difference of expected cumulative rewards,

J(πE;rβ)J(πk;rβ)=J(πE;rk)J(πk;rk)(i)+L(θk,β)L(θk,βk)(ii)+λ(ψ(β)ψ(βk))(iii),\displaystyle J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}})=\underbrace{J(\pi_{\text{{\tiny E}}};r_{k})-J(\pi_{k};r_{k})}_{\displaystyle\text{(i)}}+\underbrace{L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})}_{\displaystyle\text{(ii)}}+\underbrace{\lambda\cdot\bigl{(}\psi(\beta^{\prime})-\psi(\beta_{k})\bigr{)}}_{\displaystyle\text{(iii)}}, (5.2)

where βSBβ\beta^{\prime}\in S_{B_{\beta}} is chosen arbitrarily and L(θ,β)L(\theta,\beta) is the objective function defined in (2.4). Following from Assumption 4.3 and the fact that βk,βSBβ\beta_{k},\beta^{\prime}\in S_{B_{\beta}}, we have that

λ(ψ(β)ψ(βk))λLψββk2λLψ2Bβ,\displaystyle\lambda\cdot\bigl{(}\psi(\beta^{\prime})-\psi(\beta_{k})\bigr{)}\leq\lambda\cdot L_{\psi}\cdot\|\beta^{\prime}-\beta_{k}\|_{2}\leq\lambda\cdot L_{\psi}\cdot 2B_{\beta}, (5.3)

which upper bounds term (iii) of (5.2). It remains to upper bound terms (i) and (ii) of (5.2), which hinges on the one-point convexity of J(π;r)J(\pi;r) with respect to π\pi and the (approximate) convexity of L(θ,β)L(\theta,\beta) with respect to β\beta. Upper bound of term (i) in (5.2). In what follows, we upper bound term (i) of (5.2). We first introduce the following cost difference lemma (Kakade and Langford, 2002), which corresponds to the one-point convexity of J(π;r)J(\pi;r) with respect to π\pi. Recall that dE𝒫(𝒮)d_{\text{{\tiny E}}}\in{\mathcal{P}}({\mathcal{S}}) is the state visitation measure induced by the expert policy πE\pi_{\text{{\tiny E}}}.

Lemma 5.1 (Cost Difference Lemma, Lemma 6.1 in Kakade and Langford (2002)).

For any policy π\pi and reward function r(s,a)r(s,a), it holds that

J(πE;r)J(π;r)=(1γ)1𝔼dE[Qrπ(s,),πEsπs𝒜],\displaystyle J(\pi_{\text{{\tiny E}}};r)-J(\pi;r)=(1-\gamma)^{-1}\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}Q^{\pi}_{r}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}, (5.4)

where γ\gamma is the discount factor.

Furthermore, we establish the following lemma, which upper bounds the right-hand side of (5.4) in Lemma 5.1.

Lemma 5.2.

Under Assumptions 4.1-4.3, we have that

𝔼dE[Qrkπk(s,),πEsπks𝒜]=η1KLdE(πEπk)η1KLdE(πEπk+1)+Δk(i),\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}Q^{\pi_{k}}_{r_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-{\pi_{k}^{s}}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}=\eta^{-1}\cdot{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k})-\eta^{-1}\cdot{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k+1})+\Delta_{k}^{\text{(i)}},

where

𝔼[|Δk(i)|]\displaystyle\mathbb{E}\bigl{[}|\Delta_{k}^{\text{(i)}}|\bigr{]} =22ChBθ1/2σ1/2N1/4+ϵQ,k+η(M02+9Bθ2)\displaystyle=2\sqrt{2}\cdot C_{h}\cdot B_{\theta}^{1/2}\cdot\sigma^{1/2}\cdot N^{-1/4}+{\epsilon}_{Q,k}+\eta\cdot(M_{0}^{2}+9B_{\theta}^{2})
+𝒪(η1τk+1Bθ3/2m1/4+Bθ5/4m1/8).\displaystyle\quad+\mathcal{O}(\eta^{-1}\cdot\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+B_{\theta}^{5/4}\cdot m^{-1/8}). (5.5)

Here M0M_{0} is defined in Assumption (a), σ\sigma is defined in Assumption 4.3, NN is the batch size in (3.16)-(3.18), and ϵQ,k=𝒪(Bω3m1/2+Bω5/2m1/4+Bω2exp(CvBω2)){\epsilon}_{Q,k}=\mathcal{O}(B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})) for an absolute constant Cv>0C_{v}>0, which depends on the absolute constant vv in Assumption (a).

Proof.

See §C.2 for a detailed proof. ∎

Combining Lemmas 5.1 and 5.2, we have that

J(πE;rk)J(πk;rk)KLdE(πEπk)KLdE(πEπk+1)+ηΔk(i)η(1γ),\displaystyle J(\pi_{\text{{\tiny E}}};r_{k})-J(\pi_{k};r_{k})\leq\frac{{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k})-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k+1})+\eta\cdot\Delta_{k}^{\text{(i)}}}{\eta\cdot(1-\gamma)}, (5.6)

which upper bounds term (i) of (5.2). Here Δk(i)\Delta_{k}^{\text{(i)}} is upper bounded in (5.2) of Lemma 5.2.

Upper bound of term (ii) in (5.2). In what follows, we upper bound term (ii) of (5.2). We first establish the following lemma, which characterizes the (approximate) convexity of L(θ,β)L(\theta,\beta) with respect to β\beta.

Lemma 5.3.

Under Assumption (a), it holds for any βSBβ\beta^{\prime}\in S_{B_{\beta}} that

𝔼init[L(θk,β)L(θk,βk)]=𝔼init[βL(θk,βk)(ββk)]+𝒪(Bβ3/2m1/4).\displaystyle\mathbb{E}_{\text{init}}\bigl{[}L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\bigr{]}=\mathbb{E}_{\text{init}}\bigl{[}\nabla_{\beta}L(\theta_{k},\beta_{k})^{\top}(\beta^{\prime}-\beta_{k})\bigr{]}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}).
Proof.

See §C.3 for a detailed proof. ∎

The term 𝒪(Bβ3/2m1/4)\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}) in Lemma 5.3 arises from the linearization error of the neural network, which is characterized in Lemma A.2. Based on Lemma 5.3, we establish the following lemma, which upper bounds term (ii) of (5.2).

Lemma 5.4.

Under Assumptions (a) and 4.3, we have that

L(θk,β)L(θk,βk)η1βkβ22η1βk+1β22η1βk+1βk22+Δk(ii),\displaystyle L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\leq\eta^{-1}\cdot\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\eta^{-1}\cdot\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}-\eta^{-1}\cdot\|\beta_{k+1}-\beta_{k}\|_{2}^{2}+\Delta_{k}^{\text{(ii)}},

where

𝔼[|Δk(ii)|]=η((2+λLψ)2+σ2N1)+2BβσN1/2+𝒪(Bβ3/2m1/4).\displaystyle\mathbb{E}\bigl{[}|\Delta_{k}^{\text{(ii)}}|\bigr{]}=\eta\cdot\bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\sigma^{2}\cdot N^{-1}\bigr{)}+2B_{\beta}\cdot\sigma\cdot N^{-1/2}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}). (5.7)
Proof.

See §C.4 for a detailed proof. ∎

By Lemma 5.4, we have that

L(θk,β)L(θk,βk)η1(βkβ22βk+1β22βk+1βk22)+Δk(ii),\displaystyle L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\leq\eta^{-1}\cdot\bigl{(}\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta_{k}\|_{2}^{2}\bigr{)}+\Delta_{k}^{\text{(ii)}}, (5.8)

which upper bounds term (ii) of (5.2). Here Δk(ii)\Delta_{k}^{\text{(ii)}} is upper bounded in (5.7) of Lemma 5.4.

Plugging (5.3), (5.6), and (5.8) into (5.2), we obtain that

J(πE;rβ)J(πk;rβ)\displaystyle J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}}) (5.9)
KLdE(πEπk)KLdE(πEπk+1)η(1γ)+η1(βkβ22βk+1β22)+2λLψBβ+Δk.\displaystyle\quad\leq\frac{{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k})-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k+1})}{\eta\cdot(1-\gamma)}+\eta^{-1}\cdot\bigl{(}\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}\bigr{)}+2\lambda\cdot L_{\psi}\cdot B_{\beta}+\Delta_{k}.

Here Δk=Δk(i)+Δk(ii)\Delta_{k}=\Delta_{k}^{\text{(i)}}+\Delta_{k}^{\text{(ii)}}, where Δk(i)\Delta_{k}^{\text{(i)}} and Δk(ii)\Delta_{k}^{\text{(ii)}} are upper bounded in (5.2) and (5.7) of Lemmas 5.2 and 5.4, respectively. Note that the upper bound of Δk\Delta_{k} does not depend on θ\theta and β\beta. Upon telescoping (5.9) with respect to kk, we obtain that

J(πE;rβ)J(π¯;rβ)\displaystyle J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\bar{\pi};r_{\beta^{\prime}}) =1Tk=0T1[J(πE;rβ)J(πk;rβ)]\displaystyle=\frac{1}{T}\sum_{k=0}^{T-1}\bigl{[}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}})\bigr{]} (5.10)
(1γ)1KLdE(πEπ0)+β0β22ηT+2λLψBβ+1Tk=0T1|Δk|.\displaystyle\leq\frac{(1-\gamma)^{-1}\cdot{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{0})+\|\beta_{0}-\beta^{\prime}\|_{2}^{2}}{\eta\cdot T}+2\lambda\cdot L_{\psi}\cdot B_{\beta}+\frac{1}{T}\sum_{k=0}^{T-1}|\Delta_{k}|.

Following from the fact that τ0=0\tau_{0}=0 and the parameterization of πθ\pi_{\theta} in (3.5), it holds that π0s\pi_{0}^{s} is the uniform distribution over 𝒜\mathcal{A} for any s𝒮s\in{\mathcal{S}}. Thus, we have KLdE(πEπ0)log|𝒜|{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{0})\leq\log|\mathcal{A}|. Meanwhile, following from the fact that βSBβ\beta^{\prime}\in S_{B_{\beta}}, it holds that ββ02Bβ\|\beta^{\prime}-\beta_{0}\|_{2}\leq B_{\beta}. Finally, by setting η=T1/2\eta=T^{-1/2}, τk=kη\tau_{k}=k\cdot\eta, and B¯=max{Bθ,Bβ}\bar{B}=\max\{B_{\theta},B_{\beta}\} in (5.10), we have that

𝔼[𝔻β(πE,π¯)]\displaystyle\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]} =𝔼[maxβSBβJ(πE;rβ)J(π¯;rβ)]\displaystyle=\mathbb{E}\bigl{[}\max_{\beta^{\prime}\in S_{B_{\beta}}}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\bar{\pi};r_{\beta^{\prime}})\bigr{]}
(1γ)1log|𝒜|+4Bβ2ηT+2λLψBβ+𝔼[maxβk=0T1|Δk|]T\displaystyle\leq\frac{(1-\gamma)^{-1}\cdot\log|\mathcal{A}|+4B_{\beta}^{2}}{\eta\cdot T}+2\lambda\cdot L_{\psi}\cdot B_{\beta}+\frac{\mathbb{E}\bigl{[}\max_{\beta^{\prime}}\sum_{k=0}^{T-1}|\Delta_{k}|\bigr{]}}{T}
=(1γ)1log|𝒜|+13B¯2+M02+8T+2λLψB¯+k=0T1εkT.\displaystyle=\frac{(1-\gamma)^{-1}\cdot\log|\mathcal{A}|+13\bar{B}^{2}+M_{0}^{2}+8}{\sqrt{T}}+2\lambda\cdot L_{\psi}\cdot\bar{B}+\frac{\sum_{k=0}^{T-1}\varepsilon_{k}}{T}.

Here εk\varepsilon_{k} is upper bounded as follows,

εk=22ChB¯σN1/2+ϵQ,k+𝒪(kB¯3/2m1/4+B¯5/4m1/8),\displaystyle\varepsilon_{k}=2\sqrt{2}\cdot C_{h}\cdot\bar{B}\cdot\sigma\cdot N^{-1/2}+{\epsilon}_{Q,k}+\mathcal{O}(k\cdot\bar{B}^{3/2}\cdot m^{-1/4}+\bar{B}^{5/4}\cdot m^{-1/8}),

where ϵQ,k=𝒪(Bω3m1/2+Bω5/2m1/4+Bω2exp(CvBω2)){\epsilon}_{Q,k}=\mathcal{O}(B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})) for an absolute constant Cv>0C_{v}>0. Thus, we complete the proof of Theorem 4.5.

References

  • Agarwal et al. (2019) Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.
  • Altman (1999) Altman, E. (1999). Constrained Markov decision processes, vol. 7. CRC Press.
  • Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. Cambridge University Press.
  • Antos et al. (2008) Antos, A., Szepesvári, C. and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein GAN. arXiv preprint arXiv:1701.07875.
  • Bhandari and Russo (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
  • Cai et al. (2019a) Cai, Q., Hong, M., Chen, Y. and Wang, Z. (2019a). On the global convergence of imitation learning: A case for linear quadratic regulator. arXiv preprint arXiv:1901.03674.
  • Cai et al. (2019b) Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019b). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
  • Cai et al. (2019c) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019c). Neural temporal-difference learning converges to global optima. arXiv preprint arXiv:1905.10027.
  • Chen et al. (2020) Chen, M., Wang, Y., Liu, T., Yang, Z., Li, X., Wang, Z. and Zhao, T. (2020). On computation and generalization of generative adversarial imitation learning. arXiv preprint arXiv:2001.02792.
  • Farahmand et al. (2016) Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C. and Mannor, S. (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17 4809–4874.
  • Farahmand et al. (2010) Farahmand, A.-m., Szepesvári, C. and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
  • Finn et al. (2016) Finn, C., Levine, S. and Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.
  • Ho and Ermon (2016) Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems.
  • Hofmann et al. (2008) Hofmann, T., Schölkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics 1171–1220.
  • Jin et al. (2019) Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2019). Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388.
  • Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, vol. 2.
  • Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
  • Kuefler et al. (2017) Kuefler, A., Morton, J., Wheeler, T. and Kochenderfer, M. (2017). Imitating driver behavior with generative adversarial networks. In IEEE Intelligent Vehicles Symposium. IEEE.
  • Lazaric et al. (2016) Lazaric, A., Ghavamzadeh, M. and Munos, R. (2016). Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17 583–612.
  • Levine and Koltun (2012) Levine, S. and Koltun, V. (2012). Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617.
  • Liu et al. (2019) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
  • Merel et al. (2017) Merel, J., Tassa, Y., TB, D., Srinivasan, S., Lemmon, J., Wang, Z., Wayne, G. and Heess, N. (2017). Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201.
  • Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research, 9 815–857.
  • Nesterov (2013) Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media.
  • Ng and Russell (2000) Ng, A. Y. and Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In International Conference on Machine Learning.
  • Peters and Schaal (2008) Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71 1180–1190.
  • Pomerleau (1991) Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3 88–97.
  • Rafique et al. (2018) Rafique, H., Liu, M., Lin, Q. and Yang, T. (2018). Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXiv:1810.02060.
  • Rahimi and Recht (2008) Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems.
  • Rahimi and Recht (2009) Rahimi, A. and Recht, B. (2009). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in Neural Information Processing Systems 1313–1320.
  • Ross and Bagnell (2010) Ross, S. and Bagnell, D. (2010). Efficient reductions for imitation learning. In International Conference on Artificial Intelligence and Statistics.
  • Ross et al. (2011) Ross, S., Gordon, G. and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics.
  • Russell (1998) Russell, S. (1998). Learning agents for uncertain environments. In Conference on Learning Theory.
  • Scherrer et al. (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. (2015). Approximate modified policy iteration and its application to the game of Tetris. The Journal of Machine Learning Research, 16 1629–1676.
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
  • Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P. and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
  • Syed et al. (2008) Syed, U., Bowling, M. and Schapire, R. E. (2008). Apprenticeship learning using linear programming. In International Conference on Machine Learning.
  • Szepesvári and Munos (2005) Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning. ACM.
  • Tai et al. (2018) Tai, L., Zhang, J., Liu, M. and Burgard, W. (2018). Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In IEEE International Conference on Robotics and Automation. IEEE.
  • van de Geer and Muro (2014) van de Geer, S. and Muro, A. (2014). On higher order isotropy conditions and lower bounds for sparse quadratic forms. Electronic Journal of Statistics, 8 3031–3061.
  • Wang et al. (2019) Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
  • Xu et al. (2019a) Xu, P., Gao, F. and Gu, Q. (2019a). An improved convergence analysis of stochastic variance-reduced policy gradient. arXiv preprint arXiv:1905.12615.
  • Xu et al. (2019b) Xu, P., Gao, F. and Gu, Q. (2019b). Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610.
  • Yang and Wang (2019a) Yang, L. and Wang, M. (2019a). Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning.
  • Yang and Wang (2019b) Yang, L. F. and Wang, M. (2019b). Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
  • Yu et al. (2016) Yu, L., Zhang, W., Wang, J. and Yu, Y. (2016). SeqGAN: Sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473.
  • Zhang et al. (2019a) Zhang, K., Koppel, A., Zhu, H. and Başar, T. (2019a). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.
  • Zhang et al. (2019b) Zhang, K., Yang, Z. and Başar, T. (2019b). Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. arXiv preprint arXiv:1906.00729.

Appendix A Neural Networks

In what follows, we present the properties of the neural network defined in (3.1). First, we define the following function class.

Definition A.1 (Function Class).

For B>0B>0 and m+m\in\mathbb{N}_{+}, we define

B,m={Wϕ0(s,a)|Wmd,WW02B},\displaystyle\mathcal{F}_{B,m}=\bigl{\{}W^{\top}\phi_{0}(s,a)\,\big{|}\,W\in\mathbb{R}^{md},\leavevmode\nobreak\ \|W-W_{0}\|_{2}\leq B\bigr{\}},

where ϕ0(s,a)\phi_{0}(s,a) is the feature vector defined in (3.2) with W=W0W=W_{0}.

As shown in Rahimi and Recht (2008), the feature ϕ0(s,a)\phi_{0}(s,a) induces a reproducing kernel Hilbert space (RKHS), namely \mathcal{H}. When mm goes to infinity, B,m\mathcal{F}_{B,m} approximates a ball in \mathcal{H}, which captures a rich class of functions (Hofmann et al., 2008; Rahimi and Recht, 2008). Furthermore, we obtain the following lemma from Cai et al. (2019c), which characterizes the linearization error of the neural network defined in (3.1).

Lemma A.2 (Linearization Error, Lemma 5.1 in Cai et al. (2019c)).

Under Assumption (a), it holds for any W,W1,W2SBW,W_{1},W_{2}\in S_{B} that,

𝔼init[WϕW1(s,a)WϕW2(s,a)2,μ2]=𝒪(B3m1/2),\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\bigl{\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\bigr{\|}_{2,\mu}^{2}\Bigr{]}=\mathcal{O}(B^{3}\cdot m^{-1/2}),
𝔼init[WϕW1(s,a)WϕW2(s,a)1,μ]=𝒪(B3/2m1/4),\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\|}_{1,\mu}\Bigr{]}=\mathcal{O}(B^{3/2}\cdot m^{-1/4}),

where ϕW(s,a)\phi_{W}(s,a) is the feature vector defined in (3.2) and μ𝒫(𝒮×𝒜)\mu\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A}) is a distribution that satisfies Assumption (a).

Proof.

See §A.1 for a detailed proof. ∎

Following from Lemma A.2, the function class B,m\mathcal{F}_{B,m} defined in Definition A.1 is a first-order approximation of the class of the neural networks defined in (3.1). Meanwhile, we establish the following lemma to characterize the sub-Gaussian property of the neural network defined in (3.1).

Lemma A.3.

Under Assumption 4.2, for any W,WSBW,W^{\prime}\in S_{B}, it holds that sup(s,a)𝒮×𝒜|WϕW(s,a)|\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|W^{\top}\phi_{W^{\prime}}(s,a)| is sub-Gaussian, where the randomness comes from the random initialization W0W_{0} in the definition of SBS_{B} in (3.4). Moreover, it holds that

𝔼init[sup(s,a)𝒮×𝒜|WϕW(s,a)|2]2M02+18B2\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}^{2}\Bigr{]}\leq 2M_{0}^{2}+18B^{2}

and that

(sup(s,a)𝒮×𝒜|WϕW(s,a)|>t)exp(vt2/2),t>2M0+6B.\displaystyle\mathbb{P}\Bigl{(}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}>t\Bigr{)}\leq\exp(-v\cdot t^{2}/2),\quad\forall t>2M_{0}+6B.
Proof.

See §A.2 for a detailed proof. ∎

A.1 Proof of Lemma A.2

Proof.

We consider any W,WSBW,W^{\prime}\in S_{B}. By the definition of ϕW(s,a)\phi_{W}(s,a) in (3.2) and the triangle inequality, we have that

|WϕW(s,a)Wϕ0(s,a)|\displaystyle\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{|}
1ml=1m|[W]l(s,a)||𝟙{(s,a)[W]l>0}𝟙{(s,a)[W0]l>0}|.\displaystyle\quad\leq\frac{1}{\sqrt{m}}\sum_{l=1}^{m}\bigl{|}[W]_{l}^{\top}(s,a)\bigr{|}\cdot\Bigl{|}\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W^{\prime}]_{l}>0\bigr{\}}-\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W_{0}]_{l}>0\bigr{\}}\Bigr{|}. (A.1)

We now upper bound the right-hand side of (A.1). For the term |[W]l(s,a)||[W]_{l}^{\top}(s,a)| in (A.1), we have that

|[W]l(s,a)|\displaystyle\bigl{|}[W]_{l}^{\top}(s,a)\bigr{|} |[W0]l(s,a)|+|([W]l[W0]l)(s,a)|\displaystyle\leq\bigl{|}[W_{0}]_{l}^{\top}(s,a)\bigr{|}+\Bigl{|}\bigl{(}[W]_{l}-[W_{0}]_{l}\bigr{)}^{\top}(s,a)\Bigr{|}
|[W0]l(s,a)|+[W]l[W0]l2,\displaystyle\leq\bigl{|}[W_{0}]_{l}^{\top}(s,a)\bigr{|}+\big{\|}[W]_{l}-[W_{0}]_{l}\big{\|}_{2}, (A.2)

where the first inequality follows from the triangle inequality and the second inequality follows from the Cauchy-Schwartz inequality and the fact that (s,a)21\|(s,a)\|_{2}\leq 1. To upper bound the term |𝟙{(s,a)[W]l>0}𝟙{(s,a)[W0]l>0}||\operatorname{\mathds{1}}\{(s,a)^{\top}[W^{\prime}]_{l}>0\}-\operatorname{\mathds{1}}\{(s,a)^{\top}[W_{0}]_{l}>0\}| on the right-hand side of (A.1), note that 𝟙{(s,a)[W]l>0}𝟙{(s,a)[W0]l>0}\operatorname{\mathds{1}}\{(s,a)^{\top}[W^{\prime}]_{l}>0\}\neq\operatorname{\mathds{1}}\{(s,a)^{\top}[W_{0}]_{l}>0\} implies that

|[W0]l(s,a)||[W]l(s,a)[W0]l(s,a)|[W]l[W0]l2.\displaystyle\bigl{|}[W_{0}]_{l}^{\top}(s,a)\bigr{|}\leq\bigl{|}[W^{\prime}]_{l}^{\top}(s,a)-[W_{0}]_{l}^{\top}(s,a)\bigr{|}\leq\big{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\big{\|}_{2}.

Thus, we have that

|𝟙{(s,a)[W]l>0}𝟙{(s,a)[W0]l>0}|𝟙{|(s,a)[W0]l|[W]l[W0]l2}.\displaystyle\Bigl{|}\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W^{\prime}]_{l}>0\bigr{\}}-\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W_{0}]_{l}>0\bigr{\}}\Bigr{|}\leq\operatorname{\mathds{1}}\Bigl{\{}\bigl{|}(s,a)^{\top}[W_{0}]_{l}\bigr{|}\leq\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}\Bigr{\}}. (A.3)

Plugging (A.1) and (A.3) into (A.1), we have that

|WϕW(s,a)Wϕ0(s,a)|\displaystyle\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{|}
1ml=1m𝟙{|(s,a)[W0]l|[W]l[W0]l2}(|(s,a)[W0]l|+[W]l[W0]l2)\displaystyle\quad\leq\frac{1}{\sqrt{m}}\sum_{l=1}^{m}\operatorname{\mathds{1}}\Bigl{\{}\bigl{|}(s,a)^{\top}[W_{0}]_{l}\bigr{|}\leq\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}\Bigr{\}}\cdot\Bigl{(}\bigl{|}(s,a)^{\top}[W_{0}]_{l}\bigr{|}+\bigl{\|}[W]_{l}-[W_{0}]_{l}\bigr{\|}_{2}\Bigr{)}
1ml=1m𝟙{|(s,a)[W0]l|[W]l[W0]l2}([W]l[W0]l2+[W]l[W0]l2).\displaystyle\quad\leq\frac{1}{\sqrt{m}}\sum_{l=1}^{m}\operatorname{\mathds{1}}\Big{\{}\big{|}(s,a)^{\top}[W_{0}]_{l}\big{|}\leq\big{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\big{\|}_{2}\Big{\}}\cdot\Bigl{(}\big{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\big{\|}_{2}+\big{\|}[W]_{l}-[W_{0}]_{l}\big{\|}_{2}\Bigr{)}.

By the fact that W,WSBW,W^{\prime}\in S_{B}, we obtain that

|WϕW(s,a)Wϕ0(s,a)|24B2ml=1m𝟙{|(s,a)[W0]l|[W]l[W0]l2}.\displaystyle\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{|}^{2}\leq\frac{4B^{2}}{m}\sum_{l=1}^{m}\operatorname{\mathds{1}}\Bigl{\{}\bigl{|}(s,a)^{\top}[W_{0}]_{l}\bigr{|}\leq\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}\Bigr{\}}.

By setting y=[W]l[W0]l2y=\|[W^{\prime}]_{l}-[W_{0}]_{l}\|_{2} in Assumption (a), we have that

WϕW(s,a)Wϕ0(s,a)2,μ28B2ml=1mc[W]l[W0]l2[W0]l2.\displaystyle\bigl{\|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{\|}_{2,\mu}^{2}\leq\frac{8B^{2}}{m}\sum_{l=1}^{m}\frac{c\cdot\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}}{\bigl{\|}[W_{0}]_{l}\bigr{\|}_{2}}.

Taking the expectation with respect to the random initialization in (3.3) and using the Cauchy-Schwartz inequality, we have that

𝔼init[WϕW(s,a)Wϕ0(s,a)2,μ2]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\big{\|}_{2,\mu}^{2}\Bigr{]}
𝔼init[8cB2m(l=1m[W]l[W0]l22)1/2(l=1m1/[W0]l22)1/2]\displaystyle\quad\leq\mathbb{E}_{\text{init}}\biggl{[}\frac{8cB^{2}}{m}\Bigl{(}\sum_{l=1}^{m}\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}^{2}\Bigr{)}^{1/2}\cdot\Bigl{(}\sum_{l=1}^{m}1/\bigl{\|}[W_{0}]_{l}\bigr{\|}_{2}^{2}\Bigr{)}^{1/2}\biggr{]}
8cB3m𝔼init[(l=1m1/[W0]l22)1/2]\displaystyle\quad\leq\frac{8cB^{3}}{m}\mathbb{E}_{\text{init}}\biggl{[}\Bigl{(}\sum_{l=1}^{m}1/\bigl{\|}[W_{0}]_{l}\bigr{\|}_{2}^{2}\Bigr{)}^{1/2}\biggr{]}
8cB3m(𝔼wN(0,Id/d)[1/w22])1/2\displaystyle\quad\leq\frac{8cB^{3}}{\sqrt{m}}\Bigl{(}\mathbb{E}_{w\sim N(0,I_{d}/d)}\bigl{[}1/\|w\|_{2}^{2}\bigr{]}\Bigr{)}^{1/2}
=𝒪(B3m1/2),\displaystyle\quad=\mathcal{O}(B^{3}\cdot m^{-1/2}),

where the second inequality follows from the fact that WW02B\|W^{\prime}-W_{0}\|_{2}\leq B, the third inequality follows from Jensen’s inequality, and the last inequality follows from Assumption (a) and Lemma A.2. Thus, for any W,W1,W2SBW,W_{1},W_{2}\in S_{B}, we have that

𝔼init[WϕW1(s,a)WϕW2(s,a)2,μ2]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\|}_{2,\mu}^{2}\Bigr{]}
2𝔼init[WϕW1(s,a)Wϕ0(s,a)2,μ2]+2𝔼init[WϕW2(s,a)Wϕ0(s,a)2,μ2]\displaystyle\quad\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{0}(s,a)\big{\|}_{2,\mu}^{2}\Bigr{]}+2\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}W^{\top}\phi_{W_{2}}(s,a)-W^{\top}\phi_{0}(s,a)\big{\|}_{2,\mu}^{2}\Bigr{]}
=𝒪(B3m1/2).\displaystyle\quad=\mathcal{O}(B^{3}\cdot m^{-1/2}).

Moreover, following from the Cauchy-Schwartz inequality, we have that 1,μ2,μ\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}. Thus, by Jensen’s inequality, we have that

𝔼init[WϕW1(s,a)WϕW2(s,a)1,μ]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\|}_{1,\mu}\Bigr{]}
𝔼init[WϕW1(s,a)WϕW2(s,a)2,μ]\displaystyle\quad\leq\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\|}_{2,\mu}\Bigr{]}
=𝒪(B3/2m1/4),\displaystyle\quad=\mathcal{O}(B^{3/2}\cdot m^{-1/4}),

which completes the proof of Lemma A.2. ∎

A.2 Proof of Lemma A.3

In what follows, we present the proof of Lemma A.3.

Proof.

Recall that we write uW(s,a)=WϕW(s,a)u_{W}(s,a)=W^{\top}\phi_{W}(s,a) and u0(s,a)=uW0(s,a)u_{0}(s,a)=u_{W_{0}}(s,a). Then, we have

|WϕW(s,a)|\displaystyle\big{|}W^{\top}\phi_{W^{\prime}}(s,a)\big{|} |u0(s,a)|+|(WW)ϕW(s,a)|+|uW(s,a)u0(s,a)|\displaystyle\leq\bigl{|}u_{0}(s,a)\bigr{|}+\bigl{|}(W-W^{\prime})^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}+\bigl{|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{|}
|u0(s,a)|+WW2ϕW(s,a)2+|uW(s,a)u0(s,a)|,\displaystyle\leq\bigl{|}u_{0}(s,a)\bigr{|}+\|W-W^{\prime}\|_{2}\cdot\big{\|}\phi_{W^{\prime}}(s,a)\big{\|}_{2}+\bigl{|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{|}, (A.4)

where the last inequality follows from the Cauchy-Schwartz inequality. It suffices to upper bound the three terms on the right-hand side of (A.2). Note that we have W,WSBW,W^{\prime}\in S_{B} and ϕW(s,a)21\|\phi_{W^{\prime}}(s,a)\|_{2}\leq 1. We have that

WW2ϕW(s,a)22B.\displaystyle\|W-W^{\prime}\|_{2}\cdot\big{\|}\phi_{W^{\prime}}(s,a)\big{\|}_{2}\leq 2B. (A.5)

It remains to upper bound the term |uW(s,a)u0(s,a)||u_{W^{\prime}}(s,a)-u_{0}(s,a)| in (A.2). Note that uW(s,a)u_{W}(s,a) is almost everywhere differentiable with respect to WW. Also, it holds that WuW(s,a)=ϕW(s,a)\nabla_{W}u_{W}(s,a)=\phi_{W}(s,a). Thus, following from the mean-value theorem and the Cauchy-Schwartz inequality, we have that

|uW(s,a)u0(s,a)|supWSBϕW(s,a)2WW02B,\displaystyle\bigl{|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{|}\leq\sup_{W\in S_{B}}\big{\|}\phi_{W}(s,a)\big{\|}_{2}\cdot\|W^{\prime}-W_{0}\|_{2}\leq B, (A.6)

where the second inequality follows from the fact that ϕW(s,a)21\|\phi_{W}(s,a)\|_{2}\leq 1 and WSBW^{\prime}\in S_{B}. Plugging (A.5) and (A.6) into (A.2), we have that

sup(s,a)𝒮×𝒜|WϕW(s,a)|sup(s,a)𝒮×𝒜|u0(s,a)|+3B.\displaystyle\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}\leq\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}u_{0}(s,a)\bigr{|}+3B.

Following from Assumption 4.2, we have that sup(s,a)𝒮×𝒜|WϕW(s,a)|\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|W^{\top}\phi_{W^{\prime}}(s,a)| is sub-Gaussian. Furthermore, it holds that

𝔼init[sup(s,a)𝒮×𝒜|WϕW(s,a)|2]2𝔼init[sup(s,a)𝒮×𝒜|u0(s,a)|2]+18B22M02+18B2\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}^{2}\Bigr{]}\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}u_{0}(s,a)\bigr{|}^{2}\Bigr{]}+18B^{2}\leq 2M_{0}^{2}+18B^{2}

and that

(sup(s,a)𝒮×𝒜|WϕW(s,a)|>t)\displaystyle\mathbb{P}\Bigl{(}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}>t\Bigr{)} (sup(s,a)𝒮×𝒜|u0(s,a)|+3B>t)\displaystyle\leq\mathbb{P}\Bigl{(}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}u_{0}(s,a)\bigr{|}+3B>t\Bigr{)}
exp(v(t3B)2)exp(vt2/2)\displaystyle\leq\exp\bigl{(}-v\cdot(t-3B)^{2}\bigr{)}\leq\exp(-v\cdot t^{2}/2)

for t>2M0+6Bt>2M_{0}+6B. Thus, we complete the proof of Lemma A.3. ∎

Appendix B Neural Temporal Difference

In this section, we introduce neural TD (Cai et al., 2019c), which computes ωk\omega_{k} in Algorithm 1. Specifically, neural TD solves the optimization problem in (3.19) via the update in (3.2), which is presented in Algorithm 2.

Algorithm 2 Neural TD
0:  Policy π\pi, reward function rr, initialization W0,bW_{0},b, number of iterations TTDT_{\mathrm{TD}} of neural TD, and stepsize α\alpha of neural TD.
1:  Initialization. Set SBω{Wmd|WW02Bω}S_{B_{\omega}}\leftarrow\{W\in\mathbb{R}^{md}\,|\,\|W-W_{0}\|_{2}\leq B_{\omega}\} and ω(0)W0\omega(0)\leftarrow W_{0}.
2:  for j=0,,TTD1j=0,\ldots,T_{{\mathrm{TD}}}-1 do
3:     Sample (s,a,s,a)(s,a,s^{\prime},a^{\prime}), where (s,a)ρπ(s,a)\sim\rho_{\pi}, sP(|s,a)s^{\prime}\sim P(\cdot\,|\,s,a), and aπ(|s)a^{\prime}\sim\pi(\cdot\,|\,s^{\prime}).
4:     Compute the Bellman residual δ(j)=Qω(j)(s,a)(1γ)r(s,a)γQω(j)(s,a)\delta(j)=Q_{\omega(j)}(s,a)-(1-\gamma)\cdot r(s,a)-\gamma\cdot Q_{\omega(j)}(s^{\prime},a^{\prime}).
5:     Update ω\omega via ω(j+1)ProjSBω{ω(j)ηδ(j)ϕω(j)(s,a)}\omega(j+1)\leftarrow\text{Proj}_{S_{B_{\omega}}}\bigl{\{}\omega(j)-\eta\cdot\delta(j)\cdot\phi_{\omega(j)}(s,a)\bigr{\}}.
6:  end for
6:  Output ω¯=T1t=0TTD1ω(j)\bar{\omega}=T^{-1}\sum_{t=0}^{T_{{\mathrm{TD}}}-1}\omega(j).

B.1 Proof of Proposition 4.4

Proof.

We obtain the following proposition from Cai et al. (2019c), which characterizes the convergence of Algorithm 2.

Proposition B.1 (Proposition 4.7 in Cai et al. (2019c)).

We set α=min{(1γ)/8,TTD1/2}\alpha=\min\{(1-\gamma)/8,T^{-1/2}_{{\mathrm{TD}}}\} in Algorithm 2. Let Qω¯(s,a)Q_{\bar{\omega}}(s,a) be the state-action value function associated with the output ω¯\bar{\omega}. Under Assumption (a), it holds for any policy π\pi and reward function r(s,a)r(s,a) that

𝔼init[Qω¯(s,a)Qrπ(s,a)2,ρπ2]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}Q_{\bar{\omega}}(s,a)-Q^{\pi}_{r}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]} =2𝔼init[ProjBω,mQrπ(s,a)Qrπ(s,a)2,ρπ2]\displaystyle=2\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}\text{Proj}_{\mathcal{F}_{B_{\omega},m}}Q^{\pi}_{r}(s,a)-Q^{\pi}_{r}(s,a)\big{\|}_{2,\rho_{\pi}}^{2}\Bigr{]}
+𝒪(Bω2TTD1/2+Bω3m1/2+Bω5/2m1/4),\displaystyle\quad+\mathcal{O}(B_{\omega}^{2}\cdot T^{-1/2}_{{\mathrm{TD}}}+B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}), (B.1)

where Bω,m\mathcal{F}_{B_{\omega},m} is defined in Definition A.1.

Recall that we denote by ϕ0(s,a)\phi_{0}(s,a) the feature vector corresponding to the random initialization in (3.3). We establish the following lemma to upper bound the bias 𝔼init[ProjBω,mQrπ(s,a)Qrπ(s,a)2,ρπ2]\mathbb{E}_{\text{init}}[\|\text{Proj}_{\mathcal{F}_{B_{\omega},m}}Q^{\pi}_{r}(s,a)-Q^{\pi}_{r}(s,a)\|^{2}_{2,\rho_{\pi}}] in (B.1) of Proposition B.1 when the reward function r(s,a)r(s,a) belongs to the reward function class β\mathcal{R}_{\beta}.

Lemma B.2.

We consider any reward function rβ(s,a)βr_{\beta}(s,a)\in\mathcal{R}_{\beta} and policy π\pi. Under Assumptions 4.1 and 4.2, it holds for Bω>Bβ+(1γ)1γBP(2M0+3Bβ)B_{\omega}>B_{\beta}+(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta}) and an absolute constant Cv=(2γ2BP2)1(1γ)2vC_{v}=(2\cdot\gamma^{2}\cdot B_{P}^{2})^{-1}\cdot(1-\gamma)^{2}\cdot v that

𝔼init[ProjBω,mQrβπ(s,a)Qrβπ(s,a)2,ρπ2]=𝒪(Bβ3m1/2+Bω2m1+Bω2exp(CvBω2)).\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\text{Proj}_{\mathcal{F}_{B_{\omega},m}}Q_{r_{\beta}}^{\pi}(s,a)-Q_{r_{\beta}}^{\pi}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\beta}^{3}\cdot m^{-1/2}+B_{\omega}^{2}\cdot m^{-1}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}.
Proof.

See §B.2 for a detailed proof. ∎

Combining Proposition B.1 and Lemma B.2, for Bω>Bβ+(1γ)1γBP(2M0+3Bβ)B_{\omega}>B_{\beta}+(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta}), we have for any π\pi that

𝔼init[Qω¯(s,a)Qrβπ(s,a)2,ρπ2]=𝒪(Bω2TTD1/2+Bω3m1/2+Bω5/2m1/4+Bω2exp(CvBω2)).\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}Q_{\bar{\omega}}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\omega}^{2}\cdot T^{-1/2}_{{\mathrm{TD}}}+B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}.

Finally, by setting TTD=Ω(m)T_{{\mathrm{TD}}}=\Omega(m), we have that

𝔼init[Qω¯(s,a)Qrβπ(s,a)2,ρπ2]=𝒪(Bω3m1/2+Bω5/2m1/4+Bω2exp(CvBω2)),\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}Q_{\bar{\omega}}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)},

which completes the proof of Proposition 4.4. ∎

B.2 Proof of Lemma B.2

Proof.

For notational simplicity, we write ϑ(s,a;w)=𝟙{|w(s,a)|>0}(s,a)\vartheta(s,a;w)=\operatorname{\mathds{1}}{\{|w^{\top}(s,a)|>0\}}\cdot(s,a). Under Assumption 4.2, we have that

P(s|s,a)=ϑ(s,a;w)φ(s;w)dq(w),where supwφ(s;w)ds2BP.\displaystyle P(s^{\prime}\,|\,s,a)=\int\vartheta(s,a;w)^{\top}\varphi(s^{\prime};w){\mathrm{d}}q(w),\quad\text{where }\sup_{w}\biggl{\|}\int\varphi(s;w){\mathrm{d}}s\biggr{\|}_{2}\leq B_{P}. (B.2)

Thus, since rβ=(1γ)1uβ(s,a)r_{\beta}=(1-\gamma)^{-1}\cdot u_{\beta}(s,a), we have that

Qrβπ(s,a)\displaystyle Q^{\pi}_{r_{\beta}}(s,a) =(1γ)rβ(s,a)+γ𝒮P(s|s,a)Vrβπ(s)ds\displaystyle=(1-\gamma)\cdot r_{\beta}(s,a)+\gamma\cdot\int_{\mathcal{S}}P(s^{\prime}\,|\,s,a)\cdot V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}
=uβ(s,a)+𝒮γVrβπ(s)ϑ(s,a;w)φ(s;w)dq(w)ds\displaystyle=u_{\beta}(s,a)+\int_{\mathcal{S}}\gamma\cdot V^{\pi}_{r_{\beta}}(s^{\prime})\cdot\int\vartheta(s,a;w)^{\top}\varphi(s^{\prime};w){\mathrm{d}}q(w){\mathrm{d}}s^{\prime}
=uβ(s,a)+ϑ(s,a;w)(γ𝒮φ(s;w)Vrβπ(s)ds)dq(w),\displaystyle=u_{\beta}(s,a)+\int\vartheta(s,a;w)^{\top}\biggl{(}\gamma\cdot\int_{\mathcal{S}}\varphi(s^{\prime};w)V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}\biggr{)}{\mathrm{d}}q(w),

where the second equality follows from (B.2) and the last equality follows from Fubini’s theorem. In the sequel, we define

α(w)=γ𝒮φ(s;w)Vrβπ(s)ds.\displaystyle\alpha(w)=\gamma\cdot\int_{\mathcal{S}}\varphi(s^{\prime};w)V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}. (B.3)

Note that α(w)d\alpha(w)\in\mathbb{R}^{d}. Then, we have that

Qrβπ(s,a)=uβ(s,a)+ϑ(s,a;w)α(w)dq(w).\displaystyle Q^{\pi}_{r_{\beta}}(s,a)=u_{\beta}(s,a)+\int\vartheta(s,a;w)^{\top}\alpha(w){\mathrm{d}}q(w).

To prove Lemma B.2, we first approximate Qrβπ(s,a)Q^{\pi}_{r_{\beta}}(s,a) by

Q¯(s,a)=uβ(s,a)+ϑ(s,a;w)α¯(w)dq(w),\displaystyle\bar{Q}(s,a)=u_{\beta}(s,a)+\int\vartheta(s,a;w)^{\top}\bar{\alpha}(w){\mathrm{d}}q(w), (B.4)

where α¯(w)=α(w)𝟙{α(w)2K}\bar{\alpha}(w)=\alpha(w)\cdot\operatorname{\mathds{1}}\{\|\alpha(w)\|_{2}\leq K\} for an absolute constant K>0K>0 specified later. Then, it holds for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} that

|Q¯(s,a)Qrβπ(s,a)|\displaystyle\bigl{|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\bigr{|} |ϑ(s,a;w)(α¯(w)α(w))|dq(w)\displaystyle\leq\int\Bigl{|}\vartheta(s,a;w)^{\top}\bigl{(}\bar{\alpha}(w)-\alpha(w)\bigr{)}\Bigr{|}{\mathrm{d}}q(w)
ϑ(s,a;w)2α¯(w)α(w)2dq(w)\displaystyle\leq\int\big{\|}\vartheta(s,a;w)\big{\|}_{2}\cdot\big{\|}\bar{\alpha}(w)-\alpha(w)\big{\|}_{2}{\mathrm{d}}q(w)
supwα¯(w)α(w)2,\displaystyle\leq\sup_{w}\big{\|}\bar{\alpha}(w)-\alpha(w)\big{\|}_{2},

where the second inequality follows from the Cauchy-Schwartz inequality and the last inequality follows from the fact that ϑ(s,a;w)21\|\vartheta(s,a;w)\|_{2}\leq 1. Thus, we have that

Q¯(s,a)Qrβπ(s,a)2,ρπQ¯(s,a)Qrβπ(s,a)supwα¯(w)α(w)2.\displaystyle\big{\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}_{2,\rho_{\pi}}\leq\big{\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}_{\infty}\leq\sup_{w}\big{\|}\bar{\alpha}(w)-\alpha(w)\big{\|}_{2}. (B.5)

We now upper bound the right-hand side of (B.5). To this end, we show that supwα(w)2\sup_{w}\|\alpha(w)\|_{2} is sub-Gaussian in the sequel. By the definition of α(w)\alpha(w) in (B.3), we have that

supwα(w)2\displaystyle\sup_{w}\bigl{\|}\alpha(w)\bigr{\|}_{2} =γ𝒮φ(s;w)Vrβπ(s)ds2\displaystyle=\gamma\cdot\biggl{\|}\int_{\mathcal{S}}\varphi(s^{\prime};w)V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}\biggr{\|}_{2}
γsups𝒮|Vrβπ(s)|supw𝒮φ(s;w)ds2\displaystyle\leq\gamma\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{|}V^{\pi}_{r_{\beta}}(s^{\prime})\bigr{|}\cdot\sup_{w}\biggl{\|}\int_{\mathcal{S}}\varphi(s^{\prime};w){\mathrm{d}}s^{\prime}\biggr{\|}_{2}
γBPsups𝒮|Vrβπ(s)|\displaystyle\leq\gamma\cdot B_{P}\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{|}V^{\pi}_{r_{\beta}}(s^{\prime})\bigr{|}
γ(1γ)1BPsup(s,a)𝒮×𝒜|uβ(s,a)|,\displaystyle\leq\gamma\cdot(1-\gamma)^{-1}\cdot B_{P}\cdot\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}u_{\beta}(s,a)\bigr{|}, (B.6)

where the second inequality follows from Assumption 4.2 and the third inequality follows from the fact that Vrβπ(s)=𝔼(s,a)νπ(s)[rβ(s,a)]V_{r_{\beta}}^{\pi}(s)=\mathbb{E}_{(s^{\prime},a^{\prime})\sim\nu_{\pi}(s)}[r_{\beta}(s^{\prime},a^{\prime})]. Here we denote by νπ(s)\nu_{\pi}(s) the state-action visitation measure starting from the state ss and following the policy π\pi. Following from Lemma A.3, we have that supwα(w)2\sup_{w}\|\alpha(w)\|_{2} is sub-Gaussian. By Lemma A.3 and (B.2), it holds for t>(1γ)1γBP(2M0+3Bβ)t>(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta}) that

(supwα(w)2>t)\displaystyle\mathbb{P}\Bigl{(}\sup_{w}\bigl{\|}\alpha(w)\bigr{\|}_{2}>t\Bigr{)} (γ(1γ)1BPsup(s,a)𝒮×𝒜|uβ(s,a)|>t)\displaystyle\leq\mathbb{P}\Bigl{(}\gamma\cdot(1-\gamma)^{-1}\cdot B_{P}\cdot\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}u_{\beta}(s,a)\bigr{|}>t\Bigr{)}
exp(v(1γ)2t22γ2BP2).\displaystyle\leq\exp\biggl{(}-\frac{v\cdot(1-\gamma)^{2}\cdot t^{2}}{2\gamma^{2}\cdot B_{P}^{2}}\biggr{)}. (B.7)

Let the absolute constant KK satisfy that K>(1γ)1γBP(2M0+3Bβ)K>(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta}) in (B.2). For notational simplicity, we write Cv=(2γ2BP2)1v(1γ)2C_{v}=(2\cdot\gamma^{2}\cdot B_{P}^{2})^{-1}\cdot v\cdot(1-\gamma)^{2}. By the fact that α¯(w)α(w)2=α(w)2𝟙{α(w)2>K}\|\bar{\alpha}(w)-\alpha(w)\|_{2}=\|\alpha(w)\|_{2}\cdot\operatorname{\mathds{1}}\{\|\alpha(w)\|_{2}>K\}, we have that

supwα¯(w)α(w)2supwα(w)2𝟙{supwα(w)2>K}.\displaystyle\sup_{w}\big{\|}\bar{\alpha}(w)-\alpha(w)\big{\|}_{2}\leq\sup_{w}\big{\|}\alpha(w)\big{\|}_{2}\cdot\operatorname{\mathds{1}}\Bigl{\{}\sup_{w}\|\alpha(w)\|_{2}>K\Bigr{\}}.

Following from (B.5) and (B.2), we have that

𝔼init[Q¯(s,a)Qrβπ(s,a)2,ρπ]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}_{2,\rho_{\pi}}\Bigr{]}
𝔼[supwα(w)2𝟙{supwα(w)2>K}]\displaystyle\quad\leq\mathbb{E}\biggl{[}\sup_{w}\big{\|}\alpha(w)\big{\|}_{2}\cdot\operatorname{\mathds{1}}\Bigl{\{}\sup_{w}\|\alpha(w)\|_{2}>K\Bigr{\}}\biggr{]}
0Kt(supwα(w)2>K)dt+Kt(supwα(w)2>t)dt\displaystyle\quad\leq\int_{0}^{K}t\cdot\mathbb{P}\Bigl{(}\sup_{w}\|\alpha(w)\|_{2}>K\Bigr{)}{\mathrm{d}}t+\int_{K}^{\infty}t\cdot\mathbb{P}\Bigl{(}\sup_{w}\|\alpha(w)\|_{2}>t\Bigr{)}{\mathrm{d}}t
=𝒪(K2exp(CvK2)).\displaystyle\quad=\mathcal{O}\bigl{(}K^{2}\cdot\exp(-C_{v}\cdot K^{2})\bigr{)}. (B.8)

We now construct Q^(s,a)K,m\widehat{Q}(s,a)\in\mathcal{F}_{K,m}, which approximates Q¯(s,a)\bar{Q}(s,a) defined in (B.4). We define

f(s,a)=ϑ(s,a;w)α¯(w)dq(ω).\displaystyle f(s,a)=\int\vartheta(s,a;w)^{\top}\bar{\alpha}(w){\mathrm{d}}q(\omega).

Then, we have that Q¯(s,a)=uβ(s,a)+f(s,a)\bar{Q}(s,a)=u_{\beta}(s,a)+f(s,a). Note that f(s,a)f(s,a) belongs to the following function class,

~K,={ϑ(s,a;w)α(w)dq(ω)|supwα(w)2K}.\displaystyle\widetilde{\mathcal{F}}_{K,\infty}=\biggl{\{}\int\vartheta(s,a;w)^{\top}\alpha(w){\mathrm{d}}q(\omega)\,\bigg{|}\,\sup_{w}\big{\|}\alpha(w)\big{\|}_{2}\leq K\biggr{\}}.

We now show that f(s,a)f(s,a) is well approximated by the following function class,

~K,m={Wϕ0(s,a)=1ml=1m[W]lϑ(s,a;[W]l)|supl[W]l2K/m},\displaystyle\widetilde{\mathcal{F}}_{K,m}=\biggl{\{}W^{\top}\phi_{0}(s,a)=\frac{1}{\sqrt{m}}\sum_{l=1}^{m}[W]_{l}^{\top}\vartheta\bigl{(}s,a;[W]_{l}\bigr{)}\,\bigg{|}\,\sup_{l}\big{\|}[W]_{l}\big{\|}_{2}\leq K/\sqrt{m}\biggr{\}},

where ϕ0(s,a)\phi_{0}(s,a) is the feature vector corresponding to the random initialization. We obtain the following lemma from Rahimi and Recht (2009), which characterizes the approximation error of ~K,\widetilde{\mathcal{F}}_{K,\infty} by ~K,m\widetilde{\mathcal{F}}_{K,m}.

Lemma B.3 (Lemma 1 in Rahimi and Recht (2009)).

For any f(s,a)~K,f(s,a)\in\widetilde{\mathcal{F}}_{K,\infty}, it holds with probability at least 1δ1-\delta that

Proj~K,mf(s,a)f(s,a)2,μKm1/2(1+2log(1/δ)),\displaystyle\big{\|}\text{Proj}_{\widetilde{\mathcal{F}}_{K,m}}f(s,a)-f(s,a)\big{\|}_{2,\mu}\leq K\cdot m^{-1/2}\cdot\bigl{(}1+\sqrt{2\log(1/\delta)}\bigr{)},

where μ𝒫(𝒮×𝒜)\mu\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A}).

Lemma B.3 implies that there exists f^(s,a)~K,m\widehat{f}(s,a)\in\widetilde{\mathcal{F}}_{K,m} such that

𝔼init[f^(s,a)f(s,a)2,ρπ2]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}\widehat{f}(s,a)-f(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]} =0(f^(s,a)f(s,a)2,ρπ2>y)dy\displaystyle=\int_{0}^{\infty}\mathbb{P}\Bigl{(}\big{\|}\widehat{f}(s,a)-f(s,a)\big{\|}^{2}_{2,\rho_{\pi}}>y\Bigr{)}{\mathrm{d}}y
0yexp(1/2(my/K1)2)=𝒪(K2/m).\displaystyle\leq\int_{0}^{\infty}y\cdot\exp\bigl{(}-1/2\cdot(\sqrt{my}/K-1)^{2}\bigr{)}=\mathcal{O}(K^{2}/m). (B.9)

By the fact that f^(s,a)~K,m\widehat{f}(s,a)\in\widetilde{\mathcal{F}}_{K,m} and the definition of K,m\mathcal{F}_{K,m} in Definition A.1, we have that f^(s,a)K,mu0(s,a)\widehat{f}(s,a)\in\mathcal{F}_{K,m}-u_{0}(s,a). Let

Q^(s,a)=βϕ0(s,a)+f^(s,a)=(β+Wf)ϕ0(s,a).\displaystyle\widehat{Q}(s,a)=\beta^{\top}\phi_{0}(s,a)+\widehat{f}(s,a)=(\beta+W_{f})^{\top}\phi_{0}(s,a).

We then have that Q^(s,a)Bβ+K,m\widehat{Q}(s,a)\in\mathcal{F}_{B_{\beta}+K,m} and that

𝔼init[Q¯(s,a)Q^(s,a)2,ρk2]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}\bar{Q}(s,a)-\widehat{Q}(s,a)\big{\|}^{2}_{2,\rho_{k}}\Bigr{]} 2𝔼init[uβ(s,a)βϕ0(s,a)2,ρπ2]+2𝔼init[f^(s,a)f(s,a)2,ρπ2]\displaystyle\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}u_{\beta}(s,a)-\beta^{\top}\phi_{0}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}+2\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}\widehat{f}(s,a)-f(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}
=𝒪(Bβ3m1/2+K2m1),\displaystyle=\mathcal{O}(B_{\beta}^{3}\cdot m^{-1/2}+K^{2}\cdot m^{-1}), (B.10)

where the last inequality follows from Assumption (a), Lemma A.2, and (B.9).

Finally, we set Bω=K+Bβ>Bβ+(1γ)1γBP(2M0+3Bβ)B_{\omega}=K+B_{\beta}>B_{\beta}+(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta}). Combining (B.2) and (B.2), we have that

𝔼init[Qrβπ(s,a)Q^(s,a)2,ρk2]\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}Q_{r_{\beta}}^{\pi}(s,a)-\widehat{Q}(s,a)\big{\|}^{2}_{2,\rho_{k}}\Bigr{]} 2𝔼init[Q¯(s,a)Q^(s,a)2,ρk2]+2𝔼init[Q¯(s,a)Qrβπ(s,a)2,ρk2]\displaystyle\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}\bar{Q}(s,a)-\widehat{Q}(s,a)\big{\|}^{2}_{2,\rho_{k}}\Bigr{]}+2\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}\bar{Q}(s,a)-Q_{r_{\beta}}^{\pi}(s,a)\big{\|}^{2}_{2,\rho_{k}}\Bigr{]}
=𝒪(Bβ3m1/2+Bω2m1+Bω2exp(CvBω2)),\displaystyle=\mathcal{O}\bigl{(}B_{\beta}^{3}\cdot m^{-1/2}+B_{\omega}^{2}\cdot m^{-1}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)},

where Q^(s,a)Bω,m\widehat{Q}(s,a)\in\mathcal{F}_{B_{\omega},m}. Thus, we complete the proof of Lemma B.2. ∎

Appendix C Proofs of Auxiliary Results

In what follows, we present the proofs of the lemmas in §3-5.

C.1 Proof of Proposition 3.1

Proof.

By the definition of the neural network in (3.1), we have for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A} that WuW(s,a)=ϕW(s,a)\nabla_{W}u_{W}(s,a)=\phi_{W}(s,a) almost everywhere. We first calculate θL(θ,β)\nabla_{\theta}L(\theta,\beta). Following from the policy gradient theorem (Sutton and Barto, 2018) and the definition of L(θ,β)L(\theta,\beta) in (2.4), we have that

θL(θ,β)\displaystyle\nabla_{\theta}L(\theta,\beta) =θJ(πθ;rβ)\displaystyle=-\nabla_{\theta}J(\pi_{\theta};r_{\beta})
=𝔼νπθ[Qrβπθ(s,a)θlogπθ(a|s)].\displaystyle=-\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}Q^{\pi_{\theta}}_{r_{\beta}}(s,a)\cdot\nabla_{\theta}\log\pi_{\theta}(a\,|\,s)\bigr{]}. (C.1)

Following from the parameterization of πθ\pi_{\theta} in (3.5) and the definition of ιθ(s,a)\iota_{\theta}(s,a) in (3.8) of Proposition 3.1, we have that

θlogπθ(a|s)\displaystyle\nabla_{\theta}\log\pi_{\theta}(a\,|\,s) =τϕθ(s,a)a𝒜τexp(τθϕθ(s,a))ϕθ(s,a)a𝒜exp(τθϕθ(s,a))\displaystyle=\tau\cdot\phi_{\theta}(s,a)-\frac{\sum_{a^{\prime}\in\mathcal{A}}\tau\cdot\exp\bigl{(}\tau\cdot\theta^{\top}\phi_{\theta}(s,a^{\prime})\bigr{)}\cdot\phi_{\theta}(s,a^{\prime})}{\sum_{a^{\prime}\in\mathcal{A}}\exp\bigl{(}\tau\cdot\theta^{\top}\phi_{\theta}(s,a^{\prime})\bigr{)}}
=τ(ϕθ(s,a)τ𝔼aπθ(|s)[ϕθ(s,a)])=τιθ(s,a).\displaystyle=\tau\cdot\Bigl{(}\phi_{\theta}(s,a)-\tau\cdot\mathbb{E}_{a^{\prime}\sim\pi_{\theta}(\cdot\,|\,s)}\bigl{[}\phi_{\theta}(s,a^{\prime})\bigr{]}\Bigr{)}=\tau\cdot\iota_{\theta}(s,a). (C.2)

Plugging (C.1) into (C.1), we have that

θL(θ,β)=τ𝔼νπθ[Qrβπθ(s,a)ιθ(s,a)].\displaystyle\nabla_{\theta}L(\theta,\beta)=-\tau\cdot\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}Q^{\pi_{\theta}}_{r_{\beta}}(s,a)\cdot\iota_{\theta}(s,a)\bigr{]}.

It remains to calculate (θ)\mathcal{I}(\theta) and βL(θ,β)\nabla_{\beta}L(\theta,\beta). By (C.1) and the definition of (θ)\mathcal{I}(\theta) in (3.7), it holds that

(θ)\displaystyle\mathcal{I}(\theta) =𝔼νπθ[logπθ(a|s)logπθ(a|s)]\displaystyle=\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\nabla\log\pi_{\theta}(a\,|\,s)\nabla\log\pi_{\theta}(a\,|\,s)^{\top}\bigr{]}
=τ2𝔼νπθ[ιθ(s,a)ιθ(s,a)].\displaystyle=\tau^{2}\cdot\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\iota_{\theta}(s,a)\iota_{\theta}(s,a)^{\top}\bigr{]}.

By the definition of the objective function L(θ,β)L(\theta,\beta) in (2.4), it holds that

βL(θ,β)\displaystyle\nabla_{\beta}L(\theta,\beta) =βJ(πE;rβ)βJ(πθ;rβ)λβψ(β)\displaystyle=\nabla_{\beta}J(\pi_{\text{{\tiny E}}};r_{\beta})-\nabla_{\beta}J(\pi_{\theta};r_{\beta})-\lambda\cdot\nabla_{\beta}\psi(\beta)
=𝔼νE[βrβ(s,a)]𝔼νπθ[βrβ(s,a)]λβψ(β)\displaystyle=\mathbb{E}_{\nu_{\text{{\tiny E}}}}\bigl{[}\nabla_{\beta}r_{\beta}(s,a)\bigr{]}-\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\nabla_{\beta}r_{\beta}(s,a)\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta)
=(1γ)1𝔼νE[ϕβ(s,a)](1γ)1𝔼νπθ[ϕβ(s,a)]λβψ(β).\displaystyle=(1-\gamma)^{-1}\cdot\mathbb{E}_{\nu_{\text{{\tiny E}}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-(1-\gamma)^{-1}\cdot\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta).

Thus, we complete the proof of Proposition 3.1. ∎

C.2 Proof of Lemma 5.2

Proof.

The proof of Lemma 5.2 is similar to that of Lemmas 5.4 and 5.5 in Wang et al. (2019). By direct calculation, we have that

η𝔼dE[Qrkπk(s,),πEsπks𝒜]=KLdE(πEπk)KLdE(πEπk+1)+ηΔk(i),\displaystyle\eta\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}Q^{\pi_{k}}_{r_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-{\pi_{k}^{s}}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}={\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k})-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k+1})+\eta\cdot\Delta_{k}^{\text{(i)}},

where Δk(i)\Delta_{k}^{\text{(i)}} takes the form of

Δk(i)\displaystyle\Delta_{k}^{\text{(i)}} =η1{𝔼dE[log(πk+1s/πks)ηQrkπk(s,),πEsπks𝒜+log(πk+1s/πks),πksπk+1s𝒜]KLdE(πk+1sπks)}\displaystyle=\eta^{-1}\cdot\biggl{\{}\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot Q^{\pi_{k}}_{r_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}+\bigl{\langle}\log({\pi_{k+1}^{s}}/{\pi_{k}^{s}}),\pi_{k}^{s}-\pi_{k+1}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{k+1}^{s}\,\|\,\pi_{k}^{s})\biggr{\}}
=η1𝔼dE[log(πk+1s/πks)ηQ^ωk(s,),πEsπks𝒜](i.a)+𝔼dE[Q^ωk(s,)Qrkπk(s,),πEsπks𝒜](i.b)\displaystyle=\underbrace{\eta^{-1}\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}}_{\displaystyle\text{(i.a)}}+\underbrace{\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\widehat{Q}_{\omega_{k}}(s,\cdot)-Q_{r_{k}}^{\pi_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}}_{\displaystyle\text{(i.b)}}
+η1𝔼dE[log(πk+1s/πks),πksπk+1s𝒜KL(πk+1sπks)](i.c).\displaystyle\quad+\underbrace{\eta^{-1}\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\log({\pi_{k+1}^{s}}/{\pi_{k}^{s}}),\pi_{k}^{s}-\pi_{k+1}^{s}\bigr{\rangle}_{\mathcal{A}}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\|\,\pi_{k}^{s})\Bigr{]}}_{\displaystyle\text{(i.c)}}. (C.3)

The following lemmas upper bound Δk(i)\Delta_{k}^{\text{(i)}} by upper bounding terms (i.a), (i.b), and (i.c) on the right-hand side of (C.2), respectively. Note that the expectation 𝔼init,dE\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}} is taken with respect to the random initialization in (3.3) and sdEs\sim d_{\text{{\tiny E}}}.

Lemma C.1 (Upper Bound of Term (i.a) in (C.2)).

Under Assumptions (a) and 4.3, we have that

𝔼init,dE[|log(πk+1s/πks)ηQ^ωk(s,),πEsπks𝒜|]\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]}
=η22ChBθ1/2σ1/2N1/4+𝒪(τk+1Bθ3/2m1/4+ηBθ5/4m1/8),\displaystyle\quad=\eta\cdot 2\sqrt{2}\cdot C_{h}\cdot B_{\theta}^{1/2}\cdot\sigma^{1/2}\cdot N^{-1/4}+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+\eta\cdot B_{\theta}^{5/4}\cdot m^{-1/8}),

where ChC_{h} is defined in Assumption (b) and σ\sigma is defined in Assumption 4.3.

Proof.

See §D.1 for a detailed proof. ∎

Lemma C.2 (Upper Bound of Term (i.b) in (C.2)).

Under Assumption (b), we have that

𝔼init,dE[Q^ωk(s,)Qrkπk(s,),πEsπks𝒜]ChϵQ,k,\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\widehat{Q}_{\omega_{k}}(s,\cdot)-Q_{r_{k}}^{\pi_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}\leq C_{h}\cdot{\epsilon}_{Q,k},

where ϵQ,k{\epsilon}_{Q,k} takes the form of

ϵQ,k=𝔼init[Qrkπk(s,a)Q^ωk(s,a)2,ρk].\displaystyle{\epsilon}_{Q,k}=\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}Q_{r_{k}}^{\pi_{k}}(s,a)-\widehat{Q}_{\omega_{k}}(s,a)\big{\|}_{2,\rho_{k}}\Bigr{]}. (C.4)
Proof.

See §D.2 for a detailed proof. ∎

Lemma C.3 (Upper Bound of Term (i.c) in (C.2)).

Under Assumptions (a) and (a), we have that

𝔼init,dE[|log(πk+1s/πks),πksπk+1s𝒜|KL(πk+1sπks)]\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s}),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\|\,\pi_{k}^{s})\biggr{]}
=η2(M02+9Bθ2)+𝒪(τk+1Bθ3/2m1/4),\displaystyle\quad=\eta^{2}\cdot(M_{0}^{2}+9B_{\theta}^{2})+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}),

where M0M_{0} is defined in Assumption (a).

Proof.

See §D.3 for a detailed proof. ∎

Finally, by Lemmas C.1-C.3, under Assumptions 4.2 and 4.3, we obtain from (C.2) that

𝔼init[|Δk(i)|]\displaystyle\mathbb{E}_{{\text{init}}}\bigl{[}|\Delta_{k}^{\text{(i)}}|\bigr{]} =22ChBθ1/2σ1/2N1/4+ChϵQ,k+η(M02+9Bθ2)\displaystyle=2\sqrt{2}C_{h}\cdot B_{\theta}^{1/2}\cdot\sigma^{1/2}\cdot N^{-1/4}+C_{h}\cdot{\epsilon}_{Q,k}+\eta\cdot(M_{0}^{2}+9B_{\theta}^{2})
+𝒪(η1τk+1Bθ3/2m1/4+Bθ5/4m1/8).\displaystyle\quad+\mathcal{O}(\eta^{-1}\cdot\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+B_{\theta}^{5/4}\cdot m^{-1/8}).

Here M0M_{0} is defined in Assumption (a), τk+1\tau_{k+1} is the inverse temperature parameter of πk+1\pi_{k+1} defined in (3.5), σ\sigma is defined in Assumption 4.3, and ϵQ,k{\epsilon}_{Q,k} is defined in (C.4) of Lemma C.2. Following from Proposition 4.4, we have that

ChϵQ,k=𝒪(Bω3m1/2+Bω5/2m1/4+Bω2exp(CvBω2)).\displaystyle C_{h}\cdot\epsilon_{Q,k}=\mathcal{O}\bigl{(}B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}.

Thus, we complete the proof of Lemma 5.2. ∎

C.3 Proof of Lemma 5.3

Proof.

We consider a fixed βSBβ\beta^{\prime}\in S_{B_{\beta}}. For notational simplicity, we write r=rβ(s,a)r^{\prime}=r_{\beta^{\prime}}(s,a), rk=rk(s,a)r_{k}=r_{k}(s,a) and ϕβ=ϕβ(s,a)\phi_{\beta}=\phi_{\beta}(s,a). By the parameterization of rβ(s,a)r_{\beta}(s,a) in (3.6), we have that

L(θk,β)L(θk,βk)=rrk,νEνk𝒮×𝒜+λψ(βk)λψ(β)\displaystyle L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})=\langle r^{\prime}-r_{k},\nu_{\text{{\tiny E}}}-\nu_{k}\rangle_{{\mathcal{S}}\times\mathcal{A}}+\lambda\cdot\psi(\beta_{k})-\lambda\cdot\psi(\beta^{\prime})
=(1γ)1(ϕβk(ββk),νEνk𝒮×𝒜+ϕββϕβkβ,νEνk𝒮×𝒜)+λ(ψ(β)ψ(β))\displaystyle\quad=(1-\gamma)^{-1}\cdot\Bigl{(}\bigl{\langle}\phi_{\beta_{k}}^{\top}(\beta^{\prime}-\beta_{k}),\nu_{\text{{\tiny E}}}-\nu_{k}\bigr{\rangle}_{{\mathcal{S}}\times\mathcal{A}}+\langle\phi_{\beta^{\prime}}^{\top}\beta^{\prime}-\phi_{\beta_{k}}^{\top}\beta^{\prime},\nu_{\text{{\tiny E}}}-\nu_{k}\rangle_{{\mathcal{S}}\times\mathcal{A}}\Bigr{)}+\lambda\cdot\bigl{(}\psi(\beta)-\psi(\beta^{\prime})\bigr{)}
(ββk)βL(θk,βk)+(1γ)1(ϕβkβϕββ1,νk+ϕβkβϕββ1,νE),\displaystyle\quad\leq(\beta^{\prime}-\beta_{k})^{\top}\nabla_{\beta}L(\theta_{k},\beta_{k})+(1-\gamma)^{-1}\cdot\bigl{(}\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{1,\nu_{k}}+\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{1,\nu_{\text{{\tiny E}}}}\bigr{)}, (C.5)

where the last inequality follows from (3.10) of Proposition 3.1. Then, we have that

𝔼init[L(θk,β)L(θk,βk)]\displaystyle\mathbb{E}_{\text{init}}\bigl{[}L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\bigr{]}
𝔼init[(ββk)βL(θk,βk)+(1γ)1(ϕβkβϕββ1,νk+ϕβkβϕββ1,νE)]\displaystyle\quad\leq\mathbb{E}_{\text{init}}\Bigl{[}(\beta^{\prime}-\beta_{k})^{\top}\nabla_{\beta}L(\theta_{k},\beta_{k})+(1-\gamma)^{-1}\cdot\bigl{(}\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{1,\nu_{k}}+\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{1,\nu_{\text{{\tiny E}}}}\bigr{)}\Bigr{]}
𝔼init[(ββk)βL(θk,βk)]+𝒪(Bβ3/2m1/4),\displaystyle\quad\leq\mathbb{E}_{\text{init}}\bigl{[}(\beta^{\prime}-\beta_{k})^{\top}\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{]}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}),

where the last inequality follows from Assumption (a), Lemma A.2, and the fact that β,βkSBβ\beta^{\prime},\beta_{k}\in S_{B_{\beta}}. Thus, we complete the proof of Lemma 5.3. ∎

C.4 Proof of Lemma 5.4

Proof.

By the update of βk\beta_{k} in (3.14), it holds for any βSBβ\beta^{\prime}\in S_{B_{\beta}} that

(βk+η^βL(θk,βk)βk+1)(ββk+1)0,\displaystyle\bigl{(}\beta_{k}+\eta\cdot\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\beta_{k+1}\bigr{)}^{\top}(\beta^{\prime}-\beta_{k+1})\leq 0,

which further implies that

η(ββk)βL(θk,βk)\displaystyle\eta\cdot({\beta^{\prime}-\beta_{k}})^{\top}{\nabla_{\beta}L(\theta_{k},\beta_{k})} βkβ22βk+1β22βk+1βk22\displaystyle\leq\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta_{k}\|_{2}^{2} (C.6)
+η((βk+1βk)^βL(θk,βk)+(βkβ)(^βL(θk,βk)βL(θk,βk))).\displaystyle\quad+\eta\cdot\Bigl{(}(\beta_{k+1}-\beta_{k})^{\top}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})+(\beta_{k}-\beta^{\prime})^{\top}\bigl{(}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{)}\Bigr{)}.

Combining (C.3) and (C.6), we have that

η(L(θk,βk)L(θk,β))βkβ22βk+1β22βk+1βk22+ηΔk(ii),\displaystyle\eta\cdot\bigl{(}L(\theta_{k},\beta_{k})-L(\theta_{k},\beta^{\prime})\bigr{)}\leq\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta_{k}\|_{2}^{2}+\eta\cdot\Delta_{k}^{\text{(ii)}},

where Δk(ii)\Delta_{k}^{\text{(ii)}} takes the form of

Δk(ii)\displaystyle\Delta_{k}^{\text{(ii)}} =(βk+1βk)^βL(θk,βk)(ii.a)+(βkβ)(^βL(θk,βk)βL(θk,βk))(ii.b)\displaystyle=\underbrace{{(\beta_{k+1}-\beta_{k})}^{\top}{\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})}}_{\displaystyle\text{(ii.a)}}+\underbrace{({\beta_{k}-\beta^{\prime}})^{\top}\bigl{(}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{)}}_{\displaystyle\text{(ii.b)}}
+(1γ)1(ϕβkβϕββ2,νk+ϕβkβϕββ2,νE)(ii.c)\displaystyle\quad+\underbrace{(1-\gamma)^{-1}\cdot\bigl{(}\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{2,\nu_{k}}+\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{2,\nu_{\text{{\tiny E}}}}\bigr{)}}_{\displaystyle\text{(ii.c)}} (C.7)

We now upper bound terms (ii.a), (ii.b), and (ii.c) on the right-hand side of (C.4). Following from Assumption (a) and Lemma A.2, we have that

𝔼init[ϕβkβϕββ2,νk+ϕβkβϕββ2,νE]=𝒪(Bβ3/2m1/4),\displaystyle\mathbb{E}_{\text{init}}\bigl{[}\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{2,\nu_{k}}+\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{2,\nu_{\text{{\tiny E}}}}\bigr{]}=\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}), (C.8)

which upper bounds term (ii.c) of (C.4). For term (ii.b) of (C.4), we have that

𝔼[|(βkβ)(^βL(θk,βk)βL(θk,βk))|]\displaystyle\mathbb{E}\biggl{[}\Big{|}({\beta_{k}-\beta^{\prime}})^{\top}\bigl{(}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{)}\Big{|}\biggr{]}
𝔼[^βL(θk,βk)βL(θk,βk)2ββk2]2Bβ𝔼[ξk2]2Bβ(σ2/N)1/2,\displaystyle\quad\leq\mathbb{E}\Bigl{[}\big{\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\big{\|}_{2}\cdot\|\beta^{\prime}-\beta_{k}\|_{2}\Bigr{]}\leq 2B_{\beta}\cdot\mathbb{E}\bigl{[}\|\xi_{k}^{\prime}\|_{2}\bigr{]}\leq 2B_{\beta}\cdot(\sigma^{2}/N)^{1/2}, (C.9)

where we write ξk=^βL(θk,βk)βL(θk,βk)\xi_{k}^{\prime}=\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k}). Here the first inequality follows from the Cauchy-Schwartz inequality, the second inequality follows from the fact that βk,βSBβ\beta_{k},\beta^{\prime}\in S_{B_{\beta}}, and the last inequality follows from Assumption 4.3. To upper bound term (ii.a) in (C.4), we have that

𝔼[|(βk+1βk)^βL(θk,βk)|]\displaystyle\mathbb{E}\Bigl{[}\bigl{|}{(\beta_{k+1}-\beta_{k})}^{\top}{\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})}\bigr{|}\Bigr{]} (C.10)
𝔼[^βL(θk,βk)2βk+1βk2]η𝔼[^βL(θk,βk)22]=2η(βL(θk,βk)22+𝔼[ξk22]),\displaystyle\quad\leq\mathbb{E}\Bigl{[}\big{\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\big{\|}_{2}\cdot\|\beta_{k+1}-\beta_{k}\|_{2}\Bigr{]}\leq\eta\cdot\mathbb{E}\Bigl{[}\big{\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\big{\|}_{2}^{2}\Bigr{]}=2\eta\cdot\Bigl{(}\big{\|}\nabla_{\beta}L(\theta_{k},\beta_{k})\big{\|}^{2}_{2}+\mathbb{E}\bigl{[}\|\xi_{k}^{\prime}\|^{2}_{2}\bigr{]}\Bigr{)},

where the first inequality follows from the Cauchy-Schwartz inequality and the second inequality follows from the update of β\beta in (3.14). Furthermore, we have

βL(θk,βk)22\displaystyle\big{\|}\nabla_{\beta}L(\theta_{k},\beta_{k})\big{\|}^{2}_{2} =𝔼νk[ϕβk(s,a)]𝔼νE[ϕβk(s,a)]+λβψ(βk)22\displaystyle=\Big{\|}\mathbb{E}_{\nu_{k}}\bigl{[}\phi_{\beta_{k}}(s,a)\bigr{]}-\mathbb{E}_{\nu_{E}}\bigl{[}\phi_{\beta_{k}}(s,a)\bigr{]}+\lambda\cdot\nabla_{\beta}\psi(\beta_{k})\Big{\|}_{2}^{2}
(𝔼νk[ϕβk(s,a)2]+𝔼νk[ϕβk(s,a)2]+λβψ(βk)2)2\displaystyle\leq\biggl{(}\mathbb{E}_{\nu_{k}}\Bigl{[}\big{\|}\phi_{\beta_{k}}(s,a)\big{\|}_{2}\Bigr{]}+\mathbb{E}_{\nu_{k}}\Bigl{[}\big{\|}\phi_{\beta_{k}}(s,a)\big{\|}_{2}\Bigr{]}+\lambda\cdot\big{\|}\nabla_{\beta}\psi(\beta_{k})\big{\|}_{2}\biggr{)}^{2}
(2+λLψ)2,\displaystyle\leq(2+\lambda\cdot L_{\psi})^{2}, (C.11)

where the first inequality follows from Jensen’s inequality and the second inequality follows from the fact that ϕW(s,a)21\|\phi_{W}(s,a)\|_{2}\leq 1 and the Lipschitz continuity of ψ(β)\psi(\beta) in Assumption (b). By plugging (C.4) into (C.10), we have that

𝔼[|^βL(θk,βk)(βkβk+1)|]\displaystyle\mathbb{E}\Bigl{[}\bigl{|}{\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})}^{\top}({\beta_{k}-\beta_{k+1}})\bigr{|}\Bigr{]} η((2+λLψ)2+𝔼[ξk22])\displaystyle\leq\eta\cdot\Bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\mathbb{E}\bigl{[}\|\xi_{k}^{\prime}\|^{2}_{2}\bigr{]}\Bigr{)}
η((2+λLψ)2+σ2/N),\displaystyle\leq\eta\cdot\bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\sigma^{2}/N\bigr{)}, (C.12)

where the last inequality follows from Assumption 4.3. Finally, by plugging (C.8), (C.4), and (C.4) into (C.4), we have that

𝔼init[|Δk(ii)|]\displaystyle\mathbb{E}_{\text{init}}\bigl{[}|\Delta_{k}^{\text{(ii)}}|\bigr{]} =η((2+λLψ)2+σ2N1)+2BβσN1/2+𝒪(Bβ3/2m1/4).\displaystyle=\eta\cdot\bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\sigma^{2}\cdot N^{-1}\bigr{)}+2B_{\beta}\cdot\sigma\cdot N^{-1/2}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}).

Thus, we complete the proof of Lemma 5.4. ∎

Appendix D Proofs of Supporting Lemmas

In what follows, we present the proofs of the lemmas in §C.

D.1 Proof of Lemma C.1

Proof.

It holds for any policies π,π\pi,\pi^{\prime} that

D(s),πs(π)s𝒜=0,\displaystyle\bigl{\langle}D(s),\pi^{s}-(\pi^{\prime})^{s}\bigr{\rangle}_{\mathcal{A}}=0, (D.1)

where D(s)D(s) only depends on the state ss. Thus, we have that

log(πk+1s/πks)ηQ^ωk(s,),πEsπks𝒜\displaystyle\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}
=τk+1ϕθk+1(s,)θk+1τkϕθk(s,)θkηϕωk(s,)ωk,πEsπks𝒜\displaystyle\quad=\bigl{\langle}\tau_{k+1}\cdot\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\tau_{k}\cdot\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k}-\eta\cdot\phi_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}
=τk+1ιθk+1(s,)θk+1τkιθk(s,)θkηιωk(s,)ωk,πEsπks𝒜,\displaystyle\quad=\bigl{\langle}\tau_{k+1}\cdot\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\tau_{k}\cdot\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k}-\eta\cdot\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}},

where the first inequality follows from the parameterization of πθ\pi_{\theta} and Q^ω\widehat{Q}_{\omega} in (3.5) and (3.12), respectively, and the second equality follows from the definition of the temperature-adjusted score function ιθ(s,a)\iota_{\theta}(s,a) in (3.8) of Proposition 3.1. Here, with a slight abuse of the notation, we define

ιωk(s,a)=ϕωk(s,a)𝔼aπk(|s)[ϕωk(s,a)].\displaystyle\iota_{\omega_{k}}(s,a)=\phi_{\omega_{k}}(s,a)-\mathbb{E}_{a^{\prime}\sim\pi_{k}(\cdot\,|\,s)}\bigl{[}\phi_{\omega_{k}}(s,a^{\prime})\bigr{]}. (D.2)

Then, following from (D.1) and the update τk+1θk+1=τkθkηδk\tau_{k+1}\cdot\theta_{k+1}=\tau_{k}\cdot\theta_{k}-\eta\cdot\delta_{k} in (3.13), we have that

log(πk+1s/πks)ηQ^ωk(s,),πEsπks𝒜\displaystyle\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}} (D.3)
=τk+1ιθk+1(s,)θk+1τkιθk(s,)θkηιωk(s,)ωk,πEsπks𝒜\displaystyle\quad=\bigl{\langle}\tau_{k+1}\cdot\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\tau_{k}\cdot\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k}-\eta\cdot\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}
=τk+1ιθk+1(s,)θk+1ιθk(s,)θk+1,πEsπks𝒜(i)ηιθk(s,)δk+ιωk(s,)ωk,πEsπks𝒜(ii).\displaystyle\quad=\underbrace{\tau_{k+1}\cdot\big{\langle}\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}}_{\displaystyle\text{(i)}}-\underbrace{\eta\cdot\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}}_{\displaystyle\text{(ii)}}.

In what follows, we upper bound terms (i) and (ii) on the right-hand side of (D.3).

Upper bound of term (i) in (D.3). Following from (3.8) of Proposition 3.1 and (D.1) we have that

|ιθk+1(s,)θk+1ιθk(s,)θk+1,πEsπks𝒜|\displaystyle\Bigl{|}\big{\langle}\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}
=|ϕθk+1(s,)θk+1ϕθk(s,)θk+1,πEsπks𝒜|\displaystyle\quad=\Bigl{|}\big{\langle}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}
ϕθk+1(s,)θk+1ϕθk(s,)θk+11,πEs+ϕθk+1(s,)θk+1ϕθk(s,)θk+11,πks,\displaystyle\quad\leq\big{\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\|}_{1,\pi_{\text{{\tiny E}}}^{s}}+\big{\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\|}_{1,\pi_{k}^{s}}, (D.4)

where the inequality follows from the triangle inequality. Following from Assumption (a) and Lemma A.2, we have that

𝔼init,dE[ϕθk+1(s,)θk+1ϕθk(s,)θk+11,πEs]=𝒪(Bθ3/2m1/4).\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Bigl{[}\big{\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\|}_{1,\pi_{\text{{\tiny E}}}^{s}}\Bigr{]}=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}). (D.5)

Furthermore, following from Assumption (b), Lemma A.2, and the Cauchy-Schwartz inequality, we have that

𝔼init,dE[ϕθk+1(s,)θk+1ϕθk(s,)θk+11,πks]\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Bigl{[}\big{\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\|}_{1,\pi_{k}^{s}}\Bigr{]}
=𝔼init,dk[ϕθk+1(s,)θk+1ϕθk(s,)θk+11,πksddEddk]\displaystyle\quad=\mathbb{E}_{{\text{init}},d_{k}}\biggl{[}\big{\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\|}_{1,\pi_{k}^{s}}\cdot\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\biggr{]}
ϕθk+1(s,a)θk+1ϕθk(s,a)θk+12,νkddEddk2,dk\displaystyle\quad\leq\big{\|}\phi_{\theta_{k+1}}(s,a)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,a)^{\top}\theta_{k+1}\big{\|}_{2,\nu_{k}}\cdot\bigg{\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\|}_{2,d_{k}}
=𝒪(Bθ3/2m1/4).\displaystyle\quad=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}). (D.6)

Here the expectation 𝔼init,dk\mathbb{E}_{{\text{init}},d_{k}} is taken with respect to the random initialization in (3.3) and sdks\sim d_{k}. Thus, plugging (D.5) and (D.6) into (D.1), we obtain for term (i) of (D.3) that

𝔼init,dE[|ιθk+1(s,)θk+1ιθk(s,)θk+1,πEsπks𝒜|]=𝒪(Bθ3/2m1/4).\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]}=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}). (D.7)

Upper bound of term (ii) in (D.3). Following from the Cauchy-Schwartz inequality, we have that

𝔼dE[|ιθk(s,)δk+ιωk(s,)ωk,πEs𝒜|]\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]} 𝒮×𝒜|ιθk(s,a)δk+ιωk(s,a)ωk|dνE(s,a)\displaystyle\leq\int_{{\mathcal{S}}\times\mathcal{A}}\big{|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{|}{\mathrm{d}}\nu_{\text{{\tiny E}}}(s,a)
dνEdνk2,νkιθk(s,a)δk+ιωk(s,a)ωk2,νk.\displaystyle\leq\bigg{\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\nu_{k}}\bigg{\|}_{2,\nu_{k}}\cdot\big{\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\|}_{2,\nu_{k}}. (D.8)

Similarly, we have that

𝔼dE[|ιθk(s,)δk+ιωk(s,)ωk,πks𝒜|]\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]} 𝒮×𝒜|ιθk(s,)δk+ιωk(s,)ωk,πks𝒜|dπks(a)ddE(s)\displaystyle\leq\int_{{\mathcal{S}}\times\mathcal{A}}\Bigl{|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}{\mathrm{d}}\pi^{s}_{k}(a){\mathrm{d}}d_{\text{{\tiny E}}}(s)
=𝒮×𝒜|ιθk(s,)δk+ιωk(s,)ωk,πks𝒜|ddEddk(s)dνk(s,a)\displaystyle=\int_{{\mathcal{S}}\times\mathcal{A}}\Bigl{|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\cdot\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}(s){\mathrm{d}}\nu_{k}(s,a)
ddEddk2,dkιθk(s,a)δk+ιωk(s,a)ωk2,νk,\displaystyle\leq\bigg{\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\|}_{2,d_{k}}\cdot\big{\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\|}_{2,\nu_{k}}, (D.9)

where the last inequality follows from the Cauchy-Schwartz inequality. Combining (D.1) and (D.1), we obtain for term (ii) of (D.3) that

𝔼dE[|ιθk(s,)δk+ιωk(s,)ωk,πEsπks𝒜|]\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]}
(dνEdνk2,νk+ddEddk2,dk)ιθk(s,a)δk+ιωk(s,a)ωk2,νk\displaystyle\quad\leq\Biggl{(}\bigg{\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\nu_{k}}\bigg{\|}_{2,\nu_{k}}+\bigg{\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\|}_{2,d_{k}}\Biggr{)}\cdot\big{\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\|}_{2,\nu_{k}}
Chιθk(s,a)δk+ιωk(s,a)ωk2,νk,\displaystyle\quad\leq C_{h}\cdot\big{\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\|}_{2,\nu_{k}}, (D.10)

where the last inequality follows from Assumption 4.1. To upper bound term (ii) of (D.3), it suffices to upper bound the right-hand side of (D.1). For notational simplicity, we write ιθk=ιθk(s,a)\iota_{\theta_{k}}=\iota_{\theta_{k}}(s,a), ιωk=ιωk(s,a)\iota_{\omega_{k}}=\iota_{\omega_{k}}(s,a), and ϕωk=ϕωk(s,a)\phi_{\omega_{k}}=\phi_{\omega_{k}}(s,a). By the triangle inequality, we have that

δkιθk+ωkιωk2,νk=(𝔼νk[(δkιθk+ωkιωk)(δkιθk+ωkιωk)])1/2\displaystyle\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\|_{2,\nu_{k}}=\Bigl{(}\mathbb{E}_{\nu_{k}}\bigl{[}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{)}^{1/2}
|(δkωk)𝔼νk[ιθk(δkιθk+ωkιωk)]|1/2(ii.a)+|𝔼νk[ωk(ιθkιωk)(δkιθk+ωkιωk)]|1/2(ii.b).\displaystyle\quad\leq\underbrace{\Bigl{|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{|}^{1/2}}_{\displaystyle\text{(ii.a)}}+\underbrace{\Bigl{|}\mathbb{E}_{\nu_{k}}\bigl{[}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{|}^{1/2}}_{\displaystyle\text{(ii.b)}}. (D.11)

We now upper bound the two terms (ii.a) and (ii.b) on the right-hand side of (D.1). For term (ii.a) of (D.1), following from (3.9) of Proposition 3.1, we have that

(θk)\displaystyle\mathcal{I}(\theta_{k}) =τk2𝔼νk[ιθkιθk].\displaystyle=\tau_{k}^{2}\cdot\mathbb{E}_{\nu_{k}}[\iota_{\theta_{k}}\iota_{\theta_{k}}^{\top}]. (D.12)

Recall that the expectation 𝔼k\mathbb{E}_{k} is taken with respect to the kk-th batch. Following from the definition of ^θL(θk,βk)\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k}) in (3.17), we have that

𝔼k[^θL(θk,βk)]\displaystyle\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]} =τk𝔼νk[ωkϕωkιθk]\displaystyle=-\tau_{k}\cdot\mathbb{E}_{\nu_{k}}[\omega_{k}^{\top}\phi_{\omega_{k}}\cdot\iota_{\theta_{k}}]
=τk𝔼νk[ωkιωkιθk]τkwk𝔼aπks[ϕωk(s,a)]𝔼νk[ιθk]\displaystyle=-\tau_{k}\cdot\mathbb{E}_{\nu_{k}}[\omega_{k}^{\top}\iota_{\omega_{k}}\cdot\iota_{\theta_{k}}]-\tau_{k}\cdot w_{k}^{\top}\mathbb{E}_{a^{\prime}\sim\pi_{k}^{s}}\bigl{[}\phi_{\omega_{k}}(s,a^{\prime})\bigr{]}\cdot\mathbb{E}_{\nu_{k}}[\iota_{\theta_{k}}]
=τk𝔼νk[ωkιωkιθk],\displaystyle=-\tau_{k}\cdot\mathbb{E}_{\nu_{k}}[\omega_{k}^{\top}\iota_{\omega_{k}}\cdot\iota_{\theta_{k}}], (D.13)

where the first equality follows from the fact that Q^ωk(s,a)=ωkϕωk(s,a)\widehat{Q}_{\omega_{k}}(s,a)=\omega_{k}^{\top}\phi_{\omega_{k}}(s,a), while the second and third equalities follow from the definition of ιωk(s,a)\iota_{\omega_{k}}(s,a) in (D.2). Following from (D.12) and (D.1), we have that

|(δkωk)𝔼νk[ιθk(δkιθk+ωkιωk)]|\displaystyle\Bigl{|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{|} =τk2|(δkωk)((θk)δkτk𝔼k[^θL(θ,β)])|\displaystyle=\tau_{k}^{-2}\cdot\biggl{|}(\delta_{k}-\omega_{k})^{\top}\Bigl{(}\mathcal{I}(\theta_{k})\delta_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta,\beta)\bigr{]}\Bigr{)}\biggr{|}
2Bθτk2(θk)δkτk𝔼k[^θL(θ,β)]2.\displaystyle\leq 2B_{\theta}\cdot\tau_{k}^{-2}\cdot\Big{\|}\mathcal{I}(\theta_{k})\delta_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta,\beta)\bigr{]}\Big{\|}_{2}. (D.14)

Here the last inequality follows from the Cauchy-Schwartz inequality and the fact that ωkδk22Bθ\|\omega_{k}-\delta_{k}\|_{2}\leq 2B_{\theta} as ωk,δkSBθ\omega_{k},\delta_{k}\in S_{B_{\theta}}. For notational simplicity, we define the following error terms,

ξk(1)\displaystyle\xi_{k}^{(1)} =^(θk)δk(θk)δk,\displaystyle=\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\mathcal{I}(\theta_{k})\delta_{k}, (D.15)
ξk(2)\displaystyle\xi_{k}^{(2)} =^θL(θk,βk)𝔼k[^θL(θk,βk)].\displaystyle=\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})-\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}. (D.16)

Then, we have for term (ii.a) in (D.1) that

𝔼init[|(δkωk)𝔼νk[ιθk(δkιθk+ωkιωk)]|1/2]\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{|}^{1/2}\biggr{]} (D.17)
(2Bθ)1/2τk1𝔼init[(θk)δkτk𝔼k[^θL(θ,β)]21/2]\displaystyle\quad\leq(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\mathbb{E}_{{\text{init}}}\biggl{[}\Big{\|}\mathcal{I}(\theta_{k})\delta_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta,\beta)\bigr{]}\Big{\|}_{2}^{1/2}\biggr{]}
(2Bθ)1/2τk1𝔼init[(^(θk)δkτk^θL(θ,β)2+ξk(1)2+τkξk(2)2)1/2]\displaystyle\quad\leq(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\mathbb{E}_{\text{init}}\biggl{[}\Bigl{(}\big{\|}\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta,\beta)\big{\|}_{2}+\|\xi_{k}^{(1)}\|_{2}+\tau_{k}\cdot\|\xi_{k}^{(2)}\|_{2}\Bigr{)}^{1/2}\biggr{]}
(2Bθ)1/2τk1(𝔼init[^(θk)δkτk^θL(θ,β)2]+𝔼init[ξk(1)2+τkξk(2)2])1/2,\displaystyle\quad\leq(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\biggl{(}\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta,\beta)\big{\|}_{2}\Bigr{]}+\mathbb{E}_{{\text{init}}}\bigl{[}\|\xi_{k}^{(1)}\|_{2}+\tau_{k}\cdot\|\xi_{k}^{(2)}\|_{2}\bigr{]}\biggr{)}^{1/2},

where the first inequality follows from (D.1), the second inequality follows from the triangle inequality, and the last inequality follows from Jensen’s inequality. Similarly to (D.15), we define the following error term,

ξk(3)=^(θk)ωk(θk)ωk.\displaystyle\xi_{k}^{(3)}=\widehat{\mathcal{I}}(\theta_{k})\omega_{k}-\mathcal{I}(\theta_{k})\omega_{k}. (D.18)

We now upper bound the right-hand side of (D.17). Recall the definition of δk\delta_{k} in (3.15). We have that

^(θk)δkτk^θL(θk,βk)2\displaystyle\big{\|}\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\big{\|}_{2} ^(θk)ωkτk^θL(θk,βk)2\displaystyle\leq\big{\|}\widehat{\mathcal{I}}(\theta_{k})\omega_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\big{\|}_{2} (D.19)
(θk)ωkτk𝔼k[^θL(θk,βk)]2+ξk(1)2+τkξk(2)2.\displaystyle\leq\Big{\|}\mathcal{I}(\theta_{k})\omega_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\|}_{2}+\|\xi_{k}^{(1)}\|_{2}+\tau_{k}\cdot\|\xi_{k}^{(2)}\|_{2}.

Following from (D.12), (D.1), and Jensen’s inequality, we have that

(θk)ωkτk𝔼k[^θL(θk,βk)]2\displaystyle\Big{\|}\mathcal{I}(\theta_{k})\omega_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\|}_{2} =τk2𝔼νk[ιθkωk(ιθkιωk)]2\displaystyle=\tau_{k}^{2}\cdot\Big{\|}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}\cdot\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\bigr{]}\Big{\|}_{2}
τk2𝔼νk[ιθk2|ωk(ιθkιωk)|]\displaystyle\leq\tau_{k}^{2}\cdot\mathbb{E}_{\nu_{k}}\Bigl{[}\|\iota_{\theta_{k}}\|_{2}\cdot\bigl{|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\bigr{|}\Bigr{]}
2τk2ωk(ιθkιωk)1,νk,\displaystyle\leq 2\tau_{k}^{2}\cdot\big{\|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\|}_{1,\nu_{k}},

where the last inequality follows from the fact that ιθ22\|\iota_{\theta}\|_{2}\leq 2 for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}. Following from Assumption (a) and Lemma A.2, we have that

𝔼init[(θk)ωkτk𝔼k[^θL(θk,βk)]2]\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Big{\|}\mathcal{I}(\theta_{k})\omega_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\|}_{2}\biggr{]} 𝔼init[2τk2ωk(ιθkιωk)1,νk]\displaystyle\leq\mathbb{E}_{{\text{init}}}\Big{[}2\tau_{k}^{2}\cdot\big{\|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\|}_{1,\nu_{k}}\Big{]}
=𝒪(τk2Bθ3/2m1/4).\displaystyle=\mathcal{O}(\tau_{k}^{2}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}). (D.20)

Plugging (D.19) and (D.1) into (D.17), we have that

𝔼init[|(δkωk)𝔼νk[ιθk(δkιθk+ωkιωk)]|1/2]\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{|}^{1/2}\biggr{]}
=(2Bθ)1/2τk1(𝒪(2τk2Bθ3/2m1/4)+𝔼init[ξk(1)2+2τkξk(2)2+ξk(3)2])1/2\displaystyle\quad=(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\Bigl{(}\mathcal{O}(2\tau_{k}^{2}\cdot B_{\theta}^{3/2}\cdot m^{-1/4})+\mathbb{E}_{{\text{init}}}\bigl{[}\|\xi_{k}^{(1)}\|_{2}+2\tau_{k}\cdot\|\xi_{k}^{(2)}\|_{2}+\|\xi_{k}^{(3)}\|_{2}\bigr{]}\Bigr{)}^{1/2}
=𝒪(τkBθ5/4m1/4)+(2Bθ)1/2τk1(𝔼init[ξk(1)2+2τkξk(2)2+ξk(3)2])1/2\displaystyle\quad=\mathcal{O}(\tau_{k}\cdot B_{\theta}^{5/4}\cdot m^{-1/4})+(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\Bigl{(}\mathbb{E}_{{\text{init}}}\bigl{[}\|\xi_{k}^{(1)}\|_{2}+2\tau_{k}\cdot\|\xi_{k}^{(2)}\|_{2}+\|\xi_{k}^{(3)}\|_{2}\bigr{]}\Bigr{)}^{1/2}
𝒪(τkBθ5/4m1/4)+22Bθ1/2(σ2/N)1/4,\displaystyle\quad\leq\mathcal{O}(\tau_{k}\cdot B_{\theta}^{5/4}\cdot m^{-1/4})+2\sqrt{2}B_{\theta}^{1/2}\cdot(\sigma^{2}/N)^{1/4}, (D.21)

where the last inequality follows from Assumption 4.3. We now upper bound term (ii.a) of (D.1). We have that

𝔼init[|𝔼νk[ωk(ιθkιωk)(δkιθk+ωkιωk)]|1/2]\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{|}\mathbb{E}_{\nu_{k}}\bigl{[}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{|}^{1/2}\biggr{]}
𝔼init,νk[|ωk(ιθkιωk)(δkιθk+ωkιωk)|]1/2\displaystyle\quad\leq\mathbb{E}_{{\text{init}},\nu_{k}}\Bigl{[}\bigl{|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{|}\Bigr{]}^{1/2}
𝔼init[ωk(ιθkιωk)2,νk]1/2𝔼init[δkιθk+ωkιωk2,νk]1/2,\displaystyle\quad\leq\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\omega^{\top}_{k}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\|}_{2,\nu_{k}}\Bigr{]}^{1/2}\cdot\mathbb{E}_{{\text{init}}}\bigl{[}\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\|_{2,\nu_{k}}\bigr{]}^{1/2}, (D.22)

where the expectation 𝔼init,νk\mathbb{E}_{{\text{init}},\nu_{k}} is taken with respect to the random initialization in (3.3) and (s,a)νk(s,a)\sim\nu_{k}, the first inequality follows from Jensen’s inequality, and the second inequality follows from the Cauchy-Schwartz inequality. Following from Assumption (a) and Lemma A.2, we have that

𝔼init[ωk(ιθkιωk)2,νk]=𝒪(Bθ3/2m1/4).\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\omega^{\top}_{k}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\|}_{2,\nu_{k}}\Bigr{]}=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}). (D.23)

To upper bound the right-hand side of (D.1), it remains to upper bound the term 𝔼init[δkιθk+ωkιωk2,νk]\mathbb{E}_{{\text{init}}}[\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\|_{2,\nu_{k}}]. We have that

𝔼init[δkιθk+ωkιωk2,νk]𝔼init[δk2ιθk2]+𝔼init[ωk2ιωk2]=𝒪(Bθ),\displaystyle\mathbb{E}_{{\text{init}}}\bigl{[}\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\|_{2,\nu_{k}}\bigr{]}\leq\mathbb{E}_{\text{init}}\bigl{[}\|\delta_{k}\|_{2}\cdot\|\iota_{\theta_{k}}\|_{2}\bigr{]}+\mathbb{E}_{\text{init}}\bigl{[}\|\omega_{k}\|_{2}\cdot\|\iota_{\omega_{k}}\|_{2}\bigr{]}=\mathcal{O}(B_{\theta}), (D.24)

where the inequality follows from the Cauchy-Schwartz inequality and the equality follows from the facts that ιθk22\|\iota_{\theta_{k}}\|_{2}\leq 2, ιωk22\|\iota_{\omega_{k}}\|_{2}\leq 2, and δk,ωkSBθ\delta_{k},\omega_{k}\in S_{B_{\theta}}. Plugging (D.23) and (D.24) into (D.1), we have that

𝔼init[|𝔼νk[ωk(ιθkιωk)(δkιθk+ωkιωk)]|1/2]=𝒪(Bθ5/4m1/8),\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{|}\mathbb{E}_{\nu_{k}}\bigl{[}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{|}^{1/2}\biggr{]}=\mathcal{O}(B_{\theta}^{5/4}\cdot m^{-1/8}), (D.25)

which upper bounds term (ii.b) of (D.1). Plugging (D.1) and (D.25) into (D.1), following from (D.1), we have that

𝔼init,dE[|ιθk(s,)δkιωk(s,)ωk,πEsπks𝒜|]\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}-\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]}
=ηCh(𝒪(Bθ5/4m1/8)+22Bθ1/2(σ2/N)1/4),\displaystyle\quad=\eta\cdot C_{h}\cdot\bigl{(}\mathcal{O}(B_{\theta}^{5/4}\cdot m^{-1/8})+2\sqrt{2}B_{\theta}^{1/2}\cdot(\sigma^{2}/N)^{1/4}\bigr{)}, (D.26)

which upper bounds term (ii) of (D.3).

Finally, plugging (D.7) and (D.1) into (D.3), we have that

𝔼init,dE[|log(πk+1s/πks)ηQ^ωk(s,),πEsπks𝒜|]\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]}
=ηCh22Bθ1/2(σ2/N)1/4+𝒪(τk+1Bθ3/2m1/4+ηBθ5/4m1/8),\displaystyle\quad=\eta\cdot C_{h}\cdot 2\sqrt{2}B_{\theta}^{1/2}\cdot(\sigma^{2}/N)^{1/4}+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+\eta\cdot B_{\theta}^{5/4}\cdot m^{-1/8}),

where ξk(1)\xi_{k}^{(1)}, ξk(2)\xi_{k}^{(2)}, and ξk(3)\xi^{(3)}_{k} are defined in (D.15), (D.16), and (D.18), respectively, and ChC_{h} is defined in Assumption (b). Thus, we complete the proof of Lemma C.1. ∎

D.2 Proof of Lemma C.2

Proof.

For notational simplicity, for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, we denote by ΔQ,k(s,a)=Q^ωk(s,a)Qrkπk(s,a)\Delta_{Q,k}(s,a)=\widehat{Q}_{\omega_{k}}(s,a)-Q^{\pi_{k}}_{r_{k}}(s,a) the error of estimating Qrkπk(s,a)Q^{\pi_{k}}_{r_{k}}(s,a) by Q^ωk(s,a)\widehat{Q}_{\omega_{k}}(s,a). Then, we have that

𝔼dE[|ΔQ,k(s,),πEsπks𝒜|]\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\Delta_{Q,k}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]}
𝒮×𝒜|ΔQ,k(s,a)|dπEs(a)ddE(s)+𝒮×𝒜|ΔQ,k(s,a)|dπks(a)ddE(s)\displaystyle\quad\leq\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{|}\Delta_{Q,k}(s,a)\bigr{|}{\mathrm{d}}\pi_{\text{{\tiny E}}}^{s}(a){\mathrm{d}}d_{\text{{\tiny E}}}(s)+\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{|}\Delta_{Q,k}(s,a)\bigr{|}{\mathrm{d}}\pi_{k}^{s}(a){\mathrm{d}}d_{\text{{\tiny E}}}(s)
=𝒮×𝒜|ΔQ,k(s,a)|dνEdρk(s,a)dρk(s,a)+𝒮×𝒜|ΔQ,k(s,a)|ddEdϱk(s)dρk(s,a)\displaystyle\quad=\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{|}\Delta_{Q,k}(s,a)\bigr{|}\cdot\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\rho_{k}}(s,a){\mathrm{d}}\rho_{k}(s,a)+\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{|}\Delta_{Q,k}(s,a)\bigr{|}\cdot\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}\varrho_{k}}(s){\mathrm{d}}\rho_{k}(s,a)
ChΔQ,k2,ρk,\displaystyle\quad\leq C_{h}\cdot\|\Delta_{Q,k}\|_{2,\rho_{k}},

where the last inequality follows from the Cauchy-Schwartz inequality and Assumption (b). Thus, we complete the proof of Lemma C.2. ∎

D.3 Proof of Lemma C.3

Proof.

Following from (D.1) and the parameterization of πθ\pi_{\theta} in (3.5), we have that

log(πk+1s/πks),πksπk+1s𝒜\displaystyle\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s}),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}} (D.27)
=τk+1θk+1ϕθk+1(s,)τkθkϕθk(s,),πksπk+1s𝒜\displaystyle\quad=\big{\langle}\tau_{k+1}\cdot\theta_{k+1}^{\top}\phi_{\theta_{k+1}}(s,\cdot)-\tau_{k}\cdot\theta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}
=(τk+1θk+1τkθk)ϕθk(s,),πksπk+1s𝒜+τk+1θk+1(ϕθk+1(s,)ϕθk(s,)),πksπk+1s𝒜.\displaystyle\quad=\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}+\tau_{k+1}\cdot\Big{\langle}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)},\pi_{k}^{s}-\pi_{k+1}^{s}\Big{\rangle}_{\mathcal{A}}.

We now upper bound the two terms on the right-hand side of (D.27). For the first term on the right-hand side of (D.27), recall that we define δk\delta_{k} in (3.15). Thus, we have that

|(τk+1θk+1τkθk)ϕθk(s,a)|=η|δkϕθk(s,a)|.\displaystyle\bigl{|}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,a)\bigr{|}=\eta\cdot\bigl{|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,a)\bigr{|}. (D.28)

Following from (D.28) and Hölder’s inequality, we have for any s𝒮s\in{\mathcal{S}} that

|(τk+1θk+1τkθk)ϕθk(s,),πksπk+1s𝒜|\displaystyle\Bigl{|}\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}
δkϕθk(s,)πksπk+1s1.\displaystyle\quad\leq\big{\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\|}_{\infty}\cdot\|\pi_{k}^{s}-\pi_{k+1}^{s}\|_{1}.

Then, following from Pinsker’s inequality, we have that

|(τk+1θk+1τkθk)ϕθk(s,),πksπk+1s𝒜|KL(πk+1sπks)\displaystyle\Bigl{|}\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\|\,\pi_{k}^{s})
ηδkϕθk(s,)πksπk+1s11/2πksπk+1s12\displaystyle\quad\leq\eta\cdot\big{\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\|}_{\infty}\cdot\|\pi_{k}^{s}-\pi_{k+1}^{s}\|_{1}-1/2\cdot\|\pi_{k}^{s}-\pi_{k+1}^{s}\|_{1}^{2}
1/2η2δkϕθk(s,)2.\displaystyle\quad\leq 1/2\cdot\eta^{2}\cdot\big{\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\|}_{\infty}^{2}. (D.29)

By the update of θk\theta_{k} in (3.13) and the definition of δk\delta_{k} in (3.15), we have that θk,δkSBθ\theta_{k},\delta_{k}\in S_{B_{\theta}}. Thus, by Lemma A.3, we have that

𝔼init[δkϕθk(s,)2]2M0+18Bθ2.\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\|}_{\infty}^{2}\Bigr{]}\leq 2M_{0}+18B_{\theta}^{2}. (D.30)

Plugging (D.30) into (D.3), we have that

|(τk+1θk+1τkθk)ϕθk(s,),πksπk+1s𝒜|KL(πk+1sπks)η2(M02+9Bθ2).\displaystyle\Bigl{|}\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\|\,\pi_{k}^{s})\leq\eta^{2}\cdot(M_{0}^{2}+9B_{\theta}^{2}). (D.31)

For the second term on the right-hand side of (D.27), following from Assumption (a) and Lemma A.2, we have

𝔼init,dE[|θk+1(ϕθk+1(s,)ϕθk(s,)),πksπk+1s𝒜|]\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Biggl{[}\biggl{|}\Big{\langle}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)},\pi_{k}^{s}-\pi_{k+1}^{s}\Big{\rangle}_{\mathcal{A}}\biggr{|}\Biggr{]}
𝔼init,dE[θk+1(ϕθk+1(s,)ϕθk(s,))1,πks]+𝔼init,dE[θk+1(ϕθk+1(s,)ϕθk(s,))1,πk+1s]\displaystyle\quad\leq\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Big{\|}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)}\Big{\|}_{1,\pi_{k}^{s}}\biggr{]}+\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Big{\|}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)}\Big{\|}_{1,\pi_{k+1}^{s}}\biggr{]}
=𝒪(Bθ3/2m1/4).\displaystyle\quad=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}). (D.32)

Finally, plugging (D.31) and (D.3) into (D.27), we have that

𝔼init,dE[|log(πk+1s/πks),πksπk+1s𝒜|KL(πk+1sπks)]\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s}),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\|\,\pi_{k}^{s})\biggr{]}
=η2(M02+9Bθ2)+𝒪(τk+1Bθ3/2m1/4),\displaystyle\quad=\eta^{2}\cdot(M_{0}^{2}+9B_{\theta}^{2})+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}),

which completes the proof of Lemma C.3. ∎