Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate

Yufeng Zhang Qi Cai Zhuoran Yang Zhaoran Wang Northwestern University; [email protected]Northwestern University; [email protected]Princeton University; [email protected]Northwestern University; [email protected]

Abstract

Generative adversarial imitation learning (GAIL) demonstrates tremendous success in practice, especially when combined with neural networks. Different from reinforcement learning, GAIL learns both policy and reward function from expert (human) demonstration. Despite its empirical success, it remains unclear whether GAIL with neural networks converges to the globally optimal solution. The major difficulty comes from the nonconvex-nonconcave minimax optimization structure. To bridge the gap between practice and theory, we analyze a gradient-based algorithm with alternating updates and establish its sublinear convergence to the globally optimal solution. To the best of our knowledge, our analysis establishes the global optimality and convergence rate of GAIL with neural networks for the first time.

1 Introduction

The goal of imitation learning (IL) is to learn to perform a task based on expert demonstration (Ho and Ermon, 2016). In contrast to reinforcement learning (RL), the agent only has access to the expert trajectories but not the rewards. The most straightforward approach of IL is behavioral cloning (BC) (Pomerleau, 1991). BC treats IL as the supervised learning problem of predicting the actions based on the states. Despite its simplicity, BC suffers from the compounding errors caused by covariate shift (Ross et al., 2011; Ross and Bagnell, 2010). Another approach of IL is inverse reinforcement learning (IRL) (Russell, 1998; Ng and Russell, 2000; Levine and Koltun, 2012; Finn et al., 2016), which jointly learns the reward function and the corresponding optimal policy. IRL formulates IL as a bilevel optimization problem. Specifically, IRL solves an RL subproblem given a reward function at the inner level and searches for the reward function which makes the expert policy optimal at the outer level. However, IRL is computationally inefficient as it requires fully solving an RL subproblem at each iteration of the outer level. Moreover, the desired reward function may be nonunique. To address such issues of IRL, Ho and Ermon (2016) propose generative adversarial imitation learning (GAIL), which searches for the optimal policy without fully solving an RL subproblem given a reward function at each iteration. GAIL solves IL via minimax optimization with alternating updates. In particular, GAIL alternates between (i) minimizing the discrepancy in expected cumulative reward between the expert policy and the learned policy and (ii) maximizing such a discrepancy over the reward function class. Such an alternating update scheme mirrors the training of generative adversarial networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017), where the learned policy acts as the generator while the reward function acts as the discriminator.

Incorporated with neural networks, which parameterize the learned policy and the reward function, GAIL achieves significant empirical success in challenging applications, such as natural language processing (Yu et al., 2016), autonomous driving (Kuefler et al., 2017), human behavior modeling (Merel et al., 2017), and robotics (Tai et al., 2018). Despite its empirical success, GAIL with neural networks remains less understood in theory. The major difficulty arises from the following aspects: (i) GAIL involves minimax optimization, while the existing analysis of policy optimization with neural networks (Anthony and Bartlett, 2009; Liu et al., 2019; Bhandari and Russo, 2019; Wang et al., 2019) only focuses on a minimization or maximization problem. (ii) GAIL with neural networks is nonconvex-nonconcave, and therefore, the existing analysis of convex-concave optimization with alternating updates is not applicable (Nesterov, 2013). There is an emerging body of literature (Rafique et al., 2018; Zhang et al., 2019b) that casts nonconvex-nonconcave optimization as bilevel optimization, where the inner level is solved to a high precision as in IRL. However, such analysis is not applicable to GAIL as it involves alternating updates.

In this paper, we bridge the gap between practice and theory by establishing the global optimality and convergence of GAIL with neural networks. Specifically, we parameterize the learned policy and the reward function with two-layer neural networks and consider solving GAIL by alternatively updating the learned policy via a step of natural policy gradient (Kakade, 2002; Peters and Schaal, 2008) and the reward function via a step of gradient ascent. In particular, we parameterize the state-action value function (also known as the Q-function) with a two-layer neural network and apply a variant of the temporal difference algorithm (Sutton and Barto, 2018) to solve the policy evaluation subproblem in natural policy gradient. We prove that the learned policy $\bar{\pi}$ converges to the expert policy $\pi_{\text{{\tiny E}}}$ at a $1/\sqrt{T}$ rate in the $\mathcal{R}$ -distance (Chen et al., 2020), which is defined as $\mathbb{D}_{\mathcal{R}}(\pi_{\text{{\tiny E}}},\bar{\pi})=\max_{r\in\mathcal{R}}J(\pi_{\text{{\tiny E}}};r)-J(\bar{\pi};r).$ Here $J(\pi;r)$ is the expected cumulative reward of a policy $\pi$ given a reward function $r(s,a)$ and $\mathcal{R}$ is the reward function class. The core of our analysis is constructing a potential function that tracks the $\mathcal{R}$ -distance. Such a rate of convergence implies that the learned policy $\bar{\pi}$ (approximately) outperforms the expert policy $\pi_{\text{{\tiny E}}}$ given any reward function $r\in\mathcal{R}$ within a finite number of iterations $T$ . In other words, the learned policy $\bar{\pi}$ is globally optimal. To the best of our knowledge, our analysis establishes the global optimality and convergence of GAIL with neural networks for the first time. It is worth mentioning that our analysis is straightforwardly applicable to linear and tabular settings, which, however, are not our focus.

Related works. Our work extends an emerging body of literature on RL with neural networks (Xu et al., 2019a; Zhang et al., 2019a; Bhandari and Russo, 2019; Liu et al., 2019; Wang et al., 2019; Agarwal et al., 2019) to IL. This line of research analyzes the global optimality and convergence of policy gradient for solving RL, which is a minimization or maximization problem. In contrast, we analyze GAIL, which is a minimax optimization problem.

Our work is also related to the analysis of apprenticeship learning (Syed et al., 2008) and GAIL (Cai et al., 2019a; Chen et al., 2020). Syed et al. (2008) analyze the convergence and generalization of apprenticeship learning. They assume the state space to be finite, and thus, do not require function approximation for the policy and the reward function. In contrast, we assume the state space to be infinite and employ function approximation based on neural networks. Cai et al. (2019a) study the global optimality and convergence of GAIL in the setting of linear-quadratic regulators. In contrast, our analysis handles general MDPs without restrictive assumptions on the transition kernel and the reward function. Chen et al. (2020) study the convergence and generalization of GAIL in the setting of general MDPs. However, they only establish the convergence to a stationary point. In contrast, we establish the global optimality of GAIL.

Notations. Let $[n]=\{1,\ldots,n\}$ for $n\in\mathbb{N}_{+}$ and $[m:n]=\{m,m+1,\ldots,n\}$ for $m<n$ . Also, let $N(\mu,\Sigma)$ be the Gaussian distribution with mean $\mu$ and covariance $\Sigma$ . We denote by ${\mathcal{P}}({\mathcal{X}})$ the set of all probability measures over the space ${\mathcal{X}}$ . For a function $f:{\mathcal{X}}\rightarrow\mathbb{R}$ , a constant $p\geq 1$ , and a probability measure $\mu\in{\mathcal{P}}({\mathcal{X}})$ , we denote by $\|f\|_{p,\mu}=(\int_{\mathcal{X}}|f(x)|^{p}{\mathrm{d}}\mu(x))^{1/p}$ the $L_{p}(\mu)$ norm of the function $f$ . For any two functions $f,g:{\mathcal{X}}\rightarrow\mathbb{R}$ , we denote by $\langle f,g\rangle_{\mathcal{X}}=\int_{\mathcal{X}}f(x)\cdot g(x){\mathrm{d}}x$ the inner product on the space ${\mathcal{X}}$ .

2 Background

In this section, we introduce reinforcement learning (RL) and generative adversarial imitation learning (GAIL).

2.1 Reinforcement Learning

We consider a Markov decision process (MDP) $({\mathcal{S}},\mathcal{A},r,P,\rho,\gamma)$ . Here ${\mathcal{S}}\subseteq\mathbb{R}^{d_{1}}$ is the state space, $\mathcal{A}\subseteq\mathbb{R}^{d_{2}}$ is the action space, which is assumed to be finite throughout this paper, $r:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R}$ is the reward function, $P:{\mathcal{S}}\times\mathcal{A}\rightarrow{\mathcal{P}}({\mathcal{S}})$ is the transition kernel, $\rho\in{\mathcal{P}}({\mathcal{S}})$ is the initial state distribution, and $\gamma\in(0,1)$ is the discount factor. Without loss of generality, we assume that ${\mathcal{S}}\times\mathcal{A}$ is compact and that $\|(s,a)\|_{2}\leq 1$ for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}\subseteq\mathbb{R}^{d}$ , where $d=d_{1}+d_{2}$ . An agent following a policy $\pi:{\mathcal{S}}\rightarrow{\mathcal{P}}(\mathcal{A})$ interacts with the environment in the following manner. At the state $s_{t}\in{\mathcal{S}}$ , the agent takes the action $a_{t}\in\mathcal{A}$ with probability $\pi(a_{t}\,|\,s_{t})$ and receives the reward $r_{t}=r(s_{t},a_{t})$ . The environment then transits into the next state $s_{t+1}$ with probability $P(s_{t+1}\,|\,s_{t},a_{t})$ . Given a policy $\pi$ and a reward function $r(s,a)$ , we define the state-action value function $Q^{\pi}_{r}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R}$ as follows,

\displaystyle Q^{\pi}_{r}(s,a)=\mathbb{E}_{\pi}\biggl{[}(1-\gamma)\cdot\sum_{t=0}^{\infty}\gamma^{t}\cdot r(s_{t},a_{t})\,\bigg{|}\,s_{0}=s,a_{0}=a\biggr{]}.

(2.1)

Here the expectation $\mathbb{E}_{\pi}$ is taken with respect to $a_{t}\sim\pi(\cdot\,|\,s_{t})$ and $s_{t+1}\sim P(\cdot\,|\,s_{t},a_{t})$ . Correspondingly, we define the state value function $V^{\pi}_{r}:{\mathcal{S}}\rightarrow\mathbb{R}$ and the advantage function $A^{\pi}_{r}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R}$ as follows,

\displaystyle V^{\pi}_{r}(s)=\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\bigl{[}Q_{r}^{\pi}(s,a)\bigr{]},\quad A^{\pi}_{r}(s,a)=Q_{r}^{\pi}(s,a)-V^{\pi}_{r}(s).

The goal of RL is to maximize the following expected cumulative reward,

\displaystyle J(\pi;r)=\mathbb{E}_{s\sim\rho}\bigl{[}V_{r}^{\pi}(s)\bigr{]}.

(2.2)

The policy $\pi$ induces a state visitation measure $d_{\pi}\in{\mathcal{P}}({\mathcal{S}})$ and a state-action visitation measure $\nu_{\pi}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A})$ , which take the forms of

\displaystyle d_{\pi}(s)=(1-\gamma)\cdot\sum_{t=0}^{\infty}\gamma^{t}\cdot\mathbb{P}\bigl{(}s_{t}=s\,\big{|}\,s_{0}\sim\rho,a_{t}\sim\pi(\cdot\,|\,s_{t})\bigr{)},\quad\nu_{\pi}(s,a)=d_{\pi}(s)\cdot\pi(a\,|\,s).

(2.3)

It then holds that $J(\pi;r)=\mathbb{E}_{(s,a)\sim\nu_{\pi}}[r(s,a)]$ . Meanwhile, we assume that the policy $\pi$ induces a state stationary distribution $\varrho_{\pi}$ over ${\mathcal{S}}$ , which satisfies that

\displaystyle\varrho_{\pi}(s)=\mathbb{P}\bigl{(}s_{t+1}=s\,\big{|}\,s_{t}\sim\rho_{\pi},a_{t}\sim\pi(\cdot\,|\,s_{t})\bigr{)}.

We denote by $\rho_{\pi}(s,a)=\varrho(s)\cdot\pi(a\,|\,s)$ the state-action stationary distribution over ${\mathcal{S}}\times\mathcal{A}$ .

2.2 Generative Adversarial Imitation Learning

The goal of imitation learning (IL) is to learn a policy that outperforms the expert policy $\pi_{\text{{\tiny E}}}$ based on the trajectory $\{(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})\}_{i\in[T_{\text{{\tiny E}}}]}$ of $\pi_{\text{{\tiny E}}}$ . We denote by $\nu_{\text{{\tiny E}}}=\nu_{\pi_{\text{{\tiny E}}}}$ and $d_{\text{{\tiny E}}}=d_{\pi_{\text{{\tiny E}}}}$ the state-action and state visitation measures induced by the expert policy, respectively, and assume that the expert trajectory $\{(s_{i},a_{i})\}_{i\in[T_{\text{{\tiny E}}}]}$ is drawn from $\nu_{{\text{{\tiny E}}}}$ . To this end, we parameterize the policy and the reward function by $\pi_{\theta}$ for $\theta\in{\mathcal{X}}_{\Pi}$ and $r_{\beta}(s,a)$ for $\beta\in{\mathcal{X}}_{R}$ , respectively, and solve the following minimax optimization problem known as GAIL (Ho and Ermon, 2016),

\displaystyle\min_{\theta\in{\mathcal{X}}_{\Pi}}\max_{\beta\in{\mathcal{X}}_{R}}L(\theta,\beta),\quad\text{where }L(\theta,\beta)=J(\pi_{\text{{\tiny E}}};r_{\beta})-J(\pi_{\theta};r_{\beta})-\lambda\cdot\psi(\beta).

(2.4)

Here $J(\pi;r)$ is the expected cumulative reward defined in (2.2), $\psi:{\mathcal{X}}_{R}\rightarrow\mathbb{R}_{+}$ is the regularizer, and $\lambda\geq 0$ is the regularization parameter. Given a reward function class $\mathcal{R}$ , we define the $\mathcal{R}$ -distance between two policies $\pi_{1}$ and $\pi_{2}$ as follows,

\displaystyle\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2})=\max_{r\in\mathcal{R}}J(\pi_{1};r)-J(\pi_{2};r)=\max_{r\in\mathcal{R}}\mathbb{E}_{\nu_{\pi_{1}}}\bigl{[}r(s,a)\bigr{]}-\mathbb{E}_{\nu_{\pi_{2}}}\bigl{[}r(s,a)\bigr{]}.

(2.5)

When $\mathcal{R}$ is the class of $1$ -Lipschitz functions, $\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2})$ is the Wasserstein- $1$ metric between the state-action visitation measures induced by $\pi_{1}$ and $\pi_{2}$ . However, $\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2})$ is not a metric in general. When $\mathbb{D}_{\mathcal{R}}(\pi_{1},\pi_{2})\leq 0$ , the policy $\pi_{2}$ outperforms the policy $\pi_{1}$ for any reward function $r\in\mathcal{R}$ . Such a notion of $\mathcal{R}$ -distance is used in Chen et al. (2020). We denote by $\mathcal{R}_{\beta}=\{r_{\beta}(s,a)\,|\,\beta\in{\mathcal{X}}_{R}\}$ the reward function class parameterized with $\beta$ . Hence, the optimization problem in (2.4) minimizes the $\mathcal{R}_{\beta}$ -distance $\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\pi_{\theta})$ (up to the regularizer $\lambda\cdot\psi(\beta)$ ), which searches for a policy $\bar{\pi}$ that (approximately) outperforms the expert policy given any reward function $r_{\beta}\in\mathcal{R}_{\beta}$ .

3 Algorithm

In this section, we introduce an algorithm with alternating updates for GAIL with neural networks, which employs natural policy gradient (NPG) to update the policy $\pi_{\theta}$ and gradient ascent to update the reward function $r_{\beta}(s,a)$ .

3.1 Parameterization with Neural Networks

We define a two-layer neural network with rectified linear units (ReLU) as follows,

\displaystyle u_{W,b}(s,a)=\frac{1}{\sqrt{m}}\sum_{l=1}^{m}b_{l}\cdot\operatorname{\mathds{1}}{\bigl{\{}(s,a)^{\top}[W]_{l}>0\bigr{\}}}\cdot(s,a)^{\top}[W]_{l}=\sum_{l=1}^{m}\bigl{[}\phi_{W,b}(s,a)\bigr{]}_{l}^{\top}[W]_{l}.

(3.1)

Here $m\in\mathbb{N}_{+}$ is the width of the neural network, $b=(b_{1},\ldots,b_{m})^{\top}\in\mathbb{R}^{m}$ and $W=([W]_{1}^{\top},\ldots,[W]_{m}^{\top})^{\top}\in\mathbb{R}^{md}$ are the parameters, and $\phi_{W,b}(s,a)=([\phi_{W,b}(s,a)]^{\top}_{1},\ldots,[\phi_{W,b}(s,a)]^{\top}_{m})^{\top}\in\mathbb{R}^{md}$ is called the feature vector in the sequel, where

\displaystyle\bigl{[}\phi_{W,b}(s,a)\bigr{]}_{l}=m^{-1/2}\cdot b_{l}\cdot\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W]_{l}>0\bigr{\}}\cdot(s,a).

(3.2)

It then holds that $u_{W,b}(s,a)=W^{\top}\phi_{W,b}(s,a)$ . Note that the feature vector $\phi_{W,b}(s,a)$ depends on the parameters $W$ and $b$ . We consider the following random initialization,

\displaystyle b_{l}\overset{{\text{i.i.d.}}}{\sim}{\text{Unif}}\bigl{(}\{-1,1\}\bigr{)},\quad[W_{0}]_{l}\overset{{\text{i.i.d.}}}{\sim}N(0,I_{d}/d),\quad\forall l\in[m].

(3.3)

Throughout the training process, we keep the parameter $b$ fixed while updating $W$ . For notational simplicity, we write $u_{W,b}(s,a)$ as $u_{W}(s,a)$ and $\phi_{W,b}(s,a)$ as $\phi_{W}(s,a)$ in the sequel. We denote by $\mathbb{E}_{\text{init}}$ the expectation taken with respect to the random initialization in (3.3). For an absolute constant $B>0$ , we define the parameter domain as

\displaystyle S_{B}=\bigl{\{}W\in\mathbb{R}^{md}\,\big{|}\,\|W-W_{0}\|_{2}\leq B\bigr{\}},

(3.4)

which is the ball centered at $W_{0}$ with the domain radius $B$ .

In the sequel, we consider the following energy-based policy,

\displaystyle\pi_{\theta}(a\,|\,s)=\frac{\exp\bigl{(}\tau\cdot u_{\theta}(s,a)\bigr{)}}{\sum_{a^{\prime}\in\mathcal{A}}\exp\bigl{(}\tau\cdot u_{\theta}(s,a^{\prime})\bigr{)}},

(3.5)

where $\tau\geq 0$ is the inverse temperature parameter and $u_{\theta}(s,a)$ is the energy function parameterized by the neural network defined in (3.1) with $W=\theta$ . In the sequel, we call $\theta$ the policy parameter. Meanwhile, we parameterize the reward function $r_{\beta}(s,a)$ as follows,

\displaystyle r_{\beta}(s,a)=(1-\gamma)^{-1}\cdot u_{\beta}(s,a),

(3.6)

where $u_{\beta}(s,a)$ is the neural network defined in (3.1) with $W=\beta$ and $\gamma$ is the discount factor. Here we use the scaling parameter $(1-\gamma)^{-1}$ to ensure that for any policy $\pi$ the state-action value function $Q^{\pi}_{r_{\beta}}(s,a)$ defined in (2.1) is well approximated by the neural network defined in (3.1). In the sequel, we call $\beta$ the reward parameter and define the reward function class as

\displaystyle\mathcal{R}_{\beta}=\{r_{\beta}(s,a)\,|\,\beta\in S_{B_{\beta}}\},

where $S_{B_{\beta}}$ is the parameter domain of $\beta$ defined in (3.4) with domain radius $B_{\beta}$ . To facilitate algorithm design, we establish the following proposition, which calculates the explicit expressions of the gradients $\nabla L(\theta,\beta)$ and the Fisher information $\mathcal{I}(\theta)$ . Recall that the Fisher information is defined as

\displaystyle\mathcal{I}(\theta)=\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}\nabla_{\theta}\log\pi_{\theta}(s,a)\nabla_{\theta}\log\pi_{\theta}(s,a)^{\top}\bigr{]}.

(3.7)

Proposition 3.1 (Gradients and Fisher Information).

We call $\iota_{\theta}(s,a)=\tau^{-1}\cdot\nabla_{\theta}\log\pi_{\theta}(a\,|\,s)$ the temperature-adjusted score function. It holds that

\displaystyle\iota_{\theta}(s,a)=\phi_{\theta}(s,a)-\mathbb{E}_{a^{\prime}\sim\pi_{\theta}(\cdot\,|\,s)}\bigl{[}\phi_{\theta}(s,a^{\prime})\bigr{]}.

(3.8)

It then holds that

$\displaystyle\mathcal{I}(\theta)$	$\displaystyle=\tau^{2}\cdot\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}\iota_{\theta}(s,a)\,\iota_{\theta}(s,a)^{\top}\bigr{]},$	(3.9)
$\displaystyle\nabla_{\theta}L(\theta,\beta)$	$\displaystyle=-\tau\cdot\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}Q^{\pi_{\theta}}_{r_{\beta}}(s,a)\cdot\iota_{\theta}(s,a)\bigr{]},$	(3.10)
$\displaystyle\nabla_{\beta}L(\theta,\beta)$	$\displaystyle=(1-\gamma)^{-1}\cdot\mathbb{E}_{(s,a)\sim\nu_{{\text{{\tiny E}}}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-(1-\gamma)^{-1}\cdot\mathbb{E}_{(s,a)\sim\nu_{\pi_{\theta}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta),$	(3.11)

where $Q^{\pi_{\theta}}_{r_{\beta}}(s,a)$ is the state-action value function defined in (2.1) with $\pi=\pi_{\theta}$ and $r=r_{\beta}$ , $\nu_{\pi_{\theta}}$ is the state-action visitation measure defined in (2.3) with $\pi=\pi_{\theta}$ , and $\mathcal{I}(\theta)$ is the Fisher information defined in (3.7).

Proof.

See §C.1 for a detailed proof. ∎

Note that the expression of the policy gradient $\nabla_{\theta}L(\theta,\beta)$ in (3.10) of Proposition 3.1 involves the state-action value function $Q^{\pi_{\theta}}_{r_{\beta}}(s,a)$ . To this end, we estimate the state-action value function $Q_{r}^{\pi}(s,a)$ by $\widehat{Q}_{\omega}(s,a)$ , which is parameterized as follows,

\displaystyle\widehat{Q}_{\omega}(s,a)=u_{\omega}(s,a).

(3.12)

Here $u_{\omega}(s,a)$ is the neural network defined in (3.1) with $W=\omega$ . In the sequel, we call $\omega$ the value parameter.

3.2 GAIL with Alternating Updates

We employ an actor-critic scheme with alternating updates of the policy and the reward function, which is presented in Algorithm 1. Specifically, we update the policy parameter $\theta$ via natural policy gradient and update the reward parameter $\beta$ via gradient ascent in the actor step, while we estimate the state-action value function $Q^{\pi}_{r}(s,a)$ via neural temporal difference (TD) (Cai et al., 2019c) in the critic step.

Actor Step. In the $k$ -th actor step, we update the policy parameter $\theta$ and the reward parameter $\beta$ as follows,

	$\displaystyle\theta_{k+1}$	$\displaystyle=\tau_{k+1}^{-1}\cdot(\tau_{k}\cdot\theta_{k}-\eta\cdot\delta_{k}),$		(3.13)
	$\displaystyle\beta_{k+1}$	$\displaystyle=\text{Proj}_{S_{B_{\beta}}}\bigl{\{}\beta_{k}+\eta\cdot\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\bigr{\}},$		(3.14)

where

\displaystyle\tau_{k+1}

\displaystyle=\eta+\tau_{k},\quad\delta_{k}\in\mathop{\mathrm{argmin}}_{\delta\in S_{B_{\theta}}}\big{\|}\widehat{\mathcal{I}}(\theta_{k})\delta-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\big{\|}_{2}.

(3.15)

Here $\eta>0$ is the stepsize, $S_{B_{\theta}}$ and $S_{B_{\beta}}$ are the parameter domains of $\theta$ and $\beta$ defined in (3.4) with domain radii $B_{\theta}$ and $B_{\beta}$ , respectively, $\text{Proj}_{S_{B_{\beta}}}\!:\mathbb{R}^{md}\rightarrow S_{B_{\beta}}$ is the projection operator, $\tau_{k}$ is the inverse temperature parameter of $\pi_{\theta_{k}}$ , and $\widehat{\mathcal{I}}(\theta_{k}),\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k}),\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})$ are the estimators of $\mathcal{I}(\theta_{k}),\nabla_{\theta}L(\theta_{k},\beta_{k}),\nabla_{\beta}L(\theta_{k},\beta_{k})$ , respectively, which are defined in the sequel. In (3.13), we update the policy parameter $\theta_{k}$ along the direction $\delta_{k}$ , which approximates the natural policy gradient $\mathcal{I}(\theta)^{-1}\cdot\nabla_{\theta}L(\theta,\beta)$ , and in (3.15) we update the inverse temperature parameter $\tau_{k}$ to ensure that $\theta_{k+1}\in S_{B_{\theta}}$ . Meanwhile, in (3.14), we update the reward parameter $\beta$ via (projected) gradient ascent. Following from $\eqref{eq:fisher_form}$ and (3.10) of Proposition 3.1, we construct the estimators of $\mathcal{I}(\theta_{k})$ and $\nabla_{\theta}L(\theta_{k},\beta_{k})$ as follows,

	$\displaystyle\widehat{\mathcal{I}}(\theta_{k})$	$\displaystyle=\frac{\tau_{k}^{2}}{N}\sum_{i=1}^{N}\iota_{\theta_{k}}(s_{i},a_{i})\,\iota_{\theta_{k}}(s_{i},a_{i})^{\top},$		(3.16)
	$\displaystyle\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})$	$\displaystyle=-\frac{\tau_{k}}{N}\sum_{i=1}^{N}\widehat{Q}_{\omega_{k}}(s_{i},a_{i})\cdot\iota_{\theta_{k}}(s_{i},a_{i}),$		(3.17)

where $\{(s_{i},a_{i})\}_{i\in[N]}$ is sampled from the state-action visitation measure $\nu_{\pi_{\theta_{k}}}$ given $\theta_{k}$ with the batch size $N$ , and $\widehat{Q}_{\omega_{k}}(s,a)$ is the estimator of $Q^{\pi_{\theta_{k}}}_{r_{\beta_{k}}}(s,a)$ computed in the critic step. Meanwhile, following from (3.11) of Proposition 3.1, we construct the estimator of $\nabla_{\beta}L(\theta_{k},\beta_{k})$ as follows,

\displaystyle\widehat{\nabla}_{\beta}L(\theta,\beta)=\frac{1}{N\cdot(1-\gamma)}\sum_{i=1}^{N}\bigl{[}\phi_{\beta_{k}}(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})-\phi_{\beta_{k}}(s_{i},a_{i})\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta_{k}),

(3.18)

where $\{(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})\}_{i\in[N]}$ is the expert trajectory. For notational simplicity, we write $\pi_{k}=\pi_{\theta_{k}}$ , $r_{k}(s,a)=r_{\beta_{k}}(s,a)$ , $d_{k}=d_{\pi_{k}}$ and $\nu_{k}=\nu_{\pi_{k}}$ for the $k$ -th step hereafter, where $\pi_{\theta}$ is the policy, $r_{\beta}(s,a)$ is the reward function, and $d_{\pi},\nu_{\pi}$ are the visitation measures defined in (2.3).

Critic Step. Note that the estimator $\widehat{\nabla}_{\theta}L(\theta,\beta)$ in (3.17) involves the estimator $\widehat{Q}_{\omega_{k}}(s,a)$ of $Q^{\pi_{k}}_{r_{k}}(s,a)$ . To this end, we parameterize $\widehat{Q}_{\omega}(s,a)$ as in (3.12) and adapt neural TD (Cai et al., 2019c), which solves the following minimization problem,

\displaystyle\omega_{k}=\mathop{\mathrm{argmin}}_{\omega\in S_{B_{\omega}}}\mathbb{E}_{(s,a)\sim\rho_{k}}\bigl{[}\widehat{Q}_{\omega}(s,a)-{\mathcal{T}}^{\pi_{k}}_{r_{k}}\widehat{Q}_{\omega}(s,a)\bigr{]}^{2}.

(3.19)

Here $S_{B_{\omega}}$ is the parameter domain with domain radius $B_{\omega}$ , $\rho_{k}$ is the state-action stationary distribution induced by $\pi_{k}$ , and ${\mathcal{T}}^{\pi_{k}}_{r_{k}}$ is the Bellman operator. Note that the Bellman operator ${\mathcal{T}}^{\pi}_{r}$ is defined as follows,

\displaystyle{\mathcal{T}}^{\pi}_{r}Q(s,a)=(1-\gamma)\cdot r(s,a)+\gamma\cdot\mathbb{E}_{\pi}\bigl{[}Q(s^{\prime},a^{\prime})\,\big{|}\,s,a\bigr{]},

where the expectation is taken with respect to $s^{\prime}\sim P(\cdot\,|\,s,a)$ and $a^{\prime}\sim\pi(\cdot\,|\,s^{\prime})$ . In neural TD, we iteratively update the value parameter $\omega$ via

	$\displaystyle\delta(j)$	$\displaystyle=Q_{\omega(j)}(s,a)-r(s,a)-\gamma\cdot Q_{\omega(j)}(s^{\prime},a^{\prime}),$
	$\displaystyle\omega(j+1)$	$\displaystyle=\text{Proj}_{S_{B_{\omega}}}\bigl{\{}\omega(j)-\alpha\cdot\delta(j)\cdot\nabla_{\omega}Q_{\omega(j)}(s,a)\bigr{\}},$		(3.20)

where $\delta(j)$ is the Bellman residual, $\alpha>0$ is the stepsize, $(s,a)$ is sampled from the state-action stationary distribution $\rho_{k}$ , and $s^{\prime}\sim P(\cdot\,|\,s,a),a^{\prime}\sim\pi_{k}(\cdot\,|\,s^{\prime})$ are the subsequent state and action. We defer the detailed discussion of neural TD to §B.

To approximately obtain the compatible function approximation (Sutton et al., 2000; Wang et al., 2019), we share the random initialization among the policy $\pi_{\theta}$ , the reward function $r_{\beta}(s,a)$ , and the state-action value function $\widehat{Q}_{\omega}(s,a)$ . In other words, we set $\theta_{0}=\beta_{0}=\omega(0)=W_{0}$ in our algorithm, where $W_{0}$ is the random initialization in (3.3). The output of GAIL is the mixed policy $\bar{\pi}$ (Altman, 1999). Specifically, the mixed policy $\bar{\pi}$ of $\pi_{0},\ldots,\pi_{T-1}$ is executed by randomly selecting a policy $\pi_{k}$ for $k\in[0:T-1]$ with equal probability before time $t=0$ and exclusively following $\pi_{k}$ thereafter. It then holds for any reward function $r(s,a)$ that

\displaystyle J(\bar{\pi};r)=\frac{1}{T}\sum_{k=0}^{T-1}J(\pi_{k};r).

(3.21)

Algorithm 1 GAIL

0: Expert trajectory

\{(s_{i}^{\text{{\tiny E}}},a_{i}^{\text{{\tiny E}}})\}_{i\in[T_{\text{{\tiny E}}}]}

, number of iterations

T

, number of iterations

T_{\text{TD}}

of neural TD, stepsize

\eta

, stepsize

\alpha

of neural TD, batch size

N

, and domain radii

B_{\theta},B_{\omega},B_{\beta}

1: Initialization. Initialize

b_{l}\sim{\text{Unif}}(\{-1,1\})

and

[W_{0}]_{l}\sim N(0,I_{d}/d)

for any

l\in[m]

and set

\tau_{0}\leftarrow 0

\theta_{0}\leftarrow W_{0}

, and

\beta_{0}\leftarrow W_{0}

2: for

k=0,1,\ldots,T-1

3: Update value parameter

\omega_{k}

via Algorithm 2 with

\pi_{k}

r_{k}

W_{0}

b

T_{{\mathrm{TD}}}

, and

\alpha

as the input.

4: Sample

\{(s_{i},a_{i})\}_{i=1}^{N}

from the state-action visitation measure

\nu_{k}

, and estimate

\widehat{\mathcal{I}}(\theta_{k})

\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})

, and

\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})

via (3.16), (3.17), and (3.18), respectively.

5: Solve

\delta_{k}\leftarrow\mathop{\mathrm{argmin}}_{\delta\in S_{\theta}}\bigl{\|}\widehat{\mathcal{I}}(\theta_{k})\cdot\delta-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{\|}_{2}

and set

\tau_{k+1}\leftarrow\tau_{k}+\eta

6: Update policy parameter

\theta

via

\theta_{k+1}\leftarrow\tau_{k+1}^{-1}\cdot(\tau_{k}\cdot\theta_{k}-\eta\cdot\delta_{k})

7: Update reward parameter

\beta

via

\beta_{k+1}\leftarrow\text{Proj}_{S_{B_{\beta}}}\{\beta_{k}+\eta\cdot\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\}

8: end for

8: Mixed policy

\bar{\pi}

\pi_{0},\ldots,\pi_{T-1}

4 Main Results

In this section, we first present the assumptions for our analysis. Then, we establish the global optimality and convergence of Algorithm 1.

4.1 Assumptions

We impose the following assumptions on the stationary distributions $\varrho_{k}\in{\mathcal{P}}({\mathcal{S}}),\rho_{k}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A})$ and the visitation measures $d_{k}\in{\mathcal{P}}({\mathcal{S}}),\nu_{k}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A})$ .

Assumption 4.1.

We assume that the following properties hold.

(a)

Let $\mu$ be either $\rho_{k}$ or $\nu_{k}$ . We assume for an absolute constant $c>0$ that

\displaystyle\mathbb{E}_{(s,a)\sim\mu}\Bigl{[}\operatorname{\mathds{1}}\bigl{\{}|W^{\top}(s,a)|\leq y\bigr{\}}\Bigr{]}\leq\frac{c\cdot y}{\|W\|_{2}},\quad\forall y>0,W\neq 0.

(b)

We assume for an absolute constant $C_{h}>0$ that

	$\displaystyle\max_{k\in\mathbb{N}}\Biggl{\{}\bigg{\\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\\|}_{2,d_{k}}+\bigg{\\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\nu_{k}}\bigg{\\|}_{2,\nu_{k}}\Biggr{\}}\leq C_{h},$
	$\displaystyle\max_{k\in\mathbb{N}}\Biggl{\{}\bigg{\\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}\varrho_{k}}\bigg{\\|}_{2,\varrho_{k}}+\bigg{\\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\rho_{k}}\bigg{\\|}_{2,\rho_{k}}\Biggr{\}}\leq C_{h}.$

Here ${\mathrm{d}}d_{\text{{\tiny E}}}/{\mathrm{d}}d_{k}$ , ${\mathrm{d}}\nu_{\text{{\tiny E}}}/{\mathrm{d}}\nu_{k}$ , ${\mathrm{d}}d_{\text{{\tiny E}}}/{\mathrm{d}}\varrho_{k}$ , and ${\mathrm{d}}\nu_{\text{{\tiny E}}}/{\mathrm{d}}\rho_{k}$ are the Radon-Nikodym derivatives.

Assumption (a) (a) holds when the probability density functions of $\rho_{k}$ and $\nu_{k}$ are uniformly upper bounded across $k$ . Assumption (b) (b) states that the concentrability coefficients are uniformly upper bounded across $k$ , which is commonly used in the analysis of RL (Szepesvári and Munos, 2005; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Scherrer et al., 2015; Farahmand et al., 2016; Lazaric et al., 2016).

For notational simplicity, we write $u_{0}(s,a)=u_{W_{0}}(s,a)$ and $\phi_{0}(s,a)=\phi_{W_{0}}(s,a)$ , where $u_{W_{0}}(s,a)$ is the neural network defined in (3.1) with $W=W_{0}$ , $\phi_{W_{0}}(s,a)$ is the feature vector defined in (3.2) with $W=W_{0}$ , and $W_{0}$ is the random initialization in (3.3). We impose the following assumptions on the neural network $u_{0}(s,a)$ and the transition kernel $P$ .

Assumption 4.2.

We assume that the following properties hold.

(a)

Let $\bar{U}=\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|u_{0}(s,a)|$ . We assume for absolute constants $M_{0}>0$ and $v>0$ that

\displaystyle\mathbb{E}_{\text{init}}[\bar{U}^{2}]\leq M_{0}^{2},\quad\mathbb{P}(\bar{U}>t)\leq\exp(-v\cdot t^{2}),\quad\forall t>2M_{0}.

(4.1)

(b)

We assume that the transition kernel $P$ belongs to the following class,

\displaystyle\widetilde{\mathcal{M}}_{\infty,B_{P}}=\biggl{\{}P(s^{\prime}\,|\,s,a)=\int\vartheta(s,a;w)^{\top}\varphi(s^{\prime};w)\,{\mathrm{d}}q(w)\,\bigg{|}\,\sup_{w}\biggl{\|}\int\varphi(s;w){\mathrm{d}}s\biggr{\|}_{2}\leq B_{P}\biggr{\}}.

Here $B_{P}>0$ is an absolute constant, $q$ is the probability density function of $N(0,I_{d}/d)$ , and $\vartheta(s,a;w)$ is defined as $\vartheta(s,a;w)=\operatorname{\mathds{1}}\{w^{\top}(s,a)>0\}\cdot(s,a).$

Assumption 4.2 (b) states that the MDP belongs to (a variant of) the class of linear MDPs (Yang and Wang, 2019a, b; Jin et al., 2019; Cai et al., 2019b). However, our class of transition kernels is infinite-dimensional, and thus, captures a rich class of MDPs. To understand Assumption 4.2 (a), recall that we initialize the neural network with $[W_{0}]_{l}\sim N(0,I_{d}/d)$ and $b_{l}\sim{\text{Unif}}(\{-1,1\})$ for any $l\in[m]$ . Thus, the neural network $u_{0}(s,a)$ defined in (3.1) with $W=W_{0}$ converges to a Gaussian process indexed by $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ as $m$ goes to infinity. Following from the facts that the maximum of a Gaussian process over a compact index set is sub-Gaussian (van de Geer and Muro, 2014) and that ${\mathcal{S}}\times\mathcal{A}$ is compact, it is reasonable to assume that $\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|u_{0}(s,a)|$ is sub-Gaussian, which further implies the existence of the absolute constants $M_{0}$ and $v$ in (4.1) of Assumption 4.2 (a). Moreover, if we assume that $m$ is even and initialize the parameters $W_{0},b$ as follows,

\displaystyle\begin{cases}[W_{0}]_{l}\overset{{\text{i.i.d.}}}{\sim}N(0,I_{d}/d),\quad b_{l}\sim{\text{Unif}}\bigl{(}\{-1,1\}\bigr{)},\quad&\forall l=1,\ldots,m/2,\\ [W_{0}]_{l}=[W_{0}]_{l-m/2},\quad b_{l}=-b_{l-m/2},\quad&\forall l=m/2+1,\ldots,m,\end{cases}

(4.2)

we have that $u_{0}(s,a)=0$ for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , which allows us to set $M_{0}=0$ and $v=+\infty$ in Assumption (a) (a). Also, it holds that $0=u_{0}(s,a)\in\mathcal{R}_{\beta}$ , which implies that $\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{1},\pi_{2})\geq 0$ for any $\pi_{1}$ and $\pi_{2}$ . The proof of our results with the random initialization in (4.2) is identical.

Finally, we impose the following assumption on the regularizer $\psi(\beta)$ and the variances of the estimators $\widehat{\mathcal{I}}(\theta)$ , $\widehat{\nabla}_{\theta}L(\theta,\beta)$ , and $\widehat{\nabla}_{\beta}L(\theta,\beta)$ defined in (3.16), (3.17), and (3.18), respectively.

Assumption 4.3.

We assume that the following properties hold.

(a)

We assume for an absolute constant $\sigma>0$ that

	$\displaystyle\mathbb{E}_{k}\bigg{[}\Big{\\|}\widehat{\mathcal{I}}(\theta_{k})W-\mathbb{E}_{k}\bigl{[}\widehat{\mathcal{I}}(\theta_{k})W\bigr{]}\Big{\\|}_{2}^{2}\bigg{]}\leq\tau_{k}^{4}\cdot\sigma^{2}/N,\quad\forall W\in S_{B_{\theta}},$		(4.3)
	$\displaystyle\mathbb{E}_{k}\biggl{[}\Big{\\|}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})-\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\\|}_{2}^{2}\biggr{]}\leq\tau_{k}^{2}\cdot\sigma^{2}/N,$		(4.4)
	$\displaystyle\mathbb{E}_{k}\biggl{[}\Big{\\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\\|}_{2}^{2}\biggr{]}\leq\sigma^{2}/N,$		(4.5)

where $\tau_{k}$ is the inverse temperature parameter in (3.5), $N\in\mathbb{N}_{+}$ is the batch size, and $S_{B_{\theta}}$ is the parameter domain of $\theta$ defined in (3.4) with the domain radius $B_{\theta}$ . Here the expectation $\mathbb{E}_{k}$ is taken with respect to the $k$ -th batch, which is drawn from $\nu_{k}$ given $\theta_{k}$ .

(b)

We assume that the regularizer $\psi(\beta)$ in (2.4) is convex and $L_{\psi}$ -Lipschitz continuous over the compact parameter domain $S_{B_{\beta}}$ .

Assumption 4.3 (a) holds when $\widehat{Q}_{\omega_{k}}(s_{i},a_{i})\cdot\iota_{\theta_{k}}(s_{i},a_{i})$ , $\iota_{\theta_{k}}(s_{i},a_{i})\iota_{\theta_{k}}(s_{i},a_{i})^{\top}$ , and $\phi_{\beta_{k}}(s_{i},a_{i})$ have uniformly upper bounded variances across $i\in[m]$ and $k$ , and the Markov chain that generates $\{(s_{i},a_{i})\}_{i\in[N]}$ mixes sufficiently fast (Zhang et al., 2019a). Similar assumptions are also used in the analysis of policy optimization (Xu et al., 2019a, b). Also, Assumption 4.3 (b) holds for most commonly used regularizers (Ho and Ermon, 2016).

4.2 Global Optimality and Convergence

In this section, we establish the global optimality and convergence of Algorithm 1. The following proposition adapted from Cai et al. (2019c) characterizes the global optimality and convergence of neural TD, which is presented in Algorithm 2.

Proposition 4.4 (Global Optimality and Convergence of Neural TD).

In Algorithm 2, we set $T_{{\mathrm{TD}}}=\Omega(m)$ , $\alpha=\min\{(1-\gamma)/8,m^{-1/2}\}$ , and $B_{\omega}=c\cdot(B_{\beta}+B_{P}\cdot(M_{0}+B_{\beta}))$ for an absolute constant $c>0$ . Let $\pi_{k},r_{k}$ be the input and $\omega_{k}$ be the output of Algorithm 2. Under Assumptions 4.1 and 4.2, it holds for an absolute constant $C_{v}>0$ that

\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}Q_{\omega_{k}}(s,a)-Q^{\pi_{k}}_{r_{k}}(s,a)\big{\|}^{2}_{2,\rho_{k}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}.

(4.6)

Here the expectation $\mathbb{E}_{\text{init}}$ is taken with respect to the random initialization in (3.3).

Proof.

See §B.1 for a detailed proof. ∎

The term $B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})$ in (4.6) of Proposition 4.4 characterizes the hardness of estimating the state-action value function $Q^{\pi_{k}}_{r_{k}}(s,a)$ by the neural network defined in (3.1), which arises because $\|Q^{\pi_{k}}_{r_{k}}(s,a)\|_{\infty}$ is not uniformly upper bounded across $k$ . Note that if we employ the random initialization in (4.2), we have that $C_{v}=+\infty$ . And consequently, such a term vanishes. We are now ready to establish the global optimality and convergence of Algorithm 1.

Theorem 4.5 (Global Optimality and Convergence of GAIL).

We set $\eta=1/\sqrt{T}$ and $B_{\omega}=c\cdot(B_{\beta}+B_{P}\cdot(M_{0}+B_{\beta}))$ for an absolute constant $c>0$ , and $B_{\theta}=B_{\omega}$ in Algorithm 1. Let $\bar{\pi}$ be the output of Algorithm 1. Under Assumptions 4.1-4.3, it holds that

\displaystyle\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]}

\displaystyle\leq\underbrace{\frac{(1-\gamma)^{-1}\cdot\log|\mathcal{A}|+13\bar{B}^{2}+M_{0}^{2}+8}{\sqrt{T}}}_{\displaystyle{\text{(i)}}}+\underbrace{2\lambda\cdot L_{\psi}\cdot\bar{B}}_{\displaystyle{\text{(ii)}}}+\underbrace{\frac{1}{T}{\sum_{k=0}^{T-1}\varepsilon_{k}}}_{\displaystyle{\text{(iii)}}}.

(4.7)

Here $\bar{B}=\max\{B_{\theta},B_{\omega},B_{\beta}\}$ , $\mathbb{D}_{\mathcal{R}_{\beta}}$ is the $\mathcal{R}_{\beta}$ -distance defined in (2.5) with $\mathcal{R}_{\beta}=\{r_{\beta}(s,a)\,|\,\beta\in S_{B_{\beta}}\}$ , the expectation is taken with respect to the random initialization in (3.3) and the $T$ batches, and the error term $\varepsilon_{k}$ satisfies that

\displaystyle\varepsilon_{k}=\underbrace{2\sqrt{2}\cdot C_{h}\cdot\bar{B}\cdot\sigma\cdot N^{-1/2}}_{\displaystyle{\text{(iii.a)}}}+\underbrace{{\epsilon}_{Q,k}}_{\displaystyle\text{(iii.b)}}+\underbrace{\mathcal{O}(k\cdot\bar{B}^{3/2}\cdot m^{-1/4}+\bar{B}^{5/4}\cdot m^{-1/8})}_{\displaystyle\text{(iii.c)}},

(4.8)

where $C_{h}$ is defined in Assumption 4.1, $L_{\psi}$ and $\sigma$ are defined in Assumption (b), and ${\epsilon}_{Q,k}=\mathcal{O}(B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2}))$ is the error induced by neural TD (Algorithm 2).

Proof.

See §5 for a detailed proof. ∎

The optimality gap in (4.7) of Theorem 4.5 is measured by the expected $\mathcal{R}_{\beta}$ -distance $\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{{\text{{\tiny E}}}},\bar{\pi})$ between the expert policy $\pi_{\text{{\tiny E}}}$ and the learned policy $\bar{\pi}$ . Thus, by showing that the optimality gap is upper bounded by $\mathcal{O}(1/\sqrt{T})$ , we prove that $\bar{\pi}$ (approximately) outperforms the expert policy $\pi_{\text{{\tiny E}}}$ in expectation when the number of iterations $T$ goes to infinity. As shown in (4.7) of Theorem 4.5, the optimality gap is upper bounded by the sum of the three terms. Term (i) corresponds to the $1/\sqrt{T}$ rate of convergence of Algorithm 1. Term (ii) corresponds to the bias induced by the regularizer $\lambda\cdot\psi(\beta)$ in the objective function $L(\theta,\beta)$ defined in (2.4). Term (iii) is upper bounded by the sum of the three terms in (4.8) of Theorem 4.5. In detail, term (iii.a) corresponds to the error induced by the variances of $\widehat{\mathcal{I}}(\theta)$ , $\widehat{\nabla}_{\theta}L(\theta,\beta)$ , and $\widehat{\nabla}_{\beta}L(\theta,\beta)$ defined in (4.3), (4.4), and (4.5) of Assumption 4.3, which vanishes as the batch size $N$ in Algorithm 1 goes to infinity. Term (iii.b) is the error of estimating $Q^{\pi}_{r}(s,a)$ by $\widehat{Q}_{\omega}(s,a)$ using neural TD (Algorithm 2). As shown in Proposition 4.4, the estimation error $\epsilon_{Q,k}$ vanishes as $m$ and $B_{\omega}$ go to infinity. Term (iii.c) corresponds to the linearization error of the neural network defined in (3.1), which is characterized in Lemma A.2. Following from Theorem 4.5, it holds for $B_{\omega}=\Omega((C_{v}^{-1}\cdot\log T)^{1/2})$ , $m=\Omega(\bar{B}^{10}\cdot T^{6})$ , and $N=\Omega(\bar{B}^{2}\cdot T\cdot\sigma^{2})$ that $\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]}=\mathcal{O}(T^{-1/2}+\lambda)$ , which implies the $1/\sqrt{T}$ rate of convergence of Algorithm 1 (up to the bias induced by the regularizer).

5 Proof of Main Results

In this section, we present the proof of Theorem 4.5, which establishes the global optimality and convergence of Algorithm 1. For notational simplicity, we write $\pi^{s}(a)=\pi(a\,|\,s)$ for any policy $\pi$ , state $s\in{\mathcal{S}}$ , and action $a\in\mathcal{A}$ . For any policies $\pi_{1},\pi_{2}$ and distribution $\mu$ over ${\mathcal{S}}$ , we denote the expected Kullback-Leibler (KL) divergence by ${\mathrm{KL}}^{\mu}$ , which is defined as ${\mathrm{KL}}^{\mu}(\pi_{1}\,\|\,\pi_{2})=\mathbb{E}_{s\sim\mu}[{\mathrm{KL}}(\pi_{1}^{s}\,\|\,\pi_{2}^{s})]$ . For any visitation measures $d_{\pi}\in{\mathcal{P}}({\mathcal{S}})$ and $\nu_{\pi}\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A})$ , we denote by $\mathbb{E}_{d_{\pi}}$ and $\mathbb{E}_{\nu_{\pi}}$ the expectations taken with respect to $s\sim d_{\pi}$ and $(s,a)\sim\nu_{\pi}$ , respectively.

Following from the property of the mixed policy $\bar{\pi}$ in (3.21), we have that

	$\displaystyle\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]}$	$\displaystyle=\mathbb{E}\bigl{[}\max_{\beta^{\prime}\in S_{B_{\beta}}}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\bar{\pi};r_{\beta^{\prime}})\bigr{]}$
		$\displaystyle=\mathbb{E}\biggl{[}\max_{\beta^{\prime}\in S_{B_{\beta}}}\frac{1}{T}\sum_{k=0}^{T-1}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}})\biggr{]}.$		(5.1)

We now upper bound the optimality gap in (5) by upper bounding the following difference of expected cumulative rewards,

\displaystyle J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}})=\underbrace{J(\pi_{\text{{\tiny E}}};r_{k})-J(\pi_{k};r_{k})}_{\displaystyle\text{(i)}}+\underbrace{L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})}_{\displaystyle\text{(ii)}}+\underbrace{\lambda\cdot\bigl{(}\psi(\beta^{\prime})-\psi(\beta_{k})\bigr{)}}_{\displaystyle\text{(iii)}},

(5.2)

where $\beta^{\prime}\in S_{B_{\beta}}$ is chosen arbitrarily and $L(\theta,\beta)$ is the objective function defined in (2.4). Following from Assumption 4.3 and the fact that $\beta_{k},\beta^{\prime}\in S_{B_{\beta}}$ , we have that

\displaystyle\lambda\cdot\bigl{(}\psi(\beta^{\prime})-\psi(\beta_{k})\bigr{)}\leq\lambda\cdot L_{\psi}\cdot\|\beta^{\prime}-\beta_{k}\|_{2}\leq\lambda\cdot L_{\psi}\cdot 2B_{\beta},

(5.3)

which upper bounds term (iii) of (5.2). It remains to upper bound terms (i) and (ii) of (5.2), which hinges on the one-point convexity of $J(\pi;r)$ with respect to $\pi$ and the (approximate) convexity of $L(\theta,\beta)$ with respect to $\beta$ . Upper bound of term (i) in (5.2). In what follows, we upper bound term (i) of (5.2). We first introduce the following cost difference lemma (Kakade and Langford, 2002), which corresponds to the one-point convexity of $J(\pi;r)$ with respect to $\pi$ . Recall that $d_{\text{{\tiny E}}}\in{\mathcal{P}}({\mathcal{S}})$ is the state visitation measure induced by the expert policy $\pi_{\text{{\tiny E}}}$ .

Lemma 5.1 (Cost Difference Lemma, Lemma 6.1 in Kakade and Langford (2002)).

For any policy $\pi$ and reward function $r(s,a)$ , it holds that

\displaystyle J(\pi_{\text{{\tiny E}}};r)-J(\pi;r)=(1-\gamma)^{-1}\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}Q^{\pi}_{r}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]},

(5.4)

where $\gamma$ is the discount factor.

Furthermore, we establish the following lemma, which upper bounds the right-hand side of (5.4) in Lemma 5.1.

Lemma 5.2.

Under Assumptions 4.1-4.3, we have that

\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}Q^{\pi_{k}}_{r_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-{\pi_{k}^{s}}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}=\eta^{-1}\cdot{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k})-\eta^{-1}\cdot{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k+1})+\Delta_{k}^{\text{(i)}},

where

	$\displaystyle\mathbb{E}\bigl{[}\|\Delta_{k}^{\text{(i)}}\|\bigr{]}$	$\displaystyle=2\sqrt{2}\cdot C_{h}\cdot B_{\theta}^{1/2}\cdot\sigma^{1/2}\cdot N^{-1/4}+{\epsilon}_{Q,k}+\eta\cdot(M_{0}^{2}+9B_{\theta}^{2})$
		$\displaystyle\quad+\mathcal{O}(\eta^{-1}\cdot\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+B_{\theta}^{5/4}\cdot m^{-1/8}).$		(5.5)

Here $M_{0}$ is defined in Assumption (a), $\sigma$ is defined in Assumption 4.3, $N$ is the batch size in (3.16)-(3.18), and ${\epsilon}_{Q,k}=\mathcal{O}(B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2}))$ for an absolute constant $C_{v}>0$ , which depends on the absolute constant $v$ in Assumption (a).

Proof.

See §C.2 for a detailed proof. ∎

Combining Lemmas 5.1 and 5.2, we have that

\displaystyle J(\pi_{\text{{\tiny E}}};r_{k})-J(\pi_{k};r_{k})\leq\frac{{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k})-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k+1})+\eta\cdot\Delta_{k}^{\text{(i)}}}{\eta\cdot(1-\gamma)},

(5.6)

which upper bounds term (i) of (5.2). Here $\Delta_{k}^{\text{(i)}}$ is upper bounded in (5.2) of Lemma 5.2.

Upper bound of term (ii) in (5.2). In what follows, we upper bound term (ii) of (5.2). We first establish the following lemma, which characterizes the (approximate) convexity of $L(\theta,\beta)$ with respect to $\beta$ .

Lemma 5.3.

Under Assumption (a), it holds for any $\beta^{\prime}\in S_{B_{\beta}}$ that

\displaystyle\mathbb{E}_{\text{init}}\bigl{[}L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\bigr{]}=\mathbb{E}_{\text{init}}\bigl{[}\nabla_{\beta}L(\theta_{k},\beta_{k})^{\top}(\beta^{\prime}-\beta_{k})\bigr{]}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}).

Proof.

See §C.3 for a detailed proof. ∎

The term $\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4})$ in Lemma 5.3 arises from the linearization error of the neural network, which is characterized in Lemma A.2. Based on Lemma 5.3, we establish the following lemma, which upper bounds term (ii) of (5.2).

Lemma 5.4.

Under Assumptions (a) and 4.3, we have that

\displaystyle L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\leq\eta^{-1}\cdot\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\eta^{-1}\cdot\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}-\eta^{-1}\cdot\|\beta_{k+1}-\beta_{k}\|_{2}^{2}+\Delta_{k}^{\text{(ii)}},

where

\displaystyle\mathbb{E}\bigl{[}|\Delta_{k}^{\text{(ii)}}|\bigr{]}=\eta\cdot\bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\sigma^{2}\cdot N^{-1}\bigr{)}+2B_{\beta}\cdot\sigma\cdot N^{-1/2}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}).

(5.7)

Proof.

See §C.4 for a detailed proof. ∎

By Lemma 5.4, we have that

\displaystyle L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\leq\eta^{-1}\cdot\bigl{(}\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta_{k}\|_{2}^{2}\bigr{)}+\Delta_{k}^{\text{(ii)}},

(5.8)

which upper bounds term (ii) of (5.2). Here $\Delta_{k}^{\text{(ii)}}$ is upper bounded in (5.7) of Lemma 5.4.

Plugging (5.3), (5.6), and (5.8) into (5.2), we obtain that

		$\displaystyle J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}})$		(5.9)
		$\displaystyle\quad\leq\frac{{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\\|\,\pi_{k})-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\\|\,\pi_{k+1})}{\eta\cdot(1-\gamma)}+\eta^{-1}\cdot\bigl{(}\\|\beta_{k}-\beta^{\prime}\\|_{2}^{2}-\\|\beta_{k+1}-\beta^{\prime}\\|_{2}^{2}\bigr{)}+2\lambda\cdot L_{\psi}\cdot B_{\beta}+\Delta_{k}.$

Here $\Delta_{k}=\Delta_{k}^{\text{(i)}}+\Delta_{k}^{\text{(ii)}}$ , where $\Delta_{k}^{\text{(i)}}$ and $\Delta_{k}^{\text{(ii)}}$ are upper bounded in (5.2) and (5.7) of Lemmas 5.2 and 5.4, respectively. Note that the upper bound of $\Delta_{k}$ does not depend on $\theta$ and $\beta$ . Upon telescoping (5.9) with respect to $k$ , we obtain that

	$\displaystyle J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\bar{\pi};r_{\beta^{\prime}})$	$\displaystyle=\frac{1}{T}\sum_{k=0}^{T-1}\bigl{[}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\pi_{k};r_{\beta^{\prime}})\bigr{]}$		(5.10)
		$\displaystyle\leq\frac{(1-\gamma)^{-1}\cdot{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\\|\,\pi_{0})+\\|\beta_{0}-\beta^{\prime}\\|_{2}^{2}}{\eta\cdot T}+2\lambda\cdot L_{\psi}\cdot B_{\beta}+\frac{1}{T}\sum_{k=0}^{T-1}\|\Delta_{k}\|.$

Following from the fact that $\tau_{0}=0$ and the parameterization of $\pi_{\theta}$ in (3.5), it holds that $\pi_{0}^{s}$ is the uniform distribution over $\mathcal{A}$ for any $s\in{\mathcal{S}}$ . Thus, we have ${\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{0})\leq\log|\mathcal{A}|$ . Meanwhile, following from the fact that $\beta^{\prime}\in S_{B_{\beta}}$ , it holds that $\|\beta^{\prime}-\beta_{0}\|_{2}\leq B_{\beta}$ . Finally, by setting $\eta=T^{-1/2}$ , $\tau_{k}=k\cdot\eta$ , and $\bar{B}=\max\{B_{\theta},B_{\beta}\}$ in (5.10), we have that

	$\displaystyle\mathbb{E}\bigl{[}\mathbb{D}_{\mathcal{R}_{\beta}}(\pi_{\text{{\tiny E}}},\bar{\pi})\bigr{]}$	$\displaystyle=\mathbb{E}\bigl{[}\max_{\beta^{\prime}\in S_{B_{\beta}}}J(\pi_{\text{{\tiny E}}};r_{\beta^{\prime}})-J(\bar{\pi};r_{\beta^{\prime}})\bigr{]}$
		$\displaystyle\leq\frac{(1-\gamma)^{-1}\cdot\log\|\mathcal{A}\|+4B_{\beta}^{2}}{\eta\cdot T}+2\lambda\cdot L_{\psi}\cdot B_{\beta}+\frac{\mathbb{E}\bigl{[}\max_{\beta^{\prime}}\sum_{k=0}^{T-1}\|\Delta_{k}\|\bigr{]}}{T}$
		$\displaystyle=\frac{(1-\gamma)^{-1}\cdot\log\|\mathcal{A}\|+13\bar{B}^{2}+M_{0}^{2}+8}{\sqrt{T}}+2\lambda\cdot L_{\psi}\cdot\bar{B}+\frac{\sum_{k=0}^{T-1}\varepsilon_{k}}{T}.$

Here $\varepsilon_{k}$ is upper bounded as follows,

\displaystyle\varepsilon_{k}=2\sqrt{2}\cdot C_{h}\cdot\bar{B}\cdot\sigma\cdot N^{-1/2}+{\epsilon}_{Q,k}+\mathcal{O}(k\cdot\bar{B}^{3/2}\cdot m^{-1/4}+\bar{B}^{5/4}\cdot m^{-1/8}),

where ${\epsilon}_{Q,k}=\mathcal{O}(B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2}))$ for an absolute constant $C_{v}>0$ . Thus, we complete the proof of Theorem 4.5.

References

Agarwal et al. (2019) Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.
Altman (1999) Altman, E. (1999). Constrained Markov decision processes, vol. 7. CRC Press.
Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. Cambridge University Press.
Antos et al. (2008) Antos, A., Szepesvári, C. and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
Arjovsky et al. (2017) Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein GAN. arXiv preprint arXiv:1701.07875.
Bhandari and Russo (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
Cai et al. (2019a) Cai, Q., Hong, M., Chen, Y. and Wang, Z. (2019a). On the global convergence of imitation learning: A case for linear quadratic regulator. arXiv preprint arXiv:1901.03674.
Cai et al. (2019b) Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019b). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
Cai et al. (2019c) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019c). Neural temporal-difference learning converges to global optima. arXiv preprint arXiv:1905.10027.
Chen et al. (2020) Chen, M., Wang, Y., Liu, T., Yang, Z., Li, X., Wang, Z. and Zhao, T. (2020). On computation and generalization of generative adversarial imitation learning. arXiv preprint arXiv:2001.02792.
Farahmand et al. (2016) Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C. and Mannor, S. (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17 4809–4874.
Farahmand et al. (2010) Farahmand, A.-m., Szepesvári, C. and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
Finn et al. (2016) Finn, C., Levine, S. and Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning.
Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.
Ho and Ermon (2016) Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems.
Hofmann et al. (2008) Hofmann, T., Schölkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics 1171–1220.
Jin et al. (2019) Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2019). Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388.
Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, vol. 2.
Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
Kuefler et al. (2017) Kuefler, A., Morton, J., Wheeler, T. and Kochenderfer, M. (2017). Imitating driver behavior with generative adversarial networks. In IEEE Intelligent Vehicles Symposium. IEEE.
Lazaric et al. (2016) Lazaric, A., Ghavamzadeh, M. and Munos, R. (2016). Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17 583–612.
Levine and Koltun (2012) Levine, S. and Koltun, V. (2012). Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617.
Liu et al. (2019) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
Merel et al. (2017) Merel, J., Tassa, Y., TB, D., Srinivasan, S., Lemmon, J., Wang, Z., Wayne, G. and Heess, N. (2017). Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201.
Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research, 9 815–857.
Nesterov (2013) Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media.
Ng and Russell (2000) Ng, A. Y. and Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In International Conference on Machine Learning.
Peters and Schaal (2008) Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71 1180–1190.
Pomerleau (1991) Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3 88–97.
Rafique et al. (2018) Rafique, H., Liu, M., Lin, Q. and Yang, T. (2018). Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXiv:1810.02060.
Rahimi and Recht (2008) Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems.
Rahimi and Recht (2009) Rahimi, A. and Recht, B. (2009). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in Neural Information Processing Systems 1313–1320.
Ross and Bagnell (2010) Ross, S. and Bagnell, D. (2010). Efficient reductions for imitation learning. In International Conference on Artificial Intelligence and Statistics.
Ross et al. (2011) Ross, S., Gordon, G. and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics.
Russell (1998) Russell, S. (1998). Learning agents for uncertain environments. In Conference on Learning Theory.
Scherrer et al. (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. (2015). Approximate modified policy iteration and its application to the game of Tetris. The Journal of Machine Learning Research, 16 1629–1676.
Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P. and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
Syed et al. (2008) Syed, U., Bowling, M. and Schapire, R. E. (2008). Apprenticeship learning using linear programming. In International Conference on Machine Learning.
Szepesvári and Munos (2005) Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning. ACM.
Tai et al. (2018) Tai, L., Zhang, J., Liu, M. and Burgard, W. (2018). Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In IEEE International Conference on Robotics and Automation. IEEE.
van de Geer and Muro (2014) van de Geer, S. and Muro, A. (2014). On higher order isotropy conditions and lower bounds for sparse quadratic forms. Electronic Journal of Statistics, 8 3031–3061.
Wang et al. (2019) Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
Xu et al. (2019a) Xu, P., Gao, F. and Gu, Q. (2019a). An improved convergence analysis of stochastic variance-reduced policy gradient. arXiv preprint arXiv:1905.12615.
Xu et al. (2019b) Xu, P., Gao, F. and Gu, Q. (2019b). Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610.
Yang and Wang (2019a) Yang, L. and Wang, M. (2019a). Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning.
Yang and Wang (2019b) Yang, L. F. and Wang, M. (2019b). Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
Yu et al. (2016) Yu, L., Zhang, W., Wang, J. and Yu, Y. (2016). SeqGAN: Sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473.
Zhang et al. (2019a) Zhang, K., Koppel, A., Zhu, H. and Başar, T. (2019a). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.
Zhang et al. (2019b) Zhang, K., Yang, Z. and Başar, T. (2019b). Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. arXiv preprint arXiv:1906.00729.

Appendix A Neural Networks

In what follows, we present the properties of the neural network defined in (3.1). First, we define the following function class.

Definition A.1 (Function Class).

For $B>0$ and $m\in\mathbb{N}_{+}$ , we define

\displaystyle\mathcal{F}_{B,m}=\bigl{\{}W^{\top}\phi_{0}(s,a)\,\big{|}\,W\in\mathbb{R}^{md},\leavevmode\nobreak\ \|W-W_{0}\|_{2}\leq B\bigr{\}},

where $\phi_{0}(s,a)$ is the feature vector defined in (3.2) with $W=W_{0}$ .

As shown in Rahimi and Recht (2008), the feature $\phi_{0}(s,a)$ induces a reproducing kernel Hilbert space (RKHS), namely $\mathcal{H}$ . When $m$ goes to infinity, $\mathcal{F}_{B,m}$ approximates a ball in $\mathcal{H}$ , which captures a rich class of functions (Hofmann et al., 2008; Rahimi and Recht, 2008). Furthermore, we obtain the following lemma from Cai et al. (2019c), which characterizes the linearization error of the neural network defined in (3.1).

Lemma A.2 (Linearization Error, Lemma 5.1 in Cai et al. (2019c)).

Under Assumption (a), it holds for any $W,W_{1},W_{2}\in S_{B}$ that,

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\bigl{\\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\bigr{\\|}_{2,\mu}^{2}\Bigr{]}=\mathcal{O}(B^{3}\cdot m^{-1/2}),$
	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\\|}_{1,\mu}\Bigr{]}=\mathcal{O}(B^{3/2}\cdot m^{-1/4}),$

where $\phi_{W}(s,a)$ is the feature vector defined in (3.2) and $\mu\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A})$ is a distribution that satisfies Assumption (a).

Proof.

See §A.1 for a detailed proof. ∎

Following from Lemma A.2, the function class $\mathcal{F}_{B,m}$ defined in Definition A.1 is a first-order approximation of the class of the neural networks defined in (3.1). Meanwhile, we establish the following lemma to characterize the sub-Gaussian property of the neural network defined in (3.1).

Lemma A.3.

Under Assumption 4.2, for any $W,W^{\prime}\in S_{B}$ , it holds that $\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|W^{\top}\phi_{W^{\prime}}(s,a)|$ is sub-Gaussian, where the randomness comes from the random initialization $W_{0}$ in the definition of $S_{B}$ in (3.4). Moreover, it holds that

\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}^{2}\Bigr{]}\leq 2M_{0}^{2}+18B^{2}

and that

\displaystyle\mathbb{P}\Bigl{(}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}>t\Bigr{)}\leq\exp(-v\cdot t^{2}/2),\quad\forall t>2M_{0}+6B.

Proof.

See §A.2 for a detailed proof. ∎

A.1 Proof of Lemma A.2

Proof.

We consider any $W,W^{\prime}\in S_{B}$ . By the definition of $\phi_{W}(s,a)$ in (3.2) and the triangle inequality, we have that

		$\displaystyle\bigl{\|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{\|}$
		$\displaystyle\quad\leq\frac{1}{\sqrt{m}}\sum_{l=1}^{m}\bigl{\|}[W]_{l}^{\top}(s,a)\bigr{\|}\cdot\Bigl{\|}\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W^{\prime}]_{l}>0\bigr{\}}-\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W_{0}]_{l}>0\bigr{\}}\Bigr{\|}.$		(A.1)

We now upper bound the right-hand side of (A.1). For the term $|[W]_{l}^{\top}(s,a)|$ in (A.1), we have that

	$\displaystyle\bigl{\|}[W]_{l}^{\top}(s,a)\bigr{\|}$	$\displaystyle\leq\bigl{\|}[W_{0}]_{l}^{\top}(s,a)\bigr{\|}+\Bigl{\|}\bigl{(}[W]_{l}-[W_{0}]_{l}\bigr{)}^{\top}(s,a)\Bigr{\|}$
		$\displaystyle\leq\bigl{\|}[W_{0}]_{l}^{\top}(s,a)\bigr{\|}+\big{\\|}[W]_{l}-[W_{0}]_{l}\big{\\|}_{2},$		(A.2)

where the first inequality follows from the triangle inequality and the second inequality follows from the Cauchy-Schwartz inequality and the fact that $\|(s,a)\|_{2}\leq 1$ . To upper bound the term $|\operatorname{\mathds{1}}\{(s,a)^{\top}[W^{\prime}]_{l}>0\}-\operatorname{\mathds{1}}\{(s,a)^{\top}[W_{0}]_{l}>0\}|$ on the right-hand side of (A.1), note that $\operatorname{\mathds{1}}\{(s,a)^{\top}[W^{\prime}]_{l}>0\}\neq\operatorname{\mathds{1}}\{(s,a)^{\top}[W_{0}]_{l}>0\}$ implies that

\displaystyle\bigl{|}[W_{0}]_{l}^{\top}(s,a)\bigr{|}\leq\bigl{|}[W^{\prime}]_{l}^{\top}(s,a)-[W_{0}]_{l}^{\top}(s,a)\bigr{|}\leq\big{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\big{\|}_{2}.

Thus, we have that

\displaystyle\Bigl{|}\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W^{\prime}]_{l}>0\bigr{\}}-\operatorname{\mathds{1}}\bigl{\{}(s,a)^{\top}[W_{0}]_{l}>0\bigr{\}}\Bigr{|}\leq\operatorname{\mathds{1}}\Bigl{\{}\bigl{|}(s,a)^{\top}[W_{0}]_{l}\bigr{|}\leq\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}\Bigr{\}}.

(A.3)

Plugging (A.1) and (A.3) into (A.1), we have that

	$\displaystyle\bigl{\|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{\|}$
	$\displaystyle\quad\leq\frac{1}{\sqrt{m}}\sum_{l=1}^{m}\operatorname{\mathds{1}}\Bigl{\{}\bigl{\|}(s,a)^{\top}[W_{0}]_{l}\bigr{\|}\leq\bigl{\\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\\|}_{2}\Bigr{\}}\cdot\Bigl{(}\bigl{\|}(s,a)^{\top}[W_{0}]_{l}\bigr{\|}+\bigl{\\|}[W]_{l}-[W_{0}]_{l}\bigr{\\|}_{2}\Bigr{)}$
	$\displaystyle\quad\leq\frac{1}{\sqrt{m}}\sum_{l=1}^{m}\operatorname{\mathds{1}}\Big{\{}\big{\|}(s,a)^{\top}[W_{0}]_{l}\big{\|}\leq\big{\\|}[W^{\prime}]_{l}-[W_{0}]_{l}\big{\\|}_{2}\Big{\}}\cdot\Bigl{(}\big{\\|}[W^{\prime}]_{l}-[W_{0}]_{l}\big{\\|}_{2}+\big{\\|}[W]_{l}-[W_{0}]_{l}\big{\\|}_{2}\Bigr{)}.$

By the fact that $W,W^{\prime}\in S_{B}$ , we obtain that

\displaystyle\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{|}^{2}\leq\frac{4B^{2}}{m}\sum_{l=1}^{m}\operatorname{\mathds{1}}\Bigl{\{}\bigl{|}(s,a)^{\top}[W_{0}]_{l}\bigr{|}\leq\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}\Bigr{\}}.

By setting $y=\|[W^{\prime}]_{l}-[W_{0}]_{l}\|_{2}$ in Assumption (a), we have that

\displaystyle\bigl{\|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\bigr{\|}_{2,\mu}^{2}\leq\frac{8B^{2}}{m}\sum_{l=1}^{m}\frac{c\cdot\bigl{\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\|}_{2}}{\bigl{\|}[W_{0}]_{l}\bigr{\|}_{2}}.

Taking the expectation with respect to the random initialization in (3.3) and using the Cauchy-Schwartz inequality, we have that

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\big{\\|}_{2,\mu}^{2}\Bigr{]}$
	$\displaystyle\quad\leq\mathbb{E}_{\text{init}}\biggl{[}\frac{8cB^{2}}{m}\Bigl{(}\sum_{l=1}^{m}\bigl{\\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\\|}_{2}^{2}\Bigr{)}^{1/2}\cdot\Bigl{(}\sum_{l=1}^{m}1/\bigl{\\|}[W_{0}]_{l}\bigr{\\|}_{2}^{2}\Bigr{)}^{1/2}\biggr{]}$
	$\displaystyle\quad\leq\frac{8cB^{3}}{m}\mathbb{E}_{\text{init}}\biggl{[}\Bigl{(}\sum_{l=1}^{m}1/\bigl{\\|}[W_{0}]_{l}\bigr{\\|}_{2}^{2}\Bigr{)}^{1/2}\biggr{]}$
	$\displaystyle\quad\leq\frac{8cB^{3}}{\sqrt{m}}\Bigl{(}\mathbb{E}_{w\sim N(0,I_{d}/d)}\bigl{[}1/\\|w\\|_{2}^{2}\bigr{]}\Bigr{)}^{1/2}$
	$\displaystyle\quad=\mathcal{O}(B^{3}\cdot m^{-1/2}),$

where the second inequality follows from the fact that $\|W^{\prime}-W_{0}\|_{2}\leq B$ , the third inequality follows from Jensen’s inequality, and the last inequality follows from Assumption (a) and Lemma A.2. Thus, for any $W,W_{1},W_{2}\in S_{B}$ , we have that

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\\|}_{2,\mu}^{2}\Bigr{]}$
	$\displaystyle\quad\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{0}(s,a)\big{\\|}_{2,\mu}^{2}\Bigr{]}+2\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W_{2}}(s,a)-W^{\top}\phi_{0}(s,a)\big{\\|}_{2,\mu}^{2}\Bigr{]}$
	$\displaystyle\quad=\mathcal{O}(B^{3}\cdot m^{-1/2}).$

Moreover, following from the Cauchy-Schwartz inequality, we have that $\|\cdot\|_{1,\mu}\leq\|\cdot\|_{2,\mu}$ . Thus, by Jensen’s inequality, we have that

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\\|}_{1,\mu}\Bigr{]}$
	$\displaystyle\quad\leq\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W_{1}}(s,a)-W^{\top}\phi_{W_{2}}(s,a)\big{\\|}_{2,\mu}\Bigr{]}$
	$\displaystyle\quad=\mathcal{O}(B^{3/2}\cdot m^{-1/4}),$

which completes the proof of Lemma A.2. ∎

A.2 Proof of Lemma A.3

In what follows, we present the proof of Lemma A.3.

Proof.

Recall that we write $u_{W}(s,a)=W^{\top}\phi_{W}(s,a)$ and $u_{0}(s,a)=u_{W_{0}}(s,a)$ . Then, we have

	$\displaystyle\big{\|}W^{\top}\phi_{W^{\prime}}(s,a)\big{\|}$	$\displaystyle\leq\bigl{\|}u_{0}(s,a)\bigr{\|}+\bigl{\|}(W-W^{\prime})^{\top}\phi_{W^{\prime}}(s,a)\bigr{\|}+\bigl{\|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{\|}$
		$\displaystyle\leq\bigl{\|}u_{0}(s,a)\bigr{\|}+\\|W-W^{\prime}\\|_{2}\cdot\big{\\|}\phi_{W^{\prime}}(s,a)\big{\\|}_{2}+\bigl{\|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{\|},$		(A.4)

where the last inequality follows from the Cauchy-Schwartz inequality. It suffices to upper bound the three terms on the right-hand side of (A.2). Note that we have $W,W^{\prime}\in S_{B}$ and $\|\phi_{W^{\prime}}(s,a)\|_{2}\leq 1$ . We have that

\displaystyle\|W-W^{\prime}\|_{2}\cdot\big{\|}\phi_{W^{\prime}}(s,a)\big{\|}_{2}\leq 2B.

(A.5)

It remains to upper bound the term $|u_{W^{\prime}}(s,a)-u_{0}(s,a)|$ in (A.2). Note that $u_{W}(s,a)$ is almost everywhere differentiable with respect to $W$ . Also, it holds that $\nabla_{W}u_{W}(s,a)=\phi_{W}(s,a)$ . Thus, following from the mean-value theorem and the Cauchy-Schwartz inequality, we have that

\displaystyle\bigl{|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{|}\leq\sup_{W\in S_{B}}\big{\|}\phi_{W}(s,a)\big{\|}_{2}\cdot\|W^{\prime}-W_{0}\|_{2}\leq B,

(A.6)

where the second inequality follows from the fact that $\|\phi_{W}(s,a)\|_{2}\leq 1$ and $W^{\prime}\in S_{B}$ . Plugging (A.5) and (A.6) into (A.2), we have that

\displaystyle\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}\leq\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}u_{0}(s,a)\bigr{|}+3B.

Following from Assumption 4.2, we have that $\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}|W^{\top}\phi_{W^{\prime}}(s,a)|$ is sub-Gaussian. Furthermore, it holds that

\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{|}^{2}\Bigr{]}\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{|}u_{0}(s,a)\bigr{|}^{2}\Bigr{]}+18B^{2}\leq 2M_{0}^{2}+18B^{2}

and that

	$\displaystyle\mathbb{P}\Bigl{(}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{\|}W^{\top}\phi_{W^{\prime}}(s,a)\bigr{\|}>t\Bigr{)}$	$\displaystyle\leq\mathbb{P}\Bigl{(}\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{\|}u_{0}(s,a)\bigr{\|}+3B>t\Bigr{)}$
		$\displaystyle\leq\exp\bigl{(}-v\cdot(t-3B)^{2}\bigr{)}\leq\exp(-v\cdot t^{2}/2)$

for $t>2M_{0}+6B$ . Thus, we complete the proof of Lemma A.3. ∎

Appendix B Neural Temporal Difference

In this section, we introduce neural TD (Cai et al., 2019c), which computes $\omega_{k}$ in Algorithm 1. Specifically, neural TD solves the optimization problem in (3.19) via the update in (3.2), which is presented in Algorithm 2.

Algorithm 2 Neural TD

0: Policy

\pi

, reward function

r

, initialization

W_{0},b

, number of iterations

T_{\mathrm{TD}}

of neural TD, and stepsize

\alpha

of neural TD.

1: Initialization. Set

S_{B_{\omega}}\leftarrow\{W\in\mathbb{R}^{md}\,|\,\|W-W_{0}\|_{2}\leq B_{\omega}\}

and

\omega(0)\leftarrow W_{0}

2: for

j=0,\ldots,T_{{\mathrm{TD}}}-1

3: Sample

(s,a,s^{\prime},a^{\prime})

, where

(s,a)\sim\rho_{\pi}

s^{\prime}\sim P(\cdot\,|\,s,a)

, and

a^{\prime}\sim\pi(\cdot\,|\,s^{\prime})

4: Compute the Bellman residual

\delta(j)=Q_{\omega(j)}(s,a)-(1-\gamma)\cdot r(s,a)-\gamma\cdot Q_{\omega(j)}(s^{\prime},a^{\prime})

5: Update

\omega

via

\omega(j+1)\leftarrow\text{Proj}_{S_{B_{\omega}}}\bigl{\{}\omega(j)-\eta\cdot\delta(j)\cdot\phi_{\omega(j)}(s,a)\bigr{\}}

6: end for

6: Output

\bar{\omega}=T^{-1}\sum_{t=0}^{T_{{\mathrm{TD}}}-1}\omega(j)

B.1 Proof of Proposition 4.4

Proof.

We obtain the following proposition from Cai et al. (2019c), which characterizes the convergence of Algorithm 2.

Proposition B.1 (Proposition 4.7 in Cai et al. (2019c)).

We set $\alpha=\min\{(1-\gamma)/8,T^{-1/2}_{{\mathrm{TD}}}\}$ in Algorithm 2. Let $Q_{\bar{\omega}}(s,a)$ be the state-action value function associated with the output $\bar{\omega}$ . Under Assumption (a), it holds for any policy $\pi$ and reward function $r(s,a)$ that

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}Q_{\bar{\omega}}(s,a)-Q^{\pi}_{r}(s,a)\big{\\|}^{2}_{2,\rho_{\pi}}\Bigr{]}$	$\displaystyle=2\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}\text{Proj}_{\mathcal{F}_{B_{\omega},m}}Q^{\pi}_{r}(s,a)-Q^{\pi}_{r}(s,a)\big{\\|}_{2,\rho_{\pi}}^{2}\Bigr{]}$
		$\displaystyle\quad+\mathcal{O}(B_{\omega}^{2}\cdot T^{-1/2}_{{\mathrm{TD}}}+B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}),$		(B.1)

where $\mathcal{F}_{B_{\omega},m}$ is defined in Definition A.1.

Recall that we denote by $\phi_{0}(s,a)$ the feature vector corresponding to the random initialization in (3.3). We establish the following lemma to upper bound the bias $\mathbb{E}_{\text{init}}[\|\text{Proj}_{\mathcal{F}_{B_{\omega},m}}Q^{\pi}_{r}(s,a)-Q^{\pi}_{r}(s,a)\|^{2}_{2,\rho_{\pi}}]$ in (B.1) of Proposition B.1 when the reward function $r(s,a)$ belongs to the reward function class $\mathcal{R}_{\beta}$ .

Lemma B.2.

We consider any reward function $r_{\beta}(s,a)\in\mathcal{R}_{\beta}$ and policy $\pi$ . Under Assumptions 4.1 and 4.2, it holds for $B_{\omega}>B_{\beta}+(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta})$ and an absolute constant $C_{v}=(2\cdot\gamma^{2}\cdot B_{P}^{2})^{-1}\cdot(1-\gamma)^{2}\cdot v$ that

\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\text{Proj}_{\mathcal{F}_{B_{\omega},m}}Q_{r_{\beta}}^{\pi}(s,a)-Q_{r_{\beta}}^{\pi}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\beta}^{3}\cdot m^{-1/2}+B_{\omega}^{2}\cdot m^{-1}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}.

Proof.

See §B.2 for a detailed proof. ∎

Combining Proposition B.1 and Lemma B.2, for $B_{\omega}>B_{\beta}+(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta})$ , we have for any $\pi$ that

\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}Q_{\bar{\omega}}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\omega}^{2}\cdot T^{-1/2}_{{\mathrm{TD}}}+B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}.

Finally, by setting $T_{{\mathrm{TD}}}=\Omega(m)$ , we have that

\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}Q_{\bar{\omega}}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}^{2}_{2,\rho_{\pi}}\Bigr{]}=\mathcal{O}\bigl{(}B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)},

which completes the proof of Proposition 4.4. ∎

B.2 Proof of Lemma B.2

Proof.

For notational simplicity, we write $\vartheta(s,a;w)=\operatorname{\mathds{1}}{\{|w^{\top}(s,a)|>0\}}\cdot(s,a)$ . Under Assumption 4.2, we have that

\displaystyle P(s^{\prime}\,|\,s,a)=\int\vartheta(s,a;w)^{\top}\varphi(s^{\prime};w){\mathrm{d}}q(w),\quad\text{where }\sup_{w}\biggl{\|}\int\varphi(s;w){\mathrm{d}}s\biggr{\|}_{2}\leq B_{P}.

(B.2)

Thus, since $r_{\beta}=(1-\gamma)^{-1}\cdot u_{\beta}(s,a)$ , we have that

	$\displaystyle Q^{\pi}_{r_{\beta}}(s,a)$	$\displaystyle=(1-\gamma)\cdot r_{\beta}(s,a)+\gamma\cdot\int_{\mathcal{S}}P(s^{\prime}\,\|\,s,a)\cdot V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}$
		$\displaystyle=u_{\beta}(s,a)+\int_{\mathcal{S}}\gamma\cdot V^{\pi}_{r_{\beta}}(s^{\prime})\cdot\int\vartheta(s,a;w)^{\top}\varphi(s^{\prime};w){\mathrm{d}}q(w){\mathrm{d}}s^{\prime}$
		$\displaystyle=u_{\beta}(s,a)+\int\vartheta(s,a;w)^{\top}\biggl{(}\gamma\cdot\int_{\mathcal{S}}\varphi(s^{\prime};w)V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}\biggr{)}{\mathrm{d}}q(w),$

where the second equality follows from (B.2) and the last equality follows from Fubini’s theorem. In the sequel, we define

\displaystyle\alpha(w)=\gamma\cdot\int_{\mathcal{S}}\varphi(s^{\prime};w)V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}.

(B.3)

Note that $\alpha(w)\in\mathbb{R}^{d}$ . Then, we have that

\displaystyle Q^{\pi}_{r_{\beta}}(s,a)=u_{\beta}(s,a)+\int\vartheta(s,a;w)^{\top}\alpha(w){\mathrm{d}}q(w).

To prove Lemma B.2, we first approximate $Q^{\pi}_{r_{\beta}}(s,a)$ by

\displaystyle\bar{Q}(s,a)=u_{\beta}(s,a)+\int\vartheta(s,a;w)^{\top}\bar{\alpha}(w){\mathrm{d}}q(w),

(B.4)

where $\bar{\alpha}(w)=\alpha(w)\cdot\operatorname{\mathds{1}}\{\|\alpha(w)\|_{2}\leq K\}$ for an absolute constant $K>0$ specified later. Then, it holds for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ that

	$\displaystyle\bigl{\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\bigr{\|}$	$\displaystyle\leq\int\Bigl{\|}\vartheta(s,a;w)^{\top}\bigl{(}\bar{\alpha}(w)-\alpha(w)\bigr{)}\Bigr{\|}{\mathrm{d}}q(w)$
		$\displaystyle\leq\int\big{\\|}\vartheta(s,a;w)\big{\\|}_{2}\cdot\big{\\|}\bar{\alpha}(w)-\alpha(w)\big{\\|}_{2}{\mathrm{d}}q(w)$
		$\displaystyle\leq\sup_{w}\big{\\|}\bar{\alpha}(w)-\alpha(w)\big{\\|}_{2},$

where the second inequality follows from the Cauchy-Schwartz inequality and the last inequality follows from the fact that $\|\vartheta(s,a;w)\|_{2}\leq 1$ . Thus, we have that

\displaystyle\big{\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}_{2,\rho_{\pi}}\leq\big{\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\|}_{\infty}\leq\sup_{w}\big{\|}\bar{\alpha}(w)-\alpha(w)\big{\|}_{2}.

(B.5)

We now upper bound the right-hand side of (B.5). To this end, we show that $\sup_{w}\|\alpha(w)\|_{2}$ is sub-Gaussian in the sequel. By the definition of $\alpha(w)$ in (B.3), we have that

$\displaystyle\sup_{w}\bigl{\\|}\alpha(w)\bigr{\\|}_{2}$	$\displaystyle=\gamma\cdot\biggl{\\|}\int_{\mathcal{S}}\varphi(s^{\prime};w)V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}\biggr{\\|}_{2}$
	$\displaystyle\leq\gamma\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{\|}V^{\pi}_{r_{\beta}}(s^{\prime})\bigr{\|}\cdot\sup_{w}\biggl{\\|}\int_{\mathcal{S}}\varphi(s^{\prime};w){\mathrm{d}}s^{\prime}\biggr{\\|}_{2}$
	$\displaystyle\leq\gamma\cdot B_{P}\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{\|}V^{\pi}_{r_{\beta}}(s^{\prime})\bigr{\|}$
	$\displaystyle\leq\gamma\cdot(1-\gamma)^{-1}\cdot B_{P}\cdot\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{\|}u_{\beta}(s,a)\bigr{\|},$	(B.6)

where the second inequality follows from Assumption 4.2 and the third inequality follows from the fact that $V_{r_{\beta}}^{\pi}(s)=\mathbb{E}_{(s^{\prime},a^{\prime})\sim\nu_{\pi}(s)}[r_{\beta}(s^{\prime},a^{\prime})]$ . Here we denote by $\nu_{\pi}(s)$ the state-action visitation measure starting from the state $s$ and following the policy $\pi$ . Following from Lemma A.3, we have that $\sup_{w}\|\alpha(w)\|_{2}$ is sub-Gaussian. By Lemma A.3 and (B.2), it holds for $t>(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta})$ that

	$\displaystyle\mathbb{P}\Bigl{(}\sup_{w}\bigl{\\|}\alpha(w)\bigr{\\|}_{2}>t\Bigr{)}$	$\displaystyle\leq\mathbb{P}\Bigl{(}\gamma\cdot(1-\gamma)^{-1}\cdot B_{P}\cdot\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{\|}u_{\beta}(s,a)\bigr{\|}>t\Bigr{)}$
		$\displaystyle\leq\exp\biggl{(}-\frac{v\cdot(1-\gamma)^{2}\cdot t^{2}}{2\gamma^{2}\cdot B_{P}^{2}}\biggr{)}.$		(B.7)

Let the absolute constant $K$ satisfy that $K>(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta})$ in (B.2). For notational simplicity, we write $C_{v}=(2\cdot\gamma^{2}\cdot B_{P}^{2})^{-1}\cdot v\cdot(1-\gamma)^{2}$ . By the fact that $\|\bar{\alpha}(w)-\alpha(w)\|_{2}=\|\alpha(w)\|_{2}\cdot\operatorname{\mathds{1}}\{\|\alpha(w)\|_{2}>K\}$ , we have that

\displaystyle\sup_{w}\big{\|}\bar{\alpha}(w)-\alpha(w)\big{\|}_{2}\leq\sup_{w}\big{\|}\alpha(w)\big{\|}_{2}\cdot\operatorname{\mathds{1}}\Bigl{\{}\sup_{w}\|\alpha(w)\|_{2}>K\Bigr{\}}.

Following from (B.5) and (B.2), we have that

		$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\big{\\|}_{2,\rho_{\pi}}\Bigr{]}$
		$\displaystyle\quad\leq\mathbb{E}\biggl{[}\sup_{w}\big{\\|}\alpha(w)\big{\\|}_{2}\cdot\operatorname{\mathds{1}}\Bigl{\{}\sup_{w}\\|\alpha(w)\\|_{2}>K\Bigr{\}}\biggr{]}$
		$\displaystyle\quad\leq\int_{0}^{K}t\cdot\mathbb{P}\Bigl{(}\sup_{w}\\|\alpha(w)\\|_{2}>K\Bigr{)}{\mathrm{d}}t+\int_{K}^{\infty}t\cdot\mathbb{P}\Bigl{(}\sup_{w}\\|\alpha(w)\\|_{2}>t\Bigr{)}{\mathrm{d}}t$
		$\displaystyle\quad=\mathcal{O}\bigl{(}K^{2}\cdot\exp(-C_{v}\cdot K^{2})\bigr{)}.$		(B.8)

We now construct $\widehat{Q}(s,a)\in\mathcal{F}_{K,m}$ , which approximates $\bar{Q}(s,a)$ defined in (B.4). We define

\displaystyle f(s,a)=\int\vartheta(s,a;w)^{\top}\bar{\alpha}(w){\mathrm{d}}q(\omega).

Then, we have that $\bar{Q}(s,a)=u_{\beta}(s,a)+f(s,a)$ . Note that $f(s,a)$ belongs to the following function class,

\displaystyle\widetilde{\mathcal{F}}_{K,\infty}=\biggl{\{}\int\vartheta(s,a;w)^{\top}\alpha(w){\mathrm{d}}q(\omega)\,\bigg{|}\,\sup_{w}\big{\|}\alpha(w)\big{\|}_{2}\leq K\biggr{\}}.

We now show that $f(s,a)$ is well approximated by the following function class,

\displaystyle\widetilde{\mathcal{F}}_{K,m}=\biggl{\{}W^{\top}\phi_{0}(s,a)=\frac{1}{\sqrt{m}}\sum_{l=1}^{m}[W]_{l}^{\top}\vartheta\bigl{(}s,a;[W]_{l}\bigr{)}\,\bigg{|}\,\sup_{l}\big{\|}[W]_{l}\big{\|}_{2}\leq K/\sqrt{m}\biggr{\}},

where $\phi_{0}(s,a)$ is the feature vector corresponding to the random initialization. We obtain the following lemma from Rahimi and Recht (2009), which characterizes the approximation error of $\widetilde{\mathcal{F}}_{K,\infty}$ by $\widetilde{\mathcal{F}}_{K,m}$ .

Lemma B.3 (Lemma 1 in Rahimi and Recht (2009)).

For any $f(s,a)\in\widetilde{\mathcal{F}}_{K,\infty}$ , it holds with probability at least $1-\delta$ that

\displaystyle\big{\|}\text{Proj}_{\widetilde{\mathcal{F}}_{K,m}}f(s,a)-f(s,a)\big{\|}_{2,\mu}\leq K\cdot m^{-1/2}\cdot\bigl{(}1+\sqrt{2\log(1/\delta)}\bigr{)},

where $\mu\in{\mathcal{P}}({\mathcal{S}}\times\mathcal{A})$ .

Lemma B.3 implies that there exists $\widehat{f}(s,a)\in\widetilde{\mathcal{F}}_{K,m}$ such that

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}\widehat{f}(s,a)-f(s,a)\big{\\|}^{2}_{2,\rho_{\pi}}\Bigr{]}$	$\displaystyle=\int_{0}^{\infty}\mathbb{P}\Bigl{(}\big{\\|}\widehat{f}(s,a)-f(s,a)\big{\\|}^{2}_{2,\rho_{\pi}}>y\Bigr{)}{\mathrm{d}}y$
		$\displaystyle\leq\int_{0}^{\infty}y\cdot\exp\bigl{(}-1/2\cdot(\sqrt{my}/K-1)^{2}\bigr{)}=\mathcal{O}(K^{2}/m).$		(B.9)

By the fact that $\widehat{f}(s,a)\in\widetilde{\mathcal{F}}_{K,m}$ and the definition of $\mathcal{F}_{K,m}$ in Definition A.1, we have that $\widehat{f}(s,a)\in\mathcal{F}_{K,m}-u_{0}(s,a)$ . Let

\displaystyle\widehat{Q}(s,a)=\beta^{\top}\phi_{0}(s,a)+\widehat{f}(s,a)=(\beta+W_{f})^{\top}\phi_{0}(s,a).

We then have that $\widehat{Q}(s,a)\in\mathcal{F}_{B_{\beta}+K,m}$ and that

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}\bar{Q}(s,a)-\widehat{Q}(s,a)\big{\\|}^{2}_{2,\rho_{k}}\Bigr{]}$	$\displaystyle\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}u_{\beta}(s,a)-\beta^{\top}\phi_{0}(s,a)\big{\\|}^{2}_{2,\rho_{\pi}}\Bigr{]}+2\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}\widehat{f}(s,a)-f(s,a)\big{\\|}^{2}_{2,\rho_{\pi}}\Bigr{]}$
		$\displaystyle=\mathcal{O}(B_{\beta}^{3}\cdot m^{-1/2}+K^{2}\cdot m^{-1}),$		(B.10)

where the last inequality follows from Assumption (a), Lemma A.2, and (B.9).

Finally, we set $B_{\omega}=K+B_{\beta}>B_{\beta}+(1-\gamma)^{-1}\cdot\gamma\cdot B_{P}\cdot(2M_{0}+3B_{\beta})$ . Combining (B.2) and (B.2), we have that

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}Q_{r_{\beta}}^{\pi}(s,a)-\widehat{Q}(s,a)\big{\\|}^{2}_{2,\rho_{k}}\Bigr{]}$	$\displaystyle\leq 2\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}\bar{Q}(s,a)-\widehat{Q}(s,a)\big{\\|}^{2}_{2,\rho_{k}}\Bigr{]}+2\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}\bar{Q}(s,a)-Q_{r_{\beta}}^{\pi}(s,a)\big{\\|}^{2}_{2,\rho_{k}}\Bigr{]}$
		$\displaystyle=\mathcal{O}\bigl{(}B_{\beta}^{3}\cdot m^{-1/2}+B_{\omega}^{2}\cdot m^{-1}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)},$

where $\widehat{Q}(s,a)\in\mathcal{F}_{B_{\omega},m}$ . Thus, we complete the proof of Lemma B.2. ∎

Appendix C Proofs of Auxiliary Results

In what follows, we present the proofs of the lemmas in §3-5.

C.1 Proof of Proposition 3.1

Proof.

By the definition of the neural network in (3.1), we have for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ that $\nabla_{W}u_{W}(s,a)=\phi_{W}(s,a)$ almost everywhere. We first calculate $\nabla_{\theta}L(\theta,\beta)$ . Following from the policy gradient theorem (Sutton and Barto, 2018) and the definition of $L(\theta,\beta)$ in (2.4), we have that

	$\displaystyle\nabla_{\theta}L(\theta,\beta)$	$\displaystyle=-\nabla_{\theta}J(\pi_{\theta};r_{\beta})$
		$\displaystyle=-\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}Q^{\pi_{\theta}}_{r_{\beta}}(s,a)\cdot\nabla_{\theta}\log\pi_{\theta}(a\,\|\,s)\bigr{]}.$		(C.1)

Following from the parameterization of $\pi_{\theta}$ in (3.5) and the definition of $\iota_{\theta}(s,a)$ in (3.8) of Proposition 3.1, we have that

	$\displaystyle\nabla_{\theta}\log\pi_{\theta}(a\,\|\,s)$	$\displaystyle=\tau\cdot\phi_{\theta}(s,a)-\frac{\sum_{a^{\prime}\in\mathcal{A}}\tau\cdot\exp\bigl{(}\tau\cdot\theta^{\top}\phi_{\theta}(s,a^{\prime})\bigr{)}\cdot\phi_{\theta}(s,a^{\prime})}{\sum_{a^{\prime}\in\mathcal{A}}\exp\bigl{(}\tau\cdot\theta^{\top}\phi_{\theta}(s,a^{\prime})\bigr{)}}$
		$\displaystyle=\tau\cdot\Bigl{(}\phi_{\theta}(s,a)-\tau\cdot\mathbb{E}_{a^{\prime}\sim\pi_{\theta}(\cdot\,\|\,s)}\bigl{[}\phi_{\theta}(s,a^{\prime})\bigr{]}\Bigr{)}=\tau\cdot\iota_{\theta}(s,a).$		(C.2)

Plugging (C.1) into (C.1), we have that

\displaystyle\nabla_{\theta}L(\theta,\beta)=-\tau\cdot\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}Q^{\pi_{\theta}}_{r_{\beta}}(s,a)\cdot\iota_{\theta}(s,a)\bigr{]}.

It remains to calculate $\mathcal{I}(\theta)$ and $\nabla_{\beta}L(\theta,\beta)$ . By (C.1) and the definition of $\mathcal{I}(\theta)$ in (3.7), it holds that

	$\displaystyle\mathcal{I}(\theta)$	$\displaystyle=\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\nabla\log\pi_{\theta}(a\,\|\,s)\nabla\log\pi_{\theta}(a\,\|\,s)^{\top}\bigr{]}$
		$\displaystyle=\tau^{2}\cdot\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\iota_{\theta}(s,a)\iota_{\theta}(s,a)^{\top}\bigr{]}.$

By the definition of the objective function $L(\theta,\beta)$ in (2.4), it holds that

	$\displaystyle\nabla_{\beta}L(\theta,\beta)$	$\displaystyle=\nabla_{\beta}J(\pi_{\text{{\tiny E}}};r_{\beta})-\nabla_{\beta}J(\pi_{\theta};r_{\beta})-\lambda\cdot\nabla_{\beta}\psi(\beta)$
		$\displaystyle=\mathbb{E}_{\nu_{\text{{\tiny E}}}}\bigl{[}\nabla_{\beta}r_{\beta}(s,a)\bigr{]}-\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\nabla_{\beta}r_{\beta}(s,a)\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta)$
		$\displaystyle=(1-\gamma)^{-1}\cdot\mathbb{E}_{\nu_{\text{{\tiny E}}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-(1-\gamma)^{-1}\cdot\mathbb{E}_{\nu_{\pi_{\theta}}}\bigl{[}\phi_{\beta}(s,a)\bigr{]}-\lambda\cdot\nabla_{\beta}\psi(\beta).$

Thus, we complete the proof of Proposition 3.1. ∎

C.2 Proof of Lemma 5.2

Proof.

The proof of Lemma 5.2 is similar to that of Lemmas 5.4 and 5.5 in Wang et al. (2019). By direct calculation, we have that

\displaystyle\eta\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}Q^{\pi_{k}}_{r_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-{\pi_{k}^{s}}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}={\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k})-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{\text{{\tiny E}}}\,\|\,\pi_{k+1})+\eta\cdot\Delta_{k}^{\text{(i)}},

where $\Delta_{k}^{\text{(i)}}$ takes the form of

$\displaystyle\Delta_{k}^{\text{(i)}}$	$\displaystyle=\eta^{-1}\cdot\biggl{\{}\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot Q^{\pi_{k}}_{r_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}+\bigl{\langle}\log({\pi_{k+1}^{s}}/{\pi_{k}^{s}}),\pi_{k}^{s}-\pi_{k+1}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}-{\mathrm{KL}}^{d_{\text{{\tiny E}}}}(\pi_{k+1}^{s}\,\\|\,\pi_{k}^{s})\biggr{\}}$
	$\displaystyle=\underbrace{\eta^{-1}\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}}_{\displaystyle\text{(i.a)}}+\underbrace{\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\widehat{Q}_{\omega_{k}}(s,\cdot)-Q_{r_{k}}^{\pi_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}}_{\displaystyle\text{(i.b)}}$
	$\displaystyle\quad+\underbrace{\eta^{-1}\cdot\mathbb{E}_{d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\log({\pi_{k+1}^{s}}/{\pi_{k}^{s}}),\pi_{k}^{s}-\pi_{k+1}^{s}\bigr{\rangle}_{\mathcal{A}}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\\|\,\pi_{k}^{s})\Bigr{]}}_{\displaystyle\text{(i.c)}}.$	(C.3)

The following lemmas upper bound $\Delta_{k}^{\text{(i)}}$ by upper bounding terms (i.a), (i.b), and (i.c) on the right-hand side of (C.2), respectively. Note that the expectation $\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}$ is taken with respect to the random initialization in (3.3) and $s\sim d_{\text{{\tiny E}}}$ .

Lemma C.1 (Upper Bound of Term (i.a) in (C.2)).

Under Assumptions (a) and 4.3, we have that

	$\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\biggr{]}$
	$\displaystyle\quad=\eta\cdot 2\sqrt{2}\cdot C_{h}\cdot B_{\theta}^{1/2}\cdot\sigma^{1/2}\cdot N^{-1/4}+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+\eta\cdot B_{\theta}^{5/4}\cdot m^{-1/8}),$

where $C_{h}$ is defined in Assumption (b) and $\sigma$ is defined in Assumption 4.3.

Proof.

See §D.1 for a detailed proof. ∎

Lemma C.2 (Upper Bound of Term (i.b) in (C.2)).

Under Assumption (b), we have that

\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Bigl{[}\bigl{\langle}\widehat{Q}_{\omega_{k}}(s,\cdot)-Q_{r_{k}}^{\pi_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}\Bigr{]}\leq C_{h}\cdot{\epsilon}_{Q,k},

where ${\epsilon}_{Q,k}$ takes the form of

\displaystyle{\epsilon}_{Q,k}=\mathbb{E}_{\text{init}}\Bigl{[}\big{\|}Q_{r_{k}}^{\pi_{k}}(s,a)-\widehat{Q}_{\omega_{k}}(s,a)\big{\|}_{2,\rho_{k}}\Bigr{]}.

(C.4)

Proof.

See §D.2 for a detailed proof. ∎

Lemma C.3 (Upper Bound of Term (i.c) in (C.2)).

Under Assumptions (a) and (a), we have that

	$\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s}),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\\|\,\pi_{k}^{s})\biggr{]}$
	$\displaystyle\quad=\eta^{2}\cdot(M_{0}^{2}+9B_{\theta}^{2})+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}),$

where $M_{0}$ is defined in Assumption (a).

Proof.

See §D.3 for a detailed proof. ∎

Finally, by Lemmas C.1-C.3, under Assumptions 4.2 and 4.3, we obtain from (C.2) that

	$\displaystyle\mathbb{E}_{{\text{init}}}\bigl{[}\|\Delta_{k}^{\text{(i)}}\|\bigr{]}$	$\displaystyle=2\sqrt{2}C_{h}\cdot B_{\theta}^{1/2}\cdot\sigma^{1/2}\cdot N^{-1/4}+C_{h}\cdot{\epsilon}_{Q,k}+\eta\cdot(M_{0}^{2}+9B_{\theta}^{2})$
		$\displaystyle\quad+\mathcal{O}(\eta^{-1}\cdot\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+B_{\theta}^{5/4}\cdot m^{-1/8}).$

Here $M_{0}$ is defined in Assumption (a), $\tau_{k+1}$ is the inverse temperature parameter of $\pi_{k+1}$ defined in (3.5), $\sigma$ is defined in Assumption 4.3, and ${\epsilon}_{Q,k}$ is defined in (C.4) of Lemma C.2. Following from Proposition 4.4, we have that

\displaystyle C_{h}\cdot\epsilon_{Q,k}=\mathcal{O}\bigl{(}B_{\omega}^{3}\cdot m^{-1/2}+B_{\omega}^{5/2}\cdot m^{-1/4}+B_{\omega}^{2}\cdot\exp(-C_{v}\cdot B_{\omega}^{2})\bigr{)}.

Thus, we complete the proof of Lemma 5.2. ∎

C.3 Proof of Lemma 5.3

Proof.

We consider a fixed $\beta^{\prime}\in S_{B_{\beta}}$ . For notational simplicity, we write $r^{\prime}=r_{\beta^{\prime}}(s,a)$ , $r_{k}=r_{k}(s,a)$ and $\phi_{\beta}=\phi_{\beta}(s,a)$ . By the parameterization of $r_{\beta}(s,a)$ in (3.6), we have that

		$\displaystyle L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})=\langle r^{\prime}-r_{k},\nu_{\text{{\tiny E}}}-\nu_{k}\rangle_{{\mathcal{S}}\times\mathcal{A}}+\lambda\cdot\psi(\beta_{k})-\lambda\cdot\psi(\beta^{\prime})$
		$\displaystyle\quad=(1-\gamma)^{-1}\cdot\Bigl{(}\bigl{\langle}\phi_{\beta_{k}}^{\top}(\beta^{\prime}-\beta_{k}),\nu_{\text{{\tiny E}}}-\nu_{k}\bigr{\rangle}_{{\mathcal{S}}\times\mathcal{A}}+\langle\phi_{\beta^{\prime}}^{\top}\beta^{\prime}-\phi_{\beta_{k}}^{\top}\beta^{\prime},\nu_{\text{{\tiny E}}}-\nu_{k}\rangle_{{\mathcal{S}}\times\mathcal{A}}\Bigr{)}+\lambda\cdot\bigl{(}\psi(\beta)-\psi(\beta^{\prime})\bigr{)}$
		$\displaystyle\quad\leq(\beta^{\prime}-\beta_{k})^{\top}\nabla_{\beta}L(\theta_{k},\beta_{k})+(1-\gamma)^{-1}\cdot\bigl{(}\\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\\|_{1,\nu_{k}}+\\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\\|_{1,\nu_{\text{{\tiny E}}}}\bigr{)},$		(C.5)

where the last inequality follows from (3.10) of Proposition 3.1. Then, we have that

	$\displaystyle\mathbb{E}_{\text{init}}\bigl{[}L(\theta_{k},\beta^{\prime})-L(\theta_{k},\beta_{k})\bigr{]}$
	$\displaystyle\quad\leq\mathbb{E}_{\text{init}}\Bigl{[}(\beta^{\prime}-\beta_{k})^{\top}\nabla_{\beta}L(\theta_{k},\beta_{k})+(1-\gamma)^{-1}\cdot\bigl{(}\\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\\|_{1,\nu_{k}}+\\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\\|_{1,\nu_{\text{{\tiny E}}}}\bigr{)}\Bigr{]}$
	$\displaystyle\quad\leq\mathbb{E}_{\text{init}}\bigl{[}(\beta^{\prime}-\beta_{k})^{\top}\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{]}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}),$

where the last inequality follows from Assumption (a), Lemma A.2, and the fact that $\beta^{\prime},\beta_{k}\in S_{B_{\beta}}$ . Thus, we complete the proof of Lemma 5.3. ∎

C.4 Proof of Lemma 5.4

Proof.

By the update of $\beta_{k}$ in (3.14), it holds for any $\beta^{\prime}\in S_{B_{\beta}}$ that

\displaystyle\bigl{(}\beta_{k}+\eta\cdot\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\beta_{k+1}\bigr{)}^{\top}(\beta^{\prime}-\beta_{k+1})\leq 0,

which further implies that

	$\displaystyle\eta\cdot({\beta^{\prime}-\beta_{k}})^{\top}{\nabla_{\beta}L(\theta_{k},\beta_{k})}$	$\displaystyle\leq\\|\beta_{k}-\beta^{\prime}\\|_{2}^{2}-\\|\beta_{k+1}-\beta^{\prime}\\|_{2}^{2}-\\|\beta_{k+1}-\beta_{k}\\|_{2}^{2}$		(C.6)
		$\displaystyle\quad+\eta\cdot\Bigl{(}(\beta_{k+1}-\beta_{k})^{\top}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})+(\beta_{k}-\beta^{\prime})^{\top}\bigl{(}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{)}\Bigr{)}.$

Combining (C.3) and (C.6), we have that

\displaystyle\eta\cdot\bigl{(}L(\theta_{k},\beta_{k})-L(\theta_{k},\beta^{\prime})\bigr{)}\leq\|\beta_{k}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta^{\prime}\|_{2}^{2}-\|\beta_{k+1}-\beta_{k}\|_{2}^{2}+\eta\cdot\Delta_{k}^{\text{(ii)}},

where $\Delta_{k}^{\text{(ii)}}$ takes the form of

	$\displaystyle\Delta_{k}^{\text{(ii)}}$	$\displaystyle=\underbrace{{(\beta_{k+1}-\beta_{k})}^{\top}{\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})}}_{\displaystyle\text{(ii.a)}}+\underbrace{({\beta_{k}-\beta^{\prime}})^{\top}\bigl{(}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{)}}_{\displaystyle\text{(ii.b)}}$
		$\displaystyle\quad+\underbrace{(1-\gamma)^{-1}\cdot\bigl{(}\\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\\|_{2,\nu_{k}}+\\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\\|_{2,\nu_{\text{{\tiny E}}}}\bigr{)}}_{\displaystyle\text{(ii.c)}}$		(C.7)

We now upper bound terms (ii.a), (ii.b), and (ii.c) on the right-hand side of (C.4). Following from Assumption (a) and Lemma A.2, we have that

\displaystyle\mathbb{E}_{\text{init}}\bigl{[}\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{2,\nu_{k}}+\|\phi_{\beta_{k}}^{\top}\beta^{\prime}-\phi_{\beta^{\prime}}^{\top}\beta^{\prime}\|_{2,\nu_{\text{{\tiny E}}}}\bigr{]}=\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}),

(C.8)

which upper bounds term (ii.c) of (C.4). For term (ii.b) of (C.4), we have that

		$\displaystyle\mathbb{E}\biggl{[}\Big{\|}({\beta_{k}-\beta^{\prime}})^{\top}\bigl{(}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\bigr{)}\Big{\|}\biggr{]}$
		$\displaystyle\quad\leq\mathbb{E}\Bigl{[}\big{\\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})\big{\\|}_{2}\cdot\\|\beta^{\prime}-\beta_{k}\\|_{2}\Bigr{]}\leq 2B_{\beta}\cdot\mathbb{E}\bigl{[}\\|\xi_{k}^{\prime}\\|_{2}\bigr{]}\leq 2B_{\beta}\cdot(\sigma^{2}/N)^{1/2},$		(C.9)

where we write $\xi_{k}^{\prime}=\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})-\nabla_{\beta}L(\theta_{k},\beta_{k})$ . Here the first inequality follows from the Cauchy-Schwartz inequality, the second inequality follows from the fact that $\beta_{k},\beta^{\prime}\in S_{B_{\beta}}$ , and the last inequality follows from Assumption 4.3. To upper bound term (ii.a) in (C.4), we have that

		$\displaystyle\mathbb{E}\Bigl{[}\bigl{\|}{(\beta_{k+1}-\beta_{k})}^{\top}{\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})}\bigr{\|}\Bigr{]}$		(C.10)
		$\displaystyle\quad\leq\mathbb{E}\Bigl{[}\big{\\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\big{\\|}_{2}\cdot\\|\beta_{k+1}-\beta_{k}\\|_{2}\Bigr{]}\leq\eta\cdot\mathbb{E}\Bigl{[}\big{\\|}\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})\big{\\|}_{2}^{2}\Bigr{]}=2\eta\cdot\Bigl{(}\big{\\|}\nabla_{\beta}L(\theta_{k},\beta_{k})\big{\\|}^{2}_{2}+\mathbb{E}\bigl{[}\\|\xi_{k}^{\prime}\\|^{2}_{2}\bigr{]}\Bigr{)},$

where the first inequality follows from the Cauchy-Schwartz inequality and the second inequality follows from the update of $\beta$ in (3.14). Furthermore, we have

$\displaystyle\big{\\|}\nabla_{\beta}L(\theta_{k},\beta_{k})\big{\\|}^{2}_{2}$	$\displaystyle=\Big{\\|}\mathbb{E}_{\nu_{k}}\bigl{[}\phi_{\beta_{k}}(s,a)\bigr{]}-\mathbb{E}_{\nu_{E}}\bigl{[}\phi_{\beta_{k}}(s,a)\bigr{]}+\lambda\cdot\nabla_{\beta}\psi(\beta_{k})\Big{\\|}_{2}^{2}$
	$\displaystyle\leq\biggl{(}\mathbb{E}_{\nu_{k}}\Bigl{[}\big{\\|}\phi_{\beta_{k}}(s,a)\big{\\|}_{2}\Bigr{]}+\mathbb{E}_{\nu_{k}}\Bigl{[}\big{\\|}\phi_{\beta_{k}}(s,a)\big{\\|}_{2}\Bigr{]}+\lambda\cdot\big{\\|}\nabla_{\beta}\psi(\beta_{k})\big{\\|}_{2}\biggr{)}^{2}$
	$\displaystyle\leq(2+\lambda\cdot L_{\psi})^{2},$	(C.11)

where the first inequality follows from Jensen’s inequality and the second inequality follows from the fact that $\|\phi_{W}(s,a)\|_{2}\leq 1$ and the Lipschitz continuity of $\psi(\beta)$ in Assumption (b). By plugging (C.4) into (C.10), we have that

	$\displaystyle\mathbb{E}\Bigl{[}\bigl{\|}{\widehat{\nabla}_{\beta}L(\theta_{k},\beta_{k})}^{\top}({\beta_{k}-\beta_{k+1}})\bigr{\|}\Bigr{]}$	$\displaystyle\leq\eta\cdot\Bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\mathbb{E}\bigl{[}\\|\xi_{k}^{\prime}\\|^{2}_{2}\bigr{]}\Bigr{)}$
		$\displaystyle\leq\eta\cdot\bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\sigma^{2}/N\bigr{)},$		(C.12)

where the last inequality follows from Assumption 4.3. Finally, by plugging (C.8), (C.4), and (C.4) into (C.4), we have that

\displaystyle\mathbb{E}_{\text{init}}\bigl{[}|\Delta_{k}^{\text{(ii)}}|\bigr{]}

\displaystyle=\eta\cdot\bigl{(}(2+\lambda\cdot L_{\psi})^{2}+\sigma^{2}\cdot N^{-1}\bigr{)}+2B_{\beta}\cdot\sigma\cdot N^{-1/2}+\mathcal{O}(B_{\beta}^{3/2}\cdot m^{-1/4}).

Thus, we complete the proof of Lemma 5.4. ∎

Appendix D Proofs of Supporting Lemmas

In what follows, we present the proofs of the lemmas in §C.

D.1 Proof of Lemma C.1

Proof.

It holds for any policies $\pi,\pi^{\prime}$ that

\displaystyle\bigl{\langle}D(s),\pi^{s}-(\pi^{\prime})^{s}\bigr{\rangle}_{\mathcal{A}}=0,

(D.1)

where $D(s)$ only depends on the state $s$ . Thus, we have that

	$\displaystyle\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}$
	$\displaystyle\quad=\bigl{\langle}\tau_{k+1}\cdot\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\tau_{k}\cdot\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k}-\eta\cdot\phi_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}$
	$\displaystyle\quad=\bigl{\langle}\tau_{k+1}\cdot\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\tau_{k}\cdot\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k}-\eta\cdot\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}},$

where the first inequality follows from the parameterization of $\pi_{\theta}$ and $\widehat{Q}_{\omega}$ in (3.5) and (3.12), respectively, and the second equality follows from the definition of the temperature-adjusted score function $\iota_{\theta}(s,a)$ in (3.8) of Proposition 3.1. Here, with a slight abuse of the notation, we define

\displaystyle\iota_{\omega_{k}}(s,a)=\phi_{\omega_{k}}(s,a)-\mathbb{E}_{a^{\prime}\sim\pi_{k}(\cdot\,|\,s)}\bigl{[}\phi_{\omega_{k}}(s,a^{\prime})\bigr{]}.

(D.2)

Then, following from (D.1) and the update $\tau_{k+1}\cdot\theta_{k+1}=\tau_{k}\cdot\theta_{k}-\eta\cdot\delta_{k}$ in (3.13), we have that

		$\displaystyle\bigl{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}$		(D.3)
		$\displaystyle\quad=\bigl{\langle}\tau_{k+1}\cdot\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\tau_{k}\cdot\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k}-\eta\cdot\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\bigr{\rangle}_{\mathcal{A}}$
		$\displaystyle\quad=\underbrace{\tau_{k+1}\cdot\big{\langle}\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}}_{\displaystyle\text{(i)}}-\underbrace{\eta\cdot\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}}_{\displaystyle\text{(ii)}}.$

In what follows, we upper bound terms (i) and (ii) on the right-hand side of (D.3).

Upper bound of term (i) in (D.3). Following from (3.8) of Proposition 3.1 and (D.1) we have that

		$\displaystyle\Bigl{\|}\big{\langle}\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}$
		$\displaystyle\quad=\Bigl{\|}\big{\langle}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}$
		$\displaystyle\quad\leq\big{\\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\\|}_{1,\pi_{\text{{\tiny E}}}^{s}}+\big{\\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\\|}_{1,\pi_{k}^{s}},$		(D.4)

where the inequality follows from the triangle inequality. Following from Assumption (a) and Lemma A.2, we have that

\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Bigl{[}\big{\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\|}_{1,\pi_{\text{{\tiny E}}}^{s}}\Bigr{]}=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}).

(D.5)

Furthermore, following from Assumption (b), Lemma A.2, and the Cauchy-Schwartz inequality, we have that

	$\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Bigl{[}\big{\\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\\|}_{1,\pi_{k}^{s}}\Bigr{]}$
	$\displaystyle\quad=\mathbb{E}_{{\text{init}},d_{k}}\biggl{[}\big{\\|}\phi_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1}\big{\\|}_{1,\pi_{k}^{s}}\cdot\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\biggr{]}$
	$\displaystyle\quad\leq\big{\\|}\phi_{\theta_{k+1}}(s,a)^{\top}\theta_{k+1}-\phi_{\theta_{k}}(s,a)^{\top}\theta_{k+1}\big{\\|}_{2,\nu_{k}}\cdot\bigg{\\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\\|}_{2,d_{k}}$
	$\displaystyle\quad=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}).$		(D.6)

Here the expectation $\mathbb{E}_{{\text{init}},d_{k}}$ is taken with respect to the random initialization in (3.3) and $s\sim d_{k}$ . Thus, plugging (D.5) and (D.6) into (D.1), we obtain for term (i) of (D.3) that

\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{|}\big{\langle}\iota_{\theta_{k+1}}(s,\cdot)^{\top}\theta_{k+1}-\iota_{\theta_{k}}(s,\cdot)^{\top}\theta_{k+1},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}\biggr{]}=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}).

(D.7)

Upper bound of term (ii) in (D.3). Following from the Cauchy-Schwartz inequality, we have that

	$\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\biggr{]}$	$\displaystyle\leq\int_{{\mathcal{S}}\times\mathcal{A}}\big{\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\|}{\mathrm{d}}\nu_{\text{{\tiny E}}}(s,a)$
		$\displaystyle\leq\bigg{\\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\nu_{k}}\bigg{\\|}_{2,\nu_{k}}\cdot\big{\\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\\|}_{2,\nu_{k}}.$		(D.8)

Similarly, we have that

$\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\biggr{]}$	$\displaystyle\leq\int_{{\mathcal{S}}\times\mathcal{A}}\Bigl{\|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}{\mathrm{d}}\pi^{s}_{k}(a){\mathrm{d}}d_{\text{{\tiny E}}}(s)$
	$\displaystyle=\int_{{\mathcal{S}}\times\mathcal{A}}\Bigl{\|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\cdot\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}(s){\mathrm{d}}\nu_{k}(s,a)$
	$\displaystyle\leq\bigg{\\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\\|}_{2,d_{k}}\cdot\big{\\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\\|}_{2,\nu_{k}},$	(D.9)

where the last inequality follows from the Cauchy-Schwartz inequality. Combining (D.1) and (D.1), we obtain for term (ii) of (D.3) that

		$\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\biggr{]}$
		$\displaystyle\quad\leq\Biggl{(}\bigg{\\|}\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\nu_{k}}\bigg{\\|}_{2,\nu_{k}}+\bigg{\\|}\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}d_{k}}\bigg{\\|}_{2,d_{k}}\Biggr{)}\cdot\big{\\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\\|}_{2,\nu_{k}}$
		$\displaystyle\quad\leq C_{h}\cdot\big{\\|}\iota_{\theta_{k}}(s,a)^{\top}\delta_{k}+\iota_{\omega_{k}}(s,a)^{\top}\omega_{k}\big{\\|}_{2,\nu_{k}},$		(D.10)

where the last inequality follows from Assumption 4.1. To upper bound term (ii) of (D.3), it suffices to upper bound the right-hand side of (D.1). For notational simplicity, we write $\iota_{\theta_{k}}=\iota_{\theta_{k}}(s,a)$ , $\iota_{\omega_{k}}=\iota_{\omega_{k}}(s,a)$ , and $\phi_{\omega_{k}}=\phi_{\omega_{k}}(s,a)$ . By the triangle inequality, we have that

		$\displaystyle\\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\\|_{2,\nu_{k}}=\Bigl{(}\mathbb{E}_{\nu_{k}}\bigl{[}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{)}^{1/2}$
		$\displaystyle\quad\leq\underbrace{\Bigl{\|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{\|}^{1/2}}_{\displaystyle\text{(ii.a)}}+\underbrace{\Bigl{\|}\mathbb{E}_{\nu_{k}}\bigl{[}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{\|}^{1/2}}_{\displaystyle\text{(ii.b)}}.$		(D.11)

We now upper bound the two terms (ii.a) and (ii.b) on the right-hand side of (D.1). For term (ii.a) of (D.1), following from (3.9) of Proposition 3.1, we have that

\displaystyle\mathcal{I}(\theta_{k})

\displaystyle=\tau_{k}^{2}\cdot\mathbb{E}_{\nu_{k}}[\iota_{\theta_{k}}\iota_{\theta_{k}}^{\top}].

(D.12)

Recall that the expectation $\mathbb{E}_{k}$ is taken with respect to the $k$ -th batch. Following from the definition of $\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})$ in (3.17), we have that

$\displaystyle\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}$	$\displaystyle=-\tau_{k}\cdot\mathbb{E}_{\nu_{k}}[\omega_{k}^{\top}\phi_{\omega_{k}}\cdot\iota_{\theta_{k}}]$
	$\displaystyle=-\tau_{k}\cdot\mathbb{E}_{\nu_{k}}[\omega_{k}^{\top}\iota_{\omega_{k}}\cdot\iota_{\theta_{k}}]-\tau_{k}\cdot w_{k}^{\top}\mathbb{E}_{a^{\prime}\sim\pi_{k}^{s}}\bigl{[}\phi_{\omega_{k}}(s,a^{\prime})\bigr{]}\cdot\mathbb{E}_{\nu_{k}}[\iota_{\theta_{k}}]$
	$\displaystyle=-\tau_{k}\cdot\mathbb{E}_{\nu_{k}}[\omega_{k}^{\top}\iota_{\omega_{k}}\cdot\iota_{\theta_{k}}],$	(D.13)

where the first equality follows from the fact that $\widehat{Q}_{\omega_{k}}(s,a)=\omega_{k}^{\top}\phi_{\omega_{k}}(s,a)$ , while the second and third equalities follow from the definition of $\iota_{\omega_{k}}(s,a)$ in (D.2). Following from (D.12) and (D.1), we have that

	$\displaystyle\Bigl{\|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{\|}$	$\displaystyle=\tau_{k}^{-2}\cdot\biggl{\|}(\delta_{k}-\omega_{k})^{\top}\Bigl{(}\mathcal{I}(\theta_{k})\delta_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta,\beta)\bigr{]}\Bigr{)}\biggr{\|}$
		$\displaystyle\leq 2B_{\theta}\cdot\tau_{k}^{-2}\cdot\Big{\\|}\mathcal{I}(\theta_{k})\delta_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta,\beta)\bigr{]}\Big{\\|}_{2}.$		(D.14)

Here the last inequality follows from the Cauchy-Schwartz inequality and the fact that $\|\omega_{k}-\delta_{k}\|_{2}\leq 2B_{\theta}$ as $\omega_{k},\delta_{k}\in S_{B_{\theta}}$ . For notational simplicity, we define the following error terms,

	$\displaystyle\xi_{k}^{(1)}$	$\displaystyle=\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\mathcal{I}(\theta_{k})\delta_{k},$		(D.15)
	$\displaystyle\xi_{k}^{(2)}$	$\displaystyle=\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})-\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}.$		(D.16)

Then, we have for term (ii.a) in (D.1) that

		$\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{\|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{\|}^{1/2}\biggr{]}$		(D.17)
		$\displaystyle\quad\leq(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\mathbb{E}_{{\text{init}}}\biggl{[}\Big{\\|}\mathcal{I}(\theta_{k})\delta_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta,\beta)\bigr{]}\Big{\\|}_{2}^{1/2}\biggr{]}$
		$\displaystyle\quad\leq(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\mathbb{E}_{\text{init}}\biggl{[}\Bigl{(}\big{\\|}\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta,\beta)\big{\\|}_{2}+\\|\xi_{k}^{(1)}\\|_{2}+\tau_{k}\cdot\\|\xi_{k}^{(2)}\\|_{2}\Bigr{)}^{1/2}\biggr{]}$
		$\displaystyle\quad\leq(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\biggl{(}\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\\|}\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta,\beta)\big{\\|}_{2}\Bigr{]}+\mathbb{E}_{{\text{init}}}\bigl{[}\\|\xi_{k}^{(1)}\\|_{2}+\tau_{k}\cdot\\|\xi_{k}^{(2)}\\|_{2}\bigr{]}\biggr{)}^{1/2},$

where the first inequality follows from (D.1), the second inequality follows from the triangle inequality, and the last inequality follows from Jensen’s inequality. Similarly to (D.15), we define the following error term,

\displaystyle\xi_{k}^{(3)}=\widehat{\mathcal{I}}(\theta_{k})\omega_{k}-\mathcal{I}(\theta_{k})\omega_{k}.

(D.18)

We now upper bound the right-hand side of (D.17). Recall the definition of $\delta_{k}$ in (3.15). We have that

	$\displaystyle\big{\\|}\widehat{\mathcal{I}}(\theta_{k})\delta_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\big{\\|}_{2}$	$\displaystyle\leq\big{\\|}\widehat{\mathcal{I}}(\theta_{k})\omega_{k}-\tau_{k}\cdot\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\big{\\|}_{2}$		(D.19)
		$\displaystyle\leq\Big{\\|}\mathcal{I}(\theta_{k})\omega_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\\|}_{2}+\\|\xi_{k}^{(1)}\\|_{2}+\tau_{k}\cdot\\|\xi_{k}^{(2)}\\|_{2}.$

Following from (D.12), (D.1), and Jensen’s inequality, we have that

	$\displaystyle\Big{\\|}\mathcal{I}(\theta_{k})\omega_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\\|}_{2}$	$\displaystyle=\tau_{k}^{2}\cdot\Big{\\|}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}\cdot\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\bigr{]}\Big{\\|}_{2}$
		$\displaystyle\leq\tau_{k}^{2}\cdot\mathbb{E}_{\nu_{k}}\Bigl{[}\\|\iota_{\theta_{k}}\\|_{2}\cdot\bigl{\|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\bigr{\|}\Bigr{]}$
		$\displaystyle\leq 2\tau_{k}^{2}\cdot\big{\\|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\\|}_{1,\nu_{k}},$

where the last inequality follows from the fact that $\|\iota_{\theta}\|_{2}\leq 2$ for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ . Following from Assumption (a) and Lemma A.2, we have that

	$\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Big{\\|}\mathcal{I}(\theta_{k})\omega_{k}-\tau_{k}\cdot\mathbb{E}_{k}\bigl{[}\widehat{\nabla}_{\theta}L(\theta_{k},\beta_{k})\bigr{]}\Big{\\|}_{2}\biggr{]}$	$\displaystyle\leq\mathbb{E}_{{\text{init}}}\Big{[}2\tau_{k}^{2}\cdot\big{\\|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\\|}_{1,\nu_{k}}\Big{]}$
		$\displaystyle=\mathcal{O}(\tau_{k}^{2}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}).$		(D.20)

Plugging (D.19) and (D.1) into (D.17), we have that

		$\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{\|}(\delta_{k}-\omega_{k})^{\top}\mathbb{E}_{\nu_{k}}\bigl{[}\iota_{\theta_{k}}(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigl{]}\Bigr{\|}^{1/2}\biggr{]}$
		$\displaystyle\quad=(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\Bigl{(}\mathcal{O}(2\tau_{k}^{2}\cdot B_{\theta}^{3/2}\cdot m^{-1/4})+\mathbb{E}_{{\text{init}}}\bigl{[}\\|\xi_{k}^{(1)}\\|_{2}+2\tau_{k}\cdot\\|\xi_{k}^{(2)}\\|_{2}+\\|\xi_{k}^{(3)}\\|_{2}\bigr{]}\Bigr{)}^{1/2}$
		$\displaystyle\quad=\mathcal{O}(\tau_{k}\cdot B_{\theta}^{5/4}\cdot m^{-1/4})+(2B_{\theta})^{1/2}\cdot\tau_{k}^{-1}\cdot\Bigl{(}\mathbb{E}_{{\text{init}}}\bigl{[}\\|\xi_{k}^{(1)}\\|_{2}+2\tau_{k}\cdot\\|\xi_{k}^{(2)}\\|_{2}+\\|\xi_{k}^{(3)}\\|_{2}\bigr{]}\Bigr{)}^{1/2}$
		$\displaystyle\quad\leq\mathcal{O}(\tau_{k}\cdot B_{\theta}^{5/4}\cdot m^{-1/4})+2\sqrt{2}B_{\theta}^{1/2}\cdot(\sigma^{2}/N)^{1/4},$		(D.21)

where the last inequality follows from Assumption 4.3. We now upper bound term (ii.a) of (D.1). We have that

		$\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{\|}\mathbb{E}_{\nu_{k}}\bigl{[}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{\|}^{1/2}\biggr{]}$
		$\displaystyle\quad\leq\mathbb{E}_{{\text{init}},\nu_{k}}\Bigl{[}\bigl{\|}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{\|}\Bigr{]}^{1/2}$
		$\displaystyle\quad\leq\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\\|}\omega^{\top}_{k}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\\|}_{2,\nu_{k}}\Bigr{]}^{1/2}\cdot\mathbb{E}_{{\text{init}}}\bigl{[}\\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\\|_{2,\nu_{k}}\bigr{]}^{1/2},$		(D.22)

where the expectation $\mathbb{E}_{{\text{init}},\nu_{k}}$ is taken with respect to the random initialization in (3.3) and $(s,a)\sim\nu_{k}$ , the first inequality follows from Jensen’s inequality, and the second inequality follows from the Cauchy-Schwartz inequality. Following from Assumption (a) and Lemma A.2, we have that

\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\omega^{\top}_{k}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\big{\|}_{2,\nu_{k}}\Bigr{]}=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}).

(D.23)

To upper bound the right-hand side of (D.1), it remains to upper bound the term $\mathbb{E}_{{\text{init}}}[\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\|_{2,\nu_{k}}]$ . We have that

\displaystyle\mathbb{E}_{{\text{init}}}\bigl{[}\|\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}}\|_{2,\nu_{k}}\bigr{]}\leq\mathbb{E}_{\text{init}}\bigl{[}\|\delta_{k}\|_{2}\cdot\|\iota_{\theta_{k}}\|_{2}\bigr{]}+\mathbb{E}_{\text{init}}\bigl{[}\|\omega_{k}\|_{2}\cdot\|\iota_{\omega_{k}}\|_{2}\bigr{]}=\mathcal{O}(B_{\theta}),

(D.24)

where the inequality follows from the Cauchy-Schwartz inequality and the equality follows from the facts that $\|\iota_{\theta_{k}}\|_{2}\leq 2$ , $\|\iota_{\omega_{k}}\|_{2}\leq 2$ , and $\delta_{k},\omega_{k}\in S_{B_{\theta}}$ . Plugging (D.23) and (D.24) into (D.1), we have that

\displaystyle\mathbb{E}_{{\text{init}}}\biggl{[}\Bigl{|}\mathbb{E}_{\nu_{k}}\bigl{[}\omega_{k}^{\top}(\iota_{\theta_{k}}-\iota_{\omega_{k}})\cdot(\delta_{k}^{\top}\iota_{\theta_{k}}+\omega_{k}^{\top}\iota_{\omega_{k}})\bigr{]}\Bigr{|}^{1/2}\biggr{]}=\mathcal{O}(B_{\theta}^{5/4}\cdot m^{-1/8}),

(D.25)

which upper bounds term (ii.b) of (D.1). Plugging (D.1) and (D.25) into (D.1), following from (D.1), we have that

		$\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\iota_{\theta_{k}}(s,\cdot)^{\top}\delta_{k}-\iota_{\omega_{k}}(s,\cdot)^{\top}\omega_{k},\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\biggr{]}$
		$\displaystyle\quad=\eta\cdot C_{h}\cdot\bigl{(}\mathcal{O}(B_{\theta}^{5/4}\cdot m^{-1/8})+2\sqrt{2}B_{\theta}^{1/2}\cdot(\sigma^{2}/N)^{1/4}\bigr{)},$		(D.26)

which upper bounds term (ii) of (D.3).

Finally, plugging (D.7) and (D.1) into (D.3), we have that

	$\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s})-\eta\cdot\widehat{Q}_{\omega_{k}}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\biggr{]}$
	$\displaystyle\quad=\eta\cdot C_{h}\cdot 2\sqrt{2}B_{\theta}^{1/2}\cdot(\sigma^{2}/N)^{1/4}+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}+\eta\cdot B_{\theta}^{5/4}\cdot m^{-1/8}),$

where $\xi_{k}^{(1)}$ , $\xi_{k}^{(2)}$ , and $\xi^{(3)}_{k}$ are defined in (D.15), (D.16), and (D.18), respectively, and $C_{h}$ is defined in Assumption (b). Thus, we complete the proof of Lemma C.1. ∎

D.2 Proof of Lemma C.2

Proof.

For notational simplicity, for any $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ , we denote by $\Delta_{Q,k}(s,a)=\widehat{Q}_{\omega_{k}}(s,a)-Q^{\pi_{k}}_{r_{k}}(s,a)$ the error of estimating $Q^{\pi_{k}}_{r_{k}}(s,a)$ by $\widehat{Q}_{\omega_{k}}(s,a)$ . Then, we have that

	$\displaystyle\mathbb{E}_{d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\Delta_{Q,k}(s,\cdot),\pi_{\text{{\tiny E}}}^{s}-\pi_{k}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}\biggr{]}$
	$\displaystyle\quad\leq\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{\|}\Delta_{Q,k}(s,a)\bigr{\|}{\mathrm{d}}\pi_{\text{{\tiny E}}}^{s}(a){\mathrm{d}}d_{\text{{\tiny E}}}(s)+\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{\|}\Delta_{Q,k}(s,a)\bigr{\|}{\mathrm{d}}\pi_{k}^{s}(a){\mathrm{d}}d_{\text{{\tiny E}}}(s)$
	$\displaystyle\quad=\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{\|}\Delta_{Q,k}(s,a)\bigr{\|}\cdot\frac{{\mathrm{d}}\nu_{\text{{\tiny E}}}}{{\mathrm{d}}\rho_{k}}(s,a){\mathrm{d}}\rho_{k}(s,a)+\int_{{\mathcal{S}}\times\mathcal{A}}\bigl{\|}\Delta_{Q,k}(s,a)\bigr{\|}\cdot\frac{{\mathrm{d}}d_{\text{{\tiny E}}}}{{\mathrm{d}}\varrho_{k}}(s){\mathrm{d}}\rho_{k}(s,a)$
	$\displaystyle\quad\leq C_{h}\cdot\\|\Delta_{Q,k}\\|_{2,\rho_{k}},$

where the last inequality follows from the Cauchy-Schwartz inequality and Assumption (b). Thus, we complete the proof of Lemma C.2. ∎

D.3 Proof of Lemma C.3

Proof.

Following from (D.1) and the parameterization of $\pi_{\theta}$ in (3.5), we have that

		$\displaystyle\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s}),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}$		(D.27)
		$\displaystyle\quad=\big{\langle}\tau_{k+1}\cdot\theta_{k+1}^{\top}\phi_{\theta_{k+1}}(s,\cdot)-\tau_{k}\cdot\theta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}$
		$\displaystyle\quad=\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}+\tau_{k+1}\cdot\Big{\langle}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)},\pi_{k}^{s}-\pi_{k+1}^{s}\Big{\rangle}_{\mathcal{A}}.$

We now upper bound the two terms on the right-hand side of (D.27). For the first term on the right-hand side of (D.27), recall that we define $\delta_{k}$ in (3.15). Thus, we have that

\displaystyle\bigl{|}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,a)\bigr{|}=\eta\cdot\bigl{|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,a)\bigr{|}.

(D.28)

Following from (D.28) and Hölder’s inequality, we have for any $s\in{\mathcal{S}}$ that

	$\displaystyle\Bigl{\|}\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}$
	$\displaystyle\quad\leq\big{\\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\\|}_{\infty}\cdot\\|\pi_{k}^{s}-\pi_{k+1}^{s}\\|_{1}.$

Then, following from Pinsker’s inequality, we have that

		$\displaystyle\Bigl{\|}\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\\|\,\pi_{k}^{s})$
		$\displaystyle\quad\leq\eta\cdot\big{\\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\\|}_{\infty}\cdot\\|\pi_{k}^{s}-\pi_{k+1}^{s}\\|_{1}-1/2\cdot\\|\pi_{k}^{s}-\pi_{k+1}^{s}\\|_{1}^{2}$
		$\displaystyle\quad\leq 1/2\cdot\eta^{2}\cdot\big{\\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\\|}_{\infty}^{2}.$		(D.29)

By the update of $\theta_{k}$ in (3.13) and the definition of $\delta_{k}$ in (3.15), we have that $\theta_{k},\delta_{k}\in S_{B_{\theta}}$ . Thus, by Lemma A.3, we have that

\displaystyle\mathbb{E}_{{\text{init}}}\Bigl{[}\big{\|}\delta_{k}^{\top}\phi_{\theta_{k}}(s,\cdot)\big{\|}_{\infty}^{2}\Bigr{]}\leq 2M_{0}+18B_{\theta}^{2}.

(D.30)

Plugging (D.30) into (D.3), we have that

\displaystyle\Bigl{|}\big{\langle}(\tau_{k+1}\cdot\theta_{k+1}-\tau_{k}\cdot\theta_{k})^{\top}\phi_{\theta_{k}}(s,\cdot),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\|\,\pi_{k}^{s})\leq\eta^{2}\cdot(M_{0}^{2}+9B_{\theta}^{2}).

(D.31)

For the second term on the right-hand side of (D.27), following from Assumption (a) and Lemma A.2, we have

		$\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\Biggl{[}\biggl{\|}\Big{\langle}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)},\pi_{k}^{s}-\pi_{k+1}^{s}\Big{\rangle}_{\mathcal{A}}\biggr{\|}\Biggr{]}$
		$\displaystyle\quad\leq\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Big{\\|}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)}\Big{\\|}_{1,\pi_{k}^{s}}\biggr{]}+\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Big{\\|}\theta_{k+1}^{\top}\bigl{(}\phi_{\theta_{k+1}}(s,\cdot)-\phi_{\theta_{k}}(s,\cdot)\bigr{)}\Big{\\|}_{1,\pi_{k+1}^{s}}\biggr{]}$
		$\displaystyle\quad=\mathcal{O}(B_{\theta}^{3/2}\cdot m^{-1/4}).$		(D.32)

Finally, plugging (D.31) and (D.3) into (D.27), we have that

	$\displaystyle\mathbb{E}_{{\text{init}},d_{\text{{\tiny E}}}}\biggl{[}\Bigl{\|}\big{\langle}\log(\pi_{k+1}^{s}/\pi_{k}^{s}),\pi_{k}^{s}-\pi_{k+1}^{s}\big{\rangle}_{\mathcal{A}}\Bigr{\|}-{\mathrm{KL}}(\pi_{k+1}^{s}\,\\|\,\pi_{k}^{s})\biggr{]}$
	$\displaystyle\quad=\eta^{2}\cdot(M_{0}^{2}+9B_{\theta}^{2})+\mathcal{O}(\tau_{k+1}\cdot B_{\theta}^{3/2}\cdot m^{-1/4}),$

which completes the proof of Lemma C.3. ∎

	$\displaystyle\bigl{\|}[W]_{l}^{\top}(s,a)\bigr{\|}$	$\displaystyle\leq\bigl{\|}[W_{0}]_{l}^{\top}(s,a)\bigr{\|}+\Bigl{\|}\bigl{(}[W]_{l}-[W_{0}]_{l}\bigr{)}^{\top}(s,a)\Bigr{\|}$
		$\displaystyle\leq\bigl{\|}[W_{0}]_{l}^{\top}(s,a)\bigr{\|}+\big{\\|}[W]_{l}-[W_{0}]_{l}\big{\\|}_{2},$		(A.2)

	$\displaystyle\mathbb{E}_{\text{init}}\Bigl{[}\big{\\|}W^{\top}\phi_{W^{\prime}}(s,a)-W^{\top}\phi_{0}(s,a)\big{\\|}_{2,\mu}^{2}\Bigr{]}$
	$\displaystyle\quad\leq\mathbb{E}_{\text{init}}\biggl{[}\frac{8cB^{2}}{m}\Bigl{(}\sum_{l=1}^{m}\bigl{\\|}[W^{\prime}]_{l}-[W_{0}]_{l}\bigr{\\|}_{2}^{2}\Bigr{)}^{1/2}\cdot\Bigl{(}\sum_{l=1}^{m}1/\bigl{\\|}[W_{0}]_{l}\bigr{\\|}_{2}^{2}\Bigr{)}^{1/2}\biggr{]}$
	$\displaystyle\quad\leq\frac{8cB^{3}}{m}\mathbb{E}_{\text{init}}\biggl{[}\Bigl{(}\sum_{l=1}^{m}1/\bigl{\\|}[W_{0}]_{l}\bigr{\\|}_{2}^{2}\Bigr{)}^{1/2}\biggr{]}$
	$\displaystyle\quad\leq\frac{8cB^{3}}{\sqrt{m}}\Bigl{(}\mathbb{E}_{w\sim N(0,I_{d}/d)}\bigl{[}1/\\|w\\|_{2}^{2}\bigr{]}\Bigr{)}^{1/2}$
	$\displaystyle\quad=\mathcal{O}(B^{3}\cdot m^{-1/2}),$

	$\displaystyle\big{\|}W^{\top}\phi_{W^{\prime}}(s,a)\big{\|}$	$\displaystyle\leq\bigl{\|}u_{0}(s,a)\bigr{\|}+\bigl{\|}(W-W^{\prime})^{\top}\phi_{W^{\prime}}(s,a)\bigr{\|}+\bigl{\|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{\|}$
		$\displaystyle\leq\bigl{\|}u_{0}(s,a)\bigr{\|}+\\|W-W^{\prime}\\|_{2}\cdot\big{\\|}\phi_{W^{\prime}}(s,a)\big{\\|}_{2}+\bigl{\|}u_{W^{\prime}}(s,a)-u_{0}(s,a)\bigr{\|},$		(A.4)

	$\displaystyle\bigl{\|}\bar{Q}(s,a)-Q^{\pi}_{r_{\beta}}(s,a)\bigr{\|}$	$\displaystyle\leq\int\Bigl{\|}\vartheta(s,a;w)^{\top}\bigl{(}\bar{\alpha}(w)-\alpha(w)\bigr{)}\Bigr{\|}{\mathrm{d}}q(w)$
		$\displaystyle\leq\int\big{\\|}\vartheta(s,a;w)\big{\\|}_{2}\cdot\big{\\|}\bar{\alpha}(w)-\alpha(w)\big{\\|}_{2}{\mathrm{d}}q(w)$
		$\displaystyle\leq\sup_{w}\big{\\|}\bar{\alpha}(w)-\alpha(w)\big{\\|}_{2},$

$\displaystyle\sup_{w}\bigl{\\|}\alpha(w)\bigr{\\|}_{2}$	$\displaystyle=\gamma\cdot\biggl{\\|}\int_{\mathcal{S}}\varphi(s^{\prime};w)V^{\pi}_{r_{\beta}}(s^{\prime}){\mathrm{d}}s^{\prime}\biggr{\\|}_{2}$
	$\displaystyle\leq\gamma\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{\|}V^{\pi}_{r_{\beta}}(s^{\prime})\bigr{\|}\cdot\sup_{w}\biggl{\\|}\int_{\mathcal{S}}\varphi(s^{\prime};w){\mathrm{d}}s^{\prime}\biggr{\\|}_{2}$
	$\displaystyle\leq\gamma\cdot B_{P}\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{\|}V^{\pi}_{r_{\beta}}(s^{\prime})\bigr{\|}$
	$\displaystyle\leq\gamma\cdot(1-\gamma)^{-1}\cdot B_{P}\cdot\sup_{(s,a)\in{\mathcal{S}}\times\mathcal{A}}\bigl{\|}u_{\beta}(s,a)\bigr{\|},$	(B.6)