On the Global Convergence of Natural Actor-Critic with Two-layer Neural Network Parametrization

\nameMudit Gaur\email[email protected]
\nameAmrit Singh Bedi\email[email protected]
\nameDi Wang\email[email protected]
\nameVaneet Aggarwal\email[email protected]

Abstract

Actor-critic algorithms have shown remarkable success in solving state-of-the-art decision-making problems. However, despite their empirical effectiveness, their theoretical underpinnings remain relatively unexplored, especially with neural network parametrization. In this paper, we delve into the study of a natural actor-critic algorithm that utilizes neural networks to represent the critic. Our aim is to establish sample complexity guarantees for this algorithm, achieving a deeper understanding of its performance characteristics. To achieve that, we propose a Natural Actor-Critic algorithm with 2-Layer critic parametrization (NAC2L). Our approach involves estimating the $Q$ -function in each iteration through a convex optimization problem. We establish that our proposed approach attains a sample complexity of $\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^{4}(1-\gamma)^{4}}\right)$ . In contrast, the existing sample complexity results in the literature only hold for a tabular or linear MDP. Our result, on the other hand, holds for countable state spaces and does not require a linear or low-rank structure on the MDP.

1 Introduction

Motivation. The use of neural networks in Actor-critic (AC) algorithms is widespread in various machine learning applications, such as games (DBLP:journals/corr/abs-1708-04782; Haliem2021LearningMG), robotics (morgan2021model), autonomous driving (9351818), ride-sharing (zheng2022stochastic), networking (geng2020multi), and recommender systems (li2020video). AC algorithms sequentially update the estimate of the actor (policy) and the critic (value function) based on the data collected at each iteration, as described in (konda1999actor). An empirical and theoretical improvement of the AC, known as Natural Actor-Critic (NAC), was proposed in (peters2008natural). NAC replaced the stochastic gradient step of the actor with the natural gradient descent step described in (kakade2001natural) based on the theory of (rattray1998natural). Despite its widespread use in state-of-the-art reinforcement learning (RL) implementations, there are no finite-time sample complexity results available in the literature for a setup where neural networks (NNs) are used to represent the critic. Such results provide estimates of the required data to achieve a specified accuracy in a loss function, typically in terms of the difference between the average reward obtained in following the policy obtained through the algorithm being analyzed and the optimal policy.

Challenges. The primary challenge in obtaining sample complexity for a NAC is that the critic loss function with NN parametrization becomes non-convex, making finite-time sample guarantees impossible. Although recent work in (fu2020single) obtains upper bounds on the error in estimating the optimal action-value function in NN critic settings, these upper bounds are asymptotic and depend on the NN parameters. Similarly, other asymptotic sample complexity results, such as those in (kakade2001natural; konda1999actor; bhatnagar2010actor), require i.i.d. sampling of the state action pairs at each iteration of the algorithm. While sample complexity results for linear MDP structures have been derived in (wu2020finite; xu2020improving) (see Related work), the structure allows for a closed-form critic update. The relaxation of the linear MDP assumption requires a critic parametrization to learn the Q-function, limiting the ability to find globally optimal policies due to non-global optimality of critic parameters.

Approach. As the optimal critic network update is not available for general MDPs, a parametrization of the critic network is used. To address the challenge of optimizing the critic parameters, we assume a two-layer neural network structure for the critic parametrization. Recent studies have shown that the parameter optimization for two-layer ReLU neural networks can be transformed into an equivalent convex program, which is computationally tractable and solvable exactly (pmlr-v119-pilanci20a). Convex formulations for convolutions and deeper models have also been investigated (sahiner2020vector; sahiner2020convex). In this work, we use these approaches to estimate the parameterized Q-function in the critic. In the policy gradient analysis, the error caused by the lack of knowledge of the exact gradient term is assumed to be upper bounded by a constant (see related work for details), where the sample complexity of this term is not accounted due to no explicit critic estimation. This paper considers the sample complexity of the critic function to derive the overall sample complexity of the natural actor-critic algorithm. Resolving the optimality of the critic parameter update, this paper considers the following question:

Is it possible to obtain global convergence sample complexity results of Natural Actor-Critic algorithm with two-layer neural network parametrization of critic?

Contributions. In this work, we provide an affirmation to the above-mentioned question. We summarize our contributions as follows.

•

We propose a novel actor-critic algorithm, employing a general parametrization for the actor and two-layer ReLU neural network parametrization for the critic.
•

Building upon the insights presented in (agarwal2020optimality), we leverage the inherent smoothness property of the actor parametrization to derive an upper bound on the estimation error of the optimal value function. This upper bound is expressed in terms of the error incurred in attaining the compatible function approximation term, as elucidated in (DBLP:conf/nips/SuttonMSM99). Our analysis dissects this error into distinct components, each of which is individually bounded from above.
•

To address the estimation error arising from the critic function, we leverage the interesting findings of (wang2021hidden; pmlr-v119-pilanci20a), converting the problem of finding critic function parameters into a convex optimization problem. This allows us to establish a finite-time sample complexity bound, a significant improvement over previous results obtained for NAC algorithms. Notably, our approach eliminates the reliance on a linear function approximation for the critic, presenting the first finite-time sample complexity result for NAC without such an approximation.

2 Related Works

Natural Actor-Critic (NAC). Actor critic methods, first conceptualised in (sutton1988learning), aim to combine the benefits of the policy gradient methods and $Q$ -learning based methods. The policy gradient step in these methods is replaced by a Natural Policy Gradient proposed in (kakade2001natural) to obtain the so-called Natural Actor Critic in (peters2005natural). Sample complexity results for Actor Critic were first obtained for MDP with finite states and actions in (williams1990mathematical), and more recently in (lan2023policy; zhang2020provably). Finite time convergence for natural actor critic using a linear MDP assumption has been obtained in (chen2022finite; khodadadian2021finite; xu2020non) with the best known sample complexity of $\tilde{\mathcal{O}}\left(\frac{1}{{\epsilon}^{2}(1-\gamma)^{4}}\right)$ (xu2020improving). Finite time sample complexity results are however, not available for Natural Actor Critic setups for general MDP where neural networks are used to represent the critic. (fu2020single) obtained asymptotic upper bounds on the estimation error for a natural actor critic algorithm where a neural network is used to represent the critic. The key related works here are summarized in Table 1.

Sample Complexity of PG Algorithms. Policy Gradient methods (PG) were first introduced in (DBLP:conf/nips/SuttonMSM99), with Natural Policy Gradients in (kakade2001natural). Sample complexity were obtained in tabular setups in (even2009online) and then improved in (agarwal2020optimality). Using linear dynamics assumption convergence guarantees were obtained in (fazel2018global), (dean2020sample), (bhandari2019global) . (zhang2020global) obtains sample complexity for convergence to second order stationary point policies. In order to improve the performance of Policy Gradient methods, techniques such as REINFORCE (williams1992simple), function approximation (sutton1999policy), importance weights (papini2018stochastic), adaptive batch sizes (papini2017adaptive), Adaptive Trust Region Policy Optimization (shani2020adaptive) have been used. We note that the Natural Policy Gradient approach has been shown to achieve a sample complexity of $\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^{3}(1-\gamma)^{6}}\right)$ (liu2020improved). However, they assume that the error incurred due to the lack of knowledge of the exact gradient term is assumed to be upper bounded by a constant (Assumption 4.4, (liu2020improved)). Additionally, each estimate of the critic requires on average a sample of size $\left(\frac{1}{1-\gamma}\right)$ and the obtained estimate is only known to be an unbiased estimate of the critic, with no error bounds provided. Explicit dependence of the constant of (Assumption 4.4, (liu2020improved)) in terms of number of samples is not considered in their sample complexity results. For the Natural Actor Critic analysis in this paper, we provide explicit dependence of the error in estimating the gradient update and incorporate the samples required for the estimation in the sample complexity analysis. We describe this difference in detail in Appendix LABEL:Comp_npg.

Table 1: This table summarizes the sample complexities of different Natural Actor-Critic Algorithms. We note that we present the first finite time sample complexity results for general MDPs where both actor and critic are parametrized by a neural network.

References

Actor

parametrization

Critic

parametrization

MDP

Assumption

Sample

Complexity

(williams1990mathematical)

Tabular

Finite

Asymptotic

(borkar1997actor)

Tabular

Finite

Asymptotic

(xu2020non)

General

None (Closed Form)

Linear

\tilde{\mathcal{O}}(\epsilon^{-4}(1-\gamma)^{-9})

(khodadadian2021finite)

General

None (Closed Form)

Linear

\tilde{\mathcal{O}}(\epsilon^{-3}(1-\gamma)^{-8})

(xu2020improving)

General

None (Closed Form)

Linear

\tilde{\mathcal{O}}(\epsilon^{-2}(1-\gamma)^{-4})

(fu2020single)

Neural Network

None

Asymptotic

This work

General

Neural Network (2-layer)

None

\tilde{\mathcal{O}}(\epsilon^{-4}(1-\gamma)^{-4})

3 Problem Setup

We consider a discounted Markov Decision Process (MDP) given by the tuple ${\mathcal{M}}:=(\mathcal{S},\mathcal{A},P,R,\gamma)$ , where $\mathcal{S}$ is a bounded measurable state space, $\mathcal{A}$ is the finite set of actions. ${P}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})$ is the probability transition kernel¹¹1 For a measurable set $\mathcal{X}$ , let $\mathcal{P}(\mathcal{X})$ denote the set of all probability measures over $\mathcal{X}$ . , ${R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{P}([0,R_{\max}])$ is the reward kernel on the state action space with $R_{\max}$ being the absolute value of the maximum reward, and $0<\gamma<1$ is the discount factor. A policy $\pi:\mathcal{S}\rightarrow\mathcal{P}(\mathcal{A})$ maps a state to a probability distribution over the action space. Here, we denote by $\mathcal{P}(\mathcal{S}),\mathcal{P}(\mathcal{A}),\mathcal{P}([a,b])$ , the set of all probability distributions over the state space, the action space, and a closed interval $[a,b]$ , respectively. With the above notation, we define the action value function for a given policy $\pi$ as

\displaystyle Q^{\pi}(s,a)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{0}=a\right],

(1)

where $r(s_{t},a_{t})\sim R(\cdot|s_{t},a_{t})$ , $a_{t}\sim\pi(\cdot|s_{t})$ and $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ for $t=\{0,\cdots,\infty\}$ . For a discounted MDP, we define the optimal action value functions as $Q^{*}(s,a)=\sup_{\pi}Q^{\pi}(s,a),\hskip 14.22636pt\forall(s,a)\in\mathcal{S}\times\mathcal{A}$ . A policy that achieves the optimal action value functions is known as the optimal policy and is denoted as $\pi^{*}$ . It holds that $\pi^{*}$ is the greedy policy with respect to $Q^{*}$ (10.5555/1512940). Hence finding $Q^{*}$ is sufficient to obtain the optimal policy. In a similar manner, we can define the value function as $V^{\pi}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r^{\prime}(s_{t},a_{t})|s_{0}=s\right],$ and from the definition in (1), it holds that $V^{\pi}(s)=\mathbb{E}_{a\sim\pi}\left[Q^{\pi}(s,a)\right]$ . Hence, we can define the optimal value function as $V^{*}(s)=\sup_{\pi}V^{\pi}(s),\hskip 14.22636pt\forall s\in\mathcal{S}$ . We define $\rho_{\pi}(s)$ as the stationary state distribution induced by the policy $\pi$ starting at state $s$ and $\zeta_{\pi}(s,a)$ is the corresponding stationary state action distribution defined as $\zeta_{\pi}(s,a)=\rho_{\pi}(s).\pi(a|s)$ . We overload notation to define $V^{\pi}(\rho)=\mathbb{E}_{s_{0}\sim\rho}[V^{\pi}(s_{0})]$ , where $\rho$ is an initial state distribution. We can define the visitation distribution as $d^{\pi}(s_{0})=(1-\gamma)\sum_{t=0}^{\infty}{\gamma}^{t}Pr^{\pi}(s_{t}=s|s_{o})$ . Here $Pr^{\pi}(s_{t}=s|s_{o})$ denotes the probability the state at time $t$ is $s$ given a starting state of $s_{o}$ . Hence, we can write $d^{\pi}_{\rho}(s)=\mathbb{E}_{s_{o}\sim\rho}[d^{\pi}(s_{0})]$ .

We now describe our Neural Actor-Critic (NAC) algorithm. In a natural policy gradient algorithm (NIPS2001_4b86abe4), the policy is parameterized as $\{\pi_{\lambda},\lambda\in\Lambda\}$ . We have $K$ total iterations of the algorithm. At iteration $k$ , the policy parameters are updated using a gradient descent step of the form

\displaystyle\lambda_{k+1}=\lambda_{k}+{\eta}F^{\dagger}_{\rho}(\lambda){\nabla}_{\lambda}V^{\pi_{\lambda}}(\rho),

(2)

From the policy gradient theorem (DBLP:conf/nips/SuttonMSM99) we have

	$\displaystyle{\nabla}_{\lambda_{k}}V^{\pi_{\lambda_{k}}}(\rho)$	$\displaystyle=$	$\displaystyle\mathbb{E}_{s,a}(\nabla{\log(\pi_{\lambda_{k}})(a\|s)}Q^{\pi_{\lambda_{k}}}(s,a)),$		(3)
	$\displaystyle F^{\dagger}_{\rho}(\lambda_{k})$	$\displaystyle=$	$\displaystyle\mathbb{E}_{s,a}\left[{\nabla}\log{\pi_{\lambda_{k}}(a\|s)}\left({\nabla_{t}}\log{\pi_{\lambda_{k}}(a\|s)}\right)^{T}\right],$		(4)

where $s\sim d^{\pi_{\lambda_{k}}}_{\rho},a\sim\pi_{\lambda_{k}}(.|s)$ . From (DBLP:conf/nips/SuttonMSM99), the principle of compatible function approximation implies that we have

	$\displaystyle F^{\dagger}_{\rho}(\lambda_{k}){\nabla}_{\lambda_{k}}V^{\pi_{\lambda_{k}}}(\rho)$	$\displaystyle=$	$\displaystyle\frac{1}{1-\gamma}w_{k}^{*}$		(5)
	$\displaystyle w_{k}^{*}$	$\displaystyle=$	$\displaystyle\operatorname*{arg\,min}_{w}\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}(s,a)-w{{\nabla}_{\lambda_{k}}\log(\pi_{\lambda_{k}}(a\|s))})^{2},$		(6)

and $s\sim d^{\pi_{\lambda_{k}}}_{\rho},a\sim\pi_{\lambda_{k}}(.|s)$ Here $(A^{\pi_{\lambda_{k}}}(s,a)=Q^{\pi_{\lambda_{k}}}(s,a)-V^{\pi_{\lambda_{k}}}(s))$ . For natural policy gradient algorithms such as in (agarwal2020optimality) and (liu2020improved) an estimate of $Q^{\pi_{\lambda_{k}}}$ (and from that an estimate of $A^{\pi_{\lambda_{k}}}(s,a)$ ) is obtained through a sampling procedure that requires on average $\left(\frac{1}{1-\gamma}\right)$ for each sample of $Q^{\pi_{\lambda_{k}}}$ . For the natural actor critic setup we maintain a parameterised estimate of the $Q$ -function is updated at each step and is used to approximate $Q^{\pi_{\lambda_{k}}}$ .

(xu2020improving) and (wu2020finite) assume that the $Q$ function can be represented as $Q^{\lambda_{k}}(s,a)=w^{t}(\phi(s,a))$ where $\phi(s,a)$ is a known feature vector and $w^{t}$ is a parameter that is updated in what is known as the critic step. The policy gradient step is known as the actor update step. In case a neural network is used to represent the $Q$ function, at each iteration $k$ of the algorithm, an estimate of its parameters are obtained by solving an optimization of the form

\displaystyle\operatorname*{arg\,min}_{\theta\in\Theta}\mathbb{E}_{s,a}({Q^{\pi_{\lambda_{k}}}-Q_{\theta}})^{2},

(7)

Due to the non-convex activation functions of a neural network the optimization in Equation (7) is non-convex and hence finite sample guarantees for actor critic algorithms with neural network representation of $Q$ functions is not possible. Thus in order to perform a finite time analysis, a linear MDP structure is assumed which is very restrictive and not practical for large state space problems. For a Natural Policy Gradient setup, a Monte Carlo sample is obtained, which makes the process require additional samples as causing high variance in critic estimation.

4 Proposed Approach

4.1 Convex Reformulation of 2 layer Neural Network

We represent the $Q$ function (critic) using a 2 layer ReLU neural network. A 2-layer ReLU Neural Network with input $x\in\mathbb{R}^{d}$ is defined as $f(x)=\sum_{i=1}^{m}\sigma^{\prime}(x^{T}u_{i})\alpha_{i}$ , where $m\geq 1$ is the number of neurons in the neural network, the parameter space is $\Theta_{m}=\mathbb{R}^{d\times m}\times\mathbb{R}^{m}$ and $\theta=(U,\alpha)$ is an element of the parameter space, where $u_{i}$ is the $i^{th}$ column of $U$ , and $\alpha_{i}$ is the $i^{th}$ coefficient of $\alpha$ . The function $\sigma^{\prime}:\mathbb{R}\rightarrow\mathbb{R}_{\geq 0}$ is the ReLU or restricted linear function defined as $\sigma^{\prime}(x)\triangleq\max(x,0)$ . In order to obtain parameter $\theta$ for a given set of data $X\in\mathbb{R}^{n\times d}$ and the corresponding response values $y\in\mathbb{R}^{n\times 1}$ , we desire the parameter that minimizes the squared loss, given by

\displaystyle\mathcal{L}(\theta)=\operatorname*{arg\,min}_{\theta}\Bigg{\|}\sum_{i=1}^{m}\sigma(Xu_{i})\alpha_{i}-y\Bigg{\|}_{2}^{2}.

(8)

In (8), we have the term $\sigma(Xu_{i})$ which is a vector $\{\sigma^{\prime}((x_{j})^{T}u_{i})\}_{j\in\{1,\cdots,n\}}$ , where $x_{j}$ is the $j^{th}$ row of $X$ . It is the ReLU function applied to each element of the vector $Xu_{i}$ . We note that the optimization in Equation (8) is non-convex in $\theta$ due to the presence of the ReLU activation function. In (wang2021hidden), it is shown that this optimization problem has an equivalent convex form, provided that the number of neurons $m$ goes above a certain threshold value. This convex problem is obtained by replacing the ReLU functions in the optimization problem with equivalent diagonal operators. The convex problem is given as

\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p):=\operatorname*{arg\,min}_{p}\Bigg{\|}\sum_{D_{i}\in D_{X}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2},

(9)

where $p\in\mathbb{R}^{d\times|D_{X}|}$ . $D_{X}$ is the set of diagonal matrices $D_{i}$ which depend on the data-set $X$ . Except for cases of $X$ being low rank, it is not computationally feasible to obtain the set $D_{X}$ . We instead use $\tilde{D}\in D_{X}$ to solve the convex problem in (9) where $p$ now would lie in $p\in\mathbb{R}^{d\times|\tilde{D}|}$ . The relevant details of the formulation and the definition of the diagonal matrices $D_{i}$ are provided in Appendix A. For a set of parameters $\theta=(u,\alpha)\in\Theta$ , we denote neural network represented by these parameters as

\displaystyle Q_{\theta}(s,a)=\sum_{i=1}^{m}\sigma^{\prime}((s,a)^{T}u_{i})\alpha_{i}.

(10)

4.2 Proposed Natural Actor Critic Algorithm with 2-Layer Critic Parametrization (NAC2L)

Algorithm 1 Natural Actor Critic with 2-Layer Critic Parametrization (NAC2L)

Input: $\mathcal{S},$ $\mathcal{A},\gamma,$ Time Horizon K $\in\mathcal{Z}$ , Updates per time step J $\in\mathcal{Z}$ ,starting state action sampling distribution $\nu$ , Number of convex optimization steps $T_{k,j},k\in\{1,\cdots,K\},j\in\{1,\cdots,J\}$ , Actor SGD learning rate $\eta$

1: Initialize:

\tilde{Q}(s,a)=0\hskip 2.84544pt\forall(s,a)\in\mathcal{S}\times\mathcal{A}

{Q}_{0}(s,a)=0\hskip 2.84544pt\forall(s,a)\in\mathcal{S}\times\mathcal{A}

\lambda_{0}=\{0\}^{d}

2: for

k\in\{1,\cdots,K\}

3: for

j\in\{1,\cdots,J\}

X_{k}=\varnothing

5: Take

n_{k,j}

state action pairs sampled from

\nu

as the starting state action distribution and then following policy

\pi_{\lambda_{k}}

6: Set

y_{i}=r_{i}+\gamma\max_{a^{\prime}\in\mathcal{A}}{Q}_{k,j-1}(s_{i+1},a^{\prime})

, where

i\in\{1,\cdots,n\}

7: Set

X_{j},Y_{j}

as the matrix of the sampled state action pairs and vector of estimated

Q

values respectively

X_{k}=X_{k}\cup X_{j}

9: Call Algorithm 2 with input (

X=X_{j}

y=Y_{j}

T=T_{j}

) and return parameter

\theta

10:

Q_{k,j}=Q_{\theta}

11: end for

12: for

i\in\{1,\cdots,(J.n_{k,j})\}

13:

A_{k,J}(s_{i},a_{i})=Q_{k,J}(s_{i},a_{i})-\sum_{a\in\mathcal{A}}\pi_{\lambda_{k}}(a|s_{i})Q_{k,J}(s_{i},a)

14:

w_{i+1}=\Big{(}w_{i}-2{\beta_{i}}\Big{(}w_{i}{\cdot}{\nabla}_{\lambda}\log{\pi_{\lambda_{k}}}(a_{i}|s_{i})-A_{k,J}(s_{i},a_{i})\Big{)}{\nabla}_{\lambda}\log{\pi_{\lambda_{k}}}(a_{i}|s_{i})\Big{)}

15: end for

16:

w_{k}=w_{J.n_{k,j}}

17: Update

\lambda_{k+1}=\lambda_{k}+{\eta}\left(\frac{1}{1-\gamma}\right)w_{k}

18: end forOutput:

\pi_{\lambda_{K+1}}

Algorithm 2 Neural Network Parameter Estimation

1: Input: data

(X,y,T)

2: Sample:

\tilde{D}={diag(1(Xg_{i}>0))}:g_{i}\sim\mathcal{N}(0,I),i\in[|\tilde{D}|]

3: Initialize

y^{1}=0,u^{1}=0

Initialize

g(u)=\|\sum_{D_{i}\in\tilde{D}}D_{i}Xu_{i}-y\|^{2}_{2}

4: for

i\in\{0,\cdots,T\}

u^{i+1}=y_{i}-\eta_{i}\nabla{g(y_{i})}

y^{i+1}=\operatorname*{arg\,min}_{y:|y|_{1}\leq\frac{R_{max}}{1-\gamma}}\|u_{i+1}-y\|^{2}_{2}

7: end for

8: Set

u^{T+1}=u^{*}

9: Solve Cone Decomposition:

\bar{v},\bar{w}\in{u_{i}^{*}=v_{i}-w_{i},i\in[d]\}}

such that

v_{i},w_{i}\in\mathcal{K}_{i}

and at-least one

v_{i},w_{i}

is zero.

10: Construct

(\theta=\{u_{i},\alpha_{i}\})

using the transformation

\displaystyle\psi(v_{i},w_{i})

\displaystyle=

\displaystyle\left\{\begin{array}[]{lr}({v}_{i},1),&\text{if }{w}_{i}=0\\ ({w}_{i},-1),&\text{if }{v}_{i}=0\\ (0,0),&\text{if }{v}_{i}={w}_{i}=0\end{array}\right.

(14)

for all

{i\in\{1,\cdots,m\}}

11: Return

\theta

We summarize the proposed approach in Algorithm 1. Algorithm 1 has an outer for loop with two inner for loops. At a fixed iteration $k$ of the outer for loop and iteration $j$ of the first inner for loop, we obtain a sequence of state action pairs and the corresponding state and reward by following the estimate of the policy at the start of the iteration. In order to perform the critic update, the state action pairs and the corresponding target $Q$ values are stored in matrix form and passed to Algorithm 2, as the input and output values respectively to solve the following optimization problem.

\displaystyle\operatorname*{arg\,min}_{\theta\in\Theta}\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\Bigg{(}Q_{\theta}(s_{i},a_{i})-r(s_{i},a_{i})-{\gamma}\mathbb{E}_{a^{{}^{\prime}}\sim\pi_{\lambda_{k}}}Q_{k,j-1}(s_{i+1},a^{{}^{\prime}})\Bigg{)}^{2},

(15)

where $Q_{k,j-1}$ is the estimate of the $Q$ function at the $k^{th}$ iteration of the outer for loop and the $(j-1)^{th}$ iteration of the first inner for loop of Algorithm 1. $Q_{\theta}$ is a neural network defined as in (10) and $n_{k,j}$ is the number of state action pairs sampled at the $k^{th}$ iteration of the outer for loop and the $j^{th}$ iteration of the first inner for loop of Algorithm 1. This is done at each iteration of the first inner for loop to perform what is known as a Fitted Q-iteration step to obtain the estimate of the critic.

Algorithm 2 first samples a set of diagonal matrices denoted by $\tilde{D}$ in line 2 of Algorithm 2. The elements of $\tilde{D}$ act as the diagonal matrix replacement of the ReLU function. Algorithm 2 then solves an optimization of the form given in Equation (15) by converting it to an optimization of the form (34). This convex optimization is solved in Algorithm 2 using the projected gradient descent algorithm. After obtaining the optima for this convex program, denoted by $u^{*}=\{u^{*}_{i}\}_{i\in\{1,\cdots,|\tilde{D}|\}}$ , in line 10, we transform them into an estimate of the solutions for the optimization given in (15), which are then passed back to Algorithm 1. The procedure is described in detail along with the relevant definitions in Appendix A.

The estimate of $w_{k}^{*}$ is obtained in the second inner for loop of Algorithm (1) where a gradient descent is performed for the loss function of the form given in Equation (6) using the state action pairs sampled in the first inner for loop. Note that we do not have access to the true $Q$ function that is required for the critic update. Thus we use the estimate of the $Q$ function obtained at the end of the first inner for loop. After obtaining our estimate of the minimizer of Equation (6), we update the policy parameter using the stochastic gradient update step. Here the state action pairs used are the same we sampled in the first inner for loop.

5 Global Convergence Analysis of NAC2L Algorithm

5.1 Assumptions

In this subsection, we formally describe the assumptions that will be used in the results.

Assumption 1

For any $\lambda_{1},\lambda_{2}\in\Lambda$ and $(s,a)\in(\mathcal{S}\times\mathcal{A})$ we have

\displaystyle\|{\nabla}log(\pi_{\lambda_{1}})(a|s)-{\nabla}log(\pi_{\lambda_{2}})(a|s)\|_{2}\leq\beta\|\lambda_{1}-\lambda_{2}\|_{2}

(16)

where $\beta>0$ .

Such assumptions have been utilised in prior policy Gradient based works such as (agarwal2020optimality; liu2020improved). This assumption is satisfied for the softmax policy parameterization

\displaystyle\pi_{\lambda}(a|s)=\frac{\exp(f_{\lambda}(s,a))}{\sum_{a^{{}^{\prime}}\in\mathcal{A}}f_{\lambda}(s,a^{{}^{\prime}})}

(17)

where $f_{\lambda}(s,a)$ is a neural network with a smooth activation function (agarwal2020optimality).

Assumption 2

Let $\theta^{*}\triangleq\arg\min_{\theta\in\Theta}\mathcal{L}(\theta)$ , where $\mathcal{L}(\theta)$ is defined in (8) and we denote $Q_{\theta^{*}}(\cdot)$ as $Q_{\theta}(\cdot)$ as defined in (10) for $\theta=\theta^{*}$ . Also, let $\theta_{\tilde{D}}^{*}\triangleq\arg\min_{\theta\in\Theta}\mathcal{L}_{|\tilde{D}|}(\theta)$ , where $\mathcal{L}_{\tilde{D}}(\theta)$ is the loss function $\mathcal{L}(\theta)$ with the set of diagonal matrices $D$ replaced by $\tilde{D}\in D$ . Further, we denote $Q_{\theta_{|\tilde{D}|}^{*}}(\cdot)$ as $Q_{\theta}(\cdot)$ as defined in (10) for $\theta=\theta_{|\tilde{D}|}^{*}$ . Then we assume

\mathbb{E}_{s,a}(|Q_{\theta^{*}}-Q_{\theta_{|\tilde{D}|}^{*}}|)_{\nu}\leq\epsilon_{|\tilde{D}|},

(18)

for any $\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A})$ .

Thus, $\epsilon_{|\tilde{D}|}$ is a measure of the error incurred due to taking a sample of diagonal matrices $\tilde{D}$ and not the full set $D_{X}$ . In practice, setting $|\tilde{D}|$ to be the same order of magnitude as $d$ (dimension of the data) gives us a sufficient number of diagonal matrices to get a reformulation of the non convex optimization problem which performs comparably or better than existing gradient descent algorithms, therefore $\epsilon_{|\tilde{D}|}$ is only included for theoretical completeness and will be negligible in practice. This has been practically demonstrated in (pmlr-v162-mishkin22a; pmlr-v162-bartan22a; pmlr-v162-sahiner22a). Refer to Appendix A for details of $D_{X}$ , $\tilde{D}$ and $\mathcal{L}_{|\tilde{D}|}(\theta)$ .

Assumption 3

We assume that for all functions $Q:\mathcal{S}\times\mathcal{A}\rightarrow\left[0,\left(\frac{R_{\max}}{1-\gamma}\right)\right]$ , there exists a function $Q_{\theta}$ where $\theta\in\Theta$ such that

\displaystyle\mathbb{E}_{s,a}{(Q_{\theta}-Q)}^{2}_{\nu}\leq\epsilon_{approx},

(19)

for any $\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A})$ .

$\epsilon_{approx}$ reflects the error that is incurred due to the inherent lack of expressiveness of the neural network function class. In the analysis of (pmlr-v120-yang20a), this error is assumed to be zero. Explicit upper bounds of $\epsilon_{bias}$ is given in terms of neural network parameters in works such as (yarotsky2017error).

Assumption 4

We assume that for all functions $Q:\mathcal{S}\times\mathcal{A}\rightarrow\left[0,\left(\frac{R_{\max}}{1-\gamma}\right)\right]$ and $\lambda\in\Lambda$ , there exists $w^{*}\in\mathbb{R}^{|\mathcal{A}|}$ such that

\displaystyle\mathbb{E}_{s,a}(|Q-w^{*}{\nabla}log(\pi_{{\lambda}}(a|s))|)_{\nu}\leq\epsilon_{bias},

(20)

for any $\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A})$ .

Assumption 4 is similar to the Transfer error assumption in works such as (agarwal2020flambe; liu2020improved). The key difference is that in the referenced works the assumption is based on the true $Q$ function, while the estimate of $w^{*}$ is obtained by using a noisy Monte Carlo estimate. For our case we use a known parameterised $Q$ function to obtain our estimate of $w^{*}$ .

Assumption 5

For any $\theta\in\Theta$ , denote by $\pi_{\theta}$ as the policy corresponding to the parameter $\theta$ and $\mu_{\theta}$ as the corresponding stationary state action distribution of the induced Markov chain. We assume that there exists a positive integer $m$ such that

\displaystyle d_{TV}\left(\mathbb{P}((s_{\tau},a_{\tau})\in\cdot|(s_{0},a_{0})=(s,a)),\mu_{\theta}(\cdot)\right)\leq m{\rho}^{\tau},\forall s\in\mathcal{S}

(21)

This assumption implies that the Markov chain is geometrically mixing. Such assumption is widely used both in analysis of stochastic gradient descent literature such as (9769873; sun2018markov), as well as finite time analysis of RL algorithms such as (wu2020finite; NEURIPS2020_2e1b24a6).

Assumption 6

Let $\nu_{1}$ be a probability measure on $\mathcal{S}\times\mathcal{A}$ which is absolutely continuous with respect to the Lebesgue measure. Let $\{\pi_{t}\}$ be a sequence of policies and suppose that the state action pair has an initial distribution of $\nu_{1}$ . Then we assume that for all $\nu_{1},\nu_{2}\in\mathcal{P}(\mathcal{S}\times\mathcal{A})$ there exists a constant $\phi_{\nu_{1},\nu_{2}}\leq\infty$ such that

\displaystyle\sup_{\pi_{1},\pi_{2},\cdots,\pi_{m}}\Bigg{|}\Bigg{|}\frac{d(P^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}}\nu_{2})}{d\nu_{1}}\Bigg{|}\Bigg{|}_{\infty}

\displaystyle\leq

\displaystyle\phi_{\nu_{1},\nu_{2}}

(22)

for all $m\in\{0,\cdots,\infty\}$ , where $\frac{d(P^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}}\nu_{2})}{d\nu_{1}}$ denotes the Radon Nikodym derivative of the state action distribution $P^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}}\nu_{2}$ with respect to the distribution $\nu_{1}$ .

This assumption puts an upper bound on the difference between the state action distribution $\nu_{1}$ and the state action distribution induced by sampling a state action pair from the distribution $\mu_{2}$ followed by any possible policy for the next $m$ steps for any finite value of $m$ . Similar assumptions have been made in (pmlr-v120-yang20a; JMLR:v17:10-364; farahmand2010error).

5.2 Main Result

Theorem 1

Suppose Assumptions 1-6 hold. Let Algorithm 1 run for $K$ iterations $J$ be the number of iterations of the first inner loop of Algorithm 1. Let $n_{k,j}$ denote the number of state-action pairs sampled and $T_{k,j}$ the number of iterations of Algorithm 2 at iteration $k$ of the outer for loop and iteration $j$ of the first inner for loop of Algorithm 1. Let $\alpha_{i}$ be the projected gradient descent size at iteration $i$ of Algorithm 2 and $|\tilde{D}|$ the number of diagonal matrices sampled in Algorithm 2. Let $\beta_{i}$ be the step size in the projected gradient descent at iteration $i$ of the second inner for loop of Algorithm 1. Let $\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A})$ be the starting state action distribution at the each iteration $k$ of Algorithm 1. If we have, $\alpha_{i}=\frac{||u^{*}_{k,j}||_{2}}{L_{k,j}\sqrt{{i}+1}}$ , $\eta=\frac{1}{\sqrt{K}}$ and $\beta_{i}=\frac{\mu_{k}}{i+1}$ , then we obtain

$\displaystyle\min_{k\leq K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu))\leq$	$\displaystyle{\mathcal{O}}\left(\frac{1}{\sqrt{K}(1-\gamma)}\right)\!+\!\frac{1}{K(1-\gamma)}\sum_{k=1}^{K}\sum_{j=0}^{J-1}\mathcal{O}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right)\!+\!\$
	$\displaystyle+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K}\sum_{j=0}^{J-1}{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K}{\mathcal{O}}(\gamma^{J})$
	$\displaystyle+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{\|\tilde{D}\|}}\right)$	(23)

where $||u^{*}_{k,j}||_{2},L_{k,j},\mu_{k},\epsilon_{bias},\epsilon_{approx},{\epsilon_{|\tilde{D}|}}$ are constants.

Hence, for $K=\mathcal{O}(\epsilon^{-2}(1-\gamma)^{-2})$ , $J=\mathcal{O}\left(\log\left(\frac{1}{\epsilon}\right)\right)$ , $n_{k,j}=\tilde{\mathcal{O}}\left(\epsilon^{-2}(1-\gamma)^{-2}\right)$ ,

$T_{k,j}=\mathcal{O}(\epsilon^{-2}(1-\gamma)^{-2})$ we have

\displaystyle\min_{k\leq K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu))\leq\epsilon+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{|\tilde{D}|}}\right),

(24)

which implies a sample complexity of $\sum_{k=1}^{K}\sum_{j=1}^{J}(n_{k,j})=\tilde{\mathcal{O}}\left({\epsilon^{-4}(1-\gamma)^{-4}}\right)$ .

6 Proof Sketch of Theorem 1

The detailed proof of Theorem 1 is given in Appendix E. The difference between our estimated value function denoted by $V^{\pi_{\lambda_{k}}}$ and the optimal value function denoted by $V^{*}$ (where $\pi_{\lambda_{k}}$ is the policy obtained at the step $k$ of algorithm 1) is first expressed as a function of the compatible function approximation error, which is then split into different components which are analysed separately. The proof is thus split into two stages. In the first stage, we demonstrate how the difference in value functions is upper bounded as a function of the errors incurred till the final step $K$ . The second part is to upper bound the different error components.

Upper Bounding Error in Separate Error Components: We use the smoothness property assumed in Assumption 1 to obtain a bound on the expectation of the difference between our estimated value function and the optimal value function.

\displaystyle\min_{k\in\{1,\cdots,K\}}V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu)\leq\frac{\log(|\mathcal{A}|)}{K{\eta}(1-\gamma)}+\frac{\eta{\beta}W^{2}}{2(1-\gamma)}+\frac{1}{K}\sum_{k=1}^{K}\frac{err_{k}}{1-\gamma},

(25)

where

\displaystyle err_{k}=\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a|s))),

(26)

and $s\sim d^{\pi^{*}}_{\nu},a\sim\pi^{*}(.|s)$ and $W$ is a constant such that $||w^{k}||_{2}\leq W$ $\forall k$ , where $k$ denotes the iteration of the outer for loop of Algorithm 1. We split the term in (26) into the errors incurred due to the actor and critic step as follows

	$\displaystyle err_{k}$	$\displaystyle=$	$\displaystyle\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a\|s)))$		(27)
		$\displaystyle=$	$\displaystyle\underbrace{\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-A_{{k,J}})}_{I}+\underbrace{\mathbb{E}_{s,a}(A_{{k,J}}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a\|s)))}_{II}.$		(28)

Note that $I$ is the difference between the true $A^{\pi_{\lambda_{k}}}$ function corresponding to the policy $\pi_{\lambda_{k}}$ and $A_{k,J}$ is our estimate. This estimation is carried out in the first inner for loop of Algorithm 1. Thus $I$ is the error incurred in the critic step. $II$ is the error incurred in the estimation of the actor update. This is incurred in the stochastic gradient descent steps in the second inner for loop of Algorithm 1.

Also note that the expectation is with respect to the discounted state action distribution of the state action pairs induced by the optimal policy $\pi^{*}$ . The state action samples that we obtain are obtained from the policy $\pi_{\lambda_{k}}$ . Thus using assumption 6 we convert the expectation in Equation (28) to an expectation with respect to the stationary state action distribution induced by the policy $\pi_{\lambda_{k}}$ .

Upper Bounding Error in Critic Step: For each iteration $k$ of the Algorithm 1. We show that minimzing $I$ is equivalent to solving the following problem

\displaystyle\operatorname*{arg\,min}_{\theta\in\Theta}\mathbb{E}_{s,a}({Q^{\pi_{\lambda_{k}}}-Q_{{\theta}}})^{2},

(29)

We will rely upon the analysis laid out in (farahmand2010error) and instead of the iteration of the value functions, we will apply a similar analysis to the action value function to obtain an upper bound for the error incurred in solving the problem in Equation (29) using the Fitted Q-Iteration. We recreate the result for the value function from Lemmas 2 of (munos2003error) for the action value function $Q$ to obtain

\displaystyle\mathbb{E}_{s,a}(Q^{\pi_{\lambda_{k}}}-Q_{k,J})

\displaystyle\leq

\displaystyle\sum_{j=1}^{J-1}\gamma^{J-j-1}(P^{\pi_{\lambda_{k}}})^{J-j-1}{\mathbb{E}}|\epsilon_{k,j}|+\gamma^{J}\left(\frac{R_{max}}{1-\gamma}\right),

(30)

where $\epsilon_{k,j}=T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q_{k,j}$ is the Bellman error incurred at iteration $j$ where $T^{\pi_{\lambda_{k}}}Q_{k,j-1},P^{\pi_{\lambda_{k}}}$ are defined as in Equations (51), (53) respectively. $Q_{k,J}$ denotes the $Q$ estimate at the final iteration $J$ of the first inner for loop and iteration $k$ of the outer for loop of Algorithm 1

The first term on the right hand side is called as the algorithmic error, which depends on how good our approximation of the Bellman error is. The second term on the right hand side is called as the statistical error, which is the error incurred due to the random nature of the system and depends only on the parameters of the MDP as well as the number of iterations of the FQI algorithm. Intuitively, this error depends on how much data is collected at each iteration, how efficient our solution to the optimization step is to the true solution, and how well our function class can approximate $T^{\pi_{\lambda_{k}}}Q_{k,j-1}$ . Building upon this intuition, we split $\epsilon_{k,j}$ into four different components as follows.

$\displaystyle\epsilon_{k,j}$	$\displaystyle=$	$\displaystyle T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q_{k,j}$	(31)
	$\displaystyle=$	$\displaystyle\underbrace{T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}}_{\epsilon^{1}_{k,j}}+\underbrace{Q^{1}_{k,j}-Q^{2}_{k,j}}_{\epsilon^{2}_{k,j}}+\underbrace{Q^{2}_{k,j}-Q^{3}_{k,j}}_{\epsilon^{3}_{k,j}}+\underbrace{Q^{3}_{k,j}-Q_{k,j}}_{\epsilon^{4}_{k,j}}$
	$\displaystyle=$	$\displaystyle{\epsilon^{1}_{k,j}}+{\epsilon^{2}_{k,j}}+{\epsilon^{3}_{k,j}}+{\epsilon^{4}_{k,j}},$

We use the Lemmas 14, 15, 16, and 17 to bound the error terms in Equation (31).

Upper Bounding Error in Actor Step: Note that we require the minimization of the term $\mathbb{E}_{s,a}(A_{k,J}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a|s)))$ . Here the expectation is with respect to stationary state action distribution corresponding to $\pi_{\lambda_{k}}$ . But we do not have samples of states action pairs from the stationary distribution with respect to the policy $\pi_{\lambda_{k}}$ , we only have samples from the Markov chain induced by the policy $\pi_{\lambda_{k}}$ . We thus refer to the theory in (9769873) and Assumption 5 to upper bound the error incurred.

7 Conclusions

In this paper, we study a Natural Actor Critic algorithm with a neural network used to represent both the actor and the critic and find the sample complexity guarantees for the algorithm. Using the conversion of the optimization of a 2 layer ReLU Neural Network to a convex problem for estimating the critic, we show that our approach achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-4}(1-\gamma)^{-4})$ . This demonstrates the first approach for achieving sample complexity beyond linear MDP assumptions for the critic.

Limitations: Relaxing the different stated assumptions is an interesting direction for the future. Further, the results assume 2-layer neural network parametrization for the critic. One can likely use the framework described in (belilovsky2019greedy) to extend the results to a multi layer setup.

A Convex Reformulation with Two-Layer Neural Networks

For representing the action value function, we will use a 2 layer ReLU neural network. In this section, we first lay out the theory behind the convex formulation of the 2 layer ReLU neural network. In the next section it will shown how it is utilised for the FQI algorithm.

In order to obtain parameter $\theta$ for a given set of data $X\in\mathbb{R}^{n\times d}$ and the corresponding response values $y\in\mathbb{R}^{n\times 1}$ , we desire the parameter that minimizes the squared loss (with a regularization parameter $\beta\in[0,1]$ ), given by

\displaystyle\mathcal{L}(\theta)

\displaystyle=

\displaystyle\operatorname*{arg\,min}_{\theta}\Bigg{\|}\sum_{i=1}^{m}\sigma(Xu_{i})\alpha_{i}-y\Bigg{\|}_{2}^{2}.

(32)

Here, we have the term $\sigma(Xu_{i})$ which is a vector $\{\sigma^{\prime}((x_{j})^{T}u_{i})\}_{j\in\{1,\cdots,n\}}$ where $x_{j}$ is the $j^{th}$ row of $X$ . It is the ReLU function applied to each element of the vector $Xu_{i}$ . We note that the optimization in Equation (8) is non-convex in $\theta$ due to the presence of the ReLU activation function. In (wang2021hidden), it is shown that this optimization problem has an equivalent convex form, provided that the number of neurons $m$ goes above a certain threshold value. This convex problem is obtained by replacing the ReLU functions in the optimization problem with equivalent diagonal operators. The convex problem is given as

\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p)

\displaystyle:=

\displaystyle\operatorname*{arg\,min}_{p}\Bigg{\|}\sum_{D_{i}\in D_{X}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2}

(33)

where $p\in\mathbb{R}^{d\times|D_{X}|}$ . $D_{X}$ is the set of diagonal matrices $D_{i}$ which depend on the data-set $X$ . Except for cases of $X$ being low rank it is not computationally feasible to obtain the set $D_{X}$ . We instead use $\tilde{D}\in D_{X}$ to solve the convex problem

\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p)

\displaystyle:=

\displaystyle(\operatorname*{arg\,min}_{p}\Bigg{\|}\sum_{D_{i}\in\tilde{D}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2},

(34)

where $p\in\mathbb{R}^{d\times|\tilde{D}|}$ . In order to understand the convex reformulation of the squared loss optimization problem, consider the vector $\sigma(Xu_{i})$

\sigma(Xu_{i})=\begin{bmatrix}\{\sigma^{{}^{\prime}}((x_{1})^{T}u_{i})\}\\ \{\sigma^{{}^{\prime}}((x_{2})^{T}u_{i})\}\\ \vdots\\ \{\sigma^{{}^{\prime}}((x_{n})^{T}u_{i})\}.\end{bmatrix}

(35)

Now for a fixed $X\in\mathbb{R}^{n\times d}$ , different $u_{i}\in\mathbb{R}^{d\times 1}$ will have different components of $\sigma(Xu_{i})$ that are non zero. For example, if we take the set of all $u_{i}$ such that only the first element of $\sigma(Xu_{i})$ are non zero (i.e, only $(x_{1})^{T}u_{i}\geq 0$ and $(x_{j})^{T}u_{i}<0$ $\forall j\in[2,\cdots,n]$ ) and denote it by the set $\mathcal{K}_{1}$ , then we have

\sigma(Xu_{i})=D_{1}(Xu_{i})\ \ \ \ \forall u_{i}\in\mathcal{K}_{1},

(36)

where $D_{1}$ is the $n\times n$ diagonal matrix with only the first diagonal element equal to $1$ and the rest $0$ . Similarly, there exist a set of $u^{\prime}s$ which result in $\sigma(Xu)$ having certain components to be non-zero and the rest zero. For each such combination of zero and non-zero components, we will have a corresponding set of $u_{i}^{\prime}s$ and a corresponding $n\times n$ Diagonal matrix $D_{i}$ . We define the possible set of such diagonal matrices possible for a given matrix X as

\displaystyle D_{X}=\{D=diag(\mathbf{1}(Xu\geq 0)):u\in\mathbb{R}^{d}\,,D\in\mathbb{R}^{n\times n}\},

(37)

where $diag(\mathbf{1}(Xu\geq 0))$ represents a matrix given by

D_{k,j}=\left\{\begin{array}[]{lr}\mathbf{1}(x_{j}^{T}u),&\text{for }k=j\\ 0&\text{for }k\neq j\end{array},\right.

(38)

where $\mathbf{1}(x)=1$ if $x>0$ and $\mathbf{1}(x)=0$ if $x\leq 0.$ Corresponding to each such matrix $D_{i}$ , there exists a set of $u_{i}$ given by

\displaystyle\mathcal{K}_{i}=\{u\in\mathbb{R}^{d}:\sigma(Xu_{i})=D_{i}Xu_{i},D_{i}\in D_{X}\}

(39)

where $I$ is the $n\times n$ identity matrix. The number of these matrices ${D}_{i}$ is upper bounded by $2^{n}$ . From (wang2021hidden) the upper bound is $\mathcal{O}\left(r\left(\frac{n}{r}\right)^{r}\right)$ where $r=rank(X)$ . Also, note that the sets $\mathcal{K}_{i}$ form a partition of the space $\mathbb{R}^{d\times 1}$ . Using these definitions, we define the equivalent convex problem to the one in Equation (8) as

\displaystyle\mathcal{L}_{\beta}(v,w):=\operatorname*{arg\,min}_{v,w}\Bigg{(}\Bigg{\|}\sum_{D_{i}\in D_{X}}D_{i}(X(v_{i}-w_{i}))-y\Bigg{\|}^{2}_{2}\Bigg{)},

(40)

where $v=\{v_{i}\}_{i\in 1,\cdots,|D_{X}|}$ , $w=\{w_{i}\}_{i\in 1,\cdots,|D_{X}|}$ , $v_{i},w_{i}\in\mathcal{K}_{i}$ , note that by definition, for any fixed $i\in\{1,\cdots,|D_{X}|\}$ at-least one of $v_{i}$ or $w_{i}$ are zero. If $v^{*},w^{*}$ are the optimal solutions to Equation (40), the number of neurons $m$ of the original problem in Equation (8) should be greater than the number of elements of $v^{*},w^{*}$ , which have at-least one of $v_{i}^{*}$ or $w_{i}^{*}$ non-zero. We denote this value as $m^{*}_{X,y}$ , with the subscript $X$ denoting that this quantity depends upon the data matrix $X$ and response $y$ .

We convert $v^{*},w^{*}$ to optimal values of Equation (8), denoted by $\theta^{*}=(U^{*},\alpha^{*})$ , using a function $\psi:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}\times\mathbb{R}$ defined as follows

\displaystyle\psi(v_{i},w_{i})

\displaystyle=

\displaystyle\left\{\begin{array}[]{lr}({v}_{i},1),&\text{if }{w}_{i}=0\\ ({w}_{i},-1),&\text{if }{v}_{i}=0\\ (0,0),&\text{if }{v}_{i}={w}_{i}=0\end{array}\right.

(44)

where according to (pmlr-v119-pilanci20a) we have $(u_{i}^{*},\alpha_{i}^{*})=\psi(v_{i}^{*},w_{i}^{*})$ , for all $i\in\{1,\cdots,|{D}_{X}|\}$ where $u^{*}_{i},\alpha^{*}_{i}$ are the elements of $\theta^{*}$ . Note that restriction of $\alpha_{i}$ to $\{1,-1,0\}$ is shown to be valid in (pmlr-v162-mishkin22a). For $i\in\{|{D}_{X}|+1,\cdots,m\}$ we set $(u_{i}^{*},\alpha_{i}^{*})=(0,0)$ .

Since $D_{X}$ is hard to obtain computationally unless $X$ is of low rank, we can construct a subset $\tilde{D}\in D_{X}$ and perform the optimization in Equation (40) by replacing $D_{X}$ with $\tilde{D}$ to get

\displaystyle\mathcal{L}_{\beta}(v,w):=\operatorname*{arg\,min}_{v,w}\Bigg{(}\Bigg{\|}\sum_{D_{i}\in\tilde{D}}D_{i}(X(v_{i}-w_{i}))-y\Bigg{\|}^{2}_{2}\Bigg{)}

(45)

where $v=\{v_{i}\}_{i\in 1,\cdots,|\tilde{D}|}$ , $w=\{w_{i}\}_{i\in 1,\cdots,|\tilde{D}|}$ , $v_{i},w_{i}\in\mathcal{K}_{i}$ , by definition, for any fixed $i\in\{1,\cdots,|\tilde{D}|\}$ at-least one of $v_{i}$ or $w_{i}$ are zero.

The required condition for $\tilde{D}$ to be a sufficient replacement for $D_{X}$ is as follows. Suppose $(v,w)=(\bar{v}_{i},\bar{w}_{i})_{i\in(1,\cdots,|\tilde{D}|)}$ denote the optimal solutions of Equation (45). Then we require

\displaystyle m\geq\sum_{D_{i}\in\tilde{D}}|\{\bar{v}_{i}:\bar{v}_{i}\neq 0\}\cup\{\bar{w}_{i}:\bar{w}_{i}\neq 0\}|.

(46)

Or, the number of neurons in the neural network are greater than the number of indices $i$ for which at-least one of $v_{i}^{*}$ or $w_{i}^{*}$ is non-zero. Further,

\displaystyle diag(Xu_{i}^{*}\geq 0:i\in[m])\in\tilde{D}.

(47)

In other words, the diagonal matrices induced by the optimal $u_{i}^{*}$ ’s of Equation (8) must be included in our sample of diagonal matrices. This is proved in Theorem 2.1 of (pmlr-v162-mishkin22a).

A computationally efficient method for obtaining $\tilde{D}$ and obtaining the optimal values of the Equation (8), is laid out in (pmlr-v162-mishkin22a). In this method we first get our sample of diagonal matrices $\tilde{D}$ by first sampling a fixed number of vectors from a $d$ dimensional standard multivariate distribution, multiplying the vectors with the data matrix $X$ and then forming the diagonal matrices based of which co-ordinates are positive. Then we solve an optimization similar to the one in Equation (40), without the constraints, that its parameters belong to sets of the form $\mathcal{K}_{i}$ as follows.

\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p):=\operatorname*{arg\,min}_{p}\Bigg{(}\Bigg{\|}\sum_{D_{i}\in\tilde{D}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2}\Bigg{)},

(48)

where $p\in\mathbb{R}^{d\times|\tilde{D}|}$ . In order to satisfy the constraints of the form given in Equation (40), this step is followed by a cone decomposition step. This is implemented through a function $\{\psi_{i}^{{}^{\prime}}\}_{i\in\{1,\cdots,|\tilde{D}|\}}$ . Let $p^{*}=\{p^{*}_{i}\}_{i\in\{1,\cdots,|\tilde{D}|\}}$ be the optimal solution of Equation (48). For each $i$ we define a function $\psi_{i}^{{}^{\prime}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}\times\mathbb{R}^{d}$ as

	$\displaystyle\psi_{i}^{{}^{\prime}}(p_{i})$	$\displaystyle=$	$\displaystyle(v_{i},w_{i})$		(49)
	$\displaystyle\textit{such that }p$	$\displaystyle=$	$\displaystyle{v}_{i}-{w}_{i},\textit{and }{v}_{i},{w}_{i}\in\mathcal{K}_{i}$

Then we obtain $\psi(p^{*}_{i})=(\bar{v}_{i},\bar{w}_{i})$ . As before, at-least one of $v_{i}$ , $w_{i}$ is $0$ . Note that in practice we do not know if the conditions in Equation (46) and (47) are satisfied for a given sampled $\tilde{D}$ . We express this as follows. If $\tilde{D}$ was the full set of Diagonal matrices then we would have $(\bar{v}_{i},\bar{w}_{i})={v}^{*}_{i},{w}^{*}_{i}$ and $\psi(\bar{v}_{i},\bar{w}_{i})=(u_{i}^{*},\alpha_{i}^{*})$ for all $i\in(1,\cdots,|D_{X}|)$ . However, since that is not the case and $\tilde{D}\in D_{X}$ , this means that $\{\psi(\bar{v}_{i},\bar{w}_{i})\}_{i\in(1,\cdots,|\tilde{D}|)}$ is an optimal solution of a non-convex optimization different from the one in Equation (8). We denote this non-convex optimization as $\mathcal{L}_{|\tilde{D}|}(\theta)$ defined as

\mathcal{L}_{|\tilde{D}|}(\theta)=\operatorname*{arg\,min}_{\theta}\Bigg{\|}\sum_{i=1}^{m^{{}^{\prime}}}\sigma(Xu_{i})\alpha_{i}-y\Bigg{\|}_{2}^{2},

(50)

where $m^{{}^{\prime}}=|\tilde{D}|$ or the size of the sampled diagonal matrix set. In order to quantify the error incurred due to taking a subset of $D_{X}$ , we assume that the expectation of the absolute value of the difference between the neural networks corresponding to the optimal solutions of the non-convex optimizations given in Equations (50) and (8) is upper bounded by a constant depending on the size of $\tilde{D}$ . The formal assumption and its justification is given in Assumption 2.

B Error Characterization

Before we define the errors incurred during the actor and critic steps, we define some additional terms as follows

We define the Bellman operator for a policy $\pi$ as follows

(T^{\pi}Q)(s,a)=r^{\prime}(s,a)+\gamma\int Q^{\pi}(s^{\prime},\pi(s^{\prime}))P(ds^{\prime}|s,a),

(51)

where $r^{\prime}(s,a)=\mathbb{E}(r(s,a)|(s,a))$ Similarly we define the Bellman Optimality Operator as

Similarly we define the Bellman Optimality Operator as

(TQ)(s,a)=\left(r^{\prime}+\max_{a^{\prime}\in\mathcal{A}}\gamma\int Q(s^{\prime},a^{\prime})P(ds^{\prime}|s,a)\right),

(52)

Further, operator $P^{\pi}$ is defined as

P^{\pi}Q(s,a)=\mathbb{E}[Q(s^{\prime},a^{\prime})|s^{\prime}\sim P(\cdot|s,a),a^{\prime}\sim\pi(\cdot|s^{\prime})],

(53)

which is the one step Markov transition operator for policy $\pi$ for the Markov chain defined on $\mathcal{S}\times\mathcal{A}$ with the transition dynamics given by $S_{t+1}\sim P(\cdot|S_{t},A_{t})$ and $A_{t+1}\sim\pi(\cdot|S_{t+1})$ . It defines a distribution on the state action space after one transition from the initial state. Similarly, $P^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}}$ is the $m$ -step Markov transition operator following policy $\pi_{t}$ at steps $1\leq t\leq m$ . It defines a distribution on the state action space after $m$ transitions from the initial state. We have the relation

	$\displaystyle(T^{\pi}Q)(s,a)=$	$\displaystyle r^{\prime}+\gamma\int Q^{\pi}(s^{\prime},\pi(s^{\prime}))P(ds^{\prime}\|s,a)$		(54)
	$\displaystyle=$	$\displaystyle r^{\prime}+{\gamma}(P^{\pi}Q)(s,a).$		(55)

We thus defines $P^{*}$ as

P^{*}Q(s,a)=\max_{a^{{}^{\prime}}\in\mathcal{A}}\mathbb{E}[Q(s^{\prime},a^{\prime})|s^{\prime}\sim P(\cdot|s,a)],

(56)

in other words, $P^{*}$ is the one step Markov transition operator with respect to the greedy policy of the function on which it is acting. which implies that

\displaystyle(TQ)(s,a)

\displaystyle=

\displaystyle r^{\prime}+{\gamma}(P^{*}Q)(s,a)

(57)

For any measurable function $f:\mathcal{S}\times\mathcal{A}:\rightarrow\mathbb{R}$ , we also define

\mathbb{E}(f)_{\nu}=\int_{\mathcal{S}\times\mathcal{A}}fd\nu,

(58)

for any distribution $\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A})$ .

We now characterize the errors which are incurred from the actor and critic steps. We define as $\zeta^{\nu}_{\pi}(s,a)$ as the stationary state action distribution induced by the policy $\pi$ with the starting state action distribution drawn from a distribution $\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A})$ . For the error incurred in the actor update we define the related loss function as

Definition 1

For iteration $k$ of the outer for loop of Algorithm 1 ,we define $w_{k}$ as the estimate of the minima of the loss function given by $\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}\left(A_{k,J}(s,a)-(w){\nabla}log(\pi_{\lambda_{k}})(a|s)\right)^{2}$ obtained at the end of the second inner for loop of Algorithm 1. We further define the true minima as

\displaystyle w^{*}_{k}

\displaystyle=

\displaystyle\operatorname*{arg\,min}_{w}\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}\left(A_{k,J}(s,a)-(w){\nabla}log(\pi_{\lambda_{k}})(a|s)\right)^{2},

(59)

For finding the estimate $w_{k}$ , we re-use the state action pairs sampled in the first inner for loop of Algorithm 1. Note that we have to solve for the loss function where the expectation is with respect to the steady state distribution $\zeta^{\nu}_{\pi}(s,a)$ , while our sample are from a markov chain which has the steady state distribution For the error incurred in the critic update, we first define the various possible $Q$ -functions which we can approximate in decreasing order of the accuracy.

For the error compnents incurred during critic estimation, we start by defining the best possible approximation of the function $T^{\pi_{\lambda_{k}}}Q_{k,j-1}$ possible from the class of two layer ReLU neural networks, with respect to the expected square from the true ground truth $T^{\pi_{\lambda_{k}}}Q_{k,j-1}$ .

Definition 2

For iteration $k$ of the outer for loop and iteration $j$ of the first inner for loop of Algorithm 1, we define

Q^{1}_{k,j}=\operatorname*{arg\,min}_{Q_{\theta},\theta\in\Theta}\mathbb{E}(Q_{\theta}(s,a)-T^{\pi_{\lambda_{k}}}Q_{k,j-1}(s,a))^{2},

(60)

where $(s,a)\sim\zeta^{\nu}_{\pi}(s,a)$ .

Note that we do not have access to the transition probability kernel $P$ , hence we cannot calculate $T^{\pi_{\lambda_{k}}}$ . To alleviate this, we use the observed next states to estimate the $Q$ -value function. Using this, we define $Q^{2}_{k,j}$ as,

Definition 3

For iteration $k$ of the outer for loop and iteration $j$ of the first inner for loop of Algorithm 1, we define

\displaystyle Q^{2}_{k,j}=\operatorname*{arg\,min}_{Q_{\theta},\theta\in\Theta}\mathbb{E}(Q_{\theta}(s,a)-(r^{\prime}(s,a)+\gamma{\mathbb{E}}Q_{j-1}(s^{\prime},a^{\prime}))^{2},

(61)

where ${(s,a)\sim\zeta^{\nu}_{\pi}(s,a),s^{\prime}\sim P(s^{\prime}|s,a)}$ and ${r^{\prime}(\cdot|s,a)\sim{R}(\cdot|s,a)}$ .

Compared to $Q^{1}_{k,j}$ , in $Q^{2}_{k,j}$ , we are minimizing the expected square loss from target function $\big{(}r^{\prime}(s,a)+\gamma{\mathbb{E}_{a^{\prime}\sim\pi_{k}}}Q_{j-1}(s^{\prime},a^{\prime})\big{)}$ .

To obtain $Q^{2}_{k,j}$ , we still need to compute the true expected value in Equation 61. However, we still do not know the transition function $P$ . To remove this limitation, we use sampling. Consider a set, $\mathcal{X}_{k,j}$ , of state-action pairs sampled as where $(s,a)\sim\zeta^{\nu}_{\pi}(s,a)$ . We now define $Q^{3}_{k,j}$ as,

Definition 4

For a given set of state action pairs $\mathcal{X}_{k,j}$ we define

\displaystyle Q^{3}_{k,j}=\operatorname*{arg\,min}_{Q_{\theta},\theta\in\Theta}\frac{1}{|\mathcal{X}_{k,j}|}\sum_{(s_{i},a_{i})\in\mathcal{X}_{k,j}}\Big{(}Q_{\theta}(s_{i},a_{i})-\big{(}r(s_{i},a_{i})+\gamma{\mathbb{E}_{a^{\prime}\sim\pi_{k}}}Q_{k,j-1}(s_{i+1},a^{\prime})\big{)}\Big{)}^{2},

(62)

where $r(s_{i},a_{i})$ , and $s_{i+1}$ are the observed reward and the observed next state for state action pair $s_{i},a_{i}$ respectively.

$Q^{3}_{k,j}$ is the best possible approximation for $Q$ -value function which minimizes the sample average of the square loss functions with the target values as $\big{(}r^{\prime}(s_{i},a_{i})+\gamma{\mathbb{E}_{a^{\prime}\sim\pi_{k}}}Q_{k,j-1}(s_{i+1},a^{\prime})\big{)}^{2}$ or the empirical loss function. After defining the possible solutions for the $Q$ -values using different loss functions, we define the errors.

We first define approximation error which represents the difference between $T^{\pi_{\lambda_{k}}}Q_{j-1}$ and its best approximation possible from the class of 2 layer ReLU neural networks. We have

Definition 5 (Approximation Error)

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, we define, $\epsilon^{1}_{k,j}=T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}$ , where $Q^{1}_{k,j}$ is the estimate of the $Q$ function at the iteration $j-1$ of the second for loop of Algorithm 1.

We also define Estimation Error which denotes the error between the best approximation of $T^{\pi_{\lambda_{k}}}Q_{k,j-1}$ possible from a 2 layer ReLU neural network and $Q^{2}_{k,j}$ . We demonstrate that these two terms are the same and this error is zero.

Definition 6 (Estimation Error)

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, $\epsilon^{2}_{k,j}=Q^{1}_{k,j}-Q^{2}_{k,j}$ .

We now define Sampling error which denotes the difference between the minimizer of expected loss function $Q^{2}_{k,j}$ and the minimizer of the empirical loss function using samples, $Q^{3}_{k,j}$ . We will use Rademacher complexity results to upper bound this error.

Definition 7 (Sampling Error)

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, $\epsilon^{3}_{k,j}=Q^{3}_{k,j}-Q^{2}_{k,j}$ .

Lastly, we define optimization error which denotes the difference between the minimizer of the empirical square loss function, $Q_{k_{3}}$ , and our estimate of this minimizer that is obtained from the projected gradient descent algorithm.

Definition 8 (Optimization Error)

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, $\epsilon^{4}_{k}=Q^{3}_{k,j}-Q_{k,j}$ . Here $Q_{k,j}$ is our estimate of the $Q$ function at iteration $k$ of Algorithm 1 and iteration $j$ of the first inner loop of Algorithm 1.

C Supplementary lemmas and Definitions

Here we provide some definitions and results that will be used to prove the lemmas stated in the paper.

Definition 9

For a given set $Z\in\mathbb{R}^{n}$ , we define the Rademacher complexity of the set $Z$ as

Rad(Z)=\mathbb{E}\left(\sup_{z\in Z}\frac{1}{n}\sum_{i=1}^{d}\Omega_{i}z_{i}\right)

(63)

where $\Omega_{i}$ is random variable such that $P(\Omega_{i}=1)=\frac{1}{2}$ , $P(\Omega_{i}=-1)=\frac{1}{2}$ and $z_{i}$ are the co-ordinates of $z$ which is an element of the set $Z$

Lemma 10

Consider a set of observed data denoted by $z=\{z_{1},z_{2},\cdots\,z_{n}\}\in\mathbb{R}^{n}$ , a parameter space $\Theta$ , a loss function $\{l:\mathbb{R}\times\Theta\rightarrow\mathbb{R}\}$ where $0\leq l(\theta,z)\leq 1$ $\forall(\theta,z)\in\Theta\times\mathbb{R}$ . The empirical risk for a set of observed data as $R(\theta)=\frac{1}{n}\sum_{i=1}^{n}l(\theta,z_{i})$ and the population risk as $r(\theta)=\mathbb{E}l(\theta,\tilde{z_{i}})$ , where $\tilde{z_{i}}$ is a co-ordinate of $\tilde{z}$ sampled from some distribution over $Z$ .

We define a set of functions denoted by $\mathcal{L}$ as

\mathcal{L}=\{z\in Z\rightarrow l(\theta,z)\in\mathbb{R}:\theta\in\Theta\}

(64)

Given $z=\{z_{1},z_{2},z_{3}\cdots,z_{n}\}$ we further define a set $\mathcal{L}\circ z$ as

\mathcal{L}\circ z\ =\{(l(\theta,z_{1}),l(\theta,z_{2}),\cdots,l(\theta,z_{n}))\in\mathbb{R}^{n}:\theta\in\Theta\}

(65)

Then, we have

\mathbb{E}\sup_{\theta\in\Theta}|\{r(\theta)-R(\theta)\}|\leq 2\mathbb{E}\left(Rad(\mathcal{L}\circ z)\right)

(66)

If the data is of the form $z_{i}=(x_{i},y_{i}),x\in X,y\in Y$ and the loss function is of the form $l(a_{\theta}(x),y)$ , is $L$ lipschitz and $a_{\theta}:\Theta{\times}X\rightarrow\mathbb{R}$ , then we have

\mathbb{E}\sup_{\theta\in\Theta}|\{r(\theta)-R(\theta)\}|\leq 2{L}\mathbb{E}\left(Rad(\mathcal{A}\circ\{x_{1},x_{2},x_{3},\cdots,x_{n}\})\right)

(67)

where

\mathcal{A}\circ\{x_{1},x_{2},\cdots,x_{n}\}\ =\{(a(\theta,x_{1}),a(\theta,x_{2}),\cdots,a(\theta,x_{n}))\in\mathbb{R}^{n}:\theta\in\Theta\}

(68)

The detailed proof of the above statement is given in (Rebeschini, 2022)²²2 Algorithmic Foundations of Learning [Lecture Notes]. https://www.stats.ox.ac.uk/ rebeschi/teaching/AFoL/20/material/. The upper bound for $\mathbb{E}\sup_{\theta\in\Theta}(\{r(\theta)-R(\theta)\})$ is proved in the aformentioned reference. However, without loss of generality the same proof holds for the upper bound for $\mathbb{E}\sup_{\theta\in\Theta}(\{R(\theta)-r(\theta)\})$ . Hence the upper bound for $\mathbb{E}\sup_{\theta\in\Theta}|\{r(\theta)-R(\theta)\}|$ can be established.

Lemma 11

Consider two random random variable $x\in\mathcal{X}$ and $y,y^{{}^{\prime}}\in\mathcal{Y}$ . Let $\mathbb{E}_{x,y},\mathbb{E}_{x}$ and $\mathbb{E}_{y|x}$ , $\mathbb{E}_{y^{{}^{\prime}}|x}$ denote the expectation with respect to the joint distribution of $(x,y)$ , the marginal distribution of $x$ , the conditional distribution of $y$ given $x$ and the conditional distribution of $y^{{}^{\prime}}$ given $x$ respectively . Let $f_{\theta}(x)$ denote a bounded measurable function of $x$ parameterised by some parameter $\theta$ and $g(x,y)$ be bounded measurable function of both $x$ and $y$ .

Then we have

\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}\right)

(69)

Proof Denote the left hand side of Equation (69) as $\mathbb{X}_{\theta}$ , then add and subtract $\mathbb{E}_{y|x}(g(x,y)|x)$ to it to get

$\displaystyle\mathbb{X}_{\theta}=$	$\displaystyle\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)+\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)-g(x,y)\right)^{2}\right)$	(70)
$\displaystyle=$	$\displaystyle\operatorname*{arg\,min}_{f_{\theta}}\Big{(}\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)^{2}+\mathbb{E}_{x,y}\left(y-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)^{2}$
	$\displaystyle\qquad-2\mathbb{E}_{x,y}\Big{(}f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\Big{)}\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)\Big{)}.$	(71)

Consider the third term on the right hand side of Equation (71)

$\displaystyle 2\mathbb{E}_{x,y}$	$\displaystyle\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)$
$\displaystyle=$	$\displaystyle 2\mathbb{E}_{x}\mathbb{E}_{y\|x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)$	(72)
$\displaystyle=$	$\displaystyle 2\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)\mathbb{E}_{y\|x}\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)$	(73)
$\displaystyle=$	$\displaystyle 2\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)\left(\mathbb{E}_{y\|x}(g(x,y))-\mathbb{E}_{y\|x}\left(\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)\right)$	(74)
$\displaystyle=$	$\displaystyle 2\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}(y\|x)\right)\Big{(}\mathbb{E}_{y\|x}(g(x,y))-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\Big{)}$	(75)
$\displaystyle=$	$\displaystyle 0$	(76)

Equation (72) is obtained by writing $\mathbb{E}_{x,y}=\mathbb{E}_{x}\mathbb{E}_{y|x}$ from the law of total expectation. Equation (73) is obtained from (72) as the term $f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)$ is not a function of $y$ . Equation (74) is obtained from (73) as $\mathbb{E}_{y|x}\left(\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)=\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)$ because $\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)$ is not a function of $y$ hence is constant with respect to the expectation operator $\mathbb{E}_{y|x}$ .

Thus plugging in value of $2\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)$ in Equation (71) we get

	$\displaystyle\arg\min A_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=$	$\displaystyle\operatorname*{arg\,min}_{f_{\theta}}(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{x,y^{{}^{\prime}}}(g(x,y^{{}^{\prime}})\|x)\right)^{2}$
		$\displaystyle+\mathbb{E}_{x,y}\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}\|x}(g(x,y^{{}^{\prime}})\|x)\right)^{2}).$		(77)

Note that the second term on the right hand side of Equation (77) des not depend on $f_{\theta}(x)$ therefore we can write Equation (77) as

\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}\right)

(78)

Since the right hand side of Equation (78) is not a function of $y$ we can replace $\mathbb{E}_{x,y}$ with $\mathbb{E}_{x}$ to get

\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}\right)

(79)

Lemma 12

Consider an optimization of the form given in Equation (45) with the regularization term $\beta=0$ denoted by $\mathcal{L}_{|\tilde{D}|}$ and it’s convex equivalent denoted by $\mathcal{L}_{0}$ . Then the value of these two loss functions evaluated at $(v,w)=(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}}$ and $\theta=\psi(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}}$ respectively are equal and thus we have

\mathcal{L}_{|\tilde{D}|}(\psi(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}})=\mathcal{L}_{0}((v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}})

(80)

Proof Consider the loss functions in Equations (40), (48) with $\beta=0$ are as follows

	$\displaystyle\mathcal{L}_{0}((v_{i},w_{i})_{i\in\{1,\cdots,\|\tilde{D}\|\}})$	$\displaystyle=$	$\displaystyle\|\|\sum_{D_{i}\in\tilde{D}}D_{i}(X(v_{i}-w_{i}))-y\|\|^{2}_{2}$		(81)
	$\displaystyle\mathcal{L}_{\|\tilde{D}\|}(\psi(v_{i},w_{i})_{i\in\{1,\cdots,\|\tilde{D}\|\}})$	$\displaystyle=$	$\displaystyle\|\|\sum_{i=1}^{\|\tilde{D}\|}\sigma(X\psi(v_{i},w_{i})_{1})\psi(v_{i},w_{i})_{2}-y\|\|_{2}^{2},$		(82)

where $\psi(v_{i},w_{i})_{1}$ , $\psi(v_{i},w_{i})_{2}$ represent the first and second coordinates of $\psi(v_{i},w_{i})$ respectively.

For any fixed $i\in\{1,\cdots,|\tilde{D}|\}$ consider the two terms

	$\displaystyle D_{i}(X(v_{i}-w_{i}))$		(83)
	$\displaystyle\sigma(X\psi(v_{i},w_{i})_{1})\psi(v_{i},w_{i})_{2}$		(84)

For a fixed $i$ either $v_{i}$ or $w_{i}$ is zero. In case both are zero, both of the terms in Equations (83) and (84) are zero as $\psi(0,0)=(0,0)$ . Assume that for a given $i$ $w_{i}=0$ . Then we have $\psi(v_{i},w_{i})=(v_{i},1)$ . Then equations (83), (84) are.

	$\displaystyle D_{i}(X(v_{i})$		(85)
	$\displaystyle\sigma(X(v_{i}))$		(86)

But by definition of $v_{i}$ we have $D_{i}(X(v_{i})=\sigma(X(v_{i}))$ , therefore Equations (85), (86) are equal. Alternatively if for a given $i$ $v_{i}=0$ , then $\psi(v_{i},w_{i})=(w_{i},-1)$ , then the terms in (83), (84) become.

	$\displaystyle-D_{i}(X(w_{i})$		(87)
	$\displaystyle-\sigma(X(w_{i}))$		(88)

By definition of $w_{i}$ we have $D_{i}(X(w_{i})=\sigma(X(w_{i}))$ , then the terms in (87), (87) are equal. Since this is true for all $i$ , we have

\mathcal{L}_{|\tilde{D}|}(\psi(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}})=\mathcal{L}_{0}((v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}})

(89)

Lemma 13

The function $Q_{\theta}(x)$ defined in equation (10) is Lipschitz continuous in $\theta$ , where $\theta$ is considered a vector in $\mathbb{R}^{(d+1)m}$ with the assumption that the set of all possible $\theta$ belong to the set $\mathcal{B}=\{\theta:|\theta^{*}-\theta|_{1}<1\}$ , where $\theta^{*}$ is some fixed value.

Proof

First we show that for all $\theta_{1}=\{u_{i},\alpha_{i}\},\theta_{2}=\{u^{{}^{\prime}}_{i},\alpha^{{}^{\prime}}_{i}\}\in\mathcal{B}$ we have $\alpha_{i}=\alpha^{{}^{\prime}}_{i}$ for all $i\in(1,\cdots,m)$

Note that

|\theta_{1}-\theta_{2}|_{1}=\sum_{i=1}^{m}|u_{i}-u^{{}^{\prime}}_{i}|_{1}+\sum_{i=1}^{m}|\alpha_{i}-\alpha^{{}^{\prime}}_{i}|,

(90)

where $|u_{i}-u^{{}^{\prime}}_{i}|_{1}=\sum_{j=1}^{d}|u_{i_{j}}-u^{{}^{\prime}}_{i_{j}}|$ with $u_{i_{j}},u^{{}^{\prime}}_{i_{j}}$ denote the $j^{th}$ component of $u_{i},u^{{}^{\prime}}_{i}$ respectively.

By construction $\alpha_{i},\alpha^{{}^{\prime}}_{i}$ can only be $1$ , $-1$ or $0$ . Therefore if $\alpha_{i}\neq\alpha^{{}^{\prime}}_{i}$ then $|\alpha_{i}-\alpha^{{}^{\prime}}_{i}|=2$ if both non zero or $|\alpha_{i}-\alpha^{{}^{\prime}}_{i}|=1$ if one is zero. Therefore $|\theta_{1}-\theta_{2}|_{1}\geq 1$ . Which leads to a contradiction.

Therefore $\alpha_{i}=\alpha^{{}^{\prime}}_{i}$ for all $i$ and we also have

|\theta_{1}-\theta_{2}|_{1}=\sum_{i=1}^{m}|u_{i}-u^{{}^{\prime}}_{i}|_{1}

(91)

$Q_{\theta}(x)$ is defined as

Q_{\theta}(x)=\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}

(92)

From Proposition $1$ in (10.5555/3327144.3327299) the function $Q_{\theta}(x)$ is Lipschitz continuous in $x$ , therefore there exist $l>0$ such that

	$\displaystyle\|Q_{\theta}(x)-Q_{\theta}(y)\|$	$\displaystyle\leq$	$\displaystyle l\|x-y\|_{1}$		(93)
	$\displaystyle\|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(y^{T}u_{i})\alpha_{i}\|$	$\displaystyle\leq$	$\displaystyle l\|x-y\|_{1}$		(94)

If we consider a single neuron of $Q_{\theta}$ , for example $i=1$ , we have $l_{1}>0$ such that

\displaystyle|\sigma^{{}^{\prime}}(x^{T}u_{1})\alpha_{i}-\sigma^{{}^{\prime}}(y^{T}u_{1})\alpha_{i}|

\displaystyle\leq

\displaystyle l_{1}|x-y|_{1}

(95)

Now consider Equation (95), but instead of considering the left hand side a a function of $x,y$ consider it a function of $u$ where we consider the difference between $\sigma^{{}^{\prime}}(x^{T}u)\alpha_{i}$ evaluated at $u_{1}$ and $u^{{}^{\prime}}_{1}$ such that

\displaystyle|\sigma^{{}^{\prime}}(x^{T}u_{1})\alpha_{i}-\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{1})\alpha_{i}|

\displaystyle\leq

\displaystyle l^{x}_{1}|u_{1}-u^{{}^{\prime}}_{1}|_{1}

(96)

for some $l^{x}_{1}>0$ .

Similarly, for all other $i$ if we change $u_{i}$ to $u^{{}^{\prime}}_{i}$ to be unchanged we have

\displaystyle|\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}|

\displaystyle\leq

\displaystyle l^{x}_{i}|u_{i}-u^{{}^{\prime}}_{i}|_{1}

(97)

for all $x$ if both $\theta_{1},\theta_{2}\in\mathcal{B}$ .

Therefore we obtain

$\displaystyle\|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}\|$	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{m}\|\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}\|$	(98)
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{m}l^{x}_{i}\|u_{i}-u^{{}^{\prime}}_{i}\|_{1}$	(99)
	$\displaystyle\leq$	$\displaystyle(\sup_{i}l_{i}^{x})\sum_{i=1}^{m}\|u_{i}-u^{{}^{\prime}}_{i}\|_{1}$	(100)
	$\displaystyle\leq$	$\displaystyle(\sup_{i}l_{i}^{x})\|\theta_{1}-\theta_{2}\|$	(101)

This result for a fixed $x$ . If we take the supremum over $x$ on both sides we get

\displaystyle\sup_{x}|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}|

\displaystyle\leq

\displaystyle(\sup_{i,x}l_{i}^{x})|\theta_{1}-\theta_{2}|

(102)

Denoting $(\sup_{i,x}l_{i}^{x})=l$ , we get

	$\displaystyle\|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}\|$	$\displaystyle\leq$	$\displaystyle l\|\theta_{1}-\theta_{2}\|_{1}$		(104)
			$\displaystyle\forall x\in\mathbb{R}^{d}$		(104)

D Supporting Lemmas

We will now state the key lemmas that will be used for finding the sample complexity of the proposed algorithm.

Lemma 14

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, the approximation error denoted by $\epsilon^{1}_{k,j}$ in Definition 5, we have

\mathbb{E}\left(|\epsilon^{1}_{k,j}|\right)\leq\sqrt{\epsilon_{bias}},

(105)

Where the expectation is with respect to and $(s,a)\sim\zeta^{\nu}_{\pi}(s,a)$

Proof Sketch: We use Assumption 3 and the definition of the variance of a random variable to obtain the required result. The detailed proof is given in Appendix F.1.

Lemma 15

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, $Q^{1}_{k,j}=Q^{2}_{k,j}$ , or equivalently $\epsilon^{2}_{k,j}=0$

Proof Sketch: We use Lemma 11 in Appendix C and use the definitions of $Q^{1}_{k,j}$ and $Q^{2}_{k,j}$ to prove this result. The detailed proof is given in Appendix F.2.

Lemma 16

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, if the number of state action pairs sampled are denoted by $n_{k,j}$ , then the error $\epsilon^{3}_{k,j}$ defined in Definition 7 is upper bounded as

\mathbb{E}\left(|\epsilon^{3}_{k,j}|\right)\leq\tilde{\mathcal{O}}\left(\frac{log(log(n_{k,j}))}{\sqrt{n_{k,j}}}\right),

(106)

Where the expectation is with respect to and $(s,a)\sim\zeta^{\nu}_{\pi}(s,a)$

Proof Sketch: First we note that For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, $\mathbb{E}(R_{X_{k,j},Q_{k,j-1}}({\theta}))=L_{Q_{j,k-1}}({\theta})$ where $R_{X_{k,j},Q_{j,k-1}}({\theta})$ and $L_{Q_{j,k-1}}({\theta})$ are defined in Appendix F.3. We use this to get a probabilistic bound on the expected value of $|(Q^{2}_{j,k})-(Q^{3}_{j,k})|$ using Rademacher complexity theory when the samples are drawn from an ergodic Markov chain. The detailed proof is given in Appendix F.3. Note the presence of the $log(log(n_{k,j}))$ term is due to the fact that the state action samples belong to a Markov Chain.

Lemma 17

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, let the number of steps of the projected gradient descent performed by Algorithm 2, denoted by $T_{k,j}$ , and the gradient descent step size $\alpha_{k,j}$ satisfy

\displaystyle\alpha_{k,j}

\displaystyle=

\displaystyle\frac{||u^{*}_{k,j}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}},

(107)

for some constants $L_{k,j}$ and $||\left(u_{k}^{*}\right)||_{2}$ . Then the error $\epsilon_{k_{4}}$ defined in Definition 8 is upper bounded as

\mathbb{E}(|\epsilon^{4}_{k,j}|)\leq\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+{\epsilon_{|\tilde{D}|}},

(108)

Where the expectation is with respect to $(s,a)\sim\zeta^{\nu}_{\pi}(s,a)$ .

Proof Sketch: We use the number of iterations $T_{k,j}$ required to get an $\epsilon$ bound on the difference between the minimum objective value and the objective value corresponding to the estimated parameter at iteration $T_{k}$ . We use the convexity of the objective and the Lipschitz property of the neural network to get a bound on the $Q$ functions corresponding to the estimated parameters. The detailed proof is given in Appendix F.4.

Lemma 18

For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1, if the number of samples of the state action pairs sampled are denoted by $n_{k,j}$ and $\beta_{i}$ be the step size in the projected gradient descent at iteration $i$ of the second inner for loop of Algorithm 1 which satisfies

\displaystyle\beta_{i}=\frac{\mu_{k}}{i+1},

(109)

where $\mu_{k}$ is the strong convexity parameter of $F_{k}$ . Then, it holds that,

\left(F_{k}(w_{i})\right)\leq\tilde{\mathcal{O}}\left(\frac{log(n_{k,j})}{{n_{k,j}}}\right)+F^{*}_{k}.

(110)

Proof Sketch: Note that we don not have access to state action samples belonging to the stationary state action distribution corresponding to the policy $\pi_{{\lambda}_{k}}$ . We only have access to samples from Markov chain with the same stationary state action distribution. To account for this, we use the results in (9769873) and obtain the difference between the optimal loss function and the loss function obtained by performing stochastic gradient descent with samples from a Markov chain.

E Proof of Theorem 1

Proof

From Assumption 1, we have

	$\displaystyle\log\frac{\pi_{\lambda_{k+1}}(a\|s)}{\pi_{\lambda_{k}}(a\|s)}$	$\displaystyle\geq$	$\displaystyle{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a\|s){\cdot}(\lambda^{k+1}-\lambda^{k})-\frac{\beta}{2}\|\|\lambda^{k+1}-\lambda^{k}\|\|_{2}^{2}$		(111)
		$\displaystyle=$	$\displaystyle{\eta}\log{\pi_{\lambda_{k}}}(a\|s){\cdot}w^{k}-{\eta}\frac{\beta}{2}\|\|w^{k}\|\|_{2}^{2}$		(112)

Thus we have,

	$\displaystyle\log\frac{\pi_{\lambda_{k+1}}(a\|s)}{\pi_{\lambda_{k}}(a\|s)}$	$\displaystyle\geq$	$\displaystyle{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a\|s){\cdot}(\lambda^{k+1}-\lambda^{k})-\frac{\beta}{2}\|\|\lambda^{k+1}-\lambda^{k}\|\|_{2}^{2}$		(113)
		$\displaystyle=$	$\displaystyle{\eta}\log{\pi_{\lambda_{k}}}(a\|s){\cdot}w^{k}-{\eta}^{2}\frac{\beta}{2}\|\|w^{k}\|\|_{2}^{2}$		(114)

From the definition of KL divergence and from the performance difference lemma from (kakade2002approximately) we have

$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}}}\left(KL(\pi^{}\|\|{\pi}^{\lambda_{k}})-\pi^{*}\|\|{\pi}^{\lambda_{k+1}})\right)=$	$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}}}\mathbb{E}_{a\sim\pi^{}(.\|s)}\left[log\frac{\pi^{\lambda_{k+1}}(a\|s)}{\pi_{\lambda_{k}}(a\|s)}\right]$	(115)
$\displaystyle\leq$	$\displaystyle{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}}}\mathbb{E}_{a\sim\pi^{}(.\|s)}\left[{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a\|s){\cdot}w^{k}\right]-{\eta}^{2}\frac{\beta}{2}\|\|w^{k}\|\|_{2}^{2}$	(116)
$\displaystyle=$	$\displaystyle{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}}}\mathbb{E}_{a\sim\pi^{}(.\|s)}\left[Q_{k,J}(s,a)\right]-{\eta}^{2}\frac{\beta}{2}\|\|w^{k}\|\|_{2}^{2}$
	$\displaystyle-{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}}}\mathbb{E}_{a\sim\pi^{}(.\|s)}\Big{[}{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a\|s){\cdot}w^{k}-A_{k,J}(s,a)\Big{]}$	(117)
$\displaystyle=$	$\displaystyle(1-\gamma){\eta}\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right)-{\eta}^{2}\frac{\beta}{2}\|\|w^{k}\|\|_{2}^{2}-{\eta}{\cdot}err_{k}.$	(118)

Equation (116) is obtained from Equation (115) using the result in Equation (112). (117) is obtained from Equation (116) using the performance difference lemma form (kakade2002approximately) where $A_{k,J}$ is the advantage function to the corresponding $Q$ function $Q_{k,j}$ .

Rearranging, we get

\displaystyle\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right)

\displaystyle\leq

\displaystyle\frac{1}{1-\gamma}\left(\frac{1}{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\left(KL(\pi^{*}||{\pi}^{\lambda_{k}})-\pi^{*}||{\pi}^{\lambda_{k+1}})\right)+{\eta}^{2}\frac{\beta}{2}{\cdot}{W}^{2}+{\eta}{\cdot}err_{k}\right)

Summing from $1$ to $K$ and dividing by $K$ we get

$\displaystyle\frac{1}{K}{\sum_{k=1}^{K}}\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right)\leq$	$\displaystyle\left(\frac{1}{1-\gamma}\right)\frac{1}{K}{\sum_{k=1}^{K}}\left(\mathbb{E}_{s\sim d_{\nu}^{\pi^{}}}\left(KL(\pi^{}\|\|{\pi}^{\lambda_{k}})-\pi^{*}\|\|{\pi}^{\lambda_{k+1}})\right)+{\eta}{\cdot}err_{k}\right)$
$\displaystyle+$	$\displaystyle\left(\frac{1}{1-\gamma}\right){\eta}^{2}\frac{\beta}{2}{\cdot}{W}^{2}$	(120)
$\displaystyle\leq$	$\displaystyle\frac{1}{{\eta}(1-\gamma)}\frac{1}{K}\mathbb{E}_{s\sim\tilde{d}}\left(KL(\pi^{*}\|\|{\pi}^{\lambda_{0}})\right)+\frac{{\eta}{\beta}{\cdot}{W}^{2}}{2(1-\gamma)}+\frac{1}{K(1-\gamma)}\sum_{k=1}^{J-1}err_{k}$	(121)
$\displaystyle\leq$	$\displaystyle\frac{\log(\|\mathcal{A}\|)}{K{\eta}(1-\gamma)}+\frac{{\eta}{\beta}{\cdot}{W}^{2}}{2(1-\gamma)}+\frac{1}{K(1-\gamma)}\sum_{k=1}^{J-1}err_{k}$	(122)

If we set $\eta=\frac{1}{\sqrt{K}}$ in Equation (122) we get

\displaystyle\frac{1}{K}{\sum_{k=1}^{K}}\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right)

\displaystyle\leq

\displaystyle\frac{1}{\sqrt{K}}\left(\frac{2\log(|\mathcal{A}|)+\beta{\cdot}{W}^{2}}{2(1-\gamma)}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{J-1}err_{k}

(123)

Now consider the term $err_{k}$ , we have from Equation (28)

$\displaystyle err_{k}$	$\displaystyle=$	$\displaystyle\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a\|s)))$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-A_{{k,J}})+\mathbb{E}_{s,a}(A_{{k,J}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a\|s)))$
	$\displaystyle\leq$	$\displaystyle\underbrace{\|\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-A_{{k,J}})\|}_{I}+\underbrace{\|\mathbb{E}_{s,a}(A_{{k,J}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a\|s)))\|}_{II}$	(125)

where $A_{k,j}$ is the estimate of $A^{\pi_{\lambda_{k}}}$ obtained at the $k^{th}$ iteration of Algorithm 1 and $s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}$ .

We first derive bounds on $I$ . From the definition of advantage function we have

$\displaystyle\|\mathbb{E}(A^{\pi_{\lambda_{k}}}(s,a)-A_{{k,J}}(s,a))\|$	$\displaystyle=$	$\displaystyle\|\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}(Q^{\pi_{\lambda_{k}}}(s,a)-E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}(s,a^{{}^{\prime}})$
		$\displaystyle-Q_{{k,J}}(s,a)+E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q_{k,J}(s,a^{{}^{\prime}}))\|$
	$\displaystyle=$	$\displaystyle\|\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}(Q^{\pi_{\lambda_{k}}}(s,a)-E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}(s,a^{{}^{\prime}})$
		$\displaystyle-Q_{{k,J}}(s,a)+E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q_{k,J}(s,a^{{}^{\prime}}))\|$
	$\displaystyle\leq$	$\displaystyle\|\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)\|$
	$\displaystyle+$	$\displaystyle\|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)\|$	(128)

We write the second term on the right hand side of Equation (128) as $\int(|(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}|}(s,a))d(\mu_{k})$ where $\mu_{k}$ is the measure associated with the state action distribution given by ${s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{\lambda_{k}}}$ . Then we have

\displaystyle\int|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu_{k})\leq\Bigg{|}\Bigg{|}\frac{d{\mu_{k}}}{d{\mu^{*}}}\Bigg{|}\Bigg{|}_{\infty}\int|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu^{*})

(129)

where $\mu^{*}$ is the measure associated with the state action distribution given by $s\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim{\pi^{*}}$ . Using Assumption 6 we have $\Bigg{|}\Bigg{|}\frac{d{\mu_{k}}}{d{\mu^{*}}}\Bigg{|}\Bigg{|}_{\infty}\leq\phi_{\mu^{*},\mu_{k}}$ . Thus Equation (129) becomes

\displaystyle\int|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu_{k})\leq(\phi_{\mu_{k},\mu^{*}})\int(|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu^{*})

(130)

Since $|\int(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu^{*})=|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|$ Equation (128) now becomes.

\displaystyle|\mathbb{E}(A^{\pi_{\lambda_{k}}}(s,a)-A_{{k,J}}(s,a))|

\displaystyle\leq

\displaystyle(1+\phi_{\mu_{k},\mu^{*}})|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a))|

Therefore minimizing $A^{\pi_{\lambda_{k}}}(s,a)-A_{{k,J}}(s,a)$ is equivalent to minimizing $Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)$ .

In order to prove the bound on $|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a))|$ we first define some notation, let $Q_{1},Q_{2}$ be two real valued functions on the state action space. The expression $Q_{1}\geq Q_{2}$ implies $Q_{1}(s,a)\geq Q_{2}(s,a)$ $\forall(s,a)\in\mathcal{S}\times\mathcal{A}$ .

Let $Q_{k,j}$ denotes our estimate of the action value function at iteration $k$ of Algorithm 1 and iteration $j$ of the first for loop of Algorithm 1 and we denote $Q_{k,J}=Q_{k,j}$ where $J$ is the total number of iterations of the first for loop of Algorithm 1. $Q^{\pi_{\lambda_{k}}}$ denotes the action value function induced by the policy $\pi_{\lambda_{k}}$ .

Consider $\epsilon_{k,j+1}=T^{\pi_{\lambda_{k}}}Q_{k,j}-Q_{k,j+1}$ .

\displaystyle TQ_{k,j}\geq T^{\pi_{\lambda_{k}}}Q_{k,j}

(132)

This follows from the definition of $T^{\pi_{\lambda_{k}}}$ and $T$ in Equation (51) and (52), respectively.

Thus we get,

$\displaystyle Q^{\pi_{\lambda_{k}}}-Q_{k,j+1}=$	$\displaystyle T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-Q_{k,j+1}$	(133)
$\displaystyle=$	$\displaystyle T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-T^{\pi_{\lambda_{k}}}Q_{k,j}+T^{\pi_{\lambda_{k}}}Q_{k,j}-TQ_{k,j}+TQ_{k,j}-Q_{k,j+1}$	(134)
$\displaystyle=$	$\displaystyle r(s,a)+\gamma P^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-(r(s,a)+\gamma P^{\pi_{\lambda_{k}}}Q_{k,j})+(r(s,a)+\gamma P^{\pi_{\lambda_{k}}}Q_{k,j})$
	$\displaystyle-(r(s,a)+\gamma P^{*}Q_{k,j})+\epsilon_{k,j+1}$
$\displaystyle=$	$\displaystyle\gamma P^{\pi_{\lambda_{k}}}(Q^{\pi_{\lambda_{k}}}-Q_{k,j})+\gamma P^{\pi_{\lambda_{k}}}Q_{k,j}-\gamma P^{*}Q_{k,j}+\epsilon_{k,j+1}$	(135)
$\displaystyle\leq$	$\displaystyle\gamma(P^{\pi_{\lambda_{k}}}(Q^{\pi_{\lambda_{k}}}-Q_{k,j}))+\epsilon_{k,j+1}$	(136)

Right hand side of Equation (133) is obtained by writing $Q_{k,J}=T^{\pi^{*}}Q^{\pi_{\lambda_{k}}}$ . This is because the function $Q^{\pi_{\lambda_{k}}}$ is a stationary point with respect to the operator $T^{\pi^{*}}$ . Equation (134) is obtained from (133) by adding and subtracting $T^{\pi^{*}}Q_{k,j}$ . Equation (136) is obtained from (135) as $P^{\pi^{*}}Q_{k,j}\leq P^{*}Q_{k,j}$ and $P^{*}$ is the operator with respect to the greedy policy of $Q_{k,j}$ .

By recursion on $k$ , we get,

Q^{\pi_{\lambda_{k}}}-Q_{k,J}\leq\sum_{j=0}^{J-1}\gamma^{J-j-1}(P^{\pi_{\lambda_{k}}})^{J-j-1}\epsilon_{k,j}+\gamma^{J}(P^{\pi_{\lambda_{k}}})^{J}(Q^{\pi_{\lambda_{k}}}-Q_{0})

(137)

using $TQ_{k,j}\geq T^{\pi^{*}}Q_{k,j}$ (from definition of $T^{\pi^{*}}$ ) and $TQ_{k,j}\geq T^{\pi_{\lambda_{k}}}Q_{k,j}$ from definition of operator $T$ .

From this we obtain

	$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,J})}\leq$	$\displaystyle{\phi_{k}}\sum_{k=0}^{J-1}\gamma^{J-j-1}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}((P^{\pi_{\lambda_{k}}})^{K-J-1}\epsilon_{k,j})$
		$\displaystyle+\gamma^{J}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}(P^{\pi_{\lambda_{k}}})^{J}(Q^{\pi_{\lambda_{k}}}-Q_{0})$		(138)

Taking absolute value on both sides of Equation (138) we get.

	$\displaystyle\|\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,J})}\|\leq$	$\displaystyle{\phi_{k}}\sum_{k=0}^{J-1}\gamma^{J-j-1}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}((P^{\pi_{\lambda_{k}}})^{J-j-1}\|\epsilon_{k,j}\|)$
		$\displaystyle+\gamma^{J}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}(P^{\pi_{\lambda_{k}}})^{K}(\|Q^{\pi_{\lambda_{k}}}-Q_{0}\|)$		(139)

For a fixed $j$ consider the term $\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}((P^{\pi_{\lambda_{k}}})^{J-j-1}|\epsilon_{k,j}|)$ using Assumption 6 we write

	$\displaystyle\|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi_{j}}((P^{\pi_{\lambda_{k}}})^{J-j-1}\|\epsilon_{k,j}\|)\|$	$\displaystyle\leq$	$\displaystyle\Bigg{\|}\Bigg{\|}\frac{d{(P^{\pi_{\lambda_{k}}})^{J-j-1}\mu_{j}}}{d{\mu^{{}^{\prime}}_{k}}}\Bigg{\|}\Bigg{\|}_{\infty}\left\|{\int}\epsilon_{k,j}d{\mu^{{}^{\prime}}_{k}}\right\|$		(140)
		$\displaystyle\leq$	$\displaystyle(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}(\|\epsilon_{k,j}\|)\|$		(141)

Here $\mu_{j}$ is the measure associated with the state action distribution given by sampling from $s\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim{\pi^{*}}$ and then applying the operator $P^{\pi_{\lambda_{k}}}$ $J-j-1$ times. $\mu_{j}$ is the measure associated with the steady state action distribution given by $(s,a)\sim\zeta^{\nu}_{\pi}(s,a)$ . Thus Equation (139) becomes

\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,J})}|\leq

\displaystyle\sum_{k=0}^{J-1}\gamma^{J-j-1}(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}(|\epsilon_{k,j}|)|+\gamma^{J}\left(\frac{R_{max}}{1-\gamma}\right)

(143)

We get the second term on the right hand side by noting that $(Q^{\pi_{\lambda_{k}}}-Q_{0})\leq\frac{R_{max}}{1-\gamma}$ . Now splitting $\epsilon_{k,j}$ as was done in Equation (31) we obtain

	$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,j})}\leq$	$\displaystyle\sum_{j=0}^{J-1}\gamma^{J-j-1}\big{(}(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{1}_{k,j}\|}+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{2}_{k,j}\|}$
		$\displaystyle+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{3}_{k,j}\|}+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{4}_{k,j}\|}\big{)}+\gamma^{J}\left(\frac{R_{max}}{1-\gamma}\right)$		(144)

Now using Lemmas F.1, F.2, F.3, F.4 we have

	$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,j})}$	$\displaystyle=$	$\displaystyle\sum_{j=0}^{J-1}\tilde{\mathcal{O}}\left(\frac{\log(log(n_{k,j}))}{\sqrt{n_{k,j}}}\right)+(\sqrt{\epsilon_{approx}}+{\epsilon_{\|\tilde{D}\|}})+\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)$		(145)
		$\displaystyle+$	$\displaystyle\tilde{\mathcal{O}}(\gamma^{J})$		(145)

From Equation (LABEL:proof_13_1_6) we get

	$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}{(A^{\pi_{\lambda_{k}}}-A_{k,j})}$	$\displaystyle=$	$\displaystyle\sum_{j=0}^{J-1}\tilde{\mathcal{O}}\left(\frac{\log(log(n_{k,j}))}{\sqrt{n_{k,j}}}\right)+(\sqrt{\epsilon_{approx}}+{\epsilon_{\|\tilde{D}\|}})+\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)$		(146)
		$\displaystyle+$	$\displaystyle\tilde{\mathcal{O}}(\gamma^{J})$		(146)

We now derive bounds on $II$ . From Theorem 2 of (9769873) we have that

\displaystyle||w_{k}-w^{*}||_{2}

\displaystyle\leq

\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)

(147)

Now define the function $F_{k}(w)=\mathbb{E}_{s,a\sim\zeta^{\nu}_{\pi}}(A_{{k,J}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a|s)))$ . From this definition we obtain

\displaystyle F_{k}(w_{k})-F_{k}(w^{*})\leq l_{F_{k}}||w_{k}-w^{*}||_{2}

\displaystyle\leq

\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)

(148)

where $l_{F_{k}}$ is lipschitz constant of $F_{k}(w)$ . Thus we obtain

\displaystyle F_{k}(w_{k})-F_{k}(w^{*})

\displaystyle\leq

\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)

(149)

From Assumption 4 we have

\displaystyle F_{k}(w_{k})-\epsilon_{bias}

\displaystyle\leq

\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)

(150)

which gives us

\displaystyle F_{k}(w_{k})

\displaystyle\leq

\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)+\epsilon_{bias}

(151)

Plugging Equations (151) and (146) in Equation (123) we get

$\displaystyle\frac{1}{K}\sum_{k=1}^{K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu))$	$\displaystyle\leq\min_{k\leq K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu))$	(152)
$\displaystyle\leq$	$\displaystyle{\mathcal{O}}\left(\frac{1}{\sqrt{K}(1-\gamma)}\right)\!+\!\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}\sum_{j=1}^{J}\mathcal{O}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right)$
	$\displaystyle+\sum_{k=1}^{K-1}\sum_{j=1}^{J}{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}{\mathcal{O}}(\gamma^{J})$
	$\displaystyle+\sum_{k=1}^{K-1}\sum_{j=1}^{J}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{\|\tilde{D}\|}}\right)$	(153)
$\displaystyle\leq$	$\displaystyle{\mathcal{O}}\left(\frac{1}{\sqrt{K}(1-\gamma)}\right)\!+\!\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}\sum_{j=1}^{J}\mathcal{O}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right)\$
	$\displaystyle+\sum_{k=1}^{K-1}\sum_{j=1}^{J}{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}{\mathcal{O}}(\gamma^{J})$
	$\displaystyle+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{\|\tilde{D}\|}}\right)$	(154)

F Proof of Supporting Lemmas

F.1 Proof Of Lemma 14

Proof

Using Assumption 3 and the definition of $Q_{kj_{1}}$ for some iteration $k$ of Algorithm 1 we have

\mathbb{E}(T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j})^{2}_{\nu}\leq\epsilon_{approx}

(155)

where $\nu\in\mathcal{P}(\mathcal{S}{\times}{\mathcal{A}})$ .

Since $|a|^{2}=a^{2}$ we obtain

\mathbb{E}(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)^{2}_{\nu}\leq\epsilon_{approx}

(156)

We have for a random variable $x$ , $Var(x)=\mathbb{E}(x^{2})-(\mathbb{E}(x))^{2}$ hence $\mathbb{E}(x)=\sqrt{\mathbb{E}(x^{2})-Var(x)}$ , Therefore replacing $x$ with $|T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-Q_{k1}|$ we get

using the definition of the variance of a random variable we get

\mathbb{E}(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)_{\nu}=\sqrt{\mathbb{E}(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)^{2}_{\nu}-Var(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)_{\nu}}

(157)

Therefore we get

\mathbb{E}(T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)_{\nu}\leq\sqrt{\epsilon_{approx}}

(158)

Since $\epsilon_{k_{1}}=T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-Q_{k1}$ we have

\mathbb{E}(|\epsilon_{k,j_{1}}|)_{\nu}\leq\sqrt{\epsilon_{approx}}

(159)

F.2 Proof Of Lemma 15

Proof From Lemma 11, we have

\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}(g(y^{{}^{\prime}},x)|x)\right)^{2}\right)

(160)

We label $x$ to be the state action pair $(s,a)$ , function $f_{\theta}(x)$ to be $Q_{\theta}(s,a)$ and $g(x,y)$ to be the function $r^{{}^{\prime}}(s,a)+{\gamma}\sum_{a^{{}^{\prime}}\in\mathcal{A}}\pi_{\lambda_{k}}(a^{{}^{\prime}}|s,a)Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})=r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})$ , where $y$ is the two dimensional random variable $(r^{{}^{\prime}}(s,a),s^{{}^{\prime}})$ and $s\sim d^{\pi_{{\theta}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s)$ , $s^{{}^{\prime}}\sim P(.|(s,a))$ and $r^{{}^{\prime}}(s,a)\sim{R}(.|s,a)$ .

Then the loss function in (69) becomes

\mathbb{E}_{s\sim d^{\pi_{{\lambda}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s),s^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a),r(s,a)\sim\mathcal{R}(.|s,a)}(Q_{\theta}(s,a)-(r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})))^{2}

(161)

Therefore by Lemma 11, we have that the function $Q_{\theta}(s,a)$ which minimizes Equation (161) it will be minimizing

\mathbb{E}_{s\sim d^{\pi_{{\lambda}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}}(Q_{\theta}(s,a)-\mathbb{E}_{s^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a),r\sim\mathcal{R}(.|s,a))}(r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})|s,a))^{2}

(162)

But we have from Equation that

\mathbb{E}_{s^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a),r\sim{R}(.|s,a))}(r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})|s,a)=T^{\pi_{\lambda_{k}}}Q_{k,j-1}

(163)

Combining Equation (161) and (163) we get

\operatorname*{arg\,min}_{Q_{\theta}}\mathbb{E}(Q_{\theta}(s,a)-(r(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})))^{2}=\operatorname*{arg\,min}_{Q_{\theta}}\mathbb{E}(Q_{\theta}(s,a)-T^{\pi_{\lambda_{k}}}Q_{k,j-1})^{2}

(164)

The left hand side of Equation (164) is $Q^{2}_{k,j}$ as defined in Definition 3 and the right hand side is $Q^{1}_{k,j}$ as defined in Definition 2, which gives us

Q^{2}_{k,j}=Q^{1}_{k,j}

(165)

F.3 Proof Of Lemma 16

Proof

We define $R_{X_{k,j},Q_{k,j-1}}({\theta})$ as

R_{X_{k,j},Q_{k,j-1}}({\theta})=\frac{1}{|X_{k,j}|}\sum_{(s_{i},a_{i})\in X_{k,j}}\Bigg{(}Q_{\theta}(s_{i},a_{i})-\Bigg{(}r(s_{i},a_{i})\\ +\gamma\mathbb{E}_{a^{{}^{\prime}}\sim\pi_{\lambda_{k}}}Q_{k,j-1}(s_{i+1},a^{{}^{\prime}})\Bigg{)}\Bigg{)}^{2},

Here, $X_{k,j}=\{s_{i},a_{i}\}_{i=\{1,\cdots,|X_{k,j}|\}}$ , where $s,a$ are sampled from a Markov chain whose stationary distribution is, $s\sim d_{\nu}^{\pi_{\lambda_{k}}},a\sim\pi_{\lambda_{k}}$ . $Q_{\theta}$ is as defined in Equation (10) and $Q_{k,j-1}$ is the estimate of the $Q$ function obtained at iteration $k$ of the outer for loop and iteration $j-1$ of the first inner for loop of Algorithm 1.

We also define the term

L_{Q_{k,j-1}}(Q_{\theta})=\mathbb{E}(Q_{\theta}(s,a)-(r^{\prime}(s,a)+\gamma\mathbb{E}_{a^{{}^{\prime}}\sim\pi_{\lambda_{k}}}Q_{k,j-1}(s^{\prime},a^{\prime}))^{2}

(166)

where ${s\sim d^{\pi_{{\theta}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s),r^{\prime}(\cdot|s,a)\sim{R}(\cdot|s,a)}\\$

We denote by $\theta^{2}_{k,j},\theta^{3}_{k,j}$ the parameters of the neural networks $Q^{2}_{k,j},Q^{3}_{k,j}$ respectively. $Q^{2}_{k,j},Q^{3}_{k,j}$ are defined in Definition 3 and 4 respectively.

We then obtain,

$\displaystyle R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})$	$\displaystyle\leq$	$\displaystyle R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})$
		$\displaystyle+L_{Q_{k,j-1}}(\theta^{2}_{k,j})-L_{Q_{k,j-1}}(\theta^{3}_{k,j})$
	$\displaystyle=$	$\displaystyle{R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-L_{Q_{k,j-1}}(\theta^{2}_{k,j})}$
		$\displaystyle-{R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})+L_{Q_{k,j-1}}(\theta^{2}_{k,j})}$
	$\displaystyle\leq$	$\displaystyle\underbrace{\|R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-L_{Q_{k,j-1}}(\theta^{2}_{k,j})\|}_{I}$
		$\displaystyle+\underbrace{\|R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})-L_{Q_{k,j-1}}(\theta^{3}_{k,j})\|}_{II}$

We get the inequality in Equation (LABEL:2_2_2) because $L_{Q_{k,j-1}}(\theta^{3}_{k,j})-L_{Q_{k,j-1}}(\theta^{2}_{k,j})>0$ as $Q^{2}_{k,j}$ is the minimizer of the loss function $L_{Q_{k,j-1}}(Q_{\theta})$ .

Consider Lemma 10. The loss function $R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})$ can be written as the mean of loss functions of the form $l(a_{\theta}(s_{i},a_{i}),y_{i})$ where $l$ is the square function. $a_{\theta}(s_{i},a_{i})=Q_{\theta}(s_{i},a_{i})$ and $y_{i}=\Big{(}r^{{}^{\prime}}(s_{i},a_{i})+\gamma{\mathbb{E}}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})\Big{)}$ . Thus we have

	$\displaystyle\mathbb{E}\sup_{\theta\in\Theta}\|R_{X_{k,j},Q_{k,j-1}}(\theta)-L_{Q_{k,j-1}}(\theta)\|\leq$		(170)
	$\displaystyle 2{\eta}^{{}^{\prime}}\mathbb{E}\left(Rad(\mathcal{A}\circ\{(s_{1},a_{1}),(s_{2},a_{2}),(s_{3},a_{3}),\cdots,(s_{n},a_{n})\})\right)$

Note that the expectation is over all $(s_{i},a_{i})$ . Where $n_{k,j}=|X_{k,j}|$ , $(\mathcal{A}\circ\{(s_{1},a_{1}),(s_{2},a_{2}),$ $(s_{3},a_{3}),\cdots,(s_{n},a_{n})\}=\{Q_{\theta}(s_{1},a_{1}),Q_{\theta}(s_{2},a_{2}),\cdots,Q_{\theta}(s_{n},a_{n})\}$ and $\eta^{{}^{\prime}}$ is the Lipschitz constant for the square function over the state action space $[0,1]^{d}$ . The expectation is with respect to $s\sim d^{\pi_{{\lambda}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s)$ , ${s_{i}^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a)}$ $r_{i}\sim R(.|s_{i},a_{i})_{{}_{i\in(1,\cdots,n_{k,j})},}$ .

From proposition 11 of (article) we have that

\left(Rad(\mathcal{A}\circ\{(s_{1},a_{1}),(s_{2},a_{2}),(s_{3},a_{3}),\cdots,(s_{n},a_{n})\})\right)\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}}

(171)

Note that proposition 11 of (article) establishes an upper bound on the Rademacher complexity using theorem 4 of (article) with the aim of applying it to the Metropolis Hastings algorithm For our purpose we only use the upper bound on Rademacher complexity established in proposition 11 of (article).

We use this result as the state action pairs are drawn not from the stationary state of the policy $\pi_{\lambda_{k}}$ but from a Markov chain with the same steady state distribution. Thus we have

\displaystyle\mathbb{E}|(R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j}))-L_{Q_{k,j-1}}(\theta^{2}_{k,j})|\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}}

(172)

The same argument can be applied for $Q^{3}_{k,j}$ to get

\displaystyle\mathbb{E}|(R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j}))-L_{Q_{k,j-1}}(\theta^{3}_{k,j})|\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}}

(173)

Then we have

\mathbb{E}\left(R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})\right)\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}}

(174)

Plugging in the definition of $R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j}),R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})$ in equation (174) and denoting $C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}}$ as $\epsilon$ we get

	$\displaystyle\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\Big{(}\mathbb{E}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}$
	$\displaystyle-\mathbb{E}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}\Big{)}\leq\epsilon$		(175)

Now for a fixed $i$ consider the term $\alpha_{i}$ defined as.

	$\displaystyle\mathbb{E}_{s_{i+1}\sim P(.\|s_{i},a_{i})}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}$
	$\displaystyle-\mathbb{E}_{s_{i+1}\sim P(.\|s_{i},a_{i})}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}$		(176)

where $s_{i},a_{i}$ are drawn from the state action distribution at the $i^{th}$ step of the Markov chain induced by following the policy $\pi_{\lambda_{k}}$ .

Now for a fixed $i$ consider the term $\beta_{i}$ defined as.

	$\displaystyle\mathbb{E}_{s_{i+1}\sim P(.\|s_{i},a_{i})}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}$
	$\displaystyle-\mathbb{E}_{s_{i+1}\sim P(.\|s_{i},a_{i})}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}$		(177)

where $s_{i},a_{i}$ are drawn from the steady state action distribution with $s_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}}$ and $a_{i}\sim\pi_{\lambda_{k}}$ . Note here that $\alpha_{i}$ and $\beta_{i}$ are the same function with only the state action pairs being drawn from different distributions.

Using these definitions we obtain

	$\displaystyle\|\mathbb{E}(\alpha_{i})-\mathbb{E}(\beta_{i})\|$	$\displaystyle\leq$	$\displaystyle\sup_{(s_{i},a_{i})}\|2.\max(\alpha_{i},\beta_{i})\|({\kappa}_{i})$		(178)
		$\displaystyle\leq$	$\displaystyle\left(4\frac{R}{1-\gamma}\right)^{2}m{\rho}^{i}$		(179)

We obtain Equation (178) by using the identity $|{\int}fd{\mu}-{\int}fd{\nu}|\leq|\max_{\mathcal{S}\times\mathcal{A}}(f)|\sup_{\mathcal{S}\times\mathcal{A}}{\int}(d{\mu}-d{\nu})|\leq|\max_{\mathcal{S}\times\mathcal{A}}(f)|{d_{TV}}(\mu,\nu)|$ , where $\mu$ and $\nu$ are two $\sigma$ finite state action probability measures and $f$ is a bounded measurable function. We have used $\kappa_{i}$ to represent the total variation distance between the state action measures of the steady state action distribution denoted by $s_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}},a_{i}\sim\pi_{\lambda_{k}}$ and the state action distribution at the $i^{th}$ step of the Markov chain induced by following the policy $\pi^{\lambda_{k}}$ . The expectation is with respect to $(s_{i},a_{i})$ . We obtain Equation (179) from Equation (178) by using Assumption 5 and the fact that $\alpha_{i}$ and $\beta_{i}$ are upper bounded by $\left(4\frac{R}{1-\gamma}\right)^{2}$

From equation (179) we get

\displaystyle\mathbb{E}(\alpha_{i})

\displaystyle\geq

\displaystyle\mathbb{E}(\beta_{i})-4\left(\frac{R}{1-\gamma}\right)^{2}m{\rho}^{i}

(180)

We get Equation (180) from Equation (179) using the fact that $|a-b|\leq c$ implies that $\left(-c\geq(a-b)\leq c\right)$ which in turn implies $a\geq b-c$ .

Using Equation (180) in equation (177) we get

			$\displaystyle\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\Big{(}\mathbb{E}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}$
			$\displaystyle-\mathbb{E}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}\Big{)}$
		$\displaystyle\leq$	$\displaystyle\epsilon+\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}4\left(\frac{R}{1-\gamma}\right)^{2}m{\rho}^{i}$
		$\displaystyle\leq$	$\displaystyle\epsilon+\frac{1}{n_{k,j}}4\left(\frac{R}{1-\gamma}\right)^{2}m\frac{1}{1-\rho}$

In Equation (LABEL:2_2_5_3_7) $(s_{i},a_{i})$ are now drawn from $s_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}}$ and $a_{i}\sim\pi_{\lambda_{k}}$ for al $i$ .

We ignore the second term on the right hand side as it is $\tilde{\mathcal{O}}\left(\frac{1}{n_{k,j}}\right)$ as compared to the first term which is $\tilde{\mathcal{O}}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right)$ . Additionally the expectation in Equation (LABEL:2_2_5_3_7) is wiht respect to ${s_{i}\sim d^{\pi_{{\theta}_{k}}}_{\nu},a_{i}\sim\pi_{\lambda_{k}}(.|s),r^{\prime}(\cdot|s,a)\sim{R}(\cdot|s,a)},s_{i+1}\sim P(.|s_{i},a_{i})\\$

Since now we have $s_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}},a_{i}\sim\pi_{\lambda_{k}}$ for all $i$ , Equation (LABEL:2_2_5_3_7) is equivalent to,

\displaystyle\mathbb{E}\underbrace{(Q^{2}_{k,j}(s,a)-Q^{3}_{k,j}(s,a))}_{A1}\underbrace{(Q^{2}_{k,j}(s,a)+Q^{3}_{k,j}(s,a)-2(r(s,a))+\gamma\max_{a\in\mathcal{A}}Q_{k,j-1}(s^{{}^{\prime}},a))}_{A2}

\displaystyle\leq

\displaystyle\epsilon

Where the expectation is now over $s\sim d_{\nu}^{\pi_{\lambda_{k}}},a\sim\pi_{\lambda_{k}}$ , $r(s,a)\sim R(.|s,a)$ and $s^{{}^{\prime}}\sim P(.|s,a)$ . We re-write Equation (LABEL:2_2_5_4) as

$\displaystyle\int\underbrace{(Q^{2}_{k,j}(s,a)-Q^{3}_{k,j}(s,a))}_{A1}\times$
	$\displaystyle\times\underbrace{(Q^{2}_{k,j}(s,a)+Q^{3}_{k,j}(s,a)-2(r(s,a))+\gamma\max_{a\in\mathcal{A}}Q_{k,j-1}(s^{{}^{\prime}},a))}_{A2}\times$
	$\displaystyle\times d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})\leq\epsilon.$	(183)

Where $\mu_{1}$ is the state action distribution $s\sim d_{\nu}^{\pi_{\lambda_{k}}},a\sim\pi_{\lambda_{k}}$ , $\mu_{2}$ , $\mu_{3}$ are the measures with respect to $(s,a)$ , $r^{{}^{\prime}}$ and $s^{{}^{\prime}}$ respectively

Now for the integral in Equation (183) we split the integral into four different integrals. Each integral is over the set of $(s,a),r^{{}^{\prime}},s^{{}^{\prime}}$ corresponding to the 4 different combinations of signs of $A1,A2$ .

	$\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2\geq 0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\nu_{3}}(s^{{}^{\prime}})$
	$\displaystyle+\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2<0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})$
	$\displaystyle+\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2<0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\nu_{3}}(s^{{}^{\prime}})$
	$\displaystyle+\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2\geq 0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})\leq\epsilon$		(184)

Now note that the first 2 terms are non-negative and the last two terms are non-positive. We then write the first two terms as

			$\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2\geq 0}(A1)(A2)d(s,a)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})$
		$\displaystyle=$	$\displaystyle C_{{k,j}_{1}}\int\|Q^{2}_{k,j}-Q^{3}_{k,j}\|d{\mu_{1}}$
		$\displaystyle=$	$\displaystyle C_{{k,j}_{1}}\mathbb{E}(\|Q^{2}_{k,j}-Q^{3}_{k,j}\|)_{\mu_{1}}$
			$\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2<0}(A1)(A2)d(s,a)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})$
		$\displaystyle=$	$\displaystyle C_{{k,j}_{2}}\int\|Q^{2}_{k,j}-Q^{3}_{k,j}\|d{\nu}$
		$\displaystyle=$	$\displaystyle C_{{k,j}_{2}}\mathbb{E}(\|Q^{2}_{k,j}-Q^{3}_{k,j}\|)_{\mu_{1}}$

We write the last two terms as

	$\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2<0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})=C_{{k,j}_{3}}{\epsilon}$		(187)
	$\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2\geq 0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})=C_{{k,j}_{4}}{\epsilon}$		(188)

Here $C_{{k,j}_{1}},C_{{k,j}_{2}},C_{{k,j}_{4}}$ and $C_{{k,j}_{4}}$ are positive constants. Plugging Equations (LABEL:2_2_5_7), (LABEL:2_2_5_8), (187), (188) into Equation (183).

\displaystyle(C_{{k,j}_{1}}+C_{{k,j}_{2}})\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}}-(C_{{k,j}_{3}}+C_{{k,j}_{4}})\epsilon

\displaystyle\leq

\displaystyle\epsilon

(189)

which implies

\displaystyle\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}}

\displaystyle\leq

\displaystyle\left(\frac{1+C_{{k,j}_{3}}+C_{{k,j}_{4}}}{C_{{k,j}_{1}}+C_{{k,j}_{2}}}\right)\epsilon

(191)

Thus we have

\displaystyle\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}}

\displaystyle\leq

\displaystyle\left(\frac{1+C_{{k,j}_{3}}+C_{{k,j}_{4}}}{C_{{k,j}_{1}}+C_{{k,j}_{2}}}\right)C_{k}\frac{\log(\log)(n_{k,j})}{\sqrt{n_{k,j}}}

(193)

which implies

\displaystyle\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}}

\displaystyle\leq

\displaystyle\tilde{\mathcal{O}}\left(\frac{\log(\log)(n_{k,j})}{\sqrt{n_{k,j}}}\right)

(195)

F.4 Proof Of Lemma 4

Proof For a given iteration $k$ of Algorithm 1 and iteration $j$ of the first inner for loop, the optimization problem to be solved in Algorithm 2 is the following

\mathcal{L}(\theta)=\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\left(Q_{\theta}(s_{i},a_{i})-\left(r(s_{i},a_{i})+\gamma\max_{a^{{}^{\prime}}\in\mathcal{A}}\gamma Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})\right)\right)^{2}

(197)

Here, $Q_{k,j-1}$ is the estimate of the $Q$ function from the iteration $j-1$ and the state action pairs $(s_{i},a_{i})_{i=\{1,\cdots,n\}}$ have been sampled from a distribution over the state action pairs denoted by $\nu$ . Since $\min_{\theta}\mathcal{L}(\theta)$ is a non convex optimization problem we instead solve the equivalent convex problem given by

	$\displaystyle u_{k,j}^{*}$	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{u}g_{k,j}(u)=\operatorname{arg\,min}_{u}\|\|\sum_{D_{i}\in\tilde{D}}D_{i}X_{k,j}u_{i}-y_{k}\|\|^{2}_{2}$		(199)
			$\displaystyle\textit{subject to}\|u\|_{1}\leq\frac{R_{\max}}{1-\gamma}$		(199)

Here, $X_{k,j}\in\mathbb{R}^{n_{k,j}\times d}$ is the matrix of sampled state action pairs at iteration $k$ , $y_{k}\in\mathbb{R}^{n_{k,j}\times 1}$ is the vector of target values at iteration $k$ . $\tilde{D}$ is the set of diagonal matrices obtained from line 2 of Algorithm 2 and $u\in\mathbb{R}^{|\tilde{D}d|\times 1}$ (Note that we are treating $u$ as a vector here for notational convenience instead of a matrix as was done in Section 4).

The constraint in Equation (199) ensures that the all the co-ordinates of the vector $\sum_{D_{i}\in\tilde{D}}D_{i}X_{k,j}u_{i}$ are upper bounded by $\frac{R_{max}}{1-\gamma}$ (since all elements of $X_{k,j}$ are between $0$ and $1$ ). This ensures that the corresponding neural network represented by Equation (10) is also upper bounded by $\frac{R_{max}}{1-\gamma}$ . We use the a projected gradient descent to solve the constrained convex optimization problem which can be written as.

\displaystyle u_{k,j}^{*}

\displaystyle=

\displaystyle\operatorname*{arg\,min}_{u:|u|_{1}\leq\frac{R_{\max}}{1-\gamma}}g_{k,j}(u)=\operatorname*{arg\,min}_{u:|u|_{1}\leq\frac{R_{\max}}{1-\gamma}}||\sum_{D_{i}\in\tilde{D}}D_{i}X_{k,j}u_{i}-y_{k}||^{2}_{2}

(200)

From Ang, Andersen(2017). “Continuous Optimization” [Notes]. https://angms.science/doc/CVX we have that if the step size $\alpha=\frac{||u^{*}_{k,j}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}$ , after $T_{k,j}$ iterations of the projected gradient descent algorithm we obtain

(g_{k,j}(u_{T_{k,j}})-g_{k,j}(u^{*}))\leq L_{k,j}\frac{||u_{k,j}^{*}||_{2}}{\sqrt{T_{k,j}+1}}

(201)

Where $L_{k,j}$ is the lipschitz constant of $g_{k,j}(u)$ and $u_{T_{k,j}}$ is the parameter estimate at step $T_{k,j}$ .

Therefore if the number of iteration of the projected gradient descent algorithm $T_{k,j}$ and the step-size $\alpha$ satisfy

	$\displaystyle T_{k,j}$	$\displaystyle\geq$	$\displaystyle L_{k,j}^{2}\|\|u_{k,j}^{*}\|\|^{2}_{2}\epsilon^{-2}-1,$		(202)
	$\displaystyle\alpha$	$\displaystyle=$	$\displaystyle\frac{\|\|u^{*}_{k}\|\|_{2}}{L_{k,j}\sqrt{T_{k,j}+1}},$		(203)

we have

(g_{k,j}(u_{T_{k,j}})-g_{k,j}(u^{*}))\leq\epsilon

(204)

Let $(v^{*}_{i},w^{*}_{i})_{i\in(1,\cdots,|\tilde{D}|)}$ , $(v^{T_{k,j}}_{i},w^{T_{k,j}}_{i})_{i\in(1,\cdots,|\tilde{D}|)}$ be defined as

	$\displaystyle(v^{}_{i},w^{}_{i})_{i\in(1,\cdots,\|\tilde{D}\|)}=\psi_{i}^{{}^{\prime}}(u^{*}_{i})_{i\in(1,\cdots,\|\tilde{D}\|)}$		(205)
	$\displaystyle(v^{T_{k,j}}_{i},w^{T_{k,j}}_{i})_{i\in(1,\cdots,\|\tilde{D}\|)}=\psi_{i}^{{}^{\prime}}(u^{T_{k,j}}_{i})_{i\in(1,\cdots,\|\tilde{D}\|)}$		(206)

where $\psi^{{}^{\prime}}$ is defined in Equation (49).

Further, we define $\theta_{|\tilde{D}|}^{*}$ and $\theta^{T_{k,j}}$ as

	$\displaystyle\theta_{\|\tilde{D}\|}^{}=\psi(v^{}_{i},w^{*}_{i})_{i\in(1,\cdots,\|\tilde{D}\|)}$		(207)
	$\displaystyle\theta^{T_{k,j}}=\psi(v^{T_{k,j}}_{i},w^{T_{k,j}}_{i})_{i\in(1,\cdots,\|\tilde{D}\|)}$		(208)

where $\psi$ is defined in Equation (44), $\theta_{|\tilde{D}|}^{*}=\operatorname*{arg\,min}_{\theta}\mathcal{L}_{|\tilde{D}|}(\theta)$ for $\mathcal{L}_{|\tilde{D}|}(\theta)$ defined in Appendix A.

Since $(g(u_{T_{k,j}})-g(u^{*}))\leq\epsilon$ , then by Lemma 12, we have

\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})\leq\epsilon

(209)

Note that $\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})$ is a constant value. Thus we can always find constant $C^{{}^{\prime}}_{k,j}$ such that

C_{k}^{{}^{\prime}}|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})

(210)

|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\frac{\mathcal{L}(\theta^{T_{k,j}})-\mathcal{L}(\theta^{*})}{C_{k}^{{}^{\prime}}}\\

(211)

Therefore if we have

	$\displaystyle T_{k,j}$	$\displaystyle\geq$	$\displaystyle L_{k,j}^{2}\|\|u_{k,j}^{*}\|\|^{2}_{2}\epsilon^{-2}-1,$		(212)
	$\displaystyle\alpha_{k,j}$	$\displaystyle=$	$\displaystyle\frac{\|\|u^{*}_{k}\|\|_{2}}{L_{k,j}\sqrt{T_{k,j}+1}},$		(213)

then we have

|\theta^{T_{k,j}}-\theta^{*}|_{1}\leq\frac{\epsilon}{C_{k}^{{}^{\prime}}}\\

(214)

which according to Equation (211) implies that

C_{k}^{{}^{\prime}}|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})\leq\epsilon\\

(215)

Dividing Equation (215) by $C_{k}^{{}^{\prime}}$ we get

|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\frac{\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})}{C_{k}^{{}^{\prime}}}\leq\frac{\epsilon}{C_{k}^{{}^{\prime}}}\\

(216)

Which implies

|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\frac{\epsilon}{C_{k}^{{}^{\prime}}}\\

(217)

Assuming $\epsilon$ is small enough such that $\frac{\epsilon}{C_{k}^{{}^{\prime}}}<1$ from lemma 13, this implies that there exists an $L_{k,j}>0$ such that

	$\displaystyle\|Q_{\theta^{T_{k,j}}}(s,a)-Q_{\theta_{\|\tilde{D}\|}^{*}}(s,a)\|$	$\displaystyle\leq$	$\displaystyle L_{k,j}\|\theta^{T_{k,j}}-\theta_{\|\tilde{D}\|}^{*}\|_{1}$		(218)
		$\displaystyle\leq$	$\displaystyle\frac{L_{k,j}\epsilon}{C_{k}^{{}^{\prime}}}$		(219)

for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . Equation (219) implies that if

	$\displaystyle T_{k,j}$	$\displaystyle\geq$	$\displaystyle L_{k,j}^{2}\|\|u_{k,j}^{*}\|\|^{2}_{2}\epsilon^{-2}-1,$		(220)
	$\displaystyle\alpha_{k,j}$	$\displaystyle=$	$\displaystyle\frac{\|\|u^{*}_{k}\|\|_{2}}{L_{k,j}\sqrt{T_{k,j}+1}},$		(221)

then we have

\displaystyle\mathbb{E}(|Q_{\theta^{T_{k,j}}}(s,a)-Q_{\theta_{|\tilde{D}|}^{*}}(s,a)|)

\displaystyle\leq

\displaystyle\frac{L_{k,j}\epsilon}{C_{k}^{{}^{\prime}}}

(222)

By definition in section E $Q_{k,j}$ is our estimate of the $Q$ function at the $k^{th}$ iteration of Algorithm $1$ and thus we have $Q_{\theta^{T_{k,j}}}=Q_{k,j}$ which implies that

\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|)

\displaystyle\leq

\displaystyle\frac{L_{k,j}\epsilon}{C_{k}^{{}^{\prime}}}

(223)

If we replace $\epsilon$ by $\frac{C^{{}^{\prime}}_{k,j}\epsilon}{L_{k,j}}$ in Equation (222), we get that if

	$\displaystyle T_{k,j}$	$\displaystyle\geq$	$\displaystyle\left({\frac{C^{{}^{\prime}}_{k,j}\epsilon}{L_{k,j}}}\right)^{-2}L_{k,j}^{2}\|\|u_{k,j}^{*}\|\|^{2}_{2}-1,$		(224)
	$\displaystyle\alpha_{k,j}$	$\displaystyle=$	$\displaystyle\frac{\|\|u^{*}_{k}\|\|_{2}}{L_{k,j}\sqrt{T_{k,j}+1}},$		(225)

we have

\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|)

\displaystyle\leq

\displaystyle\epsilon

(226)

From Assumption 2, we have that

\displaystyle\mathbb{E}(|Q_{\theta^{*}}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|)

\displaystyle\leq

\displaystyle\epsilon_{|\tilde{D}|}

(227)

where $\theta^{*}=\operatorname*{arg\,min}_{\theta\in\Theta}\mathcal{L}(\theta)$ and by definition of $Q^{3}_{k,j}$ in Definition 7, we have that $Q^{3}_{k,j}=Q_{\theta^{*}}$ . Therefore if we have

	$\displaystyle T_{k,j}$	$\displaystyle\geq$	$\displaystyle\left({\frac{C^{{}^{\prime}}_{k,j}\epsilon}{L_{k,j}}}\right)^{-2}L_{k,j}^{2}\|\|u_{k,j}^{*}\|\|^{2}_{2}-1,$		(228)
	$\displaystyle\alpha_{k,j}$	$\displaystyle=$	$\displaystyle\frac{\|\|u^{*}_{k}\|\|_{2}}{L_{k,j}\sqrt{T_{k,j}+1}},$		(229)

we have

	$\displaystyle\mathbb{E}(\|Q_{k,j}(s,a)-Q^{3}_{k,j}(s,a)\|)_{\nu}$	$\displaystyle\leq$	$\displaystyle\mathbb{E}(\|Q_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{}}(s,a)\|)+\mathbb{E}(\|Q^{3}_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{}}(s,a)\|)$		(231)
		$\displaystyle\leq$	$\displaystyle\epsilon+\epsilon_{\|\tilde{D}\|}$		(231)

This implies

\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q^{3}_{k,j}(s,a)|)

\displaystyle\leq

\displaystyle\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\epsilon_{|\tilde{D}|}

(232)

	$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,J})}\leq$	$\displaystyle{\phi_{k}}\sum_{k=0}^{J-1}\gamma^{J-j-1}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}((P^{\pi_{\lambda_{k}}})^{K-J-1}\epsilon_{k,j})$
		$\displaystyle+\gamma^{J}\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}(P^{\pi_{\lambda_{k}}})^{J}(Q^{\pi_{\lambda_{k}}}-Q_{0})$		(138)

	$\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{}},a\sim\pi^{}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,j})}\leq$	$\displaystyle\sum_{j=0}^{J-1}\gamma^{J-j-1}\big{(}(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{1}_{k,j}\|}+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{2}_{k,j}\|}$
		$\displaystyle+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{3}_{k,j}\|}+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{\|\epsilon^{4}_{k,j}\|}\big{)}+\gamma^{J}\left(\frac{R_{max}}{1-\gamma}\right)$		(144)