This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Global Convergence of Natural Actor-Critic with Two-layer Neural Network Parametrization

\nameMudit Gaur\email[email protected]
\nameAmrit Singh Bedi\email[email protected]
\nameDi Wang\email[email protected]
\nameVaneet Aggarwal\email[email protected]
Abstract

Actor-critic algorithms have shown remarkable success in solving state-of-the-art decision-making problems. However, despite their empirical effectiveness, their theoretical underpinnings remain relatively unexplored, especially with neural network parametrization. In this paper, we delve into the study of a natural actor-critic algorithm that utilizes neural networks to represent the critic. Our aim is to establish sample complexity guarantees for this algorithm, achieving a deeper understanding of its performance characteristics. To achieve that, we propose a Natural Actor-Critic algorithm with 2-Layer critic parametrization (NAC2L). Our approach involves estimating the QQ-function in each iteration through a convex optimization problem. We establish that our proposed approach attains a sample complexity of 𝒪~(1ϵ4(1γ)4)\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^{4}(1-\gamma)^{4}}\right). In contrast, the existing sample complexity results in the literature only hold for a tabular or linear MDP. Our result, on the other hand, holds for countable state spaces and does not require a linear or low-rank structure on the MDP.

1 Introduction

Motivation. The use of neural networks in Actor-critic (AC) algorithms is widespread in various machine learning applications, such as games (DBLP:journals/corr/abs-1708-04782; Haliem2021LearningMG), robotics (morgan2021model), autonomous driving (9351818), ride-sharing (zheng2022stochastic), networking (geng2020multi), and recommender systems (li2020video). AC algorithms sequentially update the estimate of the actor (policy) and the critic (value function) based on the data collected at each iteration, as described in (konda1999actor). An empirical and theoretical improvement of the AC, known as Natural Actor-Critic (NAC), was proposed in (peters2008natural). NAC replaced the stochastic gradient step of the actor with the natural gradient descent step described in (kakade2001natural) based on the theory of (rattray1998natural). Despite its widespread use in state-of-the-art reinforcement learning (RL) implementations, there are no finite-time sample complexity results available in the literature for a setup where neural networks (NNs) are used to represent the critic. Such results provide estimates of the required data to achieve a specified accuracy in a loss function, typically in terms of the difference between the average reward obtained in following the policy obtained through the algorithm being analyzed and the optimal policy.

Challenges. The primary challenge in obtaining sample complexity for a NAC is that the critic loss function with NN parametrization becomes non-convex, making finite-time sample guarantees impossible. Although recent work in (fu2020single) obtains upper bounds on the error in estimating the optimal action-value function in NN critic settings, these upper bounds are asymptotic and depend on the NN parameters. Similarly, other asymptotic sample complexity results, such as those in (kakade2001natural; konda1999actor; bhatnagar2010actor), require i.i.d. sampling of the state action pairs at each iteration of the algorithm. While sample complexity results for linear MDP structures have been derived in (wu2020finite; xu2020improving) (see Related work), the structure allows for a closed-form critic update. The relaxation of the linear MDP assumption requires a critic parametrization to learn the Q-function, limiting the ability to find globally optimal policies due to non-global optimality of critic parameters.

Approach. As the optimal critic network update is not available for general MDPs, a parametrization of the critic network is used. To address the challenge of optimizing the critic parameters, we assume a two-layer neural network structure for the critic parametrization. Recent studies have shown that the parameter optimization for two-layer ReLU neural networks can be transformed into an equivalent convex program, which is computationally tractable and solvable exactly (pmlr-v119-pilanci20a). Convex formulations for convolutions and deeper models have also been investigated (sahiner2020vector; sahiner2020convex). In this work, we use these approaches to estimate the parameterized Q-function in the critic. In the policy gradient analysis, the error caused by the lack of knowledge of the exact gradient term is assumed to be upper bounded by a constant (see related work for details), where the sample complexity of this term is not accounted due to no explicit critic estimation. This paper considers the sample complexity of the critic function to derive the overall sample complexity of the natural actor-critic algorithm. Resolving the optimality of the critic parameter update, this paper considers the following question:

Is it possible to obtain global convergence sample complexity results of Natural Actor-Critic algorithm with two-layer neural network parametrization of critic?

Contributions. In this work, we provide an affirmation to the above-mentioned question. We summarize our contributions as follows.

  • We propose a novel actor-critic algorithm, employing a general parametrization for the actor and two-layer ReLU neural network parametrization for the critic.

  • Building upon the insights presented in (agarwal2020optimality), we leverage the inherent smoothness property of the actor parametrization to derive an upper bound on the estimation error of the optimal value function. This upper bound is expressed in terms of the error incurred in attaining the compatible function approximation term, as elucidated in (DBLP:conf/nips/SuttonMSM99). Our analysis dissects this error into distinct components, each of which is individually bounded from above.

  • To address the estimation error arising from the critic function, we leverage the interesting findings of (wang2021hidden; pmlr-v119-pilanci20a), converting the problem of finding critic function parameters into a convex optimization problem. This allows us to establish a finite-time sample complexity bound, a significant improvement over previous results obtained for NAC algorithms. Notably, our approach eliminates the reliance on a linear function approximation for the critic, presenting the first finite-time sample complexity result for NAC without such an approximation.

2 Related Works

Natural Actor-Critic (NAC). Actor critic methods, first conceptualised in (sutton1988learning), aim to combine the benefits of the policy gradient methods and QQ-learning based methods. The policy gradient step in these methods is replaced by a Natural Policy Gradient proposed in (kakade2001natural) to obtain the so-called Natural Actor Critic in (peters2005natural). Sample complexity results for Actor Critic were first obtained for MDP with finite states and actions in (williams1990mathematical), and more recently in (lan2023policy; zhang2020provably). Finite time convergence for natural actor critic using a linear MDP assumption has been obtained in (chen2022finite; khodadadian2021finite; xu2020non) with the best known sample complexity of 𝒪~(1ϵ2(1γ)4)\tilde{\mathcal{O}}\left(\frac{1}{{\epsilon}^{2}(1-\gamma)^{4}}\right) (xu2020improving). Finite time sample complexity results are however, not available for Natural Actor Critic setups for general MDP where neural networks are used to represent the critic. (fu2020single) obtained asymptotic upper bounds on the estimation error for a natural actor critic algorithm where a neural network is used to represent the critic. The key related works here are summarized in Table 1.

Sample Complexity of PG Algorithms. Policy Gradient methods (PG) were first introduced in (DBLP:conf/nips/SuttonMSM99), with Natural Policy Gradients in (kakade2001natural). Sample complexity were obtained in tabular setups in (even2009online) and then improved in (agarwal2020optimality). Using linear dynamics assumption convergence guarantees were obtained in (fazel2018global), (dean2020sample), (bhandari2019global) . (zhang2020global) obtains sample complexity for convergence to second order stationary point policies. In order to improve the performance of Policy Gradient methods, techniques such as REINFORCE (williams1992simple), function approximation (sutton1999policy), importance weights (papini2018stochastic), adaptive batch sizes (papini2017adaptive), Adaptive Trust Region Policy Optimization (shani2020adaptive) have been used. We note that the Natural Policy Gradient approach has been shown to achieve a sample complexity of 𝒪~(1ϵ3(1γ)6)\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^{3}(1-\gamma)^{6}}\right) (liu2020improved). However, they assume that the error incurred due to the lack of knowledge of the exact gradient term is assumed to be upper bounded by a constant (Assumption 4.4, (liu2020improved)). Additionally, each estimate of the critic requires on average a sample of size (11γ)\left(\frac{1}{1-\gamma}\right) and the obtained estimate is only known to be an unbiased estimate of the critic, with no error bounds provided. Explicit dependence of the constant of (Assumption 4.4, (liu2020improved)) in terms of number of samples is not considered in their sample complexity results. For the Natural Actor Critic analysis in this paper, we provide explicit dependence of the error in estimating the gradient update and incorporate the samples required for the estimation in the sample complexity analysis. We describe this difference in detail in Appendix LABEL:Comp_npg.

Table 1: This table summarizes the sample complexities of different Natural Actor-Critic Algorithms. We note that we present the first finite time sample complexity results for general MDPs where both actor and critic are parametrized by a neural network.
References
Actor
parametrization
Critic
parametrization
MDP
Assumption
Sample
Complexity
(williams1990mathematical) Tabular Tabular Finite Asymptotic
(borkar1997actor) Tabular Tabular Finite Asymptotic
(xu2020non) General None (Closed Form) Linear 𝒪~(ϵ4(1γ)9)\tilde{\mathcal{O}}(\epsilon^{-4}(1-\gamma)^{-9})
(khodadadian2021finite) General None (Closed Form) Linear 𝒪~(ϵ3(1γ)8)\tilde{\mathcal{O}}(\epsilon^{-3}(1-\gamma)^{-8})
(xu2020improving) General None (Closed Form) Linear 𝒪~(ϵ2(1γ)4)\tilde{\mathcal{O}}(\epsilon^{-2}(1-\gamma)^{-4})
(fu2020single) Neural Network Neural Network None Asymptotic
This work General Neural Network (2-layer) None 𝒪~(ϵ4(1γ)4)\tilde{\mathcal{O}}(\epsilon^{-4}(1-\gamma)^{-4})

3 Problem Setup

We consider a discounted Markov Decision Process (MDP) given by the tuple :=(𝒮,𝒜,P,R,γ){\mathcal{M}}:=(\mathcal{S},\mathcal{A},P,R,\gamma), where 𝒮\mathcal{S} is a bounded measurable state space, 𝒜\mathcal{A} is the finite set of actions. P:𝒮×𝒜𝒫(𝒮){P}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{P}(\mathcal{S}) is the probability transition kernel111 For a measurable set 𝒳\mathcal{X}, let 𝒫(𝒳)\mathcal{P}(\mathcal{X}) denote the set of all probability measures over 𝒳\mathcal{X}. , R:𝒮×𝒜𝒫([0,Rmax]){R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{P}([0,R_{\max}]) is the reward kernel on the state action space with RmaxR_{\max} being the absolute value of the maximum reward, and 0<γ<10<\gamma<1 is the discount factor. A policy π:𝒮𝒫(𝒜)\pi:\mathcal{S}\rightarrow\mathcal{P}(\mathcal{A}) maps a state to a probability distribution over the action space. Here, we denote by 𝒫(𝒮),𝒫(𝒜),𝒫([a,b])\mathcal{P}(\mathcal{S}),\mathcal{P}(\mathcal{A}),\mathcal{P}([a,b]), the set of all probability distributions over the state space, the action space, and a closed interval [a,b][a,b], respectively. With the above notation, we define the action value function for a given policy π\pi as

Qπ(s,a)=𝔼[t=0γtr(st,at)|s0=s,a0=a],\displaystyle Q^{\pi}(s,a)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{0}=a\right], (1)

where r(st,at)R(|st,at)r(s_{t},a_{t})\sim R(\cdot|s_{t},a_{t}), atπ(|st)a_{t}\sim\pi(\cdot|s_{t}) and st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}) for t={0,,}t=\{0,\cdots,\infty\}. For a discounted MDP, we define the optimal action value functions as Q(s,a)=supπQπ(s,a),(s,a)𝒮×𝒜Q^{*}(s,a)=\sup_{\pi}Q^{\pi}(s,a),\hskip 14.22636pt\forall(s,a)\in\mathcal{S}\times\mathcal{A}. A policy that achieves the optimal action value functions is known as the optimal policy and is denoted as π\pi^{*}. It holds that π\pi^{*} is the greedy policy with respect to QQ^{*} (10.5555/1512940). Hence finding QQ^{*} is sufficient to obtain the optimal policy. In a similar manner, we can define the value function as Vπ(s)=𝔼[t=0γtr(st,at)|s0=s],V^{\pi}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r^{\prime}(s_{t},a_{t})|s_{0}=s\right], and from the definition in (1), it holds that Vπ(s)=𝔼aπ[Qπ(s,a)]V^{\pi}(s)=\mathbb{E}_{a\sim\pi}\left[Q^{\pi}(s,a)\right]. Hence, we can define the optimal value function as V(s)=supπVπ(s),s𝒮V^{*}(s)=\sup_{\pi}V^{\pi}(s),\hskip 14.22636pt\forall s\in\mathcal{S}. We define ρπ(s)\rho_{\pi}(s) as the stationary state distribution induced by the policy π\pi starting at state ss and ζπ(s,a)\zeta_{\pi}(s,a) is the corresponding stationary state action distribution defined as ζπ(s,a)=ρπ(s).π(a|s)\zeta_{\pi}(s,a)=\rho_{\pi}(s).\pi(a|s). We overload notation to define Vπ(ρ)=𝔼s0ρ[Vπ(s0)]V^{\pi}(\rho)=\mathbb{E}_{s_{0}\sim\rho}[V^{\pi}(s_{0})], where ρ\rho is an initial state distribution. We can define the visitation distribution as dπ(s0)=(1γ)t=0γtPrπ(st=s|so)d^{\pi}(s_{0})=(1-\gamma)\sum_{t=0}^{\infty}{\gamma}^{t}Pr^{\pi}(s_{t}=s|s_{o}). Here Prπ(st=s|so)Pr^{\pi}(s_{t}=s|s_{o}) denotes the probability the state at time tt is ss given a starting state of sos_{o}. Hence, we can write dρπ(s)=𝔼soρ[dπ(s0)]d^{\pi}_{\rho}(s)=\mathbb{E}_{s_{o}\sim\rho}[d^{\pi}(s_{0})].

We now describe our Neural Actor-Critic (NAC) algorithm. In a natural policy gradient algorithm (NIPS2001_4b86abe4), the policy is parameterized as {πλ,λΛ}\{\pi_{\lambda},\lambda\in\Lambda\}. We have KK total iterations of the algorithm. At iteration kk, the policy parameters are updated using a gradient descent step of the form

λk+1=λk+ηFρ(λ)λVπλ(ρ),\displaystyle\lambda_{k+1}=\lambda_{k}+{\eta}F^{\dagger}_{\rho}(\lambda){\nabla}_{\lambda}V^{\pi_{\lambda}}(\rho), (2)

From the policy gradient theorem (DBLP:conf/nips/SuttonMSM99) we have

λkVπλk(ρ)\displaystyle{\nabla}_{\lambda_{k}}V^{\pi_{\lambda_{k}}}(\rho) =\displaystyle= 𝔼s,a(log(πλk)(a|s)Qπλk(s,a)),\displaystyle\mathbb{E}_{s,a}(\nabla{\log(\pi_{\lambda_{k}})(a|s)}Q^{\pi_{\lambda_{k}}}(s,a)), (3)
Fρ(λk)\displaystyle F^{\dagger}_{\rho}(\lambda_{k}) =\displaystyle= 𝔼s,a[logπλk(a|s)(tlogπλk(a|s))T],\displaystyle\mathbb{E}_{s,a}\left[{\nabla}\log{\pi_{\lambda_{k}}(a|s)}\left({\nabla_{t}}\log{\pi_{\lambda_{k}}(a|s)}\right)^{T}\right], (4)

where sdρπλk,aπλk(.|s)s\sim d^{\pi_{\lambda_{k}}}_{\rho},a\sim\pi_{\lambda_{k}}(.|s). From (DBLP:conf/nips/SuttonMSM99), the principle of compatible function approximation implies that we have

Fρ(λk)λkVπλk(ρ)\displaystyle F^{\dagger}_{\rho}(\lambda_{k}){\nabla}_{\lambda_{k}}V^{\pi_{\lambda_{k}}}(\rho) =\displaystyle= 11γwk\displaystyle\frac{1}{1-\gamma}w_{k}^{*} (5)
wk\displaystyle w_{k}^{*} =\displaystyle= argminw𝔼s,a(Aπλk(s,a)wλklog(πλk(a|s)))2,\displaystyle\operatorname*{arg\,min}_{w}\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}(s,a)-w{{\nabla}_{\lambda_{k}}\log(\pi_{\lambda_{k}}(a|s))})^{2}, (6)

and sdρπλk,aπλk(.|s)s\sim d^{\pi_{\lambda_{k}}}_{\rho},a\sim\pi_{\lambda_{k}}(.|s) Here (Aπλk(s,a)=Qπλk(s,a)Vπλk(s))(A^{\pi_{\lambda_{k}}}(s,a)=Q^{\pi_{\lambda_{k}}}(s,a)-V^{\pi_{\lambda_{k}}}(s)). For natural policy gradient algorithms such as in (agarwal2020optimality) and (liu2020improved) an estimate of QπλkQ^{\pi_{\lambda_{k}}} (and from that an estimate of Aπλk(s,a)A^{\pi_{\lambda_{k}}}(s,a)) is obtained through a sampling procedure that requires on average (11γ)\left(\frac{1}{1-\gamma}\right) for each sample of QπλkQ^{\pi_{\lambda_{k}}}. For the natural actor critic setup we maintain a parameterised estimate of the QQ-function is updated at each step and is used to approximate QπλkQ^{\pi_{\lambda_{k}}}.

(xu2020improving) and (wu2020finite) assume that the QQ function can be represented as Qλk(s,a)=wt(ϕ(s,a))Q^{\lambda_{k}}(s,a)=w^{t}(\phi(s,a)) where ϕ(s,a)\phi(s,a) is a known feature vector and wtw^{t} is a parameter that is updated in what is known as the critic step. The policy gradient step is known as the actor update step. In case a neural network is used to represent the QQ function, at each iteration kk of the algorithm, an estimate of its parameters are obtained by solving an optimization of the form

argminθΘ𝔼s,a(QπλkQθ)2,\displaystyle\operatorname*{arg\,min}_{\theta\in\Theta}\mathbb{E}_{s,a}({Q^{\pi_{\lambda_{k}}}-Q_{\theta}})^{2}, (7)

Due to the non-convex activation functions of a neural network the optimization in Equation (7) is non-convex and hence finite sample guarantees for actor critic algorithms with neural network representation of QQ functions is not possible. Thus in order to perform a finite time analysis, a linear MDP structure is assumed which is very restrictive and not practical for large state space problems. For a Natural Policy Gradient setup, a Monte Carlo sample is obtained, which makes the process require additional samples as causing high variance in critic estimation.

4 Proposed Approach

4.1 Convex Reformulation of 2 layer Neural Network

We represent the QQ function (critic) using a 2 layer ReLU neural network. A 2-layer ReLU Neural Network with input xdx\in\mathbb{R}^{d} is defined as f(x)=i=1mσ(xTui)αif(x)=\sum_{i=1}^{m}\sigma^{\prime}(x^{T}u_{i})\alpha_{i}, where m1m\geq 1 is the number of neurons in the neural network, the parameter space is Θm=d×m×m\Theta_{m}=\mathbb{R}^{d\times m}\times\mathbb{R}^{m} and θ=(U,α)\theta=(U,\alpha) is an element of the parameter space, where uiu_{i} is the ithi^{th} column of UU, and αi\alpha_{i} is the ithi^{th} coefficient of α\alpha. The function σ:0\sigma^{\prime}:\mathbb{R}\rightarrow\mathbb{R}_{\geq 0} is the ReLU or restricted linear function defined as σ(x)max(x,0)\sigma^{\prime}(x)\triangleq\max(x,0). In order to obtain parameter θ\theta for a given set of data Xn×dX\in\mathbb{R}^{n\times d} and the corresponding response values yn×1y\in\mathbb{R}^{n\times 1}, we desire the parameter that minimizes the squared loss, given by

(θ)=argminθi=1mσ(Xui)αiy22.\displaystyle\mathcal{L}(\theta)=\operatorname*{arg\,min}_{\theta}\Bigg{\|}\sum_{i=1}^{m}\sigma(Xu_{i})\alpha_{i}-y\Bigg{\|}_{2}^{2}. (8)

In (8), we have the term σ(Xui)\sigma(Xu_{i}) which is a vector {σ((xj)Tui)}j{1,,n}\{\sigma^{\prime}((x_{j})^{T}u_{i})\}_{j\in\{1,\cdots,n\}}, where xjx_{j} is the jthj^{th} row of XX. It is the ReLU function applied to each element of the vector XuiXu_{i}. We note that the optimization in Equation (8) is non-convex in θ\theta due to the presence of the ReLU activation function. In (wang2021hidden), it is shown that this optimization problem has an equivalent convex form, provided that the number of neurons mm goes above a certain threshold value. This convex problem is obtained by replacing the ReLU functions in the optimization problem with equivalent diagonal operators. The convex problem is given as

β(p):=argminpDiDXDi(Xpi)y22,\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p):=\operatorname*{arg\,min}_{p}\Bigg{\|}\sum_{D_{i}\in D_{X}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2}, (9)

where pd×|DX|p\in\mathbb{R}^{d\times|D_{X}|}. DXD_{X} is the set of diagonal matrices DiD_{i} which depend on the data-set XX. Except for cases of XX being low rank, it is not computationally feasible to obtain the set DXD_{X}. We instead use D~DX\tilde{D}\in D_{X} to solve the convex problem in (9) where pp now would lie in pd×|D~|p\in\mathbb{R}^{d\times|\tilde{D}|}. The relevant details of the formulation and the definition of the diagonal matrices DiD_{i} are provided in Appendix A. For a set of parameters θ=(u,α)Θ\theta=(u,\alpha)\in\Theta, we denote neural network represented by these parameters as

Qθ(s,a)=i=1mσ((s,a)Tui)αi.\displaystyle Q_{\theta}(s,a)=\sum_{i=1}^{m}\sigma^{\prime}((s,a)^{T}u_{i})\alpha_{i}. (10)

4.2 Proposed Natural Actor Critic Algorithm with 2-Layer Critic Parametrization (NAC2L)

Algorithm 1 Natural Actor Critic with 2-Layer Critic Parametrization (NAC2L)

Input: 𝒮,\mathcal{S}, 𝒜,γ,\mathcal{A},\gamma, Time Horizon K 𝒵\in\mathcal{Z} , Updates per time step J 𝒵\in\mathcal{Z} ,starting state action sampling distribution ν\nu, Number of convex optimization steps Tk,j,k{1,,K},j{1,,J}T_{k,j},k\in\{1,\cdots,K\},j\in\{1,\cdots,J\}, Actor SGD learning rate η\eta

1:  Initialize: Q~(s,a)=0(s,a)𝒮×𝒜\tilde{Q}(s,a)=0\hskip 2.84544pt\forall(s,a)\in\mathcal{S}\times\mathcal{A}, Q0(s,a)=0(s,a)𝒮×𝒜{Q}_{0}(s,a)=0\hskip 2.84544pt\forall(s,a)\in\mathcal{S}\times\mathcal{A} λ0={0}d\lambda_{0}=\{0\}^{d}
2:  for k{1,,K}k\in\{1,\cdots,K\} do
3:     for j{1,,J}j\in\{1,\cdots,J\} do
4:        Xk=X_{k}=\varnothing
5:        Take nk,jn_{k,j} state action pairs sampled from ν\nu as the starting state action distribution and then following policy πλk\pi_{\lambda_{k}}.
6:        Set yi=ri+γmaxa𝒜Qk,j1(si+1,a)y_{i}=r_{i}+\gamma\max_{a^{\prime}\in\mathcal{A}}{Q}_{k,j-1}(s_{i+1},a^{\prime}), where i{1,,n}i\in\{1,\cdots,n\}
7:        Set Xj,YjX_{j},Y_{j} as the matrix of the sampled state action pairs and vector of estimated QQ values respectively
8:        Xk=XkXjX_{k}=X_{k}\cup X_{j}
9:        Call Algorithm 2 with input (X=XjX=X_{j}, y=Yjy=Y_{j}, T=TjT=T_{j}) and return parameter θ\theta
10:        Qk,j=QθQ_{k,j}=Q_{\theta}
11:     end for
12:     for i{1,,(J.nk,j)}i\in\{1,\cdots,(J.n_{k,j})\} do
13:        Ak,J(si,ai)=Qk,J(si,ai)a𝒜πλk(a|si)Qk,J(si,a)A_{k,J}(s_{i},a_{i})=Q_{k,J}(s_{i},a_{i})-\sum_{a\in\mathcal{A}}\pi_{\lambda_{k}}(a|s_{i})Q_{k,J}(s_{i},a)
14:        wi+1=(wi2βi(wiλlogπλk(ai|si)Ak,J(si,ai))λlogπλk(ai|si))w_{i+1}=\Big{(}w_{i}-2{\beta_{i}}\Big{(}w_{i}{\cdot}{\nabla}_{\lambda}\log{\pi_{\lambda_{k}}}(a_{i}|s_{i})-A_{k,J}(s_{i},a_{i})\Big{)}{\nabla}_{\lambda}\log{\pi_{\lambda_{k}}}(a_{i}|s_{i})\Big{)}
15:     end for
16:     wk=wJ.nk,jw_{k}=w_{J.n_{k,j}}
17:     Update λk+1=λk+η(11γ)wk\lambda_{k+1}=\lambda_{k}+{\eta}\left(\frac{1}{1-\gamma}\right)w_{k}
18:  end forOutput: πλK+1\pi_{\lambda_{K+1}}
Algorithm 2 Neural Network Parameter Estimation
1:  Input: data (X,y,T)(X,y,T)
2:  Sample: D~=diag(1(Xgi>0)):gi𝒩(0,I),i[|D~|]\tilde{D}={diag(1(Xg_{i}>0))}:g_{i}\sim\mathcal{N}(0,I),i\in[|\tilde{D}|]
3:  Initialize y1=0,u1=0y^{1}=0,u^{1}=0 Initialize g(u)=DiD~DiXuiy22g(u)=\|\sum_{D_{i}\in\tilde{D}}D_{i}Xu_{i}-y\|^{2}_{2}
4:  for i{0,,T}i\in\{0,\cdots,T\} do
5:     ui+1=yiηig(yi)u^{i+1}=y_{i}-\eta_{i}\nabla{g(y_{i})}
6:     yi+1=argminy:|y|1Rmax1γui+1y22y^{i+1}=\operatorname*{arg\,min}_{y:|y|_{1}\leq\frac{R_{max}}{1-\gamma}}\|u_{i+1}-y\|^{2}_{2}
7:  end for
8:  Set uT+1=uu^{T+1}=u^{*}
9:  Solve Cone Decomposition:v¯,w¯ui=viwi,i[d]}\bar{v},\bar{w}\in{u_{i}^{*}=v_{i}-w_{i},i\in[d]\}} such that vi,wi𝒦iv_{i},w_{i}\in\mathcal{K}_{i} and at-least one vi,wiv_{i},w_{i} is zero.
10:  Construct (θ={ui,αi})(\theta=\{u_{i},\alpha_{i}\}) using the transformation
ψ(vi,wi)\displaystyle\psi(v_{i},w_{i}) =\displaystyle= {(vi,1),if wi=0(wi,1),if vi=0(0,0),if vi=wi=0\displaystyle\left\{\begin{array}[]{lr}({v}_{i},1),&\text{if }{w}_{i}=0\\ ({w}_{i},-1),&\text{if }{v}_{i}=0\\ (0,0),&\text{if }{v}_{i}={w}_{i}=0\end{array}\right. (14)
for all i{1,,m}{i\in\{1,\cdots,m\}}
11:  Return θ\theta

We summarize the proposed approach in Algorithm 1. Algorithm 1 has an outer for loop with two inner for loops. At a fixed iteration kk of the outer for loop and iteration jj of the first inner for loop, we obtain a sequence of state action pairs and the corresponding state and reward by following the estimate of the policy at the start of the iteration. In order to perform the critic update, the state action pairs and the corresponding target QQ values are stored in matrix form and passed to Algorithm 2, as the input and output values respectively to solve the following optimization problem.

argminθΘ1nk,ji=1nk,j(Qθ(si,ai)r(si,ai)γ𝔼aπλkQk,j1(si+1,a))2,\displaystyle\operatorname*{arg\,min}_{\theta\in\Theta}\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\Bigg{(}Q_{\theta}(s_{i},a_{i})-r(s_{i},a_{i})-{\gamma}\mathbb{E}_{a^{{}^{\prime}}\sim\pi_{\lambda_{k}}}Q_{k,j-1}(s_{i+1},a^{{}^{\prime}})\Bigg{)}^{2}, (15)

where Qk,j1Q_{k,j-1} is the estimate of the QQ function at the kthk^{th} iteration of the outer for loop and the (j1)th(j-1)^{th} iteration of the first inner for loop of Algorithm 1. QθQ_{\theta} is a neural network defined as in (10) and nk,jn_{k,j} is the number of state action pairs sampled at the kthk^{th} iteration of the outer for loop and the jthj^{th} iteration of the first inner for loop of Algorithm 1. This is done at each iteration of the first inner for loop to perform what is known as a Fitted Q-iteration step to obtain the estimate of the critic.

Algorithm 2 first samples a set of diagonal matrices denoted by D~\tilde{D} in line 2 of Algorithm 2. The elements of D~\tilde{D} act as the diagonal matrix replacement of the ReLU function. Algorithm 2 then solves an optimization of the form given in Equation (15) by converting it to an optimization of the form (34). This convex optimization is solved in Algorithm 2 using the projected gradient descent algorithm. After obtaining the optima for this convex program, denoted by u={ui}i{1,,|D~|}u^{*}=\{u^{*}_{i}\}_{i\in\{1,\cdots,|\tilde{D}|\}}, in line 10, we transform them into an estimate of the solutions for the optimization given in (15), which are then passed back to Algorithm 1. The procedure is described in detail along with the relevant definitions in Appendix A.

The estimate of wkw_{k}^{*} is obtained in the second inner for loop of Algorithm (1) where a gradient descent is performed for the loss function of the form given in Equation (6) using the state action pairs sampled in the first inner for loop. Note that we do not have access to the true QQ function that is required for the critic update. Thus we use the estimate of the QQ function obtained at the end of the first inner for loop. After obtaining our estimate of the minimizer of Equation (6), we update the policy parameter using the stochastic gradient update step. Here the state action pairs used are the same we sampled in the first inner for loop.

5 Global Convergence Analysis of NAC2L Algorithm

5.1 Assumptions

In this subsection, we formally describe the assumptions that will be used in the results.

Assumption 1

For any λ1,λ2Λ\lambda_{1},\lambda_{2}\in\Lambda and (s,a)(𝒮×𝒜)(s,a)\in(\mathcal{S}\times\mathcal{A}) we have

log(πλ1)(a|s)log(πλ2)(a|s)2βλ1λ22\displaystyle\|{\nabla}log(\pi_{\lambda_{1}})(a|s)-{\nabla}log(\pi_{\lambda_{2}})(a|s)\|_{2}\leq\beta\|\lambda_{1}-\lambda_{2}\|_{2} (16)

where β>0\beta>0.

Such assumptions have been utilised in prior policy Gradient based works such as (agarwal2020optimality; liu2020improved). This assumption is satisfied for the softmax policy parameterization

πλ(a|s)=exp(fλ(s,a))a𝒜fλ(s,a)\displaystyle\pi_{\lambda}(a|s)=\frac{\exp(f_{\lambda}(s,a))}{\sum_{a^{{}^{\prime}}\in\mathcal{A}}f_{\lambda}(s,a^{{}^{\prime}})} (17)

where fλ(s,a)f_{\lambda}(s,a) is a neural network with a smooth activation function (agarwal2020optimality).

Assumption 2

Let θargminθΘ(θ)\theta^{*}\triangleq\arg\min_{\theta\in\Theta}\mathcal{L}(\theta), where (θ)\mathcal{L}(\theta) is defined in (8) and we denote Qθ()Q_{\theta^{*}}(\cdot) as Qθ()Q_{\theta}(\cdot) as defined in (10) for θ=θ\theta=\theta^{*}. Also, let θD~argminθΘ|D~|(θ)\theta_{\tilde{D}}^{*}\triangleq\arg\min_{\theta\in\Theta}\mathcal{L}_{|\tilde{D}|}(\theta), where D~(θ)\mathcal{L}_{\tilde{D}}(\theta) is the loss function (θ)\mathcal{L}(\theta) with the set of diagonal matrices DD replaced by D~D\tilde{D}\in D. Further, we denote Qθ|D~|()Q_{\theta_{|\tilde{D}|}^{*}}(\cdot) as Qθ()Q_{\theta}(\cdot) as defined in (10) for θ=θ|D~|\theta=\theta_{|\tilde{D}|}^{*}. Then we assume

𝔼s,a(|QθQθ|D~||)νϵ|D~|,\mathbb{E}_{s,a}(|Q_{\theta^{*}}-Q_{\theta_{|\tilde{D}|}^{*}}|)_{\nu}\leq\epsilon_{|\tilde{D}|}, (18)

for any ν𝒫(𝒮×𝒜)\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A}).

Thus, ϵ|D~|\epsilon_{|\tilde{D}|} is a measure of the error incurred due to taking a sample of diagonal matrices D~\tilde{D} and not the full set DXD_{X}. In practice, setting |D~||\tilde{D}| to be the same order of magnitude as dd (dimension of the data) gives us a sufficient number of diagonal matrices to get a reformulation of the non convex optimization problem which performs comparably or better than existing gradient descent algorithms, therefore ϵ|D~|\epsilon_{|\tilde{D}|} is only included for theoretical completeness and will be negligible in practice. This has been practically demonstrated in (pmlr-v162-mishkin22a; pmlr-v162-bartan22a; pmlr-v162-sahiner22a). Refer to Appendix A for details of DXD_{X}, D~\tilde{D} and |D~|(θ)\mathcal{L}_{|\tilde{D}|}(\theta).

Assumption 3

We assume that for all functions Q:𝒮×𝒜[0,(Rmax1γ)]Q:\mathcal{S}\times\mathcal{A}\rightarrow\left[0,\left(\frac{R_{\max}}{1-\gamma}\right)\right], there exists a function QθQ_{\theta} where θΘ\theta\in\Theta such that

𝔼s,a(QθQ)ν2ϵapprox,\displaystyle\mathbb{E}_{s,a}{(Q_{\theta}-Q)}^{2}_{\nu}\leq\epsilon_{approx}, (19)

for any ν𝒫(𝒮×𝒜)\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A}).

ϵapprox\epsilon_{approx} reflects the error that is incurred due to the inherent lack of expressiveness of the neural network function class. In the analysis of (pmlr-v120-yang20a), this error is assumed to be zero. Explicit upper bounds of ϵbias\epsilon_{bias} is given in terms of neural network parameters in works such as (yarotsky2017error).

Assumption 4

We assume that for all functions Q:𝒮×𝒜[0,(Rmax1γ)]Q:\mathcal{S}\times\mathcal{A}\rightarrow\left[0,\left(\frac{R_{\max}}{1-\gamma}\right)\right] and λΛ\lambda\in\Lambda, there exists w|𝒜|w^{*}\in\mathbb{R}^{|\mathcal{A}|} such that

𝔼s,a(|Qwlog(πλ(a|s))|)νϵbias,\displaystyle\mathbb{E}_{s,a}(|Q-w^{*}{\nabla}log(\pi_{{\lambda}}(a|s))|)_{\nu}\leq\epsilon_{bias}, (20)

for any ν𝒫(𝒮×𝒜)\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A}).

Assumption 4 is similar to the Transfer error assumption in works such as (agarwal2020flambe; liu2020improved). The key difference is that in the referenced works the assumption is based on the true QQ function, while the estimate of ww^{*} is obtained by using a noisy Monte Carlo estimate. For our case we use a known parameterised QQ function to obtain our estimate of ww^{*}.

Assumption 5

For any θΘ\theta\in\Theta, denote by πθ\pi_{\theta} as the policy corresponding to the parameter θ\theta and μθ\mu_{\theta} as the corresponding stationary state action distribution of the induced Markov chain. We assume that there exists a positive integer mm such that

dTV(((sτ,aτ)|(s0,a0)=(s,a)),μθ())mρτ,s𝒮\displaystyle d_{TV}\left(\mathbb{P}((s_{\tau},a_{\tau})\in\cdot|(s_{0},a_{0})=(s,a)),\mu_{\theta}(\cdot)\right)\leq m{\rho}^{\tau},\forall s\in\mathcal{S} (21)

This assumption implies that the Markov chain is geometrically mixing. Such assumption is widely used both in analysis of stochastic gradient descent literature such as (9769873; sun2018markov), as well as finite time analysis of RL algorithms such as (wu2020finite; NEURIPS2020_2e1b24a6).

Assumption 6

Let ν1\nu_{1} be a probability measure on 𝒮×𝒜\mathcal{S}\times\mathcal{A} which is absolutely continuous with respect to the Lebesgue measure. Let {πt}\{\pi_{t}\} be a sequence of policies and suppose that the state action pair has an initial distribution of ν1\nu_{1}. Then we assume that for all ν1,ν2𝒫(𝒮×𝒜)\nu_{1},\nu_{2}\in\mathcal{P}(\mathcal{S}\times\mathcal{A}) there exists a constant ϕν1,ν2\phi_{\nu_{1},\nu_{2}}\leq\infty such that

supπ1,π2,,πmd(Pπ1Pπ2Pπmν2)dν1\displaystyle\sup_{\pi_{1},\pi_{2},\cdots,\pi_{m}}\Bigg{|}\Bigg{|}\frac{d(P^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}}\nu_{2})}{d\nu_{1}}\Bigg{|}\Bigg{|}_{\infty} \displaystyle\leq ϕν1,ν2\displaystyle\phi_{\nu_{1},\nu_{2}} (22)

for all m{0,,}m\in\{0,\cdots,\infty\}, where d(Pπ1Pπ2Pπmν2)dν1\frac{d(P^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}}\nu_{2})}{d\nu_{1}} denotes the Radon Nikodym derivative of the state action distribution Pπ1Pπ2Pπmν2P^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}}\nu_{2} with respect to the distribution ν1\nu_{1}.

This assumption puts an upper bound on the difference between the state action distribution ν1\nu_{1} and the state action distribution induced by sampling a state action pair from the distribution μ2\mu_{2} followed by any possible policy for the next mm steps for any finite value of mm. Similar assumptions have been made in (pmlr-v120-yang20a; JMLR:v17:10-364; farahmand2010error).

5.2 Main Result

Theorem 1

Suppose Assumptions 1-6 hold. Let Algorithm 1 run for KK iterations JJ be the number of iterations of the first inner loop of Algorithm 1. Let nk,jn_{k,j} denote the number of state-action pairs sampled and Tk,jT_{k,j} the number of iterations of Algorithm 2 at iteration kk of the outer for loop and iteration jj of the first inner for loop of Algorithm 1. Let αi\alpha_{i} be the projected gradient descent size at iteration ii of Algorithm 2 and |D~||\tilde{D}| the number of diagonal matrices sampled in Algorithm 2. Let βi\beta_{i} be the step size in the projected gradient descent at iteration ii of the second inner for loop of Algorithm 1. Let ν𝒫(𝒮×𝒜)\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A}) be the starting state action distribution at the each iteration kk of Algorithm 1. If we have, αi=uk,j2Lk,ji+1\alpha_{i}=\frac{||u^{*}_{k,j}||_{2}}{L_{k,j}\sqrt{{i}+1}}, η=1K\eta=\frac{1}{\sqrt{K}} and βi=μki+1\beta_{i}=\frac{\mu_{k}}{i+1}, then we obtain

minkK(V(ν)VπλK(ν))\displaystyle\min_{k\leq K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu))\leq 𝒪(1K(1γ))+1K(1γ)k=1Kj=0J1𝒪(loglog(nk,j)nk,j)+\displaystyle{\mathcal{O}}\left(\frac{1}{\sqrt{K}(1-\gamma)}\right)\!+\!\frac{1}{K(1-\gamma)}\sum_{k=1}^{K}\sum_{j=0}^{J-1}\mathcal{O}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right)\!+\!\
+1K(1γ)k=1Kj=0J1𝒪(1Tk,j)+1K(1γ)k=1K𝒪(γJ)\displaystyle+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K}\sum_{j=0}^{J-1}{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K}{\mathcal{O}}(\gamma^{J})
+11γ(ϵbias+(ϵapprox)+ϵ|D~|)\displaystyle+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{|\tilde{D}|}}\right) (23)

where uk,j2,Lk,j,μk,ϵbias,ϵapprox,ϵ|D~|||u^{*}_{k,j}||_{2},L_{k,j},\mu_{k},\epsilon_{bias},\epsilon_{approx},{\epsilon_{|\tilde{D}|}} are constants.

Hence, for K=𝒪(ϵ2(1γ)2)K=\mathcal{O}(\epsilon^{-2}(1-\gamma)^{-2}), J=𝒪(log(1ϵ))J=\mathcal{O}\left(\log\left(\frac{1}{\epsilon}\right)\right), nk,j=𝒪~(ϵ2(1γ)2)n_{k,j}=\tilde{\mathcal{O}}\left(\epsilon^{-2}(1-\gamma)^{-2}\right),

Tk,j=𝒪(ϵ2(1γ)2)T_{k,j}=\mathcal{O}(\epsilon^{-2}(1-\gamma)^{-2}) we have

minkK(V(ν)VπλK(ν))ϵ+11γ(ϵbias+(ϵapprox)+ϵ|D~|),\displaystyle\min_{k\leq K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu))\leq\epsilon+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{|\tilde{D}|}}\right), (24)

which implies a sample complexity of k=1Kj=1J(nk,j)=𝒪~(ϵ4(1γ)4)\sum_{k=1}^{K}\sum_{j=1}^{J}(n_{k,j})=\tilde{\mathcal{O}}\left({\epsilon^{-4}(1-\gamma)^{-4}}\right).

6 Proof Sketch of Theorem 1

The detailed proof of Theorem 1 is given in Appendix E. The difference between our estimated value function denoted by VπλkV^{\pi_{\lambda_{k}}} and the optimal value function denoted by VV^{*} (where πλk\pi_{\lambda_{k}} is the policy obtained at the step kk of algorithm 1) is first expressed as a function of the compatible function approximation error, which is then split into different components which are analysed separately. The proof is thus split into two stages. In the first stage, we demonstrate how the difference in value functions is upper bounded as a function of the errors incurred till the final step KK. The second part is to upper bound the different error components.

Upper Bounding Error in Separate Error Components: We use the smoothness property assumed in Assumption 1 to obtain a bound on the expectation of the difference between our estimated value function and the optimal value function.

mink{1,,K}V(ν)VπλK(ν)log(|𝒜|)Kη(1γ)+ηβW22(1γ)+1Kk=1Kerrk1γ,\displaystyle\min_{k\in\{1,\cdots,K\}}V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu)\leq\frac{\log(|\mathcal{A}|)}{K{\eta}(1-\gamma)}+\frac{\eta{\beta}W^{2}}{2(1-\gamma)}+\frac{1}{K}\sum_{k=1}^{K}\frac{err_{k}}{1-\gamma}, (25)

where

errk=𝔼s,a(Aπλkwklog(πλk(a|s))),\displaystyle err_{k}=\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a|s))), (26)

and sdνπ,aπ(.|s)s\sim d^{\pi^{*}}_{\nu},a\sim\pi^{*}(.|s) and WW is a constant such that wk2W||w^{k}||_{2}\leq W k\forall k, where kk denotes the iteration of the outer for loop of Algorithm 1. We split the term in (26) into the errors incurred due to the actor and critic step as follows

errk\displaystyle err_{k} =\displaystyle= 𝔼s,a(Aπλkwklog(πλk(a|s)))\displaystyle\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a|s))) (27)
=\displaystyle= 𝔼s,a(AπλkAk,J)I+𝔼s,a(Ak,Jwklog(πλk(a|s)))II.\displaystyle\underbrace{\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-A_{{k,J}})}_{I}+\underbrace{\mathbb{E}_{s,a}(A_{{k,J}}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a|s)))}_{II}. (28)

Note that II is the difference between the true AπλkA^{\pi_{\lambda_{k}}} function corresponding to the policy πλk\pi_{\lambda_{k}} and Ak,JA_{k,J} is our estimate. This estimation is carried out in the first inner for loop of Algorithm 1. Thus II is the error incurred in the critic step. IIII is the error incurred in the estimation of the actor update. This is incurred in the stochastic gradient descent steps in the second inner for loop of Algorithm 1.

Also note that the expectation is with respect to the discounted state action distribution of the state action pairs induced by the optimal policy π\pi^{*}. The state action samples that we obtain are obtained from the policy πλk\pi_{\lambda_{k}}. Thus using assumption 6 we convert the expectation in Equation (28) to an expectation with respect to the stationary state action distribution induced by the policy πλk\pi_{\lambda_{k}}.

Upper Bounding Error in Critic Step: For each iteration kk of the Algorithm 1. We show that minimzing II is equivalent to solving the following problem

argminθΘ𝔼s,a(QπλkQθ)2,\displaystyle\operatorname*{arg\,min}_{\theta\in\Theta}\mathbb{E}_{s,a}({Q^{\pi_{\lambda_{k}}}-Q_{{\theta}}})^{2}, (29)

We will rely upon the analysis laid out in (farahmand2010error) and instead of the iteration of the value functions, we will apply a similar analysis to the action value function to obtain an upper bound for the error incurred in solving the problem in Equation (29) using the Fitted Q-Iteration. We recreate the result for the value function from Lemmas 2 of (munos2003error) for the action value function QQ to obtain

𝔼s,a(QπλkQk,J)\displaystyle\mathbb{E}_{s,a}(Q^{\pi_{\lambda_{k}}}-Q_{k,J}) \displaystyle\leq j=1J1γJj1(Pπλk)Jj1𝔼|ϵk,j|+γJ(Rmax1γ),\displaystyle\sum_{j=1}^{J-1}\gamma^{J-j-1}(P^{\pi_{\lambda_{k}}})^{J-j-1}{\mathbb{E}}|\epsilon_{k,j}|+\gamma^{J}\left(\frac{R_{max}}{1-\gamma}\right), (30)

where ϵk,j=TπλkQk,j1Qk,j\epsilon_{k,j}=T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q_{k,j} is the Bellman error incurred at iteration jj where TπλkQk,j1,PπλkT^{\pi_{\lambda_{k}}}Q_{k,j-1},P^{\pi_{\lambda_{k}}} are defined as in Equations (51), (53) respectively. Qk,JQ_{k,J} denotes the QQ estimate at the final iteration JJ of the first inner for loop and iteration kk of the outer for loop of Algorithm 1

The first term on the right hand side is called as the algorithmic error, which depends on how good our approximation of the Bellman error is. The second term on the right hand side is called as the statistical error, which is the error incurred due to the random nature of the system and depends only on the parameters of the MDP as well as the number of iterations of the FQI algorithm. Intuitively, this error depends on how much data is collected at each iteration, how efficient our solution to the optimization step is to the true solution, and how well our function class can approximate TπλkQk,j1T^{\pi_{\lambda_{k}}}Q_{k,j-1}. Building upon this intuition, we split ϵk,j\epsilon_{k,j} into four different components as follows.

ϵk,j\displaystyle\epsilon_{k,j} =\displaystyle= TπλkQk,j1Qk,j\displaystyle T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q_{k,j} (31)
=\displaystyle= TπλkQk,j1Qk,j1ϵk,j1+Qk,j1Qk,j2ϵk,j2+Qk,j2Qk,j3ϵk,j3+Qk,j3Qk,jϵk,j4\displaystyle\underbrace{T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}}_{\epsilon^{1}_{k,j}}+\underbrace{Q^{1}_{k,j}-Q^{2}_{k,j}}_{\epsilon^{2}_{k,j}}+\underbrace{Q^{2}_{k,j}-Q^{3}_{k,j}}_{\epsilon^{3}_{k,j}}+\underbrace{Q^{3}_{k,j}-Q_{k,j}}_{\epsilon^{4}_{k,j}}
=\displaystyle= ϵk,j1+ϵk,j2+ϵk,j3+ϵk,j4,\displaystyle{\epsilon^{1}_{k,j}}+{\epsilon^{2}_{k,j}}+{\epsilon^{3}_{k,j}}+{\epsilon^{4}_{k,j}},

We use the Lemmas 14, 15, 16, and 17 to bound the error terms in Equation (31).

Upper Bounding Error in Actor Step: Note that we require the minimization of the term 𝔼s,a(Ak,Jwklog(πλk(a|s)))\mathbb{E}_{s,a}(A_{k,J}-w^{k}{\nabla}log(\pi_{\lambda_{k}}(a|s))). Here the expectation is with respect to stationary state action distribution corresponding to πλk\pi_{\lambda_{k}}. But we do not have samples of states action pairs from the stationary distribution with respect to the policy πλk\pi_{\lambda_{k}}, we only have samples from the Markov chain induced by the policy πλk\pi_{\lambda_{k}}. We thus refer to the theory in (9769873) and Assumption 5 to upper bound the error incurred.

7 Conclusions

In this paper, we study a Natural Actor Critic algorithm with a neural network used to represent both the actor and the critic and find the sample complexity guarantees for the algorithm. Using the conversion of the optimization of a 2 layer ReLU Neural Network to a convex problem for estimating the critic, we show that our approach achieves a sample complexity of 𝒪~(ϵ4(1γ)4)\tilde{\mathcal{O}}(\epsilon^{-4}(1-\gamma)^{-4}). This demonstrates the first approach for achieving sample complexity beyond linear MDP assumptions for the critic.

Limitations: Relaxing the different stated assumptions is an interesting direction for the future. Further, the results assume 2-layer neural network parametrization for the critic. One can likely use the framework described in (belilovsky2019greedy) to extend the results to a multi layer setup.

A Convex Reformulation with Two-Layer Neural Networks

For representing the action value function, we will use a 2 layer ReLU neural network. In this section, we first lay out the theory behind the convex formulation of the 2 layer ReLU neural network. In the next section it will shown how it is utilised for the FQI algorithm.

In order to obtain parameter θ\theta for a given set of data Xn×dX\in\mathbb{R}^{n\times d} and the corresponding response values yn×1y\in\mathbb{R}^{n\times 1}, we desire the parameter that minimizes the squared loss (with a regularization parameter β[0,1]\beta\in[0,1]), given by

(θ)\displaystyle\mathcal{L}(\theta) =\displaystyle= argminθi=1mσ(Xui)αiy22.\displaystyle\operatorname*{arg\,min}_{\theta}\Bigg{\|}\sum_{i=1}^{m}\sigma(Xu_{i})\alpha_{i}-y\Bigg{\|}_{2}^{2}. (32)

Here, we have the term σ(Xui)\sigma(Xu_{i}) which is a vector {σ((xj)Tui)}j{1,,n}\{\sigma^{\prime}((x_{j})^{T}u_{i})\}_{j\in\{1,\cdots,n\}} where xjx_{j} is the jthj^{th} row of XX. It is the ReLU function applied to each element of the vector XuiXu_{i}. We note that the optimization in Equation (8) is non-convex in θ\theta due to the presence of the ReLU activation function. In (wang2021hidden), it is shown that this optimization problem has an equivalent convex form, provided that the number of neurons mm goes above a certain threshold value. This convex problem is obtained by replacing the ReLU functions in the optimization problem with equivalent diagonal operators. The convex problem is given as

β(p)\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p) :=\displaystyle:= argminpDiDXDi(Xpi)y22\displaystyle\operatorname*{arg\,min}_{p}\Bigg{\|}\sum_{D_{i}\in D_{X}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2} (33)

where pd×|DX|p\in\mathbb{R}^{d\times|D_{X}|}. DXD_{X} is the set of diagonal matrices DiD_{i} which depend on the data-set XX. Except for cases of XX being low rank it is not computationally feasible to obtain the set DXD_{X}. We instead use D~DX\tilde{D}\in D_{X} to solve the convex problem

β(p)\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p) :=\displaystyle:= (argminpDiD~Di(Xpi)y22,\displaystyle(\operatorname*{arg\,min}_{p}\Bigg{\|}\sum_{D_{i}\in\tilde{D}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2}, (34)

where pd×|D~|p\in\mathbb{R}^{d\times|\tilde{D}|}. In order to understand the convex reformulation of the squared loss optimization problem, consider the vector σ(Xui)\sigma(Xu_{i})

σ(Xui)=[{σ((x1)Tui)}{σ((x2)Tui)}{σ((xn)Tui)}.]\sigma(Xu_{i})=\begin{bmatrix}\{\sigma^{{}^{\prime}}((x_{1})^{T}u_{i})\}\\ \{\sigma^{{}^{\prime}}((x_{2})^{T}u_{i})\}\\ \vdots\\ \{\sigma^{{}^{\prime}}((x_{n})^{T}u_{i})\}.\end{bmatrix} (35)

Now for a fixed Xn×dX\in\mathbb{R}^{n\times d}, different uid×1u_{i}\in\mathbb{R}^{d\times 1} will have different components of σ(Xui)\sigma(Xu_{i}) that are non zero. For example, if we take the set of all uiu_{i} such that only the first element of σ(Xui)\sigma(Xu_{i}) are non zero (i.e, only (x1)Tui0(x_{1})^{T}u_{i}\geq 0 and (xj)Tui<0(x_{j})^{T}u_{i}<0 j[2,,n]\forall j\in[2,\cdots,n] ) and denote it by the set 𝒦1\mathcal{K}_{1}, then we have

σ(Xui)=D1(Xui)ui𝒦1,\sigma(Xu_{i})=D_{1}(Xu_{i})\ \ \ \ \forall u_{i}\in\mathcal{K}_{1}, (36)

where D1D_{1} is the n×nn\times n diagonal matrix with only the first diagonal element equal to 11 and the rest 0. Similarly, there exist a set of usu^{\prime}s which result in σ(Xu)\sigma(Xu) having certain components to be non-zero and the rest zero. For each such combination of zero and non-zero components, we will have a corresponding set of uisu_{i}^{\prime}s and a corresponding n×nn\times n Diagonal matrix DiD_{i}. We define the possible set of such diagonal matrices possible for a given matrix X as

DX={D=diag(𝟏(Xu0)):ud,Dn×n},\displaystyle D_{X}=\{D=diag(\mathbf{1}(Xu\geq 0)):u\in\mathbb{R}^{d}\,,D\in\mathbb{R}^{n\times n}\}, (37)

where diag(𝟏(Xu0))diag(\mathbf{1}(Xu\geq 0)) represents a matrix given by

Dk,j={𝟏(xjTu),for k=j0for kj,D_{k,j}=\left\{\begin{array}[]{lr}\mathbf{1}(x_{j}^{T}u),&\text{for }k=j\\ 0&\text{for }k\neq j\end{array},\right. (38)

where 𝟏(x)=1\mathbf{1}(x)=1 if x>0x>0 and 𝟏(x)=0\mathbf{1}(x)=0 if x0.x\leq 0. Corresponding to each such matrix DiD_{i}, there exists a set of uiu_{i} given by

𝒦i={ud:σ(Xui)=DiXui,DiDX}\displaystyle\mathcal{K}_{i}=\{u\in\mathbb{R}^{d}:\sigma(Xu_{i})=D_{i}Xu_{i},D_{i}\in D_{X}\} (39)

where II is the n×nn\times n identity matrix. The number of these matrices Di{D}_{i} is upper bounded by 2n2^{n}. From (wang2021hidden) the upper bound is 𝒪(r(nr)r)\mathcal{O}\left(r\left(\frac{n}{r}\right)^{r}\right) where r=rank(X)r=rank(X). Also, note that the sets 𝒦i\mathcal{K}_{i} form a partition of the space d×1\mathbb{R}^{d\times 1}. Using these definitions, we define the equivalent convex problem to the one in Equation (8) as

β(v,w):=argminv,w(DiDXDi(X(viwi))y22),\displaystyle\mathcal{L}_{\beta}(v,w):=\operatorname*{arg\,min}_{v,w}\Bigg{(}\Bigg{\|}\sum_{D_{i}\in D_{X}}D_{i}(X(v_{i}-w_{i}))-y\Bigg{\|}^{2}_{2}\Bigg{)}, (40)

where v={vi}i1,,|DX|v=\{v_{i}\}_{i\in 1,\cdots,|D_{X}|}, w={wi}i1,,|DX|w=\{w_{i}\}_{i\in 1,\cdots,|D_{X}|}, vi,wi𝒦iv_{i},w_{i}\in\mathcal{K}_{i}, note that by definition, for any fixed i{1,,|DX|}i\in\{1,\cdots,|D_{X}|\} at-least one of viv_{i} or wiw_{i} are zero. If v,wv^{*},w^{*} are the optimal solutions to Equation (40), the number of neurons mm of the original problem in Equation (8) should be greater than the number of elements of v,wv^{*},w^{*}, which have at-least one of viv_{i}^{*} or wiw_{i}^{*} non-zero. We denote this value as mX,ym^{*}_{X,y}, with the subscript XX denoting that this quantity depends upon the data matrix XX and response yy.

We convert v,wv^{*},w^{*} to optimal values of Equation (8), denoted by θ=(U,α)\theta^{*}=(U^{*},\alpha^{*}), using a function ψ:d×dd×\psi:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}\times\mathbb{R} defined as follows

ψ(vi,wi)\displaystyle\psi(v_{i},w_{i}) =\displaystyle= {(vi,1),if wi=0(wi,1),if vi=0(0,0),if vi=wi=0\displaystyle\left\{\begin{array}[]{lr}({v}_{i},1),&\text{if }{w}_{i}=0\\ ({w}_{i},-1),&\text{if }{v}_{i}=0\\ (0,0),&\text{if }{v}_{i}={w}_{i}=0\end{array}\right. (44)

where according to (pmlr-v119-pilanci20a) we have (ui,αi)=ψ(vi,wi)(u_{i}^{*},\alpha_{i}^{*})=\psi(v_{i}^{*},w_{i}^{*}), for all i{1,,|DX|}i\in\{1,\cdots,|{D}_{X}|\} where ui,αiu^{*}_{i},\alpha^{*}_{i} are the elements of θ\theta^{*}. Note that restriction of αi\alpha_{i} to {1,1,0}\{1,-1,0\} is shown to be valid in (pmlr-v162-mishkin22a). For i{|DX|+1,,m}i\in\{|{D}_{X}|+1,\cdots,m\} we set (ui,αi)=(0,0)(u_{i}^{*},\alpha_{i}^{*})=(0,0).

Since DXD_{X} is hard to obtain computationally unless XX is of low rank, we can construct a subset D~DX\tilde{D}\in D_{X} and perform the optimization in Equation (40) by replacing DXD_{X} with D~\tilde{D} to get

β(v,w):=argminv,w(DiD~Di(X(viwi))y22)\displaystyle\mathcal{L}_{\beta}(v,w):=\operatorname*{arg\,min}_{v,w}\Bigg{(}\Bigg{\|}\sum_{D_{i}\in\tilde{D}}D_{i}(X(v_{i}-w_{i}))-y\Bigg{\|}^{2}_{2}\Bigg{)} (45)

where v={vi}i1,,|D~|v=\{v_{i}\}_{i\in 1,\cdots,|\tilde{D}|}, w={wi}i1,,|D~|w=\{w_{i}\}_{i\in 1,\cdots,|\tilde{D}|}, vi,wi𝒦iv_{i},w_{i}\in\mathcal{K}_{i}, by definition, for any fixed i{1,,|D~|}i\in\{1,\cdots,|\tilde{D}|\} at-least one of viv_{i} or wiw_{i} are zero.

The required condition for D~\tilde{D} to be a sufficient replacement for DXD_{X} is as follows. Suppose (v,w)=(v¯i,w¯i)i(1,,|D~|)(v,w)=(\bar{v}_{i},\bar{w}_{i})_{i\in(1,\cdots,|\tilde{D}|)} denote the optimal solutions of Equation (45). Then we require

mDiD~|{v¯i:v¯i0}{w¯i:w¯i0}|.\displaystyle m\geq\sum_{D_{i}\in\tilde{D}}|\{\bar{v}_{i}:\bar{v}_{i}\neq 0\}\cup\{\bar{w}_{i}:\bar{w}_{i}\neq 0\}|. (46)

Or, the number of neurons in the neural network are greater than the number of indices ii for which at-least one of viv_{i}^{*} or wiw_{i}^{*} is non-zero. Further,

diag(Xui0:i[m])D~.\displaystyle diag(Xu_{i}^{*}\geq 0:i\in[m])\in\tilde{D}. (47)

In other words, the diagonal matrices induced by the optimal uiu_{i}^{*}’s of Equation (8) must be included in our sample of diagonal matrices. This is proved in Theorem 2.1 of (pmlr-v162-mishkin22a).

A computationally efficient method for obtaining D~\tilde{D} and obtaining the optimal values of the Equation (8), is laid out in (pmlr-v162-mishkin22a). In this method we first get our sample of diagonal matrices D~\tilde{D} by first sampling a fixed number of vectors from a dd dimensional standard multivariate distribution, multiplying the vectors with the data matrix XX and then forming the diagonal matrices based of which co-ordinates are positive. Then we solve an optimization similar to the one in Equation (40), without the constraints, that its parameters belong to sets of the form 𝒦i\mathcal{K}_{i} as follows.

β(p):=argminp(DiD~Di(Xpi)y22),\displaystyle\mathcal{L}^{{}^{\prime}}_{\beta}(p):=\operatorname*{arg\,min}_{p}\Bigg{(}\Bigg{\|}\sum_{D_{i}\in\tilde{D}}D_{i}(Xp_{i})-y\Bigg{\|}^{2}_{2}\Bigg{)}, (48)

where pd×|D~|p\in\mathbb{R}^{d\times|\tilde{D}|} . In order to satisfy the constraints of the form given in Equation (40), this step is followed by a cone decomposition step. This is implemented through a function {ψi}i{1,,|D~|}\{\psi_{i}^{{}^{\prime}}\}_{i\in\{1,\cdots,|\tilde{D}|\}}. Let p={pi}i{1,,|D~|}p^{*}=\{p^{*}_{i}\}_{i\in\{1,\cdots,|\tilde{D}|\}} be the optimal solution of Equation (48). For each ii we define a function ψi:dd×d\psi_{i}^{{}^{\prime}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}\times\mathbb{R}^{d} as

ψi(pi)\displaystyle\psi_{i}^{{}^{\prime}}(p_{i}) =\displaystyle= (vi,wi)\displaystyle(v_{i},w_{i}) (49)
such that p\displaystyle\textit{such that }p =\displaystyle= viwi,and vi,wi𝒦i\displaystyle{v}_{i}-{w}_{i},\textit{and }{v}_{i},{w}_{i}\in\mathcal{K}_{i}

Then we obtain ψ(pi)=(v¯i,w¯i)\psi(p^{*}_{i})=(\bar{v}_{i},\bar{w}_{i}). As before, at-least one of viv_{i}, wiw_{i} is 0. Note that in practice we do not know if the conditions in Equation (46) and (47) are satisfied for a given sampled D~\tilde{D}. We express this as follows. If D~\tilde{D} was the full set of Diagonal matrices then we would have (v¯i,w¯i)=vi,wi(\bar{v}_{i},\bar{w}_{i})={v}^{*}_{i},{w}^{*}_{i} and ψ(v¯i,w¯i)=(ui,αi)\psi(\bar{v}_{i},\bar{w}_{i})=(u_{i}^{*},\alpha_{i}^{*}) for all i(1,,|DX|)i\in(1,\cdots,|D_{X}|). However, since that is not the case and D~DX\tilde{D}\in D_{X}, this means that {ψ(v¯i,w¯i)}i(1,,|D~|)\{\psi(\bar{v}_{i},\bar{w}_{i})\}_{i\in(1,\cdots,|\tilde{D}|)} is an optimal solution of a non-convex optimization different from the one in Equation (8). We denote this non-convex optimization as |D~|(θ)\mathcal{L}_{|\tilde{D}|}(\theta) defined as

|D~|(θ)=argminθi=1mσ(Xui)αiy22,\mathcal{L}_{|\tilde{D}|}(\theta)=\operatorname*{arg\,min}_{\theta}\Bigg{\|}\sum_{i=1}^{m^{{}^{\prime}}}\sigma(Xu_{i})\alpha_{i}-y\Bigg{\|}_{2}^{2}, (50)

where m=|D~|m^{{}^{\prime}}=|\tilde{D}| or the size of the sampled diagonal matrix set. In order to quantify the error incurred due to taking a subset of DXD_{X}, we assume that the expectation of the absolute value of the difference between the neural networks corresponding to the optimal solutions of the non-convex optimizations given in Equations (50) and (8) is upper bounded by a constant depending on the size of D~\tilde{D}. The formal assumption and its justification is given in Assumption 2.

B Error Characterization

Before we define the errors incurred during the actor and critic steps, we define some additional terms as follows

We define the Bellman operator for a policy π\pi as follows

(TπQ)(s,a)=r(s,a)+γQπ(s,π(s))P(ds|s,a),(T^{\pi}Q)(s,a)=r^{\prime}(s,a)+\gamma\int Q^{\pi}(s^{\prime},\pi(s^{\prime}))P(ds^{\prime}|s,a), (51)

where r(s,a)=𝔼(r(s,a)|(s,a))r^{\prime}(s,a)=\mathbb{E}(r(s,a)|(s,a)) Similarly we define the Bellman Optimality Operator as

Similarly we define the Bellman Optimality Operator as

(TQ)(s,a)=(r+maxa𝒜γQ(s,a)P(ds|s,a)),(TQ)(s,a)=\left(r^{\prime}+\max_{a^{\prime}\in\mathcal{A}}\gamma\int Q(s^{\prime},a^{\prime})P(ds^{\prime}|s,a)\right), (52)

Further, operator PπP^{\pi} is defined as

PπQ(s,a)=𝔼[Q(s,a)|sP(|s,a),aπ(|s)],P^{\pi}Q(s,a)=\mathbb{E}[Q(s^{\prime},a^{\prime})|s^{\prime}\sim P(\cdot|s,a),a^{\prime}\sim\pi(\cdot|s^{\prime})], (53)

which is the one step Markov transition operator for policy π\pi for the Markov chain defined on 𝒮×𝒜\mathcal{S}\times\mathcal{A} with the transition dynamics given by St+1P(|St,At)S_{t+1}\sim P(\cdot|S_{t},A_{t}) and At+1π(|St+1)A_{t+1}\sim\pi(\cdot|S_{t+1}). It defines a distribution on the state action space after one transition from the initial state. Similarly, Pπ1Pπ2PπmP^{\pi_{1}}P^{\pi_{2}}\cdots{P}^{\pi_{m}} is the mm-step Markov transition operator following policy πt\pi_{t} at steps 1tm1\leq t\leq m. It defines a distribution on the state action space after mm transitions from the initial state. We have the relation

(TπQ)(s,a)=\displaystyle(T^{\pi}Q)(s,a)= r+γQπ(s,π(s))P(ds|s,a)\displaystyle r^{\prime}+\gamma\int Q^{\pi}(s^{\prime},\pi(s^{\prime}))P(ds^{\prime}|s,a) (54)
=\displaystyle= r+γ(PπQ)(s,a).\displaystyle r^{\prime}+{\gamma}(P^{\pi}Q)(s,a). (55)

We thus defines PP^{*} as

PQ(s,a)=maxa𝒜𝔼[Q(s,a)|sP(|s,a)],P^{*}Q(s,a)=\max_{a^{{}^{\prime}}\in\mathcal{A}}\mathbb{E}[Q(s^{\prime},a^{\prime})|s^{\prime}\sim P(\cdot|s,a)], (56)

in other words, PP^{*} is the one step Markov transition operator with respect to the greedy policy of the function on which it is acting. which implies that

(TQ)(s,a)\displaystyle(TQ)(s,a) =\displaystyle= r+γ(PQ)(s,a)\displaystyle r^{\prime}+{\gamma}(P^{*}Q)(s,a) (57)

For any measurable function f:𝒮×𝒜:f:\mathcal{S}\times\mathcal{A}:\rightarrow\mathbb{R}, we also define

𝔼(f)ν=𝒮×𝒜f𝑑ν,\mathbb{E}(f)_{\nu}=\int_{\mathcal{S}\times\mathcal{A}}fd\nu, (58)

for any distribution ν𝒫(𝒮×𝒜)\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A}).

We now characterize the errors which are incurred from the actor and critic steps. We define as ζπν(s,a)\zeta^{\nu}_{\pi}(s,a) as the stationary state action distribution induced by the policy π\pi with the starting state action distribution drawn from a distribution ν𝒫(𝒮×𝒜)\nu\in\mathcal{P}(\mathcal{S}\times\mathcal{A}). For the error incurred in the actor update we define the related loss function as

Definition 1

For iteration kk of the outer for loop of Algorithm 1 ,we define wkw_{k} as the estimate of the minima of the loss function given by 𝔼(s,a)ζπν(s,a)(Ak,J(s,a)(w)log(πλk)(a|s))2\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}\left(A_{k,J}(s,a)-(w){\nabla}log(\pi_{\lambda_{k}})(a|s)\right)^{2} obtained at the end of the second inner for loop of Algorithm 1. We further define the true minima as

wk\displaystyle w^{*}_{k} =\displaystyle= argminw𝔼(s,a)ζπν(s,a)(Ak,J(s,a)(w)log(πλk)(a|s))2,\displaystyle\operatorname*{arg\,min}_{w}\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}\left(A_{k,J}(s,a)-(w){\nabla}log(\pi_{\lambda_{k}})(a|s)\right)^{2}, (59)

For finding the estimate wkw_{k}, we re-use the state action pairs sampled in the first inner for loop of Algorithm 1. Note that we have to solve for the loss function where the expectation is with respect to the steady state distribution ζπν(s,a)\zeta^{\nu}_{\pi}(s,a), while our sample are from a markov chain which has the steady state distribution For the error incurred in the critic update, we first define the various possible QQ-functions which we can approximate in decreasing order of the accuracy.

For the error compnents incurred during critic estimation, we start by defining the best possible approximation of the function TπλkQk,j1T^{\pi_{\lambda_{k}}}Q_{k,j-1} possible from the class of two layer ReLU neural networks, with respect to the expected square from the true ground truth TπλkQk,j1T^{\pi_{\lambda_{k}}}Q_{k,j-1}.

Definition 2

For iteration kk of the outer for loop and iteration jj of the first inner for loop of Algorithm 1, we define

Qk,j1=argminQθ,θΘ𝔼(Qθ(s,a)TπλkQk,j1(s,a))2,Q^{1}_{k,j}=\operatorname*{arg\,min}_{Q_{\theta},\theta\in\Theta}\mathbb{E}(Q_{\theta}(s,a)-T^{\pi_{\lambda_{k}}}Q_{k,j-1}(s,a))^{2}, (60)

where (s,a)ζπν(s,a)(s,a)\sim\zeta^{\nu}_{\pi}(s,a).

Note that we do not have access to the transition probability kernel PP, hence we cannot calculate TπλkT^{\pi_{\lambda_{k}}}. To alleviate this, we use the observed next states to estimate the QQ-value function. Using this, we define Qk,j2Q^{2}_{k,j} as,

Definition 3

For iteration kk of the outer for loop and iteration jj of the first inner for loop of Algorithm 1, we define

Qk,j2=argminQθ,θΘ𝔼(Qθ(s,a)(r(s,a)+γ𝔼Qj1(s,a))2,\displaystyle Q^{2}_{k,j}=\operatorname*{arg\,min}_{Q_{\theta},\theta\in\Theta}\mathbb{E}(Q_{\theta}(s,a)-(r^{\prime}(s,a)+\gamma{\mathbb{E}}Q_{j-1}(s^{\prime},a^{\prime}))^{2}, (61)

where (s,a)ζπν(s,a),sP(s|s,a){(s,a)\sim\zeta^{\nu}_{\pi}(s,a),s^{\prime}\sim P(s^{\prime}|s,a)} and r(|s,a)R(|s,a){r^{\prime}(\cdot|s,a)\sim{R}(\cdot|s,a)}.

Compared to Qk,j1Q^{1}_{k,j}, in Qk,j2Q^{2}_{k,j}, we are minimizing the expected square loss from target function (r(s,a)+γ𝔼aπkQj1(s,a))\big{(}r^{\prime}(s,a)+\gamma{\mathbb{E}_{a^{\prime}\sim\pi_{k}}}Q_{j-1}(s^{\prime},a^{\prime})\big{)}.

To obtain Qk,j2Q^{2}_{k,j}, we still need to compute the true expected value in Equation 61. However, we still do not know the transition function PP. To remove this limitation, we use sampling. Consider a set, 𝒳k,j\mathcal{X}_{k,j} , of state-action pairs sampled as where (s,a)ζπν(s,a)(s,a)\sim\zeta^{\nu}_{\pi}(s,a). We now define Qk,j3Q^{3}_{k,j} as,

Definition 4

For a given set of state action pairs 𝒳k,j\mathcal{X}_{k,j} we define

Qk,j3=argminQθ,θΘ1|𝒳k,j|(si,ai)𝒳k,j(Qθ(si,ai)(r(si,ai)+γ𝔼aπkQk,j1(si+1,a)))2,\displaystyle Q^{3}_{k,j}=\operatorname*{arg\,min}_{Q_{\theta},\theta\in\Theta}\frac{1}{|\mathcal{X}_{k,j}|}\sum_{(s_{i},a_{i})\in\mathcal{X}_{k,j}}\Big{(}Q_{\theta}(s_{i},a_{i})-\big{(}r(s_{i},a_{i})+\gamma{\mathbb{E}_{a^{\prime}\sim\pi_{k}}}Q_{k,j-1}(s_{i+1},a^{\prime})\big{)}\Big{)}^{2}, (62)

where r(si,ai)r(s_{i},a_{i}), and si+1s_{i+1} are the observed reward and the observed next state for state action pair si,ais_{i},a_{i} respectively.

Qk,j3Q^{3}_{k,j} is the best possible approximation for QQ-value function which minimizes the sample average of the square loss functions with the target values as (r(si,ai)+γ𝔼aπkQk,j1(si+1,a))2\big{(}r^{\prime}(s_{i},a_{i})+\gamma{\mathbb{E}_{a^{\prime}\sim\pi_{k}}}Q_{k,j-1}(s_{i+1},a^{\prime})\big{)}^{2} or the empirical loss function. After defining the possible solutions for the QQ-values using different loss functions, we define the errors.

We first define approximation error which represents the difference between TπλkQj1T^{\pi_{\lambda_{k}}}Q_{j-1} and its best approximation possible from the class of 2 layer ReLU neural networks. We have

Definition 5 (Approximation Error)

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, we define, ϵk,j1=TπλkQk,j1Qk,j1\epsilon^{1}_{k,j}=T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}, where Qk,j1Q^{1}_{k,j} is the estimate of the QQ function at the iteration j1j-1 of the second for loop of Algorithm 1.

We also define Estimation Error which denotes the error between the best approximation of TπλkQk,j1T^{\pi_{\lambda_{k}}}Q_{k,j-1} possible from a 2 layer ReLU neural network and Qk,j2Q^{2}_{k,j}. We demonstrate that these two terms are the same and this error is zero.

Definition 6 (Estimation Error)

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, ϵk,j2=Qk,j1Qk,j2\epsilon^{2}_{k,j}=Q^{1}_{k,j}-Q^{2}_{k,j}.

We now define Sampling error which denotes the difference between the minimizer of expected loss function Qk,j2Q^{2}_{k,j} and the minimizer of the empirical loss function using samples, Qk,j3Q^{3}_{k,j}. We will use Rademacher complexity results to upper bound this error.

Definition 7 (Sampling Error)

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, ϵk,j3=Qk,j3Qk,j2\epsilon^{3}_{k,j}=Q^{3}_{k,j}-Q^{2}_{k,j}.

Lastly, we define optimization error which denotes the difference between the minimizer of the empirical square loss function, Qk3Q_{k_{3}}, and our estimate of this minimizer that is obtained from the projected gradient descent algorithm.

Definition 8 (Optimization Error)

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, ϵk4=Qk,j3Qk,j\epsilon^{4}_{k}=Q^{3}_{k,j}-Q_{k,j}. Here Qk,jQ_{k,j} is our estimate of the QQ function at iteration kk of Algorithm 1 and iteration jj of the first inner loop of Algorithm 1.

C Supplementary lemmas and Definitions

Here we provide some definitions and results that will be used to prove the lemmas stated in the paper.

Definition 9

For a given set ZnZ\in\mathbb{R}^{n}, we define the Rademacher complexity of the set ZZ as

Rad(Z)=𝔼(supzZ1ni=1dΩizi)Rad(Z)=\mathbb{E}\left(\sup_{z\in Z}\frac{1}{n}\sum_{i=1}^{d}\Omega_{i}z_{i}\right) (63)

where Ωi\Omega_{i} is random variable such that P(Ωi=1)=12P(\Omega_{i}=1)=\frac{1}{2}, P(Ωi=1)=12P(\Omega_{i}=-1)=\frac{1}{2} and ziz_{i} are the co-ordinates of zz which is an element of the set ZZ

Lemma 10

Consider a set of observed data denoted by z={z1,z2,zn}nz=\{z_{1},z_{2},\cdots\,z_{n}\}\in\mathbb{R}^{n}, a parameter space Θ\Theta, a loss function {l:×Θ}\{l:\mathbb{R}\times\Theta\rightarrow\mathbb{R}\} where 0l(θ,z)10\leq l(\theta,z)\leq 1 (θ,z)Θ×\forall(\theta,z)\in\Theta\times\mathbb{R}. The empirical risk for a set of observed data as R(θ)=1ni=1nl(θ,zi)R(\theta)=\frac{1}{n}\sum_{i=1}^{n}l(\theta,z_{i}) and the population risk as r(θ)=𝔼l(θ,zi~)r(\theta)=\mathbb{E}l(\theta,\tilde{z_{i}}), where zi~\tilde{z_{i}} is a co-ordinate of z~\tilde{z} sampled from some distribution over ZZ.

We define a set of functions denoted by \mathcal{L} as

={zZl(θ,z):θΘ}\mathcal{L}=\{z\in Z\rightarrow l(\theta,z)\in\mathbb{R}:\theta\in\Theta\} (64)

Given z={z1,z2,z3,zn}z=\{z_{1},z_{2},z_{3}\cdots,z_{n}\} we further define a set z\mathcal{L}\circ z as

z={(l(θ,z1),l(θ,z2),,l(θ,zn))n:θΘ}\mathcal{L}\circ z\ =\{(l(\theta,z_{1}),l(\theta,z_{2}),\cdots,l(\theta,z_{n}))\in\mathbb{R}^{n}:\theta\in\Theta\} (65)

Then, we have

𝔼supθΘ|{r(θ)R(θ)}|2𝔼(Rad(z))\mathbb{E}\sup_{\theta\in\Theta}|\{r(\theta)-R(\theta)\}|\leq 2\mathbb{E}\left(Rad(\mathcal{L}\circ z)\right) (66)

If the data is of the form zi=(xi,yi),xX,yYz_{i}=(x_{i},y_{i}),x\in X,y\in Y and the loss function is of the form l(aθ(x),y)l(a_{\theta}(x),y), is LL lipschitz and aθ:Θ×Xa_{\theta}:\Theta{\times}X\rightarrow\mathbb{R}, then we have

𝔼supθΘ|{r(θ)R(θ)}|2L𝔼(Rad(𝒜{x1,x2,x3,,xn}))\mathbb{E}\sup_{\theta\in\Theta}|\{r(\theta)-R(\theta)\}|\leq 2{L}\mathbb{E}\left(Rad(\mathcal{A}\circ\{x_{1},x_{2},x_{3},\cdots,x_{n}\})\right) (67)

where

𝒜{x1,x2,,xn}={(a(θ,x1),a(θ,x2),,a(θ,xn))n:θΘ}\mathcal{A}\circ\{x_{1},x_{2},\cdots,x_{n}\}\ =\{(a(\theta,x_{1}),a(\theta,x_{2}),\cdots,a(\theta,x_{n}))\in\mathbb{R}^{n}:\theta\in\Theta\} (68)

The detailed proof of the above statement is given in (Rebeschini, 2022)222 Algorithmic Foundations of Learning [Lecture Notes]. https://www.stats.ox.ac.uk/ rebeschi/teaching/AFoL/20/material/. The upper bound for 𝔼supθΘ({r(θ)R(θ)})\mathbb{E}\sup_{\theta\in\Theta}(\{r(\theta)-R(\theta)\}) is proved in the aformentioned reference. However, without loss of generality the same proof holds for the upper bound for 𝔼supθΘ({R(θ)r(θ)})\mathbb{E}\sup_{\theta\in\Theta}(\{R(\theta)-r(\theta)\}). Hence the upper bound for 𝔼supθΘ|{r(θ)R(θ)}|\mathbb{E}\sup_{\theta\in\Theta}|\{r(\theta)-R(\theta)\}| can be established.

Lemma 11

Consider two random random variable x𝒳x\in\mathcal{X} and y,y𝒴y,y^{{}^{\prime}}\in\mathcal{Y}. Let 𝔼x,y,𝔼x\mathbb{E}_{x,y},\mathbb{E}_{x} and 𝔼y|x\mathbb{E}_{y|x}, 𝔼y|x\mathbb{E}_{y^{{}^{\prime}}|x} denote the expectation with respect to the joint distribution of (x,y)(x,y), the marginal distribution of xx, the conditional distribution of yy given xx and the conditional distribution of yy^{{}^{\prime}} given xx respectively . Let fθ(x)f_{\theta}(x) denote a bounded measurable function of xx parameterised by some parameter θ\theta and g(x,y)g(x,y) be bounded measurable function of both xx and yy.

Then we have

argminfθ𝔼x,y(fθ(x)g(x,y))2=argminfθ(𝔼x,y(fθ(x)𝔼y|x(g(x,y)|x))2)\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}\right) (69)

Proof  Denote the left hand side of Equation (69) as 𝕏θ\mathbb{X}_{\theta}, then add and subtract 𝔼y|x(g(x,y)|x)\mathbb{E}_{y|x}(g(x,y)|x) to it to get

𝕏θ=\displaystyle\mathbb{X}_{\theta}= argminfθ(𝔼x,y(fθ(x)𝔼y|x(g(x,y)|x)+𝔼y|x(g(x,y)|x)g(x,y))2)\displaystyle\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)+\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)-g(x,y)\right)^{2}\right) (70)
=\displaystyle= argminfθ(𝔼x,y(fθ(x)𝔼y|x(g(x,y)|x))2+𝔼x,y(y𝔼y|x(g(x,y)|x))2\displaystyle\operatorname*{arg\,min}_{f_{\theta}}\Big{(}\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}+\mathbb{E}_{x,y}\left(y-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}
2𝔼x,y(fθ(x)𝔼y|x(g(x,y)|x))(g(x,y)𝔼y|x(g(x,y)|x))).\displaystyle\qquad-2\mathbb{E}_{x,y}\Big{(}f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\Big{)}\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\Big{)}. (71)

Consider the third term on the right hand side of Equation (71)

2𝔼x,y\displaystyle 2\mathbb{E}_{x,y} (fθ(x)𝔼y|x(g(x,y)|x))(g(x,y)𝔼y|x(g(x,y)|x))\displaystyle\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)
=\displaystyle= 2𝔼x𝔼y|x(fθ(x)𝔼y|x(g(x,y)|x))(g(x,y)𝔼y|x(g(x,y)|x))\displaystyle 2\mathbb{E}_{x}\mathbb{E}_{y|x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right) (72)
=\displaystyle= 2𝔼x(fθ(x)𝔼y|x(g(x,y)|x))𝔼y|x(g(x,y)𝔼y|x(g(x,y)|x))\displaystyle 2\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\mathbb{E}_{y|x}\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right) (73)
=\displaystyle= 2𝔼x(fθ(x)𝔼y|x(g(x,y)|x))(𝔼y|x(g(x,y))𝔼y|x(𝔼y|x(g(x,y)|x)))\displaystyle 2\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\left(\mathbb{E}_{y|x}(g(x,y))-\mathbb{E}_{y|x}\left(\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\right) (74)
=\displaystyle= 2𝔼x(fθ(x)𝔼(y|x))(𝔼y|x(g(x,y))𝔼y|x(g(x,y)|x))\displaystyle 2\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}(y|x)\right)\Big{(}\mathbb{E}_{y|x}(g(x,y))-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\Big{)} (75)
=\displaystyle= 0\displaystyle 0 (76)

Equation (72) is obtained by writing 𝔼x,y=𝔼x𝔼y|x\mathbb{E}_{x,y}=\mathbb{E}_{x}\mathbb{E}_{y|x} from the law of total expectation. Equation (73) is obtained from (72) as the term fθ(x)𝔼y|x(g(x,y)|x)f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x) is not a function of yy. Equation (74) is obtained from (73) as 𝔼y|x(𝔼y|x(g(x,y)|x))=𝔼y|x(g(x,y)|x)\mathbb{E}_{y|x}\left(\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)=\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x) because 𝔼y|x(g(x,y)|x)\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x) is not a function of yy hence is constant with respect to the expectation operator 𝔼y|x\mathbb{E}_{y|x}.

Thus plugging in value of 2𝔼x,y(fθ(x)𝔼y|x(g(x,y)|x))(g(x,y)𝔼y|x(g(x,y)|x))2\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right) in Equation (71) we get

argminAfθ𝔼x,y(fθ(x)g(x,y))2=\displaystyle\arg\min A_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}= argminfθ(𝔼x,y(fθ(x)𝔼x,y(g(x,y)|x))2\displaystyle\operatorname*{arg\,min}_{f_{\theta}}(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{x,y^{{}^{\prime}}}(g(x,y^{{}^{\prime}})|x)\right)^{2}
+𝔼x,y(g(x,y)𝔼y|x(g(x,y)|x))2).\displaystyle+\mathbb{E}_{x,y}\left(g(x,y)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}). (77)

Note that the second term on the right hand side of Equation (77) des not depend on fθ(x)f_{\theta}(x) therefore we can write Equation (77) as

argminfθ𝔼x,y(fθ(x)g(x,y))2=argminfθ(𝔼x,y(fθ(x)𝔼y|x(g(x,y)|x))2)\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}\right) (78)

Since the right hand side of Equation (78) is not a function of yy we can replace 𝔼x,y\mathbb{E}_{x,y} with 𝔼x\mathbb{E}_{x} to get

argminfθ𝔼x,y(fθ(x)g(x,y))2=argminfθ(𝔼x(fθ(x)𝔼y|x(g(x,y)|x))2)\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x}\left(f_{\theta}(x)-\mathbb{E}_{y^{{}^{\prime}}|x}(g(x,y^{{}^{\prime}})|x)\right)^{2}\right) (79)

 

Lemma 12

Consider an optimization of the form given in Equation (45) with the regularization term β=0\beta=0 denoted by |D~|\mathcal{L}_{|\tilde{D}|} and it’s convex equivalent denoted by 0\mathcal{L}_{0}. Then the value of these two loss functions evaluated at (v,w)=(vi,wi)i{1,,|D~|}(v,w)=(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}} and θ=ψ(vi,wi)i{1,,|D~|}\theta=\psi(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}} respectively are equal and thus we have

|D~|(ψ(vi,wi)i{1,,|D~|})=0((vi,wi)i{1,,|D~|})\mathcal{L}_{|\tilde{D}|}(\psi(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}})=\mathcal{L}_{0}((v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}}) (80)

Proof  Consider the loss functions in Equations (40), (48) with β=0\beta=0 are as follows

0((vi,wi)i{1,,|D~|})\displaystyle\mathcal{L}_{0}((v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}}) =\displaystyle= DiD~Di(X(viwi))y22\displaystyle||\sum_{D_{i}\in\tilde{D}}D_{i}(X(v_{i}-w_{i}))-y||^{2}_{2} (81)
|D~|(ψ(vi,wi)i{1,,|D~|})\displaystyle\mathcal{L}_{|\tilde{D}|}(\psi(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}}) =\displaystyle= i=1|D~|σ(Xψ(vi,wi)1)ψ(vi,wi)2y22,\displaystyle||\sum_{i=1}^{|\tilde{D}|}\sigma(X\psi(v_{i},w_{i})_{1})\psi(v_{i},w_{i})_{2}-y||_{2}^{2}, (82)

where ψ(vi,wi)1\psi(v_{i},w_{i})_{1}, ψ(vi,wi)2\psi(v_{i},w_{i})_{2} represent the first and second coordinates of ψ(vi,wi)\psi(v_{i},w_{i}) respectively.

For any fixed i{1,,|D~|}i\in\{1,\cdots,|\tilde{D}|\} consider the two terms

Di(X(viwi))\displaystyle D_{i}(X(v_{i}-w_{i})) (83)
σ(Xψ(vi,wi)1)ψ(vi,wi)2\displaystyle\sigma(X\psi(v_{i},w_{i})_{1})\psi(v_{i},w_{i})_{2} (84)

For a fixed ii either viv_{i} or wiw_{i} is zero. In case both are zero, both of the terms in Equations (83) and (84) are zero as ψ(0,0)=(0,0)\psi(0,0)=(0,0). Assume that for a given ii wi=0w_{i}=0. Then we have ψ(vi,wi)=(vi,1)\psi(v_{i},w_{i})=(v_{i},1). Then equations (83), (84) are.

Di(X(vi)\displaystyle D_{i}(X(v_{i}) (85)
σ(X(vi))\displaystyle\sigma(X(v_{i})) (86)

But by definition of viv_{i} we have Di(X(vi)=σ(X(vi))D_{i}(X(v_{i})=\sigma(X(v_{i})), therefore Equations (85), (86) are equal. Alternatively if for a given ii vi=0v_{i}=0, then ψ(vi,wi)=(wi,1)\psi(v_{i},w_{i})=(w_{i},-1), then the terms in (83), (84) become.

Di(X(wi)\displaystyle-D_{i}(X(w_{i}) (87)
σ(X(wi))\displaystyle-\sigma(X(w_{i})) (88)

By definition of wiw_{i} we have Di(X(wi)=σ(X(wi))D_{i}(X(w_{i})=\sigma(X(w_{i})), then the terms in (87), (87) are equal. Since this is true for all ii, we have

|D~|(ψ(vi,wi)i{1,,|D~|})=0((vi,wi)i{1,,|D~|})\mathcal{L}_{|\tilde{D}|}(\psi(v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}})=\mathcal{L}_{0}((v_{i},w_{i})_{i\in\{1,\cdots,|\tilde{D}|\}}) (89)

 

Lemma 13

The function Qθ(x)Q_{\theta}(x) defined in equation (10) is Lipschitz continuous in θ\theta, where θ\theta is considered a vector in (d+1)m\mathbb{R}^{(d+1)m} with the assumption that the set of all possible θ\theta belong to the set ={θ:|θθ|1<1}\mathcal{B}=\{\theta:|\theta^{*}-\theta|_{1}<1\}, where θ\theta^{*} is some fixed value.

Proof 

First we show that for all θ1={ui,αi},θ2={ui,αi}\theta_{1}=\{u_{i},\alpha_{i}\},\theta_{2}=\{u^{{}^{\prime}}_{i},\alpha^{{}^{\prime}}_{i}\}\in\mathcal{B} we have αi=αi\alpha_{i}=\alpha^{{}^{\prime}}_{i} for all i(1,,m)i\in(1,\cdots,m)

Note that

|θ1θ2|1=i=1m|uiui|1+i=1m|αiαi|,|\theta_{1}-\theta_{2}|_{1}=\sum_{i=1}^{m}|u_{i}-u^{{}^{\prime}}_{i}|_{1}+\sum_{i=1}^{m}|\alpha_{i}-\alpha^{{}^{\prime}}_{i}|, (90)

where |uiui|1=j=1d|uijuij||u_{i}-u^{{}^{\prime}}_{i}|_{1}=\sum_{j=1}^{d}|u_{i_{j}}-u^{{}^{\prime}}_{i_{j}}| with uij,uiju_{i_{j}},u^{{}^{\prime}}_{i_{j}} denote the jthj^{th} component of ui,uiu_{i},u^{{}^{\prime}}_{i} respectively.

By construction αi,αi\alpha_{i},\alpha^{{}^{\prime}}_{i} can only be 11, 1-1 or 0. Therefore if αiαi\alpha_{i}\neq\alpha^{{}^{\prime}}_{i} then |αiαi|=2|\alpha_{i}-\alpha^{{}^{\prime}}_{i}|=2 if both non zero or |αiαi|=1|\alpha_{i}-\alpha^{{}^{\prime}}_{i}|=1 if one is zero. Therefore |θ1θ2|11|\theta_{1}-\theta_{2}|_{1}\geq 1. Which leads to a contradiction.

Therefore αi=αi\alpha_{i}=\alpha^{{}^{\prime}}_{i} for all ii and we also have

|θ1θ2|1=i=1m|uiui|1|\theta_{1}-\theta_{2}|_{1}=\sum_{i=1}^{m}|u_{i}-u^{{}^{\prime}}_{i}|_{1} (91)

Qθ(x)Q_{\theta}(x) is defined as

Qθ(x)=i=1mσ(xTui)αiQ_{\theta}(x)=\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i} (92)

From Proposition 11 in (10.5555/3327144.3327299) the function Qθ(x)Q_{\theta}(x) is Lipschitz continuous in xx, therefore there exist l>0l>0 such that

|Qθ(x)Qθ(y)|\displaystyle|Q_{\theta}(x)-Q_{\theta}(y)| \displaystyle\leq l|xy|1\displaystyle l|x-y|_{1} (93)
|i=1mσ(xTui)αii=1mσ(yTui)αi|\displaystyle|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(y^{T}u_{i})\alpha_{i}| \displaystyle\leq l|xy|1\displaystyle l|x-y|_{1} (94)

If we consider a single neuron of QθQ_{\theta}, for example i=1i=1, we have l1>0l_{1}>0 such that

|σ(xTu1)αiσ(yTu1)αi|\displaystyle|\sigma^{{}^{\prime}}(x^{T}u_{1})\alpha_{i}-\sigma^{{}^{\prime}}(y^{T}u_{1})\alpha_{i}| \displaystyle\leq l1|xy|1\displaystyle l_{1}|x-y|_{1} (95)

Now consider Equation (95), but instead of considering the left hand side a a function of x,yx,y consider it a function of uu where we consider the difference between σ(xTu)αi\sigma^{{}^{\prime}}(x^{T}u)\alpha_{i} evaluated at u1u_{1} and u1u^{{}^{\prime}}_{1} such that

|σ(xTu1)αiσ(xTu1)αi|\displaystyle|\sigma^{{}^{\prime}}(x^{T}u_{1})\alpha_{i}-\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{1})\alpha_{i}| \displaystyle\leq l1x|u1u1|1\displaystyle l^{x}_{1}|u_{1}-u^{{}^{\prime}}_{1}|_{1} (96)

for some l1x>0l^{x}_{1}>0.

Similarly, for all other ii if we change uiu_{i} to uiu^{{}^{\prime}}_{i} to be unchanged we have

|σ(xTui)αiσ(xTui)αi|\displaystyle|\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}| \displaystyle\leq lix|uiui|1\displaystyle l^{x}_{i}|u_{i}-u^{{}^{\prime}}_{i}|_{1} (97)

for all xx if both θ1,θ2\theta_{1},\theta_{2}\in\mathcal{B}.

Therefore we obtain

|i=1mσ(xTui)αii=1mσ(xTui)αi|\displaystyle|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}| \displaystyle\leq i=1m|σ(xTui)αi(xTui)αi|\displaystyle\sum_{i=1}^{m}|\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}| (98)
\displaystyle\leq i=1mlix|uiui|1\displaystyle\sum_{i=1}^{m}l^{x}_{i}|u_{i}-u^{{}^{\prime}}_{i}|_{1} (99)
\displaystyle\leq (supilix)i=1m|uiui|1\displaystyle(\sup_{i}l_{i}^{x})\sum_{i=1}^{m}|u_{i}-u^{{}^{\prime}}_{i}|_{1} (100)
\displaystyle\leq (supilix)|θ1θ2|\displaystyle(\sup_{i}l_{i}^{x})|\theta_{1}-\theta_{2}| (101)

This result for a fixed xx. If we take the supremum over xx on both sides we get

supx|i=1mσ(xTui)αii=1mσ(xTui)αi|\displaystyle\sup_{x}|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}| \displaystyle\leq (supi,xlix)|θ1θ2|\displaystyle(\sup_{i,x}l_{i}^{x})|\theta_{1}-\theta_{2}| (102)

Denoting (supi,xlix)=l(\sup_{i,x}l_{i}^{x})=l, we get

|i=1mσ(xTui)αii=1mσ(xTui)αi|\displaystyle|\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u_{i})\alpha_{i}-\sum_{i=1}^{m}\sigma^{{}^{\prime}}(x^{T}u^{{}^{\prime}}_{i})\alpha_{i}| \displaystyle\leq l|θ1θ2|1\displaystyle l|\theta_{1}-\theta_{2}|_{1} (104)
xd\displaystyle\forall x\in\mathbb{R}^{d}

 

D Supporting Lemmas

We will now state the key lemmas that will be used for finding the sample complexity of the proposed algorithm.

Lemma 14

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, the approximation error denoted by ϵk,j1\epsilon^{1}_{k,j} in Definition 5, we have

𝔼(|ϵk,j1|)ϵbias,\mathbb{E}\left(|\epsilon^{1}_{k,j}|\right)\leq\sqrt{\epsilon_{bias}}, (105)

Where the expectation is with respect to and (s,a)ζπν(s,a)(s,a)\sim\zeta^{\nu}_{\pi}(s,a)

Proof Sketch: We use Assumption 3 and the definition of the variance of a random variable to obtain the required result. The detailed proof is given in Appendix F.1.

Lemma 15

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, Qk,j1=Qk,j2Q^{1}_{k,j}=Q^{2}_{k,j}, or equivalently ϵk,j2=0\epsilon^{2}_{k,j}=0

Proof Sketch: We use Lemma 11 in Appendix C and use the definitions of Qk,j1Q^{1}_{k,j} and Qk,j2Q^{2}_{k,j} to prove this result. The detailed proof is given in Appendix F.2.

Lemma 16

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, if the number of state action pairs sampled are denoted by nk,jn_{k,j}, then the error ϵk,j3\epsilon^{3}_{k,j} defined in Definition 7 is upper bounded as

𝔼(|ϵk,j3|)𝒪~(log(log(nk,j))nk,j),\mathbb{E}\left(|\epsilon^{3}_{k,j}|\right)\leq\tilde{\mathcal{O}}\left(\frac{log(log(n_{k,j}))}{\sqrt{n_{k,j}}}\right), (106)

Where the expectation is with respect to and (s,a)ζπν(s,a)(s,a)\sim\zeta^{\nu}_{\pi}(s,a)

Proof Sketch: First we note that For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, 𝔼(RXk,j,Qk,j1(θ))=LQj,k1(θ)\mathbb{E}(R_{X_{k,j},Q_{k,j-1}}({\theta}))=L_{Q_{j,k-1}}({\theta}) where RXk,j,Qj,k1(θ)R_{X_{k,j},Q_{j,k-1}}({\theta}) and LQj,k1(θ)L_{Q_{j,k-1}}({\theta}) are defined in Appendix F.3. We use this to get a probabilistic bound on the expected value of |(Qj,k2)(Qj,k3)||(Q^{2}_{j,k})-(Q^{3}_{j,k})| using Rademacher complexity theory when the samples are drawn from an ergodic Markov chain. The detailed proof is given in Appendix F.3. Note the presence of the log(log(nk,j))log(log(n_{k,j})) term is due to the fact that the state action samples belong to a Markov Chain.

Lemma 17

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, let the number of steps of the projected gradient descent performed by Algorithm 2, denoted by Tk,jT_{k,j}, and the gradient descent step size αk,j\alpha_{k,j} satisfy

αk,j\displaystyle\alpha_{k,j} =\displaystyle= uk,j2Lk,jTk,j+1,\displaystyle\frac{||u^{*}_{k,j}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}, (107)

for some constants Lk,jL_{k,j} and (uk)2||\left(u_{k}^{*}\right)||_{2}. Then the error ϵk4\epsilon_{k_{4}} defined in Definition 8 is upper bounded as

𝔼(|ϵk,j4|)𝒪~(1Tk,j)+ϵ|D~|,\mathbb{E}(|\epsilon^{4}_{k,j}|)\leq\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+{\epsilon_{|\tilde{D}|}}, (108)

Where the expectation is with respect to (s,a)ζπν(s,a)(s,a)\sim\zeta^{\nu}_{\pi}(s,a).

Proof Sketch: We use the number of iterations Tk,jT_{k,j} required to get an ϵ\epsilon bound on the difference between the minimum objective value and the objective value corresponding to the estimated parameter at iteration TkT_{k}. We use the convexity of the objective and the Lipschitz property of the neural network to get a bound on the QQ functions corresponding to the estimated parameters. The detailed proof is given in Appendix F.4.

Lemma 18

For a given iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1, if the number of samples of the state action pairs sampled are denoted by nk,jn_{k,j} and βi\beta_{i} be the step size in the projected gradient descent at iteration ii of the second inner for loop of Algorithm 1 which satisfies

βi=μki+1,\displaystyle\beta_{i}=\frac{\mu_{k}}{i+1}, (109)

where μk\mu_{k} is the strong convexity parameter of FkF_{k}. Then, it holds that,

(Fk(wi))𝒪~(log(nk,j)nk,j)+Fk.\left(F_{k}(w_{i})\right)\leq\tilde{\mathcal{O}}\left(\frac{log(n_{k,j})}{{n_{k,j}}}\right)+F^{*}_{k}. (110)

Proof Sketch: Note that we don not have access to state action samples belonging to the stationary state action distribution corresponding to the policy πλk\pi_{{\lambda}_{k}}. We only have access to samples from Markov chain with the same stationary state action distribution. To account for this, we use the results in (9769873) and obtain the difference between the optimal loss function and the loss function obtained by performing stochastic gradient descent with samples from a Markov chain.

E Proof of Theorem 1

Proof 

From Assumption 1, we have

logπλk+1(a|s)πλk(a|s)\displaystyle\log\frac{\pi_{\lambda_{k+1}}(a|s)}{\pi_{\lambda_{k}}(a|s)} \displaystyle\geq λklogπλk(a|s)(λk+1λk)β2λk+1λk22\displaystyle{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a|s){\cdot}(\lambda^{k+1}-\lambda^{k})-\frac{\beta}{2}||\lambda^{k+1}-\lambda^{k}||_{2}^{2} (111)
=\displaystyle= ηlogπλk(a|s)wkηβ2wk22\displaystyle{\eta}\log{\pi_{\lambda_{k}}}(a|s){\cdot}w^{k}-{\eta}\frac{\beta}{2}||w^{k}||_{2}^{2} (112)

Thus we have,

logπλk+1(a|s)πλk(a|s)\displaystyle\log\frac{\pi_{\lambda_{k+1}}(a|s)}{\pi_{\lambda_{k}}(a|s)} \displaystyle\geq λklogπλk(a|s)(λk+1λk)β2λk+1λk22\displaystyle{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a|s){\cdot}(\lambda^{k+1}-\lambda^{k})-\frac{\beta}{2}||\lambda^{k+1}-\lambda^{k}||_{2}^{2} (113)
=\displaystyle= ηlogπλk(a|s)wkη2β2wk22\displaystyle{\eta}\log{\pi_{\lambda_{k}}}(a|s){\cdot}w^{k}-{\eta}^{2}\frac{\beta}{2}||w^{k}||_{2}^{2} (114)

From the definition of KL divergence and from the performance difference lemma from (kakade2002approximately) we have

𝔼sdνπ(KL(π||πλk)π||πλk+1))=\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\left(KL(\pi^{*}||{\pi}^{\lambda_{k}})-\pi^{*}||{\pi}^{\lambda_{k+1}})\right)= 𝔼sdνπ𝔼aπ(.|s)[logπλk+1(a|s)πλk(a|s)]\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\mathbb{E}_{a\sim\pi^{*}(.|s)}\left[log\frac{\pi^{\lambda_{k+1}}(a|s)}{\pi_{\lambda_{k}}(a|s)}\right] (115)
\displaystyle\leq η𝔼sdνπ𝔼aπ(.|s)[λklogπλk(a|s)wk]η2β2wk22\displaystyle{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\mathbb{E}_{a\sim\pi^{*}(.|s)}\left[{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a|s){\cdot}w^{k}\right]-{\eta}^{2}\frac{\beta}{2}||w^{k}||_{2}^{2} (116)
=\displaystyle= η𝔼sdνπ𝔼aπ(.|s)[Qk,J(s,a)]η2β2wk22\displaystyle{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\mathbb{E}_{a\sim\pi^{*}(.|s)}\left[Q_{k,J}(s,a)\right]-{\eta}^{2}\frac{\beta}{2}||w^{k}||_{2}^{2}
η𝔼sdνπ𝔼aπ(.|s)[λklogπλk(a|s)wkAk,J(s,a)]\displaystyle-{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\mathbb{E}_{a\sim\pi^{*}(.|s)}\Big{[}{\nabla}_{\lambda_{k}}\log{\pi_{\lambda_{k}}}(a|s){\cdot}w^{k}-A_{k,J}(s,a)\Big{]} (117)
=\displaystyle= (1γ)η(Vπ(ν)Vk(ν))η2β2wk22ηerrk.\displaystyle(1-\gamma){\eta}\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right)-{\eta}^{2}\frac{\beta}{2}||w^{k}||_{2}^{2}-{\eta}{\cdot}err_{k}. (118)

Equation (116) is obtained from Equation (115) using the result in Equation (112). (117) is obtained from Equation (116) using the performance difference lemma form (kakade2002approximately) where Ak,JA_{k,J} is the advantage function to the corresponding QQ function Qk,jQ_{k,j}.

Rearranging, we get

(Vπ(ν)Vk(ν))\displaystyle\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right) \displaystyle\leq 11γ(1η𝔼sdνπ(KL(π||πλk)π||πλk+1))+η2β2W2+ηerrk)\displaystyle\frac{1}{1-\gamma}\left(\frac{1}{\eta}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\left(KL(\pi^{*}||{\pi}^{\lambda_{k}})-\pi^{*}||{\pi}^{\lambda_{k+1}})\right)+{\eta}^{2}\frac{\beta}{2}{\cdot}{W}^{2}+{\eta}{\cdot}err_{k}\right)

Summing from 11 to KK and dividing by KK we get

1Kk=1K(Vπ(ν)Vk(ν))\displaystyle\frac{1}{K}{\sum_{k=1}^{K}}\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right)\leq (11γ)1Kk=1K(𝔼sdνπ(KL(π||πλk)π||πλk+1))+ηerrk)\displaystyle\left(\frac{1}{1-\gamma}\right)\frac{1}{K}{\sum_{k=1}^{K}}\left(\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}}}\left(KL(\pi^{*}||{\pi}^{\lambda_{k}})-\pi^{*}||{\pi}^{\lambda_{k+1}})\right)+{\eta}{\cdot}err_{k}\right)
+\displaystyle+ (11γ)η2β2W2\displaystyle\left(\frac{1}{1-\gamma}\right){\eta}^{2}\frac{\beta}{2}{\cdot}{W}^{2} (120)
\displaystyle\leq 1η(1γ)1K𝔼sd~(KL(π||πλ0))+ηβW22(1γ)+1K(1γ)k=1J1errk\displaystyle\frac{1}{{\eta}(1-\gamma)}\frac{1}{K}\mathbb{E}_{s\sim\tilde{d}}\left(KL(\pi^{*}||{\pi}^{\lambda_{0}})\right)+\frac{{\eta}{\beta}{\cdot}{W}^{2}}{2(1-\gamma)}+\frac{1}{K(1-\gamma)}\sum_{k=1}^{J-1}err_{k} (121)
\displaystyle\leq log(|𝒜|)Kη(1γ)+ηβW22(1γ)+1K(1γ)k=1J1errk\displaystyle\frac{\log(|\mathcal{A}|)}{K{\eta}(1-\gamma)}+\frac{{\eta}{\beta}{\cdot}{W}^{2}}{2(1-\gamma)}+\frac{1}{K(1-\gamma)}\sum_{k=1}^{J-1}err_{k} (122)

If we set η=1K\eta=\frac{1}{\sqrt{K}} in Equation (122) we get

1Kk=1K(Vπ(ν)Vk(ν))\displaystyle\frac{1}{K}{\sum_{k=1}^{K}}\left(V^{\pi^{*}}(\nu)-V^{k}(\nu)\right) \displaystyle\leq 1K(2log(|𝒜|)+βW22(1γ))+1K(1γ)k=1J1errk\displaystyle\frac{1}{\sqrt{K}}\left(\frac{2\log(|\mathcal{A}|)+\beta{\cdot}{W}^{2}}{2(1-\gamma)}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{J-1}err_{k} (123)

Now consider the term errkerr_{k}, we have from Equation (28)

errk\displaystyle err_{k} =\displaystyle= 𝔼s,a(Aπλkwklog(πθk(a|s)))\displaystyle\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a|s)))
=\displaystyle= 𝔼s,a(AπλkAk,J)+𝔼s,a(Ak,Jwklog(πθk(a|s)))\displaystyle\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-A_{{k,J}})+\mathbb{E}_{s,a}(A_{{k,J}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a|s)))
\displaystyle\leq |𝔼s,a(AπλkAk,J)|I+|𝔼s,a(Ak,Jwklog(πθk(a|s)))|II\displaystyle\underbrace{|\mathbb{E}_{s,a}(A^{\pi_{\lambda_{k}}}-A_{{k,J}})|}_{I}+\underbrace{|\mathbb{E}_{s,a}(A_{{k,J}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a|s)))|}_{II} (125)

where Ak,jA_{k,j} is the estimate of AπλkA^{\pi_{\lambda_{k}}} obtained at the kthk^{th} iteration of Algorithm 1 and sdνπ,aπs\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}.

We first derive bounds on II. From the definition of advantage function we have

|𝔼(Aπλk(s,a)Ak,J(s,a))|\displaystyle|\mathbb{E}(A^{\pi_{\lambda_{k}}}(s,a)-A_{{k,J}}(s,a))| =\displaystyle= |𝔼sdνπ,aπ(Qπλk(s,a)EaπλkQπλk(s,a)\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}(s,a^{{}^{\prime}})
Qk,J(s,a)+EaπλkQk,J(s,a))|\displaystyle-Q_{{k,J}}(s,a)+E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q_{k,J}(s,a^{{}^{\prime}}))|
=\displaystyle= |𝔼sdνπ,aπ(Qπλk(s,a)EaπλkQπλk(s,a)\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}(s,a^{{}^{\prime}})
Qk,J(s,a)+EaπλkQk,J(s,a))|\displaystyle-Q_{{k,J}}(s,a)+E_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q_{k,J}(s,a^{{}^{\prime}}))|
\displaystyle\leq |𝔼sdνπ,aπ(Qπλk(s,a)Qk,J(s,a)|\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|
+\displaystyle+ |𝔼sdνπ,aπλk(Qπλk(s,a)Qk,J(s,a)|\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)| (128)

We write the second term on the right hand side of Equation (128) as (|(Qπλk(s,a)Qk,J|(s,a))d(μk)\int(|(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}|}(s,a))d(\mu_{k}) where μk\mu_{k} is the measure associated with the state action distribution given by sdνπ,aπλk{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{\lambda_{k}}}. Then we have

|Qπλk(s,a)Qk,J(s,a)|d(μk)dμkdμ|Qπλk(s,a)Qk,J(s,a)|d(μ)\displaystyle\int|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu_{k})\leq\Bigg{|}\Bigg{|}\frac{d{\mu_{k}}}{d{\mu^{*}}}\Bigg{|}\Bigg{|}_{\infty}\int|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu^{*}) (129)

where μ\mu^{*} is the measure associated with the state action distribution given by sdνπ,aπs\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim{\pi^{*}}. Using Assumption 6 we have dμkdμϕμ,μk\Bigg{|}\Bigg{|}\frac{d{\mu_{k}}}{d{\mu^{*}}}\Bigg{|}\Bigg{|}_{\infty}\leq\phi_{\mu^{*},\mu_{k}}. Thus Equation (129) becomes

|Qπλk(s,a)Qk,J(s,a)|d(μk)(ϕμk,μ)(|Qπλk(s,a)Qk,J(s,a)|d(μ)\displaystyle\int|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu_{k})\leq(\phi_{\mu_{k},\mu^{*}})\int(|Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu^{*}) (130)

Since |(Qπλk(s,a)Qk,J(s,a)|d(μ)=|𝔼sdνπ,aπ(Qπλk(s,a)Qk,J(s,a)||\int(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)|d(\mu^{*})=|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a)| Equation (128) now becomes.

|𝔼(Aπλk(s,a)Ak,J(s,a))|\displaystyle|\mathbb{E}(A^{\pi_{\lambda_{k}}}(s,a)-A_{{k,J}}(s,a))| \displaystyle\leq (1+ϕμk,μ)|𝔼sdνπ,aπ(Qπλk(s,a)Qk,J(s,a))|\displaystyle(1+\phi_{\mu_{k},\mu^{*}})|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a))|

Therefore minimizing Aπλk(s,a)Ak,J(s,a)A^{\pi_{\lambda_{k}}}(s,a)-A_{{k,J}}(s,a) is equivalent to minimizing Qπλk(s,a)Qk,J(s,a)Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a).

In order to prove the bound on |𝔼sdνπ,aπ(Qπλk(s,a)Qk,J(s,a))||\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(Q^{\pi_{\lambda_{k}}}(s,a)-Q_{{k,J}}(s,a))| we first define some notation, let Q1,Q2Q_{1},Q_{2} be two real valued functions on the state action space. The expression Q1Q2Q_{1}\geq Q_{2} implies Q1(s,a)Q2(s,a)Q_{1}(s,a)\geq Q_{2}(s,a) (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}.

Let Qk,jQ_{k,j} denotes our estimate of the action value function at iteration kk of Algorithm 1 and iteration jj of the first for loop of Algorithm 1 and we denote Qk,J=Qk,jQ_{k,J}=Q_{k,j} where JJ is the total number of iterations of the first for loop of Algorithm 1. QπλkQ^{\pi_{\lambda_{k}}} denotes the action value function induced by the policy πλk\pi_{\lambda_{k}}.

Consider ϵk,j+1=TπλkQk,jQk,j+1\epsilon_{k,j+1}=T^{\pi_{\lambda_{k}}}Q_{k,j}-Q_{k,j+1}.

TQk,jTπλkQk,j\displaystyle TQ_{k,j}\geq T^{\pi_{\lambda_{k}}}Q_{k,j} (132)

This follows from the definition of TπλkT^{\pi_{\lambda_{k}}} and TT in Equation (51) and (52), respectively.

Thus we get,

QπλkQk,j+1=\displaystyle Q^{\pi_{\lambda_{k}}}-Q_{k,j+1}= TπλkQπλkQk,j+1\displaystyle T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-Q_{k,j+1} (133)
=\displaystyle= TπλkQπλkTπλkQk,j+TπλkQk,jTQk,j+TQk,jQk,j+1\displaystyle T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-T^{\pi_{\lambda_{k}}}Q_{k,j}+T^{\pi_{\lambda_{k}}}Q_{k,j}-TQ_{k,j}+TQ_{k,j}-Q_{k,j+1} (134)
=\displaystyle= r(s,a)+γPπλkQπλk(r(s,a)+γPπλkQk,j)+(r(s,a)+γPπλkQk,j)\displaystyle r(s,a)+\gamma P^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-(r(s,a)+\gamma P^{\pi_{\lambda_{k}}}Q_{k,j})+(r(s,a)+\gamma P^{\pi_{\lambda_{k}}}Q_{k,j})
(r(s,a)+γPQk,j)+ϵk,j+1\displaystyle-(r(s,a)+\gamma P^{*}Q_{k,j})+\epsilon_{k,j+1}
=\displaystyle= γPπλk(QπλkQk,j)+γPπλkQk,jγPQk,j+ϵk,j+1\displaystyle\gamma P^{\pi_{\lambda_{k}}}(Q^{\pi_{\lambda_{k}}}-Q_{k,j})+\gamma P^{\pi_{\lambda_{k}}}Q_{k,j}-\gamma P^{*}Q_{k,j}+\epsilon_{k,j+1} (135)
\displaystyle\leq γ(Pπλk(QπλkQk,j))+ϵk,j+1\displaystyle\gamma(P^{\pi_{\lambda_{k}}}(Q^{\pi_{\lambda_{k}}}-Q_{k,j}))+\epsilon_{k,j+1} (136)

Right hand side of Equation (133) is obtained by writing Qk,J=TπQπλkQ_{k,J}=T^{\pi^{*}}Q^{\pi_{\lambda_{k}}}. This is because the function QπλkQ^{\pi_{\lambda_{k}}} is a stationary point with respect to the operator TπT^{\pi^{*}}. Equation (134) is obtained from (133) by adding and subtracting TπQk,jT^{\pi^{*}}Q_{k,j}. Equation (136) is obtained from (135) as PπQk,jPQk,jP^{\pi^{*}}Q_{k,j}\leq P^{*}Q_{k,j} and PP^{*} is the operator with respect to the greedy policy of Qk,jQ_{k,j}.

By recursion on kk, we get,

QπλkQk,Jj=0J1γJj1(Pπλk)Jj1ϵk,j+γJ(Pπλk)J(QπλkQ0)Q^{\pi_{\lambda_{k}}}-Q_{k,J}\leq\sum_{j=0}^{J-1}\gamma^{J-j-1}(P^{\pi_{\lambda_{k}}})^{J-j-1}\epsilon_{k,j}+\gamma^{J}(P^{\pi_{\lambda_{k}}})^{J}(Q^{\pi_{\lambda_{k}}}-Q_{0}) (137)

using TQk,jTπQk,jTQ_{k,j}\geq T^{\pi^{*}}Q_{k,j} (from definition of TπT^{\pi^{*}}) and TQk,jTπλkQk,jTQ_{k,j}\geq T^{\pi_{\lambda_{k}}}Q_{k,j} from definition of operator TT.

From this we obtain

𝔼sdνπ,aπ(QπλkQk,J)\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,J})}\leq ϕkk=0J1γJj1𝔼sdνπ,aπ((Pπλk)KJ1ϵk,j)\displaystyle{\phi_{k}}\sum_{k=0}^{J-1}\gamma^{J-j-1}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}((P^{\pi_{\lambda_{k}}})^{K-J-1}\epsilon_{k,j})
+γJ𝔼sdνπ,aπ(Pπλk)J(QπλkQ0)\displaystyle+\gamma^{J}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(P^{\pi_{\lambda_{k}}})^{J}(Q^{\pi_{\lambda_{k}}}-Q_{0}) (138)

Taking absolute value on both sides of Equation (138) we get.

|𝔼sdνπ,aπ(QπλkQk,J)|\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,J})}|\leq ϕkk=0J1γJj1𝔼sdνπ,aπ((Pπλk)Jj1|ϵk,j|)\displaystyle{\phi_{k}}\sum_{k=0}^{J-1}\gamma^{J-j-1}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}((P^{\pi_{\lambda_{k}}})^{J-j-1}|\epsilon_{k,j}|)
+γJ𝔼sdνπ,aπ(Pπλk)K(|QπλkQ0|)\displaystyle+\gamma^{J}\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}(P^{\pi_{\lambda_{k}}})^{K}(|Q^{\pi_{\lambda_{k}}}-Q_{0}|) (139)

For a fixed jj consider the term 𝔼sdνπ,aπ((Pπλk)Jj1|ϵk,j|)\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}((P^{\pi_{\lambda_{k}}})^{J-j-1}|\epsilon_{k,j}|) using Assumption 6 we write

|𝔼sdνπ,aπj((Pπλk)Jj1|ϵk,j|)|\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi_{j}}((P^{\pi_{\lambda_{k}}})^{J-j-1}|\epsilon_{k,j}|)| \displaystyle\leq d(Pπλk)Jj1μjdμk|ϵk,j𝑑μk|\displaystyle\Bigg{|}\Bigg{|}\frac{d{(P^{\pi_{\lambda_{k}}})^{J-j-1}\mu_{j}}}{d{\mu^{{}^{\prime}}_{k}}}\Bigg{|}\Bigg{|}_{\infty}\left|{\int}\epsilon_{k,j}d{\mu^{{}^{\prime}}_{k}}\right| (140)
\displaystyle\leq (ϕμk,μj)𝔼(s,a)ζπν(s,a)(|ϵk,j|)|\displaystyle(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}(|\epsilon_{k,j}|)| (141)

Here μj\mu_{j} is the measure associated with the state action distribution given by sampling from sdνπ,aπs\sim d_{\nu}^{\pi^{*}},a^{{}^{\prime}}\sim{\pi^{*}} and then applying the operator PπλkP^{\pi_{\lambda_{k}}} Jj1J-j-1 times. μj\mu_{j} is the measure associated with the steady state action distribution given by (s,a)ζπν(s,a)(s,a)\sim\zeta^{\nu}_{\pi}(s,a). Thus Equation (139) becomes

|𝔼sdνπ,aπ(QπλkQk,J)|\displaystyle|\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,J})}|\leq k=0J1γJj1(ϕμk,μj)𝔼(s,a)ζπν(s,a)(|ϵk,j|)|+γJ(Rmax1γ)\displaystyle\sum_{k=0}^{J-1}\gamma^{J-j-1}(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}_{(s,a)\sim\zeta^{\nu}_{\pi}(s,a)}(|\epsilon_{k,j}|)|+\gamma^{J}\left(\frac{R_{max}}{1-\gamma}\right) (143)

We get the second term on the right hand side by noting that (QπλkQ0)Rmax1γ(Q^{\pi_{\lambda_{k}}}-Q_{0})\leq\frac{R_{max}}{1-\gamma}. Now splitting ϵk,j\epsilon_{k,j} as was done in Equation (31) we obtain

𝔼sdνπ,aπ(QπλkQk,j)\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,j})}\leq j=0J1γJj1((ϕμk,μj)𝔼|ϵk,j1|+(ϕμk,μj)𝔼|ϵk,j2|\displaystyle\sum_{j=0}^{J-1}\gamma^{J-j-1}\big{(}(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{|\epsilon^{1}_{k,j}|}+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{|\epsilon^{2}_{k,j}|}
+(ϕμk,μj)𝔼|ϵk,j3|+(ϕμk,μj)𝔼|ϵk,j4|)+γJ(Rmax1γ)\displaystyle+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{|\epsilon^{3}_{k,j}|}+(\phi_{\mu^{{}^{\prime}}_{k},\mu_{j}})\mathbb{E}{|\epsilon^{4}_{k,j}|}\big{)}+\gamma^{J}\left(\frac{R_{max}}{1-\gamma}\right) (144)

Now using Lemmas F.1, F.2, F.3, F.4 we have

𝔼sdνπ,aπ(QπλkQk,j)\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}{(Q^{\pi_{\lambda_{k}}}-Q_{k,j})} =\displaystyle= j=0J1𝒪~(log(log(nk,j))nk,j)+(ϵapprox+ϵ|D~|)+𝒪~(1Tk,j)\displaystyle\sum_{j=0}^{J-1}\tilde{\mathcal{O}}\left(\frac{\log(log(n_{k,j}))}{\sqrt{n_{k,j}}}\right)+(\sqrt{\epsilon_{approx}}+{\epsilon_{|\tilde{D}|}})+\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right) (145)
+\displaystyle+ 𝒪~(γJ)\displaystyle\tilde{\mathcal{O}}(\gamma^{J})

From Equation (LABEL:proof_13_1_6) we get

𝔼sdνπ,aπ(AπλkAk,j)\displaystyle\mathbb{E}_{s\sim d_{\nu}^{\pi^{*}},a\sim\pi^{*}}{(A^{\pi_{\lambda_{k}}}-A_{k,j})} =\displaystyle= j=0J1𝒪~(log(log(nk,j))nk,j)+(ϵapprox+ϵ|D~|)+𝒪~(1Tk,j)\displaystyle\sum_{j=0}^{J-1}\tilde{\mathcal{O}}\left(\frac{\log(log(n_{k,j}))}{\sqrt{n_{k,j}}}\right)+(\sqrt{\epsilon_{approx}}+{\epsilon_{|\tilde{D}|}})+\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right) (146)
+\displaystyle+ 𝒪~(γJ)\displaystyle\tilde{\mathcal{O}}(\gamma^{J})

We now derive bounds on IIII. From Theorem 2 of (9769873) we have that

wkw2\displaystyle||w_{k}-w^{*}||_{2} \displaystyle\leq 𝒪(log(nk,j)nk,j)\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right) (147)

Now define the function Fk(w)=𝔼s,aζπν(Ak,Jwklog(πθk(a|s)))F_{k}(w)=\mathbb{E}_{s,a\sim\zeta^{\nu}_{\pi}}(A_{{k,J}}-w^{k}{\nabla}log(\pi^{{\theta}_{k}}(a|s))). From this definition we obtain

Fk(wk)Fk(w)lFkwkw2\displaystyle F_{k}(w_{k})-F_{k}(w^{*})\leq l_{F_{k}}||w_{k}-w^{*}||_{2} \displaystyle\leq 𝒪(log(nk,j)nk,j)\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right) (148)

where lFkl_{F_{k}} is lipschitz constant of Fk(w)F_{k}(w). Thus we obtain

Fk(wk)Fk(w)\displaystyle F_{k}(w_{k})-F_{k}(w^{*}) \displaystyle\leq 𝒪(log(nk,j)nk,j)\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right) (149)

From Assumption 4 we have

Fk(wk)ϵbias\displaystyle F_{k}(w_{k})-\epsilon_{bias} \displaystyle\leq 𝒪(log(nk,j)nk,j)\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right) (150)

which gives us

Fk(wk)\displaystyle F_{k}(w_{k}) \displaystyle\leq 𝒪(log(nk,j)nk,j)+ϵbias\displaystyle\mathcal{O}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)+\epsilon_{bias} (151)

Plugging Equations (151) and (146) in Equation (123) we get

1Kk=1K(V(ν)VπλK(ν))\displaystyle\frac{1}{K}\sum_{k=1}^{K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu)) minkK(V(ν)VπλK(ν))\displaystyle\leq\min_{k\leq K}(V^{*}(\nu)-V^{\pi_{\lambda_{K}}}(\nu)) (152)
\displaystyle\leq 𝒪(1K(1γ))+1K(1γ)k=1K1j=1J𝒪(loglog(nk,j)nk,j)\displaystyle{\mathcal{O}}\left(\frac{1}{\sqrt{K}(1-\gamma)}\right)\!+\!\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}\sum_{j=1}^{J}\mathcal{O}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right)
+k=1K1j=1J𝒪(1Tk,j)+1K(1γ)k=1K1𝒪(γJ)\displaystyle+\sum_{k=1}^{K-1}\sum_{j=1}^{J}{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}{\mathcal{O}}(\gamma^{J})
+k=1K1j=1J(log(nk,j)nk,j)+11γ(ϵbias+(ϵapprox)+ϵ|D~|)\displaystyle+\sum_{k=1}^{K-1}\sum_{j=1}^{J}\left(\frac{\log(n_{k,j})}{{n_{k,j}}}\right)+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{|\tilde{D}|}}\right) (153)
\displaystyle\leq 𝒪(1K(1γ))+1K(1γ)k=1K1j=1J𝒪(loglog(nk,j)nk,j)\displaystyle{\mathcal{O}}\left(\frac{1}{\sqrt{K}(1-\gamma)}\right)\!+\!\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}\sum_{j=1}^{J}\mathcal{O}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right)\
+k=1K1j=1J𝒪(1Tk,j)+1K(1γ)k=1K1𝒪(γJ)\displaystyle+\sum_{k=1}^{K-1}\sum_{j=1}^{J}{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\frac{1}{K(1-\gamma)}\sum_{k=1}^{K-1}{\mathcal{O}}(\gamma^{J})
+11γ(ϵbias+(ϵapprox)+ϵ|D~|)\displaystyle+\frac{1}{1-\gamma}\left(\epsilon_{bias}+(\sqrt{\epsilon_{approx}})+{\epsilon_{|\tilde{D}|}}\right) (154)

 

F Proof of Supporting Lemmas

F.1 Proof Of Lemma 14

Proof 

Using Assumption 3 and the definition of Qkj1Q_{kj_{1}} for some iteration kk of Algorithm 1 we have

𝔼(TπλkQk,j1Qk,j1)ν2ϵapprox\mathbb{E}(T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j})^{2}_{\nu}\leq\epsilon_{approx} (155)

where ν𝒫(𝒮×𝒜)\nu\in\mathcal{P}(\mathcal{S}{\times}{\mathcal{A}}).

Since |a|2=a2|a|^{2}=a^{2} we obtain

𝔼(|TπλkQk,j1Qk,j1|)ν2ϵapprox\mathbb{E}(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)^{2}_{\nu}\leq\epsilon_{approx} (156)

We have for a random variable xx, Var(x)=𝔼(x2)(𝔼(x))2Var(x)=\mathbb{E}(x^{2})-(\mathbb{E}(x))^{2} hence 𝔼(x)=𝔼(x2)Var(x)\mathbb{E}(x)=\sqrt{\mathbb{E}(x^{2})-Var(x)}, Therefore replacing xx with |TπλkQπλkQk1||T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-Q_{k1}| we get

using the definition of the variance of a random variable we get

𝔼(|TπλkQk,j1Qk,j1|)ν=𝔼(|TπλkQk,j1Qk,j1|)ν2Var(|TπλkQk,j1Qk,j1|)ν\mathbb{E}(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)_{\nu}=\sqrt{\mathbb{E}(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)^{2}_{\nu}-Var(|T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)_{\nu}} (157)

Therefore we get

𝔼(TπλkQk,j1Qk,j1|)νϵapprox\mathbb{E}(T^{\pi_{\lambda_{k}}}Q_{k,j-1}-Q^{1}_{k,j}|)_{\nu}\leq\sqrt{\epsilon_{approx}} (158)

Since ϵk1=TπλkQπλkQk1\epsilon_{k_{1}}=T^{\pi_{\lambda_{k}}}Q^{\pi_{\lambda_{k}}}-Q_{k1} we have

𝔼(|ϵk,j1|)νϵapprox\mathbb{E}(|\epsilon_{k,j_{1}}|)_{\nu}\leq\sqrt{\epsilon_{approx}} (159)

 

F.2 Proof Of Lemma 15

Proof  From Lemma 11, we have

argminfθ𝔼x,y(fθ(x)g(x,y))2=argminfθ(𝔼x,y(fθ(x)𝔼(g(y,x)|x))2)\operatorname*{arg\,min}_{f_{\theta}}\mathbb{E}_{x,y}\left(f_{\theta}(x)-g(x,y)\right)^{2}=\operatorname*{arg\,min}_{f_{\theta}}\left(\mathbb{E}_{x,y}\left(f_{\theta}(x)-\mathbb{E}(g(y^{{}^{\prime}},x)|x)\right)^{2}\right) (160)

We label xx to be the state action pair (s,a)(s,a), function fθ(x)f_{\theta}(x) to be Qθ(s,a)Q_{\theta}(s,a) and g(x,y)g(x,y) to be the function r(s,a)+γa𝒜πλk(a|s,a)Qk,j1(s,a)=r(s,a)+γ𝔼Qk,j1(s,a)r^{{}^{\prime}}(s,a)+{\gamma}\sum_{a^{{}^{\prime}}\in\mathcal{A}}\pi_{\lambda_{k}}(a^{{}^{\prime}}|s,a)Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})=r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}}), where yy is the two dimensional random variable (r(s,a),s)(r^{{}^{\prime}}(s,a),s^{{}^{\prime}}) and sdνπθk,aπλk(.|s)s\sim d^{\pi_{{\theta}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s),sP(.|(s,a))s^{{}^{\prime}}\sim P(.|(s,a)) and r(s,a)R(.|s,a)r^{{}^{\prime}}(s,a)\sim{R}(.|s,a).

Then the loss function in (69) becomes

𝔼sdνπλk,aπλk(.|s),sP(s|s,a),r(s,a)(.|s,a)(Qθ(s,a)(r(s,a)+γ𝔼Qk,j1(s,a)))2\mathbb{E}_{s\sim d^{\pi_{{\lambda}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s),s^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a),r(s,a)\sim\mathcal{R}(.|s,a)}(Q_{\theta}(s,a)-(r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})))^{2} (161)

Therefore by Lemma 11, we have that the function Qθ(s,a)Q_{\theta}(s,a) which minimizes Equation (161) it will be minimizing

𝔼sdνπλk,aπλk(Qθ(s,a)𝔼sP(s|s,a),r(.|s,a))(r(s,a)+γ𝔼Qk,j1(s,a)|s,a))2\mathbb{E}_{s\sim d^{\pi_{{\lambda}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}}(Q_{\theta}(s,a)-\mathbb{E}_{s^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a),r\sim\mathcal{R}(.|s,a))}(r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})|s,a))^{2} (162)

But we have from Equation that

𝔼sP(s|s,a),rR(.|s,a))(r(s,a)+γ𝔼Qk,j1(s,a)|s,a)=TπλkQk,j1\mathbb{E}_{s^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a),r\sim{R}(.|s,a))}(r^{{}^{\prime}}(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})|s,a)=T^{\pi_{\lambda_{k}}}Q_{k,j-1} (163)

Combining Equation (161) and (163) we get

argminQθ𝔼(Qθ(s,a)(r(s,a)+γ𝔼Qk,j1(s,a)))2=argminQθ𝔼(Qθ(s,a)TπλkQk,j1)2\operatorname*{arg\,min}_{Q_{\theta}}\mathbb{E}(Q_{\theta}(s,a)-(r(s,a)+{\gamma}\mathbb{E}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})))^{2}=\operatorname*{arg\,min}_{Q_{\theta}}\mathbb{E}(Q_{\theta}(s,a)-T^{\pi_{\lambda_{k}}}Q_{k,j-1})^{2} (164)

The left hand side of Equation (164) is Qk,j2Q^{2}_{k,j} as defined in Definition 3 and the right hand side is Qk,j1Q^{1}_{k,j} as defined in Definition 2, which gives us

Qk,j2=Qk,j1Q^{2}_{k,j}=Q^{1}_{k,j} (165)

 

F.3 Proof Of Lemma 16

Proof 

We define RXk,j,Qk,j1(θ)R_{X_{k,j},Q_{k,j-1}}({\theta}) as

RXk,j,Qk,j1(θ)=1|Xk,j|(si,ai)Xk,j(Qθ(si,ai)(r(si,ai)+γ𝔼aπλkQk,j1(si+1,a)))2,R_{X_{k,j},Q_{k,j-1}}({\theta})=\frac{1}{|X_{k,j}|}\sum_{(s_{i},a_{i})\in X_{k,j}}\Bigg{(}Q_{\theta}(s_{i},a_{i})-\Bigg{(}r(s_{i},a_{i})\\ +\gamma\mathbb{E}_{a^{{}^{\prime}}\sim\pi_{\lambda_{k}}}Q_{k,j-1}(s_{i+1},a^{{}^{\prime}})\Bigg{)}\Bigg{)}^{2},

Here, Xk,j={si,ai}i={1,,|Xk,j|}X_{k,j}=\{s_{i},a_{i}\}_{i=\{1,\cdots,|X_{k,j}|\}}, where s,as,a are sampled from a Markov chain whose stationary distribution is, sdνπλk,aπλks\sim d_{\nu}^{\pi_{\lambda_{k}}},a\sim\pi_{\lambda_{k}}. QθQ_{\theta} is as defined in Equation (10) and Qk,j1Q_{k,j-1} is the estimate of the QQ function obtained at iteration kk of the outer for loop and iteration j1j-1 of the first inner for loop of Algorithm 1.

We also define the term

LQk,j1(Qθ)=𝔼(Qθ(s,a)(r(s,a)+γ𝔼aπλkQk,j1(s,a))2L_{Q_{k,j-1}}(Q_{\theta})=\mathbb{E}(Q_{\theta}(s,a)-(r^{\prime}(s,a)+\gamma\mathbb{E}_{a^{{}^{\prime}}\sim\pi_{\lambda_{k}}}Q_{k,j-1}(s^{\prime},a^{\prime}))^{2} (166)

where sdνπθk,aπλk(.|s),r(|s,a)R(|s,a){s\sim d^{\pi_{{\theta}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s),r^{\prime}(\cdot|s,a)\sim{R}(\cdot|s,a)}\\

We denote by θk,j2,θk,j3\theta^{2}_{k,j},\theta^{3}_{k,j} the parameters of the neural networks Qk,j2,Qk,j3Q^{2}_{k,j},Q^{3}_{k,j} respectively. Qk,j2,Qk,j3Q^{2}_{k,j},Q^{3}_{k,j} are defined in Definition 3 and 4 respectively.

We then obtain,

RXk,j,Qk,j1(θk,j2)RXk,j,Qk,j1(θk,j3)\displaystyle R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j}) \displaystyle\leq RXk,j,Qk,j1(θk,j2)RXk,j,Qk,j1(θk,j3)\displaystyle R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})
+LQk,j1(θk,j2)LQk,j1(θk,j3)\displaystyle+L_{Q_{k,j-1}}(\theta^{2}_{k,j})-L_{Q_{k,j-1}}(\theta^{3}_{k,j})
=\displaystyle= RXk,j,Qk,j1(θk,j2)LQk,j1(θk,j2)\displaystyle{R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-L_{Q_{k,j-1}}(\theta^{2}_{k,j})}
RXk,j,Qk,j1(θk,j3)+LQk,j1(θk,j2)\displaystyle-{R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})+L_{Q_{k,j-1}}(\theta^{2}_{k,j})}
\displaystyle\leq |RXk,j,Qk,j1(θk,j2)LQk,j1(θk,j2)|I\displaystyle\underbrace{|R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-L_{Q_{k,j-1}}(\theta^{2}_{k,j})|}_{I}
+|RXk,j,Qk,j1(θk,j3)LQk,j1(θk,j3)|II\displaystyle+\underbrace{|R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})-L_{Q_{k,j-1}}(\theta^{3}_{k,j})|}_{II}

We get the inequality in Equation (LABEL:2_2_2) because LQk,j1(θk,j3)LQk,j1(θk,j2)>0L_{Q_{k,j-1}}(\theta^{3}_{k,j})-L_{Q_{k,j-1}}(\theta^{2}_{k,j})>0 as Qk,j2Q^{2}_{k,j} is the minimizer of the loss function LQk,j1(Qθ)L_{Q_{k,j-1}}(Q_{\theta}).

Consider Lemma 10. The loss function RXk,j,Qk,j1(θk,j3)R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j}) can be written as the mean of loss functions of the form l(aθ(si,ai),yi)l(a_{\theta}(s_{i},a_{i}),y_{i}) where ll is the square function. aθ(si,ai)=Qθ(si,ai)a_{\theta}(s_{i},a_{i})=Q_{\theta}(s_{i},a_{i}) and yi=(r(si,ai)+γ𝔼Qk,j1(s,a))y_{i}=\Big{(}r^{{}^{\prime}}(s_{i},a_{i})+\gamma{\mathbb{E}}Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})\Big{)}. Thus we have

𝔼supθΘ|RXk,j,Qk,j1(θ)LQk,j1(θ)|\displaystyle\mathbb{E}\sup_{\theta\in\Theta}|R_{X_{k,j},Q_{k,j-1}}(\theta)-L_{Q_{k,j-1}}(\theta)|\leq (170)
2η𝔼(Rad(𝒜{(s1,a1),(s2,a2),(s3,a3),,(sn,an)}))\displaystyle 2{\eta}^{{}^{\prime}}\mathbb{E}\left(Rad(\mathcal{A}\circ\{(s_{1},a_{1}),(s_{2},a_{2}),(s_{3},a_{3}),\cdots,(s_{n},a_{n})\})\right)

Note that the expectation is over all (si,ai)(s_{i},a_{i}). Where nk,j=|Xk,j|n_{k,j}=|X_{k,j}|, (𝒜{(s1,a1),(s2,a2),(\mathcal{A}\circ\{(s_{1},a_{1}),(s_{2},a_{2}), (s3,a3),,(sn,an)}={Qθ(s1,a1),Qθ(s2,a2),,Qθ(sn,an)}(s_{3},a_{3}),\cdots,(s_{n},a_{n})\}=\{Q_{\theta}(s_{1},a_{1}),Q_{\theta}(s_{2},a_{2}),\cdots,Q_{\theta}(s_{n},a_{n})\} and η\eta^{{}^{\prime}} is the Lipschitz constant for the square function over the state action space [0,1]d[0,1]^{d}. The expectation is with respect to sdνπλk,aπλk(.|s)s\sim d^{\pi_{{\lambda}_{k}}}_{\nu},a\sim\pi_{\lambda_{k}}(.|s) ,siP(s|s,a){s_{i}^{{}^{\prime}}\sim P(s^{{}^{\prime}}|s,a)} riR(.|si,ai),i(1,,nk,j)r_{i}\sim R(.|s_{i},a_{i})_{{}_{i\in(1,\cdots,n_{k,j})},}.

From proposition 11 of (article) we have that

(Rad(𝒜{(s1,a1),(s2,a2),(s3,a3),,(sn,an)}))Cklog(log(nk,j))nk,j\left(Rad(\mathcal{A}\circ\{(s_{1},a_{1}),(s_{2},a_{2}),(s_{3},a_{3}),\cdots,(s_{n},a_{n})\})\right)\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}} (171)

Note that proposition 11 of (article) establishes an upper bound on the Rademacher complexity using theorem 4 of (article) with the aim of applying it to the Metropolis Hastings algorithm For our purpose we only use the upper bound on Rademacher complexity established in proposition 11 of (article).

We use this result as the state action pairs are drawn not from the stationary state of the policy πλk\pi_{\lambda_{k}} but from a Markov chain with the same steady state distribution. Thus we have

𝔼|(RXk,j,Qk,j1(θk,j2))LQk,j1(θk,j2)|Cklog(log(nk,j))nk,j\displaystyle\mathbb{E}|(R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j}))-L_{Q_{k,j-1}}(\theta^{2}_{k,j})|\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}} (172)

The same argument can be applied for Qk,j3Q^{3}_{k,j} to get

𝔼|(RXk,j,Qk,j1(θk,j3))LQk,j1(θk,j3)|Cklog(log(nk,j))nk,j\displaystyle\mathbb{E}|(R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j}))-L_{Q_{k,j-1}}(\theta^{3}_{k,j})|\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}} (173)

Then we have

𝔼(RXk,j,Qk,j1(θk,j2)RXk,j,Qk,j1(θk,j3))Cklog(log(nk,j))nk,j\mathbb{E}\left(R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j})-R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j})\right)\leq C_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}} (174)

Plugging in the definition of RXk,j,Qk,j1(θk,j2),RXk,j,Qk,j1(θk,j3)R_{X_{k,j},Q_{k,j-1}}(\theta^{2}_{k,j}),R_{X_{k,j},Q_{k,j-1}}(\theta^{3}_{k,j}) in equation (174) and denoting Cklog(log(nk,j))nk,jC_{k}\frac{\log(\log(n_{k,j}))}{\sqrt{n_{k,j}}} as ϵ\epsilon we get

1nk,ji=1nk,j(𝔼(Qk,j2(si,ai)(r(si,ai)+𝔼aπλkQk,j2(si+1,a)))2\displaystyle\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\Big{(}\mathbb{E}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}
𝔼(Qk,j3(si,ai)(r(si,ai)+𝔼aπλkQk,j3(si+1,a)))2)ϵ\displaystyle-\mathbb{E}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}\Big{)}\leq\epsilon (175)

Now for a fixed ii consider the term αi\alpha_{i} defined as.

𝔼si+1P(.|si,ai)(Qk,j2(si,ai)(r(si,ai)+𝔼aπλkQk,j2(si+1,a)))2\displaystyle\mathbb{E}_{s_{i+1}\sim P(.|s_{i},a_{i})}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}
𝔼si+1P(.|si,ai)(Qk,j3(si,ai)(r(si,ai)+𝔼aπλkQk,j3(si+1,a)))2\displaystyle-\mathbb{E}_{s_{i+1}\sim P(.|s_{i},a_{i})}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2} (176)

where si,ais_{i},a_{i} are drawn from the state action distribution at the ithi^{th} step of the Markov chain induced by following the policy πλk\pi_{\lambda_{k}}.

Now for a fixed ii consider the term βi\beta_{i} defined as.

𝔼si+1P(.|si,ai)(Qk,j2(si,ai)(r(si,ai)+𝔼aπλkQk,j2(si+1,a)))2\displaystyle\mathbb{E}_{s_{i+1}\sim P(.|s_{i},a_{i})}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}
𝔼si+1P(.|si,ai)(Qk,j3(si,ai)(r(si,ai)+𝔼aπλkQk,j3(si+1,a)))2\displaystyle-\mathbb{E}_{s_{i+1}\sim P(.|s_{i},a_{i})}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2} (177)

where si,ais_{i},a_{i} are drawn from the steady state action distribution with sidνπλks_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}} and aiπλka_{i}\sim\pi_{\lambda_{k}}. Note here that αi\alpha_{i} and βi\beta_{i} are the same function with only the state action pairs being drawn from different distributions.

Using these definitions we obtain

|𝔼(αi)𝔼(βi)|\displaystyle|\mathbb{E}(\alpha_{i})-\mathbb{E}(\beta_{i})| \displaystyle\leq sup(si,ai)|2.max(αi,βi)|(κi)\displaystyle\sup_{(s_{i},a_{i})}|2.\max(\alpha_{i},\beta_{i})|({\kappa}_{i}) (178)
\displaystyle\leq (4R1γ)2mρi\displaystyle\left(4\frac{R}{1-\gamma}\right)^{2}m{\rho}^{i} (179)

We obtain Equation (178) by using the identity |fdμfdν||max𝒮×𝒜(f)|sup𝒮×𝒜(dμdν)||max𝒮×𝒜(f)|dTV(μ,ν)||{\int}fd{\mu}-{\int}fd{\nu}|\leq|\max_{\mathcal{S}\times\mathcal{A}}(f)|\sup_{\mathcal{S}\times\mathcal{A}}{\int}(d{\mu}-d{\nu})|\leq|\max_{\mathcal{S}\times\mathcal{A}}(f)|{d_{TV}}(\mu,\nu)|, where μ\mu and ν\nu are two σ\sigma finite state action probability measures and ff is a bounded measurable function. We have used κi\kappa_{i} to represent the total variation distance between the state action measures of the steady state action distribution denoted by sidνπλk,aiπλks_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}},a_{i}\sim\pi_{\lambda_{k}} and the state action distribution at the ithi^{th} step of the Markov chain induced by following the policy πλk\pi^{\lambda_{k}}. The expectation is with respect to (si,ai)(s_{i},a_{i}). We obtain Equation (179) from Equation (178) by using Assumption 5 and the fact that αi\alpha_{i} and βi\beta_{i} are upper bounded by (4R1γ)2\left(4\frac{R}{1-\gamma}\right)^{2}

From equation (179) we get

𝔼(αi)\displaystyle\mathbb{E}(\alpha_{i}) \displaystyle\geq 𝔼(βi)4(R1γ)2mρi\displaystyle\mathbb{E}(\beta_{i})-4\left(\frac{R}{1-\gamma}\right)^{2}m{\rho}^{i} (180)

We get Equation (180) from Equation (179) using the fact that |ab|c|a-b|\leq c implies that (c(ab)c)\left(-c\geq(a-b)\leq c\right) which in turn implies abca\geq b-c.

Using Equation (180) in equation (177) we get

1nk,ji=1nk,j(𝔼(Qk,j2(si,ai)(r(si,ai)+𝔼aπλkQk,j2(si+1,a)))2\displaystyle\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\Big{(}\mathbb{E}(Q^{2}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{2}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}
𝔼(Qk,j3(si,ai)(r(si,ai)+𝔼aπλkQk,j3(si+1,a)))2)\displaystyle-\mathbb{E}(Q^{3}_{k,j}(s_{i},a_{i})-(r(s_{i},a_{i})+\mathbb{E}_{a^{{}^{\prime}}\sim\pi^{\lambda_{k}}}Q^{3}_{k,j}(s_{i+1},a^{{}^{\prime}})))^{2}\Big{)}
\displaystyle\leq ϵ+1nk,ji=1nk,j4(R1γ)2mρi\displaystyle\epsilon+\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}4\left(\frac{R}{1-\gamma}\right)^{2}m{\rho}^{i}
\displaystyle\leq ϵ+1nk,j4(R1γ)2m11ρ\displaystyle\epsilon+\frac{1}{n_{k,j}}4\left(\frac{R}{1-\gamma}\right)^{2}m\frac{1}{1-\rho}

In Equation (LABEL:2_2_5_3_7) (si,ai)(s_{i},a_{i}) are now drawn from sidνπλks_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}} and aiπλka_{i}\sim\pi_{\lambda_{k}} for al ii.

We ignore the second term on the right hand side as it is 𝒪~(1nk,j)\tilde{\mathcal{O}}\left(\frac{1}{n_{k,j}}\right) as compared to the first term which is 𝒪~(loglog(nk,j)nk,j)\tilde{\mathcal{O}}\left(\frac{\log\log(n_{k,j})}{\sqrt{n_{k,j}}}\right). Additionally the expectation in Equation (LABEL:2_2_5_3_7) is wiht respect to sidνπθk,aiπλk(.|s),r(|s,a)R(|s,a),si+1P(.|si,ai){s_{i}\sim d^{\pi_{{\theta}_{k}}}_{\nu},a_{i}\sim\pi_{\lambda_{k}}(.|s),r^{\prime}(\cdot|s,a)\sim{R}(\cdot|s,a)},s_{i+1}\sim P(.|s_{i},a_{i})\\

Since now we have sidνπλk,aiπλks_{i}\sim d_{\nu}^{\pi_{\lambda_{k}}},a_{i}\sim\pi_{\lambda_{k}} for all ii, Equation (LABEL:2_2_5_3_7) is equivalent to,

𝔼(Qk,j2(s,a)Qk,j3(s,a))A1(Qk,j2(s,a)+Qk,j3(s,a)2(r(s,a))+γmaxa𝒜Qk,j1(s,a))A2\displaystyle\mathbb{E}\underbrace{(Q^{2}_{k,j}(s,a)-Q^{3}_{k,j}(s,a))}_{A1}\underbrace{(Q^{2}_{k,j}(s,a)+Q^{3}_{k,j}(s,a)-2(r(s,a))+\gamma\max_{a\in\mathcal{A}}Q_{k,j-1}(s^{{}^{\prime}},a))}_{A2} \displaystyle\leq ϵ\displaystyle\epsilon

Where the expectation is now over sdνπλk,aπλks\sim d_{\nu}^{\pi_{\lambda_{k}}},a\sim\pi_{\lambda_{k}}, r(s,a)R(.|s,a)r(s,a)\sim R(.|s,a) and sP(.|s,a)s^{{}^{\prime}}\sim P(.|s,a). We re-write Equation (LABEL:2_2_5_4) as

(Qk,j2(s,a)Qk,j3(s,a))A1×\displaystyle\int\underbrace{(Q^{2}_{k,j}(s,a)-Q^{3}_{k,j}(s,a))}_{A1}\times
×(Qk,j2(s,a)+Qk,j3(s,a)2(r(s,a))+γmaxa𝒜Qk,j1(s,a))A2×\displaystyle\times\underbrace{(Q^{2}_{k,j}(s,a)+Q^{3}_{k,j}(s,a)-2(r(s,a))+\gamma\max_{a\in\mathcal{A}}Q_{k,j-1}(s^{{}^{\prime}},a))}_{A2}\times
×dμ1(s,a)dμ2(r)dμ3(s)ϵ.\displaystyle\times d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})\leq\epsilon. (183)

Where μ1\mu_{1} is the state action distribution sdνπλk,aπλks\sim d_{\nu}^{\pi_{\lambda_{k}}},a\sim\pi_{\lambda_{k}}, μ2\mu_{2}, μ3\mu_{3} are the measures with respect to (s,a)(s,a), rr^{{}^{\prime}} and ss^{{}^{\prime}} respectively

Now for the integral in Equation (183) we split the integral into four different integrals. Each integral is over the set of (s,a),r,s(s,a),r^{{}^{\prime}},s^{{}^{\prime}} corresponding to the 4 different combinations of signs of A1,A2A1,A2.

{(s,a),r,s}:A10,A20(A1)(A2)𝑑μ1(s,a)𝑑μ2(r)𝑑ν3(s)\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2\geq 0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\nu_{3}}(s^{{}^{\prime}})
+{(s,a),r,s}:A1<0,A2<0(A1)(A2)𝑑μ1(s,a)𝑑μ2(r)𝑑μ3(s)\displaystyle+\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2<0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})
+{(s,a),r,s}:A10,A2<0(A1)(A2)𝑑μ1(s,a)𝑑μ2(r)𝑑ν3(s)\displaystyle+\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2<0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\nu_{3}}(s^{{}^{\prime}})
+{(s,a),r,s}:A1<0,A20(A1)(A2)𝑑μ1(s,a)𝑑μ2(r)𝑑μ3(s)ϵ\displaystyle+\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2\geq 0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})\leq\epsilon (184)

Now note that the first 2 terms are non-negative and the last two terms are non-positive. We then write the first two terms as

{(s,a),r,s}:A10,A20(A1)(A2)d(s,a)𝑑μ1(s,a)𝑑μ2(r)𝑑μ3(s)\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2\geq 0}(A1)(A2)d(s,a)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})
=\displaystyle= Ck,j1|Qk,j2Qk,j3|𝑑μ1\displaystyle C_{{k,j}_{1}}\int|Q^{2}_{k,j}-Q^{3}_{k,j}|d{\mu_{1}}
=\displaystyle= Ck,j1𝔼(|Qk,j2Qk,j3|)μ1\displaystyle C_{{k,j}_{1}}\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}}
{(s,a),r,s}:A1<0,A2<0(A1)(A2)d(s,a)𝑑μ1(s,a)𝑑μ2(r)𝑑μ3(s)\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2<0}(A1)(A2)d(s,a)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})
=\displaystyle= Ck,j2|Qk,j2Qk,j3|𝑑ν\displaystyle C_{{k,j}_{2}}\int|Q^{2}_{k,j}-Q^{3}_{k,j}|d{\nu}
=\displaystyle= Ck,j2𝔼(|Qk,j2Qk,j3|)μ1\displaystyle C_{{k,j}_{2}}\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}}

We write the last two terms as

{(s,a),r,s}:A10,A2<0(A1)(A2)𝑑μ1(s,a)𝑑μ2(r)𝑑μ3(s)=Ck,j3ϵ\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1\geq 0,A2<0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})=C_{{k,j}_{3}}{\epsilon} (187)
{(s,a),r,s}:A1<0,A20(A1)(A2)𝑑μ1(s,a)𝑑μ2(r)𝑑μ3(s)=Ck,j4ϵ\displaystyle\int_{\{(s,a),r^{{}^{\prime}},s^{{}^{\prime}}\}:A1<0,A2\geq 0}(A1)(A2)d{\mu_{1}}(s,a)d{\mu_{2}}(r)d{\mu_{3}}(s^{{}^{\prime}})=C_{{k,j}_{4}}{\epsilon} (188)

Here Ck,j1,Ck,j2,Ck,j4C_{{k,j}_{1}},C_{{k,j}_{2}},C_{{k,j}_{4}} and Ck,j4C_{{k,j}_{4}} are positive constants. Plugging Equations (LABEL:2_2_5_7), (LABEL:2_2_5_8), (187), (188) into Equation (183).

(Ck,j1+Ck,j2)𝔼(|Qk,j2Qk,j3|)μ1(Ck,j3+Ck,j4)ϵ\displaystyle(C_{{k,j}_{1}}+C_{{k,j}_{2}})\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}}-(C_{{k,j}_{3}}+C_{{k,j}_{4}})\epsilon \displaystyle\leq ϵ\displaystyle\epsilon (189)

which implies

𝔼(|Qk,j2Qk,j3|)μ1\displaystyle\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}} \displaystyle\leq (1+Ck,j3+Ck,j4Ck,j1+Ck,j2)ϵ\displaystyle\left(\frac{1+C_{{k,j}_{3}}+C_{{k,j}_{4}}}{C_{{k,j}_{1}}+C_{{k,j}_{2}}}\right)\epsilon (191)

Thus we have

𝔼(|Qk,j2Qk,j3|)μ1\displaystyle\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}} \displaystyle\leq (1+Ck,j3+Ck,j4Ck,j1+Ck,j2)Cklog(log)(nk,j)nk,j\displaystyle\left(\frac{1+C_{{k,j}_{3}}+C_{{k,j}_{4}}}{C_{{k,j}_{1}}+C_{{k,j}_{2}}}\right)C_{k}\frac{\log(\log)(n_{k,j})}{\sqrt{n_{k,j}}} (193)

which implies

𝔼(|Qk,j2Qk,j3|)μ1\displaystyle\mathbb{E}(|Q^{2}_{k,j}-Q^{3}_{k,j}|)_{\mu_{1}} \displaystyle\leq 𝒪~(log(log)(nk,j)nk,j)\displaystyle\tilde{\mathcal{O}}\left(\frac{\log(\log)(n_{k,j})}{\sqrt{n_{k,j}}}\right) (195)

 

F.4 Proof Of Lemma 4

Proof  For a given iteration kk of Algorithm 1 and iteration jj of the first inner for loop, the optimization problem to be solved in Algorithm 2 is the following

(θ)=1nk,ji=1nk,j(Qθ(si,ai)(r(si,ai)+γmaxa𝒜γQk,j1(s,a)))2\mathcal{L}(\theta)=\frac{1}{n_{k,j}}\sum_{i=1}^{n_{k,j}}\left(Q_{\theta}(s_{i},a_{i})-\left(r(s_{i},a_{i})+\gamma\max_{a^{{}^{\prime}}\in\mathcal{A}}\gamma Q_{k,j-1}(s^{{}^{\prime}},a^{{}^{\prime}})\right)\right)^{2} (197)

Here, Qk,j1Q_{k,j-1} is the estimate of the QQ function from the iteration j1j-1 and the state action pairs (si,ai)i={1,,n}(s_{i},a_{i})_{i=\{1,\cdots,n\}} have been sampled from a distribution over the state action pairs denoted by ν\nu. Since minθ(θ)\min_{\theta}\mathcal{L}(\theta) is a non convex optimization problem we instead solve the equivalent convex problem given by

uk,j\displaystyle u_{k,j}^{*} =\displaystyle= argminugk,j(u)=argminuDiD~DiXk,juiyk22\displaystyle\operatorname*{arg\,min}_{u}g_{k,j}(u)=\operatorname*{arg\,min}_{u}||\sum_{D_{i}\in\tilde{D}}D_{i}X_{k,j}u_{i}-y_{k}||^{2}_{2} (199)
subject to|u|1Rmax1γ\displaystyle\textit{subject to}|u|_{1}\leq\frac{R_{\max}}{1-\gamma}

Here, Xk,jnk,j×dX_{k,j}\in\mathbb{R}^{n_{k,j}\times d} is the matrix of sampled state action pairs at iteration kk, yknk,j×1y_{k}\in\mathbb{R}^{n_{k,j}\times 1} is the vector of target values at iteration kk. D~\tilde{D} is the set of diagonal matrices obtained from line 2 of Algorithm 2 and u|D~d|×1u\in\mathbb{R}^{|\tilde{D}d|\times 1} (Note that we are treating uu as a vector here for notational convenience instead of a matrix as was done in Section 4).

The constraint in Equation (199) ensures that the all the co-ordinates of the vector DiD~DiXk,jui\sum_{D_{i}\in\tilde{D}}D_{i}X_{k,j}u_{i} are upper bounded by Rmax1γ\frac{R_{max}}{1-\gamma} (since all elements of Xk,jX_{k,j} are between 0 and 11). This ensures that the corresponding neural network represented by Equation (10) is also upper bounded by Rmax1γ\frac{R_{max}}{1-\gamma}. We use the a projected gradient descent to solve the constrained convex optimization problem which can be written as.

uk,j\displaystyle u_{k,j}^{*} =\displaystyle= argminu:|u|1Rmax1γgk,j(u)=argminu:|u|1Rmax1γDiD~DiXk,juiyk22\displaystyle\operatorname*{arg\,min}_{u:|u|_{1}\leq\frac{R_{\max}}{1-\gamma}}g_{k,j}(u)=\operatorname*{arg\,min}_{u:|u|_{1}\leq\frac{R_{\max}}{1-\gamma}}||\sum_{D_{i}\in\tilde{D}}D_{i}X_{k,j}u_{i}-y_{k}||^{2}_{2} (200)

From Ang, Andersen(2017). “Continuous Optimization” [Notes]. https://angms.science/doc/CVX we have that if the step size α=uk,j2Lk,jTk,j+1\alpha=\frac{||u^{*}_{k,j}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}, after Tk,jT_{k,j} iterations of the projected gradient descent algorithm we obtain

(gk,j(uTk,j)gk,j(u))Lk,juk,j2Tk,j+1(g_{k,j}(u_{T_{k,j}})-g_{k,j}(u^{*}))\leq L_{k,j}\frac{||u_{k,j}^{*}||_{2}}{\sqrt{T_{k,j}+1}} (201)

Where Lk,jL_{k,j} is the lipschitz constant of gk,j(u)g_{k,j}(u) and uTk,ju_{T_{k,j}} is the parameter estimate at step Tk,jT_{k,j}.

Therefore if the number of iteration of the projected gradient descent algorithm Tk,jT_{k,j} and the step-size α\alpha satisfy

Tk,j\displaystyle T_{k,j} \displaystyle\geq Lk,j2uk,j22ϵ21,\displaystyle L_{k,j}^{2}||u_{k,j}^{*}||^{2}_{2}\epsilon^{-2}-1, (202)
α\displaystyle\alpha =\displaystyle= uk2Lk,jTk,j+1,\displaystyle\frac{||u^{*}_{k}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}, (203)

we have

(gk,j(uTk,j)gk,j(u))ϵ(g_{k,j}(u_{T_{k,j}})-g_{k,j}(u^{*}))\leq\epsilon (204)

Let (vi,wi)i(1,,|D~|)(v^{*}_{i},w^{*}_{i})_{i\in(1,\cdots,|\tilde{D}|)}, (viTk,j,wiTk,j)i(1,,|D~|)(v^{T_{k,j}}_{i},w^{T_{k,j}}_{i})_{i\in(1,\cdots,|\tilde{D}|)} be defined as

(vi,wi)i(1,,|D~|)=ψi(ui)i(1,,|D~|)\displaystyle(v^{*}_{i},w^{*}_{i})_{i\in(1,\cdots,|\tilde{D}|)}=\psi_{i}^{{}^{\prime}}(u^{*}_{i})_{i\in(1,\cdots,|\tilde{D}|)} (205)
(viTk,j,wiTk,j)i(1,,|D~|)=ψi(uiTk,j)i(1,,|D~|)\displaystyle(v^{T_{k,j}}_{i},w^{T_{k,j}}_{i})_{i\in(1,\cdots,|\tilde{D}|)}=\psi_{i}^{{}^{\prime}}(u^{T_{k,j}}_{i})_{i\in(1,\cdots,|\tilde{D}|)} (206)

where ψ\psi^{{}^{\prime}} is defined in Equation (49).

Further, we define θ|D~|\theta_{|\tilde{D}|}^{*} and θTk,j\theta^{T_{k,j}} as

θ|D~|=ψ(vi,wi)i(1,,|D~|)\displaystyle\theta_{|\tilde{D}|}^{*}=\psi(v^{*}_{i},w^{*}_{i})_{i\in(1,\cdots,|\tilde{D}|)} (207)
θTk,j=ψ(viTk,j,wiTk,j)i(1,,|D~|)\displaystyle\theta^{T_{k,j}}=\psi(v^{T_{k,j}}_{i},w^{T_{k,j}}_{i})_{i\in(1,\cdots,|\tilde{D}|)} (208)

where ψ\psi is defined in Equation (44), θ|D~|=argminθ|D~|(θ)\theta_{|\tilde{D}|}^{*}=\operatorname*{arg\,min}_{\theta}\mathcal{L}_{|\tilde{D}|}(\theta) for |D~|(θ)\mathcal{L}_{|\tilde{D}|}(\theta) defined in Appendix A.

Since (g(uTk,j)g(u))ϵ(g(u_{T_{k,j}})-g(u^{*}))\leq\epsilon, then by Lemma 12, we have

|D~|(θTk,j)|D~|(θ|D~|)ϵ\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})\leq\epsilon (209)

Note that |D~|(θTk,j)|D~|(θ|D~|)\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*}) is a constant value. Thus we can always find constant Ck,jC^{{}^{\prime}}_{k,j} such that

Ck|θTk,jθ|D~||1|D~|(θTk,j)|D~|(θ|D~|)C_{k}^{{}^{\prime}}|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*}) (210)
|θTk,jθ|D~||1(θTk,j)(θ)Ck|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\frac{\mathcal{L}(\theta^{T_{k,j}})-\mathcal{L}(\theta^{*})}{C_{k}^{{}^{\prime}}}\\ (211)

Therefore if we have

Tk,j\displaystyle T_{k,j} \displaystyle\geq Lk,j2uk,j22ϵ21,\displaystyle L_{k,j}^{2}||u_{k,j}^{*}||^{2}_{2}\epsilon^{-2}-1, (212)
αk,j\displaystyle\alpha_{k,j} =\displaystyle= uk2Lk,jTk,j+1,\displaystyle\frac{||u^{*}_{k}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}, (213)

then we have

|θTk,jθ|1ϵCk|\theta^{T_{k,j}}-\theta^{*}|_{1}\leq\frac{\epsilon}{C_{k}^{{}^{\prime}}}\\ (214)

which according to Equation (211) implies that

Ck|θTk,jθ|D~||1|D~|(θTk,j)|D~|(θ|D~|)ϵC_{k}^{{}^{\prime}}|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})\leq\epsilon\\ (215)

Dividing Equation (215) by CkC_{k}^{{}^{\prime}} we get

|θTk,jθ|D~||1|D~|(θTk,j)|D~|(θ|D~|)CkϵCk|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\frac{\mathcal{L}_{|\tilde{D}|}(\theta^{T_{k,j}})-\mathcal{L}_{|\tilde{D}|}(\theta_{|\tilde{D}|}^{*})}{C_{k}^{{}^{\prime}}}\leq\frac{\epsilon}{C_{k}^{{}^{\prime}}}\\ (216)

Which implies

|θTk,jθ|D~||1ϵCk|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1}\leq\frac{\epsilon}{C_{k}^{{}^{\prime}}}\\ (217)

Assuming ϵ\epsilon is small enough such that ϵCk<1\frac{\epsilon}{C_{k}^{{}^{\prime}}}<1 from lemma 13, this implies that there exists an Lk,j>0L_{k,j}>0 such that

|QθTk,j(s,a)Qθ|D~|(s,a)|\displaystyle|Q_{\theta^{T_{k,j}}}(s,a)-Q_{\theta_{|\tilde{D}|}^{*}}(s,a)| \displaystyle\leq Lk,j|θTk,jθ|D~||1\displaystyle L_{k,j}|\theta^{T_{k,j}}-\theta_{|\tilde{D}|}^{*}|_{1} (218)
\displaystyle\leq Lk,jϵCk\displaystyle\frac{L_{k,j}\epsilon}{C_{k}^{{}^{\prime}}} (219)

for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Equation (219) implies that if

Tk,j\displaystyle T_{k,j} \displaystyle\geq Lk,j2uk,j22ϵ21,\displaystyle L_{k,j}^{2}||u_{k,j}^{*}||^{2}_{2}\epsilon^{-2}-1, (220)
αk,j\displaystyle\alpha_{k,j} =\displaystyle= uk2Lk,jTk,j+1,\displaystyle\frac{||u^{*}_{k}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}, (221)

then we have

𝔼(|QθTk,j(s,a)Qθ|D~|(s,a)|)\displaystyle\mathbb{E}(|Q_{\theta^{T_{k,j}}}(s,a)-Q_{\theta_{|\tilde{D}|}^{*}}(s,a)|) \displaystyle\leq Lk,jϵCk\displaystyle\frac{L_{k,j}\epsilon}{C_{k}^{{}^{\prime}}} (222)

By definition in section E Qk,jQ_{k,j} is our estimate of the QQ function at the kthk^{th} iteration of Algorithm 11 and thus we have QθTk,j=Qk,jQ_{\theta^{T_{k,j}}}=Q_{k,j} which implies that

𝔼(|Qk,j(s,a)QθD~(s,a)|)\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|) \displaystyle\leq Lk,jϵCk\displaystyle\frac{L_{k,j}\epsilon}{C_{k}^{{}^{\prime}}} (223)

If we replace ϵ\epsilon by Ck,jϵLk,j\frac{C^{{}^{\prime}}_{k,j}\epsilon}{L_{k,j}} in Equation (222), we get that if

Tk,j\displaystyle T_{k,j} \displaystyle\geq (Ck,jϵLk,j)2Lk,j2uk,j221,\displaystyle\left({\frac{C^{{}^{\prime}}_{k,j}\epsilon}{L_{k,j}}}\right)^{-2}L_{k,j}^{2}||u_{k,j}^{*}||^{2}_{2}-1, (224)
αk,j\displaystyle\alpha_{k,j} =\displaystyle= uk2Lk,jTk,j+1,\displaystyle\frac{||u^{*}_{k}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}, (225)

we have

𝔼(|Qk,j(s,a)QθD~(s,a)|)\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|) \displaystyle\leq ϵ\displaystyle\epsilon (226)

From Assumption 2, we have that

𝔼(|Qθ(s,a)QθD~(s,a)|)\displaystyle\mathbb{E}(|Q_{\theta^{*}}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|) \displaystyle\leq ϵ|D~|\displaystyle\epsilon_{|\tilde{D}|} (227)

where θ=argminθΘ(θ)\theta^{*}=\operatorname*{arg\,min}_{\theta\in\Theta}\mathcal{L}(\theta) and by definition of Qk,j3Q^{3}_{k,j} in Definition 7, we have that Qk,j3=QθQ^{3}_{k,j}=Q_{\theta^{*}}. Therefore if we have

Tk,j\displaystyle T_{k,j} \displaystyle\geq (Ck,jϵLk,j)2Lk,j2uk,j221,\displaystyle\left({\frac{C^{{}^{\prime}}_{k,j}\epsilon}{L_{k,j}}}\right)^{-2}L_{k,j}^{2}||u_{k,j}^{*}||^{2}_{2}-1, (228)
αk,j\displaystyle\alpha_{k,j} =\displaystyle= uk2Lk,jTk,j+1,\displaystyle\frac{||u^{*}_{k}||_{2}}{L_{k,j}\sqrt{T_{k,j}+1}}, (229)

we have

𝔼(|Qk,j(s,a)Qk,j3(s,a)|)ν\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q^{3}_{k,j}(s,a)|)_{\nu} \displaystyle\leq 𝔼(|Qk,j(s,a)QθD~(s,a)|)+𝔼(|Qk,j3(s,a)QθD~(s,a)|)\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|)+\mathbb{E}(|Q^{3}_{k,j}(s,a)-Q_{\theta_{\tilde{D}}^{*}}(s,a)|) (231)
\displaystyle\leq ϵ+ϵ|D~|\displaystyle\epsilon+\epsilon_{|\tilde{D}|}

This implies

𝔼(|Qk,j(s,a)Qk,j3(s,a)|)\displaystyle\mathbb{E}(|Q_{k,j}(s,a)-Q^{3}_{k,j}(s,a)|) \displaystyle\leq 𝒪~(1Tk,j)+ϵ|D~|\displaystyle\tilde{\mathcal{O}}\left(\frac{1}{\sqrt{T_{k,j}}}\right)+\epsilon_{|\tilde{D}|} (232)