This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of Q-value in case of Gaussian rewards

Konatsu Miyamoto Dynamic Pricing technology Osaka univercity Masaya Suzuki Kindai univercity Dynamic Pricing technology Yuma Kigami MatrixFlow Kodai Satake Osaka univercity Dynamic Pricing technology
Abstract

In this paper, as a study of reinforcement learning, we converge the Q function to unbounded rewards such as Gaussian distribution. From the central limit theorem, in some real-world applications it is natural to assume that rewards follow a Gaussian distribution , but existing proofs cannot guarantee convergence of the Q-function. Furthermore, in the distribution-type reinforcement learning and Bayesian reinforcement learning that have become popular in recent years, it is better to allow the reward to have a Gaussian distribution. Therefore, in this paper, we prove the convergence of the Q-function under the condition of E[r(s,a)2]<E[r(s,a)^{2}]<\infty, which is much more relaxed than the existing research. Finally, as a bonus, a proof of the policy gradient theorem for distributed reinforcement learning is also posted.

1 Introduction

In recent years, Reinforcement Learning(RL) has come into fasion. General method in ordinary Reinforcement Learning using Markov decision processes use a state action value functions[1]. Agents created by these algorithms take strategies to maximize the expected value of the cumulative reward. However, in practical use , there are many situations where it is necessary to consider not only expected values but also risks. Therefore, Distributional Reinforcement Learning(DRL) that considers the distribution of cumulative rewards has also been studied. DRL research presents a particle method of risk responsive algorithm[2]. As for similar research, there are[3][4],which is equivalent to [2] mathematically,but used the different algorithm and parametric methods[5]. [4] discusses the convergence of measures in discrete steps. Another way to practice DRL is using the Bayesian approach. In [22],it is regarded as an estimation of the uncertainty of the expected value.But in fact, the Bayesian inferece can approximate the distribution of uncertain objecsion. It can perform distributed reinforcement learning. There are other existing papers on Bayesian reinforcement learning. We want to take [6][7] up this time.It is a method using Gaussian processes, and it can be said that the reward follows Gaussian distributions. [5] also supports unbounded rewards like Gaussian distributions. We want to show that the approximation of the cumulative reward distribution converges even in unbounded rewards. In this paper, we prove the convergence of the normal state action value function as a preliminary step. In addition, we perform the convergence proof for Q functions with continuous concentration domain,taking Deep Q-learning(DQN) into consideration.

1.1 Related works

The proof history of Q-function convergence is long. For example, there are papers such as [8], [9], [10], and [11] using [10]. A paper on an unusual proof method is [12] using ordinary differential equations. For DQN, there is a study [13] summarizing the approximation error. The approximation error due to the neural network is verified there. Other research results include [14][15][16][17][18]. All of these studies assume that rewards are bounded. That is, there is a certain constant Rmax<R_{max}<\infty and

|r(s,a)|Rmaxa.e.\displaystyle|r(s,a)|\leq R_{max}\ a.e. (1.1)

holds. Therefore, Gaussian distributions cannot be assumed. In this paper, we prove the convergence of the Q function under condition ,

s,aS×A,E[r(s,a)2]<\displaystyle\forall s,a\in S\times A,E[r(s,a)^{2}]<\infty (1.2)

which is more relaxed than (1.1), with normal distribution in mind. Finally, we prove the convergence of the Q function in the domain of continuous concentration under ideal conditions. This is a frequent concept in reinforcement learning.

2 Background

2.1 transition kearnel

Let two tuples (S,𝒮),(T,𝒯)(S,\mathcal{S}),(T,\mathcal{T}) be both measurable spaces.Transition kernel k:S×𝒯+k:S\times\mathcal{T}\to\mathbb{R}_{+} is defined to satisfy the following two conditions.

\displaystyle\cdot B𝒯,k(,B)onSismeasurable\displaystyle\forall B\in\mathcal{T},k(\cdot,B)\ on\ S\ is\ measurable (2.1)
\displaystyle\cdot sS,k(s,)ismeasureonT\displaystyle\forall s\in S,k(s,\cdot)is\ measure\ on\ T (2.2)

This is used in situations where ss is fixed and the distribution on TT is fixed.

2.2 Markov decision process

Assume that both the set of states SS and the set of actions AA are finite sets. A transition kernel pp is defined on (S×A,2S×A),(S×,2S())(S\times A,2^{S\times A}),(S\times\mathbb{R},2^{S}\otimes\mathcal{B}(\mathbb{R})). That is, p(r,s|s,a)p(r,s|s,a) is a probability measure that governs the distribution of the next state sSs\in S and immediate reward rr\in\mathbb{R} when an action aAa\in A is taken in state sSs\in S is there. The strategy π:S𝒫(A)\pi:S\to\mathcal{P}(A) is the action probability determined from the current situation, as can be seen from the definition. The deterministic approach is that for any ss, there is a aa and π(a|s)=1\pi(a|s)=1. A set of random variables st,at,rts_{t},a_{t},r_{t} taking values in S,A,S,A,\mathbb{R} is written as (st,at,rt)t=0(s_{t},a_{t},r_{t})^{\infty}_{t=0}. This stochastic process is called Markov decision process(MDP).

2.3 Optimal measures and state action value functions

Put the whole set of policies as Π\Pi. The state action value function Qπ:S×AQ^{\pi}:S\times A\to\mathbb{R} for the policy π\pi is defined as follows.

Qπ(s,a):=E[t=0γtRt|s0=s,at=a(rt,st+1),p(rt,st+1|st,at),π(at|st)]\displaystyle Q^{\pi}(s,a):=E[\sum^{\infty}_{t=0}\gamma^{t}R_{t}|s_{0}=s,a_{t}=a(r_{t},s_{t+1}),p(r_{t},s_{t+1}|s_{t},a_{t}),\pi(a_{t}|s_{t})] (2.3)

Furthermore, the state value function Vπ(s)V^{\pi}(s) is defined as follows.

Vπ(s):=aAπ(a|s)Qπ(s,a)\displaystyle V^{\pi}(s):=\sum_{a\in A}\pi(a|s)Q^{\pi}(s,a) (2.4)

Define the optimal strategy π\pi^{*} as

π:=argmaxπΠVπ(s0)\displaystyle\pi^{*}:=argmax_{\pi\in\Pi}V^{\pi}(s_{0}) (2.5)

In addition, the state action value function QπQ^{\pi^{*}} for the optimal policy is called the optimum state action value function, and simply expressed as QQ^{*}. The action that takes the maximum value for the optimal state action function is the optimal policy.

π(a|s)={1argmaxaAQ(s,a)0else\displaystyle\pi^{*}(a|s)=\begin{cases}1\ argmax_{a\in A}Q(s,a)\\ 0\ else\end{cases} (2.6)

holds for any s,as,a.

3 Update of state action value function and Robbins Monro condition

Update the Q-unction as follows

Qn+1(s,a)=(1α(s,a,st,at,t))Qt(s,a)+α(s,a,st,at,t)[rt(st,at)+maxbAQt(st+1,b)]\displaystyle Q_{n+1}(s,a)=(1-\alpha(s,a,s_{t},a_{t},t))Q_{t}(s,a)+\alpha(s,a,s_{t},a_{t},t)[r_{t}(s_{t},a_{t})+\max_{b\in A}Q_{t}(s_{t+1},b)] (3.1)

The following sequence {ct}t=0\{c_{t}\}^{\infty}_{t=0} satisfies the Robbins-Monro condition.

t,ct[0,1]\displaystyle\forall t,c_{t}\in[0,1] (3.2)
t=0ct=\displaystyle\sum^{\infty}_{t=0}c_{t}=\infty (3.3)
t=0ct2<\displaystyle\sum^{\infty}_{t=0}c^{2}_{t}<\infty (3.4)

Using this, the mapping α:S×A×S×A×(0,1)\alpha:S\times A\times S\times A\times\mathbb{N}\to(0,1) is defined as follows.

α(s,a,st,at,t)={ctst=s,at=a0else\displaystyle\alpha(s,a,s_{t},a_{t},t)=\begin{cases}c_{t}\ s_{t}=s,a_{t}=a\\ 0\ else\end{cases} (3.5)

In addition,it is assumed that this also satisfies the Robbins Monroe condition stochastically uniformly for arbitrary s,as,a.

t=0α(s,a,st,at,t)=a.e.\displaystyle\sum^{\infty}_{t=0}\alpha(s,a,s_{t},a_{t},t)=\infty\ a.e. (3.6)
t=0α(s,a,st,at,t)2<a.e.\displaystyle\sum^{\infty}_{t=0}\alpha(s,a,s_{t},a_{t},t)^{2}<\infty\ a.e. (3.7)

4 Proof of Q-function convergence for unbounded rewards

Consider a real-valued function wt(x)w_{t}(x) on a finite set 𝒳\mathcal{X}.

Theorem 1

Convergence of Q-value in case of Gaussian rewards

𝒳\mathcal{X} is finite set. Let ramdom value rt(x),𝒳:=S×Ar_{t}(x),\mathcal{X}:=S\times A. Let WW be the set of functions f:𝒳f:\mathcal{X}\to\mathbb{R} and fW||f||_{W}is defined asfW:=maxx𝒳f(x)||f||_{W}:=\max_{x\in\mathcal{X}}f(x) . For any s,as,a, let E[r2(s,a)]<E[r^{2}(s,a)]<\infty.

QtQW0\displaystyle||Q_{t}-Q^{*}||_{W}\to 0 (4.1)

proof.

In line with the proof of [9]. The FF condition is relaxed and the statement is stronger, so it needs to be done more precisely. Consider a stochastic process of Δt(x):=Qt(x)Q(x)\Delta_{t}(x):=Q_{t}(x)-Q^{*}(x). Since Q(x)Q^{*}(x) is a constant, V(Δt(x))=V(Qt(x))V(\Delta_{t}(x))=V(Q_{t}(x)). PuttingFt(x):=rt(x)+γsupbQt(X(s,a),b)Q(x)F_{t}(x):=r_{t}(x)+\gamma\sup_{b}Q_{t}(X(s,a),b)-Q^{*}(x), this is t+1\mathcal{F}_{t+1} measurable stochastic process. Furthermore, if we put Gt(x):=rt(x)+γsupbQt(X(s,a),b)G_{t}(x):=r_{t}(x)+\gamma\sup_{b}Q_{t}(X(s,a),b), by definition GtE[Gt(x)|t]=FtE[Ft(x)|t]G_{t}-E[G_{t}(x)|\mathcal{F}_{t}]=F_{t}-E[F_{t}(x)|\mathcal{F}_{t}]. The two stochastic processes δt,wtW\delta_{t},w_{t}\in W are taken so that Δ0(x)=δ0(x)+w0(x)\Delta_{0}(x)=\delta_{0}(x)+w_{0}(x). Define time evolution as

δt+1(x)=(1at(x))δt(x)+at(x)E[Ft(x)|t]\displaystyle\delta_{t+1}(x)=(1-a_{t}(x))\delta_{t}(x)+a_{t}(x)E[F_{t}(x)|\mathcal{F}_{t}] (4.2)
wt+1(x)=(1at(x))wt(x)+at(x)pt(x)\displaystyle w_{t+1}(x)=(1-a_{t}(x))w_{t}(x)+a_{t}(x)p_{t}(x) (4.3)

However, pt(x):=Ft(x)E[Ft(x)|t]p_{t}(x):=F_{t}(x)-E[F_{t}(x)|\mathcal{F}_{t}]. At this time, Δt(x)=wt(x)+δt(x)\Delta_{t}(x)=w_{t}(x)+\delta_{t}(x). First, we show that wtw_{t} converges to 0 for 𝒳\mathcal{X} with probability 1 by using Lemma 2. By definition, E[pt|t]=0E[p_{t}|\mathcal{F}_{t}]=0, so tE|[pt|t]|=0\sum_{t}E|[p_{t}|\mathcal{F}_{t}]|=0 holds. From Lemma 1 and the definition of pt,Gtp_{t},G_{t}, E[pt2]4E[Gt2]E[p_{t}^{2}]\leq 4E[G_{t}^{2}] holds. Putting Lt(ω):=supx|Qt(x)|L_{t}(\omega):=\sup_{x}|Q_{t}(x)|, this random variable is t\mathcal{F}_{t} -measurable and takes a finite value with probability 11. Since L0L_{0} is a finite value, a certain constant K0K_{0} can be taken so that E[L02]K02CRE[L_{0}^{2}]\leq K^{2}_{0}C_{R} holds. And the following holds with probability 1.

Lt+1max(Lt,(1bt)Lt+bt(supx|rt(x)|+γLt))\displaystyle L_{t+1}\leq\max(L_{t},(1-b_{t})L_{t}+b_{t}(\sup_{x}|r_{t}(x)|+\gamma L_{t})) (4.4)

Using the above formula, the following holds

E[Lt+12]\displaystyle E[L^{2}_{t+1}] max(E[Lt2],E[((1bt)Lt+bt(supx|rt(x)|+γLt))2])\displaystyle\leq\max(E[L^{2}_{t}],E[((1-b_{t})L_{t}+b_{t}(\sup_{x}|r_{t}(x)|+\gamma L_{t}))^{2}]) (4.5)

Suppose there is KtK_{t}\in\mathbb{R} that is E[Lt2]Kt2CRE[L_{t}^{2}]\leq K^{2}_{t}C_{R}. At this time, put Ht:=supx|rt(x)|+γLtH_{t}:=\sup_{x}|r_{t}(x)|+\gamma L_{t}

E[Ht2]\displaystyle E[H_{t}^{2}] =E[supx|rt(x)|2]+2E[supx|rt(x)|γLt]+γ2E[Lt2]\displaystyle=E[\sup_{x}|r_{t}(x)|^{2}]+2E[\sup_{x}|r_{t}(x)|\gamma L_{t}]+\gamma^{2}E[L_{t}^{2}] (4.6)
CR+2γCRKt2CR+Kt2CR\displaystyle\leq C_{R}+2\gamma\sqrt{C_{R}K_{t}^{2}C_{R}}+K^{2}_{t}C_{R} (4.7)
=(1+γKt)2CR\displaystyle=(1+\gamma K_{t})^{2}C_{R} (4.8)

Then,

E[((1bt)Lt+bt(supx|rt(x)|+γLt))2]\displaystyle E[((1-b_{t})L_{t}+b_{t}(\sup_{x}|r_{t}(x)|+\gamma L_{t}))^{2}] (1bt)2E[Lt2]+2(1bt)btE[Lt2]E[Ht2]+bt2γ2E[Ht2]\displaystyle\leq(1-b_{t})^{2}E[L_{t}^{2}]+2(1-b_{t})b_{t}\sqrt{E[L_{t}^{2}]E[H_{t}^{2}]}+b_{t}^{2}\gamma^{2}E[H_{t}^{2}] (4.9)
(1bt)2Kt2CR+2(1bt)btKt(1+γKt)CR+(1+γKt)2CR\displaystyle\leq(1-b_{t})^{2}K_{t}^{2}C_{R}+2(1-b_{t})b_{t}K_{t}(1+\gamma K_{t})C_{R}+(1+\gamma K_{t})^{2}C_{R} (4.10)
=((1bt)Kt+bt(1+γKt))2CR\displaystyle=((1-b_{t})K_{t}+b_{t}(1+\gamma K_{t}))^{2}C_{R} (4.11)
=(Kt+bt(1(1γ)Kt))2CR\displaystyle=(K_{t}+b_{t}(1-(1-\gamma)K_{t}))^{2}C_{R} (4.12)

Putting Kt+1=max(Kt,Kt+bt(1(1γ)Kt))K_{t+1}=max(K_{t},K_{t}+b_{t}(1-(1-\gamma)K_{t})), ELt+12Kt+1CRE{L^{2}_{t+1}}\leq K_{t+1}C_{R} can be said. Since K0K_{0}\in\mathbb{R} exists, KtK_{t}\in\mathbb{R} exists for any tt, and E[Lt2]Kt2CRE[L_{t}^{2}]\leq K_{t}^{2}C_{R} can be said. It is clear from the equation that Kt+1=KtK_{t+1}=K_{t} when Kt>11γK_{t}>\frac{1}{1-\gamma}, and Kt11γK_{t}\leq\frac{1}{1-\gamma} Then, Kt+111γ+1K_{t+1}\leq\frac{1}{1-\gamma}+1 holds. Therefore, it was shown earlier that KtK_{t} exists for any tt, in addition KtK:=max(K0,11γ+1)K_{t}\leq K^{*}:=\max(K_{0},\frac{1}{1-\gamma}+1) can be also said. |Gt(x)||rt(x)|+γLt|G_{t}(x)|\leq|r_{t}(x)|+\gamma L_{t} holds, so the following equation hold.for all xx

14E[pt2(x)]\displaystyle\frac{1}{4}E[p_{t}^{2}(x)] E[Gt2(x)]\displaystyle\leq E[G_{t}^{2}(x)] (4.13)
=E[rt(x)2]+2γE[rt(x)2]E[Lt2]+E[Lt2]\displaystyle=E[r_{t}(x)^{2}]+2\gamma\sqrt{E[r_{t}(x)^{2}]E[L_{t}^{2}]}+E[L_{t}^{2}] (4.14)
(1+γK)CR\displaystyle\leq(1+\gamma K^{*})C_{R} (4.15)

Then,

tE[at2pt2]\displaystyle\sum_{t}E[a_{t}^{2}p_{t}^{2}] t4bt2(1+γK)CR\displaystyle\leq\sum_{t}4b^{2}_{t}(1+\gamma K^{*})C_{R} (4.16)
4M(1+γK)CR<\displaystyle\leq 4M(1+\gamma K^{*})C_{R}<\infty (4.17)

holds for all xx. When we use Lemma2,putting

Ut:=at(x)pt(x)\displaystyle U_{t}:=a_{t}(x)p_{t}(x) (4.18)
T(wt,ω):=(1at(x))wn\displaystyle T(w_{t},\omega):=(1-a_{t}(x))w_{n} (4.19)

tE[Ut2]<\sum_{t}E[U_{t}^{2}]<\inftycan be said. Since E[Ut|n]=0E[U_{t}|\mathcal{F}_{n}]=0, t|E[Ut|n]|=0\sum_{t}|E[U_{t}|\mathcal{F}_{n}]|=0 holds. Then, for any ϵ>0\epsilon>0, set α=ϵ,βt(x)=bt2(x)\alpha=\epsilon,\beta_{t}(x)=b^{2}_{t}(x) and γt(x)=ϵ(2at(x)at2(x))\gamma_{t}(x)=\epsilon(2a_{t}(x)-a_{t}^{2}(x)), then

T2(wt,ω)max(α,(1+βt)wt2γt)\displaystyle T^{2}(w_{t},\omega)\leq\max(\alpha,(1+\beta_{t})w^{2}_{t}-\gamma_{t}) (4.20)
tγt=a.e\displaystyle\sum_{t}\gamma_{t}=\infty\ a.e (4.21)

holds. The latter is based on Robbins Monro conditions. Therefore, wt(x)0w_{t}(x)\to 0 holds for any xx. Define the linear operator 𝒯:WW\mathcal{T}:W\to W as follows: for qWq\in W

𝒯q(s,a)\displaystyle\mathcal{T}q(s,a) =s[r(s,a)+γsupbq(s,b)]p(dr,s|s,a)\displaystyle=\int_{\mathbb{R}}\sum_{s^{\prime}}[r(s,a)+\gamma\sup_{b}q(s^{\prime},b)]p(dr,s^{\prime}|s,a) (4.22)
=E[r(s,a)+supbq(X(s,a),b)]\displaystyle=E[r(s,a)+\sup_{b}q(X(s,a),b)] (4.23)

QQ^{*} is a fixed point for this operator. For any q1,q2Wq_{1},q_{2}\in W

𝒯q1𝒯q2W\displaystyle||\mathcal{T}q_{1}-\mathcal{T}q_{2}||_{W} =sups,a[|s[r(s,a)+γsupbq1(s,b)]p(dr,s|s,a)s[r(s,a)+γsupbq2(s`,b)]p(dr,s|s,a)|]\displaystyle=sup_{s,a}[|\int_{\mathbb{R}}\sum_{s^{\prime}}[r(s,a)+\gamma\sup_{b}q_{1}(s^{\prime},b)]p(dr,s^{\prime}|s,a)-\int_{\mathbb{R}}\sum_{s^{\prime}}[r(s,a)+\gamma\sup_{b}q_{2}(s^{`},b)]p(dr,s^{\prime}|s,a)|] (4.24)
s[γ|supbq1(s,b)supbq2(s,b)|]p(dr,s|s,a)\displaystyle\leq\int_{\mathbb{R}}\sum_{s^{\prime}}[\gamma|\sup_{b}q_{1}(s^{\prime},b)-\sup_{b}q_{2}(s^{\prime},b)|]p(dr,s^{\prime}|s,a) (4.25)
s[γsupb|q1(s,b)q2(s,b)|]p(dr,s|s,a)\displaystyle\leq\int_{\mathbb{R}}\sum_{s^{\prime}}[\gamma sup_{b}|q_{1}(s^{\prime},b)-q_{2}(s,b)|]p(dr,s^{\prime}|s,a) (4.26)
=γq1q2W\displaystyle=\gamma||q_{1}-q_{2}||_{W} (4.27)

Thus 𝒯\mathcal{T} is a reduction operator.

|E[Ft(x,a)|t]|\displaystyle|E[F_{t}(x,a)|\mathcal{F}_{t}]| s|r(s,a)+γsupbQt(s,b)Q(s,a)|p(dr,s|s,a)\displaystyle\leq\int_{\mathbb{R}}\sum_{s^{\prime}}|r(s,a)+\gamma\sup_{b}Q_{t}(s^{\prime},b)-Q^{*}(s,a)|p(dr,s^{\prime}|s,a) (4.28)
=|𝒯Qt(x,a)Q(s,a)|\displaystyle=|\mathcal{T}Q_{t}(x,a)-Q^{*}(s,a)| (4.29)
=|𝒯Qt(x,a)𝒯Q(s,a)|\displaystyle=|\mathcal{T}Q_{t}(x,a)-\mathcal{T}Q^{*}(s,a)| (4.30)
γΔtW\displaystyle\leq\gamma||\Delta_{t}||_{W} (4.31)

Then,

δt+1\displaystyle||\delta_{t+1}|| (1at(x))δt+at(x)δt+wt\displaystyle\leq(1-a_{t}(x))||\delta_{t}||+a_{t}(x)||\delta_{t}+w_{t}|| (4.32)
(1at(x))δt+at(x)(δt+wt)\displaystyle\leq(1-a_{t}(x))||\delta_{t}||+a_{t}(x)(||\delta_{t}||+||w_{t}||) (4.33)

wt(x)||w_{t}(x)|| converges uniformly to 0 with a probability of 1 for any xx as described above. Therefore, from Lemma 3, δt+1(x)0||\delta_{t+1}(x)||\to 0 for any xx. That is, for any xx, Δt(x)W0||\Delta_{t}(x)||_{W}\to 0, which holds the main theorem assertion.

5 Theorem for SARASA

The method in Chapter 3 is called Q-learning, and the value is updated before performing the next action. On the other hand, SARASA updates the value after performing the following actions.

Qt+1(s,a)=(1α(s,a,st,at,t))Qt(s,a)+α(s,a,st,at,t)(rt(s,a)+Qt(st+1,at+1))\displaystyle Q_{t+1}(s,a)=(1-\alpha(s,a,s_{t},a_{t},t))Q_{t}(s,a)+\alpha(s,a,s_{t},a_{t},t)(r_{t}(s,a)+Q_{t}(s_{t+1},a_{t+1})) (5.1)

at+1a_{t+1} is often stochastically determined by softmax function or the like.

Theorem 2

Suppose that the Q function is updated by the above SARASA method. At this time,

QtQW0int\displaystyle||Q_{t}-Q^{*}||_{W}\to 0\ in\ t\to\infty (5.2)

proof.

Put Lt:=maxx,yinmathcalX|Qt(x)Qt(y)|L^{\prime}_{t}:=\ max_{x,y\ in\ mathcal{X}}|Q_{t}(x)-Q_{t}(y)| It is clear from the definition that Lt2LtL^{\prime}_{t}\leq 2L_{t}. Later along this follows the proof of Theorem 1.

6 Convergence proof for unbounded rewards under continuous concentration

For example, in a situation such as DQN, an update for one s,as,a has an effect on other state actions. As a simple model to take such situations into account, we put the ripple function f(x1,x2)f(x_{1},x_{2}) defined on the compact set 𝒳2\mathcal{X}^{2}. This satisfies the next conditions.

f(x,x)=1\displaystyle f(x,x)=1 (6.1)
f(x,y)iscontinue.\displaystyle f(x,y)\ is\ continue. (6.2)

If QQ^{*} is a continuous function, it can be used to depart from any continuous function and have the same convergence on the compact set. Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be a simple connected compact set. Let Q,Q0Q^{*},Q_{0} be a continuous function on 𝒳\mathcal{X}. Let WW be a continuous function on 𝒳\mathcal{X}. fW:=maxx𝒳f(x)||f||_{W}:=\max_{x\in\mathcal{X}}f(x)

Qt+1(s,a)=(1f(s,a,st,at)α(s,a,st,at,t))Qt(s,a)+f(s,a,st,at)α(s,a,st,at,t)(rt(s,a)+maxbAQt(st+1,b))\displaystyle Q_{t+1}(s,a)=(1-f(s,a,s_{t},a_{t})\alpha(s,a,s_{t},a_{t},t))Q_{t}(s,a)+f(s,a,s_{t},a_{t})\alpha(s,a,s_{t},a_{t},t)(r_{t}(s,a)+\max_{b\in A}Q_{t}(s_{t+1},b)) (6.3)

At this time, QtQW0||Q_{t}-Q^{*}||_{W}\to 0

proof. Consider a finite set KN:={x1,x2,x3,,xN}K_{N}:=\{x_{1},x_{2},x_{3},......,x_{N}\} on 𝒳\mathcal{X}. Limiting QQ to KK converges to a correct function uniformly over KK from Theorem 1.For any ϵ\epsilon Since QQ^{*} is a continuous function, the function whose value is defined on a dense set is uniquely determined. Convergence can be said.

7 Conclusion and Future Work

As we mentioned earlier,we want to prove the convergence of the distribution. An order evaluation of the expected value should be also performed. We also want to estimate the convergence order for a specific neural network such as [13]. According to [13], as with Theorem 3, in the domain of continuous concentration, as Rmax:=supr(ω,s,a)R_{max}:=\sup r(\omega,s,a), using constants C1,C2,ξ,αC_{1},C_{2},\xi,\alpha

QQnWC1(logn)ξnα+C2Rmax\displaystyle||Q^{*}-Q_{n}||_{W}\leq C_{1}\cdot(\log n)^{\xi}n^{-\alpha}+C_{2}R_{max} (7.1)

is established. However, when rr follows a normal distribution, Rmax=R_{max}=\infty, so the upper limit of the error is infinite, and this unexpected expression has no meaning. In case of using unbounded rewards, stronger inequality proofs are needed.

References

  • [1] Watldns, C.J.C.H. . Learning from delayed rewards. PhD Thesis, University of Cambridge, England. 1989
  • [2] T.Morimura,M Sugiyama,H Kashima,H Hachiya,and T.Tanaka.Nonparametric return distribution approximation for reinforcement learning.In International Conference on Machine Learning,2010
  • [3] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458, 2017.
  • [4] Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., and Teh, Y. W. . An analysis of categorical distributional reinforcement learning. In Artificial Intelligence and Statistics (AISTATS).2018
  • [5] T.Morimura,M.Sugiyama,H.Kashima,H.Hachiya,and T.Tanaka.Parametric return density estimation for reinforcement learning. In Conference on Uncertainty in Artificial Intterigence,2010
  • [6] Azizzadenesheli, K., Brunskill, E., Anandkumar, A., 2018. Efficient exploration through bayesian deep q-networks.arXiv preprint arXiv:1802.04412.2018
  • [7] Rasmussen, C. E. and Kuss, M.. Gaussian processes in reinforcement learning. In Thrun, S., Saul, L. K., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16, pages 751–759, Cambridge, MA, USA. MIT Press.2004
  • [8] CHRISTOPHER J.C.H. WATKINS ,PETER DAYAN,Q-learning,Machine Learning, 8, 279-292 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.1992
  • [9] J. N. Tsitsiklis, Asynchronous Stochastic Approximation and Q-learning, Machine Learning, 16, 1994
  • [10] Tommi Jakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185–1201, 1994.
  • [11] F. S. Melo, Convergence of Q-learning: A simple proof, Institute Of Systems and Robotics, Tech. Rep.2001
  • [12] V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochastic approximation.and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  • [13] Zhuoran Yang,Yuchen Xie,Zhaoran Wang,A Theoretical Analysis of Deep Q-Learning,arXiv preprint arXiv:1901.00137v2.2019
  • [14] Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. . Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16 1629–1676.2015
  • [15] Farahmand, A.-m., Szepesvári, C. and Munos, R. . Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.2010
  • [16] Andras Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129, 2008.
  • [17] Remi Munos. Performance bounds in ĺp norm for approximate value iteration. SIAM Journalon Control and Optimization, 2007.
  • [18] Remi Munos. Error bounds for approximate policy iteration. In ÍCML 2003: Proceedings of the 20th Annual International Conference on Machine Learning, 2003
  • [19] Venter J, On Dvoretzky stochastic approximation theorems, Annals of Mathematical Statistics 37, 1534–1544,1966
  • [20] Berti, P., Crimaldi, I., Pratelli, L. and Rigo, P. Rate of convergence of predictive distributions for dependent data. Bernoulli 15, 1351–1367.2009.
  • [21] BERTI, P., PRATELLI, L. and RIGO, P. . Limit theorems for predictive sequences of random variables. Technical Report 146, Dip. EPMQ, Univ. Pavia. Available at economia.2002
  • [22] Ghavamzadeh, M., Mannor, S., Pineau, J., and Tamar, A. . Bayesian reinforcement learning:a survey. Foundations and Trends in Machine Learning, 8(5-6):359–483.2015
  • [23] Chen Tessler, Guy Tennenholtz, and Shie Mannor. Distributional policy optimization: An alternative approach for continuous control. arXiv preprint arXiv:1905.09855,2019.
  • [24] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, Neural ordinary differential equations, arXiv preprint arXiv:1806.07366, 2018.
  • [25] David Ha, Andrew Dai, Quoc V. Le,HyperNetworks. In International Conference on Learning Representations, 2017

Appendix A Lemmas and proofs

Lemma 1

Consider a random variable YY and a partial σ\sigma-algebla 𝒢\mathcal{G}. If Z:=YE[Y|𝒢]Z:=Y-E[Y|\mathcal{G}], the following equation holds.

E[Z2]4E[Y2]\displaystyle E[Z^{2}]\leq 4E[Y^{2}] (A.1)

We quote the important theorem.

Lemma 2

Convergence theorem for stochastic systems[19]

Consider the following stochastic process.

Xt+1:=T(X0,,Xt,ω)+Ut(ω)\displaystyle X_{t+1}:=T(X_{0},......,X_{t},\omega)+U_{t}(\omega) (A.2)

This satisfies the following equation with probability 1.

|T(x1,x2,,xt,ω)|2max(α,(1+βt(ω))xt2γt)\displaystyle|T(x_{1},x_{2},......,x_{t},\omega)|^{2}\leq max(\alpha,(1+\beta_{t}(\omega))x^{2}_{t}-\gamma_{t}) (A.3)

However, with α>0\alpha>0, with probability 1, βt(ω)<M,tβt<\beta_{t}(\omega)<M^{\prime},\sum_{t}\beta_{t}<\infty holds, and with probability 1, Let tγt(ω)=\sum_{t}\gamma_{t}(\omega)=\infty.

tE[Ut2]\displaystyle\sum_{t}E[U_{t}^{2}] <\displaystyle<\infty (A.4)
tE[Ut|t]\displaystyle\sum_{t}E[U_{t}|\mathcal{F}_{t}] <\displaystyle<\infty (A.5)

At this time, there exists a certain N(ω)N(\omega), and it holds for any n>N(ω)n>N(\omega)

limsupt|Xt|2<αa.e.\displaystyle\lim\sup_{t\to\infty}|X_{t}|^{2}<\alpha\ a.e. (A.6)

If β,γ\beta,\gamma are taken again for any α\alpha and the same can be said, ”uniform convergence to 0” can be said that is much stronger than approximate convergence.

Lemma 3

x0x_{0}\in\mathbb{R} is assumed to be a real number.

xn+1=(1an)xn+γan|xn|\displaystyle x_{n+1}=(1-a_{n})x_{n}+\gamma a_{n}|x_{n}| (A.7)

γ(0,1)\gamma\in(0,1) is a constant.At this time, xn0x_{n}\to 0 holds with probability 1.

proof.

Look at each ω\omega. That is, {an}n=0\{a_{n}\}^{\infty}_{n=0} is constant sequence that satisfies n=0an=,n=0an2<\sum^{\infty}_{n=0}a_{n}=\infty,\sum^{\infty}_{n=0}a^{2}_{n}<\infty. XnX_{n} is nonnegative for a sufficiently large nn, so it is bounded below. In addition, since xnxn+1x_{n}\geq x_{n+1} is apparent from the equation, {xn}n=1\{x_{n}\}_{n=1}^{\infty} is a monotonically decreasing sequence. The sequence converges because it is bounded and monotonically decreasing below.Putting bn:=anγanb_{n}:=a_{n}-\gamma a_{n}, this satisfies n=0bn=,n=0bn2<\sum^{\infty}_{n=0}b_{n}=\infty,\sum^{\infty}_{n=0}b^{2}_{n}<\infty . You can sayxn=x1Πi=1n(1bi)x_{n}=x_{1}\Pi^{n}_{i=1}(1-b_{i}), and the convergence destination xx is x1Πn=1(1bn)x_{1}\Pi^{\infty}_{n=1}(1-b_{n}). cn=Πi=1n(1bi)c_{n}=\Pi^{n}_{i=1}(1-b_{i}), the infinite product of nn\to\infty is n=0bn=\sum^{\infty}_{n=0}b_{n}=\infty, but diverges. However, since it is 0cn10\leq c_{n}\leq 1, cn0c_{n}\to 0 is known, and xn0x_{n}\to 0 can be said.

Lemma 4

Let ϵ>0\epsilon>0.

xn+1=(1an)xn+γan|xn+ϵ|\displaystyle x_{n+1}=(1-a_{n})x_{n}+\gamma a_{n}|x_{n}+\epsilon| (A.8)

Then xnϵγ1γx_{n}\to\epsilon\frac{\gamma}{1-\gamma} holds.

proof.

xn+1xn\displaystyle x_{n+1}-x_{n} =an((1γ)xnϵγ)\displaystyle=-a_{n}((1-\gamma)x_{n}-\epsilon\gamma) (A.9)
=an(1γ)(xnϵγ1γ)\displaystyle=-a_{n}(1-\gamma)(x_{n}-\epsilon\frac{\gamma}{1-\gamma}) (A.10)

The difference from ϵγ1γ\epsilon\frac{\gamma}{1-\gamma} is reduced by an(1γ)a_{n}(1-\gamma). If yn:=xnϵγ1γy_{n}:=x_{n}-\epsilon\frac{\gamma}{1-\gamma}, by definition it is clearly yn+1yn=xn+1xny_{n+1}-y_{n}=x_{n+1}-x_{n}. Moreover,

yn+1yn\displaystyle y_{n+1}-y_{n} =an(1γ)(yn)\displaystyle=-a_{n}(1-\gamma)(y_{n}) (A.11)
yn+1\displaystyle y_{n+1} =(1an(1γ))yn\displaystyle=(1-a_{n}(1-\gamma))y_{n} (A.12)

After that, it is xnϵγ1γx_{n}\to\epsilon\frac{\gamma}{1-\gamma} because it is yn0y_{n}\to 0 by the same argument in Lemma 3.

Lemma 5

Suppose that the sequence {cn}+\{c_{n}\}\subset\mathbb{R}_{+} converges uniformly to 0 on a set of probabilities 1. That is, for any ϵ1>0\epsilon_{1}>0, there is a certain Nϵ1(ω)N_{\epsilon_{1}}(\omega), and when n>Nϵ1(ω)n>N_{\epsilon_{1}}(\omega), |cn|<ϵ1|c_{n}|<\epsilon_{1} holds with probability 1. At this time,

xn+1=(1an)xn+γan|xn+cn|\displaystyle x_{n+1}=(1-a_{n})x_{n}+\gamma a_{n}|x_{n}+c_{n}| (A.13)

xn{x_{n}} converges to 0.

proof.

zNϵ1\displaystyle z_{N_{\epsilon_{1}}} =xNϵ1\displaystyle=x_{N_{\epsilon_{1}}} (A.14)
zn+1\displaystyle z_{n+1} =(1an)zn+γan|zn+ϵ1|\displaystyle=(1-a_{n})z_{n}+\gamma a_{n}|z_{n}+\epsilon_{1}| (A.15)

|Zn||xn||Z_{n}|\geq|x_{n}| for such n>Nn>N. Znϵ1γ1γZ_{n}\to\epsilon_{1}\frac{\gamma}{1-\gamma} from Lemma 4. That is, for any ϵ2>0\epsilon_{2}>0, there is a certain Nϵ2>Nepsilon1N_{\epsilon_{2}}>N_{\ epsilon_{1}}, and n>Nϵ2n>N_{\epsilon_{2}} for any n,zn<ϵ1γ1γ+ϵ2n,z_{n}<\epsilon_{1}\frac{\gamma}{1-\gamma}+\epsilon_{2} ϵ1,ϵ2\epsilon_{1},\epsilon_{2} can be arbitrarily taken, so if we define a new ϵ:=ϵ1γ1γ+ϵ2\epsilon:=\epsilon_{1}\frac{\gamma}{1-\gamma}+\epsilon_{2}, this is also ϵ>0\epsilon>0 can be taken arbitrarily. Within the range of Using zn>xnz_{n}>x_{n}, there is a Nϵ2N_{\epsilon_{2}} for any ϵ\epsilon and xn<ϵx_{n}<\epsilonfor n>Nϵ2n>N_{\epsilon_{2}} .

Appendix B Strict Proof of Policy Gradient thorem and Distributionaly

We prove the famous policy gradient theorem using the QQ function and its version in distributed reinforcement learning [23].

Theorem 3

Policy Gradient thorem

Consider the gradient of the policy value function J(θ):=E[Q(x,πθ(x))]J(\theta):=E[Q(x,\pi_{\theta}(x))]. At this time, it is assumed that π,Q\pi,Q is implemented by a neural network, the activation function is Lipschitz continuous, and θQ(x,a)=0\nabla_{\theta}Q(x,a)=0. Then,The following equation holds,

θJ(θ)=Eρ[θπ(θ)aQ(s,a)|a=π(x)]\displaystyle\nabla_{\theta}J(\theta)=E_{\rho}[\nabla_{\theta}\pi(\theta)\nabla_{a}Q(s,a)|_{a=\pi(x)}] (B.1)

However, ρ\rho is memory data in general implementation. Next, consider the case of distributed reinforcement learning. If a random variable representing the cumulative reward sum is expressed as ZZ, then Q(s,a)=En[Z(s,a)]Q(s,a)=E_{n}[Z(s,a)] holds. Suppose ZZ is a neural network with stochastic output.

Z(ω)(s,a)=fω(s,a)\displaystyle Z(\omega)(s,a)=f_{\omega}(s,a) (B.2)

Then

θJ(θ)=Eρ[θπ(x)En[aZ(x,a)]|a=π(x)]\displaystyle\nabla_{\theta}J(\theta)=E_{\rho}[\nabla_{\theta}\pi(x)E_{n}[\nabla_{a}Z(x,a)]|_{a=\pi(x)}] (B.3)

proof.

The interchangeable conditions of differentiation and Lebesgue integration are described as follows. Suppose there is a function f(x,ω)f(x,\omega) that can be Lebesgue integrable over Ω\Omega and differentiable by xx. At this time, there is an integrable function ϕ(ω)\phi(\omega), and xx can be differentiated almost everywhere on Ω\Omega by xx and |xf(x,ω)i|ϕ(ω)|\nabla_{x}f(x,\omega)_{i}|\leq\phi(\omega) holds, then Ωf(x,ω)𝑑μ(ω)\int_{\Omega}f(x,\omega)d\mu(\omega) is differentiable by xx, and holds,

xΩf(x,ω)𝑑μ(ω)=Ωxf(x,ω)𝑑μ(ω)\displaystyle\nabla_{x}\int_{\Omega}f(x,\omega)d\mu(\omega)=\int_{\Omega}\nabla_{x}f(x,\omega)d\mu(\omega) (B.4)

When μ(Ω)<\mu(\Omega)<\infty, An example of a function class that satisfies this is the “Lipschitz continuous function”. Neural networks is generally combinations of linear transformations and Lipschitz continuous activation map.111 General activation functions such as sigmoid, ReLu, Reaky Relu, and Swish are all Lipschitz continuous functions. Moreover, if the Lipschitz constant of the function ff is written as fL||f||_{L}, then considering two Lipschitz continuous functions f,gf,g, fgLfLgL||f\circ g||_{L}\leq||f||_{L}||g||_{L}. From this, πθ(x),Q(x,a),Q(x,πθ(x))\pi_{\theta}(x),Q(x,a),Q(x,\pi_{\theta}(x)) are Lipschitz continuous for x,ax,a, respectively. Although it is not Lipschitz continuous for θ\theta, it is Lipschitz continuous for each element, and the definition and definition of \nabla allow the exchange of differentiation and integration. That is, the following holds from the differential chain rule,

θJ(θ)=Eρ[θπθ(x)aQ(s,a)|a=πθ(x)]\displaystyle\nabla_{\theta}J(\theta)=E_{\rho}[\nabla_{\theta}\pi_{\theta}(x)\nabla_{a}Q(s,a)|_{a=\pi_{\theta}(x)}] (B.5)

Similarly, aE[Z(s,a)]=E[afω(s,a)]\nabla_{a}E[Z(s,a)]=E[\nabla_{a}f_{\omega}(s,a)] and fωf_{\omega} is Lipschitz continuous functions for any ω\omega , For distribution type

θJ(θ)\displaystyle\nabla_{\theta}J(\theta) =Eρ[θπ(x)En[afω(x,a)]|a=π(x)]\displaystyle=E_{\rho}[\nabla_{\theta}\pi(x)E_{n}[\nabla_{a}f_{\omega}(x,a)]|_{a=\pi(x)}] (B.6)
=Eρ[θπ(x)En[aZ(x,a)]|a=π(x)]\displaystyle=E_{\rho}[\nabla_{\theta}\pi(x)E_{n}[\nabla_{a}Z(x,a)]|_{a=\pi(x)}] (B.7)

As described above, the policy gradient theorem is established because the policy is Lipschitz-continuous for each parameter, and is obviously not for a policy function composed of ODEnet[24], hypernet[25], or the like that reuses parameters.

Appendix C Notation

Let (A,𝒪)(A,\mathcal{O}) be a topological space.

  • σ(𝒪)\sigma(\mathcal{O}):Smallest σ\sigma-algebla containing all o𝒪o\in\mathcal{O}.

  • (A):=σ(𝒪)\mathcal{B}(A):=\sigma(\mathcal{O})

  • 𝒫(A)\mathcal{P}(A):If AA is a finite set, it is the set of all probability measures defined by measurable space (A,2A)(A,2^{A}), and if it is an infinite set, (A,(A))(A,\mathcal{B}(A))

  • 𝒢:=σ(×𝒢)\mathcal{F}\otimes\mathcal{G}:=\sigma(\mathcal{F}\times\mathcal{G})