This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Appendix for: MOORe: Model-based Offline-to-Online Reinforcement Learning

Appendix A Theorem proofs

The full proof of Theorem LABEL:thm1 is below.

Proof.

We begin our proof from the last horizon h=Hh=H, and then use the closeness under horizon hh to prove the closeness under horizon h1h-1.

For the last horizon h=Hh=H, VM,H(s)=0MV_{M,H}^{*}(s)=0\ \forall M because it is the terminal state, so VM1,H(s)VM2,H0=ϵH||V^{*}_{M_{1},H}(s)-V^{*}_{M_{2},H}||_{\infty}\leq 0=\epsilon_{H}. Suppose

s𝒮h,VM1,h(s)VM2,hϵh.\forall s\in\mathcal{S}_{h},||V^{*}_{M_{1},h}(s)-V^{*}_{M_{2},h}||_{\infty}\leq\epsilon_{h}. (1)

We need to prove

s𝒮h1,VM1,h1(s)VM2,h1ϵh1.\forall s\in\mathcal{S}_{h-1},||V^{*}_{M_{1},h-1}(s)-V^{*}_{M_{2},h-1}||_{\infty}\leq\epsilon_{h-1}. (2)

It is equivalent to prove

ϵh1VM2,h1(s)VM1,h1(s)ϵh1.-\epsilon_{h-1}\leq V^{*}_{M_{2},h-1}(s)-V^{*}_{M_{1},h-1}(s)\leq\epsilon_{h-1}. (3)

For simplicity but without loss of generality, we prove the right part inequality in the above equation.

LHS\displaystyle LHS =maxa𝒜{s𝒮hpM2(s|s,a)(r(s,a)+γVM2,h(s))}\displaystyle=\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{2}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{2},h}(s^{\prime}))\} (4)
maxa𝒜{s𝒮hpM1(s|s,a)(r(s,a)+γVM1,h(s))}\displaystyle-\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\}
=\displaystyle= maxa𝒜{s𝒮h[pM1(s|s,a)(r(s,a)+γVM1,h(s))\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}[p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))
+(pM2(s|s,a)pM1(s|s,a))r(s,a)\displaystyle+(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))r(s,a)
+pM2(s|s,a)VM2,h(s)pM1(s|s,a)VM1,h(s)]}\displaystyle+p_{M_{2}}(s^{\prime}|s,a)V^{*}_{M_{2},h}(s^{\prime})-p_{M_{1}}(s^{\prime}|s,a)V^{*}_{M_{1},h}(s^{\prime})]\}
maxa𝒜{s𝒮hpM1(s|s,a)(r(s,a)+γVM1,h(s))}\displaystyle-\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\}
\displaystyle\leq maxa𝒜{s𝒮hpM1(s|s,a)(r(s,a)+γVM1,h(s))}\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\}
+maxa𝒜{s𝒮h[(pM2(s|s,a)pM1(s|s,a))r(s,a)\displaystyle+\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}[(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))r(s,a)
+pM2(s|s,a)VM2,h(s)pM1(s|s,a)VM1,h(s)]}\displaystyle+p_{M_{2}}(s^{\prime}|s,a)V^{*}_{M_{2},h}(s^{\prime})-p_{M_{1}}(s^{\prime}|s,a)V^{*}_{M_{1},h}(s^{\prime})]\}
maxa𝒜{s𝒮hpM1(s|s,a)(r(s,a)+γVM1,h(s))}\displaystyle-\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\}
=\displaystyle= maxa𝒜{s𝒮h[(pM2(s|s,a)pM1(s|s,a))r(s,a)\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}[(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))r(s,a)
+pM2(s|s,a)VM2,h(s)pM1(s|s,a)VM1,h(s)]}\displaystyle+p_{M_{2}}(s^{\prime}|s,a)V^{*}_{M_{2},h}(s^{\prime})-p_{M_{1}}(s^{\prime}|s,a)V^{*}_{M_{1},h}(s^{\prime})]\}
\displaystyle\leq maxa𝒜{s𝒮h|pM2(s|s,a)pM1(s|s,a)|}rmax\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}|p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a)|\}r_{max}
+maxa𝒜{s𝒮h(pM2(s|s,a)pM1(s|s,a))VM2,h(s)\displaystyle+\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))V^{*}_{M_{2},h}(s^{\prime})
+pM1(s|s,a)(VM2,h(s)VM1,h(s))}\displaystyle+p_{M_{1}}(s^{\prime}|s,a)(V^{*}_{M_{2},h}(s^{\prime})-V^{*}_{M_{1},h}(s^{\prime}))\}
\displaystyle\leq maxa𝒜{D1(pM1(|s,a),pM2(|s,a))}(rmax+Vmax)\displaystyle\max_{a\in\mathcal{A}}\{D_{\ell_{1}}(p_{M_{1}}(\cdot|s,a),p_{M_{2}}(\cdot|s,a))\}(r_{max}+V_{max})
+1(VM2,h(s)VM1,h(s))}\displaystyle+1\cdot(V^{*}_{M_{2},h}(s^{\prime})-V^{*}_{M_{1},h}(s^{\prime}))\}
\displaystyle\leq D1(pM1,pM2)(rmax+Vmax)+1ϵ(h)}\displaystyle D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max})+1\cdot\epsilon(h)\}
=\displaystyle= D1(pM1,pM2)(rmax+Vmax)\displaystyle D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max})
+D1(pM1,pM2)(rmax+Vmax)(Hh)\displaystyle+D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max})(H-h)
=\displaystyle= D1(pM1,pM2)(rmax+Vmax)(Hh+1).\displaystyle D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max})(H-h+1).

The proof of Lemma LABEL:lemma1 is as follows.

Proof.

First prove that \forall estimated MDP M^\hat{M} and its relevant uncertainty penalized MDP M~\tilde{M},

ηM~(πM~)=ηM^(πM~)λUM^(πM~).\eta_{\tilde{M}}(\pi_{\tilde{M}}^{*})=\eta_{\hat{M}}(\pi_{\tilde{M}}^{*})-\lambda U_{\hat{M}}(\pi_{\tilde{M}}^{*}). (5)

We know that M~\tilde{M} and M^\hat{M} shares the same transition dynamics pp, but different reward functions r~(s,a)=r^(s,a)λu(s,a)\tilde{r}(s,a)=\hat{r}(s,a)-\lambda u(s,a). Therefore,

ηM~(πM~)=\displaystyle\eta_{\tilde{M}}(\pi_{\tilde{M}}^{*})= 𝔼(s,a)ρM~πM~(s,a)[r~(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim\rho_{{\tilde{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}[\tilde{r}(s,a)] (6)
=\displaystyle= 𝔼(s,a)ρM^πM~(s,a)[r^(s,a)λu(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim\rho_{{\hat{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}[\hat{r}(s,a)-\lambda u(s,a)]
=\displaystyle= 𝔼(s,a)ρM^πM~(s,a)r^(s,a)λ𝔼(s,a)ρM^πM~(s,a)u(s,a)\displaystyle\mathbb{E}_{(s,a)\sim\rho_{{\hat{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}\hat{r}(s,a)-\lambda\mathbb{E}_{(s,a)\sim\rho_{{\hat{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}u(s,a)
=\displaystyle= ηM^(πM~)λUM^(πM~).\displaystyle\eta_{\hat{M}}(\pi_{\tilde{M}}^{*})-\lambda U_{\hat{M}}(\pi_{\tilde{M}}^{*}).

Next we’ll decompose |ηM^t(πM~t)ηM^t+1(πM~t+1)||\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| to get the final bound.

|ηM^t\displaystyle|\eta_{\hat{M}_{t}} (πM~t)ηM^t+1(πM~t+1)|\displaystyle(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| (7)
=\displaystyle= |(ηM~t(πM~t)+λUM^t(πM~t))\displaystyle|(\eta_{\tilde{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})+\lambda U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*}))
(ηM~t+1(πM~t+1)+λUM^t+1(πM~t+1))|\displaystyle\qquad-(\eta_{\tilde{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})+\lambda U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*}))|
\displaystyle\leq |ηM~t(πM~t)ηM~t+1(πM~t)|\displaystyle|\eta_{\tilde{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\tilde{M}_{t+1}}(\pi_{\tilde{M}_{t}}^{*})|
+λ|UM^t(πM~t)UM^t+1(πM~t+1)|\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|
\displaystyle\leq D1(pM~t,pM~t+1)(rmax+Vmax)H\displaystyle D_{\ell_{1}}(p_{\tilde{M}_{t}},p_{\tilde{M}_{t+1}})(r_{max}+V_{max})H
+λ|UM^t(πM~t)UM^t+1(πM~t+1)|\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|
=\displaystyle= D1(pM^t,pM^t+1)(rmax+Vmax)H\displaystyle D_{\ell_{1}}(p_{\hat{M}_{t}},p_{\hat{M}_{t+1}})(r_{max}+V_{max})H
+λ|UM^t(πM~t)UM^t+1(πM~t+1)|.\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|.

Remarks on Equation (LABEL:Gineq)

The original inequality in Equation (LABEL:Gineq) is introduced in MOPO yu2020mopo, using the Total Variation Distance as the distance measure, instead of δ1(,)\delta_{\ell_{1}}(\cdot,\cdot). The Total Variation Distance is defined as δTV(P,Q)=supA|P(A)Q(A)|\delta_{TV}(P,Q)=\sup_{A\in\mathcal{F}}|P(A)-Q(A)| where P,QP,Q are two probability measures on a sigma algebra \mathcal{F}. It is written as

|GM^π(s,a)|VmaxδTV(pM^(s,a),pM(s,a)).|G_{\hat{M}}^{\pi}(s,a)|\leq V_{max}\delta_{TV}(p_{\hat{M}}(s,a),p_{M}(s,a)). (8)

And the version in Equation (LABEL:Gineq) has the same meaning as δTV(P,Q)=12δ1(P,Q)\delta_{TV}(P,Q)=\frac{1}{2}\delta_{\ell_{1}}(P,Q), because δTV(P,Q)=12δ1(P,Q)\delta_{TV}(P,Q)=\frac{1}{2}\delta_{\ell_{1}}(P,Q) when the sigma-algebra \mathcal{F} is countable.

The proof of Theorem LABEL:thm2 is carried out according to Lemma LABEL:lemma1 and Lemma LABEL:tele.

Proof.

The first step considers ηM^t(πM~t),ηM^t+1(πM~t+1)\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*}),\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*}) to build the bound. And in the second step, Lemma LABEL:tele is used to bound |ηM(πM~t)ηM^t(πM~t)||\eta_{M}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})| and |ηM(πM~t+1)ηM^t+1(πM~t+1)||\eta_{M}(\pi_{\tilde{M}_{t+1}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|, while |ηM^t(πM~t)ηM^t+1(πM~t+1)||\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| is bounded by Lemma LABEL:lemma1.

|\displaystyle| ηM(πM~t)ηM(πM~t+1)|\displaystyle\eta_{M}(\pi_{\tilde{M}_{t}}^{*})-\eta_{M}(\pi_{\tilde{M}_{t+1}}^{*})| (9)
|ηM(πM~t)ηM^1(πM~t)|\displaystyle\leq|\eta_{M}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{1}}(\pi_{\tilde{M}_{t}}^{*})|
+|ηM^t(πM~t)ηM^t+1(πM~t+1)|\displaystyle\qquad+|\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|
+|ηM(πM~t+1)ηM^t+1(πM~t+1)|\displaystyle\qquad+|\eta_{M}(\pi_{\tilde{M}_{t+1}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|
(𝔼(s,a)ρM^tπM~t(s,a)δ1(pM^t(s,a),pM(s,a))\displaystyle\leq(\mathbb{E}_{(s,a)\sim\rho_{\hat{M}_{t}}^{\pi_{\tilde{M}_{t}}^{*}}(s,a)}\delta_{\ell_{1}}(p_{\hat{M}_{t}}(s,a),p_{M}(s,a))
+𝔼(s,a)ρM^t+1πM~t+1(s,a)δ1(pM^t+1(s,a),pM(s,a)))12γVmax\displaystyle\qquad+\mathbb{E}_{(s,a)\sim\rho_{\hat{M}_{t+1}}^{\pi_{\tilde{M}_{t+1}}^{*}}(s,a)}\delta_{\ell_{1}}(p_{\hat{M}_{t+1}}(s,a),p_{M}(s,a)))\frac{1}{2}\gamma V_{max}
+λ|UM^t(πM~t)UM^t+1(πM~t+1)|\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|
+D1(pM^t,pM^t+1)(rmax+Vmax)H.\displaystyle\qquad+D_{\ell_{1}}(p_{\hat{M}_{t}},p_{\hat{M}_{t+1}})(r_{max}+V_{max})H.