Appendix for: MOORe: Model-based Offline-to-Online Reinforcement Learning Appendix A Theorem proofs The full proof of Theorem LABEL:thm1 is below. Proof. We begin our proof from the last horizon h=Hh=H, and then use the closeness under horizon hh to prove the closeness under horizon h−1h-1. For the last horizon h=Hh=H, VM,H∗(s)=0∀MV_{M,H}^{*}(s)=0\ \forall M because it is the terminal state, so ‖VM1,H∗(s)−VM2,H∗‖∞≤0=ϵH||V^{*}_{M_{1},H}(s)-V^{*}_{M_{2},H}||_{\infty}\leq 0=\epsilon_{H}. Suppose ∀s∈𝒮h,‖VM1,h∗(s)−VM2,h∗‖∞≤ϵh.\forall s\in\mathcal{S}_{h},||V^{*}_{M_{1},h}(s)-V^{*}_{M_{2},h}||_{\infty}\leq\epsilon_{h}. (1) We need to prove ∀s∈𝒮h−1,‖VM1,h−1∗(s)−VM2,h−1∗‖∞≤ϵh−1.\forall s\in\mathcal{S}_{h-1},||V^{*}_{M_{1},h-1}(s)-V^{*}_{M_{2},h-1}||_{\infty}\leq\epsilon_{h-1}. (2) It is equivalent to prove −ϵh−1≤VM2,h−1∗(s)−VM1,h−1∗(s)≤ϵh−1.-\epsilon_{h-1}\leq V^{*}_{M_{2},h-1}(s)-V^{*}_{M_{1},h-1}(s)\leq\epsilon_{h-1}. (3) For simplicity but without loss of generality, we prove the right part inequality in the above equation. LHS\displaystyle LHS =maxa∈𝒜{∑s′∈𝒮hpM2(s′|s,a)(r(s,a)+γVM2,h∗(s′))}\displaystyle=\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{2}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{2},h}(s^{\prime}))\} (4) −maxa∈𝒜{∑s′∈𝒮hpM1(s′|s,a)(r(s,a)+γVM1,h∗(s′))}\displaystyle-\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\} =\displaystyle= maxa∈𝒜{∑s′∈𝒮h[pM1(s′|s,a)(r(s,a)+γVM1,h∗(s′))\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}[p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime})) +(pM2(s′|s,a)−pM1(s′|s,a))r(s,a)\displaystyle+(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))r(s,a) +pM2(s′|s,a)VM2,h∗(s′)−pM1(s′|s,a)VM1,h∗(s′)]}\displaystyle+p_{M_{2}}(s^{\prime}|s,a)V^{*}_{M_{2},h}(s^{\prime})-p_{M_{1}}(s^{\prime}|s,a)V^{*}_{M_{1},h}(s^{\prime})]\} −maxa∈𝒜{∑s′∈𝒮hpM1(s′|s,a)(r(s,a)+γVM1,h∗(s′))}\displaystyle-\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\} ≤\displaystyle\leq maxa∈𝒜{∑s′∈𝒮hpM1(s′|s,a)(r(s,a)+γVM1,h∗(s′))}\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\} +maxa∈𝒜{∑s′∈𝒮h[(pM2(s′|s,a)−pM1(s′|s,a))r(s,a)\displaystyle+\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}[(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))r(s,a) +pM2(s′|s,a)VM2,h∗(s′)−pM1(s′|s,a)VM1,h∗(s′)]}\displaystyle+p_{M_{2}}(s^{\prime}|s,a)V^{*}_{M_{2},h}(s^{\prime})-p_{M_{1}}(s^{\prime}|s,a)V^{*}_{M_{1},h}(s^{\prime})]\} −maxa∈𝒜{∑s′∈𝒮hpM1(s′|s,a)(r(s,a)+γVM1,h∗(s′))}\displaystyle-\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}p_{M_{1}}(s^{\prime}|s,a)(r(s,a)+\gamma V^{*}_{M_{1},h}(s^{\prime}))\} =\displaystyle= maxa∈𝒜{∑s′∈𝒮h[(pM2(s′|s,a)−pM1(s′|s,a))r(s,a)\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}[(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))r(s,a) +pM2(s′|s,a)VM2,h∗(s′)−pM1(s′|s,a)VM1,h∗(s′)]}\displaystyle+p_{M_{2}}(s^{\prime}|s,a)V^{*}_{M_{2},h}(s^{\prime})-p_{M_{1}}(s^{\prime}|s,a)V^{*}_{M_{1},h}(s^{\prime})]\} ≤\displaystyle\leq maxa∈𝒜{∑s′∈𝒮h|pM2(s′|s,a)−pM1(s′|s,a)|}rmax\displaystyle\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}|p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a)|\}r_{max} +maxa∈𝒜{∑s′∈𝒮h(pM2(s′|s,a)−pM1(s′|s,a))VM2,h∗(s′)\displaystyle+\max_{a\in\mathcal{A}}\{\sum_{s^{\prime}\in\mathcal{S}_{h}}(p_{M_{2}}(s^{\prime}|s,a)-p_{M_{1}}(s^{\prime}|s,a))V^{*}_{M_{2},h}(s^{\prime}) +pM1(s′|s,a)(VM2,h∗(s′)−VM1,h∗(s′))}\displaystyle+p_{M_{1}}(s^{\prime}|s,a)(V^{*}_{M_{2},h}(s^{\prime})-V^{*}_{M_{1},h}(s^{\prime}))\} ≤\displaystyle\leq maxa∈𝒜{Dℓ1(pM1(⋅|s,a),pM2(⋅|s,a))}(rmax+Vmax)\displaystyle\max_{a\in\mathcal{A}}\{D_{\ell_{1}}(p_{M_{1}}(\cdot|s,a),p_{M_{2}}(\cdot|s,a))\}(r_{max}+V_{max}) +1⋅(VM2,h∗(s′)−VM1,h∗(s′))}\displaystyle+1\cdot(V^{*}_{M_{2},h}(s^{\prime})-V^{*}_{M_{1},h}(s^{\prime}))\} ≤\displaystyle\leq Dℓ1(pM1,pM2)(rmax+Vmax)+1⋅ϵ(h)}\displaystyle D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max})+1\cdot\epsilon(h)\} =\displaystyle= Dℓ1(pM1,pM2)(rmax+Vmax)\displaystyle D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max}) +Dℓ1(pM1,pM2)(rmax+Vmax)(H−h)\displaystyle+D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max})(H-h) =\displaystyle= Dℓ1(pM1,pM2)(rmax+Vmax)(H−h+1).\displaystyle D_{\ell_{1}}(p_{M_{1}},p_{M_{2}})(r_{max}+V_{max})(H-h+1). ∎ The proof of Lemma LABEL:lemma1 is as follows. Proof. First prove that ∀\forall estimated MDP M^\hat{M} and its relevant uncertainty penalized MDP M~\tilde{M}, ηM~(πM~∗)=ηM^(πM~∗)−λUM^(πM~∗).\eta_{\tilde{M}}(\pi_{\tilde{M}}^{*})=\eta_{\hat{M}}(\pi_{\tilde{M}}^{*})-\lambda U_{\hat{M}}(\pi_{\tilde{M}}^{*}). (5) We know that M~\tilde{M} and M^\hat{M} shares the same transition dynamics pp, but different reward functions r~(s,a)=r^(s,a)−λu(s,a)\tilde{r}(s,a)=\hat{r}(s,a)-\lambda u(s,a). Therefore, ηM~(πM~∗)=\displaystyle\eta_{\tilde{M}}(\pi_{\tilde{M}}^{*})= 𝔼(s,a)∼ρM~πM~∗(s,a)[r~(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim\rho_{{\tilde{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}[\tilde{r}(s,a)] (6) =\displaystyle= 𝔼(s,a)∼ρM^πM~∗(s,a)[r^(s,a)−λu(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim\rho_{{\hat{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}[\hat{r}(s,a)-\lambda u(s,a)] =\displaystyle= 𝔼(s,a)∼ρM^πM~∗(s,a)r^(s,a)−λ𝔼(s,a)∼ρM^πM~∗(s,a)u(s,a)\displaystyle\mathbb{E}_{(s,a)\sim\rho_{{\hat{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}\hat{r}(s,a)-\lambda\mathbb{E}_{(s,a)\sim\rho_{{\hat{M}}}^{\pi_{\tilde{M}}^{*}}(s,a)}u(s,a) =\displaystyle= ηM^(πM~∗)−λUM^(πM~∗).\displaystyle\eta_{\hat{M}}(\pi_{\tilde{M}}^{*})-\lambda U_{\hat{M}}(\pi_{\tilde{M}}^{*}). Next we’ll decompose |ηM^t(πM~t∗)−ηM^t+1(πM~t+1∗)||\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| to get the final bound. |ηM^t\displaystyle|\eta_{\hat{M}_{t}} (πM~t∗)−ηM^t+1(πM~t+1∗)|\displaystyle(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| (7) =\displaystyle= |(ηM~t(πM~t∗)+λUM^t(πM~t∗))\displaystyle|(\eta_{\tilde{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})+\lambda U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})) −(ηM~t+1(πM~t+1∗)+λUM^t+1(πM~t+1∗))|\displaystyle\qquad-(\eta_{\tilde{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})+\lambda U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*}))| ≤\displaystyle\leq |ηM~t(πM~t∗)−ηM~t+1(πM~t∗)|\displaystyle|\eta_{\tilde{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\tilde{M}_{t+1}}(\pi_{\tilde{M}_{t}}^{*})| +λ|UM^t(πM~t∗)−UM^t+1(πM~t+1∗)|\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| ≤\displaystyle\leq Dℓ1(pM~t,pM~t+1)(rmax+Vmax)H\displaystyle D_{\ell_{1}}(p_{\tilde{M}_{t}},p_{\tilde{M}_{t+1}})(r_{max}+V_{max})H +λ|UM^t(πM~t∗)−UM^t+1(πM~t+1∗)|\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| =\displaystyle= Dℓ1(pM^t,pM^t+1)(rmax+Vmax)H\displaystyle D_{\ell_{1}}(p_{\hat{M}_{t}},p_{\hat{M}_{t+1}})(r_{max}+V_{max})H +λ|UM^t(πM~t∗)−UM^t+1(πM~t+1∗)|.\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|. ∎ Remarks on Equation (LABEL:Gineq) The original inequality in Equation (LABEL:Gineq) is introduced in MOPO yu2020mopo, using the Total Variation Distance as the distance measure, instead of δℓ1(⋅,⋅)\delta_{\ell_{1}}(\cdot,\cdot). The Total Variation Distance is defined as δTV(P,Q)=supA∈ℱ|P(A)−Q(A)|\delta_{TV}(P,Q)=\sup_{A\in\mathcal{F}}|P(A)-Q(A)| where P,QP,Q are two probability measures on a sigma algebra ℱ\mathcal{F}. It is written as |GM^π(s,a)|≤VmaxδTV(pM^(s,a),pM(s,a)).|G_{\hat{M}}^{\pi}(s,a)|\leq V_{max}\delta_{TV}(p_{\hat{M}}(s,a),p_{M}(s,a)). (8) And the version in Equation (LABEL:Gineq) has the same meaning as δTV(P,Q)=12δℓ1(P,Q)\delta_{TV}(P,Q)=\frac{1}{2}\delta_{\ell_{1}}(P,Q), because δTV(P,Q)=12δℓ1(P,Q)\delta_{TV}(P,Q)=\frac{1}{2}\delta_{\ell_{1}}(P,Q) when the sigma-algebra ℱ\mathcal{F} is countable. The proof of Theorem LABEL:thm2 is carried out according to Lemma LABEL:lemma1 and Lemma LABEL:tele. Proof. The first step considers ηM^t(πM~t∗),ηM^t+1(πM~t+1∗)\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*}),\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*}) to build the bound. And in the second step, Lemma LABEL:tele is used to bound |ηM(πM~t∗)−ηM^t(πM~t∗)||\eta_{M}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})| and |ηM(πM~t+1∗)−ηM^t+1(πM~t+1∗)||\eta_{M}(\pi_{\tilde{M}_{t+1}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})|, while |ηM^t(πM~t∗)−ηM^t+1(πM~t+1∗)||\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| is bounded by Lemma LABEL:lemma1. |\displaystyle| ηM(πM~t∗)−ηM(πM~t+1∗)|\displaystyle\eta_{M}(\pi_{\tilde{M}_{t}}^{*})-\eta_{M}(\pi_{\tilde{M}_{t+1}}^{*})| (9) ≤|ηM(πM~t∗)−ηM^1(πM~t∗)|\displaystyle\leq|\eta_{M}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{1}}(\pi_{\tilde{M}_{t}}^{*})| +|ηM^t(πM~t∗)−ηM^t+1(πM~t+1∗)|\displaystyle\qquad+|\eta_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| +|ηM(πM~t+1∗)−ηM^t+1(πM~t+1∗)|\displaystyle\qquad+|\eta_{M}(\pi_{\tilde{M}_{t+1}}^{*})-\eta_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| ≤(𝔼(s,a)∼ρM^tπM~t∗(s,a)δℓ1(pM^t(s,a),pM(s,a))\displaystyle\leq(\mathbb{E}_{(s,a)\sim\rho_{\hat{M}_{t}}^{\pi_{\tilde{M}_{t}}^{*}}(s,a)}\delta_{\ell_{1}}(p_{\hat{M}_{t}}(s,a),p_{M}(s,a)) +𝔼(s,a)∼ρM^t+1πM~t+1∗(s,a)δℓ1(pM^t+1(s,a),pM(s,a)))12γVmax\displaystyle\qquad+\mathbb{E}_{(s,a)\sim\rho_{\hat{M}_{t+1}}^{\pi_{\tilde{M}_{t+1}}^{*}}(s,a)}\delta_{\ell_{1}}(p_{\hat{M}_{t+1}}(s,a),p_{M}(s,a)))\frac{1}{2}\gamma V_{max} +λ|UM^t(πM~t∗)−UM^t+1(πM~t+1∗)|\displaystyle\qquad+\lambda|U_{\hat{M}_{t}}(\pi_{\tilde{M}_{t}}^{*})-U_{\hat{M}_{t+1}}(\pi_{\tilde{M}_{t+1}}^{*})| +Dℓ1(pM^t,pM^t+1)(rmax+Vmax)H.\displaystyle\qquad+D_{\ell_{1}}(p_{\hat{M}_{t}},p_{\hat{M}_{t+1}})(r_{max}+V_{max})H. ∎