This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Riemannian Proximal Policy Optimization

\nameShijun Wang \email[email protected]
\addrDept. of Artificial Intelligence
Ant Financial Services Group
Seattle, USA \AND\nameBaocheng Zhu \email[email protected]
\nameChen Li \email[email protected]
\nameMingzhe Wu \email[email protected]
\addrDept. of Artificial Intelligence
Ant Financial Services Group
Shanghai, China \AND\nameJames Zhang \email[email protected]
\addrDept. of Artificial Intelligence
Ant Financial Services Group
New York, USA \AND\nameWei Chu \email[email protected]
\nameYuan Qi \email[email protected]
\addrDept. of Artificial Intelligence
Ant Financial Services Group
Hangzhou, China
Abstract

In this paper, We propose a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employ Gaussian mixture model (GMM) and formulate it as a non-convex optimization problem in the Riemannian space of positive semidefinite matrices. For two given policy functions, we also provide its lower bound on policy improvement by using bounds derived from the Wasserstein distance of GMMs. Preliminary experiments show the efficacy of our proposed Riemannian proximal policy optimization algorithm.

Keywords: Markov Decision Process, Riemannian Proximal Policy Optimization, Gaussian Mixture Model, Wasserstein Distance

1 Introduction

Reinforcement learning studies how agents explore/exploit environment, and take actions to maximize long-term reward. It has broad applications in robot control and game playing(Mnih et al., 2015; Silver et al., 2016; Argall et al., 2009; Silver et al., 2017). Value iteration and policy gradient methods are mainstream methods for reinforcement learning (Sutton and Barto, 2018; Li, 2017).

Policy gradient methods learn optimal policy directly from past experience or on the fly. It maximizes expected discounted reward through a parametrized policy whose parameters are updated using gradient ascent. Traditional policy gradient methods suffer from three well-known obstacles: high-variance, sample inefficiency and difficulty in tuning learning rate. To make the learning algorithm more robust and scalable to large datasets, Schulman et al. proposed trust region policy optimization algorithm (TRPO)(Schulman et al., 2015). TRPO searches for the optimal policy by maximizing a surrogate function with constraints placed upon the KL divergence between old and new policy distributions, which guarantees monotonically improvements. To further improve the data efficiency and reliable performance, proximal policy optimization algorithm (PPO) was proposed which utilizes first-order optimization and clipped probability ratio between the new and old policies (Schulman et al., 2017). TRPO was also extended to constrained reinforcement learning. Achiam et al. proposed constrained policy optimization (CPO) which guarantees near-constraint satisfaction at each iteration (Achiam et al., 2017).

Although TRPO, PPO and CPO have shown promising performance on complex decision-making problems, such as continuous-control tasks and playing Atari, as other neural network based models, they face two typical challenges: the lack of interpretability, and difficult to converge due to the nature of non-convex optimization in high dimensional parameter space. For many real applications, data lying in a high dimensional ambient space usually have a much lower intrinsic dimension. It may be easier to optimize the policy function in low dimensional manifolds.

In recent years, Many optimization methods are generalized from Euclidean space to Riemannian space due to manifold structures existed in many machine learning problems(Absil et al., 2007, 2009; Vandereycken, 2013; Huang et al., 2015; Zhang et al., 2016). In this paper, we leverage merits of TRPO, PPO, and CPO and propose a new algorithm called Riemannian proximal policy optimization (RPPO) by taking manifold learning into account for policy optimization. In order to estimate the policy, we need a density-estimation function. Options we have include kernel density estimation, neural networks, Gaussian mixture model (GMM), etc. In this study we choose GMM due to its good analytical characteristics, universal representation power and low computational cost compared with neural networks. It is well-known that the covariance matrices of GMM lie in a Riemannian manifold of positive semidefinite matrices.

To be more specific, we model policy functions using GMM first. Secondly, to optimize GMM and learn the optimal policy functions efficiently, we formulate it as a non-convex optimization problem in the Riemannian space. By this way, our method gains advantages in improving both interpretability and speed of convergence. Please note that Our RPPO algorithm can be easily extended to any other non-GMM density estimators, as long as their parameter space is Riemannian. In addition, previously GMM has been applied to reinforcement learning by embedding GMM in the Q-learning framework(Agostini and Celaya, 2010). So it also suffers from the headache of Q-learning that it can hardly handle problems with large continuous state-action space.

2 Preliminaries

2.1 Reinforcement learning

In this study, we consider the following Markov decision process (MDP) which is defined as a tuple (S,A,P,r,γ)\left(S,A,P,r,\gamma\right), where SS is the set of states, AA is the set of actions, P:S×A×S[0,1]P:S\times A\times S\rightarrow\left[0,1\right] is the transition probability function, r:S×A×Sr:S\times A\times S\rightarrow\mathcal{R} is the reward function, and γ\gamma is the discount factor which balances future rewards and immediate ones.

To make optimal decisions for MDP problems, reinforcement learning was proposed to learn optimal value function or policy. A value function is an expected, discounted accumulative reward function of a state or state-action pair by following a policy π\pi. Here we define state value function as vπ(s)=Eτπ[r(τ)s0=s]v_{\pi}\left(s\right)=E_{\tau\sim\pi}\left[r(\tau)\mid s_{0}=s\right] where τ=(s0,a0,s1,)\tau=\left(s_{0},a_{0},s_{1},...\right) denotes a trajectory by playing policy π\pi, atπ(atst)a_{t}\sim\pi\left(a_{t}\mid s_{t}\right), and st+1P(st+1st,at)s_{t+1}\sim P\left(s_{t+1}\mid s_{t},a_{t}\right). Similarly we define state-action value function as: qπ(s,a)=Eτπ[r(τ)s0=s,a0=a]q_{\pi}\left(s,a\right)=E_{\tau\sim\pi}\left[r(\tau)\mid s_{0}=s,a_{0}=a\right]. We also define advantage function as Aπ(s,a)=qπ(s,a)vπ(s)A_{\pi}\left(s,a\right)=q_{\pi}\left(s,a\right)-v_{\pi}\left(s\right).

In reinforcement learning, we try to find or learn an optimal policy π\pi which maximizes a given performance metric J(π)J\left(\pi\right). Infinite horizon discounted accumulative return is widely used to evaluate a given policy which is defined as: J(π)=Eτπ[t=0γtr(st,at,st+1)]J\left(\pi\right)=\underset{\tau\sim\pi}{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r\left(s_{t},a_{t},s_{t+1}\right)\right], where r(st,at,st+1)r\left(s_{t},a_{t},s_{t+1}\right) is the reward from sts_{t} to st+1s_{t+1} by taking action ata_{t}. Please note that the expectation operation is performed over the distribution of trajectories.

In the work of (Kakade and Langford, 2002; Schulman et al., 2017), for two given policies π\pi and π\pi^{\prime}, their expected accumulative returns over infinite horizon can be linked by the advantage functions: J(π)=J(π)+Eτπ[t[=0]γtAπ(st,at)]J\left(\pi^{\prime}\right)=J\left(\pi\right)+\underset{\tau\sim\pi^{\prime}}{E}\left[\stackrel{{\scriptstyle[}}{{t}}=0]{\infty}{\sum}\gamma^{t}A_{\pi}\left(s_{t},a_{t}\right)\right]. By introducing discounted visit frequencies ρπ(s)=P(s0=s)+γP(s1=s)+γ2P(s2=s)+\rho_{\pi}(s)=P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+... (Schulman et al., 2015; Achiam et al., 2017), where s0ρ0s_{0}\sim\rho_{0} , ρ0:S\rho_{0}\colon S\rightarrow\mathcal{R} is distribution of initial state s0s_{0}, we have J(π)=J(π)+Σ𝑠ρπ(s)Σ𝑎π(as)Aπ(st,at)J\left(\pi^{\prime}\right)=J\left(\pi\right)+\underset{s}{\Sigma}\rho_{\pi^{\prime}}(s)\underset{a}{\Sigma}\pi^{\prime}(a\mid s)A_{\pi}\left(s_{t},a_{t}\right). To reduce the complexity of searching for a new policy π\pi^{\prime} which increases J(π)J\left(\pi^{\prime}\right), we replace discounted visit frequencies ρπ(s)\rho_{\pi^{\prime}}(s) to be optimized with old discounted visit frequencies ρπ(s)\rho_{\pi}(s):L(π)=L(π)+Σ𝑠ρπ(s)Σ𝑎π(as)Aπ(st,at)L\left(\pi^{\prime}\right)=L\left(\pi\right)+\underset{s}{\Sigma}\rho_{\pi}(s)\underset{a}{\Sigma}\pi^{\prime}(a\mid s)A_{\pi}\left(s_{t},a_{t}\right).

Assume that the policy functions π(as)\pi(a\mid s) are parametrized by a vector θ\theta, π(as)=πθ(as)\pi(a\mid s)=\pi_{\theta}(a\mid s) . Searching for new policy π\pi^{\prime} is equivalent to searching new parameters θ\theta^{\prime} in the parameter space. So we have L(π)=L(θ)=L(θ)+Σ𝑠ρπθ(s)Σ𝑎πθ(as)Aπθ(s,a)L\left(\pi^{\prime}\right)=L\left(\theta^{\prime}\right)=L\left(\theta\right)+\underset{s}{\Sigma}\rho_{\pi_{\theta}}(s)\underset{a}{\Sigma}\pi_{\theta^{\prime}}(a\mid s)A_{\pi_{\theta}}\left(s,a\right).

2.2 Riemannian space

Here we give a brief introduction to Riemannian space, for more details see (Eisenhart, 2016).Let MM be a connected and finite dimensional manifold with dimensionality of mm. We denote by TpMT_{p}M the tangent space of MM at pp. Let MM be endowed with a Riemannian metric .,.\left\langle.,.\right\rangle, with corresponding norm denoted by .\parallel.\parallel, so that MM is now a Riemannian manifold (Eisenhart, 2016). We use l(γ)=abγ(t)dtl\left(\gamma\right)=\int_{a}^{b}\bigparallel\gamma^{\prime}\left(t\right)\bigparallel dt to denote length of a piecewise smooth curve γ:[a,b]M\gamma:\left[a,b\right]\longrightarrow M joining θ\theta^{\prime} to θ\theta, i.e., such that γ(a)=θ\gamma\left(a\right)=\theta^{\prime} and γ(b)=θ\gamma\left(b\right)=\theta. Minimizing this length functional over the set of all piecewise smooth curves passing θ\theta^{\prime} and θ\theta, we get a Riemannian distance d(θ,θ)d\left(\theta^{\prime},\theta\right) which induces original topology on MM. Take θM,\theta\in M, the exponential map expθ:TθMMexp_{\theta}:T_{\theta}M\longrightarrow M is defined by expθv=γv(1,θ)exp_{\theta}v=\gamma_{v}\left(1,\theta\right) which maps a tangent vector vv at θ\theta to MM along the curve γ\gamma. For any θM\theta^{\prime}\in M we define the exponential inverse map expθ1:MTθMexp_{\theta^{\prime}}^{-1}:M\longrightarrow T_{\theta^{\prime}}M which is CC^{\infty} and maps a point θ\theta^{{}^{\prime}}on MM to a tangent vector at θ\theta with d(θ,θ)=expθ1θd\left(\theta^{\prime},\theta\right)=\parallel exp_{\theta^{\prime}}^{-1}\theta\parallel. We assume (M,d)\left(M,d\right) is a complete metric space, bounded and all closed subsets of MM are compact. For a given convex function f:MRf:M\rightarrow R at θM\theta^{\prime}\in M, a vector sTθMs\in T_{\theta^{\prime}}M is called subgradient of ff at θM\theta^{\prime}\in M if f(θ)f(θ)+s,expθ1θf\left(\theta\right)\geq f\left(\theta^{\prime}\right)+\left\langle s,exp_{\theta^{\prime}}^{-1}\theta\right\rangle, for all θM\theta\in M. The set of all subgradients of ff at θM\theta^{\prime}\in M is called subdifferential of ff at θM\theta^{\prime}\in M which is denoted by f(θ)\partial f\left(\theta^{\prime}\right). If MM is a Hadamard manifold which is complete, simply connected and has everywhere non-positive sectional curvature, the subdifferential of ff at any point on MM is non-empty (Ferreira and Oliveira, 2002).

3 Modeling policy function using Gaussian mixture model

To model policy functions, we employ the Gaussian mixture model which is a widely used and statistically mature method for clustering and density estimation. The policy function can be modeled as π(as)=Σi=1Kα(s)𝒩(a;μi(s),Si(s))\pi(a\mid s)=\Sigma_{i=1}^{K}\alpha(s)\mathcal{N}(a;\mu_{i}(s),S_{i}(s)), where 𝒩\mathcal{N} is a (multivariate) Gaussian distribution with mean μRd\mu\in R^{d} and covariance matrix S0S\succ 0, KK is number of components in the mixture model, α=(α1,α2,,αK)\alpha=(\alpha_{1},\alpha_{2},...,\alpha_{K}) are mixture component weights which sum to 1. In the following, we drop ss in GMM to make it simple and parameters of GMM still depend on state variable ss implicitly.

Let’s define θ=((α1,μ1,S1),(α2,μ2,S2),,(αK,μK,SK))\theta=((\alpha_{1},\mu_{1},S_{1}),(\alpha_{2},\mu_{2},S_{2}),...,(\alpha_{K},\mu_{K},S_{K})) which parametrizes the policy function. For a given policy function πθ(as)\pi_{\theta}(a\mid s), we would like to find a new policy πθ(as)\pi_{\theta^{\prime}}^{\prime}(a\mid s) which has higher performance evaluated by L(π)=L(π)+Σ𝑠ρπ(s)Σ𝑎π(as)Aπ(s,a)L\left(\pi^{\prime}\right)=L\left(\pi\right)+\underset{s}{\Sigma}\rho_{\pi}(s)\underset{a}{\Sigma}\pi^{\prime}(a\mid s)A_{\pi}\left(s,a\right) within close proximity of the old policy to avoid dramatic policy updates.

Define g(θ)=L(θ)g(\theta^{\prime})=-L(\theta^{\prime}), and φ(θ)=β2d2(θ,θ)\varphi(\theta^{\prime})=\frac{\beta}{2}d^{2}(\theta^{\prime},\theta) which searches new θ\theta^{\prime} near the proximity of the old parameter θ\theta. d(θ,θ)d(\theta^{\prime},\theta) is a similarity metric which can be Euclidean distance, KL divergence, J-S divergence or the 2nd Wasserstein distance(Arjovsky et al., 2017).

We would like to optimize the following problem with corresponding constraints from GMMs:

minθ=((α1,μ1,S1),(α2,μ2,S2),,(αK,μK,SK))f(θ)=g(θ)+φ(θ)\underset{\theta^{\prime}=((\alpha_{1}^{\prime},\mu_{1}^{\prime},S_{1}^{\prime}),(\alpha_{2}^{\prime},\mu_{2}^{\prime},S_{2}^{\prime}),...,(\alpha_{K}^{\prime},\mu_{K}^{\prime},S_{K}^{\prime}))}{\min}f(\theta^{\prime})=g\left(\theta^{\prime}\right)+\varphi(\theta^{\prime}) (1)
=L(θ)Σ𝑠ρπθ(s)Σ𝑎(i[=1]Kαi𝒩(a;μi,Si))Aπθ(s,a)+β2d2(θ,θ),=-L\left(\theta\right)-\underset{s}{\Sigma}\rho_{\pi_{\theta}}(s)\underset{a}{\Sigma}\left(\stackrel{{\scriptstyle[}}{{i}}=1]{K}{\sum}\alpha_{i}^{\prime}\mathcal{N}(a;\mu_{i}^{\prime},S_{i}^{\prime})\right)A_{\pi_{\theta}}\left(s,a\right)+\frac{\beta}{2}d^{2}(\theta^{\prime},\theta),

s.t. i[=1]KΣαi=1\stackrel{{\scriptstyle[}}{{i}}=1]{K}{\Sigma}\alpha_{i}^{\prime}=1, Si0S_{i}^{\prime}\succ 0, i=1,2,,Ki=1,2,...,K.

We employ a reparametrization method to make the Gaussian distributions zero-centered. We augment action variables by 1 and define a new variable vector as a=[a,1]a=[a,1]^{\top} with new covariance matrix S=[S+μμμμ1]S=\left[\begin{array}[]{cc}S+\mu\mu^{\top}&\mu\\ \mu^{\top}&1\end{array}\right](Hosseini and Sra, 2015).

In the Optimization Problem (1), there is a simplex constraint αΔK\alpha\in\Delta_{K}. To convert it to a unconstrained problem, we first define ηk=log(αkαK)\eta_{k}=log(\frac{\alpha_{k}}{\alpha_{K}}), for k=1,2,,K1k=1,2,...,K-1, and let ηK=0\eta_{K}=0 be a constant (Hosseini and Sra, 2015). Then we have the following unconstrained optimization problem:

minθ={η={ηi}i=1K1,S={Si0}i=1K}f(θ)=g(θ)+φ(θ)\underset{\theta^{\prime}=\{\eta^{\prime}=\{\eta_{i}^{\prime}\}_{i=1}^{K-1},S^{\prime}=\{S_{i}^{\prime}\succ 0\}_{i=1}^{K}\}}{\min}f(\theta^{\prime})=g\left(\theta^{\prime}\right)+\varphi(\theta^{\prime}) (2)
=L(θ)Σ𝑠ρπθ(s)Σ𝑎(i[=1]Kexp(ηk)Σj=1Kexp(ηj)𝒩(a;Si))Aπθ(s,a)+β2d2(θ,θ).=-L(\theta)-\underset{s}{\Sigma}\rho_{\pi_{\theta}}(s)\underset{a}{\Sigma}\left(\stackrel{{\scriptstyle[}}{{i}}=1]{K}{\sum}\frac{exp(\eta_{k}^{\prime})}{\Sigma_{j=1}^{K}exp(\eta_{j}^{\prime})}\mathcal{N}(a;S_{i}^{\prime})\right)A_{\pi_{\theta}}\left(s,a\right)+\frac{\beta}{2}d^{2}(\theta^{\prime},\theta).

4 Riemannian proximal method for policy optimization

4.1 Riemannian proximal method for a class of non-convex problems

In this section, following the work of (Khamaru and Wainwright, 2018), we tackle a more general class of functions of the form: f(θ)=g(θ)h(θ)+φ(θ)f\left(\theta\right)=g\left(\theta\right)-h\left(\theta\right)+\varphi\left(\theta\right). We assume the following assumptions hold:

Assumption 1:

(a) The function gg is continuously differentiable and its gradient vector field is Lipschitz continuous with constant L0L\geq 0. (b) The function hh is continuous and convex. (c) The function φ(θ)\varphi\left(\theta\right) is proper, convex and lower semi-continuous. (d) The function ff is bounded below over a complete Riemannian manifold MM of dimension mm. (e) Solution set of min\min f(θ)f(\theta) is non-empty and its optimum value is denoted as ff^{*}.

Lemma 1. Under Assumption 1, assume uku_{k} and vkv_{k} are subgradients of the convex functions hh and φ\varphi, respectively. We have θk+1=expθk(αk(g(θk)uk+vk+1))\theta_{k+1}=exp_{\theta_{k}}(-\alpha_{k}(\nabla g(\theta_{k})-u_{k}+v_{k+1})), and f(θk)f(θk+1)12αkd2(θk,θk+1)f(\theta_{k})-f(\theta_{k+1})\geq\frac{1}{2\alpha_{k}}d^{2}(\theta_{k},\theta_{k+1}), where αk\alpha_{k} is the step size.

Proof of Lemma 1 can be found in the Appendix.

Theorem 1. Under Assumption 1, the following statements hold for any sequence {θk}k0\left\{\theta_{k}\right\}_{k\geq 0} generated by Algorithm 1:

(a) Any limit point of the sequence {θk}k0\left\{\theta_{k}\right\}_{k\geq 0} is a critical point, and the sequence of function values {f(θk)}k0\left\{f(\theta_{k})\right\}_{k\geq 0} is strictly decreasing and convergent.

(b) For k=0,1,2,,Nk=0,1,2,...,N, we have k=0Nd2(θk,θk+1)2(f(θ0)f)L\sum_{k=0}^{N}d^{2}(\theta_{k},\theta_{k+1})\leq\frac{2(f(\theta_{0})-f^{*})}{L}. In addition, if the function hh is MhM_{h}-smooth: uk+1ukMh×d(θk,θk+1)\lVert u_{k+1}-u_{k}\rVert\leq M_{h}\times d(\theta_{k},\theta_{k+1}) where MhM_{h} is a constant, then

k=0N1(L+Mh+1αk)2f(θk+1)222(f(θ0)f)L\sum_{k=0}^{N}\frac{1}{(L+M_{h}+\frac{1}{\alpha_{k}})^{2}}\lVert\nabla f(\theta_{k+1})\rVert_{2}^{2}\leq\frac{2(f(\theta_{0})-f^{*})}{L} (3)

Proof of Theorem 1 can be found in the Appendix.

Algorithm 1 Riemannian proximal optimization algorithm

1. Given an initial point θ0M\theta_{0}\in M and choose step size αk(0,1L]\alpha_{k}\in(0,\frac{1}{L}], k{0,1,2,}k\in\{0,1,2,...\}.

2. For k=0,1,2,k=0,1,2,..., do

Choose subgradient ukh(θk)u_{k}\in\partial h\left(\theta_{k}\right), where h(θ)\partial h\left(\theta\right) denotes subdifferential (or subderivatives) of the convex function hh at the point θ\theta. Update

θk+1=minθM{φ(θ)+12αkd2(θ,θkαk(g(θk)uk))}.\theta_{k+1}=\underset{\theta\in M}{\min}\left\{\varphi(\theta)+\frac{1}{2\alpha_{k}}d^{2}(\theta,\theta_{k}-\alpha_{k}\left(\nabla g\left(\theta_{k}\right)-u_{k}\right))\right\}. (4)

4.2 Lower bound of policy improvement

Assume we have two policy functions π(as)=Σiαi𝒩(a;Si)\pi^{\text{$\prime$}}(a\mid s)=\Sigma_{i}\alpha_{i}^{\prime}\mathcal{N}(a;S_{i}^{\prime}) and π(as)=Σiαi𝒩(a;Si)\pi(a\mid s)=\Sigma_{i}\alpha_{i}\mathcal{N}(a;S_{i}) parameterized by GMMs with parameters θ={α={αi}i=1K,S={Si0}i=1K}\theta^{\prime}=\{\alpha^{\prime}=\{\alpha_{i}^{\prime}\}_{i=1}^{K},S^{\prime}=\{S_{i}^{\prime}\succ 0\}_{i=1}^{K}\} and θ={α={αi}i=1K,S={Si0}i=1K}\theta=\{\alpha=\{\alpha_{i}\}_{i=1}^{K},S=\{S_{i}\succ 0\}_{i=1}^{K}\}, we would like to bound the performance improvement of π(as)\pi^{\text{$\prime$}}(a\mid s) over π(as)\pi(a\mid s) under limitation of the proximal operator.

In this study, we choose the Wasserstein distance to measure discrepancy between policy functions π(as)\pi^{\text{$\prime$}}(a\mid s) and π(as)\pi(a\mid s) due to its robustness. For two distributions μ0\mu_{0} and μ1\mu_{1} of dimension nn, the Wasserstein distance is defined as W2(μ0,μ1)=infpΠ(μ0,μ1)Rn×Rnxy2p(x,y)𝑑x𝑑yW_{2}(\mu_{0},\mu_{1})=\underset{p\in\Pi(\mu_{0},\mu_{1})}{inf}\int_{R^{n}\times R^{n}}\parallel x-y\parallel^{2}p(x,y)dxdy which seeks a joint probability distribution Π\Pi in R2nR^{2n} whose marginals along coordinates xx and yy coincide with μ0\mu_{0} and μ1\mu_{1} , respectively. For two Gaussian distributions N0(μ0,S0)N_{0}(\mu_{0},S_{0}) and N1(μ1,S1)N_{1}(\mu_{1},S_{1}), its Wasserstein distance is W2(N0,N1)2=μ0μ12+trace(S0+S12(S01/2S1S01/2)1/2)W_{2}(N_{0},N_{1})^{2}=\parallel\mu_{0}-\mu_{1}\parallel^{2}+trace(S_{0}+S_{1}-2(S_{0}^{1/2}S_{1}S_{0}^{1/2})^{1/2}).

First we have the following lemma:

Lemma 2. Given two policies parametrized by GMMs π(as)=Σiαi𝒩(a;Si)\pi^{\text{$\prime$}}(a\mid s)=\Sigma_{i}\alpha_{i}^{\prime}\mathcal{N}(a;S_{i}^{\prime}) and π(as)=Σiαi𝒩(a;Si)\pi(a\mid s)=\Sigma_{i}\alpha_{i}\mathcal{N}(a;S_{i}), let f(π)=βΣ𝑠ρπ(s)Σ𝑎(i[=1]Kexp(ηi)Σj=1Kexp(ηj)𝒩(a;Si))Aπ(s,a)DW2π(π,π)f(\pi^{\prime})=\beta\underset{s}{\Sigma}\rho_{\pi}(s)\underset{a}{\Sigma}\left(\stackrel{{\scriptstyle[}}{{i}}=1]{K}{\sum}\frac{exp(\eta_{i}^{\prime})}{\Sigma_{j=1}^{K}exp(\eta_{j}^{\prime})}\mathcal{N}(a;S_{i}^{\prime})\right)A_{\pi}\left(s,a\right)-D_{W_{2}}^{\pi}(\pi^{\prime},\pi), where DW2π(π,π)=sρπ(s)W2(π,π)D_{W_{2}}^{\pi}(\pi^{\prime},\pi)=\sum_{s}\rho^{\pi}(s)W_{2}(\pi^{\prime},\pi), W2W_{2} defines Wasserstein distance between two GMMs. Then exist π=argmaxπ\overset{\sim}{\pi}=\underset{\pi^{\prime}}{\arg\max} f(π)f(\pi^{\prime}), and f(π)f(π)=0f(\overset{\sim}{\pi})\geq f(\pi)=0.

Lemma 2 can be simply proved by applying Theorem 1.

To reduce computational complexity, we employ discrete Wasserstein distance by embedding GMMs to a manifold of probability densities with Gaussian mixture structure as proposed by (Chen et al., 2019). To be more specific, the discrete Wasserstein distance between two GMMs π(as)\pi^{\text{$\prime$}}(a\mid s) and π(as)\pi(a\mid s) is:

W2(π,π)2=i,jc(i,j)W2(𝒩(a;Si),𝒩(a;Sj))2,W_{2}(\pi^{\text{$\prime$}},\pi)^{2}=\sum_{i,j}c^{*}(i,j)W_{2}(\mathcal{N}(a;S_{i}^{\text{$\prime$}}),\mathcal{N}(a;S_{j}))^{2}, (5)

where c(i,j)=argminc(α,α)i,jc(i,j)W2(𝒩(a;Si),𝒩(a;Sj))2c^{*}(i,j)=\underset{c\in\prod(\alpha^{\prime},\alpha)}{\arg\min}\sum_{i,j}c(i,j)W_{2}(\mathcal{N}(a;S_{i}^{\text{$\prime$}}),\mathcal{N}(a;S_{j}))^{2}.

Lemma 3. Given two policies parametrized by GMMs π(as)=Σiαi𝒩(a;Si)\pi^{\text{$\prime$}}(a\mid s)=\Sigma_{i}\alpha_{i}^{\prime}\mathcal{N}(a;S_{i}^{\prime}) and π(as)=Σiαi𝒩(a;Si)\pi(a\mid s)=\Sigma_{i}\alpha_{i}\mathcal{N}(a;S_{i}) , their total variation distance can be bounded as follows:

DTV(π(as),π(as))BTV(π,π)=i,jd(i,j)BTV(𝒩(a;Siπ),𝒩(a;Sjπ)),D_{TV}(\pi^{\text{$\prime$}}(a\mid s),\pi(a\mid s))\leq B_{TV}(\pi^{\prime},\pi)=\sum_{i,j}d^{*}(i,j)B_{TV}(\mathcal{N}(a;S_{i}^{\pi^{\prime}}),\mathcal{N}(a;S_{j}^{\pi})), (6)

where BTV(N0(μ,S0),N1(μ,S1))=32min{1,S01S1IF}B_{TV}(N_{0}(\mu,S_{0}),N_{1}(\mu,S_{1}))=\frac{3}{2}\min\{1,\parallel S_{0}^{-1}S_{1}-I\parallel_{F}\} for Gaussian distributions N0(μ,S0)N_{0}(\mu,S_{0}) and 𝒩(μ,S1)\mathcal{N}(\mu,S_{1})(Devroye et al., 2018), and d(i,j)=argmind(α,α)i,jd(i,j)BTV(𝒩(a;Si),𝒩(a;Sj))d^{*}(i,j)=\underset{d\in\prod(\alpha^{\prime},\alpha)}{\arg\min}\sum_{i,j}d(i,j)B_{TV}(\mathcal{N}(a;S_{i}^{\prime}),\mathcal{N}(a;S_{j})). Please note the bound BTV(π,π)B_{TV}(\pi^{\prime},\pi) actually is the Wasserstein distance between the discrete distributions α={α1,α2,,αK}\alpha^{\prime}=\{\alpha_{1}^{\prime},\alpha_{2,}^{\prime}...,\alpha_{K}^{\prime}\} and α={α1,α2,,αK}\alpha=\{\alpha_{1},\alpha_{2,}...,\alpha_{K}\} with pairwise cost defined by the bound of total variation distance between two Gaussian distributions BTV(Ni(μ,Si),Nj(μ,Sj))B_{TV}(N_{i}(\mu,S_{i}),N_{j}(\mu,S_{j})), i,j=1,2,,Ki,j=1,2,...,K.

With Lemma 2 and Lemma 3, we have the following theorem:

Theorem 2. Given two policy functions π\pi^{\prime} and π\pi parametrized by GMMs π(as)=Σiαi𝒩(a;Si)\pi^{\text{$\prime$}}(a\mid s)=\Sigma_{i}\alpha_{i}^{\prime}\mathcal{N}(a;S_{i}^{\prime}) and π(as)=Σiαi𝒩(a;Si)\pi(a\mid s)=\Sigma_{i}\alpha_{i}\mathcal{N}(a;S_{i}), assume policy π\overset{\sim}{\pi} is parametrized by θ=argmaxπ\overset{\sim}{\theta}=\underset{\pi^{\prime}}{\arg\max} f(π)f(\pi^{\prime}) as shown in Lemma 2, then we have the following bound for any π\pi^{\prime} within proximity of π\overset{\sim}{\pi}:

J(π)J(π)2BTVπ(π,π)Mπ+1βDW2π(π,π)2γϵππ(1γ)BTVπ(π,π)J(\pi^{\prime})-J(\pi)\geq-2B_{TV}^{{\pi^{\prime}}}(\pi^{\prime},\overset{\sim}{\pi})M^{\overset{\sim}{\pi}}+\frac{1}{\beta}D_{W_{2}}^{\pi}(\overset{\sim}{\pi},\pi)-\frac{2\gamma\epsilon_{\pi}^{\overset{\sim}{\pi}}}{(1-\gamma)}B_{TV}^{{\pi}}(\overset{\sim}{\pi},\pi),

where ϵππ=maxsEaπAπ(s,a)\epsilon_{\pi}^{\overset{\sim}{\pi}}=\max_{s}\mid E_{a\sim\overset{\sim}{\pi}}A^{\pi}(s,a)\mid, Mπ=maxs,a|Aπ(s,a)|M^{\overset{\sim}{\pi}}=\max_{s,a}|A^{\overset{\sim}{\pi}}(s,a)|, BTVπ(π,π)=sρπ(s)BTV(π,π)B_{TV}^{{\pi^{\prime}}}(\pi^{\prime},\overset{\sim}{\pi})=\sum_{s}\rho^{{\pi^{\prime}}}(s)B_{TV}(\pi^{\prime},\overset{\sim}{\pi}) and BTVπ(π,π)=sρπ(s)BTV(π,π)B_{TV}^{\pi}(\overset{\sim}{\pi},\pi)=\sum_{s}\rho^{\pi}(s)B_{TV}(\overset{\sim}{\pi},\pi) (BTV(π,π)B_{TV}(\pi^{\prime},\overset{\sim}{\pi}) and BTV(π,π)B_{TV}(\overset{\sim}{\pi},\pi) follow the bound definition BTV(π,π)B_{TV}(\pi^{\text{$\prime$}},\pi) in Lemma 3), BW2π(π,π)=sρπ(s)Wd2(π,π)B_{W_{2}}^{\pi}(\overset{\sim}{\pi},\pi)=\sum_{s}\rho^{\pi}(s)W_{d2}(\overset{\sim}{\pi},\pi).

Proofs of Lemma 3 and Theorem 2 are shown in the appendix.

4.3 Implementation of the Riemannian proximal policy optimization method

Recall that in the optimization problem (2), we are trying to optimize the following objective function: minθ={η={ηi}i=1K1,S={Si0}i=1K}f(θ)=g(θ)+φ(θ)\underset{\theta^{\prime}=\{\eta^{\prime}=\{\eta_{i}^{\prime}\}_{i=1}^{K-1},S^{\prime}=\{S_{i}^{\prime}\succ 0\}_{i=1}^{K}\}}{\min}f(\theta^{\prime})=g\left(\theta^{\prime}\right)+\varphi(\theta^{\prime}).

1) Riemannian Gradient

grad𝐒𝐢g(θ)=g(θ)Si=i[=1]K(Σ𝑠ρπθ(s)Σ𝑎Aπθ(s,a))exp(ηi)Σj=1Kexp(ηj)×𝒩(a;Si)Sigrad_{\mathbf{S_{i}^{\prime}}}g(\theta^{\prime})=\frac{\partial g(\theta^{\prime})}{\partial S_{i}^{\prime}}=-\stackrel{{\scriptstyle[}}{{i}}=1]{K}{\sum}(\underset{s}{\Sigma}\rho_{\pi_{\theta}}(s)\underset{a}{\Sigma}A_{\pi_{\theta}}\left(s,a\right))\frac{exp(\eta_{i}^{\prime})}{\Sigma_{j=1}^{K}exp(\eta_{j}^{\prime})}\times\frac{\partial\mathcal{N}(a;S_{i}^{\prime})}{\partial S_{i}^{\prime}},

𝒩(a;Si)Si=𝒩(a;Si)×12[Si1+Si1aaSi1]\frac{\partial\mathcal{N}(a;S_{i}^{\prime})}{\partial S_{i}^{\prime}}=\mathcal{N}(a;S_{i}^{\prime})\times\frac{1}{2}\left[-S_{i}^{\prime-1}+S_{i}^{\prime-1}aa^{\top}S_{i}^{\prime-1}\right], i=1,2,,Ki=1,2,...,K.

Let ai=(Σ𝑠ρπθ(s)Σ𝑎Aπθ(s,a))𝒩(a;Si)a_{i}=-(\underset{s}{\Sigma}\rho_{\pi_{\theta}}(s)\underset{a}{\Sigma}A_{\pi_{\theta}}\left(s,a\right))\mathcal{N}(a;S_{i}^{\prime}), m=1,2,,Km=1,2,...,K,

gradηmg(θ)=g(θ)ηm=amexp(ηm)(Σjexp(ηj)Σi1(Σjexp(ηj))2{aiexp(ηi)×exp(ηm)}grad_{\eta_{m}^{\prime}}g(\theta^{\prime})=\frac{\partial g(\theta^{\prime})}{\partial\eta_{m}^{\prime}}=\frac{a_{m}exp(\eta_{m}^{\prime})}{(\Sigma_{j}exp(\eta_{j}^{\prime})}-\Sigma_{i}\frac{1}{(\Sigma_{j}exp(\eta_{j}^{\prime}))^{2}}\left\{a_{i}exp(\eta_{i}^{\prime})\times exp(\eta_{m}^{\prime})\right\}.

For Euclidean distance d(θ,θ)=β2θθ22d(\theta^{\prime},\theta)=\frac{\beta}{2}\parallel\theta^{\prime}-\theta\parallel_{2}^{2}, we have

Siφ(θ)=Si(β2i=1Kd2(Si,Si))=β(SiSi)\partial_{S_{i}^{\prime}}\varphi(\theta^{\prime})=\frac{\partial}{\partial S_{i}^{\prime}}(\frac{\beta}{2}\sum_{i=1}^{K}d^{2}(S_{i}^{\prime},S_{i}))=\beta(S_{i}^{\prime}-S_{i}), i=1,2,,Ki=1,2,...,K. For discrete Wasserstein distance d=Wd22(θ,θ)d=W_{d2}^{2}(\theta^{\prime},\theta), we have

Siφ(θ)=β2Si(i,jc(i,j)trace(Si+Sj2(Si1/2SjSi1/2)1/2))\partial_{S_{i}^{\prime}}\varphi(\theta^{\prime})=\frac{\beta}{2}\frac{\partial}{\partial S_{i}^{\prime}}(\sum_{i,j}c^{*}(i,j)trace(S_{i}^{\prime}+S_{j}-2(S_{i}^{\prime 1/2}S_{j}S_{i}^{\prime 1/2})^{1/2})), where

c(i,j)=argminc(α,α)i,jc(i,j)W2(𝒩(a;Si),𝒩(a;Sj))2c^{*}(i,j)=\underset{c\in\prod(\alpha^{\prime},\alpha)}{\arg\min}\sum_{i,j}c(i,j)W_{2}(\mathcal{N}(a;S_{i}^{\text{$\prime$}}),\mathcal{N}(a;S_{j}))^{2}, i=1,2,,Ki=1,2,...,K.

2) Retraction

With Si,tS_{i,t}^{\prime} and gradSi,tg(θ)grad_{S_{i,t}^{\prime}}g(\theta^{\prime}) shown above at iteration tt, we would like to calculate Si,t+1S_{i,t+1}^{\prime} using retraction. From (Cheng, 2013), for any tangent vector ηTWM\eta\in T_{W}M, where WW is a point in Riemannian space MM, its retraction RW(η):=argminXMW+ηXFR_{W}\left(\eta\right):=\underset{X\in M}{\arg\min}\parallel W+\eta-X\parallel_{F}. For our case RSi,t(αt(gradSi,tg(θ)+Si,tφ(θ)))=i[=1]nσiqiqiR_{S_{i,t}^{\prime}}\left(-\alpha_{t}(grad_{S_{i,t}^{\prime}}g(\theta^{\prime})+\partial_{S_{i,t}^{\prime}}\varphi(\theta^{\prime}))\right)=\stackrel{{\scriptstyle[}}{{i}}=1]{n}{\sum}\sigma_{i}q_{i}q_{i}^{\top}, where σi\sigma_{i} and qiq_{i} are the ii-th eigenvalues and eigenvector of matrix Si,tαt(gradSi,tg(θ)+Si,tφ(θ))S_{i,t}^{\prime}-\alpha_{t}(grad_{S_{i,t}^{\prime}}g(\theta^{\prime})+\partial_{S_{i,t}^{\prime}}\varphi(\theta^{\prime})).

ηi\eta_{i}, i=1,2,,K1i=1,2,...,K-1 are updated using standard gradient decent method in the Euclidean space. The calculation and retraction shown above are repeated until f(θ)f(\theta^{\prime}) converges.

5 Experimental results

5.1 Simulation environments and baseline methods

We choose TRPO and PPO, which are well-known excelling at continuous-control tasks, as baseline algorithms. Each algorithm runs on the following 3 environments in OpenAI Gym MuJoCo simulator (Todorov et al., 2012): InvertedPendulum-v2, Hopper-v2, and Walker2d-v2 with increasing task complexity regarding size of state and action spaces. For each run, we compute the average reward for every 50 episodes, and report the mean reward curve and parameters statistics for comparison.

5.2 Preliminary results

Table 1: Number of parameters of each algorithm on each environment with dimensions listed.
Environments Number of parameters Dim. of states Dim. of actions
RPPO TRPO PPO
InvertedPendulum-v2 104 124,026 124,026 4 1
Hopper-v2 599 5,281,434 5,281,434 11 3
Walker2d-v2 599 40,404,522 40,404,522 17 6

In Fig. 1 we show mean reward (column1) for PPO, RPPO and TRPO algorithms on three MuJoCo environments, screenshots (column2) and probability density of GMM (column3) for RPPO on each environment. From the learning curves, we can see that as the state-action dimension of environment increases (shown in Table 1), both the convergence speed and the reward improvement slow down. This is because the higher dimension the environment sits, the more difficult the optimization task is for the algorithm. Correspondingly, in the GMM plot, S, A represent the state and the action dimensions respectively, and the probability density is shown in z axis. In the density plot, we can see that as the environment complexity increases, the density pattern becomes more diverse, and non-diagonal matrix terms also show its importance. The probability density of GMM shows that RPPO learns meaningful structure of policy functions.

TRPO and PPO are pure neural-network-based models with numerous parameters. This makes the model highly vulnerable to overfitting, poor network architecture design and the hyper-parameters tuning. RPPO achieves better robustness by having much fewer parameters. In Table 1 we compare the number of parameters of each algorithm on each environment. It can be seen that GMM has 10310510^{3}\sim 10^{5} order fewer parameters as compared with TRPO and PPO.

Refer to caption

Figure 1: Comparison of PPO, RPPO and TRPO on three MuJoCo environments.

6 Conclusion

We proposed a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employed the Gaussian mixture model (GMM) and formulated it as a non-convex optimization problem in the Riemannian space of positive semidefinite matrices. Preliminary experiments on benchmark tasks in OpenAI Gym MuJoCo (Todorov et al., 2012) show the efficacy of the proposed RPPO algorithm.

In Sec. 4.1, the algorithm 1 we proposed is capable of optimizing a general class of non-convex functions of the form f(θ)=g(θ)h(θ)+φ(θ)f\left(\theta\right)=g\left(\theta\right)-h\left(\theta\right)+\varphi\left(\theta\right). Due to page limit, in this study we focus on f(θ)=g(θ)+φ(θ)f\left(\theta\right)=g\left(\theta\right)+\varphi\left(\theta\right) as shown in the Optimization problem (2). In the future, it would be interesting to incorporate constraints in MDP problems like constrained policy optimization (Achiam et al., 2017) and encode them as a concave function h(θ)-h(\theta) in our RPPO algorithm.

Appendix

1. Proof of Lemma 1:

First let’s define a convex majorant q(w,θk)q(w,\theta_{k}) of the function ff as follows:

q(w,θk)=g(θk)h(θk)+g(θk)uk,w+12αkw22+φ(expθk(w))q(w,\theta_{k})=g(\theta_{k})-h(\theta_{k})+\langle\nabla g(\theta_{k})-u_{k},w\rangle+\frac{1}{2\alpha_{k}}\lVert w\lVert_{2}^{2}+\varphi(exp_{\theta_{k}}(w)), where wTθkMw\in T_{\theta_{k}}M. Note that minimizer of q(w,θk)q(w,\theta_{k}) is the same as θk+1=minθM{φ(θ)+12αkd2(θ,θkαk(g(θk)uk))}\theta_{k+1}=\underset{\theta\in M}{\min}\left\{\varphi(\theta)+\frac{1}{2\alpha_{k}}d^{2}(\theta,\theta_{k}-\alpha_{k}\left(\nabla g\left(\theta_{k}\right)-u_{k}\right))\right\}. The optimality condition of θk+1\theta_{k+1} guarantees that there exists subgradient vk+1φ(θk+1)v_{k+1}\in\partial\varphi(\theta_{k+1}) satisfying the following equation: g(θk)uk+vk+1+1αkw=0\nabla g(\theta_{k})-u_{k}+v_{k+1}+\frac{1}{\alpha_{k}}w=0. Let w=expθk1θk+1w=exp_{\theta_{k}}^{-1}\theta_{k+1}, we have θk+1=expθk(αk(g(θk)uk+vk+1))\theta_{k+1}=exp_{\theta_{k}}(-\alpha_{k}(\nabla g(\theta_{k})-u_{k}+v_{k+1})).

From convexity of the function φ\varphi, for any θM\theta\in M and vk+1φ(θk+1)v_{k+1}\in\partial\varphi(\theta_{k+1}) we have φ(θk)φ(θk+1)+vk+1,w\varphi(\theta_{k})\geq\varphi(\theta_{k+1})+\langle v_{k+1},w\rangle, wTθk+1Mw\in T_{\theta_{k+1}}M. To prove the second inequality in Lemma 1, we have

f(θk)q(wk,θk)=g(θk)h(θk)+φ(θk)q(wk,θk)f(\theta_{k})-q(w_{k},\theta_{k})=g(\theta_{k})-h(\theta_{k})+\varphi(\theta_{k})-q(w_{k},\theta_{k})

g(θk)h(θk)+φ(θk+1)+vk+1,wq(wk,θk)\geq g(\theta_{k})-h(\theta_{k})+\varphi(\theta_{k+1})+\langle v_{k+1},w\rangle-q(w_{k},\theta_{k})

g(θk)h(θk)+φ(θk+1)+vk+1,w(g(θk)h(θk)+g(θk)uk,wk+12αkwk22+φ(expθk(wk)))\geq g(\theta_{k})-h(\theta_{k})+\varphi(\theta_{k+1})+\langle v_{k+1},w\rangle-(g(\theta_{k})-h(\theta_{k})+\langle\nabla g(\theta_{k})-u_{k},w_{k}\rangle+\frac{1}{2\alpha_{k}}\lVert w_{k}\lVert_{2}^{2}+\varphi(exp_{\theta_{k}}(w_{k})))

φ(θk+1)+vk+1,w(g(θk)uk,wk+12αkwk22+φ(expθk(wk))).\geq\varphi(\theta_{k+1})+\langle v_{k+1},w\rangle-(\langle\nabla g(\theta_{k})-u_{k},w_{k}\rangle+\frac{1}{2\alpha_{k}}\lVert w_{k}\lVert_{2}^{2}+\varphi(exp_{\theta_{k}}(w_{k}))).

Since w=wk=αk(g(θk)uk+vk+1)w=-w_{k}=\alpha_{k}(\nabla g(\theta_{k})-u_{k}+v_{k+1}), we have

f(θk)q(wk,θk)g(θk)+ukvk+1,wk12αkwk22f(\theta_{k})-q(w_{k},\theta_{k})\geq\langle-\nabla g(\theta_{k})+u_{k}-v_{k+1},w_{k}\rangle-\frac{1}{2\alpha_{k}}\lVert w_{k}\lVert_{2}^{2}

1αkwk,wk12αkwk22=12αkwk22\geq\frac{1}{\alpha_{k}}\langle w_{k},w_{k}\rangle-\frac{1}{2\alpha_{k}}\lVert w_{k}\lVert_{2}^{2}=\frac{1}{2\alpha_{k}}\lVert w_{k}\lVert_{2}^{2}.

Recall that q(wk,θk)q(w_{k},\theta_{k}) is a majorant for the function ff, so

f(θk)f(θk+1)f(θk)q(wk,θk)12αkwk22=12αkd2(θk,θk+1)f(\theta_{k})-f(\theta_{k+1})\geq f(\theta_{k})-q(w_{k},\theta_{k})\geq\frac{1}{2\alpha_{k}}\lVert w_{k}\lVert_{2}^{2}=\frac{1}{2\alpha_{k}}d^{2}(\theta_{k},\theta_{k+1}).

2. Proof of Theorem 1:

First we would like to prove the convergence of function value. Since the sequence {f(θk)}k0\left\{f(\theta_{k})\right\}_{k\geq 0} is bounded below, if θk=θk+1\theta_{k}=\theta_{k+1} for some kk, the convergence of the sequence {f(θk)}k0\left\{f(\theta_{k})\right\}_{k\geq 0} is trivial. Let’s assume that θkθk+1\theta_{k}\neq\theta_{k+1} for all k=0,1,2,k=0,1,2,.... Under the above assumption, Lemma 1 ensures that f(θk)>f(θk+1)f(\theta_{k})>f(\theta_{k+1}). Consequently, there must exist some scalar f¯\bar{f} which is the limit of f(θk)f(\theta_{k}), limkf(θk)=f¯\underset{k\rightarrow\infty}{lim}f(\theta_{k})=\bar{f}.

Due to page limit, the proof of stationarity of limit points is omitted.

Now let’s establish the bound. Since f=minf(θ)f^{*}=\min f(\theta) is finite, by utilizing Lemma 1, we have

f(θ0)ff(θ0)f(θN+1)=k=0N(f(θk)f(θk+1))k=0N12αkd2(θk,θk+1)f(\theta_{0})-f^{*}\geq f(\theta_{0})-f(\theta_{N+1})=\sum_{k=0}^{N}(f(\theta_{k})-f(\theta_{k+1}))\geq\sum_{k=0}^{N}\frac{1}{2\alpha_{k}}d^{2}(\theta_{k},\theta_{k+1}).

Note that αk(0,1L]\alpha_{k}\in(0,\frac{1}{L}], k{0,1,2,}k\in\{0,1,2,...\}, so k=0Nd2(θk,θk+1)2(f(θ0)f)L\sum_{k=0}^{N}d^{2}(\theta_{k},\theta_{k+1})\leq\frac{2(f(\theta_{0})-f^{*})}{L}.

Recall that the function hh is MhM_{h}-smooth,

f(θk+1)2=g(θk+1)uk+1+vk+12=g(θk+1)uk+1(g(θk)uk+1αkd(θk,θk+1))2\lVert\nabla f(\theta_{k+1})\rVert_{2}=\lVert\nabla g(\theta_{k+1})-u_{k+1}+v_{k+1}\rVert_{2}=\lVert\nabla g(\theta_{k+1})-u_{k+1}-(\nabla g(\theta_{k})-u_{k}+\frac{1}{\alpha_{k}}d(\theta_{k},\theta_{k+1}))\rVert_{2}

g(θk+1)g(θk)+uk+1uk+1αkd(θk,θk+1)(L+Mh+1αk)d(θk,θk+1)\leq\lVert\nabla g(\theta_{k+1})-\nabla g(\theta_{k})\rVert+\lVert u_{k+1}-u_{k}\rVert+\frac{1}{\alpha_{k}}d(\theta_{k},\theta_{k+1})\leq(L+M_{h}+\frac{1}{\alpha_{k}})d(\theta_{k},\theta_{k+1})

So

2(f(θ0)f)Lk=0Nd2(θk,θk+1)k=0N1(L+Mh+1αk)2f(θk+1)22\frac{2(f(\theta_{0})-f^{*})}{L}\geq\sum_{k=0}^{N}d^{2}(\theta_{k},\theta_{k+1})\geq\sum_{k=0}^{N}\frac{1}{(L+M_{h}+\frac{1}{\alpha_{k}})^{2}}\lVert\nabla f(\theta_{k+1})\rVert_{2}^{2}

3. Proof of Lemma 3:

For two Gaussian distributions N0(μ,S0)N_{0}(\mu,S_{0}) and N1(μ,S1)N_{1}(\mu,S_{1}) with the same mean, their total variation distance is bounded by (Devroye et al., 2018):

DTV(N0(μ,S0),N1(μ,S1))BTV(N0(μ,S0),N1(μ,S1))=32min{1,S01S1IF}D_{TV}(N_{0}(\mu,S_{0}),N_{1}(\mu,S_{1}))\leq B_{TV}(N_{0}(\mu,S_{0}),N_{1}(\mu,S_{1}))=\frac{3}{2}\min\{1,\parallel S_{0}^{-1}S_{1}-I\parallel_{F}\} . By using Wasserstein metric, we have

DTV(π(as),π(as))=12Σiαi𝒩(a;Si)Σiαi𝒩(a;Si)𝑑aD_{TV}(\pi^{\text{$\prime$}}(a\mid s),\pi(a\mid s))=\frac{1}{2}\int\mid\Sigma_{i}\alpha_{i}^{\text{$\prime$}}\mathcal{N}(a;S_{i}^{\prime})-\Sigma_{i}\alpha_{i}\mathcal{N}(a;S_{i})\mid da,

i,jd(i,j)BTV(𝒩(a;Si),𝒩(a;Sj))\leq\sum_{i,j}d^{*}(i,j)B_{TV}(\mathcal{N}(a;S_{i}^{\prime}),\mathcal{N}(a;S_{j})),

where d(i,j)=argmind(α,α)i,jd(i,j)BTV(𝒩(a;Si),𝒩(a;Sj))d^{*}(i,j)=\underset{d\in\prod(\alpha^{\prime},\alpha)}{\arg\min}\sum_{i,j}d(i,j)B_{TV}(\mathcal{N}(a;S_{i}^{\prime}),\mathcal{N}(a;S_{j})).

4. Proof of Theorem 2:

Our proof follows the idea proposed by Wang et al. (Wang et al., 2018). From Corollary 1 in (Achiam et al., 2017), we have

J(π)J(π)Lπ(π)2γϵππ(1γ)DTVπ(π,π)J(\pi^{\prime})-J(\pi)\geq L_{\pi}(\pi^{\prime})-\frac{2\gamma\epsilon_{\pi}^{\pi^{\prime}}}{(1-\gamma)}D_{TV}^{{\pi}}(\pi^{\prime},\pi), where ϵππ=maxsEaπAπ(s,a)\epsilon_{\pi}^{\pi^{\prime}}=\max_{s}\mid E_{a\sim\pi^{\prime}}A^{\pi}(s,a)\mid, DTVπ(π,π)=Σ𝑠ρπ(s)DTV(π(as),π(as))D_{TV}^{{\pi}}(\pi^{\prime},\pi)=\underset{s}{\Sigma}\rho_{\pi}(s)D_{TV}(\pi^{\text{$\prime$}}(a\mid s),\pi(a\mid s)) and Lπ(π)=Σ𝑠ρπ(s)Σ𝑎π(a|s)Aπ(s,a)L_{\pi}(\pi^{\prime})=\underset{s}{\Sigma}\rho_{\pi}(s)\underset{a}{\Sigma}\pi^{{}^{\prime}}(a|s)A_{\pi}\left(s,a\right).

Similarly, we also have

|J(π)J(π)|=Σ𝑠ρπ(s)|Σ𝑎(π(a|s)π(a|s))Aπ(s,a)|Σ𝑠ρπ(s)Σ𝑎|π(a|s)π(a|s)||Aπ(s,a)|2DTVπ(π,π)Mπ|J(\pi^{\prime})-J(\pi)|=\underset{s}{\Sigma}\rho_{\pi}(s)|\underset{a}{\Sigma}(\pi^{\prime}(a|s)-\pi(a|s))A_{\pi}\left(s,a\right)|\leq\underset{s}{\Sigma}\rho_{\pi}(s)\underset{a}{\Sigma}|\pi^{\prime}(a|s)-\pi(a|s)||A_{\pi}\left(s,a\right)|\leq 2D_{TV}^{{\pi^{\prime}}}(\pi^{\prime},\pi)M^{\pi} where Mπ=maxs,a|Aπ(s,a)|M^{\pi}=\max_{s,a}|A^{\pi}(s,a)|.

J(π)J(π)=(J(π)J(π))+(J(π)J(π))J(\pi^{\prime})-J(\pi)=(J(\pi^{\prime})-J(\overset{\sim}{\pi}))+(J(\overset{\sim}{\pi})-J(\pi)) 2DTVπ(π,π)Mπ+Lπ(π)2γϵππ(1γ)DTVπ(π,π)\geq-2D_{TV}^{{\pi^{\prime}}}(\pi^{\prime},\overset{\sim}{\pi})M^{\overset{\sim}{\pi}}+L_{\pi}(\overset{\sim}{\pi})-\frac{2\gamma\epsilon_{\pi}^{\overset{\sim}{\pi}}}{(1-\gamma)}D_{TV}^{{\pi}}(\overset{\sim}{\pi},\pi)

2DTVπ(π,π)Mπ+1βDW2π(π,π)2γϵππ(1γ)DTVπ(π,π)\geq-2D_{TV}^{{\pi^{\prime}}}(\pi^{\prime},\overset{\sim}{\pi})M^{\overset{\sim}{\pi}}+\frac{1}{\beta}D_{W_{2}}^{\pi}(\overset{\sim}{\pi},\pi)-\frac{2\gamma\epsilon_{\pi}^{\overset{\sim}{\pi}}}{(1-\gamma)}D_{TV}^{{\pi}}(\overset{\sim}{\pi},\pi).


References

  • Absil et al. (2007) P-A Absil, Christopher G Baker, and Kyle A Gallivan. Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
  • Absil et al. (2009) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
  • Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org, 2017.
  • Agostini and Celaya (2010) Alejandro Agostini and Enric Celaya. Reinforcement learning with a gaussian mixture model. In The 2010 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2010.
  • Argall et al. (2009) Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • Chen et al. (2019) Yongxin Chen, Tryphon T Georgiou, and Allen Tannenbaum. Optimal transport for gaussian mixture models. IEEE Access, 7:6269–6278, 2019.
  • Cheng (2013) Li Cheng. Riemannian similarity learning. In International Conference on Machine Learning, pages 540–548, 2013.
  • Devroye et al. (2018) Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians. arXiv preprint arXiv:1810.08693, 2018.
  • Eisenhart (2016) Luther Pfahler Eisenhart. Riemannian geometry. Princeton university press, 2016.
  • Ferreira and Oliveira (2002) OP Ferreira and PR Oliveira. Proximal point algorithm on riemannian manifolds. Optimization, 51(2):257–270, 2002.
  • Hosseini and Sra (2015) Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for gaussian mixtures. In Advances in Neural Information Processing Systems, pages 910–918, 2015.
  • Huang et al. (2015) Wen Huang, Kyle A Gallivan, and P-A Absil. A broyden class of quasi-newton methods for riemannian optimization. SIAM Journal on Optimization, 25(3):1660–1685, 2015.
  • Kakade and Langford (2002) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning, pages 267–274, 2002.
  • Khamaru and Wainwright (2018) Koulik Khamaru and Martin J Wainwright. Convergence guarantees for a class of non-convex and non-smooth optimization problems. arXiv preprint arXiv:1804.09629, 2018.
  • Li (2017) Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
  • Vandereycken (2013) Bart Vandereycken. Low-rank matrix completion by Riemannian optimization. SIAM Journal on Optimization, 23, No. 2:1214–1236, 2013.
  • Wang et al. (2018) Qing Wang, Jiechao Xiong, Lei Han, Han Liu, Tong Zhang, et al. Exponentially weighted imitation learning for batched historical data. In Advances in Neural Information Processing Systems, pages 6288–6297, 2018.
  • Zhang et al. (2016) Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592–4600, 2016.