This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Yufeng Zhang Department of Industrial Engineering and Management Sciences, Northwestern University Siyu Chen Department of Statistics and Data Science, Yale University Zhuoran Yang Department of Operations Research and Financial Engineering, Princeton University Michael I. Jordan Department of EECS and Statistics, University of California, Berkeley Zhaoran Wang Department of Industrial Engineering and Management Sciences, Northwestern University
Abstract

Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.

00footnotetext: Equal contribution.

1 Introduction

In reinforcement learning (RL) (Sutton and Barto, 2018), an agent aims to learn the optimal policy that maximizes the expected total reward obtained from interactions with an environment. Policy-based RL algorithms achieve such a goal by directly optimizing the expected total reward as a function of the policy, which often involves two components: policy evaluation and policy improvement. Specifically, policy evaluation refers to estimating the value function of the current policy, which characterizes the performance of the current policy and reveals the updating direction for finding a better policy, which is known as policy improvement. Algorithms that explicitly incorporate these two ingredients are called actor-critic (AC) methods (Konda and Tsitsiklis, 2000), where the actor and the critic refer to the policy and its corresponding value function, respectively.

Recently, in RL applications with large state spaces, actor-critic methods have achieved striking empirical successes when empowered by expressive function approximators such as neural networks (Agostinelli et al., 2019; Akkaya et al., 2019; Berner et al., 2019; Duan et al., 2016; Silver et al., 2016, 2017; Vinyals et al., 2019). These successes benefit from the data-dependent representations learned by neural networks. Unfortunately, however, the theoretical understanding of this data-dependent benefit is very limited. The classical theory of AC focuses on the case of linear function approximation, where the actor and critic are represented using linear functions with the feature mapping fixed throughout learning (Bhatnagar et al., 2008, 2009; Konda and Tsitsiklis, 2000). Meanwhile, a few recent works establish convergence and optimality of AC with overparameterized neural networks (Fu et al., 2020; Liu et al., 2019; Wang et al., 2019), where the neural network training is captured by the Neural Tangent Kernel (NTK) (Jacot et al., 2018). Specifically, with properly designed parameter initialization and stepsizes, and sufficiently large network widths, the neural networks employed by both actor and critic can be assumed to be well approximated by linear functions of a random feature vector. In other words, from the point of view of representation learning, the features induced by these algorithms are by assumption infinitesimally close to the initial featural representation, which is data-independent.

In this work, we make initial steps towards understanding how representation learning comes into play in neural AC. Specifically, we address the following questions:

Going beyond the NTK regime, does neural AC provably find the globally optimal policy? How does the feature representation associated with the neural network evolve along with neural AC?

We focus on a version of AC where the critic performs temporal-difference (TD) learning (Sutton, 1988) for policy evaluation and the actor improves its policy via proximal policy optimization (PPO) (Schulman et al., 2017), which corresponds to a Kullback-Leibler (KL) divergence regularized optimization problem, with the critic providing the update direction. Moreover, we utilize two-timescale updates where both the actor and critic are updated at each iteration but with the critic having a much larger stepsize. In other words, the critic is updated at a faster timescale. Meanwhile, we represent the critic explicitly as a two-layer overparameterized neural network and parameterize the actor implicitly via the critic and PPO updates. To examine convergence, we study the evolution of the actor and critic in a continuous-time limit with the network width going to infinity. In such a regime, the actor update is closely connected to replicator dynamics (Börgers and Sarin, 1997; Hennes et al., 2020; Schuster and Sigmund, 1983) and the critic update is captured by a semigradient flow in the Wasserstein space (Villani, 2008). Moreover, the semigradient flow runs at a faster timescale according to the two-timescale mechanism.

It turns out that the separation of timescales plays an important role in the convergence analysis. In particular, the continuous-time limit enables us to separately analyze the evolution of actor and critic and then combine these results to get final theoretical guarantees. Specifically, focusing solely on the actor, we prove that the time-averaged suboptimality of the actor converges sublinearly to zero up to the time-averaged policy evaluation error associated with critic updates. Moreover, for the critic, under regularity conditions, we connect the Bellman error to the Wasserstein distance and show that the time-averaged policy evaluation error also converges sublinearly to zero. Therefore, we show that neural AC provably achieves global optimality at a sublinear rate. Furthermore, regarding representation learning, we show that the critic induces a data-dependent feature representation within an O(1/α)O(1/\alpha) neighborhood of the initial representation in terms of the Wasserstein distance, where α\alpha is a sufficiently large scaling parameter.

The key to our technical analysis reposes on three ingredients: (i) infinite-dimensional variational inequalities with a one-point monotonicity (Harker and Pang, 1990), (ii) a mean-field perspective on neural networks (Chizat and Bach, 2018b; Mei et al., 2019, 2018; Sirignano and Spiliopoulos, 2020a, b), and (iii) the two-timescale stochastic approximation (Borkar, 2009; Kushner and Yin, 2003). In particular, in the infinite-width limit, the neural network and its induced feature representation are identified with a distribution over the parameter space. The mean-field perspective enables us to characterize the evolution of such a distribution within the Wasserstein space via a certain partial differential equation (PDE) (Ambrosio and Gigli, 2013; Ambrosio et al., 2008; Villani, 2003, 2008). For policy evaluation, this PDE is given by the semigradient flow induced by TD learning. We characterize the error of policy evaluation by showing that mean-field Bellman error satisfies a version of one-point monotonicity tailored to the Wasserstein space. Moreover, our actor analysis utilizes the geometry of policy optimization, which shows that the expected total reward, as a function of the policy, also enjoys the property of one-point monotonicity in the policy space. Finally, the actor and critic errors are connected via two-timescale stochastic approximation. To the best of our knowledge, this is the first time that convergence and global optimality guarantees have been obtained for neural AC.

Related Work. AC with linear function approximation has been studied extensively in the literature. In particular, using a two-timescale stochastic approximation via ordinary differential equations, Konda and Tsitsiklis (2000); Bhatnagar et al. (2008) and Bhatnagar et al. (2009) establish asymptotic convergence guarantees in the continuous-time limiting regime. More recently, using more sophisticated optimization techniques, various works (Wu et al., 2020; Xu et al., 2020b, a; Hong et al., 2020; Khodadadian et al., 2021) have established discrete-time convergence guarantees that show that linear AC converges sublinearly to either a stationary point or the globally optimal policy. Furthermore, when overparameterized neural networks are employed, Liu et al. (2019); Wang et al. (2019) and Fu et al. (2020) prove that neural AC converges to the global optimum at a sublinear rate. In these works, the initial value of the network parameters and the learning rates are chosen such that both actor and critic updates are captured by the NTK. In other words, when the network width is sufficiently large, such a version of neural AC is well approximated by its linear counterpart via the neural tangent feature. In comparison, we establish a mean-field analysis that has a different scaling than the NTK regime. We also establish finite-time convergence to global optimality, and more importantly, the feature representation induced by the critic is data-dependent and allowed to evolve within a much larger neighborhood around the initial one.

Our work is also related to a recent line of research on understanding stochastic gradient descent (SGD) for supervised learning problems involving an overparameterized two-layer neural network under the mean-field regime. See, e.g., Chizat and Bach (2018b); Mei et al. (2018, 2019); Javanmard et al. (2019); Wei et al. (2019); Fang et al. (2019b, a); Chen et al. (2020a); Sirignano and Spiliopoulos (2020b, a); Lu et al. (2020) and the references therein. In the continuous-time and infinite-width limit, these works show that SGD for neural network training is captured by a Wasserstein gradient flow (Villani, 2008; Ambrosio et al., 2008; Ambrosio and Gigli, 2013) of an energy function that corresponds to the objective function in supervised learning. In contrast, our analysis combines such a mean-field analysis with TD learning and two-timescale stochastic approximation, which are tailored specifically to AC. Moreover, our critic is updated via TD learning, which is a semigradient algorithm and there is no objective functional making TD learning a gradient-based algorithm. Thus, in the mean-field regime, our critic is given by a Wasserstein semigradient flow, which also differs from these existing works.

Additionally, our work is closely related to Zhang et al. (2020) and Agazzi and Lu (2019), who provide mean-field analyses for neural TD-learning and Q-learning (Watkins and Dayan, 1992). In comparison, we focus on AC, which is a two-timescale policy optimization algorithm. Finally, Agazzi and Lu (2020) study softmax policy gradient with neural network policies in the mean-field regime, where policy gradient is cast as a Wasserstein gradient flow with respect to the expected total reward. The algorithm assumes that the critic directly gets the desired value function and thus the algorithm is single-timescale. Moreover, the convergence guarantee in Agazzi and Lu (2020) is asymptotic. In comparison, our AC is two-timescale and we establish non-asymptotic sublinear convergence guarantees to global optimality.

Notation. We denote by 𝒫(𝒳)\mathscr{P}(\mathcal{X}) the set of probability measures over the measurable space 𝒳\mathcal{X}. Given a curve ρ:𝒳\rho:\mathbb{R}\rightarrow\mathcal{X}, we denote by ρ˙s=tρt|t=s\dot{\rho}_{s}=\partial_{t}\rho_{t}{\,|\,}_{t=s} its derivative with respect to time. For an operator F:𝒳𝒳F:\mathcal{X}\rightarrow\mathcal{X} and a measure μ𝒫(𝒳)\mu\in\mathscr{P}(\mathcal{X}), we denote by Fμ=μF1F_{\sharp}\mu=\mu\circ F^{-1} the push forward of μ\mu through FF. We denote by χ2(ρμ)\chi^{2}(\rho\,\|\,\mu) the chi-squared divergence between probability measures ρ\rho and μ\mu, which is defined as χ2(ρμ)=(ρ/μ1)2dμ\chi^{2}(\rho\,\|\,\mu)=\int(\rho/\mu-1)^{2}{\mathrm{d}}\mu. Given two probability measures ρ\rho and μ\mu, we denote the Kullback-Leibler divergence or the relative entropy from μ\mu to ρ\rho by KL(ρμ)=log(ρ/μ)dρ{\mathrm{KL}}(\rho\,\|\,\mu)=\int\log(\rho/\mu){\mathrm{d}}\rho. For ν1,ν2,μ𝒫(𝒳)\nu_{1},\nu_{2},\mu\in\mathscr{P}(\mathcal{X}), we define the H˙1(μ)\dot{H}^{-1}(\mu) weighted homogeneous Sobolev norm as ν1ν2H˙1(μ)=sup{|f,ν1ν2||fH˙1(μ)1}\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)}=\sup\big{\{}|\langle f,\nu_{1}-\nu_{2}\rangle|\big{|}\|f\|_{\dot{H}^{1}(\mu)}\leq 1\big{\}}. We denote by f(x)p,μ=(|f(x)|pμ(dx))1/p\|f(x)\|_{p,\mu}=(\int|f(x)|^{p}\mu({\mathrm{d}}x))^{1/p} the p\ell_{p}-norm with respect to probability measure μ\mu. We denote by \otimes the semidirect product, i.e., μK=K(y|x)μ(x)\mu\otimes K=K(y{\,|\,}x)\mu(x) for μ𝒫(𝒳)\mu\in\mathscr{P}(\mathcal{X}) and transition kernel K:𝒳𝒫(𝒴)K:\mathcal{X}\rightarrow\mathscr{P}(\mathcal{Y}). For a function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R}, we denote by Lip(f)=supx,y𝒳,xy|f(x)f(y)|/xy{\rm Lip}(f)=\sup_{x,y\in\mathcal{X},x\neq y}|f(x)-f(y)|/\|x-y\| its Lipschitz constant. We denote a normal distribution on D\mathbb{R}^{D} by N(μ,Σ)N(\mu,\Sigma), where μ\mu is the mean value and Σ\Sigma is the covariance matrix.

2 Background

In this section, we first introduce the policy optimization problem and the actor-critic method. We then present the definition of the Wasserstein space.

2.1 Policy Optimization and Actor-Critic

We consider a Markov decision process (MDP) given by (𝒮,𝒜,γ,P,r,𝒟0)({\mathcal{S}},\mathcal{A},\gamma,P,r,\mathcal{D}_{0}), where 𝒮d1{\mathcal{S}}\subseteq\mathbb{R}^{d_{1}} is the state, 𝒜d2\mathcal{A}\subseteq\mathbb{R}^{d_{2}} is the action space, γ(0,1)\gamma\in(0,1) is the discount factor, P:𝒮×𝒜𝒫(𝒮)P:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathscr{P}({\mathcal{S}}) is the transition kernel, r:𝒮×𝒜+r:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R}_{+} is the reward function, and 𝒟0𝒫(𝒮)\mathcal{D}_{0}\in\mathscr{P}({\mathcal{S}}) is the initial state distribution. Without loss of generality, we assume that 𝒮×𝒜d{\mathcal{S}}\times\mathcal{A}\subseteq\mathbb{R}^{d} and (s,a)21\|(s,a)\|_{2}\leq 1, where d=d1+d2d=d_{1}+d_{2}. We remark that as long as the state-action space is bounded, we can normalize the space to be within the unit sphere. Given a policy π:𝒮×𝒜𝒫(𝒮)\pi:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathscr{P}({\mathcal{S}}), at the mmth step, the agent takes an action ama_{m} at state sms_{m} according to π(|sm)\pi(\cdot{\,|\,}s_{m}) and observes a reward rm=r(sm,am)r_{m}=r(s_{m},a_{m}). The environment then transits to the next state sm+1s_{m+1} according to the transition kernel P(|sm,am)P(\cdot{\,|\,}s_{m},a_{m}). Note that the policy π\pi induces Markov chains on both 𝒮{\mathcal{S}} and 𝒮×𝒜{\mathcal{S}}\times\mathcal{A}. Considering the Markov chain on 𝒮{\mathcal{S}}, we denote the induced Markov transition kernel by Pπ:𝒮𝒫(𝒮)P^{\pi}:{\mathcal{S}}\rightarrow\mathscr{P}({\mathcal{S}}), which is defined as Pπ(s|s)=𝒜P(s|s,a)π(da|s)P^{\pi}(s^{\prime}{\,|\,}s)=\int_{\mathcal{A}}P(s^{\prime}{\,|\,}s,a)\pi({\mathrm{d}}a{\,|\,}s). Likewise, we denote the Markov transition kernel on 𝒮×𝒜{\mathcal{S}}\times\mathcal{A} by P~π:𝒮×𝒜𝒫(𝒮×𝒜)\widetilde{P}^{\pi}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathscr{P}({\mathcal{S}}\times\mathcal{A}), which is defined as P~π(s,a|s,a)=π(a|s)P(s|s,a)\widetilde{P}^{\pi}(s^{\prime},a^{\prime}{\,|\,}s,a)=\pi(a^{\prime}{\,|\,}s^{\prime})P(s^{\prime}{\,|\,}s,a). Let 𝒟~{\widetilde{\mathcal{D}}} be a probability measure on 𝒮×𝒜{\mathcal{S}}\times\mathcal{A}. We then define the visitation measure induced by policy π\pi and starting from 𝒟~{\widetilde{\mathcal{D}}} as

~𝒟~π(d(s,a))=(1γ)m0γm((sm,am)d(s,a)|(s0,a0)𝒟~),\displaystyle{\widetilde{\mathcal{E}}}^{\pi}_{{\widetilde{\mathcal{D}}}}\big{(}{\mathrm{d}}(s,a)\big{)}=(1-\gamma)\cdot\sum_{m\geq 0}\gamma^{m}\cdot\mathbb{P}\big{(}(s_{m},a_{m})\in{\mathrm{d}}(s,a){\,|\,}(s_{0},a_{0})\sim{\widetilde{\mathcal{D}}}\big{)}, (2.1)

where (sm,am)(s_{m},a_{m}) is the trajectory generated by starting from (s0,a0)𝒟~(s_{0},a_{0})\sim{\widetilde{\mathcal{D}}} and following policy π\pi thereafter. If 𝒟~=𝒟π{\widetilde{\mathcal{D}}}=\mathcal{D}\otimes\pi holds, we then denote such a visitation measure by ~𝒟π\widetilde{\mathcal{E}}_{\mathcal{D}}^{\pi}. Furthermore, we denote by (ds)=𝒜~(ds,da)\mathcal{E}({\mathrm{d}}s)=\int_{\mathcal{A}}{\widetilde{\mathcal{E}}}({\mathrm{d}}s,{\mathrm{d}}a) the marginal distribution of visitation measure ~{\widetilde{\mathcal{E}}} with respect to 𝒮{\mathcal{S}}. In particular, when (s0,a0)𝒟π(s_{0},a_{0})\sim\mathcal{D}\otimes\pi holds in (2.1), it follows that ~𝒟π=𝒟ππ{\widetilde{\mathcal{E}}}_{\mathcal{D}}^{\pi}=\mathcal{E}_{\mathcal{D}}^{\pi}\otimes\pi. In policy optimization, we aim to maximize the expected total reward J(π)J(\pi) defined as follows,

J(π)=𝔼π[m0γmr(sm,am)|s0𝒟0],\displaystyle J(\pi)=\mathbb{E}^{\pi}\biggl{[}\sum_{m\geq 0}\gamma^{m}\cdot r(s_{m},a_{m}){\,\big{|}\,}s_{0}\sim\mathcal{D}_{0}\biggr{]},

where we denote by 𝔼π\mathbb{E}^{\pi} the expectation with respect to amπ(|sm)a_{m}\sim\pi(\cdot{\,|\,}s_{m}) and sm+1P(|sm,am)s_{m+1}\sim P(\cdot{\,|\,}s_{m},a_{m}) for m0m\geq 0. We define the action value function Qπ:𝒮×𝒜Q^{\pi}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R} and the state value function Vπ:𝒮V^{\pi}:{\mathcal{S}}\rightarrow\mathbb{R} as follows,

Qπ(s,a)=𝔼π[m0γmr(sm,am)|s0=s,a0=a],Vπ(s)=Qπ(s,),π(|s)𝒜,\displaystyle Q^{\pi}(s,a)=\mathbb{E}^{\pi}\biggl{[}\sum_{m\geq 0}\gamma^{m}\cdot r(s_{m},a_{m}){\,\big{|}\,}s_{0}=s,a_{0}=a\biggr{]},\quad V^{\pi}(s)=\big{\langle}Q^{\pi}(s,\cdot),\pi(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}, (2.2)

where we denote by ,𝒜\langle\cdot,\cdot\rangle_{\mathcal{A}} the inner product on the action space 𝒜\mathcal{A}. Correspondingly, the advantage function Aπ:𝒮×𝒜A^{\pi}:{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R} is defined as

Aπ(s,a)=Qπ(s,a)Vπ(s).\displaystyle A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s).

The action value function QπQ^{\pi} corresponds to the minimizer of the following mean-squared Bellman error (MSBE),

MSBE(Q;π)=12𝔼(s,a)Φ~π[(Q(s,a)r(s,a)γ𝔼(s,a)P~π(|s,a)[Q(s,a)])2],\displaystyle\mathrm{MSBE}(Q;\pi)=\frac{1}{2}\mathbb{E}_{(s,a)\sim\widetilde{\Phi}^{\pi}}\Bigl{[}\bigl{(}Q(s,a)-r(s,a)-\gamma\mathbb{E}_{(s^{\prime},a^{\prime})\sim\widetilde{P}^{\pi}(\cdot{\,|\,}s,a)}[Q(s^{\prime},a^{\prime})]\bigr{)}^{2}\Bigr{]}, (2.3)

where Φ~π\widetilde{\Phi}^{\pi} is the sampling distribution depending on policy π\pi, which will be defined by (4.8) in §4.2. Note that when Φ~π\widetilde{\Phi}^{\pi} is with full support, i.e., supp(Φ~π)=𝒮×𝒜\mathrm{supp}(\widetilde{\Phi}^{\pi})={\mathcal{S}}\times\mathcal{A}, QπQ^{\pi} is the unique global minimizer to the MSBE. Consequently, the policy optimization problem can be written as the following bilevel optimization problem,

maxπJ(π)=𝔼s𝒟0[Qπ(s,),π(|s)𝒜],subject to Qπ=argminQMSBE(Q;π).\displaystyle\max_{\pi}J(\pi)=\mathbb{E}_{s\sim\mathcal{D}_{0}}\Bigl{[}\big{\langle}Q^{\pi}(s,\cdot),\pi(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}\Bigr{]},\quad\text{subject to }Q^{\pi}=\mathop{\mathrm{argmin}}_{Q}\text{MSBE}(Q;\pi). (2.4)

The inner problem in (2.4) is known as a policy evaluation subproblem, while the outer problem is the policy improvement subproblem. One of the most popular way to solve the policy optimization problem is actor-critic (AC) methods (Sutton and Barto, 2018), where the job of the critic is to evaluate current policy and then the actor updates its policy according to the critic’s evaluation.

2.2 Wasserstein Space

Let ΘD\Theta\subseteq\mathbb{R}^{D} be a Polish space. We denote by 𝒫2(Θ)𝒫(Θ)\mathscr{P}_{2}(\Theta)\subseteq\mathscr{P}(\Theta) the set of probability measures with finite second moments. Then, the Wasserstein-2 distance between μ,ν𝒫2(Θ)\mu,\nu\in\mathscr{P}_{2}(\Theta) is defined as follows,

W2(μ,ν)=inf{𝔼[XY2]1/2|law(X)=μ,law(Y)=ν},\displaystyle W_{2}(\mu,\nu)=\inf\Bigl{\{}\mathbb{E}\bigl{[}\|X-Y\|^{2}\bigr{]}^{1/2}{\,\Big{|}\,}\mathop{\mathrm{law}}(X)=\mu,\mathop{\mathrm{law}}(Y)=\nu\Bigr{\}},

where the infimum is taken over the random variables XX and YY on Θ\Theta and we denote by law(X){\rm law}(X) the distribution of a random variable XX. We call =(𝒫2(Θ),W2)\mathcal{M}=(\mathscr{P}_{2}(\Theta),W_{2}) the Wasserstein (W2W_{2}) space, which is an infinite-dimensional manifold (Villani, 2008). See §A.1 for more details.

3 Algorithm

Two-timescale actor-critic. We consider a two-timescale actor-critic (AC) algorithm (Kakade, 2002; Peters and Schaal, 2008) for the policy optimization problem in (2.4). For policy evaluation, we parameterize the critic QQ with a neural network and update the parameter via temporal-difference (TD) learning (Sutton, 1988). For policy improvement, we update the actor policy π\pi via proximal policy optimization (PPO) (Schulman et al., 2017). Our algorithm is two-timescale since both the actor and critic are updated at each iteration with different stepsizes. Specifically, we parameterize the critic QQ by the following neural network with width MM and parameter 𝜽^=(θ^(1),,θ^(M))D×M\widehat{\bm{\theta}}=(\widehat{\theta}^{(1)},\cdots,\widehat{\theta}^{(M)})\in\mathbb{R}^{D\times M},

Q𝜽^(s,a)=αMi=1Mσ(s,a;θ^(i)).\displaystyle Q_{\widehat{\bm{\theta}}}(s,a)=\frac{\alpha}{M}\sum_{i=1}^{M}\sigma(s,a;\widehat{\theta}^{(i)}). (3.1)

Here σ(s,a;θ):𝒮×𝒜×D\sigma(s,a;\theta):{\mathcal{S}}\times\mathcal{A}\times\mathbb{R}^{D}\rightarrow\mathbb{R} is the activation function and α>0\alpha>0 is the scaling parameter. Such a structure also appears in Chizat and Bach (2018a); Mei et al. (2019) and Chen et al. (2020b). In a descrete-time finite-width (DF) scenario, at the kkth iteration, the critic and actor are updated as follows,

DF-TD:θ^k+1(i)=θ^k(i)εα(Q𝜽^k(sk,ak)r(sk,ak)γQ𝜽^k(sk,ak))θσ(s,a;θ^k(i)),\displaystyle\text{DF-TD:}\quad\widehat{\theta}_{k+1}^{(i)}=\widehat{\theta}_{k}^{(i)}-\frac{\varepsilon^{\prime}}{\alpha}\bigl{(}Q_{\widehat{\bm{\theta}}_{k}}(s_{k},a_{k})-r(s_{k},a_{k})-\gamma Q_{\widehat{\bm{\theta}}_{k}}(s^{\prime}_{k},a^{\prime}_{k})\bigr{)}\nabla_{\theta}\sigma(s,a;\widehat{\theta}_{k}^{(i)}), (3.2)
DF-PPO:π^k+1(|s)=argmaxπ{Q𝜽^k(s,),π(|s)𝒜ε1KL(π(|s)π^k(|s))},\displaystyle\text{DF-PPO:}\quad\widehat{\pi}_{k+1}(\cdot{\,|\,}s)=\mathop{\mathrm{argmax}}_{\pi}\Bigl{\{}\big{\langle}Q_{\widehat{\bm{\theta}}_{k}}(s,\cdot),\pi(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}-\varepsilon^{-1}\cdot{\mathrm{KL}}\bigl{(}\pi(\cdot{\,|\,}s)\,\|\,\widehat{\pi}_{k}(\cdot{\,|\,}s)\bigr{)}\Bigr{\}}, (3.3)

where (sk,ak)Φ~π^k(s_{k},a_{k})\sim\widetilde{\Phi}^{\widehat{\pi}_{k}} and (sk,ak)P~π^k(|sk,ak)(s^{\prime}_{k},a^{\prime}_{k})\sim\widetilde{P}^{\widehat{\pi}_{k}}(\cdot{\,|\,}s_{k},a_{k}). Here π^k\widehat{\pi}_{k} is the policy for the actor at the kkth iteration, Φ~π^k\widetilde{\Phi}^{\widehat{\pi}_{k}} is the corresponding weighting distribution, ε\varepsilon and ε\varepsilon^{\prime} are the stepsizes for the DF-PPO update and the DF-TD update, respectively. In (3.2), the scaling of α1\alpha^{-1} arises since our update falls into the lazy-training regime (Chizat and Bach, 2018a). In the sequel, we denote by η=ε/ε\eta=\varepsilon^{\prime}/\varepsilon the relative TD timescale. Note that in a double-loop AC algorithm, the critic can usually be solved with high precision. In the two-timescale AC however, even with the KL-divergence term in (3.3) which regularizes the policy update and helps to improve the local estimation quality of the TD update, the critic Q𝜽^kQ_{\widehat{\bm{\theta}}_{k}} for updating the actor’s policy π^k\widehat{\pi}_{k} can still be far from the true action value function Qπ^kQ^{\widehat{\pi}_{k}}. Since the policy evaluation problem is not fully solved at each iteration, the two-timescale AC can be more efficient in computation while more challenging to characterize theoretically.

Mean-field (MF) Limit. To analyze the convergence of the two-timescale AC with neural networks, we employ an analysis that studies the mean-field limit regime (Mei et al., 2018, 2019). Here, by saying the mean-field limit, we refer to an infinite-width limit, i.e., MM\rightarrow\infty for the neural network width MM in (3.1), and a continuous-time limit, i.e., t=kεt=k\varepsilon where ε0\varepsilon\rightarrow 0 for the stepsize in (3.2) and (3.3). For 𝜽^={θ^(i)}i=1M\widehat{\bm{\theta}}=\{\widehat{\theta}^{(i)}\}_{i=1}^{M} independently sampled from a distribution ρ\rho, we can write the infinite-width limit of (3.1) as

Q(s,a;ρ)=ασ(s,a;θ)ρ(dθ).\displaystyle Q(s,a;\rho)=\alpha\int\sigma(s,a;\theta)\rho({\mathrm{d}}\theta). (3.4)

In the sequel, we denote by ρ^k\widehat{\rho}_{k} the distribution of θ^k(i)\widehat{\theta}_{k}^{(i)} for the infinite-width limit of the neural network at the kkth iteration. We further let ρt\rho_{t} and πt\pi_{t} be the continuous-time limits of ρ^k\widehat{\rho}_{k} and π^k\widehat{\pi}_{k}, respectively. The existence of πt\pi_{t} will be shown shortly after. As derived in Zhang et al. (2020), the mean-field limit of the DF-TD update in (3.2) is

MF-TD:tρt=ηdiv(ρtg(;ρt,πt)),\displaystyle\text{MF-TD:}\quad\partial_{t}\rho_{t}=-\eta\mathop{\mathrm{div}}\bigl{(}\rho_{t}\cdot g(\cdot;\rho_{t},\pi_{t})\bigr{)}, (3.5)

where η\eta is the relative TD timescale and

g(θ;ρ,π)=𝔼Φ~ππ{[Q(s,a;ρ)r(s,a)γQ(s,a;ρ)]α1θσ(s,a;θ)}\displaystyle g(\theta;\rho,\pi)=-\mathbb{E}_{\widetilde{\Phi}^{\pi}}^{\pi}\Bigl{\{}\big{[}Q(s,a;\rho)-r(s,a)-\gamma\cdot Q(s^{\prime},a^{\prime};\rho)\big{]}\cdot\alpha^{-1}\nabla_{\theta}\sigma(s,a;\theta)\Bigr{\}} (3.6)

is a vector field. Here 𝔼Φ~ππ\mathbb{E}_{\widetilde{\Phi}^{\pi}}^{\pi} is taken with respect to (s,a)Φ~π(s,a)\sim\widetilde{\Phi}^{\pi} and (s,a)P~π(|s,a)(s^{\prime},a^{\prime})\sim\widetilde{P}^{\pi}(\cdot{\,|\,}s,a). It remains to characterize the mean-field limit of the DF-PPO update in (3.3). By solving the maximization problem in (3.3), the infinite-width limit of DF-PPO update can be written in closed form as

ε1{log[π^k+1(a|s)]log[π^k(a|s)]}=Q(s,a;ρ^k)Z^k(s),\displaystyle\varepsilon^{-1}\cdot\Big{\{}\log\big{[}\widehat{\pi}_{k+1}(a{\,|\,}s)\big{]}-\log\big{[}\widehat{\pi}_{k}(a{\,|\,}s)\big{]}\Big{\}}=Q(s,a;\widehat{\rho}_{k})-\widehat{Z}_{k}(s),

where Z^k(s)\widehat{Z}_{k}(s) is the normalizing factor such that π^k(da|s)=1\int\widehat{\pi}_{k}({\mathrm{d}}a{\,|\,}s)=1 for any s𝒮s\in{\mathcal{S}}. By letting t=kεt=k\varepsilon and ε0\varepsilon\rightarrow 0, such a result directly shows that the limit of π^k\widehat{\pi}_{k} exists. Therefore, in the mean-field limit, we have tlogπt=QtZt\partial_{t}\log\pi_{t}=Q_{t}-Z_{t}, which can be further written as tπt=πt(QtZt)\partial_{t}\pi_{t}=\pi_{t}\cdot(Q_{t}-Z_{t}). Here we have Qt(a,s)=Q(a,s;ρt)Q_{t}(a,s)=Q(a,s;\rho_{t}) and ZtZ_{t} is the continuous-time limit of Z^k\widehat{Z}_{k}. Furthermore, noting that tπt(da|s)=0\partial_{t}\int\pi_{t}({\mathrm{d}}a{\,|\,}s)=0, the mean-field limit of the DF-PPO update in (3.3) is

MF-PPO:ddtπt=πtAt,whereAt(s,a)=Qt(s,a)Qt(s,a)πt(da|s).\displaystyle\text{MF-PPO:}\quad\frac{{\mathrm{d}}}{{\mathrm{d}}t}\pi_{t}=\pi_{t}\cdot A_{t},\qquad\text{where}~{}A_{t}(s,a)=Q_{t}(s,a)-\int Q_{t}(s,a)\pi_{t}({\mathrm{d}}a{\,|\,}s). (3.7)

The two updates (3.5) and (3.7) correspond to the mean-field limits of (3.2) and (3.3), respectively, and together serve as the mean-field limit of the two-timescale AC. In particular, we remark that the MF-TD update in (3.5) for the critic is captured by a semigradient flow in the Wasserstein space (Villani, 2008) while the MF-PPO update in (3.7) for the actor resembles the replicator dynamics (Schuster and Sigmund, 1983; Börgers and Sarin, 1997; Hennes et al., 2020). See §A.2 for more discussion of the replicator dynamics. Note that such a framework is applicable to continuous state and action spaces. In this paper, we aim to provide a theoretical analysis of the mean-field limit of the two-timescale AC.

4 Main Result

In this section, we first establish the convergence of the MF-PPO update in §4.1. Then, under additional assumptions, we establish the optimality and convergence of the mean-field two-timescale AC in §4.2.

4.1 Convergence of Mean-field PPO

For the MF-PPO update in (3.7), we establish the following theorem on global optimality and convergence rate.

Theorem 4.1 (Convergence of MF-PPO).

Let π=argmaxπJ(π)\pi^{*}=\mathop{\mathrm{argmax}}_{\pi}J(\pi) be the optimal policy and π0\pi_{0} be the initial policy. Then, it holds that

1T0T(J(π)J(πt))dtζT+4κ1T0TQtQπt2,ϕ~πtdtpolicy evaluation error,\displaystyle\frac{1}{T}\int_{0}^{T}\bigl{(}J(\pi^{*})-J(\pi_{t})\bigr{)}{\mathrm{d}}t\leq\frac{\zeta}{T}+4\kappa\cdot\underbrace{\frac{1}{T}\int_{0}^{T}\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}{\mathrm{d}}t}_{\displaystyle\text{policy evaluation error}}, (4.1)

where ϕ~πt𝒫(𝒮×𝒜)\widetilde{\phi}^{\pi_{t}}\in\mathscr{P}({\mathcal{S}}\times\mathcal{A}) is an evaluation distribution for the policy evaluation error and ζ=𝔼s𝒟0π[KL(π(|s)π0(|s))]\zeta=\mathbb{E}_{s\sim\mathcal{E}_{\mathcal{D}_{0}}^{\pi^{*}}}\Bigl{[}{\mathrm{KL}}\bigl{(}\pi^{*}(\cdot{\,|\,}s)\,\|\,\pi_{0}(\cdot{\,|\,}s)\bigr{)}\Bigr{]} is the expected KL-divergence between π\pi^{*} and π0\pi_{0}. Furthermore, letting ϕ~πt=12ϕ~0+12ϕ0πt\widetilde{\phi}^{\pi_{t}}=\frac{1}{2}\widetilde{\phi}_{0}+\frac{1}{2}\phi_{0}\otimes\pi_{t}, where ϕ~0𝒫(𝒮×𝒜)\widetilde{\phi}_{0}\in\mathscr{P}({\mathcal{S}}\times\mathcal{A}) is a base distribution and ϕ0=𝒜ϕ~0(,da)\phi_{0}=\int_{\mathcal{A}}\widetilde{\phi}_{0}(\cdot,{\mathrm{d}}a), the concentrability coefficient κ\kappa is given by

κ=~𝒟0πϕ~0.\displaystyle\kappa=\bigg{\|}\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}_{0}}\bigg{\|}_{\infty}.
Proof.

See §B.1 for a detailed proof. ∎

The concentrability coefficient commonly appears in the reinforcement learning literature (Szepesvári and Munos, 2005; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Scherrer et al., 2015; Farahmand et al., 2016; Lazaric et al., 2016; Liu et al., 2019; Wang et al., 2019). In contrast to a more standard concentrability coefficient form, note that κ\kappa is irrelevant to the update of the algorithm. To show the convergence of the MF-PPO, our condition here is much weaker since we only need to specify a base distribution ϕ~0\widetilde{\phi}_{0} such that κ<\kappa<\infty.

Theorem 4.1 shows that the MF-PPO converges to the globally optimal policy at a rate of O(T1)O(T^{-1}) up to the policy evaluation error. Such a theorem implies the global optimality and convergence of a double-loop AC algorithm, where the critic QtQ_{t} is solved to high precision and the policy evaluation error is sufficiently small. In the sequel, we consider a more challenging setting, where the critic QtQ_{t} is updated simultaneously along with the update of the actor’s policy πt\pi_{t}.

4.2 Global Optimality and Convergence of Two-timescale AC

In this section, we aim to provide an upper bound on the policy evaluation error when the critic and the actor are updated simultaneously. Specifically, the actor is updated via MF-PPO in (3.7) and the critic Qt=Q(;ρt)Q_{t}=Q(\cdot;\rho_{t}) is updated via the MF-TD in (3.5). The smooth function σ\sigma in the parameterization of the Q function in (3.4) is taken to be the following two-layer neural network,

σ(s,a;θ)=Bββ(b)σ~(w(s,a,1)),\displaystyle\sigma(s,a;\theta)=B_{\beta}\cdot\beta(b)\cdot\widetilde{\sigma}\bigl{(}w^{\top}(s,a,1)\bigr{)}, (4.2)

where σ~:\widetilde{\sigma}:\mathbb{R}\rightarrow\mathbb{R} is the activation function, θ=(b,w)\theta=(b,w) is the parameter, and β:(1,1)\beta:\mathbb{R}\rightarrow(-1,1) is an odd and invertible function with scaling hyper-parameter Bβ>0B_{\beta}>0. It then holds that D=d+2D=d+2, where dd and DD are the dimensions of (s,a)(s,a) and θ\theta, respectively. It is worth noting that the function class of σ(s,a;θ)ρ(dθ)\int\sigma(s,a;\theta)\rho({\mathrm{d}}\theta) for ρ𝒫2(D)\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}) is the same as

={βσ~(w(s,a,1))ν(dβ,dw)|ν𝒫2((Bβ,Bβ)×d+1)},\displaystyle\mathcal{F}=\Bigl{\{}\int\beta^{\prime}\cdot\widetilde{\sigma}\bigl{(}w^{\top}(s,a,1)\bigr{)}\nu({\mathrm{d}}\beta^{\prime},{\mathrm{d}}w){\,\Big{|}\,}\nu\in\mathscr{P}_{2}\bigl{(}(-B_{\beta},B_{\beta})\times\mathbb{R}^{d+1}\bigr{)}\Bigr{\}}, (4.3)

which captures a vast function class because of the universal function approximation theorem (Barron, 1993; Pinkus, 1999). We remark that we introduce the rescaling function β\beta in (4.2) to avoid the study of the space of probability measures over (Bβ,Bβ)×d+1(-B_{\beta},B_{\beta})\times\mathbb{R}^{d+1} in (4.3), which has a boundary and thus lacks the regularity in the study of optimal transport. Furthermore, note that we introduce a hyper-parameter α>1\alpha>1 in the Q function in (3.4). Thus, we are using α\alpha\cdot\mathcal{F} to represent \mathcal{F}, which causes an “over-representation” when α>1\alpha>1. Such over-representation appears to be essential for our analysis. To anticipate briefly, we note that α\alpha controls the gap in the average total reward over time when the relative time-scale η\eta is properly selected according to Theorem 4.6. Furthermore, such an influence is imposed through Lemma 4.4, which shows that the Wasserstein distance between ρ0\rho_{0} and ρπt\rho_{\pi_{t}} is upper bounded by O(1/α)O(1/\alpha). In what follows, we consider the initialization of the TD update to be ρ0=N(0,ID)\rho_{0}=N(0,I_{D}), which implies that Q(s,a;ρ0)=0Q(s,a;\rho_{0})=0. We next impose a regularity assumption on the two-layer neural network σ\sigma.

Assumption 4.2 (Regularity of the Neural Network).

For the two-layer neural network σ\sigma defined in (4.2), we assume that the following properties hold.

  • (i)

    The rescaling function β:(1,1)\beta:\mathbb{R}\rightarrow(-1,1) is odd, L0,βL_{0,\beta}-Lipschitz continuous, L1,βL_{1,\beta}-smooth, and invertible. Meanwhile, the inverse β1\beta^{-1} is locally Lipschitz continuous. In particular, we assume that β1\beta^{-1} is β\ell_{\beta}-Lipschitz continuous in [2/3,2/3][-2/3,2/3].

  • (ii)

    The activation function σ~:\widetilde{\sigma}:\mathbb{R}\rightarrow\mathbb{R} is odd, Bσ~B_{\widetilde{\sigma}}-bounded, L0,σ~L_{0,\widetilde{\sigma}}-Lipschitz continuous, and L1,σ~L_{1,\widetilde{\sigma}}-smooth.

We remark that Assumption 4.2 is not restrictive and is satisfied by a large family of neural networks, e.g., σ~(x)=tanh(x)\widetilde{\sigma}(x)=\tanh(x) and β(b)=tanh(b)\beta(b)=\tanh(b). Noting that (s,a)21\|(s,a)\|_{2}\leq 1, Assumption 4.2 implies that the function σ(s,a;θ)\sigma(s,a;\theta) in (4.2) is odd with respect to ww and bb and is also bounded, Lipschitz continuous, and smooth in the parameter domain, that is,

|θσ(s,a;θ)|<B1,|θθ2σ(s,a;θ)|<B2.\displaystyle|\nabla_{\theta}\sigma(s,a;\theta)|<B_{1},\quad|\nabla^{2}_{\theta\theta}\sigma(s,a;\theta)|<B_{2}. (4.4)

We then impose the following assumption on the MDP.

Assumption 4.3 (Regularity of the MDP).

For the MDP (𝒮,𝒜,γ,P,r,𝒟0)({\mathcal{S}},\mathcal{A},\gamma,P,r,\mathcal{D}_{0}), we assume the following properties hold.

  • (i)

    The reward function rr and the transition kernel PP admit the following representations with respect to the activation function σ~\widetilde{\sigma},

    r(s,a)\displaystyle r(s,a) =Brσ~((s,a,1)w)μ(dw),\displaystyle=B_{r}\cdot\int\widetilde{\sigma}\bigl{(}(s,a,1)^{\top}w\bigr{)}\mu({\mathrm{d}}w), (4.5)
    P(s|s,a)\displaystyle P(s^{\prime}{\,|\,}s,a) =σ~((s,a,1)w)φ(s)ψ(s;dw),\displaystyle=\int\widetilde{\sigma}\bigl{(}(s,a,1)^{\top}w\bigr{)}\varphi(s^{\prime})\psi(s^{\prime};{\mathrm{d}}w), (4.6)

    where μ\mu and ψ(s;)\psi(s^{\prime};\cdot) are probability measures in 𝒫2(d+1)\mathscr{P}_{2}(\mathbb{R}^{d+1}) for any s𝒮s^{\prime}\in{\mathcal{S}}, BrB_{r} is a positive scaling parameter, and φ(s):𝒮+\varphi(s^{\prime}):{\mathcal{S}}\rightarrow\mathbb{R}_{+} is a nonnegative function.

  • (ii)

    The reward function rr satisfies that r(s,a)0r(s,a)\geq 0 for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}. For the representation of rr in (4.5) and the representation of the transition kernel PP in (4.6), we assume that

    χ2(μρw,0)<Mμ,χ2(ψ(s;)ρw,0)<Mψ,s𝒮,\displaystyle\chi^{2}(\mu\,\|\,\rho_{w,0})<M_{\mu},\qquad\chi^{2}\bigl{(}\psi(s;\cdot)\,\|\,\rho_{w,0}\bigr{)}<M_{\psi},\quad\forall s\in{\mathcal{S}},
    φ(s)dsM1,φ,φ(s)2dsM2,φ,\displaystyle\int\varphi(s){\mathrm{d}}s\leq M_{1,\varphi},\qquad\int\varphi(s)^{2}{\mathrm{d}}s\leq M_{2,\varphi},

    where ρw,0\rho_{w,0} is the marginal distribution of ρ0\rho_{0} with respect to ww, i.e., ρw,0=ρ0(db,)\rho_{w,0}=\int\rho_{0}({\mathrm{d}}b,\cdot), χ2\chi^{2} is the chi-squared divergence, and MμM_{\mu}, MψM_{\psi}, M1,φM_{1,\varphi}, M2,φM_{2,\varphi} are absolute constants.

  • (iii)

    We assume that there exists an absolute constant 𝒢\mathcal{G} such that

    ψ(s;)ψ(s;)H˙1(μ)<𝒢,ψ(s;)μH˙1(μ)<𝒢,\displaystyle\big{\|}\psi(s;\cdot)-\psi(s^{\prime};\cdot)\big{\|}_{\dot{H}^{-1}(\mu)}<\mathcal{G},\qquad\big{\|}\psi(s;\cdot)-\mu\big{\|}_{\dot{H}^{-1}(\mu)}<\mathcal{G},
    ψ(s;)μH˙1(ψ(s;))<𝒢,ψ(s;)ψ(s;)H˙1(ψ(s′′;))<𝒢,s,s,s′′𝒮,\displaystyle\big{\|}\psi(s;\cdot)-\mu\big{\|}_{\dot{H}^{-1}(\psi(s^{\prime};\cdot))}<\mathcal{G},\quad\big{\|}\psi(s;\cdot)-\psi(s^{\prime};\cdot)\big{\|}_{\dot{H}^{-1}(\psi(s^{\prime\prime};\cdot))}<\mathcal{G},\quad\forall s,s^{\prime},s^{\prime\prime}\in{\mathcal{S}},

    where H˙1()\|\cdot\|_{\dot{H}^{-1}(\cdot)} is the weighted homogeneous Sobolev norm.

We remark that by assuming ψ\psi to be a probability measure and that φ(s)0\varphi(s^{\prime})\geq 0 in (4.6), the representation of the transition kernel does not lose generality. Specifically, the function class of (4.6) is the same as

𝒫={σ~((s,a,1)w)ψ~(s;dw)|ψ~(s;) is a signed measure for any s𝒮}.\displaystyle\mathcal{P}=\Big{\{}\int\widetilde{\sigma}((s,a,1)^{\top}w)\widetilde{\psi}(s^{\prime};{\mathrm{d}}w)\,\Big{|}\,\widetilde{\psi}(s^{\prime};\cdot)\text{ is a signed measure for any $s^{\prime}\in{\mathcal{S}}$}\Big{\}}.

See §C.1 for a detailed proof. Assumption 4.3 generalizes the linear MDP in Yang and Wang (2019b, a); Cai et al. (2019a) and Jin et al. (2020). In contrast, our representation of the reward function and the transition kernel benefits from the universal function approximation theorem and is thus not as restrictive as the original linear MDP assumption. Note that the infinite-width neural network has a two-layer structure by (4.2)\eqref{eq:nn}. We establish the following lemma on the regularity of the representation of the action value function QπQ^{\pi} by such a neural network.

Lemma 4.4 (Regularity of Representation of QπQ^{\pi}).

Suppose that Assumptions 4.2 and 4.3 hold. For any policy π\pi, there exists a probability measure ρπ𝒫2(D)\rho_{\pi}\in\mathscr{P}_{2}(\mathbb{R}^{D}) for the representation of QπQ^{\pi} with the following properties.

  • (i)

    For function Q(s,a;ρπ)Q(s,a;\rho_{\pi}) defined by (3.4) with ρ=ρπ\rho=\rho_{\pi} and the action value function Qπ(s,a)Q^{\pi}(s,a) defined by (2.2), we have Q(s,a;ρπ)=Qπ(s,a)Q(s,a;\rho_{\pi})=Q^{\pi}(s,a) for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

  • (ii)

    By letting Bβ2(Br+γ(1γ)1BrM1,φ)B_{\beta}\geq 2(B_{r}+\gamma(1-\gamma)^{-1}B_{r}M_{1,\varphi}) for the neural network defined in (4.2) and ρ0N(0,ID)\rho_{0}\sim N(0,I_{D}) for the initial distribution, we have W~2(ρπ,ρ0)D¯\widetilde{W}_{2}(\rho_{\pi},\rho_{0})\leq\bar{D} for any policy π\pi, where we define W~2(,)=αW2(,)\widetilde{W}_{2}(\cdot,\cdot)=\alpha W_{2}(\cdot,\cdot) as the scaled W2W_{2} metric. Here the constant D¯\bar{D} depends on the discount factor γ\gamma and the absolute constants L0,β,L1,β,lβ,Br,Mμ,Mψ,M1,φ,M2,φL_{0,\beta},L_{1,\beta},l_{\beta},B_{r},M_{\mu},M_{\psi},M_{1,\varphi},M_{2,\varphi} defined in Assumptions 4.2 and 4.3.

Proof.

See §B.2 for a detailed proof. ∎

Property (i) of Lemma 4.4 shows that the action value function QπQ^{\pi} can be parameterized with the infinite-width two-layer neural network Q(;ρπ)Q(\cdot;\rho_{\pi}) in (3.4). Note that a larger BβB_{\beta} captures a larger function class in (4.3). Without loss of generality, we assume that Bβ2(Br+γ(1γ)1BrM1,φ)B_{\beta}\geq 2(B_{r}+\gamma(1-\gamma)^{-1}B_{r}M_{1,\varphi}) holds in the sequel. Hence, by Property (ii), we have that W~2(ρπ,ρ0)O(1)\widetilde{W}_{2}(\rho_{\pi},\rho_{0})\leq O(1) for any policy π\pi. In particular, it holds by Property (i) of Lemma 4.4 that QtQπt2,ϕ~πt=Q(;ρt)Q(;ρπt)2,ϕ~πt\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}=\|Q(\cdot;\rho_{t})-Q(\cdot;\rho_{\pi_{t}})\|_{2,\widetilde{\phi}^{\pi_{t}}} and we have the following theorem to characterize such an error with regard to the W2W_{2} space.

Theorem 4.5 (Upper Bound of Policy Evaluation Error).

Suppose that Assumptions 4.2 and 4.3 hold and ρ0N(0,ID)\rho_{0}\sim N(0,I_{D}) is the initial distribution. We specify the weighting distribution Φ~πt\widetilde{\Phi}^{\pi_{t}} in MF-TD (3.5) as Φ~πt=~ϕ~πtπt\widetilde{\Phi}^{\pi_{t}}={\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}, where ϕ~πt𝒫(𝒮×𝒜)\widetilde{\phi}^{\pi_{t}}\in\mathscr{P}({\mathcal{S}}\times\mathcal{A}) is the evaluation distribution for the policy evaluation error in Theorem 4.1. Then, it holds that

(1γ)QtQπt2,ϕ~πt2\displaystyle(1-\sqrt{\gamma})\cdot\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}^{2} ddtW~22(ρt,ρπt)2η+Δt,\displaystyle\leq-\frac{{\mathrm{d}}}{{\mathrm{d}}t}\frac{\widetilde{W}_{2}^{2}(\rho_{t},\rho_{\pi_{t}})}{2\eta}+\Delta_{t}, (4.7)

where

Δt=\displaystyle\Delta_{t}= 2α1/2η1B1W~2(ρt,ρ0)W~2(ρt,ρπt)\displaystyle 2\alpha^{1/2}\eta^{-1}\mathcal{B}B_{1}\cdot\widetilde{W}_{2}(\rho_{t},\rho_{0})\widetilde{W}_{2}(\rho_{t},\rho_{\pi_{t}})
+α1B2(4B1max{W~2(ρπt,ρ0),W~2(ρt,ρ0)}+Br)W~2(ρt,ρπ)2.\displaystyle\quad+\alpha^{-1}B_{2}\cdot\Big{(}4B_{1}\max\big{\{}\widetilde{W}_{2}(\rho_{\pi_{t}},\rho_{0}),\widetilde{W}_{2}(\rho_{t},\rho_{0})\big{\}}+B_{r}\Big{)}\widetilde{W}_{2}(\rho_{t},\rho_{\pi})^{2}.

Here B1B_{1} and B2B_{2} are defined in (4.4) of Assumption 4.2, η\eta is the relative TD timescale, α\alpha is the scaling parameter of the neural network, and W~2=αW2\widetilde{W}_{2}=\alpha W_{2} is the scaled W2W_{2} metric. Moreover, the constant \mathcal{B} depends on the discount factor γ\gamma, the scaling parameter BβB_{\beta} in (4.2), and the absolute constants lβ,Br,M1,φ,𝒢l_{\beta},B_{r},M_{1,\varphi},\mathcal{G} defined in Assumptions 4.2 and 4.3.

Proof.

See §B.3 for a detailed proof. ∎

Here we give a nonrigorous discussion on how to upper bound Δt\Delta_{t} in (4.7). If W~2(ρt,ρ0)O(1)\widetilde{W}_{2}(\rho_{t},\rho_{0})\leq O(1) holds for any t[0,T]t\in[0,T], by W~2(ρπt,ρ0)O(1)\widetilde{W}_{2}(\rho_{\pi_{t}},\rho_{0})\leq O(1) in Lemma 4.4 and the triangle inequality of W2W_{2} distance (Villani, 2008), it follows that W~2(ρt,ρπt)O(1)\widetilde{W}_{2}(\rho_{t},\rho_{\pi_{t}})\leq O(1) and ΔtO(α1/2η1+α1)\Delta_{t}\leq O(\alpha^{1/2}\eta^{-1}+\alpha^{-1}). Taking a time average of integration on both sides of (4.7), the policy evaluation error 1T0TQtQπt2,ϕ~πtdt\frac{1}{T}\int_{0}^{T}\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}{\mathrm{d}}t is then upper bounded by O(η1T1+α1/2η1+α1)O(\eta^{-1}T^{-1}+\alpha^{1/2}\eta^{-1}+\alpha^{-1}). Inspired by such a fact, we introduce the following restarting mechanism to ensure W~2(ρt,ρ0)O(1)\widetilde{W}_{2}(\rho_{t},\rho_{0})\leq O(1). Restarting Mechanism. Let W~0=λD¯\widetilde{W}_{0}=\lambda\bar{D} be a threshold, where D¯\bar{D} is the upper bound for W~2(ρπ,ρ0)\widetilde{W}_{2}(\rho_{\pi},\rho_{0}) by Lemma 4.4, λ3\lambda\geq 3 is a constant scaling parameter for the restarting threshold, ρt\rho_{t} is the distribution of the parameters in the neural network at time tt, and ρ0\rho_{0} is the initial distribution. Whenever we detect that W~2(ρt,ρ0)\widetilde{W}_{2}(\rho_{t},\rho_{0}) reaches W~0\widetilde{W}_{0} in the update, we pause and reset ρt\rho_{t} to ρ0\rho_{0} by resampling the parameters from ρ0\rho_{0}. Then, we reset the critic with the newly sampled parameters while keeping the actor’s policy πt\pi_{t} unchanged and continue the update.

The restarting mechanism guarantees W~2(ρt,ρ0)λD¯\widetilde{W}_{2}(\rho_{t},\rho_{0})\leq\lambda\bar{D} by restricting the distribution ρt\rho_{t} of the parameters to be close to ρ0\rho_{0}. Moreover, by letting λ3\lambda\geq 3, we ensure that ρπt\rho_{\pi_{t}} is realizable by ρt\rho_{t} since W~2(ρπt,ρ0)D¯λD¯\widetilde{W}_{2}(\rho_{\pi_{t}},\rho_{0})\leq\bar{D}\leq\lambda\bar{D}, which means that the neural network is capable of capturing the representation of the action value function QπtQ^{\pi_{t}}. We remark that by letting W~0=O(1)\widetilde{W}_{0}=O(1), we allow ρt\rho_{t} to deviate from ρ0\rho_{0} up to W2(ρt,ρ0)O(α1)W_{2}(\rho_{t},\rho_{0})\leq O(\alpha^{-1}) in the restarting mechanism. In contrast, the NTK regime (Cai et al., 2019b) which corresponds to letting α=M\alpha=\sqrt{M} in (3.1) only allows ρt\rho_{t} to deviate from ρ0\rho_{0} by the chi-squared divergence χ2(ρtρ0)O(M1)=o(1)\chi^{2}(\rho_{t}\,\|\,\rho_{0})\leq O(M^{-1})=o(1). That is, the NTK regime fails to induce a feature representation significantly different from the initial one. Before moving on, we summarize the construction of the weighting distribution Φ~πt\widetilde{\Phi}^{\pi_{t}} in Theorem 4.1 and 4.5 as follows,

Φ~πt=~ϕ~πtπt,ϕ~πt=12ϕ~0+12ϕ0πt,ϕ0=𝒜ϕ~0(,da),\displaystyle\widetilde{\Phi}^{\pi_{t}}={\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}},\qquad\widetilde{\phi}^{\pi_{t}}=\frac{1}{2}\widetilde{\phi}_{0}+\frac{1}{2}\phi_{0}\otimes\pi_{t},\qquad\phi_{0}=\int_{\mathcal{A}}\widetilde{\phi}_{0}(\cdot,{\mathrm{d}}a), (4.8)

where ϕ~0\widetilde{\phi}_{0} is the base distribution. Now we have the following theorem that characterizes the global optimality and convergence of the two-timescale AC with restarting mechanism.

Theorem 4.6 (Global Optimality and Convergence Rate of Two-timescale AC with Restarting Mechanism).

Suppose that (4.8) and Assumptions 4.2 and 4.3 hold. With the restarting mechanism, it holds that

1T0T(J(π)J(πt))dtζT(a)+4κα1S1+α1/2η1S2+η1D¯22T(1γ)(b),\displaystyle\frac{1}{T}\int_{0}^{T}\bigl{(}J(\pi^{*})-J(\pi_{t})\bigr{)}{\mathrm{d}}t\leq\underbrace{\frac{\zeta}{T}}_{\displaystyle\text{(a)}}+\underbrace{4\kappa\sqrt{\alpha^{-1}S_{1}+\alpha^{1/2}\eta^{-1}S_{2}+\frac{\eta^{-1}\bar{D}^{2}}{2T(1-\sqrt{\gamma})}}}_{\displaystyle\text{(b)}}, (4.9)

where we have

ζ=𝔼s𝒟0π[KL(π(|s)π0(|s))],κ=~𝒟0πϕ~0,\displaystyle\zeta=\mathbb{E}_{s\sim\mathcal{E}_{\mathcal{D}_{0}}^{\pi^{*}}}\Bigl{[}{\mathrm{KL}}\bigl{(}\pi^{*}(\cdot{\,|\,}s)\,\|\,\pi_{0}(\cdot{\,|\,}s)\bigr{)}\Bigr{]},\quad\kappa=\bigg{\|}\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}_{0}}\bigg{\|}_{\infty},
S1=(1+λ)2D¯2B2(4B1λD¯+Br)1γ,S2=2B1λ(1+λ)D¯21γ.\displaystyle S_{1}=\frac{(1+\lambda)^{2}\bar{D}^{2}B_{2}(4B_{1}\lambda\bar{D}+B_{r})}{1-\sqrt{\gamma}},\quad S_{2}=\frac{2\mathcal{B}B_{1}\lambda(1+\lambda)\bar{D}^{2}}{1-\sqrt{\gamma}}.

Here BrB_{r}, B1B_{1} and B2B_{2} are defined in Assumption 4.2 and 4.3, D¯\bar{D} is the upper bound for W~2(ρπ,ρ0)\widetilde{W}_{2}(\rho_{\pi},\rho_{0}) in Lemma 4.4, \mathcal{B} depends on the discount factor γ\gamma and the absolute constants defined in Assumption 4.2 and 4.3, and λ\lambda is the scaling parameter for the restarting threshold. Additionally, the total restarting number NN satisfies the following inequality,

N(λ2)1((α1ηS1+α1/2S2)2TD¯2(1γ)+1).\displaystyle N\leq(\lambda-2)^{-1}\big{(}(\alpha^{-1}\eta S_{1}+\alpha^{1/2}S_{2})2T\bar{D}^{-2}(1-\sqrt{\gamma})+1\big{)}.
Proof.

See §B.4 for a detailed proof. ∎

Note that for a given MDP with starting distribution 𝒟0\mathcal{D}_{0}, the expected KL-divergence ζ\zeta and the concentrability coefficient κ\kappa are both independent of the two-timescale update. We remark that our condition for (4.9) to be bounded is not restrictive. Specifically, we only need a given π0\pi_{0} and ϕ~0\widetilde{\phi}_{0} such that the KL-divergence ζ<\zeta<\infty and the concentrability coefficient κ<\kappa<\infty, which is a weakening relative to the standard usage of the concentrability coefficient in the literature (Szepesvári and Munos, 2005; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Scherrer et al., 2015; Farahmand et al., 2016; Lazaric et al., 2016; Liu et al., 2019; Wang et al., 2019).

The first term (a) on the right-hand side of (4.9) diminishes as TT\rightarrow\infty. The second term (b) corresponds to the policy evaluation error. We give an example to demonstrate the convergence of the two-timescale AC. We let the scaling parameter λ=3\lambda=3 for the restarting threshold. By letting η=α3/2\eta=\alpha^{3/2}, it holds that 1T0T(J(π)J(πt))dt\frac{1}{T}\int_{0}^{T}\bigl{(}J(\pi^{*})-J(\pi_{t})\bigr{)}{\mathrm{d}}t decreases at a rate of O(T1+O(α1/2)+O(α3/4T1/2))O(T^{-1}+O(\alpha^{-1/2})+O(\alpha^{-3/4}T^{-1/2})) as α\alpha\rightarrow\infty and TT\rightarrow\infty. Note that η=α3/2\eta=\alpha^{3/2} shows that the critic has a larger relative TD timescale in (3.5). As for the total number of restartings NN, it holds that NO(α1/2T)N\leq O(\alpha^{1/2}T) as α\alpha\rightarrow\infty, which induces a tradeoff, i.e., a larger α\alpha guarantees a smaller gap in 1T0T(J(π)J(πt))dt\frac{1}{T}\int_{0}^{T}\bigl{(}J(\pi^{*})-J(\pi_{t})\bigr{)}{\mathrm{d}}t but yields more restartings and potentially requires a larger relative TD timescale.

5 Acknowledgement

Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports.

References

  • Agazzi and Lu (2019) Agazzi, A. and Lu, J. (2019). Temporal-difference learning for nonlinear value function approximation in the lazy training regime. arXiv preprint arXiv:1905.10917.
  • Agazzi and Lu (2020) Agazzi, A. and Lu, J. (2020). Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858.
  • Agostinelli et al. (2019) Agostinelli, F., McAleer, S., Shmakov, A. and Baldi, P. (2019). Solving the Rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence, 1 356–363.
  • Akkaya et al. (2019) Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R. et al. (2019). Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113.
  • Ambrosio and Gigli (2013) Ambrosio, L. and Gigli, N. (2013). A user’s guide to optimal transport. In Modelling and Optimisation of Flows on Networks. Springer, 1–155.
  • Ambrosio et al. (2008) Ambrosio, L., Gigli, N. and Savaré, G. (2008). Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer.
  • Antos et al. (2008) Antos, A., Szepesvári, C. and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
  • Barron (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39 930–945.
  • Berner et al. (2019) Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C. et al. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  • Bhatnagar et al. (2008) Bhatnagar, S., Ghavamzadeh, M., Lee, M. and Sutton, R. S. (2008). Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems.
  • Bhatnagar et al. (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M. and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45 2471–2482.
  • Börgers and Sarin (1997) Börgers, T. and Sarin, R. (1997). Learning through reinforcement and replicator dynamics. Journal of Economic Theory, 77 1–14.
  • Borkar (2009) Borkar, V. S. (2009). Stochastic Approximation: A Dynamical Systems Viewpoint. Springer.
  • Cai et al. (2019a) Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019a). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
  • Cai et al. (2019b) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019b). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
  • Chen et al. (2020a) Chen, M., Wang, Y., Liu, T., Yang, Z., Li, X., Wang, Z. and Zhao, T. (2020a). On computation and generalization of generative adversarial imitation learning. arXiv preprint arXiv:2001.02792.
  • Chen et al. (2020b) Chen, Z., Cao, Y., Gu, Q. and Zhang, T. (2020b). Mean-field analysis of two-layer neural networks: Non-asymptotic rates and generalization bounds. arXiv preprint arXiv:2002.04026.
  • Chizat and Bach (2018a) Chizat, L. and Bach, F. (2018a). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.
  • Chizat and Bach (2018b) Chizat, L. and Bach, F. (2018b). On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems.
  • Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J. and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
  • Fang et al. (2019a) Fang, C., Dong, H. and Zhang, T. (2019a). Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
  • Fang et al. (2019b) Fang, C., Gu, Y., Zhang, W. and Zhang, T. (2019b). Convex formulation of overparameterized deep neural networks. arXiv preprint arXiv:1911.07626.
  • Farahmand et al. (2016) Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C. and Mannor, S. (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17 4809–4874.
  • Farahmand et al. (2010) Farahmand, A.-m., Szepesvári, C. and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
  • Friedrichs (1944) Friedrichs, K. O. (1944). The identity of weak and strong extensions of differential operators. Transactions of the American Mathematical Society, 55 132–151.
  • Fu et al. (2020) Fu, Z., Yang, Z. and Wang, Z. (2020). Single-timescale actor-critic provably finds globally optimal policy. arXiv preprint arXiv:2008.00483.
  • Harker and Pang (1990) Harker, P. T. and Pang, J.-S. (1990). Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Mathematical Programming, 48 161–220.
  • Hennes et al. (2020) Hennes, D., Morrill, D., Omidshafiei, S., Munos, R., Perolat, J., Lanctot, M., Gruslys, A., Lespiau, J.-B., Parmas, P., Duéñez-Guzmán, E. et al. (2020). Neural replicator dynamics: Multiagent learning via hedging policy gradients. In International Conference on Autonomous Agents and MultiAgent Systems.
  • Hong et al. (2020) Hong, M., Wai, H.-T., Wang, Z. and Yang, Z. (2020). A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170.
  • Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems.
  • Javanmard et al. (2019) Javanmard, A., Mondelli, M. and Montanari, A. (2019). Analysis of a two-layer neural network via displacement convexity. arXiv preprint arXiv:1901.01375.
  • Jin et al. (2020) Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory.
  • Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, vol. 2.
  • Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
  • Khodadadian et al. (2021) Khodadadian, S., Doan, T. T., Maguluri, S. T. and Romberg, J. (2021). Finite sample analysis of two-time-scale natural actor-critic algorithm. arXiv preprint arXiv:2101.10506.
  • Konda and Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems.
  • Kushner and Yin (2003) Kushner, H. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer.
  • Lazaric et al. (2016) Lazaric, A., Ghavamzadeh, M. and Munos, R. (2016). Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17 583–612.
  • Liu et al. (2019) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
  • Lu et al. (2020) Lu, Y., Ma, C., Lu, Y., Lu, J. and Ying, L. (2020). A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning. PMLR.
  • Mei et al. (2019) Mei, S., Misiakiewicz, T. and Montanari, A. (2019). Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory. PMLR.
  • Mei et al. (2018) Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 E7665–E7671.
  • Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research, 9 815–857.
  • Otto and Villani (2000) Otto, F. and Villani, C. (2000). Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173 361–400.
  • Peters and Schaal (2008) Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71 1180–1190.
  • Peyre (2011) Peyre, R. (2011). Comparison between w2w_{2} distance and H˙1\dot{H}^{-1} norm, and localisation of wasserstein distance. arXiv preprint arXiv:1104.4631.
  • Pinkus (1999) Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8 143–195.
  • Scherrer et al. (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. (2015). Approximate modified policy iteration and its application to the game of Tetris. The Journal of Machine Learning Research, 16 1629–1676.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Schuster and Sigmund (1983) Schuster, P. and Sigmund, K. (1983). Replicator dynamics. Journal of Theoretical Biology, 100 533–538.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529 484–489.
  • Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017). Mastering the game of Go without human knowledge. Nature, 550 354.
  • Sirignano and Spiliopoulos (2020a) Sirignano, J. and Spiliopoulos, K. (2020a). Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130 1820–1852.
  • Sirignano and Spiliopoulos (2020b) Sirignano, J. and Spiliopoulos, K. (2020b). Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 725–752.
  • Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3 9–44.
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press.
  • Szepesvári and Munos (2005) Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning. ACM.
  • Villani (2003) Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Society.
  • Villani (2008) Villani, C. (2008). Optimal Transport: Old and New. Springer.
  • Vinyals et al. (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P. et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 350–354.
  • Wang et al. (2019) Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
  • Watkins and Dayan (1992) Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8 279–292.
  • Wei et al. (2019) Wei, C., Lee, J. D., Liu, Q. and Ma, T. (2019). Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems.
  • Wu et al. (2020) Wu, Y., Zhang, W., Xu, P. and Gu, Q. (2020). A finite time analysis of two time-scale actor critic methods. arXiv preprint arXiv:2005.01350.
  • Xu et al. (2020a) Xu, T., Wang, Z. and Liang, Y. (2020a). Improving sample complexity bounds for actor-critic algorithms. arXiv preprint arXiv:2004.12956.
  • Xu et al. (2020b) Xu, T., Wang, Z. and Liang, Y. (2020b). Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. arXiv preprint arXiv:2005.03557.
  • Yang and Wang (2019a) Yang, L. F. and Wang, M. (2019a). Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
  • Yang and Wang (2019b) Yang, L. F. and Wang, M. (2019b). Sample-optimal parametric q-learning using linearly additive features. arXiv preprint arXiv:1902.04779.
  • Zhang et al. (2020) Zhang, Y., Cai, Q., Yang, Z., Chen, Y. and Wang, Z. (2020). Can temporal-difference and Q-learning learn representation? A mean-field theory. arXiv preprint arXiv:2006.04761.

Appendix A Supplement to the Background

In this section, we present some background on Wasserstein space and replicator dynamics.

A.1 Wasserstein Space

Let ΘD\Theta\subseteq\mathbb{R}^{D} be a Polish space. We denote by 𝒫2(Θ)𝒫(Θ)\mathscr{P}_{2}(\Theta)\subseteq\mathscr{P}(\Theta) the set of probability measures with finite second moments. Then, the Wasserstein-2 (W2W_{2}) distance between μ,ν𝒫2(Θ)\mu,\nu\in\mathscr{P}_{2}(\Theta) is defined as follows,

W2(μ,ν)=inf{𝔼[XY2]1/2|law(X)=μ,law(Y)=ν},\displaystyle W_{2}(\mu,\nu)=\inf\Bigl{\{}\mathbb{E}\bigl{[}\|X-Y\|^{2}\bigr{]}^{1/2}{\,\Big{|}\,}\mathop{\mathrm{law}}(X)=\mu,\mathop{\mathrm{law}}(Y)=\nu\Bigr{\}}, (A.1)

where the infimum is taken over the random variables XX and YY on Θ\Theta and we denote by law(X){\rm law}(X) the distribution of a random variable XX. We call =(𝒫2(Θ),W2)\mathcal{M}=(\mathscr{P}_{2}(\Theta),W_{2}) the Wasserstein space, which is an infinite-dimensional manifold (Villani, 2008). In particular, we define the tangent vector at μ\mu\in\mathcal{M} as ρ˙0\dot{\rho}_{0} for the corresponding curve ρ:[0,1]𝒫2(Θ)\rho:[0,1]\rightarrow\mathscr{P}_{2}(\Theta) with ρ0=μ\rho_{0}=\mu. Under certain regularity conditions, the continuity equation tρt=div(ρtvt)\partial_{t}\rho_{t}=-\mathop{\mathrm{div}}(\rho_{t}v_{t}) corresponds to a vector field v:[0,1]×ΘDv:[0,1]\times\Theta\rightarrow\mathbb{R}^{D}, which endows the infinite-dimensional manifold 𝒫2(Θ)\mathscr{P}_{2}(\Theta) with a weak Riemannian structure in the following sense (Villani, 2008). Given any tangent vectors uu and u~\widetilde{u} at μ\mu\in\mathcal{M} and the corresponding vector fields v,v~v,\widetilde{v}, which satisfy u+div(μv)=0u+\mathop{\mathrm{div}}(\mu v)=0 and u~+div(μv~)=0\widetilde{u}+\mathop{\mathrm{div}}(\mu\widetilde{v})=0, respectively, we define the inner product of uu and u~\widetilde{u} as follows,

u,u~μ,W2=vv~dμ=v,v~μ,\displaystyle\langle u,\widetilde{u}\rangle_{\mu,W_{2}}=\int v\cdot\widetilde{v}{\mathrm{d}}\mu=\langle v,\widetilde{v}\rangle_{\mu}, (A.2)

which yields a Riemannian metric. Such a Riemannian metric further induces a norm uμ,W2=u,uμ,W21/2\|u\|_{\mu,W_{2}}=\langle u,u\rangle_{\mu,W_{2}}^{1/2} for any tangent vector uTμu\in T_{\mu}\mathcal{M} at any μ\mu\in\mathcal{M}, which allows us to write the Wasserstein-2 distance defined in (A.1) as follows,

W2(μ,ν)=inf{(01ρ˙tρt,W22dt)1/2|ρ:[0,1],ρ0=μ,ρ1=ν}.\displaystyle W_{2}(\mu,\nu)=\inf\Biggl{\{}\biggl{(}\int_{0}^{1}\|\dot{\rho}_{t}\|^{2}_{\rho_{t},W_{2}}\,{\mathrm{d}}t\biggr{)}^{1/2}{\,\Bigg{|}\,}\rho:[0,1]\rightarrow\mathcal{M},\rho_{0}=\mu,\rho_{1}=\nu\Biggr{\}}. (A.3)

Here ρ˙s\dot{\rho}_{s} denotes tρt|t=s\partial_{t}\rho_{t}{\,|\,}_{t=s} for any s[0,1]s\in[0,1]. In particular, the infimum in (A.3) is attained by the geodesic ρ~:[0,1]𝒫2(Θ)\widetilde{\rho}:[0,1]\rightarrow\mathscr{P}_{2}(\Theta) connecting μ,ν\mu,\nu\in\mathcal{M}. Moreover, the geodesics on \mathcal{M} are constant-speed, that is,

ρ~t˙ρ~t,W2=W2(μ,ν),t[0,1].\displaystyle\|\dot{\widetilde{\rho}_{t}}\|_{\widetilde{\rho}_{t},W_{2}}=W_{2}(\mu,\nu),\quad\forall t\in[0,1]. (A.4)

In Wasserstein space \mathcal{M}, a curve ρ:[0,1]𝒫2(Θ)\rho:[0,1]\rightarrow\mathscr{P}_{2}(\Theta) is defined to be absolutely continous if there exists mL1(a,b)m\in L^{1}(a,b), i.e., ab|m˙(t)|dt<\int_{a}^{b}|\dot{m}(t)|{\mathrm{d}}t<\infty, such that

W2(ρs,ρt)stm(r)dr,a<st<b.\displaystyle W_{2}(\rho_{s},\rho_{t})\leq\int_{s}^{t}m(r){\mathrm{d}}r,\quad\forall a<s\leq t<b.

Such an absolutely continuous curve ρt\rho_{t} allows us to define the metric derivative in \mathcal{M} as follows,

|ρ˙t|W2=limstW2(ρs,ρt)|st|.\displaystyle|\dot{\rho}_{t}|_{W_{2}}=\lim_{s\rightarrow t}\frac{W_{2}(\rho_{s},\rho_{t})}{|s-t|}. (A.5)

By Ambrosio et al. (2008), the metric derivative |ρ˙t|W2|\dot{\rho}_{t}|_{W_{2}} is connected to the norm of the tangent vector by

|ρ˙t|W2=ρ˙tρt,W2.\displaystyle|\dot{\rho}_{t}|_{W_{2}}=\|\dot{\rho}_{t}\|_{\rho_{t},W_{2}}. (A.6)

Furthermore, we introduce the Wasserstein-1 distance, which is defined as

W1(μ1,μ2)=inf{𝔼[XY]|law(X)=μ1,law(Y)=μ2}\displaystyle W_{1}(\mu^{1},\mu^{2})=\inf\Bigl{\{}\mathbb{E}\bigl{[}\|X-Y\|\bigr{]}{\,\Big{|}\,}{\rm law}(X)=\mu^{1},{\rm law}(Y)=\mu^{2}\Bigr{\}}

for any μ1,μ2𝒫(D)\mu^{1},\mu^{2}\in\mathscr{P}(\mathbb{R}^{D}) with finite first moments. The Wasserstein-1 distance has the following dual representation (Ambrosio et al., 2008),

W1(μ1,μ2)=sup{f(x)d(μ1μ2)(x)|continuousf:D,Lip(f)1}.\displaystyle W_{1}(\mu^{1},\mu^{2})=\sup\biggl{\{}\int f(x)\,{\mathrm{d}}(\mu^{1}-\mu^{2})(x){\,\bigg{|}\,}{\rm continuous}~{}f:\mathbb{R}^{D}\rightarrow\mathbb{R},{\rm Lip}(f)\leq 1\biggr{\}}. (A.7)

A.2 Replicator Dynamics

The replicator dynamics originally arises in the study of evolutionary game theory (Schuster and Sigmund, 1983). For a function ff, the replicator dynamics is given by the differential equation

ddtxt(a)=xt(a)[f(a,xt)ϕ(x)],\displaystyle\frac{{\mathrm{d}}}{{\mathrm{d}}t}x_{t}(a)=x_{t}(a)[f(a,x_{t})-\phi(x)],

where ϕ(x)=x(a)f(a,x)\phi(x)=\int x(a)f(a,x). As for the PPO update in (3.7), for a fixed ss, let x(a)=π(a|s)x(a)=\pi(a\,|\,s) and f(a,x)=Qπ(s,a)f(a,x)=Q^{\pi}(s,a), we see that (3.7) corresponds to a replicator dynamics if Qt=QπtQ_{t}=Q^{\pi_{t}}. Note that in the simultaneous update of both the critic and actor, we do not have access to the true action value function QπQ^{\pi}. Thus, we use the estimator QtQ_{t} calculated by the critic step to guide the update of the actor in the PPO update, which takes the form of a replicator dynamics in the continuous-time limit.

Appendix B Proofs of Main Results

In this section, we give detailed proof of the theorems and present a detailed statement of Lemma 4.4.

B.1 Proof of Theorem 4.1

Proof.

Following from the performance difference lemma (Kakade and Langford, 2002), we have

J(π)J(πt)\displaystyle J(\pi^{*})-J(\pi_{t}) =(1γ)1𝔼s𝒟0π[Aπt(s,),π(|s)πt(|s)𝒜],\displaystyle=(1-\gamma)^{-1}\cdot\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\bigl{[}\langle A^{\pi_{t}}(s,\cdot),\pi^{*}(\cdot{\,|\,}s)-\pi_{t}(\cdot{\,|\,}s)\rangle_{\mathcal{A}}\bigr{]},

where 𝒟0π\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}} is the visitation measure induced by π\pi^{*} from 𝒟0\mathcal{D}_{0} and AπtA^{\pi_{t}} is the advantage function. Note that the continuous PPO dynamics in (3.7) can be equivalently written as tlogπt=At\partial_{t}\log\pi_{t}=A_{t}. Thus, we have

(1γ)(J(π)J(πt))\displaystyle(1-\gamma)\cdot\bigl{(}J(\pi^{*})-J(\pi_{t})\bigr{)} =𝔼s𝒟0π[ddtlogπt(|s)+Aπt(s,)At(s,),π(|s)πt(|s)𝒜]\displaystyle=\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\Bigl{[}\big{\langle}\frac{{\mathrm{d}}}{{\mathrm{d}}t}\log\pi_{t}(\cdot{\,|\,}s)+A^{\pi_{t}}(s,\cdot)-A_{t}(s,\cdot),\pi^{*}(\cdot{\,|\,}s)-\pi_{t}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}\Bigr{]}
=𝔼s𝒟0π[ddtlogπt(|s),π(|s)πt(|s)𝒜]\displaystyle=\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\Bigl{[}\big{\langle}\frac{{\mathrm{d}}}{{\mathrm{d}}t}\log\pi_{t}(\cdot{\,|\,}s),\pi^{*}(\cdot{\,|\,}s)-\pi_{t}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}\Bigr{]}
+𝔼s𝒟0π[Qπt(s,)Qt(s,),π(|s)πt(|s)𝒜].\displaystyle\quad+\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\Bigl{[}\big{\langle}Q^{\pi_{t}}(s,\cdot)-Q_{t}(s,\cdot),\pi^{*}(\cdot{\,|\,}s)-\pi_{t}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}\Bigr{]}. (B.1)

For the first term on the right-hand side of (B.1), it holds that,

ddtlogπt(|s),π(|s)πt(|s)𝒜\displaystyle\big{\langle}\frac{{\mathrm{d}}}{{\mathrm{d}}t}\log\pi_{t}(\cdot{\,|\,}s),\pi^{*}(\cdot{\,|\,}s)-\pi_{t}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}
=ddtlogπt(|s),π(|s)𝒜ddtlogπt(|s),πt(|s)𝒜\displaystyle\quad=\big{\langle}\frac{{\mathrm{d}}}{{\mathrm{d}}t}\log\pi_{t}(\cdot{\,|\,}s),\pi^{*}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}-\big{\langle}\frac{{\mathrm{d}}}{{\mathrm{d}}t}\log\pi_{t}(\cdot{\,|\,}s),\pi_{t}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}} (B.2)
=ddtKL(π(|s)πt(|s)),\displaystyle\quad=-\frac{{\mathrm{d}}}{{\mathrm{d}}t}{\mathrm{KL}}\bigl{(}\pi^{*}(\cdot{\,|\,}s)\,\|\,\pi_{t}(\cdot{\,|\,}s)\bigr{)},

where the last equality holds by noting that tlogπt(|s),πt(|s)𝒜=t𝒜πt(da|s)=0\big{\langle}\partial_{t}\log\pi_{t}(\cdot{\,|\,}s),\pi_{t}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}=\partial_{t}\int_{\mathcal{A}}\pi_{t}({\mathrm{d}}a{\,|\,}s)=0. For the second term on the right-hand side of (B.1), by the Cauchy-Schwartz inequality, we have

𝔼s𝒟0π[Qπt(s,)Qt(s,),π(|s)𝒜]\displaystyle\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\Bigl{[}\big{\langle}Q^{\pi_{t}}(s,\cdot)-Q_{t}(s,\cdot),\pi^{*}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}\Bigr{]} =𝔼(s,a)ϕ~πt[(Qπt(s,a)Qt(s,a))π(a|s)𝒟0π(s)ϕ~πt(x)]\displaystyle=\mathbb{E}_{(s,a)\sim\widetilde{\phi}^{\pi_{t}}}\Bigl{[}\bigl{(}Q^{\pi_{t}}(s,a)-Q_{t}(s,a)\bigr{)}\pi^{*}(a|s)\frac{\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}(s)}{\widetilde{\phi}^{\pi_{t}}(x)}\Bigr{]}
~𝒟0πϕ~πt2,ϕ~πtQπtQt2,ϕ~πt,\displaystyle\leq\bigg{\|}\frac{{\widetilde{\mathcal{E}}}^{\pi^{*}}_{\mathcal{D}_{0}}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}\cdot\big{\|}Q^{\pi_{t}}-Q_{t}\big{\|}_{2,\widetilde{\phi}^{\pi_{t}}}, (B.3)
𝔼s𝒟0π[Qπt(s,)Qt(s,),πt(|s)𝒜]\displaystyle\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\Bigl{[}\big{\langle}Q^{\pi_{t}}(s,\cdot)-Q_{t}(s,\cdot),\pi_{t}(\cdot{\,|\,}s)\big{\rangle}_{\mathcal{A}}\Bigr{]} =𝔼(x)ϕ~πt[(Qπt(s,)Qt(s,))πt(a|s)𝒟0π(s)ϕ~πt(x)]\displaystyle=\mathbb{E}_{(x)\sim\widetilde{\phi}^{\pi_{t}}}\Bigl{[}\big{(}Q^{\pi_{t}}(s,\cdot)-Q_{t}(s,\cdot))\pi_{t}(a{\,|\,}s)\frac{\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}(s)}{\widetilde{\phi}^{\pi_{t}}(x)}\Bigr{]}
𝒟0ππtϕ~πt2,ϕ~πtQπtQt2,ϕ~πt.\displaystyle\leq\bigg{\|}\frac{\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}\otimes\pi_{t}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}\cdot\big{\|}Q^{\pi_{t}}-Q_{t}\big{\|}_{2,\widetilde{\phi}^{\pi_{t}}}. (B.4)

Plugging (B.1), (B.1), and (B.1) into (B.1), we have

J(π)J(πt)\displaystyle J(\pi^{*})-J(\pi_{t}) ddt𝔼s𝒟0π[KL(π(|s)πt(|s))]\displaystyle\leq-\frac{{\mathrm{d}}}{{\mathrm{d}}t}\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\Bigl{[}{\mathrm{KL}}\bigl{(}\pi^{*}(\cdot{\,|\,}s)\,\|\,\pi_{t}(\cdot{\,|\,}s)\bigr{)}\Bigr{]}
+(~𝒟0πϕ~πt2,ϕ~πt+𝒟0ππtϕ~πt2,ϕ~πt)QtQπt2,ϕ~πt.\displaystyle\quad+\Big{(}\bigg{\|}\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}+\bigg{\|}\frac{\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}\otimes\pi_{t}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}\Big{)}\cdot\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}. (B.5)

By further letting ϕ~πt=12ϕ~0+12ϕ0πt\widetilde{\phi}^{\pi_{t}}=\frac{1}{2}\widetilde{\phi}_{0}+\frac{1}{2}\phi_{0}\otimes\pi_{t}, it holds for the concentrability coefficient ~𝒟0πϕ~πt2,ϕ~πt\bigg{\|}\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}} in (B.1) that

~𝒟0πϕ~πt2,ϕ~πt2~𝒟0πϕ~02,ϕ~πt2~𝒟0πϕ~0.\displaystyle\bigg{\|}\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}\leq\bigg{\|}\frac{2{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}_{0}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}\leq 2\bigg{\|}\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}_{0}}\bigg{\|}_{\infty}. (B.6)

By further letting ϕ0(s)=𝒜ϕ~0(s,da)\phi_{0}(s)=\int_{\mathcal{A}}\widetilde{\phi}_{0}(s,{\mathrm{d}}a), it holds for the concentrability coefficient 𝒟0ππtϕ~πt2,ϕ~πt\bigg{\|}\frac{\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}\otimes\pi_{t}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}} in (B.1) that

𝒟0ππtϕ~πt2,ϕ~πt2𝒟0πϕ02,ϕ~πt2~𝒟0π(x)ϕ~0(x)ϕ~0(s,da)ϕ0(s)2~𝒟0πϕ~0.\displaystyle\bigg{\|}\frac{\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}\otimes\pi_{t}}{\widetilde{\phi}^{\pi_{t}}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}\leq\bigg{\|}\frac{2\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}{\phi_{0}}\bigg{\|}_{2,\widetilde{\phi}^{\pi_{t}}}\leq 2\Bigg{\|}\frac{\int\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}(x)}{\widetilde{\phi}_{0}(x)}\widetilde{\phi}_{0}(s,{\mathrm{d}}a)}{\phi_{0}(s)}\Bigg{\|}_{\infty}\leq 2\bigg{\|}\frac{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}{\widetilde{\phi}_{0}}\bigg{\|}_{\infty}. (B.7)

Plugging (B.6) and (B.7) into (B.1) and taking integration on both sides of (B.1), we have

1T0T(J(π)J(πt))dt1T𝔼s𝒟0π[KL(π(|s)π0(|s))]+4κT0TQtQπt2,ϕ~πtdt,\displaystyle\frac{1}{T}\int_{0}^{T}\bigl{(}J(\pi^{*})-J(\pi_{t})\bigr{)}{\mathrm{d}}t\leq\frac{1}{T}\mathbb{E}_{s\sim\mathcal{E}^{\pi^{*}}_{\mathcal{D}_{0}}}\Bigl{[}{\mathrm{KL}}\bigl{(}\pi^{*}(\cdot{\,|\,}s)\,\|\,\pi_{0}(\cdot{\,|\,}s)\bigr{)}\Bigr{]}+\frac{4\kappa}{T}\cdot\int_{0}^{T}\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}{\mathrm{d}}t,

where κ=~𝒟0π/ϕ~0\kappa=\big{\|}{{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}}/{\widetilde{\phi}_{0}}\big{\|}_{\infty}. Thus, we complete the proof of Theorem 4.1. ∎

B.2 Detailed Statement of Lemma 4.4

We give a detailed version of Lemma 4.4 as follows.

Lemma B.1 (Regularity of Representation of QπQ^{\pi}).

Suppose that Assumptions 4.2 and 4.3 hold. For any policy π\pi, there exists a probability measure ρπ𝒫2(D)\rho_{\pi}\in\mathscr{P}_{2}(\mathbb{R}^{D}) for the representation of QπQ^{\pi} with the following properties.

  • (i)

    For function Q(s,a;ρπ)Q(s,a;\rho_{\pi}) defined by (3.4) with ρ=ρπ\rho=\rho_{\pi} and the action value function Qπ(s,a)Q^{\pi}(s,a) defined by (2.2), we have Q(s,a;ρπ)=Qπ(s,a)Q(s,a;\rho_{\pi})=Q^{\pi}(s,a) for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

  • (ii)

    For gg defined in (3.6), we have g(;ρπ)=0g(\cdot;\rho_{\pi})=0 for any policy π\pi.

  • (iii)

    By letting Bβ2(Br+γ(1γ)1BrM1,φ)B_{\beta}\geq 2(B_{r}+\gamma(1-\gamma)^{-1}B_{r}M_{1,\varphi}) for the neural network defined in (4.2) and ρ0N(0,ID)\rho_{0}\sim N(0,I_{D}) for the initial distribution, we have W~2(ρπ,ρ0)<D¯\widetilde{W}_{2}(\rho_{\pi},\rho_{0})<\bar{D} for any policy π\pi, where we define W~(,)=αW(,)\widetilde{W}(\cdot,\cdot)=\alpha W(\cdot,\cdot) as the scaled W2W_{2} metric. Here constant D¯\bar{D} depends on the discount factor γ\gamma and the absolute constants L0,β,L1,β,lβ,Br,Mμ,Mψ,M1,φ,M2,φL_{0,\beta},L_{1,\beta},l_{\beta},B_{r},M_{\mu},M_{\psi},M_{1,\varphi},M_{2,\varphi} defined in Assumptions 4.2 and 4.3.

  • (iv)

    For any two policies π1\pi_{1} and π2\pi_{2}, it holds that,

    W2(ρπ1,ρπ2)α12sups𝒮𝔼ssπ1[π1(|s)π2(|s)1],W_{2}(\rho_{\pi_{1}},\rho_{\pi_{2}})\leq\alpha^{-\frac{1}{2}}\mathcal{B}\cdot\sup_{s\in{\mathcal{S}}}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{1}}}\Big{[}\big{\|}\pi_{1}(\cdot{\,|\,}s^{\prime})-\pi_{2}(\cdot{\,|\,}s^{\prime})\big{\|}_{1}\Big{]},

    where \mathcal{B} depends on the dicount factor γ\gamma, the scaling parameter BβB_{\beta} in (4.2), and the absolute constants lβ,Br,M1,φ,𝒢l_{\beta},B_{r},M_{1,\varphi},\mathcal{G} defined in Assumptions 4.2 and 4.3.

Proof.

See §C.2 for a detailed proof. ∎

B.3 Proof of Theorem 4.5

Proof.

For notation simplicity, we let x=(s,a)x=(s,a). By Property (i) of Lemma B.1, it holds that QtQπt2,ϕ~πt2=QtQ(;ρπ)2,ϕ~πt2\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}^{2}=\|Q_{t}-Q(\cdot;\rho_{\pi})\|_{2,\widetilde{\phi}^{\pi_{t}}}^{2}, where Qt=Q(;ρt)Q_{t}=Q(\cdot;\rho_{t}). Thus it prompts us to study the W2W_{2} distance between ρt\rho_{t} and ρπt\rho_{\pi_{t}}. By the first variation formula in Lemma E.2, it holds that

ddtW22(ρt,ρπt)2=ρ˙t,α~t0˙ρt,W2ρ˙πt,β~t0˙ρπt,W2.\displaystyle\frac{{\mathrm{d}}}{{\mathrm{d}}t}\frac{W_{2}^{2}(\rho_{t},\rho_{\pi_{t}})}{2}=-\langle\dot{\rho}_{t},\dot{\widetilde{\alpha}_{t}^{0}}\rangle_{\rho_{t},W_{2}}-\langle\dot{\rho}_{\pi_{t}},\dot{\widetilde{\beta}_{t}^{0}}\rangle_{\rho_{\pi_{t}},W_{2}}. (B.8)

Here α~t[0,1]\widetilde{\alpha}_{t}^{[0,1]} is the geodesic connecting ρt\rho_{t} and ρπt\rho_{\pi_{t}}, and β~t[0,1]\widetilde{\beta}_{t}^{[0,1]} is its time-inverse, i.e., β~ts=α~t1s\widetilde{\beta}_{t}^{s}=\widetilde{\alpha}_{t}^{1-s}. Besides, we denote by α~ts˙=sα~ts\dot{\widetilde{\alpha}_{t}^{s}}=\partial_{s}\widetilde{\alpha}_{t}^{s} and β~ts˙=sβ~ts\dot{\widetilde{\beta}_{t}^{s}}=\partial_{s}\widetilde{\beta}_{t}^{s} the derivation of geodesics α~ts\widetilde{\alpha}_{t}^{s} and β~ts\widetilde{\beta}_{t}^{s} with respect to ss, respectively. We denote by vsv_{s} the corresponding vector field at α~ts\widetilde{\alpha}_{t}^{s}, which satisfies sα~ts=div(α~tsvs)\partial_{s}\widetilde{\alpha}_{t}^{s}=-\mathop{\mathrm{div}}(\widetilde{\alpha}_{t}^{s}v_{s}). For the first term of (B.8), it holds that

ρ˙t,α~t0˙ρt,W2\displaystyle-\langle\dot{\rho}_{t},\dot{\widetilde{\alpha}_{t}^{0}}\rangle_{\rho_{t},W_{2}} =ηg(;ρt,πt),v0ρt\displaystyle=-\eta\cdot\big{\langle}g(\cdot;\rho_{t},\pi_{t}),v_{0}\big{\rangle}_{\rho_{t}}
=η01sbigg(;α~ts,πt),vsbigα~tsdsηg(;ρπt,πt),v1ρπt\displaystyle=\eta\cdot\int_{0}^{1}\partial_{s}big\langle g(\cdot;\widetilde{\alpha}_{t}^{s},\pi_{t}),v_{s}big\rangle_{\widetilde{\alpha}_{t}^{s}}{\mathrm{d}}s-\eta\cdot\big{\langle}g(\cdot;\rho_{\pi_{t}},\pi_{t}),v_{1}\big{\rangle}_{\rho_{\pi_{t}}}
=η01sg(;α~ts,πt),vsα~tsds+η01g(θ;α~ts,πt)s(vsα~ts)(θ)dθds,\displaystyle=\eta\cdot\int_{0}^{1}\big{\langle}\partial_{s}g(\cdot;\widetilde{\alpha}_{t}^{s},\pi_{t}),v_{s}\big{\rangle}_{\widetilde{\alpha}_{t}^{s}}{\mathrm{d}}s+\eta\cdot\int_{0}^{1}\int g(\theta;\widetilde{\alpha}_{t}^{s},\pi_{t})\cdot\partial_{s}(v_{s}\widetilde{\alpha}_{t}^{s})(\theta){\mathrm{d}}\theta{\mathrm{d}}s, (B.9)

where the first equality follows from (A.2) and the third equation follows from g(;ρπt,πt)=0g(\cdot;\rho_{\pi_{t}},\pi_{t})=0 by Property (ii) of Lemma B.1. For the first term on the right-hand side of (B.3), we have

η01sg(;α~ts,πt),vsα~tsds\displaystyle\eta\cdot\int_{0}^{1}\big{\langle}\partial_{s}g(\cdot;\widetilde{\alpha}_{t}^{s},\pi_{t}),v_{s}\big{\rangle}_{\widetilde{\alpha}_{t}^{s}}{\mathrm{d}}s
=α1η01𝔼Φ~πtπt[s(Q(x;α~ts)γQ(x;α~ts))σ(x;θ)],(vsα~ts)(θ)dθds\displaystyle\quad=-\alpha^{-1}\eta\cdot\int_{0}^{1}\int\bigg{\langle}\mathbb{E}_{\widetilde{\Phi}^{\pi_{t}}}^{\pi_{t}}\Bigl{[}\partial_{s}\bigl{(}Q(x;\widetilde{\alpha}_{t}^{s})-\gamma\cdot Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\bigr{)}\nabla\sigma(x;\theta)\Bigr{]},(v_{s}\widetilde{\alpha}_{t}^{s})(\theta)\bigg{\rangle}{\mathrm{d}}\theta{\mathrm{d}}s
=α1η01𝔼Φ~πtπt[s(Q(x;α~ts)γQ(x;α~ts))σ(x;θ)],div(vsα~ts)(θ)dθds,\displaystyle\quad=\alpha^{-1}\eta\cdot\int_{0}^{1}\int\bigg{\langle}\mathbb{E}_{\widetilde{\Phi}^{\pi_{t}}}^{\pi_{t}}\Bigl{[}\partial_{s}\bigl{(}Q(x;\widetilde{\alpha}_{t}^{s})-\gamma\cdot Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\bigr{)}\sigma(x;\theta)\Bigr{]},\mathop{\mathrm{div}}(v_{s}\widetilde{\alpha}_{t}^{s})(\theta)\bigg{\rangle}{\mathrm{d}}\theta{\mathrm{d}}s, (B.10)

where the first equality holds by defintion of gg in (3.6) and the last equality follows from Stokes’ formula. Note that we have div(vsα~ts)=sα~ts\mathop{\mathrm{div}}(v_{s}\widetilde{\alpha}_{t}^{s})=-\partial_{s}\widetilde{\alpha}_{t}^{s} by the definition of vector field vsv_{s}. Thus, it holds for (B.3) that

η01sg(;α~ts,πt),vsα~tsds\displaystyle\eta\cdot\int_{0}^{1}\big{\langle}\partial_{s}g(\cdot;\widetilde{\alpha}_{t}^{s},\pi_{t}),v_{s}\big{\rangle}_{\widetilde{\alpha}_{t}^{s}}{\mathrm{d}}s
=α1η01𝔼Φ~πtπt[s(Q(x;α~ts)γQ(x;α~ts))σ(x;θ)],sα~ts(θ)dθds\displaystyle\quad=-\alpha^{-1}\eta\cdot\int_{0}^{1}\int\bigg{\langle}\mathbb{E}_{\widetilde{\Phi}^{\pi_{t}}}^{\pi_{t}}\Bigl{[}\partial_{s}\bigl{(}Q(x;\widetilde{\alpha}_{t}^{s})-\gamma\cdot Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\bigr{)}\sigma(x;\theta)\Bigr{]},\partial_{s}\widetilde{\alpha}_{t}^{s}(\theta)\bigg{\rangle}{\mathrm{d}}\theta{\mathrm{d}}s
=α2η01𝔼Φ~πtπt[s(Q(x;α~ts)γQ(x;α~ts))sQ(x;α~ts)]ds,\displaystyle\quad=-\alpha^{-2}\eta\cdot\int_{0}^{1}\mathbb{E}_{\widetilde{\Phi}^{\pi_{t}}}^{\pi_{t}}\Bigl{[}\partial_{s}\bigl{(}Q(x;\widetilde{\alpha}_{t}^{s})-\gamma\cdot Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\bigr{)}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\Bigr{]}{\mathrm{d}}s, (B.11)

where the last equality follows from the definition of QQ in (3.4). We let f(D~)=(𝔼x~𝒟~πt[(sQ(x;α~ts))2])1/2f(\widetilde{D})=\big{(}\mathbb{E}_{x\sim{\widetilde{\mathcal{E}}}_{\widetilde{\mathcal{D}}}^{\pi_{t}}}\bigl{[}(\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s}))^{2}\bigr{]}\big{)}^{1/2} with respect to a specific ss and tt. Recall that for the weighting distribution Φ~πt\widetilde{\Phi}^{\pi_{t}}, we set Φ~πt=~ϕ~πtπt\widetilde{\Phi}^{\pi_{t}}={\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}. Hence, for the integrand of (B.3), we have

𝔼Φ~πtπt[s(Q(x;α~ts)γQ(x;α~ts))sQ(x;α~ts)]\displaystyle-\mathbb{E}_{\widetilde{\Phi}^{\pi_{t}}}^{\pi_{t}}\Bigl{[}\partial_{s}\bigl{(}Q(x;\widetilde{\alpha}_{t}^{s})-\gamma\cdot Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\bigr{)}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\Bigr{]}
=f(ϕ~πt)2+γsQ(x;α~ts)sQ(x;α~ts)P~πt(x|x)~ϕ~πtπt(x)dxdx\displaystyle\quad=-f(\widetilde{\phi}^{\pi_{t}})^{2}+\gamma\cdot\int\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\partial_{s}Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\widetilde{P}^{\pi_{t}}(x^{\prime}{\,|\,}x){\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}(x){\mathrm{d}}x^{\prime}{\mathrm{d}}x
f(ϕ~πt)2+γ(sQ(x;α~ts))2~ϕ~πtπt(dx)(sQ(x;α~ts))2P~πt(dx|x)~ϕ~πtπt(dx),\displaystyle\quad\leq-f(\widetilde{\phi}^{\pi_{t}})^{2}+\gamma\cdot\sqrt{\int\big{(}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\big{)}^{2}{\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}({\mathrm{d}}x)}\cdot\sqrt{\int\big{(}\partial_{s}Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\big{)}^{2}\widetilde{P}^{\pi_{t}}({\mathrm{d}}x^{\prime}{\,|\,}x){\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}({\mathrm{d}}x)}, (B.12)

where the equality follows from the definition of 𝔼Φ~πtπt\mathbb{E}_{\widetilde{\Phi}^{\pi_{t}}}^{\pi_{t}} in (3.6) and the inequality folllows from the Cauchy-Schwarz inequality. We define 𝒯π:𝒫(𝒮×𝒜)𝒫(𝒮×𝒜){\mathcal{T}}^{\pi}:\mathscr{P}({\mathcal{S}}\times\mathcal{A})\rightarrow\mathscr{P}({\mathcal{S}}\times\mathcal{A}) as a mapping operator such that 𝒯π𝒟~(x)=𝒟~(x)P~π(x|dx){\mathcal{T}}^{\pi}{\widetilde{\mathcal{D}}}(x^{\prime})=\int{\widetilde{\mathcal{D}}}(x)\widetilde{P}^{\pi}(x^{\prime}|{\mathrm{d}}x). We rewrite (B.3) as

𝔼Φ~πtπt[s(Q(x;α~ts)γQ(x;α~ts))sQ(x;α~ts)]f(ϕ~πt)2+γf(ϕ~πt)f(𝒯πtϕ~πt).\displaystyle-\mathbb{E}_{\widetilde{\Phi}^{\pi_{t}}}^{\pi_{t}}\bigl{[}\partial_{s}\bigl{(}Q(x;\widetilde{\alpha}_{t}^{s})-\gamma\cdot Q(x^{\prime};\widetilde{\alpha}_{t}^{s})\bigr{)}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\bigr{]}\leq-f(\widetilde{\phi}^{\pi_{t}})^{2}+\gamma\cdot f(\widetilde{\phi}^{\pi_{t}})\cdot f({\mathcal{T}}^{\pi_{t}}\widetilde{\phi}^{\pi_{t}}). (B.13)

By the definition of 𝒯πt{\mathcal{T}}^{\pi_{t}} and the definition of visitation measure in (2.1), it holds that ~ϕ~πtπtγ~𝒯πtϕ~πtπt=(1γ)ϕ~πt{\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}-\gamma{\widetilde{\mathcal{E}}}_{{\mathcal{T}}^{\pi_{t}}\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}=(1-\gamma)\widetilde{\phi}^{\pi_{t}}. Hence, we have f(ϕ~πt)2γf2(𝒯πtϕ~πt)=(1γ)𝔼ϕ~πt[(sQ(x;α~ts))2]f(\widetilde{\phi}^{\pi_{t}})^{2}-\gamma f^{2}({\mathcal{T}}^{\pi_{t}}\widetilde{\phi}^{\pi_{t}})=(1-\gamma)\mathbb{E}_{\widetilde{\phi}^{\pi_{t}}}\big{[}\big{(}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\big{)}^{2}\big{]} and it holds for (B.13) that

f(ϕ~πt)2+γf(ϕ~πt)f(𝒯πtϕ~πt)\displaystyle-f(\widetilde{\phi}^{\pi_{t}})^{2}+\gamma f(\widetilde{\phi}^{\pi_{t}})\cdot f({\mathcal{T}}^{\pi_{t}}\widetilde{\phi}^{\pi_{t}}) =f(ϕ~πt)f(ϕ~πt)+γf(𝒯πtϕ~πt)(f(ϕ~πt)2γ2f2(𝒯πtϕ~πt))\displaystyle=-\frac{f(\widetilde{\phi}^{\pi_{t}})}{f(\widetilde{\phi}^{\pi_{t}})+\gamma f({\mathcal{T}}^{\pi_{t}}\widetilde{\phi}^{\pi_{t}})}\cdot\big{(}f(\widetilde{\phi}^{\pi_{t}})^{2}-\gamma^{2}f^{2}({\mathcal{T}}^{\pi_{t}}\widetilde{\phi}^{\pi_{t}})\big{)}
f(ϕ~πt)2γ(f(ϕ~πt)2(1γ)𝔼ϕ~πt[(Q(x;α~ts))2])1+γ\displaystyle\leq-\frac{f(\widetilde{\phi}^{\pi_{t}})^{2}-\gamma\bigg{(}f(\widetilde{\phi}^{\pi_{t}})^{2}-(1-\gamma)\mathbb{E}_{\widetilde{\phi}^{\pi_{t}}}\Big{[}\big{(}Q(x;\widetilde{\alpha}_{t}^{s})\big{)}^{2}\Big{]}\bigg{)}}{1+\sqrt{\gamma}}
(1γ)𝔼ϕ~πt[(sQ(x;α~ts))2],\displaystyle\leq-(1-\sqrt{\gamma})\cdot\mathbb{E}_{\widetilde{\phi}^{\pi_{t}}}\Big{[}\big{(}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\big{)}^{2}\Big{]}, (B.14)

where the first inequality holds by noting that f(ϕ~πt)γf(𝒯πtϕ~πt)f(\widetilde{\phi}^{\pi_{t}})\geq\sqrt{\gamma}f({\mathcal{T}}^{\pi_{t}}\widetilde{\phi}^{\pi_{t}}) and the last inequality holds by noting that f(ϕ~πt)2(1γ)𝔼ϕ~πt[(sQ(x;α~ts))2]f(\widetilde{\phi}^{\pi_{t}})^{2}\geq(1-\gamma)\mathbb{E}_{\widetilde{\phi}^{\pi_{t}}}\big{[}\big{(}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\big{)}^{2}\big{]}. Combining (B.3), (B.13), and (B.3) together, it holds for the first term of (B.3) that

η01sg(;α~ts,πt),vsα~tsds\displaystyle\eta\cdot\int_{0}^{1}\big{\langle}\partial_{s}g(\cdot;\widetilde{\alpha}_{t}^{s},\pi_{t}),v_{s}\big{\rangle}_{\widetilde{\alpha}_{t}^{s}}{\mathrm{d}}s α2η(1γ)01𝔼ϕ~πt[(sQ(x;α~ts))2]ds\displaystyle\leq-\alpha^{-2}\eta(1-\sqrt{\gamma})\cdot\int_{0}^{1}\mathbb{E}_{\widetilde{\phi}^{\pi_{t}}}\Big{[}\big{(}\partial_{s}Q(x;\widetilde{\alpha}_{t}^{s})\big{)}^{2}\Big{]}{\mathrm{d}}s
α2η(1γ)Q(x;ρπt)Q(x;ρt)2,ϕ~πt2,\displaystyle\leq-\alpha^{-2}\eta(1-\sqrt{\gamma})\cdot\big{\|}Q(x;\rho_{\pi_{t}})-Q(x;\rho_{t})\big{\|}_{2,\widetilde{\phi}^{\pi_{t}}}^{2}, (B.15)

where the last inequality holds by the Cauchy-Schwarz inequality. For the second term of (B.3), we have

η01g(θ;α~ts,πt)s(vsα~ts)(θ)dθds\displaystyle\eta\cdot\int_{0}^{1}\int g(\theta;\widetilde{\alpha}_{t}^{s},\pi_{t})\cdot\partial_{s}(v_{s}\widetilde{\alpha}_{t}^{s})(\theta){\mathrm{d}}\theta{\mathrm{d}}s =η01g(θ;α~ts,πt),α~ts(θ)vs(θ)vs(θ)dθds\displaystyle=\eta\cdot\int_{0}^{1}\int\big{\langle}\nabla g(\theta;\widetilde{\alpha}_{t}^{s},\pi_{t}),\widetilde{\alpha}_{t}^{s}(\theta)\cdot v_{s}(\theta)\otimes v_{s}(\theta)\big{\rangle}{\mathrm{d}}\theta{\mathrm{d}}s
η01supθg(θ;α~ts,πt)FW2(ρt,ρπt)2ds\displaystyle\leq\eta\cdot\int_{0}^{1}\sup_{\theta}\|\nabla g(\theta;\widetilde{\alpha}_{t}^{s},\pi_{t})\|_{\rm{F}}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}})^{2}{\mathrm{d}}s
=ηsupθ,sg(θ;α~ts,πt)FW2(ρt,ρπt)2,\displaystyle=\eta\cdot\sup_{\theta,s}\|\nabla g(\theta;\widetilde{\alpha}_{t}^{s},\pi_{t})\|_{\rm{F}}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}})^{2}, (B.16)

where the first equality holds by Eulerian representation of the geodesic in Lemma E.3 that s(vsα~ts)=div(α~tsvsvs)\partial_{s}(v_{s}\cdot\widetilde{\alpha}_{t}^{s})=-\mathop{\mathrm{div}}(\widetilde{\alpha}_{t}^{s}\cdot v_{s}\otimes v_{s}) and Stokes’s fomula. Here, we denote by \otimes the outer product between two vectors. The inequality of (B.3) follows from vs(θ)vs(θ)F=vs(θ)2\|v_{s}(\theta)\otimes v_{s}(\theta)\|_{\rm{F}}=\|v_{s}(\theta)\|^{2} and the property of the geodesic in (A.4) that vsα~ts,W2=W2(ρt,ρπt)\|v_{s}\|_{\widetilde{\alpha}_{t}^{s},W_{2}}=W_{2}(\rho_{t},\rho_{\pi_{t}}).

For the second term on the right-hand side of (B.8), we denote by utu_{t} the corresponding vector field at ρπt\rho_{\pi_{t}} such that tρπt=div(ρπtut)\partial_{t}\rho_{\pi_{t}}=-\mathop{\mathrm{div}}(\rho_{\pi_{t}}u_{t}). Then, it holds that

ρ˙πt,β~t0˙ρπt,W2=ut,v1ρπtρ˙πtρπt,W2W2(ρt,ρπt)=|ρ˙πt|W2W2(ρt,ρπt),\displaystyle-\langle\dot{\rho}_{\pi_{t}},\dot{\widetilde{\beta}_{t}^{0}}\rangle_{\rho_{\pi_{t}},W_{2}}=\langle u_{t},v_{1}\rangle_{\rho_{\pi_{t}}}\leq\|\dot{\rho}_{\pi_{t}}\|_{\rho_{\pi_{t}},W_{2}}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}})=|\dot{\rho}_{\pi_{t}}|_{W_{2}}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}}), (B.17)

where the first equality follows from (A.2), the inequality follows from the Cauchy-Schwarz inequality and the facts that utρπt=ρ˙πtρπt,W2\|u_{t}\|_{\rho_{\pi_{t}}}=\|\dot{\rho}_{\pi_{t}}\|_{\rho_{\pi_{t}},W_{2}} by (A.2) and that v1ρπt=W2(ρt,ρπt)\|v_{1}\|_{\rho_{\pi_{t}}}=W_{2}(\rho_{t},\rho_{\pi_{t}}) by (A.4). The last equality of (B.17) holds by (A.6). Plugging the definition of metric derivative |ρ˙t|W2|\dot{\rho}_{t}|_{W_{2}} in (A.5) into (B.17), we have

ρ˙πt,β~t0˙ρπt,W2\displaystyle-\langle\dot{\rho}_{\pi_{t}},\dot{\widetilde{\beta}_{t}^{0}}\rangle_{\rho_{\pi_{t}},W_{2}} limΔt0W2(ρπt,ρπt+Δt)|Δt|W2(ρt,ρπt)\displaystyle\leq\lim_{\Delta t\rightarrow 0}\frac{W_{2}(\rho_{\pi_{t}},\rho_{\pi_{t+\Delta t}})}{|\Delta t|}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}})
α1/2limΔt0sups𝒮𝔼ssπt[πt+Δt(|s)πt(|s)Δt1]W2(ρt,ρπt)\displaystyle\leq\alpha^{-1/2}\mathcal{B}\cdot\lim_{\Delta t\rightarrow 0}\sup_{s\in{\mathcal{S}}}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{t}}}\Big{[}\Big{\|}\frac{\pi_{t+\Delta t}(\cdot{\,|\,}s^{\prime})-\pi_{t}(\cdot{\,|\,}s^{\prime})}{\Delta t}\Big{\|}_{1}\Big{]}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}})
=α1/2sups𝒮𝔼ssπt[At(s,)πt(|s)1]W2(ρt,ρπt),\displaystyle=\alpha^{-1/2}\mathcal{B}\cdot\sup_{s\in{\mathcal{S}}}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{t}}}\Big{[}\big{\|}A_{t}(s^{\prime},\cdot)\pi_{t}(\cdot{\,|\,}s^{\prime})\big{\|}_{1}\Big{]}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}}), (B.18)

where the second inequality follows from Property (iv) of Lemma B.1 and the equality follows from the MF-PPO update in (3.7). For the approximation of the advantage function AtA_{t}, it holds that

supx𝒮×𝒜|At(x)|\displaystyle\sup_{x\in{\mathcal{S}}\times\mathcal{A}}\big{|}A_{t}(x)\big{|} =supx𝒮×𝒜|Q(x;ρt)Q(s,a)πt(da|s)|\displaystyle=\sup_{x\in{\mathcal{S}}\times\mathcal{A}}\bigg{|}Q(x;\rho_{t})-\int Q(s,a^{\prime})\pi_{t}({\mathrm{d}}a^{\prime}{\,|\,}s)\bigg{|}
2supx𝒮×𝒜|Q(x;ρt)|.\displaystyle\leq 2\sup_{x\in{\mathcal{S}}\times\mathcal{A}}\big{|}Q(x;\rho_{t})\big{|}. (B.19)

Plugging (B.3) into (B.3), we have

ρ˙πt,β~t0ρπt\displaystyle-\langle\dot{\rho}_{\pi_{t}},\widetilde{\beta}_{t}^{0}\rangle_{\rho_{\pi_{t}}} α1/2supx𝒳|At(x)|sups𝒮𝔼ssπt[πt(|s)1]W2(ρt,ρπt)\displaystyle\leq\alpha^{-1/2}\mathcal{B}\cdot\sup_{x\in\mathcal{X}}\big{|}A_{t}(x)\big{|}\cdot\sup_{s\in{\mathcal{S}}}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{t}}}\Big{[}\big{\|}\pi_{t}(\cdot{\,|\,}s^{\prime})\big{\|}_{1}\Big{]}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}})
2α1/2supx𝒳|Q(x;ρt)|W2(ρt,ρπt).\displaystyle\leq 2\alpha^{-1/2}\mathcal{B}\cdot\sup_{x\in\mathcal{X}}{\big{|}Q(x;\rho_{t})\big{|}}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}}). (B.20)

where the last inequality follows from πt(|s)1=1\|\pi_{t}(\cdot{\,|\,}s^{\prime})\|_{1}=1. By first plugging (B.3) and (B.3) into (B.3), and then plugging (B.3) and (B.3) into (B.8), we have

ddtW22(ρt,ρπt)2\displaystyle\frac{{\mathrm{d}}}{{\mathrm{d}}t}\frac{W_{2}^{2}(\rho_{t},\rho_{\pi_{t}})}{2} α2η(1γ)QtQπt2,ϕ~πt2+ηsupθ,sg(θ;α~ts,πt)FW2(ρt,ρπ)2\displaystyle\leq-\alpha^{-2}\eta(1-\sqrt{\gamma})\cdot\big{\|}Q_{t}-Q^{\pi_{t}}\big{\|}_{2,\widetilde{\phi}^{\pi_{t}}}^{2}+\eta\cdot\sup_{\theta,s}\big{\|}\nabla g(\theta;\widetilde{\alpha}_{t}^{s},\pi_{t})\big{\|}_{\rm{F}}\cdot W_{2}(\rho_{t},\rho_{\pi})^{2}
+2α1/2supx𝒳|Q(x;ρt)|W2(ρt,ρπt).\displaystyle\quad+2\alpha^{-1/2}\mathcal{B}\cdot\sup_{x\in\mathcal{X}}{\big{|}Q(x;\rho_{t})\big{|}}\cdot W_{2}(\rho_{t},\rho_{\pi_{t}}). (B.21)

Plugging Lemma D.1 into (B.3), we have

ddtW22(ρt,ρπt)2\displaystyle\frac{{\mathrm{d}}}{{\mathrm{d}}t}\frac{W_{2}^{2}(\rho_{t},\rho_{\pi_{t}})}{2}\leq ηα2(1γ)QtQπt2,ϕ~πt2\displaystyle-\eta\alpha^{-2}(1-\sqrt{\gamma})\cdot\big{\|}Q_{t}-Q^{\pi_{t}}\big{\|}_{2,\widetilde{\phi}^{\pi_{t}}}^{2}
+ηα1B2(2αB1sups[0,1]W2(α~ts,ρ0)+Br)W2(ρt,ρπ)2\displaystyle+\eta\alpha^{-1}B_{2}\cdot\big{(}2\alpha B_{1}\sup_{s\in[0,1]}W_{2}(\widetilde{\alpha}_{t}^{s},\rho_{0})+B_{r}\big{)}\cdot W_{2}(\rho_{t},\rho_{\pi})^{2}
+2α1/2B1W2(ρt,ρ0)W2(ρt,ρπt).\displaystyle+2\alpha^{1/2}\mathcal{B}B_{1}\cdot W_{2}(\rho_{t},\rho_{0})W_{2}(\rho_{t},\rho_{\pi_{t}}). (B.22)

Note that α~ts\widetilde{\alpha}_{t}^{s} is the geodesic connecting ρt\rho_{t} and ρπ\rho_{\pi}. By Lemma D.2, we have

sups[0,1]W2(α~ts,ρ0)2max{W2(ρπt,ρ0),W2(ρt,ρ0)}.\displaystyle\sup_{s\in[0,1]}W_{2}(\widetilde{\alpha}_{t}^{s},\rho_{0})\leq 2\max\big{\{}W_{2}(\rho_{\pi_{t}},\rho_{0}),W_{2}(\rho_{t},\rho_{0})\big{\}}. (B.23)

Plugging (B.23) into (B.3), it follows that

ddtW~22(ρt,ρπt)2η\displaystyle\frac{{\mathrm{d}}}{{\mathrm{d}}t}\frac{\widetilde{W}_{2}^{2}(\rho_{t},\rho_{\pi_{t}})}{2\eta}\leq (1γ)QtQπt2,ϕ~πt2+2η1α1/2B1W~2(ρt,ρ0)W~2(ρt,ρπt)\displaystyle-(1-\sqrt{\gamma})\cdot\big{\|}Q_{t}-Q^{\pi_{t}}\big{\|}_{2,\widetilde{\phi}^{\pi_{t}}}^{2}+2\eta^{-1}\alpha^{1/2}\mathcal{B}B_{1}\cdot\widetilde{W}_{2}(\rho_{t},\rho_{0})\widetilde{W}_{2}(\rho_{t},\rho_{\pi_{t}})
+α1B2(4B1max{W~2(ρπt,ρ0),W~2(ρt,ρ0)}+Br)W~2(ρt,ρπ)2,\displaystyle+\alpha^{-1}B_{2}\cdot\Big{(}4B_{1}\max\big{\{}\widetilde{W}_{2}(\rho_{\pi_{t}},\rho_{0}),\widetilde{W}_{2}(\rho_{t},\rho_{0})\big{\}}+B_{r}\Big{)}\widetilde{W}_{2}(\rho_{t},\rho_{\pi})^{2},

Where W~2=α1W2\widetilde{W}_{2}=\alpha^{-1}W_{2} is the scaled W2W_{2} metric. Thus, we complete the proof of Theorem 4.5. ∎

B.4 Proof of Theorem 4.6

Proof.

We remark that the restarting mechanism produces discontinuity in ρt\rho_{t} while πt\pi_{t} remains continuous. Let T0,T1,,TNT_{0},T_{1},\cdots,T_{N} denote the restarting points in [0,T)[0,T), where T0=0T_{0}=0 and NN is the total restarting number in [0,T)[0,T). Let TnT_{n}^{-} and Tn+T_{n}^{+} denote the moments just before and after the restarting occurring at TnT_{n}, respectively. According to the restarting mechanism, we have W~2(ρt,ρ0)λD¯\widetilde{W}_{2}(\rho_{t},\rho_{0})\leq\lambda\bar{D}, W~2(ρTn,ρ0)=λD¯\widetilde{W}_{2}(\rho_{T_{n}^{-}},\rho_{0})=\lambda\bar{D} and ρTn+=ρ0\rho_{T_{n}^{+}}=\rho_{0}. Recall that we set Φ~πt=~ϕ~πtπt\widetilde{\Phi}^{\pi_{t}}={\widetilde{\mathcal{E}}}_{\widetilde{\phi}^{\pi_{t}}}^{\pi_{t}}. By (4.7) of Theorem 4.5, it holds that

(1γ)QtQπt2,ϕ~πt2\displaystyle(1-\sqrt{\gamma})\cdot\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}^{2}\leq ddtW~22(ρt,ρπt)2η+2η1α1/2B1λD¯W~2(ρt,ρπt)\displaystyle-\frac{{\mathrm{d}}}{{\mathrm{d}}t}\frac{\widetilde{W}_{2}^{2}(\rho_{t},\rho_{\pi_{t}})}{2\eta}+2\eta^{-1}\alpha^{1/2}\mathcal{B}B_{1}\lambda\bar{D}\cdot\widetilde{W}_{2}(\rho_{t},\rho_{\pi_{t}})
+α1B2(4B1λD¯+Br)W~2(ρt,ρπ)2.\displaystyle+\alpha^{-1}B_{2}\cdot(4B_{1}\lambda\bar{D}+B_{r})\widetilde{W}_{2}(\rho_{t},\rho_{\pi})^{2}. (B.24)

For simplicity, we let S1=(1+λ)2D¯2B2(4B1λD¯+Br)1γS_{1}=\frac{(1+\lambda)^{2}\bar{D}^{2}B_{2}(4B_{1}\lambda\bar{D}+B_{r})}{1-\sqrt{\gamma}}, S2=2B1λ(1+λ)D¯21γS_{2}=\frac{2\mathcal{B}B_{1}\lambda(1+\lambda)\bar{D}^{2}}{1-\sqrt{\gamma}}, ξ=12(1γ)\xi=\frac{1}{2(1-\sqrt{\gamma})}, Q(t)=QtQπt2,ϕ~πtQ(t)=\|Q_{t}-Q^{\pi_{t}}\|_{2,\widetilde{\phi}^{\pi_{t}}}, and W~(t)=W~2(ρt,ρπt)\widetilde{W}(t)=\widetilde{W}_{2}(\rho_{t},\rho_{\pi_{t}}). By sum of the integrals of (B.4) on [T0,T1],,[TN1,TN][T_{0},T_{1}],\cdots,[T_{N-1},T_{N}], and [TN,T][T_{N},T], we have

1T0TQ2(t)dt\displaystyle\frac{1}{T}\int_{0}^{T}Q^{2}(t){\mathrm{d}}t 1T0T(α1S1(1+λ)2D¯2W~(t)2+α1/2η1S2(1+λ)D¯W~(t))dt\displaystyle\leq\frac{1}{T}\int_{0}^{T}\Big{(}\frac{\alpha^{-1}S_{1}}{(1+\lambda)^{2}\bar{D}^{2}}\widetilde{W}(t)^{2}+\frac{\alpha^{1/2}\eta^{-1}S_{2}}{(1+\lambda)\bar{D}}\widetilde{W}(t)\Big{)}{\mathrm{d}}t
ξTη(n=0N1TnTn+1ddtW~(t)2dt+TNTddtW~(t)2dt).\displaystyle\quad-\frac{\xi}{T\eta}\biggl{(}\sum_{n=0}^{N-1}\int_{T_{n}}^{T_{n+1}}\frac{{\mathrm{d}}}{{\mathrm{d}}t}\widetilde{W}(t)^{2}{\mathrm{d}}t+\int_{T_{N}}^{T}\frac{{\mathrm{d}}}{{\mathrm{d}}t}\widetilde{W}(t)^{2}{\mathrm{d}}t\biggr{)}. (B.25)

Note that we have W~(t)=W~2(ρt,ρπt)W~2(ρt,ρ0)+W~2(ρπt,ρ0)(1+λ)D¯\widetilde{W}(t)=\widetilde{W}_{2}(\rho_{t},\rho_{\pi_{t}})\leq\widetilde{W}_{2}(\rho_{t},\rho_{0})+\widetilde{W}_{2}(\rho_{\pi_{t}},\rho_{0})\leq(1+\lambda)\bar{D} by the triangle inequality of W2W_{2} distance, the restarting mechanism and Property (iii) of Lemma B.1. It thus holds for (B.4) that

1T0TQ2(t)dt\displaystyle\frac{1}{T}\int_{0}^{T}Q^{2}(t){\mathrm{d}}t 1T0T(α1S1+α1/2η1S2)dt\displaystyle\leq\frac{1}{T}\int_{0}^{T}(\alpha^{-1}S_{1}+\alpha^{1/2}\eta^{-1}S_{2}){\mathrm{d}}t
ξTη(n=0N1(W~2(Tn+1)W~2(Tn+))+(W~2(T)W~2(TN+))).\displaystyle\quad-\frac{\xi}{T\eta}\biggl{(}\sum_{n=0}^{N-1}\big{(}\widetilde{W}^{2}(T_{n+1}^{-})-\widetilde{W}^{2}(T_{n}^{+})\big{)}+\big{(}\widetilde{W}^{2}(T)-\widetilde{W}^{2}(T_{N}^{+})\big{)}\biggr{)}. (B.26)

Note that we have W~(Tn+1)W~2(ρ0,ρTn+1)W~2(ρ0,ρπTn+1)(λ1)D¯D¯W~(ρ0,ρπTn+)=W~(Tn+)\widetilde{W}(T_{n+1}^{-})\geq\widetilde{W}_{2}(\rho_{0},\rho_{T_{n+1}^{-}})-\widetilde{W}_{2}(\rho_{0},\rho_{\pi_{T_{n+1}^{-}}})\geq(\lambda-1)\bar{D}\geq\bar{D}\geq\widetilde{W}(\rho_{0},\rho_{\pi_{T_{n}^{+}}})=\widetilde{W}(T_{n}^{+}) by the triangle inequality of W2W_{2} distance, the restarting mechanism and Property (iii) of Lemma B.1. It thus holds for (B.4) that

1T0TQ2(t)dtα1S1+α1/2η1S2+η1D¯22T(1γ).\displaystyle\frac{1}{T}\int_{0}^{T}Q^{2}(t){\mathrm{d}}t\leq\alpha^{-1}S_{1}+\alpha^{1/2}\eta^{-1}S_{2}+\frac{\eta^{-1}\bar{D}^{2}}{2T(1-\sqrt{\gamma})}. (B.27)

By setting ϕ~πt=12ϕ~0+12ϕ0πt\widetilde{\phi}^{\pi_{t}}=\frac{1}{2}\widetilde{\phi}_{0}+\frac{1}{2}\phi_{0}\otimes\pi_{t} and plugging (B.27) into (4.1) of Theorem 4.1, we have

1T0T(J(π)J(πt))dt\displaystyle\frac{1}{T}\int_{0}^{T}\bigl{(}J(\pi^{*})-J(\pi_{t})\bigr{)}{\mathrm{d}}t ζT+4κT0TQ(t)dt\displaystyle\leq\frac{\zeta}{T}+\frac{4\kappa}{T}\int_{0}^{T}Q(t){\mathrm{d}}t
ζT+4κ1T0TQ2(t)dt\displaystyle\leq\frac{\zeta}{T}+4\kappa\sqrt{\frac{1}{T}\int_{0}^{T}Q^{2}(t){\mathrm{d}}t}
ζT+4κα1S1+α1/2η1S2+η1D¯22T(1γ),\displaystyle\leq\frac{\zeta}{T}+4\kappa\sqrt{\alpha^{-1}S_{1}+\alpha^{1/2}\eta^{-1}S_{2}+\frac{\eta^{-1}\bar{D}^{2}}{2T(1-\sqrt{\gamma})}}, (B.28)

where κ=~𝒟0π/ϕ~0\kappa=\big{\|}{\widetilde{\mathcal{E}}}_{\mathcal{D}_{0}}^{\pi^{*}}/\widetilde{\phi}_{0}\big{\|}_{\infty} is the concentrability coefficient and ζ=𝔼sπ[KL(π(|s)π0(|s))]\zeta=\mathbb{E}_{s\sim\mathcal{E}_{\pi^{*}}}\bigl{[}{\mathrm{KL}}\bigl{(}\pi^{*}(\cdot{\,|\,}s)\,\|\,\pi_{0}(\cdot{\,|\,}s)\bigr{)}\bigr{]} is the KL-divergence between π\pi^{*} and π0\pi_{0}. Here the second inequality follows from the Cauchy-Schwarz inequality. Therefore, we complete the proof of (4.9) in Theorem 4.6. In what follows, we aim to upper bound the total restarting number NN. Recall that we have W~(Tn+)D¯\widetilde{W}(T_{n}^{+})\leq\bar{D} and W~(Tn+1)(λ1)D¯\widetilde{W}(T_{n+1}^{-})\geq(\lambda-1)\bar{D} according to the restarting mechanism. Thus, it holds for (B.4) that

1T0TQ2(t)dt\displaystyle\frac{1}{T}\int_{0}^{T}Q^{2}(t){\mathrm{d}}t α1S1+α1/2η1S2ξTη(N(λ2)D¯2D¯2).\displaystyle\leq\alpha^{-1}S_{1}+\alpha^{1/2}\eta^{-1}S_{2}-\frac{\xi}{T\eta}\big{(}N(\lambda-2)\bar{D}^{2}-\bar{D}^{2}\big{)}.

Since Q2(t)0Q^{2}(t)\geq 0, it follows that

N\displaystyle N ((α1S1+α1/2η1S2)2T(1γ)η+D¯2)D¯2λ2\displaystyle\leq\big{(}(\alpha^{-1}S_{1}+\alpha^{1/2}\eta^{-1}S_{2})2T(1-\sqrt{\gamma})\eta+\bar{D}^{2}\big{)}\cdot\frac{\bar{D}^{-2}}{\lambda-2}
=(λ2)1((α1ηS1+α1/2S2)2TD¯2(1γ)+1),\displaystyle=(\lambda-2)^{-1}\big{(}(\alpha^{-1}\eta S_{1}+\alpha^{1/2}S_{2})2T\bar{D}^{-2}(1-\sqrt{\gamma})+1\big{)},

which upper bounds the total restarting number NN. Hence, we complete the proof of Theorem 4.6. ∎

Appendix C Proofs of Supporting Lemmas

In this section, we give detailed proof of the supporting lemmas.

C.1 Generality of Transition Kernel Representation

Recall that we have the following representation of the transition kernel in Assumption 4.3,

P(s|s,a)\displaystyle P(s^{\prime}{\,|\,}s,a) =σ~((s,a,1)w)φ(s)ψ(s;w)dw,\displaystyle=\int\widetilde{\sigma}\bigl{(}(s,a,1)^{\top}w\bigr{)}\varphi(s^{\prime})\psi(s^{\prime};w){\mathrm{d}}w, (C.1)

where φ(s)0\varphi(s^{\prime})\geq 0 and ψ~(s;)𝒫(w)\widetilde{\psi}(s;\cdot)\in\mathscr{P}(w). Such a representation has the same function class as

P~(s|s,a)=σ~((s,a,1)w)ψ~(s;w)dw,\displaystyle\widetilde{P}(s^{\prime}{\,|\,}s,a)=\int\widetilde{\sigma}\big{(}(s,a,1)^{\top}w\big{)}\widetilde{\psi}(s^{\prime};w){\mathrm{d}}w, (C.2)

where ψ~\widetilde{\psi} is a finite signed measure.

Proof.

For simplicity, let 𝒫\mathcal{P} and 𝒫~\widetilde{\mathcal{P}} denote the function class represented by (C.1) and (C.2), respectively. Note that for any transition kernel P(s|s,a)P(s^{\prime}{\,|\,}s,a) represented by (C.1), by letting ψ~(s;w)=φ(s)ψ(s;w)\widetilde{\psi}(s^{\prime};w)=\varphi(s^{\prime})\psi(s^{\prime};w), such a transition kernel P(s|s,a)P(s^{\prime}{\,|\,}s,a) can be equivalently represented by P~(s|s,a)\widetilde{P}(s^{\prime}{\,|\,}s,a) in (C.2). Thus, it holds that 𝒫𝒫~\mathcal{P}\subseteq\widetilde{\mathcal{P}}. Therefore, we only need to prove 𝒫~𝒫\widetilde{\mathcal{P}}\subseteq\mathcal{P}, which is equivalent to proving that for any P~(s|s,a)𝒫~\widetilde{P}(s^{\prime}{\,|\,}s,a)\in\widetilde{\mathcal{P}} given by (C.2), the signed measure ψ~\widetilde{\psi} can be non-negative. If that is the case, by letting φ(s)=(ψ~+(s;w)+ψ~(s;w))dw\varphi(s^{\prime})=\int\big{(}\widetilde{\psi}_{+}(s^{\prime};w)+\widetilde{\psi}_{-}(s^{\prime};-w)\big{)}{\mathrm{d}}w and ψ(s;w)=φ(s)1(ψ~+(s;w)+ψ~(s;w))\psi(s^{\prime};w)=\varphi(s^{\prime})^{-1}\bigl{(}\widetilde{\psi}_{+}(s^{\prime};w)+\widetilde{\psi}_{-}(s^{\prime};-w)\bigr{)}, we can have (C.2) equivalently represented by (C.1). Note that there always exist non-negative functions ψ~+\widetilde{\psi}_{+} and ψ~\widetilde{\psi}_{-} such that ψ~=ψ~+ψ~\widetilde{\psi}=\widetilde{\psi}_{+}-\widetilde{\psi}_{-}. Since σ~\widetilde{\sigma} is an odd function, it holds that

P~(s|s,a)\displaystyle\widetilde{P}(s^{\prime}{\,|\,}s,a) =σ~(wx)ψ~(s;w)dw\displaystyle=\int\widetilde{\sigma}(w^{\top}x){\widetilde{\psi}}(s^{\prime};w){\mathrm{d}}w
=σ~(wx)ψ~+(s;w)dw+σ~(wx)ψ~(s;w)dw\displaystyle=\int\widetilde{\sigma}(w^{\top}x)\widetilde{\psi}_{+}(s^{\prime};w){\mathrm{d}}w+\int\widetilde{\sigma}(-w^{\top}x)\widetilde{\psi}_{-}(s^{\prime};w){\mathrm{d}}w
=σ~(wx)(ψ~+(s;w)+ψ~(s;w))dw.\displaystyle=\int\widetilde{\sigma}(w^{\top}x)\bigl{(}\widetilde{\psi}_{+}(s^{\prime};w)+\widetilde{\psi}_{-}(s^{\prime};-w)\bigr{)}{\mathrm{d}}w.

Thus, by letting φ(s)=(ψ~+(s;w)+ψ~(s;w))dw\varphi(s^{\prime})=\int\big{(}\widetilde{\psi}_{+}(s^{\prime};w)+\widetilde{\psi}_{-}(s^{\prime};-w)\big{)}{\mathrm{d}}w and ψ(s;w)=φ(s)1(ψ~+(s;w)+ψ~(s;w))\psi(s^{\prime};w)=\varphi(s^{\prime})^{-1}\bigl{(}\widetilde{\psi}_{+}(s^{\prime};w)+\widetilde{\psi}_{-}(s^{\prime};-w)\bigr{)}, it holds that P~(s|s,a)=σ~(wx)φ(s)ψ(s;dw)=P(s|s,a)\widetilde{P}(s^{\prime}{\,|\,}s,a)=\int\widetilde{\sigma}(w^{\top}x)\varphi(s^{\prime})\psi(s^{\prime};{\mathrm{d}}w)=P(s^{\prime}{\,|\,}s,a), where ψ𝒫(𝒲)\psi\in\mathscr{P}(\mathcal{W}) and φ0\varphi\geq 0. Hence, any P~(s|s,a)𝒫~\widetilde{P}(s^{\prime}{\,|\,}s,a)\in\widetilde{\mathcal{P}} can be equivalently represented by (C.1) and it follows that 𝒫~𝒫\widetilde{\mathcal{P}}\subseteq\mathcal{P}. Thus, (C.1) has the same function class as (C.2) and we complete the proof. ∎

C.2 Proof of Detailed Version of Lemma 4.4

In this section, we prove a more detailed version of Lemma 4.4, i.e., Lemma B.1.

Proof.

We begin by a sketch of the proof of Lemma B.1. We first construct functions ZπZ_{\pi} and νπ(w)\nu_{\pi}(w). With the use of mollifiers, we prove that there exists a function pπ(b)p_{\pi}(b) satisfying (C.6) and then formulate a construction of ρ¯π(θ)\bar{\rho}_{\pi}(\theta), which gives way to obtain ρπ\rho_{\pi}. For the proof of Property (iii), using the technique of Talagrand’s inequality and the chi-squared divergence, we establish a constant upper bound for W2(ρπ,ρ0)W_{2}(\rho_{\pi},\rho_{0}). For the proof of Property (iv), by exploiting the inequality between W2W_{2} distance and the weighted homogeneous Sobolev norm, we upper bound W2(ρπ1,ρπ2)W_{2}(\rho_{\pi_{1}},\rho_{\pi_{2}}) up to O(α1/2)O(\alpha^{-1/2}).

Proof of Property (i) of Lemma B.1. We give a proof of Property (i) by a construction of ρπ\rho_{\pi}. For notational simplicity, we let x=(s,a,1)x=(s,a,1). By definitions of the action value function QπQ^{\pi} and the state value function VπV^{\pi} in (2.2), we have

Qπ(s,a)\displaystyle Q^{\pi}(s,a) =r(s,a)+γP(s|s,a)Vπ(s)ds\displaystyle=r(s,a)+\gamma\cdot\int P(s^{\prime}{\,|\,}s,a)V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}
=σ~(wx){Brμ(w)+γφ(s)ψ(s;w)Vπ(s)ds}dw\displaystyle=\int\widetilde{\sigma}(w^{\top}x)\Bigl{\{}B_{r}\cdot\mu(w)+\gamma\cdot\int\varphi(s^{\prime})\psi(s^{\prime};w)V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}\Bigr{\}}{\mathrm{d}}w
=Zπσ~(wx)νπ(w)dw,\displaystyle=\int Z_{\pi}\cdot\widetilde{\sigma}(w^{\top}x)\nu_{\pi}(w){\mathrm{d}}w, (C.3)

where

νπ(w)=Zπ1(Brμ(w)+γφ(s)ψ(s;w)Vπ(s)ds),\displaystyle\nu_{\pi}(w)=Z_{\pi}^{-1}\cdot\Bigl{(}B_{r}\mu(w)+\gamma\cdot\int\varphi(s^{\prime})\psi(s^{\prime};w)V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}\Bigr{)}, (C.4)
Zπ=Br+γφ(s)Vπ(s)ds.\displaystyle Z_{\pi}=B_{r}+\gamma\cdot\int\varphi(s^{\prime})V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}. (C.5)

Here, the second equality in (C.2) holds by (i) of Assumption 4.3. We construct ρ¯π\bar{\rho}_{\pi} by ρ¯π=νπ×pπ\bar{\rho}_{\pi}=\nu_{\pi}\times p_{\pi}, i.e., ρ¯π(w,b)=νπ(w)pπ(b)\bar{\rho}_{\pi}(w,b)=\nu_{\pi}(w)p_{\pi}(b), where pπ(b)p_{\pi}(b) is defined to be a probability measure in 𝒫2()\mathscr{P}_{2}(\mathbb{R}) such that

Bββ(b)pπ(db)=Zπ.\displaystyle\int B_{\beta}\cdot\beta(b)p_{\pi}({\mathrm{d}}b)=Z_{\pi}. (C.6)

We remark that such a pπp_{\pi} exists and we will provide a construction later. Since we have Vπ0V^{\pi}\geq 0, φ0\varphi\geq 0, and ψ0\psi\geq 0 by Assumption 4.3, it turns out that νπ\nu_{\pi} is a probability density function according to (C.4), which further suggests that ρ¯π𝒫2(D)\bar{\rho}_{\pi}\in\mathscr{P}_{2}(\mathbb{R}^{D}). Plugging (C.6) into (C.2), it holds that

Qπ(x)=σ(x;θ)ρ¯π(θ)dθ,\displaystyle Q^{\pi}(x)=\int\sigma(x;\theta)\bar{\rho}_{\pi}(\theta){\mathrm{d}}\theta,

where the equality holds by noting that σ(x;θ)=Bββ(b)σ~(wx)\sigma(x;\theta)=B_{\beta}\cdot\beta(b)\cdot\widetilde{\sigma}(w^{\top}x) in (4.2) and that ρ¯π=νπ×pπ\bar{\rho}_{\pi}=\nu_{\pi}\times p_{\pi}. Furthermore, by letting ρπ=ρ¯π+(1α1)(ρ0ρ¯π)\rho_{\pi}=\bar{\rho}_{\pi}+(1-\alpha^{-1})(\rho_{0}-\bar{\rho}_{\pi}), we have

Qπ(x)\displaystyle Q^{\pi}(x) =ασ(x;θ)(ρπ(1α1)ρ0)dθ=ασ(x;θ)ρπdθ=Q(x;ρπ),\displaystyle=\alpha\int\sigma(x;\theta)(\rho_{\pi}-(1-\alpha^{-1})\rho_{0}){\mathrm{d}}\theta=\alpha\int\sigma(x;\theta)\rho_{\pi}{\mathrm{d}}\theta=Q(x;\rho_{\pi}),

where the second equality holds by noting that σ\sigma is odd with respect to ww and bb and that ρ0N(0,ID)\rho_{0}\sim N(0,I_{D}) is an even function. Thus, we finish the construction of ρπ\rho_{\pi} and also complete the proof of Property (i) in Lemma B.1.

A Construction for pπ(b)p_{\pi}(b). Recall that we have pπp_{\pi} defined in (C.6). Here, we provide a construction for pπp_{\pi} which has some properties that will facilitate our analysis. Ideally, we want pπp_{\pi} to have global support and concentrate to its mean, which motivates us to consider pπp_{\pi} to be Gaussian distribution with high variance. Recall that we assume that β1\beta^{-1} is β\ell_{\beta}-Lipschitz continuous on [2/3,2/3][-2/3,2/3] in Assumption 4.2. Let q(b)q(b) be the probability density function of the standard Gaussian distribution,i.e., qN(0,1)q\sim N(0,1). Then, qϵ(bz)=ϵ1q((bz)/ϵ)q_{\epsilon}(b-z)=\epsilon^{-1}\cdot q((b-z)/\epsilon) is the probability density function such that qϵN(z,ϵ2)q_{\epsilon}\sim N(z,\epsilon^{2}). We define function βϵ\beta_{\epsilon} as follows,

βϵ(z)\displaystyle\beta_{\epsilon}(z) =β(b)qϵ(bz)db=(βqϵ)(z),\displaystyle=\int\beta(b)q_{\epsilon}(b-z){\mathrm{d}}b=(\beta*q_{\epsilon})(z), (C.7)

where * denotes the convolution. Note that {qϵ}ϵ>0\{q_{\epsilon}\}_{\epsilon>0} can be viewed as a class of mollifiers (Friedrichs, 1944). In particular, let

ϵ¯=min{π216L0,β,π212βL1,β,1},\displaystyle\bar{\epsilon}=\min\Big{\{}\sqrt{\frac{\pi}{2}}\cdot\frac{1}{6L_{0,\beta}},\sqrt{\frac{\pi}{2}}\cdot\frac{1}{2\ell_{\beta}L_{1,\beta}},1\Big{\}}, (C.8)

where L0,βL_{0,\beta} and L1,βL_{1,\beta} characterize the Lipschitz continuity and smoothness of β\beta respectively and β1\beta^{-1} is β\ell_{\beta}-Lipschitz continuous in [2/3,2/3][-2/3,2/3] by Assumption 4.2. For the approximation error of mollifier βϵ\beta_{\epsilon}, it holds that

|β(z)βϵ¯(z)|\displaystyle\big{|}\beta(z)-\beta_{\bar{\epsilon}}(z)\big{|} =|(β(z)β(z))qϵ¯(zz)dz|\displaystyle=\Big{|}\int\big{(}\beta(z)-\beta(z^{\prime})\big{)}q_{{\bar{\epsilon}}}(z-z^{\prime}){\mathrm{d}}z^{\prime}\Big{|}
L0,β|zz|qϵ¯(zz)dz\displaystyle\leq L_{0,\beta}\cdot\int|z-z^{\prime}|q_{{\bar{\epsilon}}}(z-z^{\prime}){\mathrm{d}}z^{\prime}
=L0,βϵ¯2π,\displaystyle=L_{0,\beta}\cdot\bar{\epsilon}\cdot\sqrt{\frac{2}{\pi}},

where the last inequality follows from |z|qϵ(z)dz=ϵ2/π\int|z|q_{\epsilon}(z){\mathrm{d}}z=\epsilon\cdot\sqrt{2/\pi}. Similarly, we have |β˙(z)β˙ϵ¯(z)|L1,βϵ¯2/π|\dot{\beta}(z)-\dot{\beta}_{{\bar{\epsilon}}}(z)|\leq L_{1,\beta}\cdot{\bar{\epsilon}}\cdot\sqrt{2/\pi}. By definition of ϵ¯{\bar{\epsilon}} in (C.8), it further holds that

supbβ1([2/3,2/3])|β(b)βϵ¯(b)|1/6,\displaystyle\sup_{b\in\beta^{-1}([-2/3,2/3])}\bigl{|}\beta(b)-\beta_{\bar{\epsilon}}(b)\bigr{|}\leq 1/6, (C.9)
supbβ1([2/3,2/3])|β˙(b)β˙ϵ¯(b)|12β.\displaystyle\sup_{b\in\beta^{-1}([-2/3,2/3])}\bigl{|}\dot{\beta}(b)-\dot{\beta}_{\bar{\epsilon}}(b)\bigr{|}\leq\frac{1}{2\ell_{\beta}}. (C.10)

Note that β(b)\beta(b) is a monotonic function with |β˙(b)|1/β|\dot{\beta}(b)|\geq 1/\ell_{\beta} in β1([2/3,2/3])\beta^{-1}([-2/3,2/3]) by Assumption 4.2. With regard to (C.10), it follows that βϵ¯(b)\beta_{\bar{\epsilon}}(b) is also monotonic in β1([2/3,2/3])\beta^{-1}([-2/3,2/3]) and that

|β˙ϵ¯(b)||β˙(b)||β˙(b)β˙ϵ¯(b)|12β,bβ1([2/3,2/3]).\displaystyle|\dot{\beta}_{\bar{\epsilon}}(b)|\geq|\dot{\beta}(b)|-|\dot{\beta}(b)-\dot{\beta}_{{\bar{\epsilon}}}(b)|\geq\frac{1}{2\ell_{\beta}},\quad\forall b\in\beta^{-1}([-2/3,2/3]). (C.11)

Furthermore, by (C.9) and the continuity of βϵ¯\beta_{{\bar{\epsilon}}} in β1([2/3,2/3])\beta^{-1}([-2/3,2/3]), we have

[1/2,1/2]βϵ¯(β1([2/3,2/3])).\displaystyle[-1/2,1/2]\subseteq\beta_{\bar{\epsilon}}(\beta^{-1}([-2/3,2/3])). (C.12)

The monotonicity of βϵ¯(b)\beta_{{\bar{\epsilon}}}(b), (C.11), and (C.12) together show that βϵ¯1\beta_{\bar{\epsilon}}^{-1} exists and is 2β2\ell_{\beta}-Lipschitz continuous in [1/2,1/2][-1/2,1/2]. Moreover, since β\beta is an odd function, it holds by (C.7) that βϵ¯\beta_{\bar{\epsilon}} is also an odd function with βϵ¯(0)=0\beta_{\bar{\epsilon}}(0)=0. Hence, it holds that βϵ¯1([1/2,1/2])[β,β]\beta_{\bar{\epsilon}}^{-1}([-1/2,1/2])\subseteq[-\ell_{\beta},\ell_{\beta}]. Furthermore, by (C.5), it holds that

Zπ\displaystyle Z_{\pi} =Br+γφ(s)Vπ(s)ds\displaystyle=B_{r}+\gamma\cdot\int\varphi(s^{\prime})V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}
Br+γ(1γ)1Brφ(s)ds\displaystyle\leq B_{r}+\gamma\cdot(1-\gamma)^{-1}\cdot B_{r}\cdot\int\varphi(s^{\prime}){\mathrm{d}}s^{\prime}
Br+γ(1γ)1BrM1,φ,\displaystyle\leq B_{r}+\gamma(1-\gamma)^{-1}B_{r}M_{1,\varphi},

where the first inequality follows from the fact that Vπ(s)(1γ)1sups,ar(s,a)V^{\pi}(s)\leq(1-\gamma)^{-1}\cdot\sup_{s,a}r(s,a) and the last inequality follows from Assumption 4.3 that φ(s)dsM1,φ\int\varphi(s^{\prime}){\mathrm{d}}s^{\prime}\leq M_{1,\varphi}. By setting Bβ2(Br+γ(1γ)1BrM1,φ)B_{\beta}\geq 2(B_{r}+\gamma(1-\gamma)^{-1}B_{r}M_{1,\varphi}), it holds that |Zπ/Bβ|1/2|Z_{\pi}/B_{\beta}|\leq 1/2, which indicates that βϵ¯1(Zπ/Bβ)\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta}) exists and allows pπ(b)=qϵ¯(bβϵ¯1(Zπ/Bβ))p_{\pi}(b)=q_{\bar{\epsilon}}(b-\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})) to be the probability density function such that pπ(b)N(βϵ¯1(Zπ/Bβ),ϵ¯2)p_{\pi}(b)\sim N(\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta}),\bar{\epsilon}^{2}). For the mean value βϵ¯1(Zπ/Bβ)\beta_{{\bar{\epsilon}}}^{-1}(Z_{\pi}/B_{\beta}), recalling that βϵ¯1([1/2,1/2])[β,β]\beta_{\bar{\epsilon}}^{-1}([-1/2,1/2])\subseteq[-\ell_{\beta},\ell_{\beta}], it thus holds that

|βϵ¯1(Zπ/Bβ)|β.\displaystyle\big{|}\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})\big{|}\leq\ell_{\beta}. (C.13)

Following from (C.7), we have

β(b)pπ(b)db=β(b)qϵ¯(bβϵ¯1(Zπ/Bβ))db=βϵ¯(βϵ¯1(Zπ/Bβ))=Zπ/Bβ.\displaystyle\int\beta(b)p_{\pi}(b){\mathrm{d}}b=\int\beta(b)q_{\bar{\epsilon}}(b-\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})){\mathrm{d}}b=\beta_{\bar{\epsilon}}\bigl{(}\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})\bigr{)}=Z_{\pi}/B_{\beta}.

Hence, our construction of pπp_{\pi} here is in line with the definition of pπp_{\pi} in (C.6). In the sequel, we consider pπ(b)=qϵ¯(bβϵ¯1(Zπ/Bβ))p_{\pi}(b)=q_{\bar{\epsilon}}\big{(}b-\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})\big{)} to hold throughout.

Proof of Property (ii) of Lemma B.1. Here we show that g(;π,ρπ)=0g(\cdot;\pi,\rho_{\pi})=0 is a direct result of Qπ(x)=Q(x;ρπ)Q^{\pi}(x)=Q(x;\rho_{\pi}) in Property (i). Note that

Qπ(x)r(x)γ𝔼xP~π(|x)Qπ(x)=0\displaystyle Q^{\pi}(x)-r(x)-\gamma\mathbb{E}_{x^{\prime}\sim\widetilde{P}^{\pi}(\cdot{\,|\,}x)}Q^{\pi}(x^{\prime})=0 (C.14)

holds by the definition of the action value function QπQ^{\pi} in (2.2) for any x𝒮×𝒜x\in{\mathcal{S}}\times\mathcal{A}. Since we have Qπ(x)=Q(x;ρπ)Q^{\pi}(x)=Q(x;\rho_{\pi}) proved, by plugging (C.14) into the definition of gg in (3.6), where Q(;ρπ)Q(\cdot;\rho_{\pi}) is substituted for QπQ^{\pi}, it follows that g(x;π,ρπ)=0g(x;\pi,\rho_{\pi})=0. Thus, we complete the proof of Property (ii) of Lemma B.1.

Proof of Property (iii) of Lemma B.1.

In what follows, we aim to upper bound W2(ρπ,ρ0)W^{2}(\rho_{\pi},\rho_{0}). We summerize our aforementioned constructions as follows,

Zπ=Br+γφ(s)Vπ(s)ds,\displaystyle Z_{\pi}=B_{r}+\gamma\cdot\int\varphi(s^{\prime})V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime},
νπ(w)=Zπ1(Brμ(w)+γφ(s)ψ(s;w)Vπ(s)ds),\displaystyle\nu_{\pi}(w)=Z_{\pi}^{-1}\Big{(}B_{r}\mu(w)+\gamma\cdot\int\varphi(s^{\prime})\psi(s^{\prime};w)V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}\Big{)}, (C.15)
pπ(b)=qϵ¯(bβϵ¯1(Zπ/Bβ)),\displaystyle p_{\pi}(b)=q_{\bar{\epsilon}}\big{(}b-\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})\big{)}, (C.16)
ρ¯π(θ)=pπ(b)νπ(w),\displaystyle\bar{\rho}_{\pi}(\theta)=p_{\pi}(b)\nu_{\pi}(w), (C.17)
ρπ=ρ¯π+(1α1)(ρ0ρ¯π).\displaystyle\rho_{\pi}=\bar{\rho}_{\pi}+(1-\alpha^{-1})(\rho_{0}-\bar{\rho}_{\pi}). (C.18)

Plugging (C.16) into the definition of Chi-squared divergence, we have

χ2(pπρ0,b)=pπ2ρ0,bdb1=1ϵ¯2ϵ¯2exp{(βϵ¯1(Zπ/Bβ))22ϵ¯2}1,\displaystyle\chi^{2}\big{(}p_{\pi}\,\|\,\rho_{0,b}\big{)}=\int\frac{p_{\pi}^{2}}{\rho_{0,b}}{\mathrm{d}}b-1=\frac{1}{\bar{\epsilon}\sqrt{2-\bar{\epsilon}^{2}}}\cdot\exp\Big{\{}\frac{\big{(}\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})\big{)}^{2}}{2-\bar{\epsilon}^{2}}\Big{\}}-1, (C.19)

Note that we have ϵ¯=min{π/2(6L0,β)1,π/2(2βL1,β)1,1}\bar{\epsilon}=\min\Big{\{}\sqrt{\pi/2}\cdot(6L_{0,\beta})^{-1},\sqrt{\pi/2}\cdot(2\ell_{\beta}L_{1,\beta})^{-1},1\Big{\}} by (C.8) and |βϵ¯1(Zπ/Bβ)|β\big{|}\beta_{\bar{\epsilon}}^{-1}(Z_{\pi}/B_{\beta})\big{|}\leq\ell_{\beta} by (C.13). Hence, we have χ2(pπρ0,b)\chi^{2}(p_{\pi}\,\|\,\rho_{0,b}) upper bounded. As for νπ\nu_{\pi} in (C.15), we have

χ2(νπρ0,w)\displaystyle\chi^{2}(\nu_{\pi}\,\|\,\rho_{0,w}) =χ2(BrZπ1μ+Vπ(s)Zπ1φ(s)ψ(s;w)dsρ0,w)\displaystyle=\chi^{2}\bigg{(}B_{r}Z_{\pi}^{-1}\mu+\int V^{\pi}(s^{\prime})Z_{\pi}^{-1}\varphi(s^{\prime})\psi(s^{\prime};w){\mathrm{d}}s^{\prime}\,\Big{\|}\,\rho_{0,w}\bigg{)}
3(χ2(BrZπ1μρ0,w)+χ2(Vπ(s)Zπ1φ(s)ψ(s;)dsρ0,w)+1),\displaystyle\leq 3\bigg{(}\chi^{2}(B_{r}Z_{\pi}^{-1}\mu\,\|\,\rho_{0,w})+\chi^{2}\Big{(}\int V^{\pi}(s^{\prime})Z_{\pi}^{-1}\varphi(s^{\prime})\psi(s^{\prime};\cdot){\mathrm{d}}s^{\prime}\,\Big{\|}\,\rho_{0,w}\Big{)}+1\bigg{)}, (C.20)

where the inequality holds by Property (iii) of Lemma D.3. For the first term on the right-hand side of (C.2), we have

χ2(BrZπ1μρ0,w)\displaystyle\chi^{2}(B_{r}Z_{\pi}^{-1}\mu\,\|\,\rho_{0,w}) =Br2Zπ2χ2(μρ0,w)+(1BrZπ1)2\displaystyle=B_{r}^{2}Z_{\pi}^{-2}\chi^{2}(\mu\,\|\,\rho_{0,w})+(1-B_{r}Z_{\pi}^{-1})^{2}
χ2(μρ0,w)+1\displaystyle\leq\chi^{2}(\mu\,\|\,\rho_{0,w})+1
Mμ+1,\displaystyle\leq M_{\mu}+1, (C.21)

where the equality holds by Property (i) of Lemma D.3, the first inequality holds by noting that |BrZπ1|1|B_{r}Z_{\pi}^{-1}|\leq 1 and the last inequality follows from χ2(μρ0,w)<Mμ\chi^{2}(\mu\,\|\,\rho_{0,w})<M_{\mu} by Assumption 4.3. Hence, the first term on the right-hand side of (C.2) is upper bounded. As for the second term, by Property (iv) of Lemma D.3, we have

χ2(Vπ(s)Zπ1φ(s)ψ(s;)dsρ0,w)\displaystyle\chi^{2}\Big{(}\int V^{\pi}(s^{\prime})Z_{\pi}^{-1}\varphi(s^{\prime})\psi(s^{\prime};\cdot){\mathrm{d}}s^{\prime}\,\Big{\|}\,\rho_{0,w}\Big{)}
(Vπ(s)Zπ1φ(s))2dsχ2(ψ(s;)ρ0,w)ds+(Vπ(s)Zπ1φ(s)ds1)2\displaystyle\leq\int\big{(}V^{\pi}(s^{\prime})Z_{\pi}^{-1}\varphi(s^{\prime})\big{)}^{2}{\mathrm{d}}s^{\prime}\cdot\int\chi^{2}(\psi(s^{\prime};\cdot)\,\|\,\rho_{0,w}){\mathrm{d}}s^{\prime}+\Big{(}\int V^{\pi}(s^{\prime})Z_{\pi}^{-1}\varphi(s^{\prime}){\mathrm{d}}s^{\prime}-1\Big{)}^{2}
(1γ)2M2,φMψ+(1γ)2M1,φ2+1,\displaystyle\leq(1-\gamma)^{-2}M_{2,\varphi}\cdot M_{\psi}+(1-\gamma)^{-2}M_{1,\varphi}^{2}+1, (C.22)

where the last inequality holds by noting that |Vπ|(1γ)1Br|V^{\pi}|\leq(1-\gamma)^{-1}B_{r}, ZπBrZ_{\pi}\geq B_{r}, 𝒮1\|{\mathcal{S}}\|\leq 1, and that χ2(ψ(s;)ρ0,w)<Mψ\chi^{2}\big{(}\psi(s^{\prime};\cdot)\,\|\,\rho_{0,w}\big{)}<M_{\psi} by Assumption 4.3. Hence, it holds that the second term of (C.2) is also upper bounded. Plugging (C.2) and (C.2) into (C.2), we can establish the upper bound for χ2(νπρ0,w)\chi^{2}(\nu_{\pi}\,\|\,\rho_{0,w}). Furthermore, by Property (ii) of Lemma D.3 and noting that ρ¯π=pπ×νπ\bar{\rho}_{\pi}=p_{\pi}\times\nu_{\pi}, we have χ2(ρ¯πρ0)\chi^{2}(\bar{\rho}_{\pi}\,\|\,\rho_{0}) upper bounded as well, that is,

χ2(ρ¯πρ0)<12D¯2,\displaystyle\chi^{2}(\bar{\rho}_{\pi}\,\|\,\rho_{0})<\frac{1}{2}\bar{D}^{2},

where D¯\bar{D} depends on absolute constants occurring in (C.19), (C.2), and (C.2), i.e., β\ell_{\beta}, L0,βL_{0,\beta}, L1,βL_{1,\beta}, BrB_{r}, M1,φM_{1,\varphi}, M2,φM_{2,\varphi}, MμM_{\mu}, and MψM_{\psi} in Assumptions 4.2 and 4.3. Since ρ0𝒩(0,ID)\rho_{0}\sim\mathcal{N}(0,I_{D}), it holds for any ρπ\rho_{\pi} that,

12W2(ρπ,ρ0)2\displaystyle\frac{1}{2}W_{2}(\rho_{\pi},\rho_{0})^{2} KL(ρπρ0)(ρπρ01)ρπρ0ρ0(dθ)=(ρπρ01)2ρ0(dθ)\displaystyle\leq{\mathrm{KL}}(\rho_{\pi}\,\|\,\rho_{0})\leq\int\Big{(}\frac{\rho_{\pi}}{\rho_{0}}-1\Big{)}\frac{\rho_{\pi}}{\rho_{0}}\rho_{0}({\mathrm{d}}\theta)=\int\Big{(}\frac{\rho_{\pi}}{\rho_{0}}-1\Big{)}^{2}\rho_{0}({\mathrm{d}}\theta)
=((1α1)ρ0(θ)+α1ρ¯π(θ)ρ0(θ)1)2ρ0(dθ)=α2χ2(ρ¯πρ0)<α22D¯2,\displaystyle=\int\Big{(}\frac{(1-\alpha^{-1})\rho_{0}(\theta)+\alpha^{-1}\bar{\rho}_{\pi}(\theta)}{\rho_{0}(\theta)}-1\Big{)}^{2}\rho_{0}({\mathrm{d}}\theta)=\alpha^{-2}\chi^{2}(\bar{\rho}_{\pi}\,\|\,\rho_{0})<\frac{\alpha^{-2}}{2}\bar{D}^{2}, (C.23)

where the first inequality follows from Talagrand’s inequality in Lemma E.4. Plugging W~2=αW2\widetilde{W}_{2}=\alpha W_{2} into (C.2), we complete the proof of Property (iii) of Lemma B.1.

Proof of Property (iv) of Lemma B.1. Upper Bounding W2(pπ1,pπ2)W_{2}(p_{\pi_{1}},p_{\pi_{2}}). Following the property of W2W_{2} distance with respect to Gaussian distribution, we have W2(pπ1,pπ2)=|βϵ¯1(Zπ1/Bβ)βϵ¯1(Zπ2/Bβ)|W_{2}(p_{\pi_{1}},p_{\pi_{2}})=\big{|}\beta_{\bar{\epsilon}}^{-1}(Z_{\pi_{1}}/B_{\beta})-\beta_{\bar{\epsilon}}^{-1}(Z_{\pi_{2}}/B_{\beta})\big{|}. Recall that we have |Zπ/Bβ|1/2|Z_{\pi}/B_{\beta}|\leq{1}/{2} and that βϵ¯1\beta_{\bar{\epsilon}}^{-1} is 2β2\ell_{\beta}-Lipschitz continuous on [1/2,1/2][-1/2,1/2]. It then holds that

W2(pπ1,pπ2)=|βϵ¯1(Zπ1/Bβ)βϵ¯1(Zπ2/Bβ)|2β|Zπ1Zπ2|/Bβ.\displaystyle W_{2}(p_{\pi_{1}},p_{\pi_{2}})=\bigl{|}\beta_{\bar{\epsilon}}^{-1}(Z_{\pi_{1}}/B_{\beta})-\beta_{\bar{\epsilon}}^{-1}(Z_{\pi_{2}}/B_{\beta})\bigr{|}\leq 2\ell_{\beta}\cdot|Z_{\pi_{1}}-Z_{\pi_{2}}|/B_{\beta}. (C.24)

Meanwhile, we have

|Zπ1Zπ2|\displaystyle|Z_{\pi_{1}}-Z_{\pi_{2}}| γφ(s)|Vπ1(s)Vπ2(s)|ds\displaystyle\leq\gamma\cdot\int\varphi(s^{\prime})\cdot\bigl{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\bigr{|}{\mathrm{d}}s^{\prime}
γφ(s)dssups𝒮|Vπ1(s)Vπ2(s)|\displaystyle\leq\gamma\cdot\int\varphi(s^{\prime}){\mathrm{d}}s^{\prime}\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\bigr{|}
γM1,φsups𝒮|Vπ1(s)Vπ2(s)|,\displaystyle\leq\gamma\cdot M_{1,\varphi}\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\bigr{|}, (C.25)

where the last inequality holds by (ii) of Assumption 4.3. Plugging (C.2) into (C.24), it holds for W2(pπ1,pπ2)W_{2}(p_{\pi_{1}},p_{\pi_{2}}) that

W2(pπ1,pπ2)2βγM1,φsups𝒮|Vπ1(s)Vπ2(s)|Bβ.\displaystyle W_{2}(p_{\pi_{1}},p_{\pi_{2}})\leq\frac{2\ell_{\beta}\cdot\gamma\cdot M_{1,\varphi}\cdot\sup_{s^{\prime}\in{\mathcal{S}}}\bigl{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\bigr{|}}{B_{\beta}}. (C.26)

Upper Bounding W2(νπ1,νπ2)W_{2}(\nu_{\pi_{1}},\nu_{\pi_{2}}). By definition of νπ\nu_{\pi} in (C.15), we have

νπ1νπ2=Br(Zπ2Zπ1)Zπ1Zπ2μ+γφ(s)ψ(s;)(Vπ1Zπ1Vπ2Zπ2)ds.\displaystyle\nu_{\pi_{1}}-\nu_{\pi_{2}}=\frac{B_{r}(Z_{\pi_{2}}-Z_{\pi_{1}})}{Z_{\pi_{1}}Z_{\pi_{2}}}\mu+\gamma\int\varphi(s^{\prime})\psi(s^{\prime};\cdot)\Big{(}\frac{V^{\pi_{1}}}{Z_{\pi_{1}}}-\frac{V^{\pi_{2}}}{Z_{\pi_{2}}}\Big{)}{\mathrm{d}}s^{\prime}. (C.27)

For Vπ1/Zπ1Vπ2/Zπ2V^{\pi_{1}}/Z_{\pi_{1}}-V^{\pi_{2}}/Z_{\pi_{2}}, it holds that

|Vπ1Zπ1Vπ2Zπ2|\displaystyle\biggl{|}\frac{V^{\pi_{1}}}{Z_{\pi_{1}}}-\frac{V^{\pi_{2}}}{Z_{\pi_{2}}}\biggr{|} max{Vπ1,Vπ2}|Zπ1Zπ2|Zπ1Zπ2+max{Zπ11,Zπ21}|Vπ1Vπ2|\displaystyle\leq\max\{V^{\pi_{1}},V^{\pi_{2}}\}\cdot\frac{|Z_{\pi_{1}}-Z_{\pi_{2}}|}{Z_{\pi_{1}}Z_{\pi_{2}}}+\max\{Z^{-1}_{\pi_{1}},Z^{-1}_{\pi_{2}}\}|V^{\pi_{1}}-V^{\pi_{2}}|
(Br1(1γ)1γM1,φ+Br1)supsS|Vπ1(s)Vπ2(s)|,\displaystyle\leq(B_{r}^{-1}(1-\gamma)^{-1}\gamma M_{1,\varphi}+B_{r}^{-1})\sup_{s^{\prime}\in S}\big{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\big{|}, (C.28)

where the last inequality holds by noting that

|Zπ2Zπ1Zπ1Zπ2|γM1,φBr2supsS|Vπ1(s)Vπ2(s)|.\displaystyle\bigg{|}\frac{Z_{\pi_{2}}-Z_{\pi_{1}}}{Z_{\pi_{1}}Z_{\pi_{2}}}\bigg{|}\leq\gamma M_{1,\varphi}B_{r}^{-2}\sup_{s^{\prime}\in S}\big{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\big{|}. (C.29)

Here the inequality in (C.29) holds by (C.2) and the fact that ZπBrZ_{\pi}\geq B_{r}. For W2(νπ1,νπ2)W_{2}(\nu_{\pi_{1}},\nu_{\pi_{2}}), by Lemma E.5, it holds that

W2(νπ1,νπ2)2νπ1νπ2H˙1(νπ1).\displaystyle W_{2}(\nu_{\pi_{1}},\nu_{\pi_{2}})\leq 2\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\nu_{\pi_{1}})}. (C.30)

Recall that we have νπ\nu_{\pi} defined in (C.15) that

νπ(w)=Zπ1(Brμ(w)+γφ(s)ψ(s;w)Vπ(s)ds),\displaystyle\nu_{\pi}(w)=Z_{\pi}^{-1}\cdot\Bigl{(}B_{r}\mu(w)+\gamma\cdot\int\varphi(s^{\prime})\psi(s^{\prime};w)V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}\Bigr{)},
Zπ=Br+γφ(s)Vπ(s)ds\displaystyle Z_{\pi}=B_{r}+\gamma\cdot\int\varphi(s^{\prime})V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime}

where Zπ1(Br+γφ(s)Vπ(s)ds)=1Z_{\pi}^{-1}(B_{r}+\gamma\cdot\int\varphi(s^{\prime})V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime})=1, BrZπ10B_{r}Z_{\pi}^{-1}\geq 0, and γφ(s)Vπ(s)Zπ10\gamma\varphi(s^{\prime})V^{\pi}(s^{\prime})Z_{\pi}^{-1}\geq 0, which indicate that νπ\nu_{\pi} is in the convex hull of μ\mu and ψ(s;)\psi(s^{\prime};\cdot). Hence, by Property (i) of Lemma D.4, it holds that

2νπ1νπ2H˙1(νπ1)2max{supsνπ1νπ2H˙1(ψ(s;)),νπ1νπ2H˙1(μ)}.\displaystyle 2\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\nu_{\pi_{1}})}\leq 2\max\big{\{}\sup_{s}\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\psi(s;\cdot))},\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\mu)}\big{\}}. (C.31)

Furthermore, following from (C.27) and Zπ1(Br+γφ(s)Vπ(s)ds)=1Z_{\pi}^{-1}(B_{r}+\gamma\cdot\int\varphi(s^{\prime})V^{\pi}(s^{\prime}){\mathrm{d}}s^{\prime})=1, by Property (ii) of Lemma D.4, it holds for 2νπ1νπ2H˙1(μ)2\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\mu)} that

2νπ1νπ2H˙1(μ)\displaystyle 2\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\mu)} max{sups𝒮μψ(s;)H˙1(μ),sup(s,s′′)𝒮×𝒮ψ(s;)ψ(s′′;)H˙1(μ)}\displaystyle\leq\max\{\sup_{s^{\prime}\in{\mathcal{S}}}\|\mu-\psi(s^{\prime};\cdot)\|_{\dot{H}^{-1}(\mu)},\sup_{(s^{\prime},s^{\prime\prime})\in{\mathcal{S}}\times{\mathcal{S}}}\|\psi(s^{\prime};\cdot)-\psi(s^{\prime\prime};\cdot)\|_{\dot{H}^{-1}(\mu)}\}
(|Br(Zπ2Zπ1)Zπ1Zπ2|+γφ(s)|Vπ1Zπ1Vπ2Zπ2|ds).\displaystyle\quad\cdot\bigg{(}\Big{|}\frac{B_{r}(Z_{\pi_{2}}-Z_{\pi_{1}})}{Z_{\pi_{1}}Z_{\pi_{2}}}\Big{|}+\gamma\int\varphi(s^{\prime})\Big{|}\frac{V^{\pi_{1}}}{Z_{\pi_{1}}}-\frac{V^{\pi_{2}}}{Z_{\pi_{2}}}\Big{|}{\mathrm{d}}s^{\prime}\bigg{)}. (C.32)

Plugging (iii) of Assumption 4.3, (C.2), and (C.29) into (C.2), we have

2νπ1νπ2H˙1(μ)\displaystyle 2\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\mu)} 𝒢(γM1,φBr1+γM1,φ(Br1(1γ)1γM1,φ+Br1))\displaystyle\leq\mathcal{G}\cdot\Big{(}\gamma M_{1,\varphi}B_{r}^{-1}+\gamma M_{1,\varphi}\big{(}B_{r}^{-1}(1-\gamma)^{-1}\gamma M_{1,\varphi}+B_{r}^{-1}\big{)}\Big{)}
supsS|Vπ1(s)Vπ2(s)|.\displaystyle\quad\cdot\sup_{s^{\prime}\in S}\big{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\big{|}. (C.33)

By simply substituting H˙1(ψ(s;))\|\cdot\|_{\dot{H}^{-1}(\psi(s;))} for H˙1(ν)\|\cdot\|_{\dot{H}^{-1}(\nu)} in both (C.2) and (C.2), it also holds for 2νπ1νπ2H˙1(ψ(s;))2\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\psi(s;\cdot))} that

2νπ1νπ2H˙1(ψ(s;))\displaystyle 2\|\nu_{\pi_{1}}-\nu_{\pi_{2}}\|_{\dot{H}^{-1}(\psi(s;\cdot))} 𝒢(γM1,φBr1+γM1,φ(Br1(1γ)1γM1,φ+Br1))\displaystyle\leq\mathcal{G}\cdot\Big{(}\gamma M_{1,\varphi}B_{r}^{-1}+\gamma M_{1,\varphi}\big{(}B_{r}^{-1}(1-\gamma)^{-1}\gamma M_{1,\varphi}+B_{r}^{-1}\big{)}\Big{)}
supsS|Vπ1(s)Vπ2(s)|.\displaystyle\quad\cdot\sup_{s^{\prime}\in S}\big{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\big{|}. (C.34)

Combining (C.30), (C.31), (C.2), and (C.2), we have

W2(νπ1,νπ2)\displaystyle W_{2}(\nu_{\pi_{1}},\nu_{\pi_{2}}) 𝒢(γM1,φBr1+γM1,φ(Br1(1γ)1γM1,φ+Br1))\displaystyle\leq\mathcal{G}\cdot\Big{(}\gamma M_{1,\varphi}B_{r}^{-1}+\gamma M_{1,\varphi}\big{(}B_{r}^{-1}(1-\gamma)^{-1}\gamma M_{1,\varphi}+B_{r}^{-1}\big{)}\Big{)}
supsS|Vπ1(s)Vπ2(s)|.\displaystyle\quad\cdot\sup_{s^{\prime}\in S}\big{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\big{|}. (C.35)

Upper Bounding W2(ρπ1,ρπ2)W_{2}({\rho}_{\pi_{1}},{\rho}_{\pi_{2}}). Note that we have ρ¯π=νπ×pπ\bar{\rho}_{\pi}=\nu_{\pi}\times p_{\pi} in (C.17). By Lemma D.5, we have

W2(ρ¯π1,ρ¯π2)W22(νπ1,νπ2)+W22(pπ1,pπ2)supsS|Vπ1(s)Vπ2(s)|,\displaystyle W_{2}(\bar{\rho}_{\pi_{1}},\bar{\rho}_{\pi_{2}})\leq\sqrt{W_{2}^{2}(\nu_{\pi_{1}},\nu_{\pi_{2}})+W_{2}^{2}(p_{\pi_{1}},p_{\pi_{2}})}\leq\mathcal{B}^{\prime}\sup_{s^{\prime}\in S}\big{|}V^{\pi_{1}}(s^{\prime})-V^{\pi_{2}}(s^{\prime})\big{|}, (C.36)

where \mathcal{B}^{\prime} depends on the discount factor γ\gamma and absolute constants β,Bβ,Br,M1,φ,𝒢\ell_{\beta},B_{\beta},B_{r},M_{1,\varphi},\mathcal{G} in Assumption 4.2 and 4.3 according to (C.26) and (C.2). By performance difference lemma (Kakade and Langford, 2002), we have

|Vπ1(s)Vπ2(s)|\displaystyle\bigl{|}V^{\pi_{1}}(s)-V^{\pi_{2}}(s)\bigr{|} =(1γ)1|𝔼ssπ1[Aπ2(s,),π1(|s)π2(|s)𝒜]|\displaystyle=(1-\gamma)^{-1}\cdot\biggl{|}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{1}}}\Bigl{[}\big{\langle}A^{\pi_{2}}(s^{\prime},\cdot),\pi_{1}(\cdot{\,|\,}s^{\prime})-\pi_{2}(\cdot{\,|\,}s^{\prime})\big{\rangle}_{\mathcal{A}}\Bigr{]}\biggr{|}
=(1γ)1|𝔼ssπ1[Qπ2(s,),π1(|s)π2(|s)𝒜]|\displaystyle=(1-\gamma)^{-1}\cdot\biggl{|}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{1}}}\Bigl{[}\big{\langle}Q^{\pi_{2}}(s^{\prime},\cdot),\pi_{1}(\cdot{\,|\,}s^{\prime})-\pi_{2}(\cdot{\,|\,}s^{\prime})\big{\rangle}_{\mathcal{A}}\Bigr{]}\biggr{|}
(1γ)2Br𝔼ssπ1[π1(|s)π2(|s)1].\displaystyle\leq(1-\gamma)^{-2}\cdot B_{r}\cdot\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{1}}}\Big{[}\big{\|}\pi_{1}(\cdot{\,|\,}s^{\prime})-\pi_{2}(\cdot{\,|\,}s^{\prime})\big{\|}_{1}\Big{]}. (C.37)

Here the inequality follows from |Qπ(s,a)|(1γ)1Br|Q^{\pi}(s,a)|\leq(1-\gamma)^{-1}\cdot B_{r}. We let =Br(1γ)2\mathcal{B}=\mathcal{B}^{\prime}B_{r}(1-\gamma)^{-2}. Plugging (C.2) into (C.36), we have

W2(ρ¯π1,ρ¯π2)sups𝒮𝔼ssπ1[π1(|s)π2(|s)1].\displaystyle W_{2}(\bar{\rho}_{\pi_{1}},\bar{\rho}_{\pi_{2}})\leq\mathcal{B}\cdot\sup_{s\in{\mathcal{S}}}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{1}}}\Big{[}\big{\|}\pi_{1}(\cdot{\,|\,}s^{\prime})-\pi_{2}(\cdot{\,|\,}s^{\prime})\big{\|}_{1}\Big{]}.

Recall that we have ρπ=ρ¯π+(1α1)(ρ0ρ¯π)\rho_{\pi}=\bar{\rho}_{\pi}+(1-\alpha^{-1})(\rho_{0}-\bar{\rho}_{\pi}) in (C.18). Then, by Lemma D.6, it holds that

W2(ρπ1,ρπ2)α12W2(ρ¯π1,ρ¯π2)α12sups𝒮𝔼ssπ1[π1(|s)π2(|s)1],\displaystyle W_{2}(\rho_{\pi_{1}},\rho_{\pi_{2}})\leq\alpha^{-\frac{1}{2}}W_{2}(\bar{\rho}_{\pi_{1}},\bar{\rho}_{\pi_{2}})\leq\alpha^{-\frac{1}{2}}\mathcal{B}\cdot\sup_{s\in{\mathcal{S}}}\mathbb{E}_{s^{\prime}\sim\mathcal{E}_{s}^{\pi_{1}}}\Big{[}\big{\|}\pi_{1}(\cdot{\,|\,}s^{\prime})-\pi_{2}(\cdot{\,|\,}s^{\prime})\big{\|}_{1}\Big{]},

which completes the proof of Lemma B.1. ∎

Appendix D Technical Results

In this section, we state and prove some technical results used in the proof of main theorems and lemmas.

Lemma D.1.

Under Assumptions 4.3 and 4.2, it holds for any ρ𝒫2(D)\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}) that

supx𝒳|Q(x;ρ)|\displaystyle\sup_{x\in\mathcal{X}}\bigl{|}Q(x;\rho)\bigr{|} αB1W2(ρ,ρ0),\displaystyle\leq\alpha\cdot B_{1}\cdot W_{2}(\rho,\rho_{0}), (D.1)
supθDθg(θ;ρ)F\displaystyle\sup_{\theta\in\mathbb{R}^{D}}\big{\|}\nabla_{\theta}g(\theta;\rho)\big{\|}_{\rm F} α1B2(2αB1W2(ρ,ρ0)+Br).\displaystyle\leq\alpha^{-1}\cdot B_{2}\cdot\big{(}2\alpha\cdot B_{1}\cdot W_{2}(\rho,\rho_{0})+B_{r}\big{)}. (D.2)
Proof.

Following from Assumptions 4.3 and 4.2, we have that θσ(x;θ)B1\|\nabla_{\theta}\sigma(x;\theta)\|\leq B_{1} for any x𝒳x\in\mathcal{X} and θD\theta\in\mathbb{R}^{D}, which implies that Lip(σ(x;)/B1)1{\rm Lip}(\sigma(x;\cdot)/B_{1})\leq 1 for any x𝒳x\in\mathcal{X}. Note that Q(x;ρ0)=0Q(x;\rho_{0})=0 for any x𝒳x\in\mathcal{X}. Thus, by (A.7) and the inequality between W1W_{1} distance and W2W_{2} distance (Villani, 2003), we have for any ρ𝒫2(D)\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}) and x𝒳x\in\mathcal{X} that

|Q(x;ρ)|\displaystyle\bigl{|}Q(x;\rho)\bigr{|} =α|σ(x;θ)d(ρρ0)(θ)|αB1W1(ρ,ρ0)αB1W2(ρ,ρ0).\displaystyle=\alpha\cdot\biggl{|}\int\sigma(x;\theta)\cdot\,{\mathrm{d}}(\rho-\rho_{0})(\theta)\biggr{|}\leq\alpha\cdot B_{1}\cdot W_{1}(\rho,\rho_{0})\leq\alpha\cdot B_{1}\cdot W_{2}(\rho,\rho_{0}). (D.3)

which completes the proof of (D.1) in Lemma D.1. Following from the definition of gg in (3.6), we have for any x𝒳x\in\mathcal{X} and ρ𝒫2(D)\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}) that

θg(θ;ρ)F\displaystyle\bigl{\|}\nabla_{\theta}g(\theta;\rho)\bigr{\|}_{\rm F} α1𝔼𝒟~[|Q(x;ρ)rγQ(x;ρ)|θθ2σ(x;θ)F]\displaystyle\leq\alpha^{-1}\cdot\mathbb{E}_{\widetilde{\mathcal{D}}}\Bigl{[}\bigl{|}Q(x;\rho)-r-\gamma\cdot Q(x^{\prime};\rho)\bigr{|}\cdot\big{\|}\nabla_{\theta\theta}^{2}\sigma(x;\theta)\big{\|}_{\rm F}\Bigr{]}
α1B2(2αB1W2(ρ,ρ0)+Br).\displaystyle\leq\alpha^{-1}\cdot B_{2}\cdot\big{(}2\alpha\cdot B_{1}\cdot W_{2}(\rho,\rho_{0})+B_{r}\big{)}.

Here the last inequality follows from (D.3) and the fact that θθ2σ(x;θ)FB2\|\nabla_{\theta\theta}^{2}\sigma(x;\theta)\|_{\rm F}\leq B_{2} for any x𝒳x\in\mathcal{X} and ρ𝒫2(D)\rho\in\mathscr{P}_{2}(\mathbb{R}^{D}), which follows from Assumptions 4.3 and 4.2. Thus, we complete the proof of Lemma D.1. ∎

Lemma D.2.

For ρ0,ρt,ρπt𝒫2(D)\rho_{0},\rho_{t},\rho_{\pi_{t}}\in\mathscr{P}_{2}(\mathbb{R}^{D}) and the geodesic αt[0,1]\alpha_{t}^{[0,1]} connecting ρt\rho_{t} and ρπt\rho_{\pi_{t}}, we have

sups[0,1]W2(αts,ρ0)2max{W2(ρπt,ρ0),W2(ρt,ρ0)}.\displaystyle\sup_{s\in[0,1]}W_{2}(\alpha_{t}^{s},\rho_{0})\leq 2\max\{W_{2}(\rho_{\pi_{t}},\rho_{0}),W_{2}(\rho_{t},\rho_{0})\}. (D.4)
Proof.

We give a proof by contradiction. Note that αts\alpha_{t}^{s} is the geodesic connecting ρπt\rho_{\pi_{t}} and ρt\rho_{t}. Assume there exists tt such that

sups[0,1]W2(αts,ρ0)>2max{W2(ρπt,ρ0),W2(ρt,ρ0)}.\sup_{s\in[0,1]}W_{2}(\alpha_{t}^{s},\rho_{0})>2\max\{W_{2}(\rho_{\pi_{t}},\rho_{0}),W_{2}(\rho_{t},\rho_{0})\}.

Then, according to the triangle inequality of W2W_{2} metric (Villani, 2008), we have

W2(ρt,αts)|W2(αts,ρ0)W2(ρt,ρ0)|>|2W2(ρt,ρ0)W2(ρt,ρ0)|=W2(ρt,ρ0),W_{2}(\rho_{t},\alpha_{t}^{s})\geq|W_{2}(\alpha_{t}^{s},\rho_{0})-W_{2}(\rho_{t},\rho_{0})|>|2W_{2}(\rho_{t},\rho_{0})-W_{2}(\rho_{t},\rho_{0})|=W_{2}(\rho_{t},\rho_{0}),

and W2(ρπt,αts)>W2(ρπt,ρ0)W_{2}(\rho_{\pi_{t}},\alpha_{t}^{s})>W_{2}(\rho_{\pi_{t}},\rho_{0}) for the same sake, which conflicts with the definition of geodesic that W2(ρt,αts)+W2(αts,ρπt)=W2(ρt,ρπt)W2(ρt,ρ0)+W2(ρπt,ρ0)W_{2}(\rho_{t},\alpha_{t}^{s})+W_{2}(\alpha_{t}^{s},\rho_{\pi_{t}})=W_{2}(\rho_{t},\rho_{\pi_{t}})\leq W_{2}(\rho_{t},\rho_{0})+W_{2}(\rho_{\pi_{t}},\rho_{0}). Hence, such tt does not exist and (D.4) holds. Thus we complete the proof of Lemma D.2. ∎

Lemma D.3.

The Chi-squared divergence has the following properties.

  • (i)

    For any probability measure g𝒫(Θ)g\in\mathscr{P}(\Theta), function f:Θf:\Theta\rightarrow\mathbb{R}, and α\alpha\in\mathbb{R}, we have

    χ2(αfg)=α2χ2(fg)+(1α)2.\chi^{2}(\alpha f\,\|\,g)=\alpha^{2}\chi^{2}(f\,\|\,g)+(1-\alpha)^{2}.
  • (ii)

    For any probability measures g1𝒫(Θ1)g_{1}\in\mathscr{P}(\Theta_{1}) and g2𝒫(Θ2)g_{2}\in\mathscr{P}(\Theta_{2}), functions f1:Θ1f_{1}:\Theta_{1}\rightarrow\mathbb{R} and f2:Θ2f_{2}:\Theta_{2}\rightarrow\mathbb{R}, we have

    χ2(f1×f2g1×g2)=χ2(f1g1)χ2(f2g2)+χ2(f1g1)+χ2(f2g2),\chi^{2}(f_{1}\times f_{2}\,\|\,g_{1}\times g_{2})=\chi^{2}(f_{1}\,\|\,g_{1})\cdot\chi^{2}(f_{2}\,\|\,g_{2})+\chi^{2}(f_{1}\,\|\,g_{1})+\chi^{2}(f_{2}\,\|\,g_{2}),

    where f1×f2f_{1}\times f_{2} is the product of f1f_{1} and f2f_{2}, and g1×g2g_{1}\times g_{2} is the product measure of g1g_{1} and g2g_{2}, i.e., (f1×f2)(θ1×θ2)=f1(θ1)f2(θ2)(f_{1}\times f_{2})(\theta_{1}\times\theta_{2})=f_{1}(\theta_{1})f_{2}(\theta_{2}) and (g1×g2)(θ1×θ2)=g1(θ1)g2(θ2)(g_{1}\times g_{2})(\theta_{1}\times\theta_{2})=g_{1}(\theta_{1})g_{2}(\theta_{2}), respectively.

  • (iii)

    For any probability measure g𝒫(Θ)g\in\mathscr{P}(\Theta), functions f1:Θf_{1}:\Theta\rightarrow\mathbb{R} and f2:Θf_{2}:\Theta\rightarrow\mathbb{R}, we have

    χ2(f1+f2g)3(χ2(f1g)+χ2(f2g)+1).\chi^{2}(f_{1}+f_{2}\,\|\,g)\leq 3\big{(}\chi^{2}(f_{1}\,\|\,g)+\chi^{2}(f_{2}\,\|\,g)+1\big{)}.
  • (iv)

    For any probability measure g𝒫(Θ)g\in\mathscr{P}(\Theta), function f:𝒳×Θf:\mathcal{X}\times\Theta\rightarrow\mathbb{R} and α:𝒳\alpha:\mathcal{X}\rightarrow\mathbb{R}, we have

    χ2(α(x)f(d,)𝑑xg)α(x)2dxχ2(f(x,)g)dx+(α(x)dx1)2.\chi^{2}\Bigl{(}\int\alpha(x)f(d,\cdot)dx\,\Big{\|}\,g\Bigr{)}\leq\int\alpha(x)^{2}{\mathrm{d}}x\cdot\int\chi^{2}(f(x,\cdot)\,\|\,g){\mathrm{d}}x+\Big{(}\int\alpha(x){\mathrm{d}}x-1\Big{)}^{2}.
Proof.

Proof of Property (i) of Lemma D.3. By definition of the Chi-squared divergence, we have

χ2(αfg)\displaystyle\chi^{2}(\alpha f\,\|\,g) =(αfg1)2dg\displaystyle=\int\Big{(}\frac{\alpha f}{g}-1\Big{)}^{2}{\mathrm{d}}g
=(α(fg1)+α1)2dg\displaystyle=\int\bigg{(}\alpha\Big{(}\frac{f}{g}-1\Big{)}+\alpha-1\bigg{)}^{2}{\mathrm{d}}g
=α2χ2(fg)+(α1)2.\displaystyle=\alpha^{2}\chi^{2}(f\,\|\,g)+(\alpha-1)^{2}.

Thus, we complete the proof of Property (i) of Lemma D.3. Proof of Property (ii) of Lemma D.3. Let f~1=f1/g11\widetilde{f}_{1}=f_{1}/g_{1}-1 and f~2=f2/g21\widetilde{f}_{2}=f_{2}/g_{2}-1. It then holds that

χ2(f1×f2g1×g2)\displaystyle\chi^{2}(f_{1}\times f_{2}\,\|\,g_{1}\times g_{2}) =(f1×f2g1×g21)2d(g1×g2)=(f~1×f~2+f~1+f~2)2d(g1×g2).\displaystyle=\int\Big{(}\frac{f_{1}\times f_{2}}{g_{1}\times g_{2}}-1\Big{)}^{2}{\mathrm{d}}(g_{1}\times g_{2})=\int(\widetilde{f}_{1}\times\widetilde{f}_{2}+\widetilde{f}_{1}+\widetilde{f}_{2})^{2}{\mathrm{d}}(g_{1}\times g_{2}).

By further noting that f~1dg1=0\int\widetilde{f}_{1}{\mathrm{d}}g_{1}=0, f~2dg2=0\int\widetilde{f}_{2}{\mathrm{d}}g_{2}=0, f~12dg1=χ2(f1g1)\int\widetilde{f}_{1}^{2}{\mathrm{d}}g_{1}=\chi^{2}(f_{1}\,\|\,g_{1}), and f~22dg2=χ2(f2g2)\int\widetilde{f}_{2}^{2}{\mathrm{d}}g_{2}=\chi^{2}(f_{2}\,\|\,g_{2}), we have

χ2(f1×f2g1×g2)\displaystyle\chi^{2}(f_{1}\times f_{2}\,\|\,g_{1}\times g_{2})
=(f~12×f~22+f~12+f~22+2(f~1)2×f~2+2(f~2)2×f~1+2f~1×f~2)d(g1×g2)\displaystyle\quad=\int\big{(}{\widetilde{f}}_{1}^{2}\times{\widetilde{f}}_{2}^{2}+{\widetilde{f}}_{1}^{2}+{\widetilde{f}}_{2}^{2}+2(\widetilde{f}_{1})^{2}\times\widetilde{f}_{2}+2(\widetilde{f}_{2})^{2}\times\widetilde{f}_{1}+2\widetilde{f}_{1}\times\widetilde{f}_{2}\big{)}{\mathrm{d}}(g_{1}\times g_{2})
=χ2(f1g1)χ2(f2g2)+χ2(f1g1)+χ2(f2g2).\displaystyle\quad=\chi^{2}(f_{1}\,\|\,g_{1})\cdot\chi^{2}(f_{2}\,\|\,g_{2})+\chi^{2}(f_{1}\,\|\,g_{1})+\chi^{2}(f_{2}\,\|\,g_{2}).

Thus, we complete the proof of Property (ii) of Lemma D.3. Proof of Property (iii) of Lemma D.3. Let f~1=f1/g11\widetilde{f}_{1}=f_{1}/g_{1}-1 and f~2=f2/g21\widetilde{f}_{2}=f_{2}/g_{2}-1. It then holds that

χ2(f1+f2g)=(f1+f2g1)2dg=(f~1+f~2+1)2dg.\displaystyle\chi^{2}(f_{1}+f_{2}\,\|\,g)=\int\Big{(}\frac{f_{1}+f_{2}}{g}-1\Big{)}^{2}{\mathrm{d}}g=\int(\widetilde{f}_{1}+\widetilde{f}_{2}+1)^{2}{\mathrm{d}}g.

By further noting that f~12dg1=χ2(f1g1)\int\widetilde{f}_{1}^{2}{\mathrm{d}}g_{1}=\chi^{2}(f_{1}\,\|\,g_{1}) and f~22dg2=χ2(f2g2)\int\widetilde{f}_{2}^{2}{\mathrm{d}}g_{2}=\chi^{2}(f_{2}\,\|\,g_{2}), we have

χ2(f1+f2g)\displaystyle\chi^{2}(f_{1}+f_{2}\,\|\,g) 3((f~1)2+(f~2)2+1)dg\displaystyle\leq 3\int\big{(}(\widetilde{f}_{1})^{2}+(\widetilde{f}_{2})^{2}+1\big{)}{\mathrm{d}}g
=3(χ2(f1g)+χ2(f2g)+1),\displaystyle=3\big{(}\chi^{2}(f_{1}\,\|\,g)+\chi^{2}(f_{2}\,\|\,g)+1\big{)},

where the inequality follows from the Cauchy-Schwarz inequality. Thus, we complete the proof of Property (iii) of Lemma D.3. Proof of Property (iv) of Lemma D.3. Let f~(x,)=f(x,)/g()1\widetilde{f}(x,\cdot)=f(x,\cdot)/g(\cdot)-1. It then holds that

χ2(α(x)f(x,)dxg)\displaystyle\chi^{2}\Big{(}\int\alpha(x)f(x,\cdot){\mathrm{d}}x\,\big{\|}\,g\Big{)} =(α(x)f~(x,θ)dx)2g(dθ)+(α(x)dx1)2\displaystyle=\int\Big{(}\int\alpha(x)\widetilde{f}(x,\theta){\mathrm{d}}x\Big{)}^{2}g({\mathrm{d}}\theta)+\Big{(}\int\alpha(x){\mathrm{d}}x-1\Big{)}^{2}
(α(x)2dxf~(x,θ)2dx)g(dθ)+(α(x)dx1)2\displaystyle\leq\int\Big{(}\int\alpha(x)^{2}{\mathrm{d}}x\cdot\int{\widetilde{f}}(x,\theta)^{2}{\mathrm{d}}x\Big{)}g({\mathrm{d}}\theta)+\Big{(}\int\alpha(x){\mathrm{d}}x-1\Big{)}^{2}
=α(x)2dxχ2(f(x,)g)dx+(α(x)dx1)2,\displaystyle=\int\alpha(x)^{2}{\mathrm{d}}x\cdot\int\chi^{2}(f(x,\cdot)\,\|\,g){\mathrm{d}}x+\Big{(}\int\alpha(x){\mathrm{d}}x-1\Big{)}^{2},

where the first equality holds by noting that f~(x,)dg=0\int\widetilde{f}(x,\cdot){\mathrm{d}}g=0 and the inequality follows from the Cauchy-Schwarz inequality. Thus, we complete the proof of Property (iv) of Lemma D.3. ∎

Lemma D.4.

For weighted homogeneous Sobolev norm defined by

ν1ν2H˙1(μ)=sup{|f,ν1ν2||fH˙1(μ)1}.\displaystyle\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)}=\sup\big{\{}|\langle f,\nu_{1}-\nu_{2}\rangle|\big{|}\|f\|_{\dot{H}^{1}(\mu)}\leq 1\big{\}}.

we have the following properties.

  • (i)

    For a group of probability measures μx:𝒳𝒫2(Θ)\mu_{x}:\mathcal{X}\rightarrow\mathscr{P}_{2}(\Theta) and ν1,ν2𝒫2(Θ)\nu_{1},\nu_{2}\in\mathscr{P}_{2}(\Theta), if μ\mu is in the convex hull of μx\mu_{x}, i.e., there exists αx0\alpha_{x}\geq 0 such that both αxdx=1\int\alpha_{x}{\mathrm{d}}x=1 and μ=αxμxdx\mu=\int\alpha_{x}\mu_{x}{\mathrm{d}}x hold, it then holds that

    ν1ν2H˙1(μ)supx𝒳ν1ν2H˙1(μx).\displaystyle\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)}\leq\sup_{x\in\mathcal{X}}\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu_{x})}.
  • (ii)

    Assume that we have measures μ𝒫2(Θ)\mu\in\mathscr{P}_{2}(\Theta) and νx:𝒳𝒫2(Θ)\nu_{x}:\mathcal{X}\rightarrow\mathscr{P}_{2}(\Theta). Let β1,β2:𝒳\beta_{1},\beta_{2}:\mathcal{X}\rightarrow\mathbb{R} be two functions on 𝒳\mathcal{X} such that β1(x)dx=β2(x)dx\int\beta_{1}(x){\mathrm{d}}x=\int\beta_{2}(x){\mathrm{d}}x. Then, by letting ν1=νxβ1(x)dx\nu_{1}=\int\nu_{x}\beta_{1}(x){\mathrm{d}}x and ν2=νxβ2(x)dx\nu_{2}=\int\nu_{x}\beta_{2}(x){\mathrm{d}}x, we have

    ν1ν2H˙1(μ)12sup(x,x′′)𝒳×𝒳νxνx′′H˙1(μ)|β1(x)β2(x)|dx.\displaystyle\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)}\leq\frac{1}{2}\sup_{(x^{\prime},x^{\prime\prime})\in\mathcal{X}\times\mathcal{X}}\|\nu_{x^{\prime}}-\nu_{x^{\prime\prime}}\|_{\dot{H}^{-1}(\mu)}\cdot\int\big{|}\beta_{1}(x)-\beta_{2}(x)\big{|}{\mathrm{d}}x.
Proof.

Proof of Property (i) of Lemma D.4. By definition of the weighted homogeneous Sobolev norm, we have

ν1ν2H˙1(μ)\displaystyle\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)} =supf{|f,ν1ν2|||f|2αxdμxdx1}\displaystyle=\sup_{f}\bigg{\{}\big{|}\langle f,\nu_{1}-\nu_{2}\rangle\big{|}\bigg{|}\int|\nabla f|^{2}\alpha_{x}{\mathrm{d}}\mu_{x}{\mathrm{d}}x\leq 1\bigg{\}}
=supf,λx0{|f,ν1ν2|||f|2dμxλx,x;αxλxdx=1}\displaystyle=\sup_{f,\lambda_{x}\geq 0}\bigg{\{}\big{|}\langle f,\nu_{1}-\nu_{2}\rangle\big{|}\bigg{|}\int|\nabla f|^{2}{\mathrm{d}}\mu_{x}\leq\lambda_{x},\forall x;\int\alpha_{x}\lambda_{x}{\mathrm{d}}x=1\bigg{\}} (D.5)
supλxαxdx=1,λx0infxsupfx{|fx,ν1ν2|||fx|2dμxλx},\displaystyle\leq\sup_{\begin{subarray}{c}\int\lambda_{x}\alpha_{x}{\mathrm{d}}x=1,\\ \lambda_{x}\geq 0\end{subarray}}\inf_{x}\sup_{f_{x}}\bigg{\{}\big{|}\langle f_{x},\nu_{1}-\nu_{2}\rangle\big{|}\bigg{|}\int|\nabla f_{x}|^{2}{\mathrm{d}}\mu_{x}\leq\lambda_{x}\bigg{\}}, (D.6)

where the first equality holds by noting that μ=αxμxdx\mu=\int\alpha_{x}\mu_{x}{\mathrm{d}}x. To illustrate the last equality, we denote by \mathscr{F} and x\mathscr{F}_{x} the allowed function class for ff in (D.5) and fxf_{x} in (D.6) to choose from, respectively. Note that we have x\mathscr{F}\subseteq\mathscr{F}_{x} for any xx, which is because the constraints for each fxf_{x} in (D.6) are relaxation of the constraints for ff in (D.5). And so, the supremum taken over \mathscr{F} is no larger than the supremum over x\mathscr{F}_{x} for any xx. Therefore, the supremum over \mathscr{F} is no larger than the the smallest supremum over x\mathscr{F}_{x}. Thus, (D.6) holds. Furthermore, we have

ν1ν2H˙1(μ)\displaystyle\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)} supλxαxdx=1,λx0infxsupfx{|fx,ν1ν2|||fx|2dμxλx}\displaystyle\leq\sup_{\begin{subarray}{c}\int\lambda_{x}\alpha_{x}{\mathrm{d}}x=1,\\ \lambda_{x}\geq 0\end{subarray}}\inf_{x}\sup_{f_{x}}\bigg{\{}\big{|}\langle f_{x},\nu_{1}-\nu_{2}\rangle\big{|}\bigg{|}\int|\nabla f_{x}|^{2}{\mathrm{d}}\mu_{x}\leq\lambda_{x}\bigg{\}} (D.7)
=supλxαxdx=1,λx0infx{λxν1ν2H˙1(μx)}\displaystyle=\sup_{\begin{subarray}{c}\int\lambda_{x}\alpha_{x}{\mathrm{d}}x=1,\\ \lambda_{x}\geq 0\end{subarray}}\inf_{x}\big{\{}\sqrt{\lambda_{x}}\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu_{x})}\big{\}} (D.8)
supλxαxdx=1,λx0infx{λx}supxν1ν2H˙1(μx),\displaystyle\leq\sup_{\begin{subarray}{c}\int\lambda_{x}\alpha_{x}{\mathrm{d}}x=1,\\ \lambda_{x}\geq 0\end{subarray}}\inf_{x}\big{\{}\sqrt{\lambda_{x}}\big{\}}\cdot\sup_{x}\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu_{x})},

where the first equality holds by noting that (D.8) is a rescaling of (D.7) with respect to λx\lambda_{x} in the constraints of (D.7). Here, we let

y=supλxαxdx=1,λx0infx{λx}.y=\sup_{\begin{subarray}{c}\int\lambda_{x}\alpha_{x}{\mathrm{d}}x=1,\\ \lambda_{x}\geq 0\end{subarray}}\inf_{x}\big{\{}\sqrt{\lambda_{x}}\big{\}}.

Then, it holds that y1y\leq 1. Otherwise, there must exists λx\lambda_{x} such that λxαxdx=1\int\lambda_{x}\alpha_{x}{\mathrm{d}}x=1 and that λx>1\lambda_{x}>1 holds for any x𝒳x\in\mathcal{X}, which contradicts with our conditions that αxdx=1\int\alpha_{x}{\mathrm{d}}x=1 and αx0\alpha_{x}\geq 0. Therefore, it further holds that

ν1ν2H˙1(μ)ysupxν1ν2H˙1(μx)supxν1ν2H˙1(μx).\displaystyle\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)}\leq y\cdot\sup_{x}\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu_{x})}\leq\sup_{x}\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu_{x})}.

Thus, we complete the proof of Property (i) of Lemma D.4. Proof of Property (ii) of Lemma D.4. Let α=β1β2\alpha=\beta_{1}-\beta_{2}. Then, we have α(x)dx=0\int\alpha(x){\mathrm{d}}x=0. Let α+=max{0,α}\alpha^{+}=\max\{0,\alpha\} and α=min{0,α}\alpha^{-}=-\min\{0,\alpha\}. Then, we have α+α=α=β1β2\alpha^{+}-\alpha^{-}=\alpha=\beta_{1}-\beta_{2} and that α++α=|α|=|β1β2|\alpha^{+}+\alpha^{-}=|\alpha|=|\beta_{1}-\beta_{2}|. Since α=0\int\alpha=0, it holds that α+(x)dx=α(x)dx=A\int\alpha^{+}(x){\mathrm{d}}x=\int\alpha^{-}(x){\mathrm{d}}x=A, where A0A\geq 0. We further let λ(x,x)=A1α+(x)α(x)\lambda(x,x^{\prime})=A^{-1}\alpha^{+}(x)\alpha^{-}(x^{\prime}). Then, it holds that α+(x)=λ(x,x)dx\alpha^{+}(x)=\int\lambda(x,x^{\prime}){\mathrm{d}}x^{\prime} and that α(x)=λ(x,x)dx\alpha^{-}(x^{\prime})=\int\lambda(x,x^{\prime}){\mathrm{d}}x. Therefore, we have

ν1ν2H˙1(μ)=𝒳(α+α)νxdxH˙1(μ)=𝒳×𝒳λ(x,x)(νxνx)dxdxH˙1(μ).\displaystyle\|\nu_{1}-\nu_{2}\|_{\dot{H}^{-1}(\mu)}=\Big{\|}\int_{\mathcal{X}}(\alpha^{+}-\alpha^{-})\nu_{x}{\mathrm{d}}x\Big{\|}_{\dot{H}^{-1}(\mu)}=\Big{\|}\int_{\mathcal{X}\times\mathcal{X}}\lambda(x,x^{\prime})(\nu_{x}-\nu_{x^{\prime}}){\mathrm{d}}x{\mathrm{d}}x^{\prime}\Big{\|}_{\dot{H}^{-1}(\mu)}.

By defintion of the weighted homogeneous Sobolev norm, it further holds that

𝒳×𝒳λ(x,x)(νxνx)dxdxH˙1(μ)\displaystyle\Big{\|}\int_{\mathcal{X}\times\mathcal{X}}\lambda(x,x^{\prime})(\nu_{x}-\nu_{x^{\prime}}){\mathrm{d}}x{\mathrm{d}}x^{\prime}\Big{\|}_{\dot{H}^{-1}(\mu)}
=supf{|λ(x,x)(νxνx)dxdx,f|||f|21}.\displaystyle\quad=\sup_{f}\bigg{\{}\Big{|}\big{\langle}\int\lambda(x,x^{\prime})(\nu_{x}-\nu_{x^{\prime}}){\mathrm{d}}x{\mathrm{d}}x^{\prime},f\big{\rangle}\Big{|}\,\bigg{|}\,\int|\nabla f|^{2}\leq 1\bigg{\}}. (D.9)

We assume the supremum in (D) is reached at ff^{*}. Then, we have |f|2dμ1\int|\nabla f^{*}|^{2}{\mathrm{d}}\mu\leq 1 and that

𝒳×𝒳λ(x,x)(νxνx)dxdxH˙1(μ)\displaystyle\Big{\|}\int_{\mathcal{X}\times\mathcal{X}}\lambda(x,x^{\prime})(\nu_{x}-\nu_{x^{\prime}}){\mathrm{d}}x{\mathrm{d}}x^{\prime}\Big{\|}_{\dot{H}^{-1}(\mu)} =|λ(x,x)(νxνx)dxdx,f|\displaystyle=\Big{|}\big{\langle}\int\lambda(x,x^{\prime})(\nu_{x}-\nu_{x^{\prime}}){\mathrm{d}}x{\mathrm{d}}x^{\prime},f^{*}\big{\rangle}\Big{|}
sup(x,x)𝒳×𝒳|νxνx,f|λ(x,x)dxdx\displaystyle\leq\sup_{(x,x^{\prime})\in\mathcal{X}\times\mathcal{X}}\big{|}\langle\nu_{x}-\nu_{x^{\prime}},f^{*}\rangle\big{|}\cdot\int\lambda(x,x^{\prime}){\mathrm{d}}x{\mathrm{d}}x^{\prime}
12sup(x,x)𝒳×𝒳νxνxH˙1(μ)|β1β2|dx,\displaystyle\leq\frac{1}{2}\sup_{(x,x^{\prime})\in\mathcal{X}\times\mathcal{X}}\|\nu_{x}-\nu_{x^{\prime}}\|_{\dot{H}^{-1}(\mu)}\cdot\int|\beta_{1}-\beta_{2}|{\mathrm{d}}x,

where the last inequality holds by noting that λ(x,x)dxdx=A=1/2|α|dx=1/2|β1β2|dx\int\lambda(x,x^{\prime}){\mathrm{d}}x{\mathrm{d}}x^{\prime}=A=1/2\cdot\int|\alpha|{\mathrm{d}}x=1/2\cdot\int|\beta_{1}-\beta_{2}|{\mathrm{d}}x. Thus, we complete the proof of Property (ii) of Lemma D.4. ∎

Lemma D.5.

For probability measures ν1,ν2𝒫2(Θ1)\nu_{1},\nu_{2}\in\mathscr{P}_{2}(\Theta_{1}) and p1,p2𝒫2(Θ2)p_{1},p_{2}\in\mathscr{P}_{2}(\Theta_{2}), it holds that

W22(ν1×p1,ν2×p2)W22(ν1,ν2)+W22(p1,p2).\displaystyle W_{2}^{2}(\nu_{1}\times p_{1},\nu_{2}\times p_{2})\leq W_{2}^{2}(\nu_{1},\nu_{2})+W_{2}^{2}(p_{1},p_{2}).
Proof.

By the property of optimal transport (Ambrosio et al., 2008), there exists mapping TνT_{\nu} and TpT_{p} such that

W22(ν1,ν2)=θ1Tν(θ1)ν1(dθ1),\displaystyle W_{2}^{2}(\nu_{1},\nu_{2})=\int\|\theta_{1}-T_{\nu}(\theta_{1})\|\nu_{1}({\mathrm{d}}\theta_{1}),
W22(p1,p2)=θ1Tp(θ1)p1(dθ1),\displaystyle W_{2}^{2}(p_{1},p_{2})=\int\|\theta_{1}-T_{p}(\theta_{1})\|p_{1}({\mathrm{d}}\theta_{1}),

and that (Tν)ν1=ν2(T_{\nu})_{\sharp}\nu_{1}=\nu_{2} and (Tp)p1=p2(T_{p})_{\sharp}p_{1}=p_{2}. Note that we have (Tν×Tp)(ν1×p1)=ν2×p2(T_{\nu}\times T_{p})_{\sharp}(\nu_{1}\times p_{1})=\nu_{2}\times p_{2}. Thus, by definition of W2W_{2} distance, it holds that

W22(ν1×p1,ν2×p2)\displaystyle W_{2}^{2}(\nu_{1}\times p_{1},\nu_{2}\times p_{2}) (θ1,θ2)(Tν(θ1),Tp(θ1))2ν1(dθ1)p1(dθ2)\displaystyle\leq\int\big{\|}(\theta_{1},\theta_{2})-\big{(}T_{\nu}(\theta_{1}),T_{p}(\theta_{1})\big{)}\big{\|}^{2}\nu_{1}({\mathrm{d}}\theta_{1})p_{1}({\mathrm{d}}\theta_{2})
=(θ1Tν(θ1)2+θ2Tp(θ2)2)ν1(dθ1)p1(dθ2)\displaystyle=\int\Big{(}\big{\|}\theta_{1}-T_{\nu}(\theta_{1})\big{\|}^{2}+\big{\|}\theta_{2}-T_{p}(\theta_{2})\big{\|}^{2}\Big{)}\nu_{1}({\mathrm{d}}\theta_{1})p_{1}({\mathrm{d}}\theta_{2})
=W22(ν1,ν2)+W22(p1,p2).\displaystyle=W_{2}^{2}(\nu_{1},\nu_{2})+W_{2}^{2}(p_{1},p_{2}).

Thus, we complete the proof of Lemma D.5. ∎

Lemma D.6.

For probability density function ρ,ρ1,ρ2𝒫2(Θ)\rho,\rho_{1},\rho_{2}\in\mathscr{P}_{2}(\Theta), let ρ~1=α1ρ1+(1α1)ρ\widetilde{\rho}_{1}=\alpha^{-1}\rho_{1}+(1-\alpha^{-1})\rho and ρ~2=α1ρ2+(1α1)ρ\widetilde{\rho}_{2}=\alpha^{-1}\rho_{2}+(1-\alpha^{-1})\rho. Then, We have

W2(ρ~1,ρ~2)α1/2W2(ρ1,ρ2).W_{2}(\widetilde{\rho}_{1},\widetilde{\rho}_{2})\leq\alpha^{-1/2}W_{2}(\rho_{1},\rho_{2}).
Proof.

Recall the definition of Wasserstain-2 distance that

W2(ρ1,ρ2)=(infγΓ(ρ1,ρ2)xy2dγ(x,y))1/2,\displaystyle W_{2}(\rho_{1},\rho_{2})=\Big{(}\inf_{\gamma\in\Gamma(\rho_{1},\rho_{2})}\int\|x-y\|^{2}{\mathrm{d}}\gamma(x,y)\Big{)}^{1/2},

where Γ(ρ1,ρ2)\Gamma(\rho_{1},\rho_{2}) is the set of all couplings of ρ1\rho_{1} and ρ2\rho_{2}. We assume that the infimum is reached by γ(x,y)Γ(ρ1,ρ2)\gamma^{*}(x,y)\in\Gamma(\rho_{1},\rho_{2}), i.e.,

W2(ρ1,ρ2)=(xy2dγ(x,y))1/2.\displaystyle W_{2}(\rho_{1},\rho_{2})=\Big{(}\int\|x-y\|^{2}{\mathrm{d}}\gamma^{*}(x,y)\Big{)}^{1/2}.

We denote by γ(x,y)\gamma^{\prime}(x,y) the distribution such that

γ(x,y)=α1γ(x,y)+(1α1)ρ(x)δ(xy),\displaystyle\gamma^{\prime}(x,y)=\alpha^{-1}\gamma^{*}(x,y)+(1-\alpha^{-1})\rho(x)\delta(x-y),

where δ(x,y)\delta(x,y) is dirac delta function and it holds that γ(x,y)Γ(ρ~1,ρ~2)\gamma^{\prime}(x,y)\in\Gamma(\widetilde{\rho}_{1},\widetilde{\rho}_{2}). Hence, it follows that

W2(ρ~1,ρ~2)\displaystyle W_{2}(\widetilde{\rho}_{1},\widetilde{\rho}_{2}) (xy2dγ(x,y))1/2\displaystyle\leq\Big{(}\int\|x-y\|^{2}{\mathrm{d}}\gamma^{\prime}(x,y)\Big{)}^{1/2}
=(xy2(α1γ(x,y)+(1α1)ρ(x)δ(xy))d(x,y))1/2\displaystyle=\Big{(}\int\|x-y\|^{2}\big{(}\alpha^{-1}\gamma^{*}(x,y)+(1-\alpha^{-1})\rho(x)\delta(x-y)\big{)}{\mathrm{d}}(x,y)\Big{)}^{1/2}
=α1/2W2(ρ1,ρ2).\displaystyle=\alpha^{-1/2}W_{2}(\rho_{1},\rho_{2}).

Thus, we complete the proof of Lemma D.6. ∎

Appendix E Auxiliary Lemmas

We use the definition of absolutely continuous curves in 𝒫2(D)\mathscr{P}_{2}(\mathbb{R}^{D}) in Ambrosio et al. (2008).

Definition E.1 (Absolutely Continuous Curve).

Let β:[a,b]𝒫2(D)\beta:[a,b]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D}) be a curve. Then, we say β\beta is an absolutely continuous curve if there exists a square-integrable function f:[a,b]f:[a,b]\rightarrow\mathbb{R} such that

W2(βs,βt)stf(τ)dτ\displaystyle W_{2}(\beta_{s},\beta_{t})\leq\int_{s}^{t}f(\tau)\,{\mathrm{d}}\tau

for any as<tba\leq s<t\leq b.

Then, we have the following first variation formula.

Lemma E.2 (First Variation Formula, Theorem 8.4.7 in Ambrosio et al. (2008)).

Given ν𝒫2(D)\nu\in\mathscr{P}_{2}(\mathbb{R}^{D}) and an absolutely continuous curve μ:[0,T]𝒫2(D)\mu:[0,T]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D}), let β:[0,1]𝒫2(D)\beta:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D}) be the geodesic connecting μt\mu_{t} and ν\nu. It holds that

ddtW2(μt,ν)22=μ˙t,β˙0μt,W2,\displaystyle\frac{{\mathrm{d}}}{{\mathrm{d}}t}\frac{W_{2}(\mu_{t},\nu)^{2}}{2}=-\langle\dot{\mu}_{t},\dot{\beta}_{0}\rangle_{\mu_{t},W_{2}},

where μ˙t=tμt\dot{\mu}_{t}=\partial_{t}\mu_{t}, β˙0=sβs|s=0\dot{\beta}_{0}=\partial_{s}\beta_{s}{\,|\,}_{s=0}, and the inner product is defined in (A.2).

Lemma E.3 (Eulerian Representation of Geodesics, Proposition 5.38 in Villani (2003)).

Let β:[0,1]𝒫2(D)\beta:[0,1]\rightarrow\mathscr{P}_{2}(\mathbb{R}^{D}) be a geodesic and vv be the corresponding vector field such that tβt=div(βtvt)\partial_{t}\beta_{t}=-\mathop{\mathrm{div}}(\beta_{t}\cdot v_{t}). It holds that

ddt(βtvt)=div(βtvtvt),\displaystyle\frac{{\mathrm{d}}}{{\mathrm{d}}t}(\beta_{t}\cdot v_{t})=-\mathop{\mathrm{div}}(\beta_{t}\cdot v_{t}\otimes v_{t}),

where \otimes is the outer product between two vectors.

Lemma E.4 (Talagrand’s Inequality, Corollary 2.1 in Otto and Villani (2000)).

Let ν\nu be N(0,κID)N(0,\kappa\cdot I_{D}). It holds for any μ𝒫2(D)\mu\in\mathscr{P}_{2}(\mathbb{R}^{D}) that

W2(μ,ν)22DKL(μν)/κ.\displaystyle W_{2}(\mu,\nu)^{2}\leq 2D_{\rm KL}(\mu\,\|\,\nu)/\kappa.
Lemma E.5 (Theorem 1 in Peyre (2011)).

Let μ,ν\mu,\nu be two probability measures in 𝒫2(θ)\mathscr{P}_{2}(\theta). Then, it holds that

W2(μ,ν)2μνH˙1(ν).\displaystyle W_{2}(\mu,\nu)\leq 2\|\mu-\nu\|_{\dot{H}^{-1}(\nu)}.

Appendix F Conclusions and Limitations

In this work, we study the time envolution of a two-timescale AC represented by a two-layer neural network in the mean-field limit. Specifically, the actor updates its policy via proximal policy optimization, which is closely related to the replicator dynamics, while the critic updates by temporal-difference learning, which is captured by a semigradient flow in the Wasserstein space. By introducing a restarting mechanism, we establish the convergence and optimality of AC with two-layer overparameterized neural network. However, the study has potential limitations. In this work we only study the continuous-time limiting regime, which is an idealized setting with infinitesimal learning rates, and establish finite-time convergence and optimality guarantees. Finite-time results for the more realistic discrete-time setting is left for future research.